2 Exploratory Data Analysis 2.1 Numerical and graphical summaries 3 The exponential family

2.2 Measure of association for categorical variables

The Pearson’s chi-squared test in the previous section is ideal for testing whether two categorical variables are independent. If this is not the case, then there exists some form of relationship between the variables, but the test provides no information as to what this relationship is.

2.2.1 Odds and log-odds

Definition 2.2.1.

The odds of an event is the ratio of the probability that the event occurs, $p$ , to the probability that it does not occur:

O=\frac{p}{1-p}

For example, the probability of getting a six with a fair dice is $1/6$ or equivalently the odds for this event is:

O_{6}=\frac{1/6}{1-1/6}=\frac{1/6}{5/6}=\frac{1}{5}

Alternatively, the probability of getting any number other than six on a fair dice is $5/6$ or equivalently the odds is:

O_{!6}=\frac{5/6}{1-5/6}=\frac{5/6}{1/6}=5=\frac{1}{O_{6}}

Here we see that the odds for negated events are reciprocal to one another.

What does it mean for an odds for an event is large or small? To compare, if an event has equal chance of happening verses not happening, e.g. rolling an even number with a fair dice, then the probability for the event is $p=1/2$ . The odds for this event is:

O_{\mathrm{Even}}=\frac{1/2}{1-1/2}=\frac{1/2}{1/2}=1.

Therefore, if the odds for the event is greater than 1, then the event of interest is more likely to happen compared to not happening. For instance, the chance of not rolling a six is 5 times more likely to occur compared to rolling a six with a fair dice. On the contrary, if the odds for the event is less than 1 then the event of interest is more likely to not happen compared to happening.

In general, we tend to deal with problems where the true probability of the events are unknown. However, an unbiased estimate of the probabilities can be derived from the contingency table as the proportion of cases that match our event of interest relative to the sample size. For example, a probability estimate for a consumer using App2 is $144/480$ and an estimate of the conditional probability of recommendation given that the consumer uses App1 is $75/96$ .

Exercise 2.10
From the app survey example, estimate the odds for:

1.

a consumer who recommend the app they are using.
2.

a consumer who uses App2.
3.

a consumer who would recommend the app given that they use App1.
4.

a consumer who would recommend the app given that they use App3.

Odds is a strictly positive measure. A common approach to improve understanding is to map odds to the full real line by taking the log-transform.

Definition 2.2.2.

The log-odds is the log of the odds:

\log(O)=\log\left(\frac{p}{1-p}\right)

For example, the log-odds for rolling a six with a fair dice is $\log(O_{6})=\log(1/5)=-1.609$ , whereas the log-odds for rolling any number other than a six is $\log(O_{!6})=\log(5)=1.609$ . It follows that the log-odds for negated events involves a change of sign.

For events that have an equal chance of occurring compared to not occurring, such as rolling an even number, we have seen that the odds for such an event is 1. Transforming this onto the log-scale, gives a corresponding log-odds of 0.

Exercise 2.11
For each of the scenarios in the previous exercise, estimate the log-odds of the event.

2.2.2 Odds Ratio and Log-Odds Ratio

The odds and log-odds give a measure of the chance of an event happening relative to the chance of it not happening. To investigate the relationship between categorical variables we ideally need a measure of the chances of one event happening relative to the chance of another event happening. Essentially we need to define a baseline event to which we can then compare the relative chances of other events.

Definition 2.2.3.

Denote $p_{1}$ and $p_{2}$ as the probabilities for events $X_{1}$ and $X_{2}$ respectively. The odds ratio of event $X_{1}$ relative to event $X_{2}$ is:

\psi=\frac{p_{1}/(1-p_{1})}{p_{2}/(1-p_{2})},

the odds for event $X_{1}$ divided by the odds for event $X_{2}$ .

In the app survey, how does the odds for a consumer to recommend the app they are using to a friend differ for App1 relative to App3. Here, App3 is taken as the baseline with an estimated probability of $121/240$ recommending the app to a friend. So an estimate for the odds ratio for recommendation of App1 relative to App3 is:

\hat{\psi}=\frac{(75/96)~{}/~{}(1-75/96)}{(121/240)~{}/~{}(1-121/240)}=\frac{7% 5/21}{121/119}=3.512

So, users of App1 are 3.512 times as more likely to recommend the app to their friends compared to the users of App3.

Exercise 2.12
Estimate the odds ratio for users of App2 to recommend the app to their friends compared to the users of App3.

Definition 2.2.4.

The log-odds ratio is the logarithmic transform of the odds ratio and equates to the difference in log-odds between events $X_{1}$ and $X_{2}$ :

\log\psi=\log\left(\frac{p_{1}/(1-p_{1})}{p_{2}/(1-p_{2})}\right)=\log\left(% \frac{p_{1}}{1-p_{1}}\right)-\log\left(\frac{p_{2}}{1-p_{2}}\right).

Whereas the odds ratio is a multiplicative comparison between two events, log-odds is a additive comparison.

Exercise 2.13
Estimate the log-odds for users of App2 to recommend the app relative to the users of App1.

When there is no difference between the two events occurring ( $p_{1}=p_{2}$ ), the odds ratio is $\psi=1$ (or $\log\psi=0$ ). If an odds ratio is less than one ( $\psi<1$ or $\log\psi<0$ ) then the odds for the baseline event, $X_{2}$ , are higher than the odd for event of interest, $X_{1}$ . On the contrary, if an odds ratio is greater than one ( $\psi>1$ or $\log\psi>0$ ) then the odds for event $X_{1}$ are higher than the odds for the baseline $X_{2}$ .

2.2.3 Confidence Interval

When estimating the (log-)odds ratio for a particular scenario we are essentially reducing the contingency table to a $2{\times}2$ case and event table:

	Case1	Case2
Event	a	c
Not Event	b	d

The odds of the ‘Event’ for ‘Case1’ is estimated by:

\frac{a/(a+b)}{1-a/(a+b)}=\frac{a}{a+b-a}=\frac{a}{b},

and likewise the ‘Event’ for ‘Case2’ is estimated by $c/d$ . It follows that the (log-)odds ratio of the ‘Event’ for ‘Case1’ relative to ‘Case2’ is:

\hat{\psi}=\frac{a/b}{c/d}=\frac{ad}{bc}\quad\mathrm{and}\quad\log\hat{\psi}=% \log\left(\frac{ad}{bc}\right).

Since the odds ratio is calculated from the contingency table there is a possibility that any odds ratio deviating from $1$ could be due to sampling variation rather than from an association. To examine which is the case, we can derive an approximate 100(1- $\alpha$ )% confidence interval for the log-odds and identify whether or not it contains $0$ .

Definition 2.2.5.

The approximate standard error of the estimated log-odds ratio, $\log\hat{\psi}$ , can be shown to be given by:

\mathrm{std}(\log\hat{\psi})\approx\sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+% \frac{1}{d}}.

From asymptotic normality, an approximate 100(1- $\alpha$ )% confidence interval for $\log\psi$ is:

\left(\log\hat{\psi}-z_{1-\frac{\alpha}{2}}\times\mathrm{std}(\log\hat{\psi}),% ~{}\log\hat{\psi}+z_{1-\frac{\alpha}{2}}\times\mathrm{std}(\log\hat{\psi})\right)

where, for $\alpha$ % significance level, $z_{1-\frac{\alpha}{2}}$ is the 100 $(1-\frac{\alpha}{2})$ % quantile of the standard normal.

Derivation of the standard error utilises a technique called the delta method, which is discussed later in the course.

Exercise 2.14
Calculate the standard error for the log-odds estimated in Exercise 2.2.2. Use this estimate to evaluate a 95% confidence interval.

Since the relationship between the odds ratio and log-odds ratio is monotonic, we can then derive a approximate 100(1- $\alpha$ )% confidence interval for $\psi$ by exponentiating the derived confidence interval for $\log\psi$ .

Exercise 2.15
Calculate an approximate 95% confidence interval for the odds ratio in the previous example.

2.2.4 Rare events

When examining contingency tables, there are data on the response variable whether the event of interest occurred or not. However, there are many scenario where the occurrence of the non-events is not possible to record or just too numerous that it is impractical to record them all. For example:

•

Event: Coronal mass ejections from the sun.
•

Non-Event: no coronal mass ejections.
•

Event: Wildebeest killed by a crocodile during the great migration.
•

Non-Event: Wildebeest survived river crossing.

In both cases, recording the event of interest are possible, but recording the non-events are either not possible or impracticable. Here the events of interest are said to be rare.

As half of the contingency table is missing, how can we calculate the odds ratio and log-odds ratio of the event between cases? If the event of interest is rare, then the number of non-events are said to be large for both cases and similar in magnitude for both cases, i.e. $b\approx d$ . From this, the odds ratio estimate is approximated by:

\hat{\psi}=\frac{ad}{bc}\approx\frac{a}{c}.

The corresponding log-odds estimate is then:

\log\hat{\psi}=\log\left(\frac{a}{c}\right)=\log(a)-\log(c)

Exercise 2.16
A manufacturer produces about half a million light bulbs a day that are all tested to ensure that they work correctly. Only a small proportion of the the light bulbs fail the test. The table below presents the number of light bulbs that were rejected according to the bulb type:

Light bulb type	Halogen	Energy saver	LED	Total
Number rejected	368	186	152	706

Estimate the rejection log-odds ratio for halogen and energy saver bulbs relative to LED bulbs.

Since the the number of non-events is large then it follows that the reciprocal of these numbers are approximately zero, i.e. $1/b\approx 0$ and $1/d\approx 0$ . Therefore, the approximate standard error for the log-odds ratio estimate is given by:

\mathrm{std}(\log\hat{\psi})\approx\sqrt{\frac{1}{a}+\frac{1}{c}}

Given this, it is then possible to derive confidence intervals for the log-odds ratio and odds ratio as before.

Exercise 2.17
Calculate the standard error of the log-odds estimate derived in the previous exercise. Is the number of rejected halogen and energy saver bulbs different than the number of rejected LED bulbs?