1 A quick introduction to R 3 Workshop - Light

2 Workshop - Voting

The table below presents the voting preferences of 2191 US citizens, seporated by gender.

	Democrat	Republican
Female	762	468
Male	484	477

The following extract creates the object vote in R containing the data from above table:

vote <- rbind(c(762, 468), c(484, 477))
rownames(vote) <- c("F", "M")
colnames(vote) <- c("Democrat", "Republican")
vote
N <- sum(vote)          # overall total
R <- vote[,1] + vote[,2]   # row totals, try apply(vote, MARGIN=1, FUN=sum)
C <- vote[1,] + vote[2,]   # column totals, try apply(vote, MARGIN=2, FUN=sum)

Is there any association between gender and political party? For this we need to evaluate the expected count for the table and the Pearson’s chi-squared test statistic.

Exp_vote <- N * (R/N) %*% t(C/N)  # expected count under null (indep)
Xsq <- sum( (vote - Exp_vote)^2 / Exp_vote ) # Pearson chi-squared test stat
Xsq

The critical value for this hypothesis test for independence is:

df <- (nrow(vote)-1) * (ncol(vote)-1)  # degree-of-freedom
qchisq(0.95, df)                 # critical value @ 5% signif level

Compare the test statistic against the critical value. Is gender and party independent variables?

The above procedure can be performed in a single command.

Xsq <- chisq.test(vote, correct=FALSE)
Xsq                   # prints test summary
Xsq$observed          # observed counts (same as M)
Xsq$expected          # expected counts under the null

The p-value is less than 5%, so we conclude that there is evidence of association between the two categorical variables.

2.1 odds/log-odds

The odds and log-odds are alternative ways of presenting probabilities. For instance, odds and log-odds for the conditional probability for a US citizen voting for democrats given their gender is estimated from the above table by:

p_Dem_given_gender <- vote[,1]/R
odd_Dem_given_gender <- p_Dem_given_gender/(1-p_Dem_given_gender)
odd_Dem_given_gender       # odds of Dem given gender
log(odd_Dem_given_gender)  # log-odds

Note that the calculation can be simplified to the number of voters for a particular party divided by the number who did not vote for that party. The odds for the republican party for each gender is:

odd_Rep_given_gender <- vote[,2] / vote[,1]
odd_Rep_given_gender
log(odd_Rep_given_gender)

2.2 odds ratio and log-odds ratio

What is the odds ratio of voting democrat for a female citizen relative to a male citizen. In this case, the event of interest is a female citizen voting for democrat and the baseline event is a male citizen voting for democrat. The odds for both events are contained in odd_Dem_given_gender. The odds ratio is calculated by:

oddR_Dem_F_to_M <- odd_Dem_given_gender[1]/odd_Dem_given_gender[2]
oddR_Dem_F_to_M
log(oddR_Dem_F_to_M)

The log-odds ratio is positive, the female citizen in this study are more likely to vote democrat compared to male citizens.

Is the identified feature evidence for association or due to sampling variation? For this we need to evaluate the 95% confidence interval for the (log-)odds ratio.

se <- sqrt(1/vote[1,1] + 1/vote[1,2] + 1/vote[2,1] + 1/vote[2,2])
LogOddR_int <- log(oddR_Dem_F_to_M) + c(-1,1)*1.96*se
LogOddR_int
exp(LogOddR_int)

The log-odds ratio 95% confidence interval does not contain 0, and likewise the odds ration 95% confidence interval does not contain 1. So the observed relationship identified above is significant at the 5% level.