Contingency tables using row or column proportions are especially useful for examining how two categorical variables are related. Segmented bar and mosaic plots provide a way to visualize the information in these tables.
A segmented bar plot is a graphical display of contingency table information. For example, a segmented bar plot representing Table 1.12 is shown in Figure LABEL:emailSpamNumberSegBar, where we have first created a bar plot using the number variable and then divided each group by the levels of spam. The column proportions of Table 1.12 have been translated into a standardized segmented bar plot in Figure LABEL:emailSpamNumberSegBarSta, which is a helpful visualization of the fraction of spam emails in each level of number.
R> tab=table(email[,c(1,21)])[2:1,]
R> barplot(tab,legend.text=c("not spam", "spam"))
R> barplot(prop.table(tab,2))
Examine both of the segmented bar plots. Which is more useful?
Answer. Figure LABEL:emailSpamNumberSegBar contains more information, but Figure LABEL:emailSpamNumberSegBarSta presents the information more clearly. This second plot makes it clear that emails with no number have a relatively high rate of spam email – about 27%! On the other hand, less than 10% of email with small or big numbers are spam. Since the proportion of spam changes across the groups in Figure LABEL:emailSpamNumberSegBarSta, we can conclude the variables are dependent, which is something we were also able to discern using table proportions. Because both the none and big groups have relatively few observations compared to the small group, the association is more difficult to see in Figure LABEL:emailSpamNumberSegBar. In some other cases, a segmented bar plot that is not standardized will be more useful in communicating important information. Before settling on a particular segmented bar plot, create standardized and non-standardized forms and decide which is more effective at communicating features of the data.
A mosaic plot is a graphical display of contingency table information that is similar to a bar plot for one variable or a segmented bar plot when using two variables. Figure LABEL:emailNumberMosaic shows a mosaic plot for the number variable. Each column represents a level of number, and the column widths correspond to the proportion of emails of each number type. For instance, there are fewer emails with no numbers than emails with only small numbers, so the no number email column is slimmer. In general, mosaic plots use box areas to represent the number of observations that box represents.
This one-variable mosaic plot is further divided into pieces in Figure LABEL:emailSpamNumberMosaic using the spam variable. Each column is split proportionally according to the fraction of emails that were spam in each number category. For example, the second column, representing emails with only small numbers, was divided into emails that were spam (lower) and not spam (upper). As another example, the bottom of the third column represents spam emails that had big numbers, and the upper part of the third column represents regular emails that had big numbers. We can again use this plot to see that the spam and number variables are associated since some columns are divided in different vertical locations than others, which was the same technique used for checking an association in the standardized version of the segmented bar plot.
R> tab=table(email[,c(1,21)]);row.names(tab)=c("not spam","spam")
R> mosaicplot(colSums(tab))
R> mosaicplot(t(tab))
In a similar way, a mosaic plot representing row proportions of Table 1.9 could be constructed, as shown in Figure LABEL:emailSpamNumberMosaicRev. However, because it is more insightful for this application to consider the fraction of spam in each category of the number variable, we prefer Figure LABEL:emailSpamNumberMosaic.
R> mosaicplot(tab)