Home page for accesible maths 1.7 Considering categorical data

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

1.7.1 Contingency tables and bar plots

Table 1.9 summarizes two variables: spam and number. Recall that number is a categorical variable that describes whether an email contains no numbers, only small numbers (values under 1 million), or at least one big number (a value of 1 million or more). A table that summarizes data for two categorical variables in this way is called a contingency table. Each value in the table represents the number of times a particular combination of variable outcomes occurred. For example, the value 149 corresponds to the number of emails in the data set that are spam and had no number listed in the email. Row and column totals are also included. The row totals provide the total counts across each row (e.g. 149+168+50=367), and column totals are total counts down each column.

A table for a single variable is called a frequency table. Table 1.10 is a frequency table for the number variable. If we replaced the counts with percentages or proportions, the table would be called a relative frequency table.

R> tab=table(email[,c("spam", "number")])[2:1,]
R> rowSums(tab); colSums(tab); sum(tab)
R> table(email[,c("html")])

number
none small big Total
spam 149 168 50 367
spam not spam 400 2659 495 3554
Total 549 2827 545 3921
Table 1.9: A contingency table for spam and number.
none small big Total
549 2827 545 3921
Table 1.10: A frequency table for the number variable.

A bar plot is a common way to display a single categorical variable. The left panel of Figure LABEL:emailNumberBarPlot shows a bar plot for the number variable. In the right panel, the counts are converted into proportions (e.g. 549/3921=0.140 for none), showing the proportion of observations that are in each level (i.e. in each category).

R> barplot(table(email[,21]))
R> barplot(table(email[,21])/sum(table(email[,21])))