Table 1.9 summarizes two variables: spam and number. Recall that number is a categorical variable that describes whether an email contains no numbers, only small numbers (values under 1 million), or at least one big number (a value of 1 million or more). A table that summarizes data for two categorical variables in this way is called a contingency table. Each value in the table represents the number of times a particular combination of variable outcomes occurred. For example, the value 149 corresponds to the number of emails in the data set that are spam and had no number listed in the email. Row and column totals are also included. The row totals provide the total counts across each row (e.g. ), and column totals are total counts down each column.
A table for a single variable is called a frequency table. Table 1.10 is a frequency table for the number variable. If we replaced the counts with percentages or proportions, the table would be called a relative frequency table.
R> tab=table(email[,c("spam", "number")])[2:1,]
R> rowSums(tab); colSums(tab); sum(tab)
R> table(email[,c("html")])
number | |||||
---|---|---|---|---|---|
none | small | big | Total | ||
spam | 149 | 168 | 50 | 367 | |
spam | not spam | 400 | 2659 | 495 | 3554 |
Total | 549 | 2827 | 545 | 3921 |
none | small | big | Total |
---|---|---|---|
549 | 2827 | 545 | 3921 |
A bar plot is a common way to display a single categorical variable. The left panel of Figure LABEL:emailNumberBarPlot shows a bar plot for the number variable. In the right panel, the counts are converted into proportions (e.g. for none), showing the proportion of observations that are in each level (i.e. in each category).
R> barplot(table(email[,21]))
R> barplot(table(email[,21])/sum(table(email[,21])))