Home page for accesible maths 1.7 Considering categorical data 1.7 Considering categorical data 1.7.2 Row and column proportions

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

1.7.1 Contingency tables and bar plots

Table 1.9 summarizes two variables: spam and number. Recall that number is a categorical variable that describes whether an email contains no numbers, only small numbers (values under 1 million), or at least one big number (a value of 1 million or more). A table that summarizes data for two categorical variables in this way is called a contingency table. Each value in the table represents the number of times a particular combination of variable outcomes occurred. For example, the value 149 corresponds to the number of emails in the data set that are spam and had no number listed in the email. Row and column totals are also included. The row totals provide the total counts across each row (e.g. $149+168+50=367$ ), and column totals are total counts down each column.

A table for a single variable is called a frequency table. Table 1.10 is a frequency table for the number variable. If we replaced the counts with percentages or proportions, the table would be called a relative frequency table.

R> tab=table(email[,c("spam", "number")])[2:1,]
R> rowSums(tab); colSums(tab); sum(tab)
R> table(email[,c("html")])

		number
		none	small	big	Total
	spam	149	168	50	367
spam	not spam	400	2659	495	3554
	Total	549	2827	545	3921

Table 1.9: A contingency table for spam and number.

none	small	big	Total
549	2827	545	3921

Table 1.10: A frequency table for the number variable.

A bar plot is a common way to display a single categorical variable. The left panel of Figure LABEL:emailNumberBarPlot shows a bar plot for the number variable. In the right panel, the counts are converted into proportions (e.g. $549/3921=0.140$ for none), showing the proportion of observations that are in each level (i.e. in each category).

R> barplot(table(email[,21]))
R> barplot(table(email[,21])/sum(table(email[,21])))