Home page for accesible maths 1.7 Considering categorical data

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

1.7.2 Row and column proportions

Table 1.11 shows the row proportions for Table 1.9. The row proportions are computed as the counts divided by their row totals. The value 149 at the intersection of spam and none is replaced by 149/367=0.406, i.e. 149 divided by its row total, 367. So what does 0.406 represent? It corresponds to the proportion of spam emails in the sample that do not have any numbers.

R> g=table(email[,1], email[,21])[2:1,]; g/rep(rowSums(g),3)

none small big Total
spam 149/367=0.406 168/367=0.458 50/367=0.136 1.000
not spam 400/3554=0.113 2657/3554=0.748 495/3554=0.139 1.000
Total 549/3921=0.140 2827/3921=0.721 545/3921=0.139 1.000
Table 1.11: A contingency table with row proportions for the spam and number variables.

A contingency table of the column proportions is computed in a similar way, where each column proportion is computed as the count divided by the corresponding column total. Table 1.12 shows such a table, and here the value 0.271 indicates that 27.1% of emails with no numbers were spam. This rate of spam is much higher compared to emails with only small numbers (5.9%) or big numbers (9.2%). Because these spam rates vary between the three levels of number (none, small, big), this provides evidence that the spam and number variables are associated.

R> g/rep(colSums(g),rep(2,3))

none small big Total
spam 149/549=0.271 168/2827=0.059 50/545=0.092 367/3921=0.094
not spam 400/549=0.729 2659/2827=0.941 495/545=0.908 3684/3921=0.906
Total 1.000 1.000 1.000 1.000
Table 1.12: A contingency table with column proportions for the spam and number variables.

We could also have checked for an association between spam and number in Table 1.11 using row proportions. When comparing these row proportions, we would look down columns to see if the fraction of emails with no numbers, small numbers, and big numbers varied from spam to not spam.

Example 1.7.1

What does 0.458 represent in Table 1.11? What does 0.059 represent in Table 1.12?

Answer. 0.458 represents the proportion of spam emails that had a small number. 0.058 represents the fraction of emails with small numbers that are spam.

Example 1.7.2

What does 0.139 at the intersection of not spam and big represent in Table 1.11? What does 0.908 represent in the Table 1.12?

Answer. 0.139 represents the fraction of non-spam email that had a big number. 0.908 represents the fraction of emails with big numbers that are non-spam emails.

Example 1.7.3

Data scientists use statistics to filter spam from incoming email messages. By noting specific characteristics of an email, a data scientist may be able to classify some emails as spam or not spam with high accuracy. One of those characteristics is whether the email contains no numbers, small numbers, or big numbers. Another characteristic is whether or not an email has any HTML content. A contingency table for the spam and format variables from the email data set are shown in Table 1.13. Recall that an HTML email is an email with the capacity for special formatting, e.g. bold text. In Table 1.13, which would be more helpful to someone hoping to classify email as spam or regular email: row or column proportions?

Answer. Such a person would be interested in how the proportion of spam changes within each email format. This corresponds to column proportions: the proportion of spam in plain text emails and the proportion of spam in HTML emails.

R> tab=table(email[,c(1, 16)])[2:1,]; colSums(tab); rowSums(tab)

If we generate the column proportions, we can see that a higher fraction of plain text emails are spam (209/1195=17.5%) than compared to HTML emails (158/2726=5.8%). This information on its own is insufficient to classify an email as spam or not spam, as over 80% of plain text emails are not spam. Yet, when we carefully combine this information with many other characteristics, such as number and other variables, we stand a reasonable chance of being able to classify some email as spam or not spam. This is a topic that is covered further in Math333: Statistical Models.

text HTML Total
spam 209 158 367
not spam 986 2568 3554
Total 1195 2726 3921
Table 1.13: A contingency table for spam and format.

Example 1.7.3 points out that row and column proportions are not equivalent. Before settling on one form for a table, it is important to consider each to ensure that the most useful table is constructed.

Example 1.7.4

Look back to Tables 1.11 and 1.12. Which would be more useful to someone hoping to identify spam emails using the number variable?

Answer. The column proportions in Table 1.12 will probably be most useful, which makes it easier to see that emails with small numbers are spam about 5.9% of the time (relatively rare). We would also see that about 27.1% of emails with no numbers are spam, and 9.2% of emails with big numbers are spam.