Table 1.11 shows the row proportions for Table 1.9. The row proportions are computed as the counts divided by their row totals. The value 149 at the intersection of spam and none is replaced by , i.e. 149 divided by its row total, 367. So what does 0.406 represent? It corresponds to the proportion of spam emails in the sample that do not have any numbers.
R> g=table(email[,1], email[,21])[2:1,]; g/rep(rowSums(g),3)
none | small | big | Total | |
---|---|---|---|---|
spam | 1.000 | |||
not spam | 1.000 | |||
Total | 1.000 |
A contingency table of the column proportions is computed in a similar way, where each column proportion is computed as the count divided by the corresponding column total. Table 1.12 shows such a table, and here the value 0.271 indicates that 27.1% of emails with no numbers were spam. This rate of spam is much higher compared to emails with only small numbers (5.9%) or big numbers (9.2%). Because these spam rates vary between the three levels of number (none, small, big), this provides evidence that the spam and number variables are associated.
R> g/rep(colSums(g),rep(2,3))
none | small | big | Total | |
spam | ||||
not spam | ||||
Total | 1.000 | 1.000 | 1.000 | 1.000 |
We could also have checked for an association between spam and number in Table 1.11 using row proportions. When comparing these row proportions, we would look down columns to see if the fraction of emails with no numbers, small numbers, and big numbers varied from spam to not spam.
Answer. 0.458 represents the proportion of spam emails that had a small number. 0.058 represents the fraction of emails with small numbers that are spam.
Answer. 0.139 represents the fraction of non-spam email that had a big number. 0.908 represents the fraction of emails with big numbers that are non-spam emails.
Data scientists use statistics to filter spam from incoming email messages. By noting specific characteristics of an email, a data scientist may be able to classify some emails as spam or not spam with high accuracy. One of those characteristics is whether the email contains no numbers, small numbers, or big numbers. Another characteristic is whether or not an email has any HTML content. A contingency table for the spam and format variables from the email data set are shown in Table 1.13. Recall that an HTML email is an email with the capacity for special formatting, e.g. bold text. In Table 1.13, which would be more helpful to someone hoping to classify email as spam or regular email: row or column proportions?
Answer. Such a person would be interested in how the proportion of spam changes within each email format. This corresponds to column proportions: the proportion of spam in plain text emails and the proportion of spam in HTML emails.
R> tab=table(email[,c(1, 16)])[2:1,]; colSums(tab); rowSums(tab)
If we generate the column proportions, we can see that a higher fraction of plain text emails are spam () than compared to HTML emails (). This information on its own is insufficient to classify an email as spam or not spam, as over 80% of plain text emails are not spam. Yet, when we carefully combine this information with many other characteristics, such as number and other variables, we stand a reasonable chance of being able to classify some email as spam or not spam. This is a topic that is covered further in Math333: Statistical Models.
text | HTML | Total | |
---|---|---|---|
spam | 209 | 158 | 367 |
not spam | 986 | 2568 | 3554 |
Total | 1195 | 2726 | 3921 |
Example 1.7.3 points out that row and column proportions are not equivalent. Before settling on one form for a table, it is important to consider each to ensure that the most useful table is constructed.
Answer. The column proportions in Table 1.12 will probably be most useful, which makes it easier to see that emails with small numbers are spam about 5.9% of the time (relatively rare). We would also see that about 27.1% of emails with no numbers are spam, and 9.2% of emails with big numbers are spam.