Many analyses are motivated by a researcher looking for a relationship between two or more variables. A social scientist may like to answer some of the following questions:
Is federal spending, on average, higher or lower in counties with high rates of poverty?
If homeownership is lower than the national average in one county, will the percent of multi-unit structures in that county likely be above or below the national average?
Which counties have a higher average income: those that enact one or more smoking bans or those that do not?
To answer these questions, data must be collected, such as the county data set shown in Table 1.5. Examining summary statistics could provide insights for each of the three questions about counties. Additionally, graphs can be used to visually summarize data and are useful for answering such questions as well.
Scatterplots are one type of graph used to study the relationship between two numerical variables. Figure LABEL:county_fed_spendVsPoverty compares the variables fed_ spend and poverty.
R> data(county)
R> plot(county[,6], county[,5], pch=20,cex=0.7, ylim=c(0,31.25))
Each point on the plot represents a single county. For instance, the highlighted dot corresponds to County 1088 in the county data set: Owsley County, Kentucky, which had a poverty rate of 41.5% and federal spending of $21.50 per capita. The scatterplot suggests a relationship between the two variables: counties with a high poverty rate also tend to have slightly more federal spending. We might brainstorm as to why this relationship exists and investigate each idea to determine which is the most reasonable explanation.
Examine the variables in the email50 data set, which are described in Table 1.4. Create two questions about the relationships between these variables that are of interest to you.
Answer. Two sample questions: (1) Intuition suggests that if there are many line breaks in an email then there also would tend to be many characters: does this hold true? (2) Is there a connection between whether an email format is plain text (versus HTML) and whether it is a spam message? The fed_ spend and poverty variables are said to be associated because the plot shows a discernible pattern. When two variables show some connection with one another, they are called associated variables. Associated variables can also be called dependent variables and vice-versa.
This example examines the relationship between homeownership and the percent of units in multi-unit structures (e.g. apartments, condos), which is visualized using a scatterplot in Figure LABEL:multiunitsVsOwnership. Are these variables associated?
R> plot(county[,8], county[,7], pch=20, cex=0.7)
Answer. It appears that the larger the fraction of units in multi-unit structures, the lower the homeownership rate. Since there is some relationship between the variables, they are associated. Because there is a downward trend in Figure LABEL:multiunitsVsOwnership – counties with more units in multi-unit structures are associated with lower homeownership – these variables are said to be negatively associated. A positive association is shown in the relationship between the poverty and fed_ spend variables represented in Figure LABEL:county_fed_spendVsPoverty, where counties with higher poverty rates tend to receive more federal spending per capita.
If two variables are not associated, then they are said to be independent. That is, two variables are independent if there is no evident relationship between the two.
Associated or independent, not both.
A pair of variables are either related in some way (associated) or not (independent). No pair of
variables is both associated and independent.