Home page for accesible maths 1.6 Examining numerical data 1.6.1 Scatterplots for paired data 1.6.3 Histograms and shape

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

1.6.2 Dot plots and the mean

Sometimes two variables is one too many: only one variable may be of interest. In these cases, a dot plot provides the most basic of displays. A dot plot is a one-variable scatterplot; an example using the number of characters from 50 emails is shown in Figure LABEL:emailCharactersDotPlot. A stacked version of this dot plot is shown in Figure LABEL:emailCharactersDotPlotStacked.

R> dotPlot(email50[,14],pch=20, ylim=c(0.95,1.05))

The mean, sometimes called the average, is a common way to measure the centre of a distribution of data. To find the mean number of characters in the 50 emails, we add up all the character counts and divide by the number of emails. For computational convenience, the number of characters is listed in the thousands and rounded to the first decimal.

\displaystyle\bar{x}=\frac{21.7+7.0+\cdots+15.8}{50}=11.6

(1.1)

R> mean(email50[,14])

The sample mean is often labelled $\bar{x}$ . The letter $x$ is being used as a generic placeholder for the variable of interest, num_ char, and the bar over the $x$ communicates that the average number of characters in the 50 emails was 11,600. It is useful to think of the mean as the balancing point of the distribution. The sample mean is shown as a triangle in Figures LABEL:emailCharactersDotPlot and LABEL:emailCharactersDotPlotStacked.

Mean. The sample mean of a numerical variable is computed as the sum of all of the observations divided by the number of observations: $\displaystyle\bar{x}=\frac{x_{1}+x_{2}+\cdots+x_{n}}{n}$ (1.2) where $x_{1},x_{2},\dots,x_{n}$ represent the $n$ observed values.

Example 1.6.4

Examine Equations (1.1) and (1.2) above. What does $x_{1}$ correspond to? And $x_{2}$ ? Can you infer a general meaning to what $x_{i}$ might represent?

Answer. $x_{1}$ corresponds to the number of characters in the first email in the sample (21.7, in thousands), $x_{2}$ to the number of characters in the second email (7.0, in thousands), and $x_{i}$ corresponds to the number of characters in the $i^{th}$ email in the data set.

Example 1.6.5

What was $n$ in this sample of emails?

Answer. The sample size was $n=50$ .

R> length(email50[,14]) The email50 data set represents a sample from a larger population of emails that were received in January and March. We could compute a mean for this population in the same way as the sample mean, however, the population mean has a special label: $\mu$ . The symbol $\mu$ is the Greek letter mu and represents the average of all observations in the population. Sometimes a subscript, such as ${}_{x}$ , is used to represent which variable the population mean refers to, e.g. $\mu_{x}$ .

Example 1.6.6

The average number of characters across all emails can be estimated using the sample data. Based on the sample of 50 emails, what would be a reasonable estimate of $\mu_{x}$ , the mean number of characters in all emails in the email data set? (Recall that email50 is a sample from email.)

Answer. The sample mean, 11,600, may provide a reasonable estimate of $\mu_{x}$ . While this number will not be perfect, it provides a point estimate of the population mean. In Chapter 2.6 and beyond, we will develop tools to characterize the accuracy of point estimates, and we will find that point estimates based on larger samples tend to be more accurate than those based on smaller samples.

Example 1.6.7

We might like to compute the average income per person in the US. To do so, we might first think to take the mean of the per capita incomes across the 3,143 counties in the county data set. What would be a better approach?

Answer. The county data set is special in that each county actually represents many individual people. If we were to simply average across the income variable, we would be treating counties with 5,000 and 5,000,000 residents equally in the calculations. Instead, we should compute the total income for each county, add up all the counties’ totals, and then divide by the number of people in all the counties. If we completed these steps with the county data, we would find that the per capita income for the US is $27,348.43. Had we computed the simple mean of per capita income across counties, the result would have been just $22,504.70!

R> sum(county[,9]*county[,4])/sum(county[,4])
R> mean(county[,9]) Example 1.6.7 used what is called a weighted mean. This will not be a key topic in this course but is highlighted here for completeness.