Home page for accesible maths 1.6 Examining numerical data

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

1.6.4 Variance and standard deviation

The mean was introduced as a method to describe the centre of a data set, but the variability in the data is also important. Here, we introduce two measures of variability: the variance and the standard deviation. Both of these are very useful in data analysis, even though their formulas are a bit tedious to calculate by hand. The standard deviation is the easier of the two to understand, and it roughly describes how far away the typical observation is from the mean.

We call the distance of an observation from its mean its deviation. Below are the deviations for the 1st, 2nd, 3rd, and 50th observations in the num_ char variable. For computational convenience, the number of characters is listed in the thousands and rounded to the first decimal.

x1-x¯ =21.7-11.6=10.1  
x2-x¯ =7.0-11.6=-4.6
x3-x¯ =0.6-11.6=-11.0
 
x50-x¯ =15.8-11.6=4.2

If we square these deviations and then take an average, the result is about equal to the sample variance, denoted by s2:

s2 =10.12+(-4.6)2+(-11.0)2++4.2250-1
=102.01+21.16+121.00++17.6449
=172.44

We divide by n-1, rather than dividing by n, when computing the variance; you need not worry about this mathematical nuance for the material in this course, it will be explained further in Math235. Notice that squaring the deviations does two things. First, it makes large values much larger, seen by comparing 10.12, (-4.6)2, (-11.0)2, and 4.22. Second, it gets rid of any negative signs.

The standard deviation is defined as the square root of the variance:

s=172.44=13.13.

The standard deviation of the number of characters in an email is about 13.13 thousand. A subscript of x may be added to the variance and standard deviation, i.e. sx2 and sx, as a reminder that these are the variance and standard deviation of the observations represented by x1, x2, …, xn. The x subscript is usually omitted when it is clear which data the variance or standard deviation is referencing.



Variance and standard deviation The sample variance, s2, is roughly the average squared distance from the sample mean. The formal definition is, s2=i=1n(xi-x¯)2n-1,s=s2. The sample standard deviation, s, is the square root of the sample variance. The sample standard deviation is useful when considering how close the data are to the sample mean.

Formulas and methods used to compute the variance and standard deviation for a population are similar to those used for a sample.1414The only difference is that the population variance has a division by n instead of n-1. However, like the mean, the population values have special symbols: σ2 for the variance and σ for the standard deviation. The symbol σ is the Greek letter sigma.



TIP: standard deviation describes variability Focus on the conceptual meaning of the standard deviation as a descriptor of variability rather than the formulas. Usually 70% of the data will be within one standard deviation of the mean and about 95% will be within two standard deviations. However, as seen in Figures LABEL:sdAsRuleForEmailNumChar and LABEL:severalDiffDistWithSdOf1, these percentages are not strict rules.
R> m=round(mean(email50[,14]), 1);s=round(sd(email50[,14]), 1)
R> sum((email50[,14]>(m-s)) & (email50[,14]<(m+s)))
R> sum((email50[,14]>(m-2*s)) & (email50[,14]<(m+2*s)))

Example 1.6.12

On page 1.7, the concept of shape of a distribution was introduced. A good description of the shape of a distribution should include modality and whether the distribution is symmetric or skewed to one side. Using Figure LABEL:severalDiffDistWithSdOf1 as an example, explain why such a description is important.

R># first simulate, then standardize to get mean 0, variance 1.
R>x1=rep(0:1, c(10,10));x1=(x1-mean(x1))/sd(x1)
R>x2=qnorm(seq(0.0025,0.9975, 0.00049));x2=(x2-mean(x2))/sd(x2)
R>x3=qchisq(seq(0.01,0.98, 0.0005), 4);x3=(x3-mean(x3))/sd(x3)
R>lim=c(-1,1)*max(c(x1,x2,x3)) # use the same limits for each plot
R>hist(x1, prob=T, xlim=lim);hist(x2, prob=T, xlim=lim);hist(x3, prob=T, xlim=lim)
Answer. Figure LABEL:severalDiffDistWithSdOf1 shows three distributions that look quite different, but all have the same mean, variance, and standard deviation. Using modality, we can distinguish between the first plot (bimodal) and the last two (unimodal). Using skewness, we can distinguish between the last plot (right skewed) and the first two. While a picture, like a histogram, tells a more complete story, we can use modality and shape (symmetry/skew) to characterize basic information about a distribution.

Example 1.6.13

Describe the distribution of the num_ char variable using the histogram in Figure LABEL:email50NumCharHist. The description should incorporate the centre, variability, and shape of the distribution, and it should also be placed in context: the number of characters in emails. Also note any especially unusual cases.

Answer. The distribution of email character counts is unimodal and very strongly skewed to the high end. Many of the counts fall near the mean at 11,600, and most fall within one standard deviation (13,130) of the mean. There is one exceptionally long email with about 65,000 characters. In practice, the variance and standard deviation are sometimes used as a means to an end, where the ‘‘end’’ is being able to accurately estimate the uncertainty associated with a sample statistic. For example, in Chapter 2.6 we will use the variance and standard deviation to assess how close the sample mean is to the population mean.