The second reason we previously required a large sample size was so that we could accurately estimate the standard error using the sample data. In the cases where we will use a small sample to calculate the standard error, it will be useful to rely on a new distribution for inference calculations: the distribution. A distribution, shown as a solid line in Figure LABEL:tDistCompareToNormalDist, has a bell shape. However, its tails are thicker than the normal model’s. This means observations are more likely to fall beyond two standard deviations from the mean than under the normal distribution.3232The standard deviation of the distribution is actually a little more than 1. However, it is useful to always think of the distribution as having a standard deviation of 1 in all of our applications. These extra thick tails are exactly the correction we need to resolve the problem of a poorly estimated standard error.
R> X <- seq(-5, 5, 0.01);Y <- dnorm(X)
R> plot(X, Y, type=’l’, lty=3,lwd=2.5)
R> Y <- dt(X, 2)
R> lines(X, Y, lwd=1.8, col=’blue’)
The distribution, always centred at zero, has a single parameter: degrees of freedom. The degrees of freedom (df) describe the precise form of the bell-shaped distribution. Several distributions are shown in Figure LABEL:tDistConvergeToNormalDist. When there are more degrees of freedom, the distribution looks very much like the standard normal distribution.
See the Moodle file for the code for the simulation.
Degrees of freedom (df)
The degrees of freedom describe the shape of the distribution. The larger the degrees of
freedom, the more closely the distribution approximates the normal model.
When the degrees of freedom is about 30 or more, the distribution is nearly indistinguishable from the normal distribution. In Section 3.3.3, we relate degrees of freedom to sample size.
We will find it very useful to become familiar with the distribution, because it plays a very similar role to the normal distribution during inference for small samples of numerical data. We use qt function, in place of the qnorm function and pt instead of pnorm for small sample numerical data. The main difference is that there is no standard distribution and so we always need to specify the degrees of freedom as well as our quantile or probability of interest.
What proportion of the distribution with 18 degrees of freedom falls below -2.10?
Answer. Just like for the normal problems, we first draw the picture in Figure LABEL:tDistDF18LeftTail2Point10 and shade the area below -2.10. To find this area, we identify the number of degrees of freedom: . Then we do pt(-2.10,df=18) = in R.
A distribution with 20 degrees of freedom is shown in the left panel of Figure LABEL:tDistDF20RightTail1Point65. Estimate the proportion of the distribution falling above 1.65.
Answer. We identify the degrees of freedom: . Then we use R remembering we are looking for the upper tail: 1-pt(1.65,df=20)=.
A distribution with 2 degrees of freedom is shown in the right panel of Figure LABEL:tDistDF20RightTail1Point65. Estimate the proportion of the distribution falling more than 3 units from the mean (above or below).
Answer. As before, first identify the appropriate degrees of freedom: . Next, we use R to find the upper tail 1-pt(3,df=2). Finally, we recall we want (symmetric) two tails (above and below 3) so we double the answer: 2*(1-pt(3,df=2))=0.09546597.
What proportion of the distribution with 19 degrees of freedom falls above -1.79 units?
Answer. We find the shaded area above -1.79 (we leave the picture to you).
1-pt(-1.79,df=19).