We would like to estimate the average difference in run times for men and women using the LonMar13Samp data set, which was a simple random sample of 65 men and 35 women from all runners in the 2013 London Marathon. Table 3.3 presents relevant summary statistics, and box plots of each sample are shown in Figure LABEL:cbrRunTimesMenWomen.
R> boxplot(LonMar13Samp[,3] LonMar13Samp[,4])
men | women | |
266.9038 | 285.7438 | |
44.48394 | 57.26673 | |
65 | 35 |
The two samples are independent of one-another, so the data are not paired. Instead a point estimate of the difference in average 26 mile times for men and women, , can be found using the two sample means:
Because we are examining two simple random samples from less than 10% of the population, each sample contains at least 30 observations, and neither distribution is strongly skewed, we can safely conclude the sampling distribution of each sample mean is nearly normal. Finally, because each sample is independent of the other (e.g. the data are not paired), we can conclude that the difference in sample means can be modelled using a normal distribution.3131Probability theory guarantees that the difference of two independent normal random variables is also normal. Because each sample mean is nearly normal and observations in the samples are independent, we are assured the difference is also nearly normal.
Conditions for normality of
If the sample means, and , each meet the criteria for having nearly normal
sampling distributions and the observations in the two samples are independent, then the difference
in sample means, , will have a sampling distribution that is nearly normal.
We can quantify the variability in the point estimate, , using the following formula for its standard error:
We usually estimate this standard error using standard deviation estimates based on the samples:
Because each sample has at least 30 observations ( and ), this substitution using the sample standard deviation tends to be very good.
Distribution of a difference of sample means
The sample difference of two means, , is nearly normal with mean
and estimated standard error
(3.1)
when each sample mean is nearly normal and all observations are independent.