Home page for accesible maths 2 Distributions and Inference 2.9.6 Choosing a significance level 2.11 Inference for other estimators

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

2.10 Examining the Central Limit Theorem

The normal model for the sample mean tends to be very good when the sample consists of at least 30 independent observations and the population data are not strongly skewed. The Central Limit Theorem provides the theory that allows us to make this assumption. The Central Limit Theorem is arguably the most important result for statistics. Certainly, without it most of the inference that is made in the world today would not be possible.

Central Limit Theorem: If $X_{1},\ldots,X_{n}$ are independent and identically distributed random variables with $\mathbb{E}\left(X_{i}\right)=\mu$ and $\mbox{Var}(X_{i})=\sigma^{2}$ then $\sum_{i=1}^{n}X_{i}\sim\mbox{N}(n\mu,n\sigma^{2}),\qquad\bar{X}\sim\mbox{N}(% \mu,\sigma^{2}/n),$ approximately (as $n\to\infty$ ) irrespective of the original distribution of $X_{i}$ .

The Central Limit Theorem states that when the sample size is small, the normal approximation may not be very good. However, as the sample size becomes large, the normal approximation improves. We will investigate three cases to see roughly when the approximation is reasonable.

We consider three data sets: one from a uniform distribution, one from an exponential distribution, and the other from a log-normal distribution. These distributions are shown in the top panels of Figure LABEL:cltSimulations. The uniform distribution is symmetric, the exponential distribution may be considered as having moderate skew since its right tail is relatively short (few outliers), and the log-normal distribution is strongly skewed and will tend to produce more apparent outliers.

See the Moodle file for the code for the simulation.

The left panel in the $n=2$ row represents the sampling distribution of $\bar{x}$ if it is the sample mean of two observations from the uniform distribution shown. The dashed line represents the closest approximation of the normal distribution. Similarly, the centre and right panels of the $n=2$ row represent the respective distributions of $\bar{x}$ for data from exponential and log-normal distributions.

Example 2.10.1

Examine the distributions in each row of Figure LABEL:cltSimulations. What do you notice about the normal approximation for each sampling distribution as the sample size becomes larger?

Answer. The normal approximation becomes better as larger samples are used.

Example 2.10.2

Would the normal approximation be good in all applications where the sample size is at least 30?

Answer. Not necessarily. For example, the normal approximation for the log-normal example is questionable for a sample size of 30. Generally, the more skewed a population distribution or the more common the frequency of outliers, the larger the sample required to guarantee the distribution of the sample mean is nearly normal.

TIP: With larger $n$ , the sampling distribution of $\bar{x}$ becomes more normal As the sample size increases, the normal model for $\bar{x}$ becomes more reasonable. We can also relax our condition on skew when the sample size is very large.

We discussed in Section 2.7.3 that the sample standard deviation, $s$ , could be used as a substitute of the population standard deviation, $\sigma$ , when computing the standard error. This estimate tends to be reasonable when $n\geq 30$ . We will encounter alternative distributions for smaller sample sizes in Chapters 3 and 4.

Example 2.10.3

Figure LABEL:pokerProfitsCanApplyNormalToSampMean shows a histogram of 50 observations. These represent winnings and losses from 50 consecutive days of a professional poker player. Can the normal approximation be applied to the sample mean, 90.69?

Answer. We should consider each of the required conditions.

(1)

These are referred to as time series data (explored further in the Math334 Modern Topics in Statistics course), because the data arrived in a particular sequence. If the player wins on one day, it may influence how she plays the next. To make the assumption of independence we should perform careful checks on such data. While the supporting analysis is not shown, no evidence was found to indicate the observations are not independent.
(2)

The sample size is 50, satisfying the sample size condition.
(3)

There are two outliers, one very extreme, which suggests the data are very strongly skewed or very distant outliers may be common for this type of data. Outliers can play an important role and affect the distribution of the sample mean and the estimate of the standard error.

Answer. Since we should be sceptical of the independence of observations and the very extreme upper outlier poses a challenge, we should not use the normal model for the sample mean of these 50 observations. If we can obtain a much larger sample, perhaps several hundred observations, then the concerns about skew and outliers would no longer apply.

Caution: Examine data structure when considering independence Some data sets are collected in such a way that they have a natural underlying structure between observations, e.g. when observations occur consecutively. Be especially cautious about independence assumptions regarding such data sets.

Caution: Watch out for strong skew and outliers Strong skew is often identified by the presence of clear outliers. If a data set has prominent outliers, or such observations are somewhat common for the type of data under study, then it is useful to collect a sample with many more than 30 observations if the normal model will be used for $\bar{x}$ . There are no simple guidelines for what sample size is big enough for all situations, so proceed with caution when working in the presence of strong skew or more extreme outliers.