Home page for accesible maths 2.5 Assessing the fit of a distribution

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

2.5.1 Q-Q plot

Example 2.3.8 suggests the distribution of heights of UK males is well approximated by the normal model. We are interested in proceeding under the assumption that the data are normally distributed, but first we must check to see if this is reasonable.

There are two visual methods for checking the assumption of normality, which can be implemented and interpreted quickly. The first is a simple histogram with the best fitting normal curve overlaid on the plot, as shown in the left panel of Figure LABEL:fcidMHeights. The sample mean x¯ and standard deviation s are used as the parameters of the best fitting normal curve. The closer this curve fits the histogram, the more reasonable the normal model assumption. Another more common method is examining a Q-Q plot.2121Also commonly called a quantile-quantile plot., shown in the right panel of Figure LABEL:fcidMHeights. The closer the points are to a perfect straight line, the more confident we can be that the data follow the specified model.

In R we can use the qqplot function to draw a Q-Q plot for any two ‘‘data sets’’. As we often want to compare to a Normal distribution, for convenience there is the qqnorm function where you just input one data set and its quantiles are compared to the quantiles of a Normal distribution with the same mean and variance as your data.

R>data(Mheights)
R>hist(Mheights, xlim=c(152.4,203.2), ylim=c(0,.06), probability=TRUE)
R>x=seq(min(Mheights)-5, max(Mheights)+5, 0.01)
R>y=dnorm(x, mean(Mheights), sd=sd(Mheights))
R>lines(x, y, lwd=1.5)
R>qqnorm(Mheights); abline(mean(Mheights),sd(Mheights))

Example 2.5.1

Three data sets of 40, 100, and 400 samples were simulated from a normal distribution, and the histograms and Q-Q plots of the data sets are shown in Figure LABEL:normalExamples. These will provide a benchmark for what to look for in plots of real data. Describe these plots.

Answer. The left panels show the histogram (top) and Q-Q plot (bottom) for the simulated data set with 40 observations. The data set is too small to really see clear structure in the histogram. The Q-Q plot also reflects this, where there are some deviations from the line. However, these deviations are not strong.

The middle panels show diagnostic plots for the data set with 100 simulated observations. The histogram shows more normality and the Q-Q plot shows a better fit. While there is one observation that deviates noticeably from the line, it is not particularly extreme. Answer. The data set with 400 observations has a histogram that greatly resembles the normal distribution, while the Q-Q plot is nearly a perfect straight line. Again in the normal probability plot there is one observation (the largest) that deviates slightly from the line. If that observation had deviated 3 times further from the line, it would be of much greater concern in a real data set. Apparent outliers can occur in normally distributed data but they are rare.

Notice the histograms look more normal as the sample size increases, and the normal probability plot becomes straighter and more stable.

Example 2.5.2

Can we approximate poker winnings by a normal distribution? We consider the poker winnings of an individual over 50 days. A histogram and Q-Q plot of these data are shown in Figure LABEL:pokerNormal.

R>data(poker)
R>hist(poker[,1],prob=T,xlim=c(-2000,4000))
R>x=seq(min(poker)-3000, max(poker)+3000, 0.01)
R>y=dnorm(x, mean(poker[,1]), sd=sd(poker[,1]))
R>lines(x, y, lwd=1.5)
R>qqnorm(poker[,1])
R>abline(mean(poker[,1]),sd(poker[,1]))

Answer. The data are very strongly right skewed in the histogram, which corresponds to the very strong deviations on the upper right component of the normal probability plot. If we compare these results to the sample of 40 normal observations in Example 2.5.1, it is apparent that these data show very strong deviations from the normal model.