One of the key underlying assumptions of the linear regression model is that the errors have a Normal distribution. In reality, we do not know the errors and these are replaced with their estimates, . We compare these estimated residuals to their model distribution using graphical diagnostics:
PP (probability-probability) and QQ plots can be used to check whether or not a sample of data can be considered to be a sample from a statistical model (usually a probability distribution). In the case of the normal linear regression model, they can be used to check whether or not the estimated residuals are a sample from a distribution. PP plots show the same information as QQ plots, but on a different scale.
The PP plot is most useful for checking that values around the average (the body) fit the proposed distribution. It compares the percentiles of the sample of data, predicted under the proposed model, to the percentiles obtained for a sample of the same size, predicted from the empirical distribution
The QQ plot is most useful for checking whether the largest and smallest values (the tails) fit the proposed distribution. It compares the ordered sample of data to the quantiles obtained for a sample of the same size from the proposed model.
First define the standardised residuals to be
From Math230, standardising by means that these should be a sample from a distribution.
Denote by the ordered standardised residuals, so that is the smallest residual, and the largest. We compare the standardised residuals to the standard normal distribution, using
A PP plot,
for . Here is the standard normal cumulative distribution function.
A QQ plot,
for , where is the inverse of the standard normal cumulative distribution function.
If the standardised residuals follow a distribution perfectly, both plots lie on the line . Because of random variation, even if the model is a good fit, the points won’t lie exactly on this line.
You have seen QQ plots before in Math104; in that setting, they were used to examine whether data could be considered to be a sample from a Normal distribution.
In example 10.1.1 we fitted the following linear regression model to try to explain variability in (log) brain weight () using (log) body weight (),
We use R to create PP and QQ plots for the standardised residuals. First we will refit the model in R to obtain the required residuals,
Next we need the residual variance,
and we can use this to get the standardised residuals:
R does not have an inbuilt function for creating a PP plot, but we can create one using the function qqplot,
Since we are comparing the standardised residuals to the standard Normal distribution, we can use the function qqnorm for the QQ plot,
In general, a QQ plot is more useful that a PP plot, as it tells us about the more ‘unusual’ values (i.e. the very high and very low residuals). It is the behaviour of these values which is most likely to highlight a lack of model fit.
If the PP and QQ plots suggest that the residuals differ from the distribution in a systematic way, for example the points curve up (or down) and away from the 45 line at either (or both) of the tails, it may be more appropriate to
Transform your response, e.g. use the log or square root functions, before fitting the model; or
Use a different residual distribution. This is discussed in Math333 Statistical Models.
A lack of normality might also be due to the residuals having non-constant variance, referred to as heteroscedasticity. This can be assessed by plotting the residuals against the explanatory variables included in the model to see whether there is evidence of variability increasing, or decreasing, with the value of the explanatory variable.