In the simple linear regression model, the errors are assumed to follow a Normal distribution. In reality, we do not know the errors and these are replaces with their estimates . We could compare these against the model distribution, using graphical diagnostics. P-P (probability-probability) and Q-Q (quantile-quantile) plots are useful for checking whether a sample of the data can be considered as a sample from a statistical model (usually a probability distribution). In this case they can be used to check whether the residuals from a linear model are a sample from the distribution.
We will create P-P and Q-Q plots of the estimated standardised residuals, which are defined as:
Standardising by means that these should be approximately be a sample from a distribution. Given these values, the residuals are then ordered from smallest to largest so that . The P-P plot is drawn by plotting the fitted cdf of the ordered residuals against the empirical cdf. That is the set of points:
where is the standard normal cdf. Both axes have a range of , as they represent probabilities. The Q-Q plot is drawn by plotting the fitted quantiles against the sample quantiles. That is the set of points:
where is the standard normal quantile function. Both axes have a range of .
If the standardised residuals follow a distribution perfectly, then the points in both P-P plot and Q-Q plot should lie on the line. Even of the model is a good fit, the point will not lie exactly on this line because of random variation.
Exercise 7.59
Draw the P-P and Q-Q plots for the fitted birthweight additive model with gestational age.
birthweight <- read.table("birthweight.dat") Model <- glm(weight ~ 1 + age, family = gaussian, data = birthweight) ##Evaluate Standardised Residuals n <- nrow(birthweight) ##Number of observations p <- length(coef(Model)) ##Number of parameters Residuals <- resid(Model) ##Residuals y-mu sig2 <- sum(Residuals^2) / (n - p) ##Residual variance (37094) stdResid <- Residuals / sqrt(sig2) ##Standardised residuals ##PP plot sortedR <- sort(stdResid) ##Order standardised residuals empP <- (1:n)/(n + 1) ##Empirical probabilities plot(empP, pnorm(sortedR), xlab = "Theoretical Probabilities", ylab = "Sample Probabilities", main = "Normal P-P plot") abline(a = 0, b = 1, lty=2) ##1:1 line ##QQ plot plot(qnorm(empP), sortedR, xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", main = "Normal Q-Q plot") #or qqnorm(sortedR) abline(a = 0, b = 1, lty=2) ##1:1 line
For a model that fits the data well, the residuals should be independent of the fitted values as well as the explanatory variables .
##Fitted vs. residuals plot(fitted(Model), resid(Model), xlab = "Fitted", ylab = "Residuals") ##Age vs. residuals plot(birthweight$age, resid(Model), xlab = "Age", ylab = "Residuals")
Checking the residuals from a Poisson or Logistic regression model is not as easy as for the simple linear regression case as the distribution of the difference , for some fitted mean , is likely to different for all observations and be of some form that is not easy to understand.
However, we note the following cases about Binomial and Poisson data. For Binomial data, , as increases then by CLT:
Also, for Poisson data, , then as increases it is known that:
Using the fitted means , the Pearson residuals is defined as:
where is the variance function. These residuals are sometimes used as a measure of model adequacy.
Similar to normal linear models, these residuals could be used to check the adequacy of the models by plotting:
Pearson residuals against fitted values, vs. .
Pearson residuals against linear predictors, vs. .
Pearson residuals against explanatory variables, vs. .
P-P and Q-Q plot of the Pearson residuals.
For the last case, care needs to be taken with small data sizes as it may not yet be close to a standard normal distribution that arises in the asymptotic regimes for Binomial and Poisson data as defined above.
The figure below presents these fit diagnostic plots for the Poisson regression applied to the AIDS dataset with additive model containing the time period explanatory variable.