7 Model inference 7.3 Analysis of deviance 8 Censoring

7.4 Assessment of model fit

In the simple linear regression model, the errors $\epsilon_{i}$ are assumed to follow a Normal distribution. In reality, we do not know the errors and these are replaces with their estimates $\hat{\epsilon}_{i}$ . We could compare these against the model distribution, $N(0,~{}\sigma^{2})$ using graphical diagnostics. P-P (probability-probability) and Q-Q (quantile-quantile) plots are useful for checking whether a sample of the data can be considered as a sample from a statistical model (usually a probability distribution). In this case they can be used to check whether the residuals from a linear model are a sample from the $N(0,~{}\sigma^{2})$ distribution.

7.4.1 P-P and Q-Q plot

We will create P-P and Q-Q plots of the estimated standardised residuals, which are defined as:

\displaystyle r_{i}=\frac{y_{i}-\hat{\mu}_{i}}{\hat{\sigma}}.

Standardising by $\hat{\sigma}$ means that these should be approximately be a sample from a $N(0,~{}1)$ distribution. Given these values, the residuals are then ordered from smallest to largest so that $r^{(1)}\leq r^{(2)}\leq\cdots\leq r^{(n)}$ . The P-P plot is drawn by plotting the fitted cdf of the ordered residuals against the empirical cdf. That is the set of points:

\displaystyle\left\{\Phi\left(r^{(i)}\right),~{}~{}\frac{i}{n+1}\right\},

where $\Phi(x)$ is the standard normal cdf. Both axes have a range of $[0,~{}1]$ , as they represent probabilities. The Q-Q plot is drawn by plotting the fitted quantiles against the sample quantiles. That is the set of points:

\displaystyle\left\{\Phi^{-1}\left(\frac{i}{n+1}\right),~{}~{}r^{(i)}\right\},

where $\Phi^{-1}(p)$ is the standard normal quantile function. Both axes have a range of $\mathbb{R}$ .

If the standardised residuals follow a $N(0,~{}1)$ distribution perfectly, then the points in both P-P plot and Q-Q plot should lie on the $y=x$ line. Even of the model is a good fit, the point will not lie exactly on this line because of random variation.

Exercise 7.59
Draw the P-P and Q-Q plots for the fitted birthweight additive model with gestational age.

birthweight <- read.table("birthweight.dat")
Model <- glm(weight ~ 1 + age, family = gaussian, data = birthweight)

##Evaluate Standardised Residuals
n <- nrow(birthweight)      ##Number of observations
p <- length(coef(Model))    ##Number of parameters
Residuals <- resid(Model)   ##Residuals y-mu
sig2 <- sum(Residuals^2) / (n - p)  ##Residual variance (37094)
stdResid <- Residuals / sqrt(sig2)  ##Standardised residuals

##PP plot
sortedR <- sort(stdResid)   ##Order standardised residuals
empP <- (1:n)/(n + 1)       ##Empirical probabilities
plot(empP, pnorm(sortedR), xlab = "Theoretical Probabilities",
    ylab = "Sample Probabilities", main = "Normal P-P plot")
abline(a = 0, b = 1, lty=2) ##1:1 line

##QQ plot
plot(qnorm(empP), sortedR, xlab = "Theoretical Quantiles",
    ylab = "Sample Quantiles", main = "Normal Q-Q plot") #or qqnorm(sortedR)
abline(a = 0, b = 1, lty=2) ##1:1 line

Figure 7.3: Link, Caption: P-P plot (left) and Q-Q plot (right) of the standardised residuals from the birthweight model against the Normal hypothesis.

For a model that fits the data well, the residuals $r_{i}$ should be independent of the fitted values $\hat{\mu}_{i}$ as well as the explanatory variables $x_{i}$ .

##Fitted vs. residuals
plot(fitted(Model), resid(Model), xlab = "Fitted", ylab = "Residuals")
##Age vs. residuals
plot(birthweight$age, resid(Model), xlab = "Age", ylab = "Residuals")

Figure 7.4: Link, Caption: Scatter plot of residuals against fitted values (left) and gestational age (right) from the birthweight model.

7.4.2 Checking fit of Poisson and Logistic Regression

Checking the residuals from a Poisson or Logistic regression model is not as easy as for the simple linear regression case as the distribution of the difference $y_{i}-\hat{\mu}_{i}$ , for some fitted mean $\hat{\mu}_{i}$ , is likely to different for all observations and be of some form that is not easy to understand.

However, we note the following cases about Binomial and Poisson data. For Binomial data, $Y\sim\mathrm{Binomial}(n,p)$ , as $n$ increases then by CLT:

\displaystyle\frac{Y-np}{\sqrt{np(1-p)}}\sim N(0,~{}1).

Also, for Poisson data, $Y\sim\mathrm{Poisson}(\lambda)$ , then as $\lambda$ increases it is known that:

\displaystyle\frac{Y-\lambda}{\sqrt{\lambda}}\sim N(0,~{}1).

Definition 7.4.1.

Using the fitted means $\hat{\mu}_{i}$ , the Pearson residuals is defined as:

\displaystyle r^{p}_{i}=\frac{y_{i}-\hat{\mu}_{i}}{\sqrt{v(\hat{\mu}_{i})}},

where $v(\mu)$ is the variance function. These residuals are sometimes used as a measure of model adequacy.

Similar to normal linear models, these residuals could be used to check the adequacy of the models by plotting:

•

Pearson residuals against fitted values, $r_{i}^{p}$ vs. $\hat{\mu}_{i}$ .
•

Pearson residuals against linear predictors, $r_{i}^{p}$ vs. $\hat{\eta}_{i}$ .
•

Pearson residuals against explanatory variables, $r_{i}^{p}$ vs. $x_{j,i}$ .
•

P-P and Q-Q plot of the Pearson residuals.

For the last case, care needs to be taken with small data sizes as it may not yet be close to a standard normal distribution that arises in the asymptotic regimes for Binomial and Poisson data as defined above.

The figure below presents these fit diagnostic plots for the Poisson regression applied to the AIDS dataset with additive model containing the time period explanatory variable.

Figure 7.5: Link, Caption: Plot of Pearson residuals against the fitted values (TL), linear predictor (TR) and time covariate (BL) for the Poisson regression model with the AIDS dataset. Q-Q plot (BR) of the Pearson residuals against the theoretical Normal hypothesis.