If not all of the problems below are discussed in the workshop for lack of time, then please have a go at the problems on your own.
We continue to work with the data for the Childhood Respiratory Disease Study, introduced in Question Sheet 2. Consider three models for log FEV,
(0.1) | |||||
(0.2) | |||||
(0.3) |
As on Question Sheet 5, is an indicator function, taking the value 1 if individual is male, and 0 otherwise.
The parameter estimates for model 0.3 are .
By finding the vector of estimated residuals for this model, calculate the fitted SS.
Test the hypotheses
In this question we explore various diagnostics to assess the goodness-of-fit of a linear regression model.
Explain the difference between the estimated residual and the estimated standardised residual .
If the model fits well, what distribution should the estimated standardised residuals follow? What graphical tools can be used to test this?
Define the hat matrix . How does this matrix relate the predicted values to the response vector ?
The hat matrix is also related to the vector of estimated residuals . How?
What do these results tell us about the relationship between the estimated residuals and predicted values? How can this be used as a goodness-of-fit diagnostic?
We derive the sampling distribution of the estimated residuals. The strategy is actually very similar to that used to find the sampling distribution of the least squares estimator .
Write the vector of estimated residuals in terms of the vector of responses .
Explain why therefore has a multivariate Normal distribution.
What is the expected value for ?
Use your answer to part (i) and the linearity property of the expectation to obtain .
What is ? Hint You might want to follow a similar procedure to that used in the notes to find .
We look at a number of diagnostics for model 0.3 in the previous question.
First we consider outliers.
Define the studentised residual for observation .
Calculate the studentised residual for observation 1. Is there significant evidence that this observation is an outlier?
Define Cook’s distance for observation . What is this used to show?
By using the R package car, look at the influence plot for the model fit.
Which observation has the greatest influence on the fit?
What is the effect on the model fit of removing this point from the data set?