MATH235 MATH235 Week 5 - Moodle Quiz-assessed problems MATH235 Week 6 - Assessed problems (coursework)

MATH235 Week 6 - Workshop problems

If not all of the problems below are discussed in the workshop for lack of time, then please have a go at the problems on your own.

We continue to work with the data for the Childhood Respiratory Disease Study, introduced in Question Sheet 2. Consider three models for log FEV,

$\displaystyle\mathbb{E}[\log\mbox{FEV}_{i}]$	$\displaystyle=$	$\displaystyle\beta_{1}+\beta_{2}\mbox{height}_{i},$	(0.1)
$\displaystyle\mathbb{E}[\log\mbox{FEV}_{i}]$	$\displaystyle=$	$\displaystyle\beta_{1}+\beta_{2}\mbox{height}_{i}+\beta_{3}\mbox{age}_{i},$	(0.2)
$\displaystyle\mathbb{E}[\log\mbox{FEV}_{i}]$	$\displaystyle=$	$\displaystyle\beta_{1}+\beta_{2}\mbox{height}_{i}+\beta_{3}I[\mbox{male}_{i}].$	(0.3)

As on Question Sheet 5, $I[\mbox{male}_{i}]$ is an indicator function, taking the value 1 if individual $i$ is male, and 0 otherwise.

WS6.1

(a)
- (i)
  
  Write down appropriate null and alternative hypotheses to test whether the best fitting model is model 0.1 or model 0.2.
- (ii)
  
  Write down the test statistic to test these hypotheses. You should define your notation.
(b)
Using the 20 observations in Table 2 to fit the models, the sums of squares (SS) for models 0.1 and 0.2 are 0.343 and 0.311 respectively.
- (i)
  
  Explain why you would expect the SS for model 0.1 to be greater than the SS for model 0.2.
- (ii)
  
  Test your hypotheses from part a (i) at the 5% level. You should give the value of your test statistic and critical value.
(c)
The parameter estimates for model 0.3 are $\hat{\beta}=(-1.97,0.0463,0.0241)^{\prime}$ .
- (i)
  
  By finding the vector of estimated residuals $\hat{\epsilon}$ for this model, calculate the fitted SS.
- (ii)
  
  Test the hypotheses
  
  $H_{0}:\mbox{Model~{}\ref{eq:cw6.1} is the best fit}~{}~{}~{}vs.~{}~{}~{}H_{1}:\mbox{Model~{}\ref{eq:cw6.3} is the best fit}.$
(d)

Explain why you cannot compare models 0.2 and 0.3 using the $F$ -test.

WS6.2

In this question we explore various diagnostics to assess the goodness-of-fit of a linear regression model.

(a)
- (i)
  
  Explain the difference between the estimated residual $\hat{\epsilon}_{i}$ and the estimated standardised residual $\hat{r}_{i}$ .
- (ii)
  
  If the model fits well, what distribution should the estimated standardised residuals follow? What graphical tools can be used to test this?
(b)
- (i)
  
  Define the hat matrix $H$ . How does this matrix relate the predicted values $\hat{\mu}$ to the response vector $\mathbf{Y}$ ?
- (ii)
  
  The hat matrix is also related to the vector of estimated residuals $\hat{\epsilon}$ . How?
- (iii)
  
  What do these results tell us about the relationship between the estimated residuals and predicted values? How can this be used as a goodness-of-fit diagnostic?

WS6.3

We derive the sampling distribution of the estimated residuals. The strategy is actually very similar to that used to find the sampling distribution of the least squares estimator $\hat{\beta}$ .

(a)
- (i)
  
  Write the vector of estimated residuals $\hat{\epsilon}$ in terms of the vector of responses $\mathbf{Y}$ .
- (ii)
  
  Explain why $\hat{\epsilon}$ therefore has a multivariate Normal distribution.
(b)
- (i)
  
  What is the expected value for $\mathbf{Y}$ ?
- (ii)
  
  Use your answer to part (i) and the linearity property of the expectation to obtain $\mathbb{E}[\hat{\epsilon}]$ .
(c)

What is $\mbox{Var}(\hat{\epsilon})$ ? Hint You might want to follow a similar procedure to that used in the notes to find $\mbox{Var}(\hat{\beta})$ .

WS6.4

We look at a number of diagnostics for model 0.3 in the previous question.

(a)
First we consider outliers.
- (i)
  
  Define the studentised residual for observation $i$ .
- (ii)
  
  Calculate the studentised residual for observation 1. Is there significant evidence that this observation is an outlier?
(b)

Define Cook’s distance for observation $i$ . What is this used to show?
(c)
By using the R package car, look at the influence plot for the model fit.
- (i)
  
  Which observation has the greatest influence on the fit?
- (ii)
  
  What is the effect on the model fit of removing this point from the data set?