MATH235

MATH235 Week 5 - Assessed problems (coursework)

Submission is due on Tuesday in Week 6.

CW5.1 

Consider the linear regression models:

𝔼[Yi]=β1+β2xi,i=1,,n.

and

𝔼[Yi]=γ1+γ2(xi-x¯),i=1,,n.

These models look very similar, except that the explanatory variable has been centred (about the mean) in the second model. The centring of explanatory variables is common, especially for data from designed experiments.

  1. (a)

    Write down the design matrices X and A for models C1 and C2 respectively.

    [marks: 1]

  2. (b)

    Calculate XX and AA. Which model gives independent estimators for the two regression coefficients? Explain your answer.

    [marks: 2]

  3. (c)

    Give interpretations of the intercept terms β1 and γ1. Which do you feel has the most useful interpretation?

    [marks: 1]

  4. (d)

    Explain why β2 and γ2 have the same interpretation.

    [marks: 1]

CW5.2 

The full FEV data set first discussed in Question Sheet 2 is available in the file fev. The data frame in this file has 654 records and six variables, including age (years), FEV (litres), height (inches), gender (1 for male, 0 for female), smoker (0 for no, 1 for yes) and log FEV.

Consider the following linear regression model,

𝔼[logYi]=β1+β2xi,1+β3xi,2+β4xi,1xi,2

where Yi is FEV, xi,1 is an indicator function for males and xi,2 is age.

  1. (a)
    • (i)

      Using the function lm fit this model to the full FEV data set. What are your least squares estimates β^?

      [marks: 1]

    • (ii)

      What is the predicted FEV for a 12 year old female?

      [marks: 1]

  2. (b)

    For this model and data set,

    (XX)-1=[0.03868-0.03868-0.003610.00361-0.038680.075460.00361-0.00699-0.003610.003610.00037-0.000370.00361-0.00699-0.000370.00070]

    and the estimated residual variance is σ^2=0.04133.

    Using these results, decide whether or not there is evidence for an interaction between age and sex. You should explain what is meant by an interaction, and state clearly you hypotheses and conclusions.

    [marks: 3]

CW5.3 

Challenge
The data in the data frame buttermilk contains the percentage of butterfat in the milk of cows sampled from five breeds (Ayrshire, 1; Canadian, 2; Guernsey, 3; Holstein-Fresian, 4; Jersey, 5). Within each breed, both mature (coded as 1) and young (coded as 2) cows have been sampled. Load the data into R.

  1. (a)

    How many cows are in the full sample? How many cows have been sampled from each breed? How many cows are young and how many are mature?

    [marks: 1]

  2. (b)

    By fitting an appropriate multiple linear regression model, assess whether breed or age has an effect on the percentage buttermilk.

    [marks: 3]

  3. (c)

    Using the model fitted above in part (b), what is the expected percentage butterfat for a mature Guernsey cow?

    [marks: 1]