Home page for accesible maths 10 Covariate selection 6.5 Summary

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

6.4.1 Examples

TheoremExample 6.4.5 Birth weights cont.

The response variable $Y_{i}$ is birth weight. There are two explanatory variables, gender (factor) and gestational age (continuous). Let $x_{i,1}$ and $x_{i,2}$ be indicator variables for males and females respectively and $x_{i,3}$ be gestational age.

One possible model is

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\beta_{3}x_{i% ,3}

(6.2)

This assumes a different intercept for males ( $\beta_{1}$ ) and females ( $\beta_{2}$ ), but a common slope for gestational age ( $\beta_{3}$ ). It does not include an overall intercept term - we will see later why this is.

A second possible model has a common intercept, but allows for separate slopes for males and females; this is an interaction between gender and age.

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,1}x_{i,3}+\beta_{3}x_{i% ,2}x_{i,3}

(6.3)

What are the interpretations of $\beta_{1}$ , $\beta_{2}$ and $\beta_{3}$ ?

1

$\beta_{1}$ is the expected birth weight of a baby born at 0 weeks gestation, regardless of gender;
2

$\beta_{2}$ is the expected change in birth weight for a male with every extra week of gestation;
3

$\beta_{3}$ is the expected change in birth weight for a female with every extra week of gestation.

The design matrix for model 6.2 has three columns; the first is the indicator for males, the second the indicator column for females and the third contains gestational age.

Describe the design matrix for model 6.3.

The design matrix for model 6.3 has three columns. The first is a column of 1’s for the intercept. The second is the product of the indicator variable for males and gestational age. The third is the product of the indicator variable for females and gestational age.

A third possible model, combining the first two, includes separate intercepts and separate slopes for the two genders.

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\beta_{3}x_{i% ,1}x_{i,3}+\beta_{4}x_{i,2}x_{i,3}

(6.4)

A plot of all three model fits is shown in Figure 6.8, Figure 6.9 and Figure 6.10.

Fig. 6.8: Birthweight (grams) against gestational age (weeks), split by gender. Straight lines show fit of model 6.2.

Fig. 6.9: Birthweight (grams) against gestational age (weeks), split by gender. Straight lines show fit of model 6.3.

Fig. 6.10: Birthweight (grams) against gestational age (weeks), split by gender. Straight lines show fit of model 6.4.

Remark.

How can we choose which of models 6.2, 6.3 and 6.4 fits the data best? Intuitively model 6.3 seems sensible - all babies start at the same weight, but gender may affect the rate of growth. However, since our data only covers births from 35 weeks gestation onwards, we should only think about the model which best reflects growth during this period.

We will look at issues of model selection later.

TheoremExample 6.4.6 Gas consumption

Continuing example 6.4.2; the response $Y_{i}$ is gas consumption. Two explanatory variables, outside temperature (continuous) and before/after cavity wall insulation (factor).

Let $x_{i,1}$ be outside temperature and $x_{i,2}$ be an indicator variable for after cavity wall insulation, i.e.

\displaystyle x_{i,2}

\displaystyle=\left\{\begin{array}[]{ll}1&\quad\text{if observation}~{}i~{}% \text{is after insulation}\\ 0&\quad\text{if observation}~{}i~{}\text{is before insulation}\end{array}\right.

The modelling approach is as follows. Example 6.2.2 gave a regression on outside temperature only,

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,1}.

(6.5)

Figure LABEL:gas_scatter2gas_scatter3 suggests that the rate of change of gas consumption with outside temperature was altered following insulation. There is also evidence of a difference in intercepts before and after insulation. We could include this information in the model as follows,

\displaystyle\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,1}+\beta_{3}x_{i,2}+% \beta_{4}x_{i,1}x_{i,2}.

(6.6)

What are the interpretations of $\beta_{1}$ and $\beta_{3}$ in this model?

1

$\beta_{1}$ is the expected gas consumption when the outside temperature is 0 ${}^{\circ}$ C, before insulation;
2

$\beta_{2}$ is the change in gas consumption for a 1 ${}^{\circ}$ C change in outside temperature, before insulation.

To interpret $\beta_{3}$ and $\beta_{4}$ :

1

$\beta_{1}+\beta_{3}$ is the expected gas consumption insulation when the outside temperature is 0 ${}^{\circ}$ C, after insulation;
2

$\beta_{2}+\beta_{4}$ is the change in gas consumption for a 1 ${}^{\circ}$ C change in outside temperature, after insulation.

$\beta_{3}$ tells us about the change in intercept following insulation; $\beta_{4}$ tells us how the relationship between gas consumption and outside temperature is altered following insulation.

Examples 6.4.5 and 6.4.6 show two different ways of including factors in linear models. In Example 6.4.5, indicator variables for all factors are included, but there is no intercept. In Example 6.4.6, there is an intercept term, but the indicator variable for only one of the two levels of the factor is included.

In general, we include an intercept term and indicator variables for $p-1$ levels of a $p$ -level factor. This ensures that the columns of the design matrix $X$ are linearly independent - even if we include two or more factors in the model.

For interpretation, one level of the factor is set as a ‘baseline’ (in our example this was before insulation) and the regression coefficients for the remaining levels of the factor can be used to report the additional effect of the remaining levels on top of the baseline.

6.5 Summary