Home page for accesible maths 9 Explanatory variables: some interesting issues 9.2.2 Interaction between a factor and a covariate 10 Covariate selection

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

9.3 Summary

{mdframed}

1

Collinearity occurs when two explanatory variables are highly correlated.
2

Collinearity makes it hard (sometimes impossible) to disentangle the separate effects of the collinear variables on the response.
3

An interaction occurs when altering the value of one explanatory variable changes the effect of a second explanatory variable on the response.
4

This change could be a change in the size of the effect, in the direction of the effect (positive or negative), or in both of these.

Chapter 10 Covariate selection

Covariate selection refers to the process of deciding which of a number of explanatory variables best explain the variability in the response variable. You can think of it as finding the subset of explanatory variables which have the strongest relationships with the response variable.

We will only look at comparing nested models. Consider two models, the first has $p_{1}$ explanatory variables and the second has $p_{2}>p_{1}$ explanatory variables. We refer to the model with fewer covariates as the simpler model.

An example of a pair of nested models is when the more complicated model contains all the explanatory variables in the simpler model, and an additional $p_{2}-p_{1}$ explanatory variables.

For example, given a response $Y_{i}$ and explanatory variables $x_{i,1}$ , $x_{i,2}$ and $x_{i,3}$ , we could create three possible models;

A

$\mathbb{E}[Y_{i}]=\beta_{0}+\beta_{1}x_{i,1}$ ,
B

$\mathbb{E}[Y_{i}]=\beta_{0}+\beta_{1}x_{i,1}+\beta_{2}x_{i,2}$ ,
C

$\mathbb{E}[Y_{i}]=\beta_{0}+\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\beta_{3}x_{i,3}$ .

Which model(s) are nested inside model C?

Models A and B are nested inside model C.

Are either of models A or C nested inside model B?

Model A is, since model B is model A with an additional covariate.

Neither model B nor model C is nested in model A.

Write down another model that is nested in model C.

$\mathbb{E}[Y_{i}]=\beta_{0}+\beta_{1}x_{i,2}$ .

Definition (Nesting).

Define model 1 as $\mathbb{E}[Y]=X\beta$ and model 2 as $\mathbb{E}[Y]=A\gamma$ , where $X$ is an $n\times p_{1}$ matrix and $A$ is an $n\times p_{2}$ matrix, with $p_{1}<p_{2}$ . Assume $X$ and $A$ are both of full rank, i.e. neither has linearly dependent columns.

Then model 1 is nested in model 2 if $X$ is a (strict) subspace of $A$

Given a pair of nested models, we will focus on deciding whether there is enough evidence in the data in favour of the more complicated model; or whether we are justified in staying with the simpler model.

The null hypothesis in this test is always that the simpler model is the best fit.

We start with an example.

TheoremExample 10.0.1 Brain weights

In Section 6.2, Example 6.2.3 considered whether the body weight of a mammal could be used to predict it’s brain weight. In addition, we have the average number of hours of sleep per day for each species in the study.

Let $Y_{i}$ denote brain weight, $x_{i,1}$ denote body weight and $x_{i,2}$ denote number of hours asleep per day. Here $i$ denotes species. We will model the log of both brain and body weight.

Which of the following models fits the data best?

1

$\mathbb{E}[\log Y_{i}]=\beta_{1}+\beta_{2}\log x_{i,1}$ ,
2

$\mathbb{E}[\log Y_{i}]=\beta_{1}+\beta_{2}x_{i,2}$ ,
3

$\mathbb{E}[\log Y_{i}]=\beta_{1}+\beta_{2}\log x_{i,1}+\beta_{3}x_{i,2}$ .

There are four species for which sleep time is unknown. For a fair comparison between models, we remove these species from the following study completely, leaving $n=58$ observations.

We can fit each of the models in R as follows,

⬇

> L1 <- lm(log(sleep$BrainWt)~log(sleep$BodyWt))

> L2 <- lm(log(sleep$BrainWt)~sleep$TotalSleep)

> L3 <-

lm(log(sleep$BrainWt)~log(sleep$BodyWt)+sleep$TotalSleep)

Figure 10.1 shows the fitted relationships in models L1 and L2.

Fig. 10.1: Left: Right: log brain weight ( $g$ ) against sleep per day (hours). Data for 58 species of mammals.

Which of these models are nested?

Models L1 and L2 are both nested in model L3.

Using the summary function, we can obtain parameter estimates, and their standard errors, e.g.

⬇

> summary(L1)

The fitted models are summarised in Table 10.1.

Model	$\beta_{1}$	$\beta_{2}$	$\beta_{3}$
L1	2.15 (0.0991)	0.759 (0.0303)	NA
L2	6.17 (0.675)	-0.299 (0.0588)	NA
L3	2.60 (0.288)	0.728 (0.0352)	-0.0386 (0.0237)

Table 10.1: Parameter estimates, with standard errors in brackets for each of three possible models for the mammal brain weight data.

For each model, we can test to see which of the explanatory variables is significant.

For model L1, we test $H_{0}:\beta_{2}=0$ vs. $H_{1}:\beta_{2}\neq 0$ by calculating

\displaystyle t=\frac{\hat{\beta}_{2}}{\operatorname{se}(\hat{\beta}_{2})}=% \frac{0.759}{0.0303}=25.09.

Comparing this to $t_{56}(0.975)=2.00$ , we see that $\beta_{2}$ is significantly different to zero at the 5% level. We conclude that there is evidence of a significant relationship between (log) brain weight and log (body weight).

For model L2, to test $H_{0}:\beta_{2}=0$ vs. $H_{1}:\beta_{2}\neq 0$ , calculate

\displaystyle t=\frac{\hat{\beta}_{2}}{\operatorname{se}(\hat{\beta}_{2})}=% \frac{-0.299}{0.0588}=-5.092.

Again, the critical value is $t_{56}(0.975)=2.00$ . Since $|-5.092|>2.00$ we conclude that there is evidence of a relationship between hours of sleep per day and (log) brain weight. This is a negative relationship: the more hours sleep per day, the lighter the brain. We cannot perhaps say that this is a causal relationship!

For model L3, we first test $H_{0}:\beta_{2}=0$ vs. $H_{0}:\beta_{2}\neq 0$ , using

\displaystyle t=\frac{\hat{\beta}_{2}}{\operatorname{se}(\hat{\beta}_{2})}=% \frac{0.728}{0.352}=20.67.

Next we test $H_{0}:\beta_{3}=0$ vs. $H_{1}:\beta_{3}\neq 0$ , using

\displaystyle t=\frac{\hat{\beta}_{3}}{\operatorname{se}(\hat{\beta}_{3})}=% \frac{-0.0386}{0.0237}=-1.632.

In both cases the critical value is $t_{55}(0.975)=2.00$ ; so, at the 5% level, there is evidence of a relationship between (log) brain weight and (log) body weight, but there is no evidence of a relationship between (log) brain weight and hours of sleep per day.

To summarise, individually, both explanatory variables appear to be significant. However, when we include both in the model, only one is significant. This appears to be a contradiction. So which is the best model to explain variability amongst brain weights in mammals?

In general, we want to select the simplest possible model that explains the most variation.

{mdframed}

Including additional explanatory variables will always increase the amount of variability explained - but is the increase sufficient to justify the additional parameter that must then be estimated?