Home page for accesible maths 9 Explanatory variables: some interesting issues

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

9.3 Summary

{mdframed}
  • 1

    Collinearity occurs when two explanatory variables are highly correlated.

  • 2

    Collinearity makes it hard (sometimes impossible) to disentangle the separate effects of the collinear variables on the response.

  • 3

    An interaction occurs when altering the value of one explanatory variable changes the effect of a second explanatory variable on the response.

  • 4

    This change could be a change in the size of the effect, in the direction of the effect (positive or negative), or in both of these.

Chapter 10 Covariate selection

Covariate selection refers to the process of deciding which of a number of explanatory variables best explain the variability in the response variable. You can think of it as finding the subset of explanatory variables which have the strongest relationships with the response variable.

We will only look at comparing nested models. Consider two models, the first has p1 explanatory variables and the second has p2>p1 explanatory variables. We refer to the model with fewer covariates as the simpler model.

An example of a pair of nested models is when the more complicated model contains all the explanatory variables in the simpler model, and an additional p2-p1 explanatory variables.

For example, given a response Yi and explanatory variables xi,1, xi,2 and xi,3, we could create three possible models;

  1. A

    𝔼[Yi]=β0+β1xi,1,

  2. B

    𝔼[Yi]=β0+β1xi,1+β2xi,2,

  3. C

    𝔼[Yi]=β0+β1xi,1+β2xi,2+β3xi,3.

Which model(s) are nested inside model C?

Models A and B are nested inside model C.

Are either of models A or C nested inside model B?

Model A is, since model B is model A with an additional covariate.

Neither model B nor model C is nested in model A.

Write down another model that is nested in model C.

𝔼[Yi]=β0+β1xi,2.

Definition (Nesting).

Define model 1 as 𝔼[Y]=Xβ and model 2 as 𝔼[Y]=Aγ, where X is an n×p1 matrix and A is an n×p2 matrix, with p1<p2. Assume X and A are both of full rank, i.e. neither has linearly dependent columns.

Then model 1 is nested in model 2 if X is a (strict) subspace of A

Given a pair of nested models, we will focus on deciding whether there is enough evidence in the data in favour of the more complicated model; or whether we are justified in staying with the simpler model.

The null hypothesis in this test is always that the simpler model is the best fit.

We start with an example.

TheoremExample 10.0.1 Brain weights

In Section 6.2, Example 6.2.3 considered whether the body weight of a mammal could be used to predict it’s brain weight. In addition, we have the average number of hours of sleep per day for each species in the study.

Let Yi denote brain weight, xi,1 denote body weight and xi,2 denote number of hours asleep per day. Here i denotes species. We will model the log of both brain and body weight.

Which of the following models fits the data best?

  1. 1

    𝔼[logYi]=β1+β2logxi,1,

  2. 2

    𝔼[logYi]=β1+β2xi,2,

  3. 3

    𝔼[logYi]=β1+β2logxi,1+β3xi,2.

There are four species for which sleep time is unknown. For a fair comparison between models, we remove these species from the following study completely, leaving n=58 observations.

We can fit each of the models in R as follows,

> L1 <- lm(log(sleep$BrainWt)~log(sleep$BodyWt))
> L2 <- lm(log(sleep$BrainWt)~sleep$TotalSleep)
> L3 <-
lm(log(sleep$BrainWt)~log(sleep$BodyWt)+sleep$TotalSleep)

Figure 10.1 shows the fitted relationships in models L1 and L2.

Fig. 10.1: Left: Right: log brain weight (g) against sleep per day (hours). Data for 58 species of mammals.

Which of these models are nested?

Models L1 and L2 are both nested in model L3.

Using the summary function, we can obtain parameter estimates, and their standard errors, e.g.

> summary(L1)

The fitted models are summarised in Table 10.1.

Model β1 β2 β3
L1 2.15 (0.0991) 0.759 (0.0303) NA
L2 6.17 (0.675) -0.299 (0.0588) NA
L3 2.60 (0.288) 0.728 (0.0352) -0.0386 (0.0237)
Table 10.1: Parameter estimates, with standard errors in brackets for each of three possible models for the mammal brain weight data.

For each model, we can test to see which of the explanatory variables is significant.

For model L1, we test H0:β2=0 vs. H1:β20 by calculating

t=β^2se(β^2)=0.7590.0303=25.09.

Comparing this to t56(0.975)=2.00, we see that β2 is significantly different to zero at the 5% level. We conclude that there is evidence of a significant relationship between (log) brain weight and log (body weight).

For model L2, to test H0:β2=0 vs. H1:β20, calculate

t=β^2se(β^2)=-0.2990.0588=-5.092.

Again, the critical value is t56(0.975)=2.00. Since |-5.092|>2.00 we conclude that there is evidence of a relationship between hours of sleep per day and (log) brain weight. This is a negative relationship: the more hours sleep per day, the lighter the brain. We cannot perhaps say that this is a causal relationship!

For model L3, we first test H0:β2=0 vs. H0:β20, using

t=β^2se(β^2)=0.7280.352=20.67.

Next we test H0:β3=0 vs. H1:β30, using

t=β^3se(β^3)=-0.03860.0237=-1.632.

In both cases the critical value is t55(0.975)=2.00; so, at the 5% level, there is evidence of a relationship between (log) brain weight and (log) body weight, but there is no evidence of a relationship between (log) brain weight and hours of sleep per day.

To summarise, individually, both explanatory variables appear to be significant. However, when we include both in the model, only one is significant. This appears to be a contradiction. So which is the best model to explain variability amongst brain weights in mammals?

In general, we want to select the simplest possible model that explains the most variation.

{mdframed}

Including additional explanatory variables will always increase the amount of variability explained - but is the increase sufficient to justify the additional parameter that must then be estimated?