Collinearity occurs when two explanatory variables are highly correlated.
Collinearity makes it hard (sometimes impossible) to disentangle the separate effects of the collinear variables on the response.
An interaction occurs when altering the value of one explanatory variable changes the effect of a second explanatory variable on the response.
This change could be a change in the size of the effect, in the direction of the effect (positive or negative), or in both of these.
Covariate selection refers to the process of deciding which of a number of explanatory variables best explain the variability in the response variable. You can think of it as finding the subset of explanatory variables which have the strongest relationships with the response variable.
We will only look at comparing nested models. Consider two models, the first has explanatory variables and the second has explanatory variables. We refer to the model with fewer covariates as the simpler model.
An example of a pair of nested models is when the more complicated model contains all the explanatory variables in the simpler model, and an additional explanatory variables.
For example, given a response and explanatory variables , and , we could create three possible models;
,
,
.
Which model(s) are nested inside model C?
Models A and B are nested inside model C.
Are either of models A or C nested inside model B?
Model A is, since model B is model A with an additional covariate.
Neither model B nor model C is nested in model A.
Write down another model that is nested in model C.
.
Define model 1 as and model 2 as , where is an matrix and is an matrix, with . Assume and are both of full rank, i.e. neither has linearly dependent columns.
Then model 1 is nested in model 2 if is a (strict) subspace of
Given a pair of nested models, we will focus on deciding whether there is enough evidence in the data in favour of the more complicated model; or whether we are justified in staying with the simpler model.
The null hypothesis in this test is always that the simpler model is the best fit.
We start with an example.
In Section 6.2, Example 6.2.3 considered whether the body weight of a mammal could be used to predict it’s brain weight. In addition, we have the average number of hours of sleep per day for each species in the study.
Let denote brain weight, denote body weight and denote number of hours asleep per day. Here denotes species. We will model the log of both brain and body weight.
Which of the following models fits the data best?
,
,
.
There are four species for which sleep time is unknown. For a fair comparison between models, we remove these species from the following study completely, leaving observations.
We can fit each of the models in R as follows,
Figure 10.1 shows the fitted relationships in models L1 and L2.
Which of these models are nested?
Models L1 and L2 are both nested in model L3.
Using the summary function, we can obtain parameter estimates, and their standard errors, e.g.
The fitted models are summarised in Table 10.1.
Model | |||
---|---|---|---|
L1 | 2.15 (0.0991) | 0.759 (0.0303) | NA |
L2 | 6.17 (0.675) | -0.299 (0.0588) | NA |
L3 | 2.60 (0.288) | 0.728 (0.0352) | -0.0386 (0.0237) |
For each model, we can test to see which of the explanatory variables is significant.
For model L1, we test vs. by calculating
Comparing this to , we see that is significantly different to zero at the 5% level. We conclude that there is evidence of a significant relationship between (log) brain weight and log (body weight).
For model L2, to test vs. , calculate
Again, the critical value is . Since we conclude that there is evidence of a relationship between hours of sleep per day and (log) brain weight. This is a negative relationship: the more hours sleep per day, the lighter the brain. We cannot perhaps say that this is a causal relationship!
For model L3, we first test vs. , using
Next we test vs. , using
In both cases the critical value is ; so, at the 5% level, there is evidence of a relationship between (log) brain weight and (log) body weight, but there is no evidence of a relationship between (log) brain weight and hours of sleep per day.
To summarise, individually, both explanatory variables appear to be significant. However, when we include both in the model, only one is significant. This appears to be a contradiction. So which is the best model to explain variability amongst brain weights in mammals?
In general, we want to select the simplest possible model that explains the most variation.
Including additional explanatory variables will always increase the amount of variability explained - but is the increase sufficient to justify the additional parameter that must then be estimated?