Collinearity arises when there is linear dependence (strong correlation) between two or more explanatory variables. We say that two explanatory variables and are
Orthogonal if is close to zero;
Collinear if is close to 1.
Collinearity is undesirable because it means that the matrix is ill conditioned, and inversion of is numerically unstable. It can also make results difficult to interpret.
In Example 6.4.4, we related annual maize prices, , to annual prices of barley, , and wheat, . Consider the three models:
,
,
.
We fit the models using R. First load the data set cereal, and see what it contains,
To fit the three models,
Now let’s look at the estimated coefficients in each model.
In the two covariate model, the coefficient for both Barley and Wheat are considerably different to the equivalent estimates obtained in the two one covariate models. In particular the relationship with Barley is positive in model 1 and negative in model 3. What is going on here?
To investigate, we check which of the covariates has a significant relationship with maize prices in each of the models. We will use the confidence interval method. For this we need the standard errors of the regression coefficients, which can be found by hand or using the summary function, e.g.
so that the standard error for in model 1 is 0.1919.
For model 1, the 95% confidence interval for (barley) is
For model 2, the 95% confidence interval for (wheat) is
For model 3, the 95% confidence interval for (barley) is
and for (wheat) it is
We can conclude that
If barley alone is included, then it has a significant relationship with maize price (at the 5% level).
If wheat alone is included, then it has a significant relationship with maize price (at the 5% level).
If both barley and wheat are included, then the relationship with barley is no longer significant (at the 5% level).
Why is this?
The answer comes if we look at the relationship between barley and wheat prices, see Figure 9.1. The sample correlation between these variables is 0.939, indicating a very strong linear relationship. Since their behaviour is so closely related, we do not need both as covariates in the model.
If we do include both, then it is impossible for the model to accurately identify the individual relationships. We should use either model 1 or model 2. However, there is no statistical way to compare these two models; but one possibility is to select the one which has smallest value associated with , i.e. the one with the strongest relationship between the covariate and the response.