Home page for accesible maths 9 Explanatory variables: some interesting issues

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

9.1 Collinearity

Collinearity arises when there is linear dependence (strong correlation) between two or more explanatory variables. We say that two explanatory variables xi and xj are

{mdframed}
  • 1

    Orthogonal if corr(xi,xj) is close to zero;

  • 2

    Collinear if corr(xi,xj) is close to 1.

Collinearity is undesirable because it means that the matrix XX is ill conditioned, and inversion of (XX) is numerically unstable. It can also make results difficult to interpret.

TheoremExample 9.1.1 Cereal prices

In Example 6.4.4, we related annual maize prices, Yi, to annual prices of barley, xi,1, and wheat, xi,2. Consider the three models:

  1. 1

    𝔼[Yi]=β1+β2xi,1,

  2. 2

    𝔼[Yi]=β1+β2xi,2,

  3. 3

    𝔼[Yi]=β1+β2xi,1+β3xi,2.

We fit the models using R. First load the data set cereal, and see what it contains,

> names(cereal)
[1] "Year"   "Barley" "Cotton" "Maize"  "Rice"   "Wheat"

To fit the three models,

> lm1 <- lm(cereal$Maize~cereal$Barley)
> lm2 <- lm(cereal$Maize~cereal$Wheat)
> lm3 <- lm(cereal$Maize~cereal$Barley+cereal$Wheat)

Now let’s look at the estimated coefficients in each model.

> lm1$coefficients
(Intercept) cereal$Barley
-9.484660      1.085748
> lm2$coefficients
(Intercept) cereal$Wheat
-30.8254882    0.9491281
> lm3$coefficients
(Intercept) cereal$Barley  cereal$Wheat
-25.6646279    -0.5095537     1.3207563

In the two covariate model, the coefficient for both Barley and Wheat are considerably different to the equivalent estimates obtained in the two one covariate models. In particular the relationship with Barley is positive in model 1 and negative in model 3. What is going on here?

To investigate, we check which of the covariates has a significant relationship with maize prices in each of the models. We will use the confidence interval method. For this we need the standard errors of the regression coefficients, which can be found by hand or using the summary function, e.g.

> summary(lm1)
Call:
lm(formula = cereal$Maize ~ cereal$Barley)
Residuals:
Min       1Q   Median       3Q      Max
-106.401  -21.731   -5.482   21.282   89.921
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    -9.4847    32.2742  -0.294    0.772
cereal$Barley   1.0857     0.1919   5.657 1.88e-05 ***
---
Signif. codes:  0 â€˜***’ 0.001 â€˜**’ 0.01 â€˜*’ 0.05 â€˜.’ 0.1 â€˜ â€™ 1
Residual standard error: 45.22 on 19 degrees of freedom
Multiple R-squared: 0.6274,     Adjusted R-squared: 0.6078
F-statistic:    32 on 1 and 19 DF,  p-value: 1.875e-05

so that the standard error for β2 in model 1 is 0.1919.

For model 1, the 95% confidence interval for β2 (barley) is

β^2±t21-2(0.975)×se(β^2)=1.09±2.093×0.1919=(0.684,1.487).

For model 2, the 95% confidence interval for β2 (wheat) is

0.949±2.093×0.1109=(0.717,1.18).

For model 3, the 95% confidence interval for β2 (barley) is

β^2±t21-3(0.975)×se(β^2)=-0.510±2.101×0.4076=(-1.366,0.347).

and for β3 (wheat) it is

1.321±2.101×0.3167=(0.655,1.99).

We can conclude that

  • 1

    If barley alone is included, then it has a significant relationship with maize price (at the 5% level).

  • 2

    If wheat alone is included, then it has a significant relationship with maize price (at the 5% level).

  • 3

    If both barley and wheat are included, then the relationship with barley is no longer significant (at the 5% level).

Why is this?

The answer comes if we look at the relationship between barley and wheat prices, see Figure 9.1. The sample correlation between these variables is 0.939, indicating a very strong linear relationship. Since their behaviour is so closely related, we do not need both as covariates in the model.

If we do include both, then it is impossible for the model to accurately identify the individual relationships. We should use either model 1 or model 2. However, there is no statistical way to compare these two models; but one possibility is to select the one which has smallest p value associated with β2, i.e. the one with the strongest relationship between the covariate and the response.

Fig. 9.1: Forecasts of annual prices for wheat against barley. Prices are in dollars per tonne.