Home page for accesible maths 9 Explanatory variables: some interesting issues 12.2 What is likelihood?9.2 Interactions

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

9.1 Collinearity

Collinearity arises when there is linear dependence (strong correlation) between two or more explanatory variables. We say that two explanatory variables $x_{i}$ and $x_{j}$ are

{mdframed}

1

Orthogonal if $\operatorname{corr}(x_{i},x_{j})$ is close to zero;
2

Collinear if $\operatorname{corr}(x_{i},x_{j})$ is close to 1.

Collinearity is undesirable because it means that the matrix $X^{\prime}X$ is ill conditioned, and inversion of $(X^{\prime}X)$ is numerically unstable. It can also make results difficult to interpret.

TheoremExample 9.1.1 Cereal prices

In Example 6.4.4, we related annual maize prices, $Y_{i}$ , to annual prices of barley, $x_{i,1}$ , and wheat, $x_{i,2}$ . Consider the three models:

1

$\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,1}$ ,
2

$\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,2}$ ,
3

$\mathbb{E}[Y_{i}]=\beta_{1}+\beta_{2}x_{i,1}+\beta_{3}x_{i,2}$ .

We fit the models using R. First load the data set cereal, and see what it contains,

⬇

> names(cereal)

[1] "Year" "Barley" "Cotton" "Maize" "Rice" "Wheat"

To fit the three models,

⬇

> lm1 <- lm(cereal$Maize~cereal$Barley)

> lm2 <- lm(cereal$Maize~cereal$Wheat)

> lm3 <- lm(cereal$Maize~cereal$Barley+cereal$Wheat)

Now let’s look at the estimated coefficients in each model.

⬇

> lm1$coefficients

(Intercept) cereal$Barley

-9.484660 1.085748

> lm2$coefficients

(Intercept) cereal$Wheat

-30.8254882 0.9491281

> lm3$coefficients

(Intercept) cereal$Barley cereal$Wheat

-25.6646279 -0.5095537 1.3207563

In the two covariate model, the coefficient for both Barley and Wheat are considerably different to the equivalent estimates obtained in the two one covariate models. In particular the relationship with Barley is positive in model 1 and negative in model 3. What is going on here?

To investigate, we check which of the covariates has a significant relationship with maize prices in each of the models. We will use the confidence interval method. For this we need the standard errors of the regression coefficients, which can be found by hand or using the summary function, e.g.

⬇

> summary(lm1)

Call:

lm(formula = cereal$Maize ~ cereal$Barley)

Residuals:

Min 1Q Median 3Q Max

-106.401 -21.731 -5.482 21.282 89.921

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -9.4847 32.2742 -0.294 0.772

cereal$Barley 1.0857 0.1919 5.657 1.88e-05 ***

---

Signif. codes: 0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

Residual standard error: 45.22 on 19 degrees of freedom

Multiple R-squared: 0.6274, Adjusted R-squared: 0.6078

F-statistic: 32 on 1 and 19 DF, p-value: 1.875e-05

so that the standard error for $\beta_{2}$ in model 1 is 0.1919.

For model 1, the 95% confidence interval for $\beta_{2}$ (barley) is

\displaystyle\hat{\beta}_{2}\pm t_{21-2}(0.975)\times\operatorname{se}(\hat{% \beta}_{2})=1.09\pm 2.093\times 0.1919=(0.684,1.487).

For model 2, the 95% confidence interval for $\beta_{2}$ (wheat) is

\displaystyle 0.949\pm 2.093\times 0.1109=(0.717,1.18).

For model 3, the 95% confidence interval for $\beta_{2}$ (barley) is

\displaystyle\hat{\beta}_{2}\pm t_{21-3}(0.975)\times\operatorname{se}(\hat{% \beta}_{2})=-0.510\pm 2.101\times 0.4076=(-1.366,0.347).

and for $\beta_{3}$ (wheat) it is

\displaystyle 1.321\pm 2.101\times 0.3167=(0.655,1.99).

We can conclude that

1

If barley alone is included, then it has a significant relationship with maize price (at the 5% level).
2

If wheat alone is included, then it has a significant relationship with maize price (at the 5% level).
3

If both barley and wheat are included, then the relationship with barley is no longer significant (at the 5% level).

Why is this?

The answer comes if we look at the relationship between barley and wheat prices, see Figure 9.1. The sample correlation between these variables is 0.939, indicating a very strong linear relationship. Since their behaviour is so closely related, we do not need both as covariates in the model.

If we do include both, then it is impossible for the model to accurately identify the individual relationships. We should use either model 1 or model 2. However, there is no statistical way to compare these two models; but one possibility is to select the one which has smallest $p$ value associated with $\beta_{2}$ , i.e. the one with the strongest relationship between the covariate and the response.