Home page for accesible maths 11.3 Residuals vs. Explanatory variables 11.5 Influence

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

11.4 Outliers

An outlier is an observed response which does not seem to fit in with the general pattern of the other responses. Outliers may be identified using

1

A simple plot of the response against the explanatory variable;
2

Looking for unusually large residuals;
3

Calculating studentized residuals.

The studentized residual for observation $i$ is defined as

\displaystyle s_{i}=\frac{\hat{\epsilon}_{i}}{\hat{\sigma}\sqrt{1-H_{ii}}}

where $H_{ii}$ is the $i$ -th element on the diagonal of the hat matrix $H=X(X^{\prime}X)^{-1}X^{\prime}$ . The term $\hat{\sigma}\sqrt{1-H_{ii}}$ comes from the sampling distribution of the estimated residuals, the derivation of which is left as a workshop question.

Remark.

The diagonal terms $H_{ii}$ are referred to as the leverages. This name comes about since, as $H_{ii}$ gets closer to one, so the fitted value $\hat{\mu}_{i}$ gets closer to the observed value $y_{i}$ . That is an observation with a large leverage will have a considerable influence on its fitted value, and consequently on the model fit.

We can test directly the null hypothesis

\displaystyle H_{0}:\text{Observation }i\text{ is not an outlier}

vs.

\displaystyle H_{1}:\text{Observation }i\text{ is an outlier}

by calculating the test statistic

\displaystyle t_{i}=s_{i}\sqrt{\left(\frac{n-p-1}{n-p-s_{i}^{2}}\right)}.

This is compared to the $t$ -distribution with $n-p-1$ degrees of freedom. We test assuming a two-tailed alternative. If the test is significant, there is evidence that observation $i$ is an outlier.

An alternative definition of $t_{i}$ is based on fitting the regression model, without using observation $i$ . This model is then used to predict the observation $y_{i}$ , and the difference between the observed and predicted values is calculated. If this difference is small, the observation is unlikely to be an outlier as it can be predicted well using only information from the model and the remaining data.

The above discussions focus on identifying outliers, but don’t specify what should be done with them. In practice, we should attempt to find out why the observation is an outlier. This reason will indicate whether the observation can safely be ignored (e.g. it occurred due to measurement error) or whether some additional term should be included in the model to explain it.

TheoremExample 11.4.1 Atmospheric pressure

Weisberg (2005), p.4 presents data from an experiment by the physicist James D. Forbes (1857) on the relationship between atmospheric pressure and the temperature at which water boils. The 17 observations, and fitted linear regression model, are plotted in Figure 11.5.

Fig. 11.5: Atmospheric pressure against the boiling point of water, with fitted line $\mathbb{E}[\text{Pressure}]=-81.1-0.523\text{Temperature}$ .

Are any of the observations outliers?

A plot of the residuals against temperature in Figure 11.6, suggests that observation 12 might be an outlier, since its residual is much larger than the rest ( $\hat{\epsilon}_{12}=0.65$ ).

Fig. 11.6: Residuals from the fitted model against temperature.

To calculate the standardized residuals, we first set up the design matrix $X$ and calculate the hat matrix $H$ ,

⬇

> load("pressures.Rdata")

> n <- length(pressure$Temp)

> X <- matrix(cbind(rep(1,n),pressure$Temp),ncol=2)

> H <- X%*%solve(t(X)%*%X)%*%t(X)

> H[12,12]

[1] 0.06393448

We also need the residual variance

⬇

> L <- lm(pressure$Pressure~pressure$Temp)

> summary(L)

From the summary command we see that the estimated residual standard error $\hat{\sigma}$ is 0.2328. Similarly

⬇

> L$residuals[12]

gives the residual $\hat{\epsilon}_{12}=0.65$ .

Combining these results, the studentized residual is

	$\displaystyle s_{12}$	$\displaystyle=\frac{\hat{\epsilon}_{1}2}{\hat{\sigma}\sqrt{1-H_{12,12}}}$
		$\displaystyle=\frac{0.65}{0.2328\times\sqrt{1-0.0639}}$
		$\displaystyle=2.89.$

Since $n=17$ and $p=2$ , the test statistic is

	$\displaystyle t_{12}$	$\displaystyle=2.89\sqrt{\left(\frac{17-2-1}{17-2-2.89^{2}}\right)}$
		$\displaystyle=4.18.$

The $p$ -value to test whether or not observation 12 is an outlier is then

⬇

> 2*(1-pt(4.18,df=14))

which is $9.25\times 10^{-4}$ . Since this is extremely small, we conclude that there is evidence that observation 12 is an outlier.