Home page for accesible maths

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

11.4 Outliers

An outlier is an observed response which does not seem to fit in with the general pattern of the other responses. Outliers may be identified using

  • 1

    A simple plot of the response against the explanatory variable;

  • 2

    Looking for unusually large residuals;

  • 3

    Calculating studentized residuals.

The studentized residual for observation i is defined as

si=ϵ^iσ^1-Hii

where Hii is the i-th element on the diagonal of the hat matrix H=X(XX)-1X. The term σ^1-Hii comes from the sampling distribution of the estimated residuals, the derivation of which is left as a workshop question.

Remark.

The diagonal terms Hii are referred to as the leverages. This name comes about since, as Hii gets closer to one, so the fitted value μ^i gets closer to the observed value yi. That is an observation with a large leverage will have a considerable influence on its fitted value, and consequently on the model fit.

We can test directly the null hypothesis

H0:Observation i is not an outlier

vs.

H1:Observation i is an outlier

by calculating the test statistic

ti=si(n-p-1n-p-si2).

This is compared to the t-distribution with n-p-1 degrees of freedom. We test assuming a two-tailed alternative. If the test is significant, there is evidence that observation i is an outlier.

An alternative definition of ti is based on fitting the regression model, without using observation i. This model is then used to predict the observation yi, and the difference between the observed and predicted values is calculated. If this difference is small, the observation is unlikely to be an outlier as it can be predicted well using only information from the model and the remaining data.

The above discussions focus on identifying outliers, but don’t specify what should be done with them. In practice, we should attempt to find out why the observation is an outlier. This reason will indicate whether the observation can safely be ignored (e.g. it occurred due to measurement error) or whether some additional term should be included in the model to explain it.

TheoremExample 11.4.1 Atmospheric pressure

Weisberg (2005), p.4 presents data from an experiment by the physicist James D. Forbes (1857) on the relationship between atmospheric pressure and the temperature at which water boils. The 17 observations, and fitted linear regression model, are plotted in Figure 11.5.

Fig. 11.5: Atmospheric pressure against the boiling point of water, with fitted line 𝔼[Pressure]=-81.1-0.523Temperature.

Are any of the observations outliers?

A plot of the residuals against temperature in Figure 11.6, suggests that observation 12 might be an outlier, since its residual is much larger than the rest (ϵ^12=0.65).

Fig. 11.6: Residuals from the fitted model against temperature.

To calculate the standardized residuals, we first set up the design matrix X and calculate the hat matrix H,

> load("pressures.Rdata")
> n <- length(pressure$Temp)
> X <- matrix(cbind(rep(1,n),pressure$Temp),ncol=2)
> H <- X%*%solve(t(X)%*%X)%*%t(X)
> H[12,12]
[1] 0.06393448

We also need the residual variance

> L <- lm(pressure$Pressure~pressure$Temp)
> summary(L)

From the summary command we see that the estimated residual standard error σ^ is 0.2328. Similarly

> L$residuals[12]

gives the residual ϵ^12=0.65.

Combining these results, the studentized residual is

s12 =ϵ^12σ^1-H12,12
=0.650.2328×1-0.0639
=2.89.

Since n=17 and p=2, the test statistic is

t12 =2.89(17-2-117-2-2.892)
=4.18.

The p-value to test whether or not observation 12 is an outlier is then

> 2*(1-pt(4.18,df=14))

which is 9.25×10-4. Since this is extremely small, we conclude that there is evidence that observation 12 is an outlier.