An outlier is an observed response which does not seem to fit in with the general pattern of the other responses. Outliers may be identified using
A simple plot of the response against the explanatory variable;
Looking for unusually large residuals;
Calculating studentized residuals.
The studentized residual for observation is defined as
where is the -th element on the diagonal of the hat matrix . The term comes from the sampling distribution of the estimated residuals, the derivation of which is left as a workshop question.
The diagonal terms are referred to as the leverages. This name comes about since, as gets closer to one, so the fitted value gets closer to the observed value . That is an observation with a large leverage will have a considerable influence on its fitted value, and consequently on the model fit.
We can test directly the null hypothesis
vs.
by calculating the test statistic
This is compared to the -distribution with degrees of freedom. We test assuming a two-tailed alternative. If the test is significant, there is evidence that observation is an outlier.
An alternative definition of is based on fitting the regression model, without using observation . This model is then used to predict the observation , and the difference between the observed and predicted values is calculated. If this difference is small, the observation is unlikely to be an outlier as it can be predicted well using only information from the model and the remaining data.
The above discussions focus on identifying outliers, but don’t specify what should be done with them. In practice, we should attempt to find out why the observation is an outlier. This reason will indicate whether the observation can safely be ignored (e.g. it occurred due to measurement error) or whether some additional term should be included in the model to explain it.
Weisberg (2005), p.4 presents data from an experiment by the physicist James D. Forbes (1857) on the relationship between atmospheric pressure and the temperature at which water boils. The 17 observations, and fitted linear regression model, are plotted in Figure 11.5.
Are any of the observations outliers?
A plot of the residuals against temperature in Figure 11.6, suggests that observation 12 might be an outlier, since its residual is much larger than the rest ().
To calculate the standardized residuals, we first set up the design matrix and calculate the hat matrix ,
We also need the residual variance
From the summary command we see that the estimated residual standard error is 0.2328. Similarly
gives the residual .
Combining these results, the studentized residual is
Since and , the test statistic is
The -value to test whether or not observation 12 is an outlier is then
which is . Since this is extremely small, we conclude that there is evidence that observation 12 is an outlier.