Covariate selection is the process of deciding which covariates explain a significant amount of the variability in the response.
Two nested linear regression models can be compared using an -test. Take two models
,
,
where and have the same number of rows, but the number of columns in is less than that in . Provided and are of full rank, then the first model is nested in the second if is a subspace of .
A simple example of nesting is when the more complicated model contains all the explanatory variables in the first model, plus one or more additional ones.
Introducing extra explanatory variables will always reduce the residual sum of squares.
To compare two nested models with sums of squares (simpler model) and (more complicated model), calculate
and compare this to the distribution.
The one-way ANOVA can be shown to be a special case of the multiple linear regression model, in which all the explanatory variables are taken to be indicator functions denoting membership of each of the groups.
Even if a valid statistical method, such as the -test, has been used to select our preferred linear model, checks should still be made to ensure that this model fits the data well. After all, it could be that all the models that we tried to fit actually described the data poorly, so that we have just made the best of a bad job. If this is the case, we need to go back and think about the underlying physical processes generating the data to suggest a better model.
Diagnostics refers to a set of tools which can be used to check how well the model describes (or fits) the data. We will use diagnostics to check that
The estimated residuals follow a distribution;
The estimated residuals are independent of the covariates used in the model;
The estimated residuals are independent of the fitted values ; None of the observations is an outlier (perhaps due to measurement error);
No observation has undue influence on the estimated model parameters.