Forward Variable Selection

The first data set examined in this workshop is of a survey of pupils from three schools investigating what explanatory variables are important in predicting how books they read outside of school over the last year may.

Book <- read.table("Book.dat")

The data set contains the following variables:

•

books How many books have you read in the last year out of school?
•

school School of the pupil: Sch1, Sch2 or Sch3.
•

gender Gender of the pupil.
•

math Standardised maths test scores.
•

eng Standardised English test scores.

The response variable books is a count variable and so we shall consider a Poisson regression model. To determine which of the explanatory variables should be included in defining the linear predictor, we shall adopt the forward variable selection technique.

Step 0: Define our current best description of the data by the null model. We can fit this using the glm command as follows.

M0 <- glm(books ~ 1, family = poisson, data = Book)
deviance(M0)     #801.52
df.residual(M0)  #236

The command deviance returns the residual deviance of the fitted model and df.residual returns the corresponding degree of freedom. Note that this information can be obtained from the summary of the model fit.

Step 1: Define and fit a set of models that includes each of the explanatory variables and perform a deviance hypothesis test where the null hypothesis is defined by our current best model.

For example, the following extract includes school into the current best model (i.e. the null model) and evaluates the residual deviance and degree of freedom of the fit.

M1 <- glm(books ~ 1 + school, family = poisson, data = Book)
deviance(M1)     #351.10
df.residual(M1)  #234

For this we can perform a deviance hypothesis test by evaluating the test statistic 801.52 - 351.10 = 450.42 on 236 - 234 = 2 degrees of freedom. With a critical value of $\chi^{2}_{2}=5.99$ , we may then conclude by rejecting the null hypothesis in favour of an alternative.

Complete the following table for the other explanatory variables:

Model	Deviance	df.	Test Stat.	df.	Accept/Reject $H_{0}$
Null Model	801.52	236	–	–	–
1 + school	351.10	234	450.42	2	Reject $H_{0}$
1 + gender
1 + math
1 + eng

Once all of the models are fitted and the hypothesis tests have been performed, take as our new current best model that with the largest test statistic where the null hypothesis was rejected. In this example we take the linear predictor 1 + eng as our current best model.

Step 3: Repeat the same process by adding one explanatory variable at a time into the current best model that is not already being used and perform a deviance hypothesis test to assess its significance.

Complete the following table.

Model	Deviance	df.	Test Stat.	df.	Accept/Reject $H_{0}$
1 + eng	338.03	235	–	–	–
1 + eng + school
1 + eng + gender
1 + eng + math

Step 4: Repeat the same process again by adding the remaining explanatory variables into the model one at a time to assess their significance in describing the variability in the model. Continue to do so until you are unable to reject the null hypothesis for any new explanatory variable.

Verify that including gender or math into the current best additive model does not significantly further reduce the deviance.

Testing for interactions

From the forward selection procedure above, we have our best additive model of books 1 + eng + school with a residual deviance of 238.94 on 233 degrees of freedom. We now wish to assess whether the interaction between eng and school is significant in further reducing the residual deviance.

Define and fit the interaction model as follows:

M_int <- glm(books ~ 1 + eng + school + eng:school, family = poisson, data = Book)
deviance(M_int)     #238.64
df.residual(M_int)  #231

The deviance test statistic for including the interaction model is 238.94 - 238.64 = 0.30 on 2 degrees of freedom. Comparing the test statistic to the critical value $\chi^{2}_{2}(0.95)=5.99$ means that there is insufficient evidence to reject the null hypothesis at the 5% significance level. We may then conclude that an interaction term should not be included.

Model Checking

Given our best model for predicting the number of books a pupil reads in a year, it is then important to verify that the fit is good and that the assumptions underpinning the model are satisfied.

M_best <- glm(books ~ 1 + eng + school, family = poisson, data = Book)
summary(M_best)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.92124    0.08783  10.489  < 2e-16 ***
eng          0.50075    0.04678  10.705  < 2e-16 ***
schoolSch2  -0.51954    0.13314  -3.902 9.54e-05 ***
schoolSch3   0.54290    0.12513   4.339 1.43e-05 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

    Null deviance: 801.52  on 236  degrees of freedom
Residual deviance: 238.94  on 233  degrees of freedom

From the summary, we see that all of the explanatory variables are significant components of the linear predictor. Furthermore, the residual deviance is smaller than the critical value of $\chi^{2}_{233}(0.95)=269.6$ and so we can conclude that the model is providing an adequate description for the variability in the data.

In addition to checking the estimates and deviance, we may also want to assess the structure of the residuals. For a large sample size, the distribution of the Pearson’s residuals are approximately Normal distributed. We can calculate the Pearson’s residuals for our best model using the resid command with the additional argument type="pearson":

p_resid <- resid(M_best, type = "pearson")

The following extract creates the visual verification for the distribution of the residuals is presented in the figure below by creating a histogram and a QQ-plot against the theoretical standard normal distribution.

par(mfrow = c(1, 2))
hist(p_resid)
qqnorm(p_resid)  ##qq-plot against the normal distribution
grid()
abline(a = 0, b = 1) ## add a 1:1 diagonal line

Figure 1: Link, Caption: Residual QQ-plot of the best Poisson regression model.