7 Model inference 7.2 Transformations of coefficients 7.4 Assessment of model fit

7.3 Analysis of deviance

Recall the definition of the residual deviance of a fitted GLM, with fitted means $\hat{\boldsymbol{\mu}}^{\mathcal{M}}$ , maximised over $g({\boldsymbol{\mu}})=\boldsymbol{\eta}\in\mathcal{M}$

\displaystyle dev(\mathcal{M})=2\{\ell(\mathbf{y})-\ell(\hat{\boldsymbol{\mu}}% ^{\mathcal{M}})\}.

Consider two models, $\mathcal{M}$ and $\mathcal{N}$ for the predictor $\boldsymbol{\eta}$ where $\mathcal{N}$ is nested in $\mathcal{M}$ , so that $\mathcal{N}\subset\mathcal{M}$ . The dimensions of these subspaces are $p$ and $q$ respectively.

Definition 7.3.1.

Nesting means that $q<p$ , and that there are $p-q$ fewer parameters in $\mathcal{N}$ , which can be obtained by setting $p-q$ constraints on the parameters of model $\mathcal{M}$ .

For example, consider the linear predictor for model $\mathcal{M}$ :

\boldsymbol{\eta}^{\mathcal{M}}=\beta_{0}+\beta_{1}\mathbf{x}_{1}+\beta_{2}% \mathbf{x}_{2}.

Model $\mathcal{N}$ is nested within model $\mathcal{M}$ if constant $\beta_{2}=0$ so that:

\boldsymbol{\eta}^{\mathcal{N}}=\beta_{0}+\beta_{1}\mathbf{x}_{1}.

Constraints need not result by a co-efficient equal to zero, but may arise from some relationship between parameters. For example, $\beta_{1}=\beta_{2}$ so that:

\boldsymbol{\eta}^{\mathcal{N}}=\beta_{0}+\beta_{1}\mathbf{x}_{1}+\beta_{1}% \mathbf{x}_{2}=\beta_{0}+\beta_{1}(\mathbf{x}_{1}+\mathbf{x}_{2}).

The deviance difference, or reduction in deviance, in fitting $\mathcal{M}$ rather than $\mathcal{N}$ is

\displaystyle dev(\mathcal{N})-dev(\mathcal{M})=2\{\ell(\hat{\boldsymbol{\mu}}% ^{\mathcal{M}})-\ell(\hat{\boldsymbol{\mu}}^{\mathcal{N}})\}.

This is just a function of the likelihood ratio.

This motivates the deviance hypothesis test with null and alternative hypotheses:

H_{0}:~{}~{}\boldsymbol{\eta}\in\mathcal{N}\quad\quad H_{1}:~{}~{}\boldsymbol{% \eta}\in\mathcal{M}

Suppose that the true model is $\mathcal{N}$ , the simpler model. Then under the null model the asymptotic result hold:

\displaystyle X^{2}=dev(\mathcal{N})-dev(\mathcal{M})\sim\chi^{2}_{p-q}.

This is the exact distribution for normal models with known variance $\sigma^{2}$ . For other GLMs it is a consequence of asymptotic likelihood theory.

If the deviance difference is large relative to a $\chi^{2}_{p-q}$ distribution, then there is evidence that $\mathcal{N}$ is not true and we reject the null hypothesis in favour of the more complex model $\mathcal{M}$ .

If the deviance difference is lower than the critical value from the $\chi^{2}_{p-q}$ distribution, then there is insufficient evidence to distinguish between models $\mathcal{N}$ and $\mathcal{M}$ and so we would favour the simpler model $\mathcal{N}$ for reasons of parsimony.

7.3.1 Testing for interactions

Testing for no interaction is equivalent to comparing the fit of $\boldsymbol{\eta}\in X_{1}.X_{2}$ with the fit of $\boldsymbol{\eta}\in X_{1}+X_{2}$ . In this case we can propose the hypotheses:

	$\displaystyle H_{0}:$	$\displaystyle~{}~{}\eta=\beta_{0}+\beta_{1}\mathbf{x}_{1}+\beta_{2}\mathbf{x}_% {2}\in X_{1}+X_{2}$	$\displaystyle(\mbox{No interaction})$
	$\displaystyle H_{1}:$	$\displaystyle~{}~{}\eta=\beta_{0}+\beta_{1}\mathbf{x}_{1}+\beta_{2}\mathbf{x}_% {2}+\beta_{3}\mathbf{x}_{1}.\mathbf{x}_{2}\in X_{1}.X_{2}$	$\displaystyle(\mbox{Interaction})$

The additive model is nested within the interaction model as we can apply the constraint of $\beta_{3}=0$ , i.e. setting all co-efficints of the interactions to zero. Assessing whether the interaction is an important component of the model can be determined by examining the difference in deviance between the two models.

To illustrate, consider the analgesic data set discussed earlier. We have already fitted the additive model under mod1. From the summary, the deviance is 4.2588 on 9 degrees of freedom. The following R extract fits the interaction model where : defines which explanatory variables are interacting.

mod2 <- glm(cbind(number, total-number) ~ 1 + ldose + drug + ldose:drug,
  family = binomial(link = "logit"), data = analgesic)
summary(mod2)

The summary of the interaction model returns:

Deviance Residuals:
     Min        1Q    Median        3Q       Max
-0.65114  -0.28717   0.00848   0.22198   1.11074

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  -2.09676    0.36498  -5.745  9.2e-09 ***
ldose         4.45805    0.65968   6.758  1.4e-11 ***
drugMo        0.03155    0.46967   0.067  0.94643
drugPe       -1.89632    0.71546  -2.650  0.00804 **
drugPh        2.00192    0.39918   5.015  5.3e-07 ***
ldose:drugMo -0.81621    0.83906  -0.973  0.33067
ldose:drugPe -0.61844    0.88482  -0.699  0.48458
ldose:drugPh  0.08731    0.92566   0.094  0.92485
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 249.9579  on 13  degrees of freedom
Residual deviance:   2.5361  on  6  degrees of freedom
AIC: 83.757

Figure 7.2: Link, Caption: Fitted curves of the analgesic data for the main effect model (left) and the interaction model (right)

Figure 7.2 (Link) presents the fit from the null and alternative hypotheses. The deviance difference between the two models is:

X^{2}=dev(H_{0})-dev(H_{1})=4.2588-2.5361=1.7227

The difference in degrees of freedom is $df(H_{1})-df(H_{0})=9-6=3$ . The 5% critical value from the $\chi^{2}_{3}$ is 7.81. We conclude that the test statistic $X^{2}$ is less than the critical value and so we fail to reject the null hypothesis at the 5% significance level. In other words, there is little difference between the two proposed models (as can be seen in Figure 7.2 (Link)) and so we select the simpler additive model without interactions for reasons of parsimony.

7.3.2 Variable selection

Of all of the explanatory variables available to us, how do we determine which we should and should not have in our model in describing the observed realisations? As an initial step, we can use existing knowledge and scientific theory to influence an initial model, but it is likely that this is not available or incomplete. We have seen how to use deviance hypothesis testing to select between nested models, but even within this structure there can be many ways of expressing the linear predictor with a small finite number of explanatory variables.

For example, consider three explanatory variables $x_{1}$ , $x_{2}$ and $x_{3}$ . Then there are 8 possible additive models that we can propose, ranging from the null model containing only the intercept term to the full additive model where all explanatory variables are included. The diagram below illustrates all eight models with dashed lines between nodes indicating which pairs are nested.

Unnumbered Figure: Link

As a naive approach, we could fit all 8 models and perform 12 deviance hypothesis tests. With three explanatory variables this is not too bad, but if there are 10 explanatory variables to try, then there are 1024 different additive models to fit with 5120 deviance hypothesis tests to try! Clearly this is highly inefficient.

There exists a number of procedural approaches to variable selection. Here we look at the forward selection procedure:

1.

Start with the null model.
2.

For all explanatory variable not in the model, add it into the model and assess whether the fit improves. Select the model that best reduces the residual deviance.
3.

Continue until no new predictors can be added.
4.

Add the interactions between the explanatory variables that are in the current best model. Perform deviance test to assess the significance of the interaction model.

To demonstrate the forward selection procedure, consider the three explanatory variable example illustrated above. The initial step is to fit the null model. Next we propose three models, adding one of each of the explanatory variable (i.e. $1+x_{1}$ , $1+x_{2}$ and $1+x_{3}$ ). We then perform the deviance hypothesis test on each of these three models with the null model as the null hypothesis test. Of those tests where there is evidence to reject the null hypothesis, select as the current best model the one that reduces the residual deviance the most (say $1+x_{2}$ ).

Moving onto the next step, propose a new set of models that includes one additional explanatory variable that is not in our current best model. Since $x_{2}$ is contained within the current best model, then we need only evaluate $1+x_{1}+x_{2}$ and $1+x_{2}+x_{3}$ as the proposal model set. As before, perform the deviance hypothesis test with all proposed models where the current best model is taken as the null hypothesis. This process is continued until either all explanatory variables are in the current best model or there was insufficient evidence to reject the null hypothesis test for all tests performed at a particular step.

The diagram below illustrates the forward selection procedure where the boxes identify which models were fitted, arrows indicate which hypothesis test was performed and shaded boxes identify which model is taken as the current best at each step. Here, the final best model is taken as $1+x_{2}+x_{3}$ due to the failure to reject the null hypothesis with the full additive model taken as the alternative. In this illustration, a total of 7 models were fitted and 6 deviance hypothesis tests were performed.

Unnumbered Figure: Link

As a final step, we may then test for interactions using the approach described in Section 7.3.1. Unless there is evidence to suggest otherwise, we need only examine the interaction between explanatory variables that are within the best additive model. So for the above example, we only need to test if the interaction $x_{2}.x_{3}$ is needed as demonstrated below. If there are more than two explanatory variables in the best additive model, then we can adopt the forward selection procedure to assess which interactions are necessary.

Unnumbered Figure: Link

There is no such thing as a ‘perfect’ model. Remember: All models are wrong, but some models are more useful than others.