6 Model Choice 6 Model Choice Information Criteria

Nested Models and Likelihood Ratio Test

Suppose two models, $f_{0}(\cdot,\vec{\theta^{0}})$ and $f_{1}(\cdot,\vec{\theta^{1}})$ are under consideration. Then we say $f_{0}$ is nested in $f_{1}$ if the parameter space of $f_{0}$ is a subspace of the parameter space of $f_{1}$ . Often, this can be demonstrated by showing that setting parameters in $f_{1}$ to particular values results in a model of the form $f_{0}$ .

Nested model examples

•

The exponential model is nested within a Weibull model. The Weibull pdf is

$f(x;\lambda,k)=\frac{k}{\lambda}\left(\frac{x}{\lambda}\right)^{k-1}e^{-(x/% \lambda)^{k}}$

for $x\geq 0$ , $\lambda>0$ , $k>0$ , where $\lambda$ is the scale parameter and $k$ is the shape parameter. However, if we set $k=1$ we recover the exponential pdf.
•

The exponential model is nested within a gamma model. The gamma pdf is

$f(x;\alpha,\beta)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x% },\$

for $x\geq 0$ , $\alpha>0$ , $\beta>0$ , where $\beta$ is the rate paramter and $\alpha$ is the shape parameter. Now, if we set $\alpha=1$ then we recover the exponential pdf (albeit with a slightly different parameterisation: let $\lambda=\beta^{-1}$ ).
•

The Uniform[0,1] distribution is nested within the beta distribution. The beta pdf is

$f(x;\alpha,\beta)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}$

for $0\leq x\leq 1$ , $\alpha>0$ , $\beta>0$ , where $\alpha$ and $\beta$ are both shape parameters. If we set $\alpha=\beta=1$ this is the Uniform[0,1] distribution (NB $B(1,1)=1$ ).
•

The negative binomial distribution gives the number of failures $x$ observed until $r$ successes are observed. The geometric distribution is nested within the negative binomial. The negative binomial pmf is

$p(x;r,p)={\binom{x+r-1}{x}}(1-p)^{r}p^{x}.$

Setting $r=1$ yields the geometric distribution.
•

The normal distribution with known variance is nested within a normal distribution with unknown variance.

If we are comparing two nested models for data $\vec{x}$ we can use the likelihood ratio test. The set-up is to view the choice between the two models as a hypothesis test. We take $f_{0}$ as a nested model within $f_{1}$ and take $\Omega_{i}$ to be the parameter space of $f_{i}$ . Then $\Omega_{0}\subseteq\Omega_{1}$ .

Theorem 13: Likelihood Ratio Test for Model Comparison.

Suppose we have the two models $f_{0}$ and $f_{1}$ as described above. Our hypotheses are

$H_{0}$ : The simpler model adequately describes $\vec{x}$ , i.e. $\theta_{0}\in\Omega_{0}$ ,
$H_{1}$ : The more complex model is required, i.e. $\theta_{0}\in\{\Omega_{1}\setminus\Omega_{0}\}$ .

Then the deviance (likelihood ratio test) for model comparison is

D(f_{1},f_{0})=2\{\ell_{1}(\hat{\vec{\theta}}^{1})-\ell_{0}(\hat{\vec{\theta}}% ^{0})\}.

Under the usual regularity conditions, if $f_{1}$ has $p_{1}$ unknown parameters and $f_{0}$ has $p_{0}$ unknown parameters (of course $p_{1}>p_{0}$ ) then under $H_{0}$ , asymptotically as $n\rightarrow\infty$ ,

D(f_{1},f_{0})\sim\chi^{2}_{p_{1}-p_{0}}.

Remarks:

•

It should not surprise you to learn that this theorem is closely related to Theorems 2, 7 and 12. In fact, all three of these are corollaries to the above result. Convincing yourself of this would be a good way of enhancing your understanding.
•

It is always the case that $\ell_{1}(\hat{\vec{\theta}}^{1})>\ell_{0}(\hat{\vec{\theta}}^{0})$ . To see this, remember that $\hat{\vec{\theta}}^{0}$ is a valid parameter choice for model $f_{1}$ , and $\hat{\vec{\theta}}^{1}$ must have at least as high a likelihood under $f_{1}$ by virtue of it being the MLE.

Theorem 13 suggests that we should evaluate $D(f_{1},f_{0})$ and compare to the corresponding critical value, $z^{2}_{c}$ from the $\chi^{2}$ distribution. If $D(f_{1},f_{0})<z^{2}_{c}$ we do not reject $H_{0}$ and take the simpler model $f_{0}$ as adequate.

Example 6.1: Failure times, ctd.

We revisit the failure times example first encountered in Section 2.6. There we fitted an exponential model to the failure time data; should a gamma model be preferred?

From earlier,

\ell(\theta)=n\log\theta-\theta n\bar{x};\ \hat{\theta}=\bar{x}^{-1}.

Using R we evaluate this at the MLE:

fail<-c(90,255,40,143,30,239,484,28,39,15)
exploglhd<-function(theta,x){
 n<-length(x)
 return(n*log(theta) - theta*n*mean(x))
}

exploglhd(1/mean(fail),fail)

This gives $\ell(\hat{\theta})=-59.14858$ , with $\hat{\theta}=0.00734$ .

Now we consider the likelihood under a gamma distribution. Recall from Example 3.7 that maximising the gamma distribution reduces to solving

n\log\hat{\alpha}-n\log\bar{x}+\sum_{i=1}^{n}\log x_{i}-n\gamma(\hat{\alpha}),

and then

\hat{\beta}=\frac{\hat{\alpha}}{\bar{x}}.

The first equation can only be solved numerically.

Using R,

f<-function(alpha,x){
   n<-length(x)
   return(n*log(alpha)-n*log(mean(x))+sum(log(x))-n*digamma(alpha))
}

uniroot(f,lower=0.001,upper=10,x=fail)

giving $\hat{\alpha}=1.013$ , and so $\hat{\beta}=0.00743$ . (Note that the digamma function in R is the derivative of the log of the gamma function.)

Again, we evaluate the gamma log likelihood at its MLE in R.

gamloglhd<-function(alpha,beta,x){
    n<-length(x)
    return(n*alpha*log(beta) + (alpha-1)*sum(log(x))
Ψ- beta*sum(x) - n*log(gamma(alpha)))
}

gamloglhd(alpha=1.012568,beta=0.007428966,x=fail)

giving $\ell(\hat{\alpha},\hat{\beta})=-59.14808$ , only very slightly larger than the exponential distribution. This gives a deviance of $2(-59.14808--59.14858)=0.001$ .

This is much less than the critical value of $\chi^{2}_{1}$ distribution, 3.84, hence we retain the simpler exponential model. We should not be surprised at our conclusion because $\hat{\alpha}=1.013$ , which is very close to 1, the value that would imply an exponential distribution.

Example 6.2: Surgical Mortality Rates at Hospitals The below table gives mortality levels for cardiac surgery on babies at 12 hospitals. $r/m$ means $r$ deaths in $m$ operations.

A 0/47	B 18/148	C 8/119	D 46/810	E 8/211	F 13/196
G 9/148	H 31/215	I 14/207	J 8/97	K 29/256	L 24/360

Let $\theta_{s}$ be the mortality rate for hospital $s$ . We would like to know whether mortality rates vary between hospitals. This can be expressed as the hypothesis test

$H_{0}$ : $\theta_{s}=\theta_{*}$ for all $s=A,\ldots,L$ .
$H_{1}$ : Each $\theta_{s}$ is allowed to be different.

Clearly the model suggested by $H_{0}$ is nested within the model for $H_{1}$ .

Letting $r_{s}$ denote the number of deaths in hospital $s$ and $m_{s}$ the total number of operations carried out at hospital $s$ , the likelihood is

L(\vec{\theta})=\prod_{s=A}^{L}{\binom{m_{s}}{r_{s}}}\theta_{s}^{r_{s}}(1-% \theta_{s})^{m_{s}-r_{s}}.

Under $H_{1}$ this breaks down into separate Binomial distributions for each hospital, so we have

\hat{\theta}_{s}=\frac{r_{s}}{m_{s}}

for $s=A,\ldots,L$ .

Under $H_{0}$ this collapses into a single Binomial distribution where operations over all hospitals are added together, yielding

\hat{\theta}_{*}=\frac{\sum r_{s}}{\sum m_{s}}.

We use R to evaluate the likelihood under each scenario as follows:

####hospital mortality example
r<-c(0,18,8,46,8,13,9,31,14,8,29,24)
m<-c(47,148,119,810,211,196,148,215,207,97,256,360)

hosploglhd<-function(theta,r,m){
    return(log(prod(dbinom(r,m,theta))))
}

#h0:
hosploglhd(theta=sum(r)/sum(m),r,m)
#h1:
hosploglhd(theta=r/m,r,m)

This gives a log-likelihood of -44.15 for $H_{0}$ and -24.88 for $H_{1}$ . The likelihood ratio statistic is then

2(-24.88--44.15)=38.54.

The model under $H_{1}$ has 11 parameters extra to the model under $H_{0}$ ; this gives a critical $\chi^{2}$ value of qchisq(0.95,df=11)=19.68. Since $38.54>19.68$ we reject $H_{0}$ and conclude that mortality rates vary between hospitals.

What is the interpretation of this?

•

Different case mix encountered by hospitals (e.g. specialised hospital taking more complex cases).
•

Some hospitals not as good as others.

Monitoring surgical mortality between hospitals (and between surgeons) is important to ensure that those performing below par are identified quickly.