6 Model Choice Relationship between LRT and AIC

More Examples

Example 6.4: AIC for linear regression.
Suppose we had a normal linear regression model, i.e. let $X_{1},\ldots,X_{n}$ be independent random variables such that $X_{i}\sim N(\sum_{j=1}^{d}\beta_{j}w_{i,j},\sigma^{2})$ for $i=1,\ldots,n$ , where the $w_{i,j}$ are covariates (explanatory variables).

Find an expression for the AIC of the linear regression model.

We saw in previous examples that

	$\displaystyle\ell(\vec{\beta},\sigma^{2})$	$\displaystyle=$	$\displaystyle\sum_{i=1}^{n}\left\{-\frac{1}{2}\log(2\pi)-\log\sigma^{2}-\frac{% 1}{2}\left(x_{i}-\sum_{j=1}^{d}\beta_{j}w_{i,j}\right)^{2}\right\}$
		$\displaystyle=$	$\displaystyle-\frac{n}{2}\log(2\pi)-\frac{n}{2}\log\sigma^{2}-\frac{1}{2\sigma% ^{2}}\sum_{i=1}^{n}\left(x_{i}-\sum_{j=1}^{d}\beta_{j}w_{i,j}\right)^{2}$

Note that this can be written as

-\frac{n}{2}\log(2\pi)-\frac{n}{2}\log\sigma^{2}-\frac{1}{2\sigma^{2}}\mbox{% SSE}(\vec{\beta}),

where $\mbox{SSE}(\cdot)$ is the sum of squares error (residual) function.

By partial differentiation of the log-likelihood with respect to each $\beta_{j}$ separately, the first two terms of the likelihood act like constants; the MLEs are just the least squares estimators of the regression parameters obtained from minimising the sum of squares error via the design matrix.

For the MLE of $\sigma^{2}$ , the partial derivative is

\frac{\partial}{\partial\sigma^{2}}\ell(\hat{\vec{\beta}},\sigma^{2})=-\frac{n% }{2\sigma^{2}}+\frac{1}{2\sigma^{4}}\mbox{SSE}(\hat{\vec{\beta}}),

from which we have

\hat{\sigma^{2}}=\mbox{SSE}(\hat{\vec{\beta}})/n.

Thus the AIC is

\mbox{AIC}=2\{-\ell(\vec{\beta},\sigma^{2})+p\}=n\log(2\pi)+n\log(SSE(\hat{% \vec{\beta}})/n)+n+2(d+1).

Example 6.5: Financial Data

Date 28/8 29/8 30/8 31/8 3/9 4/9 5/9 6/9 7/9 10/9 11/9 Close 5775.7 5743.5 5719.5 5711.5 5758.4 5672.0 5657.9 5777.3 5794.8 5793.2 5792.2 Date 12/9 13/9 14/9 17/9 18/9 19/9 20/9 21/9 24/9 25/9 26/9 Close 5782.1 5819.9 5915.5 5893.5 5868.2 5888.5 5854.6 5852.6 5838.8 5859.7 5768.1

The table gives the closing figures for the FTSE100 between 28 Aug 2012 and 26 Sept 2012.

We may want to fit a suitable model in order to understand how the figures are changing over time, and perhaps make predictions about future stock movements.

As usual, the first step is to plot the data. In this case, the data are a time series so it makes sense to plot them slightly differently. The data are available in the file ftse-12.csv and the R code to produce the plot in Figure LABEL:fig:ftse is

ftse<-read.csv("ftse-sept12.csv")
# R can understand dates, see ?as.Date
ftse$Date<-as.Date(ftse$Date,format="%d-%b-%y")

plot(ftse$Date,ftse$Close,type="l",xlab="date",ylab="close")

The graph demonstrates that this graph has the interesting property that the closing values are correlated with neighbouring closing values.

Figure 6.4: Link, Caption: FTSE100 closing prices over the period 28 Aug 2012 to 26 Sept 2012

A naive model one could try is, taking $X_{i}$ to be the closing price on day $i$ , let

X_{1},\ldots,X_{n}\sim N(\mu,\sigma^{2}),\ \ \mbox{iid.}

However, this does not take account of the correlation structure of the data. A possible alternative is the AR(1) model (AR stands for autoregressive):

X_{1},\ldots,X_{n}\sim N(\mu+\rho x_{i-1},\sigma^{2}),

with $x_{0}$ found from the historical data (in this case $x_{0}=5776.6$ , the closing price on 27 Aug 2012). These data are no longer identically distributed or independent. Notice that these models are nested: the naive model is a special case of the AR(1) model with $\rho=0$ ; we could therefore do model selection using the likelihood ratio test as well as AIC.

An easy way to fit these models in R is to view them both as linear models — the naive model having an intercept only, the AR(1) model having a single covariate — the closing value from the previous day.

#naive model
modNaive<-lm(ftse$Close~1) #fit model
logLik(modNaive)
AIC(modNaive)
summary(modNaive)

#AR(1) model
closeyest<-c(5776.6,ftse$Close[-22])
modAR1<-lm(ftse$Close~closeyest)
logLik(modAR1)
AIC(modAR1)
summary(modAR1)

We see that the AR(1) model has a much lower AIC. Twice the decrease in log-likelihood (i.e. the deviance) is $\approx 20\gg 3.84$ , hence we unanimously accept the more complex model.

Indeed, we could try an AR(2) model, which models the closing price on the previous two days’ closing prices as

X_{1},\ldots,X_{n}\sim N(\mu+\rho_{1}x_{i-1}+\rho_{2}x_{i-2},\sigma^{2}),

where $x_{-1}=5776.6$ (coincidentally, the closing price on 26 Aug was the same as 27 Aug).

#AR(2) model
close2dago<-c(5776.6,5776.6,ftse$Close[-c(21,22)])
modAR2<-lm(ftse$Close~closeyest+close2dago)
logLik(modAR2)
AIC(modAR2)
summary(modAR2)

The AIC is higher, and the log-likelihood has barely decreased. Hence the AR(1) model is preferred. The AR(1) model suggests a good prediction for the closing price of the FTSE 100 on 27 September is, via coef(modAR1),

\textrm{Predicted close}=1344.68+0.768\times 5768.1=5774.58.

The actual closing price on 27 September was 5779.4.

Historical financial data of this kind can be downloaded from
http://uk.finance.yahoo.com/.

Time series data will be covered in more detail in MATH334 Topics in Modern Statistics.

Example 6.6: Conception and Smoking

Women who were pregnant with planned pregnancies were asked how many menstrual cycles it took them to get pregnant (after discontinuing contraception). The women were then categorized as smokers or non-smokers (smoker = smokes an average of at least one cigarette a day). The data are in the table below.

The question of interest is: is the time to conception different in smokers and non-smokers?

Cycles	1	2	3	4	5	6	7	8	9	10	11	12	$>12$
Smokers	29	16	17	4	3	9	4	5	1	1	1	3	7
Non-smokers	198	107	55	38	18	22	7	9	5	3	6	6	12

As usual we start with a plot. The R code for a suitable plot (Figure LABEL:fig:smoke-obs) is given below.

smokeobs<-c(29,16,17,4,3,9,4,5,1,1,1,3,7)
ns<-sum(smokeobs) #sample size: smokers
nsmokeobs<-c(198,107,55,38,18,22,7,9,5,3,6,6,12)
nn<-sum(nsmokeobs) #sample size: nonsmokers

par(mfrow=c(2,1)) #plot 2x1 grid

barplot(smokeobs,names.arg=1:13,xlab="Number of cycles",
 ylab="Frequency",col="blue",main="smokers")

barplot(nsmokeobs,names.arg=1:13,xlab="Number of cycles",
 ylab="Frequency",col="blue",main="non smokers")

Figure 6.5: Link, Caption: Number of menstrual cycles taken to conceive: smokers and non-smokers

One could postulate that the number of failed cycles required to conceive may follow a geometric distribution, with parameter $\theta_{s}$ for smokers and $\theta_{n}$ for non smokers. We could construct a hypothesis test with $H_{0}:\theta_{s}=\theta_{n}$ and $H_{1}:\theta_{s}\neq\theta_{n}$ .

Similar to Example 6.6, $H_{0}$ implies a distribution that is nested in the distribution implied by $H_{1}$ , so we could carry out a likelihood ratio test.

As derived in WS2.1, in general for $R_{1},\ldots,R_{n}$ IID random variables with Geometric $(\theta)$ distribution,

\ell(\theta)=n\log(1-\theta)+\sum r_{i}\log\theta;\ \ \ \hat{\theta}=\frac{% \sum r_{i}}{n+\sum r_{i}}.

We can use R to calculate the log-likelihood implied by each of the hypotheses.

n<-ns+nn #total sample size

llgeom<-function(theta,n,r){
      return(n*log(1-theta) + r*log(theta))
}

#under H0
tobs<-smokeobs+nsmokeobs
r0<-sum(tobs*(0:12))
thethat0<-r0/(n+r0)

llgeom(thethat0,n,r0)

#under H1
r1s<-sum(smokeobs*(0:12))
thethat1s<-r1s/(ns+r1s)
r1n<-sum(nsmokeobs*(0:12))
thethat1n<-r1n/(nn+r1n)

llgeom(thethat1s,ns,r1s)+llgeom(thethat1n,nn,r1n)

Twice the difference between the likelihoods exceeds 3.84, so $H_{0}$ is rejected.

Is this geometric distribution with $\theta$ depending on smoking status a good model for the data? We can plot observed against expected values to determine this.

par(mfrow=c(2,1)) #plot 2x1 grid
#smokers
expdatas<-ns*c(dgeom(0:12,thethat1s))

#nonsmokers
expdatan<-nn*c(dgeom(0:12,thethat1n))

barplot(rbind(smokeobs,expdatas), names.arg = 1:13,
    xlab = "Number of cycles", ylab = "Frequency",
    col=c("blue","green"),main="smokers",beside=T)

legend("topright",c("observed","expected"),
Ψ    col=c("blue","green"),lty=1)

barplot(rbind(nsmokeobs,expdatan), names.arg = 1:13,
    xlab="Number of cycles", ylab="Frequency",
    col=c("blue","green"),main="non smokers",beside=T)

Figure 6.6: Link, Caption: Number of menstrual cycles taken to conceive: smokers and non-smokers

The result is Figure Figure 6.6 (Link). The fit is poor. Why might this be?

Real data are complex and messy: the standard distributions may not fit, and considerable thought may be needed to find an appropriate modelling strategy.

This course has laid the foundations: many more advanced methods exist. Some will be covered in other third year statistics modules; more still on postgraduate statistics courses.