6 Model Choice

More Examples

Example 6.4:  AIC for linear regression.
Suppose we had a normal linear regression model, i.e. let X1,,Xn be independent random variables such that XiN(j=1dβjwi,j,σ2) for i=1,,n, where the wi,j are covariates (explanatory variables).

Find an expression for the AIC of the linear regression model.

We saw in previous examples that

(β,σ2) = i=1n{-12log(2π)-logσ2-12(xi-j=1dβjwi,j)2}
= -n2log(2π)-n2logσ2-12σ2i=1n(xi-j=1dβjwi,j)2

Note that this can be written as

-n2log(2π)-n2logσ2-12σ2SSE(β),

where SSE() is the sum of squares error (residual) function.

By partial differentiation of the log-likelihood with respect to each βj separately, the first two terms of the likelihood act like constants; the MLEs are just the least squares estimators of the regression parameters obtained from minimising the sum of squares error via the design matrix.

For the MLE of σ2, the partial derivative is

σ2(β^,σ2)=-n2σ2+12σ4SSE(β^),

from which we have

σ2^=SSE(β^)/n.

Thus the AIC is

AIC=2{-(β,σ2)+p}=nlog(2π)+nlog(SSE(β^)/n)+n+2(d+1).

Example 6.5:  Financial Data

Date 28/8 29/8 30/8 31/8 3/9 4/9 5/9 6/9 7/9 10/9 11/9 Close 5775.7 5743.5 5719.5 5711.5 5758.4 5672.0 5657.9 5777.3 5794.8 5793.2 5792.2 Date 12/9 13/9 14/9 17/9 18/9 19/9 20/9 21/9 24/9 25/9 26/9 Close 5782.1 5819.9 5915.5 5893.5 5868.2 5888.5 5854.6 5852.6 5838.8 5859.7 5768.1

The table gives the closing figures for the FTSE100 between 28 Aug 2012 and 26 Sept 2012.

We may want to fit a suitable model in order to understand how the figures are changing over time, and perhaps make predictions about future stock movements.

As usual, the first step is to plot the data. In this case, the data are a time series so it makes sense to plot them slightly differently. The data are available in the file ftse-12.csv and the R code to produce the plot in Figure LABEL:fig:ftse is

ftse<-read.csv("ftse-sept12.csv")
# R can understand dates, see ?as.Date
ftse$Date<-as.Date(ftse$Date,format="%d-%b-%y")

plot(ftse$Date,ftse$Close,type="l",xlab="date",ylab="close")

The graph demonstrates that this graph has the interesting property that the closing values are correlated with neighbouring closing values.

Figure 6.4: Link, Caption: FTSE100 closing prices over the period 28 Aug 2012 to 26 Sept 2012

A naive model one could try is, taking Xi to be the closing price on day i, let

X1,,XnN(μ,σ2),iid.

However, this does not take account of the correlation structure of the data. A possible alternative is the AR(1) model (AR stands for autoregressive):

X1,,XnN(μ+ρxi-1,σ2),

with x0 found from the historical data (in this case x0=5776.6, the closing price on 27 Aug 2012). These data are no longer identically distributed or independent. Notice that these models are nested: the naive model is a special case of the AR(1) model with ρ=0; we could therefore do model selection using the likelihood ratio test as well as AIC.

An easy way to fit these models in R is to view them both as linear models — the naive model having an intercept only, the AR(1) model having a single covariate — the closing value from the previous day.

#naive model
modNaive<-lm(ftse$Close~1) #fit model
logLik(modNaive)
AIC(modNaive)
summary(modNaive)

#AR(1) model
closeyest<-c(5776.6,ftse$Close[-22])
modAR1<-lm(ftse$Close~closeyest)
logLik(modAR1)
AIC(modAR1)
summary(modAR1)

We see that the AR(1) model has a much lower AIC. Twice the decrease in log-likelihood (i.e. the deviance) is 203.84, hence we unanimously accept the more complex model.

Indeed, we could try an AR(2) model, which models the closing price on the previous two days’ closing prices as

X1,,XnN(μ+ρ1xi-1+ρ2xi-2,σ2),

where x-1=5776.6 (coincidentally, the closing price on 26 Aug was the same as 27 Aug).

#AR(2) model
close2dago<-c(5776.6,5776.6,ftse$Close[-c(21,22)])
modAR2<-lm(ftse$Close~closeyest+close2dago)
logLik(modAR2)
AIC(modAR2)
summary(modAR2)

The AIC is higher, and the log-likelihood has barely decreased. Hence the AR(1) model is preferred. The AR(1) model suggests a good prediction for the closing price of the FTSE 100 on 27 September is, via coef(modAR1),

Predicted close=1344.68+0.768×5768.1=5774.58.

The actual closing price on 27 September was 5779.4.

Historical financial data of this kind can be downloaded from
http://uk.finance.yahoo.com/.

Time series data will be covered in more detail in MATH334 Topics in Modern Statistics.

Example 6.6:  Conception and Smoking

Women who were pregnant with planned pregnancies were asked how many menstrual cycles it took them to get pregnant (after discontinuing contraception). The women were then categorized as smokers or non-smokers (smoker = smokes an average of at least one cigarette a day). The data are in the table below.

The question of interest is: is the time to conception different in smokers and non-smokers?

Cycles 1 2 3 4 5 6 7 8 9 10 11 12 >12
Smokers 29 16 17 4 3 9 4 5 1 1 1 3 7
Non-smokers 198 107 55 38 18 22 7 9 5 3 6 6 12

As usual we start with a plot. The R code for a suitable plot (Figure LABEL:fig:smoke-obs) is given below.

smokeobs<-c(29,16,17,4,3,9,4,5,1,1,1,3,7)
ns<-sum(smokeobs) #sample size: smokers
nsmokeobs<-c(198,107,55,38,18,22,7,9,5,3,6,6,12)
nn<-sum(nsmokeobs) #sample size: nonsmokers

par(mfrow=c(2,1)) #plot 2x1 grid

barplot(smokeobs,names.arg=1:13,xlab="Number of cycles",
 ylab="Frequency",col="blue",main="smokers")

barplot(nsmokeobs,names.arg=1:13,xlab="Number of cycles",
 ylab="Frequency",col="blue",main="non smokers")
Figure 6.5: Link, Caption: Number of menstrual cycles taken to conceive: smokers and non-smokers

One could postulate that the number of failed cycles required to conceive may follow a geometric distribution, with parameter θs for smokers and θn for non smokers. We could construct a hypothesis test with H0:θs=θn and H1:θsθn.

Similar to Example 6.6, H0 implies a distribution that is nested in the distribution implied by H1, so we could carry out a likelihood ratio test.

As derived in WS2.1, in general for R1,,Rn IID random variables with Geometric(θ) distribution,

(θ)=nlog(1-θ)+rilogθ;θ^=rin+ri.

We can use R to calculate the log-likelihood implied by each of the hypotheses.

n<-ns+nn #total sample size

llgeom<-function(theta,n,r){
      return(n*log(1-theta) + r*log(theta))
}

#under H0
tobs<-smokeobs+nsmokeobs
r0<-sum(tobs*(0:12))
thethat0<-r0/(n+r0)

llgeom(thethat0,n,r0)
#under H1
r1s<-sum(smokeobs*(0:12))
thethat1s<-r1s/(ns+r1s)
r1n<-sum(nsmokeobs*(0:12))
thethat1n<-r1n/(nn+r1n)

llgeom(thethat1s,ns,r1s)+llgeom(thethat1n,nn,r1n)

Twice the difference between the likelihoods exceeds 3.84, so H0 is rejected.

Is this geometric distribution with θ depending on smoking status a good model for the data? We can plot observed against expected values to determine this.

par(mfrow=c(2,1)) #plot 2x1 grid
#smokers
expdatas<-ns*c(dgeom(0:12,thethat1s))

#nonsmokers
expdatan<-nn*c(dgeom(0:12,thethat1n))

barplot(rbind(smokeobs,expdatas), names.arg = 1:13,
    xlab = "Number of cycles", ylab = "Frequency",
    col=c("blue","green"),main="smokers",beside=T)

legend("topright",c("observed","expected"),
Ψ    col=c("blue","green"),lty=1)

barplot(rbind(nsmokeobs,expdatan), names.arg = 1:13,
    xlab="Number of cycles", ylab="Frequency",
    col=c("blue","green"),main="non smokers",beside=T)
Figure 6.6: Link, Caption: Number of menstrual cycles taken to conceive: smokers and non-smokers

The result is Figure Figure 6.6 (Link). The fit is poor. Why might this be?

Real data are complex and messy: the standard distributions may not fit, and considerable thought may be needed to find an appropriate modelling strategy.

This course has laid the foundations: many more advanced methods exist. Some will be covered in other third year statistics modules; more still on postgraduate statistics courses.