Example 6.4: AIC for linear regression.
Suppose we had a normal linear regression model, i.e. let be independent
random variables such that
for , where the are covariates (explanatory
variables).
Find an expression for the AIC of the linear regression model.
We saw in previous examples that
Note that this can be written as
where is the sum of squares error (residual) function.
By partial differentiation of the log-likelihood with respect to each separately, the first two terms of the likelihood act like constants; the MLEs are just the least squares estimators of the regression parameters obtained from minimising the sum of squares error via the design matrix.
For the MLE of , the partial derivative is
from which we have
Thus the AIC is
Example 6.5: Financial Data
The table gives the closing figures for the FTSE100 between 28 Aug 2012 and 26 Sept 2012.
We may want to fit a suitable model in order to understand how the figures are changing over time, and perhaps make predictions about future stock movements.
As usual, the first step is to plot the data. In this case, the data are a time series so it makes sense to plot them slightly differently. The data are available in the file ftse-12.csv
and the R code to produce the plot in Figure LABEL:fig:ftse is
ftse<-read.csv("ftse-sept12.csv") # R can understand dates, see ?as.Date ftse$Date<-as.Date(ftse$Date,format="%d-%b-%y") plot(ftse$Date,ftse$Close,type="l",xlab="date",ylab="close")
The graph demonstrates that this graph has the interesting property that the closing values are correlated with neighbouring closing values.
A naive model one could try is, taking to be the closing price on day , let
However, this does not take account of the correlation structure of the data. A possible alternative is the AR(1) model (AR stands for autoregressive):
with found from the historical data (in this case , the closing price on 27 Aug 2012). These data are no longer identically distributed or independent. Notice that these models are nested: the naive model is a special case of the AR(1) model with ; we could therefore do model selection using the likelihood ratio test as well as AIC.
An easy way to fit these models in R is to view them both as linear models — the naive model having an intercept only, the AR(1) model having a single covariate — the closing value from the previous day.
#naive model modNaive<-lm(ftse$Close~1) #fit model logLik(modNaive) AIC(modNaive) summary(modNaive) #AR(1) model closeyest<-c(5776.6,ftse$Close[-22]) modAR1<-lm(ftse$Close~closeyest) logLik(modAR1) AIC(modAR1) summary(modAR1)
We see that the AR(1) model has a much lower AIC. Twice the decrease in log-likelihood (i.e. the deviance) is , hence we unanimously accept the more complex model.
Indeed, we could try an AR(2) model, which models the closing price on the previous two days’ closing prices as
where (coincidentally, the closing price on 26 Aug was the same as 27 Aug).
#AR(2) model close2dago<-c(5776.6,5776.6,ftse$Close[-c(21,22)]) modAR2<-lm(ftse$Close~closeyest+close2dago) logLik(modAR2) AIC(modAR2) summary(modAR2)
The AIC is higher, and the log-likelihood has barely decreased. Hence the AR(1) model is preferred. The AR(1) model suggests a good prediction for the closing price of the FTSE 100 on
27 September is, via coef(modAR1)
,
The actual closing price on 27 September was 5779.4.
Historical financial data of this kind can be downloaded from
http://uk.finance.yahoo.com/
.
Time series data will be covered in more detail in MATH334 Topics in Modern Statistics.
Example 6.6: Conception and Smoking
Women who were pregnant with planned pregnancies were asked how many menstrual cycles it took them to get pregnant (after discontinuing contraception). The women were then categorized as smokers or non-smokers (smoker = smokes an average of at least one cigarette a day). The data are in the table below.
The question of interest is: is the time to conception different in smokers and non-smokers?
Cycles | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Smokers | 29 | 16 | 17 | 4 | 3 | 9 | 4 | 5 | 1 | 1 | 1 | 3 | 7 |
Non-smokers | 198 | 107 | 55 | 38 | 18 | 22 | 7 | 9 | 5 | 3 | 6 | 6 | 12 |
As usual we start with a plot. The R code for a suitable plot (Figure LABEL:fig:smoke-obs) is given below.
smokeobs<-c(29,16,17,4,3,9,4,5,1,1,1,3,7) ns<-sum(smokeobs) #sample size: smokers nsmokeobs<-c(198,107,55,38,18,22,7,9,5,3,6,6,12) nn<-sum(nsmokeobs) #sample size: nonsmokers par(mfrow=c(2,1)) #plot 2x1 grid barplot(smokeobs,names.arg=1:13,xlab="Number of cycles", ylab="Frequency",col="blue",main="smokers") barplot(nsmokeobs,names.arg=1:13,xlab="Number of cycles", ylab="Frequency",col="blue",main="non smokers")
One could postulate that the number of failed cycles required to conceive may follow a geometric distribution, with parameter for smokers and for non smokers. We could construct a hypothesis test with and .
Similar to Example 6.6, implies a distribution that is nested in the distribution implied by , so we could carry out a likelihood ratio test.
As derived in WS2.1, in general for IID random variables with Geometric distribution,
We can use R to calculate the log-likelihood implied by each of the hypotheses.
n<-ns+nn #total sample size llgeom<-function(theta,n,r){ return(n*log(1-theta) + r*log(theta)) } #under H0 tobs<-smokeobs+nsmokeobs r0<-sum(tobs*(0:12)) thethat0<-r0/(n+r0) llgeom(thethat0,n,r0)
#under H1 r1s<-sum(smokeobs*(0:12)) thethat1s<-r1s/(ns+r1s) r1n<-sum(nsmokeobs*(0:12)) thethat1n<-r1n/(nn+r1n) llgeom(thethat1s,ns,r1s)+llgeom(thethat1n,nn,r1n)
Twice the difference between the likelihoods exceeds 3.84, so is rejected.
Is this geometric distribution with depending on smoking status a good model for the data? We can plot observed against expected values to determine this.
par(mfrow=c(2,1)) #plot 2x1 grid #smokers expdatas<-ns*c(dgeom(0:12,thethat1s)) #nonsmokers expdatan<-nn*c(dgeom(0:12,thethat1n)) barplot(rbind(smokeobs,expdatas), names.arg = 1:13, xlab = "Number of cycles", ylab = "Frequency", col=c("blue","green"),main="smokers",beside=T) legend("topright",c("observed","expected"), Ψ col=c("blue","green"),lty=1) barplot(rbind(nsmokeobs,expdatan), names.arg = 1:13, xlab="Number of cycles", ylab="Frequency", col=c("blue","green"),main="non smokers",beside=T)
Real data are complex and messy: the standard distributions may not fit, and considerable thought may be needed to find an appropriate modelling strategy.
This course has laid the foundations: many more advanced methods exist. Some will be covered in other third year statistics modules; more still on postgraduate statistics courses.