Home page for accesible maths 13 Information and Sufficiency

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

13.1 Introduction

Last time we looked at some more examples of the method of maximum likelihood. When the parameter of interest, θ, is continuous, the MLE, θ^, can be found by differentiating the log-likelihood and setting it equal to zero. We must then check the second derivative of the log-likelihood is negative (at our candidate θ^) to verify that we have found a maximum.

Definition.

Suppose we have a sample 𝐱=x1,,xn, drawn from a density f(𝐱|θ) with unknown parameter θ, with log-likelihood l(θ|𝐱). The score function, S(θ), is the first derivative of the log-likelihood with respect to θ:

S(θ|𝐱)=l(θ|𝐱)=θl(θ|𝐱).

This is just giving a name to something we have already encountered.

As discussed previously, the MLE solves S(θ^)=0. Here, f(𝐱|θ) is being used to denote the joint density of 𝐱=x1,,xn. For the iid case, f(𝐱|θ)=i=1nf(xi|θ). Also, l(θ|𝐱)=logf(𝐱|θ). This is all just from the definitions.

Definition.

Suppose we have a sample 𝐱=x1,,xn, drawn from a density f(𝐱|θ) with unknown parameter θ, with log-likelihood l(θ|𝐱). The observed information function, IO(θ), is MINUS the second derivative of the log-likelihood with respect to θ:

IO(θ|𝐱)=-l′′(θ|𝐱)=-2θ2l(θ|𝐱).

Remember that the second derivative of l(θ) is negative at the MLE θ^ (that’s how we check it’s a maximum!). So the definition of observed information takes the negative of this to give something positive.

The observed information gets its name because it quantifies the amount of information obtained from a sample. An approximate 95% confidence interval for θtrue (the unobservable true value of the parameter θ) is given by

(θ^-1.96IO(θ^),θ^+1.96IO(θ^)).

This confidence interval is asymptotic, which means it is accurate when the sample is large. Some further justification on where this interval comes from will follow later in the course.

What happens to the confidence interval as IO(θ^) changes?

TheoremExample 13.1.1 Mercedes Benz drivers

You may recall the following example from last year.The website MBClub UK (associated with Mercedes Benz) carried out a poll on the number of times taken to pass a driving test. The results were as follows.

Number of failed attempts 0 1 2 3
Observed frequency 147 47 20 5
Table 13.1: Number of times taken for drivers to pass the driving test.

As always, we begin by looking at the data.

obsdata<-c(147,47,20,5)
barplot(obsdata,names.arg=c(0:2, or more'),
xlab="Number of failed attempts",
ylab="Frequency",col="orange")

Next, we propose a model for the data to begin addressing the question.

It is proposed to model the data as iid (independent and identically distributed) draws from a geometric distribution.

Why is this a suitable model?

What assumptions are being made?

Are these assumptions reasonable?

The probability mass function (pmf) for the geometric distribution, where X is defined as the number of failed attempts, is given by

Pr[X=x]=θ(1-θ)x,

where x=0,1,2,.

Assuming that the people in the ‘3 or more’ column failed exactly three times, the likelihood for general data x1,,xn is

L(θ)=i=1nθ(1-θ)xi,

and the log-likelihood is

l(θ) =i=1nlog{θ(1-θ)xi}
=i=1n{log(θ)+xilog(1-θ)}
=nlog(θ)+log(1-θ)i=1nxi.

The score function is therefore

S(θ)=l(θ)=nθ-i=1nxi1-θ.

A candidate for the MLE, θ^, solves S(θ^)=0:

  1. 1

    nθ^=i=1nxi1-θ^,

  2. 2

    n(1-θ^)=θ^i=1nxi,

  3. 3

    n=θ^(n+i=1nxi),

  4. 4

    θ^=nn+i=1nxi.

To confirm this really is an MLE we need to verify it is a maximum, i.e. a negative second derivative.

l′′(θ)=-nθ2-i=1nxi(1-θ)2<0.

In this case the function is clearly negative for all θ(0,1), if not we would just need to check this is the case at the proposed MLE.

Now plugging in the numbers, n=219 and i=1nxi=0×147+1×47+2×20+3×5=102, we get

θ^=219219+102=0.682.

This is the same answer as the ‘obvious one’ from intuition.

But now we can calculate the observed information at θ^, and use this to construct a 95% confidence interval for θtrue.

IO(θ^) =-l′′(θ^)
=nθ^2+i=1nxi(1-θ^)2
=2190.6822+102(1-0.682)2
=1479.5.

Now the 95% confidence interval is given by

(l,u) =(θ^-1.96IO(θ^),θ^+1.96IO(θ^))
=(0.682-1.961479.5,0.682+1.961479.5)
=(0.631,0.733).

We should also check the fit of the model by plotting the observed data against the theoretical data from the model (with the MLE plugged in for θ).

#value of theta: MLE
mletheta<-0.682
#expected data counts
expdata<-219*c(dgeom(0:2,mletheta),1-pgeom(2,mletheta))
#make plot
barplot(rbind(obsdata,expdata),names.arg=c(0:2, or more'),
xlab="Number of failed attempts",ylab="Frequency",
col=c("orange","red"),beside=T)
#add legend
legend("topright",c("observed","expected"),
col=c("orange","red"),lty=1)

We can do actually do slightly better than this.

We assumed ‘the people in the “3 or more” column failed exactly three times’. With likelihood we don’t need to do this. Remember: the likelihood is just the joint probability of the data. In fact, people in the “3 or more” group have probability

Pr[X3] =1-(Pr[X=0]+Pr[X=1]+Pr[X=2])
=1-(θ+(1-θ)θ+(1-θ)2θ).

We could therefore write the likelihood more correctly as

L(θ)=i=1n{θ(1-θ)xi}zii=1n{1-(θ+(1-θ)θ+(1-θ)2θ)}1-zi,

where zi=1 if xi<3 and zi=0 if xi3.

NOTE: if all we know about an observation x is that it exceeds some value, we say that x is censored. This is an important issue with patient data, as we may lose contact with a patient before we have finished observing them. Censoring is dealt with in more generality MATH335 Medical Statistics.

What is the MLE of θ using the more correct version of the likelihood?

The term in the second product (for the censored observations) can be seen as a geometric progression with constant term a=θ and common ratio r=(1-θ), and so Pr(X3|θ)=(1-θ)3 (check that this is the case).

Hence the likelihood can be written

L(θ) =θnu(1-θ)xi((1-θ)3)nc
=θnu(1-θ)xi+3nc

where the sum of xi’s only involves the uncensored observations, nu denotes the number of uncensored observations, and nc is the number of censored observations.

The log-likelihood becomes l(θ)=nulog(θ)+(xi+3nc)log(1-θ).

Differentiating, the score function is

S(θ)=l(θ)=nuθ-xi+3nc1-θ.

A candidate MLE solves l(θ^)=0, giving

nuθ^ =xi+3nc1-θ^
nu(1-θ^) =θ^(xi+3nc)
nu =θ^(nu+xi+3nc)
θ^ =nunu+xi+3nc.

The value of the MLE using these data is 214214+102=0.677.

Compare this to the original MLE of 0.682.

Why is the new estimate different to this?

Why is the difference small?