We now explore examples of likelihood inference for some common models.
Accident and emergency departments are hard to manage because patients arrive at random (they are unscheduled). Some patients may need to be seen urgently.
Excess staff (doctors and nurses) must be avoided because this wastes NHS money; however, A&E departments also have to adhere to performance targets (e.g. patients dealt with within four hours). So staff levels need to be ‘balanced’ so that there are sufficient staff to meet targets but not too many so that money is not wasted.
A first step in achieving this is to study data on patient arrival times. It is proposed that we model the time between patient arrivals as iid realisations from an Exponential distribution.
Why is this a suitable model?
What assumptions are being made?
Are these assumptions reasonable?
Suppose I stand outside Lancaster Royal Infirmary A&E and record the following inter-arrival times of patients (in minutes):
18.39, 2.70, 5.42, 0.99, 5.42, 31.97, 2.96, 5.28, 8.51, 10.90.
As usual, the first thing we do is look at the data!
The exponential pdf is given by
for , . Assuming that the data are iid, the definition of the likelihood function for gives us, for general data ,
Note: we usually drop the ‘’ from whenever possible.
Usually, when we have products and the parameter is continuous, the best way to find the MLE is to find the log-likelihood and differentiate.
So the log-likelihood is
Now we differentiate:
Now solutions to are potential MLEs.
To ensure this is a maximum we check the second derivative is negative:
So the solution we have found is the MLE, and plugging in our data we find (via 1/mean(arrive))
Now that we have our MLE, we should check that the assumed model seems reasonable. Here, we will use a QQ-plot.
Given the small dataset, this seems ok – there is no obvious evidence of deviation from the exponential model.
Knowing that the exponential distribution is reasonable, and having an estimate for its rate, is useful to calculate staff scheduling requirements in the A&E.
Extensions of the idea consider flows of patients through the various services (take Math332 Stochastic Processes and/or the STOR-i MRes for more on this).
In an article by Mackowiak et al.33Mackowiak P.A., Wasserman, S.S. and Levine, M.M. (1992) A Critical Appraisal of 98.6 Degrees F, the Upper Limit of the Normal Body Temperature and Other Legacies of Carl Reinhold August Wunderlich, the authors measure the body temperatures of a number of individuals to assess whether true mean body temperature is 98.6 degrees Fahrenheit or not. A dataset of individuals is available in the normtemp dataset. The data are assumed to be normally distributed with standard deviation .
Why is this a suitable model?
What assumptions are being made?
Are these assumptions reasonable?
What do the data look like?
The plot can be produced using
The histogram of the data is reasonable, but there might be some skew in the data (right tail).
The normal pdf is given by
where in this case, is known.
The likelihood is then
Since the parameter of interest (in this case ) is continuous, we can differentiate the log-likelihood to find the MLE:
and so
For candidate MLEs we set this to zero and solve, i.e.
and so the MLE is .
This is also the “obvious” estimate (the sample mean). To check it is indeed an MLE, the second derivative of the log-likelihood is
which confirms this is the case.
Using the data, we find .
This might indicate evidence for the body temperature being different from the assumed 98.6 degrees Fahrenheit.
We now check the fit:
The fit looks good – although (as the histogram previously showed) there is possibly some mild right (positive) skew, indicated by the quantile points above the line.
Why might the QQ-plot show the “stepped” behaviour of the points?
Every day I cycle to Lancaster University, and have to pass through the traffic lights at the crossroads by Booths (heading south down Scotforth Road). I am either stopped or not stopped by the traffic lights. Over a period of a term, I cycle in 50 times. Suppose that the time I arrive at the traffic lights is independent of the traffic light sequence.
On 36 of the 50 days, the lights are on green and I can cycle straight through. Let be the probability that the lights are on green. Write down the likelihood and log-likelihood of , and hence calculate its MLE.
With the usual iid assumption we see that, if is the number of times the lights are on green then . So we have
We therefore have, for general and ,
and
Solutions to are potential MLEs:
and if we have
For this to be an MLE it must have negative second derivative.
In particular we have and so is the MLE.
Now suppose that over a two week period, on the 14 occasions I get stopped by the traffic lights (they are on red) my waiting times are given by (in seconds)
Assume that the traffic lights remain on red for a fixed amount of time , regardless of the traffic conditions.
Given the above data, write down the likelihood of , and sketch it. What is the MLE of ?
We are going to assume that these waiting times are drawn independently from , where is the parameter we wish to estimate.
Why is this a suitable model?
What assumptions are being made?
Are these assumptions reasonable?
Constructing the likelihood for this example is slightly different from those we have seen before. The pdf of the distribution is
The unusual thing here is that the data enters only through the boundary conditions on the pdf. Another way to write the above pdf is
where is the indicator function.
For data , the likelihood function is then
We can write this as
For our case we have and , so .
We are next asked to sketch this MLE. In R,
From the plot it is clear that , since this is the value that leads to the maximum likelihood. Notice that solving would not work in this case, since the likelihood is not differentiable at the MLE.
However, on the feasible range of , i.e. , we have
and so
Remember that derivatives express the rate of change of a function. Hence since the (log-)likelihood is negative ( is strictly positive), then the likelihood is decreasing on the feasible range of parameter values.
Since we are trying to maximise the likelihood, this means we should take the minimum over the feasible range as the MLE. The minimum value on the range is .
One case where differentiation is clearly not the right approach to use for maximisation is when the parameter of interest is discrete.
A computer network comprises of computers. The probability of one of these computers to store illegally downloaded files is , independent for each computer. In a particular network it is found that exactly one computer contains illegally downloaded files. Our parameter of interest is .
What is a suitable model for the data?
What assumptions are being made?
Are these assumptions reasonable?
What is the likelihood of ?
Let be the number of computers in the network that contains illegally downloaded files. Then is
Note that the possible values can take are . We can sketch the likelihood for a suitable range of values:
From the plot, we can see that the MLE for is . Alternatively, from the likelihood we have
The likelihood is increasing for , which is equivalent to .
To maximize the likelihood, we want the largest (integer) value of satisfying this constraint, i.e. , hence .
The ratio between two likelihood values is useful to look at for other reasons.
Suppose we have data , that arise from a population with likelihood function , with MLE . Then the relative likelihood of the parameter is
The relative likelihood quantifies how likely different values of are relative to the maximum likelihood estimate.
Using this definition, we can construct relative likelihood intervals which are similar to confidence intervals.
A p% relative likelihood interval for is defined as the set
For example a 50% relative likelihood interval for in our example would be
By plugging in different values of , we see that the relative likelihood interval is . The values in the interval can be seen in the figure below.
Suppose we are interested in estimating , the number of distinct colours of Smarties.
In order to estimate , suppose members of the class make a number of draws and record the colour.
Suppose that the data collected (seven draws) were:
purple, blue, brown, blue, brown, purple, brown.
We record whether we had a new colour or repeat:
New, New, New, Repeat, Repeat, Repeat, Repeat.
Let denote the number of unique colours. Then the likelihood function for given the above data is:
If in a second experiment, we observed:
New, New, New, Repeat, New, Repeat, New,
then the likelihood would be:
The MLEs in each case are and .
The plots below show the respective likelihoods.
![]() |
![]() |
R code for plotting these likelihoods:
Three randomly selected members of a class of 10 students are canvassed for their opinion on Brexit. Two are in favour of staying in Europe. What can one infer about the overall class opinion?
The parameter in this model is the number of pro-Remain students in the class, , say. It is discrete, and could take values . The actual true unknown value of is designated by .
Now is
Now since the likelihood function of is the probability (or density) of the observed data for given values of , we have
for .
This function is not continuous (because the parameter is discrete). It can be maximised but not by differentiation.
The maximum likelihood estimate is Note that the points are not joined up in this plot. This is to emphasize the discrete nature of the parameter of interest.
The probability model is an instance of the hypergeometric distribution.
A procedure for modelling and inference:
Subject-matter question needs answering.
Data are, or become, available to address this question.
Look at the data – exploratory analysis.
Propose a model.
Check the model fits.
Use the model to address the question.
The likelihood function is the probability of the observed data for instances of a parameter. Often we use the log-likelihood function as it is easier to work with. The likelihood is a function of an unknown parameter.
The maximum likelihood estimator (MLE) is the value of the parameter that maximises the likelihood. This is intuitively appealing, and later we will show it is a theoretically justified choice. The MLE should be found using an appropriate maximisation technique.
If the parameter is continuous, we can often (but not always) find the MLE by considering the derivative of the log-likelihood. If the parameter is discrete, we usually evaluate the likelihood at a range of possible values.
DON’T JOIN UP POINTS WHEN PLOTTING THE LIKELIHOOD FOR A DISCRETE PARAMETER.
DO NOT DIFFERENTIATE LIKELIHOODS OF DISCRETE PARAMETERS!
Last time we looked at some more examples of the method of maximum likelihood. When the parameter of interest, , is continuous, the MLE, , can be found by differentiating the log-likelihood and setting it equal to zero. We must then check the second derivative of the log-likelihood is negative (at our candidate ) to verify that we have found a maximum.
Suppose we have a sample , drawn from a density with unknown parameter , with log-likelihood . The score function, , is the first derivative of the log-likelihood with respect to :
This is just giving a name to something we have already encountered.
As discussed previously, the MLE solves . Here, is being used to denote the joint density of . For the iid case, . Also, . This is all just from the definitions.
Suppose we have a sample , drawn from a density with unknown parameter , with log-likelihood . The observed information function, , is MINUS the second derivative of the log-likelihood with respect to :
Remember that the second derivative of is negative at the MLE (that’s how we check it’s a maximum!). So the definition of observed information takes the negative of this to give something positive.
The observed information gets its name because it quantifies the amount of information obtained from a sample. An approximate 95% confidence interval for (the unobservable true value of the parameter ) is given by
This confidence interval is asymptotic, which means it is accurate when the sample is large. Some further justification on where this interval comes from will follow later in the course.
What happens to the confidence interval as changes?
You may recall the following example from last year.The website MBClub UK (associated with Mercedes Benz) carried out a poll on the number of times taken to pass a driving test. The results were as follows.
Number of failed attempts | 0 | 1 | 2 | |
Observed frequency | 147 | 47 | 20 | 5 |
As always, we begin by looking at the data.
Next, we propose a model for the data to begin addressing the question.
It is proposed to model the data as iid (independent and identically distributed) draws from a geometric distribution.
Why is this a suitable model?
What assumptions are being made?
Are these assumptions reasonable?
The probability mass function (pmf) for the geometric distribution, where is defined as the number of failed attempts, is given by
where .
Assuming that the people in the ‘3 or more’ column failed exactly three times, the likelihood for general data is
and the log-likelihood is
The score function is therefore
A candidate for the MLE, , solves :
To confirm this really is an MLE we need to verify it is a maximum, i.e. a negative second derivative.
In this case the function is clearly negative for all , if not we would just need to check this is the case at the proposed MLE.
Now plugging in the numbers, and , we get
This is the same answer as the ‘obvious one’ from intuition.
But now we can calculate the observed information at , and use this to construct a 95% confidence interval for .
Now the 95% confidence interval is given by
We should also check the fit of the model by plotting the observed data against the theoretical data from the model (with the MLE plugged in for ).
We can do actually do slightly better than this.
We assumed ‘the people in the “3 or more” column failed exactly three times’. With likelihood we don’t need to do this. Remember: the likelihood is just the joint probability of the data. In fact, people in the “3 or more” group have probability
We could therefore write the likelihood more correctly as
where if and if .
NOTE: if all we know about an observation is that it exceeds some value, we say that is censored. This is an important issue with patient data, as we may lose contact with a patient before we have finished observing them. Censoring is dealt with in more generality MATH335 Medical Statistics.
What is the MLE of using the more correct version of the likelihood?
The term in the second product (for the censored observations) can be seen as a geometric progression with constant term and common ratio , and so (check that this is the case).
Hence the likelihood can be written
where the sum of ’s only involves the uncensored observations, denotes the number of uncensored observations, and is the number of censored observations.
The log-likelihood becomes .
Differentiating, the score function is
A candidate MLE solves , giving
The value of the MLE using these data is .
Compare this to the original MLE of 0.682.
Why is the new estimate different to this?
Why is the difference small?
Last time we introduced the score function (the derivative of the log-likelihood), and the observed information function (MINUS the second derivative of the log-likelihood). The score function is zero at the MLE. The observed information function evaluated at the MLE gives us a method to construct confidence intervals.
We will now study the concept of observed information in more detail.
Humans are a diploid species, which means you have two copies of every gene (one from your father, one from your mother). Genes occur in different forms; this is what leads to some of the different traits you see in humans (e.g. eye colour). Mendelian traits are a special kind of trait that are determined by a single gene.
Having wet or dry earwax is a Mendelian trait. Earwax wetness is controlled by the gene ABCC11 (this gene lives about half way along chromosome 16). We will call the wet earwax version of ABCC11 W, and the dry version w. The wet version is dominant, which means you only need one copy of W to have wet earwax. Both copies of the gene need to be w to get dry earwax.
The Hardy-Weinberg law of genetics states that if W occurs in a (randomly mating) population with proportion (so w occurs with proportion ) potential combinations in humans obey the proportions:
combination | WW | Ww | ww |
---|---|---|---|
proportion |
Suppose I take a sample of 100 people and assess the wetness of their earwax. I observe that 87 of the people have wet earwax and 13 of them have dry earwax.
I am actually interested in , the proportion of copies of W in my population.
Show that the probability of a person having wet earwax is , and that the probability of a person having dry earwax is . Also show that these two probabilities sum to .
The number of people with wet earwax in my sample is therefore . So
IMPORTANT FACT: when writing down the likelihood, we can always omit multiplicative constants, since they become additive in the log-likelihood, then disappear in the differentation. A multiplicative constant is one that does not depend on the parameter of interest (here ).
So we can write down the likelihood as
So the log likelihood is
(plus constant).
Now is a continuous parameter so a suitable way to find a candidate MLE is to differentiate. The score function is
We can solve ; it is as a quadratic in :
This gives two solutions but we need as it is a proportion, so get as our potential MLE.
The second derivative is
This is clearly at , confirming that it is a maximum.
The observed information is obtained by substituting into , giving
Hence an approximate 95% confidence interval for is given by
After all that derivation, don’t forget the context. This is a 95% confidence interval for the proportion of people with a W variant of ABCC11 gene in the population of interest.
Suppose that, instead of looking in people’s ears to see whether their wax is wet or dry we decide to genotype them instead, thereby knowing whether they are WW, Ww or ww.
This is a considerably more expensive option (although perhaps a little less disgusting) so a natural question is: what do we gain by doing this?
We take the same 100 people and find that 42 are WW, 45 are Ww and 13 are ww. Think about how this relates back to the earwax wetness. Did we need to genotype everyone?
The likelihood function for given our new information is
The log-likelihood is
where is a constant.
As before, is continuous so we can find candidates for the MLE by differentiating:
Now solving gives a candidate MLE
,
,
i.e.
This is our potential MLE. Checking the second derivative
which is at confirming that it is a maximum.
The observed information is obtained by substituting into , giving
Hence an approximate 95% confidence interval for is given by
Now, compare the confidence intervals and the observed informations from the two separate calculations. What do you conclude?
Of course, genotyping the participants of the study is expensive, so may not be worthwhile. If this was a real problem, the statistician could communicate the figures above to the geneticist investigating gene ABCC11, who would then be able to make an evidence-based decision about how to conduct the experiment.
Recall the driving test data from the Example 13.1.
Number of failed attempts | 0 | 1 | 2 | |
Observed frequency | 147 | 47 | 20 | 5 |
We chose to model these data as being geometrically distributed. Assuming that the people in the ‘3 or more’ column failed exactly three times, the log-likelihood for general data is
Now, suppose that, rather than being presented with the table of passing attempts, you were simply told that with 219 people filling in the survey, .
Would it still be possible to proceed with fitting the model?
The answer is yes; moreover, we can proceed in exactly the same way, and achieve the same results! This is because, if you look at the log-likelihood, the only way in which the data is involved is through , meaning that in some sense, this is all we need to know.
This is clearly a big advantage, we just have to remember one number rather than an entire table.
We call a sufficient statistic for .
Let be a sample from . Then a function of the data is said to be a sufficient statistic for (or sufficient for ) if is independent of given , i.e.
Some consequences of this definition:
For the objective of learning about , if I am told , there is no value in being told anything else about .
If I have two datasets and , and , then I should make the same conclusions about from both, even if .
Sufficient statistics always exist since trivially always satisfies the above definition.
Let be a sample from . Let be sufficient for . Then is said to be minimally sufficient for if there is no sufficient statistic with a lower dimension than .
Let be a sample from . Then a function is sufficient for if and only if the likelihood function can be factorised in the form
where is a function of the data only, and is a function of the data only through .
For a proof see page 276 of Casella and Berger.
We can also express the factorisation result in terms of the log-likelihood, which is often easier, just by taking logs of the above result:
where and .
We can show that is sufficient for in the driving test example by inspection of the log-likelihood:
Letting , then , and , we have satisfied the factorisation criterion, and hence is sufficient for .
Suppose that I carry out another survey on attempts to pass a driving test, again with participants and get data , with but . Are the following statements true or false?
, the MLE based on data , is the same as , the MLE based on data .
The confidence intervals based on both datasets will be identical.
The geometric distribution is appropriate for both datasets.
An important shortcoming in only considering the sufficient statistic is that it does not allow us to check how well the chosen model fits.
Recall from the beginning of this section, the London homicides data, which we modelled as a random sample from the Poisson distribution. We found
and that the log-likelihood function for the Poisson data is consequently
with the MLE being
By differentiating again, we can find the information function
and so
What is a sufficient statistic for the Poisson parameter?
For this case, we can let , and , and , we have satisfied the factorisation criterion, and hence is sufficient for .
Suppose the sample comes from . Find a sufficient statistic for . Is the MLE a function of this statistic or of the sample mean? Give a formula for the 95% confidence interval of .
First, the density is given by
leading to the likelihood
Hence, is a sufficient statistic for . The log-likelihood and score functions are
Solving gives a candidate MLE
which is a function of the sufficient statistic. To check this is an MLE we calculate
In this case it isn’t immediately obvious that , but substituting in
confirming that this is an MLE.
The observed information is ,
Therefore a 95% confidence interval is given by
The score function is the first derivative of the log-likelihood. The observed information is MINUS the second derivative of the log-likelihood. It will always be positive when evaluated at the MLE.
DO NOT FORGET THE MINUS SIGN!
The likelihood function adjusts appropriately when more information becomes available. Observed information does what it says. Higher observed information leads to narrower confidence intervals. This is a good thing as narrower confidence intervals mean we are more sure about where the true value lies.
For a continuous parameter of interest, , the calculation of the MLE and its confidence interval follows the steps:
Write down the likelihood, .
Write down the log-likelihood, .
Work out the score function, .
Solve to get a candidate for the MLE, .
Work out . Check it is negative at the MLE candidate to verify it is a maximum.
Work out the observed information, .
Calculate the confidence interval for :
Changing the data that your inference is based on will change the amount of information, and subsequent inference (e.g. confidence intervals).
A statistic is said to be sufficient for a parameter , if is independent of when conditioning on .
An equivalent, and easier to demonstrate condition is that the likelihood can be factorised in the form , iff is sufficient.
We have noted that an asymptotic 95% confidence interval for a true parameter, , is given by
where is the MLE and
is the observed information.
In this lecture we will sketch the derivation of the distribution of the MLE, and show why the above really is an asymptotic 95% confidence interval for .
Recall the distinction between an estimate and an estimator.
Given a sample , an estimator is any function of that sample. An estimate is a particular numerical value produced by the estimator for given data .
The maximum likelihood estimator is a random variable; therefore it has a distribution. A maximum likelihood estimate is just a number, based on fixed data.
For the rest of this lecture we consider an iid sample , from some distribution with unknown parameter , and the MLE (maximum likelihood estimator) .
The Fisher information of a random sample is the expected value of minus the second derivative of the log-likelihood, evaluated at the true value of the parameter:
This is related to, but different from, the observed information.
The observed information is calculated based on observed data; the Fisher information is calculated taking expectations over random data.
The observed information is calculated at , the Fisher information is calculated at .
The observed information can be written down numerically; the Fisher information usually cannot be since it depends on , which is unknown.
Suppose is a random sample from . Find the Fisher information. Remember that . For ,
where is a constant.
Hence
We see that our answer is in terms of , which is unknown (and not in terms of the data!) The Fisher information is useful for many things in likelihood inference, to see more take MATH330 Likelihood Inference.
Here, it features in the most important theorem in the course.
Suppose we have an iid sample from some distribution with unknown parameter , with maximum likelihood estimator . Then (under certain regularity conditions) in the limit as
This says that, for large, the distribution of the MLE is approximately normal with mean equal to the true value of the parameter, and variance equal to the reciprocal of the Fisher information.
We will not prove the result in this course, but it has to do with the central limit theorem (from MATH230).
Turning this around, this means that, for large ,
This result is useless as it stands, because we can only calculate when we know , and if we know it, why are we constructing a confidence interval for it?!
Luckily, the result also works asymptotically if we replace by , giving that
is an approximate 95% confidence interval for (as claimed earlier).
A large batch of electrical components contains a proportion which are defective and not repairable, a proportion which are defective but repairable and a proportion which are satisfactory.
What values of are admissible?
Fifty components are selected at random (with replacement) from the batch, of which 2 are defective and not repairable, 5 are defective and repairable and 43 are satisfactory.
Write down the likelihood function, and make a rough sketch of it.
Obtain the maximum likelihood estimate of .
Obtain an approximate 95% confidence interval for . A value of equal to 0.02 is believed to represent acceptable quality for the batch. Do the data support the conclusion that the batch is of acceptable quality?
There are 3 types of component, each giving rise to a constraint on :
,
,
,
as the components each need to have valid probabilities. The third inequality is sufficient for the other two and gives .
Given the data, the likelihood is
For the sketch note that and the function is concave and positive between these two with a maximum closer to than .
To work out the MLE, we differentiate the (log-)likelihood as usual. The log-likelihood is
Differentiating,
A candidate MLE solves , giving .
Moreover,
so this is indeed the MLE.
The observed information is
So a confidence interval for is
As is within this confidence interval there is no evidence of this batch being sub-standard.
Under certain regularity conditions, the maximum likelihood estimator has, asymptotically, a normal distribution with mean equal to the true parameter value, and variance equal to the inverse of the Fisher information.
The Fisher information is minus the expectation of the second derivative of the log-likelihood evaluated at the true parameter value.
Based on this, we can construct approximate 95% confidence intervals for the true parameter value based on the MLE and the observed information.
Importantly, this is an asymptotic result so is only approximate. In particular, it is a bad approximation to a 95% confidence interval when the sample size, , is small.
In the last lecture we showed that the MLE is asymptotically normally distributed, and we use this fact to construct an approximate 95% confidence interval.
In this lecture we will introduce the concept of deviance, and show that this leads to another way to calculate approximate confidence intervals that have various advantages.
We will begin by showing through an example where things can go wrong with the confidence intervals we know (and love?).
On a fair (European) roulette wheel there is a probability of each number coming up.
In the early 1990s, Gonzalo Garcia-Pelayo believed that casino roulette wheels were not perfectly random, and that by recording the results and analysing them with a computer, he could gain an edge on the house by predicting that certain numbers were more likely to occur next than the odds offered by the house suggested. This he did at the Casino de Madrid in Madrid, Spain, winning 600,000 euros in a single day, and one million euros in total.
Legal action against him by the casino was unsuccessful, it being ruled that the casino should fix its wheel:
Suppose I am curious that the number 17 seems to come up on a casino’s roulette wheel more frequently than other numbers. I track it for 30 spins, during which it comes up 2 times. I decide to carry out a likelihood analysis on , the probability of the number 17 coming up, and its confidence interval.
We propose to model the situation as follows. Let be the number of times the number 17 comes up in 30 spins of the roulette wheel. We decide to model .
Why is this a suitable model?
What assumptions are being made?
Are these assumptions reasonable?
The probability of the observed data is given by
The likelihood is simply the probability of the observed data, but we can ignore the multiplicative constants, so
The log-likelihood is
Differentiating,
Now remember solutions to are potential MLEs:
The second derivative will both tell us whether this is a maximum, and provide the observed information:
This is clearly negative for all , so must be a maximum.
Moreover, the observed information is
A 95% confidence interval for is given by
which, on substituting in and the observed information becomes
The resulting confidence interval includes negative values (for a probability parameter). What’s the problem??
Let’s look at a plot of the log-likelihood for the above situation.
We notice that the log-likelihood is quite asymmetric. This happens because the MLE is close to the edge of the feasible space (i.e. close to 0). The confidence interval defined above is forced to be symmetric, which seems inappropriate here.
Suppose we have a log-likelihood function with unknown parameter , . Then the deviance function is given by
Notice that , and .
What can we say about ?
This is a fixed (but unknown) value for fixed data . However, in similar spirit to the last lecture, we can consider random data . Now, the deviance function depends on (since different data leads to different likelihoods). So, is a random variable.
Suppose we have an iid sample from some distribution with unknown parameter . Then (under certain regularity conditions) in the limit as ,
i.e. the deviance of the true value of has a distribution with one degree of freedom.
The practical upshot of this result is that we have another way to construct a confidence interval for . A 95% confidence interval for , for example, is given by , i.e. any values of whose deviance is smaller than 3.84.
This property of the deviance is best seen visually. Going back to the roulette data:
From the graph we can estimate the confidence interval based on the deviance. In fact the exact answer to three decimal places is . Notice that this is not symmetrical, and that all values in the interval are feasible.
The original motivation for all of this was that we were wondering if the number 17 comes up more often than with the 1/37 that should be observed in a fair roulette wheel.
In fact 1/37=0.027, which is within the 95% confidence interval calculated above. Hence there is insufficient evidence (so far) to support the claim that this number is coming up more often than it should.
We have now seen two different ways to calculate approximate confidence intervals (CI) for an unknown parameter. Previously, we calculated CI based on the asymptotic distribution of the MLE (CI-MLE). Here, we showed how to calculate the CI based on the asymptotic distribution of the deviance (CI-D).
We discussed various differences and pros and cons of the two:
CI-MLE is always symmetric about the MLE. CI-D is not.
CI-MLE can include values with zero likelihood (e.g. infeasible values such as negative probabilities, as seen here). CI-D will only include feasible values.
CI-D is typically harder to calculate than CI-MLE.
For reasons we will not go into here, CI-D is typically more accurate than CI-MLE.
CI-D is invariant to re-parametrization; CI-MLE is not. (This is a good thing for CI-D, that we will learn more about in subsequent lectures).
Overall, CI-D is usually preferred to CI-MLE (since the only disadvantage is that it is harder to compute).
DEVIANCES ARE ALWAYS NON-NEGATIVE!
In our likelihood examples we discussed modelling inter-arrival times at an A&E department using an Exponential distribution. The exponential pdf is given by
for and , where is the rate parameter.
Based on the inter-arrival times (in minutes):
giving , we came up with the MLE for of .
Now, .
How would we go about finding an estimate for ?
Method 1: re-write the pdf as
where and , to give a likelihood of
then find the MLE by the usual approach.
Method 2: Since , presumably .
Which method is more convenient?
Which method appears more rigorous?
In fact, both methods give the same solution always. This property is called invariance to reparameterization of the MLE. It is a nice property both because it agrees with our intuition, and saves us a lot of potential calculation.
If is the MLE of and is a monotonic function of , , then the MLE of is .
Write . The likelihood for is , and for is . Note that as is monotonic and define by . To show that is the MLE,
as is MLE. But
∎
This means that both methods above must give the same answer.
Show this works for the case above, by demonstrating that Method 1 leads to .
The following corollary follows immediately from invariance of the MLE to reparametrisation.
Confidence intervals based on the deviance are invariant to reparametrisation, in the sense that
by Theorem above, which equals
∎
The practical consequence of this is that if
is a deviance confidence interval with coverage for ,
then
is a deviance confidence interval with coverage for .
(Of course, ).
IMPORTANT: This simple translation does not hold for confidence intervals based on the asymptotic distribution to the MLE. This is because that depends on the second derivative of with respect to the parameter, which will be different in more complicated ways for different parameter choices.
This will be explored more in MATH330 Likelihood Inference.
The random variables are independent and identically distributed with the geometric distribution
where is a parameter in the range of to be estimated. The mean of the above geometric distribution is .
Write down formulae for the maximum likelihood estimator for and for Fisher’s information;
Write down what you know about the distribution of the maximum likelihood estimator for this example when is large.
In a particular experiment, , .
Compute an approximate 95% confidence interval for based on the asymptotic distribution of the maximum likelihood estimator;
Compute the deviance and sketch it over the range . Use your sketch to describe how to use the deviance to obtain an approximate 95% confidence interval for ;
If you were asked to produce an approximate 95% confidence interval for the mean of the distribution , what would be your recommended approach? Justify your answer.
For the model, the likelihood function is
The log-likelihood is then
with derivative
A candidate MLE solves , giving
Moreover,
so this is indeed the MLE.
For the Fisher Information,
after simplification, since .
Using the Fisher information, the asymptotic distribution of the MLE is
Using the data, the MLE is . The observed information is
Therefore a confidence interval is
The deviance is given by
To plot the deviance calculate and , and note that . A confidence interval is obtained by drawing a horizontal line at 3.84; the interval is all with .
To construct a confidence interval for the mean, we would use the mean function on the deviance-based confidence interval just calculated, as this is invariant to re-parametrization.
The simple, intuitive answer is true: if is a MLE, then if for any monotonic transformation , then .
The same simple result can be applied to confidence intervals based on the deviance, but can NOT be applied to confidence intervals based on the asymptotic distribution of the MLE.