Recall the driving test data from the Example 13.1.
Number of failed attempts | 0 | 1 | 2 | |
Observed frequency | 147 | 47 | 20 | 5 |
We chose to model these data as being geometrically distributed. Assuming that the people in the ‘3 or more’ column failed exactly three times, the log-likelihood for general data is
Now, suppose that, rather than being presented with the table of passing attempts, you were simply told that with 219 people filling in the survey, .
Would it still be possible to proceed with fitting the model?
The answer is yes; moreover, we can proceed in exactly the same way, and achieve the same results! This is because, if you look at the log-likelihood, the only way in which the data is involved is through , meaning that in some sense, this is all we need to know.
This is clearly a big advantage, we just have to remember one number rather than an entire table.
We call a sufficient statistic for .
Let be a sample from . Then a function of the data is said to be a sufficient statistic for (or sufficient for ) if is independent of given , i.e.
Some consequences of this definition:
For the objective of learning about , if I am told , there is no value in being told anything else about .
If I have two datasets and , and , then I should make the same conclusions about from both, even if .
Sufficient statistics always exist since trivially always satisfies the above definition.
Let be a sample from . Let be sufficient for . Then is said to be minimally sufficient for if there is no sufficient statistic with a lower dimension than .
Let be a sample from . Then a function is sufficient for if and only if the likelihood function can be factorised in the form
where is a function of the data only, and is a function of the data only through .
For a proof see page 276 of Casella and Berger.
We can also express the factorisation result in terms of the log-likelihood, which is often easier, just by taking logs of the above result:
where and .
We can show that is sufficient for in the driving test example by inspection of the log-likelihood:
Letting , then , and , we have satisfied the factorisation criterion, and hence is sufficient for .
Suppose that I carry out another survey on attempts to pass a driving test, again with participants and get data , with but . Are the following statements true or false?
, the MLE based on data , is the same as , the MLE based on data .
The confidence intervals based on both datasets will be identical.
The geometric distribution is appropriate for both datasets.
An important shortcoming in only considering the sufficient statistic is that it does not allow us to check how well the chosen model fits.
Recall from the beginning of this section, the London homicides data, which we modelled as a random sample from the Poisson distribution. We found
and that the log-likelihood function for the Poisson data is consequently
with the MLE being
By differentiating again, we can find the information function
and so
What is a sufficient statistic for the Poisson parameter?
For this case, we can let , and , and , we have satisfied the factorisation criterion, and hence is sufficient for .
Suppose the sample comes from . Find a sufficient statistic for . Is the MLE a function of this statistic or of the sample mean? Give a formula for the 95% confidence interval of .
First, the density is given by
leading to the likelihood
Hence, is a sufficient statistic for . The log-likelihood and score functions are
Solving gives a candidate MLE
which is a function of the sufficient statistic. To check this is an MLE we calculate
In this case it isn’t immediately obvious that , but substituting in
confirming that this is an MLE.
The observed information is ,
Therefore a 95% confidence interval is given by