By definition, or
give measures of how likely is to be the true parameter
value for the population, since (up to proportionality)
is the probability of obtaining the observed data if had
been the true parameter.
Consequently, values of with large
(log)likelihood are more likely to be correct than values with low
(log)likelihood.
This leads to the principle of maximum likelihood estimation, whereby we estimate by the maximum likelihood estimator (MLE), defined to be any value of that maximises (or ).
Example 1.2: IID Poisson data.
Find for
independent data from a Poisson
distribution.
For with , . Then
We should also check that the second derivative is negative. Differentiating is a common way to find maxima (for continuous parameter space) but not the only way. It won’t always work and we may have to resort to ‘brute force’ approaches.
We define the first derivative of the log-likelihood as the score function:
so the MLE is the solution of the score equation .
The likelihood function can also be used to interpret the plausibility of values of other than the MLE. Define the relative likelihood of as
Example 1.3: Coin tossing.
Suppose a coin is tossed times and heads are observed. Then if represents the probability of a head on a single coin toss, the number of heads is Binomial, and the likelihood function is
and this gives
Now we can consider the relative likelihood of (i.e. the coin is fair) as
One could subjectively interpret the relative likelihood; however, we will develop theoretical tools to compare values of with different likelihoods later.
For this development, it is useful to introduce the deviance, which is a relative likelihood-like measure:
Exercise: what is the relationship between the relative likelihood and the deviance?
There are two ways to measure the information about contained in the sample: the observed information and the expected information . The observed information is based on the observed data
and the expected information, sometimes called Fisher information, is averaged over all possible data sets (expectation over )
These quantities also represent the curvature of the likelihood function. Particularly of interest is the curvature at the MLE, . High information (large curvature) at the MLE corresponds to a tight peak, and indicates less uncertainty about .
Example 1.4: IID Poisson data, ctd.
Find the observed and expected
information for independent data from a
Poisson distribution. How do they change with ?
Recall . So
Thus .
Now as ,
Note that both and are multiples of .
The pair provide a useful two-dimensional summary of any (log) likelihood function. Indeed, consider approximating by a second-order Taylor expansion about :
This can be written as
so this is a quadratic approximation for the log relative likelihood depending only on our pair . If this approximation is accurate around the MLE, we call the likelihood function regular.
As frequentists, however, we must consider that the data, , which we have used to construct our MLE , is a realisation of the random variables . Another draw from these random variables would result in different data, , and hence a different MLE, .
Therefore, is itself a random variable, and must have a distribution: we call this the sampling distribution of .
The fact that the MLE has a distribution makes it seem somewhat foolish simply to quote the MLE as a point estimate of the true value of the parameter . This motivates the idea of a confidence region for , which is defined as any rule for constructing a region with the following property:
where is referred to as the confidence level, and is specified in advance.
Where confidence regions are 1-dimensional and of the form they are usually called confidence intervals.
This is a precise and important definition. Any specific confidence region we produce depends upon the observed data . It either contains, or does not contain, the true parameter .
We cannot make a probability statement about this specific confidence region in isolation because there is no repeated sampling. However, there is a collection of confidence regions that would be produced for all the different samples we could draw from the random variables .
The probability of these intervals covering the true value is , where we define probability as a limiting proportion over the samples.
It is rarely possible to construct confidence regions exactly; hence, it is usual to rely on asymptotic results (i.e. results with ) and then assume that these results hold approximately for our observed (see e.g. Example 2.2).
Before framing our discussion in more formal asymptotic theory, we first we first recall a result
which gives us a bound on how good an inference method can be.
Theorem (CRLB): Cramér-Rao bound for variance of
estimators.
Let be an estimator with expected value
, where is a differentiable
function. Then
In particular, if is an unbiased estimator of , we have that and . Thus,
This states that an estimator has minimum variance if it achieves the bounds on the righthand side of the two expressions above.
In particular, if the estimator is unbiased, if its variance is the reciprocal of the Fisher information of a model, it is a minimum variance unbiased estimator (MVUE).
Example 1.5: The sample mean estimator of the Poisson parameter.
Let be independent data from a
Poisson distribution. Is a MVUE for the Poisson parameter ?
Firstly, we note that thus is unbiased as an estimator for .
The variance of is
Lastly, recall from Example 1.4 that
Since the variance of is equal to , according to the CRLB Theorem, it is a minimum variance estimator (because it achieves the Cramér-Rao lower bound).
From above, it is unbiased too, so is a MVUE.