The Wald statistic and Wald confidence interval are based on the asymptotic distribution of the MLE, as given in Theorem 1 and Note 3.
The statistic can be used to test whether our true parameter has a specific value , i.e.
using the fact that, under ,
This allows us to calculate a -value for using the statistic (see MATH235).
Example 2.1: Coin tossing, ctd.
Recall: a coin is tossed times and heads are observed, and we had
and
Suppose we are interested in testing the hypothesis that the coin is fair:
Now,
so
giving a statistic of
Then, using R for example, we obtain a p-value of
2*(1 - pnorm(0.645)) = 0.518
suggesting we shouldn’t reject ; there isn’t evidence for a biased coin. This matches our intuition coming from having a high relative likelihood.
Similarly, we can construct Wald confidence intervals for as
the most common choice being a confidence interval with and hence .
Example 2.2: Coin tossing, ctd. A 95% confidence interval for the coin example is thus
Recalling the equivalence between hypothesis testing and confidence intervals (MATH235), it should come as no surprise that falls within the confidence interval.
Note that in this binomial proportion case, the form of the confidence interval is
which is precisely the “usual” form of the confidence interval for the binomial proportion taught in introductory statistics (derived from the CLT binomial approximation to the normal distribution).
It is important to remember that confidence intervals based on asymptotic distributions (e.g. of the MLE) are in most cases approximate and will only hold for large .
For example, in the case of the binomial proportion, “exact” confidence intervals can be obtained using quantiles of the Beta distribution (the so-called Clopper-Pearson interval).
Exercise 1: Use your research skills (Google and e.g. R) to find the “exact” interval corresponding to the data in Example 2.2 above.
The next example (and following questions) examines the potential differences and tests your understanding further.
Example 2.3: Normal variance, mean known.
Suppose the sample comes from , with known. Give a general formula for the MLE and a corresponding 95% confidence interval of . Compute this confidence interval for data with and .
The Normal density is given by , leading to the likelihood
The log-likelihood and score functions are
Solving gives an MLE of . The observed information is
Therefore a 95% confidence interval based on the MLE is given by
with the data information given.
On the other hand, we know (e.g. from MATH230) also that (a distribution is the sum of squared standard normal distributions). Hence
Here is the quantile of a distribution.
Exercise 2: Does the normal variance maximum likelihood estimator achieve the Cramér-Rao bound? How about the estimator ? Are these estimators unbiased?
Exercise 3: Suppose now we wanted to perform inference on the population mean instead, i.e. the sample comes from , with known. Is the MLE unbiased? Is it a minimum variance unbiased estimator (MVUE)?
Exercise 4: Would confidence intervals based on the MLE in Exercise 3 be approximate or exact?
The main drawback to the Wald procedures is that they rely on the asymptotic normality of the MLE. In finite samples the distribution may be far from normal; the likelihood may not be symmetric (or more specifically, regular, as defined earlier) about the MLE.
One solution in this case is to look for a transformation on the parameter space that makes the likelihood more regular about the MLE, and hence improves the normality approximation.
Example 2.4: Biased coin tossing. We consider a coin tossing example as before, but this time take and . (A similar example, with a roulette wheel motivation, was given in MATH235).
Plugging the numbers into the general form for the binomial Wald confidence interval calculated before gives
Clearly this is invalid as the confidence interval includes negative values, which are outside of the parameter space.
Consider the log-odds transformation
which has inverse
This is a useful transformation for probabilities as (it will be revisited in MATH333 Statistical Models).
So the normal approximation to the distribution of is going to be much better than to , so we calculate the Wald confidence interval in -space then transform back to -space.
The invariance property of likelihood tells us that
From Theorem 3,
and
so and .
Therefore the 95% confidence interval for is
Transforming this back into a confidence interval about using the function above, gives a new confidence interval for of
which is an improvement at least in that the confidence interval is confined to the interval !
Generally, finding an appropriate function (let alone the best function ) is difficult or impossible, so this does not overcome entirely the limitations of Wald procedures. We turn instead to procedures based on likelihood ratios or deviance.
An alternative means by which to proceed is to look instead at the asymptotic distribution of the deviance at , and use this to devise testing procedures and construct confidence intervals.
The testing procedure that relies on the deviance is called the likelihood ratio test. This is the most important hypothesis testing procedure you will learn in undergraduate statistics.
Suppose we are carrying out a hypothesis test of a null hypothesis against a very simple alternative .
Note this is different (simpler) to most alternative hypotheses we have considered so far in that it posits a single value rather than a range of values.
The likelihood ratio test testing versus has a rejection region of the form
where the constant is chosen to achieve level . (Recall that the level of the test, , is the probability of rejecting the null hypothesis when the null hypothesis is true).
Intuitively: we reject in favour of if likelihood of is much larger than likelihood of , given the data.
It may seem concerning up to now that in order to carry out a hypothesis test, one can dream up all kinds of procedures (e.g. the Wald test or a likelihood ratio test (LRT)). We now move on to the notion of an optimal hypothesis test, and an important result that establishes achieving this optimality.
Remarkably, we can show that, in a sense to be defined, the LRT is the optimal test to do, so whenever it is possible to carry it out, it is to be preferred over any other test.
To judge this optimality, we introduce an important property of hypothesis tests.
The power of the test, usually denoted , is the probability of rejecting the null hypothesis when the alternative hypothesis is true.
Therefore, we would like to find tests with small but large power. Usually, in hypothesis testing, we fix in advance, then ‘hope’ we have good power.
Theorem 5 Neyman-Pearson Lemma.
Consider the hypotheses
Then the likelihood ratio test is the most powerful of all tests of level no more than .
(Neyman-Pearson)
Set-up
First, is proportional to the joint density, i.e. for some constant . Therefore, letting denote the density under and the density under ,
and we can re-write the LRT as rejecting when
i.e.
Now let
Consider a new test, , defined similarly as
Finally for the set-up, let
1. Showing
Consider the cases.
If then and so .
If then and so .
If then clearly .
2. Integrating and showing most powerful
The result that is true for all , and so
Now, by assumption that the new test has a level of at most (which the LRT achieves), we have i.e. the right hand bracket is . Therefore the left hand bracket must be positive and so
But this says that under the likelihood ratio testis more likely to reject and hence more powerful. So we are done. ∎
Note 1: In cases we have considered so far, we have taken to be some specific value and to be the MLE (e.g. in the coin tossing Example 2.1). In this case, the asymptotic distribution of the deviance (by Theorem 2) implies that
under .
Note 2: If we are interested in a more general alternative hypothesis, e.g. , then we require Neyman-Pearson to hold for all (with the same ). In this case, we call the likelihood ratio test uniformly most powerful (UMP).
This property will usually hold for the examples we consider in this module, but it is beyond this course to check it. Even when the LRT is not UMP, it is still a ‘good’ test by virtue of it being most powerful for each simple alternative hypothesis.
The asymptotic distribution of the deviance suggests a means of constructing a confidence region for as
where is the critical value of the distribution for confidence level ; for example for a 95% confidence interval, .
Example 2.5: Coin tossing, ctd.
Back to the boring coin tossing example (with , ). Again, we could test the hypothesis , this time using the likelihood ratio test.
We have
which via the R command 1-pchisq(0.402,df=1) gives , similar to that found with the Wald test.
Note: All hypothesis tests will be two-tailed unless otherwise specified. Because a distribution is the square of a normal distribution, we DO NOT double the -value produced by R because squaring puts both tails in the same (positive) place, i.e. corresponds to .
In the example we have looked at, the Wald test and LRT give similar results. However, the LRT is to be preferred. The first reason is that the Wald test is carried out in -space, and hence if we transform into, say -space where , then the results will change. The LRT depends on likelihoods, and the values of the likelihoods do not depend on how they are parameterised.
In fact, this suggests something deeper. For the Wald test to ‘work well’, we have to find a transformation such that
The LRT is invariant to the transformation , and this suggests that the LRT will work well provided a function exists to satisfy the above, but we do not need to know what this function is.
Suppose exists, and consider a second-order Taylor expansion of about . Recall from Section 2.2.2 this is given by
and so
The function makes the likelihood regular by definition, so the approximation above is good. This means that the Wald confidence interval based on the asymptotic distribution of is identical to the deviance based confidence interval. ∎