2 Hypothesis Tests and Confidence Intervals

Wald Statistic and Wald Confidence Interval

The Wald statistic and Wald confidence interval are based on the asymptotic distribution of the MLE, as given in Theorem 1 and Note 3.

The statistic can be used to test whether our true parameter θ0 has a specific value θ0*, i.e.

H0 : θ0=θ0*
H1 : θ0θ0*,

using the fact that, under H0,

Z=θ^-θ0*{IO(θ^)}-1/2N(0,1).

This allows us to calculate a p-value for H0 using the Z statistic (see MATH235).

Example 2.1:  Coin tossing, ctd.

Recall: a coin is tossed n=10 times and r=6 heads are observed, and we had

(θ)=rθ-n-r1-θ

and

θ^=rn=0.6.

Suppose we are interested in testing the hypothesis that the coin is fair:

H0:θ0=0.5.

Now,

′′(θ)=-rθ2-n-r(1-θ)2,

so

IO(θ^)=r(r/n)2+n-r((n-r)/n)2=n(1θ^+11-θ^)=41.67,

giving a Z statistic of

z=0.6-0.541.67-1/2=0.645.

Then, using R for example, we obtain a p-value of

2*(1 - pnorm(0.645)) = 0.518

suggesting we shouldn’t reject H0; there isn’t evidence for a biased coin. This matches our intuition coming from θ=0.5 having a high relative likelihood.

Similarly, we can construct Wald confidence intervals for θ0 as

θ^±z1-α/2{IO(θ^)}-1/2,

the most common choice being a 95% confidence interval with α=0.05 and hence z1-α/2=1.96.

Example 2.2:  Coin tossing, ctd. A 95% confidence interval for the coin example is thus

0.6±1.96×41.67-1/2=(0.296,0.904).

Recalling the equivalence between hypothesis testing and confidence intervals (MATH235), it should come as no surprise that 0.5 falls within the confidence interval.

Note that in this binomial proportion case, the form of the confidence interval is

θ^±z1-α/2{IO(θ^)}-1/2=θ^±z1-α/2θ^(1-θ^)n

which is precisely the “usual” form of the confidence interval for the binomial proportion taught in introductory statistics (derived from the CLT binomial approximation to the normal distribution).

Confidence: approximate or not?

It is important to remember that confidence intervals based on asymptotic distributions (e.g. of the MLE) are in most cases approximate and will only hold for large n.

For example, in the case of the binomial proportion, “exact” confidence intervals can be obtained using quantiles of the Beta distribution (the so-called Clopper-Pearson interval).

Exercise 1: Use your research skills (Google and e.g. R) to find the “exact” interval corresponding to the data in Example 2.2 above.

The next example (and following questions) examines the potential differences and tests your understanding further.

Example 2.3:  Normal variance, mean known.
Suppose the sample x1,,xn comes from XN(0,θ), with μ=0 known. Give a general formula for the MLE θ^ and a corresponding 95% confidence interval of θ. Compute this confidence interval for data with n=9 and ixi2=9.

The Normal(0,θ) density is given by f(xi|θ)=12πθexp{-xi22θ}, leading to the likelihood

L(θ) 1θn/2exp{-ixi22θ}.

The log-likelihood and score functions are

l(θ)= -n2logθ-ixi22θ
S(θ)=l(θ)= -n2θ+ixi22θ2.

Solving S(θ)=0 gives an MLE of θ^=ixi2n. The observed information is

IO(θ^)=n32(xi2)2.

Therefore a 95% confidence interval based on the MLE is given by

(l,u) = (θ^-1.96IO(θ^),θ^+1.96IO(θ^))
= (ixi2n-1.96n-3/2xi22,ixi2n+1.96n-3/2xi22)
= (0.08,1.92)

with the data information given.

On the other hand, we know (e.g. from MATH230) also that nθ^(X)θ=iXi2θχn2 (a χ2 distribution is the sum of squared standard normal distributions). Hence

1-α = P(χα/2,n2<nθ^θ<χ1-α/2,n2)
= P(χα/2,n2nθ^<1θ<χ1-α/2,n2nθ^)
= P(nθ^χ1-α/2,n2<θ<nθ^χα/2,n2)
= (0.47,3.33).

Here χα/2,n2 is the α/2 quantile of a χn2 distribution.

Exercise 2: Does the normal variance maximum likelihood estimator θ^(X) achieve the Cramér-Rao bound? How about the estimator θ~(X)=iXi2n-1 ? Are these estimators unbiased?

Exercise 3: Suppose now we wanted to perform inference on the population mean instead, i.e. the sample x1,,xn comes from XN(μ,1), with σ2=1 known. Is the MLE μ^(X)=iXin unbiased? Is it a minimum variance unbiased estimator (MVUE)?

Exercise 4: Would confidence intervals based on the MLE in Exercise 3 be approximate or exact?

The main drawback to the Wald procedures is that they rely on the asymptotic normality of the MLE. In finite samples the distribution may be far from normal; the likelihood may not be symmetric (or more specifically, regular, as defined earlier) about the MLE.

One solution in this case is to look for a transformation g on the parameter space that makes the likelihood more regular about the MLE, and hence improves the normality approximation.

Example 2.4:  Biased coin tossing. We consider a coin tossing example as before, but this time take r=2 and n=30. (A similar example, with a roulette wheel motivation, was given in MATH235).

Plugging the numbers into the general form for the binomial Wald confidence interval calculated before gives

  1. θ^=2/30

  2. IO(θ^)=482.143

  3. 95% CI=(-0.023,0.156)

Clearly this is invalid as the confidence interval includes negative values, which are outside of the parameter space.
Consider the log-odds transformation

ϕ=g(θ)=log(θ1-θ),

which has inverse

θ=g-1(ϕ)=11+e-ϕ.

This is a useful transformation for probabilities as g:(0,1)(-,) (it will be revisited in MATH333 Statistical Models).

So the normal approximation to the distribution of ϕ^ is going to be much better than to θ^, so we calculate the Wald confidence interval in ϕ-space then transform back to θ-space.

The invariance property of likelihood tells us that

ϕ^=g(θ^)=-2.639.

From Theorem 3,

Var[ϕ^]={g(θ^)}2Var(θ^),

and

g(θ)=1θ+11-θ,

so {g(θ^)}2=258.29 and Var[ϕ^]=0.536.

Therefore the 95% confidence interval for ϕ0 is

-2.639±1.960.536=(-4.074,-1.204).

Transforming this back into a confidence interval about θ using the g-1 function above, gives a new confidence interval for θ0 of

(11+e4.074,11+e1.204)=(0.017,0.231),

which is an improvement at least in that the confidence interval is confined to the interval (0,1)!

Generally, finding an appropriate function g (let alone the best function g) is difficult or impossible, so this does not overcome entirely the limitations of Wald procedures. We turn instead to procedures based on likelihood ratios or deviance.

An alternative means by which to proceed is to look instead at the asymptotic distribution of the deviance at θ0, and use this to devise testing procedures and construct confidence intervals.

The testing procedure that relies on the deviance is called the likelihood ratio test. This is the most important hypothesis testing procedure you will learn in undergraduate statistics.

Suppose we are carrying out a hypothesis test of a null hypothesis H0:θ0=a against a very simple alternative H1:θ0=b.

Note this is different (simpler) to most alternative hypotheses we have considered so far in that it posits a single value rather than a range of values.

The likelihood ratio test testing H0 versus H1 has a rejection region of the form

L(θ=a)L(θ=b)η,

where the constant η is chosen to achieve level α. (Recall that the level of the test, α, is the probability of rejecting the null hypothesis when the null hypothesis is true).

Intuitively: we reject H0 in favour of H1 if likelihood of θ1 is much larger than likelihood of θ0, given the data.

It may seem concerning up to now that in order to carry out a hypothesis test, one can dream up all kinds of procedures (e.g. the Wald test or a likelihood ratio test (LRT)). We now move on to the notion of an optimal hypothesis test, and an important result that establishes achieving this optimality.

Neyman-Pearson Lemma

Remarkably, we can show that, in a sense to be defined, the LRT is the optimal test to do, so whenever it is possible to carry it out, it is to be preferred over any other test.

To judge this optimality, we introduce an important property of hypothesis tests.

The power of the test, usually denoted 1-β, is the probability of rejecting the null hypothesis when the alternative hypothesis is true.

Therefore, we would like to find tests with small α but large power. Usually, in hypothesis testing, we fix α in advance, then ‘hope’ we have good power.

Theorem 5 Neyman-Pearson Lemma.
Consider the hypotheses

H0 : θ0=a
H1 : θ0=b.

Then the likelihood ratio test is the most powerful of all tests of level no more than α.

Proof.

(Neyman-Pearson)
Set-up
First, L(θ) is proportional to the joint density, i.e. L(θ;X)=cf(X) for some constant c. Therefore, letting f0 denote the density under H0 and f1 the density under H1,

L(θ=a)L(θ=b)=cf0(X)cf1(X)=f0(X)f1(X),

and we can re-write the LRT as rejecting H0 when

f0(X)ηf1(X)

i.e.

ηf1(X)-f0(X)0.

Now let

ϕ0(x)={1 when N-P rejects H00 when N-P does not reject H0

Consider a new test, ϕ, defined similarly as

ϕ(x)={1 when new test rejects H00 when new test does not reject H0

Finally for the set-up, let

U(x)={ϕ0(x)-ϕ(x)}{ηf1(x)-f0(x)}.

1. Showing U(x)0
Consider the cases.

  • If {ηf1(x)-f0(x)}>0 then ϕ0(x)=1 and so U(x)0.

  • If {ηf1(x)-f0(x)}<0 then ϕ0(x)=0 and so U(x)0.

  • If {ηf1(x)-f0(x)}=0 then clearly U(x)=0.

2. Integrating and showing most powerful
The result that U(x)0 is true for all x, and so

0 {ϕ0(x)-ϕ(x)}{ηf1(x)-f0(x)}𝑑x
0 η{ϕ0(x)f1(x)𝑑x-ϕ(x)f1(x)𝑑x}
+ϕ(x)f0(x)dx-ϕ0(x)f0x)dx
0 η{EH1[ϕ0(X)]-EH1[ϕ(X)]}+{EH0[ϕ(X)]-EH0[ϕ0(X)]}.

Now, by assumption that the new test ϕ has a level of at most α (which the LRT achieves), we have EH0[ϕ(X)]α=EH0[ϕ0(X)], i.e. the right hand bracket is 0. Therefore the left hand bracket must be positive and so

EH1[ϕ0(X)]EH1[ϕ(X)].

But this says that under H1 the likelihood ratio testis more likely to reject H0 and hence more powerful. So we are done. ∎

Note 1: In cases we have considered so far, we have taken a to be some specific value and b to be the MLE (e.g. in the coin tossing Example 2.1). In this case, the asymptotic distribution of the deviance (by Theorem 2) implies that

D(θ0*)=2{(θ^)-(θ0*)}=-2log{L(θ=θ0*)L(θ=θ^)}χ12

under H0.

Note 2: If we are interested in a more general alternative hypothesis, e.g. H1:θ0a, then we require Neyman-Pearson to hold for all θΩ\{a} (with the same η). In this case, we call the likelihood ratio test uniformly most powerful (UMP). This property will usually hold for the examples we consider in this module, but it is beyond this course to check it. Even when the LRT is not UMP, it is still a ‘good’ test by virtue of it being most powerful for each simple alternative hypothesis.

The asymptotic distribution of the deviance suggests a means of constructing a confidence region for θ0 as

{θ:D(θ)<χ1-α2},

where χ1-α2 is the critical value of the χ12 distribution for confidence level 1-α; for example for a 95% confidence interval, χ1-α2=3.84.

Example 2.5:  Coin tossing, ctd.

Back to the boring coin tossing example (with r=6, n=10). Again, we could test the hypothesis H0:θ0=0.5, this time using the likelihood ratio test.

We have

D(0.5)=2{(0.6)-(0.5)}=2{(-6.730)-(-6.931)}=0.402,

which via the R command 1-pchisq(0.402,df=1) gives p=0.526, similar to that found with the Wald test.

Note: All hypothesis tests will be two-tailed unless otherwise specified. Because a χ12 distribution is the square of a normal distribution, we DO NOT double the p-value produced by R because squaring puts both tails in the same (positive) place, i.e. z2>3.84 corresponds to {z<-1.96}{z>1.96}.

In the example we have looked at, the Wald test and LRT give similar results. However, the LRT is to be preferred. The first reason is that the Wald test is carried out in θ-space, and hence if we transform into, say ϕ-space where ϕ=g(θ), then the results will change. The LRT depends on likelihoods, and the values of the likelihoods do not depend on how they are parameterised.

In fact, this suggests something deeper. For the Wald test to ‘work well’, we have to find a transformation g such that

g(θ^)-g(θ0*)se(g(θ^))N(0,1).

The LRT is invariant to the transformation g, and this suggests that the LRT will work well provided a function g exists to satisfy the above, but we do not need to know what this function is.

Sketch proof.

Suppose g exists, ϕ=g(θ) and consider a second-order Taylor expansion of L(ϕ) about ϕ^. Recall from Section 2.2.2 this is given by

(ϕ)(ϕ^)-12(ϕ-ϕ^)2IO(ϕ^),

and so

D(ϕ)(ϕ-ϕ^)2IO(ϕ^).

The function g makes the likelihood regular by definition, so the approximation above is good. This means that the Wald confidence interval based on the asymptotic distribution of ϕ^ is identical to the deviance based confidence interval. ∎