2 Hypothesis Tests and Confidence Intervals 2 Hypothesis Tests and Confidence Intervals Example: Failure Times

Wald Statistic and Wald Confidence Interval

The Wald statistic and Wald confidence interval are based on the asymptotic distribution of the MLE, as given in Theorem 1 and Note 3.

The statistic can be used to test whether our true parameter $\theta_{0}$ has a specific value $\theta_{0}^{*}$ , i.e.

	$\displaystyle H_{0}$	$\displaystyle:$	$\displaystyle\theta_{0}=\theta_{0}^{*}$
	$\displaystyle H_{1}$	$\displaystyle:$	$\displaystyle\theta_{0}\neq\theta_{0}^{*},$

using the fact that, under $H_{0}$ ,

Z=\frac{\hat{\theta}-\theta_{0}^{*}}{\{I_{O}(\hat{\theta})\}^{-1/2}}\sim N(0,1).

This allows us to calculate a $p$ -value for $H_{0}$ using the $Z$ statistic (see MATH235).

Example 2.1: Coin tossing, ctd.

Recall: a coin is tossed $n=10$ times and $r=6$ heads are observed, and we had

\ell^{\prime}(\theta)=\frac{r}{\theta}-\frac{n-r}{1-\theta}

and

\hat{\theta}=\frac{r}{n}=0.6.

Suppose we are interested in testing the hypothesis that the coin is fair:

H_{0}:\theta_{0}=0.5.

Now,

\ell^{\prime\prime}(\theta)=-\frac{r}{\theta^{2}}-\frac{n-r}{(1-\theta)^{2}},

I_{O}(\hat{\theta})=\frac{r}{(r/n)^{2}}+\frac{n-r}{((n-r)/n)^{2}}=n\left(\frac% {1}{\hat{\theta}}+\frac{1}{1-\hat{\theta}}\right)=41.67,

giving a $Z$ statistic of

z=\frac{0.6-0.5}{41.67^{-1/2}}=0.645.

Then, using R for example, we obtain a p-value of

2*(1 - pnorm(0.645)) = 0.518

suggesting we shouldn’t reject $H_{0}$ ; there isn’t evidence for a biased coin. This matches our intuition coming from $\theta=0.5$ having a high relative likelihood.

Similarly, we can construct Wald confidence intervals for $\theta_{0}$ as

\hat{\theta}\pm z_{1-\alpha/2}\{I_{O}(\hat{\theta})\}^{-1/2},

the most common choice being a $95\%$ confidence interval with $\alpha=0.05$ and hence $z_{1-\alpha/2}=1.96$ .

Example 2.2: Coin tossing, ctd. A 95% confidence interval for the coin example is thus

0.6\pm 1.96\times 41.67^{-1/2}=(0.296,0.904).

Recalling the equivalence between hypothesis testing and confidence intervals (MATH235), it should come as no surprise that $0.5$ falls within the confidence interval.

Note that in this binomial proportion case, the form of the confidence interval is

\hat{\theta}\pm z_{1-\alpha/2}\{I_{O}(\hat{\theta})\}^{-1/2}=\hat{\theta}\pm z% _{1-\alpha/2}\sqrt{\frac{\hat{\theta}(1-\hat{\theta})}{n}}

which is precisely the “usual” form of the confidence interval for the binomial proportion taught in introductory statistics (derived from the CLT binomial approximation to the normal distribution).

Confidence: approximate or not?

It is important to remember that confidence intervals based on asymptotic distributions (e.g. of the MLE) are in most cases approximate and will only hold for large $n$ .

For example, in the case of the binomial proportion, “exact” confidence intervals can be obtained using quantiles of the Beta distribution (the so-called Clopper-Pearson interval).

Exercise 1: Use your research skills (Google and e.g. R) to find the “exact” interval corresponding to the data in Example 2.2 above.

The next example (and following questions) examines the potential differences and tests your understanding further.

Example 2.3: Normal variance, mean known.
Suppose the sample $x_{1},\ldots,x_{n}$ comes from $X\sim N(0,\theta)$ , with $\mu=0$ known. Give a general formula for the MLE $\hat{\theta}$ and a corresponding 95% confidence interval of $\theta$ . Compute this confidence interval for data with $n=9$ and $\sum_{i}x_{i}^{2}=9$ .

The Normal $(0,\theta)$ density is given by $f(x_{i}|\theta)=\frac{1}{\sqrt{2\pi\theta}}\exp\left\{-\frac{x_{i}^{2}}{2% \theta}\right\}$ , leading to the likelihood

\displaystyle L(\theta)\propto

\displaystyle\frac{1}{\theta^{n/2}}\exp\left\{-\frac{\sum_{i}x_{i}^{2}}{2% \theta}\right\}.

The log-likelihood and score functions are

	$\displaystyle l(\theta)=$	$\displaystyle-\frac{n}{2}\log\theta-\frac{\sum_{i}x_{i}^{2}}{2\theta}$
	$\displaystyle S(\theta)=l^{\prime}(\theta)=$	$\displaystyle-\frac{n}{2\theta}+\frac{\sum_{i}x_{i}^{2}}{2\theta^{2}}.$

Solving $S(\theta)=0$ gives an MLE of $\hat{\theta}=\frac{\sum_{i}x_{i}^{2}}{n}$ . The observed information is

I_{O}(\hat{\theta})=\frac{n^{3}}{2(\sum x_{i}^{2})^{2}}.

Therefore a 95% confidence interval based on the MLE is given by

$\displaystyle(l,u)$	$\displaystyle=$	$\displaystyle\left(\hat{\theta}-\frac{1.96}{\sqrt{I_{O}(\hat{\theta})}},\ \hat% {\theta}+\frac{1.96}{\sqrt{I_{O}(\hat{\theta})}}\right)$

	$\displaystyle=$	$\displaystyle{\color[rgb]{0,0,0}\left(\frac{\sum_{i}x_{i}^{2}}{n}-1.96n^{-3/2}% \sum x_{i}^{2}\sqrt{2},\ \ \frac{\sum_{i}x_{i}^{2}}{n}+1.96n^{-3/2}\sum x_{i}^% {2}\sqrt{2}\right)}$
	$\displaystyle=$	$\displaystyle(0.08,1.92)$

with the data information given.

On the other hand, we know (e.g. from MATH230) also that $\frac{n\hat{\theta}(X)}{\theta}=\frac{\sum_{i}X_{i}^{2}}{\theta}\sim\chi^{2}_{n}$ (a $\chi^{2}$ distribution is the sum of squared standard normal distributions). Hence

$\displaystyle 1-\alpha$	$\displaystyle=$	$\displaystyle P\left(\chi^{2}_{\alpha/2,n}<\frac{n\hat{\theta}}{\theta}<\chi^{% 2}_{1-\alpha/2,n}\right)$
	$\displaystyle=$	$\displaystyle P\left(\frac{\chi^{2}_{\alpha/2,n}}{n\hat{\theta}}<\frac{1}{% \theta}<\frac{\chi^{2}_{1-\alpha/2,n}}{n\hat{\theta}}\right)$
	$\displaystyle=$	$\displaystyle P\left(\frac{n\hat{\theta}}{\chi^{2}_{1-\alpha/2,n}}<\theta<% \frac{n\hat{\theta}}{\chi^{2}_{\alpha/2,n}}\right)$

	$\displaystyle=$	$\displaystyle(0.47,3.33).$

Here $\chi^{2}_{\alpha/2,n}$ is the $\alpha/2$ quantile of a $\chi^{2}_{n}$ distribution.

Exercise 2: Does the normal variance maximum likelihood estimator $\hat{\theta}(X)$ achieve the Cramér-Rao bound? How about the estimator $\tilde{\theta}(X)=\frac{\sum_{i}X_{i}^{2}}{n-1}$ ? Are these estimators unbiased?

Exercise 3: Suppose now we wanted to perform inference on the population mean instead, i.e. the sample $x_{1},\ldots,x_{n}$ comes from $X\sim N(\mu,1)$ , with $\sigma^{2}=1$ known. Is the MLE $\hat{\mu}(X)=\frac{\sum_{i}X_{i}}{n}$ unbiased? Is it a minimum variance unbiased estimator (MVUE)?

Exercise 4: Would confidence intervals based on the MLE in Exercise 3 be approximate or exact?

The main drawback to the Wald procedures is that they rely on the asymptotic normality of the MLE. In finite samples the distribution may be far from normal; the likelihood may not be symmetric (or more specifically, regular, as defined earlier) about the MLE.

One solution in this case is to look for a transformation $g$ on the parameter space that makes the likelihood more regular about the MLE, and hence improves the normality approximation.

Example 2.4: Biased coin tossing. We consider a coin tossing example as before, but this time take $r=2$ and $n=30$ . (A similar example, with a roulette wheel motivation, was given in MATH235).

Plugging the numbers into the general form for the binomial Wald confidence interval calculated before gives

$\hat{\theta}=2/30$
$I_{O}(\hat{\theta})=482.143$
$\mbox{95\% CI}=(-0.023,0.156)$

Clearly this is invalid as the confidence interval includes negative values, which are outside of the parameter space.
Consider the log-odds transformation

\phi=g(\theta)=\log\left(\frac{\theta}{1-\theta}\right),

which has inverse

\theta=g^{-1}(\phi)=\frac{1}{1+e^{-\phi}}.

This is a useful transformation for probabilities as $g:(0,1)\rightarrow(-\infty,\infty)$ (it will be revisited in MATH333 Statistical Models).

So the normal approximation to the distribution of $\hat{\phi}$ is going to be much better than to $\hat{\theta}$ , so we calculate the Wald confidence interval in $\phi$ -space then transform back to $\theta$ -space.

The invariance property of likelihood tells us that

\hat{\phi}=g(\hat{\theta})=-2.639.

From Theorem 3,

\mbox{Var}[\hat{\phi}]=\{g^{\prime}(\hat{\theta})\}^{2}\mbox{Var}(\hat{\theta}),

and

g^{\prime}(\theta)=\frac{1}{\theta}+\frac{1}{1-\theta},

so $\{g^{\prime}(\hat{\theta})\}^{2}=258.29$ and $\mbox{Var}[\hat{\phi}]=0.536$ .

Therefore the 95% confidence interval for $\phi_{0}$ is

-2.639\pm 1.96\sqrt{0.536}=(-4.074,-1.204).

Transforming this back into a confidence interval about $\theta$ using the $g^{-1}$ function above, gives a new confidence interval for $\theta_{0}$ of

\left(\frac{1}{1+e^{4.074}},\frac{1}{1+e^{1.204}}\right)=(0.017,0.231),

which is an improvement at least in that the confidence interval is confined to the interval $(0,1)$ !

Generally, finding an appropriate function $g$ (let alone the best function $g$ ) is difficult or impossible, so this does not overcome entirely the limitations of Wald procedures. We turn instead to procedures based on likelihood ratios or deviance.

An alternative means by which to proceed is to look instead at the asymptotic distribution of the deviance at $\theta_{0}$ , and use this to devise testing procedures and construct confidence intervals.

The testing procedure that relies on the deviance is called the likelihood ratio test. This is the most important hypothesis testing procedure you will learn in undergraduate statistics.

Suppose we are carrying out a hypothesis test of a null hypothesis $H_{0}:\theta_{0}=a$ against a very simple alternative $H_{1}:\theta_{0}=b$ .

Note this is different (simpler) to most alternative hypotheses we have considered so far in that it posits a single value rather than a range of values.

The likelihood ratio test testing $H_{0}$ versus $H_{1}$ has a rejection region of the form

\frac{L(\theta=a)}{L(\theta=b)}\leq\eta,

where the constant $\eta$ is chosen to achieve level $\alpha$ . (Recall that the level of the test, $\alpha$ , is the probability of rejecting the null hypothesis when the null hypothesis is true).

Intuitively: we reject $H_{0}$ in favour of $H_{1}$ if likelihood of $\theta_{1}$ is much larger than likelihood of $\theta_{0}$ , given the data.

It may seem concerning up to now that in order to carry out a hypothesis test, one can dream up all kinds of procedures (e.g. the Wald test or a likelihood ratio test (LRT)). We now move on to the notion of an optimal hypothesis test, and an important result that establishes achieving this optimality.

Neyman-Pearson Lemma

Remarkably, we can show that, in a sense to be defined, the LRT is the optimal test to do, so whenever it is possible to carry it out, it is to be preferred over any other test.

To judge this optimality, we introduce an important property of hypothesis tests.

The power of the test, usually denoted $1-\beta$ , is the probability of rejecting the null hypothesis when the alternative hypothesis is true.

Therefore, we would like to find tests with small $\alpha$ but large power. Usually, in hypothesis testing, we fix $\alpha$ in advance, then ‘hope’ we have good power.

Theorem 5 Neyman-Pearson Lemma.
Consider the hypotheses

	$\displaystyle H_{0}$	$\displaystyle:$	$\displaystyle\theta_{0}=a$
	$\displaystyle H_{1}$	$\displaystyle:$	$\displaystyle\theta_{0}=b.$

Then the likelihood ratio test is the most powerful of all tests of level no more than $\alpha$ .

Proof.

(Neyman-Pearson)
Set-up
First, $L(\theta)$ is proportional to the joint density, i.e. $L(\theta;\vec{X})=cf(\vec{X})$ for some constant $c$ . Therefore, letting $f_{0}$ denote the density under $H_{0}$ and $f_{1}$ the density under $H_{1}$ ,

\frac{L(\theta=a)}{L(\theta=b)}=\frac{cf_{0}(\vec{X})}{cf_{1}(\vec{X})}=\frac{% f_{0}(\vec{X})}{f_{1}(\vec{X})},

and we can re-write the LRT as rejecting $H_{0}$ when

f_{0}(\vec{X})\leq\eta f_{1}(\vec{X})

i.e.

\eta f_{1}(\vec{X})-f_{0}(\vec{X})\geq 0.

Now let

\phi_{0}(\vec{x})=\left\{\begin{array}[]{ll}1&\text{ when }\text{N-P rejects $% H_{0}$}\\ 0&\text{ when }\text{N-P does not reject $H_{0}$}\end{array}\right.

Consider a new test, $\phi$ , defined similarly as

\phi(\vec{x})=\left\{\begin{array}[]{ll}1&\text{ when }\text{new test rejects % $H_{0}$}\\ 0&\text{ when }\text{new test does not reject $H_{0}$}\end{array}\right.

Finally for the set-up, let

U(\vec{x})=\left\{\phi_{0}(\vec{x})-\phi(\vec{x})\right\}\left\{\eta f_{1}(% \vec{x})-f_{0}(\vec{x})\right\}.

1. Showing $U(\vec{x})\geq 0$
Consider the cases.

•

If $\left\{\eta f_{1}(\vec{x})-f_{0}(\vec{x})\right\}>0$ then $\phi_{0}(\vec{x})=1$ and so $U(\vec{x})\geq 0$ .
•

If $\left\{\eta f_{1}(\vec{x})-f_{0}(\vec{x})\right\}<0$ then $\phi_{0}(\vec{x})=0$ and so $U(\vec{x})\geq 0$ .
•

If $\left\{\eta f_{1}(\vec{x})-f_{0}(\vec{x})\right\}=0$ then clearly $U(\vec{x})=0$ .

2. Integrating and showing most powerful
The result that $U(\vec{x})\geq 0$ is true for all $\vec{x}$ , and so

$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle\int\left\{\phi_{0}(\vec{x})-\phi(\vec{x})\right\}\left\{\eta f_{% 1}(\vec{x})-f_{0}(\vec{x})\right\}d\vec{x}$
$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle\eta\left\{\int\phi_{0}(\vec{x})f_{1}(\vec{x})d\vec{x}-\int\phi(% \vec{x})f_{1}(\vec{x})d\vec{x}\right\}$
		$\displaystyle\quad\quad+\int\phi(\vec{x})f_{0}(\vec{x})d\vec{x}-\int\phi_{0}(% \vec{x})f_{0}\vec{x})d\vec{x}$
$\displaystyle 0$	$\displaystyle\leq$	$\displaystyle\eta\left\{E_{H_{1}}[\phi_{0}(\vec{X})]-E_{H_{1}}[\phi(\vec{X})]% \right\}+\left\{E_{H_{0}}[\phi(\vec{X})]-E_{H_{0}}[\phi_{0}(\vec{X})]\right\}.$

Now, by assumption that the new test $\phi$ has a level of at most $\alpha$ (which the LRT achieves), we have $E_{H_{0}}[\phi(\vec{X})]\leq\alpha=E_{H_{0}}[\phi_{0}(\vec{X})],$ i.e. the right hand bracket is $\leq 0$ . Therefore the left hand bracket must be positive and so

E_{H_{1}}[\phi_{0}(\vec{X})]\geq E_{H_{1}}[\phi(\vec{X})].

But this says that under $H_{1}$ the likelihood ratio testis more likely to reject $H_{0}$ and hence more powerful. So we are done. ∎

Note 1: In cases we have considered so far, we have taken $a$ to be some specific value and $b$ to be the MLE (e.g. in the coin tossing Example 2.1). In this case, the asymptotic distribution of the deviance (by Theorem 2) implies that

D(\theta_{0}^{*})=2\{\ell(\hat{\theta})-\ell(\theta_{0}^{*})\}=-2\log\left\{% \frac{L(\theta=\theta_{0}^{*})}{L(\theta=\hat{\theta})}\right\}\sim\chi^{2}_{1}

under $H_{0}$ .

Note 2: If we are interested in a more general alternative hypothesis, e.g. $H_{1}:\theta_{0}\neq a$ , then we require Neyman-Pearson to hold for all $\theta\in\Omega\backslash\{a\}$ (with the same $\eta$ ). In this case, we call the likelihood ratio test uniformly most powerful (UMP). This property will usually hold for the examples we consider in this module, but it is beyond this course to check it. Even when the LRT is not UMP, it is still a ‘good’ test by virtue of it being most powerful for each simple alternative hypothesis.

The asymptotic distribution of the deviance suggests a means of constructing a confidence region for $\theta_{0}$ as

\{\theta:D(\theta)<\chi_{1-\alpha}^{2}\},

where $\chi^{2}_{1-\alpha}$ is the critical value of the $\chi^{2}_{1}$ distribution for confidence level $1-\alpha$ ; for example for a 95% confidence interval, $\chi^{2}_{1-\alpha}=3.84$ .

Example 2.5: Coin tossing, ctd.

Back to the boring coin tossing example (with $r=6$ , $n=10$ ). Again, we could test the hypothesis $H_{0}:\theta_{0}=0.5$ , this time using the likelihood ratio test.

We have

D(0.5)=2\{\ell(0.6)-\ell(0.5)\}=2\{(-6.730)-(-6.931)\}=0.402,

which via the R command 1-pchisq(0.402,df=1) gives $p=0.526$ , similar to that found with the Wald test.

Note: All hypothesis tests will be two-tailed unless otherwise specified. Because a $\chi^{2}_{1}$ distribution is the square of a normal distribution, we DO NOT double the $p$ -value produced by R because squaring puts both tails in the same (positive) place, i.e. $z^{2}>3.84$ corresponds to $\{z<-1.96\}\cup\{z>1.96\}$ .

In the example we have looked at, the Wald test and LRT give similar results. However, the LRT is to be preferred. The first reason is that the Wald test is carried out in $\theta$ -space, and hence if we transform into, say $\phi$ -space where $\phi=g(\theta)$ , then the results will change. The LRT depends on likelihoods, and the values of the likelihoods do not depend on how they are parameterised.

In fact, this suggests something deeper. For the Wald test to ‘work well’, we have to find a transformation $g$ such that

\frac{g(\hat{\theta})-g(\theta_{0}^{*})}{\mbox{se}(g(\hat{\theta}))}\sim N(0,1).

The LRT is invariant to the transformation $g$ , and this suggests that the LRT will work well provided a function $g$ exists to satisfy the above, but we do not need to know what this function is.

Sketch proof.

Suppose $g$ exists, $\phi=g(\theta)$ and consider a second-order Taylor expansion of $L(\phi)$ about $\hat{\phi}$ . Recall from Section 2.2.2 this is given by

\ell(\phi)\approx\ell(\hat{\phi})-\frac{1}{2}(\phi-\hat{\phi})^{2}I_{O}(\hat{% \phi}),

and so

D(\phi)\approx(\phi-\hat{\phi})^{2}I_{O}(\hat{\phi}).

The function $g$ makes the likelihood regular by definition, so the approximation above is good. This means that the Wald confidence interval based on the asymptotic distribution of $\hat{\phi}$ is identical to the deviance based confidence interval. ∎