1 Modelling and Statistical Inference The Likelihood Function Asymptotic Theory

Maximum likelihood estimation and relative likelihood

By definition, $L(\theta)$ or $\ell(\theta)$ give measures of how likely $\theta$ is to be the true parameter value for the population, since (up to proportionality) $L(\theta)$ is the probability of obtaining the observed data if $\theta$ had been the true parameter.

Consequently, values of $\theta$ with large (log)likelihood are more likely to be correct than values with low (log)likelihood.

This leads to the principle of maximum likelihood estimation, whereby we estimate $\theta$ by the maximum likelihood estimator (MLE), defined to be any value $\hat{\theta}$ of $\theta$ that maximises $L(\theta)$ (or $\ell(\theta)$ ).

Example 1.2: IID Poisson data.
Find $\hat{\theta}$ for independent data $\vec{x}=x_{1},\ldots,x_{n}$ from a Poisson $(\theta)$ distribution.

For $\theta\in\Theta$ with $\Theta=(0,\infty)$ , $f(x|\theta)=\frac{e^{-\theta}\theta^{x}}{x!}~{}~{}~{}~{}~{}x=0,1,2,\ldots$ . Then

L(\theta)={\color[rgb]{0,0,0}\prod_{i=1}^{n}\frac{e^{-\theta}\theta^{x_{i}}}{x% _{i}!}\propto\prod_{i=1}^{n}e^{-\theta}\theta^{x_{i}}}.

$\displaystyle\ell(\theta)$	$\displaystyle=$	$\displaystyle\log\left\{\prod_{i=1}^{n}e^{-\theta}\theta^{x_{i}}\right\}\quad(% \ +\ const\ )$
$\displaystyle\ell(\theta)$	$\displaystyle=$	$\displaystyle{\color[rgb]{0,0,0}-n\theta+n\bar{x}\log(\theta)}$
$\displaystyle\ell^{\prime}(\theta)$	$\displaystyle=$	$\displaystyle{\color[rgb]{0,0,0}-n+\frac{n\bar{x}}{\theta}}$
$\displaystyle\ell^{\prime}(\hat{\theta})$	$\displaystyle=$	$\displaystyle{\color[rgb]{0,0,0}-n+\frac{n\bar{x}}{\hat{\theta}}=0}$
$\displaystyle\hat{\theta}$	$\displaystyle=$	$\displaystyle{\color[rgb]{0,0,0}\bar{x}}.$

We should also check that the second derivative is negative. Differentiating is a common way to find maxima (for continuous parameter space) but not the only way. It won’t always work and we may have to resort to ‘brute force’ approaches.

We define the first derivative of the log-likelihood as the score function:

S(\theta)=\ell^{\prime}(\theta)=\frac{\partial}{\partial\theta}\log L(\theta)

so the MLE $\hat{\theta}$ is the solution of the score equation $S(\theta)=0$ .

The likelihood function can also be used to interpret the plausibility of values of $\theta$ other than the MLE. Define the relative likelihood of $\theta$ as

RL(\theta)=\frac{L(\theta)}{L(\hat{\theta})}.

Example 1.3: Coin tossing.
Suppose a coin is tossed $n=10$ times and $r=6$ heads are observed. Then if $\theta$ represents the probability of a head on a single coin toss, the number of heads is Binomial $(10,\theta)$ , and the likelihood function is

$\displaystyle L(\theta)$	$\displaystyle=$	$\displaystyle\theta^{r}(1-\theta)^{n-r},$
$\displaystyle\ell(\theta)$	$\displaystyle=$	$\displaystyle r\log\theta+(n-r)\log(1-\theta),$
$\displaystyle\ell^{\prime}(\theta)$	$\displaystyle=$	$\displaystyle\frac{r}{\theta}-\frac{n-r}{1-\theta},$

and this gives

\hat{\theta}=\frac{r}{n}=0.6.

Now we can consider the relative likelihood of $\theta=0.5$ (i.e. the coin is fair) as

RL(0.5)=\frac{L(\theta=0.5)}{L(\theta=0.6)}=\frac{0.5^{6}\times 0.5^{4}}{0.6^{% 6}\times 0.4^{4}}=\frac{9.77\times 10^{-4}}{11.94\times 10^{-4}}=0.82.

One could subjectively interpret the relative likelihood; however, we will develop theoretical tools to compare values of $\theta$ with different likelihoods later.

For this development, it is useful to introduce the deviance, which is a relative likelihood-like measure:

D(\theta)=2\{\ell(\hat{\theta})-\ell(\theta)\}.

Exercise: what is the relationship between the relative likelihood and the deviance?

Observed and expected information

There are two ways to measure the information about $\theta$ contained in the sample: the observed information $I_{O}(\theta)$ and the expected information $I_{E}(\theta)$ . The observed information is based on the observed data

I_{O}(\theta)=-\ell^{\prime\prime}(\theta;\vec{x})=-\ell^{\prime\prime}(\theta)

and the expected information, sometimes called Fisher information, is averaged over all possible data sets (expectation over $\vec{X}$ )

I_{E}(\theta)=E\left\{-\frac{\partial^{2}}{\partial\theta^{2}}\ell(\theta|\vec% {X})\right\}.

These quantities also represent the curvature of the likelihood function. Particularly of interest is the curvature at the MLE, $I_{O}(\hat{\theta})$ . High information (large curvature) at the MLE corresponds to a tight peak, and indicates less uncertainty about $\theta_{0}$ .

Example 1.4: IID Poisson data, ctd.
Find the observed and expected information for independent data $x_{1},\ldots,x_{n}$ from a Poisson $(\theta)$ distribution. How do they change with $n$ ?

Recall $\ell(\theta)=-n\theta+n\bar{x}\log(\theta)$ . So

	$\displaystyle\ell^{\prime}(\theta)$	$\displaystyle=$	$\displaystyle-n+\frac{n\bar{x}}{\theta}$
	$\displaystyle\ell^{\prime\prime}(\theta)$	$\displaystyle=$	$\displaystyle-\frac{n\bar{x}}{\theta^{2}}$

Thus $I_{O}(\theta)=n\bar{x}/\theta^{2}$ .

Now as $E(\bar{X})=E(\sum_{i=1}^{n}X_{i}/n)=\sum_{i=1}^{n}\theta_{0}/n=\theta_{0}$ ,

	$\displaystyle\ell^{\prime\prime}(\theta_{0};\vec{X})$	$\displaystyle=$	$\displaystyle-\frac{n\bar{X}}{\theta_{0}^{2}}$
	$\displaystyle I_{E}(\theta_{0})$	$\displaystyle=$	$\displaystyle\frac{nE(\bar{X})}{\theta_{0}^{2}}=\frac{n\theta_{0}}{\theta_{0}^% {2}}=\frac{n}{\theta_{0}}$

Note that both $I_{O}(\theta)$ and $I_{E}(\theta)$ are multiples of $n$ .

The pair $(\hat{\theta},I_{O}(\hat{\theta}))$ provide a useful two-dimensional summary of any (log) likelihood function. Indeed, consider approximating $\ell(\theta)$ by a second-order Taylor expansion about $\hat{\theta}$ :

	$\displaystyle\ell(\theta)$	$\displaystyle\approx$	$\displaystyle\ell(\hat{\theta})+(\theta-\hat{\theta})\ell^{\prime}(\hat{\theta% })+\frac{1}{2}(\theta-\hat{\theta})^{2}\ell^{\prime\prime}(\hat{\theta})$
	$\displaystyle\ell(\theta)$	$\displaystyle\approx$	$\displaystyle\ell(\hat{\theta})+\frac{1}{2}(\theta-\hat{\theta})^{2}\ell^{% \prime\prime}(\hat{\theta}).$

This can be written as

\log\left(RL(\theta)\right)\approx-\frac{1}{2}(\theta-\hat{\theta})^{2}I_{O}(% \hat{\theta}),

so this is a quadratic approximation for the log relative likelihood depending only on our pair $(\hat{\theta},I_{O}(\hat{\theta}))$ . If this approximation is accurate around the MLE, we call the likelihood function regular.

Confidence Regions

As frequentists, however, we must consider that the data, $\vec{x}$ , which we have used to construct our MLE $\hat{\theta}(\vec{x})$ , is a realisation of the random variables $\vec{X}$ . Another draw from these random variables would result in different data, $\vec{x^{\prime}}$ , and hence a different MLE, $\hat{\theta}(\vec{x^{\prime}})$ .

Therefore, $\hat{\theta}(\vec{X})$ is itself a random variable, and must have a distribution: we call this the sampling distribution of $\hat{\theta}$ .

The fact that the MLE has a distribution makes it seem somewhat foolish simply to quote the MLE as a point estimate of the true value of the parameter $\theta_{0}$ . This motivates the idea of a confidence region for $\theta_{0}$ , which is defined as any rule for constructing a region $C$ with the following property:

P[C(\vec{X})\ni\theta_{0}]=1-\alpha,

where $1-\alpha$ is referred to as the confidence level, and is specified in advance.

Where confidence regions are 1-dimensional and of the form $C(\vec{x})=(\theta_{l}(\vec{x}),\theta_{u}(\vec{x}))$ they are usually called confidence intervals.

This is a precise and important definition. Any specific confidence region $C(\vec{x})$ we produce depends upon the observed data $\vec{x}$ . It either contains, or does not contain, the true parameter $\theta_{0}$ .

We cannot make a probability statement about this specific confidence region in isolation because there is no repeated sampling. However, there is a collection of confidence regions that would be produced for all the different samples $\vec{x}$ we could draw from the random variables $\vec{X}$ .

The probability of these intervals covering the true value $\theta_{0}$ is $1-\alpha$ , where we define probability as a limiting proportion over the samples.

It is rarely possible to construct confidence regions exactly; hence, it is usual to rely on asymptotic results (i.e. results with $n\rightarrow\infty$ ) and then assume that these results hold approximately for our observed $n$ (see e.g. Example 2.2).

The Cramér-Rao bound and MVUEs

Before framing our discussion in more formal asymptotic theory, we first we first recall a result which gives us a bound on how good an inference method can be.

Theorem (CRLB): Cramér-Rao bound for variance of estimators.
Let $T({\bf X})$ be an estimator with expected value $E[T({\bf X})]=g(\theta)$ , where $g(\theta)$ is a differentiable function. Then

\displaystyle{\rm Var}[T({\bf X})]\geq[g^{\prime}(\theta_{0})]^{2}[I_{E}(% \theta_{0})]^{-1}.

In particular, if $T({\bf X})$ is an unbiased estimator of $\theta$ , we have that $g(\theta)=\theta$ and $g^{\prime}(\theta)=1$ . Thus,

\textrm{Var}[T({\bf X})]\geq[I_{E}(\theta_{0})]^{-1}.

This states that an estimator has minimum variance if it achieves the bounds on the righthand side of the two expressions above.

In particular, if the estimator is unbiased, if its variance is the reciprocal of the Fisher information of a model, it is a minimum variance unbiased estimator (MVUE).

Example 1.5: The sample mean estimator of the Poisson parameter.
Let $x_{1},\ldots,x_{n}$ be independent data from a Poisson $(\theta)$ distribution. Is $\bar{X}$ a MVUE for the Poisson parameter $\theta$ ?

•

Firstly, we note that $E(\bar{X})=E(\sum_{i=1}^{n}X_{i}/n)=\sum_{i=1}^{n}\theta/n=\theta,$ thus $\bar{X}$ is unbiased as an estimator for $\theta$ .
•

The variance of $\bar{X}$ is

$\displaystyle\textrm{Var}(\bar{X})$ $\displaystyle=$ $\displaystyle\textrm{Var}\left(\sum_{i=1}^{n}X_{i}/n\right)$

$\displaystyle=$ $\displaystyle{\color[rgb]{0,0,0}\sum_{i=1}^{n}\frac{1}{n^{2}}\textrm{Var}(% \theta)=n\theta/n^{2}=\theta/n}.$
•

Lastly, recall from Example 1.4 that $I_{E}(\theta)=\frac{n}{\theta}.$

Since the variance of $\bar{X}$ is equal to $1/I_{E}(\theta)$ , according to the CRLB Theorem, it is a minimum variance estimator (because it achieves the Cramér-Rao lower bound).

From above, it is unbiased too, so is a MVUE.

	$\displaystyle\textrm{Var}(\bar{X})$	$\displaystyle=$	$\displaystyle\textrm{Var}\left(\sum_{i=1}^{n}X_{i}/n\right)$
		$\displaystyle=$	$\displaystyle{\color[rgb]{0,0,0}\sum_{i=1}^{n}\frac{1}{n^{2}}\textrm{Var}(% \theta)=n\theta/n^{2}=\theta/n}.$