Home page for accesible maths 13 Information and Sufficiency 13.2 Suppression of Information 13.4 Summary

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

13.3 Sufficiency

Recall the driving test data from the Example 13.1.

Number of failed attempts	0	1	2	$\geq 3$
Observed frequency	147	47	20	5

Table 13.2: Number of times taken for drivers to pass the driving test.

We chose to model these data as being geometrically distributed. Assuming that the people in the ‘3 or more’ column failed exactly three times, the log-likelihood for general data $x_{1},\ldots,x_{n}$ is

	$\displaystyle l(\theta)$	$\displaystyle=\sum_{i=1}^{n}\log\left\{\theta(1-\theta)^{x_{i}}\right\}$
		$\displaystyle=\sum_{i=1}^{n}\{\log(\theta)+x_{i}\log(1-\theta)\}$
		$\displaystyle=n\log(\theta)+\log(1-\theta)\sum_{i=1}^{n}x_{i}.$

Now, suppose that, rather than being presented with the table of passing attempts, you were simply told that with 219 people filling in the survey, $\sum_{i=1}^{219}x_{i}=102$ .

Would it still be possible to proceed with fitting the model?

The answer is yes; moreover, we can proceed in exactly the same way, and achieve the same results! This is because, if you look at the log-likelihood, the only way in which the data is involved is through $\sum_{i=1}^{n}x_{i}$ , meaning that in some sense, this is all we need to know.

This is clearly a big advantage, we just have to remember one number rather than an entire table.

We call $\sum_{i=1}^{n}x_{i}$ a sufficient statistic for $\theta$ .

Definition.

Let ${\bf x}=x_{1},\ldots,x_{n}$ be a sample from $f(\cdot|\theta)$ . Then a function of the data $T({\bf x})$ is said to be a sufficient statistic for $\theta$ (or sufficient for $\theta$ ) if ${\bf x}$ is independent of $\theta$ given $T({\bf x})$ , i.e.

\Pr[{\bf X}={\bf x}|T({\bf x}),\theta]=\Pr[{\bf X}={\bf x}|T({\bf x})].

Some consequences of this definition:

1

For the objective of learning about $\theta$ , if I am told $T({\bf x})$ , there is no value in being told anything else about ${\bf x}$ .
2

If I have two datasets ${\bf x_{1}}$ and ${\bf x_{2}}$ , and $T({\bf x_{1}})=T({\bf x_{2}})$ , then I should make the same conclusions about $\theta$ from both, even if ${\bf x_{1}}\neq{\bf x_{2}}$ .
3

Sufficient statistics always exist since trivially $T({\bf x})={\bf x}$ always satisfies the above definition.

Definition.

Let ${\bf x}=x_{1},\ldots,x_{n}$ be a sample from $f(\cdot|\theta)$ . Let $T({\bf x})$ be sufficient for $\theta$ . Then $T({\bf x})$ is said to be minimally sufficient for $\theta$ if there is no sufficient statistic with a lower dimension than $T$ .

Theorem (Neyman factorisation theorem).

Let ${\bf x}=x_{1},\ldots,x_{n}$ be a sample from $f(\cdot|\theta)$ . Then a function $T({\bf x})$ is sufficient for $\theta$ if and only if the likelihood function can be factorised in the form

L(\theta)=g({\bf x})\times h(T({\bf x}),\theta),

where $g$ is a function of the data only, and $h$ is a function of the data only through $t({\bf x})$ .

For a proof see page 276 of Casella and Berger.

We can also express the factorisation result in terms of the log-likelihood, which is often easier, just by taking logs of the above result:

	$\displaystyle l(\theta)$	$\displaystyle=\log\big{\{}g({\bf x})\times h(T({\bf x}),\theta)\big{\}}$
		$\displaystyle=\log\{g({\bf x})\}+\log\{h(T({\bf x}),\theta)\}$
		$\displaystyle=\tilde{g}({\bf x})+\tilde{h}(T({\bf x}),\theta),$

where $\tilde{g}=\log(g)$ and $\tilde{h}=\log(h)$ .

We can show that $\sum_{i=1}^{n}x_{i}$ is sufficient for $\theta$ in the driving test example by inspection of the log-likelihood:

l(\theta)=n\log(\theta)+\log(1-\theta)\sum_{i=1}^{n}x_{i}.

Letting $T({\bf x})=\sum_{i=1}^{n}x_{i}\$ , then $\tilde{h}(T({\bf x}),\theta)=n\log(\theta)+\log(1-\theta)T({\bf x})$ , and $\tilde{g}({\bf x})=0$ , we have satisfied the factorisation criterion, and hence $T({\bf x})=\sum_{i=1}^{n}x_{i}$ is sufficient for $\theta$ .

Suppose that I carry out another survey on attempts to pass a driving test, again with $n=219$ participants and get data $\vec{y}=y_{1},\ldots,y_{n}$ , with ${\bf x}\neq{\bf y}$ but $\sum_{i=1}^{n}x_{i}=\sum_{i=1}^{n}y_{i}$ . Are the following statements true or false?

1

$\hat{\theta}({\bf x})$ , the MLE based on data ${\bf x}$ , is the same as $\hat{\theta}({\bf y})$ , the MLE based on data ${\bf y}$ .
2

The confidence intervals based on both datasets will be identical.
3

The geometric distribution is appropriate for both datasets.

An important shortcoming in only considering the sufficient statistic is that it does not allow us to check how well the chosen model fits.

TheoremExample 13.3.1 Poisson parameter (cont.)

Recall from the beginning of this section, the London homicides data, which we modelled as a random sample from the Poisson distribution. We found

	$\displaystyle L(\lambda\|x_{1},\ldots,x_{n})$	$\displaystyle=\prod_{i=1}^{n}\frac{\lambda^{x_{i}}\exp(-\lambda)}{x_{i}!}$
		$\displaystyle=\lambda^{\sum_{i}x_{i}}\exp(-n\lambda)\prod_{i=1}^{n}\frac{1}{x_% {i}!}$
		$\displaystyle\propto\lambda^{\sum_{i}x_{i}}\exp(-n\lambda),$

and that the log-likelihood function for the Poisson data is consequently

l(\lambda)=\log(\lambda)\sum_{i=1}^{n}x_{i}-n\lambda+c,

with the MLE being

\hat{\lambda}=\frac{\sum_{i=1}^{n}x_{i}}{n}=\bar{x}.

By differentiating again, we can find the information function

l^{\prime\prime}(\lambda|{\bf x})=-\lambda^{-2}\sum_{i=1}^{n}x_{i},

and so

I_{O}(\lambda|{\bf x})=\lambda^{-2}\sum_{i=1}^{n}x_{i}.

What is a sufficient statistic for the Poisson parameter?

For this case, we can let $T({\bf x})=\sum_{i=1}^{n}x_{i}\$ , and $\tilde{h}(T({\bf x}),\theta)=\log(\lambda)T({\bf x})-n\lambda$ , and $\tilde{g}({\bf x})=c=-\sum_{i=1}^{n}\log(x_{i}!)$ , we have satisfied the factorisation criterion, and hence $T({\bf x})=\sum_{i=1}^{n}x_{i}$ is sufficient for $\lambda$ .

TheoremExample 13.3.2 Normal variance

Suppose the sample $x_{1},\ldots,x_{n}$ comes from $X\sim N(0,\theta)$ . Find a sufficient statistic for $\theta$ . Is the MLE a function of this statistic or of the sample mean? Give a formula for the 95% confidence interval of $\theta$ .

First, the $\operatorname{Normal}(0,\theta)$ density is given by

\displaystyle{\color[rgb]{1,1,1}f(x_{i}|\theta)=\frac{1}{\sqrt{2\pi\theta}}% \exp\left\{-\frac{x_{i}^{2}}{2\theta}\right\},}

leading to the likelihood

1

${\color[rgb]{1,1,1}L(\theta)=\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\theta}}\exp% \left\{-\frac{x_{i}^{2}}{2\theta}\right\},}$
2

${\color[rgb]{1,1,1}L(\theta)=\frac{1}{\theta^{n/2}}\exp\left\{-\frac{\sum_{i}x% _{i}^{2}}{2\theta}\right\}.}$

Hence, $T({\bf x})=\sum_{i}x_{i}^{2}$ is a sufficient statistic for $\theta$ . The log-likelihood and score functions are

1

${\color[rgb]{1,1,1}l(\theta)=-\frac{n}{2}\log\theta-\frac{\sum_{i}x_{i}^{2}}{2% \theta},}$
2

${\color[rgb]{1,1,1}S(\theta)=l^{\prime}(\theta)=-\frac{n}{2\theta}+\frac{\sum_% {i}x_{i}^{2}}{2\theta^{2}}.}$

Solving $S(\theta)=0$ gives a candidate MLE

\displaystyle\hat{\theta}=\frac{\sum_{i}x_{i}^{2}}{n},

which is a function of the sufficient statistic. To check this is an MLE we calculate

\displaystyle l^{\prime\prime}(\theta)=\frac{n}{2\theta^{2}}-\frac{\sum_{i}x_{% i}^{2}}{\theta^{3}}.

In this case it isn’t immediately obvious that $l^{\prime\prime}(\hat{\theta})<0$ , but substituting in

	$\displaystyle l^{\prime\prime}(\hat{\theta})$	$\displaystyle=\frac{n}{2\left(\frac{\sum x_{i}^{2}}{n}\right)^{2}}-\frac{\sum x% _{i}^{2}}{\left(\frac{\sum x_{i}^{2}}{n}\right)^{3}}$
		$\displaystyle=\frac{n^{3}}{2(\sum x_{i}^{2})^{2}}-\frac{n^{3}}{(\sum x_{i}^{2}% )^{2}}$
		$\displaystyle=-\frac{n^{3}}{2(\sum x_{i}^{2})^{2}}<0,$

confirming that this is an MLE.

The observed information is $I_{O}(\hat{\theta})=-l^{\prime\prime}(\hat{\theta})$ ,

{\color[rgb]{1,1,1}I_{O}(\hat{\theta})=\frac{n^{3}}{2(\sum x_{i}^{2})^{2}}.}

Therefore a 95% confidence interval is given by

	$\displaystyle(l,u)$	$\displaystyle=\left(\hat{\theta}-\frac{1.96}{\sqrt{I_{O}(\hat{\theta})}},\ % \hat{\theta}+\frac{1.96}{\sqrt{I_{O}(\hat{\theta})}}\right)$
		$\displaystyle={\color[rgb]{1,1,1}\left(\frac{\sum_{i}x_{i}^{2}}{n}-1.96n^{-3/2% }\sum x_{i}^{2}\sqrt{2},\ \ \frac{\sum_{i}x_{i}^{2}}{n}+1.96n^{-3/2}\sum x_{i}% ^{2}\sqrt{2}\right)}.$