1 Week 6: Expectation-Maximisation (EM) algorithm

1.1 Mixture model - motivation

We motivate the Expectation-Maximisation (EM) algorithm with a classical $19^{th}$ century problem!

The first attempt at mixture modelling was by Pearson in 1894 when he was presented by the biologist W.F.R. Weldon with 1000 observations on crabs found in Naples, see [3]. The data are ratios of forehead width to body length, a frequency table is available at
www.maths.uq.edu.au/~gjm/DATA/Crab.dat.

The following graph shows a histogram of the data together with a mixture of two Gaussians fitted.

Unnumbered Figure: Link

The red line gives the mixture density obtained using the EM algorithm and the green lines show the individual Gaussian densities, multiplied by the appropriate weighting. Hence the red line is the sum of the two green lines.
See www.math.mcmaster.ca/peter/mix/demex/excrabs.html for more details.

We shall consider the two-group case although the ideas readily extend to more than two groups, and from mixtures of Gaussian to other distributions.

If a random variable $X$ is drawn from $N(\mu_{1},\sigma_{1}^{2})$ with probability $\alpha_{1}$ and from $N(\mu_{2},\sigma_{2}^{2})$ with probability $\alpha_{2}$ , $\alpha_{1}+\alpha_{2}=1$ , then the pdf of $X$ is

f(x)=\alpha_{1}\phi(x|\mu_{1},\sigma_{1}^{2})+\alpha_{2}\phi(x|\mu_{2},\sigma_% {2}^{2})

where $\phi(x;\mu,\sigma^{2})$ is the pdf of $N(\mu,\sigma^{2})$ . This pdf is called a Gaussian mixture with two components.

Proof. Let $Y=j$ if $X$ come from $N(\mu_{j},\sigma_{j}^{2})$ , $j=1,2$ . By definition, the cdf of $X$ is

$\displaystyle F(x)$	$\displaystyle=$	$\displaystyle P(X\leq x)$
	$\displaystyle=$	$\displaystyle P(X\leq x\|Y=1)P(Y=1)+P(X\leq x\|Y=2)P(Y=2)$
	$\displaystyle=$	$\displaystyle\int_{-\infty}^{x}\phi(x;\mu_{1},\sigma_{1}^{2}){\mathrm{d}}x% \times\alpha_{1}+\int_{-\infty}^{x}\phi(x;\mu_{2},\sigma_{2}^{2}){\mathrm{d}}x% \times\alpha_{2}$

Thus the pdf

f(x)=F^{\prime}(x)=\phi(x|\mu_{1},\sigma_{1}^{2})\times\alpha_{1}+\phi(x|\mu_{% 2},\sigma_{2}^{2})\times\alpha_{2}

Given a random sample $x_{1},\ldots,x_{n}$ from this distribution, the log-likelihood function is

	$\displaystyle\ell(\mathbf{\theta};\mathbf{x})$	$\displaystyle=$	$\displaystyle\sum_{i=1}^{n}\log\left\{\alpha_{1}\phi(x_{i}\|\mu_{1},\sigma_{1}^% {2})+\alpha_{2}\phi(x_{i}\|\mu_{2},\sigma_{2}^{2})\right\}$
		$\displaystyle=$	$\displaystyle\sum_{i=1}^{n}\log\left\{\frac{\alpha_{1}}{\sqrt{2\pi}\sigma_{1}}% e^{-\frac{1}{2}(\frac{x_{i}-\mu_{1}}{\sigma_{1}})^{2}}+\frac{\alpha_{2}}{\sqrt% {2\pi}\sigma_{2}}e^{-\frac{1}{2}(\frac{x_{i}-\mu_{2}}{\sigma_{2}})^{2}}\right\}$

which cannot be maximised easily. Here $\mathbf{x}=(x_{1},\ldots,x_{n})$ and $\mathbf{\theta}=(\alpha_{1},\alpha_{2},\mu_{1},\mu_{2},\sigma_{1}^{2},\sigma_{% 2}^{2})$ .

However, with additional information $y_{1},\ldots,y_{n}$ on which distribution each $x_{i}$ comes from, the likelihood function becomes

$\displaystyle L(\mathbf{\theta};\mathbf{x},\mathbf{y})$	$\displaystyle=$	$\displaystyle\prod_{i=1}^{n}\mathbb{P}(X_{i}=x_{i},Y_{i}=y_{i}\|\mathbf{\theta})$
	$\displaystyle=$	$\displaystyle\prod_{i=1}^{n}\mathbb{P}(X_{i}=x_{i}\|Y_{i}=y_{i},\mathbf{\theta}% )\mathbb{P}(Y_{i}=y_{i}\|\mathbf{\theta})$
	$\displaystyle=$	$\displaystyle\prod_{i=1}^{n}\left\{\frac{1}{\sqrt{2\pi}\sigma_{y_{i}}}\exp% \left(-\frac{1}{2}\left(\frac{x_{i}-\mu_{y_{i}}}{\sigma_{y_{i}}}\right)^{2}% \right)\times\alpha_{y_{i}}\right\}.$

Thus the log-likelihood function becomes

$\displaystyle\ell(\mathbf{\theta};\mathbf{x},\mathbf{y})$	$\displaystyle=$	$\displaystyle\sum_{i=1}^{n}\log\left\{\frac{1}{\sqrt{2\pi}\sigma_{y_{i}}}e^{-% \frac{1}{2}\left(\frac{x_{i}-\mu_{y_{i}}}{\sigma_{y_{i}}}\right)^{2}}\times% \alpha_{y_{i}}\right\}$
	$\displaystyle=$	$\displaystyle-\frac{n}{2}\log(2\pi)-\frac{1}{2}\sum_{i=1}^{n}\log(\sigma_{y_{i% }}^{2})-\frac{1}{2}\sum_{i=1}^{n}\left(\frac{x_{i}-\mu_{y_{i}}}{\sigma_{y_{i}}% }\right)^{2}+\sum_{i=1}^{n}\log\alpha_{y_{i}}$
	$\displaystyle=$	$\displaystyle-\frac{n}{2}\log(2\pi)-\frac{1}{2}\log(\sigma_{1}^{2})\sum_{i=1}^% {n}1_{\{y_{i}=1\}}-\frac{1}{2}\log(\sigma_{2}^{2})\sum_{i=1}^{n}1_{\{y_{i}=2\}}$
		$\displaystyle-\frac{1}{2}\sum_{i=1}^{n}\left(\frac{x_{i}-\mu_{1}}{\sigma_{1}}% \right)^{2}1_{\{y_{i}=1\}}-\frac{1}{2}\sum_{i=1}^{n}\left(\frac{x_{i}-\mu_{2}}% {\sigma_{2}}\right)^{2}1_{\{y_{i}=2\}}$
		$\displaystyle+\log\alpha_{1}\sum_{i=1}^{n}1_{\{y_{i}=1\}}+\log\alpha_{2}\sum_{% i=1}^{n}1_{\{y_{i}=2\}}$

which is simpler and can be maximised as follows with ‘obvious’ solutions.

Setting the partial derivatives with respect to $\mu_{1}$ and $\sigma_{1}^{2}$ to zero

\frac{\partial}{\partial\mu_{1}}\ell(\mathbf{\theta};\mathbf{x},\mathbf{y})=-% \frac{1}{2}\sum_{i=1}^{n}2\left(\frac{x_{i}-\mu_{1}}{\sigma_{1}}\right)\frac{-% 1}{\sigma_{1}}1_{\{y_{i}=1\}}=0,

\frac{\partial}{\partial\sigma_{1}^{2}}\ell(\mathbf{\theta};\mathbf{x},\mathbf% {y})=-\frac{1}{2}\frac{1}{\sigma_{1}^{2}}\sum_{i=1}^{n}1_{\{y_{i}=1\}}+\frac{1% }{2}\frac{1}{\sigma_{1}^{4}}\sum_{i=1}^{n}(x_{i}-\mu_{1})^{2}1_{\{y_{i}=1\}}=0,

gives

\hat{\mu}_{1}=\frac{\sum_{i=1}^{n}x_{i}1_{\{y_{i}=1\}}}{\sum_{i=1}^{n}1_{\{y_{% i}=1\}}},~{}~{}\hat{\sigma}_{1}^{2}=\frac{\sum_{i=1}^{n}(x_{i}-\hat{\mu}_{1})^% {2}1_{\{y_{i}=1\}}}{\sum_{i=1}^{n}1_{\{y_{i}=1\}}}

which are respectively the sample mean and sample variance of the sample from $N(\mu_{1},\sigma_{1}^{2})$ . Similarly,

\hat{\mu}_{2}=\frac{\sum_{i=1}^{n}x_{i}1_{\{y_{i}=2\}}}{\sum_{i=1}^{n}1_{\{y_{% i}=2\}}},~{}~{}\hat{\sigma}_{2}^{2}=\frac{\sum_{i=1}^{n}(x_{i}-\hat{\mu}_{2})^% {2}1_{\{y_{i}=2\}}}{\sum_{i=1}^{n}1_{\{y_{i}=2\}}}

The rest of the maximisation uses the Lagrange multiplier method

\log\alpha_{1}\sum_{i=1}^{n}1_{\{y_{i}=1\}}+\log\alpha_{2}\sum_{i=1}^{n}1_{\{y% _{i}=2\}}-\lambda(\alpha_{1}+\alpha_{2}-1)

with constraint $\alpha_{1}+\alpha_{2}=1$ .

\frac{1}{\alpha_{1}}\sum_{i=1}^{n}1_{\{y_{i}=1\}}-\lambda=0,~{}\frac{1}{\alpha% _{2}}\sum_{i=1}^{n}1_{\{y_{i}=2\}}-\lambda=0

\hat{\alpha}_{1}=\frac{1}{n}\sum_{i=1}^{n}1_{\{y_{i}=1\}},~{}\hat{\alpha}_{2}=% \frac{1}{n}\sum_{i=1}^{n}1_{\{y_{i}=2\}}

which are the sample proportions from each population.

However, the data $\mathbf{y}$ are not available in practice. Therefore how can we make use of the above maximum likelihood estimator for $\mathbf{\theta}$ given $\mathbf{x}$ and $\mathbf{y}$ when the only data available are $\mathbf{x}$ . This is where the EM-algorithm comes into play. We iterate between the $M$ -step (MLE) above computing $\hat{\mathbf{\theta}}$ based on the full data ( $\mathbf{x}$ and $\mathbf{y}$ ) and an $E$ -step which estimates the unobserved $\mathbf{y}$ given the data $\mathbf{x}$ and current parameter estimates $\mathbf{\theta}$ .

1.2 Informal EM algorithm for the mixture model

Before formally introducing the EM algorithm, we outline the procedure for the mixture model.

To initialise, we need to choose initial estimates for the parameters of the model. The key thing is to choose $\mu_{1}\neq\mu_{2}$ , and for the parameters to be reasonable. For example, the data ranges from 0.57 to 0.7, therefore I choose $\mu_{1}=0.6$ and $\mu_{2}=0.65$ (plausible values in the range) with $\sigma_{1}^{2}=\sigma_{2}^{2}=0.0004$ (close to the variance of the data $0.000364$ ). Initially I have no preference for either of the two Gaussian distributions making up the mixture, so take $\alpha_{1}=\alpha_{2}=0.5$ .

E-Step
The augmented data that we want are $\mathbf{y}$ , the mixture component (Gaussian distribution) to which each of the observations $\mathbf{x}$ belongs. Given $\mathbf{\theta}=(\mu_{1},\mu_{2},\sigma_{1},\sigma_{2},\alpha_{1},\alpha_{2})$ and $\mathbf{x}$ , we can compute the probability (expectation) of the observations coming from each of the two distributions.

Since the observations are assumed to be independent,

\mathbb{P}(\mathbf{Y}=\mathbf{y}|\mathbf{\theta},\mathbf{x})=\prod_{i=1}^{n}% \mathbb{P}(Y_{i}=y_{i}|\mathbf{\theta},x_{i}).

By Bayes’ theorem, for $i=1,2,\ldots,n$ ,

$\displaystyle\mathbb{E}[1_{\{Y_{i}=1\}}\|\mathbf{\theta},x_{i}]=\mathbb{P}(Y_{i% }=1\|\mathbf{\theta},x_{i})$	$\displaystyle=$	$\displaystyle\frac{\mathbb{P}(X_{i}=x_{i},Y_{i}=1\|\mathbf{\theta})}{\mathbb{P}% (X_{i}=x_{i}\|\mathbf{\theta})}$
	$\displaystyle=$	$\displaystyle\frac{\mathbb{P}(X_{i}=x_{i},Y_{i}=1\|\mathbf{\theta})}{\mathbb{P}% (X_{i}=x_{i},Y_{i}=1\|\mathbf{\theta})+\mathbb{P}(X_{i}=x_{i},Y_{i}=2\|\mathbf{% \theta})}$
	$\displaystyle=$	$\displaystyle\frac{\alpha_{1}/(\sqrt{2\pi}\sigma_{1})\exp\{-(x_{i}-\mu_{1})^{2% }/(2\sigma_{1}^{2})\}}{\alpha_{1}/(\sqrt{2\pi}\sigma_{1})\exp\{-(x_{i}-\mu_{1}% )^{2}/(2\sigma_{1}^{2})\}+\alpha_{2}/(\sqrt{2\pi}\sigma_{2})\exp\{-(x_{i}-\mu_% {2})^{2}/(2\sigma_{2}^{2})\}},$

with $\mathbb{E}[1_{\{Y_{i}=2\}}|\mathbf{\theta},x_{i}]=1-\mathbb{E}[1_{\{Y_{i}=1\}}% |\mathbf{\theta},x_{i}]$ .

Thus in the E-Step we compute the expectation of $\mathbb{E}[1_{\{Y_{i}=1\}}|\mathbf{\theta},x_{i}]$ , which we shall call $p_{i}$ (with $\mathbb{E}[1_{\{Y_{i}=2\}}|\mathbf{\theta},x_{i}]=1-p_{i}$ ).

M-Step
We can now plug in the expected values for $1_{\{y_{i}=1\}}$ and $1_{\{y_{i}=2\}}$ , obtained in the E-step into the full data MLEs derived above to update our estimates of $\mathbf{\theta}$ . Specifically,

\begin{array}[]{ll}\mu_{1}=\frac{\sum_{i=1}^{n}x_{i}p_{i}}{\sum_{i=1}^{n}p_{i}% }&\mu_{2}=\frac{\sum_{i=1}^{n}x_{i}(1-p_{i})}{\sum_{i=1}^{n}(1-p_{i})}\\ \sigma_{1}^{2}=\frac{\sum_{i=1}^{n}(x_{i}-\mu_{1})^{2}p_{i}}{\sum_{i=1}^{n}p_{% i}}&\sigma_{2}^{2}=\frac{\sum_{i=1}^{n}(x_{i}-\mu_{1})^{2}(1-p_{i})}{\sum_{i=1% }^{n}(1-p_{i})}\\ \alpha_{1}=\frac{1}{n}\sum_{i=1}^{n}p_{i}&\alpha_{2}=\frac{1}{n}\sum_{i=1}^{n}% (1-p_{i})=1-\alpha_{1}.\end{array}

The algorithm then iterates between the two steps (E and M). Updating the probability of each observation belonging to each of the two Gaussian distributions and then updating the estimates (via MLEs) of the parameters. The process stops when we have convergence. That is, two successive estimates of the parameters agree to some pre-defined precision.

For the Naples bay crab data I required successive values of $\alpha_{1}$ to differ by less than $10^{-4}$ . This meant that with my chosen starting values 467 iterations were required (taking 0.22 seconds in R!), resulting in $\mu_{1}=0.6319$ , $\mu_{2}=0.6550$ , $\sigma_{1}^{2}=0.000329$ , $\sigma_{2}^{2}=0.000155$ , $\alpha_{1}=0.444$ and $\alpha_{2}=0.556$ .

1.3 The EM algorithm

The EM algorithm is an iterative algorithm, introduced in [2], and is designed to compute the maximum likelihood estimate when there is missing data. It is iterative in that it consists of a series of steps called iterations, where the parameter value gets repeatedly updated, until a convergence criterion is met. The algorithm converges to a local maximum of the likelihood function (ie. not necessarily where the global maximum is). Thus if the likelihood function is unimodal the EM algorithm will converge to the maximum likelihood estimate (MLE).

The EM algorithm is primarily used for maximising the likelihood when the task becomes easier given more information associated with existing data. This situation is called an incomplete data problem because we do not have this extra information. A prime example is mixture distributions such as that given above. In the Naples crab example, if we knew from which of the two Gaussian distributions each observation comes parameter estimation is straightforward.

The EM algorithm gets its name from the two steps involved in each iteration. The E-step (expectation) and the M-step (maximisation). The procedure alternates between these two steps (so it goes E, M, E, M,…) until convergence is achieved (successive values are sufficiently close).

Let $x$ represent our observed data and $\theta$ the parameter(s) of interest. (In the Naples crab example, $x$ represents the crab ratios and $\theta$ represents the parameters of the two Gaussian distributions and the mixture proportions.) By choosing a suitable parametric model we can write down the likelihood function $L(\theta;x)$ and our aim is then to find the value of $\theta$ that maximises $L(\theta;x)$ or equivalently $\ell(\theta;x)=\log(L(\theta;x))$ , i.e. find the MLE (maximum likelihood estimate) of $\theta$ . Suppose that it is possible to think of some additional data $y$ , such that the ‘complete data’ log-likelihood $\ell(\theta;x,y)$ is of a simpler form than the ‘incomplete data’ log-likelihood $\ell(\theta;x)$ and thus easier to maximise. (In the Naples crab example, $y$ represents from which of the two Gaussian distributions each observation comes.) The log-likelihood of $x$ and $y$ is preferred but $y$ is not available, so we estimate the log-likelihood by taking expectation with respect to the available data $x$ in such a way that the estimated log-likelihood function remains easier to estimate. However, the expectation requires the value of $\theta$ , thus it can only be done iteratively, using the current value of $\theta$ . These are the key ideas behind the EM algorithm. Starting with an initial value for $\theta$ , we

1.

Estimate the log-likelihood $\ell(\theta;x,y)$ using the current value of $\theta$ by taking expectation conditioning on $x$ .
2.

Maximise the estimated log-likelihood with respect to $\theta$ to get an updated value for $\theta$ .
3.

Continue steps 1 and 2 until convergence is achieved.

Note that the above is given in terms of the log-likelihood because it is often easier to work with than the likelihood, of the complete data $x$ and $y$ .

More specifically, the procedure is as follows. Let $\theta^{*}$ denote the current estimate of $\theta$ , we define the function

\displaystyle Q(\theta,\theta^{*})=\int\ell(\theta;x,y)f(y|x;\theta^{*})\,{% \mathrm{d}}y

that is, in the expectation step,

Q(\theta,\theta^{*})=E_{\theta^{*}}[\ell(\theta;x,Y)|X=x].

The expectation is with respect to the conditional distribution of $Y$ given the observed data $x$ using the current estimate $\theta^{*}$ of $\theta$ , ie. $f(y|x;\theta^{*})$ . Note that $Q(\theta,\theta^{*})$ is the expectation of the complete data log-likelihood function taking the unavailable data $Y$ as random. The E- and M- steps can then formally be specified as:

1.

The E-step Calculation of (key elements of ) $Q(\theta,\theta^{*})$ as a function of $\theta$ ;
(Determination of $Q(\theta,\theta^{*})$ by taking expectation might sound daunting but it is often fairly straightforward. Only the key elements of $Q$ are calculated, so that it can be maximised in the M-step.)
2.

The M-step Maximisation of $Q(\theta,\theta^{*})$ with respect to $\theta$ to get $\theta^{\ast\ast}$ , which becomes $\theta^{*}$ in the next iteration.
(Maximisation of $Q(\theta,\theta^{*})$ should be much easier than maximising $\ell(\theta;x)$ , often with an explicit solution.)

The EM algorithm can be interpreted as ’data augmentation’ when the complete data log-likelihood is linear in the missing data $y$ , which is estimated in the E-step and substituted in the M-step. That is, rather than the somewhat off-putting $Q(\theta,\theta^{*})$ , the E-step reduces down to compute $E[y|x,\theta^{\ast}]$ and this is then substituted in for $y$ in the M-step. In the Naples crab example, this is not quite the case but we simply computed the probability that each observation belongs to each of the two Gaussian distributions and substitute this via $\mathbb{E}[1_{\{y_{i}=1\}}|x_{i},\mathbf{\theta}]=\mathbb{P}(y_{i}=1|x_{i},% \mathbf{\theta})$ into the MLE. In other words, we can formally show that the straightforward procedure we employed for the Naples crab data is in fact an implementation of the EM algorithm.

1.4 How does the EM algorithm work?

The EM algorithm is therefore a deterministic algorithm that alternates between the expectation and maximisation steps. Even though it works with the (estimated) complete data log-likelihood, it actually increases the incomplete data log-likelihood $\ell(\theta;x)$ in each iteration. Let us prove this step by step.

First, we give an overview of the notation. Below we assume the data are continuous for simplicity of exposition but the same arguments (with minor modifications) apply to discrete data or a mixture of continuous and discrete data. Note that the Naples crab data is a mixture of continuous and discrete data, the crab ratios $\mathbf{x}$ are continuous, whilst the allocation variable $\mathbf{y}$ are discrete. The full data likelihood is $f(x,y|\theta)$ , the probability density function of the observed and augmented data given parameters $\theta$ . The observed data likelihood is $f(x|\theta)=L(\theta;x)=\int_{y}f(x,y|\theta)\,dy$ . Finally, $f(y|x,\theta)$ is the conditional distribution of $y$ given $x$ and $\theta$ .

Because we maximise $Q(\theta,\theta^{*})$ as a function of its first argument to get $\theta^{**}$ ,

$\displaystyle Q(\theta^{*},\theta^{})-Q(\theta^{},\theta^{})$	$\displaystyle\geq$	$\displaystyle 0$
$\displaystyle\int\log(f(x,y\|\theta^{*}))f(y\|x,\theta^{})\,{\mathrm{d}}y-\int% \log(f(x,y\|\theta^{}))f(y\|x,\theta^{})\,{\mathrm{d}}y$	$\displaystyle\geq$	$\displaystyle 0$
$\displaystyle\int\log\left(\frac{f(x,y\|\theta^{*})}{f(x,y\|\theta^{})}\right)% f(y\|x,\theta^{*})\,{\mathrm{d}}y$	$\displaystyle\geq$	$\displaystyle 0.$	(1.1)

Since $f(x,y|\theta)=f(y|x,\theta)f(x|\theta)$ , we can write

\int\log\left(\frac{f(y|x,\theta^{**})}{f(y|x,\theta^{*})}\right)f(y|x,\theta^% {*})\,{\mathrm{d}}y+\int\log\left(\frac{f(x|\theta^{**})}{f(x|\theta^{*})}% \right)f(y|x,\theta^{*})\,{\mathrm{d}}y\geq 0.

(1.2)

Using the inequality $\log(x)\leq x-1$ , we can deduce that the first integral cannot be positive.

$\displaystyle\int\log\left(\frac{f(y\|x,\theta^{*})}{f(y\|x,\theta^{})}\right)% f(y\|x,\theta^{*})\,{\mathrm{d}}y$	$\displaystyle\leq$	$\displaystyle\int\left(\frac{f(y\|x,\theta^{*})}{f(y\|x,\theta^{})}-1\right)f(% y\|x,\theta^{*})\,{\mathrm{d}}y$	(1.3)
	$\displaystyle=$	$\displaystyle\int f(y\|x,\theta^{*}){\mathrm{d}}y-\int f(y\|x,\theta^{}){% \mathrm{d}}y$
	$\displaystyle=$	$\displaystyle 0.$

Therefore the second integral must be non-negative.

$\displaystyle\int\log\left(\frac{f(x\|\theta^{*})}{f(x\|\theta^{})}\right)f(y\|% x,\theta^{*})\,{\mathrm{d}}y$	$\displaystyle\geq$	$\displaystyle 0$
$\displaystyle\log\left(\frac{f(x\|\theta^{*})}{f(x\|\theta^{})}\right)\int f(y% \|x,\theta^{*})\,{\mathrm{d}}y$	$\displaystyle\geq$	$\displaystyle 0$
$\displaystyle\log(f(x\|\theta^{*}))-\log(f(x\|\theta^{}))$	$\displaystyle\geq$	$\displaystyle 0.$	(1.4)

Since $f(x|\theta)=L(\theta;x)$ , we have that $\ell(\theta^{**};x)\geq\ell(\theta^{*};x)$ , the log-likelihood is increasing (or at the very least non-decreasing) from one iteration to the next.

We have shown that at each iteration of the EM algorithm the likelihood is increasing. Furthermore, it can be shown that if these iterations converge, then they converge to a stationary point of the likelihood function. Although this is necessarily a local maximum, unless the likelihood function is unimodal this is not necessarily a global maximum, ie. the MLE.

Advantages

1.

It often has meaningful solutions and interpretation as data augmentation.
2.

Convergence is easy to ‘detect’ since we wait until successive values of $\theta$ fall within a certain level of variation.

Disadvantages

1.

It can be slow to converge.
2.

Need to be able to compute the E-step – restricts applications. (This can be circumvented to some extent by using a Monte Carlo EM algorithm.)

1.5 Genetics example

This example actually originates from the paper by [2] which introduced the EM algorithm.

The data consist of the genetic linkage of 197 animals and the data are divided into four genetic categories, labeled 1 through to 4, see [4] and [2]. The probabilities that an animal belongs to each of the four categories are $\frac{1}{2}+\frac{\theta}{4},\frac{1-\theta}{4},\frac{1-\theta}{4},\frac{% \theta}{4}$ , respectively. Let $\mathbf{x}=(x_{1},x_{2},x_{3},x_{4})=(125,18,20,34)$ denote the total number of animals in each category. For example, the probability of belonging to category 1 is $\frac{1}{2}+\frac{\theta}{4}$ and there are 125 observed animals in category 1. It is possible to maximise this multinomial likelihood directly and hence, obtain the MLE for $\theta$ without recourse to the EM algorithm. However, it is far simpler to apply the EM algorithm. To show that the EM algorithm works, we do both for this example.

MLE - observed data

The likelihood for $\theta$ given $\mathbf{x}$ is

	$\displaystyle L(\theta;\mathbf{x})$	$\displaystyle=$	$\displaystyle\frac{(x_{1}+x_{2}+x_{3}+x_{4})!}{x_{1}!x_{2}!x_{3}!x_{4}!}\left(% \frac{2+\theta}{4}\right)^{x_{1}}\left(\frac{1-\theta}{4}\right)^{x_{2}+x_{3}}% \left(\frac{\theta}{4}\right)^{x_{4}}$		(1.5)
		$\displaystyle\propto$	$\displaystyle(2+\theta)^{x_{1}}(1-\theta)^{x_{2}+x_{3}}\theta^{x_{4}}.$		(1.5)

Therefore the log-likelihood satisfies

\displaystyle\ell(\theta;\mathbf{x})=K+x_{1}\log(2+\theta)+(x_{2}+x_{3})\log(1% -\theta)+x_{4}\log(\theta),

(1.6)

where $K$ is a constant. Therefore

\displaystyle\frac{d\;}{d\theta}\ell(\theta;\mathbf{x})=\frac{x_{1}}{2+\theta}% -\frac{x_{2}+x_{3}}{1-\theta}+\frac{x_{4}}{\theta}.

(1.7)

By setting $\frac{d\;}{d\theta}\ell(\theta;\mathbf{x})=0$ , we obtain the following quadratic (in $\theta$ ) equation from (1.7),

$\displaystyle x_{1}(1-\theta)\theta-(x_{2}+x_{3})(2+\theta)\theta+x_{4}(2+% \theta)(1-\theta)$	$\displaystyle=$	$\displaystyle 0$
$\displaystyle-\{x_{1}+x_{2}+x_{3}+x_{4}\}\theta^{2}+\{x_{1}-2(x_{2}+x_{3})-x_{% 4}\}\theta+2x_{4}$	$\displaystyle=$	$\displaystyle 0$
$\displaystyle-197\theta^{2}+15\theta+68$	$\displaystyle=$	$\displaystyle 0$	(1.8)

This yields

\theta=\frac{-15\pm\sqrt{15^{2}-4\times 68\times-197}}{2\times-197}=-0.5507% \mbox{ or }0.6268.

Since $\theta$ needs to lie between 0 and 1, we get $\theta=0.6268$ .

EM algorithm - full data

What extra information is going to make computation of the MLE simpler?

Suppose that we can choose $y$ so that we have an augmented data likelihood of the form

\displaystyle L(\theta;\mathbf{x},y)=C\theta^{A}(1-\theta)^{B}.

(1.9)

Then

\displaystyle\ell(\theta;\mathbf{x},y)=\log C+A\log\theta+B\log(1-\theta)

with

\displaystyle\frac{d\;}{d\theta}\ell(\theta;\mathbf{x},y)=\frac{A}{\theta}-% \frac{B}{1-\theta}.

Setting $\frac{d\;}{d\theta}\ell(\theta;\mathbf{x})=0$ give the linear equation

A(1-\theta)-B\theta=0.

It is then trivial to show that

\theta=\frac{A}{A+B}.

Note that likelihoods of the form (1.9) are common in statistics, for example, binomial (geometric, negative binomial) data.

What is stopping the genetic linkage data from having a likelihood of the form (1.9)? The first cell with probability $1/2+\theta/4$ .

What if we break the probability $1/2+\theta/4$ into $1/2$ and $\theta/4$ , a component independent of $\theta$ and a component proportional to $\theta$ ? Below we detail how this can be done.

Suppose that the observed cell $x_{1}$ could be divided into two subcategories $A$ and $B$ . Suppose that $y$ $(x_{1}-y)$ is the number of animals is subcategory $A$ $(B)$ with cell probability $\theta/4$ $(1/2)$ . This would give an augmented data set $(\mathbf{x},y)$ .

The likelihood for $\theta$ given $(\mathbf{x},y)$ , is

	$\displaystyle L(\theta;\mathbf{x},y)$	$\displaystyle=$	$\displaystyle\frac{(x_{1}+x_{2}+x_{3}+x_{4})!}{y!(x_{1}-y)!x_{2}!x_{3}!x_{4}!}% \left(\frac{1}{2}\right)^{x_{1}-y}\theta^{y}(1-\theta)^{x_{2}+x_{3}}\theta^{x_% {4}}$		(1.10)
		$\displaystyle\propto$	$\displaystyle(1-\theta)^{x_{2}+x_{3}}\theta^{y+x_{4}}$		(1.10)

Hence, there exists a constant $C$ such that

	$\displaystyle\ell(\theta;\mathbf{x},y)$	$\displaystyle=$	$\displaystyle\log(L(\theta;\mathbf{x},y))$		(1.11)
		$\displaystyle=$	$\displaystyle\log C+(x_{2}+x_{3})\log(1-\theta)+(y+x_{4})\log(\theta).$		(1.11)

(This yields $\hat{\theta}=(y+x_{4})/(y+x_{2}+x_{3}+x_{4})$ using the observations after (1.9).)

E - step
From (1.11), we have that

$\displaystyle Q(\theta,\theta^{*})$	$\displaystyle=$	$\displaystyle E[\ell(\theta;\mathbf{x},y)]$
	$\displaystyle=$	$\displaystyle E[\log C\|\theta^{},\mathbf{x}]+E[(y+x_{4})\log(\theta)+(x_{2}+x% _{3})\log(1-\theta)\|\theta^{},\mathbf{x}]$
	$\displaystyle=$	$\displaystyle K+(E[y\|\theta^{*},\mathbf{x}]+x_{4})\log(\theta)+(x_{2}+x_{3})% \log(1-\theta),$

since $(x_{2},x_{3},x_{4})$ are known, fixed constants and $K$ does not depend on $\theta$ , so plays no role in the M-step (can be ignored). Therefore the only quantity that needs to be computed in the E-step is $E[y|\theta^{*},\mathbf{x}]$ .

What is the distribution of $y|\theta^{*},\mathbf{x}$ ?
There are 125 animals in category 1 which independently can be assigned to either category $A$ or category $B$ . The conditional probability of animal belonging to category $A$ given that they belong to category 1 is simply

\frac{P(\mbox{Category $A$ and Category 1})}{P(\mbox{Category 1})}=\frac{P(\mbox{Category $A$})}{P(\mbox{Category 1})}=\frac{\theta^{*}/4}{% \theta^{*}/4+1/2}=\frac{\theta^{*}}{\theta^{*}+2}.

Thus $y|\theta^{*},\mathbf{x}\sim{\rm Bin}(125,\theta^{*}/(2+\theta^{*}))$ , and hence, $E[y|\theta^{*},\mathbf{x}]=125\theta^{*}/(2+\theta^{*})$ . This gives

\displaystyle Q(\theta,\theta^{*})=C+\left\{\frac{125\theta^{*}}{2+\theta^{*}}% +34\right\}\log(\theta)+38\log(1-\theta).

(1.12)

M - step
The M-step is straightforward and simply involves finding $\frac{d\;}{d\theta}Q(\theta,\theta^{*})=0$ (which we have effectively already done above). From (1.12), we have that

\frac{d\;}{d\theta}Q(\theta,\theta^{*})=\left\{\frac{125\theta^{*}}{2+\theta^{% *}}+34\right\}\frac{1}{\theta}-\frac{38}{1-\theta}.

Hence $\theta^{\ast\ast}$ solves

\left\{\frac{125\theta^{*}}{2+\theta^{*}}+34\right\}\frac{1}{\theta^{\ast\ast}% }-\frac{38}{1-\theta^{\ast\ast}}=0,

which following the steps above yields

\theta^{\ast\ast}=\frac{125\theta^{*}/(2+\theta^{*})+34}{125\theta^{*}/(2+% \theta^{*})+34+38}.

For this example, we have that the maximum likelihood estimate (MLE) is $\theta=0.6268$ to 4 decimal places. Starting with $\theta^{0}=0.5$ as the initial value for $\theta$ , successive iterations of the EM algorithm yields:

\begin{array}[]{l|l}i&\theta^{i}\\ \hline 1&0.5000\\ 2&0.6082\\ 3&0.6243\\ 4&0.6265\\ 5&0.6268\\ 6&0.6268\end{array}

1.6 Right censored data

The EM algorithm is particularly useful where we have censored data. In many experiments and medical trials, we have censored data. In a medical study, we could be interested in the time until death following major surgery. However, we may only be able to follow up patients for 2 years after surgery. That is, for a patient alive 2 years after surgery we don’t know when the death of the patient occurs, only that death occurs more than 2 years after surgery.

Suppose that $X_{1},X_{2},\ldots,X_{n}$ are independent and identically distributed according to $f(x|\theta)$ . However, assume that there is right censoring at the value $a$ . That is, for all observations greater than $a$ , the only information we have is that the observation exceeds $a$ . The common notation is to denote censored values superscripted with an asterisk. Therefore, we have an incomplete sample (suitably rearranged)

\mathbf{x}=(x_{1},\ldots,x_{m},a^{*},\ldots,a^{*}).

In this case, the likelihood based on all we know about $x_{1},\ldots,x_{n}$ is

L(\theta;\mathbf{x})=\left\{\prod_{i=1}^{m}f(x_{i}|\theta)\right\}\{1-F(a|% \theta)\}^{n-m},

where $F$ is the cdf (cumulative distribution function) corresponding to $f(x)$ . Typically it is difficult to maximise $L(\theta;\mathbf{x})$ because of the presence of the non-standard term $\{1-F(a|\theta)\}^{n-m}$ . Let $\mathbf{y}=(x_{m+1},x_{m+2},\ldots,x_{n})$ represent the true but unobserved values. Then the pair $(\mathbf{x},\mathbf{y})$ provides the complete data $(x_{1},\ldots,x_{n})$ and

L(\theta;\mathbf{x},\mathbf{y})=\prod_{i=1}^{n}f(x_{i}|\theta)

which is usually straightforward to maximise.

Suppose that $X\sim Gam(2,\delta)$ , where $\delta$ is unknown. Suppose that we have the following 15 observations with the data censored at $a=2.5$ :

	$\displaystyle 1.226,\hskip 11.381102pt2.500^{\ast},\hskip 11.381102pt1.229,% \hskip 11.381102pt0.576,\hskip 11.381102pt1.925$
	$\displaystyle 2.500^{\ast},\hskip 11.381102pt1.437,\hskip 11.381102pt1.217,% \hskip 11.381102pt1.836,\hskip 11.381102pt2.500^{\ast}$
	$\displaystyle 2.500^{\ast},\hskip 11.381102pt1.643,\hskip 11.381102pt2.225,% \hskip 11.381102pt2.500^{\ast},\hskip 11.381102pt2.067$

Note that the cdf of $X$ is given by

F(x)=1-e^{-\delta x}(1+\delta x)\;\;\;x>0,

whilst the pdf of $X$ is given by $f(x)=\delta^{2}xe^{-\delta x}$ , $x>0$ . We reorder the data, $\mathbf{x}$ , into ascending order, then

L(\delta;\mathbf{x})=\left\{\prod_{i=1}^{10}\delta^{2}x_{i}e^{-\delta x_{i}}% \right\}\left(e^{-2.5\delta}(1+2.5\delta)\right)^{5}.

This is difficult to maximize.

Let $y_{i}$ denote the actual value of the $i^{th}$ observation. Then for $i=1,2,\ldots,10$ , $y_{i}=x_{i}$ and for $i=11,\ldots,15$ , $y_{i}>x_{i}$ .

Then

$\displaystyle\ell(\delta;\mathbf{x},\mathbf{y})$	$\displaystyle=$	$\displaystyle\sum_{i=1}^{15}\log\left(\delta^{2}y_{i}e^{-\delta y_{i}}\right)$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{15}\{2\log(\delta)+\log(y_{i})-\delta y_{i}\}$
	$\displaystyle=$	$\displaystyle 30\log(\delta)-\delta\sum_{i=1}^{15}y_{i}+C,$

where $C$ does not depend on $\delta$ .

Hence,

\frac{{\mathrm{d}}\;}{{\mathrm{d}}\delta}\ell(\delta;\mathbf{x},\mathbf{y})=% \frac{30}{\delta}-\sum_{i=1}^{15}y_{i}

which when set equal to 0 solves to give $\hat{\delta}=2/\bar{y}$ , where $\bar{y}$ is the sample mean.

Now conditioning on what we know about $x_{1},\ldots,x_{n}$ ,

{\mathrm{E}}[\ell(\delta;\mathbf{x},\mathbf{Y})|\mathbf{x}]=30\log(\delta)-% \delta\left(x_{1}+\cdots+x_{10}+5x^{\ast}\right)+C

where $x^{\ast}=\mathbb{E}[X|X>a]$ using the conditional density of $X$ given $X>a$ . This leads to

Q(\delta,\delta^{*})=30\log(\delta)-\delta\left(x_{1}+\cdots+x_{10}+5{x}^{\ast% }\right)+\mbox{const}

which is maximised at

\delta^{**}=\frac{30}{x_{1}+\cdots+x_{10}+5x^{\ast}}.

where

x^{*}=\frac{2+2\delta^{*}a+\delta^{*2}a^{2}}{\delta^{*}(1+\delta^{*}a)}.

The EM algorithm consists of calculation of $x^{*}$ in the E-step and $\delta^{**}$ in the M-step, which are repeated until convergence.

For the given data, $x_{1}+\cdots+x_{10}=15.381$ and the EM algorithm converges to $0.8387$ .

Note that

	$\displaystyle x^{\ast}$	$\displaystyle=$	$\displaystyle\mathbb{E}[X\|X>a]$
		$\displaystyle=$	$\displaystyle\int_{-\infty}^{\infty}xf_{X}(x\|x>a)\,dx.$

Now

f_{X}(x|x>a)=\frac{1_{\{x>a\}}f_{X}(x)}{\mathbb{P}(X>a)},

where $\mathbb{P}(X>a)=1-F(a)$ . Therefore

$\displaystyle x^{\ast}$	$\displaystyle=$	$\displaystyle\int_{a}^{\infty}x\frac{f_{X}(x)}{1-F(a)}\,dx$
	$\displaystyle=$	$\displaystyle\frac{1}{1-F(a)}\int_{a}^{\infty}xf(x)\,dx$
	$\displaystyle=$	$\displaystyle\frac{1}{(1+\delta a)\exp(-\delta a)}\int_{a}^{\infty}x\delta^{2}% x\exp(-\delta x)\,dx$
	$\displaystyle=$	$\displaystyle\frac{1}{(1+\delta a)\exp(-\delta a)}\times\frac{1}{\delta}(2+2% \delta a+\delta^{2}a^{2})\exp(-\delta a)$
	$\displaystyle=$	$\displaystyle\frac{2+2\delta a+\delta^{2}a^{2}}{\delta(1+\delta a)}.$

1.7 Standard errors

As usual in maximum likelihood estimation, the calculation of standard errors requires the calculation of the inverse Hessian matrix (Fisher information):

\displaystyle\left[-\frac{\partial^{2}\log f(\mathbf{x}|\theta)}{\partial% \theta_{i}\partial\theta_{j}}\right]^{-1}

(1.13)

evaluated at the MLE $\hat{\theta}$ . For missing data problems this is usually difficult to evaluate directly. However, the structure of the EM algorithm can be exploited to assist in the calculations. Although the methodology can be applied to a vector $\theta$ , we will restrict attention to the case where $\theta$ is a single parameter. In this case, the standard error of the estimator $\hat{\theta}$ is

\displaystyle 1/\frac{d^{2}\;}{d\theta^{2}}\log f(\mathbf{x}|\theta)

(1.14)

First note that $f(\mathbf{x},\mathbf{y}|\theta)=f(\mathbf{x}|\theta)f(\mathbf{y}|\mathbf{x},\theta)$ , thus

$\displaystyle f(\mathbf{x}\|\theta)$	$\displaystyle=$	$\displaystyle f(\mathbf{x},\mathbf{y}\|\theta)/f(\mathbf{y}\|\mathbf{x},\theta),$
$\displaystyle-\log f(\mathbf{x}\|\theta)$	$\displaystyle=$	$\displaystyle-\log f(\mathbf{x},\mathbf{y}\|\theta)-\;\left\{-\log f(\mathbf{y}% \|\mathbf{x},\theta)\right\}$
$\displaystyle-\frac{{\mathrm{d}}^{2}}{{\mathrm{d}}\theta^{2}}\log f(\mathbf{x}% \|\theta)$	$\displaystyle=$	$\displaystyle-\frac{{\mathrm{d}}^{2}}{{\mathrm{d}}\theta^{2}}\log f(\mathbf{x}% ,\mathbf{y}\|\theta)-\;\left\{-\frac{{\mathrm{d}}^{2}}{{\mathrm{d}}\theta^{2}}% \log f(\mathbf{y}\|\mathbf{x},\theta)\right\}.$	(1.15)

The equation (1.15) forms the basis for the missing information principle:

\mbox{Observed Information = Complete Information }-\mbox{ Missing information}

Multiplying both sides of (1.15) by $f(\mathbf{y}|\mathbf{x},\phi)$ and integrate with respect to $y$ , we get (for any $\phi$ )

	$\displaystyle\int-\frac{{\mathrm{d}}^{2}}{{\mathrm{d}}\theta^{2}}\log f(% \mathbf{x}\|\theta)f(\mathbf{y}\|\mathbf{x},\phi)\;dy$	$\displaystyle=$	$\displaystyle\int-\frac{{\mathrm{d}}^{2}}{{\mathrm{d}}\theta^{2}}\log f(% \mathbf{x},\mathbf{y}\|\theta)f(\mathbf{y}\|\mathbf{x},\phi)\;dy-\;\int\left\{-% \frac{{\mathrm{d}}^{2}}{{\mathrm{d}}\theta^{2}}\log f(\mathbf{y}\|\mathbf{x},% \theta)\right\}f(\mathbf{y}\|\mathbf{x},\phi)\;dy$
	$\displaystyle-\frac{{\mathrm{d}}^{2}}{{\mathrm{d}}\theta^{2}}\log f(\mathbf{x}% \|\theta)$	$\displaystyle=$	$\displaystyle-\frac{\partial^{2}Q(\theta,\phi)}{\partial\theta^{2}}-\;-\frac{% \partial^{2}H(\theta,\phi)}{\partial\theta^{2}}$		(1.16)

where

	$\displaystyle Q(\theta,\phi)$	$\displaystyle=$	$\displaystyle\int\log(f(\mathbf{x},\mathbf{y}\|\theta))f(\mathbf{y}\|\mathbf{x},% \phi)\,{\mathrm{d}}\mathbf{y},$
	$\displaystyle H(\theta,\phi)$	$\displaystyle=$	$\displaystyle\int\log(f(\mathbf{y}\|\mathbf{x},\theta))f(\mathbf{y}\|\mathbf{x};% \phi)\,{\mathrm{d}}\mathbf{y}.$

Here we use $\phi$ to avoid differentiation with respect to $\theta$ , it will take the same value as $\theta$ in the following.

We know how to deal with the $Q$ function. For the $H$ function, we can use

$\displaystyle-\left.\frac{\partial^{2}H(\theta,\phi)}{\partial\theta^{2}}% \right\|_{\phi=\theta=\hat{\theta}}$	$\displaystyle=$	$\displaystyle\mbox{var}\left(\left.\frac{{\mathrm{d}}\log f(\mathbf{Y}\|\mathbf% {x},\theta)}{{\mathrm{d}}\theta}\right\|\mathbf{X}=\mathbf{x},\hat{\theta}\right)$
	$\displaystyle=$	$\displaystyle\mbox{var}\left(\frac{{\mathrm{d}}\log f(\mathbf{x},\mathbf{Y}\|% \theta)}{{\mathrm{d}}\theta}-\left.\frac{{\mathrm{d}}\log f(\mathbf{x}\|\theta)% }{{\mathrm{d}}\theta}\right\|\mathbf{X}=\mathbf{x},\hat{\theta}\right)$
	$\displaystyle=$	$\displaystyle\mbox{var}\left(\left.\frac{{\mathrm{d}}\log f(\mathbf{x},\mathbf% {Y}\|\theta)}{{\mathrm{d}}\theta}\right\|\mathbf{X}=\mathbf{x},\hat{\theta}\right)$

(The second equality follows since, conditional upon $\mathbf{X}=\mathbf{x}$ and $\theta=\hat{\theta}$ , $\frac{{\mathrm{d}}\log f(\mathbf{x}|\theta)}{{\mathrm{d}}\theta}$ is a constant and so does not alter the variance.)

The calculation of standard errors is best illustrated using an example. We shall use the genetics problem seen earlier. In that case

$\displaystyle\left[-\frac{\partial^{2}Q(\theta,\phi)}{\partial\theta^{2}}% \right]_{\phi=\theta=\hat{\theta}}$	$\displaystyle=$	$\displaystyle\frac{{\mathrm{E}}[Y\|X_{1}=x_{1},\hat{\theta}]+x_{4}}{\hat{\theta% }^{2}}+\frac{x_{2}+x_{3}}{(1-\hat{\theta})^{2}}$
	$\displaystyle=$	$\displaystyle\frac{63.83}{0.6268^{2}}+\frac{38}{(1-0.6268)^{2}}$
	$\displaystyle=$	$\displaystyle 435.3,$

since

Q(\theta,\theta^{*})=K+({\mathrm{E}}[Y|X_{1}=x_{1},\theta^{*}]+x_{4})\log(% \theta)+(x_{2}+x_{3})\log(1-\theta),

and simply replace $\theta^{*}$ by $\phi$ .

This is the information if the estimated data (expected value for $y$ ) had been genuine. We also have that:

\frac{{\mathrm{d}}\log f(\mathbf{x},y|\theta)}{{\mathrm{d}}\theta}=\frac{y+x_{% 4}}{\theta}-\frac{x_{2}+x_{3}}{1-\theta}.

Note that all the observed terms have variance 0 conditioning on $\mathbf{x}$ , so

$\displaystyle\mbox{var}\left(\left.\frac{{\mathrm{d}}\log f(\mathbf{x},Y)}{{% \mathrm{d}}\theta}\right\|\mathbf{X}=\mathbf{x},\hat{\theta}\right)$	$\displaystyle=$	$\displaystyle\frac{1}{\hat{\theta}^{2}}\mbox{var}(Y\|X_{1}=x_{1},\hat{\theta})$
	$\displaystyle=$	$\displaystyle\frac{125}{\hat{\theta}^{2}}\left(\frac{\hat{\theta}}{2+\hat{% \theta}}\right)\left(\frac{2}{2+\hat{\theta}}\right)$
	$\displaystyle=$	$\displaystyle 57.8$

since $Y|X_{1}=125\sim{\rm Bin}\left(125,\frac{\theta}{2+\theta}\right)$ .

Therefore we have that

-\frac{{\mathrm{d}}^{2}\log f(\mathbf{x}|\hat{\theta})}{{\mathrm{d}}\theta^{2}% }=435.3-57.8=377.5

giving the standard error of $\hat{\theta}$ as $\sqrt{1/377.5}=0.0515$ . Note that if the augmented data were genuine then the standard error of $\hat{\theta}$ would be $\sqrt{1/435.3}=0.0479$ . Thus as we would expect the presence of missing data leads to larger standard errors in the parameters.

1.8 Lab Session and R code

Each week R code associated with the examples in the lecture notes will be provided for practice either before or during the lab session. Ideally you should have a look at the R code in advance of the lab session.

Code for the genetic linkage data and normal mixture example are available on Moodle with a supplementary file for simulating data from a mixture of normal distributions. Familiarise yourself with these algorithms and generate different data sets to try out the normal mixture code.

For the right censored data example there is the outline of a function to implement the EM algorithm. Complete the EM algorithm code and test on the data given in the lecture notes.

Spend no more than 40 minutes on the above.

Epidemic Example

The Reed-Frost epidemic model is suitable for modelling a disease outbreak in a household of size $n$ . The model assumes that the population is homogeneously mixing (infections are equally likely to be with anybody) and is an SIR (Susceptible $\rightarrow$ Infected $\rightarrow$ Removed) model where individuals can only be infected once. Each infectious individual has probability $p$ , whilst infectious, of infecting a susceptible individual in the household. Independence in the chance of infecting two different susceptibles.

Consider a household of size 3 with 1 initial infective and 2 initially susceptible individuals. The initial infective can infect either 0, 1 or 2 of the initial susceptibles and these events occur with probability $(1-p)^{2}$ , $2p(1-p)$ and $p^{2}$ , respectively. In the case that the initial infective infects nobody the epidemic finishes with only one person infected. In the case that the initial infective infects two individuals, everybody has become infected and the epidemic infects three people. In the remaining case, where one initial susceptible is infected by the initial infective either the second infective infects the remaining susceptible (probability $p$ ) or not (probability $1-p$ ). Therefore there are four distinct epidemic outcomes in the household.

\begin{array}[]{c|c}\mbox{Outcome}&\mbox{Probability}\\ \hline\{1\}&(1-p)^{2}\\ \{1,1\}&2p(1-p)\times(1-p)\\ \{1,1,1\}&2p(1-p)\times p\\ \{1,2\}&p^{2}\end{array}

The first outcome $\{1\}$ results in one person infected, the second outcome $\{1,1\}$ results in two people infected and the final two outcomes $\{1,1,1\}$ and $\{1,2\}$ result in all three people infected.

Suppose that a community consists of 334 households of size 3 with the following the epidemic final size data observed.

\begin{array}[]{l|c|c|c}\mbox{Final size}&1&2&3\\ \hline\mbox{No. of Households}&34&25&275\end{array}

This is the Rhode Island measles data set given in Bailey (1975), p.254.

1.

Write down the likelihood $L(p;\mathbf{x})$ of $\mathbf{x}=(x_{1},x_{2},x_{3})$ given $p$ , where $x_{i}$ denotes the total number of households with $i$ people infected.
2.

(Leave this question to the end, if you are short on time.) Find $\frac{d\;}{dp}l(p;\mathbf{x})=\frac{d\;}{dp}\log L(p;\mathbf{x})$ . Solve $\frac{d\;}{dp}l(p;\mathbf{x})=0$ to find the MLE, $\hat{p}$ .
3.

Let $y$ denote the total number of households where the initial infective infects both the other individuals in the household. i.e. Outcome $\{1,2\}$ occurs.

Write down the likelihood $L(p;\mathbf{x},y)$ of $\mathbf{x}$ and $y$ given $p$ .
4.

Simplify $l(p;\mathbf{x},y)=\log L(p;\mathbf{x},y)$ and compute the MLE, $\hat{p}$ for $(\mathbf{x},y)$ .
(This will help form the M-step of the EM algorithm.)
5.

What is the distribution of $y$ given $\mathbf{x}$ and $p$ ?
Hint: Think about similarities between this problem and the genetics example.
6.

Write down $Q(p,p^{*})=E_{p^{*}}[l(p;\mathbf{x},y)|\mathbf{x}]$ .
(This will help form the E-step of the EM algorithm.)
7.

Write an EM algorithm in R to find the MLE $\hat{p}$ .
8.

Calculate the standard error of the MLE $\hat{p}$ .

1.9 Practice Questions

These are exam type questions for your practice.

1.

In a biological experiment the number of offspring $X$ of a fruit fly is assumed to have probability mass function given by

$pf(x)+(1-p)g(x)$

with

$f(x)=\left\{\begin{array}[]{ll}1&x=0\\ 0&\mbox{otherwise}.\end{array}\right.$

and

$g(x)=\left\{\begin{array}[]{ll}\frac{\lambda^{x}}{x!}\exp(-\lambda)&x=0,1,% \ldots\\ 0&\mbox{otherwise}.\end{array}\right.$

That is, $f$ is a point mass at 0 and $g$ is a Poisson distribution with parameter $\lambda$ .

Suppose that $x_{1},x_{2},\ldots,x_{n}$ are the offspring from $n$ fruit flies and that the parameters $p$ and $\lambda$ are unknown.
1. (a)
  
  Write down the joint-likelihood of $p$ and $\lambda$ given $x_{1},x_{2},\ldots,x_{n}$ .
Let $\alpha=\frac{1}{n}\sum_{i=1}^{n}1_{\{x_{i}=0\}}$ (the proportion of the fruit flies in the sample who have no offspring) and let $\mu=\frac{1}{n}\sum_{i=1}^{n}x_{i}$ (the mean number of offspring per fruit fly in the sample).

Let $U_{1},U_{2},\ldots,U_{n}$ be unobserved independent and identically distributed Bernoulli random variables with $P(U_{i}=1)=p(=1-P(U_{i}=0))$ . Suppose that if $U_{i}=1$ then $X_{i}$ is distributed according to $f$ whereas if $U_{i}=0$ then $X_{i}$ is distributed according to $g$ .
1. (b)
  
  Show that the joint log-likelihood of $p$ and $\lambda$ given $\mathbf{x}=(x_{1},x_{2},\ldots,x_{n})$ and $\mathbf{U}=(U_{1},U_{2},\ldots,U_{n})$ is
  
  $l(p,\lambda|{\bf x},{\bf U})=m\log p+(n-m)\log(1-p)+n\mu\log\lambda-(n-m)% \lambda+A,$
  
  where $m=\sum_{i=1}^{n}U_{i}$ and $A$ is a constant independent of $p$ and $\lambda$ .
2. (c)
  
  Find the maximum likelihood estimates (MLEs) of $p$ and $\lambda$ given $\mathbf{x}=(x_{1},x_{2},\ldots,x_{n})$ and $\mathbf{U}=(U_{1},U_{2},\ldots,U_{n})$ .
3. (d)
  
  Show that
  
  $E[U_{i}|p,\lambda,x_{i}]=\left\{\begin{array}[]{ll}\frac{p}{p+(1-p)\exp(-% \lambda)}&(x_{i}=0)\\ 0&(x_{i}>0)\end{array}\right.$
4. (e)
  
  Give a description of the EM algorithm in the absence of $\mathbf{U}=(U_{1},U_{2},\ldots,U_{n})$ for finding the MLEs of $p$ and $\lambda$ .
2.

Suppose that a group of $n$ patients are tested for angina. The test has a binary outcome, either the patient has angina or not. For $i=1,2,\ldots,n$ , let $y_{i}=1$ if the $i^{th}$ patient has angina and $y_{i}=0$ otherwise. Each patient also has their cholesterol level taken with $x_{i}$ denoting the cholesterol level of patient $i$ .

A probit model is assumed for the probability that a patient has angina. That is, if $Y_{i}$ is a binary random variable taking the value 1 if patient $i$ has angina and 0 otherwise, then

$P(Y_{i}=1|x_{i})=1-\Phi(\beta x_{i}),$

where $\beta$ is an unknown parameter of interest and $\Phi(\cdot)$ denotes the cumulative distribution function of a standard normal. If $W\sim N(0,1)$ , then $\Phi(w)=P(W\leq w)$ .

It is difficult to obtain the maximum likelihood estimate, $\hat{\beta}$ , of $\beta$ given $\mathbf{y}=(y_{1},\ldots,y_{n})$ and $\mathbf{x}=(x_{1},\ldots,x_{n})$ . However, data imputation and the EM algorithm can be used to obtain $\hat{\beta}$ .

Let $\mathbf{z}=(z_{1},\ldots,z_{n})$ and suppose that $z_{i}$ satisfies

$\displaystyle z_{i}|x_{i},\beta$ $\displaystyle\sim$ $\displaystyle N(\beta x_{i},1)$

$\displaystyle y_{i}|x_{i},\beta,z_{i}$ $\displaystyle=$ $\displaystyle 1_{\{z_{i}>0\}}.$

That is, the angina status of a patient ( $y_{i}$ ) is deterministic (known) on the basis of the unobserved $z_{i}$ , where $z_{i}$ follows a normal distribution with mean determined by $\beta$ and $x_{i}$ .
1. (a)
  
  Show that the likelihood of $\beta$ given $\mathbf{y}$ and $\mathbf{z}$ , $L(\beta;\mathbf{y},\mathbf{z})$ satisfies
  
  $L(\beta;\mathbf{y},\mathbf{z})=\prod_{i=1}^{n}1_{\{z_{i}>0\}}^{y_{i}}1_{\{z_{i% }\leq 0\}}^{1-y_{i}}(2\pi)^{-n/2}\exp\left(-\frac{1}{2}\sum_{i=1}^{n}(z_{i}-x_% {i}\beta)^{2}\right).$
2. (b)
  
  Compute $E[z_{i}|y_{i}=1,x_{i},\beta]$ .
3. (c)
  
  Compute the maximum likelihood estimate, $\tilde{\beta}$ , of $\beta$ given $\mathbf{y}$ and $\mathbf{z}$ .
  Hint: Let $W\sim N(\mu,\sigma^{2})$ . Then for any $a\in\mathbb{R}$ ,
  
  $E[W|W>a]=\mu+\frac{\phi((a-\mu)/\sigma)}{1-\Phi((a-\mu)/\sigma)}\sigma,$
  
  where $\phi(z)$ denotes the probability density function of a standard normal distribution evaluated at $z$ .
4. (d)
  
  Outline an EM algorithm for obtaining $\hat{\beta}$ .

$\displaystyle f(\mathbf{x}\|\theta)$	$\displaystyle=$	$\displaystyle f(\mathbf{x},\mathbf{y}\|\theta)/f(\mathbf{y}\|\mathbf{x},\theta),$
$\displaystyle-\log f(\mathbf{x}\|\theta)$	$\displaystyle=$	$\displaystyle-\log f(\mathbf{x},\mathbf{y}\|\theta)-\;\left\{-\log f(\mathbf{y}% \|\mathbf{x},\theta)\right\}$
$\displaystyle-\frac{{\mathrm{d}}^{2}}{{\mathrm{d}}\theta^{2}}\log f(\mathbf{x}% \|\theta)$	$\displaystyle=$	$\displaystyle-\frac{{\mathrm{d}}^{2}}{{\mathrm{d}}\theta^{2}}\log f(\mathbf{x}% ,\mathbf{y}\|\theta)-\;\left\{-\frac{{\mathrm{d}}^{2}}{{\mathrm{d}}\theta^{2}}% \log f(\mathbf{y}\|\mathbf{x},\theta)\right\}.$	(1.15)

$\displaystyle-\left.\frac{\partial^{2}H(\theta,\phi)}{\partial\theta^{2}}% \right\|_{\phi=\theta=\hat{\theta}}$	$\displaystyle=$	$\displaystyle\mbox{var}\left(\left.\frac{{\mathrm{d}}\log f(\mathbf{Y}\|\mathbf% {x},\theta)}{{\mathrm{d}}\theta}\right\|\mathbf{X}=\mathbf{x},\hat{\theta}\right)$
	$\displaystyle=$	$\displaystyle\mbox{var}\left(\frac{{\mathrm{d}}\log f(\mathbf{x},\mathbf{Y}\|% \theta)}{{\mathrm{d}}\theta}-\left.\frac{{\mathrm{d}}\log f(\mathbf{x}\|\theta)% }{{\mathrm{d}}\theta}\right\|\mathbf{X}=\mathbf{x},\hat{\theta}\right)$
	$\displaystyle=$	$\displaystyle\mbox{var}\left(\left.\frac{{\mathrm{d}}\log f(\mathbf{x},\mathbf% {Y}\|\theta)}{{\mathrm{d}}\theta}\right\|\mathbf{X}=\mathbf{x},\hat{\theta}\right)$

	$\displaystyle z_{i}\|x_{i},\beta$	$\displaystyle\sim$	$\displaystyle N(\beta x_{i},1)$
	$\displaystyle y_{i}\|x_{i},\beta,z_{i}$	$\displaystyle=$	$\displaystyle 1_{\{z_{i}>0\}}.$