We motivate the Expectation-Maximisation (EM) algorithm with a classical century problem!
The first attempt at mixture modelling
was by Pearson in 1894 when he was presented by the biologist W.F.R.
Weldon with 1000 observations on crabs found in Naples, see
[3]. The data are ratios of forehead width to body
length, a frequency table is available at
www.maths.uq.edu.au/~gjm/DATA/Crab.dat
.
The following graph shows a histogram of the data together with a mixture of two Gaussians fitted.
Unnumbered Figure: Link
The red line gives the mixture density obtained using the EM algorithm and the green lines show the individual Gaussian densities, multiplied by the appropriate weighting. Hence the red line is the sum of the two green lines.
See www.math.mcmaster.ca/peter/mix/demex/excrabs.html
for
more details.
We shall consider the two-group case although the ideas readily extend to more than two groups, and from mixtures of Gaussian to other distributions.
If a random variable is drawn from with probability and from with probability , , then the pdf of is
where is the pdf of . This pdf is called a Gaussian mixture with two components.
Proof. Let if come from , . By definition, the cdf of is
Thus the pdf
Given a random sample from this distribution, the log-likelihood function is
which cannot be maximised easily. Here and .
However, with additional information on which distribution each comes from, the likelihood function becomes
Thus the log-likelihood function becomes
which is simpler and can be maximised as follows with ‘obvious’ solutions.
Setting the partial derivatives with respect to and to zero
gives
which are respectively the sample mean and sample variance of the sample from . Similarly,
The rest of the maximisation uses the Lagrange multiplier method
with constraint .
which are the sample proportions from each population.
However, the data are not available in practice. Therefore how can we make use of the above maximum likelihood estimator for given and when the only data available are . This is where the EM-algorithm comes into play. We iterate between the -step (MLE) above computing based on the full data ( and ) and an -step which estimates the unobserved given the data and current parameter estimates .
Before formally introducing the EM algorithm, we outline the procedure for the mixture model.
To initialise, we need to choose initial estimates for the parameters of the model. The key thing is to choose , and for the parameters to be reasonable. For example, the data ranges from 0.57 to 0.7, therefore I choose and (plausible values in the range) with (close to the variance of the data ). Initially I have no preference for either of the two Gaussian distributions making up the mixture, so take .
E-Step
The augmented data that we want are , the mixture component (Gaussian distribution) to which each of the observations belongs. Given and , we can compute the probability (expectation) of the observations coming from each of the two distributions.
Since the observations are assumed to be independent,
By Bayes’ theorem, for ,
with .
Thus in the E-Step we compute the expectation of , which we shall call (with ).
M-Step
We can now plug in the expected values for and , obtained in the E-step into the full data MLEs derived above to update our estimates of . Specifically,
The algorithm then iterates between the two steps (E and M). Updating the probability of each observation belonging to each of the two Gaussian distributions and then updating the estimates (via MLEs) of the parameters. The process stops when we have convergence. That is, two successive estimates of the parameters agree to some pre-defined precision.
For the Naples bay crab data I required successive values of to differ by less than . This meant that with my chosen starting values 467 iterations were required (taking 0.22 seconds in R!), resulting in , , , , and .
The EM algorithm is an iterative algorithm, introduced in [2], and is designed to compute the maximum likelihood estimate when there is missing data. It is iterative in that it consists of a series of steps called iterations, where the parameter value gets repeatedly updated, until a convergence criterion is met. The algorithm converges to a local maximum of the likelihood function (ie. not necessarily where the global maximum is). Thus if the likelihood function is unimodal the EM algorithm will converge to the maximum likelihood estimate (MLE).
The EM algorithm is primarily used for maximising the likelihood when the task becomes easier given more information associated with existing data. This situation is called an incomplete data problem because we do not have this extra information. A prime example is mixture distributions such as that given above. In the Naples crab example, if we knew from which of the two Gaussian distributions each observation comes parameter estimation is straightforward.
The EM algorithm gets its name from the two steps involved in each iteration. The E-step (expectation) and the M-step (maximisation). The procedure alternates between these two steps (so it goes E, M, E, M,…) until convergence is achieved (successive values are sufficiently close).
Let represent our observed data and the parameter(s) of interest. (In the Naples crab example, represents the crab ratios and represents the parameters of the two Gaussian distributions and the mixture proportions.) By choosing a suitable parametric model we can write down the likelihood function and our aim is then to find the value of that maximises or equivalently , i.e. find the MLE (maximum likelihood estimate) of . Suppose that it is possible to think of some additional data , such that the ‘complete data’ log-likelihood is of a simpler form than the ‘incomplete data’ log-likelihood and thus easier to maximise. (In the Naples crab example, represents from which of the two Gaussian distributions each observation comes.) The log-likelihood of and is preferred but is not available, so we estimate the log-likelihood by taking expectation with respect to the available data in such a way that the estimated log-likelihood function remains easier to estimate. However, the expectation requires the value of , thus it can only be done iteratively, using the current value of . These are the key ideas behind the EM algorithm. Starting with an initial value for , we
Estimate the log-likelihood using the current value of by taking expectation conditioning on .
Maximise the estimated log-likelihood with respect to to get an updated value for .
Continue steps 1 and 2 until convergence is achieved.
Note that the above is given in terms of the log-likelihood because it is often easier to work with than the likelihood, of the complete data and .
More specifically, the procedure is as follows. Let denote the current estimate of , we define the function
that is, in the expectation step,
The expectation is with respect to the conditional distribution of given the observed data using the current estimate of , ie. . Note that is the expectation of the complete data log-likelihood function taking the unavailable data as random. The E- and M- steps can then formally be specified as:
The E-step Calculation of (key elements of ) as a function of ;
(Determination of by taking expectation might sound daunting but it is often fairly straightforward.
Only the key elements of are calculated, so that it can be
maximised in the M-step.)
The M-step Maximisation of with respect to to get , which becomes in the next iteration.
(Maximisation of should be much easier than maximising , often with an explicit solution.)
The EM algorithm can be interpreted as ’data augmentation’ when the complete data log-likelihood is linear in the missing data , which is estimated in the E-step and substituted in the M-step. That is, rather than the somewhat off-putting , the E-step reduces down to compute and this is then substituted in for in the M-step. In the Naples crab example, this is not quite the case but we simply computed the probability that each observation belongs to each of the two Gaussian distributions and substitute this via into the MLE. In other words, we can formally show that the straightforward procedure we employed for the Naples crab data is in fact an implementation of the EM algorithm.
The EM algorithm is therefore a deterministic algorithm that alternates between the expectation and maximisation steps. Even though it works with the (estimated) complete data log-likelihood, it actually increases the incomplete data log-likelihood in each iteration. Let us prove this step by step.
First, we give an overview of the notation. Below we assume the data are continuous for simplicity of exposition but the same arguments (with minor modifications) apply to discrete data or a mixture of continuous and discrete data. Note that the Naples crab data is a mixture of continuous and discrete data, the crab ratios are continuous, whilst the allocation variable are discrete. The full data likelihood is , the probability density function of the observed and augmented data given parameters . The observed data likelihood is . Finally, is the conditional distribution of given and .
Because we maximise as a function of its first argument to get ,
(1.1) |
Since , we can write
(1.2) |
Using the inequality , we can deduce that the first integral cannot be positive.
(1.3) | |||||
Therefore the second integral must be non-negative.
(1.4) |
Since , we have that , the log-likelihood is increasing (or at the very least non-decreasing) from one iteration to the next.
We have shown that at each iteration of the EM algorithm the likelihood is increasing. Furthermore, it can be shown that if these iterations converge, then they converge to a stationary point of the likelihood function. Although this is necessarily a local maximum, unless the likelihood function is unimodal this is not necessarily a global maximum, ie. the MLE.
Advantages
It often has meaningful solutions and interpretation as data augmentation.
Convergence is easy to ‘detect’ since we wait until successive values of fall within a certain level of variation.
Disadvantages
It can be slow to converge.
Need to be able to compute the E-step – restricts applications. (This can be circumvented to some extent by using a Monte Carlo EM algorithm.)
This example actually originates from the paper by [2] which introduced the EM algorithm.
The data consist of the genetic linkage of 197 animals and the data are divided into four genetic categories, labeled 1 through to 4, see [4] and [2]. The probabilities that an animal belongs to each of the four categories are , respectively. Let denote the total number of animals in each category. For example, the probability of belonging to category 1 is and there are 125 observed animals in category 1. It is possible to maximise this multinomial likelihood directly and hence, obtain the MLE for without recourse to the EM algorithm. However, it is far simpler to apply the EM algorithm. To show that the EM algorithm works, we do both for this example.
MLE - observed data
The likelihood for given is
(1.5) | |||||
Therefore the log-likelihood satisfies
(1.6) |
where is a constant. Therefore
(1.7) |
By setting , we obtain the following quadratic (in ) equation from (1.7),
(1.8) |
This yields
Since needs to lie between 0 and 1, we get .
EM algorithm - full data
What extra information is going to make computation of the MLE simpler?
Suppose that we can choose so that we have an augmented data likelihood of the form
(1.9) |
Then
with
Setting give the linear equation
It is then trivial to show that
Note that likelihoods of the form (1.9) are common in statistics, for example, binomial (geometric, negative binomial) data.
What is stopping the genetic linkage data from having a likelihood of the form (1.9)? The first cell with probability .
What if we break the probability into and , a component independent of and a component proportional to ? Below we detail how this can be done.
Suppose that the observed cell could be divided into two subcategories and . Suppose that is the number of animals is subcategory with cell probability . This would give an augmented data set .
The likelihood for given , is
(1.10) | |||||
Hence, there exists a constant such that
(1.11) | |||||
(This yields using the observations after (1.9).)
E - step
From (1.11), we have that
since are known, fixed constants and does not depend on , so plays no role in the M-step (can be ignored). Therefore the only quantity that needs to be computed in the E-step is .
What is the distribution of ?
There are 125 animals in category 1 which independently can be
assigned to either category or category . The conditional
probability of animal belonging to category given that they
belong to category 1 is simply
Thus , and hence, . This gives
(1.12) |
M - step
The M-step is straightforward and simply involves finding (which we have effectively already done above). From (1.12), we have
that
Hence solves
which following the steps above yields
For this example, we have that the maximum likelihood estimate (MLE) is to 4 decimal places. Starting with as the initial value for , successive iterations of the EM algorithm yields:
The EM algorithm is particularly useful where we have censored data. In many experiments and medical trials, we have censored data. In a medical study, we could be interested in the time until death following major surgery. However, we may only be able to follow up patients for 2 years after surgery. That is, for a patient alive 2 years after surgery we don’t know when the death of the patient occurs, only that death occurs more than 2 years after surgery.
Suppose that are independent and identically distributed according to . However, assume that there is right censoring at the value . That is, for all observations greater than , the only information we have is that the observation exceeds . The common notation is to denote censored values superscripted with an asterisk. Therefore, we have an incomplete sample (suitably rearranged)
In this case, the likelihood based on all we know about is
where is the cdf (cumulative distribution function) corresponding to . Typically it is difficult to maximise because of the presence of the non-standard term . Let represent the true but unobserved values. Then the pair provides the complete data and
which is usually straightforward to maximise.
Suppose that , where is unknown. Suppose that we have the following 15 observations with the data censored at :
Note that the cdf of is given by
whilst the pdf of is given by , . We reorder the data, , into ascending order, then
This is difficult to maximize.
Let denote the actual value of the observation. Then for , and for , .
Then
where does not depend on .
Hence,
which when set equal to 0 solves to give , where is the sample mean.
Now conditioning on what we know about ,
where using the conditional density of given . This leads to
which is maximised at
where
The EM algorithm consists of calculation of in the E-step and in the M-step, which are repeated until convergence.
For the given data, and the EM algorithm converges to .
Note that
Now
where . Therefore
As usual in maximum likelihood estimation, the calculation of standard errors requires the calculation of the inverse Hessian matrix (Fisher information):
(1.13) |
evaluated at the MLE . For missing data problems this is usually difficult to evaluate directly. However, the structure of the EM algorithm can be exploited to assist in the calculations. Although the methodology can be applied to a vector , we will restrict attention to the case where is a single parameter. In this case, the standard error of the estimator is
(1.14) |
First note that , thus
(1.15) |
The equation (1.15) forms the basis for the missing information principle:
Multiplying both sides of (1.15) by and integrate with respect to , we get (for any )
(1.16) |
where
Here we use to avoid differentiation with respect to , it will take the same value as in the following.
We know how to deal with the function. For the function, we can use
(The second equality follows since, conditional upon and , is a constant and so does not alter the variance.)
The calculation of standard errors is best illustrated using an example. We shall use the genetics problem seen earlier. In that case
since
and simply replace by .
This is the information if the estimated data (expected value for ) had been genuine. We also have that:
Note that all the observed terms have variance 0 conditioning on , so
since .
Therefore we have that
giving the standard error of as . Note that if the augmented data were genuine then the standard error of would be . Thus as we would expect the presence of missing data leads to larger standard errors in the parameters.
Each week R code associated with the examples in the lecture notes will be provided for practice either before or during the lab session. Ideally you should have a look at the R code in advance of the lab session.
Code for the genetic linkage data and normal mixture example are available on Moodle with a supplementary file for simulating data from a mixture of normal distributions. Familiarise yourself with these algorithms and generate different data sets to try out the normal mixture code.
For the right censored data example there is the outline of a function to implement the EM algorithm. Complete the EM algorithm code and test on the data given in the lecture notes.
Spend no more than 40 minutes on the above.
Epidemic Example
The Reed-Frost epidemic model is suitable for modelling a disease outbreak in a household of size . The model assumes that the population is homogeneously mixing (infections are equally likely to be with anybody) and is an SIR (SusceptibleInfectedRemoved) model where individuals can only be infected once. Each infectious individual has probability , whilst infectious, of infecting a susceptible individual in the household. Independence in the chance of infecting two different susceptibles.
Consider a household of size 3 with 1 initial infective and 2 initially susceptible individuals. The initial infective can infect either 0, 1 or 2 of the initial susceptibles and these events occur with probability , and , respectively. In the case that the initial infective infects nobody the epidemic finishes with only one person infected. In the case that the initial infective infects two individuals, everybody has become infected and the epidemic infects three people. In the remaining case, where one initial susceptible is infected by the initial infective either the second infective infects the remaining susceptible (probability ) or not (probability ). Therefore there are four distinct epidemic outcomes in the household.
The first outcome results in one person infected, the second outcome results in two people infected and the final two outcomes and result in all three people infected.
Suppose that a community consists of 334 households of size 3 with the following the epidemic final size data observed.
This is the Rhode Island measles data set given in Bailey (1975), p.254.
Write down the likelihood of given , where denotes the total number of households with people infected.
(Leave this question to the end, if you are short on time.) Find . Solve to find the MLE, .
Let denote the total number of households where the initial infective infects both the other individuals in the household. i.e. Outcome occurs.
Write down the likelihood of and given .
Simplify
and compute the MLE, for .
(This will
help form the M-step of the EM algorithm.)
What is the distribution of given and ?
Hint: Think about similarities between this problem and the
genetics example.
Write down .
(This will help form the E-step of the
EM algorithm.)
Write an EM algorithm in R to find the MLE .
Calculate the standard error of the MLE .
These are exam type questions for your practice.
In a biological experiment the number of offspring of a fruit fly is assumed to have probability mass function given by
with
and
That is, is a point mass at 0 and is a Poisson distribution with parameter .
Suppose that are the offspring from fruit flies and that the parameters and are unknown.
Write down the joint-likelihood of and given .
Let (the proportion of the fruit flies in the sample who have no offspring) and let (the mean number of offspring per fruit fly in the sample).
Let be unobserved independent and identically distributed Bernoulli random variables with . Suppose that if then is distributed according to whereas if then is distributed according to .
Show that the joint log-likelihood of and given and is
where and is a constant independent of and .
Find the maximum likelihood estimates (MLEs) of and given and .
Show that
Give a description of the EM algorithm in the absence of for finding the MLEs of and .
Suppose that a group of patients are tested for angina. The test has a binary outcome, either the patient has angina or not. For , let if the patient has angina and otherwise. Each patient also has their cholesterol level taken with denoting the cholesterol level of patient .
A probit model is assumed for the probability that a patient has angina. That is, if is a binary random variable taking the value 1 if patient has angina and 0 otherwise, then
where is an unknown parameter of interest and denotes the cumulative distribution function of a standard normal. If , then .
It is difficult to obtain the maximum likelihood estimate, , of given and . However, data imputation and the EM algorithm can be used to obtain .
Let and suppose that satisfies
That is, the angina status of a patient () is deterministic (known) on the basis of the unobserved , where follows a normal distribution with mean determined by and .
Show that the likelihood of given and , satisfies
Compute .
Compute the maximum likelihood estimate, , of given and .
Hint: Let . Then for any ,
where denotes the probability density function of a standard normal distribution evaluated at .
Outline an EM algorithm for obtaining .