Data augmentation is a useful tool for obtaining samples from the posterior distribution of parameters where is not in a convenient form for analysis. The Gibbs sampler with data augmentation is primarily used in the following two situations:
In real world problems there is often missing data. Although in theory the distribution of the observed data can be obtained by integrating out the missing data, this is often difficult or impossible to do.
The likelihood of the data is not tractable in its current form, but conditional upon the collection of unobserved (extra) data, the likelihood becomes tractable. This corresponds to the situation we have observed with the EM algorithm and there are similarities (as well as obvious differences) between the EM algorithm and Gibbs sampler in data augmentation problems.
More generally, data augmentation is used extensively within MCMC (Markov chain Monte Carlo) algorithms to assist with obtaining samples from .
The generic set up is as follows. Let denote the observed data and let denote the model parameters. Then
and this can be difficult to work with if the likelihood is not in a convenient form. Now suppose that there is extra information such that joint distribution of and , is more convenient to work with. We can then look to construct a Gibbs sampler (or alternative MCMC algorithm) to obtain samples from . It is then trivial to obtain samples from the marginal distribution by simply ignoring the values. In other words, focus on the marginal distribution . Note that
Then the Gibbs sampler alternates between updating the parameters and augmented data.
Update given and . ie. Use .
Update given and . ie. Use .
Both the updates of and will often be broken down into a number of steps. Note the similarities to the EM algorithm with step 1 replacing the M-step (updating the parameters given the observed and augmented data) and step 2 replacing the E-step (updating the augmented data given the observed data and parameters).
We illustrate data augmentation within the Gibbs sampler using a range of examples.
We begin by illustrating data augmentation Gibbs sampling with a very simple example, which is a simplified version of the Normal mixtures introduced in the Week 6 notes for the EM algorithm.
Suppose that are iid (independent and identically distributed) from the mixture density
That is,
Thus the mixture can be constructed as follows. For each observation toss a fair coin and let denote the outcome of the coin toss with if the coin shows a head and otherwise. If the coin toss shows a head draw from , otherwise draw from .
The likelihood can be written explicitly:-
where . It is difficult to maximise (for MLE) or to compute the posterior distribution for using the given likelihood. However, if we knew the results of the coin tosses (ie. If we knew , then we would know which observations come from the and the posterior distribution (or MLE) would be straightforward to derive.
Let denote the outcome of the coin tosses and let , the set of coin tosses that are tails. Thus for , is drawn from .
Original likelihood:-
Augmented likelihood:-
Therefore if we take a prior for , we have that
Thus the posterior distribution of is
where .
On the other hand let denote the outcome of the coin toss. Then it is straightforward to show that
The Gibbs sampling algorithm is:-
Initial value for . (A reasonable starting value is , the sample mean.)
For , set with probability , and set otherwise. This updates the set .
Sample .
We revisit the genetic linkage example from Week 6, the EM algorithm. Whilst we are now focused on the posterior distribution rather than the MLE of the parameter , we use exactly the same data augmentation as before.
The data consist of the genetic linkage of 197 animals and the data are divided into four genetic categories, labeled 1 through to 4, see [1]. The probabilities that an animal belongs to each of the four categories are , respectively. Let denote the total number of animals in each category.
We are interested in estimating . Note that for the category probabilities to be valid we require that . Therefore we shall assign an (uninformative) uniform prior to . This is a multinomial experiment (4 different outcomes), so
The posterior distribution of is not of a particularly nice form,
Suppose that the observed cell could be divided into two subcategories and . Suppose that is the number of animals in subcategory with cell probability . This would give an augmented data set . Then
Thus
Therefore the posterior density is proportional to a Beta density giving
What is ?
There are 125 animals in categories and .
The conditional probability that the animal belongs to category given that the animal belongs to category 1 is
Thus .
Therefore the Gibbs sampler iterates between the following two equations:
(Note that the above is for general , not just the data given.)
A sample of size 100 from the posterior distribution was obtained using the above Gibbs sampler and the output is presented below. Initial values and were chosen.
Unnumbered Figure: Link
The algorithm appears to converge immediately, but just to be safe I
shall disregard the first 20 iterations as burn-in. The following
summary statistics are then obtained for :-
Estimate of posterior mean, 0.6258.
Estimate of posterior
variance, .
Posterior density plots (obtained
by kernel smoothing) suggest a modal value of about 0.645.
It is important to note that the sequence of realisations of are not independent. This is clear from their construction, via the Markov chain. One useful summary of dependence is an ACF (autocorrelation function) plot. The acf plot (in R) gives the estimated value of for , the correlation between samples from the posterior distribution which differ by a lag of iterations. For independent samples, for all , although an estimate of the correlation will not be exactly 0. Generally for MCMC, the faster the approaches 0 as increases the better. (With very few exceptions, MCMC constructs Markov chains with positive correlation, .)
An acf plot for is given based upon a sample of 1000 iterations following a burn-in of 100 iterations. As we can see there is very little dependence with only significantly different to 0. (Note that is always equal to 1.) This is very good. In practice the dependence is usually a lot more marked. We will say more about dependence and the cost of dependent versus independent samples in later weeks.
Unnumbered Figure: Link
The performance of students in a test is thought to depend upon individual and school variability. In particular the following model is suggested:-
where is the performance of the individual from school and denotes the school effect. The parameters and are assumed to be unknown parameters to be estimated. Whilst are observed, the are unobserved. Therefore we use data augmentation (augment with ) to estimate based upon observations , where .
This is a hierarchical model with students based within schools and the school having an effect on student performance. We could have more hierarchical levels, for examples, the schools could be in different cities and we could have a city effect included.
Note that given , is independent of and . Moreover, given , and are independent, for . Therefore
(3.1) | |||||
Now
(3.2) |
Since the performance of students from different schools are independent, we have that
(3.3) |
To find the conditional distributions of , we combine the likelihood (3.3) with the priors for , and . For simplicity, we shall assume that the priors on , and are improper uniform priors. That is, and . Therefore, letting and , we have that
Therefore
All that is now required is the conditional distributions of . Note that the ’s are independent with
Therefore, we have that
Note that the mean of is a compromise between the mean of the data and the prior mean with
Furthermore as more data are collected it moves closer to the mean of .
Example.
Suppose that a study involves 4 schools. Students from each school take a test and score a mark out of 100. The results are presented in the table below.
Note: The data were generated from; and with the rounded to give integer values. Therefore the ’true’ values are , and .
A Gibbs sampler algorithm was run on the above data set to get a sample of 1100 iterations. After discarding the first 100 iterations, we can use the remaining 1000 iterations to estimate the posterior means and standard deviations of the parameters.
The mean of is almost 5 times the true value of . The reason for this is that there is only four schools and the means for the 4 schools are fairly similar. ( large implies greater precision, ie. smaller variance.) A time series plot of the 1100 iterations of are given in the figure below. We also see that has a right-skewed distribution so that the mean is (quite a bit) larger than the posterior mode.
Unnumbered Figure: Link
Unnumbered Figure: Link
We consider a probit regression example. Let be binary response variables for a collection of objects with associated covariate measurement . We assume that
where is the standard normal distribution function, is the linear predictor and represents a column vector of regression coefficients. We are interested in the posterior distribution of and we assume a prior for .
In the probit model, the mean (the probability of a ) is given by , hence, it corresponds to the probit link function: .
Note that
Therefore we do not have a nice analytical expression for and resort again to data augmentation.
For the data augmentation it is useful to think how we could simulate data from the probit model with coefficients and covariates . To simulate such that , we could take the following steps:
Simulate .
If , set and otherwise set .
Note that
as required. Given the above construction, the ’s are simple Normal random variables and the ’s are a simple deterministic functions of the ’s.
Gibbs sampler
This representation lends itself to
efficient simulation using Gibbs.
The joint posterior of
is given by
The conditional posterior distribution of is therefore multivariate normal:
where . This is similar to the block update of in the linear regression in the Week 7 lab session. On the other the conditional posterior distribution for each is truncated normal,
(3.4) |
We can therefore create a Gibbs sampler which, at each step
Samples from .
Samples from , for .
Example.
Response: Occurrence or non-occurrence of infection following birth by Caesarian section.
Covariates:
if Caesarian section was not planned and otherwise;
if there is presence of one or more risk factors, such as diabetes or excessive weight, and otherwise;
if antibiotics were given as prophylaxis and otherwise.
In total: births.
Covariates | Infection | |||
yes | no | |||
0 | 0 | 0 | 8 | 32 |
0 | 0 | 1 | 0 | 2 |
0 | 1 | 0 | 28 | 30 |
0 | 1 | 1 | 1 | 17 |
1 | 0 | 0 | 0 | 9 |
1 | 0 | 1 | 0 | 0 |
1 | 1 | 0 | 23 | 3 |
1 | 1 | 1 | 11 | 87 |
The program for implementing the probit model is available in probit.r. I ran the code for 5500 iterations discarding the first 500 iterations as burn-in. The tables below give the estimates of the parameter means and standard deviations along with the probability that a parameter is positive (based on the sample). We can see from the table that the Caesarian section not being planned (emergency) and having risk factors increase the chance of infection, whilst antibiotics (fortunately) reduce the risk of infection.
Coefficients: | mean | Std | probability |
(Intercept) | -0.9854 | 0.2049 | 0 |
noplan | 0.5063 | 0.2313 | 0.9868 |
factor | 1.0607 | 0.2386 | 1 |
antib | -1.7498 | 0.2447 | 0 |
Finally, in the table below, we give the (estimated) posterior probability of becoming infected for each set of covariates. Note that
This can be estimated using samples from by
In the table below we compare the posterior probability of infections versus the observed proportion infected for each set of covariates. The table shows generally good agreement between the observed proportions and those given by the model.
Covariates | Infection? | |||
---|---|---|---|---|
E | O | |||
0 | 0 | 0 | 0.1665 | 0.2000 |
0 | 0 | 1 | 0.0047 | 0.0000 |
0 | 1 | 0 | 0.5300 | 0.4828 |
0 | 1 | 1 | 0.0525 | 0.0556 |
1 | 0 | 0 | 0.3202 | 0.0000 |
1 | 0 | 1 | 0.0158 | – |
1 | 1 | 0 | 0.7149 | 0.8846 |
1 | 1 | 1 | 0.1243 | 0.1122 |
We have introduced hierarchical models above and here we consider an alternative parameterisation of the hierarchical Gaussian model. Reparameterisation can be a useful in statistics to improve estimation of the model parameters. The reparameterisations presented below are based upon ideas from [2]. Let be observed and be unobserved, with
where and are assumed known and and are independent standard normal random variables (). The parameter is assumed to be unknown and its posterior distribution is of interest. We shall consider the case where for all and consequently drop the subscript . Therefore the model can be rewritten as
Therefore this is a simplified version of the hierarchical model studied above.
An alternative parameterisation is the non-centered parameterisation proposed by [2],
Note that and are a priori independent, but conditional upon the data are dependent.
Both (centered and non-centered) parameterisations permit a (data-augmentation) Gibbs sampler as outlined below. (Throughout we shall assign an improper, uniform prior to ; .)
Centered algorithm
The joint distribution of given , satisfies
since and given , is independent of .
Thus
Therefore
This gives
(Note that if has pdf , then .)
For , depends only upon and . Therefore
This gives
It is straightforward using the above conditional distributions to construct a Gibbs sampling algorithm which alternates between updating and updating .
Non-centered algorithm
The joint distribution of given , satisfies
since and is a priori independent of .
Thus
Therefore, letting , we have that
This gives
For , depends only upon and . Therefore
This gives
It is straightforward using the above conditional distributions to construct a Gibbs sampling algorithm which alternates between updating and updating .
Example.
To illustrate the two algorithms, we apply both algorithms to a data set consisting of observations from with , and . Both algorithms were run to generate a sample of size 1000 from the posterior distribution of . The two algorithms are obtaining samples from the same posterior distribution and should therefore give the same answers (subject to Monte Carlo error - random variation). The estimated means (standard deviations) of based on the two samples are and for the centered and non-centered algorithms, respectively. Below are trace plots of 100 iterations of each algorithm followed by the acf plots. We can clearly see from the acf plots that the non-centered algorithm is preferable to the centered algorithm.
Unnumbered Figure: Link
Unnumbered Figure: Link
Unnumbered Figure: Link
Unnumbered Figure: Link
For the above model we can actually show analytically that for the given values of and , the non-centered algorithm is preferable by calculating the autocorrelation for both models. Note that it is actually fairly easy to show that the autocorrelation is the autocorrelation to the power for the above model. Let and denote the autocorrelations for the centered and non-centered algorithms, respectively. Then we can show that and .
To prove the above, we need to compute , for which we need the stationary distribution of . Note that
Therefore
giving
In other words, MCMC is not needed for the case, where for all , but MCMC is required if any .
Thus . Therefore we need to compute . This is most easily down by showing that
where is a random variable independent of . In that case,
and .
For the centered algorithm, note that
Hence
Therefore .
A similar calculation holds for . Thus if , the centered algorithm is preferable, otherwise the non-centered algorithm is preferable.
Write a Gibbs sampler algorithm to apply to the school performance data given in Section 3.3 of the lecture notes. Compare your output to that given in the lecture notes.
A recap of the data, the model and the necessary conditional distributions is given below.
Performance of 25 students in a test selected from 4 schools.
Model:
where is the performance of the individual from school and denotes the school effect. Note that the ’s are not observed.
Conditional distributions for the Gibbs sampler:-
This extends the multiple regression example studied in last weeks lab to allow for outliers.
When there are outliers suspected in the data, the sensible thing to do is to take a closer look at the data and try and understand where the lack of fit may be coming from. There are several possibilities: e.g. the outliers may be due to errors in recording the data, or it may be that there were certain features corresponding to those observations that our model does not take into account. The latter may be solved e.g. by adding relevant explanatory variables to the model, or by changing other aspects of the model.
Now suppose that none of the above applies to your model: that is, there are no recording errors in the data that you know of, you have no additional explanatory variables that you could possibly use, and you can not think of a more suitable model specification. In that case, a partial solution is to use a sampling distribution that is more robust to outliers than the Normal distribution. A distribution with thicker tails than the Normal distribution allows for larger departures from the mean and, therefore, outliers are likely to have less impact on the resulting posterior inference on and . That is what we mean by a sampling distribution that is more resistant to outliers than the Normal distribution.
A distribution with this property is the Cauchy distribution. Using a Cauchy distribution leads to the following sampling density for observation :
(3.5) |
Consider the same prior distributions for and used in Week 7.
Write down, up to a proportionality constant, the joint posterior density of .
Write down, up to a proportionality constant, the conditional posterior densities of , and (that is, , and ). Is our new model amenable to Gibbs sampling? Explain why or why not.
Now consider the following idea: The Cauchy distribution in (3.5) can also be interpreted as a Normal distribution with an unknown (random) precision parameter.
More specifically, assuming
(3.6) | |||||
(3.7) |
is, in fact, equivalent to assuming the sampling model in (3.5).
Aside. The Cauchy distribution is a -distribution with degree of freedom, and (abusing notation)
The statement follows since is the same as .
Thus, we can augment the parameters with the variables (note that there is one variable per observation) to obtain the posterior density
Find the conditional posterior density of each , , , where denotes the vector excluding .
Using an almost identical argument to the previous lab, show that
where ,
and
Using an almost identical argument to the previous lab (and the previous question), show that
where ,
and
Using an almost identical argument to the previous lab, show that
Write down a Gibbs sampling algorithm to compute the joint posterior distribution of .
Make a copy of your Gibbs sampling function from last weeks lab and alter this to include the , producing a Gibbs sampler for the Cauchy model.
How does the inference on and compare with that obtained in Lab 2 where you used a Normal sampling model?
Suppose that we have independent and identically distributed realisations from the random variable , where is a zero-inflated Poisson. That is,
where are the parameters of interest.
This is an example of a mixture distribution, where is distributed either according to distribution or distribution .
For , let denote the total number of ’s equal to . Then is a sufficient statistic for obtaining the posterior distribution of . That is,
Suppose that and .
Show that
Let denote the total number of ’s which come from distribution . Write down .
Hence, write down up to proportionality.
Suppose that and . Find the following conditional distributions:-
;
;
.
Write a Gibbs sampler to obtain samples from .
An ecologist is monitoring the number of field mice at sites. Each of the sites are observed for days with the number of mice caught at each site recorded daily. Let denote the number of mice caught at site on day . The data with are assumed to arise from the following hierarchical model:
Note that are unobserved and let denote a realization from
, corresponding to .
Suppose that a prior is assigned to .
Write down the likelihood for given .
Find the conditional probability distribution of given .
For , find and identify the conditional probability distribution of .
Describe a Gibbs sampler algorithm for obtaining samples from , the posterior distribution of given the observed data .