The Gibbs sampler is a special case of the general Metropolis-Hastings algorithm. All MCMC algorithms are based upon the Metropolis-Hastings algorithm, but as with the Gibbs sampler other special cases are considered in their own right.
Suppose that we wish to simulate from a (multivariate) distribution , which we term the target distribution. We introduce an arbitrary but specified transition probability from which simulation is straightforward. The transition probability is the probability density of moving from to .
If is drawn from , the Metropolis-Hastings will generate (dependent) samples , from .
The Metropolis-Hastings algorithm is:
Metropolis-Hastings algorithm
Given the current position , generate a ‘candidate
value’, from .
Note that is
often referred to as the proposal distribution, in that, we are
proposing a candidate value using it.
Calculate the ‘acceptance probability’, , given by
With probability accept the candidate value and set ; otherwise reject the candidate value and set .
Repeat until a sample of the desired size is obtained.
It is not difficult to show the is the stationary distribution of this chain. This is done below. Under mild conditions, including making sure that the proposal distribution does not lead to a reducible Markov chain, the chain will converge to this stationary distribution.
To show that the stationary distribution of the Metropolis-Hastings algorithm is indeed , it is necessary to show that, for all
where is the probability density, for given that . For , the density consists of two parts, first we need to propose the move from to , ie. and secondly, we need to accept the move, . Thus . However, the probability is slightly more involved. (Note that typically there is a non-zero probability of staying at .) Firstly, we have the probability of proposing that we stay where we are which is a move we will always accept! Secondly, we could propose a candidate value and reject such a move, meaning that we would stay where we are. Thus
In general it is not a good idea to have , proposing to stay where you are just slows down the process. Therefore without loss of generality, we take .
Therefore bringing all of the above arguments together, we have that
as required. This is since is a probability density, and hence, integrates to 1 and the other two terms cancel with each other.
There exists a simpler proof based upon detailed balance. is the stationary distribution of a Markov chain with transition kernel if for all ,
Specifically for the Metropolis-Hastings algorithm, , and therefore,
as required.
There are (infinitely) many possible choices for the proposal distribution . Two of the most common choices are outlined below.
Independence sampler
For all , let for some
probability density , that is, the proposal density is
independent of the current value . This idea is similar to
rejection sampling, in that we sample from one probability density
and then accept or reject the samples as coming from the
probability density . This method works well when we
can find a simple (well understood, easy to simulate from)
probability density which closely approximates . Note that in this case the acceptance probability is given
by
where denotes the relative weights of the two densities at .
Example. Mixture of normal distributions as the target (stationary) distribution, , with a normal distribution for the proposal distribution, .
That is,
Unnumbered Figure: Link
Random-walk Metropolis
Suppose that we are currently
at . In stationarity it is likely that has high posterior
probability. (Note that in stationarity the chances of
being at is proportional to .) Therefore it might seem sensible to
propose a candidate value, from a distribution which is centred
about the current value . For example, , that is, we propose from a normal distribution with
mean . If the proposal distribution is symmetric about as in
the case of the normal distribution, then we have the nice property
that
Hence in this case, the acceptance probability , simplifies to
Example. Mixture of normal distributions is the target with a normal distribution for the proposal.
That is,
For (marked with a
vertical green line),
Unnumbered Figure: Link
(Note that the proposal density has been divided by 2 in the above figure.)
Why choose ?
The answer is given by Peskun (1973) and is as follows. is the largest value that the acceptance probability can take
for the given choice of which ensures that the
stationary distribution of the Markov chain is given by . Choosing means the Markov chain
moves around the sample space as fast as possible (given ) and therefore converges as quickly as possible.
Accepting a move.
The simplest way to accept a proposed move computationally is to generate a random number from and accept the move if (reject otherwise). This leads us to a couple of observations which can assist in implementing MCMC.
Let , so that . It suffices to accept the move if (reject otherwise), since if , and the proposed move is accepted anyway.
It suffices to accept the move if (reject otherwise) as is a monotonic function. This can help with numerical stability, although this unlikely to be the case for problems we consider in this course.
Thus far we have described the Metropolis-Hastings algorithm for obtaining samples from a probability distribution with probability density function . Typically, is the posterior distribution for a Bayesian statistical problem. That is, . The key to the success of the Metropolis-Hastings algorithm (MCMC) is that we only need to know up to a constant of proportionality. To see this note that
Then the acceptance probability of the Metropolis-Hastings algorithm can be written as
That is, it is not necessary to compute the marginal likelihood, the constant of proportionality of the posterior distribution.
The general Metropolis-Hastings algorithm has the major advantage over the Gibbs sampler in that we do not need to know any of the conditional distributions. All that is required is to be able to simulate from which we can choose arbitrarily subject to a few basic constraints. Moreover, and this can be of crucial importance, we only need to know up to proportionality. This is because we only need to consider the ratio of the posterior density at different values, and hence, the proportionality constants cancel. For example, with the independence sampler we only need to know up to proportionality. Whilst, we have great freedom in choosing , it is important to make good choices concerning . A poor choice of can seriously affect the efficiency and convergence of the Metropolis-Hastings algorithm.
It is often desirable to create hybrid MCMC chains which are hybrid of different types of MCMC chains. A particularly useful application is when using the Gibbs sampler. It is often impossible to obtain the conditional distributions of all the parameters. In this case, it is natural to use a one-dimensional Metropolis-Hastings step to simulate from the one-dimensional conditionals. For example, use the independence sampler or random walk Metropolis for one of the parameters which does not have a nice conditional distribution. Then the accepted value will be from the correct one-dimensional conditional distribution but we have not needed to simulate directly from the conditional distribution. This type of algorithm is sometimes referred to as Metropolis-within-Gibbs or componentwise MCMC.
Suppose that we want to obtain samples from (a posterior distribution), where are the parameters of interest and denotes data. Componentwise MCMC follows the same procedure as the Gibbs sampler by updating subsets of the parameters, often one parameter, sequentially keeping the other parameters fixed.
That is, within each iteration, for .
Propose according to some proposal , where denotes the current
value of
the Markov chain.
Note that the proposal distribution could depend
on the whole of but often it will only depend on
.
Set . (Only the element differs from .)
Compute the acceptance probability for the proposed move from to .
With probability , accept the proposed move set . Otherwise keep unchanged (reject the proposed move).
The above algorithm can be adapted to update a subset of the parameters at once. In other words, componentwise MCMC is not restricted to single parameter updates.
The ideal proposal distribution for an independence sampler is
.
Why? The acceptance rate is always 1 and we get iid
samples from .
Why is this of no direct use?
If we can sample from
directly there is no need for MCMC.
How might it, nonetheless, help us to
choose a sensible proposal distribution?
We can look for which is a good approximation for .
What might a good acceptance rate for the independence
sampler be?
As close to 1 as possible whilst exploring the
whole of the posterior distribution.
To illustrate the independence sampler, and to illustrate the effect of the choice of on efficiency, we return to the genetics example, analysed using the Gibbs sampler in Week 8. Note that using the Metropolis-Hastings algorithm, we can use
directly.
Remember that there are 197 animals divided into four categories. The number of animals in each category are
with category probabilities
For the proposal we shall use and consider efficiency for different choices of and . This gives
where is a constant independent of which is not important for the MCMC.
We shall consider three choices for , the uniform , and . We can see from the figures below that works best. This is because the mean of the proposal is close to the mean of the posterior distribution , and the shape of the density is the best approximation of , see the plot comparing the density of (solid line) with (dashed line) below. Even better results are obtained by using say . However, there are a number of reasons for preferring a proposal distribution which is over-dispersed relative to the posterior distribution . Roughly speaking the proposal distribution has a larger variance than the posterior distribution. Note that at the other extreme, is very inefficient since the proposal distribution does not reflect the posterior distribution , ie. very few of the proposed moves are from high density regions of the posterior distribution. Finally, note that the naive uniform proposal performs adequately.
sigene1.pdf
Unnumbered Figure: Link
Unnumbered Figure: Link
Unnumbered Figure: Link
In the final figure the solid line denotes distribution and the dashed line denotes .
The closer approximates , the better. Thus we look to have for all . However, the theoretical (and practical) performance of the MCMC algorithm relies upon . In particular, there are convergence problems if . In fact the algorithm is reducible if . There are no such problems caused by having , although the efficiency of the algorithm can be affected. The main cause of is choosing to have too small a variance (under-dispersed relative to ).
Suppose that . Then the probability of accepting a move away from , where is
Therefore if the independence sampler is started at , it takes a geometrically distributed number of iterations to move from with success probability . As , the probability of accepting a move away from converges to 0 which is clearly problematic. It is worth noting that for any point , the number of iterations the independence sampler spends at the point before moving is geometrically distributed with success probability (of moving)
A measure of the efficiency of an independent sampler is the average acceptance rate. That is,
The acceptance rate is high, when and are in close agreement. Furthermore, if , then the acceptance rate is
We consider a censored data example. The approach has similarities to that taken in Section 1.6 for analysing censored data using the EM algorithm, in that again we use data augmentation.
Suppose that are independent and identically distributed according to . However, assume that there is right censoring at the value . That is, for all observations greater than , the only information we have is that the observation exceeds . Therefore, we have a sample , where . The common notation is to denote censored values with an asterisk superscript. Usually for convenience we rearrange the sample as
where . Then with denoting the observed data, we have that
where is the cdf corresponding to . Typically it is difficult to work with because of the presence of the non-standard term . Let represent the true but unobserved values of . Then the pair corresponds to complete information and
which is generally straightforward to work with. (Note the obvious similarities with the EM algorithm in Week 6.)
Suppose that , where is an unknown with a prior on . Suppose that we have the following 20 observations with the data censored at :
where denotes censored observations.
Note that cumulative distribution function of is
whilst the pdf of is .
Therefore if we reorder the data into ascending order, we have that
and it is difficult to write down the posterior distribution of . We could proceed as above and use an independence sampler to obtain samples from , possibly using a Gamma proposal distribution. In other words, take an approach similar to that used for the genetics example above.
However letting ,
Hence
giving .
For , is observed, so no updating is required. For , , where is distributed according to conditioned to be greater than . Simulating values from directly is not possible and the following independence sampler is proposed. Propose from and otherwise. Simulation from this distribution can be done using inversion of the cdf. That is, simulate and set . Thus, if is the pdf of , for ,
Note that in computing , we need only compute up to proportionality in as remains fixed in the update of . That is, it suffices to use when updating a censored observation.
Independence sampler within Gibbs
Choose initial values for and .
For ;
Update ;
For , draw and set .
Accept the proposed value with probability , where .
Store .
The algorithm works well, see the traceplot of below. The output is based upon a sample of 1000 iterations following a burn-in of 100 iterations. The estimated posterior mean and standard deviation of are and , respectively.
Unnumbered Figure: Link
How do I decide when my chain has converged?
When my chain has converged how do I decide if it is mixing quickly enough (or equivalently, if the chain is long enough to get accurate answers)?
How do I choose my algorithm so that I do not need ”too many” iterations? (e.g. Would a proposal have been better for the Independence Sampler? Would a jump proposal have been better for the RWM?)
What do we mean by ”better” in the previous question?!
Below we begin to answer the above questions. To assist us with answering the questions we consider four scenarios for obtaining samples from .
Random walk Metropolis with and proposal distribution .
Independence Sampler with and proposal distribution .
Random walk Metropolis with and proposal distribution .
Random walk Metropolis with and proposal distribution .
Unnumbered Figure: Link
Let then, if the are independent,
Recap ANOVA for m groups of size n - comparison of two variance estimates; between groups (from the group means) and within groups (about the group means).
Suppose that you have .
Let with and .
Note that
Hence is an unbiased estimate of . (This estimate is based upon between group variability.)
Also
Hence is also unbiased estimate of . (This estimate is based upon within group variability.)
Finally, is also an unbiased estimate of . (This estimate is based upon total variability.)
Let
Thus , and are estimates of the variance based upon the between group, within group and total variation, respectively.
We will run (e.g. 5) separate chains of length (e.g. 1000) from different starting points which are more spread out than (”overdispersed with respect to”) the posterior and compare two estimates of the variance of the stationary distribution.
The GR Convergence Diagnostic Procedure
Run chains for iterations and discard the first
iterations of each chain as burn-in. (Keep the last iterations from
each chain.)
Let denote the mean parameter value of chain
and let denote the overall mean. Let
denote the variance of chain . Then
and
Calculate the between chain variance estimate, , and the within chain variance estimate, .
Calculate (nearly the total variance of all chains together).
If the chains have all converged to the same distribution then and should all be estimates of the variance of the stationary distribution, . i.e. They should all be similar.
Evaluate .
If the chains have not yet converged then should be (because the chains are still too spread out) and should be (since each chain has not yet traversed the state-space).
So if the chains have not converged then otherwise .
Gelman (1996) suggests accepting convergence if .
The figure shows a plot of the GR statistics using 2 chains: the fast mixing RWM started at , and the independence sampler started at .
Unnumbered Figure: Link
The figure shows a plot of the GR statistics using 3 chains: the fast mixing RWM started at , the independence sampler started at , and the RWM started at .
What does the Gelman-Rubin statistics tell us?
It is very helpful for checking whether or not the distribution is multi-model. In that, if we start at a wide range of different values we will converge to different modes if they exist. Then we need to ensure the MCMC can efficiently move between modes. It also gives an indication of a safe burn-in period. For example, if for , this tell us that a burn-in of 1000 iterations should be sufficient to ensure convergence.
If you had independent samples from a random variable with distribution , then you could estimate
and
But you could also estimate the error in each of these approximations
Suppose that we have a chain of length , . Each element of the chain depends on the previous element. Indeed, because the previous element depends on the one before that, there is dependency between and etc. Because of this (positive) correlation there is generally less information in the 1000 elements of the chain than there would have been in 1000 independent draws.
If the chain is stationary then should not depend on . is called the lag-1 autocorrelation. Similarly define as the lag-k autocorrelation. Note that the lag-0 autocorrelation is 1, in that, any random variable is perfectly correlated with itself!
The plots below show the autocorrelations up to lag-50 for four chains sampling from : RWM with N(0,1), Indep. Sampler with N(0,4), RWM with N(0,1) (start at 100), RWM with N(0,0.01). These correspond to the traceplots given at the start of the Section.
Unnumbered Figure: Link
The RWM with proposal performs quite well with for . and . The Independence Sampler works very well with and all other . This means that there is a small cost for this algorithm over an independent sample. For , the effect of the starting value has a significant effect on the acf plot. The RWM with proposal mixes slowly and consequently decreases only slowly as increases.
The integrated autocorrelation time for an infinite Markov chain is defined as
In practice we only have a finite chain and so calculate
where is the first time that is below some threshold (e.g. ). For the examples above is . The larger the integrated ACT the higher the correlation and so the less information there is in the chain. In practice we only estimate the after burn-in (since we only use the part of the chain after burn-in). The estimate for the final 800 iterations of chain started at is .
For independent samples
Define the effective sample size .
Lemma For a stationary series of correlated variables with ,
Proof:
The effective sample sizes for our four chains are therefore: , , (excluding burn-in), and . The values in the final chain contain less information than would a sample of independent values!
How do I decide when my chain has converged? Both by eye and (if feasible) using the Gelman-Rubin statistic.
When my chain has converged how do I decide if it is mixing quickly enough (or equivalently, if the chain is long enough to get accurate answers)? By eye and look at the effective sample size and the standard error of your estimate.
How do I choose my algorithm so that I do not need ”too many” iterations? (e.g. Would a proposal have been better for the Independence Sampler? Would a jump proposal have been better for the RWM?) To be discussed!
What do we mean by ”better” in the previous question?! Larger effective sample size for the same length of chain.
Write R code to implement the independence sampler for the genetics example. (Section 4.3.2).
Write R code to implement the independence sampler within Gibbs for the censored data. (Section 4.4).
Suppose that, there exists such that, has pdf,
For fixed values of . Write and implement the following independence sampler programs for obtaining samples from .
for , a uniform proposal on .
and giving
How do the two independent samplers compare with ?
Theoretical Question: For , comment upon the choice of .
The files background.R, RWM1d.R and Indep1d.R have the code for applying the random walk Metropolis algorithm and the independence sampler to five 1-d distributions. The files background.R gives the background functions whilst RWM1d.R and Indep1d.R run the random walk and independence sampler algorithms.
One observation with the R code is that it uses the acceptance probability.
For the 1D functions you have a choice of target (stationary) distributions
Gaussian:
Laplace:
t5:
Cauchy:
Bimodal:
Of course in real life it is easy to obtain an independent sample from each of these distributions and we would never run an MCMC chain for this purpose. However these distributions have different properties and it can be instructive to see how the behaviour of the chain varies in these simple examples. These rules of thumb can then be carried over to more complex scenarios.
You may choose your proposal distribution from Gaussian, Laplace, t5, or Cauchy; you also choose the scale parameter for the proposal. For example if you chose a Gaussian proposal with scale parameter when calling RWM1d then the proposed jump is , i.e. . The same proposal for the independence sampler would mean .
Try running RWM1d or Indep1d using the example function call at the bottom of the file - you will see that it produces some basic output.
Important for intuition Note that the Gaussian has lighter tails than the Laplace which has lighter tails than the , which has lighter tails than the Cauchy.
Introducing coda Spend no more than 15 minutes on this section!
Run the 1D RWM function for iterations, starting at , using a Gaussian proposal, a Gaussian target, and a scale parameter of
a1=RWM1d(1000,0.5,"Gaussian","Gaussian",0)
a1 is a vector of length (check!). To use the coda package we first load the library
library(coda)
We then need to turn a1 into an mcmc object.
a1.mc=mcmc(a1)
Try out the following, work out and note down what they do; if
necessary use the help facility.
summary(a1.mc)
rejectionRate(a1.mc)How does this relate to
acceptance rate?
plot(a1.mc,smooth=F)
traceplot(a1.mc,smooth=F)
densityplot(a1.mc,smooth=F)
effectiveSize(a1.mc)
autocorr.diag(a1.mc)
autocorr.plot(a1.mc,lag.max=50)
cumuplot(a1.mc) (Why might this be useful?)
To use the Gelman-Rubin convergence test we need to have a few more
chains started at very different points.
a2=RWM1d(1000,0.5,"Gaussian","Gaussian",4)
a3=RWM1d(1000,0.5,"Gaussian","Gaussian",-4)
Turn these into mcmc objects a2.mc and a3.mc.
Create an mcmc list
a.listmcmc.list(a1=a1.mc,a2=a2.mc,a3=a3.mc)
Now try the following
gelman.diag(a.list)
gelman.plot(a.list)
You should find that the chains take about 200 iterations to converge (check using the trace plots). You would therefore only use the last 800 iterations to calculate any quantities of interest (e.g. mean and variance).
coda has a number of other functions - feel free to
investigate in your own time.
https://cran.r-project.org/web/packages/coda/coda.pdf
The main idea of this part of the lab is that you should investigate how the choice of scaling and proposal distribution affects behaviour for different targets. The instructions below comprise a minimal set of tasks - there are many further investigations if you have time!
Throughout this part of the lab it will be better for you to complete one investigation (e.g. Qn1 below) than to partially complete many.
Mixing
First try RWM1d(); each time you should run the algorithm for 10000 iterations and start at which is in the main mass of the target distribution (so we do not have to worry about convergence). You should try a variety of different scalings of the proposal from to . For each chain, look at the trace plot to see how the chain behaves; note down the acceptance rate, and the effective sample size.
Repeat some of the simulations at least 3 times to check for consistency.
For each of the four stages below (and any other possibilities which you wish to try) answer the following:
Which scaling gives the best effective sample size?
What is that best effective sample size?
What is the acceptance rate at this best scaling?
How did the chains behave?
Be sure to examine the traceplots as well as calculating the effective sample size.
Use a Gaussian target and Gaussian proposal.
Repeat (1) for a Gaussian target but instead using a Cauchy proposal.
Repeat (1) and (2) using a Cauchy target with a Gaussian proposal (also try a Cauchy proposal if time allows).
Repeat (1) and (2) but using the bimodal target with a Gaussian proposal (also try a Cauchy proposal if time allows).
What do you conclude? Can you explain your findings?
Now look at Indep1d() following exactly the same instructions as above; your findings and conclusions should be different!
Convergence
Note If time is short then the minimal set of investigations for this section comprises of Qn1 and Qn3 below for RWM1d().
For all chains in this section you should start with . In each case run a chain with , the best scaling that you found in the previous section - and with . In each case investigate the number of iterations which it takes for the chain to converge, either using the traceplot alone or by running several chains, looking at the traceplots and using the GR plot.
Investigate RWM1d using the following combinations
Gaussian target and Gaussian proposal.
Gaussian target and Cauchy proposal.
Cauchy target and Gaussian proposal.
Cauchy target and Cauchy proposal.
Feel free to try other combinations if there is time!
Now investigate Indep1d using the same combinations of target and proposal as for RWM1d().
Suppose that . The following 3 independence samplers are suggested for obtaining a sample from .
Proposal distribution with proposal probability density function .
Proposal distribution with proposal probability density function .
Proposal distribution with proposal probability density function .
Why is Sampler 3 the best choice for obtaining samples from , and why are the other two independence samplers unsuitable?
A sample of size 1000 was obtained using Sampler 3 from and the output is summarised in the form of a traceplot and acf plot in Figure 1 with an acceptance rate of 0.883.
Unnumbered Figure: Link
Traceplot and acf plot of sample of size 1000 from using Sampler 3. Comment upon the performance of Sampler 3 from .
A doctor is interested in the survival time of patients (in years) following major surgery. It is proposed that the survival time, , follows a Gamma distribution with shape parameter 2 and scale parameter . That is, has probability density function,
Suppose that patients were monitored and their survival times are recorded.
Given that the prior distribution on is , derive the posterior distribution given .
Due to limited resources patients are only monitored for 5 years following surgery. Therefore if the survival time of the patient is less than or equal to 5 years it is known exactly. However, if the survival time of the patient is more than 5 years all that is known is that the survival time exceeds 5 years (censored data). Censored survival times are denoted by .
Suppose that for , are observed exactly (survival time less than or equal to 5 years) and for the remaining patients their survival times are censored at 5 years.
The survival time, , given that it exceeds 5 years has probability density function
Describe an independence sampler for simulating from .
Describe a Gibbs sampler for obtaining samples from the posterior distribution of given the observed data .
Suppose that is a non-negative random variable. That is, for , . A reflected random walk proposal is used. That is, if the MCMC is currently at , the proposed value , is generated from
Let denote the probability density function of a standard Normal distribution.
Show that .
Hence, or otherwise, show that the acceptance probability, , for the proposed move from to satisfies