The key point to RWM (random walk Metropolis) is that the proposal distribution is ‘centred’ upon the current state of the Markov chain. That is, the proposed or candidate value , is given by
where the random variable is such that and is usually symmetric. When is symmetric the acceptance probability simplifies to
Proof Let have density and for to be symmetric, for all , . Therefore,
Hence,
RWM is possibly the most used MCMC algorithm, but why? There are a number of reasons.
It generally works pretty well. The basic idea is that, in stationarity, the chain spends most of its time in regions with high probability. Therefore if the posterior distribution is continuous and we are currently at , a point with high posterior density then for any point close to the posterior density will also be high. Thus it makes sense to propose moves relative to the current position .
It is very straightforward to implement even in high dimensions. All we need is that our choice of does not result in a reducible Markov chain. A straightforward choice for if is -dimensional is the -dimensional standard normal distribution, where is a scaling factor for the proposal standard deviation. This is in contrast to the Gibbs sampler where we require the conditional distributions and the independent sampler which requires the proposal distribution to be a good approximation of the posterior distribution. Also the independent sampler can be reducible if the ratio of the posterior density to the proposal density is unbounded.
Correlation/dependence between different parameters can be overcome. The Gibbs sampler was shown to perform badly when parameters were highly correlated. This problem can to some extent be circumvented when using RWM by choosing the proposal distribution to have similar correlation between pairs of parameters as the correlation between those same pairs of parameters in the posterior density.
Thus far the RWM algorithm seems ideal but it does have some drawbacks. These include:
It is too general. It is a general method which can ride roughshod over nuances of a model.
There can still be problems with convergence.
There is still the problem of dependence between samples and the need for appropriate choice of variance for the proposal distribution. We will discuss this issue in some detail.
Multi-modal distributions. There can be problems with the chain moving between different modes. This problem is related to convergence, in that, the MCMC chain may appear to have converged but in reality it is only moving around one of the modal values. An important reason for performing Gelman-Rubin diagnostics and starting the MCMC chain from different starting values.
The choice of proposal variance is a very important issue for RWM. We shall illustrate why with a simple example. Let and therefore
For , let the proposal distribution , thus . Let denote dependence on a specific standard deviation . The algorithm was run with the following choices of ; . The algorithm starts in stationarity so there is no need for a burn-in period. The results of the first 500 iterations are plotted below, in the order .
Unnumbered Figure: Link
Unnumbered Figure: Link
Unnumbered Figure: Link
All three choices of result in the same stationary (posterior) distribution for the Markov chain. However, the results are very different. For the case , the acceptance probability is 0.923. Therefore most proposed moves are accepted. However, most of the moves are very small and therefore it takes the Markov chain a long time to explore the posterior distribution. At the other extreme, in the case , the acceptance probability is (ie. less than 1 in 50 proposed moves are accepted). Therefore although when we accept moves we make large moves in the posterior distribution, we spend a long time ‘stuck’ at most of the points visited. Therefore there is a need for a compromise between accepting a reasonable number of proposed moves and making reasonable sized moves when proposed moves are accepted. Thus the case with an acceptance probability of 0.243 presents a happy compromise. To demonstrate this further for each of the three choices of , the effective sample size was computed based upon a sample of size 1000. For , the effective sample sizes were and , respectively.
Therefore we have seen that in the above example that choosing such that the probability of accepting a move is about works well. Can we derive more general conclusions in how to choose ? Perhaps somewhat surprisingly we can. It has been proved that for (almost) all continuous uni-modal posterior distributions an acceptance rate of about is (close to) optimal, in terms of maximising the effective sample size. (See [3] if you are feeling brave!) The rule is a limit result as the dimensionality of tends to , but is very good once the dimensionality gets above 5. (For , the pdf of a , the optimal (average) acceptance probability is about .) Therefore a very general rule of thumb is to run small MCMC trials with different values of until the acceptance rate is approximately and then perform the main MCMC run with this choice of . There is no need to fine tune too much since acceptance rates between and are close to optimal. That is, (almost) regardless of the posterior distribution we can approximately optimise the RWM algorithm by tuning the proposal standard deviation. For the above Normal example is optimal and when the algorithm was run with the choice of an effective sample size of 308.3 was obtained.
As mentioned above care needs to be taken when applying the RWM algorithm to multi-modal distributions. The main problem being that the algorithm can get ‘stuck’ at one of the modes. That is, the MCMC chain moves around but remains close to one modal value and takes a very long time to explore the other modal values. This is most easily seen with an example. Let have posterior density,
Therefore
Suppose that the proposal distribution is normal with standard deviation . That is,
The RWM algorithm was run to obtain samples of size 1000 from the posterior distribution of . The algorithm was run for and with the results presented below.
Unnumbered Figure: Link
Unnumbered Figure: Link
At a first glance, both runs appear to have converged, judging by a visual inspection of the output. However, closer inspection of the -axis shows that for , the MCMC chain has spent all 1000 iterations centred about the mode at 10. (Even in a run of length 100000 the MCMC chain did not find the mode at 0.) Therefore the sample obtained is not representative of the posterior distribution even though the algorithm is correct. On the other hand, in the case the algorithm moves freely between the two modal values of 0 and 10. Thus this sample is far more representative of the distribution .
The moral of the story is that, MCMC runs can appear to have converged even when they have not. It is important that the MCMC chain explores the whole of the posterior distribution and does not get ‘stuck’ at one mode. This is a primary reason for using Gelman-Rubin diagnostics with different starting points for the MCMC chains. For multi-modal distributions hybrid RWM algorithms can be very effective. That is, RWM algorithms which propose moves using both low and high proposal variances. The low proposal variance moves are good for exploring the posterior distribution about a modal value, whereas high proposal variance moves are useful in trying to find other modal values.
We consider two simple adaptions of the RWM algorithm. The first, multiplicative RWM, is appropriate for positive densities ( for ), especially for heavy tailed distributions. The second MALA (Metropolis adjusted Langevin algorithm) is cleverer than RWM in that it takes into account the derivatives of in the proposal of new values.
Let have pdf , where for . Then multiplicative RWM essentially transforms the problem onto the log-scale as follows. Propose
Or equivalently, if and , propose
Let denote the pdf of , then
since . Therefore for multiplicative RWM, the acceptance probability is
(The first term is the result of doing RWM upon the log scale, with the other two terms being the result of the transformation.)
We demonstrate the benefits of multiplicative RWM with
For both standard RWM and multiplicative RWM I started the MCMC algorithm in the tail of with . (Note that .) Trace plots of the first 1000 iterations of each algorithm are presented below. We can see that the multiplicative RWM quickly converges to the centre of probability mass around 0, whereas the RWM algorithm shows random walk behaviour around 100. The RWM algorithm does eventually get close to 0 but it takes quite a few iterations. The RWM algorithm is correct but performs poorly for heavy tailed densities (probability densities where as for all ). This is, because the RWM can get stuck in tails of the distribution and wanders apparently aimlessly for a long time before returning to the area of high probability mass. Since for large , if , where is small compared to , . Thus most moves are accepted, hence the random walk behaviour.
Unnumbered Figure: Link
RWM with
Unnumbered Figure: Link
Multiplicative RWM with .
In addition, both algorithms were run started at for iterations. Both algorithms provide good estimation of in the range . However, if we consider large values of , we can see that multiplicative RWM is better. In 100000 iterations, on average there will be 75 observations greater than 10 and 11 observations greater than 20. For the RWM algorithm there was only 23 observations greater than 10 and none greater than 20 (the maximum value was 13.5). On the other hand for multiplicative RWM there were 67 observations greater than 10 and 17 observations greater than 20. The maximum value for multiplicative RWM was 41 and under , the probability of getting a value greater than 41 is . Thus the RWM algorithm is less inclined to enter the tail of than the multiplicative RWM algorithm. However, as we saw above when the RWM algorithm does enter the tail of it spends a long time in the tail before returning to the area of high probability mass.
The standard RWM algorithm proposes a new value centered on the current value. However, it makes sense, where possible, to make use of knowledge of . In particular, we want to propose more values where is large than where is small. Therefore if we know , and in particular, its derivative we can use this to concentrate the proposal in areas of the distribution with high probability mass. The MALA algorithm (Metropolis adjusted Langevin algorithm) is as follows.
Propose a new value , by setting
where .
Accept the proposed value with probability
If the proposed value is accepted set , otherwise set .
The key point to note is that the proposal is not symmetric, ie. , and therefore this needs to be taken account of in the acceptance probability. The class of models to which MALA can be applied is smaller than RWM and using MALA is more involved as we need the derivatives of . However, where MALA can be used it is preferable to RWM. For high-dimensional problems MALA is seen to have some very nice properties, see [4]. In particular, if , the number of dimensions, is large, the optimal choice of for MALA is compared with for RWM. Thus potentially offering substantial improvements.
We conclude the course with a few examples of the Metropolis-Hastings algorithm applied to statistical problems.
Suppose that for , . The likelihood is
The conjugate prior for is a Gamma() distribution .
But gamma priors are often extremely unsatisfactory for expressing ignorance. (e.g. just plot the graph with .)
Instead use a half-Cauchy prior , for some (e.g. ). The posterior is
Perhaps the easiest MH algorithm in this case would be an independence sampler with , i.e.
Since in this case
Alternatively a random walk Metropolis algorithm with
should also work well.
Why the above choice of proposal variance?
The variance of is . (This is the posterior with the improper, flat prior .) Scale by a factor of as we know this is the optimal
proposal for a . Note that if is large by the central
limit theorem .
Here are the traceplots for the suggested algorithms applied to the following data (so and ), and starting with . (The MLE of a Poisson is .)
Unnumbered Figure: Link
Independence Sampler
Unnumbered Figure: Link
RWM
Mixing seems reasonable. Acceptance rates are and respectively and effective sample sizes are and .
Estimates of the posterior median and credibility intervals for are
Independence Sampler: 5.35 (3.83,7.11).
RWM: 5.36 (3.88,7.13).
Note:. In ”real life” you would not use MCMC on the above target; there is a simpler and more appropriate technique, rejection sampling, which you will have met in the Bayesian Statistics course.
In this Section we consider using the Metropolis Hastings algorithm for logistic regression models. We use the Caesarian section example previously studied in Section 3.5 (using a Probit regression and data augmentation Gibbs sampler).
Response: Occurrence or non-occurrence of infection following birth by Caesarian section.
Covariates:
if Caesarian section was not planned and otherwise;
if there is presence of one or more risk factors, such as diabetes or excessive weight, and otherwise;
if antibiotics were given as prophylaxis and otherwise.
In total: births.
Covariates | Infection | |||
yes | no | |||
0 | 0 | 0 | 8 | 32 |
0 | 0 | 1 | 0 | 2 |
0 | 1 | 0 | 28 | 30 |
0 | 1 | 1 | 1 | 17 |
1 | 0 | 0 | 0 | 9 |
1 | 0 | 1 | 0 | 0 |
1 | 1 | 0 | 23 | 3 |
1 | 1 | 1 | 11 | 87 |
We have aggregated all binary responses in each covariate category to binomial responses. Thus,
where is an index for each covariate combination, is the number of infections and the total number of observations for the covariate combination.
Suppose that we assume the binary logistic regression model,
where denotes the probability of an infection for each individual in the covariate combination and .
A Maximum likelihood analysis in R gives the following results.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.8926 0.4124 -4.590 4.44e-06 ***
noplan 1.0720 0.4253 2.520 0.0117 *
factor 2.0299 0.4552 4.459 8.23e-06 ***
antib -3.2544 0.4813 -6.762 1.37e-11 ***
---
Sig. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Null deviance: 83.491 on 6 degrees of freedom
Residual deviance: 10.997 on 3 degrees of freedom
AIC: 36.178
Number of Fisher Scoring iterations: 4
The 10.997 value for the residual deviance is not consistent with a distribution, since we can compute as
> 1-pchisq(10.997,3) [1] 0.01174211
So there is a fairly strong warning sign that the model may not fit adequately. To keep things simple we will ignore this (could consider interactions or probit regression, as seen earlier).
A Bayesian analysis places a prior on , usually a multivariate normal prior . The posterior distribution for is
The posterior density is a complicated, non-linear function of . How can we construct an MCMC algorithm to sample from this distribution?
Update jointly: Independence sampler with
multivariate normal (or better, ) candidate generator
with mean equal to posterior mode and covariance matrix equal to
inverse curvature at mode.
Newton-Raphson for
;
relies on asymptotic normality
of posterior distribution;
Multivariate normal may rarely propose
values in the tails of the posterior, hence the preference for a -distribution.
Update jointly:
Multivariate Gaussian random walk candidate generator, with
covariance matrix equal to the inverse curvature at posterior mode
times a (scalar) tuning factor,
or some other pre-estimated covariance matrix
Needs tuning and pre-calculation of covariance estimate.
Update component-wise via MH:
Simple random walk proposal updating one component at a
time.
Needs tuning;
problems with
collinearity.
The code to implement algorithms 1, 2 and 3 are stored in files logisticIP.R, logisticMRWM.R and logisticRWM.R. The code to run the analysis is given in the file logisticRUN.R.
The output from running these algorithms for 5000 iterations is plotted in the files “logistic1.pdf”, “logistic2.pdf” and “logistic3.pdf”.
Estimates (posterior mean, standard deviation and posterior probability of exceeding 0) using the first algorithm (independence sampler – 5,000 samples after burn-in of 500) are given in the following table:
Coefficients: | mean | Std | probability |
(Intercept) | -1.9544 | 0.4228 | 0 |
noplan | 1.1071 | 0.4229 | 0.9968 |
factor | 2.0955 | 0.467 | 1 |
antib | -3.3322 | 0.4867 | 0 |
The overall acceptance rate of this run was 87.6% which is very good for the independence sampler. The multivariate random walk algorithm works well when the covariance matrix of the proposal distribution reflects correlations between the parameters. For both the independence sampler and the multivariate random walk algorithms the same mechanism is used to find an appropriate covariance matrix for the proposal distribution. The componentwise (univariate) random walk Metropolis performs more poorly with slowly decaying correlations in the acf plot. However, the componentwise random walk Metropolis is run without tuning of the covariance matrix which is important for successful implementation of the other algorithms.
This example is taken from [1] and gives a non-linear model for oxygen demand in a biochemical experiment against incubation time.
Model:
where is biochemical oxygen demand in , is incubation time in days and .
The data with predictive curve is given below. This is the least squares estimated curve.
Unnumbered Figure: Link
Data source: www.itl.nist.gov/div898/strd/nls/data/boxbod.shtml
Random walk Metropolis was used to obtain a sample from the posterior distribution of with a multivariate Gaussian proposal. Tuning of the proposal variance found that worked well, where and
is computed on the basis of pilot runs.
The estimated posterior means (standard deviations) of the parameters are and for , and , respectively, based on a sample of size 10000. (A burn-in of 1000 iterations was used.) Finally, in the figure below is plotted against , which clearly shows a non-linear relationship between the regression parameters.
Unnumbered Figure: Link
In this lab you will use Poisson regression to analyse the data labants.txt, which consist of the ant species richness (number of ant species) found in 64 square metre sampling grids, in 22 bogs and 22 forests surrounding the bogs, in Connecticut, Massachsetts and Vermont (USA). The sites span of latitude in New England. The data source is [2]. There are 44 observations on four variables, plus the site identifier.
Site: An abbreviation for the name of the site
Srich: The species richness (number of species) for ants
Habitat: Either Forest or Bog
Latitude: The latitude (in decimal degrees) for the site
Elevation: The elevation of the site, in metres above sea level
Create a new directory for this lab and download the data file to it. The file lab5code.r contains some of the R code that you will need. You should download this file to the same directory as the data file and then edit it in order to perform the following tasks. (Note that the MASS library is called to allow straightforward sampling from a multivariate normal distribution.)
The first few lines in lab5code.r read in the data and create a response vector y and the design matrix X. Make sure you understand what the code is doing. Look at the data. Why do we subtract 1 when creating the second column of X?
There are three covariates: habitat, latitude, and elevation, and so there are four components to our parameter vector - why?
We will use independent Gaussian priors for each of the four s, each centered on and with the same variance, the scalar value V.prior. The function log.prior() should take the covariate vector beta and the scalar variance V.prior and should return the log of the prior density for beta. Work out the formula for the log prior (up to an arbitrary constant independent of beta) and so fill in the template for this function.
We wish to employ vague but proper priors so V.prior should be large but finite: a value of 100 should suffice in this instance.
We will be using the following model
Recall that the density of a single Poisson random variable which has mean is
Use vector arithmetic to fill in the template for the function
poisson.log.like() which returns the log-likelihood of the
parameter vector beta given data vector y and
covariate matrix X. Hint: your very first step
should be
loglam <- X %*% beta
.
This gives a vector of the values.
The inputs and outputs of the function RWM() are detailed in the comments above it. Fill in the blanks as instructed by the comments in the function.
Note V.prop is the variance of the (multivariate) Gaussian jump proposal in the RWM for beta. Since beta is a 4-component vector, V.prop must be a 4x4 matrix.
Initially you will tune the algorithm by choosing a scaling and a variance matrix. All tuning runs should be for 10000 iterations and should start at . Set your plot window to with the command par(mfrow=c(2,2)).
Perform a run with V.prop=diag(rep(1,4)) (why?) and lambda.prop=0.001. Create an MCMC object from this run and look at the trace plots. What do you notice?
Find the variance matrix from this run using the command V=var(B) where B is the output from the previous run. Change your scale parameter, lambda.prop, to 1 (this is approximately ) and try a new run with V.prop set to V. What do the traceplots show?
Repeat 2. until all four components are mixing well. What is your final variance matrix? What is the correlation matrix? Use these to explain the behaviour of the first run.
Look at the effective sample size for this run (removing burn in by slicing). Can you increase it by changing lambda.prop?
Use slicing on your final run to chop the first few rows from your output matrix (burn-in). Find the posterior median estimate and a CI for each parameter and draw kernel density estimates or histograms from the MCMC output.
Let with .
Consider a MALA algorithm to obtain samples from . That is, propose a new value from
Show that the acceptance probability for the MALA algorithm is
Why choose for the MALA algorithm?
Let for and otherwise.
Let for . Consider the Metropolis-Hastings algorithm for obtaining samples from with proposal density .
What name is given to this type of Metropolis-Hastings algorithm?
Show that the acceptance probability is .
The probability of accepting a proposed move when the current state is is given by
Show that
Which out of and should be used for the algorithm and why?
In an angina study, the sex and cholesterol levels of patients were recorded. Let and denote the sex ( or ) and cholesterol level, respectively, of patient . It is proposed that the probability that patient suffers from angina is
independently of all other patients, and are the unknown parameters of interest. Finally, let if patient suffers from angina and otherwise with .
Write down the probability that given , , , and .
Given independent priors for each of , and , write down, up to proportionality, the joint probability density function of .
Outline a random walk Metropolis algorithm to obtain samples from the joint posterior distribution of .
Patient is male () and his cholesterol level, . Describe how to use a sample from the MCMC algorithm to estimate the probability that patient has angina.