4 Week 9: The Metropolis-Hastings Algorithm

4.1 Introduction to the Metropolis-Hastings algorithm

The Gibbs sampler is a special case of the general Metropolis-Hastings algorithm. All MCMC algorithms are based upon the Metropolis-Hastings algorithm, but as with the Gibbs sampler other special cases are considered in their own right.

Suppose that we wish to simulate from a (multivariate) distribution f(θ), which we term the target distribution. We introduce an arbitrary but specified transition probability q(θ,ϕ) from which simulation is straightforward. The transition probability q(θ,ϕ) is the probability density of moving from θ to ϕ.

If θ0 is drawn from f(), the Metropolis-Hastings will generate (dependent) samples θ0,θ1,, from f().

The Metropolis-Hastings algorithm is:

Metropolis-Hastings algorithm

  1. 1.

    Given the current position θn=ϑ, generate a ‘candidate value’, ϕ from q(ϑ,ϕ).
    Note that q(,) is often referred to as the proposal distribution, in that, we are proposing a candidate value ϕ using it.

  2. 2.

    Calculate the ‘acceptance probability’, α(ϑ,ϕ), given by

    α(ϑ,ϕ)=min{f(ϕ)q(ϕ,ϑ)f(ϑ)q(ϑ,ϕ),1}.
  3. 3.

    With probability α(ϑ,ϕ) accept the candidate value and set θn+1=ϕ; otherwise reject the candidate value and set θn+1=ϑ.

  4. 4.

    Repeat until a sample of the desired size is obtained.

It is not difficult to show the f(θ) is the stationary distribution of this chain. This is done below. Under mild conditions, including making sure that the proposal distribution does not lead to a reducible Markov chain, the chain will converge to this stationary distribution.

To show that the stationary distribution of the Metropolis-Hastings algorithm is indeed π(), it is necessary to show that, for all θ

f(ϕ)g(ϕ,ϑ)𝑑ϕ=f(ϑ)

where g(ϕ,ϑ) is the probability density, for θ1=ϑ given that θ0=ϕ. For ϕϑ, the density g(ϕ,ϑ) consists of two parts, first we need to propose the move from ϕ to ϑ, ie. q(ϕ,ϑ) and secondly, we need to accept the move, α(ϕ,ϑ). Thus g(ϕ,ϑ)=q(ϕ,ϑ)α(ϕ,ϑ). However, the probability g(ϑ,ϑ) is slightly more involved. (Note that typically there is a non-zero probability of staying at ϑ.) Firstly, we have the probability of proposing that we stay where we are q(ϑ,ϑ) which is a move we will always accept! Secondly, we could propose a candidate value φ and reject such a move, meaning that we would stay where we are. Thus

g(ϑ,ϑ)=q(ϑ,ϑ)α(ϑ,ϑ)+q(ϑ,φ)(1-α(ϑ,φ))𝑑φ.

In general it is not a good idea to have q(ϑ,ϑ)>0, proposing to stay where you are just slows down the process. Therefore without loss of generality, we take q(ϑ,ϑ)=0.

Therefore bringing all of the above arguments together, we have that

f(ϕ)g(ϕ,ϑ)𝑑ϕ = f(ϕ)q(ϕ,ϑ)α(ϕ,ϑ)𝑑ϕ+f(ϑ)q(ϑ,φ)(1-α(ϑ,φ))𝑑φ
= f(ϕ)q(ϕ,ϑ)min{1,f(ϑ)q(ϑ,ϕ)f(ϕ)q(ϕ,ϑ)}𝑑ϕ+f(ϑ)q(ϑ,φ)(1-min{1,f(φ)q(φ,ϑ)f(ϑ)q(ϑ,φ)})𝑑φ
= min{f(ϕ)q(ϕ,ϑ),f(ϑ)q(ϑ,ϕ)}𝑑ϕ+f(ϑ)q(ϑ,φ)𝑑φ-min{f(φ)q(φ,ϑ),f(ϑ)q(ϑ,φ)}𝑑φ
= min{f(ϕ)q(ϕ,ϑ),f(ϑ)q(ϑ,ϕ)}𝑑ϕ+f(ϑ)q(ϑ,φ)𝑑φ-min{f(ϕ)q(ϕ,ϑ),f(ϑ)q(ϑ,ϕ)}𝑑ϕ
= f(ϑ),

as required. This is since q(ϑ,) is a probability density, and hence, integrates to 1 and the other two terms cancel with each other.

There exists a simpler proof based upon detailed balance. f() is the stationary distribution of a Markov chain with transition kernel g(ϕ,θ) if for all ϕ,θ,

f(ϕ)g(ϕ,θ)=f(θ)g(θ,ϕ).

Specifically for the Metropolis-Hastings algorithm, g(ϕ,θ)=q(ϕ,θ)α(ϕ,θ), and therefore,

f(ϕ)g(ϕ,θ) = f(ϕ)q(ϕ,θ)min{1,f(θ)q(θ,ϕ)f(ϕ)q(ϕ,θ)}
= min{f(ϕ)q(ϕ,θ),f(θ)q(θ,ϕ)}
= f(θ)q(θ,ϕ)min{f(ϕ)q(ϕ,θ)f(θ)q(θ,ϕ),1}
= f(θ)g(θ,ϕ),

as required.

There are (infinitely) many possible choices for the proposal distribution q. Two of the most common choices are outlined below.

  1. 1.

    Independence sampler
    For all ϑ, let q(ϑ,ϕ)=q(ϕ) for some probability density q(), that is, the proposal density is independent of the current value ϑ. This idea is similar to rejection sampling, in that we sample from one probability density q() and then accept or reject the samples as coming from the probability density f(). This method works well when we can find a simple (well understood, easy to simulate from) probability density q() which closely approximates f(). Note that in this case the acceptance probability is given by

    α(ϑ,ϕ)=min{w(ϕ)w(ϑ),1},

    where w(φ)=f(φ)/q(φ) denotes the relative weights of the two densities at φ.

    Example. Mixture of normal distributions as the target (stationary) distribution, X, with a normal distribution for the proposal distribution, Y.

    f(x) = 12{12πexp(-(x+1)2/2)+122πexp(-x2/8)}  (black)
    q(y) = 12π5exp(-(y+0.5)2/10)  (red)

    That is,

    X {N(-1,1)w.p.0.5N(0,4)w.p.0.5
    Y N(-0.5,5)

    Unnumbered Figure: Link

  2. 2.

    Random-walk Metropolis
    Suppose that we are currently at ϑ. In stationarity it is likely that ϑ has high posterior probability. (Note that in stationarity the chances of being at ϑ is proportional to f(ϑ).) Therefore it might seem sensible to propose a candidate value, ϕ from a distribution which is centred about the current value ϑ. For example, ϕ|ϑN(ϑ,σ2), that is, we propose ϕ from a normal distribution with mean ϑ. If the proposal distribution is symmetric about ϑ as in the case of the normal distribution, then we have the nice property that

    q(ϑ,ϕ)=12πσexp(-12σ2(ϕ-ϑ)2)=12πσexp(-12σ2(ϑ-ϕ)2)=q(ϕ,ϑ).

    Hence in this case, the acceptance probability α(ϑ,ϕ), simplifies to

    α(ϑ,ϕ)=min{f(ϕ)f(ϑ),1}.

    Example. Mixture of normal distributions is the target with a normal distribution for the proposal.

    f(x) = 12{12πexp(-(x+1)2/2)+122πexp(-x2/8)}  (black)
    q(y|x) = 12πexp(-(y-x)2/2)  (red)

    That is,

    X {N(-1,1)w.p.0.5N(0,4)w.p.0.5
    Y N(x,1)

    For x=1 (marked with a vertical green line),

    Unnumbered Figure: Link

    (Note that the proposal density has been divided by 2 in the above figure.)

Why choose α(ϑ,ϕ)=min{f(ϕ)q(ϕ,ϑ)f(ϑ)q(ϑ,ϕ),1}?
The answer is given by Peskun (1973) and is as follows. α(ϑ,ϕ) is the largest value that the acceptance probability can take for the given choice of q(,) which ensures that the stationary distribution of the Markov chain is given by f(). Choosing α(,) means the Markov chain moves around the sample space as fast as possible (given q(,)) and therefore converges as quickly as possible.

Accepting a move.
The simplest way to accept a proposed move computationally is to generate a random number u from U(0,1) and accept the move if uα(ϑ,ϕ) (reject otherwise). This leads us to a couple of observations which can assist in implementing MCMC.

  1. 1.

    Let R(ϑ,ϕ)=f(ϕ)q(ϕ,ϑ)f(ϑ)q(ϑ,ϕ), so that α(ϑ,ϕ)=min{1,R(ϑ,ϕ)}. It suffices to accept the move if uR(ϑ,ϕ) (reject otherwise), since if R(ϑ,ϕ)α(ϑ,ϕ), R(ϑ,ϕ)>1 and the proposed move is accepted anyway.

  2. 2.

    It suffices to accept the move if log(u)log(R(ϑ,ϕ)) (reject otherwise) as log is a monotonic function. This can help with numerical stability, although this unlikely to be the case for problems we consider in this course.

4.2 The Metropolis-Hastings algorithm in a Bayesian context

Thus far we have described the Metropolis-Hastings algorithm for obtaining samples from a probability distribution with probability density function f(). Typically, f() is the posterior distribution for a Bayesian statistical problem. That is, f(θ)=π(θ|𝐱). The key to the success of the Metropolis-Hastings algorithm (MCMC) is that we only need to know π(θ|𝐱) up to a constant of proportionality. To see this note that

π(θ|𝐱)=π(𝐱|θ)π(θ)/π(𝐱).

Then the acceptance probability of the Metropolis-Hastings algorithm can be written as

α(ϑ,ϕ) = min{π(ϕ|𝐱)q(ϕ,ϑ)π(ϑ|𝐱)q(ϑ,ϕ),1}
= min{{π(𝐱|ϕ)π(ϕ)/π(𝐱)}q(ϕ,ϑ){π(𝐱|ϑ)π(ϑ)/π(𝐱)}q(ϑ,ϕ),1}
= min{π(𝐱|ϕ)π(ϕ)q(ϕ,ϑ)π(𝐱|ϑ)π(ϑ)q(ϑ,ϕ),1}.

That is, it is not necessary to compute the marginal likelihood, the constant of proportionality of the posterior distribution.

The general Metropolis-Hastings algorithm has the major advantage over the Gibbs sampler in that we do not need to know any of the conditional distributions. All that is required is to be able to simulate from q which we can choose arbitrarily subject to a few basic constraints. Moreover, and this can be of crucial importance, we only need to know π(|𝐱) up to proportionality. This is because we only need to consider the ratio of the posterior density at different values, and hence, the proportionality constants cancel. For example, with the independence sampler we only need to know w up to proportionality. Whilst, we have great freedom in choosing q, it is important to make good choices concerning q. A poor choice of q can seriously affect the efficiency and convergence of the Metropolis-Hastings algorithm.

It is often desirable to create hybrid MCMC chains which are hybrid of different types of MCMC chains. A particularly useful application is when using the Gibbs sampler. It is often impossible to obtain the conditional distributions of all the parameters. In this case, it is natural to use a one-dimensional Metropolis-Hastings step to simulate from the one-dimensional conditionals. For example, use the independence sampler or random walk Metropolis for one of the parameters which does not have a nice conditional distribution. Then the accepted value will be from the correct one-dimensional conditional distribution but we have not needed to simulate directly from the conditional distribution. This type of algorithm is sometimes referred to as Metropolis-within-Gibbs or componentwise MCMC.

4.2.1 Componentwise MCMC

Suppose that we want to obtain samples from π(𝜽|𝐱)=π(𝐱|𝜽)π(𝜽)/π(𝐱) (a posterior distribution), where 𝜽=(θ1,,θd) are the parameters of interest and 𝐱 denotes data. Componentwise MCMC follows the same procedure as the Gibbs sampler by updating subsets of the parameters, often one parameter, sequentially keeping the other parameters fixed.

That is, within each iteration, for k=1,2,,d.

  1. 1.

    Propose θk according to some proposal qk(𝜽,θk), where 𝜽 denotes the current value of the Markov chain.
    Note that the proposal distribution qk(,) could depend on the whole of 𝜽 but often it will only depend on θk.

  2. 2.

    Set 𝜽=(θ1,,θk-1,θk,θk+1,θd). (Only the kth element differs from 𝜽.)

  3. 3.

    Compute the acceptance probability αk(𝜽,𝜽) for the proposed move from 𝜽 to 𝜽.

    αk(𝜽,𝜽) = min{1,π(𝜽|𝐱)π(𝜽|𝐱)qk(𝜽,θk)qk(𝜽,θk)}
    = min{1,π(𝐱|𝜽)π(𝜽)π(𝐱|𝜽)π(𝜽)qk(𝜽,θk)qk(𝜽,θk)}.
  4. 4.

    With probability αk(𝜽,𝜽), accept the proposed move set 𝜽=𝜽. Otherwise keep 𝜽 unchanged (reject the proposed move).

The above algorithm can be adapted to update a subset of the parameters at once. In other words, componentwise MCMC is not restricted to single parameter updates.

4.3 Independence sampler

4.3.1 Introduction

The ideal proposal distribution for an independence sampler is q(ϕ)=f(ϕ).
Why? The acceptance rate is always 1 and we get iid samples from f().

Why is this of no direct use?
If we can sample from f() directly there is no need for MCMC.

How might it, nonetheless, help us to choose a sensible proposal distribution?
We can look for q() which is a good approximation for f().

What might a good acceptance rate for the independence sampler be?
As close to 1 as possible whilst exploring the whole of the posterior distribution.

4.3.2 Genetics example (revisited)

To illustrate the independence sampler, and to illustrate the effect of the choice of q on efficiency, we return to the genetics example, analysed using the Gibbs sampler in Week 8. Note that using the Metropolis-Hastings algorithm, we can use

(f(θ)=)π(θ|𝐱)(2+θ)x1(1-θ)x2+x3θx4,

directly.

Remember that there are 197 animals divided into four categories. The number of animals in each category are

𝐱=(x1,x2,x3,x4)=(125,18,20,34),

with category probabilities

(p1,p2,p3,p4)=(12+θ4,14(1-θ),14(1-θ),θ4).

For the proposal q(θ) we shall use θBeta(a,b) and consider efficiency for different choices of a and b. This gives

w(θ)=C(2+θ)x1(1-θ)x2+x3θx4θa-1(1-θ)b-1,

where C is a constant independent of θ which is not important for the MCMC.

We shall consider three choices for (a,b), the uniform (1,1), (6,4) and (1,4). We can see from the figures below that (6,4) works best. This is because the mean of the proposal 0.6 is close to the mean of the posterior distribution 0.624, and the shape of the density q() is the best approximation of π(|𝐱), see the plot comparing the density of q() (solid line) with π(|𝐱) (dashed line) below. Even better results are obtained by using say (31,19). However, there are a number of reasons for preferring a proposal distribution q() which is over-dispersed relative to the posterior distribution π(|𝐱). Roughly speaking the proposal distribution has a larger variance than the posterior distribution. Note that at the other extreme, (a,b)=(1,4) is very inefficient since the proposal distribution does not reflect the posterior distribution π(|𝐱), ie. very few of the proposed moves are from high density regions of the posterior distribution. Finally, note that the naive uniform proposal performs adequately.

sigene1.pdf

Unnumbered Figure: Link

Unnumbered Figure: Link

Unnumbered Figure: Link

In the final figure the solid line denotes Beta(6,4) distribution and the dashed line denotes π.

4.3.3 Why should q be over-dispersed relative to f?

The closer q approximates f, the better. Thus we look to have w(θ)1 for all θ. However, the theoretical (and practical) performance of the MCMC algorithm relies upon supθw(θ). In particular, there are convergence problems if supθw(θ)=. In fact the algorithm is reducible if supθw(θ)=. There are no such problems caused by having w(θ)0, although the efficiency of the algorithm can be affected. The main cause of supθw(θ)= is choosing q to have too small a variance (under-dispersed relative to f).

Suppose that supθw(θ)=K. Then the probability of accepting a move away from θ, where w(θ)=K is

-(1w(ϕ)w(θ))q(ϕ)𝑑ϕ = -w(ϕ)Kq(ϕ)𝑑ϕ
= -f(ϕ)K𝑑ϕ
= 1K.

Therefore if the independence sampler is started at θ, it takes a geometrically distributed number of iterations to move from θ with success probability 1/K. As K, the probability of accepting a move away from θ converges to 0 which is clearly problematic. It is worth noting that for any point ϑ, the number of iterations the independence sampler spends at the point ϑ before moving is geometrically distributed with success probability (of moving)

-(1w(ϕ)w(ϑ))q(ϕ)𝑑ϕ.

A measure of the efficiency of an independent sampler is the average acceptance rate. That is,

-[-(1w(ϕ)w(ϑ))q(ϕ)𝑑ϕ]f(ϑ)𝑑ϑ
= --(f(ϕ)q(ϑ)f(ϑ)q(ϕ))𝑑ϕ𝑑ϑ.

The acceptance rate is high, when f and q are in close agreement. Furthermore, if q()=f(), then the acceptance rate is

--f(ϑ)f(ϕ)𝑑ϑ𝑑ϕ=1.

4.4 Independence sampler within Gibbs: Censored data

We consider a censored data example. The approach has similarities to that taken in Section 1.6 for analysing censored data using the EM algorithm, in that again we use data augmentation.

Suppose that x1,x2,,xn are independent and identically distributed according to f(x|θ). However, assume that there is right censoring at the value a. That is, for all observations greater than a, the only information we have is that the observation exceeds a. Therefore, we have a sample y1,y2,,yn, where yi=min{xi,a}. The common notation is to denote censored values with an asterisk superscript. Usually for convenience we rearrange the sample as

y1,,ym,ym+1,,yn,

where yj=a. Then with 𝐲=(y1,y2,,yn) denoting the observed data, we have that

L(θ|𝐲)=i=1mf(yi|θ){1-Fθ(a)}n-m,

where Fθ is the cdf corresponding to f. Typically it is difficult to work with L(θ|𝐲) because of the presence of the non-standard term {1-Fθ(a)}n-m. Let 𝐳=(zm+1,zm+2,,zn) represent the true but unobserved values of ym+1,,yn. Then the pair (𝐲,𝐳) corresponds to complete information and

L(θ;𝐲,𝐳) = π(𝐲,𝐳|θ)
= i=1mf(yi|θ)i=m+1nf(zi|θ),

which is generally straightforward to work with. (Note the obvious similarities with the EM algorithm in Week 6.)

Suppose that XGamma(2,δ), where δ is an unknown with a Gamma(α,β) prior on δ. Suppose that we have the following 20 observations with the data censored at a=2.0:

0.7596, 1.5408, 2.0000, 0.7261, 0.2157, 1.1026, 2.0000
2.0000, 1.3579, 0.9520, 1.2688, 2.0000, 2.0000, 2.0000,
1.5395, 1.8802, 0.8471, 0.4374, 1.5395, 1.2576,

where denotes censored observations.

Note that cumulative distribution function of X is

FX(x)=1-e-δx(1+δx)(x>0),

whilst the pdf of X is f(x)=f(x)=δ2xe-δx (x>0).

Therefore if we reorder the data into ascending order, we have that

L(δ;𝐲)=i=114δ2yie-δyi{e-2δ(1+2δ)}6,

and it is difficult to write down the posterior distribution of δ. We could proceed as above and use an independence sampler to obtain samples from δ, possibly using a Gamma proposal distribution. In other words, take an approach similar to that used for the genetics example above.

However letting 𝐱=(𝐲,𝐳),

L(δ;𝐱) = i=120δ2xie-δxi
δ40exp(-δi=120xi).

Hence

π(δ|𝐱) δ40exp(-δi=120xi)×δα-1exp(-δβ)
δ40+α-1exp(-δ(i=120xi+β)),

giving δ|𝐲,𝐳Gamma(40+α,i=120xi+β).

For i=1,2,,m, yi is observed, so no updating is required. For i=m+1,m+2,,n, zi|yi=a,δXa, where Xa is distributed according to Gamma(2,δ) conditioned to be greater than a. Simulating values from Xa directly is not possible and the following independence sampler is proposed. Propose zi from q(z)=24/z4 (z>2) and q(z)=0 otherwise. Simulation from this distribution can be done using inversion of the cdf. That is, simulate ui(0,1) and set zi=(8/ui)1/3. Thus, if fa(z)=f(z)/P(X>a) (z>a) is the pdf of Xa, for z>a=2,

w(z) = fa(z)q(z)
= zexp(-δz)/P(X>2|δ)24/z4
z5exp(-δz)=h(z), say.

Note that in computing w(z), we need only compute w(z) up to proportionality in z as δ remains fixed in the update of zi. That is, it suffices to use h(z) when updating a censored observation.

Independence sampler within Gibbs

  1. 1.

    Choose initial values for δ and 𝐳=(z15,z16,,z20).

  2. 2.

    For k=1,2,,n;

    • Update δ|𝐲,𝐳Gamma(40+α,i=120xi+β);

    • For i=15,16,,20, draw uiU(0,1) and set zi=(8/ui)1/3.
      Accept the proposed value zi with probability min{1,h(zi)/h(zi)}, where h(z)=z5exp(-δz).

    • Store δ.

The algorithm works well, see the traceplot of δ below. The output is based upon a sample of 1000 iterations following a burn-in of 100 iterations. The estimated posterior mean and standard deviation of δ are 1.165 and 0.2095, respectively.

Unnumbered Figure: Link

4.5 Diagnostics and rules of thumb

4.5.1 Motivation

  1. 1.

    How do I decide when my chain has converged?

  2. 2.

    When my chain has converged how do I decide if it is mixing quickly enough (or equivalently, if the chain is long enough to get accurate answers)?

  3. 3.

    How do I choose my algorithm so that I do not need ”too many” iterations? (e.g. Would a N(0,1) proposal have been better for the Independence Sampler? Would a N(0,1/2) jump proposal have been better for the RWM?)

  4. 4.

    What do we mean by ”better” in the previous question?!

Below we begin to answer the above questions. To assist us with answering the questions we consider four scenarios for obtaining samples from XN(0,1).

  • Random walk Metropolis with x(0)=2 and proposal distribution Y|X=xN(x,1).

  • Independence Sampler with x(0)=2 and proposal distribution Y|X=xN(0,22).

  • Random walk Metropolis with x(0)=100 and proposal distribution Y|X=xN(x,1).

  • Random walk Metropolis with x(0)=2 and proposal distribution Y|X=xN(x,(0.1)2).

Unnumbered Figure: Link

4.5.2 The Gelman Rubin Convergence Diagnostic

Let σ2=Var[Xi] then, if the Xi are independent,

Var[X¯]=Var[1ni=1nXi]=1n2i=1nVar[Xi]=σ2n.

Recap ANOVA for m groups of size n - comparison of two variance estimates; between groups (from the group means) and within groups (about the group means).

Suppose that you have {Xij} (i=1,,m;j=1,,n).

Let X¯i=1nj=1nXij with Var[X¯i]=σ2/n and X¯=1mi=1mX¯i.

Note that

𝔼[1m-1i=1m(X¯i-X¯)2] = 1m-1i=1m𝔼[X¯i2-2X¯iX¯+X¯2]
= 1m-1i=1m(𝔼[X¯i2]-𝔼[X¯2])
= mm-1{𝔼[X¯12]-(1m𝔼[X¯12]+m-1m𝔼[X¯1]𝔼[X¯2])}
= mm-1m-1m(𝔼[X¯12]-𝔼[X¯1]2)
= Var[X¯1]=σ2n

Hence B=nm-1i=1m(x¯i-x¯)2 is an unbiased estimate of σ2. (This estimate is based upon between group variability.)

Also

𝔼[1mi=1m1n-1j=1n(Xij-X¯i)2] = 1mi=1m𝔼[1n-1j=1n(Xij-X¯i)2]
= 𝔼[1n-1j=1n(X1j-X¯1)2]
= σ2

Hence W=1mi=1m1n-1j=1n(xij-x¯i)2 is also unbiased estimate of σ2. (This estimate is based upon within group variability.)

Finally, V=1mn-1i=1mj=1n(xij-x¯)2 is also an unbiased estimate of σ2. (This estimate is based upon total variability.)
Let

V = 1mn-1i=1mj=1n(xij-x¯)2
= 1mn-1i=1mj=1n(xij-x¯i+x¯i-x¯)2
= 1mn-1i=1mj=1n(xij-x¯i)2+1mn-1i=1mj=1n(x¯i-x¯)2
= m(n-1)mn-1W+m-1(mn-1)B
(1-1n)W+1nB.

Thus B, W and V are estimates of the variance based upon the between group, within group and total variation, respectively.

We will run m (e.g. 5) separate chains of length n (e.g. 1000) from different starting points which are more spread out than (”overdispersed with respect to”) the posterior and compare two estimates of the variance of the stationary distribution.

The GR Convergence Diagnostic Procedure

  1. 1.

    Run m chains for 2n iterations and discard the first n iterations of each chain as burn-in. (Keep the last n iterations from each chain.)
    Let θ¯i denote the mean parameter value of chain i and let θ¯ denote the overall mean. Let si2 denote the variance of chain i. Then

    B=ni=1m(θ¯i-θ¯)2m-1

    and

    W=1mi=1msi2.
  2. 2.

    Calculate the between chain variance estimate, B, and the within chain variance estimate, W.

  3. 3.

    Calculate V=(1-1/n)W+(1/n)B (nearly the total variance of all m chains together).

  4. 4.

    If the chains have all converged to the same distribution then W,B, and V should all be estimates of the variance of the stationary distribution, σ2. i.e. They should all be similar.

  5. 5.

    Evaluate R^:=V/W.

  6. 6.

    If the chains have not yet converged then V should be >σ2 (because the chains are still too spread out) and W should be <σ2 (since each chain has not yet traversed the state-space).

  7. 7.

    So if the chains have not converged then R^>1 otherwise R^1.

  8. 8.

    Gelman (1996) suggests accepting convergence if R^<1.2.

The figure shows a plot of the GR statistics using 2 chains: the fast mixing RWM started at x(0)=2, and the independence sampler started at x(0)=2.

Unnumbered Figure: Link

The figure shows a plot of the GR statistics using 3 chains: the fast mixing RWM started at x(0)=2, the independence sampler started at x(0)=2, and the RWM started at x(0)=100.

What does the Gelman-Rubin statistics tell us?

It is very helpful for checking whether or not the distribution is multi-model. In that, if we start at a wide range of different values we will converge to different modes if they exist. Then we need to ensure the MCMC can efficiently move between modes. It also gives an indication of a safe burn-in period. For example, if for n=1000 R^1, this tell us that a burn-in of 1000 iterations should be sufficient to ensure convergence.

4.6 Independent versus dependent samples

If you had 1000 independent samples x1,,x1000 from a random variable X with distribution f, then you could estimate

μ^=11000i=11000xiμ=𝔼[X],

and

σ2^=1999i=11000(xi2-μ^)2σ2=Var[X].

But you could also estimate the error in each of these approximations

Var[μ^]=σ2nσ^2n,Var[σ2^]=2σ4n-12σ^4n-1.

Suppose that we have a chain of length 1000, x(1),,x(1000). Each element of the chain depends on the previous element. Indeed, because the previous element depends on the one before that, there is dependency between x(i) and x(i-2) etc. Because of this (positive) correlation there is generally less information in the 1000 elements of the chain than there would have been in 1000 independent draws.

If the chain is stationary then ρ1:=Corr[X(i),X(i-1)] should not depend on i. ρ1 is called the lag-1 autocorrelation. Similarly define ρk:=Corr[X(i),X(i-k)] as the lag-k autocorrelation. Note that the lag-0 autocorrelation is 1, in that, any random variable is perfectly correlated with itself!

The plots below show the autocorrelations up to lag-50 for four chains sampling from N(0,1): RWM with N(0,1), Indep. Sampler with N(0,4), RWM with N(0,1) (start at 100), RWM with N(0,0.01). These correspond to the traceplots given at the start of the Section.

Unnumbered Figure: Link

The RWM with proposal N(0,1) performs quite well with ρk0 for k>20. ρ1=0.8 and ρkρ1k. The Independence Sampler works very well with ρ1=0.25 and all other ρk0. This means that there is a small cost for this algorithm over an independent sample. For x(0)=100, the effect of the starting value has a significant effect on the acf plot. The RWM with proposal N(0,0.01) mixes slowly and consequently ρk decreases only slowly as k increases.

4.6.1 Integrated Autocorrelation Time and Effective Sample Size

The integrated autocorrelation time for an infinite Markov chain is defined as

IACT:=1+2i=1ρi.

In practice we only have a finite chain and so calculate

I^ACT:=1+2i=1K-1ρi,

where K is the first time that ρK is below some threshold (e.g. 0.05). For the examples above I^ACT is 11.3,1.6,159.2,278.2. The larger the integrated ACT the higher the correlation and so the less information there is in the chain. In practice we only estimate the IACT after burn-in (since we only use the part of the chain after burn-in). The IACT estimate for the final 800 iterations of chain started at x(0)=100 is 14.6.

For n independent samples

Var[μ^]=σ2nσ^2n.

Define the effective sample size neff:=n/IACT.

Lemma For a stationary series of correlated variables with i=1i|ρi|<,

Var[μ^]σ^2neff.

Proof:

Var[μ^] = Var[1ni=1nX(i)]
= 1n2i=1nj=1nCov[X(i),X(j)]
= 1n2i=1nVar[X(i)]+2n2i=1n-1j=i+1nCov[X(i),X(j)]
= 1n2{nVar[X(1)]+2i=1n-1(n-i)Cov[X(0),X(i)]}
1n{Var[X(1)]+2i=1n-1Cov[X(0),X(i)]}
= 1nσ2{1+2i=1n-1ρi}
1nσ^2{1+2i=1ρi}
= 1nσ^2IACT=σ^2neff.

The effective sample sizes for our four chains are therefore: 1000/11.3=88.5, 1000/1.6=625, 800/14.6=54.8 (excluding burn-in), and 1000/278.2=3.6. The 1000 values in the final chain contain less information than would a sample of 4 independent values!

4.6.2 Answers

  1. 1.

    How do I decide when my chain has converged? Both by eye and (if feasible) using the Gelman-Rubin statistic.

  2. 2.

    When my chain has converged how do I decide if it is mixing quickly enough (or equivalently, if the chain is long enough to get accurate answers)? By eye and look at the effective sample size and the standard error of your estimate.

  3. 3.

    How do I choose my algorithm so that I do not need ”too many” iterations? (e.g. Would a N(0,1) proposal have been better for the Independence Sampler? Would a N(0,1/2) jump proposal have been better for the RWM?) To be discussed!

  4. 4.

    What do we mean by ”better” in the previous question?! Larger effective sample size for the same length of chain.

4.7 R Practice Questions: Independence Sampler

  1. 1.

    Write R code to implement the independence sampler for the genetics example. (Section 4.3.2).

  2. 2.

    Write R code to implement the independence sampler within Gibbs for the censored data. (Section 4.4).

  3. 3.

    Suppose that, there exists β>0 such that, 𝐗 has pdf,

    πβ(𝐱){exp(-β(x1-x2)2)if 0x1,x2100otherwise.

    For fixed values of β. Write and implement the following independence sampler programs for obtaining samples from 𝐗.

    1. (a)

      q(𝐱)=1100 for 0x1,x210, a uniform proposal on [0,10].

    2. (b)

      x1U(0,10) and x2|x1N(x1,σ2) giving

      q(𝐱)=110×12πσexp(-12σ2(x2-x1)2),0x110,x2.

    How do the two independent samplers compare with β=2?

    Theoretical Question: For β=2, comment upon the choice of σ.

4.8 Lab Session: Convergence diagnostics

The files background.R, RWM1d.R and Indep1d.R have the code for applying the random walk Metropolis algorithm and the independence sampler to five 1-d distributions. The files background.R gives the background functions whilst RWM1d.R and Indep1d.R run the random walk and independence sampler algorithms.

One observation with the R code is that it uses the log acceptance probability.

For the 1D functions you have a choice of target (stationary) distributions

  • Gaussian: π(x)e-12x2

  • Laplace: π(x)e-|x|

  • t5: π(x)1/(1+x2/5)3

  • Cauchy: π(x)1/(1+x2)

  • Bimodal: π(x)0.2e-12(x+8)2+0.8×12e-18(x-6)2

Of course in real life it is easy to obtain an independent sample from each of these distributions and we would never run an MCMC chain for this purpose. However these distributions have different properties and it can be instructive to see how the behaviour of the chain varies in these simple examples. These rules of thumb can then be carried over to more complex scenarios.

You may choose your proposal distribution from Gaussian, Laplace, t5, or Cauchy; you also choose the scale parameter λ for the proposal. For example if you chose a Gaussian proposal with scale parameter λ=3 when calling RWM1d then the proposed jump is N(0,9), i.e. xpropN(xcurr,9). The same proposal for the independence sampler would mean xpropN(0,9).

Try running RWM1d or Indep1d using the example function call at the bottom of the file - you will see that it produces some basic output.

Important for intuition Note that the Gaussian has lighter tails than the Laplace which has lighter tails than the t5, which has lighter tails than the Cauchy.

Introducing coda Spend no more than 15 minutes on this section!

Run the 1D RWM function for 1000 iterations, starting at x0=0, using a Gaussian proposal, a Gaussian target, and a scale parameter of 0.5

a1=RWM1d(1000,0.5,"Gaussian","Gaussian",0)

a1 is a vector of length 1000 (check!). To use the coda package we first load the library

library(coda)

We then need to turn a1 into an mcmc object.

a1.mc=mcmc(a1)

Try out the following, work out and note down what they do; if necessary use the help facility.
summary(a1.mc)
rejectionRate(a1.mc)
How does this relate to acceptance rate?
plot(a1.mc,smooth=F)
traceplot(a1.mc,smooth=F)
densityplot(a1.mc,smooth=F)
effectiveSize(a1.mc)
autocorr.diag(a1.mc)
autocorr.plot(a1.mc,lag.max=50)
cumuplot(a1.mc)
(Why might this be useful?)

To use the Gelman-Rubin convergence test we need to have a few more chains started at very different points.
a2=RWM1d(1000,0.5,"Gaussian","Gaussian",4)
a3=RWM1d(1000,0.5,"Gaussian","Gaussian",-4)

Turn these into mcmc objects a2.mc and a3.mc.

Create an mcmc list
a.listmcmc.list(a1=a1.mc,a2=a2.mc,a3=a3.mc)

Now try the following
gelman.diag(a.list)
gelman.plot(a.list)

You should find that the chains take about 200 iterations to converge (check using the trace plots). You would therefore only use the last 800 iterations to calculate any quantities of interest (e.g. mean and variance).

coda has a number of other functions - feel free to investigate in your own time.
https://cran.r-project.org/web/packages/coda/coda.pdf

The main idea of this part of the lab is that you should investigate how the choice of scaling and proposal distribution affects behaviour for different targets. The instructions below comprise a minimal set of tasks - there are many further investigations if you have time!

Throughout this part of the lab it will be better for you to complete one investigation (e.g. Qn1 below) than to partially complete many.

Mixing

First try RWM1d(); each time you should run the algorithm for 10000 iterations and start at x0=0 which is in the main mass of the target distribution (so we do not have to worry about convergence). You should try a variety of different scalings of the proposal from λ=0.1 to λ=10. For each chain, look at the trace plot to see how the chain behaves; note down the acceptance rate, and the effective sample size.

Repeat some of the simulations at least 3 times to check for consistency.

For each of the four stages below (and any other possibilities which you wish to try) answer the following:

  • Which scaling gives the best effective sample size?

  • What is that best effective sample size?

  • What is the acceptance rate at this best scaling?

  • How did the chains behave?

Be sure to examine the traceplots as well as calculating the effective sample size.

  1. 1.

    Use a Gaussian target and Gaussian proposal.

  2. 2.

    Repeat (1) for a Gaussian target but instead using a Cauchy proposal.

  3. 3.

    Repeat (1) and (2) using a Cauchy target with a Gaussian proposal (also try a Cauchy proposal if time allows).

  4. 4.

    Repeat (1) and (2) but using the bimodal target with a Gaussian proposal (also try a Cauchy proposal if time allows).

What do you conclude? Can you explain your findings?

Now look at Indep1d() following exactly the same instructions as above; your findings and conclusions should be different!

Convergence

Note If time is short then the minimal set of investigations for this section comprises of Qn1 and Qn3 below for RWM1d().

For all chains in this section you should start with x0=50. In each case run a chain with λ=λ^, the best scaling that you found in the previous section - and with λ=2λ^. In each case investigate the number of iterations which it takes for the chain to converge, either using the traceplot alone or by running several chains, looking at the traceplots and using the GR plot.

Investigate RWM1d using the following combinations

  1. 1.

    Gaussian target and Gaussian proposal.

  2. 2.

    Gaussian target and Cauchy proposal.

  3. 3.

    Cauchy target and Gaussian proposal.

  4. 4.

    Cauchy target and Cauchy proposal.

Feel free to try other combinations if there is time!

Now investigate Indep1d using the same combinations of target and proposal as for RWM1d().

4.9 Practice Questions

  1. 1.

    Suppose that f(x)1+x2exp(-x2/2) (x). The following 3 independence samplers are suggested for obtaining a sample from f(x).

    • Sampler 1.

      Proposal distribution N(0,1) with proposal probability density function q(x)=exp(-x2/2)/2π.

    • Sampler 2.

      Proposal distribution Cauchy(50,1) with proposal probability density function q(x)={1/(π(1+(x-50)2)).

    • Sampler 3.

      Proposal distribution t5 with proposal probability density function q(x)=8π45(1+x2/5)-3.

    1. (a)

      Why is Sampler 3 the best choice for obtaining samples from f(x), and why are the other two independence samplers unsuitable?

    2. (b)

      A sample of size 1000 was obtained using Sampler 3 from f(x) and the output is summarised in the form of a traceplot and acf plot in Figure 1 with an acceptance rate of 0.883.

      Unnumbered Figure: Link

      Traceplot and acf plot of sample of size 1000 from f(x) using Sampler 3. Comment upon the performance of Sampler 3 from f(x).

  2. 2.

    A doctor is interested in the survival time of patients (in years) following major surgery. It is proposed that the survival time, S, follows a Gamma distribution with shape parameter 2 and scale parameter δ. That is, S has probability density function,

    fS(s)=δ2sexp(-δs)  (s0).

    Suppose that N patients were monitored and their survival times s1,s2,,sN are recorded.

    1. (a)

      Given that the prior distribution on δ is Gamma(A,B), derive the posterior distribution δ given 𝐬=(s1,s2,,sN).

    Due to limited resources patients are only monitored for 5 years following surgery. Therefore if the survival time of the patient is less than or equal to 5 years it is known exactly. However, if the survival time of the patient is more than 5 years all that is known is that the survival time exceeds 5 years (censored data). Censored survival times are denoted by .

    Suppose that for 1M<N, s1,s2,,sM are observed exactly (survival time less than or equal to 5 years) and for the remaining N-M patients their survival times sM+1,sM+2,,sN are censored at 5 years.

    1. (b)

      The survival time, S, given that it exceeds 5 years has probability density function

      fS>5(s|δ)=δ2sexp(-δs)(5δ+1)exp(-5δ)  (s5).

      Describe an independence sampler for simulating from S.

    2. (c)

      Describe a Gibbs sampler for obtaining samples from the posterior distribution of δ given the observed data s1,s2,,sM,5,5,,5.

  3. 3.

    Suppose that X is a non-negative random variable. That is, for x<0, f(x)=0. A reflected random walk proposal is used. That is, if the MCMC is currently at x, the proposed value y, is generated from

    Y|N(x,σ2)|.

    Let ϕ(z) denote the probability density function of a standard Normal distribution.

    1. (a)

      Show that q(x,y)=ϕ((x-y)/σ)+ϕ((x+y)/σ).

    2. (b)

      Hence, or otherwise, show that the acceptance probability, α(x,y), for the proposed move from x to y satisfies

      α(x,y)=min{1,f(y)f(x)}.