5 Week 10: Random walk Metropolis and applications of the MH algorithm

5.1 Introduction

The key point to RWM (random walk Metropolis) is that the proposal distribution is ‘centred’ upon the current state of the Markov chain. That is, the proposed or candidate value $y^{\ast}$ , is given by

y^{\ast}=x+W,

where the random variable $W$ is such that $E[W]=0$ and is usually symmetric. When $W$ is symmetric the acceptance probability simplifies to

\alpha(x,y^{\ast})=\min\left\{\frac{f(y^{\ast})}{f(x)},1\right\}.

Proof Let $W$ have density $g(\cdot)$ and for $W$ to be symmetric, for all $w$ , $g(-w)=g(w)$ . Therefore,

q(x,y^{\ast})=g(y^{\ast}-x)=g(x-y^{\ast})=q(y^{\ast},x).

Hence,

	$\displaystyle\alpha(x,y^{\ast})$	$\displaystyle=$	$\displaystyle\min\left\{\frac{f(y^{\ast})q(y^{\ast},x)}{f(x)q(x,y^{\ast})},1\right\}$
		$\displaystyle=$	$\displaystyle\min\left\{\frac{f(y^{\ast})}{f(x)},1\right\}.$

RWM is possibly the most used MCMC algorithm, but why? There are a number of reasons.

1.

It generally works pretty well. The basic idea is that, in stationarity, the chain spends most of its time in regions with high probability. Therefore if the posterior distribution is continuous and we are currently at $x$ , a point with high posterior density then for any point $y$ close to $x$ the posterior density will also be high. Thus it makes sense to propose moves relative to the current position $x$ .
2.

It is very straightforward to implement even in high dimensions. All we need is that our choice of $W$ does not result in a reducible Markov chain. A straightforward choice for $W$ if $x$ is $d$ -dimensional is $\sigma\times$ the $d$ -dimensional standard normal distribution, where $\sigma$ is a scaling factor for the proposal standard deviation. This is in contrast to the Gibbs sampler where we require the conditional distributions and the independent sampler which requires the proposal distribution to be a good approximation of the posterior distribution. Also the independent sampler can be reducible if the ratio of the posterior density to the proposal density is unbounded.
3.

Correlation/dependence between different parameters can be overcome. The Gibbs sampler was shown to perform badly when parameters were highly correlated. This problem can to some extent be circumvented when using RWM by choosing the proposal distribution to have similar correlation between pairs of parameters as the correlation between those same pairs of parameters in the posterior density.

Thus far the RWM algorithm seems ideal but it does have some drawbacks. These include:

1.

It is too general. It is a general method which can ride roughshod over nuances of a model.
2.

There can still be problems with convergence.
3.

There is still the problem of dependence between samples and the need for appropriate choice of variance for the proposal distribution. We will discuss this issue in some detail.
4.

Multi-modal distributions. There can be problems with the chain moving between different modes. This problem is related to convergence, in that, the MCMC chain may appear to have converged but in reality it is only moving around one of the modal values. An important reason for performing Gelman-Rubin diagnostics and starting the MCMC chain from different starting values.

5.2 Proposal variance

The choice of proposal variance is a very important issue for RWM. We shall illustrate why with a simple example. Let $X\sim N(0,1)$ and therefore

f(x)=\frac{1}{\sqrt{2\pi}}\exp(-x^{2}/2).

For $\sigma>0$ , let the proposal distribution $W\sim N(0,\sigma^{2})$ , thus $y^{\ast}\sim N(x,\sigma^{2})$ . Let $W_{\sigma}$ denote dependence on a specific standard deviation $\sigma$ . The algorithm was run with the following choices of $\sigma$ ; $\sigma=0.1,5,50$ . The algorithm starts in stationarity so there is no need for a burn-in period. The results of the first 500 iterations are plotted below, in the order $\sigma=0.1,5,50$ .

Unnumbered Figure: Link

All three choices of $\sigma$ result in the same stationary (posterior) distribution for the Markov chain. However, the results are very different. For the case $\sigma=0.1$ , the acceptance probability is 0.923. Therefore most proposed moves are accepted. However, most of the moves are very small and therefore it takes the Markov chain a long time to explore the posterior distribution. At the other extreme, in the case $\sigma=50$ , the acceptance probability is $0.018$ (ie. less than 1 in 50 proposed moves are accepted). Therefore although when we accept moves we make large moves in the posterior distribution, we spend a long time ‘stuck’ at most of the points visited. Therefore there is a need for a compromise between accepting a reasonable number of proposed moves and making reasonable sized moves when proposed moves are accepted. Thus the case $\sigma=5$ with an acceptance probability of 0.243 presents a happy compromise. To demonstrate this further for each of the three choices of $\sigma$ , the effective sample size was computed based upon a sample of size 1000. For $\sigma=0.1,5,50$ , the effective sample sizes were $6.2,160.4$ and $8.5$ , respectively.

Therefore we have seen that in the above example that choosing $\sigma$ such that the probability of accepting a move is about $25\%$ works well. Can we derive more general conclusions in how to choose $\sigma$ ? Perhaps somewhat surprisingly we can. It has been proved that for (almost) all continuous uni-modal posterior distributions an acceptance rate of about $25\%$ is (close to) optimal, in terms of maximising the effective sample size. (See [3] if you are feeling brave!) The $25\%$ rule is a limit result as the dimensionality of $\pi$ tends to $\infty$ , but is very good once the dimensionality gets above 5. (For $f(\cdot)$ , the pdf of a $N(0,1)$ , the optimal (average) acceptance probability is about $40\%$ .) Therefore a very general rule of thumb is to run small MCMC trials with different values of $\sigma$ until the acceptance rate is approximately $25\%$ and then perform the main MCMC run with this choice of $\sigma$ . There is no need to fine tune $\sigma$ too much since acceptance rates between $15\%$ and $50\%$ are close to optimal. That is, (almost) regardless of the posterior distribution we can approximately optimise the RWM algorithm by tuning $\sigma$ the proposal standard deviation. For the above Normal example $\sigma=2.4$ is optimal and when the algorithm was run with the choice of $\sigma$ an effective sample size of 308.3 was obtained.

5.3 Multi-modal distributions

As mentioned above care needs to be taken when applying the RWM algorithm to multi-modal distributions. The main problem being that the algorithm can get ‘stuck’ at one of the modes. That is, the MCMC chain moves around but remains close to one modal value and takes a very long time to explore the other modal values. This is most easily seen with an example. Let $X$ have posterior density,

f(x)=\frac{1}{2\sqrt{2\pi}}\left(\exp(-x^{2}/2)+\exp(-(x-10)^{2}/2)\right).

Therefore

X\stackrel{{\scriptstyle D}}{{=}}\left\{\begin{array}[]{ll}N(0,1)&\mbox{with probability }0.5\\ N(10,1)&\mbox{with probability }0.5\end{array}\right..

Suppose that the proposal distribution is normal with standard deviation $\sigma$ . That is,

y^{\ast}\sim N(x,\sigma^{2}).

The RWM algorithm was run to obtain samples of size 1000 from the posterior distribution of $X$ . The algorithm was run for $\sigma=1$ and $\sigma=10$ with the results presented below.

Unnumbered Figure: Link

At a first glance, both runs appear to have converged, judging by a visual inspection of the output. However, closer inspection of the $y$ -axis shows that for $\sigma=1$ , the MCMC chain has spent all 1000 iterations centred about the mode at 10. (Even in a run of length 100000 the MCMC chain did not find the mode at 0.) Therefore the sample obtained is not representative of the posterior distribution even though the algorithm is correct. On the other hand, in the case $\sigma=10$ the algorithm moves freely between the two modal values of 0 and 10. Thus this sample is far more representative of the distribution $X$ .

The moral of the story is that, MCMC runs can appear to have converged even when they have not. It is important that the MCMC chain explores the whole of the posterior distribution and does not get ‘stuck’ at one mode. This is a primary reason for using Gelman-Rubin diagnostics with different starting points for the MCMC chains. For multi-modal distributions hybrid RWM algorithms can be very effective. That is, RWM algorithms which propose moves using both low and high proposal variances. The low proposal variance moves are good for exploring the posterior distribution about a modal value, whereas high proposal variance moves are useful in trying to find other modal values.

5.4 Extensions

We consider two simple adaptions of the RWM algorithm. The first, multiplicative RWM, is appropriate for positive densities ( $f(x)=0$ for $x<0$ ), especially for heavy tailed distributions. The second MALA (Metropolis adjusted Langevin algorithm) is cleverer than RWM in that it takes into account the derivatives of $f(x)$ in the proposal of new values.

5.4.1 Multiplicative RWM

Let $X$ have pdf $f(x)$ , where $f(x)=0$ for $x<0$ . Then multiplicative RWM essentially transforms the problem onto the log-scale as follows. Propose

\log Y\sim N(\log x,\sigma^{2})\hskip 14.226378pt\mbox{as }n\rightarrow\infty.

Or equivalently, if $Z=\log X$ and $z=\log x$ , propose

W\sim N(z,\sigma^{2})\hskip 14.226378pt\mbox{as }n\rightarrow\infty.

Let $f_{Z}(\cdot)$ denote the pdf of $Z$ , then

	$\displaystyle f_{Z}(z)$	$\displaystyle=$	$\displaystyle\left\|\frac{dx}{dz}\right\|f(x=e^{z})$
		$\displaystyle=$	$\displaystyle e^{z}\pi(e^{z})$

since $x=e^{z}$ . Therefore for multiplicative RWM, the acceptance probability is

\min\left\{1,\frac{f_{Z}(W)}{f_{Z}(z)}\right\}=\min\left\{1,\frac{f_{Z}(\log Y% )}{f_{Z}(\log x)}\right\}=\min\left\{1,\frac{Yf(Y)}{xf(x)}\right\}.

(The first term is the result of doing RWM upon the log scale, with the other two terms being the result of the transformation.)

We demonstrate the benefits of multiplicative RWM with

f(x)=\left\{\begin{array}[]{ll}\frac{3}{(1+x)^{4}}&(x>0)\\ 0&(x\leq 0)\end{array}\right.

For both standard RWM and multiplicative RWM I started the MCMC algorithm in the tail of $X$ with $X_{0}=100$ . (Note that $P(X>100)=1/1030301$ .) Trace plots of the first 1000 iterations of each algorithm are presented below. We can see that the multiplicative RWM quickly converges to the centre of probability mass around 0, whereas the RWM algorithm shows random walk behaviour around 100. The RWM algorithm does eventually get close to 0 but it takes quite a few iterations. The RWM algorithm is correct but performs poorly for heavy tailed densities (probability densities where $f(x)e^{\theta x}\rightarrow\infty$ as $x\rightarrow\infty$ for all $\theta>0$ ). This is, because the RWM can get stuck in tails of the distribution and wanders apparently aimlessly for a long time before returning to the area of high probability mass. Since for large $x$ , if $Y=x+\epsilon$ , where $\epsilon$ is small compared to $x$ , $f(Y)/f(x)\approx 1$ . Thus most moves are accepted, hence the random walk behaviour.

Unnumbered Figure: Link

RWM with $\sigma=2$

Unnumbered Figure: Link

Multiplicative RWM with $\sigma=1$ .

In addition, both algorithms were run started at $X_{0}=1$ for $100000$ iterations. Both algorithms provide good estimation of $f(\cdot)$ in the range $[0,10]$ . However, if we consider large values of $x$ , we can see that multiplicative RWM is better. In 100000 iterations, on average there will be 75 observations greater than 10 and 11 observations greater than 20. For the RWM algorithm there was only 23 observations greater than 10 and none greater than 20 (the maximum value was 13.5). On the other hand for multiplicative RWM there were 67 observations greater than 10 and 17 observations greater than 20. The maximum value for multiplicative RWM was 41 and under $f$ , the probability of getting a value greater than 41 is $1/74088$ . Thus the RWM algorithm is less inclined to enter the tail of $f$ than the multiplicative RWM algorithm. However, as we saw above when the RWM algorithm does enter the tail of $\pi$ it spends a long time in the tail before returning to the area of high probability mass.

5.4.2 MALA

The standard RWM algorithm proposes a new value centered on the current value. However, it makes sense, where possible, to make use of knowledge of $f$ . In particular, we want to propose more values $Y$ where $f$ is large than where $f$ is small. Therefore if we know $f(x)$ , and in particular, its derivative we can use this to concentrate the proposal in areas of the distribution with high probability mass. The MALA algorithm (Metropolis adjusted Langevin algorithm) is as follows.

1.

Propose a new value $y$ , by setting

$y=x+\frac{\sigma^{2}}{2}\frac{d\;}{dx}\log f(x)+N(0,\sigma^{2}),$

where $X_{t}=x$ .
2.

Accept the proposed value with probability

$\min\left\{1,\frac{f(y)\exp\left(-\frac{1}{2\sigma^{2}}\left\{x-y-\frac{\sigma% ^{2}}{2}\frac{d\;}{dx}\log f(Y)\right\}^{2}\right)}{f(x)\exp\left(-\frac{1}{2% \sigma^{2}}\left\{y-x-\frac{\sigma^{2}}{2}\frac{d\;}{dx}\log f(x)\right\}^{2}% \right)}\right\}.$
3.

If the proposed value is accepted set $X_{t+1}=Y$ , otherwise set $X_{t+1}=x$ .

The key point to note is that the proposal is not symmetric, ie. $q(x,y)\neq q(y,x)$ , and therefore this needs to be taken account of in the acceptance probability. The class of models to which MALA can be applied is smaller than RWM and using MALA is more involved as we need the derivatives of $f(\cdot)$ . However, where MALA can be used it is preferable to RWM. For high-dimensional problems MALA is seen to have some very nice properties, see [4]. In particular, if $d$ , the number of dimensions, is large, the optimal choice of $\sigma^{2}$ for MALA is $O(d^{-1/3})$ compared with $O(d^{-1})$ for RWM. Thus potentially offering substantial improvements.

5.5 Applications of the MH algorithm

We conclude the course with a few examples of the Metropolis-Hastings algorithm applied to statistical problems.

5.5.1 Non-conjugate Priors (1D)

Suppose that for $i=1,2,\ldots,n$ , $Y_{i}\sim{\rm Po}(\lambda)$ . The likelihood is

\pi(\mathbf{y}|\lambda)\propto\lambda^{\sum_{i=1}^{n}y_{i}}e^{-n\lambda}.

The conjugate prior for $\lambda$ is a Gamma( $\alpha,\beta$ ) distribution $\pi(\lambda)\propto\lambda^{\alpha-1}~{}e^{-\beta\lambda}$ .

But gamma priors are often extremely unsatisfactory for expressing ignorance. (e.g. just plot the graph with $\alpha=1/2$ .)

Instead use a half-Cauchy prior $\pi_{0}(\lambda)\propto 1/(1+\beta^{2}\lambda^{2})~{}~{}(\lambda\geq 0)$ , for some $\beta$ (e.g. $\beta=1$ ). The posterior is

\pi(\lambda|\mathbf{y})\propto\frac{\lambda^{\sum_{i=1}^{n}y_{i}}~{}e^{-n% \lambda}}{1+\beta^{2}\lambda^{2}}.

Perhaps the easiest MH algorithm in this case would be an independence sampler with $\lambda^{*}\sim\mbox{Gam}(1+\sum_{i=1}^{n}y_{i},n)$ , i.e.

q(\lambda)\propto\lambda^{\sum_{i=1}^{n}y_{i}}~{}e^{-n\lambda}.

Since in this case

\frac{\pi(\lambda|\mathbf{y})}{q(\lambda)}\propto\frac{1}{1+\beta^{2}\lambda^{% 2}}\leq 1.

Alternatively a random walk Metropolis algorithm with

\lambda^{*}\sim N\left(\lambda_{curr},~{}2.4^{2}\times\frac{1}{n^{2}}\left(1+% \sum_{i=1}^{n}y_{i}\right)\right)

should also work well.
Why the above choice of proposal variance?
The variance of $\mbox{Gam}(1+\sum y_{i},n)$ is $\frac{1}{n^{2}}(1+\sum y_{i})$ . (This is the posterior with the improper, flat prior $\pi(\lambda)\propto 1$ $(\lambda>0)$ .) Scale by a factor of $2.4^{2}$ as we know this is the optimal proposal for a $N(0,1)$ . Note that if $n$ is large by the central limit theorem $\mbox{Gam}(1+\sum y_{i},n)\approx N(\{1+\sum y_{i}\}/n,\{1+\sum y_{i}\}/n^{2})$ .

Here are the traceplots for the suggested algorithms applied to the following data $\mathbf{y}=(6,8,8,3,6,3,6,4)$ (so $n=8,$ and $\sum{y_{i}}=44$ ), and starting with $\lambda=5.5$ . (The MLE of a Poisson is $\hat{\lambda}=\sum{y_{i}}/n=44/8=5.5$ .)

Unnumbered Figure: Link

Independence Sampler

Unnumbered Figure: Link

RWM

Mixing seems reasonable. Acceptance rates are $0.864$ and $0.425$ respectively and effective sample sizes are $611.5$ and $233.3$ .

Estimates of the posterior median and $95\%$ credibility intervals for $\lambda$ are

Independence Sampler: 5.35 (3.83,7.11).

RWM: 5.36 (3.88,7.13).

Note:. In ”real life” you would not use MCMC on the above target; there is a simpler and more appropriate technique, rejection sampling, which you will have met in the Bayesian Statistics course.

5.5.2 Logistic regression

In this Section we consider using the Metropolis Hastings algorithm for logistic regression models. We use the Caesarian section example previously studied in Section 3.5 (using a Probit regression and data augmentation Gibbs sampler).

Response: Occurrence or non-occurrence of infection following birth by Caesarian section.

Covariates:

$\bullet$ $x_{1}=1$ if Caesarian section was not planned and $x_{1}=0$ otherwise;

$\bullet$ $x_{2}=1$ if there is presence of one or more risk factors, such as diabetes or excessive weight, and $x_{2}=0$ otherwise;

$\bullet$ $x_{3}=1$ if antibiotics were given as prophylaxis and $x_{3}=0$ otherwise.

In total: $251$ births.

Covariates			Infection
$x_{1}$	$x_{2}$	$x_{3}$	yes	no
0	0	0	8	32
0	0	1	0	2
0	1	0	28	30
0	1	1	1	17
1	0	0	0	9
1	0	1	0	0
1	1	0	23	3
1	1	1	11	87

We have aggregated all binary responses in each covariate category to binomial responses. Thus,

Y_{i}\sim{\rm Binomial}(n_{i},p_{i})

where $i$ is an index for each covariate combination, $y_{i}$ is the number of infections and $n_{i}$ the total number of observations for the $i^{th}$ covariate combination.

Suppose that we assume the binary logistic regression model,

\mbox{logit}\left({p_{i}}\right)=\log\left(\frac{p_{i}}{1-p_{i}}\right)=\eta_{% i}={\mathbf{x}}_{i}^{T}{\mbox{\boldmath$\beta$}},

where $p_{i}$ denotes the probability of an infection for each individual in the $i^{th}$ covariate combination and ${\mathbf{x}}_{i}^{T}=(1,x_{1i},x_{2i},x_{3i})$ .

$\displaystyle f({\mathbf{y}}\|{\mbox{\boldmath$\beta$}})$	$\displaystyle=$	$\displaystyle\prod_{i=1}^{n}f(y_{i}\|{\mbox{\boldmath$\beta$}})\propto\prod_{i=% 1}^{n}p_{i}^{y_{i}}(1-p_{i})^{n_{i}-y_{i}}$
	$\displaystyle=$	$\displaystyle\prod_{i=1}^{n}\left[\frac{\exp({\mathbf{x}_{i}}^{T}{\mbox{% \boldmath$\beta$}})}{1+\exp({\mathbf{x}_{i}}^{T}{\mbox{\boldmath$\beta$}})}% \right]^{y_{i}}\left[\frac{1}{1+\exp({\mathbf{x}_{i}}^{T}{\mbox{\boldmath$% \beta$}})}\right]^{n_{i}-y_{i}}$
	$\displaystyle=$	$\displaystyle\prod_{i=1}^{n}\frac{\exp({\mathbf{x}_{i}}^{T}{\mbox{\boldmath$% \beta$}}\cdot y_{i})}{[1+\exp({\mathbf{x}_{i}}^{T}{\mbox{\boldmath$\beta$}})]^% {n_{i}}}.$

A Maximum likelihood analysis in R gives the following results.

$~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}$ Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.8926 0.4124 -4.590 4.44e-06 ***
noplan $~{}~{}~{}~{}~{}$ 1.0720 0.4253 2.520 0.0117 *
factor $~{}~{}~{}~{}~{}$ 2.0299 0.4552 4.459 8.23e-06 ***
antib $~{}~{}~{}~{}~{}$ -3.2544 0.4813 -6.762 1.37e-11 ***
---
Sig. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Null deviance: 83.491 on 6 degrees of freedom
Residual deviance: 10.997 on 3 degrees of freedom
AIC: 36.178

Number of Fisher Scoring iterations: 4

The 10.997 value for the residual deviance is not consistent with a $\chi^{2}_{3}$ distribution, since we can compute $P(\chi^{2}_{3}>10.997)$ as

> 1-pchisq(10.997,3) [1] 0.01174211

So there is a fairly strong warning sign that the model may not fit adequately. To keep things simple we will ignore this (could consider interactions or probit regression, as seen earlier).

A Bayesian analysis places a prior on ${\mbox{\boldmath$\beta$}}=(\beta_{0},\beta_{1},\beta_{2},\beta_{3})^{T}$ , usually a multivariate normal prior ${\mbox{\boldmath$\beta$}}\sim N_{4}({\mbox{\boldmath$\mu$}}_{0},{\mathbf{C}}_{% 0})$ . The posterior distribution for $\beta$ is

	$\displaystyle\pi({\mbox{\boldmath$\beta$}}\|{\mathbf{y}})$	$\displaystyle\propto$	$\displaystyle f(\mathbf{y}\|{\mbox{\boldmath$\beta$}})\pi_{0}({\mbox{\boldmath$% \beta$}})$
		$\displaystyle\propto$	$\displaystyle\,\,\left\{\prod_{i=1}^{n}\frac{\exp({\mathbf{x}_{i}}^{T}{\mbox{% \boldmath$\beta$}}\cdot y_{i})}{(1+\exp({\mathbf{x}_{i}}^{T}{\mbox{\boldmath$% \beta$}}))^{n_{i}}}\right\}\,\exp\left(-{({\mbox{\boldmath$\beta$}}-{\mbox{% \boldmath$\mu$}}_{0})^{T}{\mathbf{C}}_{0}^{-1}({\mbox{\boldmath$\beta$}}-{% \mbox{\boldmath$\mu$}}_{0})\over 2}\right).$

The posterior density is a complicated, non-linear function of $\beta$ . How can we construct an MCMC algorithm to sample from this distribution?

1.

Update $\beta$ jointly: Independence sampler with multivariate normal (or better, $t_{5}$ ) candidate generator with mean equal to posterior mode and covariance matrix equal to inverse curvature at mode.
$\bullet$ Newton-Raphson for $\pi({\mbox{\boldmath$\beta$}}|{\mathbf{y}})$ ;
$\bullet$ relies on asymptotic normality of posterior distribution;
$\bullet$ Multivariate normal may rarely propose values in the tails of the posterior, hence the preference for a $t$ -distribution.
2.

Update $\beta$ jointly: Multivariate Gaussian random walk candidate generator, with covariance matrix equal to the inverse curvature at posterior mode times a (scalar) tuning factor, or some other pre-estimated covariance matrix
$\bullet$ Needs tuning and pre-calculation of covariance estimate.
3.

Update $\beta$ component-wise via MH: Simple random walk proposal updating one $\beta$ component at a time.
$\bullet$ Needs tuning;
$\bullet$ problems with collinearity.

The code to implement algorithms 1, 2 and 3 are stored in files logisticIP.R, logisticMRWM.R and logisticRWM.R. The code to run the analysis is given in the file logisticRUN.R.
The output from running these algorithms for 5000 iterations is plotted in the files “logistic1.pdf”, “logistic2.pdf” and “logistic3.pdf”.

Estimates (posterior mean, standard deviation and posterior probability of exceeding 0) using the first algorithm (independence sampler – 5,000 samples after burn-in of 500) are given in the following table:

Coefficients:	mean	Std	probability
(Intercept)	-1.9544	0.4228	0
noplan	1.1071	0.4229	0.9968
factor	2.0955	0.467	1
antib	-3.3322	0.4867	0

The overall acceptance rate of this run was 87.6% which is very good for the independence sampler. The multivariate random walk algorithm works well when the covariance matrix of the proposal distribution reflects correlations between the parameters. For both the independence sampler and the multivariate random walk algorithms the same mechanism is used to find an appropriate covariance matrix for the proposal distribution. The componentwise (univariate) random walk Metropolis performs more poorly with slowly decaying correlations in the acf plot. However, the componentwise random walk Metropolis is run without tuning of the covariance matrix which is important for successful implementation of the other algorithms.

5.5.3 Non-linear regression

This example is taken from [1] and gives a non-linear model for oxygen demand in a biochemical experiment against incubation time.

Model:

y=\beta_{1}(1-\exp(-\beta_{2}x))+\epsilon

where $y$ is biochemical oxygen demand in $mg/l$ , $x$ is incubation time in days and $\epsilon\sim N(0,\sigma^{2})$ .

The data with predictive curve $y=213.0894(1-\exp(-0.5472x))$ is given below. This is the least squares estimated curve.

Unnumbered Figure: Link

Data source: www.itl.nist.gov/div898/strd/nls/data/boxbod.shtml

Random walk Metropolis was used to obtain a sample from the posterior distribution of $\mbox{\boldmath$\theta$}=(\beta_{1},\beta_{2},\sigma)$ with a multivariate Gaussian proposal. Tuning of the proposal variance found that $1.39V$ worked well, where $1.39=2.4/\sqrt{3}$ and

V=\left(\begin{array}[]{rrr}680&-6&-37\\ -6&0.135&1.88\\ -37&1.88&262\end{array}\right)

is computed on the basis of pilot runs.

The estimated posterior means (standard deviations) of the parameters are $207(25.3),0.886(0.818)$ and $30.0(17.2)$ for $\beta_{1}$ , $\beta_{2}$ and $\sigma$ , respectively, based on a sample of size 10000. (A burn-in of 1000 iterations was used.) Finally, in the figure below $\beta_{2}$ is plotted against $\beta_{1}$ , which clearly shows a non-linear relationship between the regression parameters.

Unnumbered Figure: Link

5.6 Lab session

In this lab you will use Poisson regression to analyse the data labants.txt, which consist of the ant species richness (number of ant species) found in 64 square metre sampling grids, in 22 bogs and 22 forests surrounding the bogs, in Connecticut, Massachsetts and Vermont (USA). The sites span $3^{o}$ of latitude in New England. The data source is [2]. There are 44 observations on four variables, plus the site identifier.

•

Site: An abbreviation for the name of the site
•

Srich: The species richness (number of species) for ants
•

Habitat: Either Forest or Bog
•

Latitude: The latitude (in decimal degrees) for the site
•

Elevation: The elevation of the site, in metres above sea level

Create a new directory for this lab and download the data file to it. The file lab5code.r contains some of the R code that you will need. You should download this file to the same directory as the data file and then edit it in order to perform the following tasks. (Note that the MASS library is called to allow straightforward sampling from a multivariate normal distribution.)

Set Up

The first few lines in lab5code.r read in the data and create a response vector y and the design matrix X. Make sure you understand what the code is doing. Look at the data. Why do we subtract 1 when creating the second column of X?

There are three covariates: habitat, latitude, and elevation, and so there are four components to our parameter vector $\beta$ - why?

We will use independent Gaussian priors for each of the four $\beta$ s, each centered on $0$ and with the same variance, the scalar value V.prior. The function log.prior() should take the covariate vector beta and the scalar variance V.prior and should return the log of the prior density for beta. Work out the formula for the log prior (up to an arbitrary constant independent of beta) and so fill in the template for this function.

We wish to employ vague but proper priors so V.prior should be large but finite: a value of 100 should suffice in this instance.

We will be using the following model

Y_{i}\sim Po(\lambda_{i})~{},~{}~{}~{}\lambda_{i}=\exp(\mathbf{x}_{i}^{t}\mbox% {\boldmath$\beta$}).

Recall that the density of a single Poisson random variable $Y_{i}$ which has mean $\lambda_{i}$ is

f(y_{i})=\frac{\lambda_{i}^{y_{i}}e^{-\lambda_{i}}}{y_{i}!}.

Use vector arithmetic to fill in the template for the function poisson.log.like() which returns the log-likelihood of the parameter vector beta given data vector y and covariate matrix X. Hint: your very first step should be loglam <- X %*% beta. This gives a vector of the $\log(\lambda)$ values.

The inputs and outputs of the function RWM() are detailed in the comments above it. Fill in the blanks as instructed by the comments in the function.

Note V.prop is the variance of the (multivariate) Gaussian jump proposal in the RWM for beta. Since beta is a 4-component vector, V.prop must be a 4x4 matrix.

MCMC - Random Walk Metropolis

Initially you will tune the algorithm by choosing a scaling and a variance matrix. All tuning runs should be for 10000 iterations and should start at $\beta=(0,0,0,0)$ . Set your plot window to $2\times 2$ with the command par(mfrow=c(2,2)).

1.

Perform a run with V.prop=diag(rep(1,4)) (why?) and lambda.prop=0.001. Create an MCMC object from this run and look at the trace plots. What do you notice?
2.

Find the variance matrix from this run using the command V=var(B) where B is the output from the previous run. Change your scale parameter, lambda.prop, to 1 (this is approximately $2.38/\sqrt{4}$ ) and try a new run with V.prop set to V. What do the traceplots show?
3.

Repeat 2. until all four components are mixing well. What is your final variance matrix? What is the correlation matrix? Use these to explain the behaviour of the first run.
4.

Look at the effective sample size for this run (removing burn in by slicing). Can you increase it by changing lambda.prop?
5.

Use slicing on your final run to chop the first few rows from your output matrix (burn-in). Find the posterior median estimate and a $95\%$ CI for each parameter and draw kernel density estimates or histograms from the MCMC output.

5.7 Practice Questions

1.

Let $X\sim N(0,1)$ with $\pi(x)=\frac{1}{\sqrt{2\pi}}\exp(-x^{2}/2)$ $(x\in\mathbb{R})$ .

Consider a MALA algorithm to obtain samples from $\pi(\cdot)$ . That is, propose a new value $y$ from

$y=x+\frac{\sigma^{2}}{2}\frac{d\;}{dx}\log\pi(x)+N(0,\sigma^{2}).$
1. (a)
  
  Show that the acceptance probability for the MALA algorithm is
  
  $\min\left\{1,\exp\left(-\frac{\sigma^{2}}{8}(y^{2}-x^{2})\right)\right\}.$
2. (b)
  
  Why choose $0<\sigma<2$ for the MALA algorithm?
2.

Let $\pi(x)=\exp(-x)$ for $x\geq 0$ and $\pi(x)=0$ otherwise.
Let $g_{\theta}(y)=\theta\exp(-\theta y)$ for $y\geq 0$ . Consider the Metropolis-Hastings algorithm for obtaining samples from $\pi(x)$ with proposal density $q(x,y)=g_{\theta}(y)$ .
1. (a)
  
  What name is given to this type of Metropolis-Hastings algorithm?
2. (b)
  
  Show that the acceptance probability is $\alpha(x,y)=\min\{\exp(-(y-x)(1-\theta)),1\}$ .
3. (c)
  
  The probability of accepting a proposed move when the current state is $X_{n}=x$ is given by
  
  $P(\mbox{Accept proposed move}|X_{n}=x)=\int_{0}^{\infty}\alpha(x,y)g_{\theta}(% y)\,dy.$
  
  Show that
  
  $P(\mbox{Accept proposed move}|X_{n}=x)=\left\{\begin{array}[]{ll}1-(1-\theta)% \exp(-\theta x)&\mbox{for }\theta<1,\\ 1&\mbox{for }\theta=1,\\ \theta\exp(-(\theta-1)x)+(1-\theta)\exp(-\theta x)&\mbox{for }\theta>1.\end{% array}\right.$
4. (d)
  
  Which out of $\theta=0.5$ and $\theta=2$ should be used for the algorithm and why?
3.

In an angina study, the sex and cholesterol levels of $n$ patients were recorded. Let $s_{i}$ and $c_{i}$ denote the sex ( $F$ or $M$ ) and cholesterol level, respectively, of patient $i$ . It is proposed that the probability that patient $i$ suffers from angina is

$\exp(\alpha_{s_{i}}+\beta c_{i})/\{1+\exp(\alpha_{s_{i}}+\beta c_{i})\}$

independently of all other patients, and $\theta=(\alpha_{F},\alpha_{M},\beta)$ are the unknown parameters of interest. Finally, let $Y_{i}=1$ if patient $i$ suffers from angina and $Y_{i}=0$ otherwise with ${\mathbf{Y}}=(Y_{1},Y_{2},\ldots,Y_{n})$ .
1. (a)
  
  Write down the probability that ${\mathbf{Y}}={\mathbf{y}}(=(y_{1},y_{2},\ldots,y_{n}))$ given ${\mathbf{c}}=(c_{1},c_{2},\ldots,c_{n})$ , ${\mathbf{s}}=(s_{1},s_{2},\ldots,s_{n})$ , $\alpha_{F}$ , $\alpha_{M}$ and $\beta$ .
2. (b)
  
  Given independent $N(0,1)$ priors for each of $\alpha_{F}$ , $\alpha_{M}$ and $\beta$ , write down, up to proportionality, the joint probability density function of $(\alpha_{F},\alpha_{M},\beta)$ .
3. (c)
  
  Outline a random walk Metropolis algorithm to obtain samples from the joint posterior distribution of $(\alpha_{F},\alpha_{M},\beta)$ .
4. (d)
  
  Patient $X$ is male ( $s_{X}=M$ ) and his cholesterol level, $c_{X}=6$ . Describe how to use a sample from the MCMC algorithm to estimate the probability that patient $X$ has angina.