2 Week 7: Introduction to MCMC

2.1 Introduction and background

Markov Chain Monte Carlo, often referred to simply as MCMC, has probably been the single most important advance in statistics over the last 25 years or so. MCMC has its origins in Metropolis et al. (1953) and Hastings (1970). However, it was only in the late 1980’s with the seminal paper Gelfand and Smith (1990) that the MCMC idea began to invade mainstream statistics and a considerable amount of research has been devoted to it ever since.

MCMC utilises a Monte Carlo method based upon Markov chains (as the name suggests!) MCMC is primarily used in Bayesian statistics to provide samples from the posterior distribution.

We cover three preliminary topics before discussing MCMC. These are:-

1.

Monte Carlo methods;
2.

Markov chains
3.

Bayesian Statistics (very brief overview – details in MATH553).

2.2 Monte Carlo Methods

Monte Carlo methods basically means simulation. These are used primarily to evaluate integrals (for random variables) where analytical solutions are either not possible or extremely laborious.

The simplest example is the case of the expectation

\mathbb{E}\left[{X}\right]=\int xf(x)\,dx

of a random variable $X$ which has no analytical form. We can estimate the mean of $X$ , $\mathbb{E}\left[{X}\right]$ by taking a sample from the distribution $X$ and using the sample mean as an estimate of the theoretical mean. That is:-

1.

Take a sample of size $n$ , $x_{1},x_{2},\ldots,x_{n}$ from the distribution $X$ , $f(\cdot)$ .
We assume that the observations are independent.
2.

Then $\mathbb{E}\left[{X}\right]\approx\frac{1}{n}\sum_{i=1}^{n}x_{i}$ .

For example, $X$ could represent the height (in cm) of a 5 year old child. An estimate of $\mathbb{E}\left[{X}\right]$ can then be obtained by taking a class of 20, 5 year old children and measuring their heights.

This argument can be generalised. Suppose that we wish to calculate/evaluate, $\theta=\mathbb{E}\left[{\phi(X)}\right]$ , for some function $\phi(\cdot)$

\theta=\int\phi(x)f(x)\,dx.

Then if we take a sample $x_{1},x_{2},\ldots,x_{n}$ from the distribution $X$ , we have that

\hat{\theta}=\frac{1}{n}\sum_{i=1}^{n}\phi(x_{i})

is an unbiased estimate of $\theta$ . This process is exactly the same as for the expectation, in that, we use the sample mean to estimate the theoretical mean. This approach is very easy to use, even for multi-dimensional distributions.

Properties of Monte Carlo integration

1.

Unbiased.
2.

If variance of $\phi(X)$ exists, then

$\mbox{Var}\left[{\hat{\theta}}\right]=\frac{1}{n}\mbox{Var}\left[{\phi(X)}% \right].$
3.

Central limit theorem

$\frac{\sqrt{n}(\hat{\theta}-\theta)}{\sqrt{\mbox{Var}\left[{\phi(X)}\right]}}% \stackrel{{\scriptstyle D}}{{\longrightarrow}}N(0,1)$

That is,

$\hat{\theta}\approx N(\theta,\mbox{Var}\left[{\phi(X)}\right]/n).$

All the above hold provided $\{X_{i}\}$ are independent. However, these results can be extended to the case where the $\{X_{i}\}$ are dependent which will be the case when using MCMC.

Drawbacks of Monte Carlo integration

1.

The variance of $\phi(X)$ is often large.
This problem can be circumvented to some extent by variance reduction methods such as importance sampling.
2.

It can be difficult to simulate from $X$ (in an efficient manner).
This is a harder problem to deal with and this is where MCMC will be useful. Simulating from univariate distributions (rejection sampling, inversion of the cdf) is usually fairly straightforward but for multivariate distributions the problem becomes a lot harder as methods such as rejection sampling perform much worse as the dimension of the problem grows. This means a lot of work is required to get even a small sample from the distribution $X$ .

2.3 Markov Chains

There is a lot of deep theory related to the use of Markov chains in Monte Carlo methods and we shall barely scratch the surface.

What is a Markov Chain?
A Markov chain $\{X_{n},n\geq 0\}$ is a stochastic process which satisfies the ‘memoryless’ property:

P(X_{n}\in A|X_{n-1}=x_{n-1},X_{n-2}=x_{n-2},\ldots,X_{0}=x_{0})=P(X_{n}\in A|% X_{n-1}=x_{n-1}),

where $X_{n}$ denotes the state of the process after $n$ steps. Basically, the future is independent of the past, given the present state of the process. A good introduction to the theory of Markov Chains is given in Grimmett and Stirzaker (1992), Section 6. We restrict attention in our discussion to discrete, time-homogeneous Markov chains, that is, for all $A$ , $i$ and $n$ ,

P(X_{n+1}\in A|X_{n}=i)=P(X_{1}\in A|X_{0}=i).

A very simple example of a Markov chain is the game Snakes and Ladders. Your future progress only depends upon your current position and not how you have arrived at your current position.

We shall illustrate the basic ideas, assuming that the state space for $X$ is countable, although in most examples of MCMC, the state space (set of possible values) of the parameters will be $\mathbb{R}^{d}$ which is uncountable. For all $i, j$ , let

P_{ij}=P(X_{1}=j|X_{0}=i).

We shall assume that the Markov chain is time homogeneous, that is, for all $n\geq 1$ ,

P(X_{n}=j|X_{n-1}=i)=P_{ij},

the probability of moving from state $i$ to state $j$ in one step. Suppose that the state space is finite. Let $P$ denote the matrix with elements $\{P_{ij}\}$ . Then the $(i,j)^{th}$ element of $P^{n}$ gives $P(X_{n}=j|X_{0}=i)$ .

Snakes and Ladders Example

Suppose that we are currently at square $37$ after $n$ turns and that there is a ladder at square 40 leading to square 62 and a snake at square 42 leading to square 27. Then rolling a standard 6 sided die, we have

$\displaystyle P(X_{(n+1)}=38\|X_{n}=37)$	$\displaystyle=$	$\displaystyle\frac{1}{6}$
$\displaystyle P(X_{(n+1)}=39\|X_{n}=37)$	$\displaystyle=$	$\displaystyle\frac{1}{6}$
$\displaystyle P(X_{(n+1)}=62\|X_{n}=37)$	$\displaystyle=$	$\displaystyle\frac{1}{6}$
$\displaystyle P(X_{(n+1)}=41\|X_{n}=37)$	$\displaystyle=$	$\displaystyle\frac{1}{6}$
$\displaystyle P(X_{(n+1)}=27\|X_{n}=37)$	$\displaystyle=$	$\displaystyle\frac{1}{6}$
$\displaystyle P(X_{(n+1)}=43\|X_{n}=37)$	$\displaystyle=$	$\displaystyle\frac{1}{6}$

depending upon which number we roll from 1 to 6.

A Markov chain is
irreducible; if for all $i$ and $j$ , there exists some $n$ such that $P(X_{n}=j|X_{0}=i)>0$ ;
(we can get from any state to any other state)
aperiodic; if the greatest common divisor of $\{n;P(X_{n}=j|X_{0}=i)>0\}=1$ ;
(does not exhibit periodic behaviour)
recurrent; if $P(\mbox{Markov chain returns to }i|\mbox{it started at }i)=1$ for all $i$ ;
(we eventually return to the starting state)
and positive recurrent; if it is irreducible, aperiodic and recurrent, and there exists a unique collection of probabilities $\{\pi_{j}\}$ such that for all $i$ ,

\displaystyle\lim_{n\rightarrow\infty}P(X_{n}=j|X_{0}=i)=\pi_{j}

(2.1)

and

\sum_{i}\pi_{i}P_{ij}=\pi_{j}.

The distribution $\pi$ is called the stationary distribution of the Markov chain. Note that (2.1) says, that regardless of the value of $X_{0}$ , for large $n$ , $P(X_{n}=j|X_{0}=i)\approx\pi_{j}$ . That is, for large $n$ , $X_{n}$ is (approximately) distributed to according to the stationary distribution. How large $n$ needs to be for the approximation to be close depends upon the starting value and the rate of convergence. The rate of convergence can be calculated using probability theory, and depends upon how

\sup_{x\in S,A\in S}|P(X_{n}\in A|X_{0}=x)-\pi(A)|

behaves as $n\rightarrow\infty$ , where $S$ is the state space of the parameters (often $\mathbb{R}^{d}$ or a subset thereof). In practice with MCMC convergence is often ‘detected’ by eye without any formal procedures.

A key point is that if, for all $i$ , $P(X_{0}=i)=\pi_{i}$ , then for all $n\geq 1$ and $i$ , $P(X_{n}=i)=\pi_{i}$ . Therefore if $X_{0}$ is distributed according to $\pi$ , then for all $n\geq 1$ , $X_{n}$ is distributed according to $\pi$ (the stationary distribution). Also $\mbox{\boldmath$\pi$}=\mbox{\boldmath$\pi$}P$ in matrix notation.

Example. Suppose that we have four states labeled 1, 2, 3, 4.

Transition matrix

P=\left(\begin{array}[]{llll}p_{11}&p_{12}&p_{13}&p_{14}\\ p_{21}&p_{22}&p_{23}&p_{24}\\ p_{31}&p_{32}&p_{33}&p_{34}\\ p_{41}&p_{42}&p_{43}&p_{44}\end{array}\right)

Then $p_{ij}$ is the probability of moving from state $i$ to state $j$ . That is,

P(X_{1}=j|X_{0}=i)=p_{ij}.

Note that all the rows must sum to 1, i.e. $\sum_{j=1}^{4}p_{ij}=1$ .

We consider different transition matrices describing the transitions from one time point to the next.

Matrix 1.

P=\left(\begin{array}[]{llll}0.5&0.5&0&0\\ 0.3&0.7&0&0\\ 0.25&0.25&0.25&0.25\\ 0&0.8&0.2&0\end{array}\right)

This is an example of a reducible Markov chain since we can’t get from state 2 to state 3. Also the Markov chain is not recurrent since if you start in state 3, the probability of being in state 3 at any later time does not exceed 0.5. (If you go to state 1 or 2 at the first step you will never return to state 3.) The probability of ever returning to state 3 is 0.3.

Matrix 2.

P=\left(\begin{array}[]{llll}0&0.5&0&0.5\\ 0.3&0&0.7&0\\ 0&0.25&0&0.75\\ 0.8&0&0.2&0\end{array}\right)

This is periodic, it returns to a given state after an even number of steps. (It alternates between odd states, 1 and 3, and even states, 2 and 4.)

Matrix 3.

P=\left(\begin{array}[]{llll}0.1&0.5&0&0.4\\ 0.2&0.6&0.1&0.1\\ 0&1&0&0\\ 0.4&0&0.25&0.35\end{array}\right)

This Markov chain is positive recurrent with $\mbox{\boldmath$\pi$}=(0.2,0.5,0.1,0.2)$ such that

\mbox{\boldmath$\pi$}=\mbox{\boldmath$\pi$}P.

For example,

	$\displaystyle\sum_{j=1}^{4}\pi_{j}P_{j2}$	$\displaystyle=$	$\displaystyle(0.2\times 0.5)+(0.5\times 0.6)+(1\times 0.1)+(0\times 0.2)$
		$\displaystyle=$	$\displaystyle 0.5=\pi_{2}$

The aim of MCMC is to construct a Markov chain whose stationary distribution is the posterior distribution that we are interested in. Therefore a sample from the Markov chain (provided the Markov chain is started in stationarity) will form a sample from the posterior distribution of interest. It is important to check that the Markov chain we construct is irreducible and aperiodic, in particular the irreducibility of the Markov chain.

Detailed balance

A discrete Markov chain with transition matrix $P$ is said to satisfy detailed balance if there is a distribution $\nu$ such that for all $i, j$ ,

\nu_{i}P_{ij}=\nu_{j}P_{ji}.

Lemma If a Markov chain with transition matrix $P$ satisfies detailed balance with distribution $\nu$ then $\mbox{\boldmath$\nu$}(=\mbox{\boldmath$\pi$})$ is the stationary distribution of the chain.
Proof: Suppose that $\nu$ satisfies detailed balance. Then for any $i$ ,

	$\displaystyle\sum_{j}\nu_{j}P_{ji}$	$\displaystyle=$	$\displaystyle\sum_{j}\nu_{i}P_{ij}$
		$\displaystyle=$	$\displaystyle\nu_{i}\sum_{j}P_{ij}=\nu_{i},$

since for all $i$ , $\sum_{j}P_{ij}=1$ . (All rows sum to 1.) Hence $\nu$ is the stationary distribution of the Markov chain. $\square$

Checking detailed balance is often an easy way of confirming a probability distribution $\pi$ is the stationary distribution of a Markov chain.

Continuous state space Markov chains

Usually, but not always, in Bayesian statistics we want to obtain samples from a continuous distribution (or a distribution which is a mixture of continuous and discrete variables). Most of the above discussion for discrete state space Markov chains holds over to continuous state space Markov chains in very natural ways with minor modifications. In this short Section, we briefly outline the key differences and in particular the influence this will have on the MCMC discussion which follows.

Firstly, we replace the transition matrix $P$ by a transition kernel $K$ . Specifically, if we let $\mathbb{S}$ denote the set of values the Markov chain can take ( $\mathbb{S}$ will often be $\mathbb{R}^{d}$ or some subset of it), for all $x,y\in\mathbb{S}$ , we have a density $k(x,y)$ such that

\int k(x,y)\,dy=1

and for a set $A\subseteq\mathbb{S}$ , the probability of moving from state $x$ to a state $y\in A$ is

\int_{A}k(x,y)\,dy.

For a continuous distribution the probability of ever returning to a state $x$ is 0. (For any continuous distribution the probability of observing any single value is 0.) Therefore for recurrence of a continuous state space Markov chain, we talk in terms of returning to a set $A$ . We won’t go into the details but for any $x\in\mathbb{S}$ , we could define a set of values, $C_{x}$ say, close to $x$ , for example within a certain distance of $x$ and look at the time taken by the Markov chain to return to $x$ .

Secondly, we can still talk in terms of a stationary distribution $\pi$ , but $\pi$ will now be a probability density function. The stationary distribution $\pi$ satisfies

\int\pi(y)k(y,x)\,dy=\pi(x),

where in moving from discrete distributions to continuous distributions we have replaced the summation by integration and the transition matrix by the transition kernel. Similarly, detailed balance is given by

\pi(x)k(x,y)=\pi(y)k(y,x).

2.4 Bayesian Statistics

In classical (or frequentist) statistics given a parametric model, we assume that the parameters of the model are fixed, typically unknown, constants. We can then use data to make inference about the parameters, often through the likelihood function. That is, find the maximum likelihood estimate (MLE) of the parameters. We can also find standard errors and confidence intervals for the MLE. See for example the Week 6 notes on the EM algorithm.

Bayesian statistics takes its name from Thomas Bayes. Again we can assume a parametric model for the data. However, rather than assuming that the parameters are fixed constant, we now assume that they are random variables incorporating uncertainty about the parameters. Moreover, we can make use of any prior knowledge we (or others) may have about the parameter values. This is an important feature of Bayesian statistics, the presence of a prior distribution for the parameters. These priors can be either vague or informative, reflecting how much prior knowledge we may have concerning the parameters. The posterior distribution of the parameters is what we are interested in. The posterior distribution for the parameters is dependent upon the parametric model chosen, the data and the prior distribution. From the posterior distribution, we can obtain the modal value for the parameters (these will be very close to the MLE values, however, depending upon the choice of prior distribution these will rarely agree exactly), the mean of the parameters and the variance (covariance) of the parameters. Thus we can obtain a great deal of information from the posterior distribution and there is no need to construct confidence intervals to measure uncertainty in the parameter values as these are obtained directly from the posterior distribution. Note that as the amount of data increases, the effects of the prior will diminish and in the limit as we obtain infinite data, the choice of prior will be inconsequential.

We now consider Bayesian statistics in action and give some motivation of why MCMC methods have been so successful. Consider a parametric model with parameter(s) $\theta$ . Let $\pi(\theta)$ denote the pdf (probability density function) of the prior distribution of $\theta$ . Let $\mathbf{x}=(x_{1},x_{2},\ldots,x_{n})$ denote the observed data and let $\pi(\mathbf{x},\theta)$ denote the joint pdf of the data and the parameters. Then by Bayes Theorem,

	$\displaystyle\pi(\mathbf{x},\theta)$	$\displaystyle=$	$\displaystyle\pi(\mathbf{x}\|\theta)\pi(\theta)$
		$\displaystyle=$	$\displaystyle\pi(\theta\|\mathbf{x})\pi(\mathbf{x}).$

We are interested in the posterior distribution of $\theta$ , that is, the distribution of $\theta$ given the data $\mathbf{x}$ , ie. $\pi(\theta|\mathbf{x})$ . Therefore

\pi(\theta|\mathbf{x})=\pi(\mathbf{x}|\theta)\pi(\theta)/\pi(\mathbf{x}).

Since the likelihood $L(\theta;\mathbf{x})=\pi(\mathbf{x}|\theta)$ , we have that

\displaystyle\pi(\theta|\mathbf{x})\propto\mbox{likelihood}\times\mbox{prior}.

(2.2)

(Note that $\pi(\mathbf{x})$ does not depend upon $\theta$ .) The likelihood and prior are often relatively easy to obtain, and in such circumstances we can find the posterior distribution up to a constant of proportionality. That is,

\pi(\theta|\mathbf{x})=K\times\mbox{likelihood}\times\mbox{prior},

where $K$ needs to be computed. The computing of $K(=1/\pi(\mathbf{x}))$ is often very difficult, if at all analytically possible. This has been a major stumbling block for Bayesian statistics. Fortunately, MCMC techniques can be applied directly to (2.2) to obtain samples from the posterior distribution, thus circumventing the need to compute $K$ . In MATH553 you will see alternatives to MCMC for getting around the problem of computing $K$ .

Gaussian Example.

This simple example produces a nice analytical answer.

Suppose that we have independent and identically distributed data according to a random variable $X\sim N(\mu,1)$ , where $\mu$ is an unknown parameter. We want to estimate $\mu$ .

Prior distribution: Suppose that we know nothing about $\mu$ and take $\pi(\mu)\propto 1$ , ie. $\int_{-\infty}^{\infty}\pi(\mu)\,d\mu=\infty$ . This is an improper prior since it does not integrate to 1 over the possible values of $\mu$ . However, often (but not always) improper prior distributions give rise to proper posterior distributions, so this is not a problem.

Suppose that $x_{1},x_{2},\ldots,x_{n}$ are independent realisations of $X$ .

Then

f(x|\mu)=\frac{1}{\sqrt{2\pi}}\exp(-(x-\mu)^{2}/2),

and

$\displaystyle L(\mu;\mathbf{x})$	$\displaystyle=$	$\displaystyle\prod_{i=1}^{n}f(x_{i}\|\mu)$
	$\displaystyle=$	$\displaystyle\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}}\exp(-(x_{i}-\mu)^{2}/2)$
	$\displaystyle=$	$\displaystyle\frac{1}{(\sqrt{2\pi})^{n}}\exp\left(-\frac{1}{2}\sum_{i=1}^{n}(x% _{i}-\mu)^{2}\right).$

Therefore

$\displaystyle\pi(\mu\|\mathbf{x})$	$\displaystyle\propto$	$\displaystyle\frac{1}{(\sqrt{2\pi})^{n}}\exp\left(-\frac{1}{2}\sum_{i=1}^{n}(x% _{i}-\mu)^{2}\right)\times 1$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\frac{1}{2}\left(\sum_{i=1}^{n}x_{i}^{2}-2\mu\sum_{i=1% }^{n}x_{i}+n\mu^{2}\right)\right)$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\frac{1}{2}\left(n\mu^{2}-2\mu n\bar{x}\right)\right),$

where $\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}$ .

A very useful result. Let $Y$ have probability density function $g(y)$ . Then if

g(y)\propto\exp\left(-\frac{1}{2}\left(Ay^{2}-2By\right)\right),

then $Y\sim N(B/A,1/A)$ .

Thus

\mu|\mathbf{x}\sim N\left(\bar{x},\frac{1}{n}\right).

Therefore the mean of $\mu$ is the sample mean (which is the MLE). Also we have that the variance is $1/n(=\mbox{Var}\left[{X}\right]/n)$ which agrees the variance of the MLE. In other words, as the sample size grows, we obtain tighter and tighter estimates of $\mu$ .

Conjugate priors

The choice of prior plays an important part in Bayesian statistics. There are two key features, how informative the priors are and conjugacy. We illustrate this idea with an extension of the Gaussian example above to the case where both the mean $\mu$ and variance $\sigma^{2}$ are unknown. The choice of priors, and in particular, conjugacy will be covered in detailed in MATH553.

For $1\leq i\leq n$ , let $X_{i}|\mu,\sigma^{2}\sim N(\mu,\sigma^{2})$ . Suppose also that we have the following prior distributions for $\mu$ and $\sigma^{2}$ :

\pi(\mu)\sim N(\lambda_{0},1/\omega_{0})

\pi(\sigma^{-2})\sim{\rm Gamma}(\alpha_{0},\gamma_{0}),

where $\mu$ and $\sigma$ are assumed to be apriori independent and $\lambda_{0},\omega_{0},\alpha_{0}$ and $\gamma_{0}$ are assumed to be known hyperparameters. Hyperparameters are simply the parameters of the prior distributions of the model parameters. Letting $\tau=\sigma^{-2}$ , ( $\tau$ is known as the precision and is the inverse of the variance) we have that, $f(x_{i}|\tau,\mu)=\sqrt{\tau/(2\pi)}\exp(-\tau(x_{i}-\mu)^{2}/2)$ , so

\pi(\mu,\tau|\mathbf{x})\propto\prod_{i=1}^{n}\{\tau^{1/2}\exp(-\tau(x_{i}-\mu% )^{2}/2)\}\times\exp(-\omega_{0}(\mu-\lambda_{0})^{2}/2)\tau^{\alpha_{0}-1}e^{% -\gamma_{0}\tau}.

Therefore, although the form of the prior distribution is highly tractable, the posterior distribution is a complex two dimensional distribution. This is often the case for multidimensional parameter sets.

However, if we consider the conditional distributions of each of the parameters in turn, these turn out to be rather nice. We exploit the following two observations which are useful for deriving conditional distributions:

1.

Suppose that the random variable $Y$ has a pdf of the form,

$f(y)\propto\prod_{j=1}^{m}\exp\left(-\frac{\phi_{j}}{2}(y-\delta_{j})^{2}% \right),$

then

$Y\sim N\left(\frac{\sum_{j=1}^{m}\phi_{j}\delta_{j}}{\sum_{j=1}^{m}\phi_{j}},% \frac{1}{\sum_{j=1}^{m}\phi_{j}}\right).$

That is, if the pdf of $Y$ is composed as the product of Normal densities then $Y$ has a Normal density with more weight given to those components with larger precisions (smaller variances).
2.

Suppose that the random variable $W$ has a pdf of the form,

$g(w)\propto w^{A-1}\exp(-Bw),$

then

$W\sim{\rm Gamma}\left(A,B\right).$

Firstly, to consider the conditional distribution of $\mu|\tau,\mathbf{x}$ we only need to focus upon terms involving $\mu$ . Therefore

	$\displaystyle\pi(\mu\|\tau,\mathbf{x})$	$\displaystyle\propto$	$\displaystyle\prod_{i=1}^{n}\left\{\exp\left(-\frac{\tau}{2}(x_{i}-\mu)^{2}% \right)\right\}\exp\left(-\frac{\omega_{0}}{2}(\mu-\lambda_{0})^{2}\right)$
		$\displaystyle\sim$	$\displaystyle N\left(\frac{\tau\sum_{i=1}^{n}x_{i}+\lambda_{0}\omega_{0}}{n% \tau+\omega_{0}},\frac{1}{n\tau+\omega_{0}}\right).$

Similarly, for of $\tau|\mu,\mathbf{x}$ we only need to focus upon terms involving $\tau$ , giving

$\displaystyle\pi(\tau\|\mu,\mathbf{x})$	$\displaystyle\propto$	$\displaystyle\prod_{i=1}^{n}\{\tau^{1/2}\exp(-\tau(x_{i}-\mu)^{2}/2)\}\times% \tau^{\alpha_{0}-1}e^{-\gamma_{0}\tau}$
	$\displaystyle=$	$\displaystyle\tau^{\alpha_{0}+n/2-1}\exp\left(-\tau\left(\frac{1}{2}{i=1}^{n}(% x_{i}-\mu)^{2}+\gamma_{0}\right)\right)$
	$\displaystyle\sim$	$\displaystyle{\rm Gamma}\left(\alpha_{0}+\frac{n}{2},\gamma_{0}+\frac{1}{2}% \sum_{i=1}^{n}(x_{i}-\mu)^{2}\right).$

Thus the conditional distributions of each of the parameters has a nice simple form. This is called conditional conjugacy.

Therefore we will proceed by introducing the Gibbs Sampler which is applicable, where the conditional distributions of each parameter given the other parameters is known but the full joint distribution of the parameters is not known explicitly.

2.5 Gibbs Sampler algorithm

Suppose that we wish to obtain a sample from the multivariate posterior distribution $\pi(\mbox{\boldmath$\theta$}|\mathbf{x})$ , where $\mbox{\boldmath$\theta$}=(\theta_{1},\theta_{2},\ldots,\theta_{d})$ denotes the parameters of the model and $\mathbf{x}=(x_{1},x_{2},\ldots,x_{N})$ denotes the observed data. The Gibbs sampler does this by successively and repeatedly simulating from the conditional distributions of each component given the other components. This procedure is particularly useful where we have conditional conjugacy, so that the resulting conditional distributions are from standard distributions.

Gibbs Sampler algorithm

1.

Initialise with $\mbox{\boldmath$\theta$}^{(0)}=(\theta_{1}^{(0)},\ldots,\theta_{d}^{(0)})$ .
2.
For $i=1,2,\ldots,n$ ,
1. (a)
  
  Simulate $\theta_{1}^{(i)}$ from the conditional $\theta_{1}|(\theta_{2}^{(i-1)},\ldots,\theta_{d}^{(i-1)}),\mathbf{x}$
2. (b)
  
  Simulate $\theta_{2}^{(i)}$ from the conditional $\theta_{2}|(\theta_{1}^{(i)},\theta_{3}^{(i-1)},\ldots,\theta_{d}^{(i-1)}),% \mathbf{x}$
3. (c)
  
  $\ldots$
4. (d)
  
  Simulate $\theta_{d}^{(i)}$ from the conditional $\theta_{d}|(\theta_{1}^{(i)},\ldots,\theta_{d-1}^{(i)}),\mathbf{x}$
3.

Discard the first $k$ iterations and estimate summary statistics of the posterior distribution using $\mbox{\boldmath$\theta$}^{(k+1)},\mbox{\boldmath$\theta$}^{(k+2)},\ldots,\mbox% {\boldmath$\theta$}^{(n)}$ .

How does it work?

Suppose that $\theta$ comes from $\pi(\mbox{\boldmath$\theta$}|\mathbf{x})$ . Then for $i=1,2,\ldots,d$ , updating the $i^{th}$ component involves drawing a new value of $\theta_{i}$ , $\theta_{i}^{\prime}$ from $\theta_{i}|\mbox{\boldmath$\theta$}_{-i},\mathbf{x}$ . Thus $\mbox{\boldmath$\theta$}^{\prime}$ , the updated set of parameters with $\theta_{i}$ replaced by $\theta_{i}^{\prime}$ also comes from the posterior distribution of $\theta$ .

Thus provided $\mbox{\boldmath$\theta$}^{(0)}$ is drawn from the posterior distribution of $\theta$ ; $\mbox{\boldmath$\theta$}^{(1)},\mbox{\boldmath$\theta$}^{(2)},\ldots,\mbox{% \boldmath$\theta$}^{(n)}$ are samples from the posterior distribution. Note that we have a dependent sample.

2.6 Burn in

We have to specify initial values for $\theta$ . If $\mbox{\boldmath$\theta$}^{(0)}$ was drawn from the stationary distribution of the Markov chain (distribution we are interested in), then $(\theta_{1}^{(0)},\ldots,\theta_{d}^{(0)}),\ldots,(\theta_{1}^{(n)},\ldots,% \theta_{d}^{(n)})$ would be realisations from the stationary distribution, $\pi(\mbox{\boldmath$\theta$}|\mathbf{x})$ . However, we don’t know what the (stationary) distribution is. (If we knew the distribution of $\mbox{\boldmath$\theta$}=(\theta_{1},\theta_{2},\ldots,\theta_{d})$ , there would be no need for MCMC to simulate from it!) Thus typically $\mbox{\boldmath$\theta$}^{(0)}$ will not be drawn from the stationary distribution. As $k$ increases $\mbox{\boldmath$\theta$}^{(k)}$ increasingly forgets, the initial value $\mbox{\boldmath$\theta$}^{(0)}$ , and consequently, for large $k$ , $\mbox{\boldmath$\theta$}^{(k)}$ is approximately from the stationary distribution. Hence, $(\theta_{1}^{(k+1)},\ldots,\theta_{d}^{(k+1)}),\ldots,(\theta_{1}^{(n)},\ldots% ,\theta_{d}^{(n)})$ are approximately from the stationary distribution and thus are considered a sample thereof. In some circumstances it is possible to use perfect simulation in which case $\mbox{\boldmath$\theta$}^{(0)}$ is drawn from the stationary distribution even though the stationary distribution is unknown. Perfect simulation is not covered in this course and is only viable in some special cases.

2.7 Gaussian Example

We recap the normal example given in the Section 2.4. Suppose that $x_{1},x_{2},\ldots,x_{n}$ are independent and identically distributed according to $X\sim N(\mu,1/\tau)$ , where $\mu$ and $\tau$ are unknown parameters to be estimated. The following prior distributions are assigned to $\mu$ and $\tau$ :-

	$\displaystyle\pi(\mu)$	$\displaystyle\sim$	$\displaystyle N(\lambda_{0},1/\omega_{0})$
	$\displaystyle\pi(\tau)$	$\displaystyle\sim$	$\displaystyle{\rm Gamma}(\alpha_{0},\gamma_{0}).$

We have the following Gibbs sampler algorithm for obtaining a sample of size $N$ from the joint distribution $\theta=(\mu,\tau)$ .

1.

Initialise with $\theta=(\mu^{(0)},\tau^{(0)})$ .
Any values of $\mu_{0}$ and $\tau_{0}>0$ will be OK. However a reasonable start would be the sample mean and inverse of the sample variance. That is, $\mu^{(0)}=\frac{1}{n}\sum_{i=1}^{n}x_{i}$ and $\tau^{(0)}=1/\left(\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\mu^{(0)})^{2}\right)$ .
2.
For $k=1,2,\ldots,N$ :-
1. (a)
  
  Simulate $\mu^{(k)}|\tau^{(k-1)}=\tau,\mathbf{x}\sim N\left(\frac{\tau\sum_{i=1}^{n}x_{i% }+\lambda_{0}\omega_{0}}{n\tau+\omega_{0}},\frac{1}{n\tau+\omega_{0}}\right)$
2. (b)
  
  Simulate $\tau^{(k)}|\mu^{(k)}=\mu,\mathbf{x}\sim{\rm Gamma}\left(\alpha_{0}+\frac{n}{2}% ,\gamma_{0}+\frac{1}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2}\right)$ .

The Gibbs sampler was run to produce a sample of size 1100 from the joint distribution of $\mu$ and $\tau$ . We discard the first 100 samples as burn-in using the remaining 1000 samples to estimate $\mu$ and $\sigma(=1/\sqrt{\tau})$ .

\begin{array}[]{c|cc}&\mbox{mean}&{sd}\\ \hline\mu&2.625&0.287\\ \sigma&2.196&0.229\end{array}

Unnumbered Figure: Link

2.8 Gibbs sampler properties

Verification

We will show that $\pi(\theta)$ is indeed the stationary distribution of the Gibbs sampler in the case where $d=2$ . The proof for $d>2$ is similar but more long-winded.

Let $\pi_{1}(\theta_{1}|\theta_{2})=\pi(\theta_{1},\theta_{2})/\pi(\theta_{2})$ and let $\pi_{2}(\theta_{2}|\theta_{1})=\pi(\theta_{1},\theta_{2})/\pi(\theta_{1})$ , the conditional distributions of $\theta_{1}$ and $\theta_{2}$ , respectively. Suppose that $\theta^{0}=(\theta_{1}^{0},\theta_{2}^{0})$ is drawn from $\pi(\cdot)$ . Then the pdf of $\theta^{1}$ given $\theta^{0}$ is

f(\theta^{1}|\theta^{0})=\pi_{1}(\theta_{1}^{1}|\theta_{2}^{0})\pi_{2}(\theta_% {2}^{1}|\theta_{1}^{1}).

We want to show that the pdf of $\theta^{1}$ , $f(\cdot)$ is equal to $\pi(\cdot)$ .

For any set $A$ ,

$\displaystyle P(\theta^{1}\in A)$	$\displaystyle=$	$\displaystyle\int 1_{\{\theta^{1}\in A\}}f(\theta^{1})\,d\theta^{1}$
	$\displaystyle=$	$\displaystyle\int\int 1_{\{\theta^{1}\in A\}}f(\theta^{1}\|\theta^{0})\pi(% \theta^{0})\,d\theta^{1}\,d\theta^{0}$
	$\displaystyle=$	$\displaystyle\int\int\int\int 1_{\{(\theta^{1}_{1},\theta^{1}_{2})\in A\}}\pi_% {1}(\theta_{1}^{1}\|\theta_{2}^{0})\pi_{2}(\theta_{2}^{1}\|\theta_{1}^{1})\pi(% \theta^{0}_{1},\theta_{2}^{0})\,d\theta^{1}_{1}d\theta^{1}_{2}d\theta^{0}_{1}d% \theta^{0}_{2}$
	$\displaystyle=$	$\displaystyle\int\int\int 1_{\{(\theta^{1}_{1},\theta^{1}_{2})\in A\}}\pi_{1}(% \theta_{1}^{1}\|\theta_{2}^{0})\pi_{2}(\theta_{2}^{1}\|\theta_{1}^{1})\left[\int% \pi(\theta^{0}_{1},\theta_{2}^{0})d\theta_{1}^{0}\right]\,d\theta^{1}_{1}d% \theta^{1}_{2}d\theta^{0}_{2}$
	$\displaystyle=$	$\displaystyle\int\int\int 1_{\{(\theta^{1}_{1},\theta^{1}_{2})\in A\}}\pi_{1}(% \theta_{1}^{1}\|\theta_{2}^{0})\pi_{2}(\theta_{2}^{1}\|\theta_{1}^{1})\pi(\theta% ^{0}_{2})\,d\theta^{1}_{1}d\theta^{1}_{2}d\theta^{0}_{2}$
	$\displaystyle=$	$\displaystyle\int\int 1_{\{(\theta^{1}_{1},\theta^{1}_{2})\in A\}}\pi_{2}(% \theta_{2}^{1}\|\theta_{1}^{1})\left[\int\pi_{1}(\theta_{1}^{1}\|\theta_{2}^{0})% \pi(\theta^{0}_{2})d\theta_{2}^{0}\right]\,d\theta^{1}_{1}d\theta^{1}_{2}$
	$\displaystyle=$	$\displaystyle\int\int 1_{\{(\theta^{1}_{1},\theta^{1}_{2})\in A\}}\pi_{2}(% \theta_{2}^{1}\|\theta_{1}^{1})\pi_{1}(\theta_{1}^{1})\,d\theta^{1}_{1}d\theta^% {1}_{2}$
	$\displaystyle=$	$\displaystyle\int\int 1_{\{(\theta^{1}_{1},\theta^{1}_{2})\in A\}}\pi(\theta_{% 1}^{1},\theta_{2}^{1})\,d\theta^{1}_{1}d\theta^{1}_{2}$
	$\displaystyle=$	$\displaystyle\int 1_{\{\theta^{1}\in A\}}\pi(\theta^{1})\,d\theta^{1}$

as required.

Observations

We make a couple of minor observations about the Gibbs sampler and its output which are very useful in practice.

Firstly, suppose that we interested in the marginal distribution of one or more of the parameters, for example, $\pi(\theta_{j}|\mathbf{x})$ . Then $\theta_{j}^{(1)},\theta_{j}^{(2)},\ldots,\theta_{j}^{(n)}$ represents a sample from $\pi(\theta_{j}|\mathbf{x})$ . That is, we automatically get samples from the marginal distributions of parameters from the Gibbs sampler.

Secondly, whilst it will often be the case that univariate conditional distributions are used to construct the Gibbs sampler, it is possible to update more than one parameter together in a block by utilising multi-dimensional conditional distributions. That is, updating a subset $p$ of the $d$ parameters conditional upon the data and the other $d-p$ parameters. This is particularly useful in regression and similar examples which give rise to multivariate Gaussian distributions as conditional distributions.

2.9 A Poisson count change point problem

We shall use this example to further illustrate the Gibbs sampler in action. The data to be analysed are obtained from Jarrett (1979) and is concerned with the number of coal mining disasters per year, over the period 1851-1962.

A plot of the data is given in the figure below, and suggests that there has been a reduction in the number of disasters per year over the period. The following model was suggested by Carlin et al. (1992):

	$\displaystyle Y_{i}$	$\displaystyle\sim$	$\displaystyle Po(\theta);\hskip 14.226378pti=1,\ldots,k;$
	$\displaystyle Y_{i}$	$\displaystyle\sim$	$\displaystyle Po(\lambda);\hskip 14.226378pti=k+1,\ldots,n(=112).$

That is, the number of disasters per year is Poisson distributed. Moreover, the first $k$ years have a common mean $\theta$ and the remaining $n-k$ years have a common mean $\lambda$ . This is a standard change point problem, in that, there is a behaviour change in the model (here a change in mean) at an unknown point in time.

Unnumbered Figure: Link

We complete the specification of the model by assuming the following prior structure. Let $\theta\sim{\rm Gamma}(a_{1},b_{1})$ and $\lambda\sim{\rm Gamma}(a_{2},b_{2})$ , $k$ discrete uniform over $\{1,2,\ldots,112\}$ , each independent of one another (ie. the parameters are a priori independent) with known hyperparameters $a_{1},a_{2},b_{1}$ and $b_{2}$ . More complicated hierarchical models have previously been used to analyse these data, see Carlin et al. (1992).

Therefore

$\displaystyle L(\theta,\lambda,k;\mathbf{y})$	$\displaystyle=$	$\displaystyle\prod_{i=1}^{k}\frac{\theta^{y_{i}}}{y_{i}!}e^{-\theta}\times% \prod_{i=k+1}^{n}\frac{\lambda^{y_{i}}}{y_{i}!}e^{-\lambda}$	(2.3)
	$\displaystyle\propto$	$\displaystyle\theta^{\sum_{i=1}^{k}y_{i}}e^{-k\theta}\lambda^{\sum_{i=k+1}^{n}% y_{i}}e^{-(n-k)\lambda}$	(2.3)
	$\displaystyle=$	$\displaystyle\left(\frac{\theta}{\lambda}\right)^{K}e^{k(\lambda-\theta)}% \lambda^{\sum_{i=1}^{n}y_{i}}e^{-n\lambda},$	(2.4)

where $K=\sum_{i=1}^{k}y_{i}$ , and

	$\displaystyle\pi(\theta,\lambda,k\|\mathbf{y})$	$\displaystyle\propto$	$\displaystyle L(\theta,\lambda,k;\mathbf{y})\times\pi(\theta)\times\pi(\lambda% )\times\pi(k)$		(2.5)
		$\displaystyle\propto$	$\displaystyle L(\theta,\lambda,k;\mathbf{y})\times\theta^{a_{1}-1}e^{-b_{1}% \theta}\times\lambda^{a_{2}-1}e^{-b_{2}\lambda}\times\frac{1}{112}.$		(2.6)

The conditional distributions of $\theta$ , $\lambda$ and $k$ , can be obtained from (2.11-2.5) by focussing upon only those terms involving the parameter of interest. Note that $k$ takes discrete values in the range $\{1,2,\ldots,112\}$ .

The conditional distributions are given as follows:

1.

Using (2.3) and (2.5), we have that

$\displaystyle\pi(\theta|\lambda,k,\mathbf{y})$ $\displaystyle\propto$ $\displaystyle\theta^{\sum_{i=1}^{k}y_{i}}e^{-k\theta}\times\theta^{a_{1}-1}e^{% -b_{1}\theta}$ (2.7)

$\displaystyle=$ $\displaystyle\theta^{\sum_{i=1}^{k}y_{i}+a_{1}-1}e^{-(k+b_{1})\theta}.$

Noting that this is the kernel of a Gamma distribution, see the observation in Section 2.4,

$\displaystyle\theta|\mathbf{y},\lambda,k$ $\displaystyle\sim$ $\displaystyle{\rm Gamma}\left(a_{1}+\sum_{i=1}^{k}y_{i},k+b_{1}\right).$
2.

Using (2.3) and (2.5), we have that

$\displaystyle\pi(\lambda|\theta,k,\mathbf{y})$ $\displaystyle\propto$ $\displaystyle\lambda^{\sum_{i=k+1}^{n}y_{i}}e^{-(n-k)\lambda}\times\lambda^{a_% {2}-1}e^{-b_{2}\lambda}$ (2.8)

$\displaystyle=$ $\displaystyle\lambda^{\sum_{i=1}^{k}y_{i}+a_{1}-1}e^{-(k+b_{1})\theta}.$

Noting that this is the kernel of a Gamma distribution, see the observation in Section 2.4,

$\displaystyle\theta|\mathbf{y},\lambda,k$ $\displaystyle\sim$ $\displaystyle{\rm Gamma}\left(a_{2}+\sum_{i=k+1}^{n}y_{i},(n-k)+b_{2}\right).$
3.

Using (2.4) and (2.5), we have that

$\displaystyle p(k|\mathbf{y},\theta,\lambda)\propto\exp\{k(\lambda-\theta)\}(% \theta/\lambda)^{K}=Q(\mathbf{y};k,\theta,\lambda),\mbox{ say}.$ (2.9)

Now $k$ does not have a nice conditional distribution, by which a mean an easily recognisable, well known distribution. However, $k$ is a discrete distribution and there are only 112 possible values. Therefore we can compute the normalising constant directly to give

$\displaystyle p(k|\mathbf{y},\theta,\lambda)=\frac{Q(\mathbf{y};k,\theta,% \lambda)}{\sum_{l=1}^{112}Q(\mathbf{y};l,\theta,\lambda)}.$ (2.10)

Sampling from the conditional distribution is straightforward in R using the sample command.

The Gibbs sampler was applied to the coal mining data set and run for 1100 iterations to obtain samples from $(\theta,\lambda,k)$ . We set $a_{1}=a_{2}=b_{1}=b_{2}=1$ for the priors. The results of the Gibbs sampler are shown in the figures below.

Unnumbered Figure: Link

Convergence of the algorithm seems to be rapid (after compensating for a poor choice of starting values, namely, $k=100$ ). Therefore I deleted the first 100 values and based all the analysis on the remaining 1000 iterations. In other words, I took the first 100 iterations as burn-in.

A histogram of the posterior distribution of year $(k+1850)$ is given below. The posterior mode is at $k=41$ , and there is very strong evidence to suggest that the change point occurs about the posterior mode which corresponds to the year 1891.

Unnumbered Figure: Link

It is clear from the trace plots above and kernel density estimates for the posterior distributions that $\lambda$ is (almost certainly) less than $\theta$ . This corresponds to less disasters, and hence, a safer working environment after the change point. Note that any summary statistic or distributional quantity related to the joint distribution of $\lambda$ and $\theta$ such as the distribution $\theta-\lambda$ can be estimated using the output from the Gibbs sampler. This is simply done by looking at each pair of realisations $(\theta,\lambda)_{i}$ giving $(\theta-\lambda)_{i}$ as the $i^{th}$ realisation from the posterior distribution of the difference.

Unnumbered Figure: Link

We can easily sample from the predictive distribution. Let $z$ denote a future observation. Future observations are iid according to ${\rm Po}(\lambda)$ and does not depend upon $\mathbf{y}$ , $k$ or $\theta$ given $\lambda$ . Therefore the predictive distribution of $z$ given $\mathbf{y}$ , $\pi(z|\mathbf{y})$ satisfies

\pi(z|\mathbf{y})=\int\pi(z|\lambda)\pi(\lambda|\mathbf{y})\,d\lambda.

Samples from $\pi(\lambda|\mathbf{y})$ are given by the Gibbs sampler in the form of $\lambda^{(1)},\lambda^{(2)},\ldots,\lambda^{(N)}$ . Therefore we can estimate $\pi(z|\mathbf{y})$ using the Monte-Carlo estimate

	$\displaystyle\widehat{\pi(z\|\mathbf{y})}$	$\displaystyle=$	$\displaystyle\frac{1}{N}\sum_{i=1}^{N}\pi(z\|\lambda^{(i)})$		(2.11)
		$\displaystyle=$	$\displaystyle\frac{1}{N}\sum_{i=1}^{N}\frac{(\lambda^{(i)})^{z}}{z!}\exp(-% \lambda^{(i)}),$		(2.11)

or simply simulate $z_{i}|\lambda^{(i)}\sim{\rm Po}(\lambda^{(i)})$ and estimate $\pi(z|\mathbf{y})$ by

\displaystyle\widehat{p(z|\mathbf{y})}=\frac{1}{N}\sum_{i=1}^{N}1_{\{z_{i}=z\}}.

(2.12)

The estimate in (2.11) has smaller variance, ie. is better. However, (2.12) is easier to compute and only requires the ability to simulate from the distribution.

For the coal mining data set, we can use the estimator to predict the number of disasters in future years (Poisson with rate $\lambda$ ). A graph of the predictive distribution alongside the corresponding estimates based on the posterior mean ( $\pi(z|\bar{\lambda})$ ) are given for a sample of 1000 realisations in the figure below. Since there is limited variation in the posterior distribution of $\lambda$ , the two distributions are virtually identical. However, in general the predictive distribution will demonstrate greater variability owing to the uncertainty in $\lambda$ .

Unnumbered Figure: Link

Predictive and estimative (based on $\pi(z|\bar{\lambda})$ ) distributions of number of coal mining disasters per year (solid line is predictive)

2.10 R practice questions

R code for applying the Gibbs sampler to generate samples from a bivariate normal distribution and the outline of a Gibbs sampler algorithm for the coal mining data.

1.

Bivariate Normal
The program biv (in biv normR.R) generates values, using the Gibbs sampler, from the bivariate normal distribution:-

$(X_{1},X_{2})\sim N\left((0,0),\left(\begin{array}[]{cc}1&\rho\\ \rho&1\end{array}\right)\right).$

Can you modify the code to produce samples from the bivariate normal distribution distribution:-

$(Y_{1},Y_{2})\sim N\left((\mu_{1},\mu_{2}),\left(\begin{array}[]{cc}\sigma_{1}% ^{2}&\rho\sigma_{1}\sigma_{2}\\ \rho\sigma_{1}\sigma_{2}&\sigma_{2}^{2}\end{array}\right)\right),$

given that

$\displaystyle Y_{1}|Y_{2}=y_{2}$ $\displaystyle\sim$ $\displaystyle N(\mu_{1}+\sigma_{1}\rho(y_{2}-\mu_{2})/\sigma_{2},(1-\rho^{2})% \sigma_{1}^{2})$

$\displaystyle Y_{2}|Y_{1}=y_{1}$ $\displaystyle\sim$ $\displaystyle N(\mu_{2}+\sigma_{2}\rho(y_{1}-\mu_{1})/\sigma_{1},(1-\rho^{2})% \sigma_{2}^{2}).$
2.

Coal Mining Example
Complete the Gibbs sampling algorithm for the coal mining data (in coal outlineR.R).

Remember:-

$\displaystyle\theta|\mathbf{y},\lambda,k$ $\displaystyle\sim$ $\displaystyle Gamma\left(a_{1}+\sum_{i=1}^{k}y_{i},k+b_{1}\right)$

$\displaystyle\lambda|\mathbf{y},\theta,k$ $\displaystyle\sim$ $\displaystyle Gamma\left(a_{2}+\sum_{i=k+1}^{n}y_{i},n-k+b_{2}\right)$

$\displaystyle p(k|\mathbf{y},\theta,\lambda)$ $\displaystyle=$ $\displaystyle Q(\mathbf{y};k,\theta,\lambda)/\sum_{j=1}^{n}Q(\mathbf{y};j,% \theta,\lambda),$

where

$Q(\mathbf{y};k,\theta,\lambda)=\exp\{k(\lambda-\theta)\}(\theta/\lambda)^{K}.$

Hint: To sample $k$ use the sample command in R.

Apply the algorithm to the coal mining data.

2.11 Lab Session: Multiple regression using the Gibbs Sampler

The aim is to analyse the dataset in the file labhills.txt. The file contains record times (in minutes) for 35 Scottish hill races, together with the length of the race (in miles) and elevation (in feet). The objective is to model the winning times in terms of distances and climbs of the races.

Examine the data file

> hills <- read.table("labhills.txt")
> names(hills)

Now look at the $\mathbf{R}$ data frame called hills what are the names of the variables in the data frame. A data frame is just a matrix containing data, and each element in the matrix can be accessed as one usually does with matrices. To refer to the first variable in the data frame (that is the first column in the data frame), you can either say hills[,1] or call it by its name hills$dist. Similarly for the other variables in the data frame. If you want to see the entire data frame do

> hills

Next look at the file gibbs_normalR.r. This contains a Gibbs sampler for the normal model that was described in the lectures:-

Y_{i}\sim N(\mu,1/\tau),~{}~{}\mu\sim N(\mu_{0},1/\omega_{0}),~{}\tau\sim\mbox% {Gam}\left({\alpha_{0},\gamma_{0}}\right).

Check that you understand what the code is doing. (There is code at the bottom of the file to simulate data and run the algorithm.) Apply the gibbs sampler to the times of the hill races with $\mu_{0}=0$ , $\omega_{0}=0.01$ and $\alpha_{0}=\gamma_{0}=1$ . Why is $\tau$ so small? Hint look at var(y).

The first aim in this lab is for you to extend and alter the code to produce a Gibbs sampler for a linear regression model.

A good starting point is Naismith’s rule, which is used to calculate the length of time a hillwalk should take. You divide the total distance by your average speed on the flat and then add on an allowance for each 100ft of ascent. The form of such a model is:

\mathbb{E}\left[{y_{i}|{\mbox{\boldmath$\beta$}}}\right]={\bf x}_{i}^{T}{\mbox% {\boldmath$\beta$}}=x_{i1}\beta_{1}+x_{i2}\beta_{2},

(2.13)

where $y_{i}$ , $x_{i1}$ , $x_{i2}$ respectively denote time, distance and climb for the $i^{th}$ race, and ${\mbox{\boldmath$\beta$}}=(\beta_{1},\beta_{2})$ is a parameter. Allowing for a Normal error distribution, we obtain that the observations $y_{1},\dots,y_{n}$ are independent, distributed according to

y_{i}|{\mbox{\boldmath$\beta$}},\tau\,\sim\,N({\bf x}_{i}^{T}{\mbox{\boldmath$% \beta$}},1/\tau).

(2.14)

To construct a Bayesian model, you also need to specify a prior distribution for the parameters $\beta$ and $\tau$ . Take these parameters independent a priori with

\mbox{\boldmath$\beta$}=\left(\begin{array}[]{c}\beta_{1}\\ \beta_{2}\end{array}\right)\sim N_{2}\left({\mbox{\boldmath$\mu$}_{0}},{\bf C}% _{0}={\rm Diag}\left({1\over{\omega_{0}}_{1}},{1\over{\omega_{0}}_{2}}\right)% \right)\quad{\rm and}\quad\tau\sim{\rm Gamma}(\alpha_{0},\gamma_{0}).

(2.15)

In other words, the prior for $\beta$ is bi-variate Normal with mean vector ${\mbox{\boldmath$\mu$}_{0}}\in\mathbb{R}^{2}$ and covariance matrix ${\bf C}_{0}$ . Since the covariance matrix is diagonal, $\beta_{1}$ and $\beta_{2}$ are independent a priori. $\tau$ is assigned a Gamma prior distribution.

1.

Write down, up to a proportionality constant, the joint posterior density of $({\mbox{\boldmath$\beta$}},\tau)$ .
2.

Write down, up to a proportionality constant, the posterior density of $\tau$ , conditioning on $\beta$ ; that is $\pi(\tau|y_{1},\ldots,y_{n},{\mbox{\boldmath$\beta$}})$ .

Thus obtain the conditional distribution, $\tau|y_{1},\ldots,y_{n},{\mbox{\boldmath$\beta$}}$ .
3.

Write down, up to a proportionality constant, the posterior density of $\beta_{1}$ , conditioning on $\tau$ and $\beta_{2}$ ; that is $\pi(\beta_{1}|\mathbf{y},\tau,\beta_{2})$ .

Thus obtain the conditional distribution, $\beta_{1}|\mathbf{y},\tau,\beta_{2}$ .
4.

Write down, up to a proportionality constant, the posterior density of $\beta_{2}$ , conditioning on $\tau$ and $\beta_{1}$ ; that is $\pi(\beta_{2}|\mathbf{y},\tau,\beta_{1})$ .

Thus obtain the conditional distribution, $\beta_{2}|\mathbf{y},\tau,\beta_{1}$ .
5.

Using the conditional posterior distributions in the previous question, write down on paper a Gibbs sampling algorithm to compute the posterior distribution of $({\mbox{\boldmath$\beta$}},\tau)$ via simulation.
6.

Copy, rename and alter the function in gibbs_normalR.r to create your Gibbs sampler in $\mathbf{R}$ .

Choose hyperparameters for the prior distribution in such a way that the prior does not contain a lot of information (ie. choose moderately large prior variances for the parameters but do not choose a shape parameter for the gamma prior below 1). Also, in the absence of prior information to the contrary, it seems natural to choose $\mu_{01}=\mu_{02}$ and $\omega_{01}=\omega_{02}$ in the prior distribution for $\beta$ .
Run the program for the hill races dataset.
1. (a)
  
  Choosing a suitable burn-in period and a suitable number of draws to be used for inference.
  
  Choosing 5000 draws and discarding the first 1000 as burn-in is reasonable.
2. (b)
  
  Display histograms of the marginal posterior distributions of $\beta_{1}$ , $\beta_{2}$ and $\tau$ .
3. (c)
  
  Compute the following numerical summaries of their posterior distributions: minimum, 1st quartile, median, mean, 3rd quartile, maximum.
4. (d)
  
  Display a scatterplot of the joint posterior distribution of $(\beta_{1},\beta_{2})$ .
5. (e)
  
  What are your conclusions in terms of the effect of distance and climb on the winning time of races?
7.

Some of the observations may be regarded as outliers, i.e. not fitted well by the assumed model. A way to try to detect outliers is to see whether any of the observations $y_{i}$ is particularly far from the value predicted by the model. For a race of distance $x_{i1}$ and climb $x_{i2}$ , a natural prediction of its winning time would be

$\hat{y}_{i}=\mathbb{E}\left[{\beta_{1}|{\bf y}}\right]x_{i1}+\mathbb{E}\left[{% \beta_{2}|{\bf y}}\right]x_{i2}.$

For each of the races, compute $y_{i}-\hat{y}_{i}$ and plot the resulting numbers. Are there any races that appear out of line?

Block update in Gibbs sampler

The above Gibbs sampler alternates between:

1.

Updating $\beta_{1}|\beta_{2},\tau$ .
2.

Updating $\beta_{2}|\beta_{1},\tau$ .
3.

Updating $\tau|\beta_{1},\beta_{2}$ .

This algorithm works well on the above data set. However, a more efficient Gibbs sampler can be obtained by updating $\beta$ as a block. That is, a Gibbs sampler which alternates between:

1.

Updating $\mbox{\boldmath$\beta$}=(\beta_{1},\beta_{2})|\tau$ .
2.

Updating $\tau|\mbox{\boldmath$\beta$}=(\beta_{1},\beta_{2})$ .

The update of $\tau$ is identical in both algorithm. For the block update of $\mbox{\boldmath$\beta$}|\tau$ , it is fairly straightforward but algebraically tedious to show that:

\mbox{\boldmath$\beta$}|\tau\sim(\mbox{\boldmath$\mu$}_{1},C_{1})

where the prior on $\beta$ is $N(\mbox{\boldmath$\mu$}_{0},C_{0})$ and

{\bf C}_{1}=({\bf C}_{0}^{-1}+\tau{\bf X}^{T}{\bf X})^{-1}\,\,{\rm and\ }{% \mbox{\boldmath$\mu$}}_{1}={\bf C}_{1}({\bf C}_{0}^{-1}{\mbox{\boldmath$\mu$}}% _{0}+\tau{\bf X}^{T}{\bf y}).

The parameters of the conditional posterior distribution of $\beta$ are a weighted average of the prior and data. Note the similarity to the conditional posterior distribution of $\mu$ in the univariate Gaussian example.

Note: I won’t expect you to compute multivariate conditional distributions for coursework or the exam.

2.12 Practice Questions

1.

Using detailed balance, construct a $4\times 4$ transition matrix $P$ with $P_{1,4}=0$ and $P_{1,3}=1/2$ for a Markov chain with stationary distribution $\mbox{\boldmath$\pi$}=(1/5,1/15,2/5,1/3)$ .

Remember for detailed balance, for all $i, j$ :-

$\pi_{i}P_{ij}=\pi_{j}P_{ji}.$
2.

Let $x=3$ denote an observation from $X\sim{\rm Bin}(n,p)$ , where both $n$ and $p$ are unknown parameters.
Suppose that ${\rm Beta}(2,2)$ and discrete uniform on $\{1,2,\ldots,10\}$ priors are assigned to $p$ and $n$ , respectively. That is,

$\displaystyle\pi(p)$ $\displaystyle=$ $\displaystyle 6p(1-p)\hskip 28.452756pt(0\leq p\leq 1)$

$\displaystyle\pi(n)$ $\displaystyle=$ $\displaystyle\frac{1}{10}\hskip 28.452756pt(n=1,2,\ldots,10)$
1. (a)
  
  Write down the likelihood $(\pi(x=3|n,p)$ ).
2. (b)
  
  Write down, up to a constant of proportionality, the joint posterior distribution of $p$ and $n$ , $\pi(p,n|x=3)$ .
3. (c)
  
  Find the conditional distribution of $p$ given $n$ and $x=3$ , $\pi(p|n,x=3)$ .
4. (d)
  
  Find the conditional distribution of $n$ given $p$ and $x=3$ , $\pi(n|p,x=3)$ .
5. (e)
  
  Describe a Gibbs sampler for obtaining samples from $\pi(p,n|x=3)$ .
3.

Consider a two-state discrete time Markov process (states 1 and 2). For $t=1,2,\ldots$ , let $X_{t}$ denote the state of the Markov process at time $t$ . The process is observed from time 1 to time $n$ with observed data $\mathbf{x}=(x_{1},x_{2},\ldots,x_{n})$ .

There is a (potential) change-point, $\tau$ , in the data, in that, for $t=1,2,\ldots,\tau$ ,

$\displaystyle P(X_{t+1}=1|X_{t}=1)=P(X_{t+1}=2|X_{t}=2)$ $\displaystyle=$ $\displaystyle p$

$\displaystyle P(X_{t+1}=2|X_{t}=1)=P(X_{t+1}=1|X_{t}=2)$ $\displaystyle=$ $\displaystyle 1-p$

and for $t=\tau+1,\tau+2,\ldots$ ,

$\displaystyle P(X_{t+1}=1|X_{t}=1)=P(X_{t+1}=2|X_{t}=2)$ $\displaystyle=$ $\displaystyle q$

$\displaystyle P(X_{t+1}=2|X_{t}=1)=P(X_{t+1}=1|X_{t}=2)$ $\displaystyle=$ $\displaystyle 1-q,$

where $p$ and $q$ are unknown parameters to be estimated. Throughout assume $U(0,1)$ prior on $p$ and $q$ . For $\tau$ , unless specified otherwise, denote the prior probability for $\tau=k$ by $f(k)$ .
1. (a)
  
  Given that $\tau=k$ , write down the likelihood of the parameters $(p,q)$ and compute the marginal posterior distributions of $\pi(p|\mathbf{x},q,\tau=k)$ and $\pi(q|\mathbf{x},p,\tau=k)$
  Hint: Note that, regardless of the value of $p$ , $P(X_{1}=1)=P(X_{1}=2)=1/2$ .
2. (b)
  
  Suppose that the location of the change-point is at an unknown location $\tau$ . Find an expression for $P(\tau=k|\mathbf{x},p,q)$ $(k=1,2,\ldots,n-1)$ .
3. (c)
  
  Outline a Gibbs sampler algorithm to obtain samples from the joint posterior distribution of $(p,q,k)$ .
4.

Tougher question

For the Gibbs sampler there is inherent dependence between successive realisations of the parameters. This is generally the case for all MCMC algorithms. The dependence between parameters can have an important impact on the convergence/performance of the MCMC chain (the chain of realisations of the MCMC output) to the posterior distribution.

Bivariate Normal
Suppose that $\mathbf{X}=(X_{1},X_{2})$ is a bivariate normal with standard normals as marginal distributions for $X_{1}$ and $X_{2}$ and $\mathbb{E}\left[{X_{1}X_{2}}\right]=\rho$ . (This is the bivariate normal example for which the Gibbs sampler code is provided.)

Then $X$ has pdf

$f(x_{1},x_{2})=\frac{1}{2\pi\sqrt{1-\rho^{2}}}\exp\left(-\frac{1}{2(1-\rho^{2}% )}\{x_{1}^{2}-2\rho x_{1}x_{2}+x_{2}^{2}\}\right).$

That is, $X_{k}\sim N(0,1)$ $(k=1,2)$ and conditional distributions,

$\displaystyle X_{1}|X_{2}\sim N(\rho X_{2},1-\rho^{2})$ (2.16)

and

$\displaystyle X_{2}|X_{1}\sim N(\rho X_{1},1-\rho^{2}).$ (2.17)

Let $\mathbf{X}^{(i)}=(X^{(i)}_{1},X^{(i)}_{2})$ denote the value of $\mathbf{X}$ obtained from the $i^{th}$ iteration of the Gibbs sampler, which alternates between (2.16) and (2.17).
1. (a)
  
  Find the distribution of $X_{1}^{(2)}$ given that $X_{1}^{(1)}=x$ .
2. (b)
  
  Find the distribution of $X_{1}^{(2)}$ given that $X_{1}^{(1)}\sim N(0,1)$ .
3. (c)
  
  Find the correlation between $X_{1}^{(1)}$ and $X_{1}^{(2)}$ given that $X_{1}^{(1)}\sim N(0,1)$ .

	$\displaystyle\pi(\theta\|\lambda,k,\mathbf{y})$	$\displaystyle\propto$	$\displaystyle\theta^{\sum_{i=1}^{k}y_{i}}e^{-k\theta}\times\theta^{a_{1}-1}e^{% -b_{1}\theta}$		(2.7)
		$\displaystyle=$	$\displaystyle\theta^{\sum_{i=1}^{k}y_{i}+a_{1}-1}e^{-(k+b_{1})\theta}.$		(2.7)

	$\displaystyle\pi(\lambda\|\theta,k,\mathbf{y})$	$\displaystyle\propto$	$\displaystyle\lambda^{\sum_{i=k+1}^{n}y_{i}}e^{-(n-k)\lambda}\times\lambda^{a_% {2}-1}e^{-b_{2}\lambda}$		(2.8)
		$\displaystyle=$	$\displaystyle\lambda^{\sum_{i=1}^{k}y_{i}+a_{1}-1}e^{-(k+b_{1})\theta}.$		(2.8)

	$\displaystyle Y_{1}\|Y_{2}=y_{2}$	$\displaystyle\sim$	$\displaystyle N(\mu_{1}+\sigma_{1}\rho(y_{2}-\mu_{2})/\sigma_{2},(1-\rho^{2})% \sigma_{1}^{2})$
	$\displaystyle Y_{2}\|Y_{1}=y_{1}$	$\displaystyle\sim$	$\displaystyle N(\mu_{2}+\sigma_{2}\rho(y_{1}-\mu_{1})/\sigma_{1},(1-\rho^{2})% \sigma_{2}^{2}).$

$\displaystyle\theta\|\mathbf{y},\lambda,k$	$\displaystyle\sim$	$\displaystyle Gamma\left(a_{1}+\sum_{i=1}^{k}y_{i},k+b_{1}\right)$
$\displaystyle\lambda\|\mathbf{y},\theta,k$	$\displaystyle\sim$	$\displaystyle Gamma\left(a_{2}+\sum_{i=k+1}^{n}y_{i},n-k+b_{2}\right)$
$\displaystyle p(k\|\mathbf{y},\theta,\lambda)$	$\displaystyle=$	$\displaystyle Q(\mathbf{y};k,\theta,\lambda)/\sum_{j=1}^{n}Q(\mathbf{y};j,% \theta,\lambda),$

	$\displaystyle\pi(p)$	$\displaystyle=$	$\displaystyle 6p(1-p)\hskip 28.452756pt(0\leq p\leq 1)$
	$\displaystyle\pi(n)$	$\displaystyle=$	$\displaystyle\frac{1}{10}\hskip 28.452756pt(n=1,2,\ldots,10)$

	$\displaystyle P(X_{t+1}=1\|X_{t}=1)=P(X_{t+1}=2\|X_{t}=2)$	$\displaystyle=$	$\displaystyle p$
	$\displaystyle P(X_{t+1}=2\|X_{t}=1)=P(X_{t+1}=1\|X_{t}=2)$	$\displaystyle=$	$\displaystyle 1-p$

	$\displaystyle P(X_{t+1}=1\|X_{t}=1)=P(X_{t+1}=2\|X_{t}=2)$	$\displaystyle=$	$\displaystyle q$
	$\displaystyle P(X_{t+1}=2\|X_{t}=1)=P(X_{t+1}=1\|X_{t}=2)$	$\displaystyle=$	$\displaystyle 1-q,$