Markov Chain Monte Carlo, often referred to simply as MCMC, has probably been the single most important advance in statistics over the last 25 years or so. MCMC has its origins in Metropolis et al.Β (1953) and Hastings (1970). However, it was only in the late 1980βs with the seminal paper Gelfand and Smith (1990) that the MCMC idea began to invade mainstream statistics and a considerable amount of research has been devoted to it ever since.
MCMC utilises a Monte Carlo method based upon Markov chains (as the name suggests!) MCMC is primarily used in Bayesian statistics to provide samples from the posterior distribution.
We cover three preliminary topics before discussing MCMC. These are:-
Monte Carlo methods;
Markov chains
Bayesian Statistics (very brief overview β details in MATH553).
Monte Carlo methods basically means simulation. These are used primarily to evaluate integrals (for random variables) where analytical solutions are either not possible or extremely laborious.
The simplest example is the case of the expectation
of a random variable which has no analytical form. We can estimate the mean of , by taking a sample from the distribution and using the sample mean as an estimate of the theoretical mean. That is:-
Take a sample of size , from the distribution , .
We assume
that the observations are independent.
Then .
For example, could represent the height (in cm) of a 5 year old child. An estimate of can then be obtained by taking a class of 20, 5 year old children and measuring their heights.
This argument can be generalised. Suppose that we wish to calculate/evaluate, , for some function
Then if we take a sample from the distribution , we have that
is an unbiased estimate of . This process is exactly the same as for the expectation, in that, we use the sample mean to estimate the theoretical mean. This approach is very easy to use, even for multi-dimensional distributions.
Properties of Monte Carlo integration
Unbiased.
If variance of exists, then
Central limit theorem
That is,
All the above hold provided are independent. However, these results can be extended to the case where the are dependent which will be the case when using MCMC.
Drawbacks of Monte Carlo integration
The variance of is often large.
This problem can be circumvented to some extent by variance
reduction methods such as importance sampling.
It can be difficult to simulate from (in an efficient manner).
This is a harder problem to deal with and this is where MCMC will be
useful. Simulating from univariate distributions (rejection
sampling, inversion of the cdf) is usually fairly straightforward
but for multivariate distributions the problem becomes a lot harder as methods such as rejection sampling perform much worse as the dimension of the problem grows.
This means a lot of work is
required to get even a small sample from the distribution .
There is a lot of deep theory related to the use of Markov chains in Monte Carlo methods and we shall barely scratch the surface.
What is a Markov Chain?
A Markov chain
is a stochastic process which satisfies the βmemorylessβ property:
where denotes the state of the process after steps. Basically, the future is independent of the past, given the present state of the process. A good introduction to the theory of Markov Chains is given in Grimmett and Stirzaker (1992), Section 6. We restrict attention in our discussion to discrete, time-homogeneous Markov chains, that is, for all , and ,
A very simple example of a Markov chain is the game Snakes and Ladders. Your future progress only depends upon your current position and not how you have arrived at your current position.
We shall illustrate the basic ideas, assuming that the state space for is countable, although in most examples of MCMC, the state space (set of possible values) of the parameters will be which is uncountable. For all , let
We shall assume that the Markov chain is time homogeneous, that is, for all ,
the probability of moving from state to state in one step. Suppose that the state space is finite. Let denote the matrix with elements . Then the element of gives .
Snakes and Ladders Example
Suppose that we are currently at square after turns and that there is a ladder at square 40 leading to square 62 and a snake at square 42 leading to square 27. Then rolling a standard 6 sided die, we have
depending upon which number we roll from 1 to 6.
A Markov chain is
irreducible; if for all and , there exists some
such that ;
(we can get from any state to any other state)
aperiodic; if the
greatest common divisor of ;
(does not exhibit periodic behaviour)
recurrent; if for all ;
(we eventually return to the starting state)
and positive recurrent; if it is irreducible, aperiodic and
recurrent, and there exists a unique collection of probabilities such that for all ,
(2.1) |
and
The distribution is called the stationary distribution of the Markov chain. Note that (2.1) says, that regardless of the value of , for large , . That is, for large , is (approximately) distributed to according to the stationary distribution. How large needs to be for the approximation to be close depends upon the starting value and the rate of convergence. The rate of convergence can be calculated using probability theory, and depends upon how
behaves as , where is the state space of the parameters (often or a subset thereof). In practice with MCMC convergence is often βdetectedβ by eye without any formal procedures.
A key point is that if, for all , , then for all and , . Therefore if is distributed according to , then for all , is distributed according to (the stationary distribution). Also in matrix notation.
Example. Suppose that we have four states labeled 1, 2, 3, 4.
Transition matrix
Then is the probability of moving from state to state . That is,
Note that all the rows must sum to 1, i.e.Β .
We consider different transition matrices describing the transitions from one time point to the next.
Matrix 1.
This is an example of a reducible Markov chain since we canβt get from state 2 to state 3. Also the Markov chain is not recurrent since if you start in state 3, the probability of being in state 3 at any later time does not exceed 0.5. (If you go to state 1 or 2 at the first step you will never return to state 3.) The probability of ever returning to state 3 is 0.3.
Matrix 2.
This is periodic, it returns to a given state after an even number of steps. (It alternates between odd states, 1 and 3, and even states, 2 and 4.)
Matrix 3.
This Markov chain is positive recurrent with such that
For example,
The aim of MCMC is to construct a Markov chain whose stationary distribution is the posterior distribution that we are interested in. Therefore a sample from the Markov chain (provided the Markov chain is started in stationarity) will form a sample from the posterior distribution of interest. It is important to check that the Markov chain we construct is irreducible and aperiodic, in particular the irreducibility of the Markov chain.
Detailed balance
A discrete Markov chain with transition matrix is said to satisfy detailed balance if there is a distribution such that for all ,
Lemma If a Markov chain with transition matrix satisfies
detailed balance with distribution then is
the stationary distribution of the chain.
Proof: Suppose that satisfies detailed balance. Then
for any ,
since for all , . (All rows sum to 1.) Hence is the stationary distribution of the Markov chain.
Checking detailed balance is often an easy way of confirming a probability distribution is the stationary distribution of a Markov chain.
Continuous state space Markov chains
Usually, but not always, in Bayesian statistics we want to obtain samples from a continuous distribution (or a distribution which is a mixture of continuous and discrete variables). Most of the above discussion for discrete state space Markov chains holds over to continuous state space Markov chains in very natural ways with minor modifications. In this short Section, we briefly outline the key differences and in particular the influence this will have on the MCMC discussion which follows.
Firstly, we replace the transition matrix by a transition kernel . Specifically, if we let denote the set of values the Markov chain can take ( will often be or some subset of it), for all , we have a density such that
and for a set , the probability of moving from state to a state is
For a continuous distribution the probability of ever returning to a state is 0. (For any continuous distribution the probability of observing any single value is 0.) Therefore for recurrence of a continuous state space Markov chain, we talk in terms of returning to a set . We wonβt go into the details but for any , we could define a set of values, say, close to , for example within a certain distance of and look at the time taken by the Markov chain to return to .
Secondly, we can still talk in terms of a stationary distribution , but will now be a probability density function. The stationary distribution satisfies
where in moving from discrete distributions to continuous distributions we have replaced the summation by integration and the transition matrix by the transition kernel. Similarly, detailed balance is given by
In classical (or frequentist) statistics given a parametric model, we assume that the parameters of the model are fixed, typically unknown, constants. We can then use data to make inference about the parameters, often through the likelihood function. That is, find the maximum likelihood estimate (MLE) of the parameters. We can also find standard errors and confidence intervals for the MLE. See for example the Week 6 notes on the EM algorithm.
Bayesian statistics takes its name from Thomas Bayes. Again we can assume a parametric model for the data. However, rather than assuming that the parameters are fixed constant, we now assume that they are random variables incorporating uncertainty about the parameters. Moreover, we can make use of any prior knowledge we (or others) may have about the parameter values. This is an important feature of Bayesian statistics, the presence of a prior distribution for the parameters. These priors can be either vague or informative, reflecting how much prior knowledge we may have concerning the parameters. The posterior distribution of the parameters is what we are interested in. The posterior distribution for the parameters is dependent upon the parametric model chosen, the data and the prior distribution. From the posterior distribution, we can obtain the modal value for the parameters (these will be very close to the MLE values, however, depending upon the choice of prior distribution these will rarely agree exactly), the mean of the parameters and the variance (covariance) of the parameters. Thus we can obtain a great deal of information from the posterior distribution and there is no need to construct confidence intervals to measure uncertainty in the parameter values as these are obtained directly from the posterior distribution. Note that as the amount of data increases, the effects of the prior will diminish and in the limit as we obtain infinite data, the choice of prior will be inconsequential.
We now consider Bayesian statistics in action and give some motivation of why MCMC methods have been so successful. Consider a parametric model with parameter(s) . Let denote the pdf (probability density function) of the prior distribution of . Let denote the observed data and let denote the joint pdf of the data and the parameters. Then by Bayes Theorem,
We are interested in the posterior distribution of , that is, the distribution of given the data , ie. . Therefore
Since the likelihood , we have that
(2.2) |
(Note that does not depend upon .) The likelihood and prior are often relatively easy to obtain, and in such circumstances we can find the posterior distribution up to a constant of proportionality. That is,
where needs to be computed. The computing of is often very difficult, if at all analytically possible. This has been a major stumbling block for Bayesian statistics. Fortunately, MCMC techniques can be applied directly to (2.2) to obtain samples from the posterior distribution, thus circumventing the need to compute . In MATH553 you will see alternatives to MCMC for getting around the problem of computing .
Gaussian Example.
This simple example produces a nice analytical answer.
Suppose that we have independent and identically distributed data according to a random variable , where is an unknown parameter. We want to estimate .
Prior distribution: Suppose that we know nothing about and take , ie. . This is an improper prior since it does not integrate to 1 over the possible values of . However, often (but not always) improper prior distributions give rise to proper posterior distributions, so this is not a problem.
Suppose that are independent realisations of .
Then
and
Therefore
where .
A very useful result. Let have probability density function . Then if
then .
Thus
Therefore the mean of is the sample mean (which is the MLE). Also we have that the variance is which agrees the variance of the MLE. In other words, as the sample size grows, we obtain tighter and tighter estimates of .
Conjugate priors
The choice of prior plays an important part in Bayesian statistics. There are two key features, how informative the priors are and conjugacy. We illustrate this idea with an extension of the Gaussian example above to the case where both the mean and variance are unknown. The choice of priors, and in particular, conjugacy will be covered in detailed in MATH553.
For , let . Suppose also that we have the following prior distributions for and :
where and are assumed to be apriori independent and and are assumed to be known hyperparameters. Hyperparameters are simply the parameters of the prior distributions of the model parameters. Letting , ( is known as the precision and is the inverse of the variance) we have that, , so
Therefore, although the form of the prior distribution is highly tractable, the posterior distribution is a complex two dimensional distribution. This is often the case for multidimensional parameter sets.
However, if we consider the conditional distributions of each of the parameters in turn, these turn out to be rather nice. We exploit the following two observations which are useful for deriving conditional distributions:
Suppose that the random variable has a pdf of the form,
then
That is, if the pdf of is composed as the product of Normal densities then has a Normal density with more weight given to those components with larger precisions (smaller variances).
Suppose that the random variable has a pdf of the form,
then
Firstly, to consider the conditional distribution of we only need to focus upon terms involving . Therefore
Similarly, for of we only need to focus upon terms involving , giving
Thus the conditional distributions of each of the parameters has a nice simple form. This is called conditional conjugacy.
Therefore we will proceed by introducing the Gibbs Sampler which is applicable, where the conditional distributions of each parameter given the other parameters is known but the full joint distribution of the parameters is not known explicitly.
Suppose that we wish to obtain a sample from the multivariate posterior distribution , where denotes the parameters of the model and denotes the observed data. The Gibbs sampler does this by successively and repeatedly simulating from the conditional distributions of each component given the other components. This procedure is particularly useful where we have conditional conjugacy, so that the resulting conditional distributions are from standard distributions.
Gibbs Sampler algorithm
Initialise with .
For ,
Simulate from the conditional
Simulate from the conditional
Simulate from the conditional
Discard the first iterations and estimate summary statistics of the posterior distribution using .
How does it work?
Suppose that comes from . Then for , updating the component involves drawing a new value of , from . Thus , the updated set of parameters with replaced by also comes from the posterior distribution of .
Thus provided is drawn from the posterior distribution of ; are samples from the posterior distribution. Note that we have a dependent sample.
We have to specify initial values for . If was drawn from the stationary distribution of the Markov chain (distribution we are interested in), then would be realisations from the stationary distribution, . However, we donβt know what the (stationary) distribution is. (If we knew the distribution of , there would be no need for MCMC to simulate from it!) Thus typically will not be drawn from the stationary distribution. As increases increasingly forgets, the initial value , and consequently, for large , is approximately from the stationary distribution. Hence, are approximately from the stationary distribution and thus are considered a sample thereof. In some circumstances it is possible to use perfect simulation in which case is drawn from the stationary distribution even though the stationary distribution is unknown. Perfect simulation is not covered in this course and is only viable in some special cases.
We recap the normal example given in the Section 2.4. Suppose that are independent and identically distributed according to , where and are unknown parameters to be estimated. The following prior distributions are assigned to and :-
We have the following Gibbs sampler algorithm for obtaining a sample of size from the joint distribution .
Initialise with .
Any values of
and will be OK. However a reasonable start
would be the sample mean and inverse of the sample variance. That
is, and .
For :-
Simulate
Simulate .
The Gibbs sampler was run to produce a sample of size 1100 from the joint distribution of and . We discard the first 100 samples as burn-in using the remaining 1000 samples to estimate and .
Unnumbered Figure:Β Link
Unnumbered Figure:Β Link
Verification
We will show that is indeed the stationary distribution of the Gibbs sampler in the case where . The proof for is similar but more long-winded.
Let and let , the conditional distributions of and , respectively. Suppose that is drawn from . Then the pdf of given is
We want to show that the pdf of , is equal to .
For any set ,
as required.
Observations
We make a couple of minor observations about the Gibbs sampler and its output which are very useful in practice.
Firstly, suppose that we interested in the marginal distribution of one or more of the parameters, for example, . Then represents a sample from . That is, we automatically get samples from the marginal distributions of parameters from the Gibbs sampler.
Secondly, whilst it will often be the case that univariate conditional distributions are used to construct the Gibbs sampler, it is possible to update more than one parameter together in a block by utilising multi-dimensional conditional distributions. That is, updating a subset of the parameters conditional upon the data and the other parameters. This is particularly useful in regression and similar examples which give rise to multivariate Gaussian distributions as conditional distributions.
We shall use this example to further illustrate the Gibbs sampler in action. The data to be analysed are obtained from Jarrett (1979) and is concerned with the number of coal mining disasters per year, over the period 1851-1962.
A plot of the data is given in the figure below, and suggests that there has been a reduction in the number of disasters per year over the period. The following model was suggested by Carlin et al.Β (1992):
That is, the number of disasters per year is Poisson distributed. Moreover, the first years have a common mean and the remaining years have a common mean . This is a standard change point problem, in that, there is a behaviour change in the model (here a change in mean) at an unknown point in time.
Unnumbered Figure:Β Link
We complete the specification of the model by assuming the following prior structure. Let and , discrete uniform over , each independent of one another (ie. the parameters are a priori independent) with known hyperparameters and . More complicated hierarchical models have previously been used to analyse these data, see Carlin et al.Β (1992).
Therefore
(2.3) | |||||
(2.4) |
where , and
(2.5) | |||||
(2.6) |
The conditional distributions of , and , can be obtained from (2.11-2.5) by focussing upon only those terms involving the parameter of interest. Note that takes discrete values in the range .
The conditional distributions are given as follows:
Using (2.4) and (2.5), we have that
(2.9) |
Now does not have a nice conditional distribution, by which a mean an easily recognisable, well known distribution. However, is a discrete distribution and there are only 112 possible values. Therefore we can compute the normalising constant directly to give
(2.10) |
Sampling from the conditional distribution is straightforward in R using the sample command.
The Gibbs sampler was applied to the coal mining data set and run for 1100 iterations to obtain samples from . We set for the priors. The results of the Gibbs sampler are shown in the figures below.
Unnumbered Figure:Β Link
Unnumbered Figure:Β Link
Unnumbered Figure:Β Link
Convergence of the algorithm seems to be rapid (after compensating for a poor choice of starting values, namely, ). Therefore I deleted the first 100 values and based all the analysis on the remaining 1000 iterations. In other words, I took the first 100 iterations as burn-in.
A histogram of the posterior distribution of year is given below. The posterior mode is at , and there is very strong evidence to suggest that the change point occurs about the posterior mode which corresponds to the year 1891.
Unnumbered Figure:Β Link
It is clear from the trace plots above and kernel density estimates for the posterior distributions that is (almost certainly) less than . This corresponds to less disasters, and hence, a safer working environment after the change point. Note that any summary statistic or distributional quantity related to the joint distribution of and such as the distribution can be estimated using the output from the Gibbs sampler. This is simply done by looking at each pair of realisations giving as the realisation from the posterior distribution of the difference.
Unnumbered Figure:Β Link
Unnumbered Figure:Β Link
Unnumbered Figure:Β Link
We can easily sample from the predictive distribution. Let denote a future observation. Future observations are iid according to and does not depend upon , or given . Therefore the predictive distribution of given , satisfies
Samples from are given by the Gibbs sampler in the form of . Therefore we can estimate using the Monte-Carlo estimate
(2.11) | |||||
or simply simulate and estimate by
(2.12) |
The estimate in (2.11) has smaller variance, ie.Β is better. However, (2.12) is easier to compute and only requires the ability to simulate from the distribution.
For the coal mining data set, we can use the estimator to predict the number of disasters in future years (Poisson with rate ). A graph of the predictive distribution alongside the corresponding estimates based on the posterior mean () are given for a sample of 1000 realisations in the figure below. Since there is limited variation in the posterior distribution of , the two distributions are virtually identical. However, in general the predictive distribution will demonstrate greater variability owing to the uncertainty in .
Unnumbered Figure:Β Link
Predictive and estimative (based on ) distributions of number of coal mining disasters per year (solid line is predictive)
R code for applying the Gibbs sampler to generate samples from a bivariate normal distribution and the outline of a Gibbs sampler algorithm for the coal mining data.
Bivariate Normal
The program biv (in biv normR.R) generates values, using the Gibbs sampler,
from the bivariate normal distribution:-
Can you modify the code to produce samples from the bivariate normal distribution distribution:-
given that
Coal Mining Example
Complete the Gibbs sampling algorithm for the coal mining data (in coal outlineR.R).
Remember:-
where
Hint: To sample use the sample command in R.
Apply the algorithm to the coal mining data.
The aim is to analyse the dataset in the file labhills.txt. The file contains record times (in minutes) for 35 Scottish hill races, together with the length of the race (in miles) and elevation (in feet). The objective is to model the winning times in terms of distances and climbs of the races.
Examine the data file
> hills <- read.table("labhills.txt") > names(hills)
Now look at the data frame called hills
what are the
names of the variables in the data frame. A data frame is just a
matrix containing data, and each element in the matrix can be
accessed as one usually does with matrices. To refer to the first
variable in the data frame (that is the first column in the data
frame), you can either say hills[,1]
or call it by its name
hills$dist
. Similarly for the other variables in the data
frame. If you want to see the entire data frame do
> hills
Next look at the file gibbs_normalR.r
. This contains a Gibbs
sampler for the normal model that was described in the lectures:-
Check that you understand what the code is doing. (There is code at the bottom of the file to simulate data and run the algorithm.) Apply the gibbs sampler to the times of the hill races with , and . Why is so small? Hint look at var(y).
The first aim in this lab is for you to extend and alter the code to produce a Gibbs sampler for a linear regression model.
A good starting point is Naismithβs rule, which is used to calculate the length of time a hillwalk should take. You divide the total distance by your average speed on the flat and then add on an allowance for each 100ft of ascent. The form of such a model is:
(2.13) |
where , , respectively denote time, distance and climb for the race, and is a parameter. Allowing for a Normal error distribution, we obtain that the observations are independent, distributed according to
(2.14) |
To construct a Bayesian model, you also need to specify a prior distribution for the parameters and . Take these parameters independent a priori with
(2.15) |
In other words, the prior for is bi-variate Normal with mean vector and covariance matrix . Since the covariance matrix is diagonal, and are independent a priori. is assigned a Gamma prior distribution.
Write down, up to a proportionality constant, the joint posterior density of .
Write down, up to a proportionality constant, the posterior density of , conditioning on ; that is .
Thus obtain the conditional distribution, .
Write down, up to a proportionality constant, the posterior density of , conditioning on and ; that is .
Thus obtain the conditional distribution, .
Write down, up to a proportionality constant, the posterior density of , conditioning on and ; that is .
Thus obtain the conditional distribution, .
Using the conditional posterior distributions in the previous question, write down on paper a Gibbs sampling algorithm to compute the posterior distribution of via simulation.
Copy, rename and alter the function in
gibbs_normalR.r
to create your Gibbs sampler in .
Choose hyperparameters for the prior distribution in such a way that the prior does not contain a lot of information (ie. choose moderately large prior variances for the parameters but do not choose a shape parameter for the gamma prior below 1). Also, in the absence of prior information to the contrary, it seems natural to choose and in the prior distribution for .
Run the program for the hill races dataset.
Choosing a suitable burn-in period and a suitable number of draws to be used for inference.
Choosing 5000 draws and discarding the first 1000 as burn-in is reasonable.
Display histograms of the marginal posterior distributions of , and .
Compute the following numerical summaries of their posterior distributions: minimum, 1st quartile, median, mean, 3rd quartile, maximum.
Display a scatterplot of the joint posterior distribution of .
What are your conclusions in terms of the effect of distance and climb on the winning time of races?
Some of the observations may be regarded as outliers, i.e.Β not fitted well by the assumed model. A way to try to detect outliers is to see whether any of the observations is particularly far from the value predicted by the model. For a race of distance and climb , a natural prediction of its winning time would be
For each of the races, compute and plot the resulting numbers. Are there any races that appear out of line?
Block update in Gibbs sampler
The above Gibbs sampler alternates between:
Updating .
Updating .
Updating .
This algorithm works well on the above data set. However, a more efficient Gibbs sampler can be obtained by updating as a block. That is, a Gibbs sampler which alternates between:
Updating .
Updating .
The update of is identical in both algorithm. For the block update of , it is fairly straightforward but algebraically tedious to show that:
where the prior on is and
The parameters of the conditional posterior distribution of are a weighted average of the prior and data. Note the similarity to the conditional posterior distribution of in the univariate Gaussian example.
Note: I wonβt expect you to compute multivariate conditional distributions for coursework or the exam.
Using detailed balance, construct a transition matrix with and for a Markov chain with stationary distribution .
Remember for detailed balance, for all :-
Let denote an observation from , where both and are unknown parameters.
Suppose that and discrete uniform on priors are assigned to and , respectively. That is,
Write down the likelihood ).
Write down, up to a constant of proportionality, the joint posterior distribution of and , .
Find the conditional distribution of given and , .
Find the conditional distribution of given and , .
Describe a Gibbs sampler for obtaining samples from .
Consider a two-state discrete time Markov process (states 1 and 2). For , let denote the state of the Markov process at time . The process is observed from time 1 to time with observed data .
There is a (potential) change-point, , in the data, in that, for ,
and for ,
where and are unknown parameters to be estimated. Throughout assume prior on and . For , unless specified otherwise, denote the prior probability for by .
Given that , write down the likelihood of the parameters
and compute the marginal posterior distributions of and
Hint: Note that, regardless of the value of , .
Suppose that the location of the change-point is at an unknown location . Find an expression for .
Outline a Gibbs sampler algorithm to obtain samples from the joint posterior distribution of .
Tougher question
For the Gibbs sampler there is inherent dependence between successive realisations of the parameters. This is generally the case for all MCMC algorithms. The dependence between parameters can have an important impact on the convergence/performance of the MCMC chain (the chain of realisations of the MCMC output) to the posterior distribution.
Bivariate Normal
Suppose that is a
bivariate normal with standard normals as marginal distributions for
and and . (This is the bivariate
normal example for which the Gibbs sampler code is provided.)
Then has pdf
That is, and conditional distributions,
(2.16) |
and
(2.17) |
Let denote the value of obtained from the iteration of the Gibbs sampler, which alternates between (2.16) and (2.17).
Find the distribution of given that .
Find the distribution of given that .
Find the correlation between and given that .