3 Week 8: Gibbs Sampler – Data augmentation

3.1 Introduction

Data augmentation is a useful tool for obtaining samples from the posterior distribution of parameters where $\pi(\mbox{\boldmath$\theta$}|\mathbf{x})$ is not in a convenient form for analysis. The Gibbs sampler with data augmentation is primarily used in the following two situations:

1.

In real world problems there is often missing data. Although in theory the distribution of the observed data can be obtained by integrating out the missing data, this is often difficult or impossible to do.
2.

The likelihood of the data is not tractable in its current form, but conditional upon the collection of unobserved (extra) data, the likelihood becomes tractable. This corresponds to the situation we have observed with the EM algorithm and there are similarities (as well as obvious differences) between the EM algorithm and Gibbs sampler in data augmentation problems.

More generally, data augmentation is used extensively within MCMC (Markov chain Monte Carlo) algorithms to assist with obtaining samples from $\pi(\mbox{\boldmath$\theta$}|\mathbf{x})$ .

The generic set up is as follows. Let $\mathbf{x}$ denote the observed data and let $\theta$ denote the model parameters. Then

\pi(\mbox{\boldmath$\theta$}|\mathbf{x})\propto\pi(\mathbf{x}|\mbox{\boldmath$% \theta$})\times\pi(\mbox{\boldmath$\theta$})

and this can be difficult to work with if the likelihood $\pi(\mathbf{x}|\mbox{\boldmath$\theta$})$ is not in a convenient form. Now suppose that there is extra information $\mathbf{y}$ such that joint distribution of $\mathbf{x}$ and $\mathbf{y}$ , $\pi(\mathbf{x},\mathbf{y}|\mbox{\boldmath$\theta$})$ is more convenient to work with. We can then look to construct a Gibbs sampler (or alternative MCMC algorithm) to obtain samples from $\pi(\mbox{\boldmath$\theta$},\mathbf{y}|\mathbf{x})$ . It is then trivial to obtain samples from the marginal distribution $\pi(\mbox{\boldmath$\theta$}|\mathbf{x})$ by simply ignoring the $\mathbf{y}$ values. In other words, focus on the marginal distribution $\pi(\mbox{\boldmath$\theta$}|\mathbf{x})$ . Note that

	$\displaystyle\pi(\mbox{\boldmath$\theta$},\mathbf{y}\|\mathbf{x})$	$\displaystyle=$	$\displaystyle\frac{\pi(\mathbf{x},\mathbf{y}\|\mbox{\boldmath$\theta$})\pi(% \mbox{\boldmath$\theta$})}{\pi(\mathbf{x})}$
		$\displaystyle\propto$	$\displaystyle\pi(\mathbf{x},\mathbf{y}\|\mbox{\boldmath$\theta$})\pi(\mbox{% \boldmath$\theta$}).$

Then the Gibbs sampler alternates between updating the parameters and augmented data.

1.

Update $\theta$ given $\mathbf{x}$ and $\mathbf{y}$ . ie. Use $\pi(\mbox{\boldmath$\theta$}|\mathbf{x},\mathbf{y})$ .
2.

Update $\mathbf{y}$ given $\mathbf{x}$ and $\theta$ . ie. Use $\pi(\mathbf{y}|\mathbf{x},\mbox{\boldmath$\theta$})$ .

Both the updates of $\theta$ and $\mathbf{y}$ will often be broken down into a number of steps. Note the similarities to the EM algorithm with step 1 replacing the M-step (updating the parameters given the observed and augmented data) and step 2 replacing the E-step (updating the augmented data given the observed data and parameters).

We illustrate data augmentation within the Gibbs sampler using a range of examples.

3.2 Normal mixture

We begin by illustrating data augmentation Gibbs sampling with a very simple example, which is a simplified version of the Normal mixtures introduced in the Week 6 notes for the EM algorithm.

Suppose that $X_{1},X_{2},\ldots,X_{n}$ are iid (independent and identically distributed) from the mixture density

f(x)=\frac{1}{2\sqrt{2\pi}}\left(e^{-x^{2}/2}+e^{-(x-\theta)^{2}/2}\right).

That is,

X\sim\left\{\begin{array}[]{ll}N(0,1)&\mbox{with prob. }0.5\\ N(\theta,1)&\mbox{with prob. }0.5\end{array}\right.

Thus the mixture can be constructed as follows. For each observation toss a fair coin and let $Y$ denote the outcome of the coin toss with $Y=0$ if the coin shows a head and $Y=1$ otherwise. If the $i^{th}$ coin toss shows a head $(Y_{i}=0)$ draw $X_{i}$ from $N(0,1)$ , otherwise draw $X_{i}$ from $N(\theta,1)$ .

The likelihood can be written explicitly:-

L(\theta;\mathbf{x})=\pi(\mathbf{x}|\mbox{\boldmath$\theta$})=\prod_{i=1}^{n}% \frac{1}{2\sqrt{2\pi}}\left(e^{-x_{i}^{2}/2}+e^{-(x_{i}-\theta)^{2}/2}\right),

where $\mathbf{x}=(x_{1},x_{2},\ldots,x_{n})$ . It is difficult to maximise (for MLE) or to compute the posterior distribution for $\theta$ using the given likelihood. However, if we knew the results of the coin tosses (ie. If we knew $Y_{1},Y_{2},\ldots Y_{n})$ , then we would know which observations come from the $N(\theta,1)$ and the posterior distribution (or MLE) would be straightforward to derive.

Let $\mathbf{y}=(y_{1},y_{2},\ldots,y_{n})$ denote the outcome of the $n$ coin tosses and let $\mathcal{A}=\{i;y_{i}=1\}$ , the set of coin tosses that are tails. Thus for $i\in\mathcal{A}$ , $X_{i}$ is drawn from $N(\theta,1)$ .

Original likelihood:-

\displaystyle L(\theta;\mathbf{x})=\prod_{i=1}^{n}\frac{1}{2\sqrt{2\pi}}\left(% e^{-x_{i}^{2}/2}+e^{-(x_{i}-\theta)^{2}/2}\right).

Augmented likelihood:-

	$\displaystyle L(\theta;\mathbf{x},\mathbf{y})=\pi(\mathbf{x},\mathbf{y}\|\theta)$	$\displaystyle=$	$\displaystyle\prod_{i\in\mathcal{A}}\left(\frac{1}{2}\times\frac{1}{\sqrt{2\pi% }}e^{-(x_{i}-\theta)^{2}/2}\right)\times\prod_{i\in\mathcal{A}^{C}}\left(\frac% {1}{2}\times\frac{1}{\sqrt{2\pi}}e^{-x_{i}^{2}/2}\right)$
		$\displaystyle\propto$	$\displaystyle\exp\left(-\sum_{i\in\mathcal{A}}(x_{i}-\theta)^{2}/2\right).$

Therefore if we take a $N(0,1)$ prior for $\theta$ , we have that

$\displaystyle\pi(\theta\|\mathbf{x},\mathbf{y})$	$\displaystyle\propto$	$\displaystyle L(\theta;\mathbf{x},\mathbf{y})\times\pi(\theta)$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\sum_{i\in\mathcal{A}}(x_{i}-\theta)^{2}/2\right)% \times\exp(-\theta^{2}/2)$
	$\displaystyle=$	$\displaystyle\exp\left(-\frac{1}{2}\left\{m\theta^{2}-2\theta\sum_{i=1}^{m}x_{% i}+\sum_{i\in\mathcal{A}}x_{i}^{2}\right\}\right)\times\exp\left(-\frac{1}{2}% \theta^{2}\right)$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\frac{1}{2}\left\{(m+1)\theta^{2}-2\theta\sum_{i\in% \mathcal{A}}x_{i}\right\}\right)$

Thus the posterior distribution of $\theta$ is

N\left(\frac{1}{m+1}\sum_{i\in\mathcal{A}}x_{i},\frac{1}{m+1}\right),

where $m=|\mathcal{A}|$ .

On the other hand let $Y_{i}$ denote the outcome of the $i^{th}$ coin toss. Then it is straightforward to show that

$\displaystyle P(Y_{i}=1\|\theta,\mbox{everything else})$	$\displaystyle=$	$\displaystyle\frac{P(Y_{i}=1,x_{i})}{f(x_{i})}$
	$\displaystyle=$	$\displaystyle\frac{P(Y_{i}=1,x_{i})}{P(Y_{i}=1,x_{i})+P(Y_{i}=0,x_{i})}$
	$\displaystyle=$	$\displaystyle\frac{\exp(-(x_{i}-\theta)^{2}/2)/{(2\sqrt{2\pi})}}{\exp(-(x_{i}-% \theta)^{2}/2)/{(2\sqrt{2\pi})}+\exp(-x_{i}^{2}/2)/{(2\sqrt{2\pi})}}$
	$\displaystyle=$	$\displaystyle\frac{\exp(-(x_{i}-\theta)^{2}/2)}{\exp(-(x_{i}-\theta)^{2}/2)+% \exp(-x_{i}^{2}/2)}.$

The Gibbs sampling algorithm is:-

1.

Initial value for $\theta_{0}$ . (A reasonable starting value is $\theta_{0}=\frac{1}{n}\sum_{i=1}^{n}x_{i}$ , the sample mean.)
2.

For $i=1,2,\ldots,n$ , set $Y_{i}=1$ with probability $\frac{\exp(-(x_{i}-\theta)^{2}/2)}{\exp(-(x_{i}-\theta)^{2}/2)+\exp(-x_{i}^{2}% /2)}$ , and set $Y_{i}=0$ otherwise. This updates the set $\mathcal{A}$ .
3.

Sample $\theta\sim N\left(\frac{1}{m+1}\sum_{i\in\mathcal{A}}x_{i},\frac{1}{m+1}\right)$ .

3.3 Genetic linkage example

We revisit the genetic linkage example from Week 6, the EM algorithm. Whilst we are now focused on the posterior distribution rather than the MLE of the parameter $\theta$ , we use exactly the same data augmentation as before.

The data consist of the genetic linkage of 197 animals and the data are divided into four genetic categories, labeled 1 through to 4, see [1]. The probabilities that an animal belongs to each of the four categories are $\frac{1}{2}+\frac{\theta}{4},\frac{1-\theta}{4},\frac{1-\theta}{4},\frac{% \theta}{4}$ , respectively. Let $\mathbf{x}=(x_{1},x_{2},x_{3},x_{4})=(125,18,20,34)$ denote the total number of animals in each category.

We are interested in estimating $\theta$ . Note that for the category probabilities to be valid we require that $0<\theta<1$ . Therefore we shall assign an (uninformative) uniform prior to $\theta$ . This is a multinomial experiment (4 different outcomes), so

	$\displaystyle L(\theta;\mathbf{x})$	$\displaystyle=$	$\displaystyle\frac{(x_{1}+x_{2}+x_{3}+x_{4})!}{x_{1}!x_{2}!x_{3}!x_{4}!}p_{1}^% {x_{1}}p_{2}^{x_{2}}p_{3}^{x_{3}}p_{4}^{x_{4}}$
		$\displaystyle=$	$\displaystyle\frac{197!}{125!18!20!34!}\left(\frac{1}{2}+\frac{\theta}{4}% \right)^{125}\left(\frac{1}{4}(1-\theta)\right)^{18}\left(\frac{1}{4}(1-\theta% )\right)^{20}\left(\frac{\theta}{4}\right)^{34}.$

The posterior distribution of $\theta$ is not of a particularly nice form,

p(\theta|\mathbf{x})\propto(2+\theta)^{x_{1}}(1-\theta)^{x_{2}+x_{3}}\theta^{x% _{4}}.

Suppose that the observed cell $x_{1}$ could be divided into two subcategories $A$ and $B$ . Suppose that $y$ $(x_{1}-y)$ is the number of animals in subcategory $A$ $(B)$ with cell probability $p_{A}=\theta/4$ $(p_{B}=1/2)$ . This would give an augmented data set $(\mathbf{x},y)$ . Then

	$\displaystyle L(\theta;\mathbf{x},y)$	$\displaystyle=$	$\displaystyle\frac{(x_{1}+x_{2}+x_{3}+x_{4})!}{(x_{1}-y)!y!x_{2}!x_{3}!x_{4}!}% p_{A}^{y}p_{B}^{x_{1}-y}p_{2}^{x_{2}}p_{3}^{x_{3}}p_{4}^{x_{4}}$
		$\displaystyle\propto$	$\displaystyle\left(\frac{\theta}{4}\right)^{y}\left(\frac{1}{4}(1-\theta)% \right)^{18}\left(\frac{1}{4}(1-\theta)\right)^{20}\left(\frac{\theta}{4}% \right)^{34}.$

Thus

p(\theta|\mathbf{x},y)\propto(1-\theta)^{38}\theta^{34+y}.

Therefore the posterior density is proportional to a Beta density giving

\theta|\mathbf{x},y\sim{\rm Beta}(y+35,39).

What is $y|\theta,\mathbf{x}$ ?

There are 125 $(=x_{1})$ animals in categories $A$ and $B$ .

The conditional probability that the animal belongs to category $A$ given that the animal belongs to category 1 is

$\displaystyle P(\mbox{Belongs to category $A$}\|\mbox{Belongs to category 1})$	$\displaystyle=$	$\displaystyle\frac{P(\mbox{Belongs to category $A$ and to category 1})}{P(% \mbox{Belongs to category 1})}$
	$\displaystyle=$	$\displaystyle\frac{P(\mbox{Belongs to category $A$})}{P(\mbox{Belongs to % category 1})}$
	$\displaystyle=$	$\displaystyle\frac{\theta/4}{1/2+\theta/4}=\frac{\theta}{2+\theta}.$

Thus $y\sim Bin(125,\theta/(2+\theta))$ .

Therefore the Gibbs sampler iterates between the following two equations:

	$\displaystyle\theta\|\mathbf{x},y$	$\displaystyle\sim$	$\displaystyle Beta(y+x_{4}+1,x_{2}+x_{3}+1)={\rm Beta}(y+35,39)$
	$\displaystyle y\|\theta,\mathbf{x}$	$\displaystyle\sim$	$\displaystyle{\rm Bin}\left(x_{1},\frac{\theta}{\theta+2}\right)={\rm Bin}% \left(125,\frac{\theta}{\theta+2}\right)$

(Note that the above is for general $\mathbf{x}$ , not just the data given.)

A sample of size 100 from the posterior distribution was obtained using the above Gibbs sampler and the output is presented below. Initial values $\theta=0.5$ and $y=25$ were chosen.

Unnumbered Figure: Link

The algorithm appears to converge immediately, but just to be safe I shall disregard the first 20 iterations as burn-in. The following summary statistics are then obtained for $\theta$ :-
Estimate of posterior mean, 0.6258.
Estimate of posterior variance, $0.00281(=0.053^{2})$ .
Posterior density plots (obtained by kernel smoothing) suggest a modal value of about 0.645.

It is important to note that the sequence of realisations of $\theta$ are not independent. This is clear from their construction, via the Markov chain. One useful summary of dependence is an ACF (autocorrelation function) plot. The acf plot (in R) gives the estimated value of $\mbox{Corr}\left[{\theta_{0},\theta_{k}}\right]$ for $k=0,1,\ldots$ , the correlation between samples from the posterior distribution which differ by a lag of $k$ iterations. For independent samples, $\mbox{Corr}\left[{\theta_{0},\theta_{k}}\right]=0$ for all $k\geq 1$ , although an estimate of the correlation will not be exactly 0. Generally for MCMC, the faster the $\mbox{Corr}\left[{\theta_{0},\theta_{k}}\right]$ approaches 0 as $k$ increases the better. (With very few exceptions, MCMC constructs Markov chains with positive correlation, $\mbox{Corr}\left[{\theta_{0},\theta_{k}}\right]>0$ .)

An acf plot for $\theta$ is given based upon a sample of 1000 iterations following a burn-in of 100 iterations. As we can see there is very little dependence with only $\mbox{Corr}\left[{\theta_{0},\theta_{1}}\right]$ significantly different to 0. (Note that $\mbox{Corr}\left[{\theta_{0},\theta_{0}}\right]$ is always equal to 1.) This is very good. In practice the dependence is usually a lot more marked. We will say more about dependence and the cost of dependent versus independent samples in later weeks.

Unnumbered Figure: Link

3.4 Hierarchical models

The performance of students in a test is thought to depend upon individual and school variability. In particular the following model is suggested:-

	$\displaystyle X_{i}$	$\displaystyle\sim$	$\displaystyle N(\theta,1/\tau)\hskip 14.226378pt(i=1,2,\ldots,m)$
	$\displaystyle Y_{ij}$	$\displaystyle\sim$	$\displaystyle N(X_{i},1/\lambda)\hskip 14.226378pt(i=1,2,\ldots,m;j=1,2,\ldots% ,n_{i}),$

where $Y_{ij}$ is the performance of the $j^{th}$ individual from school $i$ and $X_{i}$ denotes the $i^{th}$ school effect. The parameters $\theta,\lambda$ and $\tau$ are assumed to be unknown parameters to be estimated. Whilst $\{Y_{ij}\}$ are observed, the $\{X_{i}\}$ are unobserved. Therefore we use data augmentation (augment with $\mathbf{x}=(x_{1},x_{2},\ldots,x_{m})$ ) to estimate $(\theta,\lambda,\tau)$ based upon observations $\mathbf{y}=(\mathbf{y}_{1},\mathbf{y}_{2},\ldots,\mathbf{y}_{m})$ , where $\mathbf{y}_{i}=(y_{i1},y_{i2},\ldots,y_{in_{i}})$ .

This is a hierarchical model with students based within schools and the school having an effect on student performance. We could have more hierarchical levels, for examples, the schools could be in different cities and we could have a city effect included.

Note that given $x_{i}$ , $\mathbf{y}_{i}$ is independent of $\theta$ and $\tau$ . Moreover, given $x_{i}$ , $y_{ij}$ and $y_{ik}$ are independent, for $j\neq k$ . Therefore

	$\displaystyle f(\mathbf{y}_{i}\|x_{i},\theta,\tau,\lambda)$	$\displaystyle=$	$\displaystyle\prod_{j=1}^{n_{i}}f(y_{ij}\|x_{i},\lambda)$		(3.1)
		$\displaystyle=$	$\displaystyle\prod_{j=1}^{n_{i}}\sqrt{\frac{\lambda}{2\pi}}\exp\left(-\frac{% \lambda}{2}(y_{ij}-x_{i})^{2}\right).$		(3.1)

Now

\displaystyle f(x_{i}|\theta,\tau)

\displaystyle=

\displaystyle\sqrt{\frac{\tau}{2\pi}}\exp\left(-\frac{\tau}{2}(x_{i}-\theta)^{% 2}\right).

(3.2)

Since the performance of students from different schools are independent, we have that

\displaystyle L(\theta,\tau,\lambda;\mathbf{x},\mathbf{y})

\displaystyle=

\displaystyle\prod_{i=1}^{m}\left\{\sqrt{\frac{\tau}{2\pi}}\exp\left(-\frac{% \tau}{2}(x_{i}-\theta)^{2}\right)\times\prod_{j=1}^{n_{i}}\sqrt{\frac{\lambda}% {2\pi}}\exp\left(-\frac{\lambda}{2}(y_{ij}-x_{i})^{2}\right)\right\}.

(3.3)

To find the conditional distributions of $\theta,\tau,\lambda,x_{1},x_{2},\ldots,x_{m}$ , we combine the likelihood (3.3) with the priors for $\theta$ , $\tau$ and $\lambda$ . For simplicity, we shall assume that the priors on $\theta$ , $\tau$ and $\lambda$ are improper uniform priors. That is, $\pi(\theta)\propto 1$ $(\theta\in\mathbb{R})$ and $\pi(\lambda),\pi(\tau)\propto 1$ $(\lambda,\tau\in\mathbb{R}^{+})$ . Therefore, letting $\bar{x}=\frac{1}{m}\sum_{i=1}^{m}x_{i}$ and $N=\sum_{i=1}^{m}n_{i}$ , we have that

$\displaystyle f(\tau\|\lambda,\theta,\mathbf{x},\mathbf{y})$	$\displaystyle\propto$	$\displaystyle\tau^{m/2}\exp\left(-\tau\times\frac{1}{2}\sum_{i=1}^{m}(x_{i}-% \theta)^{2}\right)$
$\displaystyle f(\theta\|\lambda,\tau,\mathbf{x},\mathbf{y})$	$\displaystyle\propto$	$\displaystyle\exp\left(-\tau\times\frac{1}{2}\sum_{i=1}^{m}(x_{i}-\theta)^{2}\right)$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\tau\times\frac{1}{2}(m\theta^{2}-2\theta m\bar{x})\right)$
$\displaystyle f(\lambda\|\tau,\theta,\mathbf{x},\mathbf{y})$	$\displaystyle\propto$	$\displaystyle\lambda^{N/2}\exp\left(-\lambda\times\frac{1}{2}\sum_{i=1}^{m}% \sum_{j=1}^{n_{i}}(y_{ij}-x_{i})^{2}\right).$

Therefore

$\displaystyle\tau\|\lambda,\theta,\mathbf{x},\mathbf{y}$	$\displaystyle\sim$	$\displaystyle Gamma\left(\frac{m}{2}+1,\frac{1}{2}\sum_{i=1}^{m}(x_{i}-\theta)% ^{2}\right)$
$\displaystyle\theta\|\lambda,\tau,\mathbf{x},\mathbf{y}$	$\displaystyle\sim$	$\displaystyle N\left(\bar{x},\frac{1}{m\tau}\right)$
$\displaystyle\lambda\|\tau,\theta,\mathbf{x},\mathbf{y}$	$\displaystyle\sim$	$\displaystyle Gamma\left(\frac{N}{2}+1,\frac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{n_% {i}}(y_{ij}-x_{i})^{2}\right).$

All that is now required is the conditional distributions of $x_{1},x_{2},\ldots,x_{m}$ . Note that the $x_{i}$ ’s are independent with

$\displaystyle f(x_{i}\|\lambda,\theta,\tau,\mathbf{x}_{i-},\mathbf{y})$	$\displaystyle=$	$\displaystyle f(x_{i}\|\lambda,\theta,\tau,\mathbf{y}_{i})$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\tau\times\frac{1}{2}(x_{i}-\theta)^{2}\right)\times% \exp\left(-\lambda\times\frac{1}{2}\sum_{j=1}^{n_{i}}(y_{ij}-x_{i})^{2}\right)$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\frac{1}{2}\left((\tau+n_{i}\lambda)x_{i}^{2}-2x_{i}% \left(\tau\theta+\lambda\sum_{j=1}^{n_{i}}y_{ij}\right)\right)\right).$

Therefore, we have that

x_{i}|\lambda,\theta,\tau,\mathbf{x}_{i-},\mathbf{y}\sim N\left(\frac{\tau% \theta+\lambda\sum_{j=1}^{n_{i}}y_{ij}}{\tau+n_{i}\lambda},\frac{1}{\tau+n_{i}% \lambda}\right).

Note that the mean of $x_{i}$ is a compromise between the mean of the data and the prior mean with

\frac{\tau\theta+\lambda\sum_{j=1}^{n_{i}}y_{ij}}{\tau+n_{i}\lambda}=\frac{% \tau}{\tau+n_{i}\lambda}\theta+\frac{n_{i}\lambda}{\tau+n_{i}\lambda}\bar{y}_{% i\cdot}.

Furthermore as more data are collected it moves closer to the mean of $\mathbf{y}_{i}$ .

Example.

Suppose that a study involves 4 schools. Students from each school take a test and score a mark out of 100. The results are presented in the table below.

\begin{array}[]{c|c|c|c}\mbox{School 1}&\mbox{School 2}&\mbox{School 3}&\mbox{% School 4}\\ \hline 60&55&54&61\\ 62&52&56&59\\ 58&54&56&58\\ 60&57&56&61\\ 59&52&56&59\\ 58&54&&63\\ &58&&\\ &54&&\end{array}

Note: The data were generated from; $X_{i}\sim N(60,4^{2})$ and $Y_{ij}\sim N(X_{i},2^{2})$ with the $Y_{ij}$ rounded to give integer values. Therefore the ’true’ values are $\theta=60$ , $\tau=0.05625$ and $\lambda=0.25$ .

A Gibbs sampler algorithm was run on the above data set to get a sample of 1100 iterations. After discarding the first 100 iterations, we can use the remaining 1000 iterations to estimate the posterior means and standard deviations of the parameters.

\begin{array}[]{c|ccc}&\theta&\tau&\lambda\\ \hline\mbox{mean}&57.426&0.300&0.345\\ \mbox{sd}&1.344&0.267&0.107\end{array}

The mean of $\tau$ is almost 5 times the true value of $\tau$ . The reason for this is that there is only four schools and the means for the 4 schools are fairly similar. ( $\tau$ large implies greater precision, ie. smaller variance.) A time series plot of the 1100 iterations of $\tau$ are given in the figure below. We also see that $\tau$ has a right-skewed distribution so that the mean is (quite a bit) larger than the posterior mode.

Unnumbered Figure: Link

3.5 Probit regression

We consider a probit regression example. Let $y_{i}\in\{0,1\},i=1,\ldots,n$ be binary response variables for a collection of $n$ objects with associated covariate measurement $\mathbf{x}_{i}$ . We assume that

	$\displaystyle y_{i}$	$\displaystyle\sim$	$\displaystyle\mbox{Bernoulli }\left(\Phi(\eta_{i})\right)$
	$\displaystyle\eta_{i}$	$\displaystyle=$	$\displaystyle{\mathbf{x}}_{i}^{T}{\mbox{\boldmath$\beta$}}$

where $\Phi(\cdot)$ is the standard normal distribution function, $\eta_{i}$ is the linear predictor and $\beta$ represents a $(p\times 1)$ column vector of regression coefficients. We are interested in the posterior distribution of $\beta$ and we assume a $N({\mbox{\boldmath$\mu$}}_{0},{\mathbf{C}}_{0})$ prior for $\beta$ .

In the probit model, the mean (the probability of a $1$ ) is given by $\mu_{i}=\Phi(\eta_{i})$ , hence, it corresponds to the probit link function: $g(\mu_{i})=\Phi^{-1}(\mu_{i})$ .

Note that

L(\mbox{\boldmath$\beta$};\mathbf{y})=\prod_{i=1}^{n}\Phi(\mathbf{x}_{i}^{T}% \mbox{\boldmath$\beta$})^{y_{i}}\{1-\Phi(\mathbf{x}_{i}^{T}\mbox{\boldmath$% \beta$})\}^{1-y_{i}}.

Therefore we do not have a nice analytical expression for $\pi(\mbox{\boldmath$\beta$}|\mathbf{y})\propto L(\mbox{\boldmath$\beta$};% \mathbf{y})$ and resort again to data augmentation.

For the data augmentation it is useful to think how we could simulate data $\mathbf{y}$ from the probit model with coefficients $\beta$ and covariates $\mathbf{x}_{1},\ldots,\mathbf{x}_{n}$ . To simulate $y_{i}$ such that $P(y_{i}=1)=\Phi({\mathbf{x}}_{i}^{T}{\mbox{\boldmath$\beta$}})$ , we could take the following steps:

1.

Simulate $z_{i}\sim N({\mathbf{x}}_{i}^{T}{\mbox{\boldmath$\beta$}},1)$ .
2.

If $z_{i}>0$ , set $y_{i}=1$ and otherwise set $y_{i}=0$ .

Note that

	$\displaystyle P(y_{i}=1)$	$\displaystyle=$	$\displaystyle P(z_{i}>0)=P(N({\mathbf{x}}_{i}^{T}{\mbox{\boldmath$\beta$}},1)>0)$
		$\displaystyle=$	$\displaystyle P(N(0,1)>-{\mathbf{x}}_{i}^{T}{\mbox{\boldmath$\beta$}})=\Phi({% \mathbf{x}}_{i}^{T}{\mbox{\boldmath$\beta$}}),$

as required. Given the above construction, the $z_{i}$ ’s are simple Normal random variables and the $y_{i}$ ’s are a simple deterministic functions of the $z_{i}$ ’s.

Gibbs sampler
This representation lends itself to efficient simulation using Gibbs.
The joint posterior of $({\mbox{\boldmath$\beta$}},z_{1},\ldots,z_{n})$ is given by

$\displaystyle\pi({\mbox{\boldmath$\beta$}},z_{1},\ldots,z_{n}\|{\mathbf{y}})$	$\displaystyle\propto$	$\displaystyle\prod_{i:y_{i}=1}\exp\left[-{1\over 2}(z_{i}-{\mathbf{x}_{i}}^{T}% {\mbox{\boldmath$\beta$}})^{2}\right]I[z_{i}>0]$
	$\displaystyle\times$	$\displaystyle\prod_{i:y_{i}=0}\exp\left[-{1\over 2}(z_{i}-{\mathbf{x}_{i}}^{T}% {\mbox{\boldmath$\beta$}})^{2}\right]I[z_{i}<0]$
	$\displaystyle\times$	$\displaystyle\exp\left[-{1\over 2}({\mbox{\boldmath$\beta$}}-{\mbox{\boldmath$% \mu$}}_{0})^{T}{\mathbf{C}}_{0}^{-1}({\mbox{\boldmath$\beta$}}-{\mbox{% \boldmath$\mu$}}_{0})\right].$

The conditional posterior distribution of $\beta$ is therefore multivariate normal:

{\mbox{\boldmath$\beta$}}|{\mathbf{z}},{\mathbf{y}}\sim N\left(({\mathbf{C}}_{% 0}^{-1}+{\mathbf{X}}^{T}{\mathbf{X}})^{-1}({\mathbf{C}}_{0}^{-1}{\mbox{% \boldmath$\mu$}_{0}}+{\mathbf{X}}^{T}{\mathbf{z}}),\,\,({\mathbf{C}}_{0}^{-1}+% {\mathbf{X}}^{T}{\mathbf{X}})^{-1}\right),

where $\mathbf{z}=(z_{1},\ldots,z_{n})^{T}$ . This is similar to the block update of $\beta$ in the linear regression in the Week 7 lab session. On the other the conditional posterior distribution for each $z_{i}$ is truncated normal,

\displaystyle z_{i}|{\mbox{\boldmath$\beta$}},{\mathbf{y}}

\displaystyle\propto

\displaystyle\left\{\begin{array}[]{cl}N({\mathbf{x}}_{i}^{T}{\mbox{\boldmath$% \beta$}},1)~{}I[z_{i}>0]&\mbox{if }y_{i}=1\\ N({\mathbf{x}}_{i}^{T}{\mbox{\boldmath$\beta$}},1)~{}I[z_{i}\leq 0]&\mbox{% otherwise},\end{array}\right.

(3.4)

We can therefore create a Gibbs sampler which, at each step

•

Samples from $\mbox{\boldmath$\beta$}|\mathbf{z},\mathbf{y}$ .
•

Samples from $z_{i}|\mbox{\boldmath$\beta$},\mathbf{y}$ , for $i=1,\dots,n$ .

Example.

Response: Occurrence or non-occurrence of infection following birth by Caesarian section.

Covariates:

$\bullet$ $x_{1}=1$ if Caesarian section was not planned and $x_{1}=0$ otherwise;

$\bullet$ $x_{2}=1$ if there is presence of one or more risk factors, such as diabetes or excessive weight, and $x_{2}=0$ otherwise;

$\bullet$ $x_{3}=1$ if antibiotics were given as prophylaxis and $x_{3}=0$ otherwise.

In total: $251$ births.

Covariates			Infection
$x_{1}$	$x_{2}$	$x_{3}$	yes	no
0	0	0	8	32
0	0	1	0	2
0	1	0	28	30
0	1	1	1	17
1	0	0	0	9
1	0	1	0	0
1	1	0	23	3
1	1	1	11	87

Table 1: Data on Caesarian birth study

The program for implementing the probit model is available in probit.r. I ran the code for 5500 iterations discarding the first 500 iterations as burn-in. The tables below give the estimates of the parameter means and standard deviations along with the probability that a parameter is positive (based on the sample). We can see from the table that the Caesarian section not being planned (emergency) and having risk factors increase the chance of infection, whilst antibiotics (fortunately) reduce the risk of infection.

Coefficients:	mean	Std	probability
(Intercept)	-0.9854	0.2049	0
noplan	0.5063	0.2313	0.9868
factor	1.0607	0.2386	1
antib	-1.7498	0.2447	0

Finally, in the table below, we give the (estimated) posterior probability of becoming infected for each set of covariates. Note that

\displaystyle P(\mbox{Infected}|\mathbf{x})

\displaystyle=

\displaystyle\int\Phi(\mathbf{x}^{T}\mbox{\boldmath$\beta$})\pi(\mbox{% \boldmath$\beta$}|\mathbf{y})\,d\mbox{\boldmath$\beta$}.

This can be estimated using $N$ samples from $\mbox{\boldmath$\beta$}|\mathbf{y}$ by

\displaystyle\frac{1}{N}\sum_{j=1}^{N}\Phi(\mathbf{x}^{T}\mbox{\boldmath$\beta% $}_{j}).

In the table below we compare the posterior probability of infections $(E)$ versus the observed proportion infected $(O)$ for each set of covariates. The table shows generally good agreement between the observed proportions and those given by the model.

Covariates			Infection?
$x_{1}$	$x_{2}$	$x_{3}$	E	O
0	0	0	0.1665	0.2000
0	0	1	0.0047	0.0000
0	1	0	0.5300	0.4828
0	1	1	0.0525	0.0556
1	0	0	0.3202	0.0000
1	0	1	0.0158	–
1	1	0	0.7149	0.8846
1	1	1	0.1243	0.1122

3.6 Gaussian hierarchical model - revisited

We have introduced hierarchical models above and here we consider an alternative parameterisation of the hierarchical Gaussian model. Reparameterisation can be a useful in statistics to improve estimation of the model parameters. The reparameterisations presented below are based upon ideas from [2]. Let $Y=(Y_{11},\ldots,Y_{mn_{m}})$ be observed and $X=(X_{1},\ldots,X_{m})$ be unobserved, with

	$\displaystyle Y_{ij}$	$\displaystyle=$	$\displaystyle X_{i}+\sigma_{y}\epsilon_{ij},\hskip 14.226378ptj=1,2,\ldots,n_{i}$
	$\displaystyle X_{i}$	$\displaystyle=$	$\displaystyle\theta+\sigma_{x}z_{i},\hskip 14.226378pti=1,2,\ldots,m,$

where $\sigma_{x}$ and $\sigma_{y}$ are assumed known and $\epsilon_{ij}$ and $z_{i}$ are independent standard normal random variables ( $N(0,1)$ ). The parameter $\theta$ is assumed to be unknown and its posterior distribution is of interest. We shall consider the case where $n_{i}=1$ for all $i$ and consequently drop the subscript $j$ . Therefore the model can be rewritten as

	$\displaystyle Y_{i}$	$\displaystyle=$	$\displaystyle N(X_{i},\sigma_{y}^{2})$
	$\displaystyle X_{i}$	$\displaystyle=$	$\displaystyle N(\theta,\sigma_{x}^{2}),\hskip 14.226378pti=1,2,\ldots,m.$

Therefore this is a simplified version of the hierarchical model studied above.

An alternative parameterisation is the non-centered parameterisation proposed by [2],

	$\displaystyle Y_{i}$	$\displaystyle=$	$\displaystyle\theta+\tilde{X}_{i}+\sigma_{y}\epsilon_{i},$
	$\displaystyle\tilde{X}_{i}$	$\displaystyle=$	$\displaystyle\sigma_{x}z_{i},\hskip 14.226378pti=1,2,\ldots,m.$

Note that $\tilde{X}=(\tilde{X}_{1},\ldots,\tilde{X}_{m})$ and $\theta$ are a priori independent, but conditional upon the data are dependent.

Both (centered and non-centered) parameterisations permit a (data-augmentation) Gibbs sampler as outlined below. (Throughout we shall assign an improper, uniform prior to $\theta$ ; $\pi(\theta)\propto 1$ .)

Centered algorithm

The joint distribution of $(\theta,\mathbf{x})$ given $\mathbf{y}$ , $\pi(\theta,\mathbf{x}|\mathbf{y})$ satisfies

	$\displaystyle\pi(\theta,\mathbf{x}\|\mathbf{y})$	$\displaystyle=$	$\displaystyle\frac{\pi(\mathbf{y}\|\mathbf{x},\theta)\pi(\mathbf{x}\|\theta)\pi(% \theta)}{\pi(\mathbf{y})}$
		$\displaystyle\propto$	$\displaystyle\pi(\mathbf{y}\|\mathbf{x})\pi(\mathbf{x}\|\theta),$

since $\pi(\theta)=1$ and given $\mathbf{x}$ , $\mathbf{y}$ is independent of $\theta$ .

Thus

\displaystyle\pi(\theta,\mathbf{x}|\mathbf{y})

\displaystyle\propto

\displaystyle\prod_{i=1}^{m}\frac{1}{\sqrt{2\pi}\sigma_{Y}}\exp\left(-\frac{1}% {2\sigma_{Y}^{2}}(y_{i}-x_{i})^{2}\right)\times\frac{1}{\sqrt{2\pi}\sigma_{X}}% \exp\left(-\frac{1}{2\sigma_{X}^{2}}(x_{i}-\theta)^{2}\right).

Therefore

$\displaystyle\pi(\theta\|\mathbf{x},\mathbf{y})$	$\displaystyle\propto$	$\displaystyle\prod_{i=1}^{m}\exp\left(-\frac{1}{2\sigma_{X}^{2}}(x_{i}-\theta)% ^{2}\right)$
	$\displaystyle=$	$\displaystyle\exp\left(-\frac{1}{2\sigma_{X}^{2}}\sum_{i=1}^{m}(x_{i}-\theta)^% {2}\right)$
	$\displaystyle=$	$\displaystyle\exp\left(-\frac{1}{2\sigma_{X}^{2}}\left\{\sum_{i=1}^{m}x_{i}^{2% }-2\sum_{i=1}^{m}x_{i}\theta+m\theta^{2}\right\}\right)$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\frac{m}{2\sigma_{X}^{2}}\left\{\theta^{2}-2\bar{x}% \theta\right\}\right).$

This gives

\theta|\mathbf{x},\mathbf{y}\sim N\left(\bar{x},\frac{\sigma_{X}^{2}}{m}\right).

(Note that if $W$ has pdf $f(w)\propto\exp\left(-\frac{1}{2\beta}(w^{2}-2\alpha w)\right)$ , then $W\sim N(\alpha,\beta)$ .)

For $i=1,2,\ldots,n$ , $x_{i}$ depends only upon $\theta$ and $y_{i}$ . Therefore

$\displaystyle\pi(x_{i}\|\theta,y_{i})$	$\displaystyle\propto$	$\displaystyle\exp\left(-\frac{1}{2\sigma_{Y}^{2}}(y_{i}-x_{i})^{2}\right)% \times\exp\left(-\frac{1}{2\sigma_{X}^{2}}(x_{i}-\theta)^{2}\right)$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\frac{1}{2\sigma_{Y}^{2}}(x_{i}^{2}-2y_{i}x_{i})\right% )\times\exp\left(-\frac{1}{2\sigma_{X}^{2}}(x_{i}^{2}-2\theta x_{i})\right)$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\frac{1}{2}\left\{\left(\frac{1}{\sigma_{Y}^{2}}+\frac% {1}{\sigma_{X}^{2}}\right)x_{i}^{2}-2x_{i}\left(\frac{y_{i}}{\sigma_{Y}^{2}}+% \frac{\theta}{\sigma_{X}^{2}}\right)\right\}\right).$

This gives

x_{i}|\theta,y_{i}\sim N\left(\left\{\frac{y_{i}}{\sigma_{Y}^{2}}+\frac{\theta% }{\sigma_{X}^{2}}\right\}\left/\left\{\frac{1}{\sigma_{Y}^{2}}+\frac{1}{\sigma% _{X}^{2}}\right\}\right.,1\left/\left\{\frac{1}{\sigma_{Y}^{2}}+\frac{1}{% \sigma_{X}^{2}}\right\}\right).\right.

It is straightforward using the above conditional distributions to construct a Gibbs sampling algorithm which alternates between updating $\theta|\mathbf{x},\mathbf{y}$ and updating $\mathbf{x}|\theta,\mathbf{y}$ .

Non-centered algorithm

The joint distribution of $(\theta,\tilde{\mathbf{x}})$ given $\mathbf{y}$ , $\pi(\theta,\tilde{\mathbf{x}}|\mathbf{y})$ satisfies

	$\displaystyle\pi(\theta,\tilde{\mathbf{x}}\|\mathbf{y})$	$\displaystyle=$	$\displaystyle\frac{\pi(\mathbf{y}\|\tilde{\mathbf{x}},\theta)\pi(\tilde{\mathbf% {x}}\|\theta)\pi(\theta)}{\pi(\mathbf{y})}$
		$\displaystyle\propto$	$\displaystyle\pi(\mathbf{y}\|\tilde{\mathbf{x}},\theta)\pi(\tilde{\mathbf{x}}),$

since $\pi(\theta)=1$ and $\tilde{\mathbf{x}}$ is a priori independent of $\theta$ .

Thus

\displaystyle\pi(\theta,\tilde{\mathbf{x}}|\mathbf{y})

\displaystyle\propto

\displaystyle\prod_{i=1}^{m}\frac{1}{\sqrt{2\pi}\sigma_{Y}}\exp\left(-\frac{1}% {2\sigma_{Y}^{2}}(y_{i}-\theta-\tilde{x}_{i})^{2}\right)\times\frac{1}{\sqrt{2% \pi}\sigma_{X}}\exp\left(-\frac{1}{2\sigma_{X}^{2}}\tilde{x}_{i}^{2}\right).

Therefore, letting $\mu=\frac{1}{n}\sum_{i=1}^{n}\tilde{x}_{i}$ , we have that

$\displaystyle\pi(\theta\|\tilde{\mathbf{x}},\mathbf{y})$	$\displaystyle\propto$	$\displaystyle\prod_{i=1}^{m}\exp\left(-\frac{1}{2\sigma_{Y}^{2}}(y_{i}-\tilde{% x}_{i}-\theta)^{2}\right)$
	$\displaystyle=$	$\displaystyle\exp\left(-\frac{1}{2\sigma_{Y}^{2}}\sum_{i=1}^{m}(y_{i}-\tilde{x% }_{i}-\theta)^{2}\right)$
	$\displaystyle=$	$\displaystyle\exp\left(-\frac{1}{2\sigma_{Y}^{2}}\left\{\sum_{i=1}^{m}(y_{i}-% \tilde{x}_{i})^{2}-2\sum_{i=1}^{m}(y_{i}-\tilde{x}_{i})\theta+m\theta^{2}% \right\}\right)$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\frac{m}{2\sigma_{Y}^{2}}\left\{\theta^{2}-2(\bar{y}-% \mu)\theta\right\}\right).$

This gives

\theta|\tilde{\mathbf{x}},\mathbf{y}\sim N\left(\bar{y}-\mu,\frac{\sigma_{Y}^{% 2}}{m}\right).

For $i=1,2,\ldots,n$ , $\tilde{x}_{i}$ depends only upon $\theta$ and $y_{i}$ . Therefore

$\displaystyle\pi(\tilde{x}_{i}\|\theta,y_{i})$	$\displaystyle\propto$	$\displaystyle\exp\left(-\frac{1}{2\sigma_{Y}^{2}}(y_{i}-\tilde{x}_{i}-\theta)^% {2}\right)\times\exp\left(-\frac{1}{2\sigma_{X}^{2}}\tilde{x}_{i}^{2}\right)$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\frac{1}{2\sigma_{Y}^{2}}(\tilde{x}_{i}^{2}-2(y_{i}-% \theta)\tilde{x}_{i}+(y_{i}-\theta)^{2})\right)\times\exp\left(-\frac{1}{2% \sigma_{X}^{2}}\tilde{x}_{i}^{2}\right)$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\frac{1}{2}\left\{\left(\frac{1}{\sigma_{Y}^{2}}+\frac% {1}{\sigma_{X}^{2}}\right)\tilde{x}_{i}^{2}-2\tilde{x}_{i}\left(\frac{y_{i}-% \theta}{\sigma_{Y}^{2}}\right)\right\}\right).$

This gives

\tilde{x}_{i}|\theta,y_{i}\sim N\left(\left\{\frac{y_{i}-\theta}{\sigma_{Y}^{2% }}\right\}/\left\{\frac{1}{\sigma_{Y}^{2}}+\frac{1}{\sigma_{X}^{2}}\right\},1/% \left\{\frac{1}{\sigma_{Y}^{2}}+\frac{1}{\sigma_{X}^{2}}\right\}\right).

It is straightforward using the above conditional distributions to construct a Gibbs sampling algorithm which alternates between updating $\theta|\tilde{\mathbf{x}},\mathbf{y}$ and updating $\tilde{\mathbf{x}}|\theta,\mathbf{y}$ .

Example.

To illustrate the two algorithms, we apply both algorithms to a data set consisting of $m=100$ observations from $y$ with $\sigma_{Y}=2$ , $\sigma_{X}=1$ and $\theta=2$ . Both algorithms were run to generate a sample of size 1000 from the posterior distribution of $\theta$ . The two algorithms are obtaining samples from the same posterior distribution and should therefore give the same answers (subject to Monte Carlo error - random variation). The estimated means (standard deviations) of $\theta$ based on the two samples are $2.435$ $(0.233)$ and $2.425$ $(0.223)$ for the centered and non-centered algorithms, respectively. Below are trace plots of 100 iterations of each algorithm followed by the acf plots. We can clearly see from the acf plots that the non-centered algorithm is preferable to the centered algorithm.

Unnumbered Figure: Link

For the above model we can actually show analytically that for the given values of $\sigma_{Y}$ and $\sigma_{X}$ , the non-centered algorithm is preferable by calculating the $lag-1$ autocorrelation for both models. Note that it is actually fairly easy to show that the $lag-k$ autocorrelation is the $lag-1$ autocorrelation to the power $k$ for the above model. Let $\rho_{C}$ and $\rho_{NC}$ denote the $lag-1$ autocorrelations for the centered and non-centered algorithms, respectively. Then we can show that $\rho_{C}=\sigma_{Y}^{2}/(\sigma_{X}^{2}+\sigma_{Y}^{2})$ and $\rho_{NC}=\sigma_{X}^{2}/(\sigma_{X}^{2}+\sigma_{Y}^{2})$ .

To prove the above, we need to compute $\mbox{Corr}\left[{\theta_{0},\theta_{1}}\right]$ , for which we need the stationary distribution of $\theta_{0}$ . Note that

Y_{i}\sim N(\theta,\sigma_{X}^{2}+\sigma_{Y}^{2})\hskip 14.226378pti=1,2,% \ldots,m.

Therefore

$\displaystyle\pi(\theta\|\mathbf{y})$	$\displaystyle\propto$	$\displaystyle\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi(\sigma_{X}^{2}+\sigma_{Y}^{2})% }}\exp\left(-\frac{1}{2(\sigma_{X}^{2}+\sigma_{Y}^{2})}(y_{i}-\theta)^{2}\right)$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\frac{1}{2(\sigma_{X}^{2}+\sigma_{Y}^{2})}\sum_{i=1}^{% m}(y_{i}-\theta)^{2}\right)$
	$\displaystyle\propto$	$\displaystyle\exp\left(-\frac{1}{2(\sigma_{X}^{2}+\sigma_{Y}^{2})}(m\theta^{2}% -2m\bar{y}\theta)\right)$

giving

\theta|\mathbf{y}\sim N\left(\bar{y},\frac{\sigma_{X}^{2}+\sigma_{Y}^{2}}{m}% \right).

In other words, MCMC is not needed for the case, where for all $i$ , $n_{i}=1$ but MCMC is required if any $n_{i}>1$ .

Thus $\mbox{Var}\left[{\theta_{0}}\right]=\mbox{Var}\left[{\theta_{1}}\right]=(% \sigma_{X}^{2}+\sigma_{Y}^{2})/m$ . Therefore we need to compute $\mbox{Cov}\left[{\theta_{0},\theta_{1}}\right]$ . This is most easily down by showing that

\theta_{1}=\alpha\theta_{0}+W,

where $W$ is a random variable independent of $\theta_{0}$ . In that case,

\mbox{Cov}\left[{\theta_{0},\theta_{1}}\right]=\mbox{Cov}\left[{\theta_{0},% \alpha\theta_{0}+W}\right]=\alpha\mbox{Var}\left[{\theta_{0}}\right]

and $\mbox{Corr}\left[{\theta_{0},\theta_{1}}\right]=\alpha$ .

For the centered algorithm, note that

\displaystyle\bar{x}|\theta_{0},\mathbf{y}\sim N\left(\frac{\sigma_{X}^{2}\bar% {y}+\sigma_{Y}^{2}\theta_{0}}{\sigma_{X}^{2}+\sigma_{Y}^{2}},\frac{\sigma_{X}^% {2}\sigma_{Y}^{2}}{m(\sigma_{X}^{2}+\sigma_{Y}^{2})}\right).

Hence

$\displaystyle\theta_{1}\|\theta_{0},\mathbf{y}$	$\displaystyle\sim$	$\displaystyle\bar{x}+N\left(0,\frac{\sigma_{X}^{2}}{m}\right)$
	$\displaystyle\sim$	$\displaystyle N\left(\frac{\sigma_{X}^{2}\bar{y}+\sigma_{Y}^{2}\theta_{0}}{% \sigma_{X}^{2}+\sigma_{Y}^{2}},\frac{\sigma_{X}^{2}\sigma_{Y}^{2}}{m(\sigma_{X% }^{2}+\sigma_{Y}^{2})}\right)+N\left(0,\frac{\sigma_{X}^{2}}{m}\right)$
	$\displaystyle=$	$\displaystyle\frac{\sigma_{Y}^{2}}{\sigma_{X}^{2}+\sigma_{Y}^{2}}\theta_{0}+N% \left(\frac{\sigma_{X}^{2}}{\sigma_{X}^{2}+\sigma_{Y}^{2}}\bar{y},\frac{\sigma% _{X}^{2}\sigma_{Y}^{2}}{m(\sigma_{X}^{2}+\sigma_{Y}^{2})}+\frac{\sigma_{X}^{2}% }{m}\right).$

Therefore $\rho_{C}=\mbox{Corr}\left[{\theta_{0},\theta_{1}}\right]=\sigma_{Y}^{2}/(% \sigma_{X}^{2}+\sigma_{Y}^{2})$ .

A similar calculation holds for $\rho_{NC}=\sigma_{X}^{2}/(\sigma_{X}^{2}+\sigma_{Y}^{2})$ . Thus if $\sigma_{Y}^{2}<\sigma_{X}^{2}$ , the centered algorithm is preferable, otherwise the non-centered algorithm is preferable.

3.7 R practice question

Write a Gibbs sampler algorithm to apply to the school performance data given in Section 3.3 of the lecture notes. Compare your output to that given in the lecture notes.

A recap of the data, the model and the necessary conditional distributions is given below.

\begin{array}[]{c|c|c|c}\mbox{School 1}&\mbox{School 2}&\mbox{School 3}&\mbox{% School 4}\\ \hline 60&55&54&61\\ 62&52&56&59\\ 58&54&56&58\\ 60&57&56&61\\ 59&52&56&59\\ 58&54&&63\\ &58&&\\ &54&&\end{array}

Performance of 25 students in a test selected from 4 schools.

Model:

	$\displaystyle X_{i}$	$\displaystyle\sim$	$\displaystyle N(\theta,1/\tau)\hskip 14.226378pt(i=1,2,\ldots,m)$
	$\displaystyle Y_{ij}$	$\displaystyle\sim$	$\displaystyle N(X_{i},1/\lambda)\hskip 14.226378pt(i=1,2,\ldots,m;j=1,2,\ldots% ,n_{i}),$

where $Y_{ij}$ is the performance of the $j^{th}$ individual from school $i$ and $X_{i}$ denotes the $i^{th}$ school effect. Note that the $X_{i}$ ’s are not observed.

Conditional distributions for the Gibbs sampler:-

$\displaystyle\tau\|\lambda,\theta,\mathbf{x},\mathbf{y}$	$\displaystyle\sim$	$\displaystyle Gamma\left(\frac{m}{2}+1,\frac{1}{2}\sum_{i=1}^{m}(x_{i}-\theta)% ^{2}\right)$
$\displaystyle\theta\|\lambda,\tau,\mathbf{x},\mathbf{y}$	$\displaystyle\sim$	$\displaystyle N\left(\bar{x},\frac{1}{m\tau}\right)$
$\displaystyle\lambda\|\tau,\theta,\mathbf{x},\mathbf{y}$	$\displaystyle\sim$	$\displaystyle Gamma\left(\frac{N}{2}+1,\frac{1}{2}\sum_{i=1}^{m}\sum_{j=1}^{n_% {i}}(y_{ij}-x_{i})^{2}\right)$
$\displaystyle x_{i}\|\lambda,\theta,\tau,\mathbf{x}_{i-},\mathbf{y}$	$\displaystyle\sim$	$\displaystyle N\left(\frac{\tau\theta+\lambda\sum_{j=1}^{n_{i}}y_{ij}}{\tau+n_% {i}\lambda},\frac{1}{\tau+n_{i}\lambda}\right).$

3.8 Lab Session: Data augmentation - robust regression

This extends the multiple regression example studied in last weeks lab to allow for outliers.

When there are outliers suspected in the data, the sensible thing to do is to take a closer look at the data and try and understand where the lack of fit may be coming from. There are several possibilities: e.g. the outliers may be due to errors in recording the data, or it may be that there were certain features corresponding to those observations that our model does not take into account. The latter may be solved e.g. by adding relevant explanatory variables to the model, or by changing other aspects of the model.

Now suppose that none of the above applies to your model: that is, there are no recording errors in the data that you know of, you have no additional explanatory variables that you could possibly use, and you can not think of a more suitable model specification. In that case, a partial solution is to use a sampling distribution that is more robust to outliers than the Normal distribution. A distribution with thicker tails than the Normal distribution allows for larger departures from the mean and, therefore, outliers are likely to have less impact on the resulting posterior inference on $\beta_{1}$ and $\beta_{2}$ . That is what we mean by a sampling distribution that is more resistant to outliers than the Normal distribution.

A distribution with this property is the Cauchy distribution. Using a Cauchy distribution leads to the following sampling density for observation $i$ :

f(y_{i}|{\mbox{\boldmath$\beta$}},\tau)={\tau^{1/2}\over\pi}{1\over 1+\tau(y_{% i}-{\bf x}_{i}^{T}{\mbox{\boldmath$\beta$}})^{2}},\,\,i=1,\ldots,n,\quad{\rm independent.}

(3.5)

Consider the same prior distributions for $\beta$ and $\tau$ used in Week 7.

1.

Write down, up to a proportionality constant, the joint posterior density of $({\mbox{\boldmath$\beta$}},\tau)$ .
2.

Write down, up to a proportionality constant, the conditional posterior densities of $\beta_{1}$ , $\beta_{2}$ and $\omega$ (that is, $\pi(\beta_{1}|{\bf y},\beta_{2},\tau)$ , $\pi(\beta_{2}|{\bf y},\beta_{1},\tau)$ and $\pi(\omega|{\bf y},\beta_{1},\beta_{2})$ ). Is our new model amenable to Gibbs sampling? Explain why or why not.

Now consider the following idea: The Cauchy distribution in (3.5) can also be interpreted as a Normal distribution with an unknown (random) precision parameter.

More specifically, assuming

	$\displaystyle y_{i}\|\beta_{1},\beta_{2},\tau,z_{i}$	$\displaystyle\sim$	$\displaystyle N\left(\beta_{1}x_{i1}+\beta_{2}x_{i2},\frac{1}{\tau z_{i}}\right)$		(3.6)
	$\displaystyle z_{i}\|\beta_{1},\beta_{2},\tau$	$\displaystyle\sim$	$\displaystyle{\rm Gamma}\left(\frac{1}{2},\frac{1}{2}\right)$		(3.7)

is, in fact, equivalent to assuming the sampling model in (3.5).

Aside. The Cauchy distribution is a $t$ -distribution with $1$ degree of freedom, and (abusing notation)

t_{\nu}=\frac{N(0,1)}{\sqrt{\chi^{2}_{\nu}/\nu}}~{}~{}~{}~{}\left(\mbox{e.g. consider the test statistic}~{}~{}\frac{\overline{Y}-\mu}{\sqrt{S^{2}}}.\right)

The statement follows since $\chi^{2}_{\nu}/\nu$ is the same as ${\rm Gamma}(\nu/2,\nu/2)$ .

Thus, we can augment the parameters with the variables $z_{1},\ldots,z_{n}$ (note that there is one variable $z_{i}$ per observation) to obtain the posterior density

	$\displaystyle\pi(\beta_{1},\beta_{2},\tau,z_{1},\ldots,z_{n}\|{\bf y})$	$\displaystyle\propto$	$\displaystyle\left\{\prod_{i=1}^{n}(\tau z_{i})^{1/2}\exp\left(-\frac{\tau z_{% i}}{2}(y_{i}-\beta_{1}x_{i1}-\beta_{2}x_{i2})^{2}\right)z_{i}^{-1/2}\exp\left(% -\frac{z_{i}}{2}\right)I[z_{i}>0]\right\}$
		$\displaystyle\times$	$\displaystyle\prod_{j=1}^{2}\frac{\omega_{0j}^{1/2}}{\sqrt{2\pi}}\exp\left(-% \frac{\omega_{0j}}{2}(\beta_{j}-\mu_{0j})^{2}\right)\times\tau^{\alpha_{0}-1}e% ^{-\gamma_{0}\tau}I[\tau>0]$

3.

Find the conditional posterior density of each $z_{i}$ , $i=1,\ldots,n$ , $\pi(z_{i}|\beta_{1},\beta_{2},\tau,{\bf y},{\bf z}_{-i})$ , where ${\bf z}_{-i}$ denotes the vector $(z_{1},\ldots,z_{n})$ excluding $z_{i}$ .
4.

Using an almost identical argument to the previous lab, show that

$\beta_{1}|\tau,\beta_{2},\mathbf{y},\mathbf{z}\sim N\left(\nu_{1},s_{1}^{2}% \right),$

where $\nu_{1}=m_{1}s_{1}^{2}$ ,

$m_{1}=\sum_{i=1}^{n}\tau z_{i}x_{i1}(y_{i}-\beta_{2}x_{i2})+\omega_{01}\mu_{01}$

and

$s_{1}^{2}=\left\{\sum_{i=1}^{n}\tau z_{i}x_{i1}^{2}+\omega_{01}\right\}^{-1}.$
5.

Using an almost identical argument to the previous lab (and the previous question), show that

$\beta_{2}|\tau,\beta_{1},\mathbf{y},\mathbf{z}\sim N\left(\nu_{2},s_{2}^{2}% \right),$

where $\nu_{2}=m_{2}s_{2}^{2}$ ,

$m_{2}=\sum_{i=1}^{n}\tau z_{i}x_{i2}(y_{i}-\beta_{1}x_{i1})+\omega_{02}\mu_{02}$

and

$s_{2}^{2}=\left\{\sum_{i=1}^{n}\tau z_{i}x_{i2}^{2}+\omega_{02}\right\}^{-1}.$
6.

Using an almost identical argument to the previous lab, show that

$\tau|\beta_{1},\beta_{2},\mathbf{y},\mathbf{z}\sim{\rm Gamma}\left(\frac{n}{2}% +\alpha_{0},\sum_{i=1}^{n}\frac{z_{i}}{2}(y_{i}-\beta_{1}x_{i1}-\beta_{2}x_{i2% })^{2}+\gamma_{0}\right).$
7.

Write down a Gibbs sampling algorithm to compute the joint posterior distribution of $(\beta_{1},\beta_{2},\tau,z_{1},\ldots,z_{n})$ .
8.

Make a copy of your Gibbs sampling function from last weeks lab and alter this to include the $z_{j}$ , producing a Gibbs sampler for the Cauchy model.
9.

How does the inference on $\beta_{1}$ and $\beta_{2}$ compare with that obtained in Lab 2 where you used a Normal sampling model?

3.9 Practice Question

1.

Suppose that we have independent and identically distributed realisations $x_{1},x_{2},\ldots,x_{n}$ from the random variable $X$ , where $X$ is a zero-inflated Poisson. That is,

$X\sim\left\{\begin{array}[]{ll}{\rm Po}(\lambda)&\mbox{with probability }1-% \epsilon;\\ 0&\mbox{with probability }\epsilon,\end{array}\right.$

where $(\epsilon,\lambda)$ are the parameters of interest.

This is an example of a mixture distribution, where $X$ is distributed either according to distribution $A\sim{\rm Po}(\lambda)$ or distribution $B\equiv 0$ .

For $k=0,1,\ldots$ , let $n_{k}=\sum_{j=1}^{n}1_{\{x_{j}=k\}}$ denote the total number of $x_{j}$ ’s equal to $k$ . Then $\mathbf{n}=(n_{0},n_{1},\ldots)$ is a sufficient statistic for obtaining the posterior distribution of $(\epsilon,\lambda)$ . That is,

$\pi(\epsilon,\lambda|\mathbf{n})=\pi(\epsilon,\lambda|\mathbf{x}).$

Suppose that $\pi(\epsilon)\sim{\rm Beta}(1,1)$ and $\pi(\lambda)\sim{\rm Gamma}(1,1)$ .
1. (a)
  
  Show that
  
  $L(\epsilon,\lambda;\mathbf{n})\propto\left\{\epsilon+(1-\epsilon)\exp(-\lambda% )\right\}^{n_{0}}\prod_{k=1}^{\infty}\left\{(1-\epsilon)\frac{\lambda^{k}}{k!}% \exp(-\lambda)\right\}^{n_{i}}.$
2. (b)
  
  Let $z$ denote the total number of $x_{j}$ ’s which come from distribution $B$ . Write down $\pi(z,\mathbf{n}|\epsilon,\lambda)$ .
  
  Hence, write down $\pi(\epsilon,\lambda,z|\mathbf{n})$ up to proportionality.
3. (c)
  Suppose that $\pi(\epsilon)\sim{\rm Beta}(1,1)$ and $\pi(\lambda)\sim{\rm Gamma}(1,1)$ . Find the following conditional distributions:-
  1. i.
    
    $\epsilon|\lambda,\mathbf{n},z$ ;
  2. ii.
    
    $\lambda|\epsilon,\mathbf{n},z$ ;
  3. iii.
    
    $z|\epsilon,\lambda,\mathbf{n}$ .
4. (d)
  
  Write a Gibbs sampler to obtain samples from $\pi(\epsilon,\lambda|\mathbf{n})$ .
2.

An ecologist is monitoring the number of field mice at $m$ sites. Each of the $m$ sites are observed for $n$ days with the number of mice caught at each site recorded daily. Let $y_{ij}$ denote the number of mice caught at site $i$ on day $j$ . The data $\mathbf{y}=(\mathbf{y}_{1},\mathbf{y}_{2},\ldots,\mathbf{y}_{m})$ with ${\mathbf{y}}_{i}=(y_{i1},y_{i2},\ldots,y_{in})$ are assumed to arise from the following hierarchical model:

$\displaystyle Y_{ij}$ $\displaystyle\sim$ $\displaystyle{\rm Pois}(X_{i})\;\;(i=1,2,\ldots,m;j=1,2,\ldots,n)$

$\displaystyle X_{i}$ $\displaystyle\sim$ $\displaystyle{\rm Exp}(\theta)\;\;(i=1,2,\ldots,m).$

Note that $X_{1},X_{2},\ldots,X_{m}$ are unobserved and let ${\mathbf{x}}=(x_{1},x_{2},\ldots,x_{m})$ denote a realization from
$(X_{1},X_{2},\ldots,X_{m})$ , corresponding to $\mathbf{y}$ .
Suppose that a ${\rm Gamma}(\alpha,\beta)$ prior is assigned to $\theta$ .
1. (a)
  
  Write down the likelihood for $\theta$ given $\mathbf{y},\mathbf{x}$ .
2. (b)
  
  Find the conditional probability distribution of $\theta$ given $\mathbf{y},\mathbf{x}$ .
3. (c)
  
  For $i=1,2,\ldots,m$ , find and identify the conditional probability distribution of $x_{i}|\mathbf{y}_{i},\theta$ .
4. (d)
  
  Describe a Gibbs sampler algorithm for obtaining samples from $\pi(\theta|\mathbf{y})$ , the posterior distribution of $\theta$ given the observed data $\mathbf{y}$ .

	$\displaystyle\pi(\mbox{\boldmath$\theta$},\mathbf{y}\|\mathbf{x})$	$\displaystyle=$	$\displaystyle\frac{\pi(\mathbf{x},\mathbf{y}\|\mbox{\boldmath$\theta$})\pi(% \mbox{\boldmath$\theta$})}{\pi(\mathbf{x})}$
		$\displaystyle\propto$	$\displaystyle\pi(\mathbf{x},\mathbf{y}\|\mbox{\boldmath$\theta$})\pi(\mbox{% \boldmath$\theta$}).$

	$\displaystyle Y_{ij}$	$\displaystyle\sim$	$\displaystyle{\rm Pois}(X_{i})\;\;(i=1,2,\ldots,m;j=1,2,\ldots,n)$
	$\displaystyle X_{i}$	$\displaystyle\sim$	$\displaystyle{\rm Exp}(\theta)\;\;(i=1,2,\ldots,m).$