3 Week 8: Gibbs Sampler – Data augmentation

3.1 Introduction

Data augmentation is a useful tool for obtaining samples from the posterior distribution of parameters where π(𝜽|𝐱) is not in a convenient form for analysis. The Gibbs sampler with data augmentation is primarily used in the following two situations:

  1. 1.

    In real world problems there is often missing data. Although in theory the distribution of the observed data can be obtained by integrating out the missing data, this is often difficult or impossible to do.

  2. 2.

    The likelihood of the data is not tractable in its current form, but conditional upon the collection of unobserved (extra) data, the likelihood becomes tractable. This corresponds to the situation we have observed with the EM algorithm and there are similarities (as well as obvious differences) between the EM algorithm and Gibbs sampler in data augmentation problems.

More generally, data augmentation is used extensively within MCMC (Markov chain Monte Carlo) algorithms to assist with obtaining samples from π(𝜽|𝐱).

The generic set up is as follows. Let 𝐱 denote the observed data and let 𝜽 denote the model parameters. Then

π(𝜽|𝐱)π(𝐱|𝜽)×π(𝜽)

and this can be difficult to work with if the likelihood π(𝐱|𝜽) is not in a convenient form. Now suppose that there is extra information 𝐲 such that joint distribution of 𝐱 and 𝐲, π(𝐱,𝐲|𝜽) is more convenient to work with. We can then look to construct a Gibbs sampler (or alternative MCMC algorithm) to obtain samples from π(𝜽,𝐲|𝐱). It is then trivial to obtain samples from the marginal distribution π(𝜽|𝐱) by simply ignoring the 𝐲 values. In other words, focus on the marginal distribution π(𝜽|𝐱). Note that

π(𝜽,𝐲|𝐱) = π(𝐱,𝐲|𝜽)π(𝜽)π(𝐱)
π(𝐱,𝐲|𝜽)π(𝜽).

Then the Gibbs sampler alternates between updating the parameters and augmented data.

  1. 1.

    Update 𝜽 given 𝐱 and 𝐲. ie. Use π(𝜽|𝐱,𝐲).

  2. 2.

    Update 𝐲 given 𝐱 and 𝜽. ie. Use π(𝐲|𝐱,𝜽).

Both the updates of 𝜽 and 𝐲 will often be broken down into a number of steps. Note the similarities to the EM algorithm with step 1 replacing the M-step (updating the parameters given the observed and augmented data) and step 2 replacing the E-step (updating the augmented data given the observed data and parameters).

We illustrate data augmentation within the Gibbs sampler using a range of examples.

3.2 Normal mixture

We begin by illustrating data augmentation Gibbs sampling with a very simple example, which is a simplified version of the Normal mixtures introduced in the Week 6 notes for the EM algorithm.

Suppose that X1,X2,,Xn are iid (independent and identically distributed) from the mixture density

f(x)=122π(e-x2/2+e-(x-θ)2/2).

That is,

X{N(0,1)with prob. 0.5N(θ,1)with prob. 0.5

Thus the mixture can be constructed as follows. For each observation toss a fair coin and let Y denote the outcome of the coin toss with Y=0 if the coin shows a head and Y=1 otherwise. If the ith coin toss shows a head (Yi=0) draw Xi from N(0,1), otherwise draw Xi from N(θ,1).

The likelihood can be written explicitly:-

L(θ;𝐱)=π(𝐱|𝜽)=i=1n122π(e-xi2/2+e-(xi-θ)2/2),

where 𝐱=(x1,x2,,xn). It is difficult to maximise (for MLE) or to compute the posterior distribution for θ using the given likelihood. However, if we knew the results of the coin tosses (ie. If we knew Y1,Y2,Yn), then we would know which observations come from the N(θ,1) and the posterior distribution (or MLE) would be straightforward to derive.

Let 𝐲=(y1,y2,,yn) denote the outcome of the n coin tosses and let 𝒜={i;yi=1}, the set of coin tosses that are tails. Thus for i𝒜, Xi is drawn from N(θ,1).

Original likelihood:-

L(θ;𝐱)=i=1n122π(e-xi2/2+e-(xi-θ)2/2).

Augmented likelihood:-

L(θ;𝐱,𝐲)=π(𝐱,𝐲|θ) = i𝒜(12×12πe-(xi-θ)2/2)×i𝒜C(12×12πe-xi2/2)
exp(-i𝒜(xi-θ)2/2).

Therefore if we take a N(0,1) prior for θ, we have that

π(θ|𝐱,𝐲) L(θ;𝐱,𝐲)×π(θ)
exp(-i𝒜(xi-θ)2/2)×exp(-θ2/2)
= exp(-12{mθ2-2θi=1mxi+i𝒜xi2})×exp(-12θ2)
exp(-12{(m+1)θ2-2θi𝒜xi})

Thus the posterior distribution of θ is

N(1m+1i𝒜xi,1m+1),

where m=|𝒜|.

On the other hand let Yi denote the outcome of the ith coin toss. Then it is straightforward to show that

P(Yi=1|θ,everything else) = P(Yi=1,xi)f(xi)
= P(Yi=1,xi)P(Yi=1,xi)+P(Yi=0,xi)
= exp(-(xi-θ)2/2)/(22π)exp(-(xi-θ)2/2)/(22π)+exp(-xi2/2)/(22π)
= exp(-(xi-θ)2/2)exp(-(xi-θ)2/2)+exp(-xi2/2).

The Gibbs sampling algorithm is:-

  1. 1.

    Initial value for θ0. (A reasonable starting value is θ0=1ni=1nxi, the sample mean.)

  2. 2.

    For i=1,2,,n, set Yi=1 with probability exp(-(xi-θ)2/2)exp(-(xi-θ)2/2)+exp(-xi2/2), and set Yi=0 otherwise. This updates the set 𝒜.

  3. 3.

    Sample θN(1m+1i𝒜xi,1m+1).

3.3 Genetic linkage example

We revisit the genetic linkage example from Week 6, the EM algorithm. Whilst we are now focused on the posterior distribution rather than the MLE of the parameter θ, we use exactly the same data augmentation as before.

The data consist of the genetic linkage of 197 animals and the data are divided into four genetic categories, labeled 1 through to 4, see [1]. The probabilities that an animal belongs to each of the four categories are 12+θ4,1-θ4,1-θ4,θ4, respectively. Let 𝐱=(x1,x2,x3,x4)=(125,18,20,34) denote the total number of animals in each category.

We are interested in estimating θ. Note that for the category probabilities to be valid we require that 0<θ<1. Therefore we shall assign an (uninformative) uniform prior to θ. This is a multinomial experiment (4 different outcomes), so

L(θ;𝐱) = (x1+x2+x3+x4)!x1!x2!x3!x4!p1x1p2x2p3x3p4x4
= 197!125!18!20!34!(12+θ4)125(14(1-θ))18(14(1-θ))20(θ4)34.

The posterior distribution of θ is not of a particularly nice form,

p(θ|𝐱)(2+θ)x1(1-θ)x2+x3θx4.

Suppose that the observed cell x1 could be divided into two subcategories A and B. Suppose that y (x1-y) is the number of animals in subcategory A (B) with cell probability pA=θ/4 (pB=1/2). This would give an augmented data set (𝐱,y). Then

L(θ;𝐱,y) = (x1+x2+x3+x4)!(x1-y)!y!x2!x3!x4!pAypBx1-yp2x2p3x3p4x4
(θ4)y(14(1-θ))18(14(1-θ))20(θ4)34.

Thus

p(θ|𝐱,y)(1-θ)38θ34+y.

Therefore the posterior density is proportional to a Beta density giving

θ|𝐱,yBeta(y+35,39).

What is y|θ,𝐱?

There are 125 (=x1) animals in categories A and B.

The conditional probability that the animal belongs to category A given that the animal belongs to category 1 is

P(Belongs to category A|Belongs to category 1) = P(Belongs to category A and to category 1)P(Belongs to category 1)
= P(Belongs to category A)P(Belongs to category 1)
= θ/41/2+θ/4=θ2+θ.

Thus yBin(125,θ/(2+θ)).

Therefore the Gibbs sampler iterates between the following two equations:

θ|𝐱,y Beta(y+x4+1,x2+x3+1)=Beta(y+35,39)
y|θ,𝐱 Bin(x1,θθ+2)=Bin(125,θθ+2)

(Note that the above is for general 𝐱, not just the data given.)

A sample of size 100 from the posterior distribution was obtained using the above Gibbs sampler and the output is presented below. Initial values θ=0.5 and y=25 were chosen.

Unnumbered Figure: Link

The algorithm appears to converge immediately, but just to be safe I shall disregard the first 20 iterations as burn-in. The following summary statistics are then obtained for θ:-
Estimate of posterior mean, 0.6258.
Estimate of posterior variance, 0.00281(=0.0532).
Posterior density plots (obtained by kernel smoothing) suggest a modal value of about 0.645.

It is important to note that the sequence of realisations of θ are not independent. This is clear from their construction, via the Markov chain. One useful summary of dependence is an ACF (autocorrelation function) plot. The acf plot (in R) gives the estimated value of Corr[θ0,θk] for k=0,1,, the correlation between samples from the posterior distribution which differ by a lag of k iterations. For independent samples, Corr[θ0,θk]=0 for all k1, although an estimate of the correlation will not be exactly 0. Generally for MCMC, the faster the Corr[θ0,θk] approaches 0 as k increases the better. (With very few exceptions, MCMC constructs Markov chains with positive correlation, Corr[θ0,θk]>0.)

An acf plot for θ is given based upon a sample of 1000 iterations following a burn-in of 100 iterations. As we can see there is very little dependence with only Corr[θ0,θ1] significantly different to 0. (Note that Corr[θ0,θ0] is always equal to 1.) This is very good. In practice the dependence is usually a lot more marked. We will say more about dependence and the cost of dependent versus independent samples in later weeks.

Unnumbered Figure: Link

3.4 Hierarchical models

The performance of students in a test is thought to depend upon individual and school variability. In particular the following model is suggested:-

Xi N(θ,1/τ)  (i=1,2,,m)
Yij N(Xi,1/λ)  (i=1,2,,m;j=1,2,,ni),

where Yij is the performance of the jth individual from school i and Xi denotes the ith school effect. The parameters θ,λ and τ are assumed to be unknown parameters to be estimated. Whilst {Yij} are observed, the {Xi} are unobserved. Therefore we use data augmentation (augment with 𝐱=(x1,x2,,xm)) to estimate (θ,λ,τ) based upon observations 𝐲=(𝐲1,𝐲2,,𝐲m), where 𝐲i=(yi1,yi2,,yini).

This is a hierarchical model with students based within schools and the school having an effect on student performance. We could have more hierarchical levels, for examples, the schools could be in different cities and we could have a city effect included.

Note that given xi, 𝐲i is independent of θ and τ. Moreover, given xi, yij and yik are independent, for jk. Therefore

f(𝐲i|xi,θ,τ,λ) = j=1nif(yij|xi,λ) (3.1)
= j=1niλ2πexp(-λ2(yij-xi)2).

Now

f(xi|θ,τ) = τ2πexp(-τ2(xi-θ)2). (3.2)

Since the performance of students from different schools are independent, we have that

L(θ,τ,λ;𝐱,𝐲) = i=1m{τ2πexp(-τ2(xi-θ)2)×j=1niλ2πexp(-λ2(yij-xi)2)}. (3.3)

To find the conditional distributions of θ,τ,λ,x1,x2,,xm, we combine the likelihood (3.3) with the priors for θ, τ and λ. For simplicity, we shall assume that the priors on θ, τ and λ are improper uniform priors. That is, π(θ)1 (θ) and π(λ),π(τ)1 (λ,τ+). Therefore, letting x¯=1mi=1mxi and N=i=1mni, we have that

f(τ|λ,θ,𝐱,𝐲) τm/2exp(-τ×12i=1m(xi-θ)2)
f(θ|λ,τ,𝐱,𝐲) exp(-τ×12i=1m(xi-θ)2)
exp(-τ×12(mθ2-2θmx¯))
f(λ|τ,θ,𝐱,𝐲) λN/2exp(-λ×12i=1mj=1ni(yij-xi)2).

Therefore

τ|λ,θ,𝐱,𝐲 Gamma(m2+1,12i=1m(xi-θ)2)
θ|λ,τ,𝐱,𝐲 N(x¯,1mτ)
λ|τ,θ,𝐱,𝐲 Gamma(N2+1,12i=1mj=1ni(yij-xi)2).

All that is now required is the conditional distributions of x1,x2,,xm. Note that the xi’s are independent with

f(xi|λ,θ,τ,𝐱i-,𝐲) = f(xi|λ,θ,τ,𝐲i)
exp(-τ×12(xi-θ)2)×exp(-λ×12j=1ni(yij-xi)2)
exp(-12((τ+niλ)xi2-2xi(τθ+λj=1niyij))).

Therefore, we have that

xi|λ,θ,τ,𝐱i-,𝐲N(τθ+λj=1niyijτ+niλ,1τ+niλ).

Note that the mean of xi is a compromise between the mean of the data and the prior mean with

τθ+λj=1niyijτ+niλ=ττ+niλθ+niλτ+niλy¯i.

Furthermore as more data are collected it moves closer to the mean of 𝐲i.

Example.

Suppose that a study involves 4 schools. Students from each school take a test and score a mark out of 100. The results are presented in the table below.

School 1School 2School 3School 460555461625256595854565860575661595256595854635854

Note: The data were generated from; XiN(60,42) and YijN(Xi,22) with the Yij rounded to give integer values. Therefore the ’true’ values are θ=60, τ=0.05625 and λ=0.25.

A Gibbs sampler algorithm was run on the above data set to get a sample of 1100 iterations. After discarding the first 100 iterations, we can use the remaining 1000 iterations to estimate the posterior means and standard deviations of the parameters.

θτλmean57.4260.3000.345sd1.3440.2670.107

The mean of τ is almost 5 times the true value of τ. The reason for this is that there is only four schools and the means for the 4 schools are fairly similar. (τ large implies greater precision, ie. smaller variance.) A time series plot of the 1100 iterations of τ are given in the figure below. We also see that τ has a right-skewed distribution so that the mean is (quite a bit) larger than the posterior mode.

Unnumbered Figure: Link

Unnumbered Figure: Link

3.5 Probit regression

We consider a probit regression example. Let yi{0,1},i=1,,n be binary response variables for a collection of n objects with associated covariate measurement 𝐱i. We assume that

yi Bernoulli (Φ(ηi))
ηi = 𝐱iT𝜷

where Φ() is the standard normal distribution function, ηi is the linear predictor and 𝜷 represents a (p×1) column vector of regression coefficients. We are interested in the posterior distribution of 𝜷 and we assume a N(𝝁0,𝐂0) prior for 𝜷.

In the probit model, the mean (the probability of a 1) is given by μi=Φ(ηi), hence, it corresponds to the probit link function: g(μi)=Φ-1(μi).

Note that

L(𝜷;𝐲)=i=1nΦ(𝐱iT𝜷)yi{1-Φ(𝐱iT𝜷)}1-yi.

Therefore we do not have a nice analytical expression for π(𝜷|𝐲)L(𝜷;𝐲) and resort again to data augmentation.

For the data augmentation it is useful to think how we could simulate data 𝐲 from the probit model with coefficients 𝜷 and covariates 𝐱1,,𝐱n. To simulate yi such that P(yi=1)=Φ(𝐱iT𝜷), we could take the following steps:

  1. 1.

    Simulate ziN(𝐱iT𝜷,1).

  2. 2.

    If zi>0, set yi=1 and otherwise set yi=0.

Note that

P(yi=1) = P(zi>0)=P(N(𝐱iT𝜷,1)>0)
= P(N(0,1)>-𝐱iT𝜷)=Φ(𝐱iT𝜷),

as required. Given the above construction, the zi’s are simple Normal random variables and the yi’s are a simple deterministic functions of the zi’s.

Gibbs sampler
This representation lends itself to efficient simulation using Gibbs.
The joint posterior of (𝜷,z1,,zn) is given by

π(𝜷,z1,,zn|𝐲) i:yi=1exp[-12(zi-𝐱iT𝜷)2]I[zi>0]
× i:yi=0exp[-12(zi-𝐱iT𝜷)2]I[zi<0]
× exp[-12(𝜷-𝝁0)T𝐂0-1(𝜷-𝝁0)].

The conditional posterior distribution of 𝜷 is therefore multivariate normal:

𝜷|𝐳,𝐲N((𝐂0-1+𝐗T𝐗)-1(𝐂0-1𝝁0+𝐗T𝐳),(𝐂0-1+𝐗T𝐗)-1),

where 𝐳=(z1,,zn)T. This is similar to the block update of 𝜷 in the linear regression in the Week 7 lab session. On the other the conditional posterior distribution for each zi is truncated normal,

zi|𝜷,𝐲 {N(𝐱iT𝜷,1)I[zi>0]if yi=1N(𝐱iT𝜷,1)I[zi0]otherwise, (3.4)

We can therefore create a Gibbs sampler which, at each step

  • Samples from 𝜷|𝐳,𝐲.

  • Samples from zi|𝜷,𝐲, for i=1,,n.

Example.

Response: Occurrence or non-occurrence of infection following birth by Caesarian section.

Covariates:

x1=1 if Caesarian section was not planned and x1=0 otherwise;

x2=1 if there is presence of one or more risk factors, such as diabetes or excessive weight, and x2=0 otherwise;

x3=1 if antibiotics were given as prophylaxis and x3=0 otherwise.

In total: 251 births.

Covariates Infection
x1 x2 x3 yes no
0 0 0 8 32
0 0 1 0 2
0 1 0 28 30
0 1 1 1 17
1 0 0 0 9
1 0 1 0 0
1 1 0 23 3
1 1 1 11 87
Table 1: Data on Caesarian birth study

The program for implementing the probit model is available in probit.r. I ran the code for 5500 iterations discarding the first 500 iterations as burn-in. The tables below give the estimates of the parameter means and standard deviations along with the probability that a parameter is positive (based on the sample). We can see from the table that the Caesarian section not being planned (emergency) and having risk factors increase the chance of infection, whilst antibiotics (fortunately) reduce the risk of infection.

Coefficients: mean Std probability
(Intercept) -0.9854 0.2049 0
noplan 0.5063 0.2313 0.9868
factor 1.0607 0.2386 1
antib -1.7498 0.2447 0

Finally, in the table below, we give the (estimated) posterior probability of becoming infected for each set of covariates. Note that

P(Infected|𝐱) = Φ(𝐱T𝜷)π(𝜷|𝐲)𝑑𝜷.

This can be estimated using N samples from 𝜷|𝐲 by

1Nj=1NΦ(𝐱T𝜷j).

In the table below we compare the posterior probability of infections (E) versus the observed proportion infected (O) for each set of covariates. The table shows generally good agreement between the observed proportions and those given by the model.

Covariates Infection?
x1 x2 x3 E O
0 0 0 0.1665 0.2000
0 0 1 0.0047 0.0000
0 1 0 0.5300 0.4828
0 1 1 0.0525 0.0556
1 0 0 0.3202 0.0000
1 0 1 0.0158
1 1 0 0.7149 0.8846
1 1 1 0.1243 0.1122

3.6 Gaussian hierarchical model - revisited

We have introduced hierarchical models above and here we consider an alternative parameterisation of the hierarchical Gaussian model. Reparameterisation can be a useful in statistics to improve estimation of the model parameters. The reparameterisations presented below are based upon ideas from [2]. Let Y=(Y11,,Ymnm) be observed and X=(X1,,Xm) be unobserved, with

Yij = Xi+σyϵij,j=1,2,,ni
Xi = θ+σxzi,i=1,2,,m,

where σx and σy are assumed known and ϵij and zi are independent standard normal random variables (N(0,1)). The parameter θ is assumed to be unknown and its posterior distribution is of interest. We shall consider the case where ni=1 for all i and consequently drop the subscript j. Therefore the model can be rewritten as

Yi = N(Xi,σy2)
Xi = N(θ,σx2),i=1,2,,m.

Therefore this is a simplified version of the hierarchical model studied above.

An alternative parameterisation is the non-centered parameterisation proposed by [2],

Yi = θ+X~i+σyϵi,
X~i = σxzi,i=1,2,,m.

Note that X~=(X~1,,X~m) and θ are a priori independent, but conditional upon the data are dependent.

Both (centered and non-centered) parameterisations permit a (data-augmentation) Gibbs sampler as outlined below. (Throughout we shall assign an improper, uniform prior to θ; π(θ)1.)

Centered algorithm

The joint distribution of (θ,𝐱) given 𝐲, π(θ,𝐱|𝐲) satisfies

π(θ,𝐱|𝐲) = π(𝐲|𝐱,θ)π(𝐱|θ)π(θ)π(𝐲)
π(𝐲|𝐱)π(𝐱|θ),

since π(θ)=1 and given 𝐱, 𝐲 is independent of θ.

Thus

π(θ,𝐱|𝐲) i=1m12πσYexp(-12σY2(yi-xi)2)×12πσXexp(-12σX2(xi-θ)2).

Therefore

π(θ|𝐱,𝐲) i=1mexp(-12σX2(xi-θ)2)
= exp(-12σX2i=1m(xi-θ)2)
= exp(-12σX2{i=1mxi2-2i=1mxiθ+mθ2})
exp(-m2σX2{θ2-2x¯θ}).

This gives

θ|𝐱,𝐲N(x¯,σX2m).

(Note that if W has pdf f(w)exp(-12β(w2-2αw)), then WN(α,β).)

For i=1,2,,n, xi depends only upon θ and yi. Therefore

π(xi|θ,yi) exp(-12σY2(yi-xi)2)×exp(-12σX2(xi-θ)2)
exp(-12σY2(xi2-2yixi))×exp(-12σX2(xi2-2θxi))
exp(-12{(1σY2+1σX2)xi2-2xi(yiσY2+θσX2)}).

This gives

xi|θ,yiN({yiσY2+θσX2}/{1σY2+1σX2},1/{1σY2+1σX2}).

It is straightforward using the above conditional distributions to construct a Gibbs sampling algorithm which alternates between updating θ|𝐱,𝐲 and updating 𝐱|θ,𝐲.

Non-centered algorithm

The joint distribution of (θ,𝐱~) given 𝐲, π(θ,𝐱~|𝐲) satisfies

π(θ,𝐱~|𝐲) = π(𝐲|𝐱~,θ)π(𝐱~|θ)π(θ)π(𝐲)
π(𝐲|𝐱~,θ)π(𝐱~),

since π(θ)=1 and 𝐱~ is a priori independent of θ.

Thus

π(θ,𝐱~|𝐲) i=1m12πσYexp(-12σY2(yi-θ-x~i)2)×12πσXexp(-12σX2x~i2).

Therefore, letting μ=1ni=1nx~i, we have that

π(θ|𝐱~,𝐲) i=1mexp(-12σY2(yi-x~i-θ)2)
= exp(-12σY2i=1m(yi-x~i-θ)2)
= exp(-12σY2{i=1m(yi-x~i)2-2i=1m(yi-x~i)θ+mθ2})
exp(-m2σY2{θ2-2(y¯-μ)θ}).

This gives

θ|𝐱~,𝐲N(y¯-μ,σY2m).

For i=1,2,,n, x~i depends only upon θ and yi. Therefore

π(x~i|θ,yi) exp(-12σY2(yi-x~i-θ)2)×exp(-12σX2x~i2)
exp(-12σY2(x~i2-2(yi-θ)x~i+(yi-θ)2))×exp(-12σX2x~i2)
exp(-12{(1σY2+1σX2)x~i2-2x~i(yi-θσY2)}).

This gives

x~i|θ,yiN({yi-θσY2}/{1σY2+1σX2},1/{1σY2+1σX2}).

It is straightforward using the above conditional distributions to construct a Gibbs sampling algorithm which alternates between updating θ|𝐱~,𝐲 and updating 𝐱~|θ,𝐲.

Example.

To illustrate the two algorithms, we apply both algorithms to a data set consisting of m=100 observations from y with σY=2, σX=1 and θ=2. Both algorithms were run to generate a sample of size 1000 from the posterior distribution of θ. The two algorithms are obtaining samples from the same posterior distribution and should therefore give the same answers (subject to Monte Carlo error - random variation). The estimated means (standard deviations) of θ based on the two samples are 2.435 (0.233) and 2.425 (0.223) for the centered and non-centered algorithms, respectively. Below are trace plots of 100 iterations of each algorithm followed by the acf plots. We can clearly see from the acf plots that the non-centered algorithm is preferable to the centered algorithm.

Unnumbered Figure: Link

Unnumbered Figure: Link

Unnumbered Figure: Link

Unnumbered Figure: Link

For the above model we can actually show analytically that for the given values of σY and σX, the non-centered algorithm is preferable by calculating the lag-1 autocorrelation for both models. Note that it is actually fairly easy to show that the lag-k autocorrelation is the lag-1 autocorrelation to the power k for the above model. Let ρC and ρNC denote the lag-1 autocorrelations for the centered and non-centered algorithms, respectively. Then we can show that ρC=σY2/(σX2+σY2) and ρNC=σX2/(σX2+σY2).

To prove the above, we need to compute Corr[θ0,θ1], for which we need the stationary distribution of θ0. Note that

YiN(θ,σX2+σY2)  i=1,2,,m.

Therefore

π(θ|𝐲) i=1n12π(σX2+σY2)exp(-12(σX2+σY2)(yi-θ)2)
exp(-12(σX2+σY2)i=1m(yi-θ)2)
exp(-12(σX2+σY2)(mθ2-2my¯θ))

giving

θ|𝐲N(y¯,σX2+σY2m).

In other words, MCMC is not needed for the case, where for all i, ni=1 but MCMC is required if any ni>1.

Thus Var[θ0]=Var[θ1]=(σX2+σY2)/m. Therefore we need to compute Cov[θ0,θ1]. This is most easily down by showing that

θ1=αθ0+W,

where W is a random variable independent of θ0. In that case,

Cov[θ0,θ1]=Cov[θ0,αθ0+W]=αVar[θ0]

and Corr[θ0,θ1]=α.

For the centered algorithm, note that

x¯|θ0,𝐲N(σX2y¯+σY2θ0σX2+σY2,σX2σY2m(σX2+σY2)).

Hence

θ1|θ0,𝐲 x¯+N(0,σX2m)
N(σX2y¯+σY2θ0σX2+σY2,σX2σY2m(σX2+σY2))+N(0,σX2m)
= σY2σX2+σY2θ0+N(σX2σX2+σY2y¯,σX2σY2m(σX2+σY2)+σX2m).

Therefore ρC=Corr[θ0,θ1]=σY2/(σX2+σY2).

A similar calculation holds for ρNC=σX2/(σX2+σY2). Thus if σY2<σX2, the centered algorithm is preferable, otherwise the non-centered algorithm is preferable.

3.7 R practice question

Write a Gibbs sampler algorithm to apply to the school performance data given in Section 3.3 of the lecture notes. Compare your output to that given in the lecture notes.

A recap of the data, the model and the necessary conditional distributions is given below.

School 1School 2School 3School 460555461625256595854565860575661595256595854635854

Performance of 25 students in a test selected from 4 schools.

Model:

Xi N(θ,1/τ)  (i=1,2,,m)
Yij N(Xi,1/λ)  (i=1,2,,m;j=1,2,,ni),

where Yij is the performance of the jth individual from school i and Xi denotes the ith school effect. Note that the Xi’s are not observed.

Conditional distributions for the Gibbs sampler:-

τ|λ,θ,𝐱,𝐲 Gamma(m2+1,12i=1m(xi-θ)2)
θ|λ,τ,𝐱,𝐲 N(x¯,1mτ)
λ|τ,θ,𝐱,𝐲 Gamma(N2+1,12i=1mj=1ni(yij-xi)2)
xi|λ,θ,τ,𝐱i-,𝐲 N(τθ+λj=1niyijτ+niλ,1τ+niλ).

3.8 Lab Session: Data augmentation - robust regression


This extends the multiple regression example studied in last weeks lab to allow for outliers.

When there are outliers suspected in the data, the sensible thing to do is to take a closer look at the data and try and understand where the lack of fit may be coming from. There are several possibilities: e.g. the outliers may be due to errors in recording the data, or it may be that there were certain features corresponding to those observations that our model does not take into account. The latter may be solved e.g. by adding relevant explanatory variables to the model, or by changing other aspects of the model.

Now suppose that none of the above applies to your model: that is, there are no recording errors in the data that you know of, you have no additional explanatory variables that you could possibly use, and you can not think of a more suitable model specification. In that case, a partial solution is to use a sampling distribution that is more robust to outliers than the Normal distribution. A distribution with thicker tails than the Normal distribution allows for larger departures from the mean and, therefore, outliers are likely to have less impact on the resulting posterior inference on β1 and β2. That is what we mean by a sampling distribution that is more resistant to outliers than the Normal distribution.

A distribution with this property is the Cauchy distribution. Using a Cauchy distribution leads to the following sampling density for observation i:

f(yi|𝜷,τ)=τ1/2π11+τ(yi-𝐱iT𝜷)2,i=1,,n,independent. (3.5)

Consider the same prior distributions for 𝜷 and τ used in Week 7.

  1. 1.

    Write down, up to a proportionality constant, the joint posterior density of (𝜷,τ).

  2. 2.

    Write down, up to a proportionality constant, the conditional posterior densities of β1, β2 and ω (that is, π(β1|𝐲,β2,τ), π(β2|𝐲,β1,τ) and π(ω|𝐲,β1,β2)). Is our new model amenable to Gibbs sampling? Explain why or why not.

Now consider the following idea: The Cauchy distribution in (3.5) can also be interpreted as a Normal distribution with an unknown (random) precision parameter.

More specifically, assuming

yi|β1,β2,τ,zi N(β1xi1+β2xi2,1τzi) (3.6)
zi|β1,β2,τ Gamma(12,12) (3.7)

is, in fact, equivalent to assuming the sampling model in (3.5).

Aside. The Cauchy distribution is a t-distribution with 1 degree of freedom, and (abusing notation)

tν=N(0,1)χν2/ν  (e.g. consider the test statisticY¯-μS2.)

The statement follows since χν2/ν is the same as Gamma(ν/2,ν/2).

Thus, we can augment the parameters with the variables z1,,zn (note that there is one variable zi per observation) to obtain the posterior density

π(β1,β2,τ,z1,,zn|𝐲) {i=1n(τzi)1/2exp(-τzi2(yi-β1xi1-β2xi2)2)zi-1/2exp(-zi2)I[zi>0]}
× j=12ω0j1/22πexp(-ω0j2(βj-μ0j)2)×τα0-1e-γ0τI[τ>0]
  1. 3.

    Find the conditional posterior density of each zi, i=1,,n, π(zi|β1,β2,τ,𝐲,𝐳-i), where 𝐳-i denotes the vector (z1,,zn) excluding zi.

  2. 4.

    Using an almost identical argument to the previous lab, show that

    β1|τ,β2,𝐲,𝐳N(ν1,s12),

    where ν1=m1s12,

    m1=i=1nτzixi1(yi-β2xi2)+ω01μ01

    and

    s12={i=1nτzixi12+ω01}-1.
  3. 5.

    Using an almost identical argument to the previous lab (and the previous question), show that

    β2|τ,β1,𝐲,𝐳N(ν2,s22),

    where ν2=m2s22,

    m2=i=1nτzixi2(yi-β1xi1)+ω02μ02

    and

    s22={i=1nτzixi22+ω02}-1.
  4. 6.

    Using an almost identical argument to the previous lab, show that

    τ|β1,β2,𝐲,𝐳Gamma(n2+α0,i=1nzi2(yi-β1xi1-β2xi2)2+γ0).
  5. 7.

    Write down a Gibbs sampling algorithm to compute the joint posterior distribution of (β1,β2,τ,z1,,zn).

  6. 8.

    Make a copy of your Gibbs sampling function from last weeks lab and alter this to include the zj, producing a Gibbs sampler for the Cauchy model.

  7. 9.

    How does the inference on β1 and β2 compare with that obtained in Lab 2 where you used a Normal sampling model?

3.9 Practice Question

  1. 1.

    Suppose that we have independent and identically distributed realisations x1,x2,,xn from the random variable X, where X is a zero-inflated Poisson. That is,

    X{Po(λ)with probability 1-ϵ;0with probability ϵ,

    where (ϵ,λ) are the parameters of interest.

    This is an example of a mixture distribution, where X is distributed either according to distribution APo(λ) or distribution B0.

    For k=0,1,, let nk=j=1n1{xj=k} denote the total number of xj’s equal to k. Then 𝐧=(n0,n1,) is a sufficient statistic for obtaining the posterior distribution of (ϵ,λ). That is,

    π(ϵ,λ|𝐧)=π(ϵ,λ|𝐱).

    Suppose that π(ϵ)Beta(1,1) and π(λ)Gamma(1,1).

    1. (a)

      Show that

      L(ϵ,λ;𝐧){ϵ+(1-ϵ)exp(-λ)}n0k=1{(1-ϵ)λkk!exp(-λ)}ni.
    2. (b)

      Let z denote the total number of xj’s which come from distribution B. Write down π(z,𝐧|ϵ,λ).

      Hence, write down π(ϵ,λ,z|𝐧) up to proportionality.

    3. (c)

      Suppose that π(ϵ)Beta(1,1) and π(λ)Gamma(1,1). Find the following conditional distributions:-

      1. i.

        ϵ|λ,𝐧,z;

      2. ii.

        λ|ϵ,𝐧,z;

      3. iii.

        z|ϵ,λ,𝐧.

    4. (d)

      Write a Gibbs sampler to obtain samples from π(ϵ,λ|𝐧).

  2. 2.

    An ecologist is monitoring the number of field mice at m sites. Each of the m sites are observed for n days with the number of mice caught at each site recorded daily. Let yij denote the number of mice caught at site i on day j. The data 𝐲=(𝐲1,𝐲2,,𝐲m) with 𝐲i=(yi1,yi2,,yin) are assumed to arise from the following hierarchical model:

    Yij Pois(Xi)(i=1,2,,m;j=1,2,,n)
    Xi Exp(θ)(i=1,2,,m).

    Note that X1,X2,,Xm are unobserved and let 𝐱=(x1,x2,,xm) denote a realization from
    (X1,X2,,Xm), corresponding to 𝐲.

    Suppose that a Gamma(α,β) prior is assigned to θ.

    1. (a)

      Write down the likelihood for θ given 𝐲,𝐱.

    2. (b)

      Find the conditional probability distribution of θ given 𝐲,𝐱.

    3. (c)

      For i=1,2,,m, find and identify the conditional probability distribution of xi|𝐲i,θ.

    4. (d)

      Describe a Gibbs sampler algorithm for obtaining samples from π(θ|𝐲), the posterior distribution of θ given the observed data 𝐲.