MATH454/554: Project III

TO BE HANDED IN BY MONDAY 15/01/2018 (WEEK 11), 10:00.

This project will contribute 17% towards the final module mark.

Submission: Upload the pdf of your answer and your R code file to the Moodle site. Your R code should be as .r or .txt file so that it can be copied and pasted to run. Submit also a printed copy of your answers (no need for a printed copy of the R code), together with a plagiarism cover sheet, to the MSc submissions pigeon hole. Please write your student ID on your answers, not your name.

Let YGeom(p) with probability mass function

(Y=k)=(1-p)pk  (k=0,1,).

Then Y is a Geometric random variable. The mean and variance of Y are p/(1-p) and p/(1-p)2, respectively. Thus the variance is greater than the mean and the geometric distribution can be preferable to the Poisson distribution for data which is over-dispersed (shows more variability) than we would expect from a Poisson distribution.

We take into account explanatory variables (covariates) by using the following model:-

YiGeom(pi),log(pi/(1-pi))=𝐱iβ  (i=1,2,,n). (1)

Thus for y=0,1,,

(Yi=y|𝐱i,β)=(11+exp(𝐱iβ))(exp(𝐱iβ)1+exp(𝐱iβ))y.

Data

The data is provided in pupil.txt and contains the number of days absent of 314 pupils along with three covariates. The number of absence days are given in column 1 and the three remaining columns are:

  • Gender; 0 - female; 1 - male (column 2);

  • Maths test score; range 0-100 (column 3);

  • Programme pupil is on; 1, 2 or 3 (column 4).

We will analyse the absence data using the geometric regression model above with 𝐱i=(1,gi,mi,1{pi=2},1{pi=3}) including an intercept term and 𝜷=(β1,β2,β3,β4,β5), where gi, mi and pi denote the gender, maths score and programme, respectively, of pupil i. Remember 1{A} is an indicator random variable which is 1 if A occurs and 0 otherwise.

The unknown parameters are β5. The objective here is to conduct posterior inference on these parameters.

In order to conduct Bayesian inference, we need to elicit a prior distribution for 𝜷, the model parameters. Let us consider the following prior distribution

𝜷=(β1β2β3β4β5)Nk(𝝁0,𝐂0=Diag(1κ01,,1κ0k)). (2)

In other words, the prior for 𝜷 is k-variate Normal with mean vector 𝝁0 and diagonal covariance matrix 𝐂0. You may choose the following hyperparameter values 𝝁0=𝟎5 (a 5-dimensional vector of 0 entries) and κ01==κ05=0.01, which correspond to a fairly uninformative prior distributions.

  1. 1.

    Write down, up to a constant of proportionality, the joint posterior distribution of 𝜷. [2]

  2. 2.

    Write a function in R to compute the log-likelihood given 𝜷, 𝐗 and 𝐲 (the absence data). [3]

  3. 3.

    Write a function in R to compute the log of the prior distribution. [1]

  4. 4.

    Write in R a random walk Metropolis algorithm to obtain samples from the joint posterior distribution of 𝜷. Use a multivariate Gaussian proposal for 𝜷 with variance matrix V.prop. [3]
    Hint: Note that you can use the random walk Metropolis for the Poisson regression example in Lab 5 as a template.

Throughout the following apply the random walk Metropolis algorithm to the pupil data set with initial parameter values 𝜷=(0,0,0,0,0) and 10000 iterations of the algorithm.

  1. 6.

    Perform a run with V.prop = diag(rep(0.01,5)) (0.01 times the identity matrix). Comment upon the performance of the random walk Metropolis algorithm. [2]

  2. 7.

    Use tuning runs to find a good choice of V.prop. State your final choice of V.prop. [2]

  3. 8.

    Run the random walk Metropolis algorithm with your chosen V.prop for 110000 iterations and estimate the joint posterior distribution of the parameters.
    Give a short report of the results obtained. [2]

  4. 9.

    Using samples from the posterior distribution for 𝜷, estimate the probability that a male on programme 3 with a maths score of 60 has no absences. [2]