In this first example, we will examine more closely the Scottish lip cancer data mentioned in the lectures. These data, consisting of rates of lip cancer in 56 counties in Scotland have been analysed by Breslow and Clayton (1993). The full data include:
the observed () and expected () cases (expected numbers based on the population and its age and sex distribution in county .)
a covariate () measuring the percentage of the population engaged in agriculture, fishing, or forestry. In the dataset below, this covariate is named AFF.
the “position” of each county expressed as a list of adjacent counties (which will not be included in the version of the data you will receive: we’ll compute the adjacency matrix later).
County | Observed cases | Expected cases | x ( in agric) | Adjacent counties |
---|---|---|---|---|
1 | 9 | 1.4 | 16 | 5,9,11,19 |
2 | 39 | 8.7 | 16 | 7,10 |
… | … | … | … | … |
56 | 0 | 1.8 | 10 | 18,24,30,33,45,55 |
We will use the CARBayes package to analyse these data; the package vignette is available from http://www.jstatsoft.org/v55/i13. First we load in the required packages for the analysis, note that you may have to install these packages using for example
install.packages(rgdal, dependencies = TRUE) install.packages(spdep, dependencies = TRUE) install.packages(CARBayes, dependencies = TRUE)
library(rgdal) library(spdep) library(CARBayes) scot <- readOGR(".", "scot", verbose = FALSE, stringsAsFactors = FALSE) # the following sets up a neighbourhood matrix, # which will be used later nb <- poly2nb(scot) W = nb2mat(nb, style = "B", zero.policy = TRUE) rownames(W) <- scot$NAME colnames(W) <- scot$NAME rs <- rowSums(W) d <- as.matrix(dist(coordinates(scot))) for (i in 1:nrow(W)) { if (rs[i] == 0) { idx <- which(d[i, ] == min(d[i, ][d[i, ] > 0]))[1] W[i, idx] <- 1 W[idx, i] <- 1 } }
The above assumes you are running R in the same directory as the file scot.shp (and the other files: scot.dbf, scot.gal, scot.prj and scot.shx), otherwise, the "." in the above should be replaced by "name-of-directory", where name-of-directory is the folder in which scot.shp is stored.
The R object scot is of class SpatialPolygonsDataFrame. Such objects, which are defined in the package sp, can be used to describe data pertaining to complex shapes consisting of many polygons (possibly with holes); they contain both data attributes as well as the geography on which they are observed. The types of geographies handled by the package sp include points, lines, pixels and polygons (with holes). Objects from the sp package belong to the S4 class, which requires slightly different handling compared to the S3 class objects we have been working with until now. To access the data.frame, containing the observed, expected and covariate information, from the SpatialPolygonsDataFrame, we can use scot@data. The top few rows of this object look like:
head(scot@data)
NAME ID Cases Expected AFF 0 Skye-Lochalsh 1 9 1.4 16 1 Banff-Buchan 2 39 8.7 16 2 Caithness 3 11 3.0 10 3 Berwickshire 4 9 2.5 24 4 Ross-Cromarty 5 15 4.3 10 5 Orkney 6 8 2.4 24
We can summarise and plot the object scot using the following code:
summary(scot) spplot(scot, "Cases")
Observation equation:
Link function:
The term is an offset term, which accounts for the fact that in different regions the population at risk is not the same. is an intercept term representing the baseline () relative risk of disease across the study region and is the (area-level) regression coefficient of the covariate .
Fit the Poisson regression model to the data using the glm function. Is there evidence that the estimated standard errors from the model fit may be too small?
We can extend the above model to take into account possible additional variation that has currently
Observation equation:
Link function:
are mutually independent random effects Normal
We can fit this model using the CARBayes package as follows:
ireg <- S.CARleroux(Cases ~ AFF + offset(log(Expected)), family = "poisson", data = scot@data, burnin = 100, n.sample = 5000, rho = 0, W = W)
Note that strictly we should not need to pass the object W to this function, but this a new implementation of the method (it used to be called S.independent).
Before we can make any inferential statements, we need to make sure that the MCMC chain has converged and is mixing satisfactorily. Look at the structure of ireg using str(ireg).
The parameters in this model are: the fixed effects, beta, there are two of these; the random effects, phi, there are 56 of these (one for each region, the in the model above); and the variance parameter governing the variance of the random effects, tau2. The chains for these parameters are stored in:
ireg$samples$beta ireg$samples$tau2 ireg$samples$phi
Complete the following tasks:
In a Bayesian analysis we need to specify priors, but this was not done explicitly above. Use the help file to find out the default priors used.
Has this chain converged satisfactorily?
If the answer to the above is “no”, re-run the MCMC until convergence (Use the help file to find out the relevant arguments to control the length of chain, burnin and thinning parameters).
Plot using spplot. What does this show?
Observation equation:
Link function:
Random spatial effects: neighbours Normal where
=mean of from neighbouring counties
, where = number of neighbours of county
Independent random effects, Normal Hence has a Normal distribution with conditional mean given by the average of the neighbouring ’s and conditional variance inversely proportional to the number of neighbours .
More formally:
neighbours Normal
(equivalently ) if areas i and j are neighbours and 0 otherwise
In the initial lines of code above, we computed the neighbours of each of our regions using the poly2nb function, we are aiming to pass this information into our model that includes a spatially correlated random effect. We can plot the assumed dependency structure:
plot(scot) plot(nb, coordinates(scot), add = TRUE, col = "red")
We then converted the neighbourhood object, nb, into a binary incidence matrix, W, using the function nb2mat. We will pass W to out MCMC algorithm:
sreg <- S.CARbym(Cases ~ AFF + offset(log(Expected)), family = "poisson", data = scot@data, W = W, burnin = 1000, n.sample = 10000)
Note that in this output the random effects are stored in sregpsi which represent for each region; samples for are stored in sregtau2; samples for are stored in sregsigma2; and samples for beta are stored in sregbeta.
Complete the following task.
Check to see if the MCMC output from the above is acceptable i.e., that the chain appears to have converged and is mixing well. If it is not acceptable, re-run using appropriate settings to achieve convergence.
Having now fitted this second model, it is of interest to compare the models ireg and sreg. To do this, we can use the Deviance Information Criterion (DIC), see Spiegelhalter et al. (2002). The DIC is defined as
where is the mean posterior deviance i.e. for independent and identically distributed data, , and parameters, ,
and is the “effective number of parameters”, given by,
where and . The model with the smaller DIC is the best.
The DIC for the two models is displayed by typing the name of the object
ireg
################# #### Model fitted ################# Likelihood model - Poisson (log link function) Random effects model - Leroux CAR Regression equation - scot$Cases ~ scot$AFF + offset(log(scot$Expected)) Number of missing observations - 0 ############ #### Results ############ Posterior quantities and DIC Median 2.5% 97.5% n.sample % accept n.effective (Intercept) -0.5247 -0.8753 -0.2178 1000 35.8 312.4 scot$AFF 0.0722 0.0394 0.1086 1000 35.8 336.6 tau2 0.3466 0.1901 0.6602 1000 100.0 615.4 rho 0.0000 0.0000 0.0000 NA NA NA Geweke.diag (Intercept) 0.3 scot$AFF -0.6 tau2 -1.0 rho NA DIC = 320.9806 p.d = 44.33652 LMPL = -123.48
sreg
################# #### Model fitted ################# Likelihood model - Poisson (log link function) Random effects model - BYM CAR Regression equation - Cases ~ AFF + offset(log(Expected)) Number of missing observations - 0 ############ #### Results ############ Posterior quantities and DIC Median 2.5% 97.5% n.sample % accept n.effective (Intercept) -0.2071 -0.4281 0.0432 9000 42.5 245.6 AFF 0.0364 0.0090 0.0585 9000 42.5 173.7 tau2 0.3636 0.1683 0.7838 9000 100.0 381.0 sigma2 0.0080 0.0022 0.0514 9000 100.0 102.5 Geweke.diag (Intercept) 0.7 AFF 0.0 tau2 -1.9 sigma2 0.0 DIC = 297.6073 p.d = 28.45902 LMPL = -124.2
So the model with spatially correlated random effects is the preferred model.