1 Lab Session 4 1 Lab Session 4 A Geostatistical Model

Lip cancer in Scotland

In this first example, we will examine more closely the Scottish lip cancer data mentioned in the lectures. These data, consisting of rates of lip cancer in 56 counties in Scotland have been analysed by Breslow and Clayton (1993). The full data include:

•

the observed ( $O_{i}$ ) and expected ( $E_{i}$ ) cases (expected numbers based on the population and its age and sex distribution in county $i$ .)
•

a covariate ( $x_{i}$ ) measuring the percentage of the population engaged in agriculture, fishing, or forestry. In the dataset below, this covariate is named AFF.
•

the “position” of each county expressed as a list of adjacent counties (which will not be included in the version of the data you will receive: we’ll compute the adjacency matrix later).

County	Observed cases	Expected cases	x ( $\%$ in agric)	Adjacent counties
1	9	1.4	16	5,9,11,19
2	39	8.7	16	7,10
…	…	…	…	…
56	0	1.8	10	18,24,30,33,45,55

We will use the CARBayes package to analyse these data; the package vignette is available from http://www.jstatsoft.org/v55/i13. First we load in the required packages for the analysis, note that you may have to install these packages using for example

install.packages(rgdal, dependencies = TRUE)
install.packages(spdep, dependencies = TRUE)
install.packages(CARBayes, dependencies = TRUE)

library(rgdal)
library(spdep)
library(CARBayes)

scot <- readOGR(".", "scot", verbose = FALSE, stringsAsFactors = FALSE)

# the following sets up a neighbourhood matrix,
# which will be used later

nb <- poly2nb(scot)
W = nb2mat(nb, style = "B", zero.policy = TRUE)
rownames(W) <- scot$NAME
colnames(W) <- scot$NAME

rs <- rowSums(W)
d <- as.matrix(dist(coordinates(scot)))
for (i in 1:nrow(W)) {
    if (rs[i] == 0) {
        idx <- which(d[i, ] == min(d[i, ][d[i, ] >
            0]))[1]
        W[i, idx] <- 1
        W[idx, i] <- 1
    }
}

The above assumes you are running R in the same directory as the file scot.shp (and the other files: scot.dbf, scot.gal, scot.prj and scot.shx), otherwise, the "." in the above should be replaced by "name-of-directory", where name-of-directory is the folder in which scot.shp is stored.

The R object scot is of class SpatialPolygonsDataFrame. Such objects, which are defined in the package sp, can be used to describe data pertaining to complex shapes consisting of many polygons (possibly with holes); they contain both data attributes as well as the geography on which they are observed. The types of geographies handled by the package sp include points, lines, pixels and polygons (with holes). Objects from the sp package belong to the S4 class, which requires slightly different handling compared to the S3 class objects we have been working with until now. To access the data.frame, containing the observed, expected and covariate information, from the SpatialPolygonsDataFrame, we can use scot@data. The top few rows of this object look like:

head(scot@data)

           NAME ID Cases Expected AFF
0 Skye-Lochalsh  1     9      1.4  16
1  Banff-Buchan  2    39      8.7  16
2     Caithness  3    11      3.0  10
3  Berwickshire  4     9      2.5  24
4 Ross-Cromarty  5    15      4.3  10
5        Orkney  6     8      2.4  24

We can summarise and plot the object scot using the following code:

summary(scot)
spplot(scot, "Cases")

Poisson regression model

•

Observation equation: $O_{i}\sim\mathrm{Poisson}(\mu_{i})$
•

Link function: $\log(\mu_{i})=\log(E_{i})+\alpha+\beta x_{i}$

The term $\log(E_{i})$ is an offset term, which accounts for the fact that in different regions the population at risk is not the same. $\alpha$ is an intercept term representing the baseline ( $\log$ ) relative risk of disease across the study region and $\beta$ is the (area-level) regression coefficient of the covariate $x$ .

Exercise 1.1.

Fit the Poisson regression model to the data using the glm function. Is there evidence that the estimated standard errors from the model fit may be too small?

Non-spatial extra Poisson variation model

We can extend the above model to take into account possible additional variation that has currently

•

Observation equation: $O_{i}\sim\mathrm{Poisson}(\mu_{i})$
•

Link function: $\log(\mu_{i})=\log(E_{i})+\alpha+\beta x_{i}+U_{i}$
•

$U_{i}$ are mutually independent random effects $\sim$ Normal $(0,\sigma_{U}^{2})$

We can fit this model using the CARBayes package as follows:

ireg <- S.CARleroux(Cases ~ AFF + offset(log(Expected)),
    family = "poisson", data = scot@data, burnin = 100,
    n.sample = 5000, rho = 0, W = W)

Note that strictly we should not need to pass the object W to this function, but this a new implementation of the method (it used to be called S.independent).

Before we can make any inferential statements, we need to make sure that the MCMC chain has converged and is mixing satisfactorily. Look at the structure of ireg using str(ireg).

The parameters in this model are: the fixed effects, beta, there are two of these; the random effects, phi, there are 56 of these (one for each region, the $\{U_{i}\}$ in the model above); and the variance parameter governing the variance of the random effects, tau2. The chains for these parameters are stored in:

ireg$samples$beta
ireg$samples$tau2
ireg$samples$phi

Exercise 1.2.

Complete the following tasks:

1.

In a Bayesian analysis we need to specify priors, but this was not done explicitly above. Use the help file to find out the default priors used.
2.

Has this chain converged satisfactorily?
3.

If the answer to the above is “no”, re-run the MCMC until convergence (Use the help file to find out the relevant arguments to control the length of chain, burnin and thinning parameters).
4.

Plot $\mathbb{E}(U_{i}|\text{data})$ using spplot. What does this show?

Spatial extra-Poisson variation model

•

Observation equation: $O_{i}\sim\mathrm{Poisson}(\mu_{i})$
•

Link function: $\log(\mu_{i})=\log(E_{i})+\alpha+\beta x_{i}+S_{i}+U_{i}$
•
Random spatial effects: $S_{i}|$ neighbours $\sim$ Normal $(m_{i},v_{i})$ where
- –
  
  $m_{i}$ =mean of $S_{j}$ from neighbouring counties $j$
- –
  
  $v_{i}=\sigma_{S}^{2}/n_{i}$ , where $n_{i}$ = number of neighbours of county $i$
•

Independent random effects, $U_{i}\sim$ Normal $(0,\sigma_{U}^{2})$ Hence $S_{i}$ has a Normal distribution with conditional mean given by the average of the neighbouring $S_{j}$ ’s and conditional variance inversely proportional to the number of neighbours $n_{i}$ .
More formally:
- –
  
  $S_{i}|$ neighbours $\sim$ Normal $(m_{i},v_{i})$
- –
  
  $m_{i}=\sum_{j}C_{ij}S_{j}/{\sum_{j}C_{ij}}=\sum_{j}W_{ij}S_{j}$
- –
  
  $v_{i}=\sigma_{S}^{2}/\sum_{j}C_{ij}$
- –
  
  $C_{ij}=1$ (equivalently $W_{ij}=1/n_{i}$ ) if areas i and j are neighbours and 0 otherwise
- –
  
  $\sum_{j}C_{ij}=n_{i}$

In the initial lines of code above, we computed the neighbours of each of our regions using the poly2nb function, we are aiming to pass this information into our model that includes a spatially correlated random effect. We can plot the assumed dependency structure:

plot(scot)
plot(nb, coordinates(scot), add = TRUE, col = "red")

We then converted the neighbourhood object, nb, into a binary incidence matrix, W, using the function nb2mat. We will pass W to out MCMC algorithm:

sreg <- S.CARbym(Cases ~ AFF + offset(log(Expected)),
    family = "poisson", data = scot@data, W = W, burnin = 1000,
    n.sample = 10000)

Note that in this output the random effects are stored in sreg $s a m p l e s$ psi which represent $S_{i}+U_{i}$ for each region; samples for $\sigma_{U}^{2}$ are stored in sreg $s a m p l e s$ tau2; samples for $\sigma_{S}^{2}$ are stored in sreg $s s a m p l e s$ sigma2; and samples for beta are stored in sreg $s a m p l e s$ beta.

Exercise 1.3.

Complete the following task.

1.

Check to see if the MCMC output from the above is acceptable i.e., that the chain appears to have converged and is mixing well. If it is not acceptable, re-run using appropriate settings to achieve convergence.

Having now fitted this second model, it is of interest to compare the models ireg and sreg. To do this, we can use the Deviance Information Criterion (DIC), see Spiegelhalter et al. (2002). The DIC is defined as

\text{DIC}=\bar{D}+p_{D}

where $\bar{D}$ is the mean posterior deviance i.e. for independent and identically distributed data, $\{x_{1},\ldots,x_{n}\}$ , and parameters, $\xi$ ,

\bar{D}=-2\mathbb{E}_{\pi(\xi|x_{1:n})}\left\{\sum_{i=1}^{n}\log[\pi(x_{i}|\xi% )]\right\},

and $p_{D}$ is the “effective number of parameters”, given by,

p_{D}=\bar{D}-\hat{D}

where $\hat{D}=-2\log[\pi(x_{i}|\bar{\xi})]$ and $\bar{\xi}=\mathbb{E}_{\pi(\xi|x_{1:n})}[\xi]$ . The model with the smaller DIC is the best.

The DIC for the two models is displayed by typing the name of the object

ireg


#################
#### Model fitted
#################
Likelihood model - Poisson (log link function)
Random effects model - Leroux CAR
Regression equation - scot$Cases ~ scot$AFF + offset(log(scot$Expected))
Number of missing observations - 0

############
#### Results
############
Posterior quantities and DIC

             Median    2.5%   97.5% n.sample % accept n.effective
(Intercept) -0.5247 -0.8753 -0.2178     1000     35.8       312.4
scot$AFF     0.0722  0.0394  0.1086     1000     35.8       336.6
tau2         0.3466  0.1901  0.6602     1000    100.0       615.4
rho          0.0000  0.0000  0.0000       NA       NA          NA
            Geweke.diag
(Intercept)         0.3
scot$AFF           -0.6
tau2               -1.0
rho                  NA

DIC =  320.9806       p.d =  44.33652       LMPL =  -123.48

sreg


#################
#### Model fitted
#################
Likelihood model - Poisson (log link function)
Random effects model - BYM CAR
Regression equation - Cases ~ AFF + offset(log(Expected))
Number of missing observations - 0

############
#### Results
############
Posterior quantities and DIC

             Median    2.5%  97.5% n.sample % accept n.effective
(Intercept) -0.2071 -0.4281 0.0432     9000     42.5       245.6
AFF          0.0364  0.0090 0.0585     9000     42.5       173.7
tau2         0.3636  0.1683 0.7838     9000    100.0       381.0
sigma2       0.0080  0.0022 0.0514     9000    100.0       102.5
            Geweke.diag
(Intercept)         0.7
AFF                 0.0
tau2               -1.9
sigma2              0.0

DIC =  297.6073       p.d =  28.45902       LMPL =  -124.2

So the model with spatially correlated random effects is the preferred model.