Previously we looked at kernel smoothing methods for estimating and comparing intensity functions. An alternative method is generalised additive modelling. A generalized additive model is a semi-parametric model with the form
where is a conventional parametric component of a linear predictor (in a case-control setting, this part can be the covariates we want to adjust for), and a non-parametric smoothing function of other covariates . is a link function (as for the traditional GLM). Model parameters are estimated using a penalised maximum likelihood approach. This works by adding a term added into the likelihood function to penalise roughness in the function .
We will use the mgcv package to do such analysis. Load the package into R using library(mgcv). For details of the available funcitons in the mgcv package, see
For further theoretical understanding of the GAM see Wood (2006); this book also gives more information on the mgcv package.
The data we use here come from a pioneering epidemiological study by Dr John Snow (1813-1858), who famously removed the handle from an unwholesome water pump to stop a cholera epidemic in London in 1854.
Download the files cholera_deaths.txt, cholera_pumps.txt and cholera_explain.txt from Moodle and read the data into R.
library(spatstat) deaths <- as.matrix(read.table("cholera_deaths.txt")) pumps <- as.matrix(read.table("cholera_pumps.txt")) win <- owin(xrange = c(8, 18), yrange = range(6, 17)) # observation window
Write some code to display deaths and pumps on a map.
Now we create an artificial homogeneous control group (no control group was available in the days of Dr John Snow). We assume the population is homogeneously living in the defined area, and we take 2 controls per case. We also create a flag for cases and controls and attach a fake age for each individual.
set.seed(1) controls <- rpoispp(lambda = 2 * nrow(deaths)/area.owin(win), win = win) cholera <- rbind(deaths, cbind(controls$x, controls$y)) cholera <- cbind(cholera, c(rep(1, nrow(deaths)), rep(0, controls$n)), sample(1:60, nrow(cholera), replace = TRUE)) colnames(cholera) <- c("xcoord", "ycoord", "flags", "age") cholera <- as.data.frame(cholera) # turn it into a data.frame object head(cholera)
xcoord ycoord flags age 1 13.588010 11.095600 1 51 2 9.878124 12.559180 1 8 3 14.653980 10.180440 1 55 4 15.220570 9.993003 1 13 5 13.162650 12.963190 1 22 6 13.806170 8.889046 1 2
We will adjust for age in the GAM and let any residual effect be modelled by a smooth 2-D function, here we also load the mgcViz package to help visualise the results of our analyses.
library(mgcv) library(mgcViz) fit <- gam(flags ~ age + s(xcoord, ycoord), data = cholera, family = binomial(link = logit))
Use the following to summarise and plot the result:
summary(fit) vz <- getViz(fit) plot(sm(vz, 1)) + l_fitRaster() + l_fitContour() + l_points(shape = "+", size = 3) + geom_point(data = data.frame(x = pumps[, 1], y = pumps[, 2], z = 1), aes(x, y), size = 5, shape = "X", colour = "blue")
Why do we use the logit link?
The plot shows . What do you notice?
Check the significance for each covariate added. Do you need to include these in the model?
We now add distance from Broad Street pump (the 7th pump) as explanatory variable in our model (retaining only the significant covariates).
broadstreet <- pumps[7, ] cholera$distance <- apply(cbind(cholera$x, cholera$y), 1, function(x) { return(sqrt(sum((x - broadstreet)^2))) })
Fit the corresponding GAM
fit2 <- gam(flags ~ distance + s(xcoord, ycoord), data = cholera, family = binomial(link = logit)) summary(fit2) vz2 <- getViz(fit2)
The default plot function for objects of class gam, plot.gam, produces results that can be difficult to read. But we can hack the code to try to produce plots with more sensibly sized text:
source("plotDOTgam.R") source("plotMGCVSMOOTH.R")
don’t worry about the detail of this code, but use ?plot.gam to find out more about this function.
Using the plotDOTgam function, plot the fitted residual surfaces from the two fits (with and without the distance from Broad Street pump) side-by-side. Note that in these plots, the level curves of the smooth spatial effect, are black lines; a confidence region (at 1 standard error) for each of these curves is bounded by a red contour below and a green contour above, the height of the surface is given as a number somewhere on the contour. What are your conclusions?