In this section we analyse date from a study of a type of adult chronic leukaemia - Chronic Granulocytic Leukaemia (CGL) - occurring between January 1982 and December 1998 for a population of individuals whose address at diagnosis lies within the north-west region (Lancashire and Greater Manchester).
Load the required files:
library(spatstat) cases <- read.csv("leukaemia_2011_cases.txt", header = TRUE) controls <- read.csv("leukaemia_2011_controls.txt", header = TRUE) poly <- read.table("NW_poly.txt", header = TRUE)
Re-format some of the objects for use shorty:
win <- owin(poly = list(x = rev(poly$x), y = rev(poly$y))) leukdata <- rbind(cases, controls) leukdata$sex <- as.factor(leukdata$sex) levels(leukdata$sex) <- c("Female", "Male") leukdata$label <- as.factor(leukdata$label) levels(leukdata$label) <- c("Control", "Case")
The order of the points was reversed in creating the object win above because spatstat requires points to be given in an anti-clockwise order; if the call to rev is omitted, a warning message is issued.
The research question is to identify whether there is spatial variation in CGL risk, and whether this variation is associated with any of three potential individual-level covariates:
Sex (female, male)
Age (in years)
Deprivation
Deprivation is measured by the so-called ‘Townsend Index of Deprivation’, a standardised measure calculated from several socio-economic variables that quantifies the deprivation level at each residential address.
Draw the study region and the locations of cases and controls:
leukppp <- ppp(x = leukdata$x, y = leukdata$y, marks = leukdata$label, window = win) plot(leukppp, chars = c(1, 3), cols = c("red", "black"))
In this study, cases are adults diagnosed with CGL, and unmatched controls have been selected to be representative of the adult population resident in the north-west region.
Perform an exploratory data analysis. This is recommended at the start of any analysis in order to inform the decision about the choice of methods used. Use graphs and tables to conduct a brief exploratory analysis of the covariates and the relationships between them in this study. Do there appear to be differences in covariate values between cases and controls?
This exploratory analysis should include:
Histograms, or density plots of deprivation by case/control status plotted on top of each other for comparison.
Histograms, or density plots of age by case/control status plotted on top of each other for comparison.
A table of sex against case control status, compute the odds ratio and a confidence interval.
HINT: you might find the ggplot2 package useful here, you can look at examples online and the helpfiles (e.g. type “overlay histograms ggplot2” into Google. The overlaying of histograms or density plots can be done in several ways, e.g. using qplot or ggplot).
Now fit a logistic regression model to model the effects of the covariates on case/control status:
logreg <- glm(label ~ age + sex + deprivation, family = binomial(link = "logit"), data = leukdata) summary(logreg)
Find confidence intervals for the parameter estimates:
exp(confint(logreg))
How do you interpret these estimates?
Now fit a GAM to allow for possible residual spatial variation:
leugam <- gam(label ~ age + sex + deprivation + s(x, y), family = binomial(link = "logit"), data = leukdata) summary(leugam)
Plot the results:
par(pty = "s") plotDOTgam(leugam, se = F, xlim = c(-0.2, 1), ylim = c(-0.1, 1.1)) plot(win, add = TRUE, col = rgb(0, 0, 1, alpha = 0.3))
A bandwidth of leugam gives the estimated smoothing parameter, which equals 3.0672186 in this case. What happens if a different bandwidth (e.g. ) is selected?
leugam2 <- gam(label ~ age + sex + deprivation + s(x, y), sp = 0.01, family = binomial(link = "logit"), data = leukdata) plotDOTgam(leugam2, se = F, xlim = c(-0.2, 1), ylim = c(-0.1, 1.1)) plot(win, add = TRUE, col = rgb(0, 0, 1, alpha = 0.3))
The mgcv package also contains the function vis.gam, which can be used for mapping the estimated probability of an individual at each location being a case; add the case and control locations, if it helps:
plot(win) vis.gam(leugam, view = c("x", "y"), n.grid = 100, plot.type = "contour", add = TRUE, type = "response", color = "terrain", too.far = 0.08, vfont = c("sans serif", "bold")) plot(win, add = TRUE) points(controls, pch = ".") points(cases, pch = ".")
These mapped probabilities seem very high as estimates of disease risk: why is this?