Chapter 1 CHIC465/565 Project 2018/2019

Project Details

The data for this assignment come from the AEGISS project (Ascertainment and Enhancement of Gastrointestinal Infection Surveillance and Statistics), which ran in Hampshire over a number of years, see Diggle et al. (2005) for a spatiotemporal analysis of the data. The full data consist of the geographical locations and date of reports of non-specific gastrointestinal infection collected through the NHS direct telephone clinical advice service (now defunct). In this project, you will perform a purely spatial analysis (i.e., ignoring time) of the data collected in Southampton.

Several socioeconomic variables are available measured on small areal divisions known as Lower Super Output Areas (LSOAs). Over the whole of Hampshire, LSOAs have a median population of around 1499 (lower quartile 1400, upper quartile 1580) but they vary quite substantially in area. Details of the available variables are below.

The main aim of this project is to write a report minimally addressing the following tasks:

  1. 1.

    Provide a brief 800 word summary of evidence in the literature for the relationship between non-specific gastrointestinal infection and socioeconomic risk factors.

  2. 2.

    To provide a thorough analysis of these data (including exploratory analyses); illustrate your findings with relevant tables and figures.

  3. 3.

    To look for and summarise evidence of spatial variation in risk compared with a control group simulated according to population density in the area. See the section below on fitting K-functions to these data. See labs 2 and 3 for ideas on how to conduct this part of the analysis.

  4. 4.

    To use models of discrete spatial variation and generalised additive models to investigate the effects of the socioeconomic covariates on the risk of disease. In particular, one research question of interest is whether it is better to use the Index of Multiple Deprivation (IMD) or a subset of its domains to explain the spatial variation in risk (details of these are below). For the Bayesian models, you will need to provide plots illustrating that you have achieved satisfactory convergence and mixing. Overlay operations will be useful here, for counting the number of cases in each LSOA. See labs 3 and 4 for ideas on how to conduct this part of the analysis and see below for the a hint on performing overlay operations.

  5. 5.

    Compare and contrast the results of your analyses.

Please do not replicate sections of the notes in your report. I am more interested in your ability to describe the setting, justify, describe and conduct an appropriate analysis and interpret the results of that analysis. Make sure you define all terms in any statistical models you present. Bearing in mind the geographic/demographic setting of the present project, you may find Rose et al. (2016) to be a useful reference as a starting point for your literature review.

You are strongly encouraged to conduct further analyses of your own devising, based on (or extending) the material you have covered in the labs and lectures. For example, one thing not covered in the labs, but which may be interesting to explore is a Poisson GAM model (for which overlay operations can again be used to extract individual-level covariates, based on the area level-covariates).

There is no formal requirement on the structure of yor report. If you prefer not to split your report into strict “Introduction, Methods, Results, Conclusions” sections, then that is fine. One important quality I will be looking for is that the report is logically organised and the text conveys a smooth flow of concepts throughout. You may find it helpful to provide an initial section in which you introduce the task, and a concluding section in which you summarise your overall findings.

Where you use mathematical notation to describe the models you are fitting, you must define all terms. The notation you use should be consistent throughout the report.

Report Requirements

Your report must be written in LaTeX and it must be appropriately referenced, preferably using BibTeX. You must compile and submit your document as a .pdf file.

You must include a copy of the most important parts of your R code as an appendix of maximum length 2 pages.

The report must not exceed 10 pages of A4 including references, but excluding the appendix with R code. The document should have margins of 1 inch on each side of the paper. Note that to get the margins you can put \usepackage[margin=1in]{geometry} in your preamble.

Reports going over the 10 pages specified above will be penalised by 10%. Any work beyond the 10 page limit will not be marked.

The deadline for this assignment is Tuesday 24th March at 9:30am for MSc Students and for MSci / Data Science Students, it is Monday 6th April at 10:30am.

Marking criteria

Requirements for a Pass

You should show understanding of the application area and the purpose of the analysis. Ex- ploratory data analysis should be performed and the adequacy of any underlying assumptions addressed. Appropriate statistical methods should be selected and correctly applied, with ad- equate description of non-elementary methods and correct referencing of sources. You should demonstrate competence in the application of statistical methods that go beyond the scope of undergraduate level statistics courses. The dissertation should include a clear statement of con- clusions appropriate to the original aims of the analysis. More detailed grading guidelines are given below.

Grading guidelines

0-19 Fail Little or no hint of any knowledge being demonstrated and/or effort being applied.20-39 Fail Some evidence of minimal effort being applied. There may be a little logic or structure to the document and some glimpses of understanding; however the analysis may be obviously incorrect or contain major omissions.

40-49 Fail There is a discernible structure and logic to the document, at least some evidence of understanding and appropriate use of techniques, and no blatantly obvious major mistakes or major omissions. However no MSc level techniques have been used (or, if used, there is little evidence of understanding) and/or the research question has not been answered sufficiently well.

50-59 Pass This should be a coherent and structured account where a research question is approached using appropriate statistical techniques including one or more at MSc level. This account should be largely correct and accompanied by relevant plots, tables and a bibliography which are then used within the text to illustrate the argument that addresses the research question. The techniques must have been applied and interpreted correctly and there should be some demonstration that the concepts behind the techniques have been understood.

60-69 Good pass As for a Pass but a) with some use of techniques or understanding of issues beyond those covered in the MSc modules studied by the student; and b) demonstrating a good conceptual or mathematical understanding. The account should flow logically and be largely complete (e.g. including exploratory analysis, leading to the formulation of appropriate model(s), and a diagnostic stage checking the model assumptions). The conclusion should demonstrate that the statistical analysis has addressed and answered the research question. This should be done in a competent, convincing and well reasoned manner.

70-79 Distinction As for a Good Pass. In addition, the student should demonstrate mastery, both mathematically and conceptually, of either a) an appropriate substantial body of methodology beyond the MSc modules studied, or b) an appropriate methodology for the problem from the MSc modules, but in addition demonstrating deep understanding of the relationship between the statistical methods used and the subject matter and discussing why their approach is better than alternatives that could have been chosen. A distinction would normally demonstrate a high level of insight, understanding and clarity, including consistent use of notation, and the research question should be answered logically and completely.

80-100 Outstanding distinction As for a Distinction. However the writing in the report must be of publishable quality (i.e. clear, unambiguous and with good style) and the work itself should be publishable, with little modification other than making it more concise, in a reputable statistical/applied statistical journal.

Obtaining and Loading the Data

The data for this project are available on the Moodle page for this course. The required file is SHAMP_AEGISS_DATA.RData and you can load it into R using

library(sp)
library(spatstat)
load("SHAMP_AEGISS_DATA.RData")

The file contains the following objects:

objects()
[1] "controls" "shamp"    "win"      "x"        "y"

The details of these are as follows

x

The x-coordinate of the 1000 cases

y

The y-coordinate of the 1000 cases

win

A ppp object containing the observation window i.e., Southampton

shamp

A SpatialPolygonsDataFrame object containing the covariate information measured at the Lower Super Output Area level.

controls

Locations of the x-y coordinates of 1000 controls.

The object shamp contains the following variables:

LSOA04CD

LSOA code

LSOA04NM

LSOA name

pop

Total Population in each LSOA

males

Number of males

females

Number of females

propmale

Male proportion

IMD

Index of Multiple Deprivation (IMD), see UK Government (2014)

Income

Income domain of the IMD

Employment

Employment domain of the IMD

Health

Health domain of the IMD

Education

Education domain of the IMD

Barriers

Barriers domain of the IMD

Crime

Crime domain of the IMD

Environment

Environment domain of the IMD

Estimating K-functions

Computation of the edge-corrected K-function for an observed point pattern may be computationally prohibitive due to (i) the number of points and (ii) the complexity of the observation window. The polygonal observation window, win, has a large number of edges, so before you can proceed with a K-function analysis of the cases and controls, it might be necessary to simplify it (but try with the full data on your PC first). If your computer is too slow you can try:

simpwin <- simplify.owin(win, d = DIST)

for some suitable DIST (e.g. 200?) then

cases <- ppp(x = x, y = y, window = simpwin)

Note the order of these operations is important here. Some of the points may now fall outside the observation window; this is to be expected, but hopefully not too many points are affected?

HINTS

The SpatialPolygonsDataFrame containing the polygonal regions and covariate date, shamp, has a projection string attached to it, in fact in this case it is the Ordnance Survey GB projection (for those of you who have been trekking in this country, this is the grid reference used in OS maps). Because the Earth is roughly spherical, in order to produce maps of countries (or regions within countries) we need some way of projecting the sphere onto a plane. The projection string tells R how to do this. The spatial objects you worked with in lab 4 did not have a projection string but the data for the project do, so you will have to tell R that the x and y values have the same projection. This can be achieved as follows:

casesidx <- over(SpatialPoints(cbind(x, y), proj4string = CRS(proj4string(shamp))),
    geometry(shamp))

For each case, this gives you the index number of the polygon within which the case is contained. There are 146 LSOAs in Southampton, so the object casesidx will be a vector of length 1000, whose entries are integers between 1 and 146.

You can do

shamp$count <- sapply(1:146, function(x) {
    sum(casesidx == x)
})

to count the number of cases falling in each region and store them in the SpatialPolygonsDataFrame.

You can use

crds <- coordinates(shamp)

to extract the coordinates of the centroids of the polygons. Using this information, you should be able to set up a Poisson generalised additive model for the count data, in a similar way to the way you did this in workshop 3.

You can pull the covariate data out of the SpatialPolygonsDataFrame using

dat <- as.data.frame(shamp)

Don’t forget to use log population as an offset in your Poisson models.

Acknowledgements

The IMD data in this project are subject to the Open Government License: http://www.nationalarchives.gov.uk/doc/open-government-licence/.

Contains National Statistics data ©Crown copyright and database right 2001, 2004, 2007

Contains Ordnance Survey data ©Crown copyright and database right 2001, 2004, 2007

The data used in this project was from project AEGISS (Diggle et al., 2005). AEGISS was supported by a grant from the Food Standards Agency, U.K., and from the National Health Service Executive Research and Knowledge Management Directorate.