3 Case-Control Methods 3.2 Spatial clustering 3.4 Spatial clustering or variation in risk?

3.3 Spatial variation in risk

Our working model for spatial variation in risk is that:

•

cases form a Poisson process with intensity $\lambda(x)$
•

controls form a second, independent, Poisson process with intensity $\lambda_{0}(x)$
•
$\lambda(x)=\alpha\lambda_{0}(x)\rho(x)$ where
- –
  
  $\alpha$ is determined by the number of controls in the design
- –
  
  $\rho(x)$ represents spatial variation in risk

It follows that, conditional on both case and control locations:

•

case/control labels are determined by a series of independent Bernoulli trials with success probabilities

$p(x)=\frac{\lambda(x)}{\lambda(x)+\lambda_{0}(x)}=\frac{\alpha\rho(x)}{1+% \alpha\rho(x)}$
•

spatial variation in risk is estimable up to a constant of proportionality

We revisit this topic in Section 6.

Kelsall and Diggle (1998) consider three approaches to nonparametric estimation:

1.

Density ratio, see Section 3.3.1.
2.

Binary (logistic) regression, see Section 3.3.2
3.

Generalized additive model, see Section 3.3.3

We now look at each of these methods in detail.

3.3.1 Density ratio

This section could arguably be renamed “intensity ratios”, as the main idea is to look at the quotient of nonparamteric estimates of the intensity of the process describing the cases and the process describing the controls. However, the theoretical framework for the methods in this section derives from nonparametric density estimation (i.e. the estimation of probability density functions), so we stick with the term “density ratios”.

This method first uses kernel smoothing for separate estimation of $\lambda(x)$ and $\lambda_{0}(x)$ , then uses these estimates in the expression $\lambda(x)=\alpha\lambda_{0}(x)\rho(x)$ to estimate $\rho(x)$ .

The basic problem addressed by the kernel smoothing method is to estimate an unknown probability density function from a observed sample of data. This problem is very familiar if one is prepared to assume a parametric form for the density function, $f(x)$ say, as then all that is required is to estimate the parameters of $f(x)$ , for example using maximum likelihood estimation.

It often occurs, especially in estimating unknown intensity functions, that it is not practical to assume a particular parametric form. In that case, non-parametric or semi-parametric methods are required.

Consider a sample $x_{1},...,x_{n}$ from an unknown univariate density function $f(x)$ . A simple but rather unsatisfactory method is to treat $f(x)$ as a discrete probability distribution, resulting in the empirical distribution function:

\hat{f}(x)=\left\{\begin{array}[]{ll}\frac{1}{n}&x\in\{x_{1},...,x_{n}\}\\ 0&\text{otherwise}\end{array}\right.

A natural improvement is based on histogram methods. Note that the bandwidth (i.e. width of each bar) has an impact on the appearance of the estimate of $f(x)$ .

Kernel smoothing improves further on these simple estimates by treating the underlying distribution as having a continuous, smooth density function. Specifically, we estimate $f(x)$ by

\hat{f}(x;h)=h^{-1}\sum_{i=1}^{n}W\{(x-x_{i})/h\}.

Here, $W(t)$ is a symmetric (even) function, non-increasing in $|t|$ , known as the kernel, and $h>0$ is a scalar which determines the amount of smoothing. Each data-point $x_{i}$ contributes to the estimate of $f(x)$ at each possible value of $x$ , with the largest contributions coming from data-points close to $x$ .

Possible choices for $W(\cdot)$ include the standard Gaussian density function, the uniform density function on a specified interval, and the Epanechnikov kernel:

W(t)=\left\{\begin{array}[]{ll}\frac{3}{4}(1-t^{2})&|t|\leq 1\\ 0&\text{otherwise}\end{array}\right.

Figure 3.4: Link, Caption: Three histograms (shaded grey) created using the same data, with different bandwidths (left to right: 0.025, 0.1, 0.25), and the corresponding estimates of the density function using kernel smoothing (smooth solid black line). Smaller choices of the bandwidth lead to more ‘unstable’ estimates of the density, larger bandwidths lead to more smooth estimates.

Another way of looking at the effect of changing $h$ appears in the graphs in Figure 3.5, for which the data are a random sample of size $n=25$ from the standard Gaussian distribution.

Figure 3.5: First Link, Second Link, Caption: Illusrating the construction of a non-parametric estimate of the intensity function for two choices of the bandwidth parameter (0.2 on the left and 0.5 on the right). Each solid curve shows the kernel centred on an individual observation. The dashed curve is the kernel estimate, and is formed by summing up the individual kernel contributions at each value of

x

. Where a kernel centred on an individual location does not overlap with any other such kernel, the dashed curve lies on top of the solid curve. Choice of the bandwidth

h

is clearly crucial.

One method for bandwidth determination is least-squares cross-validation, see Wand and Jones (1994). We would like to choose $\hat{f}$ to minimise the integrated mean square error

R(\hat{f},f)=\mathbb{E}\left[\int(\hat{f}(x;h)-f(x))^{2}\mathrm{d}x\right].

We cannot do this directly, as $f(x)$ is unknown, but instead we can estimate the part of this function that depends on $h$ ,

J(h)=\int[\hat{f}(x;h)]^{2}\mathrm{d}x-2\int\hat{f}(x;h)f(x)\mathrm{d}x,

\hat{J}(h)=\int[\hat{f}(x;h)]^{2}\mathrm{d}x-\frac{2}{n}\sum_{i=1}^{n}\hat{f}_% {(-i)}(x;h),

where

\hat{f}_{(-i)}(x;h)=\frac{n}{(n-1)}h^{-1}\sum_{j\neq i}^{n}W\{(x-x_{j})/h\}.

is the so-called ‘jackknife’ estimate of $f(x)$ , obtained using all except the $i$ th data-point. Then choose $h$ to minimise $\hat{J}(h)$ . Stone’s theorem (Theorem 3.1 below) guarantees (under certain conditions) that, in large samples, the estimate $\hat{h}$ obtained using cross-validation converges to the value that minimises $J(h)$ . Further technical details are provided by Scott and Terrell (1987).

Theorem 3.1.

Stone’s Theorem. (NOT EXAMINABLE) Suppose $f$ is bounded. Let $f(\,\cdot\,;h)$ denote the kernel estimator with bandwidth $h$ and let $h^{*}$ denote the bandwidth chosen by cross-validation. Then

\frac{\displaystyle\int\left[f(x)-\hat{f}(x;h^{*})\right]^{2}\mathrm{d}x}{\inf% _{h}\displaystyle\int\left[f(x)-\hat{f}(x;h)\right]^{2}\mathrm{d}x}\rightarrow 1

almost surely.

In a spatial setting, the method is similar. The $x_{i}$ are now case or control locations (treated separately), and $W(\cdot)$ is a circularly symmetric bivariate density function. The same value of $h$ is usually used to estimate both $\lambda(x)$ and $\lambda_{0}(x)$ .

Since $\lambda(x)=\alpha\lambda_{0}(x)\rho(x)$ , we have

\hat{\rho}(x)=\frac{\hat{\lambda}(x)}{\hat{\alpha}\hat{\lambda}_{0}(x)}

An estimate of the probability that a given individual at location $x$ is a case is

\hat{p}(x)=\frac{\hat{\lambda}(x)}{\hat{\lambda}(x)+\hat{\lambda}_{0}(x)}

3.3.2 Binary (logistic) regression

•

For $n$ cases and $m=N-n$ controls, let $Y_{i}=1/0$ for case/control respectively at location $x_{i}$ .
•

Define weights,

$w_{i}(x)=\frac{W\{(x-x_{i})/h\}}{\sum_{j=1}^{N}W\{(x-x_{j})/h\}}$
•

Kernel estimator is

$\hat{p}(x)=\sum_{i=1}^{N}w_{i}(x)Y_{i}$

and so

$\hat{\rho}(x)=\frac{\hat{p}(x)}{\hat{\alpha}\{1-\hat{p}(x)\}}.$

As in the kernel smoothing method, the amount of smoothing is determined primarily by the bandwidth, $h$ . Kelsall and Diggle (1998) recommend cross-validation to choose $h$ , similar to the cross-validation method described above. For binary regression, with no explanatory variables, this is defined as follows:

•

for each $h$ , let $\hat{p}_{(-i)}(x_{i})$ be the estimate of $p(x_{i})$ using all data except $Y_{i}$ ;
•

choose $h$ to maximise the cross-validated likelihood, or equivalently minimise $CV(h)$ (minus the cross-validated log-likelihood), defined as

$CV(h)=-\sum_{i=1}^{n}\log\hat{p}_{(-i)}(x_{i})-\sum_{i=n+1}^{N}\log\{1-\hat{p}% _{(-i)}(x_{i})\}$

Jarner et al. (2002) develop the modifications needed to analyse matched case-control data. Even though in general the matched design is not recommended when spatial variation is of scientific interest, if a matched design is used the method of analysis must respect the design.

If explanatory variables are included, the generalized additive model (below) can be used, which uses a similar idea by incorporating a cross-validated likelihood step in order to estimate residual risk after estimating the effects of covariates on $p(x)$ .

Example 3.3.

Lung and stomach cancers in Walsall (from Kelsall and Diggle (1998)).

Figure 3.6: Link, Caption: Cross-validation for the pancreas (dashed line), stomach (dotted line) and lung cancers (solid line); the horizontal line . The cross-validated likelihood has well-defined minima for stomach and lung cancers, but not so much for pancreatic cancer.

•

Well-defined minimum of $CV(h)$ for lung and stomach cancer data (Figure 3.6).
•

No well-defined minimum for (rare) pancreatic cancer implies no strong evidence for spatial variation in risk for pancreatic cancer (also Figure 3.6).

Figure 3.7: First Link, Second Link, Caption: The shaded area is the

\log_{2}

risk across Birmingham (legend at the top left of each plot). The left plot shows the

\log_{2}

risk for lung cancer and the right plot shows the

\log_{2}

risk for stomach cancer. The solid and dashed lines identify boundaries of regions where risk is significantly higher or lower, respectively, than average.

•

It can be seen that there are some minor differences in the pattern of variation for both diseases, but broadly it is similar.

3.3.3 Generalized additive model

Generalized additive models (GAMs) are an extension of generalized linear models (GLMs) that allow a more flexible functional relationship between covariates and the outcome of interest. In GLMs, this relationship is typically assumed to be linear, or even if not, must be specified in advance. In much the same way as the kernel smoothing methods described above, GAMs allow the form of this relationship to be estimated as part of the model-fitting procedure.

The model assumption is

\log\left(\frac{p(x)}{1-p(x)}\right)=u(x)^{\prime}\beta+g(x)

where $u(x)$ is vector of known risk factors and $g(x)$ is a function that models smooth residual spatial variation.

In a spatial setting, $x$ is a geographical location, but terms equivalent to $g(x)$ could also be included to investigate the nature of measured covariates, both in spatial and non-spatial regression analyses.

We have assumed a logit link, $\log\{p(x)/[1-p(x)]\}$ , canonical for the Bernoulli distribution. As for GLMs, other link functions could also be specified, and $g(x)$ could be replaced by $\sum_{i=1}^{p}g_{i}(x_{i})$ for greater generality.

We have $\hat{\alpha}\hat{\rho}(x)=\exp\{u(x)^{\prime}\hat{\beta}+\hat{g}(x)\}$ , and the log-likelihood is of the form

\sum_{i=1}^{n}\log p(x_{i})+\sum_{i=n+1}^{N}\log\{1-p(x_{i})\}.

Details of the fitting algorithm are available in Hastie and Tibshirani (1990), and include a kernel smoothing step for $\hat{g}(x)$ within iteratively weighted least squares. The method is implemented in the R library mgcv.