4 Methods for Spatially Aggregated Data 4.1 Poisson regression modelling 4.3 Extra-Poisson variation

4.2 Ecological bias

When regression models are fitted to area-level data, the effect of covariates on the fitted means $\mu_{i}$ may or may not be the same as the corresponding effects on individual risk. Differences between covariate effects at individual and at area levels lead to what is usually called ecological bias.

4.2.1 Ecological bias as model mis-specification

•

Suppose that we wish to describe the relationship between an outcome variable $Y$ and an exposure $x$ ;
•

imagine that we collect data $(x_{ij},Y_{ij})$ , where $i$ denotes groups (areas), and $j$ denotes individuals within groups.
•

Figure LABEL:fig:ecological1 shows the relationship between $x$ and $Y$ using synthetic data from individuals in five groups:

Unnumbered Figure: Link

Caption: Showing the potential relationships between $x$ and $Y$ , that might arise depending on the choice of model. In each plot, the red dots are the data points, $(x_{ij},Y_{ij})$ . Broadly speaking, these show a strong negative trend, but within this negative trend, there are 5 groups of individuals and within each group, the trend between $x$ and $Y$ is slightly positive. In the top plot, only the individual level data are shown. In the bottom left plot, in addition to the individual level data, the large blue dots are the group-level averages $(\bar{x}_{i},\bar{Y}_{i})$ , where $\bar{x}_{i}$ is the mean of the $x$ s in the $i$ th group and $\bar{Y}_{i}$ is the mean of the $Y$ s in the $i$ th group. In the bottom right plot, the group level averages are now shown as hollow blue circles and two regression lines have been added. The dashed blue line corresponds to a regression on the group level averages (model 1 in the text), whereas the five solid red lines are from a regression model that respects both the group and individual-level structures in the data (model 3 in the text). Model 1 would lead one to conclude that the relationship between $Y$ and $x$ is strongly negative, whereas Model 3 captures the positive trend between $Y$ and $X$ within each group.

Consider three possible models for the above synthetic data:

1.

Common individual-level regressions

$\displaystyle Y_{ij}$ $\displaystyle=\alpha+\beta x_{ij}+Z_{ij}$ $\displaystyle Z_{ij}\sim{\rm iid}\quad{\rm N}(0,\sigma^{2})$

$\displaystyle\bar{Y}_{i}$ $\displaystyle=\alpha+\beta\bar{x}_{i}+\bar{Z}_{i}$ $\displaystyle\bar{Z}_{i}\sim{\rm iid}\quad{\rm N}(0,\sigma^{2}/n_{i})$

This model is not correct.
2.

Separate individual-level regressions

$\displaystyle Y_{ij}$ $\displaystyle=\alpha_{i}+\beta x_{ij}+Z_{ij}$ $\displaystyle Z_{ij}\sim{\rm iid}\quad{\rm N}(0,\sigma^{2})$

$\displaystyle\bar{Y}_{i}$ $\displaystyle=\alpha_{i}+\beta\bar{x}_{i}+\bar{Z}_{i}$ $\displaystyle\bar{Z}_{i}\sim{\rm iid}\quad{\rm N}(0,\sigma^{2}/n_{i})$

This model is correct, but not identifiable from group-level data (we need individual-level data for each group to estimate the group’s slope).
3.

An additional group-level covariate

$\displaystyle Y_{ij}$ $\displaystyle=\alpha_{i}+\beta x_{ij}+Z_{ij}$ $\displaystyle Z_{ij}\sim{\rm iid}\quad{\rm N}(0,\sigma^{2})$

$\displaystyle\alpha_{i}$ $\displaystyle=\alpha+\gamma u_{i}+Z_{i}^{*}$ $\displaystyle Z_{i}^{*}\sim{\rm iid}\quad{\rm N}(0,\tau^{2})$
1. (a)
  
  if $u_{i}\neq\bar{x}_{i}$ , then
  
  $\bar{Y}_{i}=\alpha+\beta\bar{x}_{i}+\gamma u_{i}+(Z_{i}^{*}+\bar{Z}_{i})$
  
  This model is correct, and is identifiable if $u_{i}$ is known, and ${\rm Corr}(u_{i},\bar{x}_{i})<1$
2. (b)
  
  if $u_{i}=\bar{x}_{i}$ , then
  
  $\bar{Y}_{i}=\alpha+(\beta+\gamma)\bar{x}_{i}+(Z_{i}^{*}+\bar{Z}_{i})$
  
  This model is implicitly assumed in an ecological regression of $\bar{Y}_{i}$ on $\bar{x}_{i}$ hence the estimand in ecological regression is $\beta+\gamma$ , not $\beta$ .

Example 4.1.

An extremely widely-cited an early historical example of the effects of ecological bias outside environmental epidemiology relates to a study carried out by the French sociologist Émile Durkheim. Durkheim collected data on suicide rates in Prussian provinces in the 1890s, and compared these to the proportion of the population that was Protestant. The graph below shows the relationship for four typical provinces, and indicates an apparently strong positive relationship between the two variables.

The ecological bias here becomes apparent when examining suicide rates from individual religious groups (Protestants, Catholics, Jews, etc). The suicide rates of the non-Protestants was highest in the provinces that contained the most Protestants, and it is this that appears to explain the ecological effect shown in the figure. For more details, see Morgenstern (1995).

Figure 4.2: Link, Caption: Plot of proportion protestant against suicide rate in Prussian provinces in the 1890s. The dots are the actual data and the solid line is a regression line on these ecological-level variables.

Ecological bias is closely related to the phenomenon known as Simpson’s paradox, or Yule’s paradox. Wakefield (2004) provides a full discussion of ecological inference and related topics.

4.2.2 Ecological bias for spatial count data

Figure 4.3: Link, Caption: A repeat of Figure 4.3, showing a realisation of a homogeneous Poisson process (black dots) superimposed onto a tessellation of the unit square (solid lines). Ecological bias can also occur with spatial count data: in particular where data originate at the individual, or indeed sub-regional level, but we only observe information at the aggregated level.

Let $A_{i}$ be small spatial regions $Y_{i}$ be counts in $A_{i}$ . Let $\{z(x):x\in\bigcup_{i}A_{i}\}=$ be a spatially varying risk factor.

1.

Individual-level model

Cases form an inhomogeneous Poisson process,

$\lambda(x)=\lambda_{0}(x)\exp\{\beta z(x)\}$
2.

Area-level model

$\displaystyle Y_{i}$ $\displaystyle\sim$ $\displaystyle{\rm Poisson}(\mu_{i})$

$\displaystyle\mu_{i}$ $\displaystyle=$ $\displaystyle\int_{x\in A_{i}}\lambda_{0}(x)\exp\{\beta z(x)\}\mathrm{d}x$

It is common practice to assume a model of the form

$\mu_{i}=\bar{\lambda}_{0i}\exp(\beta\bar{z}_{i})$

where $\bar{\lambda}_{0i}$ and $\bar{z}_{i}$ are the averages over the small area $A_{i}$ of $\lambda_{0}(x)$ and $z(x)$ , respectively. Note that this is strictly incorrect, except under rather special conditions.

For problems of this kind, the suggested strategy is to:

•

specify the model at the individual level
•

derive the resulting joint probability distribution for area-level data
•

check that parameters of interest are identifiable from area-level data
•

make the required inferences

See, for example, Prentice and Sheppard (1995) and Sheppard and Prentice (1995). Li et al. (2012) and Taylor et al. (2013) discuss a data-augmentation scheme (under modelling assumptions) to enable an individual-level model to be fitted to aggregated data under the slightly more complex scenario that $\lambda(x)$ is stochastic.

	$\displaystyle Y_{ij}$	$\displaystyle=\alpha+\beta x_{ij}+Z_{ij}$	$\displaystyle Z_{ij}\sim{\rm iid}\quad{\rm N}(0,\sigma^{2})$
	$\displaystyle\bar{Y}_{i}$	$\displaystyle=\alpha+\beta\bar{x}_{i}+\bar{Z}_{i}$	$\displaystyle\bar{Z}_{i}\sim{\rm iid}\quad{\rm N}(0,\sigma^{2}/n_{i})$

	$\displaystyle Y_{ij}$	$\displaystyle=\alpha_{i}+\beta x_{ij}+Z_{ij}$	$\displaystyle Z_{ij}\sim{\rm iid}\quad{\rm N}(0,\sigma^{2})$
	$\displaystyle\bar{Y}_{i}$	$\displaystyle=\alpha_{i}+\beta\bar{x}_{i}+\bar{Z}_{i}$	$\displaystyle\bar{Z}_{i}\sim{\rm iid}\quad{\rm N}(0,\sigma^{2}/n_{i})$

$\displaystyle Y_{i}$	$\displaystyle\sim$	$\displaystyle{\rm Poisson}(\mu_{i})$

$\displaystyle\mu_{i}$	$\displaystyle=$	$\displaystyle\int_{x\in A_{i}}\lambda_{0}(x)\exp\{\beta z(x)\}\mathrm{d}x$