1 Introduction 1.2 Model formulation 1.4 What is a GLM?

1.3 Motivating example

The choice of model highly depends on what the data is describing. Here we examine a few motivating examples that we shall be investigating throughout the course.

1.3.1 Birthweight of babies

The first data set contains the birthweight and gestational age (time from conception to birth) of 24 babies born in hospital. The data is held in R as birthweight. A scatter plot of this data set is presented below.

birthweight = read.table(’birthweight.dat’)
attach(birthweight)
plot(age, weight, pch=16)
abline(a = -1485.0, b = 115.5)

Figure 1.1: Link, Caption: Birthweight data with fitted linear regression function.

These observations suggest a linear trend of birthweight with increasing gestational age, together with some random scatter around the overall trend. A simple linear model is a clear candidate for modelling this data. The fitted least-squares line is show on the plot.

Let $Y_{i}$ denote the birthweight, for $i=1,2,\ldots,n$ , and $x_{i}$ denotes the gestational age for baby $i$ . The linear model is:

\displaystyle Y_{i}=\mu_{i}+\epsilon_{i}\quad\mathrm{with}\quad\mu_{i}=a+bx_{i% },\quad\mathrm{and}\quad\epsilon_{i}\sim N(0,\sigma^{2})

This can be equivalently be represented as,

\displaystyle Y_{i}\sim N(\mu_{i},\sigma^{2})\quad\mathrm{with}\quad\mu_{i}=a+% bx_{i}.

Using the lm command in R we can easily find the fitted mean line as $\hat{\mu}=-1485.0+115.5x$ .

Exercise 1.3

1.

Interpret this finding in context.
2.

Estimate a suitable value of $\sigma$ .

1.3.2 AIDS deaths

Between 1983 to 1986, data was collected on the number of deaths from AIDS in Australia in consecutive three-month periods. The data is held in R as aids and shown below.

Figure 1.2: Link, Caption: AIDS data with linear (dashed) and exponential (solid) fitted curves.

The dotted line on the plot is the least-squares regression line estimated from the data. At first sight, this linear model does not appear to fit too badly, but we notice that the fitted values are negative for periods 1 and 2. This is a bad aspect of the model since we know that negative values are not possible. Also, the observations are counts, not continuous, and the variance seems to increase with the mean.

A reasonable model for this data might be that the number of deaths $Y_{i}$ at each time $t_{i}$ is Poisson( $\mu_{i}$ ), with the means $\mu_{i}$ increasing in time. To make sure that these means do not go negative, we can model $\mu_{i}=\exp\{a+bt_{i}\}$ which is always positive. This model can be written as:

\displaystyle Y_{i}\sim\mathrm{Pois}(\mu_{i})\quad\mathrm{with}\quad\log(\mu_{% i})=a+bt_{i}.

Fitting this model to the data by maximum likelihood, we obtain estimates $\hat{a}=0.340$ and $\hat{b}=0.257$ . The corresponding fitted curve, $\hat{\mu}=\exp\{\hat{a}+\hat{b}t\}$ is shown in the plot as the solid line, which provide a better description of the data.

aids = read.table(’aids.dat’)
attach(aids)
plot(time, number)
mu = exp(0.340 + 0.257*time)
line(time, mu, col="blue")

Exercise 1.4
Give an interpretation of the model

\displaystyle Y_{i}\sim\mathrm{Pois}(\mu_{i})\quad\mathrm{with}\quad\log(\mu_{% i})=0.340+0.257t_{i}.

This model is not a causal model, but merely a description of how the AIDS epidemic is growing.

The Poisson model for the AIDS data can be defined in general terms as:

	$\displaystyle Y_{i}\sim G(\mu_{i})\quad\mathrm{where}\quad\mathbb{E}[Y_{i}]=% \mu_{i},$
	$\displaystyle g(\mu_{i})=\eta_{i}\quad\mathrm{and}~{}~{}\eta_{i}=\beta^{\prime% }\mathbf{x}_{i}$

where we have to identify:

•

$i$ index, here 3 month period.
•

$Y_{i}$ response, here number of AIDS deaths.
•

$\mathbf{x}_{i}$ covariates, here time period.
•

$\beta$ coefficients, here $\beta=(a,b)$ .
•

$G$ distribution, here Poisson.
•

$g$ link function, here $\log$ .

1.3.3 Clinical trial data

In a phase I clinical trial to find the effective dose of a new drug where patients were randomly assigned to receive different doses of the drug. The table below shows the number $z_{i}$ of patients responding positively to the drug for each dose $x_{i}$ .

Dose ( $x_{i}$ )	1.69	1.72	1.76	1.78	1.81	1.84	1.86	1.88
# Patients ( $m_{i}$ )	59	60	62	56	63	59	62	60
# +ve responses ( $z_{i}$ )	6	13	18	28	52	53	61	60

This data is held in R as clintrial. Below is a plot of the observed proportion of positive responses, $y_{i}=z_{i}/m_{i}$ , at each dose.

Figure 1.3: Link, Caption: Clinical trial data with linear (dashed) and logit (solid) fitted curves.

The fitted least-squares line of best fit is shown as a dashed-line. This goes outside the defined $(0,1)$ interval for proportions. We therefore need to take account of the non-normal distribution of the data.

For each dose, a binomial model is appropriate, for example:

\displaystyle Z_{i}\sim\mathrm{Binomial}(m_{i},\mu_{i})

where $\mu_{i}$ is the probability that a patient will respond at does $x_{i}$ . Here, the expectation is $\mathbb{E}[Z_{i}]=m_{i}\mu_{i}$ . To match with the previous examples, we consider the the transformation $Y_{i}=Z_{i}/m_{i}$ to define the proportion of positive responses for each dose level. The expectation of the proportion is therefore $\mathbb{E}(Y_{i})=\mathbb{E}(Z_{i})/m_{i}=\mu_{i}$ . Taking the proportion as our response variable, it follows that $Y_{i}$ is a binomial-proportion model:

\displaystyle Y_{i}\sim\mathrm{Binoprop}(m_{i},\mu_{i}).

Now we need to model $\mu_{i}$ as a function of $x_{i}$ . Assuming a linear relationship such as $\mu_{i}=a+bx_{i}$ may conflict with the interpretation of $\mu$ as a probability. We must then constrain the mean to the required interval. One such mapping is the logistic function:

\displaystyle h(x)=\frac{\exp\{x\}}{1+\exp\{x\}}

A proposed model for the clinical trial data could then be:

\displaystyle Y_{i}\sim\mathrm{Bionprop}(m_{i},\mu_{i}),\quad\mathrm{with}% \quad\mathrm{logit}(\mu_{i})=a+bx_{i}

The logit function is the inverse function to the logistic function.

Exercise 1.5
Evaluate the logit function.

The solid line in the plot above presents the curve of best fit with maximum likelihood estimates $\hat{a}=-60.1$ and $\hat{b}=33.9$ .

clintrial = read.table(’clintrial.dat’)
attach(clintrial)
plot(dose, propn)
eta = -60.1 + 33.9 * dose
mu = exp(eta)/(1+exp(eta))
lines(dose, mu, col="blue")

Exercise 1.6
Give a contextual interpretation to this model, using the plot.

Exercise 1.7
The Binoprop model follows the general structure:

	$\displaystyle Y_{i}$	$\displaystyle\sim G(\mu_{i})\quad\mathrm{where}\quad\mathbb{E}(Y_{i})=\mu_{i},$
	$\displaystyle g(\mu_{i})$	$\displaystyle=\eta_{i}\quad\mathrm{and}~{}~{}\eta_{i}=\beta^{\prime}x_{i}.$

where each component is:

1.3.4 Bacteria data

Twenty-six dishes of bacterial cultures were given different amounts, $x_{i}$ , of a drug. For each, it was recorded whether they responded to the drug, $Y_{i}=1$ , or not, $Y_{i}=0$ . This data is held in R as bacteria. Below is a plot of the bacteria data at each dose level.

Figure 1.4: Link, Caption: Bacteria data with linear (dashed) and logit (solid) fitted curves.

It is particularly apparent in this case that a normal linear model is not appropriate. The line of best fit again extends beyond the set of valid probability values. The response is categorical not continuous, as it can only take one of two values. In no sense can we say that the residuals from the fit are approximately normal distributed about the fitted line!

A more sensible model might assume that at each dose, $x_{i}$ , there is a probability $\mu_{i}$ that a bacterial culture dish will respond so that $Y_{i}\sim\mathrm{Bernoulli}(\mu_{i})$ . This is a special case of the binomial model discussed in the previous section. It makes sense to model $\mu_{i}$ in the same way:

	$\displaystyle Y_{i}$	$\displaystyle\sim\mathrm{Bernoulli}(\mu_{i})\quad\mathrm{with}$
	$\displaystyle\mathrm{logit}(\mu_{i})$	$\displaystyle=a+bx_{i}.$

The curve of best fit using the maximum likelihood estimates $\hat{a}=-4.111$ and $\hat{b}=3.581$ is represented in the above plot with a solid line.

bacteria = read.table(’bacteria.dat’)
attach(bacteria)
plot(dose, response)
eta = -4.111 + 3.581 * dose
mu = exp(eta)/(1+exp(eta))
lines(dose, mu, col="blue")