1 Introduction

1.3 Motivating example

The choice of model highly depends on what the data is describing. Here we examine a few motivating examples that we shall be investigating throughout the course.

1.3.1 Birthweight of babies

The first data set contains the birthweight and gestational age (time from conception to birth) of 24 babies born in hospital. The data is held in R as birthweight. A scatter plot of this data set is presented below.

birthweight = read.table(’birthweight.dat’)
attach(birthweight)
plot(age, weight, pch=16)
abline(a = -1485.0, b = 115.5)
Figure 1.1: Link, Caption: Birthweight data with fitted linear regression function.

These observations suggest a linear trend of birthweight with increasing gestational age, together with some random scatter around the overall trend. A simple linear model is a clear candidate for modelling this data. The fitted least-squares line is show on the plot.

Let Yi denote the birthweight, for i=1,2,,n, and xi denotes the gestational age for baby i. The linear model is:

Yi=μi+ϵiwithμi=a+bxi,andϵiN(0,σ2)

This can be equivalently be represented as,

YiN(μi,σ2)withμi=a+bxi.

Using the lm command in R we can easily find the fitted mean line as μ^=-1485.0+115.5x.

 
Exercise 1.3

  1. 1.

    Interpret this finding in context.

  2. 2.

    Estimate a suitable value of σ.

 

1.3.2 AIDS deaths

Between 1983 to 1986, data was collected on the number of deaths from AIDS in Australia in consecutive three-month periods. The data is held in R as aids and shown below.

Figure 1.2: Link, Caption: AIDS data with linear (dashed) and exponential (solid) fitted curves.

The dotted line on the plot is the least-squares regression line estimated from the data. At first sight, this linear model does not appear to fit too badly, but we notice that the fitted values are negative for periods 1 and 2. This is a bad aspect of the model since we know that negative values are not possible. Also, the observations are counts, not continuous, and the variance seems to increase with the mean.

A reasonable model for this data might be that the number of deaths Yi at each time ti is Poisson(μi), with the means μi increasing in time. To make sure that these means do not go negative, we can model μi=exp{a+bti} which is always positive. This model can be written as:

YiPois(μi)withlog(μi)=a+bti.

Fitting this model to the data by maximum likelihood, we obtain estimates a^=0.340 and b^=0.257. The corresponding fitted curve, μ^=exp{a^+b^t} is shown in the plot as the solid line, which provide a better description of the data.

aids = read.table(’aids.dat’)
attach(aids)
plot(time, number)
mu = exp(0.340 + 0.257*time)
line(time, mu, col="blue")

 
Exercise 1.4
Give an interpretation of the model

YiPois(μi)withlog(μi)=0.340+0.257ti.

 

This model is not a causal model, but merely a description of how the AIDS epidemic is growing.

The Poisson model for the AIDS data can be defined in general terms as:

YiG(μi)where𝔼[Yi]=μi,
g(μi)=ηiandηi=β𝐱i

where we have to identify:

  • i index, here 3 month period.

  • Yi response, here number of AIDS deaths.

  • 𝐱i covariates, here time period.

  • β coefficients, here β=(a,b).

  • G distribution, here Poisson.

  • g link function, here log.

1.3.3 Clinical trial data

In a phase I clinical trial to find the effective dose of a new drug where patients were randomly assigned to receive different doses of the drug. The table below shows the number zi of patients responding positively to the drug for each dose xi.

Dose (xi) 1.69 1.72 1.76 1.78 1.81 1.84 1.86 1.88
# Patients (mi) 59 60 62 56 63 59 62 60
# +ve responses (zi) 6 13 18 28 52 53 61 60

This data is held in R as clintrial. Below is a plot of the observed proportion of positive responses, yi=zi/mi, at each dose.

Figure 1.3: Link, Caption: Clinical trial data with linear (dashed) and logit (solid) fitted curves.

The fitted least-squares line of best fit is shown as a dashed-line. This goes outside the defined (0,1) interval for proportions. We therefore need to take account of the non-normal distribution of the data.

For each dose, a binomial model is appropriate, for example:

ZiBinomial(mi,μi)

where μi is the probability that a patient will respond at does xi. Here, the expectation is 𝔼[Zi]=miμi. To match with the previous examples, we consider the the transformation Yi=Zi/mi to define the proportion of positive responses for each dose level. The expectation of the proportion is therefore 𝔼(Yi)=𝔼(Zi)/mi=μi. Taking the proportion as our response variable, it follows that Yi is a binomial-proportion model:

YiBinoprop(mi,μi).

Now we need to model μi as a function of xi. Assuming a linear relationship such as μi=a+bxi may conflict with the interpretation of μ as a probability. We must then constrain the mean to the required interval. One such mapping is the logistic function:

h(x)=exp{x}1+exp{x}

A proposed model for the clinical trial data could then be:

YiBionprop(mi,μi),withlogit(μi)=a+bxi

The logit function is the inverse function to the logistic function.

 
Exercise 1.5
Evaluate the logit function.

 

The solid line in the plot above presents the curve of best fit with maximum likelihood estimates a^=-60.1 and b^=33.9.

clintrial = read.table(’clintrial.dat’)
attach(clintrial)
plot(dose, propn)
eta = -60.1 + 33.9 * dose
mu = exp(eta)/(1+exp(eta))
lines(dose, mu, col="blue")

 
Exercise 1.6
Give a contextual interpretation to this model, using the plot.

 

 
Exercise 1.7
The Binoprop model follows the general structure:

Yi G(μi)where𝔼(Yi)=μi,
g(μi) =ηiandηi=βxi.

where each component is:

 

1.3.4 Bacteria data

Twenty-six dishes of bacterial cultures were given different amounts, xi, of a drug. For each, it was recorded whether they responded to the drug, Yi=1, or not, Yi=0. This data is held in R as bacteria. Below is a plot of the bacteria data at each dose level.

Figure 1.4: Link, Caption: Bacteria data with linear (dashed) and logit (solid) fitted curves.

It is particularly apparent in this case that a normal linear model is not appropriate. The line of best fit again extends beyond the set of valid probability values. The response is categorical not continuous, as it can only take one of two values. In no sense can we say that the residuals from the fit are approximately normal distributed about the fitted line!

A more sensible model might assume that at each dose, xi, there is a probability μi that a bacterial culture dish will respond so that YiBernoulli(μi). This is a special case of the binomial model discussed in the previous section. It makes sense to model μi in the same way:

Yi Bernoulli(μi)with
logit(μi) =a+bxi.

The curve of best fit using the maximum likelihood estimates a^=-4.111 and b^=3.581 is represented in the above plot with a solid line.

bacteria = read.table(’bacteria.dat’)
attach(bacteria)
plot(dose, response)
eta = -4.111 + 3.581 * dose
mu = exp(eta)/(1+exp(eta))
lines(dose, mu, col="blue")