The choice of model highly depends on what the data is describing. Here we examine a few motivating examples that we shall be investigating throughout the course.
The first data set contains the birthweight and gestational age (time from conception to birth) of 24 babies born in hospital. The data is held in R as birthweight
. A scatter plot of this data set is presented below.
birthweight = read.table(’birthweight.dat’) attach(birthweight) plot(age, weight, pch=16) abline(a = -1485.0, b = 115.5)
These observations suggest a linear trend of birthweight with increasing gestational age, together with some random scatter around the overall trend. A simple linear model is a clear candidate for modelling this data. The fitted least-squares line is show on the plot.
Let denote the birthweight, for , and denotes the gestational age for baby . The linear model is:
This can be equivalently be represented as,
Using the lm command in R we can easily find the fitted mean line as .
Exercise 1.3
Interpret this finding in context.
Estimate a suitable value of .
Between 1983 to 1986, data was collected on the number of deaths from AIDS in Australia in consecutive three-month periods. The data is held in R as aids
and shown below.
The dotted line on the plot is the least-squares regression line estimated from the data. At first sight, this linear model does not appear to fit too badly, but we notice that the fitted values are negative for periods 1 and 2. This is a bad aspect of the model since we know that negative values are not possible. Also, the observations are counts, not continuous, and the variance seems to increase with the mean.
A reasonable model for this data might be that the number of deaths at each time is Poisson(), with the means increasing in time. To make sure that these means do not go negative, we can model which is always positive. This model can be written as:
Fitting this model to the data by maximum likelihood, we obtain estimates and . The corresponding fitted curve, is shown in the plot as the solid line, which provide a better description of the data.
aids = read.table(’aids.dat’) attach(aids) plot(time, number) mu = exp(0.340 + 0.257*time) line(time, mu, col="blue")
Exercise 1.4
Give an interpretation of the model
This model is not a causal model, but merely a description of how the AIDS epidemic is growing.
The Poisson model for the AIDS data can be defined in general terms as:
where we have to identify:
index, here 3 month period.
response, here number of AIDS deaths.
covariates, here time period.
coefficients, here .
distribution, here Poisson.
link function, here .
In a phase I clinical trial to find the effective dose of a new drug where patients were randomly assigned to receive different doses of the drug. The table below shows the number of patients responding positively to the drug for each dose .
Dose () | 1.69 | 1.72 | 1.76 | 1.78 | 1.81 | 1.84 | 1.86 | 1.88 |
---|---|---|---|---|---|---|---|---|
# Patients () | 59 | 60 | 62 | 56 | 63 | 59 | 62 | 60 |
# +ve responses () | 6 | 13 | 18 | 28 | 52 | 53 | 61 | 60 |
This data is held in R as clintrial
. Below is a plot of the observed proportion of positive responses, , at each dose.
The fitted least-squares line of best fit is shown as a dashed-line. This goes outside the defined interval for proportions. We therefore need to take account of the non-normal distribution of the data.
For each dose, a binomial model is appropriate, for example:
where is the probability that a patient will respond at does . Here, the expectation is . To match with the previous examples, we consider the the transformation to define the proportion of positive responses for each dose level. The expectation of the proportion is therefore . Taking the proportion as our response variable, it follows that is a binomial-proportion model:
Now we need to model as a function of . Assuming a linear relationship such as may conflict with the interpretation of as a probability. We must then constrain the mean to the required interval. One such mapping is the logistic function:
A proposed model for the clinical trial data could then be:
The logit function is the inverse function to the logistic function.
Exercise 1.5
Evaluate the logit function.
The solid line in the plot above presents the curve of best fit with maximum likelihood estimates and .
clintrial = read.table(’clintrial.dat’) attach(clintrial) plot(dose, propn) eta = -60.1 + 33.9 * dose mu = exp(eta)/(1+exp(eta)) lines(dose, mu, col="blue")
Exercise 1.6
Give a contextual interpretation to this model, using the plot.
Exercise 1.7
The Binoprop model follows the general structure:
where each component is:
Twenty-six dishes of bacterial cultures were given different amounts, , of a drug. For each, it was recorded whether they responded to the drug, , or not, . This data is held in R as bacteria
. Below is a plot of the bacteria data at each dose level.
It is particularly apparent in this case that a normal linear model is not appropriate. The line of best fit again extends beyond the set of valid probability values. The response is categorical not continuous, as it can only take one of two values. In no sense can we say that the residuals from the fit are approximately normal distributed about the fitted line!
A more sensible model might assume that at each dose, , there is a probability that a bacterial culture dish will respond so that . This is a special case of the binomial model discussed in the previous section. It makes sense to model in the same way:
The curve of best fit using the maximum likelihood estimates and is represented in the above plot with a solid line.
bacteria = read.table(’bacteria.dat’) attach(bacteria) plot(dose, response) eta = -4.111 + 3.581 * dose mu = exp(eta)/(1+exp(eta)) lines(dose, mu, col="blue")