Exercises

In the following exercises, we shall be investigating the examples that were discussed in Chapter 1 of the lecture notes. All of the data sets are available from MOODLE. Download and save the files in a convenient location, remembering to change the working directory in R if necessary.

  1. 1.

    Birthweight. The data set birthweight contains the weight in grams of 24 babies (12 male and 12 female) and their gestational age in weeks at birth.

    birthweight <- read.table("birthweight.dat")
    

    For this example, we considered a simple linear regression model of birthweight (Yi) in terms of gestational age (xi) of the form:

    YiN(μi,σ2)  μi=ηi  ηi=β0+β1xi  fori=1,,24,

    where σ2 is assumed known. Recall from MATH235 that we are able to fit this model using the lm command as follows:

    Model0 <- lm(weight ~ age, data = birthweight)
    

    When using the glm command we must specify the distribution of the responses. In this case, birthweight follows a Gaussian distribution and so estimates of the coefficients (β0,β1) are equivalently evaluated by:

    Model1 <- glm(weight ~ age, data = birthweight, family = gaussian)
    

    Verify that the estimate you obtain from lm and glm are the same.

    Further information about the parameter estimates and quality of the model fit can be evaluated using the summary command:

    summary(Model1)
    
  2. 2.

    AIDS. The data set aids contains the number of AIDS cases from 1983 to 1986 during 3-month intervals.

    aids <- read.table("aids.dat")
    

    The distribution of the number of AIDS cases is assumed to be Poisson distributed. To describe how the number of cases (Yi) changes with respect to the time periods (ti), the following Poisson regression model is proposed:

    YiPois(μi,σ2)  logμi=ηi  ηi=β0+β1ti  fori=1,,14.

    We can fit this model using the glm command where we specify the distribution family as poisson:

    Model2 <- glm(number ~ time, data = aids, family=poisson)
    summary(Model2)
    

    From the summary of the parameter estimates we see that the intercept parameter is not significantly different from zero as its p-value is 0.176, which is greater than 5% significance level.

    Examine the following commands propose alternative linear predictors for the AIDS data.

    Model2a <- glm(number ~ 1, data = aids, family=poisson)
    Model2b <- glm(number ~ 1 + time, data = aids, family=poisson)
    Model2c <- glm(number ~ -1 + time, data = aids, family=poisson)
    

    The linear predictor for Model2a only contains the intercept parameter. This defines the null model with no covariates.

    For Model2b the term 1 explicitly specifies that an intercept term should be defined in the linear predictor. By default, R defines the intercept term whether the 1 term is specified or not, therefore the formula y ~ x is the same as y ~ 1 + x. To verify that this is the case, compare the estimates from Model2b with Model2.

    In the third example, the -1 definition overrides the default action defined above by specifying a linear predictor that has no intercept term. For this model, when all explanatory values are zero, the linear predictor is zero (ηi=0) and so the expected number of counts is one (μi=exp{ηi}=1). Since the intercept co-efficient in Model2 is not significant, then it can be argued that Model2c is a better description of the AIDS data due to parsimony reasons.

  3. 3.

    Clinical Trial. The data set clintrial contains the data from a clinical trial with 481 participants who were assigned to one of eight different dose level (xi) of a particular drug. The number of participants in each dose level who reacted positively to the drug (Yi) were recorded.

    clintrial <- read.table("clintrial.dat")
    

    In this case, we assume that the proportion of participants who reacted positively follows a binomial-proportion distribution and so propose the following logistic regression model:

    YiBinoprop(mi,μi)logit(μi)=ηiηi=β0+β1xifori=1,,8,

    where mi is the number of participants who received dose level xi.

    To fit a logistic regression model to this data in R, we once again use the glm command and specify the family as binomial. However, before the ~ symbol in the formula we must define a two column matrix where the first column defines the number of events that we are interested in and the second column specifies the number of non-events. This is executed as follows for the clinical trial data:

    Model3 <- glm(cbind(respond, number-respond) ~ dose, data = clintrial,
        family=binomial)
    summary(Model3)
    

    We may now ask the question: Is this a good model for the data. From the summary we see that the residual deviance is 13.633 on 6 degrees of freedom. The deviance of a GLM is asymptotically chi-squared distributed, so the critical value at the 5% significance level with 6 degrees of freedom is:

    qchisq(1 - 0.05, df = 6) ##12.59
    

    Since the residual deviance is greater than the critical value then there is sufficient evidence to reject the hypothesis that the proposed logistic regression model adequately describes the data. In other words, there exists more variability in the data than what the model is able to describe. It is possible to address this by adding or changing the explanatory variables to define a linear predictor that better describes the variability in the data.

  4. 4.

    IRWLS & Bacteria. The glm command calculates the parameter estimates by applying the iterative re-weighted least squares (IRWLS) algorithm. The example code in the lecture notes executes the IRWLS algorithm for a Poisson regression model and applies it to the AIDS data set.

    Copy the function below that provides the outline for the IRWLS algorithm. Replace the text ??? with the appropriate commands in order to evaluate the MLEs for a binary/logistic regression model defined in the previous part where mi=1.

    irwls_logistic <- function(param0, y, X, I){
      #param0 -- Initial parameter vector
      #y -- Binary Realisations
      #X -- Design Matrix
      #I -- Number of iterations to perform
      beta <- param0
      for(i in 1:I){
        eta <- as.vector(X %*% beta) #Linear Predictor
        mu <- ???            #Mean
        var <- ???           #Variance
        linkderiv <- ???     #Derivative of Link Function
        z <- as.matrix(eta + (y - mu) * linkderiv) #Adjusted Responses
        W <- solve(var * linkderiv^2) #Weights Matrix
        betanew <- solve( t(X) %*% W %*% X) %*% t(X) %*% W %*% z
        cat("Params(", i, "):", betanew, "\n")
        beta <- betanew
      }
    }
    

    The bacteria data set arises from an experiment involving 26 dishes of bacterial cultures. Each culture is given a particular drug at various different dosages (xi). After some time it was recorded if the bacterial culture reacted to the drug (Yi=1) or not (Yi=1).

    bacteria <- read.table("bacteria.dat")
    

    This can be described by the logistic regression defined in the previous part where the number of cases at each dosage is mi=1.

    Apply your IRWLS function to calculate the MLEs for the bacteria example:

    Responses <- bacteria$response
    DesignMatrix <- cbind(1, bacteria$dose)
    irwls_logistic(param0 = c(0,0), y = Responses, X = DesignMatrix, I = 10)
    

    Try various starting values to check that the algorithm converges to the same estimates.

    Compare the estimates you have obtained to the estimates derived from the glm command:

    Model4 <- glm(response ~ dose, data=bacteria, family=binomial)
    

    Note: When the response variable is binary, there is no need to specify the number of non-events as in the previous logistic regression case.