Consider the prisoner height data from the lecture.
prisoners <- read.table("prisoners.dat")
The data set contains:
height Recorded height of prisoner in inches.
gender Gender of prisoner.
censor Categorical variable stating whether the prisoner’s height was ‘O’bserved, ‘L’eft censored (shorter than 60 inches) or ‘R’ight censored (taller than 72 inches).
First, assume that no censoring has occurred and fit a simple linear regression model with the gender explanatory variable.
LinReg1 <- glm(height ~ 1 + gender, family = gaussian, data = prisoners) summary(LinReg1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 64.1298 0.1050 610.91 <2e-16 *** genderM 5.7552 0.1263 45.57 <2e-16 ***
From the estimates, we see that the average height of female prisoners is 64.1 inches and the difference between male and female prisoners is estimated to be 5.8 inches. However, is this a good model? How are tall and short prisoners being treated in this model? No this is not a good model as the heights for tall/short prisoners are taken at the censor point instead of at their true heights. The model is not accounting for the uncertainty arising due to censoring.
An alternative approach is to only examine the data that was observed. Create a second data set that contains prisoners whose heights were not censored.
prisoners2 <- prisoners[prisoners$censor == "O", ]
Fit a simple linear regression model to this data set.
LinReg2 <- glm(height ~ 1 + gender, family = gaussian, data = prisoners2) summary(LinReg2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 64.19773 0.09968 644.07 <2e-16 *** genderM 5.21103 0.12365 42.14 <2e-16 ***
In this case, we obtain a substantial different estimate for the average height difference between male and female prisoners. As all measurements are observed in this example, does this make the above model better? No, here we are being selective as to what data because it is easy to analyse! We are assuming that if a height measurement is censored then it contains no information and so does not contribute to the likelihood. This is not true as we know that a tall prisoner is at least 72 inches in height and a short prisoner is at most 60 inches in height.
The correct approach in this case is to apply a censored linear regression model. Load the package censReg, you may first need to install the package using install.packages("censReg").
library(censReg)
Fit the following censored regression model with left and right censoring point of 60 and 72 inches respectively.
LinReg_cens <- censReg(height ~ gender, left=60, right=72, data=prisoners) summary(LinReg_cens) Observations: Total Left-censored Uncensored Right-censored 1000 5 868 127 Coefficients: Estimate Std. error t value Pr(> t) (Intercept) 64.11728 0.11780 544.30 <2e-16 *** genderM 5.97337 0.14274 41.85 <2e-16 *** logSigma 0.72694 0.02476 29.36 <2e-16 ***
Here, we again obtain a different MLE for the difference in average heights between male and female prisoners. This time the censored data is being correctly accounted for.