often we are interested in comparing the survival of two (or more) groups. For example, two treatment groups, males versus females, smokers versus non-smokers etc.
the usual two group methods (e.g t-tests to compare group means) are not valid due to censoring
separate Kaplan-Meier plots with confidence intervals can used to investigate groups informally
example: lung cancer data examine survival outcome by gender
males: red (solid) curve; females: blue (dashed) curve
Unnumbered Figure: Link
comments: survival appears, on average, to be extended in females but some overlap in the upper limit of the confidence interval
question: potential confounders?
a formal comparison can be made using the log-rank test
the null hypothesis is that the survival distributions are equal for the sub-groups (i.e no difference in survival)
let denote the observed event times and the number of events at time
further, let denote the number at risk at time (e.g. alive at time ) of which are in group 1 and are in group 2
if no difference: the expected number of events in each group is:
we actually observe: ,
summing over the failure times for the two groups gives and
the log-rank test statistic:
the function: survdiff() conducts the log rank test in R
the command and output is below
> survdiff(Surv(time,status)~sex, data=lung)
Call:
survdiff(formula = Surv(time, status) ~ sex,data = lung)
N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 138 112 91.6 4.55 10.3
sex=2 90 53 73.4 5.68 10.3
Chisq= 10.3 on 1 degrees of freedom, p= 0.00131
the log-rank test result indicates significant difference in the survival outcomes for male and female lung cancer patients
comments?
Kaplan-Meier curves can be obtained for more than two sub-groups and survival compared informally
it may be preferable not to add the confidence intervals since the plots can become confusing
the log rank test can be used to compare more than two groups
more generally, a model can be fit to the data and potential for confounding accommodated
the Cox proportional hazards model is commonly used to flexibly model covariate effects on the hazard function