2.6 Confounding in epidemiological investigations 3.1 Estimating the survivor function: non-parametric estimation

Chapter 3 Introduction to Survival Analysis

•

survival analysis: “analysis of data in the form of times from some well-defined time origin to occurrence of some event or endpoint” (Collett, 2003)
•

medical studies: the time origin may be the time of entry in to a clinical trial; time of diagnosis of disease; time of commencement of treatment; time of surgery; date of birth etc.
•

if the endpoint is death then the times to event are literally survival times, for example, time to death following kidney transplantation
•

more generally, time-to-event data take various forms for example: time to onset of heart disease; time to failure of a prosthesis; time to treatment failure; time to relief of pain; time to recurrence of symptoms etc

Survival Analysis

•
the events may be:
- –
  
  positive, such as discharge from hospital or time to conception;
- –
  
  adverse, such as death or recurrence of disease
- –
  
  neutral, such as cessation of breast feeding
•

regardless of nature the convention is to refer to this type of data as survival data and the analysis as survival analysis
•

the time-to-event is a random variable, $T$ , often referred to as a lifetime random variable

Example, lung cancer

•

survival times of patients with advanced lung cancer
•

228 patients $\rightarrow$ 165 deaths observed

Unnumbered Figure: Link

Special features of survival data

•
time-to-event data are not amenable to standard methods of analysis since the event times are:
- –
  
  positive-continuous
- –
  
  typically skewed
- –
  
  subject to censoring
•
censoring occurs when the event of interest (end-point) is not observed:
- –
  
  right censoring: the event time exceeds the last follow-up time
- –
  
  left censoring: the event time precedes the last follow-up time but is unknown
- –
  
  interval censoring: the event time falls in some specified interval

Right censoring

•

left/interval censoring occurs less frequently than right censoring. In this module we will consider methods for right censored data
•

right censored observations: we do not know when, or if, the patient will experience the event, only that the event has not occurred at the end of the observation period (last follow-up)
•
right censoring can be due to:
- –
  
  the period of observation ending prior to the event occurring (e.g. five year study period)
- –
  
  loss to follow-up (e.g. moved away, did not return for scheduled follow-up)
- –
  
  a competing event which precludes further follow-up (e.g. a death occurs before a hip prosthesis fails)
- –
  
  note also the event may not be inevitable (e.g. time to pregnancy)
- –
  
  censoring cannot be ignored: observations carry important information about survival
- –
  
  consider comparing two treatments: a more effective treatment will result in increased survival and hence increased censoring at the end of follow-up

Patient time and study time

•

Patients are typically not recruited at the same time but are accrued sequentially over a period of time and then are followed-up to a fixed date $\rightarrow$ the period of observation thus varies between patients
•

assumptions: patients prognosis does not depend upon time of entry to the study (less of a problem in randomised trials)
•

patients lost to follow-up have the same prognosis as those remaining in the study (i.e. random censoring)

Lung cancer data example

id  inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1     3  306      2  74   1       1       90       100     1175      NA
2     3  455      2  68   1       0       90        90     1225      15
3     3 1010      1  56   1       0       90        90       NA      15
4     5  210      2  57   1       1       90        60     1150      11
5     1  883      2  60   1       0      100        90       NA       0
6    12 1022      1  74   1       1       50        80      513       0
7     7  310      2  68   2       2       70        60      384      10
8    11  361      2  71   2       2       60        80      538       1
9     1  218      2  53   1       1       70        80      825      16
10    7  166      2  61   1       2       70        70      271      34
.     .  .        .  .    .       .       .         .        .        .
.     .  .        .  .    .       .       .         .        .        .
.     .  .        .  .    .       .       .         .        .        .
.     .  .        .  .    .       .       .         .        .        .
.     .  .        .  .    .       .       .         .        .        .
228   22  177     1  58   2       1       80        90     1060       0

Lung cancer data example description


Format:

    inst:       Institution code

    time:       Survival time in days

    status:     censoring status 1=censored, 2=dead

    age:        Age in years

    sex:        Male=1 Female=2

    ph.ecog:    ECOG performance score (0=good 5=dead)

    ph.karno:   Karnofsky performance score (bad=0-good=100) rated by physician

    pat.karno:  Karnofsky performance score as rated by patient

    meal.cal:   Calories consumed at meals

    wt.loss:    Weight loss in last six months

Aims of survival Analysis

•

model the survival times for a single group
•

compare survival distributions for two or more groups
•

assess the effects of covariates on survival
•

make predictions
•

usually the event times will be continuous measurements, but they are typically recorded in rounded form
•

thus, although the data are strictly continuous, our methods must allow for potential ties in the data caused by rounding

Notation

•

let $T$ denote the life time random variable
•

$t_{i}\;\;(i=1,2,\ldots,n)$ observed event times
•

$c_{i}\;\;(i=1,2,\ldots,n)$ : censoring times,
•

$\delta_{i}$ : censoring/failure indicator - 0 if censored, 1 if failure
•

$x_{i}$ : $p$ -vector of covariates for individual $i$

•

$n_{i}\;(i=1,2,\ldots,n)$ : risk set - number at risk just before $t_{i}$
•

$d_{i}$ : the number of events (e.g. deaths) at time $t_{i}$

Data frame for survival data

unit	time	cens	$X_{1}$	$X_{2}$	$\cdots$
1	$y_{1}$	$\delta_{1}$	$x_{11}$	$x_{12}$	$\cdots$
2	$y_{2}$	$\delta_{2}$	$x_{21}$	$x_{22}$
3	$y_{3}$	$\delta_{3}$	$x_{31}$	$x_{32}$
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\ddots$

where

	$\displaystyle Y_{i}$	$\displaystyle=$	$\displaystyle\min(T_{i},C_{i})$
	$\displaystyle\delta_{i}$	$\displaystyle=$	$\displaystyle I(T_{i}\leq C_{i})$

•

the censoring time $C_{i}$ is the time at which unit $i$ leaves the study, with realised valued $c_{i}$
•

censoring times may be fixed or random, e.g. $5$ -year study period
•

we do not observe both $T_{i}$ and $C_{i}$
•

we record $t_{i}$ if $T_{i}\leq C_{i}$ or else we record $c_{i}$ if $T_{i}>C_{i}$
•

hence we record $Y_{i}=\min(T_{i},C_{i})$ and a censoring indicator $\delta_{i}=I(T_{i}\leq C_{i})$

Basic Functions I

•

in summarising survival data there are two functions of central interest: the survivor function and the hazard function
•

the survival time of individual $i$ : a realisation of a non-negative random variable $T$
•

let $F(t)$ denote the distribution function of $T$ with corresponding probability density function $f(t)$ then

$F(t)=P(T\leq{}t)=\int_{0}^{t}f(s)ds$
•

by definition the pdf is $f(t)=\frac{dF(t)}{dt}$
•

the probability that an individual survives to time $t$ is given by the survivor function

$S(t)=P(T>t)=1-F(t)=\int_{t}^{\infty}f(s)ds$
•

note that $S(t)$ is a monotone decreasing function with $S(0)=1$ and tends to zero as $t$ approaches infinity

Basic Functions II

•

conversely we can express the pdf as:

$f(t)=\displaystyle\lim_{\Delta{}t\rightarrow 0}\frac{P(t\leq T<t+\Delta t)}{% \Delta t}=\frac{dF(t)}{dt}=-\frac{dS(t)}{dt}$
•

the hazard function specifies the instantaneous rate of failure at $T=t$ given survival to time $t$ and is defined:

$h(t)=\displaystyle\lim_{\Delta{}t\rightarrow 0}\frac{P(t\leq T<t+\Delta t|T>t)% }{\Delta t}=\frac{f(t)}{S(t)}$
•

the hazard is a rate not a probability. It can assume values in $[0,\infty)$
•

the quantity $h(t)\Delta{}t$ approximates the probability that an individual who has survived to time $t$ will experience the event in the interval $(t,t+\Delta{}t)$

Basic Functions III

•

the cumulative or integrated hazard function is by definition

$H(t)=\int_{0}^{t}h(u)du$
•
other relationships follow:
- –
  
  $S(t)=1-\int_{0}^{t}f(u)du$
- –
  
  $f(t)=h(t)\times S(t)$
- –
  
  $h(t)=-\frac{d(\log(S(t)))}{dt}$
- –
  
  $S(t)=\exp\{-H(t)\}$

Relationships between basic functions

•

the density, distribution, survivor and hazard functions are different mechanisms for describing the distribution of survival times
•

these functions capture the essential features of lifetime variables
•

specifying one function completely determines the others
•

consequently one may interchange between them but a model may be better specified in terms of one rather than another

Examples of survivor functions

Unnumbered Figure: Link

Examples of hazard functions

Unnumbered Figure: Link

Hazard functions

•

the hazard function tells us about the effect of time on the probability of failure
•

the hazard informs us of failure rates, for example, of patients of a certain age
•
there are many general shapes for the hazard function. Generic types are: increasing, decreasing, constant and bathtub:
- –
  
  an increasing hazard function is indicative of natural ageing (or wearing out)
- –
  
  a decreasing hazard functions is less likely clinically but may fit, for example, risk following organ transplantation
- –
  
  a bath-tub shaped hazard fits, for example, population risk of death from birth: infant deaths give rise to increased events early on, the process then stabilises prior to increasing with age

3.1 Estimating the survivor function: non-parametric estimation

3.2 Comparing survival distributions between subgroups