Continuous endpoints (e.g diastolic blood pressure, height, weight, serum cholesterol level) are typically summarised in terms of group means or medians
Clinically the outcome of interest is often the difference, , between the groups as opposed to the actual values
We are interested in the estimated size of the difference, , but also in the degree of precision of the estimate as measured by its standard error, , or constructing a confidence interval
A p-value quantifying the play of chance under the null may be interesting but it is the size of the difference which is of interest in terms of clinical relevance
A study may continue if is sufficiently large even if it is not significantly different from zero
Statistical significance does not imply a clinically relevant difference!
Larger samples increased precision
Students’ t-test: direct testing procedure
Let denote the treatment group mean and the control group mean. We have:
Research Hypothesis:
versus
Estimate:
Test Statistic computed under :
Assuming a common underlying variance, , the standard error is estimated by pooling the data :
with
where is the total number of patients recruited
Inference: The random variable is compared to the -distribution with degrees of freedom and inference is based upon whether
The, perhaps, more familiar form of the pooled variance estimate is given below.
Let and represent the sample standard deviations for the and and represent the group numbers compute the pooled variance estimate:
and as previously the standard error is then given by:
Assumptions: underpinning this t-test?
An approximate test assuming non-constant variances can be performed (Welch’s) similarly with
and adjustment to the degrees of freedom.
Non-parametric alternative? Mann-Whitney test
Hypothesis tests yield p-values but do not allow for direct assessment/quantification of effect sizes an estimate of the difference and a corresponding confidence interval is preferable
Let denote the true difference in treatment group means:
estimate: observed difference in group means
corresponding 95% confidence interval:
with
as previously
inference: Does the confidence interval for the true difference, , span zero?
often is specified
interval interpretation: Under repeated sampling we would expect of such constructed intervals to contain the true parameter value
intervals are more informative: one can assess the plausible range for the effect size
statistical significance versus clinical relevance: a significant p-value does not imply clinical relevance
good reporting practice: CONSORT statement recommends reporting of both estimates and confidence intervals
For paired designs (for example, a patients left eye and right eye, twins etc) a one sample t-test is performed based upon the observed within-pair differences, . The procedure:
1. Compute the mean of the within-pair differences:
2. Compute the standard deviation of the differences .
3. Calculate the standard error of the mean of the differences .
4. Compute the test statistic under :
versus
5. Compare to a -distribution on degrees of freedom.
6. Preferable approach: compute a confidence interval for the true difference:
Responses may be dichotomous, binary, for example, cancer free at five years/cancer recurred withing five years, died/survived, diseased/not diseased.
Clinically interest then lies in comparing the treatment group proportions, and , say.
Consider the following general two-by-two tabular representation of results Treatment Group 1 2 Total Outcome Yes a b a+b observed No c d c+d Total a+c b+d n
:
Let and denote the respective risks (success probabilities) for the treatment and control groups
the risk difference, , is estimated:
the relative risk, , is estimated:
The disease odds, , provides the ratio of success to failure. The odds ratio comparing groups is estimated:
Study conducted to investigate association between hay fever and eczema in 11 year old children. Findings presented in tabular form:
Hay fever | ||||
---|---|---|---|---|
Yes | No | Total | ||
Eczema | Yes | 141 | 420 | 561 |
No | 928 | 13 525 | 14 453 | |
Total | 1069 | 13 945 | 15 522 |
Event: Eczema=“Yes”
What happens if we consider the event to be ’No Eczema?
What happens if we consider the table the other way round and think of ’Hayfever’ as the event for the two eczema groups?
Intervals are constructed based upon the Normal approximation CI:
risk difference:
with , , and
relative risk:
odds ratio:
Aim: to compare the odds of eczema amongst hayfever sufferers to that of non-sufferers
Compute the odds ratio:
Compute the standard error of the natural logarithm of the odds ratio, then compute a confidence interval for the logarithm of the odds ratio and then back-transform:
95% CI: antilog: (asymmetric on the odds ratio scale)
Further examples on exercise sheet
More details on the utility of the ratio of odds will feature later in the course when we consider observational studies
further reading: Bland JM, Altman DG (2000) BMJ 320: 1468.
So far in the module we have considered two study designs:
a parallel group design: different groups of patients are studied concurrently (in parallel). Patients receive a single therapy (or combination of therapies) estimate of treatment effect is based upon so-called ’between-subject’ comparisons. We used the two independent samples -test for inference.
a paired design: patients receive both treatment for example, matching parts of anatomy (e.g. limbs, eyes, kin etc) estimate of treatment effect is based upon a ’within-subject’ comparison. We used a paired t-test (one sample t-test on the within subject differences). We noted asymmetry can be problematic!
An alternative design building upon the idea that a participant acts as their own control is the:
crossover design: patients receive a sequence of treatments; the order determined by randomisation estimate of treatment effect based upon ’within-subject’ comparisons.
Definition (Senn, 1993)
“A cross-over trial is one in which subjects are given sequences of treatments with the object of studying differences between individual treatments (or sub-sequences of treatments).”
Randomisation: the order of the treatments is assigned at random
The times when treatments are administered are called treatment periods, or simply periods
Simple, example (2 period, 2 treatments)
Sequence | Period 1 | Period 2 |
---|---|---|
Group 1 | A | B |
Group 2 | B | A |
Advantages of the cross-over design are:
within-subject comparisons: patients act as their own control elimination of between-patient variation
sample size is smaller: same number of observations with fewer patients
precision increased: can achieve the same degree precision in estimation with fewer observations
Further reading (Senn 1993, Sec. 1.3)
Disadvantages / issues relating to the use of a cross-over design are:
inconvenience to patients: several treatments, longer total time under observation (sometimes advantage!)
drop outs: patients may withdraw
they only suitable for certain indications
period by treatment interaction: the treatment effect is not constant over time
carry-over effect: “Carry-over is the persistence […] of a treatment applied in one period in a subsequent period of treatment.”
analysis is more complex: pairs of measurements may be systematic differences between periods
Further reading Senn 1993, Sec. 1.4
Wash-out period:
“A wash-out period is a period in a trial during which the effect of a treatment given previously is believed to disappear. If no treatment is given during the wash-out period then the wash-out is passive. If a treatment is given during the wash-out period than the wash-out is active.”
(Senn, 1993)
When are cross-over trials useful?
chronic diseases which are relatively stable (e.g. asthma,
rheumatism, migraine, moderate hypertension, epilepsy)
single-dose trials of bio-equivalence (PK/PD) rather than long-term trials
drugs with rapid, reversible effects rather than ones with persistent effects
Various types of cross-over designs exist but we shall focus upon the so-called design.
two treatment, two period cross-over
two sequences: 1) AB and 2) BA
also called AB/BA design (more specific)
in the following normally distributed endpoint considered
Motivating example, Asthma trial
objective: comparing the effects of formoterol (experimental) and salbutamol (standard)
patients: 13 children (aged 7 to 14 years) with moderate to severe asthma
single-dose trial: 200 g subatomic, 12 g formoterol: bronchodilators
primary endpoint
peak expiratory flow (PEF, [l/min]): a measure lung function
several measurements during the first 12 hours after drug intake
measurements after 8 hours considered here
drop-outs:
NOTE patient 8 dropped out after first period
not mentioned by Graff-Lonnevig V, Browaldh L (1990)!
See also Senn 1993, Sec. 3.1
design
randomised (randomisation procedure?): order of treatments assigned at random to form sequence groups
double-blind: double-dummy technique
two treatment, two period cross-over (AB/BA design)
wash-out period of at least one day
Seq. | Period 1 | Wash-Out | Period 2 |
---|---|---|---|
F/S | formoterol | no treatment | salbutamol |
S/F | salbutamol | no treatment | formoterol |
Unnumbered Figure: Link
Unnumbered Figure: Link
If no period effect then one can proceed as per the paired design considered previously using a paired t-test
method
calculate the treatment differences (response on formoterol minus response on salbutamol) for each subject,
calculate the mean of the differences and SE()
perform a one-sample t-test for the differences (i.e. a paired t-test)
construct a confidence interval for the true difference
assumptions underlying the use of the paired test
normally distributed differences
unbiased: = = true treatment effect, , say
mean of the differences: ,
standard deviation of the differences: ,
degrees of freedom (df):
test statistic
confidence interval for the true difference
p-value:
Conclusion/comments?
‘factors that might cause the differences not to be distributed at random about the true treatment effect”
period effect (e.g. hay fever: pollen count differs; learning effects etc.)
period by treatment interaction
carry-over
patient by treatment interaction: cannot be investigated in AB/BA design
patient by period interaction
(Senn, 1993)
Let denote the expectation for treatment B
denote the treatment effect (treatment A - treatment B)
denote period effect (period 2 - period 1)
We can express the expected values for the AB/BA design in the cells of a table: Sequence Period 1 Period 2 AB BA
How can we yield an unbiased estimate of the treatment effect in the presence of the period effect?
Recall: is unbiased estimator of if:
1. Consider the expectation of the mean of the period differences (period 1 - period 2):
for each sequence group (1:(AB), 2:(BA)).
2. subtracting the expected period differences:
3. and dividing by 2 to yield
Hence the estimator
is unbiased for
How can we yield an unbiased estimate of the period effect?
1. Consider again the expectation of the mean of the period differences for each sequence group:
2. summing the expectations for the two groups gives
3. and dividing by -2 yields
Hence the estimator
is unbiased for
Method:
calculate the period differences, ,
(period 1 - period 2) for each individual
calculate the means and standard deviations, for the two
sequence groups
estimate the treatment effect:
compute the test statistic under
with
and
construct a confidence interval for
The table below gives the means and standard deviations of the period differences for the two sequence groups sequence n for/sal 7 30.7 33.0 sal/for 6 -62.5 44.7
test statistic
95% confidence interval for the treatment effect:
p-value p=0.001
Comments/conclusion?
How do the period-adjusted results compare with the simple analysis results?
Add the sequence group means (as opposed to subtracting them) and then divide by -2
Note the form of the standard error is the same for the treatment and period effect: why?
Exercise: estimate the period effect and construct a 95% confidence interval for the period effect.
Sequence | Period 1 | Period 2 |
---|---|---|
AB | ||
BA |
Let and denote the expected carry-over effects (with , and defined as previous)
How can you use the cell means model to yield an estimate carry over effects?
only the difference between and identifiable
The estimate is based upon differences between expected sequence groups totals
testing for carry-over?
estimate is based upon ’between-patient’ variation low power of test
the carry over effect is confounded with period-treatment interaction in design
two-stage procedure biased estimator of treatment effect
do not test for carry-over !!!
conclusion (Senn 1993, p 69)
“No help regarding this problem is to be expected from the data. The solution lies entirely in design.”
further reading: Senn (1993), Senn (1997)
:
Senn S (1993) Cross-over trials in clinical research. Wiley, Chichester.
Senn S (1997) Statistical issues in drug development. Wiley, Chichester.
Jones B, Kenward MG (1990) Design and analysis of cross-over trials. Chapman & Hall, London.
Senn S et al. An incomplete blocks cross-over in asthma. In: Vollmar J, Hothorn LA (eds). Cross-over clinical trials. Gustav Fischer Verlag, Stuttgart.
Sample size by definition: “number of subjects in a clinical trial”
Why adequate sample sizes?
ethics
budget constraints
time constraints
The trial should be sufficiently large to provide a reliable answer to the research question
Usually based upon the primary endpoint. Usually an efficacy measure as opposed to safety / tolerability endpoint
Guidelines:
ICH E9 - Statistical Principles for Clinical Trials (Section 3.5: Sample Size)
Classical hypothesis testing use p-values to determine which of two competing hypotheses to draw from available data: versus , say.
p-value: the probability, , of obtaining a test result at least as extreme as that observed, assuming that the null hypothesis is true
The so-called size of test is given by the value which is typically chosen: .
If we reject and conclude data inconsistent with null
Errors in testing: methods are based upon experimental data and hence carry some risk of drawing a false conclusion
Truth | |||
---|---|---|---|
true | true | ||
Decision | Fail to reject | No Error | Error II |
made | Reject | Error I | No Error |
Unnumbered Figure: Link
Type I error rate:
Type II error rate:
Critical value: tkrit
Power
Definition of power: “’probability of concluding that the alternative hypothesis is true given that it is in fact, true….” (Senn, 1997)
Power depends upon:
statistical test being used
the size of that test
the nature and variability of the observations made
the alternative hypothesis (e.g the size of difference )
Note that a priori we do not know the size of the difference between treatments usually the alternate hypothesis is based upon a clinically relevant difference, , say
is a difference we would like to detect with reasonable power
consider (approx) normally distributed test statistic
under and under
set power equal to target value:
note , solve equation for
Let’s now consider a specific test
Assume data: , iid
with known variance
hypotheses: vs.
test statistic: with
critical value:
desired power: if (smallest
clinically relevant difference) is
set the probability for rejection of the null hypothesis to :
Under :
non-centrality parameter:
sample size:
For a two sided test we substitute in above (more on this later)
Assume data: , , with and iid.
Denote the true difference in treatment effects .
hypotheses: vs.
test statistic:
variance known
Under
Non-centrality parameter?
Exercises this week: consider the form of the non-centrality parameter and derive the sample size formula for the two group Gauss test.
Note the required sample size formula is:
For the two-sided test we substitute
Assume data and hypotheses as for 2-sample Gauss test, but unknown variance
test statistic:
with
non-centrality parameter:
approximate sample size is:
exact sample size is based upon non-central t-distribution: with
power equation cannot be solved for explicitly
we will use RStudio to compute exact sample size
“Epidemiology is the study of the distribution and determinants of disease in human populations.”
aim: “to inform health professionals and the public at large in order for improvements in general health status to be made” ( Woodward, 1999)
Studies of distribution are largely descriptive:
examples include distributions by: geography, time, age, gender, social class, ethnicity and occupation
information is obtained regarding disease frequency in populations/sub-populations
descriptive studies an be used to generate research hypothesis and resource allocation etc.
Disease determinants are the factors that precipitate disease (aetiological/ causal agents)
examples: biological (cholesterol/blood pressure), environmental (atmospheric pollutants), social/behavioural (smoking and diet)
(potential) aetiological agents are referred to as risk factors
studies of determinants of disease: analytic epidemiology using individual level disease and exposure data
The epidemiological domain includes both observation and experiment, however, experimentation is usually limited for ethical reasons
Following huge increases in the number of lung cancer deaths, research on smoking and lung cancer was conducted in various epidemiological studies and caused huge debate regarding the interpretation of the study results studies found exposure-disease associations but: does smoking cause lung cancer?
Sir R. A. Fisher raised the issue of association versus causation that clouds interpretation of observational studies
Fisher proposed that the association could be explained by a confounder: a genotype predisposed to both smoking and lung cancer
In response Cornfield argued that the existence of such a confounding factor seemed implausible because the magnitude of the measure of association (relative risk between 10-20 for smokers versus non-smokers)
This led to pioneering work by Sir Richard Doll and Sir Austin Bradford Hill shortly after the second world war
when: initiated in October 1951 by Doll & Hill
who: wrote to members of the medical profession in the UK
study group: more than 40,000 doctors replied (out of almost 60,000)
exposure evaluation: questionnaire about smoking habits (including current smoker, ex-smoker, never smoker)
outcome measure: number of subsequent deaths, cause of death (Registrars-General UK)
reprinted: BMJ 328:1529-1533 (26 Jun 2004), Doll et al. 328 (7455): 1519
Unnumbered Figure: Link
Epidemiology involves the collection, analysis and interpretation of data from human populations.
Population
target population: population we wish to draw inferences for (e.g. all males in Britain)
study population: population from which data are collected (e.g. British Doctors Study)
generalisability: can we use the study population results to draw accurate conclusions about the target?
Choice of study sample
generalisability of results (trade of with availability, cost etc)
optimal scheme: random sample of target population
doctors study: opportunistic sample readily identifiable and likely to be cooperative
Epidemiological investigations use data from a variety sources
Sources: routinely collected data (e.g. vital registrations: birth/death/cancer/infectious disease registers census data, hospital data bases etc.) or based upon data purposely collected by the investigators (retrospectively or prospectively) by surveys, recruitment and follow-up
Routinely collected data:
vital for monitoring public health (e.g. cancer incidence), health planning (e.g. how to accommodate increasing life-expectancy)
may be of limited quality: subject to regional variation (e.g variations in classification, coding etc) and often do not contain the required individual level information
vital statistics are gathered by the government: information is available on births. still-births, abortions, deaths, area populations, mortality, migration etc.
can be used to draw investigate high-level inferences regarding possible associations between routinely available attributes (area, gender, age and social class) and the rate of incidence or death from a particular disease.
Diseases are presently classified by ICD-10: the international classification of disease.
Routinely collected data are useful but have inherent limitations that we should be mindful of:
Coverage: morbidity is inherently difficult to define and hence coverage cannot be complete
only hospital patients are covered for many illnesses
practitioners vary in their reporting of notifiable infectious diseases
sickness certification relates mainly to patients who need a certificate for their employers
cancer registers may miss cases who never present to hospital
diagnostic/operative data are more difficult to capture than administrative data and are hence often omitted
Accuracy: diagnosis of cause of death and illness can be incorrect
Availability: confidentiality safe-guards may limit data availability. However, providing research can be justified Research Ethics Committee approval can be obtained