Home page for accesible maths 5 Analysis of Variance

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

5.1 Multiple t-tests

Carrying out multiple two sample t-tests seems the obvious way to compare means across a number of groups. However there are two reasons why this may not be such a sensible idea:

  1. 1.

    There are a lot of tests;

  2. 2.

    Individual test errors multiply across tests, see discussion below.

Large number of tests. How many tests are required to carry out all pairwise mean comparisons for m groups?

Each test involves two groups, so we require the number of ways of selecting two items out of m. This is exactly the mathematical definition of a combination,

(m2)=m!2!(m-2)!=m!2(m-2)!,

since 2!=2.

Consider the case when you have just three groups. This would require three tests. If you had seven groups, this would become 21 tests. If you had ten groups then it would be 45 tests. However, it is not so difficult to write some code to automate these tests. The larger issue relates to the overall possibility of making an error in one, or more, of the tests.

How can we make an error when carrying out a hypothesis test? Suppose that we are testing

H0:μ=μ0

vs.

H1:μμ0

at the 5% level. We reject the null hypothesis if the absolute value of the test statistic

t=X¯-μ0S/n

lies above the 97.5% quantile of the tn-1-distribution. Consequently there is a 5% probability of rejecting H0 even when it is true.

Definition.

A Type I error occurs when the null hypothesis H0 is rejected when it is in fact true. The probability of a Type I error is equal to the significance level α of the test.

There is a second type of error, which is less interesting for our purposes. This kind of error occurs if H0 is accepted when H1 is in fact true.

Definition.

A Type II error occurs when the null hypothesis H0 is accepted when it is in fact not true.

The probability of a Type II error depends on the true value of the population parameter and can therefore only be calculated for speculated values of this parameter. For example, we might say ‘If the difference between the true mean and μ0 was d for some d>0, what would be the probability of a Type II error when testing H0:μ=μ0 against H1:μ>μ0’. The probability of a Type II error is linked to the power of the test.

Definition.

The power of the test is the probability of correctly accepting the alternative hypothesis H1, given that it is true. In other words, power is 1-Pr[Type II error].

There will be more investigation of Type I and Type II errors in the coursework and homework sheets. However, for now, we focus on an extension to the Type I error in the context of multiple testing. In particular, we also define the family-wise error rate:

Definition.

The family-wise error rate is the probability that H0 is incorrectly rejected at least once across the whole series of tests.

When comparing the pairwise means of multiple groups, we would hope that the family-wise error rate would be equal to the probability of a Type I error for a single test. Unfortunately this is not the case. The probability that we incorrectly reject H0 at least once across k independent tests is the same as one minus the probability that we incorrectly reject H0 in none of these tests. By definition, the probability that we incorrectly reject H0 on any given test is α, then

FWER =1-Pr[none of the k tests are incorrectly rejected]
=1-Pr[individual test is not incorrectly rejected]k
=1-(1-α)k.

If two independent tests are preformed at the 5% significance level, what is the FWER? How does this change if 10 tests are performed?

For two tests, the FWER is 0.0975. For 10 tests it is 0.401. That is the probability of incorrectly rejected H0 in at least one of the ten tests is 40%. This is considerably higher that the 5% probability for a single test.

Whilst much research has, and is being, carried out into the issues surrounding multiple testing, in particular in health research and genomics, we shall now concentrate on a different approach to the comparison of the means of three or more groups.