Carrying out multiple two sample -tests seems the obvious way to compare means across a number of groups. However there are two reasons why this may not be such a sensible idea:
There are a lot of tests;
Individual test errors multiply across tests, see discussion below.
Large number of tests. How many tests are required to carry out all pairwise mean comparisons for groups?
Each test involves two groups, so we require the number of ways of selecting two items out of . This is exactly the mathematical definition of a combination,
since .
Consider the case when you have just three groups. This would require three tests. If you had seven groups, this would become 21 tests. If you had ten groups then it would be 45 tests. However, it is not so difficult to write some code to automate these tests. The larger issue relates to the overall possibility of making an error in one, or more, of the tests.
How can we make an error when carrying out a hypothesis test? Suppose that we are testing
vs.
at the 5% level. We reject the null hypothesis if the absolute value of the test statistic
lies above the 97.5% quantile of the -distribution. Consequently there is a 5% probability of rejecting even when it is true.
A Type I error occurs when the null hypothesis is rejected when it is in fact true. The probability of a Type I error is equal to the significance level of the test.
There is a second type of error, which is less interesting for our purposes. This kind of error occurs if is accepted when is in fact true.
A Type II error occurs when the null hypothesis is accepted when it is in fact not true.
The probability of a Type II error depends on the true value of the population parameter and can therefore only be calculated for speculated values of this parameter. For example, we might say ‘If the difference between the true mean and was for some , what would be the probability of a Type II error when testing against ’. The probability of a Type II error is linked to the power of the test.
The power of the test is the probability of correctly accepting the alternative hypothesis , given that it is true. In other words, power is .
There will be more investigation of Type I and Type II errors in the coursework and homework sheets. However, for now, we focus on an extension to the Type I error in the context of multiple testing. In particular, we also define the family-wise error rate:
The family-wise error rate is the probability that is incorrectly rejected at least once across the whole series of tests.
When comparing the pairwise means of multiple groups, we would hope that the family-wise error rate would be equal to the probability of a Type I error for a single test. Unfortunately this is not the case. The probability that we incorrectly reject at least once across independent tests is the same as one minus the probability that we incorrectly reject in none of these tests. By definition, the probability that we incorrectly reject on any given test is , then
If two independent tests are preformed at the 5% significance level, what is the FWER? How does this change if 10 tests are performed?
For two tests, the FWER is 0.0975. For 10 tests it is 0.401. That is the probability of incorrectly rejected in at least one of the ten tests is 40%. This is considerably higher that the 5% probability for a single test.
Whilst much research has, and is being, carried out into the issues surrounding multiple testing, in particular in health research and genomics, we shall now concentrate on a different approach to the comparison of the means of three or more groups.