Hypothesis tests are not flawless. Just think of the court system: innocent people are sometimes wrongly convicted and the guilty sometimes walk free. Similarly, we can make a wrong decision in statistical hypothesis tests. However, the difference is that we have the tools necessary to quantify how often we make such errors.
There are two competing hypotheses: the null and the alternative. In a hypothesis test, we make a statement about which one might be true, but we might choose incorrectly. There are four possible scenarios in a hypothesis test, which are summarized in Table 2.7.
Test conclusion | |||
---|---|---|---|
do not reject | reject in favour of | ||
true | okay | Type 1 Error | |
Truth | true | Type 2 Error | okay |
A Type 1 Error is rejecting the null hypothesis when is actually true. A Type 2 Error is failing to reject the null hypothesis when the alternative is actually true.
In a court, the defendant is either innocent () or guilty (). What does a Type 1 Error represent in this context? What does a Type 2 Error represent? Table 2.7 may be useful.
Answer. If the court makes a Type 1 Error, this means the defendant is innocent ( true) but wrongly convicted. A Type 2 Error means the court failed to reject (i.e. failed to convict the person) when she was in fact guilty ( true).
How could we reduce the Type 1 Error rate in courts? What influence would this have on the Type 2 Error rate?
Answer. To lower the Type 1 Error rate, we might raise our standard for conviction from ‘‘beyond a reasonable doubt’’ to ‘‘beyond a conceivable doubt’’ so fewer people would be wrongly convicted. However, this would also make it more difficult to convict the people who are actually guilty, so we would make more Type 2 Errors.
How could we reduce the Type 2 Error rate in courts? What influence would this have on the Type 1 Error rate?
Answer. To lower the Type 2 Error rate, we want to convict more guilty people. We could lower the standards for conviction from ‘‘beyond a reasonable doubt’’ to ‘‘beyond a little doubt’’. Lowering the bar for guilt will also result in more wrongful convictions, raising the Type 1 Error rate.
Exercises 2.9.5-2.9.7 provide an important lesson: if we reduce how often we make one type of error, we generally make more of the other type.
Hypothesis testing is built around rejecting or failing to reject the null hypothesis. That is, we do not reject unless we have strong evidence. But what precisely does strong evidence mean? As a general rule of thumb, for those cases where the null hypothesis is actually true, we do not want to incorrectly reject more than 5% of the time. This corresponds to a significance level of 0.05. We often write the significance level using (the Greek letter alpha): . We discuss the appropriateness of different significance levels in Section 2.9.6.
Significance Level
The significance level is defined as:
If we use a 95% confidence interval to test a hypothesis where the null hypothesis is true, we will make an error whenever the point estimate is at least 1.96 standard errors away from the population parameter. This happens about 5% of the time (2.5% in each tail). Similarly, using a 99% confidence interval to evaluate a hypothesis is equivalent to a significance level of .
A confidence interval is, in one sense, simplistic in the world of hypothesis tests. Consider the following two scenarios:
The null value (the parameter value under the null hypothesis) is in the 95% confidence interval but just barely, so we would not reject . However, we might like to somehow say, quantitatively, that it was a close decision.
The null value is very far outside of the interval, so we reject . However, we want to communicate that, not only did we reject the null hypothesis, but it wasn’t even close. Such a case is depicted in Figure LABEL:whyWeWantPValue.
In Section 2.9.4, we introduce a tool called the p-value that will be helpful in these cases. The p-value method also extends to hypothesis tests where confidence intervals cannot be easily constructed or applied.