Home page for accesible maths 2.9 Hypothesis testing 2.9.3 Decision errors 2.9.5 Two-sided hypothesis testing with p-values

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

2.9.4 Formal testing using p-values

The p-value is a way of quantifying the strength of the evidence against the null hypothesis and in favour of the alternative. Formally the p-value is a conditional probability.

p-value The p-value is the probability of observing data at least as favourable to the alternative hypothesis as our current data set, if the null hypothesis is true. We typically use a summary statistic of the data, in this chapter the sample mean, to help compute the p-value and evaluate the hypotheses.

Example 2.9.8

A poll by researchers at University College London found that university students average about 7.45 hours of sleep per night. Researchers at Lancaster University are interested in showing that students at their school sleep longer than seven hours on average, and they would like to demonstrate this using a sample of students. What would be an appropriate sceptical position for this research?

Answer. A sceptic would have no reason to believe that sleep patterns at this University are different than the sleep patterns at another University. We can set up the null hypothesis for this test as a sceptical perspective: the students at this University average 7.45 hours of sleep per night. The alternative hypothesis takes a new form reflecting the interests of the research: the students average more than 7.45 hours of sleep. We can write these hypotheses as

$H_{0}$ :

$\mu=7.45$ .
$H_{A}$ :

$\mu>7.45$ .

Using $\mu>7.45$ as the alternative is an example of a one-sided hypothesis test. In this investigation, there is no apparent interest in learning whether the mean is less than 7.45 hours.²⁷²⁷This is entirely based on the interests of the researchers. Had they been only interested in the opposite case – showing that their students were actually averaging fewer than 7.45 hours of sleep but not interested in showing more than 7.45 hours – then our setup would have set the alternative as $\mu<7.45$ . Earlier we encountered a two-sided hypothesis where we looked for any clear difference, greater than or less than the null value.

Always use a two-sided test unless it was made clear prior to data collection that the test should be one-sided. Switching a two-sided test to a one-sided test after observing the data is dangerous because it can inflate the Type 1 Error rate.

TIP: One-sided and two-sided tests If the researchers are only interested in showing an increase or a decrease, but not both, use a one-sided test. If the researchers would be interested in any difference from the null value – an increase or decrease – then the test should be two-sided.

TIP: Always write the null hypothesis as an equality We will find it most useful if we always list the null hypothesis as an equality (e.g. $\mu=7.45$ ) while the alternative always uses an inequality (e.g. $\mu\neq 7.45$ , $\mu>7.45$ , or $\mu<7.45$ ).

The researchers at Lancaster University conducted a simple random sample of $n=110$ students on campus. They found that these students averaged 7.87 hours of sleep and the standard deviation of the amount of sleep for the students was 1.75 hours. A histogram of the sample is shown in Figure LABEL:histOfSleepForCollegeThatWasCheckingForMoreThan7Hours.

R> data(sleep)
R> hist(sleep)

Before we can use a normal model for the sample mean or compute the standard error of the sample mean, we must verify conditions. (1) Because this is a simple random sample from less than 10% of the student body, the observations are independent. (2) The sample size in the sleep study is sufficiently large since it is greater than 30. (3) The data show moderate skew in Figure LABEL:histOfSleepForCollegeThatWasCheckingForMoreThan7Hours and the presence of a couple of outliers. This skew and the outliers (which are not too extreme) are acceptable for a sample size of $n=110$ . With these conditions verified, the normal model can be safely applied to $\bar{x}$ and the estimated standard error will be very accurate.

Example 2.9.9

What is the standard deviation associated with $\bar{x}$ ? That is, estimate the standard error of $\bar{x}$ .

Answer. The standard error can be estimated from the sample standard deviation and the sample size: $SE_{\bar{x}}=\frac{s_{x}}{\sqrt{n}}=\frac{1.75}{\sqrt{110}}=0.17$ . The hypothesis test will be evaluated using a significance level of $\alpha=0.05$ . We want to consider the data under the scenario that the null hypothesis is true. In this case, the sample mean is from a distribution that is nearly normal and has mean 7.45 and standard deviation of about 0.17. Such a distribution is shown in Figure LABEL:pValueOneSidedSleepStudy.

The shaded tail in Figure LABEL:pValueOneSidedSleepStudy represents the chance of observing such a large mean, conditional on the null hypothesis being true. That is, the shaded tail represents the p-value. We shade all means larger than our sample mean, $\bar{x}=7.87$ , because they are more favourable to the alternative hypothesis than the observed mean.

We compute the p-value by finding the tail area of this normal distribution, which we learned to do in Section 2.3. First compute the Z score of the sample mean, $\bar{x}=7.87$ :

\displaystyle Z=\frac{\bar{x}-\text{null value}}{SE_{\bar{x}}}=\frac{7.87-7.45% }{0.17}=2.47

Using pnorm(2.47), the lower unshaded area is found to be $\mathbb{P}(Z<2.47)=0.993$ . Thus the shaded area is $1-0.993=0.007$ . If the null hypothesis is true, the probability of observing such a large sample mean for a sample of 110 students is only 0.007. That is, if the null hypothesis is true, we would not often see such a large mean.

We evaluate the hypotheses by comparing the p-value to the significance level. Because the p-value is less than the significance level (p-value $=0.007<0.05=\alpha$ ), we reject the null hypothesis. What we observed is so unusual with respect to the null hypothesis that it casts serious doubt on $H_{0}$ and provides strong evidence favouring $H_{A}$ .

p-value as a tool in hypothesis testing The p-value quantifies how strongly the data favour $H_{0}$ . A small p-value (usually $<0.05$ ) corresponds to sufficient evidence to reject $H_{0}$ . Note that this says nothing about whether $H_{A}$ is true.

TIP: It is useful to first draw a picture to find the p-value It is useful to draw a picture of the distribution of $\bar{x}$ as though $H_{0}$ was true (i.e. $\mu$ equals the null value), and shade the region (or regions) of sample means that are at least as favourable to the alternative hypothesis. These shaded regions represent the p-value.

The ideas below review the process of evaluating hypothesis tests with p-values:

•

The null hypothesis represents a sceptic’s position or a position of no difference. We reject this position only if the evidence strongly favours $H_{A}$ .
•

A small p-value means that if the null hypothesis is true, there is a low probability of seeing a point estimate at least as extreme as the one we saw. We interpret this as strong evidence in favour of the alternative.
•

We reject the null hypothesis if the p-value is smaller than the significance level, $\alpha$ , which is usually 0.05. Otherwise, we fail to reject $H_{0}$ .
•

We should always state the conclusion of the hypothesis test in plain language so non-statisticians can also understand the results.

The p-value is constructed in such a way that we can directly compare it to the significance level ( $\alpha$ ) to determine whether or not to reject $H_{0}$ . This method ensures that the Type 1 Error rate does not exceed the significance level standard.

Example 2.9.10

Ebay might be interested in showing that buyers on its site tend to pay less than they would for the corresponding new item on Amazon. We’ll research this topic for one particular product: a video game called Mario Kart for the Nintendo Wii. During early October 2009, Amazon sold this game for $46.99. Set up an appropriate (one-sided!) hypothesis test to check the claim that Ebay buyers pay less during auctions at this same time.

Answer. The sceptic would say the average is the same on Ebay, and we are interested in showing the average price is lower.

$H_{0}$ :

The average auction price on Ebay is equal to (or more than) the price on Amazon. We write only the equality in the statistical notation: $\mu_{ebay}=46.99$ .
$H_{A}$ :

The average price on Ebay is less than the price on Amazon, $\mu_{ebay}<46.99$ .

Example 2.9.11

During early October, 2009, 52 Ebay auctions were recorded for Mario Kart.²⁸²⁸These data were collected by OpenIntro staff. The total prices for the auctions are presented using a histogram in Figure LABEL:ebayMarioKartAuctionPriceHistogramFor3ConditionsExercise, and we may like to apply the normal model to the sample mean. Check the three conditions required for applying the normal model: (1) independence, (2) at least 30 observations, and (3) the data are not strongly skewed.

R> data(marioKart)
R> hist(marioKart[,7][marioKart[,11]==1], breaks=10)

Answer. (1) The independence condition is unclear. We will make the assumption that the observations are independent, which we should report with any final results.
(2) The sample size is sufficiently large: $n=52\geq 30$ .
(3) The data distribution is not strongly skewed; it is approximately symmetric.

Example 2.9.12

The average sale price of the 52 Ebay auctions for Wii Mario Kart was $44.17 with a standard deviation of $4.15. Does this provide sufficient evidence to reject the null hypothesis in Exercise 2.9.10? Use a significance level of $\alpha=0.01$ .

Answer.

The hypotheses were set up and the conditions were checked in Exercises 2.9.10 and 2.9.11. The next step is to find the standard error of the sample mean and produce a sketch to help find the p-value.

\displaystyle SE_{\bar{x}}=s/\sqrt{n}=4.15/\sqrt{52}=0.5755

Because the alternative hypothesis says we are looking for a smaller mean, we shade the lower tail. We find this shaded area by using the Z score and pnorm: $Z=\frac{44.17-46.99}{0.5755}=-4.90$ , which has area pnorm(-4.90)= $\mathbb{P}(Z<-4.90)=0.0000004791833$ . The area is so small we cannot really see it on the picture. This lower tail area corresponds to the p-value.

Because the p-value is so small – specifically, smaller than $\alpha=0.01$ – this provides sufficiently strong evidence to reject the null hypothesis in favour of the alternative. The data provide statistically significant evidence that the average price on Ebay is lower than Amazon’s asking price.