Home page for accesible maths 5 Analysis of Variance 5.1 Multiple

t

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

5.2 One-way ANOVA

Suppose that we have measured a response variable on a sample from our population. Now also suppose that the population can be split into three or more groups according to a second variable. These groups might be contrived e.g. from an experiment run under a number of different conditions, or natural e.g. comparing family income across a number of regions. Before we attempt to answer a question of this type, it is useful to view the data graphically. A sensible way to compare a response variable across groups is to use a boxplot. Boxplots consist of a box (representing the upper and lower quartiles of the sample), with a midline (sample median) and two tails (sample minimum and maximum).

The aim of a one-way ANOVA is to compare the means of the groups. Let $\mu_{i}$ represent the population mean for group $i$ . If there are $m$ groups, then we test

H_{0}:\mu_{1}=\mu_{2}=\cdots=\mu_{m}

vs.

H_{1}:\mu_{1}\neq\mu_{2}\neq\cdots\neq\mu_{m}.

TheoremExample 5.2.1 Starlings

The mean mass (grams) of 10 starlings from each of four different roost situations were recorded. Does the mean mass differ between the roosting groups?

⬇

> load("starlingsAOV.Rdata")

> boxplot(Mass~Roost,starlings)

The resulting plot can be found in Figure 5.1. There does appear to be a clear difference in the distribution of the weights across the four groups.

Fig. 5.1: Boxplots of starling masses for four different roosting sites.

Note that we cannot test for ordering in the group means e.g. $\mu_{1}>\mu_{2}>\ldots>\mu_{m}$ , nor can we test whether the mean of group 1 alone differs from the means of all the other groups. The basis of the test is to compare two sums of squares. If the null hypothesis is true, these two sums should both provide an estimate of the population variance. If the null hypothesis is false, then only one of them is an estimate of the population variance. Therefore the ratio of the two sums should be close to 1 only if the null hypothesis holds.

In more detail, let $Y_{ji}$ represent the response variable for the $i$ -th individual in group $j$ . Suppose that there are $j=1,\ldots,m$ groups and that each group contains $n$ observations. In practice, we can deal with groups which have different numbers of observations, but it makes the presentation slightly more messy.

The basic assumption of any ANOVA is that the random variables $Y_{ji}$ are an i.i.d sample with

\displaystyle Y_{ij}\sim\operatorname{Normal}(\mu_{j},\sigma^{2}),

(5.1)

so that each group can have a distinct mean, but the variance is the same across groups.

Let $\bar{Y}$ denote the overall mean and $\bar{Y}_{j}$ denote the mean of the $j$ -th group, and consider the overall sum of squares ( $SS_{T}$ )

SS_{T}=\sum_{j=1}^{m}\sum_{i=1}^{n}(Y_{ji}-\bar{Y})^{2}.

If we consider the summand, then this can be expanded as

	$\displaystyle(Y_{ji}-\bar{Y})^{2}$	$\displaystyle=(Y_{ji}-\bar{Y}_{j}+\bar{Y}_{j}-\bar{Y})^{2}$
		$\displaystyle=(Y_{ji}-\bar{Y}_{j})^{2}+2(Y_{ji}-\bar{Y}_{j})(\bar{Y}_{j}-\bar{% Y})+(\bar{Y}_{j}-\bar{Y})^{2}.$

Thus the total sum of squares can be written as

	$\displaystyle SS_{T}$	$\displaystyle=\sum_{j=1}^{m}\sum_{i=1}^{n}(Y_{ji}-\bar{Y})^{2}$
		$\displaystyle=\sum_{j=1}^{m}\sum_{i=1}^{n}\left[(Y_{ji}-\bar{Y}_{j})^{2}+2(Y_{% ji}-\bar{Y}_{j})(\bar{Y}_{j}-\bar{Y})+(\bar{Y}_{j}-\bar{Y})^{2}\right]$
		$\displaystyle=\sum_{j=1}^{m}\sum_{i=1}^{n}(Y_{ji}-\bar{Y}_{j})^{2}+2\sum_{j=1}% ^{m}\left[(\bar{Y}_{j}-\bar{Y})\sum_{i=1}^{n}(Y_{ji}-\bar{Y}_{j})\right]+n\sum% _{j=1}^{m}(\bar{Y}_{j}-\bar{Y})^{2}.$

Now

	$\displaystyle\sum_{i=1}^{n}(Y_{ji}-\bar{Y}_{j})$	$\displaystyle=\sum_{i=1}^{n}Y_{ji}-n\bar{Y}_{j}$
		$\displaystyle=n\bar{Y}_{j}-n\bar{Y_{j}}$

by the definition of $\bar{Y}_{j}$ , so $\sum_{i=1}^{n}(Y_{ji}-\bar{Y}_{j})=0$ .

And so the total sum of squares can be split into

SS_{T}=\sum_{j=1}^{m}\sum_{i=1}^{n}(Y_{ji}-\bar{Y}_{j})^{2}+n\sum_{j=1}^{m}(% \bar{Y}_{j}-\bar{Y})^{2}

The two terms on the right are referred to respectively as the within group ( $SS_{W}$ ) and the between group ( $SS_{B}$ ) sums of squares. When calculating these terms, it is usual to compute $SS_{T}$ and $SS_{B}$ directly from the data, and then to calculate $SS_{W}$ as

SS_{W}=SS_{T}-SS_{B}.

From the sums of squares, we calculate the mean sums of squares,

MS_{B}=\frac{SS_{B}}{m-1}

and

MS_{W}=\frac{SS_{W}}{m(n-1)}.

Under the null hypothesis, both of these quantities can be used to estimate the residual variance $\sigma^{2}$ . Therefore to carry out the test, we calculate the ratio of these estimators

F=\frac{MS_{B}}{MS_{W}}.

If the null hypothesis is true, this ratio will be close to 1. The question is, how far away from 1 does the ratio need to be in order for us to conclude that there is evidence against the null hypothesis? To answer this, we require the sampling distribution of the ratio, under the assumption that $H_{0}$ is true. We can then obtain the critical region, which will contain all values of the ratio which are sufficiently unusual under $H_{0}$ to allow us to reject $H_{0}$ .

Under assumption (5.1) both $MS_{B}$ and $MS_{W}$ are the sum of squares of independent Normal random variables. Consequently they each have a $\chi^{2}$ distribution (see results from Math230). In each case, the degrees of freedom of the $\chi^{2}$ distribution is given by the denominator of the estimator, which is the value required to give an unbiased estimator of $\sigma^{2}$ . This is $m-1$ for $MS_{B}$ and $m(n-1)$ for $MS_{W}$ . Since the ratio of two $\chi^{2}$ random variables is a random variable with an $F$ -distribution, the required sampling distribution is

F\sim F_{m-1,m(n-1)}.

By comparing the test statistic $F$ to this sampling distribution, we can decide whether or not to reject the null hypothesis, usually based on either a critical region or a $p$ -value.

TheoremExample 5.2.2 Starlings again

Recall the starling masses that we saw in Example 5.2.1. Ten starlings were sampled from four different roosts. The data can be found in the file starlings.Rdata. Carry out a one-way ANOVA to test whether the mean weight of starlings varies between roosts. You should state clearly your hypotheses and conclusions.

The hypotheses are

H_{0}:\mu_{1}=\mu_{2}=\mu_{3}=\mu_{4}

vs.

H_{1}:\mu_{1}\neq\mu_{2}\neq\mu_{3}\neq\mu_{4}.

Next we need to calculate the three sums of squares. First we need the overall and group means. The overall mean is

\frac{1}{40}\sum_{j=1}^{4}\sum_{i=10}^{10}y_{ij}=\frac{1}{40}\times 3170=79.25

and the within group means are 83.6, 79.4, 78.6 and 75.4. To calculate sums of squares,

\displaystyle SS_{T}=\sum_{i=1}^{4}\sum_{j=1}^{10}(y_{ji}-79.25)^{2}=797.5

	$\displaystyle SS_{B}$	$\displaystyle=10\times\left[(83.6-79.25)^{2}+(79.4-79.25)^{2}+(78.6-79.26)^{2}% +(75.4-79.26)^{2}\right]$
		$\displaystyle=10\times 34.19$
		$\displaystyle=341.9$

\displaystyle SS_{W}=SS_{T}-SS_{B}=797.5-341.9=455.6

Next calculate the mean sums of squares

$MS_{B}=\frac{SS_{B}}{m-1}=\frac{341.9}{3}=113.97$ ,
$MS_{W}=\frac{SS_{W}}{m(n-1)}=\frac{455.6}{4\times 9}=12.66$ .

Finally we calculate the test statistic

F=\frac{MS_{B}}{MS_{W}}=\frac{113.97}{12.66}=9.005.

The degrees of freedom for the sampling distribution are given by the denominators in the between and within mean sums of squares: in this case, 3 and 36. So the critical region at the 5% level of significance is given by

⬇

> qf(0.95,3,36)

This gives us a critical value of 2.87 (see Figure 5.1). Since $9.005>2.87$ we would reject $H_{0}$ and conclude that there is evidence of a difference between the mean masses at the four different roosts.

Alternatively, we could calculate the $p$ -value,

⬇

> 1-pf(9.005,3,36)

This gives a $p$ -value of 0.000139, which is clearly less that 0.05, so again we would reject the null hypothesis.

Fig. 5.2: The density of the $F_{3,36}$ sampling distribution for the $F$ -ratio in the starling ANOVA example. The critical region for the test at the 5% level is marked in blue.

Finally, note an alternative way to write the ANOVA assumptions is that the $Y_{ji}$ are i.i.d with

\displaystyle Y_{ji}\sim\operatorname{Normal}(a+b_{j},\sigma^{2}),

$i=1,\ldots,n$ , $j=1,\ldots,m$ .

Here $a$ is the ‘base’ mean level that is common to all groups and $b_{j}$ is the effect on the mean of being in the $j$ -th group. ANOVA gives us a way to test whether or not these means are the same. In the coming sections on linear regression modelling, we will see how we can estimate the size of these effects.

5.2.1 ANOVA in R