One-Way ANOVA in Biological Research

Mathematical Biology IA
University of Cambridge
M. Castle
MATHEMATICAL BIOLOGY LENT: ANOVA AND LINEAR MODELS

Lecture 1: One-Way Analysis of Variance (ANOVA)
Aims
1) To introduce the One-Way ANOVA, a test to compare the means of multiple groups.
2) To introduce the concept of partitioning of variance for statistical inference.
Objectives: After the lecture, students should be able to:
1) Compute Sums of Squares, degrees of freedom, Mean Squares and F statistics for a
One-Way ANOVA.
2) Make inferences on the mean of multiple groups using an ANOVA table.
In the last lecture, you saw how you can compare the means of two samples with a t-test.
However, we often want to deal with more complex problems that involve several groups. To
tackle these sorts of problems, we need to develop a more general framework, called Analysis
of Variance (ANOVA). In this lecture, we will concentrate on the simplest form of ANOVA (a
One-Way ANOVA, comparing the mean of several groups), and we will then expand this
framework to deal with more complex problems. As in previous lectures, we will confine
ourselves to normally distributed populations, and we will assume that these populations
have equal variances.
About notation
First of all, we need to look at the notation required to understand a One-Way ANOVA.
Subscripts are used to define the origin of each data point. Each observation is denoted as ygi,
where g represents the group (sample) it comes from, and i defines the individual response
within the group. So, y23 is the third observation in the second group. If we consider the small
dataset of meerkat weights in table 1, this would be the third observation from Kuruman
River, which is 597 g. A dataset has k groups (meaning that g = 1 ... k), and each group has ng
observations. In the meerkat example, we have two locations (k = 2), n1 is the sample size for
Deception valley (n1 = 6) and n2 is the sample size for Kuruman river (n2 = 6). The overall
sample size is N (N = n1 + n2 = 12).
Table 1. Weights in grams of a number of meerkats caught in a field study in Deception

Valley, Botswana (g=1)
514
519
568
571
553
531
y11
y12
y13
y14
y15
y16
Kuruman River, South Africa (g=2)

624
542
597
597
577
678
y21
y22
y23
y24
y25
y26
The Analysis of Variance framework

The analysis of variance framework is based on the idea that the variance of the response can
be partitioned into components that correspond to the source of variation (namely, one or
1
M. Castle
more components due to changes in the values of the independent variable(s) and a
component due to random error). In a One-Way ANOVA, we have only one independent
variable (e.g. location), which is a discrete factor (i.e. a categorical variable).
To understand how this is computed, we need to think about each observation as a deviation
from the overall mean (Fig. 1),
ygi g gi
where,
=overall mean
g=the group effect
gi=the random error component
Fig. 1: Diagram showing the partitioning of an individual weight
ygi
gi
In the meerkat example, this implies that there is a mean weight for this species (), which is
then affected by the location (with >0 if meerkats are bigger than average at that location,
and <0 if the site is not a good one and they are smaller). However, even within a site, not all
individuals will have the same weight, and the individual deviation from the site mean is given
by (where is normally distributed with a mean of 0). As the purpose of a One-Way ANOVA
is to determine whether the mean of several groups differs significantly, the null hypothesis
can be phrased as H0: 1 = 2 == k. The H1 states that 1k are not all equal.
So, how can we estimate the variance components? If you think back to your first lecture, the
numerator of the formula for the variance is the sum of square deviations (SS) from the mean.
SS are a good way to summarise the level of variability around a mean, and we can use a
similar approach here (Fig. 2). A further property of SS is that they are additive. So,
SSTot = SSG + SSE
where
SSTot = Total Sum of Squares
SSG = Group Sum of Squares (also known as the Treatment SS)
SSE = Error Sum of Squares
First, let us estimate the Total Sum of Squares. This is the total deviation of the dataset from
the overall mean (Fig. 2). Even though we dont know the true overall mean, we can estimate
it by pooling all the observations from all the groups and then taking their mean, y . So,
k
ng
SSTot ( ygi y )2
g 1 i 1
M. Castle
In the meerkat dataset, the overall mean is 572.6. The sum of squared deviations from the
overall mean (SSTot) is
514 572.6 519 572.6 571 572.6

2
... 678 572.6 24343.0
[All figures are rounded to 1 decimal place (DP). It is convention to give summary statistics to
an accuracy of 1 DP more than the accuracy of the original data, which in the meerkat
example were to zero DP].
Fig. 2: ANOVA Sums of Squares
SSTot
Weight (g)
700
600
Deception Valley
Kuruman River
500
400
0
10
12
Observations
SSE
700
700
600
600
Weight (g)
Weight (g)
SSG
500
500
400
400
0
Observations
10
12
10
12
Observations
We now need to consider the Group Sum of Squares (SSG). What we want to know here is how
much variability in the dataset comes from the fact that the group means are different from
the overall mean. Again we dont know exactly what g is, but we can take the mean for each
M. Castle
group y g (which is our best estimate of g ). Now we can estimate the amount of
deviation due to the group effect as
k
ng
SSG ( y g y )2 ng y g y
g 1 i 1
g 1
In the meerkat example, the mean for Deception Valley, y1 , is 542.7, and the mean for
Kuruman River, y2 , is 602.5. Since the sample size is 6 for both groups (n1 = n2 = 6), we have
SSG 6 (542.7 572.6)2 6 (602.5 572.6)2 10740.1

Finally, we need to estimate the amount of variation in the data due to the random error (i.e.
those random deviations due to individual effects). Since our best estimate of g is y g ,
we can write:
k
ng
SS E ( y gi y g )2
g 1 i 1
For the meerkat example,

SSE (514 542.7)2 ... (531 542.7)2 (624 602.5)2 ... (678 602.5)2 13602.8
Now that we have created estimates of the components of variance, we can use them to test
our null hypothesis. As usual, we need to say how confident we are that our results do not
deviate from the H0 simply as a result of random chance. The obvious approach here is to say
that, if SSG is a rather large proportion of SSTot, then the null hypothesis that all groups are
similar is unlikely to be true. We can rephrase this as saying that, if the amount of variance
accounted by the group effect is large in comparison to the amount of variance due to error
(keep in mind that SSTot = SSG + SSE), then the group effect is likely to be real (i.e. significant).
At this stage, you might be tempted to simply compare SSG with SSE. However, note that we
obtained the different sum of squares using very different amount of information. The SSG
referred to a small number of estimates (the group means) compared to the overall mean,
whereas the SSE is the sum of a large number of individual deviations (we compared all data
points to their respective group means). So, to compare the variance components, we first
need to standardise them according to the number of parameters involved (i.e. the amount
of information).
We need to develop a standardisation parameter that allows to compare SS. This parameter
is called the degrees of freedom (usually denoted as df), and it obviously related to the
number of parameters we had to estimate in order to compute a set of square deviations. In
general, the degrees of freedom for a given set of deviations is equal to the number of
parameters/observations minus the number of reference parameter values we used to
compute the deviations. So, the df for SSTot (dfTot) is equal to the sample size minus 1 (as we
only have to estimate the overall mean). The dfG is equal to the number of groups minus 1 (as
the group means are compared to a single overall mean). Finally, dfE is equal to the sample
size minus the number of groups (as we had to estimate a mean for each group).
M. Castle
Source
SS
df
MS
Group
SSG
dfG = k - 1
MSG = SSG / dfG
MSG / MSE
Error
SSE
dfE = N - k
MSE = SSE / dfE
Total
SSTot
dfTot = N - 1
We want to concentrate on the group and error components, to be able to test our null
hypothesis delineated above (i.e. that the variance component due to the group is large
relative to the component due to error). We can estimate the mean square deviation (MS, our
standardised estimator of variation) for each component by dividing each SS by its
appropriate df. Our confidence on the H0 depends on how much variation in the dataset can
be attributed to the group effect vs. the amount due to random error. So, to get a handle on
this, we can estimate the ratio between the group MS (MSG) and the error MS (MSE). The ratio
of two variances has a well known behaviour, described by the F distribution. The F
distribution is somewhat different from the distributions you have encountered so far, as it is
defined by two degrees of freedom, one for the variance on the numerator and one for the
variance in the denominator.
So, if we write out the ANOVA table for meerkats
Source
SS
df
MS
Group
10740.1
10740.1
7.90
Error
13602.8
10
1360.3
Total
24342.9
11
From your statistical tables, the critical value for F1,10 at = 0.05 is 4.96. Since 7.90, our
estimated F, is larger than the critical value, we conclude that the result is significant, and we
reject the null hypothesis that all populations are equal in weight. The exact p value is 0.018,
as we can see from the R output:
Analysis of Variance Table
Response: Weight
Df Sum Sq Mean Sq F value Pr(>F)
Location
1 10740.1 10740.1 7.8955 0.01848 *
Residuals 10 13602.8 1360.3
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Since we are comparing only two groups, we can now repeat the analysis with a t-test
(assuming equal variances), and confirm that the result does not change:
Two Sample t-test
data: Weight by Location
t = -2.8099, df = 10, p-value = 0.01848
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-107.27897 -12.38769
sample estimates:
mean in group Deception Valley
mean in group Kuruman River
602.5000
M. Castle
How do we report this result? In project write-ups (and later in papers that you might author),
you will be interested in the biology, not the statistics. So, describe your results, and use the
statistics to back them up:
Meerkat weights in Kuruman River were significantly different from those in Deception Valley
(F1,10=7.90, p<0.05).
Important: Note that you should ALWAYS provide the statistics (in this case F), the number of
degrees of freedom for parametric tests (for F, we have two values, 1 and 10) or the sample
sizes for certain non-parametric tests, and an indication of the p value (p<0.05, or even better,
the exact p value to 3 DP, p=0.018).
A worked example using three groups

The ANOVA framework is very powerful and flexible. Let us see how we can use a One-Way
ANOVA to compare the means of three groups.
We have obtained another six meerkat weights, this time coming from Addo Elephant Park. If
we add these new data to the previous dataset, we obtain
Table 2. Weights in grams of a number of meerkats caught in a field study in Deception
Valley, Botswana (g=1)
514
519
568
571
553
531
y11
y12
y13
y14
y15
y16
Kuruman River, South Africa (g=2)

624
542
597
597
577
678
y21
y22
y23
y24
y25
y26
Addo Elephant Park, South Africa (g=3)

591
641
677
653
673
595
y31
y32
y33
y34
y35
y36
We want to test whether there is a difference in average weight among the populations.
M. Castle
First, we plot the data:
So, H0: 1 = 2 = 3 (i.e. there is no difference). H1 is that not all 1k are equal.
nk
y
g 1 i 1
k
Compute the overall mean:
gi
514 519 568 ... 673 595

594.5
( 6 6 6)
g 1
n1
Mean for each group:
y1
1i
i 1
n1
514 519 568 571 553 531

542.7
6
n2
y2
y
i 1
2i
n2
624 542 597 597 577 678

602.5
6
591 641 677 653 673 595

638.3
6
n3
y3
y
i 1
n3
3i
M. Castle
ng
SSTot ( y gi y )2 (514 594.5)2

g 1 i 1
(519 594.5)2 (568 594.5)2
Sum of Squares:
... (595 594.5)2 48672.5

k
SSG ng y g y
g 1
6 (542.7 594.5)2
6 (602.5 594.5)2 6 (638.3 594.5)2 28032.3

SSE SSTot SSG 48672.5 28032.3 20640.2
Degrees of freedom:
dfTot = N-1 = (6 + 6 + 6) 1 = 18 1 = 17
dfG = k-1 = 3 1 = 2
dfE = dfTot - dfG = 17 2 = 15
Mean Squares:
MSG = SSG / dfG = 28032.3 / 2 = 14016.7

MSE = SSE / dfE = 20640.2 / 15 = 1376.0
F statistics:
F2,15 = MSG / MSE = 14016.7 / 1376.0 = 10.19
Populate the ANOVA table:

Source
SS
df
MS
Group
28032.3
14016.2
10.19
Error
20640.2
15
1376.0
Total
48672.5
17
Critical value for F2,15 at = 0.05 is 3.68 reject H0 and accept H1

Using R, we can confirm that our calculations are correct:
Analysis of Variance Table
Response: Weight
Df Sum Sq Mean Sq F value
Pr(>F)
Location
2 28032
14016 10.186 0.001606 **
Residuals 15 20640
1376
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Report: Meerkats from the three locations differed significantly in weight

(F2,15=10.2,p=0.002), with the population in Deception valley being the lightest and the one in
Addo Elephant Park being the heaviest.
M. Castle
Additivity of Sums of Squares

This section is not examinable and is for reference only.
We can show algebraically that the total sum of squares will always be equal to the group sum
of squares plus the error sum of squares i.e. we can show that:
SST = SSG + SSE
always.
Consider SST
= ( )
=1 =1

= [( ) + ( )]
=1 =1

= [( ) + ( ) + 2( )( )]
=1 =1

= ( ) + ( ) + 2 ( )( )
=1 =1
=1 =1
=1 =1
= + + 2 ( )( )
=1 =1
So all we need to do is show that the third term is actually identically zero:
2 ( )( )
=1 =1
[( ) ( )]
=1
=1
[( ) ( )]
=1
=1
[( )( )]
=1
Therefore
SST = SSG + SSE
M. Castle
Lecture 1: One-Way Analysis of Variance (ANOVA) Questions

1) We measure the feeding rate (no. of items/5 minute focal observation) of
oystercatchers at three sites (exposed, partial, and sheltered).
exposed
14.2
16.5
9.3
15.1
13.4
partial
18.4
13.0
17.4
20.4
16.5
sheltered
24.1
22.2
25.3
25.1
21.5
Is there any evidence that the feeding rate differs among locations?
Provide one or two sentences that you could use in a paper to summarise your
analysis.
2) Juvenile lobsters in aquaculture were grown on three different diets (fresh
mussels, semi-dry pellets and dry flakes). After nine weeks, their wet weight in
grams was:
mussels
151.6
132.1
104.2
153.5
132.0
119.0
161.9
pellets
117.7
110.8
128.6
110.1
175.2
flakes
101.8
102.9
90.4
132.8
129.3
129.4
Is there any evidence that the diet affects the growth rate of lobsters?
analysis.
3) We recorded the biomass (g) of three species of bacteria (A, B, and C) grown
in flasks with a glucose broth. After a day, their mass was:
A
59.7
52.2
55.4
59.4
52.7
B
50.0
45.6
50.1
40.1
49.3
C
48.5
61.5
55.2
45.2
51.5
Do the bacteria species differ in their ability to grow under the conditions of the
experiment?
analysis.
10

One-Way ANOVA in Biological Research

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

One-Way ANOVA in Biological Research

Uploaded by

Copyright:

Available Formats

Mathematical Biology IA

MATHEMATICAL BIOLOGY LENT: ANOVA AND LINEAR MODELS

Table 1. Weights in grams of a number of meerkats caught in a field study in Deception

Kuruman River, South Africa (g=2)

The Analysis of Variance framework

514 572.6 519 572.6 571 572.6

... 678 572.6 24343.0

Fig. 2: ANOVA Sums of Squares

SSG 6 (542.7 572.6)2 6 (602.5 572.6)2 10740.1

For the meerkat example,

MSG = SSG / dfG

MSE = SSE / dfE

A worked example using three groups

Kuruman River, South Africa (g=2)

Addo Elephant Park, South Africa (g=3)

First, we plot the data:

Compute the overall mean:

514 519 568 ... 673 595

Mean for each group:

514 519 568 571 553 531

624 542 597 597 577 678

591 641 677 653 673 595

SSTot ( y gi y )2 (514 594.5)2

(519 594.5)2 (568 594.5)2

... (595 594.5)2 48672.5

6 (602.5 594.5)2 6 (638.3 594.5)2 28032.3

MSG = SSG / dfG = 28032.3 / 2 = 14016.7

F2,15 = MSG / MSE = 14016.7 / 1376.0 = 10.19

Populate the ANOVA table:

Critical value for F2,15 at = 0.05 is 3.68 reject H0 and accept H1

Report: Meerkats from the three locations differed significantly in weight

Additivity of Sums of Squares

Lecture 1: One-Way Analysis of Variance (ANOVA) Questions

You might also like