Professional Documents
Culture Documents
University of Cambridge
M. Castle
About notation
First of all, we need to look at the notation required to understand a One-Way ANOVA.
Subscripts are used to define the origin of each data point. Each observation is denoted as ygi,
where g represents the group (sample) it comes from, and i defines the individual response
within the group. So, y23 is the third observation in the second group. If we consider the small
dataset of meerkat weights in table 1, this would be the third observation from Kuruman
River, which is 597 g. A dataset has k groups (meaning that g = 1 ... k), and each group has ng
observations. In the meerkat example, we have two locations (k = 2), n1 is the sample size for
Deception valley (n1 = 6) and n2 is the sample size for Kuruman river (n2 = 6). The overall
sample size is N (N = n1 + n2 = 12).
519
568
571
553
531
y11
y12
y13
y14
y15
y16
542
597
597
577
678
y21
y22
y23
y24
y25
y26
Mathematical Biology IA
University of Cambridge
M. Castle
more components due to changes in the values of the independent variable(s) and a
component due to random error). In a One-Way ANOVA, we have only one independent
variable (e.g. location), which is a discrete factor (i.e. a categorical variable).
To understand how this is computed, we need to think about each observation as a deviation
from the overall mean (Fig. 1),
ygi g gi
where,
=overall mean
g=the group effect
gi=the random error component
Fig. 1: Diagram showing the partitioning of an individual weight
ygi
gi
In the meerkat example, this implies that there is a mean weight for this species (), which is
then affected by the location (with >0 if meerkats are bigger than average at that location,
and <0 if the site is not a good one and they are smaller). However, even within a site, not all
individuals will have the same weight, and the individual deviation from the site mean is given
by (where is normally distributed with a mean of 0). As the purpose of a One-Way ANOVA
is to determine whether the mean of several groups differs significantly, the null hypothesis
can be phrased as H0: 1 = 2 == k. The H1 states that 1k are not all equal.
So, how can we estimate the variance components? If you think back to your first lecture, the
numerator of the formula for the variance is the sum of square deviations (SS) from the mean.
SS are a good way to summarise the level of variability around a mean, and we can use a
similar approach here (Fig. 2). A further property of SS is that they are additive. So,
SSTot = SSG + SSE
where
SSTot = Total Sum of Squares
SSG = Group Sum of Squares (also known as the Treatment SS)
SSE = Error Sum of Squares
First, let us estimate the Total Sum of Squares. This is the total deviation of the dataset from
the overall mean (Fig. 2). Even though we dont know the true overall mean, we can estimate
it by pooling all the observations from all the groups and then taking their mean, y . So,
k
ng
SSTot ( ygi y )2
g 1 i 1
Mathematical Biology IA
University of Cambridge
M. Castle
In the meerkat dataset, the overall mean is 572.6. The sum of squared deviations from the
overall mean (SSTot) is
[All figures are rounded to 1 decimal place (DP). It is convention to give summary statistics to
an accuracy of 1 DP more than the accuracy of the original data, which in the meerkat
example were to zero DP].
SSTot
Weight (g)
700
600
Deception Valley
Kuruman River
500
400
0
10
12
Observations
SSE
700
700
600
600
Weight (g)
Weight (g)
SSG
500
500
400
400
0
Observations
10
12
10
12
Observations
We now need to consider the Group Sum of Squares (SSG). What we want to know here is how
much variability in the dataset comes from the fact that the group means are different from
the overall mean. Again we dont know exactly what g is, but we can take the mean for each
Mathematical Biology IA
University of Cambridge
M. Castle
group y g (which is our best estimate of g ). Now we can estimate the amount of
deviation due to the group effect as
k
ng
SSG ( y g y )2 ng y g y
g 1 i 1
g 1
In the meerkat example, the mean for Deception Valley, y1 , is 542.7, and the mean for
Kuruman River, y2 , is 602.5. Since the sample size is 6 for both groups (n1 = n2 = 6), we have
ng
SS E ( y gi y g )2
g 1 i 1
Mathematical Biology IA
University of Cambridge
M. Castle
Source
SS
df
MS
Group
SSG
dfG = k - 1
MSG / MSE
Error
SSE
dfE = N - k
Total
SSTot
dfTot = N - 1
We want to concentrate on the group and error components, to be able to test our null
hypothesis delineated above (i.e. that the variance component due to the group is large
relative to the component due to error). We can estimate the mean square deviation (MS, our
standardised estimator of variation) for each component by dividing each SS by its
appropriate df. Our confidence on the H0 depends on how much variation in the dataset can
be attributed to the group effect vs. the amount due to random error. So, to get a handle on
this, we can estimate the ratio between the group MS (MSG) and the error MS (MSE). The ratio
of two variances has a well known behaviour, described by the F distribution. The F
distribution is somewhat different from the distributions you have encountered so far, as it is
defined by two degrees of freedom, one for the variance on the numerator and one for the
variance in the denominator.
So, if we write out the ANOVA table for meerkats
Source
SS
df
MS
Group
10740.1
10740.1
7.90
Error
13602.8
10
1360.3
Total
24342.9
11
From your statistical tables, the critical value for F1,10 at = 0.05 is 4.96. Since 7.90, our
estimated F, is larger than the critical value, we conclude that the result is significant, and we
reject the null hypothesis that all populations are equal in weight. The exact p value is 0.018,
as we can see from the R output:
Analysis of Variance Table
Response: Weight
Df Sum Sq Mean Sq F value Pr(>F)
Location
1 10740.1 10740.1 7.8955 0.01848 *
Residuals 10 13602.8 1360.3
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Since we are comparing only two groups, we can now repeat the analysis with a t-test
(assuming equal variances), and confirm that the result does not change:
Two Sample t-test
data: Weight by Location
t = -2.8099, df = 10, p-value = 0.01848
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-107.27897 -12.38769
sample estimates:
mean in group Deception Valley
mean in group Kuruman River
602.5000
Mathematical Biology IA
University of Cambridge
M. Castle
How do we report this result? In project write-ups (and later in papers that you might author),
you will be interested in the biology, not the statistics. So, describe your results, and use the
statistics to back them up:
Meerkat weights in Kuruman River were significantly different from those in Deception Valley
(F1,10=7.90, p<0.05).
Important: Note that you should ALWAYS provide the statistics (in this case F), the number of
degrees of freedom for parametric tests (for F, we have two values, 1 and 10) or the sample
sizes for certain non-parametric tests, and an indication of the p value (p<0.05, or even better,
the exact p value to 3 DP, p=0.018).
519
568
571
553
531
y11
y12
y13
y14
y15
y16
542
597
597
577
678
y21
y22
y23
y24
y25
y26
641
677
653
673
595
y31
y32
y33
y34
y35
y36
We want to test whether there is a difference in average weight among the populations.
Mathematical Biology IA
University of Cambridge
M. Castle
So, H0: 1 = 2 = 3 (i.e. there is no difference). H1 is that not all 1k are equal.
nk
y
g 1 i 1
k
gi
g 1
n1
y1
1i
i 1
n1
y2
y
i 1
2i
n2
n3
y3
y
i 1
n3
3i
Mathematical Biology IA
University of Cambridge
M. Castle
ng
Sum of Squares:
SSG ng y g y
g 1
6 (542.7 594.5)2
dfTot = N-1 = (6 + 6 + 6) 1 = 18 1 = 17
dfG = k-1 = 3 1 = 2
dfE = dfTot - dfG = 17 2 = 15
Mean Squares:
F statistics:
SS
df
MS
Group
28032.3
14016.2
10.19
Error
20640.2
15
1376.0
Total
48672.5
17
Mathematical Biology IA
University of Cambridge
M. Castle
= ( )
=1 =1
= [( ) + ( )]
=1 =1
= [( ) + ( ) + 2( )( )]
=1 =1
= ( ) + ( ) + 2 ( )( )
=1 =1
=1 =1
=1 =1
= + + 2 ( )( )
=1 =1
So all we need to do is show that the third term is actually identically zero:
2 ( )( )
=1 =1
[( ) ( )]
=1
=1
[( ) ( )]
=1
=1
[( )( )]
=1
Therefore
SST = SSG + SSE
Mathematical Biology IA
University of Cambridge
M. Castle
partial
18.4
13.0
17.4
20.4
16.5
sheltered
24.1
22.2
25.3
25.1
21.5
Is there any evidence that the feeding rate differs among locations?
Provide one or two sentences that you could use in a paper to summarise your
analysis.
2) Juvenile lobsters in aquaculture were grown on three different diets (fresh
mussels, semi-dry pellets and dry flakes). After nine weeks, their wet weight in
grams was:
mussels
151.6
132.1
104.2
153.5
132.0
119.0
161.9
pellets
117.7
110.8
128.6
110.1
175.2
flakes
101.8
102.9
90.4
132.8
129.3
129.4
Is there any evidence that the diet affects the growth rate of lobsters?
Provide one or two sentences that you could use in a paper to summarise your
analysis.
3) We recorded the biomass (g) of three species of bacteria (A, B, and C) grown
in flasks with a glucose broth. After a day, their mass was:
A
59.7
52.2
55.4
59.4
52.7
B
50.0
45.6
50.1
40.1
49.3
C
48.5
61.5
55.2
45.2
51.5
Do the bacteria species differ in their ability to grow under the conditions of the
experiment?
Provide one or two sentences that you could use in a paper to summarise your
analysis.
10