You are on page 1of 13

STAT 511 ANOVA and Regression 1

' $

Comparing Several Means: ANOVA

Blue Lake snap beans were To compare the means of several,


grown in 12 open-top chambers, say I, groups (populations), one of-
which are subject to 4 treat- ten uses an analysis of variance
ments, 3 each, with O3 and SO2 model, or ANOVA.
present/absent. The total yield For the I populations, we use 1 ,
Slide 1 was measured for each chamber. 2 , . . . , I and 1 , 2 , . . . , I to
Sulfur Dioxide denote their respective means and
Ozone Absent Present standard deviations. Similarly, the
Absent 1.52 1.49 sample mean, sample standard de-
1.85 1.55 viation, and sample size of the ith
1.39 1.21 population are denoted by x i , si ,
Present 1.15 0.65 and Ji .
1.30 0.76
Of most interest are the compar-
1.57 0.69
isons between the i s.
& %

' $

Group Means and Grand Mean

For the bean growth data,


For Ji s large, by CLT,
2
i N (i , i ),
P
trt Ji j xij i
x X Ji
1 3 4.76 1.5867
and s2i are reliable estimates of i2 .
2 3 4.25 1.4167
For Ji s small, one assumes normal-
3 3 4.02 1.3400
Slide 2 ity and 12 = = I2 = 2 .
4 3 2.10 0.7000

The grand total of n = 12 obser- The individual sample means are


i = J1i Jj=1
P i
P P
vations is i j xij = 15.13, so x xij ,
the grand mean is where xij is the jth observation in
x
= 15.13
= 1.2608. the ith group. The grand mean is
12
= n1 Ii=1 Jj=1
P P i
x xij ,
The Ji s here are all equal so x
PI
where n = i=1 Ji is the total num-
is the mean of xi s. This would
ber of observations in the I groups.
not be the case for Ji s unequal.
& %

C. Gu Spring 2016
STAT 511 ANOVA and Regression 2

' $

Variation Within Groups

For the bean growth data, Under the assumption


trt
P
i )2 s2i 12 = = I2 = 2 ,
j (xij x
1 .112467 .056233 one would like to estimate the com-
2 .065867 .032933 mon variance 2 using all available
3 .090600 .045300 information. Such information is
Slide 3 4 .006200 .003100 contained in the sum of squared
errors,
SSE is
SSE = Ii=1 Jj=1 i )2
P P i
(xij x
i )2 = .275134,
P P
i j (xij x
= Ii=1 (Ji 1)s2i .
P
and MSE is
The pooled variance estimate is
s2p = .275133
= .034392.
124
given by
For Ji s all equal, s2p = i s2i /I. SSE
P
s2p = MSE = ,
In general, s2p is a weighted mean nI
PI
of s2i with weights (Ji 1). where n I = i=1 (Ji 1).

& %

' $

Variation Between Groups

For the bean growth data, To measure the variability between


SSTr is given by groups, one calculates the sum of
squares for treatments,
)2 = 1.353758,
P
i 3(
xi x
SSTr = Ii=1 Jj=1 )2
P P i
(
xi x
and SST is given by
= Ii=1 Ji ( )2 .
P
xi x
Slide 4 P P
)2 = 1.628892.
i j (xij x
It can be shown that
It is easy to verify that
)2 = i )2
P P P P
i j (xij x i j (xij x
SST = SSTr + SSE
)2 ,
P P
+ i j (xi x
If one ignores the grouping,
)2 .
P P
where SST = i j (xij x
then the sample variance of
For I = 2, it can be shown that
the n observations is
( 2 )2
x1 x
s2 = 1
SST. SSTr = 1 .
n1
J1
+ J12

& %

C. Gu Spring 2016
STAT 511 ANOVA and Regression 3

' $

ANOVA Table, F -Test

Associated with SSE and SST are


MSE is an unbiased estimate of 2 .
degrees of freedom n I and n 1.
Similarly, SSTr has df I 1. Note
For i s all equal, MSTr is also an
that
unbiased estimate of 2 . When
n 1 = (n I) + (I 1). i s are not all equal, MSTr tends
Slide 5
Dividing SS by the corresponding to be larger.
df, one gets a mean square (MS).
To test the hypotheses
An ANOVA table summarizes
H0 : 1 = = I vs. Ha : o.w.,
all the information.
Src SS df MS Calculate
MSTr
Trt SSTr I 1 SSTr f= ,
I1 MSE
SSE and reject H0 when f > F,1 ,2 ,
Error SSE nI nI
Total SST n1 where 1 = I 1 and 2 = n I.

& %

' $

F -Distribution

Let Yi N (0, 1), i = 1, . . . , m, For the bean growth data, the


and Zj N (0, 1), j = 1, . . . , n, ANOVA table is given by
independent. The distribution of Src SS df MS
Pm
Yi2 /m
Pi=1
n
Trt 1.3538 3 .4513
2
j=1 Zj /n
Error 0.2751 8 .0344
is called a F -distribution with
Slide 6 degrees of freedom n = m and Total 1.6289 11
d = n. It is easy to calculate
F(3,8) and F(8,3)
.4513
f= = 13.12,
.0344
0.6

which is larger than F.05,3,8 =


0.4

4.07, so we reject H0 at the 5%


significance level.
0.2

To obtain F.05,3,8 in R, use


0.0

0 1 2 3 4 5 6

& %
qf(.95,3,8).

C. Gu Spring 2016
STAT 511 ANOVA and Regression 4

' $

F - and t-tests, Computing Formulas

For I = 2, one has


MSTr ( 2 )2
x1 x Since SST = SSTr + SSE, one only
f= = 2 1 . needs to calculate two of the three
MSE sp ( J1 + J12 )
Reject H0 when f > F,1,n2 . terms.

)2
P P
Compare this with the t-test for SST = i j (xij x
Slide 7
H0 : 1 = 2 versus Ha : 1 6= 2 , xij )2
P P
(
x2ij i j
P P
= ,
x
1 x
2 i j n
t= q ,
)2
P P
sp J11 + J12 SSTr = i j (
xi x
P (Pj xij )2 ( i j xij )2
P P
with a rejection region |t| > = ,
i Ji n
t/2,n2 . We notice that f = t2 .
i )2
P P
SSE = i j (xij x
Actually, one also has F,1, =
xij )2
P
(
t2/2, , so the F -test is equivalent 2 j
P P P
= i j xij i Ji
.
to the t-test we learned earlier.

& %

' $

Computing ANOVA: Example

Consider the following data Using the computing formulas,


Sample SSE = 244 + 114 + 56
1 2 3 222 202 122
( + + )
12 8 6 2 4 3
= 24,
Slide 8 10 5 2
3 4 222 202 122 542
SSTr = ( + + )
2 4 3 9
4
= 66.
Ji 2 4 3
P
j xij 22 20 12 Since f = 66/2 24/6
= 8.25 and
P 2
244 114 56 F.05,2,6 = 5.14, we reject
j xij
i
x 11 5 4 H0 : 1 = 2 = 3
P P
n = 9, i j xij = 54, x
= 6. at the 5% significance level.

& %

C. Gu Spring 2016
STAT 511 ANOVA and Regression 5

' $

Parameter Estimation and Testing

For the bean growth data, The inferences concerning means


x
1 = 1.5867, x2 = 1.4167, are derived from the fact that
2
X i N (i , ).
s2p = .0344 = .18552 , Ji
J1 = J2 = 3, = 8. A (1 )100% CI for i is
s
Slide 9 A 95% CI for 1 is s2p
q x
i t/2, ,
1.5867 2.306 .0344 , Ji
3
where = n I.
or (1.340, 1.834), where t.025,8 =
2.306. A (1 )100% CI for 1 2 is
r
1 1
A 95% CI for 1 2 is ( 2 ) t/2, s2p (
x1 x + )
q J1 J2
.17 2.306(.1855) 23 ,
Tests for hypotheses concerning
or (.179, .519). One would ac-
these parameters can be similarly
cept H0 : 1 = 2 at the 5% level.
& %
constructed.

' $

Estimating and Testing Contrasts

For the bean growth data, a con- A linear combination of means,


trast of interest is = c1 1 + + cI I ,
= (1 2 ) (3 4 ). is to be estimated by
= 0 implies no interaction be- = c1 x
1 + + ck x
I ,
tween O3 and SO2 . with a standard
Slide 10 s error
The estimate is given by c21 c2
= sp
+ + I .
= x
1 x
2 x
3 + x
4 = .47, J1 JI
with a standard error When c1 , . . . , cI add to zero,
p P
= .1855 4/3 = .2142. i ci = 0, such a is called a con-

A 95% CI for is trast. For example, 1 2 is a


.47 2.306(.2142), contrast.
or (.964, .024). One would con- In applications, contrasts are often
clude = 0 at the 5% level. of the most interest.
& %

C. Gu Spring 2016
STAT 511 ANOVA and Regression 6

' $

Relations Between Variables

Functional relations: y = f (x) deterministic, such as (i) A = r2


for the area A and radius r of a circle; or (ii) y = 95 (x 32) for
thermometer readings xo F and y o C.
Statistical relations: Variables tend to vary together, but there
Slide 11
is no deterministic coupling. Among examples are (i) ages of
married couples; and (ii) lengths and weights of snakes.

100 120 140 160 180 200


3.0
2.0

weight (gm)
area
1.0
0.0

0.0 0.2 0.4 0.6 0.8 1.0 55 60 65


radius length (cm)

& %

' $

Simple Linear Regression

When studying the heights A simple linear regression is of the form


of father-son pairs, Galton
Y = 0 + 1 x +
found, in late 19th cen-
tury, that for fathers taller
Y response or dependent var.
than average, the average
height of their sons is be- x predictor or indep. var.
Slide 12 tween their height and the noise or random error
average. Ditto for fathers
shorter than average. Y varies randomly given x. The distri-
o

o
o
bution of Y varies systematically with
o

x through the regression function


o
o oo
o
oo o o
o o o
o o oo
o
o o o o o o o
o o oo oo o
oo o o o oo o o
o
o o o o
o oo o o o oo o oo o o
o
o o o oo o o o ooo oo
o o o oo
o o oo o
o o o o o o o oooo oo
o
o
o
o oo ooo o o o

Y x = 0 + 1 x.
o o ooooo o oo o o
o o o o oo o
o o o o o
o o o o o o oo o o oo
oo o o o oo o o oo
oo o oo oooo o oooo o o o o o
ooo o o o ooo ooo ooo o o o o oo oo
o o o ooo o oooooooooo o oo
oo o o o o o o oooo oooooo o ooo
o o
o ooo o o o ooooo ooo o o oo o o oo o oo
o
o
oo oo o o
oo
o o oooo o oo o ooo o oo oooooo o o o o
o oo o o
o o o o o oo oo o o oooooo o o oooooooooo o o o ooo o o
oo o oo oo ooo ooo ooooooooo oo oo oo o
oo oooooooo o oo o
oo ooo oo o oo o ooo o ooo ooo o o oo o oo o
o o oo o o oo oooo o ooo o o o o
o o o o ooo o oo o oo o o o oo oo o o o
o
o
o oo oo
oo
ooooooo o oooooooooo oo oo o
o oo oo oo o
o o oo ooooooo oooo o o oo o
o
oo oooo oooo oo ooo o oooo o ooo o o o
o o oo o oooooooooooo o oooo oo oo oo oo oo o oo o

The model has a systematic part,


o oo o o o o o o
oo o o oo o o oo o ooo ooo o o o oo
o o
oo o oo o oo ooo o ooo o ooo ooo o oooo oo o
o o oo oo oooooo oo oo o o oo oo o o o o o
o o o o o o o o ooooooo o oooo o oo o
o o o o o oo oooo o o ooo o o o o
oo ooo o
oo o o
oo oo o o oooooo o o o
oo o o o o o o oo o oo o
oo o oo o ooo o oo ooooo ooo o oo oo o
o o o o
o o o o ooo o oooo o ooo o o oo o o
o o o o o oo oo
o o o ooo o o o o o o o o
oo oo o o o o o o oo o o
o o o

0 + 1 x, and a random part, .


o o o
o o oo
o o oo o
o oo o oo oo o o o
o o o o o o
oo
o o o o oo o o o
o o o o o
o
o o o
o o
o
oo
o
o o o

A causal structure is usually implied.


& %

C. Gu Spring 2016
STAT 511 ANOVA and Regression 7

' $

Model Assumptions in SLR

Data come in as pairs (xi , yi ), and the model is written as


Yi = 0 + 1 xi + i
It is usually assumed that i N (0, 2 ).

Slide 13 Consider In practice, one observes pairs


Y = 12 + 8x + , (xi , yi ), and estimates model
where N (0, 9). Since parameters 0 , 1 , and 2 .
Y |x = 1 N (20, 9), Y x = 0 + 1 x is a strong
one has assumption.
The normality assumption can
P (Y < 17|x = 1)
sometimes be weakened to
17 20
= P (Z < ) = .1587 i = 0 and 2i = 2 .
3
& %

' $

Example: Length and Weight of Snakes

Nine adult females of the snake Vipera berus


Length Weight
were caught and measured. The lengths and
60 136 weights are listed on the left and plotted below.
69 198
66 194
200

Slide 14
180

64 140
140 160
weight (gm)

54 93
67 172
120

59 116
100

65 174
63 145 55 60 65
length (cm)

& %

C. Gu Spring 2016
STAT 511 ANOVA and Regression 8

' $

Least Squares Estimates of 0 , 1

The lengths and weights of fe- Minimizing w.r.t. 0 , 1


n
male snakes. X
Q= (yi (0 + 1 xi ))2 ,
The LS estimate of regression i=1

function is Y = 301 + 7.19X. one obtains the least squares


Slide 15 (LS) estimates of (0 , 1 ),
Y=-301+7.19X
Y=-227+6X
180

Sxy
b1 = 1 = ,
Sxx
weight (gm)

b0 = 0 = y b1 x
140

where
80 100

Q=1093.7
Q=1347 P
Sxy = i (xi x
)(yi y),
55 60 65
)2 .
P
length (cm) Sxx = i (xi x

& %

' $

Fitted Values and Residuals

The mean response Y x at x is (unbias-


The lengths and weights of
female snakes. edly) estimated by the fitted regression
function
x y y e
Y x = Y = b0 + b1 x.

60 136 130.4 5.6
At the data points, one has the fitted
Slide 16 69 198 195.2 2.8
values (y-hat)
66 194 173.6 20.4
yi = b0 + b1 xi ,
64 140 159.2 -19.2
and the residuals
54 93 87.3 5.7
ei = yi yi = yi (b0 + b1 xi ).
67 172 180.8 -8.8
59 116 123.2 -7.2 The fitted values and residuals satisfy
Pn
65 174 166.4 7.6 i = n
P
i=1 y i=1 yi ,
63 145 152.0 -7.0 Pn Pn
i=1 ei = i=1 xi ei = 0.

& %

C. Gu Spring 2016
STAT 511 ANOVA and Regression 9

' $

Estimation of 2

Consider a model To estimate 2 , calculate the


Yi = + i , residual sum of squares
n n
where i = 0 and 2i = 2 . The X 2
X
SSE = (yi yi ) = e2i ,
estimate i=1 i=1
yi =
= y and use
Slide 17
actually minimizes P
yi )2
2 SSE i (yi
s = = .
Q= n 2
P
i=1 (yi ) . n2 n2
An unbiased estimate of 2 is Unbiasedness: s2 = 2 .
Pn
i )2
i=1 (yi y
2
s = To calculate s2 , use
n1 2
Sxy
Pn 2 SSE = Syy ,
i=1 ei Sxx
= ,
n1 where
y)2 .
P
where yi contains one parameter. Syy = i (yi

& %

' $

Details of Calculation

We use the lengths and weights of snakes to illustrate. Note that


( xi )2
P P P
X xi yi X
Sxy = xi yi , Sxx = x2i .
n n

First summarize the data. Now we have


P P 2
Slide 18 xi = 567 xi = 35893 1237
P P 2 b1 = 172
= 7.19
yi = 1368 yi = 217926
P b0 = 152 7.19(63)
xi yi = 87421
Then calculate = 301

=
x 567
= 63, y = 1368
= 152, SSE is given by
9 9
5672 12372
Sxx = 35893 = 172, 9990 172
= 1093.7,
9

Syy = 217926 13682


= 9990, so 2 is estimated by
9

Sxy = 87421
567(1368)
= 1237. s2 = 1093.7
92
= 156.24.

& %
9

C. Gu Spring 2016
STAT 511 ANOVA and Regression 10

' $

Inferences Concerning 1

Lengths and weights of snakes. Assume i N (0, 2 ).

We have b1 = 7.19 and b1 N (1 , b21 ),


q
sb1 = 156.24
172
= .953. where b21 = 2 /Sxx is to be esti-
A 95% CI for 1 is given by mated by
Slide 19 7.19 2.365(.953), s2
s2b1 = .
where t.025,7 = 2.365. Sxx
The inferences are based on
To test the hypotheses
H0 : 1 = 0 vs. Ha : 1 6= 0, b1 1
tn2 .
sb1
we calculate
t= 7.190
= 7.545, For example, a (1 )100% CI for
.953
and reject H0 even at the 1%- 1 is given by
level, as |t| > 3.499 = t.005,7 . b1 t/2,n2 sb1 .
& %

' $

Analysis of Variance

Decompose the deviation of yi from y,


The lengths and weights of female snakes.

yi y = (
yi y) + (yi yi ),
56.94
F

where ( yi y) is systematic and (yi yi ) is


random. It can be shown that
8896.3
156.24
MS

)2 = i ( yi y)2 + i (yi yi )2
P P P
Slide 20 i (yi y

SST : (n 1) = SSR : 1 + SSE : (n 2)


df
1

8
7

The ANOVA table summarizes related infor-


8896.3

9990.0
1093.7

mation.
SS

Source SS df MS f
SSR MSR
Model SSR 1
Source
Model

1 MSE
Resid
Total

2 SSE
Resid SSE n2 s = n2
Total SST n1
& %

C. Gu Spring 2016
STAT 511 ANOVA and Regression 11

' $

F -Test for 1 = 0

The lengths and weights of fe- It can be shown that


male snakes. MSR = 2 + 12 Sxx ,

Since MSE = 2 .

8896.3 When 1 = 0, one has


f= = 56.94, MSR
156.24 f= F1,n2 .
Slide 21 F.01,1,7 = 12.246, MSE
These lead to the F -test for
we reject H0 : 1 = 0 at the 1% H0 : 1 = 0 vs. Ha : 1 6= 0,
level. which rejects H0 when Fs >
This is equivalent to the t-test on F,1,n2 .
Slide 19. Note that The F - and t-tests are equivalent:

f = 56.94 = 7.552 = t2 , MSR


MSE
= f = t2 = ( sb1 )2 ,
b1

F.01,1,7 = 12.25 = 3.52 = t2.005,7 . F,1,n2 = t2/2,n2 .


& %

' $

Inferences Concerning 0

For the lengths and weights of Assume i N (0, 2 ).


snakes, 0 has no meaning.
b0 N (0 , b20 ),
Consider Y = 15 + 5X + , where
N (0, 4). Given xi = 8(.1)10, where
simulate Yi and estimate the re- 2
b20 = 2 { n1 + x
Sxx
}
Slide 22 gression function.
is to be estimated by
o
oo
o ooooooo o o
2
60

oooo
o o
oo s2b0 = s2 { n1 + x
Sxx
}

The inferences are based on


40

b0 0
tn2 .
20

sb0

For |
x| large, 0 is hard to esti-
0

& %
0 2 4 6 8 10
mate, or to interpret.

C. Gu Spring 2016
STAT 511 ANOVA and Regression 12

' $

Inferences Concerning Y x = 0 + 1 x

The lengths and weights of fe- Assume i N (0, 2 ).


male snakes.
Y N (0 + 1 x, Y2 ),
We are to estimate the average
weight of snakes of length 60 cm. where Y = b0 + b1 X, and
(xx)2
Y = 301 + 7.19(60) Y2 = 2 { n1 + Sxx
}
Slide 23
= 130.4, is to be estimated by
2
s2Y = 156.24{ 19 + (6063)
172
} s2Y = s2 { n1 + (xx)2
}.
Sxx

= 25.535 = 5.0532 , The inferences are based on


so a 95% CI for 0 + 1 60 is Y (0 + 1 x)
tn2 .
sY
130.4 2.365(5.053),
For |x x
| large, 0 + 1 x is hard to
or (118.45, 142.35).
& %
estimate.

' $

Prediction of New Observation

The lengths and weights of fe- To predict a new response at x,


male snakes. Y = 0 + 1 x + ,
one has to allow for the variability
We are to predict the weight of a
of .
snake of length 60 cm.
With 0 , 1 , and 2 known, the pre-
Slide 24 Y = 130.4,
diction interval
s2 = 156.24, (0 + 1 x) z/2
s2Y = 25.535 covers Y with probability 1 .

so a 95% PI for Y at X = 60 is With 0 + 1 x estimated by Y =


b0 + b1 x, we use
130.4 2.365 156.24 + 25.535,
q
Y t/2,n2 s2 + s2 ,
Y
or (98.51, 162.29). This is wider where the variances of Y and are
than the CI for 0 + 1 60. estimated by s2Y and s2 .
& %

C. Gu Spring 2016
STAT 511 ANOVA and Regression 13

' $

R2 , Correlation

Lengths and weights of snakes.


The coefficient of determina-
R = 2 8896.3
= .891 tion, or R2 ,
9990

r= 1237
= .944 SSR SSE
172(9990) R2 = =1 ,
SST SST
o
o o
measures the amount of variation
Slide 25 o o
oo

explained by the model.


o o
oo o o
o
ooo o o oo ooo oo
o
o o oo o o o
o o oooo ooo
oo
o o o
o oo o o
o
o oo o o oooo oo ooo o o o o
oo oo o o o o o oo
o oo o oo
ooo ooo o oo
o o oo ooo o o o o o
o oooooo o
o
o o o o o oo o
o oo o o oo ooo ooo o
o o
o o o oo
oo ooo ooo
o o o o
o
o o oo o
o
ooo
o oo o
oo
o
The coefficient of correlation,
ooo o o o o
o o
o o

o Sxy
r= p ,
o
o
o
oo
o
Sxx Syy
o o o o
o
o oo o
o o
o o o oo o

o
o
o
oo o
o o
oo o oo o
oo
o o o o o oo o
o o
o
o o
o
oo o

o
o
oo measures the linear association be-
o o oooo
o
o o oo oo oooo o oo oo o o o o ooo
o
o
o o ooo

o
oo o o
o
o
oo
oo o o oo o o o
oooo oooo o
o
o
o oo o o
oo
o o
oo
o oo o o o
o o
oo oo
o
o tween X and Y .
o o o o oo o oo ooo o o
o oo oo oo o
o
o oo o o o o o o

0 R2 1. 1 r 1. R2 = r2 .
o o oo oo
o oo
oo o
o

& %
o

C. Gu Spring 2016