You are on page 1of 10

SIMPLE LINEAR REGRESSION Analysis of Variance

In simple linear regression a test of H0 : β1 = 0 is provided by the analysis of variance


table

Source of Degrees of Sums of Mean F p


Variation Freedom Squares Squares
RegMS
Regression 1 β̂.SXY RMS EMS p

Error n−2 † EMS

Total n−1 SY Y

where the total variability is partitioned into two components, the regression sum of
squares and error sum of squares as follows:

X X
(yi − ȳ)2 = [(ŷi − ȳ) + (yi − ŷi )]2
X X X
= (ŷi − ȳ)2 + (yi − ŷi )2 + 2 (ŷi − ȳ)(yi − ŷi )
X X X X
= (ŷi − ȳ)2 + (yi − ŷi )2 + 2 ŷi (yi − ŷi ) − 2ȳ (yi − ŷi )
X X
= (β̂0 + β̂1 xi − ȳ)2 + ε̂2i
X X
= (ȳ − β̂1 x̄ + β̂1 xi − ȳ)2 + ε̂2i
X X
= β̂1 (xi − x̄)2 + ε̂2i
X X
= β̂12 (xi − x̄)2 + ε̂2i
SXY 2 X
=( ) SXX + ε̂2i
SXX
X
= β̂1 SXY + ε̂2i

2.1
SIMPLE LINEAR REGRESSION Analysis of Variance

The calculation of the coefficients in a regression line is based on the assumptions


that the observations follow an assumed model. All of the information about how the
regression line fails to explain the variability in the response variable is contained in
the residuals, and the residual sum of squares can in some cases be used to determine
if the model being used is correct. If the model is correct, the residual sum of squares
is a measure of the random variation contained in the sample. If the model is not
correct, the residual sum of squares will contain not only the random variation but
also a variability due to the lack of the fit of the model.

If a prior estimate of the error variance, σ 2 , is available from previous experiments,


we can compare this prior estimate with the error mean square and conclude that
there is a lack of fit if the error mean square is significantly greater than this prior
estimate of σ 2 .

If no prior estimate of σ 2 is available, a test of fit can still be made if repeat


measurements of the response variable are available at one of more of the values of the
predictor variable. These repeat measurements can be used to obtain an estimate of
σ 2 , which is referred to as an estimate of the pure error. This estimate is obtained
by partitioning the error sum of squares into two components, one being the pure
error term for the model and the other a term corresponding to a lack of fit of the
model.

The partitioning of the error sum of squares as


X X
(yij − ŷij )2 = [(yij − ȳi. ) + (ȳi. − ŷij )]2

produces an analysis of variance table in which it is possible to test both H0 : β1 = 0


and whether there is any lack of fit in the model.

2.2
SIMPLE LINEAR REGRESSION Testing Lack of Fit Example

Example

The repair and maintenance costs of a car (y) increase with the age of the car (x).
The following results are collected by a company with a large fleet of cars.

Age of car (years) Annual costs ($)

1 580
2 599
1 510
1 537
2 624
4 780
2 611
2 571
3 630
3 656
1 540
3 690
1 490
1 515
2 588
3 684
1 546

The values of the predictor variable which have more than one observation of the
response variable enable the calculation of an estimate of the pure error and the
partition of the error sum of squares into components for pure error and lack of fit.

In Minitab, requesting the Pure error Lack of Fit Test from the Options menu
gives the following output.

2.3
SIMPLE LINEAR REGRESSION Testing Lack of Fit Example

Regression Analysis: Cost versus Age

The regression equation is


Cost = 454 + 73.6 Age

Predictor Coef SE Coef T P


Constant 454.32 14.82 30.66 0.000
Age 73.563 6.874 10.70 0.000

S = 26.5699 R-Sq = 88.4% R-Sq(adj) = 87.6%

Analysis of Variance

Source DF SS MS F P
Regression 1 80854 80854 114.53 0.000
Residual Error 15 10589 706
Lack of Fit 2 1503 752 1.08 0.370
Pure Error 13 9086 699
Total 16 91444

1 rows with no replicates

From this analysis of variance table we can test the null hypothesis

H0 : E(Y ) = α + βx,

i.e. that the linear model fits the observed data. In this case we accept that the linear
model is appropriate (p=0.37).

2.4
SIMPLE LINEAR REGRESSION Matrix Representation

If the simple linear regression model is represented in matrix notation as

Y = Xβ + ε with ε ∼ N (0, σ 2 I)

minimizing
ε′ ε = (Y − Xβ)′ (Y − Xβ)

gives

β̂ = (X ′ X)−1 X ′ Y

and

Ŷ = X β̂
= X(X ′ X)−1 X ′ Y

where the term X(X ′ X)−1 X ′ is denoted by H and is called the hat matrix.

The residuals are given by

ε̂ = Y − Ŷ
= Y − X(X ′ X)−1 X ′ Y
= (I − X(X ′ X)−1 X ′ )Y
= (I − H)Y

and E(ε̂) = 0 V (ε̂) = σ 2 (I − H) and V (ε̂i ) = σ 2 (1 − hii ).

2.5
SIMPLE LINEAR REGRESSION Outliers – Predictor Variable

The leverage hii at a point xi is a measure of the distance between the values of the
predictor for the ith case and the mean of all of the predictor values.

It is calculated as
1 (xi − x̄)2
hii = +
n SXX
and is judged to be large if
2p
hii >
n
where p is the number of parameters in the model and n is the number of observations.

The maximum value hii can take is 1 and it is said to be large if hii > 0.5 and
moderate if 0.2 ≤ hii ≤ 0.5.

2.6
SIMPLE LINEAR REGRESSION Outliers – Response Variable

The variance of the residual at a given point depends on the leverage and is given by

V (ε̂i ) = σ 2 (1 − hii )

and as a result the variance will be small whenever the leverage is large. Improved
diagnostics can be obtained by scaling the residuals. The scaled residuals are called
Studentized residuals and can be obtained by using two different estimates of
overall variance.

Internally Studentized residuals are defined by

ε̂i
ri = √
σ̂ 1 − hii

where the estimate of the overall variance σ 2 is obtained using all of the sample values.

Externally Studentized residuals or deleted residuals are defined by

ε̂
ti = √i
σ̂(i) 1 − hii

2
where the estimate of the overall variance σ(i) is calculated with the ith case not being
included.

Values of the response variable are identified as outliers if their Studentized residuals
are large in absolute value.

An internally Studentized residual is considered large if it is greater than 2 in absolute


value.

An externally Studentized residual is considered large if it is greater than the α/2n


critical point of a t–distribution with n − p − 1 degrees of freedom where n is the
number of observations and p is the number of parameters in the model.

2.7
SIMPLE LINEAR REGRESSION Outliers – Mean Shift Outlier Test

An explicit formulation for identifying outliers is the mean shift outlier model. If the
ith point is considered to be an outlier, is is assumed that the model has the form

y j = β 0 + β 1 x j + εj for j 6= i

and
y i = β 0 + β 1 x i + δ + εi .

The outlier has an expected value different from the value predicted by the model by
an amount δ and a test of the ith point being an outlier is a test of δ = 0.

To test for δ = 0, define a new predictor variable U such that uj = 0 for j 6= i and
ui = 1 and fit the regression of Y on X and U as

Y = β0 + β1 X + δU + ε.

The coefficient of U is an estimate of the mean shift δ and its corresponding t-statistic
provides a test of δ = 0 against the two–sided alternative.

2.8
SIMPLE LINEAR REGRESSION Influential Observations

An observation is considered influential if its exclusion results in large changes to the


estimates of the regression parameters.

Cook’s Distance is defined by

1  hii  2
Di = r
p 1 − hii i

and is a measure which compares the values of the regression coefficients obtained
using all of the sample values with the regression coefficients obtained with the ith
sample values not included.

Points with a Cook’s distance greater than 1 are considered to have substantial
influence.

The DFFITS measure is defined by


r
hii
DF F IT Si = ti
1 − hii

and is a measure which compares the fitted value for the ith case when all of the
sample values are used to the fit the regression with the fitted value for the ith case
when the ith case is not included in fitting the regression.
p
A point is considered influential if its DFFFITS value is greater than 2 p/n.

2.9
SIMPLE LINEAR REGRESSION Residual Correlation

In time series, the value at time (t + 1) is often predicted by the value at time t, that
is, the value at (t + 1) is correlated with the value at time t and the residuals are also
correlated. A residual plot shows positive and negative runs of the residual values.
Tests for correlation amongst the residuals can be performed using the Durbin–
Watson statistic which is given by
Pn 2
(ε̂ − ε̂ )
d= Pni 2i−1
i=2
.
i=1 ε̂i

The range of the statistic, d, is such that 0 ≤ d ≤ 4, with the residuals being
uncorrelated if d ≃ 2, positively correlated if d < 2 with very strong positive
correlation if d = 0 and negatively correlated if d > 2 with very strong negative
correlation if d = 4.

To test
H0 : error terms are not autocorrelated
against
H1 : error terms are positively autocorrelated

if d < dL,α we reject H0


if d > dU,α we do not reject H0 and
if dL,α ≤ d ≤ dU,α the test is inclusive

where dL,α and dU,α are appropriate critical points for the Durbin–Watson statistic.

To test
H0 : error terms are not autocorrelated
against
H1 : error terms are negatively autocorrelated

if 4 − d < dL,α we reject H0


if 4 − d > dU,α we do not reject H0 and
if dL,α ≤ 4 − d ≤ dU,α the test is inclusive.

Corresponding results exist for the two–sided alternative with appropriate α/2 critical
points.

2.10

You might also like