Professional Documents
Culture Documents
Total n−1 SY Y
where the total variability is partitioned into two components, the regression sum of
squares and error sum of squares as follows:
X X
(yi − ȳ)2 = [(ŷi − ȳ) + (yi − ŷi )]2
X X X
= (ŷi − ȳ)2 + (yi − ŷi )2 + 2 (ŷi − ȳ)(yi − ŷi )
X X X X
= (ŷi − ȳ)2 + (yi − ŷi )2 + 2 ŷi (yi − ŷi ) − 2ȳ (yi − ŷi )
X X
= (β̂0 + β̂1 xi − ȳ)2 + ε̂2i
X X
= (ȳ − β̂1 x̄ + β̂1 xi − ȳ)2 + ε̂2i
X X
= β̂1 (xi − x̄)2 + ε̂2i
X X
= β̂12 (xi − x̄)2 + ε̂2i
SXY 2 X
=( ) SXX + ε̂2i
SXX
X
= β̂1 SXY + ε̂2i
2.1
SIMPLE LINEAR REGRESSION Analysis of Variance
2.2
SIMPLE LINEAR REGRESSION Testing Lack of Fit Example
Example
The repair and maintenance costs of a car (y) increase with the age of the car (x).
The following results are collected by a company with a large fleet of cars.
1 580
2 599
1 510
1 537
2 624
4 780
2 611
2 571
3 630
3 656
1 540
3 690
1 490
1 515
2 588
3 684
1 546
The values of the predictor variable which have more than one observation of the
response variable enable the calculation of an estimate of the pure error and the
partition of the error sum of squares into components for pure error and lack of fit.
In Minitab, requesting the Pure error Lack of Fit Test from the Options menu
gives the following output.
2.3
SIMPLE LINEAR REGRESSION Testing Lack of Fit Example
Analysis of Variance
Source DF SS MS F P
Regression 1 80854 80854 114.53 0.000
Residual Error 15 10589 706
Lack of Fit 2 1503 752 1.08 0.370
Pure Error 13 9086 699
Total 16 91444
From this analysis of variance table we can test the null hypothesis
H0 : E(Y ) = α + βx,
i.e. that the linear model fits the observed data. In this case we accept that the linear
model is appropriate (p=0.37).
2.4
SIMPLE LINEAR REGRESSION Matrix Representation
Y = Xβ + ε with ε ∼ N (0, σ 2 I)
minimizing
ε′ ε = (Y − Xβ)′ (Y − Xβ)
gives
β̂ = (X ′ X)−1 X ′ Y
and
Ŷ = X β̂
= X(X ′ X)−1 X ′ Y
where the term X(X ′ X)−1 X ′ is denoted by H and is called the hat matrix.
ε̂ = Y − Ŷ
= Y − X(X ′ X)−1 X ′ Y
= (I − X(X ′ X)−1 X ′ )Y
= (I − H)Y
2.5
SIMPLE LINEAR REGRESSION Outliers – Predictor Variable
The leverage hii at a point xi is a measure of the distance between the values of the
predictor for the ith case and the mean of all of the predictor values.
It is calculated as
1 (xi − x̄)2
hii = +
n SXX
and is judged to be large if
2p
hii >
n
where p is the number of parameters in the model and n is the number of observations.
The maximum value hii can take is 1 and it is said to be large if hii > 0.5 and
moderate if 0.2 ≤ hii ≤ 0.5.
2.6
SIMPLE LINEAR REGRESSION Outliers – Response Variable
The variance of the residual at a given point depends on the leverage and is given by
V (ε̂i ) = σ 2 (1 − hii )
and as a result the variance will be small whenever the leverage is large. Improved
diagnostics can be obtained by scaling the residuals. The scaled residuals are called
Studentized residuals and can be obtained by using two different estimates of
overall variance.
ε̂i
ri = √
σ̂ 1 − hii
where the estimate of the overall variance σ 2 is obtained using all of the sample values.
ε̂
ti = √i
σ̂(i) 1 − hii
2
where the estimate of the overall variance σ(i) is calculated with the ith case not being
included.
Values of the response variable are identified as outliers if their Studentized residuals
are large in absolute value.
2.7
SIMPLE LINEAR REGRESSION Outliers – Mean Shift Outlier Test
An explicit formulation for identifying outliers is the mean shift outlier model. If the
ith point is considered to be an outlier, is is assumed that the model has the form
y j = β 0 + β 1 x j + εj for j 6= i
and
y i = β 0 + β 1 x i + δ + εi .
The outlier has an expected value different from the value predicted by the model by
an amount δ and a test of the ith point being an outlier is a test of δ = 0.
To test for δ = 0, define a new predictor variable U such that uj = 0 for j 6= i and
ui = 1 and fit the regression of Y on X and U as
Y = β0 + β1 X + δU + ε.
The coefficient of U is an estimate of the mean shift δ and its corresponding t-statistic
provides a test of δ = 0 against the two–sided alternative.
2.8
SIMPLE LINEAR REGRESSION Influential Observations
1 hii 2
Di = r
p 1 − hii i
and is a measure which compares the values of the regression coefficients obtained
using all of the sample values with the regression coefficients obtained with the ith
sample values not included.
Points with a Cook’s distance greater than 1 are considered to have substantial
influence.
and is a measure which compares the fitted value for the ith case when all of the
sample values are used to the fit the regression with the fitted value for the ith case
when the ith case is not included in fitting the regression.
p
A point is considered influential if its DFFFITS value is greater than 2 p/n.
2.9
SIMPLE LINEAR REGRESSION Residual Correlation
In time series, the value at time (t + 1) is often predicted by the value at time t, that
is, the value at (t + 1) is correlated with the value at time t and the residuals are also
correlated. A residual plot shows positive and negative runs of the residual values.
Tests for correlation amongst the residuals can be performed using the Durbin–
Watson statistic which is given by
Pn 2
(ε̂ − ε̂ )
d= Pni 2i−1
i=2
.
i=1 ε̂i
The range of the statistic, d, is such that 0 ≤ d ≤ 4, with the residuals being
uncorrelated if d ≃ 2, positively correlated if d < 2 with very strong positive
correlation if d = 0 and negatively correlated if d > 2 with very strong negative
correlation if d = 4.
To test
H0 : error terms are not autocorrelated
against
H1 : error terms are positively autocorrelated
where dL,α and dU,α are appropriate critical points for the Durbin–Watson statistic.
To test
H0 : error terms are not autocorrelated
against
H1 : error terms are negatively autocorrelated
Corresponding results exist for the two–sided alternative with appropriate α/2 critical
points.
2.10