You are on page 1of 28

STAT 525 FALL 2011

Chapter 3
Diagnostics and Remedial Measures
Professor Dabao Zhang
Diagnostics
Procedures to determine appropriateness of the model and
check assumptions used in the standard inference
If there are violations, inference and model may not be rea-
sonable thereby resulting in faulty conclusions
Always check before any inference!!!!!!!!
Procedures involve both graphical methods and formal sta-
tistical tests
3-1
Diagnostics for X
Scatterplot of Y vs X common diagnostic
Fit smooth curve I=SM## (e.g., I=SM70 in slide 1-5)
Is linear trend reasonable?
Any unusual/inuential (X, Y ) observations?
Can also look at distribution of X alone
Skewed distribution
Unusual or outlying values?
Recall model does not state X Normal
Does X have pattern over time (order collected)?
If Y depends on X, looking at Y alone may be deceiving (i.e.,
mixture of normal dists)
3-2
PROC UNIVARIATE in SAS
Provides numerous graphical and numerical summaries
Mean, median
Variance, std dev, range, IQR
Skewness, kurtosis
Tests for normality
Histograms
Box plots
QQ plots
Stem-and-leaf plots
3-3
Example: Grade Point Average
options nocenter; /* output layout: not centerized */
goptions colors=(none); /* graphics display: black/white */
data a1;
infile U:\.www\datasets525\CH01PR19.txt;
input grade_point test_score;
/* Line printer plots: stem-and-leaf, horizontal bar chart
box plot, normal probability plot */
/* Graphics display: histogram, probplot, qqplot */
proc univariate data=a1 plot;
var test_score;
qqplot test_score / normal (L=1 mu=est sigma=est);
histogram test_score / kernel(L=2) normal;
run; quit;
3-4
The UNIVARIATE Procedure
Variable: test_score
Moments
N 120 Sum Weights 120
Mean 24.725 Sum Observations 2967
Std Deviation 4.47206549 Variance 19.9993697
Skewness -0.1363553 Kurtosis -0.5596968
Uncorrected SS 75739 Corrected SS 2379.925
Coeff Variation 18.0872214 Std Error Mean 0.40824186
Basic Statistical Measures
Location Variability
Mean 24.72500 Std Deviation 4.47207
Median 25.00000 Variance 19.99937
Mode 24.00000 Range 21.00000
Interquartile Range 7.00000
...
3-5
Upper QQ Plot Lower Histogram
3-6
Diagnostics for Residuals
If model is appropriate, residuals should reect assumptions
on error terms

i
i.i.d. N(0,
2
)
Recall properties of residuals


e
i
= 0 Mean is zero


(e
i
e)
2
= SSE Variance is MSE
e
i
s not independent (derived from same tted regression line)
When sample size large, the dependency can basically be ignored
3-7
Questions addressed by diagnostics
Is the relationship linear?
Does the variance depend on X?
Are there outliers?
Are error terms not independent?
Are the errors normal?
Can other predictors be helpful?
3-8
Residual Plots
Plot e vs X can assess most questions
Get same info from plot of e vs

Y because X and

Y linearly
related
Other plots include e vs time/order, a histogram or QQplot
of e, and e vs other predictor variables
See pages 102-113 for examples
Plots are usually enough for identifying gross violations of
assumptions (since inferences are quite robust)
3-9
Example: Toluca Campany
data a1;
infile U:\.www\datasets525\CH01TA01.txt;
input lotsize workhrs;
seq = _n_;
proc reg data=a1;
model workhrs=lotsize;
output out=a2 r=resid;
proc gplot data=a2;
plot resid*lotsize;
plot resid*seq;
run;
/* Line type: L=1 for solid line; L=2 for dashed line */
proc univariate data=a2 plot normal;
var resid;
histogram resid / normal kernel(L=2);
qqplot resid / normal (L=1 mu=est sigma=est);
run;
3-10
3-11
Upper QQ Plot Lower Histogram
3-12
Tests for Normality
Test based on the correlation between the residuals and their
expected values under normality proposed on page 115
Requires table of critical values
SAS provides four normality tests
proc univariate normal;
var resid;
Shapiro-Wilk most commonly used
3-13
Example: Plasma Level (p. 132)
The UNIVARIATE Procedure
Variable: resid (Residual)
Tests for Normality
Test --Statistic--- -----p Value------
Shapiro-Wilk W 0.839026 Pr < W 0.0011
Kolmogorov-Smirnov D 0.167483 Pr > D 0.0703
Cramer-von Mises W-Sq 0.137723 Pr > W-Sq 0.0335
Anderson-Darling A-Sq 0.95431 Pr > A-Sq 0.0145
3-14
Other Formal Tests
Durbin-Watson test for correlated errors (assuming AR(1)
for errors as in Chapter 12)
Modied Levene / Brown-Forsythe test for constant variance
(Chapter 18)
Breusch-Pagan test for constant variance
Plots vs Tests
Plots are more likely to suggest a remedy. Also, test results are very
dependent on n. With a large enough sample size, we can reject most
null hypotheses even if the deviation is slight
3-15
Lack of Fit Test
More formal approach to tting a smooth curve through the
observations
Requires repeat observations of Y at one or more levels of X
Assumes Y |X
ind
N((X),
2
)
H
0
: (X) =
0
+
1
X
H
a
: (X) =
0
+
1
X
Will use full/reduced model framework
3-16
Notation
Dene X levels as X
1
, X
2
, . . . , X
c
There are n
j
replicates at level X
j
(

n
j
= n)
Y
ij
is the i
th
replicate at X
j
Full Model: Y
ij
=
j
+
ij
No assumption on association : E(Y
ij
) =
j
There are c parameters

j
= Y
.j
and s
2
=

(Y
ij

j
)
2
/(n c)
Reduced Model: Y
ij
=
0
+
1
X
j
+
ij
Linear association
There are 2 parameters
s
2
=

(Y
ij


Y
j
)
2
/(n 2)
3-17
SSE(F)=

(Y
ij

j
)
2
SSE(R)=

(Y
ij


Y
j
)
2
F

=
(SSE(R) SSE(F))/((n 2) (n c))
SSE(F)/(n c)
Is variation about the regression line substantially bigger than
variation at specic level of X?
Approximate test can be done by grouping similar X values
together
3-18
Example: Plasma Level (p. 132)
/* Analysis of Variance - Reduced Model */
proc reg;
model lplasma=age;
run;
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 0.52308 0.52308 134.03 <.0001
Error 23 0.08976 0.00390
Corrected Total 24 0.61284
------------------------------------------------
/* Analysis of Variance - Full Model */
proc glm;
class age;
model lplasma=age;
run;
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 4 0.53854 0.13463 36.24 <.0001
Error 20 0.07430 0.00372
Corrected Total 24 0.61284
------------------------------------------------
F

=
(.08976 .07430)/(23 20)
.00372
= 1.387

P-value = 0.2757
3-19
Remedies
Nonlinear relationship
Transform X or add additional predictors
Nonlinear regression
Nonconstant variance
Transform Y
Weighted least squares
Nonnormal errors
Transform Y
Generalized Linear model
Nonindependence
Allow correlated errors
Work with rst dierences
3-20
Nonlinear Relationships
Can model many nonlinear relationships with linear models,
some with several explanatory variables
Y
i
=
0
+
1
X
i
+
2
X
2
i
+
i
Y
i
=
0
+
1
log(X
i
) +
i
Can sometimes transform nonlinear model into a linear model
Y
i
=
0
exp(
1
X
i
)
i

log(Y
i
) = log(
0
) +
1
X
i
+log(
i
)
Have altered our assumptions about error
Can perform nonlinear regression (PROC NLIN)
3-21
Nonconstant Variance
Will discuss weighted analysis in Chapter 11
Nonconstant variance often associated with a skewed error
term distribution
A transformation of Y often remedies both violations
Will focus on Box-Cox transformations
Y

= Y

3-22
Box-Cox Transformation
Special cases:
= 1 no transformation
= .5 square root
= 0 natural log (by denition)
Can estimate using ML
f
i
=
1

2
2
exp

1
2
2
(Y

i

0

1
X
i
)
2


ML
minimizes SSE
Can also do a numerical search
PROC TRANSREG will do this in SAS
3-23
Example: Plasma Level (p. 132)
data a1;
infile d:\nobackup\tmp\CH03TA08.txt;
input age plasma lplasma;
symbol1 v=circle i=sm50 c=red; symbol2 v=circle i=rl c=black;
proc gplot;
plot plasma*age=1 plasma*age=2/overlay; run;
3-24
proc transreg data=a1;
model boxcox(plasma)=identity(age);
run;
The TRANSREG Procedure
Lambda R-Square Log Like
-1.50 0.83 -8.1127
-1.25 0.85 -6.3056
-1.00 0.86 -4.8523 *
-0.75 0.86 -3.8891 *
-0.50 0.87 -3.5523 <
-0.25 0.86 -3.9399 *
0.00 + 0.85 -5.0754 *
0.25 0.84 -6.8988
0.50 0.82 -9.2925
0.75 0.79 -12.1209
1.00 0.75 -15.2625
< - Best Lambda
* - Confidence Interval
+ - Convenient Lambda
R
2
instead of SSE is given
= 0 (log transform) is the most convenient value
3-25
proc gplot;
plot lplasma*age=1 lplasma*age=2/overlay;
run; quit;
3-26
Chapter Review
Diagnostics
Graphical methods
Statistical tests
Remedies
Nonlinearity
Nonconstant variance
3-27

You might also like