Professional Documents
Culture Documents
1 / 25
Outline
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
Diagnostics for simple regression Goodness of t of regression: analysis of variance. F -statistics. Residuals. Diagnostic plots.
2 / 25
3 / 25
Goodness of t
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
Sums of squares
n n
SSE =
i=1 n
(Yi Yi )2 =
i=1 n
(Yi 0 1 Xi )2 (Y 0 1 Xi )2
i=1
SSR =
i=1 n
(Y Yi )2 =
SST =
i=1
R2 =
R code
5 / 25
R code
6 / 25
R code
7 / 25
F -statistics
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
What is an F -statistic? An F -statistic is a ratio of sample variances (mean squares): it has a numerator, N, and a denominator, D that are independent. Let N and dene 2 num , dfnum F = D 2 den dfden
N . D We say F has an F distribution with parameters dfnum , dfden and write F Fdfnum ,dfden .
8 / 25
can be thought of as a ratio of variances. In fact, under H0 : 1 = 0, F F1,n2 because SSR = Y Y 1 SSE = Y Y
2 2
F and t statistics
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
In other words, the square of a t-statistic is an F -statistic. Because it is always positive, an F -statistic has no direction () associated with it. In fact, (see R code) F =
2 MSR 1 = . MSE SE (1 )2
10 / 25
Interpretation of an F -statistic In regression, the numerator is usually a dierence in goodness of t of two (nested) models. The denominator is 2 an estimate of 2 . Our example today: the bigger model is the simple linear regression model, the smaller is the model with constant mean (one sample model). If the F is large, it says that the bigger model explains a lot more variability in Y (relative to 2 ) than the smaller one.
11 / 25
The F -statistic has the form F = (SSE (RM) SSE (FM))/(dfRM dfFM ) . SSE (FM)/dfFM
Diagnostics
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
13 / 25
Diagnostics
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
What can go wrong? Regression function can be wrong: maybe regression function should be quadratic (see R code). Model for the errors may be incorrect:
may not be normally distributed. may not be independent. may not have the same variance.
Detecting problems is more art then science, i.e. we cannot test for all possible problems in a regression model. Basic idea of diagnostic measures: if model is correct then residuals ei = Yi Yi , 1 i n should look like a sample of (not quite independent) N(0, 2 ) random variables.
14 / 25
R code
15 / 25
Diagnostic plots
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
Problems in the regression function True regression function may have higher-order non-linear terms, polynomial or otherwise. Can sometimes be remedied by looking at a plot of X vs. residuals e . If there is any visible trend in this plot, may consider adding more terms to the model to capture this trend (this makes the model a multiple regression model).
16 / 25
R code
17 / 25
Quadratic model
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
R code
18 / 25
R code
19 / 25
Possible problems & diagnostic checks Errors may not be normally distributed or may not have the same variance qqnorm can help with this. Variance may not be constant. Can also be addressed in a plot of X vs. e : fan shape or other trend indicate non-constant variance. Outliers: points where the model really does not t! Possibly mistakes in data transcription, lab errors, who knows? Should be recognized and (hopefully) explained.
20 / 25
Non-normality
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
qqnorm If ei , 1 i n were really a sample of N(0, 2 ) then their sample quantiles should be close to the sample quantiles of the N(0, 2 ) distribution. Plot: e(i) vs. E((i) ), 1 i n. where e(i) is the i-th smallest residual (order statistic) and E((i) ) is the expected value for independent i s N(0, 2 ).
21 / 25
QQplot of residuals
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
R code
22 / 25
R code
23 / 25
R code
24 / 25
R code
25 / 25