Statistics 191: Introduction To Applied Statistics: Simple Linear Regression: Diagnostics

Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
Statistics 191: Introduction to Applied Statistics

Simple Linear Regression: Diagnostics
Jonathan Taylor Department of Statistics Stanford University
January 28, 2009
1 / 25
Outline
Diagnostics for simple regression Goodness of t of regression: analysis of variance. F -statistics. Residuals. Diagnostic plots.
2 / 25
Geometry of Least Squares

3 / 25
Goodness of t
Sums of squares
n n
SSE =
i=1 n
(Yi Yi )2 =
i=1 n
(Yi 0 1 Xi )2 (Y 0 1 Xi )2
i=1
SSR =
i=1 n
(Y Yi )2 =
SST =
i=1
(Yi Y )2 = SSE + SSR SSR SSE X =1 = Cor (X , Y )2 . SST SST
R2 =
Basic idea: if R 2 is large: a lot of the variability in Y is explained by X .

4 / 25
Total sum of squares

R code
5 / 25
Error sum of squares

R code
6 / 25
Regression sum of squares

R code
7 / 25
F -statistics
What is an F -statistic? An F -statistic is a ratio of sample variances (mean squares): it has a numerator, N, and a denominator, D that are independent. Let N and dene 2 num , dfnum F = D 2 den dfden
N . D We say F has an F distribution with parameters dfnum , dfden and write F Fdfnum ,dfden .
8 / 25
F -statistic in simple linear regression

Goodness of t F -statistic The ratio F =
SSR/1 MSR = SSE /(n 2) MSE
can be thought of as a ratio of variances. In fact, under H0 : 1 = 0, F F1,n2 because SSR = Y Y 1 SSE = Y Y
2 2
and from our picture, these vectors are orthogonal.

9 / 25
F and t statistics
Relation between F and t If T t , then T2 N(0, 1)2 2 /1 1 2 . 2 / /
In other words, the square of a t-statistic is an F -statistic. Because it is always positive, an F -statistic has no direction () associated with it. In fact, (see R code) F =
2 MSR 1 = . MSE SE (1 )2
10 / 25
F -statistics in regression models

Interpretation of an F -statistic In regression, the numerator is usually a dierence in goodness of t of two (nested) models. The denominator is 2 an estimate of 2 . Our example today: the bigger model is the simple linear regression model, the smaller is the model with constant mean (one sample model). If the F is large, it says that the bigger model explains a lot more variability in Y (relative to 2 ) than the smaller one.
11 / 25
F -test in simple linear regression

Example in more detail Full (bigger) model : FM : Yi = 0 + 1 Xi + i
Reduced (smaller) model: RM : Yi = 0 + i
The F -statistic has the form F = (SSE (RM) SSE (FM))/(dfRM dfFM ) . SSE (FM)/dfFM
Reject H0 : RM is correct, if F > F1,1,n2 .

12 / 25
Diagnostics
What are the assumptions Yi = 0 + 1 Xi + i Errors i are assumed independent N(0, 2 ).
13 / 25
Diagnostics
What can go wrong? Regression function can be wrong: maybe regression function should be quadratic (see R code). Model for the errors may be incorrect:
may not be normally distributed. may not be independent. may not have the same variance.
Detecting problems is more art then science, i.e. we cannot test for all possible problems in a regression model. Basic idea of diagnostic measures: if model is correct then residuals ei = Yi Yi , 1 i n should look like a sample of (not quite independent) N(0, 2 ) random variables.
14 / 25
A bad simple regression model

R code
15 / 25
Diagnostic plots
Problems in the regression function True regression function may have higher-order non-linear terms, polynomial or otherwise. Can sometimes be remedied by looking at a plot of X vs. residuals e . If there is any visible trend in this plot, may consider adding more terms to the model to capture this trend (this makes the model a multiple regression model).
16 / 25
Plot of residuals vs. X

R code
17 / 25
Quadratic model
R code
18 / 25
Plot of residuals vs. X (quadratic)

R code
19 / 25
Problems with the errors

Possible problems & diagnostic checks Errors may not be normally distributed or may not have the same variance qqnorm can help with this. Variance may not be constant. Can also be addressed in a plot of X vs. e : fan shape or other trend indicate non-constant variance. Outliers: points where the model really does not t! Possibly mistakes in data transcription, lab errors, who knows? Should be recognized and (hopefully) explained.
20 / 25
Non-normality
qqnorm If ei , 1 i n were really a sample of N(0, 2 ) then their sample quantiles should be close to the sample quantiles of the N(0, 2 ) distribution. Plot: e(i) vs. E((i) ), 1 i n. where e(i) is the i-th smallest residual (order statistic) and E((i) ) is the expected value for independent i s N(0, 2 ).
21 / 25
QQplot of residuals
R code
22 / 25
QQplot of residuals (quadratic)

R code
23 / 25
Outlier and nonconstant variance

R code
24 / 25
Outlier and nonconstant variance

R code
25 / 25

Statistics 191: Introduction To Applied Statistics: Simple Linear Regression: Diagnostics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics 191: Introduction To Applied Statistics: Simple Linear Regression: Diagnostics

Uploaded by

Copyright:

Available Formats

Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

Statistics 191: Introduction to Applied Statistics

Jonathan Taylor Department of Statistics Stanford University

January 28, 2009

Geometry of Least Squares

(Yi Y )2 = SSE + SSR SSR SSE X =1 = Cor (X , Y )2 . SST SST

Basic idea: if R 2 is large: a lot of the variability in Y is explained by X .

Total sum of squares

Error sum of squares

Regression sum of squares

F -statistic in simple linear regression

Goodness of t F -statistic The ratio F =

SSR/1 MSR = SSE /(n 2) MSE

and from our picture, these vectors are orthogonal.

Relation between F and t If T t , then T2 N(0, 1)2 2 /1 1 2 . 2 / /

F -statistics in regression models

F -test in simple linear regression

Example in more detail Full (bigger) model : FM : Yi = 0 + 1 Xi + i

Reduced (smaller) model: RM : Yi = 0 + i

Reject H0 : RM is correct, if F > F1,1,n2 .

What are the assumptions Yi = 0 + 1 Xi + i Errors i are assumed independent N(0, 2 ).

A bad simple regression model

Plot of residuals vs. X

Plot of residuals vs. X (quadratic)

Problems with the errors

QQplot of residuals (quadratic)

Outlier and nonconstant variance

Outlier and nonconstant variance

You might also like