You are on page 1of 25

Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

Statistics 191: Introduction to Applied Statistics


Simple Linear Regression: Diagnostics

Jonathan Taylor Department of Statistics Stanford University

January 28, 2009

1 / 25

Outline
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

Diagnostics for simple regression Goodness of t of regression: analysis of variance. F -statistics. Residuals. Diagnostic plots.

2 / 25

Geometry of Least Squares


Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

3 / 25

Goodness of t
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

Sums of squares
n n

SSE =
i=1 n

(Yi Yi )2 =
i=1 n

(Yi 0 1 Xi )2 (Y 0 1 Xi )2
i=1

SSR =
i=1 n

(Y Yi )2 =

SST =
i=1

(Yi Y )2 = SSE + SSR SSR SSE X =1 = Cor (X , Y )2 . SST SST

R2 =

Basic idea: if R 2 is large: a lot of the variability in Y is explained by X .


4 / 25

Total sum of squares


Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

R code

5 / 25

Error sum of squares


Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

R code

6 / 25

Regression sum of squares


Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

R code

7 / 25

F -statistics
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

What is an F -statistic? An F -statistic is a ratio of sample variances (mean squares): it has a numerator, N, and a denominator, D that are independent. Let N and dene 2 num , dfnum F = D 2 den dfden

N . D We say F has an F distribution with parameters dfnum , dfden and write F Fdfnum ,dfden .

8 / 25

F -statistic in simple linear regression


Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

Goodness of t F -statistic The ratio F =

SSR/1 MSR = SSE /(n 2) MSE

can be thought of as a ratio of variances. In fact, under H0 : 1 = 0, F F1,n2 because SSR = Y Y 1 SSE = Y Y
2 2

and from our picture, these vectors are orthogonal.


9 / 25

F and t statistics
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

Relation between F and t If T t , then T2 N(0, 1)2 2 /1 1 2 . 2 / /

In other words, the square of a t-statistic is an F -statistic. Because it is always positive, an F -statistic has no direction () associated with it. In fact, (see R code) F =
2 MSR 1 = . MSE SE (1 )2

10 / 25

F -statistics in regression models


Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

Interpretation of an F -statistic In regression, the numerator is usually a dierence in goodness of t of two (nested) models. The denominator is 2 an estimate of 2 . Our example today: the bigger model is the simple linear regression model, the smaller is the model with constant mean (one sample model). If the F is large, it says that the bigger model explains a lot more variability in Y (relative to 2 ) than the smaller one.

11 / 25

F -test in simple linear regression


Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

Example in more detail Full (bigger) model : FM : Yi = 0 + 1 Xi + i

Reduced (smaller) model: RM : Yi = 0 + i

The F -statistic has the form F = (SSE (RM) SSE (FM))/(dfRM dfFM ) . SSE (FM)/dfFM

Reject H0 : RM is correct, if F > F1,1,n2 .


12 / 25

Diagnostics
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

What are the assumptions Yi = 0 + 1 Xi + i Errors i are assumed independent N(0, 2 ).

13 / 25

Diagnostics
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

What can go wrong? Regression function can be wrong: maybe regression function should be quadratic (see R code). Model for the errors may be incorrect:
may not be normally distributed. may not be independent. may not have the same variance.

Detecting problems is more art then science, i.e. we cannot test for all possible problems in a regression model. Basic idea of diagnostic measures: if model is correct then residuals ei = Yi Yi , 1 i n should look like a sample of (not quite independent) N(0, 2 ) random variables.

14 / 25

A bad simple regression model


Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

R code

15 / 25

Diagnostic plots
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

Problems in the regression function True regression function may have higher-order non-linear terms, polynomial or otherwise. Can sometimes be remedied by looking at a plot of X vs. residuals e . If there is any visible trend in this plot, may consider adding more terms to the model to capture this trend (this makes the model a multiple regression model).

16 / 25

Plot of residuals vs. X


Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

R code

17 / 25

Quadratic model
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

R code

18 / 25

Plot of residuals vs. X (quadratic)


Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

R code

19 / 25

Problems with the errors


Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

Possible problems & diagnostic checks Errors may not be normally distributed or may not have the same variance qqnorm can help with this. Variance may not be constant. Can also be addressed in a plot of X vs. e : fan shape or other trend indicate non-constant variance. Outliers: points where the model really does not t! Possibly mistakes in data transcription, lab errors, who knows? Should be recognized and (hopefully) explained.

20 / 25

Non-normality
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

qqnorm If ei , 1 i n were really a sample of N(0, 2 ) then their sample quantiles should be close to the sample quantiles of the N(0, 2 ) distribution. Plot: e(i) vs. E((i) ), 1 i n. where e(i) is the i-th smallest residual (order statistic) and E((i) ) is the expected value for independent i s N(0, 2 ).

21 / 25

QQplot of residuals
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

R code

22 / 25

QQplot of residuals (quadratic)


Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

R code

23 / 25

Outlier and nonconstant variance


Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

R code

24 / 25

Outlier and nonconstant variance


Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University

R code

25 / 25

You might also like