You are on page 1of 22

Introduction to linear regression analysis

-with applications on SAS

COMP-STAT GROUP

The aim of this presentation to explain important steps


involved in a Linear regression setup.

We will proceed in a logical flow of the process.

Identification ,estimation and prediction

COMP-STAT GROUP

Introduction

The study of dependence


Does changing the class size affect success of students
Explaining the dependent variable based on a set of
independent variables mathematically

COMP-STAT GROUP

Regression Models

COMP-STAT GROUP

The Model

Y is dependent variable
Xs are independent variables
is the error term
Observe that the model is linear in the coefficients .
What does linearity means?
Simple linear regression : Model with only one predictor
Estimation: Least square and/or maximum likelihood estimator
COMP-STAT GROUP

Assumptions

Linearity
Normality
Homoscedasticity
Independence
(of explanatory variables, of error terms)
Number of cases
Data accuracy
Missing Data
Outliers

Main
assumptions

What do they mean?


COMP-STAT GROUP

Assumptions (contd.)
Number of cases
The cases to independent variable ration should ideally be 20:1(min 5:1)

Accuracy of data
that you had entered valid data points
Missing data
there treatment is necessary
Outliers

COMP-STAT GROUP

Objectives of analysis

Estimation
Hypothesis testing
Confidence intervals
Prediction of new observations

Let us take a real life problem and then


proceed further

COMP-STAT GROUP

An example
We have data on jet engine thrust as response variable & primary
speed of rotation, secondary speed of rotation, fuel flow rate,
pressure, exhaust temperature and ambient temperature at time of
test as regressor variables
The objective is to fit a linear regression model and check if our
model satisfies all underlying assumptions and can predict future
observations correctly

COMP-STAT GROUP

Variable selection

Important algorithms:
Forward selection
Backward elimination
Stepwise regression (preferred)

Always start with your domain knowledge. It will guide you through
the selection of variables from a set of candidate variables.

Dont rely too much on variable selection algorithm since they are too
much computer dependant.

COMP-STAT GROUP

10

Categorical independent variables

How to incorporate qualitative variables in the analysis


Concept of dummy variables
We include k-1 dummies for a k categories
One category is set as base category
They act like usual variables in the linear regression setup
Suppose we have three categories of TV A,B and C .Then we will
include 2 dummies .
let the dummies are X and Y then they will take value as follows
X
Y
A
0
0
B
1
0
C
0
1

COMP-STAT GROUP

11

Post estimation concerns

We had seen the model outputs and analyzed them

Once the model is estimated next step is to check


if our model satisfies all the assumptions stated

If all the assumptions are satisfied we are good otherwise


correction and modifications must be done to make the
model ready for use

COMP-STAT GROUP

12

Regression Diagnostics

COMP-STAT GROUP

13

Residuals
ei=yi-^yi
Lower the residuals better the model.

Types - Standardized residuals (Std.R)


- Studentized residuals (Stdnt.R)
- PRESS residuals
- Rstudent residuals

Std.R >3 ,indicates a potential outlier.

better to look for Stdnt.R

PRESS (prediction error sum of squares) Residuals


Also called deleted residuals
Estimate model by deleting that observation and then calculating the
predicted value for that observation. The residual so obtained is PRESS
residual
Higher value indicates a high influence point

SAS code
Proc reg data=test;
model y=x1 x2 x3 x4;
output out=dataset STUDENT RSTUDENT PRESS
COMP-STAT GROUP

14

Residual plots

Normal probability plots

Plot of normal quantiles against residual


quantiles

a straight line confirms normality


assumption of residuals.

Highly sensitive to non normality


near two tails

Can be helpful in outlier detection

Statistical Tests

Kolmogorov Smirnov test

Anderson Darling test

Shaipro-Wilk test

SAS code
proc univariate data=residuals normal; /*normal option for normality tests*/
var r;
qqplot r/normal(mu= est sigma=est);
/*est is for estimating mean & variance from data itself*/
run;
COMP-STAT GROUP

15

Residual Plots (contd.)

Homogeneity of error variance

To check homoscedasticity assumption of the


error variance

If the assumption holds then the plot between


residuals and predicted values should have a
random pattern

Also reveal one or more unusually large residuals


which or course are potential outliers

If the plot is not random you may need to apply


some transformations on regressors

White Test

Tests the null hypothesis that the variance of the


residual is homogenous

Use the spec option in the model statement

Remedy

Resort to generalized least square estimators

SAS Code

Proc reg data=dataset;


model y=x1 x2 x3/spec
plot r.*p; /*plot residual vs. predicted values*/`
COMP-STAT GROUP

16

Outlier Treatment

Is an extreme observation

Residuals considerably larger in absolute value than the others say 3 or 4 standard
deviations from the mean indicate potential y-space outliers

Are data points that are not typical of the rest of the data

Residual plots and normal probability plot are helpful in identifying outliers

Can also use studentized or R-Student residuals

Should be removed from the data before estimating the model if it is a bad (?) value

There should be strong non statistical evidence that the outlier is a bad value before
it is discarded

Sometimes desired in the analysis ( you want points of high yield or say low cost)

COMP-STAT GROUP

17

Diagnostics for Leverage and influence

Leverage
o An observation with an extreme value on a
predictor variable is called a point with high
leverage
o Leverage is a measure of how far an independent
variable deviates from its mean
o These leverage points can have an effect on the
estimate of regression coefficients
o Leverage (>(2p+2)/n)

Influential Observations
o An observation is said to be influential if removing
the observation substantially changes the estimate
of coefficients
o Influence can be thought of as the product of
leverage and outliers
o Not all leverage points are going to be influential on
the regression coefficients
o

desirable to consider both the location of the point


in the x-space and the response variable in
measuring the influence

o Measures :
Cooks D (>1), DFFITS(2p/n), DFBETAS(>2/n)

SAS Code
use
COOKD=name1
DFFITS=name2
H=name3 /* H is for leverage*/
in the output option of proc reg
(you can also use INFLUENCE in
model option for detailed analysis)
COMP-STAT GROUP

18

Multicollinearity

When explanatory variables are not independent (near perfect


linear relationship)
Reasons
Faulty data collection method
Constraints on the model or in the population
Model specification
An over defined model

Effect:
Unstable coefficients estimate
Inflated standard error of coff. Estimates

Tools to detect
Examine correlation matrix of independent variables\
Variance inflation factor (>10)(VIF) tolerance is 1/VIF
condition indices (>1000)
Variance decomposition proportions

COMP-STAT GROUP

19

Remedies

Collecting additional data


Model respecification
Redefining the regressors
Variable elimination

SAS code
Proc reg data=test;
model y=x1 x2/VIF TOL COLLINOINT;
/*COLLINOINT gives a detailed collinearity analysis with intercept
variable adjusted out. COLLIN option gives the same analysis with
intercept*/

COMP-STAT GROUP

20

Linearity

Scatter plot or matrix plot


Plots variables against each other
The linear relationship can be confirmed by observing a staright line trend

SAS Code
Proc sgscatter data=test;
Matrix x1 x2 x3 x4 / group=name;
Run;

COMP-STAT GROUP

21

Independence of error terms


We assume that error terms are independent of each
other
Can arise when observations are collected over time

the problem of autocorrelation


Durbin Watson test (~ 2 when error terms are uncorrelated)
Use dw in the model option in proc reg to calculate durbin watson test

Students of same school tend to be more alike than the


other schools

COMP-STAT GROUP

22

You might also like