You are on page 1of 44

Regression Analysis

Dr Sunil D Lakdawala
Sunil_lakdawala@hotmail.com

Regression Line

Equation of Line Y = a + b*X


a: Intercept, b: Slope
Draw line passing through (1,5) and (2,7)
Find out a and b
Predict value of Y, for X = 5
Is relation direct?
Similarly draw line passing through (0,6) and
(1,3)
Find a and b. Is relation direct?

9/30/16

Simple Regression

It is a bivariate linear regression - that is, it is


the process of constructing a mathematical
model or function that can be used to predict
or determine one variable by another
variable.
The variable to be predicted is called the
dependent variable and is denoted by Y.
The predictor is called the independent
variable or explanatory variable, and is
denoted by X.

9/30/16

Regression Line (Cont)

Drawing regression line for scatter chart


See # of Passengers vs cost
Find our error, absolute error and square error
Can we take error? Absolute error? Square error?
What are characteristics?
Draw line such that square error is least
The equation of the simple linear regression line is
given by
Yi=a +bXi+
Minimize 2 by finding best fit for a, b

Find out value of a and b for minimizing square error

9/30/16

Problem

Let us consider the data


displayed in the table.
The values in the 1st
column denote number of
passengers for 12 fivehundred-mile commercial
airline flights using Boeing
737s during the same
season of the year.
We use these data to
develop a regression model
to predict cost by number of
passengers.

9/30/16

Regression Line (Cont)

b = (Xi*Yi n*X*Y) / (Xi2 n*X2)

a = Y - b* X

Se Standard Error = Sqrt((Yi-Yp)2/(n-2))

Assumption: Errors are normally distributed

What is interpretation of standard error. In


comparison with mean value of Y? In terms
of percentage?
Look at example of Cost vs Passengers

9/30/16

Regression Line (Cont)

Interpretation of Standard Error (Cont)


what is the range of cost for 80 passengers
with 95% confidence and 90% confidence
For n > 30, use Z distribution
with For n < 30, normal distribution can not
be assumed. Need to take t distribution
What will be the range for the above
problem?
Range is same for Y. Is it true?
Assumption: One is predicting within the
range

9/30/16

Correlation Analysis

Degree to which one variable is linearly related to


another
Coefficient of Determination:
r2 = 1 (Yi Yp)2 / (Yi Y)2 (between 0 and 1)
= 1 - Ratio
Ratio : variation between actual and predicted value
w. r. t. variation of Yi from mean (Unexplained part)
Variation of Y around regression line
Variation of Y around its own mean
r2 = 1 - ratio of above two
r2 = 0.78 78% variation of Y from Y is explained
using regression. 22% is not explained

9/30/16

INTERPRETATION of r2

If first term is zero, both the terms are exactly


linearly related, and value will be 1
If first term = second term, there is no
relation, and value is 0
Take example of table 12-6 (DO), table 1213, figure 12-13, figure 12-14 to find r2
(HENKE)
Calculate values using excel

9/30/16

Coefficient of Correlation

r = sqrt (r2)
See fig 12.16
If r = 0.6, how good is regression? How much
variation in Y is explained by regression?

9/30/16

10

Inferences about population


parameters

Instead of point value of b, we want to find


out range of b with 90% confidence level
Find t for given degrees of freedom, and
given confidence level
Find sb (standard error for b)

Sb = Se / (Xi2 nX2)

Range is b t*sb and b + t*sb

9/30/16

11

The Equation

The equation of the simple linear regression line is given


by
Yi=0+1Xi+
Minimize 2 by finding best fit for 0, 1

Table 5-2 Pulp Price Regression Makridakis


Table 5-4 PVC Regression Makridakis
Residual / Error
Outliers: Observation with large residuals (Table 5-4)
Influential Observations: Observation that have a great
influence on the fitted equation. Usually they are extreme
observation (. (See King-Kong problem Fig 5-8
Makridakis)

9/30/16

12

The Equation (Cont)

Causation vs Correlation: X: Weekly # of deaths by


drowning and Y: Consumption of coke might be highly
related, but there may not be causal relationship Both
might be increasing in summer!
Lurking Variable : Explanatory variable not included in
regression that is highly related to both X and Y (e.g.
Season in above case
Confounding Variable: New Car Sales may depend
upon both Price as well as Advertisement
Expenditure. Last two are called Confounding variable

9/30/16

13

The Equation (cont)


The regression analysis is performed under following assumptions.

Y = 0 + 1 * X +
1. Residual:

2.

Residual

3.

(Y- ^Y) should be near zero. (Y - ^Y) is the residue, denoted by

Plot X vs (Y- ^Y). should be Random (i.e. normally distributed)


and should not have any trend (unlike Y = X**2)

Standard Error Se = (SUM Squared Error/(N-K-1); K = 1 for


Simple Regression; SUM Squared Error = (Y- ^Y) **2

9/30/16

Se should be acceptable (Se / Ymean give good idea of error)


68% of Residue should be within Se
95% of Residue should be within 2* Se

14

The Equation (Cont)


4.

5.

6.

7.

Correlation (r) is measure of linear association-ship


between the variables. Even if variables have high
nonlinear relationship, r might be very small (see Fig
5-7 Makridakis)
For small n, r is notoriously unstable. For n = 30 or
more, it starts becoming stable
r can change drastically due to extreme values. (See
King-Kong problem Fig 5-8 Makridakis, where just one
extreme point changes r from 0.527 to 0.940). What
should we do?
Coefficient of Determination: R**2 should be high,
towards 1. Interpretation of R2 and r

9/30/16

15

The Equation (Cont)


P value should be smaller than 0.05 (i.e. 95%
confidence) for rejecting null Hypothesis (0 =0 / 1 =
0)
F = t**2 = MS (Regression) / MS (Residual) (t value
for 1)
Significance F = p value for
1
9.
Adjusted R**2 =
1 (Sum Squared Error/ (N-K-1)*(N-1)/ (X-X
mean )
(Y-Y mean )
10. Should make common sense i.e. when X change by
1, Y changes by Slope. +ve or ve change should
make common sense
11. Only prediction valid within the range from which model
is made
8.

9/30/16

16

The Equation (Cont)


12.

13.

14.

15.

Please see equation 5.19 for error interval on predicted


value
Please see equations for and and their error
interval on page 216
Residues vs explanatory variable should not have any
pattern (No trend, No seasonality, etc..)
Residues should have mean as zero and should be
normally distributed

9/30/16

17

Data and Analysis

9/30/16

18

Summary Output

9/30/16

19

Residuals

It is the
difference
between the
actual Y value
and predicted Y
value by the
regression
model in
predicting each
value of the
dependent
variable.

9/30/16

20

Residuals

The total of the


residuals squared is
called the sum of
squares of error
(SSE).
The standard error
of the estimate is a
standard deviation
of the error of the
regression model

9/30/16

21

Coefficient of Determination

A widely used measure of fit for regression


models is the coefficient of determination.
The coefficient of determination is the
proportion of variability of the dependent
variable (Y) accounted for or explained by the
independent variable (X).
It is denoted by r2.
It lies between 0 and 1.

9/30/16

22

r2 in Airlines Cost

r2 = .899 [pg6,12,13]
This means that about 89.9% of the
variability of the cost of flying a Boeing 737
airplane on a commercial flight is accounted
for or predicted by the number of passengers.
This also means that about 11.1% of the
variation in airline flight cost, Y, is
unaccounted for by X or unexplained by the
regression model.

9/30/16

23

Correlation

It is a measure of
association. It measures
the strength of relatedness
of two variables.
For example, we may be
interested in determining
the correlation between

the prices of two stocks in


the same industry

How strong are these


correlations?
The Pearson product moment correlation
coefficient is given by.

9/30/16

24

Correlation
1. The measure is applicable only
if both variables being analyzed
have at least an interval level of
data.
2. r is a measure of the linear
correlation of two variables.
3. r = +1 denotes a perfect positive
relationship between two sets of
variables.
4. r = -1$ denotes a perfect
negative correlation, which
indicates an inverse relationship
between two variables.
5. r=0 means that there is no linear
relationship between the two
variables.
6. The coefficient of determination
= (correlation coefficient) r2
9/30/16

25

9/30/16

26

Factors to be taken care of

Plot of residual vs x should be healthy (Carry


out the example Y = SQR (X)
Do not try to predict Y, outside the range from
the ones used for building model (Try
predicting for Y = SQR (X)

9/30/16

27

Multiple Regression Model

1.

2.

The general equation which describes multiple regression


model is given by
Yi=0+1Xi + 2X2 + kXk +
Minimize 2 by finding best i
Assumptions made in the model are :
Residual:
(Y- ^Y) should be near zero. (Y - ^Y) is the residue,
denoted by
Plot X vs (Y- ^Y) for each X should be Random and should
i
i.
not have any trend (unlike Y = X**2)
Standard Error Se = (SUM Squared Error/(N-K-1); K = # of
independent variables; SUM Squared Error = (Y- ^Y) **2
S should be acceptable (S / Y
e
e
mean give good idea of
error)
68% of Residue should be within S
e
95% of Residue should be within 2* S
e

9/30/16

28

Multiple Regression Model


(Cont)
Coefficient of Determination: R**2 should be high, towards 1
P value should be smaller than 0.05 (i.e. 95% confidence)
for rejecting null Hypothesis (i = 0)

3.
4.

F = t**2 = MS (Regression) / MS (Residual) (t value


overall)
Significance F = p value overall. For rejecting null
hypothesis (1=0 and 2=0 ..), this should be small

5.

6.

Should make common sense i.e. when X change by 1, Y


changes by Slope. +ve or ve change should make common
sense

Adjusted R**2 should not be very different from R**2. By


adding more variables, one can always make R**2 large.
But Adjusted R**2 might be much smaller than R**2

9/30/16

29

Problem

See Fig 6-1 Bankdata Regression


A real estate study was conducted in
a small city to determine what
variables, if any, are related to the
market price of a home.
Several variables were explored,
including the number of bedrooms,
the number of bathrooms, the age of
the house, the number of square
feet of living space, the total number
of square feet of space, and how
many garages the house had.
Suppose that the business analyst
wants to develop a regression
model to predict the market price of
a home by two variables: ``total
number of square feet in the house''
and the age of the house.
The data are given in the table.

9/30/16

30

The Fitted Model

Y = 57.351 + 0.0177X1 0.6663X2


Interpretation:
The Y- intercept is equal to 57.351. In this example, Y-intercept
does not have any practical significance.
The coefficient of X (total number of square feet in the house) is
1
0.0177. This means that 1-unit increase in square footage would
result in predicted increase of (0.0177) ($1000) = $17.70 in the
price of the home if the age were held constant.
The coefficient of X (age) is -0.6663. The negative sign on the
2
coefficient denotes an inverse relationship between the age of a
house and the price of the house : the older the house, the lower
the price. In this case, if the total number of square feet in the
house is kept constant, a 1-unit increase in the age of the house
(1 year) will result in (-0.6663) ($1000) = - 666.30, a predicted
drop in the price.

9/30/16

31

Testing the Model

r2 = 0.715

Testing the overall


model

Significance Tests
of the Regression
Coefficients

9/30/16

32

9/30/16

33

Analysis of Residuals

9/30/16

34

Multicollinearity

Multicollinearity refers to two or more independent


variables of a multiple regression model being highly
correlated. This causes problems in the
interpretation of results.
In particular, these problems are

9/30/16

It is difficult, if not impossible, to interpret the estimates of


the regression coefficients.
Inordinately small t values for the regression coefficients
may result.
The standard deviations of regression coefficients are
overestimated.
The algebraic sign of estimated regression coefficients may
be the opposite of what would be expected for a particular
predictor variable. [pg24]
35

Search Procedures

All possible regression


Take all possible combination of K variables (2 K -1 models).
Choose the best model
Forward selection
Start with one variable. Try out all variable one at a time. Choose
the best one.
Then take 2nd variable, and so on.
Choose the best model
Backward Elimination:
Start with all variables
Eliminate the one with smallest t
Keep on repeating
Stepwise regression
Same as Forward Selection, but at every time also check that the
variable included is significant (acceptable p value)

9/30/16

36

Factors to be taken care of

Value of R2 could be inflated. Consider R2 adjusted


Better model does not imply cause and effect between
independent variables and dependent variable (some
other factors might be causing both)
Value of regression coefficient may not directly tell about
the importance, because
Different Units
Multi co-linearity
Multi co-linearity can create problem. To address the
same,
Use Search Procedures
Use r between two variables. For larger value, do not
take both

9/30/16

37

Non Linear Models (5/4 Makridakis)

Nonlinearity in parameters More complex (One may be


able to use transformation to convert into linear in certain
cases)
Nonlinearity in variables
Local Regression (see 5/4/3 of Makridakis)

9/30/16

38

Non-Linear Model
Y=0+1X1+ 2X2 **2 +
Choose Y1 = X ;Y = X **2
1
2
2
Y=0+1X1+ 2X1X2 +
Choose Y1 = X ;Y = X * X
1
2
1
2
Y = 0*1X
Log(Y) = Log( ) +X*Log( ); Now it is in
0
1
linear form
Similarly Y = 0*X1 ; Y = 1 / (0+1X1+ 2X2 )
can be converted into linear regression
9/30/16

39

Indicator (Dummy) Variable


Regression required numeric value
For Gender, define Gender = 1 for Male and Gender
= 0 for Female
For Region (N/S/W/E), do not map to 0, 0.33, 0.66
and 1.0 (Why? .. Unordered)
Define three variable, X1, X2 and X3
X1 = 1 for North Region, 0 other wise
X2 = 1 for South Region, 0 other wise
X3 = 1 for West Region, 0 other wise
X1 = X2 = X3 = 0 represents East Region

9/30/16

40

Others (Pg 270 Makridakis)

Trading day variation

Introduce seven variables, T1: # of Mondays in Month, T2: # of


Tuesdays in Month, ..

Holiday Effect

9/30/16

V=1 if Diwali falls in this month (or part of Diwali)

41

Interventions (Pg 271 Makridakis)

Seat belt legislation was introduced


Due to that car accidents went down
Introduce dummy variable I = 0 (Before seat belt
legislation), and I =1 (after seat belt legislation)
More complex models can be introduced, if effect is
spread over some time
See figure 8-15 for intervention variable

9/30/16

42

Effect of Advertising Expenditure on Sale (Pg 271


Makridakis)
Monthly Sale is output variable and monthly
advertisement expense is one of the input variable
Effect of advertisement expense lasts till 3 months (say)
One can model as follows:
Y = b + b *X + b2*X
t
0
1
1,t
1,t-1 + b3*X1,t-2 + ..

9/30/16

43

Miscellaneous
Variance - Covariance Matrix
Vector X = (X1, X2, X3, )
Let (i) be arithmetic average of X(i)
(i.j) = k (X(i,k) - (i))*(X(j,k) - (j))/ N
(i.i) is Variance Matrix

9/30/16

44

You might also like