Regression Analysis Explained

Regression Analysis
Dr Sunil D Lakdawala
Sunil_lakdawala@hotmail.com
Regression Line
Equation of Line Y = a + b*X

a: Intercept, b: Slope
Draw line passing through (1,5) and (2,7)
Find out a and b
Predict value of Y, for X = 5
Is relation direct?
Similarly draw line passing through (0,6) and
(1,3)
Find a and b. Is relation direct?
9/30/16
Simple Regression
It is a bivariate linear regression - that is, it is

the process of constructing a mathematical
model or function that can be used to predict
or determine one variable by another
variable.
The variable to be predicted is called the
dependent variable and is denoted by Y.
The predictor is called the independent
variable or explanatory variable, and is
denoted by X.
9/30/16
Regression Line (Cont)
Drawing regression line for scatter chart

See # of Passengers vs cost
Find our error, absolute error and square error
Can we take error? Absolute error? Square error?
What are characteristics?
Draw line such that square error is least
The equation of the simple linear regression line is
given by
Yi=a +bXi+
Minimize 2 by finding best fit for a, b
Find out value of a and b for minimizing square error
9/30/16
Problem
Let us consider the data

displayed in the table.
The values in the 1st
column denote number of
passengers for 12 fivehundred-mile commercial
airline flights using Boeing
737s during the same
season of the year.
We use these data to
develop a regression model
to predict cost by number of
passengers.
9/30/16
b = (Xi*Yi n*X*Y) / (Xi2 n*X2)
a = Y - b* X
Se Standard Error = Sqrt((Yi-Yp)2/(n-2))
Assumption: Errors are normally distributed
What is interpretation of standard error. In

comparison with mean value of Y? In terms
of percentage?
Look at example of Cost vs Passengers
9/30/16
Interpretation of Standard Error (Cont)

what is the range of cost for 80 passengers
with 95% confidence and 90% confidence
For n > 30, use Z distribution
with For n < 30, normal distribution can not
be assumed. Need to take t distribution
What will be the range for the above
problem?
Range is same for Y. Is it true?
Assumption: One is predicting within the
range
9/30/16
Correlation Analysis
Degree to which one variable is linearly related to

another
Coefficient of Determination:
r2 = 1 (Yi Yp)2 / (Yi Y)2 (between 0 and 1)
= 1 - Ratio
Ratio : variation between actual and predicted value
w. r. t. variation of Yi from mean (Unexplained part)
Variation of Y around regression line
Variation of Y around its own mean
r2 = 1 - ratio of above two
r2 = 0.78 78% variation of Y from Y is explained
using regression. 22% is not explained
9/30/16
INTERPRETATION of r2
If first term is zero, both the terms are exactly

linearly related, and value will be 1
If first term = second term, there is no
relation, and value is 0
Take example of table 12-6 (DO), table 1213, figure 12-13, figure 12-14 to find r2
(HENKE)
Calculate values using excel
9/30/16
Coefficient of Correlation
r = sqrt (r2)
See fig 12.16
If r = 0.6, how good is regression? How much
variation in Y is explained by regression?
9/30/16
10
Inferences about population

parameters
Instead of point value of b, we want to find

out range of b with 90% confidence level
Find t for given degrees of freedom, and
given confidence level
Find sb (standard error for b)
Sb = Se / (Xi2 nX2)
Range is b t*sb and b + t*sb
9/30/16
11
The Equation
The equation of the simple linear regression line is given

by
Yi=0+1Xi+
Minimize 2 by finding best fit for 0, 1
Table 5-2 Pulp Price Regression Makridakis

Table 5-4 PVC Regression Makridakis
Residual / Error
Outliers: Observation with large residuals (Table 5-4)
Influential Observations: Observation that have a great
influence on the fitted equation. Usually they are extreme
observation (. (See King-Kong problem Fig 5-8
Makridakis)
9/30/16
12
The Equation (Cont)
Causation vs Correlation: X: Weekly # of deaths by

drowning and Y: Consumption of coke might be highly
related, but there may not be causal relationship Both
might be increasing in summer!
Lurking Variable : Explanatory variable not included in
regression that is highly related to both X and Y (e.g.
Season in above case
Confounding Variable: New Car Sales may depend
upon both Price as well as Advertisement
Expenditure. Last two are called Confounding variable
9/30/16
13
The Equation (cont)

The regression analysis is performed under following assumptions.
Y = 0 + 1 * X +
1. Residual:
2.
Residual
3.
(Y- ^Y) should be near zero. (Y - ^Y) is the residue, denoted by
Plot X vs (Y- ^Y). should be Random (i.e. normally distributed)

and should not have any trend (unlike Y = X**2)
Standard Error Se = (SUM Squared Error/(N-K-1); K = 1 for

Simple Regression; SUM Squared Error = (Y- ^Y) **2
9/30/16
Se should be acceptable (Se / Ymean give good idea of error)

68% of Residue should be within Se
95% of Residue should be within 2* Se
14
The Equation (Cont)

4.
5.
6.
7.
Correlation (r) is measure of linear association-ship

between the variables. Even if variables have high
nonlinear relationship, r might be very small (see Fig
5-7 Makridakis)
For small n, r is notoriously unstable. For n = 30 or
more, it starts becoming stable
r can change drastically due to extreme values. (See
King-Kong problem Fig 5-8 Makridakis, where just one
extreme point changes r from 0.527 to 0.940). What
should we do?
Coefficient of Determination: R**2 should be high,
towards 1. Interpretation of R2 and r
9/30/16
15
The Equation (Cont)

P value should be smaller than 0.05 (i.e. 95%
confidence) for rejecting null Hypothesis (0 =0 / 1 =
0)
F = t**2 = MS (Regression) / MS (Residual) (t value
for 1)
Significance F = p value for
1
9.
Adjusted R**2 =
1 (Sum Squared Error/ (N-K-1)*(N-1)/ (X-X
mean )
(Y-Y mean )
10. Should make common sense i.e. when X change by
1, Y changes by Slope. +ve or ve change should
make common sense
11. Only prediction valid within the range from which model
is made
8.
9/30/16
16
The Equation (Cont)

12.
13.
14.
15.
Please see equation 5.19 for error interval on predicted

value
Please see equations for and and their error
interval on page 216
Residues vs explanatory variable should not have any
pattern (No trend, No seasonality, etc..)
Residues should have mean as zero and should be
normally distributed
9/30/16
17
Data and Analysis
9/30/16
18
Summary Output
9/30/16
19
Residuals
It is the
difference
between the
actual Y value
and predicted Y
value by the
regression
model in
predicting each
value of the
dependent
variable.
9/30/16
20
Residuals
The total of the

residuals squared is
called the sum of
squares of error
(SSE).
The standard error
of the estimate is a
standard deviation
of the error of the
regression model
9/30/16
21
Coefficient of Determination
A widely used measure of fit for regression

models is the coefficient of determination.
The coefficient of determination is the
proportion of variability of the dependent
variable (Y) accounted for or explained by the
independent variable (X).
It is denoted by r2.
It lies between 0 and 1.
9/30/16
22
r2 in Airlines Cost
r2 = .899 [pg6,12,13]
This means that about 89.9% of the
variability of the cost of flying a Boeing 737
airplane on a commercial flight is accounted
for or predicted by the number of passengers.
This also means that about 11.1% of the
variation in airline flight cost, Y, is
unaccounted for by X or unexplained by the
regression model.
9/30/16
23
Correlation
It is a measure of
association. It measures
the strength of relatedness
of two variables.
For example, we may be
interested in determining
the correlation between
the prices of two stocks in

the same industry
How strong are these

correlations?
The Pearson product moment correlation
coefficient is given by.
9/30/16
24
Correlation
1. The measure is applicable only
if both variables being analyzed
have at least an interval level of
data.
2. r is a measure of the linear
correlation of two variables.
3. r = +1 denotes a perfect positive
relationship between two sets of
variables.
4. r = -1$ denotes a perfect
negative correlation, which
indicates an inverse relationship
between two variables.
5. r=0 means that there is no linear
relationship between the two
variables.
6. The coefficient of determination
= (correlation coefficient) r2
9/30/16
25
9/30/16
26
Factors to be taken care of
Plot of residual vs x should be healthy (Carry

out the example Y = SQR (X)
Do not try to predict Y, outside the range from
the ones used for building model (Try
predicting for Y = SQR (X)
9/30/16
27
Multiple Regression Model
1.
2.
The general equation which describes multiple regression

model is given by
Yi=0+1Xi + 2X2 + kXk +
Minimize 2 by finding best i
Assumptions made in the model are :
Residual:
(Y- ^Y) should be near zero. (Y - ^Y) is the residue,
denoted by
Plot X vs (Y- ^Y) for each X should be Random and should
i
i.
not have any trend (unlike Y = X**2)
Standard Error Se = (SUM Squared Error/(N-K-1); K = # of
independent variables; SUM Squared Error = (Y- ^Y) **2
S should be acceptable (S / Y
e
e
mean give good idea of
error)
68% of Residue should be within S
e
95% of Residue should be within 2* S
e
9/30/16
28
Multiple Regression Model

(Cont)
Coefficient of Determination: R**2 should be high, towards 1
P value should be smaller than 0.05 (i.e. 95% confidence)
for rejecting null Hypothesis (i = 0)
3.
4.
F = t**2 = MS (Regression) / MS (Residual) (t value

overall)
Significance F = p value overall. For rejecting null
hypothesis (1=0 and 2=0 ..), this should be small
5.
6.
Should make common sense i.e. when X change by 1, Y

changes by Slope. +ve or ve change should make common
sense
Adjusted R**2 should not be very different from R**2. By

adding more variables, one can always make R**2 large.
But Adjusted R**2 might be much smaller than R**2
9/30/16
29
Problem
See Fig 6-1 Bankdata Regression

A real estate study was conducted in
a small city to determine what
variables, if any, are related to the
market price of a home.
Several variables were explored,
including the number of bedrooms,
the number of bathrooms, the age of
the house, the number of square
feet of living space, the total number
of square feet of space, and how
many garages the house had.
Suppose that the business analyst
wants to develop a regression
model to predict the market price of
a home by two variables: ``total
number of square feet in the house''
and the age of the house.
The data are given in the table.
9/30/16
30
The Fitted Model
Y = 57.351 + 0.0177X1 0.6663X2

Interpretation:
The Y- intercept is equal to 57.351. In this example, Y-intercept
does not have any practical significance.
The coefficient of X (total number of square feet in the house) is
1
0.0177. This means that 1-unit increase in square footage would
result in predicted increase of (0.0177) ($1000) = $17.70 in the
price of the home if the age were held constant.
The coefficient of X (age) is -0.6663. The negative sign on the
2
coefficient denotes an inverse relationship between the age of a
house and the price of the house : the older the house, the lower
the price. In this case, if the total number of square feet in the
house is kept constant, a 1-unit increase in the age of the house
(1 year) will result in (-0.6663) ($1000) = - 666.30, a predicted
drop in the price.
9/30/16
31
Testing the Model
r2 = 0.715
Testing the overall

model
Significance Tests
of the Regression
Coefficients
9/30/16
32
9/30/16
33
Analysis of Residuals
9/30/16
34
Multicollinearity
Multicollinearity refers to two or more independent

variables of a multiple regression model being highly
correlated. This causes problems in the
interpretation of results.
In particular, these problems are
9/30/16
It is difficult, if not impossible, to interpret the estimates of

the regression coefficients.
Inordinately small t values for the regression coefficients
may result.
The standard deviations of regression coefficients are
overestimated.
The algebraic sign of estimated regression coefficients may
be the opposite of what would be expected for a particular
predictor variable. [pg24]
35
Search Procedures
All possible regression

Take all possible combination of K variables (2 K -1 models).
Choose the best model
Forward selection
Start with one variable. Try out all variable one at a time. Choose
the best one.
Then take 2nd variable, and so on.
Choose the best model
Backward Elimination:
Start with all variables
Eliminate the one with smallest t
Keep on repeating
Stepwise regression
Same as Forward Selection, but at every time also check that the
variable included is significant (acceptable p value)
9/30/16
36
Factors to be taken care of
Value of R2 could be inflated. Consider R2 adjusted

Better model does not imply cause and effect between
independent variables and dependent variable (some
other factors might be causing both)
Value of regression coefficient may not directly tell about
the importance, because
Different Units
Multi co-linearity
Multi co-linearity can create problem. To address the
same,
Use Search Procedures
Use r between two variables. For larger value, do not
take both
9/30/16
37
Non Linear Models (5/4 Makridakis)
Nonlinearity in parameters More complex (One may be

able to use transformation to convert into linear in certain
cases)
Nonlinearity in variables
Local Regression (see 5/4/3 of Makridakis)
9/30/16
38
Non-Linear Model
Y=0+1X1+ 2X2 **2 +
Choose Y1 = X ;Y = X **2
1
2
2
Y=0+1X1+ 2X1X2 +
Choose Y1 = X ;Y = X * X
1
2
1
2
Y = 0*1X
Log(Y) = Log( ) +X*Log( ); Now it is in
0
1
linear form
Similarly Y = 0*X1 ; Y = 1 / (0+1X1+ 2X2 )
can be converted into linear regression
9/30/16
39
Indicator (Dummy) Variable

Regression required numeric value
For Gender, define Gender = 1 for Male and Gender
= 0 for Female
For Region (N/S/W/E), do not map to 0, 0.33, 0.66
and 1.0 (Why? .. Unordered)
Define three variable, X1, X2 and X3
X1 = 1 for North Region, 0 other wise
X2 = 1 for South Region, 0 other wise
X3 = 1 for West Region, 0 other wise
X1 = X2 = X3 = 0 represents East Region
9/30/16
40
Others (Pg 270 Makridakis)
Trading day variation
Introduce seven variables, T1: # of Mondays in Month, T2: # of

Tuesdays in Month, ..
Holiday Effect
9/30/16
V=1 if Diwali falls in this month (or part of Diwali)
41
Interventions (Pg 271 Makridakis)
Seat belt legislation was introduced

Due to that car accidents went down
Introduce dummy variable I = 0 (Before seat belt
legislation), and I =1 (after seat belt legislation)
More complex models can be introduced, if effect is
spread over some time
See figure 8-15 for intervention variable
9/30/16
42
Effect of Advertising Expenditure on Sale (Pg 271

Makridakis)
Monthly Sale is output variable and monthly
advertisement expense is one of the input variable
Effect of advertisement expense lasts till 3 months (say)
One can model as follows:
Y = b + b *X + b2*X
t
0
1
1,t
1,t-1 + b3*X1,t-2 + ..
9/30/16
43
Miscellaneous
Variance - Covariance Matrix
Vector X = (X1, X2, X3, )
Let (i) be arithmetic average of X(i)
(i.j) = k (X(i,k) - (i))*(X(j,k) - (j))/ N
(i.i) is Variance Matrix
9/30/16
44

Regression Analysis Explained

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Analysis Explained

Uploaded by

Copyright:

Available Formats

Regression Analysis

Equation of Line Y = a + b*X

It is a bivariate linear regression - that is, it is

Regression Line (Cont)

Drawing regression line for scatter chart

Find out value of a and b for minimizing square error

Let us consider the data

Regression Line (Cont)

b = (Xi*Yi n*X*Y) / (Xi2 n*X2)

Se Standard Error = Sqrt((Yi-Yp)2/(n-2))

Assumption: Errors are normally distributed

What is interpretation of standard error. In

Regression Line (Cont)

Interpretation of Standard Error (Cont)

Degree to which one variable is linearly related to

If first term is zero, both the terms are exactly

Inferences about population

Instead of point value of b, we want to find

Range is b t*sb and b + t*sb

The equation of the simple linear regression line is given

Table 5-2 Pulp Price Regression Makridakis

The Equation (Cont)

Causation vs Correlation: X: Weekly # of deaths by

The Equation (cont)

(Y- ^Y) should be near zero. (Y - ^Y) is the residue, denoted by

Plot X vs (Y- ^Y). should be Random (i.e. normally distributed)

Standard Error Se = (SUM Squared Error/(N-K-1); K = 1 for

Se should be acceptable (Se / Ymean give good idea of error)

The Equation (Cont)

Correlation (r) is measure of linear association-ship

The Equation (Cont)

The Equation (Cont)

Please see equation 5.19 for error interval on predicted

Data and Analysis

The total of the

A widely used measure of fit for regression

the prices of two stocks in

How strong are these

Factors to be taken care of

Plot of residual vs x should be healthy (Carry

Multiple Regression Model

The general equation which describes multiple regression

Multiple Regression Model

F = t**2 = MS (Regression) / MS (Residual) (t value

Should make common sense i.e. when X change by 1, Y

Adjusted R**2 should not be very different from R**2. By

See Fig 6-1 Bankdata Regression

The Fitted Model

Y = 57.351 + 0.0177X1 0.6663X2

Testing the Model

Testing the overall

Multicollinearity refers to two or more independent

It is difficult, if not impossible, to interpret the estimates of

All possible regression

Factors to be taken care of

Value of R2 could be inflated. Consider R2 adjusted

Non Linear Models (5/4 Makridakis)

Nonlinearity in parameters More complex (One may be

Indicator (Dummy) Variable

Others (Pg 270 Makridakis)

Trading day variation

Introduce seven variables, T1: # of Mondays in Month, T2: # of

V=1 if Diwali falls in this month (or part of Diwali)

b = (XiYi nXY) / (Xi2 nX2)

Range is b tsb and b + tsb

Adjusted R2 should not be very different from R2. By