Multiple Linear Regression

Multiple Linear Regression
By: Shruthi Reddy,Gadampalli

005927160
Traditional vs Validation Data Set

The training dataset is used to train or build a model and to test the accuracy of
the estimated value calculated using trading data; we have to set aside a part of
original data called as validation set data.
>> Fit a multiple linear regression model to the

median house price (MEDV) as a function of CRIM,
CHAS, and RM
Inputs
Inputs
Inputs
Inputs
CRIM
Inputs
MEDV
CHAS RM
Training Data Scoring Summary Report

Total
sum of
RMS
squared
Error
errors
11759.0
4 6.219409
Average
Error
-4.10783E15
Validation Data Scoring Summary Report

Total
sum of
Average
squared RMS
Error
errors Error
7371.00 6.040 0.039038
3
705
718
Regression Model
Input
Coefficien
tStd. Error
Variable
t
Statistic
s
Intercep
-28.3135
3.2925
-8.5993
t
CRIM
-0.285
0.044
-6.4755
CHAS
3.6893
1.4977
2.4634
RM
8.2114
0.5195
15.805
P-Value
0
0
0.0143
0
CI Lower CI Upper
-34.7929
-0.3716
0.742
7.189
-21.8341
-0.1984
6.6365
9.2338
RSS
Reductio
n
153409.9
3128.863
773.5671
9791.29
>> Write the equation for predicting the median house price from
the predictors in the model.
>> What median house price is predicted for a tract in the Boston area that
does not bound the Charles River, has a crime rate of 0.1, and where the
average number of rooms per house is 6? What is the prediction error?
MEDV= -28.3135+ (-0.285*CRIM) + (3.6893*CHAS) + (8.2114*RM)
MEDV= -28.3135+ (-0.285*0.1) + (3.6893*0) + (8.2114*6)
0.0285+0+49.2684
MEDV=20.9264
Median house price is = 20,926.4
=-28.3135-
>>Correlation table
Which predictors are likely to be measuring the same thing among the 14 predictors? Discuss the
relationships among INDUS, NOX, and TAX.
Correlation values:
INDUS & NOX 0.763

AGE & NOX 0.731
TAX & RAD 0.910
TAX & INDUS 0.7208
NOX & DIS -0.769
DIS & AGE -0.747
After considering the highest and lowest correlation, we can eliminate

INDUS, AGE and TAX
Model 1
Total sum of
squared errors
RMS Error
Average Error
4616.353
4.780506
-0.300841374
Model 2
Total sum of
squared errors
RMS Error
Average Error
4686.579
4.81673
-0.22067477
Decile-wise lift chart

(validation dataset)
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0
Cumulative
MEDV when
sorted using
predicted
values
Cumulative
MEDV using
average
0
100
200
# Cases
300
Decile mean / Global mean
Cumulative
Lift chart (validation

dataset)
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Series1
5 6 7
Deciles
9 10
Model 3
Total sum of
squared errors RMS Error
4828.086
Average Error
4.888908
-0.18015984
5000
Cumulative
MEDV when
sorted using
predicted
values
Cumulative
4000
3000
2000
Cumulative
MEDV using
average
1000
0
0
100
200
# Cases
300

(validation dataset)
Lift chart (validation

dataset)
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Series1
5 6 7
Deciles
10
Summary
Comparing Total sum of squared errors, RMS Error and Average Error along
with lift charts we can conclude that model 1 with CRIM, ZN, CHAS, NOX, RM,
DIS, RAD, PTRATIO, B, LSTAT is the best model for predicting Boston housing
prices.
correlation table for airfare
Distance is the best predictor for fare
Pivot table with the average fare
Converting Categorical variables (e.g.,

SW) into dummy variables
Stepwise Regression
The highest value with Adjusted R2 will be the estimated best model
Model 1:
Total sum of
squared errors
787222.7657
RMS Error
35.12679152
Average
Error
6.62863E-12
Model 2:
Total sum of
squared errors
793812.5792
RMS Error
35.27350767
Average
Error
-2.3941E-12
Model 3:
Total sum of
squared errors
807460.9178
RMS Error
35.57545114
Average
Error
-3.2331E-12
Exhaustive Search
We have 3 models based on the Adjusted R2value:
From the 3 models we need to analyze the Lift charts and the RMS error value and select the best fitting model.
I decided to go with Model2 after considering or evaluating the RMS error on the next analysis questions.
Model 1:
(training dataset)
Lift chart (training dataset)

120000
80000
Cumulative
100000
Cumulative FARE
when sorted using
predicted values
60000
40000
Cumulative FARE
using average
20000
0
0
200
400
# Cases
600
800
Total sum
of
squared
errors
RMS Error
787222.8
35.12679
Average
Error
-5.5E-12
2
1.5
1
Series1
0.5
0
1
5 6 7
Deciles
10
Model 2:
Decile-wise lift chart (training

dataset)
120000
80000
Cumulative FARE when

sorted using predicted
values
60000
40000
Cumulative FARE using

average
20000
0
0
200
400
# Cases
600
800
Cumulative
100000
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Series1
Total sum of
squared
errors
785120.4
RMS Error
35.07986
Average
Error
7.32E-09
5
6
Deciles
10
Model 3:
Decile-wise lift chart (training
dataset)
Cumulative
100000
80000
Cumulative FARE when

sorted using predicted
values
60000
40000
Cumulative FARE using

average
20000
0
200
400
# Cases
600
800
Total sum
of
squared
errors
120000
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Series1
RMS Error
785001.1 35.07719
Average
Error
2.4E-09
5
6
Deciles
10
Average fare on route by the given

characteristics
Using the Formula
y=_0+_1 x_1+_2 x_2+_3 x_3++_k x_k
We are calculating the average route on fare for the values given in question:
17.9917+(NEW*2.4219)+(HI*0.0083)+(S_Income*0.0012)+(E_Income*0.0014)+(S_POP*0)+(E_POP*0)+(Distance*0.075
8)+(PAX*-0.0009)+(Vacation*-35.7079)+(SW_ORD*-41.074)+(Slot_ORD*-16.3576)+(Gate_ORD*20.6157)
Y = 17.9917+(3*2.4219)+(4442.141*0.0083)+(28,760*0.0012)+(27,664*0.0014)+(4,557,004*0)+(3,195,503*0)+(1976*
0.0758)+(12782*-0.0009)+(0*-35.7079)+(0*-41.074)+(1*-16.3576)+(1*-20.6157)
= 222.14
If Southwest decides to cover this route:

We replace the co-efficient value of SW as 1 if Southwest decides to cover the
route,
Y = 17.9917+(3*2.4219)+(4442.141*0.0083)+(28,760*0.0012)+(27,664*0.0014)+(4,557,004*0)+(3,19
5,503*0)+(1976*0.0758)+(12782*-0.0009)+(0*-35.7079)+(1*-41.074)+(1*16.3576)+(1*-20.6157)
Then the average fare if Southwest decides to cover the route would be
181.067.
factors not available for predicting the average fare from a new airport are:
Slot
Gate
SW
Exhaustive Search
R2 value with 0.7139 is considered to be the
best fit model.
average fare predicted with the given
characteristics for this model is 195.157
With model3,the average fare was 222.14,
and the difference between this model is
26.983, so model considering all the factors
is the best fit.

Multiple Linear Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Linear Regression

Uploaded by

Copyright:

Available Formats

Multiple Linear Regression

By: Shruthi Reddy,Gadampalli

Traditional vs Validation Data Set

>> Fit a multiple linear regression model to the

Training Data Scoring Summary Report

Validation Data Scoring Summary Report

INDUS & NOX 0.763

After considering the highest and lowest correlation, we can eliminate

Decile-wise lift chart

Decile mean / Global mean

Lift chart (validation

Decile-wise lift chart

Lift chart (validation

correlation table for airfare

Distance is the best predictor for fare

Pivot table with the average fare

Converting Categorical variables (e.g.,

Lift chart (training dataset)

Decile mean / Global mean

Decile-wise lift chart (training

Cumulative FARE when

Cumulative FARE using

Decile mean / Global mean

Lift chart (training dataset)

Cumulative FARE when

Cumulative FARE using

Decile mean / Global mean

Average fare on route by the given

If Southwest decides to cover this route:

You might also like