You are on page 1of 25

Multiple Linear Regression

By: Shruthi Reddy,Gadampalli


005927160

Traditional vs Validation Data Set


The training dataset is used to train or build a model and to test the accuracy of
the estimated value calculated using trading data; we have to set aside a part of
original data called as validation set data.

>> Fit a multiple linear regression model to the


median house price (MEDV) as a function of CRIM,
CHAS, and RM
Inputs
Inputs
Inputs

Inputs

CRIM

Inputs

MEDV

CHAS RM

Training Data Scoring Summary Report


Total
sum of
RMS
squared
Error
errors
11759.0
4 6.219409

Average
Error
-4.10783E15

Validation Data Scoring Summary Report


Total
sum of
Average
squared RMS
Error
errors Error
7371.00 6.040 0.039038
3
705
718

Regression Model

Input
Coefficien
tStd. Error
Variable
t
Statistic
s
Intercep
-28.3135
3.2925
-8.5993
t
CRIM
-0.285
0.044
-6.4755
CHAS
3.6893
1.4977
2.4634
RM
8.2114
0.5195
15.805

P-Value

0
0
0.0143
0

CI Lower CI Upper

-34.7929
-0.3716
0.742
7.189

-21.8341
-0.1984
6.6365
9.2338

RSS
Reductio
n
153409.9
3128.863
773.5671
9791.29

>> Write the equation for predicting the median house price from
the predictors in the model.

>> What median house price is predicted for a tract in the Boston area that
does not bound the Charles River, has a crime rate of 0.1, and where the
average number of rooms per house is 6? What is the prediction error?
MEDV= -28.3135+ (-0.285*CRIM) + (3.6893*CHAS) + (8.2114*RM)
MEDV= -28.3135+ (-0.285*0.1) + (3.6893*0) + (8.2114*6)
0.0285+0+49.2684
MEDV=20.9264
Median house price is = 20,926.4

=-28.3135-

>>Correlation table
Which predictors are likely to be measuring the same thing among the 14 predictors? Discuss the
relationships among INDUS, NOX, and TAX.

Correlation values:

INDUS & NOX 0.763


AGE & NOX 0.731
TAX & RAD 0.910
TAX & INDUS 0.7208
NOX & DIS -0.769
DIS & AGE -0.747

After considering the highest and lowest correlation, we can eliminate


INDUS, AGE and TAX

Model 1
Total sum of
squared errors

RMS Error

Average Error

4616.353

4.780506

-0.300841374

Model 2
Total sum of
squared errors

RMS Error

Average Error

4686.579

4.81673

-0.22067477

Decile-wise lift chart


(validation dataset)

5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0

Cumulative
MEDV when
sorted using
predicted
values
Cumulative
MEDV using
average
0

100
200
# Cases

300

Decile mean / Global mean

Cumulative

Lift chart (validation


dataset)

2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0

Series1

5 6 7
Deciles

9 10

Model 3
Total sum of
squared errors RMS Error
4828.086

Average Error

4.888908

-0.18015984

5000
Cumulative
MEDV when
sorted using
predicted
values

Cumulative

4000
3000
2000

Cumulative
MEDV using
average

1000
0
0

100
200
# Cases

300

Decile-wise lift chart


(validation dataset)
Decile mean / Global mean

Lift chart (validation


dataset)
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0

Series1

5 6 7
Deciles

10

Summary

Comparing Total sum of squared errors, RMS Error and Average Error along
with lift charts we can conclude that model 1 with CRIM, ZN, CHAS, NOX, RM,
DIS, RAD, PTRATIO, B, LSTAT is the best model for predicting Boston housing
prices.

correlation table for airfare

Distance is the best predictor for fare

Pivot table with the average fare

Converting Categorical variables (e.g.,


SW) into dummy variables

Stepwise Regression
The highest value with Adjusted R2 will be the estimated best model

Model 1:

Total sum of
squared errors
787222.7657

RMS Error
35.12679152

Average
Error
6.62863E-12

Model 2:

Total sum of
squared errors
793812.5792

RMS Error
35.27350767

Average
Error
-2.3941E-12

Model 3:

Total sum of
squared errors
807460.9178

RMS Error
35.57545114

Average
Error
-3.2331E-12

Exhaustive Search
We have 3 models based on the Adjusted R2value:
From the 3 models we need to analyze the Lift charts and the RMS error value and select the best fitting model.

I decided to go with Model2 after considering or evaluating the RMS error on the next analysis questions.

Model 1:
Decile-wise lift chart
(training dataset)

Lift chart (training dataset)


120000

80000

Decile mean / Global mean

Cumulative

100000
Cumulative FARE
when sorted using
predicted values

60000
40000

Cumulative FARE
using average

20000
0
0

200

400
# Cases

600

800

Total sum
of
squared
errors

RMS Error

787222.8

35.12679

Average
Error

-5.5E-12

2
1.5
1
Series1

0.5
0
1

5 6 7
Deciles

10

Model 2:
Lift chart (training dataset)

Decile-wise lift chart (training


dataset)

120000

80000

Cumulative FARE when


sorted using predicted
values

60000
40000

Cumulative FARE using


average

20000
0
0

200

400
# Cases

600

800

Decile mean / Global mean

Cumulative

100000

2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0

Series1

Total sum of
squared
errors

785120.4

RMS Error

35.07986

Average
Error

7.32E-09

5
6
Deciles

10

Model 3:
Decile-wise lift chart (training
dataset)

Lift chart (training dataset)

Cumulative

100000
80000

Cumulative FARE when


sorted using predicted
values

60000

40000

Cumulative FARE using


average

20000
0

200

400
# Cases

600

800

Total sum
of
squared
errors

Decile mean / Global mean

120000

2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0

Series1

RMS Error

785001.1 35.07719

Average
Error

2.4E-09

5
6
Deciles

10

Average fare on route by the given


characteristics
Using the Formula
y=_0+_1 x_1+_2 x_2+_3 x_3++_k x_k
We are calculating the average route on fare for the values given in question:
17.9917+(NEW*2.4219)+(HI*0.0083)+(S_Income*0.0012)+(E_Income*0.0014)+(S_POP*0)+(E_POP*0)+(Distance*0.075
8)+(PAX*-0.0009)+(Vacation*-35.7079)+(SW_ORD*-41.074)+(Slot_ORD*-16.3576)+(Gate_ORD*20.6157)
Y = 17.9917+(3*2.4219)+(4442.141*0.0083)+(28,760*0.0012)+(27,664*0.0014)+(4,557,004*0)+(3,195,503*0)+(1976*
0.0758)+(12782*-0.0009)+(0*-35.7079)+(0*-41.074)+(1*-16.3576)+(1*-20.6157)
= 222.14

If Southwest decides to cover this route:


We replace the co-efficient value of SW as 1 if Southwest decides to cover the
route,
Y = 17.9917+(3*2.4219)+(4442.141*0.0083)+(28,760*0.0012)+(27,664*0.0014)+(4,557,004*0)+(3,19
5,503*0)+(1976*0.0758)+(12782*-0.0009)+(0*-35.7079)+(1*-41.074)+(1*16.3576)+(1*-20.6157)
Then the average fare if Southwest decides to cover the route would be
181.067.

factors not available for predicting the average fare from a new airport are:
Slot
Gate
SW

Exhaustive Search
R2 value with 0.7139 is considered to be the
best fit model.
average fare predicted with the given
characteristics for this model is 195.157
With model3,the average fare was 222.14,
and the difference between this model is
26.983, so model considering all the factors
is the best fit.

You might also like