Multiple Regression

Multiple
Regression
Amar Saxena
Amar.Saxena@IFMR.ac.in
+91.993.002.2910
Why Multiple Regression
• Our real world is multivariable
• Multivariable analysis is a tool to determine the relative
contribution of all factors
• Health is just dependent on smoking or drinking?

o Diet, exercise, genetics, age, job, sleeping habits also play an
important roll in deciding one’s health
o We often want to describe the effect of smoking over and above
these other variables.
• We extend the simple linear regression model. Any number of

independent variables is now allowed.
• We wish to build a model that fits the data better than the
simple linear regression model.
Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 2

Visualizing Multiple Regression
a + b1X1 + b2X2 + b3X3

Greater complexity
Vs
Understanding real life scenarios

Considerations & Assumptions
• The units observed should be a random sample from a well defined
population.
• All variables should be measured on an interval, continuous scale.
• All the variables should be normal distributed.
• Linear relationships between dependent and independent variables.
• Independent variables should be, at best, moderately correlated
o There must be no perfect (or near-perfect) correlations among them
o This situation is called multicollinearity.
• There must be no interactions between independent variables
Error Term
• The errors are normally distributed
• Errors have a constant variance
• The model errors are independent
Errors (residuals) from the regression model: εi = (Yi – Ŷi)
What should be the price of a flat?
Example:
Suppose you are considering purchasing a flat. And want to determine the
key determinants of the price.
• You believe that the price of flat is dependent on -
o x1 = Number of Bedrooms in the flat
o x2 = Number of Bathrooms in the flat Data for Multiple
o x3= Size of the flat Regression
o y = Price quoted by builders

• Assume a linear relationship
• You randomly chooses a sample of 781 flats and obtain the data.
Model
y = a + b1x1 + b2x2+ b3x3 + ε
Least Squared Estimation

Use the AnalyticPak in Excel to solve it
Summary Output
Step 1
y = 475531.87
- 1005440.73 x1
+ 459811.58 x2
+ 3096.01 x3
+ε

But, how good is my Regression Line?
• No line is perfect
o There is always some error in the estimation – a
o There is always some part of response (Y) that can’t be explained
by predictor (x)
• So, total variance in Y is divided into two parts,
o Variance that can be explained by x, using regression
o Variance that can’t be explained by x
Model Assessment
• The model is assessed using three measures:
o The standard error of estimate
o The coefficient of determination
o The F-test of the analysis of variance
• The standard error of estimates is used in the calculations for
the other measures.
Standard Error of Estimate
• The standard deviation of the error is estimated by the
Standard Error of Estimate:
SSE
s 
n  k 1
(k+1 coefficients were estimated)
• The magnitude of sε is judged by comparing it to Ŷ
Mean y = ₹ 38,33,291.10
sε = 2524622.335
It seems that sε is not particularly small
(relative to the mean of Y) – 66%.
• Question:
Can we conclude the model does not fit
the data well? Not Necessarily.
Coefficient of Determination (R2)
• We know that, R 1
2
SSE

SSE
R2 = 0.479
(Yi Y ) SST
2
47.9% of the variation in Flat Price is explained by the 3 independent

variables. So, 52.1% is unexplained.
• Introducing Adjusted R2 to assess goodness of fit of the equation.

o Quoted most often when explaining accuracy of the regression equation.
o More conservative than R2 : always less than R2.
o Adjusted R2 increases only when the last input variable makes the
equation more accurate (improves the Regression equation’s ability to predict
the output).
o R2 always goes up when a new variable is added, whether or not the
new input variable improves the Regression equation’s accuracy.
poor fair good very good

0% 25% 50% 75% 100%
Testing the Validity of the Model
• Is the relationship between the dependent and independent
variables completely random?
• To answer this question, we test the hypothesis:
H0: β1 = β2 = … = βk = 0 (The relationship is completely random)
H1: at least one βi ≠ 0 (at least one independent variable affects y)
• We use the F-Test for testing this hypothesis.

MSR/MSE
SSR
k=
n-k-1 =
SSE MSR = SSR/k
n-1 =
MSE = SSE/n-k-1
Conclusion: There is sufficient evidence to reject the
null hypothesis in favor of the alternative hypothesis.
This linear regression model is valid
R2adjusted

Interpreting the Coefficients
• Intercept a = 4,75,531.87
o This is the value of y when all the predictors take the value zero.
o Since the data range of all the independent variables do not cover the
value zero, do NOT interpret the intercept
• Bathrooms b1 = - 10,05,440.73
o For each additional bathroom, price of flat reduces by Rs 10 lakhs
o Does this make conventional sense?
• Bedrooms b2 = 4,59,811.58
o For each additional bedroom, flat price increases by Rs 4.6 lakhs
• Size of Flat b3 = 3,096.01

o As size increases, price of flat goes up – at the rate of Rs 3,096 per sq ft.
Testing Individual Coefficients
• The hypothesis for each bi is –
H0: βi = 0 Test statistic
bi  i
H1: βi  0 t
sbi
d.f. = n - k -1
• Excel Output
Ignore
Very strong
Reasonably strong
Strongest

p-Value
• p value gives us an idea about the significance of each variable
• That p-value basically compares the

o fit of the regression “with everything except to the variable” vs
the fit of the regression “with everything including the variable”.
o There will be no decrease or, at best, minimal change in adjusted
r-squared if we remove the variable.
• Note: It is possible all the variables in a regression to produce

great individual fits, and yet have none of the variables be
individually significant
o There is a great overall fit (High R2), yet none of the individual
variables are significant.

Reporting regression results:
“The price quoted by builders were analyzed by multiple
regression, using number of bathrooms, number of bedrooms
and area as regressors.
The regression was a reasonable fit (R2adj = 47%), but the overall
relationship was significant (F = 237.97, p < 0.0001).
With other variables held constant, price of flat quoted by
builder was negatively related to number of bathrooms,
decreasing by Rs. 10 lakhs for every bathroom. The price of flat
on the other hand increases by Rs 4.6 lakh per bedroom and by
Rs 3,096 per sq ft.
All the 3 variables were significant at p = 0.01. Size of the flat
came out as the most significant variable.
Regression Diagnostics
• The conditions required for the model assessment to apply
must be checked.
o Is the error variable normally Draw a histogram of the residuals
distributed?
o Is the error variance constant? Plot the residuals versus the
predicted values of Y
o Are the errors independent? And, if applicable, with time period
o Can we identify outlier?

o Is multi-collinearity (correlation between the Xi’s) a problem?
Multicollinearity
• Multicollinearity (or inter correlation) exists when at least
some of the predictor variables are correlated among
themselves.
• One of the biggest issue. To resolve it -

o Drop the troublesome RHS variables
o Principal components estimator
o Transform the data
Stepwise Regression
• Choose a subset of the independent variables which "best"
explains the dependent variable.
1) Forward Selection
• Start by choosing the independent variable which explains the
most variation in the dependent variable.
• Choose a second variable which explains the most residual
variation, and then recalculate regression coefficients.
• Continue until no variables "significantly" explain residual
variation.
Stepwise Regression
2) Backward Selection
• Start with all the variables in the model, and drop the
least "significant", one at a time, until you are left with
only "significant" variables.
3) Mixture of the two

• Perform a forward selection, but drop variables which
become no longer "significant" after introduction of
new variables.
Other Regressions
• Hierarchical Regression
o The researcher determines the order of entry of the variables.
o F-tests are used to compute the significance of each added
variable (or set of variables) to the explanation reflected in R-
square
o an alternative to comparing betas for purposes of assessing the
importance of the independents
• Categorical Regression
o Used when there is a combination of nominal, ordinal, and
interval-level independent variables.

Multiple Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Regression

Uploaded by

Copyright:

Available Formats

Multiple

• Health is just dependent on smoking or drinking?

• We extend the simple linear regression model. Any number of

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 2

a + b1X1 + b2X2 + b3X3

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 3

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 4

o y = Price quoted by builders

Least Squared Estimation

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 7

47.9% of the variation in Flat Price is explained by the 3 independent

• Introducing Adjusted R2 to assess goodness of fit of the equation.

poor fair good very good

• We use the F-Test for testing this hypothesis.

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 12

• Size of Flat b3 = 3,096.01

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 14

• That p-value basically compares the

• Note: It is possible all the variables in a regression to produce

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 15

o Can we identify outlier?

• One of the biggest issue. To resolve it -

3) Mixture of the two

You might also like