You are on page 1of 21

Multiple

Regression

Amar Saxena
Amar.Saxena@IFMR.ac.in
+91.993.002.2910
Why Multiple Regression
• Our real world is multivariable
• Multivariable analysis is a tool to determine the relative
contribution of all factors

• Health is just dependent on smoking or drinking?


o Diet, exercise, genetics, age, job, sleeping habits also play an
important roll in deciding one’s health
o We often want to describe the effect of smoking over and above
these other variables.

• We extend the simple linear regression model. Any number of


independent variables is now allowed.
• We wish to build a model that fits the data better than the
simple linear regression model.

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 2


Visualizing Multiple Regression

a + b1X1 + b2X2 + b3X3

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 3


Greater complexity
Vs
Understanding real life scenarios

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 4


Considerations & Assumptions
• The units observed should be a random sample from a well defined
population.
• All variables should be measured on an interval, continuous scale.
• All the variables should be normal distributed.
• Linear relationships between dependent and independent variables.
• Independent variables should be, at best, moderately correlated
o There must be no perfect (or near-perfect) correlations among them
o This situation is called multicollinearity.
• There must be no interactions between independent variables

Error Term
• The errors are normally distributed
• Errors have a constant variance
• The model errors are independent
Errors (residuals) from the regression model: εi = (Yi – Ŷi)
Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 5
What should be the price of a flat?
Example:
Suppose you are considering purchasing a flat. And want to determine the
key determinants of the price.
• You believe that the price of flat is dependent on -
o x1 = Number of Bedrooms in the flat
o x2 = Number of Bathrooms in the flat Data for Multiple
o x3= Size of the flat Regression

o y = Price quoted by builders


• Assume a linear relationship
• You randomly chooses a sample of 781 flats and obtain the data.

Model
y = a + b1x1 + b2x2+ b3x3 + ε

Least Squared Estimation


Use the AnalyticPak in Excel to solve it
Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 6
Summary Output

Step 1
y = 475531.87
- 1005440.73 x1
+ 459811.58 x2
+ 3096.01 x3

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 7


But, how good is my Regression Line?
• No line is perfect
o There is always some error in the estimation – a
o There is always some part of response (Y) that can’t be explained
by predictor (x)
• So, total variance in Y is divided into two parts,
o Variance that can be explained by x, using regression
o Variance that can’t be explained by x

Model Assessment
• The model is assessed using three measures:
o The standard error of estimate
o The coefficient of determination
o The F-test of the analysis of variance
• The standard error of estimates is used in the calculations for
the other measures.
Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 8
Standard Error of Estimate
• The standard deviation of the error is estimated by the
Standard Error of Estimate:

SSE
s 
n  k 1
(k+1 coefficients were estimated)
• The magnitude of sε is judged by comparing it to Ŷ

Mean y = ₹ 38,33,291.10
sε = 2524622.335
It seems that sε is not particularly small
(relative to the mean of Y) – 66%.
• Question:
Can we conclude the model does not fit
the data well? Not Necessarily.
Coefficient of Determination (R2)
• We know that, R 1
2
SSE

SSE
R2 = 0.479
(Yi Y ) SST
2

47.9% of the variation in Flat Price is explained by the 3 independent


variables. So, 52.1% is unexplained.

• Introducing Adjusted R2 to assess goodness of fit of the equation.


o Quoted most often when explaining accuracy of the regression equation.
o More conservative than R2 : always less than R2.
o Adjusted R2 increases only when the last input variable makes the
equation more accurate (improves the Regression equation’s ability to predict
the output).
o R2 always goes up when a new variable is added, whether or not the
new input variable improves the Regression equation’s accuracy.

poor fair good very good


0% 25% 50% 75% 100%
Testing the Validity of the Model
• Is the relationship between the dependent and independent
variables completely random?
• To answer this question, we test the hypothesis:
H0: β1 = β2 = … = βk = 0 (The relationship is completely random)
H1: at least one βi ≠ 0 (at least one independent variable affects y)

• We use the F-Test for testing this hypothesis.


MSR/MSE

SSR
k=
n-k-1 =
SSE MSR = SSR/k
n-1 =
MSE = SSE/n-k-1
Conclusion: There is sufficient evidence to reject the
null hypothesis in favor of the alternative hypothesis.
This linear regression model is valid
R2adjusted

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 12


Interpreting the Coefficients
• Intercept a = 4,75,531.87
o This is the value of y when all the predictors take the value zero.
o Since the data range of all the independent variables do not cover the
value zero, do NOT interpret the intercept

• Bathrooms b1 = - 10,05,440.73
o For each additional bathroom, price of flat reduces by Rs 10 lakhs
o Does this make conventional sense?

• Bedrooms b2 = 4,59,811.58
o For each additional bedroom, flat price increases by Rs 4.6 lakhs

• Size of Flat b3 = 3,096.01


o As size increases, price of flat goes up – at the rate of Rs 3,096 per sq ft.
Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 13
Testing Individual Coefficients
• The hypothesis for each bi is –
H0: βi = 0 Test statistic
bi  i
H1: βi  0 t
sbi
d.f. = n - k -1

• Excel Output

Ignore
Very strong
Reasonably strong
Strongest

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 14


p-Value
• p value gives us an idea about the significance of each variable

• That p-value basically compares the


o fit of the regression “with everything except to the variable” vs
the fit of the regression “with everything including the variable”.
o There will be no decrease or, at best, minimal change in adjusted
r-squared if we remove the variable.

• Note: It is possible all the variables in a regression to produce


great individual fits, and yet have none of the variables be
individually significant
o There is a great overall fit (High R2), yet none of the individual
variables are significant.

Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 15


Reporting regression results:
“The price quoted by builders were analyzed by multiple
regression, using number of bathrooms, number of bedrooms
and area as regressors.
The regression was a reasonable fit (R2adj = 47%), but the overall
relationship was significant (F = 237.97, p < 0.0001).
With other variables held constant, price of flat quoted by
builder was negatively related to number of bathrooms,
decreasing by Rs. 10 lakhs for every bathroom. The price of flat
on the other hand increases by Rs 4.6 lakh per bedroom and by
Rs 3,096 per sq ft.
All the 3 variables were significant at p = 0.01. Size of the flat
came out as the most significant variable.
Regression Diagnostics
• The conditions required for the model assessment to apply
must be checked.
o Is the error variable normally Draw a histogram of the residuals
distributed?
o Is the error variance constant? Plot the residuals versus the
predicted values of Y
o Are the errors independent? And, if applicable, with time period

o Can we identify outlier?


o Is multi-collinearity (correlation between the Xi’s) a problem?
Multicollinearity
• Multicollinearity (or inter correlation) exists when at least
some of the predictor variables are correlated among
themselves.

• One of the biggest issue. To resolve it -


o Drop the troublesome RHS variables
o Principal components estimator
o Transform the data
Amar Saxena | 993.002.2910 | AmarSaxena@gmail.com Slide 18
Stepwise Regression
• Choose a subset of the independent variables which "best"
explains the dependent variable.

1) Forward Selection
• Start by choosing the independent variable which explains the
most variation in the dependent variable.
• Choose a second variable which explains the most residual
variation, and then recalculate regression coefficients.
• Continue until no variables "significantly" explain residual
variation.
Stepwise Regression
2) Backward Selection
• Start with all the variables in the model, and drop the
least "significant", one at a time, until you are left with
only "significant" variables.

3) Mixture of the two


• Perform a forward selection, but drop variables which
become no longer "significant" after introduction of
new variables.
Other Regressions
• Hierarchical Regression
o The researcher determines the order of entry of the variables.
o F-tests are used to compute the significance of each added
variable (or set of variables) to the explanation reflected in R-
square
o an alternative to comparing betas for purposes of assessing the
importance of the independents

• Categorical Regression
o Used when there is a combination of nominal, ordinal, and
interval-level independent variables.

You might also like