You are on page 1of 25

MULTIPLE REGRESSION AND ISSUES

IN REGRESSION ANALYSIS
MULTIPLE REGRESSION
With multiple regression, we can analyze the association between more
than one independent variable and our dependent variable.

Returning to our analysis of the determinants of loan rates, we also believe that
the number of lines of credit the client currently employs is related to the loan
rate charged. Accordingly, we model a multiple linear regression of the
relationship:
- General form

- Specific form
where DTI is the debt-to-income ratio and Open lines is the number of
existing lines of credit the client already possesses.

2
MULTIPLE REGRESSION
Focus On: Calculations

Coefficient estimates output

Coefficients Standard Error t-Stat p-Value


Intercept 0.0066 0.0352 0.1879 0.8563
DTI 0.7576 0.1845 4.1068 0.0045
Open lines 0.0059 0.0052 1.1429 0.2906

The coefficient estimates are both positive, indicating that increases in the DTI
and open lines are associated with increases in the loan rate. But only the DTI
is significant at the 95% or better level as indicated by its t-stat of 4.1068.
A 1% increase in the debt-to-income ratio leads to a 75.76 bp increase in
loan rate, holding the number of open lines constant.

3
MULTIPLE REGRESSION
Focus On: Hypothesis Testing

We can test the hypothesis that the true population slope coefficient for the
association between open lines and loan rate is zero.

1. Formulate hypothesis H0: b2 = 0 versus Ha : b2 0 (a two-tailed test)

2. Identify appropriate test statistic

3. Specify the significance level 0.05 leading to a critical value of 2.4469

4. Collect data and calculate test statistic

5. Make the statistical decision Fail to reject the null because 1.1429 < 2.4469

4
MULTIPLE REGRESSION
Focus On: The p-Value Approach

Coefficients Standard Error t-Stat p-Value


Intercept 0.0066 0.0352 0.1879 0.8563
DTI 0.7576 0.1845 4.1068 0.0045
OpenLines 0.0059 0.0052 1.1429 0.2906
p-Values appear in reference to the coefficient estimates on the regression
output. For the coefficient estimates, we would fail to reject a null hypothesis of
a zero parameter value for b0 at any level above = 0.8563, for b1 at any
level above = 0.0045, and for b2at any level above = 0.2906.
Conventionally, accepted levels are 0.1, 0.05, and 0.01, which leads us to
reject the null hypothesis of a zero parameter value only for b1 and conclude
that only b1 is statistically significantly different from zero at generally accepted
levels.

5
MULTIPLE REGRESSION ASSUMPTIONS
Multiple linear regression has the same underlying assumptions as single
independent variable linear regression and some additional ones.

1. The relationship between the dependent variable, Y, and the


independent variables (X1, X2, . . . , Xk) is linear.
2. The independent variables (X1, X2, . . . , Xk) are not random. Also, no
exact linear relation exists between two or more of the
independent variables.
3. The expected value of the error term, conditioned on the independent
variables, is zero.
4. The variance of the error term is the same for all observations.
5. The error term is uncorrelated across observations: E(ij) = 0, j i.
6. The error term is normally distributed.

6
MULTIPLE REGRESSION PREDICTED VALUES
Focus On: Calculations

Returning to our multiple linear regression, what loan rate would we


expect for a borrower with an 18% DTI and 3 open lines of credit?

7
UNCERTAINTY IN LINEAR REGRESSION
There are two sources of uncertainty in linear regression models:
1. Uncertainty associated with the random error term.
- The random error term itself contains uncertainty, which can be estimated
from the standard error of the estimate for the regression equation.
2. Uncertainty associated with the parameter estimates.
- The estimated parameters also contain uncertainty because they are only
estimates of the true underlying population parameters.
- For a single independent variable, as covered in the prior chapter,
estimates of this uncertainty can be obtained.
- For multiple independent variables, the matrix algebra necessary to
obtain such estimates is beyond the scope of this text.

8
MULTIPLE REGRESSION: ANOVA
Focus On: Regression Output

df SS MSS F Significance F
Regression 2 0.0120 0.0060 9.6104 0.0098
Residual 7 0.0044 0.0006
Total 9 0.0164
The analysis of variance section of the output provides the F-test for the
hypothesis that all the coefficient estimates are jointly zero. The high value of
this F-test leads us to reject the null that all the coefficients are jointly zero,
concluding that at least one coefficient estimate is nonzero.
Combined with the coefficient estimates, this model suggests that the loan rate
is fairly well described by the level of the debt-to-income ratio for the client, but
that the number of outstanding open lines does not make a strong contribution
to that understanding.

9
F-TEST
Focus On: Calculations

The F-test for a multiple regression determines whether the slope coefficients,
taken together simultaneously as a group, are all zero. The test statistic is

From our regression output, this is

which is greater than the critical value for an F(0.05,2,8)=4.4590 leading us to


reject the null hypothesis of all coefficient estimates being equal to zero.

10
R2 AND ADJUSTED R2
Focus On: Regression Output

Regression specification output from our


example regression provides
- Multiple R is the correlation coefficient for the
degree of association between the Regression Statistics
independent variables and the dependent Multiple R 0.8562
variable.
R2 0.7330
- R2 is our familiar correlation estimate the
Adjusted R2 0.6568
independent variables explain 73.3% of the
variation in the dependent variable. Standard Error 0.0250
- Adjusted R2 is a more appropriate measure Observations 10
for a correlation estimate that accounts for
the presence of multiple independent
variables and it is 65.68%.

11
INDICATOR VARIABLES
Often called dummy variables, indicator variables are used to capture
qualitative aspects of the hypothesized relationship.

Consider that a reliance on short-term sources of financing is also


generally believed to be associated with more risky borrowers. The
indicator variable, STR, for short-term reliance is coded as a 1 when
borrowers have predominantly used lines of credit as existing
borrowing and 0 otherwise. The hypothesized relationship is now

12
INDICATOR VARIABLES
Focus On: Regression Output Regression Statistics
Multiple R 0.8562
R2 0.7330
Adjusted R2 0.6568
df SS MSS F Significance F
Regression 3 0.0136 0.0045 9.7037 0.0102
Residual 6 0.0028 0.0005
Total 9 0.0164
Coefficients Standard Error t-Stat p-Value
Intercept 0.0138 0.0324 0.4252 0.6855
DTI 0.6117 0.1781 3.4340 0.0139
Open lines 0.0265 0.0121 2.1958 0.0705
STR 0.0681 0.0371 1.8367 0.1159
13
VIOLATIONS: HETEROSKEDASTICITY
The variance of the errors differs across observations (Assumption 4).

There are two types of heteroskedasticity:


- Unconditional heteroskedasticity, which presents no problems
for statistical inference, and
- Conditional heteroskedasticity, wherein the error variance is
correlated with the independent variable values.
- Parameter estimates are still consistent.
- F-test and t-tests are unreliable.

14
VIOLATIONS: SERIAL CORRELATION
There is correlation between the error terms (Assumption 5).

The focus in this chapter is the case in which there is serial correlation but
no lagged values of the dependent variable as independent variable(s).
- Parameter estimates are consistent, but the standard errors are
incorrect.
- The F-test and t-tests are likely inflated with positive serial correlation,
the most common case with financial variables.
Parameter estimates are still consistent as long as there are no lagged
values of the dependent variable as independent variables.
- If there are lagged values as independent variables,
- Coefficient estimates are inconsistent.
- This is the statistical arena of time series (Chapter 10).

15
TESTING AND CORRECTING FOR VIOLATIONS
There are well-established tests for serial correlation and
heteroskedasticity, as well as ways to correct for their impact.

Testing for
- Heteroskedasticitiy Use the BreuschPagan test
- Serial correlation Use the DurbinWatson test
Correcting for
- Heteroskedasticity Use robust standard errors or generalized
least squares
Use White standard errors
- Serial correlation Use the Hansen correction
- This also corrects for heteroskedasticity.

16
TESTING FOR SERIAL CORRELATION
Focus On: Calculating the DurbinWatson Statistic

You have recently estimated a regression model with 100 observations and two
independent variables. Using the estimated errors, you have determined that the
correlation between the error term and a first lagged value of the error term is
0.16. Do the observations exhibit positive serial correlation?
- The test statistic is.
- The critical values from Appendix E are dl= 1.63 and du= 1.72.
- Because 1.68 > 1.63, we fail to reject the null of positive serial correlation.

Rejection zone Rejection zone


for for
positive negative
Inconclusive
serial correlation serial correlation
dl= 1.63 du= 1.72

17
VIOLATIONS: MULTICOLLINEARITY
Multicollinearity occurs when two or more independent variables or combinations
of independent variables are highly (but not perfectly) correlated with each other
(Assumption 6).
Common with financial data
- Estimates are still consistent, but imprecise and unreliable.
- One indicator that you may have a collinearity problem is the
presence of a significant F-test but no (few) significant t-tests.
No easy solution to correct violation, you may have to drop variables.
- The story here is critical.

18
SUMMARIZING VIOLATIONS AND SOLUTIONS

Problem Effect Solution


Heteroskedasticity Incorrect standard errors Use robust standard
errors
Serial Correlation Incorrect standard errors* Use robust standard
errors
Multicollinearity High R2 and low t-stats No theory-based solution

19
MODEL SPECIFICATION
Models should
- Be grounded in financial or economic reasoning.
- Have variables that are an appropriate functional form for
their nature.
- Have specifications that are parsimonious.
- Be in compliance with the regression assumptions.
- Be tested out-of-sample before applying them to
decisions.

20
MODEL MISSPECIFICATIONS
A model is misspecified when it violates the assumptions underlying linear
regression, its functional form is incorrect, or it contains time-series
specification problems.
Generally, model misspecification can result in invalid statistical inference when
we are using linear regression.
Misspecification has a number of possible sources:
1. Misspecified functional form can arise from several possible problems:
- Omitted variable bias.
- Incorrectly represented variables.
- Data that are pooled which should not.
2. Error term correlation with independent variables can arise from:
- Lagged values of the dependent variable as independent variables.
- Measurement error in the independent variables.
- Independent variables that are functions of the dependent variable.
21
AVOIDING MISSPECIFICATION

If independent or dependent variables are nonlinear, use an appropriate


transformation to make them linear.
- For example, use common size statements or log-based transformations.
Avoid independent variables that are mathematical transformations of
dependent variables.
Dont include spurious independent variables (no data mining).
Perform diagnostic tests for violations of the linear regression assumptions.
- If violations are found, use appropriate corrections.
Validate model estimations out-of-sample when possible.
Ensure that data come from a single underlying population.
- The data collection process should be grounded in good sampling practice.

22
QUALITATIVE DEPENDENT VARIABLES
The dependent variable of interest may be a categorical variable
representing the state of the subject we are analyzing.
Dependent variables that take on ordinal or nominal values are better
estimated using models developed for qualitative analysis.
- This approach is the dependent variable analog to indicator (dummy)
variables as independent variables.
Three broad categories
1. Probit: Based on the normal distribution, it estimates probability of the
dependent variable outcome.
2. Logit: Based on the logistic distribution, it also estimates probability of
the dependent variable outcome.
3. Discriminant Analysis: It estimates a linear function, which can then
be used to assign the observation to the underlying categories.

23
ECONOMIC MEANING AND
MULTIPLE REGRESSION
Regression Statistics
Multiple R 0.8562
R2 0.7330
Adjusted R2 0.6568
df SS MSS F Significance F
Regression 3 0.0136 0.0045 9.7037 0.0102
Residual 6 0.0028 0.0005
Total 9 0.0164
Coefficients Standard Error t-Stat p-value
Intercept 0.0138 0.0324 0.4252 0.6855
DTI 0.6117 0.1781 3.4340 0.0139
Open lines 0.0265 0.0121 2.1958 0.0705
STR 0.0681 0.0371 1.8367 0.1159
24
SUMMARY
We are often interested in the relationship between more than two financial
variables, and multiple linear regression allows us to model such relationships
and subject our beliefs about them to rigorous testing.
Financial data often exhibit characteristics that violate the underlying
assumptions necessary for linear regression and its associated hypothesis test to
be meaningful.
The main violations are
- Serial correlation.
- Conditional heteroskedasticity.
- Multicollinearity.
We can test for each of these conditions and correct our estimations and
hypothesis tests to account for their effects.

25

You might also like