You are on page 1of 6

Multiple Regression

Multiple Introduction
Regression • Multiple regression model allows us to evaluate the impact of multiple
Analysis (3 + 3 + 3) independent variable on a dependent variable

• The slope coefficients measure how much the dependent variable Y changes
when the independent variable Xj changes by one unit, others stay the same

Limitations
• Same as Simple Linear Regression

Assumptions

Adjusted coefficient of determination


• 𝑅 2 will automatically increase as more independent variables are added to the
mix and reduce the amount of unexplained variation
• Adjusted 𝑅 2 will not automatically increase when more are introduced as it is
adjusted for the degrees of freedom

• If the inclusion of another independent variable results in a nomimal increase


in RSS and 𝑅 2, adjusted 𝑅 2 will decrease
• Adjusted 𝑅 2 is always less than 𝑅 2 because k is always greater than 0. In fact, it
is possible for Adjusted 𝑅 2 to be negative, which is not possible for 𝑅 2
• To apply adjusted 𝑅 2, the dependent vraible must be defined in the same
manner and the samplize size must be the same

Confidence Intervals for Individual regression coefficients


• A confidence interval with a certain degree of confidence (1 - ∝)

• The critical t-value is a two-tailed value computed based on the significance


level (∝) and n-(k+1) degree of freedom
Hypothesis Testing for individual regression coefficients
• Normally, the null hypothesis is the position we want to reject, and the
alternative hypothesis is the position we want to validate
• Same as Simple Linear Regression

Analysis of Variance (ANOVA)

Hypothesis Testing for all regression coefficients in the model (F – Test)


• Same as Simple Linear Regression

Predicting the dependent variables

Violation of Description
assumptions – • Variance of the error term is non-constant.
Heteroskedasticit • Unconditional: Not related to independent variables – no major problems.
y
• Conditional: related to independent variable – a major problem.

Effects
• For T-test, underestimated standard errors of the regression coefficients,
hence overstated actual t-stats too high, so coefficients might appear
significant when they are not (Type I error)
• For F-test, the MSE becomes a biased estimator of the true population
variance
• Does not affect the consistency of the estimators of the regression parameters
and the estimates of regression coefficients

Detection
• Scatter diagrams: plot residual against each independent variable and against
time
• BP test (Regress the residuals-squared against the independent variables)
o Null hypothesis vs alternate hypothesis

o Test statistic is one-tailed chi-square on 𝑅 2 on the second regression

o Null hypothesis will be rejected (Heteroscedasticity exists)


Correction
• Compute robust standard errors (aka White-corrected standard errors) used to
recalculate the t-statistics
• Use generalized least squares
Violation of Description
assumptions – • It occurs when regression errors are correlated across observations. It typical
Serial Correlation arises in time series regression
(Autocorrelation) • Positive serial correlation ocurrs when a positive error for one observation
increases the chances of a positive error for another. Vice versa for negative

Effects (lagged value of dependent variable is not an independent variable)


• For T-test, positive error causes underestimated standard errors of regression
coefficients, hence overestimated actual t-stats, so coefficients might appear
falsely significant (Type I error). Vice versa for negative (Type II error)
• For F-test, positive error causes underestimated MSE of the population error
variance, hence overstated F-stats, so coefficients might appear falsely
significant (Type I error). Vice versa for negative (Type II error)
• Does not affect the estimates of regression coefficients and the consistency of
the estimators of the regression parameters

Detection
• Null hypothesis vs alternate hypothesis
o Ho: No serial correlation
o Ha: Serial correlation exists
• DW test statistics
o If regression residuals are positively serially correlated, DW-stat will be
less than 2 (0 when serial correlation = +1)
o If regression residuals are positively serially correlated, DW-stat will be
greater than 2 (4 when serial correlation = -1)
o If regression residuals are not serially correlated, DW-stat = 2
o Critical DW-stat is not known with certainty, only upper & lower values
• Decision rule
Correction
• Adjust the coefficient standard errors, e.g. using the Hansen method (which
also corrects for conditional heteroskedasticity). After correction, regression
coefficients/ DW-stat remains the same but robust standard errors are larger
• Modify regression equation to eliminate the serial correlation
Violation of Description
assumptions – • Two or more independent variables are mutually correlated, making the
Multicollinearity interpretation of the regression output problematic.

Effects
• Does not affect the consistency of the estimators of the regression parameters
but itake the estimates of regression parameters inaccurate and unreliable
• Overestimated SEE and coefficient standard error, hence underestimated t-
stats and the null is rejected less frequently leading to Type II errors

Detection
• When there are only two independent variables, one indicator is the high
correlation coefficient between them (rule of thumb: > 0.7)
• When dealing with more than two independent variables, low pair correlations
could still lead to multicollinearity
• Conflicting t- and F-tests: significant F-statistic combined with insignificant
individual t-statistic, but exception still exists

Correction
• Exclude one or more of the independent variables from the regression model
• Advanced tepwise regression to remove variables from the regression model
Model Principles
Specification • The model should be based on economic reasoning. This reduces the risk of
Issues finding relationships by simply mining the data.
• The functional form for the variables should be appropriate given the nature of
the variables. Transforming the data may be necessary.
• The model should be parsimonious, which means accomplishing a lot with a
little. Each variable in the regression should be important.
• Examine the model for violations of regression assumptions before accepting
the results.
• Test the model with out of sample data. This means use data outside the
dataset that was used to create the model.

Misspecified Functional Forms


• One or more important independent variables could be omitted. This could
result in biased and inconsistent regression coefficients.
• Variables may need to be transformed, such as by taking the natural logarithm
of the variable. This is often needed to account for nonlinear relationships.
• Financial statements could be transformed by using common size statements
when gathering data from multiple companies.
• The regression model could combine data from different samples that should
not be combined.
Time-series Misspecifications
• Including lagged dependent variables as independent variables and these
lagged dependent variables are serially correlated with error terms. This
results in independent variables serially correlating with error terms
• Including a function of a dependent variable as an independent variable i.e.
using variables measured at the end of the period to predict a value in the
period. Only current & past information, not the future info should be used
• Independent variables that are measured with error. For example, expected
inflation could be used in the regression when only actual inflation can be
measured.
• The most frequent source of misspecification in linear regressions is
nonstationarity. That means the variable properties such as mean, and
variance are not stable through time.
Qualitative Dummy variable as Dependent Variable
Variable • A quantitative dependent variable Y (1 = bankrupt or 0 = not) is modelled
based on various independent variables such as ROI, leverage ratios, etc
• The probit model estimates the probability that Y = 1 using the normal
distribution given a value of the independent variable X.
• The logit model estimates the probability that Y = 1 using the logistic
distribution given a value of the independent variable X
• With discriminant analysis, a linear function can be used to generate an overall
score based on which the observations can be classified qualitatively

Dummy variable as Independent Variable


Steps in assessing
a multiple
regression model

You might also like