You are on page 1of 14

Quantitative Methods:

Sample Covariance:

Sample correlation coefficient:

(bound by -1 < rxy < +1)

Limitations: outliers; Spurious Correlation (appearance of casual linear relationship when there is
actually none); Non-linear relationships (parabola)
Determine if correlation coefficient r is significant? Hypothesis Testing:
Ho: p = 0; Ha: p 0
Reject Ho if t computed <- t critical; or t+ critical < T computed
-t critical value < Not Significant < + t critical value
Reject < do not reject < reject -> (if Ho = not significant)
Linear Regression Assumptions: - Linear relationship exists between dependent and independent
variables; - Independent variables un-correlated with residuals; - residual expected value = 0.
Variance of Residual is constant. E2i = 2i
*Residual = independently distributed; normally distributed.

Yi = b0 + b1X1 +i

OR

i = ^b0 + ^b1 X1

With ^ hats is predicted; no error term

bo = intercept; b1 = regression slope; xi = ith observation (indep); yi = ith obs (dependent); Ei = error term
regression line is line from which estimates of ^b0 and ^b1 are such that sum of squared differences
between actual Yi and predicted ^ Yi are minimized; (SSE) Regression line minimizes SSE!
Simple Linear Regression = OLS (Ordinary Least Squares)

^b0 = intercept
= covariance / variance
Slope = covariace / variance

* slope is like Beta

Residual = difference between actual Y and expected Y ^

Standard Error of Estimate (SEE):


SEE is low relative to total variability if relationship between indep & dependent variables is strong
SEE is high relative to total variability if relationship between indep & dependent variables is weak
Coefficient of determination R2 = % of total variation in dependent variable explained by independent
variable.
*for simple linear easy! -> just square the correlation coefficient. Eg R2 = r2

Regression Coefficient Confidence Interval: Is a slope coeff. Statistically different from zero?
Ho: b1 = 0; Ha: b1 0;

*middle b1 has no hat ^;

b1 = regression formula (prediction) Tc = table; sb1 = error;

* tc = n-2

*If Zero is not in the range, it is statistically different from Zero (slope coefficient that is)!
T-test can be performed to test hypothesis that the true slope coefficient b1 is equal to some
hypothesized value.
-t critical value (reject) < Accept < + t critical value (reject)
Rejection of null means that slope coefficient IS different from hypothesized
value of b1.

To test whether indep explains variation in dependent (i.e statistically significant), the hypothesis is
whether the true slope is zero (b1 = 0); Ho: b1 = 0; Ha: b1 0;

predicted value of Y
Confidence Intervals for Predicted Values:
Df = n-2 ;
Sf = std. error of forecast
CIs for predictor values:

Ignore this one -> Highly unlikely ->:

ANNOVA:
Total Sum of Squares (SST) = Measures total variation in dependent variable

Regression Sum of Squares (RSS) = Measures variation in dependent variable that is explained by
independent variable

Sum off squared Errors: SSE = Measures unexplained variation in dependent variable
Also

Regression:

MSR is mean sum of squares = RSS / k = RSS / 1 = RSS

(df =1)

Error:

MSE = SSE / n 2

(df =n-2)

Total

SST:

(df =n-1)

R2 = % of total variation in dependent variable explained by independent


(also called coefficient of determination)
= Total variation (SST) unexplained (SSE)

= explained (RSS) / Total (SST)

Total variation (SST)


SEE = std. dev of regression error

= MSE = (SSE/n-2)

F Statistic: Assesses how well a set of independent variables, as a group, explains the variation in the
dependent variable.

In Multiple Regression, tests whether at least 1 independent variable explains significant portion of the
variation in dependent variable.
For 1 indep variable, same thing as T test!
Df numerator =k = 1; DF denominator = n-k-1 = n-2
Reject H0 if F> F Critical
Limitations: Linear Relationship can change over time (parameter instability); Even if model is accurate,
other peoples awareness can act on same information.

Multiple Regression
T-test: assess significance of individual regression parameters
F-test: asses effectiveness of model as a whole in explaining dependent variable
Simple Linear Regression (Univariate): eg. Stock return against Beta
Multiple: eg. Stock return against beta, firm size, industry, etc.
General Multiple linear regression model:
n= # of observations; k = # indep. Variables
Estimates Intercept & slope coefficients such that sum of squared errors is minimized.

^ hats are estimates, everything with hats

B0: intercept term is the value of dependent variable when the independent variables are all equal to =0.
Bk: each slope coefficient is the estimated change in the dependent variable for a one-unit change in
THAT independent variable, holding all other independent variables constant! (Partial slope coefficients)

Hypothesis testing: t-Test -> significance of individual coefficients

df: n-k-1

test statistical significance? Null: Coefficient = 0; Ha: 0 Ho: bj = 0; Ha: bj 0


*p-value = smallest level of signif. for which null can be rejected.
if p-value < is less than signif. level, null is rejected; if p-value > signif. level, null cannot be rejected.

Confidence intervals for Regression Coefficient (same as simple linear):


Or estimated regression coefficient +/- (critical t value) (coeff stnd. Error)
Df: n-k-1
Predicting Dependent Variable -> sub in #s!

^Yi = ^b0 + ..

Assumptions: Linear relationship between dependent and indep variables; Indep variables are not
random; Expected value of error term is Zero; variance of error term is constant; error is normally
distributed;

F-Statistic = assess how well set of indep. variables as a group explain variation in dependent variable
test whether at least one indep. variable explains portion of variation.
Ho: b1=b2=b3=b4=0;

Ha: at least one bj 0

* always a one tailed test, despite = and

Df numerator = k;

DF denominator = n-k-1

Decision rule: reject Ho if F (test stat) > Fc (Critical Value)

Coefficient of Determination (R2) can be used to test overall effectiveness of the entire set of
independent variables in explaining dependent variable. (eg. R2 of .63 indicates that model, as a whole,
explains 63% of variation of dependent variable.)
R2 = % of total variation in dependent variable explained by independent
= Total variation (SST) unexplained (SSE)

= SST-SSE / SST; = explained (RSS) / Total (SST)

Total variation (SST)


* may need to adjust R2 to compensate for overestimated bias when many additional variables are
added. Ra2 takes into consideration the fact that many independent variables may effect R2. More IVs
may not sig. affect dv but will affect R2.

We want a higher R2a! * . Ra2 must be lower than R2!

Adjusted R2a =

R2 = RSS / SST;

F = MSR / MSE w k and n-k-1 df

SEE = MSE
p-value decision rules; t-stat decision rules and F-stat decision Rules!

Dummy Variables: If independent variable is binary in nature (on/off), # of dummy variables is 1 less
than # of classes (eg n-1). Last value represented by intercept;

Heteroskedasticity; Serial Correlation (auto correlation) ; MultiColinearity


Heteroskedasticity -> occurs when the variance of residuals is not the same across all observations in the
sample. (occurs when there are samples that are more spread out then rest of the sample)
Unconditional -> not related to level of indep variables. Does not increase/decrease with changes in
value of independent variables. (usually causes no problems with regression).
Conditional -> related to the level of (conditional on) indep variables. Eg. Exists if variance of the residual
term increases as the value of the independent variable increases; Creates significant problems for
statistical inference.

Effect of Heteroskedasiticity on Regression Analysis:


standard errors are usually unreliable estimates.
-

Coefficient estimates arent affected (bj with a ^)


Stnd. Error too small -> t-stats too large Ho rejected too often
Stnd. Error too large -> t-stats too small Ho rejected not often enough
F-test unreliable

Detecting Heteroskedasticity
-> scatter plots; breusch pagan chi squared test
Scatter plots: variation in regression residuals increase as indep variables increase
Breusch Pagan Chi-squared test: n X R2residuals w k dfs
* R2 resid = Rz from second residual regression of squared residuals from fist regression on indep
variables. (2nd regression not original regression); n=# obs; k = # indep variables; * one talied
Correcting Heteroskedasticity
Robust Standard Errors: (white corrected std. errors OR Heteroskedasticity consistant std. errors)
-> recalculate t-stats using original regression coefficients
Serial Correlation (Auto Correlation):
Residual terms are correlated with one another (relatively common with time series data)
+ ive SC -> exists when a +ive regression error in one time period increases the probability of observing a
+ive regression error for the next time period .
- ive SC -> exists when a -ive regression error in one time period increases the probability of observing a
-ive regression error for the next time period .
Effect: typically results in coefficient std errors too small; causes too many type 1 errors -> rejecting null
when actually true.
Detecting: Residual Plots or Durbin Watson statistic
Durbin-Watson Statistic: another method to detect presence of serial correlation
where E^t = residual for period t
0 reject H0 dL inconclusive du do not reject Ho.
H0 = no positive serial correlation

If sample size is very large, DW 2(1-r) where r = correlation coefficient between residuals from one
period and those from previous period.
Approximation: DW 2: error terms homoscedastic NOT serially correlated
DW < 2: error terms positively serially correlated
DW > 2: error terms negatively serially correlated
Eg,. Ho: Regression has No +ive serial correlation:
Correcting for it: Adjust Coefficient std. errors using Hanson Method (really? -> unbelievable).

Multicollinearity:
Two or more of independent variables or linear combinations of indep variables in a multiple regression
are highly correlated w each other.
Effect: Results in greater probability of incorrectly concluding that a variable is not stat. significant (type
II error).
Detection -> When T-test indicate that no individual coefficient is statistically different than zero, while
F-test is statistically significant and R2 is high.
Correcting for it -> Omit one of the Independent Variables;
Misspecification: Selection of independent variables to be included in the regression & transformations;
1 omitting a variable
2- variable should be transferred (log/ln)
3 incorrectly pooling data
4 using lagged dependent variable as independent variable
5 forecasting the past
6 measuring indep variables with error

Qualitative Variables: we can use binary dummy variables; Probit and Logit models; discriminant models

Time Series Analysis: set of observations for a variable over successive periods of time.
Linear trend model = pattern that can be graphed with a straight line.
Y t = bo + b1(t) +Et
Ordinary Least Squares: regression used to estimate coefficient in trend line; Y^ = hats!... no E
Log-linear trend models (exponential growth) continuous compounding
+ive : convex; -ive is concave;
Yt = eb0 + b1 (t); b1 = constant rate of growth; e = base of natural log
Y ->dependent variable; (exponential function of time) independent variable; take ln of both sides loglinear ln (yt) = ln (eb0 + b1(t)) -> ln (yt) = b0 + b1 (t)
Which model to use?
Linear = data points seem to be equally dist above and below regression line; (Constant Amount)
Log-Linear = curved: Constant RATE, not amount (Financial Data) (Serial Correlation -> might work)
Time Seriew: DW should 2.0;

2.0 -> residual terms are correlated

Auto Regressive Model -> dependent variable is regressed against one or more values of itself
X1 = b0 + b1Xt-1 + Et
Covariance Stationarity: constant & finite expected value of time series over time; constant and finite
variance (volatility around its mean) constant and finite covariance between values at any given lag
*mean reverting level
Conditions _> constant & finite: expected value; variance; covariance
Correctly Specified = No Serial Correlation [no correlation of errors]
Forecasting: Calculate a one-step ahead before a two-step.

Test Model Fit: (AR) If correctly specified, will not exhibit serial correlation. (SC= error terms +ive/-ively
correlated)
1 estimate AR Model using linear regression
2 calculate auto correlations of models residuals
3 test whether auto correlations are significantly different from Zero. * If model correct, none of the
auto-correlations will be significant (ie. T-test all will be zero) t-stat = estimated AC std. error; std error
= 1/T; T = # obs (just like n)
Auto Correlation / (1/T);

then check t-calc vs table and accept or reject

Mean reversion -> tendency to move towards mean (decline when current value is above mean;
increase when below)
When @ mean reverting level -> predicts that next value will be the same as current value (eg ^xt = xt-1)
(AR1) X1 = Bo + b1 (xt 1);

Xt = bo + b1xt; (solve for Xt to find level)

Xt = b0 / (1-b1)

if Xt > b0 / (1-b1), then predicts Xt-1 < Xt


* All covariance stationary time series have Mean Reverting Level;
Abs Value of lag coefficient <1 = finite MR lev el
In sample forecasts: within the range of data (ie. Time period) used to estimate model (sample or test
period)
Out of Sample forecasts: are made outside of sample period.
RMSE: Root Mean Squared Error -> lower is better

[Test between AR(1) & AR(2) in sample]

Instability of coefficients of time series model: history doesnt necessarily repeat itself. Many
contributing factors could change over time.
Compare accuracy of AR models forecasting values
Random Walk: predicted value of one period = equals previous period value + random error term (e.g.
Xt = XT-1 + Et)
With a drift: Intercept 0 (eg. Xt = b0 + B1xt-1 + Et); b0 = constant drift;
With or without a drift: b1 = 1
Neither of these exhibit covariance stationarity: bo/1-b1 If b1=1 impossible = b0/0 (undefined); exhibits
unit root;
Must have finite MR level to be Covariance Stationary

If coefficient in lag variable is 1.0, series is not covariance stationary; is said to have unit root.
Testing for non-stationarity; 1 AR model Auto-correlations; 2 Dicky Fuller; 1: stationary process will have
residual ACs insignificantly different from zero at all lags;
Dicky Fuller: x1 = b0 + b1xt-1 + E -> subtract Xt-1 to both sides Xt-Xt-1 = b0 + (b1-1)Xt-1 + E
Test whether (b1-1) 0 @ modified t-test. If B1-1 not sig diff from zero, b1 = 1 and series is unit root!

First Differencing
If we believe time series has a random walk (ie. Has unit root) we can transform the data to covariance
stationary time series (First Differencing). -> Subtract value of dependent variable in previous period
from current to define a new dependent variable. (ie. X1 Xt-1 = Yt; Yt = Et)
Yt = b0 + b1 yt-1 + E1;

b0 = 0; 0/1-0 = 0 covariance stationary

Seasonality:
To determine seasonality -> test each period to see if lagged auto-correlations are different from zero (ttest). (n-k) check significance of each period.
Correcting for Seasonality: Occupancy in any quarter = previous quarter + same quarter in previous year.
An additional lag of the dependent variable (same period previous year) is added to original model as
another independent variable.
e.g
Ln(xt) = b0 + b1 (ln xt-1) + b2 (lnxt-4) + Et;

ARCH Models -> Auto-Regressive Conditional heteroskedasticity


In a single time series, ARCH exists if the variance of residuals in one period is dependent on the
variance of residuals in a previous period; Std. errors of regression coefficients & hypothesis tests
invalid.
ARCH 1 model tests for it. Squared residuals from estimated time series ^Et2 are regressed on the first
lag of the squared residuals E2t-1.

If a1 = 0 time series is ARCH 1. If ARCHed, GLS used to develop predicted model.


GLS -> generalized least squares;

Predicting Variance of Time Series:


If series has ARCH errors, ARCH model can be used to predict the variance of the residuals in future
periods.

Two times series: two different variables. -> test for unit roots. Separate DF tests w 5 possible results
12345-

Both series are covariance stationary (can use linear regression, coeff should be reliable)
Only dependent variable time series is covariance stationary (not reliable)
Only indep variable time series is covariance stationary (not reliable)
Neither time series is covariance stationary & two are not Cointegrated (not reliable)
Neither time series is covariance stationary & two ARE not Cointegrated (reliable)

Co-Integration: Means two time series are economically linked or follow the same trend & that
relationship is not expected to change.
Regress one variable on the other Yt = b0 + b1Xt + E;

time series y and x

-> tested for unit root using dickey fuller; If reject Null of unit root, we say error terms are covariance
stationary & Cointegrated. Use Engel Granger t-values, not standard ones;

Which Time Series to Use?


12345678-

Determine Goal
Examine characteristics (Heteroskedasticity, Seasonality, etc.)
Use trend Model (linear; log-linear; etc)
Test for Serial correlation (Durbin Watson), No SC -> step 5
AR Model
First Differencing
ARCH
If two good models left over, calculate out of sample RMSE to determine which one is better!

You might also like