Professional Documents
Culture Documents
Sample Covariance:
Limitations: outliers; Spurious Correlation (appearance of casual linear relationship when there is
actually none); Non-linear relationships (parabola)
Determine if correlation coefficient r is significant? Hypothesis Testing:
Ho: p = 0; Ha: p 0
Reject Ho if t computed <- t critical; or t+ critical < T computed
-t critical value < Not Significant < + t critical value
Reject < do not reject < reject -> (if Ho = not significant)
Linear Regression Assumptions: - Linear relationship exists between dependent and independent
variables; - Independent variables un-correlated with residuals; - residual expected value = 0.
Variance of Residual is constant. E2i = 2i
*Residual = independently distributed; normally distributed.
Yi = b0 + b1X1 +i
OR
i = ^b0 + ^b1 X1
bo = intercept; b1 = regression slope; xi = ith observation (indep); yi = ith obs (dependent); Ei = error term
regression line is line from which estimates of ^b0 and ^b1 are such that sum of squared differences
between actual Yi and predicted ^ Yi are minimized; (SSE) Regression line minimizes SSE!
Simple Linear Regression = OLS (Ordinary Least Squares)
^b0 = intercept
= covariance / variance
Slope = covariace / variance
Regression Coefficient Confidence Interval: Is a slope coeff. Statistically different from zero?
Ho: b1 = 0; Ha: b1 0;
* tc = n-2
*If Zero is not in the range, it is statistically different from Zero (slope coefficient that is)!
T-test can be performed to test hypothesis that the true slope coefficient b1 is equal to some
hypothesized value.
-t critical value (reject) < Accept < + t critical value (reject)
Rejection of null means that slope coefficient IS different from hypothesized
value of b1.
To test whether indep explains variation in dependent (i.e statistically significant), the hypothesis is
whether the true slope is zero (b1 = 0); Ho: b1 = 0; Ha: b1 0;
predicted value of Y
Confidence Intervals for Predicted Values:
Df = n-2 ;
Sf = std. error of forecast
CIs for predictor values:
ANNOVA:
Total Sum of Squares (SST) = Measures total variation in dependent variable
Regression Sum of Squares (RSS) = Measures variation in dependent variable that is explained by
independent variable
Sum off squared Errors: SSE = Measures unexplained variation in dependent variable
Also
Regression:
(df =1)
Error:
MSE = SSE / n 2
(df =n-2)
Total
SST:
(df =n-1)
= MSE = (SSE/n-2)
F Statistic: Assesses how well a set of independent variables, as a group, explains the variation in the
dependent variable.
In Multiple Regression, tests whether at least 1 independent variable explains significant portion of the
variation in dependent variable.
For 1 indep variable, same thing as T test!
Df numerator =k = 1; DF denominator = n-k-1 = n-2
Reject H0 if F> F Critical
Limitations: Linear Relationship can change over time (parameter instability); Even if model is accurate,
other peoples awareness can act on same information.
Multiple Regression
T-test: assess significance of individual regression parameters
F-test: asses effectiveness of model as a whole in explaining dependent variable
Simple Linear Regression (Univariate): eg. Stock return against Beta
Multiple: eg. Stock return against beta, firm size, industry, etc.
General Multiple linear regression model:
n= # of observations; k = # indep. Variables
Estimates Intercept & slope coefficients such that sum of squared errors is minimized.
B0: intercept term is the value of dependent variable when the independent variables are all equal to =0.
Bk: each slope coefficient is the estimated change in the dependent variable for a one-unit change in
THAT independent variable, holding all other independent variables constant! (Partial slope coefficients)
df: n-k-1
^Yi = ^b0 + ..
Assumptions: Linear relationship between dependent and indep variables; Indep variables are not
random; Expected value of error term is Zero; variance of error term is constant; error is normally
distributed;
F-Statistic = assess how well set of indep. variables as a group explain variation in dependent variable
test whether at least one indep. variable explains portion of variation.
Ho: b1=b2=b3=b4=0;
Df numerator = k;
DF denominator = n-k-1
Coefficient of Determination (R2) can be used to test overall effectiveness of the entire set of
independent variables in explaining dependent variable. (eg. R2 of .63 indicates that model, as a whole,
explains 63% of variation of dependent variable.)
R2 = % of total variation in dependent variable explained by independent
= Total variation (SST) unexplained (SSE)
Adjusted R2a =
R2 = RSS / SST;
SEE = MSE
p-value decision rules; t-stat decision rules and F-stat decision Rules!
Dummy Variables: If independent variable is binary in nature (on/off), # of dummy variables is 1 less
than # of classes (eg n-1). Last value represented by intercept;
Detecting Heteroskedasticity
-> scatter plots; breusch pagan chi squared test
Scatter plots: variation in regression residuals increase as indep variables increase
Breusch Pagan Chi-squared test: n X R2residuals w k dfs
* R2 resid = Rz from second residual regression of squared residuals from fist regression on indep
variables. (2nd regression not original regression); n=# obs; k = # indep variables; * one talied
Correcting Heteroskedasticity
Robust Standard Errors: (white corrected std. errors OR Heteroskedasticity consistant std. errors)
-> recalculate t-stats using original regression coefficients
Serial Correlation (Auto Correlation):
Residual terms are correlated with one another (relatively common with time series data)
+ ive SC -> exists when a +ive regression error in one time period increases the probability of observing a
+ive regression error for the next time period .
- ive SC -> exists when a -ive regression error in one time period increases the probability of observing a
-ive regression error for the next time period .
Effect: typically results in coefficient std errors too small; causes too many type 1 errors -> rejecting null
when actually true.
Detecting: Residual Plots or Durbin Watson statistic
Durbin-Watson Statistic: another method to detect presence of serial correlation
where E^t = residual for period t
0 reject H0 dL inconclusive du do not reject Ho.
H0 = no positive serial correlation
If sample size is very large, DW 2(1-r) where r = correlation coefficient between residuals from one
period and those from previous period.
Approximation: DW 2: error terms homoscedastic NOT serially correlated
DW < 2: error terms positively serially correlated
DW > 2: error terms negatively serially correlated
Eg,. Ho: Regression has No +ive serial correlation:
Correcting for it: Adjust Coefficient std. errors using Hanson Method (really? -> unbelievable).
Multicollinearity:
Two or more of independent variables or linear combinations of indep variables in a multiple regression
are highly correlated w each other.
Effect: Results in greater probability of incorrectly concluding that a variable is not stat. significant (type
II error).
Detection -> When T-test indicate that no individual coefficient is statistically different than zero, while
F-test is statistically significant and R2 is high.
Correcting for it -> Omit one of the Independent Variables;
Misspecification: Selection of independent variables to be included in the regression & transformations;
1 omitting a variable
2- variable should be transferred (log/ln)
3 incorrectly pooling data
4 using lagged dependent variable as independent variable
5 forecasting the past
6 measuring indep variables with error
Qualitative Variables: we can use binary dummy variables; Probit and Logit models; discriminant models
Time Series Analysis: set of observations for a variable over successive periods of time.
Linear trend model = pattern that can be graphed with a straight line.
Y t = bo + b1(t) +Et
Ordinary Least Squares: regression used to estimate coefficient in trend line; Y^ = hats!... no E
Log-linear trend models (exponential growth) continuous compounding
+ive : convex; -ive is concave;
Yt = eb0 + b1 (t); b1 = constant rate of growth; e = base of natural log
Y ->dependent variable; (exponential function of time) independent variable; take ln of both sides loglinear ln (yt) = ln (eb0 + b1(t)) -> ln (yt) = b0 + b1 (t)
Which model to use?
Linear = data points seem to be equally dist above and below regression line; (Constant Amount)
Log-Linear = curved: Constant RATE, not amount (Financial Data) (Serial Correlation -> might work)
Time Seriew: DW should 2.0;
Auto Regressive Model -> dependent variable is regressed against one or more values of itself
X1 = b0 + b1Xt-1 + Et
Covariance Stationarity: constant & finite expected value of time series over time; constant and finite
variance (volatility around its mean) constant and finite covariance between values at any given lag
*mean reverting level
Conditions _> constant & finite: expected value; variance; covariance
Correctly Specified = No Serial Correlation [no correlation of errors]
Forecasting: Calculate a one-step ahead before a two-step.
Test Model Fit: (AR) If correctly specified, will not exhibit serial correlation. (SC= error terms +ive/-ively
correlated)
1 estimate AR Model using linear regression
2 calculate auto correlations of models residuals
3 test whether auto correlations are significantly different from Zero. * If model correct, none of the
auto-correlations will be significant (ie. T-test all will be zero) t-stat = estimated AC std. error; std error
= 1/T; T = # obs (just like n)
Auto Correlation / (1/T);
Mean reversion -> tendency to move towards mean (decline when current value is above mean;
increase when below)
When @ mean reverting level -> predicts that next value will be the same as current value (eg ^xt = xt-1)
(AR1) X1 = Bo + b1 (xt 1);
Xt = b0 / (1-b1)
Instability of coefficients of time series model: history doesnt necessarily repeat itself. Many
contributing factors could change over time.
Compare accuracy of AR models forecasting values
Random Walk: predicted value of one period = equals previous period value + random error term (e.g.
Xt = XT-1 + Et)
With a drift: Intercept 0 (eg. Xt = b0 + B1xt-1 + Et); b0 = constant drift;
With or without a drift: b1 = 1
Neither of these exhibit covariance stationarity: bo/1-b1 If b1=1 impossible = b0/0 (undefined); exhibits
unit root;
Must have finite MR level to be Covariance Stationary
If coefficient in lag variable is 1.0, series is not covariance stationary; is said to have unit root.
Testing for non-stationarity; 1 AR model Auto-correlations; 2 Dicky Fuller; 1: stationary process will have
residual ACs insignificantly different from zero at all lags;
Dicky Fuller: x1 = b0 + b1xt-1 + E -> subtract Xt-1 to both sides Xt-Xt-1 = b0 + (b1-1)Xt-1 + E
Test whether (b1-1) 0 @ modified t-test. If B1-1 not sig diff from zero, b1 = 1 and series is unit root!
First Differencing
If we believe time series has a random walk (ie. Has unit root) we can transform the data to covariance
stationary time series (First Differencing). -> Subtract value of dependent variable in previous period
from current to define a new dependent variable. (ie. X1 Xt-1 = Yt; Yt = Et)
Yt = b0 + b1 yt-1 + E1;
Seasonality:
To determine seasonality -> test each period to see if lagged auto-correlations are different from zero (ttest). (n-k) check significance of each period.
Correcting for Seasonality: Occupancy in any quarter = previous quarter + same quarter in previous year.
An additional lag of the dependent variable (same period previous year) is added to original model as
another independent variable.
e.g
Ln(xt) = b0 + b1 (ln xt-1) + b2 (lnxt-4) + Et;
Two times series: two different variables. -> test for unit roots. Separate DF tests w 5 possible results
12345-
Both series are covariance stationary (can use linear regression, coeff should be reliable)
Only dependent variable time series is covariance stationary (not reliable)
Only indep variable time series is covariance stationary (not reliable)
Neither time series is covariance stationary & two are not Cointegrated (not reliable)
Neither time series is covariance stationary & two ARE not Cointegrated (reliable)
Co-Integration: Means two time series are economically linked or follow the same trend & that
relationship is not expected to change.
Regress one variable on the other Yt = b0 + b1Xt + E;
-> tested for unit root using dickey fuller; If reject Null of unit root, we say error terms are covariance
stationary & Cointegrated. Use Engel Granger t-values, not standard ones;
Determine Goal
Examine characteristics (Heteroskedasticity, Seasonality, etc.)
Use trend Model (linear; log-linear; etc)
Test for Serial correlation (Durbin Watson), No SC -> step 5
AR Model
First Differencing
ARCH
If two good models left over, calculate out of sample RMSE to determine which one is better!