You are on page 1of 42

Model Evaluation & R2

Evaluating Our Models


 Our hypothesis tests tend to focus on
individual coefficients – t-tests
 In addition, we may want to make an
overall evaluation of our model
 Is the model a “good fit?” to the data
 Does the model “explain” the
dependent variable?
“Explaining” Variance
 What do we mean when we say a
model is “good” or “explains” the
dependent variable.
 Explanation exists in our theory, not
in any data we might observe.
 Thus “explained” variance cannot be
measured with a statistic (or with
data).
“Explaining” Variance
 Measures of association (like slope
coefficients) can demonstrate
relationships in our data.
 That is, they can show that variables
“fit” together well.
 If models provide a good fit, this can
be the basis of empirical support for
a theoretical explanation of the data.
What’s a “Good Model?”
 What do we mean when we say we have a
“good model” of our data?

 Changes in x have a large impact on


changes in yˆ j ( That is, is large)


Remaining residuals are small
(That is,  û2 is small)  uˆ 
2  ˆ
u 2
i

n  k 1
Building a Composite Measure
 Developed an overall measure of
these concepts that is insensitive to
the scale measuring x’s & y
 Measure is based on fundamental
goal of the OLS estimator – minimize
squared errors.
 Take the ratio of “Explained”Sum of
Squares (SSE) to Total Sum of
Squares (SST)
The Coefficient of Determination
 This ratio is known as the “coefficient of
determination” or R2

SSE SSR
R 2
  1
SST SST

 R2 is literally just the correlation between


the observed “y” and the fitted “yhat ”,
Comparing Across Samples
 We would like a measure of our
model of y that can be compared
across different samples
 If y=Xβ+u, y is a function of 3 things:
 x’s we draw in our sample
 u’s we draw in our sample
 β parameters that are true across
samples of the population
Comparing Across Samples
 Our β-hat can be generalized across
 û
samples because it is not dependent
2
on
the or σx2 in a sample.

ˆ  (X'X)1 X ' y  (X'X)1 X '( X  u )


ˆ  (X'X)1 X ' X  (X'X)1 X ' u
ˆ    (X'X)1 X ' u

 
If the covariance of X and u=0, the
2

and σx2 are irrelevant


Comparing Across Samples
 Unfortunately, R2 does not have this
property.
SSE SSE
R 
2

SST SSR  SSE

 SSR is the sum of the squared


 û2 residuals,
thus R2 is clearly a function of the in
the sample
Comparing Across Samples
 In addition, because the  û2 is in the
 50)not
100/(100but
denominator,  .667the numerator, R2
also becomes a function of the σx2
(100  20)/(100  20  50)  .706
 û2
 Holding the constant, an increase in
the σx2 will increase the numerator
relatively more than the denominator.

 Thus R2 will go up!


What IS R2 Really?
The R2 Stew
 R2 can also be rewritten:
ˆ ' X ' X ˆ SSE
R2 
ˆ ' X ' X ˆ  uˆ ' uˆ
SSE SSR

 Which can be thought in the bivariate


context as:ˆ 2 * 2
j xj
R  2 2
2

ˆ j * x j   u2ˆ
 Which is: (causal_strength) 2 *variance
(causal_strength) 2 *variance+good_fit
The R Stew
2

 Thus R2 conflates our two aspects of a



“good model,” combines
2
û them with ,
and places them on a dimensionless scale

 The resulting value is nearly


uninterpretable

 Basically, R2 measures the shape of the


cloud of observations around our
regression line.
Adjusted R-Squared
 Imposes a penalty for adding new
variables to the model

R  1  [( SSR / n  k  1) /( SST / n  1)]


2

 1  [ /( SST / n  1)
2

Dimensions of A Good Model
 Fundamental problem with R2 is that
it attempts to place incommensurate
values on a common scale
 Causal strength and close fit to the

data are simply different concepts


 Causal strength is best measured

with
β-hat.
 What is a more direct measure of fit?
Better Measures of Fit:
Standard Error of Regression
 Standard Error of the Regression
 A.K.A. The standard error of the estimate

ˆ  uˆ  uˆ
se(  )  
SST n

 i i
( x
i 1
 ˆ
x ) 2

 This is the average size of the residual


in units of the dependent variable
 Standard deviation in y after the effect of
x is taken out
 Square root of the variance of uhat
Better Measures of Fit:
Standard Error of Regression
 se is a better measure of fit than R2
because:
 It isolates the distance between the regression
line and the data
 It is independent of β-hat and the variance of X
 It is calculated on a meaningful scale

 If u~N(0,σ2), se allows us to make


inferences about how likely errors of
different sizes will be
Better Measures of Fit:
Mean Squared Error
 Limitation of Std. Error of Regression is
that it must assume that E(u|X)=0
 Some useful estimators are biased, but
consistent (logit & probit)
 Mean Squared Error does not rely on this
assumption
 MSE is simply the average of the squared
deviations
uˆ ' uˆ
RootMSE 
n  k 1
Better Measures of Fit:
Mean Squared Error
 MSE has similar advantages to
Standard Error of the Regression
 Isolates the size of the residuals from
βhat
 Does not depend on βhat or the variance
of X
 Calculated in units of Y
 Stata helpfully reports both MSE and the
Root MSE
 Final test for overall fit is the F-test
Better Measures of Fit:
The F-Test
 Tests null hypothesis that all
coefficients (beta-hats)=0
 Define:
 r =restricted model F ( SSRr  SSRur ) / q

SSRur /( n  k  1)
 ur=unrestricted model
 q=dfr-dfur

This can be expressed as a function of the R2


(R  R ) / q
2 2
(R  R ) / q
2 2
F ur r
 ur r

(1  R ) /(n  k  1) (1  R ) / df
2
ur
2
ur ur
5% Critical Value and the F3,60 distribution with
3 numerator and 60 denominator degrees of
freedom

Area=.95

Area=.05

0 2.76
Better Measures of Fit Remember!!
2. The F-statistics is always nonnegative.

2. The null hypothesis (H0) of the F-test is that all of


 To compare different specifications with
the additional βs in the unrestricted model are
the same dependent variable use the Root
equal to 0. If any one of the βs is significantly
different from 0, we can reject the H0.
Mean Squared Error
In the STATA output, the null is that all βs included
in the full model are equal to 0.
 F-test is also helpful at assessing whether
group of variables
3. Conceptually,together accurately
the F-statistic and Chi-squared statistic
predict the model.are analogous. After normalization, the chi-squared is
the limiting distribution of as the F denominator df
goes to infinity.

 To see whether adding additional variables


improves the fit of the model, perform an
F-test. Not packaged in STATA. You will
have to perform it by hand.
R2, Correlation Coefficients, and
Standardized Betas
 Central problem with R2 is its conflation
of2
û the with the causal impact (β-hat)

  2
Inclusion of û in calculation made R2
dependent on σx2

 Correlation coefficients and


standardized regression coefficients
(betas) have exact same problems
Correlations and Standardized
Betas
 The formula for a correlation
coefficient is:
Cov( x, y )  x, y
 
sd ( x) * sd ( y )  x y

Here σy is in the denominator, which
depends upon σu which is sample specific

 A standardized “beta” is simply a partial


correlation coefficient (controlling for the
other x’s)
R , Correlations and Betas
2

 R2, correlations and standardized


betas are measures of association
that are independent of units of
variables
 Goal is to develop stats that can be
compared regardless of
measurement
 This undertaking is misguided
 Makes comparisons less reliable and
less meaningful
The Solution to R2:
Don’t Rely on It
 R2 tells us little we want to know –
pay it no heed
 We must make substantive
evaluations of our models:
 Size of Coefficients (hypothesis tests)
 Size of Substantive Effects
 Ability to forecast out-of-sample
A Few Uses for R2 and Correlations
 Calculating levels of multicollinearity
– Auxiliary R2
 Simultaneous equations models –
building instrument in 2SLS
R2 Depends on the Variance of
x – An Example
 I created a dataset where X1 varies
from –20 to 20 where σx = 10
 I created an error term u~N(0,20)
 I created y1=3+2X1+u
Analyzing the Full Range of x

. reg y1 x1

Source | SS df MS Number of obs = 100


---------+------------------------------ F( 1, 98) = 68.03
Model | 30717.8966 1 30717.8966 Prob > F = 0.0000
Residual | 44248.6995 98 451.517342 R-squared = 0.4098
---------+------------------------------ Adj R-squared = 0.4037
Total | 74966.5961 99 757.238344 Root MSE = 21.249

------------------------------------------------------------------------------
y1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
x1 | 1.92611 .2335192 8.248 0.000 1.462699 2.389522
_cons | 4.27945 2.130969 2.008 0.047 .0506116 8.508289
------------------------------------------------------------------------------
Restricting x to –10<x<10
. reg y1 x1 if x1>-10 & x1<10

Source | SS df MS Number of obs = 70


---------+------------------------------ F( 1, 68) = 18.81
Model | 8038.90856 1 8038.90856 Prob > F = 0.0000
Residual | 29063.7331 68 427.407839 R-squared = 0.2167
---------+------------------------------ Adj R-squared = 0.2051
Total | 37102.6416 69 537.719444 Root MSE = 20.674

------------------------------------------------------------------------------
y1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
x1 | 1.971043 .4544842 4.337 0.000 1.064134 2.877952
_cons | 5.884096 2.517394 2.337 0.022 .8607142 10.90748
------------------------------------------------------------------------------
Correlation Coefficients Show
the Same Changes
. corr y1 x1
(obs=100)
| y1 x1
---------+------------------
y1 | 1.0000
x1 | 0.6401 1.0000

. corr y1 x1 if x1>-10 & x1<10


(obs=70)
| y1 x1
---------+------------------
y1 | 1.0000
x1 | 0.4655 1.0000
Standards of Model Evaluation
 Statistical Significance of Coefficients
 t-tests
 Substantive Size of Effects
 Generate predictions from the model
 Size of Residuals
 Compare to substantive effects
 Forecasting Out-of-Sample
Statistical Significance of
Coefficients
 First thing we want to do with a
model is see if the coefficients
relevant for our theory are
discernable from zero
 T-tests give us confidence that
relationships we observe are
generalizable
 Don’t get too focused on .05 as a
“magical” threshold for significance
Statistical Significance of
Coefficients
 If a coefficient is statistically
significant:
 Report this as support for your theory
 Move on to describe substantive effects
 If a coefficient is not statistically
significant:
 Investigate why this is so
 Several possible reasons
Reasons for Statistically
Insignificant Coefficients
 Problems with the data
 Measurement, collinearity,
heteroskedasticity, selection effects,
and autocorrelation (in time series
models)
 Problems with the specification of the
theory
 Omitted variable, functional form,
endogeneity
 The theory could just be wrong.
Substantive Significance of the
Coefficients
 Reporting statistical significance is
never sufficient support for a model
 We should always make judgments
about the substantive importance of
the effects
 This is done by creating predicted
values from the model
 We change one (or more) variables in
the model & holding others constant.
This is easily done using Clarify.
Substantive Significance of the
Coefficients
 By creating these artificial
predictions, we are constructing
counter-factual scenarios
 Change in the predicted value of y
gives us the substantive effect of x.
 In OLS, these effects can be read
directly from the coefficients
 For other models (probit/logit) the
computer generates predicted values
Substantive Significance of the
Coefficients
 Once we have generated the
predicted effects of significant
variables, we can make substantive
comparisons
 Compare predicted effects to overall
variation in y.
 Compare predicted effects of x1 to
predicted effects of x2…xn
Size of Residuals
 Present some judgment about the
size of the errors in the model
 Standard Error of the Regression
 Root Mean Squared Error
 Compare size of errors to overall
variation in y.
 Compare size of errors to substantive
effects of variables
Out-of-Sample Forecasts
 Forecasting out-of-sample guards
against ad hoc data-fitting. Forces
you to model only systematic
elements.
 Presidential Dummies in analysis of approval.
We have no theoretical reason to add these.
 Allows us to compare non-nested
models, even different functional
forms.
 Shows coefficients account for
relationships outside the data used
to generate them.
 That is our ULTIMATE goal!!
Generating Out-of-Sample
Forecasts
 Process for generating Out-of-Sample
forecasts is simple
 When testing your theory, set aside part of
your data (the test set) for forecasting
 Random sample of cases
 Estimate the βhats on the remaining data
(the training set)
 Use βhats from training set to generate
forecast outcomes (yhat) for the test set
cases
What is a Good Forecast?
OLS
 For OLS we can simply calculate the
Mean Squared Error for the forecast
values
 The smaller the MSE, the better the
forecast.
 Since MSE is in units of the
dependent variable, we can make
substantive judgments about
whether the forecast errors are big or
small.

You might also like