You are on page 1of 48

Econometrics using R

Rajat Tayal
Fourth Quantitative Finance Workshop
December 21-December 24, 2012
Indian Institute of Technology, Kanpur

23 December 2012

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II 23 December 2012

1/1

Outline of the presentation


Linear regression
Simple linear regression
Multiple linear regression
Partially linear models
Factors, interactions, and weights
Linear regression with time series data
Linear regression with panel data
Systems of linear equations

Regression diagnostics
Leverage and standardized residuals
Deletion diagnostics
The function influence.measures()
Testing for heteroskedasticity
Testing for functional form
Testing for autocorrelation
Robust standard errors and tests
Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II 23 December 2012

2/1

Part I
Linear regression

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II 23 December 2012

3/1

Introduction
The linear regression model, typically estimated by ordinary least squares
(OLS), is the workhorse of applied econometrics. The model is
yi = xiT + i , i = 1, 2, . . . , n.

(1)

y = X + 

(2)

E (|X ) = 0

(3)

Var (|X ) = 2 I

(4)

E (j |xi ) = 0, i j.

(5)

For cross-sections:

For time series:

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II 23 December 2012

4/1

Introduction

We estimate the OLS by:


= (X T X )1 X T y

(6)

the residuals are  = y y


The corresponding fitted values are: y = X ,
T
and the residual sum of squares is  .
In R, models are typically fitted by calling a model-fitting function, in this
case lm(), with a formula object describing the model and a data.frame
object containing the variables used in the formula.
fm < lm(formula, data, . . .)

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II 23 December 2012

5/1

The first example


Data from Stock & Watson (2007) on subscriptions to economics journals
at US libraries for the year 2000:
>
>
>
>
>
>

install.packages("AER", dependencies=TRUE)
library(AER)
data("Journals") ; names("Journals")
journals <- Journals[, c("subs", "price")]
journals$citeprice <- Journals$price/Journals$citations
summary(journals)
subs
price
citeprice
Min.
:
2.0
Min.
: 20.0
Min.
: 0.005223
1st Qu.: 52.0
1st Qu.: 134.5
1st Qu.: 0.464495
Median : 122.5
Median : 282.0
Median : 1.320513
Mean
: 196.9
Mean
: 417.7
Mean
: 2.548455
3rd Qu.: 268.2
3rd Qu.: 540.8
3rd Qu.: 3.440171
Max.
:1098.0
Max.
:2120.0
Max.
:24.459459
Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II 23 December 2012

6/1

The first example

In view of the wide range of the variables, combined with a


considerable amount of skewness, it is useful to take logarithms.
The goal is to estimate the effect of the price per citation on the
number of library subscriptions.
To explore this issue quantitatively, we will fit a linear regression
model,
log (subs)i = 1 + 2 log (citeprice)i + i

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II 23 December 2012

(7)

7/1

The first example

Here, the formula of interest is log (subs) log (citeprice). This can be used
both for plotting and for model fitting:
> plot(log(subs) ~ log(citeprice), data = journals)
> jour_lm <- lm(log(subs) ~ log(citeprice), data = journals)
> abline(jour_lm)
abline() extracts the coefficients of the fitted model and adds the
corresponding regression line to the plot.

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II 23 December 2012

8/1

The first example

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II 23 December 2012

9/1

The first example

The function lm() returns a fitted-model object, here stored as jour lm.
It is an object of class lm.

> class(jour_lm)
[1] "lm"
> names(jour_lm)
[1] "coefficients" "residuals" "effects" "rank" "fitted.val
[7] "qr" "df.residual" "xlevels" "call" "terms" "model"

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

10 / 1

The first example


> summary(jour_lm)
Call:
lm(formula = log(subs) ~ log(citeprice), data = journals)
Residuals:
Min
1Q
Median
3Q
Max
-2.72478 -0.53609 0.03721 0.46619 1.84808
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
4.76621
0.05591
85.25
<2e-16 ***
log(citeprice) -0.53305
0.03561 -14.97
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 0.7497 on 178 degrees of freedom
Multiple R-squared: 0.5573,
Adjusted R-squared: 0.5548
F-statistic:
224 on 1 and 178 DF, p-value: < 2.2e-16
Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

11 / 1

Generic functions for fitted (linear) model objects


Function
print()
summary()
coef() (or coefficients())
residuals() (or resid())
fitted() (or fitted.values())
anova()
predict()
plot()
confint()
deviance()
vcov()
logLik()
AIC()

Rajat Tayal (IIT Kanpur)

Function Description
simple printed display
standard regression output
extracting the regression coefficients
extracting residuals
extracting fitted values
comparison of nested models
predictions for new data
diagnostic plots
confidence intervals for the regression
coefficients
residual sum of squares
(estimated) variance-covariance matrix
log-likelihood (assuming normally distributed
errors)
information criteria including AIC, BIC/SBC
(assuming normally distributed errors)

Introduction to Estimation/Computing Environment -II23 December 2012

12 / 1

The first example


It is instructive to take a brief look at what the summary() method
returns for a fitted lm object:

> jour_slm <- summary(jour_lm)


> class(jour_slm)
[1] "summary.lm"
> names(jour_slm)
[1] "call" "terms" "residuals" "coefficients" "aliased" "sigm
[7] "df" "r.squared" "adj.r.squared" "fstatistic" "cov.unscal
> jour_slm$coefficients
Estimate Std. Error
t value
Pr(>|t|)
(Intercept)
4.7662121 0.05590908 85.24934 2.953913e-146
log(citeprice) -0.5330535 0.03561320 -14.96786 2.563943e-33

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

13 / 1

Analysis of variance
> anova(jour_lm)
Analysis of Variance Table
Response: log(subs)
Df Sum Sq Mean Sq F value
Pr(>F)
log(citeprice)
1 125.93 125.934 224.04 < 2.2e-16 ***
Residuals
178 100.06
0.562
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
The ANOVA table breaks the sum of squares about the mean (for the
dependent variable, here log(subs)) into two parts: a part that is
accounted for by a linear function of log(citeprice) and a part attributed to
residual variation.
Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

14 / 1

Point and Interval estimates

the function coef() can


To extract the estimated regression coefficients ,
be used:
> coef(jour_lm)
(Intercept) log(citeprice)
4.7662121
-0.5330535
> confint(jour_lm, level = 0.95)
2.5 %
97.5 %
(Intercept)
4.6558822 4.8765420
log(citeprice) -0.6033319 -0.4627751

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

15 / 1

Prediction

Two types of predictions:


1
2

the prediction of points on the regression line and


the prediction of a new data value.

The standard errors of predictions for new data take into account
both the uncertainty in the regression line and the variation of the
individual points about the line.
Thus, the prediction interval for prediction of new data is larger than
that for prediction of points on the line. The function predict()
provides both types of standard errors.

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

16 / 1

Prediction

> predict(jour_lm, newdata = data.frame(citeprice = 2.11),


interval = "confidence")
fit
lwr
upr
1 4.368188 4.247485 4.48889
> predict(jour_lm, newdata = data.frame(citeprice = 2.11),
interval = "prediction")
fit
lwr
upr
1 4.368188 2.883746 5.852629
The point estimates are identical (fit) but the intervals differ.
The prediction intervals can also be used for computing and visualizing
confidence bands.

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

17 / 1

Prediction

>
>
>
>
>
>

lciteprice <- seq(from = -6, to = 4, by = 0.25)


jour_pred <- predict(jour_lm, interval = "prediction", newda
plot(log(subs) ~ log(citeprice), data = journals)
lines(jour_pred[, 1] ~ lciteprice, col = 1)
lines(jour_pred[, 2] ~ lciteprice, col = 1, lty = 2)
lines(jour_pred[, 3] ~ lciteprice, col = 1, lty = 2)

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

18 / 1

Prediction

Figure: Scatterplot with prediction intervals for the journals data


Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

19 / 1

Plotting lm objects

The plot() method for class lm() provides six types of diagnostic plots,
four of which are shown by default.
We set the graphical parameter mfrow to c(2, 2) using the par() function,
creating a 2 2 matrix of plotting areas to see all four plots simultaneously:
> par(mfrow = c(2, 2))
> plot(jour_lm)
> par(mfrow = c(1, 1))
The first provides a graph of residuals versus fitted values, the second is a
QQ plot for normality, plots three and four are a scale-location plot and a
plot of standardized residuals against leverages, respectively.

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

20 / 1

Plotting lm objects

Figure: Diagnostic plots for the journals data


Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

21 / 1

Testing a linear hypothesis

The standard regression output as provided by summary() only indicates


individual significance of each regressor and joint significance of all
regressors in the form of t and F statistics, respectively. Often it is
necessary to test more general hypotheses.
This is possible using the function linear.hypothesis() from the car package.
Suppose we want to test the hypothesis that the elasticity of the number
of library subscriptions with respect to the price per citation equals 0.5.
H0 : 2 = 0.5

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

(8)

22 / 1

Testing a linear hypothesis

> linear.hypothesis(jour_lm, "log(citeprice) = -0.5")


Linear hypothesis test
Hypothesis:
log(citeprice) = - 0.5
Model 1: restricted model
Model 2: log(subs) ~ log(citeprice)
Res.Df
RSS Df Sum of Sq
F Pr(>F)
1
179 100.54
2
178 100.06 1
0.48421 0.8614 0.3546

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

23 / 1

Multiple linear regression


In economics, most regression analyses comprise more than a single
regressor. Often there are regressors of a special type, usually referred to
as dummy variables in econometrics, which are used for coding
categorical variables.
> data("CPS1988")
> summary(CPS1988)
wage
Min.
:
50.05
1st Qu.: 308.64
Median : 522.32
Mean
: 603.73
3rd Qu.: 783.48
Max.
:18777.20

education
Min.
: 0.00
1st Qu.:12.00
Median :12.00
Mean
:13.07
3rd Qu.:15.00
Max.
:18.00

experience
Min.
:-4.0
1st Qu.: 8.0
Median :16.0
Mean
:18.2
3rd Qu.:27.0
Max.
:63.0

ethnicity
cauc:25923
afam: 2232

smsa
no : 7223
yes:20932

region
northeast:6441
midwest :6863
south
:8760
west
:6091

parttime
no :25631
yes: 2524

The model of interest is


log (wage) = 1 +2 experience+3 experience 2 +4 education+5 ethnicity +
(9)

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

24 / 1

Multiple linear regression


> cps_lm <- lm(log(wage) ~ experience + I(experience^2) + education
+ ethnicity, data = CPS1988)
> summary(cps_lm)
Call:
lm(formula = log(wage) ~ experience + I(experience^2) + education +
ethnicity, data = CPS1988)
Residuals:
Min
1Q Median
3Q
Max
-2.9428 -0.3162 0.0580 0.3756 4.3830
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
4.321e+00 1.917e-02 225.38
<2e-16 ***
experience
7.747e-02 8.800e-04
88.03
<2e-16 ***
I(experience^2) -1.316e-03 1.899e-05 -69.31
<2e-16 ***
education
8.567e-02 1.272e-03
67.34
<2e-16 ***
ethnicityafam
-2.434e-01 1.292e-02 -18.84
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 0.5839 on 28150 degrees of freedom
Multiple R-squared: 0.3347,
Adjusted R-squared: 0.3346
F-statistic: 3541 on 4 and 28150 DF, p-value: < 2.2e-16
Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

25 / 1

Comparison of models
With more than a single explanatory variable, it is interesting to test for
the relevance of subsets of regressors. For any two nested models, this can
be done using the function anova(). E.g. to test for the relevance of the
variable ethnicity, we explicitly fit the model without ethnicity and then
compare both models.
> cps_noeth <- lm(log(wage) ~ experience + I(experience^2) +
education, data = CPS1988)
> anova(cps_noeth, cps_lm)
Analysis of Variance Table
Model 1: log(wage) ~ experience + I(experience^2) + education
Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity
Res.Df
RSS Df Sum of Sq
F
Pr(>F)
1 28151 9719.6
2 28150 9598.6 1
121.02 354.91 < 2.2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
This reveals that the effect of ethnicity is significant at any reasonable level.

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

26 / 1

Comparison of models

> cps_noeth <- update(cps_lm, formula = . ~ . - ethnicity)


> waldtest(cps_lm, . ~ . - ethnicity)
Wald test
Model 1: log(wage) ~ experience + I(experience^2) + education
Model 2: log(wage) ~ experience + I(experience^2) + education
Res.Df Df
F
Pr(>F)
1 28150
2 28151 -1 354.91 < 2.2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

27 / 1

Part II
Linear regression with panel data

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

28 / 1

Introduction

There has been considerable interest in panel data econometrics over


the last two decades.
The package plm (Croissant and Millo 2008) contains the relevant
fitting functions and methods for specifications in R.
Two types of panel data models:
1
2

Statis linear models


Dynamic linear models

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

29 / 1

Introduction
For illustrating the basic fixed- and random-effects methods, we use the
wellknown Grunfeld data (Grunfeld 1958) comprising 20 annual
observations on the three variables real gross investment (invest), real
value of the firm (value), and real value of the capital stock (capital) for
11 large US firms for the years 1935-1954.
>
>
>
>

data("Grunfeld", package = "AER")


library("plm")
gr <- subset(Grunfeld, firm %in% c("General Electric", "Gene
pgr <- plm.data(gr, index = c("firm", "year"))

# The last command tells R that the individuals are


called "firm", whereas the time identifier is called "year".
Alternatively, pooled OLS can be run using:
> gr_pool <- plm(invest ~ value + capital, data = pgr, model
= "pooling")
Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

30 / 1

One-way panel regression


investit = 1 values + 2 capital + i + it
(10)
where i = 1, . . . , n, t = 1, . . . , T, and the i denote the individual-specific effects. A
fixed-effects version is estimated by running OLS on a within-transformed model:
> gr_fe <- plm(invest ~ value + capital, data = pgr, model = "within")
> summary(gr_fe)
Oneway (individual) effect Within Model
Call:
plm(formula = invest ~ value + capital, data = pgr, model = "within")
Balanced Panel: n=3, T=20, N=60
Residuals :
Min. 1st Qu. Median 3rd Qu.
Max.
-167.00 -26.10
2.09
26.80 202.00
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
value
0.104914
0.016331 6.4242 3.296e-08 ***
capital 0.345298
0.024392 14.1564 < 2.2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Total Sum of Squares:
1888900 Residual Sum of Squares: 243980
R-Squared
: 0.87084
Adj. R-Squared : 0.79827
F-statistic: 185.407 on 2 and 55 DF, p-value: < 2.22e-16
Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

31 / 1

One-way panel regression


A two-way model could have been estimated upon setting effect =
twoways.
If fixed effects need to be inspected, a fixef() method and an
associated summary() method are available.
To check whether the fixed effects are really needed, we compare the
fixed effects and the pooled OLS fits by means of pFtest().
> gr_fe <- plm(invest ~ value + capital, data = pgr, model = "within")
> pFtest(gr_fe, gr_pool)
F test for individual effects
data: invest ~ value + capital
F = 56.8247, df1 = 2, df2 = 55, p-value = 4.148e-14
alternative hypothesis: significant effects

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

32 / 1

One-way panel regression

It is also possible to fit a random-effects version of (3.3) using the


same fitting function upon setting model = random and selecting a
method for estimating the variance components.
Four methods are available: Swamy-Arora, Amemiya,
Wallace-Hussain, and Nerlove.
> gr_re <- plm(invest ~ value + capital, data = pgr, model = "random",
random.method = "walhus")
> summary(gr_re)

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

33 / 1

One-way panel regression


Oneway (individual) effect Random Effect Model
(Wallace-Hussains transformation)
Call:
plm(formula = invest ~ value + capital, data = pgr, model = "random",
random.method = "walhus")
Balanced Panel: n=3, T=20, N=60
Effects:
var std.dev share
idiosyncratic 4389.31
66.25 0.352
individual
8079.74
89.89 0.648
theta: 0.8374
Residuals :
Min. 1st Qu. Median 3rd Qu.
Max.
-187.00 -32.90
6.96
31.40 210.00
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
(Intercept) -109.976572
61.701384 -1.7824
0.08001 .
value
0.104280
0.014996 6.9539 3.797e-09 ***
capital
0.344784
0.024520 14.0613 < 2.2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Total Sum of Squares:
1988300
Residual Sum of Squares: 257520
R-Squared
: 0.87048
Adj. R-Squared : 0.82696
Rajat Tayal (IIT
Kanpur) on 2 Introduction
to Estimation/Computing
Environment -II23 December 2012
F-statistic:
191.545
and 57 DF,
p-value: < 2.22e-16

34 / 1

One-way panel regression


A comparison of the regression coefficients shows that fixed- and
randomeffects methods yield rather similar results for these data.
To check whether the random effects are really needed, a Lagrange
multiplier test is available in plmtest(), defaulting to the test
proposed by Honda (1985).
> plmtest(gr_pool)
Lagrange Multiplier Test - (Honda)
data: invest ~ value + capital
normal = 15.4704, p-value < 2.2e-16
alternative hypothesis: significant effects

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

35 / 1

One-way panel regression


Random-effects methods are more efficient than the fixed-effects
estimator under more restrictive assumptions, namely exogeneity of
the individual effects. It is therefore important to test for endogeneity,
and the standard approach employs a Hausman test. The relevant
function phtest() requires two panel regression objects, in our case
yielding
> phtest(gr_re, gr_fe)
Hausman Test
data: invest ~ value + capital
chisq = 0.0404, df = 2, p-value = 0.98
alternative hypothesis: one model is inconsistent
In line with the rather similar estimates presented above, endogeneity
does not appear to be a problem here.
Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

36 / 1

Dynamic linear models

To conclude this section, we present a more advanced example, the


dynamic panel data model:
yit =

p
X

j yi,tj + xitT + uit , uit = i + t + it

(11)

i=1

This is estimated by the method of Arellano and Bond (1991) viz.


generalized method of moments (GMM) estimator utilizing lagged
endogenous regressors after a first-differences transformation.

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

37 / 1

Dynamic linear models

> data("EmplUK", package = "plm")


> form <- log(emp) ~ log(wage) + log(capital) + log(output)
# The function providing the Arellano-Bond estimator
is pgmm().
>
+
+
+

empl_ab <- pgmm(dynformula(form, list(2, 1, 0, 1)),


data = EmplUK, index = c("firm", "year"),
effect = "twoways", model = "twosteps",
gmm.inst = ~ log(emp), lag.gmm = list(c(2, 99)))

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

38 / 1

Dynamic linear models


Twoways effects Two steps model
Call:
pgmm(formula = dynformula(form, list(2, 1, 0, 1)), data = EmplUK,
effect = "twoways", model = "twosteps", index = c("firm",
"year"), ... = list(gmm.inst = ~log(emp), lag.gmm = list(c(2,
99))))
Unbalanced Panel: n=140, T=7-9, N=1031
Number of Observations Used: 611
Residuals
Min.
1st Qu.
Median
Mean
3rd Qu.
Max.
-0.6191000 -0.0255700 0.0000000 -0.0001339 0.0332000 0.6410000
Coefficients
Estimate Std. Error z-value Pr(>|z|)
lag(log(emp), c(1, 2))1 0.474151
0.085303
5.5584 2.722e-08 ***
lag(log(emp), c(1, 2))2 -0.052967
0.027284 -1.9413 0.0522200 .
log(wage)
-0.513205
0.049345 -10.4003 < 2.2e-16 ***
lag(log(wage), 1)
0.224640
0.080063
2.8058 0.0050192 **
log(capital)
0.292723
0.039463
7.4177 1.191e-13 ***
log(output)
0.609775
0.108524
5.6188 1.923e-08 ***
lag(log(output), 1)
-0.446373
0.124815 -3.5763 0.0003485 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

39 / 1

Dynamic linear models

Sargan Test: chisq(25) = 30.11247 (p.value=0.22011)


Autocorrelation test (1): normal = -2.427829 (p.value=0.0075948)
Autocorrelation test (2): normal = -0.3325401 (p.value=0.36974)
Wald test for coefficients: chisq(7) = 371.9877 (p.value=< 2.22e-16)
Wald test for time dummies: chisq(6) = 26.9045 (p.value=0.0001509)

The results suggest that autoregressive dynamics are important for these
data.

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

40 / 1

Part III
Regression diagnostics

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

41 / 1

Review

>
>
>
>
>

data("Journals")
journals <- Journals[, c("subs", "price")]
journals$citeprice <- Journals$price/Journals$citations
journals$age <- 2000 - Journals$foundingyear
jour_lm <- lm(log(subs) ~ log(citeprice), data = journals)

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

42 / 1

Testing for heteroskedasticity

For cross-section regressions, the assumption Var (i |xi ) = 2 is


typically in doubt. A popular test for checking this assumption is the
Breusch-Pagan test (Breusch and Pagan 1979).
For our model fitted to the journals data, stored in jour lm, the
diagnostic plots in suggest that the variance decreases with the fitted
values or, equivalently, it increases with the price per citation.
Hence, the regressor log(citeprice) used in the main model should also
be employed for the auxiliary regression.
Under H0 , the test statistic of the Breusch-Pagan test approximately
follows a 2q distribution, where q is the number of regressors in the
auxiliary regression (excluding the constant term).

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

43 / 1

Testing for heteroskedasticity


The function bptest() implements all these flavors of the
Breusch-Pagan test. By default, it computes the studentized statistic
for the auxiliary regression utilizing the original regressors X.
> bptest(jour_lm)
studentized Breusch-Pagan test
data: jour_lm
BP = 9.803, df = 1, p-value = 0.001742
Alternatively, the White test picks up the heteroskedasticity. It uses
the original regressors as well as their squares and interactions in the
auxiliary regression, which can be passed as a second formula to
bptest().
> bptest(jour_lm, ~ log(citeprice) + I(log(citeprice)^2),
+ data = journals)
studentized Breusch-Pagan test
data: jour_lm
BP = 10.912, df = 2, p-value = 0.004271
Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

44 / 1

Testing the functional form


The assumption E (|X ) = 0 is crucial for consistency of the
least-squares estimator. A typical source for violation of this
assumption is a misspecification of the functional form; e.g., by
omitting relevant variables. One strategy for testing the functional
form is to construct auxiliary variables and assess their significance
using a simple F test. This is what Ramseys RESET does.
The function resettest() defaults to using second and third powers of
the fitted values as auxiliary variables.
> resettest(jour_lm)
RESET test
data: jour_lm
RESET = 1.4409, df1 = 2, df2 = 176, p-value = 0.2395

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

45 / 1

Testing the functional form

The rainbow test (Utts 1982) takes a different approach to testing


the functional form. It fits a model to a subsample (typically the
middle 50%) and compares it with the model fitted to the full sample
using an F test.
> raintest(jour_lm, order.by = ~ age, data = journals)
Rainbow test
data: jour_lm
Rain = 1.774, df1 = 90, df2 = 88, p-value = 0.003741
This appears to be the case, signaling that the relationship between
the number of subscriptions and the price per citation also depends
on the age of the journal.

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

46 / 1

Testing for autocorrelation


Let us reconsider the first model for the US consumption function.
> library(dynlm)
> data("USMacroG")
> consump1 <- dynlm(consumption ~ dpi + L(dpi), data =
USMacroG)

A classical testing procedure suggested for assessing autocorrelation


in regression relationships is the Durbin-Watson test (Durbin and
Watson 1950).
> dwtest(consump1)
Durbin-Watson test
data: consump1
DW = 0.0866, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0

Further tests for autocorrelation are the Box-Pierce test and the
Ljung-Box test, both being implemented in the function Box.test() in
base R.
> Box.test(residuals(consump1), type = "Ljung-Box")
Box-Ljung test
Rajat
Tayal (IITresiduals(consump1)
Kanpur)
Introduction to Estimation/Computing Environment -II23 December 2012
data:

47 / 1

Thats all for today

Thank You....

Rajat Tayal (IIT Kanpur)

Introduction to Estimation/Computing Environment -II23 December 2012

48 / 1

You might also like