You are on page 1of 26

Lecture Notes 20152016

Advanced Econometrics I
Chapters 7 and 8

Francisco Blasques

These lecture notes contain the material covered in the master course
Advanced Econometrics I. Further study material can be found in the
lecture slides and the many references cited throughout the text.

Contents
7 Model Selection and Pseudo-True Parameters
7.1 Least squares and the weighted L2 -norm . . . . . . . . .
7.2 Maximum likelihood and the Kullback-Leibler divergence
7.3 Model selection . . . . . . . . . . . . . . . . . . . . . . .
7.4 Inference under Model Misspecification . . . . . . . . . .
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

3
3
5
7
9
11

8 Simulation-Based Econometric Analysis


8.1 Probabilistic Analysis and Value-at-Risk . . . . . . . . . . . . . . . .
8.2 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Impulse Response Functions . . . . . . . . . . . . . . . . . . . . . . .

13
13
17
22

. .
.
. .
. .
. .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

Model Selection and Pseudo-True Parameters

In this chapter, we will try to answer probably the most essential question of all: what
exactly are we estimating?
Until now, we have always defined the parameter of interest 0 , simply as being
the identifiably unique maximizer of the limit criterion function Q . In Chapter
T will
5 we have shown that, under appropriate conditions, extremum estimators
converge to the point 0 , which is the unique maximizer of Q over the parameter
space. In Chapter 6, we have further shown that the distribution of the extremum
T is asymptotically Gaussian and centered at 0 .
estimator
The definition of 0 as the unique maximizer of Q might seem rather weak and
uninteresting, but it is not. On the contrary, it is worth repeating it: 0 is the
unique maximizer of Q . In most cases, being the unique maximizer of Q is a very
meaningful statement in itself. For example, in maximum likelihood estimation, this
means that 0 is really the most likely parameter value given an infinite amount of
information (i.e. an infinitely large sample). In least squares estimation, 0 is the
parameter value that gives the best fit as judged by the sum of squared residuals
from an infinite number of observations. In method of moments estimation, 0 is the
parameter value that best matches the moments of the data as judged by an infinite
sample. This is, in itself, something important!
Definition 1 (Pseudo-true parameter) Given an extremum estimator
T arg max QT (xT , )

the pseudo-true parameter 0 is the unique element of that maximizes the limit
criterion Q over
0 arg max Q ().

As we shall see below, there is an even better characterization of these estimators: 0


is the parameter that makes the model as similar as possible to the data generating
process. When the model is correctly specified, then that point is precisely the true
parameter ! In other words, when the model is well specified, then the pseudo-true
parameter corresponds to the true parameter because the true parameter 0 is indeed
the unique maximizer of Q .
In introductory econometrics courses, the model is always assumed to be well
specified, and hence the parameter 0 is always seen as being the true parameter
(defined in Chapter 4). This assumption is however not needed and is often only
made for simplicity.

7.1

Least squares and the weighted L2 -norm

Suppose that the data is generated by an NLAR model


xt = 0 (xt1 ) + t
3

tZ

with some unknown function 0 . Suppose further that we estimate a parametric


NLAR model
xt = (xt1 , ) + t t Z
that may, or may not, be well specified. If the model is well specified, then there
exists some 0 such that
(xt , 0 ) = 0 (xt ) xt .
If the model is mis-specified, then there exists no such 0 .
Consider now the least squares estimator
T
2
1X

xt (xt1 , ) .
T arg min
T
t=2

As we already know, under appropriate regularity conditions, we can show through


the application of a law of large numbers that the least-squares criterion function
converges to the limit criterion1
Z
2
2
E xt (xt1 , )
0 (x) (x, ) dP0 (x)
and that 0 is the unique minimizer of this limit quantity
Z
2
0 = arg min
0 (x) (x, ) dP0 (x).

As it turns out, the quantity that 0 is minimizing above is simply the transformation
of the L2 -norm that measures the distance between the true unknown function 0 (x)
and the modeled regression function (x, ).
If the model is well specified, then this distance is precisely minimized at the true
parameter 0 . In particular, 0 is the only value for which, the distance is exactly
zero,
Z
Z
2
2
0 (x) (x, 0 ) dP0 (x) =
(x, 0 ) (x, 0 ) dP0 (x) = 0.
If the model is mis-specified, then 0 is, by definition, the unique element of that
minimizes the L2 -norm distance between the true function 0 and the model function (, ). The uniqueness of 0 can be easily obtained in several settings.2 Figure
1
The symbol should be read as proportional to. For example, f (z) g(z) means that f (z) is
proportional to g(z); i.e. there exists some constant c such that f (z) = g(z) + c z. Two functions
that are proportional have the same arg max.
2
For example, if 0 is continuous and (, ) is a polynomial function, then there exists a unique
0 that minimizes this distance. Many other results of this type exist.

1 shows examples of linear AR(1) approximations to nonlinear data generating processes. The figure on the left shows the best approximation that a mis-specified linear
AR(1) model can provide to a logistic AR data generating process with Gaussian innovations. The figure in the middle shows the best L2 approximation that a linear
AR(1) model can provide to an Logistic SESTAR data generating process. Finally,
the figure on the right shows the best L2 approximation that a linear AR(1) model
can provide to an Exponential SESTAR data generating process.
DGP: Logistic AR

DGP: Logistic SESTAR

Model: Linear AR

Model: Linear AR

DGP: Exponential SESTAR


3

Model: Linear AR

1
2

0.4
0.2

2.5

xt

0.6

xt

xt

0.8

1
0

2
1.5

0
0.2

10

xt 1

1
3

10

True function
Best L2 approximation

xt 1

True function
Best L2 approximation

xt 1

True function
Best L2 approximation

Figure 1: Left: best linear approximation to logistic function in L2 norm. Center: best linear
approximation to logistic SESTAR function in L2 norm. Right: best linear approximation to
exponential SESTAR function in L2 norm.

7.2

Maximum likelihood and the Kullback-Leibler divergence

Consider the maximum likelihood estimator


T
1X

log f (xt , xt1 , )


T arg max
T
t=2

where log f (xt , xt1 , ) denote the log conditional densities of xt given xt1 . Under
appropriate regularity conditions, we know that the limit criterion is then given by
L () = E log f (xt , xt1 , ).
As a result 0 is, by definition, the unique maximizer of L
0 arg max E log f (xt , xt1 , ).

Interestingly enough, this means that 0 is the unique minimizer of the following
quantity
0 arg min E log f0 (xt , xt1 ) E log f (xt , xt1 , ).

where f0 (xt , xt1 ) is the true unknown conditional density of xt given xt1 . This
quantity is quite important: it is the Kullback-Leibler (KL) distance between the
conditional density f (xt , xt1 , ) implied by the model, and the true conditional density f0 (xt , xt1 ). This is crucial, because it shows that the point 0 that maximizes
the limit likelihood function Q , is also the point that provides the best approximation to the true conditional density f0 (xt , xt1 ) as judged by the Kullback-Leibler
distance. In other words, 0 satisfies


0 = arg min KL f0 (xt , xt1 ) , f (xt , xt1 , )

where KL f0 (xt , xt1 ) , f (xt , xt1 , )


f (xt , xt1 , ).

is the KL distance between f0 (xt , xt1 ) and

Definition 2 (Kullback-Leibler Distance) Let zt be a random variable and f1 and f2


be two probability density functions. Then the Kullback-Leibler distance between f1
and f2 is given by
KL(f1 , f2 ) = E log f1 (zt ) E log f2 (zt ).
When the model is well specified, then 0 corresponds to the true parameter since
the KL distance


KL f0 (xt , xt1 ) , f (xt , xt1 , 0 )
is precisely minimized at the point 0 where f (xt , xt1 , 0 ) = f0 (xt , xt1 ). In particular, at 0 we get


KL f (xt , xt1 , 0 ) , f (xt , xt1 , 0 )
= 0.
When the model is mis-specified, 0 is the parameter value that provides the best
approximation to the data generating process in Kullback-Leibler distance.
Note that when the model is correctly specified, then 0 is unique as long as each
parameter in defines a unique conditional density for the data. Indeed, this
ensures that only 0 can set f (xt , xt1 , 0 ) equal to f0 (xt , xt1 ) and hence set the KL
distance to zero. However, if the model is mis-specified, then it is more difficult to
ascertain if 0 is unique. Several conditions exist to ensure the uniqueness of 0 , but
these are quite complicated and hence lie outside the scope of this course.
Figure 2 provides some intuition. On the left, we show an example where there
exists a unique density function (blue dot) in the model (collection of densities represented by gray area) that provides a best approximation to the true density (red
dot) of the data generating process (DGP). On the right, we show an example where
there exist two density functions (blue dots) in the model (gray area) that provide
a best approximation to the true density function (red dot). In the first case 0 is
unique. In the second case it is not.
6

Figure 2: Left: best approximation to true density of DGP is unique. Right: best approximation
to true density of DGP is not unique.

7.3

Model selection

Until now we have worked with a wide range of linear and nonlinear models with
different dynamic properties and capable of describing different features of the data.
How do we know which model is best for describing a certain time-series? The answer
to this question depends crucially on what is the purpose of the model. There exists
no global satisfactory answer!
If a given model is designed to forecast, then it should be judged on its ability to
produce accurate forecasts. If a model is designed to describe well certain moments,
then it should be judged on its ability to approximate those moments well. If it is
designed to explain a certain dynamic behavior of a time-series, then the model should
be considered good if it is indeed capable of delivering the desired result. Having said
this, it is important to note that a model that approximates well the true distribution
of the data is also a model that, in general, will (i) produce accurate forecasts; (ii)
describe well the moments of the data; (iii) approximate well the dynamic features of
the data; etc. As such, all of the above objectives are, in one way or another, related
to approximating the true probability measure P0 . Below, we review a fundamental
and very general theory of model selection that builds on the theory of extremum
estimation covered in Chapters 5 and 6. This general method of model selection
attempts to find the model that best approximates the data by means of optimizing
a penalized estimation criterion function.
Looking back, it should be clear that Section 7.2 already suggested a method of
model selection. Namely, if the parameters of two competing models are estimated
by maximum likelihood, then it is reasonable to select the model that achieves the
highest log likelihood value. Indeed, by construction, this model provides the best
approximation to the DGP in KL distance. If the parameters of two competing
models are estimated by the least-squares method, then we should select the model
that achieves the lowest sum-of-squared residuals since this model provides the best
approximation to the DGP in L2 distance. There is however one detail that must be
7

kept in mind: this argument is only valid asymptotically!


The ML estimator only minimizes the KL distance asymptotically when the sample size T goes to infinity. Similarly, the least squares estimator only minimizes the
L2 norm asymptotically. In finite samples, since 0 and Q are unknown, we must
compare models using the value of the sample criterion function QT evaluated at the
T . In other words, we use QT (xT ,
T ) as an approximation to Q ( 0 )
point estimate
However, we must account for the phenomenon that the sample criterion function can
be spuriously improved by increasing the number of parameters in the model. This is
true, even if the added parameters do not describe better the DGP. For example, in
ML estimation we can spuriously achieve a higher likelihood value by increasing the
number of parameters in the model. Similarly, in least squared estimation, we can
spuriously reduce the sum of squared residuals by increasing the number of parameters in the model. We must thus be careful when comparing models with a different
number of parameters.
In order to correct for this phenomenon, we must penalize for the number of
T )
parameters in the model. The idea is to take the value of the criterion QT (xT ,
and subtract a positive quantity hT (k) where hT is some function of the number of
parameters k in the model (that may also depend on sample size). In general, we
should thus use the quantity
T ) hT (k)
QT (xT ,
to compare models with different parameters.
Example: (Adjusted R2 ) As you may recall, the coefficient of determination R2 is
precisely a measure that reflects the ability of a given model to minimize the sum of
T ) that
squared residuals. In particular, the R2 is a standardized version QT (xT ,
2
ensures that R takes values between 0 and 1,
T ).
R2 QT (xT ,
In models estimated by least squares, the so-called adjusted R2 , given by
2 = R2 h(k)
R
adjusts the value of R2 in order to account for the fact that the R2 can always
be improved by increasing the number of parameters in the model. Unlike R2 , the
adjusted R2 can only be improved if the added parameter improves the R2 by more
than would be expected by chance.
Example: (Akaikes Information Criterion (AIC)) In models with k parameters estimated by maximum likelihood, the so-called Akaikes Information Criterion
(AIC), given by
T )
AIC = 2k 2LT (xT ,
8

T ) in order to account
adds a penalty to the negative of log likelihood value LT (xT ,
for the fact that the likelihood can always be improved by increasing the number of
parameters in the model. Note that the best model is the one with smallest AIC !
The AIC, introduced by H. Akaike in 1973 and 1974, gives rise to a truly general
model selection technique. Unfortunately, the theoretical foundations of the AIC are
poorly understood by the majority of practitioners. Misguided by simulation results
that do not reflect the theoretical context of each model selection technique, many
practitioners unfortunately abandon the AIC in favor of other criteria that are valid
under much more restrictive settings. Unlike a host of other criteria, the AIC can
be used to compare non-nested, non-congruent, misspecified models in very general
settings. In order to achieve great asymptotic generality, the AIC penalty should
however be allowed to grow with sample size. Following Sin and White (1996), the
following modified AIC can be used to consistently select models in large samples,
under very general conditions.
Definition 3 (Modified Information Criterion) Given two models (Model 1 and Model
2), with parameters 1 1 Rp and 2 2 Rq , p q, and with loglikelihoods
1 ) (Model 1), and L2 (xT ,
2 ) (Model 2), the modified AIC is given by
L1T (xT ,
T
T
T

1
1 ) L2 (xT ,
2 ) c(p q) T log(log(T )) 2
MAIC = L1T (xT ,
T
T
T
where c is a strictly positive scalar.
Naturally, a positive MAIC constitutes evidence in favor of model 1. A negative
MAIC constitutes evidence in favor of model 2. It is worth noting that the penalty
1
of the MAIC rises with sample size at a rate faster than T 2 , but slower that T . It
is also important to highlight that the MAIC is asymptotically consistent; i.e. the
MAIC selects the best model (in KL divergence) with probability converging to one
as T . The only additional complication of the MAIC compared to the AIC
is that it depends on the unspecified constant c > 0. Values of c 0.1 may be
acceptable as a rule-of-thumb. In particular, for c 0.1, we obtain a penalty c(p

 21
q) log(log(T ))
(p q) for T 250. As a result, we obtain the AIC selection
rule precisely for sample sizes where the AIC performs reasonably well. In general
however, the selection of c > 0 should be guided, in any given setting, by further
theoretical or simulation based results.

7.4

Inference under Model Misspecification

This final section of this chapter is devoted to the estimation of the asymptotic
variance of extremum estimators. In particular, we end with a warning: econometric

software programs typically calculate the asymptotic variance of estimators assuming


that the model is correctly specified!
According to the classical asymptotic normality theorem in Chapter 6, we know
that, under appropriate regularity conditions, extremum estimators satisfy

 d

T 0
T
N 0 , > as T .
T has an approximate
We also noted in Chapter 6 that this implies that the estimator
distribution given by

T approx

N 0 , > /T .
This approximate distribution can be used to conduct inference. However, in order
for this result to be useful in practice, we must estimate the unknown and .
Estimates of and are generally easy to obtain. Theorem 3 tells us that is the
asymptotic variance of the standardized derivative of the criterion function
T
1X
d
T
q(xt , xt1 , 0 ) N (0, )
T t=2

as T .

The central limit theorem for SE martingale difference sequences in Chapter 4, tells
us that if {q(xt , xt1 , 0 )} is a martingale difference sequence, then is simply the
variance of q(xt , xt1 , 0 ). Recall that being a martingale difference sequence means
essentially that {q(xt , xt1 , 0 )} is white noise, i.e. that it is uncorrelated with mean
zero and some finite variance. As such, for any given null hypothesis H0 : 0 = 0 ,
we can estimate


= Var q(xt , xt1 , 0 ) = Eq(xt , xt1 , 0 )2
by its sample counterpart
T
X
T = 1
q(xt , xt1 , 0 )2 .

T t=2

Luckily, when the model is correctly specified, then {q(xt , xt1 , 0 )} is always uncorrelated under the null hypothesis, and hence, estimation offers no problems. Furthermore, under correct specification, it holds true that = 1 , and hence, we can
estimate the asymptotic distribution of the estimator using

1
T N ,

/T
under H0 : 0 = 0 .
0
T
1
Since = E2 q(xt , xt1 , 0 ) , then for any given null hypothesis H0 : 0 = 0 ,
we can also estimate the asymptotic variance using the alternative estimator
T =
1 =

T
1 X

t=2

10

1

q(xt , xt1 , 0 )

Unfortunately, if the model is mis-specified, then {q(xt , xt1 , 0 )} is generally correlated and the equality = 1 does not hold. As a result, we must use a ro T for that takes into account the autocorrelation in
bust variance estimator
T above.
{q(xt , xt1 , 0 )}. Furthermore, we must separately estimate using
T given by
This yields an estimate of the asymptotic distribution of
N 0 ,

1 >
T T T
T

Remark: Software packages typically give estimates of the asymptotic variance that
are based on the assumption of correct specification! They do not use robust variance
T and they assume the equality = 1 . As a serious econometriestimators for
cian, you surely recognize the great limitations of this approach!

7.5

Exercises

1. Consider two models, A and B. The parameters of model A are estimated


by the least-squares method, and the parameters of model B are estimated by
maximum likelihood. Let LS
0 denote the least-squares pseudo-true parameter
L
for model A, and M
be
the
maximum likelihood pseudo-true parameter for
0
model B. Comment on the following statements:
ML
ML
(a) The pseudo-true parameters LS
are the same, i.e. LS
0 and 0
0 = 0 .
ML
(b) If models A and B are well specified, then LS
0 = 0 .
ML
(c) If model A nests model B and model B is well specified, then LS
0 = 0 .
ML
(d) If model A nests model B and model A is mis-specified, then LS
0 = 0 .

2. Let {xt }tZ be a strictly stationary and ergodic time-series satisfying E|xt |4 <
given by
xt = 0 (xt1 ) + t t Z.
Suppose that you estimate the parameters 0 of the following model
xt = g(xt1 , )xt1 + t

t Z.

using the following estimator


T arg max 1

T
X

(xt g(xt1 , )xt1 )4 .

t=2

Suppose that there exists a unique parameter vector 0 that maximizes the
limit criterion function. Show that 0 minimizes an Lp norm distance between
0 (xt1 ) and g(xt1 , )xt1 .
11

3. The following table shows the estimation results for a sequence of nested models
estimated by the least squares method. Model A nests model B, model B nests
model C, and model C nests model D. Find 2 mistakes in this table.
Model

nr of parameters

R2

Adjusted R2

A
B
C
D

7
5
3
2

0.94
0.77
0.63
0.65

0.88
0.79
0.54
0.41

4. The following table shows the ML estimation results for four alternative (nonnested) models. Which model would you select?
Model

nr of parameters

Log likelihood

AIC

A
B
C
D

4
4
5
9

-1285.3
-1283.1
-1278.7
-1279.4

2578.6
2574.2
2567.6
2573.8

5. Answer again the question above with the additional information that the sample size is T = 100. What if T = 250? T = 500? T = 10000? as T ?

12

Simulation-Based Econometric Analysis

In general, simulations are needed to analyze nonlinear dynamic econometric models.


From the probabilistic implications of the model, to forecasts or impulse response
functions, the need for simulations is pervasive. This occurs because the conditional
distribution of {xt }tZ is often analytically intractable. The same is true of conditional expectations and conditional variances. In this chapter we review the use of
Monte Carlo simulations for performing econometric analysis in the nonlinear world.
These techniques are all simple to understand and implement, by anyone with a very
basic understanding of computational programing. In any case, snippets of MATLAB code with step-by-step explanations are made available to those with little or
no computational background.

8.1

Probabilistic Analysis and Value-at-Risk

The probabilistic analysis of linear dynamic models is often simple and analytically
tractable. Consider for example the linear AR(1) model
xt = + xt1 + t t Z ,

{t } N ID(0, 2 ).

Given a sample of observed data xT := (x1 , ..., xT ), the conditional probability of


observing xT +1 larger than some constant c, given xT , is easy to calculate since the
error term is additive and Gaussian. For any given parameter vector := (, , 2 ),
obtaining P (xT +1 > c | xT ) is easy since we know that
xT +1 | xT N ( + xT , 2 ).
Similarly, it is easy to calculate the unconditional probability P (xt > c) because the
marginal distribution of xt is given by

xt N /(1 ) , 2 /(1 2 ) .
Typically, this kind of probabilistic analysis proceeds under the assumption that
the model is correctly specified and that the parameter estimates correspond to the
actual true parameter. Fortunately, these assumptions are not needed. Indeed, these
assumptions are typically employed just to simplify the description of the results to
a public that is not specialized in econometrics. In order to recognize the model
and parameter uncertainty that are inherent to econometrics analysis, we just have
to clarify that the results are conditional on the adopted model and the estimated
parameters. In other words, all we must do is recognize that P T (xT +1 > c | xT ) is
an approximation to the true unknown P0 (xT +1 > c | xT ).
Consider, for example, the following parameter estimates obtained from a sample
of quarterly growth rates of the gross domestic product (GDP) in The Netherlands,
spanning from the 3rd quarter of 1987 to the first quarter of 2014,
xt = 0.59 + 0.42xt1 + t t Z ,
13

{t } N ID(0, 0.752).

Given that the last observed value of the quarterly gdp growth rate was xT = 0.37%,
in the first quarter of 2014, what is the probability that the growth rate becomes positive in the next quarter?. Well, conditional on the postulated model and parameter
estimates, the probability that the economy leaves the recession in the second quarter
of 2014, is then given by P T (xT +1 > 0 | xT ) 0.69. Indeed, conditional on the model
and estimated parameters, we have
xT +1 |xT N (0.435 , 0.752)
hence the probability of observing a positive growth rate at time T + 1 is actually
quite reasonable! The unconditional probability of positive growth is easily obtained
as being P T (xt > 0) 0.87 since xt N (1.01, 0.913) for every t.
In nonlinear dynamic models, it may sometimes be difficult to derive analytically
these probabilities. Consider for example the NLAR model
{t } N ID(0, 2 ).

xt = f (xt1 , t , ) t Z ,

(1)

Despite the fact that the innovations are assumed to be iid Gaussian, the distribution
of xT +1 given xT may be difficult to ascertain due to the nonlinear function f . Luckily,
approximate distributions can be easily obtained through Monte Carlo simulations.
In particular, the probability P T (xT +1 > c | xT ) can be approximated by drawing
N innovations {iT +1 }N
xiT +1 }N
i=1 , obtaining N simulated values for {
i=1 using (1), all
conditional on the same observed xT , and finally calculating
P T (xT +1 > c | xT )

N
1 X i
I(
xT +1 > c)
N i=1

where xiT +1 denotes the ith simulated value of xT +1 and I(


xiT +1 > c) denotes the
i
i
xiT +1 > c) = 0
indicator function that sets I(
xT +1 > c) = 1 if xT +1 > c and I(
otherwise.
The snippet of MATLAB code below provides such an approximation for a nonlinear autoregressive model
xt+1 = tanh(0.9xt + t ) t Z ,

{t } N ID(0, 0.01).

In this specific example, we calculate the probability that P T (xT +1 > 0.4 | xT ), with
the number of simulations set to N = 10000, and where the last observed sample
value is XT = 0.2.

14

N = 10000;

(Set number of draws)

x T = 0.2;

(Set value of XT )

eps T1 = 0.1randn(1,N);

(Generate N values of T +1 )

x T1 = tanh(0.9*x T+eps T1);

(Calculate N values of xT +1 )
(Calculate P (xT +1 > 0.4 | xT ))

P08 = (1/N)sum(x T1>0.4);

A similar reasoning can be applied to obtain h-step-ahead probabilities. The snippet


of MATLAB code below, calculates the probability P T (xT +3 > 0.4 | xT ) for the same
model, by iterating the nonlinear dynamic model forward.
h = 3;

(Set steps ahead)

N = 10000;

(Set number of draws)

x T = 0.2;

(Set value of xT )
(Generate h N innovations)

eps = 0.1randn(h,N);

x(1,:) = x Tones(1,N);
for t=1:h
(Calculate N values of xT +3 recursively)
x(t+1,:) = tanh(0.9*x(t,:)+eps(t,:));
end
P08 = (1/N)*sum(x(h+1)>0.4);

(Calculate P (xT +3 > 0.4))

In certain cases, the nonlinear nature of the dynamic equation does not complicate
the probabilistic analysis of the model. For example, in nonlinear dynamic models
with additive innovations,
xt = f (xt1 , ) + t
it is easy to calculate conditional probabilities of the type P T (xT +1 > c | xT ) as long
as the distribution of the innovations is known. For example, if the innovations in the
model above are iid Gaussian N (0, 2 ), then it follows immediately that xT +1 |xT
N (f (xT , ), 2 ). Note however that for multiple steps-ahead probabilistic statements,
Monte Carlo simulations are again required even for nonlinear models with additive
innovations.
A similar reasoning applies to time-varying parameter models. Consider, for example,
the observation-driven local-level model,
xt = t + t

15

t+1 = + (xt t ) + t .
Conditional on the model at hand, probabilistic statements about xT +1 given xT
can easily be made since T +1 is given when we condition on the observed sample
xT . Suppose again that the innovations are iid Gaussian N (0, 2 ). Then xT +1 |xT
N (T +1 , 2 ). Again, Monte Carlo simulations may be required for multiple stepsahead probabilistic statements, especially, when the updating equation for the timevarying parameter is nonlinear.
In finance, the Value-at-Risk (VaR) is a popular risk measure that is often derived
from the probabilistic analysis of volatility models. Specifically, for a given portfolio
and a pre-specified probability , the daily -VaR is the minimum amount the investor
stands to loose with probability over a period of one day. For example, if a portfolio
has a daily 10%-VaR of 1 million euros, then there is a 10% probability that the value
of the portfolio will fall by more than 1 million euros in one day.
The VaR is often also stated in terms of percentage loss. For example, if a portfolio
has a daily 5%-VaR of 17%, then there is a 5% probability that the value of the
portfolio will fall by more than 17% of its value in one day. Mathematically, given a
portfolio value pt at time t, and a random return xt = (pt pt1 )/pt1 on the portfolio,
the 5%-VaR in percentage loss is defined as the value c that satisfies
P (xt c) = 0.05.
Clearly, the VaR expressed in percentage loss can immediately be turned into the
VaR in monetary loss by multiplying c loss by the value of the stock at that time.
Please take a moment to notice that the true VaR is not known exactly since the true
distribution of the sequence {xt }tZ is unknown. Indeed, any statements involving the
probabilistic distribution of {xt }tZ are statements about the unknown. Probabilities
about {xt }tZ can only be estimated, and those estimates typically depend on the
model adopted by the researcher and parameter estimates obtained from the data.
Typically, the VaR is interpreted as if the model was correctly specified and the
parameter estimates corresponded to the true parameter. This is a practice that simplifies the presentation of the estimated VaR for a public that is not specialized in
econometrics. Luckily however, we do not have to assume correct specification or correct parameters. Model uncertainty and parameter uncertainty can be acknowledged!
We just have to recognize that our VaR estimates are effectively conditional on the
model and the estimated parameters. In other words, we just have to recognize that
the estimated VaR obtained from setting P T (xt c) = 0.05 is an approximation to
true VaR that sets P0 (xt c) = 0.05. Below we give a useful definition of estimated
VaR.

16

Definition 4 (Estimated Value-at-Risk - VaR) Let {pt }tZ be a random sequence


of portfolio values, and xt = (pt pt1 )/pt1 denote the return sequence (i.e. the
percentage changes) on the portfolio. Given a model P := {P , }, and a
T , the estimated -VaR in percentage loss at time t Z is
parameter estimate
defined as the percentage value c that satisfies P T (xt c) = . Furthermore, the
estimated -VaR in monetary loss at time t Z is given by c pt .
Conditional volatility models like the GARCH, can easily provide us with an estimate
of the time-varying VaR of a stock at any time, conditional on past information.
Indeed, since stock returns (typically in percentage changes or log differences) are
modeled to satisfy
xt = t t
the distribution of xt+1 conditional on xt , is easily tractable for any t. For example, if
suppose again that the innovations are iid Gaussian N (0, 1), then xT +1 |xT N (0, t2 ).
As a result, the VaR is obtained immediately through application of the Gaussian
quantile function Q (the inverse of the distribution function F )
Q T () = inf{x R : F T (x)}.
Obviously, the same reasoning applies if t has other distributions! Once again, note
that Monte Carlo simulations may be necessary for calculating a multiple step-ahead
VaR.

8.2

Forecasting

Prediction of future values of a random variable (i.e. forecasting) is a central problem


in time-series analysis. There are various sources of forecasting uncertainty that affect
the potential accuracy of any given forecast. Four fundamental sources of uncertainty
in an estimated probability model are:
1. Measurement uncertainty;
2. Model uncertainty;
3. Parameter uncertainty;
4. Innovation uncertainty.
Measurement uncertainty arises typically from problems in data collection or from
inherent difficulties in observing the data of interest. In certain applications this
kind of uncertainty can be safely ignored. When this is not possible, then measurement uncertainty can be directly modeled (e.g. by adding observation noise), thus
effectively turning the problem of measurement uncertainty into that of model uncertainty (through the way the noise is modeled) and estimation uncertainty (through
the effect of the noise on parameter estimation).
17

Model uncertainty is virtually always ignored when forecasting and producing confidence bounds. This is usually done by imposing axioms of correct specification.
Typically, researchers ignore model uncertainty simply because model uncertainty is
too difficult to integrate in the design of forecast bounds. In certain cases however,
model uncertainty is less problematic. Sometimes it can even be safely disregarded!
This occurs when the statistical model is sufficiently general to contain the data
generating process; see e.g. Grenander (1992) and Chen (2007) for a review of the
sieve estimation of semi-nonparametric models with an infinite dimensional parameter space with unbounded complexity (called infinite entropy). In any case, even in
simple parametric models, it is important to note that we do not have assume correct
specification in order to set aside the complicated issue of model uncertainty. Instead,
we just have to recognize that our forecasts are conditional on the model at hand!
Parameter uncertainty is also rarely incorporated in producing forecasts and their
respective confidence bounds. This occurs because parameter uncertainty is also
difficult to incorporate in forecasting, at least analytically. There exist simulationbased methods that allow us to incorporate parameter uncertainty. These however,
lie outside the scope of this text.
In the end of the day, innovation uncertainty is typically the only ingredient that is
taken into account when producing point forecasts and deriving confidence bounds.
In essence, the researcher recognizes that future innovations are unknown and goes
about producing forecasts with confidence bounds that are effectively conditional on
the model and the estimated parameters. Luckily, taking innovation uncertainty into
account is often enough to produce reasonable confidence bounds. The reason for this
is that the distribution of out-of-sample forecast errors is typically approximated by
the distribution of in-sample residuals. Hence, poor models and poor parameter estimates that lead to large residuals, also lead to large out-of-sample forecast bounds. In
some sense, these bounds already incorporate some model uncertainty and parameter
uncertainty.
With these considerations in mind, we can formulate the following useful definition
of point forecast.
Definition 5 (Point forecast) Given a model P := {P (), }, a sample of data
T , the point forecast for xT +h , is
xT := (x1 , ..., xT ), and a parameter estimate
the conditional expectation xT +h = E T (xT +h | xT ).3
When forecasting continuous random variables, it should be immediately clear that
the probability of any point forecast xT +h being correct is exactly equal to zero. In
other words, xT +h satisfies xT +h 6= xT with probability one. In some sense, point
forecasts are meaningless if they are given without confidence bounds. Indeed, given
3

E T denotes the conditional expectation taken w.r.t. the estimated measure P T .

18

a point forecast of xT +h = 3.75, would you expect the realization of xT +h to be close


to 3.75? Maybe between 3.5 and 4? or between 0 and 15? between -10 and 1500? To
answer any these questions we need confidence bounds for our forecasts.
In general, in the world of probability models, forecasts are useful if they make
probabilistic statements about xT +h . Producing confidence bounds for forecasts is
essential. As highlighted above, given the presence of model and parameter uncertainty, any probabilistic statements about xT +h can only be made conditional on the
model and the estimated parameters.
Definition 6 (Forecasted distribution) Given a model P := {P (), }, a sam T , the forecasted
ple of data xT := (x1 , ..., xT ), and a parameter estimate
distribution of xT +h , is the conditional distribution P T (xT +h | xT ).
Linear Models
As you may recall from your introductory econometrics courses, in the world of linear
dynamic models, producing point forecasts with confidence bounds is typically a
simple exercise. Consider the following Gaussian linear AR(1) model,
xt = xt1 + t , {t } N ID(0, 2 ) , || < 1.
Before using this model to forecast, recall that a battery of tests can be employed
to ensure that the model provides a reasonable description of the data. From unitroot tests for analyzing the stationarity of the residuals, to autocorrelation and heteroekedasticity tests designed to ensure that the residuals are approximately white
noise, Jarque-Bera statistics that test the normality assumption, or information criteria and RESET tests that question the linear specification of the model, the testing
possibilities are almost endless! As an econometrician, you should always let the
data speak. If a simple model does the trick, then use it. If it does not, then keep
searching!
In any case, regardless of the quality of the model, you can always ask questions
conditional on the model ; i.e. taking its assumptions to be true. Indeed, conditional
on the AR(1) model above, the one-step ahead forecast is naturally given
xT +1 = E(xT +1 |x1 , . . . , xT ) = E(xT + T +1 |x1 , . . . , xT ) = xT
because T +1 is independent of past observations, and hence E(T +1 |x1 , . . . , xT ) =
E(T +1 ) = 0. The forecast error is thus given by
eT +1 = xT +1 xT +1 = xT +1 xT = T +1 .
with E(eT +1 ) = E(T +1 ) = 0 and Var(eT +1 ) = Var(T +1 ) = 2 . Similarly, the twostep-ahead forecast is given by
xT +2 = E(xT +2 |x1 , . . . xT ) = E(xT +1 + T +2 |x1 , . . . xT )
= E(xT +1 |x1 , . . . xT ) + E(T +2 |x1 , . . . xT )
=
xT +1 = 2 xT .
19

The two-step-ahead forecast error is,


eT +2 = xT +2 xT +2 = xT +2 2 xT =
= xT +2 xT +1 + xT +1 2 xT = T +2 + T +1 .
with mean and variance given by E(eT +2 ) = 0 and Var(eT +2 ) = 2 (1 + 2 ). Not
surprisingly, the h-step-ahead forecast is given by
xT +h = E(xT +h |x1 , . . . xT )
= E(h xT + h1 T +1 + . . . + T +h |x1 , . . . xT )
= h xT
and the h-step forecast error is,
eT +h = xT +h xT +h = h1 T +1 + . . . + T +h
with E(eT +h ) = 0 and Var(eT +h ) = 2 (1 + 2 + . . . + 2(h1) ). Conditional on the
model, if we let the forecasting horizon diverge to infinity h , we obtain naturally
2
the unconditional variance of the AR(1) as the forecast error Var(eT +h ) = 1
2.
This is natural, as in the long-run, we know that the process returns to its mean,
hence, our best forecast must be limh xT +h = 0, and the forecasting error becomes
precisely the unconditional variance of xt . Having an expression for the forecast error
means that confidence bounds are also easy to obtain conditional on this model that
postulates iid Gaussian innovations. Indeed, we just have to recognize that, since
{t } N ID(0, 2 ), the model predicts a distribution for eT +h given by

eT +h N 0, 2 (1 + 2 + . . . + 2(h1) ) .
Nonlinear Models
The state of affairs is very different in the world of nonlinear dynamic models. In
general, it is not possible to produce analytically tractable h-step-ahead forecasts.
This occurs, not surprisingly, because it is difficult to derive the expectation of xT +h
conditional on the sample xT := (x1 , ..., xT ). Similarly, it is often too hard to derive
the distribution of forecast errors and produce confidence bounds for the forecasts. In
this section we review briefly the use of simulations for forecasting nonlinear dynamic
models.
Conditional on the model and estimated parameters, Monte Carlo simulations can be
used to obtain an approximate distribution of xT +h given xT . This is done, of course,
by drawing t s from their estimated distribution. Consider again the estimated
nonlinear autoregressive model
xt+1 = tanh(0.9xt + t ) t Z ,
20

{t } N ID(0, 0.01).

The snippet of MATLAB code below produces point forecasts of xT +h for h =


1, ..., 10, with 90% confidence bounds. The number of simulations set to N = 10000,
and where the last observed sample value is xT = 0.2.
h = 10;

(Set steps-ahead)

N = 10000;

(Set number of draws)

x T = 0.2;

(Set value of xT )

eps T1 = 0.1 randn(1,N);


x(1,:)

(Generate N values of T +1 )

= x T*ones(1,N);

for t=1:h
x(t+1,:)
end

(Simulate future values recursively)

= tanh(0.9*x(t,:)+eps T1(t,:));

x hat(1) = x T;
upper bound(1)= x T;
lower bound(1)= x T;
for t=1:h
x hat(t+1) = mean(x(t+1,:));
(Calculate point forecasts (conditional mean)
upper bound(t+1) = prctile(x(t+1,:),95); (Calculate bound: 95th percentile)
lower bound(t+1) = prctile(x(t+1,:),05); (Calculate bound: 5th percentile)
end
plot(x hat,k)
hold on
plot(upper bound,r)
hold on
plot(lower bound,r)

(Plot point forecast: black line)


(Plot upper bound forecast: red line)
(Plot lower bound: red line)

In nonlinear time-varying parameter models, point forecasts and confidence bounds


can be obtained in the same way for both the data and the time-varying parameter. In
particular, given a model and a parameter estimate, it is possible to draw innovations
from the estimated distribution and iterate the parameter update forward.
Consider, for example, in the fat-tailed nonlinear local-level model,
xt = t + t ,

{t } NIT() ,

t = (t1 , xt1 , ).
Given a sample of data xT and parameter estimates, we can certainly draw multiple

innovations {iT +j }N,h


i=1,j=1 , from a students-t distribution (T ), and produce multiple
21

0.4
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
T

T+1

T+2

T+3

T+4

T+5

T+6

T+7

T+8

T+9

T+10

Figure 3: Point forecast and 90% confidence bounds produce by code snippet above.
simulated values {
iT +j }N,h
i=1,j=1 conditional on xT . With these simulated values we
can naturally calculate Monte Carlo approximations of the conditional mean and
T . Similarly, we also obtain
confidence bounds for T +h implied by the model under
the conditional mean and confidence bounds for xT +h .
This reasoning applies naturally to other time-varying parameter models. Consider the fat-tailed nonlinear volatility model
xt = t t ,

{t } NIT() ,

2
t2 = (t1
, xt1 , ).

Again, by drawing multiple innovations {iT +j }N,h


i=1,j=1 , from a students-t distribution

(T ), we can produce multiple simulated conditional volatility values {


Ti +j }N,h
i=1,j=1 .
With these simulated values we can naturally calculate Monte Carlo approximations
of the conditional mean and forecast confidence bounds.

8.3

Impulse Response Functions

Impulse response functions are instruments that allow us to study the dynamic behavior of time-series in response to a random unanticipated shock. In practice, they
allow us to analyze different what if scenarios. In macroeconometrics, we could ask
how many months does it take for aggregate consumption to recover from a negative
3% shock? In financial econometrics, one may be interested in knowing how does
volatility react to a negative return of -10%? In economics, unanticipated shocks can
come from a number of sources. From foreign demand shocks, to oils price shocks,
natural catastrophes, exchange rate shocks, etc. In a policy analysis context, one may
be interested in studying the effect of unannounced government expenditure shocks,
tax changes, money supply shocks, interest rates shifts, etc. Of course, it is important
to keep in mind that the parameters that describe the dynamics of the process may
22

be affected by policy changes. Indeed, the famous Lucas critique applies not only to
reduced form statistical models, but also, to the so-called structural models.4
Remark 1 (Lucas critique) Parameters estimated from historical data reflect, among
other things, the policies of the past. As a result, these parameters are not appropriate
to describe the dynamics properties of a time-series after a policy change. Different
institutional policies may give rise to different dynamics, and hence, different parameters. This must be recognized when performing policy analysis.
As before, we proceed carefully by recognizing that whatever analysis we make is conditional on the adopted model and estimated parameters. Our focus on the conditionality on the model and the estimated parameters may seem unnecessarily repetitive
to you. This could not be further from the truth! Understanding and recognizing
the limitations of the tools at our disposal is crucial for a competent and professional
econometric analysis of the data.
The Impulse Response Function (IRF) is essentially the expected path of {xt } after a
shock of a certain size  at time t = s. Indeed, you may recall from your introductory
econometrics courses the following definition of IRF.
Definition 7 (Impulse Response Function) Given a model P := {P , } and
T , the Impulse Response Function (IRF) with origin x,
a parameter estimate
generated by a shock (or impulse) , at time t = s, is a sequence of points {
xt }
satisfying:
xt = x t < s,
xt = x +  at t = s,
xt = E T (xt |
xs , xs1 , ...) t > s/
Regardless of the model being linear or nonlinear, the IRF describes the expected
path of {xt } following a shock of magnitude  at time t = s starting from a fixed level
x. The figure below plots an IRF with origin x that coincides with the unconditional
mean of the process.
In introductory econometrics courses you have derived the IRFs of linear dynamic
models. As you may remember, these IRFs are often easy to derive by hand. Consider,
for example, the linear AR(1) model,
xt = 1 xt1 + t .
4

The later class of models is typically affected because those models are called structural, but
they are not truly structural.

23

x+

x
Before Impulse

After Impulse

Figure 4: Impulse Response Function


The IRF with origin x generated by an impulse of magnitude  at time t = s is given
by,
xs2 = x
xs1 = x
xs = x + 
xs+1 = E(xs+1 |
xs , xs1 , ...) + 0 = x
xs+2 = E(xs+2 |
xs+1 , xs , ...) + 0 = 2 x
xt+3 = E(xs+3 |
xs+2 , xs+1 , ...) + 0 = 3 x
In the world of nonlinear dynamic models we typically require Monte Carlo simulations to produce impulse response functions. Simulations are needed because the
conditional expectation is analytically intractable. Clearly, the simulation methods
discussed in the previous section can be used to calculate approximate impulse response functions. Consider again the estimated nonlinear autoregressive model
xt+1 = tanh(0.9xt + t ) t Z ,

{t } N ID(0, 0.01).

The snippet of MATLAB code below produces point forecasts of xT +h for h = 1, ..., 10,
with 90% confidence bounds. The number of simulations set to N = 10000, and the
last observed sample value is xT = 0.2.
s = 3;

(Set shock time s)

x0 = 0.2;

(Set origin value)

e = -1;

(Set shock size e)

h = 7;

(Set steps ahead for IRF)

N = 10000;

(Set number of draws)

eps = 0.1*randn(s+h,N);

(Generate innovation values)

24

eps(1:s-1,:)
eps(s,:)

(Set shocks to zero for ts)

= e;

x(1:s-1,:)
x(s,:)

= 0;

(Set shocks to e for t=s)

= x0*ones(s-1,N);

(Set time-series to x for ts)

= e*ones(1,N);

(Set time-series to x+e for t=s)

for t=s+1:s+h
x(t,:) = tanh(0.9*x(t-1,:)+eps(t,:));
end

(Simulate values recursively)

for t=1:s+h
x tilde(t) = mean(x(t,:));
upper bound(t) = prctile(x(t,:),95);
lower bound(t) = prctile(x(t,:),05);
end
plot(upper bound,r)
hold on
plot(lower bound,r)
hold on
plot(x tilde,k))

(Obtain IRF recursively)

(Plot upper IRF forecast: red line)


(Plot lower IRF forecast: red line)
(Plot IRF: black line)

0.2
0
-0.2
-0.4
-0.6
-0.8
-1
-1.2
s-2

s-1

s+1

s+2

s+3

s+4

s+5

s+6

s+7

Figure 5: IRF with origin at x = 0.2 and shock size  = 1 at time t = s.


IRFs can be obtained in the same manner for time-varying parameter driven
models. In particular, the parameter update equation can be used to produce IRFs
for both the time-varying parameter and the data. Confidence bounds like the ones
produced above are also available for those models.
With this we conclude the econometric analysis of nonlinear dynamic models. As
you may have noticed, calculating VaR, producing forecasts, or generating IRFs is
not at all complicated. Indeed, simulations make everything easy. In the end of
the day, complications arise only in the interpretation of the results. What does the
25

IRF describe if depends on the selected model and estimated parameter? What does
the IRF really mean if the model is misspecified? If two models produce different
IRFs, which one is best? In the future, I hope you use all your knowledge of about
parameter estimation, model specification, and model comparison, to give careful and
well founded answers to these questions.

It was a pleasure knowing you all!


Hope you enjoyed learning some Advanced Econometrics!

Good Luck!

26

You might also like