Professional Documents
Culture Documents
Advanced Econometrics I
Chapter 5
Francisco Blasques
These lecture notes contain the material covered in the master course
Advanced Econometrics I. Further study material can be found in the
lecture slides and the many references cited throughout the text.
Contents
5 Asymptotic Theory for M and Z Estimators
5.1 M and Z estimators: Definition and Examples . .
5.2 Existence and Measurability . . . . . . . . . . . .
5.3 Consistency . . . . . . . . . . . . . . . . . . . . .
5.3.1 The general consistency theorem . . . . .
5.3.2 Uniform convergence . . . . . . . . . . . .
5.3.3 Stochastic Equicontinuity . . . . . . . . .
5.3.4 Identifiable uniqueness . . . . . . . . . . .
5.3.5 Notes for Time-varying Parameter Models
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
4
8
11
12
15
21
24
25
26
Econometrics and statistics are essentially devoted to the art of learning from the
data. The fundamental question we have in mind is always the following: where did
the data come from? or in other words: what are the properties of the data generating
process (DGP)? As we have seen in Chapter 4, when we deal with parametric models,
the distribution of the data is determined by the values of unknown parameters. As
a result, we can re-state these questions as: what are the values of the unknown
parameters? In this chapter we will focus precisely on the estimation of parameters.
The effort, to learn about unknown properties of the DGP dates back at least to
the great italian mathematician Gerolamo Cardano that stated the first law of large
numbers, without proof, in the 1500s. Essentially, he noted that the true probability of
success in any given gamble, could be approximated by calculating the average success
over an increasingly larger number of trials. Stating this law of large numbers, was
in some sense, the first step to understanding that we can learn about true unknown
quantities through repeated observation of the same phenomenon!1
The method of least squares published by Legendre in 1806, but fist discovered
by Gauss in 1795, constituted another important step in the art of learning from the
data. The least squares method was soon applied to problems in physics, astronomy,
engineering and economics, and brought immense fame and glory to both Gauss and
Legendre. Gauss derived not only the formula for the OLS estimator itself, but he also
wrote down the exact conditions under which the estimator is unbiased and normally
distributed! Unfortunately, few realized the importance of these theoretical results that
characterized the properties of the OLS estimator. It was only when Andrey Markov
re-published Gausss work in 1901 that the results became famous. Today these results
1
As mentioned in Chapter 2, the first proof of the law of large numbers came only through the
hand of Jacob Bernoullis (1713) Ars Conjectandi. A simpler proof was found by Pafnuty Chebyshev
in 1874, using an unproved inequality, that his student Markov finally proved in 1884.
are collectively known as the Gauss-Markov Theorem. Surely you heard about it in
your introductory econometrics courses!
The 18th and 19th centuries witnessed also the first developments in the method
of maximum likelihood. These developments came by the hand of the great mathematicians Lagrange, Bernoulli, Laplace and Gauss. Unlike the method of least-squares
however, the method of maximum likelihood was not immediately popular. Indeed, it
was only with the work of Fisher in the early 20th century that the method of maximum
likelihood became the most popular estimator of all.
Throughout the 20th century, a much more general theory of estimation was developed that includes the least-squares estimator, the maximum-likelihood estimator,
the method-of-moments estimator, and many other estimators, as special cases: it is
the so-called extremum estimation theory. This theory, was developed, among others,
by Doob (1934), Cramer (1946), Wald (1949), Le Cam (1949), Jennrich (1969) and
Malinvaud (1970).
In this chapter we will define and analyze the properties of extremum estimators.
With this general theory, we will be able to establish the properties of many estimators
in very general settings!
5.1
When the random sample is realized and we observe a vector of points xT (e) RT ,
for some event e E, then QT (xT (e), ) is just a real valued function,
QT (xT (e), ) : R
that we can attempt to maximize! For every realization e E, we get a new function
QT (xT (e), ) : R to maximize, and we obtain a new maximizer that we call a
parameter estimate!
Definition 1 (M-estimator) An extremum estimator is called an M-estimator when
the criterion function takes the form of a sum,
QT (xT , ) =
T
1X
q(xt , xt1 , ) T N.
T t=2
Examples of M-estimators include the famous maximum likelihood (ML) estimator, the least squares (LS) estimator, and the generalized method of moments (GMM)
estimator.
Example: (Maximum likelihood estimator) The criterion function of the ML
estimator is the log likelihood function LT . The ML estimator is thus an extremum
estimator where QT (xT , ) = LT (xT , ),
T arg max LT (xT , ).
T arg max
`(xt |xt1 , ).
T
t=2
Note that dividing the log likelihood by T is perfectly legitimate since the arg max is
still the same.
Example: (Least-squares estimator) The criterion function of the LS estimator is the sum of squared residuals function. The LS estimator of the parameters
of an NLAR model
xt = (xt1 , ) + t
takes the form of an M-estimator where
T
2
1X
QT (xT , ) =
xt (xt1 , ) .
T t=2
We write the criterion as the negative sum of squared residuals (note the minus sign)
since maximizing the negative sum of squared residuals is the same as minimizing the
sum of squared residuals. We can also divide the criterion by
T because it does not
change the arg max set. Hence, for ut () := xt (xt1 , ) we obtain
T
T
1X
1X
2
ut () = arg max
ut ()2 .
T arg min
T
T
t=2
t=2
Example: (Generalized method of moments estimator) The criterion function of the GMM estimator is the weighted quadratic difference of moments
T
T
X
0 1 X
T arg min 1
g(xt , xt1 , ) W
g(xt , xt1 , )
T
T t=2
t=2
where g(xt , xt1 , ) is the random vector that should satisfy the theoretical moment
condition Eg(xt , xt1 , ) = 0. The GMM estimator can naturally be written in Mestimator form.
Extremum estimators with criterion functions QT : RT R that are differentiable and strictly concave on the parameter space can also be written as estimators
that set to zero the criterion derivative,
QT (xT , ) =
QT (xT , )
.
The strict concavity of the criterion function QT over the parameter space , is used
to ensure that the point where QT (xT , ) = 0 is really the global maximum of
the function QT . Otherwise, the point of zero derivative could correspond to a local
maximum, or minimum, or some other stationary point.
In any case, when the criterion function QT is not strictly concave, then we can
T that sets the derivative
still write the extremum estimator as a random element
QT (xT , ) to zero as long as we apply a measurable selection technique that selects
some zeros and not others. We will not discuss here what constitutes a measurable
selection. It suffices to keep in mind that we could devise a rule to select the right
zeros, i.e. those that correspond to global maxima. For example, if QT : RT R
is twice differentiable on , then we can use the second derivative to figure out which
zeros of QT (xT , ) correspond to maximum points.
T as
We can thus always define
n
o
T : QT (xT , ) = 0
(1)
or alternatively as
T ) = 0.
QT (xT ,
6
(2)
Clearly, the argument above also applies to the case of M-estimators that have
differentiable functions. Re-writing an M-estimator in this way gives rise to the socalled Z-estimator.
T : E is called a Z-estimator if it
Definition 2 (Z-estimator) An estimator
satisfies,
T
1X
T ) = 0 T N.
q(xt , xt1 ,
QT (xT , T ) =
T t=2
Example: (Maximum likelihood estimator) ML estimators can be immediately written as Z-estimators that set the so called score function (the derivative of
the likelihood) to zero
T
1X
T ) = 0.
`(xt |xt1 ,
LT (xT , T ) =
T t=2
Example: (Least squares estimator) The least-squares estimator for the NLAR
model considered above can be re-written as a Z-estimator that sets the derivative of
the least-squares function to zero
T
1X
T )) (xt1 , T ) = 0.
2(xt (xt1 ,
T t=2
Example: (Method-of-moments estimator) An estimator that is typically written in Z-estimator form is the method of moments (MM) estimator. The MM estimator is a special case of the GMM estimator obtained when the number of moment
conditions is the same as the number of parameters to be estimated. The MM estimator takes the form
T
1X
T ) = 0.
g(xt , xt1 ,
T t=2
where g(xt , xt1 , ) is the random vector that should satisfy the theoretical moment
condition Eg(xt , xt1 , ) = 0.
Just as we can re-define an M-estimator as a Z-estimator by taking the derivative
of the criterion function, we can also re-define an M-estimator as a Z-estimator by
integrating the criterion function. As a result, we can focus on analyzing only the
properties of M-estimators. The results can be easily extended to Z-estimators. Below
we will focus exclusively on the properties of M-estimators.
5.2
T are random
In your introductory econometrics courses you learned that estimators
variables that take values in the parameter space . In particular, they take a different
value in for every new sample of observed data xT . Since estimators are random
variables, we can study their stochastic properties, like the bias, variance, convergence
in probability (consistency), convergence in distribution (asymptotic normality), etc.
In introductory econometrics, we dealt with estimators that were analytically
tractable. As a result, it was easy to show that the estimators considered there were
indeed random variables. For example, the OLS estimator in the linear regression
considered in Chapter 2 takes the form:
PT
y t xt
.
T = Pt=1
T
2
x
t=1 t
Since, T is a continuous function of the random variables y1 , ..., yT and x1 , ..., xT ,
then it follows immediately that T is a random variable under the Borel -algebra.
However, we must now answer the following question: how can we be sure that the
estimator is a random variable when it is analytically intractable?
Theorem 1 below gives us an answer. This theorem is important because, if we are
going to talk about the convergence and distributional properties of estimators (like
consistency and asymptotic normality), then we must first show that the estimator
exists and that it is a random variable.
The problem of existence is related to the fact that some functions do not have
a maximum. If the criterion function QT cannot be maximized, then our extremum
estimator
T arg max QT (xT , ).
does not exist, because the arg max set is empty (i.e. there is no that maximizes
QT ). For example, what is the value of that maximizes the function exp()? There
is none! the larger the value of the larger the value of the function! What is the
value of that maximizes the function 1/? There is none! The closer is to zero,
the larger the value of the function is.3
Luckily, the BolzanoWeierstrass theorem tells us sufficient conditions for a function to have a maximum. In particular, it tells us that every continuous function has
a maximum on a compact set.4
Theorem 1 (BolzanoWeierstrass) Let f : Rn R be a continuous function on a
compact set R Rn . Then f has a maximum in R, i.e. there exists a point r R
such that f (r) f (r0 ) r0 R.
3
Note that = 0 is not an acceptable answer because the function 1/ is not defined at = 0. In
particular, note that the positive limit lim0 1/ = is well defined, but 1/0 is not. It is simply
not true that 1/0 = .
4
Recall that a subset of Rn is said to be compact if it is bounded and closed.
Consider again the example exp(). Since the exponential function is continuous,
then we know that it has a maximum on a compact subset [a, b] of R. This is quite
obvious! We only have a problem if we try to maximize exp() over a set like [a, b)
which is not compact (because it is open in b), or a set like the entire real line R
(which is not compact because it is unbounded). Consider the example, 1/ again.
This function is discontinuous at = 0. But as long as we consider a compact subset
of R that does not contain 0, then the function has a maximum. For example 1/
has a maximum at = 1 if we restrict attention to the compact interval [1, 2].
Theorem 2 (Existence and Measurability) Let be a compact subset of Rn , for
some n N, and QT : RT R be such that:
(i) QT (xT , ) : R is continuous on for every xT RT .
(ii) QT (, ) : RT R is continuous on RT for every .
T : E satisfying
Then, there exists a measurable map
T arg max QT (xT , ).
The theorem above ensures two important things. First, that the estimator exists.
Second, that the estimator is a random variable (i.e. that it is a measurable map).
As you can see, condition (i) in Theorem 2 and the assumption that is compact,
T
are essentially designed to ensure that the criterion function QT has a maximizer
over the parameter space . Condition (ii), assumes that QT is continuous on the
random variables x1 , ..., xT and is used to ensure that the criterion function QT is
measurable (and hence random) with respect to the Borel -algebra.
In the case of an M-estimator, the criterion function takes the form
T
1X
q(xt , xt1 , )
QT (xT , ) =
T t=2
and hence we can re-write the conditions of Theorem 2 in terms of conditions on the
function q.
Theorem 3 (Existence and Measurability) Let be a compact subset of Rn , n N,
and QT : RT R be such that:
P
(i) QT (xT , ) = T1 Tt=2 q(xt , xt1 , )
(ii) q(xt , xt1 , ) : R is continuous on for every (xt , xt1 ) R2 .
(iii) q(, , ) : R2 R is continuous on R2 for every .
T : E satisfying
Then, there exists a measurable map
T arg max
T
X
q(xt , xt1 , ).
t=2
Example: (Maximum likelihood estimator of NLAR model) The criterion function of the ML estimator QT (xT , ) = LT (xT , ) satisfies the conditions of Theorem
2 if the joint density L is continuous and is compact. In the case of the NLAR
model
xt = (xt1 , ) + t t Z
where is some nonlinear function and {t } are iid innovations with probability
density function f (), we have
xt |xt1 f xt (xt1 , ),
T
1X
log f xt (xt1 , ), .
and hence LT (xT , ) =
T t=2
As a result, the criterion function satisfies the conditions of Theorem 3 if both f and
are continuous functions (in both arguments), and is compact.
Example: (Maximum likelihood estimator of Gaussian RCAR model)
In the case of the Gaussian RCAR(1) model
xt = t1 xt1 + t
with t1 N (, 2 )
We have xt |xt1
tZ
and t N (0, 2 ) t Z.
2 2
2
N xt1 , xt1 + .
T
1 (xt xt1 )2
1X 1
log 2(2 x2t1 + 2 )
.
and hence LT (xT , ) =
T t=2 2
2 2 x2t1 + 2
T
2
1X
xt (xt1 , ) .
T t=2
with = (, )
5.3
Consistency
Consistency is probably the most fundamental property that any estimator can sat T is said to be consistent for a parameter 0 if, as the
isfy. Essentially, an estimator
T takes values ever closer to 0 .
sample size T grows to infinity,
Letting sample size increase without bound should not be ridiculed as
merely a fanciful exercise. Rather asymptotics uncover the most fundamental properties of a procedure and give us a very powerful and general
evaluation tool.
Casella and Berger (2001)
An estimator that is not consistent is, in some sense, a deeply flawed estimator!
Indeed, an estimator that is not consistent is an estimator that will not approximate
the parameter of interest, even when an infinite amount of information is available!
Below we define what we mean by a consistent estimator and a strongly consistent
estimator. This is a good time to take a look at Appendix A and remind yourself
p
of the concepts of convergence in probability (denoted by the symbol ) and almost
a.s.
sure convergence (denoted by the symbol ).
T is said to be consistent for a
Definition 3 (Consistent estimator) An estimator
p
T 0 as T .
parameter 0 if and only if
T is said to be strongly
Definition 4 (Strongly consistent estimator) An estimator
a.s.
T 0 as T .
consistent for a parameter 0 if and only if
5
I(xt1 0) denotes the indicator function that takes value 1 if xt1 0 and the value 0
otherwise.
11
In both definitions of consistency, the crucial thing to keep in mind is that, if the
T is consistent for 0 , then it takes values increasingly closer to 0 as
estimator
T is
the sample size T increases. In other words, the distribution of the estimator
increasingly concentrated around the parameter 0 .
In fact you may recall from your introductory econometrics course that the following conditions6
T ) = 0
lim Var(
and
T ) = 0
lim Bias(
T to 0 .
are sufficient for the consistency of the estimator
5.3.1
The following two theorems establish the conditions under which an extremum esti T is consistent for a parameter 0 . Both theorems rely on the identifiable
mator
uniqueness of the parameter 0 and the uniform convergence of the criterion function
QT to some limit deterministic function Q .
The concept of uniform convergence is stronger than the concept of pointwise
convergence. We say that the criterion function QT converges pointwise over to a
p
limit function Q if it holds true that QT (xT , ) Q () for every . Pointwise
convergence it thus about convergence for every possible parameter value . As
it turns out, this is very different from the definition of convergence you will find in
the definition below. In Section 5.3.2 we will learn how to ensure that a function
converges uniformly.
Definition 5 (Uniform convergence) The criterion function QT (xT , ) : R is
said to converge uniformly in probability to a limit deterministic function Q : R
if and only if
p
sup QT (xT , ) Q () 0 as T .
When the criterion function converges uniformly almost surely, then we refer to
it as the strong uniform convergence of the criterion.
Definition 6 (Strong uniform convergence) The criterion QT (xT , ) : R is said
to converge uniformly almost surely to a limit deterministic function Q : R if
and only if
a.s.
sup QT (xT , ) Q () 0 as T .
The concept of identifiable uniqueness of 0 is one that states not only that 0 is the
unique maximizer of a function Q , i.e. that Q ( 0 ) > Q () , but also
6
T is defined as E
T 0
Recall that the bias of an estimator
12
that this maximizer is well separated. Let S( 0 , ) denotes a set of points contained
in a ball centered at 0 , and S c ( 0 , ) denotes its complement in ,
S( 0 , ) := : k 0 k <
and S c ( 0 , ) := :
/ S( 0 , ) .
In the equation above k k denotes some distance or norm defined on the parameter
space Rn . If you are not familiar with distances and norms, then this might be
a good time to read Appendix B.7 In any case, the most important thing to keep in
mind is that k 0 k measures the distance between and 0 .
Definition 7 (Identifiable uniqueness) A parameter 0 is said to be an identifiably unique maximizer of the limit criterion function Q : R if
sup
Q () < Q ( 0 ).
S c ( 0 ,)
The following theorem takes the conditions of Theorem 2, the identifiable uniqueness of 0 and the uniform convergence of the criterion function QT to obtain the
T . This remarkable theorem is the result of
consistency of the extremum estimator
decades of work. It is a masterpiece of 20th century statistics and econometrics!8
T be an estimator satisfying the conditions of TheTheorem 4 (Consistency) Let
orem 2 and suppose that:
(i) The criterion function QT converges in probability uniformly over to the limit
deterministic function Q as T
p
sup QT (xT , ) Q () 0 as T .
(ii) The parameter 0 is the identifiably unique maximizer of the limit criterion
function Q
sup Q () < Q ( 0 ).
S c ( 0 ,)
p
T is consistent for 0 since
T
Then the estimator
0 as T .
7
i=1
13
The strong consistency is obtained when the criterion function converges almost
surely to its limit.
T be an estimator satisfying the conditions of
Theorem 5 (Strong consistency) Let
Theorem 2 and suppose that:
(i) The criterion function QT converges uniformly almost surely over to the limit
deterministic function Q as T
a.s.
sup QT (xT , ) Q () 0 as T .
(ii) The parameter 0 is the identifiably unique maximizer of the limit criterion
function Q
sup Q () < Q ( 0 ).
S c ( 0 ,)
L () < L ( 0 ).
S c ( 0 ,)
14
as
5.3.2
Uniform convergence
As explained above, the concept of uniform convergence is stronger than the concept
p
of pointwise convergence of a function where QT (xT , ) Q () for every . In
particular, it is easy to show that uniform convergence implies pointwise convergence,
but pointwise convergence does not imply uniform convergence.
Example: Consider a sequence of functions {GT }T N defined on the interval [0, 1]
where each GT is given by GT (x) = xT . This sequence converges pointwise (but not
uniformly) to the limit function G given by
0 for x [0, 1)
G (x) =
1 for x = 1
In particular, for every x [0, 1] the {GT } converges to G
|GT (x) G (x)| 0 x [0.1] (pointwise convergence)
but {GT } does not converge uniformly on [0, 1] to G
sup |GT (x) G (x)| 9 0 (no uniform convergence).
x[0,1]
The following theorem explains that we can obtain uniform convergence from
pointwise convergence as long as the sequence of functions is stochastically equicontinuous. Stochastic equicontinuity is obtained when the sequence is composed of
differentiable functions with derivative that is bounded in expectation.
Theorem 6 (Stochastic equicontinuity and uniform convergence) Let (E, F, P ) be a
probability space and {GT }N be a sequence of random functions GT : E R
that are differentiable on the convex compact set . Suppose that:
(i) The sequence converges pointiwse in probability to a limit function G
p
GT G ()
as T
15
for every .
as T .
GT G ()
as T
for every .
almost surely.
as T .
In the case of an M-estimator, it is quite easy to verify that the criterion function
converges uniformly. In particular, we apply laws of large numbers to obtain the
pointwise convergence and the differentiability of the criterion to obtain the uniform
convergence.
Theorem 8 (Uniform convergence for M-estimators) Let the criterion function in
Theorem 6 take the form
T
1X
GT () =
q(xt , xt1 , ).
T t=2
as T
for every .
tZ
where
`(xt , xt1 , ) := log f xt (xt1 , ), .
We already know that if the random sequence of data {xt }tZ is strictly stationary
and ergodic, then the random sequence
n
o
`(xt , xt1 , )
tZ
17
is also strictly stationary and ergodic for every , by Krengels theorem. Hence,
if the first moment of the log likelihood sequence is bounded for every
E`(xt , xt1 , ) <
for every
then by the law of large numbers for SE sequences we have
T
1X
a.s.
`(xt , xt1 , ) E`(xt , xt1 , )
T t=2
as T
for every .
then, by Theorems 7 and 8, we conclude that the log likelihood converges uniformly
(and strongly) to its limit
T
1 X
a.s.
sup
`(xt , xt1 , ) E`(xt , xt1 , ) 0
T
t=2
as T .
then, we can still conclude by Theorems 6 and 8, that the log likelihood converges
uniformly to its limit in probability
T
1 X
p
sup
`(xt , xt1 , ) E`(xt , xt1 , ) 0
T t=2
as T .
18
Since 12 log 2 is just a constant, we can ignore it and define the ML estimator of
the unknown parameter T as
T
1X
T = arg max
q(xt , xt1 , ) with q(xt , xt1 , ) = (xt xt1 )2 .
[0,2] T
t=2
Note that we have defined the parameter space as consisting of the compact interval
[0, 2]. Since the data {xt }Tt=1 is strictly stationary and ergodic, then the random
sequence
n
oT
(xt xt1 )2
t=1
is also strictly stationary and ergodic for every [0, 2], by Krengels theorem.
Furthermore, this sequence has bounded first moment for every [0, 2] because9
E(xt xt1 )2 = Ex2t + 2 x2t1 2xt xt1
h
2
i
2
2
(by sub-additivity of absolute value)
E xt + || xt1 2|| xt xt1
(by linearity of expectation)
2
2
Ext + ||2 Ext1 2||Ext xt1 <
for every [0, 2]. Hence, by the law of large numbers for SE sequences we have the
pointwise convergence of the criterion function
T
1X
(xt xt1 )2
T t=2
a.s.
E(xt xt1 )2 ,
[0,2]
h
i
2E |xt xt1 | + sup |||x2t1 |
[0,2]
h
i
2
2E |xt xt1 | + 2|xt1 |
2E|xt xt1 | + 4E|xt1 |2 < .
Here we use the sub-additivity of the absolute value |a + b| |a| + |b|, we use the stationarity of
xt to conclude that E|xt |2 < holds for all t, and we use the inequality E|XY | E|X|2 +E|Y |2 that
holds for any two random variables X and Y , to conclude that E|xt |2 < t implies E|xt xt1 | < .
19
as T .
with = (, ) and parameter space = [5, 5] [2, 2]. The criterion function of
the LS estimator takes the form
T
2
1X
q(xt , xt1 , ) with q(xt , xt1 , ) = xt tanh(xt1 ) .
QT (xT , ) =
T t=2
Since the data {xt }Tt=1 is strictly stationary and ergodic, then the random sequence
n
oT
n
2 oT
q(xt , xt1 , )
xt tanh(xt1 )
t=1
t=1
is also strictly stationary and ergodic for every , by Krengels theorem. Furthermore, this sequence has bounded first moment for every because the tanh
function is uniformly bounded, and hence there exists some constant C such that10
| + tanh(xt1 )| C
and this implies that
2
E xt tanh(xt1 ) = Ex2t + ( + tanh(xt1 ))2 2xt ( + tanh(xt1 ))
2
Ext + C 2 + 2CE|xt | < .
Hence, by the law of large numbers for SE sequences we have
T
2
1X
xt tanh(xt1 )
T t=2
2
E xt tanh(xt1 )
a.s.
as T for every . Finally, we note that is compact and q(xt , xt1 , ) has
derivative
h
i
1
1
q(xt ,xt1 ,)
q(xt ,xt1 ,)
= 2q(xt , xt1 , ) 2 2q(xt , xt1 , ) 2 tanh(xt1 )
10
Here we use the fact that the tanh is uniformly bounded and hence E| tanh(zt )|k < holds for
any random variable zt and any power k > 0.
20
where q(xt , xt1 , ) 2 := xt tanh(xt1 ). As a result, since | tanh(x)| is uniformly bounded by 1 we have E| tanh(xt1 )| 1 and it follows that the derivative of
q(xt , xt1 , ) is bounded in expectation because the first term satisfies11
1
E sup 2q(xt , xt1 , ) 2 = E sup 2xt 2 2 tanh(xt1 )
(sub-additivity of supremum)
h
i
E 2|xt | + 2 sup || + 2 sup ||| tanh(xt1 )|
[5,5]
(because
sup
|w| = a)
w[a,a]
[2,2]
h
i
E 2|xt | + 2 5 + 2 2| tanh(xt1 )|
(sub-additivity of supremum)
h
i
E 2|xt ||ht1 | + 2 sup |||ht1 | + 2 sup |||ht1 |2
[5,5]
(because
sup
|w| = a)
w[a,a]
[2,2]
h
i
E 2|xt ||ht1 | + 2 5|ht1 | + 2 2|ht1 |2
5.3.3
as T .
Stochastic Equicontinuity
Here we use the fact that a vector zt has bounded expectation if each element of the vector has
bounded expectation.
21
pointwise convergence on a compact parameter space. Above we made use of the fact
that a criterion function of the form
T
1X
QT (xT , ) =
q(xt , xt1 , )
T t=1
(3)
In some cases this condition is easy to verify. In some other cases however, the
derivations can be quite unpleasant. For example, when is a high-dimensional
vector, then we have to derive a large number of partial derivatives.
Fortunately, the bounded derivative condition in (3) can be easily obtained when
q(xt , xt1 , ) is a well behaved continuously differentiable function. Below, we let
q(xt , xt1 , )/ i denote the derivative of q(xt , xt1 , ) with respect to the ith element
of the vector .
Definition 8 (Well behaved continuously differentiable function) A continuously differentiable function q(xt , xt1 , ) is said to be well behaved of order n > 0 in
if
q(xt , xt1 , ) n
E sup q(xt , xt1 , )
<
i
for some .
Theorem 9 reveals a very simple way of obtaining stochastic equicontinuity when the
criterion function QT (xT , ) is an average of terms that take the form q(xt , xt1 , ).
However, in time-varying parameter models, the criterion function is slightly different.
In particular, QT (xT , ) takes the form
T
1X
q xt , t (, 1 ),
QT (xT , ) =
T t=1
and
q xt , t (, 1 ), ) m
E sup q xt , t ( , 1 ), )
<
for some .
Note that the moment bound of Theorem 10 is more restrictive than the moment
bound of Theorem 9. In particular, Theorem 9 only required the first moment of
q(xt , xt1 , ) to be bounded
E|q(xt , xt1 , )| <
for some
whereas Theorem 10 requires not only that the update equation for t is well
behaved and satisfies certain moment bounds, but also, that the second moment of
q(xt , t (, 1 ), ) be bounded
E|q(xt , t (, 1 ), )|2 <
5.3.4
for some .
Identifiable uniqueness
We end this chapter by noting that the identifiable uniqueness condition used in
Theorems 4 and 5 can be easily obtained when the parameter space is compact
and the limit criterion is continuous.
Theorem 11 (Identifiable uniqueness) Let be a compact subset of Rn and 0
be the unique maximizer of a continuous criterion Q on
Q () < Q ( 0 ) .
Then 0 is an identifiable unique maximizer of Q as it satisfies
sup
Q () < Q ( 0 ).
S c ( 0 ,)
24
Theorem 11 tells us that when is compact and Q is continuous, then its maximizer 0 is automatically well separated or identifiably unique. Figure 1 illustrates
how the importance of the compactness of and the continuity of Q . Indeed, in
both cases considered in Figure 1, the parameter 0 is the unique maximizer of Q ,
but it fails to be identifiably unique because either fails to be compact or Q fails
to be continuous.
Figure 1: Left: parameter 0 is unique maximizer of Q , but it is not identifiably unique because
Q is not continuous. Right: parameter 0 is unique maximizer of Q , but it is not identifiably
unique because is not compact.
5.3.5
In Chapter 4 we noted that the log likelihood function in the local-level model depends
on the filtered parameter {t (, 1 )}tN initialized at some value 1 R,
2
T
x
(,
)
1X 1
t
t
1
log 22
L(xT , ) =
.
T t=2 2
22
Similarly, we noted that the log likelihood function in the GARCH model depends
on the filtered volatility {t2 (, 12 )}tN initialized at some value 12 > 0,
LT (xT , ) =
T
X
t=2
1
1
x2t
1
2
2
.
log 2 log t (, 1 )
2
2
2 t2 (, 12 )
Since the filtered parameters initialized at time t = 1 can never be SE (they can
only converge to a limit SE process), we now must answer the following question:
How can we apply a law of large numbers to the criterion function if it is not SE?
Luckily the answer is simple: the initialization does not matter, as long as the filter
converges to a limit SE sequence. The reason for this answer is simple. The following
theorem explains why we can safely ignore the initialization when establishing the
convergence of the log likelihood function.
25
(5)
Note that in the theorem above, the convergence with the limit SE process
{t ()}tZ in (4) is easy to obtain by applying a law of large numbers! This theorem
tells us that if we can apply the law of large numbers with the limit SE sequence
{t ()}tZ in (4), then we immediately get the law of large numbers with filtered parameter {t (, 1 )}tN in (5). So, in essence, we can simply ignore the initialization
problem, as long as we know that the filtered parameter {t (, 1 )}tN converges to
an SE limit {t ()}tZ .
5.4
Exercises
1. Which of the following extremum estimators exist? Which are random variables?
T defined as
(a) The maximum likelihood estimator
T arg max LT (xT , )
26
T arg min
[5,10]
|ut ()|
t=2
T
X
ut ()2
t=1
T
X
ut ()4
t=1
2. Write the least squares criterion function for the parameters of the following
models:
(a) Fat-tailed sigmoid AR(1):
xt = + cos(xt1 ) + t ,
27
{t }tZ T ID(5).
xt = g(xt1 ; )xt1 + t ,
g(xt1 ; ) :=
1 + exp(xt1 )
t Z.
xt = g(xt1 ; )xt1 + t ,
g(xt1 ; ) := +
1 + exp + (xt1 )2
t Z.
3. Write the log likelihood function for the parameters of the following models:
(a) Fat-tailed sigmoid AR(1):
xt = + cos(xt1 ) + t ,
{t }tZ T ID().
{t }tZ N ID(0, 2 ) ,
1 + exp(xt1 )
t Z.
{t }tZ N ID(0, 2) ,
1 + exp + (xt1 )2
{t } NID(0, 2 ) ,
t = + (xt1 t1 ) + t1 .
(e) GARCH:
{t }tZ NID(0, 1) ,
xt = t t ,
2
.
t2 = + x2t1 + t1
28
t Z.
xt = t t ,
2
t2 = + tanh(x2t1 ) + t1
.
(g) NGARCH:
xt = t t ,
{t } NID(0, 1) ,
2
.
t2 = + (xt1 t1 )2 + t1
(h) QGARCH:
xt = t t ,
{t } NID(0, 1) ,
2
t2 = + x2t1 + xt1 + t1
.
4. Which of the following criterion functions QT converge uniformly over the parameter space to some deterministic limit criterion Q ?
(a) The criterion QT is given by
QT () =
T
1X
xt ,
T t=1
T
1X
(xt xt1 )2 ,
T t=2
T
1X
(xt xt1 )4 ,
T t=2
and the parameter space is given by = [a, b], for some (a, b) R2 , and
{xt } is an SE sequence satisfying E|xt |4 < .
29
T
1X
log(xt ) + exp() ,
T t=1
{t }tZ T ID(7).
xt = g(xt1 ; )xt1 + t ,
g(xt1 ; ) :=
1 + exp(xt1 )
t Z.
xt = g(xt1 ; )xt1 + t ,
g(xt1 ; ) := +
1 + exp + (xt1 )2
t Z.
6. Solve again Exercise 5 assuming that the model is well specified instead of
assuming properties for the data. In particular, for each of the models in Exercise 5, suppose that {xt }Tt=1 is a subset of a time-series {xt }tZ = {xt ( 0 )}tZ
generated by the model under 0 , and then give sufficient conditions for
T to a
the consistency (and strong consistency) of the least squares estimator
vector 0 .
7. Let the sample of data {xt }Tt=1 be a subset of an SE time-series {xt }tZ satisfying E|xt |8 < . Give sufficient conditions for the consistency (and strong
T to a vector 0 in the
consistency) of the maximum likelihood estimator
following regressions:
(a) Fat-tailed sigmoid AR(1):
xt = + cos(xt1 ) + t ,
{t }tZ T ID().
{t }tZ N ID(0, 2 ) ,
1 + exp(xt1 )
31
t Z.
{t }tZ N ID(0, 2) ,
1 + exp + (xt1 )2
t Z.
{t } NID(0, 2 ) ,
t = + (xt1 t1 ) + t1 .
(e) GARCH:
{t }tZ NID(0, 1) ,
xt = t t ,
2
t2 = + x2t1 + t1
.
{t } TID(7) ,
2
t2 = + tanh(x2t1 ) + t1
.
(g) NGARCH:
xt = t t ,
{t } NID(0, 1) ,
2
t2 = + (xt1 t1 )2 + t1
.
(h) QGARCH:
xt = t t ,
{t } NID(0, 1) ,
2
.
t2 = + x2t1 + xt1 + t1
8. Solve again Exercise 7 assuming that the model is well specified instead of assuming properties for the data. In particular, for each of the models in Exercise
5, suppose that {xt }Tt=1 is a subset of a time-series {xt }tZ = {xt ( 0 )}tZ generated by the model under 0 , and then give sufficient conditions for the
T
consistency (and strong consistency) of the maximum likelihood estimator
to a vector 0 .
32