AE2015 Lecture Notes Ch5

Lecture Notes 20152016
Advanced Econometrics I
Chapter 5
Francisco Blasques
These lecture notes contain the material covered in the master course
Advanced Econometrics I. Further study material can be found in the
lecture slides and the many references cited throughout the text.
Contents
5 Asymptotic Theory for M and Z Estimators
5.1 M and Z estimators: Definition and Examples . .
5.2 Existence and Measurability . . . . . . . . . . . .
5.3 Consistency . . . . . . . . . . . . . . . . . . . . .
5.3.1 The general consistency theorem . . . . .
5.3.2 Uniform convergence . . . . . . . . . . . .
5.3.3 Stochastic Equicontinuity . . . . . . . . .
5.3.4 Identifiable uniqueness . . . . . . . . . . .
5.3.5 Notes for Time-varying Parameter Models
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
4
8
11
12
15
21
24
25
26
Asymptotic Theory for M and Z Estimators

Some more reading material:
1. Newey and McFadden (1994), Large sample estimation and hypothesis testing. Chapter 36 of Handbook of Econometrics.
Sections 1 and 2
2. White (1996), Estimation Inference and Specification Analysis
Chapter 3
3. van der Vaart (1998), Asymptotic Statistics
Chapter 5
4. Potscher and Prucha (1997), Dynamic Nonlinear Economic Models:
Asymptotic Theory
Chapters 3, 4.2, 4.3, 4.6.
5. Davidson (1994), Stochastic Limit Theory
Chapter 21
Econometrics and statistics are essentially devoted to the art of learning from the
data. The fundamental question we have in mind is always the following: where did
the data come from? or in other words: what are the properties of the data generating
process (DGP)? As we have seen in Chapter 4, when we deal with parametric models,
the distribution of the data is determined by the values of unknown parameters. As
a result, we can re-state these questions as: what are the values of the unknown
parameters? In this chapter we will focus precisely on the estimation of parameters.
The effort, to learn about unknown properties of the DGP dates back at least to
the great italian mathematician Gerolamo Cardano that stated the first law of large
numbers, without proof, in the 1500s. Essentially, he noted that the true probability of
success in any given gamble, could be approximated by calculating the average success
over an increasingly larger number of trials. Stating this law of large numbers, was
in some sense, the first step to understanding that we can learn about true unknown
quantities through repeated observation of the same phenomenon!1
The method of least squares published by Legendre in 1806, but fist discovered
by Gauss in 1795, constituted another important step in the art of learning from the
data. The least squares method was soon applied to problems in physics, astronomy,
engineering and economics, and brought immense fame and glory to both Gauss and
Legendre. Gauss derived not only the formula for the OLS estimator itself, but he also
wrote down the exact conditions under which the estimator is unbiased and normally
distributed! Unfortunately, few realized the importance of these theoretical results that
characterized the properties of the OLS estimator. It was only when Andrey Markov
re-published Gausss work in 1901 that the results became famous. Today these results
1
As mentioned in Chapter 2, the first proof of the law of large numbers came only through the
hand of Jacob Bernoullis (1713) Ars Conjectandi. A simpler proof was found by Pafnuty Chebyshev
in 1874, using an unproved inequality, that his student Markov finally proved in 1884.
are collectively known as the Gauss-Markov Theorem. Surely you heard about it in
your introductory econometrics courses!
The 18th and 19th centuries witnessed also the first developments in the method
of maximum likelihood. These developments came by the hand of the great mathematicians Lagrange, Bernoulli, Laplace and Gauss. Unlike the method of least-squares
however, the method of maximum likelihood was not immediately popular. Indeed, it
was only with the work of Fisher in the early 20th century that the method of maximum
likelihood became the most popular estimator of all.
Throughout the 20th century, a much more general theory of estimation was developed that includes the least-squares estimator, the maximum-likelihood estimator,
the method-of-moments estimator, and many other estimators, as special cases: it is
the so-called extremum estimation theory. This theory, was developed, among others,
by Doob (1934), Cramer (1946), Wald (1949), Le Cam (1949), Jennrich (1969) and
Malinvaud (1970).
In this chapter we will define and analyze the properties of extremum estimators.
With this general theory, we will be able to establish the properties of many estimators
in very general settings!
5.1
M and Z estimators: Definition and Examples
Given a probability space (E, F, P ), a random sample x1 , ..., xT , and a parameter

T is a measurable map
T : E defined as
space , an extremum estimator
T arg max QT (x1 , ..., xT , ).
T is thus, in simple words, an argument that maximizes the

The extremum estimator
criterion function QT (x1 , ..., xT , ). Note that in the previous sentence we used an
element rather than the element. This was on purpose, since the criterion function
may have several maxima. In particular, as you can see above, the estimator is
defined as being in the arg max of the criterion function QT (x1 , ..., xT , ) because the
function might have multiple arguments that maximize it. The arg max is precisely
the set of all points that maximize the criterion function.
We will often use the shorter notation for the random sample xT := (x1 , ..., xT )
and hence define the estimator as
T arg max QT (xT , ).
The function QT : RT R is a real-valued function called the criterion function.

The criterion function is random because it is a function of the random sample xT .
Recall that the random vector xT is defined as a measurable map xT : E RT .
Hence, Q(xT , ) is a map2
Q(xT , ) : E R.
2
Let f : A B be a function that maps elements of A to elements of B and g : B C be a

function that maps elements of B to elements of C. Then the composition function g(f ) : A C
maps elements of A to elements of C.
When the random sample is realized and we observe a vector of points xT (e) RT ,
for some event e E, then QT (xT (e), ) is just a real valued function,
QT (xT (e), ) : R
that we can attempt to maximize! For every realization e E, we get a new function
QT (xT (e), ) : R to maximize, and we obtain a new maximizer that we call a
parameter estimate!
Definition 1 (M-estimator) An extremum estimator is called an M-estimator when
the criterion function takes the form of a sum,
QT (xT , ) =
T
1X
q(xt , xt1 , ) T N.
T t=2
Examples of M-estimators include the famous maximum likelihood (ML) estimator, the least squares (LS) estimator, and the generalized method of moments (GMM)
estimator.
Example: (Maximum likelihood estimator) The criterion function of the ML
estimator is the log likelihood function LT . The ML estimator is thus an extremum
estimator where QT (xT , ) = LT (xT , ),
T arg max LT (xT , ).
The ML estimator of a Markov dynamical system is also an M-estimator

since the
P
log likelihood function can be decomposed into a sum LT (xT , ) = Tt=2 `(xt |xt1 , )
of log conditional densities `(xt |xt1 , ) = log f (xt |xt1 , ). As a result, the ML
estimator takes the M-estimator form
T
1X
T arg max
`(xt |xt1 , ).
T
t=2
Note that dividing the log likelihood by T is perfectly legitimate since the arg max is
still the same.
Example: (Least-squares estimator) The criterion function of the LS estimator is the sum of squared residuals function. The LS estimator of the parameters
of an NLAR model
xt = (xt1 , ) + t
takes the form of an M-estimator where
T
2
1X
QT (xT , ) =
xt (xt1 , ) .
T t=2
We write the criterion as the negative sum of squared residuals (note the minus sign)
since maximizing the negative sum of squared residuals is the same as minimizing the
sum of squared residuals. We can also divide the criterion by
T because it does not
change the arg max set. Hence, for ut () := xt (xt1 , ) we obtain
T
T
1X
1X
2
ut () = arg max
ut ()2 .
T arg min
T
T
t=2
t=2
Example: (Generalized method of moments estimator) The criterion function of the GMM estimator is the weighted quadratic difference of moments
T
T
X
0 1 X

T arg min 1
g(xt , xt1 , ) W
g(xt , xt1 , )
T
T t=2
t=2
where g(xt , xt1 , ) is the random vector that should satisfy the theoretical moment
condition Eg(xt , xt1 , ) = 0. The GMM estimator can naturally be written in Mestimator form.
Extremum estimators with criterion functions QT : RT R that are differentiable and strictly concave on the parameter space can also be written as estimators
that set to zero the criterion derivative,
QT (xT , ) =
QT (xT , )
.
The strict concavity of the criterion function QT over the parameter space , is used
to ensure that the point where QT (xT , ) = 0 is really the global maximum of
the function QT . Otherwise, the point of zero derivative could correspond to a local
maximum, or minimum, or some other stationary point.
In any case, when the criterion function QT is not strictly concave, then we can
T that sets the derivative
still write the extremum estimator as a random element
QT (xT , ) to zero as long as we apply a measurable selection technique that selects
some zeros and not others. We will not discuss here what constitutes a measurable
selection. It suffices to keep in mind that we could devise a rule to select the right
zeros, i.e. those that correspond to global maxima. For example, if QT : RT R
is twice differentiable on , then we can use the second derivative to figure out which
zeros of QT (xT , ) correspond to maximum points.
T as
We can thus always define
n
o
T : QT (xT , ) = 0
(1)
or alternatively as
T ) = 0.
QT (xT ,
6
(2)
Clearly, the argument above also applies to the case of M-estimators that have
differentiable functions. Re-writing an M-estimator in this way gives rise to the socalled Z-estimator.
T : E is called a Z-estimator if it
Definition 2 (Z-estimator) An estimator
satisfies,
T
1X
T ) = 0 T N.
q(xt , xt1 ,
QT (xT , T ) =
T t=2
Example: (Maximum likelihood estimator) ML estimators can be immediately written as Z-estimators that set the so called score function (the derivative of
the likelihood) to zero
T
1X
T ) = 0.
`(xt |xt1 ,
LT (xT , T ) =
T t=2
Example: (Least squares estimator) The least-squares estimator for the NLAR
model considered above can be re-written as a Z-estimator that sets the derivative of
the least-squares function to zero
T
1X
T )) (xt1 , T ) = 0.
2(xt (xt1 ,
T t=2
Example: (Method-of-moments estimator) An estimator that is typically written in Z-estimator form is the method of moments (MM) estimator. The MM estimator is a special case of the GMM estimator obtained when the number of moment
conditions is the same as the number of parameters to be estimated. The MM estimator takes the form
T
1X
T ) = 0.
g(xt , xt1 ,
T t=2
where g(xt , xt1 , ) is the random vector that should satisfy the theoretical moment
condition Eg(xt , xt1 , ) = 0.
Just as we can re-define an M-estimator as a Z-estimator by taking the derivative
of the criterion function, we can also re-define an M-estimator as a Z-estimator by
integrating the criterion function. As a result, we can focus on analyzing only the
properties of M-estimators. The results can be easily extended to Z-estimators. Below
we will focus exclusively on the properties of M-estimators.
5.2
Existence and Measurability
T are random
In your introductory econometrics courses you learned that estimators
variables that take values in the parameter space . In particular, they take a different
value in for every new sample of observed data xT . Since estimators are random
variables, we can study their stochastic properties, like the bias, variance, convergence
in probability (consistency), convergence in distribution (asymptotic normality), etc.
In introductory econometrics, we dealt with estimators that were analytically
tractable. As a result, it was easy to show that the estimators considered there were
indeed random variables. For example, the OLS estimator in the linear regression
considered in Chapter 2 takes the form:
PT
y t xt
.
T = Pt=1
T
2
x
t=1 t
Since, T is a continuous function of the random variables y1 , ..., yT and x1 , ..., xT ,
then it follows immediately that T is a random variable under the Borel -algebra.
However, we must now answer the following question: how can we be sure that the
estimator is a random variable when it is analytically intractable?
Theorem 1 below gives us an answer. This theorem is important because, if we are
going to talk about the convergence and distributional properties of estimators (like
consistency and asymptotic normality), then we must first show that the estimator
exists and that it is a random variable.
The problem of existence is related to the fact that some functions do not have
a maximum. If the criterion function QT cannot be maximized, then our extremum
estimator
does not exist, because the arg max set is empty (i.e. there is no that maximizes
QT ). For example, what is the value of that maximizes the function exp()? There
is none! the larger the value of the larger the value of the function! What is the
value of that maximizes the function 1/? There is none! The closer is to zero,
the larger the value of the function is.3
Luckily, the BolzanoWeierstrass theorem tells us sufficient conditions for a function to have a maximum. In particular, it tells us that every continuous function has
a maximum on a compact set.4
Theorem 1 (BolzanoWeierstrass) Let f : Rn R be a continuous function on a
compact set R Rn . Then f has a maximum in R, i.e. there exists a point r R
such that f (r) f (r0 ) r0 R.
3
Note that = 0 is not an acceptable answer because the function 1/ is not defined at = 0. In
particular, note that the positive limit lim0 1/ = is well defined, but 1/0 is not. It is simply
not true that 1/0 = .
4
Recall that a subset of Rn is said to be compact if it is bounded and closed.
Consider again the example exp(). Since the exponential function is continuous,
then we know that it has a maximum on a compact subset [a, b] of R. This is quite
obvious! We only have a problem if we try to maximize exp() over a set like [a, b)
which is not compact (because it is open in b), or a set like the entire real line R
(which is not compact because it is unbounded). Consider the example, 1/ again.
This function is discontinuous at = 0. But as long as we consider a compact subset
of R that does not contain 0, then the function has a maximum. For example 1/
has a maximum at = 1 if we restrict attention to the compact interval [1, 2].
Theorem 2 (Existence and Measurability) Let be a compact subset of Rn , for
some n N, and QT : RT R be such that:
(i) QT (xT , ) : R is continuous on for every xT RT .
(ii) QT (, ) : RT R is continuous on RT for every .
T : E satisfying
Then, there exists a measurable map
The theorem above ensures two important things. First, that the estimator exists.
Second, that the estimator is a random variable (i.e. that it is a measurable map).
As you can see, condition (i) in Theorem 2 and the assumption that is compact,
T
are essentially designed to ensure that the criterion function QT has a maximizer
over the parameter space . Condition (ii), assumes that QT is continuous on the
random variables x1 , ..., xT and is used to ensure that the criterion function QT is
measurable (and hence random) with respect to the Borel -algebra.
In the case of an M-estimator, the criterion function takes the form
T
1X
q(xt , xt1 , )
QT (xT , ) =
T t=2
and hence we can re-write the conditions of Theorem 2 in terms of conditions on the
function q.
Theorem 3 (Existence and Measurability) Let be a compact subset of Rn , n N,
and QT : RT R be such that:
P
(i) QT (xT , ) = T1 Tt=2 q(xt , xt1 , )
(ii) q(xt , xt1 , ) : R is continuous on for every (xt , xt1 ) R2 .
(iii) q(, , ) : R2 R is continuous on R2 for every .
T : E satisfying
Then, there exists a measurable map
T arg max
T
X
q(xt , xt1 , ).
t=2
Example: (Maximum likelihood estimator of NLAR model) The criterion function of the ML estimator QT (xT , ) = LT (xT , ) satisfies the conditions of Theorem
2 if the joint density L is continuous and is compact. In the case of the NLAR
model
xt = (xt1 , ) + t t Z
where is some nonlinear function and {t } are iid innovations with probability
density function f (), we have

xt |xt1 f xt (xt1 , ),
T

1X
log f xt (xt1 , ), .
and hence LT (xT , ) =
T t=2
As a result, the criterion function satisfies the conditions of Theorem 3 if both f and
are continuous functions (in both arguments), and is compact.
Example: (Maximum likelihood estimator of Gaussian RCAR model)
In the case of the Gaussian RCAR(1) model
xt = t1 xt1 + t
with t1 N (, 2 )
We have xt |xt1
tZ
and t N (0, 2 ) t Z.

2 2
2
N xt1 , xt1 + .
T
1 (xt xt1 )2
1X 1
log 2(2 x2t1 + 2 )
.
and hence LT (xT , ) =
T t=2 2
2 2 x2t1 + 2
If 2 a > 0 and 2 a > 0, then this function is clearly continuous on the

data xT for every = (, 2 , 2 ) and continuous on for every xT . As a result, the
conditions of Theorem 3 are satisfied as long as the parameter space is compact.
For example,
n
o
= (, 2 , 2 ) R3 : 0 0.5 , 1 2 5 , 0.1 2 10 .
Example: (Least-squares estimator of NLAR model) In the case of the NLAR
model
xt = (xt1 , ) + t t Z
10
the criterion function of the LS estimator takes the form

QT (xT , ) =
T
2
1X
xt (xt1 , ) .
T t=2
Clearly, this criterion function satisfies the conditions of Theorem 3 as long as is a

continuous function in all its arguments. For example, the function
(xt1 , ) = + tanh(xt1 )
with = (, )
is continuous in xt1 and , but the function5

(xt1 , ) = + I(xt1 0)
is discontinuous in xt1 and hence fails to satisfy the conditions of Theorem 2.
5.3
Consistency
Consistency is probably the most fundamental property that any estimator can sat T is said to be consistent for a parameter 0 if, as the
isfy. Essentially, an estimator
T takes values ever closer to 0 .
sample size T grows to infinity,
Letting sample size increase without bound should not be ridiculed as
merely a fanciful exercise. Rather asymptotics uncover the most fundamental properties of a procedure and give us a very powerful and general
evaluation tool.
Casella and Berger (2001)
An estimator that is not consistent is, in some sense, a deeply flawed estimator!
Indeed, an estimator that is not consistent is an estimator that will not approximate
the parameter of interest, even when an infinite amount of information is available!
Below we define what we mean by a consistent estimator and a strongly consistent
estimator. This is a good time to take a look at Appendix A and remind yourself
p
of the concepts of convergence in probability (denoted by the symbol ) and almost
a.s.
sure convergence (denoted by the symbol ).
T is said to be consistent for a
Definition 3 (Consistent estimator) An estimator
p
T 0 as T .
parameter 0 if and only if
T is said to be strongly
Definition 4 (Strongly consistent estimator) An estimator
a.s.
T 0 as T .
consistent for a parameter 0 if and only if
5
I(xt1 0) denotes the indicator function that takes value 1 if xt1 0 and the value 0
otherwise.
11
In both definitions of consistency, the crucial thing to keep in mind is that, if the
T is consistent for 0 , then it takes values increasingly closer to 0 as
estimator
T is
the sample size T increases. In other words, the distribution of the estimator
increasingly concentrated around the parameter 0 .
In fact you may recall from your introductory econometrics course that the following conditions6
T ) = 0
lim Var(
and
T ) = 0
lim Bias(
T to 0 .
are sufficient for the consistency of the estimator
5.3.1
The general consistency theorem
The following two theorems establish the conditions under which an extremum esti T is consistent for a parameter 0 . Both theorems rely on the identifiable
mator
uniqueness of the parameter 0 and the uniform convergence of the criterion function
QT to some limit deterministic function Q .
The concept of uniform convergence is stronger than the concept of pointwise
convergence. We say that the criterion function QT converges pointwise over to a
p
limit function Q if it holds true that QT (xT , ) Q () for every . Pointwise
convergence it thus about convergence for every possible parameter value . As
it turns out, this is very different from the definition of convergence you will find in
the definition below. In Section 5.3.2 we will learn how to ensure that a function
converges uniformly.
Definition 5 (Uniform convergence) The criterion function QT (xT , ) : R is
said to converge uniformly in probability to a limit deterministic function Q : R
if and only if

p
sup QT (xT , ) Q () 0 as T .
When the criterion function converges uniformly almost surely, then we refer to
it as the strong uniform convergence of the criterion.
Definition 6 (Strong uniform convergence) The criterion QT (xT , ) : R is said
to converge uniformly almost surely to a limit deterministic function Q : R if
and only if

a.s.
sup QT (xT , ) Q () 0 as T .
The concept of identifiable uniqueness of 0 is one that states not only that 0 is the
unique maximizer of a function Q , i.e. that Q ( 0 ) > Q () , but also
6
T is defined as E
T 0
Recall that the bias of an estimator
12
that this maximizer is well separated. Let S( 0 , ) denotes a set of points contained
in a ball centered at 0 , and S c ( 0 , ) denotes its complement in ,

S( 0 , ) := : k 0 k <
and S c ( 0 , ) := :
/ S( 0 , ) .
In the equation above k k denotes some distance or norm defined on the parameter
space Rn . If you are not familiar with distances and norms, then this might be
a good time to read Appendix B.7 In any case, the most important thing to keep in
mind is that k 0 k measures the distance between and 0 .
Definition 7 (Identifiable uniqueness) A parameter 0 is said to be an identifiably unique maximizer of the limit criterion function Q : R if
sup
Q () < Q ( 0 ).
S c ( 0 ,)
The following theorem takes the conditions of Theorem 2, the identifiable uniqueness of 0 and the uniform convergence of the criterion function QT to obtain the
T . This remarkable theorem is the result of
consistency of the extremum estimator
decades of work. It is a masterpiece of 20th century statistics and econometrics!8
T be an estimator satisfying the conditions of TheTheorem 4 (Consistency) Let
orem 2 and suppose that:
(i) The criterion function QT converges in probability uniformly over to the limit
deterministic function Q as T

p
sup QT (xT , ) Q () 0 as T .
(ii) The parameter 0 is the identifiably unique maximizer of the limit criterion
function Q
sup Q () < Q ( 0 ).
S c ( 0 ,)
p
T is consistent for 0 since
T
Then the estimator
0 as T .
7
For example, if k 0 k is the Euclidean distance between the n-dimensional vectors =

( , ..., n ) and 0 = (01 , ..., 0n ), where i denotes simply the ith element of the vector , then
v
u n
uX
k 0 k = t (i 0i )2 .
1
i=1
If k 0 k is the uniform distance, then k 0 k = supi |i 0i |.

8
The most remarkable feature of this theorem is its simplicity! Consistency is obtained essentially from two conditions. The uniform convergence of the criterion function, and the identifiable
uniqueness of the parameter 0 .
13
The strong consistency is obtained when the criterion function converges almost
surely to its limit.
T be an estimator satisfying the conditions of
Theorem 5 (Strong consistency) Let
Theorem 2 and suppose that:
(i) The criterion function QT converges uniformly almost surely over to the limit
deterministic function Q as T

a.s.
sup QT (xT , ) Q () 0 as T .
(ii) The parameter 0 is the identifiably unique maximizer of the limit criterion
function Q
sup Q () < Q ( 0 ).
S c ( 0 ,)
T is strongly consistent for 0 since

T a.s.
0 as T .
Then the estimator
The theorems above give us sufficient conditions for the consistency and strong
consistency of extremum estimators. In order to be useful, we must now learn how
to verify the uniform convergence of the criterion and the identifiable uniqueness of
the parameter of interest.
Example: (Maximum likelihood estimator) The ML estimator is given by
T arg max LT (xT , )
T is consistent for 0 if the log likelihood function

By Theorem 3 the ML estimator
LT converges uniformly to some limit function L ,

p
sup LT (xT , ) L () 0 as T
and if 0 is the identifiably unique maximizer of the limit log likelihood L on

sup
L () < L ( 0 ).
S c ( 0 ,)
Example: (Least-squares estimator) The least-squares estimator is given by

T arg max UT ()
where UT () is the sum of squared residuals

T
1X
UT () :=
ut ()2 .
T t=2
14
T is consistent for 0 if the least squares

By Theorem
estimator
P4T the least-squares
1
function T t=2 ut ()2 converges uniformly to some limit function function Eut ()2 ,
T
1 X
p

sup
ut ()2 Eut ()2 0
T t=2
as
and if 0 is the identifiably unique maximizer of the limit criterion on

sup
S c ( 0 ,)
5.3.2
Eut ()2 < Eut ( 0 )2 .
Uniform convergence
As explained above, the concept of uniform convergence is stronger than the concept
p
of pointwise convergence of a function where QT (xT , ) Q () for every . In
particular, it is easy to show that uniform convergence implies pointwise convergence,
but pointwise convergence does not imply uniform convergence.
Example: Consider a sequence of functions {GT }T N defined on the interval [0, 1]
where each GT is given by GT (x) = xT . This sequence converges pointwise (but not
uniformly) to the limit function G given by

0 for x [0, 1)
G (x) =
1 for x = 1
In particular, for every x [0, 1] the {GT } converges to G
|GT (x) G (x)| 0 x [0.1] (pointwise convergence)
but {GT } does not converge uniformly on [0, 1] to G
sup |GT (x) G (x)| 9 0 (no uniform convergence).
x[0,1]
The following theorem explains that we can obtain uniform convergence from
pointwise convergence as long as the sequence of functions is stochastically equicontinuous. Stochastic equicontinuity is obtained when the sequence is composed of
differentiable functions with derivative that is bounded in expectation.
Theorem 6 (Stochastic equicontinuity and uniform convergence) Let (E, F, P ) be a
probability space and {GT }N be a sequence of random functions GT : E R
that are differentiable on the convex compact set . Suppose that:
(i) The sequence converges pointiwse in probability to a limit function G
p
GT G ()
as T
15
for every .
(ii) The sequence {GT }T N is stochastically equicontinuous

G ()
T

sup E sup
< .
Then {GT }T N converges uniformly in probability to the limit function G

p
sup |GT () G ()| 0
as T .
In order to obtain strong uniform convergence we need the sequence of functions to

be strongly stochastically equicontinuous. This requires essentially that the derivative
be uniformly bounded rather than bounded in expectation.
Theorem 7 (Strong stochastic equicontinuity and Strong uniform convergence) Let
(E, F, P ) be a probability space and {GT }N be a sequence of random functions GT :
E R that are differentiable on the convex compact set . Suppose that:
(i) The sequence converges pointiwse almost surely to a limit function G
a.s.
GT G ()
as T
for every .
(ii) The sequence {GT }T N is strongly stochastically equicontinuous

G ()

T
sup sup
<
almost surely.
Then {GT }T N converges uniformly almost surely to the limit function G

a.s.
sup |GT () G ()| 0
as T .
In the case of an M-estimator, it is quite easy to verify that the criterion function
converges uniformly. In particular, we apply laws of large numbers to obtain the
pointwise convergence and the differentiability of the criterion to obtain the uniform
convergence.
Theorem 8 (Uniform convergence for M-estimators) Let the criterion function in
Theorem 6 take the form
T
1X
GT () =
q(xt , xt1 , ).
T t=2
Then GT is differentiable on if q is differentiable on , and furthermore,

16
(i) The sequence {GT }T N converges pointiwse in probability to a limit function

G = E(q(xt , xt1 , )) for every
as long as {q(xt , xt1 , )} is SE and has bounded first moment

E|q(xt , xt1 , )| < ,
because then the criterion satisfies a law of large numbers
T
1X
p
q(xt , xt1 , ) E(q(xt , xt1 , ))
T t=2
as T
for every .
(ii) The sequence {GT }T N is stochastically equicontinuous as long as {q(xt , xt1 , )}

is SE and has a derivative with bounded expectation
q(x , x , )

t
t1
E sup
< .
(iii) The sequence {GT }T N is strongly stochastically equicontinuous as long as the

sequence {qt (xt , xt1 , )} is SE and has a uniformly bounded derivative
q(x , x , )

t
t1
sup
< almost surely.
Example: (Maximum likelihood estimator of NLAR model) Let the observed

data {xt }Tt=1 be a subset of an SE sequence {xt }tZ . As we have seen above, the log
likelihood function of the ML estimator in the NLAR model
xt = (xt1 , ) + t
tZ
with iid innovations {t }, t f (), takes the form

T
1X
`(xt , xt1 , )
LT (xT , ) =
T t=2
where

`(xt , xt1 , ) := log f xt (xt1 , ), .
We already know that if the random sequence of data {xt }tZ is strictly stationary
and ergodic, then the random sequence
n
o
`(xt , xt1 , )
tZ
17
is also strictly stationary and ergodic for every , by Krengels theorem. Hence,
if the first moment of the log likelihood sequence is bounded for every

E`(xt , xt1 , ) <
for every
then by the law of large numbers for SE sequences we have
T
1X
a.s.
`(xt , xt1 , ) E`(xt , xt1 , )
T t=2
as T
for every .
Finally, if is compact and `t is differentiable with uniformly bounded derivative

`(x , x , )

t
t1
sup
< a.s.
then, by Theorems 7 and 8, we conclude that the log likelihood converges uniformly
(and strongly) to its limit
T
1 X

a.s.
sup
`(xt , xt1 , ) E`(xt , xt1 , ) 0
T
t=2
as T .
If the derivative of f is not uniformly bounded, but instead, it is bounded in expectation,

`(x , x , )

t
t1
E sup
<
then, we can still conclude by Theorems 6 and 8, that the log likelihood converges
uniformly to its limit in probability
T
1 X

p
sup
`(xt , xt1 , ) E`(xt , xt1 , ) 0
T t=2
as T .
Example: (Maximum likelihood estimator of Gaussian AR(1) model)

Let the observed data {xt }Tt=1 be a subset of an SE sequence {xt }tZ with bounded
second moment E|xt |2 < . Consider the Gaussian AR(1) model with N (0, 1) innovations,
xt = xt1 + t where {t }tZ NID(0, 1).

Since xt |xt1 N xt1 , 1 , the log likelihood function is given by
T
1X 1
LT (xT , ) =
log 2 (xt xt1 )2 .
T t=2 2
18
Since 12 log 2 is just a constant, we can ignore it and define the ML estimator of
the unknown parameter T as
T
1X
T = arg max
q(xt , xt1 , ) with q(xt , xt1 , ) = (xt xt1 )2 .
[0,2] T
t=2
Note that we have defined the parameter space as consisting of the compact interval
[0, 2]. Since the data {xt }Tt=1 is strictly stationary and ergodic, then the random
sequence
n
oT
(xt xt1 )2
t=1
is also strictly stationary and ergodic for every [0, 2], by Krengels theorem.
Furthermore, this sequence has bounded first moment for every [0, 2] because9

E(xt xt1 )2 = Ex2t + 2 x2t1 2xt xt1
h

2

i
2
2

(by sub-additivity of absolute value)
E xt + || xt1 2|| xt xt1
(by linearity of expectation)
2

2

Ext + ||2 Ext1 2||Ext xt1 <
for every [0, 2]. Hence, by the law of large numbers for SE sequences we have the
pointwise convergence of the criterion function
T
1X
(xt xt1 )2
T t=2
a.s.
E(xt xt1 )2 ,
as T for every [0, 2].
Finally, since = [0, 2] is compact and q(xt , xt1 , ) has derivative

q(xt , xt1 , )
= 2(xt xt1 )xt1
it follows that the derivative is bounded in expectation because

E sup |2(xt xt1 )xt1 | = E sup |2xt xt1 2x2t1 |
[0,2]
[0,2]
(by sub-additivity of supremum)
h
i
2E |xt xt1 | + sup |||x2t1 |
[0,2]
(because sup |w| = |a|)

w[0,a]
h
i
2
2E |xt xt1 | + 2|xt1 |
2E|xt xt1 | + 4E|xt1 |2 < .
Here we use the sub-additivity of the absolute value |a + b| |a| + |b|, we use the stationarity of
xt to conclude that E|xt |2 < holds for all t, and we use the inequality E|XY | E|X|2 +E|Y |2 that
holds for any two random variables X and Y , to conclude that E|xt |2 < t implies E|xt xt1 | < .
19
As a result, by Theorems 6 and 8, we conclude that the log likelihood converges

uniformly in probability to its limit
T
1 X

p
sup
(xt xt1 )2 E(xt xt1 )2 0
[0,2] T t=2
as T .
Example: (Least-squares estimator of NLAR(1) model) Let the observed data

{xt }Tt=1 be a subset of an SE sequence {xt }tZ with bounded second moment E|xt |2 <
. Consider the following NLAR(1) model
xt = + tanh(xt1 ) + t
where {t }tZ WN(0, 2 )
with = (, ) and parameter space = [5, 5] [2, 2]. The criterion function of
the LS estimator takes the form
T
2
1X
q(xt , xt1 , ) with q(xt , xt1 , ) = xt tanh(xt1 ) .
QT (xT , ) =
T t=2
Since the data {xt }Tt=1 is strictly stationary and ergodic, then the random sequence
n
oT
n
2 oT
q(xt , xt1 , )
xt tanh(xt1 )
t=1
t=1
is also strictly stationary and ergodic for every , by Krengels theorem. Furthermore, this sequence has bounded first moment for every because the tanh
function is uniformly bounded, and hence there exists some constant C such that10
| + tanh(xt1 )| C
and this implies that

2
E xt tanh(xt1 ) = Ex2t + ( + tanh(xt1 ))2 2xt ( + tanh(xt1 ))
2
Ext + C 2 + 2CE|xt | < .
Hence, by the law of large numbers for SE sequences we have
T
2
1X
xt tanh(xt1 )
T t=2
2
E xt tanh(xt1 )
a.s.
as T for every . Finally, we note that is compact and q(xt , xt1 , ) has
derivative
h
i

1
1
q(xt ,xt1 ,)
q(xt ,xt1 ,)
= 2q(xt , xt1 , ) 2 2q(xt , xt1 , ) 2 tanh(xt1 )
10
Here we use the fact that the tanh is uniformly bounded and hence E| tanh(zt )|k < holds for
any random variable zt and any power k > 0.
20
where q(xt , xt1 , ) 2 := xt tanh(xt1 ). As a result, since | tanh(x)| is uniformly bounded by 1 we have E| tanh(xt1 )| 1 and it follows that the derivative of
q(xt , xt1 , ) is bounded in expectation because the first term satisfies11

1
E sup 2q(xt , xt1 , ) 2 = E sup 2xt 2 2 tanh(xt1 )
(sub-additivity of supremum)
h
i
E 2|xt | + 2 sup || + 2 sup ||| tanh(xt1 )|
[5,5]
(because
sup
|w| = a)
w[a,a]
[2,2]
h
i
E 2|xt | + 2 5 + 2 2| tanh(xt1 )|
2E|xt | + 2 5 + 2 2E| tanh(xt1 )|
(because sup | tanh(w)| = 1)
2E|xt | + 10 + 4 = 2E|xt | + 14 <
and the second term satisfies (below we use ht1 tanh(xt1 ))

1
E sup 2q(xt , xt1 , ) 2 tanh(xt1 ) = E sup 2xt ht1 2ht1 2h2t1
(sub-additivity of supremum)
h
i
E 2|xt ||ht1 | + 2 sup |||ht1 | + 2 sup |||ht1 |2
[5,5]
(because
sup
|w| = a)
w[a,a]
[2,2]
h
i
E 2|xt ||ht1 | + 2 5|ht1 | + 2 2|ht1 |2
E2|xt ||ht1 | + 10E|ht1 | + 4E|ht1 |2
(because sup | tanh(w)| = 1)
E|xt | + 10 + 4 = E|xt | + 14 <
As a result, by Theorem 6, we conclude that the log likelihood converges uniformly

in probability to its limit
T

1 X

2
2 p
(xt xt1 ) E(xt xt1 ) 0
sup
T t=2
5.3.3
as T .
Stochastic Equicontinuity
Stochastic equicontinuity is a crucial ingredient for ensuring the uniform convergence

of criterion functions. In particular, it allows us to obtain uniform convergence from
11
Here we use the fact that a vector zt has bounded expectation if each element of the vector has
bounded expectation.
21
pointwise convergence on a compact parameter space. Above we made use of the fact
that a criterion function of the form
T
1X
QT (xT , ) =
q(xt , xt1 , )
T t=1
is stochastically equicontinuous as long as {q(xt , xt1 , )} is SE and its derivative is

uniformly bounded over in expectation
q(x , x , )

t
t1
E sup
< .
(3)
In some cases this condition is easy to verify. In some other cases however, the
derivations can be quite unpleasant. For example, when is a high-dimensional
vector, then we have to derive a large number of partial derivatives.
Fortunately, the bounded derivative condition in (3) can be easily obtained when
q(xt , xt1 , ) is a well behaved continuously differentiable function. Below, we let
q(xt , xt1 , )/ i denote the derivative of q(xt , xt1 , ) with respect to the ith element
of the vector .
Definition 8 (Well behaved continuously differentiable function) A continuously differentiable function q(xt , xt1 , ) is said to be well behaved of order n > 0 in
if

q(xt , xt1 , ) n

E sup q(xt , xt1 , )
<
i
for some and every i.

You do not have to worry too much about the definition above! The most important
is to keep in mind that it binds the nth moment of a function and its derivative
together. This is useful because: (a) many differentiable functions you will encounter
are indeed well behaved, and (b) that when q is continuously differentiable and well
behaved of order 1, then stochastic equicontinuity follows immediately from the simple
moment bound E|q(xt , xt1 , )| < . A proof of the following theorem can be found
in Appendix C.
Theorem 9 (Simple stochastic equicontinuity) Let q(xt , xt1 , ) be continuously differentiable and well behaved of order 1 in . Then the stochastic equicontinuity
condition
q(x , x , )

t
t1
E sup
<
is implied by the moment bound

E|q(xt , xt1 , )| <
22
for some .
Theorem 9 reveals a very simple way of obtaining stochastic equicontinuity when the
criterion function QT (xT , ) is an average of terms that take the form q(xt , xt1 , ).
However, in time-varying parameter models, the criterion function is slightly different.
In particular, QT (xT , ) takes the form
T

1X
q xt , t (, 1 ),
QT (xT , ) =
T t=1
where t (, 1 ) is a time-varying parameter initialized at = 1. For example, t

could be a time-varying mean t = t as in the observation-driven local-level model
xt = t + t , t N (0, 2 )
t+1 = + (xt t ) + t .
Alternatively, t could be a time-varying volatility t = t as in the GARCH model
xt = t t , t N (0, 1)
2
t+1
= + x2t + t2 .
Regardless of the role that t plays, the

crucial difference is that we now have
q xt , t (, 1 ), ) instead of q(xt , xt1 , in the criterion function. This is important because Theorem 9 above no longer holds.
Luckily, a similar theorem can be devised where the notion of well behaved function
is extended to the time-varying parameter.
Definition 9 (Good behavior in time-varying parameter models) A continuously differentiable function q xt , t (, 1 ), ) is said to be well behaved of order n in
and well behaved of order m in t if both

q xt , t (, 1 ), ) n

E sup q xt , t ( , 1 ), )
<
i
and

q xt , t (, 1 ), ) m

E sup q xt , t ( , 1 ), )
<
for some and every i.

Please note the important asymmetry in the treatment of and t in the definition above. In particular, note that the supremum is always taken w.r.t. the fixed
parameter . As a result, the notion of well behaved function w.r.t. the time-varying
parameter like t is quite different from the notion of well behaved function w.r.t. a
fixed parameter. Making use of the definition above we obtain the following theorem
(see Appendix C for a proof).
23
Theorem 10 (Simple stochastic equicontinuity for time-varying parameters) Let the

function q(xt , t (, 1 ), ) be continuously differentiable and well behaved of order 2
in both t (, 1 ) and , and let t (, 1 ) be a time-varying parameter
t+1 = (t , xt , )
with a continuously differentiable updating function that is well-behaved of order 2
in both t (, 1 ) and , that satisfies the following conditions for E|t ()|2 <
(, x , ) 2

t
2
E|(1 , xt , )| < and E sup
< 1.
Then the stochastic equicontinuity condition

q(x , (, ), )

t
t
1
E sup
< .
is implied by the moment bound

E|q(xt , t (, 1 ), )|2 <
for some .
Note that the moment bound of Theorem 10 is more restrictive than the moment
bound of Theorem 9. In particular, Theorem 9 only required the first moment of
q(xt , xt1 , ) to be bounded
E|q(xt , xt1 , )| <
for some
whereas Theorem 10 requires not only that the update equation for t is well
behaved and satisfies certain moment bounds, but also, that the second moment of
q(xt , t (, 1 ), ) be bounded
E|q(xt , t (, 1 ), )|2 <
5.3.4
for some .
Identifiable uniqueness
We end this chapter by noting that the identifiable uniqueness condition used in
Theorems 4 and 5 can be easily obtained when the parameter space is compact
and the limit criterion is continuous.
Theorem 11 (Identifiable uniqueness) Let be a compact subset of Rn and 0
be the unique maximizer of a continuous criterion Q on
Q () < Q ( 0 ) .
Then 0 is an identifiable unique maximizer of Q as it satisfies
sup
Q () < Q ( 0 ).
S c ( 0 ,)
24
Theorem 11 tells us that when is compact and Q is continuous, then its maximizer 0 is automatically well separated or identifiably unique. Figure 1 illustrates
how the importance of the compactness of and the continuity of Q . Indeed, in
both cases considered in Figure 1, the parameter 0 is the unique maximizer of Q ,
but it fails to be identifiably unique because either fails to be compact or Q fails
to be continuous.
Figure 1: Left: parameter 0 is unique maximizer of Q , but it is not identifiably unique because
Q is not continuous. Right: parameter 0 is unique maximizer of Q , but it is not identifiably
unique because is not compact.
5.3.5
Notes for Time-varying Parameter Models
In Chapter 4 we noted that the log likelihood function in the local-level model depends
on the filtered parameter {t (, 1 )}tN initialized at some value 1 R,
2
T
x
(,
)
1X 1
t
t
1
log 22
L(xT , ) =
.
T t=2 2
22
Similarly, we noted that the log likelihood function in the GARCH model depends
on the filtered volatility {t2 (, 12 )}tN initialized at some value 12 > 0,
LT (xT , ) =
T
X
t=2
1
1
x2t
1
2
2
.
log 2 log t (, 1 )
2
2
2 t2 (, 12 )
Since the filtered parameters initialized at time t = 1 can never be SE (they can
only converge to a limit SE process), we now must answer the following question:
How can we apply a law of large numbers to the criterion function if it is not SE?
Luckily the answer is simple: the initialization does not matter, as long as the filter
converges to a limit SE sequence. The reason for this answer is simple. The following
theorem explains why we can safely ignore the initialization when establishing the
convergence of the log likelihood function.
25
Theorem 12 (Convergence of limit SE criterion) Let the log likelihood function be

given by
T

1X
` xt , t (, 1 ),
T t=2
where ` is a continuous function and {t (, 1 )}tN is a filtered parameter that converges to an SE limit {t ()}tZ . If the criterion converges for the limit SE parameter
{t ()}tZ
T
p

1X
` xt , t (), E` xt , t (), .
(4)
T t=2
then it also converges for the filtered parameter {t (, 1 )}tN initialized at 1
T
p

1X
` xt , t (, 1 ), E` xt , t (), .
T t=2
(5)
Note that in the theorem above, the convergence with the limit SE process
{t ()}tZ in (4) is easy to obtain by applying a law of large numbers! This theorem
tells us that if we can apply the law of large numbers with the limit SE sequence
{t ()}tZ in (4), then we immediately get the law of large numbers with filtered parameter {t (, 1 )}tN in (5). So, in essence, we can simply ignore the initialization
problem, as long as we know that the filtered parameter {t (, 1 )}tN converges to
an SE limit {t ()}tZ .
5.4
Exercises
1. Which of the following extremum estimators exist? Which are random variables?
T defined as
(a) The maximum likelihood estimator
T arg max LT (xT , )
with likelihood function LT that is continuous in both arguments and =

Rn .
(b) The penalized maximum likelihood estimator defined as
T arg max LT (xT , ) + H()
[1,1]
with likelihood function LT that is continuous in both arguments and

penalty function H() = tanh() that penalizes negative values of .
26
(c) The penalized maximum likelihood estimator defined as

T arg max LT (xT , ) + H()
[1,1]
with likelihood function LT that is continuous in both arguments and

penalty function

1
if 0
H() =
1
if < 0
that penalizes negative values of .
(d) The least absolute residuals estimator T defined as
T
X
T arg min
[5,10]
|ut ()|
t=2
where ut () are residuals from an AR(1) model ut () = xt xt1 .

T defined as
(e) The least squares estimator
T arg min
T
X
ut ()2
t=1
where ut () are residuals from an AR(1) model ut () = xt xt1 .

T defined as
(f) The outlier sensitive estimator
T arg min
T
X
ut ()4
t=1
where is compact and ut () are residuals from an AR(1) model ut () =

xt xt1 + zt with dummy variable zt given by

1
if xt 0
zt =
0
if xt < 0
2. Write the least squares criterion function for the parameters of the following
models:
(a) Fat-tailed sigmoid AR(1):
xt = + cos(xt1 ) + t ,
27
{t }tZ T ID(5).
(b) Logistic SESTAR:

{t }tZ N ID(0, 1) ,
xt = g(xt1 ; )xt1 + t ,
g(xt1 ; ) :=
1 + exp(xt1 )
t Z.
(c) Exponential SESTAR:

{t }tZ N ID(0, 2) ,
xt = g(xt1 ; )xt1 + t ,
g(xt1 ; ) := +

1 + exp + (xt1 )2
t Z.
3. Write the log likelihood function for the parameters of the following models:
xt = + cos(xt1 ) + t ,
{t }tZ T ID().

xt = g(xt1 ; )xt1 + t ,
g(xt1 ; ) :=
{t }tZ N ID(0, 2 ) ,
1 + exp(xt1 )
t Z.

xt = g(xt1 ; )xt1 + t ,
g(xt1 ; ) := +
{t }tZ N ID(0, 2) ,

1 + exp + (xt1 )2
(d) Gaussian observation-driven local-level model :

xt = t + t ,
{t } NID(0, 2 ) ,
t = + (xt1 t1 ) + t1 .
(e) GARCH:
{t }tZ NID(0, 1) ,
xt = t t ,
2
.
t2 = + x2t1 + t1
28
t Z.
(f) Robust GARCH:

{t } TID(7) ,
xt = t t ,
2
t2 = + tanh(x2t1 ) + t1
.
(g) NGARCH:
xt = t t ,
{t } NID(0, 1) ,
2
.
t2 = + (xt1 t1 )2 + t1
(h) QGARCH:
xt = t t ,
{t } NID(0, 1) ,
2
t2 = + x2t1 + xt1 + t1
.
4. Which of the following criterion functions QT converge uniformly over the parameter space to some deterministic limit criterion Q ?
(a) The criterion QT is given by
QT () =
T
1X
xt ,
T t=1
and the parameter space is given by = [0, 100], and {xt } is an SE

sequence satisfying E|xt | < .
(b) The criterion QT is given by
QT () =
T
1X
(xt xt1 )2 ,
T t=2
and the parameter space is given by 0 1, and {xt } is an SE sequence

satisfying E|xt | < .
(c) The criterion QT is given by
QT () =
T
1X
(xt xt1 )4 ,
T t=2
and the parameter space is given by = [a, b], for some (a, b) R2 , and
{xt } is an SE sequence satisfying E|xt |4 < .
29
(d) The criterion QT is given by

QT () =
T
1X
log(xt ) + exp() ,
T t=1
and the parameter space of = (, , ) is given by

= (0, 3] [1, 10] [5, 6]
and {xt } is a strictly positive SE sequence xt > 0 t Z satisfying
E log |xt | < .
(e) The criterion QT is given by
T
1X
QT () =
log(xt ) ,
T t=1
and the parameter space of = (, ) is given by = [2, 3] [1, 10],

and {xt } is a strictly positive SE sequence xt > 0 t Z satisfying
E log |xt | < .
(f) The criterion QT is given by
T
1 X (xt xt1 )2
,
QT () =
T t=2
2
and the parameter space of = (, 2 ) is given by = [0.01, 2] [1, 10],

and {xt } is an SE sequence satisfying E|xt |2 < .
(g) The criterion QT is given by
T
1 X (xt xt1 )2
,
QT () =
T t=2
2
and the parameter space of = (, 2 ) is given by = [1, 1] (0, 10],

and {xt } is an SE sequence satisfying E|xt |2 < .
5. Let the sample of data {xt }Tt=1 be a subset of an SE time-series {xt }tZ with
bounded moments of second order E|xt |2 < . Give sufficient conditions for
T to a
the consistency (and strong consistency) of the least squares estimator
vector 0 in the following regressions:
30

xt = + cos(xt1 ) + t ,
{t }tZ T ID(7).

{t }tZ N ID(0, 1) ,
xt = g(xt1 ; )xt1 + t ,
g(xt1 ; ) :=
1 + exp(xt1 )
t Z.

{t }tZ N ID(0, 2) ,
xt = g(xt1 ; )xt1 + t ,
g(xt1 ; ) := +

1 + exp + (xt1 )2
t Z.
6. Solve again Exercise 5 assuming that the model is well specified instead of
assuming properties for the data. In particular, for each of the models in Exercise 5, suppose that {xt }Tt=1 is a subset of a time-series {xt }tZ = {xt ( 0 )}tZ
generated by the model under 0 , and then give sufficient conditions for
T to a
the consistency (and strong consistency) of the least squares estimator
vector 0 .
7. Let the sample of data {xt }Tt=1 be a subset of an SE time-series {xt }tZ satisfying E|xt |8 < . Give sufficient conditions for the consistency (and strong
T to a vector 0 in the
consistency) of the maximum likelihood estimator
following regressions:
xt = + cos(xt1 ) + t ,
{t }tZ T ID().

xt = g(xt1 ; )xt1 + t ,
g(xt1 ; ) :=
{t }tZ N ID(0, 2 ) ,
1 + exp(xt1 )
31
t Z.

xt = g(xt1 ; )xt1 + t ,
g(xt1 ; ) := +
{t }tZ N ID(0, 2) ,

1 + exp + (xt1 )2
t Z.
(d) Gaussian observation-driven local-level model :

xt = t + t ,
{t } NID(0, 2 ) ,
t = + (xt1 t1 ) + t1 .
(e) GARCH:
{t }tZ NID(0, 1) ,
xt = t t ,
2
t2 = + x2t1 + t1
.
(f) Robust GARCH:

xt = t t ,
{t } TID(7) ,
2
t2 = + tanh(x2t1 ) + t1
.
(g) NGARCH:
xt = t t ,
{t } NID(0, 1) ,
2
t2 = + (xt1 t1 )2 + t1
.
(h) QGARCH:
xt = t t ,
{t } NID(0, 1) ,
2
.
t2 = + x2t1 + xt1 + t1
8. Solve again Exercise 7 assuming that the model is well specified instead of assuming properties for the data. In particular, for each of the models in Exercise
5, suppose that {xt }Tt=1 is a subset of a time-series {xt }tZ = {xt ( 0 )}tZ generated by the model under 0 , and then give sufficient conditions for the
T
consistency (and strong consistency) of the maximum likelihood estimator
to a vector 0 .
32

AE2015 Lecture Notes Ch5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AE2015 Lecture Notes Ch5

Uploaded by

Copyright:

Available Formats

Lecture Notes 20152016

Asymptotic Theory for M and Z Estimators

M and Z estimators: Definition and Examples

Given a probability space (E, F, P ), a random sample x1 , ..., xT , and a parameter

T is thus, in simple words, an argument that maximizes the

The function QT : RT R is a real-valued function called the criterion function.

Let f : A B be a function that maps elements of A to elements of B and g : B C be a

The ML estimator of a Markov dynamical system is also an M-estimator

Existence and Measurability

If 2 a > 0 and 2 a > 0, then this function is clearly continuous on the

the criterion function of the LS estimator takes the form

Clearly, this criterion function satisfies the conditions of Theorem 3 as long as is a

is continuous in xt1 and , but the function5

The general consistency theorem

For example, if k 0 k is the Euclidean distance between the n-dimensional vectors =

If k 0 k is the uniform distance, then k 0 k = supi |i 0i |.

T is strongly consistent for 0 since

T is consistent for 0 if the log likelihood function

and if 0 is the identifiably unique maximizer of the limit log likelihood L on

Example: (Least-squares estimator) The least-squares estimator is given by

where UT () is the sum of squared residuals

T is consistent for 0 if the least squares

and if 0 is the identifiably unique maximizer of the limit criterion on

Eut ()2 < Eut ( 0 )2 .

(ii) The sequence {GT }T N is stochastically equicontinuous

Then {GT }T N converges uniformly in probability to the limit function G

sup |GT () G ()| 0

In order to obtain strong uniform convergence we need the sequence of functions to

(ii) The sequence {GT }T N is strongly stochastically equicontinuous

Then {GT }T N converges uniformly almost surely to the limit function G

sup |GT () G ()| 0

Then GT is differentiable on if q is differentiable on , and furthermore,

(i) The sequence {GT }T N converges pointiwse in probability to a limit function

as long as {q(xt , xt1 , )} is SE and has bounded first moment

(ii) The sequence {GT }T N is stochastically equicontinuous as long as {q(xt , xt1 , )}

(iii) The sequence {GT }T N is strongly stochastically equicontinuous as long as the

Example: (Maximum likelihood estimator of NLAR model) Let the observed

with iid innovations {t }, t f (), takes the form

Finally, if is compact and `t is differentiable with uniformly bounded derivative

If the derivative of f is not uniformly bounded, but instead, it is bounded in expectation,

Example: (Maximum likelihood estimator of Gaussian AR(1) model)

as T for every [0, 2].

Finally, since = [0, 2] is compact and q(xt , xt1 , ) has derivative

it follows that the derivative is bounded in expectation because

(by sub-additivity of supremum)

(because sup |w| = |a|)

(by linearity of expectation)

As a result, by Theorems 6 and 8, we conclude that the log likelihood converges

Example: (Least-squares estimator of NLAR(1) model) Let the observed data

where {t }tZ WN(0, 2 )

(by linearity of expectation)

2E|xt | + 2 5 + 2 2E| tanh(xt1 )|

(because sup | tanh(w)| = 1)

2E|xt | + 10 + 4 = 2E|xt | + 14 <

and the second term satisfies (below we use ht1 tanh(xt1 ))

(by linearity of expectation)

E2|xt ||ht1 | + 10E|ht1 | + 4E|ht1 |2

(because sup | tanh(w)| = 1)

E|xt | + 10 + 4 = E|xt | + 14 <

As a result, by Theorem 6, we conclude that the log likelihood converges uniformly

Stochastic equicontinuity is a crucial ingredient for ensuring the uniform convergence

is stochastically equicontinuous as long as {q(xt , xt1 , )} is SE and its derivative is

for some and every i.