You are on page 1of 26

Advanced Time Series Econometrics

(L12621)
Patrick Marsh
School of Economics
Room B42 SCGB
e-mail: patrick.marsh@nottingham.ac.uk
January 15, 2015

Course Content
Part 1. Non-Stationary Time Series
These lecture notes are designed to be self-contained. Additional reading, particularly
of peer reviewed published research articles will be detailed in the notes with links
to the articles in Moodle. Much of the material is covered by standard advanced
Econometric textbooks, the most useful of which is:
Hamilton J.D. Time Series Analysis (Princeton, 1994).
Sporadic references to specific chapters will appear in the notes below.
There are additional references to several key articles contained within the notes.
These articles are available on the moodle website for this module.

Non-Stationary Time Series [Hamilton Chapters


15 and 16]

Recall from Econometrics II last year the definition of a stationary time series:
(S1): E (yt ) = , < < for all t.
(S2): V ar (yt ) = 2 < for all t.
(S3): Cov (yt , ytk ) = (k) (the autocovariance function)

for all t.

These imply that the mean, variance and autocovariances are all both finite and
constant for all t. A non-stationary time series is one which violates one, or more, of
these three conditions.
1

Example 1: Linear Trending Series


Consider the series
yt = + t + t ,

t = 1, .., T,

where T denotes the sample size, t IID (0, 2 ) and = 0. Observe that
E (yt ) = + t + E [t ] = + t.
Consequently since E (yt ) is not constant for all t, condition (S1) is not satisfied and
the process is non-stationary.
In this course we will focus on two particular types of violation of stationarity
which seem to routinely occur with both economic and financial data. The first is
the linear trend in mean series as in Example 1.
The second is what is known as an integrated or difference stationary process.
Linear trends and integration can be hard to differentiate between and, indeed often
occur together. First we will define exactly what we mean by an integrated or
difference stationary process.

1.1

Preliminaries

Definition: A time series {yt } is said to be integrated of order d, denoted I (d) ,


if it must be differenced d times to make the resulting series both stationary and
invertible.
Notation: The dth difference of the series is given by
d yt = (1 L)d yt ,
where L is the lag-operator which is defined such that Lk yt = ytk , for k = 0, 1, 2, ...
Most often (but not always) d = 1 (i.e. first differences) so we have yt = yt yt1.
ARMA (p, q) Models
Recall from Econometrics II the ARMA(p, q) model:
yt = 1 yt1 + ... + p ytp + t + 1 t1 + ... + q tq ,
or equally
(L) yt = (L) t ,
where (L) = 1 1 L .. p Lp and (L) = 1 + 1 L + ... + q Lq .
If the roots of (z) = 0 all lie strictly outside the unit circle |z| = 1 then {yt }
is stationary.
If the roots of (z) = 0 all lie strictly outside the unit circle |z| = 1 then {yt }
is invertible.
The ARIMA class of models are based on the observation that many economic
time series appear stationary after differencing.
2

1.2

ARIMA (p, d, q) Models

An ARIM A(p, d, q) process is one which takes the form


d yt = 1 d yt1 + ... + p d ytp + t + 1 t1 + ... + q tq ,
or,

(L) (1 L)d yt = (L) t ,

where (z) and (z) satisfy the stationarity and invertibility conditions above, respectively. Thus the ARIM A(p, d, q) is just a standard ARMA (p, q) applied to the
differenced series (1 L)d yt .
Example 2
Consider the process
yt = yt1 + t ,

t = 1, .., T,

(1)

with t IID (0, 2 )and starting value y0 a fixed (finite) constant. We need to check
the stationarity conditions. By repeated back substitution we have
yt = y0 +

t


i ,

i=1



t
so that E (yt ) = y0 + E
i=1 i = y0 . This is both finite and constant for all t.
Hence condition (S1) is satisfied. However,

V ar (yt ) = V

y0 +

t

i=1

i = 0 + V ar

 t


i = t 2 ,

i=1

which violates condition (S2) since the variance depends on t.


Thus the process in (1) is not stationary. However, if we apply differences with
d = 1 we obtain
yt = yt yt1 = t ,

then E (yt ) = E [t ] = 0, V ar (yt ) = V ar (t ) = 2 and Cov [yt , ytk ] =


Cov [t , tk ] = 0. So conditions (S1), (S2) and (S3) are all met and so yt is a
stationary process.
The process in (1) is called a random walk. It is the simplest possible example
of an I(1) series because yt is an IID process; that is yt ARIM A (0, 1, 0) .
Example 3
Consider the process
(L) yt = t ,

t IID 0, 2 ,

with (L) = (1 L) (1 0.5L) . Note that immediately we can say yt is nonstationary


- think of what the roots of (z) must be. In this case the relevant ARIM A model
is
(1 0.5L) yt = t ,
3

or we could instead write


(L) yt = (L) t ,
with (L) = (1 0.5L) and (L) = 1.
Example 4: What is the order of integration of the series
yt = t t1 ,

t IID 0, 2 ?

In terms of the first 2 conditions;

(S1) E (yt ) = E (t ) E (t1 ) = 0


,
(S2) V (yt ) = V (t ) + V (t1 ) 2Cov [t , t1 ] = 2 2
while for (S3) with k = 1we have
Cov [yt , yt1 ] = E [yt yt1 ] = E [(t t1 ) (t1 t2 )]


= E 2t1 = 2 ,

while for k > 1 we have Cov [yt , ytk ] = E [yt ytk ] = E [(t t1 ) (tk tk1 )] =
0. Consequently the series is stationary. However it is NOT I(0) because it is not
invertible. Writing the process in ARM A form we have
yt = (L) t = (1 L) t ,
which is a moving average unit root process. Moreover no amount of differencing will
yield a process which is BOTH stationary and invertible.
In fact this type of process often occurs when the difference operator has been
applied too many times - i.e. the process has been over differenced. We could say
that since yt = t t1 and t is stationary and invertible then yt is I(1). We can
say that all I (d), with d < 0, processes are non-invertible.
Example 5
Consider


yt = + t + t , t IID 0, 2 .
Many (old) statistics textbooks advise dealing with the non-stationarity in this model
(note E (yt ) = + t) by taking first differences, yielding
yt = + t .
However, doing so yields a series which is stationary but non invertible - similar to
the previous example. Note also its difficult to do any inference on if we difference
this way.
What should we do instead? One option is to de-trend by running an OLS
regression of yt on the constant and linear trend (this is the simplest regression model
with xt = t) and then take the residuals

t = yt
t,
4

where
and are the OLS estimators from that regression. This approach works
fine if the process is I (0) but there are problems - which we shall explore later - if it
is I (1) .
One solution to the problems involving non-stationary time series is to transform
them to stationary series - typically through differencing. Since the transformed series
are then stationary then standard results for modelling such series (and performing
inference on them) hold. As we shall see this is not the case for non-stationary series.
Moreover, as we have seen there are problems associated with this approach.
Example 6
Suppose that log (GDPt ) is non-stationary but log (GDPt ) is stationary. It is
relatively common to work with the latter, i.e. growth rates of GDP. However as we
saw in example 5 we do lose information on the level of GDP. This may be acute
if we want to measure the long-run relationships between GDP and other variables.
As well see later this is where CO-INTEGRATION (i.e. what Sir Clive Granger won
his Nobel Prize for) becomes an extremely important tool.
Specifically differences of variables only measure short-run effects because these
are changes not the level or long-run position of the variable. By working in
differences we lose information on the long-run effects.
1.2.1

Drifts and Trends

So far we have only considered zero mean ARIM A processes. In practice series rarely
have means which are identically zero. An example of an ARIM A process with a
non-zero mean is given by the model
yt = + t + vt , t = 1, .., t
vt = 1 vt1 + t , t IID(0, 2 ).

(2)

If |1 | < 1 (the usual condition for stationarity of an AR(1)) we refer to this


model as a trend-stationary model. This is because the stochastic part of the
model, vt , satisfies the stationarity conditions so that simply subtracting the
unconditional mean + t from yt would yield a stationary (and invertible)
series.
If 1 = 1, so that vt = vt1 + t is I(1) -and hence so is yt - then differencing
gives
yt = + t + vt ( + (t 1) + vt1 )
= + t .
This model is referred to as a random walk with drift, where the drift term
is given by .
Regardless of whether |1 | < 1 or 1 = 1 the process has a linear trend in the
mean since, whatever value of 1 , E (yt ) = + t.
5

DEFINITION: An ARMA(p, q) model, (L) yt = (L) t is said to contain d


unit roots if d of the solutions to (z) = 0 lie exactly on the unit circle, |z| = 1. Note
that any roots with modulus 1 are also valid, e.g. 1, i, i etc.
Equivalently we can say that an I(d) series has - or admits - d unit roots in its
autoregressive polynomial.
Example 7: Consider the process


t IID 0, 2 .

yt = 1 yt1 + t ,

If 1 = 1 then (z) has a root equal to 1 (a unit root) and the process is I(1) (in fact
it is the random walk process) while if |1 | < 1 the root is stable and the process is
I (0) .
Example 8: Consider the ARM A(2, 0) process
yt = 1 yt1 + 2 yt2 + t ,

t IID 0, 2 .

If we rewrite this equation as

yt yt1 = (1 1) yt1 + 2 yt2 + t


= (1 1) yt1 (1 1) yt2 + (1 + 2 1) yt2 + t ,
then we can write
yt = (1 1) yt1 + (1 + 2 1) yt2 + t .
If 1 + 2 = 1 then this model can be written as
yt = (1 1) yt1 + t .
So that if also 0 < 1 < 2 then this is an ARIM A(1, 1, 0) model and the process
has one (d = 1) unit root and one stable root. It is therefore I(1).
If, however, 1 = 2 and 2 = 1 then we obtain yt = yt1 + t , or
yt yt1 = t
(yt ) = t
2 yt = t ,
so that the process has two unit roots and is I(2). In terms of a lag polynomial
we have
(L) = 1 1 L 2 L2 = 1 2L + L2 = (1 L)2 = 2 ,
in this case.
6

1.3

Properties of Unit Root Series

Consider, once again, the AR(1) process:




t IID 0, 2 ,

yt = 1 yt1 + t ,

with initial condition y0 . By repeated back substitution we obtain that


yt = t + 1 t1 + 21 t2 + ... + t1 y0 =

t1


i1 ti + t1 y0 .

i=0

For the random walk case, where 1 = 1 :


(i) V (yt ) = t 2 as t .
(ii) The initial condition, y0 , matters and does not vanish as t . Contrast
this with the stationary case |1 | < 1 in which t1 y0 0 as t . The former
is an example of the long memory property of integrated series (while I(0) series
are termed short memory). Similarly at t = 50, say, the weight of the shock
from t = 49 (i.e. 49 ) is the same as it will be when the time series reaches
t = 50000000.
Note that if |1 | > 1 (termed an explosive root and the series an explosive
AR(1)) then these problems are exacerbated.

1.4

Comparisons with Trend-Stationary Models

Consider the following two models, both observed for t = 1, 2, .., T ;


(a) : yt = + t + ut
(b) : yt = + yt1 + ut ,
where in each case ut denotes an I(0) process. Both of these models have been
used extensively to model real macroeconomic time series data. Note that they both
capture the general pattern of macroeconomic behavior - i.e. series combining trends
and random fluctuations. e.g. GDP trend upwards (we hope) but over time there are
fluctuations about the trend. (See figures 1&2 on moodle)
TREND STATIONARY MODEL (a): says that GDP is growing along the trend
line. There are random fluctuations about this trend (the ut s) but their effects are
only short term (short memory). So after, for example, an earthquake, war or other
significant event GDPt has a tendency to return to the trend line. Hence shocks to
the economy have only a transient impact in this model.
UNIT ROOT MODEL (b): says that GDP grows on average by each year but
the effects of unexpected shocks (the ut s) are persistent (long memory). So after, for
example, an earthquake, war or other significant event GDPt remains below trend
and starts growing again from this new lower level. Shocks to the economy have a
permanent impact.
Unit root models are typically associated with real business cycle models while
trend-stationary models are associated with Keynesian theories.
7

1.5

The Spurious Regression Problem

Suppose that {yt } and {xt } are both unit root I (1) series, both observed for t =
1, 2, ..., T, i.e.
yt = yt1 + vt & xt = xt1 + wt ,
where vt and wt are I (0) series.
Consider running the OLS regression of y on x,
yt = + xt + et ,
then
The OLS estimator for has a non-degenerate limiting distribution. This is in
p
contrast to the case where both x and y are stationary where . When
both x and y are I(1) processes the fitted relationship is just an outcome of
some random variable and not related to the actual relationship between the
variables.
The t-statistic for testing on does not have a t-distribution - even asymptotically as the sample size tends to infinity. Worse - it diverges as the sample size
increases.
The usual R2 measure has a non-degenerate limiting distribution - it does not
converge to the true correlation between x and y.
Now consider the case where vt and wt are independent (and hence so are xt and
yt ) so that in the regression = 0.
If we run a regression of y on x then we will likely get a value of very different
from 0, even in very large sample sizes.
R2 will not converge to zero - the true correlation between x and y. It will
suggest there is a good fit of the regression even though it shouldnt.
Tests of H0 : = 0 will tend to reject far too often if we use critical values from
the t-distribution and moreover in large samples we will reject H0 no matter
what (finite) critical values we use.
These findings tend towards the same conclusion: that even if yt and xt are independent it is very possible that standard estimators and tests will mislead you into
thinking there is a (long-run) relationship between the variables y and x. This is
called the spurious regression problem. Two papers exploring this problem in
considerable detail are Granger and Newbold (1974) and Phillips (1986).
The spurious regression problem is the principal reason why it is vital to pre-test
data for the presence of a unit root. This is what the focus of the remainder of this
half of the module will focus upon.
8

Unit Root Testing

As we have just seen, models with a unit root have very different properties than
stationary models. Therefore, it is very important to test whether a given time series
may have a unit root before proceeding to model its relationship with other variables.
The first specific tests developed to this end are the Dickey-Fuller unit root tests.
We will explore these tests and, in particular, detail their large sample properties
and show that they have non-standard (i.e. not Normal, t, F or Chi-Square) limiting
distributions.

2.1

Dickey-Fuller Unit Root Tests

Assume that we have T +1 observations on a time series yt generated by the following,


yt = yt1 + ut ,

t = 1, .., T,

(3)

where ut IID (0, 2 ) and we assume the initial condition y0 is a random variable
with finite variance. We can rewrite (3) as
yt = yt1 + ut ,

t = 1, .., T,

(4)

where yt = yt yt1 and = 1.


Dickey and Fuller (DF) (1979 and 1981) consider tests of the following;
H0 : = 1 (i.e. = 0) yt I (1)
vs H1 : || < 1 (i.e. 2 < < 0) yt I (0) .


Under the null hypothesis yt is an integrated (unit root) process, yt = y0 + ti=1 ui .


Under the alternative it is stationary, i.e. I (0) .
DF suggest to use OLS to estimate in (4) and then propose two possible statistics
for testing H0 against H1 . These are the normalized bias statistic
T = T
and the one-sided t-statistic
tDF =

T

t=1 yt yt1
t
2
t=1 yt1

se (
)





2

se (
) = T
t=1

and the variance is estimated by


2

T 1 Tt=1 yt yt1
=
,

2
T 2 Tt=1 yt1

T

t=1

(yt yt1 )2
.
T 1
9

2
yt1

As we will soon establish the limiting null distributions of neither of these tests is
standard normal (N (0, 1)) . This is a crucial property of both T and tDF . Comparing
outcomes of these statistics with critical values obtained from standard normal tables
will NOT deliver tests with the anticipated size (i.e. probability of incorrect rejection
of a true H0 ). In fact the large sample critical values for the tests are;
1%
T
13.8
tDF
2.58
N (0, 1) 2.33

2.5%
10.5
2.23
1.96

5%
8.10
1.95
1.65

10%
5.70
1.62
1.28

Note that, for example, using a standard normal critical value at 5%, i.e. 1.65, for
the tDF statistic would imply a test which actually has a size of near 10%.
We shall turn our attention to formally establishing the limiting distribution of
the T and tDF statistics under the unit root null hypothesis. Unfortunately standard methods of obtaining these do not apply in this problem and so we introduce
a new tool called the functional limit theorem [FCLT] which is the cornerstone of
distribution theory in the nonstationary case.

2.2
2.2.1

Unit Root Asymptotic Distribution Theory


Introduction

Consider the AR(1) process




and y0 = 0. The OLS estimator is


T

= t=1 yt yt1 =

T
2
t=1 yt1

T

t=1

ut N 0, 2 ,

yt = yt1 + ut ,

T
(yt1 + ut ) yt1
yt1 ut
= + t=1
,
T
T
2
2
t=1 yt1
t=1 yt1

and if || < 1 then (e.g. Hamilton, Ch. 8) the standard limit theorems apply so
that


 

d

T 1/2
N 0, 1 2 .

One can immediately see a problem if = 1! The scaling required to get a limit
distribution is different. To see this consider


T
1

t=1 yt1 ut
1 = T
T
.

T
2
2
T
t=1 yt1

(5)

Consider the numerator in (5), when = 1 then yt = ut + ut1 + ... + u1 (since y0 = 0)


so that


yt N 0, 2 t .
10

Also, when = 1, then




2
yt2 = (yt1 + ut )2 = yt1
+ 2yt1 ut + u2t ,

or equally
yt1 ut =
If we sum (6) from 1 to T , we get


1 2
2
yt yt1
u2t .
2

(6)

T


T

1 2
1
2
yt1 ut =
yT y0
u2t .
2
2 t=1
t=1

Using the fact that y0 = 0 and dividing by both T and 2 , we get


T
T
1 
1 yT 2
1 1
y
u
=

u2 .
t1 t
2 T t=1
2 T 1/2
2 2 T t=1 t
1/2

But since yT N (0, 2 T ) then yT / ( 2 T ) N (0, 1), so that yT2 / ( 2 T ) 21 and



p
by the law of large numbers T1 Tt=1 u2t E [u2t ] = 2 . Putting these results together
we find
T


1 
d 1
2
y
u

1
.
t1 t
2 T t=1
2 1

2
In the denominator of (3) note that yt1 N (0, 2 (t 1)) so that E yt1
=
2
(t 1) ,

 T

t=1

2
yt1

T

t=1

2
E yt1
= 2

T

t=1

(t 1) = 2

T (T 1)
.
2

Although we dont yet have the tools to derive the asymptotic distribution in the
denominator it should be pretty clear that such a distribution can only be obtained
if the scaling is T 2 . The required tools begin with the definition of:
2.2.2

Brownian Motion

Consider the random walk process yt = yt1 + ut , where ut IIDN (0, 1) and y0 = 0.

Thus (and as above) yt = tj=1 uj N (0, t) . Consider also, for s > t,
ys yt = ut+1 + ut+2 + ... + us N (0, s t) ,

and moreover yt ys is independent of yr yq if t > s > r > q.


Consider now yt yt1 = ut N (0, 1) , but we consider ut to be the sum of two
independent variables say
ut = e1t + e2t ,
11

where both e1t and e2t are N (0, 1/2) variables. We could then associate e1t with the
change between yt1 and some mid-point yt1/2 , say, so that
yt1/2 yt1 = e1t

; yt yt1/2 = e2t ,

but still
yt yt1 = e1t + e2t .
In fact we could go further and consider N 1 interim points so that
yt yt1 = e1t + e2t + ... + eNt ;

eit IIDN (0, 1/N) ,

i = 1, .., N.

Further we could consider what happens if we allow N . Doing so defines the


continuous time process known as standard Brownian motion, which is defined as:
DEFINITION: A standard Brownian motion W (.) is a continuous time stochastic process associating each date t [0, 1] with the scalar random variable W (t),
such that;
(a) W (0) = 0;
(b) for dates 0 t1 < t2 < ... < tk 1 the changes [W (t2 ) W (t1 )] , ..., [W (tk ) W (tk1 )]
are independent normal with, [W (s) W (t)] N (0, s t) .
(c) For any given realization, W (t) is continuous in t with probability 1.
Note we have defined times as being between 0 and 1 rather than 0 and for
convenience for what follows.
2.2.3

The Functional Central Limit Theorem (FCLT)

The simplest version of the Central Limit Theorem (CLT) has that if ut IID (0, 2 )

then if u = T 1 Tt=1 ut then T 1/2 u N (0, 2 ) as T .
Consider now just the first half of a sample (we discard the rest) and the (half)
sample mean

1 T/2
ut ,
uT /2 =
T /2 t=1
where T /2 denotes the largest integer smaller than or equal to T /2 (this is called
the integer part of T /2). Notice that also as T ,
d

T /21/2 uT /2 N 0, 2 ,


and notice that this (half) sample mean is independent of the (half) sample mean
constructed from the rest of the data.
We can generalize to taking the rth fraction of a sample, where r [0, 1] , by
defining
r

1 T
XT (r) =
ut .
T t=1
12

Note the denominator in XT (r) is T, not T r . Now as r moves between 0 and 1


XT (r) is a step function with

1 /T

when
when
when
:
when

XT (r) = (u1 + u2 ) /T

:

T u /T
t
t=1

Then,
T

1/2

T r
1 

XT (r) =
ut =
T t=1

but as T then
so that

1
T r

1/2 

T r
t=1

T r
T

0 r < 1/T
1/T r < 2/T
2/T r < 2/T .
:
r=1

1/2 

1
T r

1/2 T r


ut N (0, 2 ) by the CLT while


d

T 1/2 XT (r) N 0, 2 r ,

or

ut ,

t=1


T r 1/2
T

r1/2 ,

T 1/2
d
XT (r) N (0, r) .

Similarly, and for r2 > r1 ,


d

T 1/2 (XT (r2 ) XT (r1 )) N (0, r2 r1 ) ,


and is independent of T 1/2 XT (r) / provided r1 > r.
The Functional
Central Limit Theorem: The sequence of stochastic functions

T XT (.) / has an asymptotic probability law described by standard Brownian
motion W (.) that is,
T 1/2
d
XT (.) W (.) .
(7)

The result in (7) is the FCLT. Although weve here assumed that the ut are IID
in fact it holds under far weaker conditions.

Notice that XT (1) is the sample mean, i.e. XT (1) = T 1 Tt=1 ut . Consequently
the standard CLT is obtained as a special case of the FCLT, i.e.

1/2

T
T 1/2
1 
d
XT (1) =
ut W (1) N (0, 1) .

T 1/2 t=1

2.2.4

The Continuous Mapping Theorem (CMT)

Let S (.) be a continuous time stochastic process with S (r) being the variable it takes
at some date r [0, 1] . Note that S (r) is a continuous function of r (with probability
13

1). Consider the sequence of continuous functions {ST (r)} such that ST (.) S (.) ,
then if g (.) is a continuous functional then the CMT states that
d

g (ST (.)) g (S (.)) .

(8a)

In this context the most commonly used functionals are (stochastic) integrals, e.g.
2
0 S (r) dr or simpler functions such as [S (r)] . The CMT also applies for continuous
functionals mapping a continuous bounded function of [0, 1] to another, e.g. g (h (.)) =
h (.) . We can use exactly this so that we have

1

T 1/2 XT (r) =



T 1/2
d
XT (r) W (r) N 0, 2 r .

Consider also the function ST (r) = T 1/2 XT (r) . Since above we had T 1/2 XT (r)
W (r) then it follows from the CMT that
d

ST (r) 2 W (r)2 .
2.2.5

Applications to Unit Root Processes

Consider again the random walk process


yt = yt1 + ut ,
Then yt =
follows:

t

j=1

ut IID 0, 2

& y0 = 0.

uj and so we can define the following stochastic function XT (r) as

y
1 /T

XT (r) = y2 /T

y /T
T

when
when
when
:
when

0 r < 1/T
1/T r < 2/T
2/T r < 2/T .
:
r=1

We can plot this (see the final figure on moodle) as a function of r. Doing so yields
rectangles of width 1/T and height yt1 /T , and thus area yt1 /T 2 . The integral of
(area under) XT (r) over r [0, 1] is therefore given by


so that

XT (r) dr =


T

y1
y2
yT 1
2
+
+
...
+
=
T
yt1 ,
T2 T2
T2
t=1

T 1/2 XT (r) dr = T 3/2

T


yt1 .

t=1

Thus since we know from the FCLT that T 1/2 XT (r) W (r) then using the CMT
we have
 1
 1
d
T 1/2 XT (r) dr
W (r) dr,
0

14

so that in fact we have shown that


T 3/2

T

t=1

yt1

1
0

W (r) dr.

It can be shown that 01 W (r) dr N (0, 1/3) .



Notice that if {yt } is a random walk then the sample mean y = T 1 Tt=1 yt

diverges, whereas instead T 3/2 Tt=1 yt1 = T 1/2 y converges to a normal limiting
variable. Contrast thus with usual Central Limit Theorem type results for either
stationary or independent data where it is T 1/2 y that converges to a normal limiting
variable.
Consider next the sum of squares of a random walk. Let
ST (r) = T [XT (r)]2 ,
so that

y
1 /T

ST (r) = y22 /T

y 2 /T
T

when
when
when
:
when

0 r < 1/T
1/T r < 2/T
2/T r < 2/T ,
:
r=1

similar to above. Then




1
0

ST (r) dr =

T

yT2 1
y12
y22
2
2
+
+
...
+
=
T
yt1
,
T2 T2
T2
t=1
d

and then (using both the FCLT and CMT) ST (.) 2 [W (.)]2 and so
T 2

T

t=1

yt2 2

W (r)2 dr.

Ultimately the point here is to collect results useful in working out the limiting
distribution of the Dickey-Fuller tests. For that also recall that
T

T

1

T
T
11 2 
1
11
yt1 ut =
yT
u2t = ST (1)
u2t ,
2
T
2
2
T
t=1
t=1
t=1

given the definition of ST (r). By the usual Law of Large numbers


T
1
p
u2t 2 ,
T t=1
d

and since ST (1) 2 [W (1)]2 then




2T

T
1 
t=1

yt1 ut
15


1
W (1)2 1 ,
2

which is the same as we saw before, noting that W (1)2 21 .


At this stage it is worth collating all of the results so far obtained. If yt = yt1 +ut
with y0 = 0 and ut IID (0, 2 ), then:
a) : T 1/2

T

t=1

b) : T 1

T

t=1

c) : T

3/2

yt1 ut
T

t=1

d) : T

T

t=1

(9)



2
2 2
W (1)2 1
1 1
2
2

yt1

ut W (1) N 0, 2

d
2
yt1

2
W (r) dr N 0,
3

W (r)2 dr.

(10)
(11)
(12)

Note that it is the SAME standard Brownian motion process, W (r) , throughout.

2.3

Asymptotic Distributions of Unit Root Test Statistics

= T yt yt1 / T y 2 , or equivalently
Recall that
t=1
t=1 t1


T
1

t=1 yt1 ut
1 = T
T
,

T
2
2
T
t=1 yt1

(13)

with ut IID (0, 2 ) . Then using the results above we have shown that
T

T



2
yt1 ut
W (1)2 1
2
t=1
d

& T

T

t=1

d
2
yt1

W (r)2 dr,

(14)

Since the ratio in (13) is a continuous function of its numerator and denominator
(which is positive with probability 1) then we can state that under H0 : = 1 the
OLS estimator satisfies


2
1


W
(1)

1
d 2
1
T
.
(15)
1
2
0 W (r) dr
Sometimes we see the numerator in (15) written instead as
an example of a stochastic integral, and so


T
1

t=1 yt1 ut d
1 = T
T


T
2
2
T
t=1 yt1

16

1
0

1
0

W (r) dW (r) - this is

W (r) dW (r)
.
2
0 W (r) dr

1

Critical values have been tabulated for the distribution of the RHS of (15) (which
we shall denote as )
Table 1:

1
W (r)dW (r)
= 0 1
2

Pr [
Pr [
Pr [
Pr [
Pr [
Pr [
Pr [

N (0, 1)

W (r) dr

< 13.8] = 0.01


< 10.5] = 0.025
< 8.1] = 0.05
< 5.7] = 0.10
< 0.93] = 0.90
< 1.28] = 0.95
< 2.03] = 0.99

2.33
1.96
1.645
1.282
1.282
1.645
2.33

Clearly the distribution of is not standard normal, instead it is a non-standard


distribution called the Dickey-Fuller distribution.
is a super-consistent estimator of when = 1,
It also follows from (15) that
in that it converges to the true value at a rate T, rather than the more usual T 1/2 .
The other popular unit root test statistic for H0 : = 1 is the OLS t-ratio:
1

tDF = 
(16)
1/2 ,

2

2 / Tt=1 yt1



t1 2 . Again this will have a non standard asymptotic
where
2 = T 1 Tt=1 yt y
distribution. To obtain it we rewrite (16) as


1
tDF = T
=

T


T
2 p

T

2 T

t=1

T
2

T

2

2
yt1

t=1

yt1 ut

2
t=1 yt1

1/2

1/2


2
2

T
T 1 Tt=1 yt1 ut 2 2 
2
=

T
yt1

2
T 2 Tt=1 yt1
t=1

W (1)2 1


2 1
0

W (r) dr

1

0
1/2 
1
0

1/2

W (r) dW (r)
W (r)2 dr

1/2 ,

since 2 .
Critical values have also been tabulated for this Dickey-Fuller distribution;
1

= 01
0

Pr [
Pr [
Pr [
Pr [
Pr [
Pr [
Pr [

Table 2:
W (r)dW (r)
W (r)2 dr

1/2

< 2.58] = 0.01


< 2.23] = 0.025
< 1.95] = 0.05
< 1.62] = 0.10
< 0.89] = 0.90
< 1.28] = 0.95
< 2.00] = 0.99
17

N (0, 1)
2.33
1.96
1.645
1.282
1.282
1.645
2.33

Example 9: The following AR(1) model was fitted by OLS for t = 1947Q2 to
1989Q1 (T = 168) for data on the US nominal 3-month T-bill rate:
t =0.99694 it1 ,
(0.010592)

where the figure in parentheses is the estimated standard error.


We then find


1
T

= 168 (0.99694 1) = 0.51

tDF = (0.99694 1) /0.010592 = 0.29

which are well above any of the (left tail) critical values in Tables 1 or 2, so we cannot
reject H0 : = 1 in favour of H1 : < 1. We also cannot reject against H1 : > 1.
2.3.1

Consistency under H1 : || < 1

Under the alternative the model is a stationary AR (1) ;


yt = yt1 + ut ,
is a consistent estimator of so that
where || < 1. For this model we know that


p
p
. Under H1 ,
1 1 < 0, since || < 1. Consequently T
1 will




1 < cv
diverge to and so no matter which critical value we choose Pr T
1 as T . I.e. the test will reject with probability 1 when H1 is true. Consistency
of the t-statistic tDF follows in the same way as we showed when we looked at the
power of the t-test in Introductory Econometrics (L11221).
2.3.2

The Initial Value/Condition

So far we have assumed y0 = 0. Here well weaken this slightly and instead consider
that either;
(a) y0 = c a constant, or
(b) y0 has a specified distribution with finite variance, e.g. y0 N (0, 2 ) .
Notice that (b) includes (a) as a special case and in turn when y0 = 0. We assume
that y0 is independent of {ut }t1 . All other previous assumptions are maintained.
Consider


T
1

T 1 (y0 u1 + y1 u2 + .. + yT 1 uT )
t=1 yt1 ut
1 = T


T
=
.

2
T 2 Tt=1 yt1
T 2 y02 + y12 + .. + yT2 1

(17)

The denominator of (17) satisfies


T

T

2
t=1

2
yt1

=T

T

2
t=1

t1


j=1

uj + y0 = T 2
18

T 


2
St1
+ 2St1 y0 + y02 ,

t=1

t1

where St1 =
T

j=1

uj . Consequently, using ST (r) and XT (r) as previously defined,

T


2
yt1

t=1

and so since

1
0

ST (r) dr + 2y0 T
d

T 1/2 XT (r)dr
T

1/2

1

T

t=1

T 1/2 XT (r)dr +

y02
,
T

W (r)dr then we have

d
2
yt1

W (r)2 dr,

as we did before.
Similarly for the numerator of (17) is
T

T


yt1 ut = T

t=1

T


(St1 + y0 ) ut = T

t=1

T


St1 ut + y0 T

1/2

t=1

T

T

t=1

ut

t=1 t
a standard CLT shows that
N (0, 2 ) and hence the second term above
T
vanishes as T , meaning that also as before

 1

1 2
2
yt1 ut W (1) 1
W (r)dW (r).
2
0
t=1

T


The initial value


 (under
 assumption (a) or (b)) has no effect on the asymptotic dis
tribution of T 1 , nor therefore on that of tDF .

2.4

Augmented Dickey-Fuller Tests

So far we have assumed that ut IID (0, 2 ) , which of course is likely to be an


unrealistic assumption in practice. Instead suppose that {yt } is generated by
yt = yt1 + ut
ut =

p


i uti + et +

i=1

q


j etj ,

j=1
2

with et IID (0, 2e ) and E (e2t 2e ) = 4 < .


Thus {ut } is itself an ARM A (p, q) which we will assume to be both stationary and
invertible. Invertibility implies that we can write {ut } as an AR () , and therefore
{yt } can be written as
yt = yt1 +

di uti + et ,

i=1

where the AR coefficients {di }


i=1 are functions of the original ARM A coefficients.
Notice that the true order of the AR for yt is infinite if q > 0.
19

The unit root hypothesis remains H0 : = 1, but when H0 is true then yt is an


ARIM A (p, 1, q) process and so ut = yt yt1 = yt . When p and q are unknown
then we approximate using
yt = yt1 +

k


di yti + et ,

(18)

i=1

known as the Augmented Dickey Fuller regression. In (18) we allow k to grow with
the sample size, for example letting k as T but k/T 1/3 0.
OLS applied to (18) yields consistent estimators (at rate T 1/2 ) for the {di }ki=1 and
the t-statistic for testing H0 has the same asymptotic distribution as in the simpler
case above, i.e.
1
W (r) dW (r)
d
tADF 0
1/2 .
2
1
0 W (r) dr


1 is not the same, however. In fact it can be shown


The distribution of T
that, under H0 ;


1
T

1
T


=

1 ki=1 di

1
0

W (r) dW (r)
.
2
0 W (r) dr

1

Formal derivations of these results are found


in, for example, Hamilton (Section 17.1).


We call the tests tADF and T 1 the Augmented Dickey Fuller (ADF) tests,
in the sense that its the OLS regression of yt1 on yt augmented by lagged values of
{yti }ki=1 as k , subject to k/T 1/3 0. In practice T is not infinite and so we
need to choose a value of k for our regression. Typically we use;
(a) Information criteria, such as the Akaike or Bayesian Information Criteria (AIC,
BIC) seen before in Econometrics modules.




(b) Deterministic Rules, such as k = 4 (T /100)1/4 or k = 12 (T /100)1/4 - see
Schwert (1989).
(c) Data based Lag Selection, which involves a step-wise procedure in which we
initially choose a (large) value of k = kmax , for example one of those above, and then
use regression t-tests to test H0 : dkmax = 0. If we dont reject we decrease the number
of lags in the ADF regression and then test H0 : dkmax 1 = 0, if we continue to fail
to reject we keep on reducing the number of lags by one until we do reject. This
procedure is described fully in Ng and Perron (1995).
Notice that running (18) implies losing k + 1 observations - since ytk is only
defined ones T reaches k + 1 - or another way to think of it is that we have an extra
k nuisance parameters ( is the interest parameter) to estimate. One consequence
of this is that often the asymptotic critical values obtained from the Dickey Fuller
distributions may not be accurate in finite samples.

20

Invariant Tests of a Unit Root

When hypothesis testing in the presence of nuisance parameters (i.e. parameters that
are not specified by the null hypothesis - e.g. the di s in the ADF regression) we need
to ensure that our test statistics have (at least asymptotic) distributions which do
not depend on these nuisance parameters at all.
Feasible tests whose distributions do not depend on nuisance parameters are said
to be similar or invariant. As far as we are concerned these two terms mean the
same thing.
For example, since the Dicky Fuller distributions dont depend on y0 under the
assumptions (a) and (b) above we can say the tests are asymptotically invariant with
respect to y0 . They are not exactly invariant, though.
If we want to make our tests exactly invariant with respect to y0 then this can be
achieved by including an intercept term in the test regression, i.e.


W (r) dr

1

ut IID 0, 2 ,

yt = + yt1 + ut ,

(19)

so that we regress yt on a constant and yt1 .


The limiting null distributions of the resulting test statistics are different, however,
from what we derived previously. Specifically


1
T

1
2

W (1)2 1 W (1)
1
0

W (r)2 dr

1
0

1
0

W (r) dr

(r) dW (r)
W
,
2

0 W (r) dr

1

(r) = W (r) 1 W (s) ds is de-meaned Brownian motion. (Note that


where W
0
above isnt the same estimator as in the case with no intercept)

We derive the test statistics in the following way. First regress yt on . The

(OLS) estimator for is y = T 1 Tt=1 yt . Consequently we define the residuals of
this regression by
ut = yt y = yt T 1
Under H0 : = 1 and yt = yt1 + ut , we have
ut =

t


j=1

t


j=1

uj + y0 T
uj T 1

T

1

s=1

T 
s


T


ys .

s=1

s


j=1

uj + y0

uj ,

s=1 j=1

which clearly does not involve y0 at all. If we also divide by T 1/2 , then
T

1/2

ut = T

1/2

t


j=1

uj T

3/2

T 
s


s=1 j=1

uj W (r)
21

(r) .
W (s) ds = W

We can pursue this line further. Now suppose that the data itself is generated by
(1 L) (yt t) = t ,

(20)

then it turns out that the asymptotic distributions of tests generated form the regression (19) will depend upon the value of .
We can rewrite (19) in the following way,
yt = yt1 + (1 ) + t (t 1) + t
= yt1 + + t + t ,
where = (1 )+ and = (1 ) . Consequently when = 1 then =
and = 0 so the model is a random walk with drift. If we ignore the presence of
(which is effectively what (19) does) then the resulting tests are useless. It can be
shown that such tests have zero asymptotic power and are therefore not consistent
tests.
We can obtain consistent and also invariant tests (with respect to all of the nuisance parameters) simply by including a constant and a linear trend in the test
regression,
yt = yt1 + + t + ut ,
i.e. we regress yt on a constant, linear trend and yt1 . As above this is achieved by obtaining residuals from a regression of yt on the constant and trend. The resulting unit
root test statistics can be shown to have the following null asymptotic distributions;


d
1
T

where

1
0

(r) dW (r)
W
1
2

0 W (r) dr


1

(r) dW (r)
W
tDF 0
1/2 ,
2
1
W
(r)
dr
0
d

 1
1
(r) = W
(r) 12 r 1
W
s
W (s) ds,
2 0
2
is de-meaned and de-trended Brownian motion.
We began with the simplest case of no constant and no trend then considered
introducing a constant and finally had both a constant and trend. Notice that the
asymptotic distributions of the resulting unit root tests all have essentially the same
form - the only difference being whether or not we are de-meaning (including a constant) and de-trending (also including a trend) the Brownian motion.

22

The effect on the critical values of this can be seen in the following Table:
Table 3
No Constant, No Trend
No Constant, No Trend
Constant, No Trend
Constant, No Trend
Constant, Trend
Constant, Trend

tDF


1
T
tDF


1
T
tDF


1
T

1%
-2.58
-13.8
-3.43
-20.7
-3.96
-29.5

2.5%
-2.23
-10.5
-3.12
-16.9
-3.66
-25.1

5%
-1.95
-8.10
-2.86
-14.1
-3.41
-21.8

10%
-1.62
-5.70
-2.57 .
-11.3
-3.12
-18.3

Notice that the effect of de-meaning then also de-trending is to shift the critical value
to the left.
Many (numerical) studies have been made into the finite sample size and power
properties of these unit root tests - see for example the papers by Schwert (1989) or
Ng and Perron (1995). One striking finding is how much less power there is when
we include a trend. It is not immediately apparent from the limiting distributions
described above why the effect on power should be so dramatic.
Above, tests which are invariant to a constant and trend were constructed from
residuals obtained from OLS estimation of the data on the constant and trend. Recall
from Econometrics I and Advanced Econometric Theory the Normal linear regression
model
y = XB + u,
where X is an n k matrix of explanatory variables and B is a k 1 vector of
= (X  X)1 X  y and the residuals are
parameters. The OLS estimator is B
= y X (X  X)1 X  y = M y,
u = y X B
where M = I X (X  X)1 X  .
Now suppose that the data are generated by (20) which we can rewrite as
yt = + t + ut

ut = ut1 + t

(21)

and t N (0, 2 ) so that in terms of the linear regression model

X=

1
1
1
:
1

1
2
3
:
T

B=

= Tt=2 ut ut1 / Tt=2 u2t1 .


then we would construct the DF unit root tests from the residuals, e.g.
23

It actually turns out that every test which is invariant with respect to both (and
so y0 ) and can be constructed from the elements of the n k dimensional vector
C  u

v=
,
u u
where we have decomposed M = CC  and C is a T (T k) matrix also satisfying
C  C = IT k .
Then according to Marsh (2007) v will have a density function which depends only
upon the parameter . Call this density fv () . Recall from Advanced Econometric
Theory the Cramer-Rao Lower Bound which states (in the current notation)


(v) Iv ()1 ,
V ar

(v) is any unbiased estimator of and Iv () = E d ln f2v () is Fisher


where
d
Information. In fact the CRLB is a special case of a more fundamental bound which
states that if z (v) is ANY statistic with mean E [z (v)] = () then
V ar [z (v)]

d ()
d

2

Iv ()1 .

That is Fisher Information represents a fundamental measure of precision for any


statistic at all, whether estimator or, as is of interest here, test. Note that the unit
root hypothesis is H0 : = 1. Marsh (2007) proves that for the model (21)
Iv (1) = 0.
That is there is NO information in any statistic which is invariant to a Linear Trend
at the very parameter value (i.e. at H0 ) that we are interested in and the variance of
ANY statistic will therefore be unbounded.
There
trends that we could use rather than just the linear one,
are many different
2
t
such as t, log (t) , t or e . As far as Im aware it is only the linear trend which does
this. This also illustrates why it is so difficult to tell the difference between linear
trends and stochastic ones as described at the beginning of these notes.
Obviously this highlights the need to only use a linear trend when one is strictly
necessary and procedures have been developed to try and ensure this is the case see for instance Harvey, Leybourne and Taylor (2009) who propose methods to do
exactly that.

Spurious Regression

It is vitally important that we do detect whether time series have unit roots (i.e.
stochastic trends) because of the possibility of obtaining spurious results when we
24

regress one on another. Here we focus on the case of a regression involving two
independent I (1) variables. Later on in this module it will be shown that some
linear combination of I (1) variables may yield an I (0) variable - this is the case of
co-integration and was what Sir Clive Granger won his Nobel prize for.

First though consider two I (1) variables {yt }


t=0 and {xt }t=0 generated as
yt = yt1 + ut

xt = xt1 + t

ut IID 0, 2u


t IID 0, 2

with E [ut , s ] = 0 for all s and t.


Now consider the regression model

& y0 = 0
& x0 = 0,

yt = + xt + et .

(22)

Since E [ut , s ] = 0 for all s and t implies {yt }


t=0 and {xt }t=0 are independent then the
slope of this regression should be zero, i.e. = 0. But is it true that OLS estimators
p
and tests find this? E.g. does 0?
Define Wu (r) and W (r) as the independent Brownian motions obtained from
cumulating and scaling the {ut } and {t }, exactly as we did previously. Also let x
and y be the sample means of the {xt } and {yt } series. The OLS estimator is then

=
=

T



)
T 2 Tt=1 yt xt T 2 x Tt=1
t=1 yt (xt x
=
T

)2
T 2 Tt=1 (xt x)2
t=1 (xt x




T 2 Tt=1 yt xt T 1/2 y T 1/2 x
.

T 2 Tt=1 x2t T 1 x2

yt

(23)

We can use all of our previous results to find;


T

1/2

y u
d

T 1/2 x
T

T

t=1

while for T 2

T

t=1

d
x2t

0
 1

Wu (r) dr

(24)

W (r) dr

(25)

(26)

W (r)2 dr,

yt xt we can easily generalize to find


T 2

T

t=1

yt xt u

Wu (r) W (r) dr

(27)

if we then apply the limits in (24) to (27) to (23) then via the CMT we have
d

: =

u
,


1
0

Wu (r) W (r) dr
1
0


1

W (r)2 dr

25

Wu (r) dr


1
0

 

W (r) dr

1
0
2

W (r) dr

(28)

say. In addition,

1/2 x,
T 1/2
= T 1/2 y T

(29)

and so we immediately find


d

T 1/2
u

Wu (r) dr

W (r) dr .

(30)

To summarize these results (28) demonstrates that converges to a well defined


random variable in the limit, i.e. it does not converge to 0. That is is not a consistent
estimator of in this context. In addition a standard t test of H0 : = 0 can be
shown to diverge as T . That is as the sample size becomes infinite we will reject
H0 : = 0, using such a test, with probability 1. These findings lead, inevitably, to
spurious inference about the existence of a relationship between yt and xt .

4.1

Possible cures for spurious regressions

(1) Include lags of xt and yt in (22) - see for example Hamilton (1994, p. 561).
(2) Difference the data before estimation, i.e. run the regression
yt = + xt + ut .
This yields (T 1/2 ) consistent estimators for and and we can apply standard
asymptotic theory, i.e. the limit distributions are Normal. However we still do need
to check if xt and yt are I (1) - differencing is not a good idea if they are not. We also
lose one of the benefits of dealing with non-stationary data, which is that we have
much faster rates of convergence than with stationary data.
(3) We can estimate (22) by Generalized Least Squares by assuming the errors have
first order autocorrelation. The resulting estimators
GLS and GLS are asymptotically equal in distribution to the estimators obtained from suggestion (2) provided
that both series are I (1) and not co-integrated.

26

You might also like