You are on page 1of 340

Statistics 730

Applied Time series Analysis

Fall 2011

Professor Peter Bloomfield


email: Peter Bloomfield@ncsu.edu

http://www.stat.ncsu.edu/people/bloomfield/courses/st730/

Characteristics of Time Series

A time series is a collection of observations made at different


times on a given system.

For example:
Earnings per share of Johnson and Johnson stock (quarterly);
Global temperature anomalies from 1856 1997 (annual);
Investment returns on the New York Stock Exchange (daily).
2

Digression: Retrieving the Data Using R


jj = scan("http://www.stat.pitt.edu/stoffer/tsa2/data/jj.dat");
jj = ts(jj, frequency = 4, start = c(1960, 1));
plot(jj);
globtemp = scan("http://www.stat.pitt.edu/stoffer/tsa2/data/globtemp.dat");
globtemp = ts(globtemp, start = 1856);
plot(globtemp);
nyse = scan("http://www.stat.pitt.edu/stoffer/tsa2/data/nyse.dat");
nyse = ts(nyse);
plot(nyse);

Correlation

Time series data are almost always correlated with each


otherautocorrelated.

We may want to exploit that correlation, or merely to cope


with it.

Exploiting Correlation: Forecasting


Suppose Yt is the tth observation, and we observe Y0, Y1, . . . , Yn1.
What can we say about Yn?
If we know the correlation structure, or more precisely the
joint distribution, of Y0, Y1, . . . , Yn1, Yn, then we calculate
the conditional distribution of Yn|Y0, Y1, . . . , Yn1.
The conditional mean is the best forecast of Yn, and the conditional standard deviation is the root-mean-square forecast
error. If the conditional distribution is normal, we can use
them to make probability statements about Yn.
5

Coping with Correlation: Regression


Suppose instead that Yt is related to a covariate xt, and we
are interested in the regression of Yt on xt.
Because the Y s are correlated, we should not use Ordinary
Least Squares to fit the regression.
If we knew the correlation structure, we would use Generalized Least Squares.
Usually we dont know it, so we must estimate it, typically
using a parsimonious parametric model.
6

Time Domain and Frequency Domain

Methods that focus on how a time series evolves from one


time to the next are called time domain methods.
Some graphs (e.g. residuals of global temperatures from a
quadratic trend) suggest the possibility of waves in the data:
l = lm(globtemp ~ time(globtemp) + I(time(globtemp)^2));
plot(globtemp - fitted(l));

Since a wave is described in terms of its period, or alternatively its frequency, methods that measure the waves in a
time series are called frequency domain methods.
7

Statistical Models

The primary objective of time series analysis is to develop mathematical models that provide plausible descriptions for sample data. . .

We model a time series as a collection of random variables:


x1, x2, x3, . . . , or more generally {xt, t T }.

Often the phenomenon being observed evolves in continuous


time, but our observations are always discrete samples.

If the sampling times t1, t2, . . . are equally spaced, their separation t = tn tn1 is the sampling interval and 1/t is the
sampling rate (samples per unit time).

Choice of sampling rate affects all aspects of data collection,


analysis, and interpretation.

Example: White Noise

Uncorrelated random variables wt with mean 0 and variance


2 , written w wn(0, 2 ).
w
t
w
Why white noise?
By analogy with white light: in the frequency domain, all
frequencies are present with the same strength.

If in addition the ws are independent and identically dis2 ).


tributed, we write wt iid(0, w
3

Iid White Noise

10

# t-distributed with 3 degrees of freedom:


w = ts(rt(500, df = 3));
plot(w);

100

200

300

400

500

Time

If in addition the ws are normally distributed, we write


2 ).
wt iid N(0, w

Iid Normal White Noise

0
1
2

w = ts(rnorm(500));
plot(w);

100

200

300

400

500

Time

Example: Moving Average

Many observed series are smoother than white noise.

Possible model:
vt =

1
wt1 + wt + wt+1
3

Moving Average

w = ts(rnorm(500));
v = filter(w, sides = 2, rep(1, 3) / 3);
plot(v);

100

200

300

400

500

Time

Averaging attenuates the faster oscillations, leaving the slower


oscillations more apparent.

More generally, a weighted average of 2, 3, or more noise


terms.

Example: Autoregression

Recursive model:
xt = xt1 0.9xt2 + wt,

t = 1, 2, . . . , 500

Like a regression equation, but the RHS contains past (lagged)


LHS variables, hence autoregression.

Shows many different types of behavior for different choices


of coefficients.

10

Autoregression

w = ts(rnorm(500));
v = filter(w, filter = c(1, -0.9), method = "recursive");
plot(v);

100

200

300

400

500

Time

11

Example: Random Walk

One model for trend; recursive definition:


xt = + xt1 + wt

Explicitly:
t

xt = t +

wj
j=1

is the drift (per unit time).

12

Random Walk

20

40

60

80

# drift delta = 0.2 per sample:


x = ts(cumsum(rnorm(500) + 0.2));
plot(x);

100

200

300

400

500

Time

13

The white noise we build it from could be non-normal.

14

Non-Normal Random Walk

250

200

150

100

50

# t-distributed increments, 1 degree of freedom, no drift:


x = ts(cumsum(rt(500, df = 1)));
plot(x);

100

200

300

400

500

Time

15

Example: Signal in Noise


Sine-wave signal:
xt = 2 cos(2t/50 + 0.6) + wt,

t = 1, 2, . . . , 500

More generally, the wave term could be


A cos(2t + ),
where:
A is amplitude;
is frequency (in cycles per unit time);
is phase (in this case, in radians).
16

Cosine wave signal plus noise

0
2
4

w = ts(rnorm(500));
x = 2 * cos(2 * pi * time(w) / 50 + 0.6 * pi) + w;
plot(x);

100

200

300

400

500

Time

17

Means

Recall: We model a time series as a collection of random


variables: x1, x2, x3, . . . , or more generally {xt, t T }.

The mean function is

x,t = E(xt) =

xft(x)dx

where the expectation is for the given t, across all the possible
values of xt. Here ft() is the pdf of xt.

Example: Moving Average

wt is white noise, with E (wt) = 0 for all t

the moving average is


vt =

1
wt1 + wt + wt+1
3

so
v,t = E (vt) =

1
E wt1 + E (wt) + E wt+1
3

= 0.

Moving Average Model with Mean Function

100

200

300

400

500

Time

Example: Random Walk with Drift

The random walk with drift is


t

xt = t +

wj
j=1

so
t

x,t = E (xt) = t +

E wj = t,
j=1

a straight line with slope .

20

40

60

80

Random Walk Model with Mean Function

100

200

300

400

500

Time

Example: Signal Plus Noise

The signal plus noise model is


xt = 2 cos(2t/50 + 0.6) + wt

so
x,t = E (xt)
= 2 cos(2t/50 + 0.6) + E (wt)
= 2 cos(2t/50 + 0.6),
the (cosine wave) signal.

0
2
4

Signal-Plus-Noise Model with Mean Function

100

200

300

400

500

Time

Covariances
The autocovariance function is, for all s and t,
x(s, t) = E (xs x,s) xt x,t

Symmetry: x(s, t) = x(t, s).


Smoothness:
if a series is smooth, nearby values will be very similar,
hence the autocovariance will be large;
conversely, for a choppy series, even nearby values may
be nearly uncorrelated.
8

Example: White Noise

2 ), then
If wt is white noise wn(0, w

2 ,
w
w (s, t) = E (wswt) =
0,

s = t,
s = t.

definitely choppy!

Autocovariances of White Noise

gamma

t
s

10

Example: Moving Average

The moving average is


vt =

1
wt1 + wt + wt+1
3

and E (vt) = 0, so
v (s, t) = E (vsvt)
1
= E ws1 + ws + ws+1 wt1 + wt + wt+1
9

2,

(3/9)w
s=t

(2/9) 2 ,
s=t1
w
=
2,

(1/9)w
s=t2

0,
otherwise.
11

Autocovariances of Moving Average

gamma

t
s

12

Example: Random Walk


The random walk with zero drift is
t

xt =

wj
j=1

and E (xt) = 0
so
x(s, t) = E (xsxt)

= E

wj

wj

j=1
2.
= min{s, t}w
j=1

13

Autocovariances of Random Walk

gamma

t
s

14

Notes:
For the first two models, x(s, t) depends on s and t only
through |s t|, but for the random walk x(s, t) depends
on s and t separately.
For the first two models, the variance x(t, t) is constant,
2 increases indefibut for the random walk x(t, t) = tw
nitely as t increases.

15

Correlations

The autocorrelation function (ACF) is


(s, t) =

(s, t)
(s, s)(t, t)

Measures the linear predictability of xt given only xs.

Like any correlation, 1 (s, t) 1.

16

Across Series

For a pair of time series xt and yt, the cross covariance


function is
x,y (s, t) = E (xs x,s) yt y,t

The cross correlation function (CCF) is


x,y (s, t) =

x,y (s, t)

x(s, s)y (t, t)

17

Stationary Time Series


Basic idea: the statistical properties of the observations do
not change over time.
Two specific forms: strong (or strict) stationarity and weak
stationarity.
A time series xt is strongly stationary if the joint distribution
of every collection of values {xt1 , xt2 , . . . , xtk } is the same as
that of the time-shifted values {xt1+h, xt2+h, . . . , xtk +h}, for
every dimension k and shift h.
Strong stationarity is hard to verify.
18

If {xt} is strongly stationary, then for instance:

k = 1: the distribution of xt is the same as that of xt+h, for


any h;
in particular, if we take h = t, the distribution of xt is
the same as that of x0;
that is, every xt has the same distribution;

19

k = 2: the joint (bivariate) distribution of (xs, xt) is the same


as that of (xs+h, xt+h), for any h;
in particular, if we take h = t, the joint distribution of
(xs, xt) is the same as that of (xst, x0);
that is, the joint distribution of (xs, xt) depends on s and
t only through s t;

and so on...

20

A time series xt is weakly stationary if:


the mean function t is constant; that is, every xt has the
same mean;
the autocovariance function (s, t) depends on s and t only
through their difference |s t|.
Weak stationarity depends only on the first and second moment functions, so is also called second-order stationarity.
Strongly stationary (plus finite variance) weakly stationary.
Weakly stationary strongly stationary (unless some other
property implies it, like normality of all joint distributions).
21

Simplifications

If xt is weakly stationary, cov xt+h, xt depends on h but not


on t, so we write the autocovariances as
(h) = cov xt+h, xt

Similarly corr xt+h, xt depends only on h, and can be written


(h) =

(t + h, t)
(t + h, t + h)(t, t)

(h)
.
(0)

22

Examples

White noise is weakly stationary.

A moving average is weakly stationary.

A random walk is not weakly stationary.

23

Estimating Means and Covariances

In other statistical applications, means, variances, and covariances are estimated by averaging across samples.

In time series, we often have only one realization.

Stationarity allows us to estimate moments anyway.

Mean
If xt is stationary, t = E (xt) , so we can estimate by
the sample mean
1 n
x
=
xt .
n t=1
We could also use a weighted mean
n

wtxt,
t=1

where

wt = 1.
t=1

Both are unbiased; usually some weighted mean has smaller


variance than x
, but not much smaller.
2

Autocovariance

Similarly, if xt is stationary, (t + h, t) = cov xt+h, xt (h),


so we can estimate (h) by
1 nh

(h) =
xt+h x
xt x

n t=1
for h = 0, 1, . . . , n 1, with
(h) =
(h).

We estimate the autocorrelation function (ACF) by


(h) =

(h)
.

(0)
3

Sampling Properties

x
is unbiased for .


(h) is not unbiased for (h), but
1 nh
xt+h
n h t=1

xt

would be. Note:


(n h) denominator instead of n;
centering at instead of x
.
4

Non-negative Definiteness
The covariance matrix of (x1, x2, . . . , xk ) is

k =

(0)
(1)
(1)
(0)
...
...
(k 1) (k 2)

. . . (k 1)
. . . (k 2)
...
...
...
(0)

and, as a covariance matrix, is non-negative definite:

a k a = var(a1x1 + a2x2 + + ak xk ) 0
for any vector of constants a = (a1, a2, . . . , ak ) .
k is also non-negative
With the above definition of
(h),
definite; that would not be true if we divided by (n h).
5

Another Sampling Property

If xt is white noise and n is large and some mild conditions


hold, (h) is approximately normal with zero mean and standard deviation
1
(h) = .
n

So we can look for autocorrelations outside 2/ n as evidence of autocorrelation.

R Examples
White noise:
acf(ts(rnorm(100)));

Southern Oscillation Index and fish recruitment:


soi = scan("http://www.stat.pitt.edu/stoffer/tsa2/data/soi.dat");
soi = ts(soi, start = 1950, frequency = 12);
recruit = scan("http://www.stat.pitt.edu/stoffer/tsa2/data/recruit.dat");
recruit = ts(recruit, start = 1950, frequency = 12);
acf(soi, 50);
acf(recruit, 50);
ccf(soi, recruit, 50);
# Negative lags indicate SOI leads recruitment.
7

Interpreting the Cross-Correlation

help(ccf) states: The lag k value returned by ccf(x,y)


estimates the correlation between x[t+k] and y[t].

So the graph shows negative correlation between SOI(t - 5


to 9 months) and recruit(t).

That is, current recruitment is (negatively) correlated with


SOI from 5 9 months ago.

SAS Example
Southern Oscillation Index and fish recruitment:
options pagesize = 80;
data soi;
infile soi.dat;
input soi;
run;
data recruit;
infile recruit.dat;
input recruit;
run;

data both;
time +1;
merge soi recruit;
run;
proc gplot data = both;
symbol i = join;
plot (soi recruit) * time;
run;
proc arima data = both;
title SOI and recruitment;
identify var = soi nlag = 50;
identify var = recruit crosscorr = soi nlag = 50;
/* Positive lags indicate SOI leads recruitment. */
run;

SAS program and output.


10

Seasonality in the SOI


The ACF of the SOI suggests that xt has a correlation of
around 0.4 with xt+12, xt+24, and so on.
This correlation is caused by the fact that those values
all fall in the same month of the year, and different months
have different means.
That is, this series has a non-constant mean function t.
Since it is non-stationary in the mean, the sample ACF does
not estimate the population ACF, and the graph has no
meaning.
11

We can estimate t and subtract it, to give a series with zero


mean.

The simplest way is to subtract the mean for a given month


of the year from all data for that month.
In R (in SAS, use corresponding proc glm):
soiSA = residuals(lm(soi ~ factor(cycle(soi))));
# transfer the time series structure of soi to soiSA:
soiSA = ts(soiSA, start = start(soi), frequency = frequency(soi));
acf(soiSA, lag = 50);

12

The ACF graph now shows correlation dropping progressively


from around 0.5 at a one month lag to zero at one year.

The CCF of soiSA and recruit shows correspondingly simpler


structure.

Frequency-domain methods will show that the recruitment


series also has some seasonality, but with much weaker effects.

Replacing recruit with a corresponding recruitSA makes negigible changes to the ACF and CCF.
13

Frequency-domain methods will also show that the seasonal


effects in SOI consist largely of an annual sine wave.

Instead of estimating 12 separate monthly means, we can fit,


and remove, a three-parameter model
2t
2t
t = 0 + 1 cos
+ 2 sin
.
12
12
In R:
soiCS = residuals(lm(soi ~ cos(2 * pi * time(soi)) +
sin(2 * pi * time(soi))));
soiCS = ts(soiCS, start = start(soi), frequency = frequency(soi));
acf(soiCS, lag = 50);
14

Vector-Valued SeriesNotation

Studies of time series data often involve p > 1 series.

E.g. Southern Oscillation Index and recruitment in a fish


population (p = 2).

Treated as a p 1 column vector:

xt =

xt,1
xt,2
...
xt,p

Mean Vector

Assume jointly weakly stationary.

mean vector:

= E (xt) =

E xt,1

E xt,2

...

E xt,p

1
2
...
p

Autocovariance Matrix

Autocovariance matrix contains individual autocovariances


on the diagonal and cross-covariances off the diagonal:
(h) = E

xt+h (xt )
1,1(h) 1,2(h)
2,1(h) 2,2(h)
...
...
p,1(h) p,2(h)

. . . 1,p(h)
. . . 2,p(h)

...
...

. . . p,p(h)

Sample mean and autocovariances

sample mean:
1 n
xt

x=
n t=1
sample autocovariance:
1 nh
(h) =

xt+h
x (xt
x)
n t=1
(h) =
(h) .
for h 0, and

Multidimensional Series (Spatial Statistics)

Some studies involve data indexed by more than one variable.

E.g. soil surface temperatures in a field

Notation: xs is the observed value at location s (s for spatial).

Soil temperatures

10

6
60
40

row

ature

Temper

30

colu 20
mns

20
10

Autocovariance and Variogram


Stationary : E (xs) and cov xs+h, xs do not depend on s.
For a stationary process, the autocovariance function is
(h) = cov xs+h, xs = E

xs+h

xs

Intrinsic: E xs+h xs and var xs+h xs do not depend on


s.
For an intrinsic process, the (semi-)variogram is
1
Vx(h) = var xs+h xs
2
7

A stationary process is intrinsic (see Problem 1.26), but an


intrinsic process is not necessarily stationary.

In one dimension, the random walk is intrinsic but not stationary.

When stationary, Vx(h) = (0) (h).

Isotropic: an intrinsic process is isotropic if the variogram is


a function only of |h|, the Euclidean distance between s + h
and s.
8

Time Series Regression

A regression model relates a response xt to inputs zt,1, zt,2, . . . , zt,q :


xt = 1zt,1 + 2zt,2 + + q zt,q + error.

Time domain modeling: the inputs often include lagged values of the same series, xt1, xt2, . . . , xtp.

Frequency domain modeling: the inputs include sine and cosine functions.

Fitting a Trend

0.4
0.2
0.0
0.4

window(globtemp, start = 1900)

> g1900 = window(globtemp, start = 1900)


> plot(g1900)

1900

1920

1940

1960

1980

2000

Time

possible model:
xt = 1 + 2t + wt,
where the error (noise) is white noise (unlikely!).
fit using ordinary least squares (OLS):
> lmg1900 = lm(g1900 ~ time(g1900)); summary(lmg1900)
Call:
lm(formula = g1900 ~ time(g1900))
Residuals:
Min
1Q
-0.30352 -0.09671

Median
0.01132

3Q
0.08289

Max
0.33519
3

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.219e+01 9.032e-01 -13.49
<2e-16 ***
time(g1900) 6.209e-03 4.635e-04
13.40
<2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 0.1298 on 96 degrees of freedom
Multiple R-Squared: 0.6515,
Adjusted R-squared: 0.6479
F-statistic: 179.5 on 1 and 96 DF, p-value: < 2.2e-16

0.0
0.4

g1900

0.2

0.4

> plot(g1900)
> abline(reg = lmg1900)

1900

1920

1940

1960

1980

2000

Time

Using PROC ARIMA


Program
data globtemp;
infile globtemp.dat;
n + 1;
input globtemp;
year = 1855 + n;
run;
proc arima data = globtemp;
where year >= 1900;
identify var = globtemp crosscorr = year;
/* The ESTIMATE statement fits a model to the
*\
\* variable in the most recent IDENTIFY statement */
estimate input = year;
run;

and output.
6

Regression Review
the regression model:
xt = 1zt,1 + 2zt,2 + + q zt,q + wt = zt + wt.
fit by minimizing the residual sum of squares
n

xt zt

RSS( ) =

t=1

find the minimum by solving the normal equations

=
zt zt

t=1

ztxt.
t=1
7

Matrix Formulation

factor matrix Znq = (z1, z2, . . . , zn) , response vector xn1 =


(x1, x2, . . . , xn)
= Z x with solution
= (Z Z)1Z x
normal equations (Z Z)

minimized RSS
= x Z

RSS

x Z

Zx
=xx
= x x x Z(Z Z)1Z x
8

Distributions

If the (white noise) errors are normally distributed (wt


2 )), then
is multivariate normal, and the usual
iid N(0, w
t- and F -statistics have the corresponding distributions.

If the errors are not normally distributed, but still iid, the
same is approximately true.

If the errors are not white noise, none of that is true.

Choosing a Regression Model

We want a model that fits well without using too many parameters.

Two estimates of the noise variance:


unbiased: s2
w = RSS/(n q)
maximum likelihood:
2 = RSS/n.

We want small
2 but also small q.
10

Information Criteria (smaller is better)

Akaikes Information Criterion (with k variables in the model):


AIC = ln
k2 +

n + 2k
n

bias-corrected Akaikes Information Criterion:


n+k
AICc = ln
k2 +
nk2
Schwarzs (Bayesian) Information Criterion:
k ln n
2
SIC = ln
k +
n

11

Notes
More commonly (e.g. in SAS output and in Rs AIC function),
these are all multiplied by n.
AIC, AICc, and SIC (also known as SBC and BIC) can be
generalized to other problems where likelihood methods are
used.
If n is large and the true k is small, minimizing BIC picks k
well, but minimizing AIC tends to over-estimate it.
If the true k is large (or infinite), minimizing AIC picks a value
that gives good predictions by trading off bias vs variance.
12

Exploratory Data Analysis (or Searching for Stationarity)

When an observed time series appears stationary, we can


calculate its sample autocorrelations, and use them to decide
on a model.

Many time series do not appear stationary; e.g., Johnson and


Johnson earnings, global temperature.

Often we can find a way to relate one series to a different


series, for which stationarity is more plausible.

Trends and Detrending

Some series can be modeled as


xt = t + yt,
where yt is stationary.

If t is a parametric form, we can estimate it and subtract


it. That is, we use the residuals from a fitted trend.

The form of trend might be linear, or higher degree polynomial, or some other function suggested by theory.
2

Example: 20th Century Global Temperature

0.3
0.0
0.3

Residuals

lmg1900 = lm(g1900 ~ time(g1900));


plot(ts(residuals(lmg1900), start = 1900));

1900

1920

1940

1960

1980

2000

Time

Differencing

Some series still appear nonstationary after detrending.

E.g. the trend t is a random walk with drift:


t

t = t +

wj
j=1

Here E(xt) = t, but


t

xt E ( xt ) =

wj + yt
j=1

with a variance that grows with time.


4

But now the first differences


xt = xt xt1 = + wt + yt yt1
are stationary.

Define the backshift operator B by Bxt = xt1


Then xt = (1 B)xt.

Also second differences


2xt = (1 B)2xt = xt 2xt1 + xt2,
etc. Easy for any positive integer d; possible for fractional d.
5

Example: 20th Century Global Temperature

0.0
0.3

diff(g1900)

0.3

plot(diff(g1900));

1900

1920

1940

1960

1980

2000

Time

Both detrending and differencing give apparently stationary


results.
6

acf(diff(g1900));

0.4
0.2

ACF

1.0

Series diff(g1900)

10

15

Lag

Differencing has removed almost all auto-correlation.

acf(residuals(lmg1900))

0.4
0.2

ACF

1.0

Series residuals(lmg1900)

10

15

Lag

Removing the trend without differencing leaves more autocorrelation.


8

Transformation (Re-expression)

Some series need to be re-expressed.

Most commonly logarithms, sometimes square roots (especially with counted data).

Often re-expression improves stationarity, and other desirable


features such as symmetry of distribution.

E.g. Glacial varve thicknesses, Johnson and Johnson earnings.


9

Periodic Signals

If a series is plausibly modeled as a cosine wave plus noise,


we can fit
xt = A cos(2t+)+wt = (A cos ) cos(2t)(A sin ) sin(2t)
by least squares.

If is known (e.g., = 1/12 for an annual cycle in monthly


data), this is a linear regression:
xt = 1 cos(2t) + 2 sin(2t)

10

If is of the form j/n for integer j (n = series length), then


2 n
1 =
xt cos(2tj/n),
n t=1
2 n
2 =
xt sin(2tj/n).
n t=1
For other , use standard linear least squares regression.
If is unknown, either:
try all s of the form j/n, plotting 1(j/n)2 + 2(j/n)2
against j/n (the periodogram);
use non-linear least squares for other .
11

# detrend global temperature using a quadratic fit


gtres = residuals(lm(globtemp ~ time(globtemp) + I(time(globtemp)^2)));
gtres = ts(gtres, start = start(globtemp));
par(mfcol = c(2, 1));
plot(gtres);
# use spectrum() to plot the periodogram of detrended global temperature
spectrum(gtres, log = "no");

12

Smoothing a Time Series


Smoothing a time series makes long-term behavior (low frequencies) more apparent. E.g. global temperature, Johnson
and Johnson earnings.
Many types of smoother:
moving averages;
kernel smoothers;
lowess, supsmu, etc.;
smoothing splines.
13

# Trailing yearly average J&J earnings


plot(jj)
lines(filter(jj, rep(1, 4)/4, sides = 1), col = "red")
title("Trailing 4-quarter averages")
# smooth global temperatures over a 30 year window
# (note half weight on end values)
plot(globtemp)
lines(filter(globtemp, c(.5, rep(1, 29), .5)/30), col = "red")
title("Centered 30 year averages")

14

Smoothing a Scatter Plot

Smoothing a scatter plot can also reveal behavior.

E.g. daily NYSE returns plotted against previous day.

15

# scatter plot of NYSE return against previous day,


# with lowess smooth
plot(nyse[-length(nyse)], nyse[-1], xlim = c(-0.02, 0.02),
ylim = c(-0.02, 0.02))
lines(lowess(nyse[-length(nyse)], nyse[-1], f = 1/5), col = "red")
title("NYSE daily return against previous day")

16

Time Domain Models

Box & Jenkins popularized an approach to time series analysis


based on
Auto-Regressive
Integrated
Moving Average
(ARIMA) models.

Autoregressive Models

Autoregressive model of order p (AR(p)):


xt = 1xt1 + 2xt2 + + pxtp + wt,
where:
xt is stationary with mean 0;
1, 2, . . . , p are constants with p = 0;
wt is uncorrelated with xtj , j = 1, 2, . . .

To model a series with non-zero mean :


(xt ) = 1 xt1 + 2 xt2 + + p xtp + wt,
or
xt = + 1xt1 + 2xt2 + + pxtp + wt,
where
= (1 1 2 p) .

Note that the intercept is not .

Note also that


wt = xt 1xt1 + 2xt2 + + pxtp
and is therefore also stationary.

Furthermore, for k > 0,


wtk = xtk 1xtk1 + 2xtk2 + + pxtkp
and wt is uncorrelated with all terms on the right hand side.
So wt is uncorrelated with wtk .
That is, {wt} is white noise.
4

The Autoregressive Operator

Use the backshift operator:


xt = 1Bxt + 2B 2xt + + pB P xt + wt,
or
1 1B 2B 2 pB p xt = wt.

The autoregressive operator is


(B) = 1 1B 2B 2 pB p.

In operator form, the model equation is (B)xt = wt.


5

Example: AR(1)

For the first-order model:


xt = xt1 + wt.

Also
xt1 = xt2 + wt1
so
xt = xt2 + wt1 + wt
= 2xt2 + wt1 + wt.
6

Now use
xt2 = xt3 + wt2
so
xt = 2 xt3 + wt2 + wt1 + wt
= 3xt3 + 2wt2 + wt1 + wt.

Continuing:
xt = k xtk +

k1

j wtj .

j=0

We have shown:
xt = k xtk +

k1

j wtj .

j=0

Since xt is stationary, if || < 1 then k xtk 0 as k ,


so

xt =

j wtj ,

j=0

an infinite moving average, or linear process.

Moments

Mean: E(xt) = 0.

Autocovariances: for h 0
(h) = cov xt+h, xt

= E

j wt+hj

k wtk

j
2 h
w
=
.
2
1

Autocorrelations: for h 0
(h)
= h.
(h) =
(0)
Note that
(h) = (h 1),

h = 1, 2, . . .

Compare with the original equation


xt = xt1 + wt.

10

Simulations

plot(arima.sim(model = list(ar = .9), 100))

11

Causality

What if || > 1? Rewrite


xt = xt1 + wt.
as
xt = 1xt+1 1wt+1
Now

xt =

j wt+j ,

j=1

a sum of future noise terms. This process is said to be not


causal. If || < 1 the process is causal.
12

The Autoregressive Operator Again

Compare the original equation:


xt = xt1 + wt (1 B)xt = wt xt = (1 B)1wt.

with the (infinite) moving average representation:

xt =
j=0

j wtj =

j B j wt

j=0

13

So
(1 B)1 =

j B j .

j=0

Compare with
(1 z)1 =

1
j
=
(z) =
j z j ,
1 z
j=0
j=0

valid for |z| < 1 (because || < 1).

We can manipulate expressions in B as if it were a complex


number z with |z| < 1.
14

Stationary versus Transient


E.g. AR(1):
Stationary version, when || < 1:

xt =

j wtj

j=0

But suppose we want to simulate, using


xt = xt1 + wt,

t = 1, 2, . . .

What about x0?


15

One possibility: let x0 = 0.

Then x1 = w1, x2 = w2 + w1, and generally


t1

xt =

j wtj .

j=0

This means that


2 1 + 2 + 4 + + 2(t1) .
var(xt) = w

var(xt) depends on t this version is not stationary.


16

But, if || < 1, then for large t,


2 1 + 2 + 4 + . . .
var(xt) w

2
w
=
1 2

Also, under the same conditions (more work!),


2 |h|
w
cov xt+h, xt
.
2
1

This version is called asymptotically stationary.


The non-stationarity is only for small t, and is called transient. Simulations use a burn-in or spin-up period: discard
the first few simulated values.
17

2
w
But note: in the stationary version, x0 N 0, 12 .

If we simulate x0 from this distribution, and for t > 0 use


xt = xt1 + wt,

t = 1, 2, . . . ,

then the result is exactly stationary.

That is, we can use a simulation with no spin-up.

This is harder for AR(p) when p > 1, so most simulators use


a spin-up period.

18

Moving Average Model

Moving average model of order q (MA(q)):


xt = wt + 1wt1 + 2wt2 + + q wtq
where:
1, 2, . . . , q are constants with q = 0;
2 ).
wt is Gaussian white noise wn(0, w

Note that wt is uncorrelated with xtj , j = 1, 2, . . . .

In operator form:
xt = (B)wt,
where the moving average operator (B) is
(B) = 1 + 1B + 2B 2 + + q B q .

Compare with the autoregressive model (B)xt = wt.

The moving average process is stationary for any values of


1, 2, . . . , q .

Moments

Mean: E (xt) = 0.

Autocovariances:
(h) = cov xt+h, xt

= E

j wt+hj

k wtk

j
2
= w

k k+h
k

=0

if h > q.
3

The MA(q) model is characterized by


2 = 0
(q) = w
q

(h) = 0

for h > q.

The contrast between the ACF of


a moving average model, which is zero except for a finite
number of lags h
an autoregressive model, which goes to zero geometrically
makes the sample ACF an important tool in deciding what
model to fit.
4

Inversion
Example: MA(1)
xt = wt + wt1 = (1 + B)wt,
so if || < 1,
wt = (1 + B)1xt = (B)xt,
where

(B) =

()j B j .

j=0

So xt satisfies an infinite autoregression:

xt =

()j xtj + wt,

j=1
5

Autoregressive Moving Average Models

Combine! ARMA(p, q):


xt =1xt1 + 2xt2 + + pxtp
+ wt + 1wt1 + 2wt2 + + q wtq .

In operator form:
(B)xt = (B)wt.

Issues in ARMA Models

Parameter redundancy: if (z) and (z) have any common factors, they can be canceled out, so the model is the same as
one with lower orders. We assume no redundancy.

Causality: If (z) = 0 for |z| 1, xt can be written in terms of


present and past ws. We assume causality.

Invertibility: If (z) = 0 for |z| 1, wt can be written in terms


of present and past xs, and xt can be written as an infinite
autoregression. We assume invertibility.
7

Using proc arima

Example: fit an MA(1) model to the differences of the log


varve thicknesses.

options linesize = 80;


ods html file = ../varve1.html;
data varve;
infile ../data/varve.dat;
input varve;
lv = log(varve);
dlv = dif(lv);
run;
8

proc arima data = varve;


title Fit an MA(1) model to differences of log varve;
identify var = dlv;
estimate q = 1;
run;

proc arima output

Using some proc arima options

Example: fit an IMA(1) model to the log varve thicknesses.

options linesize = 80;


ods html file = varve2.html;
data varve;
infile varve.dat;
input varve;
lv = log(varve);
run;

proc arima data = varve;


title Fit an IMA(1, 1) model to log varve, using ML;
title2 Use minic option to identify a good model;
identify var = lv(1) minic;
estimate q = 1 method = ml;
estimate q = 2 method = ml;
estimate p = 1 q = 1 method = ml;
run;

proc arima output

Notes on the proc arima output

For the MA(1) model, the Autocorrelation Check of Residuals rejects the null hypothesis that the residuals are white
noise.
If the series really had MA(1) structure, the residuals
would be white noise.
So the MA(1) model is not a good fit for this series.

For both the MA(2) and the ARMA(1, 1) models, the ChiSquare statistics are not significant, so these models both
seem satisfactory. ARMA(1, 1) has the better AIC and SBC.
10

Using R

Fit a given model and test the residuals as white noise:


varve.ma1 = arima(diff(log(varve)),
order = c(p = 0, d = 0, q = 1));
varve.ma1;
Box.test(residuals(varve.ma1), lag = 6,
type = "Ljung", fitdf = 1);

Note: the fitdf argument indicates that these are residuals


from a fit with a single parameter.
11

As in proc arima, differencing can be carried out within arima():


varve.ima1 = arima(log(varve), order = c(0, 1, 1));
varve.ima1;
Box.test(residuals(varve.ima1), 6, "Ljung", 1);

But note that you cannot include the intercept, so the results
are not identical.
Rerun the original analysis with no intercept:
arima(diff(log(varve)), order = c(0, 0, 1),
include.mean = FALSE);
12

Make a table of AICs:


AICtable = matrix(NA, 5, 5);
dimnames(AICtable) =
list(paste("p =", 0:4), paste("q =", 0:4));
for (p in 0:4) {
for (q in 0:4) {
varve.arma = arima(diff(log(varve)), order = c(p, 0, q));
AICtable[p+1, q+1] = AIC(varve.arma);
}
}
AICtable;
Note: proc arimas MINIC option tabulates (an approximation
to) BIC, not AIC.
13

Make a table of BICs:


BICtable = matrix(NA, 5, 5);
dimnames(BICtable) =
list(paste("p =", 0:4), paste("q =", 0:4));
for (p in 0:4) {
for (q in 0:4) {
varve.arma = arima(diff(log(varve)), order = c(p, 0, q));
BICtable[p+1, q+1] =
AIC(varve.arma, k = log(length(varve) - 1));
}
}
BICtable;
Both tables suggest ARMA(1, 1).
14

ARMA Autocorrelation Functions


For a moving average process, MA(q):
xt = wt + 1wt1 + 2wt2 + + q wtq .
So (with 0 = 1)
(h) = cov xt+h, xt

= E

j wt+hj

j=0

qh

2
j j+h,
w
=
j=0

k wtk

k=0

0hq
h > q.
1

So the ACF is

qh

j j+h

j=0
,
q
(h) =
2

j=0

0hq

h > q.

Notes:
In these expressions, 0 = 1 for convenience.
(q) = 0 but (h) = 0 for h > q.
MA(q).

This characterizes

For an autoregressive process, AR(p):


xt = 1xt1 + 2xt2 + + pxtp + wt.

So
(h) = cov xt+h, xt

= E

j xt+hj + wt+h xt
j=1

j (h j) + cov wt+h, xt .

=
j=1

Because xt is causal, xt is wt+ a linear combination of wt1, wt2, . . . .

So

2
w
cov wt+h, xt =
0

h=0
h > 0.

Hence
p

j (h j),

(h) =

h>0

j=1

and
p
2.
j (j) + w

(0) =
j=1

2 , these equa If we know the parameters 1, 2, . . . , p and w


tions for h = 0 and h = 1, 2, . . . , p form p + 1 linear equations
in the p + 1 unknowns (0), (1), . . . , (p).

The other autocovariances can then be found recursively


from the equation for h > p.

Alternatively, if we know (or have estimated) (0), (1), . . . , (p),


they form p + 1 linear equations in the p + 1 parameters
2.
1, 2, . . . , p and w

These are the Yule-Walker equations.


5

For the ARMA(p, q) model with p > 0 and q > 0:


xt =1xt1 + 2xt2 + + pxtp
+ wt + 1wt1 + 2wt2 + + q wtq ,
a generalized set of Yule-Walker equations must be used.

The moving average models ARMA(0, q) = MA(q) are the


only ones with a closed form expression for (h).

For AR(p) and ARMA(p, q) with p > 0, the recursive equation


means that for h > max(p, q + 1), (h) is a sum of geometrically decaying terms, possibly damped oscillations.
6

The recursive equation is


p

j (h j),

(h) =

h > q.

j=1

What kinds of sequences satisfy an equation like this?


Try (h) = z h for some constant z.
The equation becomes
p

0 = z h

j z (hj) = z h 1
j=1

j z j = z h(z).
j=1

So if (z) = 0, then (h) = z h satisfies the equation.


Since (z) is a polynomial of degree p, there are p solutions,
say z1, z2, . . . , zp.
So a more general solution is
p

(h) =

cl zlh,

l=1

for any constants c1, c2, . . . , cp.


If z1, z2, . . . , zp are distinct, this is the most general solution;
if some roots are repeated, the general form is a little more
complicated.
8

If all z1, z2, . . . , zp are real, this is a sum of geometrically


decaying terms.
If any root is complex, its complex conjugate must also be a
root, and these two terms may be combined into geometrically decaying sine-cosine terms.
The constants c1, c2, . . . , cp are determined by initial conditions; in the ARMA case, these are the Yule-Walker equations.
Note that the various rates of decay are the zeros of (z),
the autoregressive operator, and do not depend on (z), the
moving average operator.
9

Example: ARMA(1, 1)
xt = xt1 + wt1 + wt.

The recursion is
(h) = (h 1),

h = 2, 3, . . .

So (h) = ch for h = 1, 2, . . . , but c = 1.

Graphically, the ACF decays geometrically, but with a different value at h = 0.


10

0.2

0.4

0.6

0.8

ARMAacf(ar = 0.9, ma = 0.5, 24)


1.0

10

15

20
25

Index

11

The Partial Autocorrelation Function

An MA(q) can be identified from its ACF: non-zero to lag q,


and zero afterwards.

We need a similar tool for AR(p).

The partial autocorrelation function (PACF) fills that role.

12

Recall: for multivariate random variables X, Y, Z, the partial


correlations of X and Y given Z are the correlations of:
the residuals of X from its regression on Z; and
the residuals of Y from its regression on Z.
Here regression means conditional expectation, or best linear prediction, based on population distributions, not a sample calculation.
In a time series, the partial autocorrelations are defined as
h,h = partial correlation of xt+h and xt
given xt+h1, xt+h2, . . . , xt+1.
13

For an autoregressive process, AR(p):


xt = 1xt1 + 2xt2 + + pxtp + wt,

If h > p, the regression of xt+h on xt+h1, xt+h2, . . . , xt+1 is


1xt+h1 + 2xt+h2 + + pxt+hp

So the residual is just wt+h, which is uncorrelated with


xt+h1, xt+h2, . . . , xt+1 and xt.

14

So the partial autocorrelation is zero for h > p:


h,h = 0,

h > p.

We can also show that p,p = p, which is non-zero by assumption.

So p,p = 0 but h,h = 0 for h > p. This characterizes AR(p).

15

The Inverse Autocorrelation Function


SASs proc arima also shows the Inverse Autocorrelation Function (IACF).
The IACF of the ARMA(p, q) model
(B)xt = (B)wt
is defined to be the ACF of the inverse (or dual) process
(inverse)

(B)xt

= (B)wt.

The IACF has the same property as the PACF: AR(p) is


characterized by an IACF that is nonzero at lag p but zero
for larger lags.
16

Summary: Identification of ARMA processes


AR(p) is characterized by a PACF or IACF that is:
nonzero at lag p;
zero for lags larger than p.
MA(q) is characterized by an ACF that is:
nonzero at lag q;
zero for lags larger than q.
For anything else, try ARMA(p, q) with p > 0 and q > 0.
17

For p > 0 and q > 0:


AR(p)

MA(q)

ARMA(p, q)

Tails off

Cuts off after lag q

Tails off

PACF

Cuts off after lag p

Tails off

Tails off

IACF

Cuts off after lag p

Tails off

Tails off

ACF

Note: these characteristics are used to guide the initial choice


of a model; estimation and model-checking will often lead to
a different model.

18

Other ARMA Identification Techniques

SASs proc arima offers the MINIC option on the identify


statement, which produces a table of SBC criteria for various
values of p and q.

The identify statement has two other options: ESACF and


SCAN.

Both produce tables in which the pattern of zero and nonzero values characterize p and q.

See Section 3.4.10 in Brocklebank and Dickey.


19

options linesize = 80;


ods html file = varve3.html;
data varve;
infile ../data/varve.dat;
input varve;
lv = log(varve);
run;
proc arima data = varve;
title Use identify options to identify a good model;
identify var = lv(1) minic esacf scan;
estimate q = 1 method = ml;
estimate q = 2 method = ml;
estimate p = 1 q = 1 method = ml;

run;

proc arima output

Forecasting
General problem: predict xn+m given xn, xn1, . . . , x1.
General solution: the (conditional) distribution of xn+m given
xn, xn1, . . . , x1.
In particular, the conditional mean is the best predictor (i.e.
minimum mean squared error).
Special case: if {xt} is Gaussian, the conditional distribution
is also Gaussian, with a conditional mean that is a linear
function of xn, xn1, . . . , x1 and a conditional variance that
does not depend on xn, xn1, . . . , x1.
1

Linear Forecasting

What if xt is not Gaussian?


Use the best linear predictor: xn
n+m .
Not the best possible predictor, but computable.

One-step Prediction
The hard way: suppose
xn
n+1 = n,1 xn + n,2 xn1 + + n,n x1 .
Choose n,1, n,2, . . . , n,n to minimize the mean squared prediction error E

2
n
.
xn+1 xn+1

Differentiate and equate to zero: n linear equations in the n


unknowns.
Solve recursively (in n) using the Durbin-Levinson algorithm.
Incidentally, the PACF is n,n.
3

One-step Prediction for an ARMA Model

The easy way: suppose we can write


xn+1 = some linear combination of xn, xn1, . . . , x1
+ something uncorrelated with xn, xn1, . . . , x1.
Then the first part is the best linear predictor, and the second
part is the prediction error.

E.g. AR(p), p n:
xn+1 = 1xn + 2xn1 + + pxn+1p +
first part

wn+1
second part

General ARMA case

Now
xn+1 =1xn + 2xn1 + + pxn+1p
+ 1wn + 2wn1 + + q wn+1q
+ wn+1.

First part on the right hand side is a linear combination of


xn, xn1, . . . , x1.

Last part, wn+1, is uncorrelated with xn, xn1, . . . , x1.


5

Middle part? If the model is invertible, wt is a linear combination of xt, xt1, . . . , so if n is large, we can truncate the
sum at x1, and wn, wn1, . . . , wn+1q are all (approximately)
linear combinations of xn, xn1, . . . , x1.

So the middle part is also approximately a linear combination


of xn, xn1, . . . , x1, whence
xn
n+1 =1 xn + 2 xn1 + + p xn+1p
+ 1wn + 2wn1 + + q wn+1q
and wn+1 is the prediction error, xn+1 xn
n+1 .

Multi-step Prediction

The easy way: build on one-step prediction. E.g. two-step:


xn+2 =1xn+1 + 2xn + + pxn+2p
+ 1wn+1 + 2wn + + q wn+2q
+ wn+2.
Replace xn+1 by xn
n+1 + wn+1 :
xn+2 =1xn
n+1 + 2 xn + + p xn+2p
+ 2wn + + q wn+2q
+ wn+2 + (1 + 1) wn+1.
7

The first two parts are again (approximately) linear combinations of xn, xn1, . . . , x1, and the last is uncorrelated with
xn, xn1, . . . , x1. So
n
xn
=
x
1
n+2
n+1 + 2 xn + + p xn+2p
+ 2wn + + q wn+2q

and the prediction error is


xn+2 xn
n+2 = wn+2 + (1 + 1 ) wn+1 .
Note that the mean squared prediction error is
2 1 + + 2 2 .
w
( 1
1)
w

Mean squared prediction error increases as we predict further


into the future.
8

Forecasting with proc arima

E.g. the fishery recruitment data.

proc arima program and output.

Note that predictions approach the series mean, and std


errors approach the series standard deviation.

The autocorrelation test for residuals is borderline, largely


because of residual autocorrelations at lags 12, 24, . . . .

Spectrum analysis shows that these are caused by seasonal


means, which can be removed: proc arima program and
output.

10

Comments on Choice of ARMA model


Keep it simple! Use small p and q.
Some systems have autoregressive-like structure.
E.g. first order dynamics:
dx(t)
= x(t)
dt
or in stochastic form,
dx(t) = x(t)dt + dW (t)
where W (t) is a Wiener process, the continuous time limit of
the random walk.
1

Discrete time approximation:


x(t) = x(t + t) x(t) = x(t)t + W (t)
or
x(t + t) = x(t) x(t)t + W (t)
= (1 t)x(t) + W (t),
an AR(1) (causal if > 0 and t is small).

Similarly a second order system leads to AR(2).

Since many real-world systems can be approximated by first


or second order dynamics, this suggests using p = 1 or 2,
and q = 0.
2

Some systems have more dimensions. E.g. first order vector


autoregression, VARp(1):

xt1 + wt .
xt =
p1
p1
pp p1
Here each component time series is typically ARMA(p, p 1).

This suggests using q < p, especially q = p 1.

Added noise: if yt is ARMA(p, q) with q < p, but we observe


xt = yt + wt where wt is white noise, uncorrelated with yt,
then xt is ARMA(p, p).

This suggests using q = p.

Summary: youll often find that you can use small p and
q p, perhaps q = 0 or q = p 1 or q = p, depending on the
background of the series.

Estimation

Current methods are likelihood-based:


f1,2,...,n (x1, x2, . . . , xn) = f1 (x1) f2|1 (x2|x1) . . .
fn|n1,...,1 xn|xn1, xn2, . . . , x1 .
If xt is AR(p) and n > p, then
fn|n1,...,1 xn|xn1, xn2, . . . , x1 =
fn|n1,...,np xn|xn1, xn2, . . . , xnp .

Assume xt is Gaussian. E.g. AR(1):


2 ] for t > 1,
ft|t1(xt|xt1) is N [(1 ) + xt1, w

and
2 /(1 2 )].
f1(x1) is N [, w

So the likelihood, still for AR(1), is


2 ) = (2 2 )n/2
L(, , w
w

S(, )
2
,
1 exp
2
2w

where
S(, ) = (1 2) (x1 )2 +

(xt ) xt1 2 .

t=2
6

Methods in proc arima

method = ml: maximize the likelihood.

method = uls:
S(, ).

minimize the unconditional sum of squares

method = cls: minimize the conditional sum of squares Sc(, ):


Sc(, ) = S(, ) (1 2) (x1 )2
n

(xt ) xt1 2 .

t=2

This is essentially least squares regression of xt on xt1.


7

AR(p), p > 1, can be handled similarly.

ARMA(p, q) with q > 0 is more complicated; state space


methods can be used to calculate the exact likelihood.

proc arima implements the same three methods in all cases.

All three methods give estimators with the same large-sample


normal distribution; all are asymptotically optimal.

Brute Force

Above methods fail (or need serious modification) if any data


are missing.

Can always fall back to brute force:


x1, x2, . . . , xn Nn(1, ),
where

nn

(0)
(1)
(2)
(1)
(0)
(1)
(2)
(1)
(0)
...
...
...
(n 1) (n 2) (n 3)

. . . (n 1)

. . . (n 2)

. . . (n 3)

...
...

...
(0)
9

2 (h), and use e.g.


Write (h) = w
compute (h).

Rs ARMAacf(...)

to

Likelihood is
1
1
exp (x 1) 1(x 1)
2
det(2 )
=

1
2 )
det(2w

exp

1
1(x 1)
(
x

1
)

2
2w

2 , then
Can maximize analytically with respect to and w
numerically with respect to and .

Missing data? Just leave out corresponding rows and columns


of .
10

The Integrated ARMA model: ARIMA(p, d, q)


Some series are nonstationary, but their differences are stationary; e.g. the random walk.
Recall: the first differences of xt are
xt xt1 = (1 B)xt = xt.
The second differences are
xt xt1 = (1 B)xt = 2xt.
If dxt is ARMA(p, q), we say that xt is ARIMA(p, d, q).
1

Under-differencing

Suppose that xt is ARIMA(p, d, q), but we analyze yt = d xt


for some d < d.

In this case, yt satisfies


dd (B)yt = (B)yt = (B)wt
where (z) = (1 z)(dd )(z) has d d roots at z = 1.

This looks like an ARMA(p + d d , q) model, but it is not


causal.
2

Over-differencing

Suppose that xt is ARIMA(p, d, q), but we analyze yt = d xt


for some d > d.

In this case, yt satisfies


(B)yt = d d(B)wt = (B)wt
where (z) = (1 z)(d d)(z) has d d roots at z = 1.

This looks like an ARMA(p, q + d d) model, but it is not


invertible.
3

Simplest model with d > 0: ARIMA(0, 1, 1)


Many nonstationary series are found to be fitted quite well
as ARIMA(0, 1, 1).
This model is connected with the exponentially weighted
moving average (EWMA) method of forecasting.
If the model is written xt xt1 = wt wt1, the one-step
forecast is

x
n+1 = (1 )

j xnj ,

j=0

the exponentially weighted moving average.


4

We can calculate the forecast recursively:


xn+1 = xn wn + wn+1.

We can find wn from xn, xn1, . . . , so the one-step forecast


is the first part:
x
n+1 = xn wn

But wn is the previous forecast error, xn x


n, so
x
n+1 = xn (xn x
n)
= (1 )xn +
xn .
In words,
the new forecast is a weighted average of the current
forecast and the current value.
Also
x
n+1 = x
n + (1 )(xn x
n),
so the new forecast is the current forecast plus a correction
based on the current forecast error.
6

Strategy for Building ARIMA Models

1. First choose d:
ACF of an integrated series tends to die away slowly, so
difference until it dies away quickly;
the IACF of a non-invertible series tends to die away
slowly, which indicates over-differencing.
You may want to try more than one value of d.

2. Next choose p and q, e.g. using MINIC.


7

3. Next estimate the model.


4. Finally check the model diagnostics:
p (if p > 0)
Significance of highest order coefficients,
and q (if q > 0);
Non-significance in autocorrelation check of residuals;
Low value of AIC or SBC.
5. Repeat from step 2 until satisfactory.
Note: You may not find a completely satisfactory model,
especially for a long data series.
8

Unit Root Tests


Choice of d can be formulated as a hypothesis test.
E.g. in the AR(1) model xt = xt1 + wt, set:
H0 : = 1, xt is ARIMA(0, 1, 0) (nonstationary, d = 1);
HA : || < 1, xt is ARIMA(1, 0, 0) (stationary, d = 0).
Test using proc arimas stationarity keyword on the identify
statement.
E.g. the global temperature data: proc arima program and
output.
9

The statistics on the Lags 0 rows in the panel Augmented


Dickey-Fuller Unit Root Tests refer to the three models
Zero Mean:
xt = xt1 + wt;
Single Mean:
xt = (xt1 ) + wt;
Trend:
xt t = xt1 (t 1) + wt.

10

Note that under H0, these models reduce to


xt = xt1 + wt,
xt = xt1 + wt,
xt = xt1 + + wt,
the first two being random walks with no drift, the latter
being a random walk with drift.
The statistics on the Lags 1 rows refer to corresponding
AR(2) models, which reduce to integrated AR(1) models
under the null hypothesis.
The Tau tests are generally preferred to the Rho tests.
11

E.g. Case-Shiller housing data: proc arima program and output.

12

Seasonal ARIMA Models

Many time series collected on a monthly or quarterly basis


have seasonal behavior.

Similarly hourly data and daily behavior.

E.g. Johnson & Johnson quarterly earnings; discussion typically focuses on comparison with:
previous quarter ;
same quarter, previous year.
1

That is, we compare xt with xt1 and xt4.


More generally, we compare xt with xt1 and xts, where
s = 4 for quarterly data,
s = 12 for monthly data,
s = 24 for daily effects in hourly data,
s = 168 for weekly effects in hourly data,
etc.
This suggests modeling xt in terms of xt1 and xts.
2

Pure Seasonal ARMA


The pure seasonal ARMA model has the form
xt = 1xts + 2xt2s + + P xtP s
+ wt + 1wts + 2wt2s + + QwtQs.
Notation: ARMA(P, Q)s.
In operator form:
P (B s)xt = Q(B s)wt.
P (B s) and Q(B s) are seasonal autoregressive and moving
average operators.
3

Multiplicative Seasonal ARMA

ACF of pure seasonal ARMA is nonzero only at lags s, 2s,


. . . ; most seasonal time series have other nonzero values.
(s)

For such series, wt = Q(B s)1P (B s)xt is not white noise


for any choice of P and Q.
(s)

But suppose that for some P and Q, wt


(s)

p(B)wt

is ARMA(p, q):

= q (B)wt,

where {wt} is white noise.


4

Then xt satisfies
P (B s)p(B)xt = Q(B s)q (B)wt.

This is the Multiplicative


ARMA(p, q) (P, Q)s.

Seasonal

ARMA

model

The non-seasonal parts p and q control short-term correlations (up to half a season, lag s/2), while the seasonal
parts P and Q control the decay of the correlations over
multiple seasons.

Example: Johnson & Johnson earnings; R analysis


par(mfrow = c(2, 1))
plot(log(jj))
jjl = lm(log(jj) ~ time(jj) + factor(cycle(jj)))
summary(aov(jjl))
jjf = ts(fitted(jjl), start = start(jj),
frequency = frequency(jj))
lines(jjf, col = 2, lty = 2)
jjr = ts(residuals(jjl), start = start(jj),
frequency = frequency(jj))
plot(jjr)
acf(jjr)
pacf(jjr)
6

PACF is simpler than ACF:


ACF spikes at lags 4, 8, perhaps 12; of these, PACF spikes
only at lag 4;
apart from lags 4, 8, . . . , PACF drops off faster.
(P)ACF indicates neither simple ARMA nor simple ARMA4.
PACF suggests ARMA(2, 0) (1, 0)4:
jja = arima(jjr, order = c(2, 0, 0),
seasonal = list(order = c(1, 0, 0), period = 4))
print(jja)
tsdiag(jja)
7

Note: the original fit of the straight line and seasonal dummies was by OLS;
possibly inefficient;
invalid inferences (standard errors, etc.).
Solution: refit as part of the time series model.
x = model.matrix( ~ time(jj) + factor(cycle(jj)))
jja = arima(log(jj), order = c(2, 0, 0),
seasonal = list(order = c(1, 0, 0), period = 4),
xreg = x, include.mean = FALSE)
print(jja)
tsdiag(jja)
8

Notes:
The time series being fitted is the original unadjusted log(jj).
The regressors are specified as the matrix argument xreg.
arima does not check for linear dependence, so we must either
omit one dummy variable from xreg or use include.mean =
FALSE in arima.
Regression parameter estimates are similar to OLS, but standard errors are roughly doubled.
Using SAS: proc arima program and output.
9

Multiplicative Seasonal ARIMA


The seasonal difference operator is s = 1 B s.
Some series show slow decay of ACF only at lags s, 2s, . . . ,
which suggests seasonal differencing.
But note: seasonal means also give slow decay of ACF at
those lags.
The
Multiplicative
Seasonal
ARIMA(p, d, q) (P, D, Q)s is

ARIMA

model

d x = (B s ) (B)w .
P (B s)p(B)D

q
t
t
Q
s
10

The Frequency Domain

Time domain methods:


regress present on past;
capture dynamics in terms of velocity (first order), acceleration (second order), inertia, etc.

Frequency domain methods:


regress present on periodic sines and cosines;
capture dynamics in terms of resonant frequencies, etc.
1

E.g. AR(2):
plot(ts(arima.sim(list(order = c(2,0,0), ar = c(1.5,-.95)), n = 144)))

Strong periodicity, around 16 peaks period of around 9


samples.

Fitting an AR model doesnt describe this:


xt = 1.50xt1 0.95xt2 + wt.

Cyclical Behavior
Simplest case is the periodic process
xt = A cos(2t + )
= U1 cos(2t) + U2 sin(2t).
where:
A is amplitude;
is frequency, in cycles per sample;
is phase, in radians;
and U1 = A cos(), U2 = A sin().
3

Folding Frequency; Aliasing


If = 0, xt = A cos(), constant.
If = 1? At t = 0, 1, 2, . . . , same thing!
= 0 is an alias of = 1.
All frequencies higher than = 1/2 have an alias in 0
1/2:
cos[2(k )t + ] = cos(2t ),

t = 0, 1, 2, . . .

= 1/2 is the folding frequency.


4

For example, = 0.8:


omega = 0.8;
phi = pi / 6;
plot(function(x) cos(2 * pi *
from = 0, to = 10);
plot(function(x) cos(2 * pi *
from = 0, to = 10, add =
abline(v = 0:10, lty = 2, col

omega * x + phi),
(1 - omega) * x - phi),
TRUE, col = "red");
= "blue");

Note:
= 0.8 = 0.5 + 0.3, and 1 = 0.2 = 0.5 0.3;
1 is folded around 0.5.
5

Stationarity

If
xt = A cos(2t + )
= U1 cos(2t) + U2 sin(2t).
and is random, uniformly distributed on [0, 2), then:
E(xt) = 0,
1
E xt+hxt = A2 cos(2h).
2

So xt is weakly stationary.
6

Also
E(U1) = E(U2) = 0,
E U12

= E U22

1 2
= A ,
2

and
E(U1U2) = 0.
Alternatively, if the U s have these properties, xt is stationary
with the same mean and autocovariances:
E(xt) = 0,
1
E xt+hxt = A2 cos(2h).
2

More generally, if
q

xt =

Uk,1 cos(2k t) + Uk,2 sin(2k t) ,


k=1

where:
the U s are uncorrelated with zero mean;
var Uk,1 = var Uk,2 = k2;
then xt is stationary with zero mean and autocovariances
q

k2 cos(2k h).

(h) =
k=1

Harmonic Analysis
Any time series sample x1, x2, . . . , xn can be written
(n1)/2

xt = a0 +

aj cos(2jt/n) + bj sin(2jt/n)
j=1

if n is odd; if n is even, an extra term is needed.


The periodogram is
2
P (j/n) = a2
j + bj .

The R function spectrum can calculate and plot the periodogram.


9

R examples:
par(mfcol = c(2, 1));
# one frequency:
x = cos(2*pi*(0.123)*(1:144))
plot.ts(x); spectrum(x, log = "no")
# and a second frequency:
x = x + 2 * cos(2*pi*(0.234)*(1:144))
plot.ts(x); spectrum(x, log = "no")
# and added noise:
x = x + rnorm(144)
plot.ts(x); spectrum(x, log = "no")
# the AR(2) series:
x = ts(arima.sim(list(order = c(2,0,0), ar = c(1.5,-.95)), n = 144))
plot(x); spectrum(x, log = "no")

Using SAS: proc spectra program and output.


10

tr st

rr ss rqs r str
t s

t t t r t s s sttr
t srs

str st s t rrs r

srt rr rsr

t t srt rr trsr s



t t
t

rqs r t rr r t
rqs

tr rr trsr t s rs trsr


t
t t

Prr

rr s

s rr s rr s P

trs s trs r

tt s r s rt s
r

r t rr s


t str st t

rst t str st s t rt
t t rr

t s r t str st ss t s
t t rr t

rs t rr s s sttr
t str st

t t rr s t rs r
s t r stt

tts t tr

r sttr t srs t t trs


s rs str r str strt
t r

s st ts t s str st
t

r rs ts s s

rtt s t s

Prrts t str st

t s stt

t t str t t s
t t s

r t t t t t

s s

r s rsts t r tr s
s tt t str st t q rss
t t

t tt ts s t s rt
t tr str t srs ts t
tt t rq

t rs r rss rrs t
rs
rs
t
t

ts str st s
rs

s t rs rss

t
rr
ttsrsstr

r
s

t t t t

t ts t
sqr t t


t t

s r r

t rs rt ss
t rs t

s r st ts t t r

tt t r t t rs

ss

sst ts t t r r s

ss tt t r

t t s t
r

t r

ss

rtr tr stts

rt

r rt t

Pr
s rt s sttr

t t t rs r t s

t P

rq s

s r t rr rqs r s ttr
s tr r rq t sr
t

s st t st r s
tr

s tr s sr rr s st
t rt t s qtt rt
t

r Prr

s t t rr
rqs

t s rt s r

t t r Prr


r
s strs st tr tr


r str t s s
r s
ts
r

Pts ts

ts r rt
tstrs t sqr

ts r r
rt tt s
r rt t
tstrs t r

t ts r ts r rrq
ts t t t t r t tr str
ts trs s strt
t r r

t t Prr

s t r
t t s r stt t s
qtt
t s t t s t tr ss rt

trt t r

t t t Prr


t r t s rs ts r
trr ts
r
t

r ts rs
r
s strs st tr tr


t ts sttt s s t rt
rtrr r
r str t s s
r s
ts
r

str

rs t ts t s t

t ts sttt s s t rt s
str rs
ts r

r str

rr tt

s st

tr st t r
tr s t ss st s t
sstt s t rr t t rq

t srs s sttr s tr t
r t r str st t

r ss ss ss t ss st srs
s rsss strs
s tss rq rqs strt strts

r
s strs st tr tr
tstrs t sqr
tt r

Period in years
1

0.5

0.25

0.167

0.00 0.02 0.04 0.06 0.08

s(f)

Frequency in cycles per year

tstrs t r
tt r

Period in years
1

0.5

0.25

0.167

0.04
0.02
0.00

s(f)

0.06

0.08

Frequency in cycles per year

Tapering
The periodogram works well with data containing only Fourier
frequencies:
w = rnorm(128, sd = 0.01);
x5 = cos(2*pi*(5/128)*(1:128)) + w;
x6 = cos(2*pi*(6/128)*(1:128)) + w;
par(mfcol = c(3, 1), mar = c(2, 2, 1, 1));
spectrum(x5, taper = 0, ylim = c(1e-7, 1e2));
spectrum(x6, taper = 0, ylim = c(1e-7, 1e2));

It doesnt work so well with other frequencies:


x5h = cos(2*pi*(5.5/128)*(1:128)) + w;
spectrum(x5h, taper = 0, ylim = c(1e-7, 1e2));
1

One solution is to taper the data:


spectrum(x5h, taper = 0.5, ylim = c(1e-7, 1e2))

This works by multiplying the data by a data window :


par(mfcol = c(3, 1), mar = c(2, 2, 1, 1));
plot(tapr(rep(1, 128), 0.25));
plot(x5h);
plot(tapr(x5h, 0.25));

The data window modifies a fraction of the data at each end


of the series, to make the data more nearly continuous when
it is wrapped.
2

Tapering makes the main peak wider, but much reduces side
lobes.

To see the side lobes, make the periodogram graphs on a


finer grid of frequencies:
par(mfcol = c(2, 1), mar = c(2, 2, 1, 1));
spectrum(x5h, taper = 0.0, ylim = c(1e-7, 1e2), pad = 896)
spectrum(x5h, taper = 0.5, ylim = c(1e-7, 1e2), pad = 896)

The default in Rs spectrum (or spec.pgram, which does the


work) is to taper 10% at each end of the data.
3

t rs

t sttr srs t t rss rs


t t

rss str st s

rss str st s
q

r q r t str q str
rst

q q

qr r

s t r t s s

ts t rrstt
t

st s st

r q r t rs t

s s ts t rrstts t
t

rrts r s sr t trrt t rs
t sqr r

srs t strt t rts t t t


t rq

s s t srs t rt
r t t rq t

t tt r

Ps tr

qr r s s t t sqr rr
t t t rs

st rrt t t ts t

t ts s str rrt rt tr
st r t rts

s t rrt s tr r

t t st

t sr

s s r s t
t t t r t t r t

t rtt s

r s t s str

s t s str s sr rs t
s t rt t s r s

s str srs t t r tts


t t rq t sst rt t t
ts t t t s rq

tr tr

r r rt t srs
t t

str tr s

t t
t
t str tr s st

rtr stt

r t rt s

s t r trs t s

r r r

stt r

s t s r t t stt

sqr r s

t tr r s r

s st sr rst s rt

t r

r t tss

t rqs r s ts rt
trr t s tt t r s st

rqs t s ts s s
t r

t s t sqr r t t tr
st t srs rrtt srs
r
s strs r r
tr st
ts tt t
q s
s r
q s
s r t

ss ts s t s t rt
s
s
r
r
s

rsss trs
tss strt strts rq rqs
rssr trr
tsr strt strtr rq rqr
strs r r
tr st
ts tt t
q s
s r
q s
s r t

stt s
ts tt s t

r trs

r tr ts t
t

t tt t

r tr

r ts r r

t t s t s t t

t
t
t tt s t t s t ts r t s
rss t t tr
t

r t
trs

rr
tsr strt
t
t t
r
tr
t t
r

t
ss
t

s r r r r t tr s rt r s
r r

t t s s
t t

t tt s
t

t
r

r tr

r r

r r

s t rq rss t t tr


tr s t r r t
tr t
r tr
tr




t
tr ss

st rtr

tr rtt
t

r tr

r r t

r r

rq rss t s

Ps

rq rss t r rts

s t t r t
s t s t

t t

stt

s r t s

s tr

r tr

t tr s t s ss tr

t s t s ss tr
trs r rrs
t t

r tr

t s t s t tr s t s
t s t s t tr s rrs t s
r s t tt tr t
t s t tr rrs tr

tt tr

t t s sttr rss t trs

t rs t tt r

t t

r tr

sts

r s tr ts

r s r s

s s t t s t s sttr t

r s r s

t r str st t s

r r

r s r s

ss

r srs

rs

tt str t str sqr

s t t rss r

t t

sts

s t ts

s s

s t s t t s t t r t
sttr t

s s

t rss str st s

s s
s

ss

r r rs
rss str t strrq rss t

t tt t sqr r s

r r rq

t s srs s tr rs tr tr
sqr r s t rqs

rs s s tr

rst r
r

trs

t s

rq rss t s
s

r t t s s
rr
tt s r t
t
tt s r t
t

r t s t s

tr r


t
r tr

rq rss t s

r

s
s

t
rr
tt
r t
tt
r t
tt
r t

s s
t
ss s
t
s s
t

Ps t trts t

tr

t t t

s tr rss t t t s

Prt

t str t tt tr s

s t tr s tt

t s stt

t s t tt t tr s t s t s
rt tr r t t t

t rt tr s t t
t t t t s t qt t t
s s tr

rt rt s

r stt
r

s sr t stt t

rst stt s t

Lagged regression
The fisheries recruitment series (yt) and the Southern Oscillation Index (xt) are cross-correlated with lags of several
months.
Perhaps we can model them as

yt =

r xtr + vt
r=

where vt is uncorrelated with xtr at all lags r. That is,


the coherence between vt and xt is zero at all frequencies;
in words: vt and xt are incoherent.
1

In terms of filters:

zt =

r xtr
r=

is the output of a filter whose input is xt, and yt is zt plus


noise that is incoherent with the input.

If the frequency response function of the filter is

B() =

r e2ir ,

r=

the spectrum of zt is
fzz () = |B()|2fxx().
2

Also the cross spectrum is


fzx() = B()fxx().

Now yt = zt + vt, and vt is incoherent with xt, and therefore


also with zt.

So the spectrum of yt is
fyy () = fzz () + fvv () = |B()|2fxx() + fvv ().
and the cross spectrum of yt and xt is
fyx() = fzx() = B()fxx().
3

So B() must satisfy


B() =

fyx()
.
fxx()

Can we find a filter with frequency response function B()?

Typically, yes. If xt and yt are such that


1/2

1/2

fyx()
d < ,
|B()|d =
1/2
1/2 fxx ()
the coefficients are
1/2

r =

1/2
n1

e2ir B()d

1
e2ik r B(k ).

n k=0

SOI and recruitment

We need B(k ), k = 0, 1, . . . , n 1, but both Rs spectrum()


and SASs proc spectra omit k = 0 and k > n/2.
In R, we can use fft() directly:
dy = fft(rec - mean(rec)) / sqrt(length(rec))
dx = fft(soi - mean(soi)) / sqrt(length(soi))
fyx = filter.complex(dy * Conj(dx), rep(1, 15), sides = 2, circular = TRUE)
fxx = filter.complex(dx * Conj(dx), rep(1, 15), sides = 2, circular = TRUE)
B = fyx/fxx
beta = Re(fft(B, inv = TRUE)) / length(B)
plot(-15:15, c(beta[-1], beta)[length(B) + -15:15], type = "h")
abline(h = 0, lty = 3)

Using the seasonally adjusted series gives a very similar result:


dy = fft(recSA) / sqrt(length(recSA))
dx = fft(soiSA) / sqrt(length(soiSA))
fyx = filter.complex(dy * Conj(dx), rep(1, 15), sides = 2, circular = TRUE)
fxx = filter.complex(dx * Conj(dx), rep(1, 15), sides = 2, circular = TRUE)
B = fyx/fxx
betaSA = Re(fft(B, inv = TRUE)) / length(B)
plot(-15:15, c(betaSA[-1], betaSA)[length(B) + -15:15], type = "h")
abline(h = 0, lty = 3)

In this case, the response is clearly recruitment, and the input


is SOI.
We could reverse the roles: SOI versus recruitment.
fxy = Conj(fyx)
fyy = filter.complex(dy * Conj(dy), rep(1, 15), sides = 2, circular = TRUE)
B = fxy/fyy
beta = Re(fft(B, inv = TRUE)) / length(B)
plot(-15:15, c(beta[-1], beta)[length(B) + -15:15], type = "h")
abline(h = 0, lty = 3)

In the first version, r 0 for r < 0, so the filter is physically


realizable. In other cases, this method may give unrealizable
filters; we can fit the best realizable filter using time domain
methods.
8

Interpreting Coherence
Recall that
fyy () = |B()|2fxx() + fvv ()
fyx()
=
fxx()

fxx() + fvv ()

= 2
yx ()fyy () + fvv ().
So
fvv () = 1 2
yx () fyy ().
The squared coherence is the proportion of the spectrum of
yt that is explained by the lagged regression on xt.
9

Forecasting

The forecasting problem is also a type of lagged regression:


of xt on its own lags;
and on only the past.

We have seen that the solution is

x
t =

r xtr ,
r=1

where the s must satisfy


cov(xt x
t, xtr ) = 0

for r = 1, 2, . . .
10

That is, wt = xt x
t is uncorrelated with all past xs, and
hence with all past ws, and hence is white noise.

So the filter

wt = xt

rxtr

r xtr =
r=1

r=0

turns xt into white noise wt.

So the spectrum fxx() satisfies


2 = f ()
w
xx

r=0

re2ir

= fxx()
r=0

re2ir

re2ir

r=0
11

So, taking logarithms:

2 log
log[fxx()] = log w

re2ir log

re2ir .

r=0

r=0

Now, provided log[fxx()] is integrable:


1/2
1/2

|log[fxx()]| d <

we can write

log[fxx()] = l0 + 2

= l0 +
r=1

lr cos(2r)
r=1

lr e2ir +

lr e2ir

r=1
12

Some standard complex variable theory implies that we can


match terms:
2 =l ,
log w
0

log

re2ir =

lr e2ir ,

r=0
r=1

log
re2ir =
lr e2ir .
r=0
r=1

13

That is,
1/2

2 = exp l
w
( 0) = exp

1/2

log[fxx()] d ,

and

re2ir = exp

r=0

lr e2ir

r=1

whence for r = 1, 2, . . .
r = r =

1/2
1/2

exp

lr e2ir e2ir d.

r=1

This is the essence of Kolmogorovs (1941) solution to the


forecasting problem.
14

Long Memory Time Series

A time series has short memory if


|(h)| < .

So a time series for which


|(h)| =
is said to have long memory.

Why do we care?
Write the mean of x1, x2, . . . , xn as
x + x2 + + xn
.
x
n = 1
n
Then
n1
1
|h|
var(x
n) =
1
(h)
n h=(n1)
n

1
|h|
1
=
(h)
n h=
n +
where (a)+ = max(a, 0) is a if a 0 and 0 if a < 0.
2

If

|(h)| < , then

|h|
1
(h)
(h)
n +
h=
h=

as n .

So

nvar(x
n)

(h),
h=

or

var(x
n) =

1
1
(h) + o
.
n h=
n
3

That is, for a short memory time series, var(x


n) goes to zero
as the sample size increases at the usual rate, 2/n, but with
a different multiplier.
Note that

(h) = f (0),
h=

the spectral density f () evaluated at = 0.


So we can also write
var(x
n) =

f (0)
1
+o
:
n
n

2 is replaced by f (0).
4

But if

|(h)| = , this doesnt work.

In practice, many series show var(x


n) decaying more slowly.
Plot log[var(x
n)] against log(n), and look for a slope of 1.
vartime = function(x, nmax = round(length(x) / 10)) {
v = rep(NA, nmax);
for (n in 1:nmax) {
y = filter(x, rep(1/n, n), sides = 1);
v[n] = var(y, na.rm = TRUE);
}
plot(log(1:nmax), log(v));
lmv = lm(log(v) ~ log(1:nmax));
abline(lmv);
title(paste(deparse(substitute(x)), "; nmax = ", nmax));
print(summary(lmv));
}
vartime(log(varve))
vartime(globtemp)
vartime(residuals(lm(globtemp ~ time(globtemp))))
5

Fractional Integration
How can we model such series?
Fractionally integrated white noise:
(1 B)dxt = wt,

0 < d < 0.5.

ACF is
(h) =

(h + d)(1 d)
h2d1
(h d + 1)(d)

So for 0 < d < 0.5,

|(h)| = .
h=
6

Notes:

var(x
n) decays like n(2d1), so
1 + slope of variance-time graph
2
gives a rough empirical estimate of d.
d=

The spectral density is


f () =

2
w
d
2
4 sin()

so for d > 0, f () as 0.
7

Also f () ||2d as 0, so a graph of log[f ()] against


log(||) gives another estimate of d.

If d 0.5, f () is not integrable, so the series is not stationary.

ARFIMA Model

In some long-memory series, autocorrelations at small lags


do not match those of fractionally integrated noise.

We can add ARMA components to allow for such differences;


the ARIMA(p, d, q) model with fractional d, or ARFIMA.

Use the R function fracdiff():


library(fracdiff)
summary(fracdiff(log(varve)))
summary(fracdiff(log(varve), nar = 1, nma = 1))
summary(fracdiff(residuals(lm(globtemp ~ time(globtemp)))))
9

Trend Estimation with ARFIMA errors

The R function fracdiff() does not allow explanatory variables, but we can use it to calculate a profile likelihood function.
E.g. global temperature versus cumulative CO2 emissions:
source("http://www.stat.ncsu.edu/people/bloomfield/courses/st730/co2w.R");
plot(cbind(globtemp, co2w));
slopes = seq(from = 0, to = 1.5, length = 151);
ll2 = rep(NA, length(slopes));
for (i in 1:length(slopes))
ll2[i] = -2 * fracdiff(globtemp - slopes[i] * co2w)$log.likelihood;
plot(slopes, ll2, type = "l");
abline(h = min(ll2) + qchisq(.95, 1));
10

The point estimate is


slopeEst = slopes[which.min(ll2)];
abline(v = slopeEst, col = "red"); # [1] 0.68

and the 95% confidence interval is roughly:


slopeCI = range(slopes[ll2 <= min(ll2) + qchisq(.95, 1)]);
abline(v = slopeCI, col = "red", lty = 2); # [1] 0.41 1.03

The CO2 series was scaled by its change from 1900 to 2000,
so we estimate the 20th century warming as 0.68C, with a
confidence interval of (0.41C, 1.03C) (note the asymmetry:
0.68(0.27, +0.35)C).
Compare with IPCC: 19062005 warming is 0.74C 0.18C.
11

Conditional Heteroscedasticity (CH)

So far, our models are for the conditional mean.

For instance, the Gaussian AR(1) model


yt = yt1 + t
may be written:
Conditionally on yt1, yt2, . . . ,

2 .
yt N + yt1 , w

The conditional mean depends on the past, the conditional


variance does not.
1

Three key features:


The conditional distribution is normal;
The conditional mean is a linear function of yt1, yt2, . . . ;
The conditional variance is constant: conditional homoscedasticity.

All three features could be changed.

Non-normal noise: typically longer tails; for fitting, provided


the variance is finite, changes the likelihood function, but
not much else.

Nonlinear mean function: Modeling a nonlinear mean is quite


difficult; for instance, ensuring stationarity is restrictive. Threshold models are perhaps most feasible.

Non-constant variance. Two approaches:


ARCH (AutoRegressive CH), GARCH (Generalized ARCH),
...
Stochastic volatility.
3

ARCH Models

Simplest is ARCH(1):
yt = t t
2
t2 = 0 + 1yt1
where t is Gaussian white noise with variance 1.

Alternatively:
Conditionally on yt1, yt2, . . . ,

2
yt N 0, 0 + 1yt1
.

If |yt1| happens to be large, t is increased, so |yt| also tends


to be large.

Conversely, if |yt1| happens to be small, t is decreased, so


|yt| also tends to be small.

volatility clusters and long tails.


n = 1000; alpha1 = 0.9; alpha0 = 1 - alpha1;
y = epsilon = ts(rnorm(n));
par(mfcol = c(2, 1));
plot(epsilon);
for (i in 2:n) y[i] = epsilon[i] * sqrt(alpha0 + alpha1 * y[i - 1]^2);
plot(y);
5

ARCH as AR

The ARCH(1) model for yt implies:


yt2 = t2 2
t
= t2 + t2

21
t
2 + 2
= 0 + 1yt1
t

21
t

or
2 +v ,
yt2 = 0 + 1yt1
t

where
vt = t2

21 .
t
6

Note that
E(vt|yt1, yt2, . . . ) = 0,
and hence that for h > 0,
E(vtvth) = E E vtvth|yt1, yt2, . . .
= E vthE vt|yt1, yt2, . . .
= 0,
so vt is (highly nonnormal) white noise, and yt2 is AR(1).
For positivity and stationarity, 0 > 0 and 0 1 < 1, and
unconditionally,
0
2
E yt = var(yt) =
.
1 1
7

Extensions and Generalizations


Extend to ARCH(m):
yt = t t
2 + y2 + + y2
t2 = 0 + 1yt1
m tm .
2 t2
Now yt2 is AR(m) usual restrictions on s.
Generalize to GARCH(m, r):
yt = t t
t2 = 0 +

m
j=1

2 +
j ytj

2 .
j tj

j=1

Now yt2 is ARMA[m, max(m, r)] corresponding restrictions


on s and s.
8

Simplest GARCH model: GARCH(1, 1)

The GARCH(1, 1) model is widely used:


2 + 2
t2 = 0 + 1yt1
1 t1

with
1 + 1 < 1
for stationarity.

The unconditional variance is now


E yt2 = var(yt) =

0
.
1 1 1
9

n = 1000; alpha1 = 0.5; beta1 = 0.4; alpha0 = 1 - alpha1 - beta1;


y = epsilon = ts(rnorm(n));
par(mfcol = c(2, 1));
plot(epsilon);
sigmatsq = 1;
for (i in 2:n) {
sigmatsq = alpha0 + alpha1 * y[i - 1]^2 + beta1 * sigmatsq;
y[i] = epsilon[i] * sqrt(sigmatsq);
}
plot(y);

Volatility clusters are more sustained.

10

In SAS, use proc autoreg and the garch option on the model
statement.
In R, explore and describe volatility:
nyse = ts(scan("nyse.dat"));
par(mfcol = c(2, 1));
plot(nyse);
plot(abs(nyse));
lines(lowess(time(nyse), abs(nyse), f = .005), col = "red");
par(mfcol = c(2, 2));
acf(nyse);
acf(abs(nyse));
acf(nyse^2);

11

In R, fit GARCH (default is 1,1):


library(tseries);
nyse.g = garch(nyse);
summary(nyse.g);
plot(nyse.g);
par(mfcol = c(1, 1));
plot(nyse);
matlines(predict(nyse.g), col = "red", lty = 1);

12

GARCH with a unit root: IGARCH

A special case:
1 + 1 = 1:

IGARCH(1, 1)

GARCH(1, 1)

with

yt = t t
2 + 2
t2 = 0 + (1 1) yt1
1 t1
Solving recursively with 0 = 0:
t2 = (1 1)

j1 2
ytj

1
j=1

an exponentially weighted moving average of yt2.


13

Tail Length

All xARCH models give yt with fat tails:


yt = t t where t N (0, 1)
fy (y) =

1
y
f ()
d.

fy () is a mixture of Gaussian densities with the same


mean and different variances.

In practice, residuals in xARCH models may not be normal,


but are usually closer to normal than the original data.
14

R UpdateFall 2011

Shumway and Stoffers code for Example 5.3 does not work
with the R garch function.

The fGarch package provides another method, garchFit, which


allows simultaneous fitting of ARMA and GARCH models.

15

gnp96 = read.table("http://www.stat.pitt.edu/stoffer/tsa2/data/gnp96.dat");
gnpr = ts(diff(log(gnp96[, 2])), frequency = 4, start = c(1947, 1));
library(fGarch);
gnpr.mod = garchFit(gnpr ~ arma(1, 0) + garch(1, 0), data.frame(gnpr = gnpr));
summary(gnpr.mod);
Title:
GARCH Modelling
Call:
garchFit(formula = gnpr ~ arma(1, 0) + garch(1, 0),
data = data.frame(gnpr = gnpr))
Mean and Variance Equation:
data ~ arma(1, 0) + garch(1, 0)
[data = data.frame(gnpr = gnpr)]
Conditional Distribution:
norm
16

Coefficient(s):
mu
ar1
0.00527795 0.36656255

omega
0.00007331

alpha1
0.19447134

Std. Errors:
based on Hessian
Error Analysis:
Estimate Std. Error t value
mu
5.278e-03
8.996e-04
5.867
ar1
3.666e-01
7.514e-02
4.878
omega 7.331e-05
9.011e-06
8.135
alpha1 1.945e-01
9.554e-02
2.035
--Signif. codes: 0 *** 0.001 ** 0.01 *
Log Likelihood:
722.2849
normalized:

Pr(>|t|)
4.44e-09
1.07e-06
4.44e-16
0.0418

***
***
***
*

0.05 . 0.1

3.253536

17

Standardised Residuals Tests:


Jarque-Bera Test
Shapiro-Wilk Test
Ljung-Box Test
Ljung-Box Test
Ljung-Box Test
Ljung-Box Test
Ljung-Box Test
Ljung-Box Test
LM Arch Test

R
R
R
R
R
R^2
R^2
R^2
R

Chi^2
W
Q(10)
Q(15)
Q(20)
Q(10)
Q(15)
Q(20)
TR^2

Statistic
9.118036
0.9842405
9.874326
17.55855
23.41363
19.2821
33.23648
37.74259
25.41625

p-Value
0.01047234
0.01433578
0.4515875
0.2865844
0.2689437
0.03682245
0.004352734
0.009518987
0.01296901

Information Criterion Statistics:


AIC
BIC
SIC
HQIC
-6.471035 -6.409726 -6.471669 -6.446282

18

garchFit also provides many diagnostic plots:


plot(gnpr.mod);

19

Threshold Models

A simple form of nonlinear model, basically a switching


AR(p):
(j)

(j)

xt = (j) + 1 xt1 + + p xtp + (j)wt


if xt1 Rj , where xt1 = (xt1, . . . , xtp) , R1, R2, . . . , Rr is
a partition of Rp, and wt is white noise with variance 1.

That is, the AR(p) parameters in the equation for xt change,


depending on the values of the previous p observations
xt1, . . . , xtp.
1

Assuming equal variances, can estimate using regression.


E.g. for monthly pneumonia and influenza deaths:
flu = ts(scan("flu.dat"));
dflu = diff(flu);
a = dflu;
for (l in 1:6)
a = cbind(a, lag(dflu, -l));
a = cbind(a, lag(dflu, -1) > 0.05);
a = data.frame(a);
names(a) = c("x", paste("x", 1:6, sep = ""), "delta");
summary(lm(x ~ delta + x1*delta + x2*delta + x3*delta + x4*delta +
x5*delta + x6*delta, data = a));
flu.l = lm(x ~ -1 + delta + x1*delta + x2*delta + x3*delta + x4*delta,
data = a);
summary(flu.l);
flu.r = residuals(flu.l);
delta = a$delta[4 + 1:length(flu.r)];
lapply(split(flu.r, delta), sd);
acf(flu.r);
2

This is inefficient, and standard errors are invalid, if variances


are unequal; here F = 1.93, df = (17, 110), P = .022.
We can also fit the model using two separate regressions:
flu.lF = lm(x ~ x1 + x2 + x3 + x4, data = a, subset = (delta == 0));
summary(flu.lF);
flu.lT = lm(x ~ x1 + x2 + x3 + x4, data = a, subset = (delta == 1));
summary(flu.lT);

Setting up a residual series:


flu.r01 = flu.r;
flu.r01[!delta] = residuals(flu.lF) / sd(residuals(flu.lF));
flu.r01[delta] = residuals(flu.lT) / sd(residuals(flu.lT));
acf(flu.r01);
3

Regression with Autocorrelated Errors

Regression model
yt = zt + xt
where xt has covariance matrix .

Generalized least squares (GLS) estimates for known :


= Z 1Z

Z 1y.

For unknown , plug in an estimate.


4

If xt is a stationary time series, can fit using OLS, then either:


get estimated autocovariances
(h) from the residuals,
;
and plug in
or use Cochrane-Orcutt method.

More generally, can use mixed model methods (SAS proc


mixed).

Cochrane and Orcutt suggested:


fit using OLS to get initial estimate of ;
fit AR(p) to OLS residuals (wt is white noise):
(B)xt = wt;
transform the regression to
(B)yt = (B)zt + (B)xt = (B)zt + wt
or
ut = vt + wt.
Residuals are now white, so fit using OLS.
6

SAS proc arima offers better solution:


Mortality and air pollution (Example 5.6): program and
output.
Global temperature and cumulative CO2 emissions: program and output.
In R, temperature (slightly different) and CO2:
arima(globtemp, order = c(1, 0, 0), xreg = co2w);
arima(globtemp, order = c(4, 0, 0), xreg = co2w);
arima(globtemp, order = c(0, 0, 4), xreg = co2w);

Lagged Regression again: Transfer Functions

To forecast an output series yt given its own past and the


present and past of an input series xt, we might use

yt =

j xtj + t = (B)xt + t,
j=0

where the noise t is uncorrelated with the inputs.

This generalizes regression with correlated errors by including lags, and specializes the frequency domain lagged regression by excluding future inputs.

Preliminary estimation of 0, 1, . . . often suggests a parsimonious model


(B)
(B) = B d
,
(B)
where:
d is the pure delay : 0 1 d1 0 and d = 0;
(B) and (B) are low-order polynomials: (B) is needed
if the s decay exponentially, and (B) is needed if the
first few nonzero s do not follow the decay.

Preliminary estimates from frequency domain method, or a


similar time domain method.
2

Time Domain Preliminary Estimates

If the input series xt were white noise, the cross correlation


y,x(h) = E yt+hxt

= E

j xt+hj + t+h xt

j=0

= hvar (xt) ,
so
y,x(h)/var (xt) provides an estimate of h.

Usually, xt is not white noise, but if it is a stationary time


series, we know how to make it white: fit an ARMA model.
3

Prewhitening

Suppose that xt is ARMA:


(B)xt = (B)wt,
where wt is white noise.
Apply the prewhitening filter (B)(B)1 to the lagged regression equation:

j wtj +
t,

yt =
j=0

where yt = [(B)(B)1]yt and


t = [(B)(B)1]t.
4

Now the cross correlation y,w (h) provides an estimate of h.


You can use SASs proc arima to do this:
first identify and estimate a model for xt;
then identify yt with xt as a crosscorr variable.
At the second step, SAS uses the prewhitening filter from
the first step to filter both xt and yt before calculating cross
correlations.
Note: SAS announces that both series have been prewhitened,
but the filter is designed to prewhiten only xt; yt is filtered,
but typically not prewhitened.
5

Finally estimate the model for yt, specifying the input series,
in the form:
input = (d$(L1,1, L1,2, . . . ) . . . (Lk,1, . . . )
/(Lk+1,1, . . . ) . . . (. . . )variable)

E.g. for Southern Oscillation and the fisheries recruitment


series: program and output.

E.g. for global temperature and an estimated historical forcing series: program and output.

Interpreting a Transfer Function

For the global temperature case, we have


yt = 0.087917 (xt + 0.79513xt1 + 0.795132xt2 + . . . ) + t.

So the effect of an impulse in the forcing xt, say a dip due


to a volcanic eruption, is felt in the current year and several
subsequent years, with a mean delay of 1/(10.79513) 4.9
years.

Also, the effect of a sustained change of +4.4W/m2 would


be
0.087917 4.4 (1 + 0.79513 + 0.795132 + . . . )
= 0.087917 4.4/(1 0.79513)
1.9C.
This is the expected forcing for a doubling of CO2 over preindustrial levels, and the temperature response is called the
climate sensitivity. The IPCC states:
Analysis of models together with constraints from
observations suggest that the equilibrium climate sensitivity is likely to be in the range 2C to 4.5C, with a
best estimate value of about 3C. It is very unlikely to
be less than 1.5C.
8

Our estimate is at the low end of that range, but quantifying


its uncertainty is difficult using proc arima.

The profile likelihood for climate sensitivity, constructed using a grid search in R (with p = 4), gives an estimated value
of 1.85C and 95% confidence limits of 1.44C to 2.27C.

1.5

2.0

2.5

-2 Log-Likelihood contours for climate sensitivity (y-axis) and


decay factor (x-axis):

0.4

0.5

0.6

0.7

0.8

0.9

10

306
308
310

ll2

304

302

-2 Log-Likelihood profile for climate sensitivity:

1.5

2.0

2.5

4.4 * theta

11

310 308 306 304 302 300

ll2

-2 Log-Likelihood profile for decay factor:

0.4

0.5

0.6

0.7

0.8

0.9

lambda

12

ARMAX Models
Vector (multivariate) regression:
output vector

yt =

yt,1
yt,2
...
yt,k

input vector

zt,1
z

t,2
zt = ..
.
zt,r
1

Regression equation:
yt,i = i,1zt,1 + i,2zt,2 + + i,r zt,r + wt,i
or in vector form

yt = Bzt + wt.
Here {wt} is multivariate white noise:
E(wt) = 0,

,
w
cov wt+h, wt =
0,

h=0
h = 0.

Given observations for t = 1, 2, . . . , n, the least squares estimator of B, also the maximum likelihood estimator when
{wt} is Gaussian white noise, is
=YZ ZZ
B

where

y1
y2
...
yn

and

z1
z2
...
zn

ML estimate of w (replace n with (n r) for unbiased):


1 n
w =
t

yt Bz
n t=1

t .
yt Bz
3

Information criteria:
Akaike:
w +
AIC = ln

2
k(k + 1)
kr +
;
n
2

Schwarz:
w +
SIC = ln

ln n
k(k + 1)
kr +
,
n
2

Bias-corrected AIC (incorrect in Shumway & Stoffer):


w +
AICc = ln

k(k + 1)
2
kr +
.
nkr1
2

Vector Autoregression
E.g., VAR(1):

xt = + xt1 + wt.
Here is a k k coefficient matrix, and {wr } is Gaussian
multivariate white noise.
This resembles the vector regression equation, with:

yt = xt,
B= ,
zt =

xt1

.
5

Observe x0, x1, . . . , xn, and condition on x0.

Maximum conditional likelihood estimators of B and w are


same as for ordinary vector regression.

VAR(p) is similar, but we must condition on the first p observations.

Full likelihood = conditional likelihood likelihood derived


from marginal distribution of first p observations, and is difficult to use.

Example: 1-year, 5-year, and 10-year weekly interest rates


Data from http://research.stlouisfed.org/fred2/series/WGS1YR/,
etc.
a = read.csv("WGS1YR.csv");
WGS1YR = ts(a[,2]);
a = read.csv("WGS5YR.csv");
WGS5YR = ts(a[,2]);
a = read.csv("WGS10YR.csv");
WGS10YR = ts(a[,2]);
a = cbind(WGS1YR, WGS5YR, WGS10YR);
plot(a);
plot(diff(a));

Use the dse package to fit VAR(1) and VAR(2) models to


differences:
library(dse);
b = TSdata(output = diff(a));
b1 = estVARXls(b, max.lag = 1);
cat("VAR(1)\n print method:\n");
print(b1);
cat("\n summary method:\n");
print(summary(b1));
b2 = estVARXls(b, max.lag = 2);
cat("\nVAR(2)\n print method:\n");
print(b2);
cat("\n summary method:\n");
print(summary(b2));

VAR(1)
print method:
neg. log likelihood= -7188.785
A(L) =
1-1.014698L1
0-0.02482398L1
0-0.0144053L1
B(L)
1
0
0

=
0
1
0

0+0.05794167L1
1-0.9224325L1
0+0.03872528L1

0-0.04292339L1
0-0.05304638L1
1-1.024605L1

0
0
1

summary method:
neg. log likelihood = -7188.785
sample length = 2448
WGS1YR y.WGS5YR
WGS10YR
RMSE 0.2005654 0.1713752 0.1563661
ARMA: model estimated by estVARXls
inputs :
outputs: WGS1YR y.WGS5YR WGS10YR
9

input dimension = 0
output dimension = 3
order A = 1
order B = 0
order C =
9 actual parameters
6 non-zero constants
trend not estimated.
VAR(2)
print method:
neg. log likelihood= -7414.944

A(L) =
1-1.329215L1+0.3221239L2
0+0.1030711L1-0.05850615L2
0-0.1539836L1+0.1172694L
0-0.07336772L1+0.05027099L2
1-1.117284L1+0.1974304L2
0-0.1148573L1+0.0577710
0+0.0002002881L1-0.01317073L2
0-0.02287398L1+0.06233586L2
1-1.252808L1+0.226
B(L)
1
0
0

=
0
1
0

0
0
1

summary method:
neg. log likelihood = -7414.944

sample length = 2448

WGS1YR y.WGS5YR
WGS10YR
RMSE 0.1910442 0.1666275 0.1534016
ARMA: model estimated by estVARXls
inputs :
outputs: WGS1YR y.WGS5YR WGS10YR
input dimension = 0
output dimension = 3
order A = 2
order B = 0
order C =
18 actual parameters
6 non-zero constants
trend not estimated.

AIC is smaller (more negative) for VAR(2), but SIC is smaller


for VAR(1).

For VAR(1),

1 =

0.3288773
0.08581201
0.06575108
0.1534516
0.004959931
0.04152504

0.136938

0.08875425
0.2406055

Largest off-diagonal elements are (1,3) and (2,3), suggesting


that changes in the 10-year rate are followed, one week later,
by changes in the same direction in the 1-year and 5-year
rates.
10

You might also like