Professional Documents
Culture Documents
Chapter 2
1. (a) The use of vertical rather than horizontal distances relates to the idea
that the explanatory variable, x, is fixed in repeated samples, so what the
model tries to do is to fit the most appropriate value of y using the model for a
given value of x. Taking horizontal distances would have suggested that we
had fixed the value of y and tried to find the appropriate values of x.
(b) When we calculate the deviations of the points, yt, from the fitted
values, y t , some points will lie above the line (yt > y t ) and some will lie below
the line (yt < y t ). When we calculate the residuals ( u t = yt y t ), those
corresponding to points above the line will be positive and those below the line
negative, so adding them would mean that they would largely cancel out. In
fact, we could fit an infinite number of lines with a zero average residual. By
squaring the residuals before summing them, we ensure that they all
contribute to the measure of loss and that they do not cancel. It is then
possible to define unique (ordinary least squares) estimates of the intercept
and slope.
(c) Taking the absolute values of the residuals and minimising their sum
would certainly also get around the problem of positive and negative residuals
cancelling. However, the absolute value function is much harder to work with
than a square. Squared terms are easy to differentiate, so it is simple to find
analytical formulae for the mean and the variance.
y t xt u t
y t xt
Notice that there is no error or residual term in the equation for the SRF: all
this equation states is that given a particular value of x, multiplying it by
and adding will give the model fitted or expected value for y, denoted y . It
is also possible to write
y t xt u t
This equation splits the observed value of y into two components: the fitted
value from the model, and a residual term. The SRF is used to infer likely
values of the PRF. That is the estimates and are constructed, for the
sample data.
We need to make the first four assumptions in order to prove that the ordinary
least squares estimators of and are best, that is to prove that they have
minimum variance among the class of linear unbiased estimators. The
theorem that proves that OLS estimators are BLUE (provided the assumptions
are fulfilled) is known as the Gauss-Markov theorem. If these assumptions are
violated (which is dealt with in Chapter 4), then it may be that OLS estimators
are no longer unbiased or efficient. That is, they may be inaccurate or
subject to fluctuations between samples.
We needed to make the fifth assumption, that the disturbances are normally
distributed, in order to make statistical inferences about the population
parameters from the sample data, i.e. to test hypotheses about the coefficients.
Making this assumption implies that test statistics will follow a t-distribution
(provided that the other assumptions also hold).
(2.57) Yes, can use OLS since the model is the usual linear model we have been
dealing with.
(2.58) Yes. The model can be linearised by taking logarithms of both sides and
by rearranging. Although this is a very specific case, it has sound theoretical
foundations (e.g. the Cobb-Douglas production function in economics), and it
is the case that many relationships can be approximately linearised by taking
logs of the variables. The effect of taking logs is to reduce the effect of extreme
values on the regression function, and it may be possible to turn multiplicative
models into additive ones which we can easily estimate.
(2.59) Yes. We can estimate this model using OLS, but we would not be able to
obtain the values of both and, but we would obtain the value of these two
coefficients multiplied together.
(2.60) Yes, we can use OLS, since this model is linear in the logarithms. For
those who have done some economics, models of this kind which are linear in
the logarithms have the interesting property that the coefficients ( and) can
be interpreted as elasticities.
(2.61). Yes, in fact we can still use OLS since it is linear in the parameters. If
we make a substitution, say qt = xtzt, then we can run the regression
yt = +qt + ut
as usual.
So, in fact, we can estimate a fairly wide range of model types using these
simple tools.
6. The null hypothesis is that the true (but unknown) value of beta is equal to
one, against a one sided alternative that it is greater than one:
H0 : = 1
H1 : > 1
* 1.147 1
test stat 2.682
SE ( ) 0.0548
We want to compare this with a value from the t-table with T-2 degrees of
freedom, where T is the sample size, and here T-2 =60. We want a value with
5% all in one tail since we are doing a 1-sided test. The critical t-value from the
t-table is 1.671:
5% rejection
region
+1.671
The value of the test statistic is in the rejection region and hence we can reject
the null hypothesis. We have statistically significant evidence that this security
has a beta greater than one, i.e. it is significantly more risky than the market as
a whole.
7. We want to use a two-sided test to test the null hypothesis that shares in
Chris Mining are completely unrelated to movements in the market as a
whole. In other words, the value of beta in the regression model would be zero
so that whatever happens to the value of the market proxy, Chris Mining
would be completely unaffected by it.
H0 : = 0
H1 : 0
The test statistic has the same format as before, and is given by:
* 0.214 0
test stat 1.150
SE ( ) 0.186
We want to find a value from the t-tables for a variable with 38-2=36 degrees
of freedom, and we want to look up the value that puts 2.5% of the distribution
in each tail since we are doing a two-sided test and we want to have a 5% size
of test over all:
-2.03 +2.03
Since the test statistic is not within the rejection region, we do not reject the
null hypothesis. We therefore conclude that we have no statistically significant
evidence that Chris Mining has any systematic risk. In other words, we have
no evidence that changes in the companys value are driven by movements in
the market.
( SE ( ) t crit , SE ( ) t crit )
-2.72 +2.72
The second point to note is that we can test an infinite number of hypotheses
about beta once we have formed the interval. For example, we would not reject
the null hypothesis contained in the last question (i.e. that beta = 0), since that
value of beta lies within the 95% and 99% confidence intervals. Would we
reject or not reject a null hypothesis that the true value of beta was 0.6? At the
5% level, we should have enough evidence against the null hypothesis to reject
it, since 0.6 is not contained within the 95% confidence interval. But at the 1%
level, we would no longer have sufficient evidence to reject the null hypothesis,
since 0.6 is now contained within the interval. Therefore we should always if
possible conduct some sort of sensitivity analysis to see if our conclusions are
altered by (sensible) changes in the level of significance used.
9. We test hypotheses about the actual coefficients, not the estimated values.
We want to make inferences about the likely values of the population
parameters (i.e. to test hypotheses about them). We do not need to test
hypotheses about the estimated values since we know exactly what our
estimates are because we calculated them!
2. (a) H0 : 3 = 2
We could use an F- or a t- test for this one since it is a single hypothesis
involving only one coefficient. We would probably in practice use a t-test since
it is computationally simpler and we only have to estimate one regression.
There is one restriction.
(b) H0 : 3 + 4 = 1
Since this involves more than one coefficient, we should use an F-test. There is
one restriction.
(c) H0 : 3 + 4 = 1 and 5 = 1
Since we are testing more than one hypothesis simultaneously, we would use
an F-test. There are 2 restrictions.
(e) H0 : 23 = 1
Although there is only one restriction, it is a multiplicative restriction. We
therefore cannot use a t-test or an F-test to test it. In fact we cannot test it at
all using the methodology that has been examined in this chapter.
H1 : 2 0 or 3 0 or 4 0 or 5 0
Note the form of the alternative hypothesis: or indicates that only one of the
components of the null hypothesis would have to be rejected for us to reject
the null hypothesis as a whole.
4. The restricted residual sum of squares will always be at least as big as the
unrestricted residual sum of squares i.e.
RRSS URSS
To see this, think about what we were doing when we determined what the
regression parameters should be: we chose the values that minimised the
residual sum of squares. We said that OLS would provide the best parameter
values given the actual sample data. Now when we impose some restrictions
on the model, so that they cannot all be freely determined, then the model
should not fit as well as it did before. Hence the residual sum of squares must
be higher once we have imposed the restrictions; otherwise, the parameter
values that OLS chose originally without the restrictions could not be the best.
In the extreme case (very unlikely in practice), the two sets of residual sum of
squares could be identical if the restrictions were already present in the data,
so that imposing them on the model would yield no penalty in terms of loss of
fit.
and rearranging
pt = 1 + 2x2t + 3qt+ ut ,
The t-ratios are given in the final row above, and are in italics. They are
calculated by dividing the coefficient estimate by its standard error. The
relevant value from the t-tables is for a 2-sided test with 5% rejection overall.
T-k = 195; tcrit = 1.97. The null hypothesis is rejected at the 5% level if the
absolute value of the test statistic is greater than the critical value. We would
conclude based on this evidence that only firm size and market to book value
have a significant effect on stock returns.
If a stocks beta increases from 1 to 1.2, then we would expect the return on the
stock to FALL by (1.2-1)*0.084 = 0.0168 = 1.68%
This is not the sign we would have expected on beta, since beta would be
expected to be positively related to return, since investors would require
higher returns as compensation for bearing higher market risk.
7. We would thus consider deleting the price/earnings and beta variables from
the regression since these are not significant in the regression - i.e. they are
not helping much to explain variations in y. We would not delete the constant
term from the regression even though it is insignificant since there are good
statistical reasons for its inclusion.
yt 1 2 x2t 3 x3t 4 yt 1 u t
yt 1 2 x2t 3 x3t 4 yt 1 vt .
Note that we have not changed anything substantial between these models in
the sense that the second model is just a re-parameterisation (rearrangement)
of the first, where we have subtracted yt-1 from both sides of the equation.
u t yt y t yt 1 2 x 2t 3 X 3t 4 yt 1
Now for the second model, the dependent variable is now the change in y:
vt yt y t yt 1 2 x 2t 3 x3t 4 yt 1
where y is the fitted value in each case (note that we do not need at this stage
to assume they are the same). Rearranging this second model would give
u t y t y t 1 1 2 x 2t 3 x3t 4 y t 1
y t 1 2 x 2t 3 x3t ( 4 1) y t 1
If we compare this formulation with the one we calculated for the first model,
we can see that the residuals are exactly the same for the two models, with
4 4 1 and i i (i = 1, 2, 3). Hence if the residuals are the same, the
residual sum of squares must also be the same. In fact the two models are
really identical, since one is just a rearrangement of the other.
RSS
R2 1 for the first model and
( yi y ) 2
RSS
R2 1 in the second case. Therefore since the total sum of
(yi y ) 2
squares (the denominator) has changed, then the value of R2 must have also
changed as a consequence of changing the dependent variable.
(c) By the same logic, since the value of the adjusted R2 is just an algebraic
modification of R2 itself, the value of the adjusted R2 must also change.
y t 1 2 x 2t 3 x3t u t
yt 1 2 x2t 3 x3t 4 x4t vt
(a) The value of R2 will almost always be higher for the second model since it
has another variable added to the regression. The value of R2 would only be
identical for the two models in the very, very unlikely event that the estimated
(b) The value of the adjusted R2 could fall as we add another variable. The
reason for this is that the adjusted version of R2 has a correction for the loss of
degrees of freedom associated with adding another regressor into a regression.
This implies a penalty term, so that the value of the adjusted R2 will only rise if
the increase in this penalty is more than outweighed by the rise in the value of
R2.
Chapter 4
1. In the same way as we make assumptions about the true value of beta and
not the estimated values, we make assumptions about the true unobservable
disturbance terms rather than their estimated counterparts, the residuals.
We know the exact value of the residuals, since they are defined by
u t y t y t . So we do not need to make any assumptions about the residuals
since we already know their value. We make assumptions about the
unobservable error terms since it is always the true value of the population
disturbances that we are really interested in, although we never actually know
what these are.
The problem appears to be that the regression parameters are all individually
insignificant (i.e. not significantly different from zero), although the value of
R2 and its adjusted version are both very high, so that the regression taken as a
whole seems to indicate a good fit. This looks like a classic example of what we
term near multicollinearity. This is where the individual regressors are very
closely related, so that it becomes difficult to disentangle the effect of each
individual variable upon the dependent variable.
The solution to near multicollinearity that is usually suggested is that since the
problem is really one of insufficient information in the sample to determine
each of the coefficients, then one should go out and get more data. In other
words, we should switch to a higher frequency of data for analysis (e.g. weekly
instead of monthly, monthly instead of quarterly etc.). An alternative is also to
get more data by using a longer sample period (i.e. one going further back in
time), or to combine the two independent variables in a ratio (e.g. x2t / x3t ).
Other, more ad hoc methods for dealing with the possible existence of near
multicollinearity were discussed in Chapter 4:
- Ignore it: if the model is otherwise adequate, i.e. statistically and in terms
of each coefficient being of a plausible magnitude and having an
appropriate sign. Sometimes, the existence of multicollinearity does not
reduce the t-ratios on variables that would have been significant without
the multicollinearity sufficiently to make them insignificant. It is worth
stating that the presence of near multicollinearity does not affect the BLUE
properties of the OLS estimator i.e. it will still be consistent, unbiased
and efficient since the presence of near multicollinearity does not violate
any of the CLRM assumptions 1-4. However, in the presence of near
multicollinearity, it will be hard to obtain small standard errors. This will
not matter if the aim of the model-building exercise is to produce forecasts
from the estimated model, since the forecasts will be unaffected by the
presence of near multicollinearity so long as this relationship between the
explanatory variables continues to hold over the forecasted sample.
(b) The coefficient estimates would still be the correct ones (assuming
that the other assumptions required to demonstrate OLS optimality are
satisfied), but the problem would be that the standard errors could be
wrong. Hence if we were trying to test hypotheses about the true parameter
values, we could end up drawing the wrong conclusions. In fact, for all of
the variables except the constant, the standard errors would typically be too
small, so that we would end up rejecting the null hypothesis too many
times.
- Transforming the data into logs, which has the effect of reducing the effect of
large errors relative to small ones.
5. (a) This is where there is a relationship between the ith and jth residuals.
Recall that one of the assumptions of the CLRM was that such a relationship
did not exist. We want our residuals to be random, and if there is evidence of
autocorrelation in the residuals, then it implies that we could predict the sign
of the next residual and get the right answer more than half the time on
average!
(b) The Durbin Watson test is a test for first order autocorrelation. The test is
calculated as follows. You would run whatever regression you were interested
in, and obtain the residuals. Then calculate the statistic
u ut 1
2
t
DW t 2 T
u
2
t
t 2
You would then need to look up the two critical values from the Durbin
Watson tables, and these would depend on how many variables and how many
The problem with a model entirely in first differences, is that once we calculate
the long run solution, all the first difference terms drop out (as in the long run
we assume that the values of all variables have converged on their own long
run values so that yt = yt-1 etc.) Thus when we try to calculate the long run
solution to this model, we cannot do it because there isnt a long run solution
to this model!
The answer is yes, there is no reason why we cannot use Durbin Watson in this
case. You may have said no here because there are lagged values of the
regressors (the x variables) variables in the regression. In fact this would be
wrong since there are no lags of the DEPENDENT (y) variable and hence DW
can still be used.
The major steps involved in calculating the long run solution are to
0 1 4 y 5 x 2 6 x3 7 x3
We now want to rearrange this to have all the terms in x2 together and so that
y is the subject of the formula:
4 y 1 5 x 2 6 x3 7 x3
4 y 1 5 x 2 ( 6 7 ) x3
( 4 )
y 1 5 x2 6 x3
4 4 4
If this still fails, then we really have to admit that the relationship between the
dependent variable and the independent variables was probably not linear
after all so that we have to either estimate a non-linear model for the data
(which is beyond the scope of this course) or we have to go back to the drawing
board and run a different regression containing different variables.
8. (a) It is important to note that we did not need to assume normality in order
to derive the sample estimates of and or in calculating their standard
errors. We needed the normality assumption at the later stage when we come
to test hypotheses about the regression coefficients, either singly or jointly, so
that the test statistics we calculate would indeed have the distribution (t or F)
that we said they would.
(b) One solution would be to use a technique for estimation and inference
which did not require normality. But these techniques are often highly
complex and also their properties are not so well understood, so we do not
know with such certainty how well the methods will perform in different
circumstances.
Once we spot a few extreme residuals, we should look at the dates when these
outliers occurred. If we have a good theoretical reason for doing so, we can
add in separate dummy variables for big outliers caused by, for example, wars,
changes of government, stock market crashes, changes in market
microstructure (e.g. the big bang of 1986). The effect of the dummy variable
is exactly the same as if we had removed the observation from the sample
altogether and estimated the regression on the remainder. If we only remove
observations in this way, then we make sure that we do not lose any useful
pieces of information represented by sample points.
(b) 1981M1-1995M12
rt = 0.0215 + 1.491 rmt RSS=0.189 T=180
1981M1-1987M10
rt = 0.0163 + 1.308 rmt RSS=0.079 T=82
1987M11-1995M12
rt = 0.0360 + 1.613 rmt RSS=0.082 T=98
(c) If we define the coefficient estimates for the first and second halves of the
sample as 1 and 1, and 2 and 2 respectively, then the null and alternative
hypotheses are
H0 : 1 = 2 and 1 = 2
and H1 : 1 2 or 1 2
First, the forward predictive failure test - i.e. we are trying to see if the model
for 1981M1-1994M12 can predict 1995M1-1995M12.
The test statistic is given by
RSS RSS1 T1 k 0.189 0.148 168 2
* * 3.832
RSS1 T2 0.148 12
Where T1 is the number of observations in the first period (i.e. the period that
we actually estimate the model over), and T2 is the number of observations we
are trying to predict. The test statistic follows an F-distribution with (T2, T1-
k) degrees of freedom. F(12, 166) = 1.81 at the 5% level. So we reject the null
hypothesis that the model can predict the observations for 1995. We would
conclude that our model is no use for predicting this period, and from a
practical point of view, we would have to consider whether this failure is a
result of a-typical behaviour of the series out-of-sample (i.e. during 1995), or
whether it results from a genuine deficiency in the model.
Now we need to be a little careful in our interpretation of what exactly are the
first and second sample periods. It would be possible to define T1 as always
being the first sample period. But I think it easier to say that T1 is always the
sample over which we estimate the model (even though it now comes after the
hold-out-sample). Thus T2 is still the sample that we are trying to predict, even
though it comes first. You can use either notation, but you need to be clear and
consistent. If you wanted to choose the other way to the one I suggest, then
Either way, we conclude that there is little evidence against the null
hypothesis. Thus our model is able to adequately back-cast the first 12
observations of the sample.
12. An outlier dummy variable will take the value one for one observation in
the sample and zero for all others. The Chow test involves splitting the sample
into two parts. If we then try to run the regression on both the sub-parts but
the model contains such an outlier dummy, then the observations on that
dummy will be zero everywhere for one of the regressions. For that sub-
sample, the outlier dummy would show perfect multicollinearity with the
intercept and therefore the model could not be estimated.
Chapter 5
1. Autoregressive models specify the current value of a series yt as a function of
its previous p values and the current value an error term, ut, while moving
average models specify the current value of a series yt as a function of the
current and previous q values of an error term, ut. AR and MA models have
different characteristics in terms of the length of their memories, which has
implications for the time it takes shocks to yt to die away, and for the shapes of
their autocorrelation and partial autocorrelation functions.
2. ARMA models are of particular use for financial series due to their
flexibility. They are fairly simple to estimate, can often produce reasonable
forecasts, and most importantly, they require no knowledge of any structural
variables that might be required for more traditional econometric analysis.
When the data are available at high frequencies, we can still use ARMA models
while exogenous explanatory variables (e.g. macroeconomic variables,
accounting ratios) may be unobservable at any more than monthly intervals at
best.
(a) The first two models are roughly speaking AR(1) models, while the last is
an MA(1). Strictly, since the first model is a random walk, it should be called
an ARIMA(0,1,0) model, but it could still be viewed as a special case of an
autoregressive model.
(b) We know that the theoretical acf of an MA(q) process will be zero after q
lags, so the acf of the MA(1) will be zero at all lags after one. For an
autoregressive process, the acf dies away gradually. It will die away fairly
quickly for case (2), with each successive autocorrelation coefficient taking on
a value equal to half that of the previous lag. For the first case, however, the
acf will never die away, and in theory will always take on a value of one,
whatever the lag.
Turning now to the pacf, the pacf for the first two models would have a large
positive spike at lag 1, and no statistically significant pacfs at other lags.
Again, the unit root process of (1) would have a pacf the same as that of a
stationary AR process. The pacf for (3), the MA(1), will decline geometrically.
(c) Clearly the first equation (the random walk) is more likely to represent
stock prices in practice. The discounted dividend model of share prices states
that the current value of a share will be simply the discounted sum of all
expected future dividends. If we assume that investors form their expectations
about dividend payments rationally, then the current share price should
embody all information that is known about the future of dividend payments,
and hence todays price should only differ from yesterdays by the amount of
unexpected news which influences dividend payments.
Thus stock prices should follow a random walk. Note that we could apply a
similar rational expectations and random walk model to many other kinds of
financial series.
If the stock market really followed the process described by equations (2) or
(3), then we could potentially make useful forecasts of the series using our
model. In the latter case of the MA(1), we could only make one-step ahead
forecasts since the memory of the model is only that length. In the case of
equation (2), we could potentially make a lot of money by forming multiple
step ahead forecasts and trading on the basis of these.
Hence after a period, it is likely that other investors would spot this potential
opportunity and hence the model would no longer be a useful description of
the data.
(d) See the book for the algebra. This part of the question is really an extension
of the others. Analysing the simplest case first, the MA(1), the memory of
the process will only be one period, and therefore a given shock or
innovation, ut, will only persist in the series (i.e. be reflected in yt) for one
For the case of the AR(1) given in equation (2), a given shock, ut, will persist
indefinitely and will therefore influence the properties of yt for ever, but its
effect upon yt will diminish exponentially as time goes on.
In the first case, the series yt could be written as an infinite sum of past
shocks, and therefore the effect of a given shock will persist indefinitely, and
its effect will not diminish over time.
4. (a) Box and Jenkins were the first to consider ARMA modelling in this
logical and coherent fashion. Their methodology consists of 3 steps:
Identification - determining the appropriate order of the model using
graphical procedures (e.g. plots of autocorrelation functions).
Estimation - of the parameters of the model of size given in the first stage. This
can be done using least squares or maximum likelihood, depending on the
model.
Diagnostic checking - this step is to ensure that the model actually estimated is
adequate. B & J suggest two methods for achieving this:
If the model appears to be adequate, then it can be used for policy analysis and
for constructing forecasts. If it is not adequate, then we must go back to stage 1
and start again!
(b) The main problem with the B & J methodology is the inexactness of the
identification stage. Autocorrelation functions and partial autocorrelations for
actual data are very difficult to interpret accurately, rendering the whole
procedure often little more than educated guesswork. A further problem
concerns the diagnostic checking stage, which will only indicate when the
proposed model is too small and would not inform on when the model
proposed is too large.
AIC = ln ( 2 ) + 2k/T
SBIC = ln ( 2 ) + k ln(T)/T
The information criteria trade off an increase in the number of parameters and
therefore an increase in the penalty term against a fall in the RSS, implying a
closer fit of the model to the data.
5. The best way to check for stationarity is to express the model as a lag
polynomial in yt.
yt 0803
. yt 1 0.682 yt 2 ut
Rewrite this as
yt (1 0.803L 0.682 L2 ) ut
We want to find the roots of the lag polynomial (1 0.803L 0.682 L2 ) 0 and
determine whether they are greater than one in absolute value. It is easier (in
my opinion) to rewrite this formula (by multiplying through by -1/0.682,
using z for the characteristic equation and rearranging) as
z2 + 1.177 z - 1.466 = 0
Using the standard formula for obtaining the roots of a quadratic equation,
1177
. 1177
. 2 4 * 1 * 1466
.
z = 0.758 or 1.934
2
Since ALL the roots must be greater than one for the model to be stationary,
we conclude that the estimated model is not stationary in this case.
6. Using the formulae above, we end up with the following values for each
criterion and for each model order (with an asterisk denoting the smallest
value of the information criterion in each case).
The result is pretty clear: both SBIC and AIC say that the appropriate model is
an ARMA(3,2).
7. We could still perform the Ljung-Box test on the residuals of the estimated
models to see if there was any linear dependence left unaccounted for by our
postulated models.
Another test of the models adequacy that we could use is to leave out some of
the observations at the identification and estimation stage, and attempt to
construct out of sample forecasts for these. For example, if we have 2000
observations, we may use only 1800 of them to identify and estimate the
models, and leave the remaining 200 for construction of forecasts. We would
then prefer the model that gave the most accurate forecasts.
8. This is not true in general. Yes, we do want to form a model which fits the
data as well as possible. But in most financial series, there is a substantial
amount of noise. This can be interpreted as a number of random events that
are unlikely to be repeated in any forecastable way. We want to fit a model to
the data which will be able to generalise. In other words, we want a model
which fits to features of the data which will be replicated in future; we do not
want to fit to sample-specific noise.
This is why we need the concept of parsimony - fitting the smallest possible
model to the data. Otherwise we may get a great fit to the data in sample, but
any use of the model for forecasts could yield terrible results.
This clearly looks like the data are consistent with a first order moving average
process since all but the first acfs are not significant (the significant lag 4 acf is
a typical wrinkle that one might expect with real data and should probably be
ignored), and the pacf has a slowly declining structure.
m
k2
Q* T (T 2) m2
k 1 T k
In this case, T=100, and m=3. The null hypothesis is H0: 1 = 0 and 2 = 0 and
3 = 0. The test statistic is calculated as
i.e. Et 1 ( y t y t 2 , y t 3 ,...)
E ( yt 2 ) a0 a1 E ( yt 1 )
= a0 a1 (a0 a1 E ( yt ))
= a0 a0a1 a1 E ( yt )
2
= a0 a0a1 a1 E ( yt )
2
= a0 a0a1 a1 (a0 a1 yt 1 )
2
= a0 a0a1 a1 a0 a1 yt 1
2 3
etc.
f t 1,1 a0 a1 yt 1
f t 1,2 a0 a1 f t 1,1
f t 1,3 a0 a1 f t 1,2
So ft-1,1 = b1u t 1
But
E ( yt 1 yt 1 , yt 2 ,...) = E (ut 1 b1ut )
= 0
Suppose that we know t-1, t-2,... and we are trying to forecast yt.
Our forecast for t is given by
(b) Given the forecasts and the actual value, it is very easy to calculate the
MSE by plugging the numbers in to the relevant formula, which in this case is
N
1
MSE
N
n 1
( xt 1 n f t 1, n ) 2
1
(3.489 0.116 0.536) 1.380
3
Notice also that 84% of the total MSE is coming from the error in the first
forecast. Thus error measures can be driven by one or two times when the
model fits very badly. For example, if the forecast period includes a stock
market crash, this can lead the mean squared error to be 100 times bigger than
it would have been if the crash observations were not included. This point
needs to be considered whenever forecasting models are evaluated. An idea of
whether this is a problem in a given situation can be gained by plotting the
forecast errors over time.
(c) This question is much simpler to answer than it looks! In fact, the inclusion
of the smoothing coefficient is a red herring - i.e. a piece of misleading and
useless information. The correct approach is to say that if we believe that the
exponential smoothing model is appropriate, then all useful information will
have already been used in the calculation of the current smoothed value
(which will of course have used the smoothing coefficient in its calculation).
Thus the three forecasts are all 0.0305.
(d) The solution is to work out the mean squared error for the exponential
smoothing model. The calculation is
MSE (0.0305 0.032) 2 (0.0305 0.961) 2 (0.0305 0.203) 2
1
3
0.0039 0.8658 0.0298 0.2998
1
3
Therefore, we conclude that since the mean squared error is smaller for the
exponential smoothing model than the Box Jenkins model, the former
produces the more accurate forecasts. We should, however, bear in mind that
the question of accuracy was determined using only 3 forecasts, which would
be insufficient in a real application.
11. (a) The shapes of the acf and pacf are perhaps best summarised in a table:
(b) The important point here is to focus on the MA part of the model and to
ignore the AR dynamics. The characteristic equation would be
(1+0.42z) = 0
The root of this equation is -1/0.42 = -2.38, which lies outside the unit circle,
and therefore the MA part of the model is invertible.
(c) Since no values for the series y or the lagged residuals are given, the
answers should be stated in terms of y and of u. Assuming that information is
available up to and including time t, the 1-step ahead forecast would be for
time t+1, the 2-step ahead for time t+2 and so on. A useful first step would be
to write the model out for y at times t+1, t+2, t+3, t+4:
The 1-step ahead forecast would simply be the conditional expectation of y for
time t+1 made at time t. Denoting the 1-step ahead forecast made at time t as
ft,1, the 2-step ahead forecast made at time t as ft,2 and so on:
since Et[ut+1]=0 and Et[ut+2]=0. Thus, beyond 1-step ahead, the MA(1) part of
the model disappears from the forecast and only the autoregressive part
remains. Although we do not know yt+1, its expected value is the 1-step ahead
forecast that was made at the first stage, ft,1.
(e) Moving average and ARMA models cannot be estimated using OLS they
are usually estimated by maximum likelihood. Autoregressive models can be
estimated using OLS or maximum likelihood. Pure autoregressive models
contain only lagged values of observed quantities on the RHS, and therefore,
the lags of the dependent variable can be used just like any other regressors.
However, in the context of MA and mixed models, the lagged values of the
error term that occur on the RHS are not known a priori. Hence, these
quantities are replaced by the residuals, which are not available until after the
model has been estimated. But equally, these residuals are required in order to
be able to estimate the model parameters. Maximum likelihood essentially
works around this by calculating the values of the coefficients and the
residuals at the same time. Maximum likelihood involves selecting the most
likely values of the parameters given the actual data sample, and given an
assumed statistical distribution for the errors. This technique will be discussed
in greater detail in the section on volatility modelling in Chapter 8.
12. (a) Some of the stylised differences between the typical characteristics of
macroeconomic and financial data were presented in Chapter 1. In particular,
one important difference is the frequency with which financial asset return
time series and other quantities in finance can be recorded. This is of
particular relevance for the models discussed in Chapter 5, since it is usually a
requirement that all of the time-series data series used in estimating a given
model must be of the same frequency. Thus, if, for example, we wanted to
build a model for forecasting hourly changes in exchange rates, it would be
difficult to set up a structural model containing macroeconomic explanatory
variables since the macroeconomic variables are likely to be measured on a
quarterly or at best monthly basis. This gives a motivation for using pure time-
series approaches (e.g. ARMA models), rather than structural formulations
with separate explanatory variables.
m m
k2
Q T k2 and Q* T (T 2) .
k 1 k 1 T k
and
The test statistics will both follow a 2 distribution with 5 degrees of freedom
(the number of autocorrelation coefficients being used in the test). The critical
values are 11.07 and 15.09 at 5% and 1% respectively. Clearly, the null
hypothesis that the first 5 autocorrelation coefficients are jointly zero is
resoundingly rejected.
(c) Setting aside the lag 5 autocorrelation coefficient, the pattern in the table is
for the autocorrelation coefficient to only be significant at lag 1 and then to fall
rapidly to values close to zero, while the partial autocorrelation coefficients
appear to fall much more slowly as the lag length increases. These
characteristics would lead us to think that an appropriate model for this series
is an MA(1). Of course, the autocorrelation coefficient at lag 5 is an anomaly
that does not fit in with the pattern of the rest of the coefficients. But such a
result would be typical of a real data series (as opposed to a simulated data
series that would have a much cleaner structure). This serves to illustrate that
when econometrics is used for the analysis of real data, the data generating
process was almost certainly not any of the models in the ARMA family. So all
we are trying to do is to find a model that best describes the features of the
data to hand. As one econometrician put it, all models are wrong, but some are
useful!
(d) Forecasts from this ARMA model would be produced in the usual way.
Using the same notation as above, and letting fz,1 denote the forecast for time
z+1 made for x at time z, etc:
f z ,1 0.38 0.10u t 1
f z , 2 0.38 0.10 0.02 0.378
f z , 2 f z ,3 0.38
Note that the MA(1) model only has a memory of one period, so all forecasts
further than one step ahead will be equal to the intercept.
Model B: AR(2)
(e) The methods are overfitting and residual diagnostics. Overfitting involves
selecting a deliberately larger model than the proposed one, and examining
the statistical significances of the additional parameters. If the additional
parameters are statistically insignificant, then the originally postulated model
is deemed acceptable. The larger model would usually involve the addition of
one extra MA term and one extra AR term. Thus it would be sensible to try an
ARMA(1,2) in the context of Model A, and an ARMA(3,1) in the context of
Model B. Residual diagnostics would involve examining the acf and pacf of the
residuals from the estimated model. If the residuals showed any action, that
is, if any of the acf or pacf coefficients showed statistical significance, this
would suggest that the original model was inadequate. Residual diagnostics
in the Box-Jenkins sense of the term involved only examining the acf and pacf,
rather than the array of diagnostics considered in Chapter 4.
It is worth noting that these two model evaluation procedures would only
indicate a model that was too small. If the model were too large, i.e. it had
superfluous terms, these procedures would deem the model adequate.
(f) There are obviously several forecast accuracy measures that could be
employed, including MSE, MAE, and the percentage of correct sign
predictions. Assuming that MSE is used, the MSE for each model is
MSE ( Model A)
1
4
(0.378 0.62) 2 (0.38 0.19) 2 (0.38 0.32) 2 (0.38 0.72) 2 0.175
MSE ( Model B)
1
4
(0.681 0.62) 2 (0.718 0.19) 2 (0.690 0.32) 2 (0.683 0.72) 2 0.326
Therefore, since the mean squared error for Model A is smaller, it would be
concluded that the moving average model is the more accurate of the two in
this case.
The easiest place to start (I think) is to take equation (1), and substitute in for
y3t, to get
Working out the products that arise when removing the brackets,
y1t (1 2 1 1 1 1 ) 0 1 0 1 1 0 1 2 X 2 t 1 1 3 X 3t 1 1u3t 1 2 X 1t
13 X 3t 1u2 t 2 0 2 2 X 2 t 2 3 X 3t 2 u3t 3 X 1t 4 X 2 t u1t
y3t (1 2 1 11 1 ) 0 (1 2 1 11 1 ) 1 y1t (1 2 1 11 1 )
2 X 2 t (1 2 1 11 1 ) 3 X 3t (1 2 1 11 1 ) u3t (1 2 1 11 1 )
(7)
Replacing y1t (1 2 1 11 1 ) in (7) with the RHS of (6),
0 1 0 1 1 0 2 0 X 1t (1 2 3 )
y 3t (1 2 1 11 1 ) 0 (1 2 1 11 1 ) 1 X 2 t (1 1 2 2 2 4 ) X 3t (1 1 3 1 3
2 3 ) u3t (1 1 2 ) 1u2 t u1t
2 X 2 t (1 2 1 11 1 ) 3 X 3t (1 2 1 11 1 ) u3t (1 2 1 11 1 )
(8)
Expanding the brackets in equation (8) and cancelling the relevant terms
y3t (1 2 1 11 1 ) 0 10 11 0 X 1t (1 2 1 1 3 ) X 2 t ( 2 14 )
X 3t ( 11 3 3 ) u3t 11u2 t 1u1t
(9)
y2 t (1 1 1 1 12 ) 0 (1 1 1 1 12 ) 1 y3t (1 1 1 1 12 )
2 X 1t (1 1 11 12 ) 3 X 3t (1 1 1 1 12 ) u2 t (1 1 1 1 12 )
(10)
0 1 0 11 0 X 1t (1 2 1 1 3 )
y 2 t (1 1 1 1 1 2 ) 0 (1 1 1 1 12 ) 1 X 2 t ( 2 1 4 ) X 3t ( 3 11 3 ) u3t
11u2 t 1u1t
2 X 1t (1 1 1 1 12 ) 3 X 3t (1 1 1 1 12 ) u2 t (1 1 1 1 12 )
(11)
Expanding the brackets in (11) and cancelling the relevant terms
y2t (1 1 1
1 12 ) 0 02 1
1 0
1 10 X1t (
1 1 3 2 22 1 ) X 2 t (
1 2
1 14 )
X 3t (
1 3 3 32 1 ) 1u3t u2 t (1 2 1 )
1 1u1t
(12)
From (6),
0 1 0 1 1 0 2 0 (1 2 3 ) ( 2 2 4 )
y1t X 1t 1 1 2 X 2t
(1 2 1 1 1 1 ) (1 2 1 1 1 1 ) (1 2 1 1 1 1 )
(1 1 3 1 3 2 3 ) u ( 2 ) 1u2 t u1t
X 3t 3t 1 1
(1 2 1 1 1 1 ) (1 2 1 1 1 1 )
(13)
From (12),
0 02 1 1 01 10 ( 1 1 3 2 22 1 ) ( 1 2 1 14 )
y2 t X1t X
(1 1 11 12 ) (1 1 11 12 ) (1 1 11 12 ) 2 t
( 1 3 3 32 1 ) u u (1 2 1 ) 1 1u1t
X 3t 1 3t 2 t
(1 1 11 12 ) (1 1 11 12 )
(14)
From (9),
0 10 11 0 (1 2 1 1 3 ) ( 2 1 4 )
y 3t X 1t X
(1 2 1 11 1 ) (1 2 1 11 1 ) (1 2 1 11 1 ) 2 t
(15)
( 11 3 3 ) u 11u2 t 1u1t
X 3t 3t
(1 2 1 11 1 ) (1 2 1 11 1 )
Notice that all of the reduced form equations (13)-(15) in this case depend on
all of the exogenous variables, which is not always the case, and that the
equations contain only exogenous variables on the RHS, which must be the
case for these to be reduced forms.
The order condition, can be expressed in a number of ways, one of which is the
following. Let G denote the number of structural equations (equal to the
number of endogenous variables). An equation is just identified if G-1
variables are absent. If more than G-1 are absent, then the equation is over-
identified, while if fewer are absent, then it is not identified.
If we take the view that consistency and unbiasedness are more important that
efficiency (which is the view that I think most econometricians would take),
this implies that treating an endogenous variable as exogenous represents the
more severe mis-specification. So if in doubt, include an equation for it!
(Although, of course, we can test for exogeneity using a Hausman-type test).
A correct answer would be to describe either two stage least squares (2SLS) or
instrumental variables (IV). Either would be acceptable, although IV requires
the user to determine an appropriate set of instruments and hence 2SLS is
simpler in practice. 2SLS involves estimating the reduced form equations, and
obtaining the fitted values in the first stage. In the second stage, the structural
2. (a) A glance at equations (6.97) and (6.98) reveals that the dependent
variable in (6.97) appears as an explanatory variable in (6.98) and that the
dependent variable in (6.98) appears as an explanatory variable in (6.97). The
result is that it would be possible to show that the explanatory variable y2t in
(6.97) will be correlated with the error term in that equation, u1t, and that the
explanatory variable y1t in (6.98) will be correlated with the error term in that
equation, u2t. Thus, there is causality from y1t to y2t and from y2t to y1t, so that
this is a simultaneous equations system. If OLS were applied separately to
each of equations (6.97) and (6.98), the result would be biased and
inconsistent parameter estimates. That is, even with an infinitely large
number of observations, OLS could not be relied upon to deliver the
appropriate parameter estimates.
(b) If the variable y1t had not appeared on the RHS of equation (6.98), this
would no longer be a simultaneous system, but would instead be an example
of a triangular system (see question 3). Thus it would be valid to apply OLS
separately to each of the equations (6.97) and (6.98).
(d) Since equation (6.97) is not identified, no method could be used to obtain
estimates of the parameters of this equation, while either ILS or 2SLS could be
used to obtain estimates of the parameters of (6.98), since it is just identified.
ILS operates by obtaining and estimating the reduced form equations and
then obtaining the structural parameters of (6.98) by algebraic back-
substitution. 2SLS involves again obtaining and estimating the reduced form
equations, and then estimating the structural equations but replacing the
endogenous variables on the RHS of (6.97) and (6.98) with their reduced form
fitted values.
Comparing between ILS and 2SLS, the former method only requires one set of
estimations rather than two, but this is about its only advantage, and
conducting a second stage OLS estimation is usually a computationally trivial
exercise. The primary disadvantage of ILS is that it is only applicable to just
identified equations, whereas many sets of equations that we may wish to
estimate are over-identified. Second, obtaining the structural form coefficients
via algebraic substitution can be a very tedious exercise in the context of large
systems (as the solution to question 1, part (a) shows!).
y1t 0 1 y 2t 2 X 1t 3 X 2t 4 y 2t 'u1t
.
y 2t 0 1 y1t 2 X 1t 3 y1t ' u1t
Separate tests of the significance of the y1t and y2t terms would then be
performed. If it were concluded that they were both significant, this would
imply that additional explanatory power can be obtained by treating the
variables as endogenous.
4. (a) p=2 and k=3 implies that there are two variables in the system, and that
both equations have three lags of the two variables. The VAR can be written in
long-hand form as:
y1t 10 111 y1t 1 211 y 2t 1 112 y1t 2 212 y 2t 2 113 y1t 3 213 y 2t 3 u1t
y 2t 20 121 y1t 1 221 y 2t 1 122 y1t 2 222 y 2t 2 123 y1t 3 223 y 2t 3 u 2t
10 y1t u1t
where 0 , yt , ut , and the coefficients on the lags of yt
20 y2 t u2 t
are defined as follows: ijk refers to the kth lag of the ith variable in the jth
equation. This seems like a natural notation to use, although of course any
sensible alternative would also be correct.
(b) This is basically a what are the advantages of VARs compared with
structural models? type question, to which a simple and effective response
would be to list and explain the points made in the book.
The most important point is that structural models require the researcher to
specify some variables as being exogenous (if all variables were endogenous,
then none of the equations would be identified, and therefore estimation of
the structural equations would be impossible). This can be viewed as a
Another possible reason why VARs are popular in the academic literature is
that standard form VARs can be estimated using OLS since all of the lags on
the RHS are counted as pre-determined variables.
Further, a glance at the academic literature which has sought to compare the
forecasting accuracies of structural models with VARs, reveals that VARs seem
to be rather better at forecasting (perhaps because the identifying restrictions
are not valid). Thus, from a purely pragmatic point of view, researchers may
prefer VARs if the purpose of the modelling exercise is to produce precise
point forecasts.
(c) VARs have, of course, also been subject to criticisms. The most important
of these criticisms is that VARs are atheoretical. In other words, they use very
little information form economic or financial theory to guide the model
specification process. The result is that the models often have little or no
theoretical interpretation, so that they are of limited use for testing and
evaluating theories.
Second, VARs can often contain a lot of parameters. The resulting loss in
degrees of freedom if the VAR is unrestricted and contains a lot of lags, could
lead to a loss of efficiency and the inclusion of lots of irrelevant or marginally
relevant terms. Third, it is not clear how the VAR lag lengths should be
chosen. Different methods are available (see part (d) of this question), but
they could lead to widely differing answers.
Finally, the very tools that have been proposed to help to obtain useful
information from VARs, i.e. impulse responses and variance decompositions,
are themselves difficult to interpret! See Runkle (1987).
(d) The two methods that we have examined are model restrictions and
information criteria. Details on how these work are contained in Sections
6.12.4 and 6.12.5. But briefly, the model restrictions approach involves
starting with the larger of the two models and testing whether it can be
restricted down to the smaller one using the likelihood ratio test based on the
determinants of the variance-covariance matrices of residuals in each case.
The alternative approach would be to examine the value of various
information criteria and to select the model that minimises the criteria. Since
there are only two models to compare, either technique could be used. The
restriction approach assumes normality for the VAR error terms, while use of
Chapter 7
1. (a) Many series in finance and economics in their levels (or log-levels) forms
are non-stationary and exhibit stochastic trends. They have a tendency not to
revert to a mean level, but they wander for prolonged periods in one
direction or the other. Examples would be most kinds of asset or goods prices,
GDP, unemployment, money supply, etc. Such variables can usually be made
stationary by transforming them into their differences or by constructing
percentage changes of them.
Most importantly therefore, we are not able to perform any hypothesis tests in
models which inappropriately use non-stationary data since the test statistics
will no longer follow the distributions which we assumed they would (e.g. a t
or F), so any inferences we make are likely to be invalid.
(c) A weakly stationary process was defined in Chapter 5, and has the
following characteristics:
1. E(yt) =
2. E ( yt )( yt ) 2
3. E ( yt1 )( yt 2 ) t 2 t1 t1 , t2
That is, a stationary process has a constant mean, a constant variance, and a
constant covariance structure. A strictly stationary process could be defined by
an equation such as
for any t1 , t2 , ..., tT Z, any k Z and T = 1, 2, ...., and where F denotes the
joint distribution function of the set of random variables. It should be evident
from the definitions of weak and strict stationarity that the latter is a stronger
definition and is a special case of the former. In the former case, only the first
two moments of the distribution has to be constant (i.e. the mean and
variances (and covariances)), whilst in the latter case, all moments of the
distribution (i.e. the whole of the probability distribution) has to be constant.
yt = + t + ut
where t = 1, 2,, is the trend and ut is a zero mean white noise disturbance
term. This is called deterministic non-stationarity because the source of the
non-stationarity is a deterministic straight line process.
A variable containing a stochastic trend will also not cross its mean value
frequently and will wander a long way from its mean value. A stochastically
non-stationary process could be a unit root or explosive autoregressive process
such as
yt = yt-1 + ut
where 1.
H0 : yt I(1)
H1 : yt I(0)
H0 : = 0
H1 : < 0
(b) The test statistic is given by / SE ( ) which equals -0.02 / 0.31 = -0.06
Since this is not more negative than the appropriate critical value, we do not
reject the null hypothesis.
(c) We therefore conclude that there is at least one unit root in the series
(there could be 1, 2, 3 or more). What we would do now is to regress 2yt on
yt-1 and test if there is a further unit root. The null and alternative hypotheses
would now be
If we rejected the null hypothesis, we would therefore conclude that the first
differences are stationary, and hence the original series was I(1). If we did not
reject at this stage, we would conclude that yt must be at least I(2), and we
would have to test again until we rejected.
3. Using the same regression as above, but on a different set of data, the
researcher now obtains the estimate =-0.52 with standard error = 0.16.
(a) The test statistic is calculated as above. The value of the test statistic = -
0.52 /0.16 = -3.25. We therefore reject the null hypothesis since the test
statistic is smaller (more negative) than the critical value.
(b) We conclude that the series is stationary since we reject the unit root null
hypothesis. We need do no further tests since we have already rejected.
4. (a) If two or more series are cointegrated, in intuitive terms this implies that
they have a long run equilibrium relationship that they may deviate from in
the short run, but which will always be returned to in the long run. In the
context of spot and futures prices, the fact that these are essentially prices of
the same asset but with different delivery and payment dates, means that
financial theory would suggest that they should be cointegrated. If they were
not cointegrated, this would imply that the series did not contain a common
stochastic trend and that they could therefore wander apart without bound
even in the long run. If the spot and futures prices for a given asset did
separate from one another, market forces would work to bring them back to
follow their long run relationship given by the cost of carry formula.
(b) There are many other examples that one could draw from financial or
economic theory of situations where cointegration would be expected to be
present and where its absence could imply a permanent disequilibrium. It is
usually the presence of market forces and investors continually looking for
arbitrage opportunities that would lead us to expect cointegration to exist.
Good illustrations include equity prices and dividends, or price levels in a set
of countries and the exchange rates between them. The latter is embodied in
the purchasing power parity (PPP) theory, which suggests that a
representative basket of goods and services should, when converted into a
common currency, cost the same wherever in the world it is purchased. In the
context of PPP, one may expect cointegration since again, its absence would
imply that relative prices and the exchange rate could wander apart without
bound in the long run. This would imply that the general price of goods and
services in one country could get permanently out of line with those, when
converted into a common currency, of other countries. This would not be
expected to happen since people would spot a profitable opportunity to buy
the goods in one country where they were cheaper and to sell them in the
country where they were more expensive until the prices were forced back into
line. There is some evidence against PPP, however, and one explanation is that
transactions costs including transportation costs, currency conversion costs,
differential tax rates and restrictions on imports, stop full adjustment from
taking place. Services are also much less portable than goods and everybody
knows that everything costs twice as much in the UK as anywhere else in the
world.
5. (a) The Johansen test is computed in the following way. Suppose we have p
variables that we think might be cointegrated. First, ensure that all the
variables are of the same order of non-stationary, and in fact are I(1), since it is
very unlikely that variables will be of a higher order of integration. Stack the
variables that are to be tested for cointegration into a p-dimensional vector,
called, say, yt. Then construct a p1 vector of first differences, yt, and form
and estimate the following VAR
Then test the rank of the matrix . If is of zero rank (i.e. all the eigenvalues
are not significantly different from zero), there is no cointegration, otherwise,
the rank will give the number of cointegrating vectors. (You could also go into
a bit more detail on how the eigenvalues are used to obtain the rank.)
(b) Repeating the table given in the question, but adding the null and
alternative hypotheses in each case, and letting r denote the number of
cointegrating vectors:
Considering each row in the table in turn, and looking at the first one first, the
test statistic is greater than the critical value, so we reject the null hypothesis
that there are no cointegrating vectors. The same is true of the second row
(that is, we reject the null hypothesis of one cointegrating vector in favour of
the alternative that there are two). Looking now at the third row, we cannot
reject (at the 5% level) the null hypothesis that there are two cointegrating
vectors, and this is our conclusion. There are two independent linear
combinations of the variables that will be stationary.
The test statistic for testing the validity of these restrictions is given by
where
If the restrictions are supported by the data, the eigenvalues will not change
much when the restrictions are imposed and so the test statistic will be small.
(d) There are many applications that could be considered, and tests for PPP,
for cointegration between international bond markets, and tests of the
expectations hypothesis were presented in Sections 7.9, 7.10, and 7.11
respectively. These are not repeated here.
Thus the trace test starts by examining all eigenvalues together to test H0: r =
0, and if this is not rejected, this is the end and the conclusion would be that
there is no cointegration. If this hypothesis is not rejected, the largest
eigenvalue would be dropped and a joint test conducted using all of the
eigenvalues except the largest to test H0: r = 1. If this hypothesis is not
rejected, the conclusion would be that there is one cointegrating vector, while
if this is rejected, the second largest eigenvalue would be dropped and the test
statistic recomputed using the remaining g-2 eigenvalues and so on. The
testing sequence would stop when the null hypothesis is not rejected.
6. (a) The operation of the Johansen test has been described in the book, and
also in question 5, part (a) above. If the rank of the matrix is zero, this
implies that there is no cointegration or no common stochastic trends between
the series. A finding that the rank of is one or two would imply that there
were one or two linearly independent cointegrating vectors or combinations of
the series that would be stationary respectively. A finding that the rank of is
3 would imply that the matrix is of full rank. Since the maximum number of
cointegrating vectors is g-1, where g is the number of variables in the system,
this does not imply that there 3 cointegrating vectors. In fact, the implication
of a rank of 3 would be that the original series were stationary, and provided
that unit root tests had been conducted on each series, this would have
effectively been ruled out.
(b) The first test of H0: r = 0 is conducted using the first row of the table.
Clearly, the test statistic is greater than the critical value so the null hypothesis
is rejected. Considering the second row, the same is true, so that the null of r =
1 is also rejected. Considering now H0: r = 2, the test statistic is smaller than
the critical value so that the null is not rejected. So we conclude that there are
2 cointegrating vectors, or in other words 2 linearly independent combinations
of the non-stationary variables that are stationary.
- Frequency: Stock market prices are measured every time there is a trade or
somebody posts a new quote, so often the frequency of the data is very high
Of these, we can allow for the non-stationarity within the linear (ARIMA)
framework, and we can use whatever frequency of data we like to form the
models, but we cannot hope to capture the other features using a linear model
with Gaussian disturbances.
(b) GARCH models are designed to capture the volatility clustering effects in
the returns (GARCH(1,1) can model the dependence in the squared returns, or
squared residuals), and they can also capture some of the unconditional
leptokurtosis, so that even if the residuals of a linear model of the form given
by the first part of the equation in part (e), the u t s, are leptokurtic, the
standardised residuals from the GARCH estimation are likely to be less
leptokurtic. Standard GARCH models cannot, however, account for leverage
effects.
- How do we decide on q?
GARCH(1,1) goes some way to get around these. The GARCH(1,1) model has
only three parameters in the conditional variance equation, compared to q+1
for the ARCH(q) model, so it is more parsimonious. Since there are less
(d) There are a number that you could choose from, and the relevant ones that
were discussed in Chapter 8, inlcuding EGARCH, GJR or GARCH-M.
The first two of these are designed to capture leverage effects. These are
asymmetries in the response of volatility to positive or negative returns. The
standard GARCH model cannot capture these, since we are squaring the
lagged error term, and we are therefore losing its sign.
The conditional variance equations for the EGARCH and GJR models are
respectively
u u t 1 2
log( t2 ) log( t21 ) t 1
t 1 t 1
And
t2 = 0 + 1 ut21 +t-12+ut-12It-1
The EGARCH model also has the added benefit that the model is expressed in
terms of the log of ht, so that even if the parameters are negative, the
conditional variance will always be positive. We do not therefore have to
artificially impose non-negativity constraints.
so that the model allows the lagged value of the conditional variance to affect
the return. In other words, our best current estimate of the total risk of the
asset influences the return, so that we expect a positive coefficient for . Note
that some authors use t (i.e. a contemporaneous term).
(e). Since yt are returns, we would expect their mean value (which will be
given by ) to be positive and small. We are not told the frequency of the data,
(f) Since the model was estimated using maximum likelihood, it does not seem
natural to test this restriction using the F-test via comparisons of residual
sums of squares (and a t-test cannot be used since it is a test involving more
than one coefficient). Thus we should use one of the approaches to hypothesis
testing based on the principles of maximum likelihood (Wald, Lagrange
Multiplier, Likelihood Ratio). The easiest one to use would be the likelihood
ratio test, which would be computed as follows:
where Lr and Lu are the values of the LLF for the restricted and
unrestricted models respectively, and m denotes the number of
restrictions, which in this case is one.
4. If the value of the test statistic is greater than the critical value, reject
the null hypothesis that the restrictions are valid.
Let h1,fT be the one step ahead forecast for h made at time T. This is easy to
calculate since, at time T, we know the values of all the terms on the RHS.
Given h1,fT , how do we calculate h2,f T , that is the 2-step ahead forecast for h
made at time T?
where ET( uT2 1 ) is the expectation, made at time T, of uT2 1 , which is the
squared disturbance term. The model assumes that the series t has zero
mean, so we can now write
ht t = E[(ut)2]
Turning this argument around, and applying it to the problem that we have,
ET[(uT+1)2] = hT+1
but we do not know hT+1 , so we replace it with h1,fT , so that (4) becomes
And so on. This is the method we could use to forecast the conditional
variance of yt. If yt were, say, daily returns on the FTSE, we could use these
volatility forecasts as an input in the Black Scholes equation to help determine
the appropriate price of FTSE index options.
(h) An s-step ahead forecast for the conditional variance could be written
For the new value of , the persistence of shocks to the conditional variance,
given by (1+) is 0.1251+ 0.98 = 1.1051, which is bigger than 1. It is obvious
from equation (x), that any value for (1+) bigger than one will lead the
forecasts to explode. The forecasts will keep on increasing and will tend to
infinity as the forecast horizon increases (i.e. as s increases). This is obviously
an undesirable property of a forecasting model! This is called non-
stationarity in variance.
For (1+)<1, the forecasts will converge on the unconditional variance as the
forecast horizon increases. For (1+) = 1, known as integrated GARCH or
IGARCH, there is a unit root in the conditional variance, and the forecasts will
stay constant as the forecast horizon increases.
2. (a) Maximum likelihood works by finding the most likely values of the
parameters given the actual data. More specifically, a log-likelihood function is
formed, usually based upon a normality assumption for the disturbance terms,
and the values of the parameters that maximise it are sought. Maximum
likelihood estimation can be employed to find parameter values for both linear
and non-linear models.
(b) The three hypothesis testing procedures available within the maximum
likelihood approach are lagrange multiplier (LM), likelihood ratio (LR) and
Wald tests. The differences between them are described in Figure 8.4, and are
not defined again here. The Lagrange multiplier test involves estimation only
under the null hypothesis, the likelihood ratio test involves estimation under
both the null and the alternative hypothesis, while the Wald test involves
estimation only under the alternative. Given this, it should be evident that the
LM test will in many cases be the simplest to compute since the restrictions
implied by the null hypothesis will usually lead to some terms cancelling out to
give a simplified model relative to the unrestricted model.
(c) OLS will give identical parameter estimates for all of the intercept and
slope parameters, but will give a slightly different parameter estimate for the
variance of the disturbances. These are shown in the Appendix to Chapter 8.
The difference in the OLS and maximum likelihood estimators for the variance
of the disturbances can be seen by comparing the divisors of equations (8A.25)
and (8A.26).
(c) There are of course a large number of competing methods for measuring
and forecasting volatility, and it is worth stating at the outset that no research
has suggested that one method is universally superior to all others, so that
each method has its merits and may work well in certain circumstances.
Historical measures of volatility are just simple average measures for
example, the standard deviation of daily returns over a 3-year period. As such,
they are the simplest to calculate, but suffer from a number of shortcomings.
First, since the observations are unweighted, historical volatility can be slow to
respond to changing market circumstances, and would not take advantage of
short-term persistence in volatility that could lead to more accurate short-
term forecasts. Second, if there is an extreme event (e.g. a market crash), this
will lead the measured volatility to be high for a number of observations equal
to the measurement sample length. For example, suppose that volatility is
being measured using a 1-year (250-day) sample of returns, which is being
rolled forward one observation at a time to produce a series of 1-step ahead
volatility forecasts. If a market crash occurs on day t, this will increase the
measured level of volatility by the same amount right until day t+250 (i.e. it
will not decay away) and then it will disappear completely from the sample so
that measured volatility will fall abruptly. Exponential weighting of
The coefficients expected would be very small for the conditional mean
coefficients, 1 and 2, since they are average daily returns, and they could be
positive or negative, although a positive average return is probably more
likely. Similarly, the intercept terms in the conditional variance equations
would also be expected to be small since and positive this is daily data. The
coefficients on the lagged squared error and lagged conditional variance in the
conditional variance equations must lie between zero and one, and more
specifically, the following might be expected: 11 and 22 0.1-0.3; 11 and 22
0.5-0.8, with 11 + 11 < 1 and 22 + 22 < 1. The coefficient values for the
conditional covariance equation are more difficult to predict: 11 + 11 < 1 is
still required for the model to be useful for forecasting covariances. The
parameters in this equation could be negative, although given that the returns
for two stock markets are likely to be positively correlated, the parameters
would probably be positive, although the model would still be a valid one if
they were not.
(b) One of two procedures could be used. Either the daily returns data would
be transformed into weekly returns data by adding up the returns over all of
the trading days in each week, or the model would be estimated using the daily
data. Daily forecasts would then be produced up to 10 days (2 trading weeks)
ahead.
In both cases, the models would be estimated, and forecasts made of the
conditional variance and conditional covariance. If daily data were used to
estimate the model, the forecasts for the conditional covariance forecasts for
the 5 trading days in a week would be added together to form a covariance
forecast for that week, and similarly for the variance. If the returns had been
aggregated to the weekly frequency, the forecasts used would simply be 1-step
ahead.
Finally, the conditional covariance forecast for the week would be divided by
the product of the square roots of the conditional variance forecasts to obtain
a correlation forecast.
Finally, implied correlations may at first blush appear to be the best method
for calculating correlation forecasts accurately, for they rely on information
obtained from the market itself. After all, who should know better about
future correlations in the markets than the people who work in those markets?
However, market-based measures of volatility and correlation are sometimes
surprisingly inaccurate, and are also sometimes difficult to obtain. Most
fundamentally, correlation forecasts will only be available where there is an
option traded whose payoffs depend on the prices of two underlying assets.
For all other situations, a market-based correlation forecast will simply not be
available.
0.16
Value of Conditional Variance
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
-1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Value of Lagged Shock
This graph is a bit of an odd one, in the sense that the conditional variance is
always lower for the EGARCH model. This may suggest estimation error in
one of the models. There is some evidence for asymmetries in the case of the
EGARCH model since the value of the conditional variance is 0.1 for a shock of
1 and 0.12 for a shock of 1.
(b) This is a tricky one. The leverage effect is used to rationalise a finding of
asymmetries in equity returns, but such an argument cannot be applied to
foreign exchange returns, since the concept of a Debt/Equity ratio has no
meaning in that context.
On the other hand, there is equally no reason to suppose that there are no
asymmetries in the case of fx data. The data used here were daily USD_GBP
returns for 1974-1994. It might be the case that, for example, that news
relating to one country has a differential impact to equally good and bad news
relating to another. To offer one illustration, it might be the case that the bad
news for the currently weak euro has a bigger impact on volatility than news
about the currently strong dollar. This would lead to asymmetries in the news
impact curve. Finally, it is also worth noting that the asymmetry term in the
EGARCH model, 1, is not statistically significant in this case.
Chapter 9
1. (a) This was a rather silly question since the answer is largely given away by
the question in part (b)! Nonetheless, although there are several methods that
could be used to determine whether there is evidence of daily seasonality in
stock returns, a simple method would be to obtain a sample of daily stock
returns and regress them on 5 day-of-the-week dummy variables. The
coefficient estimates would then be interpreted as the average return on each
(b) The problem is one of perfect multicollinearity between the five daily
dummy variables and the constant term known as the dummy variable trap.
The sum of the five daily dummy variables will be one in every time period,
and this will be identical to the column of ones used for the constant. The
result is that the implicit assumption of the columns of the matrix of
explanatory variables being independent of one another has been violated, and
hence there is not enough separate information in the sample to be able to
calculate the values of all of the coefficients. The (XX) matrix will be singular
and therefore its inverse will not exist. The solution is simple: either use all 5
daily dummy variables but no intercept term, or drop one of the dummy
variables and still include the intercept. These two methods of dealing with the
problem are equivalent with identical RSS, and only the interpretation of the
coefficient estimates will change.
(c) The first step is to calculate the t-ratios. These are 0.232, -2.691, 0.673, -
0.039, and 0.141 for the intercept, D1, D2, D3 and D4 respectively. The
interpretation of the intercept coefficient is the value of the return when all of
the variables (including the daily dummies) are zero, which in this case is the
average Friday return. The coefficients on the daily dummies can be
interpreted as the average deviation of that days return from the average
return for all days of the week. Only one of these dummy variables is
significant the dummy for Monday, and we would thus conclude that the
average return on Monday was significantly lower than the average return for
the whole week, but there is no statistically significant evidence of any other
daily seasonalities given these results.
yt = 1 + 2x2t + 3x3t + ut .
The dummy variables are defined identically as in the intercept dummy case,
and again one less dummy is needed than the total number of days in the
(e) The financial year ends at approximately the end of March in the UK. So
one way to test the hypothesis that stock returns are different at the end of the
tax year compared with other times of the year would be to obtain a long
sample of monthly returns and to regress the returns on a dummy variable
taking the value one in March and zero otherwise. If, everything else equal,
investors were selling to realise losses in March, we would expect the
coefficient on this dummy to be negative and statistically significant due to
excess selling pressure. Thus average March returns would be significantly
lower than average returns over the whole year.
2. (a) A switching model is simply one where the behaviour of the series is
permitted to change from one type to another under the model. For example,
any regression containing seasonal dummy variables would be a simple kind
of switching model, since the behaviour of the series will be different at
different times.
Threshold autoregressive (TAR) models are those where the variable under
study is assumed to follow one autoregressive process in a given regime and
other autoregressive processes in other regimes. Movements from one regime
to another occur when a variable (not necessarily the variable under study)
rises above or falls below a particular value. Markov switching models, in their
simplest forms, assume that a variable can be drawn from one of several
regimes, each regime having its own mean and variance. The key distinction
between the two classes of models is that TAR models assume that the
threshold variable governing the regime is known, and under the model, once
this threshold is set, the variable is in one of the regimes alone. The Markov
switching model, on the other hand, assumes that the state-determining or
forcing variable is unobserved. The variable under study is thus never
completely in one regime or another, but rather is in each regime with some
probability at each point in time.
The decision on which of the two model classes is more appropriate for a
particular application would be made on the grounds of whether the state-
determining variable was observable or not, and what type of dynamics were
of interest in the model. For example, if the financial theory does not suggest
the forcing variable, then the Markov switching approach may be preferable.
On the other hand, if theory suggests an obvious choice of switching variable,
or if it is of interest to use an AR-type model, then the TAR would be more
appropriate.
There have been very few comparisons of the two approaches that I know of
authors seem to just adopt a particular approach and use it without discussing
the alternatives available.
(b) (i) The Markov property applies if a process is path independent that
is, it is only the current value of a series or the current set of probabilities that
determine where the process will be during the next time period, and none of
the values of the series or probabilities during previous time periods. Thus, a
(c) A SETAR model is a TAR model where the state-determining variable is the
dependent variable used in the regression. The use of SETAR models rather
than a more general TAR removes one item to decide on (the state-
determining variable). But there are many others the number of regimes, the
number of lags in each regime, the value(s) of the threshold(s), and the lag
with which the variable will switch. The major difficulty with SETAR (and
indeed all TAR) models is that it is impossible to easily and validly estimate all
of these quantities at the same time. They depend on eachother, and also, the
threshold causes a discontinuity in the function that would be maximised (if
ML is used) or minimised (if NLS) is used.
Therefore, the usually easiest way to estimate such models is to use as much
knowledge as possible from financial theory and to assume values for other
parameters and then to estimate as little as possible. For example, it may be
the case that the number of regimes, the delay value and the threshold values
can be assumed from theory. This would leave only the number of lags in each
regime together with the coefficients to be estimated. This could be validly
done using information criteria to determine the lag lengths for each regime
and ML or NLS to estimate the coefficients.
where ln( fx F / G ,t ) is the log of the exchange rate, expressed in French francs per
German mark, and ln( p F ,t ) and ln( pG ,t ) are the logs of the French and German
consumer price series respectively. We could define
ln( fx F / G,t ) ln( p F ,t ) ln( pG,t ) as the deviation from PPP (see Chapter 7). This
could be generalised to allow for a different relationship between the three
variables according to whether the deviation from PPP is larger than some
upper threshold value r, or smaller (more negative) than a lower threshold s:
We could then consider the different values of the parameter estimates in each
of the 3 regimes (although we could not validly conduct hypothesis tests in the
usual way since it is very likely that the variables in the model are non-
stationary!). The values of the thresholds (r and s) could be imposed on the
basis of some assumed size of transactions costs, or they could be estimated.
(f) The problem is essentially that the threshold no longer exists under the null
hypothesis that the SETAR model collapses to a linear model with the same
lag lengths as were in each part of the SETAR. This fact means that the usual
basis of asymptotic theory for testing hypotheses is not applicable, so that the
test statistics would not follow the distributions that we would have assumed
of them. There are procedures available for testing hypotheses such as this in
the context of TAR models, but these are quite complex see Hansen (1996),
for example.
(g) It is tempting to think that more complex models are bound to produce
more accurate forecasts than simpler models since the former should be able
to capture more of the relevant features of the data. However, this is certainly
not the case, for complex models may have a tendency to fit to sample-specific
features of the data that are not replicated during future (out of sample)
yt = + yt-1 + ut , ut N(0,t2)
t2 = 0 + 1 ut21 +t-12 + 1D1t + 2D2t + 3D3t + 4D4t
where D1, .., D4 are Monday, .., Thursday dummy variables. A more
sophisticated model could also allow the coefficients on the lagged squared
error or lagged conditional variance terms to also vary across the days of the
week. If Monday volatility dynamics are different from other days of the week,
we would expect to see 1 significant.
For (ii), some sort of threshold model is required, and the question suggests
that the threshold variable is observable (and is the value of the previous days
volatility). Thus, an appropriate model would be a GARCH model with a
threshold in the conditional variance that switches, e.g.
yt = + yt-1 + ut , ut N(0,t2)
t2 = 0 + 1 ut21 +t-12 + 1It
where It = 1 if t-12 > 0.1, and zero otherwise. Note that the question does not
specify what is volatility, so it is assumed in this answer that it is equated
with conditional variance. Again, this dummy variable would only allow the
intercept in the conditional variance (i.e. the unconditional variance) to vary
according to the previous days volatility. A similar dummy variable could be
applied to the lagged squared error or lagged conditional variance terms to
allow them to vary with the size of the previous days volatility.