Answers Review Questions Econometrics PDF

Solutions to the end of Chapter Exercises
Chapter 2
1. (a) The use of vertical rather than horizontal distances relates to the idea
that the explanatory variable, x, is fixed in repeated samples, so what the
model tries to do is to fit the most appropriate value of y using the model for a
given value of x. Taking horizontal distances would have suggested that we
had fixed the value of y and tried to find the appropriate values of x.
(b) When we calculate the deviations of the points, yt, from the fitted
values, y t , some points will lie above the line (yt > y t ) and some will lie below
the line (yt < y t ). When we calculate the residuals ( u t = yt y t ), those
corresponding to points above the line will be positive and those below the line
negative, so adding them would mean that they would largely cancel out. In
fact, we could fit an infinite number of lines with a zero average residual. By
squaring the residuals before summing them, we ensure that they all
contribute to the measure of loss and that they do not cancel. It is then
possible to define unique (ordinary least squares) estimates of the intercept
and slope.
(c) Taking the absolute values of the residuals and minimising their sum
would certainly also get around the problem of positive and negative residuals
cancelling. However, the absolute value function is much harder to work with
than a square. Squared terms are easy to differentiate, so it is simple to find
analytical formulae for the mean and the variance.
2. The population regression function (PRF) is a description of the model that

is thought to be generating the actual data and it represents the true
relationship between the variables. The population regression function is also
known as the data generating process (DGP). The PRF embodies the true
values of and , and for the bivariate model, could be expressed as
y t xt u t
Note that there is a disturbance term in this equation. In some textbooks, a

distinction is drawn between the PRF (the underlying true relationship
between y and x) and the DGP (the process describing the way that the actual
observations on y come about).
1/59 Introductory Econometrics for Finance Chris Brooks 2008

The sample regression function, SRF, is the relationship that has been
estimated using the sample observations, and is often written as
y t xt
Notice that there is no error or residual term in the equation for the SRF: all
this equation states is that given a particular value of x, multiplying it by
and adding will give the model fitted or expected value for y, denoted y . It
is also possible to write
y t xt u t
This equation splits the observed value of y into two components: the fitted
value from the model, and a residual term. The SRF is used to infer likely
values of the PRF. That is the estimates and are constructed, for the
sample data.
3. An estimator is simply a formula that is used to calculate the estimates, i.e.

the parameters that describe the relationship between two or more
explanatory variables. There are an infinite number of possible estimators;
OLS is one choice that many people would consider a good one. We can say
that the OLS estimator is best i.e. that it has the lowest variance among the
class of linear unbiased estimators. So it is optimal in the sense that no other
linear, unbiased estimator would have a smaller sampling variance. We could
define an estimator with a lower sampling variance than the OLS estimator,
but it would either be non-linear or biased or both! So there is a trade-off
between bias and variance in the choice of the estimator.
4. A list of the assumptions of the classical linear regression models

disturbance terms is given in Box 2.3 on p.44 of the book.
We need to make the first four assumptions in order to prove that the ordinary
least squares estimators of and are best, that is to prove that they have
minimum variance among the class of linear unbiased estimators. The
theorem that proves that OLS estimators are BLUE (provided the assumptions
are fulfilled) is known as the Gauss-Markov theorem. If these assumptions are
violated (which is dealt with in Chapter 4), then it may be that OLS estimators
are no longer unbiased or efficient. That is, they may be inaccurate or
subject to fluctuations between samples.
We needed to make the fifth assumption, that the disturbances are normally
distributed, in order to make statistical inferences about the population
parameters from the sample data, i.e. to test hypotheses about the coefficients.
Making this assumption implies that test statistics will follow a t-distribution
(provided that the other assumptions also hold).

5. If the models are linear in the parameters, we can use OLS.
(2.57) Yes, can use OLS since the model is the usual linear model we have been
dealing with.
(2.58) Yes. The model can be linearised by taking logarithms of both sides and
by rearranging. Although this is a very specific case, it has sound theoretical
foundations (e.g. the Cobb-Douglas production function in economics), and it
is the case that many relationships can be approximately linearised by taking
logs of the variables. The effect of taking logs is to reduce the effect of extreme
values on the regression function, and it may be possible to turn multiplicative
models into additive ones which we can easily estimate.
(2.59) Yes. We can estimate this model using OLS, but we would not be able to
obtain the values of both and, but we would obtain the value of these two
coefficients multiplied together.
(2.60) Yes, we can use OLS, since this model is linear in the logarithms. For
those who have done some economics, models of this kind which are linear in
the logarithms have the interesting property that the coefficients ( and) can
be interpreted as elasticities.
(2.61). Yes, in fact we can still use OLS since it is linear in the parameters. If
we make a substitution, say qt = xtzt, then we can run the regression
yt = +qt + ut
as usual.
So, in fact, we can estimate a fairly wide range of model types using these
simple tools.
6. The null hypothesis is that the true (but unknown) value of beta is equal to
one, against a one sided alternative that it is greater than one:
H0 : = 1
H1 : > 1
The test statistic is given by
* 1.147 1
test stat 2.682
SE ( ) 0.0548
We want to compare this with a value from the t-table with T-2 degrees of
freedom, where T is the sample size, and here T-2 =60. We want a value with
5% all in one tail since we are doing a 1-sided test. The critical t-value from the
t-table is 1.671:

f(x)
5% rejection
region
+1.671
The value of the test statistic is in the rejection region and hence we can reject
the null hypothesis. We have statistically significant evidence that this security
has a beta greater than one, i.e. it is significantly more risky than the market as
a whole.
7. We want to use a two-sided test to test the null hypothesis that shares in
Chris Mining are completely unrelated to movements in the market as a
whole. In other words, the value of beta in the regression model would be zero
so that whatever happens to the value of the market proxy, Chris Mining
would be completely unaffected by it.
The null and alternative hypotheses are therefore:
H0 : = 0
H1 : 0
The test statistic has the same format as before, and is given by:
* 0.214 0
test stat 1.150
SE ( ) 0.186
We want to find a value from the t-tables for a variable with 38-2=36 degrees
of freedom, and we want to look up the value that puts 2.5% of the distribution
in each tail since we are doing a two-sided test and we want to have a 5% size
of test over all:

2.5% rejection region 2.5% rejection region
-2.03 +2.03
The critical t-value is therefore 2.03.
Since the test statistic is not within the rejection region, we do not reject the
null hypothesis. We therefore conclude that we have no statistically significant
evidence that Chris Mining has any systematic risk. In other words, we have
no evidence that changes in the companys value are driven by movements in
the market.
8. A confidence interval for beta is given by the formula:
( SE ( ) t crit , SE ( ) t crit )
Confidence intervals are almost invariably 2-sided, unless we are told

otherwise (which we are not here), so we want to look up the values which put
2.5% in the upper tail and 0.5% in the upper tail for the 95% and 99%
confidence intervals respectively. The 0.5% critical values are given as follows
for a t-distribution with T-2=38-2=36 degrees of freedom:

0.5% rejection region 0.5% rejection region
-2.72 +2.72
The confidence interval in each case is thus given by (0.2140.186*2.03) for a

95% confidence interval, which solves to (-0.164, 0.592) and
(0.2140.186*2.72) for a 99% confidence interval, which solves to (-
0.292,0.720)
There are a couple of points worth noting.
First, one intuitive interpretation of an X% confidence interval is that we are

X% sure that the true value of the population parameter lies within the
interval. So we are 95% sure that the true value of beta lies within the interval
(-0.164, 0.592) and we are 99% sure that the true population value of beta lies
within (-0.292, 0.720). Thus in order to be more sure that we have the true
vale of beta contained in the interval, i.e. as we move from 95% to 99%
confidence, the interval must become wider.
The second point to note is that we can test an infinite number of hypotheses
about beta once we have formed the interval. For example, we would not reject
the null hypothesis contained in the last question (i.e. that beta = 0), since that
value of beta lies within the 95% and 99% confidence intervals. Would we
reject or not reject a null hypothesis that the true value of beta was 0.6? At the
5% level, we should have enough evidence against the null hypothesis to reject
it, since 0.6 is not contained within the 95% confidence interval. But at the 1%
level, we would no longer have sufficient evidence to reject the null hypothesis,
since 0.6 is now contained within the interval. Therefore we should always if
possible conduct some sort of sensitivity analysis to see if our conclusions are
altered by (sensible) changes in the level of significance used.
9. We test hypotheses about the actual coefficients, not the estimated values.
We want to make inferences about the likely values of the population
parameters (i.e. to test hypotheses about them). We do not need to test
hypotheses about the estimated values since we know exactly what our
estimates are because we calculated them!

Chapter 3
1. It can be proved that a t-distribution is just a special case of the more
general F-distribution. The square of a t-distribution with T-k degrees of
freedom will be identical to an F-distribution with (1,T-k) degrees of freedom.
But remember that if we use a 5% size of test, we will look up a 5% value for
the F-distribution because the test is 2-sided even though we only look in one
tail of the distribution. We look up a 2.5% value for the t-distribution since the
test is 2-tailed.
Examples at the 5% level from tables
T-k F critical value t critical value

20 4.35 2.09
40 4.08 2.02
60 4.00 2.00
120 3.92 1.98
2. (a) H0 : 3 = 2
We could use an F- or a t- test for this one since it is a single hypothesis
involving only one coefficient. We would probably in practice use a t-test since
it is computationally simpler and we only have to estimate one regression.
There is one restriction.
(b) H0 : 3 + 4 = 1
Since this involves more than one coefficient, we should use an F-test. There is
one restriction.
(c) H0 : 3 + 4 = 1 and 5 = 1
Since we are testing more than one hypothesis simultaneously, we would use
an F-test. There are 2 restrictions.
(d) H0 : 2 =0 and 3 = 0 and 4 = 0 and 5 = 0

As for (c), we are testing multiple hypotheses so we cannot use a t-test. We
have 4 restrictions.
(e) H0 : 23 = 1
Although there is only one restriction, it is a multiplicative restriction. We
therefore cannot use a t-test or an F-test to test it. In fact we cannot test it at
all using the methodology that has been examined in this chapter.
3. THE regression F-statistic would be given by the test statistic associated

with hypothesis iv) above. We are always interested in testing this hypothesis
since it tests whether all of the coefficients in the regression (except the
constant) are jointly insignificant. If they are then we have a completely
useless regression, where none of the variables that we have said influence y
actually do. So we would need to go back to the drawing board!

The alternative hypothesis is:
H1 : 2 0 or 3 0 or 4 0 or 5 0
Note the form of the alternative hypothesis: or indicates that only one of the
components of the null hypothesis would have to be rejected for us to reject
the null hypothesis as a whole.
4. The restricted residual sum of squares will always be at least as big as the
unrestricted residual sum of squares i.e.
RRSS URSS
To see this, think about what we were doing when we determined what the
regression parameters should be: we chose the values that minimised the
residual sum of squares. We said that OLS would provide the best parameter
values given the actual sample data. Now when we impose some restrictions
on the model, so that they cannot all be freely determined, then the model
should not fit as well as it did before. Hence the residual sum of squares must
be higher once we have imposed the restrictions; otherwise, the parameter
values that OLS chose originally without the restrictions could not be the best.
In the extreme case (very unlikely in practice), the two sets of residual sum of
squares could be identical if the restrictions were already present in the data,
so that imposing them on the model would yield no penalty in terms of loss of
fit.
5. The null hypothesis is: H0 : 3 + 4 = 1 and 5 = 1
The first step is to impose this on the regression model:
yt = 1 + 2x2t + 3x3t + 4x4t + 5x5t + ut subject to 3 + 4 = 1 and 5 = 1.
We can rewrite the first part of the restriction as 4 = 1 - 3
Then rewrite the regression with the restriction imposed
yt = 1 + 2x2t + 3x3t + (1-3)x4t + x5t + ut
which can be re-written
yt = 1 + 2x2t + 3x3t + x4t - 3x4t + x5t + ut
and rearranging
(yt x4t x5t ) = 1 + 2x2t + 3x3t - 3x4t + ut
(yt x4t x5t) = 1 + 2x2t + 3(x3t x4t)+ ut

Now create two new variables, call them Pt and Qt:
pt = (yt - x3t - x4t)

qt = (x2t -x3t)
We can then run the linear regression:
pt = 1 + 2x2t + 3qt+ ut ,
which constitutes the restricted regression model.
The test statistic is calculated as ((RRSS-URSS)/URSS)*(T-k)/m

In this case, m=2, T=96, k=5 so the test statistic = 5.704. Compare this to an
F-distribution with (2,91) degrees of freedom, which is approximately 3.10.
Hence we reject the null hypothesis that the restrictions are valid. We cannot
impose these restrictions on the data without a substantial increase in the
residual sum of squares.
6. ri = 0.080 + 0.801Si + 0.321MBi + 0.164PEi - 0.084BETAi

(0.064) (0.147) (0.136) (0.420) (0.120)
1.25 5.45 2.36 0.390 -0.700
The t-ratios are given in the final row above, and are in italics. They are
calculated by dividing the coefficient estimate by its standard error. The
relevant value from the t-tables is for a 2-sided test with 5% rejection overall.
T-k = 195; tcrit = 1.97. The null hypothesis is rejected at the 5% level if the
absolute value of the test statistic is greater than the critical value. We would
conclude based on this evidence that only firm size and market to book value
have a significant effect on stock returns.
If a stocks beta increases from 1 to 1.2, then we would expect the return on the
stock to FALL by (1.2-1)*0.084 = 0.0168 = 1.68%
This is not the sign we would have expected on beta, since beta would be
expected to be positively related to return, since investors would require
higher returns as compensation for bearing higher market risk.
7. We would thus consider deleting the price/earnings and beta variables from
the regression since these are not significant in the regression - i.e. they are
not helping much to explain variations in y. We would not delete the constant
term from the regression even though it is insignificant since there are good
statistical reasons for its inclusion.
yt 1 2 x2t 3 x3t 4 yt 1 u t
yt 1 2 x2t 3 x3t 4 yt 1 vt .
Note that we have not changed anything substantial between these models in
the sense that the second model is just a re-parameterisation (rearrangement)
of the first, where we have subtracted yt-1 from both sides of the equation.

(a) Remember that the residual sum of squares is the sum of each of the
squared residuals. So lets consider what the residuals will be in each
case. For the first model in the level of y
u t yt y t yt 1 2 x 2t 3 X 3t 4 yt 1
Now for the second model, the dependent variable is now the change in y:
vt yt y t yt 1 2 x 2t 3 x3t 4 yt 1
where y is the fitted value in each case (note that we do not need at this stage
to assume they are the same). Rearranging this second model would give
u t y t y t 1 1 2 x 2t 3 x3t 4 y t 1
y t 1 2 x 2t 3 x3t ( 4 1) y t 1
If we compare this formulation with the one we calculated for the first model,
we can see that the residuals are exactly the same for the two models, with
4 4 1 and i i (i = 1, 2, 3). Hence if the residuals are the same, the
residual sum of squares must also be the same. In fact the two models are
really identical, since one is just a rearrangement of the other.
(b) As for R2, recall how we calculate R2:
RSS
R2 1 for the first model and
( yi y ) 2
RSS
R2 1 in the second case. Therefore since the total sum of
(yi y ) 2
squares (the denominator) has changed, then the value of R2 must have also
changed as a consequence of changing the dependent variable.
(c) By the same logic, since the value of the adjusted R2 is just an algebraic
modification of R2 itself, the value of the adjusted R2 must also change.
8. A researcher estimates the following two econometric models
y t 1 2 x 2t 3 x3t u t
yt 1 2 x2t 3 x3t 4 x4t vt
(a) The value of R2 will almost always be higher for the second model since it
has another variable added to the regression. The value of R2 would only be
identical for the two models in the very, very unlikely event that the estimated

coefficient on the x4t variable was exactly zero. Otherwise, the R2 must be
higher for the second model than the first.
(b) The value of the adjusted R2 could fall as we add another variable. The
reason for this is that the adjusted version of R2 has a correction for the loss of
degrees of freedom associated with adding another regressor into a regression.
This implies a penalty term, so that the value of the adjusted R2 will only rise if
the increase in this penalty is more than outweighed by the rise in the value of
R2.
11. R2 may be defined in various ways, but the most common is

ESS
R2
TSS
Since both ESS and TSS will have units of the square of the dependent
variable, the units will cancel out and hence R2 will be unit-free!
Chapter 4
1. In the same way as we make assumptions about the true value of beta and
not the estimated values, we make assumptions about the true unobservable
disturbance terms rather than their estimated counterparts, the residuals.
We know the exact value of the residuals, since they are defined by
u t y t y t . So we do not need to make any assumptions about the residuals
since we already know their value. We make assumptions about the
unobservable error terms since it is always the true value of the population
disturbances that we are really interested in, although we never actually know
what these are.
2. We would like to see no pattern in the residual plot! If there is a pattern in

the residual plot, this is an indication that there is still some action or
variability left in yt that has not been explained by our model. This indicates
that potentially it may be possible to form a better model, perhaps using
additional or completely different explanatory variables, or by using lags of
either the dependent or of one or more of the explanatory variables. Recall
that the two plots shown on pages 157 and 159, where the residuals followed a
cyclical pattern, and when they followed an alternating pattern are used as
indications that the residuals are positively and negatively autocorrelated
respectively.
Another problem if there is a pattern in the residuals is that, if it does

indicate the presence of autocorrelation, then this may suggest that our
standard error estimates for the coefficients could be wrong and hence any
inferences we make about the coefficients could be misleading.

3. The t-ratios for the coefficients in this model are given in the third row after
the standard errors. They are calculated by dividing the individual coefficients
by their standard errors.
y t = 0.638 + 0.402 x2t - 0.891 x3t R2 0.96,R 2 0.89

(0.436) (0.291) (0.763)
t-ratios 1.46 1.38 -1.17
The problem appears to be that the regression parameters are all individually
insignificant (i.e. not significantly different from zero), although the value of
R2 and its adjusted version are both very high, so that the regression taken as a
whole seems to indicate a good fit. This looks like a classic example of what we
term near multicollinearity. This is where the individual regressors are very
closely related, so that it becomes difficult to disentangle the effect of each
individual variable upon the dependent variable.
The solution to near multicollinearity that is usually suggested is that since the
problem is really one of insufficient information in the sample to determine
each of the coefficients, then one should go out and get more data. In other
words, we should switch to a higher frequency of data for analysis (e.g. weekly
instead of monthly, monthly instead of quarterly etc.). An alternative is also to
get more data by using a longer sample period (i.e. one going further back in
time), or to combine the two independent variables in a ratio (e.g. x2t / x3t ).
Other, more ad hoc methods for dealing with the possible existence of near
multicollinearity were discussed in Chapter 4:
- Ignore it: if the model is otherwise adequate, i.e. statistically and in terms
of each coefficient being of a plausible magnitude and having an
appropriate sign. Sometimes, the existence of multicollinearity does not
reduce the t-ratios on variables that would have been significant without
the multicollinearity sufficiently to make them insignificant. It is worth
stating that the presence of near multicollinearity does not affect the BLUE
properties of the OLS estimator i.e. it will still be consistent, unbiased
and efficient since the presence of near multicollinearity does not violate
any of the CLRM assumptions 1-4. However, in the presence of near
multicollinearity, it will be hard to obtain small standard errors. This will
not matter if the aim of the model-building exercise is to produce forecasts
from the estimated model, since the forecasts will be unaffected by the
presence of near multicollinearity so long as this relationship between the
explanatory variables continues to hold over the forecasted sample.
- Drop one of the collinear variables - so that the problem disappears.

However, this may be unacceptable to the researcher if there were strong a
priori theoretical reasons for including both variables in the model. Also, if
the removed variable was relevant in the data generating process for y, an
omitted variable bias would result.

- Transform the highly correlated variables into a ratio and include only the
ratio and not the individual variables in the regression. Again, this may be
unacceptable if financial theory suggests that changes in the dependent
variable should occur following changes in the individual explanatory
variables, and not a ratio of them.
4. (a) The assumption of homoscedasticity is that the variance of the errors is

constant and finite over time. Technically, we write Var (u t ) u2 .
(b) The coefficient estimates would still be the correct ones (assuming
that the other assumptions required to demonstrate OLS optimality are
satisfied), but the problem would be that the standard errors could be
wrong. Hence if we were trying to test hypotheses about the true parameter
values, we could end up drawing the wrong conclusions. In fact, for all of
the variables except the constant, the standard errors would typically be too
small, so that we would end up rejecting the null hypothesis too many
times.
(c) There are a number of ways to proceed in practice, including

- Using heteroscedasticity robust standard errors which correct for the
problem by enlarging the standard errors relative to what they would have
been for the situation where the error variance is positively related to one of
the explanatory variables.
- Transforming the data into logs, which has the effect of reducing the effect of
large errors relative to small ones.
5. (a) This is where there is a relationship between the ith and jth residuals.
Recall that one of the assumptions of the CLRM was that such a relationship
did not exist. We want our residuals to be random, and if there is evidence of
autocorrelation in the residuals, then it implies that we could predict the sign
of the next residual and get the right answer more than half the time on
average!
(b) The Durbin Watson test is a test for first order autocorrelation. The test is
calculated as follows. You would run whatever regression you were interested
in, and obtain the residuals. Then calculate the statistic
u ut 1
2
t
DW t 2 T
u
2
t
t 2
You would then need to look up the two critical values from the Durbin
Watson tables, and these would depend on how many variables and how many

observations and how many regressors (excluding the constant this time) you
had in the model.
The rejection / non-rejection rule would be given by selecting the appropriate

region from the following diagram:
(c) We have 60 observations, and the number of regressors excluding the

constant term is 3. The appropriate lower and upper limits are 1.48 and 1.69
respectively, so the Durbin Watson is lower than the lower limit. It is thus
clear that we reject the null hypothesis of no autocorrelation. So it looks like
the residuals are positively autocorrelated.
(d) yt 1 2 x2t 3 x3t 4 x4t u t
The problem with a model entirely in first differences, is that once we calculate
the long run solution, all the first difference terms drop out (as in the long run
we assume that the values of all variables have converged on their own long
run values so that yt = yt-1 etc.) Thus when we try to calculate the long run
solution to this model, we cannot do it because there isnt a long run solution
to this model!
(e) yt 1 2 x2t 3 x3t 4 x4t 5 x2t 1 6 X 3t 1 7 X 4t 1 vt
The answer is yes, there is no reason why we cannot use Durbin Watson in this
case. You may have said no here because there are lagged values of the
regressors (the x variables) variables in the regression. In fact this would be
wrong since there are no lags of the DEPENDENT (y) variable and hence DW
can still be used.
6. yt 1 2 x 2t 3 x3t 4 yt 1 5 x2t 1 6 x3t 1 7 x rt 4 u t
The major steps involved in calculating the long run solution are to
- set the disturbance term equal to its expected value of zero
- drop the time subscripts

- remove all difference terms altogether since these will all be zero by the
definition of the long run in this context.
Following these steps, we obtain
0 1 4 y 5 x 2 6 x3 7 x3
We now want to rearrange this to have all the terms in x2 together and so that
y is the subject of the formula:
4 y 1 5 x 2 6 x3 7 x3
4 y 1 5 x 2 ( 6 7 ) x3
( 4 )
y 1 5 x2 6 x3
4 4 4
The last equation above is the long run solution.
7. Ramseys RESET test is a test of whether the functional form of the

regression is appropriate. In other words, we test whether the relationship
between the dependent variable and the independent variables really should
be linear or whether a non-linear form would be more appropriate. The test
works by adding powers of the fitted values from the regression into a second
regression. If the appropriate model was a linear one, then the powers of the
fitted values would not be significant in this second regression.
If we fail Ramseys RESET test, then the easiest solution is probably to

transform all of the variables into logarithms. This has the effect of turning a
multiplicative model into an additive one.
If this still fails, then we really have to admit that the relationship between the
dependent variable and the independent variables was probably not linear
after all so that we have to either estimate a non-linear model for the data
(which is beyond the scope of this course) or we have to go back to the drawing
board and run a different regression containing different variables.
8. (a) It is important to note that we did not need to assume normality in order
to derive the sample estimates of and or in calculating their standard
errors. We needed the normality assumption at the later stage when we come
to test hypotheses about the regression coefficients, either singly or jointly, so
that the test statistics we calculate would indeed have the distribution (t or F)
that we said they would.
(b) One solution would be to use a technique for estimation and inference
which did not require normality. But these techniques are often highly
complex and also their properties are not so well understood, so we do not
know with such certainty how well the methods will perform in different
circumstances.

One pragmatic approach to failing the normality test is to plot the estimated
residuals of the model, and look for one or more very extreme outliers. These
would be residuals that are much bigger (either very big and positive, or very
big and negative) than the rest. It is, fortunately for us, often the case that one
or two very extreme outliers will cause a violation of the normality
assumption. The reason that one or two extreme outliers can cause a violation
of the normality assumption is that they would lead the (absolute value of the)
skewness and / or kurtosis estimates to be very large.
Once we spot a few extreme residuals, we should look at the dates when these
outliers occurred. If we have a good theoretical reason for doing so, we can
add in separate dummy variables for big outliers caused by, for example, wars,
changes of government, stock market crashes, changes in market
microstructure (e.g. the big bang of 1986). The effect of the dummy variable
is exactly the same as if we had removed the observation from the sample
altogether and estimated the regression on the remainder. If we only remove
observations in this way, then we make sure that we do not lose any useful
pieces of information represented by sample points.
9. (a) Parameter structural stability refers to whether the coefficient estimates

for a regression equation are stable over time. If the regression is not
structurally stable, it implies that the coefficient estimates would be different
for some sub-samples of the data compared to others. This is clearly not what
we want to find since when we estimate a regression, we are implicitly
assuming that the regression parameters are constant over the entire sample
period under consideration.
(b) 1981M1-1995M12
rt = 0.0215 + 1.491 rmt RSS=0.189 T=180
1981M1-1987M10
rt = 0.0163 + 1.308 rmt RSS=0.079 T=82
1987M11-1995M12
rt = 0.0360 + 1.613 rmt RSS=0.082 T=98
(c) If we define the coefficient estimates for the first and second halves of the
sample as 1 and 1, and 2 and 2 respectively, then the null and alternative
hypotheses are
H0 : 1 = 2 and 1 = 2
and H1 : 1 2 or 1 2
(d) The test statistic is calculated as

Test stat. =
RSS ( RSS 1 RSS 2 ) (T 2k ) 0.189 (0.079 0.082) 180 4
* * 15.304
RSS 1 RSS 2 k 0.079 0.082 2
This follows an F distribution with (k,T-2k) degrees of freedom. F(2,176) =

3.05 at the 5% level. Clearly we reject the null hypothesis that the coefficients
are equal in the two sub-periods.
10. The data we have are

1981M1-1995M12
rt = 0.0215 + 1.491 Rmt RSS=0.189 T=180
1981M1-1994M12
rt = 0.0212 + 1.478 Rmt RSS=0.148 T=168
1982M1-1995M12
rt = 0.0217 + 1.523 Rmt RSS=0.182 T=168
First, the forward predictive failure test - i.e. we are trying to see if the model
for 1981M1-1994M12 can predict 1995M1-1995M12.
The test statistic is given by
RSS RSS1 T1 k 0.189 0.148 168 2
* * 3.832
RSS1 T2 0.148 12
Where T1 is the number of observations in the first period (i.e. the period that
we actually estimate the model over), and T2 is the number of observations we
are trying to predict. The test statistic follows an F-distribution with (T2, T1-
k) degrees of freedom. F(12, 166) = 1.81 at the 5% level. So we reject the null
hypothesis that the model can predict the observations for 1995. We would
conclude that our model is no use for predicting this period, and from a
practical point of view, we would have to consider whether this failure is a
result of a-typical behaviour of the series out-of-sample (i.e. during 1995), or
whether it results from a genuine deficiency in the model.
The backward predictive failure test is a little more difficult to understand,

although no more difficult to implement. The test statistic is given by
RSS RSS 1 T1 k 0.189 0.182 168 2

* * 0.532
RSS 1 T2 0.182 12
Now we need to be a little careful in our interpretation of what exactly are the
first and second sample periods. It would be possible to define T1 as always
being the first sample period. But I think it easier to say that T1 is always the
sample over which we estimate the model (even though it now comes after the
hold-out-sample). Thus T2 is still the sample that we are trying to predict, even
though it comes first. You can use either notation, but you need to be clear and
consistent. If you wanted to choose the other way to the one I suggest, then

you would need to change the subscript 1 everywhere in the formula above so
that it was 2, and change every 2 so that it was a 1.
Either way, we conclude that there is little evidence against the null
hypothesis. Thus our model is able to adequately back-cast the first 12
observations of the sample.
11. By definition, variables having associated parameters that are not

significantly different from zero are not, from a statistical perspective, helping
to explain variations in the dependent variable about its mean value. One
could therefore argue that empirically, they serve no purpose in the fitted
regression model. But leaving such variables in the model will use up valuable
degrees of freedom, implying that the standard errors on all of the other
parameters in the regression model, will be unnecessarily higher as a result. If
the number of degrees of freedom is relatively small, then saving a couple by
deleting two variables with insignificant parameters could be useful. On the
other hand, if the number of degrees of freedom is already very large, the
impact of these additional irrelevant variables on the others is likely to be
inconsequential.
12. An outlier dummy variable will take the value one for one observation in
the sample and zero for all others. The Chow test involves splitting the sample
into two parts. If we then try to run the regression on both the sub-parts but
the model contains such an outlier dummy, then the observations on that
dummy will be zero everywhere for one of the regressions. For that sub-
sample, the outlier dummy would show perfect multicollinearity with the
intercept and therefore the model could not be estimated.
Chapter 5
1. Autoregressive models specify the current value of a series yt as a function of
its previous p values and the current value an error term, ut, while moving
average models specify the current value of a series yt as a function of the
current and previous q values of an error term, ut. AR and MA models have
different characteristics in terms of the length of their memories, which has
implications for the time it takes shocks to yt to die away, and for the shapes of
their autocorrelation and partial autocorrelation functions.
2. ARMA models are of particular use for financial series due to their
flexibility. They are fairly simple to estimate, can often produce reasonable
forecasts, and most importantly, they require no knowledge of any structural
variables that might be required for more traditional econometric analysis.
When the data are available at high frequencies, we can still use ARMA models
while exogenous explanatory variables (e.g. macroeconomic variables,
accounting ratios) may be unobservable at any more than monthly intervals at
best.

3. yt = yt-1 + ut (1)
yt = 0.5 yt-1 + ut (2)
yt = 0.8 ut-1 + ut (3)
(a) The first two models are roughly speaking AR(1) models, while the last is
an MA(1). Strictly, since the first model is a random walk, it should be called
an ARIMA(0,1,0) model, but it could still be viewed as a special case of an
autoregressive model.
(b) We know that the theoretical acf of an MA(q) process will be zero after q
lags, so the acf of the MA(1) will be zero at all lags after one. For an
autoregressive process, the acf dies away gradually. It will die away fairly
quickly for case (2), with each successive autocorrelation coefficient taking on
a value equal to half that of the previous lag. For the first case, however, the
acf will never die away, and in theory will always take on a value of one,
whatever the lag.
Turning now to the pacf, the pacf for the first two models would have a large
positive spike at lag 1, and no statistically significant pacfs at other lags.
Again, the unit root process of (1) would have a pacf the same as that of a
stationary AR process. The pacf for (3), the MA(1), will decline geometrically.
(c) Clearly the first equation (the random walk) is more likely to represent
stock prices in practice. The discounted dividend model of share prices states
that the current value of a share will be simply the discounted sum of all
expected future dividends. If we assume that investors form their expectations
about dividend payments rationally, then the current share price should
embody all information that is known about the future of dividend payments,
and hence todays price should only differ from yesterdays by the amount of
unexpected news which influences dividend payments.
Thus stock prices should follow a random walk. Note that we could apply a
similar rational expectations and random walk model to many other kinds of
financial series.
If the stock market really followed the process described by equations (2) or
(3), then we could potentially make useful forecasts of the series using our
model. In the latter case of the MA(1), we could only make one-step ahead
forecasts since the memory of the model is only that length. In the case of
equation (2), we could potentially make a lot of money by forming multiple
step ahead forecasts and trading on the basis of these.
Hence after a period, it is likely that other investors would spot this potential
opportunity and hence the model would no longer be a useful description of
the data.
(d) See the book for the algebra. This part of the question is really an extension
of the others. Analysing the simplest case first, the MA(1), the memory of
the process will only be one period, and therefore a given shock or
innovation, ut, will only persist in the series (i.e. be reflected in yt) for one

period. After that, the effect of a given shock would have completely worked
through.
For the case of the AR(1) given in equation (2), a given shock, ut, will persist
indefinitely and will therefore influence the properties of yt for ever, but its
effect upon yt will diminish exponentially as time goes on.
In the first case, the series yt could be written as an infinite sum of past
shocks, and therefore the effect of a given shock will persist indefinitely, and
its effect will not diminish over time.
4. (a) Box and Jenkins were the first to consider ARMA modelling in this
logical and coherent fashion. Their methodology consists of 3 steps:
Identification - determining the appropriate order of the model using
graphical procedures (e.g. plots of autocorrelation functions).
Estimation - of the parameters of the model of size given in the first stage. This
can be done using least squares or maximum likelihood, depending on the
model.
Diagnostic checking - this step is to ensure that the model actually estimated is
adequate. B & J suggest two methods for achieving this:
- Overfitting, which involves deliberately fitting a model larger than

that suggested in step 1 and testing the hypothesis that all the additional
coefficients can jointly be set to zero.
- Residual diagnostics. If the model estimated is a good description of

the data, there should be no further linear dependence in the residuals of the
estimated model. Therefore, we could calculate the residuals from the
estimated model, and use the Ljung-Box test on them, or calculate their acf. If
either of these reveal evidence of additional structure, then we assume that the
estimated model is not an adequate description of the data.
If the model appears to be adequate, then it can be used for policy analysis and
for constructing forecasts. If it is not adequate, then we must go back to stage 1
and start again!
(b) The main problem with the B & J methodology is the inexactness of the
identification stage. Autocorrelation functions and partial autocorrelations for
actual data are very difficult to interpret accurately, rendering the whole
procedure often little more than educated guesswork. A further problem
concerns the diagnostic checking stage, which will only indicate when the
proposed model is too small and would not inform on when the model
proposed is too large.
(c) We could use Akaikes or Schwarzs Bayesian information criteria. Our

objective would then be to fit the model order that minimises these.

We can calculate the value of Akaikes (AIC) and Schwarzs (SBIC) Bayesian
information criteria using the following respective formulae
AIC = ln ( 2 ) + 2k/T
SBIC = ln ( 2 ) + k ln(T)/T
The information criteria trade off an increase in the number of parameters and
therefore an increase in the penalty term against a fall in the RSS, implying a
closer fit of the model to the data.
5. The best way to check for stationarity is to express the model as a lag
polynomial in yt.
yt 0803
. yt 1 0.682 yt 2 ut
Rewrite this as
yt (1 0.803L 0.682 L2 ) ut
We want to find the roots of the lag polynomial (1 0.803L 0.682 L2 ) 0 and
determine whether they are greater than one in absolute value. It is easier (in
my opinion) to rewrite this formula (by multiplying through by -1/0.682,
using z for the characteristic equation and rearranging) as
z2 + 1.177 z - 1.466 = 0
Using the standard formula for obtaining the roots of a quadratic equation,
1177
. 1177
. 2 4 * 1 * 1466
.
z = 0.758 or 1.934
2
Since ALL the roots must be greater than one for the model to be stationary,
we conclude that the estimated model is not stationary in this case.
6. Using the formulae above, we end up with the following values for each
criterion and for each model order (with an asterisk denoting the smallest
value of the information criterion in each case).
ARMA (p,q) model order log ( 2 ) AIC SBIC

(0,0) 0.932 0.942 0.944
(1,0) 0.864 0.884 0.887
(0,1) 0.902 0.922 0.925
(1,1) 0.836 0.866 0.870
(2,1) 0.801 0.841 0.847
(1,2) 0.821 0.861 0.867
(2,2) 0.789 0.839 0.846

(3,2) 0.773 0.833*
0.842*
(2,3) 0.782 0.842 0.851
(3,3) 0.764 0.834 0.844
The result is pretty clear: both SBIC and AIC say that the appropriate model is
an ARMA(3,2).
7. We could still perform the Ljung-Box test on the residuals of the estimated
models to see if there was any linear dependence left unaccounted for by our
postulated models.
Another test of the models adequacy that we could use is to leave out some of
the observations at the identification and estimation stage, and attempt to
construct out of sample forecasts for these. For example, if we have 2000
observations, we may use only 1800 of them to identify and estimate the
models, and leave the remaining 200 for construction of forecasts. We would
then prefer the model that gave the most accurate forecasts.
8. This is not true in general. Yes, we do want to form a model which fits the
data as well as possible. But in most financial series, there is a substantial
amount of noise. This can be interpreted as a number of random events that
are unlikely to be repeated in any forecastable way. We want to fit a model to
the data which will be able to generalise. In other words, we want a model
which fits to features of the data which will be replicated in future; we do not
want to fit to sample-specific noise.
This is why we need the concept of parsimony - fitting the smallest possible
model to the data. Otherwise we may get a great fit to the data in sample, but
any use of the model for forecasts could yield terrible results.
Another important point is that the larger the number of estimated

parameters (i.e. the more variables we have), then the smaller will be the
number of degrees of freedom, and this will imply that coefficient standard
errors will be larger than they would otherwise have been. This could lead to a
loss of power in hypothesis tests, and variables that would otherwise have
been significant are now insignificant.
9. (a) We class an autocorrelation coefficient or partial autocorrelation

1
coefficient as significant if it exceeds 1.96 = 0.196. Under this rule, the
T
sample autocorrelation functions (sacfs) at lag 1 and 4 are significant, and the
spacfs at lag 1, 2, 3, 4 and 5 are all significant.
This clearly looks like the data are consistent with a first order moving average
process since all but the first acfs are not significant (the significant lag 4 acf is
a typical wrinkle that one might expect with real data and should probably be
ignored), and the pacf has a slowly declining structure.

(b) The formula for the Ljung-Box Q* test is given by
m
k2
Q* T (T 2) m2
k 1 T k
using the standard notation.
In this case, T=100, and m=3. The null hypothesis is H0: 1 = 0 and 2 = 0 and
3 = 0. The test statistic is calculated as
0.420 2 0.104 2 0.032 2

Q* 100 102 19.41.
100 1 100 2 100 3
The 5% and 1% critical values for a 2 distribution with 3 degrees of freedom

are 7.81 and 11.3 respectively. Clearly, then, we would reject the null
hypothesis that the first three autocorrelation coefficients are jointly not
significantly different from zero.
10. (a) To solve this, we need the concept of a conditional expectation,
i.e. Et 1 ( y t y t 2 , y t 3 ,...)
For example, in the context of an AR(1) model such as , yt a0 a1 yt 1 ut

If we are now at time t-1, and dropping the t-1 subscript on the expectations
operator
E ( yt ) a0 a1 yt 1
E ( yt 1 ) a0 a1 E ( yt )
= a0 a1 yt 1 (a0 a1 yt 1 )
= a0 a0a1 a1 yt 1
2
E ( yt 2 ) a0 a1 E ( yt 1 )
= a0 a1 (a0 a1 E ( yt ))
= a0 a0a1 a1 E ( yt )
2
= a0 a0a1 a1 E ( yt )
2
= a0 a0a1 a1 (a0 a1 yt 1 )
2
= a0 a0a1 a1 a0 a1 yt 1
2 3
etc.
f t 1,1 a0 a1 yt 1
f t 1,2 a0 a1 f t 1,1
f t 1,3 a0 a1 f t 1,2

To forecast an MA model, consider, e.g.
yt ut b1ut 1
E ( yt yt 1 , yt 2 ,...) = E (ut b1ut 1 )
b1u t 1
So ft-1,1 = b1u t 1
But
E ( yt 1 yt 1 , yt 2 ,...) = E (ut 1 b1ut )
= 0
Going back to the example above,

yt 0.036 0.69 yt 1 0.42ut 1 ut
Suppose that we know t-1, t-2,... and we are trying to forecast yt.
Our forecast for t is given by
E ( yt yt 1 , yt 2 ,...) = f t 1,1 0.036 0.69 yt 1 0.42ut 1 ut

= 0.036 +0.693.4+0.42(-1.3)
= 1.836
ft-1,2 = E ( yt 1 yt 1 , yt 2 ,...) 0.036 0.69 yt 0.42ut ut 1
But we do not know yt or ut at time t-1.
Replace yt with our forecast of yt which is ft-1,1.
ft-1,2 = 0.036 +0.69 ft-1,1

= 0.036 + 0.69*1.836
= 1.302
ft-1,3 = 0.036 +0.69 ft-1,2

= 0.036 + 0.69*1.302
= 0.935
etc.
(b) Given the forecasts and the actual value, it is very easy to calculate the
MSE by plugging the numbers in to the relevant formula, which in this case is
N
1
MSE
N

n 1
( xt 1 n f t 1, n ) 2
if we are making N forecasts which are numbered 1,2,3.

Then the MSE is given by

MSE
1
3
(1.836 0.032) 2 (1.302 0.961) 2 (0.935 0.203) 2
1
(3.489 0.116 0.536) 1.380
3
Notice also that 84% of the total MSE is coming from the error in the first
forecast. Thus error measures can be driven by one or two times when the
model fits very badly. For example, if the forecast period includes a stock
market crash, this can lead the mean squared error to be 100 times bigger than
it would have been if the crash observations were not included. This point
needs to be considered whenever forecasting models are evaluated. An idea of
whether this is a problem in a given situation can be gained by plotting the
forecast errors over time.
(c) This question is much simpler to answer than it looks! In fact, the inclusion
of the smoothing coefficient is a red herring - i.e. a piece of misleading and
useless information. The correct approach is to say that if we believe that the
exponential smoothing model is appropriate, then all useful information will
have already been used in the calculation of the current smoothed value
(which will of course have used the smoothing coefficient in its calculation).
Thus the three forecasts are all 0.0305.
(d) The solution is to work out the mean squared error for the exponential
smoothing model. The calculation is
MSE (0.0305 0.032) 2 (0.0305 0.961) 2 (0.0305 0.203) 2
1
3
0.0039 0.8658 0.0298 0.2998
1
3
Therefore, we conclude that since the mean squared error is smaller for the
exponential smoothing model than the Box Jenkins model, the former
produces the more accurate forecasts. We should, however, bear in mind that
the question of accuracy was determined using only 3 forecasts, which would
be insufficient in a real application.
11. (a) The shapes of the acf and pacf are perhaps best summarised in a table:
Process acf pacf

White No significant coefficients No significant coefficients
noise
AR(2) Geometrically declining or First 2 pacf coefficients
damped sinusoid acf significant, all others
insignificant
MA(1) First acf coefficient significant, Geometrically declining or
all others insignificant damped sinusoid pacf
ARMA(2 Geometrically declining or Geometrically declining or
,1) damped sinusoid acf damped sinusoid pacf

A couple of further points are worth noting. First, it is not possible to tell what
the signs of the coefficients for the acf or pacf would be for the last three
processes, since that would depend on the signs of the coefficients of the
processes. Second, for mixed processes, the AR part dominates from the point
of view of acf calculation, while the MA part dominates for pacf calculation.
(b) The important point here is to focus on the MA part of the model and to
ignore the AR dynamics. The characteristic equation would be
(1+0.42z) = 0
The root of this equation is -1/0.42 = -2.38, which lies outside the unit circle,
and therefore the MA part of the model is invertible.
(c) Since no values for the series y or the lagged residuals are given, the
answers should be stated in terms of y and of u. Assuming that information is
available up to and including time t, the 1-step ahead forecast would be for
time t+1, the 2-step ahead for time t+2 and so on. A useful first step would be
to write the model out for y at times t+1, t+2, t+3, t+4:
yt 1 0.036 0.69 yt 0.42ut u t 1

yt 2 0.036 0.69 yt 1 0.42u t 1 u t 2
yt 3 0.036 0.69 yt 2 0.42u t 2 u t 3
yt 4 0.036 0.69 yt 3 0.42u t 3 u t 4
The 1-step ahead forecast would simply be the conditional expectation of y for
time t+1 made at time t. Denoting the 1-step ahead forecast made at time t as
ft,1, the 2-step ahead forecast made at time t as ft,2 and so on:
E( yt 1 yt , yt 1 ,...) f t ,1 Et [ yt 1 ] Et [0.036 0.69 yt 0.42ut ut 1 ] 0.036 0.69 yt 0.42ut
since Et[ut+1]=0. The 2-step ahead forecast would be given by
E( yt 2 yt , yt 1,...) ft , 2 Et [ yt 2 ] Et [0.036 0.69 yt 1 0.42ut 1 ut 2 ] 0.036 0.69 f t ,1
since Et[ut+1]=0 and Et[ut+2]=0. Thus, beyond 1-step ahead, the MA(1) part of
the model disappears from the forecast and only the autoregressive part
remains. Although we do not know yt+1, its expected value is the 1-step ahead
forecast that was made at the first stage, ft,1.
The 3-step ahead forecast would be given by
E( yt 3 yt , yt 1,...) ft ,3 Et [ yt 3 ] Et [0.036 0.69 yt 2 0.42ut 2 ut 3 ] 0.036 0.69 f t , 2
and the 4-step ahead by

E( yt 4 yt , yt 1,...) ft , 4 Et [ yt 4 ] Et [0.036 0.69 yt 3 0.42ut 3 ut 4 ] 0.036 0.69 f t ,3
(d) A number of methods for aggregating the forecast errors to produce a

single forecast evaluation measure were suggested in the paper by Makridakis
and Hibon (1995) and some discussion is presented in the book. Any of the
methods suggested there could be discussed. A good answer would present an
expression for the evaluation measures, with any notation introduced being
carefully defined, together with a discussion of why the measure takes the
form that it does and what the advantages and disadvantages of its use are
compared with other methods.
(e) Moving average and ARMA models cannot be estimated using OLS they
are usually estimated by maximum likelihood. Autoregressive models can be
estimated using OLS or maximum likelihood. Pure autoregressive models
contain only lagged values of observed quantities on the RHS, and therefore,
the lags of the dependent variable can be used just like any other regressors.
However, in the context of MA and mixed models, the lagged values of the
error term that occur on the RHS are not known a priori. Hence, these
quantities are replaced by the residuals, which are not available until after the
model has been estimated. But equally, these residuals are required in order to
be able to estimate the model parameters. Maximum likelihood essentially
works around this by calculating the values of the coefficients and the
residuals at the same time. Maximum likelihood involves selecting the most
likely values of the parameters given the actual data sample, and given an
assumed statistical distribution for the errors. This technique will be discussed
in greater detail in the section on volatility modelling in Chapter 8.
12. (a) Some of the stylised differences between the typical characteristics of
macroeconomic and financial data were presented in Chapter 1. In particular,
one important difference is the frequency with which financial asset return
time series and other quantities in finance can be recorded. This is of
particular relevance for the models discussed in Chapter 5, since it is usually a
requirement that all of the time-series data series used in estimating a given
model must be of the same frequency. Thus, if, for example, we wanted to
build a model for forecasting hourly changes in exchange rates, it would be
difficult to set up a structural model containing macroeconomic explanatory
variables since the macroeconomic variables are likely to be measured on a
quarterly or at best monthly basis. This gives a motivation for using pure time-
series approaches (e.g. ARMA models), rather than structural formulations
with separate explanatory variables.
It is also often of particular interest to produce forecasts of financial variables

in real time. Producing forecasts from pure time-series models is usually
simply an exercise in iterating with conditional expectations. But producing
forecasts from structural models is considerably more difficult, and would
usually require the production of forecasts for the structural variables as well.
(b) A simple rule of thumb for determining whether autocorrelation

coefficients and partial autocorrelation coefficients are statistically significant

is to classify them as significant at the 5% level if they lie outside of
1
1.96 * , where T is the sample size. In this case, T = 500, so a particular
T
coefficient would be deemed significant if it is larger than 0.088 or smaller
than 0.088. On this basis, the autocorrelation coefficients at lags 1 and 5 and
the partial autocorrelation coefficients at lags 1, 2, and 3 would be classed as
significant. The formulae for the Box-Pierce and the Ljung-Box test statistics
are respectively
m m
k2
Q T k2 and Q* T (T 2) .
k 1 k 1 T k
In this instance, the statistics would be calculated respectively as
Q 500 [0.307 2 (0.0132 ) 0.086 2 0.0312 (0.197 2 )] 70.79
and
0.307 2 (0.013 2 ) 0.086 2 0.0312 (0.197 2 )

Q* 500 502 71.39
500 1 500 2 500 3 500 4 500 5
The test statistics will both follow a 2 distribution with 5 degrees of freedom
(the number of autocorrelation coefficients being used in the test). The critical
values are 11.07 and 15.09 at 5% and 1% respectively. Clearly, the null
hypothesis that the first 5 autocorrelation coefficients are jointly zero is
resoundingly rejected.
(c) Setting aside the lag 5 autocorrelation coefficient, the pattern in the table is
for the autocorrelation coefficient to only be significant at lag 1 and then to fall
rapidly to values close to zero, while the partial autocorrelation coefficients
appear to fall much more slowly as the lag length increases. These
characteristics would lead us to think that an appropriate model for this series
is an MA(1). Of course, the autocorrelation coefficient at lag 5 is an anomaly
that does not fit in with the pattern of the rest of the coefficients. But such a
result would be typical of a real data series (as opposed to a simulated data
series that would have a much cleaner structure). This serves to illustrate that
when econometrics is used for the analysis of real data, the data generating
process was almost certainly not any of the models in the ARMA family. So all
we are trying to do is to find a model that best describes the features of the
data to hand. As one econometrician put it, all models are wrong, but some are
useful!
(d) Forecasts from this ARMA model would be produced in the usual way.
Using the same notation as above, and letting fz,1 denote the forecast for time
z+1 made for x at time z, etc:

Model A: MA(1)
f z ,1 0.38 0.10u t 1
f z , 2 0.38 0.10 0.02 0.378
f z , 2 f z ,3 0.38
Note that the MA(1) model only has a memory of one period, so all forecasts
further than one step ahead will be equal to the intercept.
Model B: AR(2)
xt 0.63 0.17 xt 1 0.09 xt 2

f z ,1 0.63 0.17 0.31 0.09 0.02 0.681
f z , 2 0.63 0.17 0.681 0.09 0.31 0.718
f z ,3 0.63 0.17 0.718 0.09 0.681 0.690
f z , 4 0.63 0.17 0.690 0.09 0.716 0.683
(e) The methods are overfitting and residual diagnostics. Overfitting involves
selecting a deliberately larger model than the proposed one, and examining
the statistical significances of the additional parameters. If the additional
parameters are statistically insignificant, then the originally postulated model
is deemed acceptable. The larger model would usually involve the addition of
one extra MA term and one extra AR term. Thus it would be sensible to try an
ARMA(1,2) in the context of Model A, and an ARMA(3,1) in the context of
Model B. Residual diagnostics would involve examining the acf and pacf of the
residuals from the estimated model. If the residuals showed any action, that
is, if any of the acf or pacf coefficients showed statistical significance, this
would suggest that the original model was inadequate. Residual diagnostics
in the Box-Jenkins sense of the term involved only examining the acf and pacf,
rather than the array of diagnostics considered in Chapter 4.
It is worth noting that these two model evaluation procedures would only
indicate a model that was too small. If the model were too large, i.e. it had
superfluous terms, these procedures would deem the model adequate.
(f) There are obviously several forecast accuracy measures that could be
employed, including MSE, MAE, and the percentage of correct sign
predictions. Assuming that MSE is used, the MSE for each model is
MSE ( Model A)
1
4

(0.378 0.62) 2 (0.38 0.19) 2 (0.38 0.32) 2 (0.38 0.72) 2 0.175
MSE ( Model B)
1
4

(0.681 0.62) 2 (0.718 0.19) 2 (0.690 0.32) 2 (0.683 0.72) 2 0.326
Therefore, since the mean squared error for Model A is smaller, it would be
concluded that the moving average model is the more accurate of the two in
this case.

Chapter 6
1. (a) This is simple to accomplish in theory, but difficult in practice as a
result of the algebra. The original equations are (renumbering them (1),
(2) and (3) for simplicity)
y1t 0 1 y 2 t 2 y 3t 3 X 1t 4 X 2 t u1t (1)

y 2 t 0 1 y 3t 2 X 1t 3 X 3t u2 t (2)
y 3t 0 1 y1t 2 X 2 t 3 X 3t u3t ( 3)
The easiest place to start (I think) is to take equation (1), and substitute in for
y3t, to get
y1t 0 1 y2t 2 ( 0 1 y1t 2 X 2t 3 X 3t u3t ) 3 X 1t 4 X 2t u1t
Working out the products that arise when removing the brackets,
y1t 0 1 y2t 2 0 2 1 y1t 2 2 X 2t 2 3 X 3t 2 u3t 3 X 1t 4 X 2t u1t
Gathering terms in y1t on the LHS:
y1t 2 1 y1t 0 1 y2t 2 0 2 2 X 2t 2 3 X 3t 2 u3t 3 X 1t 4 X 2t u1t
y1t (1 2 1 ) 0 1 y2t 2 0 2 2 X 2t 2 3 X 3t 2 u3t 3 X 1t 4 X 2t u1t

(4)
Now substitute into (2) for y3t from (3).
y2t 0 1 ( 0 1 y1t 2 X 2t 3 X 3t u3t ) 2 X 1t 3 X 3t u2t
Removing the brackets
y2t 0 1 0 1 1 y1t 1 2 X 2t 1 3 X 3t 1u3t 2 X 1t 3 X 3t u2t (5)
Substituting into (4) for y2t from (5),
y1t (1 2 1 ) 0 1 ( 0 1 0 1 1 y1t 2 X 2 t 1 3 X 3t 1u3t 2 X 1t

3 X 3t u2 t ) 2 0 2 2 X 2 t 2 3 X 3t 2 u3t 3 X 1t 4 X 2 t u1t
Taking the y1t terms to the LHS:
y1t (1 2 1 1 1 1 ) 0 1 0 1 1 0 1 2 X 2 t 1 1 3 X 3t 1 1u3t 1 2 X 1t
13 X 3t 1u2 t 2 0 2 2 X 2 t 2 3 X 3t 2 u3t 3 X 1t 4 X 2 t u1t
Gathering like-terms in the other variables together:

y1t (1 2 1 1 1 1 ) 0 1 0 1 1 0 2 0 X 1t (1 2 3 ) X 2 t (1 1 2 2 2 4 )
X 3t (1 1 3 1 3 2 3 ) u3t (1 1 2 ) 1 u2 t u1t
(6)
Multiplying all through equation (3) by (1 2 1 1 1 1 ) :
y3t (1 2 1 11 1 ) 0 (1 2 1 11 1 ) 1 y1t (1 2 1 11 1 )
2 X 2 t (1 2 1 11 1 ) 3 X 3t (1 2 1 11 1 ) u3t (1 2 1 11 1 )
(7)
Replacing y1t (1 2 1 11 1 ) in (7) with the RHS of (6),
0 1 0 1 1 0 2 0 X 1t (1 2 3 )

y 3t (1 2 1 11 1 ) 0 (1 2 1 11 1 ) 1 X 2 t (1 1 2 2 2 4 ) X 3t (1 1 3 1 3
2 3 ) u3t (1 1 2 ) 1u2 t u1t
2 X 2 t (1 2 1 11 1 ) 3 X 3t (1 2 1 11 1 ) u3t (1 2 1 11 1 )
(8)
Expanding the brackets in equation (8) and cancelling the relevant terms
y3t (1 2 1 11 1 ) 0 10 11 0 X 1t (1 2 1 1 3 ) X 2 t ( 2 14 )
X 3t ( 11 3 3 ) u3t 11u2 t 1u1t
(9)
Multiplying all through equation (2) by (1 2 1 1 1 1 ) :
y2 t (1 1 1 1 12 ) 0 (1 1 1 1 12 ) 1 y3t (1 1 1 1 12 )
2 X 1t (1 1 11 12 ) 3 X 3t (1 1 1 1 12 ) u2 t (1 1 1 1 12 )
(10)
Replacing y3t (1 2 1 11 1 ) in (10) with the RHS of (9),
0 1 0 11 0 X 1t (1 2 1 1 3 )

y 2 t (1 1 1 1 1 2 ) 0 (1 1 1 1 12 ) 1 X 2 t ( 2 1 4 ) X 3t ( 3 11 3 ) u3t
11u2 t 1u1t
2 X 1t (1 1 1 1 12 ) 3 X 3t (1 1 1 1 12 ) u2 t (1 1 1 1 12 )
(11)
Expanding the brackets in (11) and cancelling the relevant terms
y2t (1 1 1
1 12 ) 0 02 1
1 0
1 10 X1t (
1 1 3 2 22 1 ) X 2 t (
1 2
1 14 )
X 3t (
1 3 3 32 1 ) 1u3t u2 t (1 2 1 )
1 1u1t
(12)

Although it might not look like it (!), equations (6), (12), and (9) respectively
will give the reduced form equations corresponding to (1), (2), and (3), by
doing the necessary division to make y1t, y2t, or y3t the subject of the formula.
From (6),
0 1 0 1 1 0 2 0 (1 2 3 ) ( 2 2 4 )
y1t X 1t 1 1 2 X 2t
(1 2 1 1 1 1 ) (1 2 1 1 1 1 ) (1 2 1 1 1 1 )
(1 1 3 1 3 2 3 ) u ( 2 ) 1u2 t u1t
X 3t 3t 1 1
(1 2 1 1 1 1 ) (1 2 1 1 1 1 )
(13)
From (12),
0 02 1 1 01 10 ( 1 1 3 2 22 1 ) ( 1 2 1 14 )
y2 t X1t X
(1 1 11 12 ) (1 1 11 12 ) (1 1 11 12 ) 2 t
( 1 3 3 32 1 ) u u (1 2 1 ) 1 1u1t
X 3t 1 3t 2 t
(1 1 11 12 ) (1 1 11 12 )
(14)
From (9),
0 10 11 0 (1 2 1 1 3 ) ( 2 1 4 )
y 3t X 1t X
(1 2 1 11 1 ) (1 2 1 11 1 ) (1 2 1 11 1 ) 2 t
(15)
( 11 3 3 ) u 11u2 t 1u1t
X 3t 3t
(1 2 1 11 1 ) (1 2 1 11 1 )
Notice that all of the reduced form equations (13)-(15) in this case depend on
all of the exogenous variables, which is not always the case, and that the
equations contain only exogenous variables on the RHS, which must be the
case for these to be reduced forms.
(b) The term identification refers to whether or not it is in fact possible to

obtain the structural form coefficients (the , , and s in equations (1)-(3))
from the reduced form coefficients (the s) by substitution. An equation can
be over-identified, just-identified, or under-identified, and the equations in a
system can have differing orders of identification. If an equation is under-
identified (or not identified), then we cannot obtain the structural form
coefficients from the reduced forms using any technique. If it is just identified,
we can obtain unique structural form estimates by back-substitution, while if
it is over-identified, we cannot obtain unique structural form estimates by
substituting from the reduced forms.
There are two rules for determining the degree of identification of an

equation: the rank condition, and the order condition. The rank condition is a
necessary and sufficient condition for identification, so if the rule is satisfied,
it guarantees that the equation is indeed identified. The rule centres around a
restriction on the rank of a sub-matrix containing the reduced form

coefficients, and is rather complex and not particularly illuminating, and was
therefore not covered in this course.
The order condition, can be expressed in a number of ways, one of which is the
following. Let G denote the number of structural equations (equal to the
number of endogenous variables). An equation is just identified if G-1
variables are absent. If more than G-1 are absent, then the equation is over-
identified, while if fewer are absent, then it is not identified.
Applying this rule to equations (1)-(3), G=3, so for an equation to be

identified, we require 2 to be absent. The variables in the system are y1, y2, y3,
X1, X2, X3. Is this the case?
Equation (1): X3t only is missing, so the equation is not identified.
Equation (2): y1t and X2t are missing, so the equation is just identified.
Equation (3): y2t and X1t are missing, so the equation is just identified.
However, the order condition is only a necessary (and not a sufficient)

condition for identification, so there will exist cases where a given equation
satisfies the order condition, but we still cannot obtain the structural form
coefficients. Fortunately, for small systems this is rarely the case. Also, in
practice, most systems are designed to contain equations that are over-
identified.
(c). It was stated in Chapter 4 that omitting a relevant variable from a

regression equation would lead to an omitted variable bias (in fact an
inconsistency as well), while including an irrelevant variable would lead to
unbiased but inefficient coefficient estimates. There is a direct analogy with
the simultaneous variable case. Treating a variable as exogenous when it really
should be endogenous because there is some feedback, will result in biased
and inconsistent parameter estimates. On the other hand, treating a variable
as endogenous when it really should be exogenous (that is, having an equation
for the variable and then substituting the fitted value from the reduced form if
2SLS is used, rather than just using the actual value of the variable) would
result in unbiased but inefficient coefficient estimates.
If we take the view that consistency and unbiasedness are more important that
efficiency (which is the view that I think most econometricians would take),
this implies that treating an endogenous variable as exogenous represents the
more severe mis-specification. So if in doubt, include an equation for it!
(Although, of course, we can test for exogeneity using a Hausman-type test).
(d). A tempting response to the question might be to describe indirect least

squares (ILS), that is estimating the reduced form equations by OLS and then
substituting back to get the structural forms; however, this response would be
WRONG, since the question tells us that the system is over-identified.
A correct answer would be to describe either two stage least squares (2SLS) or
instrumental variables (IV). Either would be acceptable, although IV requires
the user to determine an appropriate set of instruments and hence 2SLS is
simpler in practice. 2SLS involves estimating the reduced form equations, and
obtaining the fitted values in the first stage. In the second stage, the structural

form equations are estimated, but replacing the endogenous variables on the
RHS with their stage one fitted values. Application of this technique will yield
unique and unbiased structural form coefficients.
2. (a) A glance at equations (6.97) and (6.98) reveals that the dependent
variable in (6.97) appears as an explanatory variable in (6.98) and that the
dependent variable in (6.98) appears as an explanatory variable in (6.97). The
result is that it would be possible to show that the explanatory variable y2t in
(6.97) will be correlated with the error term in that equation, u1t, and that the
explanatory variable y1t in (6.98) will be correlated with the error term in that
equation, u2t. Thus, there is causality from y1t to y2t and from y2t to y1t, so that
this is a simultaneous equations system. If OLS were applied separately to
each of equations (6.97) and (6.98), the result would be biased and
inconsistent parameter estimates. That is, even with an infinitely large
number of observations, OLS could not be relied upon to deliver the
appropriate parameter estimates.
(b) If the variable y1t had not appeared on the RHS of equation (6.98), this
would no longer be a simultaneous system, but would instead be an example
of a triangular system (see question 3). Thus it would be valid to apply OLS
separately to each of the equations (6.97) and (6.98).
(c) The order condition for determining whether an equation from a

simultaneous system is identified was described in question 1, part (b). There
are 2 equations in the system of (6.97) and (6.98), so that only 1 variable
would have to be missing from an equation to make it just identified. If no
variables are absent, the equation would not be identified, while if more than
one were missing, the equation would be over-identified. Considering
equation (6.97), no variables are missing so that this equation is not identified,
while equation (6.98) excludes only variable X2t, so that it is just identified.
(d) Since equation (6.97) is not identified, no method could be used to obtain
estimates of the parameters of this equation, while either ILS or 2SLS could be
used to obtain estimates of the parameters of (6.98), since it is just identified.
ILS operates by obtaining and estimating the reduced form equations and
then obtaining the structural parameters of (6.98) by algebraic back-
substitution. 2SLS involves again obtaining and estimating the reduced form
equations, and then estimating the structural equations but replacing the
endogenous variables on the RHS of (6.97) and (6.98) with their reduced form
fitted values.
Comparing between ILS and 2SLS, the former method only requires one set of
estimations rather than two, but this is about its only advantage, and
conducting a second stage OLS estimation is usually a computationally trivial
exercise. The primary disadvantage of ILS is that it is only applicable to just
identified equations, whereas many sets of equations that we may wish to
estimate are over-identified. Second, obtaining the structural form coefficients
via algebraic substitution can be a very tedious exercise in the context of large
systems (as the solution to question 1, part (a) shows!).

(e) The Hausman procedure works by first obtaining and estimating the
reduced form equations, and then estimating the structural form equations
separately using OLS, but also adding the fitted values from the reduced form
estimations as additional explanatory variables in the equations where those
variables appear as endogenous RHS variables. Thus, if the reduced form
fitted values corresponding to equations (6.97) and (6.98) are given by y1t and
y2t respectively, the Hausmann test equations would be
y1t 0 1 y 2t 2 X 1t 3 X 2t 4 y 2t 'u1t
.
y 2t 0 1 y1t 2 X 1t 3 y1t ' u1t
Separate tests of the significance of the y1t and y2t terms would then be
performed. If it were concluded that they were both significant, this would
imply that additional explanatory power can be obtained by treating the
variables as endogenous.
3. An example of a triangular system was given in Section 6.7. Consider a

scenario where there are only two endogenous variables. The key distinction
between this and a fully simultaneous system is that in the case of a triangular
system, causality runs only in one direction, whereas for a simultaneous
equation, it would run in both directions. Thus, to give an example, for the
system to be triangular, y1 could appear in the equation for y2 and not vice
versa. For the simultaneous system, y1 would appear in the equation for y2,
and y2 would appear in the equation for y1.
4. (a) p=2 and k=3 implies that there are two variables in the system, and that
both equations have three lags of the two variables. The VAR can be written in
long-hand form as:
y1t 10 111 y1t 1 211 y 2t 1 112 y1t 2 212 y 2t 2 113 y1t 3 213 y 2t 3 u1t
y 2t 20 121 y1t 1 221 y 2t 1 122 y1t 2 222 y 2t 2 123 y1t 3 223 y 2t 3 u 2t
10 y1t u1t
where 0 , yt , ut , and the coefficients on the lags of yt
20 y2 t u2 t
are defined as follows: ijk refers to the kth lag of the ith variable in the jth
equation. This seems like a natural notation to use, although of course any
sensible alternative would also be correct.
(b) This is basically a what are the advantages of VARs compared with
structural models? type question, to which a simple and effective response
would be to list and explain the points made in the book.
The most important point is that structural models require the researcher to
specify some variables as being exogenous (if all variables were endogenous,
then none of the equations would be identified, and therefore estimation of
the structural equations would be impossible). This can be viewed as a

restriction (a restriction that the exogenous variables do not have any
simultaneous equations feedback), often called an identifying restriction.
Determining what are the identifying restrictions is supposed to be based on
economic or financial theory, but Sims, who first proposed the VAR
methodology, argued that such restrictions were incredible. He thought that
they were too loosely based on theory, and were often specified by researchers
on the basis of giving the restrictions that the models required to make the
equations identified. Under a VAR, all the variables have equations, and so in
a sense, every variable is endogenous, which takes the ability to cheat (either
deliberately or inadvertently) or to mis-specify the model in this way, out of
the hands of the researcher.
Another possible reason why VARs are popular in the academic literature is
that standard form VARs can be estimated using OLS since all of the lags on
the RHS are counted as pre-determined variables.
Further, a glance at the academic literature which has sought to compare the
forecasting accuracies of structural models with VARs, reveals that VARs seem
to be rather better at forecasting (perhaps because the identifying restrictions
are not valid). Thus, from a purely pragmatic point of view, researchers may
prefer VARs if the purpose of the modelling exercise is to produce precise
point forecasts.
(c) VARs have, of course, also been subject to criticisms. The most important
of these criticisms is that VARs are atheoretical. In other words, they use very
little information form economic or financial theory to guide the model
specification process. The result is that the models often have little or no
theoretical interpretation, so that they are of limited use for testing and
evaluating theories.
Second, VARs can often contain a lot of parameters. The resulting loss in
degrees of freedom if the VAR is unrestricted and contains a lot of lags, could
lead to a loss of efficiency and the inclusion of lots of irrelevant or marginally
relevant terms. Third, it is not clear how the VAR lag lengths should be
chosen. Different methods are available (see part (d) of this question), but
they could lead to widely differing answers.
Finally, the very tools that have been proposed to help to obtain useful
information from VARs, i.e. impulse responses and variance decompositions,
are themselves difficult to interpret! See Runkle (1987).
(d) The two methods that we have examined are model restrictions and
information criteria. Details on how these work are contained in Sections
6.12.4 and 6.12.5. But briefly, the model restrictions approach involves
starting with the larger of the two models and testing whether it can be
restricted down to the smaller one using the likelihood ratio test based on the
determinants of the variance-covariance matrices of residuals in each case.
The alternative approach would be to examine the value of various
information criteria and to select the model that minimises the criteria. Since
there are only two models to compare, either technique could be used. The
restriction approach assumes normality for the VAR error terms, while use of

the information criteria does not. On the other hand, the information criteria
can lead to quite different answers depending on which criterion is used and
the severity of its penalty term. A completely different approach would be to
put the VARs in the situation that they were intended for (e.g. forecasting,
making trading profits, determining a hedge ratio etc.), and see which one
does best in practice!
Chapter 7
1. (a) Many series in finance and economics in their levels (or log-levels) forms
are non-stationary and exhibit stochastic trends. They have a tendency not to
revert to a mean level, but they wander for prolonged periods in one
direction or the other. Examples would be most kinds of asset or goods prices,
GDP, unemployment, money supply, etc. Such variables can usually be made
stationary by transforming them into their differences or by constructing
percentage changes of them.
(b) Non-stationarity can be an important determinant of the properties of a

series. Also, if two series are non-stationary, we may experience the problem
of spurious regression. This occurs when we regress one non-stationary
variable on a completely unrelated non-stationary variable, but yield a
reasonably high value of R2, apparently indicating that the model fits well.
Most importantly therefore, we are not able to perform any hypothesis tests in
models which inappropriately use non-stationary data since the test statistics
will no longer follow the distributions which we assumed they would (e.g. a t
or F), so any inferences we make are likely to be invalid.
(c) A weakly stationary process was defined in Chapter 5, and has the
following characteristics:
1. E(yt) =
2. E ( yt )( yt ) 2
3. E ( yt1 )( yt 2 ) t 2 t1 t1 , t2
That is, a stationary process has a constant mean, a constant variance, and a
constant covariance structure. A strictly stationary process could be defined by
an equation such as
Fx t1 , xt 2 ,..., xtT ( x1 ,..., xT ) Fx t1 k , xt 2 k ,..., xtT k ( x1 ,..., xT )
for any t1 , t2 , ..., tT Z, any k Z and T = 1, 2, ...., and where F denotes the
joint distribution function of the set of random variables. It should be evident
from the definitions of weak and strict stationarity that the latter is a stronger
definition and is a special case of the former. In the former case, only the first
two moments of the distribution has to be constant (i.e. the mean and
variances (and covariances)), whilst in the latter case, all moments of the
distribution (i.e. the whole of the probability distribution) has to be constant.

Both weakly stationary and strictly stationary processes will cross their mean
value frequently and will not wander a long way from that mean value.
An example of a deterministic trend process was given in Figure 7.5. Such a

process will have random variations about a linear (usually upward) trend. An
expression for a deterministic trend process yt could be
yt = + t + ut
where t = 1, 2,, is the trend and ut is a zero mean white noise disturbance
term. This is called deterministic non-stationarity because the source of the
non-stationarity is a deterministic straight line process.
A variable containing a stochastic trend will also not cross its mean value
frequently and will wander a long way from its mean value. A stochastically
non-stationary process could be a unit root or explosive autoregressive process
such as
yt = yt-1 + ut
where 1.
2. (a)The null hypothesis is of a unit root against a one sided stationary

alternative, i.e. we have
H0 : yt I(1)
H1 : yt I(0)
which is also equivalent to
H0 : = 0
H1 : < 0
(b) The test statistic is given by / SE ( ) which equals -0.02 / 0.31 = -0.06
Since this is not more negative than the appropriate critical value, we do not
reject the null hypothesis.
(c) We therefore conclude that there is at least one unit root in the series
(there could be 1, 2, 3 or more). What we would do now is to regress 2yt on
yt-1 and test if there is a further unit root. The null and alternative hypotheses
would now be
H0 : yt I(1) i.e. yt I(2)

H1 : yt I(0) i.e. yt I(1)
If we rejected the null hypothesis, we would therefore conclude that the first
differences are stationary, and hence the original series was I(1). If we did not
reject at this stage, we would conclude that yt must be at least I(2), and we
would have to test again until we rejected.

(d) We cannot compare the test statistic with that from a t-distribution since
we have non-stationarity under the null hypothesis and hence the test statistic
will no longer follow a t-distribution.
3. Using the same regression as above, but on a different set of data, the
researcher now obtains the estimate =-0.52 with standard error = 0.16.
(a) The test statistic is calculated as above. The value of the test statistic = -
0.52 /0.16 = -3.25. We therefore reject the null hypothesis since the test
statistic is smaller (more negative) than the critical value.
(b) We conclude that the series is stationary since we reject the unit root null
hypothesis. We need do no further tests since we have already rejected.
(c) The researcher is correct. One possible source of non-whiteness is when

the errors are autocorrelated. This will occur if there is autocorrelation in the
original dependent variable in the regression (yt). In practice, we can easily
get around this by augmenting the test with lags of the dependent variable to
soak up the autocorrelation. The appropriate number of lags can be
determined using the information criteria.
4. (a) If two or more series are cointegrated, in intuitive terms this implies that
they have a long run equilibrium relationship that they may deviate from in
the short run, but which will always be returned to in the long run. In the
context of spot and futures prices, the fact that these are essentially prices of
the same asset but with different delivery and payment dates, means that
financial theory would suggest that they should be cointegrated. If they were
not cointegrated, this would imply that the series did not contain a common
stochastic trend and that they could therefore wander apart without bound
even in the long run. If the spot and futures prices for a given asset did
separate from one another, market forces would work to bring them back to
follow their long run relationship given by the cost of carry formula.
The Engle-Granger approach to cointegration involves first ensuring that the

variables are individually unit root processes (note that the test is often
conducted on the logs of the spot and of the futures prices rather than on the
price series themselves). Then a regression would be conducted of one of the
series on the other (i.e. regressing spot on futures prices or futures on spot
prices) would be conducted and the residuals from that regression collected.
These residuals would then be subjected to a Dickey-Fuller or augmented
Dickey-Fuller test. If the null hypothesis of a unit root in the DF test
regression residuals is not rejected, it would be concluded that a stationary
combination of the non-stationary variables has not been found and thus that
there is no cointegration. On the other hand, if the null is rejected, it would be
concluded that a stationary combination of the non-stationary variables has
been found and thus that the variables are cointegrated.
Forming an error correction model (ECM) following the Engle-Granger

approach is a 2-stage process. The first stage is (assuming that the original
series are non-stationary) to determine whether the variables are cointegrated.

If they are not, obviously there would be no sense in forming an ECM, and the
appropriate response would be to form a model in first differences only. If the
variables are cointegrated, the second stage of the process involves forming
the error correction model which, in the context of spot and futures prices,
could be of the form given in equation (7.57) on page 345.
(b) There are many other examples that one could draw from financial or
economic theory of situations where cointegration would be expected to be
present and where its absence could imply a permanent disequilibrium. It is
usually the presence of market forces and investors continually looking for
arbitrage opportunities that would lead us to expect cointegration to exist.
Good illustrations include equity prices and dividends, or price levels in a set
of countries and the exchange rates between them. The latter is embodied in
the purchasing power parity (PPP) theory, which suggests that a
representative basket of goods and services should, when converted into a
common currency, cost the same wherever in the world it is purchased. In the
context of PPP, one may expect cointegration since again, its absence would
imply that relative prices and the exchange rate could wander apart without
bound in the long run. This would imply that the general price of goods and
services in one country could get permanently out of line with those, when
converted into a common currency, of other countries. This would not be
expected to happen since people would spot a profitable opportunity to buy
the goods in one country where they were cheaper and to sell them in the
country where they were more expensive until the prices were forced back into
line. There is some evidence against PPP, however, and one explanation is that
transactions costs including transportation costs, currency conversion costs,
differential tax rates and restrictions on imports, stop full adjustment from
taking place. Services are also much less portable than goods and everybody
knows that everything costs twice as much in the UK as anywhere else in the
world.
5. (a) The Johansen test is computed in the following way. Suppose we have p
variables that we think might be cointegrated. First, ensure that all the
variables are of the same order of non-stationary, and in fact are I(1), since it is
very unlikely that variables will be of a higher order of integration. Stack the
variables that are to be tested for cointegration into a p-dimensional vector,
called, say, yt. Then construct a p1 vector of first differences, yt, and form
and estimate the following VAR
yt = yt-k + 1 yt-1 + 2 yt-2 + ... + k-1 yt-(k-1) + ut
Then test the rank of the matrix . If is of zero rank (i.e. all the eigenvalues
are not significantly different from zero), there is no cointegration, otherwise,
the rank will give the number of cointegrating vectors. (You could also go into
a bit more detail on how the eigenvalues are used to obtain the rank.)
(b) Repeating the table given in the question, but adding the null and
alternative hypotheses in each case, and letting r denote the number of
cointegrating vectors:

Null Alternative max 95% Critical
Hypothesis Hypothesis value
r=0 r=1 38.962 33.178
r=1 r=2 29.148 27.169
r=2 r=3 16.304 20.278
r=3 r=4 8.861 14.036
r=4 r=5 1.994 3.962
Considering each row in the table in turn, and looking at the first one first, the
test statistic is greater than the critical value, so we reject the null hypothesis
that there are no cointegrating vectors. The same is true of the second row
(that is, we reject the null hypothesis of one cointegrating vector in favour of
the alternative that there are two). Looking now at the third row, we cannot
reject (at the 5% level) the null hypothesis that there are two cointegrating
vectors, and this is our conclusion. There are two independent linear
combinations of the variables that will be stationary.
(c) Johansens method allows the testing of hypotheses by considering them

effectively as restrictions on the cointegrating vector. The first thing to note is
that all linear combinations of the cointegrating vectors are also cointegrating
vectors. Therefore, if there are many cointegrating vectors in the unrestricted
case and if the restrictions are relatively simple, it may be possible to satisfy
the restrictions without causing the eigenvalues of the estimated coefficient
matrix to change at all. However, as the restrictions become more complex,
renormalisation will no longer be sufficient to satisfy them, so that imposing
them will cause the eigenvalues of the restricted coefficient matrix to be
different to those of the unrestricted coefficient matrix. If the restriction(s)
implied by the hypothesis is (are) nearly already present in the data, then the
eigenvectors will not change significantly when the restriction is imposed. If,
on the other hand, the restriction on the data is severe, then the eigenvalues
will change significantly compared with the case when no restrictions were
imposed.
The test statistic for testing the validity of these restrictions is given by
T [ln(1 ) ln(1 )] 2(p-r)

i r 1
*
i i
where
i* are the characteristic roots (eigenvalues) of the restricted model

i are the characteristic roots (eigenvalues) of the unrestricted model
r is the number of non-zero (eigenvalues) characteristic roots in the

unrestricted model
p is the number of variables in the system.
If the restrictions are supported by the data, the eigenvalues will not change
much when the restrictions are imposed and so the test statistic will be small.
(d) There are many applications that could be considered, and tests for PPP,
for cointegration between international bond markets, and tests of the
expectations hypothesis were presented in Sections 7.9, 7.10, and 7.11
respectively. These are not repeated here.
(e) Both Johansen statistics can be thought of as being based on an

examination of the eigenvalues of the long run coefficient or matrix. In both
cases, the g eigenvalues (for a system containing g variables) are placed
ascending order: 1 2 ... g. The maximal eigenvalue (i.e. the max)
statistic is based on an examination of each eigenvalue separately, while the
trace statistic is based on a joint examination of the g-r largest eigenvalues. If
the test statistic is greater than the critical value from Johansens tables, reject
the null hypothesis that there are r cointegrating vectors in favour of the
alternative that there are r+1 (for max) or more than r (for trace). The testing
is conducted in a sequence and under the null, r = 0, 1, ..., g-1 so that the
hypotheses for trace and max are as follows
Null hypothesis for both tests Trace alternative

Max alternative
H0: r=0 H1: 0 < r g H1: r = 1

H0: r=1 H1: 1 < r g H1: r = 2
H0: r=2 H1: 2 < r g H1: r = 3
... ... ...
H0: r = p-1 H1: r = g H1: r = g
Thus the trace test starts by examining all eigenvalues together to test H0: r =
0, and if this is not rejected, this is the end and the conclusion would be that
there is no cointegration. If this hypothesis is not rejected, the largest
eigenvalue would be dropped and a joint test conducted using all of the
eigenvalues except the largest to test H0: r = 1. If this hypothesis is not
rejected, the conclusion would be that there is one cointegrating vector, while
if this is rejected, the second largest eigenvalue would be dropped and the test
statistic recomputed using the remaining g-2 eigenvalues and so on. The
testing sequence would stop when the null hypothesis is not rejected.

The maximal eigenvalue test follows exactly the same testing sequence with
the same null hypothesis as for the trace test, but the max test only considers
one eigenvalue at a time. The null hypothesis that r = 0 is tested using the
largest eigenvalue. If this null is rejected, the null that r = 1 is examined using
the second largest eigenvalue and so on.
6. (a) The operation of the Johansen test has been described in the book, and
also in question 5, part (a) above. If the rank of the matrix is zero, this
implies that there is no cointegration or no common stochastic trends between
the series. A finding that the rank of is one or two would imply that there
were one or two linearly independent cointegrating vectors or combinations of
the series that would be stationary respectively. A finding that the rank of is
3 would imply that the matrix is of full rank. Since the maximum number of
cointegrating vectors is g-1, where g is the number of variables in the system,
this does not imply that there 3 cointegrating vectors. In fact, the implication
of a rank of 3 would be that the original series were stationary, and provided
that unit root tests had been conducted on each series, this would have
effectively been ruled out.
(b) The first test of H0: r = 0 is conducted using the first row of the table.
Clearly, the test statistic is greater than the critical value so the null hypothesis
is rejected. Considering the second row, the same is true, so that the null of r =
1 is also rejected. Considering now H0: r = 2, the test statistic is smaller than
the critical value so that the null is not rejected. So we conclude that there are
2 cointegrating vectors, or in other words 2 linearly independent combinations
of the non-stationary variables that are stationary.
7. The fundamental difference between the Engle-Granger and the Johansen

approaches is that the former is a single-equation methodology whereas
Johansen is a systems technique involving the estimation of more than one
equation. The two approaches have been described in detail in Chapter 7 and
in the answers to the questions above, and will therefore not be covered again.
The main (arguably only) advantage of the Engle-Granger approach is its
simplicity and its intuitive interpretability. However, it has a number of
disadvantages that have been described in detail in Chapter 7, including its
inability to detect more than one cointegrating relationship and the
impossibility of validly testing hypotheses about the cointegrating vector.

Chapter 8
1. (a). A number of stylised features of financial data have been suggested at

the start of Chapter 8 and in other places throughout the book:
- Frequency: Stock market prices are measured every time there is a trade or
somebody posts a new quote, so often the frequency of the data is very high
- Non-stationarity: Financial data (asset prices) are covariance non-

stationary; but if we assume that we are talking about returns from here on,
then we can validly consider them to be stationary.
- Linear Independence: They typically have little evidence of linear

(autoregressive) dependence, especially at low frequency.
- Non-normality: They are not normally distributed they are fat-tailed.
- Volatility pooling and asymmetries in volatility: The returns exhibit

volatility clustering and leverage effects.
Of these, we can allow for the non-stationarity within the linear (ARIMA)
framework, and we can use whatever frequency of data we like to form the
models, but we cannot hope to capture the other features using a linear model
with Gaussian disturbances.
(b) GARCH models are designed to capture the volatility clustering effects in
the returns (GARCH(1,1) can model the dependence in the squared returns, or
squared residuals), and they can also capture some of the unconditional
leptokurtosis, so that even if the residuals of a linear model of the form given
by the first part of the equation in part (e), the u t s, are leptokurtic, the
standardised residuals from the GARCH estimation are likely to be less
leptokurtic. Standard GARCH models cannot, however, account for leverage
effects.
(c) This is essentially a which disadvantages of ARCH are overcome by

GARCH question. The disadvantages of ARCH(q) are:
- How do we decide on q?
- The required value of q might be very large
- Non-negativity constraints might be violated.
When we estimate an ARCH model, we require i >0 i=1,2,...,q (since

variance cannot be negative)
GARCH(1,1) goes some way to get around these. The GARCH(1,1) model has
only three parameters in the conditional variance equation, compared to q+1
for the ARCH(q) model, so it is more parsimonious. Since there are less

parameters than a typical qth order ARCH model, it is less likely that the
estimated values of one or more of these 3 parameters would be negative than
all q+1 parameters. Also, the GARCH(1,1) model can usually still capture all of
the significant dependence in the squared returns since it is possible to write
the GARCH(1,1) model as an ARCH(), so lags of the squared residuals back
into the infinite past help to explain the current value of the conditional
variance, ht.
(d) There are a number that you could choose from, and the relevant ones that
were discussed in Chapter 8, inlcuding EGARCH, GJR or GARCH-M.
The first two of these are designed to capture leverage effects. These are
asymmetries in the response of volatility to positive or negative returns. The
standard GARCH model cannot capture these, since we are squaring the
lagged error term, and we are therefore losing its sign.
The conditional variance equations for the EGARCH and GJR models are
respectively
u u t 1 2
log( t2 ) log( t21 ) t 1
t 1 t 1
And
t2 = 0 + 1 ut21 +t-12+ut-12It-1
where It-1 = 1 if ut-1 0

= 0 otherwise
For a leverage effect, we would see > 0 in both models.
The EGARCH model also has the added benefit that the model is expressed in
terms of the log of ht, so that even if the parameters are negative, the
conditional variance will always be positive. We do not therefore have to
artificially impose non-negativity constraints.
One form of the GARCH-M model can be written
yt = +other terms + t-1+ ut , ut N(0,ht)

t2 = 0 + 1 ut21 +t-12
so that the model allows the lagged value of the conditional variance to affect
the return. In other words, our best current estimate of the total risk of the
asset influences the return, so that we expect a positive coefficient for . Note
that some authors use t (i.e. a contemporaneous term).
(e). Since yt are returns, we would expect their mean value (which will be
given by ) to be positive and small. We are not told the frequency of the data,

but suppose that we had a year of daily returns data, then would be the
average daily percentage return over the year, which might be, say 0.05
(percent). We would expect the value of 0 again to be small, say 0.0001, or
something of that order. The unconditional variance of the disturbances would
be given by 0/(1-(1 +2)). Typical values for 1 and 2 are 0.8 and 0.15
respectively. The important thing is that all three alphas must be positive, and
the sum of 1 and 2 would be expected to be less than, but close to, unity,
with 2 > 1.
(f) Since the model was estimated using maximum likelihood, it does not seem
natural to test this restriction using the F-test via comparisons of residual
sums of squares (and a t-test cannot be used since it is a test involving more
than one coefficient). Thus we should use one of the approaches to hypothesis
testing based on the principles of maximum likelihood (Wald, Lagrange
Multiplier, Likelihood Ratio). The easiest one to use would be the likelihood
ratio test, which would be computed as follows:
1. Estimate the unrestricted model and obtain the maximised value of

the log-likelihood function.
2. Impose the restriction by rearranging the model, and estimate the

restricted model, again obtaining the value of the likelihood at the
new optimum. Note that this value of the LLF will be likely to be
lower than the unconstrained maximum.
3. Then form the likelihood ratio test statistic given by
LR = -2(Lr - Lu) 2(m)
where Lr and Lu are the values of the LLF for the restricted and
unrestricted models respectively, and m denotes the number of
restrictions, which in this case is one.
4. If the value of the test statistic is greater than the critical value, reject
the null hypothesis that the restrictions are valid.
(g) In fact, it is possible to produce volatility (conditional variance) forecasts

in exactly the same way as forecasts are generated from an ARMA model by
iterating through the equations with the conditional expectations operator.
We know all information including that available up to time T. The answer to

this question will use the convention from the GARCH modelling literature to
denote the conditional variance by ht rather than t2. What we want to
generate are forecasts of hT+1 T, hT+2 T, ..., hT+s T where T denotes all
information available up to and including observation T. Adding 1 then 2 then
3 to each of the time subscripts, we have the conditional variance equations
for times T+1, T+2, and T+3:
hT+1 = 0 + 1 u T2 + hT (1)

hT+2 = 0 + 1 u T2 1 + hT+1
(2)
hT+3 = 0 + 1 u T2 2 +hT+2 (3)
Let h1,fT be the one step ahead forecast for h made at time T. This is easy to
calculate since, at time T, we know the values of all the terms on the RHS.
Given h1,fT , how do we calculate h2,f T , that is the 2-step ahead forecast for h
made at time T?
From (2), we can write
h2,f T = 0 + 1 ET( uT2 1 )+ h1,fT (4)
where ET( uT2 1 ) is the expectation, made at time T, of uT2 1 , which is the
squared disturbance term. The model assumes that the series t has zero
mean, so we can now write
Var(ut) = E[(ut -E(ut))2]= E[(ut)2].
The conditional variance of ut is ht, so
ht t = E[(ut)2]
Turning this argument around, and applying it to the problem that we have,
ET[(uT+1)2] = hT+1
but we do not know hT+1 , so we replace it with h1,fT , so that (4) becomes
h2,f T = 0 + 1 h1,fT + h1f,T

= 0 + (1+) h1,fT
What about the 3-step ahead forecast?

By similar arguments,
h3,f T = ET(0 + 1 uT2 2 + hT+2)

= 0 + (1+) h2,f T
= 0 + (1+)[ 0 + (1+) h1,fT ]
And so on. This is the method we could use to forecast the conditional
variance of yt. If yt were, say, daily returns on the FTSE, we could use these
volatility forecasts as an input in the Black Scholes equation to help determine
the appropriate price of FTSE index options.
(h) An s-step ahead forecast for the conditional variance could be written

s 1
h f
s ,T 0 (1 ) i 1 (1 ) s 1 h1f,T (x)
i 1
For the new value of , the persistence of shocks to the conditional variance,
given by (1+) is 0.1251+ 0.98 = 1.1051, which is bigger than 1. It is obvious
from equation (x), that any value for (1+) bigger than one will lead the
forecasts to explode. The forecasts will keep on increasing and will tend to
infinity as the forecast horizon increases (i.e. as s increases). This is obviously
an undesirable property of a forecasting model! This is called non-
stationarity in variance.
For (1+)<1, the forecasts will converge on the unconditional variance as the
forecast horizon increases. For (1+) = 1, known as integrated GARCH or
IGARCH, there is a unit root in the conditional variance, and the forecasts will
stay constant as the forecast horizon increases.
2. (a) Maximum likelihood works by finding the most likely values of the
parameters given the actual data. More specifically, a log-likelihood function is
formed, usually based upon a normality assumption for the disturbance terms,
and the values of the parameters that maximise it are sought. Maximum
likelihood estimation can be employed to find parameter values for both linear
and non-linear models.
(b) The three hypothesis testing procedures available within the maximum
likelihood approach are lagrange multiplier (LM), likelihood ratio (LR) and
Wald tests. The differences between them are described in Figure 8.4, and are
not defined again here. The Lagrange multiplier test involves estimation only
under the null hypothesis, the likelihood ratio test involves estimation under
both the null and the alternative hypothesis, while the Wald test involves
estimation only under the alternative. Given this, it should be evident that the
LM test will in many cases be the simplest to compute since the restrictions
implied by the null hypothesis will usually lead to some terms cancelling out to
give a simplified model relative to the unrestricted model.
(c) OLS will give identical parameter estimates for all of the intercept and
slope parameters, but will give a slightly different parameter estimate for the
variance of the disturbances. These are shown in the Appendix to Chapter 8.
The difference in the OLS and maximum likelihood estimators for the variance
of the disturbances can be seen by comparing the divisors of equations (8A.25)
and (8A.26).
3. (a) The unconditional variance of a random variable could be thought of,

abusing the terminology somewhat, as the variance without reference to a
time index, or rather the variance of the data taken as a whole, without
conditioning on a particular information set. The conditional variance, on the
other hand, is the variance of a random variable at a particular point in time,
conditional upon a particular information set. The variance of ut, t2 ,
conditional upon its previous values, may be written t2 = Var(ut ut-1, ut-2,...)

= E[(ut-E(ut))2 ut-1, ut-2,...], while the unconditional variance would simply be
Var(ut) = 2.
Forecasts from models such as GARCH would be conditional forecasts,

produced for a particular point in time, while historical volatility is an
unconditional measure that would generate unconditional forecasts. For
producing 1-step ahead forecasts, it is likely that a conditional model making
use of recent relevant information will provide more accurate forecasts
(although whether it would in any particular application is an empirical
question). As the forecast horizon increases, however, a GARCH model that is
stationary in variance will yield forecasts that converge upon the long-term
average (historical) volatility. By the time we reach 20-steps ahead, the
GARCH forecast is likely to be very close to the unconditional variance so that
there is little gain likely from using GARCH models for forecasts with very
long horizons. For approaches such as EWMA, where there is no converge on
an unconditional average as the prediction horizon increases, they are likely to
produce inferior forecasts as the horizon increases for series that show a long-
term mean reverting pattern in volatility. This arises because if the volatility
estimate is above its historical average at the end of the in-sample estimation
period, EWMA would predict that it would continue at this level while in
reality it is likely to fall back towards its long-term mean eventually.
(b) Equation (8.110) is an equation showing that the variance of the

disturbances is not fixed over time, but rather varies systematically according
to a GARCH process. This is therefore an example of heteroscedasticity. Thus,
the consequences if it were present but ignored would be those described in
Chapter 4. In summary, the coefficient estimates would still be consistent and
unbiased but not efficient. There is therefore the possibility that the standard
error estimates calculated using the usual formulae would be incorrect leading
to inappropriate inferences.
(c) There are of course a large number of competing methods for measuring
and forecasting volatility, and it is worth stating at the outset that no research
has suggested that one method is universally superior to all others, so that
each method has its merits and may work well in certain circumstances.
Historical measures of volatility are just simple average measures for
example, the standard deviation of daily returns over a 3-year period. As such,
they are the simplest to calculate, but suffer from a number of shortcomings.
First, since the observations are unweighted, historical volatility can be slow to
respond to changing market circumstances, and would not take advantage of
short-term persistence in volatility that could lead to more accurate short-
term forecasts. Second, if there is an extreme event (e.g. a market crash), this
will lead the measured volatility to be high for a number of observations equal
to the measurement sample length. For example, suppose that volatility is
being measured using a 1-year (250-day) sample of returns, which is being
rolled forward one observation at a time to produce a series of 1-step ahead
volatility forecasts. If a market crash occurs on day t, this will increase the
measured level of volatility by the same amount right until day t+250 (i.e. it
will not decay away) and then it will disappear completely from the sample so
that measured volatility will fall abruptly. Exponential weighting of

observations as the EWMA model does, where the weight attached to each
observation in the calculation of volatility declines exponentially as the
observations go further back in time, will resolve both of these issues.
However, if forecasts are produced from an EWMA model, these forecasts will
not converge upon the long-term mean volatility estimate as the prediction
horizon increases, and this may be undesirable (see part (a) of this question).
There is also the issue of how the parameter is calculated (see equation (8.5)
on page 443, although, of course, it can be estimated using maximum
likelihood). GARCH models overcome this problem with the forecasts as well,
since a GARCH model that is stationary in variance will have forecasts that
converge upon the long-term average as the horizon increases (see part (a) of
this question). GARCH models will also overcome the two problems with
unweighted averages described above. However, GARCH models are far more
difficult to estimate than the other two models, and sometimes, when
estimation goes wrong, the resulting parameter estimates can be nonsensical,
leading to nonsensical forecasts as well. Thus it is important to apply a reality
check to estimated GARCH models to ensure that the coefficient estimates
are intuitively plausible. Finally, implied volatility estimates are those derived
from the prices of traded options. The market-implied volatility forecasts are
obtained by backing out the volatility from the price of an option using an
option pricing formula together with an iterative search procedure. Financial
market practitioners would probably argue that implied forecasts of the future
volatility of the underlying asset are likely to be more accurate than those
estimated from statistical models because the people who work in financial
markets know more about what is likely to happen to those instruments in the
future than econometricians do. Also, an inaccurate volatility forecast
implied from an option price may imply an inaccurate option price and
therefore the possibility of arbitrage opportunities. However, the empirical
evidence on the accuracy of implied versus statistical forecasting models is
mixed, and some research suggests that implied volatility systematically over-
estimates the true volatility of the underlying asset returns. This may arise
from the use of an incorrect option pricing formula to obtain the implied
volatility for example, the Black-Scholes model assumes that the volatility of
the underlying asset is fixed (non-stochastic), and also that the returns to the
underlying asset are normally distributed. Both of these assumptions are at
best tenuous. A further reason for the apparent failure of the implied model
may be a manifestation of the peso problem. This occurs when market
practitioners include in the information set that they use to price options the
possibility of a very extreme return that has a low probability of occurrence,
but has important ramifications for the price of the option due to its sheer size.
If this event does not occur in the sample period over which the implied and
actual volatilities are compared, the implied model will appear inaccurate. Yet
this does not mean that the practitioners forecasts were wrong, but rather
simply that the low-probability, high-impact event did not happen during that
sample period. It is also worth stating that only one implied volatility can be
calculated from each option price for the average volatility of the underlying
asset over the remaining lifetime of the option.

4. (a). A possible diagonal VECH model would be
y1t 1 u1t u h h12t

, u t 1t N(0, t) , t 11t
y 2t 2 u 2t u 2t h12t h22t
h11t 11 11u1t 1 11h11t 1

2
h12t 12 12u1t 1u 2t 1 12 h12t 1

h22t 22 22u 2t 1 22 h22t 1
2
The coefficients expected would be very small for the conditional mean
coefficients, 1 and 2, since they are average daily returns, and they could be
positive or negative, although a positive average return is probably more
likely. Similarly, the intercept terms in the conditional variance equations
would also be expected to be small since and positive this is daily data. The
coefficients on the lagged squared error and lagged conditional variance in the
conditional variance equations must lie between zero and one, and more
specifically, the following might be expected: 11 and 22 0.1-0.3; 11 and 22
0.5-0.8, with 11 + 11 < 1 and 22 + 22 < 1. The coefficient values for the
conditional covariance equation are more difficult to predict: 11 + 11 < 1 is
still required for the model to be useful for forecasting covariances. The
parameters in this equation could be negative, although given that the returns
for two stock markets are likely to be positively correlated, the parameters
would probably be positive, although the model would still be a valid one if
they were not.
(b) One of two procedures could be used. Either the daily returns data would
be transformed into weekly returns data by adding up the returns over all of
the trading days in each week, or the model would be estimated using the daily
data. Daily forecasts would then be produced up to 10 days (2 trading weeks)
ahead.
In both cases, the models would be estimated, and forecasts made of the
conditional variance and conditional covariance. If daily data were used to
estimate the model, the forecasts for the conditional covariance forecasts for
the 5 trading days in a week would be added together to form a covariance
forecast for that week, and similarly for the variance. If the returns had been
aggregated to the weekly frequency, the forecasts used would simply be 1-step
ahead.
Finally, the conditional covariance forecast for the week would be divided by
the product of the square roots of the conditional variance forecasts to obtain
a correlation forecast.
(c) There are various approaches available, including computing simple

historical correlations, exponentially weighted measures, and implied
correlations derived from the prices of traded options.

(d) The simple historical approach is obviously the simplest to calculate, but
has two main drawbacks. First, it does not weight information: so any
observations within the sample will be given equal weight, while those outside
the sample will automatically be given a weight of zero. Second, any extreme
observations in the sample will have an equal effect until they abruptly drop
out of the measurement period. For example, suppose that one year of daily
data is used to estimate volatility. If the sample is rolled through one day at a
time, an observation corresponding to a market crash will appear in the next
250 samples, with equal effect, but with then disappear altogether.
Exponentially weighted moving average models of covariance and variance

(which can be used to construct correlation measures) more plausibly give
additional weight to more recent observations, with the weight given to each
observation declining exponentially as they go further back into the past.
These models have the undesirable property that the forecasts for different
numbers of steps ahead will be the same. Hence the forecasts will not tend to
the unconditional mean as those from a suitable GARCH model would.
Finally, implied correlations may at first blush appear to be the best method
for calculating correlation forecasts accurately, for they rely on information
obtained from the market itself. After all, who should know better about
future correlations in the markets than the people who work in those markets?
However, market-based measures of volatility and correlation are sometimes
surprisingly inaccurate, and are also sometimes difficult to obtain. Most
fundamentally, correlation forecasts will only be available where there is an
option traded whose payoffs depend on the prices of two underlying assets.
For all other situations, a market-based correlation forecast will simply not be
available.
Finally, multivariate GARCH models will give more weight to recent

observations in computing the forecasts, but maybe difficult and compute
time-intensive to estimate.
5. A news impact curve shows the effect of shocks of different magnitudes on

the next periods volatility. These curves can be used to examine visually
whether there are any asymmetry effects in volatility for a particular set of
data. For the data given in this question, the way I would approach it is to put
values of the lagged error into column A ranging from 1 to +1 in increments
of 0.01. Then simply enter the formulae for the GARCH and EGARCH models
into columns 2 and 3 that refer to those values of the lagged error put in
column A. The graph obtained would be

0.2
GARCH
0.18 EGARCH
0.16
Value of Conditional Variance
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
-1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Value of Lagged Shock
This graph is a bit of an odd one, in the sense that the conditional variance is
always lower for the EGARCH model. This may suggest estimation error in
one of the models. There is some evidence for asymmetries in the case of the
EGARCH model since the value of the conditional variance is 0.1 for a shock of
1 and 0.12 for a shock of 1.
(b) This is a tricky one. The leverage effect is used to rationalise a finding of
asymmetries in equity returns, but such an argument cannot be applied to
foreign exchange returns, since the concept of a Debt/Equity ratio has no
meaning in that context.
On the other hand, there is equally no reason to suppose that there are no
asymmetries in the case of fx data. The data used here were daily USD_GBP
returns for 1974-1994. It might be the case that, for example, that news
relating to one country has a differential impact to equally good and bad news
relating to another. To offer one illustration, it might be the case that the bad
news for the currently weak euro has a bigger impact on volatility than news
about the currently strong dollar. This would lead to asymmetries in the news
impact curve. Finally, it is also worth noting that the asymmetry term in the
EGARCH model, 1, is not statistically significant in this case.
Chapter 9
1. (a) This was a rather silly question since the answer is largely given away by
the question in part (b)! Nonetheless, although there are several methods that
could be used to determine whether there is evidence of daily seasonality in
stock returns, a simple method would be to obtain a sample of daily stock
returns and regress them on 5 day-of-the-week dummy variables. The
coefficient estimates would then be interpreted as the average return on each

day of the week, and if some of these were statistically significant but with
differing signs, this could be taken as evidence of daily seasonalities.
(b) The problem is one of perfect multicollinearity between the five daily
dummy variables and the constant term known as the dummy variable trap.
The sum of the five daily dummy variables will be one in every time period,
and this will be identical to the column of ones used for the constant. The
result is that the implicit assumption of the columns of the matrix of
explanatory variables being independent of one another has been violated, and
hence there is not enough separate information in the sample to be able to
calculate the values of all of the coefficients. The (XX) matrix will be singular
and therefore its inverse will not exist. The solution is simple: either use all 5
daily dummy variables but no intercept term, or drop one of the dummy
variables and still include the intercept. These two methods of dealing with the
problem are equivalent with identical RSS, and only the interpretation of the
coefficient estimates will change.
(c) The first step is to calculate the t-ratios. These are 0.232, -2.691, 0.673, -
0.039, and 0.141 for the intercept, D1, D2, D3 and D4 respectively. The
interpretation of the intercept coefficient is the value of the return when all of
the variables (including the daily dummies) are zero, which in this case is the
average Friday return. The coefficients on the daily dummies can be
interpreted as the average deviation of that days return from the average
return for all days of the week. Only one of these dummy variables is
significant the dummy for Monday, and we would thus conclude that the
average return on Monday was significantly lower than the average return for
the whole week, but there is no statistically significant evidence of any other
daily seasonalities given these results.
(d) Intercept dummy variables work by changing the regression intercept

estimate if a certain set of conditions hold, while slope dummies work by
changing the slope(s). For example, suppose that the regression model under
study for a sample of daily returns is
yt = 1 + 2x2t + 3x3t + ut .
A model containing these variables but also including intercept dummy

variables would be
yt = 1 + 2x2t + 3x3t + 4D1t + 5D2t + 6D3t + 7D4t + ut .

As before, if we include the intercept in the regression, we only want 4 dummy
variables, and any 4 of the 5 daily dummies (defined as above) could be
included. A model containing the explanatory variables and slope dummy
variables would be
yt = 1 + 2x2t + 3x3t + 4D1tx2t + 5D2tx2t + 6D3x2t + 7D4tx2t + 8D1tx3t +

9D2tx3t + 10D3tx3t + 11D4tx3t + ut .
The dummy variables are defined identically as in the intercept dummy case,
and again one less dummy is needed than the total number of days in the

week. The dummies are now multiplied by explanatory variables so that each
of the slopes on x2 and x3 are permitted to vary from one day to the next.
(e) The financial year ends at approximately the end of March in the UK. So
one way to test the hypothesis that stock returns are different at the end of the
tax year compared with other times of the year would be to obtain a long
sample of monthly returns and to regress the returns on a dummy variable
taking the value one in March and zero otherwise. If, everything else equal,
investors were selling to realise losses in March, we would expect the
coefficient on this dummy to be negative and statistically significant due to
excess selling pressure. Thus average March returns would be significantly
lower than average returns over the whole year.
2. (a) A switching model is simply one where the behaviour of the series is
permitted to change from one type to another under the model. For example,
any regression containing seasonal dummy variables would be a simple kind
of switching model, since the behaviour of the series will be different at
different times.
Threshold autoregressive (TAR) models are those where the variable under
study is assumed to follow one autoregressive process in a given regime and
other autoregressive processes in other regimes. Movements from one regime
to another occur when a variable (not necessarily the variable under study)
rises above or falls below a particular value. Markov switching models, in their
simplest forms, assume that a variable can be drawn from one of several
regimes, each regime having its own mean and variance. The key distinction
between the two classes of models is that TAR models assume that the
threshold variable governing the regime is known, and under the model, once
this threshold is set, the variable is in one of the regimes alone. The Markov
switching model, on the other hand, assumes that the state-determining or
forcing variable is unobserved. The variable under study is thus never
completely in one regime or another, but rather is in each regime with some
probability at each point in time.
The decision on which of the two model classes is more appropriate for a
particular application would be made on the grounds of whether the state-
determining variable was observable or not, and what type of dynamics were
of interest in the model. For example, if the financial theory does not suggest
the forcing variable, then the Markov switching approach may be preferable.
On the other hand, if theory suggests an obvious choice of switching variable,
or if it is of interest to use an AR-type model, then the TAR would be more
appropriate.
There have been very few comparisons of the two approaches that I know of
authors seem to just adopt a particular approach and use it without discussing
the alternatives available.
(b) (i) The Markov property applies if a process is path independent that
is, it is only the current value of a series or the current set of probabilities that
determine where the process will be during the next time period, and none of
the values of the series or probabilities during previous time periods. Thus, a

series that followed an AR(1) model would possess the Markov property. An
algebraic expression for the Markov property is given in equation (9.10) on
page 464. The implication of a process having the Markov property is that its
development can be described using only a vector of current probabilities and
a single transition matrix.
(ii) A transition matrix, in the context of Markov switching models, is a matrix

that maps a set of current probabilities to a set of future probabilities. Thus it
will describe the probabilities of the process being in a particular state in the
next period, conditioned upon it being in a given state during this period.
(c) A SETAR model is a TAR model where the state-determining variable is the
dependent variable used in the regression. The use of SETAR models rather
than a more general TAR removes one item to decide on (the state-
determining variable). But there are many others the number of regimes, the
number of lags in each regime, the value(s) of the threshold(s), and the lag
with which the variable will switch. The major difficulty with SETAR (and
indeed all TAR) models is that it is impossible to easily and validly estimate all
of these quantities at the same time. They depend on eachother, and also, the
threshold causes a discontinuity in the function that would be maximised (if
ML is used) or minimised (if NLS) is used.
Therefore, the usually easiest way to estimate such models is to use as much
knowledge as possible from financial theory and to assume values for other
parameters and then to estimate as little as possible. For example, it may be
the case that the number of regimes, the delay value and the threshold values
can be assumed from theory. This would leave only the number of lags in each
regime together with the coefficients to be estimated. This could be validly
done using information criteria to determine the lag lengths for each regime
and ML or NLS to estimate the coefficients.
(d) Standard information criteria of the form described in Chapter 5 could be

employed to determine the appropriate length of the lags in each regime.
There would be one value of the criterion for each model order, and the model
that order that minimised the value of the criterion would be the one selected.
The problem with this approach is that, if the series under study resides in one
of the regimes for a considerably shorter time than it resides in the others, a
very short lag length will typically be selected for that regime. The reason is
that the reduction in the overall residual sum of squares is unlikely to be big if
it covers only a small number of observations. The upshot is that the use of
standard information criteria applied globally to the whole model would
typically be to lead to long lag lengths for all regimes that the series spends a
high proportion of time in and short lag lengths for regimes that it did not
enter very often. A solution is to define an information criterion that does not
penalise the whole model for additional parameters in one state, i.e. a criterion
that is a function of the separately calculated residual sums of squares for both
of the regimes and of the number of lags and of the number of observations in
each of those regimes. An algebraic example of such a criterion was given in
equation (9.26) on page 476.

(e) If there are transactions costs that are non-negligible, this can lead PPP not
to hold since there would be deviations from PPP, which may appear to
represent profitable trading opportunities since the law of one price is
violated, but that in practice are unprofitable once these costs are taken into
account. Thus, a threshold model may be useful for this, since it would allow
PPP not to hold if the deviations from PPP were sufficiently small that
transactions costs would imply that this situation could persist indefinitely,
while the PPP relationship would be restored if the deviations from it became
sufficiently large to warrant cross-border trading, which would restore
equilibrium. In the linear case with no thresholds, the PPP relationship is can
be estimated for the current example using France and Germany (before the
advent of the EURO currency!) by:
ln( fx F / G,t ) 0 1 ln( p F ,t ) 2 ln( pG,t ) u t
where ln( fx F / G ,t ) is the log of the exchange rate, expressed in French francs per
German mark, and ln( p F ,t ) and ln( pG ,t ) are the logs of the French and German
consumer price series respectively. We could define
ln( fx F / G,t ) ln( p F ,t ) ln( pG,t ) as the deviation from PPP (see Chapter 7). This
could be generalised to allow for a different relationship between the three
variables according to whether the deviation from PPP is larger than some
upper threshold value r, or smaller (more negative) than a lower threshold s:
0 1 ln( p F ,t ) 2 ln( p G ,t ) u1t if ln( fx F / G ,t 1 ) ln( p F ,t 1 ) ln( p G ,t 1 ) r

ln( fx F / G ,t ) 0 1 ln( p F ,t ) 2 ln( p G ,t ) u 2t if s ln( fx F / G ,t 1 ) ln( p F ,t 1 ) ln( p G ,t 1 ) r
ln( p ) ln( p ) u if ln( fx F / G ,t 1 ) ln( p F ,t 1 ) ln( p G ,t 1 ) s
0 1 F ,t 2 G ,t 3t
We could then consider the different values of the parameter estimates in each
of the 3 regimes (although we could not validly conduct hypothesis tests in the
usual way since it is very likely that the variables in the model are non-
stationary!). The values of the thresholds (r and s) could be imposed on the
basis of some assumed size of transactions costs, or they could be estimated.
(f) The problem is essentially that the threshold no longer exists under the null
hypothesis that the SETAR model collapses to a linear model with the same
lag lengths as were in each part of the SETAR. This fact means that the usual
basis of asymptotic theory for testing hypotheses is not applicable, so that the
test statistics would not follow the distributions that we would have assumed
of them. There are procedures available for testing hypotheses such as this in
the context of TAR models, but these are quite complex see Hansen (1996),
for example.
(g) It is tempting to think that more complex models are bound to produce
more accurate forecasts than simpler models since the former should be able
to capture more of the relevant features of the data. However, this is certainly
not the case, for complex models may have a tendency to fit to sample-specific
features of the data that are not replicated during future (out of sample)

periods, and therefore lead to less accurate forecasts. This issue was discussed
in Chapter 5. However, the use of SETAR models for producing out of sample
prediction brings with it an additional problem, namely the possibility that the
regime that the variable will reside in during the forecasted observations will
be incorrectly predicted. If the SETAR model fits the data well, it is likely that
the behaviour will be quite different between the two regimes, and therefore
that the forecasts from each regime would also be different. This being the
case, forecasting the regime wrongly could cause a big source of forecast
inaccuracy for the series, and in practice it is often very difficult to forecast the
regime that the series will be in with any reasonable accuracy. Thus, any
improvement in forecast accuracy from accurate prediction of the variable
conditional upon a correct forecast of the regime that it will be in, is likely to
be more than outweighed by incorrectly forecasting the regime. Overall,
therefore, regime switching models have produced surprisingly poor forecasts,
even when they appear to fit the data very well see Dacco and Satchell
(1999).
3. Both of these questions concern volatility dynamics rather than dynamics in

the conditional mean therefore, in both cases the appropriate answer would
be to use a model for volatility dynamics but which also allowed for varying
behaviour over time. For (i), a plausible model would be a GARCH with some
daily dummy variables included in the conditional variance equation, e.g.
yt = + yt-1 + ut , ut N(0,t2)
t2 = 0 + 1 ut21 +t-12 + 1D1t + 2D2t + 3D3t + 4D4t
where D1, .., D4 are Monday, .., Thursday dummy variables. A more
sophisticated model could also allow the coefficients on the lagged squared
error or lagged conditional variance terms to also vary across the days of the
week. If Monday volatility dynamics are different from other days of the week,
we would expect to see 1 significant.
For (ii), some sort of threshold model is required, and the question suggests
that the threshold variable is observable (and is the value of the previous days
volatility). Thus, an appropriate model would be a GARCH model with a
threshold in the conditional variance that switches, e.g.
yt = + yt-1 + ut , ut N(0,t2)
t2 = 0 + 1 ut21 +t-12 + 1It
where It = 1 if t-12 > 0.1, and zero otherwise. Note that the question does not
specify what is volatility, so it is assumed in this answer that it is equated
with conditional variance. Again, this dummy variable would only allow the
intercept in the conditional variance (i.e. the unconditional variance) to vary
according to the previous days volatility. A similar dummy variable could be
applied to the lagged squared error or lagged conditional variance terms to
allow them to vary with the size of the previous days volatility.


Answers Review Questions Econometrics PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Answers Review Questions Econometrics PDF

Uploaded by

Copyright:

Available Formats

Solutions to the end of Chapter Exercises

2. The population regression function (PRF) is a description of the model that

Note that there is a disturbance term in this equation. In some textbooks, a

1/59 Introductory Econometrics for Finance Chris Brooks 2008

3. An estimator is simply a formula that is used to calculate the estimates, i.e.

4. A list of the assumptions of the classical linear regression models

2/59 Introductory Econometrics for Finance Chris Brooks 2008

The test statistic is given by

3/59 Introductory Econometrics for Finance Chris Brooks 2008

The null and alternative hypotheses are therefore:

4/59 Introductory Econometrics for Finance Chris Brooks 2008

The critical t-value is therefore 2.03.

8. A confidence interval for beta is given by the formula:

Confidence intervals are almost invariably 2-sided, unless we are told

5/59 Introductory Econometrics for Finance Chris Brooks 2008

The confidence interval in each case is thus given by (0.2140.186*2.03) for a

There are a couple of points worth noting.

First, one intuitive interpretation of an X% confidence interval is that we are

6/59 Introductory Econometrics for Finance Chris Brooks 2008

Examples at the 5% level from tables

T-k F critical value t critical value

(d) H0 : 2 =0 and 3 = 0 and 4 = 0 and 5 = 0

3. THE regression F-statistic would be given by the test statistic associated

7/59 Introductory Econometrics for Finance Chris Brooks 2008

5. The null hypothesis is: H0 : 3 + 4 = 1 and 5 = 1

The first step is to impose this on the regression model:

yt = 1 + 2x2t + 3x3t + 4x4t + 5x5t + ut subject to 3 + 4 = 1 and 5 = 1.

We can rewrite the first part of the restriction as 4 = 1 - 3

Then rewrite the regression with the restriction imposed

yt = 1 + 2x2t + 3x3t + (1-3)x4t + x5t + ut

which can be re-written

yt = 1 + 2x2t + 3x3t + x4t - 3x4t + x5t + ut

(yt x4t x5t ) = 1 + 2x2t + 3x3t - 3x4t + ut

(yt x4t x5t) = 1 + 2x2t + 3(x3t x4t)+ ut

8/59 Introductory Econometrics for Finance Chris Brooks 2008

pt = (yt - x3t - x4t)

We can then run the linear regression:

which constitutes the restricted regression model.

The test statistic is calculated as ((RRSS-URSS)/URSS)*(T-k)/m

6. ri = 0.080 + 0.801Si + 0.321MBi + 0.164PEi - 0.084BETAi

9/59 Introductory Econometrics for Finance Chris Brooks 2008

(b) As for R2, recall how we calculate R2:

8. A researcher estimates the following two econometric models

10/59 Introductory Econometrics for Finance Chris Brooks 2008

11. R2 may be defined in various ways, but the most common is

2. We would like to see no pattern in the residual plot! If there is a pattern in

Another problem if there is a pattern in the residuals is that, if it does

11/59 Introductory Econometrics for Finance Chris Brooks 2008

y t = 0.638 + 0.402 x2t - 0.891 x3t R2 0.96,R 2 0.89

- Drop one of the collinear variables - so that the problem disappears.

12/59 Introductory Econometrics for Finance Chris Brooks 2008

4. (a) The assumption of homoscedasticity is that the variance of the errors is

(c) There are a number of ways to proceed in practice, including

13/59 Introductory Econometrics for Finance Chris Brooks 2008

The rejection / non-rejection rule would be given by selecting the appropriate

(c) We have 60 observations, and the number of regressors excluding the

(d) yt 1 2 x2t 3 x3t 4 x4t u t

(e) yt 1 2 x2t 3 x3t 4 x4t 5 x2t 1 6 X 3t 1 7 X 4t 1 vt

6. yt 1 2 x 2t 3 x3t 4 yt 1 5 x2t 1 6 x3t 1 7 x rt 4 u t

- set the disturbance term equal to its expected value of zero

- drop the time subscripts

14/59 Introductory Econometrics for Finance Chris Brooks 2008

Following these steps, we obtain