Professional Documents
Culture Documents
Associate Editors
Christopher F. Baum, Boston College Frauke Kreuter, Univ. of MarylandCollege Park
Nathaniel Beck, New York University Peter A. Lachenbruch, Oregon State University
Rino Bellocco, Karolinska Institutet, Sweden, and Jens Lauritsen, Odense University Hospital
University of Milano-Bicocca, Italy Stanley Lemeshow, Ohio State University
Maarten L. Buis, WZB, Germany J. Scott Long, Indiana University
A. Colin Cameron, University of CaliforniaDavis Roger Newson, Imperial College, London
Mario A. Cleves, University of Arkansas for Austin Nichols, Urban Institute, Washington DC
Medical Sciences Marcello Pagano, Harvard School of Public Health
William D. Dupont, Vanderbilt University Sophia Rabe-Hesketh, Univ. of CaliforniaBerkeley
Philip Ender, University of CaliforniaLos Angeles J. Patrick Royston, MRC Clinical Trials Unit,
David Epstein, Columbia University London
Allan Gregory, Queens University Philip Ryan, University of Adelaide
James Hardin, University of South Carolina Mark E. Schaffer, Heriot-Watt Univ., Edinburgh
Ben Jann, University of Bern, Switzerland Jeroen Weesie, Utrecht University
Stephen Jenkins, London School of Economics and Ian White, MRC Biostatistics Unit, Cambridge
Political Science Nicholas J. G. Winter, University of Virginia
Ulrich Kohler, University of Potsdam, Germany Jeffrey Wooldridge, Michigan State University
The Stata Journal publishes reviewed papers together with shorter notes or comments, regular columns, book
reviews, and other material of interest to Stata users. Examples of the types of papers include 1) expository
papers that link the use of Stata commands or programs to associated principles, such as those that will serve
as tutorials for users rst encountering a new eld of statistics or a major new technique; 2) papers that go
beyond the Stata manual in explaining key features or uses of Stata that are of interest to intermediate
or advanced users of Stata; 3) papers that discuss new commands or Stata programs of interest either to
a wide spectrum of users (e.g., in data management or graphics) or to some large segment of Stata users
(e.g., in survey statistics, survival analysis, panel analysis, or limited dependent variable modeling); 4) papers
analyzing the statistical properties of new or existing estimators and tests in Stata; 5) papers that could
be of interest or usefulness to researchers, especially in elds that are of practical importance but are not
often included in texts or other journals, such as the use of Stata in managing datasets, especially large
datasets, with advice from hard-won experience; and 6) papers of interest to those who teach, including Stata
with topics such as extended examples of techniques and interpretation of results, simulations of statistical
concepts, and overviews of subject areas.
The Stata Journal is indexed and abstracted by CompuMath Citation Index, Current Contents/Social and Behav-
ioral Sciences, RePEc: Research Papers in Economics, Science Citation Index Expanded (also known as SciSearch),
Scopus, and Social Sciences Citation Index.
For more information on the Stata Journal, including information for authors, see the webpage
http://www.stata-journal.com
Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station, Texas 77845, telephone
979-696-4600 or 800-STATA-PC, fax 979-696-4601, or online at
http://www.stata.com/bookstore/sj.html
Subscription rates listed below include both a printed and an electronic copy unless otherwise mentioned.
http://www.stata.com/bookstore/sjj.html
Individual articles three or more years old may be accessed online without charge. More recent articles may
be ordered online.
http://www.stata-journal.com/archives.html
The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA.
Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive, College Station, TX
77845, USA, or emailed to sj@stata.com.
Copyright
c 2014 by StataCorp LP
Copyright Statement: The Stata Journal and the contents of the supporting les (programs, datasets, and
help les) are copyright
c by StataCorp LP. The contents of the supporting les (programs, datasets, and
help les) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy
or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part,
as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions.
This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites,
leservers, or other locations where the copy may be accessed by anyone other than the subscriber.
Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting
les understand that such use is made without warranty of any kind, by either the Stata Journal, the author,
or StataCorp. In particular, there is no warranty of tness of purpose or merchantability, nor for special,
incidental, or consequential damages such as loss of prots. The purpose of the Stata Journal is to promote
free communication among Stata users.
The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, , Stata
Press, Mata, , and NetCourse are registered trademarks of StataCorp LP.
Volume 14 Number 3 2014
Abstract. In this article, I present ivtreatreg, a command for tting four dier-
ent binary treatment models with and without heterogeneous average treatment
eects under selection-on-unobservables (that is, treatment endogeneity). Depend-
ing on the model specied by the user, ivtreatreg provides consistent estimation
of average treatment eects by using instrumental-variables estimators and a gen-
eralized two-step Heckman selection model. The added value of this new command
is that it allows for generalization of the regression approach typically used in stan-
dard program evaluation by assuming heterogeneous response to treatment. It also
serves as a sort of toolbox for conducting joint comparisons of dierent treatment
methods, thus readily permitting checks on the robustness of results.
Keywords: st0346, ivtreatreg, microeconometrics, treatment models, instrumental
variables, unobservable selection, treatment endogeneity, heterogeneous treatment
response
1 Introduction
It is increasingly recognized as good practice to perform ex-post evaluation of economic
and social programs through counterfactual evidence-based statistical analysis. Such
analysis is particularly important at the policy-making level. The statistical approach
is usually applied to measuring the causal eects of an intervention on part of an external
authority, such as local or national government, on a set of subjects targeted by a given
program, such as individuals and companies. Similar analysis is also becoming popular
in reassessing causal relations among factors identied under modern microeconometric
theory from a counterfactual perspective but not necessarily regarding policy implica-
tions.
Several ocial Stata commands and new user-written commands have been applied
to enlarge the set of available statistical tools for conducting these counterfactual anal-
yses. Table 1 contains a list of commands for estimating binary treatment eects.
However, the most recent release of Stata, version 13, provides a new far-reaching suite
called teffects, which can be used to estimate treatment eects from observational
data.
c 2014 StataCorp LP st0346
454 Fitting binary treatment models
The teffects command can be used to estimate potential outcome means and av-
erage treatment eects (ATEs). As shown in table 2, the teffects suite covers a large
set of methods, such as regression adjustment; inverse-probability weights; doubly ro-
bust methods, including inverse-probability-weighted regression adjustment; augmented
inverse-probability weights; and matching on the propensity score or covariates (with
nearest neighbors). Other subcommands can be used for postestimation purposes and
for testing reliability of results; for example, overlap allows for plotting the estimated
densities of the probability of getting each treatment level.
G. Cerulli 455
Table 2. Stata 13 teffects subcommands for estimating treatment eects from obser-
vational data
Subcommand Description
aipw Augmented inverse-probability weighting
ipw Inverse-probability weighting
ipwra Inverse-probability-weighted regression adjustment
nnmatch Nearest-neighbor matching
overlap Overlap plots
psmatch Propensity-score matching
ra Regression adjustment
When applying teffects, the outcome models can be continuous, binary, count, or
nonnegative. Binary outcomes can be modeled using logit, probit, or heteroskedastic
probit regression, and count and nonnegative outcomes can be modeled using Poisson
regression. The treatment model can be binary or multinomial. Binary treatments
can be modeled using logit, probit, or heteroskedastic probit regression. For multino-
mial treatments, one can use pairwise comparisons and then exploit binary treatment
approaches.1
While the teffects command deals mainly with estimation methods suitable under
selection-on-observables, Stata 13 presents two further commands to deal with endoge-
nous binary treatment (occurring in the case of selection-on-unobservables): etregress
and etpoisson. etregress estimates the ATE and the other parameters of a linear
regression model augmented with an endogenous binary treatment variable. Basically,
etregress is an improvement on Statas treatreg command, whose estimation is based
on the Heckman (1978) selection model. Because such a model is fully parametric, esti-
mation can be performed either by full maximum likelihood or, less parametrically, by
a two-step consistent estimator. Similarly, etpoisson estimates an endogenous binary
treatment model when the outcome is a count variable by using a Poisson regression.
Both the ATE and the ATE on the treated (ATET) can be estimated by etpoisson.
Although Stata 13 oers the above commands for dealing with endogenous treat-
ment, the commands suer from two important limitations. First, they assume joint
normality of errors, meaning that they are not robust to violation of this hypothesis.
Second, they do not allowat least by defaultfor calculation of causal eects under
observable heterogeneity, meaning that they assume causal eects to be the same in the
subpopulation of treated and untreated units. This second limitation might be partially
1. For multinomial treatment, readers can refer to the user-written command poparms, which estimates
multivalued treatment eects under conditional independence by using the ecient semiparametric
estimation of multivalued treatment eects. See Cattaneo (2010) and Cattaneo, Drukker, and
Holland (2013) for tutorials.
456 Fitting binary treatment models
overcome by introducing interactions between the binary treatment and the covariates
in the outcome equation, but this requires further user programming to recover all the
parameters of interest.
The gsem command, also new in Stata 13, can estimate the causal parameters of
models with selection-on-unobservables, implemented as unobserved components, and
heterogeneous eects, implemented as random coecients. However, gsem uses full-
information maximum likelihood (ML), thus assuming a fully specied parametric model,
which in some contexts could present questionable reliability.
The ivtreatreg command I present in this article implements a series of methods
for treatment-eects estimation under treatment endogeneity that use only conditional-
moment restrictions. These methods are more robust than those implemented by
etregress or gsem. ML estimators would be naturally more ecient under correct
specication, and this means that a trade-o may arise between robustness and e-
ciency. On the one hand, assuming some parametric distributive form for the error
terms allows one to use ML estimation reaching the CramerRao lower variance bound.
On the other hand, when these distributive assumptions are questionable, ML may be
less reliable than less ecient (but consistent) estimation procedures, and the latter
ones become more robust. Thus it seems useful to adopt distribution-free methods for
dealing with treatment endogeneity, which the ivtreatreg command makes possible.
ivtreatreg ts four binary treatment models with and without idiosyncratic or
heterogeneous ATEs.2 Depending on the model specied by the user, ivtreatreg pro-
vides consistent estimation of ATEs under the hypothesis of selection-on-unobservables
by using IV and a generalized Heckman-style selection model.
Conditional on a prespecied subset of exogenous variables, xthought of as driving
the heterogeneous response to treatmentivtreatreg calculates the ATE, the ATET,
and the ATE on the nontreated (ATENT) for each called model, as well as the estimates
of these parameters conditional on the observable factors x.
Specically, the four models t by ivtreatreg are direct-2sls (IV regression t
by direct two-stage least squares), probit-ols (IV two-step regression t by probit and
OLS), probit-2sls (IV regression t by probit and two-stage least squares), and heckit
(Heckman two-step selection model).
Extensive discussion of the conditions under which previous methods provide con-
sistent estimation of ATE, ATET, and ATENT can be found in Wooldridge (2010).
ivtreatreg provides value by allowing for generalization of the regression approach
typically employed in standard program evaluation by assuming heterogeneous response
to treatment and treatment endogeneity. It is also a sort of toolbox for conducting joint
comparisons of dierent treatment methods, thus readily permitting the researcher to
run checks on the robustness of results.
In sections 2 and 3 of this article, I briey present the statistical framework and
estimation methods implemented by ivtreatreg. In section 4, I present the syntax
2. To my knowledge, no previous Stata command has addressed this objective.
G. Cerulli 457
with a description of the help le, and in section 5, I conduct a Monte Carlo experiment
to test the reliability of ivtreatreg. In section 6, I demonstrate the command applied
to real data from a study of the relationship between education and fertility. I conclude
with section 7, where I provide a brief summary and arm the value of ivtreatreg.
In the appendix, I derive the formulas for the selection model.
2 Statistical framework3
Our hypothetical evaluation objective is to estimate the eect of binary treatment w
(taking value 1 for treated and 0 for untreated units) on scalar outcome y.4 We sup-
pose that the assignment to treatment is not random but instead due to some form of
the units self-selection or external selection. For each unit, (y1 , y0 ) denotes the two
potential outcomes,5 where the outcome is y1 when the individual is treated and y0
when the individual is not treated. We then collect an independent and identically dis-
tributed sample of observations (yi , wi , xi ) with i = 1, . . . , N , where x is a row vector of
covariates hypothesized as driving the observable nonrandom assignment to treatment
(confounders).
Here we are interested in estimating the ATE, dened as
ATE = E(y1 y0 )
If we rely on observational data alone, we cannot identify the ATE because, for the same
individual and at the same time, we can observe just one out of the two quantities
needed to calculate the ATE (Holland 1986). By restricting the analysis on the group
of treated units, we can also dene a second causal parameter, the ATET, as
ATET = E(y1 y0 | w = 1)
Similarly, the ATENT, meaning the ATE calculated within the subsample of untreated
units, is
ATENT = E(y1 y0 | w = 0)
3. This section draws on the substantial literature on econometrics of program evaluation, such as
Rubin (1974), Angrist (1991), Angrist, Imbens, and Rubin (1996), Heckman, LaLonde, and Smith
(1999), Wooldridge (2010), and Cattaneo (2010). For a recent survey, see also Imbens and
Wooldridge (2009).
4. Notation follows Wooldridge (2010).
5. For simplicity, I avoid writing the subscript form of the unit i when referring to population param-
eters.
458 Fitting binary treatment models
where p(w = 1) is the probability of being treated and p(w = 0) is the probability
of being untreated. Where x is known, we can also dene the previous parameters
conditional on x as follows:
ATE(x) = E(y1 y0 | x)
ATET(x) = E(y1 y0 | w = 1, x)
ATENT(x) = E(y1 y0 | w = 0, x)
These quantities are functions of x, which means that they can be seen as individual-
specic ATEs because each individual owns a specic value of x. Furthermore, by law
of iterated expectation, we have
ATE = Ex {ATE(x)}
ATET= Ex {ATET(x)}
ATENT = Ex {ATENT(x)}
The analyst needs to recover consistent (and, when possible, ecient) estimators of
the previous parameters from observational data. Before going on, note that through-
out this article we assume that the stable unit treatment value assumption (Rubin
1978) holds. This assumption states that the treatment received by one unit does
not aect other units outcome (Cox 1958). We thus restrict the analysis to a no-
interference setting. Indeed, when the stable unit treatment value assumption does not
hold, treatment externality eects between units may occur and pose severe problems
in identifying eects.6
3 Estimation methods
The new command ivtreatreg implements four models to consistently estimate previ-
ous parameters, and three of these are IV estimators. These methods are direct-2sls
(IV regression estimated by direct two-stage least squares), probit-ols (IV two-step re-
gression estimated by probit and OLS), probit-2sls (IV regression estimated by probit
and two-stage least squares), and heckit (Heckman two-step selection model). Each
of these can be estimated by assuming either homogeneous or heterogeneous response
to treatment (for a total of eight models). Before presenting how ivtreatreg works, I
briey set out the formulas, conditions, and procedures of each model (see Wooldridge
[2010, chap. 21]). We start by assuming that
6. Treatment-eects estimation under interference between units is a challenging eld of study. Sobel
(2006), Rosenbaum (2007), and Hudgens and Halloran (2008) oer important contributions on
correct inferences within such a setting.
G. Cerulli 459
Equations (1) and (2) represent the potential outcome equations assumed to be linear
in parameters, while the vector x can also contain nonlinear functions of the various
covariates. Equation (3) is the so-called potential outcome model and expresses the
observational rule of the model, because y is the observed outcome. We do not need to
explicitly specify an equation for w (that is, a selection equation) in this model; however,
we could specify an equation. We could assume, for instance, that a linear probability
model for the propensity to be selected into treatment is
w = 0 + x 1 + a (4)
where a is an error component. As soon as we hold that a is uncorrelated with (e1 ;
e0 ), then (4) is redundant and not needed to identify causal parameters. However, we
must know w to identify the causal parameters, as we will discuss later. By substituting
(1)(2) into (3), we get
y = 0 + (1 0 )w + x 0 + w(x 1 x 0 ) + e0 + w(e1 e0 )
where 0 = 1 implies observable heterogeneity and e1 = e0 implies unobservable
heterogeneity.
Next, we dene = e0 + w(e1 e0 ). We can distinguish two cases: 1) e1 = e0 and
2) e1 = e0 , which can in turn be split into the following subcases:
Case 1.1. e1 = e0 = e, 0 = 1 = , E(e | x, w) = 0: unobservable homogeneity,
homogeneous reaction function of y0 and y1 to x, treatment exogeneity.
In this case, we can show that
E(y | w, x) = 0 + w ATE + x
ATE = ATE(x) = ATET = ATET(x) = ATENT = ATENT(x) = 1 0
Thus no heterogeneous ATE (over x) exists. Furthermore, OLS consistently estimates
ATE.
ATE = (1 0 ) + x
ATE(x) = ATE + (x x )
ATET = ATE + Ex (x x | w = 1)
ATET(x) = ATE + {(x x ) | w = 1}
ATENT = ATE + Ex (x x | w = 0)
ATENT(x) = ATE + {(x x ) | w = 0}
460 Fitting binary treatment models
ATE OLS
=
ATE(x) =
OLS + (x x) OLS
N
1
ATET OLS +
= wi (xi x)
N OLS
wi i=1
i=1
= {
OLS + (x x)
ATET(x) OLS }(w=1)
N
1
ATENT OLS +
= (1 wi )(xi x)
N OLS
(1 wi ) i=1
i=1
ATENT(x) =
OLS + (x x) OLS
(w=0)
where it is clear that, under treatment exogeneity, these parameters can be consistently
estimated by plugging-in the parameters from an OLS of (5).
But what happens when treatment exogeneity fails and w becomes endogenous? We
then have three subcases.
Case 2.1. e1 = e0 = e, 0 = 1 = , E(e | x, w) = 0: unobservable homogeneity,
homogeneous reaction function of y0 and y1 to x, treatment endogeneity.
In this case, we can show that
E(y | w, x) = 0 + w ATE + x 0
ATE = ATET = ATENT
ATE IV
=
ATE(x) =
IV + (x x) IV
N
1
ATET IV +
= wi (xi x)
N IV
wi i=1
i=1
ATET(x) = IV + (x x)
IV
(w=1)
N
1
ATENT IV +
= (1 wi )(xi x)
N IV
(1 wi ) i=1
i=1
ATENT(x) =
IV + (x x) IV
(w=0)
To apply IV and get consistent estimation, this case requires a further orthogonal con-
dition,
E{w(e1 e0 ) | x, z} = E{w(e1 e0 )} (7)
Given this condition, estimation may proceed as in Case 2.2.
Next, I present the methods implemented by ivtreatreg by referring to the case of
heterogeneous reaction.
2. Plug these estimated parameters into the sample formulas and recover all the
causal eects.
However, ivtreatreg does not t such a model, because it can be more robustly ob-
tained by using the regression-adjustment estimator implemented in the teffects com-
mand of Stata 13 (with the suboption ra). This command handles many functional
forms other than the linear one, and an estimation of ATENT can also be obtained
using the margins command after running the regression in step 1. For this reason,
ivtreatreg concentrates on the endogenous treatment-eect case, for which it adds
new tools.
direct-2sls
By using direct-2sls, the analyst does not consider the binary nature of w. This
method follows the typical IV steps:
2. Run a second OLS of y on {x, wf v,i , wf v,i (x x )}. The coecient of wf v,i is a
consistent estimation of ATE.
3. Plug these estimated parameters into the sample formulas, recover all the other
causal eects, and obtain standard errors for ATET and ATENT via bootstrap.
G. Cerulli 463
probit-ols
In this case, the analyst exploits the binary nature of w by tting a probit regression in
the rst step. Operationally, probit-ols follows these three steps:
probit-2sls
2. Run an OLS of w on (1, x, pw ), thus getting the tted values w2f v,i .
The coecient of w2f v,i is a more ecient estimator of ATE compared with direct-2sls.
Furthermore, to achieve consistency, this procedure does not require that the process
generating w is correctly specied; thus, it is more robust than probit-ols.
3.3 heckit
ivtreatreg considers a generalized heckit model to consistently estimate previous pa-
rameters without using an IV. The price is that of relying on a trivariate normality
assumption between the error terms of the potential outcomes and the error term of
the treatment. However, this model has the advantage of tting Case 2.3 without in-
voking (7). The reference model is again the system of (14), where we also assume
that (e0 , e1 , a) are trivariate normal. Such a model, as implemented by ivtreatreg,
generalizes the two-step option of the ocial Stata command treatreg.
By default, the treatreg command assumes neither observable heterogeneity (be-
cause it holds that 0 = 1 ) nor unobservable heterogeneity (because it holds that
e1 = e0 ). When these two assumptions are removed, the model leads to the following
464 Fitting binary treatment models
baseline regression function, which can be consistently estimated by OLS (see Wooldridge
[2010, 949]):
(q) (q)
E(y | x, z, w) = 0 + w + x 0 + w(x x ) + 1 w + 0 (1 w)
(q) 1 (q)
where is the ATE, 1 and 0 are the correlations between the two potential outcomes
errors and the treatments error, and (x) and (x) are the density and cumulative
normal distribution, respectively. To estimate the previous regression, ivtreatreg
performs the following two-step procedure:
i , (1 wi )i /1
2. Run an OLS of yi on {1, wi , xi , wi (xi x )i , wi i / i }.
ATE =
ATE(x) = + (x x)
although ATET(x), ATET, ATENT(x), and ATENT assume dierent forms compared with
previous models, specically7
ATET(x) = { + (x x) + (0 + 1 ) 1 (q)}(w=1)
N N
1 1
ATET =+ w (x
i i x) + (1 + 0 ) wi 1 (q)
N
N
wi i=1 wi i=1
i=1 i=1
and
ATENT(x) = { + (x x) + (1 + 0 ) 0 (q)}(w=0)
N
1
ATENT =+ (1 wi )(xi x) + (0 + 1 )
N
(1 wi ) i=1
i=1
N
1
(1 wi ) 0i (q)
N
(1 wi ) i=1
i=1
4.1 Syntax
ivtreatreg outcome treatment varlist if in weight , model(modeltype)
hetero(varlist h) iv(varlist iv) conf(#) graphic vce(vcetype) beta
const(noconstant) head(noheader)
where outcome species the target variable that is the object of the evaluation, treat-
ment species the binary treatment variable (that is, 1 = treated or 0 = untreated),
and varlist denes the list of exogenous variables that are considered as observable
confounders.
fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight.
4.2 Options
model(modeltype) species the treatment model to be t, where modeltype must be one
of the following four models (described in sections 3.3 and 3.4 above): direct-2sls,
probit-2sls, probit-ols, or heckit. model() is required.
modeltype Description
direct-2sls IV regression t by direct two-stage least squares
probit-2sls IV regression t by probit and two-stage least squares
probit-ols IV two-step regression t by probit and OLS
heckit Heckman two-step selection model
hetero(varlist h) species the list of variables over which to calculate the idiosyncratic
ATE(x), ATET(x), and ATENT(x), where x = varlist h. When this option is not
specied, the command ts the specied model without heterogeneous ATE. varlist h
should be the same set or a subset of the variables specied in varlist.
iv(varlist iv) species the variables to be used as instruments. This option is required
with model(direct-2sls); it is optional with other modeltypes.
conf(#) sets the condence level to the specied number. The default is conf(95).
466 Fitting binary treatment models
4.3 Remarks
The ivtreatreg command also creates several variables that can be used to further
examine the data:
z varname h are the IVs used in a models regression when hetero(varlist h) and
iv(varlist iv) are specied. z varname h are created only for IV models.
where
x1 : ln(h1 )
x2 : ln(h2 )
z : ln(h3 )
h1 : 2 (1) + c
h2 : 2 (1) + c
h3 : 2 (1) + c
c : 2 (1)
and
(a, e0 , e1 ) : N (0, )
2 2
a a,e0 a,e1 a a,e0 a e0 a,e1 a e1
= e20 a,e1 = e20 e0 ,e1 e0 e1
e21 e21
a2 = 1, e20 = 3, e21 = 6.5
a,e0 = 0.5, a,e1 = 0.3, e0 ,e1 = 0
By assuming that the correlation between a and e0 (a,e0 ) and the correlation between a
and e1 (a,e1 ) are dierent from 0, the wthe selection binary indicatoris endogenous.
We indicate the instrument with z, which is directly correlated with w but directly
uncorrelated with y1 and y0 . Given these assumptions, the DGP is completed by the
potential outcome means, yi = y0i + wi (y1i y0i ), generating the observable outcome y.
468 Fitting binary treatment models
The DGP is simulated 2,000 times using a sample size of 2,000. For each simula-
tion, we get a dierent data matrix (x1 , x2 , y, w, z) on which we apply the four models
implemented by ivtreatreg. Table 3 and gure 1 set out the simulation results.
We see that the true value of ATE is 0.224. As expected, all the IV procedures con-
sistently estimate the true ATE, with a slight bias of around 5% only for direct-2sls.
Figure 1 conrms these ndings by jointly plotting the distributions of ATEs obtained
by each single method over the 2,000 DGP simulations. All methods give similar results,
though direct-2sls has a slightly dierent shape with fatter tails. This suggests that
we should examine the estimation precision. Under our DGP assumptions, we expect
model heckit to be the most ecient method, followed by model probit-ols and
model probit-2sls, with model direct-2sls performing the worst. In fact, our DGP
follows exactly the same assumptions on which the model heckit is based, as well as
the joint trivariate normality of a, e0 , and e1 .
G. Cerulli 469
1.5
Kernel density of ATE
.5 0 1
1 .5 0 .5 1 1.5
ATE
Table 3 conrms the following theoretical predictions: the lowest standard deviation
is achieved by model heckit (0.248) and the highest by model direct-2sls (0.316),
with the other methods lying in the middle with no appreciable dierences. Observe
that the standard error means (mean SE in column 4) show that the values of the
standard deviations of the estimators in column 3 are estimated precisely (values are
much the same). This means that the asymptotic distribution of the ATE estimators
approximates nite-sample distribution well.
Table 3 also shows simulation results for test size. The size of a test is the probability
of rejecting a hypothesis H0 when H0 is true. In our DGP, we set the size level at 0.05
for a two-sided test where H0 : ATE = 0.224 against the alternative H1 : ATE = 0.224.
The results, under the heading Rejection rate (column 5), represent the proportion
of simulations that lead to rejection of H0 . These values should be interpreted as the
simulation estimate of the true test size (which we assumed to be 0.05). As expected,
the rejection rates are all lower than the usual 5% signicance.
As a conclusion, these results seem to conrm both our expected theoretical results
and the computational reliability of the ivtreatreg command.
This specication adopts the covariate frsthalf as the IV and takes value 1 if the
woman was born in the rst six months of the year and 0 otherwise. This variable is
partially correlated with educ7, but it should not have any direct relationship with the
number of family children.
The simple dierence-in-mean estimator (the mean of the treated ones, which are
the children in the group of more educated women, minus the mean of the untreated
ones, which are the children in the group of less educated women) is 1.77 with a t-
value of 28.46. This means that women with more education show about two children
fewer than women with less education, without ceteris paribus conditions. By adding
confounding factors in the regression specication, we get the OLS estimate of ATE as
0.394 with a t-value of 7.94, still in absence of heterogeneous treatment. This is still
signicant, but the magnitude, as expected, dropped considerably compared with the
dierence-in-mean estimation, thus showing that confounders are relevant. When we
consider OLS estimation with heterogeneity, we get an ATE equal to 0.37, which is still
signicant at 1%.9
When we consider IV estimation, results change dramatically. As we did in our
working example of how to use ivtreatreg, we estimate the previous specication for
probit-2sls with heterogeneous treatment response. The main outcome is reported
below, where results from both the probit rst-step and the IV regression of the second
step are set out. Results on the probit show that frsthalf is partially correlated with
educ7, thus it can be reliably used as an instrument for this variable. Step 2 shows that
the ATE (again, the coecient of educ7) is no more signicant and that it changes sign,
becoming positive and equal to 0.30.
. use fertil2.dta
. ivtreatreg children educ7 age agesq evermarr urban electric tv,
> hetero(age agesq evermarr urban) iv(frsthalf) model(probit-2sls) graphic
(output omitted )
Probit regression Number of obs = 4358
LR chi2(7) = 1130.84
Prob > chi2 = 0.0000
Log likelihood = -2428.384 Pseudo R2 = 0.1889
(output omitted )
(output omitted )
472 Fitting binary treatment models
This result is in line with the IV estimation obtained by Wooldridge (2010). Never-
theless, having assumed heterogeneous response to treatment, we can now also calcu-
late the ATET and ATENT, and inspect the cross-unit distribution of these eects. First,
ivtreatreg returns these parameters as scalars (along with treated and untreated sam-
ple size).
. ereturn list
scalars:
(output omitted )
e(ate) = .3004007409051661
e(atet) = .898290019586237
e(atent) = -.4468834318294228
e(N_tot) = 4358
e(N_treat) = 2421
e(N_untreat) = 1937
(output omitted )
To get the standard errors for testing ATET and ATENT signicance, we can easily
implement a bootstrap procedure as follows:
atet: e(atet)
atent: e(atent)
The results show that both ATET and ATENT are not signicant and show quite
dierent values, but the values are not far from that of ATE. Furthermore, a simple
check shows that ATE = ATETp(w = 1) + ATENT p(w = 0), for example,
which conrms the expected result. Finally, we analyze the distribution of ATE(x),
ATET(x), and ATENT(x). Figure 2 shows the result.
G. Cerulli 473
.4 .3
Kernel density
.2 .1
0
2 0 2 4
ATE(x) ATET(x)
ATENT(x)
Variable heckit
educ7 -1.92***
G_fv
Finally, gure 3 shows the plot of the ATE distribution for each method. These
distributions largely follow a similar pattern, although direct-2sls and heckit show
some appreciable dierences. heckit, in particular, shows a very dierent pattern with
a strong demarcation between the plot of treated and untreated units. Consequently, it
appears to not be a reliable estimation procedure, an observation that deserves further
inspection.
G. Cerulli 475
Model probitols: Comparison of ATE(x) ATET(x) ATENT(x) Model direct2sls: Comparison of ATE(x) ATET(x) ATENT(x)
1.5
.4 .3
1
Kernel density
Kernel density
.2
.5
.1
0
0
2 0 2 4 2 1.5 1 .5
Model probit2sls: Comparison of ATE(x) ATET(x) ATENT(x) Model heckit: Comparison of ATE(x) ATET(x) ATENT(x)
1.5
.4 .3
1
Kernel density
Kernel density
.2
.5
.1
0
2 0 2 4 4 3 2 1 0
Figure 3. Distribution of ATE(x), ATET(x), and ATENT(x) for the four models t by
ivtreatreg
7 Conclusion
In this article, I presented a new user-written Stata command, ivtreatreg, for tting
four dierent binary treatment models with and without idiosyncratic or heterogeneous
ATEs. Depending on the model specied, ivtreatreg consistently estimates ATEs under
the hypothesis of selection-on-unobservables exploiting IV estimators and a generalized
two-step Heckman selection model.
After presenting the statistical framework, I provided evidence on the reliability
of ivtreatreg by using a Monte Carlo experiment. To familiarize the reader with
the command, I also applied it to a real dataset. Results from both the Monte Carlo
experiment and the real dataset encourage one to use the command when the empirical
and theoretical setting suggests that treatment endogeneity and heterogeneous response
to treatment are present. In such cases, performing more than one method may be a
useful robustness check. The ivtreatreg command makes such checks possible and
easy to perform.
476 Fitting binary treatment models
8 References
Abadie, A., D. Drukker, J. L. Herr, and G. W. Imbens. 2004. Implementing matching
estimators for average treatment eects in Stata. Stata Journal 4: 290311.
Angrist, J. D. 1991. Instrumental variables estimation of average treatment eects in
econometrics and epidemiology. NBER Technical Working Paper No. 0115.
http://www.nber.org/papers/t0115.
Angrist, J. D., G. W. Imbens, and D. B. Rubin. 1996. Identication of causal eects
using instrumental variables. Journal of the American Statistical Association 91:
444455.
Austin, N. A. 2007. rd: Stata module for regression discontinuity estimation. Statistical
Software Components S456888, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s456888.html.
Becker, S. O., and A. Ichino. 2002. Estimation of average treatment eects based on
propensity scores. Stata Journal 2: 358377.
Cattaneo, M. D. 2010. Ecient semiparametric estimation of multi-valued treatment
eects under ignorability. Journal of Econometrics 155: 138154.
Cattaneo, M. D., D. M. Drukker, and A. D. Holland. 2013. Estimation of multivalued
treatment eects under conditional independence. Stata Journal 13: 407450.
Cerulli, G. 2014. treatrew: A user-written command for estimating average treatment
eects by reweighting on the propensity score. Stata Journal 14: 541561.
Cox, D. R. 1958. Planning of Experiments. New York: Wiley.
Heckman, J. J. 1978. Dummy endogenous variables in a simultaneous equation system.
Econometrica 46: 931959.
Heckman, J. J., R. J. LaLonde, and J. A. Smith. 1999. The economics and econometrics
of active labor market programs. In Handbook of Labor Economics, ed. O. Ashenfelter
and D. Card, vol. 3A, 18652097. Amsterdam: Elsevier.
Holland, P. W. 1986. Statistics and causal inference. Journal of the American Statistical
Association 81: 945960.
Hudgens, M. G., and M. E. Halloran. 2008. Toward causal inference with interference.
Journal of the American Statistical Association 103: 832842.
Imbens, G. W., and J. M. Wooldridge. 2009. Recent developments in the econometrics
of program evaluation. Journal of Economic Literature 47: 586.
Leuven, E., and B. Sianesi. 2003. psmatch2: Stata module to perform full Mahalanobis
and propensity score matching, common support graphing, and covariate imbalance
testing. Statistical Software Components S432001, Department of Economics, Boston
College. http://ideas.repec.org/c/boc/bocode/s432001.html.
G. Cerulli 477
. 1978. Bayesian inference for causal eects: The role of randomization. Annals
of Statistics 6: 3458.
Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd
ed. Cambridge, MA: MIT Press.
. 2013. Introductory Econometrics: A Modern Approach. 5th ed. Mason, OH:
South-Western.
Appendix
Derivation of ATET(x), ATET, ATENT(x), and ATENT in the heckit
model
Proof.
The heckit model with observable and unobservable heterogeneity relies on these as-
sumptions:
1. y = 0 + w + x 0 + w(x x ) + u
2. E(e1 | x, z) = E(e0 | x, z) = 0
3. w = 1(0 + 1 x + 2 z + a 0) = 1(q 0)
4. E(a | x, z) = 0
5. (a, e0 , e1 ) 3 N
6. a N (0, 1) a = 1
7. u = e0 + w(e1 e0 )
where
(q)
1 (q) =
(q)
As for ATET, applying a similar procedure, it is immediate to get
ATENT(x) = { + (x x) + (1 + 0 ) 0 (q)}(w=1)
N
1
ATENT =+ (1 wi )(xi x) + (1 + 0 )
N
(1 wi ) i=1
i=1
N
1
(1 wi ) 0i (q)
N
(1 wi ) (i=1)
i=1
where
(q)
0 (q) =
1 (q)
ATET(x) = { + (x x) + (1 + 0 ) 1 (q)}(w=1)
ATENT(x) = { + (x x) + (1 + 0 ) 0 (q)}(w=0)
It follows that
For the law of iterated expectations, we get E() = p(w = 1)E( | q 0) + p(w =
0)E( | q 0) = 0, because E() = E(e1 e0 ) = 0, proving that
ATE(x) = + (x x)
and nally
ATE = Ex {ATE(x)} =
The Stata Journal (2014)
14, Number 3, pp. 481498
1 Introduction
Markov regime-switching models are frequently used in economic analysis and are preva-
lent in elds such as nance, industrial organization, and business cycle theory. Unfortu-
nately, conducting proper inference with these models can be exceptionally challenging.
In particular, testing for the possible presence of multiple regimes requires the use of a
nonstandard test statistic and critical values that may dier across model specications.
Cho and White (2007) demonstrate that because of the unusually complicated na-
ture of the null space, the appropriate measure for a test of multiple regimes in the
Markov regime-switching framework is a quasi-likelihood-ratio (QLR) statistic. They
provide an asymptotic null distribution for this test statistic from which critical values
should be drawn. Because this distribution is a function of a Gaussian process, the
critical values are dicult to obtain from a simple closed-form distribution. Moreover,
the elements of the Gaussian process underlying the asymptotic null distribution are
dependent upon one another. Thus the critical values depend on the covariance of the
Gaussian process and, because of the complex nature of this covariance structure, are
best calculated using numerical approximation. In this article, we summarize the steps
necessary for such an approximation and introduce the new command rscv, which can
be used to produce the desired regime-switching critical values for a QLR test of only
one regime.
We focus on a simple linear model with Gaussian errors, but the QLR test and the
rscv command are generalizable to a much broader class of models. This methodology
can be applied to models with multiple covariates and non-Gaussian errors. It is also
c 2014 StataCorp LP st0347
482 Obtaining critical values for test of Markov regime switching
2 Null hypothesis
Specifying a Markov regime-switching model requires a test to conrm the presence of
multiple regimes. The rst step is to test the null hypothesis of one regime against the
alternative hypothesis of Markov switching between two regimes. If this null hypothesis
can be rejected, then one can proceed to estimate the Markov regime-switching models
with two or more regimes. The key to conducting valid inference is then a test of the
null hypothesis of one regime, which yields an asymptotic size equal to or less than the
nominal test size.
To understand how to conduct valid inference for the null hypothesis of only one
regime, consider a basic regime-switching model,
yt = 0 + st + ut (1)
where ut i.i.d. N 0, 2 . The unobserved state variable st (0, 1) indicates that
regime in state 0, yt has mean 0 , while regime in state 1, yt has mean 1 = 0 + . The
sequence (st )nt=1 is generated by a rst-order Markov process with P (st = 1|st1 = 0) =
p0 and P (st = 0|st1 = 1) = p1 .
The key is to understand the parameter space that corresponds to the null hypoth-
esis. Under the null hypothesis, there is one regime with mean . Hence, the null
parameter space must capture all the possible regions that correspond to one regime.
The rst region corresponds to the assumption that 0 = 1 = , which is the as-
sumption that each of the two regimes is observed with positive probability: p0 > 0
V. K. Bostwick and D. G. Steigerwald 483
and p1 > 0. The nonstandard feature of the null space is that it includes two addi-
tional regions, each of which also corresponds to one regime with mean . The second
region corresponds to the assumption that only regime 0 occurs with positive probabil-
ity, p0 = 0, and that 0 = . In this second region, the mean of regime 1, 1 is not
identied, so this region in the null hypothesis does not impose any value on 1 0 .
The third region is a mirror image of the second region, where now the assumption is
that regime 1 occurs with probability 1: p1 = 0 and 1 = . The three regions are
depicted in gure 1. The vertical distance measures the value of p0 and of p1 , and the
horizontal distance measures the value of 1 0 . Thus the vertical line at 1 = 0
captures the region of the null parameter space that corresponds to the assumption
that 0 = 1 = together with p0 , p1 (0, 1). The lower horizontal line captures the
region of the null parameter space where p0 = 0 and 1 0 is unrestricted. Similarly,
the upper horizontal line captures the region of the null parameter space where p1 = 0
and 1 0 is unrestricted.
1 0 = 0
p1 = 0
p0 = 0
The additional curves that correspond to the values p0 = 0 and p1 = 0 help prevent
one from misclassifying a small group of extremal values as a second regime. In gure 1,
we depict the null space together with local neighborhoods for two points in this space.
These two neighborhoods illustrate the dierent roles of the three curves in the null
space. Points in the circular neighborhood of the point on 1 0 = 0 correspond
to processes with two regimes that have only slightly separated means. Points in the
semicircular neighborhood around the point on p1 = 0 correspond to processes in which
there are two regimes with widely separated means, one of which occurs infrequently.
Because a researcher is often concerned that rejection of the null hypothesis of one
regime is due to a small group of outliers rather than multiple regimes, including these
boundary values reduces this type of false rejection. Consequently, a valid test of the
null hypothesis of one regime must account for the entire null region and include all
three curves.
484 Obtaining critical values for test of Markov regime switching
The asymptotic null distribution of QLRn is (Cho and White 2007, theorem 6(b),
1692),
2 2
QLRn max {max (0, G)} , sup G (0 ) (2)
where G(0 ) is a Gaussian process, G(0 ) := min{0, G(0 )}, and G is a standard Gaus-
sian random variable correlated with G(0 ). (For a more complete description of (2),
see Bostwick and Steigerwald [2012]).
The critical value for a test based on the statistic QLRn thus corresponds to a quan-
tile for the largest value over max(0, G)2 and sup {G(0 ) }2 . To determine this quan-
tity, one must account for the covariance among the elements of G(0 ) as well as their
covariance with G. The structure of this covariance, which is described in detail in
Bostwick and Steigerwald (2012), is
2
( )
e 1 2
E {G (0 ) G (0 )} = 12 12 (3)
4 ( )4
1 2 2
2 2
e 2 e( ) 1 ( ) 2
V. K. Bostwick and D. G. Steigerwald 485
4.2 Description
rscv simulates the asymptotic null distribution of QLRn and returns the corresponding
critical value. If no options are specied, rscv returns the critical value for a size 5%
QLR test with a regime separation of 1 standard deviation calculated over 100,000
replications.
4.3 Options
ll(#) species a lower bound on the interval H containing the number of standard
deviations separating regime means, where H. The default is ll(-1), meaning
that the mean of regime 1 is no more than 1 standard deviation below the mean of
regime 2.
ul(#) species an upper bound on the interval H containing the number of standard
deviations separating regime means. The default is ul(1), meaning that the mean
of regime 1 is no more than 1 standard deviation above the mean of regime 2.
r(#) species the number of simulation replications to be used in calculating the critical
values. The default is r(100000), meaning that the simulation will be run 100,000
times.
q(#) species the quantile for which a critical value should be calculated. The default
is q(0.95), which corresponds to a nominal test size of 5%.
k 2 4
k N 0, e 1
2
k! 2
k=3
where K determines the accuracy of the Taylor-series approximation. Note that the
covariance of this simulated process, E{G A ()G A ( )}, is identical to the covariance
structure of G(0 ) in (3).
We must also account for the covariance between G and G(0 ). Cho and White
(2007) establish that this covariance corresponds to the term in the Taylor-series ex-
pansion for k = 4. Thus we set G =
4 so that Cov{G, G(0 )} = Cov{G, G A ()}.
Therefore, the critical value that corresponds to (2) for a test size of 5% is the 0.95
quantile of the simulated value
2
max {max (0,
4 )} , max min 0, G A ()
2
(4)
H
The rscv command executes the numerical simulation of (4) by rst generating the
series (
k )K
k=0 i.i.d. N (0, 1). For each value in a discrete set of H, it then constructs
A 2 1/2
K1 k
G () = (e 1 /2)2 4
k=3 / k!
k . The command then obtains the
value mi = max({max(0,
4 )}2 , max [min{0, G A ()}]2 ), corresponding to (2) for each
replication (indexed by i). Let (m[i] )ri=1 be the vector of ordered values of mi calculated
in each replication. The command rscv returns the critical value for a test with size q
from m[(1q)r] .
For each replication, rscv calculates G A () at a ne grid of values over the interval
H. To do so requires three quantities: the interval H (which must encompass the true
value of ), the grid of values over H (given by the grid mesh), and the number of
desired terms in the Taylor-series approximation, K. The user species the interval H
using the ll() and ul() options. If 0 is thought to lie within 3 standard deviations
V. K. Bostwick and D. G. Steigerwald 487
of 1 , the interval is H = [3.0, 3.0]. Because the process is calculated at only a nite
number of values, the accuracy of the calculated maximum increases as the grid mesh
shrinks. Thus the command rscv implements a grid mesh of 0.01, as recommended in
Cho and White (2007, 1693). For the interval H = [3.0, 3.0], and with a grid mesh of
0.01, the process is calculated at the points (3.00, 2.99, . . . , 3.00).
Given the grid mesh of 0.01 and the user-specied interval H, we must determine
the appropriate value of K. To do so, we consider the approximation error, K, =
2
(e 1 2 4 /2)1/2 k=K k / k!
k . We want to ensure that as K increases,
the variance of K, decreases toward zero. Carter and Steigerwald (2013) show that
for large K, var(K, ) e2J log K log K . Therefore, the command rscv implements a
value of K such that for the user-specied interval H, (maxH ||)2 /K 1/2.
The rscv command also allows the user to specify the number of simulation repli-
cations and the desired quantile. For large values of H and the default number of
replications (r = 100000), the rscv command could require more memory than a 32-bit
operating system can provide. In this case, the user may need to specify a smaller num-
ber of replications to calculate the critical values for the desired interval, H. Critical
values derived using fewer simulation replications may be stable to only one signicant
digit. Table 1 depicts the results of rscv for a size 5% test over varying values of ll(),
ul(), and r().
5 Example
We demonstrate how to test for the presence of multiple regimes through an example
from the economics literature. Unlike the simple model that we have considered until
now, (1), the model in this example includes several added complexities that are com-
monly used in regime-switching applications. We describe how to construct the QLR
test statistic for this more general model, how to use existing Stata commands to obtain
the value of the test statistic, and, nally, how to use the new command, rscv, to obtain
an appropriate critical value.
Our example is derived from Bloom, Canning, and Sevilla (2003), who test whether
the large dierences in income levels across countries are better explained by dierences
in intrinsic geography or by a regime-switching model where the regimes correspond to
488 Obtaining critical values for test of Markov regime switching
distinct equilibria. To this end, the authors use cross-sectional data to analyze the dis-
tribution of per capita income levels for countries with similar exogenous characteristics
and test for the presence of multiple regimes.
Bloom, Canning, and Sevilla (2003) propose a model of switching between two pos-
sible equilibria. Regime 1 occurs with probability p(x) and corresponds to countries
that are in a poverty trap equilibrium.
y = 1 + 1 x + 1 , Var( 1 ) = 12 (5)
In both regimes, y is the log gross domestic product per capita, and x is the absolute lat-
itude, which functions as a catchall for a variety of exogenous geographic characteristics.
This model diers from a Markov regime-switching model in that the authors are look-
ing at dierent regimes in a cross-section rather than over time. Thus the probability of
being in either regime is stationary, and the unobserved regime indicator is an i.i.d. ran-
dom variable. This modication corresponds exactly to that made by Cho and White
(2007) to create the quasi-log-likelihood, so in this example, the log-likelihood ratio and
the QLR are one and the same.
Note that this model is more general than the basic regime-switching model pre-
sented in section 2. Bloom, Canning, and Sevilla (2003) have allowed for three general-
izations: covariates with coecients that vary across regimes; error variances that are
regime specic; and regime probabilities that depend on the included covariates. How-
ever, as Carter and Steigerwald (2013) discuss, the asymptotic null distribution (2) is
derived under the following assumptions: that the dierence between regimes be in only
the intercept j ; that the variance of the error terms be constant across regimes; and
that the regime probabilities do not depend on the exogenous characteristic, x. Thus,
to form the test statistic, we must t the following two-regime model: regime 1 occurs
with probability p and corresponds to
y = 1 + x + (5 )
y = 2 + x + (6 )
where Var (
) = 2 .
Simplifying the model like this does not diminish the validity of the QLR as a one-
regime test for the model in (5) and (6). Under the null hypothesis of one regime, there is
necessarily only one error variance, only one coecient for each covariate, and a regime
probability equal to one. Thus, under the null hypothesis, the QLR test will necessarily
have the correct size even if the data are accurately modeled by a more complex system.
V. K. Bostwick and D. G. Steigerwald 489
Once the null hypothesis is rejected using this restricted model, the researcher can then
t a model with regime-specic variances and coecients, if desired.1
For the restricted model in (5 ) and (6 ), the quasi-log-likelihood is
n
1
Ln p, 2 , , 1 , 2 = lt p, 2 , , 1 , 2
n t=1
where lt (p, 2 , , 1 , 2 ) := log{pf (yt |xt ; 2 , , 1 ) + (1 p)f (yt |xt ; 2 , , 2 )}, and
f (yt |xt ; 2 , , j ) is the conditional density for j = 1, 2. It is common to assume, as
Bloom, Canning, and Sevilla (2003) do, that
is a normal random variable2 so that
2 2 2
f (yt |xt ; , , j ) = 1/( 2 2 )e(yt j xt ) /(2 ) . Let (
2 , ,
p, 1 , 2 ) be the values
2
that maximize Ln and let (1, , ,
1 , ) be the values that make Ln as large as possible
under the null hypothesis of one regime. The QLR statistic is then
QLRn = 2n Ln p ,
2 , , 1 , 2 Ln 1,
2 , , 1 ,
To estimate QLRn , we use the same Penn World Table and CIA World Factbook data
as in Bloom, Canning, and Sevilla (2003).3 First, we must determine the parameter
values that maximize the quasi-log-likelihood under the null hypothesis, (1,
2 , , 1 , )
and evaluate the quasi-log-likelihood at those values. To obtain these parameter values,
we estimate a linear regression of y on x, which corresponds to maximizing
n
1 1 1 2
Ln 1, 2 , , 1 , = log e 22 (yt 1 xt )
n t=1 2 2
While this can be achieved with a simple ordinary least-squares command, we also need
the value of the log-likelihood, so we detail how to use Stata commands to obtain both
the parameter estimates and this value.
1. With a more complex data-generating process, these restrictions could lead to an increased prob-
ability of failing to reject a false null hypothesis and, hence, a decrease in the power of the QLR
test.
2. Bloom, Canning, and Sevilla (2003) assume normally distributed errors, but the QLR test allows
for any error distribution within the exponential family.
3. Latitude data for countries appearing in the 1985 Penn World Tables and missing from the CIA
World Factbook come from https://www.google.com/.
490 Obtaining critical values for test of Markov regime switching
To nd (1,
2 , , 1 , ), we use the following code, which relies on the Stata command
ml.
mu
_cons 6.927805 1.420095 4.88 0.000 4.144469 9.711141
beta
_cons .0408554 .049703 0.82 0.411 -.0565607 .1382714
sigma
_cons .8019654 .5670752 1.41 0.157 -.3094815 1.913412
. matrix gammasingle=e(b)
. generate llf1regime=ln(((2*_pi*gammasingle[1,3]^2)^(-1/2))*
> exp((-1/(2*gammasingle[1,3]^2))*
> (lgdp-gammasingle[1,1]-gammasingle[1,2]*latitude)^2))
. quietly summarize llf1regime
. quietly replace llf1regime=r(sum)
. display "Final estimated quasi-log-likelihood for one regime: " llf1regime
Final estimated quasi-log-likelihood for one regime: -182.1338
alternative hypothesis, because the quasi-log-likelihood involves the log of the sum of
two terms.
n
1
Ln p, 2 , , 1 , 2 = log pf yt |xt ; 2 , , 1 + (1 p) f yt |xt ; 2 , , 2
n t=1
(0) (0)
1. Choose starting guesses for the parameter values p(0) , 2(0) , (0) , 1 , 2 .
2. For each observation, calculate t = P(st = 1|yt , xt ) such that
(0)
f yt |xt ; 2(0) , (0) , 1
t = p(0)
(0) (0)
p(0) f yt |xt ; 2(0) , (0) , 1 + 1 p(0) f yt |xt ; 2(0) , (0) , 2
(1) (1)
3. Use Statas ml command to nd the parameter values p(1) , 2(1) , (1) , 1 , 2
that maximize the complete log-likelihood.
n
1
LC
n p, 2
, , 1 , 2 = t log f yt |xt ; 2 , , 1
n t=1
+ (1 t ) log f yt |xt ; 2 , , 2
+ (1 t ) log(1 p) + t log p }
5. If all 3 convergence criteria are less than some tolerance level (we use 1/n), then
(1) (1)
quit and use p(1) , 2(1) , (1) , 1 , 2 as the nal parameter estimates. Otherwise,
(1) 2(1) (1) (1)
repeat steps 25 with p , , (1) , 1 , 2 as the new starting guesses.
492 Obtaining critical values for test of Markov regime switching
mu1
_cons 6.532847 1.148891 5.69 0.000 4.281062 8.784632
mu2
_cons 7.813265 1.45266 5.38 0.000 4.966102 10.66043
beta
_cons .0451607 .0374139 1.21 0.227 -.0281691 .1184905
sigma
_cons .5986278 .4232938 1.41 0.157 -.2310128 1.428268
p
_cons .7708245 .4203024 1.83 0.067 -.052953 1.594602
V. K. Bostwick and D. G. Steigerwald 493
Thus we have n Ln (
p,
2 , , 1 ,
2 ) = 179.9662. Then, to calculate the test statistic,
QLRn , we type
. generate QLR=2*(llf2reg-llf1reg)
. display "Quasi-likelihood-ratio test statistic of one regime: " QLR
Quasi-likelihood-ratio test statistic of one regime: 4.3352051
These estimates and the resulting QLR test statistic are summarized in table 2. For the
complete Stata code used to create table 2, see the appendix.
Finally, we use the rscv command to calculate the critical value for the QLR test of
size 5%. We allow for the possibility that the two regimes are widely separated and set
H = (5.0, 5.0). The command and output are shown below.
. rscv, ll(-5) ul(5) r(100000) q(0.95)
7.051934397
Given that this critical value of 7.05 exceeds the QLR statistic of 4.3, we cannot reject
the null hypothesis of one regime.
This result is consistent with the ndings of Bloom, Canning, and Sevilla (2003),
although they use a dierent method to obtain the necessary critical values. They
494 Obtaining critical values for test of Markov regime switching
report a likelihood ratio and the corresponding critical values for a restricted version of
their model where the regime probabilities are xed (p does not depend on x). Using
this restricted model, the authors do not reject the null hypothesis of one regime. At
the time that Bloom, Canning, and Sevilla (2003) were published, researchers had yet to
successfully derive the asymptotic null distribution for a likelihood-ratio test of regime
switching. Therefore, the authors use Monte Carlo methods to generate their critical
values using random data generated from the estimated relationship given by the model
in (5) and (6). The primary disadvantage of this approach is that the derived critical
values are then dependent upon the authors assumptions concerning the underlying
data-generating process.
Bloom, Canning, and Sevilla (2003) go on to report a likelihood-ratio test of a single
regime model against the unrestricted model with latitude-dependent regime probabili-
ties. With the unrestricted model, the authors can use the likelihood ratio and simulated
critical values to reject the null hypothesis in favor of the alternative of two regimes.
Because the null distribution derived by Cho and White (2007) applies to only the QLR
constructed using the two-regime model given in (5 ) and (6 ), we cannot use the QLR
test and, hence, the rscv command to obtain the critical values necessary to evaluate
this unrestricted test statistic.
6 Discussion
We provide a methodology and a new command, rscv, to construct critical values for
a test of regime switching for a simple linear model with Gaussian errors. Despite
the complexity of the underlying methodology, rscv is relatively simple to execute and
merely requires the researcher to provide a range for the standardized distance between
regime means. In section 5, we demonstrate how these methods can be generalized
to a very broad class of models, and we discuss the restrictions necessary to properly
estimate the QLR statistic and use the rscv critical values.
7 References
Bloom, D. E., D. Canning, and J. Sevilla. 2003. Geography and poverty traps. Journal
of Economic Growth 8: 355378.
Bostwick, V. K., and D. G. Steigerwald. 2012. Obtaining critical values for test of
Markov regime switching. Economics Working Paper Series qt3685g3qr, University
of California, Santa Barbara. http://ideas.repec.org/p/cdl/ucsbec/qt3685g3qr.html.
Carter, A. V., and D. G. Steigerwald. 2012. Testing for regime switching: A comment.
Econometrica 80: 18091812.
Cho, J. S., and H. White. 2007. Testing for regime switching. Econometrica 75: 1671
1720.
Appendix
The following Stata code was used to create table 2. The code ts the model in section 5
under the alternative hypothesis of two regimes using the EM algorithm and then under
the null hypothesis of one regime using the Stata ml command. Finally, the QLR test
statistic is calculated.
* Estimating QLR test statistic for Bloom, Canning, and Sevilla (2003)
/***************************************************/
* First, estimate parameters and log likelihood for the case of two regimes:
* lgdp = theta0 + delta*latitude + u~N(0,sigma2) with probability (1-lambda)
* lgpp = theta1 + delta*latitude + u~N(0,sigma2) with probability lambda
/***************************************************/
* Start with initial guess for theta0, theta1, delta, sigma2, and lambda:
regress lgdp latitude
matrix beta=e(b)
svmat double beta, names(matcol)
scalar dhat=betalatitude
generate intercept=lgdp-dhat*latitude
summarize intercept
scalar t0hat=r(mean)-r(Var)
496 Obtaining critical values for test of Markov regime switching
scalar t1hat=r(mean)+r(Var)
scalar shat=sqrt(r(Var))
scalar lhat=0.5
matrix gammahat=(t1hat, t0hat, dhat, shat, lhat)
display "Original guess for parameter values: "
matrix list gammahat
/***************************************************/
* Start loop that continues until parameter estimates have converged
generate error1=10
generate error2=10
generate error3=10
generate tol=1/_N
generate count=0
generate count1=1
generate count2=1
generate count3=1
generate f1=0
generate f0=0
generate fboth=0
generate etahat=0
generate llfhat=0
generate llfnew=0
generate fdelta=0
generate fnew=0
generate Inllfnew=0
generate Inllfdelta=0
generate nd1=0
generate nd2=0
generate nd3=0
generate nd4=0
generate nd5=0
/***************************************************/
* Now use etahat to create and maximize log-likelihood function
/***************************************************/
* Check whether the parameter estimates have converged
mata: st_matrix("temp", max(abs(st_matrix("gammanew")-st_matrix("gammahat"))))
quietly replace error1=temp[1,1]
V. K. Bostwick and D. G. Steigerwald 497
/***************************************************/
* Keep track of when each convergence criterion is met
quietly replace count1=count1+1 if error1>tol
quietly replace count2=count2+1 if error2>tol
quietly replace count3=count3+1 if error3>tol
* End of loop
}
498 Obtaining critical values for test of Markov regime switching
/***************************************************/
* Calculate final log likelihood for two regimes
quietly replace f1=((2*_pi*gammanew[1,4]^2)^(-1/2))* ///
exp((-1/(2*gammanew[1,4]^2))*(lgdp-gammanew[1,1]-gammanew[1,3]*latitude)^2)
quietly replace f0=((2*_pi*gammanew[1,4]^2)^(-1/2))* ///
exp((-1/(2*gammanew[1,4]^2))*(lgdp-gammanew[1,2]-gammanew[1,3]*latitude)^2)
generate f2reg=gammanew[1,5]*f1+(1-gammanew[1,5])*f0
generate llf2reg=ln(f2reg)
quietly summarize llf2reg
quietly replace llf2reg=r(sum)
* Output final parameter estimates
display "Final estimated parameter values for two regimes: "
matrix list gammanew
display "Final estimated log likelihood for two regimes: " llf2reg
display "Total number of loop iterations: " count
display "Parameter values converged after " count1 " iterations"
display "Log likelihood value converged after " count2 " iterations"
display "Gradient of Log likelihood converged after " count3 " iterations"
/***************************************************/
* Second, estimate parameters and log likelihood for the case of only one regime:
/***************************************************/
* Finally, calculate QLR test statistic:
generate QLR=2*(llf2reg-llf1reg)
display "Quasi-likelihood-ratio test statistic of one regime: " QLR
The Stata Journal (2014)
14, Number 3, pp. 499510
Abstract. The analysis of multinomial data often includes the following question
of interest: Is a particular category the most populous (that is, does it have the
largest probability)? Berry (2001, Journal of Statistical Planning and Inference
99: 175182) developed a likelihood-ratio test for assessing the evidence for the ex-
istence of a unique most probable category. Nettleton (2009, Journal of the Amer-
ican Statistical Association 104: 10521059) developed a likelihood-ratio test for
testing whether a particular category was most probable, showed that the test was
an example of an intersection-union test, and proposed other intersection-union
tests for testing whether a particular category was most probable. He extended
his likelihood-ratio test to the existence of a unique most probable category and
showed that his test was equivalent to the test developed by Berry (2001, Journal
of Statistical Planning and Inference 99: 175182). Nettleton (2009, Journal of the
American Statistical Association 104: 10521059) showed that the likelihood ratio
for identifying a unique most probable cell could be viewed as a union-intersection
test. The purpose of this article is to survey dierent methods and present a
command, cellsupremacy, for the analysis of multinomial data as it pertains to
identifying the signicantly most probable category; the article also presents a
command for sample-size calculations and power analyses, power cellsupremacy,
that is useful for planning multinomial data studies.
Keywords: st0348, cellsupremacy, cellsupremacyi, power cellsupremacy, most
probable category, multinomial data, cell supremacy, cell inferiority
1 Introduction
If Y1 , Y2 , . . . , Yk are independent Poisson-distributed random variables with means 1 ,
2 , . . ., k , then (Y1 , Y2 , . . . , Yk ), conditional on their sum, is multinomial(N, p1 , p2 , . . . ,
pk ), where pi = i / k k represents the probability of the ith category. Multinomial
data are common in biological, marketing, and opinion research scenarios. In a recent
study, Price et al. (2011) used data from the 2008 National Health Interview Survey
to examine whether 18- to 26-year-old women who are most likely to benet from
catch-up vaccination are aware of the human papillomavirus (HPV) vaccine and have
received initial and subsequent doses in the 3-dose series. The study found that the
most common reasons for lack of interest in the HPV vaccine were belief that it was not
needed (35.9%), not knowing enough about it (17.1%), concerns about safety (12.7%),
c 2014 StataCorp LP st0348
500 Cell supremacy
and not being sexually active (10.3%). These 4 responses were among the 11 possible
response categories to the survey question. Is the belief among respondents that the HPV
vaccine was not needed the unique most probable reason for lack of interest in the HPV
vaccine? Response to questionnaire-based infertility studies varies, and Morris et al.
(2013) noted that dierent modes of contact can aect response. Results of their study
indicated that 59% of the women surveyed preferred a mailed questionnaire, 37% chose
an online questionnaire, and only 3% selected a telephone interview as their mode of
contact. Is a mailed questionnaire the most preferred mode of contact? Are these
results signicant? The purpose of this article is to survey dierent methods and to
present a command for the analysis of multinomial data as it pertains to identifying the
signicantly most probable category; the article also presents a command for sample-size
calculations and power analyses that is useful for planning multinomial data studies.
2 Methods
Nettleton (2009) posed the test for the supremacy of a multinomial cell probability as an
intersection-union test (IUT). Suppose X = (X1 , . . . , Xk ) has a multinomial distribution
with n trials and the cell probabilities p1 , . . . , pk . The parameter p = (p1 , . . . , pk ) lies
in the set P of vectors of order k, whose components are positive and sum to one.
The tested null hypothesis states that a particular cell of interest is not more probable
than all others. Suppose the kth cell is the cell of interest; then the hypothesis can be
formulated as
k1
k1
H0 : pk pi versus H1 : p k > pi
i=1 i=1
which Nettleton (2009) noted can be stated as
H0 : pk max(p1 , . . . , pk1 ) versus H1 : pk > max(p1 , . . . , pk1 )
Nettleton (2009) oered three possible asymptotic IUT statistics: the score test, the
Wald test, and the likelihood-ratio test. Suppose x = (x1 , . . . , xk ) is a realization of
= (
X = (X1 , . . . , Xk ); then pi = xi /n so that p p1 , . . . , pk ) is the maximum likelihood
estimate of p = (p1 , . . . , pk ). Each asymptotic IUT statistic is zero unless xk is greater
than max(x1 , . . . , xk1 ). Nettleton (2009) also suggested a test based on the conditional
distribution of Xk , given the sum of xk and m, where m = max(x1 , . . . , xk1 ).
p-value for the test is given by Pr (2(1) TS | TS )/2, where 2(1) denotes a 2 random
variable with 1 degree of freedom.
n( pM )2
pk
p
k +
pM ( pM )2
pk if pk > pM = max(
p1 , . . . , pk1 )
TW =
0 otherwise
H0 is rejected if and only if TW 2(1),12 . The approximate p-value for the test
is given by Pr (2(1) TW | TW )/2.
H0 is rejected if and only if TLR 2(1),12 . The approximate p-value for the test
is given by Pr (2(1) TLR | TLR )/2.
m+x
k
m + xk
p-value = 2(m+xk )
x
x=xk
The simulation studies by Nettleton (2009) showed that the conditional IUT based on
the binomial distribution yielded a true p-value typically less than the nominal value.
Farcomeni (2012) suggested that the exact test (that is, conditional binomial) may
be conservative and that the exact signicance level may be smaller than the desired
nominal level. Farcomeni (2012) suggested using the typical continuity correction for
the binomial; namely, he recommended the mid-p value as the p-value of the test.
502 Cell supremacy
One could formulate the test for cell inferiority (that is, a particular cell is least
probable) as
Farcomeni (2012) suggests using the exact test for inferiority where the sum goes
from 0 to xk . That is, the p-value for the conditional IUT for inferiority would be
xk
m + xk
p-value = 2(m+xk )
x
x=0
Alam and Thompson (1972) discussed the challenges of testing whether a particular
cell is least probable from a design point of view. Nettleton (2009) showed that the
likelihood-ratio test statistic could be used to test for the existence of a unique most
probable cell. That is, rather than test whether a particular cell chosen a priori is
the most probable, one could test whether the largest observed cell was uniquely most
probable. The likelihood-ratio test statistic matches the test statistic developed by
Berry (2001) and rejects H0 if and only if TLR 2(1),12 . The approximate p-value
for the test is given by Pr (2(1) TLR | TLR ), where 2(1) denotes a 2 random variable
with 1 degree of freedom. That is, the p-value is twice the p-value for the test in which
a particular cell chosen a priori is most probable.
2.7 Power
We consider the case of a random variable Xmultinomial(n, p1 , . . . , pk ). Without
loss of generality, we will assume that pk is the maximum among the k cells. Let
B. M. Fellman and J. Ensor 503
H0 : pk = pM versus H1 : pk > pM
where p0 = (pk + pM )/2 (Guenther 1977). For example, consider the random variable
at the = 0.05 signicance level. The null hypothesis is rejected if TS 2.70554. Solely
based on p4 and p5 , the noncentrality parameter for testing the 5th cell selected a priori
as the most probable cell is
(0.4 0.35)2
= 100 0.71429
0.35
and the approximate power is
We have a trinomial, and there is strong competition for the maximum among the rst
k 1 cells. Because the cells of a multinomial are not independent, one would expect
the distribution of the rst k 1 cells that aect the power to detect the kth cell to
be the most probable. The simulated power for this scenario was 0.087. Thus the
approximation of power must consider the impact of the distribution of the rst k 1
cells. The correlation among the two cells of a multinomial is
$
pa pb
a,b =
(1 pa )(1 pb )
The power to detect the 5th cell as the most probable is the power that p5 > p4 and
p5 > p3 . Consider approximating the power by
1+M,N
power Pr TS 2(1),12 | pk , pM Pr TS 2(1),12 | pk , pN
where pM and pN represent the maximum and the second largest of the cell probabilities
of the rst k 1 cells, respectively, and M,N represents the correlation between cells
M and N . For our example, the approximate power is
Applying this form of the approximation to the original example with p1 through p3
equal to 0.1 and p4 equal to 0.3 yields an approximate power of
power Pr TS 2(1),12 | p5 = 0.4, p3 = 0.3
1+4,3
Pr TS 2(1),12 | p5 = 0.4, p3 = 0.1
(0.21833) (0.91232)10.21822
0.20322
Table 1 provides simulations of size 100,000 for several scenarios to investigate the
adequacy of our proposed approximation. For each scenario, p6 is the cell of interest,
5,4 represents the correlation between the 5th and 4th cell, Sim. is the simulated
power, and Approx. is our power approximation.
B. M. Fellman and J. Ensor 505
2.8 Conclusions
Nettleton (2009) suggested that the asymptotic procedures are preferred for moderate to
large sample sizes based on simulations, but the IUT based on conditional tests is a useful
option when a small sample size casts doubt on the validity of the asymptotic procedures.
Our power simulations tend to also suggest that the power approximation works best
for moderate to large sample sizes. Scenarios 2932 present a slightly more complex
problem with three cells vying for the top spot among the rst cells. For these scenarios,
our power approximation yields slightly liberal results because the approximate power is
consistently larger than the simulated power. Under this scenario, the power to detect
the 6th cell as the most probable is the power that p6 > p5 , p6 > p4 , and p6 > p3 .
Thus one could improve the approximation by considering the added competition for
supremacy among the rst k 1 cells. That is, for n = 200, the approximate power is
506 Cell supremacy
power Pr TS 2(1),12 | p5 = 0.4, p4 = 0.2
1+4,3
Pr TS 2(1),12 | p5 = 0.4, p3 = 0.2
1+24,3
Pr TS 2(1),12 | p5 = 0.4, p3 = 0.2
(0.97761) (0.97761)10.25 (0.97761)10.50
0.95032
which compares favorably with the simulated power. However, we believe that for most
real-world problems, considering the impact of the top two cell probabilities among the
rst k 1 cells is sucient.
cellsupremacyi, counts(numlist)
power cellsupremacy, freq(numlist) n(#) simulate dots reps(#)
alpha(#)
dots shows the replication dots when using the simulate option.
reps(#) species the number of simulations used to calculate the power. The default
is reps(10000).
alpha(#) species the alpha that is used for calculating the power. The default is
alpha(0.05).
3.4 Examples
Suppose we are studying breast cancer and we nd that the distribution of subtypes is
a trinomial distribution with HER2+, HR+, and TNBC. In our data, we nd that patients
with leptomeningeal disease were more likely to be HER2+ (45%). We are interested in
knowing whether this particular category is the most populous (that is, does it have
the largest probability of occurring?). The following example will generate a sample
dataset and illustrate the use of the new command to answer this question.
. set obs 100
obs was 0, now 100
. generate subtype = "HER2+" in 1/45
(55 missing values generated)
. replace subtype = "HR+" in 46/73
(28 real changes made)
. replace subtype = "TNBC" in 74/100
(27 real changes made)
. tab subtype
subtype Freq. Percent Cum.
The p-values for all tests are less than 0.05, which indicates that HER2+ is the most
probable. The test for the existence of a most probable cell is also signicant. On the
other hand, if we were interested in cell inferiority (least probable), we would not reject
our hypothesis because our p-values are approximately 0.50. Below is another example
with a slightly dierent distribution than before.
. clear
. set obs 100
obs was 0, now 100
. generate subtype = "HER2+" in 1/45
(55 missing values generated)
. replace subtype = "HR+" in 46/85
(40 real changes made)
. replace subtype = "TNBC" in 86/100
(15 real changes made)
. tab subtype
subtype Freq. Percent Cum.
Because HER2+ and HR+ have similar frequencies, we cannot conclude that HER2+ is
the most probable. In this case, we can conclude that TNBC is the least probable cell. The
above examples can both be implemented by entering the raw counts cellsupremacyi
45 28 27 or cellsupremacyi 45 40 15, respectively.
B. M. Fellman and J. Ensor 509
To illustrate how to use the power cellsupremacy command to calculate the power
of the test, we consider the examples in section 2.7 for testing cell superiority for the
random variables,
and
Ymultinomial(n = 50, p1 = 0.1, p2 = 0.1, p3 = 0.1, p4 = 0.3, p5 = 0.4)
. clear
. set seed 339487731
. power_cellsupremacy, simulate freq(0 0 0.3 0.3 0.4) n(50)
Simulations (10000)
N Simulated Power Approximate Power
50 0.0898 0.0915
. power_cellsupremacy, simulate freq(0.1 0.1 0.1 0.3 0.4) n(50)
Simulations (10000)
N Simulated Power Approximate Power
50 0.2121 0.2032
4 Acknowledgment
This research is supported in part by the National Institutes of Health through M. D.
Andersons Cancer Center Support Grant CA016672.
5 References
Alam, K., and J. R. Thompson. 1972. On selecting the least probable multinomial
event. Annals of Mathematical Statistics 43: 19811990.
Guenther, W. C. 1977. Power and sample size for approximate chi-square tests. Amer-
ican Statistician 31: 8385.
Nettleton, D. 2009. Testing for the supremacy of a multinomial cell probability. Journal
of the American Statistical Association 104: 10521059.
510 Cell supremacy
Price, R. A., J. A. Tiro, M. Saraiya, H. Meissner, and N. Breen. 2011. Use of human
papillomavirus vaccines among young adult women in the United States: An analysis
of the 2008 National Health Interview Survey. Cancer 117: 55605568.
1 Introduction
Competition and antitrust authorities have long been concerned with the possible an-
ticompetitive eects of mergers. This is in particular the case for horizontal mergers,
which are mergers between rms selling substitute products. The traditional concern
has been that such mergers raise market power, which may hurt consumers and reduce
total welfare (the sum of producer and consumer surplus). At the same time, however,
it has been recognized that mergers may also result in cost savings or other eciencies.
While such cost savings may often be insucient to reduce prices and benet consumers,
it has been shown that even small cost savings can be sucient to raise total welfare (see
Williamson [1968] and Farrell and Shapiro [1990]).1 Despite the possible total welfare
gains, most competition authorities in practice take a consumer surplus standard when
evaluating proposed mergers.
Merger simulation is increasingly used as a tool to evaluate the eects of horizontal
mergers. Consistent with policy practice, the focus is often on the price and con-
sumer surplus eects, but various applications also evaluate the eects on total wel-
fare.2 Merger simulation aims to predict the merger eects in the following three steps.
1. According to Williamsons (1968) analysis, the deadweight loss from the output reduction after
the merger is a second-order eect that is easily compensated by the cost savings from the merger.
However, Posner (1975) argues that there is another source of ineciency from mergers because
rms must spend wasteful resources to make a merger and maintain market power. In this alterna-
tive view, it may be more natural to use consumer surplus as a standard to evaluate mergers and
to ignore the transfer from consumers to rms.
2. Early contributions to the merger simulation literature are Werden and Froeb (1994), Nevo
(2000), Epstein and Rubinfeld (2002), and Ivaldi and Verboven (2005). For a recent survey, see
Budzinski and Ruhmer (2010).
c 2014 StataCorp LP st0349
512 Merger simulation
The rst step species and estimates a demand system, usually one with dierentiated
products. The second step makes an assumption about the rms equilibrium behavior,
typically multiproduct BertrandNash, to compute the products current prot margins
and their implied marginal costs. The third step usually assumes that marginal costs
are constant and computes the postmerger price equilibrium, accounting for increased
market power, cost eciencies, and perhaps remedies (such as divestiture). This enables
one to compute the mergers eect on prices, consumer surplus, producer surplus, and
total welfare. Stata is often used to estimate the demand system (the rst step) but not
to implement a complete merger simulation (including the second and third steps). In
this article, we show how to implement merger simulation in Stata as a postestimation
command, that is, after estimating the parameters of a demand system for dierentiated
products. We also illustrate how to perform merger simulation when the demand pa-
rameters are not estimated but rather calibrated to be consistent with outside industry
information on price elasticities and prot margins. We allow for a variety of exten-
sions, including the role of (marginal) cost savings, remedies (divestiture), and conduct
dierent from BertrandNash behavior.
We consider an oligopoly model with multiproduct price-setting rms that may par-
tially collude and have constant marginal cost. Following Berry (1994), we specify the
demand system as an aggregate nested logit model, which can be estimated with market-
level data using linear regression methods (as opposed to the individual-level nested logit
model). We consider both a unit demand specication, as in Berry (1994) and Verboven
(1996), and a constant expenditures specication, as in Bjornerstedt and Verboven
(2013). The model requires a dataset on products sold in one market, or in a panel
of markets, with information on the products prices, their quantities sold, rm and
nest identiers, and possibly other product characteristics.
In section 2, we discuss the merger simulation model, including the nested logit
demand system. In section 3, we introduce the commands required to carry out the
merger simulation. Section 4 provides examples and section 5 concludes.
where (0, 1) is a conduct parameter to allow for the possibility that rms partially
coordinate. If = 0, rms behave noncooperatively as multiproduct rms. If = 1,
J. Bj
ornerstedt and F. Verboven 513
q(p) + { (p)} (p c) = 0
This can be inverted to write price as the sum of marginal cost and a markup, where the
markup term (inversely) depends on the price elasticities and on the product-ownership
matrix:
1
p = c {
(p)} q(p) (2)
For single-product rms with no collusion ( = 0), the markup term is price divided by
the own-price elasticity of demand. With multiproduct-rms and partial collusion, the
cross-price elasticities also matter, and this increases the markup term (if products are
substitutes).
Equation (2) serves two purposes. First, it can be rewritten to uncover the premerger
marginal cost vector c based on the premerger prices and estimated price elasticities of
demand; that is,
1
cpre = ppre + { pre
(ppre )} q(ppre )
Second, (2) can be used to predict the postmerger equilibrium. The merger involves
two possible changes: a change in the product ownership matrix from pre to post and,
if there are eciencies, a change in the marginal cost vector from cpre to cpost . To
simulate the new price equilibrium, one may use xed point iteration on (2), possibly
with a dampening parameter in the markup term, or another algorithm such as the
Newton method (see, for example, Judd [1998, 633]).
data (see Train [2009] for an overview), Berry (1994) and Berry, Levinsohn, and Pakes
(1995) show how to estimate the models with aggregate data. The dataset consists of
J 1 vectors of the products quantities q, prices p, and a J K matrix of product
characteristics x, including indicator variables for the products subgroup and group
and their rm aliation. The dataset is for either one market or a panel of markets, for
example, dierent years or dierent regions and countries. The panel is not necessarily
balanced, because new products may be introduced over time, or old products may be
eliminated, and not all products may be for sale in all regions.
In addition to each product js quantity sold qj , its price pj , and the vector of
product characteristics xj , it is necessary to observe (or estimate) the potential market
size for the dierentiated products. In the common unit demand specication of the
nested logit, consumers have inelastic conditional demands: they buy either a single
unit of their most preferred product j = 1, . . . , J or the outside good j = 0. The
potential market size is then the potential number of consumers I, for example, an
assumed fraction of the observed population in the market, I = L. An alternative is
the constant expenditures specication, where consumers have unit elastic conditional
demand: they buy a constant expenditure of their preferred product or the outside
good. Here the potential market size is the potential total budget B, for example, an
assumed fraction of total gross domestic product in the market, B = Y .
As shown by Berry (1994) and the extensions by Verboven (1996) and Bj ornerstedt
and Verboven (2013), the aggregate two-level nested logit model gives rise to the fol-
lowing linear estimating equation for a cross section of products j = 1, . . . , J:
ln(sj /s0 ) = xj +
pj + 1 ln(sj|hg ) + 2 ln(sh|g ) + j (3)
mean valuations for the observed product characteristics, a price parameter < 0,
and two nesting parameters 1 and 2 , which measure the consumers preference cor-
relation for products in the same subgroup and group. The model reduces to a one-
level nested logit model with only subgroups as nests if 2 = 0, to a one-level nested
logit model with only groups as nests if 1 = 2 , and to a simple logit model with-
out nests if 1 = 2 = 0. The mean gross valuation for product j is dened as
j xj + j = ln(sj /s0 )
pj 1 ln(sj|hg ) 2 ln(sh|g ), so it can be computed
from the products market share, price, and the parameters , 1 , and 2 .
In sum, the aggregate nested logit model is essentially a linear regression of the
products market shares on price, product characteristics, and (sub)group shares. In
the unit demand specication, price enters linearly and market shares are in volumes; in
the constant expenditures specication, price enters logarithmically and market shares
are in values. In both cases, the unobserved product characteristics term, j , may
be correlated with price and market shares, so instrumental variables should be used.
Cost shifters would qualify as instruments, but these are typically not available at
the product level. Berry, Levinsohn, and Pakes (1995) suggest using sums of the other
products characteristics (over the rm and the entire market). For the nested logit
model, Verboven (1996) adds sums of the other product characteristics by subgroup
and group.
3.1 Syntax
mergersim init if in , marketsize(varname)
{quantity(varname) | price(varname) | revenue(varname)} nests(varlist)
unitdemand cesdemand alpha(#) sigmas(# # ) name(string)
mergersim market if in , firm(varname) conduct(#) name(string)
516 Merger simulation
mergersim simulate if in , firm(varname) {buyer(#)
seller(#) | newfirm(varname)} conduct(#) name(string) buyereff(#)
sellereff(#) efficiencies(varname) newcosts(varname) newconduct(#)
method(fixedpoint | newton) maxit(#) dampen(#) keepvars detail
mergersim mre if in , {buyer(#) seller(#) | newfirm(varname)}
name(string)
3.2 Options
Demand and market specication
The demand and market specication are set in mergersim init and mergersim market
(and in mergersim simulate if mergersim market is not explicitly invoked by the
user).
marketsize(varname) species the potential size of market (total number of potential
buyers in unit demand specication, total potential budget in constant expenditures
specication). marketsize() is required with mergersim init.
Any two of price(), quantity(), or revenue() are required.
quantity(varname) species the quantity variable.
price(varname) species the price variable.
revenue(varname) species the revenue variable.
nests(varlist) species one or two nesting variables. The outer nest is specied rst.
If only one variable is specied, a one-level nested logit model applies. If the option
is not specied, a simple logit model applies.
unitdemand species the unit demand specication (default).
cesdemand species the constant expenditure specication rather than the default unit
demand specication.
alpha(#) species a value for the alpha parameter rather than using an estimate. Note
that this option has no eect if mergersim market has been run.
sigmas(# # ) species a value for the sigma parameters rather than using an esti-
mate. In the two-level nested logit, the rst sigma corresponds to the log share of
the product in the subgroup, and the second corresponds to the log share of the
subgroup in the group.
name(string) species a name for the simulation. Variables created will have the spec-
ied name followed by an underscore character rather than the default M . This
option can be used with all the mergersim subcommands.
J. Bj
ornerstedt and F. Verboven 517
firm(varname) species the integer variable, indexing the rm owning the product.
firm() is required with mergersim market and mergersim simulate.
conduct(#) measures the fraction of the competitors prots that rms account for
when setting their own prices. It gives the degree of joint prot maximization be-
tween rms before the merger in percentage terms (number between 0 and 1).
Merger specication
Computation
The computation options can be set in mergersim simulate, where the postmerger
Nash equilibrium is computed.
518 Merger simulation
keepvars species that all generated variables should be kept after simulation, calcu-
lation of elasticities, or minimal required eciencies.
detail shows market shares in mergersim simulate. These market shares are relative
to total sales (excluding the outside good). Market shares are in terms of volumes
for the unit demand specication and in terms of value for the constant expenditure
specication. Changes in consumer and producer surplus and in the Herndahl
Hirshman index are also displayed.
3.3 Description
mergersim performs a merger simulation with the subcommands init, market, and
simulate. mergersim init must be invoked rst to initialize the settings. mergersim
market calculates the price elasticities and marginal costs. mergersim simulate per-
forms a merger simulation, automatically invoking mergersim market if the command
has not been called by the user. In addition to displaying results, mergersim creates
various variables at each step. By default, the names of these variables begin with M .
First, mergersim init initializes the settings for the merger simulation. It is re-
quired before estimation and before a rst merger simulation. It denes the upper and
lower nests; the specication (unit demand or constant expenditures demand); the price,
quantity, and revenue variables (two out of three); the potential market size variable;
and the rm identier (numerical variable). It also generates the variables necessary
to estimate the demand parameters (alpha and sigmas) using a linear (nested) logit
regression, similar to Berry (1994) and the extensions of Bjornerstedt and Verboven
(2013). The names of the market share and price variables to use in the regression will
depend on the demand specication and are shown in the display output of mergersim
init. Alternatively, the demand parameters can be calibrated with the alpha() and
sigmas() options rather than being estimated.
Second, mergersim market computes the premerger conditionsthe gross valua-
tions j and marginal costs cj of each product j under assumptions regarding the
degree of coordination. The computations are based on the last estimates of , 1 ,
and 2 unless they are overruled by values specied by the user in the alpha() and
J. Bj
ornerstedt and F. Verboven 519
sigmas() options. mergersim market is required after mergersim init and before
the rst mergersim simulate. It is not necessary to specify mergersim market before
additional mergersim simulates (unless one wants to specify new premerger values of
j and cj ).
Third, mergersim simulate computes the postmerger prices and quantities under
assumptions regarding the identity of the merged rms, their cost eciencies, and the
degree of collusion (the same as before the merger). It is possible to repeat the command
multiple times after estimation.
In addition to these three main subcommands, several other subcommands can pro-
vide useful information. For example, mergersim mre computes the minimum required
eciencies per product for the price not to increase after the merger. It can be invoked
after mergersim init.
4 Examples
4.1 Preparing the data
To demonstrate mergersim, we use the dataset on the European car market, collected by
Goldberg and Verboven (2001) and maintained on their webpages.3 We take a reduced
version of that dataset with fewer variables and a slightly more aggregate rm denition;
the dataset is called cars1.dta. Each observation comprises a car model, year, and
country. The total number of observations is 11,483: there are 30 years (19701999) and
5 countries (Belgium, France, Germany, Italy, and the United Kingdom), which implies
an average of 77 car models per year and country. The car market is divided into ve
upper nests (groups) according to the segments: subcompact, compact, intermediate,
standard, and luxury. Each segment is further subdivided into lower nests (subgroups)
according to the origin: domestic or foreign (for example, Fiat is domestic in Italy and
foreign in the other countries). Sales are new car registrations (qu). Price is measured in
1,000 Euro (in 1999 purchasing power). The product characteristics are horsepower (in
kilowatts), fuel eciency (in liter/100 kilometers), width (in centimeters), and height
(in centimeters). The commands below are provided in a script called example.do.
3. See http://www.econ.kuleuven.be/public/ndbad83/frank/cars.htm.
520 Merger simulation
. use cars1
. summarize year country co segment domestic firm qu price horsepower fuel
> width height pop ngdp
Variable Obs Mean Std. Dev. Min Max
A rst key preparatory task is to dene the two dimensions of the panel and to
time set the data (unless there is only one cross-section). The rst dimension is the
product, that is, the car model (for example, Volkswagen [VW] Golf). The second
dimension is the market, which can be dened as the country and year (for example,
France in 1995).
Note that the panel is unbalanced because most models are not available throughout
the entire period or in all countries.
A second key preparatory task is to dene the potential market size. For the car
market, it is sensible to adopt a unit demand specication. We specify the potential
market size as total population divided by 4, a crude proxy for the number of households.
In practice, the potential market size in a given year may be lower because cars are
durable and consumers who just purchased a car may not consider buying a new one
immediately.
. generate MSIZE=pop/4
J. Bj
ornerstedt and F. Verboven 521
The rst step initializes the settings for the merger simulation using the command
mergersim init. The next example species a two-level nested logit model where the
groups are the segments and the subgroups are domestic or foreign with the segments.
This requires the option nests(segment domestic). The specication is the default
unit demand specication. The price, quantity, market size, and rm variables are also
specied.
merger init creates market share and price variables labeled with an M prex (the
default prex). The variable M ls is the dependent variable ln(sj /s0 ), M lsjh is the log
of the subgroup share ln(sj|hg ), and M lshg is the log of the group share ln(sh|g ).
We can estimate the nested logit model with a linear regression estimator using
instrumental variables to account for the endogeneity of the price and market share
variables. As a simplication to illustrate the approach, we consider a xed-eects
regression without instruments.
522 Merger simulation
. xtreg M_ls price M_lsjh M_lshg horsepower fuel width height domestic year
> country2-country5, fe
Fixed-effects (within) regression Number of obs = 11483
Group variable: co Number of groups = 351
R-sq: within = 0.8948 Obs per group: min = 1
between = 0.7576 avg = 32.7
overall = 0.8427 max = 146
F(13,11119) = 7271.50
corr(u_i, Xb) = -0.0147 Prob > F = 0.0000
sigma_u .52455749
sigma_e .36374004
rho .6752947 (fraction of variance due to u_i)
F test that all u_i=0: F(350, 11119) = 22.69 Prob > F = 0.0000
The parameters that will inuence the merger simulations are the price parameter
= 0.0468 and the nesting parameters 1 = 0.905 and 2 = 0.568 (the coecients
of, respectively, M lsjh and M lshg). These estimates satisfy the following restrictions
from economic theory: < 0 and 1 > 1 2 0. However, it is important to stress
that the xed-eects estimator is inconsistent because price and the subgroup and group
market share variables are endogenous. As discussed in Berry (1994), an instrumental-
variable estimator is required (for example, using ivreg or xtivreg with appropriate
instruments). We therefore use only the results from the xed-eects estimator for
illustration.
The second step in the merger simulation calculates the premerger market conditions
(the products gross valuations and their marginal costs and the price elasticities of
demand) using the command mergersim market. In the example below, these calcula-
tions are done for only the ve countries in 1998. Because no values for , 1 , and 2 are
specied, mergersim market uses the parameters in the last available Stata estimation,
that is, the ones from a xed-eects regression.
J. Bj
ornerstedt and F. Verboven 523
Demand estimate
xtreg M_ls price M_lsjh M_lshg horsepower fuel width height domestic year
> country2-country5, fe
Dependent variable: M_ls
Parameters
alpha = -0.047
sigma1 = 0.905
sigma2 = 0.568
Observations: 449
These results imply fairly high own-price elasticities for the products in 1998, 7.488
on average. The cross-price elasticities are higher for products within the same subgroup
(0.766) than for products of a dierent subgroup (0.068) and especially for products of
a dierent group (0.001). The Lerner index or percentage markup over marginal cost
varies from 9.9% to 37.2%, with a tendency of higher percentage markups for rms with
lower-priced models (a feature of most unit demand-logit models).
The third step performs the actual merger simulation using the mergersim simulate
command. The example below considers a merger where General Motors (GM) (rm =
15) sells its operations to VW (rm = 26). Note that the merger simulations would be
the same if VW sold its operations to GM. We rst carry out the merger simulations
for Germany in 1998, where it can be considered a domestic merger (because GM sells
the Opel brands, which are produced in Germany). It is assumed that there are no
marginal cost savings to the seller or the buyer and that there is no partial coordination
(neither before nor after the merger).
. mergersim simulate if year == 1998 & country == 3, seller(15) buyer(26)
> detail
Merger Simulation
Simulation method: Newton
Buyer Seller Periods/markets: 1
Firm 26 15 Number of iterations: 6
Marginal cost savings Max price change in last it: 4.5e-06
Prices
Unweighted averages by firm
The results show prices before and after the merger (in 1,000 Euro) and the percent-
age price change averaged by rm. This information is provided standard, even without
the detail option at the end. The merger simulations predict that GM will on average
raise its prices by 7.6%, while VW will on average raise its prices by 3.6%. The rivals
respond with only very small price increases (with the exception of Ford).4
Because the new price vector is saved, one can use Statas graphics to plot these
results. Consider the following commands:
. generate perc_price_ch=M_price_ch*100
(11386 missing values generated)
. graph bar (mean) perc_price_ch if country==3&year==1998,
> over(firm, sort(perc_price_ch) descending label(angle(vertical)))
> ytitle(Percentage) title(Average percentage price increase per firm)
GM
VW
Ford
BMW
Mercedes
Renault
Fiat
Volvo
PSA
Mitsubishi
Nissan
Toyota
Honda
Mazda
Suzuki
Hyundai
Daewoo
Kia
4. Note that one can also specify the detail option to display the market shares before and after the
merger and the percentage point dierence. If one is interested to see more detailed results, one
can use additional options under mergersim results. One can also use standard Stata commands,
such as table, based on the variables M price (premerger price) and M price2 (postmerger price).
526 Merger simulation
Without the detail option after the mergersim simulate command, the output
reports only the price information. The detail option produces additional results on
the following variables (premerger, postmerger, and changes): market shares by rm,
the Herndahl index, C4 and C8 ratios (market share of 4 and 8 largest rms), and
consumer and producer surplus.5
Market shares by quantity
Unweighted averages by firm
Pre-merger Post-merger
Change
For example, the Herndahl index increases from 1,501 to 1,972. Consumer surplus
(in Germany) drops by 1.8 billion Euro or 586 Euro per car (because 3.1 million cars
were sold in Germany in 1998). This is partly compensated by an increase in producer
surplus of 1.3 billion Euro.
5. In logit and nested logit models, consumer surplus (up to a constant) is given by the well-known
log(sum) expression divided by the marginal utility of income. Caution is warranted in the constant
expenditure specication because marginal utility is not constant. See Train (2009).
J. Bj
ornerstedt and F. Verboven 527
Eciencies
First, one may account for the possibility that the buying or the selling rm benets
from a marginal cost saving, which may be passed on to consumer prices. The cost
saving is expressed as a percentage of current marginal cost. In the command below,
the options sellereff(0.2) and buyereff(0.2) mean that the seller and the buyer
each have a marginal cost saving of 20% on all of their products.
. mergersim simulate if year == 1998 & country == 3, seller(15) buyer(26)
> sellereff(0.20) buyereff(0.20) method(fixedpoint) maxit(40) dampen(0.5)
Merger Simulation
Simulation method: Dampened Fixed point
Buyer Seller Periods/markets: 1
Firm 26 15 Number of iterations: 19
Marginal cost savings .2 .2 Max price change in last it: .
Prices
Unweighted averages by firm
There is now a predicted price decrease in Germany of 2.2% for GM and 7.5%
for VW. This implies that the 20% cost savings are suciently passed to consumers.
To obtain convergence, we used a xed-point iteration with a dampening factor of 0.5
because the default Newton method did not converge. sellereff() and buyereff()
assume the same percentage cost saving for all products of the seller and buyer. A
528 Merger simulation
The generated variable M mre refers to the minimum required eciency per product
owned by the merging rms and is set to a missing value for the products of the non-
merging rms. According to the results, the minimum required eciencies for the 19
products of the merging rms are on average 12.3% (unweighted) and 22.1% (weighted
by sales).
Divestiture as a remedy
Second, one may account for divestiture as a remedy to mitigate the price eects of
a merger. Under such a remedy, the competition authority accepts the merger on the
condition that the rms sell some of their products or brands. To simulate the eects of a
merger with divestiture, one can replace the options buyer(#) and seller(#) with the
option newfirm(varname), which species a variable for the new ownership structure
after the merger. To illustrate, we consider a merger between Renault (rm = 18) and
PSA (rm = 16), where PSA sells the brands Peugeot and Citro en. This merger would
substantially raise average prices in France: 59.8% for the Renault products and 63.1%
for the PSA products (ignoring entry and substitution to other countries). To mitigate
the anticompetitive eects, the competition authority may request that PSA sell one of
its brands, Citroen (brand = 4), to Fiat (rm = 4). The commands below show how
to simulate the eects of such a merger with divestiture after creating the appropriate
variable firm rem for the new ownership structure.6
6. Note that this example starts with mergersim init and moves to mergersim simulate without per-
forming a regression to obtain the price and nesting parameters. In this case, mergersim continues
to use the most recent results.
J. Bj
ornerstedt and F. Verboven 529
. generate firm_rem=firm
. replace firm_rem=16 if firm==18 // original merger
(890 real changes made)
. replace firm_rem=4 if brand==4 // divestiture
(583 real changes made)
. quietly mergersim init, nests(segment domestic) unit price(price)
> quantity(qu) marketsize(MSIZE) firm(firm)
. quietly mergersim simulate if year == 1998 & country == 2, seller(16)
> buyer(18)
. mergersim simulate if year == 1998 & country == 2, newfirm(firm_rem)
Merger Simulation
Simulation method: Newton
Variable name Periods/markets: 1
Ownership from: firm_rem Number of iterations: 7
Marginal cost savings Max price change in last it: 9.7e-08
Prices
Unweighted averages by firm
The results show that the merger with divestiture raises the average price only by
16.2% for Renault and by 8.9% for the Peugeot brand, whereas the price of Fiat (now
including the Citroen brand) increases by 0.6%. The option newfirm(varname) can
also be used for other applications, for example, to assess the impact of two consecutive
mergers.
530 Merger simulation
Conduct
Third, one may account for the possibility that rms partially coordinate, that is, take
into account a fraction of the competitors prots when setting prices. Assume, for
example, that rms maintain the same degree of coordination before and after the
merger: one can set the conduct parameter such that the markups are in line with
outside estimates. Performing mergersim market before mergersim simulate enables
one to verify whether the conduct parameter results in premerger markups in line with
outside estimates. This is shown in the following example (which returns to the earlier
merger between GM and VW in Germany).
Demand estimate
xtreg M_ls price M_lsjh M_lshg horsepower fuel width height domestic year
> country2-country5, fe
Dependent variable: M_ls
Parameters
alpha = -0.047
sigma1 = 0.905
sigma2 = 0.568
Observations: 97
J. Bj
ornerstedt and F. Verboven 531
The results show that if rms coordinate by taking into account 50% of the com-
petitors prots, then the Lerner index becomes almost twice as high as when there is
no coordination. The predicted price eects after the merger can now be computed.
Merger Simulation
Simulation method: Newton
Buyer Seller Periods/markets: 1
Firm 26 15 Number of iterations: 6
Marginal cost savings Max price change in last it: 2.1e-07
Pre Post
Conduct: .5 .5
Prices
Unweighted averages by firm
532 Merger simulation
Under partial coordination, the merger simulation predicts larger price increases. On
one hand, there is a larger predicted price increase for the merging rms: this feature
does not hold generally, because the merging rms already partially coordinate before
the merger. On the other hand, there is also a larger predicted price increase for the
outsider rms: this feature may hold more generally because it reects that outsiders
have more cooperative responses to price changes by the merging rms.
The merger simulation results depend on the values of three parameters: , 1 , and 2
(and on the price and quantity data per product). A practitioner may not want to rely
too heavily on the econometric estimates of these parameters and may want to verify
whether the elasticities and markups are consistent with external industry information.
Here a practitioner would not estimate but calibrate the parameters such that they
result in price elasticities and markups that are equal to external estimates. Such
calibration is possible by specifying the options alpha() and sigmas() to mergersim
market. The selected values overrule the values in memory, for example, the ones from a
previous estimation. In the lines below, we specify = 0.035 (closer to 0 as compared
with the econometric estimate of = 0.047), and we keep 1 and 2 to the previous
values. Hence, we calibrate such that demand would be less elastic. The results from
this calibration indeed imply lower price elasticities (on average 5.5):
J. Bj
ornerstedt and F. Verboven 533
Demand calibration
Parameters
alpha = -0.035
sigma1 = 0.910
sigma2 = 0.570
Observations: 97
The next lines show what this calibration implies for merger simulation.
Merger Simulation
Simulation method: Newton
Buyer Seller Periods/markets: 1
Firm 26 15 Number of iterations: 6
Marginal cost savings Max price change in last it: 5.9e-06
Prices
Unweighted averages by firm
These results show that the predicted price increase is larger when demand is less
elastic.
One can also use the calibration options alpha() and sigmas() to implement a para-
metric bootstrap for constructing condence intervals of the computed merger eects.
The following lines perform three steps. First, we take 100 draws for , 1 , and 2
assuming the parameters are normally distributed. Second, we perform 100 merger
simulations for each draw. Third, we save the results for the average price increase of
the buying rm and the selling rm, and we compute summary statistics.
J. Bj
ornerstedt and F. Verboven 535
Earlier, we obtained point estimates for the percentage price increase of 7.6% for
GM and 3.6% for VW (for the base scenario). The 95% condence intervals for these
price increases are [6.78.4]% for GM and [3.14.0]% for VW.
. generate MSIZE1=ngdpe/5
This assumes the potential expenditures on cars in a country and year are 20% of
total gross domestic product.
Next we calibrate (rather than estimate) the parameters to = 0.5, 1 = 0.9, and
2 = 0.6.
We can verify the premerger elasticities and markups at these calibrated parameters:
Demand calibration
Parameters
alpha = -0.500
sigma1 = 0.900
sigma2 = 0.600
Observations: 97
The premerger elasticities and markups are roughly comparable with the ones of the
estimated unit demand model (with less variation between rms). However, as shown
below, the merger simulation results in a larger predicted price increase: +10.1% for
GM and +4.4% for VW. This follows from the dierent functional form: the constant
expenditures specication has the property of quasi-constant price elasticity, whereas the
unit demand specication has the property that consumers become more price sensitive
as rms raise prices. For this same reason, eciencies in the form of marginal cost
savings would also be passed more to consumers under this specication.
538 Merger simulation
Merger Simulation
Simulation method: Newton
Buyer Seller Periods/markets: 1
Firm 26 15 Number of iterations: 7
Marginal cost savings Max price change in last it: 4.7e-09
Prices
Unweighted averages by firm
(output omitted )
Because the detail option was added, mergersim simulate reports additional re-
sults. Consumer surplus now drops by 2.2 billion Euro (versus 1.8 billion Euro in the
unit demand specication), and producer surplus increases by 1.1 billion Euro (versus
1.3 billion Euro before).
Pre-merger Post-merger
Change
5 Conclusions
This overview has shown how to apply two specications of the two-level nested logit
demand system to merger simulation. We show that merger simulation can be applied
as a postestimation command based on estimated parameter values, or it can be im-
plemented without estimation but with calibrated parameters. The merger simulation
results yield intuitive predictions given the assumed demand parameters.7 The set of
merger simulation commands can be used to simulate the eects of horizontal mergers
in a standard setting (dierentiated products, multiproduct Bertrand price setting).
One can also incorporate various extensions, including eciencies in the form of cost
savings, remedies through partial divestiture, and alternative behavioral assumptions
(partial collusive behavior).
Other applications and extensions could be considered. For example, for the car
market, it could be interesting to generalize the demand model to allow consumers to
substitute between countries by introducing an upper nest for the choice of country
instead of assuming such substitution is not possible. These additional substitution
possibilities would limit the market power eects of mergers. Other demand models
may also be considered, such as a random coecients logit model or the almost ideal
demand system.
6 References
Berry, S., J. Levinsohn, and A. Pakes. 1995. Automobile prices in market equilibrium.
Econometrica 63: 841890.
Bj
ornerstedt, J., and F. Verboven. 2013. Does merger simulation work? Evidence from
the Swedish analgesics market.
http://www.econ.kuleuven.be/public/ndbad83/Frank/Papers/
Bjornerstedt%20&%20Verboven,%202013.pdf.
Budzinski, O., and I. Ruhmer. 2010. Merger simulation in competition policy: A survey.
Journal of Competition Law & Economics 6: 277319.
Farrell, J., and C. Shapiro. 1990. Horizontal mergers: An equilibrium analysis. American
Economic Review 80: 107126.
Froeb, L. M., and G. J. Werden. 1998. A robust test for consumer welfare enhancing
mergers among sellers of a homogeneous product. Economics Letters 58: 367369.
7. We stress, however, that the estimated parameters were based on an inconsistent xed-eects esti-
mator. In practice, one should use instrumental variables to estimate the parameters consistently.
540 Merger simulation
Goldberg, P. K., and F. Verboven. 2001. The evolution of price dispersion in the
European car market. Review of Economic Studies 68: 811848.
Ivaldi, M., and F. Verboven. 2005. Quantifying the eects from horizontal mergers in
European competition policy. International Journal of Industrial Organization 23:
669691.
Nevo, A. 2000. Mergers with dierentiated products: The case of the ready-to-eat cereal
industry. RAND Journal of Economics 31: 395421.
Posner, R. A. 1975. The social costs of monopoly and regulation. Journal of Political
Economy 83: 807828.
Roller, L.-H., J. Stennek, and F. Verboven. 2001. Eciency gains from mergers. Euro-
pean Economy 5: 31128.
Train, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:
Cambridge University Press.
Werden, G. J., and L. Froeb. 1994. The eects of mergers in dierentiated products
industries: Logit demand and merger policy. Journal of Law, Economics, and Orga-
nization 10: 407426.
1 Introduction
treatrew is a user-written command for estimating average treatment eects (ATEs) by
reweighting (REW) on the propensity score. Depending on the specied model (probit
or logit), treatrew provides consistent estimation of ATEs under the hypothesis of
selection on observables. Conditional on a prespecied set of observable exogenous
variables xthought of as those driving the nonrandom assignment to treatment
treatrew estimates the average treatment eect (ATE), the average treatment eect
on the treated (ATET), and the average treatment eect on the nontreated (ATENT); it
also estimates these parameters conditional on the observable factors x (that is, ATE(x),
ATET(x), and ATENT(x)).
c 2014 StataCorp LP st0350
542 treatrew: A user-written command
2. Build weights as 1/pi for treated observations and 1/(1 pi ) for untreated obser-
vations.
3. Calculate ATEs by comparing the weighted means of the two groups (for instance,
with a weighted least-squares [WLS] regression).
G. Cerulli 543
i. y1 = g1 (x) + 1 , E(1 ) = 0
ii. y0 = g0 (x) + 0 , E(0 ) = 0
iii. y = wy1 + y0 (1 w)
iv. Conditional mean independence (CMI) holds; therefore, E(y1 |w, x) = E(y1 |x) and
E(y0 |w, x) = E(y0 |x)
v. x exogenous
y1 and y0 are the subjects outcome when treated and untreated, respectively; g1 (x)
and g0 (x) are the subjects reaction function to the confounder x when the subject is
treated and untreated, respectively; w is the treatment binary indicator taking value 1
for treated and 0 for untreated subjects; 0 and 1 are two error terms with unconditional
zero mean; and x is a set of observable and exogenous confounding variables assumed
to drive the nonrandom assignment into treatment. In short, the CMI assumption states
that it is sucient to control only for x to restore random assignment conditions. When
assumptions iv hold,
{w p(x)}y
ATE = E (1)
p(x){1 p(x)}
{w p(x)}y
ATET = E (2)
p(w = 1){1 p(x)}
{w p(x)}y
ATENT = E (3)
p(w = 0)p(x)
Estimation follows in two steps: i) estimate the propensity score p(xi ), thus obtaining
p(xi ); and ii) substitute p(xi ) into previous formulas to get parameters. Consistency is
guaranteed because these estimators are M-estimators.
But how do we get standard errors for previous estimators? We can exploit some
results when the rst step is a maximum likelihood (ML) estimation and the second step
is an M-estimation. In our case, the rst step is an ML based on logit (or probit), and the
second step is a standard M-estimator. For such cases, Wooldridge (2007; 2010, 922924)
proposed a straightforward procedure to get analytical standard errors provided that the
propensity score is correctly specied. In what follows, we demonstrate Wooldridges
(2007; 2010, 922924) procedure and formulas for obtaining these standard errors.
and call them ei (i = 1, . . . , N ). The asymptotic standard error for ATE is equal to
% N
&1/2
1 2
e
N i=1 i
(4)
N
and we can use it to test the signicance of ATE. Of course, d will have a dierent
expression according to the probability model adopted. Here we consider the logit and
probit cases.
Case 1: Logit
Suppose that the correct probability follows a logistic distribution. This means that
exp(xi )
p(xi , ) = = (xi ) (5)
1 + exp(xi )
Thus, by simple algebra, we see that
= xi (wi pi )
d i
'()*
1R
Case 2: Probit
Suppose that the right probability follows a normal distribution. This means that
p(xi , ) = (xi )
Thus, by simple algebra, we see that
)xi {wi (xi )}
= (xi ,
d i
(xi ){1 (xi )}
where () and () are the normal cumulative distribution and density function, re-
spectively. One can also add functions of x to estimate previous formulas. This reduces
standard errors if these functions are partially correlated with
ki .
Finally, observe that the previous procedure produces standard errors that are lower
than those produced by ignoring the rst step (that is, the propensity-score estimation
via ML). Indeed, the nave standard error
,
1
N 2 1/2
+
ki ATE
N i=1
N
is higher than the one produced by the previous procedure.
546 treatrew: A user-written command
The standard errors presented in this section are correct when the actual data-
generating process follows the probit or the logit probability rules. If not, then a mea-
surement error is present, and the estimations might be inconsistent. Authors such as
Hirano, Imbens, and Ridder (2003) and Li, Racine, and Wooldridge (2009) have sug-
gested more exible nonparametric estimation of the standard errors. Under correct
specication, a straightforward alternative is to use bootstrapping, where the binary
response estimation and the averaging are included in each bootstrap iteration.
command syntax. The user has to declare: a) the outcome variable, that is, the vari-
able over which the treatment is expected to have an impact (outcome); b) the binary
treatment variable (treatment); c) a set of confounding variables (varlist); and, nally,
d) a series of options. Two options are important: the option model(modeltype) sets
the type of model, probit or logit, that has to be used in estimating the propensity
score; the option graphic and the related option range(a b) produce a chart where the
distribution of ATE(x), ATET(x), and ATENT(x) are jointly plotted within the interval
[a; b].
As an e-class command, treatrew provides an ereturn list of objects (such as
scalars and matrices) to be used in the next elaborations. In particular, the values of
ATE, ATET, and ATENT are returned in the scalars e(ate), e(atet), and e(atent), and
they can be used to get bootstrapped standard errors. By default, treatrew provides
analytical standard errors.
4.1 Syntax
treatrew outcome treatment varlist if in weight , model(modeltype)
graphic range(a b) conf(#) vce(robust)
outcome is the target variable for measuring the impact of the treatment.
treatment is the binary treatment variable taking 1 for treated and 0 for untreated
subjects.
varlist is the set of pretreatment (or observable confounding) variables.
fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight.
4.2 Description
treatrew estimates ATEs by REW on the propensity score. Depending on the specied
model, treatrew provides consistent estimation of ATEs under the hypothesis of selec-
tion on observables. Conditional on a prespecied set of observable exogenous variables
xthought of as those driving the nonrandom assignment to treatmenttreatrew
estimates the ATE, the ATET, the ATENT, and these parameters conditional on the ob-
servable factors x (that is, ATE(x), ATET(x), and ATENT(x)). Parameters standard
errors are provided either analytically (following Wooldridge [2010, 920930]) or via
bootstrapping. treatrew assumes that the propensity-score specication is correct.
treatrew creates several variables:
4.3 Options
model(modeltype) species the model for estimating the propensity score, where mod-
eltype must be one of probit or logit. model() is required.
graphic allows for a graphical representation of the density distributions of ATE(x),
ATET(x), and ATENT(x) within their whole support.
4.5 Examples
To show a practical application of treatrew, we use an instructional dataset called
fertil2.dta, which is included in Wooldridge (2013) and collects cross-sectional data
on 4,361 women of childbearing age in Botswana. This dataset is freely downloadable
at http://fmwww.bc.edu/ec-p/data/wooldridge/fertil2.dta. It contains 28 variables on
women and family characteristics.
Using fertil2.dta, we are interested in evaluating the impact of the variable educ7
(taking value 1 if a woman has more than or exactly seven years of education and
0 otherwise) on the number of family children (children). Several conditioning (or
confounding) observable factors are included in the dataset, such as the age of the
woman (age), whether the family owns a television (tv), whether the woman lives
in a city (urban), and so forth. To inquire into the relation between education and
fertility according to Wooldridges (2010, ex. 21.3, 940) specication, we estimate the
ATE, ATET, and ATENT (as well as ATE(x), ATET(x), and ATENT(x)) by REW us-
ing treatrew. We also compare REW results with other popular program evaluation
methods: i) the dierence in mean (DIM), taken as benchmark; ii) the OLS regression-
based random-coecient model with heterogeneous reaction to confounders, estimated
through the user-written command ivtreatreg, provided by Cerulli (2011); and iii) a
one-to-one nearest-neighbor matching, computed by the command psmatch2, provided
G. Cerulli 549
by Leuven and Sianesi (2003). Because matching estimators can be seen as specic
REW procedures (Busso, DiNardo, and McCrary 2008), comparing REW with matching
is worthwhile. By taking just the case of ATET, we can prove that
1
ATETMatching = yi h(i, j)yj
Ni
i(w=1) jC(i)
N
N
N
1 1
= wi yi (1 wj )yj wi h(i, j)
N1 i=1 j=1
N1 i=1
N
N
1 1
= wi yi (1 wj )yj (j) = ATETReweighting
N1 i=1
N0 j=1
N
where (j) = N0 /N1 wi h(i, j) are REW factors, C(i) is the untreated subjects neigh-
i=1
borhood for the treated subject i, and h(i, j) are matching weights thatonce oppor-
tunely speciedproduce dierent types of matching methods. Results from all of these
estimators are reported in table 1.
550
Table 1. Comparison of ATE, ATET, and ATENT estimation among DIM, CF-OLS, REW, and MATCH
1 2 3 4 5 6 7
DIM CF-OLS REW REW REW REW MATCH(a)
(probit) (logit) (probit) (logit)
analytical analytical bootstrapped bootstrapped
standard standard standard standard
errors errors errors errors
ATE 1.77 *** 0.374 *** 0.43 *** 0.415 *** 0.434 *** 0.415 *** 0.316 ***
0.062 0.051 0.068 0.068 0.070 0.071 0.080
28.46 7.35 6.34 6.09 6.15 5.87 3.93
ATET 0.255 *** 0.355 ** 0.345 *** 0.355 *** 0.345 *** 0.131
0.048 0.15 0.104 0.0657 0.054 0.249
5.37 2.37 3.33 5.50 6.45 0.52
ATENT 0.523 *** 0.532 *** 0.503 ** 0.532 *** 0.503 *** 0.549 ***
0.075 0.19 0.257 0.115 0.119 0.135
7.00 2.81 1.96 4.61 4.21 4.07
Note: b/se/t; DIM; CF-OLS: control-function OLS; REW; MATCH. (a) Standard errors for ATE and ATENT are computed by bootstrapping.
*** = 1%, ** = 5%, * = 10% of signicance.
treatrew: A user-written command
G. Cerulli 551
For CF-OLS, standard errors for ATET and ATENT are obtained via bootstrap and
can be obtained in Stata by typing
Results set out in columns 36 refer to the REW estimator. In columns 3 and 4,
standard errors are computed analytically, whereas in columns 5 and 6, they are com-
puted via bootstrap for the logit and probit models, respectively. These results can be
retrieved by typing sequentially
. treatrew children educ7 age agesq evermarr urban electric tv, model(probit)
. treatrew children educ7 age agesq evermarr urban electric tv, model(logit)
. bootstrap e(ate) e(atet) e(atent), reps(200):
> treatrew children educ7 age agesq evermarr urban electric tv, model(probit)
. bootstrap e(ate) e(atet) e(atent), reps(200):
> treatrew children educ7 age agesq evermarr urban electric tv, model(logit)
where the option common restricts the sample to subjects with common support. To
test the balancing property for such a matching estimation, we provide a DIM on the
propensity score before and after matching treated and untreated subjects using the
psmatch2 postestimation command pstest:
552 treatrew: A user-written command
(output omitted )
This test suggests that with regard to the propensity score, the matching procedure
implemented by psmatch2 is balanced, so we can trust matching results (the propensity
score was unbalanced before matching, and it becomes balanced after matching).
Unlike DIM, results from CF-OLS and REW are fairly comparable in terms of both
coecients size and signicance: the values of ATE, ATET, and ATENT obtained using
REW on the propensity score are a little higher than those obtained using CF-OLS. This
means that the linearity of the potential-outcome equations assumed by the CF-OLS is
an acceptable approximation. According to the value of ATET, as obtained by REW and
visible in column 3 of table 1, an educated woman in Botswana would have beenceteris
paribussignicantly more fertile if she had been less educated. We can conclude that
education has a negative impact on fertility, leading a woman to have around 0.5 fewer
children. If confounding variables were not considered, as it happens using DIM, this
negative eect would appear dramatically higher, around 1.77 children: the dierence
between 1.77 and 0.5 (around 1.3) is an estimation of the bias induced by the presence
of selection on observables.
Columns 3 and 4 show REW results using Wooldridges (2010) analytical standard
errors in the case of probit and logit, respectively. As partly expected, these results
are similar. But the REW results when standard errors are obtained via bootstrap
(columns 5 and 6) are more interesting. Here statistical signicance is conrmed when
compared with results derived from analytical formulas. However, bootstrapping seems
to increase signicance for both ATET and ATENT, while the standard error for ATE is
in line with the analytical one.
Some dierences in results emerge when applying the one-to-one nearest-neighbor
matching (column 7) on this dataset. In this case, ATET becomes insignicant with a
magnitude that is around one-third lower than that obtained by REW. As said above, the
standard errors of ATE and ATENT are here obtained via bootstrap because psmatch2
does not provide analytical solutions for these two parameters. Nevertheless, as proved
by Abadie and Imbens (2008), bootstrap performance is generally poor in the case of
matching, so these results have to be taken with some caution.
G. Cerulli 553
Finally, gure 1 sets out the estimated kernel density for the distribution of ATE(x),
ATET(x), and ATENT(x) when treatrew is used with options graphic and range(-30
30). It is evident that the distribution of ATET(x) is a bit more concentrated around
its mean (equal to ATET) than the distribution of ATENT(x) is; this indicates that more
educated women respond more homogeneously to a higher level of education. On the
contrary, less educated women react more heterogeneously to a potential higher level of
education.
40 20 0 20 40
x
ATE(x) ATET(x)
ATENT(x)
Model:logit
. use fertil2
. teffects ipw (children) (educ7 $xvars, probit), ate
Iteration 0: EE criterion = 6.624e-21
Iteration 1: EE criterion = 4.722e-32
Treatment-effects estimation Number of obs = 4358
Estimator : inverse-probability weights
Outcome model : weighted mean
Treatment model: probit
Robust
children Coef. Std. Err. z P>|z| [95% Conf. Interval]
ATE
educ7
(1 vs 0) -.1531253 .0755592 -2.03 0.043 -.3012187 -.0050319
POmean
educ7
0 2.208163 .0689856 32.01 0.000 2.072954 2.343372
In this estimation, we see that the value of ATE is 0.153 with a standard error of
0.075, which results in a moderately signicant eect of educ7 on children.
This value of ATE can also be obtained using a simple WLS regression of y on w and
a constant, with weights hi designed in this way:
Robust
children Coef. Std. Err. t P>|t| [95% Conf. Interval]
This table shows that the results of the commands calculating IPW and WLS for ATE
are identical. A dierence, however, appears in the estimated standard errors, which
are quite divergent: 0.075 for IPW against 0.108 for WLS. Moreover, observe that ATE
calculated by WLS becomes nonsignicant.
Why are these standard errors dierent? The answer resides in a dierent approach
used for estimating the variance of ATE (and, possibly, ATET): WLS regression uses the
usual OLS variancecovariance matrix adjusted for the presence of a matrix of weights,
lets say ; however, WLS does not consider the presence of a generated regressor,
namely, the weights computed through the propensity scores estimated in the rst step.
On the contrary, IPW accounts for the variability introduced by the generated weights
by exploiting a generalized method of moments approach for estimating the correct
variancecovariance matrix (see StataCorp [2013, 6888]). In this sense, IPW is a more
robust approach than a standard WLS regression.
As implemented in Stata, both WLS and IPW by default use normalized weights,
that is, weights that add up to one. treatrew, on the contrary, uses nonnormalized
weights, which is why the ATE values obtained from treatrew (see the previous section)
are numerically dierent from those obtained from WLS and IPW. As proved by Busso,
DiNardo, and McCrary (2008, 7), a general formula for estimating ATE by REW is
N N
1 1
+
ATE = wi yi hi1 (1 wi )yi hi0 (8)
N i=1 N i=1
hi1 = 1/p(x)
hi0 = 1/{1 p(xi )}
Such weights do not sum up to one. In this case, analytical standard errors cannot be
retrieved by a weighted regression, and the method suggested by Wooldridge (2010)
and implemented through treatrewfor getting correct analytical standard errors for
ATE, ATET, and ATENT is thus needed because a generated regressor from the rst-step
estimation is used in the second step.
The normalized weights used in WLS and IPW are instead
1/p(xi )
hi1 = N
1
wi /p(xi )
N1 i=1
1/{1 p(xi )}
hi0 = N
1
(1 wi )/{1 p(xi )}
N0 i=1
556 treatrew: A user-written command
Appendix B shows that if the formula of ATE implemented in treatrew using nor-
malized (rather than nonnormalized) weights was adopted, then the treatrews ATE
estimation would become numerically equivalent to the value of ATE obtained by the
commands used to calculate WLS and IPW.
Thus we can assert that both teffects ipw and treatrew lead to correct analytical
standard errors because both take into account that the propensity score is a generated
regressor from a rst-step (probit or logit) regression. The dierent values of ATE and
ATET obtained in the two approaches reside only in the dierent weighting scheme
(normalized versus nonnormalized).
In short, treatrew is useful when considering nonnormalized weights, that is, when a
pure IPW scheme is used. Moreover, compared with teffects ipw, treatrew provides
an estimation of ATENT, though it does not by default provide an estimation of the
mean potential outcomes.
6 Conclusion
This article provides a command, treatrew, for estimating ATEs by REW on the propen-
sity score as proposed by Rosenbaum and Rubin (1983). Although REW is a popular and
long-standing statistical technique to deal with the bias induced by drawing inference
in the presence of a nonrandom sample, its implementation in Stata with parameters
analytic standard errors (as proposed by Wooldridge [2010, 920930]) and a nonnormal-
ized weighting scheme was still missing. This article and the accompanying ado-le ll
this gap by providing an easy-to-use implementation of the REW method, which can be
used as a valuable tool for estimating causal eects under selection on observables.
7 References
Abadie, A., and G. W. Imbens. 2008. On the failure of the bootstrap for matching
estimators. Econometrica 76: 15371557.
Brunell, T. L., and J. DiNardo. 2004. A propensity score reweighting approach to esti-
mating the partisan eects of full turnout in American presidential elections. Political
Analysis 12: 2845.
Busso, M., J. DiNardo, and J. McCrary. 2008. Finite sample properties of semipara-
metric estimators of average treatment eects.
http://elsa.berkeley.edu/users/cle/laborlunch/mccrary.pdf.
Cerulli, G. 2011. ivtreatreg: A new Stata routine for estimating binary treatment
models with heterogeneous response to treatment under observable and unobservable
selection. 8th Italian Stata Users Group meeting proceedings.
http://www.stata.com/meeting/italy11/abstracts/italy11 cerulli.pdf.
Hirano, K., G. W. Imbens, and G. Ridder. 2003. Ecient estimation of average treat-
ment eects using the estimated propensity score. Econometrica 71: 11611189.
G. Cerulli 557
Leuven, E., and B. Sianesi. 2003. psmatch2: Stata module to perform full Mahalanobis
and propensity score matching, common support graphing, and covariate imbalance
testing. Statistical Software Components S432001, Department of Economics, Boston
College. http://ideas.repec.org/c/boc/bocode/s432001.html.
Li, Q., J. S. Racine, and J. M. Wooldridge. 2009. Ecient estimation of average treat-
ment eects with mixed categorical and continuous data. Journal of Business and
Economic Statistics 27: 206223.
Lunceford, J. K., and M. Davidian. 2004. Stratication and weighting via the propensity
score in estimation of causal treatment eects: A comparative study. Statistics in
Medicine 23: 29372960.
Morgan, S. L., and D. J. Harding. 2006. Matching estimators of causal eects: Prospects
and pitfalls in theory and practice. Sociological Methods and Research 35: 360.
Nichols, A. 2007. Causal inference with observational data. Stata Journal 7: 507541.
Robins, J. M., M. A. Hernan, and B. Brumback. 2000. Marginal structural models and
causal inference in epidemiology. Epidemiology 11: 550560.
Rosenbaum, P. R., and D. B. Rubin. 1983. The central role of the propensity score in
observational studies for causal eects. Biometrika 70: 4155.
. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cam-
bridge, MA: MIT Press.
Appendix A
This appendix provides the mathematical steps to get the REW formulas for ATEs as
reported in (1)(3). Observe rst that wy = w{wy1 +y0 (1w)} = w2 y1 +wy0 w2 y0 =
wy1 because w2 = w. Therefore,
wy wy1 LIE2 wy1 wE(y1 |x, w)
E |x =E |x = E E |x, w |x = E |x
p(x) p(x) p(x) p(x)
CMI wE(y1 |x) wg1 (x) w
= E |x = E |x = g1 (x) E |x
p(x) p(x) p(x)
g1 (x) g1 (x)
= E(w|x) = p(x) = g1 (x) (9)
p(x) p(x)
provided that 0 < p(x) < 1. To get ATE, one needs to take the expectation of ATE(x)
on x,
{w p(x)}y {w p(x)}y
ATE = Ex {ATE(x)} = Ex E |x = E
p(x){1 p(x)} p(x){1 p(x)}
{w p(x)}y {w p(x)}y0
= + w(y1 y0 ) (11)
{1 p(x)} {1 p(x)}
G. Cerulli 559
Consider now the quantity {w p(x)}y0 in the right-hand side of (11). We see that
that is,
{w p(x)}y
E = E{w(y1 y0 )}
{1 p(x)}
E(h) = E{w(y1 y0 )}
= p(w = 1) E{w(y1 y0 )|w = 1} + p(w = 0) E{w(y1 y0 )|w = 0}
= p(w = 1) E{(y1 y0 )|w = 1}
= p(w = 1) ATET
proving that
{w p(x)}y
ATET = E
p(w = 1){1 p(x)}
Appendix B
In this appendix, we show that if one considers the formula of ATE as implemented
in treatrew by using normalized rather than nonnormalized weights, then treatrews
ATE estimation becomes numerically equivalent to the ATE obtained by commands used
to calculate WLS and IPW. To this purpose, we rst calculate the ATE estimator by
means of the general formula in (8) by adopting normalized IPW weights:
N N
1 1
+
ATE = wi yi hi1 (1 wi )yi hi0
N i=1 N i=1
As an intermediary step, we show that normalized weights sum up to one for the weights
of both the treated and the untreated subjects.
Second, we compute the estimation of ATE by multiplying the two summands for
the treated and untreated units in (8) by the outcome y (equal in this example to the
variable children):
which is numerically equivalent to the value of the ATE obtained via WLS and IPW.
The Stata Journal (2014)
14, Number 3, pp. 562579
Abstract. We present motivation and new commands for modeling count data.
While our focus is to present new commands for estimating count data, we also
discuss generalized binomial regression and present the zero-inated versions of
each model.
Keywords: st0351, gbin, zigbin, nbregf, nbregw, zinbregf, zinbregw, binomial, War-
ing, count data, overdispersion, underdispersion
1 Introduction
We introduce programs for regression models of count data. Poisson regression analysis
is widely used to model such response variables because the Poisson model assumes
equidispersion (equality of the mean and variance). In practice, equidispersion is rarely
reected in data. In most situations, the variance exceeds the mean. This occurrence
of extra-Poisson variation is known as overdispersion (see, for example, Dean [1992]).
In situations where the variance is smaller than the mean, data are characterized as
being underdispersed. Modeling underdispersed count data with inappropriate models
can lead to overestimated standard errors and misleading inference. While there are
various approaches for modeling overdispersed count data, such as the negative binomial
distributions and other mixtures of Poisson (Yang et al. 2007; Hilbe 2014), there are
few models for underdispersed count data. Harris, Yang, and Hardin (2012) introduced
a generalized Poisson regression command to handle underdispersed count data.
As stated earlier, count data can be analyzed using regression models based on the
Poisson distribution. However, in this article, we will discuss other discrete regression
models that can be used, such as the generalized negative binomial distribution, which
was described by Jain and Consul (1971) and later by Consul and Gupta (1980). The
distribution was also investigated by Famoye (1995), who illustrated a use for analyzing
grouped binomial data.
c 2014 StataCorp LP st0351
T. Harris, J. W. Hilbe, and J. W. Hardin 563
2 The models
2.1 Generalized negative binomial: Famoye
As implemented in the accompanying software, the NBREGF model assumes that is
a scalar unknown parameter. Thus the probability mass function (PMF), mean, and
variance are given by
564 Modeling count data with generalized distributions
+ y y y+y
P (Y = y) = (1 ) (1)
+ y y
where 0 < < 1, 1 < 1 for > 0 and nonnegative outcomes yi (0, 1, 2, . . .).
1
E(Y ) = (1 )
3
V (Y ) = (1 )(1 )
The main dierences from the GBIN model are that the parameter is an unknown
parameter in (1) but a known parameter in (2) and that = > 1. In the limit
1, the variance approaches that of the negative binomial distribution. Thus the
parameter generalizes the negative binomial distribution in the NBREGF model to have
greater variance than is allowed in a negative binomial regression model. To construct a
regression model, we implemented the log link log() = x to make results comparable
to Poisson and negative binomial models.
1
E(Y ) = n 1
1 + 1 +
= n
3
V (Y ) = n 1 (1 + )
1 + 1 +
= n(1 + )(1 + )
Parameterizing g() = x, where g() is a suitable link function assuming that plays
the role of the probability of success, we obtain results that coincide with a grouped
data binomial model. The variance is equal to binomial variance if = 0, and it is
equal to negative binomial variance if = 1. Thus the > 0 parameter generalizes the
binomial distribution in the GBIN regression model.
i. Y |x, x , v Poisson(x )
ii. x |v Gamma(ax , v)
iii. v Beta(, k)
where k, , ax > 0, ax = ( 1)/k, and (a)w is the Pochhammer notation for (a + w)/
(w) if a > 0. The expected value and variance of the distribution are
ax k
E(Y ) = =
1
k+1 k+1
V (Y ) = + + 2 (3)
2 k( 2)
where ax , k > 0 and > 2 (to ensure nonnegative variance). To construct a regression
model, we implemented the log link log() = x to make results comparable to Poisson
and negative binomial models. A unique characteristic of this model occurs when the
data are from a dierent underlying distribution. For instance, when the data are
from a Poisson distribution with V (Y ) = , it indicates that (k + 1)/( 2) 0 and
{k + 1}/{k( 2)} 0 then k, . Also, if the data have an underlying NB-2
(negative binomial-2) distribution with V (Y ) = + 2 (where is the dispersion
parameter), it indicates that (k + 1)/( 2) 0 and {k + 1}/{k( 2)} ,
where k 1/ and .
where p is the probability that the binary process results in a zero outcome, 0 p < 1,
and f (y) is the probability function. Zero-ination models are proposed for the NBREGF,
GBIN, and NBREGW distributions.
566 Modeling count data with generalized distributions
3 Syntax
The accompanying software includes the command les as well as supporting les for
prediction and help. In the following syntax diagrams, unspecied options include the
usual collection of maximization and display options available to all estimation com-
mands. In addition, all zero-inated commands include the ilink(linkname) option to
specify the link function for the ination model. The generalized binomial model for
grouped binomial data also includes the link(linkname) option for linking the proba-
bility of success to the linear predictor. Supported linknames include logit, probit,
loglog, and cloglog.
The syntax for specifying a generalized binomial regression model for grouped data
is given by
gbin depvar indepvars if in weight , options
The syntax for tting a generalized negative binomial regression model where the
distribution is assumed to follow Famoyes description is given by
nbregf depvar indepvars if in weight , options
The syntax for tting a generalized negative binomial regression model where the
distribution is derived from the Waring distribution is given by
nbregw depvar indepvars if in weight , options
The syntax for specifying a zero-inated count model where the count distribution
follows that described by Famoye is given by
zinbregf depvar indepvars if in weight ,
/
inflate(varlist , offset(varname) / cons) vuong options
The syntax for specifying a zero-inated count model where the count distribution
follows the Waring distribution is given by
zinbregw depvar indepvars if in weight ,
/
inflate(varlist , offset(varname) / cons) vuong options
A Vuong test (see Vuong [1989]) evaluates whether the regression model with zero
ination or the regression model without zero ination is closer to the true model. A
T. Harris, J. W. Hilbe, and J. W. Hardin 567
random variable is dened as the vector log LZ log LS , where LZ is the likelihood of
the zero-inated model evaluated at its maximum likelihood estimation, and LS is the
likelihood of the standard (nonzero-inated) model evaluated at its maximum likelihood
estimation. The vector of dierences over the N observations is then used to dene the
statistic
N
V = 0
( )2 /(N 1)
which, asymptotically, is characterized by a standard normal distribution. A signi-
cant positive statistic indicates preference for the zero-inated model, and a signicant
negative statistic indicates preference for the model without zero ination. Nonsignif-
icant Vuong statistics indicate no preference for either model. Results of this test are
included in a footnote to the estimation of the model when the user includes the vuong
option in any of the zero-inated commands. Vuong statistics with corrections based
on the Akaike information criterion (AIC) and the Bayesian information criterion (BIC)
are also displayed in the output (see Desmarais and Harden [2013] for details). They
are displayed for each of the zero-inated models discussed in this article.
4 Example
We shall use the popular German health data for the year 1984 as example data. The
goal of our model is to understand the number of visits made to a physician during 1984.
Our predictor of interest is whether the patient is highly educated based on achieving
a graduate degree, for example, an MA or MS, an MBA, a PhD, or a professional degree.
Confounding predictors are age (from 2564) and income in German Marks, divided by
10. We rst model the data using Poisson regression. The glm command is used to
determine the Pearson dispersion, or dispersion statistic, which is not available using
the poisson command.
568 Modeling count data with generalized distributions
OIM
docvis IRR Std. Err. z P>|z| [95% Conf. Interval]
. estat ic
Akaikes information criterion and Bayesian information criterion
. estat ic
Akaikes information criterion and Bayesian information criterion
The AIC and BIC statistics are substantially lower here than they are for the Poisson
model, indicating a much better t than the Poisson model.
. display 1/exp(_b[edlevel4])
1.3763358
Patients without a graduate education are 38% more likely to see a physician than
are patients with a graduate education. We can likewise arm that patients without
a graduate education saw a physician 38% more often in 1984 than patients with a
graduate education.
The negative binomial model did not adjust for all the correlation, or dispersion, in
the data.
This is perhaps due to the excessive number of times a patient in the data never
saw a physician in 1984. A tabulation of docvis shows that nearly 42% of the 3,874
patients in the data did not visit a physician. This value is far greater than the one
accounted for by the Poisson and negative binomial distributional assumptions.
. count if docvis==0
1611
. display "Zeros account for " %4.2f (r(N)*100/3874) "% of the outcomes"
Zeros account for 41.58% of the outcomes
Given the excess zero counts in docvis, it may be wise to employ a zero-inated
regression model on the data. At the least, we can determine which predictors tend to
prevent patients from going to the doctor.
570 Modeling count data with generalized distributions
. zinb docvis edlevel4 age hh, nolog inflate(edlevel4 age hh) irr
Zero-inflated negative binomial regression Number of obs = 3874
Nonzero obs = 2263
Zero obs = 1611
Inflation model = logit LR chi2(3) = 98.50
Log likelihood = -8330.799 Prob > chi2 = 0.0000
docvis
edlevel4 .9176719 .1289238 -0.61 0.541 .6967903 1.208573
age 1.020511 .0025432 8.15 0.000 1.015538 1.025508
hh .4506524 .0720932 -4.98 0.000 .3293598 .6166132
_cons 1.768336 .2419851 4.17 0.000 1.352333 2.31231
inflate
edlevel4 1.174194 .3519899 3.34 0.001 .4843067 1.864082
age -.0521002 .0115586 -4.51 0.000 -.0747547 -.0294458
hh .2071444 .570265 0.36 0.716 -.9105545 1.324843
_cons -.037041 .4438804 -0.08 0.933 -.9070305 .8329486
. estat ic
Akaikes information criterion and Bayesian information criterion
The AIC statistic is 20 points lower in the zero-inated model but 5 points higher
for the BIC statistic. However, variables edlevel4 and age appear to aect zero counts,
with younger graduate patients more likely to not see a physician at all during the year.
Given the zero-inated model, patients without a graduate education see the physician
9% more often than patients with a graduate education.
. display 1/exp(_b[edlevel4])
1.0897141
Because excess zero counts did not appear to bear on extra correlation in the data,
there may be other factors. We employ a generalized Waring negative binomial model
to further identify the source of extra dispersion.
T. Harris, J. W. Hilbe, and J. W. Hardin 571
. estat ic
Akaikes information criterion and Bayesian information criterion
The AIC and BIC statistics are substantially lower here than for either the negative
binomial or zero-inated version. For the calculated and k, the V (Y ) = + 0.624 +
2.9942 , where is the mean. Here we see that the term {k + 1}/{k( 2)} =
2.994, from (3), is close to the dispersion parameter = 2.319 when using an NB-2
regression model from above. More information on the background of this model can
be found in Hilbe (2011).
572 Modeling count data with generalized distributions
To address the excess zeros in the outcome, we also t a zero-inated Waring model.
. zinbregw docvis edlevel4 age hh, nolog inflate(edlevel4 age hh) eform vuong
Zero-inflated gen neg binomial-W regression Number of obs = 3874
Regression link: Nonzero obs = 2263
Inflation link : logit Zero obs = 1611
Wald chi2(3) = 66.10
Log likelihood = -8262.174 Prob > chi2 = 0.0000
docvis
edlevel4 .9414482 .1406355 -0.40 0.686 .7024933 1.261684
age 1.017108 .0024842 6.95 0.000 1.012251 1.021989
hh .4841428 .0964645 -3.64 0.000 .3276222 .7154409
_cons 2.457403 .3313549 6.67 0.000 1.886691 3.200751
inflate
edlevel4 .613575 .2222675 2.76 0.006 .1779387 1.049211
age -.026716 .0048778 -5.48 0.000 -.0362763 -.0171558
hh -.0137845 .3544822 -0.04 0.969 -.7085569 .6809879
_cons .1834942 .245023 0.75 0.454 -.2967421 .6637305
Vuong test of zinbregw vs. gen neg binomial(W): z = 0.55 Pr>z = 0.2897
Bias-corrected (AIC) Vuong test: z = 0.13 Pr>z = 0.4482
Bias-corrected (BIC) Vuong test: z = -1.20 Pr>z = 0.8845
. estat ic
Akaikes information criterion and Bayesian information criterion
Note that introducing the zero-ination component into the regression model results
in losing signicance of the education level in the model of the mean outcomes. However,
that variable does play a signicant role (along with age) in determining whether a
person has zero visits to the doctor.
. estat ic
Akaikes information criterion and Bayesian information criterion
Note that the risk ratios are nearly identical to the NB-2 negative binomial model.
The AIC and BIC statistics are lower than NB-2, but only by about 12 and 5 points,
respectively. Because of the excessive zero counts, we model a zero-inated model.
574 Modeling count data with generalized distributions
. zinbregf docvis edlevel4 age hh, nolog inflate(edlevel4 age hh) eform vuong
Zero-inflated gen neg binomial-F regression Number of obs = 3874
Regression link: Nonzero obs = 2263
Inflation link : logit Zero obs = 1611
LR chi2(3) = 176.08
Log likelihood = -8292.015 Prob > chi2 = 0.0000
docvis
edlevel4 .9125286 .1191361 -0.70 0.483 .7065079 1.178626
age 1.017058 .0024233 7.10 0.000 1.012319 1.021818
hh .4915087 .0753322 -4.63 0.000 .3639736 .6637315
_cons .0010836 .2112138 -0.04 0.972 1.3e-169 8.9e+162
inflate
edlevel4 .7118035 .2073926 3.43 0.001 .3053213 1.118286
age -.0380198 .0054111 -7.03 0.000 -.0486254 -.0274142
hh .2529651 .3447803 0.73 0.463 -.422792 .9287221
_cons .368429 .2425669 1.52 0.129 -.1069933 .8438514
Vuong test of zinbregf vs. gen neg binomial(F): z = 6.23 Pr>z = 0.0000
Bias-corrected (AIC) Vuong test: z = 5.68 Pr>z = 0.0000
Bias-corrected (BIC) Vuong test: z = 3.99 Pr>z = 0.0000
. estat ic
Akaikes information criterion and Bayesian information criterion
The AIC and BIC statistics are substantially lower than the nonzero-inated param-
eterization, and they are also lower than the Waring regression model. Here we nd
that younger patients without a graduate education see physicians more frequently than
patients with a graduate education (as we discovered before) and that the important
statistics are and .
commands, we generate data following a complementary log-log link function for the
generalized binomial outcome and a log-log link for the zero-ination component.
Once we have dened the components of the outcome and the necessary covariates,
we generate the outcome. The zero-inated version of the outcome is the product of
the binomial outcome and the zero-ination (binary) component.
Before tting the zero-inated model for the zero-inated outcome, we rst illustrate
how well a zero-inated model might t the nonzero-inated outcome. In this case, we
should expect the binomial regression components to estimate the means well, and we
should expect the covariate of the zero-ination component to be nonsignicant.
y
x1 .447438 .0681432 6.57 0.000 .3138797 .5809963
_cons -1.958826 .0540354 -36.25 0.000 -2.064733 -1.852918
inflate
z1 .4499741 .4248806 1.06 0.290 -.3827765 1.282725
z2 2.068714 60.05847 0.03 0.973 -115.6437 119.7812
_cons -3.264426 60.05983 -0.05 0.957 -120.9795 114.4507
Note that the Vuong statistic was nonsignicant in this example. Though it fails to
provide compelling evidence for one model over the other, we would prefer the nonzero-
T. Harris, J. W. Hilbe, and J. W. Hardin 577
inated model because of the lack of signicant covariates in the ination. When we
t a zero-inated model for the outcome that was specically generated to include zero
ination, we see a much better t.
yo
x1 .4628085 .086265 5.36 0.000 .2937322 .6318848
_cons -1.969505 .0873894 -22.54 0.000 -2.140785 -1.798225
inflate
z1 .2292778 .1270487 1.80 0.071 -.019733 .4782886
z2 .3955768 .1296781 3.05 0.002 .1414125 .6497411
_cons -.4796692 .1724896 -2.78 0.005 -.8177426 -.1415958
Here the Vuong test indicates a clear preference for the zero-ination model, and we
note that the estimated coecients are close to the values we specied in synthesizing
these data.
6 References
Cameron, A. C., and P. K. Trivedi. 2013. Regression Analysis of Count Data. 2nd ed.
Cambridge: Cambridge University Press.
Consul, P. C., and H. C. Gupta. 1980. The generalized negative binomial distribution
and its characterization by zero regression. SIAM Journal on Applied Mathematics
39: 231237.
Dean, C. B. 1992. Testing for overdispersion in Poisson and binomial regression models.
Journal of the American Statistical Association 87: 451457.
Desmarais, B. A., and J. J. Harden. 2013. Testing for zero ination in count models:
Bias correction for the Vuong test. Stata Journal 13: 810835.
Famoye, F. 1995. Generalized binomial regression model. Biometrical Journal 37: 581
594.
Hardin, J. W., and J. M. Hilbe. 2012. Generalized Linear Models and Extensions. 3rd
ed. College Station, TX: Stata Press.
Harris, T., Z. Yang, and J. W. Hardin. 2012. Modeling underdispersed count data with
generalized Poisson regression. Stata Journal 12: 736747.
Jain, G. C., and P. C. Consul. 1971. A generalized negative binomial distribution. SIAM
Journal on Applied Mathematics 21: 501513.
Tang, W., H. He, and X. M. Tu. 2012. Applied Categorical and Count Data Analysis.
Boca Raton, FL: Chapman & Hall/CRC.
Vuong, Q. H. 1989. Likelihood ratio tests for model selection and non-nested hypotheses.
Econometrica 57: 307333.
Wang, X.-F., Z. Jiang, J. J. Daly, and G. H. Yue. 2012. A generalized regression model
for region of interest analysis of fMRI data. Neuroimage 59: 502510.
Winkelmann, R. 2008. Econometric Analysis of Count Data. 5th ed. Berlin: Springer.
T. Harris, J. W. Hilbe, and J. W. Hardin 579
Yang, Z., J. W. Hardin, C. L. Addy, and Q. H. Vuong. 2007. Testing approaches for
overdispersion in Poisson regression versus the generalized Poisson model. Biometrical
Journal 49: 565584.
Alfonso Flores-Lagunes
Department of Economics
State University of New York, Binghamton
Binghamton, NY
aores@binghamton.edu
Alessandra Mattei
Department of Statistics, Informatics, Applications Giuseppe Parenti
University of Florence
Florence, Italy
mattei@disia.uni.it
1 Introduction
The evaluation process in economics, sociology, law, and many other elds generally
relies on applying nonexperimental techniques to estimate average treatment eects.
c 2014 StataCorp LP st0352
M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 581
Propensity-score methods (Rosenbaum and Rubin 1983) are attractive empirical tools
to balance the distribution of covariates between treatment groups and compare the
groups in terms of observed covariates. Under the unconfoundedness assumption, which
requires that potential outcomes are independent of the treatment conditional on the
observed covariates, propensity-score methods allow one to eliminate (or at least re-
duce) the potential bias in treatment-eects estimates in observational studies. Most
applications aim to evaluate causal eects of a binary treatment. There is extensive
literature on identifying and estimating causal eects of binary treatments (for exam-
ple, Imbens and Wooldridge [2009]; Stuart [2010]; Angrist, Imbens, and Rubin [1996]),
and many statistical software packages have built-in or add-on functions for imple-
menting methods to estimate causal eects of programs or policies. For example,
Becker and Ichino (2002) developed a set of programs (pscore.ado) for estimating av-
erage treatment eects on the treated using propensity-score matching by focusing on
four matching estimators: nearest-neighbor, radius, kernel, and stratication match-
ing. More recently, building on the work of Becker and Ichino (2002), Dorn (2012)
proposed a routine that helps improve covariate balance, and so the specication of the
propensity-score model, using data-driven approaches.
In many empirical studies, treatments may take on many values, implying that
participants in the study may receive dierent treatment levels. In such cases, one
may want to assess the heterogeneity of treatment eects arising from variation in the
amount of treatment exposure, that is, estimate a doseresponse function (DRF). Over
the past years, propensity-score methods have been generalized and applied to multival-
ued treatments (for example, Imbens [2000]; Lechner [2001]) and, more recently, to con-
tinuous treatments and arbitrary treatment regimes (for example, Hirano and Imbens
[2004]; Imai and van Dyk [2004]; Flores et al. [2012]; Bia and Mattei [2012]; Kluve et al.
[2012]).
In this article, we build on work by Hirano and Imbens (2004), who introduced the
concept of the generalized propensity score (GPS) and used it to estimate the entire DRF
of a continuous treatment. Hirano and Imbens (2004) used a parametric partial-mean
approach to estimate the DRF. Here we focus on semiparametric techniques. Specically,
we present a set of programs that allows users to i) estimate the GPS under alternative
parametric assumptions using generalized linear models;1 ii) impose the common sup-
port condition as dened in Flores et al. (2012) and assess the balance of covariates after
adjusting for the estimated GPS; and iii) estimate the DRF using the estimated GPS by
applying either the nonparametric inverse-weighting (IW) kernel estimator developed in
Flores et al. (2012) or a new set of semiparametric estimators based on penalized spline
techniques.
1. Guardabascio and Ventura (2014) proposed the routine gpscore2.ado to estimate the GPS using
generalized linear models.
582 Semiparametric estimators of doseresponse functions
We use a dataset collected by Imbens, Rubin, and Sacerdote (2001) to illustrate these
programs and to evaluate the eect of the prize amount on subsequent labor earnings
of winners of the Megabucks lottery in Massachusetts in the mid-1980s. We implement
our programs to semiparametrically estimate the average potential postwinning labor
earnings for each lottery prize amount. The prize is obviously assigned at random,
but unit and item nonresponse lead to a self-selected sample where the prize amount
received is no longer independent of background characteristics.
This article is organized as follows: Section 2 describes the methodological approach
we refer to in the analysis. Section 3 introduces the GPS model and the semiparametric
estimators of the DRF. Sections 3 and 3.2 show, respectively, the syntax and the options
of the drf command. Section 5 illustrates the methods and the program using data
from Imbens, Rubin, and Sacerdote (2001). Section 6 concludes.
2 Estimation strategy
We estimate a continuous DRF that relates each value of the dose (for example, lottery
prize amount) to the outcome variable (for example, postwinning labor earnings) within
the potential-outcome approach to causal inference (Rubin 1974, 1978). Formally, con-
sider a set of N individuals, and denote each of them by subscript i: i = 1, . . . , N .
Under the stable unit treatment value assumption (Rubin 1980, 1990), for each unit
i, there is a set of potential outcomes {Yi (t)}tT , where T is a subset of the real line,
T R. We are interested in estimating the average DRF, (t) = E{Yi (t)}.
For each individual i, we observe a vector of pretreatment covariates, Xi , the received
treatment level, Ti , and the corresponding value of the outcome for this treatment level,
Yi = Yi (Ti ).
The central assumption of our approach is that the assignment to treatment levels is
weakly unconfounded given the set of observed variables, that is, Yi (t) Ti |Xi for all t
T (Hirano and Imbens 2004). This assumption is described as weak unconfoundedness
because it requires only conditional independence for each potential outcome Yi (t) rather
than joint independence of all potential outcomes.
Under weak unconfoundedness, we can apply the GPS techniques for continuous
treatments introduced by Hirano and Imbens (2004). Let r(t, x) = fT |X (t|x) be the
conditional density of the treatment given the covariates. The GPS is dened as Ri =
r(Ti , Xi ). The GPS is a balancing score (Rosenbaum and Rubin 1983; Hirano and Im-
bens 2004); that is, within strata with the same value of r(t, x), the probability that
T = t does not depend on the value of X. The weak unconfoundedness assumption,
combined with the balancing score property, implies that assignment to treatment is
weakly unconfounded given the GPS. Formally,
for every t T (theorem 1.2.2 in Hirano and Imbens [2004]). Thus any bias associated
with dierences in the distribution of covariates across groups with dierent treatment
levels can be removed using the GPS. Formally, Hirano and Imbens (2004) showed that
M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 583
3 Inference
We use two-step semiparametric estimators of the DRF. The rst step is to parametri-
cally model and estimate the GPS, Ri = r(Ti , Xi ), and to assess the common support
condition and the balance of the covariates. The second step is to estimate the average
DRF, (t), using either the nonparametric IW kernel estimator proposed by Flores et al.
(2012) or a semiparametric spline-based estimator. Here we describe these two steps,
implemented in the routine drf.
2. betafit (version 1.0.0 at the time of this writing) is available from the Statistical Software Com-
ponents archive (or findit betafit) and must be installed separately from drf.
584 Semiparametric estimators of doseresponse functions
k for those units with Qi = qk with that of units with Qi = qk and is given by the
of R i
subsample
CSk = i : R ik max min R jk , min Rjk , min max R jk , max R jk
j:Qj =qk j:Qj =qk j:Qj =qk j:Qj =qk
Finally, the sample is restricted to units that are comparable across all the K inter-
vals simultaneously by keeping only individuals who are simultaneously in the common
support region
1K for all k intervals. Therefore, the common-support subsample is given
by CS = k=1 CSk .
As in applications of standard propensity-score methods, in GPS applications, it is
crucial to evaluate how well the estimated GPS balances the covariates. Several methods
can be applied to evaluate the balancing properties of the GPS. The drf command
implements two approaches: an approach based on blocking on the GPS and an approach
that uses a likelihood-ratio (LR) test. The blocking on the GPS approach was proposed
by Hirano and Imbens (2004), and it is implemented in the drf routine using two-
sided t tests or Bayes factors (see also Bia and Mattei [2008]). The second approach
was proposed by Flores et al. (2012), who suggested using an LR test to compare an
unrestricted model for Ti that includes all covariates and the GPS (up to a cubic term)
with a restricted model that sets the coecients of all covariates equal to zero. If the GPS
suciently balances the covariates, then the covariates should have little explanatory
power conditional on the GPS.3
3. An alternative approach, which is not implemented in our program, was proposed by Kluve et al.
(2012). It consists of regressing each covariate on the treatment variable and comparing the signif-
icance of the coecients for specications with and without conditioning on the GPS.
M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 585
The simplest bivariate penalized spline smoothing relies on additive spline bases,
which can be formally dened in our setting as
Kt
Kr
E Yi |Ti , R i +
i = a0 + at Ti + ar R utk (Ti kkt )+ + urk Ri k r (1)
k
+
k=1 k=1
where for any number z, z+ is equal to z if z is positive and is equal to 0 otherwise, and
k1t < < kK
t r r
t and k1 < < kK r are K
t
and K r distinct knots in the support of T
and the estimated GPS, R i , respectively.
The additive models have many attractive features, one being their simplicity. How-
ever, an additive model may not provide a satisfactory t, so more complex mod-
els including interaction terms are required. To this end, we consider tensor prod-
uct bases, which are obtained by forming all pairwise products of the basis functions
1, Ti , (Ti k1t ), . . . , (Ti kK
t r r
t ) and 1, Ri , (Ri k1 ), . . . , (Ri kK r ). Formally,
E Yi |Ti , R i = a0 + a t Ti + a r R i + atr Ti Ri
K
t
K
r
K
t
+ utk Ti kkt + + urk i
R kkr + i Ti kkt
vkt R
+ +
k=1 k=1 k=1
Kr
Kt
Kr
+ vkr Ti Ri k r + tr
vkk t
T i kk
i k r
R (2)
k + k
+ +
k=1 k=1 k =1
Estimation problems may arise when the tensor product approach is applied, espe-
cially if the sample size is relatively small. When these problems arise, the drf program
alerts users and suggests they adopt an additive model instead.
As an alternative to tensor product splines, we propose to use the so-called radial
basis functions, which are basis functions of the form C{(t, r) (k, k ) } for some
univariate function C. Here we consider the following function
2 t 2 , 2 t 22 2 t 2
2 t k 2 2 t k 2 2 t k 2
C 2 2 2 =22 2 log 2 2
r k r 2 r k r 2 2 r k r 2
Given the estimated parameters of the regression functions (1), (2), or (3), the
average potential outcome at treatment level t is estimated by averaging the estimated
t .
regression function over R i
586 Semiparametric estimators of doseresponse functions
Flores et al. (2012) proposed to estimate the DRF using a nonparametric IW estima-
tor based on kernel methods. In this approach, the estimated scores are used to weight
observations to adjust for covariate dierences. Let K(u) be a kernel function with the
usual properties, and let h be a bandwidth satisfying h 0 and N h as N .
The IW approach is implemented using a local linear regression of Y on T with weighted
kernel function K t , where Kh (z) = h1 K(z/h). Formally,
h,X (Ti t) = Kh (Ti t)/R
i
the IW kernel estimator of the average DRF is dened as
D0 (t)S2 (t) D1 (t)S1 (t)
(t) =
S0 (t)S2 (t) S12 (t)
N j
N j
where Sj (t) = i=1 K h,X (Ti t)(Ti t) and Dj (t) = i=1 Kh,X (Ti t)(Ti t) Yi ,
j = 0, 1, 2.
We implement the IW estimator using a normal kernel. By default, the global band-
width is selected using the procedure proposed by Fan and Gijbels (1996), which esti-
mates the unknown terms in the optimal global bandwidth by using a global polynomial
of order p + 3, where p is the order of the local polynomial tted. However, users can
also choose an alternative global bandwidth.
Note that the argument varlist represents the observed pretreatment variables, which
are used to estimate the GPS. Note that spacefill must be installed (Bia and Van Kerm
2014).4
4.2 Options
Required
Global options
gps stores the estimated generalized propensity score in the gpscore variable that is
added to the dataset.6
family(familyname) species the distribution used to estimate the GPS. The available
distributional families are Gaussian (normal) (family(gaussian)), inverse Gaussian
(family(igaussian)), Gamma (family(gamma)), and Beta (family(beta)). The
default is family(gaussian). The Gaussian, inverse Gaussian, and Gamma distri-
butional families are t using glm, and the beta distribution is t using betafit.
The following four options are for the glm command, so they can be specied only
when the Gaussian, inverse Gaussian, or Gamma distribution is assumed for the treat-
ment variable.
link(linkname) species the link function for the Gaussian, inverse Gaussian, and
Gamma distributional families. The available links are link(identity), link(log),
and link(pow), and the default is the canonical link for the family() specied (see
help for glm for further details).
5. The subroutines mtpspline and radialpspline are called, respectively, when estimators with pe-
nalized splines (type = mtspline) and radial penalized splines (type = radialpspline) are used.
6. This option must not be specied when running the bootstrap.
588 Semiparametric estimators of doseresponse functions
vce(vcetype) species the type of standard error reported for the GPS estimation when
the Gaussian, inverse Gaussian, or Gamma distribution is assumed for the treatment
variable. vcetype may be oim, robust, cluster clustvar, eim, opg, bootstrap,
jackknife, hac, kernel, jackknife1 (see help glm for further details).
nolog(#) is a ag (# = 0, 1) that suppresses the iterations of the algorithm toward
eventual convergence when running the glm command. The default is nolog(0).
search searches for good starting values for the parameters of the generalized linear
model used to estimate the generalized propensity score (see help glm for further
details).
Overlap options
test varlist(varlist) species that the balancing property must be assessed for each
variable in varlist. The default test varlist() consists of all the variables used to
estimate the GPS.
test(type) allows users to specify whether the balancing property is to be assessed
using a blocking on the GPS approach employing either standard two-sided t tests
(test(t test)) or Bayes factors (test(Bayes factor)) or using a model-compari-
son approach with an LR test (test(L like)).
The blocking on the GPS approach using standard two-sided t tests provides the
values of the test statistics before and after adjusting for the GPS for each pretreat-
ment variable included in test varlist() and for each prexed treatment interval
specied in cutpoints(). Specically, let p be the number of control variables
in test varlist(), and let H be the number of treatment intervals specied in
cutpoints(). Then the program calculates and shows p H values of the test
statistic before and after adjusting for the GPS, where the adjustment is done by
dividing the values of the GPS evaluated at the representative point index() into
the number of intervals specied in nq gps(). (See Hirano and Imbens [2004] for
further details.)
The model-comparison approach uses a LR test to compare an unrestricted model
for Ti , including all the covariates and the GPS (up to a cubic term), with a re-
stricted model that sets the coecients of all covariates to zero. By default, both
the blocking on the GPS approach and the model-comparison approach are applied.
M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 589
flag(#) allows the user to specify that drf estimates the GPS without performing the
balancing test. The default is flag(1), which means that the balancing property is
assessed.
DRF options
tpoints(vector) indicates that the DRF is evaluated at each level of the treatment in
vector. By default, the drf program creates a vector with jth element equal to
the jth observed treatment value. This option cannot be used with npoints() or
npercentiles() (see below).
npoints(#) indicates that the DRF is evaluated at each level of the treatment be-
longing to a set of evenly spaced values t0 , t1 , . . . , t# that cover the range of the
observed treatment. This option cannot be used with tpoints() (see above) or
npercentiles() (see below).
npercentiles(#) indicates that the DRF is evaluated at each level of the treatment
corresponding to the percentiles tq0 , tq1 , . . . , tq# of the treatments empirical distri-
bution. This option cannot be used with tpoints() or npoints() (see above).
det displays more detailed output on the DRF estimation. When det is not specied,
the program displays only the chosen DRF estimator: method(radialpspline),
method(mtpspline), or method(iwkernel).
delta(#) species that drf also estimate the treatment-eect function (t + #) (t).
The default is delta(0), which means that drf estimates only the DRF, (t).
knots(numlist) species the list of knots for the treatment and the GPS variable. This
option cannot be used with the nknots() option (see above).
standardized implies that the spacefill algorithm standardizes the treatment vari-
able and the GPS variables before selecting the knots. The knots are chosen using
the standardized variables.
degree1(#) species the power of the treatment variable included in the penalized
spline model. The default is degree1(1).
degree2(#) species the power of the GPS included in the penalized spline model. The
default is degree2(1).
nknots1(#) species the number (#) of knots for the treatment variable. The location
of the Kk th knot is dened as {(k + 1)/(# + 2)}th sample quantile of the unique
Ti for k = 1, . . . , #. The default is nknots1(max(5, min(n/4, 35))), where n is
the number of unique Ti (Ruppert, Wand, and Carroll 2003). This option cannot
be used with the knots1(numlist) option (see below).
nknots2(#) species the number (#) of knots for the GPS. The location of the Kk th
knot is dened as {(k + 1)/(# + 2)}th sample quantile of the unique Ri for k =
1, . . . , #. The default is nknots2(max(5, min(n/4, 35))), where n is the number
of unique Ri (Ruppert, Wand, and Carroll 2003). This option cannot be used with
the knots2() option (see below).
knots1(numlist) species the list of knots for the treatment variable. This option
cannot be used with the nknots1() option (see above).
knots2(numlist) species the list of knots for the GPS. This option cannot be used with
the nknots2() option (see above).
additive allows users to implement penalized splines using the additive model without
including the product terms.
Mutual options for the tensor-product and radial penalized spline estimators
Mutual options for the tensor-product and radial penalized spline estimators involve
either the mtpspline subroutine or the radialpspline subroutine, depending on which
estimator is used.
estopts(string) species all the possible options allowed when running the xtmixed
models to t penalized spline models (see help xtmixed for further details).
M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 591
. use lotterydataset.dta
. * we delete the extreme values (1 and 99 percentile)
. drop if year6==.
(35 observations deleted)
. summarize prize, de
Treatment variable = Prize amount
Percentiles Smallest
1% 5.3558 1.139
5% 10.05 5
10% 11.246 5.3558 Obs 202
25% 17.034 6.844 Sum of Wgt. 202
50% 32.1835 Mean 57.36918
Largest Std. Dev. 64.84194
75% 71.642 270.1
90% 137.27 305.09 Variance 4204.477
95% 171.73 323.32 Skewness 2.821964
99% 305.09 484.79 Kurtosis 14.18278
592 Semiparametric estimators of doseresponse functions
OIM
prize Coef. Std. Err. z P>|z| [95% Conf. Interval]
*****************************************************************
31 observations are dropped after imposing common support
*****************************************************************
drf_gpscore
Percentiles Smallest
1% .0000774 .0000308
5% .00118 .0000774
10% .0033023 .0003464 Obs 160
25% .0077024 .0004499 Sum of Wgt. 160
50% .0092675 Mean .0082089
Largest Std. Dev. .002953
75% .0103387 .0107928
90% .0107204 .010793 Variance 8.72e-06
95% .0107831 .0107953 Skewness -1.419599
99% .0107953 .0107956 Kurtosis 3.908883
********************************************
End of the algorithm to estimate the gpscore
********************************************
**********************************************************
Log-Likelihood test for Unrestricted and Restricted Model
**********************************************************
****************************************************
Unrestricted Model
link(E[T]) = GPSCORE + GPSCORE^2 + GPSCORE^3 + X
****************************************************
Generalized linear models No. of obs = 160
Optimization : ML Residual df = 144
Scale parameter = 383.389
Deviance = 55208.02303 (1/df) Deviance = 383.389
Pearson = 55208.02303 (1/df) Pearson = 383.389
Variance function: V(u) = 1 [Gaussian]
Link function : g(u) = ln(u) [Log]
AIC = 8.881567
Log likelihood = -694.5253454 BIC = 54477.2
OIM
prize Coef. Std. Err. z P>|z| [95% Conf. Interval]
********************************************************
Restricted Model: Pretreatment variables are excluded
link(E[T]) = GPSCORE + GPSCORE^2 + GPSCORE^3
********************************************************
Generalized linear models No. of obs = 160
Optimization : ML Residual df = 156
Scale parameter = 386.9127
Deviance = 60358.37384 (1/df) Deviance = 386.9127
Pearson = 60358.37384 (1/df) Pearson = 386.9127
Variance function: V(u) = 1 [Gaussian]
Link function : g(u) = ln(u) [Log]
AIC = 8.820758
Log likelihood = -701.6606578 BIC = 59566.65
OIM
prize Coef. Std. Err. z P>|z| [95% Conf. Interval]
**********************************************************
Restricted Model: GPS terms are excluded (link(E[T]) = X)
**********************************************************
Generalized linear models No. of obs = 160
Optimization : ML Residual df = 147
Scale parameter = 1311.924
Deviance = 192852.8661 (1/df) Deviance = 1311.924
Pearson = 192852.8661 (1/df) Pearson = 1311.924
Variance function: V(u) = 1 [Gaussian]
Link function : g(u) = ln(u) [Log]
AIC = 10.09489
Log likelihood = -794.5908861 BIC = 192106.8
OIM
prize Coef. Std. Err. z P>|z| [95% Conf. Interval]
********************************************************************
Likelihood-ratio tests:
Comparison between the unrestricted model and the restricted models
********************************************************************
LR_TEST[3,4]
Lrtest T-Statistics p-value Restrictions
Unrestricted -694.52535 . . .
Covariates X -701.66066 14.270625 .2837616 12
GPS terms -794.59089 200.13108 3.952e-43 3
Number of observations = 160
***********************************************************
End of the assesment of the balancing property of the GPS
***********************************************************
Then we estimate the DRF and the treatment-eect function, which represents the
marginal propensity to earn out of the yearly prize money, using both penalized spline
techniques and the IW kernel estimator. Following Hirano and Imbens (2004), we ob-
tain the estimates of these functions at 10 dierent prize-amount values, considering
increments of $1,000 between $10,000 and $100,000 for the estimation of the treatment-
eect function. Note that we scaled the prize amount by dividing it by $1,000. To avoid
redundancies, we show details on the output from running drf for only the radial penal-
ized spline estimator (method(radialpspline)). Note that the det option is specied,
so details on estimating the DRF are shown.
****************
DRF estimation
****************
Radial penalized spline estimator
Run 1 .. (Cpq = 383.37)
Run 2 .. (Cpq = 427.99)
Run 3 ... (Cpq = 388.19)
Run 4 .. (Cpq = 365.61)
Run 5 ... (Cpq = 389.08)
Performing EM optimization:
Performing gradient-based optimization:
Iteration 0: log restricted-likelihood = -509.60164
Iteration 1: log restricted-likelihood = -509.58312
Iteration 2: log restricted-likelihood = -509.58286
Iteration 3: log restricted-likelihood = -509.58286
596 Semiparametric estimators of doseresponse functions
_all: Identity
sd(__00002U..__000033)(1) .0285723 .0584111 .0005198 1.570645
LR test vs. linear regression: chibar2(01) = 0.06 Prob >= chibar2 = 0.4072
(1) __00002U __00002V __00002W __00002X __00002Y __00002Z __000030 __000031
__000032 __000033
. matrix list e(b)
e(b)[1,20]
c1 c2 c3 c4 c5 c6
y1 15.131775 12.106819 9.3763398 7.2519104 6.0217689 5.5866336
c7 c8 c9 c10 c11 c12
y1 5.7080575 5.9898157 6.0769106 5.7288158 -.3081758 -.2900365
c13 c14 c15 c16 c17 c18
y1 -.23826795 -.15935109 -.05448761 -.00673878 .02770708 .02217719
c19 c20
y1 -.01213146 -.06489899
. matrix C = e(b)
. drop gpscore
. set seed 2322
M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 597
. bootstrap _b, reps(50): drf agew ownhs owncoll male tixbot workthen yearm1
> yearm2 yearm3 yearm4 yearm5 yearm6, outcome(year6) treatment(prize)
> test(L_like) tpoints(tp) numoverlap(3) method(radialpspline) family(gaussian)
> link(log) nolog(1) search nknots(10) det delta(1)
(running drf on estimation sample)
Bootstrap replications (50)
1 2 3 4 5
.................................................. 50
Bootstrap results Number of obs = 191
Replications = 50
Figures 1 and 2 show the estimates of the DRF and the treatment-eect function by
using the semiparametric techniques implemented in the drf routine and a paramet-
ric approach. The parametric estimates are derived using the doseresponse routine
(Bia and Mattei 2008), which follows the parametric approach originally proposed by
Hirano and Imbens (2004).7 As can be seen in gures 1 and 2, the two penalized spline
estimators and the IW kernel estimator lead to similar results: the DRFs have a U shape
(which is more tenuous in the case of the radial spline method) and the treatment-eect
functions have irregular shapes increasing over most of the treatment range and decreas-
ing for high treatment levels. The parametric approach shows quite a dierent picture.
The DRF goes down sharply for low prize amounts and follows an inverse J shape for
prize amounts greater than $20,000. The treatment-eect function reaches a maximum
around $30,000, and then it slowly decreases.
7. The code to derive the graphs is shown here for only the radial penalized spline estimator.
598 Semiparametric estimators of doseresponse functions
9 10 11 12 13 14 15 16 17 18
Doseresponse function
Doseresponse function
8
8
7
7
6
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Treatment Treatment
9 10 11 12 13 14 15 16 17 18
Doseresponse function
Doseresponse function
8
8
7
7
6
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Treatment Treatment
.2
.1
.1
0
0
Derivative
Derivative
.1
.1
.2
.2
.3
.3
.4
.4
.5
.5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Treatment Treatment
.2
.1
.1
0
0
Derivative
Derivative
.1
.1
.2
.2
.3
.3
.4
.4
.5
.5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Treatment Treatment
Figures 3 and 4 show the DRFs and the treatment-eect functions estimated using
the semiparametric and parametric techniques, now accompanied by pointwise 95% con-
dence bands. The condence bands are based on a normal approximation using boot-
strap standard errors, which are computed calling the drf program (or doseresponse
program) in the bootstrap command.8
8. The radial spline-based models may produce slightly dierent estimates in dierent runs and when
using the bootstrap command. This happens because within those models, an optimal set of
design points is chosen via random selection of the knot values using the spacefill algorithm (see
Bia and Van Kerm [2014] for further details). Some selected sets of knots may raise convergence
issues depending on the data. Thus we recommend that users set a seed before running the drf
code to make the results replicable.
600 Semiparametric estimators of doseresponse functions
60
40
40
Doseresponse function
Doseresponse function
20
20
0
0
20
20
40
40
60
60
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Treatment Treatment
40
40
Doseresponse function
Doseresponse function
20
20
0
0
20
20
40
40
60
60
0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100
Treatment Treatment
5
3
3
1
1
Derivative
Derivative
1
1
3
3
5
5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Treatment Treatment
5
3
3
1
1
Derivative
Derivative
1
1
3
3
5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Treatment Treatment
The example allows us to highlight two important points. First, gures 3 and 4
show that dierences in the point estimates and their precision among the three semi-
parametric estimators are more pronounced for low and high treatment levels. This is
because our data are sparse for lower and higher values of the treatment.9 Because of
the nonparametric methods we use, estimation becomes noisier and the parameters are
estimated less precisely in regions of the data with few observations, which is reected
in the wider condence intervals. This is particularly evident for the radial spline ap-
proach, which seems to be more sensitive to the sample size than the IW and penalized
splines estimators are. Second, it is clear from gures 3 and 4 that the parametric
estimator produces much tighter condence bands relative to the semiparametric esti-
mators. This is due to the additional structure imposed by the parametric estimator,
which allows extrapolation from regions where data are abundant to regions where data
are scarce. However, if the assumptions behind the parametric structure are incorrect,
the results, including their precision, are likely misleading.
9. In particular, there are very few observations for prizes lower than $15,000 and greater than $40,000.
602 Semiparametric estimators of doseresponse functions
6 Conclusion
We develop a program where we implement semiparametric estimators of the DRF based
on the GPS, assuming that assignment to the treatment is weakly unconfounded given
pretreatment variables. We propose three semiparametric estimators: the IW kernel
estimator developed in Flores et al. (2012) and two estimators using penalized spline
methods for bivariate smoothing. We use data from a survey of Massachusetts lottery
winners to illustrate the proposed methods and program. We nd that the semipara-
metric estimators provide estimates of the DRF and the treatment-eect function that
are substantially dierent from those obtained when using the parametric approach orig-
inally proposed in Hirano and Imbens (2004). All the semiparametric estimators agree
on a U -shaped DRF, which contrasts with the estimated inverse J shape uncovered by
the parametric estimator. Although we cannot draw a rm conclusion about the relative
performance of the estimators based on one dataset, we argue that a misspecication
of the conditional expectation of the outcome given treatment and GPS could result
in inappropriate removal of self-selection bias and in misleading estimates of the DRF.
Therefore, it is advisable to also use semiparametric estimators that account for compli-
cated structures that are dicult to model parametrically. Conversely, semiparametric
estimators can be sensitive to the sample size and might not perform well in regions
with few observations.
7 Acknowledgments
This research is part of the Estimation of direct and indirect causal eects using semi-
parametric and nonparametric methods project supported by the Luxembourg Fonds
National de la Recherche, which is cofunded under the Marie Curie Actions of the
European Commission (FP7-COFUND).
8 References
Angrist, J. D., G. W. Imbens, and D. B. Rubin. 1996. Identication of causal eects
using instrumental variables. Journal of the American Statistical Association 91:
444455.
Becker, S. O., and A. Ichino. 2002. Estimation of average treatment eects based on
propensity scores. Stata Journal 2: 358377.
Bia, M., and A. Mattei. 2008. A Stata package for the estimation of the doseresponse
function through adjustment for the generalized propensity score. Stata Journal 8:
354373.
. 2012. Assessing the eect of the amount of nancial aids to Piedmont rms using
the generalized propensity score. Statistical Methods & Applications 21: 485516.
Bia, M., and P. Van Kerm. 2014. Space-lling location selection. Stata Journal 14:
605622.
M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 603
Buis, M. L., N. J. Cox, and S. P. Jenkins. 2003. betat: Stata module to t a two-
parameter beta distribution. Statistical Software Components S435303, Department
of Economics, Boston College. http://ideas.repec.org/c/boc/bocode/s435303.html.
Dorn, S. 2012. pscore2: Stata module to enforce balancing score property in each
covariate dimension. UK Stata Users Group meeting.
http://econpapers.repec.org/paper/bocusug12/11.htm.
Fan, J., and I. Gijbels. 1996. Local Polynomial Modelling and Its Applications. New
York: Chapman & Hall/CRC.
Flores, C. A., A. Flores-Lagunes, A. Gonzalez, and T. C. Neumann. 2012. Estimating
the eects of length of exposure to instruction in a training program: The case of job
corps. Review of Economics and Statistics 94: 153171.
Guardabascio, B., and M. Ventura. 2014. Estimating the doseresponse function
through a generalized linear model approach. Stata Journal 14: 141158.
Hirano, K., and G. W. Imbens. 2004. The propensity score with continuous treat-
ments. In Applied Bayesian Modeling and Causal Inference from Incomplete-Data
Perspectives, ed. A. Gelman and X.-L. Meng, 7384. Chichester, UK: Wiley.
Imai, K., and D. A. van Dyk. 2004. Causal inference with general treatment regimes:
Generalizing the propensity score. Journal of the American Statistical Association
99: 854866.
Imbens, G. W. 2000. The role of the propensity score in estimating doseresponse
functions. Biometrika 87: 706710.
Imbens, G. W., D. B. Rubin, and B. I. Sacerdote. 2001. Estimating the eect of unearned
income on labor earnings, savings, and consumption: Evidence from a survey of
lottery players. American Economic Review 91: 778794.
Imbens, G. W., and J. M. Wooldridge. 2009. Recent developments in the econometrics
of program evaluation. Journal of Economic Literature 47: 586.
Jann, B. 2005. moremata: Stata module (Mata) to provide various functions. Sta-
tistical Software Components S455001, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s455001.html.
Kluve, J., H. Schneider, A. Uhlendor, and Z. Zhao. 2012. Evaluating continuous
training programmes by using the generalized propensity score. Journal of the Royal
Statistical Society, Series A 175: 587617.
Lechner, M. 2001. Identication and estimation of causal eects of multiple treatments
under the conditional independence assumption. In Econometric Evaluation of Labour
Market Policies, ed. M. Lechner and F. Pfeier, 4358. Heidelberg: Physica-Verlag.
Newey, W. K. 1994. Kernel estimation of partial means and a general variance estimator.
Econometric Theory 10: 233253.
604 Semiparametric estimators of doseresponse functions
Rosenbaum, P. R., and D. B. Rubin. 1983. The central role of the propensity score in
observational studies for causal eects. Biometrika 70: 4155.
. 1978. Bayesian inference for causal eects: The role of randomization. Annals
of Statistics 6: 3458.
Stuart, E. A. 2010. Matching methods for causal inference: A review and a look forward.
Statistical Science 25: 121.
1 Introduction
Spatial statistics often address geographical sampling from a set of locations for net-
works construction (Cox, Cox, and Ensor 1997), for example, for installing air quality
monitoring (Nychka and Saltzman 1998) or for evaluating exposure to environmental
chemicals (Kim et al. 2010). The issue involves evaluating a discrete list of potential
locations and determining a small, optimal subset of placesa designat which
to position, say, measurement instruments or sensors. One strategy to address such a
problemthe geometric approachaims to nd a design that minimizes the aggregate
distance between the locations and the sensors.
As discussed in Ruppert, Wand, and Carroll (2003) and Gelfand, Banerjee, and Fin-
ley (2012), location selection is also relevant in estimation of statistical models such as
multivariate nonparametric or semiparametric regression models. By analogy, instead
of locating measurement instruments, one seeks to identify a small number of loca-
tions from a large dataset at which to estimate a statistical model to reduce com-
putational cost. For example, kernel density estimates or locally weighted regression
models (Cleveland 1979; Fan and Gijbels 1996) are typically calculated on a grid of
points spanning the data range rather than over the whole input data points (and in-
terpolation is used where needed). The location of knots in spline regression models is
somewhat related; a small number of knots are selected instead of knots being placed
at many (or all) potential distinct data points. Determining such a grid is relatively
easy in one-dimensional modelsfor example, it is customary to locate knots at selected
percentiles of the data. Choosing an appropriate multidimensional grid while preserv-
ing computational tractability is more complicated because merely taking combinations
of unidimensional grids quickly inates the number of evaluation points. In this con-
text, Ruppert, Wand, and Carroll (2003) recommend applying a geometric space-lling
design to identify grid points or knot locations.
c 2014 StataCorp LP st0353
606 Space-filling location selection
with p < 0. dp (x, Dn ) measures how well the design Dn covers the location x. When
p , dp (x, Dn ) tends to the shortest Euclidean distance between x and a point
in Dn (Johnson, Moore, and Ylvisaker 1990). dp (x, Dn ) is zero if x is at a location in
Dn .
1. An R implementation of Royle and Nychkas (1998) algorithm is available in Furrer, Nychka, and
Sain (2013).
M. Bia and P. Van Kerm 607
over all possible designs Dn from C. The optimal design minimizes the q power mean of
the coverages of all locations outside of the design (the candidate points). Increasing
q gives greater importance to the distance of the design to poorly covered locations.
Figure 1 can help readers visualize the criterion. From a set of 38 European cities, we
selected a potential design of ve locations: Madrid, Brussels, Berlin, Riga, and Soa.
The coverage of, say, London by this design is given by plugging the Euclidean distances
from London to the ve selected cities into (1). With a large negative p, this coverage
will be determined by the distance to the closest city, namely, Brussels. Repeating such
calculations for all 33 cities from outside the design and aggregating the coverages using
(2) gives the overall geometric distances of European cities to the design composed
of Madrid, Brussels, Berlin, Riga, and Soa. The optimal design is the combination of
any ve cities that minimizes this criterion. The design composed of Madrid, Brussels,
Berlin, Riga, and Soa is in fact the optimal design for p = 5 and q = 1.
3.1 Syntax
spacefill varlist if in weight , ndesign(#) design0(varlist)
fixed(varname) exclude(varname) p(#) q(#) nnfrac(#) nnpoints(#)
nruns(#) standardize standardize2 standardize3 sphericize ranks
generate(newvar) genmarker(newvar) noverbose
aweights, fweights, and iweights are allowed; see [U] 11.1.6 weight.
varlist and the if or in qualier identify the data from which the optimal subset is
selected.
3.2 Options
ndesign(#) species n, the size of the design. The default is ndesign(4).
design0(varlist) identies a set of initial designs identied by observations with nonzero
varlist. If multiple variables are passed, one optimization is performed for each initial
design, and the selected design is the one with best coverage.
fixed(varname) identies observations that are included in all designs when varname
is nonzero.
exclude(varname) identies observations excluded from all designs when varname is
nonzero.
p(#) species a scalar value for the distance parameter for calculating the distance of
each location to the design; for example, p = 1 gives harmonic mean distance, and
p = gives the minimum distance. The default is p(-5), as recommended in
Royle and Nychka (1998).
q(#) species a scalar value for the parameter q. The default is q(1) (the arithmetic
mean).
nnfrac(#) species the fraction of data to consider as nearest neighbors in the point-
swapping iterations. Limiting checks to nearest neighbors improves speed but does
not guarantee convergence to the best design; therefore, setting nruns(#) is recom-
mended. The default is nnfrac(0.50).
nnpoints(#) species the number of nearest neighbors considered in the point-swapping
iterations. Limiting checks to nearest neighbors improves speed. nnfrac(#) and
nnpoints(#) are mutually exclusive.
nruns(#) sets the number of independent runs performed on alternative random initial
designs. The selected design is the one with best coverage across the runs. The
default is nruns(5).
610 Space-filling location selection
standardize standardizes all variables in varlist to zero mean and unit standard devi-
ation (SD) before calculating distances between observations.
standardize2 standardizes all variables in varlist to zero mean and SD before calculating
distances between observations, with an estimator of the SD as 0.7413 times the
interquartile range.
standardize3 standardizes all variables in varlist to zero median and SD before calcu-
lating distances between observations, with an estimator of the SD as 0.7413 times
the interquartile range.
sphericize transforms all variables in varlist into zero mean, SD, and zero covariance
using a Cholesky decomposition of the variancecovariance matrix before calculating
distances between observations.
ranks transforms all variables in varlist into their (fractional) ranks and uses distances
between these observation ranks in each dimension to evaluate distances between
observations.
generate(newvar) species the names for new variables containing the locations of the
best design points. If one variable is specied, it is used as a stubname; otherwise,
the number of new variable names must match the number of variables in varlist.
genmarker(newvar) species the name of a new binary variable equal to one for obser-
vations selected in the best design and zero otherwise.
noverbose suppresses output display.
Options standardize2, standardize3, and ranks require installation of the user-
written package moremata, which is available on the Statistical Software Components
archive (Jann 2005).
4 Examples
We provide two illustrations for the application of spacefill. The rst example uses
ozone2.txt, which is available in the R elds package (Furrer, Nychka, and Sain 2013),
and provides examples of standard site selection. The second example uses survey data
from the Panel Socio-Economique Liewen zu Letzebuerg/European Union-Statistics on
Income and Living Conditions (PSELL3/EU-SILC) and illustrates the use of spacefill
for nonparametric regression analysis with multidimensional, nonspatial data.
We start by selecting an optimal design of size 10 from the 147 locations, using
default values p = 5 and q = 1, candidate swaps limited to the nearest half of the
locations, and 5 runs with random starting designs.
Notice that the rst run leads to a somewhat higher aggregate distance to the design
points (Cpq=100.34) than the other runs. This stresses the importance of multiple
starting designs. Figure 1 shows the selected locations in the best design (achieved at
run 3, where Cpq=94.19).
612 Space-filling location selection
Figure 2. Scatterplot and histogram of longitude and latitude for all 147 locations (gray
histograms and gray hollow circles) and 10 best design points (thick histograms and
solid dots) with p = 5 and q = 1 (default)
Users can improve speed by restricting potential swaps to a smaller number of nearest
neighbors. Limiting a search to 25 nearest neighbors (against 69the default half of the
locationsin the rst example), our second example below runs in 4 seconds against
11 seconds for our initial example, without much loss in the coverage of the resulting
design (Cpq=96.59). On the other hand, running spacefill with the full candidates
as potential swaps runs in over 30 seconds for an optimal design with Cpq=91.96.
. spacefill lon lat, ndesign(10) nnpoints(25) genmarker(set1)
Run 1 ..... (Cpq = 117.02)
Run 2 .... (Cpq = 109.93)
Run 3 .. (Cpq = 110.99)
Run 4 .. (Cpq = 101.05)
Run 5 ..... (Cpq = 96.59)
. spacefill lon lat, ndesign(10) nnfrac(1)
Run 1 ... (Cpq = 91.96)
Run 2 .... (Cpq = 91.96)
Run 3 .. (Cpq = 91.96)
Run 4 ... (Cpq = 92.32)
Run 5 ... (Cpq = 91.96)
We now illustrate the use of the genmarker(), fixed(), and exclude() options. In
the previous call, genmarker(set1) generated a dummy variable equal to 1 for the 10
M. Bia and P. Van Kerm 613
points selected into the best design and 0 otherwise. We now specify exclude(set1)
to derive a new design with 10 dierent locations and then use fixed(set2) to force
this new design into a design of size 15.
4. 1 0 0
10. 0 1 1
25. 1 0 0
40. 1 0 0
48. 0 1 1
55. 1 0 0
58. 0 1 1
60. 1 0 0
61. 0 1 1
63. 0 0 1
67. 0 0 1
74. 1 0 0
77. 0 0 1
80. 0 1 1
82. 0 1 1
89. 0 0 1
91. 0 1 1
97. 1 0 0
107. 0 1 1
109. 1 0 0
121. 0 1 1
125. 0 0 1
135. 0 1 1
140. 1 0 0
143. 1 0 0
The key parameters q and p of the coverage criterion can also be exibly specied.
Figure 2 illustrates 3 designs selected with default parameters p = 5 and q = 1 (dots),
with p = 1 and q = 1 (squares), and with p = 1 and q = 5 (crosses). With p = 5,
the distance of a location to the design is mainly determined by the distance to the
closest point of the design; p = 1 accounts for the distance to all points in the design,
leading to more central location selections. Setting q = 5 penalizes large distances
between design and nondesign points, leading to location selections more spread out
toward external points. Note our use of user-specied random starting designs with
option design0() to ensure comparison is made on common initial values.
614 Space-filling location selection
Figure 3. Scatterplot of longitude and latitude for all 147 locations (gray hollow circles)
and best design points with default p = 5 and q = 1 (dots), with p = 1 and q = 1
(squares), and with p = 1 and q = 5 (crosses)
M. Bia and P. Van Kerm 615
. clear
. set obs 16
obs was 0, now 16
. range lon -95 -80 16
. range lat 36 46 11
(5 missing values generated)
. fillin lon lat
. gen byte sample = 0
. save gridlatlon.dta , replace
file gridlatlon.dta saved
. clear
. insheet using ozone2.txt
(3 vars, 147 obs)
. keep lat lon
. gen byte sample = 1
. append using gridlatlon
. spacefill lon lat [iw=sample], exclude(sample) ndesign(25) nnpoints(100)
> genmarker(subgrid1)
147 points excluded from designs (sample>0)
Run 1 .. (Cpq = 63.93)
Run 2 .... (Cpq = 63.92)
Run 3 .... (Cpq = 63.71)
Run 4 ... (Cpq = 63.07)
Run 5 ... (Cpq = 63.02)
616 Space-filling location selection
Figure 4. Actual 147 locations (hollowed gray circles), 176 candidate grid points (lattice;
crosses), and 25 optimally selected grid points (solid dots)
rithm is indeed applicable to broad data congurations. Second, the dierence in the
histograms for the sample and for the design points is a reminder that selecting a space-
lling design is distinct from drawing a representative subset of the data. The points
that best cover the data in a geometric sense must not necessarily reect their frequency
distribution: few design points may contribute to cover many data points in areas of
high concentration, while design points spread out in areas of low data concentration
will contribute to cover a smaller number of data points.
. summarize height weight wage
Variable Obs Mean Std. Dev. Min Max
Figure 5. Scatterplot and histogram of height and weight for all data (gray histograms
and hollowed markers) and best design points (thick histograms and markers) for the
standardized values of the height, weight, and wage
618 Space-filling location selection
Figure 6. Scatterplot and histogram of height and wage for all data (gray histograms
and hollowed markers) and best design points (thick histograms and markers) for the
standardized values of the height, weight, and wage
We now use these data to run a locally weighted polynomial regression of wage
on height and weight. Our objective is to assess nonparametrically the relationship
between wage and body size. For the sake of illustration, we want to estimate expected
wage nonparametrically at multiple grid points from a lattice where each point is a
pair of heightweight values. One reason for this is that tting the model at all height
weight pairs in our data would be computationally expensive (and inecient if there are
nearly identical heightweight pairs in the data). We seek a cheaper alternative with
fewer evaluation points. (This is similar to using lpoly with the at() option instead
of lowess in the unidimensional setting.) Also we use evaluation points on a lattice
instead of at sample values because we are considering tting the model for dierent
subsamples, and we want to have model estimates on a common grid of evaluation points
for all subsamples. (If need be, bivariate interpolation will be used to recover estimates
at sample values; see [G-2] graph twoway contourline for the interpolation formula.)
This setting is relatively standard in nonparametric regression analysis, especially when
dealing with large samples or computationally heavy estimators (for example, cross-
validation-based bandwidth selection).
M. Bia and P. Van Kerm 619
We start with a 20 20 rectangular lattice covering heights from 150 to 192 centime-
ters and weights from 43 to 127 kilograms. While this lattice spans the values observed
in our sample, it also includes many empirically irrelevant heightweight pairs. Estima-
tion on the full grid is therefore unnecessary, and we use spacefill as described above
to select a subset of points on the lattice that covers our data.
Figure 7 shows resulting estimates based on a space-lling design of size 50, as well
as estimates based on a random subset of 100 lattice points, on 100 Halton draws from
the lattice, on the full lattice, and on all sample points. Brightness of the contours
corresponds to local regression estimates of expected wage from black (for monthly
wage below EUR 1000) to white (for monthly wage above EUR 5000). In each panel,
local regression was eectively calculated only at the marked grid points (and so it was
conducted faster on the space-lling design), while the overall coloring of the map was
based on the thin-plate-spline interpolation built in twoway contour.
620 Space-filling location selection
Figure 7. Contour plot of expected wage of 500 Luxembourg women by height and
weight from monthly wage less than EUR 1000 (black) to more than EUR 5000 (white).
Calculations based on local regression estimation. White lines identify body-mass in-
dices of 18.5, 25, and 30, which delineate underweight, overweight, and obesity, respec-
tively.
M. Bia and P. Van Kerm 621
The contour plots display variations in areas of low data density (top left and bot-
tom right), reecting both the imprecision and variability of the local linear regression
estimates in these zones and the variations introduced by the interpolation of values
away from the bulk of the data. In areas of higher data densityfor height below 180
centimeters and weight below 100 kilogramsestimates on the 50-points space-lling
subset dier little from those of the full sample or from the full lattice.4
Acknowledgments
This research is part of the project Estimation of direct and indirect causal eects using
semi-parametric and non-parametric methods, which is supported by the Luxembourg
Fonds National de la Recherche, cofunded under the Marie Curie Actions of the
European Commission (FP7-COFUND). Philippe Van Kerm acknowledges funding for
the project Information and Wage Inequality, which is supported by the Luxembourg
Fonds National de la Recherche (contract C10/LM/785657).
5 References
Cleveland, W. S. 1979. Robust locally weighted regression and smoothing scatterplots.
Journal of the American Statistical Association 74: 829836.
Cox, D. D., L. H. Cox, and K. B. Ensor. 1997. Spatial sampling and the environment:
Some issues and directions. Environmental and Ecological Statistics 4: 219233.
Fan, J., and I. Gijbels. 1996. Local Polynomial Modelling and Its Applications. New
York: Chapman & Hall/CRC.
Furrer, R., D. Nychka, and S. Sain. 2013. elds: Tools for spatial data. R package
version 6.7.6. http://CRAN.R-project.org/package=elds.
Gelfand, A. E., S. Banerjee, and A. O. Finley. 2012. Spatial design for knot selection
in knot-based dimension reduction models. In Spatio-Temporal Design: Advances in
Ecient Data Acquisition, ed. J. Mateu and W. G. M uller, 142169. Chichester, UK:
Wiley.
Jann, B. 2005. moremata: Stata module (Mata) to provide various functions. Sta-
tistical Software Components S455001, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s455001.html.
Johnson, M. E., L. M. Moore, and D. Ylvisaker. 1990. Minimax and maximin distance
designs. Journal of Statistical Planning and Inference 26: 131148.
Kim, J.-I., A. B. Lawson, S. McDermott, and C. M. Aelion. 2010. Bayesian spatial
modeling of disease risk in relation to multivariate environmental risk elds. Statistics
in Medicine 29: 142157.
4. Note, incidentally, how taller women tend to be paid higher wages in these data in all three body-
mass index categories.
622 Space-filling location selection
Nychka, D., and N. Saltzman. 1998. Design of air-quality monitoring networks. In Case
Studies in Environmental Statistics (Lecture Notes in Statistics 132), ed. D. Nychka,
W. Piegorsch, and L. Cox, 5176. New York: Springer.
Royle, J. A., and D. Nychka. 1998. An algorithm for the construction of spatial coverage
designs with implementation in SPLUS. Computers and Geosciences 24: 479488.
1 Introduction
Markov chain Monte Carlo (MCMC) methods are a popular and widely used means
of drawing from probability distributions that are not easily inverted, that have dif-
cult normalizing constants, or for which a closed form cannot be found. While of-
ten considered a collection of methods with primary usefulness in Bayesian analysis
and estimation, MCMC methods can be applied to a variety of estimation problems.
Chernozhukov and Hong (2003), for example, show that MCMC methods can be applied
to many problems of traditional statistical inference and used to t a wide class of
modelsessentially, any statistical model with a pseudoquadratic objective function.
This class of models encompasses many common econometric models that have tra-
ditionally been t by maximum likelihood or generalized methods of moments. This
article describes some Mata functions for drawing from distributions by using dierent
types of adaptive MCMC algorithms. The Mata implementation of the algorithms is
intended to allow straightforward application to estimation problems.
c 2014 StataCorp LP st0354
624 Adaptive MCMC in Mata
While it is well known that MCMC methods are useful for drawing from dicult
densities, one might ask: why use MCMC methods in estimation? Sometimes, maximiz-
ing an objective function may be dicult or slow, perhaps because of discontinuities or
nonconcave regions of the objective function, a large parameter space, or diculty in
programming analytic gradients or Hessians. When bootstrapping of standard errors
is required, estimation problems are exacerbated because of the need to ret a model
many times. MCMC methods may provide a more feasible means of estimation in these
cases: estimation based on sampling directly from the joint parameter distribution does
not require optimization and still provides the desired result of estimationa descrip-
tion of the joint distribution of parameters. MCMC methods are a popular means of
implementing Bayesian estimators because they allow one to avoid hard-to-calculate
normalizing constants that often appear in posterior distributions. Unlike extrema-
based estimation, Bayesian estimators do not rely on asymptotic results and thus are
useful in small-sample estimation problems or when the asymptotic distribution of pa-
rameters is dicult to characterize.
In this article, I describe a Mata function, amcmc(), that implements adaptive or non-
adaptive MCMC algorithms. I also describe a suite of routines, amcmc *(), that allows
implementation via a series of structured functions, as one might use Mata functions
such as moptimize( ) (see [M-5] moptimize( )) or deriv( ) (see [M-5] deriv( )). The
algorithms implemented by the Mata routines more or less follow Andrieu and Thoms
(2008), who present an accessible overview of the theory and practice of adaptive MCMC.
In section 2, I provide an intuitive overview of adaptive MCMC algorithms, while
in section 3, I describe how the algorithms are implemented in Mata by amcmc() or
by creating a structured object via the suite of functions amcmc *(). In section 4, I
describe four applications. I show how the routines might be used in a straightforward
parameter estimation problem, and I describe how methods can be applied to a more
dicult problem: censored quantile regression. In this discussion, I also introduce
the mcmccqreg command. I then show how routines can be used to sample from a
distribution that is hard to invert and lacks a normalizing constant. In a nal example
in section 4, I apply the methods to Bayesian estimation of a mixed logit model following
Train (2009) and introduce the bayesmixedlogit command. In section 5, I sketch a
basic Mata implementation of an adaptive MCMC algorithm, which I hope will give users
a template for developing adaptive MCMC algorithms in more specialized applications.
In section 6, I conclude and oer some sources for additional reading.
Table 1. An MH algorithm. The proposal distribution is denoted by q(Y, X), while the
target distribution is (X). (X, Y ) denotes the draw acceptance probability.
Basic MH algorithm
1: Initialize start value X = X0 and draws T .
2: Set t = 0 and repeat steps 36 while t T :
3: Draw a candidate Yt from q(Yt , Xt ).
(Yt ) q(Yt ,Xt )
4: Compute (Yt , Xt ) = min (X t ) q(Xt ,Yt )
, 1 .
5: Set Xt+1 = Yt with prob. (Yt , Xt ),
Xt+1 = Xt otherwise.
6: Increment t.
Output: The sequence (Xt )Tt=1 .
The MH algorithm sketched in table 1 has the property that candidate draws Yt
that increase the value of the target distribution, (X), are always accepted, whereas
candidate draws that produce lower values of the target distribution are accepted with
only probability . Under general conditions, the draws X1 , X2 , . . . , XT converge to
draws from the target distribution, (X); see Chib and Greenberg (1995) for proofs.
One can see the convenience the algorithm provides in drawing from densities of the form
(X) = (X)/K, where K is some perhaps dicult-to-calculate normalizing constant.
Computation of K is unnecessary, because it cancels out of the ratio (X)/(Y ). The
proposal distribution, q(Y, X), is where the Markov chain part of Markov chain
Monte Carlo comes in. It is what distinguishes MCMC algorithms from more general
acceptance-rejection Monte Carlo sampling: candidate draws depend upon previous
draws in this function.
MCMC algorithms are simple and exible, and they are therefore applicable to a wide
variety of problems. However, they can be challenging to implement, mainly because it
can be hard to nd an appropriate proposal distribution, q(Y, X). If q(Y, X) is chosen
poorly, coverage of the target distribution, (X), may be poor. This is where adaptive
MCMC methods are used because they help tune the proposal distribution. As an
adaptive MCMC algorithm proceeds, information about acceptance rates of previous
draws is collected and embodied in some set of tuning parameters . Slow convergence
or nonconvergence of an algorithm like that in table 1 is often caused by acceptance of
too few or too many candidate draws: if the algorithm accepts too few candidate draws,
candidates are too far away from regions of the support of the distribution where (X)
is large; if too many candidates are accepted, candidates occupy an area of the support
of the distribution clustered closely around a large value of (X). Accordingly, if the
acceptance rate is too low, the tuning mechanism contracts the search range; if the
acceptance rate is too high, it expands the search range. As a practical matter, one
augments the proposal distribution with the tuning parameters so that the proposal
distribution is something like q(Y, X) = q(Y, X, ). A description of such an algorithm
appears in table 2.
626 Adaptive MCMC in Mata
The algorithm in table 2 also relies on a simplication of the basic MCMC algorithm
presented in table 1, which results when a symmetric proposal distribution is used so that
q(Y, X, ) = q(X, Y, ). With a symmetric proposal distributionthe (multivariate)
normal distribution being a prominent examplethe proposal distribution drops out
of the calculation of the acceptance probability in step 4 of the algorithm; this results
in the simplied acceptance probability (Y, Xt ) = min[{(Y )}/{(Xt )}, 1]. All the
Mata routines discussed in this article use a multivariate normal density for a proposal
distribution.
These conditions are satised by the weighting parameter used in the adaptive al-
gorithm
in table 3 so long as (0, 1): the reason is that under these circumstances,
t t diverges, but a suciently large value of that forces the series {1/(1 + t) }1+
to converge can always be found.
A last detail to address is how to initialize the value of the scaling parameter at the
start of the algorithm. According to Andrieu and Thoms (2008, 359), theory suggests
that a good place to start with the scaling parameter is 2.382 /d, where d is the
dimension of the target distribution. The Mata routines presented below all use this
value as a starting point, with one exception.
There are many variations on the basic theme of the algorithm presented in table 3.
One possibility is one-at-a-time, sequential sampling of values from the distribution,
which produces a Metropolis-within-Gibbs type sampler. Another possibility is to
work halfway between the global sampling algorithm of table 3 and the sequential
sampling, creating what might be labeled a block adaptive MCMC sampler.3 In my
experience, Metropolis-within-Gibbs samplers or block samplers are often useful in situ-
ations in which variables are scaled very dierently or in situations where the researcher
might not have good intuition about starting values.
Related to determining how to execute the algorithm is the issue of how to choose
T , the length of the run. One would like to choose T large enough so that the conver-
gence criteria mentioned above are satised and enough draws are produced for reliable
statistical inference. How does one know that the algorithm has achieved these goals?
This is a surprisingly complex question that really does not have a good answer. While
one can often detect problems with the algorithm, there is no way to guarantee that
the algorithm has converged. Gelman and Shirley (2011) describe dierent techniques
for assessing performance and convergence of the run, but they also emphasize the
complementary roles of visual inspection of results, understanding the application, and
understanding the subject matter. These issues are discussed at greater length in the
conclusion.
3. I follow the convention of referring to a sequential sampler as a Metropolis-within-Gibbs sampler,
even though many nd this terminology misleading; see Geyer (2011, 2829). What I call a block
sampler, some might call a block-Gibbs sampler.
M. J. Baker 629
The rst Mata implementation of the algorithms described in section 2 is through the
Mata function amcmc(),4 which uses dierent types of adaptive MCMC samplers based
upon user-provided information. In addition to describing details of sampling (spec-
ication of draws, weighting parameters, and acceptance rates), the user can specify
whether sampling is to proceed all at once (globally), in blocks, or sequentially. The
user can also set up amcmc() to work with a stand-alone distribution or with an
objective function previously set up to work with moptimize() or optimize(). The
syntax is as follows:
Description
If the dimension of the target probability distribution (or the parameter vector) is char-
acterized as a 1 c row vector, amcmc() returns a matrix of draws from the distribution
organized in c columns and r = draws burn rows, so each row of the returned matrix
can be considered a draw from the target distribution lnf. Additional information about
the draws is collected in three arguments overwritten by amcmc(): arate, vals, and lam,
which contain actual acceptance rates, the log value of the target distribution at each
draw, and , the proposal scaling parameters. If a Metropolis-within-Gibbs sampler or
a block sampler is used, lam, as well as arate, is returned as a row vector equal in length
to the dimension of the distribution or the number of blocks.
Information about how to draw from the target distribution and how the distribution
has been programmed is passed to the command as a sequence of strings in the (string)
row vector alginfo. This row vector can contain information about whether sampling is
to be sequential (mwg), in blocks (block), or global (global). If the user is interested in
applying amcmc() to a model statement constructed with moptimize() or optimize(),
information on this and the type of evaluator function used with the model should also
be contained in alginfo. Target distribution information can be standalone, moptimize,
or optimize. Information on evaluator type can also be of any sort (that is, d0, v0,
etc.).5 A nal option that can be passed along as part of alginfo is the key fast, which
will execute the adaptive MCMC algorithm more quickly but less exactly. I give some
examples of what alginfo might look like in the remarks about syntax.
The second argument of amcmc(), lnf, is a pointer to the target distribution, which
must be written in log form. xinit and Vinit are conformable initial values for the
routine and an initial variancecovariance matrix for the proposal distribution. The
scalar draws and burn tell the routine how many draws to make from the distribution
and how many of these draws are to be discarded as an initial burn-in period. delta
is a string scalar that describes how adaptation is to occur, while aopt is the desired
acceptance rate; see section 2.1.
The real matrix blocks contains information on how amcmc() should proceed if the
user wishes to draw from the function in blocks. If the user does not wish to draw in
blocks, the user simply passes a missing value for this argument. If the user provides an
argument here, but does not specify block as part of alginfo, sampling will not occur
in blocks.
If the user is drawing from a function constructed with a prespecied model com-
mand written to work with either moptimize() or optimize(), this model statement is
passed to amcmc() via the optional M argument. As described below, this argument can
also have other uses; for example, it can pass up to 10 additional explanatory variables
to amcmc().
The nal option is noisy, and if the user species noisy="noisy", amcmc() will
produce feedback on drawing as the algorithm executes. A dot is produced every time
the evaluation function lnf is called (not every time a draw is completed, because the
latter is taken by amcmc() to mean a complete run through the routine). Thus, if a
block sampler or a Metropolis-within-Gibbs style sampler is used, a draw is deemed to
have occurred when all the blocks or variables have been drawn once. The value of the
target distribution is reported every 50 evaluations.
Remarks
It is helpful to have a few examples of how information about the draws to be conducted
can be passed to the amcmc() function through the rst argument, alginfo. This is
described in table 4.
5. The routine will not work with evaluators of the lnf type.
M. J. Baker 631
The user can select any item from each of the rows on table 4 and pass it to amcmc()
as part of alginfo. For example, if the user is trying to draw from a function that was
written as a type d2 evaluator to work with moptimize and the user wished to use a
global sampler, he or she might specify
alginfo="moptimize","d2","global"
Order does not matter, so the user could also specify
alginfo="d0","moptimize","global"
If the user had a stand-alone function and wished to do Metropolis-within-Gibbs
style sampling from this function, he or she would specify
alginfo="standalone","mwg"
or even just alginfo="mwg" because if no model statement is submitted, amcmc() will
assume that the function is stand alone. The nal option that the user might specify
is the "fast" option, which tacks on the string fast to alginfo. This option is helpful
when the user wishes to sample globally or in blocks but has a problem with large
dimension. Because the global and block samplers use Cholesky decomposition of the
proposal covariance matrix, large problems may be time consuming. The "fast" option
circumvents the potential slowdown by working with just the diagonal elements of the
proposal covariance matrix, so one can avoid Cholesky decomposition. One should,
however, be cautious in using this option and should probably apply it only when the
user can be reasonably certain that distribution variables are independent.6
The row vector xinit contains an initial value for the draws, while Vinit is an initial
variancecovariance matrix that may be a conformable identity matrix. If, however,
Vinit is a row vector, amcmc() will interpret this as the diagonal of a variance matrix
with zero o-diagonal entries.
While the user-specied scalar delta controls how rapidly adaptation vanishes, the
user may also specify delta equal to missing (delta = .). amcmc() will then assume that
the user does not want any adaptation to occur but instead wishes to draw from the
invariant proposal distribution with mean xinit and covariance matrix Vinit. In this
case, the user must supply values of lambda to describe to the algorithm how to scale
draws from the proposal distribution. Constructing the code this way allows users to
run the adaptive algorithm for a while, and once it has converged, it allows users to
switch to an algorithm using an invariant proposal distribution. If a global sampler is
used, only one value of lambda is required; otherwise, lambda must be conformable with
the sampler. So, if the option mwg is used, the dimension of lambda must match the
dimension of the target distribution; if the option block is used, lambda must contain
as many entries as the number of blocks.
Whether one wishes to do Metropolis-within-Gibbs sampling, block sampling, or
global sampling, the routine requires the same set of input information (although the
6. I included this option hoping that users might try it and see for what problems, if any, it does and
does not work well.
632 Adaptive MCMC in Mata
overwritten values lam and arate dier slightly) with one exception. When one samples
in block form, amcmc() requires a matrix to be provided in block, in which the number
of rows is equal to the number of sampling groups, and the values to be drawn together
have 1s in the appropriate positions and 0s elsewhere. So, for example, if one wished to
draw from a ve-dimensional distribution and wished to draw values for the rst three
arguments together, and then arguments four and ve together, one would set up a
matrix B as follows:
1 1 1 0 0
B=
0 0 0 1 1
One might suspect that this would result in the same sort of algorithm obtained by
specifying alginfo="mwg", but this is not the case. After each draw, the block algorithm
updates the entire mean proposal vector and covariance matrix, so information on each
draw is used to prepare for the next.7 While not the intended use of the block-sampling
algorithm, if one leaves a column of all 0s in the matrix B, the corresponding value of
the parameter will never be drawn. This is a quick, albeit not particularly ecient, way
of constraining parameters at particular values during the drawing process.
The argument M of amcmc() can contain a previously assembled model statement, or
it can be used to pass additional arguments of a function to the routine.8 For example,
if the user has written a function to be sampled from that has three arguments, such
as lnf(x,Y,Z), the user would specify the standalone option in the variable alginfo,
assemble the additional arguments into a pointer, and then pass this information to
amcmc(). In this instance, M might be constructed in Mata as follows:
M=J(2,1,NULL)
M[1,1]=&Y
M[2,1]=&Z
M can then be passed to amcmc(), which will use Y and Z (in order) to evaluate
lnf(x,Y,Z). As shown in the examples, this usage of pointers can be handy when
amcmc() is used as part of a larger algorithm: one can continually change Y and Z
without actually having to explicitly declare that Y and Z have changed as the algorithm
executes.
7. Using amcmc() in this way is akin to what Andrieu and Thoms (2008, 360) describe as an adaptive
MCMC algorithm with componentwise adaptive scaling.
8. But not both; we assume that any arguments have already been built into the model statement if
a previously constructed model is used.
M. J. Baker 633
Another alternative that has advantages in certain situations, particularly when one
wishes to do adaptive MCMC as one step in a larger sampling problem, is to set up an
adaptive MCMC sampling problem by using the set of functions amcmc *(). The user
rst opens a problem using the amcmc init() function and then lls in the details of
the drawing procedure. The user can use the following functions to set up an adaptive
MCMC problem, with the arguments corresponding to those described in section 3.1:
A = amcmc init()
amcmc lnf(A, pointer (real scalar function) scalar f)
amcmc args(A, pointer matrix Z)
amcmc xinit(A, real rowvector xinit)
amcmc Vinit(A, real matrix Vinit)
amcmc aopt(A, real scalar aopt)
amcmc blocks(A, real matrix blocks)
amcmc model(A, transmorphic M)
amcmc noisy(A, string scalar noisy)
amcmc alginfo(A, string rowvector alginfo)
amcmc damper(A, real scalar delta)
amcmc lambda(A, real rowvector lambda)
amcmc draws(A, real scalar draws)
amcmc burn(A, real scalar burn)
Once a problem has been specied, a run can be initiated via the function
amcmc draw(A)
where * in the above function can be any of the following: vals, arate, passes,
totaldraws, acceptances, propmean, propvar, or report. Additionally, users can
recover their initial specications by using * = draws, aopt, alginfo, noisy, blocks,
damper, xinit, Vinit, or lambda. An additional function amcmc results lastdraw()
produces the value of only the last draw. Two other functions that are useful when one
is executing an adaptive MCMC draw as part of a larger algorithm are
634 Adaptive MCMC in Mata
The function amcmc append() allows the user to indicate that results should be overwrit-
ten by specifying append="overwrite". In this case, the results of only the most recent
draws are kept. This can be useful when doing an analysis where nuisance parameters of
a model are being drawn, and storing all the previous draws would tax the memory and
impact the speed of the algorithms operation. The function amcmc reeval() allows
the user to indicate whether the target distribution should be reevaluated at the last
draw before a proposed value is tried by specifying reeval="reeval". When the draw
is part of a larger algorithm, some of the arguments of the target distribution might
change as the larger algorithm proceeds. In these cases, the target distribution needs
to be reevaluated at the new argument values and the last previous draw to function
correctly. If the user sets reeval to anything else, it is assumed that nothing has changed
and that the value of the target distribution has not changed between draws.
Remarks
Some of the information accessible with amcmc results *() provides hints as to why
a user might prefer to use a problem statement to attack an adaptive MCMC problem
instead of the Mata function amcmc(). Using a problem statement is particularly useful
because one can easily stop, restart, and append a run within Matas structure envi-
ronment. In this way, a user can perform adaptive MCMC as part of a larger algorithm;
the structure makes it easy to retain information about past adaptation and runs as the
algorithm proceeds and also makes it easy to modify arguments of the algorithm. In
the model statement syntax, information about the number of times a given problem
has been initiated is retrievable via the function amcmc results passes(A), while the
acceptance history of an entire run is accessible via amcmc results acceptances(A).
Given the initialization of an adaptive MCMC problem A, one can run amcmc draw()
sequentially and results will be appended to previous results. Accordingly, the burn
period is active only the rst time the function is executed. Thereafter, it is assumed
that the user wishes to retain all drawn values. As mentioned above, the user can
choose whether to retain all the information about previous draws with the function
amcmc append(). When a user species append="overwrite" to save the draws of only
the last run, the routine still includes all information about adaptation contained in the
entire drawing history.
When a user initializes an adaptive MCMC problem via amcmc init(), some defaults
are set unless overwritten by the user. The number of draws is set to 1, the burn period
is set to 0, the target distribution is assumed to be stand alone, the acceptance rate is
set to 0.234, and results are appended to previous results if multiple passes are made.
It is also assumed that the function does not need to be reevaluated at the last value
before drawing a new proposal.
M. J. Baker 635
Further description can be found in the help les, accessible by typing help mata
amcmc() or help mf amcmc at Statas command prompt.
4 Examples
4.1 Parameter estimation
For my rst example, I apply adaptive MCMC to a simple estimation problem. Suppose
that I have already programmed a likelihood function to use with moptimize() in Mata,
but I wish to try another means of estimating parametersperhaps because I have
found that maximization of the likelihood function is taking too long or presents other
diculties or because I am worried about small-sample properties of the estimators.
I decide to try to t the model by drawing directly from the conditional distribution
of parameters. The ideas derive from Bayess rule and the usual principles of Bayesian
estimation, but they can be applied to virtually any maximum likelihood problem.9 Via
Bayess rule, the distribution of parameters conditional on the data can be written as
p(X|)p() p(X|)p()
p(|X) = =5 (1)
p(X) p(X|)p()d
If one has no prior information about parameter values, one can take p()the prior dis-
tribution of parametersto be (improper) uniform over the support of the parameters.
As this renders p() constant, one then obtains the posterior parameter distribution as
p(|X) p(X|) (2)
So, according to (2), one might interpret a likelihood function as the distribution of
parameters conditional on data up to a constant of proportionality. The conditional
mean of parameter values is then
6
E(|X) = p(|X)d (3)
One can estimate E(|X) by simulating the right-hand side of (3) via S draws from the
conditional distribution p(|X),
S
1
E(|X) = (s)
S s=1
These simulations can also be used to characterize higher-order moments of the param-
eter distribution. I shall follow the nomenclature adopted by Chernozhukov and Hong
(2003) and refer to obtained estimators as Laplace-type estimators (LTEs) or quasi-
Bayesian estimators (QBEs).
Returning to the example, I will posit a simple linear model with log-likelihood
function,
(y X) (y X) n
ln L ln 2
2 2 2
9. They can also be applied to a wider variety of problems; see Chernozhukov and Hong (2003).
636 Adaptive MCMC in Mata
For comparison, in the following code, I take this simple model and t it to some data by
using a type d0 evaluator and Matas moptimize() function. One subtlety of the code is
that the variance is coded in exponentiated form. This is done so that when amcmc() is
applied to the problem, the objective function is consistent with the multivariate normal
proposal distribution, which requires that parameters have support (, ).10 The
following code develops the model statement and ts the model via maximum likelihood:
. sysuse auto
(1978 Automobile Data)
. mata:
mata (type end to exit)
: function lregeval(M,todo,b,crit,s,H)
> {
> real colvector p1, p2
> real colvector y1
> p1=moptimize_util_xb(M,b,1)
> p2=moptimize_util_xb(M,b,2)
> y1=moptimize_util_depvar(M,1)
> crit=-(y1:-p1)*(y1:-p1)/(2*exp(p2))-
> rows(y1)/2*p2
> }
note: argument todo unused
note: argument s unused
note: argument H unused
: M=moptimize_init()
: moptimize_init_evaluator(M,&lregeval())
: moptimize_init_evaluatortype(M,"d0")
: moptimize_init_depvar(M,1,"mpg")
: moptimize_init_eq_indepvars(M,1,"price weight displacement")
: moptimize_init_eq_indepvars(M,2,"")
: moptimize(M)
initial: f(p) = -18004
alternative: f(p) = -10466.142
rescale: f(p) = -298.60453
rescale eq: f(p) = -189.39334
Iteration 0: f(p) = -189.39334 (not concave)
Iteration 1: f(p) = -172.06827 (not concave)
Iteration 2: f(p) = -162.08563 (not concave)
Iteration 3: f(p) = -156.61996 (not concave)
Iteration 4: f(p) = -143.55991
Iteration 5: f(p) = -129.10949
Iteration 6: f(p) = -127.05705
Iteration 7: f(p) = -127.05447
Iteration 8: f(p) = -127.05447
10. A less ecient way to deal with parameters with restricted supports is to program the distribution
so that it returns a missing value whenever a draw lands outside the appropriate range.
M. J. Baker 637
: moptimize_result_display(M)
Number of obs = 74
eq1
price -.0000966 .0001591 -0.61 0.544 -.0004085 .0002153
weight -.0063909 .0011759 -5.43 0.000 -.0086956 -.0040862
displacement .0054824 .0096492 0.57 0.570 -.0134296 .0243945
_cons 40.10848 1.974222 20.32 0.000 36.23907 43.97788
eq2
_cons 2.433905 .164399 14.80 0.000 2.111688 2.756121
: end
I now estimate model parameters via simulation by treating the likelihood function
like the parameters conditional distribution. I start with a Metropolis-within-Gibbs
sequential sampler to obtain 10,000 draws for each parameter value, discarding the rst
20 draws as a burn-in period. I start with this sampler because it is usually a relatively
safe choice when there is little information on starting points, which I am pretending are
unavailable. I set the initial values used by the sampler to 0 and use an identity matrix
as an initial covariance matrix for proposals. I choose a value of delta = 2/3, which
allows a fairly conservative amount of adaptation to occur and a desired acceptance rate
of 0.4.11
. set seed 8675309
. mata:
mata (type end to exit)
: alginfo="moptimize","d0","mwg"
: b_mwg=amcmc(alginfo,&lregeval(),J(1,5,0),I(5),10000,50,2/3,.4,
> arate=.,vals=.,lambda=.,.,M)
: st_matrix("b_mwg",mean(b_mwg))
: st_matrix("V_mwg",variance(b_mwg))
: end
11. Regarding what might seem a relatively short burn-in period, I set this period to be short enough
to show the convergence behavior of the algorithm.
638 Adaptive MCMC in Mata
. ereturn display
eq1
price -.0001322 .0001714 -0.77 0.440 -.0004681 .0002036
weight -.0057418 .0018016 -3.19 0.001 -.009273 -.0022107
displacement .00218 .0125846 0.17 0.862 -.0224854 .0268454
_cons 39.00328 3.095009 12.60 0.000 32.93717 45.06939
eq2
_cons 2.518081 .2071915 12.15 0.000 2.111993 2.924169
Although the algorithm was not allowed a very long burn-in time, the simulation-based
parameter estimates are close to those obtained by maximum likelihood.12 How fre-
quently were draws of each parameter accepted, and how close is the algorithm working
around the maximum value of the function? This information is returned as the over-
written arguments arate and vals.
. mata:
mata (type end to exit)
: arate
1
1 .3806030151
2 .3807035176
3 .3870351759
4 .4020100503
5 .3951758794
: max(vals),mean(vals)
1 2
1 -127.1097198 -130.2193494
: end
The sampler nds and operates close to the maximum value of the log likelihood (which
was 127.05), and the acceptance rates of the draws are very close to the desired
acceptance rate of 0.4. To understand what the distribution of the parameters looks
like, I pass the information about parameter draws to Stata and form visual pictures
of results. The code below accomplishes this and creates two panels of graphs: one
that shows the distribution of parameters (gure 1) and one that shows how parameter
draws and the value of the function evolved as the algorithm moved (gure 2).
12. One possible issue here is whether it is appropriate to summarize the results in usual Stata format
like this. One can assume that this is acceptable here because the parameters are collectively
normally distributed. Whether this is true in more general problems requires careful thought.
M. J. Baker 639
. preserve
. clear
. local varnames price weight displacement constant std_dev
. getmata (`varnames)=b_mwg
. getmata vals=vals
. generate t=_n
. local graphs
. local tgraphs
. foreach var of local varnames {
2. quietly {
3. histogram `var, saving(`var, replace) nodraw
4. twoway line `var t, saving(t`var, replace) nodraw
5. }
6. local graphs "`graphs `var.gph"
7. local tgraphs "`tgraphs t`var.gph"
8. }
. histogram vals, saving(vals,replace) nodraw
(bin=39, start=-183.40158, width=1.4433811)
(file vals.gph saved)
. twoway line vals t, saving(vals_t,replace) nodraw
(file vals_t.gph saved)
. graph combine `graphs vals.gph
. graph export vals_mwg.eps, replace
(file vals_mwg.eps written in EPS format)
. graph combine `tgraphs vals_t.gph
. graph export valst_mwg.eps, replace
(file valst_mwg.eps written in EPS format)
. restore
Figure 1 is composed of histograms for each parameter, with the last panel being the
histogram of the log likelihood. Parameters seem to be approximately normally dis-
tributed (with a few blips), excepting the rst few draws, and they are also centered
around parameter values obtained via maximum likelihood.
640 Adaptive MCMC in Mata
400
50
40
300
Density
Density
Density
30
200
20
100
500
10
0
0
.001 .0005 0 .0005 .01 .005 0 .005 .06 .04 .02 0 .02
price weight displacement
.25
2.5
.25
.2
.2
.15
1.5
Density
Density
Density
.15
.1
.1
.05
.05
.5
0
0
10 20 30 40 50 2 2.5 3 3.5 4 180170160150140130
constant std_dev vals
Figure 2 shows how the drawn values for parameters and the value of the objective
function evolved as the algorithm proceeded.
.0005
.005
0 .02
displacement
0
0
weight
price
.04 .02
.0005
.005
.06
.001
.01
0 2000 4000 6000 800010000 0 2000 4000 6000 800010000 0 2000 4000 6000 800010000
t t t
50
3.5
constant
std_dev
vals
30
3 2.5
20
2
10
0 2000 4000 6000 800010000 0 2000 4000 6000 800010000 0 2000 4000 6000 800010000
t t t
From gure 2, one can see that after a few iterations, the algorithm settles down
to drawing from an appropriate range. The draws are also autocorrelated, and this
autocorrelation is a general property of any MCMC algorithm, adaptive or not. Thus,
M. J. Baker 641
when one applies MCMC algorithms in practice, it is sometimes benecial to thin out
the draws by keeping, say, only every 5th or 10th draw or to jumble draws.
To illustrate the use of a global sampler and some of the problems one might en-
counter in an MCMC-based analysis, I now apply a global sampler to the problem so that
all parameter values are drawn simultaneously. The following code shows the results of
a run of 12,000 draws with a burn-in period of 2,000:
eq1
price -.0004614 .0019104 -0.24 0.809 -.0042057 .0032829
weight .013056 .0232029 0.56 0.574 -.0324209 .0585328
displacement -.1798405 .3163187 -0.57 0.570 -.7998138 .4401328
_cons 15.16227 20.84814 0.73 0.467 -25.69933 56.02387
eq2
_cons 4.017751 1.880026 2.14 0.033 .3329679 7.702533
One can see from these results that the algorithm has not quickly found an appropriate
range of values for parameter values. Figures 3 and 4 indicate whythe algorithm
spends considerable time stuck away from the maximal function value.
642 Adaptive MCMC in Mata
8
400 600 800 1000
100
80
6
Density
Density
Density
60
4
40
2
200
20
0
0
.015 .01 .005 0 .005 0 .05 .1 1.5 1 .5 0
price weight displacement
.3
.006
.8 .6
.2
.004
Density
Density
Density
.4
.002
.1
.2
0
0
0 10 20 30 40 50 0 2 4 6 8 10 3000 2000 1000 0
constant std_dev vals
Figure 3. Distribution of parameters after a global MCMC run that is slow to converge
.005
.1
0
0
displacement
.5
weight
.05
.005
price
1
.01
0
.015
1.5
0 2000 4000 6000 800010000 0 2000 4000 6000 800010000 0 2000 4000 6000 800010000
t t t
50
10
0
40
1000
30
constant
std_dev
6
vals
20
2000
4
10
3000
0
0 2000 4000 6000 800010000 0 2000 4000 6000 800010000 0 2000 4000 6000 800010000
t t t
The problem observed in gures 3 and 4 is that the algorithm was not allowed to burn
in for a long enough time for the global MCMC algorithm to work correctly. While
the parameter values eventually settled down closer to their true values, it took the
algorithm upward of 6,000 draws to nd the right range. In fact, it looks as though the
algorithm settled into a stable range for draws 2,0006,000 or so but then once again
experienced a jump to the correct stable range, a phenomenon known as pseudoconver-
M. J. Baker 643
gence (Geyer 2011). This behavior is also responsible for the multimodal appearance
of the histograms on gure 3.
While my intent is to illustrate how the Mata function amcmc() works, my example
also illustrates what can happen when one fails to specify appropriate adjustment pa-
rameters and does not allow an adaptive MCMC algorithm to run long enough in a given
estimation problem. One may unknowingly get bad results, as the case would be if
the global algorithm had been allowed to run for only 5,000 iterations. This sometimes
happens if poor starting values are mixed with parameters that have very dierent mag-
nitudes, for example, the constant in the initial model relative to the other parameters.
From inspecting gure 3, one can see that the constant did not nd its correct range
until just after 6,000 draws, and this is likely what caused the problem.
This discussion motivates using amcmc() in steps, where a slower but relatively
robust sampler (a Metropolis-within-Gibbs sampler, in this case) is used to orient pa-
rameters close to their correct range before a global sampler is used, as shown in the
following code:
. mata:
mata (type end to exit)
: alginfo="mwg","d0","moptimize"
: b_start=amcmc(alginfo,&lregeval(),J(1,5,0),I(5),5*1000,5*100,2/3,.4,
> arate=.,vals=.,lambda=.,.,M)
: alginfo="global","d0","moptimize"
: b_glo2=amcmc(alginfo,&lregeval(),mean(b_start),
> variance(b_start),11000,1000,2/3,.4,
> arate=.,vals=.,lambda=.,.,M)
: st_matrix("b_glo2",mean(b_glo2))
: st_matrix("V_glo2",variance(b_glo2))
: end
eq1
price -.0001059 .0001584 -0.67 0.504 -.0004164 .0002046
weight -.0063727 .0012014 -5.30 0.000 -.0087275 -.0040179
displacement .0056462 .0099215 0.57 0.569 -.0137997 .025092
_cons 40.10216 1.912111 20.97 0.000 36.35449 43.84982
eq2
_cons 2.480892 .1665249 14.90 0.000 2.15451 2.807275
644 Adaptive MCMC in Mata
Thus one can then draw parameters that are scaled dierently either alone or in blocks
until the algorithm nds it footing, and then proceed with a global algorithm. I have
motivated the use of a global drawing method because of its clear speed advantages, but
another more subtle reason to use it that might not be obvious when visually inspecting
the graphs is that global draws often exhibit less serial correlation across draws.13 The
conclusion provides sources with additional tips for setting up, analyzing, and presenting
the results of an MCMC run.
Yet another alternative is to once again begin with a Metropolis-within-Gibbs sam-
pler to characterize the distribution of the parameters and, once this is done suciently
well, to run the algorithm without adaptation so that one is using an invariant proposal
distribution and a regular MCMC algorithm. After an initial run with the "mwg" option,
I submit the mean and variance of results to the global sampler with no adaptation
parameter, passing a value of missing (.) for delta. Because I am not passing any
information to amcmc() on how to do adaptation in this case, I am required to submit
a value for lambda, so I choose = 2.382 /n.14 Finally, I also submit a missing value
for aopt. Because no adaptation occurs, aopt is not used by the algorithm.
. mata:
mata (type end to exit)
: alginfo="mwg","d0","moptimize"
: b_start=amcmc(alginfo,&lregeval(),J(1,5,0),I(5),5*1000,5*100,2/3,.4,
> arate=.,vals=.,lambda=.,.,M)
: alginfo="global","d0","moptimize"
: b_glo3=amcmc(alginfo,&lregeval(),mean(b_start),
> variance(b_start),10000,0,.,.,
> arate=.,vals=.,(2.38^2/5),.,M)
: arate
.2253
: mean(b_glo3)
1
1 -.0000916295
2 -.0064095109
3 .0054916501
4 40.14276799
5 2.497166774
: end
Apparently, the proposal distribution was successfully tuned in the initial run with the
Metropolis-within-Gibbs sampler. The mean values of the parameters obtained from
the global draw are close to their maximum-likelihood values, and the acceptance rate
is in the healthy range.
. mata:
mata (type end to exit)
: A=amcmc_init()
: amcmc_alginfo(A,("global","d0","moptimize"))
: amcmc_lnf(A,&lregeval())
: amcmc_xinit(A,J(1,5,0))
: amcmc_Vinit(A,I(5))
: amcmc_model(A,M)
: amcmc_draws(A,4000)
: amcmc_damper(A,2/3)
: amcmc_draw(A)
: end
I can now access results using the previously described amcmc results *(A) set of
functions.
where ci in (4) denotes a (left) censoring point that might be specic to the ith ob-
servation, and (u) = { (1(u < 0)}u. (0, 1) is the quantile of interest. Esti-
mation using derivative-based maximization methods is problematic because the objec-
tive function (4) has at regions and discontinuities. While one might do well with a
nonderivative-based optimization method such as NelderMead, one is then confronted
with the problem of characterizing the parameters distribution and getting standard
errors. For these reasons, one might opt for an LTE or a QBE estimator.
646 Adaptive MCMC in Mata
To apply amcmc() to the problem, I rst program the objective function as follows:15
. mata:
mata (type end to exit)
: void cqregeval(M,todo,b,crit,g,H) {
> real colvector u,Xb,y,C
> real scalar tau
>
> Xb =moptimize_util_xb(M,b,1)
> y =moptimize_util_depvar(M,1)
> tau =moptimize_util_userinfo(M,1)
> C =moptimize_util_userinfo(M,2)
> u =(y:-rowmax((C,Xb)))
> crit =-colsum(u:*(tau:-(u:<0)))
> }
note: argument todo unused
note: argument g unused
note: argument H unused
: end
The following code sets up a model statement for use with the function moptimize( )
(see [M-5] moptimize( )). One can follow the Mata code with moptimize(M) to verify
that this model and variations on the basic theme, obtained by dropping or adding
additional variables, encounter diculties.
15. One might code the objective function without summing over observations. I sum over observations
so that the objective is compatible with NelderMead in Stata, which requires a type d0 evaluator.
M. J. Baker 647
Setting up the problem like this allows the use of amcmc(), where I implement the
strategy of using a Metropolis-within-Gibbs-type algorithm followed by a global sampler.
. mata:
mata (type end to exit)
: alginfo="mwg","d0","moptimize"
: b_start=amcmc(alginfo,&cqregeval(),J(1,4,0),I(4),5000,1000,2/3,.4,
> arate=.,vals=.,lambda=.,.,M)
: alginfo="global","d0","moptimize"
: b_end=amcmc(alginfo,&cqregeval(),mean(b_start),
> variance(b_start),20000,10000,1,.234,arate=.,vals=.,lambda=.,.,M)
: end
Because this application might be of more general interest, I developed the command
mcmccqreg, which is a wrapper for the LTE and QBE estimation of censored quantile
regression. The previous code can be executed by the command.
One can see from the way the command is issued how information about the sampler,
the drawing process, and the censoring point (which has default of 0 for all observations)
can be controlled using the mcmccqreg command. The command produces estimates
that are summary statistics of the sampling run. mcmccqreg allows one to save results,
and the results of the run are saved in the le lsub draws with the objective function
648 Adaptive MCMC in Mata
value after each draw. The user can then easily analyze the draws using Statas graphing
and statistical analysis tools. While the workings of the command derive more or less
directly from the description of amcmc(), more information about the command and
some additional examples can be found in the mcmccqregs help le.
As written, p does not integrate to one and seems hard to invert. While Metropolis-
within-Gibbs or global sampling works ne with this example, to illustrate the block
sampler, I will draw from the distribution in blocks, where values for the rst two
arguments are drawn together, followed by a draw of the third. Thus the block matrix
to be passed to amcmc() is
1 1 0
B=
0 0 1
The code that programs the function and draws from the distribution is as follows:
The example is set up to draw 4,000 values with a burn-in period of 200. Graphs of the
simulation results are shown in gures 5 and 6.
.4
.5
.4
.3
Density
Density
.3
.2
.2
.1
.1
0
0
4 2 0 2 4 5 0 5
x_1 x_2
.15
.5
.4
.1
Density
Density
.2 .3
.05
.1
0
0
90 95 100 105 110 8 6 4 2 0
x_3 vals
5
2
x_1
x_2
0
0 2
5
4
0
105
2
vals
100
x_3
4
95
6
90
The graphs give a visual of the marginal distributions for the variables, while the time-
series diagram veries that our simulation run is getting good coverage and rapid con-
vergence to the target distribution.
A dierent way to draw from this distribution would be to set up an adaptive MCMC
problem via a structured set of Mata functions.
. mata:
mata (type end to exit)
: A=amcmc_init()
: amcmc_lnf(A,&ln_fun())
: amcmc_alginfo(A,("standalone","block"))
: amcmc_draws(A,4000)
: amcmc_burn(A,200)
: amcmc_damper(A,2/3)
: amcmc_xinit(A,J(1,3,0))
: amcmc_Vinit(A,I(3))
: amcmc_blocks(A,B)
: amcmc_draw(A)
: end
16. The data are downloadable from Trains website at http://eml.berkeley.edu/train/ and can also
be found at http://fmwww.bc.edu/repec/bocode/t/traindata.dta.
M. J. Baker 651
where in (5),
njt is an independent identically distributed extreme value, and n are
individual-specic parameters. Variation in these parameters across the population is
captured by assuming parameters normally distributed with mean b and covariance
matrix W. I denote a persons choice at t as ynt J. Then the probability of observing
person ns sequence of choices is
7
en xnynt t
L(yn |) = J x
(6)
n
j=1 e
njt
t
Given the distribution of , I can write the above conditional on the distribution of
parameters, (|b, W), and integrate over the distribution of parameter values to get
6
L(yn |b, W) = L(yn |)(|b, W)d
In a Bayesian approach, a prior h(b, W) is assumed, and the joint posterior likelihood
of the parameters is formed using
7
H(b, W|Y, X) L(yn |b, W)h(b, W) (7)
n
Following the outline given in Train (2009, 301302), we see that drawing from the
posterior proceeds in three steps. First, b is drawn conditional on n and W; then W
is drawn conditional on b and n ; and nally, the values of n are drawn conditional on
b and W. The rst two steps are straightforward, assuming that the prior distribution
of b is normal with extremely large variance and that the prior for W is an inverted
Wishart with K degrees of freedom and an identity scale matrix. In this case, the
conditional distribution of b is N (, WN 1 ), where is the mean of the n s. The
conditional distribution of W is an inverted Wishart with K + N degrees of freedom
and scale matrix (KI + N S)/(K + N ), where S = N 1 n (n b)(n b) is the
sample variance of the n s about b.
The distribution of n given choices, data, and (b, W) has no simple form, but from
(8), we see that the distribution of a particular persons parameters obeys
where the term L(yn |n ) in (9) is given by (6). This is a natural place to apply MCMC
methods, and it is here where I can use the amcmc *() suite of functions.
I now return to the example. traindata.dta contains information on the energy
contract choices of 100 people, where each person faces up to 12 dierent choice oc-
casions. Suppliers contracts are dierentiated by price, the type of contract oered,
location to the individual, how well-known the supplier is, and the season in which the
oer was made.
As a point of comparison, I t the model in Train (2009, 305) using mixlogit (after
download and installation).
. clear all
. set more off
. use http://fmwww.bc.edu/repec/bocode/t/traindata.dta
. set seed 90210
. mixlogit y, rand(price contract local wknown tod seasonal) group(gid) id(pid)
Iteration 0: log likelihood = -1253.1345 (not concave)
Iteration 1: log likelihood = -1163.1407 (not concave)
Iteration 2: log likelihood = -1142.7635
Iteration 3: log likelihood = -1123.6896
Iteration 4: log likelihood = -1122.6326
Iteration 5: log likelihood = -1122.6226
Iteration 6: log likelihood = -1122.6226
Mixed logit model Number of obs = 4780
LR chi2(6) = 467.53
Log likelihood = -1122.6226 Prob > chi2 = 0.0000
Mean
price -.8908633 .0616638 -14.45 0.000 -1.011722 -.7700045
contract -.22285 .0390333 -5.71 0.000 -.2993539 -.1463462
local 1.958347 .1827835 10.71 0.000 1.600098 2.316596
wknown 1.560163 .1507413 10.35 0.000 1.264715 1.85561
tod -8.291551 .4995409 -16.60 0.000 -9.270633 -7.312469
seasonal -9.108944 .5581876 -16.32 0.000 -10.20297 -8.014916
SD
price .1541266 .0200631 7.68 0.000 .1148036 .1934495
contract .3839507 .0432156 8.88 0.000 .2992497 .4686516
local 1.457113 .1572685 9.27 0.000 1.148873 1.765354
wknown -.8979788 .1429141 -6.28 0.000 -1.178085 -.6178722
tod 1.313033 .1648894 7.96 0.000 .9898559 1.63621
seasonal 1.324614 .1881265 7.04 0.000 .9558927 1.693335
To implement the Bayesian estimator, I proceed in the steps outlined by Train (2009,
301302). First, I develop a Mata function that produces a single draw from the condi-
tional distribution of b.
M. J. Baker 653
. mata:
mata (type end to exit)
: real matrix drawb_betaW(beta,W) {
> return(mean(beta)+rnormal(1,cols(beta),0,1)*cholesky(W))
> }
: end
Next I use the instructions described in Train (2009, 299) to draw from the conditional
distribution of W. The Mata function is
. mata
mata (type end to exit)
: real matrix drawW_bbeta(beta,b)
> {
> v=rnormal(cols(b)+rows(beta),cols(b),0,1)
> S1=variance(beta)
> S=invsym((cols(b)*I(cols(b))+rows(beta)*S1)/(cols(b)+rows(beta)))
> L=cholesky(S)
> R=(L*v)*(L*v)/(cols(b)+rows(beta))
> return(invsym(R))
> }
: end
I now have two of the three steps of the drawing scheme in place. The last task is more
nuanced and involves using structured amcmc problems in conjunction with the exible
ways in which one can manipulate structures in Mata. The key is to think of drawing
each set of individual-level parameters n as a separate adaptive MCMC problem. It is
helpful to rst get all the data into Mata, get familiar with its structure, and then work
from there.
. mata:
mata (type end to exit)
: st_view(y=.,.,"y")
: st_view(X=.,.,"price contract local wknown tod seasonal")
: st_view(pid=.,.,"pid")
: st_view(gid=.,.,"gid")
: end
The matrix (really, a column vector) y is a sequence of dummy variables marking the
choices of individual n in each choice occasion, while the matrix X collects explanatory
variables for each potential choice. pid and gid are identiers for individuals and choice
occasions, respectively. I now write a Mata function that computes the log probability
for a particular vector of parameters for a given person, conditional on that persons
information.
654 Adaptive MCMC in Mata
. mata:
mata (type end to exit)
: real scalar lnbetan_bW(betaj,b,W,yj,Xj)
> {
> Uj=rowsum(Xj:*betaj)
> Uj=colshape(Uj,4)
> lnpj=rowsum(Uj:*colshape(yj,4)):-
> ln(rowsum(exp(Uj)))
> var=-1/2*(betaj:-b)*invsym(W)*(betaj:-b)-
> 1/2*ln(det(W))-cols(betaj)/2*ln(2*pi())
> llj=var+sum(lnpj)
> return(llj)
> }
: end
The function takes in ve arguments, the rst of which is a parameter vector for the
person (that is, the values to be drawn). The second and third arguments characterize
the mean and covariance matrix of the parameters across the population.18 The fourth
and fth arguments contain information about an individuals choices and explanatory
variables.
The rst line of code multiplies parameters by explanatory variables to form utility
terms, which are then shaped into a matrix with four columns. Individuals have four
options available on each choice occasion. After reshaping, the utilities from potential
choices on each occasion occupy a row, with separate choice occasions in columns. lnpj
then contains the log probabilities of the choices actually madethe log of utility less
the logged sum of exponentiated utilities. Finally, var computes the log distribution
of parameters about the conditional mean, and llj sums the two components. The
result is the log likelihood of individual ns parameter values, given choices, data, and
the parameters governing the distribution of individual-level parameters.
I now set up a structured problem for each individual in the dataset. I begin by
setting up a single adaptive MCMC problem and then replicate this problem using J( )
(see [M-5] J( )) to match the number of individual-level parameter setsthe same as
the number of individual-level identiers in the data (gid)characterized via Matas
panelsetup( ) (see [M-5] panelsetup( )) function.
18. This function is not as fast as it could be, and it is also specic to the dataset. One way to speed the
algorithm is to compute the Cholesky decomposition of W once before individual-level parameters
are drawn. The wrapper bayesmixedlogit exploits this and a few other improvements.
M. J. Baker 655
. mata
mata (type end to exit)
: m=panelsetup(pid,1)
: Ap=amcmc_init()
: amcmc_damper(Ap,1)
: amcmc_alginfo(Ap,("standalone","global"))
: amcmc_append(Ap,"overwrite")
: amcmc_lnf(Ap,&lnbetan_bW())
: amcmc_draws(Ap,1)
: amcmc_append(Ap,"overwrite")
: amcmc_reeval(Ap,"reeval")
: A=J(rows(m),1,Ap)
: end
I also apply the amcmc option "overwrite", which means that the results from only
the last round of drawing will be saved. Specifying the "reeval" option means that
each individuals likelihood will be reevaluated at the new parameter values and the old
values of coecients before drawing.
I now duplicate the problem by forming a matrix of adaptive MCMC problems
one for each individualand then use a loop to ll in individual-level choices and
explanatory variables as arguments. In the end, the matrix A is a collection of 100
separate adaptive MCMC problems. Before this, some initial values for b and W are set,
and some initial values for individual-level parameters are drawn. I set up the pointer
matrix Args to hold this information along with the individual-level information.
. mata
mata (type end to exit)
: Args=J(rows(m),4,NULL)
: b=J(1,6,0)
: W=I(6)*6
: beta=b:+sqrt(diagonal(W)):*rnormal(rows(m),cols(b),0,1)
: for (i=1;i<=rows(m);i++) {
> Args[i,1]=&b
> Args[i,2]=&W
> Args[i,3]=&panelsubmatrix(y,i,m)
> Args[i,4]=&panelsubmatrix(X,i,m)
> amcmc_args(A[i],Args[i,])
> amcmc_xinit(A[i],b)
> amcmc_Vinit(A[i],W)
> }
: end
After creating some placeholders for the draws (bvals and Wvals), we can execute the
drawing algorithm as follows:
656 Adaptive MCMC in Mata
. mata
mata (type end to exit)
: its=20000
: burn=10000
: bvals=J(0,cols(beta),.)
: Wvals=J(0,cols(rowshape(W,1)),.)
: for (i=1;i<=its;i++) {
> b=drawb_betaW(beta,W/rows(m))
> W=drawW_bbeta(beta,b)
> bvals=bvals\b
> Wvals=Wvals\rowshape(W,1)
> beta_old=beta
> for (j=1;j<=rows(A);j++) {
> amcmc_draw(A[j])
> beta[j,]=amcmc_results_lastdraw(A[j])
> }
> }
: end
The algorithm consists of an outer loop and an inner loop, within which individual-level
parameters are drawn sequentially. The current value of the beta vector, which holds
individual-level parameters in rows, is overwritten with the last draw produced by using
the amcmc results lastdraw() function.
A subtlety of the code also indicates a reason why it is useful to pass additional
function arguments as pointers: each time a new value of b and W is drawn, a user
does not need to reiterate to each sampling problem that b and W have changed, be-
cause pointers point to positions that hold objects and not to the values of the objects
themselves. Thus, every time a new value of b or W is drawn, the arguments of all 100
problems are automatically changed. By specifying that the target distribution for each
level problem is to be reevaluated, the user tells the routine to recalculate lnbetan bW
at the last drawn value when comparing a new draw to the previous one.
Because the technique might be of greater interest, I have developed a command that
implements the algorithm bayesmixedlogit. For example, the algorithm described by
the previous code could be executed with the following command, which also summarizes
results in a way conformable with usual Stata output:
M. J. Baker 657
Random
price -1.168711 .1245738 -9.38 0.000 -1.4129 -.9245209
contract -.3433208 .0682585 -5.03 0.000 -.4771212 -.2095204
local 2.637242 .3436764 7.67 0.000 1.963567 3.310917
wknown 2.138963 .2596608 8.24 0.000 1.629976 2.647951
tod -11.16374 1.049769 -10.63 0.000 -13.2215 -9.105982
seasonal -11.19243 1.030291 -10.86 0.000 -13.212 -9.172849
Cov_Random
var_price .8499292 .2332495 3.64 0.000 .3927132 1.307145
cov_priceco~t .1128769 .0803203 1.41 0.160 -.044567 .2703208
cov_pricelo~l 1.583028 .4519537 3.50 0.000 .6971079 2.468948
cov_pricewk~n .8898662 .3096053 2.87 0.004 .2829775 1.496755
cov_pricetod 6.106009 1.909356 3.20 0.001 2.363286 9.848731
cov_pricese~l 6.044055 1.892895 3.19 0.001 2.333601 9.75451
var_contract .3450904 .0670202 5.15 0.000 .2137174 .4764634
cov_contrac~l .4714882 .2131141 2.21 0.027 .0537416 .8892347
cov_contrac~n .3624791 .1560516 2.32 0.020 .0565865 .6683717
cov_contrac~d .7592097 .6576296 1.15 0.248 -.5298765 2.048296
cov_contrac~l .9147682 .65939 1.39 0.165 -.3777688 2.207305
var_local 7.000292 1.883972 3.72 0.000 3.307328 10.69326
cov_localwk~n 4.022065 1.248119 3.22 0.001 1.575501 6.468629
cov_localtod 12.84674 3.787742 3.39 0.001 5.422006 20.27148
cov_localse~l 13.40598 3.727253 3.60 0.000 6.099812 20.71214
var_wknown 3.364285 1.012474 3.32 0.001 1.379632 5.348938
cov_wknowntod 6.513209 2.60766 2.50 0.013 1.401671 11.62475
cov_wknowns~l 7.109282 2.563623 2.77 0.006 2.084064 12.1345
var_tod 57.62449 16.97876 3.39 0.001 24.3427 90.90628
cov_todseas~l 53.93841 16.35184 3.30 0.001 21.88551 85.99131
var_seasonal 55.05572 16.54599 3.33 0.001 22.62226 87.48918
The results are similar but not identical to those obtained using mixlogit. Additional
information and examples for bayesmixedlogit can be found in the help le, and some
examples of estimating a mixed logit model using Bayesian methods are provided in
the help le for amcmc(), accessible via the commands help mf amcmc or help mata
amcmc().
658 Adaptive MCMC in Mata
5 Description
In this section, I sketch a Mata implementation of what I have been referring to as
a global adaptive MCMC algorithm. The sketched routine omits a few details, mainly
about parsing options, but it is relatively true to form in describing how the algorithms
discussed in the article are actually implemented in Mata and might be used as a
template for developing more specialized algorithms. It assumes that the user wishes to
draw from a stand-alone function without additional arguments. The code is as follows:
. mata:
mata (type end to exit)
: real matrix amcmc_global(f,xinit,Vinit,draws,burn,damper,
> aopt,arate,val,lam)
> {
> real scalar nb,old,pro,i,alpha
> real rowvector xold,xpro,mu
> real matrix Accept,accept,xs,V,Vsq,Vold
>
> nb=cols(xinit) /* Initialization */
> xold=xinit
> lam=2.38^2/nb
> old=(*f)(xold)
> val=old
>
> Accept=0
> xs=xold
> mu=xold
> V=Vinit
> Vold=I(cols(xold))
>
> for (i=1;i<=draws;i++) {
> accept=0
> Vsq=cholesky(V) /* Prep V for drawing */
> if (hasmissing(Vsq)) {
> Vsq=cholesky(Vold)
> V=Vold
> }
>
> xpro=xold+lam*rnormal(1,nb,0,1)*Vsq /* Draw, value calc. */
>
>
> pro=(*f)(xpro)
>
> if (pro==. ) alpha=0 /* calc. of accept. prob */
>
> else if (pro>old) alpha=1
> else alpha=exp(pro-old)
>
> if (runiform(1,1)<alpha) {
> old=pro
> xold=xpro
> accept=1
> }
>
> lam=lam*exp(1/(i+1)^damper*(alpha-aopt)) /*update*/
> xs=xs\xold
> val=val\old
> Accept=Accept\accept
M. J. Baker 659
> mu=mu+1/(i+1)^damper*(xold-mu)
> Vold=V
> V=V+1/(i+1)^damper*((xold-mu)(xold-mu)-V)
> _makesymmetric(V)
> }
>
> val =val[burn+1::draws,]
> arate=mean(Accept[burn+1::draws,])
> return(xs[burn+1::draws,])
> }
: end
The function starts by setting up a variable (nb) to hold the dimension of the distribu-
tion, and xold, which functions as xt in the algorithms discussed in table 3, is set to
the user-supplied initial value. The initial value of (called lam) is set as discussed by
Andrieu and Thoms (2008, 359).
Next the log value of the distribution (f) at xold is calculated and called old. The
next few steps proceed as one would expect. However, I nd it useful to have a default
covariance matrix waitingVold in the codein case the Cholesky decomposition en-
counters problems. For example, this could happen if the initial variancecovariance
matrix is not positive denite or if there is insucient variation in the draws, which
sometimes happens in the early stages of a run. Once a usable covariance matrix has
been obtained, xpro (which functions as Yt in the algorithms in tables 1, 2, and 3) is
formed using a conformable vector of standard normal random variates, and the function
is evaluated at xpro.
The acceptance probability alpha is then calculated in a numerically stable way in an
if-else if-else block. If the target function returns a missing value when evaluated,
alpha is set to 0 so that the draw will not be retained. If the proposal produces a higher
value of the target function, alpha is set to one. Otherwise, it is set as described by
the algorithms.19 Finally, a uniform random variable is drawn that determines whether
the draw is to be accepted. Once this is known, all values are updated according to
the scheme described in table 3. Once the for loop concludes, the algorithm overwrites
the acceptance rate, arate, and the function value, val, and returns the results of the
draw.
6 Conclusions
I have given a brief overview of adaptive MCMC methods and how they can be imple-
mented using the Mata routine amcmc() and a suite of functions amcmc *(). While I
have given some ideas about how one might use and display obtained results, my primary
purpose is to present and describe an implementation of adaptive MCMC algorithms.
19. The Mata function exp() does not evaluate to missing for very small values as it does for very large
values.
660 Adaptive MCMC in Mata
I have not discussed how one should set up the parameters of the draw, such as
the number of draws to take, whether to use a global sampler, or how aggressively to
tune the proposal distribution. I have also not discussed what users should do once
they have obtained draws from an adaptive MCMC algorithm. The functions leave these
decisions in the hands of users. Creating, describing, and analyzing results obtained via
MCMC is fortunately the subject of extensive literature. Broadly speaking, literature
on MCMC is built around the related issues of assessing convergence of a run and of
assessing the mixing and intensity of a run. A further issue is how one should deal
with autocorrelation between draws. Whatever means are used to analyze results, it
is fortunate that Stata provides a ready-made battery of tools to summarize, modify,
and graph results. However, while it is often easy to spot problems in an MCMC run, it
is impossible to know whether the run has actually provided draws from the intended
distribution.
On the subject of convergence, there is not any universally accepted criterion, but
researchers propose many guidelines. Gelman and Rubin (1992) present several useful
ideas. A general discussion appears in Geyer (2011), and some practical advice appears
in Gelman and Shirley (2011), who advocate discarding the rst half of a run as a burn-
in period and performing multiple runs in parallel from dierent starting points and
comparing results. To be sure that one is actually sampling from the right region of the
density, one can use heated distributions in preliminary runs. Eectively, these heated
distributions raise the likelihood function to some fractional power,20 which attens the
distribution and allows for more rapid and broader exploration of the parameter space.
One can also compare the results of multiple runs and compare the variance within
runs and between runs. A useful technique is to investigate the autocorrelation function
of results and then thin the results, retaining only a fraction of the draws so that most
of the autocorrelation is rid from the data. One can use time-series tools to test for
autocorrelation among draws. A possibility discussed by Gelman and Shirley (2011) is
to jumble the results of the simulation. While it might seem obvious, it is worthwhile
to note that solutions to these problems are interdependent. A draw that exhibits a
lot of autocorrelation may require more thinning and a longer run to obtain a suitable
number of draws. A good place to start with these and other aspects of analyzing results
is Brooks et al. (2011).
As may have been clear from the examples presented in section 4, another option
is to run the algorithm for some suitable amount of time and then restart the run
without adaptation by using previous results as starting values so that one is drawing
from an invariant proposal distribution. A simple yet useful starting point in judging
convergence is seeing whether the algorithm produces results with graphs that look like
those in gure 2 but not those in gure 4. A graph that does not contain jumps or
at spots and looks more or less like white noise is a preliminary indication that the
algorithm is working well. However, pseudo-convergence can still be very dicult to
detect. In addition to containing much practical advice, Geyer (2011) also advises that
one should at least do an overnight run, adding only half in jest that one should start
a run when the article is submitted and keep running until the referees reports arrive.
This cannot delay the article, and may detect pseudo-convergence (Geyer 2011, 18).
7 References
Andrieu, C., and J. Thoms. 2008. A tutorial on adaptive MCMC. Statistics and Com-
puting 18: 343373.
Brooks, S., A. Gelman, G. L. Jones, and X.-L. Meng, eds. 2011. Handbook of Markov
Chain Monte Carlo. Boca Raton, FL: Chapman & Hall/CRC.
Gelman, A., and D. B. Rubin. 1992. Inference from iterative simulation using multiple
sequences. Statistical Science 7: 457472.
Gelman, A., and K. Shirley. 2011. Inference from simulations and monitoring conver-
gence. In Handbook of Markov Chain Monte Carlo, ed. S. Brooks, A. Gelman, G. L.
Jones, and X.-L. Meng, 163174. Boca Raton, FL: Chapman & Hall/CRC.
Geyer, C. J. 2011. Introduction to Markov Chain Monte Carlo. In Handbook of Markov
Chain Monte Carlo, ed. S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng, 348.
Boca Raton, FL: Chapman & Hall/CRC.
Hole, A. R. 2007. Fitting mixed logit models by using maximum simulated likelihood.
Stata Journal 7: 388401.
Powell, J. L. 1984. Least absolute deviations estimation for the censored regression
model. Journal of Econometrics 25: 303325.
Train, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:
Cambridge University Press.
Abstract. This command meets the need of a researcher who holds multiple data
les in comma-separated value format diering by a period variable (for example,
year or quarter) or by a cross-sectional variable (for example, country or rm) and
must combine them into one Stata-format le.
Keywords: dm0076, csvconvert, comma-separated value le, .csv
1 Introduction
In applied research, it is common to come across several data les containing the same
set of variables that need to be combined into one le. For instance, in a cross-country
survey, a researcher may collect information country by country and thus create several
data les, one for each country. Or within the same cross-section (or even within the
same country), the researcher may sample each year independently and generate various
data les that dier by year.
A practical issue in this type of situation is determining how to read all of those
les together in Stata, especially if they are manifold. The standard approach would
be to import each data le sequentially into Stata by using a combination of import
delimited and append. This approach, however, requires a user to type several com-
mand lines proportional to the number of les to be included; thus it is reasonably
doable if the number of data les is limited.
Suppose the directory C:\data\world bank contains three comma-separated value
(.csv) les: wb2007.csv, wb2008.csv, and wb2009.csv.1 After setting the appropri-
ate working directory, a user implements the aforementioned procedure by typing the
following command lines:
1. csvconvert is designed to handle many .csv les; however, for simplicity, all the examples below
consider a limited set of .csv les.
c 2014 StataCorp LP dm0076
A. A. Gaggero 663
Alternatively, and more compactly, the same result can be obtained with a loop.
. foreach file in wb2007 wb2008 wb2009 {
2. import delimited using `file.csv, clear
3. save `file
4. }
. foreach file in wb2007.dta wb2008.dta {
2. append using `file
3. }
Another way is to work with the disk operating system (DOS) to gather all the .csv les
into one .csv le and then to read the assembled single .csv le into memory using
import delimited.
Under the DOS framework, the lines below assemble wb2007.csv, wb2008.csv, and
wb2009.csv into a newly created .csv le named input.csv.
cd "C:\data\world bank"
copy wb2007.csv wb2008.csv wb2009.csv input.csv
To assemble all .csv les stored in the directory C:\data\world bank into a new le
named input.csv, type
cd "C:\data\world bank"
copy *.csv input.csv
A similar approach that bypasses the DOS framework can be implemented. However,
if the number of .csv les is large, the process may not be as straightforward. For
simplicity, let us still consider just three .csv les. Once the appropriate working
directory is set, the command lines to type are as follows:
. copy wb2008.csv wb2007.csv, append
. copy wb2009.csv wb2007.csv, append
. import delimited using wb2007.csv
The rst two command lines append wb2008.csv and wb2009.csv to wb2007.csv.
The third command reads the .csv le into Stata.
Note, however, that if the rst line of both wb2008.csv and wb2009.csv contains
the variable names, these are also appended.2 Thus, because of the presence of extra
lines with names, all the variables are read as a string. To correct this inaccuracy, one
should rst remove the lines with the variable names and then use destring to set the
numerical format.
2. Unfortunately, the option varnames(nonames), applicable with import delimited, is unavailable
with copy.
664 A simple command to gather comma-separated value files into Stata
Alternatively, we could prevent this fault by manually preparing the .csv les (that
is, by removing the lines with the variable names in the .csv les to be appended). The
whole process can be time consuming, especially if the number of .csv les is large. The
csvconvert command simplies and automatizes the procedure of gathering multiple
.csv les into one .dta, as illustrated in the next section.
where input directory is the path of the directory in which the .csv les are stored. Do
not use any quotes at the endpoints of the directory path, even if the directory name
contains spaces (see example 1 below).
2.2 Options
replace species that the existing output le (if it already exists) be overwritten.
replace is required.
input file(filenames) species a subset of the .csv les to be converted. The filenames
must be separated by a space and include the .csv extension (see example 2 below).
If this option is not specied, csvconvert considers all the .csv les stored in the
input directory.
output dir(output directory) species the directory in which the .dta output le is
saved. If this option is not specied, the le is saved in the same directory where
the .csv les are stored.
output file(filename) species the name of the .dta output le. The default is
output file(output.dta).
3 Examples
3.1 Example 1Basic
The simplest way to run csvconvert is to type the command and the directory path
where the .csv les are stored followed by the mandatory option replace. In the same
directory, Stata will create output.dta, which collects all the .csv les of that directory
in Stata format.
A. A. Gaggero 665
During conversion, csvconvert sequentially reports the name of the .csv le being
converted, the number of variables, and the number of observations. If something in the
process appears odd, extra messages are displayed to alert the researcher and demand
further inspection. For instance, suppose that one .csv le contains a symbol or a
letter in one cell of a numerical variable; if ignored, this inaccuracy may undermine the
whole process. For this reason, csvconvert adds a note to help the researcher detect
the fault. In example 6, wb2008 symbol.csv contains N/A in one cell of the variable
populationtotal.
. note
_dta:
1. File included on 18 Jan 2014 10:11 : "wb2007.csv"
2. File included on 18 Jan 2014 10:11 : "wb2008.csv"
3. File included on 18 Jan 2014 10:11 : "wb2009.csv"
By reading the log, you can see that in the conversion of wb2008 symbol.csv, the
variable populationtotal changed its format from numerical to string. Therefore,
wb2008 symbol.csv is the le that needs to be inspected. Once the anomalous obser-
vation is detected and manually corrected (for example, by emptying the anomalous
cell via Excel and saving the corrected le as wb2008 symbol2.csv), you can relaunch
csvconvert and check that it now runs smoothly.
The warning message shows that there are three duplicate observations. Of course,
you can look carefully at the Results window and nd that wb2008.csv was entered
twice. However, if you are handling a large set of .csv les, checking each line of the
screen would be very time consuming.
Tabulating the variable csvfile conditional on duplicates being equal to one
quickly detects that the duplicate observations come from wb2008.csv.
. tabulate _csvfile if _duplicates==1
csv file
from which
observation
originates Freq. Percent Cum.
Total 6 100.00
5 Acknowledgments
I am grateful to Editor Joseph Newton for his assistance during revision and to Violeta
Carrion, Emanuele Forlani, Edna Solomon, and one anonymous referee for very helpful
comments.
Rusty Tchernis
Georgia State University
Atlanta, GA
Institute for the Study of Labor
Bonn, Germany
National Bureau of Economic Research
Cambridge, MA
rtchernis@gsu.edu
c 2014 StataCorp LP st0355
I. McCarthy, D. Millimet, and R. Tchernis 671
1 Introduction
The causal eect of binary treatment on outcomes is a central component of empirical
research in economics and many other disciplines. When individuals self-select into
treatment and when prospective randomization of the treatment and control groups is
not feasible, researchers must adopt alternative empirical methods intended to control
for the inherent self-selection. If individuals self-select on the basis of observed variables
(selection on observed variables), a variety of appropriate methodologies are available
to estimate the causal eects of the treatment. If instead individuals self-select on the
basis of unobserved variables (selection on unobserved variables), estimating treatment
eects is more dicult.
When one is confronted with selection on unobserved variables, the most common
empirical approach is to rely on an instrumental variable (IV); however, if credible instru-
ments are unavailable, a few approaches now exist that attempt to estimate the eects of
the treatment without an exclusion restriction. This article introduces a new Stata com-
mand, bmte, that implements two recent estimators proposed in Millimet and Tchernis
(2013) and designed to estimate treatment eects when selection on unobserved vari-
ables exists and appropriate exclusion restrictions are unavailable:
i. The minimum-biased (MB) estimator: This estimator searches for the observations
with minimized bias in the treatment-eects estimate of interest. This is accom-
plished by trimming the estimation sample to include only observations with a
propensity score within a certain interval as specied by the user. When the
conditional independence assumption (CIA) holds (that is, independence between
treatment assignment and potential outcomes, conditional on observed variables),
the MB estimator is unbiased. Otherwise, the MB estimator tends to minimize
the bias among estimators that rely on the CIA. Furthermore, the MB estima-
tor changes the parameter being estimated because of the restricted estimation
sample.
ii. The bias-corrected (BC) estimator: This estimator relies on the two-step estimator
of Heckmans bivariate normal (BVN) selection model to estimate the bias among
estimators that inappropriately apply the CIA (Heckman 1976, 1979). However,
unlike the BVN estimator, the BC estimator does not require specication of the
functional form for the outcome of interest in the nal step. Moreover, unlike the
MB estimator, the BC estimator does not change the parameter being estimated.
timators alongside preexisting estimators, the bmte command provides a picture of the
average causal eects of the treatment across a variety of assumptions and when valid
exclusion restrictions are unavailable.
These parameters may also vary with a vector of covariates, X, in which case the
parameters have an analogous representation conditional on a particular value of X.1
For nonrandom treatment assignment, selection into treatment may follow one of two
general paths: 1) selection on observed variables, also referred to as unconfoundedness
or the CIA (Rubin 1974; Heckman and Robb 1985); and 2) selection on unobserved
variables. Under the CIA, selection into treatment is random conditional on covariates,
X, and the average eect of the treatment can be obtained by comparing outcomes
of individuals in the two treatment states with identical values of the covariates. This
approach often uses propensity-score methods to reduce the dimensionality problem
arising when X is a high-dimensional vector (Rosenbaum and Rubin 1983), with the
propensity score denoted by P (Xi ) = Pr(Ti = 1|Xi ).
If the CIA fails to hold, then the estimated treatment eects relying on the CIA are
biased. Following Heckman and Navarro-Lozano (2004) and Black and Smith (2004),
we denote the potential outcomes as Y (0) = g0 (X) + 0 and Y (1) = g1 (X) + 1 , where
g0 (X) and g1 (X) are the deterministic portions of the outcome variable in the control
and treatment groups, respectively, and where (0 , 1 ) are the corresponding error terms.
We also denote the latent treatment variable by T = h(X) u, where h(X) represents
the deterministic portion of T , and u denotes the error term. The observed treatment,
T , is therefore equal to 1 if T > 0 and 0 otherwise. Finally, we denote by the
dierence in the residuals of the potential outcomes, = 0 1 .
1. More formally, the coecient measures the treatment eect, adjusting for a simultaneous linear
change in the covariates, X, rather than being conditional on a specic value of X. We thank an
anonymous referee for highlighting this point.
I. McCarthy, D. Millimet, and R. Tchernis 673
Assuming and u are jointly normally distributed, the bias can be derived as
{h(X)}
BATE {P (X)} = [0u 0 + {1 P (X)}u ] (1)
{h(X)}[1 {h(X)}]
where P(Xi ) is an estimate of the propensity score obtained using a probit model.
Under the CIA, the IPW estimator in (2) provides an unbiased estimate of ATE .
When this assumption fails, the bias for the ATE follows the closed functional form in
(1), with similar expressions for the ATT and ATU. The MB estimator aims to minimize
the bias by estimating (2) using only observations with a propensity score close to
the bias-minimizing propensity score, denoted by P . Using P eectively limits the
observations included in the estimation of the IPW treatment eects to minimize the
inherent bias when the CIA fails. We denote by the set of observations ultimately
included in the estimation. In general, however, P and are unknown. Therefore, the
MB estimator estimates P and to minimize the bias in (1) by using Heckmans BVN
selection model, the details of which are provided in Millimet and Tchernis (2013).
The MB estimator of the ATE is formally given by
Y i Ti Yi (1 Ti )
i i
P (Xi ) 1 P (Xi )
MB,ATE (P ) = (3)
Ti (1 Ti )
i i
P (Xi ) 1 P (Xi )
674 The bmte command
where = {i|P(Xi ) C(P )}, and C(P ) denotes a neighborhood around P . Fol-
lowing Millimet and Tchernis (2013), the MB estimator denes C(P ) as C(P ) =
{P (Xi )|P(Xi ) (P , P )}, where P = max(0.02, P ), P = min(0.98, P + ),
and > 0 is the smallest value such that at least percent of both the treatment and
control groups are contained in . Specic values of are specied within the bmte
command, with smaller values reducing the bias at the expense of higher variance. The
MB estimator trims observations with propensity scores above and below specic values,
regardless of the value of . These threshold values can be specied within the bmte
command options. Obtaining does not require the use of Heckmans BVN selection
model when the focus is on the ATT or ATU, because P is known to be one-half in these
cases (Black and Smith 2004).
If the user is sensitive to potential deviations from the normality assumptions under-
lying Heckmans BVN model, the MB estimator and other estimators can be extended
appropriately (Millimet and Tchernis 2013). Such adjustments are included as part
of the bmte command, denoted by the Edgeworth-expansion versions of the relevant
estimators.
MBBC,ATE (P ) = MB,ATE (P ) BATE (P ) (4)
where the corresponding estimators for the ATT and ATU follow. With heterogeneous
treatment eects, the MB-BC estimator changes the parameter being estimated. To
identify the correct parameter of interest, the bmte command rst estimates the MB-
BC estimator in (4) conditional on the propensity score, P (X), and then estimates the
(unconditional) ATE by taking the expectation of this over the distribution of X in the
population (or subpopulation of the treated). The resulting BC estimator is given by
BC,ATE = IPW,ATE BATE {P (Xi )} (5)
i
where again the corresponding estimators for the ATT and ATU follow.
I. McCarthy, D. Millimet, and R. Tchernis 675
where ()/() is the inverse Mills ratio, and is an independent and identically dis-
tributed error term with constant variance and zero conditional mean. With this ap-
proach, the estimated ATE is given by
BVN,ATE = X 1 0 (7)
2.4 CF approach
Heckmans BVN selection model is a special case of the CF approach. The idea is to devise
a function where the treatment assignment is no longer correlated with the error term
in the outcome equation once it is included, as outlined nicely in Heckman, LaLonde,
and Smith (1999) and Navarro (2008). Specically, consider the outcome equation
where S is the order of the polynomial. The following equation is then estimable via
OLS:
2. Depending on ones dataset and specic application, it may not be meaningful to evaluate all
covariates at their means. Therefore, when interpreting the treatment-eects estimates, the user
should check that the data support the use of X. We are grateful to an anonymous referee for
clarifying this important point.
676 The bmte command
As is clear from (8), t and t0 are not separately identied; however, because the
selection problem disappears in the tails of the propensity score, it follows that the CF
becomes zero and that the intercepts from the potential-outcome equations are identied
using observations in the extreme end of the support of P (X). After one estimates the
intercept terms, the ATE and ATT are given by
CF,ATE = ( 0 ) + X 1 0 and
1 (9)
CF,ATT = ( 0 ) + X 1 1 0 + E(1
1 0 |Ti = 1) (10)
where
S
, ,
1 P (X)
E(0
|Ti = 1) = 0s P (X)s0
and
s=1 P (X)
S
S
E(1
|Ti = 1) = 1s +
1s P (X)s1
s=1 s=1
and where P (X) is the overall mean propensity score, and P (X)t , t = 0, 1, is the mean
propensity score in group t.
Assuming S(X) = exp(X), the parameters of (11) are estimable by maximum likeli-
hood, with the log-likelihood function given by3
X
Ti
X
1Ti
ln L = ln ln 1 (12)
i exp(X) exp(X)
3. Our functional form assumption, S(X) = exp(X), is a simplication made to compare the KV
estimator and the other estimators available with the bmte command. For more details on the KV
estimator and alternative functional forms for S(X), see Klein and Vella (2009).
I. McCarthy, D. Millimet, and R. Tchernis 677
where the element of corresponding to the intercept is normalized to zero for iden-
tication. The maximum likelihood estimates are then used to obtain the predicted
probability of treatment, P (X), which may be used as an instrument for T in (6),
excluding the selection correction terms.
3.2 Specication
The bmte command requires the user to specify an outcome variable, depvar, at least
one independent variable, and a treatment assignment variable, group(). Additional
independent variables are optional. The command also uses Stata commands hetprob
and ivreg2 (Baum, Schaer, and Stillman 2003, 2004, 2005). The remaining options
of the bmte command are detailed below.
3.3 Options
group(varname) species the treatment assignment variable. group() is required.
ee indicates that the Edgeworth-expansion versions of the MB, BVN, and BC estimators
be included in addition to the original versions of each respective estimator. The
Edgeworth expansion is robust to deviations from normality in Heckmans BVN
selection model.
hetero allows for heterogeneous treatment eects, with ATE, ATT, and ATU estimates
presented at the mean level of each independent variable.
theta(#) denotes the minimum percentage such that both the treatment and control
groups have propensity scores in the interval (P , P ) from (3). Multiple values of
theta() are allowed (for example, theta(5 25), for 5% and 25%). Each value will
form a dierent estimated treatment eect using the MB and MB-BC estimators.
678 The bmte command
psvars(indepvars) denotes the list of regressors used in the estimation of the propensity
score. If unspecied, the list of regressors is assumed to be the same as the original
covariate list.
kv(indepvars) denotes the list of independent variables used to model the variance in
the hetprob command. Like the psvars() option, the list of kv() regressors is
assumed to be the same as the original covariate list if not explicitly specied.
cf(#) species the order of the polynomial used in the CF estimator. The default is
cf(3).
pmin(#) and pmax(#) specify the minimum and maximum propensity scores, respec-
tively, included in the MB estimator. Observations with propensity scores outside
this range will be automatically excluded from the MB estimates. The defaults are
pmin(0.02) and pmax(0.98).
psate(#)psatuee(#) specify the xed propensity-score values (specic to each treat-
ment eect of interest) to be used as the bias-minimizing propensity scores in lieu
of estimating the values within the program itself.
saving(filename) indicates where to save the output.
replace indicates that the output in saving() should replace any preexisting le in
the same location.
bs and reps(#) specify that 95% condence intervals be calculated by bootstrap using
the percentile method and the number of replications in reps(#). The default is
reps(100).
fixp is an option for the bootstrap command that, when specied, estimates the bias-
minimizing propensity score {P (X)} and applies this estimate across all bootstrap
replications rather than reestimating at each replication.
4 Example
Following Millimet and Tchernis (2013), we provide an application of the bmte com-
mand to the study of the U.S. school breakfast program (SBP). Specically, we seek
causal estimates of the ATEs of SBP on child health. The data are from the Early
Childhood Longitudinal StudyKindergarten Class of 19981999 and are available for
download from the Journal of Applied Econometrics Data Archive.4 We provide esti-
mates of the eect of SBP on growth rate in body mass index from rst grade to the
spring of third grade.
4. http://qed.econ.queensu.ca/jae/datasets/millimet001/.
I. McCarthy, D. Millimet, and R. Tchernis 679
We rst dene global variable lists XVARS and HVARS and limit our analysis to third
grade students only. XVARS are the covariates used in the OLS estimation as well as
in the calculation of the propensity score. HVARS are the covariates used in the KV
estimator (that is, the variables that enter into the heteroskedasticity portion of the
hetprob command).
We then estimate the eect of SBP participation in the rst grade (break1) on body
mass index growth (clbmi) by using the bmte command. In our application, we specify a
of 5% and 25%, and we estimate bootstrap condence intervals using 250 replications.
We also specify the ee option, asking that the results include the Edgeworth-expansion
versions of the relevant estimators. The resulting Stata output is as follows:
MB
MB-EE
MB-BC
MB-BC-EE
Here we focus on the general structure and theme of the output. For a thorough
discussion and interpretation of the results, see Millimet and Tchernis (2013). As indi-
cated by the section headings, the output presents results for the ATE, ATT, and ATU
using basic OLS and IPW treatment-eects estimates as well as each of the MB (3), MB-
BC (4), BC (5), BVN (7), CF [(9) and (10)], and KV [(11), (12), and (6)] estimators.
Below each estimate is the respective 95% condence interval.
As discussed in Millimet and Tchernis (2013), separate MB and MB-BC estimates
are presented for each value of specied in the bmte command (in this case, 5% and
25%). The results for the CF estimator also include a joint test of signicance of all
covariates in the OLS step of the CF estimator (8). Similarly, the KV results include
a test for weak instruments (the CraggDonald Wald F statistic and p-value) as well
as a likelihood-ratio test for heteroskedasticity based on the results of hetprob. Also
included in the bmte output is the estimated bias-minimizing propensity score.
I. McCarthy, D. Millimet, and R. Tchernis 681
5 Remarks
Despite advances in the program evaluation literature, treatment-eects estimators re-
main severely limited when the CIA fails and when valid exclusion restrictions are un-
available. Following the methodology presented in Millimet and Tchernis (2013), we
propose and describe a new Stata command (bmte) that provides a range of treatment-
eects estimates intended to estimate the average eects of the treatment when the CIA
fails and appropriate exclusion restrictions are unavailable.
Importantly, the bmte command provides results that are useful across a range of
alternative assumptions. For example, if the CIA holds, the IPW estimator provided
by the bmte command yields an unbiased estimate of the causal eects of treatment.
The MB estimator then oers a robustness check, given its comparable performance
when the model is correctly specied or overspecied and its improved performance if
the model is underspecied. If, however, the CIA does not hold, the bmte command
provides results that are appropriate under strong functional form assumptions, either
with homoskedastic (BVN or CF) or heteroskedastic (KV) errors, or under less restrictive
functional form assumptions (BC). As illustrated in our example application to the U.S.
SBP, the breadth of estimators implemented with the bmte command provides a broad
picture of the average causal eects of the treatment across a variety of assumptions.
6 References
Baum, C. F., M. E. Schaer, and S. Stillman. 2003. Instrumental variables and GMM:
Estimation and testing. Stata Journal 3: 131.
Black, D. A., and J. Smith. 2004. How robust is the evidence on the eects of college
quality? Evidence from matching. Journal of Econometrics 121: 99124.
Heckman, J., and R. Robb, Jr. 1985. Alternative methods for evaluating the impact of
interventions: An overview. Journal of Econometrics 30: 239267.
Heckman, J. J., R. J. LaLonde, and J. A. Smith. 1999. The economics and econometrics
of active labor market programs. In Handbook of Labor Economics, ed. O. Ashenfelter
and D. Card, vol. 3A, 18652097. Amsterdam: Elsevier.
Hirano, K., and G. W. Imbens. 2001. Estimation of causal eects using propensity score
weighting: An application to data on right heart catheterization. Health Services and
Outcomes Research Methodology 2: 259278.
Klein, R., and F. Vella. 2009. A semiparametric model for binary response and contin-
uous outcomes under index heteroscedasticity. Journal of Applied Econometrics 24:
735762.
Navarro, S. 2008. Control function. In The New Palgrave Dictionary of Economics, ed.
S. N. Durlauf and L. E. Blume, 2nd ed. London: Palgrave Macmillan.
Rosenbaum, P. R., and D. B. Rubin. 1983. The central role of the propensity score in
observational studies for causal eects. Biometrika 70: 4155.
the Department of Health Care Policy at Harvard Medical School. He received his PhD in
economics from Brown University.
Daniel Millimet is a professor of economics at Southern Methodist University and a research
fellow at the Institute for the Study of Labor. His primary areas of research are applied
microeconometrics, labor economics, and environmental economics. His research has been
funded by various organizations, including the United States Department of Agriculture. He
received his PhD in economics from Brown University.
The Stata Journal (2014)
14, Number 3, pp. 684692
1 Introduction
In recent years, it has become increasingly popular to use panel time-series datasets for
econometric analysis. These panel datasets are reasonably large in both cross-sectional
(N ) and time (T ) dimensions, as compared with the more conventional panels with very
large N yet small T. Theoretical research into the asymptotics of panel time series has
revealed two crucial dierences from the typical panel: the need for slope coecients
to be heterogeneous (for example, see Phillips and Moon [2000] and Im, Pesaran, and
Shin [2003]) and the concern of nonstationarity. Both dierences suggest that the usual
xed-eects or random-eects estimators are not appropriate for this application.
The long time dimension in panel time series allows one to use regular time-series
analytical tools, such as unit root and cointegration testing, to determine the order of
integration and the long-run relationship between variables. Researchers have proposed
a variety of tests and estimators that (in varying ways) extend time-series tools for panels
while importantly allowing for heterogeneity in the cross-sectional units (as opposed to
simply pooling the data). Users have already implemented several of these tests and
estimators into Stata (for example, see Blackburne and Frank [2007] and Eberhardt
[2012]).
This article and the associated program, xtpedroni, introduce two tools that were
developed in Pedroni (1999, 2001, 2004) for use in Stata. The rst tool is seven test
statistics for the null of no cointegration in nonstationary heterogeneous panels with
one or more regressors. The second tool is a between-dimension (that is, group-mean)
panel-dynamic ordinary least-squares (PDOLS) estimator. Both tools can include time
c 2014 StataCorp LP st0356
T. Neal 685
dummies (by time demeaning the data) to capture common time eects among members
of the panel. Nevertheless, they cannot account for more sophisticated forms of cross-
sectional dependence.
In this article, I will discuss the theoretical foundations of both tools. I will also
introduce the usage and capabilities of xtpedroni, and apply the program to replicate
the results in Pedroni (2001).
N
1
yt = yi,t
N i=1
All the test statistics are residual-based tests, with residuals collected from the
following regressions:
several available options). A linear time trend i t can be inserted into the regression at
the users discretion.
Next, several series and parameters are calculated from the regressions above.
T N
1 2 1 2
s2
i = ,
s2
N,T = s
T t=1 i,t N n=1 i
T ki T
2 = 1 2 s
L
2
+ (1 ) i,t i,ts
11i
T t=1 i,t T s=1 ki + 1 t=s+1
ki T
i = 1
(1
s
) i,t
i,ts
T s=1 ki + 1 t=s+1
T N
1 2 i , 1 2 2
s2i = ,
i2 = s2i + 2
N,T
2
= L
T t=1 i,t N n=1 11i i
The seven statistics can then be constructed from the following equations. (See Pedroni
[1999] for a complete discussion on how these statistics are constructed.)
3 N T 2 2
panel v: T 2 N 2 ( i=1 t=1 L i,t1 )1
11i e
N T
2 e2 )1 N T L
panel : T N ( i=1 t=1 L 2 ei,t1 i )
ei,t
11i i,t1 i=1 t=1 11i (
N T 2 e2 ) 12 N T L 2 ei,t1 i )
panel t: ( 2
N,T i=1 t=1 L 11i i,t1 i=1 t=1 11i ( ei,t
N T 2 e2 ) 12 N T L 2
s2
panel ADF: (N,T i=1 t=1 L 11i i,t1 i=1 t=1 11i e
i,t1 ei,t
N T 2 e2 )1 T ( i )
group : T 1N i=1 ( t=1 L 11i i,t1 ei,t
t=1 ei,t1
N T 1 T i )
group t: 1
N
i2
i=1 ( 2i,t1 ) 2
t=1 e t=1 (
ei,t1
ei,t
N T 12
T
group ADF: 1
N i=1 ( 2
t=1 s 2
i e i,t1 ) i,t1
t=1 e ei,t
The test statistics are then adjusted so that they are distributed as N (0, 1) under the
null. The adjustments performed on the statistics vary depending on the number of
regressors, whether time trends were included, and the type of test statistic.
Because the null of no cointegration is rejected, the panel v statistic goes to positive
innity while the other test statistics go to negative innity. Baltagi (2013, 296) provides
a formal interpretation of a rejection of the null: Rejection of the null hypothesis means
that enough of the individual cross-sections have statistics far away from the means
predicted by theory were they to be generated under the null.
The relative power of each test statistic is not entirely clear, and there may be con-
tradictory results between the statistics. Pedroni (2004) reported that the group and
panel ADF statistics have the best power properties when T < 100, with the panel v
T. Neal 687
and group statistics performing comparatively worse. Furthermore, the ADF statis-
tics perform better if the errors follow an autoregressive process (see Harris and Sollis
[2003]).
3 Pedronis PDOLS
Consider the following model:
yi,t = i + i xi,t + it
T
, 12
t = (i 0 ) i2
(xi,t xi ) 2
i
t=1
N
1
t = t
GM N i=1 i
Here zi,t is the 2(p + 1) 1 vector of regressors (this includes the lags and leads of the
dierenced explanatory variable), and i2 is the long-run variance of the residuals it .
i2 is computed in the program through the Newey and West (1987) heteroskedasticity-
and autocorrelation-consistent method with a Bartlett kernel. By default, the maxi-
mum lag for the Bartlett kernel is selected automatically for each cross-section in the
panel according to 4 (T /100)(2/9) (see Newey and West [1994]), but it can also be set
manually by the user.
In comparison, Kao and Chiang (1997) and Mark and Sul (2003) compute the panel
statistics along the within-dimension, with the t statistics designed to test H0 : i =
0 against HA : i = A = 0 . Pedronis PDOLS estimator is averaged along the
688 Panel cointegration analysis with xtpedroni
between-dimension (that is, the group mean). Accordingly, the panel test statistics test
H0 : i = 0 against HA : i = 0 . In the alternative hypothesis, the regressors are
not constrained to be a constant A . Pedroni (2001) argues that this is an important
advantage for between-dimension panel time-series estimators, particularly when one
expects slope heterogeneity.
4.2 Options
Options that aect the cointegration test and the PDOLS estimation
notdum suppresses time demeaning of the variables (that is, the common time dummies).
Time demeaning is turned on by default. This option may be appropriate to use
when averaging over the N dimension may destroy the cointegrating relationship or
when there are comparability concerns between panel units in the data.
nopdols suppresses PDOLS estimation (that is, reports only the cointegration test re-
sults).
notest suppresses the cointegration tests (that is, reports only PDOLS estimation).
extraobs includes the available observations from the missing years in the time means
used for time demeaning if there is an unbalanced panel with observations missing for
some of the variables (at the start or end of the sample) for certain individuals. This
was the behavior of Pedronis original PDOLS program but not of the cointegration
test program. It is o by default.
b(#) denes the null hypothesis beta as #. The default is b(0).
mlags(#) species the number of lags to be used in the Bartlett kernel for the Newey
West long-run variance. If mlags() is not specied, then the number of lags is
determined automatically for each individual following Newey and West (1994).
adflags(#) species the maximum number of lags to be considered in the lag selection
process for the ADF regressions. If adflags() is not specied, then it is determined
automatically.
lags(#) species the number of lags and leads to be included in the DOLS regression.
The default is lags(2).
full reports the DOLS regression for each individual in the panel.
average(string) determines the methodology used to combine individual coecient es-
timates into the panel estimate. string can be simple (default), sqrt, or precision.
simple takes a simple average and is the behavior of the original Pedroni program.
sqrt weighs each estimate according to the square root of the precision matrix,
which is the same procedure used for averaging the t statistics. precision weighs
each individuals coecient estimates by its precision.
. use pedronidata
. xtset country time
panel variable: country (strongly balanced)
time variable: time, 1973m6 to 1993m11
delta: 1 month
. xtpedroni logexrate logratio, notest lags(5) mlags(5) b(1) notdum
Pedronis PDOLS (Group mean average):
No. of Panel units: 20 Lags and leads: 5
Number of obs: 4700 Avg obs. per unit: 235
Data has not been time-demeaned.
We computed the results without time dummies (by specifying the notdum option),
and then with time dummies. We specied the option notest to suppress the results
of the cointegration test, which are not yet relevant. The option b(1) instructed the
program to compute all t statistics against the null hypothesis that the slope coecient
is equal to 1, which is appropriate for economic interpretation when testing the weak
long-run PPP hypothesis. In accordance with Pedronis original use of the group-mean
PDOLS estimator to calculate these results, we set the number of lags and leads in the
DOLS regression to 5 by specifying lags(5), and we set the number of lags used in the
Bartlett kernel for the NeweyWest long-run variance of the residuals to 5 by specifying
mlags(5).
We can now replicate the individual DOLS results for each country in the panel as
follows:
The output was compressed into a formatted table for brevity. We specied several
options to obtain the exact results. The option full displays the results of estimation
for each individual panel unit. Emulating Pedronis original use of the program for this
empirical application, we set the number of lags and leads in the DOLS regression to 4 by
T. Neal 691
specifying lags(4) and the number of lags used in the Bartlett kernel for the Newey
West long-run variance of the residuals to 4 by specifying mlags(4). No common time
dummies were used for the individual country results (notdum option).
Pedroni (2004) applied the seven panel cointegration test statistics to the PPP hy-
pothesis. We repeat this procedure as follows:
v 4.735 .
rho -2.027 -2.814
t -1.434 -2.185
adf -.9087 -1.737
The results will be inconsistent with those found in Pedroni (2004), because those results
relied on a larger sample period than did the Pedroni (2001) dataset we are currently
using. The only option we specied here is nopdols, which suppresses the PDOLS
estimation results.
Overall, the results indicate a cointegrating relationship between the log of the ex-
change rate and the log of the aggregate Consumer Price Index ratio. Statistical in-
ference is straightforward because all the test statistics are distributed N (0,1). All the
tests, except the panel t and ADF statistics, are signicant at least at the 10% level.
Furthermore, the PDOLS results support the weak long-run PPP hypothesis. Most of
the coecients are close to 1, but many are notably higher or lower. For a complete
discussion of the results, see Pedroni (2001).
6 Acknowledgments
This program is indebted to the work of many individuals, including Peter Pedroni,
Tom Doan, Tony Bryant, Roselyne Joyeux, and an anonymous reviewer.
7 References
Baltagi, B. H. 2013. Econometric Analysis of Panel Data. 5th ed. New York: Wiley.
Blackburne, E. F., III, and M. W. Frank. 2007. Estimation of nonstationary heteroge-
neous panels. Stata Journal 7: 197208.
Eberhardt, M. 2012. Estimating panel time-series models with heterogeneous slopes.
Stata Journal 12: 6171.
692 Panel cointegration analysis with xtpedroni
Harris, R., and R. Sollis. 2003. Applied Time Series Modelling and Forecasting. New
York: Wiley.
Im, K. S., M. H. Pesaran, and Y. Shin. 2003. Testing for unit roots in heterogeneous
panels. Journal of Econometrics 115: 5374.
Kao, C., and M.-H. Chiang. 1997. On the estimation and inference of a cointegrated
regression in panel data. Syracuse University Manuscript.
Mark, N. C., and D. Sul. 2003. Cointegration vector estimation by panel DOLS and
long-run money demand. Oxford Bulletin of Economics and Statistics 65: 665680.
Pedroni, P. 1999. Critical values for cointegration tests in heterogeneous panels with
multiple regressors. Oxford Bulletin of Economics and Statistics 61: 653670.
1 Introduction
Dropbox makes scholarly collaboration much easier because it allows scholars to share
les across dierent computers. At the same time, sharing do-les in Dropbox presents
its own complications. Because users may install Dropbox in dierent locations and
because users have dierent usernames, often on dierent computers, directory paths to
Dropbox folders may not work in do-les. This is especially likely when multiple Drop-
box users collaborate. Here I present some tips on how to overcome these diculties.
2.1 Syncing
One issue with using Dropbox to share les is that Dropbox automatically syncs les
as they are saved. Stata do-les can get ahead of the Dropbox synchronization if, for
instance, a user saves les and then appends these les soon after in a loop. It may
also happen if a user saves a le and then uses it. This problem can be solved with
a sleep command at the end of the loop. Telling Stata to wait for ve seconds or so
before continuing the loop will usually solve the problem.
c 2014 StataCorp LP pr0058
694 Stata and Dropbox
3 Solutions
There are several dierent ways to ensure that everyone can easily share and use Stata
do-les in Dropbox without errors. I discuss the advantages and drawbacks of the
dierent ways below.
3.1 Edit le
One solution, at least for Windows users, is to open do-les using the edit option. The
user does not have to specify a pathname, because Stata will automatically change the
1. I use /users/username to refer to a users home directory because most users use Windows or
Macs. Unix users should read it as ~.
R. Hicks 695
directory to the one where the do-le is located. From there, relative paths can be used
to negotiate around the shared directory. The biggest drawback to this method is that
it is limited to Windows users. It also does not t with how a lot of people use Stata,
because each time a user wants to open a do-le in a dierent directory, the user has to
open a new instance of Stata or change the directory within Stata.
3.2 Capture
Other users may prefer to use the capture command to change the directory. Here each
user puts a change directory (cd) command to his or her Dropbox folder preceded by
the capture command, which prevents Stata from returning an error and aborting the
do-le if the specied directory does not exist. As the number of users increases, or if
users have dierent usernames for their home and oce computers, keeping track of all
the dierent directories becomes dicult.
3.3 c(username)
Stata stores the users name in a c-class value called c(username). If all users have Drop-
box in the same place, the macro can be used to specify the Dropbox directory. As noted
above, one of the common places users store Dropbox is in /users/username/Dropbox/.
The username is stored by Stata as c(username), which can be inserted as a local in the
change directory command: cd /users/c(username)/Dropbox. This will work as
long as all users have Dropbox installed in the same directory. However, some users may
install Dropbox in /users/username/My Dropbox/ or in /users/username/Documents
/Dropbox. If this is the case, then c(username) will not work. Moreover, as noted
above, this will work with Windows and Mac computers but not with Unix comput-
ers. If all collaborators use Unix or Macs, they could use ~/Dropbox to go to the root
Dropbox directory.
3.4 dropbox.ado
A nal solution is to use an ado-le I created, dropbox.ado, which looks for the Drop-
box directory in the most common places that users install Dropbox. It starts in
the most commonly used location (/users/c(username)/Dropbox for Windows and
~/Dropbox for Mac and Unix computers) and then searches within the Documents di-
rectory and then the root directory to nd Dropbox. The command returns the local
Dropbox directory as r(db), and unless the nocd option is specied, it changes the
directory to a users root Dropbox directory. From there, the relative paths of all users
within Dropbox will be the same. The command also uses the username macro to look
for the Dropbox directory.
696 Stata and Dropbox
This command is limited because it may not provide the correct Dropbox directory
if a user has more than one instance of Dropbox installed. It will not work if a Windows
user has Dropbox installed on a drive other than the c: drive. Also the command will
work only if all shared users have the command on their computers.
4 Conclusion
Using multiple computers and sharing les in the Cloud is increasingly common. In this
article, I presented some tips on how to best handle do-les shared with the popular
Dropbox program. Here I conclude with a couple of general tips about navigating
directories when sharing do-les.
First, avoid using the backslash when setting paths; instead, use a forward slash.
The backslash is used only by Windows machines; it is also used as an escape character
by Stata, which often causes confusion when users include locals in their pathnames.
For example, c:\users\c(username)\Dropbox will not work in Stata because Stata
will ignore the backslash between users and c(username). Both Unix and Macs use
the forward slash in directories, and Windows recognizes the forward slash, so it is a
costless change. It will also ensure conformability across operating systems. Similarly,
Windows users should avoid references to the c:\ drive as often as possible. Sometimes,
this is unavoidable, especially with network drives or with partitioned drives. However,
if all work is done on the c:\ drive, Windows will recognize cd / as referring to the c:\
drive, which brings Windows syntax in line with Unix and Mac syntax.
Second, users should become familiar with the commands to move around directories
without specifying full path names. Users can move up one directory using cd ..
or up two directories using cd ../... From the current directory, users can move
down a directory by specifying only the new directory name. For example, to go from
/users/username/Dropbox/ to /users/username/Dropbox/Shared Folder/, one can
type cd "Shared Folder".
1 Introduction
For instructors of measurement and evaluation and individuals seeking methodological
guidance, it is dicult to nd a book that both covers key analytic concepts and provides
clear direction on how to perform the associated analyses in a given statistical software
package. The fourth edition of An Introduction to Stata for Health Researchers, by
Svend Juul and Morten Frydenberg, lls this need. It does an excellent job of covering a
wide range of measurement and evaluation topics while providing a gentle introduction
to Stata for those unfamiliar with the software. In fact, though the title suggests
the book is for health researchers, it is readily generalizable to many disciplines that
implement the same methods.
Many improvements have been made to the book since John Carlins review of
the inaugural edition in 2006 (Carlin 2006), including a reorganization of chapters to
more closely mirror the typical ow of a research project, an increase in the number of
practice exercises, and a more focused treatment of statistical issues. Additionally, this
fourth edition has been updated for Stata 13. On the whole, Juul and Frydenberg have
prepared a very accessible book for readers with varied levels of prociency in statistics
or Stata, or both.
2 Overview
Section I includes four chapters (called the basics) that introduce the reader to Stata.
These chapters cover such issues as installing the program, getting help, understanding
le types, and using command syntax. While a novice could go directly to the Stata
users manual (in particular, Getting Started with Stata and the Stata Users Guide),
this book oers a more user-friendly introduction. Combined, these 35 pages are more
than sucient to get a Stata novice up and running.
c 2014 StataCorp LP gn0061
698 Review of An Introduction to Stata for Health Researchers
Section II includes six chapters dealing with issues pertaining to data management,
such as variable types (numeric, dates and strings) and their manipulation and storage
(chapter 5); importing and exporting data (chapter 6); applying labels (chapter 7);
generating and replacing values and performing basic calculations (chapter 8); and
changing data structure, such as appending, merging, reshaping, and collapsing data
(chapter 9). Chapter 10 provides excellent advice on creating documentation (via do-
les and logs, etc.) to ensure reproducibility of data management and analytic steps.
While creating documentation is seemingly intuitive, not all researchers consistently
follow these steps.
Section III includes ve chapters focusing on the types of data analyses most widely
used in health-related research.
Chapter 11 starts with basic descriptive analytics and then continues on to analy-
ses using epidemiologic tables for binary variables (including the addition of stratied
variables). This naturally progresses to analyses of continuous variables, and the chap-
ter demonstrates some visual displays of the data (histograms, QQ plots, and kernel
density plots) and methods of tabulation. The chapter then ventures into more formal
basic statistical analyses, such as t tests, one-way analysis of variance, and nonparamet-
ric techniques (ranksum).
Chapter 12 presents ordinary least-squares and logistic regression, with a fair amount
of exposition on the use of lincom for postestimation.
Chapter 13 describes time-to-event analyses, starting with simple curves and tables,
and then moves into progressively more complex Cox regression models (without and
with time-varying covariates). Next it introduces Poisson models to examine more
complex models for rates. Finally, it includes a brief discussion on indirect and direct
standardization.
Chapter 14 is titled Measurement and diagnosis, and it describes graphical plots
and statistical tests for assessing measurement variation at one time point, and then
again over multiple measurements, for dependent samples. This transitions into methods
used for assessing accuracy of diagnostic tests (that is, sensitivity, specicity, area under
the curve, etc.).
Chapter 15Miscellaneousincludes topics such as random sampling, sample-
size calculations (including a nice example using simulation to estimate power for a
noninferiority study), error trapping, and log les.
Section IV includes one comprehensive chapter on graphs (44 pages). The chapter
begins by plotting a basic graph and describing the various elements, and it progresses
with increasing sophistication. It ends with some important tips on saving the code in
do-les so that graphs can be reproduced or enhanced later.
The nal section, section V, is composed of a single chapter titled Advanced top-
ics and discusses storing and using results after estimation and dening macros and
scalars. It then discusses looping through data using foreach, forvalues, and if/then
statements. The chapter ends with a brief overview of creating user-written commands.
A. Linden 699
3 Comments
The book is well organized, following the logical step-by-step approach that investigators
apply to their research: data acquisition and management, analysis, and presentation
of results. The many brief examples are useful and generalizable, and the footnotes
are helpful additions. When a topic is briey touched upon, the authors refer the
reader to the relevant help resource in Stata for more details. They also provide helpful
recommendations for resolving issues that may have multiple solutions.
Another strength of the book is that it contains many important but often overlooked
details (even for advanced Stata users), such as why a value may appear dierently
when formatted as oat versus double (pages 4546) and how this precision may impact
comparisons. Other examples include the use of numlabel to display both the value
and the value label of a variable (page 67), the use of egen cut() to easily recode
continuous variables into categories (page 75), and setting showbaselevels to display
a line for the reference level in regression output (page 153). Of arguably greatest value
is the fact that the authors continually emphasize the importance of developing good
habits in documenting the work process (using do-les and logs) so that all output
can be replicated, errors can be tracked down, and time-consuming procedures can be
performed repeatedly and eciently.
There is very little that I would change about this book, and my suggestions all relate
to what the authors could consider for future editions. First, the authors use lincom
and testparm extensively in the chapters on regression and time-to-event analyses.
Readers would benet from seeing examples using margins (followed by marginsplot).
margins is an extremely exible command that allows the user to perform various
analyses after running regression models, mostly with little additional specication.
The authors currently provide only a footnote (page 150) pointing interested readers
to the excellent book written by Michael N. Mitchell (2012). Second, some mention
of parametric regression models for survival analysis would be valuable (using streg),
because readers in certain disciplines may prefer these models over Cox regression models
(using stcox).
Finally, while Stata 13 introduced a new set of commands to estimate treatment
eects using propensity score-based matching and weighting techniques, the only men-
tion of such approaches is in appendix A, where the authors briey describe the Stata
Treatment-Eects Reference Manual by saying this: Despite its title, it does not cor-
respond to the methods of analysis that are mainstream in health research. This
statement left me somewhat perplexed, given that graduate programs in public health
in the United States have a required course in program evaluation that likely cov-
ers these methods in at least some detail. Furthermore, there is a growing body of
health research literature where using these methods has become commonplace (see,
for example, Austin [2007; 2008]). Readers would benet from an introduction to these
techniques, perhaps as a nal chapter in which some of the datasets analyzed in pre-
vious chapters using regression are reanalyzed using one of these approaches and the
results compared. The Stata Treatment-Eects Reference Manual oers an excellent
700 Review of An Introduction to Stata for Health Researchers
introduction to the methods implemented in Stata, and Stuart (2010) provides a more
comprehensive discussion of treatment-eects estimation using an array of approaches.
In summary, I strongly recommend this book both for students in introductory
measurement and evaluation courses and for more seasoned health researchers who
would like to avoid a steep learning curve when trying to conduct analyses in Stata.
4 References
Austin, P. C. 2007. Propensity-score matching in the cardiovascular surgery literature
from 2004 to 2006: A systematic review and suggestions for improvement. Journal of
Thoracic and Cardiovascular Surgery 134: 11281135.
Juul, S., and M. Frydenberg. 2014. An Introduction to Stata for Health Researchers.
4th ed. College Station, TX: Stata Press.
Mitchell, M. N. 2012. Interpreting and Visualizing Regression Models Using Stata.
College Station, TX: Stata Press.
Stuart, E. A. 2010. Matching methods for causal inference: A review and a look forward.
Statistical Science 25: 121.
Software Updates
References
Gu, Y., A. R. Hole, and S. Knox. 2013. Fitting the generalized multinomial logit model
in Stata. Stata Journal 13: 382397.
Lokshin, M., and Z. Sajaia. 2004. Maximum likelihood estimation of endogenous switch-
ing regression models. Stata Journal 4: 282289.
. 2005a. Software update: st0071 1: Maximum likelihood estimation of endoge-
nous switching regression models. Stata Journal 5: 139.
. 2005b. Software update: st0071 2: Maximum likelihood estimation of endoge-
nous switching regression models. Stata Journal 5: 471.
c 2014 StataCorp LP up0044