The Stata Journal: Number 3 2014

The Stata Journal
Volume 14 Number 3 2014
A Stata Press publication

StataCorp LP
College Station, Texas
The Stata Journal
Editors
H. Joseph Newton Nicholas J. Cox
Department of Statistics Department of Geography
Texas A&M University Durham University
College Station, Texas Durham, UK
editors@stata-journal.com editors@stata-journal.com
Associate Editors
Christopher F. Baum, Boston College Frauke Kreuter, Univ. of MarylandCollege Park
Nathaniel Beck, New York University Peter A. Lachenbruch, Oregon State University
Rino Bellocco, Karolinska Institutet, Sweden, and Jens Lauritsen, Odense University Hospital
University of Milano-Bicocca, Italy Stanley Lemeshow, Ohio State University
Maarten L. Buis, WZB, Germany J. Scott Long, Indiana University
A. Colin Cameron, University of CaliforniaDavis Roger Newson, Imperial College, London
Mario A. Cleves, University of Arkansas for Austin Nichols, Urban Institute, Washington DC
Medical Sciences Marcello Pagano, Harvard School of Public Health
William D. Dupont, Vanderbilt University Sophia Rabe-Hesketh, Univ. of CaliforniaBerkeley
Philip Ender, University of CaliforniaLos Angeles J. Patrick Royston, MRC Clinical Trials Unit,
David Epstein, Columbia University London
Allan Gregory, Queens University Philip Ryan, University of Adelaide
James Hardin, University of South Carolina Mark E. Schaffer, Heriot-Watt Univ., Edinburgh
Ben Jann, University of Bern, Switzerland Jeroen Weesie, Utrecht University
Stephen Jenkins, London School of Economics and Ian White, MRC Biostatistics Unit, Cambridge
Political Science Nicholas J. G. Winter, University of Virginia
Ulrich Kohler, University of Potsdam, Germany Jeffrey Wooldridge, Michigan State University
Stata Press Editorial Manager Stata Press Copy Editors

Lisa Gilmore David Culwell, Shelbi Seiner, and Deirdre Skaggs
The Stata Journal publishes reviewed papers together with shorter notes or comments, regular columns, book
reviews, and other material of interest to Stata users. Examples of the types of papers include 1) expository
papers that link the use of Stata commands or programs to associated principles, such as those that will serve
as tutorials for users rst encountering a new eld of statistics or a major new technique; 2) papers that go
beyond the Stata manual in explaining key features or uses of Stata that are of interest to intermediate
or advanced users of Stata; 3) papers that discuss new commands or Stata programs of interest either to
a wide spectrum of users (e.g., in data management or graphics) or to some large segment of Stata users
(e.g., in survey statistics, survival analysis, panel analysis, or limited dependent variable modeling); 4) papers
analyzing the statistical properties of new or existing estimators and tests in Stata; 5) papers that could
be of interest or usefulness to researchers, especially in elds that are of practical importance but are not
often included in texts or other journals, such as the use of Stata in managing datasets, especially large
datasets, with advice from hard-won experience; and 6) papers of interest to those who teach, including Stata
with topics such as extended examples of techniques and interpretation of results, simulations of statistical
concepts, and overviews of subject areas.
The Stata Journal is indexed and abstracted by CompuMath Citation Index, Current Contents/Social and Behav-
ioral Sciences, RePEc: Research Papers in Economics, Science Citation Index Expanded (also known as SciSearch),
Scopus, and Social Sciences Citation Index.
For more information on the Stata Journal, including information for authors, see the webpage
http://www.stata-journal.com
Subscriptions are available from StataCorp, 4905 Lakeway Drive, College Station, Texas 77845, telephone
979-696-4600 or 800-STATA-PC, fax 979-696-4601, or online at
http://www.stata.com/bookstore/sj.html
Subscription rates listed below include both a printed and an electronic copy unless otherwise mentioned.
U.S. and Canada Elsewhere
Printed & electronic Printed & electronic

1-year subscription $ 98 1-year subscription $138
2-year subscription $165 2-year subscription $245
1-year student subscription $ 75 1-year student subscription $ 99
1-year institutional subscription $245 1-year institutional subscription $285

Electronic only Electronic only

1-year subscription $ 75 1-year subscription $ 75
1-year student subscription $ 45 1-year student subscription $ 45
Back issues of the Stata Journal may be ordered online at
http://www.stata.com/bookstore/sjj.html
Individual articles three or more years old may be accessed online without charge. More recent articles may
be ordered online.
http://www.stata-journal.com/archives.html
The Stata Journal is published quarterly by the Stata Press, College Station, Texas, USA.
Address changes should be sent to the Stata Journal, StataCorp, 4905 Lakeway Drive, College Station, TX
77845, USA, or emailed to sj@stata.com.
Copyright
c 2014 by StataCorp LP
Copyright Statement: The Stata Journal and the contents of the supporting les (programs, datasets, and
help les) are copyright
c by StataCorp LP. The contents of the supporting les (programs, datasets, and
help les) may be copied or reproduced by any means whatsoever, in whole or in part, as long as any copy
or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
The articles appearing in the Stata Journal may be copied or reproduced as printed copies, in whole or in part,
as long as any copy or reproduction includes attribution to both (1) the author and (2) the Stata Journal.
Written permission must be obtained from StataCorp if you wish to make electronic copies of the insertions.
This precludes placing electronic copies of the Stata Journal, in whole or in part, on publicly accessible websites,
leservers, or other locations where the copy may be accessed by anyone other than the subscriber.
Users of any of the software, ideas, data, or other materials published in the Stata Journal or the supporting
les understand that such use is made without warranty of any kind, by either the Stata Journal, the author,
or StataCorp. In particular, there is no warranty of tness of purpose or merchantability, nor for special,
incidental, or consequential damages such as loss of prots. The purpose of the Stata Journal is to promote
free communication among Stata users.
The Stata Journal, electronic version (ISSN 1536-8734) is a publication of Stata Press. Stata, , Stata
Press, Mata, , and NetCourse are registered trademarks of StataCorp LP.
Volume 14 Number 3 2014
The Stata Journal

Articles and Columns 453
ivtreatreg: A command for tting binary treatment models with heterogeneous

response to treatment and unobservable selection . . . . . . . . . . . . . . . . . . G. Cerulli 453
Obtaining critical values for test of Markov regime switching . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V. K. Bostwick and D. G. Steigerwald 481
A command for signicance and power to test for the existence of a unique most
probable category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. M. Fellman and J. Ensor 499
Merger simulation with nested logit demand. . . .J. Bjornerstedt and F. Verboven 511
treatrew: A user-written command for estimating average treatment eects by
reweighting on the propensity score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G. Cerulli 541
Modeling count data with generalized distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. Harris, J. W. Hilbe, and J. W. Hardin 562
A Stata package for the application of semiparametric estimators of doseresponse
functions. . . . . . . . . . . . . . . . .M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 580
Space-lling location selection . . . . . . . . . . . . . . . . . . . . . . . . . M. Bia and P. Van Kerm 605
Adaptive Markov chain Monte Carlo sampling and estimation in Mata. . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. J. Baker 623
csvconvert: A simple command to gather comma-separated value les into Stata
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. A. Gaggero 662
The bmte command: Methods for the estimation of treatment eects when exclu-
sion restrictions are unavailable . . . I. McCarthy, D. Millimet, and R. Tchernis 670
Panel cointegration analysis with xtpedroni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. Neal 684
Stata and Dropbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Hicks 693
Notes and Comments 697
Review of An Introduction to Stata for Health Researchers, Fourth Edition, by

Juul and Frydenberg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Linden 697
Software Updates 701

The Stata Journal (2014)
14, Number 3, pp. 453480
ivtreatreg: A command for tting binary

treatment models with heterogeneous response
to treatment and unobservable selection
Giovanni Cerulli
Ceris-CNR
National Research Council of Italy
Institute for Economic Research on Firms and Growth
Rome, Italy
g.cerulli@ceris.cnr.it
Abstract. In this article, I present ivtreatreg, a command for tting four dier-
ent binary treatment models with and without heterogeneous average treatment
eects under selection-on-unobservables (that is, treatment endogeneity). Depend-
ing on the model specied by the user, ivtreatreg provides consistent estimation
of average treatment eects by using instrumental-variables estimators and a gen-
eralized two-step Heckman selection model. The added value of this new command
is that it allows for generalization of the regression approach typically used in stan-
dard program evaluation by assuming heterogeneous response to treatment. It also
serves as a sort of toolbox for conducting joint comparisons of dierent treatment
methods, thus readily permitting checks on the robustness of results.
Keywords: st0346, ivtreatreg, microeconometrics, treatment models, instrumental
variables, unobservable selection, treatment endogeneity, heterogeneous treatment
response
1 Introduction
It is increasingly recognized as good practice to perform ex-post evaluation of economic
and social programs through counterfactual evidence-based statistical analysis. Such
analysis is particularly important at the policy-making level. The statistical approach
is usually applied to measuring the causal eects of an intervention on part of an external
authority, such as local or national government, on a set of subjects targeted by a given
program, such as individuals and companies. Similar analysis is also becoming popular
in reassessing causal relations among factors identied under modern microeconometric
theory from a counterfactual perspective but not necessarily regarding policy implica-
tions.
Several ocial Stata commands and new user-written commands have been applied
to enlarge the set of available statistical tools for conducting these counterfactual anal-
yses. Table 1 contains a list of commands for estimating binary treatment eects.
However, the most recent release of Stata, version 13, provides a new far-reaching suite
called teffects, which can be used to estimate treatment eects from observational
data.

c 2014 StataCorp LP st0346
454 Fitting binary treatment models
Table 1. Commands for performing econometric program evaluation: ordinary least-

squares (OLS) estimation using a control function; heckit, Heckman-type selection
model; dierence-in-dierences (DID); instrumental variables (IV); regression discon-
tinuity design
Command Description Author

regress OLS estimation based on a control StataCorp
function, linear reweighting, DID
(panel data)
ivregress Basic IV, local average treatment StataCorp
eect
etregress Selection model (heckit) StataCorp
psmatch2* Matching (nearest neighbor on Leuven and Sianesi (2003)
covariates and propensity score)
pscore* Matching (propensity score) Becker and Ichino (2002)
nnmatch* Matching (nearest neighbor on Abadie et al. (2004)
covariates)
rd* Regression discontinuity design Austin (2007)
(sharp and fuzzy)
treatrew* Reweighting on propensity score Cerulli (2014)
diff* DID (repeated cross-section) Villa (2009)
* User-written command downloadable from the Statistical Software Components
archive
The teffects command can be used to estimate potential outcome means and av-
erage treatment eects (ATEs). As shown in table 2, the teffects suite covers a large
set of methods, such as regression adjustment; inverse-probability weights; doubly ro-
bust methods, including inverse-probability-weighted regression adjustment; augmented
inverse-probability weights; and matching on the propensity score or covariates (with
nearest neighbors). Other subcommands can be used for postestimation purposes and
for testing reliability of results; for example, overlap allows for plotting the estimated
densities of the probability of getting each treatment level.
G. Cerulli 455
Table 2. Stata 13 teffects subcommands for estimating treatment eects from obser-
vational data
Subcommand Description
aipw Augmented inverse-probability weighting
ipw Inverse-probability weighting
ipwra Inverse-probability-weighted regression adjustment
nnmatch Nearest-neighbor matching
overlap Overlap plots
psmatch Propensity-score matching
ra Regression adjustment
When applying teffects, the outcome models can be continuous, binary, count, or
nonnegative. Binary outcomes can be modeled using logit, probit, or heteroskedastic
probit regression, and count and nonnegative outcomes can be modeled using Poisson
regression. The treatment model can be binary or multinomial. Binary treatments
can be modeled using logit, probit, or heteroskedastic probit regression. For multino-
mial treatments, one can use pairwise comparisons and then exploit binary treatment
approaches.1
While the teffects command deals mainly with estimation methods suitable under
selection-on-observables, Stata 13 presents two further commands to deal with endoge-
nous binary treatment (occurring in the case of selection-on-unobservables): etregress
and etpoisson. etregress estimates the ATE and the other parameters of a linear
regression model augmented with an endogenous binary treatment variable. Basically,
etregress is an improvement on Statas treatreg command, whose estimation is based
on the Heckman (1978) selection model. Because such a model is fully parametric, esti-
mation can be performed either by full maximum likelihood or, less parametrically, by
a two-step consistent estimator. Similarly, etpoisson estimates an endogenous binary
treatment model when the outcome is a count variable by using a Poisson regression.
Both the ATE and the ATE on the treated (ATET) can be estimated by etpoisson.
Although Stata 13 oers the above commands for dealing with endogenous treat-
ment, the commands suer from two important limitations. First, they assume joint
normality of errors, meaning that they are not robust to violation of this hypothesis.
Second, they do not allowat least by defaultfor calculation of causal eects under
observable heterogeneity, meaning that they assume causal eects to be the same in the
subpopulation of treated and untreated units. This second limitation might be partially
1. For multinomial treatment, readers can refer to the user-written command poparms, which estimates
multivalued treatment eects under conditional independence by using the ecient semiparametric
estimation of multivalued treatment eects. See Cattaneo (2010) and Cattaneo, Drukker, and
Holland (2013) for tutorials.
overcome by introducing interactions between the binary treatment and the covariates
in the outcome equation, but this requires further user programming to recover all the
parameters of interest.
The gsem command, also new in Stata 13, can estimate the causal parameters of
models with selection-on-unobservables, implemented as unobserved components, and
heterogeneous eects, implemented as random coecients. However, gsem uses full-
information maximum likelihood (ML), thus assuming a fully specied parametric model,
which in some contexts could present questionable reliability.
The ivtreatreg command I present in this article implements a series of methods
for treatment-eects estimation under treatment endogeneity that use only conditional-
moment restrictions. These methods are more robust than those implemented by
etregress or gsem. ML estimators would be naturally more ecient under correct
specication, and this means that a trade-o may arise between robustness and e-
ciency. On the one hand, assuming some parametric distributive form for the error
terms allows one to use ML estimation reaching the CramerRao lower variance bound.
On the other hand, when these distributive assumptions are questionable, ML may be
less reliable than less ecient (but consistent) estimation procedures, and the latter
ones become more robust. Thus it seems useful to adopt distribution-free methods for
dealing with treatment endogeneity, which the ivtreatreg command makes possible.
ivtreatreg ts four binary treatment models with and without idiosyncratic or
heterogeneous ATEs.2 Depending on the model specied by the user, ivtreatreg pro-
vides consistent estimation of ATEs under the hypothesis of selection-on-unobservables
by using IV and a generalized Heckman-style selection model.
Conditional on a prespecied subset of exogenous variables, xthought of as driving
the heterogeneous response to treatmentivtreatreg calculates the ATE, the ATET,
and the ATE on the nontreated (ATENT) for each called model, as well as the estimates
of these parameters conditional on the observable factors x.
Specically, the four models t by ivtreatreg are direct-2sls (IV regression t
by direct two-stage least squares), probit-ols (IV two-step regression t by probit and
OLS), probit-2sls (IV regression t by probit and two-stage least squares), and heckit
(Heckman two-step selection model).
Extensive discussion of the conditions under which previous methods provide con-
sistent estimation of ATE, ATET, and ATENT can be found in Wooldridge (2010).
ivtreatreg provides value by allowing for generalization of the regression approach
typically employed in standard program evaluation by assuming heterogeneous response
to treatment and treatment endogeneity. It is also a sort of toolbox for conducting joint
comparisons of dierent treatment methods, thus readily permitting the researcher to
run checks on the robustness of results.
In sections 2 and 3 of this article, I briey present the statistical framework and
estimation methods implemented by ivtreatreg. In section 4, I present the syntax
2. To my knowledge, no previous Stata command has addressed this objective.
G. Cerulli 457
with a description of the help le, and in section 5, I conduct a Monte Carlo experiment
to test the reliability of ivtreatreg. In section 6, I demonstrate the command applied
to real data from a study of the relationship between education and fertility. I conclude
with section 7, where I provide a brief summary and arm the value of ivtreatreg.
In the appendix, I derive the formulas for the selection model.
2 Statistical framework3
Our hypothetical evaluation objective is to estimate the eect of binary treatment w
(taking value 1 for treated and 0 for untreated units) on scalar outcome y.4 We sup-
pose that the assignment to treatment is not random but instead due to some form of
the units self-selection or external selection. For each unit, (y1 , y0 ) denotes the two
potential outcomes,5 where the outcome is y1 when the individual is treated and y0
when the individual is not treated. We then collect an independent and identically dis-
tributed sample of observations (yi , wi , xi ) with i = 1, . . . , N , where x is a row vector of
covariates hypothesized as driving the observable nonrandom assignment to treatment
(confounders).
Here we are interested in estimating the ATE, dened as
ATE = E(y1 y0 )
If we rely on observational data alone, we cannot identify the ATE because, for the same
individual and at the same time, we can observe just one out of the two quantities
needed to calculate the ATE (Holland 1986). By restricting the analysis on the group
of treated units, we can also dene a second causal parameter, the ATET, as
ATET = E(y1 y0 | w = 1)
Similarly, the ATENT, meaning the ATE calculated within the subsample of untreated
units, is
ATENT = E(y1 y0 | w = 0)
An interesting relationship links these three parameters:
ATE = ATET p(w = 1) + ATENT p(w = 0)
3. This section draws on the substantial literature on econometrics of program evaluation, such as
Rubin (1974), Angrist (1991), Angrist, Imbens, and Rubin (1996), Heckman, LaLonde, and Smith
(1999), Wooldridge (2010), and Cattaneo (2010). For a recent survey, see also Imbens and
Wooldridge (2009).
4. Notation follows Wooldridge (2010).
5. For simplicity, I avoid writing the subscript form of the unit i when referring to population param-
eters.
where p(w = 1) is the probability of being treated and p(w = 0) is the probability
of being untreated. Where x is known, we can also dene the previous parameters
conditional on x as follows:
ATE(x) = E(y1 y0 | x)
ATET(x) = E(y1 y0 | w = 1, x)
ATENT(x) = E(y1 y0 | w = 0, x)
These quantities are functions of x, which means that they can be seen as individual-
specic ATEs because each individual owns a specic value of x. Furthermore, by law
of iterated expectation, we have
ATE = Ex {ATE(x)}
ATET= Ex {ATET(x)}
ATENT = Ex {ATENT(x)}
The analyst needs to recover consistent (and, when possible, ecient) estimators of
the previous parameters from observational data. Before going on, note that through-
out this article we assume that the stable unit treatment value assumption (Rubin
1978) holds. This assumption states that the treatment received by one unit does
not aect other units outcome (Cox 1958). We thus restrict the analysis to a no-
interference setting. Indeed, when the stable unit treatment value assumption does not
hold, treatment externality eects between units may occur and pose severe problems
in identifying eects.6
3 Estimation methods
The new command ivtreatreg implements four models to consistently estimate previ-
ous parameters, and three of these are IV estimators. These methods are direct-2sls
(IV regression estimated by direct two-stage least squares), probit-ols (IV two-step re-
gression estimated by probit and OLS), probit-2sls (IV regression estimated by probit
and two-stage least squares), and heckit (Heckman two-step selection model). Each
of these can be estimated by assuming either homogeneous or heterogeneous response
to treatment (for a total of eight models). Before presenting how ivtreatreg works, I
briey set out the formulas, conditions, and procedures of each model (see Wooldridge
[2010, chap. 21]). We start by assuming that
y0 = 0 + x 0 + e0 , E(e0 ) = 0, E(e0 | x) = 0, 0 = parameter (1)

y1 = 1 + x 1 + e1 , E(e1 ) = 0, E(e1 | x) = 0, 1 = parameter (2)
y = y0 + w(y1 y0 ) (3)
6. Treatment-eects estimation under interference between units is a challenging eld of study. Sobel
(2006), Rosenbaum (2007), and Hudgens and Halloran (2008) oer important contributions on
correct inferences within such a setting.
G. Cerulli 459
Equations (1) and (2) represent the potential outcome equations assumed to be linear
in parameters, while the vector x can also contain nonlinear functions of the various
covariates. Equation (3) is the so-called potential outcome model and expresses the
observational rule of the model, because y is the observed outcome. We do not need to
explicitly specify an equation for w (that is, a selection equation) in this model; however,
we could specify an equation. We could assume, for instance, that a linear probability
model for the propensity to be selected into treatment is
w = 0 + x 1 + a (4)
where a is an error component. As soon as we hold that a is uncorrelated with (e1 ;
e0 ), then (4) is redundant and not needed to identify causal parameters. However, we
must know w to identify the causal parameters, as we will discuss later. By substituting
(1)(2) into (3), we get
y = 0 + (1 0 )w + x 0 + w(x 1 x 0 ) + e0 + w(e1 e0 )
where 0 = 1 implies observable heterogeneity and e1 = e0 implies unobservable
heterogeneity.
Next, we dene = e0 + w(e1 e0 ). We can distinguish two cases: 1) e1 = e0 and
2) e1 = e0 , which can in turn be split into the following subcases:
Case 1.1. e1 = e0 = e, 0 = 1 = , E(e | x, w) = 0: unobservable homogeneity,
homogeneous reaction function of y0 and y1 to x, treatment exogeneity.
In this case, we can show that
E(y | w, x) = 0 + w ATE + x
ATE = ATE(x) = ATET = ATET(x) = ATENT = ATENT(x) = 1 0
Thus no heterogeneous ATE (over x) exists. Furthermore, OLS consistently estimates
ATE.
Case 1.2. e1 = e0 = e, 0 = 1 ; E( | x, w) = 0: unobservable homogeneity,

heterogeneous reaction function of y0 and y1 to x, treatment exogeneity.
E(y | w, x) = 0 + w ATE + x 0 + w(x x ) (5)
ATE = ATET = ATENT
where = 1 0 and x = E(x) is the sample mean of x. In this case, heterogeneous

ATE (over x) exist, and the population causal parameters take the forms
ATE = (1 0 ) + x
ATE(x) = ATE + (x x )
ATET = ATE + Ex (x x | w = 1)
ATET(x) = ATE + {(x x ) | w = 1}
ATENT = ATE + Ex (x x | w = 0)
ATENT(x) = ATE + {(x x ) | w = 0}
whose sample equivalents are
ATE OLS
=
ATE(x) =
OLS + (x x) OLS
N
1
ATET OLS +
= wi (xi x)

N OLS
wi i=1
i=1
= {
OLS + (x x)
ATET(x) OLS }(w=1)
N

1
ATENT OLS +
= (1 wi )(xi x)

N OLS
(1 wi ) i=1
i=1

ATENT(x) =
OLS + (x x) OLS
(w=0)
where it is clear that, under treatment exogeneity, these parameters can be consistently
estimated by plugging-in the parameters from an OLS of (5).
But what happens when treatment exogeneity fails and w becomes endogenous? We
then have three subcases.
Case 2.1. e1 = e0 = e, 0 = 1 = , E(e | x, w) = 0: unobservable homogeneity,
homogeneous reaction function of y0 and y1 to x, treatment endogeneity.
E(y | w, x) = 0 + w ATE + x 0
ATE = ATET = ATENT
However, if an IV z is available, we can consistently estimate ATE by exploiting an IV

approach.
Case 2.2. e1 = e0 = e, 0 = 1 , E(e | x, w) = 0: unobservable homogeneity,
heterogeneous reaction function of y0 and y1 to x, treatment endogeneity.
E(y | w, x) = 0 + w ATE + x 0 + w(x x ) (6)

ATE = ATET = ATENT
G. Cerulli 461
Even in this case, if an IV z is available, we can consistently estimate ATE by exploiting

an IV approach. Observe, however, that we have two endogenous variables: w and
w(x x ). However, once IV estimations of parameters in (6) are available, we can
consistently recover all the causal parameters of interest as follows:
ATE IV
=
ATE(x) =
IV + (x x) IV
N
1
ATET IV +
= wi (xi x)
N IV
wi i=1
i=1

ATET(x) = IV + (x x)
IV
(w=1)
N

1
ATENT IV +
= (1 wi )(xi x)

N IV
(1 wi ) i=1
i=1

ATENT(x) =
IV + (x x) IV
(w=0)
Case 2.3. e1 = e0 , 0 = 1 , E( | x, w) = 0: unobservable heterogeneity, heterogeneous

reaction function of y0 and y1 to x, treatment endogeneity.
E(y | w, x) = 0 + w ATE + x 0 + w(x x )

ATE = ATET = ATENT
To apply IV and get consistent estimation, this case requires a further orthogonal con-
dition,
E{w(e1 e0 ) | x, z} = E{w(e1 e0 )} (7)
Given this condition, estimation may proceed as in Case 2.2.
Next, I present the methods implemented by ivtreatreg by referring to the case of
heterogeneous reaction.
3.1 Control-function regression

Control-function regression consistently estimates the previously dened causal eects
under selection-on-observables, that is, when conditional mean independence (CMI)
holds. CMI implies treatment exogeneity by restricting the independence between po-
tential outcomes and treatment to the mean once covariates x are xed at a certain
level. The control-function estimation protocol is as follows:
1. Estimate yi = 0 + wi + xi 0 + wi (xi x ) + errori by OLS, thus getting

consistent estimates of 0 , , 0 , and , with = ATE.
2. Plug these estimated parameters into the sample formulas and recover all the
causal eects.
3. Obtain standard errors for ATET and ATENT via bootstrap.
However, ivtreatreg does not t such a model, because it can be more robustly ob-
tained by using the regression-adjustment estimator implemented in the teffects com-
mand of Stata 13 (with the suboption ra). This command handles many functional
forms other than the linear one, and an estimation of ATENT can also be obtained
using the margins command after running the regression in step 1. For this reason,
ivtreatreg concentrates on the endogenous treatment-eect case, for which it adds
new tools.
3.2 Instrumental variables

When the CMI hypothesis does not hold, control-function regression causes biased esti-
mates of causal eects. This happens when the selection-into-treatment is due to both
observable and unobservable factors. In this case, w becomes endogenous, that is, cor-
related with the regression error term. This is the case when the error term of (4) is
correlated with e0 in (1) or with e1 in (2). IV can also restore consistency under the
selection-on-unobservables. Nevertheless, applying IV requires the availability of at least
one variable zthe instrumental variableassumed to be directly correlated with the
treatment w and directly uncorrelated with the outcome y. This implies an exclusion
restriction under which IV identies causal parameters. ivtreatreg implements the fol-
lowing three consistent but dierently ecient IV methods: direct-2sls, probit-ols,
and probit-2sls.
direct-2sls
By using direct-2sls, the analyst does not consider the binary nature of w. This
method follows the typical IV steps:
1. Run an OLS regression of w on x and z, thus getting the predicted values of wi ,

indicated by wf v,i .
2. Run a second OLS of y on {x, wf v,i , wf v,i (x x )}. The coecient of wf v,i is a
consistent estimation of ATE.
3. Plug these estimated parameters into the sample formulas, recover all the other
causal eects, and obtain standard errors for ATET and ATENT via bootstrap.
G. Cerulli 463
probit-ols
In this case, the analyst exploits the binary nature of w by tting a probit regression in
the rst step. Operationally, probit-ols follows these three steps:
1. Apply a probit of w on x and z, thus getting pw , the predicted probability of w.
2. Run an OLS of y on {1, x, pw , pw (x x )}.
3. Follow step 3 above.
The coecient of pw is a consistent and more ecient estimator of ATE (compared

with direct-2sls) given that the process generating w is correctly specied. It has
higher eciency because the propensity score is the orthogonal projection of w in the
vector space generated by (x, z). However, with this method, standard errors must be
corrected for the presence of a generated regressor and heteroskedasticity.
probit-2sls
Operationally, probit-2sls follows these four steps:
1. Apply a probit of w on x and z, thus getting pw , the predicted probability of w.
2. Run an OLS of w on (1, x, pw ), thus getting the tted values w2f v,i .
3. Run a second OLS of y on {1, x, w2f v,i , w2f v,i (x x )}.
4. Follow step 3 above.
The coecient of w2f v,i is a more ecient estimator of ATE compared with direct-2sls.
Furthermore, to achieve consistency, this procedure does not require that the process
generating w is correctly specied; thus, it is more robust than probit-ols.
3.3 heckit
ivtreatreg considers a generalized heckit model to consistently estimate previous pa-
rameters without using an IV. The price is that of relying on a trivariate normality
assumption between the error terms of the potential outcomes and the error term of
the treatment. However, this model has the advantage of tting Case 2.3 without in-
voking (7). The reference model is again the system of (14), where we also assume
that (e0 , e1 , a) are trivariate normal. Such a model, as implemented by ivtreatreg,
generalizes the two-step option of the ocial Stata command treatreg.
By default, the treatreg command assumes neither observable heterogeneity (be-
cause it holds that 0 = 1 ) nor unobservable heterogeneity (because it holds that
e1 = e0 ). When these two assumptions are removed, the model leads to the following
baseline regression function, which can be consistently estimated by OLS (see Wooldridge
[2010, 949]):
(q) (q)
E(y | x, z, w) = 0 + w + x 0 + w(x x ) + 1 w + 0 (1 w)
(q) 1 (q)
where is the ATE, 1 and 0 are the correlations between the two potential outcomes
errors and the treatments error, and (x) and (x) are the density and cumulative
normal distribution, respectively. To estimate the previous regression, ivtreatreg
performs the following two-step procedure:
1. Run a probit regression of wi on (1, xi , zi ) and get (i ,

i ).
i , (1 wi )i /1
2. Run an OLS of yi on {1, wi , xi , wi (xi x )i , wi i / i }.
After estimation, one can also test the hypothesis of no selection-on-unobservables by

testing the null:
H0 : 1 = 0 = 0
More importantly, it is easy to show that
ATE =
ATE(x) = + (x x)
although ATET(x), ATET, ATENT(x), and ATENT assume dierent forms compared with
previous models, specically7
ATET(x) = { + (x x) + (0 + 1 ) 1 (q)}(w=1)
N N
1 1
ATET =+ w (x
i i x) + (1 + 0 ) wi 1 (q)

N
N
wi i=1 wi i=1
i=1 i=1
and
ATENT(x) = { + (x x) + (1 + 0 ) 0 (q)}(w=0)
N

1
ATENT =+ (1 wi )(xi x) + (0 + 1 )

N
(1 wi ) i=1
i=1
N

1
(1 wi ) 0i (q)

N
(1 wi ) i=1
i=1
Given the estimates of , 1 , 0 , , 1 , and 0 from the previous two-step procedure,

one can easily calculate all the causal eects. Here bootstrapping can again be used to
obtain standard errors for ATET and ATENT.
7. See the appendix for the derivation of these formulas.
G. Cerulli 465
4 The ivtreatreg command

The ivtreatreg command ts the four binary treatment models presented above, with
and without idiosyncratic or heterogeneous ATEs. The command calculates the ATE,
ATET, and ATENT, as well as the estimates of these parameters conditional on the
observable factors x [that is, ATE(x), ATET(x), and ATENT(x)].
4.1 Syntax

ivtreatreg outcome treatment varlist if in weight , model(modeltype)

hetero(varlist h) iv(varlist iv) conf(#) graphic vce(vcetype) beta

const(noconstant) head(noheader)
where outcome species the target variable that is the object of the evaluation, treat-
ment species the binary treatment variable (that is, 1 = treated or 0 = untreated),
and varlist denes the list of exogenous variables that are considered as observable
confounders.
fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight.
4.2 Options
model(modeltype) species the treatment model to be t, where modeltype must be one
of the following four models (described in sections 3.3 and 3.4 above): direct-2sls,
probit-2sls, probit-ols, or heckit. model() is required.
modeltype Description
direct-2sls IV regression t by direct two-stage least squares
probit-2sls IV regression t by probit and two-stage least squares
probit-ols IV two-step regression t by probit and OLS
heckit Heckman two-step selection model
hetero(varlist h) species the list of variables over which to calculate the idiosyncratic
ATE(x), ATET(x), and ATENT(x), where x = varlist h. When this option is not
specied, the command ts the specied model without heterogeneous ATE. varlist h
should be the same set or a subset of the variables specied in varlist.
iv(varlist iv) species the variables to be used as instruments. This option is required
with model(direct-2sls); it is optional with other modeltypes.
conf(#) sets the condence level to the specied number. The default is conf(95).
graphic requests a graphical representation of the density distributions of ATE(x),

ATET(x), and ATENT(x). graphic gives an outcome only if specied with hetero().
vce(vcetype) species the type of standard error reported.

vce(robust) species to report standard errors that are robust to some kinds of
misspecication.
vce(bootstrap | jackknife | conventional) may be specied when hetero() is
not specied with model(heckit) to report standard errors that use the bootstrap
method, the jackknife method, or the conventionally derived variance estimator.
beta reports standardized beta coecients.
const(noconstant) suppresses the regression constant term.
head(noheader) suppresses the display of summary statistics at the top of the output;
only the coecient table is displayed.
4.3 Remarks
The ivtreatreg command also creates several variables that can be used to further
examine the data:
ws varname h are the additional regressors used in a models regression when

hetero(varlist h) is specied. ws varname h are created for all models.
z varname h are the IVs used in a models regression when hetero(varlist h) and
iv(varlist iv) are specied. z varname h are created only for IV models.
ATE x is an estimate of the idiosyncratic ATE.
ATET x is an estimate of the idiosyncratic ATET.
ATENT x is an estimate of the idiosyncratic ATENT.
G fv is the predicted probability from the probit regression, conditional on the

observable confounders used.
wL0 and wL1 are the Heckman correction terms.
Finally, the treatment must be a 0/1 binary variable (1 = treated, 0 = untreated).

The standard errors for ATET and ATENT can be obtained via bootstrapping. Also,
when option hetero() is not specied, then ATE(x), ATET(x), and ATENT(x) are single
numbers equal to ATE = ATET = ATENT.
G. Cerulli 467
4.4 Stored results

ivtreatreg stores the following in e():
Scalars
e(N tot) total number of used observations
e(N treated) number of used treated units
e(N untreated) number of used untreated units
e(ate) value of the ATE
e(atet) value of the ATET
e(atent) value of the ATENT
5 A Monte Carlo experiment for testing ivtreatreg

In this section, I provide a Monte Carlo experiment to check whether ivtreatreg com-
plies with predictions from the theory and to assess its correctness from a computational
point of view. The rst step is to dene a data-generating process (DGP) as follows:

w = 1(0.5 + 0.5x1 + 0.3x2 + 0.6z + a > 0)
y0 = 0.1 + 0.2x1 + 0.2x2 + e0

y1 = 0.3 + 0.3x1 + 0.3x2 + e1
where

x1 : ln(h1 )

x2 : ln(h2 )

z : ln(h3 )
h1 : 2 (1) + c

h2 : 2 (1) + c

h3 : 2 (1) + c

c : 2 (1)
and
(a, e0 , e1 ) : N (0, )
2 2
a a,e0 a,e1 a a,e0 a e0 a,e1 a e1
= e20 a,e1 = e20 e0 ,e1 e0 e1
e21 e21
a2 = 1, e20 = 3, e21 = 6.5
a,e0 = 0.5, a,e1 = 0.3, e0 ,e1 = 0
By assuming that the correlation between a and e0 (a,e0 ) and the correlation between a
and e1 (a,e1 ) are dierent from 0, the wthe selection binary indicatoris endogenous.
We indicate the instrument with z, which is directly correlated with w but directly
uncorrelated with y1 and y0 . Given these assumptions, the DGP is completed by the
potential outcome means, yi = y0i + wi (y1i y0i ), generating the observable outcome y.
The DGP is simulated 2,000 times using a sample size of 2,000. For each simula-
tion, we get a dierent data matrix (x1 , x2 , y, w, z) on which we apply the four models
implemented by ivtreatreg. Table 3 and gure 1 set out the simulation results.
Table 3. Monte Carlo simulation output of ivtreatreg
(1) (2) (3) (4) (5)

Estimator Bias% Mean Std. dev. Mean SE Rejection rate
direct-2sls 5.05 0.235 0.316 0.318 0.042
probit-ols 2.92 0.217 0.272 0.268 0.045
probit-2sls 1.16 0.227 0.267 0.267 0.045
heckit 0.87 0.226 0.248 0.240 0.045
True value of ATE 0.224
We see that the true value of ATE is 0.224. As expected, all the IV procedures con-
sistently estimate the true ATE, with a slight bias of around 5% only for direct-2sls.
Figure 1 conrms these ndings by jointly plotting the distributions of ATEs obtained
by each single method over the 2,000 DGP simulations. All methods give similar results,
though direct-2sls has a slightly dierent shape with fatter tails. This suggests that
we should examine the estimation precision. Under our DGP assumptions, we expect
model heckit to be the most ecient method, followed by model probit-ols and
model probit-2sls, with model direct-2sls performing the worst. In fact, our DGP
follows exactly the same assumptions on which the model heckit is based, as well as
the joint trivariate normality of a, e0 , and e1 .
G. Cerulli 469
Monte Carlo for ATE Comparison of methods under treatment endogeneity
1.5
Kernel density of ATE
.5 0 1
1 .5 0 .5 1 1.5
ATE
direct2sls probit2sls probitols

heckit
True ATE = .224
Sample size = 2000
Number of simulations = 2000
Figure 1. Monte Carlo distributions of ATE; comparison of IV methods
Table 3 conrms the following theoretical predictions: the lowest standard deviation
is achieved by model heckit (0.248) and the highest by model direct-2sls (0.316),
with the other methods lying in the middle with no appreciable dierences. Observe
that the standard error means (mean SE in column 4) show that the values of the
standard deviations of the estimators in column 3 are estimated precisely (values are
much the same). This means that the asymptotic distribution of the ATE estimators
approximates nite-sample distribution well.
Table 3 also shows simulation results for test size. The size of a test is the probability
of rejecting a hypothesis H0 when H0 is true. In our DGP, we set the size level at 0.05
for a two-sided test where H0 : ATE = 0.224 against the alternative H1 : ATE = 0.224.
The results, under the heading Rejection rate (column 5), represent the proportion
of simulations that lead to rejection of H0 . These values should be interpreted as the
simulation estimate of the true test size (which we assumed to be 0.05). As expected,
the rejection rates are all lower than the usual 5% signicance.
As a conclusion, these results seem to conrm both our expected theoretical results
and the computational reliability of the ivtreatreg command.
6 ivtreatreg in practice: An application to the relation-

ship between education and fertility
To see how ivtreatreg works in practice, we consider an instructional dataset called
fertil2.dta, which accompanies the book Introductory Econometrics: A Modern Ap-
proach by Wooldridge (2013) and is a collection of cross-sectional data on 4,361 women
of childbearing age in Botswana.8 This dataset contains 28 variables on various female

and family characteristics. In this exercise, we are particularly interested in evaluating
the impact of the variable educ7 (taking value 1 if a woman has seven years of education
or more and 0 otherwise) on the number of family children (children). Several condi-
tioning (or confounding) observable factors are included in the dataset, such as the age
of the woman (age), whether or not the family owns a TV (tv), and whether or not the
woman lives in a city (urban). To inquire about the relationship between education and
fertility, following Wooldridge (2010), we estimate the following specication for each of
the four models implemented by ivtreatreg:
. ivtreatreg children educ7 age agesq evermarr urban electric tv,

> hetero(age agesq evermarr urban) iv(frsthalf) model(modeltype) graphic
This specication adopts the covariate frsthalf as the IV and takes value 1 if the
woman was born in the rst six months of the year and 0 otherwise. This variable is
partially correlated with educ7, but it should not have any direct relationship with the
number of family children.
The simple dierence-in-mean estimator (the mean of the treated ones, which are
the children in the group of more educated women, minus the mean of the untreated
ones, which are the children in the group of less educated women) is 1.77 with a t-
value of 28.46. This means that women with more education show about two children
fewer than women with less education, without ceteris paribus conditions. By adding
confounding factors in the regression specication, we get the OLS estimate of ATE as
0.394 with a t-value of 7.94, still in absence of heterogeneous treatment. This is still
signicant, but the magnitude, as expected, dropped considerably compared with the
dierence-in-mean estimation, thus showing that confounders are relevant. When we
consider OLS estimation with heterogeneity, we get an ATE equal to 0.37, which is still
signicant at 1%.9
When we consider IV estimation, results change dramatically. As we did in our
working example of how to use ivtreatreg, we estimate the previous specication for
probit-2sls with heterogeneous treatment response. The main outcome is reported
below, where results from both the probit rst-step and the IV regression of the second
step are set out. Results on the probit show that frsthalf is partially correlated with
educ7, thus it can be reliably used as an instrument for this variable. Step 2 shows that
the ATE (again, the coecient of educ7) is no more signicant and that it changes sign,
becoming positive and equal to 0.30.
8. The data are downloadable at http://fmwww.bc.edu/ec-p/data/wooldridge/fertil2.dta.

9. OLS results on ATE are obtained by estimating the baseline regression set out in section 3.1 with
OLS.
G. Cerulli 471
. use fertil2.dta
> hetero(age agesq evermarr urban) iv(frsthalf) model(probit-2sls) graphic
(output omitted )
Probit regression Number of obs = 4358
LR chi2(7) = 1130.84
Prob > chi2 = 0.0000
Log likelihood = -2428.384 Pseudo R2 = 0.1889
educ7 Coef. Std. Err. z P>|z| [95% Conf. Interval]
frsthalf -.2206627 .0418563 -5.27 0.000 -.3026995 -.1386259

age -.0150337 .0174845 -0.86 0.390 -.0493027 .0192354
agesq -.0007325 .0002897 -2.53 0.011 -.0013003 -.0001647
evermarr -.2972879 .0486734 -6.11 0.000 -.392686 -.2018898
urban .2998122 .0432321 6.93 0.000 .2150789 .3845456
electric .4246668 .0751255 5.65 0.000 .2774235 .57191
tv .9281707 .0977462 9.50 0.000 .7365915 1.11975
_cons 1.13537 .2440057 4.65 0.000 .6571273 1.613612
(output omitted )
Instrumental variables (2SLS) regression

Source SS df MS Number of obs = 4358
F( 11, 4346) = 448.51
Model 10198.4139 11 927.128534 Prob > F = 0.0000
Residual 11311.6182 4346 2.60276536 R-squared = 0.4741
Adj R-squared = 0.4728
Total 21510.0321 4357 4.93689055 Root MSE = 1.6133
children Coef. Std. Err. t P>|t| [95% Conf. Interval]
educ7 .3004007 .4995617 0.60 0.548 -.6789951 1.279797

_ws_age -.8428913 .1368854 -6.16 0.000 -1.111256 -.5745262
_ws_agesq .011469 .0019061 6.02 0.000 .007732 .0152059
_ws_evermarr -.8979833 .2856655 -3.14 0.002 -1.458033 -.3379333
_ws_urban .4167504 .2316103 1.80 0.072 -.037324 .8708247
age .859302 .0966912 8.89 0.000 .669738 1.048866
agesq -.01003 .0012496 -8.03 0.000 -.0124799 -.0075801
evermarr 1.253709 .1586299 7.90 0.000 .9427132 1.564704
urban -.5313325 .1379893 -3.85 0.000 -.801862 -.260803
electric -.2392104 .1010705 -2.37 0.018 -.43736 -.0410608
tv -.2348937 .1478488 -1.59 0.112 -.5247528 .0549653
_cons -13.7584 1.876365 -7.33 0.000 -17.43704 -10.07977
Instrumented: educ7 _ws_age _ws_agesq _ws_evermarr _ws_urban

Instruments: age agesq evermarr urban electric tv G_fv _z_age _z_agesq
_z_evermarr _z_urban
(output omitted )
This result is in line with the IV estimation obtained by Wooldridge (2010). Never-
theless, having assumed heterogeneous response to treatment, we can now also calcu-
late the ATET and ATENT, and inspect the cross-unit distribution of these eects. First,
ivtreatreg returns these parameters as scalars (along with treated and untreated sam-
ple size).
. ereturn list
scalars:
(output omitted )
e(ate) = .3004007409051661
e(atet) = .898290019586237
e(atent) = -.4468834318294228
e(N_tot) = 4358
e(N_treat) = 2421
e(N_untreat) = 1937
(output omitted )
To get the standard errors for testing ATET and ATENT signicance, we can easily
implement a bootstrap procedure as follows:
. bootstrap atet=e(atet) atent=e(atent), rep(100):

> ivtreatreg children educ7 age agesq evermarr urban electric tv,
> hetero(age agesq evermarr urban) iv(frsthalf) model(probit-2sls)
Bootstrap results Number of obs = 4358

Replications = 100
command: ivtreatreg children educ7 age agesq evermarr urban electric
tv, >hetero(age agesq evermarr urban) iv(frsthalf) model(probit-2sls)
atet: e(atet)
atent: e(atent)
Observed Bootstrap Normal-based

Coef. Std. Err. z P>|z| [95% Conf. Interval]
atet .89829 .5488267 1.64 0.102 -.1773905 1.973971

atent -.4468834 .4124428 -1.08 0.279 -1.255257 .3614897
The results show that both ATET and ATENT are not signicant and show quite
dierent values, but the values are not far from that of ATE. Furthermore, a simple
check shows that ATE = ATETp(w = 1) + ATENT p(w = 0), for example,
. di "ATE= " (e(N_treat)/e(N_tot))*e(atet)+(e(N_untreat)/e(N_tot))*e(atent)

ATE= .30040086
which conrms the expected result. Finally, we analyze the distribution of ATE(x),
ATET(x), and ATENT(x). Figure 2 shows the result.
G. Cerulli 473
Model probit2sls: Comparison of ATE(x) ATET(x) ATENT(x)
.4 .3
Kernel density
.2 .1
0
2 0 2 4
ATE(x) ATET(x)
ATENT(x)
Figure 2. Distribution of ATE(x), ATET(x), and ATENT(x) in model probit-2sls
We see that ATET(x) presents a substantially uniform distribution, while both

ATE(x) and ATENT(x) show a distribution more concentrated on negative values. In
particular, ATENT(x) shows the highest modal value around 2.2 children, thus pre-
dicting that less-educated women would have been less fertile if they had been more
educated.
ATE results for all four models and for the simple dierence-in-mean test (t test) are
shown below. The ATE obtained by IV methods is consistently not signicant, but it has
a positive value for only probit-2sls. The rest of the ATEs consistently show negative
valuesmeaning that more-educated women would have been more fertile if they had
been less educated. heckit is a little more puzzling because the result is signicant and
very close to the dierence-in-mean estimation that is highly suspected as biased. This
could be because the identication conditions of heckit are not met in this dataset.
. regress children educ7

(output omitted )
. estimates store ttest
> hetero(age agesq evermarr urban) iv(frsthalf) model(heckit) graphic
(output omitted )
. estimates store heckit
> hetero(age agesq evermarr urban) iv(frsthalf) model(probit-ols) graphic
(output omitted )
. estimates store probit_ols
> hetero(age agesq evermarr urban) iv(frsthalf) model(direct-2sls) graphic
(output omitted )
. estimates store direct_2sls
> hetero(age agesq evermarr urban) iv(frsthalf) model(probit-2sls) graphic
(output omitted )
. estimates store probit_2sls
. estimates table ttest probit_ols direct_2sls probit_2sls heckit,
> b(%9.2f) keep(educ7 G_fv) star
Variable ttest probit_ols direct_2sls probit_2sls
educ7 -1.77*** -1.04 0.30

G_fv -0.11
legend: * p<0.05; ** p<0.01; *** p<0.001
Variable heckit
educ7 -1.92***
G_fv
legend: * p<0.05; ** p<0.01; *** p<0.001
Finally, gure 3 shows the plot of the ATE distribution for each method. These
distributions largely follow a similar pattern, although direct-2sls and heckit show
some appreciable dierences. heckit, in particular, shows a very dierent pattern with
a strong demarcation between the plot of treated and untreated units. Consequently, it
appears to not be a reliable estimation procedure, an observation that deserves further
inspection.
G. Cerulli 475
Model probitols: Comparison of ATE(x) ATET(x) ATENT(x) Model direct2sls: Comparison of ATE(x) ATET(x) ATENT(x)
1.5
.4 .3
1
Kernel density
Kernel density
.2
.5
.1
0
0
2 0 2 4 2 1.5 1 .5
ATE(x) ATET(x) ATE(x) ATET(x)

ATENT(x) ATENT(x)
Model probit2sls: Comparison of ATE(x) ATET(x) ATENT(x) Model heckit: Comparison of ATE(x) ATET(x) ATENT(x)
1.5
.4 .3
1
Kernel density
Kernel density
.2
.5
.1
0
2 0 2 4 4 3 2 1 0
ATE(x) ATET(x) ATE(x) ATET(x)

ATENT(x) ATENT(x)
Figure 3. Distribution of ATE(x), ATET(x), and ATENT(x) for the four models t by
ivtreatreg
7 Conclusion
In this article, I presented a new user-written Stata command, ivtreatreg, for tting
four dierent binary treatment models with and without idiosyncratic or heterogeneous
ATEs. Depending on the model specied, ivtreatreg consistently estimates ATEs under
the hypothesis of selection-on-unobservables exploiting IV estimators and a generalized
two-step Heckman selection model.
After presenting the statistical framework, I provided evidence on the reliability
of ivtreatreg by using a Monte Carlo experiment. To familiarize the reader with
the command, I also applied it to a real dataset. Results from both the Monte Carlo
experiment and the real dataset encourage one to use the command when the empirical
and theoretical setting suggests that treatment endogeneity and heterogeneous response
to treatment are present. In such cases, performing more than one method may be a
useful robustness check. The ivtreatreg command makes such checks possible and
easy to perform.
8 References
Abadie, A., D. Drukker, J. L. Herr, and G. W. Imbens. 2004. Implementing matching
estimators for average treatment eects in Stata. Stata Journal 4: 290311.
Angrist, J. D. 1991. Instrumental variables estimation of average treatment eects in
econometrics and epidemiology. NBER Technical Working Paper No. 0115.
http://www.nber.org/papers/t0115.
Angrist, J. D., G. W. Imbens, and D. B. Rubin. 1996. Identication of causal eects
using instrumental variables. Journal of the American Statistical Association 91:
444455.
Austin, N. A. 2007. rd: Stata module for regression discontinuity estimation. Statistical
Software Components S456888, Department of Economics, Boston College.
http://ideas.repec.org/c/boc/bocode/s456888.html.
Becker, S. O., and A. Ichino. 2002. Estimation of average treatment eects based on
propensity scores. Stata Journal 2: 358377.
Cattaneo, M. D. 2010. Ecient semiparametric estimation of multi-valued treatment
eects under ignorability. Journal of Econometrics 155: 138154.
Cattaneo, M. D., D. M. Drukker, and A. D. Holland. 2013. Estimation of multivalued
treatment eects under conditional independence. Stata Journal 13: 407450.
Cerulli, G. 2014. treatrew: A user-written command for estimating average treatment
eects by reweighting on the propensity score. Stata Journal 14: 541561.
Cox, D. R. 1958. Planning of Experiments. New York: Wiley.
Heckman, J. J. 1978. Dummy endogenous variables in a simultaneous equation system.
Econometrica 46: 931959.
Heckman, J. J., R. J. LaLonde, and J. A. Smith. 1999. The economics and econometrics
of active labor market programs. In Handbook of Labor Economics, ed. O. Ashenfelter
and D. Card, vol. 3A, 18652097. Amsterdam: Elsevier.
Holland, P. W. 1986. Statistics and causal inference. Journal of the American Statistical
Association 81: 945960.
Hudgens, M. G., and M. E. Halloran. 2008. Toward causal inference with interference.
Journal of the American Statistical Association 103: 832842.
Imbens, G. W., and J. M. Wooldridge. 2009. Recent developments in the econometrics
of program evaluation. Journal of Economic Literature 47: 586.
Leuven, E., and B. Sianesi. 2003. psmatch2: Stata module to perform full Mahalanobis
and propensity score matching, common support graphing, and covariate imbalance
testing. Statistical Software Components S432001, Department of Economics, Boston
College. http://ideas.repec.org/c/boc/bocode/s432001.html.
G. Cerulli 477
Rosenbaum, P. R. 2007. Interference between units in randomized experiments. Journal

of the American Statistical Association 102: 191200.
Rubin, D. B. 1974. Estimating causal eects of treatments in randomized and nonran-

domized studies. Journal of Educational Psychology 66: 688701.
. 1978. Bayesian inference for causal eects: The role of randomization. Annals
of Statistics 6: 3458.
Sobel, M. E. 2006. What do randomized studies of housing mobility demonstrate?:

Causal inference in the face of interference. Journal of the American Statistical As-
sociation 101: 13981407.
Villa, J. M. 2009. di: Stata module to perform dierences in dierences estimation.

Statistical Software Components S457083, Department of Economics, Boston College.
Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd
ed. Cambridge, MA: MIT Press.
. 2013. Introductory Econometrics: A Modern Approach. 5th ed. Mason, OH:
South-Western.
About the author

Giovanni Cerulli is a researcher at Ceris-CNR, National Research Council of Italy, Institute for
Economic Research on Firms and Growth. He received a degree in statistics and a PhD in
economic sciences from Sapienza University of Rome and is editor-in-chief of the International
Journal of Computational Economics and Econometrics. His research interests are mainly on
applied microeconometrics, with a focus on counterfactual treatment-eects models for program
evaluation. Stata programming and simulation- and agent-based methods are also among his
related elds of study. He has published articles in high-quality, refereed economics journals.
Appendix
Derivation of ATET(x), ATET, ATENT(x), and ATENT in the heckit
model
Proof.
The heckit model with observable and unobservable heterogeneity relies on these as-
sumptions:
1. y = 0 + w + x 0 + w(x x ) + u
2. E(e1 | x, z) = E(e0 | x, z) = 0
3. w = 1(0 + 1 x + 2 z + a 0) = 1(q 0)
4. E(a | x, z) = 0
5. (a, e0 , e1 ) 3 N
6. a N (0, 1) a = 1
7. u = e0 + w(e1 e0 )
By denition, we know that

ATET(x) = E(y1 y0 | x, w = 1) = (1 0 ) + {g1 (x) g0 (x)} + E(e1 e0 | x, w = 1)
At the same time, because e1 and e0 are independent of x, we also have
E(e1 e0 | x, w = 1) = E(e1 e0 | w = 1)
The value of the last expectation is easy to compute; indeed, by putting
e1 e 0 =
it follows that still has a normal distribution. This means that, from the property of
truncated normal distributions,
(q)
E( | w = 1) = E( | 0 + 1 x + 2 z + a 0) = E( | q 0) = a
(q)
From the linearity property of the covariance, we get
a = Cov(; a) = Cov(e1 e0 ; a) = Cov(e1 ; a) Cov(e0 ; a) = e1 a e0 a = 1 + 0
because 0 = e0 a and 1 = e1 a . This implies that
ATET(x) = { + (x x) + (1 + 0 ) 1 (q)}(w=1)
N N
1 1
ATET =+ w (x
i i x) + (1 + 0 ) wi 1 (q)

N
N
wi i=1 wi i=1
i=1 i=1
G. Cerulli 479
where
(q)
1 (q) =
(q)
As for ATET, applying a similar procedure, it is immediate to get
ATENT(x) = { + (x x) + (1 + 0 ) 0 (q)}(w=1)
N

1
ATENT =+ (1 wi )(xi x) + (1 + 0 )

N
(1 wi ) i=1
i=1
N

1
(1 wi ) 0i (q)

N
(1 wi ) (i=1)
i=1
where
(q)
0 (q) =
1 (q)
Showing that ATE = ATET p(w=1) + ATENT p(w=0) in the heckit

model
Proof.
Consider formulas for ATE(x) and ATENT(x) in the heckit model:
ATET(x) = { + (x x) + (1 + 0 ) 1 (q)}(w=1)
ATENT(x) = { + (x x) + (1 + 0 ) 0 (q)}(w=0)
It follows that
ATE(x) = p(w = 1){ + (x x) + (1 + 0 ) 1 (q)}

+ p(w = 0) { + (x x) + (1 + 0 ) 0 (q)}
This implies that
ATE(x) = { + (x x)}{p(w = 1) + p(w = 0)} + p(w = 1){(1 + 0 ) 1 (q)}

+ p(w = 0) {(1 + 0 ) 0 (q)}

(q)
= { + (x x)} + p(w = 1) (1 + 0 )
(q)

(q)
+ p(w = 0) (1 + 0 )
1 (q)
because p(w = 1) + p(w = 0) = 1. However, we saw that E( | q 0) =

(1 + 0 ) (q)/(q) and E( | q < 0) = (1 + 0 ) (q)/{1 (q)}.
For the law of iterated expectations, we get E() = p(w = 1)E( | q 0) + p(w =
0)E( | q 0) = 0, because E() = E(e1 e0 ) = 0, proving that
ATE(x) = + (x x)
and nally
ATE = Ex {ATE(x)} =
14, Number 3, pp. 481498
Obtaining critical values for test of Markov

regime switching
Valerie K. Bostwick Douglas G. Steigerwald
Department of Economics Department of Economics
University of California, Santa Barbara University of California, Santa Barbara
Santa Barbara, CA Santa Barbara, CA
vkbostwick@gmail.com doug@econ.ucsb.edu
Abstract. For Markov regime-switching models, a nonstandard test statistic

must be used to test for the possible presence of multiple regimes. Carter and
Steigerwald (2013, Journal of Econometric Methods 2: 2534) derive the ana-
lytic steps needed to implement the Markov regime-switching test proposed by
Cho and White (2007, Econometrica 75: 16711720). We summarize the imple-
mentation steps and address the computational issues that arise. We then in-
troduce a new command to compute regime-switching critical values, rscv, and
present it in the context of empirical research.
Keywords: st0347, rscv, Markov regime switching
1 Introduction
Markov regime-switching models are frequently used in economic analysis and are preva-
lent in elds such as nance, industrial organization, and business cycle theory. Unfortu-
nately, conducting proper inference with these models can be exceptionally challenging.
In particular, testing for the possible presence of multiple regimes requires the use of a
nonstandard test statistic and critical values that may dier across model specications.
Cho and White (2007) demonstrate that because of the unusually complicated na-
ture of the null space, the appropriate measure for a test of multiple regimes in the
Markov regime-switching framework is a quasi-likelihood-ratio (QLR) statistic. They
provide an asymptotic null distribution for this test statistic from which critical values
should be drawn. Because this distribution is a function of a Gaussian process, the
critical values are dicult to obtain from a simple closed-form distribution. Moreover,
the elements of the Gaussian process underlying the asymptotic null distribution are
dependent upon one another. Thus the critical values depend on the covariance of the
Gaussian process and, because of the complex nature of this covariance structure, are
best calculated using numerical approximation. In this article, we summarize the steps
necessary for such an approximation and introduce the new command rscv, which can
be used to produce the desired regime-switching critical values for a QLR test of only
one regime.
We focus on a simple linear model with Gaussian errors, but the QLR test and the
rscv command are generalizable to a much broader class of models. This methodology
can be applied to models with multiple covariates and non-Gaussian errors. It is also

482 Obtaining critical values for test of Markov regime switching
applicable to regime-switching models where the dependent variable is vector valued,

although the dierence between distributions must be in only one mean parameter.
Although most regime-switching models are thought of in the context of time-series data,
we provide an example in section 5 of how to use the QLR test in cross-section models.
However, there is one notable restriction on the allowable class of regime-switching
models. Carter and Steigerwald (2012) establish that the quasi-maximum likelihood
estimator created using the quasi-log-likelihood is inconsistent if the covariates include
lagged values of the dependent variable. Thus the QLR test should be used with extreme
caution on autoregressive models.
The article is organized as follows. In section 2, we describe the unusual null space
that corresponds to a test of only one regime versus the alternative of regime switching.
In section 3, we present the QLR test statistic, as derived by Cho and White (2007),
and the corresponding asymptotic null distribution. We also summarize the analysis
in Carter and Steigerwald (2013) describing the covariance structure of the relevant
Gaussian process. In section 4, we describe the methodology used by the rscv command
to numerically approximate the relevant critical values. We also present the syntax and
options of the rscv command and provide sample output. We illustrate the use of the
rscv command with an application from the economics literature in section 5. Finally,
we conclude in section 6 with some remarks on the general applicability of this command
and the underlying methods.
2 Null hypothesis
Specifying a Markov regime-switching model requires a test to conrm the presence of
multiple regimes. The rst step is to test the null hypothesis of one regime against the
alternative hypothesis of Markov switching between two regimes. If this null hypothesis
can be rejected, then one can proceed to estimate the Markov regime-switching models
with two or more regimes. The key to conducting valid inference is then a test of the
null hypothesis of one regime, which yields an asymptotic size equal to or less than the
nominal test size.
To understand how to conduct valid inference for the null hypothesis of only one
regime, consider a basic regime-switching model,
yt = 0 + st + ut (1)

where ut i.i.d. N 0, 2 . The unobserved state variable st (0, 1) indicates that
regime in state 0, yt has mean 0 , while regime in state 1, yt has mean 1 = 0 + . The
sequence (st )nt=1 is generated by a rst-order Markov process with P (st = 1|st1 = 0) =
p0 and P (st = 0|st1 = 1) = p1 .
The key is to understand the parameter space that corresponds to the null hypoth-
esis. Under the null hypothesis, there is one regime with mean . Hence, the null
parameter space must capture all the possible regions that correspond to one regime.
The rst region corresponds to the assumption that 0 = 1 = , which is the as-
sumption that each of the two regimes is observed with positive probability: p0 > 0
V. K. Bostwick and D. G. Steigerwald 483
and p1 > 0. The nonstandard feature of the null space is that it includes two addi-
tional regions, each of which also corresponds to one regime with mean . The second
region corresponds to the assumption that only regime 0 occurs with positive probabil-
ity, p0 = 0, and that 0 = . In this second region, the mean of regime 1, 1 is not
identied, so this region in the null hypothesis does not impose any value on 1 0 .
The third region is a mirror image of the second region, where now the assumption is
that regime 1 occurs with probability 1: p1 = 0 and 1 = . The three regions are
depicted in gure 1. The vertical distance measures the value of p0 and of p1 , and the
horizontal distance measures the value of 1 0 . Thus the vertical line at 1 = 0
captures the region of the null parameter space that corresponds to the assumption
that 0 = 1 = together with p0 , p1 (0, 1). The lower horizontal line captures the
region of the null parameter space where p0 = 0 and 1 0 is unrestricted. Similarly,
the upper horizontal line captures the region of the null parameter space where p1 = 0
and 1 0 is unrestricted.
1 0 = 0
p1 = 0
p0 = 0
Figure 1. All three regions of the null hypothesis H0 : p0 = 0 and 0 = ; p1 = 0 and

1 = ; or 0 = 1 = together with local neighborhoods of p1 = 0 and 0 = 1 =
The additional curves that correspond to the values p0 = 0 and p1 = 0 help prevent
one from misclassifying a small group of extremal values as a second regime. In gure 1,
we depict the null space together with local neighborhoods for two points in this space.
These two neighborhoods illustrate the dierent roles of the three curves in the null
space. Points in the circular neighborhood of the point on 1 0 = 0 correspond
to processes with two regimes that have only slightly separated means. Points in the
semicircular neighborhood around the point on p1 = 0 correspond to processes in which
there are two regimes with widely separated means, one of which occurs infrequently.
Because a researcher is often concerned that rejection of the null hypothesis of one
regime is due to a small group of outliers rather than multiple regimes, including these
boundary values reduces this type of false rejection. Consequently, a valid test of the
null hypothesis of one regime must account for the entire null region and include all
three curves.
3 QLR test statistic

To implement a valid test of the null hypothesis of one regime, a likelihood-ratio statistic
is needed. When considering the likelihood-ratio statistic for a Markov regime-switching
process, Cho and White (2007) nd that including p0 = 0 and p1 = 0 in the parameter
space creates signicant diculties in the asymptotic analysis. These diculties lead
them to consider a QLR statistic for which the Markov structure of the state variable is
ignored and (st ) is instead a sequence of independent and identically distributed (i.i.d.)
random variables.
This i.i.d. restriction allows Cho and White (2007) to consider only the stationary
probability, P (st = 1) = , where = p0 /(p0 + p1 ). Because = 1 if and only if p1 = 0
(and = 0 if and only if p0 = 0), the null hypothesis for a test of one regime based on
the QLR statistic is expressed with three curves. The null hypothesis is H0 : 0 = 1 =
(curve 1), = 0 and 0 = (curve 2), and = 1 and 1 = (curve 3). The alternative
hypothesis is H1 : (0, 1) and 0 = 1 .
For our basic model in (1), the quasi-log-likelihood analyzed by Cho and White
(2007) is
n
1
Ln , 2 , 0 , 1 = lt , 2 , 0 , 1
n t=1
where lt (, 2 , 0 , 1 ) := log{(1 )f (yt | 2 , 0 ) + f (yt | 2 , 1 )} and f (yt | 2 , j ) is the

conditional density with j = 0, 1. ( 2 , 0 , 1 ) are the parameter values that maximize
,
the quasi-log-likelihood function. (1, 2 , , 1 ) are the parameter values that maximize
Ln under the null hypothesis that = 1. The QLR statistic is then

QLRn = 2n Ln , 2 , 0 , 1 Ln 1, 2 , , 1
The asymptotic null distribution of QLRn is (Cho and White 2007, theorem 6(b),
1692),
2 2
QLRn max {max (0, G)} , sup G (0 ) (2)

where G(0 ) is a Gaussian process, G(0 ) := min{0, G(0 )}, and G is a standard Gaus-
sian random variable correlated with G(0 ). (For a more complete description of (2),
see Bostwick and Steigerwald [2012]).
The critical value for a test based on the statistic QLRn thus corresponds to a quan-
tile for the largest value over max(0, G)2 and sup {G(0 ) }2 . To determine this quan-
tity, one must account for the covariance among the elements of G(0 ) as well as their
covariance with G. The structure of this covariance, which is described in detail in
Bostwick and Steigerwald (2012), is
2
( )
e 1 2
E {G (0 ) G (0 )} = 12 12 (3)
4 ( )4
1 2 2
2 2
e 2 e( ) 1 ( ) 2
where = (0 )/ and = (0 )/. This covariance determines the quantity

sup {G(0 ) }2 that appears in the asymptotic null distribution. Because the regime-
specic parameters enter (3) only through , a researcher does not need to specify the
parameter space to calculate sup {G(0 ) }2 . The only requirement is to specify the
set H that contains the number of standard deviations that separate the regime means.
Finally, to fully capture the behavior of the asymptotic null distribution of QLRn , we
must also account for the covariance between G and G(0 ). Cho and White (2007) show
2
that Cov{G, G(0 )} = (e 1 2 4 /2)1/2 4 .
4 The rscv command

4.1 Syntax

rscv , ll(#) ul(#) r(#) q(#)
4.2 Description
rscv simulates the asymptotic null distribution of QLRn and returns the corresponding
critical value. If no options are specied, rscv returns the critical value for a size 5%
QLR test with a regime separation of 1 standard deviation calculated over 100,000
replications.
4.3 Options
ll(#) species a lower bound on the interval H containing the number of standard
deviations separating regime means, where H. The default is ll(-1), meaning
that the mean of regime 1 is no more than 1 standard deviation below the mean of
regime 2.
ul(#) species an upper bound on the interval H containing the number of standard
deviations separating regime means. The default is ul(1), meaning that the mean
of regime 1 is no more than 1 standard deviation above the mean of regime 2.
r(#) species the number of simulation replications to be used in calculating the critical
values. The default is r(100000), meaning that the simulation will be run 100,000
times.
q(#) species the quantile for which a critical value should be calculated. The default
is q(0.95), which corresponds to a nominal test size of 5%.
4.4 Simulation process

For a QLR test with size 5%, the critical value corresponds to the 0.95 quantile of the
limit distribution given on the right side of (2). Because the dependence in the process
G (0 ) renders numeric integration infeasible, we construct the quantile by simulating

independent replications of the process. In this section, we describe the simulation
process used to obtain these critical values and how each of the rscv command options
aects those simulations.
Because the covariance of G (0 ) depends on only an index , we do not need to
simulate G (0 ) directly. Instead, we simulate G A (), which we will construct to have
the same covariance structure as G (0 ). The process G A () will therefore provide us
with the correct quantile while relying solely on the index, .
To construct G A () for the covariance structure in (3), recall that by a Taylor-series
expansion, e = 1 + + 2 /2! + . Hence, for ( k )
k=0 i.i.d. N (0, 1),

k 2 4
k N 0, e 1
2
k! 2
k=3
Using this fact, our simulated process is constructed as

12 K1
k
A 2 4
G () = e 1
2
k
2 k!
k=3
where K determines the accuracy of the Taylor-series approximation. Note that the
covariance of this simulated process, E{G A ()G A ( )}, is identical to the covariance
structure of G(0 ) in (3).
We must also account for the covariance between G and G(0 ). Cho and White
(2007) establish that this covariance corresponds to the term in the Taylor-series ex-
pansion for k = 4. Thus we set G = 4 so that Cov{G, G(0 )} = Cov{G, G A ()}.
Therefore, the critical value that corresponds to (2) for a test size of 5% is the 0.95
quantile of the simulated value

2
max {max (0, 4 )} , max min 0, G A ()
2
(4)
H
The rscv command executes the numerical simulation of (4) by rst generating the
series ( k )K
k=0 i.i.d. N (0, 1). For each value in a discrete set of H, it then constructs
A 2 1/2
K1 k
G () = (e 1 /2)2 4
k=3 / k! k . The command then obtains the
value mi = max({max(0, 4 )}2 , max [min{0, G A ()}]2 ), corresponding to (2) for each
replication (indexed by i). Let (m[i] )ri=1 be the vector of ordered values of mi calculated
in each replication. The command rscv returns the critical value for a test with size q
from m[(1q)r] .
For each replication, rscv calculates G A () at a ne grid of values over the interval
H. To do so requires three quantities: the interval H (which must encompass the true
value of ), the grid of values over H (given by the grid mesh), and the number of
desired terms in the Taylor-series approximation, K. The user species the interval H
using the ll() and ul() options. If 0 is thought to lie within 3 standard deviations
of 1 , the interval is H = [3.0, 3.0]. Because the process is calculated at only a nite
number of values, the accuracy of the calculated maximum increases as the grid mesh
shrinks. Thus the command rscv implements a grid mesh of 0.01, as recommended in
Cho and White (2007, 1693). For the interval H = [3.0, 3.0], and with a grid mesh of
0.01, the process is calculated at the points (3.00, 2.99, . . . , 3.00).
Given the grid mesh of 0.01 and the user-specied interval H, we must determine
the appropriate value of K. To do so, we consider the approximation error, K, =
2
(e 1 2 4 /2)1/2 k=K k / k! k . We want to ensure that as K increases,
the variance of K, decreases toward zero. Carter and Steigerwald (2013) show that
for large K, var(K, ) e2J log K log K . Therefore, the command rscv implements a
value of K such that for the user-specied interval H, (maxH ||)2 /K 1/2.
The rscv command also allows the user to specify the number of simulation repli-
cations and the desired quantile. For large values of H and the default number of
replications (r = 100000), the rscv command could require more memory than a 32-bit
operating system can provide. In this case, the user may need to specify a smaller num-
ber of replications to calculate the critical values for the desired interval, H. Critical
values derived using fewer simulation replications may be stable to only one signicant
digit. Table 1 depicts the results of rscv for a size 5% test over varying values of ll(),
ul(), and r().
Table 1. Critical values for linear models with Gaussian errors
H (1, 1) (2, 2) (3, 3) (4, 4) (5, 5)

100,000 4.9 5.6 6.2 6.7 7.0
Replications
10,000 4.9 5.6 6.2 6.6 7.1
Nominal level 5%; grid mesh of 0.01.
5 Example
We demonstrate how to test for the presence of multiple regimes through an example
from the economics literature. Unlike the simple model that we have considered until
now, (1), the model in this example includes several added complexities that are com-
monly used in regime-switching applications. We describe how to construct the QLR
test statistic for this more general model, how to use existing Stata commands to obtain
the value of the test statistic, and, nally, how to use the new command, rscv, to obtain
an appropriate critical value.
Our example is derived from Bloom, Canning, and Sevilla (2003), who test whether
the large dierences in income levels across countries are better explained by dierences
in intrinsic geography or by a regime-switching model where the regimes correspond to
distinct equilibria. To this end, the authors use cross-sectional data to analyze the dis-
tribution of per capita income levels for countries with similar exogenous characteristics
and test for the presence of multiple regimes.
Bloom, Canning, and Sevilla (2003) propose a model of switching between two pos-
sible equilibria. Regime 1 occurs with probability p(x) and corresponds to countries
that are in a poverty trap equilibrium.
y = 1 + 1 x + 1 , Var( 1 ) = 12 (5)
Regime 2 occurs with probability 1 p(x) and corresponds to countries in a wealthy

equilibrium.
y = 2 + 2 x + 2 , Var( 2 ) = 22 (6)
In both regimes, y is the log gross domestic product per capita, and x is the absolute lat-
itude, which functions as a catchall for a variety of exogenous geographic characteristics.
This model diers from a Markov regime-switching model in that the authors are look-
ing at dierent regimes in a cross-section rather than over time. Thus the probability of
being in either regime is stationary, and the unobserved regime indicator is an i.i.d. ran-
dom variable. This modication corresponds exactly to that made by Cho and White
(2007) to create the quasi-log-likelihood, so in this example, the log-likelihood ratio and
the QLR are one and the same.
Note that this model is more general than the basic regime-switching model pre-
sented in section 2. Bloom, Canning, and Sevilla (2003) have allowed for three general-
izations: covariates with coecients that vary across regimes; error variances that are
regime specic; and regime probabilities that depend on the included covariates. How-
ever, as Carter and Steigerwald (2013) discuss, the asymptotic null distribution (2) is
derived under the following assumptions: that the dierence between regimes be in only
the intercept j ; that the variance of the error terms be constant across regimes; and
that the regime probabilities do not depend on the exogenous characteristic, x. Thus,
to form the test statistic, we must t the following two-regime model: regime 1 occurs
with probability p and corresponds to
y = 1 + x + (5 )
while regime 2, which occurs with probability (1 p), corresponds to
y = 2 + x + (6 )
where Var ( ) = 2 .
Simplifying the model like this does not diminish the validity of the QLR as a one-
regime test for the model in (5) and (6). Under the null hypothesis of one regime, there is
necessarily only one error variance, only one coecient for each covariate, and a regime
probability equal to one. Thus, under the null hypothesis, the QLR test will necessarily
have the correct size even if the data are accurately modeled by a more complex system.
Once the null hypothesis is rejected using this restricted model, the researcher can then
t a model with regime-specic variances and coecients, if desired.1
For the restricted model in (5 ) and (6 ), the quasi-log-likelihood is
n
1
Ln p, 2 , , 1 , 2 = lt p, 2 , , 1 , 2
n t=1
where lt (p, 2 , , 1 , 2 ) := log{pf (yt |xt ; 2 , , 1 ) + (1 p)f (yt |xt ; 2 , , 2 )}, and
f (yt |xt ; 2 , , j ) is the conditional density for j = 1, 2. It is common to assume, as
Bloom, Canning, and Sevilla (2003) do, that is a normal random variable2 so that
2 2 2
f (yt |xt ; , , j ) = 1/( 2 2 )e(yt j xt ) /(2 ) . Let (
2 , ,
p, 1 , 2 ) be the values
2
that maximize Ln and let (1, , ,
1 , ) be the values that make Ln as large as possible
under the null hypothesis of one regime. The QLR statistic is then

QLRn = 2n Ln p ,
2 , , 1 , 2 Ln 1,
2 , , 1 ,
To estimate QLRn , we use the same Penn World Table and CIA World Factbook data
as in Bloom, Canning, and Sevilla (2003).3 First, we must determine the parameter
values that maximize the quasi-log-likelihood under the null hypothesis, (1,
2 , , 1 , )
and evaluate the quasi-log-likelihood at those values. To obtain these parameter values,
we estimate a linear regression of y on x, which corresponds to maximizing
n
1 1 1 2
Ln 1, 2 , , 1 , = log e 22 (yt 1 xt )
n t=1 2 2
While this can be achieved with a simple ordinary least-squares command, we also need
the value of the log-likelihood, so we detail how to use Stata commands to obtain both
the parameter estimates and this value.
1. With a more complex data-generating process, these restrictions could lead to an increased prob-
ability of failing to reject a false null hypothesis and, hence, a decrease in the power of the QLR
test.
2. Bloom, Canning, and Sevilla (2003) assume normally distributed errors, but the QLR test allows
for any error distribution within the exponential family.
3. Latitude data for countries appearing in the 1985 Penn World Tables and missing from the CIA
World Factbook come from https://www.google.com/.
To nd (1,
2 , , 1 , ), we use the following code, which relies on the Stata command
ml.
. program define llfsingle

1. version 13
2. args lnf mu beta sigma
3. quietly replace `lnf= (1/_N)*ln(((2*_pi*`sigma^2)^(-1/2))*
> exp((-1/(2*`sigma^2))*(lgdp-`mu-`beta*latitude)^2))
4. end
. ml model lf llfsingle /mu /beta /sigma
. ml maximize
initial: log likelihood = -<inf> (could not be evaluated)
feasible: log likelihood = -127.9261
rescale: log likelihood = -31.297788
rescale eq: log likelihood = -2.3397622
Iteration 0: log likelihood = -2.3397622 (not concave)
Iteration 2: log likelihood = -1.2842957
Number of obs = 152
Wald chi2(0) = .
Log likelihood = -1.1982487 Prob > chi2 = .
mu
_cons 6.927805 1.420095 4.88 0.000 4.144469 9.711141
beta
_cons .0408554 .049703 0.82 0.411 -.0565607 .1382714
sigma
_cons .8019654 .5670752 1.41 0.157 -.3094815 1.913412
. matrix gammasingle=e(b)
Then, using these estimates, we evaluate Ln at its maximum to nd Ln (1,

2 , , 1 , ).
. generate llf1regime=ln(((2*_pi*gammasingle[1,3]^2)^(-1/2))*
> exp((-1/(2*gammasingle[1,3]^2))*
> (lgdp-gammasingle[1,1]-gammasingle[1,2]*latitude)^2))
. quietly summarize llf1regime
. quietly replace llf1regime=r(sum)
. display "Final estimated quasi-log-likelihood for one regime: " llf1regime
Final estimated quasi-log-likelihood for one regime: -182.1338
Thus we have n Ln (1,

2 , , 1 , ) = 182.1388.
Second, we must determine the parameter values that maximize the quasi-log-
likelihood under the alternative hypothesis of two regimes, (
p,
2 , , 1 ,
2 ) and evaluate
the quasi-log-likelihood at those values. Direct maximization is more dicult under the
alternative hypothesis, because the quasi-log-likelihood involves the log of the sum of
two terms.
n
1
Ln p, 2 , , 1 , 2 = log pf yt |xt ; 2 , , 1 + (1 p) f yt |xt ; 2 , , 2
n t=1
The expectations-maximization (EM) algorithm is a method used to circumvent this

diculty. This algorithm requires iterative estimation of the latent regime probabilities,
p, and maximization of the resultant log-likelihood function until parameter estimates
converge. The EM algorithm proceeds as follows:
(0) (0)
1. Choose starting guesses for the parameter values p(0) , 2(0) , (0) , 1 , 2 .
2. For each observation, calculate t = P(st = 1|yt , xt ) such that

(0)
f yt |xt ; 2(0) , (0) , 1
t = p(0)
(0) (0)
p(0) f yt |xt ; 2(0) , (0) , 1 + 1 p(0) f yt |xt ; 2(0) , (0) , 2
(1) (1)
3. Use Statas ml command to nd the parameter values p(1) , 2(1) , (1) , 1 , 2
that maximize the complete log-likelihood.
n
1
LC
n p, 2
, , 1 , 2 = t log f yt |xt ; 2 , , 1
n t=1

+ (1 t ) log f yt |xt ; 2 , , 2
+ (1 t ) log(1 p) + t log p }
4. To test for convergence, calculate

(1) (1) (0) (0)
a. max p(1) , 2(1) , (1) , 1 , 2 p(0) , 2(0) , (0) , 1 , 2 ;

b. |LC , 2(1) , (1) , 1 , 2 LC
(1) (1) (0) (0)
n p
(1)
n p
(0)
, 2(0) , (0) , 1 , 2 |; and

c. (using numeric derivatives) max(LC
n ).
5. If all 3 convergence criteria are less than some tolerance level (we use 1/n), then
(1) (1)
quit and use p(1) , 2(1) , (1) , 1 , 2 as the nal parameter estimates. Otherwise,
(1) 2(1) (1) (1)
repeat steps 25 with p , , (1) , 1 , 2 as the new starting guesses.
The following code illustrates the implementation of these steps to obtain (

p,
2 , ,
1 , and
2 ).
. program define llfmulti

1. version 13
2. args lnf mu1 mu2 beta sigma p
3. quietly replace `lnf= (1/_N)*((1-etahat)*(ln((2*_pi*`sigma^2)^(-1/2))+
> ((-1/(2*`sigma^2))*(lgdp-`mu2-`beta*latitude)^2)+
> ln(1-`p))+etahat*(ln((2*_pi*`sigma^2)^(-1/2))+
> ((-1/(2*`sigma^2))*(lgdp-`mu1-`beta*latitude)^2)+ln(`p)))
4. end
. generate error=10
. generate tol=1/_N
. while error>tol {
2. quietly replace f1=((2*_pi*gammahat[1,4]^2)^(-1/2))*
> exp((-1/(2*gammahat[1,4]^2))*(lgdp-gammahat[1,1]-gammahat[1,3]*latitude)^2)
3. quietly replace f2=((2*_pi*gammahat[1,4]^2)^(-1/2))*
> exp((-1/(2*gammahat[1,4]^2))*(lgdp-gammahat[1,2]-gammahat[1,3]*latitude)^2)
4. quietly replace fboth=gammahat[1,5]*f1+(1-gammahat[1,5])*f2
5. quietly replace etahat=gammahat[1,5]*f1/fboth
6. ml model lf llfmulti /mu1 /mu2 /beta /sigma /p
7. ml init gammahat, copy
8. quietly ml maximize
9. matrix gammanew=e(b)
10. *Check for convergence using user-defined program nds
. nds
11. quietly replace error=max(nd1,nd2,nd3,nd4,nd5)
12. matrix gammahat=gammanew
13. }
. ml display
Number of obs = 152
Wald chi2(0) = .
Log likelihood = -1.4441013 Prob > chi2 = .
mu1
_cons 6.532847 1.148891 5.69 0.000 4.281062 8.784632
mu2
_cons 7.813265 1.45266 5.38 0.000 4.966102 10.66043
beta
_cons .0451607 .0374139 1.21 0.227 -.0281691 .1184905
sigma
_cons .5986278 .4232938 1.41 0.157 -.2310128 1.428268
p
_cons .7708245 .4203024 1.83 0.067 -.052953 1.594602
Using these estimates, we evaluate Ln at its maximum to nd Ln (

p,
2 , , 1 ,
2 ).
. quietly replace f1=((2*_pi*gammanew[1,4]^2)^(-1/2))*
> exp((-1/(2*gammanew[1,4]^2))*(lgdp-gammanew[1,1]-gammanew[1,3]*latitude)^2)
. quietly replace f2=((2*_pi*gammanew[1,4]^2)^(-1/2))*
> exp((-1/(2*gammanew[1,4]^2))*(lgdp-gammanew[1,2]-gammanew[1,3]*latitude)^2)
. generate lf2reg=gammanew[1,5]*f1+(1-gammanew[1,5])*f2
. generate llf2regime=ln(lf2reg)
. quietly summarize llf2regime
. quietly replace llf2regime=r(sum)
. display "Final estimated quasi-log-likelihood for two regimes: " llf2regime
Final estimated quasi-log-likelihood for two regimes: -179.9662
Thus we have n Ln (
p,
2 , , 1 ,
2 ) = 179.9662. Then, to calculate the test statistic,
QLRn , we type
. generate QLR=2*(llf2reg-llf1reg)
. display "Quasi-likelihood-ratio test statistic of one regime: " QLR
Quasi-likelihood-ratio test statistic of one regime: 4.3352051
These estimates and the resulting QLR test statistic are summarized in table 2. For the
complete Stata code used to create table 2, see the appendix.
Table 2. QLR test of one regime versus two regimes
One regime Two regimes

Regime I Regime II
Constant (1 , 2 ) 6.928 6.533 7.813
Latitude () 0.041 0.045
Standard deviation of error () 0.802 0.599
Probability of regime I (p) 0.771
Log likelihood (Ln ) 182.1 180.0
QLRn 4.3
Finally, we use the rscv command to calculate the critical value for the QLR test of
size 5%. We allow for the possibility that the two regimes are widely separated and set
H = (5.0, 5.0). The command and output are shown below.
. rscv, ll(-5) ul(5) r(100000) q(0.95)
7.051934397
Given that this critical value of 7.05 exceeds the QLR statistic of 4.3, we cannot reject
the null hypothesis of one regime.
This result is consistent with the ndings of Bloom, Canning, and Sevilla (2003),
although they use a dierent method to obtain the necessary critical values. They
report a likelihood ratio and the corresponding critical values for a restricted version of
their model where the regime probabilities are xed (p does not depend on x). Using
this restricted model, the authors do not reject the null hypothesis of one regime. At
the time that Bloom, Canning, and Sevilla (2003) were published, researchers had yet to
successfully derive the asymptotic null distribution for a likelihood-ratio test of regime
switching. Therefore, the authors use Monte Carlo methods to generate their critical
values using random data generated from the estimated relationship given by the model
in (5) and (6). The primary disadvantage of this approach is that the derived critical
values are then dependent upon the authors assumptions concerning the underlying
data-generating process.
Bloom, Canning, and Sevilla (2003) go on to report a likelihood-ratio test of a single
regime model against the unrestricted model with latitude-dependent regime probabili-
ties. With the unrestricted model, the authors can use the likelihood ratio and simulated
critical values to reject the null hypothesis in favor of the alternative of two regimes.
Because the null distribution derived by Cho and White (2007) applies to only the QLR
constructed using the two-regime model given in (5 ) and (6 ), we cannot use the QLR
test and, hence, the rscv command to obtain the critical values necessary to evaluate
this unrestricted test statistic.
6 Discussion
We provide a methodology and a new command, rscv, to construct critical values for
a test of regime switching for a simple linear model with Gaussian errors. Despite
the complexity of the underlying methodology, rscv is relatively simple to execute and
merely requires the researcher to provide a range for the standardized distance between
regime means. In section 5, we demonstrate how these methods can be generalized
to a very broad class of models, and we discuss the restrictions necessary to properly
estimate the QLR statistic and use the rscv critical values.
7 References
Bloom, D. E., D. Canning, and J. Sevilla. 2003. Geography and poverty traps. Journal
of Economic Growth 8: 355378.
Bostwick, V. K., and D. G. Steigerwald. 2012. Obtaining critical values for test of
Markov regime switching. Economics Working Paper Series qt3685g3qr, University
of California, Santa Barbara. http://ideas.repec.org/p/cdl/ucsbec/qt3685g3qr.html.
Carter, A. V., and D. G. Steigerwald. 2012. Testing for regime switching: A comment.
. 2013. Markov regime-switching tests: Asymptotic critical values. Journal of

Econometric Methods 2: 2534.
Cho, J. S., and H. White. 2007. Testing for regime switching. Econometrica 75: 1671
1720.
About the authors

Valerie Bostwick is currently completing a PhD in economics at the University of California,
Santa Barbara.
Douglas G. Steigerwald joined the faculty of the Department of Economics at the University
of California, Santa Barbara, after completing an MA in statistics and a PhD in economics at
the University of California, Berkeley.
Appendix
The following Stata code was used to create table 2. The code ts the model in section 5
under the alternative hypothesis of two regimes using the EM algorithm and then under
the null hypothesis of one regime using the Stata ml command. Finally, the QLR test
statistic is calculated.
* Estimating QLR test statistic for Bloom, Canning, and Sevilla (2003)
* Log-likelihood function with two regimes

capture program drop llf
program define llf
version 13
args lnf theta1 theta0 delta sigma lambda
quietly replace `lnf=(1/_N)*((1-etahat)*(ln((2*_pi*`sigma^2)^(-1/2)) ///
+((-1/(2*`sigma^2))*(lgdp-`theta0-`delta*latitude)^2)+ln(1-`lambda)) ///
+etahat*(ln((2*_pi*`sigma^2)^(-1/2))+((-1/(2*`sigma^2))*(lgdp-`theta1 ///
-`delta*latitude)^2)+ln(`lambda)))
end
* Log-likelihood function for one regime

capture program drop llfsingle
program define llfsingle
version 13
args lnf theta delta sigma
quietly replace `lnf= (1/_N)*ln(((2*_pi*`sigma^2)^(-1/2))* ///
exp((-1/(2*`sigma^2))*(lgdp-`theta-`delta*latitude)^2))
end
/***************************************************/
* First, estimate parameters and log likelihood for the case of two regimes:
* lgdp = theta0 + delta*latitude + u~N(0,sigma2) with probability (1-lambda)
* lgpp = theta1 + delta*latitude + u~N(0,sigma2) with probability lambda
/***************************************************/
* Start with initial guess for theta0, theta1, delta, sigma2, and lambda:
regress lgdp latitude
matrix beta=e(b)
svmat double beta, names(matcol)
scalar dhat=betalatitude
generate intercept=lgdp-dhat*latitude
summarize intercept
scalar t0hat=r(mean)-r(Var)
scalar t1hat=r(mean)+r(Var)
scalar shat=sqrt(r(Var))
scalar lhat=0.5
matrix gammahat=(t1hat, t0hat, dhat, shat, lhat)
display "Original guess for parameter values: "
matrix list gammahat
/***************************************************/
* Start loop that continues until parameter estimates have converged
generate error1=10
generate error2=10
generate error3=10
generate tol=1/_N
generate count=0
generate count1=1
generate count2=1
generate count3=1
generate f1=0
generate f0=0
generate fboth=0
generate etahat=0
generate llfhat=0
generate llfnew=0
generate fdelta=0
generate fnew=0
generate Inllfnew=0
generate Inllfdelta=0
generate nd1=0
generate nd2=0
generate nd3=0
generate nd4=0
generate nd5=0
while error1>tol | error2>tol | error3>tol {
* Calculate guess for eta_t=Pr(St=1|sample)

* Calculate f(Yt|St=1, gammahat)
quietly replace f1=((2*_pi*gammahat[1,4]^2)^(-1/2))* ///
exp((-1/(2*gammahat[1,4]^2))*(lgdp-gammahat[1,1]-gammahat[1,3]* ///
latitude)^2)
* Calculate f(Yt|St=0, gammahat)
quietly replace f0=((2*_pi*gammahat[1,4]^2)^(-1/2))* ///
exp((-1/(2*gammahat[1,4]^2))*(lgdp-gammahat[1,2]-gammahat[1,3]* ///
latitude)^2)
* Calculate f(Yt|gammahat)
quietly replace fboth=gammahat[1,5]*f1+(1-gammahat[1,5])*f0
quietly replace etahat=gammahat[1,5]*f1/fboth
/***************************************************/
* Now use etahat to create and maximize log-likelihood function
ml model lf llf /theta1 /theta0 /delta /sigma /lambda

ml init gammahat, copy
ml maximize
matrix gammanew=e(b)
/***************************************************/
* Check whether the parameter estimates have converged
mata: st_matrix("temp", max(abs(st_matrix("gammanew")-st_matrix("gammahat"))))
quietly replace error1=temp[1,1]
* Check whether the log likelihood has converged

quietly replace llfnew=e(ll)
quietly replace llfhat=(1/_N)*((1-etahat) ///
*(ln((2*_pi*gammahat[1,4]^2)^(-1/2)) ///
+((-1/(2*gammahat[1,4]^2)) ///
*(lgdp-gammahat[1,2]-gammahat[1,3]*latitude)^2) ///
+ln(1-gammahat[1,5]))+etahat*(ln((2*_pi*gammahat[1,4]^2)^(-1/2)) ///
+((-1/(2*gammahat[1,4]^2)) ///
*(lgdp-gammahat[1,1]-gammahat[1,3]*latitude)^2) ///
+ln(gammahat[1,5])))
quietly summarize llfhat
quietly replace llfhat=r(sum)
quietly replace error2=abs(llfhat-llfnew)
* Check whether the numeric derivative is zero

* Recalculate incomplete log likelihood with new gamma estimates
quietly replace f1=((2*_pi*gammanew[1,4]^2)^(-1/2))* ///
exp((-1/(2*gammanew[1,4]^2))*(lgdp-gammanew[1,1]-gammanew[1,3]*latitude)^2)
quietly replace fnew=gammanew[1,5]*f1+(1-gammanew[1,5])*f0
quietly replace Inllfnew=log(fnew)
quietly summarize Inllfnew
quietly replace Inllfnew=r(sum)/_N
* Calculate incomplete log likelihood for gamma + 0.0001
forvalues i=1/5 {
matrix gammadelta=gammanew
matrix gammadelta[1,ì]=gammadelta[1,ì]+.0001
quietly replace f1=((2*_pi*gammadelta[1,4]^2)^(-1/2)) ///
*exp((-1/(2*gammadelta[1,4]^2)) ///
*(lgdp-gammadelta[1,1]-gammadelta[1,3]* ///
latitude)^2)
quietly replace f0=((2*_pi*gammadelta[1,4]^2)^(-1/2)) ///
*exp((-1/(2*gammadelta[1,4]^2)) ///
*(lgdp-gammadelta[1,2]-gammadelta[1,3]* ///
latitude)^2)
quietly replace fdelta=gammadelta[1,5]*f1+(1-gammadelta[1,5])*f0
quietly replace Inllfdelta=log(fdelta)
quietly summarize Inllfdelta
quietly replace Inllfdelta=r(sum)/_N
quietly replace ndì=abs(Inllfdelta-Inllfnew)/.0001
}
quietly replace error3=max(nd1,nd2,nd3,nd4,nd5)
/***************************************************/
* Keep track of when each convergence criterion is met
quietly replace count1=count1+1 if error1>tol
* Update gammahat and overall iteration count

matrix gammahat=gammanew
quietly replace count=count+1
* End of loop
}
/***************************************************/
* Calculate final log likelihood for two regimes
generate f2reg=gammanew[1,5]*f1+(1-gammanew[1,5])*f0
generate llf2reg=ln(f2reg)
quietly summarize llf2reg
quietly replace llf2reg=r(sum)
* Output final parameter estimates
display "Final estimated parameter values for two regimes: "
matrix list gammanew
display "Final estimated log likelihood for two regimes: " llf2reg
display "Total number of loop iterations: " count
display "Parameter values converged after " count1 " iterations"
display "Log likelihood value converged after " count2 " iterations"
display "Gradient of Log likelihood converged after " count3 " iterations"
/***************************************************/
* Second, estimate parameters and log likelihood for the case of only one regime:
* Maximize log likelihood with only one regime

* lgdp = theta + delta*lat + u~N(0,sigma2)
quietly summarize intercept
matrix gamma0=(r(mean), dhat, .1)
* Maximize to find new estimate of gamma
ml model lf llfsingle /theta /delta /sigma
ml init gamma0, copy
ml maximize
matrix gammasingle=e(b)
*Calculate log likelihood for one regime with estimated gamma

generate llf1reg=ln(((2*_pi*gammasingle[1,3]^2)^(-1/2))* ///
exp((-1/(2*gammasingle[1,3]^2))*(lgdp-gammasingle[1,1]-gammasingle[1,2]* ///
latitude)^2))
quietly summarize llf1reg
quietly replace llf1reg=r(sum)
* Output final parameter estimates
display "Final estimated parameter values for one regime: "
matrix list gammasingle
display "Final estimated log likelihood for one regime: " llf1reg
/***************************************************/
* Finally, calculate QLR test statistic:
generate QLR=2*(llf2reg-llf1reg)
display "Quasi-likelihood-ratio test statistic of one regime: " QLR
14, Number 3, pp. 499510
A command for signicance and power to test

for the existence of a unique most probable
category
Bryan M. Fellman Joe Ensor
MD Anderson Cancer Center MD Anderson Cancer Center
Houston, TX Houston, TX
bmfellman@mdanderson.org joensor@mdanderson.org
Abstract. The analysis of multinomial data often includes the following question
of interest: Is a particular category the most populous (that is, does it have the
largest probability)? Berry (2001, Journal of Statistical Planning and Inference
99: 175182) developed a likelihood-ratio test for assessing the evidence for the ex-
istence of a unique most probable category. Nettleton (2009, Journal of the Amer-
ican Statistical Association 104: 10521059) developed a likelihood-ratio test for
testing whether a particular category was most probable, showed that the test was
an example of an intersection-union test, and proposed other intersection-union
tests for testing whether a particular category was most probable. He extended
his likelihood-ratio test to the existence of a unique most probable category and
showed that his test was equivalent to the test developed by Berry (2001, Journal
of Statistical Planning and Inference 99: 175182). Nettleton (2009, Journal of the
American Statistical Association 104: 10521059) showed that the likelihood ratio
for identifying a unique most probable cell could be viewed as a union-intersection
test. The purpose of this article is to survey dierent methods and present a
command, cellsupremacy, for the analysis of multinomial data as it pertains to
identifying the signicantly most probable category; the article also presents a
command for sample-size calculations and power analyses, power cellsupremacy,
that is useful for planning multinomial data studies.
Keywords: st0348, cellsupremacy, cellsupremacyi, power cellsupremacy, most
probable category, multinomial data, cell supremacy, cell inferiority
1 Introduction
If Y1 , Y2 , . . . , Yk are independent Poisson-distributed random variables with means 1 ,
2 , . . ., k , then (Y1 , Y2 , . . . , Yk ), conditional on their sum, is multinomial(N, p1 , p2 , . . . ,
pk ), where pi = i / k k represents the probability of the ith category. Multinomial
data are common in biological, marketing, and opinion research scenarios. In a recent
study, Price et al. (2011) used data from the 2008 National Health Interview Survey
to examine whether 18- to 26-year-old women who are most likely to benet from
catch-up vaccination are aware of the human papillomavirus (HPV) vaccine and have
received initial and subsequent doses in the 3-dose series. The study found that the
most common reasons for lack of interest in the HPV vaccine were belief that it was not
needed (35.9%), not knowing enough about it (17.1%), concerns about safety (12.7%),

500 Cell supremacy
and not being sexually active (10.3%). These 4 responses were among the 11 possible
response categories to the survey question. Is the belief among respondents that the HPV
vaccine was not needed the unique most probable reason for lack of interest in the HPV
vaccine? Response to questionnaire-based infertility studies varies, and Morris et al.
(2013) noted that dierent modes of contact can aect response. Results of their study
indicated that 59% of the women surveyed preferred a mailed questionnaire, 37% chose
an online questionnaire, and only 3% selected a telephone interview as their mode of
contact. Is a mailed questionnaire the most preferred mode of contact? Are these
results signicant? The purpose of this article is to survey dierent methods and to
present a command for the analysis of multinomial data as it pertains to identifying the
signicantly most probable category; the article also presents a command for sample-size
calculations and power analyses that is useful for planning multinomial data studies.
2 Methods
Nettleton (2009) posed the test for the supremacy of a multinomial cell probability as an
intersection-union test (IUT). Suppose X = (X1 , . . . , Xk ) has a multinomial distribution
with n trials and the cell probabilities p1 , . . . , pk . The parameter p = (p1 , . . . , pk ) lies
in the set P of vectors of order k, whose components are positive and sum to one.
The tested null hypothesis states that a particular cell of interest is not more probable
than all others. Suppose the kth cell is the cell of interest; then the hypothesis can be
formulated as
k1
k1

H0 : pk pi versus H1 : p k > pi
i=1 i=1
which Nettleton (2009) noted can be stated as
H0 : pk max(p1 , . . . , pk1 ) versus H1 : pk > max(p1 , . . . , pk1 )
Nettleton (2009) oered three possible asymptotic IUT statistics: the score test, the
Wald test, and the likelihood-ratio test. Suppose x = (x1 , . . . , xk ) is a realization of
= (
X = (X1 , . . . , Xk ); then pi = xi /n so that p p1 , . . . , pk ) is the maximum likelihood
estimate of p = (p1 , . . . , pk ). Each asymptotic IUT statistic is zero unless xk is greater
than max(x1 , . . . , xk1 ). Nettleton (2009) also suggested a test based on the conditional
distribution of Xk , given the sum of xk and m, where m = max(x1 , . . . , xk1 ).
2.1 Score test

The test statistic for the asymptotic score test is
n( pM )2
pk
p
k +
pM if pk > pM = max(
p1 , . . . , pk1 )
TS =
0 otherwise
H0 is rejected if and only if TS 2(1),12 , where 2(1),12 represents the {100

(1 2)}th quantile of the 2 distribution with 1 degree of freedom. The approximate
B. M. Fellman and J. Ensor 501
p-value for the test is given by Pr (2(1) TS | TS )/2, where 2(1) denotes a 2 random
variable with 1 degree of freedom.
2.2 Wald test

The test statistic for the asymptotic Wald test is
n( pM )2
pk
p
k +
pM ( pM )2
pk if pk > pM = max(
p1 , . . . , pk1 )
TW =
0 otherwise
H0 is rejected if and only if TW 2(1),12 . The approximate p-value for the test
is given by Pr (2(1) TW | TW )/2.
2.3 Likelihood-ratio test

The test statistic for the asymptotic likelihood-ratio test is

2 M ln 2M
+ xk ln 2xk
if xk > M = max(x1 , . . . , xk1 )
M +xk M +xk
TLR =

0 otherwise
H0 is rejected if and only if TLR 2(1),12 . The approximate p-value for the test
is given by Pr (2(1) TLR | TLR )/2.
2.4 Conditional binomial test

The conditional distribution of Xk , given m + xk , where m = max(x1 , . . . , xk1 ), is
binomial(m + xk , 1/2). Thus a p-value for testing the null hypothesis that is valid for
all n is Pr {Xk xk | xk + max(x1 , . . . , xk )}. The conditional IUT is equivalent to a
permutation test, where the p-value is expressed as
m+x
k
m + xk
p-value = 2(m+xk )
x
x=xk
The simulation studies by Nettleton (2009) showed that the conditional IUT based on
the binomial distribution yielded a true p-value typically less than the nominal value.
Farcomeni (2012) suggested that the exact test (that is, conditional binomial) may
be conservative and that the exact signicance level may be smaller than the desired
nominal level. Farcomeni (2012) suggested using the typical continuity correction for
the binomial; namely, he recommended the mid-p value as the p-value of the test.
502 Cell supremacy
2.5 Mid-p value test

Using the mid-p value approach, we see that the p-value is
m+x
k
m + xk (m+xk +1) m + xk
p-value = 2 + 2(m+xk )
xk x
x=xk +1
2.6 Inferiority test

The test for cell supremacy can be formulated as
H0 : pk max(p1 , . . . , pk1 ) versus H1 : pk > max(p1 , . . . , pk1 )
One could formulate the test for cell inferiority (that is, a particular cell is least
probable) as
H0 : pk min(p1 , . . . , pk1 ) versus H1 : pk < min(p1 , . . . , pk1 )
Farcomeni (2012) suggests using the exact test for inferiority where the sum goes
from 0 to xk . That is, the p-value for the conditional IUT for inferiority would be
xk
m + xk
p-value = 2(m+xk )
x
x=0
and the mid-p value adjustment could be stated as

k 1
x
m + xk (m+xk +1) m + xk
p-value = 2 + 2(m+xk )
xk x
x=0
Alam and Thompson (1972) discussed the challenges of testing whether a particular
cell is least probable from a design point of view. Nettleton (2009) showed that the
likelihood-ratio test statistic could be used to test for the existence of a unique most
probable cell. That is, rather than test whether a particular cell chosen a priori is
the most probable, one could test whether the largest observed cell was uniquely most
probable. The likelihood-ratio test statistic matches the test statistic developed by
Berry (2001) and rejects H0 if and only if TLR 2(1),12 . The approximate p-value
for the test is given by Pr (2(1) TLR | TLR ), where 2(1) denotes a 2 random variable
with 1 degree of freedom. That is, the p-value is twice the p-value for the test in which
a particular cell chosen a priori is most probable.
2.7 Power
We consider the case of a random variable Xmultinomial(n, p1 , . . . , pk ). Without
loss of generality, we will assume that pk is the maximum among the k cells. Let
pM = max(p1 , . . . , pk1 )that is, assume the maximum pi ; i = 1, 2, . . . , k 1 occurs at

i = M and consider the test
H0 : pk = pM versus H1 : pk > pM
The score test rejects H0 if

TS 2(1),12
and for xk > xM ,
2 2
2
pk p
k +
pM
pM p
k +
pM

n (pk pM ) 2 2
TS = =n +
pk + pM

p
k +
pM p
k +
pM

2 2
where is the signicance level of the test. To evaluate
power = Pr (TS 2(1),12 | pk , pM

pk > pM )
we need the noncentrality parameter,

(pk p0 )2 (pM p0 )2 (pk p0 )2
=n + = 2n
p0 p0 p0
where p0 = (pk + pM )/2 (Guenther 1977). For example, consider the random variable
Xmultinomial(n = 50, p1 = 0.1, p2 = 0.1, p3 = 0.1, p4 = 0.3, p5 = 0.4)
Suppose we wish to test the hypothesis
H0 : p5 max(p1 , . . . , p4 ) versus H1 : p5 > max(p1 , . . . , p4 )
at the = 0.05 signicance level. The null hypothesis is rejected if TS 2.70554. Solely
based on p4 and p5 , the noncentrality parameter for testing the 5th cell selected a priori
as the most probable cell is

(0.4 0.35)2
= 100 0.71429
0.35
and the approximate power is
power Pr (2(1),0.71479 2.70554) 0.21833
where 2(1),0.71479 is a noncentral 2 random variable with a noncentrality parameter of

0.71479 and 1 degree of freedom. The simulation of size 100,000 yielded a power equal
to 0.214 for this scenario. The approximation is ignorant of the distribution of the rst
k 1 cells. Because p4 is three times greater than any other cell probability amount
in the rst k 1 cells, the approximation yields a reasonable result. Now consider the
random variable
Xmultinomial(n = 50, p1 = 0, p2 = 0, p3 = 0.3, p4 = 0.3, p5 = 0.4)

504 Cell supremacy
We have a trinomial, and there is strong competition for the maximum among the rst
k 1 cells. Because the cells of a multinomial are not independent, one would expect
the distribution of the rst k 1 cells that aect the power to detect the kth cell to
be the most probable. The simulated power for this scenario was 0.087. Thus the
approximation of power must consider the impact of the distribution of the rst k 1
cells. The correlation among the two cells of a multinomial is
$
pa pb
a,b =
(1 pa )(1 pb )
The power to detect the 5th cell as the most probable is the power that p5 > p4 and
p5 > p3 . Consider approximating the power by
1+M,N
power Pr TS 2(1),12 | pk , pM Pr TS 2(1),12 | pk , pN
where pM and pN represent the maximum and the second largest of the cell probabilities
of the rst k 1 cells, respectively, and M,N represents the correlation between cells
M and N . For our example, the approximate power is
power Pr (TS 2(1),12 | p5 = 0.4, p3 = 0.3)

1+4,3
Pr TS 2(1),12 | p5 = 0.4, p4 = 0.3
(0.21833) (0.21833)10.42857
0.09151
Applying this form of the approximation to the original example with p1 through p3
equal to 0.1 and p4 equal to 0.3 yields an approximate power of

power Pr TS 2(1),12 | p5 = 0.4, p3 = 0.3
1+4,3
Pr TS 2(1),12 | p5 = 0.4, p3 = 0.1
(0.21833) (0.91232)10.21822
0.20322
Table 1 provides simulations of size 100,000 for several scenarios to investigate the
adequacy of our proposed approximation. For each scenario, p6 is the cell of interest,
5,4 represents the correlation between the 5th and 4th cell, Sim. is the simulated
power, and Approx. is our power approximation.
Table 1. Power analysis

Scenario p1 p2 p3 p4 p5 p6 5,4 Subjects Sim. Approx.
1 0 0.1 0.1 0.1 0.3 0.4 0.2182 25 0.137 0.119

2 50 0.214 0.203
3 200 0.520 0.519
4 1000 0.984 0.984
5 0 0 0 0.3 0.3 0.4 0.4286 25 0.057 0.056
6 50 0.087 0.092
7 200 0.353 0.356
8 1000 0.971 0.974
9 0.0626 0.0625 0.0625 0.0625 0.25 0.5 0.1491 25 0.413 0.384
10 50 0.664 0.651
11 200 0.994 0.993
12 1000 1.000 1.000
13 0 0 0 0.25 0.25 0.5 0.3333 25 0.260 0.237
14 50 0.504 0.493
15 200 0.989 0.988
16 1000 1.000 1.000
17 0.05 0.05 0.05 0.05 0.2 0.6 0.1147 25 0.747 0.698
18 50 0.953 0.935
19 200 1.000 1.000
20 1000 1.000 1.000
21 0 0 0 0.2 0.2 0.6 0.2500 25 0.631 0.567
22 50 0.915 0.890
23 200 1.000 1.000
24 1000 1.000 1.000
25 0.1 0.1 0.1 0.1 0.2 0.4 0.1667 25 0.257 0.265
26 50 0.550 0.530
27 200 0.981 0.978
28 1000 1.000 1.000
29 0 0 0.2 0.2 0.2 0.4 0.2500 25 0.143 0.170
30 50 0.326 0.376
31 200 0.953 0.961
32 1000 1.000 1.000
2.8 Conclusions
Nettleton (2009) suggested that the asymptotic procedures are preferred for moderate to
large sample sizes based on simulations, but the IUT based on conditional tests is a useful
option when a small sample size casts doubt on the validity of the asymptotic procedures.
Our power simulations tend to also suggest that the power approximation works best
for moderate to large sample sizes. Scenarios 2932 present a slightly more complex
problem with three cells vying for the top spot among the rst cells. For these scenarios,
our power approximation yields slightly liberal results because the approximate power is
consistently larger than the simulated power. Under this scenario, the power to detect
the 6th cell as the most probable is the power that p6 > p5 , p6 > p4 , and p6 > p3 .
Thus one could improve the approximation by considering the added competition for
supremacy among the rst k 1 cells. That is, for n = 200, the approximate power is
506 Cell supremacy

power Pr TS 2(1),12 | p5 = 0.4, p4 = 0.2
1+4,3
Pr TS 2(1),12 | p5 = 0.4, p3 = 0.2
1+24,3
Pr TS 2(1),12 | p5 = 0.4, p3 = 0.2
(0.97761) (0.97761)10.25 (0.97761)10.50
0.95032
which compares favorably with the simulated power. However, we believe that for most
real-world problems, considering the impact of the top two cell probabilities among the
rst k 1 cells is sucient.
3 The cellsupremacy, cellsupremacyi, and

power cellsupremacy commands
3.1 Syntax

cellsupremacy varname weight
cellsupremacyi, counts(numlist)

power cellsupremacy, freq(numlist) n(#) simulate dots reps(#)

alpha(#)
fweights is allowed; see [U] 11.1.6 weight.
3.2 Option for cellsupremacyi

counts(numlist) species the cell counts for each category of the variable of interest.
counts() is required.
3.3 Options for power cellsupremacy

freq(numlist) species the frequency of cells for each category of the variable of interest.
freq() is required.
n(#) species the number of observations. n() is required.
simulate calculates the simulated power and the approximate power. When not spec-
ied, only the approximated power is calculated.
dots shows the replication dots when using the simulate option.
reps(#) species the number of simulations used to calculate the power. The default
is reps(10000).
alpha(#) species the alpha that is used for calculating the power. The default is
alpha(0.05).
3.4 Examples
Suppose we are studying breast cancer and we nd that the distribution of subtypes is
a trinomial distribution with HER2+, HR+, and TNBC. In our data, we nd that patients
with leptomeningeal disease were more likely to be HER2+ (45%). We are interested in
knowing whether this particular category is the most populous (that is, does it have
the largest probability of occurring?). The following example will generate a sample
dataset and illustrate the use of the new command to answer this question.
. set obs 100
obs was 0, now 100
. generate subtype = "HER2+" in 1/45
(55 missing values generated)
. replace subtype = "HR+" in 46/73
(28 real changes made)
. replace subtype = "TNBC" in 74/100
. tab subtype
subtype Freq. Percent Cum.
HER2+ 45 45.00 45.00

HR+ 28 28.00 73.00
TNBC 27 27.00 100.00
Total 100 100.00

. cellsupremacy subtype
TESTS FOR CELL SUPREMACY
Category HER2+ had the largest observed frequency.
TESTING WHETHER CATEGORY HER2+ SELECTED A PRIORI IS MOST PROBABLE.
Quantity Score Wald LR Binomial Mid-P
-----------------------------------------------------------------
Test Statistic 3.9589 4.1221 3.9955
p-value 0.0233 0.0212 0.0228 0.0302 0.0237
TEST FOR THE EXISTENCE OF A MOST PROBABLE CELL
Quantity LR
-------------------------
Test Statistic 3.9955
p-value 0.0456
TESTS FOR CELL INFERIORITY
Category TNBC had the smallest observed frequency.
TESTING WHETHER CATEGORY TNBC SELECTED A PRIORI IS LEAST PROBABLE.
Quantity Binomial Mid-P
---------------------------------------------
p-value 0.5000 0.4469
508 Cell supremacy
The p-values for all tests are less than 0.05, which indicates that HER2+ is the most
probable. The test for the existence of a most probable cell is also signicant. On the
other hand, if we were interested in cell inferiority (least probable), we would not reject
our hypothesis because our p-values are approximately 0.50. Below is another example
with a slightly dierent distribution than before.
. clear
. set obs 100
obs was 0, now 100
. generate subtype = "HER2+" in 1/45
. replace subtype = "HR+" in 46/85
. replace subtype = "TNBC" in 86/100
. tab subtype
subtype Freq. Percent Cum.
HER2+ 45 45.00 45.00

HR+ 40 40.00 85.00
TNBC 15 15.00 100.00
Total 100 100.00

. cellsupremacy subtype
TESTS FOR CELL SUPREMACY
Category HER2+ had the largest observed frequency.
TESTING WHETHER CATEGORY HER2+ SELECTED A PRIORI IS MOST PROBABLE.
Quantity Score Wald LR Binomial Mid-P
-----------------------------------------------------------------
Test Statistic 0.2941 0.2950 0.2943
p-value 0.2938 0.2935 0.2937 0.3323 0.2950
TEST FOR THE EXISTENCE OF A MOST PROBABLE CELL
Quantity LR
-------------------------
Test Statistic 0.2943
p-value 0.5875
TESTS FOR CELL INFERIORITY
Category TNBC had the smallest observed frequency.
TESTING WHETHER CATEGORY TNBC SELECTED A PRIORI IS LEAST PROBABLE.
Quantity Binomial Mid-P
---------------------------------------------
p-value 0.0005 0.0003
Because HER2+ and HR+ have similar frequencies, we cannot conclude that HER2+ is
the most probable. In this case, we can conclude that TNBC is the least probable cell. The
above examples can both be implemented by entering the raw counts cellsupremacyi
45 28 27 or cellsupremacyi 45 40 15, respectively.
To illustrate how to use the power cellsupremacy command to calculate the power
of the test, we consider the examples in section 2.7 for testing cell superiority for the
random variables,
Xmultinomial(n = 50, p1 = 0, p2 = 0, p3 = 0.3, p4 = 0.3, p5 = 0.4)
and
Ymultinomial(n = 50, p1 = 0.1, p2 = 0.1, p3 = 0.1, p4 = 0.3, p5 = 0.4)
. clear
. set seed 339487731
. power_cellsupremacy, simulate freq(0 0 0.3 0.3 0.4) n(50)
Simulations (10000)
N Simulated Power Approximate Power
50 0.0898 0.0915
. power_cellsupremacy, simulate freq(0.1 0.1 0.1 0.3 0.4) n(50)
Simulations (10000)
N Simulated Power Approximate Power
50 0.2121 0.2032
4 Acknowledgment
This research is supported in part by the National Institutes of Health through M. D.
Andersons Cancer Center Support Grant CA016672.
5 References
Alam, K., and J. R. Thompson. 1972. On selecting the least probable multinomial
event. Annals of Mathematical Statistics 43: 19811990.
Berry, J. C. 2001. On the existence of a unique most probable category. Journal of

Statistical Planning and Inference 99: 175182.
Farcomeni, A. 2012. Testing supremacy or inferiority of multinomial cell probabilities

with application to biting preferences of loggerhead marine turtles. Communications
in StatisticsTheory and Methods 41: 3445.
Guenther, W. C. 1977. Power and sample size for approximate chi-square tests. Amer-
ican Statistician 31: 8385.
Morris, M., P. Edwards, P. Doyle, and N. Maconochie. 2013. Women in an infertility

survey responded more by mail but preferred a choice: Randomized controlled trial.
Journal of Clinical Epidemiology 66: 226235.
Nettleton, D. 2009. Testing for the supremacy of a multinomial cell probability. Journal
of the American Statistical Association 104: 10521059.
510 Cell supremacy
Price, R. A., J. A. Tiro, M. Saraiya, H. Meissner, and N. Breen. 2011. Use of human
papillomavirus vaccines among young adult women in the United States: An analysis
of the 2008 National Health Interview Survey. Cancer 117: 55605568.
About the authors

Bryan Fellman is a research statistical analyst in the Department of Biostatistics at the Uni-
versity of Texas MD Anderson Cancer Center.
Joe Ensor is a research statistician in the Department of Biostatistics at the University of Texas
MD Anderson Cancer Center.
14, Number 3, pp. 511540
Merger simulation with nested logit demand

Jonas Bjornerstedt Frank Verboven
Swedish Competition Authority University of Leuven
Stockholm, Sweden Leuven, Belgium
jonas@bjornerstedt.org frank.verboven@kuleuven.be
Abstract. In this article, we show how to implement merger simulation in Stata

as a postestimation command, that is, after estimating an aggregate nested logit
demand system with a linear regression model. We also show how to implement
merger simulation when the demand parameters are not estimated but instead cal-
ibrated to be consistent with outside information on average price elasticities and
prot margins. We allow for a variety of extensions, including the role of (marginal)
cost savings, remedies (divestiture), and conduct dierent from BertrandNash
behavior.
Keywords: st0349, mergersim, merger simulation, aggregate nested logit model,
unit demand and constant expenditures demand
1 Introduction
Competition and antitrust authorities have long been concerned with the possible an-
ticompetitive eects of mergers. This is in particular the case for horizontal mergers,
which are mergers between rms selling substitute products. The traditional concern
has been that such mergers raise market power, which may hurt consumers and reduce
total welfare (the sum of producer and consumer surplus). At the same time, however,
it has been recognized that mergers may also result in cost savings or other eciencies.
While such cost savings may often be insucient to reduce prices and benet consumers,
it has been shown that even small cost savings can be sucient to raise total welfare (see
Williamson [1968] and Farrell and Shapiro [1990]).1 Despite the possible total welfare
gains, most competition authorities in practice take a consumer surplus standard when
evaluating proposed mergers.
Merger simulation is increasingly used as a tool to evaluate the eects of horizontal
mergers. Consistent with policy practice, the focus is often on the price and con-
sumer surplus eects, but various applications also evaluate the eects on total wel-
fare.2 Merger simulation aims to predict the merger eects in the following three steps.
1. According to Williamsons (1968) analysis, the deadweight loss from the output reduction after
the merger is a second-order eect that is easily compensated by the cost savings from the merger.
However, Posner (1975) argues that there is another source of ineciency from mergers because
rms must spend wasteful resources to make a merger and maintain market power. In this alterna-
tive view, it may be more natural to use consumer surplus as a standard to evaluate mergers and
to ignore the transfer from consumers to rms.
2. Early contributions to the merger simulation literature are Werden and Froeb (1994), Nevo
(2000), Epstein and Rubinfeld (2002), and Ivaldi and Verboven (2005). For a recent survey, see
Budzinski and Ruhmer (2010).

512 Merger simulation
The rst step species and estimates a demand system, usually one with dierentiated
products. The second step makes an assumption about the rms equilibrium behavior,
typically multiproduct BertrandNash, to compute the products current prot margins
and their implied marginal costs. The third step usually assumes that marginal costs
are constant and computes the postmerger price equilibrium, accounting for increased
market power, cost eciencies, and perhaps remedies (such as divestiture). This enables
one to compute the mergers eect on prices, consumer surplus, producer surplus, and
total welfare. Stata is often used to estimate the demand system (the rst step) but not
to implement a complete merger simulation (including the second and third steps). In
this article, we show how to implement merger simulation in Stata as a postestimation
command, that is, after estimating the parameters of a demand system for dierentiated
products. We also illustrate how to perform merger simulation when the demand pa-
rameters are not estimated but rather calibrated to be consistent with outside industry
information on price elasticities and prot margins. We allow for a variety of exten-
sions, including the role of (marginal) cost savings, remedies (divestiture), and conduct
dierent from BertrandNash behavior.
We consider an oligopoly model with multiproduct price-setting rms that may par-
tially collude and have constant marginal cost. Following Berry (1994), we specify the
demand system as an aggregate nested logit model, which can be estimated with market-
level data using linear regression methods (as opposed to the individual-level nested logit
model). We consider both a unit demand specication, as in Berry (1994) and Verboven
(1996), and a constant expenditures specication, as in Bjornerstedt and Verboven
(2013). The model requires a dataset on products sold in one market, or in a panel
of markets, with information on the products prices, their quantities sold, rm and
nest identiers, and possibly other product characteristics.
In section 2, we discuss the merger simulation model, including the nested logit
demand system. In section 3, we introduce the commands required to carry out the
merger simulation. Section 4 provides examples and section 5 concludes.
2 Merger simulation with an aggregate nested logit de-

mand system
2.1 Merger simulation
Suppose there are J products, indexed by j = 1, . . . , J. The demand for product j is
qj (p), where p is a J 1 price vector, and its marginal cost is constant and equal to cj .
Each rm f owns a subset of products Ff and chooses the prices of its own products
j Ff to maximize

f (p) = (pj cj ) qj (p) + (pj cj ) qj (p)
jFf j F
/ f
where (0, 1) is a conduct parameter to allow for the possibility that rms partially
coordinate. If = 0, rms behave noncooperatively as multiproduct rms. If = 1,
J. Bj
ornerstedt and F. Verboven 513
they behave as a perfect, joint-prot maximizing cartel. A BertrandNash equilibrium

is dened by the following system of rst-order conditions:
qk (p) qk (p)
qj (p) + (pk ck ) + (pk ck ) = 0, j = 1, . . . , J (1)
pj pj
kFf kF
/ f
Let be a J J product-ownership matrix, with (j, k) = 1 if products j and k are

produced by the same rm and (j, k) = otherwise. If = 0 (no collusion), becomes
the usual block diagonal matrix; if all rms own only one product, becomes the identity
matrix. Furthermore, let q(p) be the J 1 demand vector, (p) q(p)/p be the
J J Jacobian of rst derivatives, and c be the J 1 marginal cost vector. We can
then write (1) in vector notation as
q(p) + { (p)} (p c) = 0
This can be inverted to write price as the sum of marginal cost and a markup, where the
markup term (inversely) depends on the price elasticities and on the product-ownership
matrix:
1
p = c { (p)} q(p) (2)
For single-product rms with no collusion ( = 0), the markup term is price divided by
the own-price elasticity of demand. With multiproduct-rms and partial collusion, the
cross-price elasticities also matter, and this increases the markup term (if products are
substitutes).
Equation (2) serves two purposes. First, it can be rewritten to uncover the premerger
marginal cost vector c based on the premerger prices and estimated price elasticities of
demand; that is,
1
cpre = ppre + { pre (ppre )} q(ppre )
Second, (2) can be used to predict the postmerger equilibrium. The merger involves
two possible changes: a change in the product ownership matrix from pre to post and,
if there are eciencies, a change in the marginal cost vector from cpre to cpost . To
simulate the new price equilibrium, one may use xed point iteration on (2), possibly
with a dampening parameter in the markup term, or another algorithm such as the
Newton method (see, for example, Judd [1998, 633]).
2.2 Nested logit demand system

The demand system q = q(p) for the J products, j = 1, . . . , J, is specied as a nested
logit model with two levels of nests, referred to as groups and subgroups. This model be-
longs to McFaddens (1978) generalized extreme value discrete choice model. Consumers
choose the alternative that maximizes random utility, which results in a specication
for choice probabilities for each alternative. The nested logit model relaxes the inde-
pendence of an irrelevant alternative property of the simple logit model and allows con-
sumers to have correlated preferences for products that belong to the same subgroup or
group. While discrete choice models were initially developed to analyze individual-level
data (see Train [2009] for an overview), Berry (1994) and Berry, Levinsohn, and Pakes
(1995) show how to estimate the models with aggregate data. The dataset consists of
J 1 vectors of the products quantities q, prices p, and a J K matrix of product
characteristics x, including indicator variables for the products subgroup and group
and their rm aliation. The dataset is for either one market or a panel of markets, for
example, dierent years or dierent regions and countries. The panel is not necessarily
balanced, because new products may be introduced over time, or old products may be
eliminated, and not all products may be for sale in all regions.
In addition to each product js quantity sold qj , its price pj , and the vector of
product characteristics xj , it is necessary to observe (or estimate) the potential market
size for the dierentiated products. In the common unit demand specication of the
nested logit, consumers have inelastic conditional demands: they buy either a single
unit of their most preferred product j = 1, . . . , J or the outside good j = 0. The
potential market size is then the potential number of consumers I, for example, an
assumed fraction of the observed population in the market, I = L. An alternative is
the constant expenditures specication, where consumers have unit elastic conditional
demand: they buy a constant expenditure of their preferred product or the outside
good. Here the potential market size is the potential total budget B, for example, an
assumed fraction of total gross domestic product in the market, B = Y .
As shown by Berry (1994) and the extensions by Verboven (1996) and Bj ornerstedt
and Verboven (2013), the aggregate two-level nested logit model gives rise to the fol-
lowing linear estimating equation for a cross section of products j = 1, . . . , J:
ln(sj /s0 ) = xj +
pj + 1 ln(sj|hg ) + 2 ln(sh|g ) + j (3)
A subscript t can be added to consider multiple markets or time periods, as in most

empirical applications. The price variable is pj = pj in the unit demand specication,
and pj = ln(pj ) in the constant expenditures specication. The variable sj is the market
share of product j in the potential market, sj|hg is the market share of product j in its
subgroup h of group g, and sh|g is the market share of subgroup h in group g. More
precisely, as discussed in more detail in Bjornerstedt and Verboven (2013), the market
shares are quantity shares in the unit demand specication

qj qj jH qj
sj = , sj|hg = , sh|g = Hhg hg
I q
jHhg j h=1 jHhg qj
and they are expenditure shares in the constant expenditures specication

pj q j pj qj jH pj qj
sj = , sj|hg = , sh|g = Hhg hg
B jHhg pj qj pj q j
h=1 jHhg
where Hhg is the set (or number) of products of subgroup h of group g.

Furthermore, in (3), xj is a vector of observed product characteristics, and j is
the error term, which captures the products quality that is unobserved to the econo-
metrician. Equation (3) has the following parameters to be estimated: a vector of
J. Bj
mean valuations for the observed product characteristics, a price parameter < 0,
and two nesting parameters 1 and 2 , which measure the consumers preference cor-
relation for products in the same subgroup and group. The model reduces to a one-
level nested logit model with only subgroups as nests if 2 = 0, to a one-level nested
logit model with only groups as nests if 1 = 2 , and to a simple logit model with-
out nests if 1 = 2 = 0. The mean gross valuation for product j is dened as
j xj + j = ln(sj /s0 )
pj 1 ln(sj|hg ) 2 ln(sh|g ), so it can be computed
from the products market share, price, and the parameters , 1 , and 2 .
In sum, the aggregate nested logit model is essentially a linear regression of the
products market shares on price, product characteristics, and (sub)group shares. In
the unit demand specication, price enters linearly and market shares are in volumes; in
the constant expenditures specication, price enters logarithmically and market shares
are in values. In both cases, the unobserved product characteristics term, j , may
be correlated with price and market shares, so instrumental variables should be used.
Cost shifters would qualify as instruments, but these are typically not available at
the product level. Berry, Levinsohn, and Pakes (1995) suggest using sums of the other
products characteristics (over the rm and the entire market). For the nested logit
model, Verboven (1996) adds sums of the other product characteristics by subgroup
and group.
3 The mergersim command

Various mergersim subcommands implement merger simulation as either commands
before and after a linear nested logit regression to estimate , 1 , and 2 or stand-alone
commands where , 1 , and 2 are specied by the user. With a panel dataset, one
must time set the dataset before invoking the mergersim commands by using xtset id
time or tsset id time, where id is the unique product identier within the market,
and time is the market identier (time and region). Time setting is not required with a
dataset for one market.
3.1 Syntax

mergersim init if in , marketsize(varname)

{quantity(varname) | price(varname) | revenue(varname)} nests(varlist)

unitdemand cesdemand alpha(#) sigmas(# # ) name(string)

mergersim market if in , firm(varname) conduct(#) name(string)

mergersim simulate if in , firm(varname) {buyer(#)

seller(#) | newfirm(varname)} conduct(#) name(string) buyereff(#)
sellereff(#) efficiencies(varname) newcosts(varname) newconduct(#)

method(fixedpoint | newton) maxit(#) dampen(#) keepvars detail

mergersim mre if in , {buyer(#) seller(#) | newfirm(varname)}

name(string)
3.2 Options
Demand and market specication
The demand and market specication are set in mergersim init and mergersim market
(and in mergersim simulate if mergersim market is not explicitly invoked by the
user).
marketsize(varname) species the potential size of market (total number of potential
buyers in unit demand specication, total potential budget in constant expenditures
specication). marketsize() is required with mergersim init.
Any two of price(), quantity(), or revenue() are required.
quantity(varname) species the quantity variable.
price(varname) species the price variable.
revenue(varname) species the revenue variable.
nests(varlist) species one or two nesting variables. The outer nest is specied rst.
If only one variable is specied, a one-level nested logit model applies. If the option
is not specied, a simple logit model applies.
unitdemand species the unit demand specication (default).
cesdemand species the constant expenditure specication rather than the default unit
demand specication.
alpha(#) species a value for the alpha parameter rather than using an estimate. Note
that this option has no eect if mergersim market has been run.

sigmas(# # ) species a value for the sigma parameters rather than using an esti-
mate. In the two-level nested logit, the rst sigma corresponds to the log share of
the product in the subgroup, and the second corresponds to the log share of the
subgroup in the group.
name(string) species a name for the simulation. Variables created will have the spec-
ied name followed by an underscore character rather than the default M . This
option can be used with all the mergersim subcommands.
J. Bj
firm(varname) species the integer variable, indexing the rm owning the product.
firm() is required with mergersim market and mergersim simulate.
conduct(#) measures the fraction of the competitors prots that rms account for
when setting their own prices. It gives the degree of joint prot maximization be-
tween rms before the merger in percentage terms (number between 0 and 1).
Merger specication
The merger specication is set in mergersim simulate or in mergersim mre.

Either the identity of buyer and seller rms or the new ownership structure are required.
The identity corresponds to the value in the variable specied with the firm() option.
buyer(#) species the buyer ID in the rm variable.
seller(#) species the seller ID in the rm variable.
newfirm(varname) species postmerger ownership in more detail than the buyer and
seller options. For example, it can be used to simulate divestitures or two cumulative
mergers by manually constructing a new rm ownership variable that diers from
the rm variable specied with the firm() option.
Eciency gains, in terms of percentage reduction in marginal costs, can be specied by
either all seller and buyer products using the buyereff() and sellereff() options or
product by product with the efficiencies() option.
buyereff(#) species the eciency gain of all products of the buyer rm after the
merger. A value of 0 indicates no eciency gain. The default is buyereff(0). For
example, to incorporate a 10% eciency gain, specify the buyereff(0.1) option.
sellereff(#) species the eciency gain of all products of the seller rm after the
merger.
efficiencies(varname) species a variable for eciency gains more generally (that
is, product by product), where, for example, 0.2 is a 20% decrease in marginal costs,
and 0 is no change.
newcosts(varname) species a variable for postmerger costs.
newconduct(#) species the degree of joint prot maximization between rms after
the merger, in percentage terms. With a conduct value of 1, the prots of other
rms are as important as own prots.
Computation
The computation options can be set in mergersim simulate, where the postmerger
Nash equilibrium is computed.
method(fixedpoint | newton) species the method used to nd postmerger Nash equi-

librium. The option can be specied as fixedpoint or newton. The default is
method(newton). The Newton method starts with one iteration of the fixedpoint
method.
maxit(#) species the maximum number of iterations in the solver methods.
dampen(#) species an initial dampening factor lower than the default dampen(1) in
the xed-point method. If fixedpoint does not converge, the method automatically
tries a dampening factor of half the initial dampening.
Display and results
keepvars species that all generated variables should be kept after simulation, calcu-
lation of elasticities, or minimal required eciencies.
detail shows market shares in mergersim simulate. These market shares are relative
to total sales (excluding the outside good). Market shares are in terms of volumes
for the unit demand specication and in terms of value for the constant expenditure
specication. Changes in consumer and producer surplus and in the Herndahl
Hirshman index are also displayed.
3.3 Description
mergersim performs a merger simulation with the subcommands init, market, and
simulate. mergersim init must be invoked rst to initialize the settings. mergersim
market calculates the price elasticities and marginal costs. mergersim simulate per-
forms a merger simulation, automatically invoking mergersim market if the command
has not been called by the user. In addition to displaying results, mergersim creates
various variables at each step. By default, the names of these variables begin with M .
First, mergersim init initializes the settings for the merger simulation. It is re-
quired before estimation and before a rst merger simulation. It denes the upper and
lower nests; the specication (unit demand or constant expenditures demand); the price,
quantity, and revenue variables (two out of three); the potential market size variable;
and the rm identier (numerical variable). It also generates the variables necessary
to estimate the demand parameters (alpha and sigmas) using a linear (nested) logit
regression, similar to Berry (1994) and the extensions of Bjornerstedt and Verboven
(2013). The names of the market share and price variables to use in the regression will
depend on the demand specication and are shown in the display output of mergersim
init. Alternatively, the demand parameters can be calibrated with the alpha() and
sigmas() options rather than being estimated.
Second, mergersim market computes the premerger conditionsthe gross valua-
tions j and marginal costs cj of each product j under assumptions regarding the
degree of coordination. The computations are based on the last estimates of , 1 ,
and 2 unless they are overruled by values specied by the user in the alpha() and
J. Bj
sigmas() options. mergersim market is required after mergersim init and before
the rst mergersim simulate. It is not necessary to specify mergersim market before
additional mergersim simulates (unless one wants to specify new premerger values of
j and cj ).
Third, mergersim simulate computes the postmerger prices and quantities under
assumptions regarding the identity of the merged rms, their cost eciencies, and the
degree of collusion (the same as before the merger). It is possible to repeat the command
multiple times after estimation.
In addition to these three main subcommands, several other subcommands can pro-
vide useful information. For example, mergersim mre computes the minimum required
eciencies per product for the price not to increase after the merger. It can be invoked
after mergersim init.
4 Examples
4.1 Preparing the data
To demonstrate mergersim, we use the dataset on the European car market, collected by
Goldberg and Verboven (2001) and maintained on their webpages.3 We take a reduced
version of that dataset with fewer variables and a slightly more aggregate rm denition;
the dataset is called cars1.dta. Each observation comprises a car model, year, and
country. The total number of observations is 11,483: there are 30 years (19701999) and
5 countries (Belgium, France, Germany, Italy, and the United Kingdom), which implies
an average of 77 car models per year and country. The car market is divided into ve
upper nests (groups) according to the segments: subcompact, compact, intermediate,
standard, and luxury. Each segment is further subdivided into lower nests (subgroups)
according to the origin: domestic or foreign (for example, Fiat is domestic in Italy and
foreign in the other countries). Sales are new car registrations (qu). Price is measured in
1,000 Euro (in 1999 purchasing power). The product characteristics are horsepower (in
kilowatts), fuel eciency (in liter/100 kilometers), width (in centimeters), and height
(in centimeters). The commands below are provided in a script called example.do.
3. See http://www.econ.kuleuven.be/public/ndbad83/frank/cars.htm.
. use cars1
. summarize year country co segment domestic firm qu price horsepower fuel
> width height pop ngdp
Variable Obs Mean Std. Dev. Min Max
year 11483 1985.43 8.540344 1970 1999

country 11483 2.918488 1.443221 1 5
co 11483 223.0364 206.6172 1 980
segment 11483 2.559087 1.289577 1 5
domestic 11483 .1886267 .3912288 0 1
firm 11483 14.49769 8.567491 1 34

qu 11483 19911.44 37803.6 51 433694
price 11483 18.49683 8.922665 5.260726 150.3351
horsepower 11483 57.26393 23.89019 13 169.5
fuel 11483 6.728904 1.709702 3.8 18.6
width 11483 164.4574 9.567716 122 188

height 11483 140.4434 4.631175 117.5 173.5
pop 11483 4.81e+07 2.18e+07 9660000 8.21e+07
ngdp 11483 1.76e+14 4.73e+14 5.18e+10 2.13e+15
A rst key preparatory task is to dene the two dimensions of the panel and to
time set the data (unless there is only one cross-section). The rst dimension is the
product, that is, the car model (for example, Volkswagen [VW] Golf). The second
dimension is the market, which can be dened as the country and year (for example,
France in 1995).
. egen yearcountry=group(year country), label

. xtset co yearcountry
panel variable: co (unbalanced)
time variable: yearcountry, 1 to 150, but with gaps
delta: 1 unit
Note that the panel is unbalanced because most models are not available throughout
the entire period or in all countries.
A second key preparatory task is to dene the potential market size. For the car
market, it is sensible to adopt a unit demand specication. We specify the potential
market size as total population divided by 4, a crude proxy for the number of households.
In practice, the potential market size in a given year may be lower because cars are
durable and consumers who just purchased a car may not consider buying a new one
immediately.
. generate MSIZE=pop/4
J. Bj
4.2 Performing a merger simulation

Merger simulation can now proceed in three steps.
Initializing the merger simulation settings
The rst step initializes the settings for the merger simulation using the command
mergersim init. The next example species a two-level nested logit model where the
groups are the segments and the subgroups are domestic or foreign with the segments.
This requires the option nests(segment domestic). The specication is the default
unit demand specication. The price, quantity, market size, and rm variables are also
specied.
. mergersim init, nests(segment domestic) price(price) quantity(qu)

> marketsize(MSIZE) firm(firm)
MERGERSIM: Merger Simulation Program

Version 1.0, Revision: 218
Unit demand two-level nested logit

Depvar Price Group shares
M_ls price M_lsjh M_lshg
Variables generated: M_ls M_lsjh M_lshg
merger init creates market share and price variables labeled with an M prex (the
default prex). The variable M ls is the dependent variable ln(sj /s0 ), M lsjh is the log
of the subgroup share ln(sj|hg ), and M lshg is the log of the group share ln(sh|g ).
We can estimate the nested logit model with a linear regression estimator using
instrumental variables to account for the endogeneity of the price and market share
variables. As a simplication to illustrate the approach, we consider a xed-eects
regression without instruments.
. xtreg M_ls price M_lsjh M_lshg horsepower fuel width height domestic year
> country2-country5, fe
Fixed-effects (within) regression Number of obs = 11483
Group variable: co Number of groups = 351
R-sq: within = 0.8948 Obs per group: min = 1
between = 0.7576 avg = 32.7
overall = 0.8427 max = 146
F(13,11119) = 7271.50
corr(u_i, Xb) = -0.0147 Prob > F = 0.0000
M_ls Coef. Std. Err. t P>|t| [95% Conf. Interval]
price -.0468375 .0013002 -36.02 0.000 -.0493861 -.0442888

M_lsjh .9047371 .0041489 218.07 0.000 .8966045 .9128696
M_lshg .5677968 .0085109 66.71 0.000 .551114 .5844796
horsepower .0038279 .0005921 6.46 0.000 .0026672 .0049886
fuel -.0270919 .004539 -5.97 0.000 -.0359892 -.0181946
width .0103757 .0016768 6.19 0.000 .0070889 .0136625
height .0004322 .0022161 0.20 0.845 -.0039117 .0047761
domestic .5230743 .0124205 42.11 0.000 .4987279 .5474206
year .0017336 .0012022 1.44 0.149 -.000623 .0040902
country2 -.6621749 .01399 -47.33 0.000 -.6895977 -.6347521
country3 -.5883123 .0147382 -39.92 0.000 -.6172017 -.5594229
country4 -.7129762 .0137524 -51.84 0.000 -.7399333 -.686019
country5 -.4155907 .016715 -24.86 0.000 -.448355 -.3828265
_cons -8.193457 2.246407 -3.65 0.000 -12.59681 -3.790101
sigma_u .52455749
sigma_e .36374004
rho .6752947 (fraction of variance due to u_i)
F test that all u_i=0: F(350, 11119) = 22.69 Prob > F = 0.0000
The parameters that will inuence the merger simulations are the price parameter
= 0.0468 and the nesting parameters 1 = 0.905 and 2 = 0.568 (the coecients
of, respectively, M lsjh and M lshg). These estimates satisfy the following restrictions
from economic theory: < 0 and 1 > 1 2 0. However, it is important to stress
that the xed-eects estimator is inconsistent because price and the subgroup and group
market share variables are endogenous. As discussed in Berry (1994), an instrumental-
variable estimator is required (for example, using ivreg or xtivreg with appropriate
instruments). We therefore use only the results from the xed-eects estimator for
illustration.
Analyzing premerger market conditions
The second step in the merger simulation calculates the premerger market conditions
(the products gross valuations and their marginal costs and the price elasticities of
demand) using the command mergersim market. In the example below, these calcula-
tions are done for only the ve countries in 1998. Because no values for , 1 , and 2 are
specied, mergersim market uses the parameters in the last available Stata estimation,
that is, the ones from a xed-eects regression.
J. Bj
. mergersim market if year == 1998
Supply: Bertrand competition

Demand: Unit demand two-level nested logit
Demand estimate
xtreg M_ls price M_lsjh M_lshg horsepower fuel width height domestic year
Dependent variable: M_ls
Parameters
alpha = -0.047
sigma1 = 0.905
sigma2 = 0.568
Own- and Cross-Price Elasticities: unweighted market averages

variable mean sd min max
M_ejj -7.488 3.761 -30.454 -1.710

M_ejk 0.766 1.276 0.003 10.908
M_ejl 0.068 0.120 0.000 0.768
M_ejm 0.001 0.002 0.000 0.011
Observations: 449
Pre-merger Market Conditions

Unweighted averages by firm
firm code price Marginal costs Pre-merger Lerner
BMW 20.194 17.499 0.146

Fiat 15.277 10.553 0.372
Ford 14.557 11.923 0.207
Honda 20.094 17.941 0.128
Hyundai 12.915 10.849 0.179
Kia 10.814 8.772 0.207
Mazda 14.651 12.557 0.156
Mercedes 25.598 21.569 0.162
Mitsubishi 15.955 13.825 0.145
Nissan 15.438 13.259 0.159
GM 21.054 18.633 0.135
PSA 16.243 13.533 0.194
Renault 15.518 12.837 0.203
Suzuki 9.289 7.226 0.234
Toyota 14.560 12.430 0.172
VW 18.990 16.388 0.181
Volvo 23.167 20.912 0.099
Daewoo 13.871 11.789 0.170
Variables generated: M_costs M_delta

These results imply fairly high own-price elasticities for the products in 1998, 7.488
on average. The cross-price elasticities are higher for products within the same subgroup
(0.766) than for products of a dierent subgroup (0.068) and especially for products of
a dierent group (0.001). The Lerner index or percentage markup over marginal cost
varies from 9.9% to 37.2%, with a tendency of higher percentage markups for rms with
lower-priced models (a feature of most unit demand-logit models).
Simulating the merger eects
The third step performs the actual merger simulation using the mergersim simulate
command. The example below considers a merger where General Motors (GM) (rm =
15) sells its operations to VW (rm = 26). Note that the merger simulations would be
the same if VW sold its operations to GM. We rst carry out the merger simulations
for Germany in 1998, where it can be considered a domestic merger (because GM sells
the Opel brands, which are produced in Germany). It is assumed that there are no
marginal cost savings to the seller or the buyer and that there is no partial coordination
(neither before nor after the merger).
. mergersim simulate if year == 1998 & country == 3, seller(15) buyer(26)
> detail
Merger Simulation
Simulation method: Newton
Buyer Seller Periods/markets: 1
Firm 26 15 Number of iterations: 6
Marginal cost savings Max price change in last it: 4.5e-06
Prices
firm code Pre-merger Post-merger Relative change
BMW 17.946 18.002 0.003

Fiat 15.338 15.341 0.000
Ford 13.093 13.362 0.023
Honda 15.778 15.780 0.000
Hyundai 12.912 12.912 0.000
Kia 11.276 11.276 0.000
Mazda 14.229 14.231 0.000
Mercedes 20.114 20.155 0.003
Mitsubishi 15.832 15.834 0.000
Nissan 15.101 15.103 0.000
GM 19.921 21.054 0.076
PSA 16.397 16.399 0.000
Renault 15.292 15.295 0.000
Suzuki 9.225 9.225 0.000
Toyota 13.019 13.020 0.000
VW 17.182 17.739 0.036
Volvo 22.149 22.154 0.000
Daewoo 13.483 13.484 0.000
Variables generated: M_price2 M_quantity2 M_price_ch (Other M_ variables

> dropped)
(output omitted )
J. Bj
The results show prices before and after the merger (in 1,000 Euro) and the percent-
age price change averaged by rm. This information is provided standard, even without
the detail option at the end. The merger simulations predict that GM will on average
raise its prices by 7.6%, while VW will on average raise its prices by 3.6%. The rivals
respond with only very small price increases (with the exception of Ford).4
Because the new price vector is saved, one can use Statas graphics to plot these
results. Consider the following commands:
. generate perc_price_ch=M_price_ch*100
. graph bar (mean) perc_price_ch if country==3&year==1998,
> over(firm, sort(perc_price_ch) descending label(angle(vertical)))
> ytitle(Percentage) title(Average percentage price increase per firm)
This produces the following plot:
Average percentage price increase per firm

8
6
Percentage
4 2
0
GM
VW
Ford
BMW
Mercedes
Renault
Fiat
Volvo
PSA
Mitsubishi
Nissan
Toyota
Honda
Mazda
Suzuki
Hyundai
Daewoo
Kia
Figure 1. Average percentage price increase per rm after merger of GM and VW
4. Note that one can also specify the detail option to display the market shares before and after the
merger and the percentage point dierence. If one is interested to see more detailed results, one
can use additional options under mergersim results. One can also use standard Stata commands,
such as table, based on the variables M price (premerger price) and M price2 (postmerger price).
Without the detail option after the mergersim simulate command, the output
reports only the price information. The detail option produces additional results on
the following variables (premerger, postmerger, and changes): market shares by rm,
the Herndahl index, C4 and C8 ratios (market share of 4 and 8 largest rms), and
consumer and producer surplus.5
Market shares by quantity
firm code Pre-merger Post-merger Difference
BMW 0.074 0.079 0.005

Fiat 0.043 0.045 0.003
Ford 0.095 0.132 0.037
Honda 0.012 0.012 0.001
Hyundai 0.006 0.006 0.000
Kia 0.003 0.003 0.000
Mazda 0.025 0.027 0.002
Mercedes 0.100 0.116 0.017
Mitsubishi 0.015 0.017 0.001
Nissan 0.025 0.027 0.002
GM 0.166 0.108 -0.058
PSA 0.034 0.037 0.003
Renault 0.051 0.054 0.003
Suzuki 0.006 0.006 0.000
Toyota 0.027 0.029 0.002
VW 0.300 0.280 -0.020
Volvo 0.012 0.013 0.001
Daewoo 0.006 0.007 0.001
Pre-merger Post-merger
HHS: 1501 1972

C4: 66.07 71.50
C8: 86.21 88.01
Change
Consumer surplus: -1,839,750

Producer surplus: 1,303,353
For example, the Herndahl index increases from 1,501 to 1,972. Consumer surplus
(in Germany) drops by 1.8 billion Euro or 586 Euro per car (because 3.1 million cars
were sold in Germany in 1998). This is partly compensated by an increase in producer
surplus of 1.3 billion Euro.
5. In logit and nested logit models, consumer surplus (up to a constant) is given by the well-known
log(sum) expression divided by the marginal utility of income. Caution is warranted in the constant
expenditure specication because marginal utility is not constant. See Train (2009).
J. Bj
4.3 Accounting for eciencies, remedies, and partial collusion

It is possible to account for several specic features of the merger.
Eciencies
First, one may account for the possibility that the buying or the selling rm benets
from a marginal cost saving, which may be passed on to consumer prices. The cost
saving is expressed as a percentage of current marginal cost. In the command below,
the options sellereff(0.2) and buyereff(0.2) mean that the seller and the buyer
each have a marginal cost saving of 20% on all of their products.
> sellereff(0.20) buyereff(0.20) method(fixedpoint) maxit(40) dampen(0.5)
Merger Simulation
Simulation method: Dampened Fixed point
Marginal cost savings .2 .2 Max price change in last it: .
Prices
BMW 17.946 17.703 -0.011

Fiat 15.338 15.265 -0.004
Ford 13.093 13.125 0.003
Honda 15.778 15.737 -0.002
Hyundai 12.912 12.908 -0.000
Kia 11.276 11.274 -0.000
Mazda 14.229 14.212 -0.001
Mercedes 20.114 19.259 -0.031
Mitsubishi 15.832 15.810 -0.001
Nissan 15.101 14.981 -0.005
GM 19.921 18.980 -0.022
PSA 16.397 16.372 -0.002
Renault 15.292 15.261 -0.003
Suzuki 9.225 9.219 -0.001
Toyota 13.019 13.005 -0.001
VW 17.182 15.717 -0.075
Volvo 22.149 22.036 -0.005
Daewoo 13.483 13.477 -0.000

> dropped)
There is now a predicted price decrease in Germany of 2.2% for GM and 7.5%
for VW. This implies that the 20% cost savings are suciently passed to consumers.
To obtain convergence, we used a xed-point iteration with a dampening factor of 0.5
because the default Newton method did not converge. sellereff() and buyereff()
assume the same percentage cost saving for all products of the seller and buyer. A
more exible option is efficiencies(), which enables one to have product-specic

percentage cost saving based on the variable that enters in efficiencies().
Instead of simulating the prices in the postmerger equilibrium with eciencies,
one can compute the minimum required eciency (percentage cost saving by prod-
uct) for the prices to remain unchanged after the merger; see Froeb and Werden (1998)
or Roller, Stennek, and Verboven (2001). This can be done with the mergersim mre
command:
. mergersim mre if year == 1998 & country == 3, seller(15) buyer(26)
Minimum Required Efficiencies for merging firms

M_costs 15.247 9.504 6.233 43.649

M_costs2 13.769 9.938 5.439 43.620
M_mre 0.123 0.128 0.001 0.401
Weighted average MRE: 0.221 Observations: 19
Variable generated: M_mre
The generated variable M mre refers to the minimum required eciency per product
owned by the merging rms and is set to a missing value for the products of the non-
merging rms. According to the results, the minimum required eciencies for the 19
products of the merging rms are on average 12.3% (unweighted) and 22.1% (weighted
by sales).
Divestiture as a remedy
Second, one may account for divestiture as a remedy to mitigate the price eects of
a merger. Under such a remedy, the competition authority accepts the merger on the
condition that the rms sell some of their products or brands. To simulate the eects of a
merger with divestiture, one can replace the options buyer(#) and seller(#) with the
option newfirm(varname), which species a variable for the new ownership structure
after the merger. To illustrate, we consider a merger between Renault (rm = 18) and
PSA (rm = 16), where PSA sells the brands Peugeot and Citro en. This merger would
substantially raise average prices in France: 59.8% for the Renault products and 63.1%
for the PSA products (ignoring entry and substitution to other countries). To mitigate
the anticompetitive eects, the competition authority may request that PSA sell one of
its brands, Citroen (brand = 4), to Fiat (rm = 4). The commands below show how
to simulate the eects of such a merger with divestiture after creating the appropriate
variable firm rem for the new ownership structure.6
6. Note that this example starts with mergersim init and moves to mergersim simulate without per-
forming a regression to obtain the price and nesting parameters. In this case, mergersim continues
to use the most recent results.
J. Bj
. generate firm_rem=firm
. replace firm_rem=16 if firm==18 // original merger
. replace firm_rem=4 if brand==4 // divestiture
. quietly mergersim init, nests(segment domestic) unit price(price)
> quantity(qu) marketsize(MSIZE) firm(firm)
. quietly mergersim simulate if year == 1998 & country == 2, seller(16)
> buyer(18)
. mergersim simulate if year == 1998 & country == 2, newfirm(firm_rem)
Merger Simulation
Variable name Periods/markets: 1
Ownership from: firm_rem Number of iterations: 7
Prices
BMW 18.342 18.347 0.000

Fiat 12.688 12.749 0.006
Ford 11.995 12.001 0.001
Honda 15.742 15.744 0.000
Hyundai 9.862 9.863 0.000
Kia 7.040 7.040 0.000
Mazda 12.536 12.536 0.000
Mercedes 25.239 25.240 0.000
Mitsubishi 14.880 14.880 0.000
Nissan 12.371 12.372 0.000
GM 18.963 18.966 0.000
PSA 15.303 16.317 0.089
Renault 14.996 17.114 0.162
Suzuki 7.824 7.824 0.000
Toyota 12.638 12.638 0.000
VW 17.735 17.744 0.001
Volvo 22.641 22.642 0.000
Daewoo 13.939 13.940 0.000

> dropped)
The results show that the merger with divestiture raises the average price only by
16.2% for Renault and by 8.9% for the Peugeot brand, whereas the price of Fiat (now
including the Citroen brand) increases by 0.6%. The option newfirm(varname) can
also be used for other applications, for example, to assess the impact of two consecutive
mergers.
Conduct
Third, one may account for the possibility that rms partially coordinate, that is, take
into account a fraction of the competitors prots when setting prices. Assume, for
example, that rms maintain the same degree of coordination before and after the
merger: one can set the conduct parameter such that the markups are in line with
outside estimates. Performing mergersim market before mergersim simulate enables
one to verify whether the conduct parameter results in premerger markups in line with
outside estimates. This is shown in the following example (which returns to the earlier
merger between GM and VW in Germany).
. mergersim market if year == 1998 & country == 3, conduct(0.5)
Supply: Partial collusion, conduct = .5

Demand estimate
xtreg M_ls price M_lsjh M_lshg horsepower fuel width height domestic year
Dependent variable: M_ls
Parameters
alpha = -0.047
sigma1 = 0.905
sigma2 = 0.568

M_ejj -6.907 2.876 -22.039 -3.339

M_ejk 0.781 1.141 0.007 4.920
M_ejl 0.060 0.123 0.001 0.637
M_ejm 0.001 0.002 0.000 0.011
Observations: 97
J. Bj

BMW 17.946 13.079 0.290

Fiat 15.338 10.845 0.334
Ford 13.093 8.114 0.419
Honda 15.778 11.433 0.286
Hyundai 12.912 8.818 0.349
Kia 11.276 7.196 0.391
Mazda 14.229 10.012 0.315
Mercedes 20.114 13.753 0.348
Mitsubishi 15.832 11.612 0.280
Nissan 15.101 10.651 0.316
GM 19.921 14.862 0.297
PSA 16.397 12.106 0.299
Renault 15.292 10.893 0.340
Suzuki 9.225 5.084 0.461
Toyota 13.019 8.794 0.379
VW 17.182 12.104 0.352
Volvo 22.149 17.596 0.208
Daewoo 13.483 9.339 0.346
The results show that if rms coordinate by taking into account 50% of the com-
petitors prots, then the Lerner index becomes almost twice as high as when there is
no coordination. The predicted price eects after the merger can now be computed.

> conduct(0.5)
Merger Simulation
Pre Post
Conduct: .5 .5
Prices
BMW 17.946 18.125 0.011

Fiat 15.338 15.434 0.007
Ford 13.093 13.881 0.063
Honda 15.778 15.889 0.008
Hyundai 12.912 13.019 0.009
Kia 11.276 11.379 0.009
Mazda 14.229 14.334 0.008
Mercedes 20.114 20.427 0.025
Mitsubishi 15.832 15.956 0.008
Nissan 15.101 15.194 0.007
GM 19.921 21.171 0.084
PSA 16.397 16.503 0.007
Renault 15.292 15.395 0.008
Suzuki 9.225 9.314 0.010
Toyota 13.019 13.115 0.008
VW 17.182 17.947 0.049
Volvo 22.149 22.265 0.005
Daewoo 13.483 13.584 0.008

> dropped)
Under partial coordination, the merger simulation predicts larger price increases. On
one hand, there is a larger predicted price increase for the merging rms: this feature
does not hold generally, because the merging rms already partially coordinate before
the merger. On the other hand, there is also a larger predicted price increase for the
outsider rms: this feature may hold more generally because it reects that outsiders
have more cooperative responses to price changes by the merging rms.
4.4 Calibrating instead of estimating the price and nesting parame-

ters
Calibration
The merger simulation results depend on the values of three parameters: , 1 , and 2
(and on the price and quantity data per product). A practitioner may not want to rely
too heavily on the econometric estimates of these parameters and may want to verify
whether the elasticities and markups are consistent with external industry information.
Here a practitioner would not estimate but calibrate the parameters such that they
result in price elasticities and markups that are equal to external estimates. Such
calibration is possible by specifying the options alpha() and sigmas() to mergersim
market. The selected values overrule the values in memory, for example, the ones from a
previous estimation. In the lines below, we specify = 0.035 (closer to 0 as compared
with the econometric estimate of = 0.047), and we keep 1 and 2 to the previous
values. Hence, we calibrate such that demand would be less elastic. The results from
this calibration indeed imply lower price elasticities (on average 5.5):
J. Bj
. mergersim market if year == 1998 & country == 3

Demand calibration
Parameters
alpha = -0.035
sigma1 = 0.910
sigma2 = 0.570

M_ejj -5.457 2.273 -17.430 -2.640

M_ejk 0.624 0.911 0.006 3.946
M_ejl 0.045 0.093 0.000 0.480
M_ejm 0.001 0.001 0.000 0.008
Observations: 97

BMW 17.946 14.738 0.193

Fiat 15.338 12.297 0.229
Ford 13.093 9.765 0.287
Honda 15.778 12.921 0.189
Hyundai 12.912 10.294 0.223
Kia 11.276 8.681 0.248
Mazda 14.229 11.455 0.206
Mercedes 20.114 15.030 0.255
Mitsubishi 15.832 13.019 0.186
Nissan 15.101 12.155 0.209
GM 19.921 16.573 0.199
PSA 16.397 13.576 0.197
Renault 15.292 12.302 0.236
Suzuki 9.225 6.586 0.294
Toyota 13.019 10.280 0.246
VW 17.182 13.540 0.254
Volvo 22.149 18.974 0.144
Daewoo 13.483 10.860 0.220

The next lines show what this calibration implies for merger simulation.
Merger Simulation
Prices
BMW 17.946 18.018 0.004

Fiat 15.338 15.342 0.000
Ford 13.093 13.443 0.030
Honda 15.778 15.781 0.000
Hyundai 12.912 12.912 0.000
Kia 11.276 11.276 0.000
Mazda 14.229 14.231 0.000
Mercedes 20.114 20.167 0.003
Mitsubishi 15.832 15.835 0.000
Nissan 15.101 15.103 0.000
GM 19.921 21.372 0.098
PSA 16.397 16.399 0.000
Renault 15.292 15.296 0.000
Suzuki 9.225 9.226 0.000
Toyota 13.019 13.020 0.000
VW 17.182 17.892 0.045
Volvo 22.149 22.155 0.000
Daewoo 13.483 13.484 0.000

> dropped)
These results show that the predicted price increase is larger when demand is less
elastic.
Application: Bootstrapping condence intervals
One can also use the calibration options alpha() and sigmas() to implement a para-
metric bootstrap for constructing condence intervals of the computed merger eects.
The following lines perform three steps. First, we take 100 draws for , 1 , and 2
assuming the parameters are normally distributed. Second, we perform 100 merger
simulations for each draw. Third, we save the results for the average price increase of
the buying rm and the selling rm, and we compute summary statistics.
J. Bj
. quietly mergersim init, nests(segment domestic) price(price) quantity(qu)

> marketsize(MSIZE) firm(firm)
. matrix b=e(b)
. matrix V=e(V)
. matrix bsub = ( b[1,1] , b[1,2] , b[1,3] )
. matrix Vsub = ( V[1,1], V[1,2], V[1,3] \ V[2,1] , V[2,2], V[2,3] \ V[3,1],
> V[3,2], V[3,3] )
. local ndraws 100
. set seed 1
. preserve
. drawnorm alpha sigma1 sigma2, n(`ndraws) cov(Vsub) means(bsub) clear
(obs 100)
. mkmat alpha sigma1 sigma2, matrix(params)
. restore
. matrix pr_ch = J(`ndraws,2,0)
. forvalues i = 1 2 to `ndraws {
2. local alpha = params[ì,1]
3. local sigma1 = params[ì,2]
4. local sigma2 = params[ì,3]
5. quietly mergersim init, nests(segment domestic) price(price) quantity(qu)
> marketsize(MSIZE) firm(firm) alpha(àlpha) sigmas(`sigma1 `sigma2)
6. quietly mergersim simulate if year == 1998 & country == 3, seller(15)
> buyer(26)
7. sum M_price_ch if year == 1998 & country == 3&firm==15, meanonly
8. matrix pr_ch[ì,1] = r(mean)
9. sum M_price_ch if year == 1998 & country == 3&firm==26, meanonly
10. matrix pr_ch[ì,2] = r(mean)
11. }
. clear
. quietly svmat pr_ch , names(pr_ch)
. sum pr_ch1 pr_ch2
pr_ch1 100 .0763034 .0031552 .0667844 .0844396

pr_ch2 100 .0355618 .0015778 .0307121 .0394875
Earlier, we obtained point estimates for the percentage price increase of 7.6% for
GM and 3.6% for VW (for the base scenario). The 95% condence intervals for these
price increases are [6.78.4]% for GM and [3.14.0]% for VW.
4.5 Constant expenditures demand

We can nally illustrate how to do merger simulation based on a constant expenditures
demand instead of a unit demand specication. For cars, this may not be a realistic
option, because consumers typically buy one unit or no unit rather than constant ex-
penditures. Nevertheless, we can use the constant expenditures specication to see how
functional form aects the predictions from merger simulation.
First, we need to dene the potential market size.
. generate MSIZE1=ngdpe/5
This assumes the potential expenditures on cars in a country and year are 20% of
total gross domestic product.
Next we calibrate (rather than estimate) the parameters to = 0.5, 1 = 0.9, and
2 = 0.6.
. mergersim init, nests(segment domestic) ces price(price) quantity(qu)

> marketsize(MSIZE1) firm(firm) alpha(-0.5) sigmas(0.9 .6)
(output omitted )
We can verify the premerger elasticities and markups at these calibrated parameters:
. mergersim market if year == 1998 & country == 3

Demand: Constant expenditure two-level nested logit
Demand calibration
Parameters
alpha = -0.500
sigma1 = 0.900
sigma2 = 0.600

M_ejj -5.574 0.493 -5.995 -4.054

M_ejk 0.426 0.493 0.005 1.946
M_ejl 0.039 0.065 0.000 0.283
M_ejm 0.001 0.001 0.000 0.006
Observations: 97

J. Bj
BMW 17.946 14.375 0.194

Fiat 15.338 12.451 0.189
Ford 13.093 10.502 0.202
Honda 15.778 12.938 0.180
Hyundai 12.912 10.732 0.169
Kia 11.276 9.384 0.168
Mazda 14.229 11.684 0.177
Mercedes 20.114 14.228 0.260
Mitsubishi 15.832 12.978 0.180
Nissan 15.101 12.281 0.183
GM 19.921 15.784 0.206
PSA 16.397 13.473 0.179
Renault 15.292 12.504 0.188
Suzuki 9.225 7.661 0.170
Toyota 13.019 10.739 0.175
VW 17.182 13.395 0.221
Volvo 22.149 17.606 0.201
Daewoo 13.483 11.201 0.169
The premerger elasticities and markups are roughly comparable with the ones of the
estimated unit demand model (with less variation between rms). However, as shown
below, the merger simulation results in a larger predicted price increase: +10.1% for
GM and +4.4% for VW. This follows from the dierent functional form: the constant
expenditures specication has the property of quasi-constant price elasticity, whereas the
unit demand specication has the property that consumers become more price sensitive
as rms raise prices. For this same reason, eciencies in the form of marginal cost
savings would also be passed more to consumers under this specication.

> detail
Merger Simulation
Prices
BMW 17.946 18.021 0.004

Fiat 15.338 15.342 0.000
Ford 13.093 13.302 0.017
Honda 15.778 15.781 0.000
Hyundai 12.912 12.912 0.000
Kia 11.276 11.276 0.000
Mazda 14.229 14.231 0.000
Mercedes 20.114 20.155 0.003
Mitsubishi 15.832 15.835 0.000
Nissan 15.101 15.103 0.000
GM 19.921 21.581 0.101
PSA 16.397 16.399 0.000
Renault 15.292 15.295 0.000
Suzuki 9.225 9.225 0.000
Toyota 13.019 13.020 0.000
VW 17.182 17.933 0.044
Volvo 22.149 22.159 0.000
Daewoo 13.483 13.484 0.000

> dropped)
(output omitted )
Because the detail option was added, mergersim simulate reports additional re-
sults. Consumer surplus now drops by 2.2 billion Euro (versus 1.8 billion Euro in the
unit demand specication), and producer surplus increases by 1.1 billion Euro (versus
1.3 billion Euro before).
Pre-merger Post-merger
HHS: 1501 1906

C4: 66.07 70.52
C8: 86.21 87.61
Change
Consumer surplus: -2,190,399

Producer surplus: 1,140,647
J. Bj
5 Conclusions
This overview has shown how to apply two specications of the two-level nested logit
demand system to merger simulation. We show that merger simulation can be applied
as a postestimation command based on estimated parameter values, or it can be im-
plemented without estimation but with calibrated parameters. The merger simulation
results yield intuitive predictions given the assumed demand parameters.7 The set of
merger simulation commands can be used to simulate the eects of horizontal mergers
in a standard setting (dierentiated products, multiproduct Bertrand price setting).
One can also incorporate various extensions, including eciencies in the form of cost
savings, remedies through partial divestiture, and alternative behavioral assumptions
(partial collusive behavior).
Other applications and extensions could be considered. For example, for the car
market, it could be interesting to generalize the demand model to allow consumers to
substitute between countries by introducing an upper nest for the choice of country
instead of assuming such substitution is not possible. These additional substitution
possibilities would limit the market power eects of mergers. Other demand models
may also be considered, such as a random coecients logit model or the almost ideal
demand system.
6 References
Berry, S., J. Levinsohn, and A. Pakes. 1995. Automobile prices in market equilibrium.
Berry, S. T. 1994. Estimating discrete-choice models of product dierentiation. RAND

Journal of Economics 25: 242262.
Bj
ornerstedt, J., and F. Verboven. 2013. Does merger simulation work? Evidence from
the Swedish analgesics market.
http://www.econ.kuleuven.be/public/ndbad83/Frank/Papers/
Bjornerstedt%20&%20Verboven,%202013.pdf.
Budzinski, O., and I. Ruhmer. 2010. Merger simulation in competition policy: A survey.
Journal of Competition Law & Economics 6: 277319.
Epstein, R. J., and D. L. Rubinfeld. 2002. Merger simulation: A simplied approach

with new applications. Antitrust Law Journal 69: 883919.
Farrell, J., and C. Shapiro. 1990. Horizontal mergers: An equilibrium analysis. American
Economic Review 80: 107126.
Froeb, L. M., and G. J. Werden. 1998. A robust test for consumer welfare enhancing
mergers among sellers of a homogeneous product. Economics Letters 58: 367369.
7. We stress, however, that the estimated parameters were based on an inconsistent xed-eects esti-
mator. In practice, one should use instrumental variables to estimate the parameters consistently.
Goldberg, P. K., and F. Verboven. 2001. The evolution of price dispersion in the
European car market. Review of Economic Studies 68: 811848.
Ivaldi, M., and F. Verboven. 2005. Quantifying the eects from horizontal mergers in
European competition policy. International Journal of Industrial Organization 23:
669691.
Judd, K. L. 1998. Numerical Methods in Economics. Cambridge, MA: MIT Press.

McFadden, D. 1978. Modelling the choice of residential location. In Spatial Interac-
tion Theory and Planning Models, ed. A. Karlqvist, L. Lundqvist, F. Snickars, and
J. Weibull, 7596. Amsterdam: North-Holland.
Nevo, A. 2000. Mergers with dierentiated products: The case of the ready-to-eat cereal
industry. RAND Journal of Economics 31: 395421.
Posner, R. A. 1975. The social costs of monopoly and regulation. Journal of Political
Economy 83: 807828.
Roller, L.-H., J. Stennek, and F. Verboven. 2001. Eciency gains from mergers. Euro-
pean Economy 5: 31128.
Train, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:
Cambridge University Press.
Verboven, F. 1996. International price discrimination in the European car market.

RAND Journal of Economics 27: 240268.
Werden, G. J., and L. Froeb. 1994. The eects of mergers in dierentiated products
industries: Logit demand and merger policy. Journal of Law, Economics, and Orga-
nization 10: 407426.
Williamson, O. E. 1968. Economies as an antitrust defense: The welfare tradeos.

American Economic Review 58: 1836.
About the authors

Jonas Bjornerstedt is a researcher at the Swedish Competition Authority and at the University
of Leuven (Belgium). His current research focuses on the empirical analysis of competition
policy and industrial organization.
Frank Verboven is a professor of economics and industrial organization at the University of
Leuven (Belgium) and research fellow at the Centre for Economic Policy Research. His current
research focuses on the empirical analysis of industries with market power, with applications
to issues in competition policy and regulation.
14, Number 3, pp. 541561
treatrew: A user-written command for

estimating average treatment eects by
reweighting on the propensity score
Giovanni Cerulli
Ceris-CNR
National Research Council of Italy
Institute for Economic Research on Firms and Growth
Rome, Italy
g.cerulli@ceris.cnr.it
Abstract. Reweighting is a popular statistical technique to deal with infer-

ence in the presence of a nonrandom sample, and various reweighting estimators
have been proposed in the literature. This article presents the user-written com-
mand treatrew, which implements reweighting on the propensity-score estimator
as proposed by Rosenbaum and Rubin (1983, Biometrika 70: 4155) in their sem-
inal article. The main contribution of this command lies in providing analytical
standard errors for the average treatment eects in the whole population, in the
subpopulation of the treated, and in that of the untreated. Standard errors are cal-
culated using the approximation suggested by Wooldridge (2010, 920930, Econo-
metric Analysis of Cross Section and Panel Data [MIT Press]), but bootstrapped
standard errors can also be easily computed. Because an implementation of this
estimator with analytic standard errors and nonnormalized weights is missing in
Stata, this article and the accompanying ado-le aim to provide the community
with an easy-to-use method for reweighting on the propensity-score. The estima-
tor proves to be a valuable tool for estimating average treatment eects under
selection on observables.
Keywords: st0350, treatrew, treatment models, reweighting, propensity score, av-
erage treatment eects, ATE, ATET, ATENT
1 Introduction
treatrew is a user-written command for estimating average treatment eects (ATEs) by
reweighting (REW) on the propensity score. Depending on the specied model (probit
or logit), treatrew provides consistent estimation of ATEs under the hypothesis of
selection on observables. Conditional on a prespecied set of observable exogenous
variables xthought of as those driving the nonrandom assignment to treatment
treatrew estimates the average treatment eect (ATE), the average treatment eect
on the treated (ATET), and the average treatment eect on the nontreated (ATENT); it
also estimates these parameters conditional on the observable factors x (that is, ATE(x),
ATET(x), and ATENT(x)).
In program evaluations and the epidemiological literature, a plethora of REW esti-

mators have been proposed. This article presents the user-written command treatrew,

542 treatrew: A user-written command
which implements REW on the propensity-score estimator as proposed by Rosenbaum

and Rubin (1983) in their seminal article.
The main contribution of this command lies in providing analytical standard errors
for the estimation of the ATE, ATET, and ATENT using the approximation suggested by
Wooldridge (2010, 920930). However, bootstrapped standard errors can also be easily
computed. treatrew assumes that the propensity score specied by the user is correct.
Thus it is sensitive to propensity-score misspecication.
The article is organized as follows: Section 2 provides the statistical description of
REW on the propensity-score estimator as implemented by treatrew. Section 3 provides
the formulas for calculating the causal parameters of interest and their standard errors.
Section 4 presents the syntax of treatrew and an application to real data. Section 5
shows the relation between treatrew and the recent Stata 13 command teffects ipw
for implementing the inverse-probability weighting (IPW) estimator. Section 6 concludes
the article. Finally, two appendixes are reported at the end of the article.
2 The REW estimator of treatment eects: A brief

overview
Reweighting is a valuable approach to estimate (binary) treatment eects in a nonexper-
imental statistical setting when subjects nonrandom assignment to treatment is due to
selection on observables. The idea behind the REW procedure is straightforward: when
the treatment is not randomly assigned, treated and untreated subjects may present
dierent distributions of their observable characteristics. This may happen either be-
cause of the subjects self-selection into the experiment (subjects may consider the net
benet of participation) or because of the selection process operated by an external en-
tity (such as a public agency managing a subsidization program whose explicit objective
is selecting beneciaries with peculiar characteristics to maximize policy eect). Many
examples can be drawn from both social and epidemiological statistical settings.
In nonrandomized experiments, the distribution of the variables feeding into x could
be strongly unbalanced. To establish a balance in their distributions, one could im-
plement REW on observations, using their probability of becoming treated, that is,
according to subjects propensity scores. A possible REW estimation protocol is as
follows:
1. Estimate the propensity score (based on x) using a logit or a probit regression,

thus obtaining the predicated probability pi .
2. Build weights as 1/pi for treated observations and 1/(1 pi ) for untreated obser-
vations.
3. Calculate ATEs by comparing the weighted means of the two groups (for instance,
with a weighted least-squares [WLS] regression).
G. Cerulli 543
This weighting scheme is based on inverse-probability regression (Robins, Hernan,

and Brumback 2000; Brunell and DiNardo 2004)that is, the idea that penalizing (ad-
vantaging) treated subjects with higher (lower) probability to be treated and advantag-
ing (penalizing) untreated subjects with higher (lower) probability to be treated make
the two groups as similar as possible. In other words, weights eliminate a confounding
component induced by the extent of the nonrandom assignment to a program.
Alternative weighting schemes have been proposed in the literature1 , and some au-
thors have shown that various matching methods can also be seen as specic REW
estimators (Lunceford and Davidian 2004; Morgan and Harding 2006). As in matching,
these estimators have dierent properties, but the main limit resides in the specication
of the propensity score because measurement errors in this specication could produce
severe bias. In what follows, we focus on REW on propensity-score inverse probabil-
ity as proposed by Rosenbaum and Rubin (1983). Here we start with the following
assumptions about the data-generating process:
i. y1 = g1 (x) + 1 , E(1 ) = 0
ii. y0 = g0 (x) + 0 , E(0 ) = 0
iii. y = wy1 + y0 (1 w)
iv. Conditional mean independence (CMI) holds; therefore, E(y1 |w, x) = E(y1 |x) and
E(y0 |w, x) = E(y0 |x)
v. x exogenous
y1 and y0 are the subjects outcome when treated and untreated, respectively; g1 (x)
and g0 (x) are the subjects reaction function to the confounder x when the subject is
treated and untreated, respectively; w is the treatment binary indicator taking value 1
for treated and 0 for untreated subjects; 0 and 1 are two error terms with unconditional
zero mean; and x is a set of observable and exogenous confounding variables assumed
to drive the nonrandom assignment into treatment. In short, the CMI assumption states
that it is sucient to control only for x to restore random assignment conditions. When
assumptions iv hold,

{w p(x)}y
ATE = E (1)
p(x){1 p(x)}

{w p(x)}y
ATET = E (2)
p(w = 1){1 p(x)}

{w p(x)}y
ATENT = E (3)
p(w = 0)p(x)
Appendix A shows the mathematical steps to get these formulas.

1. Another possible weighting scheme could be assuming pi /(1 pi ) for untreated subjects and 1 for
treated ones (Nichols 2007). The literature distinguishes between normalized and nonnormalized
weighting schemes depending on whether the weights sum to one or to a dierent value, respectively
(Busso, DiNardo, and McCrary 2008).
3 Sample estimation and standard errors for ATE, ATET,

and ATENT
Assuming that the propensity score is correctly specied, we can estimate previous
parameters by using the sample equivalent of the population parameters; that is,
N
1 {wi p(xi )}yi
ATE =
N i=1 p(xi ){1 p(xi )}
N
1 {wi p(xi )}yi
ATET =
N i=1 p(w = 1){1 p(xi )}
N
1 {wi p(xi )}yi
ATENT =
N i=1 p(w = 0)
p(xi )
Estimation follows in two steps: i) estimate the propensity score p(xi ), thus obtaining
p(xi ); and ii) substitute p(xi ) into previous formulas to get parameters. Consistency is
guaranteed because these estimators are M-estimators.
But how do we get standard errors for previous estimators? We can exploit some
results when the rst step is a maximum likelihood (ML) estimation and the second step
is an M-estimation. In our case, the rst step is an ML based on logit (or probit), and the
second step is a standard M-estimator. For such cases, Wooldridge (2007; 2010, 922924)
proposed a straightforward procedure to get analytical standard errors provided that the
propensity score is correctly specied. In what follows, we demonstrate Wooldridges
(2007; 2010, 922924) procedure and formulas for obtaining these standard errors.
3.1 Standard-error estimation for ATE

First, dene the estimated ML score of the rst step (probit or logit). It is, by denition,
equal to
i , xi ,
i = d(w { p(xi , )} {wi p(xi ,
)}
d ) =
){1 p(xi ,
p(xi , )}
Observe that d is a row vector of the R 1 parameters and represents the gradient
of the function p(x, ).
Second, dene the generic estimated summand of ATE as
{wi p(xi )}yi

ki =
p(xi ){1 p(xi )}
G. Cerulli 545
Third, calculate ordinary least-squares (OLS) residuals from this regression,

ki on 1, di with i = 1, . . . , N
and call them ei (i = 1, . . . , N ). The asymptotic standard error for ATE is equal to
% N
&1/2
1 2
e
N i=1 i
(4)
N
and we can use it to test the signicance of ATE. Of course, d will have a dierent
expression according to the probability model adopted. Here we consider the logit and
probit cases.
Case 1: Logit
Suppose that the correct probability follows a logistic distribution. This means that
exp(xi )
p(xi , ) = = (xi ) (5)
1 + exp(xi )
Thus, by simple algebra, we see that
= xi (wi pi )
d i
'()*
1R
Case 2: Probit
Suppose that the right probability follows a normal distribution. This means that
p(xi , ) = (xi )
Thus, by simple algebra, we see that
)xi {wi (xi )}
= (xi ,
d i
(xi ){1 (xi )}
where () and () are the normal cumulative distribution and density function, re-
spectively. One can also add functions of x to estimate previous formulas. This reduces
standard errors if these functions are partially correlated with
ki .
Finally, observe that the previous procedure produces standard errors that are lower
than those produced by ignoring the rst step (that is, the propensity-score estimation
via ML). Indeed, the nave standard error
,
1
N 2 1/2
+
ki ATE
N i=1

N
is higher than the one produced by the previous procedure.
3.2 Standard-error estimation for ATET

This follows a route similar to ATE. Dene the generic estimated summand of ATET as
{wi p(xi )}yi
qi =
p(w = 1){1 p(xi )}
and calculate
qi on 1, di
ri = residuals from the regression of
The asymptotic standard error for ATET is

N
,1/2
1 2
{
p(w = 1)} 1

ri wi ATET
N i=1

N
3.3 Standard-error estimation for ATENT

In this case, dene the generic estimated summand of ATENT as
bi = {wi p(xi )}yi

p(w = 0)p(xi )
and then calculate
si = residuals from the regression of bi on 1, di
The asymptotic standard error for ATENT is

- N
.1/2
1 2
{
p(w = 0)} 1

si (1 wi ) ATENT
N i=1

N
The standard errors presented in this section are correct when the actual data-
generating process follows the probit or the logit probability rules. If not, then a mea-
surement error is present, and the estimations might be inconsistent. Authors such as
Hirano, Imbens, and Ridder (2003) and Li, Racine, and Wooldridge (2009) have sug-
gested more exible nonparametric estimation of the standard errors. Under correct
specication, a straightforward alternative is to use bootstrapping, where the binary
response estimation and the averaging are included in each bootstrap iteration.
4 The treatrew command: Syntax and use

treatrew estimates ATE, ATET, and ATENT parameters with either analytical or boot-
strapped standard errors. The syntax is rather simple and follows the typical Stata
G. Cerulli 547
command syntax. The user has to declare: a) the outcome variable, that is, the vari-
able over which the treatment is expected to have an impact (outcome); b) the binary
treatment variable (treatment); c) a set of confounding variables (varlist); and, nally,
d) a series of options. Two options are important: the option model(modeltype) sets
the type of model, probit or logit, that has to be used in estimating the propensity
score; the option graphic and the related option range(a b) produce a chart where the
distribution of ATE(x), ATET(x), and ATENT(x) are jointly plotted within the interval
[a; b].
As an e-class command, treatrew provides an ereturn list of objects (such as
scalars and matrices) to be used in the next elaborations. In particular, the values of
ATE, ATET, and ATENT are returned in the scalars e(ate), e(atet), and e(atent), and
they can be used to get bootstrapped standard errors. By default, treatrew provides
analytical standard errors.
4.1 Syntax

treatrew outcome treatment varlist if in weight , model(modeltype)

graphic range(a b) conf(#) vce(robust)
outcome is the target variable for measuring the impact of the treatment.
treatment is the binary treatment variable taking 1 for treated and 0 for untreated
subjects.
varlist is the set of pretreatment (or observable confounding) variables.
fweights, iweights, and pweights are allowed; see [U] 11.1.6 weight.
4.2 Description
treatrew estimates ATEs by REW on the propensity score. Depending on the specied
model, treatrew provides consistent estimation of ATEs under the hypothesis of selec-
tion on observables. Conditional on a prespecied set of observable exogenous variables
xthought of as those driving the nonrandom assignment to treatmenttreatrew
estimates the ATE, the ATET, the ATENT, and these parameters conditional on the ob-
servable factors x (that is, ATE(x), ATET(x), and ATENT(x)). Parameters standard
errors are provided either analytically (following Wooldridge [2010, 920930]) or via
bootstrapping. treatrew assumes that the propensity-score specication is correct.
treatrew creates several variables:
ATE x is an estimate of the idiosyncratic ATE.

ATET x is an estimate of the idiosyncratic ATET.
ATENT x is an estimate of the idiosyncratic ATENT.
4.3 Options
model(modeltype) species the model for estimating the propensity score, where mod-
eltype must be one of probit or logit. model() is required.
graphic allows for a graphical representation of the density distributions of ATE(x),
ATET(x), and ATENT(x) within their whole support.
range(a b) allows for a graphical representation of the density distributions of ATE(x),

ATET(x), and ATENT(x) within the support [a; b] specied by the user. range()
must be specied with the graphic option.
conf(#) sets the condence level of probit or logit estimates equal to the specied #.
The default is conf(95).
vce(robust) allows for robust regression standard errors in the probit or logit estimates.
4.4 Stored results

treatrew stores the following in e():
Scalars
e(N) number of observations e(ate) value of the ATE
e(N1) number of (used) treated e(atet) value of the ATET
subjects e(atent) value of the ATENT
e(N0) number of (used) untreated
subjects
4.5 Examples
To show a practical application of treatrew, we use an instructional dataset called
fertil2.dta, which is included in Wooldridge (2013) and collects cross-sectional data
on 4,361 women of childbearing age in Botswana. This dataset is freely downloadable
at http://fmwww.bc.edu/ec-p/data/wooldridge/fertil2.dta. It contains 28 variables on
women and family characteristics.
Using fertil2.dta, we are interested in evaluating the impact of the variable educ7
(taking value 1 if a woman has more than or exactly seven years of education and
0 otherwise) on the number of family children (children). Several conditioning (or
confounding) observable factors are included in the dataset, such as the age of the
woman (age), whether the family owns a television (tv), whether the woman lives
in a city (urban), and so forth. To inquire into the relation between education and
fertility according to Wooldridges (2010, ex. 21.3, 940) specication, we estimate the
ATE, ATET, and ATENT (as well as ATE(x), ATET(x), and ATENT(x)) by REW us-
ing treatrew. We also compare REW results with other popular program evaluation
methods: i) the dierence in mean (DIM), taken as benchmark; ii) the OLS regression-
based random-coecient model with heterogeneous reaction to confounders, estimated
through the user-written command ivtreatreg, provided by Cerulli (2011); and iii) a
one-to-one nearest-neighbor matching, computed by the command psmatch2, provided
G. Cerulli 549
by Leuven and Sianesi (2003). Because matching estimators can be seen as specic
REW procedures (Busso, DiNardo, and McCrary 2008), comparing REW with matching
is worthwhile. By taking just the case of ATET, we can prove that

1
ATETMatching = yi h(i, j)yj
Ni
i(w=1) jC(i)
N
N
N
1 1
= wi yi (1 wj )yj wi h(i, j)
N1 i=1 j=1
N1 i=1
N
N
1 1
= wi yi (1 wj )yj (j) = ATETReweighting
N1 i=1
N0 j=1

N
where (j) = N0 /N1 wi h(i, j) are REW factors, C(i) is the untreated subjects neigh-
i=1
borhood for the treated subject i, and h(i, j) are matching weights thatonce oppor-
tunely speciedproduce dierent types of matching methods. Results from all of these
estimators are reported in table 1.
550
Table 1. Comparison of ATE, ATET, and ATENT estimation among DIM, CF-OLS, REW, and MATCH
1 2 3 4 5 6 7
DIM CF-OLS REW REW REW REW MATCH(a)
(probit) (logit) (probit) (logit)
analytical analytical bootstrapped bootstrapped
standard standard standard standard
errors errors errors errors
ATE 1.77 *** 0.374 *** 0.43 *** 0.415 *** 0.434 *** 0.415 *** 0.316 ***
0.062 0.051 0.068 0.068 0.070 0.071 0.080
28.46 7.35 6.34 6.09 6.15 5.87 3.93
ATET 0.255 *** 0.355 ** 0.345 *** 0.355 *** 0.345 *** 0.131
0.048 0.15 0.104 0.0657 0.054 0.249
5.37 2.37 3.33 5.50 6.45 0.52
ATENT 0.523 *** 0.532 *** 0.503 ** 0.532 *** 0.503 *** 0.549 ***
0.075 0.19 0.257 0.115 0.119 0.135
7.00 2.81 1.96 4.61 4.21 4.07
Note: b/se/t; DIM; CF-OLS: control-function OLS; REW; MATCH. (a) Standard errors for ATE and ATENT are computed by bootstrapping.
*** = 1%, ** = 5%, * = 10% of signicance.
treatrew: A user-written command
G. Cerulli 551
Results in column 1 refer to the DIM and are obtained by typing
. regress children educ7
Results in column 2 refer to CF-OLS and are obtained by typing

> hetero(age agesq evermarr urban electric tv) model(cf-ols)
For CF-OLS, standard errors for ATET and ATENT are obtained via bootstrap and
can be obtained in Stata by typing
. bootstrap atet=r(atet) atent=r(atent), rep(200):

> ivtreatreg children educ7 age agesq evermarr urban electric tv,
> hetero(age agesq evermarr urban electric tv) model(cf-ols)
Results set out in columns 36 refer to the REW estimator. In columns 3 and 4,
standard errors are computed analytically, whereas in columns 5 and 6, they are com-
puted via bootstrap for the logit and probit models, respectively. These results can be
retrieved by typing sequentially
. treatrew children educ7 age agesq evermarr urban electric tv, model(probit)
. treatrew children educ7 age agesq evermarr urban electric tv, model(logit)
. bootstrap e(ate) e(atet) e(atent), reps(200):
> treatrew children educ7 age agesq evermarr urban electric tv, model(probit)
. bootstrap e(ate) e(atet) e(atent), reps(200):
> treatrew children educ7 age agesq evermarr urban electric tv, model(logit)
Finally, column 7 presents an estimation of ATEs obtained by implementing a one-

to-one nearest-neighbor matching on the propensity score (MATCH). Here the standard
error for ATET is obtained analytically, whereas those for ATE and ATENT are computed
by bootstrapping. Matching results can be obtained by typing
. psmatch2 educ7 age agesq evermarr urban electric tv, ate out(children) common
. bootstrap r(ate) r(atu): psmatch2 educ7 $xvars, ate out(children) common
where the option common restricts the sample to subjects with common support. To
test the balancing property for such a matching estimation, we provide a DIM on the
propensity score before and after matching treated and untreated subjects using the
psmatch2 postestimation command pstest:
. pstest _pscore, both
Unmatched Mean %reduct t-test

Variable Matched Treated Control %bias |bias| t p>|t|
_pscore Unmatched .65692 .42546 111.7 37.05 0.000

Matched .65692 .65688 0.0 100.0 0.01 0.994
(output omitted )
This test suggests that with regard to the propensity score, the matching procedure
implemented by psmatch2 is balanced, so we can trust matching results (the propensity
score was unbalanced before matching, and it becomes balanced after matching).
Unlike DIM, results from CF-OLS and REW are fairly comparable in terms of both
coecients size and signicance: the values of ATE, ATET, and ATENT obtained using
REW on the propensity score are a little higher than those obtained using CF-OLS. This
means that the linearity of the potential-outcome equations assumed by the CF-OLS is
an acceptable approximation. According to the value of ATET, as obtained by REW and
visible in column 3 of table 1, an educated woman in Botswana would have beenceteris
paribussignicantly more fertile if she had been less educated. We can conclude that
education has a negative impact on fertility, leading a woman to have around 0.5 fewer
children. If confounding variables were not considered, as it happens using DIM, this
negative eect would appear dramatically higher, around 1.77 children: the dierence
between 1.77 and 0.5 (around 1.3) is an estimation of the bias induced by the presence
of selection on observables.
Columns 3 and 4 show REW results using Wooldridges (2010) analytical standard
errors in the case of probit and logit, respectively. As partly expected, these results
are similar. But the REW results when standard errors are obtained via bootstrap
(columns 5 and 6) are more interesting. Here statistical signicance is conrmed when
compared with results derived from analytical formulas. However, bootstrapping seems
to increase signicance for both ATET and ATENT, while the standard error for ATE is
in line with the analytical one.
Some dierences in results emerge when applying the one-to-one nearest-neighbor
matching (column 7) on this dataset. In this case, ATET becomes insignicant with a
magnitude that is around one-third lower than that obtained by REW. As said above, the
standard errors of ATE and ATENT are here obtained via bootstrap because psmatch2
does not provide analytical solutions for these two parameters. Nevertheless, as proved
by Abadie and Imbens (2008), bootstrap performance is generally poor in the case of
matching, so these results have to be taken with some caution.
G. Cerulli 553
Finally, gure 1 sets out the estimated kernel density for the distribution of ATE(x),
ATET(x), and ATENT(x) when treatrew is used with options graphic and range(-30
30). It is evident that the distribution of ATET(x) is a bit more concentrated around
its mean (equal to ATET) than the distribution of ATENT(x) is; this indicates that more
educated women respond more homogeneously to a higher level of education. On the
contrary, less educated women react more heterogeneously to a potential higher level of
education.
Reweighting: Comparison of ATE(x) ATET(x) ATENT(x)

.2 .15
Kernel density
.1 .05
0
40 20 0 20 40
x
ATE(x) ATET(x)
ATENT(x)
Model:logit
Figure 1. Estimation of the distribution of ATE(x), ATET(x), and ATENT(x) by REW

on the propensity score with range equal to (30; 30)
5 Relation between treatrew and Stata 13s teects ipw

Stata 13 provides a new command, teffects, for estimating treatment eects for ob-
servational data. Among the many estimation methods provided by this command,
teffects ipw implements a REW estimator based on IPW.
teffects ipw estimates the parameters ATE and ATET and the mean potential
outcomes using a WLS regression where weights are a function of the propensity score
estimated in the rst step. To see the equivalence between IPW and WLS, we apply the
teffects ipw command to our previous dataset by computing ATE.
. use fertil2
. teffects ipw (children) (educ7 $xvars, probit), ate
Iteration 0: EE criterion = 6.624e-21
Iteration 1: EE criterion = 4.722e-32
Treatment-effects estimation Number of obs = 4358
Estimator : inverse-probability weights
Outcome model : weighted mean
Treatment model: probit
Robust
children Coef. Std. Err. z P>|z| [95% Conf. Interval]
ATE
educ7
(1 vs 0) -.1531253 .0755592 -2.03 0.043 -.3012187 -.0050319
POmean
educ7
0 2.208163 .0689856 32.01 0.000 2.072954 2.343372
In this estimation, we see that the value of ATE is 0.153 with a standard error of
0.075, which results in a moderately signicant eect of educ7 on children.
This value of ATE can also be obtained using a simple WLS regression of y on w and
a constant, with weights hi designed in this way:
hi = hi1 = 1/p(xi ) if wi = 1 (6)

hi = hi0 = 1/{1 p(xi )} if wi = 0 (7)
The Stata code for computing such a WLS regression is as follows:

. global xvars age agesq evermarr urban electric tv
. probit educ7 $xvars, robust // estimate the probit regression
(output omitted )
. predict _ps, p // call the estimated propensity score as _ps
. generate H=(1/_ps)*educ7+1/(1-_ps)*(1-educ7) // weighing function H for w=1
> and w=0
. regress children educ7 [pw=H], vce(robust) // estimate ATE by a WLS regression
(sum of wgt is 9.1714e+03)
Linear regression Number of obs = 4358
F( 1, 4356) = 2.00
Prob > F = 0.1576
R-squared = 0.0013
Root MSE = 2.1324
Robust
children Coef. Std. Err. t P>|t| [95% Conf. Interval]
educ7 -.1531253 .1083464 -1.41 0.158 -.3655393 .0592887

_cons 2.208163 .0867265 25.46 0.000 2.038135 2.378191
G. Cerulli 555
This table shows that the results of the commands calculating IPW and WLS for ATE
are identical. A dierence, however, appears in the estimated standard errors, which
are quite divergent: 0.075 for IPW against 0.108 for WLS. Moreover, observe that ATE
calculated by WLS becomes nonsignicant.
Why are these standard errors dierent? The answer resides in a dierent approach
used for estimating the variance of ATE (and, possibly, ATET): WLS regression uses the
usual OLS variancecovariance matrix adjusted for the presence of a matrix of weights,
lets say ; however, WLS does not consider the presence of a generated regressor,
namely, the weights computed through the propensity scores estimated in the rst step.
On the contrary, IPW accounts for the variability introduced by the generated weights
by exploiting a generalized method of moments approach for estimating the correct
variancecovariance matrix (see StataCorp [2013, 6888]). In this sense, IPW is a more
robust approach than a standard WLS regression.
As implemented in Stata, both WLS and IPW by default use normalized weights,
that is, weights that add up to one. treatrew, on the contrary, uses nonnormalized
weights, which is why the ATE values obtained from treatrew (see the previous section)
are numerically dierent from those obtained from WLS and IPW. As proved by Busso,
DiNardo, and McCrary (2008, 7), a general formula for estimating ATE by REW is
N N
1 1
+
ATE = wi yi hi1 (1 wi )yi hi0 (8)
N i=1 N i=1
treatrew uses nonnormalized inverse-probability weights dened as above; that is
hi1 = 1/p(x)
hi0 = 1/{1 p(xi )}
Such weights do not sum up to one. In this case, analytical standard errors cannot be
retrieved by a weighted regression, and the method suggested by Wooldridge (2010)
and implemented through treatrewfor getting correct analytical standard errors for
ATE, ATET, and ATENT is thus needed because a generated regressor from the rst-step
estimation is used in the second step.
The normalized weights used in WLS and IPW are instead
1/p(xi )
hi1 = N
1
wi /p(xi )
N1 i=1
1/{1 p(xi )}
hi0 = N
1
(1 wi )/{1 p(xi )}
N0 i=1
Appendix B shows that if the formula of ATE implemented in treatrew using nor-
malized (rather than nonnormalized) weights was adopted, then the treatrews ATE
estimation would become numerically equivalent to the value of ATE obtained by the
commands used to calculate WLS and IPW.
Thus we can assert that both teffects ipw and treatrew lead to correct analytical
standard errors because both take into account that the propensity score is a generated
regressor from a rst-step (probit or logit) regression. The dierent values of ATE and
ATET obtained in the two approaches reside only in the dierent weighting scheme
(normalized versus nonnormalized).
In short, treatrew is useful when considering nonnormalized weights, that is, when a
pure IPW scheme is used. Moreover, compared with teffects ipw, treatrew provides
an estimation of ATENT, though it does not by default provide an estimation of the
mean potential outcomes.
6 Conclusion
This article provides a command, treatrew, for estimating ATEs by REW on the propen-
sity score as proposed by Rosenbaum and Rubin (1983). Although REW is a popular and
long-standing statistical technique to deal with the bias induced by drawing inference
in the presence of a nonrandom sample, its implementation in Stata with parameters
analytic standard errors (as proposed by Wooldridge [2010, 920930]) and a nonnormal-
ized weighting scheme was still missing. This article and the accompanying ado-le ll
this gap by providing an easy-to-use implementation of the REW method, which can be
used as a valuable tool for estimating causal eects under selection on observables.
7 References
Abadie, A., and G. W. Imbens. 2008. On the failure of the bootstrap for matching
estimators. Econometrica 76: 15371557.
Brunell, T. L., and J. DiNardo. 2004. A propensity score reweighting approach to esti-
mating the partisan eects of full turnout in American presidential elections. Political
Analysis 12: 2845.
Busso, M., J. DiNardo, and J. McCrary. 2008. Finite sample properties of semipara-
metric estimators of average treatment eects.
http://elsa.berkeley.edu/users/cle/laborlunch/mccrary.pdf.
Cerulli, G. 2011. ivtreatreg: A new Stata routine for estimating binary treatment
models with heterogeneous response to treatment under observable and unobservable
selection. 8th Italian Stata Users Group meeting proceedings.
http://www.stata.com/meeting/italy11/abstracts/italy11 cerulli.pdf.
Hirano, K., G. W. Imbens, and G. Ridder. 2003. Ecient estimation of average treat-
ment eects using the estimated propensity score. Econometrica 71: 11611189.
G. Cerulli 557
Horvitz, D. G., and D. J. Thompson. 1952. A generalization of sampling without

replacement from a nite universe. Journal of the American Statistical Association
47: 663685.
Leuven, E., and B. Sianesi. 2003. psmatch2: Stata module to perform full Mahalanobis
and propensity score matching, common support graphing, and covariate imbalance
testing. Statistical Software Components S432001, Department of Economics, Boston
College. http://ideas.repec.org/c/boc/bocode/s432001.html.
Li, Q., J. S. Racine, and J. M. Wooldridge. 2009. Ecient estimation of average treat-
ment eects with mixed categorical and continuous data. Journal of Business and
Economic Statistics 27: 206223.
Lunceford, J. K., and M. Davidian. 2004. Stratication and weighting via the propensity
score in estimation of causal treatment eects: A comparative study. Statistics in
Medicine 23: 29372960.
Morgan, S. L., and D. J. Harding. 2006. Matching estimators of causal eects: Prospects
and pitfalls in theory and practice. Sociological Methods and Research 35: 360.
Nichols, A. 2007. Causal inference with observational data. Stata Journal 7: 507541.
Robins, J. M., M. A. Hernan, and B. Brumback. 2000. Marginal structural models and
causal inference in epidemiology. Epidemiology 11: 550560.
Rosenbaum, P. R., and D. B. Rubin. 1983. The central role of the propensity score in
observational studies for causal eects. Biometrika 70: 4155.
StataCorp. 2013. Stata 13 Treatment Eects Reference Manual: Potential Out-

comes/Counterfactual Outcomes. College Station, TX: Stata Press.
Wooldridge, J. M. 2007. Inverse probability weighted estimation for general missing

data problems. Journal of Econometrics 141: 12811301.
. 2010. Econometric Analysis of Cross Section and Panel Data. 2nd ed. Cam-
bridge, MA: MIT Press.
. 2013. Introductory Econometrics: A Modern Approach. 5th ed. Mason, OH:

South-Western.
About the author

Giovanni Cerulli is a researcher at Ceris-CNR, National Research Council of Italy, Institute for
Economic Research on Firms and Growth. He received a degree in statistics and a PhD in
economic sciences from Sapienza University of Rome and is editor-in-chief of the International
Journal of Computational Economics and Econometrics. His research interests are mainly on
applied microeconometrics, with a focus on counterfactual treatment-eects models for program
evaluation. Stata programming and simulation- and agent-based methods are also among his
related elds of study. He has published articles in high-quality, refereed economics journals.
Appendix A
This appendix provides the mathematical steps to get the REW formulas for ATEs as
reported in (1)(3). Observe rst that wy = w{wy1 +y0 (1w)} = w2 y1 +wy0 w2 y0 =
wy1 because w2 = w. Therefore,

wy wy1 LIE2 wy1 wE(y1 |x, w)
E |x =E |x = E E |x, w |x = E |x
p(x) p(x) p(x) p(x)

CMI wE(y1 |x) wg1 (x) w
= E |x = E |x = g1 (x) E |x
p(x) p(x) p(x)
g1 (x) g1 (x)
= E(w|x) = p(x) = g1 (x) (9)
p(x) p(x)
because E(w|x) = p(x). Similarly, we can show that

(1 w)y
E |x = g0 (x) (10)
{1 p(x)}
Combining (9) and (10) we see that

wy (1 w)y {w p(x)}y
ATE(x) = g1 (x) g0 (x) = E |x E |x = E |x
p(x) {1 p(x)} p(x){1 p(x)}
provided that 0 < p(x) < 1. To get ATE, one needs to take the expectation of ATE(x)
on x,

{w p(x)}y {w p(x)}y
ATE = Ex {ATE(x)} = Ex E |x = E
p(x){1 p(x)} p(x){1 p(x)}
that is, the inverse-probability REW estimation of ATE. Interestingly, it is possible to

show that such an estimator is equivalent to the HorvitzThompson estimator (Horvitz
and Thompson 1952). In sampling theory, it is a method for estimating the total and
mean of a super population in a stratied sample. IPW is generally applied to account
for dierent proportions of observations within strata in a target population.
Similarly, we can also calculate ATET by considering that
{w p(x)}y = {w p(x)} {y0 + w (y1 y0 )}

= {w p(x)} y0 + w {w p(x)} (y1 y0 )
= {w p(x)} y0 + w {1 p(x)} (y1 y0 )
because w2 = w. Thus, by dividing the previous expression by {1 p(x)}, we get
{w p(x)}y {w p(x)}y0
= + w(y1 y0 ) (11)
{1 p(x)} {1 p(x)}
G. Cerulli 559
Consider now the quantity {w p(x)}y0 in the right-hand side of (11). We see that
{w p(x)}y0 = E[{w p(x)}y0 |x] = E(E[{w p(x)}y0 |x, w]|x)

= E[{w p(x)} E(y0 |x, w)|x] = E[{w p(x)} E{y0 |x}|x]
= E[{w p(x)} g0 (x)|x] = g0 (x) E[{w p(x)}|x]
= g0 (x) [E(w|x) E{p(x)|x}] = g0 (x) {p(x) p(x)} = 0
Taking (11) and applying the expectation conditional on x, we get

{w p(x)}y {w p(x)}y0
E |x = E |x + E{w(y1 y0 )|x} = E{w(y1 y0 )|x}
{1 p(x)} {1 p(x)}
because we proved that {w p(x)}y0 is 0. By the law of iterated expectations (LIE),

we get

{w p(x)}y {w p(x)}y
Ex E |x = E
{1 p(x)} {1 p(x)} (12)

Ex E {w(y1 y0 )|x} E{w(y1 y0 )}
that is,

{w p(x)}y
E = E{w(y1 y0 )}
{1 p(x)}
Using LIE again, by assuming h = w(y1 y0 ), we get
E(h) = E{w(y1 y0 )}
= p(w = 1) E{w(y1 y0 )|w = 1} + p(w = 0) E{w(y1 y0 )|w = 0}
= p(w = 1) E{(y1 y0 )|w = 1}
= p(w = 1) ATET
This means that

{w p(x)}y
E = E{w(y1 y0 )} = p(w = 1) ATET
{1 p(x)}
proving that

{w p(x)}y
ATET = E
p(w = 1){1 p(x)}
Finally, by remembering that ATE = p(w = 1) ATET + p(w = 0) ATENT, we can

also prove that

{w p(x)}y
ATENT = E
p(w = 0)p(x)
Appendix B
In this appendix, we show that if one considers the formula of ATE as implemented
in treatrew by using normalized rather than nonnormalized weights, then treatrews
ATE estimation becomes numerically equivalent to the ATE obtained by commands used
to calculate WLS and IPW. To this purpose, we rst calculate the ATE estimator by
means of the general formula in (8) by adopting normalized IPW weights:
N N
1 1
+
ATE = wi yi hi1 (1 wi )yi hi0
N i=1 N i=1
As an intermediary step, we show that normalized weights sum up to one for the weights
of both the treated and the untreated subjects.
* Weights sum up to one for "treated"

. generate h1 = educ7/_ps // observe that educ7=w
. summarize h1
. scalar sum_h1 = _N*r(mean)
. summarize educ7 if educ7==1
. scalar mean_h1 = (1/r(N))*sum_h1
. generate H1 = (1/_ps)/mean_h1 // H1 is the normalized weight for treated units
. generate m1 = educ7*H1 // m1 is equal to w*h1 using h1=H1
. summarize m1
. scalar tot_m1 = _N*r(mean)
. scalar N1 = r(N)
. scalar one1 = (1/N1)* tot_m1
. display one1
1 // ok
* Weights sum up to one for "untreated"

. generate h0 = (1-educ7)/(1-_ps)
. summarize h0
. scalar sum_h0 = _N*r(mean)
. scalar mean_h0 = (1/r(N))*sum_h0
. generate H0 = (1/(1-_ps))/mean_h0 // H0 is the normalized weight for
> untreated units
. generate m0 = (1-educ7)*H0 // m0 is equal to (1-w)*h0 using h0=H0
. summarize m0
. scalar tot_m0 = _N*r(mean)
. scalar N0 = r(N)
. scalar one0 = (1/N0)* tot_m0
. display one0
1 // ok
G. Cerulli 561
Second, we compute the estimation of ATE by multiplying the two summands for
the treated and untreated units in (8) by the outcome y (equal in this example to the
variable children):
* Average outcome for treated units

. generate s1 = children*educ7*H1 // s1 is the summand y*w*h1 of (8) with h1=H
. summarize s1
. scalar tot_s1 = _N*r(mean)
. scalar N1 = r(N)
. scalar _s1 = (1/N1)* tot_s1 // _s1 is the average outcome for treated units
. display _s1
2.0550377
* Average outcome for untreated units

. generate s0 = children*(1-educ7)*H0 // s0 is y*(1-w)*h0 of (8) with h0=H0
. summarize s0
. scalar tot_s0 = _N*r(mean)
. scalar N0 = r(N)
. scalar _s0 = (1/N0)* tot_s0 // _s0 is the average outcome for untreated units
. display _s0
2.208163
We see that the ATE is the dierence between s1 and s0,
. display _s1 - _s0 // ok

-.15312
which is numerically equivalent to the value of the ATE obtained via WLS and IPW.
14, Number 3, pp. 562579
Modeling count data with generalized

distributions
Tammy Harris Joseph M. Hilbe
Institute for Families in Society School of Social and Family Dynamics
University of South Carolina Arizona State University
Columbia, SC Tempe, AZ
harris68@mailbox.sc.edu hilbe@asu.edu
James W. Hardin
Institute for Families in Society
Department of Epidemiology and Biostatistics
University of South Carolina
Columbia, SC
jhardin@sc.edu
Abstract. We present motivation and new commands for modeling count data.
While our focus is to present new commands for estimating count data, we also
discuss generalized binomial regression and present the zero-inated versions of
each model.
Keywords: st0351, gbin, zigbin, nbregf, nbregw, zinbregf, zinbregw, binomial, War-
ing, count data, overdispersion, underdispersion
1 Introduction
We introduce programs for regression models of count data. Poisson regression analysis
is widely used to model such response variables because the Poisson model assumes
equidispersion (equality of the mean and variance). In practice, equidispersion is rarely
reected in data. In most situations, the variance exceeds the mean. This occurrence
of extra-Poisson variation is known as overdispersion (see, for example, Dean [1992]).
In situations where the variance is smaller than the mean, data are characterized as
being underdispersed. Modeling underdispersed count data with inappropriate models
can lead to overestimated standard errors and misleading inference. While there are
various approaches for modeling overdispersed count data, such as the negative binomial
distributions and other mixtures of Poisson (Yang et al. 2007; Hilbe 2014), there are
few models for underdispersed count data. Harris, Yang, and Hardin (2012) introduced
a generalized Poisson regression command to handle underdispersed count data.
As stated earlier, count data can be analyzed using regression models based on the
Poisson distribution. However, in this article, we will discuss other discrete regression
models that can be used, such as the generalized negative binomial distribution, which
was described by Jain and Consul (1971) and later by Consul and Gupta (1980). The
distribution was also investigated by Famoye (1995), who illustrated a use for analyzing
grouped binomial data.

T. Harris, J. W. Hilbe, and J. W. Hardin 563
The generalized binomial regression model is a simplication based on the gener-

alized negative binomial distribution for which we treat one of the parameters as the
known denominator of proportional (grouped binomial) outcomes. The properties and
utility of the distribution for regression models for count and grouped binomial data are
discussed in Jain and Consul (1971), Consul and Gupta (1980), and Famoye (1995).
Another extension of the negative binomial distribution is the univariate generalized
Waring distribution, or the beta negative binomial distribution. The present generalized
Waring distribution was proposed and used by Irwin (1968) to model accident count
data. An advantage of this model over the negative binomial model is that investigators
can separate the unobserved heterogeneity from the internal factors of each individuals
characteristics and external factors (covariates) that may aect the variability of data
(confounding). For more technical and historical information on the distribution and
associated regression models, see Rodrguez-Avi et al. (2009), Irwin (1968), and Hilbe
(2011).
To distinguish the origins of specic regression models, we use NBREGF for count
models based on the generalized negative binomial distribution, GBIN for grouped bino-
mial models based on a simplication of the generalized negative binomial distribution,
and NBREGW for count models based on the generalized Waring distribution.
Many applications of the NBREGF regression model have been illustrated in studies
involving medicine, ecology, physics, etc. Wang et al. (2012) used the NBREGF model to
analyze a rehabilitation program study that evaluated brain function in stroke patients
by using functional magnetic resonance imaging. Hardin and Hilbe (2012) presented an
example that used microplot data of carrot y damage. For this example, the authors
analyzed these data by using Statas suite of ml() functions and developed syntax for
the GBIN regression. Lastly, Rodrguez-Avi et al. (2009) used the NBREGW regression
model to model the number of goals scored by football players, and they compared the
results with the results of a regression model based on the negative binomial distribution.
Herein, we illustrate modeling count data using the NBREGF, GBIN, and NBREGW
regression models. This article is organized as follows. In section 2, we review the
three count-data regression models and their zero-inated versions. In section 3, we
present the syntax for the new commands. In section 4, we present a real-world data
example. Finally, in section 5, we give a summary. We also present software that we
enhanced from Hardin and Hilbe (2012) to t NBREGF and GBIN models.
2 The models
2.1 Generalized negative binomial: Famoye
As implemented in the accompanying software, the NBREGF model assumes that is
a scalar unknown parameter. Thus the probability mass function (PMF), mean, and
variance are given by
564 Modeling count data with generalized distributions

+ y y y+y
P (Y = y) = (1 ) (1)
+ y y
where 0 < < 1, 1 < 1 for > 0 and nonnegative outcomes yi (0, 1, 2, . . .).
1
E(Y ) = (1 )
3
V (Y ) = (1 )(1 )
The main dierences from the GBIN model are that the parameter is an unknown
parameter in (1) but a known parameter in (2) and that = > 1. In the limit
1, the variance approaches that of the negative binomial distribution. Thus the
parameter generalizes the negative binomial distribution in the NBREGF model to have
greater variance than is allowed in a negative binomial regression model. To construct a
regression model, we implemented the log link log() = x to make results comparable
to Poisson and negative binomial models.
2.2 Generalized binomial

The generalized binomial regression model is based on a simplication of the generalized
negative binomial distribution. We assume that the parameter in (1) is a vector of
observation-specic known constants n (they are the denominators of grouped binomial
data), = , and is replaced with /(1 + ). When is known, the parameter
is nonnegative, while in the generalized negative binomial distribution, > 1. Under
these changes, the PMF, mean, and variance are given by

n n + y y ny+y
P (Y = y) = 1 (2)
n + y y 1 + 1 +
1

E(Y ) = n 1
1 + 1 +
= n
3
V (Y ) = n 1 (1 + )
1 + 1 +
= n(1 + )(1 + )
Parameterizing g() = x, where g() is a suitable link function assuming that plays
the role of the probability of success, we obtain results that coincide with a grouped
data binomial model. The variance is equal to binomial variance if = 0, and it is
equal to negative binomial variance if = 1. Thus the > 0 parameter generalizes the
binomial distribution in the GBIN regression model.
2.3 Generalized Waring

As illustrated in Irwin (1968), the generalized Waring distribution can be constructed
under the following specications:
i. Y |x, x , v Poisson(x )
ii. x |v Gamma(ax , v)
iii. v Beta(, k)
In the authors presentation for accident data, he species |v as accident liability

and v as accident proneness. The PMF is ultimately given by
(ax + )(k + ) (ax )y (k)y 1

P (Y = y) =
()(ax + k + ) (ax + k + )y y!
where k, , ax > 0, ax = ( 1)/k, and (a)w is the Pochhammer notation for (a + w)/
(w) if a > 0. The expected value and variance of the distribution are
ax k
E(Y ) = =
1

k+1 k+1
V (Y ) = + + 2 (3)
2 k( 2)
where ax , k > 0 and > 2 (to ensure nonnegative variance). To construct a regression
model, we implemented the log link log() = x to make results comparable to Poisson
and negative binomial models. A unique characteristic of this model occurs when the
data are from a dierent underlying distribution. For instance, when the data are
from a Poisson distribution with V (Y ) = , it indicates that (k + 1)/( 2) 0 and
{k + 1}/{k( 2)} 0 then k, . Also, if the data have an underlying NB-2
(negative binomial-2) distribution with V (Y ) = + 2 (where is the dispersion
parameter), it indicates that (k + 1)/( 2) 0 and {k + 1}/{k( 2)} ,
where k 1/ and .
2.4 Zero ination

When there is an excess of zeros in count-response data, Poisson (and other) distribution
models may not be appropriate to use. Hardin and Hilbe (2012) describe the two origins
of zero outcomes: 1) individuals who do not enter into the counting process and 2)
individuals who enter into the counting process and have a zero outcome. Therefore,
the model must be separated into dierent parts, one consisting of a zero count y = 0
and the other consisting of a nonzero count y > 0. The zero-inated model is given by

p + (1 p)f (y) y=0
P (Y = y) =
(1 p)f (y) y = 1, 2, . . .
where p is the probability that the binary process results in a zero outcome, 0 p < 1,
and f (y) is the probability function. Zero-ination models are proposed for the NBREGF,
GBIN, and NBREGW distributions.
3 Syntax
The accompanying software includes the command les as well as supporting les for
prediction and help. In the following syntax diagrams, unspecied options include the
usual collection of maximization and display options available to all estimation com-
mands. In addition, all zero-inated commands include the ilink(linkname) option to
specify the link function for the ination model. The generalized binomial model for
grouped binomial data also includes the link(linkname) option for linking the proba-
bility of success to the linear predictor. Supported linknames include logit, probit,
loglog, and cloglog.
The syntax for specifying a generalized binomial regression model for grouped data
is given by

gbin depvar indepvars if in weight , options
and the syntax for the zero-inated version is given by

zigbin depvar indepvars if in weight ,
/
inflate(varlist , offset(varname) / cons) vuong options
The syntax for tting a generalized negative binomial regression model where the
distribution is assumed to follow Famoyes description is given by

nbregf depvar indepvars if in weight , options
The syntax for tting a generalized negative binomial regression model where the
distribution is derived from the Waring distribution is given by

nbregw depvar indepvars if in weight , options
The syntax for specifying a zero-inated count model where the count distribution
follows that described by Famoye is given by

zinbregf depvar indepvars if in weight ,
/
The syntax for specifying a zero-inated count model where the count distribution
follows the Waring distribution is given by

zinbregw depvar indepvars if in weight ,
/
A Vuong test (see Vuong [1989]) evaluates whether the regression model with zero
ination or the regression model without zero ination is closer to the true model. A
random variable is dened as the vector log LZ log LS , where LZ is the likelihood of
the zero-inated model evaluated at its maximum likelihood estimation, and LS is the
likelihood of the standard (nonzero-inated) model evaluated at its maximum likelihood
estimation. The vector of dierences over the N observations is then used to dene the
statistic
N
V = 0
( )2 /(N 1)
which, asymptotically, is characterized by a standard normal distribution. A signi-
cant positive statistic indicates preference for the zero-inated model, and a signicant
negative statistic indicates preference for the model without zero ination. Nonsignif-
icant Vuong statistics indicate no preference for either model. Results of this test are
included in a footnote to the estimation of the model when the user includes the vuong
option in any of the zero-inated commands. Vuong statistics with corrections based
on the Akaike information criterion (AIC) and the Bayesian information criterion (BIC)
are also displayed in the output (see Desmarais and Harden [2013] for details). They
are displayed for each of the zero-inated models discussed in this article.
4 Example
We shall use the popular German health data for the year 1984 as example data. The
goal of our model is to understand the number of visits made to a physician during 1984.
Our predictor of interest is whether the patient is highly educated based on achieving
a graduate degree, for example, an MA or MS, an MBA, a PhD, or a professional degree.
Confounding predictors are age (from 2564) and income in German Marks, divided by
10. We rst model the data using Poisson regression. The glm command is used to
determine the Pearson dispersion, or dispersion statistic, which is not available using
the poisson command.
. use rwm1984, clear

(German health data for 1984; Hardin & Hilbe, GLM and Extensions, 3rd ed)
. gen hh = hhninc/10
. glm docvis edlevel4 age hh, nolog eform fam(poisson)
Generalized linear models No. of obs = 3874
Optimization : ML Residual df = 3870
Scale parameter = 1
Deviance = 24369.36065 (1/df) Deviance = 6.296992
Pearson = 44032.57716 (1/df) Pearson = 11.37793
Variance function: V(u) = u [Poisson]
Link function : g(u) = ln(u) [Log]
AIC = 8.120749
Log likelihood = -15725.89176 BIC = -7604.745
OIM
docvis IRR Std. Err. z P>|z| [95% Conf. Interval]
edlevel4 .7887207 .0380651 -4.92 0.000 .7175343 .8669693

age 1.026209 .0008362 31.75 0.000 1.024571 1.027849
hh .3468308 .0257417 -14.27 0.000 .299876 .4011378
_cons 1.326749 .0608884 6.16 0.000 1.212619 1.451619
. estat ic
Akaikes information criterion and Bayesian information criterion
Model Obs ll(null) ll(model) df AIC BIC
. 3874 . -15725.89 4 31459.78 31484.83
Note: N=Obs used in calculating BIC; see [R] BIC note

. nbreg docvis edlevel4 age hh, nolog irr
Negative binomial regression Number of obs = 3874
LR chi2(3) = 161.23
Dispersion = mean Prob > chi2 = 0.0000
edlevel4 .7265669 .0837908 -2.77 0.006 .5795774 .9108351

age 1.026037 .0023731 11.11 0.000 1.021397 1.030699
hh .4487569 .0718929 -5.00 0.000 .327827 .6142958
_cons 1.246529 .1453412 1.89 0.059 .991871 1.56657
/lnalpha .8413514 .0308101 .7809646 .9017381
alpha 2.319499 .0714641 2.183578 2.463882
Likelihood-ratio test of alpha=0: chibar2(01) = 1.5e+04 Prob>=chibar2 = 0.000

. estat ic
. 3874 -8425.206 -8344.593 5 16699.19 16730.5
The AIC and BIC statistics are substantially lower here than they are for the Poisson
model, indicating a much better t than the Poisson model.
. display 1/exp(_b[edlevel4])
1.3763358
Patients without a graduate education are 38% more likely to see a physician than
are patients with a graduate education. We can likewise arm that patients without
a graduate education saw a physician 38% more often in 1984 than patients with a
graduate education.
The negative binomial model did not adjust for all the correlation, or dispersion, in
the data.
. quietly glm docvis edlevel4 age hh, fam(nbin ml)

. display e(dispers_p)
1.4017258
This is perhaps due to the excessive number of times a patient in the data never
saw a physician in 1984. A tabulation of docvis shows that nearly 42% of the 3,874
patients in the data did not visit a physician. This value is far greater than the one
accounted for by the Poisson and negative binomial distributional assumptions.
. count if docvis==0
1611
. display "Zeros account for " %4.2f (r(N)*100/3874) "% of the outcomes"
Zeros account for 41.58% of the outcomes
Given the excess zero counts in docvis, it may be wise to employ a zero-inated
regression model on the data. At the least, we can determine which predictors tend to
prevent patients from going to the doctor.
. zinb docvis edlevel4 age hh, nolog inflate(edlevel4 age hh) irr
Zero-inflated negative binomial regression Number of obs = 3874
Nonzero obs = 2263
Zero obs = 1611
Inflation model = logit LR chi2(3) = 98.50
Log likelihood = -8330.799 Prob > chi2 = 0.0000
docvis
edlevel4 .9176719 .1289238 -0.61 0.541 .6967903 1.208573
age 1.020511 .0025432 8.15 0.000 1.015538 1.025508
hh .4506524 .0720932 -4.98 0.000 .3293598 .6166132
_cons 1.768336 .2419851 4.17 0.000 1.352333 2.31231
inflate
edlevel4 1.174194 .3519899 3.34 0.001 .4843067 1.864082
age -.0521002 .0115586 -4.51 0.000 -.0747547 -.0294458
hh .2071444 .570265 0.36 0.716 -.9105545 1.324843
_cons -.037041 .4438804 -0.08 0.933 -.9070305 .8329486
/lnalpha .6203884 .0662583 9.36 0.000 .4905245 .7502522
alpha 1.85965 .1232172 1.633173 2.117534
. estat ic
. 3874 -8380.051 -8330.799 9 16679.6 16735.96
The AIC statistic is 20 points lower in the zero-inated model but 5 points higher
for the BIC statistic. However, variables edlevel4 and age appear to aect zero counts,
with younger graduate patients more likely to not see a physician at all during the year.
Given the zero-inated model, patients without a graduate education see the physician
9% more often than patients with a graduate education.
. display 1/exp(_b[edlevel4])
1.0897141
Because excess zero counts did not appear to bear on extra correlation in the data,
there may be other factors. We employ a generalized Waring negative binomial model
to further identify the source of extra dispersion.
4.1 Generalized negative binomial: Waring

. nbregw docvis edlevel4 age hh, nolog eform
Generalized negative binomial-W regression Number of obs = 3874
LR chi2(3) = 163.80
edlevel4 .6910153 .0865378 -2.95 0.003 .5406164 .8832549

age 1.027732 .0024925 11.28 0.000 1.022859 1.032629
hh .4693135 .086958 -4.08 0.000 .3263967 .674808
_cons 1.142679 .1431097 1.06 0.287 .8939621 1.460593
/lnrhom2 .9045584 .1992573 .5140212 1.295096

/lnk -.6113509 .0521974 -.7136559 -.5090458
rho 4.470841 .4923331 3.672001 5.651345

k .5426174 .0283232 .4898501 .6010688
. estat ic
. 3874 -8397.319 -8315.421 6 16642.84 16680.41
The AIC and BIC statistics are substantially lower here than for either the negative
binomial or zero-inated version. For the calculated and k, the V (Y ) = + 0.624 +
2.9942 , where is the mean. Here we see that the term {k + 1}/{k( 2)} =
2.994, from (3), is close to the dispersion parameter = 2.319 when using an NB-2
regression model from above. More information on the background of this model can
be found in Hilbe (2011).
To address the excess zeros in the outcome, we also t a zero-inated Waring model.
. zinbregw docvis edlevel4 age hh, nolog inflate(edlevel4 age hh) eform vuong
Zero-inflated gen neg binomial-W regression Number of obs = 3874
Regression link: Nonzero obs = 2263
Inflation link : logit Zero obs = 1611
Wald chi2(3) = 66.10
docvis
edlevel4 .9414482 .1406355 -0.40 0.686 .7024933 1.261684
age 1.017108 .0024842 6.95 0.000 1.012251 1.021989
hh .4841428 .0964645 -3.64 0.000 .3276222 .7154409
_cons 2.457403 .3313549 6.67 0.000 1.886691 3.200751
inflate
edlevel4 .613575 .2222675 2.76 0.006 .1779387 1.049211
age -.026716 .0048778 -5.48 0.000 -.0362763 -.0171558
hh -.0137845 .3544822 -0.04 0.969 -.7085569 .6809879
_cons .1834942 .245023 0.75 0.454 -.2967421 .6637305
/lnrhom2 .1842115 .0856861 .0162699 .3521532

/lnk 1.071457 .2498257 .581808 1.561107
rho 3.20227 .1030178 3.016403 3.422126

k 2.919632 .7293992 1.789271 4.764092
Vuong test of zinbregw vs. gen neg binomial(W): z = 0.55 Pr>z = 0.2897
Bias-corrected (AIC) Vuong test: z = 0.13 Pr>z = 0.4482
Bias-corrected (BIC) Vuong test: z = -1.20 Pr>z = 0.8845
. estat ic
. 3874 . -8262.174 10 16544.35 16606.97
Note that introducing the zero-ination component into the regression model results
in losing signicance of the education level in the model of the mean outcomes. However,
that variable does play a signicant role (along with age) in determining whether a
person has zero visits to the doctor.
4.2 Generalized negative binomial: Famoye

We can also attempt to understand the relationship of doctor visits and the high edu-
cation of patients with the additional factors age and income by using another parame-
terization of negative binomial. This model was discussed in Famoye (1995), but it has
had little notice in the literature, which is probably because of the lack of associated
software support.
. nbregf docvis edlevel4 age hh, nolog eform

Generalized negative binomial-F regression Number of obs = 3874
LR chi2(3) = 166.51
edlevel4 .7205452 .0831698 -2.84 0.005 .5746591 .9034669

age 1.025957 .0024634 10.67 0.000 1.02114 1.030796
hh .4596616 .0743405 -4.81 0.000 .3347915 .6311055
_cons 2.366462 .3349416 6.09 0.000 1.793177 3.123028
/lnphim1 -3.252403 .4280259 -4.091318 -2.413488

/lntheta -.6445887 .0760764 -.7936957 -.4954816
phi 1.038681 .0165565 1.016717 1.089503

theta .5248784 .0399309 .4521706 .6092774
. estat ic
. 3874 -8421.139 -8337.884 6 16687.77 16725.34
Note that the risk ratios are nearly identical to the NB-2 negative binomial model.
The AIC and BIC statistics are lower than NB-2, but only by about 12 and 5 points,
respectively. Because of the excessive zero counts, we model a zero-inated model.
. zinbregf docvis edlevel4 age hh, nolog inflate(edlevel4 age hh) eform vuong
Zero-inflated gen neg binomial-F regression Number of obs = 3874
Regression link: Nonzero obs = 2263
Inflation link : logit Zero obs = 1611
LR chi2(3) = 176.08
docvis
edlevel4 .9125286 .1191361 -0.70 0.483 .7065079 1.178626
age 1.017058 .0024233 7.10 0.000 1.012319 1.021818
hh .4915087 .0753322 -4.63 0.000 .3639736 .6637315
_cons .0010836 .2112138 -0.04 0.972 1.3e-169 8.9e+162
inflate
edlevel4 .7118035 .2073926 3.43 0.001 .3053213 1.118286
age -.0380198 .0054111 -7.03 0.000 -.0486254 -.0274142
hh .2529651 .3447803 0.73 0.463 -.422792 .9287221
_cons .368429 .2425669 1.52 0.129 -.1069933 .8438514
/lnphim1 6.826485 195.023 -375.4115 389.0645

/lntheta 7.679818 194.9173 -374.3511 389.7107
phi 922.9442 179800.3 1 9.3e+168

theta 2164.225 421844.9 2.6e-163 1.8e+169
Vuong test of zinbregf vs. gen neg binomial(F): z = 6.23 Pr>z = 0.0000
Bias-corrected (BIC) Vuong test: z = 3.99 Pr>z = 0.0000
. estat ic
. 3874 -8380.053 -8292.015 10 16604.03 16666.65
The AIC and BIC statistics are substantially lower than the nonzero-inated param-
eterization, and they are also lower than the Waring regression model. Here we nd
that younger patients without a graduate education see physicians more frequently than
patients with a graduate education (as we discovered before) and that the important
statistics are and .
4.3 Generalized binomial regression

If the outcomes are bounded counts (for which the bounds are known), then the data
can be addressed by grouped binomial models. Rather than introducing a new dataset
for these models as we did before, we illustrate how to generate synthetic data.
Herein, we synthesize the generalized binomial outcome along with a zero-inated
version of the generalized binomial outcome. To highlight the options built in to the
commands, we generate data following a complementary log-log link function for the
generalized binomial outcome and a log-log link for the zero-ination component.
. set seed 13092

. drop _all
. set obs 1500
obs was 0, now 1500
. // Linear predictors for zero-inflation
. gen z1 = runiform() < 0.5
. gen z2 = runiform() < 0.5
. gen zg = -0.5+0.25*z1+0.25*z2
. // Note that the zero-inflation link function is in terms of Prob(Y=0)
. gen z = rbinomial(1,1-exp(-exp(-zg))) // ilink(loglog)
. // Linear predictors for the outcome
. gen x1 = runiform() < 0.5
. gen xb = -2+0.5*x1
. gen n = floor(10*runiform()) + 1
. // Note that the outcome link function is in terms of Prob(Y=1)
. gen mu = 1-exp(-exp(xb)) // link(cloglog)
Once we have dened the components of the outcome and the necessary covariates,
we generate the outcome. The zero-inated version of the outcome is the product of
the binomial outcome and the zero-ination (binary) component.
. // Program to generate random outcomes "y"

. gen double yu = runiform() // random quantile
. gen y = 0 // initial outcome
. gen double p = 0 // initial cumulative probability
. capture program drop doit
. program define doit
1. args sigma
2. local flag 1
3. local y = 0
4. while `flag { // increase cumulative probability if y < n
5. quietly replace p = p + exp(lngamma(n+`y*`sigma+1)-
> lngamma(n+`y*`sigma-`y+1)-lngamma(`y+1)+log(n)+`y*log(mu) +
> (n+`y*`sigma-`y)*log(1+mu*`sigma-mu)-log(n+`y*`sigma)-
> (n+`y*`sigma)*log(1+mu*`sigma)) if `y < n
6. quietly replace y = y+1 if p <= yu // increase y if cumulative
> probability <= yu
7. quietly replace p = 1 if y >= n
8. local y = `y+1
9. quietly count if p <= yu // see if finished
10. if `r(N)==0 {
11. local flag = 0 // all done
12. }
13. }
14. end
. doit 1.25 // sigma=1.25
. // Zero-inflated outcomes "yo"
. gen yo = y*z
Having created an outcome with specied associations to our covariates, we can t

a model to see how closely the sample data match the specications.
. // Nonzero-inflated model of nonzero-inflated outcome

. gbin y x1, link(cloglog) n(n) nolog
Generalized binomial regression Number of obs = 1500
Link = cloglog LR chi2(1) = 50.73
Dispersion = generalized binomial Prob > chi2 = 0.0000
y Coef. Std. Err. z P>|z| [95% Conf. Interval]
x1 .4411576 .0681407 6.47 0.000 .3076043 .574711

_cons -2.000648 .0503157 -39.76 0.000 -2.099264 -1.902031
/lnsigma .2661259 .1168846 .0370362 .4952155
sigma 1.304899 .1525227 1.037731 1.640852
Likelihood-ratio test of sigma=0: chibar2(01) = 152.55 Prob>=chibar2 = 0.000
Before tting the zero-inated model for the zero-inated outcome, we rst illustrate
how well a zero-inated model might t the nonzero-inated outcome. In this case, we
should expect the binomial regression components to estimate the means well, and we
should expect the covariate of the zero-ination component to be nonsignicant.
. // Zero-inflated model of nonzero-inflated outcome

. zigbin y x1, inflate(z1 z2) n(n) link(cloglog) ilink(loglog) vuong nolog
Zero-inflated generalized binomial regression Number of obs = 1500
Regression link: cloglog Nonzero obs = 751
Inflation link : loglog Zero obs = 749
LR chi2(1) = 42.67
y
x1 .447438 .0681432 6.57 0.000 .3138797 .5809963
_cons -1.958826 .0540354 -36.25 0.000 -2.064733 -1.852918
inflate
z1 .4499741 .4248806 1.06 0.290 -.3827765 1.282725
z2 2.068714 60.05847 0.03 0.973 -115.6437 119.7812
_cons -3.264426 60.05983 -0.05 0.957 -120.9795 114.4507
/lnsigma .1366821 .1379003 -.1335977 .4069618
sigma 1.146464 .1580977 .874942 1.502247
Vuong test of zigbin vs. gen binomial: z = 1.25 Pr>z = 0.1048

Bias-corrected (BIC) Vuong test: z = -3.04 Pr>z = 0.9988
Note that the Vuong statistic was nonsignicant in this example. Though it fails to
provide compelling evidence for one model over the other, we would prefer the nonzero-
inated model because of the lack of signicant covariates in the ination. When we
t a zero-inated model for the outcome that was specically generated to include zero
ination, we see a much better t.
. // Zero-inflated model of zero-inflated outcome

. zigbin yo x1, inflate(z1 z2) n(n) link(cloglog) ilink(loglog) vuong nolog
Zero-inflated generalized binomial regression Number of obs = 1500
Regression link: cloglog Nonzero obs = 541
Inflation link : loglog Zero obs = 959
LR chi2(1) = 28.34
yo Coef. Std. Err. z P>|z| [95% Conf. Interval]
yo
x1 .4628085 .086265 5.36 0.000 .2937322 .6318848
_cons -1.969505 .0873894 -22.54 0.000 -2.140785 -1.798225
inflate
z1 .2292778 .1270487 1.80 0.071 -.019733 .4782886
z2 .3955768 .1296781 3.05 0.002 .1414125 .6497411
_cons -.4796692 .1724896 -2.78 0.005 -.8177426 -.1415958
/lnsigma .0882868 .2415756 -.3851926 .5617661
sigma 1.092301 .2638733 .6803196 1.753767
Vuong test of zigbin vs. gen binomial: z = 3.11 Pr>z = 0.0009

Bias-corrected (BIC) Vuong test: z = 1.20 Pr>z = 0.1159
Here the Vuong test indicates a clear preference for the zero-ination model, and we
note that the estimated coecients are close to the values we specied in synthesizing
these data.
5 Discussion and conclusions

In this article, we introduced programs for modeling count data. These count data can
be overdispersed (variance is greater than the mean), underdispersed (variance is smaller
than the mean), or undispersed (variance equals the mean). We then illustrated the
use of the new commands nbregf, zinbregf, nbregw, and zinbregw using real-world
German health data from 1984. We synthesized data and used it to demonstrate the
gbin and zigbin models. This article is fairly technical, and some readers may desire
more background on count-data models such as the Poisson, generalized Poisson, and
negative binomial models. For those readers, we recommend Hardin and Hilbe (2012),
Cameron and Trivedi (2013), Winkelmann (2008), and Tang, He, and Tu (2012).
6 References
Cameron, A. C., and P. K. Trivedi. 2013. Regression Analysis of Count Data. 2nd ed.
Cambridge: Cambridge University Press.
Consul, P. C., and H. C. Gupta. 1980. The generalized negative binomial distribution
and its characterization by zero regression. SIAM Journal on Applied Mathematics
39: 231237.
Dean, C. B. 1992. Testing for overdispersion in Poisson and binomial regression models.
Desmarais, B. A., and J. J. Harden. 2013. Testing for zero ination in count models:
Bias correction for the Vuong test. Stata Journal 13: 810835.
Famoye, F. 1995. Generalized binomial regression model. Biometrical Journal 37: 581
594.
Hardin, J. W., and J. M. Hilbe. 2012. Generalized Linear Models and Extensions. 3rd
ed. College Station, TX: Stata Press.
Harris, T., Z. Yang, and J. W. Hardin. 2012. Modeling underdispersed count data with
generalized Poisson regression. Stata Journal 12: 736747.
Hilbe, J. M. 2011. Negative Binomial Regression. 2nd ed. Cambridge: Cambridge

University Press.
. 2014. Modeling Count Data. Cambridge: Cambridge University Press.
Irwin, J. O. 1968. The generalized Waring distribution applied to accident theory.

Journal of the Royal Statistical Society Series A 131: 205225.
Jain, G. C., and P. C. Consul. 1971. A generalized negative binomial distribution. SIAM
Journal on Applied Mathematics 21: 501513.
Rodrguez-Avi, J., A. Conde-S

anchez, A. J. S
aez-Castillo, M. J. Olmo-Jimenez, and
A. M. Martnez-Rodrguez. 2009. A generalized Waring regression model for count
data. Computational Statistics and Data Analysis 53: 37173725.
Tang, W., H. He, and X. M. Tu. 2012. Applied Categorical and Count Data Analysis.
Boca Raton, FL: Chapman & Hall/CRC.
Vuong, Q. H. 1989. Likelihood ratio tests for model selection and non-nested hypotheses.
Wang, X.-F., Z. Jiang, J. J. Daly, and G. H. Yue. 2012. A generalized regression model
for region of interest analysis of fMRI data. Neuroimage 59: 502510.
Winkelmann, R. 2008. Econometric Analysis of Count Data. 5th ed. Berlin: Springer.
Yang, Z., J. W. Hardin, C. L. Addy, and Q. H. Vuong. 2007. Testing approaches for
overdispersion in Poisson regression versus the generalized Poisson model. Biometrical
Journal 49: 565584.
About the authors

Tammy Harris is a senior research associate in the Institute for Families in Society at the Uni-
versity of South Carolina, Columbia, SC. She graduated from the Department of Epidemiology
and Biostatistics at the University of South Carolina with a PhD in August 2013.
Joseph M. Hilbe is an emeritus professor (University of Hawaii), an adjunct professor of statis-
tics at Arizona State University, Tempe, AZ, and a Solar System Ambassador at Jet Propulsion
Laboratory, Pasadena, CA.
James W. Hardin is an associate professor in the Department of Epidemiology and Biostatistics
and an aliated faculty in the Institute for Families in Society at the University of South
Carolina, Columbia, SC.
14, Number 3, pp. 580604
A Stata package for the application of

semiparametric estimators of doseresponse
functions
Michela Bia Carlos A. Flores
CEPS/INSTEAD Department of Economics
Esch-Sur-Alzette, Luxembourg California Polytechnic State University
michela.bia@ceps.lu San Luis Obispo, CA
core32@calpoly.edu
Alfonso Flores-Lagunes
Department of Economics
State University of New York, Binghamton
Binghamton, NY
aores@binghamton.edu
Alessandra Mattei
Department of Statistics, Informatics, Applications Giuseppe Parenti
University of Florence
Florence, Italy
mattei@disia.uni.it
Abstract. In many observational studies, the treatment may not be binary or

categorical but rather continuous, so the focus is on estimating a continuous dose
response function. In this article, we propose a set of programs that semiparamet-
rically estimate the doseresponse function of a continuous treatment under the
unconfoundedness assumption. We focus on kernel methods and penalized spline
models and use generalized propensity-score methods under continuous treatment
regimes for covariate adjustment. Our programs use generalized linear models to
estimate the generalized propensity score, allowing users to choose between alter-
native parametric assumptions. They also allow users to impose a common sup-
port condition and evaluate the balance of the covariates using various approaches.
We illustrate our routines by estimating the eect of the prize amount on subse-
quent labor earnings for Massachusetts lottery winners, using data collected by
Imbens, Rubin, and Sacerdote (2001, American Economic Review, 778794).
Keywords: st0352, drf, doseresponse function, generalized propensity score, ker-
nel estimator, penalized spline estimator, weak unconfoundedness
1 Introduction
The evaluation process in economics, sociology, law, and many other elds generally
relies on applying nonexperimental techniques to estimate average treatment eects.

M. Bia, C. A. Flores, A. Flores-Lagunes, A. Mattei 581
Propensity-score methods (Rosenbaum and Rubin 1983) are attractive empirical tools
to balance the distribution of covariates between treatment groups and compare the
groups in terms of observed covariates. Under the unconfoundedness assumption, which
requires that potential outcomes are independent of the treatment conditional on the
observed covariates, propensity-score methods allow one to eliminate (or at least re-
duce) the potential bias in treatment-eects estimates in observational studies. Most
applications aim to evaluate causal eects of a binary treatment. There is extensive
literature on identifying and estimating causal eects of binary treatments (for exam-
ple, Imbens and Wooldridge [2009]; Stuart [2010]; Angrist, Imbens, and Rubin [1996]),
and many statistical software packages have built-in or add-on functions for imple-
menting methods to estimate causal eects of programs or policies. For example,
Becker and Ichino (2002) developed a set of programs (pscore.ado) for estimating av-
erage treatment eects on the treated using propensity-score matching by focusing on
four matching estimators: nearest-neighbor, radius, kernel, and stratication match-
ing. More recently, building on the work of Becker and Ichino (2002), Dorn (2012)
proposed a routine that helps improve covariate balance, and so the specication of the
propensity-score model, using data-driven approaches.
In many empirical studies, treatments may take on many values, implying that
participants in the study may receive dierent treatment levels. In such cases, one
may want to assess the heterogeneity of treatment eects arising from variation in the
amount of treatment exposure, that is, estimate a doseresponse function (DRF). Over
the past years, propensity-score methods have been generalized and applied to multival-
ued treatments (for example, Imbens [2000]; Lechner [2001]) and, more recently, to con-
tinuous treatments and arbitrary treatment regimes (for example, Hirano and Imbens
[2004]; Imai and van Dyk [2004]; Flores et al. [2012]; Bia and Mattei [2012]; Kluve et al.
[2012]).
In this article, we build on work by Hirano and Imbens (2004), who introduced the
concept of the generalized propensity score (GPS) and used it to estimate the entire DRF
of a continuous treatment. Hirano and Imbens (2004) used a parametric partial-mean
approach to estimate the DRF. Here we focus on semiparametric techniques. Specically,
we present a set of programs that allows users to i) estimate the GPS under alternative
parametric assumptions using generalized linear models;1 ii) impose the common sup-
port condition as dened in Flores et al. (2012) and assess the balance of covariates after
adjusting for the estimated GPS; and iii) estimate the DRF using the estimated GPS by
applying either the nonparametric inverse-weighting (IW) kernel estimator developed in
Flores et al. (2012) or a new set of semiparametric estimators based on penalized spline
techniques.
1. Guardabascio and Ventura (2014) proposed the routine gpscore2.ado to estimate the GPS using
generalized linear models.
582 Semiparametric estimators of doseresponse functions
We use a dataset collected by Imbens, Rubin, and Sacerdote (2001) to illustrate these
programs and to evaluate the eect of the prize amount on subsequent labor earnings
of winners of the Megabucks lottery in Massachusetts in the mid-1980s. We implement
our programs to semiparametrically estimate the average potential postwinning labor
earnings for each lottery prize amount. The prize is obviously assigned at random,
but unit and item nonresponse lead to a self-selected sample where the prize amount
received is no longer independent of background characteristics.
This article is organized as follows: Section 2 describes the methodological approach
we refer to in the analysis. Section 3 introduces the GPS model and the semiparametric
estimators of the DRF. Sections 3 and 3.2 show, respectively, the syntax and the options
of the drf command. Section 5 illustrates the methods and the program using data
from Imbens, Rubin, and Sacerdote (2001). Section 6 concludes.
2 Estimation strategy
We estimate a continuous DRF that relates each value of the dose (for example, lottery
prize amount) to the outcome variable (for example, postwinning labor earnings) within
the potential-outcome approach to causal inference (Rubin 1974, 1978). Formally, con-
sider a set of N individuals, and denote each of them by subscript i: i = 1, . . . , N .
Under the stable unit treatment value assumption (Rubin 1980, 1990), for each unit
i, there is a set of potential outcomes {Yi (t)}tT , where T is a subset of the real line,
T R. We are interested in estimating the average DRF, (t) = E{Yi (t)}.
For each individual i, we observe a vector of pretreatment covariates, Xi , the received
treatment level, Ti , and the corresponding value of the outcome for this treatment level,
Yi = Yi (Ti ).
The central assumption of our approach is that the assignment to treatment levels is
weakly unconfounded given the set of observed variables, that is, Yi (t) Ti |Xi for all t
T (Hirano and Imbens 2004). This assumption is described as weak unconfoundedness
because it requires only conditional independence for each potential outcome Yi (t) rather
than joint independence of all potential outcomes.
Under weak unconfoundedness, we can apply the GPS techniques for continuous
treatments introduced by Hirano and Imbens (2004). Let r(t, x) = fT |X (t|x) be the
conditional density of the treatment given the covariates. The GPS is dened as Ri =
r(Ti , Xi ). The GPS is a balancing score (Rosenbaum and Rubin 1983; Hirano and Im-
bens 2004); that is, within strata with the same value of r(t, x), the probability that
T = t does not depend on the value of X. The weak unconfoundedness assumption,
combined with the balancing score property, implies that assignment to treatment is
weakly unconfounded given the GPS. Formally,
fT {t|r(t, Xi ), Yi (t)} = fT {t|r(t, Xi )}
for every t T (theorem 1.2.2 in Hirano and Imbens [2004]). Thus any bias associated
with dierences in the distribution of covariates across groups with dierent treatment
levels can be removed using the GPS. Formally, Hirano and Imbens (2004) showed that
if the assignment to the treatment is weakly unconfounded given pretreatment variables

Xi , then (t) = E[{t, r(t, Xi )}], where (t, r) = E{Yi (t)|r(t, Xi ) = r} = E(Yi |Ti =
t, Ri = r) (theorem 1.3.1 in Hirano and Imbens [2004]).
3 Inference
We use two-step semiparametric estimators of the DRF. The rst step is to parametri-
cally model and estimate the GPS, Ri = r(Ti , Xi ), and to assess the common support
condition and the balance of the covariates. The second step is to estimate the average
DRF, (t), using either the nonparametric IW kernel estimator proposed by Flores et al.
(2012) or a semiparametric spline-based estimator. Here we describe these two steps,
implemented in the routine drf.
3.1 Estimation of the GPS

The rst part of the drf program estimates the GPS, allows users to impose an overlap
condition, and tests the balancing property of the GPS.
The GPS is estimated parametrically and alternative distributional assumptions can
be specied. Specically, we assume that
g(Ti |Xi ) {h(, Xi ), }
where g is a link function, is a probability density function, h is a exible function

of the covariates depending on an unknown parameter vector , and is a scale pa-
rameter. In the drf program, we consider the Gaussian, inverse Gaussian, and Gamma
distributions using the identity function, the logarithm, and the power function as link
functions. We also implement a two-parameter beta distribution to address evaluation
problems where the treatment variable takes on values in the interval (0, 1), representing,
for instance, a proportion. We use maximum likelihood methods to t these models by
using the ocial Stata command glm (see [R] glm) or the user-written package betafit
(Buis, Cox, and Jenkins 2003).2
An important issue in GPS applications is determining the common support or
overlap region. The drf program allows users to do this by using the approach
proposed by Flores et al. (2012). Specically, the sample is rst divided into K intervals
according to the distribution of the treatment, cutting at the 100 (k/K)th, k =
1, . . . , K 1 percentiles of the treatment empirical distribution. Let qk , k = 1, . . . , K,
denote these intervals, and let Qi be the interval unit i belongs to: Ti Qi . For each
interval qk , let Rk be the GPS evaluated at the median level of the treatment in that
i
interval for unit i, which is calculated for all units. The common support region with
respect to qk , denoted by CSk , is obtained by comparing the support of the distribution
2. betafit (version 1.0.0 at the time of this writing) is available from the Statistical Software Com-
ponents archive (or findit betafit) and must be installed separately from drf.
k for those units with Qi = qk with that of units with Qi = qk and is given by the
of R i
subsample

CSk = i : R ik max min R jk , min Rjk , min max R jk , max R jk
j:Qj =qk j:Qj =qk j:Qj =qk j:Qj =qk
Finally, the sample is restricted to units that are comparable across all the K inter-
vals simultaneously by keeping only individuals who are simultaneously in the common
support region
1K for all k intervals. Therefore, the common-support subsample is given
by CS = k=1 CSk .
As in applications of standard propensity-score methods, in GPS applications, it is
crucial to evaluate how well the estimated GPS balances the covariates. Several methods
can be applied to evaluate the balancing properties of the GPS. The drf command
implements two approaches: an approach based on blocking on the GPS and an approach
that uses a likelihood-ratio (LR) test. The blocking on the GPS approach was proposed
by Hirano and Imbens (2004), and it is implemented in the drf routine using two-
sided t tests or Bayes factors (see also Bia and Mattei [2008]). The second approach
was proposed by Flores et al. (2012), who suggested using an LR test to compare an
unrestricted model for Ti that includes all covariates and the GPS (up to a cubic term)
with a restricted model that sets the coecients of all covariates equal to zero. If the GPS
suciently balances the covariates, then the covariates should have little explanatory
power conditional on the GPS.3
3.2 Estimation of the doseresponse function

We estimate the DRF by applying spline and kernel techniques. The rst technique is
implemented using a partial mean approach (Newey 1994). Specically, for the penalized
spline methods, we rst estimate the conditional expectation of the observed outcome
Yi given the treatment actually received, Ti , and the GPS previously estimated in the
rst stage, Ri , using bivariate penalized spline smoothing based on i) additive spline
bases; ii) tensor products of spline bases; or iii) radial basis functions (for example,
Ruppert, Wand, and Carroll [2003]). Mixed models provide a representation of the
penalized splines that allows smoothing to be done using mixed-model methodologies
and software. In our routine, we use the Stata routine xtmixed, renamed mixed in
Stata 13, to t penalized spline regressions. The average DRF at t is then estimated by
averaging the estimated regression function over the estimated score function evaluated
t r(t, Xi ).
at the specic treatment level t; that is, R i
3. An alternative approach, which is not implemented in our program, was proposed by Kluve et al.
(2012). It consists of regressing each covariate on the treatment variable and comparing the signif-
icance of the coecients for specications with and without conditioning on the GPS.
The simplest bivariate penalized spline smoothing relies on additive spline bases,
which can be formally dened in our setting as
Kt
Kr

E Yi |Ti , R i +
i = a0 + at Ti + ar R utk (Ti kkt )+ + urk Ri k r (1)
k
+
k=1 k=1
where for any number z, z+ is equal to z if z is positive and is equal to 0 otherwise, and
k1t < < kK
t r r
t and k1 < < kK r are K
t
and K r distinct knots in the support of T
and the estimated GPS, R i , respectively.
The additive models have many attractive features, one being their simplicity. How-
ever, an additive model may not provide a satisfactory t, so more complex mod-
els including interaction terms are required. To this end, we consider tensor prod-
uct bases, which are obtained by forming all pairwise products of the basis functions
1, Ti , (Ti k1t ), . . . , (Ti kK
t r r
t ) and 1, Ri , (Ri k1 ), . . . , (Ri kK r ). Formally,

E Yi |Ti , R i = a0 + a t Ti + a r R i + atr Ti Ri
K

t
K

r
K

t

+ utk Ti kkt + + urk i
R kkr + i Ti kkt
vkt R
+ +
k=1 k=1 k=1
Kr
Kt
Kr

+ vkr Ti Ri k r + tr
vkk t
T i kk
i k r
R (2)
k + k
+ +
k=1 k=1 k =1
Estimation problems may arise when the tensor product approach is applied, espe-
cially if the sample size is relatively small. When these problems arise, the drf program
alerts users and suggests they adopt an additive model instead.
As an alternative to tensor product splines, we propose to use the so-called radial
basis functions, which are basis functions of the form C{(t, r) (k, k ) } for some
univariate function C. Here we consider the following function
2 t 2 , 2 t 22 2 t 2
2 t k 2 2 t k 2 2 t k 2
C 2 2 2 =22 2 log 2 2
r k r 2 r k r 2 2 r k r 2
where is the Euclidean norm, and we assume that

K

2 t 2 ,
2 Ti kk 2
E Yi |Ti , R i + atr Ti R
i = a0 + a t Ti + a r R i + uk C 2 2 (3)
2 Ri kkr 2
k=1
where u1 , . . . , uk are random variables with

- mean 0 and variancecovariance matrix
2 t t 2 ,.
2 kk kk 2
Cov(u) = u2 (k )(k ) , with k = C 2 2
1/2 1/2
2 kr .
k kkr 2
1k,k K
Given the estimated parameters of the regression functions (1), (2), or (3), the
average potential outcome at treatment level t is estimated by averaging the estimated
t .
regression function over R i
Flores et al. (2012) proposed to estimate the DRF using a nonparametric IW estima-
tor based on kernel methods. In this approach, the estimated scores are used to weight
observations to adjust for covariate dierences. Let K(u) be a kernel function with the
usual properties, and let h be a bandwidth satisfying h 0 and N h as N .
The IW approach is implemented using a local linear regression of Y on T with weighted
kernel function K t , where Kh (z) = h1 K(z/h). Formally,
h,X (Ti t) = Kh (Ti t)/R
i
the IW kernel estimator of the average DRF is dened as
D0 (t)S2 (t) D1 (t)S1 (t)
(t) =

S0 (t)S2 (t) S12 (t)
N j
N j
where Sj (t) = i=1 K h,X (Ti t)(Ti t) and Dj (t) = i=1 Kh,X (Ti t)(Ti t) Yi ,
j = 0, 1, 2.
We implement the IW estimator using a normal kernel. By default, the global band-
width is selected using the procedure proposed by Fan and Gijbels (1996), which esti-
mates the unknown terms in the optimal global bandwidth by using a global polynomial
of order p + 3, where p is the order of the local polynomial tted. However, users can
also choose an alternative global bandwidth.
4 The drf command

4.1 Syntax

drf varlist if in
weight , outcome(varname) treatment(varname)

cutpoints(varname) index(string) nq gps(#) method(type) gps
family(familyname) link(linkname) vce(vcetype) nolog(#) search
common(#) numoverlap(#) test varlist(varlist) test(type) flag(#)
tpoints(vector) npoints(#) npercentiles(#) det delta(#)
bandwidth(#) nknots(#) knots(#) standardized degree1(#)
degree2(#) nknots1(#) nknots2(#) knots1(#) knots2(#) additive

estopts(string)
Note that the argument varlist represents the observed pretreatment variables, which
are used to estimate the GPS. Note that spacefill must be installed (Bia and Van Kerm
2014).4
4.2 Options
Required
outcome(varname) species that varname is the outcome variable.

4. spacefill requires the Mata package moremata (Jann 2005).
treatment(varname) species that varname is the treatment variable.

cutpoints(varname) divides the range or set of the possible treatment values, T ,
into intervals within which the balancing properties of the GPS are checked using a
blocking on the GPS approach. varname is a variable indicating to which interval
each observation belongs. This option is required unless flag() is set to 0 (see
below).
index(string) species the representative point of the treatment variable at which the
GPS must be evaluated within each treatment interval specied in cutpoints().
string identies either the mean (string = mean) or a percentile (string = p1, . . . ,
p100). This is used when checking the balancing properties of the GPS using a
blocking on the GPS approach. This option is required unless flag() is set to 0
(see below).
nq gps(#) species that for each treatment interval dened in cutpoints(), the values
of the GPS evaluated at the representative point index() have to be divided into #
(# {1, . . . , 100}) intervals, dened by the quantiles of the GPS evaluated at the
representative point index(). This is used when checking the balancing properties
of the GPS using a blocking on the GPS approach. This option is required unless
flag() is set to 0 (see below).
method(type) species the type of approach to be used to estimate the DRF. The ap-
proaches are bivariate-penalized splines (type = mtspline), bivariate penalized ra-
dial splines (type = radialpspline), or IW kernel (type = iwkernel).5
Global options
gps stores the estimated generalized propensity score in the gpscore variable that is
added to the dataset.6
family(familyname) species the distribution used to estimate the GPS. The available
distributional families are Gaussian (normal) (family(gaussian)), inverse Gaussian
(family(igaussian)), Gamma (family(gamma)), and Beta (family(beta)). The
default is family(gaussian). The Gaussian, inverse Gaussian, and Gamma distri-
butional families are t using glm, and the beta distribution is t using betafit.
The following four options are for the glm command, so they can be specied only
when the Gaussian, inverse Gaussian, or Gamma distribution is assumed for the treat-
ment variable.
link(linkname) species the link function for the Gaussian, inverse Gaussian, and
Gamma distributional families. The available links are link(identity), link(log),
and link(pow), and the default is the canonical link for the family() specied (see
help for glm for further details).
5. The subroutines mtpspline and radialpspline are called, respectively, when estimators with pe-
nalized splines (type = mtspline) and radial penalized splines (type = radialpspline) are used.
6. This option must not be specied when running the bootstrap.
vce(vcetype) species the type of standard error reported for the GPS estimation when
the Gaussian, inverse Gaussian, or Gamma distribution is assumed for the treatment
variable. vcetype may be oim, robust, cluster clustvar, eim, opg, bootstrap,
jackknife, hac, kernel, jackknife1 (see help glm for further details).
nolog(#) is a ag (# = 0, 1) that suppresses the iterations of the algorithm toward
eventual convergence when running the glm command. The default is nolog(0).
search searches for good starting values for the parameters of the generalized linear
model used to estimate the generalized propensity score (see help glm for further
details).
Overlap options
common(#) is a ag (# = 0, 1) that restricts the inference to the subsample satisfying

the common support condition when it is implemented (# = 1). The default is
common(1).
numoverlap(#) species that the common support condition is imposed by dividing
the sample into # groups according to # quantiles of the treatment distribution.
By default, the sample is divided into 5 groups, cutting at the 20th, 40th, 60th, and
80th percentiles of the distribution if common(1).
Balancing property assessment options
test varlist(varlist) species that the balancing property must be assessed for each
variable in varlist. The default test varlist() consists of all the variables used to
estimate the GPS.
test(type) allows users to specify whether the balancing property is to be assessed
using a blocking on the GPS approach employing either standard two-sided t tests
(test(t test)) or Bayes factors (test(Bayes factor)) or using a model-compari-
son approach with an LR test (test(L like)).
The blocking on the GPS approach using standard two-sided t tests provides the
values of the test statistics before and after adjusting for the GPS for each pretreat-
ment variable included in test varlist() and for each prexed treatment interval
specied in cutpoints(). Specically, let p be the number of control variables
in test varlist(), and let H be the number of treatment intervals specied in
cutpoints(). Then the program calculates and shows p H values of the test
statistic before and after adjusting for the GPS, where the adjustment is done by
dividing the values of the GPS evaluated at the representative point index() into
the number of intervals specied in nq gps(). (See Hirano and Imbens [2004] for
further details.)
The model-comparison approach uses a LR test to compare an unrestricted model
for Ti , including all the covariates and the GPS (up to a cubic term), with a re-
stricted model that sets the coecients of all covariates to zero. By default, both
the blocking on the GPS approach and the model-comparison approach are applied.
flag(#) allows the user to specify that drf estimates the GPS without performing the
balancing test. The default is flag(1), which means that the balancing property is
assessed.
DRF options
tpoints(vector) indicates that the DRF is evaluated at each level of the treatment in
vector. By default, the drf program creates a vector with jth element equal to
the jth observed treatment value. This option cannot be used with npoints() or
npercentiles() (see below).
npoints(#) indicates that the DRF is evaluated at each level of the treatment be-
longing to a set of evenly spaced values t0 , t1 , . . . , t# that cover the range of the
observed treatment. This option cannot be used with tpoints() (see above) or
npercentiles() (see below).
npercentiles(#) indicates that the DRF is evaluated at each level of the treatment
corresponding to the percentiles tq0 , tq1 , . . . , tq# of the treatments empirical distri-
bution. This option cannot be used with tpoints() or npoints() (see above).
det displays more detailed output on the DRF estimation. When det is not specied,
the program displays only the chosen DRF estimator: method(radialpspline),
method(mtpspline), or method(iwkernel).
delta(#) species that drf also estimate the treatment-eect function (t + #) (t).
The default is delta(0), which means that drf estimates only the DRF, (t).
Options for the IW kernel estimator (iwkernel)
bandwith(#) species the bandwidth to be used. By default, the global bandwidth

is chosen using the automatic procedure described in Fan and Gijbels (1996). This
procedure estimates the unknown terms in the optimal global bandwidth by using a
global polynomial of order p + 3, where p is the order of the local polynomial tted.
Options for the radial penalized spline estimator (radialpspline)
nknots(#) species the number of knots to be selected in the two-dimensional space

of the treatment variable and the GPS. The default is nknots(max(20, min(n/4,
150))), where n is the number of unique (Ti , Ri ) (Ruppert, Wand, and Carroll
2003). When this option is specied, the subroutines radialpspline and spacefill
(Bia and Van Kerm 2014) are called. This option cannot be used with the knots()
option (see below).
knots(numlist) species the list of knots for the treatment and the GPS variable. This
option cannot be used with the nknots() option (see above).
standardized implies that the spacefill algorithm standardizes the treatment vari-
able and the GPS variables before selecting the knots. The knots are chosen using
the standardized variables.
Options for the tensor-product penalized spline estimator (mtpspline)
degree1(#) species the power of the treatment variable included in the penalized
spline model. The default is degree1(1).
degree2(#) species the power of the GPS included in the penalized spline model. The
default is degree2(1).
nknots1(#) species the number (#) of knots for the treatment variable. The location
of the Kk th knot is dened as {(k + 1)/(# + 2)}th sample quantile of the unique
Ti for k = 1, . . . , #. The default is nknots1(max(5, min(n/4, 35))), where n is
the number of unique Ti (Ruppert, Wand, and Carroll 2003). This option cannot
be used with the knots1(numlist) option (see below).
nknots2(#) species the number (#) of knots for the GPS. The location of the Kk th
knot is dened as {(k + 1)/(# + 2)}th sample quantile of the unique Ri for k =
1, . . . , #. The default is nknots2(max(5, min(n/4, 35))), where n is the number
of unique Ri (Ruppert, Wand, and Carroll 2003). This option cannot be used with
the knots2() option (see below).
knots1(numlist) species the list of knots for the treatment variable. This option
cannot be used with the nknots1() option (see above).
knots2(numlist) species the list of knots for the GPS. This option cannot be used with
the nknots2() option (see above).
additive allows users to implement penalized splines using the additive model without
including the product terms.
Mutual options for the tensor-product and radial penalized spline estimators
Mutual options for the tensor-product and radial penalized spline estimators involve
either the mtpspline subroutine or the radialpspline subroutine, depending on which
estimator is used.
estopts(string) species all the possible options allowed when running the xtmixed
models to t penalized spline models (see help xtmixed for further details).
5 Example: The lottery dataset

We illustrate the methods and the programs discussed by reanalyzing data from a survey
of Massachusetts lottery winners (see Imbens, Rubin, and Sacerdote [2001] for details on
the survey). We focus on evaluating how the prize amount aects future labor earnings
(from social security records). This example is also considered in Hirano and Imbens
(2004).
The sample we use consists of 237 individuals who won a major prize in the lottery.
The outcome of interest is earnings six years after winning the lottery (year6), and the
treatment is the prize amount (prize). The lottery prize is randomly assigned, but there
is substantial unit and item nonresponse as well as heterogeneity in the sample with
respect to background characteristics. Thus it is more reasonable to conduct the analysis
conditioning on the observed pretreatment variables under the weak unconfoundedness
assumption.
Pretreatment variables are age, gender, years of high school, years of college, winning
year, number of tickets bought, working status at the time of playing the lottery, and
earnings s years before winning the lottery, s = 1, 2, . . . , 6. To avoid results driven
by outliers, we drop observations belonging to the upper 5% of the treatment variable
distribution.
The output from running drf, shown below, is organized as follows. First, the GPS
model and summary statistics of the estimated GPS are shown, and the common support
is determined. The results show that 31 observations were dropped after we imposed the
common support condition. Second, the balancing property is assessed. We specify the
test(L like) option for the balancing test, so results from only the model-comparison
approach using the LR test are reported. The LR test shows that the GPS balances
the covariates: they have little explanatory power conditional on the GPS. Indeed, the
restricted model for Ti that excludes the covariates cannot be rejected at the usual
signicance levels (p-value is 0.284), whereas the restricted model that excludes the GPS
is soundly rejected (p-value is 0).
. use lotterydataset.dta
. * we delete the extreme values (1 and 99 percentile)
. drop if year6==.
(35 observations deleted)
. summarize prize, de
Treatment variable = Prize amount
Percentiles Smallest
1% 5.3558 1.139
5% 10.05 5
10% 11.246 5.3558 Obs 202
25% 17.034 6.844 Sum of Wgt. 202
50% 32.1835 Mean 57.36918
Largest Std. Dev. 64.84194
75% 71.642 270.1
90% 137.27 305.09 Variance 4204.477
95% 171.73 323.32 Skewness 2.821964
99% 305.09 484.79 Kurtosis 14.18278
. drop if prize >= r(p95)

(11 observations deleted)
. replace year6 = year6/1000
year6 was long now double
. matrix define tp = (10\20\30\40\50\60\70\80\90\100)
. set seed 2322
. drf agew ownhs owncoll male tixbot workthen yearm1 yearm2 yearm3 yearm4
> yearm5 yearm6, outcome(year6) treatment(prize) gps test(L_like)
> tpoints(tp) numoverlap(3) method(radialpspline) family(gaussian)
> link(log) nknots(10) nolog(1) search det delta(1)
******************************************************
Algorithm to estimate the generalized propensity score
******************************************************
Estimation of the propensity score

Scale parameter = 1365.58
Variance function: V(u) = 1 [Gaussian]
AIC = 10.12285
Log likelihood = -953.731889 BIC = 242138.2
OIM
prize Coef. Std. Err. z P>|z| [95% Conf. Interval]
agew .0158337 .0053884 2.94 0.003 .0052727 .0263947

ownhs .0585063 .0742126 0.79 0.430 -.0869477 .2039603
owncoll -.0108263 .0389408 -0.28 0.781 -.0871488 .0654962
male .3615542 .1564085 2.31 0.021 .0549991 .6681093
tixbot -.0174202 .0188308 -0.93 0.355 -.0543279 .0194875
workthen .0680442 .1819285 0.37 0.708 -.2885291 .4246174
yearm1 -.0033454 .0102149 -0.33 0.743 -.0233662 .0166754
yearm2 .0018299 .0151926 0.12 0.904 -.0279471 .0316069
yearm3 -.0190244 .0134829 -1.41 0.158 -.0454505 .0074016
yearm4 .0451296 .0194034 2.33 0.020 .0070997 .0831596
yearm5 -.0094795 .0147496 -0.64 0.520 -.0383882 .0194293
yearm6 -.0055688 .0084792 -0.66 0.511 -.0221877 .0110501
_cons 2.534394 .489911 5.17 0.000 1.574186 3.494602
Note: The common support condition is imposed

*****************************************************************
31 observations are dropped after imposing common support
*****************************************************************
drf_gpscore
Percentiles Smallest
1% .0000774 .0000308
5% .00118 .0000774
10% .0033023 .0003464 Obs 160
25% .0077024 .0004499 Sum of Wgt. 160
50% .0092675 Mean .0082089
Largest Std. Dev. .002953
75% .0103387 .0107928
90% .0107204 .010793 Variance 8.72e-06
95% .0107831 .0107953 Skewness -1.419599
99% .0107953 .0107956 Kurtosis 3.908883
********************************************
End of the algorithm to estimate the gpscore
********************************************
**********************************************************
Log-Likelihood test for Unrestricted and Restricted Model
**********************************************************
****************************************************
Unrestricted Model
link(E[T]) = GPSCORE + GPSCORE^2 + GPSCORE^3 + X
****************************************************
AIC = 8.881567
OIM
drf_gpscore -139.9919 107.5174 -1.30 0.193 -350.7222 70.73837

drf_gpscore2 -45688.7 24107.57 -1.90 0.058 -92938.68 1561.268
drf_gpscore3 4243995 1464344 2.90 0.004 1373934 7114055
agew .0067685 .0036542 1.85 0.064 -.0003935 .0139306
ownhs .0159357 .0348134 0.46 0.647 -.0522974 .0841687
owncoll .0146014 .028581 0.51 0.609 -.0414163 .0706192
male -.0071926 .0945985 -0.08 0.939 -.1926022 .178217
tixbot -.0120352 .0108077 -1.11 0.265 -.033218 .0091475
workthen -.0411355 .1226241 -0.34 0.737 -.2814743 .1992032
yearm1 .0042786 .0080239 0.53 0.594 -.011448 .0200052
yearm2 -.0129785 .0123375 -1.05 0.293 -.0371595 .0112024
yearm3 .0191091 .015091 1.27 0.205 -.0104687 .048687
yearm4 .001562 .0113064 0.14 0.890 -.0205982 .0237222
yearm5 -.008559 .0116933 -0.73 0.464 -.0314774 .0143595
yearm6 .0002114 .00695 0.03 0.976 -.0134105 .0138332
_cons 4.74533 .2766597 17.15 0.000 4.203088 5.287573
********************************************************
Restricted Model: Pretreatment variables are excluded
link(E[T]) = GPSCORE + GPSCORE^2 + GPSCORE^3
********************************************************
AIC = 8.820758
OIM
drf_gpscore -84.75421 83.03918 -1.02 0.307 -247.508 77.99958

drf_gpscore2 -53755.36 20238.49 -2.66 0.008 -93422.08 -14088.64
drf_gpscore3 4533115 1287859 3.52 0.000 2008958 7057273
_cons 5.034825 .0706282 71.29 0.000 4.896396 5.173253
**********************************************************
Restricted Model: GPS terms are excluded (link(E[T]) = X)
**********************************************************
AIC = 10.09489
OIM
agew .0196754 .0078967 2.49 0.013 .0041982 .0351525

ownhs .0445558 .0879733 0.51 0.613 -.1278687 .2169802
owncoll .0102703 .0484571 0.21 0.832 -.0847039 .1052445
male .3800062 .1676205 2.27 0.023 .051476 .7085364
tixbot -.0179112 .0212375 -0.84 0.399 -.0595359 .0237135
workthen .1593496 .2189032 0.73 0.467 -.2696929 .5883921
yearm1 .0158358 .0119526 1.32 0.185 -.0075909 .0392624
yearm2 -.0347405 .0256188 -1.36 0.175 -.0849524 .0154713
yearm3 -.0074285 .0246622 -0.30 0.763 -.0557656 .0409086
yearm4 .0487374 .0278511 1.75 0.080 -.0058497 .1033245
yearm5 -.013943 .018552 -0.75 0.452 -.0503042 .0224183
yearm6 .000416 .0150639 0.03 0.978 -.0291088 .0299408
_cons 2.285246 .6383848 3.58 0.000 1.034035 3.536457
********************************************************************
Likelihood-ratio tests:
Comparison between the unrestricted model and the restricted models
********************************************************************
LR_TEST[3,4]
Lrtest T-Statistics p-value Restrictions
Unrestricted -694.52535 . . .
Covariates X -701.66066 14.270625 .2837616 12
GPS terms -794.59089 200.13108 3.952e-43 3
Number of observations = 160
***********************************************************
End of the assesment of the balancing property of the GPS
***********************************************************
Then we estimate the DRF and the treatment-eect function, which represents the
marginal propensity to earn out of the yearly prize money, using both penalized spline
techniques and the IW kernel estimator. Following Hirano and Imbens (2004), we ob-
tain the estimates of these functions at 10 dierent prize-amount values, considering
increments of $1,000 between $10,000 and $100,000 for the estimation of the treatment-
eect function. Note that we scaled the prize amount by dividing it by $1,000. To avoid
redundancies, we show details on the output from running drf for only the radial penal-
ized spline estimator (method(radialpspline)). Note that the det option is specied,
so details on estimating the DRF are shown.
****************
DRF estimation
****************
Radial penalized spline estimator
Run 1 .. (Cpq = 383.37)
Run 2 .. (Cpq = 427.99)
Run 3 ... (Cpq = 388.19)
Run 4 .. (Cpq = 365.61)
Run 5 ... (Cpq = 389.08)
Performing EM optimization:
Performing gradient-based optimization:
Iteration 0: log restricted-likelihood = -509.60164
Computing standard errors:

Mixed-effects REML regression Number of obs = 129
Group variable: _all Number of groups = 1
Obs per group: min = 129
avg = 129.0
max = 129
Wald chi2(2) = 5.01
Log restricted-likelihood = -509.58286 Prob > chi2 = 0.0818
year6 Coef. Std. Err. z P>|z| [95% Conf. Interval]
prize -.2582684 .215657 -1.20 0.231 -.6809484 .1644115

drf_gpscore -1355.627 897.2735 -1.51 0.131 -3114.25 402.997
_cons 34.56937 11.09994 3.11 0.002 12.8139 56.32485
Random-effects Parameters Estimate Std. Err. [95% Conf. Interval]
_all: Identity
sd(__00002U..__000033)(1) .0285723 .0584111 .0005198 1.570645
sd(Residual) 13.36947 .8725761 11.76412 15.19389
LR test vs. linear regression: chibar2(01) = 0.06 Prob >= chibar2 = 0.4072
(1) __00002U __00002V __00002W __00002X __00002Y __00002Z __000030 __000031
__000032 __000033
. matrix list e(b)
e(b)[1,20]
c1 c2 c3 c4 c5 c6
y1 15.131775 12.106819 9.3763398 7.2519104 6.0217689 5.5866336
c7 c8 c9 c10 c11 c12
y1 5.7080575 5.9898157 6.0769106 5.7288158 -.3081758 -.2900365
c13 c14 c15 c16 c17 c18
y1 -.23826795 -.15935109 -.05448761 -.00673878 .02770708 .02217719
c19 c20
y1 -.01213146 -.06489899
. matrix C = e(b)
. drop gpscore
. set seed 2322
. bootstrap _b, reps(50): drf agew ownhs owncoll male tixbot workthen yearm1
> yearm2 yearm3 yearm4 yearm5 yearm6, outcome(year6) treatment(prize)
> test(L_like) tpoints(tp) numoverlap(3) method(radialpspline) family(gaussian)
> link(log) nolog(1) search nknots(10) det delta(1)
(running drf on estimation sample)
Bootstrap replications (50)
1 2 3 4 5
.................................................. 50
Bootstrap results Number of obs = 191
Replications = 50
Observed Bootstrap Normal-based

c1 15.13177 24.33924 0.62 0.534 -32.57225 62.8358

c2 12.10682 6.628999 1.83 0.068 -.8857812 25.09942
c3 9.37634 6.500001 1.44 0.149 -3.363427 22.11611
c4 7.25191 7.843234 0.92 0.355 -8.120547 22.62437
c5 6.021769 12.20073 0.49 0.622 -17.89122 29.93475
c6 5.586634 15.15628 0.37 0.712 -24.11914 35.2924
c7 5.708057 18.95607 0.30 0.763 -31.44515 42.86127
c8 5.989816 23.01648 0.26 0.795 -39.12166 51.10129
c9 6.076911 26.94703 0.23 0.822 -46.7383 58.89212
c10 5.728816 31.02343 0.18 0.853 -55.07598 66.53361
c11 -.3081758 2.3051 -0.13 0.894 -4.826088 4.209736
c12 -.2900365 2.43639 -0.12 0.905 -5.065274 4.485201
c13 -.2382679 .5888614 -0.40 0.686 -1.392415 .9158791
c14 -.1593511 .641826 -0.25 0.804 -1.417307 1.098605
c15 -.0544876 .4563326 -0.12 0.905 -.9488831 .8399079
c16 -.0067388 .4477181 -0.02 0.988 -.8842501 .8707725
c17 .0277071 .5016994 0.06 0.956 -.9556057 1.01102
c18 .0221772 .4548985 0.05 0.961 -.8694075 .9137618
c19 -.0121315 .4958827 -0.02 0.980 -.9840437 .9597808
c20 -.064899 .5120701 -0.13 0.899 -1.068538 .93874
Figures 1 and 2 show the estimates of the DRF and the treatment-eect function by
using the semiparametric techniques implemented in the drf routine and a paramet-
ric approach. The parametric estimates are derived using the doseresponse routine
(Bia and Mattei 2008), which follows the parametric approach originally proposed by
Hirano and Imbens (2004).7 As can be seen in gures 1 and 2, the two penalized spline
estimators and the IW kernel estimator lead to similar results: the DRFs have a U shape
(which is more tenuous in the case of the radial spline method) and the treatment-eect
functions have irregular shapes increasing over most of the treatment range and decreas-
ing for high treatment levels. The parametric approach shows quite a dierent picture.
The DRF goes down sharply for low prize amounts and follows an inverse J shape for
prize amounts greater than $20,000. The treatment-eect function reaches a maximum
around $30,000, and then it slowly decreases.
7. The code to derive the graphs is shown here for only the radial penalized spline estimator.
. line radialest treatment, lcolor(black)

> yscale(r(6 18)) title("Radial spline method")
> xtitle("Treatment") ylabel(6 7 8 9 10 11 12 13 14 15 16 17 18)
> xlabel(0 10 20 30 40 50 60 70 80 90 100)
> ytitle("Dose-response function") scheme(medim)
. graph save DRF_RAD.gph, replace
(file DRF_RAD.gph saved)
. graph export DRF_RAD.eps, replace
(note: file DRF_RAD.eps not found)
(file DRF_RAD.eps written in EPS format)
. line radialder treatment, lcolor(black)
> yscale(r(-0.45 0.15)) title("Radial spline method")
> xtitle("Treatment") ylabel(-0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2)
> xlabel(0 10 20 30 40 50 60 70 80 90 100)
> ytitle("Derivative") scheme(medim)
. graph save dDRF_RAD.gph, replace
(file dDRF_RAD.gph saved)
. graph export dDRF_RAD.eps, replace
(note: file dDRF_RAD.eps not found)
(file dDRF_RAD.eps written in EPS format)
IWeighting Kernel Method Penalized Spline Method

9 10 11 12 13 14 15 16 17 18
9 10 11 12 13 14 15 16 17 18
Doseresponse function
8
8
7
7
6
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Treatment Treatment
Radial Spline Method Parametric Method

9 10 11 12 13 14 15 16 17 18
9 10 11 12 13 14 15 16 17 18
8
8
7
7
6
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Treatment Treatment
Figure 1. Estimated doseresponse functions


.2
.2
.1
.1
0
0
Derivative
Derivative
.1
.1
.2
.2
.3
.3
.4
.4
.5
.5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Treatment Treatment

.2
.2
.1
.1
0
0
Derivative
Derivative
.1
.1
.2
.2
.3
.3
.4
.4
.5
.5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Treatment Treatment
Figure 2. Estimated treatment-eect functions
Figures 3 and 4 show the DRFs and the treatment-eect functions estimated using
the semiparametric and parametric techniques, now accompanied by pointwise 95% con-
dence bands. The condence bands are based on a normal approximation using boot-
strap standard errors, which are computed calling the drf program (or doseresponse
program) in the bootstrap command.8
8. The radial spline-based models may produce slightly dierent estimates in dierent runs and when
using the bootstrap command. This happens because within those models, an optimal set of
design points is chosen via random selection of the knot values using the spacefill algorithm (see
Bia and Van Kerm [2014] for further details). Some selected sets of knots may raise convergence
issues depending on the data. Thus we recommend that users set a seed before running the drf
code to make the results replicable.
. twoway (line upperEstRAD treatment, lcolor(black))

> (line radialest treatment, lcolor(black))
> (line lowerEstRAD treatment, lcolor(black)),
> yscale(r(-40 60)) xtitle("Treatment") ylabel(-40 -20 0 20 40 60)
> title("Radial spline method") ytitle("Dose-response function")
> xlabel(0 10 20 30 40 50 60 70 80 90 100) scheme(medim)
. graph save CI_DRF_RAD.gph, replace
(file CI_DRF_RAD.gph saved)
. graph export CI_DRF_RAD.eps, replace
(note: file CI_DRF_RAD.eps not found)
(file CI_DRF_RAD.eps written in EPS format)
. twoway (line upperDerRAD treatment, lcolor(black))
> (line radialder treatment, lcolor(black))
> (line lowerDerRAD treatment, lcolor(black)),
> yscale(r(-2 2)) xtitle("Treatment") ylabel(-2 -1 0.0 1 2)
> title("Radial spline method") ytitle("Derivative")
> xlabel(0 10 20 30 40 50 60 70 80 90 100) scheme(medim)
. graph save CI_dDRF_RAD.gph, replace
(file CI_dDRF_RAD.gph saved)
. graph export CI_dDRF_RAD.eps, replace
(note: file CI_dDRF_RAD.eps not found)
(file CI_dDRF_RAD.eps written in EPS format)

60
60
40
40
20
20
0
0
20
20
40
40
60
60
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Treatment Treatment

60
60
40
40
20
20
0
0
20
20
40
40
60
60
0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100
Treatment Treatment
Figure 3. 95% condence bands for the doseresponse functions


5
5
3
3
1
1
Derivative
Derivative
1
1
3
3
5
5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Treatment Treatment

5
5
3
3
1
1
Derivative
Derivative
1
1
3
3
5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Treatment Treatment
Figure 4. 95% condence bands for the treatment-eect functions
The example allows us to highlight two important points. First, gures 3 and 4
show that dierences in the point estimates and their precision among the three semi-
parametric estimators are more pronounced for low and high treatment levels. This is
because our data are sparse for lower and higher values of the treatment.9 Because of
the nonparametric methods we use, estimation becomes noisier and the parameters are
estimated less precisely in regions of the data with few observations, which is reected
in the wider condence intervals. This is particularly evident for the radial spline ap-
proach, which seems to be more sensitive to the sample size than the IW and penalized
splines estimators are. Second, it is clear from gures 3 and 4 that the parametric
estimator produces much tighter condence bands relative to the semiparametric esti-
mators. This is due to the additional structure imposed by the parametric estimator,
which allows extrapolation from regions where data are abundant to regions where data
are scarce. However, if the assumptions behind the parametric structure are incorrect,
the results, including their precision, are likely misleading.
9. In particular, there are very few observations for prizes lower than $15,000 and greater than $40,000.
6 Conclusion
We develop a program where we implement semiparametric estimators of the DRF based
on the GPS, assuming that assignment to the treatment is weakly unconfounded given
pretreatment variables. We propose three semiparametric estimators: the IW kernel
estimator developed in Flores et al. (2012) and two estimators using penalized spline
methods for bivariate smoothing. We use data from a survey of Massachusetts lottery
winners to illustrate the proposed methods and program. We nd that the semipara-
metric estimators provide estimates of the DRF and the treatment-eect function that
are substantially dierent from those obtained when using the parametric approach orig-
inally proposed in Hirano and Imbens (2004). All the semiparametric estimators agree
on a U -shaped DRF, which contrasts with the estimated inverse J shape uncovered by
the parametric estimator. Although we cannot draw a rm conclusion about the relative
performance of the estimators based on one dataset, we argue that a misspecication
of the conditional expectation of the outcome given treatment and GPS could result
in inappropriate removal of self-selection bias and in misleading estimates of the DRF.
Therefore, it is advisable to also use semiparametric estimators that account for compli-
cated structures that are dicult to model parametrically. Conversely, semiparametric
estimators can be sensitive to the sample size and might not perform well in regions
with few observations.
7 Acknowledgments
This research is part of the Estimation of direct and indirect causal eects using semi-
parametric and nonparametric methods project supported by the Luxembourg Fonds
National de la Recherche, which is cofunded under the Marie Curie Actions of the
European Commission (FP7-COFUND).
8 References
Angrist, J. D., G. W. Imbens, and D. B. Rubin. 1996. Identication of causal eects
using instrumental variables. Journal of the American Statistical Association 91:
444455.
Becker, S. O., and A. Ichino. 2002. Estimation of average treatment eects based on
propensity scores. Stata Journal 2: 358377.
Bia, M., and A. Mattei. 2008. A Stata package for the estimation of the doseresponse
function through adjustment for the generalized propensity score. Stata Journal 8:
354373.
. 2012. Assessing the eect of the amount of nancial aids to Piedmont rms using
the generalized propensity score. Statistical Methods & Applications 21: 485516.
Bia, M., and P. Van Kerm. 2014. Space-lling location selection. Stata Journal 14:
605622.
Buis, M. L., N. J. Cox, and S. P. Jenkins. 2003. betat: Stata module to t a two-
parameter beta distribution. Statistical Software Components S435303, Department
of Economics, Boston College. http://ideas.repec.org/c/boc/bocode/s435303.html.
Dorn, S. 2012. pscore2: Stata module to enforce balancing score property in each
covariate dimension. UK Stata Users Group meeting.
http://econpapers.repec.org/paper/bocusug12/11.htm.
Fan, J., and I. Gijbels. 1996. Local Polynomial Modelling and Its Applications. New
York: Chapman & Hall/CRC.
Flores, C. A., A. Flores-Lagunes, A. Gonzalez, and T. C. Neumann. 2012. Estimating
the eects of length of exposure to instruction in a training program: The case of job
corps. Review of Economics and Statistics 94: 153171.
Guardabascio, B., and M. Ventura. 2014. Estimating the doseresponse function
through a generalized linear model approach. Stata Journal 14: 141158.
Hirano, K., and G. W. Imbens. 2004. The propensity score with continuous treat-
ments. In Applied Bayesian Modeling and Causal Inference from Incomplete-Data
Perspectives, ed. A. Gelman and X.-L. Meng, 7384. Chichester, UK: Wiley.
Imai, K., and D. A. van Dyk. 2004. Causal inference with general treatment regimes:
Generalizing the propensity score. Journal of the American Statistical Association
99: 854866.
Imbens, G. W. 2000. The role of the propensity score in estimating doseresponse
functions. Biometrika 87: 706710.
Imbens, G. W., D. B. Rubin, and B. I. Sacerdote. 2001. Estimating the eect of unearned
income on labor earnings, savings, and consumption: Evidence from a survey of
lottery players. American Economic Review 91: 778794.
Imbens, G. W., and J. M. Wooldridge. 2009. Recent developments in the econometrics
of program evaluation. Journal of Economic Literature 47: 586.
Jann, B. 2005. moremata: Stata module (Mata) to provide various functions. Sta-
tistical Software Components S455001, Department of Economics, Boston College.
Kluve, J., H. Schneider, A. Uhlendor, and Z. Zhao. 2012. Evaluating continuous
training programmes by using the generalized propensity score. Journal of the Royal
Statistical Society, Series A 175: 587617.
Lechner, M. 2001. Identication and estimation of causal eects of multiple treatments
under the conditional independence assumption. In Econometric Evaluation of Labour
Market Policies, ed. M. Lechner and F. Pfeier, 4358. Heidelberg: Physica-Verlag.
Newey, W. K. 1994. Kernel estimation of partial means and a general variance estimator.
Econometric Theory 10: 233253.

. 1978. Bayesian inference for causal eects: The role of randomization. Annals
of Statistics 6: 3458.
. 1980. Bias reduction using Mahalanobis-metric matching. Biometrics 36: 293

298.
. 1990. Comment: Neyman (1923) and causal inference in experiments and

observational studies. Statistical Science 5: 472480.
Ruppert, D., M. P. Wand, and R. J. Carroll. 2003. Semiparametric Regression. Cam-

bridge: Cambridge University Press.
Stuart, E. A. 2010. Matching methods for causal inference: A review and a look forward.
Statistical Science 25: 121.
About the authors

Michela Bia is a researcher at CEPS/INSTEAD, Population & Emploi, Esch-Sur-Alzette, Lux-
embourg.
Carlos A. Flores is an associate professor in the Department of Economics, Orfalea College of
Business at the California Polytechnic State University.
Alfonso Flores-Lagunes is an associate professor in the Department of Economics at the State
University of New York, Binghamton.
Alessandra Mattei is an assistant professor in the Department of Statistics, Informatics, Ap-
plications Giuseppe Parenti at the University of Florence.
14, Number 3, pp. 605622
Space-lling location selection

Michela Bia Philippe Van Kerm
CEPS/INSTEAD CEPS/INSTEAD
Esch-sur-Alzette, Luxembourg Esch-sur-Alzette, Luxembourg
michela.bia@ceps.lu philippe.vankerm@ceps.lu
Abstract. In this article, we describe an implementation of a space-lling location-

selection algorithm. The objective is to select a subset from a list of locations
so that the spatial coverage of the locations by the selected subset is optimized
according to a geometric criterion. Such an algorithm designed for geographical
site selection is useful for determining a grid of points that covers a data matrix
as needed in various nonparametric estimation procedures.
Keywords: st0353, spacell, spatial sampling, space-lling design, site selection,
nonparametric regression, multivariate knot selection, point swapping
1 Introduction
Spatial statistics often address geographical sampling from a set of locations for net-
works construction (Cox, Cox, and Ensor 1997), for example, for installing air quality
monitoring (Nychka and Saltzman 1998) or for evaluating exposure to environmental
chemicals (Kim et al. 2010). The issue involves evaluating a discrete list of potential
locations and determining a small, optimal subset of placesa designat which
to position, say, measurement instruments or sensors. One strategy to address such a
problemthe geometric approachaims to nd a design that minimizes the aggregate
distance between the locations and the sensors.
As discussed in Ruppert, Wand, and Carroll (2003) and Gelfand, Banerjee, and Fin-
ley (2012), location selection is also relevant in estimation of statistical models such as
multivariate nonparametric or semiparametric regression models. By analogy, instead
of locating measurement instruments, one seeks to identify a small number of loca-
tions from a large dataset at which to estimate a statistical model to reduce com-
putational cost. For example, kernel density estimates or locally weighted regression
models (Cleveland 1979; Fan and Gijbels 1996) are typically calculated on a grid of
points spanning the data range rather than over the whole input data points (and in-
terpolation is used where needed). The location of knots in spline regression models is
somewhat related; a small number of knots are selected instead of knots being placed
at many (or all) potential distinct data points. Determining such a grid is relatively
easy in one-dimensional modelsfor example, it is customary to locate knots at selected
percentiles of the data. Choosing an appropriate multidimensional grid while preserv-
ing computational tractability is more complicated because merely taking combinations
of unidimensional grids quickly inates the number of evaluation points. In this con-
text, Ruppert, Wand, and Carroll (2003) recommend applying a geometric space-lling
design to identify grid points or knot locations.

606 Space-filling location selection
In this article, we describe an implementation of an algorithm for space-lling spatial-

design construction. The algorithm developed in Royle and Nychka (1998) selects a set
of design points from a discrete set of candidate points such that the coverage of the
candidate points by the design points is optimized according to a geometric coverage
criterion.1 The algorithm involves iterative point swapping between the candidate
points and the design points until no swapping can further improve the coverage of the
candidate points by the design points. The coverage criteria is geometric, but it is not
restricted to spatial, two-dimensional data. The procedure can be used in miscellaneous
settings when optimal subsampling of multivariate data is needed. Constraints are easily
imposed by excluding or including particular locations in the design. A nearest-neighbor
approximation makes the algorithm fast even for large samples.
We describe Royle and Nychkas (1998) algorithm in section 2 and its implementation
in Stata in section 3. We illustrate several uses of the spacefill command in section 4.
We show how it can be applied for generating a multidimensional grid of xed size that
optimally covers a dataset.
2 Geometric coverage criterion and the point-swapping

algorithm
2.1 Geometric coverage criterion
The space-lling design selection considered here is based on optimization with respect
to the geometric coverage of a set of data points. We refer to data points as loca-
tions, although they are not restricted to geographic locations identied by spatial
coordinatesin principle, any unidimensional or multidimensional coordinates can be
used to locate points (see examples in section 4).
Following Royle and Nychkas (1998) notation, we let C denote a set of N candidate
locations (the candidate set). We let Dn be a subset of n locations selected from C.
Dn is a design of size n, and the locations selected in Dn are design points. The
geometric metric for the distance between any given location x and the design Dn is
p1

dp (x, Dn ) = ||x y||p (1)
yDn
with p < 0. dp (x, Dn ) measures how well the design Dn covers the location x. When
p , dp (x, Dn ) tends to the shortest Euclidean distance between x and a point
in Dn (Johnson, Moore, and Ylvisaker 1990). dp (x, Dn ) is zero if x is at a location in
Dn .
1. An R implementation of Royle and Nychkas (1998) algorithm is available in Furrer, Nychka, and
Sain (2013).
M. Bia and P. Van Kerm 607
A design Dn is considered to optimally cover the set of locations C for parameters

p and q if it minimizes
, q1

Cp,q (C, Dn ) = dp (x, Dn )q (2)
xC
over all possible designs Dn from C. The optimal design minimizes the q power mean of
the coverages of all locations outside of the design (the candidate points). Increasing
q gives greater importance to the distance of the design to poorly covered locations.
Figure 1 can help readers visualize the criterion. From a set of 38 European cities, we
selected a potential design of ve locations: Madrid, Brussels, Berlin, Riga, and Soa.
The coverage of, say, London by this design is given by plugging the Euclidean distances
from London to the ve selected cities into (1). With a large negative p, this coverage
will be determined by the distance to the closest city, namely, Brussels. Repeating such
calculations for all 33 cities from outside the design and aggregating the coverages using
(2) gives the overall geometric distances of European cities to the design composed
of Madrid, Brussels, Berlin, Riga, and Soa. The optimal design is the combination of
any ve cities that minimizes this criterion. The design composed of Madrid, Brussels,
Berlin, Riga, and Soa is in fact the optimal design for p = 5 and q = 1.
Figure 1. Coverage criterion illustration: Coverage of a ve-city design (Madrid, Brus-

sels, Berlin, Riga, and Soa) with distances to London as example
2.2 A point-swapping algorithm

In most applications, identication of the optimal design by calculating the coverage
criterion for all possible subsets of size n from N is computationally prohibitive. Royle
and Nychka (1998) propose a simple point-swapping algorithm to determine Dn . Start-
ing from a random initial design Dn0 , the algorithm iteratively attempts to swap a point
from the design with the point from the candidate set that leads to the greatest im-
provement in coverage. If this tentative swap improves coverage of the candidate set
by the design, the latter is updated. Otherwise, the swap is ignored. The process is
repeated until no swap between a design point and a candidate point can improve cov-
erage. Users can signicantly improve speed by restricting potential swaps for a point
in the design to its k nearest neighbors in the candidate set [according to (1)]. See
Royle and Nychka (1998) for details.
The point-swapping algorithm makes it straightforward to impose constraints on the
inclusion or exclusion of specic locations; such points are considered in calculations of
the geometric criterion but excluded from any potential swap. Nonrandom initial design
points can also be used.
Although the algorithm always converges to a solution, it is not guaranteed to con-
verge to the globally optimal Dn for any initial design when potential swaps are limited
to nearest neighbors. Therefore, Royle and Nychka (1998) recommend repeating esti-
mation for multiple initial design sets and selecting the design with the best coverage
across repetitions (see section 4).
3 The spacell command

The spacefill command performs space-lling location selection using Royle and Ny-
chkas (1998) point-swapping algorithm. It operates on N observations from variables
identifying the coordinates of the data points and returns the subset of n < N observa-
tions that optimally covers the data.
spacefill options allow forced inclusion or exclusion of particular observations,
user-specied initial design, and automatic standardization of location coordinates.
When weights are specied, spacefill performs weighted calculation of the aggre-
gate coverage measure [see (2)]. In section 4, we show that combining weights and
restrictions on candidate locations makes it easy to create an optimal regular grid
over a dataset.
3.1 Syntax

spacefill varlist if in weight , ndesign(#) design0(varlist)
fixed(varname) exclude(varname) p(#) q(#) nnfrac(#) nnpoints(#)
nruns(#) standardize standardize2 standardize3 sphericize ranks

generate(newvar) genmarker(newvar) noverbose
aweights, fweights, and iweights are allowed; see [U] 11.1.6 weight.
varlist and the if or in qualier identify the data from which the optimal subset is
selected.
3.2 Options
ndesign(#) species n, the size of the design. The default is ndesign(4).
design0(varlist) identies a set of initial designs identied by observations with nonzero
varlist. If multiple variables are passed, one optimization is performed for each initial
design, and the selected design is the one with best coverage.
fixed(varname) identies observations that are included in all designs when varname
is nonzero.
exclude(varname) identies observations excluded from all designs when varname is
nonzero.
p(#) species a scalar value for the distance parameter for calculating the distance of
each location to the design; for example, p = 1 gives harmonic mean distance, and
p = gives the minimum distance. The default is p(-5), as recommended in
Royle and Nychka (1998).
q(#) species a scalar value for the parameter q. The default is q(1) (the arithmetic
mean).
nnfrac(#) species the fraction of data to consider as nearest neighbors in the point-
swapping iterations. Limiting checks to nearest neighbors improves speed but does
not guarantee convergence to the best design; therefore, setting nruns(#) is recom-
mended. The default is nnfrac(0.50).
nnpoints(#) species the number of nearest neighbors considered in the point-swapping
iterations. Limiting checks to nearest neighbors improves speed. nnfrac(#) and
nnpoints(#) are mutually exclusive.
nruns(#) sets the number of independent runs performed on alternative random initial
designs. The selected design is the one with best coverage across the runs. The
default is nruns(5).
standardize standardizes all variables in varlist to zero mean and unit standard devi-
ation (SD) before calculating distances between observations.
standardize2 standardizes all variables in varlist to zero mean and SD before calculating
distances between observations, with an estimator of the SD as 0.7413 times the
interquartile range.
standardize3 standardizes all variables in varlist to zero median and SD before calcu-
lating distances between observations, with an estimator of the SD as 0.7413 times
the interquartile range.
sphericize transforms all variables in varlist into zero mean, SD, and zero covariance
using a Cholesky decomposition of the variancecovariance matrix before calculating
distances between observations.
ranks transforms all variables in varlist into their (fractional) ranks and uses distances
between these observation ranks in each dimension to evaluate distances between
observations.
generate(newvar) species the names for new variables containing the locations of the
best design points. If one variable is specied, it is used as a stubname; otherwise,
the number of new variable names must match the number of variables in varlist.
genmarker(newvar) species the name of a new binary variable equal to one for obser-
vations selected in the best design and zero otherwise.
noverbose suppresses output display.
Options standardize2, standardize3, and ranks require installation of the user-
written package moremata, which is available on the Statistical Software Components
archive (Jann 2005).
4 Examples
We provide two illustrations for the application of spacefill. The rst example uses
ozone2.txt, which is available in the R elds package (Furrer, Nychka, and Sain 2013),
and provides examples of standard site selection. The second example uses survey data

from the Panel Socio-Economique Liewen zu Letzebuerg/European Union-Statistics on
Income and Living Conditions (PSELL3/EU-SILC) and illustrates the use of spacefill
for nonparametric regression analysis with multidimensional, nonspatial data.
4.1 Basic usage

ozone2.txt contains air quality information in 147 locations in the US Midwest in the
Summer 1987 (Furrer, Nychka, and Sain 2013). Locations are identied by their relative
latitude (lat) and longitude (lon).
We start by selecting an optimal design of size 10 from the 147 locations, using
default values p = 5 and q = 1, candidate swaps limited to the nearest half of the
locations, and 5 runs with random starting designs.
. insheet using ozone2.txt

(3 vars, 147 obs)
. spacefill lon lat, ndesign(10)
Run 1 .... (Cpq = 100.34)
Run 2 .... (Cpq = 96.92)
Run 3 ...... (Cpq = 94.19)
Run 4 .... (Cpq = 95.00)
Run 5 .. (Cpq = 95.19)
. return list
scalars:
r(q) = 1
r(p) = -5
r(nn) = 69
r(Cpq) = 94.19164847896585
r(nexcluded) = 0
r(nfixed) = 0
r(ndesign) = 10
r(N) = 147
macros:
r(varlist) : "lon lat"
matrices:
r(Best_Design) : 10 x 2
. matrix list r(Best_Design)
r(Best_Design)[10,2]
lon lat
r1 -87.752998 41.855
r2 -90.160004 38.612
r3 -85.841003 39.935001
r4 -87.57 38.021
r5 -91.662003 41.992001
r6 -84.476997 39.106998
r7 -85.578003 38.137001
r8 -85.671997 42.985001
r9 -83.403 42.388
r10 -88.283997 43.333
Notice that the rst run leads to a somewhat higher aggregate distance to the design
points (Cpq=100.34) than the other runs. This stresses the importance of multiple
starting designs. Figure 1 shows the selected locations in the best design (achieved at
run 3, where Cpq=94.19).
Figure 2. Scatterplot and histogram of longitude and latitude for all 147 locations (gray
histograms and gray hollow circles) and 10 best design points (thick histograms and
solid dots) with p = 5 and q = 1 (default)
Users can improve speed by restricting potential swaps to a smaller number of nearest
neighbors. Limiting a search to 25 nearest neighbors (against 69the default half of the
locationsin the rst example), our second example below runs in 4 seconds against
11 seconds for our initial example, without much loss in the coverage of the resulting
design (Cpq=96.59). On the other hand, running spacefill with the full candidates
as potential swaps runs in over 30 seconds for an optimal design with Cpq=91.96.
. spacefill lon lat, ndesign(10) nnpoints(25) genmarker(set1)
Run 1 ..... (Cpq = 117.02)
Run 2 .... (Cpq = 109.93)
Run 3 .. (Cpq = 110.99)
Run 4 .. (Cpq = 101.05)
Run 5 ..... (Cpq = 96.59)
. spacefill lon lat, ndesign(10) nnfrac(1)
Run 1 ... (Cpq = 91.96)
Run 2 .... (Cpq = 91.96)
Run 3 .. (Cpq = 91.96)
Run 4 ... (Cpq = 92.32)
Run 5 ... (Cpq = 91.96)
We now illustrate the use of the genmarker(), fixed(), and exclude() options. In
the previous call, genmarker(set1) generated a dummy variable equal to 1 for the 10
points selected into the best design and 0 otherwise. We now specify exclude(set1)
to derive a new design with 10 dierent locations and then use fixed(set2) to force
this new design into a design of size 15.
. spacefill lon lat, ndesign(10) nnpoints(25) exclude(set1) genmarker(set2)

> noverbose
10 points excluded from designs (set1>0)
. spacefill lon lat, ndesign(15) nnpoints(25) fixed(set2) genmarker(set3)
> noverbose
10 fixed design points (set2>0)
. list set1 set2 set3 if set1+set2+set3>0
set1 set2 set3
4. 1 0 0
10. 0 1 1
25. 1 0 0
40. 1 0 0
48. 0 1 1
55. 1 0 0
58. 0 1 1
60. 1 0 0
61. 0 1 1
63. 0 0 1
67. 0 0 1
74. 1 0 0
77. 0 0 1
80. 0 1 1
82. 0 1 1
89. 0 0 1
91. 0 1 1
97. 1 0 0
107. 0 1 1
109. 1 0 0
121. 0 1 1
125. 0 0 1
135. 0 1 1
140. 1 0 0
143. 1 0 0
The key parameters q and p of the coverage criterion can also be exibly specied.
Figure 2 illustrates 3 designs selected with default parameters p = 5 and q = 1 (dots),
with p = 1 and q = 1 (squares), and with p = 1 and q = 5 (crosses). With p = 5,
the distance of a location to the design is mainly determined by the distance to the
closest point of the design; p = 1 accounts for the distance to all points in the design,
leading to more central location selections. Setting q = 5 penalizes large distances
between design and nondesign points, leading to location selections more spread out
toward external points. Note our use of user-specied random starting designs with
option design0() to ensure comparison is made on common initial values.
. generate byte init1 = 1 in 1/10

. local options nnfrac(0.3) nruns(10) design0(init1 init2 init3 init4 init5)
> noverbose
. spacefill lat lon, òptions generate(Des)
. spacefill lat lon, òptions generate(Des_BIS) p(-1) q(1)
. spacefill lat lon, òptions generate(Des_TER) p(-1) q(5)
. spacefill lat lon, òptions generate(Des_QUAT) p(-5) q(5)
Figure 3. Scatterplot of longitude and latitude for all 147 locations (gray hollow circles)
and best design points with default p = 5 and q = 1 (dots), with p = 1 and q = 1
(squares), and with p = 1 and q = 5 (crosses)
4.2 Design selection from external locations: Lattice subsets

By combining the exclude() option and weights, one can use spacefill to nd an op-
timal design from an external set of locations; that is, one can use it to select a subset
of points from a set A that optimally covers points from a set B. This is particularly
useful to identify a subset of points from a lattice (the set A) that best covers the data
(the set B). To set this up, we start by generating the latticea dataset with many
candidate grid pointsusing range (see [D] range) and fillin (see [D] llin). We ap-
pend this generated dataset to the locations data. We then identify actual observations
from the sample by sample==0 and the generated candidate locations on the lattice by
sample==1.
We can now run spacefill to select a smaller subset of grid points from the full
lattice that optimally covers the actual locations. To do so, we run spacefill on the
whole set of data points with i) exclude(sample) to select points from the grid only
and ii) with [iw=sample] so that the aggregate distance is computed only between the
design points on the grid and the actual locations. A set of 25 optimally chosen grid
points from a candidate grid of 176 (11 16) points is shown in gure 3. Below we
illustrate how this can be used to speed up calculations of computationally intensive
nonparametric regression models.
. clear
. set obs 16
obs was 0, now 16
. range lon -95 -80 16
. range lat 36 46 11
. fillin lon lat
. gen byte sample = 0
. save gridlatlon.dta , replace
file gridlatlon.dta saved
. clear
. insheet using ozone2.txt
(3 vars, 147 obs)
. keep lat lon
. gen byte sample = 1
. append using gridlatlon
. spacefill lon lat [iw=sample], exclude(sample) ndesign(25) nnpoints(100)
> genmarker(subgrid1)
147 points excluded from designs (sample>0)
Run 1 .. (Cpq = 63.93)
Run 2 .... (Cpq = 63.92)
Run 3 .... (Cpq = 63.71)
Run 4 ... (Cpq = 63.07)
Run 5 ... (Cpq = 63.02)
Figure 4. Actual 147 locations (hollowed gray circles), 176 candidate grid points (lattice;
crosses), and 25 optimally selected grid points (solid dots)
4.3 Handling nonspatial data: Nonparametric regression example

We now illustrate the use of spacefill with multidimensional and nonspatial data
taken from the PSELL3/EU-SILC collected in 2007.2 We extracted information on the
height, weight, and wage of a random subsample of 500 working women.
We rst use spacefill to select a subset of 50 women with characteristics on these 3
variables that best cover the sample. Given the dierent metric of the three variables,
we specify the standardize option to compute the geometric distance criterion after
standardizing the three variables to have zero mean and unit SD in the sample.3
Figures 5 and 6 show bivariate scatterplots and histograms of the selected 50 design
points. Two features are worth noting. First, the quality of the coverage is not aected
by the skewness of the data (especially in the wage dimension). The space-lling algo-
2. PSELL3/EU-SILC is a longitudinal survey on income and living conditions representative of the

population residing in Luxembourg. Data are collected annually in a sample of more than 3,500
private households.
3. Alternative standardization could have been adopted with options standardize2, standardize3,
sphericize, or ranks.
rithm is indeed applicable to broad data congurations. Second, the dierence in the
histograms for the sample and for the design points is a reminder that selecting a space-
lling design is distinct from drawing a representative subset of the data. The points
that best cover the data in a geometric sense must not necessarily reect their frequency
distribution: few design points may contribute to cover many data points in areas of
high concentration, while design points spread out in areas of low data concentration
will contribute to cover a smaller number of data points.
. summarize height weight wage
height 500 165.21 6.8886 150 192

weight 500 65.368 12.80502 43 127
wage 500 2720.688 1920.047 300 10000
. spacefill height weight wage, ndesign(50) nnfrac(0.05) generate(BH BW BWa)
> standardize
Run 1 .... (Cpq = 196.98)
Run 2 ..... (Cpq = 195.15)
Run 3 .... (Cpq = 196.13)
Run 4 ........ (Cpq = 196.79)
Run 5 .... (Cpq = 194.55)
Figure 5. Scatterplot and histogram of height and weight for all data (gray histograms
and hollowed markers) and best design points (thick histograms and markers) for the
standardized values of the height, weight, and wage
Figure 6. Scatterplot and histogram of height and wage for all data (gray histograms
and hollowed markers) and best design points (thick histograms and markers) for the
standardized values of the height, weight, and wage
We now use these data to run a locally weighted polynomial regression of wage
on height and weight. Our objective is to assess nonparametrically the relationship
between wage and body size. For the sake of illustration, we want to estimate expected
wage nonparametrically at multiple grid points from a lattice where each point is a
pair of heightweight values. One reason for this is that tting the model at all height
weight pairs in our data would be computationally expensive (and inecient if there are
nearly identical heightweight pairs in the data). We seek a cheaper alternative with
fewer evaluation points. (This is similar to using lpoly with the at() option instead
of lowess in the unidimensional setting.) Also we use evaluation points on a lattice
instead of at sample values because we are considering tting the model for dierent
subsamples, and we want to have model estimates on a common grid of evaluation points
for all subsamples. (If need be, bivariate interpolation will be used to recover estimates
at sample values; see [G-2] graph twoway contourline for the interpolation formula.)
This setting is relatively standard in nonparametric regression analysis, especially when
dealing with large samples or computationally heavy estimators (for example, cross-
validation-based bandwidth selection).
We start with a 20 20 rectangular lattice covering heights from 150 to 192 centime-
ters and weights from 43 to 127 kilograms. While this lattice spans the values observed
in our sample, it also includes many empirically irrelevant heightweight pairs. Estima-
tion on the full grid is therefore unnecessary, and we use spacefill as described above
to select a subset of points on the lattice that covers our data.
Figure 7 shows resulting estimates based on a space-lling design of size 50, as well
as estimates based on a random subset of 100 lattice points, on 100 Halton draws from
the lattice, on the full lattice, and on all sample points. Brightness of the contours
corresponds to local regression estimates of expected wage from black (for monthly
wage below EUR 1000) to white (for monthly wage above EUR 5000). In each panel,
local regression was eectively calculated only at the marked grid points (and so it was
conducted faster on the space-lling design), while the overall coloring of the map was
based on the thin-plate-spline interpolation built in twoway contour.
Figure 7. Contour plot of expected wage of 500 Luxembourg women by height and
weight from monthly wage less than EUR 1000 (black) to more than EUR 5000 (white).
Calculations based on local regression estimation. White lines identify body-mass in-
dices of 18.5, 25, and 30, which delineate underweight, overweight, and obesity, respec-
tively.
The contour plots display variations in areas of low data density (top left and bot-
tom right), reecting both the imprecision and variability of the local linear regression
estimates in these zones and the variations introduced by the interpolation of values
away from the bulk of the data. In areas of higher data densityfor height below 180
centimeters and weight below 100 kilogramsestimates on the 50-points space-lling
subset dier little from those of the full sample or from the full lattice.4
Acknowledgments
This research is part of the project Estimation of direct and indirect causal eects using
semi-parametric and non-parametric methods, which is supported by the Luxembourg
Fonds National de la Recherche, cofunded under the Marie Curie Actions of the
European Commission (FP7-COFUND). Philippe Van Kerm acknowledges funding for
the project Information and Wage Inequality, which is supported by the Luxembourg
Fonds National de la Recherche (contract C10/LM/785657).
5 References
Cleveland, W. S. 1979. Robust locally weighted regression and smoothing scatterplots.
Cox, D. D., L. H. Cox, and K. B. Ensor. 1997. Spatial sampling and the environment:
Some issues and directions. Environmental and Ecological Statistics 4: 219233.
Fan, J., and I. Gijbels. 1996. Local Polynomial Modelling and Its Applications. New
York: Chapman & Hall/CRC.
Furrer, R., D. Nychka, and S. Sain. 2013. elds: Tools for spatial data. R package
version 6.7.6. http://CRAN.R-project.org/package=elds.
Gelfand, A. E., S. Banerjee, and A. O. Finley. 2012. Spatial design for knot selection
in knot-based dimension reduction models. In Spatio-Temporal Design: Advances in
Ecient Data Acquisition, ed. J. Mateu and W. G. M uller, 142169. Chichester, UK:
Wiley.
Jann, B. 2005. moremata: Stata module (Mata) to provide various functions. Sta-
tistical Software Components S455001, Department of Economics, Boston College.
Johnson, M. E., L. M. Moore, and D. Ylvisaker. 1990. Minimax and maximin distance
designs. Journal of Statistical Planning and Inference 26: 131148.
Kim, J.-I., A. B. Lawson, S. McDermott, and C. M. Aelion. 2010. Bayesian spatial
modeling of disease risk in relation to multivariate environmental risk elds. Statistics
in Medicine 29: 142157.
4. Note, incidentally, how taller women tend to be paid higher wages in these data in all three body-
mass index categories.
Nychka, D., and N. Saltzman. 1998. Design of air-quality monitoring networks. In Case
Studies in Environmental Statistics (Lecture Notes in Statistics 132), ed. D. Nychka,
W. Piegorsch, and L. Cox, 5176. New York: Springer.
Royle, J. A., and D. Nychka. 1998. An algorithm for the construction of spatial coverage
designs with implementation in SPLUS. Computers and Geosciences 24: 479488.
Ruppert, D., M. P. Wand, and R. J. Carroll. 2003. Semiparametric Regression. Cam-

bridge: Cambridge University Press.
About the authors

Michela Bia and Philippe Van Kerm are at CEPS/INSTEAD, Esch-sur-Alzette, Luxembourg.
14, Number 3, pp. 623661
Adaptive Markov chain Monte Carlo sampling

and estimation in Mata
Matthew J. Baker
Hunter College and the Graduate Center, CUNY
New York, NY
matthew.baker@hunter.cuny.edu
Abstract. I describe algorithms for drawing from distributions using adap-

tive Markov chain Monte Carlo (MCMC) methods; I introduce a Mata func-
tion for performing adaptive MCMC, amcmc(); and I present a suite of functions,
amcmc *(), that allows an alternative implementation of adaptive MCMC. amcmc()
and amcmc *() can be used with models set up to work with Matas moptimize( )
(see [M-5] moptimize( )) or optimize( ) (see [M-5] optimize( )) or with stand-
alone functions. To show how the routines can be used in estimation problems, I
give two examples of what Chernozhukov and Hong (2003, Journal of Econometrics
115: 293346) refer to as quasi-Bayesian or Laplace-type estimatorssimulation-
based estimators using MCMC sampling. In the rst example, I illustrate basic
ideas and show how a simple linear model can be t by simulation. In the next
example, I describe simulation-based estimation of a censored quantile regression
model following Powell (1986, Journal of Econometrics 32: 143155); the discus-
sion describes the workings of the command mcmccqreg. I also present an example
of how the routines can be used to draw from distributions without a normalizing
constant and used in Bayesian estimation of a mixed logit model. This discussion
introduces the command bayesmixedlogit.
Keywords: st0354, amcmc(), amcmc *(), bayesmixedlogit, mcmccqreg, Mata,
Markov chain Monte Carlo, drawing from distributions, Bayesian estimation,
mixed logit
1 Introduction
Markov chain Monte Carlo (MCMC) methods are a popular and widely used means
of drawing from probability distributions that are not easily inverted, that have dif-
cult normalizing constants, or for which a closed form cannot be found. While of-
ten considered a collection of methods with primary usefulness in Bayesian analysis
and estimation, MCMC methods can be applied to a variety of estimation problems.
Chernozhukov and Hong (2003), for example, show that MCMC methods can be applied
to many problems of traditional statistical inference and used to t a wide class of
modelsessentially, any statistical model with a pseudoquadratic objective function.
This class of models encompasses many common econometric models that have tra-
ditionally been t by maximum likelihood or generalized methods of moments. This
article describes some Mata functions for drawing from distributions by using dierent
types of adaptive MCMC algorithms. The Mata implementation of the algorithms is
intended to allow straightforward application to estimation problems.

624 Adaptive MCMC in Mata
While it is well known that MCMC methods are useful for drawing from dicult
densities, one might ask: why use MCMC methods in estimation? Sometimes, maximiz-
ing an objective function may be dicult or slow, perhaps because of discontinuities or
nonconcave regions of the objective function, a large parameter space, or diculty in
programming analytic gradients or Hessians. When bootstrapping of standard errors
is required, estimation problems are exacerbated because of the need to ret a model
many times. MCMC methods may provide a more feasible means of estimation in these
cases: estimation based on sampling directly from the joint parameter distribution does
not require optimization and still provides the desired result of estimationa descrip-
tion of the joint distribution of parameters. MCMC methods are a popular means of
implementing Bayesian estimators because they allow one to avoid hard-to-calculate
normalizing constants that often appear in posterior distributions. Unlike extrema-
based estimation, Bayesian estimators do not rely on asymptotic results and thus are
useful in small-sample estimation problems or when the asymptotic distribution of pa-
rameters is dicult to characterize.
In this article, I describe a Mata function, amcmc(), that implements adaptive or non-
adaptive MCMC algorithms. I also describe a suite of routines, amcmc *(), that allows
implementation via a series of structured functions, as one might use Mata functions
such as moptimize( ) (see [M-5] moptimize( )) or deriv( ) (see [M-5] deriv( )). The
algorithms implemented by the Mata routines more or less follow Andrieu and Thoms
(2008), who present an accessible overview of the theory and practice of adaptive MCMC.
In section 2, I provide an intuitive overview of adaptive MCMC algorithms, while
in section 3, I describe how the algorithms are implemented in Mata by amcmc() or
by creating a structured object via the suite of functions amcmc *(). In section 4, I
describe four applications. I show how the routines might be used in a straightforward
parameter estimation problem, and I describe how methods can be applied to a more
dicult problem: censored quantile regression. In this discussion, I also introduce
the mcmccqreg command. I then show how routines can be used to sample from a
distribution that is hard to invert and lacks a normalizing constant. In a nal example
in section 4, I apply the methods to Bayesian estimation of a mixed logit model following
Train (2009) and introduce the bayesmixedlogit command. In section 5, I sketch a
basic Mata implementation of an adaptive MCMC algorithm, which I hope will give users
a template for developing adaptive MCMC algorithms in more specialized applications.
In section 6, I conclude and oer some sources for additional reading.
2 An overview of adaptive MCMC algorithms

At the heart of adaptive MCMC sampling is the MetropolisHastings (MH) algorithm.
An MH algorithm is built around a target distribution that one wishes to sample from,
(X), and a proposal distribution, q(Y, X).1 If one is mainly interested in applying
MCMC in estimation, one may think of (X) as a conditional likelihood function, and
X can be thought of as a 1 n row vector of parameters. A basic MH algorithm is
described in table 1.
1. For ease of comparison, I follow the notation of Andrieu and Thoms (2008) wherever possible.
M. J. Baker 625
Table 1. An MH algorithm. The proposal distribution is denoted by q(Y, X), while the
target distribution is (X). (X, Y ) denotes the draw acceptance probability.
Basic MH algorithm
1: Initialize start value X = X0 and draws T .
2: Set t = 0 and repeat steps 36 while t T :
3: Draw a candidate Yt from q(Yt , Xt ).
(Yt ) q(Yt ,Xt )
4: Compute (Yt , Xt ) = min (X t ) q(Xt ,Yt )
, 1 .
5: Set Xt+1 = Yt with prob. (Yt , Xt ),
Xt+1 = Xt otherwise.
6: Increment t.
Output: The sequence (Xt )Tt=1 .
The MH algorithm sketched in table 1 has the property that candidate draws Yt
that increase the value of the target distribution, (X), are always accepted, whereas
candidate draws that produce lower values of the target distribution are accepted with
only probability . Under general conditions, the draws X1 , X2 , . . . , XT converge to
draws from the target distribution, (X); see Chib and Greenberg (1995) for proofs.
One can see the convenience the algorithm provides in drawing from densities of the form
(X) = (X)/K, where K is some perhaps dicult-to-calculate normalizing constant.
Computation of K is unnecessary, because it cancels out of the ratio (X)/(Y ). The
proposal distribution, q(Y, X), is where the Markov chain part of Markov chain
Monte Carlo comes in. It is what distinguishes MCMC algorithms from more general
acceptance-rejection Monte Carlo sampling: candidate draws depend upon previous
draws in this function.
MCMC algorithms are simple and exible, and they are therefore applicable to a wide
variety of problems. However, they can be challenging to implement, mainly because it
can be hard to nd an appropriate proposal distribution, q(Y, X). If q(Y, X) is chosen
poorly, coverage of the target distribution, (X), may be poor. This is where adaptive
MCMC methods are used because they help tune the proposal distribution. As an
adaptive MCMC algorithm proceeds, information about acceptance rates of previous
draws is collected and embodied in some set of tuning parameters . Slow convergence
or nonconvergence of an algorithm like that in table 1 is often caused by acceptance of
too few or too many candidate draws: if the algorithm accepts too few candidate draws,
candidates are too far away from regions of the support of the distribution where (X)
is large; if too many candidates are accepted, candidates occupy an area of the support
of the distribution clustered closely around a large value of (X). Accordingly, if the
acceptance rate is too low, the tuning mechanism contracts the search range; if the
acceptance rate is too high, it expands the search range. As a practical matter, one
augments the proposal distribution with the tuning parameters so that the proposal
distribution is something like q(Y, X) = q(Y, X, ). A description of such an algorithm
appears in table 2.
The algorithm in table 2 also relies on a simplication of the basic MCMC algorithm
presented in table 1, which results when a symmetric proposal distribution is used so that
q(Y, X, ) = q(X, Y, ). With a symmetric proposal distributionthe (multivariate)
normal distribution being a prominent examplethe proposal distribution drops out
of the calculation of the acceptance probability in step 4 of the algorithm; this results
in the simplied acceptance probability (Y, Xt ) = min[{(Y )}/{(Xt )}, 1]. All the
Mata routines discussed in this article use a multivariate normal density for a proposal
distribution.
Table 2. Overview of an adaptive MH algorithm with tuning and a symmetric proposal

distribution
Adaptive MH algorithm (with symmetric q)

1: Initialize start value X = X0 , draws T , and tuning parameters 0 .
3: Draw a candidate Yt from q(Yt , Xt ,
t ).
(Yt )
4: Compute (Yt , Xt ) = min (X t)
,1 .
6: Update t+1 = f (t , X0 , X1 , X2 , . . . , Xt ).
7: Increment t.
There is an important theoretical problem with an adaptive MCMC algorithm like

that in table 2. Tuning the proposal distribution results in loss of as an invariant
distribution of the process (Xt ) (Andrieu and Thoms 2008, 345) if it is not done care-
fully. Tuning the proposal distribution alters the long-run behavior of the algorithm so
that it no longer produces the sought-after draws from the target distribution, (X).
A solution to this problem is to tune the proposal distribution for some burn-in period
and then stop tuning so that the proposal distribution is stationary. Another solution is
to set up the algorithm so that tuning eventually recedes from the algorithm. The lat-
ter approach is referred to as vanishing or diminishing adaptation (Andrieu and Thoms
2008; Rosenthal 2011). With vanishing adaptation, if the algorithm runs for a sucient
number of iterations, the proposal distribution stabilizes while also (hopefully) being
tuned to provide good coverage of the target distribution. The Mata functions presented
in this article are built to work with vanishing adaptation, but they can also be set up
so that no adaptation of the proposal distribution occurs.
2.1 Adaptive MCMC with vanishing adaptation

Before discussing implementation of vanishing adaptation, I must discuss how frequently
candidate draws should be accepted by an MCMC algorithm. Ideally, the acceptance rate
should be such that good coverage of the target distribution is achieved with the smallest
M. J. Baker 627
possible number of draws. Rosenthal (2011) provides an accessible treatment on opti-

mal acceptance rates in adaptive MCMC algorithms and a summary of the main ideas
and results. At the risk of oversimplifying, I provide some guidelines. For univariate
distributions, the optimal acceptance rate is about 0.44, and as the dimension of (X)
increases to innity, the optimal acceptance rate converges to 0.234. Rosenthal (2011)
points out that moderate departure from these rates is unlikely to greatly damage algo-
rithm performance and that often for distributions with even relatively small dimension
(that is, d 5), the optimal acceptance rate is close to the asymptotic bound of 0.234.
In table 3, I describe an algorithm that is tuned toward a targeted acceptance rate
(presumably in or close to the range [0.234, 0.44]).
Table 3. Overview of an adaptive MH algorithm with a multivariate normal proposal

distribution and a specic tuning mechanism.
Adaptive MCMC algorithm with normal proposal and vanishing adaptation

1: Set starting values X0 , 0 , 0 , 0 , , ( > 0), and draws T .
3: Draw a candidate Yt M V N (Xt , t t ).
(Yt )
4: Compute (Yt , Xt ) = min (X t)
,1 .
1
6: Compute weighting parameter t = (1+t) .

7: Update t+1 = exp {t ((Yt , Xt ) )} t .
8: Update t+1 = t + t (X t+1 t ).
9: Update t+1 = t + t (Xt+1 t ) (Xt+1 t ) t .
10: Increment t.
Table 3 is a fairly complete description of how an adaptive MCMC algorithm might

be implemented and how the Mata functions presented in section 3 actually operate.
In step 1, the algorithm starts with the initial value X0 ; an initial variancecovariance
matrix for proposals, 0 ; an initial value of a scaling parameter, 0 ; and a targeted
acceptance rate, . The algorithm also requires a value for what can be considered an
averaging or damping parameter, , which controls how quickly the impact of the tuning
mechanism decays through the parameter t = 1/(1+t) , calculated in step 6. For large
values of , adaptation ceases quickly as rapidly approaches zero; for values of close to
zero, adaptation occurs more slowly, and the algorithm uses more information about past
draws in tuning proposals. The Mata routines presented below allow the user to specify
such a parameter when implementing the algorithm.2 In steps 8 and 9, the algorithm
updates the mean and covariance matrix of the proposal distribution according to the
weighting parameter t , and because t eventually decays to zero, updating ceases, and
2. One might prefer this value to be as close to its upper bound as possible to reduce the impact of
tuning quickly; the tradeo is that the proposal distribution may not be as well adapted.
the algorithm eventually carries on with stable proposal distribution characterized by

t+1 = t , t+1 = t , and t+1 = t .
If a researcher wished to write his or her own adaptive MCMC routine, the speci-
cation of the weighting scheme embodied in and on table 3 could be extended.
Andrieu and Thoms (2008) describe some other possibilities for adaptation, including
stochastic schemes or weighting functions that adapt as the algorithm continues. As
described by Andrieu and Thoms (2008, 356), virtually anything goes with the tuning
process, provided that the sequence t satises the following properties:

t = , t1+ < ; > 0
t t
These conditions are satised by the weighting parameter used in the adaptive al-
gorithm
in table 3 so long as (0, 1): the reason is that under these circumstances,
t t diverges, but a suciently large value of that forces the series {1/(1 + t) }1+
to converge can always be found.
A last detail to address is how to initialize the value of the scaling parameter at the
start of the algorithm. According to Andrieu and Thoms (2008, 359), theory suggests
that a good place to start with the scaling parameter is 2.382 /d, where d is the
dimension of the target distribution. The Mata routines presented below all use this
value as a starting point, with one exception.
There are many variations on the basic theme of the algorithm presented in table 3.
One possibility is one-at-a-time, sequential sampling of values from the distribution,
which produces a Metropolis-within-Gibbs type sampler. Another possibility is to
work halfway between the global sampling algorithm of table 3 and the sequential
sampling, creating what might be labeled a block adaptive MCMC sampler.3 In my
experience, Metropolis-within-Gibbs samplers or block samplers are often useful in situ-
ations in which variables are scaled very dierently or in situations where the researcher
might not have good intuition about starting values.
Related to determining how to execute the algorithm is the issue of how to choose
T , the length of the run. One would like to choose T large enough so that the conver-
gence criteria mentioned above are satised and enough draws are produced for reliable
statistical inference. How does one know that the algorithm has achieved these goals?
This is a surprisingly complex question that really does not have a good answer. While
one can often detect problems with the algorithm, there is no way to guarantee that
the algorithm has converged. Gelman and Shirley (2011) describe dierent techniques
for assessing performance and convergence of the run, but they also emphasize the
complementary roles of visual inspection of results, understanding the application, and
understanding the subject matter. These issues are discussed at greater length in the
conclusion.
3. I follow the convention of referring to a sequential sampler as a Metropolis-within-Gibbs sampler,
even though many nd this terminology misleading; see Geyer (2011, 2829). What I call a block
sampler, some might call a block-Gibbs sampler.
M. J. Baker 629

3.1 A Mata function
Syntax
The rst Mata implementation of the algorithms described in section 2 is through the
Mata function amcmc(),4 which uses dierent types of adaptive MCMC samplers based
upon user-provided information. In addition to describing details of sampling (spec-
ication of draws, weighting parameters, and acceptance rates), the user can specify
whether sampling is to proceed all at once (globally), in blocks, or sequentially. The
user can also set up amcmc() to work with a stand-alone distribution or with an
objective function previously set up to work with moptimize() or optimize(). The
syntax is as follows:
real matrix amcmc(string rowvector alginfo,

pointer (real scalar function) scalar lnf, real rowvector xinit,
real matrix Vinit, real scalar draws, real scalar burn,
real scalar delta, real scalar aopt, transmorphic arate,
transmorphic vals, transmorphic lambda,
real matrix blocks | transmorphic M, string scalar noisy)
Description
If the dimension of the target probability distribution (or the parameter vector) is char-
acterized as a 1 c row vector, amcmc() returns a matrix of draws from the distribution
organized in c columns and r = draws burn rows, so each row of the returned matrix
can be considered a draw from the target distribution lnf. Additional information about
the draws is collected in three arguments overwritten by amcmc(): arate, vals, and lam,
which contain actual acceptance rates, the log value of the target distribution at each
draw, and , the proposal scaling parameters. If a Metropolis-within-Gibbs sampler or
a block sampler is used, lam, as well as arate, is returned as a row vector equal in length
to the dimension of the distribution or the number of blocks.
Information about how to draw from the target distribution and how the distribution
has been programmed is passed to the command as a sequence of strings in the (string)
row vector alginfo. This row vector can contain information about whether sampling is
to be sequential (mwg), in blocks (block), or global (global). If the user is interested in
applying amcmc() to a model statement constructed with moptimize() or optimize(),
information on this and the type of evaluator function used with the model should also
be contained in alginfo. Target distribution information can be standalone, moptimize,
or optimize. Information on evaluator type can also be of any sort (that is, d0, v0,
4. Stata 12 is required for usage of amcmc().

etc.).5 A nal option that can be passed along as part of alginfo is the key fast, which
will execute the adaptive MCMC algorithm more quickly but less exactly. I give some
examples of what alginfo might look like in the remarks about syntax.
The second argument of amcmc(), lnf, is a pointer to the target distribution, which
must be written in log form. xinit and Vinit are conformable initial values for the
routine and an initial variancecovariance matrix for the proposal distribution. The
scalar draws and burn tell the routine how many draws to make from the distribution
and how many of these draws are to be discarded as an initial burn-in period. delta
is a string scalar that describes how adaptation is to occur, while aopt is the desired
acceptance rate; see section 2.1.
The real matrix blocks contains information on how amcmc() should proceed if the
user wishes to draw from the function in blocks. If the user does not wish to draw in
blocks, the user simply passes a missing value for this argument. If the user provides an
argument here, but does not specify block as part of alginfo, sampling will not occur
in blocks.
If the user is drawing from a function constructed with a prespecied model com-
mand written to work with either moptimize() or optimize(), this model statement is
passed to amcmc() via the optional M argument. As described below, this argument can
also have other uses; for example, it can pass up to 10 additional explanatory variables
to amcmc().
The nal option is noisy, and if the user species noisy="noisy", amcmc() will
produce feedback on drawing as the algorithm executes. A dot is produced every time
the evaluation function lnf is called (not every time a draw is completed, because the
latter is taken by amcmc() to mean a complete run through the routine). Thus, if a
block sampler or a Metropolis-within-Gibbs style sampler is used, a draw is deemed to
have occurred when all the blocks or variables have been drawn once. The value of the
target distribution is reported every 50 evaluations.
Remarks
It is helpful to have a few examples of how information about the draws to be conducted
can be passed to the amcmc() function through the rst argument, alginfo. This is
described in table 4.
Table 4. Options for using amcmc(), passed in the argument alginfo
Sampling information mwg, global, block

Model denition moptimize, optimize, standalone
Evaluator type d*, q*, e*, g*, v*
Other information fast
5. The routine will not work with evaluators of the lnf type.
M. J. Baker 631
The user can select any item from each of the rows on table 4 and pass it to amcmc()
as part of alginfo. For example, if the user is trying to draw from a function that was
written as a type d2 evaluator to work with moptimize and the user wished to use a
global sampler, he or she might specify
alginfo="moptimize","d2","global"
Order does not matter, so the user could also specify
alginfo="d0","moptimize","global"
If the user had a stand-alone function and wished to do Metropolis-within-Gibbs
style sampling from this function, he or she would specify
alginfo="standalone","mwg"
or even just alginfo="mwg" because if no model statement is submitted, amcmc() will
assume that the function is stand alone. The nal option that the user might specify
is the "fast" option, which tacks on the string fast to alginfo. This option is helpful
when the user wishes to sample globally or in blocks but has a problem with large
dimension. Because the global and block samplers use Cholesky decomposition of the
proposal covariance matrix, large problems may be time consuming. The "fast" option
circumvents the potential slowdown by working with just the diagonal elements of the
proposal covariance matrix, so one can avoid Cholesky decomposition. One should,
however, be cautious in using this option and should probably apply it only when the
user can be reasonably certain that distribution variables are independent.6
The row vector xinit contains an initial value for the draws, while Vinit is an initial
variancecovariance matrix that may be a conformable identity matrix. If, however,
Vinit is a row vector, amcmc() will interpret this as the diagonal of a variance matrix
with zero o-diagonal entries.
While the user-specied scalar delta controls how rapidly adaptation vanishes, the
user may also specify delta equal to missing (delta = .). amcmc() will then assume that
the user does not want any adaptation to occur but instead wishes to draw from the
invariant proposal distribution with mean xinit and covariance matrix Vinit. In this
case, the user must supply values of lambda to describe to the algorithm how to scale
draws from the proposal distribution. Constructing the code this way allows users to
run the adaptive algorithm for a while, and once it has converged, it allows users to
switch to an algorithm using an invariant proposal distribution. If a global sampler is
used, only one value of lambda is required; otherwise, lambda must be conformable with
the sampler. So, if the option mwg is used, the dimension of lambda must match the
dimension of the target distribution; if the option block is used, lambda must contain
as many entries as the number of blocks.
Whether one wishes to do Metropolis-within-Gibbs sampling, block sampling, or
global sampling, the routine requires the same set of input information (although the
6. I included this option hoping that users might try it and see for what problems, if any, it does and
does not work well.
overwritten values lam and arate dier slightly) with one exception. When one samples
in block form, amcmc() requires a matrix to be provided in block, in which the number
of rows is equal to the number of sampling groups, and the values to be drawn together
have 1s in the appropriate positions and 0s elsewhere. So, for example, if one wished to
draw from a ve-dimensional distribution and wished to draw values for the rst three
arguments together, and then arguments four and ve together, one would set up a
matrix B as follows:
1 1 1 0 0
B=
0 0 0 1 1
One can also pass an identity matrix as a block matrix:

1 0 0 0 0
0 1 0 0 0

B= 0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
One might suspect that this would result in the same sort of algorithm obtained by
specifying alginfo="mwg", but this is not the case. After each draw, the block algorithm
updates the entire mean proposal vector and covariance matrix, so information on each
draw is used to prepare for the next.7 While not the intended use of the block-sampling
algorithm, if one leaves a column of all 0s in the matrix B, the corresponding value of
the parameter will never be drawn. This is a quick, albeit not particularly ecient, way
of constraining parameters at particular values during the drawing process.
The argument M of amcmc() can contain a previously assembled model statement, or
it can be used to pass additional arguments of a function to the routine.8 For example,
if the user has written a function to be sampled from that has three arguments, such
as lnf(x,Y,Z), the user would specify the standalone option in the variable alginfo,
assemble the additional arguments into a pointer, and then pass this information to
amcmc(). In this instance, M might be constructed in Mata as follows:
M=J(2,1,NULL)
M[1,1]=&Y
M[2,1]=&Z
M can then be passed to amcmc(), which will use Y and Z (in order) to evaluate
lnf(x,Y,Z). As shown in the examples, this usage of pointers can be handy when
amcmc() is used as part of a larger algorithm: one can continually change Y and Z
without actually having to explicitly declare that Y and Z have changed as the algorithm
executes.
7. Using amcmc() in this way is akin to what Andrieu and Thoms (2008, 360) describe as an adaptive
MCMC algorithm with componentwise adaptive scaling.
8. But not both; we assume that any arguments have already been built into the model statement if
a previously constructed model is used.
M. J. Baker 633
3.2 Adaptive MCMC via a structure

Syntax
Another alternative that has advantages in certain situations, particularly when one
wishes to do adaptive MCMC as one step in a larger sampling problem, is to set up an
adaptive MCMC sampling problem by using the set of functions amcmc *(). The user
rst opens a problem using the amcmc init() function and then lls in the details of
the drawing procedure. The user can use the following functions to set up an adaptive
MCMC problem, with the arguments corresponding to those described in section 3.1:
A = amcmc init()
amcmc lnf(A, pointer (real scalar function) scalar f)
amcmc args(A, pointer matrix Z)
amcmc xinit(A, real rowvector xinit)
amcmc Vinit(A, real matrix Vinit)
amcmc aopt(A, real scalar aopt)
amcmc blocks(A, real matrix blocks)
amcmc model(A, transmorphic M)
amcmc noisy(A, string scalar noisy)
amcmc alginfo(A, string rowvector alginfo)
amcmc damper(A, real scalar delta)
amcmc lambda(A, real rowvector lambda)
amcmc draws(A, real scalar draws)
amcmc burn(A, real scalar burn)
Once a problem has been specied, a run can be initiated via the function
amcmc draw(A)
Results can be accessed via a series of functions of the form
amcmc results *(A)
where * in the above function can be any of the following: vals, arate, passes,
totaldraws, acceptances, propmean, propvar, or report. Additionally, users can
recover their initial specications by using * = draws, aopt, alginfo, noisy, blocks,
damper, xinit, Vinit, or lambda. An additional function amcmc results lastdraw()
produces the value of only the last draw. Two other functions that are useful when one
is executing an adaptive MCMC draw as part of a larger algorithm are
amcmc append(A, string scalar append)
amcmc reeval(A, string scalar reeval)
The function amcmc append() allows the user to indicate that results should be overwrit-
ten by specifying append="overwrite". In this case, the results of only the most recent
draws are kept. This can be useful when doing an analysis where nuisance parameters of
a model are being drawn, and storing all the previous draws would tax the memory and
impact the speed of the algorithms operation. The function amcmc reeval() allows
the user to indicate whether the target distribution should be reevaluated at the last
draw before a proposed value is tried by specifying reeval="reeval". When the draw
is part of a larger algorithm, some of the arguments of the target distribution might
change as the larger algorithm proceeds. In these cases, the target distribution needs
to be reevaluated at the new argument values and the last previous draw to function
correctly. If the user sets reeval to anything else, it is assumed that nothing has changed
and that the value of the target distribution has not changed between draws.
Remarks
Some of the information accessible with amcmc results *() provides hints as to why
a user might prefer to use a problem statement to attack an adaptive MCMC problem
instead of the Mata function amcmc(). Using a problem statement is particularly useful
because one can easily stop, restart, and append a run within Matas structure envi-
ronment. In this way, a user can perform adaptive MCMC as part of a larger algorithm;
the structure makes it easy to retain information about past adaptation and runs as the
algorithm proceeds and also makes it easy to modify arguments of the algorithm. In
the model statement syntax, information about the number of times a given problem
has been initiated is retrievable via the function amcmc results passes(A), while the
acceptance history of an entire run is accessible via amcmc results acceptances(A).
Given the initialization of an adaptive MCMC problem A, one can run amcmc draw()
sequentially and results will be appended to previous results. Accordingly, the burn
period is active only the rst time the function is executed. Thereafter, it is assumed
that the user wishes to retain all drawn values. As mentioned above, the user can
choose whether to retain all the information about previous draws with the function
amcmc append(). When a user species append="overwrite" to save the draws of only
the last run, the routine still includes all information about adaptation contained in the
entire drawing history.
When a user initializes an adaptive MCMC problem via amcmc init(), some defaults
are set unless overwritten by the user. The number of draws is set to 1, the burn period
is set to 0, the target distribution is assumed to be stand alone, the acceptance rate is
set to 0.234, and results are appended to previous results if multiple passes are made.
It is also assumed that the function does not need to be reevaluated at the last value
before drawing a new proposal.
M. J. Baker 635
Further description can be found in the help les, accessible by typing help mata
amcmc() or help mf amcmc at Statas command prompt.
4 Examples
4.1 Parameter estimation
For my rst example, I apply adaptive MCMC to a simple estimation problem. Suppose
that I have already programmed a likelihood function to use with moptimize() in Mata,
but I wish to try another means of estimating parametersperhaps because I have
found that maximization of the likelihood function is taking too long or presents other
diculties or because I am worried about small-sample properties of the estimators.
I decide to try to t the model by drawing directly from the conditional distribution
of parameters. The ideas derive from Bayess rule and the usual principles of Bayesian
estimation, but they can be applied to virtually any maximum likelihood problem.9 Via
Bayess rule, the distribution of parameters conditional on the data can be written as
p(X|)p() p(X|)p()
p(|X) = =5 (1)
p(X) p(X|)p()d
If one has no prior information about parameter values, one can take p()the prior dis-
tribution of parametersto be (improper) uniform over the support of the parameters.
As this renders p() constant, one then obtains the posterior parameter distribution as
p(|X) p(X|) (2)
So, according to (2), one might interpret a likelihood function as the distribution of
parameters conditional on data up to a constant of proportionality. The conditional
mean of parameter values is then
6
E(|X) = p(|X)d (3)
One can estimate E(|X) by simulating the right-hand side of (3) via S draws from the
conditional distribution p(|X),
S
1
E(|X) = (s)
S s=1
These simulations can also be used to characterize higher-order moments of the param-
eter distribution. I shall follow the nomenclature adopted by Chernozhukov and Hong
(2003) and refer to obtained estimators as Laplace-type estimators (LTEs) or quasi-
Bayesian estimators (QBEs).
Returning to the example, I will posit a simple linear model with log-likelihood
function,
(y X) (y X) n
ln L ln 2
2 2 2
9. They can also be applied to a wider variety of problems; see Chernozhukov and Hong (2003).
For comparison, in the following code, I take this simple model and t it to some data by
using a type d0 evaluator and Matas moptimize() function. One subtlety of the code is
that the variance is coded in exponentiated form. This is done so that when amcmc() is
applied to the problem, the objective function is consistent with the multivariate normal
proposal distribution, which requires that parameters have support (, ).10 The
following code develops the model statement and ts the model via maximum likelihood:
. sysuse auto
(1978 Automobile Data)
. mata:
mata (type end to exit)
: function lregeval(M,todo,b,crit,s,H)
> {
> real colvector p1, p2
> real colvector y1
> p1=moptimize_util_xb(M,b,1)
> p2=moptimize_util_xb(M,b,2)
> y1=moptimize_util_depvar(M,1)
> crit=-(y1:-p1)*(y1:-p1)/(2*exp(p2))-
> rows(y1)/2*p2
> }
note: argument todo unused
note: argument s unused
note: argument H unused
: M=moptimize_init()
: moptimize_init_evaluator(M,&lregeval())
: moptimize_init_evaluatortype(M,"d0")
: moptimize_init_depvar(M,1,"mpg")
: moptimize_init_eq_indepvars(M,1,"price weight displacement")
: moptimize_init_eq_indepvars(M,2,"")
: moptimize(M)
initial: f(p) = -18004
alternative: f(p) = -10466.142
rescale: f(p) = -298.60453
rescale eq: f(p) = -189.39334
Iteration 0: f(p) = -189.39334 (not concave)
Iteration 4: f(p) = -143.55991
Iteration 5: f(p) = -129.10949
Iteration 6: f(p) = -127.05705
Iteration 7: f(p) = -127.05447
Iteration 8: f(p) = -127.05447
10. A less ecient way to deal with parameters with restricted supports is to program the distribution
so that it returns a missing value whenever a draw lands outside the appropriate range.
M. J. Baker 637
: moptimize_result_display(M)
Number of obs = 74
mpg Coef. Std. Err. z P>|z| [95% Conf. Interval]
eq1
price -.0000966 .0001591 -0.61 0.544 -.0004085 .0002153
weight -.0063909 .0011759 -5.43 0.000 -.0086956 -.0040862
displacement .0054824 .0096492 0.57 0.570 -.0134296 .0243945
_cons 40.10848 1.974222 20.32 0.000 36.23907 43.97788
eq2
_cons 2.433905 .164399 14.80 0.000 2.111688 2.756121
: end
I now estimate model parameters via simulation by treating the likelihood function
like the parameters conditional distribution. I start with a Metropolis-within-Gibbs
sequential sampler to obtain 10,000 draws for each parameter value, discarding the rst
20 draws as a burn-in period. I start with this sampler because it is usually a relatively
safe choice when there is little information on starting points, which I am pretending are
unavailable. I set the initial values used by the sampler to 0 and use an identity matrix
as an initial covariance matrix for proposals. I choose a value of delta = 2/3, which
allows a fairly conservative amount of adaptation to occur and a desired acceptance rate
of 0.4.11
. set seed 8675309
. mata:
: alginfo="moptimize","d0","mwg"
: b_mwg=amcmc(alginfo,&lregeval(),J(1,5,0),I(5),10000,50,2/3,.4,
> arate=.,vals=.,lambda=.,.,M)
: st_matrix("b_mwg",mean(b_mwg))
: st_matrix("V_mwg",variance(b_mwg))
: end
. local names eq1:price eq1:weight eq1:displacement eq1:_cons eq2:_cons

. matrix colnames b_mwg=`names
. matrix colnames V_mwg=`names
. matrix rownames V_mwg=`names
. ereturn post b_mwg V_mwg
11. Regarding what might seem a relatively short burn-in period, I set this period to be short enough
to show the convergence behavior of the algorithm.
. ereturn display
eq1
price -.0001322 .0001714 -0.77 0.440 -.0004681 .0002036
weight -.0057418 .0018016 -3.19 0.001 -.009273 -.0022107
displacement .00218 .0125846 0.17 0.862 -.0224854 .0268454
_cons 39.00328 3.095009 12.60 0.000 32.93717 45.06939
eq2
_cons 2.518081 .2071915 12.15 0.000 2.111993 2.924169
Although the algorithm was not allowed a very long burn-in time, the simulation-based
parameter estimates are close to those obtained by maximum likelihood.12 How fre-
quently were draws of each parameter accepted, and how close is the algorithm working
around the maximum value of the function? This information is returned as the over-
written arguments arate and vals.
. mata:
: arate
1
1 .3806030151
2 .3807035176
3 .3870351759
4 .4020100503
5 .3951758794
: max(vals),mean(vals)
1 2
1 -127.1097198 -130.2193494
: end
The sampler nds and operates close to the maximum value of the log likelihood (which
was 127.05), and the acceptance rates of the draws are very close to the desired
acceptance rate of 0.4. To understand what the distribution of the parameters looks
like, I pass the information about parameter draws to Stata and form visual pictures
of results. The code below accomplishes this and creates two panels of graphs: one
that shows the distribution of parameters (gure 1) and one that shows how parameter
draws and the value of the function evolved as the algorithm moved (gure 2).
12. One possible issue here is whether it is appropriate to summarize the results in usual Stata format
like this. One can assume that this is acceptable here because the parameters are collectively
normally distributed. Whether this is true in more general problems requires careful thought.
M. J. Baker 639
. preserve
. clear
. local varnames price weight displacement constant std_dev
. getmata (`varnames)=b_mwg
. getmata vals=vals
. generate t=_n
. local graphs
. local tgraphs
. foreach var of local varnames {
2. quietly {
3. histogram `var, saving(`var, replace) nodraw
4. twoway line `var t, saving(t`var, replace) nodraw
5. }
6. local graphs "`graphs `var.gph"
7. local tgraphs "`tgraphs t`var.gph"
8. }
. histogram vals, saving(vals,replace) nodraw
(bin=39, start=-183.40158, width=1.4433811)
(file vals.gph saved)
. twoway line vals t, saving(vals_t,replace) nodraw
(file vals_t.gph saved)
. graph combine `graphs vals.gph
. graph export vals_mwg.eps, replace
(file vals_mwg.eps written in EPS format)
. graph combine `tgraphs vals_t.gph
. graph export valst_mwg.eps, replace
(file valst_mwg.eps written in EPS format)
. restore
Figure 1 is composed of histograms for each parameter, with the last panel being the
histogram of the log likelihood. Parameters seem to be approximately normally dis-
tributed (with a few blips), excepting the rst few draws, and they are also centered
around parameter values obtained via maximum likelihood.
1000 1500 2000 2500
400
50
40
300
Density
Density
Density
30
200
20
100
500
10
0
0
.001 .0005 0 .0005 .01 .005 0 .005 .06 .04 .02 0 .02
price weight displacement
.25
2.5
.25
.2
.2
.15
1.5
Density
Density
Density
.15
.1
.1
.05
.05
.5
0
0
10 20 30 40 50 2 2.5 3 3.5 4 180170160150140130
constant std_dev vals
Figure 1. The distribution of the parameters after an MCMC run
Figure 2 shows how the drawn values for parameters and the value of the objective
function evolved as the algorithm proceeded.
.0005
.005
0 .02
displacement
0
0
weight
price
.04 .02
.0005
.005
.06
.001
.01
0 2000 4000 6000 800010000 0 2000 4000 6000 800010000 0 2000 4000 6000 800010000
t t t
50
180 170 160 150 140 130

4
40
3.5
constant
std_dev
vals
30
3 2.5
20
2
10
0 2000 4000 6000 800010000 0 2000 4000 6000 800010000 0 2000 4000 6000 800010000
t t t
Figure 2. A look at the estimates
From gure 2, one can see that after a few iterations, the algorithm settles down
to drawing from an appropriate range. The draws are also autocorrelated, and this
autocorrelation is a general property of any MCMC algorithm, adaptive or not. Thus,
M. J. Baker 641
when one applies MCMC algorithms in practice, it is sometimes benecial to thin out
the draws by keeping, say, only every 5th or 10th draw or to jumble draws.
To illustrate the use of a global sampler and some of the problems one might en-
counter in an MCMC-based analysis, I now apply a global sampler to the problem so that
all parameter values are drawn simultaneously. The following code shows the results of
a run of 12,000 draws with a burn-in period of 2,000:
. set seed 8675309

. mata:
: alginfo="global","d0","moptimize"
: b_glo=amcmc(alginfo,&lregeval(),J(1,5,0),I(5),12000,2000,2/3,.4,
: st_matrix("b_glo",mean(b_glo))
: st_matrix("V_glo",variance(b_glo))
: end

. matrix colnames b_glo=`names
. matrix colnames V_glo=`names
. matrix rownames V_glo=`names
. ereturn post b_glo V_glo
. ereturn display
eq1
price -.0004614 .0019104 -0.24 0.809 -.0042057 .0032829
weight .013056 .0232029 0.56 0.574 -.0324209 .0585328
displacement -.1798405 .3163187 -0.57 0.570 -.7998138 .4401328
_cons 15.16227 20.84814 0.73 0.467 -25.69933 56.02387
eq2
_cons 4.017751 1.880026 2.14 0.033 .3329679 7.702533
One can see from these results that the algorithm has not quickly found an appropriate
range of values for parameter values. Figures 3 and 4 indicate whythe algorithm
spends considerable time stuck away from the maximal function value.
8
400 600 800 1000
100
80
6
Density
Density
Density
60
4
40
2
200
20
0
0
.015 .01 .005 0 .005 0 .05 .1 1.5 1 .5 0
price weight displacement
.3
.006
.8 .6
.2
.004
Density
Density
Density
.4
.002
.1
.2
0
0
0 10 20 30 40 50 0 2 4 6 8 10 3000 2000 1000 0
constant std_dev vals
Figure 3. Distribution of parameters after a global MCMC run that is slow to converge
.005
.1
0
0
displacement
.5
weight
.05
.005
price
1
.01
0
.015
1.5
0 2000 4000 6000 800010000 0 2000 4000 6000 800010000 0 2000 4000 6000 800010000
t t t
50
10
0
40
1000
30
constant
std_dev
6
vals
20
2000
4
10
3000
0
0 2000 4000 6000 800010000 0 2000 4000 6000 800010000 0 2000 4000 6000 800010000
t t t
Figure 4. Characteristics of draws after a global MCMC run
The problem observed in gures 3 and 4 is that the algorithm was not allowed to burn
in for a long enough time for the global MCMC algorithm to work correctly. While
the parameter values eventually settled down closer to their true values, it took the
algorithm upward of 6,000 draws to nd the right range. In fact, it looks as though the
algorithm settled into a stable range for draws 2,0006,000 or so but then once again
experienced a jump to the correct stable range, a phenomenon known as pseudoconver-
M. J. Baker 643
gence (Geyer 2011). This behavior is also responsible for the multimodal appearance
of the histograms on gure 3.
While my intent is to illustrate how the Mata function amcmc() works, my example
also illustrates what can happen when one fails to specify appropriate adjustment pa-
rameters and does not allow an adaptive MCMC algorithm to run long enough in a given
estimation problem. One may unknowingly get bad results, as the case would be if
the global algorithm had been allowed to run for only 5,000 iterations. This sometimes
happens if poor starting values are mixed with parameters that have very dierent mag-
nitudes, for example, the constant in the initial model relative to the other parameters.
From inspecting gure 3, one can see that the constant did not nd its correct range
until just after 6,000 draws, and this is likely what caused the problem.
This discussion motivates using amcmc() in steps, where a slower but relatively
robust sampler (a Metropolis-within-Gibbs sampler, in this case) is used to orient pa-
rameters close to their correct range before a global sampler is used, as shown in the
following code:
. mata:
: alginfo="mwg","d0","moptimize"
: b_start=amcmc(alginfo,&lregeval(),J(1,5,0),I(5),5*1000,5*100,2/3,.4,
: b_glo2=amcmc(alginfo,&lregeval(),mean(b_start),
> variance(b_start),11000,1000,2/3,.4,
: st_matrix("b_glo2",mean(b_glo2))
: st_matrix("V_glo2",variance(b_glo2))
: end

. matrix colnames b_glo2=`names
. matrix colnames V_glo2=`names
. matrix rownames V_glo2=`names
. ereturn post b_glo2 V_glo2
. ereturn display
eq1
price -.0001059 .0001584 -0.67 0.504 -.0004164 .0002046
weight -.0063727 .0012014 -5.30 0.000 -.0087275 -.0040179
displacement .0056462 .0099215 0.57 0.569 -.0137997 .025092
_cons 40.10216 1.912111 20.97 0.000 36.35449 43.84982
eq2
_cons 2.480892 .1665249 14.90 0.000 2.15451 2.807275
Thus one can then draw parameters that are scaled dierently either alone or in blocks
until the algorithm nds it footing, and then proceed with a global algorithm. I have
motivated the use of a global drawing method because of its clear speed advantages, but
another more subtle reason to use it that might not be obvious when visually inspecting
the graphs is that global draws often exhibit less serial correlation across draws.13 The
conclusion provides sources with additional tips for setting up, analyzing, and presenting
the results of an MCMC run.
Yet another alternative is to once again begin with a Metropolis-within-Gibbs sam-
pler to characterize the distribution of the parameters and, once this is done suciently
well, to run the algorithm without adaptation so that one is using an invariant proposal
distribution and a regular MCMC algorithm. After an initial run with the "mwg" option,
I submit the mean and variance of results to the global sampler with no adaptation
parameter, passing a value of missing (.) for delta. Because I am not passing any
information to amcmc() on how to do adaptation in this case, I am required to submit
a value for lambda, so I choose = 2.382 /n.14 Finally, I also submit a missing value
for aopt. Because no adaptation occurs, aopt is not used by the algorithm.
. mata:
: b_start=amcmc(alginfo,&lregeval(),J(1,5,0),I(5),5*1000,5*100,2/3,.4,
: b_glo3=amcmc(alginfo,&lregeval(),mean(b_start),
> variance(b_start),10000,0,.,.,
> arate=.,vals=.,(2.38^2/5),.,M)
: arate
.2253
: mean(b_glo3)
1
1 -.0000916295
2 -.0064095109
3 .0054916501
4 40.14276799
5 2.497166774
: end
Apparently, the proposal distribution was successfully tuned in the initial run with the
Metropolis-within-Gibbs sampler. The mean values of the parameters obtained from
the global draw are close to their maximum-likelihood values, and the acceptance rate
is in the healthy range.
13. I thank an anonymous referee for pointing this out.

14. Note that I did not retain and submit the values of lambda from the initial runthis is because the
global sampler requires a scalar value for lambda, while the Metropolis-within-Gibbs run returns a
vector of values overwritten in lambda.
M. J. Baker 645
I could have also set up this problem using a structure as follows:
. mata:
: A=amcmc_init()
: amcmc_alginfo(A,("global","d0","moptimize"))
: amcmc_lnf(A,&lregeval())
: amcmc_xinit(A,J(1,5,0))
: amcmc_Vinit(A,I(5))
: amcmc_model(A,M)
: amcmc_draws(A,4000)
: amcmc_damper(A,2/3)
: amcmc_draw(A)
: end
I can now access results using the previously described amcmc results *(A) set of
functions.
4.2 Censored quantile regression

While the previous example demonstrated the basic principles and how one might apply
adaptive MCMC in problems of parameter estimation, the example did not show how the
methods might work when the usual maximization-based techniques fail. Chernozhukov
and Hong (2003) use as an example censored quantile regression originally developed in
Powell (1984) and extended in Powell (1986), which, as Chernozhukov and Hong (2003,
296) note, provides a way to do valid inference in TobinAmemiya models without dis-
tributional assumptions and with heteroskedasticity of unknown form. Unfortunately,
the model is hard to handle with the usual methods. The objective function is
n

Ln () = {Yi max (ci , Xi )} (4)
i
where ci in (4) denotes a (left) censoring point that might be specic to the ith ob-
servation, and (u) = { (1(u < 0)}u. (0, 1) is the quantile of interest. Esti-
mation using derivative-based maximization methods is problematic because the objec-
tive function (4) has at regions and discontinuities. While one might do well with a
nonderivative-based optimization method such as NelderMead, one is then confronted
with the problem of characterizing the parameters distribution and getting standard
errors. For these reasons, one might opt for an LTE or a QBE estimator.
To apply amcmc() to the problem, I rst program the objective function as follows:15
. mata:
: void cqregeval(M,todo,b,crit,g,H) {
> real colvector u,Xb,y,C
> real scalar tau
>
> Xb =moptimize_util_xb(M,b,1)
> y =moptimize_util_depvar(M,1)
> tau =moptimize_util_userinfo(M,1)
> C =moptimize_util_userinfo(M,2)
> u =(y:-rowmax((C,Xb)))
> crit =-colsum(u:*(tau:-(u:<0)))
> }
note: argument todo unused
note: argument g unused
note: argument H unused
: end
The following code sets up a model statement for use with the function moptimize( )
(see [M-5] moptimize( )). One can follow the Mata code with moptimize(M) to verify
that this model and variations on the basic theme, obtained by dropping or adding
additional variables, encounter diculties.
. webuse laborsub, clear

. gen censorpoint=0
. mata:
: M=moptimize_init()
: moptimize_init_evaluator(M,&cqregeval())
: moptimize_init_depvar(M,1,"whrs")
: moptimize_init_eq_indepvars(M,1,"kl6 k618 wa")
: tau=.6
: moptimize_init_userinfo(M,1,tau)
: st_view(C=.,.,"censorpoint")
: moptimize_init_userinfo(M,2,C)
: moptimize_init_evaluatortype(M,"d0")
: end
15. One might code the objective function without summing over observations. I sum over observations
so that the objective is compatible with NelderMead in Stata, which requires a type d0 evaluator.
M. J. Baker 647
Setting up the problem like this allows the use of amcmc(), where I implement the
strategy of using a Metropolis-within-Gibbs-type algorithm followed by a global sampler.
. mata:
: b_start=amcmc(alginfo,&cqregeval(),J(1,4,0),I(4),5000,1000,2/3,.4,
: b_end=amcmc(alginfo,&cqregeval(),mean(b_start),
> variance(b_start),20000,10000,1,.234,arate=.,vals=.,lambda=.,.,M)
: end
Because this application might be of more general interest, I developed the command
mcmccqreg, which is a wrapper for the LTE and QBE estimation of censored quantile
regression. The previous code can be executed by the command.
. set seed 584937

. quietly mcmccqreg whrs kl6 k618 wa, tau(.6) sampler("mwg") draws(5000)
> burn(1000) dampparm(.667) arate(.4) censorvar(censorpoint)
. matrix binit=e(b)
. matrix V=e(V)
. mcmccqreg whrs kl6 k618 wa, tau(.6) sampler("global") draws(20000)
> burn(10000) arate(.234) saving(lsub_draws) replace
> from(binit) fromv(V)
Powells mcmc-estimated censored quantile regression
Observations: 250
Mean acceptance rate: 0.359
Total draws: 20000
Burn-in draws: 10000
Draws retained: 10000
whrs Coef. Std. Err. t P>|t| [95% Conf. Interval]
kl6 -1175.389 152.6341 -7.70 0.000 -1474.583 -876.1953

k618 -171.3108 23.75806 -7.21 0.000 -217.8814 -124.7402
wa -29.23027 10.74507 -2.72 0.007 -50.29276 -8.167779
_cons 2638.497 500.8126 5.27 0.000 1656.804 3620.191
Value of objective function:

Mean: -89298.96
Min: -89295.52
Max: -89308.58
Draws saved in: lsub_draws
*Results are presented to conform with Stata covention, but
are summary statistics of draws, not coefficient estimates.
One can see from the way the command is issued how information about the sampler,
the drawing process, and the censoring point (which has default of 0 for all observations)
can be controlled using the mcmccqreg command. The command produces estimates
that are summary statistics of the sampling run. mcmccqreg allows one to save results,
and the results of the run are saved in the le lsub draws with the objective function
value after each draw. The user can then easily analyze the draws using Statas graphing
and statistical analysis tools. While the workings of the command derive more or less
directly from the description of amcmc(), more information about the command and
some additional examples can be found in the mcmccqregs help le.
4.3 Drawing from a distribution

I now show how to use amcmc() to draw from a distribution. Suppose that I have devel-
oped a theory that says three variables are jointly distributed according to a distribution
characterized by

p(x1 , x2 , x3 ) exp x21 0.5x22 + x1 x2 0.05(x3 100)2
As written, p does not integrate to one and seems hard to invert. While Metropolis-
within-Gibbs or global sampling works ne with this example, to illustrate the block
sampler, I will draw from the distribution in blocks, where values for the rst two
arguments are drawn together, followed by a draw of the third. Thus the block matrix
to be passed to amcmc() is
1 1 0
B=
0 0 1
The code that programs the function and draws from the distribution is as follows:
. set seed 262728

. mata:
: real scalar ln_fun(x)
> {
> return(-x[1]^2-1/2*x[2]^2+x[1]*x[2]-.05*(x[3]-100)^2)
> }
: B=(1,1,0) \ (0,0,1)
: alginfo="standalone","block"
: x_block=amcmc(alginfo,&ln_fun(),J(1,3,0),I(3),4000,200,2/3,.4,
> arate=.,vals=.,lambda=.,B)
: end
M. J. Baker 649
The example is set up to draw 4,000 values with a burn-in period of 200. Graphs of the
simulation results are shown in gures 5 and 6.
.4
.5
.4
.3
Density
Density
.3
.2
.2
.1
.1
0
0
4 2 0 2 4 5 0 5
x_1 x_2
.15
.5
.4
.1
Density
Density
.2 .3
.05
.1
0
0
90 95 100 105 110 8 6 4 2 0
x_3 vals
Figure 5. Draws and the log value of the distribution

4
5
2
x_1
x_2
0
0 2
5
4
0 1000 2000 3000 4000 0 1000 2000 3000 4000

t t
110
0
105
2
vals
100
x_3
4
95
6
90
0 1000 2000 3000 4000 0 1000 2000 3000 4000

t t
Figure 6. Behavior of draws as the algorithm proceeds

The graphs give a visual of the marginal distributions for the variables, while the time-
series diagram veries that our simulation run is getting good coverage and rapid con-
vergence to the target distribution.
A dierent way to draw from this distribution would be to set up an adaptive MCMC
problem via a structured set of Mata functions.
. mata:
: A=amcmc_init()
: amcmc_lnf(A,&ln_fun())
: amcmc_alginfo(A,("standalone","block"))
: amcmc_draws(A,4000)
: amcmc_burn(A,200)
: amcmc_damper(A,2/3)
: amcmc_xinit(A,J(1,3,0))
: amcmc_Vinit(A,I(3))
: amcmc_blocks(A,B)
: amcmc_draw(A)
: end
4.4 Bayesian estimation of a mixed logit model

In this section, I describe the nuts and bolts of Bayesian estimation of a mixed logit
model; the implementation is available via the command bayesmixedlogit, which I
have written and made available for download. The wrapper function bayesmixedlogit
adds some features but essentially works as described in this section.
While there is no strong reason to prefer using the amcmc routines as a function or
a structure in the previous examples, the power and exibility of structured objects in
Mata is indispensable in this example. My exposition of the basic ideas follows Train
(2009) as closely as possible, which also contains a nice overview of the principles. The
example assumes that one has access to traindata.dta, which is used by Hole (2007)
to illustrate estimation of a mixed logit model by maximum simulated likelihood.16
The help le for amcmcaccessible by typing help mata amcmc() or help mf amcmc
at Statas command promptdescribes an example that relies on data downloadable
from the Stata website.
The data concern n = 1, 2, 3, . . . , N people, each of whom makes a selection from
among j = 1, 2, 3, . . . , J choices on occasions t = 1, 2, 3, . . . , T . For each choice made,
there are a set of covariates xnjt that explain ns choices at t. A persons utility from
the jth choice on occasion t is specied as
Unjt = n xnjt + njt (5)
16. The data are downloadable from Trains website at http://eml.berkeley.edu/train/ and can also
be found at http://fmwww.bc.edu/repec/bocode/t/traindata.dta.
M. J. Baker 651
where in (5), njt is an independent identically distributed extreme value, and n are
individual-specic parameters. Variation in these parameters across the population is
captured by assuming parameters normally distributed with mean b and covariance
matrix W. I denote a persons choice at t as ynt J. Then the probability of observing
person ns sequence of choices is
7
en xnynt t
L(yn |) = J x
(6)
n
j=1 e
njt
t
Given the distribution of , I can write the above conditional on the distribution of
parameters, (|b, W), and integrate over the distribution of parameter values to get
6
L(yn |b, W) = L(yn |)(|b, W)d
In a Bayesian approach, a prior h(b, W) is assumed, and the joint posterior likelihood
of the parameters is formed using
7
H(b, W|Y, X) L(yn |b, W)h(b, W) (7)
n
Because it is dicult to compute the likelihood in (7), simulation-based methods are

usually used in estimation, as in the package mixlogit, developed in Hole (2007).17 An
alternative is a Bayesian approach. As described by Train (2009), estimation becomes
fairly easy (at least conceptually) if one breaks the problem into a sequence of condi-
tional distributions, taking the view that each set of individual-level coecients n are
additional parameters to be estimated. The posterior distribution of parameters given
data becomes
7
H(b, W, n , n = 1, 2, 3, . . . , N |y, X) L(yn |n )(n |b, W)h(b, W) (8)
n
Following the outline given in Train (2009, 301302), we see that drawing from the
posterior proceeds in three steps. First, b is drawn conditional on n and W; then W
is drawn conditional on b and n ; and nally, the values of n are drawn conditional on
b and W. The rst two steps are straightforward, assuming that the prior distribution
of b is normal with extremely large variance and that the prior for W is an inverted
Wishart with K degrees of freedom and an identity scale matrix. In this case, the
conditional distribution of b is N (, WN 1 ), where is the mean of the n s. The
conditional distribution of W is an inverted Wishart with K + N degrees of freedom
and scale matrix (KI + N S)/(K + N ), where S = N 1 n (n b)(n b) is the
sample variance of the n s about b.
The distribution of n given choices, data, and (b, W) has no simple form, but from
(8), we see that the distribution of a particular persons parameters obeys
K(n |b, W, yn , Xn ) L(yn |n )(n |b, W) (9)

17. From the Stata prompt, type search mixlogit.
where the term L(yn |n ) in (9) is given by (6). This is a natural place to apply MCMC
methods, and it is here where I can use the amcmc *() suite of functions.
I now return to the example. traindata.dta contains information on the energy
contract choices of 100 people, where each person faces up to 12 dierent choice oc-
casions. Suppliers contracts are dierentiated by price, the type of contract oered,
location to the individual, how well-known the supplier is, and the season in which the
oer was made.
As a point of comparison, I t the model in Train (2009, 305) using mixlogit (after
download and installation).
. clear all
. set more off
. use http://fmwww.bc.edu/repec/bocode/t/traindata.dta
. set seed 90210
. mixlogit y, rand(price contract local wknown tod seasonal) group(gid) id(pid)
Mixed logit model Number of obs = 4780
LR chi2(6) = 467.53
Mean
price -.8908633 .0616638 -14.45 0.000 -1.011722 -.7700045
contract -.22285 .0390333 -5.71 0.000 -.2993539 -.1463462
local 1.958347 .1827835 10.71 0.000 1.600098 2.316596
wknown 1.560163 .1507413 10.35 0.000 1.264715 1.85561
tod -8.291551 .4995409 -16.60 0.000 -9.270633 -7.312469
seasonal -9.108944 .5581876 -16.32 0.000 -10.20297 -8.014916
SD
price .1541266 .0200631 7.68 0.000 .1148036 .1934495
contract .3839507 .0432156 8.88 0.000 .2992497 .4686516
local 1.457113 .1572685 9.27 0.000 1.148873 1.765354
wknown -.8979788 .1429141 -6.28 0.000 -1.178085 -.6178722
tod 1.313033 .1648894 7.96 0.000 .9898559 1.63621
seasonal 1.324614 .1881265 7.04 0.000 .9558927 1.693335
The sign of the estimated standard deviations is irrelevant: interpret them as

being positive
To implement the Bayesian estimator, I proceed in the steps outlined by Train (2009,
301302). First, I develop a Mata function that produces a single draw from the condi-
tional distribution of b.
M. J. Baker 653
. mata:
: real matrix drawb_betaW(beta,W) {
> return(mean(beta)+rnormal(1,cols(beta),0,1)*cholesky(W))
> }
: end
Next I use the instructions described in Train (2009, 299) to draw from the conditional
distribution of W. The Mata function is
. mata
: real matrix drawW_bbeta(beta,b)
> {
> v=rnormal(cols(b)+rows(beta),cols(b),0,1)
> S1=variance(beta)
> S=invsym((cols(b)*I(cols(b))+rows(beta)*S1)/(cols(b)+rows(beta)))
> L=cholesky(S)
> R=(L*v)*(L*v)/(cols(b)+rows(beta))
> return(invsym(R))
> }
: end
I now have two of the three steps of the drawing scheme in place. The last task is more
nuanced and involves using structured amcmc problems in conjunction with the exible
ways in which one can manipulate structures in Mata. The key is to think of drawing
each set of individual-level parameters n as a separate adaptive MCMC problem. It is
helpful to rst get all the data into Mata, get familiar with its structure, and then work
from there.
. mata:
: st_view(y=.,.,"y")
: st_view(X=.,.,"price contract local wknown tod seasonal")
: st_view(pid=.,.,"pid")
: st_view(gid=.,.,"gid")
: end
The matrix (really, a column vector) y is a sequence of dummy variables marking the
choices of individual n in each choice occasion, while the matrix X collects explanatory
variables for each potential choice. pid and gid are identiers for individuals and choice
occasions, respectively. I now write a Mata function that computes the log probability
for a particular vector of parameters for a given person, conditional on that persons
information.
. mata:
: real scalar lnbetan_bW(betaj,b,W,yj,Xj)
> {
> Uj=rowsum(Xj:*betaj)
> Uj=colshape(Uj,4)
> lnpj=rowsum(Uj:*colshape(yj,4)):-
> ln(rowsum(exp(Uj)))
> var=-1/2*(betaj:-b)*invsym(W)*(betaj:-b)-
> 1/2*ln(det(W))-cols(betaj)/2*ln(2*pi())
> llj=var+sum(lnpj)
> return(llj)
> }
: end
The function takes in ve arguments, the rst of which is a parameter vector for the
person (that is, the values to be drawn). The second and third arguments characterize
the mean and covariance matrix of the parameters across the population.18 The fourth
and fth arguments contain information about an individuals choices and explanatory
variables.
The rst line of code multiplies parameters by explanatory variables to form utility
terms, which are then shaped into a matrix with four columns. Individuals have four
options available on each choice occasion. After reshaping, the utilities from potential
choices on each occasion occupy a row, with separate choice occasions in columns. lnpj
then contains the log probabilities of the choices actually madethe log of utility less
the logged sum of exponentiated utilities. Finally, var computes the log distribution
of parameters about the conditional mean, and llj sums the two components. The
result is the log likelihood of individual ns parameter values, given choices, data, and
the parameters governing the distribution of individual-level parameters.
I now set up a structured problem for each individual in the dataset. I begin by
setting up a single adaptive MCMC problem and then replicate this problem using J( )
(see [M-5] J( )) to match the number of individual-level parameter setsthe same as
the number of individual-level identiers in the data (gid)characterized via Matas
panelsetup( ) (see [M-5] panelsetup( )) function.
18. This function is not as fast as it could be, and it is also specic to the dataset. One way to speed the
algorithm is to compute the Cholesky decomposition of W once before individual-level parameters
are drawn. The wrapper bayesmixedlogit exploits this and a few other improvements.
M. J. Baker 655
. mata
: m=panelsetup(pid,1)
: Ap=amcmc_init()
: amcmc_damper(Ap,1)
: amcmc_alginfo(Ap,("standalone","global"))
: amcmc_append(Ap,"overwrite")
: amcmc_lnf(Ap,&lnbetan_bW())
: amcmc_draws(Ap,1)
: amcmc_append(Ap,"overwrite")
: amcmc_reeval(Ap,"reeval")
: A=J(rows(m),1,Ap)
: end
I also apply the amcmc option "overwrite", which means that the results from only
the last round of drawing will be saved. Specifying the "reeval" option means that
each individuals likelihood will be reevaluated at the new parameter values and the old
values of coecients before drawing.
I now duplicate the problem by forming a matrix of adaptive MCMC problems
one for each individualand then use a loop to ll in individual-level choices and
explanatory variables as arguments. In the end, the matrix A is a collection of 100
separate adaptive MCMC problems. Before this, some initial values for b and W are set,
and some initial values for individual-level parameters are drawn. I set up the pointer
matrix Args to hold this information along with the individual-level information.
. mata
: Args=J(rows(m),4,NULL)
: b=J(1,6,0)
: W=I(6)*6
: beta=b:+sqrt(diagonal(W)):*rnormal(rows(m),cols(b),0,1)
: for (i=1;i<=rows(m);i++) {
> Args[i,1]=&b
> Args[i,2]=&W
> Args[i,3]=&panelsubmatrix(y,i,m)
> Args[i,4]=&panelsubmatrix(X,i,m)
> amcmc_args(A[i],Args[i,])
> amcmc_xinit(A[i],b)
> amcmc_Vinit(A[i],W)
> }
: end
After creating some placeholders for the draws (bvals and Wvals), we can execute the
drawing algorithm as follows:
. mata
: its=20000
: burn=10000
: bvals=J(0,cols(beta),.)
: Wvals=J(0,cols(rowshape(W,1)),.)
: for (i=1;i<=its;i++) {
> b=drawb_betaW(beta,W/rows(m))
> W=drawW_bbeta(beta,b)
> bvals=bvals\b
> Wvals=Wvals\rowshape(W,1)
> beta_old=beta
> for (j=1;j<=rows(A);j++) {
> amcmc_draw(A[j])
> beta[j,]=amcmc_results_lastdraw(A[j])
> }
> }
: end
The algorithm consists of an outer loop and an inner loop, within which individual-level
parameters are drawn sequentially. The current value of the beta vector, which holds
individual-level parameters in rows, is overwritten with the last draw produced by using
the amcmc results lastdraw() function.
A subtlety of the code also indicates a reason why it is useful to pass additional
function arguments as pointers: each time a new value of b and W is drawn, a user
does not need to reiterate to each sampling problem that b and W have changed, be-
cause pointers point to positions that hold objects and not to the values of the objects
themselves. Thus, every time a new value of b or W is drawn, the arguments of all 100
problems are automatically changed. By specifying that the target distribution for each
level problem is to be reevaluated, the user tells the routine to recalculate lnbetan bW
at the last drawn value when comparing a new draw to the previous one.
Because the technique might be of greater interest, I have developed a command that
implements the algorithm bayesmixedlogit. For example, the algorithm described by
the previous code could be executed with the following command, which also summarizes
results in a way conformable with usual Stata output:
M. J. Baker 657
. set seed 475446

. bayesmixedlogit y, rand(price contract local wknown tod seasonal)
> group(gid) id(pid) draws(20000) burn(10000) samplerrand("global")
> saving(train_draws) replace
Bayesian Mixed Logit Model Observations = 4780
Groups = 100
Acceptance rates: Choices = 1195
Fixed coefs = Total draws = 20000
Random coefs(ave,min,max)= 0.270, 0.235, 0.289 Burn-in draws = 10000
y Coef. Std. Err. t P>|t| [95% Conf. Interval]
Random
price -1.168711 .1245738 -9.38 0.000 -1.4129 -.9245209
contract -.3433208 .0682585 -5.03 0.000 -.4771212 -.2095204
local 2.637242 .3436764 7.67 0.000 1.963567 3.310917
wknown 2.138963 .2596608 8.24 0.000 1.629976 2.647951
tod -11.16374 1.049769 -10.63 0.000 -13.2215 -9.105982
seasonal -11.19243 1.030291 -10.86 0.000 -13.212 -9.172849
Cov_Random
var_price .8499292 .2332495 3.64 0.000 .3927132 1.307145
cov_priceco~t .1128769 .0803203 1.41 0.160 -.044567 .2703208
cov_pricelo~l 1.583028 .4519537 3.50 0.000 .6971079 2.468948
cov_pricewk~n .8898662 .3096053 2.87 0.004 .2829775 1.496755
cov_pricetod 6.106009 1.909356 3.20 0.001 2.363286 9.848731
cov_pricese~l 6.044055 1.892895 3.19 0.001 2.333601 9.75451
var_contract .3450904 .0670202 5.15 0.000 .2137174 .4764634
cov_contrac~l .4714882 .2131141 2.21 0.027 .0537416 .8892347
cov_contrac~n .3624791 .1560516 2.32 0.020 .0565865 .6683717
cov_contrac~d .7592097 .6576296 1.15 0.248 -.5298765 2.048296
cov_contrac~l .9147682 .65939 1.39 0.165 -.3777688 2.207305
var_local 7.000292 1.883972 3.72 0.000 3.307328 10.69326
cov_localwk~n 4.022065 1.248119 3.22 0.001 1.575501 6.468629
cov_localtod 12.84674 3.787742 3.39 0.001 5.422006 20.27148
cov_localse~l 13.40598 3.727253 3.60 0.000 6.099812 20.71214
var_wknown 3.364285 1.012474 3.32 0.001 1.379632 5.348938
cov_wknowntod 6.513209 2.60766 2.50 0.013 1.401671 11.62475
cov_wknowns~l 7.109282 2.563623 2.77 0.006 2.084064 12.1345
var_tod 57.62449 16.97876 3.39 0.001 24.3427 90.90628
cov_todseas~l 53.93841 16.35184 3.30 0.001 21.88551 85.99131
var_seasonal 55.05572 16.54599 3.33 0.001 22.62226 87.48918
Draws saved in train_draws

*Results are presented to conform with Stata covention, but
are summary statistics of draws, not coefficient estimates.
The results are similar but not identical to those obtained using mixlogit. Additional
information and examples for bayesmixedlogit can be found in the help le, and some
examples of estimating a mixed logit model using Bayesian methods are provided in
the help le for amcmc(), accessible via the commands help mf amcmc or help mata
amcmc().
5 Description
In this section, I sketch a Mata implementation of what I have been referring to as
a global adaptive MCMC algorithm. The sketched routine omits a few details, mainly
about parsing options, but it is relatively true to form in describing how the algorithms
discussed in the article are actually implemented in Mata and might be used as a
template for developing more specialized algorithms. It assumes that the user wishes to
draw from a stand-alone function without additional arguments. The code is as follows:
. mata:
: real matrix amcmc_global(f,xinit,Vinit,draws,burn,damper,
> aopt,arate,val,lam)
> {
> real scalar nb,old,pro,i,alpha
> real rowvector xold,xpro,mu
> real matrix Accept,accept,xs,V,Vsq,Vold
>
> nb=cols(xinit) /* Initialization */
> xold=xinit
> lam=2.38^2/nb
> old=(*f)(xold)
> val=old
>
> Accept=0
> xs=xold
> mu=xold
> V=Vinit
> Vold=I(cols(xold))
>
> for (i=1;i<=draws;i++) {
> accept=0
> Vsq=cholesky(V) /* Prep V for drawing */
> if (hasmissing(Vsq)) {
> Vsq=cholesky(Vold)
> V=Vold
> }
>
> xpro=xold+lam*rnormal(1,nb,0,1)*Vsq /* Draw, value calc. */
>
>
> pro=(*f)(xpro)
>
> if (pro==. ) alpha=0 /* calc. of accept. prob */
>
> else if (pro>old) alpha=1
> else alpha=exp(pro-old)
>
> if (runiform(1,1)<alpha) {
> old=pro
> xold=xpro
> accept=1
> }
>
> lam=lam*exp(1/(i+1)^damper*(alpha-aopt)) /*update*/
> xs=xs\xold
> val=val\old
> Accept=Accept\accept
M. J. Baker 659
> mu=mu+1/(i+1)^damper*(xold-mu)
> Vold=V
> V=V+1/(i+1)^damper*((xold-mu)(xold-mu)-V)
> _makesymmetric(V)
> }
>
> val =val[burn+1::draws,]
> arate=mean(Accept[burn+1::draws,])
> return(xs[burn+1::draws,])
> }
: end
The function starts by setting up a variable (nb) to hold the dimension of the distribu-
tion, and xold, which functions as xt in the algorithms discussed in table 3, is set to
the user-supplied initial value. The initial value of (called lam) is set as discussed by
Andrieu and Thoms (2008, 359).
Next the log value of the distribution (f) at xold is calculated and called old. The
next few steps proceed as one would expect. However, I nd it useful to have a default
covariance matrix waitingVold in the codein case the Cholesky decomposition en-
counters problems. For example, this could happen if the initial variancecovariance
matrix is not positive denite or if there is insucient variation in the draws, which
sometimes happens in the early stages of a run. Once a usable covariance matrix has
been obtained, xpro (which functions as Yt in the algorithms in tables 1, 2, and 3) is
formed using a conformable vector of standard normal random variates, and the function
is evaluated at xpro.
The acceptance probability alpha is then calculated in a numerically stable way in an
if-else if-else block. If the target function returns a missing value when evaluated,
alpha is set to 0 so that the draw will not be retained. If the proposal produces a higher
value of the target function, alpha is set to one. Otherwise, it is set as described by
the algorithms.19 Finally, a uniform random variable is drawn that determines whether
the draw is to be accepted. Once this is known, all values are updated according to
the scheme described in table 3. Once the for loop concludes, the algorithm overwrites
the acceptance rate, arate, and the function value, val, and returns the results of the
draw.
6 Conclusions
I have given a brief overview of adaptive MCMC methods and how they can be imple-
mented using the Mata routine amcmc() and a suite of functions amcmc *(). While I
have given some ideas about how one might use and display obtained results, my primary
purpose is to present and describe an implementation of adaptive MCMC algorithms.
19. The Mata function exp() does not evaluate to missing for very small values as it does for very large
values.
I have not discussed how one should set up the parameters of the draw, such as
the number of draws to take, whether to use a global sampler, or how aggressively to
tune the proposal distribution. I have also not discussed what users should do once
they have obtained draws from an adaptive MCMC algorithm. The functions leave these
decisions in the hands of users. Creating, describing, and analyzing results obtained via
MCMC is fortunately the subject of extensive literature. Broadly speaking, literature
on MCMC is built around the related issues of assessing convergence of a run and of
assessing the mixing and intensity of a run. A further issue is how one should deal
with autocorrelation between draws. Whatever means are used to analyze results, it
is fortunate that Stata provides a ready-made battery of tools to summarize, modify,
and graph results. However, while it is often easy to spot problems in an MCMC run, it
is impossible to know whether the run has actually provided draws from the intended
distribution.
On the subject of convergence, there is not any universally accepted criterion, but
researchers propose many guidelines. Gelman and Rubin (1992) present several useful
ideas. A general discussion appears in Geyer (2011), and some practical advice appears
in Gelman and Shirley (2011), who advocate discarding the rst half of a run as a burn-
in period and performing multiple runs in parallel from dierent starting points and
comparing results. To be sure that one is actually sampling from the right region of the
density, one can use heated distributions in preliminary runs. Eectively, these heated
distributions raise the likelihood function to some fractional power,20 which attens the
distribution and allows for more rapid and broader exploration of the parameter space.
One can also compare the results of multiple runs and compare the variance within
runs and between runs. A useful technique is to investigate the autocorrelation function
of results and then thin the results, retaining only a fraction of the draws so that most
of the autocorrelation is rid from the data. One can use time-series tools to test for
autocorrelation among draws. A possibility discussed by Gelman and Shirley (2011) is
to jumble the results of the simulation. While it might seem obvious, it is worthwhile
to note that solutions to these problems are interdependent. A draw that exhibits a
lot of autocorrelation may require more thinning and a longer run to obtain a suitable
number of draws. A good place to start with these and other aspects of analyzing results
is Brooks et al. (2011).
As may have been clear from the examples presented in section 4, another option
is to run the algorithm for some suitable amount of time and then restart the run
without adaptation by using previous results as starting values so that one is drawing
from an invariant proposal distribution. A simple yet useful starting point in judging
convergence is seeing whether the algorithm produces results with graphs that look like
those in gure 2 but not those in gure 4. A graph that does not contain jumps or
at spots and looks more or less like white noise is a preliminary indication that the
algorithm is working well. However, pseudo-convergence can still be very dicult to
detect. In addition to containing much practical advice, Geyer (2011) also advises that
one should at least do an overnight run, adding only half in jest that one should start
20. Or, equivalently, multiply the log likelihood by a fractional power.

M. J. Baker 661
a run when the article is submitted and keep running until the referees reports arrive.
This cannot delay the article, and may detect pseudo-convergence (Geyer 2011, 18).
7 References
Andrieu, C., and J. Thoms. 2008. A tutorial on adaptive MCMC. Statistics and Com-
puting 18: 343373.
Brooks, S., A. Gelman, G. L. Jones, and X.-L. Meng, eds. 2011. Handbook of Markov
Chain Monte Carlo. Boca Raton, FL: Chapman & Hall/CRC.
Chernozhukov, V., and H. Hong. 2003. An MCMC approach to classical estimation.

Journal of Econometrics 115: 293346.
Chib, S., and E. Greenberg. 1995. Understanding the MetropolisHastings algorithm.

American Statistician 49: 327335.
Gelman, A., and D. B. Rubin. 1992. Inference from iterative simulation using multiple
sequences. Statistical Science 7: 457472.
Gelman, A., and K. Shirley. 2011. Inference from simulations and monitoring conver-
gence. In Handbook of Markov Chain Monte Carlo, ed. S. Brooks, A. Gelman, G. L.
Jones, and X.-L. Meng, 163174. Boca Raton, FL: Chapman & Hall/CRC.
Geyer, C. J. 2011. Introduction to Markov Chain Monte Carlo. In Handbook of Markov
Chain Monte Carlo, ed. S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng, 348.
Boca Raton, FL: Chapman & Hall/CRC.
Hole, A. R. 2007. Fitting mixed logit models by using maximum simulated likelihood.
Stata Journal 7: 388401.
Powell, J. L. 1984. Least absolute deviations estimation for the censored regression
model. Journal of Econometrics 25: 303325.
. 1986. Censored regression quantiles. Journal of Econometrics 32: 143155.
Rosenthal, J. S. 2011. Optimal proposal distributions and adaptive MCMC. In Handbook

of Markov Chain Monte Carlo, ed. S. Brooks, A. Gelman, G. L. Jones, and X.-L.
Meng, 93112. Boca Raton, FL: Chapman & Hall/CRC.
Train, K. E. 2009. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:
Cambridge University Press.
About the author

Matthew Baker is an associate professor of economics at Hunter College and the Graduate
Center, City University of New York. One of his current interests is simulation-based econo-
metrics.
14, Number 3, pp. 662669
csvconvert: A simple command to gather

comma-separated value les into Stata
Alberto A. Gaggero
Department of Economics and Management
University of Pavia
Pavia, Italy
alberto.gaggero@unipv.it
Abstract. This command meets the need of a researcher who holds multiple data
les in comma-separated value format diering by a period variable (for example,
year or quarter) or by a cross-sectional variable (for example, country or rm) and
must combine them into one Stata-format le.
Keywords: dm0076, csvconvert, comma-separated value le, .csv
1 Introduction
In applied research, it is common to come across several data les containing the same
set of variables that need to be combined into one le. For instance, in a cross-country
survey, a researcher may collect information country by country and thus create several
data les, one for each country. Or within the same cross-section (or even within the
same country), the researcher may sample each year independently and generate various
data les that dier by year.
A practical issue in this type of situation is determining how to read all of those
les together in Stata, especially if they are manifold. The standard approach would
be to import each data le sequentially into Stata by using a combination of import
delimited and append. This approach, however, requires a user to type several com-
mand lines proportional to the number of les to be included; thus it is reasonably
doable if the number of data les is limited.
Suppose the directory C:\data\world bank contains three comma-separated value
(.csv) les: wb2007.csv, wb2008.csv, and wb2009.csv.1 After setting the appropri-
ate working directory, a user implements the aforementioned procedure by typing the
following command lines:
. import delimited using wb2008.csv, clear

. save wb2008.dta
. save wb2009.dta
1. csvconvert is designed to handle many .csv les; however, for simplicity, all the examples below
consider a limited set of .csv les.

c 2014 StataCorp LP dm0076
A. A. Gaggero 663
. append using wb2008.dta

. append using wb2009.dta
Alternatively, and more compactly, the same result can be obtained with a loop.
. foreach file in wb2007 wb2008 wb2009 {
2. import delimited using `file.csv, clear
3. save `file
4. }
. foreach file in wb2007.dta wb2008.dta {
2. append using `file
3. }
Another way is to work with the disk operating system (DOS) to gather all the .csv les
into one .csv le and then to read the assembled single .csv le into memory using
import delimited.
Under the DOS framework, the lines below assemble wb2007.csv, wb2008.csv, and
wb2009.csv into a newly created .csv le named input.csv.
cd "C:\data\world bank"
copy wb2007.csv wb2008.csv wb2009.csv input.csv
To assemble all .csv les stored in the directory C:\data\world bank into a new le
named input.csv, type
cd "C:\data\world bank"
copy *.csv input.csv
To read input.csv into Stata, type

. import delimited using C:\data\world bank\input.csv
A similar approach that bypasses the DOS framework can be implemented. However,
if the number of .csv les is large, the process may not be as straightforward. For
simplicity, let us still consider just three .csv les. Once the appropriate working
directory is set, the command lines to type are as follows:
. copy wb2008.csv wb2007.csv, append
. copy wb2009.csv wb2007.csv, append
. import delimited using wb2007.csv
The rst two command lines append wb2008.csv and wb2009.csv to wb2007.csv.
The third command reads the .csv le into Stata.
Note, however, that if the rst line of both wb2008.csv and wb2009.csv contains
the variable names, these are also appended.2 Thus, because of the presence of extra
lines with names, all the variables are read as a string. To correct this inaccuracy, one
should rst remove the lines with the variable names and then use destring to set the
numerical format.
2. Unfortunately, the option varnames(nonames), applicable with import delimited, is unavailable
with copy.
664 A simple command to gather comma-separated value files into Stata
Alternatively, we could prevent this fault by manually preparing the .csv les (that
is, by removing the lines with the variable names in the .csv les to be appended). The
whole process can be time consuming, especially if the number of .csv les is large. The
csvconvert command simplies and automatizes the procedure of gathering multiple
.csv les into one .dta, as illustrated in the next section.
2 The csvconvert command

2.1 Syntax
The syntax is

csvconvert input directory, replace input file(filenames)

output dir(output directory) output file(filename)
where input directory is the path of the directory in which the .csv les are stored. Do
not use any quotes at the endpoints of the directory path, even if the directory name
contains spaces (see example 1 below).
2.2 Options
replace species that the existing output le (if it already exists) be overwritten.
replace is required.
input file(filenames) species a subset of the .csv les to be converted. The filenames
must be separated by a space and include the .csv extension (see example 2 below).
If this option is not specied, csvconvert considers all the .csv les stored in the
input directory.
output dir(output directory) species the directory in which the .dta output le is
saved. If this option is not specied, the le is saved in the same directory where
the .csv les are stored.
output file(filename) species the name of the .dta output le. The default is
output file(output.dta).
3 Examples
3.1 Example 1Basic
The simplest way to run csvconvert is to type the command and the directory path
where the .csv les are stored followed by the mandatory option replace. In the same
directory, Stata will create output.dta, which collects all the .csv les of that directory
in Stata format.
A. A. Gaggero 665
. csvconvert C:\data\world bank, replace

_________________________________________________
The csv file wb2007.csv
(6 vars, 3 obs)
has been successfully included in output.dta
_________________________________________________
(6 vars, 3 obs)
_________________________________________________
(6 vars, 3 obs)
_________________________________________________
****************************************************************
You have successfully converted 3 csv files in one Stata file
****************************************************************
3.2 Example 2Subset of .csv les to be converted

If you want to convert only a subset of the .csv les in the directory (for example,
wb2008.csv and wb2009.csv), then you need to list the les to be converted inside the
parentheses of the option input file(). Filenames must be separated by a blank space
and must be specied using the .csv extension.
. csvconvert C:\data\world bank, replace input_file(wb2008.csv wb2009.csv)

_________________________________________________
(6 vars, 3 obs)
_________________________________________________
(6 vars, 3 obs)
_________________________________________________
****************************************************************
****************************************************************
3.3 Example 3Naming the output le and saving it in a predeter-

mined directory
Suppose you wish to name your output le wb data and save it in the directory
C:\data\wb dataset. In this case, you would type
. csvconvert C:\data\world bank, replace output_file(wb_data.dta)

> output_dir(C:\data\wb dataset)
_________________________________________________
(6 vars, 3 obs)
has been successfully included in wb_data.dta.dta
_________________________________________________
(6 vars, 3 obs)
_________________________________________________
(6 vars, 3 obs)
_________________________________________________
****************************************************************
****************************************************************
3.4 Example 4Including all possible options

Example 2 and example 3 can be combined.
. csvconvert C:\data\world bank, replace input_file(wb2008.csv wb2009.csv)

> output_file(wb_data.dta) output_dir(C:\data\wb dataset)
(output omitted )
4 Tips and additional examples for practitioners

csvconvert is designed to speed up the process of joining a large number of .csv
les. As the number of input les increases, the likelihood that one of them contains
inaccuracies rises. It is important, therefore, to keep track of all the steps in the process
so that the origin of possible faults can be detected.
While creating the output le, csvconvert oers various ways to check that the
conversion of the .csv les into Stata has been completed correctly.
First of all, at the end of the process, csvconvert displays the number of .csv les
contained in the output le. This information allows a researcher to check whether the
expected number of .csv les to be included in the output le is equal to the actual
number of .csv les that have been converted. The complete list of .csv les included
in the .dta le can be obtained by typing note (see example 5). Additionally, by
default, csvconvert creates one variable named csvfile, which encloses the name of
the .csv le where the observation originates.
A. A. Gaggero 667
During conversion, csvconvert sequentially reports the name of the .csv le being
converted, the number of variables, and the number of observations. If something in the
process appears odd, extra messages are displayed to alert the researcher and demand
further inspection. For instance, suppose that one .csv le contains a symbol or a
letter in one cell of a numerical variable; if ignored, this inaccuracy may undermine the
whole process. For this reason, csvconvert adds a note to help the researcher detect
the fault. In example 6, wb2008 symbol.csv contains N/A in one cell of the variable
populationtotal.
4.1 Example 5List of the .csv les included in the .dta le

Once csvconvert has been completed, the full list of .csv les included in the .dta
le, together with the date and time when each .csv le was converted, can be obtained
by typing note in the command window.
. note
_dta:
1. File included on 18 Jan 2014 10:11 : "wb2007.csv"
4.2 Example 6Detecting the origin of anomalous observations

Suppose that you wish to convert three les: wb2007.csv, wb2008 symbol.csv, and
wb2009.csv. The le wb2008 symbol.csv contains a fault (that is, the aforementioned
N/A cell), but you are unaware of it.

> input_file(wb2007.csv wb2008_symbol.csv wb2009.csv)
_________________________________________________
(6 vars, 3 obs)
_________________________________________________
The csv file wb2008_symbol.csv
(6 vars, 3 obs)
(note: variable populationtotal was long in the using data, but will be str9
now)
_________________________________________________
(6 vars, 3 obs)
(note: variable populationtotal was str9 in the using data, but will be long
now)
(note: variable _csvfile was str10, now str17 to accommodate using datas
values)
_________________________________________________
****************************************************************
****************************************************************
By reading the log, you can see that in the conversion of wb2008 symbol.csv, the
variable populationtotal changed its format from numerical to string. Therefore,
wb2008 symbol.csv is the le that needs to be inspected. Once the anomalous obser-
vation is detected and manually corrected (for example, by emptying the anomalous
cell via Excel and saving the corrected le as wb2008 symbol2.csv), you can relaunch
csvconvert and check that it now runs smoothly.

> input_file(wb2007.csv wb2008_symbol2.csv wb2009.csv)
_________________________________________________
(6 vars, 3 obs)
_________________________________________________
The csv file wb2008_symbol2.csv
(6 vars, 3 obs)
_________________________________________________
(6 vars, 3 obs)
(note: variable _csvfile was str10, now str18 to accommodate using datas
values)
_________________________________________________
****************************************************************
****************************************************************
4.3 Example 7Spotting duplicate observations

If csvconvert happens to include duplicate observations (for instance, it inserted the
same input le twice), it displays a warning message. Moreover, to facilitate the detec-
tion of double observations, csvconvert generates a new dummy variable, duplicate,
that is equal to one in case of duplicate observations. This example describes the pro-
cedure to spot whether an input le has been entered twice and, if so, which one.
A. A. Gaggero 669

> input_file(wb2008.csv wb2009.csv wb2008.csv)
_________________________________________________
(6 vars, 3 obs)
_________________________________________________
(6 vars, 3 obs)
_________________________________________________
(6 vars, 3 obs)
_________________________________________________
****************************************************************
****************************************************************
Warning - output.dta has 3 duplicate observations: you might have entered a
> .csv file name twice in the input_file() option, or your orginal dataset may
> contain duplicates. Check if this is what you wanted: variable _duplicates
> = 1 in case of duplicate and = 0 otherwise may help.
The warning message shows that there are three duplicate observations. Of course,
you can look carefully at the Results window and nd that wb2008.csv was entered
twice. However, if you are handling a large set of .csv les, checking each line of the
screen would be very time consuming.
Tabulating the variable csvfile conditional on duplicates being equal to one
quickly detects that the duplicate observations come from wb2008.csv.
. tabulate _csvfile if _duplicates==1
csv file
from which
observation
originates Freq. Percent Cum.
wb2008.csv 6 100.00 100.00
Total 6 100.00
5 Acknowledgments
I am grateful to Editor Joseph Newton for his assistance during revision and to Violeta
Carrion, Emanuele Forlani, Edna Solomon, and one anonymous referee for very helpful
comments.
About the author

Alberto A. Gaggero is currently an assistant professor in the Department of Economics and
Management at the University of Pavia, where he teaches applied industrial organization.
He obtained his PhD from the University of Essex. He formerly held research positions at
the University of Genoa, at Hogeschool University Brussel, and at the Belgian Ministry of
Economic Aairs. His research topics center on applied industrial organization with particular
interest in airline pricing.
14, Number 3, pp. 670683
The bmte command: Methods for the

estimation of treatment eects when exclusion
restrictions are unavailable
Ian McCarthy Daniel Millimet
Emory University Southern Methodist University
Atlanta, GA Dallas, TX
ianmccarthy.econ@gmail.com Institute for the Study of Labor
Bonn, Germany
millimet@smu.edu
Rusty Tchernis
Georgia State University
Atlanta, GA
Institute for the Study of Labor
Bonn, Germany
National Bureau of Economic Research
Cambridge, MA
rtchernis@gsu.edu
Abstract. We present a new Stata command, bmte (bias-minimizing treatment

eects), that implements two new estimators proposed in Millimet and Tchernis
(2013, Journal of Applied Econometrics 28: 9821017) and designed to estimate
the eect of treatment when selection on unobserved variables exists and appro-
priate exclusion restrictions are unavailable. In addition, the bmte command esti-
mates treatment eects from several alternative estimators that also do not rely
on exclusion restrictions for identication of the causal eects of the treatment,
including the following: 1) Heckmans two-step estimator (1976, Annals of Eco-
nomic and Social Measurement 5: 475492; 1979, Econometrica 47: 153161); 2) a
control function approach outlined in Heckman, LaLonde, and Smith (1999, Hand-
book of Labor Economics 3: 18652097) and Navarro (2008, The New Palgrave
Dictionary of Economics [Palgrave Macmillan]); and 3) a more recent estimator
proposed by Klein and Vella (2009, Journal of Applied Econometrics 24: 735762)
that exploits heteroskedasticity for identication. By implementing two new esti-
mators alongside preexisting estimators, the bmte command provides a picture of
the average causal eects of the treatment across a variety of assumptions. We
present an example application of the command following Millimet and Tchernis
(2013, Journal of Applied Econometrics 28: 9821017).
Keywords: st0355, bmte, treatment eects, propensity score, unconfoundedness,
selection on unobserved variables

I. McCarthy, D. Millimet, and R. Tchernis 671
1 Introduction
The causal eect of binary treatment on outcomes is a central component of empirical
research in economics and many other disciplines. When individuals self-select into
treatment and when prospective randomization of the treatment and control groups is
not feasible, researchers must adopt alternative empirical methods intended to control
for the inherent self-selection. If individuals self-select on the basis of observed variables
(selection on observed variables), a variety of appropriate methodologies are available
to estimate the causal eects of the treatment. If instead individuals self-select on the
basis of unobserved variables (selection on unobserved variables), estimating treatment
eects is more dicult.
When one is confronted with selection on unobserved variables, the most common
empirical approach is to rely on an instrumental variable (IV); however, if credible instru-
ments are unavailable, a few approaches now exist that attempt to estimate the eects of
the treatment without an exclusion restriction. This article introduces a new Stata com-
mand, bmte, that implements two recent estimators proposed in Millimet and Tchernis
(2013) and designed to estimate treatment eects when selection on unobserved vari-
ables exists and appropriate exclusion restrictions are unavailable:
i. The minimum-biased (MB) estimator: This estimator searches for the observations
with minimized bias in the treatment-eects estimate of interest. This is accom-
plished by trimming the estimation sample to include only observations with a
propensity score within a certain interval as specied by the user. When the
conditional independence assumption (CIA) holds (that is, independence between
treatment assignment and potential outcomes, conditional on observed variables),
the MB estimator is unbiased. Otherwise, the MB estimator tends to minimize
the bias among estimators that rely on the CIA. Furthermore, the MB estima-
tor changes the parameter being estimated because of the restricted estimation
sample.
ii. The bias-corrected (BC) estimator: This estimator relies on the two-step estimator
of Heckmans bivariate normal (BVN) selection model to estimate the bias among
estimators that inappropriately apply the CIA (Heckman 1976, 1979). However,
unlike the BVN estimator, the BC estimator does not require specication of the
functional form for the outcome of interest in the nal step. Moreover, unlike the
MB estimator, the BC estimator does not change the parameter being estimated.
In addition, the bmte command summarizes results of several alternative estimators

across a range of assumptions, including standard ordinary least-squares (OLS) and
inverse-probability-weighted (IPW) treatment-eects estimates. The bmte command
also presents the results of additional estimates applicable when the CIA fails and valid
exclusion restrictions are unavailable, including the following: 1) Heckmans BVN esti-
mator; 2) a control function (CF) approach outlined in Heckman, LaLonde, and Smith
(1999) and Navarro (2008); and 3) a more recent estimator proposed by Klein and Vella
(2009) that exploits heteroskedasticity for identication. By implementing two new es-
672 The bmte command
timators alongside preexisting estimators, the bmte command provides a picture of the
average causal eects of the treatment across a variety of assumptions and when valid
exclusion restrictions are unavailable.
2 Framework and methodology

Here we provide a brief background on the potential-outcomes model and the estima-
tors implemented by the bmte command. For additional discussion, see Millimet and
Tchernis (2013). We consider the standard potential-outcomes framework, denoting by
Yi (T ) the potential outcome of individual i under binary treatment T T = (0, 1). The
causal eect of the treatment (T = 1) relative to the control (T = 0) is dened as the
dierence between the corresponding potential outcomes, i = Yi (1) Yi (0).
In the evaluation literature, several population parameters are of potential interest.
The most commonly used parameters include the average treatment eect (ATE), the
ATE on the treated (ATT), and the ATE on the untreated (ATU), dened as
ATE = E(i ) = E{Yi (1) Yi (0)}

ATT = E(i |T = 1) = E{Yi (1) Yi (0)|T = 1}
ATU = E(i |T = 0) = E{Yi (1) Yi (0)|T = 0}
These parameters may also vary with a vector of covariates, X, in which case the
parameters have an analogous representation conditional on a particular value of X.1
For nonrandom treatment assignment, selection into treatment may follow one of two
general paths: 1) selection on observed variables, also referred to as unconfoundedness
or the CIA (Rubin 1974; Heckman and Robb 1985); and 2) selection on unobserved
variables. Under the CIA, selection into treatment is random conditional on covariates,
X, and the average eect of the treatment can be obtained by comparing outcomes
of individuals in the two treatment states with identical values of the covariates. This
approach often uses propensity-score methods to reduce the dimensionality problem
arising when X is a high-dimensional vector (Rosenbaum and Rubin 1983), with the
propensity score denoted by P (Xi ) = Pr(Ti = 1|Xi ).
If the CIA fails to hold, then the estimated treatment eects relying on the CIA are
biased. Following Heckman and Navarro-Lozano (2004) and Black and Smith (2004),
we denote the potential outcomes as Y (0) = g0 (X) + 0 and Y (1) = g1 (X) + 1 , where
g0 (X) and g1 (X) are the deterministic portions of the outcome variable in the control
and treatment groups, respectively, and where (0 , 1 ) are the corresponding error terms.
We also denote the latent treatment variable by T = h(X) u, where h(X) represents
the deterministic portion of T , and u denotes the error term. The observed treatment,
T , is therefore equal to 1 if T > 0 and 0 otherwise. Finally, we denote by the
dierence in the residuals of the potential outcomes, = 0 1 .
1. More formally, the coecient measures the treatment eect, adjusting for a simultaneous linear
change in the covariates, X, rather than being conditional on a specic value of X. We thank an
anonymous referee for highlighting this point.
Assuming and u are jointly normally distributed, the bias can be derived as
{h(X)}
BATE {P (X)} = [0u 0 + {1 P (X)}u ] (1)
{h(X)}[1 {h(X)}]
where u is the correlation between and u, 0u is the correlation between 0 and u,

0 is the standard deviation of 0 , is the standard deviation of , and and are the
normal probability density function and cumulative distribution function, respectively.
When the CIA fails, consistent estimation of the treatment eect of interest requires
an alternative technique robust to selection on unobservables. This is dicult because
obtaining a consistent point estimate of a measure of the treatment eect typically
requires an exclusion restriction, which is unavailable in many situations. The proposed
bmte command presents a series of treatment-eects estimators designed to estimate
the average eects of treatment when appropriate exclusion restrictions are unavailable,
exploiting the functional form of the bias in (1). Below we briey present ve of the
estimators implemented by the bmte command.
2.1 The MB estimator

This technique relates generally to the normalized IPW estimator of Hirano and Imbens
(2001), given by
N Y i Ti N Yi (1 Ti )
i=1 i=1
P (Xi ) 1 P (Xi )
IPW,ATE = (2)
N T i N (1 Ti )
i=1 i=1
P (Xi ) 1 P (Xi )
where P(Xi ) is an estimate of the propensity score obtained using a probit model.
Under the CIA, the IPW estimator in (2) provides an unbiased estimate of ATE .
When this assumption fails, the bias for the ATE follows the closed functional form in
(1), with similar expressions for the ATT and ATU. The MB estimator aims to minimize
the bias by estimating (2) using only observations with a propensity score close to
the bias-minimizing propensity score, denoted by P . Using P eectively limits the
observations included in the estimation of the IPW treatment eects to minimize the
inherent bias when the CIA fails. We denote by the set of observations ultimately
included in the estimation. In general, however, P and are unknown. Therefore, the
MB estimator estimates P and to minimize the bias in (1) by using Heckmans BVN
selection model, the details of which are provided in Millimet and Tchernis (2013).
The MB estimator of the ATE is formally given by
Y i Ti Yi (1 Ti )
i i
P (Xi ) 1 P (Xi )
MB,ATE (P ) = (3)
Ti (1 Ti )
i i
P (Xi ) 1 P (Xi )
where = {i|P(Xi ) C(P )}, and C(P ) denotes a neighborhood around P . Fol-
lowing Millimet and Tchernis (2013), the MB estimator denes C(P ) as C(P ) =
{P (Xi )|P(Xi ) (P , P )}, where P = max(0.02, P ), P = min(0.98, P + ),
and > 0 is the smallest value such that at least percent of both the treatment and
control groups are contained in . Specic values of are specied within the bmte
command, with smaller values reducing the bias at the expense of higher variance. The
MB estimator trims observations with propensity scores above and below specic values,
regardless of the value of . These threshold values can be specied within the bmte
command options. Obtaining does not require the use of Heckmans BVN selection
model when the focus is on the ATT or ATU, because P is known to be one-half in these
cases (Black and Smith 2004).
If the user is sensitive to potential deviations from the normality assumptions under-
lying Heckmans BVN model, the MB estimator and other estimators can be extended
appropriately (Millimet and Tchernis 2013). Such adjustments are included as part
of the bmte command, denoted by the Edgeworth-expansion versions of the relevant
estimators.
2.2 The BC approach

Estimation of the error correlation structure using Heckmans BVN model immediately
introduces the possibility of a BC version of each estimator. Specically, estimates of
the bias of the MB estimator of the ATE, denoted by BATE (P ), can be derived from
the two-stage BVN model. The estimated bias can then be applied as an adjustment to
the standard IPW treatment-eects estimate.
The MB bias-corrected (MB-BC) estimator for the ATE is then given by

MBBC,ATE (P ) = MB,ATE (P ) BATE (P ) (4)
where the corresponding estimators for the ATT and ATU follow. With heterogeneous
treatment eects, the MB-BC estimator changes the parameter being estimated. To
identify the correct parameter of interest, the bmte command rst estimates the MB-
BC estimator in (4) conditional on the propensity score, P (X), and then estimates the
(unconditional) ATE by taking the expectation of this over the distribution of X in the
population (or subpopulation of the treated). The resulting BC estimator is given by

BC,ATE = IPW,ATE BATE {P (Xi )} (5)
i
where again the corresponding estimators for the ATT and ATU follow.
2.3 BVN selection

Briey, Heckmans BVN selection model adopts a two-stage approach: 1) estimate the
), using a standard probit model with binary treatment
probability of treatment, (Xi
as the dependent variable; and 2) estimate via OLS the following second-stage outcome
equation,

)
(Xi
Yi = Xi 0 + Xi Ti (1 0 ) + 0 (1 Ti ) (6)
1 (Xi )

(Xi )
+ 1 Ti + i
)
(Xi
where ()/() is the inverse Mills ratio, and is an independent and identically dis-
tributed error term with constant variance and zero conditional mean. With this ap-
proach, the estimated ATE is given by

BVN,ATE = X 1 0 (7)
Similar expressions are available for the ATT and ATU.2
2.4 CF approach
Heckmans BVN selection model is a special case of the CF approach. The idea is to devise
a function where the treatment assignment is no longer correlated with the error term
in the outcome equation once it is included, as outlined nicely in Heckman, LaLonde,
and Smith (1999) and Navarro (2008). Specically, consider the outcome equation
Yi (t) = t + gt (Xi ) + E(t |Xi , Ti = t) + it , t = 0, 1
Approximating E(t |X, T = t) with a polynomial in P (X) yields

S

Yi (t) = (t + t0 ) + gt (Xi ) + ts P (Xi )s + it , t = 0, 1
s=1
where S is the order of the polynomial. The following equation is then estimable via
OLS:
Yi = (0 + 00 )(1 Ti ) + (1 + 10 )Ti + Xi 0 + Xi Ti (1 0 ) (8)

S
S

+ 0s (1 Ti )P (Xi )s + 1s Ti P (X)s + i
s=1 s=1
2. Depending on ones dataset and specic application, it may not be meaningful to evaluate all
covariates at their means. Therefore, when interpreting the treatment-eects estimates, the user
should check that the data support the use of X. We are grateful to an anonymous referee for
clarifying this important point.
As is clear from (8), t and t0 are not separately identied; however, because the
selection problem disappears in the tails of the propensity score, it follows that the CF
becomes zero and that the intercepts from the potential-outcome equations are identied
using observations in the extreme end of the support of P (X). After one estimates the
intercept terms, the ATE and ATT are given by

CF,ATE = ( 0 ) + X 1 0 and
1 (9)

CF,ATT = ( 0 ) + X 1 1 0 + E(1
1 0 |Ti = 1) (10)
where
S
, ,
1 P (X)
E(0
|Ti = 1) = 0s P (X)s0
and
s=1 P (X)
S
S

E(1
|Ti = 1) = 1s +
1s P (X)s1

s=1 s=1
and where P (X) is the overall mean propensity score, and P (X)t , t = 0, 1, is the mean
propensity score in group t.
2.5 Klein and Vella (2009) estimator

Unlike the CF approach, which relies on observations at the extremes of the support
of P (X), the Klein and Vella (2009) (KV) estimator attempts to identify the treatment
eect by using more information from the middle of the support. Our implementation
of the KV estimator relies on a similar functional form assumption to the BVN estimator
in the absence of heteroskedasticity but eectively induces a valid exclusion restriction
in the presence of heteroskedasticity. Specically, denote the latent treatment by T =
X u , where u = S(X)u, S(X) is an unknown positive function, and u N (0, 1).
Here S(X) is intended to allow for a general form of heteroskedasticity in the treatment
eects.
In this case, the probability of receiving the treatment conditional on X is given by

X
Pr(T = 1|X) = (11)
S(X)
Assuming S(X) = exp(X), the parameters of (11) are estimable by maximum likeli-
hood, with the log-likelihood function given by3

X
Ti
X
1Ti
ln L = ln ln 1 (12)
i exp(X) exp(X)
3. Our functional form assumption, S(X) = exp(X), is a simplication made to compare the KV
estimator and the other estimators available with the bmte command. For more details on the KV
estimator and alternative functional forms for S(X), see Klein and Vella (2009).
where the element of corresponding to the intercept is normalized to zero for iden-
tication. The maximum likelihood estimates are then used to obtain the predicted
probability of treatment, P (X), which may be used as an instrument for T in (6),
excluding the selection correction terms.
3 The bmte command

3.1 Syntax
The bmte command implements the above MB, BC, BVN, CF, and KV estimators as well
as the traditional OLS and IPW estimators. The syntax for the bmte command is

bmte depvar indepvars if in , group(varname) ee hetero theta(#)
psvars(indepvars) kv(indepvars) cf(#) pmin(#) pmax(#) psate(#)
psatt(#) psatu(#) psateee(#) psattee(#) psatuee(#) saving(filename)

replace bs reps(#) fixp
3.2 Specication
The bmte command requires the user to specify an outcome variable, depvar, at least
one independent variable, and a treatment assignment variable, group(). Additional
independent variables are optional. The command also uses Stata commands hetprob
and ivreg2 (Baum, Schaer, and Stillman 2003, 2004, 2005). The remaining options
of the bmte command are detailed below.
3.3 Options
group(varname) species the treatment assignment variable. group() is required.
ee indicates that the Edgeworth-expansion versions of the MB, BVN, and BC estimators
be included in addition to the original versions of each respective estimator. The
Edgeworth expansion is robust to deviations from normality in Heckmans BVN
selection model.
hetero allows for heterogeneous treatment eects, with ATE, ATT, and ATU estimates
presented at the mean level of each independent variable.
theta(#) denotes the minimum percentage such that both the treatment and control
groups have propensity scores in the interval (P , P ) from (3). Multiple values of
theta() are allowed (for example, theta(5 25), for 5% and 25%). Each value will
form a dierent estimated treatment eect using the MB and MB-BC estimators.
psvars(indepvars) denotes the list of regressors used in the estimation of the propensity
score. If unspecied, the list of regressors is assumed to be the same as the original
covariate list.
kv(indepvars) denotes the list of independent variables used to model the variance in
the hetprob command. Like the psvars() option, the list of kv() regressors is
assumed to be the same as the original covariate list if not explicitly specied.
cf(#) species the order of the polynomial used in the CF estimator. The default is
cf(3).
pmin(#) and pmax(#) specify the minimum and maximum propensity scores, respec-
tively, included in the MB estimator. Observations with propensity scores outside
this range will be automatically excluded from the MB estimates. The defaults are
pmin(0.02) and pmax(0.98).
psate(#)psatuee(#) specify the xed propensity-score values (specic to each treat-
ment eect of interest) to be used as the bias-minimizing propensity scores in lieu
of estimating the values within the program itself.
saving(filename) indicates where to save the output.
replace indicates that the output in saving() should replace any preexisting le in
the same location.
bs and reps(#) specify that 95% condence intervals be calculated by bootstrap using
the percentile method and the number of replications in reps(#). The default is
reps(100).
fixp is an option for the bootstrap command that, when specied, estimates the bias-
minimizing propensity score {P (X)} and applies this estimate across all bootstrap
replications rather than reestimating at each replication.
4 Example
Following Millimet and Tchernis (2013), we provide an application of the bmte com-
mand to the study of the U.S. school breakfast program (SBP). Specically, we seek
causal estimates of the ATEs of SBP on child health. The data are from the Early
Childhood Longitudinal StudyKindergarten Class of 19981999 and are available for
download from the Journal of Applied Econometrics Data Archive.4 We provide esti-
mates of the eect of SBP on growth rate in body mass index from rst grade to the
spring of third grade.
4. http://qed.econ.queensu.ca/jae/datasets/millimet001/.
We rst dene global variable lists XVARS and HVARS and limit our analysis to third
grade students only. XVARS are the covariates used in the OLS estimation as well as
in the calculation of the propensity score. HVARS are the covariates used in the KV
estimator (that is, the variables that enter into the heteroskedasticity portion of the
hetprob command).
. infile using millimettchernissbpdictionary.dct

(output omitted )
. global XVARS gender age white black hispanic city suburb
> neast mwest south wicearly wicearlymiss momafb momafbmiss
> momft mompt momnw momeda momedb momedc momedd momede ses
> sesmiss bweight bweightmiss hfoodb hfoodbmiss books
> booksmiss momafb2 ses2 bweight2 books2 age2 z1-z22
. global HVARS ses age south city
We then estimate the eect of SBP participation in the rst grade (break1) on body
mass index growth (clbmi) by using the bmte command. In our application, we specify a
of 5% and 25%, and we estimate bootstrap condence intervals using 250 replications.
We also specify the ee option, asking that the results include the Edgeworth-expansion
versions of the relevant estimators. The resulting Stata output is as follows:
. bmte clbmi $XVARS if grade==3, g(break1) t(5 25) ee psv($XVARS) bs reps(250)

> kv($HVARS)
Theta ATE ATT ATU
OLS 0.007 0.007 0.007

0.003, 0.011] [ 0.003, 0.011] [ 0.003, 0.011]
IPW 0.009 0.006 0.011

0.005, 0.014] [ 0.002, 0.012] [ 0.005, 0.012]
MB
0.05 0.015 -0.000 -0.000

-0.008, 0.022] [ -0.011, 0.014] [ -0.011, 0.014]
0.25 0.005 0.005 0.004
-0.002, 0.011] [ -0.001, 0.011] [ -0.003, 0.009]
MB-EE
0.05 0.014 0.009 0.020

0.005, 0.033] [ 0.003, 0.023] [ 0.002, 0.036]
0.25 0.013 0.005 0.013
0.005, 0.020] [ -0.001, 0.012] [ 0.004, 0.022]
CF 0.048 0.077 0.035

-0.043, 0.120] [ -0.021, 0.159] [ -0.050, 0.107]
F = 5.677
p = 0.000
KV-IV -0.008 -0.008 -0.008

-0.037, 0.022] [ -0.037, 0.022] [ -0.037, 0.022]
F = 133.462
p = 0.000
LR = 27.393
p = 0.000
BVN -0.017 -0.003 -0.021

-0.046, 0.012] [ -0.021, 0.015] [ -0.059, 0.015]
BVN-EE 0.230 0.134 0.310

0.052, 0.330] [ 0.033, 0.187] [ 0.070, 0.187]
MB-BC
0.05 -0.007 -0.019 -0.026

-0.050, 0.018] [ -0.055, 0.020] [ -0.053, 0.002]
0.25 -0.017 -0.014 -0.022
-0.048, 0.011] [ -0.049, 0.022] [ -0.047, 0.002]
MB-BC-EE
0.05 0.070 0.212 0.268

-0.039, 0.220] [ 0.024, 0.304] [ 0.009, 0.393]
0.25 0.069 0.208 0.261
-0.048, 0.215] [ 0.020, 0.299] [ 0.000, 0.388]
P* 0.672 0.500 0.500

0.167, 0.963] [ 0.500, 0.500] [ 0.500, 0.500]
P*-EE 0.033 0.787 0.141
0.020, 0.943] [ 0.728, 0.956] [ 0.020, 0.399]
BC-IPW -0.018 -0.014 -0.004

-0.048, 0.012] [ -0.048, 0.022] [ -0.059, 0.022]
BC-IPW-EE 0.229 0.313 1.063

0.050, 0.331] [ 0.070, 0.439] [ 0.269, 0.439]
Here we focus on the general structure and theme of the output. For a thorough
discussion and interpretation of the results, see Millimet and Tchernis (2013). As indi-
cated by the section headings, the output presents results for the ATE, ATT, and ATU
using basic OLS and IPW treatment-eects estimates as well as each of the MB (3), MB-
BC (4), BC (5), BVN (7), CF [(9) and (10)], and KV [(11), (12), and (6)] estimators.
Below each estimate is the respective 95% condence interval.
As discussed in Millimet and Tchernis (2013), separate MB and MB-BC estimates
are presented for each value of specied in the bmte command (in this case, 5% and
25%). The results for the CF estimator also include a joint test of signicance of all
covariates in the OLS step of the CF estimator (8). Similarly, the KV results include
a test for weak instruments (the CraggDonald Wald F statistic and p-value) as well
as a likelihood-ratio test for heteroskedasticity based on the results of hetprob. Also
included in the bmte output is the estimated bias-minimizing propensity score.
We wish to reemphasize two points regarding the appropriate interpretation of re-

sults. First, the MB estimators will generally alter the interpretation of the parameter
being estimated. Thus they may estimate a parameter considered to be uninteresting.
Therefore, researchers should pay attention to the value of P as well as the attributes
of observations with propensity scores close to this value. Second, none of the estimators
considered here match the performance of a traditional IV estimator, although IV may
also change the interpretation of the parameter being estimated.
5 Remarks
Despite advances in the program evaluation literature, treatment-eects estimators re-
main severely limited when the CIA fails and when valid exclusion restrictions are un-
available. Following the methodology presented in Millimet and Tchernis (2013), we
propose and describe a new Stata command (bmte) that provides a range of treatment-
eects estimates intended to estimate the average eects of the treatment when the CIA
fails and appropriate exclusion restrictions are unavailable.
Importantly, the bmte command provides results that are useful across a range of
alternative assumptions. For example, if the CIA holds, the IPW estimator provided
by the bmte command yields an unbiased estimate of the causal eects of treatment.
The MB estimator then oers a robustness check, given its comparable performance
when the model is correctly specied or overspecied and its improved performance if
the model is underspecied. If, however, the CIA does not hold, the bmte command
provides results that are appropriate under strong functional form assumptions, either
with homoskedastic (BVN or CF) or heteroskedastic (KV) errors, or under less restrictive
functional form assumptions (BC). As illustrated in our example application to the U.S.
SBP, the breadth of estimators implemented with the bmte command provides a broad
picture of the average causal eects of the treatment across a variety of assumptions.
6 References
Baum, C. F., M. E. Schaer, and S. Stillman. 2003. Instrumental variables and GMM:
Estimation and testing. Stata Journal 3: 131.
. 2004. Software updates: Instrumental variables and GMM: Estimation and

testing. Stata Journal 4: 224.
. 2005. Software updates: Instrumental variables and GMM: Estimation and

testing. Stata Journal 5: 607.
Black, D. A., and J. Smith. 2004. How robust is the evidence on the eects of college
quality? Evidence from matching. Journal of Econometrics 121: 99124.
Heckman, J., and S. Navarro-Lozano. 2004. Using matching, instrumental variables,

and control functions to estimate economic choice models. Review of Economics and
Statistics 86: 3057.
Heckman, J., and R. Robb, Jr. 1985. Alternative methods for evaluating the impact of
interventions: An overview. Journal of Econometrics 30: 239267.
Heckman, J. J. 1976. The common structure of statistical models of truncation, sample

selection and limited dependent variables and a simple estimator for such models.
Annals of Economic and Social Measurement 5: 475492.
. 1979. Sample selection bias as a specication error. Econometrica 47: 153161.
Heckman, J. J., R. J. LaLonde, and J. A. Smith. 1999. The economics and econometrics
of active labor market programs. In Handbook of Labor Economics, ed. O. Ashenfelter
and D. Card, vol. 3A, 18652097. Amsterdam: Elsevier.
Hirano, K., and G. W. Imbens. 2001. Estimation of causal eects using propensity score
weighting: An application to data on right heart catheterization. Health Services and
Outcomes Research Methodology 2: 259278.
Klein, R., and F. Vella. 2009. A semiparametric model for binary response and contin-
uous outcomes under index heteroscedasticity. Journal of Applied Econometrics 24:
735762.
Millimet, D. L., and R. Tchernis. 2013. Estimation of treatment eects without an

exclusion restriction: With an application to the analysis of the school breakfast
program. Journal of Applied Econometrics 28: 9821017.
Navarro, S. 2008. Control function. In The New Palgrave Dictionary of Economics, ed.
S. N. Durlauf and L. E. Blume, 2nd ed. London: Palgrave Macmillan.

About the authors

Ian McCarthy is an assistant professor of economics at Emory University. His research relates
primarily to the elds of health economics, policy, and economic evaluation of health care
programs. Within these areas, he is interested in patient choice, hospital and insurance market
structure, and empirical methodologies in cost and comparative eectiveness research. Prior
to joining Emory University, he was a director in the economic consulting practice at FTI
Consulting and a director of health economics with Baylor Scott & White Health. He received
his PhD in economics from Indiana University.
Rusty Tchernis is an associate professor of economics in the Andrew Young School of Policy
Studies at Georgia State University. He is also a research associate at the National Bureau
of Economic Research and a Research Fellow at the Institute for the Study of Labor. His
primary areas of research are applied econometrics, health economics, and labor economics.
Before becoming a faculty member at the Andrew Young School, he was an assistant professor
in the Department of Economics at Indiana University and a postdoctoral research fellow in
the Department of Health Care Policy at Harvard Medical School. He received his PhD in
economics from Brown University.
Daniel Millimet is a professor of economics at Southern Methodist University and a research
fellow at the Institute for the Study of Labor. His primary areas of research are applied
microeconometrics, labor economics, and environmental economics. His research has been
funded by various organizations, including the United States Department of Agriculture. He
received his PhD in economics from Brown University.
14, Number 3, pp. 684692
Panel cointegration analysis with xtpedroni

Timothy Neal
University of New South Wales
Sydney, Australia
timothy.neal@unsw.edu.au
Abstract. In this article, I introduce the new command xtpedroni, which

implements the Pedroni (1999, Oxford Bulletin of Economics and Statistics 61:
653670; 2004, Econometric Theory 20: 597625) panel cointegration test and
the Pedroni (2001, Review of Economics and Statistics 83: 727731) group-mean
panel-dynamic ordinary least-squares estimator. For nonstationary heterogeneous
panels that are long (large T ) and wide (large N ), xtpedroni tests for cointegra-
tion among one or more regressors by using seven test statistics under the null of
no cointegration, and it also estimates the cointegrating equation for each individ-
ual as well as the group mean of the panel. The test can include common time
dummies and unbalanced panels.
Keywords: st0356, xtpedroni, panel cointegration, panel-dynamic ordinary least
squares, PDOLS, cointegration test, panel time series, nonstationary panels
1 Introduction
In recent years, it has become increasingly popular to use panel time-series datasets for
econometric analysis. These panel datasets are reasonably large in both cross-sectional
(N ) and time (T ) dimensions, as compared with the more conventional panels with very
large N yet small T. Theoretical research into the asymptotics of panel time series has
revealed two crucial dierences from the typical panel: the need for slope coecients
to be heterogeneous (for example, see Phillips and Moon [2000] and Im, Pesaran, and
Shin [2003]) and the concern of nonstationarity. Both dierences suggest that the usual
xed-eects or random-eects estimators are not appropriate for this application.
The long time dimension in panel time series allows one to use regular time-series
analytical tools, such as unit root and cointegration testing, to determine the order of
integration and the long-run relationship between variables. Researchers have proposed
a variety of tests and estimators that (in varying ways) extend time-series tools for panels
while importantly allowing for heterogeneity in the cross-sectional units (as opposed to
simply pooling the data). Users have already implemented several of these tests and
estimators into Stata (for example, see Blackburne and Frank [2007] and Eberhardt
[2012]).
This article and the associated program, xtpedroni, introduce two tools that were
developed in Pedroni (1999, 2001, 2004) for use in Stata. The rst tool is seven test
statistics for the null of no cointegration in nonstationary heterogeneous panels with
one or more regressors. The second tool is a between-dimension (that is, group-mean)
panel-dynamic ordinary least-squares (PDOLS) estimator. Both tools can include time

T. Neal 685
dummies (by time demeaning the data) to capture common time eects among members
of the panel. Nevertheless, they cannot account for more sophisticated forms of cross-
sectional dependence.
In this article, I will discuss the theoretical foundations of both tools. I will also
introduce the usage and capabilities of xtpedroni, and apply the program to replicate
the results in Pedroni (2001).
2 Pedronis cointegration test

Pedroni (1999, 2004) introduced seven test statistics that test the null hypothesis of no
cointegration in nonstationary panels. The seven test statistics allow heterogeneity in
the panel, both in the short-run dynamics as well as in the long-run slope and intercept
coecients. Unlike regular time-series analysis, this tool does not consider normalization
or the exact number of cointegrating relationships. Instead, the hypothesis test is simply
the degree of evidence, or lack thereof, for cointegration in the panel among two or more
variables.
The seven test statistics are grouped into two categories: group-mean statistics that
average the results of individual country test statistics and panel statistics that pool
the statistics along the within-dimension. Nonparametric ( and t) and parametric
(augmented DickeyFuller [ADF] and v) test statistics are within both groups.
The test can include common time dummies to address simple cross-sectional de-
pendency, which is applied by time demeaning the data for each individual and variable
as follows:
N
1
yt = yi,t
N i=1
All the test statistics are residual-based tests, with residuals collected from the
following regressions:
yi,t = i + 1i x1i,t + 2i x2i,t + + M i xM i,t + ei,t

M

yi,t = mi xmi,t + i,t
m=1
ei,t =
i ei,t1 +
i,t
K

ei,t =
i ei,t1 + i,k
i,t
ei,tk +
k=1
where i = 1, 2, . . . , N is the number of individuals in the panel, t = 1, 2, . . . , T is the

number of time periods, m = 1, 2, . . . , M is the number of regressors, and k = 1, 2, . . . , K
is the number of lags in the ADF regression (selected automatically by xtpedroni with
686 Panel cointegration analysis with xtpedroni
several available options). A linear time trend i t can be inserted into the regression at
the users discretion.
Next, several series and parameters are calculated from the regressions above.
T N
1 2 1 2
s2
i = ,
s2
N,T = s
T t=1 i,t N n=1 i
T ki T
2 = 1 2 s
L
2
+ (1 ) i,t i,ts
11i
T t=1 i,t T s=1 ki + 1 t=s+1
ki T
i = 1
(1
s
) i,t
i,ts
T s=1 ki + 1 t=s+1
T N
1 2 i , 1 2 2
s2i = ,
i2 = s2i + 2
N,T
2
= L
T t=1 i,t N n=1 11i i
The seven statistics can then be constructed from the following equations. (See Pedroni
[1999] for a complete discussion on how these statistics are constructed.)
3 N T 2 2
panel v: T 2 N 2 ( i=1 t=1 L i,t1 )1
11i e
N T
2 e2 )1 N T L
panel : T N ( i=1 t=1 L 2 ei,t1 i )
ei,t
11i i,t1 i=1 t=1 11i (
N T 2 e2 ) 12 N T L 2 ei,t1 i )
panel t: ( 2
N,T i=1 t=1 L 11i i,t1 i=1 t=1 11i ( ei,t
N T 2 e2 ) 12 N T L 2
s2
panel ADF: (N,T i=1 t=1 L 11i i,t1 i=1 t=1 11i e

i,t1 ei,t
N T 2 e2 )1 T ( i )
group : T 1N i=1 ( t=1 L 11i i,t1 ei,t
t=1 ei,t1
N T 1 T i )
group t: 1
N
i2
i=1 ( 2i,t1 ) 2
t=1 e t=1 (
ei,t1
ei,t
N T 12
T
group ADF: 1
N i=1 ( 2
t=1 s 2
i e i,t1 ) i,t1
t=1 e ei,t
The test statistics are then adjusted so that they are distributed as N (0, 1) under the
null. The adjustments performed on the statistics vary depending on the number of
regressors, whether time trends were included, and the type of test statistic.
Because the null of no cointegration is rejected, the panel v statistic goes to positive
innity while the other test statistics go to negative innity. Baltagi (2013, 296) provides
a formal interpretation of a rejection of the null: Rejection of the null hypothesis means
that enough of the individual cross-sections have statistics far away from the means
predicted by theory were they to be generated under the null.
The relative power of each test statistic is not entirely clear, and there may be con-
tradictory results between the statistics. Pedroni (2004) reported that the group and
panel ADF statistics have the best power properties when T < 100, with the panel v
T. Neal 687
and group statistics performing comparatively worse. Furthermore, the ADF statis-
tics perform better if the errors follow an autoregressive process (see Harris and Sollis
[2003]).
3 Pedronis PDOLS
Consider the following model:
yi,t = i + i xi,t + it
The PDOLS estimator is an extension of the individual time-series dynamic ordinary

least squares (DOLS), which is a simple yet ecient single-equation estimate of the
cointegrating vector. It can be applied to data that are nonstationary and exhibit a
cointegrating relationship between the variables. We can extend this to panel time-series
data and conduct a DOLS regression on each individual in the above panel as follows:
P

yi,t = i + i xi,t + i,j xi,tj + it
j=P
where i = 1, 2, . . . , N is the number of units in the panel, t = 1, 2, . . . , T is the number

of time periods, p = 1, 2, . . . , P is the number of lags and leads in the DOLS regression,
i is the slope coecient, and xi,t is the explanatory variable. The coecients and
associated t statistics are then averaged over the entire panel by using Pedronis group-
mean method.
% T &1 T ,
N
1
GM

=
zi,t zi,t zi,t (yi,t y i )
N i=1 t=1 t=1
T
, 12

t = (i 0 ) i2
(xi,t xi ) 2
i
t=1
N
1
t = t
GM N i=1 i
Here zi,t is the 2(p + 1) 1 vector of regressors (this includes the lags and leads of the
dierenced explanatory variable), and i2 is the long-run variance of the residuals it .
i2 is computed in the program through the Newey and West (1987) heteroskedasticity-
and autocorrelation-consistent method with a Bartlett kernel. By default, the maxi-
mum lag for the Bartlett kernel is selected automatically for each cross-section in the
panel according to 4 (T /100)(2/9) (see Newey and West [1994]), but it can also be set
manually by the user.
In comparison, Kao and Chiang (1997) and Mark and Sul (2003) compute the panel
statistics along the within-dimension, with the t statistics designed to test H0 : i =
0 against HA : i = A = 0 . Pedronis PDOLS estimator is averaged along the
between-dimension (that is, the group mean). Accordingly, the panel test statistics test
H0 : i = 0 against HA : i = 0 . In the alternative hypothesis, the regressors are
not constrained to be a constant A . Pedroni (2001) argues that this is an important
advantage for between-dimension panel time-series estimators, particularly when one
expects slope heterogeneity.
4 The xtpedroni command

4.1 Syntax

xtpedroni depvar indepvars if in , notdum nopdols notest extraobs
b(#) mlags(#) trend lagselect(string) adflags(#) lags(#) full

average(string)
4.2 Options
Options that aect the cointegration test and the PDOLS estimation
notdum suppresses time demeaning of the variables (that is, the common time dummies).
Time demeaning is turned on by default. This option may be appropriate to use
when averaging over the N dimension may destroy the cointegrating relationship or
when there are comparability concerns between panel units in the data.
nopdols suppresses PDOLS estimation (that is, reports only the cointegration test re-
sults).
notest suppresses the cointegration tests (that is, reports only PDOLS estimation).
extraobs includes the available observations from the missing years in the time means
used for time demeaning if there is an unbalanced panel with observations missing for
some of the variables (at the start or end of the sample) for certain individuals. This
was the behavior of Pedronis original PDOLS program but not of the cointegration
test program. It is o by default.
b(#) denes the null hypothesis beta as #. The default is b(0).
mlags(#) species the number of lags to be used in the Bartlett kernel for the Newey
West long-run variance. If mlags() is not specied, then the number of lags is
determined automatically for each individual following Newey and West (1994).
Options that aect only the cointegration test
trend adds a linear time trend.

lagselect(string) species the criterion used to select lag length in the ADF regressions.
string can be aic (default), bic, or hqic.
T. Neal 689
adflags(#) species the maximum number of lags to be considered in the lag selection
process for the ADF regressions. If adflags() is not specied, then it is determined
automatically.
Options that aect only the PDOLS estimation
lags(#) species the number of lags and leads to be included in the DOLS regression.
The default is lags(2).
full reports the DOLS regression for each individual in the panel.
average(string) determines the methodology used to combine individual coecient es-
timates into the panel estimate. string can be simple (default), sqrt, or precision.
simple takes a simple average and is the behavior of the original Pedroni program.
sqrt weighs each estimate according to the square root of the precision matrix,
which is the same procedure used for averaging the t statistics. precision weighs
each individuals coecient estimates by its precision.
5 Replicating Pedroni results

Pedroni (2001) applied the group-mean PDOLS estimator empirically to test the purchas-
ing power parity (PPP) hypothesis. Specically, it tested the weak long-run PPP, which
argues that while nominal exchange rates and aggregate price ratios move together,
they may not be directly proportional in the long term. Accordingly, the cointegrating
slope may be close to yet dierent from 1. Pedroni used monthly data on nominal
exchange rates and Consumer Price Index deators from the International Monetary
Funds International Financial Statistics database for this test.
We will now replicate the group-mean PDOLS results with the same dataset and
xtpedroni.
. use pedronidata
. xtset country time
panel variable: country (strongly balanced)
time variable: time, 1973m6 to 1993m11
delta: 1 month
. xtpedroni logexrate logratio, notest lags(5) mlags(5) b(1) notdum
Pedronis PDOLS (Group mean average):
No. of Panel units: 20 Lags and leads: 5
Number of obs: 4700 Avg obs. per unit: 235
Data has not been time-demeaned.
Variables Beta t-stat
logratio 1.202 9.537

. xtpedroni logexrate logratio, notest lags(5) mlags(5) b(1)

Pedronis PDOLS (Group mean average):
No. of Panel units: 20 Lags and leads: 5
Number of obs: 4700 Avg obs. per unit: 235
Data has been time-demeaned.
Variables Beta t-stat
logratio_td 1.141 12.76
We computed the results without time dummies (by specifying the notdum option),
and then with time dummies. We specied the option notest to suppress the results
of the cointegration test, which are not yet relevant. The option b(1) instructed the
program to compute all t statistics against the null hypothesis that the slope coecient
is equal to 1, which is appropriate for economic interpretation when testing the weak
long-run PPP hypothesis. In accordance with Pedronis original use of the group-mean
PDOLS estimator to calculate these results, we set the number of lags and leads in the
DOLS regression to 5 by specifying lags(5), and we set the number of lags used in the
Bartlett kernel for the NeweyWest long-run variance of the residuals to 5 by specifying
mlags(5).
We can now replicate the individual DOLS results for each country in the panel as
follows:
. xtpedroni logexrate logratio, full notest lags(4) mlags(4) b(1) notdum

(output omitted )
Individual DOLS results

Country t statistic Country t statistic
UK 0.67 1.91 Japan 1.75 5.03
Belgium 0.23 1.96 Greece 0.99 0.37
Denmark 1.90 2.85 Portugal 1.09 2.46
France 2.21 8.09 Spain 1.02 0.18
Germany 0.91 0.60 Turkey 1.11 5.84
Italy 1.08 1.12 NZ 1.02 0.61
Holland 0.66 2.06 Chile 1.37 10.95
Sweden 1.16 0.82 Mexico 1.03 3.60
Switzerland 1.36 2.25 India 2.06 7.80
Canada 1.43 1.88 South Korea 0.88 1.46
The output was compressed into a formatted table for brevity. We specied several
options to obtain the exact results. The option full displays the results of estimation
for each individual panel unit. Emulating Pedronis original use of the program for this
empirical application, we set the number of lags and leads in the DOLS regression to 4 by
T. Neal 691
specifying lags(4) and the number of lags used in the Bartlett kernel for the Newey
West long-run variance of the residuals to 4 by specifying mlags(4). No common time
dummies were used for the individual country results (notdum option).
Pedroni (2004) applied the seven panel cointegration test statistics to the PPP hy-
pothesis. We repeat this procedure as follows:
. xtpedroni logexrate logratio, nopdols

Pedronis cointegration tests:
No. of Panel units: 20 Regressors: 1
No. of obs.: 4920 Avg obs. per unit: 246
Data has been time-demeaned.
Test Stats. Panel Group
v 4.735 .
rho -2.027 -2.814
t -1.434 -2.185
adf -.9087 -1.737
All test statistics are distributed N(0,1), under a null of no cointegration,

and diverge to negative infinity (save for panel v).
The results will be inconsistent with those found in Pedroni (2004), because those results
relied on a larger sample period than did the Pedroni (2001) dataset we are currently
using. The only option we specied here is nopdols, which suppresses the PDOLS
estimation results.
Overall, the results indicate a cointegrating relationship between the log of the ex-
change rate and the log of the aggregate Consumer Price Index ratio. Statistical in-
ference is straightforward because all the test statistics are distributed N (0,1). All the
tests, except the panel t and ADF statistics, are signicant at least at the 10% level.
Furthermore, the PDOLS results support the weak long-run PPP hypothesis. Most of
the coecients are close to 1, but many are notably higher or lower. For a complete
discussion of the results, see Pedroni (2001).
6 Acknowledgments
This program is indebted to the work of many individuals, including Peter Pedroni,
Tom Doan, Tony Bryant, Roselyne Joyeux, and an anonymous reviewer.
7 References
Baltagi, B. H. 2013. Econometric Analysis of Panel Data. 5th ed. New York: Wiley.
Blackburne, E. F., III, and M. W. Frank. 2007. Estimation of nonstationary heteroge-
neous panels. Stata Journal 7: 197208.
Eberhardt, M. 2012. Estimating panel time-series models with heterogeneous slopes.
Harris, R., and R. Sollis. 2003. Applied Time Series Modelling and Forecasting. New
York: Wiley.
Im, K. S., M. H. Pesaran, and Y. Shin. 2003. Testing for unit roots in heterogeneous
panels. Journal of Econometrics 115: 5374.
Kao, C., and M.-H. Chiang. 1997. On the estimation and inference of a cointegrated
regression in panel data. Syracuse University Manuscript.
Mark, N. C., and D. Sul. 2003. Cointegration vector estimation by panel DOLS and
long-run money demand. Oxford Bulletin of Economics and Statistics 65: 665680.
Newey, W. K., and K. D. West. 1987. A simple, positive semi-denite, heteroskedasticity

and autocorrelation consistent covariance matrix. Econometrica 55: 703708.
. 1994. Automatic lag selection in covariance matrix estimation. Review of

Economic Studies 61: 631653.
Pedroni, P. 1999. Critical values for cointegration tests in heterogeneous panels with
multiple regressors. Oxford Bulletin of Economics and Statistics 61: 653670.
. 2001. Purchasing power parity tests in cointegrated panels. Review of Economics

and Statistics 83: 727731.
. 2004. Panel cointegration: Asymptotic and nite sample properties of pooled

time series tests with an applicaton to the PPP hypothesis. Econometric Theory 20:
597625.
Phillips, P. C. B., and H. R. Moon. 2000. Nonstationary panel data analysis: An

overview of some recent developments. Econometric Reviews 19: 263286.
About the author

Timothy Neal is currently a PhD candidate in the Australian School of Business at the Univer-
sity of New South Wales. His research interests include income inequality, panel econometrics,
and welfare economics.
14, Number 3, pp. 693696
Stata and Dropbox

Raymond Hicks
Woodrow Wilson School
Niehaus Center for Globalization and Governance
Princeton University
Princeton, NJ
rhicks@princeton.edu
Abstract. Dropbox makes scholarly collaboration much easier because it allows

scholars to share les across dierent computers. However, because the Dropbox
directories have dierent pathnames for dierent users, sharing do-les can be
complicated. In this article, I oer some tips on how to navigate pathnames in
do-les when using Dropbox, and I present a command that automatically nds
and changes to a users Dropbox directory.
Keywords: pr0058, dropbox, Dropbox, directories, tips
1 Introduction
Dropbox makes scholarly collaboration much easier because it allows scholars to share
les across dierent computers. At the same time, sharing do-les in Dropbox presents
its own complications. Because users may install Dropbox in dierent locations and
because users have dierent usernames, often on dierent computers, directory paths to
Dropbox folders may not work in do-les. This is especially likely when multiple Drop-
box users collaborate. Here I present some tips on how to overcome these diculties.
2 What are the issues?

There are three issues in using Stata with Dropbox. Two issues involve potential dicul-
ties in syncing les. The third issue, which this article discusses more, is the pathnames
of les.
2.1 Syncing
One issue with using Dropbox to share les is that Dropbox automatically syncs les
as they are saved. Stata do-les can get ahead of the Dropbox synchronization if, for
instance, a user saves les and then appends these les soon after in a loop. It may
also happen if a user saves a le and then uses it. This problem can be solved with
a sleep command at the end of the loop. Telling Stata to wait for ve seconds or so
before continuing the loop will usually solve the problem.

c 2014 StataCorp LP pr0058
694 Stata and Dropbox
2.2 Simultaneous les open

A second issue may arise if multiple users have the same le open simultaneously.
Changes made by one user may not be saved if another user also has the le open.
Therefore, some people may want to store the data and do-les outside of Dropbox and
share only log and result les in Dropbox.
2.3 Dierent usernames

Many users, however, will want to store data and do-les in a shared Dropbox folder,
especially users that do all of their Stata work within do-les, including opening and
saving les located in Dropbox. To open or save the les, Stata needs a pathname so that
it knows where the Dropbox folder is located. Because the Dropbox directory is usually
placed within a users home directory, this creates a potential problem. Dierent people
will have dierent usernames, and even the same user may have dierent usernames on
oce and personal computers. If a do-le explicitly refers to a specic username, the
do-le will stop running if the username does not exist on the computer. For example,
use /users/jdoe/Dropbox/data1.dta will not work if the users name is johndoe.
This type of failure may make collaborating or using multiple computers (such as home
and oce) frustrating.
Moreover, two other issues may complicate sharing les in Dropbox. First, dierent
computers have dierent conventions for pathnames. Although cd /users/username
/Dropbox will work on Windows and Mac computers, it will not work on Unix comput-
ers. For Macs and Unix, cd ~/Dropbox will work, but it will not work with Windows.1
Second, Dropbox can be installed in a default location (/users/username
/Dropbox), but many users install it in dierent places. Some users install it as My
Dropbox, while others store it within their Documents folder (/users/username
/Documents/Dropbox).
All three of these issues potentially make it dicult to share Stata les in Dropbox.
3 Solutions
There are several dierent ways to ensure that everyone can easily share and use Stata
do-les in Dropbox without errors. I discuss the advantages and drawbacks of the
dierent ways below.
3.1 Edit le
One solution, at least for Windows users, is to open do-les using the edit option. The
user does not have to specify a pathname, because Stata will automatically change the
1. I use /users/username to refer to a users home directory because most users use Windows or
Macs. Unix users should read it as ~.
R. Hicks 695
directory to the one where the do-le is located. From there, relative paths can be used
to negotiate around the shared directory. The biggest drawback to this method is that
it is limited to Windows users. It also does not t with how a lot of people use Stata,
because each time a user wants to open a do-le in a dierent directory, the user has to
open a new instance of Stata or change the directory within Stata.
3.2 Capture
Other users may prefer to use the capture command to change the directory. Here each
user puts a change directory (cd) command to his or her Dropbox folder preceded by
the capture command, which prevents Stata from returning an error and aborting the
do-le if the specied directory does not exist. As the number of users increases, or if
users have dierent usernames for their home and oce computers, keeping track of all
the dierent directories becomes dicult.
3.3 c(username)
Stata stores the users name in a c-class value called c(username). If all users have Drop-
box in the same place, the macro can be used to specify the Dropbox directory. As noted
above, one of the common places users store Dropbox is in /users/username/Dropbox/.
The username is stored by Stata as c(username), which can be inserted as a local in the
change directory command: cd /users/c(username)/Dropbox. This will work as
long as all users have Dropbox installed in the same directory. However, some users may
install Dropbox in /users/username/My Dropbox/ or in /users/username/Documents
/Dropbox. If this is the case, then c(username) will not work. Moreover, as noted
above, this will work with Windows and Mac computers but not with Unix comput-
ers. If all collaborators use Unix or Macs, they could use ~/Dropbox to go to the root
Dropbox directory.
3.4 dropbox.ado
A nal solution is to use an ado-le I created, dropbox.ado, which looks for the Drop-
box directory in the most common places that users install Dropbox. It starts in
the most commonly used location (/users/c(username)/Dropbox for Windows and
~/Dropbox for Mac and Unix computers) and then searches within the Documents di-
rectory and then the root directory to nd Dropbox. The command returns the local
Dropbox directory as r(db), and unless the nocd option is specied, it changes the
directory to a users root Dropbox directory. From there, the relative paths of all users
within Dropbox will be the same. The command also uses the username macro to look
for the Dropbox directory.
696 Stata and Dropbox
This command is limited because it may not provide the correct Dropbox directory
if a user has more than one instance of Dropbox installed. It will not work if a Windows
user has Dropbox installed on a drive other than the c: drive. Also the command will
work only if all shared users have the command on their computers.
4 Conclusion
Using multiple computers and sharing les in the Cloud is increasingly common. In this
article, I presented some tips on how to best handle do-les shared with the popular
Dropbox program. Here I conclude with a couple of general tips about navigating
directories when sharing do-les.
First, avoid using the backslash when setting paths; instead, use a forward slash.
The backslash is used only by Windows machines; it is also used as an escape character
by Stata, which often causes confusion when users include locals in their pathnames.
For example, c:\users\c(username)\Dropbox will not work in Stata because Stata
will ignore the backslash between users and c(username). Both Unix and Macs use
the forward slash in directories, and Windows recognizes the forward slash, so it is a
costless change. It will also ensure conformability across operating systems. Similarly,
Windows users should avoid references to the c:\ drive as often as possible. Sometimes,
this is unavoidable, especially with network drives or with partitioned drives. However,
if all work is done on the c:\ drive, Windows will recognize cd / as referring to the c:\
drive, which brings Windows syntax in line with Unix and Mac syntax.
Second, users should become familiar with the commands to move around directories
without specifying full path names. Users can move up one directory using cd ..
or up two directories using cd ../... From the current directory, users can move
down a directory by specifying only the new directory name. For example, to go from
/users/username/Dropbox/ to /users/username/Dropbox/Shared Folder/, one can
type cd "Shared Folder".
About the author

Raymond Hicks is a statistical programmer in the Niehaus Center for Globalization and Gov-
ernance at Princeton University, where he focuses on trade and monetary issues.
14, Number 3, pp. 697700
Review of An Introduction to Stata for Health

Researchers, Fourth Edition, by Juul and
Frydenberg
Ariel Linden
Linden Consulting Group, LLC
Ann Arbor, MI
alinden@lindenconsulting.org
Abstract. In this article, I review An Introduction to Stata for Health Researchers,

Fourth Edition, by Svend Juul and Morten Frydenberg (2014 [Stata Press]).
Keywords: gn0061, introduction to Stata, data management, statistical analysis,
health research
1 Introduction
For instructors of measurement and evaluation and individuals seeking methodological
guidance, it is dicult to nd a book that both covers key analytic concepts and provides
clear direction on how to perform the associated analyses in a given statistical software
package. The fourth edition of An Introduction to Stata for Health Researchers, by
Svend Juul and Morten Frydenberg, lls this need. It does an excellent job of covering a
wide range of measurement and evaluation topics while providing a gentle introduction
to Stata for those unfamiliar with the software. In fact, though the title suggests
the book is for health researchers, it is readily generalizable to many disciplines that
implement the same methods.
Many improvements have been made to the book since John Carlins review of
the inaugural edition in 2006 (Carlin 2006), including a reorganization of chapters to
more closely mirror the typical ow of a research project, an increase in the number of
practice exercises, and a more focused treatment of statistical issues. Additionally, this
fourth edition has been updated for Stata 13. On the whole, Juul and Frydenberg have
prepared a very accessible book for readers with varied levels of prociency in statistics
or Stata, or both.
2 Overview
Section I includes four chapters (called the basics) that introduce the reader to Stata.
These chapters cover such issues as installing the program, getting help, understanding
le types, and using command syntax. While a novice could go directly to the Stata
users manual (in particular, Getting Started with Stata and the Stata Users Guide),
this book oers a more user-friendly introduction. Combined, these 35 pages are more
than sucient to get a Stata novice up and running.

c 2014 StataCorp LP gn0061
698 Review of An Introduction to Stata for Health Researchers
Section II includes six chapters dealing with issues pertaining to data management,
such as variable types (numeric, dates and strings) and their manipulation and storage
(chapter 5); importing and exporting data (chapter 6); applying labels (chapter 7);
generating and replacing values and performing basic calculations (chapter 8); and
changing data structure, such as appending, merging, reshaping, and collapsing data
(chapter 9). Chapter 10 provides excellent advice on creating documentation (via do-
les and logs, etc.) to ensure reproducibility of data management and analytic steps.
While creating documentation is seemingly intuitive, not all researchers consistently
follow these steps.
Section III includes ve chapters focusing on the types of data analyses most widely
used in health-related research.
Chapter 11 starts with basic descriptive analytics and then continues on to analy-
ses using epidemiologic tables for binary variables (including the addition of stratied
variables). This naturally progresses to analyses of continuous variables, and the chap-
ter demonstrates some visual displays of the data (histograms, QQ plots, and kernel
density plots) and methods of tabulation. The chapter then ventures into more formal
basic statistical analyses, such as t tests, one-way analysis of variance, and nonparamet-
ric techniques (ranksum).
Chapter 12 presents ordinary least-squares and logistic regression, with a fair amount
of exposition on the use of lincom for postestimation.
Chapter 13 describes time-to-event analyses, starting with simple curves and tables,
and then moves into progressively more complex Cox regression models (without and
with time-varying covariates). Next it introduces Poisson models to examine more
complex models for rates. Finally, it includes a brief discussion on indirect and direct
standardization.
Chapter 14 is titled Measurement and diagnosis, and it describes graphical plots
and statistical tests for assessing measurement variation at one time point, and then
again over multiple measurements, for dependent samples. This transitions into methods
used for assessing accuracy of diagnostic tests (that is, sensitivity, specicity, area under
the curve, etc.).
Chapter 15Miscellaneousincludes topics such as random sampling, sample-
size calculations (including a nice example using simulation to estimate power for a
noninferiority study), error trapping, and log les.
Section IV includes one comprehensive chapter on graphs (44 pages). The chapter
begins by plotting a basic graph and describing the various elements, and it progresses
with increasing sophistication. It ends with some important tips on saving the code in
do-les so that graphs can be reproduced or enhanced later.
The nal section, section V, is composed of a single chapter titled Advanced top-
ics and discusses storing and using results after estimation and dening macros and
scalars. It then discusses looping through data using foreach, forvalues, and if/then
statements. The chapter ends with a brief overview of creating user-written commands.
A. Linden 699
3 Comments
The book is well organized, following the logical step-by-step approach that investigators
apply to their research: data acquisition and management, analysis, and presentation
of results. The many brief examples are useful and generalizable, and the footnotes
are helpful additions. When a topic is briey touched upon, the authors refer the
reader to the relevant help resource in Stata for more details. They also provide helpful
recommendations for resolving issues that may have multiple solutions.
Another strength of the book is that it contains many important but often overlooked
details (even for advanced Stata users), such as why a value may appear dierently
when formatted as oat versus double (pages 4546) and how this precision may impact
comparisons. Other examples include the use of numlabel to display both the value
and the value label of a variable (page 67), the use of egen cut() to easily recode
continuous variables into categories (page 75), and setting showbaselevels to display
a line for the reference level in regression output (page 153). Of arguably greatest value
is the fact that the authors continually emphasize the importance of developing good
habits in documenting the work process (using do-les and logs) so that all output
can be replicated, errors can be tracked down, and time-consuming procedures can be
performed repeatedly and eciently.
There is very little that I would change about this book, and my suggestions all relate
to what the authors could consider for future editions. First, the authors use lincom
and testparm extensively in the chapters on regression and time-to-event analyses.
Readers would benet from seeing examples using margins (followed by marginsplot).
margins is an extremely exible command that allows the user to perform various
analyses after running regression models, mostly with little additional specication.
The authors currently provide only a footnote (page 150) pointing interested readers
to the excellent book written by Michael N. Mitchell (2012). Second, some mention
of parametric regression models for survival analysis would be valuable (using streg),
because readers in certain disciplines may prefer these models over Cox regression models
(using stcox).
Finally, while Stata 13 introduced a new set of commands to estimate treatment
eects using propensity score-based matching and weighting techniques, the only men-
tion of such approaches is in appendix A, where the authors briey describe the Stata
Treatment-Eects Reference Manual by saying this: Despite its title, it does not cor-
respond to the methods of analysis that are mainstream in health research. This
statement left me somewhat perplexed, given that graduate programs in public health
in the United States have a required course in program evaluation that likely cov-
ers these methods in at least some detail. Furthermore, there is a growing body of
health research literature where using these methods has become commonplace (see,
for example, Austin [2007; 2008]). Readers would benet from an introduction to these
techniques, perhaps as a nal chapter in which some of the datasets analyzed in pre-
vious chapters using regression are reanalyzed using one of these approaches and the
results compared. The Stata Treatment-Eects Reference Manual oers an excellent
700 Review of An Introduction to Stata for Health Researchers
introduction to the methods implemented in Stata, and Stuart (2010) provides a more
comprehensive discussion of treatment-eects estimation using an array of approaches.
In summary, I strongly recommend this book both for students in introductory
measurement and evaluation courses and for more seasoned health researchers who
would like to avoid a steep learning curve when trying to conduct analyses in Stata.
4 References
Austin, P. C. 2007. Propensity-score matching in the cardiovascular surgery literature
from 2004 to 2006: A systematic review and suggestions for improvement. Journal of
Thoracic and Cardiovascular Surgery 134: 11281135.
. 2008. A critical appraisal of propensity-score matching in the medical literature

between 1996 and 2003. Statistics in Medicine 27: 20372049.
Carlin, J. 2006. Review of An Introduction to Stata for Health Researchers by Juul.

Juul, S., and M. Frydenberg. 2014. An Introduction to Stata for Health Researchers.
4th ed. College Station, TX: Stata Press.
Mitchell, M. N. 2012. Interpreting and Visualizing Regression Models Using Stata.
College Station, TX: Stata Press.
Stuart, E. A. 2010. Matching methods for causal inference: A review and a look forward.
Statistical Science 25: 121.
About the author

Ariel Linden is a health services researcher specializing in the evaluation of health care in-
terventions. He is both an independent consultant and an adjunct associate professor at the
University of Michigan in the department of Health Management and Policy, where he teaches
program evaluation.
14, Number 3, p. 701
Software Updates
st0286 1: A menu-driven facility for sample-size calculations in cluster randomized con-

trolled trials. K. Hemming and J. Marsh. Stata Journal 13: 114135.
The original command restricted both the coecient of variation of cluster sizes
(size cv) and outcome (cluster cv) to be less than 1. This was an incorrect
restriction and has been removed. The help le also incorrectly referred to the
cluster cv as being the coecient of variation of the cluster sizes, when it is the
coecient of variation of the outcome.
st0295 1: Generating Manhattan plots in Stata. D. E. Cook, K. R. Ryckman, and J.
C. Murray. Stata Journal 13: 323328.
The manhattan package has been updated because there was an error in the way
Bonferroni lines were drawn in the Manhattan plots. The update xes this issue.
st0301 2: Fitting the generalized multinomial logit model in Stata. Y. Gu, A. R. Hole,
and S. Knox. Stata Journal 13: 382397.
A new noscale option has been added to gmnlbeta. By default, gmnlbeta calculates
the individual-level scaled parameters (as in equation 2 of Gu, Hole, and Knox
[2013]). When noscale is specied, gmnlbeta calculates instead the individual-level
parameters without scaling by sigma.
st0331 1: Estimating marginal treatment eects using parametric and semiparametric
methods. S. Brave and T. Walstrum. Stata Journal 14: 191217.
This update to the margte command includes a bug x for the parametric normal
model t by maximum likelihood. When run with the mlikelihood option, margte
interfaces with the movestay command (Lokshin and Sajaia 2004, 2005a,b). The
previous version of margte produced incorrect results when reading the output of
movestay, except under a particular parameterization of the generalized Roy model.
An updated help le is included to help clarify the dierences in the treatment of
the parameters of the generalized Roy model in movestay and margte.
References
Gu, Y., A. R. Hole, and S. Knox. 2013. Fitting the generalized multinomial logit model
in Stata. Stata Journal 13: 382397.
Lokshin, M., and Z. Sajaia. 2004. Maximum likelihood estimation of endogenous switch-
ing regression models. Stata Journal 4: 282289.
. 2005a. Software update: st0071 1: Maximum likelihood estimation of endoge-
nous switching regression models. Stata Journal 5: 139.
. 2005b. Software update: st0071 2: Maximum likelihood estimation of endoge-
nous switching regression models. Stata Journal 5: 471.

c 2014 StataCorp LP up0044

The Stata Journal: Number 3 2014

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Stata Journal: Number 3 2014

Uploaded by

Copyright:

Available Formats

The Stata Journal

Volume 14 Number 3 2014

A Stata Press publication

Stata Press Editorial Manager Stata Press Copy Editors

U.S. and Canada Elsewhere

Printed & electronic Printed & electronic

1-year student subscription $ 75 1-year student subscription $ 99

1-year institutional subscription $245 1-year institutional subscription $285

Electronic only Electronic only

1-year student subscription $ 45 1-year student subscription $ 45

Back issues of the Stata Journal may be ordered online at

The Stata Journal

ivtreatreg: A command for tting binary treatment models with heterogeneous

Notes and Comments 697

Review of An Introduction to Stata for Health Researchers, Fourth Edition, by

Software Updates 701

ivtreatreg: A command for tting binary

Table 1. Commands for performing econometric program evaluation: ordinary least-

Command Description Author

An interesting relationship links these three parameters:

ATE = ATET p(w = 1) + ATENT p(w = 0)

y0 = 0 + x 0 + e0 , E(e0 ) = 0, E(e0 | x) = 0, 0 = parameter (1)

Case 1.2. e1 = e0 = e, 0 = 1 ; E( | x, w) = 0: unobservable homogeneity,

where = 1 0 and x = E(x) is the sample mean of x. In this case, heterogeneous

whose sample equivalents are

However, if an IV z is available, we can consistently estimate ATE by exploiting an IV

E(y | w, x) = 0 + w ATE + x 0 + w(x x ) (6)

Even in this case, if an IV z is available, we can consistently estimate ATE by exploiting

Case 2.3. e1 = e0 , 0 = 1 , E( | x, w) = 0: unobservable heterogeneity, heterogeneous

E(y | w, x) = 0 + w ATE + x 0 + w(x x )

3.1 Control-function regression

1. Estimate yi = 0 + wi + xi 0 + wi (xi x ) + errori by OLS, thus getting

3. Obtain standard errors for ATET and ATENT via bootstrap.

3.2 Instrumental variables

1. Run an OLS regression of w on x and z, thus getting the predicted values of wi ,

1. Apply a probit of w on x and z, thus getting pw , the predicted probability of w.

2. Run an OLS of y on {1, x, pw , pw (x x )}.

3. Follow step 3 above.

The coecient of pw is a consistent and more ecient estimator of ATE (compared

Operationally, probit-2sls follows these four steps:

1. Apply a probit of w on x and z, thus getting pw , the predicted probability of w.

3. Run a second OLS of y on {1, x, w2f v,i , w2f v,i (x x )}.

4. Follow step 3 above.

1. Run a probit regression of wi on (1, xi , zi ) and get (i ,

After estimation, one can also test the hypothesis of no selection-on-unobservables by

Given the estimates of , 1 , 0 , , 1 , and 0 from the previous two-step procedure,

4 The ivtreatreg command

graphic requests a graphical representation of the density distributions of ATE(x),

vce(vcetype) species the type of standard error reported.

ws varname h are the additional regressors used in a models regression when

ATE x is an estimate of the idiosyncratic ATE.

ATET x is an estimate of the idiosyncratic ATET.

ATENT x is an estimate of the idiosyncratic ATENT.

G fv is the predicted probability from the probit regression, conditional on the

wL0 and wL1 are the Heckman correction terms.

Finally, the treatment must be a 0/1 binary variable (1 = treated, 0 = untreated).

4.4 Stored results

5 A Monte Carlo experiment for testing ivtreatreg

Table 3. Monte Carlo simulation output of ivtreatreg

(1) (2) (3) (4) (5)

Monte Carlo for ATE Comparison of methods under treatment endogeneity

direct2sls probit2sls probitols

Figure 1. Monte Carlo distributions of ATE; comparison of IV methods

. di "ATE= " (e(N_treat)/e(N_tot))e(atet)+(e(N_untreat)/e(N_tot))e(atent)

legend: * p<0.05; p<0.01; * p<0.001

legend: * p<0.05; p<0.01; * p<0.001

Then, using these estimates, we evaluate Ln at its maximum to nd Ln (1,

Thus we have n Ln (1,