You are on page 1of 6

Biometrika (2011), 98, 1, pp.

231236 C 2011 Biometrika Trust Printed in Great Britain

doi: 10.1093/biomet/asq071 Advance Access publication 6 February 2011

A novel reversible jump algorithm for generalized linear models


BY M. PAPATHOMAS Department of Epidemiology and Biostatistics, School of Public Health, Faculty of Medicine, Imperial College London, Norfolk Place, London W2 1PG, U.K.
m.papathomas@imperial.ac.uk

P. DELLAPORTAS AND V. G. S. VASDEKIS Department of Statistics, Athens University of Economics and Business, Patission 76, 10434 Athens, Greece
petros@aueb.gr vasdekis@aueb.gr

SUMMARY We propose a novel methodology to construct proposal densities in reversible jump algorithms that obtain samples from parameter subspaces of competing generalized linear models with differing dimensions. The derived proposal densities are not restricted to moves between nested models and are applicable even to models that share no common parameters. We illustrate our methodology on competing logistic regression and log-linear graphical models, demonstrating how our suggested proposal densities, together with the resulting freedom to propose moves between any models, improve the performance of the reversible jump algorithm.
Some key words: Bayesian inference; Graphical model; Linear regression; Logistic regression; Log-linear model.

1. INTRODUCTION The reversible jump algorithm (Green, 1995) extends the standard MetropolisHastings algorithm to variable dimension spaces. Assume that a data vector y is generated by model i M, where M is a set of competing models. Each model specifies a likelihood f ( y | i , i ) subject to an unknown parameter vector i i of size pi , where i R pi is the parameter space for model i . We denote by f (i ) and f (i | i ) the prior densities for model i and vector i , respectively. The reversible jump algorithm proceeds as follows. Let (i , i ) be the current state of the Markov chain. We propose a new model j with probability (i , j ) and generate u from a proposal density q (u | i , i , j , y ). 1 ) = Assume that a differentiable function gi, j with a differentiable inverse gi, j = g j ,i exists, and that ( j , u ) = pi + dim(u ). Then, the proposed move from model i to model j is accepted gi, j (i , u ) with p j + dim(u with a probability i, j = min(1, A), where A= ( j , u f ( y | j , j ) f ( j | j ) f ( j )( j , i )q (u | j , j, i, y) ) . f ( y | i , i ) f (i | i ) f (i )(i , j )q (u | i , i , j , y ) (i , u ) (1)

Currently available methods to choose q and g are described by Brooks et al. (2003) and the accompanying discussion; see also Richardson & Green (1997), Vermaak et al. (2004), Sisson (2005), Ehlers & Brooks (2008) and Fan et al. (2009). The majority of these methods refer to local moves in M, that is, moves between models that share many common parameters. A global approach is given by Green (2003), who develops a method for constructing proposal distributions that is similar in spirit to the Metropolis sampler of Roberts (2003), but the requirement of a pilot run reduces the appeal of the method

232

M. PAPATHOMAS, P. DELLAPORTAS

AND

V. G. S. VASDEKIS

when the number of models is large. An extension to this approach is given by Hastie in an unpublished 2004 University of Bristol PhD thesis. A study most relevant to our work is an unpublished 2000 report by P. J. Green. He considers the construction of proposal densities when the reversible jump algorithm samples from competing general linear models based on data of sample size n with independent homoscedastic normal errors and known variance. The parameters i represent the regression parameters in model i and the efficient construction of a reversible jump algorithm boils down to deriving suitable q (u | i , i , j , y ) when the algorithm is in state (i , i ). He proposes setting q as a normal density N (, ) and constructs appropriate values for and by exploiting the design matrices X i and X j of the current and proposed models, respectively, based on the observation that an efficient proposal density should perturb i orthogonally away from the hyperspace 1 T defined by X i before it is projected onto the X j hyperspace. This leads to a value = ( X T j X j) X j y + 1 T (X T = j X j ) X j ( X i i Pi y ), where Pi is an appropriate projection matrix, whereas Green suggests T 1 T T 1 ( X j X j ) X j ( In Pi ) X j ( X j X j ) . 2. THE

PROPOSED APPROACH

We consider an n -dimensional vector y of normal observations and competing linear models N (i , Vi ) (i M), where i = X i i , X i is the design matrix of model i and i is of dimension pi . The variances Vi are considered known and are not the subject of inference. We assume noninformative parameter priors in the sense that they are constant in the important region of the likelihood function. Suppose that the reversible jump algorithm has a current state (i , i ) and that a move is proposed to ( j , j ). Then, our key idea is that the proposal density q (u | i , i , j , y ) should satisfy the relationship f ( y | i , i ) = i j E u { f ( y | j (u ), j )}, (2)

i , i )/ f ( y | j , j ), where i and j are the maximum likelihood where the constant i j is taken to be f ( y | estimates of the model parameters. Equation (2) represents a criterion that appropriately exploits not only the design matrices X i and X j but also expected beliefs about the data under the proposed density q (u | i , i , j , y ). We tackle (2) by assuming that q (u | i , i , j , y ) is a normal density N (, ). The following theorem provides a solution to (2): THEOREM 1. Under the model determination set-up defined above, one solution for the mean of the proposal distribution N (, ) is
1 1 T 1 1 = (X T j V j X j ) X j V j { y + B Vi 1/2

( X i i Pi y )},

(3)

1/2 where B = (V j + X j X T and Pi = X i ( X iT Vi1 X i )1 X iT Vi1 is the projection matrix to the space genj) erated by the columns of X i , weighted by Vi1 .

The proof of Theorem 1 is given in the Appendix. Equation (3) states that the mean of the proposal density is the maximum likelihood estimate of the new model plus a correction term that depends upon the difference between the fitted values under the maximum likelihood estimate for model i , Pi y , and the fitted values under the currently accepted i . Intuitively, the difference X i i Pi y determines a distance between the current value i from the mode of its posterior density, so the proposed value of j lies, in expectation, in a relatively equally high posterior region in model j . To complete the construction of the proposal density, we need to choose such that B is invertible. Assuming that the last t 0 parameters in j are common to both models and setting 1/2 1/2 Q i j = X iT Vi Vj X j , one choice for is
1 1 1 1 = Q j j Q j j Q ji Q ii Q i j Q j j + cI p j ,

(4)

where the scalar c 0 is a tuning parameter that determines the variability for the t common parameters. If c > 0, then is positive definite with rank p j and dim(u ) = p j , dim(u ) = pi . Then, the Jacobian in (1)

Miscellanea
Table 1. Mixing performance of samplers (standard errors in parentheses)
Log-linear models Acceptance rate (%) Suggested proposals (c = 105 ) Suggested proposals (c = 0) Green Dellaportas & Forster 51 51 51 24 Iterations to highest posterior probability model 447 (32) 483 (34) 493 (39) 8454 (1248)

233
Logistic models Acceptance rate (%) 87 85 05 2 3

simplifies to | 1/2 1/2 |, with denoting the covariance matrix of the proposal density when a move is attempted from model j to model i . Setting c = 0 yields deterministic, but not identical, proposed moves ) = pi t . for the common parameters between models and dim(u ) = p j t , dim(u 3. IMPLEMENTATION OF PROPOSALS IN

GENERALIZED LINEAR MODELS

31. Data transformations The construction of proposal densities in a reversible jump algorithm that jumps between generalized linear models with different linear predictors is achieved by applying a data transformation to responses so that they approximate normality; see Clyde (1999). This transformation is used only to derive q (u | i , i , j , y ) and does not affect the form of likelihood functions in (1) which are calculated using the original data and models. The resulting proposal densities are then based on (3) and (4) with y replaced by the transformed response vector. Therefore, they are only approximate solutions of (2) but still turn out to provide satisfactory mixing. 32. Graphical log-linear model determination Let wk be a Poisson random variable and ik = ( X i i )k be the linear predictor so that, for model i , to the Poisson distriE (wk | M = i ) = exp(ik ) (k = 1, . . . , n ). Using the standard normal approximation w + log(w) is approximately bution and the delta method, we readily obtain that yk = 2( wk w)/ , where w denotes the sample mean. distributed as N (ik , 1/w) Edwards & Havr anek (1985) presented a 26 contingency table in which 1841 men were cross-classified by six risk factors for heart disease. Dellaportas & Forster (1999) assumed that main effects are always present and constructed a reversible jump algorithm to compare the 32 768 possible graphical log-linear models. Their algorithm allowed moves that involve the addition or removal of an edge in the graph (Jones et al., 2005), whilst proposal densities were obtained through a pilot run on the saturated model. We adopt the same prior specification as in Dellaportas & Forster (1999) and construct a reversible jump algorithm that allows for the addition, removal or replacement of an edge in the graph. Compared with the algorithm of Dellaportas & Forster (1999) which allowed only addition and removal moves, our algorithm can propose moves between nonnested models. Results were derived from 3 106 iterations, after 8 105 burn-in iterations were discarded. The chain mixing was similar for c = 0 and any value of c within (109 , 103 ). Values 0 < c < 109 caused a numerical instability. The acceptance rate with our proposal densities was almost twice as high as that observed with the Dellaportas & Forster (1999) algorithm and similar to that obtained with the proposals derived in the report by Green; see Table 1. We also ran the sampler 200 times starting all chains from the model that contains only main effects and recorded the number of iterations before the highest posterior density model was first visited; see Table 1. It is clear that allowing nonnested moves between models significantly increases the mobility of the chain. To evaluate the quality of our proposals leaving aside the fact that they allow for nonnested moves, we revisited the Dellaportas & Forster (1999) algorithm but replaced their independent, pilot run-based proposal densities with those derived from (3) and (4). Under this set-up, and allowing only moves between nested models, the acceptance rate turned out

234

M. PAPATHOMAS, P. DELLAPORTAS

AND

V. G. S. VASDEKIS

to be 71%, an improvement by a factor of 3 when compared with the 24% of Dellaportas & Forster (1999). A similar increase was observed with the proposals derived by Green in the unpublished report. 33. Competing logistic regression models for binomial data Let z k (k = 1, . . . , n ) be the number of successes in a series of n binomial experiments with corresponding n k trials and probabilities of success pk . Define wk = z k / n k and let ik = ( X i i )k be the linear exp(ik )/{1 + exp predictor so that, for model i M, E (wk | M = i ) = (ik )}. Following Clyde (1999) we 1 w) }1/2 {arcsin( wk ) arcsin( w k )} + log{w/( 1 w) } is approxcan readily obtain that yk = 2{w( 1 w) }1 ], with w denoting the sample mean. imately distributed as N [ik , {n k w( We consider a dataset analysed by Fowlkes et al. (1988) in which the response is the number of subjects satisfied with their employment and there are four explanatory factors. We assessed the 64 models that contain main effects and all possible combinations of two-way interactions by adopting a reversible jump algorithm with equal prior model probabilities and unit information parameter priors as in Ntzoufras et al. (2003). We allowed the addition, removal or replacement of an interaction term and compared it with the Dellaportas & Forster (1999) approach in which only addition or removal of an interaction is allowed. Results were obtained from 3 105 iterations, after 5 104 burn-in iterations were discarded; see Table 1. Results were very robust for c < 104 , with an acceptance rate almost four times higher than that of Dellaportas & Forster (1999). The comparatively small acceptance rate obtained with the proposals in the 2000 report by Green, where the variance matrix is not weighed by Vi , illustrates that when dealing with binomial data it is important to allow for the variance of each yk to be weighted by n k . ACKNOWLEDGEMENT The work was co-funded by the European Social Fund and National Resources, Ministry of Education, Pythagoras II, and by the Methodology Research Programme of the Medical Research Council. We would like to thank Professor Peter Green for making known to us the unpublished draft and other useful comments, and the editor and referees for their very helpful suggestions. APPENDIX Proof of Theorem 1. We first prove that for vectors z , and y and matrices X , priate dimensions so that the quadratic forms below are well defined, (z )T where A=
1 1

and V , with appro(A1)

(z ) + ( y X z )T V 1 ( y X z ) = (z m )T A(z m ) + K , m = A1 (
1

+ X T V 1 X ,

+ X T V 1 y ),

K = { ( X T V 1 X )1 X T V 1 y }T { + y T V 1 ( In PX ) y , PX = X ( X T V 1 X )1 X T V 1 .

+ ( X T V 1 X )1 }1 { ( X T V 1 X )1 X T V 1 y }

First note that (A1) holds after some algebra with K = (


1

+ X T V 1 y )T (
1

+ X T V 1 X )1 (
1

+ X T V 1 y ) + T
1

+ y T V 1 y .
1

The required expression for K is derived by completing the square so that ( + X T V 1 y )T (


T

+ X T V 1 X )1 (
T

+ X T V 1 y ) T X)
1

= ( m 1 ) A1 ( m 1 ) + y V

X (X V
T

X V

y,

Miscellanea
where A1 =
1

235
+ ( X T V 1 X )1 }1 ,

+ X T V 1 X )1 (
1

1 1

= {

1 m 1 = A 1

+ X T V 1 X )1

X T V 1 y = ( X T V 1 X )1 X T V 1 y .

Using (A1) we obtain E u { f ( y | u , j )} = = |2 |1/2 |2 V j |1/2 exp{1/2(u )T


1

(u ) 1/2( X j u y )T V j1 ( X j u y )}du

|2 |1/2 |2 V j |1/2 exp[1/2{(u m )T A(u m ) + y T V j1 ( In P j ) y


1 1 T 1 T + ( [ X T j V j X j ] X j V j y) ( 1 1 1 1 T 1 T 1 + [X T j V j X j ] ) ( [ X j V j X j ] X j V j y )}]du ,

where
1 1 T 1 Pj = X j ( X T j Vj X j ) X j Vj ,

A=

1 + XT j Vj X j ,

m = A1 (

1 + XT j V j y ).

Only the first part of this sum depends on u and the integral becomes |2 |1/2 |2 V j |1/2 |2(
1 1 1 1/2 + XT j Vj X j ) | 1 1 1 1 T 1 T 1 + (X T j V j X j ) } { ( X j V j X j ) X j V j y }] 1 1 T 1 T exp[1/2{ ( X T j V j X j ) X j V j y} {

exp{1/2 y T V j1 ( In P j ) y }. The product of the first and third determinants can be simplified using (1.5) of Harville (1997, p. 417) to 1/2 |V j1 (V j + X j X T . By applying (2.2) of Harville (1997, p. 424) twice we obtain j )| { Consequently, E u { f ( y | u , j )} = |2 V j |1/2 |V j1 (V j + X j (V j + X j
1/2 XT exp{1/2( X j P j y )T j )| 1 1 T XT j ) ( X j P j y )} exp{1/2 y V j ( In P j ) y }. 1 1 1 + (X T = XT j Vj X j ) } j Vj + X j

XT j

X j.

i = ( X iT Vi1 X i )1 X iT Vi1 y , the likelihood becomes For the maximum likelihood estimate i , i ) = |2 Vi |1/2 exp{1/2 y T Vi1 ( In Pi ) y }, f (y | i , i ) reduces to and the ratio f ( y | i , i )/ f ( y | exp{1/2( X i i y )T Vi1 Pi ( X i i y )} = exp{1/2( X i i Pi y )T Vi1 ( X i i Pi y )}. Therefore, condition (2) becomes exp{1/2( X i i Pi y )T Vi1 ( X i i Pi y )} = |2 V j |1/2 |V j1 (V j + X j
1/2 XT j )| 1 1 T XT j ) ( X j P j y )} exp{1/2 y V j ( In P j ) y }

exp{1/2( X j P j y )T (V j + X j

|2 V j |1/2 exp{1/2 y T Vi1 ( In Pi ) y } = 1/2 exp{1/2( X j P j y )T (V j + X j where = |V j1 (V j + X j


1 XT j ) ( X j P j y )} ,

XT j )| > 1 (Rao & Toutenburg 1995, p. 299). By taking logarithms, we obtain

( X i i Pi y )T Vi1 ( X i i Pi y ) = log + ( X j P j y )T B 2 ( X j P j y ),

236
2

M. PAPATHOMAS, P. DELLAPORTAS
T

AND
2

V. G. S. VASDEKIS
, where = ( In P j )v and v is any n -

where B = (V j + X j X j ) . Setting = (log ) ( B ) dimensional vector, the previous equation can be written as
T

1/2

1/2

( X i i Pi y )T Vi1 ( X i i Pi y ) = ( X j P j y )T B 2 ( X j P j y ). A value of that satisfies the above equality is obtained by solving the equation B ( X j P j y ) = 1/2 Vi ( X i i Pi y ), which gives
1 1 1 T 1 T 1 T 1 1 = (X T j V j X j ) X j V j y + ( X j V j X j ) X j V j B Vi 1 1 T 1 1 = (X T j V j X j ) X j V j { y + B Vi 1/2 1/2

( X i i Pi y )

( X i i Pi y )}.

REFERENCES
BROOKS, S. P., GIUDICI, P. & ROBERTS, G. O. (2003). Efficient construction of Markov chain Monte Carlo proposal distributions (with Discussion). J. R. Statist. Soc. B 65, 355. CLYDE, M.A. (1999). Bayesian model averaging and model search strategies. In Bayesian Statistics 6, Ed. J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, pp. 15785. New York: Oxford University Press. DELLAPORTAS, P. & FORSTER, J. J. (1999). Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Biometrika 86, 61533. NEK, T. (1985). A fast procedure for model search in multi-dimensional contingency tables. EDWARDS, D. & HAVR A Biometrika 72, 33951. EHLERS, R. S. & BROOKS, S. P. (2008). Adaptive proposal construction for reversible jump MCMC. Scand. J. Statist. 35, 67790. FAN, Y., PETERS, G. W. & SISSON, S. A. (2009). Automating and evaluating reversible jump MCMC proposal distributions. Statist. Comp. 19, 40921. FOWLKES, E. B., FREENY, A. E. & LANDWEHR, J. M. (1988). Evaluating logistic models for large contingency tables. J. Am. Statist. Assoc. 83, 61122. GREEN, P. J. (1995). Reversible jump MCMC computation and Bayesian model determination. Biometrika 82, 71132. GREEN, P. J. (2003). Trans-dimensional Markov chain Monte Carlo. In Highly Structured Stochastic Systems, Ed. P. J. Green, N. L. Hjort and S. Richardson, pp. 17998. New York: Oxford University Press. HARVILLE, D. A. (1997). Matrix Algebra from a Statisticians Perspective. Berlin: Springer. JONES, B., CARVALHO, C., DOBRA, A., HANS, C., CARTER, C. & WEST, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statist. Sci. 20, 388400. NTZOUFRAS, I., DELLAPORTAS, P. & FORSTER, J. J. (2003). Bayesian variable and link determination for generalized linear models. J. Statist. Plan. Infer. 111, 16580. RAO, C. R. & TOUTENBURG, H. (1995). Linear Models. New York: Springer. RICHARDSON, S. & GREEN, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with Discussion). J. R. Statist. Soc. B 59, 73158. ROBERTS, G. O. (2003). Linking theory and practice of MCMC. In Highly Structured Stochastic Systems, Ed. P. J. Green, N. L. Hjort and S. Richardson, pp. 14566. New York: Oxford University Press. SISSON, S. A. (2005). Trans-dimensional Markov chains: a decade of progress and future perspectives. J. Am. Statist. Assoc. 100, 107789. VERMAAK, J., ANDRIEU, C., DOUCET, A. & GODSILL, S. J. (2004). Reversible jump Markov chain Monte Carlo strategies for Bayesian model selection in autoregressive processes. J. Time Ser. Anal. 25, 785809.

[Received June 2009. Revised September 2010]

You might also like