Generalized Linear Models, Estimating Functions and Multivariate Extensions

Generalized Linear Models, Estimating Functions
and
Multivariate Extensions
Kung-Yee Liang
Department of Biostatistics
School of Hygiene & Public Health
Johns Hopkins University
July 1214, 1999
2
PREFACE
Regression analysis has been developed for many years and remains one of the most
commonly used statistical tools to help scientists address their scientic inquiries. In
the year of 1972, thanks to the seminal work by Nelder and Wedderburn, many useful
regression models are unied under the framework of generalized linear models (GLMs).
These include as special cases ANCOVA, multiple regression for continuous responses, lo-
gistic regression for binary responses and log-linear models for contingency tables. They
asserted that each model under consideration can be cast through two components. One
is the systematic component which relates the mean response to the covariates by spec-
ifying a link function. The other is the random component which is needed to account
for uncertainty in the response variable. Through this common framework, issues that
are commonly faced in regression such as estimating procedure, regression diagnostics,
measurement errors in covariates, goodness-of-t, etc. can be addressed in a unied fash-
ion. While there is no doubt that this approach will be in its existence for many years
to come, it faces two, among others, challenges that has drawn a good deal of attention
among statistical researchers in the past few decades. One is to deal with the situation
where the scientists may not have sucient knowledge about the subject matters to be
incorporated by their statistical colleagues to characterize the random mechanism for
which the data are generated. The work by Wedderburn in 1974, known as the quasi
likelihood method, provides a means to cope with this complication. It relaxes the ran-
dom component assumption required by GLMs, rather a much weaker assumption on the
variance expression for the response variables is installed. Several questions remain to
be answered regarding the use of this alternative method. First, can the work by Wed-
derburn be extended to accommodate more general and realistic variance specications?
Second, does the quasi score method by Wedderburn possess any desirable statistical
optimality properties? Third, is there any evidence that this method perform well in
practice?
The other challenge GLMs face is how to deal with the situation where some of the ob-
servations are not necessarily statistically independent of each others. Implicitly, GLMs
operates on the assumption that all the observations are independent statistically of each
others. More frequently, one witnesses in biomedical studies that such an assumption
may be invalid. In the past decade or two, the statistical community also witnessed the
tremendous development in statistical methods for analyzing data that are correlated
with each others. A natural questions to raise is: what are the pros and cons of these
new methods in terms of their suitability in addressing scientic objectives from studies
involving correlated data?
In this lecture note, we intend to address the above two posed challenges and questions
through the vehicle of estimating functions, a concept formally developed in 1960 by
the independent work of Durbin and Godambe. This lecture note has the following
organization. In Chapter one, we rst characterize ve motivating examples which involve
correlated data. We argued that for all examples considered, their scientic objectives can
be cast through regression modeling. Also discussed are (1) the statistical consequence
when the correlation possessed by the data is ignored in the analysis and (2) quantities
that are useful for measuring the degree of associations for correlated data.
Chapter two provides a chronicle overview on some recent developments of regres-
sion techniques with focus on generalized linear models, quasi likelihood methods and
estimating function approaches. This overview is preceded by some discussions on issues
3
concerning likelihood-based inference. These include the impact of nuisance parameters
and cautions in using the popular Walds procedure.
Chapters three and four may be viewed as a preview of statistical modeling for corre-
lated data. In Chapter three, a brief review on historic developments for analyzing pro-
portional data leads to the following question: is the binomial assumption made therein
a reasonable one? The well known litter eect in teratological experiments suggests
that one of the key assumptions in binomial, namely, all the binary responses from within
are statistically independent of each others, is invalid. We discussed in Chapter three
drawbacks of some attempts in statistical literature in the 1970s and 1980s to relax the
two key assumptions in binomial for proportional data. The teratological experiment
data also serves to illustrate the usefulness of the quasi likelihood method in dealing
with (1) uncertainty of random mechanisms and (2) impact of nuisance parameters.
Chapter four consists of two parts. In part one, we discussed three dierent statistical
models that are useful for polytomous responses in which the discrete response variable
contains three or more categories. Circumstances under which these models may be, or
may not be, appropriate are highlighted and contrasted through two examples. This
part also serves to extend GLMs to incorporate either ordinal or nested polytomous
response variables. In part two, we focused on interpretations of parameters induced by
log-linear models which are popular for analyzing data from contingency tables. There
has been some renewed interest lately in transferring this approach for contingency tables
to correlated data analysis. Through detailed discussions on interpretations of log-linear
parameters, we argued that such an application to correlated data may proceed with
caution.
In Chapter ve, we contrasted three dierent statistical modelings for correlated data:
marginal models, random eects models and observation driven models with focus on
interpretations of induced parameters. Such contrasts and circumstances under which
the proposed models may be most appropriate are illustrated through two examples
introduced in Chapter one: The Baltimore Eye Survey Study and the Schizophrenia
Trial Study.
Chapter six deals with statistical inference for correlated data when either one of
three statistical models mentioned above is employed. These methods may be viewed as
multivariate extensions of GLMs and quasi likelihood which were primarily designed for
analyzing univariate responses. Again, these newly developed methods were illustrated
through the two above mentioned examples.
This lecture note is made possible through the kind invitation of Dr. C. Z. Wei,
the former Director of the Institute of Statistical Science, Academia Sinica, Taiwan,
R.O.C., to be a part of the lecture series in statistics sponsored by the institute in 1999.
The coordinating eorts by Dr. C.H. Chen, the executive secretary of the lecture series
committee, and by the committee members are gratefully acknowledged. The typing
of this lecture note and the lectures delivered in July, 1999 in Academic Sinica were
diligently prepared by Ms. Patty Hubbard, Department of Biostatistics, Johns Hopkins
University and by Ms. Umy Chen, Institute of Statistical Science, Academia Sinica.
Their hard work and patience throughout are beyond what can be described in words
I thank them from the bottom of my heart.
Baltimore K.Y.L.
March 2000
4
Contents
1 Correlated Data 11
1.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Characteristics of Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Impact of Ignoring Within-Cluster Dependence . . . . . . . . . . . . . . . 13
1.3.1 Incorrect Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Eciency Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Measures of Within-Cluster Dependence . . . . . . . . . . . . . . . . . . . 16
2 Regression Analysis: An Overview 19
2.1 Some Issues Concerning Likelihood-Based Inference . . . . . . . . . . . . . 19
2.1.1 Impact of Nuisance Parameters . . . . . . . . . . . . . . . . . . . . 20
2.1.2 Cautions on Walds Procedure . . . . . . . . . . . . . . . . . . . . 21
2.2 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 First Two Moments of GLMs . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Score Function for . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 MLE Versus Weighted Least Squares . . . . . . . . . . . . . . . . . 25
2.2.4 Deviance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Quasi-Likelihood Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Estimating Function Approach . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Limitations: Impact of Nuisance Parameters . . . . . . . . . . . . 28
2.4.2 Orthogonality Properties of Conditional Score Functions . . . . . . 29
2.4.3 (Local) Optimality of Quasi-Score Functions . . . . . . . . . . . . 30
3 Analysis of Binary Data 33
3.1 Historical Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.1 Empirical Logit versus Logistic Regression . . . . . . . . . . . . . . 35
3.1.2 Is the Binomial Assumption Reasonable? . . . . . . . . . . . . . . 35
3.2 Teratological Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Some Cautionary Notes on Beta-Binomial Distributions . . . . . . 37
3.2.2 An Alternative Inferential Procedure . . . . . . . . . . . . . . . . . 38
3.3 Further Relaxation of Binomial Assumptions . . . . . . . . . . . . . . . . 40
3.3.1 Rosners Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Additive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 Multiplicative Model . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5
6 CONTENTS
4 Analysis of Polytomous and Count Data 43
4.1 Types of Polytomous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Regression Models for Polytomous Responses . . . . . . . . . . . . . . . . 45
4.2.1 Polytomous Logistic Regression Models . . . . . . . . . . . . . . . 45
4.2.2 Proportional Odds Models . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.3 Continuation Ratio Models . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Log-Linear Models for Contingency Tables . . . . . . . . . . . . . . . . . . 49
4.3.1 Interpretations of Log-Linear Model Parameters . . . . . . . . . . 50
4.3.2 An Alternative Representation of Log-Linear Models . . . . . . . . 52
5 Statistical Modellings for Correlated Data 55
5.1 Marginal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Random Eects Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Observation Driven Models . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4 Contrasts of Three Modelling Approaches . . . . . . . . . . . . . . . . . . 60
6 Statistical Inference for Correlated Data 63
6.1 Marginal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1.1 Quasi-Likelihood Approach . . . . . . . . . . . . . . . . . . . . . . 63
6.1.2 Likelihood Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 Random Eects Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2.1 Conditional Likelihood Approach . . . . . . . . . . . . . . . . . . . 68
6.3 Observation Driven Models . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3.2 Quasi-Likelihood Approach . . . . . . . . . . . . . . . . . . . . . . 70
6.4 Pre- Post Trial on Schizophrenia Revisited . . . . . . . . . . . . . . . . . . 71
List of Tables
1.1 Some features of ve examples in 1.1. . . . . . . . . . . . . . . . . . . . . 13
1.2 Ranges of correlation coecients for a pair of binary responses with cor-
responding means
j
and
k
, respectively. . . . . . . . . . . . . . . . . . . 17
3.1 Proportions of children surviving 2 years free of disease following diagnosis
and treatment for neuroblastoma by age and stage of disease at diagnosis
(Breslow & McCann, 1971). . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Proportions of fetuses surviving the 21 day lactation period among those
who were alive at 4 days by treatment group (Weil, 1970). . . . . . . . . . 34
3.3 The averaged MLE ( s.e.) for for dierent assumed values of from
1,000 simulations (Williams, 1988). . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Estimates and standard errors of
1
= log{
T
/(1
T
)}log{
C
/(1
C
)}
of Weils data presented in Table 3.2. Upper entry: common ; lower entry:
heterogeneous (Liang and Hanfelt, 1994). . . . . . . . . . . . . . . . . . 38
3.5 Simulation results for
1
= log{
T
/(1
T
)} from 1,000 replications mim-
icking the data structure of Weils. The true
1
is 1.129 and
T
= 0.317
and
C
= 0.021 (Liang and Hanfelt, 1994). . . . . . . . . . . . . . . . . . 39
4.1 Regression estimates ( s.e.) based on both proportional odds models and
polytomous logistic regression models for the disturbed dream example
(Maxwell, 1961). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Regression estimates ( s.e.) based on proportional odds models and con-
tinuation ratio models for the tonsil size example (Holmes and Williams,
1954). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1 Regression estimates and standard errors (in parenthesis) for the visual
impairment response from the Baltimore Eye Survey Study. . . . . . . . . 66
6.2 Regression estimates and standard errors (in parenthesis) from three mod-
elling approaches: Baltimore Eye Survey Study. . . . . . . . . . . . . . . . 71
6.3 Regression estimates and standard errors (in parenthesis) from three mod-
elling approaches: a pre- post trial on schizophrenia. . . . . . . . . . . . . 72
7
8 LIST OF TABLES
List of Figures
1.1 The plot of the logarithm of V
1
/V
2
versus for selected cluster sizes
n(2, 5, 10) and (0, 0.2, 0.5, 0.8, 1.0). Here, V
1
is the variance of the least
squared estimate

1
when the within-cluster dependence is ignored and
V
2
is the correct variance. We have assumed E(Y
ij
) =
0
+
1
x
ij
and
= corr(Y
ij
, Y
ik
), j < k = 1, . . . , n : is the ratio of the between cluster
variance to the total variance among the x
ij
s. . . . . . . . . . . . . . . . 14
1.2 The plot of V
3
/V
2
versus for selected n(2, 5, 10) and (0, 0.2, 0.5, 0.8, 1.0).
Here V
2
, , , and the assumed model are the same as described in Figure
1.1, V
3
is the variance of the best unbiased estimate of
1
. . . . . . . . . . 15
2.1 Beta-binomial log-likelihoods for the exposed group based on the data
reported by Weil (1970). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.1 Predicted risk of VI by race and age among those with 9 years of education
in Baltimore Eye Survey Study (Tielsch et al., 1990): solid line, whites;
dashed line, blacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Mean responses in PANSS by dropout cohort. . . . . . . . . . . . . . . . . 73
9
10 LIST OF FIGURES
Chapter 1
Correlated Data
1.1 Motivating Examples
Regression analysis is commonly used in biomedical research. A typical assumption
behind this tool is that all observations are statistically independent, or at least uncor-
related with each other. The following research problems in which we have been directly
or indirectly involved suggest that there are many situations for which this assumption
might not be met.
Example 1.1 (Baltimore Eye Survey Study). Over 5,000 individuals aged 40 and
above received a visual examination as part of a population-based prevalence study of
ocular disorders (Tielsch et al. 1990). The main objective of the summary is to identify
demographic variables such as race and age that are associated with vision loss. The
complication of the study is that data are available on both eyes which are unlikely to
be independent because many causes of eye impairment are binocular.
Example 1.2 (Family Study on Chronic Obstructive Pulmonary Disease (COPD)). Six
hundred and thirteen members of 158 COPD cases seen at the Johns Hopkins Hospital
were examined and given spirometry tests (Cohen, et al., 1977). The primary objective
is to determine the percent of total variance attributable to unobserved genetic factors
shared among siblings and their parents. It is typical in family studies that the response
variables, FEV1 in this case, are measured for related individuals, which in turn are
crucial to address scientic questions of interest such as the one stated above.
Example 1.3 (Sister Chromatid Exchange Study). A total of 14 hepatocelluar carci-
noma, 14 nasopharyngeal carcinoma and 16 cervical cancer patients and their age-sex-
matched controls, were studied to compare the frequency of sister chromatid exchange
(SCE) in their peripheral lymphocytes. The main hypothesis, which was tested in the
study conducted by Professor C.J. Chen, College of Public Health, National Taiwan Uni-
versity, is that cancer patients may have a higher frequency of SCE in lymphocytes than
matched controls. The outcome variable is the number of SCE per cell where 20 cells
from each subject were cultured.
Example 1.4 (Pre-Post Trial of Schizophrenia). Five hundred and twenty three pa-
tients diagnosed with schizophrenia were randomly allocated amongst six treatments:
placebo, haloperidol 20 mg and risperidone at dose levels 2mg, 6mg, 10mg and 16mg.
The primary response variable is the total score obtained on the Positive and Negative
11
12 CHAPTER 1. CORRELATED DATA
Symptom Rating Scale (PANSS), a measure of severity of schizophrenia. This score for
each patient was measured at baseline, weeks 1, 2, 4, 6 and 8 after randomization. The
main objective is to determine if risperidone improves the rate of decline in PANSS over
the two month period.
Example 1.5 (Diabetic Retinopathy Study). Over 1,700 patients with diabetic retinopa-
thy and visual acuity of 20/100 or better in both eyes participated in the study (Mur-
phy and Patz, 1978). One eye of each patient was randomly selected for laster photo-
coagulation and the other was observed without treatment. The main objective is to
study the eectiveness of laster photo-coagulation in delaying the onset of blindness and
the response variable for each eye is time from randomization to the occurrence of visual
acuity less than 5/200.
1.2 Characteristics of Examples
Although dierent in their scientic objectives, the examples introduced above share
some important characteristics in common that are critical to the statistical development
presented in Chapters 5 and 6. First, data in each of these examples are organized in
clusters. For example, in the Baltimore Eye Survey Study, each cluster comprises
observations of two eyes from an individual. In the family study (Example 1.2), a cluster
is formed by a family with observations from related individuals. Secondly, response
variables from the same cluster are likely to be correlated with each other. In longitudinal
studies where repeated observations from each subject are measured over time, this is
known as tracking. In family studies, observations from related individuals such as
siblings are correlated due to shared genes and/or environments. This explanation can
be applied to the fellow eyes in Example 1.1 as well. Thirdly, the scientic objectives
of these examples can be formulated statistically in terms of regression analyses as one
would have been for independent responses. For example, one could use the conventional
logistic regression to relate the risk of visual impairment for each eye in the Baltimore
Eye Survey Study to demographic variables such as age and race. In addition, one can
assess how the degree of within-cluster dependence between fellow eyes on variables such
as race, through regression as well. The issue of measuring within-cluster dependence
will be addressed in 1.4, whereas more formal statistical modellings for correlated data
with the focus on regression will be the central theme of Chapter 5.
The examples introduced above, on the other hand, dier from another in some impor-
tant ways. First, the type of response varies from binary (Example 1.1), count (Example
1.3), continuous (Example 1.2 and 1.4) to survival subject to censoring (Example 1.5).
Secondly, they dier in the structure of within-cluster dependence. For example, if the
sampling unit for the Baltimore Eye Survey Study were the household in which each
member of the sampled household was oered for eye examinations, more complicated
dependence structure would result since one would expect responses from family mem-
bers to be correlated to each other. Furthermore, the degree of association between fellow
eyes is likely to be stronger than responses of eyes from two related subjects. Designs of
this kind are special cases of the so-called nested design. Examples include intervention
studies in which children (patients) are nested within household (clinicians) which in turn
are nested within villages (clinics). Special care is required to model within-cluster de-
pendence for data collected from nested designs or from family studies where the degree
of association for siblings is likely to be dierent than that of parents (Qaqish and Liang,
1.3. IMPACT OF IGNORING WITHIN-CLUSTER DEPENDENCE 13
1992).
Finally, the examples dier from one another with respect to their focus. In longitu-
dinal studies (Example 1.4), the regression of response variables on explanatory variables
such as time (in weeks) from randomization, treatment status and their interactions is
most important whereas the within-cluster dependence may be viewed as a nuisance. On
the other hand, it is typical in family studies (Example 1.2) that the primary focus is
on the role genetic factors play in disease as captured by the percent of total variation
in response variables attributable to genetic components. In this situation, the within-
cluster association represents the primary focus even though regression adjustment for
the regression variable is crucial to separate out the environmental impact from the ge-
netic ones. Table 1.1 summarizes both the common factors and dierences among the
ve examples.
Table 1.1: Some features of ve examples in 1.1.
Scientic Focus
Example Cluster Response Mean Dependence
Eye survey individual binary primary secondary
COPD family continuous nuisance primary
SCE individual count primary nuisance
Schizophrenic individual continuous primary nuisance
Retinopathy individual survival primary secondary
1.3 Impact of Ignoring Within-Cluster Dependence
A natural question to raise is what happens if one ignores the within-cluster depen-
dence and uses the conventional regression method assuming independence among ob-
servations? In the Baltimore Eye Survey example, this amounts to treating 10,398 ob-
servations from 5,199 individuals as if they were collected from 10,398 individuals with
one observation (could be from the right or left eye) per individual. From a statisti-
cal viewpoint, there are at least two consequences as a result of ignoring within-cluster
dependence: incorrect assessment of precision of regression coecient estimates and in-
ecient estimation of regression coecients. We now address these two issues in greater
detail through a simple model for correlated data.
Let Y
ij
be the j
th
observation from the i
th
cluster, j = 1, . . . , n, i = 1, . . . , m. Here
n is the common cluster size and m the number of clusters. We assume for each i,
E(Y
ij
) =
0
+
1
x
ij
, Var(Y
ij
) =
2
,
Cov(Y
ij
, Y
ik
) =
2
, j < k = 1, . . . , n. (1.1)
This model assumes the expected value of Y is a simple linear function of a covariate,
x, and that the correlation, as measure by correlation coecient, between each pair
of responses from the same cluster has the same value, , say. The focus here is on the
inference for
1
when ignoring the fact that responses from the same cluster are correlated
with each other. This amounts to the use of the ordinary least square estimator

1
to
estimate
1
where
1
=
m
i=1
n
j=1
(y
ij
y
i
)(x
ij
x
i
) /
m
i=1
n
j=1
(x
ij
x
i
)
2
,
with y
i
and x
i
be the sample means of the Y
i
s and x
i
s from the i
th
cluster.
1.3.1 Incorrect Variance
Ignoring the within-cluster association leads to the use of
V
1
=
2
/
j
(x
ij
x)
2
=
2
/V
T
,
where x =
j
x
ij
/ (nm), as the variance estimate of

1
. The correct variance, V
2
, of
1
has the form
V
2
= V
1
{1 + (n 1)},
where = n
n
i=1
(x
i
x)
2
/ V
T
is the fraction of the total variation in the xs explained by
the between cluster variation in x
i
s. Two important cases of deserve special attention.
One is when = 0, i.e. x
1
= . . . , x
m
. This occurs commonly in longitudinal studies in
which every subjects response is measured at the same set of times. In this case,
1
is
estimated mainly by using the within-cluster changes in Y . The other extreme cases is
when = 1, i.e. x
i1
= x
i2
= . . . = x
in
for all i m. This is typcial in studies such as
the Baltimore Eye Survey Study where all the covariates are cluster-specic.
Figure 1.1: The plot of the logarithm of V
1
/V
2
versus for selected cluster
sizes n(2, 5, 10) and (0, 0.2, 0.5, 0.8, 1.0). Here, V
1
is the variance of the
least squared estimate

1
when the within-cluster dependence is ignored
and V
2
is the correct variance. We have assumed E(Y
ij
) =
0
+
1
x
ij
and
= corr(Y
ij
, Y
ik
), j < k = 1, . . . , n : is the ratio of the between cluster
variance to the total variance among the x
ij
s.
1.3. IMPACT OF IGNORING WITHIN-CLUSTER DEPENDENCE 15
Figure 1.1 shows plots of log(V
1
/V
2
) = log{1 +(n1)} against for some selected
ns and s. The message is rather clear. For within cluster comparisons, i.e. = 0, the
condence interval based on V
1
is wider than should be and discrepancy between V
1
and
V
2
increases with . On the other hand, for between cluster comparisons, i.e. = 1, the
condence interval based on V
1
is too narrow and the discrepancy also increases with
. In either case, invalid scientic conclusions, which could be either false positive or
negative, may be drawn if V
1
is used as the variance estimate of

1
.
1.3.2 Eciency Loss
While the ordinary least square estimate

1
remains unbiased as an estimator of
1
, the
well known Gauss-Markov estimating theorem suggests that the uncertainty in estimating
1
may be reduced by using the weighted least square estimator,

1
say, which properly
accounts for the within-cluster association. Under the specied correlation structure in
(1.1),

1
has the variance of form
V
3
= V
1
(1 ){1 + (n 1)} / {1 + n(1 )}.
Figure 1.2 shows plots of V
3
/V
1
against for selected ns and s. Interestingly, no
Figure 1.2: The plot of V
3
/V
2
versus for selected n(2, 5, 10) and
(0, 0.2, 0.5, 0.8, 1.0). Here V
2
, , , and the assumed model are the same
as described in Figure 1.1, V
3
is the variance of the best unbiased estimate
of
1
.
eciency loss occurs when is equal to 0 or 1, irrespective of and n. On the other
hand, a great deal of eciency loss by using

1
may result when approaches 0.5. Such
phenomenon is more apparent with increased but less so for n. The main message
in Figure 1.2 is that ignoring correlation could lead to a substantial loss of statistical
power especially when both within-cluster and between-cluster information is being used
to estimate
1
and that it is important to examine and incorporate the within-cluster
association structure so as to help improve upon the eciency for the inference.
1.4 Measures of Within-Cluster Dependence
Adequately describing the pattern of within-cluster dependence is important in several
regards: (i) it may represent the main scientic objective as is typically the case in family
studies (Example 1.2) and (ii) it may help to more precisely characterize the relationship
between the means and covariates as asserted in 1.3.2. In this section, we briey discuss
how one measures the within-cluster association and its patterns.
For continuous responses, the most commonly used measure of dependence between
a pair of responses from the same cluster is correlation coecient, dened as
(Y
j
, Y
k
) =
Cov(Y
j
, Y
k
)
{Var(Y
j
)Var(Y
k
)}
1/2
, j < k = 1, . . . , n.
This quantity takes values between 1 and 1 inclusively. Very little dependence between
Y
j
and Y
k
may be claimed if is close to zero and a strong association is evident if
is close to either 1 or 1. Furthermore, a positive association between Y
j
and Y
k
, i.e.
> 0, means Y
j
tends to be larger than expected if Y
k
is and vice versa. Finally, this
measure of association has the desired property that it be dimensionless and symmetric.
For the former, it means that for Y
j
and Y
k
is the same as that for cY
j
and cY
k
where
c is a non-zero constiuent. Thus, the same magnitude of results whether one uses, for
example, meter or foot as the unit for height or length. For the latter, it matches with the
intuition that the association between Y
j
and Y
k
has the same concept as that between
Y
k
and Y
j
. For a cluster of size greater or equal to 3, it raises the question as to how
one distinguishes between, for example,
1,2
and
1,3
, the correlation coecient between
Y and Y
2
and between Y
1
and Y
3
, respectively? The answer to this question depends on
the nature of the clustering and sometimes, on how the scientic objective is formulated.
For longitudinal studies, it is a common belief that the correlation coecient between
two observations adjacent in times is likely to be larger than that when two observations
are far apart in times. This is a part of the notion, known as tracking in longitudinal
studies. To capture this phenomenon, one may model
(Y
j
, Y
k
) =
|tjt
k
|
,
where t
j
and t
k
are times at which Y
j
and Y
k
are observed, respectively. This pattern
is known as the AR-1 (auto-regressive model of order one) model. For more detailed
discussion on describing and empirically examining patterns of within-subject association
for longitudinal data, we refer the readers to the work by Diggle (1988), Diggle, Liang
and Zeger (1994) and the references therein.
For family studies of rst degree relatives, i.e. parents, siblings and ospring, the
pattern of within-family association is likely to be dierent from that from longitudinal
studies. For siblings, it would be sensible to assume that the degree of association,
as measured by , is the same for each pair of siblings. However, it would be equally
sensible to allow this correlation coecient (
SS
) to be dierent than that of parents,
PP
. Indeed, any sensible statistical procedure should be exible enough to allow one to
test the equality of
SS
and
PP
. Assuming that the same environment is shared by all
relatives in the same family, a larger
SS
than
PP
would suggest that for the continuous
responses considered, known as quantitative traits in the eld of genetic epidemiology,
genetic factors may play a non-trivial role. For more detailed discussion on how one may
1.4. MEASURES OF WITHIN-CLUSTER DEPENDENCE 17
model and estimate patterns of within-family association, see, for example, Beaty et al.
(1997).
The use of correlation coecient as a measure of association is less useful for binary
responses. This is because the range of is narrowed considerably due to the constraint
that
max(0,
j
+
k
1) < Pr(Y
j
= Y
k
= 1) < min(
j
,
k
).
where
j
= Pr(Y
j
= 1) and
k
= Pr(Y
k
= 1). Furthermore, the degree of constraint
depends on values of the s. Table 1.2 shows ranges of as a function of selected
j
and
k
. For example, if the true
j
(
k
) is 0.1(0.3), then the corresponding between Y
j
and Y
k
must be greater than 0.22, but less than 0.51, much narrower than the ideal
situation of (1, 1). This constraint in range of makes the use of much less desirable
for binary responses than it is for continuous responses, especially when characterizing
within-cluster association represents the main objective. As an alternative, one may
Table 1.2: Ranges of correlation coecients for a pair of binary responses
with corresponding means
j
and
k
, respectively.
j
0.1 0.3 0.5
0.1 (.11, 1.0) (.22, .51) (.33, .33)
0.2 (.17, .67) (.33, .76) (.50, .50)
0.5 (.33, .33) (.65, .65) (1.0, 1.0)
consider the use of the odds ratio (OR)
OR(Y
j
, Y
k
) =
Pr(Y
j
= Y
k
= 1) Pr(Y
j
= Y
k
= 0)
Pr(Y
j
= 1, Y
k
= 0) Pr(Y
j
= 0, Y
k
= 1)
,
as a measure of association between a pair of responses, Y
j
and Y
k
. For this quantity, no
constraint is induced other than the fact that it must be positive. A positive association
results if OR is greater than one and a negative association may be claimed if OR is
less than one. Furthermore, the odds ratio is symmetric as well and is familiar to public
health researchers for its ease in interpretation. Just as in continuous response situations,
one can model the odds ratio, or more preferably its log version, through regression. For
a more detailed discussion on the use of odds ratios for within-cluster associations, see
Heagerty and Zeger (1998) for longitudinal data and Liang and Beaty (1991) for family
data. Finally, dierent measures of association between discrete variables have been
suggested in the literature; see Goodman and Kruskal (1979). We choose OR as the
primary measure of within-cluster dependence for discrete data, mainly because it is
easy to interpret and is familiar to biomedical researchers.
Chapter 2
Regression Analysis: An
Overview
As mentioned earlier, regression analysis has been and remains one of the most commonly
used statistical tools in biomedical research. In clinical research, this technique is useful to
help assess the degree of treatment eects when randomized clinical trials are conducted.
It can also be used for clinicians to identify prognostic factors of disease for prediction
purposes. In public health research, regression analysis is instrumental to help identify
risk factors of disease for prevention purposes. It is also useful in observational studies to
adjust for confounding variables that are not considered in the design stage for matching.
In this chapter, we provide a chronicle overview on some recent developments of regression
techniques with focus on methods including generalized linear models, quasi-likelihood
methods and estimating function approaches. Since these methods are, by and large,
likelihood-based, we rst review some key issues in likelihood-based inference that is
particularly relevant to subsequent developments.
2.1 Some Issues Concerning Likelihood-Based Infer-
ence
Let f(; , ) be the probability (density) function for a random variable, Y , of size m
indexed by two sets of parameters, of p-dimensional and of q-dimensional. Here
represents parameters of interest reecting scientic interest and the so-called nui-
sance parameters. Nuisance parameters, by denition, are of little intrinsic interest to
investigators, yet necessary to fully specify the random mechanism, f(; , ), for which
the data, Y y, are generated. A typical example of nuisance parameters is the exposed
probability of controls in case-control studies in which the interest is on the association
between the studied disease and the hypothesized exposure of interest, characterized by
the odds ratio, . Upon the data were observed, the only quantities in f that are un-
known to investigators are and . The phrase likelihood function is formally dened
as a function of and that is proportional to f(y; , ), i.e.
L(, ) L(, |Y = y) f(y; , ). (2.1)
19
20 CHAPTER 2. REGRESSION ANALYSIS: AN OVERVIEW
2.1.1 Impact of Nuisance Parameters
When the value of nuisance parameters is known, one can contrast two distinct values
of ,
1
and
2
, by computing the following likelihood ratio
LR(
1
,
2
) =
L(
1
, )
L(
2
, )
. (2.2)
A likelihood ratio which is greater than one would suggest that there is evidence in fa-
vor of
1
instead of
2
as conveyed by the data through the likelihood function (e.g.
Royall, 1997). Furthermore, the larger the likelihood ratio in magnitude, the stronger
the evidence. However, when is unknown to investigators, one faces the additional
complication due to the need of specifying when computing LR(
1
,
2
). The following
example from a teratological experiment (Weil, 1970) illustrates that in the absence of
knowledge in , the presence of nuisance parameters may have a profound impact on
the likelihood inference for parameters of interest. Figure 2.1 shows plots of log L(, )
against for selected values. Here the likelihood function is proportional to a product
of 16 observations following a beta-binomial distribution indexed by , the probability of
surviving a 21 day lactation period for a fetus and , the nuisance parameter characteriz-
ing the so-called litter eect, an eect commonly observed in teratological experiments.
More detailed treatments of analysis for teratological experiments in general and of this
example in particular will be given in Chapter 3. Choosing
1
= 0.8 and
2
= 0.6 as two
plausible values for , we note that dierent values of lead to dierent conclusions as to
which is favored more by the data. For example, L(0, 8, 0.8)/L(0.6, 0.8) = exp(2.89)
strongly suggests that = 0.6 is favored over = 0.8 when = 0.8 is used, whereas
L(0.8, 0, 1)/L(0.6, 0, 1) = exp(5.82) strong supports the opposite conclusion.
Nuisance parameters could have non-trivial impacts on estimating as well. The
following well known Neyman-Scott problem (Neyman and Scott, 1948) demonstrates
that with many nuisance parameters at hand, the conventional maximum likelihood
approach for estimating may not even be consistent at all. Consider {Y
ij
, j = 1, . . . , n},
independent observations from the i
th
of m independent clusters following a normal
distribution with mean
i
and a common variance , i = 1, . . . , m. The likelihood function
for and = (
1
, . . . ,
m
) is proportional to
L(,
1
, . . . ,
m
)
m
i=1
n
2
e
n
j=1
(yiji)
2
/2
.
It is easy to verify that the maximum likelihood estimator (MLE) of , as a result of
jointly maximizing the likelihood function with respect to and , has the form
=
m
i=1
n
j=1
(y
ij
y
i
)
2
/(nm).
Assuming that the cluster size, n, is xed by design, it can be easily shown that

converges as m to (m 1)/m instead of . Such an undesirable phenomenon

for

is not specic to the example given above. Indeed, it is well known that when the
dimension of nuisance parameters, increase with the sample size, m in this case, the MLE
of is in general inconsistent (e.g. Andersen, 1970). An important special case of this
2.1. SOME ISSUES CONCERNING LIKELIHOOD-BASED INFERENCE 21
Figure 2.1: Beta-binomial log-likelihoods for the exposed group based on
the data reported by Weil (1970).
kind is on the estimation of the common odds ratio when the one-to-one matched case-
control design is adopted (Breslow and Day, 1980). Indeed, it has been shown (Breslow,
1981) that

converges, as the number of matched pairs increases, to
2
instead of .
Many likelihood-based inferential procedures have been developed to eliminate, or at
least to reduce, the impact of nuisance parameters; see for example the work of Kalbeisch
and Sprott (1970), Andersen (1970), Cox (1972), Lindsay (1982), Cox and Reed (1987),
Reed (1995) and references therein. These methods rely upon the availability of the
likelihood function that is fully and correctly specied. In 2.4 we present an alternative
approach, the estimating function approach, which also serves to eliminate (or reduce) the
impact of nuisance parameters without the need to fully specify the likelihood function.
2.1.2 Cautions on Walds Procedure
The basis behind the frequentist approach builds on the repeated sampling, or long run,
interpretation for inferential making (e.g. Cox and Hinkley 1974). A central theme of
this approach is on testing the null hypothesis that =
0
, a pre-specied value and
on computing condence intervals for . With the exception that f(; , ) is generated
from an exponential family (Lehmann, 1959), one needs to appeal to the large sample
approximation for the purpose, for example, of deriving condence intervals. To this
end, three large sample inferential procedures have been developed and employed on a
daily basis: likelihood-ratio-based, score-statistic-based and Wald-based. We now briey
describe these three procedures from the viewpoint of testing the null hypothesis that
H
0
: =
0
. Their application to condence intervals can be converted through hy-
pothesis testing in a straightforward manner and will be mostly omitted. As the name
suggested, the likelihood-ratio-based method relies upon the use of the ratio of the like-
lihood function evaluated at
0
and

for testing H
0
, i.e.
T
1
= 2 log
L(
0
,

(
0
))
L(
,

)
,
where (
,

) are the MLE of (, ) and

(
0
) the MLE of under H
0
. The score-statistic-
based procedure relies upon one of the key properties of the likelihood function, namely,
E(S(, ); , ) = 0 , , (2.3)
where
S(, ) = log L(, )/(, ),
known as the score function for (, ). Based on (2.3), one would reject H
0
, on the
intuitive ground, if S(
0
, ) is large. To assure that this assessment of S(
0
, ) being
large is not aected by the unit of the Y
i
s, one needs to standardize S by considering
T
2
= S
t
(
0
,

(
0
))I
1
(
0
,

(
0
))S(
0
,

(
0
)),
where
I(, ) = E(
2
L(, )/(, ))
= Cov(S(, )),
is known as the Fisher information matrix for (, ). Finally, in contrast to the LR
procedure in which
0
and

are compared indirectly through the likelihood function, the
Walds method uses
T
3
= (

0
)
t
(I
I
t
I
1
)
|(
)
(

0
)
to test H
0
, here we have re-expressed I(, ) as
I(, ) =
_
I
_
,
where, for example,
I
= E(
2
log L(, )/
2
)
and
I
= E(
2
log L(, )/).
Detailed treatments on these three large sample procedures can be found in Rao
(1973, 6.e). In particular, it was shown that under some regularity conditions, all
three test statistics converge, as m , to a
2
p
distribution and furthermore, are
equivalent asymptotically to each other, i.e. the dierence between T
1
and T
2
, and hence
T
1
and T
3
and T
2
and T
3
, is in the order of o
p
(
m). Despite the similarity among these

three procedures when the sample is large, the Wald procedure possess some undesirable
properties that are not shared by the other two procedures. First, this procedure is not
invariant in the following sense. Let = () be a one-to-one function of so that the null
2.2. GENERALIZED LINEAR MODELS 23
hypothesis of
0
is equivalent to (
0
) =
0
. Unlike T
1
and T
2
, the Walds test
statistic is dierent numerically depending on the choice of parameters used to formulate
the null hypothesis. The following hypothetical example on testing the hypothesis of an
association from a 2 2 table illustrates the shortcoming. With 8(2) out of 18 cases (17
controls) exposed to a pre-specied risk factor, one has the MLE for , the odds ratio,
equal to 6.0 with an estimated standard error (s.e.) of 5.35. This leads to a Wald test
statistic value of (6 1)
2
/(5.34)
2
= 0.70 with p-value < 0.1. On the other hand, if one
used = log as the basis for inference, the corresponding Walds test statistic is equal
to (log 6 0)
2
/(0.89)
2
= 4.0, highly statistically signicant at the 0.05 level.
The other drawback of the Wald procedure is that in certain circumstances T
3
may be
powerless when the true value,
say, is far dierent from

0
, i.e.
0
is large in
magnitude. This counter-intuitive phenomenon may be explained at least heuristically as
follows. For simplicity, we assume p = 1 and there is no nuisance parameters, i.e. q = 0.
It is true in general that the asymptotic variance of

, I
1
(), depends on . While the
numerator of T
3
, (

0
)
2
, increases as

0
increases since

, so does the
denominator of T
3
, I(). Under the circumstances that I() increases at a faster rate
than (

0
)
2
does, T
3
becomes arbitrarily small in spite of the fact that the true ,
is
very dierent from
0
. This undesirable property of T
3
was rst observed and explained
for logistic regression models by Hauck and Donner (1977). It remains an open question
as to how one may characterize such circumstances leading to the peculiar behavior of
the Wald procedure.
For illustration, consider the following time to relapse data from a clinical trial re-
ported in Lee (1992). For ve breast cancer patients receiving CMF (x = 1) after a radical
mastectomy, the times to relapse were recorded as 23, 16
+
, 18
+
, 20
+
, 24
+
in weeks, where
the + sign represents censoring. The ve control patients (x = 0) all experienced re-
lapse at 15, 18, 19, 19 and 20 weeks. A proportional hazard model with x, the treatment
status, as the sole covariate was tted to the above data resulting in

= 16.66 and
s.e.(
) = 1513. This leads to a Walds test statistic value of 0.0001 with p-value = 0.991.
The other two test statistics, the likelihood ratio and score test statistic, amount to 8.76
and 6.90 with p-values 0.003 and 0.009, respectively, matching well with the intuition as
conveyed by the data that the new treatment (CMF) prolongs time to relapse for breast
cancer patients receiving radical mastectomy.
2.2 Generalized Linear Models
Multiple linear regression analysis and logistic regression analysis are daily employed
in biomedical research for reasons given at the onset of this chapter. In 1972, these
methods, among others, are unied under the framework of generalized linear models
(GLMs) (Nelder and Wedderburn). We now briey review this seminal work along with
several key properties of GLMs.
For a sample of size m, let Y
i
be the univariate response and x
i
, a p 1 vector, be
the covariates thought to be related to Y
i
, i = 1, . . . , m. To specify a GLM, the following
two components are sucient (Nelder and Wedderburn, 1972):
(A) (Random component). Y
i
is generated from an exponential dispersion family
(Tweedie, 1947, Jhanssen, 1987) of the form
f(Y
i
= y;
i
, ) = exp
_
i
y
i
b(
i
)
a()
+ c(y
i
; )
_
, (2.4)
(B) (Systematic component). The expectation of Y
i
, denoted as
i
, is related to x
i
through the link function h, i.e.
i
= E(Y
i
|x
i
) = h
1
(x
t
i
). (2.5)
Some special cases of GLMs are as follows.
Example 2.1 (Multiple linear regression models). This corresponds to a special case
with h() , the identity link function, a() and Y
i
being normally distributed
with mean
i
=
i
and variance .
Example 2.2 (Logistic regression models). For a binary response with Y = 1 or 0, this
popular model is a special case of GLMs with h() = log{/(1 )}, the logit link,
a() 1, and Y
i
following a binomial distribution of size 1 and probability of success
equal to
i
.
Example 2.3 (Poisson regression models). For count responses which are common in
cohort analysis with the number of events such as the death or diagnosis of a disease,
this corresponds to the use of the log link, i.e. h() = log , a() 1 and a Poisson
distribution with mean
i
as the basis for inference. Note that the index
Example 2.4 (Log-linear models). For data that can be expressed in contingency
tables, log-linear models are commonly used (e.g. Bishop, Fienberg and Holland, 1975).
Here as in Example 2.3, the index i represents the group of individuals sharing similar
characteristics which are categorical in nature. This model can also be cast as a GLM with
h() = log and a Poisson distribution for Y
i
, which is the number of individuals with x
i
as the covariate values. A major dierence between this model and the Poisson regression
model lies on the objective of interest. For log-linear models, the primary interest is on
examining the pattern of associations among categorical variables represented by x. For
the latter, it is of interest to identify risk factors characterized by x that may be associated
with the risk of a particular event.
We now present several key properties of GLMs that are particularly relevant to the
subsequent development.
2.2.1 First Two Moments of GLMs
The exponential dispersion family displayed in (2.4) does not explicitly reveal how the
moments of Y
i
, in particular the rst two moments, may be related to
i
and . Simple
algebra suggests that
i
= b
(
i
),
Var(Y
i
) = a()b
(
i
). (2.6)
The expression for the variance of Y
i
explains the rationale behind the phrase dispersion
parameters for . Finally, given the relationship between
i
and through (2.5) and
(2.6) above, it is clear that f(Y ; , ) can also be expressed as f(Y ; , ).
2.2. GENERALIZED LINEAR MODELS 25
2.2.2 Score Function for
Assuming that the Y
i
s are independent and represent a random sample from the targeted
population, the likelihood function for and is simply proportional to
L(, )
m
i=1
f(y
i
; , )
= exp
_
m
i=1
i
()y
i
b(
i
())
a()
+ c(y
i
; )
_
.
Consequently, the score function for can be derived as
S
(, ) = log L(, )/ =
m
i=1
_
_
t
Var
1
(Y
i
; , )(y
i
i
()). (2.7)
It is important to note that the score function for depends only on the rst two
moments of the Y
i
s despite the full specication of f through (A) in (2.4). This forms
the primary motivation behind the use of the quasi-likelihood advocated by Wedderburn
(1974) which is the focus of the next section. Furthermore, given that a() appears as a
proportional factor in var(Y
i
), no knowledge on is needed to derive the MLE of ,

, by
solving S
(, ) = 0 even though the asymptotic variances of

,
(, ), does depend
on , where
1
(, ) = lim
m
m
i=1
_
_
t
Var
1
(Y
i
; , )
_
_
/m. (2.8)
2.2.3 MLE Versus Weighted Least Squares
An alternative estimating procedure for is the well-known weighted least squares ap-
proach. It amounts to minimizing the following objective function for , namely,
Q(, ) =
m
i=1
(y
i
i
())
2
Var(Y
i
; , )
, (2.9)
the weighted dierences between the observed (Y
i
s) and the expected (
i
s). Again,
the dispersion parameter plays no role in the estimation of through minimizing Q
in (2.9). While intuitive, this approach, in contrast to the common belief, may lead to
inconsistent estimation of if the variance of Y
i
depends on the mean and hence on .
To see this, note that minimizing Q in (2.9) is equivalent to solving Q/ 0, where
Q(, )/ =
m
i=1
_
2
_
i
()
_
t
Var
1
(Y
i
; , )(y
i
i
()) (2.10)
+
_

Var
1
(Y
i
, , )
_
(y
i
i
())
2
_
.
Note that the rst term in (2.10) is identical to S
(, ), the score function of , whereas

the second term has in general non-zero expectations regardless of the sample size m.
The resulting solution,

say, would be inconsistent as an estimator of unless Var(Y
i
)
is independent of in which case Q(, )/ S
(, ) and hence

.
This undesirable property of the weighted least squares approach can be alleviated
through the following simple modication. Assuming the existence of a consistent esti-
mator,

say, for , one remedy is to minimize instead

Q
(, ) =
m
i=1
(y
i
i
())
2
Var(Y
i
;

, )
. (2.11)
It can be shown that the minimizer of Q
(, ),

say, is now consistent for any consistent

estimation of

of . One such choice is the minimizer of
i
(y
i
i
())
2
. Furthermore,
it can be seen that by iterating the minimization process in (2.11) repeatedly, i.e. mini-
mizing
Q
t+1
(, ) =
m
i=1
(y
i
i
())
2
Var(Y
i
;

t
, )
, t = 0, 1, . . . ,
where

0
=

, it leads to

, the MLE. Indeed, one can show that

and

are asymptot-
ically equivalent to each other in that

=

= o
p
(m).
2.2.4 Deviance
An important aspect of model tting is model checking which examines the pattern of
departures from the tted model. There are at least two dierent types of departure:
the systematic departure which describes the departure of the model from the data
as a whole, and the isolated departure which identies individuals whose data are not
in accordance with the tted model (McCullagh and Nelder, 1989, Chapter 12). This
subsection focuses only on the former one.
For the i
th
subject with x
i
as the covariate value, there are two means to estimate
i
, the expected value of Y
i
. One is to estimate by
i
(
) based on the tted model. The

other is to simply estimate
i
by Y
i
= y
i
, an unbiased estimator. The rst approach,
which uses the whole data, is legitimate only if the tted model adequately describes
the data. On the other hand, the second approach, while always legitimate, does not
fully utilize the sample data except y
i
itself. These contrasts between two approaches
provides a means to answer the question as to how well the model ts the data by
comparing
i
(
) with y
i
through, for example, the likelihood function, L(
i
, ). Indeed
the quantity deviance is proportional to the likelihood ratio test statistic comparing
the null hypothesis that the tted model is adequate versus the saturated alternative,
i.e. the
i
s are dierent and unrelated to each other. More specically, the deviance for
a tted model is formally dened as
D( ; Y ) = 2
m
i=1
{log L(
i
(
); ) log L(y
i
; )}a().
Thus, a small value in D would indicate that the tted model describes the data as
a whole rather well whereas a large value would suggest otherwise. A few cautionary
notes on the use of the deviance, or the Pearson
2
statistic (Q(
, )), as a measure of
2.3. QUASI-LIKELIHOOD METHOD 27
goodness-of-t. First, except for the special case of normal responses, the distribution of
D, even asymptotically, is unknown. Second, as the hypothetical example in McCullagh
and Nelder (1989, p.123) suggested, any global measure of goodness-of-t such as the
deviance may have a limited usage for choosing competitive models for the purpose of
predictions when the x considered falls outside the range formed by the observed x
i
s,
known as extrapolation.
2.3 Quasi-Likelihood Method
To adopt the generalized linear model approach one needs to specify the random compo-
nent for the response variables. Very often one faces the problem where the investigators
are uncertain about the complete random mechanism by which the data are generated.
This could be due to the fact that the underlying biologic theory is not fully understood
yet and/or no substantial (empirical) experience of similar data from previous studies
is available. On the other hand, the scientic objective can often be adequately charac-
terized through regression, i.e. the systematic component may be well described. With
this in mind, Wedderburn (1974) proposed as an alternative the quasi-likelihood method
which replaces the random component (A) in GLMs by
(A
) (Variance specication). The variance of Y

i
has the form
Var(Y
i
) = a()V (
i
),
where V is the known function relating the variance to the mean,
i
.
In the absence of the likelihood function that is available for statistical inference,
Wedderburn (1974) proposed the use of the quasi-score function to estimate , i.e. by
solving
S
(, ) =
m
i=1
_
_
t
Var
1
(Y
i
; , )(y
i
i
()) = 0. (2.12)
This quasi-score function has the same expression as (2.7) and would be the true score
function for if the Y
i
s were indeed generated from an exponential dispersion fam-
ily. McCullagh (1983) studied the asymptotic behavior of

, the solution of S
0, and
found that under some regularity conditions,

is consistent and asymptotically normally
distributed with the covariance matrix equal to in (2.8). Firth (1987), on the other
hand, examined the eciency of

relative to the MLE of when the true probability
mechanism for the Y
i
s is not from an exponential dispersion family. Several dierent
types of departure from the exponential dispersion family were established and, as ex-
pected, high eciency of

is maintained when the departure of f from the exponential
dispersion family is only modest (Firth, 1987). A remaining question to be addressed
is whether

or the quasi-score function, S
, possesses any optimal property? This issue

will be the main focus of the next section. Indeed, we will consider the more general
situations than (A
), namely
(A

i
has the form
Var(Y
i
) = V
i
(
i
; ), (2.13)
where V
i
is a known function of the mean,
i
, and , the dispersion parameter.
This specication, of course, includes (A
) as a special case with V

i
(
i
, ) = a()V (
i
).
2.4 Estimating Function Approach
Returning to the general setup in 2.1 where we assume the data, Y = y, is generated
from a probability (density) function f(y; , ) indexed by parameters of interest and
nuisance parameters . In the regression setting, would be the regression coecient
whereas would be dispersion parameters. An estimating function for is simply a
function of the data y and parameters of interest , i.e. g(y; ). An estimating function
is called unbiased if
E(g(Y ; ); , ) = 0 (2.14)
for all and values. An important example of unbiased estimating functions from
the perspective of health research is the quasi-score function, S
(, ) in (2.12) where
a() factors out in S
(, ). While the formal theoretical development of estimating

functions was not advanced until the 1960s, scientists grappled with how to design
estimating functions which combined observations and unknown interest as early as the
mid 18th century; see the interesting book on the history of statistics before 1900 by
Stigler (1986). Meanwhile, the method of moments, advocated by Karl Pearson, may
be viewed as a precursor of the modern estimating function approach. Here, selected
empirical moments such as the sample mean are equated to their expectations which are
presumably dependent on parameters of scientic interest.
The concept of unbiased estimating functions and the corresponding optimality theory
was formally developed in the year of 1960 through two independent work by Durbin
(1960) and Godambe (1960). We now briey recast the essence of a series of more
general work by Godambe, before addressing the question concerning the optimality
of the quasi-score function. Consider the class of unbiased estimating functions G =
{g(y; ) : E(g(Y ; ); , ) = 0, , }. A member of G, g
say, is optimal within this class

(Godambe, 1960) if it minimizes
W(, ) = E
1
(g(Y ; )/)Cov(g(Y ; ))E
1
(g(Y ; )/). (2.15)
When is known, i.e. there is no nuisance parameters, Godambe (1960) showed that the
score function for is the optimal one among G. In the more practical situation where
is present and unknown, Godambe (1976) showed that
g
(y; ) = log f(y|t; )/ (2.16)

is optical among G where t is a complete, sucient statistic for for xed .
2.4.1 Limitations: Impact of Nuisance Parameters
This modern line of optimal estimating function discussed thus far is subject to the fol-
lowing two major limitations. First, optimality considered is ascribed to the estimating
function, yet scientists and other practitioners are concerned more about estimators. As
Crowder (1989) put it: This is like admiring the pram rather than the baby. This
concern may be reconciled as follows. It can be shown that under some regularity
conditions,

, the solution of g
(y; ) = 0, has minimum variance among solutions of

{g(y; ) = 0 : g G}. Thus the limitation of estimating functions noted above may not
be too serious an issue as such if one is willing to appeal to a large sample argument.
2.4. ESTIMATING FUNCTION APPROACH 29
Second, nuisance parameters can compromise the optimality theory. Specically, the
optimal estimating function given in (2.16) relies upon the existence of a complete suf-
cient statistic for that does not depend on t. This is the case when Y is generated
from an exponential family with and being the natural (or canonical) parameters,
but more generally t = t(). In this more general situation, log f(y|t(); , )/ de-
pends on as well and hence is only locally optimal (Lindsay, 1982). More seriously,
the sample space of Y given t() may depend on and consequently, f(y|t(); , ) many
not be dierentiable with respect to (Kalbeisch and Sprott, 1970). It is worth noting
that the quasi-score function in (2.12) also suers from this limitation if the variance of
Y
i
has the more general form in (2.13), in which case the nuisance parameter may not
factor out in (2.13).
2.4.2 Orthogonality Properties of Conditional Score Functions
The previous subsection made it clear that with the exception of exponential families, nui-
sance parameters could compromise the optimal theory. While many likelihood methods
are potentially available leading to unbiased estimating functions that are independent
of nuisance parameters, there is no general guidance as to how one should proceed. In
addition, unlike the conditional score function in (2.16), there is no known theory to sup-
port the optimality of derived estimating functions. For this reason, we redirect our focus
on the nding of locally optimal estimating functions in which the impact of nuisance
parameters may be minimized, if not eliminated. Avoiding the complication associated
with the computation of f(y|t
; , ), where t
is complete and sucient for for xed

, Lindsay (1982) extends the concept of conditional score functions in this more general
situation, i.e. t t
by dening
g
C
(, ) = log L(, )/ E( log L(, )/|t
), (2.17)
which reduces to log f(y|t; )/ when t
= t. Due to the dependence of g

C
on , the
conditional score function dened above is only locally optimal among G at the true
value (Lindsay, 1982). However, g
C
(, ), by denition, is orthogonal to the space
spanned by the sucient statistic t
for . Thus the impact of nuisance parameters

on g
C
(, ) is expected, at least intuitively, to be small. The following orthogonality
properties enjoyed by g
C
(, ) represent attempts to quantify the above speculation.
These properties ((a) to (d)) are listed so that g
C
(, ) implies (a), which implies (b),
which implies (c), which implies (d):
(a) E(g
C
(,

); , ) = 0 for all , and any

which is a function of t
;
(b) E(g
C
(,
); , ) = 0 for all , and
;
(c) E(g
C
(,
)/
; , ) = 0 for all , and
;
(d) Cov(g
C
(, ), log L(, )/; , ) = 0 for all and .
Interpretation of these four orthogonality properties can be seen in Liang and Zeger
(1995). Furthermore, a rich class of probability (density) functions in which t
is in-
deed complete and sucient for is given by Liang and Tsou (1992). In summary,
we have devoted this subsection to the discussion of desirable properties possessed by
the conditional score function proposed by Lindsay (1982). These properties, namely,
local optimality and orthogonality to nuisance parameters, form the criteria for the next
subsection.
Finally, several researchers, including Cox and Reid (1987), Liang (1987), Ferguson,
Reid and Cox (1991), Waterman and Lindsay (1996) have investigated strategies for
approximating the conditional score function in the more general situation where the
complete sucient statistic for for xed may not exist or are dicult to nd.
2.4.3 (Local) Optimality of Quasi-Score Functions
We now return to the question posed in 2.3, namely the potential optimality of the quasi-
likelihood method, which will be addressed in more general settings. In the previous two
subsections, we have assumed implicitly that the likelihood function, indexed by and
,is readily available for inference. As mentioned in 2.3, there are situations where
either the investigators are uncertain about the random mechanism or the probability
distribution is too complicated to write down explicitly. Examples include data from
spatial processes encountered in plant ecology (Besag, 1974) and from DNA sequencing.
Nevertheless, it is common that the parameters of interest are well dened reecting
the scientic interest, even though they may not completely specify the probability dis-
tribution for the data y. We assume that the data y = (y
1
, . . . , y
m
) is decomposed into
m strata and the y
i
s, i = 1, . . . , m, are uncorrelated with each other. The sizes of
the y
i
s are not required to be the same. Furthermore, we assume that is common in
meaning to all m strata and that there exists an unbiased estimating function, g
i
(y
i
; ),
for the i
th
stratum, i = 1, . . . , m. A simple and trivial example of this setup is that the
y
i
s are independent and identically distributed with mean and variance and a simple
choice of g
i
would be y
i
. A natural question is how to combine {g
i
(y
i
; ), i = 1, . . . , m}
together into a single unbiased estimating function for inference that is optimal in some
sense?
To address this question we consider a more restricted class (than G) of unbiased
estimating functions, namely,
G
= {g(y; ) =
m
i=1
a
i
()g
i
(y
i
; )},
which is a linear combination of the g
i
s. It can be shown (e.g. Liang and Zeger, 1995)
that the optimal unbiased estimating function within G
, g
say, is
g
=
m
i=1
E
t
(g
i
/)Cov
1
(g
i
) g
i
, (2.18)
which minimizes W(, ) dened in (2.15). Furthermore, the solution of g
= 0,

say,
has the smallest asymptotic variance among {
: g(y;

) = 0, for all g G
}. This setup
includes several important examples:
Example 2.5 (Quasi-score function). Under (A
) and (B) specied in 2.3, one has

h(
i
) = x
i
, Var(Y
i
) = a()V (
i
).
By choosing g
i
(y
i
; ) = y
i

i
(), one has as the optimal unbiased estimating function
2.4. ESTIMATING FUNCTION APPROACH 31
among G
= {
m
i
a
i
()(y
i
i
())}
g
=
m
i=1
E
t
(g
i
/) Var
1
(g
i
)g
i
=
m
i=1
_
_
t
Var
1
(Y
i
)(y
i
i
()),
the quasi-score function for .
Example 2.6 (Correlated data). For y
i
= (y
i
, . . . , y
ini
), the observations from the
i
th
cluster, i = 1, . . . , m, where n
i
is the cluster size, the objective of interest is fre-
quently captured through the modeling of
ij
= E(Y
ij
), i.e. h(
ij
) = x
t
ij
(Liang and
Zeger, 1986). In this case, one such choice of g
i
would be y
i

i
() where
i
=
(
i
(), . . . ,
ini
())
t
and the optimal g
among G
= {g =
m
i=1
a
i
()g
i
(y
i
; )} is
g
=
m
i=1
_
_
t
Cov
1
(Y
i
)(y
i
i
()),
known as the generalized estimating equation (GEE) (Liang and Zeger, 1986).
Example 2.7 (2 2 tables). In case-control studies, subjects are stratied into m
strata by confounding variables. The primary interest is on the degree of association
between the targeted disease and a dichotomous risk factor of interest as measured by
the odds ratio, . In this case, y
i
= (y
i1
, n
i1
y
i1
, y
i2
, n
i2
y
i2
) where y
i1
(y
i2
) is the
number of exposed among n
i1
cases (n
i2
controls) in the i
th
stratum, i = 1, . . . , m. It
can be veried easily that
g
i
(y
i
; ) = y
i1
(n
i2
y
i2
) y
i2
(n
i
y
i
)
is an unbiased estimating function for from the i
th
stratum. While
g
=
m
i=1
E
t
(g
i
/)Var
1
(g
i
)g
i
=
m
i=1
a
i
()g
i
is complicated computationally, a
i
() reduces to 1/(n
i1
+ n
i2
) when is evaluated at
one (McCullagh and Nelder, 1989) and this leads to the well known Mantel-Haenszel
estimator of (Mantel-Haenszel, 1959).
Implicitly, the optimality theory developed thus far assumes g
is functionally inde-
pendent of the nuisance parameters . This is indeed the case for Example 2.7 if the
expectations derived in computing g
are conditional on t
i
= y
i1
+y
i2
, the total number
of exposed in the i
th
stratum, i = 1, . . . , m. In general, however, the distribution g
i
,
which is functionally independent of , does depend on . Thus, g
= g
(y; , ) and
is locally optimal among G
at the true value as is the conditional score function

(2.17) when the probability mechanism is available for inference. One important excep-
tion is the quasi-score function under (A
), i.e. a() appears as a proportional factor in

Var(Y ). In this case, a() factored out and g
= g
(y; ) is globally optimal among G
.
Even though g
may depend on , we now argue that the impact of on g
and on the
corresponding solution of g
= 0 is small. This is because g
shared the orthogonality

properties (b) - (d) enjoyed by the conditional score function, i.e.
(b) E(g
(y; ,
); , ) = 0 , and
,
(c) E(g
(y; ,
)/
; , ) = 0 , and
,
(d) Cov(g
(y; , ), log L(, )/; , ) = 0 and .

One important implication of these properties, at least when the sample size is large, is
that

, the solution of g
(y; ,

) = 0, where

is an arbitrary consistent estimator of ,

has the same variance as that when is known. That is, the uncertainty in estimating ,
as measured by the variance, is unaltered whether is known or not, another indication
that
Chapter 3
Analysis of Binary Data
Binary (dichotomous) response is commonly observed in biomedical research. Examples
include death versus alive, ill versus healthy, injured or not in a car accident, depressed or
not in psychiatric research, etc. Sometimes there is a time element involved in dening the
response. For example, one may be interested in, as shown below in Table 3.1 whether
a child diagnosed with neuroblastoma survives through a two year period. Another
example, shown in Table 3.2, has the response for each fetus as to whether it survives
21 day lactation period or not. These two examples share similar data structure which
is the focus of this chapter, namely,
(y
i
/n
i
, x
i
), i = 1, . . . , m,
where y
i
is the number of aected out of the n
i
subjects in the i
th
stratum and x
i
,
Table 3.1: Proportions of children surviving 2 years free of disease following
diagnosis and treatment for neuroblastoma by age and stage of disease at
diagnosis (Breslow & McCann, 1971).
Age at Diagnosis Stage at Diagnosis
(in months) I II III IV IVS
0-11 11/12 15/16 2/4 5/18 18/19
12-23 3/4 3/7 5/8 0/25 1/3
24+ 4/5 4/12 3/15 3/93 2/5
a p 1 vector, covariates that are common to all n
i
subjects. For the neuroblastoma
example, m is equal to 15 with p = 2 covariates, namely, age and stage at diagnosis.
For the teratological experiment example, m would be 32 with one primary covariate,
i.e. treatment status: x = 1 if treated and 0 if control. Questions of interest for data
that can be summarized in this kind of structure include (i) can variation in y
i
/n
i
be
explained by x
i
? and (ii) can one predict the aected status using x
i
? The rst question
is important in that it provides a means to identify risk factors that may be useful for
prevention purposes. The second question is useful for identifying prognostic factors
which may have clinical and policy implementation implications.
33
34 CHAPTER 3. ANALYSIS OF BINARY DATA
Table 3.2: Proportions of fetuses surviving the 21 day lactation period
among those who were alive at 4 days by treatment group (Weil, 1970).
Group Responses
Treated 12/12, 11/11, 10/10, 9/9, 10/11, 9/10, 9/10, 8/9,
8/9, 4/5, 7/9, 4/7, 5/10, 3/6, 3/10, 0/7
Control 13/13, 12/12, 9/9, 9/9, 8/8, 8/8, 12/13, 11/12,
9/10, 9/10, 8/9, 11/13, 4/5, 5/7, 7/10, 7/10
In this chapter, we discuss some analytical issues that are relevant to this type of
data structure with special attention paid to the issue of statistical dependence within a
stratum which is common in data from teratological experiments.
3.1 Historical Developments
Given the much longer history in multiple linear regression for continuous responses,
it was natural that some earlier developments for analyzing binary data appealed to
the conventional linear regression method. This forms the basis behind the rst two
approaches presented below.
Approach 1 (Arcsine analysis). This approach amounts to creating a new response
variable through a arcsine transform on the original response variable, that is, y
i
=
sin
1
(
_
y
i
/n
i
) and imposing the following model
y
i
= x
t
i
+
i
,
i
i.i.d.
N(0, 1/(4n
i
)).
This transformation was motivated by the fact that the variance of y
i
, at least asymptot-
ically, is independent of the means, x
t
i
, a desirable property to have in the conventional
linear regression approach; see 2.2.3.
Approach 2 (Empirical logit analysis). This approach is similar to Approach 1 except
that a dierent transformation is applied to y
i
/n
i
, namely, y
i
= log(y
i
/(n
i
y
i
)). This
is a special case of a comprehensive treatment for analyzing categorical data developed
in the 1960s by Grizzle, Starmer and Koch (1969).
Approach 3 (Modern analysis). This approach, which is a special case of GLMs, diers
from the rst two approaches in one important aspect. In this modern (for the lack of
a better term) approach, the notion of transformation is applied to the expectation of
y
i
/n
i
,
i
, rather than the data itself. In other words, for a given link function, h say,
this approach species
h(
i
) = x
t
i
and furthermore, y
i
follows a binomial distribution. Some common choices of link func-
tions include the probit link for bioassay analysis (e.g., Finney, 1964) and the logit link
for case-control studies (Breslow and Day, 1980).
It is interesting to contrast the modern approach versus the rst two approaches. For
simplicity, we restrict our attention to the log odds transformation (Approach 2) so that
the comparison is made between the empirical logit approach and the logistic regression
approach.
3.1. HISTORICAL DEVELOPMENTS 35
3.1.1 Empirical Logit versus Logistic Regression
We now argue that the empirical logit approach inherits the following three drawbacks
that are not shared by the logistic regression method. First, the regression coecients
of this approach are dicult to interpret: change in averaged Y
= log{Y/(n Y )} per
unit change in x. The transformation Y
is made for convenience whose average does

not necessarily possess physical meaning of interest. This is to be contrasted with the
interpretations of the logistic regression coecients: change in log odds in risk per unit
change in x. These quantities are easy to understand with sound physical meaning for
biomedical researchers. Second, the normal distribution assumption for the Y
i
s relies
heavily on the n
i
s being large. As can be seen in either example, the n
i
s vary from
strata to strata and are not (cant be in teratological experiments) xed or controlled
by designs. This concern, on the other hand, is not shared by the logistic regression
method. Indeed, the latter approach can be applied even if n
i
1 for all i so long
as m is suciently large. Third, with the exception of the intercept, all the regression
coecients in logistic regression models are estimable should one adopt the retrospective
sampling, a commonly adopted design in epidemiologic studies. The same thing cannot
be said for the empirical logit approach. Finally, parameters in logistic regressin models
are the canonical parameters in an exponential family which means the intercepts can
be eliminated through conditioning which is critical for the matched case-control designs
to be implemented (Breslow and Day, 1980).
3.1.2 Is the Binomial Assumption Reasonable?
Either one of three approaches presented earlier assumes that Y
i
follows a binomial
distribution with size n
i
and probability
i
, i = 1, . . . , m. A natural question to ask is
whether the binomial assumption is adequate? To answer this question, we note that one
natural way to characterize a binomial variate, Y say, is that Y be expressed as a sum of
n binary observations, i.e., Y = Y
1
+ . . . + Y
n
with Y
j
= 1 or 0 for all j = 1, . . . , n such
that (1) all the Y
j
s are statistically independent of each other and (2) Pr(Y
j
= 1) = ,
the common probability for all j = 1, . . . , n.
For the neuroblastoma example, it seems sensible to assume that the number of chil-
dren with the same age and stage of diagnosis who survived for a two-year period is well
approximated by a binomial distribution. After all, these children are unrelated to each
other and meanwhile, comparable to each other with regard to major prognostic factors
for neuroblastoma. On the other hand, it is well known in the teratological experiment
literature (e.g. Haseman and Kupper, 1979) of the tendency for fetuses from the same
litter (stratum in our notation) to respond more alike than fetuses from dierent litters
even if they all received the same treatment. This is known as the litter eect which
may be explained by the fact that fetuses from the same litter share the common genes
and environment which in turn leads to the likeness of responses from fetuses of the
same litter. This argument strongly suggest that the statistical independence assumption
of the binomial assumption may be invalid. Indeed, for the sixteen proportions from the
treated group in Table 3.2, only 25% (= 0.328/1.295) of the total variation in the y/ns
is explained by the variation induced by the binomial assumption (Liang and Hanfelt,
1994).
In the next, we discuss pros and cons of approaches that have been developed to aid
investigators analyze data derived from teratological experiments.
3.2 Teratological Experiments
Teratological (toxicological) experiments are designed to test whether a substance is
indeed a carcinogen or toxin; and if so, it would be of interest to establish the dose-
response relationship in order to provide guidances for the tolerance level regarding safety
of human populations. Typically, pregnant animals (e.g. rats) are randomized to receive
the substance of dierent dose levels. The adverse eect such as structural developmental
eect of the substance in the fetuses is then assessed to address the question stated above.
As mentioned in 3.1.2, the so-called litter eect is commonly observed and con-
sequently, the binomial assumption on Y , the number of aected fetuses from a litter, is
invalid. One of the earliest approaches to address this issue of litter eect, also known
as extra binomial variation or more generally over-dispersion, is to assume that Y
follows a beta binomial distribution (Skellam, 1948) instead. This distribution may be
motivated and derived as follows. Consider
y
i
/n
i
, i = 1, . . . , m
,
the responses (proportions of fetuses that are aected) from litters receiving the same
dosage. For the example presented in Table 3.2, m
is equal to 16 for both treated and

control groups. To account for variation in responses among litters that are comparable
(regarding treatment levels) to each other, this approach assumes
(1) given
i
, Y
i
|
i
Binomial (n
i
,
i
), i = 1, . . . , m
,
(2)
i
Beta (, ), , > 0.
Consequently, the beta-binomial distribution for each Y
i
arrived as
Pr(Y
i
= y
i
; , ) =
_
Pr(Y
i
= y
i
|
i
)f(
i
; , )d
i
=
_
1
0
_
n
i
y
i
_
yi
i
(1
i
)
niyi
( + )
()()
1
i
(1
i
)
1
d
i
=
_
n
i
y
i
_
( + )(y
i
)(n
i
y
i
+ )
()()( + + n
i
)
. (3.1)
The key step of this derivation is assumption (2) which acknowledges that the underlying
risk,
i
, varies from litter to litter by imposing a distribution, beta in this case, on the
i
s. Intuitively, the litter eect is pronounced if the variance of
i
is large. One diculty
associated with the representation in (3.1) is lack of clear interpretability for both and
. However, noting that the rst two moments of a beta distribution have the form
E() = /( + ) ,
Var() =

+ + 1
(1 ) = (1 ),
we now argue that the new parameters, and are more easily interpreted. This is
because the rst two moments of the Y
i
s can now be expressed as
E(Y
i
/n
i
) = E(E(Y
i
/n
i
|
i
)) = E(
i
) = ,
Var(Y
i
/n
i
) =
(1 )
n
i
(1 + (n
i
1)).
3.2. TERATOLOGICAL EXPERIMENTS 37
Thus is simply the overall, or averaged, risk for those randomized to a particular dose
group. On the other hand, , the proportional factor in Var(), is introduced to account
for variation that is not solely explained by the binomial variation. Furthermore, note in
3.1.2 that Y
i
can be expressed as Y
i
= Y
i1
+. . . +Y
ini
and that these n
i
binary responses
are not statistically independent of each. If one assumes the correlations for each pair
of these Y
ij
s are the same, then it can be shown easily that this common correlation
coecient is exactly equal to . Thus, the notion of variation between litters is equivalent
to within litter likeness as captured by the same quantity .
Finally, to test or to establish dose-response relationship between and x, the dosage
level, one may consider
h() =
0
+ x
t
1
,
in line with the systematic component specied under GLMs.
3.2.1 Some Cautionary Notes on Beta-Binomial Distributions
While intuitive for its derivation, one may use this approach with caution from the
frequentist viewpoint. First, as shown below, the parameter of interest and the nuisance
parameter are highly intertwined with each other. Consequently, the presence of
is likely to have non-trivial impacts on inference; see 2.1 for general discussion on
impacts of nuisance parameters. To see this, Table 3.3 below, which is reported in
Williams (1988), shows through simulations that misspecifying value could lead to
severe bias in estimation. For example, if the true and values are 0.125 and
0.2, respectively and a sample of 20 beta-binomial observations were generated, the
averaged MLE of over 1,000 replications is equal to 0.167 if a value of 0.4 was
assumed when deriving . This phenomenon is also evident when the beta-binomial
distributions are tted to the Weil data in Table 3.2. Table 3.4 shows that the MLE
of
1
= log(
T
/(1
T
)) log(
C
/(1
C
)) dropped to 0.665 when
T
and
C
were
forced to be the same. This is to be contrasted with the MLE of
1
when
T
and
C
are
allowed to be dierent: 1.129. It is interesting to note that the standard error estimate
of

1
when assuming binomial distribution in the Y s, i.e. ignoring the litter eect is too
small: 0.330 compared to 0.464 when a beta-binomial distribution is assumed instead.
Table 3.3: The averaged MLE ( s.e.) for for dierent assumed values of
from 1,000 simulations (Williams, 1988).
True parameter values
Assumed value = 0.125, = 0.2 = 0.25, = 0.333
0.00 0.1263 0.0012 0.2530 0.0020
0.05 0.1118 0.0011 0.2310 0.0020
0.10 0.1134 0.0011 0.2266 0.0020
0.20 0.1271 0.0011 0.2336 0.0019
0.30 0.1457 0.0012 0.2491 0.0018
0.40 0.1670 0.0013 0.2685 0.0018
0.50 0.1906 0.0014 0.2903 0.0017
Second, the asymptotic normal approximation for the MLE of the s may not be
ready to take place for practical sizes when tting the beta-binomial distribution to the
Table 3.4: Estimates and standard errors of
1
= log{
T
/(1
T
)}
log{
C
/(1
C
)} of Weils data presented in Table 3.2. Upper entry: com-
mon ; lower entry: heterogeneous (Liang and Hanfelt, 1994).
Method

1
s.e.

C
Binomial 0.961 0.330 0.0 0.0
Beta-binomial 0.665 0.460 0.19 0.19
1.129 0.464 0.32 0.02
Quasi-likelihood
Var(Y ) = n(1 ) 1.022 0.471 0.21 0.21
(1 + (n 1)) 1.070 0.470 0.46 0.05
Var(Y ) = 0.961 0.475 2.78 2.78
n(1 ) 0.961 0.475 4.74 1.46
data at hand. Table 3.5 shows results from a simulation in which data mimicking the
structure from Table 3.2 and following beta-binomial distributions with the true s and
s being the MLEs from the last row of Table 3.4 were generated 1,000 times. Either
allowing
T
and
C
to be dierent in the analysis or not, the MLE approach leads to
undesirable 95% coverage probability that is far from being symmetric as would have
been the case if the normal approximation to the distribution of MLEs were ready to
take place.
Finally, for a litter of size n in which Y = Y
1
+ . . . + Y
n
and Y
j
dened as
(Y
1
, . . . , Y
j1
, Y
j+1
, . . . , Y
n
), one has
Pr(Y
j
= 1|Y
j
)
Pr(Y
j
= 0|Y
j
)
=
+
=j
y
+ n
=j
y
1
(3.2)
if Y is assumed to follow a beta-binomial distribution indexed by and . In other
words, the odds for Y
j
to be aected, i.e. Y
j
= 1, depends on Y
j
, the responses from the
rest of the litter, only through

=j
y
, the total number of aected. While appealing

mathematically, this expression in (3.2) raises a concern regarding the biologic plausibility
of this distribution as the litter size varies. It seems more intuitive that the eect of Y
j
,
if any, on Y
j
would be through its proportion, i.e.

=j
y
/(n 1), rather than through
=j
y
.
3.2.2 An Alternative Inferential Procedure
The beta-binomial distribution is an excellent example in which (1) the probability dis-
tribution is indexed by parameters of interest and nuisance parameters, (2) the nuisance
parameters have performed impacts on inference for parameter of interest and (3) yet
the conditional score function approach in 2.4 is not applicable (t
= y in this case).
Since the main interest of teratological experiments can be cast through regression mod-
elling on the rst moment of the Y s, one alternative is to consider the quasi-likelihood
3.2. TERATOLOGICAL EXPERIMENTS 39
Table 3.5: Simulation results for
1
= log{
T
/(1
T
)} from 1,000 repli-
cations mimicking the data structure of Weils. The true
1
is 1.129 and
T
= 0.317 and
C
= 0.021 (Liang and Hanfelt, 1994).
Sample Asymptotic 95% converge prob.
Method Bias s.d. s.e. lower upper
Beta-binomial
T
=
C
0.362 0.448 0.473 10.8 0.1
T
=
C
0.091 0.517 0.414 14.2 3.2
Quasi-likelihood
1. Var(Y ) = n(1 )
(1 + (n 1))
T
=
C
-0.013 0.473 0.456 2.8 2.9
T
=
C
-0.013 0.473 0.474 2.8 2.7
2. Var(Y ) =
n(1 )
T
=
C
-0.014 0.475 0.457 3.1 2.5
T
=
C
-0.014 0.475 0.547 3.1 2.5
appproach detailed in 2.3. Specically, one may consider
(A

i
has the form of
Var(Y
i
) = n
i
(1 )(1 + (n
i
1)), or
n
i
(1 ).
(B) (Systematic component). The mean of Y
i
/n
i
is modeled as
logit
i
=
0
+
1
x
i
, i = 1, . . . , m.
Note that the rst variance expression corresponds to that induced by the beta-binomial
distribution whereas the second one is induced from the exponential-dispersion family. As
discussed in 2.4.3, while the quasi-score function does depend on , yet the impact of
are minimal at least from the theoretical viewpoint. Results below provide some empirical
evidence of the above assertion. Table 3.4 shows for the Weil data that the estimators and
the corresponding s.e.s from the quasi-likelihood approach are very similar to each other
regardless of the choice of the variance functions and whether
T
and
C
are assumed to
be the same or not. This observation is conrmed by the simulated study presented in
Table 3.5, which shows that this approach, unlike the beta-binomial likelihood approach,
leads to excellent coverage probabilities, small bias and excellent agreement between the
sample variance and the averaged estimated variance of

1
. More extensive simulations
which mimic the data structures that are common in practice can be found in Liang and
Hanfelt (1994).
In summary, we review in this section, two dierent approaches for analyzing data
from teratological experiments which confront the non-trivial litter eects. Both theo-
retical arguments and empirical results support the use of the quasi-likelihood method
as a sensible alternative to the conventional beta-binomial approach.
3.3 Further Relaxation of Binomial Assumptions
Recall in 3.1.2 we discussed two components that constitute the binomial variate Y =
Y
1
+. . . +Y
n
, namely, the statistical independence and identical distribution in the Y
j
s,
j = 1, . . . , n. In 3.2 we addressed ways to relax the rst assumption of statistical inde-
pendence. In this section, we review some recent developments for attempts to further
relax the second assumption of identical distributions. This work may be motivated in
family studies where responses from n related individuals are not only correlated with
each other due to shared genes and environment, but have dierent risks of disease due
to dierent make up in risk factor proles consisting of, for example, age, sex, etc. Draw-
backs of these approaches are highlighted so as to provide further motivation for more
modern developments for modelling correlated data to be presented in Chapter 5.
3.3.1 Rosners Approach
Let x
j
be the p 1 vector of covariates for the j
th
component of a cluster of size n.
Building on the expression in (3.2), Rosner (1984) proposes to consider for j = 1, . . . , n
log
Pr(Y
j
= 1|Y
j
, x
1
, . . . , x
n
)
Pr(Y
j
= 0|Y
j
, x
1
, . . . , x
n
)
= log( +
=j
y
) log( + n
=j
y
1) +
t
x
j
. (3.3)
It models explicitly the dependence of the log odds for Y
j
to be one on Y
j
and x
1
, . . . , x
n
through

=j
y
and x
j
for each j = 1, . . . , n. Using the Hammersley-Cliord Theorem
(Besag, 1974), this specication in (3.3) does lead to a legitimate joint distribution for
Y = (Y
1
, . . . , Y
n
) given (x
1
, . . . , x
n
), namely,
Pr(Y
1
= y
1
, . . . , Y
n
= y
n
|x
1
, . . . , x
n
)
Pr(Y
1
= . . . = Y
n
= 0|x
1
, . . . , x
n
)
=
a
s
b
ns
b
s
exp
_
_
n
j=1
y
j
t
x
j
_
_
,
where
a
s
=
s1
=0
( + ), s 1,
b
s
=
s1
=0
( + ), s 1,
a
0
= b
0
= 1, s =
n
j=1
y
j
.
As mentioned in 3.2.1, the eect of Y
j
on Y
j
, as modeled in(3.3), depends on
=j
Y
,
which is a function of the cluster size n. More seriously, the eect of x
j
on Y
j
, as
3.3. FURTHER RELAXATION OF BINOMIAL ASSUMPTIONS 41
measured by , is likely to be diminished as x
j
is competing directly against Y
j
in
explaining the variation in Y
j
. Furthermore, the random eect motivation conveyed in
the original beta-binomial deviation (i.e. Beta(, )) seems to disappear when one
tries to incorporate covariate eects for each component j.
3.3.2 Additive Model
Approaching the same issue from a totally dierent angle, Bahadur (1961) proposed the
following join distribution for (Y
1
, . . . , Y
n
) given (x
1
, . . . , x
n
)
Pr(Y
1
= y
1
, . . . , Y
n
= y
n
|x
1
, . . . , x
n
) (3.4)
=
n
j=1
Pr(Y
j
= y
j
|x
j
)
_
_
1 +
j<k
jk
z
j
z
k
+
j<k<
jk
z
j
z
k
z
+. . . +
1...n
n
j=1
z
_
_
,
where
z
j
=
y
j

j
_
j
(1
j
)
,
j
= Pr(Y
j
= 1|x
j
), j = 1, . . . , n.
This is termed additive model in that the ratio of
Pr(Y
j
= y
j
, j = 1, . . . , n|x
1
, . . . , x
n
)
n
j=1
Pr(Y
j
= y
j
|x
j
)
is additive in the products of the y
j
s. As a special case, Kupper and Haseman (1978)
consider
Pr(Y
j
= y
j
, j = 1, . . . , n|x
1
, . . . , x
n
) (3.5)
=
n
j=1
Pr(Y
j
= y
j
|x
j
)(1 +
j<k
z
j
z
k
),
where is the common correlation coecient for each pair of the y
j
s and
j
for
all j = 1, . . . , n. A major drawback of this model, as acknowledged by Kupper and
Haseman (1978), is that there is no guarantee that the quantities on the right hand side
of (3.5) are non-negative, a necessary condition for probability functions. Furthermore,
as pointed out in 2.4, the range of is constrained by the
j
s and as in the beta-
binomial distribution, the
j
s and are likely to be highly intertwined with each other.
These concerns have limited the usage of this approach.
3.3.3 Multiplicative Model
Similar in spirit to the Rosner model, this third approach considers for j = 1, . . . , n
log
Pr(Y
j
= 1|Y
j
, x
1
, . . . , x
n
)
Pr(Y
j
= 0|Y
j
, x
1
, . . . , x
n
)
= +
=j
y
j
+
t
x
j
. (3.6)
Again, utilizing the Hammersley-Cliord theorem, this leads to (Altham, 1978; Connolly
and Liang, 1988)
Pr(Y
j
= y
j
, j = 1, . . . , n|x
1
, . . . , x
n
)
Pr(Y
j
= 0, j = 1, . . . , n|x
1
, . . . , x
n
)
= exp
_
_
j=1
y
j
+
n
j=1
y
j
(
n
j=1
y
j
1) +
t
n
j=1
y
j
x
j
_
_
.
Due to the similar representations in (3.6), this model is subject to the same drawbacks
inherited by the Rosner model discussed earlier.
In summary, we have reviewed in this section three dierent approaches to extend the
binomial distribution allowing statistical dependence and dierent risks among the Y
j
s
from the same cluster. Drawbacks of these approaches are highlighted. In this regard,
the treatment of this chapter may be viewed as a preview of more modern developments
for modelling correlated data to be presented in Chapter 5.
Chapter 4
Analysis of Polytomous and
Count Data
In biomedical research, categorical response with three or more possible categories, known
as polytomous response, is also common in practice. Examples of this kind include
degree of injury (mild, modest and severe), dierent cell types of lung cancer (e.g.,
small, squamous, large, etc.) and income level (low, medium and high). Just as in
binary response, researchers are interested in identifying factors that are related to the
polytomous response for prevention, intervention and prediction purposes. The rst
part of this chapter reviews a variety of statistical models that have been proposed in
the literature to address the above issue. We discuss through several real examples the
pros and cons of each approach with the emphasis on the interpretations of regression
coecients induced by the models.
In social science research, data are frequently expressed in contingency tables since
variables observed are mostly categorical. The scientic interest is often on how these
variables are related to each other. For that purpose, log-linear models (e.g. Bishop,
Fienberg and Holland, 1974) have been developed since the 1950s to address the issue
of patterns of association among variables collected from the same subjects. Attempts
have also been made in the past decade or two to utilize this approach to help analyze
the types of correlated data introduced in Chapter one. The second part of this chapter
is devoted to critically reviewing the pros and especially the cons of log-linear models
with emphasis on the interpretations of coecients implied by the models. One useful
reference for this topic is Liang, Zeger and Qaqish (1992) and the discussion therein.
4.1 Types of Polytomous Data
Examining examples of polytomous responses listed earlier, one notes that they can be
classied into at least three dierent types in terms of scale which require dierent at-
tention for modelling. The rst type is in a nominal scale. Here categories are regarded
as exchangeable such as cell types of lung cancer, blood type or hair color (red, black,
brown, etc.). The second type is in a ordinal scale where categories are ordered like the
ordinal numbers (rst, second, third, etc.). Examples include degree of injury, perception
43
44 CHAPTER 4. ANALYSIS OF POLYTOMOUS AND COUNT DATA
of food quality and for that matter hair color (light, medium, dark, etc.). It is important
to point out for ordinal responses that while ordered, the notion of distance or spac-
ing between categories is arbitrary and in some instances, irrelevant. Furthermore, in
many instances, choice and denition of categories are either arbitrary or subjective. An
important implication of this observation is that any sensible analytical procedure should
have the desired property that same conclusions be drawn if new categories are formed by
collapsing adjacent categories of the original scales. As an illustrative example, consider
the following data from a 3 3 table relating the height (short, medium or tall) of the
husbands to that of the wives. The simple chi-squared test gives rise to a value of 2.9
which provides little evidence of association between husbands and wives heights. The
same conclusion is drawn if one either collapses the short and medium categories
together (
2
1
= 1.45) or collapses instead the medium and tall categories together
(
2
1
= 0.90).
Wife
T M S
T 18 28 14
Husband M 20 51 28
S 12 25 9
Wife
T M & S
Husband T 18 42
M & S 32 113
Wife
T & M S
T & M 117 42
S 37 9
The third type of polytomous response is the interval scale in which scores or numerical
labels are frequently attached. Examples of this kind include salary, income, age etc. in
which the scores may be the midpoints of the intervals dened according to the original
scales. In this situation, it is common that dierences between scores from dierent
categories be interpreted as a measure of separation of the categories.
In the next section, we discuss the pros and cons of three commonly used regression
models for polytomous responses of either nominal or ordinal scales. Throughout we will
use the following two examples for illustration.
Example 4.1 (Disturbed dream). The data presented below is taken from Maxwell
(1961) on the association between degree of suering from disturbed dreams and age
among 223 boys age 5 to 15. It is clear that the response variable, severity of disturbed
dreams, is ordinal while the explanatory variable aged is an interval scale.
Degree of suering from
disturbed dreams
Not Very
Age (years) severe 2 3 severe
5-7 7 4 3 7
8-9 10 15 11 13
10-11 23 9 11 7
12-13 28 9 12 10
14-15 32 5 4 3
Example 4.2 (Tonsil size). The data below is from Holmes and Williams (1954) who
classify 1398 children aged 1-15 years according to their relative tonsil size and whether
4.2. REGRESSION MODELS FOR POLYTOMOUS RESPONSES 45
or not they were carriers of streptococcus pyogenes (SP). The response variable here,
tonsil size, is also ordinal and the interest is on assessing the nature and direction of the
possible eect of SP on the enlargement of tonsil size.
Tonsil size
Not Slightly Greatly
SP enlarged enlarged enlarged
Carrier 19 29 24
Non-carrier 497 560 269
4.2 Regression Models for Polytomous Responses
Consider a response variable, Y , which has C + 1 possible categories, i.e. Y = 0, 1, 2, . . .
or C with C being a positive integer. The interest is on the relationship between Y and
x, a p 1 vector of covariates.
4.2.1 Polytomous Logistic Regression Models
This model represents a straightforward extension of the logistic regression model for
binary response by contrasting the odds for Y to be j = 1, . . . , C instead of 0, i.e.
log
Pr(Y = j|x)
Pr(Y = 0|x)
=
j
+
t
j
x, j = 1, . . . , C. (4.1)
It can easily be seen that the distribution function for Y given x is fully specied by the
C logistic regression models above, namely,
Pr(Y = j|x) =
e
j+
t
j
x
1 +
C
=1
e
+
t
x
, j = 0, 1, . . . , C.
with
0
= 0 and
0
= 0, a vector of p zeros. Here
j
characterizes the eect of x on
the log odds for being the j
th
category instead of the category zero, j = 1, . . . , C. It is
interesting to point out that for a pair of categories, 0 < j < , the log odds for Y to be
instead of j is fully determined by (4.1) as
log
Pr(Y = |x)
Pr(Y = j|x)
= (
j
) + (
j
)
t
x, (4.2)
which is linear in x as well. The simple equality of
j
=
j
), results of (4.1) and
(4.2), explains why subjects whose Y values are dierent than 0 and j are informative
about
j
. This model is particularly useful for nominal response where categories are
considered exchangeable. Finally, assuming that Y
i
follows a multinomial distribution,
likelihood inference for
j
s and
j
s can be carried out and is available in SAS (PROC
CATMOD).
4.2.2 Proportional Odds Models
For ordinal responses, McCullagh (1980) proposes the use of proportional odds models
which can also be expressed in C logistic regression models:
log

j
(x)
1
j
(x)
=
j
+
t
x, j = 1, . . . , C, (4.3)
where
j
(x) = Pr(Y j|x). This model focuses on the modelling of
j
(x),the cumulative
probability of Y to be in the j
th
category or higher given x, which is meaningful only
if Y is ordinal. It assumes that the eect of x on those C log odds is the same for all
j = 1, . . . , C, an assumption which can be checked empirically. An important feature
of this model is that there is no need to assign scores to the C + 1 categories of the
response variable. Instead, the ordinal feature of Y is incorporated (or acknowledged) by
(1) modelling through
j
and (2) assuming that the eect of x on log{
j
(x)/(1
j
(x)}
is the same, or at least is in the same direction. Furthermore, the interpretation of
is well preserved: the change in log odds for suering more than less (Example 4.1) per
unit change in x. Finally, this model has been implemented in SAS as well (PROD
LOGISTIC).
Returning to the disturbed dream example, Table 4.1 shows the MLEs of the
j
s
and
j
s when both the polytomous logistic regression model and the proportional odds
model are entertained with a single continuous x: age in years (midpoint of each age
interval). Results from the proportional odds model tting suggest a strong negative
relationship between suering from disturbed dreams and age among 5-15 year old boys
in that the odds of suering from disturbed dreams decreases by a factor of 0.8(= e
0.229
)
per year. Results from tting the polytomous logistic regression model, which ignores the
ordinal feature of the response variable, lead to the similar conclusion qualitatively. We
note, however, that the standard error estimate of

(0.052) is, as expected, much smaller
than that of the three

j
s (0.078, 0.078 and 0.081, respectively). This example serves
to illustrate that proper modelling through acknowledging the ordinal feature of the
response variable could lead to a substantial gain in eciency for regression coecients
of interest.
Table 4.1: Regression estimates ( s.e.) based on both proportional odds
models and polytomous logistic regression models for the disturbed dream
example (Maxwell, 1961).
Models
Variable Proportional odds Polytomous logistic
1
6.402 1.403 6.271 2.108
2
5.586 1.387 4.784 2.115
3
4.571 1.377 7.792 2.169
1
0.263 0.078
2
0.208 0.078
3
0.323 0.081
0.229 0.052
4.2.3 Continuation Ratio Models
There are situations where the response variable is ordered in a hierarchical (or nested)
fashion. Examples include
4.2. REGRESSION MODELS FOR POLYTOMOUS RESPONSES 47
Example 4.3 (Teratological experiment). Chen et al. (1991) present data from a ter-
atological experiment in which each litter was randomized into one of three levels of
exposure (low, medium and high) to hydroxyures. Here fetuses from a litter could be
dead or alive at birth, and if alive, they could be malformed or normal after a period of
time; see the diagram below. Here the relationship between the proportions of fetuses
that are dead and the dosage level allow investigators to examine the embryo-toxic eect
of the targeted substance; whereas the proportions of alive fetuses that are malformed
can be used to address the structural developmental eect of the substance. Here the
response variable has three categories with a natural order of death, alive but malformed
and alive and normal, a decreasing order of health severity. It is hierarchical (or nested)
in that the issue concerning the risk for being malformed is only meaningful to those
fetuses who are alive at birth.
Example 4.4 (Insemination experiment). McCullagh and Nelder (1989, p.62) describe
an experiment in which milch cows receive treatments to induce pregnancy in such a
way that each cow will continue the experiments until pregnant as shown below. Here
the response variable represents a series of 1/0 response indicating whether the cow is
pregnant or not. However, this response variable is hierarchical (or nested) in that the
j
th
response would be recalled only if the cow was not pregnant in the previous j 1
experiments.
Example 4.5 (Tonsil enlargement). Returning to the tonsil size example (Example
4.2), one might argue that the response variable can be expressed in a way that is
similar to that in Example 4.3. Here each subject may be rst classied as enlarged
or not regarding the tonsil size; and for those experiencing enlargement, they can be
furthered classied as either greatly enlarged or slightly enlarged as shown below. Here
the investigators are in the position to address two related but dierent questions, the rst
being whether carrying SP is associated with tonsil enlargement, and the second being
whether carrying SP is associated with serious enlargement given tonsil enlargement has
taken place.
Diagram 1 (Teratological experiment)
fetus
@
@
@
@
@
dead alive
@
@
@
@
@
malformed normal
Diagram 2 (Insemination experiment)
Experiment Milch cow
@
@
@
@
@
1 pregnant not
@
@
@
@
@
2 pregnant not
@
@
@
@
@
3 pregnant not
Diagram 3 (Tonsil enlargement)
tonsil
@
@
@
@
@
not enlarged enlarged
@
@
@
@
@
slightly enlarged greatly enlarged
For ordinal response of this kind, it is clear that proportional odds models may not be
appropriate. Instead we consider the continuation ratio model by Mason and Fienberg
(1985)
logit (
j
(x)/
j
(x)) =
j
+
t
j
x, j = 0, 1, . . . , C 1, (4.4)
where
j
(x) = Pr(Y = j|x) and
j
(x) = Pr(Y j|x). Like the other two models,
this model is represented by C logistic regression models. Here
j
(x)/
j
(x) can be
re-expressed as Pr(Y = j|Y j, x), the conditional probability for Y to be j given
that Y is either in the j
th
category or higher. It is rather clear that this conditional
probability is more meaningful than
j
(x) itself in reecting the scientic question of
interest. Note that in (4.4) we have allowed the eect of x on
j
(x)/
j
(x) to be dierent
for dierent j. This would be a sensible approach for Examples 4.3 and 4.5 as, for
instance, there is no scientic reason to expect that the embryo-toxic eect to be the same
as the structural developmental eect. On the other hand, it may be equally sensible
4.3. LOG-LINEAR MODELS FOR CONTINGENCY TABLES 49
to allow the
j
s to be the same in Example 4.4, which characterizes the eect of x on
the probability for being pregnant allowing the probability itself (i.e. the
j
s) to vary
across experiments. Nevertheless, this assumption of common can always be examined
statistically. Returning to the tonsil size example, Table 4.2 gives the MLEs and their
s.e. when both the proportional odds model and the continuation ratio model were tted
to the data shown earlier. Here the sole covariate is the SP carrier status (x = 1 for being
carrier and 0 if not). Both models lead to conclusions of strong evidence of association
between SP carrier and tonsil enlargement, even though the regression coecients from
the two models have dierent interpretations. Results from the proportional odds model
suggest that the odds for having more serious tonsil size enlargement is 83% (= e
0.603
1) higher for the carrier than the non-carriers. On the other hand, results from the
continuation ratio model indicate that the odds for having tonsil enlargement is 67%
(= e
0.514
1) higher for the carriers than the non-carriers; whereas the odds for having
tonsil size greatly enlarged among those with tonsil enlargement is 72% (= e
0.543
1)
higher for the carriers than the non-carriers.
Table 4.2: Regression estimates ( s.e.) based on proportional odds mod-
els and continuation ratio models for the tonsil size example (Holmes and
Williams, 1954).
Model
Variable Proportional odds Continuation ratio
1
0.509 0.056 0.512 0.051
2
1.363 0.067 1.245 0.093
1
0.514 0.273
2
0.543 0.286
0.603 0.227
4.3 Log-Linear Models for Contingency Tables
Log-linear models represent a popular approach to analyze data that can be summarized
in contingency tables. One important feature of this approach is that there is no need
to specify response variables. This is especially useful in social science research with
the aim to examine the relationship (or the association) among variables considered.
More recently, this log-linear model approach has been advocated by some researchers to
deal with correlation data. Examples include cross-over trials where patients receive, by
design, dierent treatments over several periods (Jones and Kenward, 1987) and genetic
studies on detection of maternal eects of disease genes (Weinberg, Wilcox and Lie,
1998). In this section, we briey review the work on log-linear models with the emphasis
on interpretations of specied parameters. It serves to point out the potential for drawing
incorrect conclusions when using log-linear models to analyze contingency tables data or
discrete correlated data.
4.3.1 Interpretations of Log-Linear Model Parameters
We start with a simple I J contingency table in which Y
ij
represents the number of
individuals who are in the ith(j
th
) category of variable 1 (variable 2), i = 1, . . . , I, j =
1, . . . , J. Let P
ij
be the corresponding probability for an individual to be in the (i, j) cell
of the I J table. With m being the sample size, one has immediately
ij
= E(Y
ij
) =
mP
ij
. It is intuitive that two categorical variables of I and J categories, respectively, are
not associated with each other (or statistically independent) if
P
ij
= P
i+
P
+j
, (4.5)
where P
i+
and P
+j
are the marginal probabilities, e.g. P
i+
is the probability that an
individual falls into the i
th
category of the rst variable. The expression in (4.5) leads to
log
ij
= log n + log P
i
+ log P
+j
(4.6)
= +
i
+
j
with
1
= 0 =
1
to avoid over-parameterizations. Thus, one can test the assumption
of statistical independence between the two variables by computing the deviance which
contrasts this no association (or null) model with the saturated model where all the
I J P
ij
s are unspecied. When this null model is rejected, a natural question, from
a modelling viewpoint, is how one may modify (4.6) to acknowledge (or to characterize)
the dependence (or association) between two variables. One obvious approach is to add
to (4.6) additional parameters ()
ij
, i.e.
log
ij
= +
i
+
j
+ ()
ij
(4.7)
with
1
=
1
= 0 = ()
1j
= ()
i1
, for all i = 1, . . . , I, j = 1, . . . , J. The following
representation clearly qualies the (I 1) (J 1)()
ij
s as association parameters,
namely, for i = 1, . . . , I and j = 2, . . . , J,
()
ij
= log
P
ij
P
i1
P
i1
P
1j
,
is the log odds ratio contrasting the (i, j) cell versus the (1, 1) cell; see the 2 2 table
below
Variable 2
1 j
Variable 1 1 P
11
P
1j
i P
i1
P
ij
An important question is how to incorporate, under the framework of log-linear models,
the additional information that (some of) the variables are ordinal? As an illustration,
the table below gives the 12 empirical log odds ratios from the disturbed dream example.
It shows the pattern that the estimated log odds ratios decrease in age for each level of
suering and decrease in degree of suering from disturbed dream for each age group.
This suggests that one approach is to assign scores, s
i
, i = 1, . . . , I, t
j
, j = 1, . . . , J to
both variables and model ()
ij
, the log odds ratios, accordingly, e.g.
()
ij
= s
i
t
j
. (4.8)
Degree of suering from disturbed dream
Not severe Very severe
5-7 0 0 0 0
8-9 0 0.42 0.41 0.11
Age (years) 10-11 0 -0.16 0.05 -0.52
12-13 0 -0.25 0 -0.45
14-15 0 -0.56 -0.54 -1.03
Returning to the disturbed dream example, we have tted the above model i.e. (4.8),
to the 5 4 contingency table with s
i
= mid-age of the i
th
age group, i = 1, . . . , 5 and
t
j
= j, j = 1, . . . , 4. The MLE for
gives 0.205 with s.e. 0.050 which suggests that

the odds for suering from disturbed dream with higher unit decreases by a factor of
0.81(= e
0.205
) per year for boys between 5 and 15 years of age. It is interesting to
contrast this approach with the proportional odds model approach in 4.2.2; see Table
4.1. Both approaches lead to the same conclusion that there is a convincing negative
association between age and degree of suering from disturbed dreams. One important
distinction between these two modelling approaches is on the assumptions being made
for the variables and the interpretability of the primary parameter of interest, in (4.3)
and
in (4.8). While both assigning scores to the age variable, which is the interval
scale, the log-linear model approach makes the further assumption by assigning scores to
the other variable, which is considered as a response variable for the proportional odds
model approach. One argument against this additional assumption is that it is rather
arbitrary. Specically, by assuming t
j
= j, j = 1, . . . , 4, one is making the assumption
that, for example, the dierence in degree suering from disturbed dreams between the
third and rst categories is twice the dierence between the second and rst categories.
This arbitrariness makes the
coecient in (4.8) dicult to interpret. The coecient

induced by the proportional odds model, on the other hand, does not rely on the assigning
of scores for the response variable and is easy to interpret. We note that the deviances
for both models with 11 degrees of freedom, are very comparable to each other (12.42
for the proportional odds model and 14.08 for the log-linear model). This example also
serves to illustrate that goodness-of-t measures such as the deviance are secondary (or
less relevant) in choosing between models compared to the interpretability of primary
parameters induced by the models.
Moving on to more practical situations of I J K contingency tables formed
by three categorical variables of I, J and K levels, respectively, let Y
ijk
(
ijk
) be the
number (expected number) of individuals who are in the i
th
, j
th
and k
th
category of
variables 1, 2 and 3, respectively, i = 1, . . . , I, j = 1, . . . , J, and k = 1, . . . , K. Again,
the primary interest is on how these three variables are associated with each other. For
three-dimensional or higher-dimensional contingency tables, more log-linear models are
available in additional to the obvious null model
log
ijk
= +
i
+
j
+
k
,
1
=
1
=
1
= 0
and the saturated model
log
ijk
= +
i
+
j
+
k
+ ()
ij
+ ()
ik
+ ()
jk
+ ()
ijk
.
Some interesting intermediate models include
(A) log
ijk
= +
i
+
j
+
k
+ ()
ij
,
(B) log
ijk
= +
i
+
j
+
k
+ ()
ij
+ ()
ik
,
(C) log
ijk
= +
i
+
j
+
k
+ ()
ij
+ ()
ik
+ ()
jk
.
To better understand the interpretation of parameters, in particular ()
ij
, specied
above, we consider the following log odds ratio
log
P
ijk
P
11k
P
i1k
P
1jk
log OR
ij|k
,
which is the log odds ratio relating variables 1 and 2 among (conditional on) those
individuals who are in the k
th
category of variable 3. Depending on models chosen,
log OR
ij|k
has dierent expressions and hence dierent interpretations. For the null
model, which implies no association among three variables, log OR
ij|k
is equal to zero as
expected. For model (C), this log odds ratio reduces to ()
ij
, which is the constant log
odds ratio common to all K categories of variable 3. On the other hand, if the saturated
model is entertained, log OR
ij|k
amounts to log odds ratio only for those who are in the
k
th
category of variable 3.
In summary, log-linear models have the following two important features that are
worthy of noting. First, interpretations of lower order terms (e.g. ()
ij
) depend on the
presence or absence of higher order terms (e.g. ()
ijk
). More importantly, the two-way
association parameters (e.g. ()
ij
) have the conditional log odds ratio interpretations,
i.e. log OR relating two variables adjusting for other variables. One important implicaiton
of the second feature stated above is the following: when more than one variable is
considered as response variables, is it sensible to adjust for other response variables if
a primary question of interest is the association between a particular response and an
independent variable? We will return to this in the next chapter.
4.3.2 An Alternative Representation of Log-Linear Models
As mentioned at the onset of 4.3, there has been some attempts recently to adopt
log-linear models as a means to model correlated data. At rst glance, this may not
seem obvious as log-linear models are applied to aggregated data whereas the interest
for the latter is to model the joint distribution of correlated data from a cluster. The
main purpose of this subsection is to show briey how to bridge the gap betwen these
two seemingly unrelated tasks. The other purpose is through this new representation
to reinforce some potential concerns regarding interpretations of regression coecients
induced by log-linear models, which were mentioned in 4.3.1.
Recall that what one observes for each subject in the context of contingency tables
can be expressed as
Y = (Y
1
, . . . , Y
j
, . . . , Y
n
)
t
,
where n is the number of categorical variables under investigation and Y
j
is a categorical
variable of C
j
levels, j = 1, . . . , n. In the disturbed dream example, n = 2 and C
1
= 5
and C
2
= 4 if Y
1
and Y
2
represent age and degree suering from disturbed dreams. For
simplicity, we assume n = 3 and C
j
= 2 for all j = 1, 2, 3 so that Y
j
= 1 or 0 depending
on whether the subject falls in the second or the rst category, respectively. Using the
same notations as in 4.3.1, we now consider a joint distribution for Y = (Y
1
, Y
2
, Y
3
)
t
satisfying
Pr(Y
j
= y
j
, j = 1, 2, 3)
Pr(Y
j
= 0, j = 1, 2, 3)
= exp{
2
y
1
+
2
y
2
+
2
y
3
+ ()
22
y
1
y
2
+ ()
22
y
2
y
3
+()
22
y
2
y
3
+ ()
222
y
1
y
2
y
3
}. (4.9)
It can be easily shown (e.g. Liang and Zeger, 1989) that the model in (4.9) is equivalent
to the saturated log-linear model for a 2 2 2 contingency table where
Y
ijk
= # of subjects in which (Y
1
= i 1, Y
2
= j 1, Y
3
= k 1) i, j, k = 1, 2.
The representation in (4.9) provides an alternative way to interpret the parameters spec-
ied under log-linear models by considering for example
log
Pr(Y
1
= 1|y
2
, y
3
)
Pr(Y
1
= 0|y
2
, y
3
)
=
2
+ ()
22
y
2
+ ()
22
y
3
+ ()
222
y
2
y
3
, (4.10)
the log odds for the rst categorical variable, Y
1
, to be one as a function of Y
2
and Y
3
, the
other two variables. Thus, ()
22
may be interpreted as the eect, measured by log odds,
of Y
2
on Y
1
adjusting for Y
3
. In other words, ()
22
characterizes the association between
Y
1
and Y
2
conditional on the status of the third variable, Y
3
in this case. Similarly, one
can derive through (4.9)
log
Pr(Y
2
= 1|y
1
, y
3
)
Pr(Y
2
= 0|y
1
, y
3
)
=
2
+ ()
22
y
1
+ ()
22
y
3
+ ()
222
y
1
y
3
,
log
Pr(Y
3
= 1|y
1
, y
2
)
Pr(Y
3
= 0|y
1
, y
2
)
=
2
+ ()
22
y
1
+ ()
22
y
2
+ ()
222
y
1
y
2
.
Representations above serve to illuminate the concerns of using log-linear models to
model longitudinal data, an important special case of correlated data. In this setting, Y
j
represents for an individual the binary observation measured at the j
th
visit, j = 1, 2, 3.
An obvious question is whether it is sensible to model as shown in (4.10) the dependence
of the rst observation, Y
1
, on observations measured subsequently, Y
2
and Y
3
in this
case. Another drawback of this modelling is that it is not reproducible (e.g. Liang, Zeger
and Qaqish, 1992) in that the joint distribution of any subset of Y = (Y
1
, Y
2
, Y
3
) do not
have the same representations as (4.9) for Y itself. This is particularly troublesome for
correlated data of varying cluster sizes.
Chapter 5
Statistical Modellings for
Correlated Data
This chapter is devoted to discussions on statistical modellings for correlated data that
have been developed recently. Recall in Chapter 1 we pointed out through examples that
scientic objectives for correlated data can often be characterized statistically through
regression to (1) assess the relationship between response variables and explanatory vari-
ables and (2) examine and describe the degrees and patterns of within-cluster association.
We discuss in this chapter three dierent approaches: marginal models, random eects
models and observation driven models. These models are illustrated through two exam-
ples presented in Chapter 1: The Baltimore Eye Survey (Example 1.1) and the pre-post
trial for schizophrenics (Example 1.4). These three models are contrasted with the focus
on their distinct interpretations of regression coecients. To begin with, we recall that
correlated data can often be expressed as follows: Let Y
i
= (Y
i1
, Y
ij
, . . . , Y
ini
)
t
be the
n
i
observations from the i
th
cluster, i = 1, . . . , m, m being the total number of clusters.
For the j
th
component, j = 1, . . . , n
i
, we assume the availability of x
ij
, a p 1 vector
of covariates, that may be associated with Y
ij
. For the Baltimore Eye Survey Example,
m = 5, 199, n
i
= 2 for all subjects and Y
i1
(Y
i2
) represents the visual impairment sta-
tus (1: impaired; 0: normal) of the left (right) eye of the i
th
subject. In this example
x
i1
= x
i2
= x
i
, i.e. all the covariates are the same for both eyes including age, race and
possibly interactions between the two demographic variables. The primary objective of
this study is to identify demographic variables that are associated with the risk of visual
impairment. The secondary objective is to examine the degree of fellow eye association
and its possible dependence on demographic variables as well. As for the pre-post trial,
m = 170 and the n
i
s range from 2 to 6 as some of the patients dropped out during
the course of follow-up. The primary response variable is PANSS measuring the severity
of schizophrenia; whereas the main explanatory variables include the treatment status
(x
i
= 1 if receiving rispirodine of 6mg and 0 if receiving haloperidol of 20 mg), time (in
weeks) from the baseline, t
j
(0, 1, 2, 4, 6 and 8) and the interaction between x
i
and t
j
.
The main objective of this clinical trial is to assess the ecacy of the new treatment for
schizophrenia, rispirodine, as measured by the rate of decline in PANSS.
55
56 CHAPTER 5. STATISTICAL MODELLINGS FOR CORRELATED DATA
5.1 Marginal Models
This approach may be viewed as a multivariate analogue of the quasi-likelihood method
in 2.3:
(A) Systematic component h(
ij
) = x
t
ij
,
ij
= E(Y
ij
|x
ij
).
(B) Variance specication Var(Y
ij
) = V (
ij
; ), where is the over-dispersion param-
eters.
(C) Covariance specication Cov(Y
ij
, Y
ik
) = C(
ij
,
ik
; ), j < k = 1, . . . , n
i
, where C
is a known function indexed by association parameters .
For the Baltimore eye survey example, one may consider
(A)
logit Pr(Y
ij
= 1) =
0
+
1
Race
i
+
2
(Age
i
60) +
3
(Age
i
60)
2
+
4
Race
i
(Age
i
60) +
5
Race
i
(Age 60)
2
, (5.1)
(B) Var(Y
ij
) =
ij
(1
ij
),
(C) log OR(Y
i1
, Y
i2
) =
0
+
1
Race
i
+
2
Age.
Several remarks are worth noting. First, implicitly one assumes in (A) of (5.1) that the
eects of a age and race on the risk of visual impairment are the same for both eyes.
While one can test this assumption formally, expert opinion on this is that common
etiology for both eyes is warranted. An interpretation of e
1
, the coecient for the Race
variable, is that it represents the odds of visual impairment for any eye of blacks of 60
years old relative to the odds for whites of the same age. Finally, recall in 1.4 our
preference of using odds ratio as a measure of association for categorical variables. The
specication (C) along with (A) and (B) lead to a full specication of Cov(Y
i1
, Y
i2
). For
the pre-post trial example, one could, for example, consider
(A)
ij
=
0
+
1
t
j
+
2
x
i
+
3
x
i
t
j
,
(B) Var(Y
ij
) = ,
(C) Cov(Y
ij
, Y
ik
) = e
|tjt
k
|
.
Here the AR-1 correlated structure for the repeated observations from the same patient is
imposed as shown in (C). More importantly, the interaction term x
i
t
j
, in (A) postulates
that the dierence in averaged PANSS between the treated (risperidone) and control
(haloperidol) groups varies linearly in time as characterized by
3
. More specically,
3
represents the change of averaged dierence in PANSS between two groups per unit (in
weeks) change as time progresses.
Finally, some general remarks on marginal models (Liang and Zeger, 1993). First,
the coecients specied in (A) describe the eects of covariates on the marginal
expectation of the response variable. For the eye survey example, this marginal expec-
tation amounts to the averaged Y for a given eye irrespective of the outcome status
of the fellow eye. For the pre-post trial example, this marginal expectation means the
5.2. RANDOM EFFECTS MODEL 57
averaged Y at a given visit irrespective of the PANSS scores at other visits. Thus
has the so-called population averaged interpretation by contrasting the averaged Y s
from dierent subpopulations formed by dierent values in x (Zeger, Liang and Albert,
1988). This approach, from this viewpoint, is particularly relevant for studies of public
health signicance. Furthermore, the interpretation of is the same regardless of the
cluster size, n
i
, and of magnitude of within-cluster association. Finally, just as in quasi-
likelihood, the joint distribution of Y
i
= (Y
i1
, . . . , Y
ini
)
t
is not fully specied by (A), (B)
and (C) and this issue will be addressed in the next chapter.
5.2 Random Eects Model
Unlike the marginal model approach which addresses issues of the regression of Y on x
and within-cluster association through two separate sets of regression models, random
eects models and observation driven models intend to address these two issues simulta-
neously through a single set of regression models. The essence of random eects models,
however, is the introduction of cluster-specic regression coecients relating Y to x to
reect natural heterogeneity among clusters caused by unmeasured factors. This notion
of heterogeneity among clusters, called random eects translates into within cluster
likeness or dependence. Specically, given b
i
, random eects specic to the i
th
cluster,
i = 1, . . . , m,
(A
) Y
i1
, . . . , Y
ij
, . . . , Y
ini
are statistically independent of each other.
(B
) Each Y
ij
follows a GLM with
h(E(Y
ij
|b
i
)) = x
t
ij
+ z
t
ij
b
i
, (5.2)
where z
ij
, a q 1 vector, is generally a subset of x
ij
.
(C
) The b
i
s are assumed to be independent and generated from a common distribution,
F(; ), with mean zero and indexed by parameters, .
For the pre-post trial example, patients vary tremendously in baseline PANSS and in
change of PANSS over time. To acknowledge this variation statistically, one may consider
(B
) Y
ij
= (
0
+b
i0
)+(
1
+b
i1
)t
j
+
2
x
i
+
3
x
i
t
j
+
ij
, where given b
i
= (b
i0
, b
i1
)
t
,
ij
i.i.d
N(0, ), j = 1, . . . , n
i
,
(C
) b
i
follows a bivariate normal distribution with mean (0, 0)
t
and covariance matrix
_

00

01
01

11
_
.
In this model,
0
and
0
+
2
represent the averaged baseline PANSS for patients receiving
haloperidol and risperidone respectively, whereas
1
and
1
+
3
represent averaged rates
of change in PANSS for the two respective groups. Thus negative
1
and
3
would suggest
that receiving risperidone would lead to a greater decline in PANSS over time than
haloperidol does. Furthermore, b
i0
and b
i1
, i = 1, . . . , m, are introduced to account for
variation across patients discussed earlier. The larger the
00
(
11
), the greater variations
across subjects in PANSS scores (in rates of change in PANSS). Finally, Var(Y
ij
) =
+
00
+ 2
01
t
j
+
11
t
2
j
and
Cov(Y
ij
, Y
ik
) =
00
+
01
(t
j
+ t
k
) +
11
t
j
t
k
, j < k,
a result of the introduction of random eects, the b
i
s. For the eye survey example, one
may consider
(B
) Y
ij
|b
i
Binomial(1,
ij
), where
ij
= E(Y
ij
|b
i
) and
logit
ij
= (
0
+ b
i
) +
1
Race
i
+
2
(Age
i
60) +
3
(Age
i
60)
2
+
4
Race
i
(Age
i
60) +
5
Rac
i
(Age
i
60)
2
,
(C
) b
i
i.i.d
N(0, ).
This logistic regression model in (B
) is very similar to that under marginal models

with the exception of the additional individual-specic parameter, b
i
, which is included
to account for variation among subjects in risks for visual impairment that are not
explained by age, race variables. Indeed, with random intercepts, Zeger, Liang and
Albert (1988) have shown that there is a simple mathematical relationship between
and in marginal models, namely,

logit E(Y
ij
) = logit E(E(Y
ij
|b
i
))
.
= x
t
ij
/(c
2
2
00
+ 1)
1/2
,
where 16
3/(15). Thus, approximately
.
= (c
2
00
+ 1)
1
2
. (5.3)
We will return to the comparison between these two models in 5.4. Some general
comments on random eects models. This approach for modelling correlated data is
particularly useful to characterize the change in Y within clusters. For example,
+b
i
describes the eect of x on Y that is specic to the i
th
cluster (or subject). Furthermore,
the empirical Bayes method (Robbins, 1956) and subsequent developments (e.g. Laird
and Ware, 1982; Zeger and Karim, 1991; Breslow and Clayton, 1993) allow investigators
to estimate not only
but also b
i
as well. Thus this modelling is particularly relevant
from a clinical viewpoint. For example, longitudinal studies would allow clinicians to
communicate with concerned parents the drop in risk of their own children for suering
from infectious disease if they change their smoking behavior (e.g. from smoking to non-
smoking). Finally, given (A
), (B
) and (C
) specied above, the probability distribution

function for Y
i
as indexed by
, and is fully specied.

5.3 Observation Driven Models
As in random eects models, this approach addresses simultaneously both issues of re-
gression of Y on x and within-cluster association through a set of regression models.
One important feature of the observation driven models that distinguish itself from the
random eects model is that it operates on the premise that correlation within a cluster
5.3. OBSERVATION DRIVEN MODELS 59
arises because one response is explicitly caused by others. A typical example of this kind
is the spread of the disease that is contagious within a family or a community. Dene
Y
i,j
= (Y
i1
, . . . , Y
i,j1
, Y
i,j+1
, . . . , Y
ini
)
t
a vector of (n
i
1) 1 Y
ij
s excluding Y
ij
. Observation driven models may be charac-
terized as follows
(A
) For H
ij
= H(Y
i,j
), a function of Y
i,j
, Y
ij
given H
ij
follows a GLM with
h(E(Y
ij
|H
ij
)) = x
t
ij
+
q
=1
f
(H
ij
; ), (5.4)
where f
, = 1, . . . , q, are known, up to , functions of H

ij
.
Choice of H
ij
depends on situations that are dealt with. For longitudinal studies, it is
natural to consider H
ij
= (Y
i1
, . . . , Y
i,j1
), observations prior to the j
th
visit. On the
other hand, for family studies, in particular, of siblings, it would be more sensible to
consider H
ij
= Y
i,j1
, the rest of the family members. If adopting this approach, one
may consider for the eye survey example
(A
) Y
ij
|Y
i,3j
Binomial(1,
ij
), j = 1, 2 with
logit Pr(Y
ij
= 1|Y
i,3j
, x
ij
)
=
0
+
1
Race
i
+
2
(Age
i
60) +
3
(Age
i
60)
2
+
4
Race
i
(Age
i
60) +
5
Race
i
(Age
i
60)
2
.
Note that with n
i
2, this model is the same as those proposed by Rosner (1984) and
Connolly and Liang (1988) discussed in 3.3. It is interesting to note that e
1
here has
the interpretation as the relative odds of visual impairment for one eye among blacks of
60 years of age versus that of whites of the same age, adjusting for the visual impairment
status of the fellow eye. This is to be contrasted with e
1
under marginal models in
which no adjustment for the fellow eye status is made. If the objective of the study
is to identify risk factors, in particular demographic ones, for prevention purposes, one
may question the relevance of this observation driven approach as the eects of race,
age, etc. on risk of visual impairment for one eye may be diminished due to their direct
competition against the visual status of the fellow eye in explaining Y
ij
; see (5.4). For
the pre-post trial example, one may consider
(A
) Y
ij
|Y
i1
, Y
i,j1
is specied as
Y
ij
=
0
+
1
t
j
+
2
x
i
+
3
x
i
t
j
+ (Y
i,j1
1
t
j1
2
x
i
3
x
i
t
j1
) +
ij
,
where
ij
i.i.d
N(0, ).
This is known as a AR-1 process in which Y
ij
depends on the past history in Y
i
only
through Y
i,j1
, the immediate past observation. We will discuss the relationship among
,
and
for this example in 5.4.

Generally speaking, this approach represents a particular way to model the within-
cluster dependence mechanism, e.g., past responses cause the present one in longitudinal
studies. This direct modelling is particularly useful for prediction purposes. For example,
one may be interested in predicting PANSS two months after the termination of the study
if the patient receiving risperidone had a score of 140, which is very high, at that time.
On the other hand, if the main purpose of the study is on the regression of Y on x
one should be cautious about the interpretation of
which may be non-trivial and

potentially misleading especially when the cluster sizes vary. This is especially so for
discrete response variables; see discussion in 3.2 and 4.3.
5.4 Contrasts of Three Modelling Approaches
It is clear that the three approaches presented in Sections 5.1-3 are motivated distinctly
considering how within-cluster association may be incorporated in conjunction with the
conventional regression of Y on x. With the exception of identity link function, which
is commonly assumed for continuous responses, the three regression coecients,
1
,
and
, are likely to be dierent numerically and interpretation-wise. In addition, the

within-cluster association pattern as implied by the models are in general dierent for
these approaches as well.
We now rst examine more closely the special case of continuous responses with
identity link. It can be shown mathematically that in this case
=
.
To see this, note that random eects models specify the eects of x
i
on the conditional
expectations of Y
i
given b
i
, the random eects. However, noting that E(b
i
) = 0 as
specied in (C
) of 5.2, we have immediately that

E(Y
ij
) = E(E(Y
ij
|b
i
)) = E(x
t
ij
+ z
t
ij
b
i
) = x
t
ij
,
which when comparing with (A) in 5.1, leads to =
. Similarly, if one adopts the

observation driven model, one has
E(Y
ij
) = E(E(Y
ij
|H(Y
i,j1
)))
= E(x
t
ij
+
q
=1
(y
i,j
x
t
i,j
))
= x
t
ij
.
Thus, for continuous responses of identity link, the three regression coecients relating
Y to x are identical numerically. The only major dierence is on how the within-cluster
association may be modeled. For example, a random eects model with intercepts as the
only random eects leads to Cov(Y
ij
, Y
ik
) =
00
, a constant covariance for each pair of
observations from the same cluster. On the other hand, if an AR-1 process is assumed
under observation driven models, one has Cov(Y
ij
, Y
ik
) =
|tjt
k
|
which decreases in
|t
j
t
k
|, the distance (in time) between two observations from the same cluster.
The equality in ,
and
shown above is indeed an exception rather than the rule.

To elaborate on this, we assume, for simplicity, one is dealing with longitudinal studies of
5.4. CONTRASTS OF THREE MODELLING APPROACHES 61
binary responses and logit link. In this situation, e
for a covariate (who marginal models

compares prevalences in subgroups formed by observations with dierent x values (and
hence measured at dierent times). For random eects models of the same covariate x,
e
describes changes in personal risk due to change in x over time. On the other hand,
e
in observation driven models characterizes changes in incidence for individuals with

the same history in Y but dier in x measured at the same time. It has been shown in
this situation (Zeger, Liang and Albert, 1988) that is smaller than
in magnitude.
However, one should in general reach similar conclusions for testing the hypothesis of no
association between Y and x (i.e. = 0 or
= 0) since
/s.e.(
/s.e.(
).
On the other hand, by examining (A) in 5.1 and (A
) in 5.3 closely, we would speculate

that
would be smaller than in magnitude due to the fact that (1) both x
ij
and
H(Y
i,j
) appear in the same regression model in explaining Y
ij
and (2) the Y
ij
s are in
general highly correlated with each other.
Chapter 6
Statistical Inference for
Correlated Data
In Chapter 5 we introduced and contrasted three statistical modellings for correlated
data. In this chapter, we review approaches to make inference about parameters speci-
ed by the models. These approaches include full likelihood methods, conditional like-
lihood methods and estimating function methods. Pros and cons of each approach are
highlighted.
6.1 Marginal Models
Recall that for marginal models only the rst two joint moments of the Y
i
s are required
to be specied, i.e.
(A) Systematic component h(
ij
) = x
t
i
, for all j = 1, . . . , n
i
,
(B) Covariance/variance specication Cov(Y
ij
, Y
ik
) = C(
ij
,
ik
; , ), j k = 1, . . . , n
i
.
The question is that given this specication in (A) and (B) how to make inference on
and based on the data at hand. Just as in the univariate response case, we consider
two dierent approaches.
6.1.1 Quasi-Likelihood Approach
If the regression of Y on x is the primary interest which usually is the case in, for example,
longitudinal studies, one may estimate by solving
S
1
(, ) =
m
i=1
_
_
t
Cov
1
(Y
i
; , )(y
i
i
()) = 0. (6.1)
The above equation is know as the generalized estimating equation (GEE) (Liang and
Zeger, 1986) which reduces to the quasi-score equation dened in 2.3 when n
i
= 1 for all
i = 1, . . . , m. Since S
1
in (6.1) depends on as well, one may iterate until convergence
between solving S
1
(, ()) = 0 for and updating () with the latest

solved above
63
64 CHAPTER 6. STATISTICAL INFERENCE FOR CORRELATED DATA
(Liang and Zeger, 1986). This approach, which we term GEE1, is robust in the sense
that, just as in quasi-likelihood, (1) no distributional assumption on Y
i
= (Y
i1
, . . . , Y
ini
)
t
is needed and (2) consistent estimation of and correct precision assessments on

are
available even if the covariance structure for Y
i
, C(
ij
,
ik
; , ), is incorrectly specied.
For the latter part, see some key references including White (1982), Gourieoux et al.
(1984), Royall (1986) and Liang and Zeger (1986). Furthermore, high eciency relative
to the MLE of is usually maintained even if Cov(Y
i
) is incorrectly specied (e.g. Liang
and Zeger, 1986; Zhao, Prentice and Self, 1992). As expected, the degree of eciency loss
depends on the magnitude of within-cluster correlation (the higher the correlation, the
greater degree of eciency loss) and on how well the specied working covariance for
Y
i
in (6.1) approximates the true covariance structure. It is important to point out that
incorrectly assuming no association within-cluster may yield high degree of eciency loss
especially for covariates whose values vary within clusters; see 2.4 and Zhao, Prentice
and Self (1992). More importantly, such an exercise may lead to severe biased estimation
in if the data are missing in a non-trivial manner. This point will be further illuminated
through the pre-post trial data analysis to be presented later.
When the within-cluster association is the primary interest, which is typically the
case for family studies, one may argue that this GEE1 approach may be inecient as
far as inference for is concerned. This is because in GEE1, estimation of and is
treated as if they were independent (or orthogonal) to each other. As an alternative, one
may solve for = (, ) simultaneously through
S
2
(, ) =
m
i=1
_
_
t
Cov
1
(Z
i
; , )(z
i
i
()) = 0, (6.2)
where
Z
i
= (Y
t
i
, W
t
i
) = (Y
t
i
, Y
2
i1
, . . . , Y
2
ini
, Y
i1
Y
i2
, . . . , Y
i,ni1
Y
ini
)
t
,
i
= E
t
(Z
i
; ) = (E
t
(Y
i
); ), E
t
(W
i
; , ))
t
,
and are parameters characterizing the third and fourth moments of the Y
i
s as required
by Cov(Z
i
) (Liang, Zeger and Qaqish, 1992). Several remarks on this GEE2 approach.
First, with a proper choice of (), the GEE1 method may be viewed as a special case
of GEE2 if Cov(Y
i
, W
i
) in the working covariance matrix for Z
i
, Cov(Z
i
), is set to be
zero. It can be seen in this situation that the rst p components of S
2
(, ) reduces to
S
1
(, ). Second, in some instances considered, Liang, Zeger and Qaqish (1992) found
that the GEE2 method is more eciency for estimating and than the GEE1 method,
although the degree of eciency gain is much less substantial for than . We will
return to this point through an illustration using the eye survey data as an example.
This new approach, however, creates a complication that is not associated with the use
of GEE1. One complication is the need to estimate as seen in (6.2). Given that
Cov(Z
i
) is serving as a working covariance matrix and its correct specication is not
needed, one may compute Cov(Z
i
) as if Y
i
follows a multivariate normal distribution
(Crowder, 1987; Prentice and Zhao, 1991). This resolves the issue of dependence of S
2
on since multivariate normal distributions are fully specied by the rst two moments.
Another approach is to replace by a guess-mates,
, a xed value. Liang, Zeger

and Qaqish (1992) found that for cases considered, this approach sacrices very little
power loss. However, no general performance of these two approaches for dealing with
6.1. MARGINAL MODELS 65
has been established and this issue is likely to be examined on the case by case basis.
Another complication is that the robust property for enjoyed by GEE1 is no longer
maintained. To see this, except when Cov(Y
i
, Z
i
) is set to be zero, it can be shown that
the rst p components of S
2
(, ), S
2,
(, ) say, is a linear combination of y
i

i
()
and W
i
E(W
i
; , ). Thus, S
2,
(, ) is an unbiased estimating function for only if
E(W
i
; , ) computed is correct, i.e. if Cov(Y
i
) is correctly specied. In other words,
the consistency of

, the solution of S
2
(, ) = 0, depends on the correct specication of
Cov(Y
i
); see Liang and Zeger (1995) for a detailed examination of this issue for a special
case.
6.1.2 Likelihood Approach
Recall that specication (A) and (B) do not fully specify the joint distribution of the Y
i
s
as only the rst two moments of the Y
i
s are involved in (A) and (B). A natural question
is: does there exist a probability function for Y
i
that is compatible with (A) and (B)?
The work by Gourieroux et al. (1984) and Zhao et al. (1992) suggests that the answer is
yes. Dropping the index i temporarily, consider the following partly exponential family
f(Y = y; , ) = exp{
t
y + c(y; ) b(, )}, (6.3)
where , called canonical parameters, is determined by
= E(Y ; ) =
_
y exp{
t
y + c(y; ) b(, )}dy
through the one-to-one correspondence between
_

_
()
_
.
Here c(y; ) is pre-specied and is called shape parameters. Some choices of c(y; )
include
c(y; ) = y
t
1
()y,
which, in conjunction with (6.3), leads to the multivariate normal distribution and for
binary responses
c(y; ) =
j<k
jk
y
j
y
k
,
known as the quadratic exponential family (e.g. Zhao and Prentice, 1990; Fitzmaurice
and Laird, 1993). With this partly exponential family, it has been shown (Zhao et al.,
1992; Fitzmaurice and Laird, 1993) that the score function for = (, ) has the form
S(, )
=
m
j=1
_
i()
0
0 I
qq
_
_
Cov(Y
i
; ) 0
Cov(Y
i
; W
i
; ) I
qq
_
1
_
y
i
i
()
w
i
E(W
i
; )
_
,
where I
qq
is the q q identity matrix and w
i
= c(y
i
; )/. Consequently, the rst p
components of S(, ) is simply S
1
(, ) in (6.1). Thus S
1
(, ) from GEE1 is the score
function for if Y
i
is from a partly exponential family as long as the Cov(Y
i
) specied
in S
1
() is the true covariance matrix from Y
i
. Furthermore, it can be shown (e.g., Zhao
et al., 1992) that and are orthogonal to each other, i.e. the covariance between the
score function for and is zero. Consequently, the consistency of

, the MLE of , is
maintained even if c(y; ) is misspecied, a property enjoyed by GEE1.
On the other hand, this likelihood approach based on the partly exponential family
may be used with caution. First, choice of c(y; ), which along with (), would lead
to the complete specication of f(y; , ) requires more knowledge about the random
mechanism than it does for (A) and (B) only. In the absence of sucient knowledge,
either substantive or empirical, one should proceed with caution. Second, this family is
in general not reproducible, i.e. the distribution function of any subset of Y
i
may not have
the same form as f(Y
i
) in (6.3). This lack of reproducibility is especially troublesome
if cluster sizes vary which is typically the case for family and longitudinal studies. We
Table 6.1: Regression estimates and standard errors (in parenthesis) for the
visual impairment response from the Baltimore Eye Survey Study.
Left eye Ordinary GEE1 GEE2 Variance ratio
Variable only (I) logistic (II) (III) (IV) (I)/(IV)
Intercept -2.87 -2.82 -2.82 -2.82 1.66
(0.098) (0.067) (0.076) (0.076)
Race 0.356 0.332 0.334 0.334 1.58
(0.132) (0.089) (0.105) (0.105)
Age-60 0.050 0.049 0.049 0.049 1.65
(0.009) (0.006) (0.007) (0.007)
(Age-60)
2
0.0007 0.0018 0.0018 0.0018 1.00
(0.0004) (0.0003) (0.0003) (0.0003)
Race -0.003 0.001 0.0007 0.0007 1.60
(Age-60) (0.011) (0.0008) (0.0005) (0.0005)
Race -0.0009 -0.001 -0.001 -0.001 1.43
(Age-60)
2
(0.0006) (0.0004) (0.0005) (0.0005)
log OR
Intercept 2.30 2.30
(0.264) (0.177)
Race 0.54 0.56
(0.415) 0.255)
now return to the Baltimore eye survey example to illustrate some of the points made
earlier. Table 6.1 presents results from four approaches: Approach I: logistic regression
using the data from the left eyes only; Approach II: logistic regression using data from
both eyes treating them as if they were statistically independent of each other; Approach
III (IV): GEE1 (GEE2) in which both approaches model log OR(Y
L
, Y
R
) as a function
of race. First, point estimates of logistic regression coecients from four approaches are
virtually identical suggesting that the etiology of visual impairment is similar for both
eyes.
Furthermore, the association between age and risk of visual impairment is strong
and positive and varies between blacks and whites; see Figure 6.1. Second, the ordinary
logistic regression approach (Approach II), which ignores within-cluster association, has
6.2. RANDOM EFFECTS MODELS 67
Figure 6.1: Predicted risk of VI by race and age among those with 9 years
of education in Baltimore Eye Survey Study (Tielsch et al., 1990): solid
line, whites; dashed line, blacks.
the tendency, as expected, to underestimate the s.e. of

compared to that of approaches
III and IV, which are known to be correct, at least asymptotically. Third, it is obvious
and intuitive that it is less ecient to use data from one eye only; see the last column
of Table 6.1. However, the ratios of the variance of

between approaches I and IV, the
latter using twice the sample size, are approximately equal to 1.6 instead of 2. This is
known as design eect in complex survey sampling and is due to the fact that the data
from fellow eyes are highly correlated with each other. Indeed, results from Table 6.1
indicate that the estimated odds ratio relating two fellow eyes in terms of risk for visual
impairment is approximately 9.97(= e
2.30
) for whites and 17.1(= e
2.30+0.54
) for blacks.
The signicant dierence in odds ratios between blacks and whites is consistent with the
long-standing clinical nding in the United States that blacks have higher incidence of
glaucoma and of diabetic mellitis, both of which tend to be bilateral diseases. Finally,
there is a substantial eciency gain in using GEE2 for inference on the pattern of within-
cluster association as the ratio of Var( ) for GEE1 versus GEE2 are 2.22(= (2.64/1.77)
2
)
for the intercept and 2.65(= (45/.255)
2
) for the race variable.
6.2 Random Eects Models
As noted in 5.2, given
(A
) Conditional on b
i
, the random eects, Y
i1
, . . . , Y
ini
are statistically independent
and for each j, Y
ij
follows a GLM with h(E(Y
ij
|b
i
)) = x
t
ij
+ z
t
ij
b
i
,
(B) b
i
, a q 1 vector,
i.i.d.
F(; D) with mean zero,
the likelihood function for
, and D is completely specied, namely,

L(
, , D)
m
i=1
_ ni
j=1
f(y
ij
|b
i
;
, )f(b
i
; D)db
i
. (6.4)
As an example, for binary responses of logit link and for a sole random intercept, i.e.
z
ij
1, following N(0, D), the likelihood function has the form
L(
, D)
m
i=1
_
exp
_
_
_
ni
j=1
y
ij
x
t
ij
+
ni
j=1
y
ij
b
i
ni
j=1
log
_
1 + e
x
t
ij
+bi
_
_
_
_
2D
e
b
2
i
/(2D)
db
i
. (6.5)
One can see from (6.5) that the likelihood function involves q-dimensional integrals, one
from each cluster and in general, numerical approximation is needed in order to computer
(6.4). We now consider two dierent approaches whose pros and cons will be contrasted
throughout.
6.2.1 Conditional Likelihood Approach
This approach only works when the link function is a canonical link in which the rst
part of the integral in (6.4) has the simple form, namely,
ni
j=1
f(y
ij
|b
i
;
) = exp
_
_
_
ni
j=1
y
ij
x
t
ij
+
ni
j=1
y
ij
z
t
ij
b
i
ni
j=1
c(y
ij
)
_
_
_
, (6.6)
which includes (6.5) as a special case. Some important examples of canonical links
include, as mentioned in 2.2, the identity link for normal distributions, log link for
Poisson distribution and logit link for binomial distribution. One important feature of
(6.6) is that given b
i
,
ni
j=1
y
ij
x
ij
and
ni
j=1
y
ij
z
ij
are jointly minimally sucient for
and
b
i
, the canonical parameters. Consequently, one can eliminate b
i
in (6.6) by conditioning
on
ni
j=1
y
ij
z
ij
, the minimal sucient statistics for b
i
for xed
, i.e.
L
c
(
w
)
m
i=1
f
_
_
y
i1
, . . . , y
ini
ni
j=1
y
ij
z
ij
;
w
_
_
, (6.7)
where
= (
w
,
z
) and x
ij
= (w
ij
, z
ij
) as z
ij
is assumed to be a subset of x
ij
. Thus,
through conditioning (see 2.2), it gives rise to a conditional likelihood which is indepen-
dent of b
i
and
z
and depends only on
w
. In other words, only a subset of
is estimable.
Thus, the usefulness of this conditional likelihood approach depends on whether covari-
ates of interest have been declared as random eects or not. For example, if z
ij
1, i.e.
random intercepts is the only random eect declared, then only those
from covariates
w
ij
s which vary within-cluster are estimable since otherwise
ni
j=1
y
ij
w
ij
=
ni
j=1
y
ij
w
i

ni
j=1
y
ij
.
Thus it should be clear that this conditional likelihood approach would not work for the
eye survey example since all the covariates are cluster-specic rather than eye-specic.
6.2. RANDOM EFFECTS MODELS 69
As another example, consider longitudinal studies in which primary objectives are of-
ten centered on the individual change of the response variables over time. If the time
variable, i.e. the times for which the response variables are measured, is considered as
a random eect, known as random slopes, then one cannot estimate the averaged rate
of change coecient should the conditional likelihood method be employed. This along
with the fact that canonical links are required limit the extent to which this method may
be applied. One desirable property of this method, i.e. using L
c
(
w
) as the basis for
inference, is that no distributional assumption on the b
i
s is needed, which is comforting
as the validity of such an assumption is dicult to verify in general.
An obvious alternative is to use the full likelihood, L(
, , D), in (6.4) as the basis for

inference on both xed eects,
, and the variance components, D and . For continuous

responses with identity links and normal distribution on both Y |b and b itself, i.e.
(A
) Y
ij
|b
i
N(x
t
ij
+ z
t
ij
b
i
, ),
(B
) b
i
i.i.d.
MVN(0, D),
Y
i
follows a multivariate normal distribution with mean x
t
i
and covariance matrix

z
t
i
Dz
i
+ I
nini
, where
x
i
=
_
_
_
x
t
i1
.
.
.
x
t
ini
_
_
_
t
, z
i
=
_
_
_
z
t
i1
.
.
.
z
t
ini
_
_
_
t
.
In this case, the restricted maximum likelihood (REML) method (e.g. Patterson and
Thompson, 1971; Laird and Ware, 1982) for , and D can be used. Furthermore, the
empirical Bayes method (Robbins, 1955) can be utilized to help provide more desirable
estimates of the b
i
s. This aspect of inference is useful especially for the purpose of
identifying clusters that are distinct from others with regard to the behavior of the
random eects. A prime example is in growth studies in which investigators are interested
in identifying children whose growth rates as captured by the b
i
s are lower than the rest
of the sample. We note that this REML method along with empirical Bayes procedure
for estimating the b
i
s have been implemented in SAS as a part of PROC MIXED.
In general, the likelihood function involves complicated q-dimensional integrations
which are demanding numerically especially when q is modestly large. Two methods
have been developed recently to address this issue. One is to use the Gibbs sampler
method in which the posterior distribution of
, , and D given the data is simulated

through, for example, the Monte Carlo Markov Chain (MCMC) algorithm (e.g. Zeger
and Karim, 1991). The other approach, known as the penalized likelihood method,
approximates f(b
i
|Y
i
) by a normal density function leading to an approximate likelihood
function which is less computationally intensive (e.g., Stiratteli et al. 1984; Breslow and
Clayton, 1993). This approach has been shown through simulations to perform well in
general except for binary responses with small cluster sizes (Breslow and Clayton, 1993).
This method has been implemented in SAS as well (PROC GLIMMIX) and has been
rened more recently to correct the bias occurring in the situation described above (e.g.
Breslow and Lin, 1995; Lin and Breslow, 1996).
6.3 Observation Driven Models
Recall that observation driven models tackle the issue of within-cluster association by
directly modelling the dependence of Y
ij
on some functions of
Y
i,j
= {Y
i1
, . . . , Y
i,j1
, Y
i,j+1
, . . . , Y
ni
},
namely, H
ij
= H
ij
(Y
i,j
) through
(A
) For j = 1, . . . , n
i
, Y
ij
|H
ij
follows a GLM with
h(E(Y
ij
|H
ij
)) = x
t
ij
+
q
=1
f
(H
ij
; ). (6.8)
Some common choices of H
ij
were discussed in 5.3.
Just as in 6.1.2, a natural question is does there exist a probability function for Y
i
that
is compatible with specication (A
)? While the answer no doubt depends on choices of

H and f
, it is comforting that the Hammerley-Cliord Theorem in Besag (1974) could

serve as a guidance for the question posed above. As an example, for binary responses
with q = 1, a choice of f
(H
ij
; ) =
=j
y
i
has led to the multiplicative model
developed by Connolly and Liang (1988).
6.3.2 Quasi-Likelihood Approach
As an alternative, one can estimate = (
, ) by solving
m
i=1
ni
j=1
_
E(Y
ij
|H
ij
; )
_
t
Var
1
(Y
ij
|H
ij
) (y
ij
E(Y
ij
|H
ij
; )) = 0,
a strategy similar to GEE for marginal models. An advantage of this approach is that
the statistical package developed for GEE of marginal models such as SAS PROC GEN-
MOD can be utilized. On the theoretical end, this approach shares the same robustness
property enjoyed by GEE for marginal models.
Returning to the Baltimore Eye Survey Study example, Table 6.2 presents results from
three modelling approaches. The rst column is based on GEE2 for marginal models as
shown in Table 6.1. The second column is derived from observation driven models using
the GEE1 method. Specically, this model assumes for each j = 1, 2,
logit Pr(Y
ij
= 1|Y
i,3j
, Age
i
, Race
i
)
=
0
+
1
Race
i
+
2
(Age
i
60) +
3
(Age
i
60)
2
+
4
Race
i
(Age
i
60) +
5
Race
i
(Age
i
60)
2
+
1
Y
i,3j
+
2
Y
i,3j
Race
i
.
Finally, the third column corresponds to results from tting a random eects model of
same covariates as marginal models with the addition of random intercepts. Consistent
6.4. PRE- POST TRIAL ON SCHIZOPHRENIA REVISITED 71
with discussion in 5.4, the estimated
s from the observation driven model are in

general smaller in magnitude than that of the s from the marginal model. In par-
ticular, the estimated
for the Race variable is reduced from 0.334 in column one

to 0.099 in column two and as a result, is not signicantly dierent than zero statisti-
Table 6.2: Regression estimates and standard errors (in parenthesis) from
three modelling approaches: Baltimore Eye Survey Study.
Marginal model Observation Random eects
Variable GEE2 driven data model
Intercept -2.82 -3.16 -4.91
(0.076) (0.073) (0.134)
Race 0.334 0.099 0.487
(0.105) (0.11) (0.189)
Age-60 0.049 0.038 0.071
(0.007) (0.006) (0.011)
(Age-60)
2
0.0018 0.0012 0.0039
(0.0003) (0.0003) (0.0007)
Race 0.0007 -0.0039 0.012
(Age-60) (0.0005) (0.008 ) (0.0016)
Race -0.001 -0.00075 -0.001
(Age-60)
2
(0.0005) (0.0004) (0.001)
log OR
Intercept 2.30 2.30
(0.264) (0.264)
Race 0.56 0.55
(0.25) (0.25)
00
8.63
variance of random intercepts.

cally if the observation driven model is employed. A simple and obvious explanation of
this discrepancy is that the eects of those demographic variables are, to some extent,
explained away by the competing variate, Y
i,3j
, the visual impairment status of the
fellow eye; see
1
and
2
in column two. On the other hand, the estimated
s from
the random eects model are larger in magnitude than the

s. However, the Wald test
statistics for testing the signicance of either or
are similar in magnitude and lead

to similar conclusions. For an example, the Wald statistics for the Race variables are
3.18(= .334/1.05) from column one and 2.58(= .487/.189) from column three.
6.4 Pre- Post Trial on Schizophrenia Revisited
Recall that in this trial more than ve hundred schizophrenic patients were randomized
into six arms, placebo, risperidone of 2, 6,, 10 and 16mg and haloperidol of 20mg. For
illustrative purpose, we focus on two particular groups: risperidone of 6mg (x = 1)
and haloperidol of 20mg (x = 0). The main objective is to examine whether the new
treatment in risperidone leads to further and faster decline in PANSS, a measure of
severity in schizophrenia, compared to the then standard treatment in haloperidol. Table
Table 6.3: Regression estimates and standard errors (in parenthesis) from
three modelling approaches: a pre- post trial on schizophrenia.
Marginal model Random eects model Observation
Intercepts Intercepts driven model
Variable Indep. Exchangeable only slopes AR-1
Intercept 91.75 91.38 91.38 91.08 90.17
(1.52) (1.93) (2.06) (1.96) (1.81)
Time (t
j
) -2.56 -1.45 -1.46 -1.16 -1.20
(0.40) (0.34) (0.26) (0.36) (0.25)
Tx -4.33 -4.11 -4.12 -3.92 -1.75
(2.13) (2.76) (2.92) (2.77) (2.54)
Time -0.12 -0.99 -0.98 -1.19 -1.39
Tx (0.55) (0.47) (0.35) (0.48) (0.34)
432.2 447.3 153.7 122.6 426.8
0.63 0.73
D
00
297.6 268.8
D
01
2.1
D
11
4.2
6.3 presents results form three modelling approaches. For marginal models, we considered
the use of independence and exchangeable (also known as compound symmetry) matrices
as the working covariance for the Y
i
s. For random eects models, we considered (1)
random intercepts only which leads to the compound symmetry covariance matrix in
Y and (2) random intercepts and slopes. In the latter, this accounts for variations in
PANSS changes over time among patients. For observation driven models, we considered
the following AR-1 model
Y
ij
= x
t
ij
+ (y
i,j1
x
t
i,j1
) +
ij
, j = 1, . . . , n
i
,
where the
ij
i.i.d
N(0, ). With the exception of the GEE approach using the inde-
pendence matrix as the working matrix, conclusions from all three approaches are very
similar qualitatively; see comments in 5.4. Specically, the averaged rate of decline in
PANSS for the haloperidol groups is estimated as 1.16 per week; whereas the averaged
rate of decline is estimated as 2.35(= 1.16 + 1.19) per week for the risperidone group.
Such a dierence is statistically signicant at the 0.05 level (1.192/.482 = 2.47, p-
value < 0.01). More importantly, this dierence appears to be meaningful clinically as
well. Considering two patients of the same PANSS scores before receiving treatments,
this result suggests that taking the new treatment would lead to 9.5(= 1.19 8) unit
greater decline in PANSS in the eighth week, a substantial improvement from the clinical
point-of-view.
6.4. PRE- POST TRIAL ON SCHIZOPHRENIA REVISITED 73
Figure 6.2: Mean responses in PANSS by dropout cohort.
Finally, ignoring within-cluster association would lead to very dierent and presum-
ably misleading conclusions. Specically, results from GEE assuming the repeated mea-
sures from the same patient were statistically independent of each other suggest that
there is no dierence in averaged rate of decline in PANSS between the two groups
(0.117 0.545). An explanation for such a striking discrepancy is that this latter
approach is vulnerable to severe and non-trivial missing data phenomenon which is ex-
hibited in this example. For example, the risperidone group experienced 36% dropout
rate in the eight week follow-up period; whereas the haloperidol group fared no better
with a 53% dropout rate. Furthermore, Figure 6.2 strongly suggests that the chance of
dropping out at a given visit depends on the prior history of PANSS scores. For an exam-
ple, patients dropped out at weeks 2, 4 and 6 experienced, on average, a sudden increase
in PANSS prior to the scheduled visits. This pattern of missing mechanism, known as
missing at random (Little and Rubin, 1987), suggests that ignoring within-cluster associ-
ation would, in general, lead to incorrect inference for questions of interest, the dierence
in rates of change in PANSS between groups in this case. For more detailed treatments
on accounting for non-trivial missing data patterns, see Diggle (1998) and Scharfstein,
Rotnitzky and Robbins (1999) and references therein.
Bibliography
[1] Altham, P.M.E. (1978). Two generalizations of the binomial distribution. Applied
Statistics, 27, 161167.
[2] Andersen, E.B. (1970). Asymptotic properties of conditional maximum likelihood
estimators. Journal of the Royal Statistical Society, Series B, 32, 283301.
[3] Bahadur, R.R. (1961). A representation of the joint distribution of responses in n
dichotomous items. In Studies on Item Analysis and Prediction, Ed. H. Solomon,
pp.158168. Stanford Mathematical Studies in the Social Sciences VI, Stanford Uni-
versity Press, Stanford, California.
[4] Beaty, T.H., Skjaerven, R., Breazeale, D.R. and Liang, K.-Y. (1997). Analyzing
sibships correlations in birth weight using large sibships from Norway. Genetic Epi-
demiology, 14, 423433.
[5] Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems.
Journal of the Royal Statistical Society, Series B, 36, 192236.
[6] Bishop,Y.M.M., Fienberg, S.E. and Holland, P.W. (1975). Discrete Multivariate
Analysis: Theory and Practice. MIT Press, Cambridge, Massachussetts.
[7] Breslow, N.E. (1981). Odds ratio estimators when the data are sparse. Biometrika,
68, 73-84.
[8] Breslow, N.E. and Clayton, D.G. (1993). Approximate inference in generalized linear
mixed models. Journal of the American Statistical Association, 88, 12534.
[9] Breslow, N.E. and Day, N.E. (1980). Statistical Methods in Cancer Research, Volume
I. IARC Scientic Publications No. 32. Lyon.
[10] Breslow, N.E. and Lin, X. (1995). Bias correction in generalized linear mixed models
with a single component of dispersion. Biometrika, 82, 81-91.
[11] Breslow, N.E. and McCann, B. (1971). Statistical estimation of prognosis of children
with neuroblastoma. Cancer Research, 31, 20982103.
[12] Chen, J.J., Kodell, R.L., Howe, R.B. and Gaylor, D.W. (1991). Analysis of trinomial
responses for reproductive and developmental toxicity experiments. Biometrics, 47,
10491058.
75
76 BIBLIOGRAPHY
[13] Cohen, B.H., Bell, W.C., Jr., Brashears, S., Diamond, E.L., Kreiss, P. et al. (1977).
Risk factors in chronic obstructive pulmonary disease (COPD). American Journal
of Epidemiology, 105, 223232.
[14] Connolly, M., Liang K.-Y. (1988). Conditional logistic regression models for corre-
lated binary data. Biometrika, 75, 501-506.
[15] Cox, D.R. (1972). Regression models and life tables (with discussion). Journal of
the Royal Statistical Society Series B, 74, 187-200.
[16] Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics, Chapman and Hall,
London, England.
[17] Cox, D.R. and Reid, N. (1987). Parameter orthogonality and approximate condi-
tional inference, (with discussion). Journal of the Royal Statistical Society, Series B,
49, 1-39.
[18] Crowder, M.J. (1987). On linear and quadratic estimating functions. Biometrika,
74, 591-597.
[19] Crowder, M.J. (1989). Comments on Godambe, V.P. and Thompson, M.E. An ex-
tension of quasi-likelihood estimation. Journal of Statistical Planning and Inference,
22, 137-152.
[20] Diggle, P.J. (1988). An approach to the analysis of repeated measures. Biometrics,
44, 959971.
[21] Diggle, P.J. (1998). Dealing with missing values in longitudinal studies. In: Recent
Advances in the Statistical Analysis of Medical Data, editors B.S. Everitt and G.
Dunn, 203208. Arnold, London, England.
[22] Diggle, P.J., Liang, K.-Y. and Zeger, S.L. (1994). Analysis of Longitudinal Data.
Oxford University Press, Oxford, England.
[23] Durbin,J. (1960). Estimation of parameters in time-series regression models.
Biometrika, 47, 139-153.
[24] Ferguson, H. Reid, N. and Cox, D.R. (1991). Estimating equations from modied
prole likelihood, in Estimating Functions, Godambe, V.P., editor, Oxford Univer-
sity Press, Oxford, England.
[25] Finney, D.J. (1964). Statistical Methods in Biological Assay, 2nd edition. Grin,
London, England.
[26] Firth, D. (1987). On the eciency of quasi-likelihood estimation. Biometrika, 74,
233-245.
[27] Fitzmaurice, G.M. and Laird, N.M. (1993). Regression models for discrete longitu-
dinal responses. Statistical Science, 8, 284-304.
[28] Godambe, V.P. (1960). An optimum property of regular maximum likelihood esti-
mation. Annals of Mathematical Statistics, 31, 120812.
BIBLIOGRAPHY 77
[29] Godambe, V.P. (1976). Conditional likelihood and unconditional optimum estimat-
ing equations. Biometrika, 63, 277-284.
[30] Goodman, L.A. and Kruskal, W.H. (1979). Measures of Association for Cross-
Classications. Springer-Verlag: New York.
[31] Gourieroux, C., Monfort, A., and Trognon, A. (1984). Psuedo-maximum likelihood
methods: theory. Econometrica, 52, 681700.
[32] Grizzle, J.E., Starmer, C.F. and Koch, G.G. (1969). Analysis of categorical data by
linear models. Biometrics, 25, 4898504.
[33] Haseman, J.K. and Kupper, L.L. (1979). Analysis of dichtomous response data from
certain toxicological experiments. Biometrics, 35, 281293.
[34] Hauck, W.W. and Donner, A. (1977). Walds test as applied to hypotheses in logist
analysis. Journal of the American Statistical Association, 72, 851853.
[35] Heagerty, P. and Zeger, S.L. (1998). Lorelogram: a regression approach to exploring
dependence in longitudinal categorical responses. Journal of the American Statistical
Association, 93, 150163.
[36] Holmes, M.C. and Williams, R.E.O. (1954). The distribution of carriers of Stepto-
coccus pyogenes among 2413 healthy children. Journal of Hygiene Comb 52, 165-179.
[37] Jones, B. and Kenward, M.G. (1987). Modelling binary data from a three-point
cross-over trial. Statistics in Medicine, 6, 55564.
[38] Jrgensen, B. (1987). Exponential dispersion models (with discussion). Journal of
the Royal Statistical Society, Series B, 49, 127162.
[39] Kalbeisch, J.D. and Sprott, D.A. (1970). Application of likelihood methods to
models involving a large number of nuisance parameters (with discussion). Journal
of the Royal Statistical Society, Series B, 32, 175-208.
[40] Kupper, L.L. and Haseman, J.K. (1978). The use of a correlated binomial model for
the analysis of data from certain toxicological experiments. Biometrics, 34, 6976.
[41] Laird, N.M. and Ware, J.H. (1982). Random-eects models for longitudinal data.
Biometrics, 38, 96374.
[42] Lee, E. (1992). Statistical Methods for Survival Data Analysis, 2nd edition. John
Wiley and Sons, New York, New York.
[43] Lehmann, E.L. (1959). Testing Statitical Hypothesis. John Wiley and Sons, New
York, New York.
[44] Liang, K.-Y. (1987). Estimating functions and approximate conditional likelihood.
Biometrika, 74, 695-702.
[45] Liang, K.-Y. and Beaty, T.H. (1991). Measuring familial aggregation by using odds-
ratio regression models. Genetic Epidemiology, 8, 361370.
78 BIBLIOGRAPHY
[46] Liang, K.-Y. and Hanfelt, J. (1994). On he use of the quasilikelihood method in
the teratological experiments. Biometrics, 50, 872-880.
[47] Liang, K.-Y. and Tsou, D. (1992). Empirical Bayes and conditional inference with
many nuisance parameters. Biometrika, 79, 261-270.
[48] Liang, K.-Y. and Zeger, S.L. (1986). Longitudinal data analysis using generalized
linear models. Biometrika, 73, 1322.
[49] Liang, K.-Y. and Zeger, S.L. (1989). A class of logistic regression models for mul-
tivariate binary times series. Journal of the American Statistical Association, 84,
447-451.
[50] Liang, K.-Y., Zeger, S.L. (1993). Regression analysis for correlated data. Annual
Review of Public Health, 14, 43-68.
[51] Liang, K.-Y. and Zeger, S.L. (1995). Inference based on estimating functions in the
presence of nuisance parameters (with discussion). Statistical Science, 10, 158-172.
[52] Liang, K.-Y., Zeger, S.L., and Qaqish, B. (1992). Multivariate regression analyses
for categorical data (with Discussion). Journal of the Royal Statistical Society, Series
B, 54, 340.
[53] Lin, X. and Breslow, N.E. (1996). Bias correction in generalized linear mixed models
with multiple components of dispersion. Journal of the American Statistical Asso-
ciation, 91, 100716.
[54] Lindsay, B. (1982). Conditional score functions: some optimality results. Biometrika,
69, 503-512.
[55] Little, R.J.A. and Rubin, D.B. (1987). Statistical Analysis with Missing Data. John
Wiley, New York.
[56] Mantel, N. and Haenszel, W. (1959). Statistican aspects of the analysis of data
from retrospective studies of disease. Journal of the National Cancer Institute, 22,
719-748.
[57] Mason, W.B. and Fienberg, S.E. (eds) (1985). Cohort Analysis in Social Research:
Beyond the Identication Problem. Springer-Verlag, New York.
[58] Maxwell, A.E. (1961). Analysing Qualitative Data. Methuen, London.
[59] McCullagh, P. (1980). Regression models for ordinal data (with discussion). Journal
of the Royal Statistical Society, Series B, 42, 10942.
[60] McCullagh, P. (1983). Quasi-likelihood functions. Annals of Statistics, 11, 59-67.
[61] McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models. Chapman and
Hall, New York.
[62] Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalised linear models. Journal
of the Royal Statistical Society, Series A 135, 370-384.
BIBLIOGRAPHY 79
[63] Neyman, J. and Scott, E.L. (1948). Consistent estimates based on partially consis-
tent observations. Econometrica, 16, 1-32.
[64] Patterson, H.D. and Thompson, R. (1971). Recovery of inter-block information when
block sizes are unequal. Biometrika, 58, 54554.
[65] Prentice, R.L. and Zhao, L.P. (1991). Estimating equations for parameters in means
and covariances of multivariate discrete and continuous responses. Biometrics, 47,
82539.
[66] Qaqish, B. and Liang, K.-Y. (1992). Marginal models for correlated binary data
with multiple classes and multiple levels of nesting. Biometrics, 48, 939950.
[67] Rao, C.R. (1973). Linear Statistical Inference and Its Applications, 2nd edition.
John Wiley, New York.
[68] Reid, N. (1995). The roles of conditioning on inference (with discussion). Statistical
Science, 10, 138-199.
[69] Robbins, H. (1956). An empirical Bayes approach to statistic. Proceedings of the
Third Berkeley Symposium, 1, 157163.
[70] Rosner, B. (1989). Multivariate methods for clustered binary data with more than
one level of nesting. Journal of the American Statistical Association, 84, 373380.
[71] Royall, R.M. (1986). Model robust inference using maximum likelihood estimators.
International Statistical Review, 54, 22126.
[72] Royall, R.M. (1997). Statistical Evidence: a Likelihood Paradigm. Chapman and
Hall, London, England.
[73] Scharfstein, D.O., Rotnitzky, A. and Robbins, J.M. (1999). Adjusting for nonignor-
able drop-out using semiparametric nonresponse models (with discussion). Journal
of the American Statistical Association, 94, 10961146.
[74] Skellam, J.G. (1948). A probability distribution derived from the binomial distribu-
tion by regarding the probability of success as variable between the sets of trials.
Journal of the Royal Statistical Society, Series B, 10, 25761.
[75] Stigler, S.M. (1986). The History of Statistics. Belknap Press, Cambridge, Mass.
[76] Stiratelli, R., Laird, N. and Ware, J.H. (1984). Random eects models for serial
observations with binary responses. Biometrics, 40, 96171.
[77] Tielsch, J.M., Sommer, A., Witt, K., Katz, J. and Royall, R.M. (1990). Blindness
and visual impairment in an American urban population: Baltimore eye survey.
Archives of Ophthalmology, 108, 286-290.
[78] Tweedie, M.C.K. (1947). Functions of a statistical variate with given means, with
special reference to Laplacian distributions. Proceedings of the Cambridge Phil.
Society, 49, 4149.
80 BIBLIOGRAPHY
[79] Waterman, R.P. and Lindsay, B.G. (1996). A simple and accurate method for ap-
proximate conditional inference applied to exponential family models. Journal of the
Royal Statistical Society, Series B, 58, 177188.
[80] Wedderburn, R.W.M. (1974). Quasi-likelihood functions, generalized linear models
and the Gaussian method. Biometrika, 61, 43947.
[81] Weil, C.S. (1970). Selection of the valid number of sampling units and a consideration
of their combination in toxicological studies involving reproduction, teratogenesis or
carcinogenesis. Food and Cosmetics Toxicology, 8, 177182.
[82] Weinberg, C.R., Wilcox, A.J. and Lie, R.T. (1998). A log-linear approach to case-
parent-triad data: assessing eects of disease genes that act either directly or though
maternal eects and that may be subject to parental imprinting. American Journal
of Human Genetics, 62,, 969978.
[83] White, H. (1982). Maximum likelihood estimation of misspecied models. Econo-
metrics, 50, 125.
[84] Williams, D.A. (1988). Reader reaction: estimation bias using the beta-binomial
distribution in teratology. Biometrics, 44, 305309.
[85] Zeger, S.L. and Karim, M.R. (1991). Generalized linear models with random eects:
a Gibbs sampling approach. Journal of the American Statistical Association, 86,
7986.
[86] Zeger, S.L., Liang, K.-Y. and Albert, P. (1988). Models for longitudinal data: a
generalized estimating equation approach. Biometrics, 44, 10491060.
[87] Zhao, L.P. and Prentice, R.L. (1990). Correlated binary regression using a general-
ized quadratic model. Biometrika, 77, 64248.
[88] Zhao, L.P, Prentice, R.L. and Self, S.G. (1992). Multivariate mean parameter esti-
mation by using a partly exponential model. Journal of the Royal Statistical Society,
Series B, 54, 805812.
Index
beta-binomial distribution, 22, 40, 41,
44
binary response, 2
biologic plausibility, 43
canonical link, 75
canonical parameter, 75
categorical response, 49
cluster, 12
cluster-specic, 76
conditional score function, 32
contingency table, 56
continuation ratio model, 55
continuous response, 2
correlated data, 34
correlation coecient, 15
deviance, 29
empirical Bayes, 76
empirical logit, 38, 39
estimating function, 24
exponential dispersion family, 26
family study, 18
Fisher information matrix, 25
generalized estimating equation, 35
generalized linear model, 26
log-linear model, 27
logistic regression model, 27, 39
multiple linear regression, 27
Poisson regression model, 27
polytomous response, 51
random component, 26
systematic component, 27
goodness-of-t, 30
Hammersley-Cliord Theorem, 45
likelihood function, 22
likelihood ratio, 22
likelihood-based inference, 21
litter eect, 22
local optimality, 33
logistic regression model, 51
longitudinal study, 18
marginal model, 61, 62, 69
maximum likelihood estimator, 22
method of moments, 31
nested design, 13
Neyman-Scott problem, 22
nuisance parameter, 21
observation driven model, 61, 65, 77
odds ratio, 19
ordinal response, 50
orthogonality, 32, 33
partly exponential family, 71
polytomous logistic regression model, 53
polytomous response, 49
population averaged, 63
proportional odds model, 52
quadratic exponential family, 72
quasi-likelihood, 28, 30, 44, 69
quasi-score function, 30, 31, 33, 34
random eects model, 61, 63, 73
random mechanism, 30
regression analysis, 2
restricted maximum likelihood, 76
score function, 25
teratological experiment, 22, 40, 53, 54
81
82 INDEX
unbiased estimating function, 31
variance specication, 30, 31, 44
weighted least squares, 28
within-cluster dependence, 12

Generalized Linear Models, Estimating Functions and Multivariate Extensions

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Generalized Linear Models, Estimating Functions and Multivariate Extensions

Uploaded by

Copyright:

Available Formats

Generalized Linear Models, Estimating Functions

converges as m to (m 1)/m instead of . Such an undesirable phenomenon

m). Despite the similarity among these

say, is far dierent from

(, ) = 0 even though the asymptotic variances of

(, ), the score function of , whereas

say, for , one remedy is to minimize instead

say, is now consistent for any consistent

of . One such choice is the minimizer of

) based on the tted model. The

) (Variance specication). The variance of Y

, possesses any optimal property? This issue

) (Variance specication). The variance of Y

) as a special case with V

(, ). While the formal theoretical development of estimating

say, is optimal within this class

(y; ) = log f(y|t; )/ (2.16)

(y; ) = 0, has minimum variance among solutions of

is complete and sucient for for xed

= t. Due to the dependence of g

for . Thus the impact of nuisance parameters

); , ) = 0 for all , and any

); , ) = 0 for all , and

; , ) = 0 for all , and

) and (B) specied in 2.3, one has

at the true value as is the conditional score function

), i.e. a() appears as a proportional factor in

(y; ) is globally optimal among G

may depend on , we now argue that the impact of on g

= 0 is small. This is because g

shared the orthogonality

(y; , ), log L(, )/; , ) = 0 and .

is an arbitrary consistent estimator of ,

is made for convenience whose average does

is equal to 16 for both treated and

, the total number of aected. While appealing

/(n 1), rather than through

) (Variance specication). The variance of Y

gives 0.205 with s.e. 0.050 which suggests that

coecient in (4.8) dicult to interpret. The coecient

) is very similar to that under marginal models

and in marginal models, namely,

3/(15). Thus, approximately

) specied above, the probability distribution

, and is fully specied.

, = 1, . . . , q, are known, up to , functions of H

for this example in 5.4.

which may be non-trivial and

, are likely to be dierent numerically and interpretation-wise. In addition, the

) of 5.2, we have immediately that

. Similarly, if one adopts the

shown above is indeed an exception rather than the rule.

for a covariate (who marginal models

in observation driven models characterizes changes in incidence for individuals with

) in 5.3 closely, we would speculate

, a xed value. Liang, Zeger

, and D is completely specied, namely,

, , D), in (6.4) as the basis for

, and the variance components, D and . For continuous

and covariance matrix

, , and D given the data is simulated

)? While the answer no doubt depends on choices of

, it is comforting that the Hammerley-Cliord Theorem in Besag (1974) could

s from the observation driven model are in