Professional Documents
Culture Documents
c
Rollin
Brant 2007
Contents
2 Multiple Regression
2.1 Assumptions of the Multiple Regression Model . . . . .
2.2 Fitting the Multiple Linear Regression Model . . . . .
2.3 Applying Multiple Regression . . . . . . . . . . . . . .
2.3.1 Initial variable selection . . . . . . . . . . . . .
2.3.2 Initial data examination . . . . . . . . . . . . .
2.3.3 Examining the Regression Coefficients . . . . .
2.3.4 Predictions . . . . . . . . . . . . . . . . . . . .
2.4 Examining Assumptions . . . . . . . . . . . . . . . . .
2.4.1 Examining Linearity . . . . . . . . . . . . . . .
2.4.2 Assessing Independence . . . . . . . . . . . . .
2.4.3 Examining the pattern of dispersion . . . . . . .
2.4.4 Examining normality . . . . . . . . . . . . . . .
2.4.5 Applying a Variance Stabilizing Transformation
2.4.6 Influence diagnostics . . . . . . . . . . . . . . .
2.5 Analysis of Variance/Covariance . . . . . . . . . . . . .
2.5.1 Allowing for varying slope terms . . . . . . . . .
2.5.2 Allowing for more than 2 categories . . . . . . .
2.5.3 Analysis of Variance: groups of variables . . . .
2.5.4 Multi-collinearity . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
6
7
8
8
10
13
14
15
16
17
17
18
19
21
23
23
26
29
32
Chapter 2
Multiple Regression
The multiple linear regression model is an extension of a simple linear regression
model to incorporate two or more explanatory variable in a prediction equation for
a response variable. Multiple regression modeling is now a mainstay of statistical
analysis in most fields because of its power and flexibility. As you will quickly learn
it requires very little effort (and sometimes even less thought) to estimate very complicated models with large numbers of variables. Practical experience has shown
however, that such models may be very hard to interpret and give very misleading
impressions. As a first example, we will consider a reasonably uncomplicated analysis with two predictor variables, beginning with an initial analysis based on simple
linear regressions.
Heparin is a drug used in the treatment and prevention of deep vein thrombosis.
The most commonly used form of the drug requires careful monitoring to prevent
under or over-anticoagulation, leading to possible treatment failure or bleeding, respectively. In a study concerning the efficacy of heparin therapy, levels of heparin
sulfate in the blood were monitored. Separate plots of heparin vs. body weight for
the 81 females and 66 males in the study are given below:
1.6
Heparin Level
1.6
.05
39
119.6
.05
weight
male
female
1.6
Heparin Level
Heparin Level
1.04
.05
.05
39
100
weight
54.5
119.6
weight
The plots indicate the potential utility of linear regression. Estimates for the
regression fits are described below:
-> sex = female
hsulf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------weight | -.0078113
.0023395
-3.34
0.001
-.012468
-.0031546
_cons |
1.071086
.1603247
6.68
0.000
.7519672
1.390204
We can examine these fitted relations graphically as seen in the following combined plot.
Heparin Level
Fitted values
f
1.6
Heparin Level
ff
ff
f
.05
ff
f
f
f
f
f f
f m
f
f
f
f
m
f
f
m
fm
f m
m m mm m
f
ff f m
f
m
f
f m
f f
f
m f
f mm m
f mf m f m f mm
f f
f f m m
f m f
f
m
m
fm
ff
mf mf ff m f
m
ff f f fm
f
f
f
f
mm fm m m
f
m
m
m
m
m
f
f
m
fmm
m
mf
f
mm
mf
m f
m m
m
m
f
m
f
m
m
m
m
m
119.6
39
weight
Both regressions suggest negative relationships between heparin levels and weight,
though the relationship seems weaker for males. Based on a quick large sample comparison of slopes, using the generic formula for the standard error of a difference of
two independent estimates:
se(est1 est2 ) =
se(est1 )2 + se(est2 )2
we note that the estimated difference between slopes is .0044 with a standard error
of .0029, which does not provide compelling evidence significant for a difference.
(Z = .0044/.0029 1.5 )
This suggests the potential utility of combining the information for males and
females in a single more precise estimate. To do this we consider a comprehensive
model that incorporates weight and sex effects - i.e. a multiple regression model.
The common slope model can be derived from a simple enhancement of the SLR
model. Letting y represent heparin and x1 be weight, the two regression lines below
have separate intercepts but the same slope in accordance with the model.
yx = f + 1 x1 , for females
and
yx = m + 1 x1 , for males
Notice that these two equations encapsulate the prediction of heparin level based
on two predictor variables, one categorical (sex) and the other continuous (weight).
By defining an indicator variable, x2 , that takes on the value 0 for females and 1
for males this can be put into a single equation
yxz = f + 1 x1 + 2 x2
Simple algebraic comparison yields that the parameter 2 is really just the difference
in intercepts m f . By applying the same principle of least squares used to obtain
simple linear regression estimates, we can obtain multiple regression estimates, as
given in the following STATA output:
-----------------------------------------------------------------------------hsulf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male | -.1162196
.0468531
-2.48
0.014
-.2088282
-.023611
weight | -.0057582
.0014804
-3.89
0.000
-.0086844
-.0028321
_cons |
.9333015
.1032388
9.04
0.000
.7292423
1.137361
------------------------------------------------------------------------------
Note that the combined slope estimate is a compromise between the two slopes
in the separate fits and indicates an overall significant negative relationship. The
coefficient for the male term is negative (and significant) indicating that males have
lower heparin levels compared to females (after adjustment for weight). 1 . A
plot of the fitted regressions for males and females makes this clear.
1
An alternate form of the model could be based by defining x2 to be 0 for males and 1 for
females. Defining things this way would imply that 2 is just f m , i.e. the negative of its
previous value. In statistics (and science in general) Though these models are defined differently,
in practice they are essentially the same because in the end they both give rise to identical fitted
values. Two models that look different algebraically but that still give rise to the same set of
predictions are really the same model and are called re-parameterizations of each other. For this
reason, it is often best to consider models in terms of the predictions they generate, rather in terms
of the particular symbols or algebraic equations used.
Heparin Level
Males
f
1.5
Heparin Level
ff
ff
1
f
f f
ff
f
f
f
.5
f
f
f
f f
f m
f f
f
m
f
f
m
fm
f m
m m mm m
f
ff f m
f
f
m
f m
f
f
m
f mm m
f mf m f m f mm
f f
f
f f m m
f m f
m
m
fm
ff
mf mf ff m f
m
ff f f fm
f
f
f
f
mm fm m m
m
f
m
m
m
m
m
f
f
fmm
m
mf
f
mf
mm
m f
m m
m
m
f
m
m
f
f
m
m
m
m
0
40
2.1
60
80
weight
100
120
Given a continuous response (dependent, outcome) variable and a set of of k numerical explanatory variables, x1 , x2 , ..., xk , the multiple linear regression model is
characterized by three assumptions.
The population mean of y within strata defined by the xs follows a linear and
additive pattern, i.e.
y|x1 ,x2 ,...,xk = + 1 x1 + 2 x2 + ... + k xk
This will be referred to as the regression equation.
Note: In the following we will use x as shorthand for the set of x variables,
so we can right yx for the left hand side in the above. Ill also use the term
x-strata to describe strata determined by a particular set of values x.
The y observations are assumed to be statistically independent
The standard deviation of y within particular x-strata, yx , is constant over
all values of x. Well simplify notation by calling this value .
The distribution of y within x-strata is normal.
2.2
Let us consider applying the MLR model to the Heparin example of the previous section. By doing so we are implicitly accepting (at least tentatively) the assumptions
above. The assumed regression equation
y|x1,x2 = f + 1 x1 + 2 x2
means that assumption of linearity is equivalent to assuming a parallel lines model,
which is most easily understood from a geometric or graphical perspective. In addition we are assuming that the dispersion of individual points about the the relevant
line (one for men, one for women) is the same for men and women and follows normal distribution. Even if we have doubts about these assumptions, it is reasonable
to fit first and check assumptions later.
Just as for a simple linear regression model, the principle of least squares provides a basis for estimating both the regression coefficients, , 1 , and 2 and the
dispersion (or scale) parameter, 2 .
If we label potential estimates as a, b1 and b2 , the least squares estimates are
the values of a, b1 and b2 that minimize the residual sum of squares ,
SSresid =
(y (a + b1 x1 + b2 x2 ))2
cases
The typical magnitude of the deviations from the fit (i.e. residual values)
yi (a + b1 x1 + b2 x2 ) is given by residual standard deviation
syxz =
SSresid
n3
In fitting the heparin data, we have applied a multiple linear regression model
with k = 2, x1 = weight and x2 = 1 if male, 0 if female. Pictorially, this particular
instance of the model can be thought of as fitting separate regression lines for males
and females, constraining the slopes to be equal, and assuming an equal degree of
dispersion about each line. The previous plot of the fitting regression lines makes
these assumptions clear. The slope of the parallel fitted lines is determined by b1 .
The value of b2 corresponds to the difference in intercepts. Because the lines are
parallel, this value also describes the difference in predicted heparin level for between
males and females in the same weight-stratum. 2
2
x2 only takes on values 0 and 1. When some (or all) of the x variables are constructed to
represent categories, the model is sometimes called the general linear model.
without xi
2.3
2.3.1
Modern computational facilities now make it quite easy to fit a multiple regression
model. To use it in a meaningful way, however, a number of practical and statistical
issues must be addressed.
The explanatory variables to be included in the regression model must be
selected.
Potential deficiencies in the fit of the model must be identified and corrected
(if possible)
Results of the model fit must be carefully interpreted and presented.
One cannot even begin of course without first choosing which predictor variables
to include in the model (well leave the remaining issues to later sections). In
general the rationale for choosing variables depends mainly on the objectives of
study. Loosely speaking, objectives can be classified as descriptive, predictive and
comparative in character (with a large degree of overlap).
At the level considered so far, our analysis of the heparin data has been merely
descriptive. In descriptive analysis (which may be a tentative precursor to development of predictive or causal models) one is interested in identifying patterns of
relationship without worrying overmuch about the underlying mechanisms or extrapolation into the future. If patterns do emerge in the analysis they may engender
more careful thought about their meaning and subsequently more focused analysis
or perhaps an additional step of data collection. The choice of variables is then
largely subjective, depending on the investigators own thoughts about what constitutes an interesting pattern. Conventions specific to individual disciplines may
exist - for instance in human epidemiology, age and sex are usually considered to be
important and will more or less automatically be considered for inclusion in models.
In predictive modeling the aim is to develop a formula or rule for making predictions. The term black-box prediction is used when there is no interest in explaining
or interpreting the individual roles of the variables in the prediction equation. For
example, if there is direct clinical utility in predicting heparin levels, one may not
care too much what particular variables are used, as long as they are commonly
measured. In general though predictive equations tend to be more reliable when
some physiological (or other theoretical) rationale can be applied. In the setting of
a study of vocabulary of elementary school students while it may well be that shoe
size has some predictive utility with regard to vocabulary size, much more powerful
predictive relationships can be developed by considering more direct determinants
such as grade level.
Studies which aim at comparative inferences typically (ideally?) focus on a small
number of primary factors (variables) and often relate to explicit (and sometime implicit) underlying causal hypotheses. For comparisons to be relevant (especially in
attempting to support causal hypotheses from observational data) there may be
a number of nuisance or confounding factors that need to be taken into account.
In the above example, one might wish to investigate the fundamental role of sex
of the patient in determining heparin levels. Because patterns in weight differ between sexes, weight needs to be accounted for in the model otherwise a distorted
picture of the relationship my result. Thus in causal modeling one attempts to be as
comprehensive as possible in including variables that could realistically be thought
to be competing causal determinants. Other criteria for making valid comparisons
10
may enter into variable selection in other settings. In the end though the focus for
interpretation will be on the role of the principal factors of interest.
2.3.2
While we could enter the actual rates in x4 , it is convenient for purposes of interpretation to
follow the indicator approach. However, both approaches yield the same fundamental models in
the sense of model prediction.
11
Fraction
.571429
Fraction
.408163
0
92
20
High
Low
age
Initial IV Rate
Heparin Level
1.6
Heparin Level
1.6
.05
92
20
.05
age
Low
High
Initial Plots
In our initial exploration of the relationships, we begin with bivariate graphs
relating y to each of the xs in turn. For age and initial dose levels, a scatterdiagram
and comparative boxplots are most relevant. Such plots may provide initial hints
about which variables may be important in the model, about non-linear effects for
some variables or about non-heterogeneous dispersion in others. However features
noted in separate bivariate plots and summaries may not translate directly to the
multiple variable model. Nonetheless, if strong non-linearities are evident in a any of
the plots there is the suggestion that some transformation of either the response or
relevant explanatory variable may be useful if the effect is a dramatic improvement
in the apparent linearity in the plots (as judged by the IOT 4 test).
The examination of these plots suggests it is reasonable to begin with a model
incorporating variables as is. Before doing so, we can get an initial indication
of the potential utility of adding age and initial dose by considering some simple
analyses, in this case, a simple linear regression of heparin level on age and a t-test
comparing heparin levels between the low and high dose groups.
4
Intra-Ocular Trauma
12
While these simple tests are not foolproof indicators of the utility of including
age and initial dose in a model with weight and sex, the above findings (P .05 for
initial dose, P = .004 for age) heighten our expectations for fitting the full model
on the next page.
. regress hsulf weight male age hepdos0
Source |
SS
df
MS
-------------+-----------------------------Model | 3.55976945
4 .889942361
Residual | 8.00847268
142 .056397695
-------------+-----------------------------Total | 11.5682421
146 .079234535
Number of obs
F( 4,
142)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
147
15.78
0.0000
0.3077
0.2882
.23748
-----------------------------------------------------------------------------hsulf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------weight | -.0058301
.0014201
-4.11
0.000
-.0086374
-.0030227
male | -.1098883
.0442424
-2.48
0.014
-.1973471
-.0224295
age |
.0039068
.0012984
3.01
0.003
.0013401
.0064735
13
We return now to consider the statistical side of things more carefully, deferring
until later lectures the consideration of goodness of fit issues Issues of interpretation
will be considered by example as the opportunity arises.
Lets return to the previous output which arises from fitting the combined regression model relating heparin levels to weight, sex, age and initial heparin dose.
As described in the last lecture, the primary components of this output are the
intercept and the coefficients for the weight, sex, age and dose variables, a and
b1 , b2 , b3 and b4 . These may be interpreted on an informal basis, as can the the
residual standard deviation syx , which gives us a notion of the closeness of the overall
fit. In order to make more precise interpretations, we must begin to integrate various
components of our analysis. In general, meaningful interpretations of any statistical
analysis require this sort of synthesis.
2.3.3
(y y)2 =
(
y y)2 +
(y y)2
14
hypothesis that all the coefficients (i.e. the TRUE values corresponding to the bi
estimates) are 0 we use the F-statistic,
F =P
(
y y)2 /k
(y y)2 /(n (k + 1))
P
2.3.4
Predictions
As mentioned above, we can form predictions which, just as in simple linear regression, can be interpret either as x-stratum specific estimates of yx , or as predictions of particular observations. In forming prediction intervals, we follow the same
framework as in simple linear regress to choose the appropriate standard error. The
standard error for the actual predicted value, appropriate for inference for the stratum population mean, se(
y ), has a formula similar in form to se(bi ). For prediction,
just as in simple linear regression we have
sepred (y) =
s2yx + se(
y )2
15
When there are a number of variables, this can become complicated, as it is not
clear how to graph data to look at a single variable, where there are a number of
important explanatory variables at work. One convenient approach is to look
at adjusted predictions where only one or two variables are considered at their
actual values, while the effects of the remaining variables are adjusted out, by
considering predictions for hypothetical cases where the adjustment variables are
set to the overall mean (or other relevant value).
For example, the following plot illustrating the age and sex effects are adjusted
for weight and dose.
y_adj
yhat_f
Linear Prediction
y_adj
1.31031
.095099
20
92
age
Adjusted predictions
2.4
Examining Assumptions
Though we have previously discussed applying the results of fitting a multiple linear
regression model, logic dictates that we should assess the plausibility of the basic
assumptions before investing much effort in interpretation. Recall that the basic
assumptions of the multiple linear regression model are
Linearity and Additivity
Independence
Homogeneous dispersion
Normality
16
2.4.1
Examining Linearity
Lets consider once more the model for heparin levels that includes the initial IV
rate and age (partial regression output below).
-----------------------------------------------------------------------------hsulf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------weight | -.0058301
.0014201
-4.11
0.000
-.0086374
-.0030227
male | -.1098883
.0442424
-2.48
0.014
-.1973471
-.0224295
age |
.0039068
.0012984
3.01
0.003
.0013401
.0064735
hepdos0 |
.176446
.0422362
4.18
0.000
.0929529
.259939
_cons |
.5979384
.1459204
4.10
0.000
.3094813
.8863955
------------------------------------------------------------------------------
.743259
Residuals
.743259
.589814
119.6
39
.589814
female
weight
male
Residuals by Sex
Residuals
.743259
Residuals
.743259
.589814
20
92
.589814
Low
age
Residual Plots
High
17
Note that when particular xs are discrete, it usually makes more sense to use
boxplots.
When there is very pronounced non-linearity in some, but not all of the plots, a
transformation of the offending x-variable may be appropriate.
2.4.2
Assessing Independence
2.4.3
The plots above may give indications of non-homogeneous dispersion. Another plot
that is often used is the plot of the residual values versus the fitted values. The plot
below illustrates this approach, though in this plot I have chosen to use a modified
form of the residual known as the standardized residual. This residual is defined as
as the ordinary residual, divided by an estimate of the standard deviation for the
residual. Standard deviations for residuals actually decrease according to how far
the xvalues are from their means, so this puts the residuals on an equal footing in
terms of variance.
18
Standardized residuals
3.18392
2.51697
.044713
.874659
Fitted values
A typical pattern is to increasing dispersion as y increases, indicating that variability increases with bigger values of y. If this is very pronounced, a logarithmic or square root transformation (variance stabilizing transformation) for y may
help. However, if the previous checks for linearity all seem in order, transforming
y may foul up the linear assumptions. If it is not desirable to transform y and the
increasing dispersion is very pronounced, more advanced models which allow for
non-homogeneous dispersion can be applied.
2.4.4
Examining normality
The assumption of normality is the least critical assumption for most purposes, in
the sense that with large samples, the central limit theorem will provide enough
normality to allow the application of tests and confidence intervals. The one critical
area is in prediction intervals for new observations, which depend on the assumption
that the individual observations (both old and new) are normal. As in simple linear
regression, we can apply a normal quantile plot (a.k.a. QQ plot) to examining this
assumption, and more importantly to highlight deviant observations (outliers).
19
Inverse Normal
Standardized residuals
3.18392
2.51697
2.47878
2.47977
Inverse Normal
2.4.5
Number of obs
F( 4,
142)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
147
10.22
0.0000
0.2235
0.2016
.26627
-----------------------------------------------------------------------------log_hsulf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------weight | -.0050508
.0015923
-3.17
0.002
-.0081985
-.0019031
male |
-.124347
.049606
-2.51
0.013
-.2224087
-.0262854
age |
.0028794
.0014558
1.98
0.050
1.50e-06
.0057573
hepdos0 |
.1470529
.0473566
3.11
0.002
.0534377
.240668
_cons | -.2562236
.1636107
-1.57
0.120
-.579651
.0672038
------------------------------------------------------------------------------
20
Examining residual plots (below) indicates that the fit seems much improved.
Residuals
.590619
Residuals
.590619
1.0384
119.6
39
1.0384
female
weight
male
Residuals by Sex
Residuals
.590619
Residuals
.590619
1.0384
20
92
1.0384
Low
age
High
Residual Plots
The following plots indicate that the transformation has introduced negative
skewness in the distribution of residuals while improving the homogeneity of variance. The latter criterion is more important for the validity of our inferences, leading
us to prefer (on a statistical basis anyway), the transformed plot. In practise however, since the two analysis are in substantial agreement, one might prefer to report
the untransformed results.
21
2.28997
Inverse Normal
Studentized residuals
Studentized residuals
2.51741
4.17447
.797481
.066249
Fitted values
2.4.6
4.17447
2.52697
2.51741
Inverse Normal
Influence diagnostics
As in simple linear regression, cases whose y-values do not follow the general pattern
of association with the xs can sometimes have undue influence on the model results.
This is especially true if the x-values for the case are also outlying. In simple
linear regression we can spot the x-outliers easily in initial plots. In multiple linear
regression, the initial plots may not reveal all. One measure of the distance of a
specific cases x-values from the overall means is called the leverage (for algebraic
reasons these values are also called the hat values.
22
Cooks D
.102249
1.2e07
199
1
patnum
Leverage
.092835
.016492
1
199
patnum
Leverage Caseplot
23
2.5
Analysis of Variance/Covariance
In this section we will see more examples of the application of constructed variables,
such as group indicator variables. The use of constructed variables provides a way
to incorporate categorical predictor variables in a regression model in a number of
ways, as demonstrated below. In addition we will learn about more advanced model
testing techniques based on the analysis of variance.
2.5.1
In the example at the beginning of this chapter, we examined simple linear regressions for heparin levels against age in males and females separately. We informally
compared slopes and noting no strong statistical evidence for differing slopes, followed a combined approach.
The following example arises from the study into sleep apnea that gave rise to the
neck and weight data described in the first lecture. The main aim of the study was
to try to predict the results of overnight sleep testing, which yields a measure of sleep
disturbance called the respiratory distress index (RDI), which is a main diagnostic
factor for obstructive sleep apnea (OSA). Since age and gender are thought to be
important determining factors in the occurrence of OSA, an analysis was conducted,
separately by sex, relating the (log transformed) RDI to age, with results below:
---------------------------------Females-------------------------------------Source |
SS
df
MS
---------+-----------------------------Model | 3.56438156
1 3.56438156
Residual | 3.05199836
17 .179529316
---------+-----------------------------Total | 6.61637993
18 .367576663
Number of obs
F( 1,
17)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
19
19.85
0.0003
0.5387
0.5116
.42371
-----------------------------------------------------------------------------lrdi |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------age |
.049469
.0111022
4.456
0.000
.0260454
.0728925
_cons | -1.266384
.4854723
-2.609
0.018
-2.290641
-.2421271
------------------------------------------------------------------------------
-----------------------------------Males--------------------------------------
24
Source |
SS
df
MS
---------+-----------------------------Model | 2.09990916
1 2.09990916
Residual | 15.2482793
54 .282375542
---------+-----------------------------Total | 17.3481884
55 .315421608
Number of obs
F( 1,
54)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
56
7.44
0.0086
0.1210
0.1048
.53139
-----------------------------------------------------------------------------lrdi |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------age |
.0163944
.0060119
2.727
0.009
.0043413
.0284474
_cons |
.4734697
.3038537
1.558
0.125
-.1357204
1.08266
------------------------------------------------------------------------------
The plot above indicates quite a difference in fitted lines for males and females,
and the informal test for comparing slopes is significant. This would invalidate
using a common slope model such as the one considered in the heparin example.
However, a sex-specific slope model can be constructed using a constructed variable
to accommodate a different slope for women. Recall in the heparin example, we
25
included a variable called male which was 1 for males, 0 for females, which allowed
for different intercepts.
In this example well define a variable called female which is 1 for females, 0 for
male. To allow for different slopes we can construct a variable femage which equals
0 for males and for females equals age in years. This is easily done by multiplying
the age variable by the female variable, as below:
. generate femage = female*age
. list age sex femage
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
age
sex
femage
39
f
39
55
m
0
50
m
0
54
m
0
33
m
0
70
m
0
57
m
0
53
f
53
40
m
0
40
m
0
etc. ..................
Number of obs
F( 3,
71)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
75
10.66
0.0000
0.3105
0.2814
.50769
-----------------------------------------------------------------------------lrdi |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------female | -1.739854
.6501125
-2.676
0.009
-3.036141
-.4435663
age |
.0163944
.0057437
2.854
0.006
.0049417
.0278471
femage |
.0330746
.0144898
2.283
0.025
.0041828
.0619663
_cons |
.4734697
.2903025
1.631
0.107
-.105377
1.052316
------------------------------------------------------------------------------
26
By considering the male and female subjects separately, we can see that the
above model gives the same separate intercepts and slopes for men and women as
the previous model, with a pooled common standard deviation. In particular, the
femage coefficient represents the difference between slopes. Testing that the TRUE
coefficient is 0 allows us to test the hypothesis of equal slopes.
2.5.2
Lets return to look at more data arising in the study of sleep apnea. Well consider
now an earlier study similar in design to the RDI study discussed in the first lecture.
In this earlier study, the severity of disease was measured by another approach giving
rise to values of the Apnea-Hypopnea Index (AHI). The variables measured in this
study are summarized below.
ahi
Min.
: 0.000
1st Qu.: 0.833
Median : 8.257
Mean
: 22.080
3rd Qu.: 28.082
Max.
:124.909
stopbr
Never :74
Some
:34
Freqly:42
bmi
Min.
:18.72
1st Qu.:26.22
Median :29.38
Mean
:30.48
3rd Qu.:33.36
Max.
:68.59
neck.circ
Min.
: 77.9
1st Qu.:100.5
Median :105.6
Mean
:106.7
3rd Qu.:114.4
Max.
:143.3
age
Min.
:24.00
1st Qu.:37.00
Median :45.50
Mean
:45.72
3rd Qu.:53.75
Max.
:74.00
snorehx
Never : 18
Some
: 24
Freqly:108
partgasp
Never :55
Some
:44
Freqly:51
gender
female: 39
male :111
Note that some of the variables above categorical but with more than two categories. Suppose that we want to incorporate such variables in our model. In
particular, lets consider the variable partgasp, which indicates how often (if ever)
their sleeping partner has observed them gasping for breath during sleep.
If we start simply by looking at y in relation to each of the x variables, we would
consider the following plots.
27
0.0
20
0.5
40
60
1.0
80
1.5
100
2.0
120
Never
Some
Freqly
Never
Some
Freqly
28
We get the same F-value, because the use of indicators allows for different means
in the three groups. Testing whether any of these variables contribute to the predictive value of the model is the same as testing for equal means. In addition, note that
29
the variable z3 has been dropped from the model. This is because it only requires
two of the indicators, plus an constant term, to allow for three different means. In
the above model, the constant term is the estimated mean for the frequently gasping
group (z3 ), and the other coefficients are differences relative to our chosen reference
group.
We can choose the group we wish to make the reference group by leaving out the
corresponding indicator variable. In the analysis below, I have incorporated left out
the indicator for subjects whose partners reported Never. The coefficients change,
but when interpreted correctly we obtain the exact same inferences as in the first
output. These two models are equivalent, that is they produce the same predictions
for this or for other similarly structure data sets.
Call: lm(formula = logahi ~ z2 + z3)
Residuals:
Min
1Q
Median
3Q
Max
-1.10554 -0.58645 -0.03734 0.51895 1.38003
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.642597 0.0883260 7.275288
<0.001
z2
0.363377 0.1324890 2.742696
0.007
z3
0.462942 0.1273375 3.635550
<0.001
Residual standard error: 0.655 on 147 degrees of freedom
Multiple R-Squared: 0.09063
F-statistic: 7.325 on 2 and 147 degrees of freedom, the p-value is 0.0009277
2.5.3
In the following section we will consider another version of analysis with applications
to examining the general linear model. As seen in in previous examples, using
analysis of variance in regression allows us to test whether a number of coefficients
are 0, based on a single omnibus test which is more convenient than separate tests
and which also eliminates some multiple testing concerns. In addition, we will see
how analysis of variance exposes some apparent paradoxes which arise from the
uncritical examination of the t-tests for individual coefficients. We will see this
in examples that I have chosen somewhat eclectically (and perhaps, artificially) to
illuminate statistical problems.
Lets consider a model that includes not only partgasp, but age, sex (or gender),
and bmi with the result below:
30
Estimate
-0.80603306
0.01098533
-0.37904052
0.03759992
0.24480014
0.26815610
Std. Error
0.283480652
0.004171628
0.109837682
0.007194607
0.117572907
0.115860514
t value Pr(>|t|)
-2.843344
0.005
2.633343
0.009
-3.450915
<0.001
5.226125
<0.001
2.082113
0.039
2.314474
0.022
To incorporate gender in the model, we have set up the indicator variable f emale,
which is 1 for females, 0 for males. Again, we could have chosen, according to
convenience, to use an indicator variable male = 1 f emale, or we could have used
another choice of indicators for partgasp. The coefficients will change, but the fitted
values, and hence the residual standard deviation s2 resid is always the same.
It is sensible to re-consider the predictive contribution of partgasp in the light
of the other variables added. We note that the t tests for z2 and z3 are both
significant. Since we could have used z1 and z2, or even z1 and z3, in the above, one
can ask, is there a proper way to test the overall significance of partgasp without
reference to the t test results, which depend on our choice of indicators. The
answer of course is yes (otherwise I never would have raised this issue!), and it lies
in comparing the residual standard deviations from the models with and without
partgasp. Here is the latter analysis.
Call: lm(formula = logahi ~ age + female + bmi)
Residuals:
Min
1Q
Median
3Q
Max
-1.40878 -0.41867 0.05892 0.48241 1.06714
Coefficients:
Estimate Std. Error
t value Pr(>|t|)
(Intercept) -0.70693507 0.285214928 -2.478605
0.014
age
0.01142104 0.004194976 2.722553
0.007
female
-0.44763783 0.107933083 -4.147364
<0.001
31
<0.001
The two residual standard errors are derived from sums of squared residuals.
The sum of squared residuals in the first model is 46.757, based on 144 degrees of
freedom. The sum of squared residuals in the second model is 48.902, based on 146
degrees of freedom. Note that because variables in the second model are a subset
of the variables in the first model, the latter sum of squared residuals is necessarily
larger. ( Why ? - because the estimates for the first model are the least squares
estimates!!).
The difference in the sum of squared residuals for model one, which well call the
FULL model and for model 2, which will be referred to as the RESTRICTED model,
indicates how much the variable partgasp adds to the predictive accuracy made by
partgasp. We can incorporate this difference in an F-test that provides measures
the statistical significance of this improvement. This F-test for additional variables
depends on the sum of squared residuals, SSresid, and the associated degrees of
freedom, dfresid for the FULL and RESTRICTED models.
Fadditional =
(SSregression
(dfresid(REST RICT ED MODEL) dfresid(F ULL MODEL)
female + bmi
female + bmi + z2 + z3
of Sq
F Pr(>F)
2.145 3.3023 0.03961 *
0.001 ** 0.01 * 0.05 . 0.1 1
32
This is only valid test for testing that a group of two or more regression coefficients are all 0. While it may seem that this could also be done informally
combining the results of separate t-tests, this is not so, due to possible effects of
emphcollinearity, which we discuss next.
2.5.4
Multi-collinearity
To illustrate the utility of the test for additional variables and the non-intuitive
nature of indiviudal significance tests, consider the following analysis. The data
was gathered in an attempt to understand how characteristics of 20 segments of
highways relate to the occurence of accidents on them.
The data set contains as the outcome variable rate as well as the length of
the segment in miles (len), the speed limit in miles per hour (slim), the width
of the driving lanes (lwid), the width of the shoulder (shld) and the number of
intersections (itg). We start with a fairly superficial approach and fit a model with
all of the explanatory variables, yielding the following fit:
. regress rate len slim lwid shld itg
Source |
SS
df
MS
-------------+-----------------------------Model | 23.3491998
5 4.66983997
Residual | 17.3849747
14 1.24178391
-------------+-----------------------------Total | 40.7341746
19 2.14390393
Number of obs
F( 5,
14)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
20
3.76
0.0228
0.5732
0.4208
1.1144
-----------------------------------------------------------------------------rate |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------len | -.1113296
.0665881
-1.67
0.117
-.2541469
.0314878
slim | -.1374671
.0850988
-1.62
0.129
-.3199858
.0450515
lwid | -1.407115
1.222619
-1.15
0.269
-4.029373
1.215142
shld |
.0367669
.1770846
0.21
0.839
-.3430419
.4165756
itg | -.2603523
.5988827
-0.43
0.670
-1.544828
1.024123
_cons |
29.38981
14.60633
2.01
0.064
-1.937656
60.71728
------------------------------------------------------------------------------
33
Number of obs
F( 2,
17)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
20
9.25
0.0019
0.5212
0.4648
1.0712
-----------------------------------------------------------------------------rate |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------len | -.1175567
.0536484
-2.19
0.043
-.230745
-.0043684
slim | -.1270314
.0486591
-2.61
0.018
-.2296932
-.0243696
_cons |
12.08022
2.619819
4.61
0.000
6.552885
17.60756
------------------------------------------------------------------------------
Note that in the first model, none of the individual t-tests (except for the constant term) are significant, but that the overall regression F-test is (P=.023). The
following matrix plot and correlation matrix hold clues to this puzzle.
2.96
20.31
12
13
1.54
6.87
Length of segment
in miles
2.96
70
Speed Limit
50
13
Lane Width
in feet
12
10
Shoulder width
in feet
2
1.54
Interchanges
0
1.61
6.87
50
70
10
34
|
|
|
|
0.3747
-0.0625
-0.0795
0.1810
1.0000
-0.0756
0.6981
0.3489
1.0000
-0.2324
-0.1904
1.0000
0.2862
1.0000
We note that there is a fairly high correlation between the variables, especially
between slim (speed limit) and shld (shoulder width). When the two variables are
in the equation together it is difficult, if not impossible, to parcel out which variable
is making the significant contribution, so that the contribution of speed limit may
be masked by the presence of shld in the model. After considering a smaller model
as below, we can see that it is plausible (depending on our objectives in fitting a
model in the first place!), to adopt it.
. test lwid shld itg
( 1)
( 2)
( 3)
lwid = 0.0
shld = 0.0
itg = 0.0
F(
3,
14) =
Prob > F =
0.57
0.6444