You are on page 1of 10

Stat 431 Assignment 2 Winter 2017

Solution Key

Question 1 [6 marks]
This problem is adapted from Jewell, N.P. (2003). Statistics for Epidemiology. CRC Press.
The data below come from a study investigating the role of several reporductive risk factors for the development
of breast cancer. Parity (number of pregnancies carried to term) is coded as X1 = 1 if parity is 0-1, and
X1 = 0 if parity is ≥ 2. Similarly age is coded X2 = 1 if age is < 40 years, and X2 = 0 if age is ≥ 40
years. Cases refer to subject with breast cancer and controls are those without breast cancer. For the parts
below, do not fit the logistic regression using statistical software, as all answers can be computed by simple
calculator operations. Be sure to show your work for all calculations.

Breast Cancer Status


Age Parity Cases Controls
< 40 years 0-1 births 24 58
2+ births 96 160
≥ 40 years 0-1 births 127 172
2+ births 353 718

a) Suppose age is ignored and the logistic model


 
π
log = β0 + β1 x 1
1−π
is fit to the data. What is the estimate and interpretation of β1 ? Conduct a Wald-based hypothesis
test of H0 : β1 = 0.

The 2 × 2 table relevant for this problem is:

Breast Cancer Status


Parity Cases Controls
0-1 births 151 230
2+ births 449 878

OR.est = (151*878)/(449*230)
log(OR.est) # beta_1 is the log OR for breast cancer for parity 0-1 vs 2+

## [1] 0.2498242

In the logistic regression model given above, β1 is the log odds ratio for breast cancer comaparing
women with a parity of 0-1 to women with a parity of 2 or more. The estimate of this log odds
ratio (see calculations above) is 0.25. To conduct a hypothesis test of H0 : β1 = 0 vs Ha : β1 =
6 0
we need an estimate of the standard error of β1 .

var.log.OR = 1/151 + 1/230 + 1/449 + 1/878


var.log.OR

## [1] 0.01433647

1
sqrt(var.log.OR)

## [1] 0.119735

test.stat = (log(OR.est)-0)/sqrt(var.log.OR)
test.stat

## [1] 2.086477

2*(1-pnorm(abs(test.stat)))

## [1] 0.03693548

The Wald-based test statistic is

β̂1 − 0 0.2498242 − 0
t∗ = = = 2.0864765
se(β̂1 ) 0.119735

and the p-value from the test is 2P [U > |t∗ |] = 0.0369355 therefore we reject the null hypothesis
that β1 = 0. (Note: This is equivalent to rejecting the null hypothsis that the odds ratio is one)

b) Suppose the logistic model


 
π
log = β0 + β1 x1 + β2 x2 + β3 x1 x2
1−π

is fit to the data. What are the estimates and interpretations of the four β parameters?

# The beta estimates are log ORs from the relevant 2x2 tables and are given by:
beta.0 = log( 353/718 )
beta.1 = log( 127*718/(353*172))
beta.2 = log( 96*718/(353*160))
beta.3 = log( 24*160/(96*58)) - log( 127*718/(353*172))

• β̂0 = −0.71 is the log odds of having breast cancer for women with parity ≥ 2 and age ≥ 40.
• β̂1 = 0.4067 is the log OR of breast cancer for women with a parity 0-1 versus ≥ 2 who are age ≥ 40.
• β̂2 = 0.1992 is the log OR of breast cancer for women age < 40 versus women aged ≥ 40 who have a
parity of ≥ 2.

• β̂3 = −0.7783 is the difference in the log ORs of breast cancer for women age < 40 versus women aged
≥ 40 between

2
Question 2 [12 marks]
The data from the problem comes from Hosmer Jr, D.W., Lemeshow, S., & Sturdivant, R.X. (2013). Applied
Logistic Regression: Third Edition. John Wiley & Sons. The authors state that:

“Myopia, more commonly referred to as nearsightedness, is an eye condition where an individual


has difficulty seeing things at a distance. . . . The risk factores for the development of myopia
have been debated for a long time and include genetic factors (e.g., family history of myopia)
and the amount and type of visual activity that a child performs (e.g., studying, reading, TV
watching, computer or video game playing, and sports/outdoor activity).”

There are 618 subjects who where not myopic when they entered the study. The respose of interest is whether
or not the subject became myopic at any time during five years of follow-up. The following variables are
included in the dataset myopia431.txt (available on Learn).

Variable Name Values/Labels Description

MYOPIC 0=No, 1=Yes Mypoia within the first five years of follow-up
AGE years Age at first visit
SPORTHR hours per week Hours per week spend engaging in sports/outdoor activities
DIOPTERHR hours per week Composite measure of near-work (studying, reading etc.) activities
MOMMY 0=No, 1=Yes Was the subject’s mother myopic?
DADMY 0=No, 1=Yes Was the subject’s father myopic?

(a) [4 marks] Input the data into R, fit the main effects logistic regression model, and print the summary
output from R. Find the fitted probability of becoming myopic for a 6 year old child, who engages in
10 hours of sports per week, 20 hours of near-work per week, and whose mother and father are both
myopic.

# Input the data


myopia = read.table("myopia431.txt",header=T)

# Fit the main effects logistic regression model, print the summary
model1 = glm(MYOPIC~AGE+SPORTHR+DIOPTERHR+MOMMY+DADMY, family=binomial(link=logit), data=myopia)
summary(model1)

##
## Call:
## glm(formula = MYOPIC ~ AGE + SPORTHR + DIOPTERHR + MOMMY + DADMY,
## family = binomial(link = logit), data = myopia)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0430 -0.5569 -0.4497 -0.3090 2.5540
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.41649 1.11763 -3.057 0.002236 **
## AGE 0.10424 0.17159 0.608 0.543516
## SPORTHR -0.04376 0.01782 -2.455 0.014087 *

3
## DIOPTERHR 0.01067 0.00761 1.402 0.160930
## MOMMY 0.83972 0.25931 3.238 0.001202 **
## DADMY 0.98679 0.26147 3.774 0.000161 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 480.08 on 617 degrees of freedom
## Residual deviance: 445.82 on 612 degrees of freedom
## AIC: 457.82
##
## Number of Fisher Scoring iterations: 5

# Find the requested fitted probability


eta.fitted = sum(model1$coeff*c(1,6,10,20,1,1))
pi.fitted = exp(eta.fitted)/(1+exp(eta.fitted))
pi.fitted

## [1] 0.2334854

The fitted probability of becoming myopic for a 6 year old child, who engages in 10 hours of
sports per week, 20 hours of near-work per week, and whose mother and father are both myopic
is found by first calculating the fitted linear predictor:

η̂ = β̂0 + β̂1 (6) + β̂2 (10) + β̂3 (20) + β̂4 + β̂5 = −1.1887341

and then transforming using the expit function to find the fitted probability:

exp(η̂)
π̂ = = 0.2334854
1 + exp(η̂)

(b) [1 mark] What is the interpretation of the regression coefficient for the SPORTHR variable?

Here β2 is the log odds ratio for developping myopia associated with a one hour increase in the
number of hours per week a child engages in sports, holding all other factors constant.

(c) [4 marks] Find an estimate and 95% confidence interval for the odds ratio for developing myopia in a
child who has both a mother and father with myopia versus a child who has both a mother and father
without myopia, holding all other factors constant.

The odds ratio is exp(β4 + β5 ). This can be found using the table method:

AGE SPORTHR DIOPTERHR MOMMY DADMY log(π/(1 − π))


x1 x2 x3 1 1 β0 + β1 x 2 + β 2 x 2 + β3 x 3 + β4 + β5
x1 x2 x3 0 0 β0 + β1 x 2 + β2 x 2 + β3 x 3
β4 + β5

The R code below calculates the estimated OR and 95% Confidence Interval. Details are given
below the code block.

4
v <- summary(model1)$cov.unscaled # save the variance-covariance matrix
c <- as.matrix(c(0,0,0,0,1,1),ncol=1) # vector to pull out beta_4 + beta_5
myest <- t(c)%*%model1$coeff; myest # estimated log OR

## [,1]
## [1,] 1.826516

exp(myest) # estimated OR

## [,1]
## [1,] 6.212205

myvar <- t(c)%*%v%*%c; myvar # estimate the variance of beta_4 + beta_5

## [,1]
## [1,] 0.1440127

myest +c(-1,1)*qnorm(0.975)*sqrt(myvar)# 95% confidence interval calculation

## [1] 1.082729 2.570303

exp(myest +c(-1,1)*qnorm(0.975)*sqrt(myvar))

## [1] 2.952727 13.069780

So the estimated OR for developing myopia in a child who has both a mother and father with
myopia versus a child who has both a mother and father without myopia, holding all other factors
constant is:
exp(β̂4 + β̂5 ) = 6.212
To find the corresponding 95% CI we first need to find the standard error of β̂4 + β̂5 :
V ar(βˆ4 + βˆ5 ) = V ar(βˆ4 ) + V ar(βˆ5 ) + 2 Cov(βˆ4 , βˆ5 )

Using R, we find V ar(βˆ4 + βˆ5 ) = 0.144013 so se(βˆ4 + βˆ5 ) = 0.37949. Finally the 95% CI is
calculated using:
exp β̂4 + β̂5 ± 1.96se(βˆ4 + βˆ5 ) = (2.953, 13.07)
 

(d) [3 marks] Use a Deviance/Likelihood Ratio test to conduct a test of the null hypothesis that a family
history of myopia is not associated with a child’s risk of developping myopia. You should use the main
effects model as the full model and define an appropriate reduced model. Be sure to carefully state the
null and alternative hypotheses in terms of the regression coefficients (be explicit about which model
you are referring to) and give the formula of the test statistic and its asymptotic distribution under the
null hypothesis. What is the conclusion of the test?

Referring to the main effects model above we are intersted in testing


H0 : β4 = β5 = 0 versus HA : β4 6= 0 or β5 6= 0
To get the deviance statistic under the alternative hypothesis we fit a reduced model with
explanatory variables AGE, SPORTHR, and DIOTERHR only.

5
# Fit the reduced model with out MOMMY and DADMY
model2 = glm(MYOPIC~AGE+SPORTHR+DIOPTERHR, family=binomial(link=logit), data=myopia)
summary(model2)

##
## Call:
## glm(formula = MYOPIC ~ AGE + SPORTHR + DIOPTERHR, family = binomial(link = logit),
## data = myopia)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8621 -0.5758 -0.5128 -0.4252 2.6256
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.908535 1.046149 -1.824 0.06810 .
## AGE 0.042048 0.169216 0.248 0.80376
## SPORTHR -0.046612 0.017699 -2.634 0.00845 **
## DIOPTERHR 0.009977 0.007472 1.335 0.18179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 480.08 on 617 degrees of freedom
## Residual deviance: 471.49 on 614 degrees of freedom
## AIC: 479.49
##
## Number of Fisher Scoring iterations: 5

# Test statistic and p-value for the Deviance test comparing model2 to model1
model2$deviance - model1$deviance

## [1] 25.66888

1-pchisq(model2$deviance - model1$deviance, 2)

## [1] 2.667307e-06

Using the difference in deviance statistics we have


∆D = D2 − D1 = 471.4906124 − 445.8217297 = 25.6688827
which has a χ2(6−4) distribution under H0 . The p-value for this test is given by

P [χ2(2) > 25.6688827] = 2.6673066 × 10−6 < 0.001


Therefore we reject the null hypothesis that β4 = β5 = 0 i.e. that the reduced model is adequate
in comparison to the main effects model. In other words, we reject the null hypothesis that a
family history of myopia is not associated with a child’s risk of developping myopia

Therefore we reject the null hypothesis that a family history of myopia is not associated with a
child’s risk of developping myopia.

6
Question 3 [12 marks]
Text of the problem has been omitted

a) [4 marks] Run the simulation study described above. Be sure to summit well commented code.

# Simulations for the estimation of beta_1 in logistic and linear regression

mySim <- function(seed, n, beta.true, analysis, logistic=T)


{
# This function generates and analyses a single dataset of size n
# beta.true = true parameter values for (beta_0,...,beta_15)
# analysis = indices of which of (x_1,...,x_15) to include in the regression

p = length(beta.true)-1

# Generate the full 15 by n covariate matrix (each row in 1 individual)


X.full = matrix( rnorm(p*n,0,), nrow = n, ncol = p)
eta = cbind( rep(1,n), X.full) %*% beta.true
X = X.full[,analysis] # Save the covariate matrix to use for analysis

# Logistic regression simulation or linear regression simulation?


if( logistic==T )
{
Y = rbinom(n,1, expit(eta))
fit = glm(Y~X, family=binomial(link=logit))
}
if( logistic==F )
{
Y = rnorm(n,eta,1)
fit = lm(Y~X)
}

# Save the beta_1 value and its standard error


c( fit$coeff[2], sqrt(summary(fit)$cov.unscaled[2,2]) )
}

SimSummary = function(sim.table,beta1.true)
{
# Calculate and save the bias, 95% CI coverage and average standard error
# sim.table = table with rows (beta1, se(beta1)) from the mySim() function
# beta1.true = true value of beta1

pbias = (mean(sim.result[1,]) - beta1.true)/beta1.true # Bias


ci.lower = sim.result[1,] - qnorm(0.975)*sim.result[2,] # CI lower bound
ci.upper = sim.result[1,] + qnorm(0.975)*sim.result[2,] # CI upper bound
ci.coverage = mean( (ci.lower <= beta1.true) & (ci.upper >= beta1.true)) # CI coverage
ase = mean(sim.result[2,]) # Average SE
c(round(100*pbias,2), 100*ci.coverage, round(ase,4))
}

# Simulation set-up starts here


set.seed(2017)
expit = function(x){exp(x)/(1+exp(x))} # useful function

7
Table 2: Logistic Regression Simulation Results

n = 100 n = 500
Variables 95% CI 95% CI
Included % Bias Coverage ASE % Bias Coverage ASE
x1 only -22.23 85.2 0.2331 -24.39 53.6 0.1012
x1 , . . . , x 5 8.77 94.8 0.2939 0.81 95.6 0.1219
x1 , . . . , x15 32.96 92.2 0.3655 4.49 94.2 0.1254

# True values of (beta_0,...,beta_15)


# x1 = variable of interest
# x2, ..., x5 = variables associated with outcome (important)
# x6, ..., x15 = variables not associated with the outcome (nusiance)
beta.true = c(0.5, 0.75, 0.25, 0.5, 0.75, 1.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

# Analyses to consider (indices of x_j's to include in regression)


analysis1 = c(TRUE,rep(FALSE,4),rep(FALSE,10)) # Omit important and nuisance variables
analysis2 = c(TRUE,rep(TRUE,4),rep(FALSE,10)) # Include important, omit nuisance
analysis3 = c(TRUE,rep(TRUE,4),rep(TRUE,10)) # Include important and nusiance

N = 500 # Number of replications

# Simulation Set #1: Logistic Regression, sample size n=100


n=100
sim.result = sapply(1:N, mySim, n=n, beta.true=beta.true, analysis=analysis1, logistic=T)
sim.summary1 = SimSummary(sim.result, beta1.true = 0.75)
sim.result = sapply(1:N, mySim, n=n, beta.true=beta.true, analysis=analysis2, logistic=T)
sim.summary2 = SimSummary(sim.result, beta1.true = 0.75)
sim.result = sapply(1:N, mySim, n=n, beta.true=beta.true, analysis=analysis3, logistic=T)
sim.summary3 = SimSummary(sim.result, beta1.true = 0.75)

# Simulation Set #2: Linear Regression, sample size n=500


n=500
sim.result = sapply(1:N, mySim, n=n, beta.true=beta.true, analysis=analysis1, logistic=T)
sim.summary4 = SimSummary(sim.result, beta1.true = 0.75)
sim.result = sapply(1:N, mySim, n=n, beta.true=beta.true, analysis=analysis2, logistic=T)
sim.summary5 = SimSummary(sim.result, beta1.true = 0.75)
sim.result = sapply(1:N, mySim, n=n, beta.true=beta.true, analysis=analysis3, logistic=T)
sim.summary6 = SimSummary(sim.result, beta1.true = 0.75)

# Print the simulation summary results


cbind( rbind( sim.summary1, sim.summary2, sim.summary3),
rbind( sim.summary4, sim.summary5, sim.summary6))

## [,1] [,2] [,3] [,4] [,5] [,6]


## sim.summary1 -22.23 85.2 0.2331 -24.39 53.6 0.1012
## sim.summary2 8.77 94.8 0.2939 0.81 95.6 0.1219
## sim.summary3 32.96 92.2 0.3655 4.49 94.2 0.1254

8
b) [4 marks] Provide a written discussion of the patterns you observe in the table you produce. What
happens when important explanatory variables are omitted from the model? What happens when unimportant
explanatory variables are included in the model?

• Note that analysis 2 (the second line of the results table) is the most correct analysis since the other
important explanatory variales (x2 , . . . , x5 ) are included in the model and the unimportant or nuisance
variables (x6 , . . . , x15 ) are not. In this scenario the bias in estimating β1 is quite low (especially at the
larger sample size of n = 500) and the confidence interval coverage is essentailly at the nominal level of
95%.
• In analysis 1 (first line of the results table), when important covariates are ommitted from the model,
there is bias in the estimation of β1 and the confidence interval coverage is low. This occurs regardless
of the sample size. It appears that the standard error is being somewhat underestimated in this setting.
• In analysis 3 (third line of the table) the inclusion of nuisance variables leads to bias in the estimation
of β1 at the small sample size of n = 100. This bias is substantially reduced at n = 500. In both sample
sizes the confidence interval coverage is just below the nominal level. It appaears that, especially at
the smaller sample size, the standard error is being overestimated probably due to all the extra noise
introduced into the analysis from the nuisance variables.

c) [4 marks] Repeat part (a) but this time generate Yi from a Normal distribution with mean xTi β and
variance 1. Fit a linear regression model to your simulated data rather than a logistic regression model.
Provide a written discussion of the patterns you observe in the linear regression simulation study and
contrast them with the patterns you observed for the logistic regression study.

The R code below runs the linear regression simulations described in this problem.

# Simulation Set #3: Linear Regression, sample size n=100


n=100
sim.result = sapply(1:N, mySim, n=n, beta.true=beta.true, analysis=analysis1, logistic=F)
sim.summary7 = SimSummary(sim.result, beta1.true = 0.75)
sim.result = sapply(1:N, mySim, n=n, beta.true=beta.true, analysis=analysis2, logistic=F)
sim.summary8 = SimSummary(sim.result, beta1.true = 0.75)
sim.result = sapply(1:N, mySim, n=n, beta.true=beta.true, analysis=analysis3, logistic=F)
sim.summary9 = SimSummary(sim.result, beta1.true = 0.75)

# Simulation Set #4: Linear Regression, sample size n=500


n=500
sim.result = sapply(1:N, mySim, n=n, beta.true=beta.true, analysis=analysis1, logistic=F)
sim.summary10 = SimSummary(sim.result, beta1.true = 0.75)
sim.result = sapply(1:N, mySim, n=n, beta.true=beta.true, analysis=analysis2, logistic=F)
sim.summary11 = SimSummary(sim.result, beta1.true = 0.75)
sim.result = sapply(1:N, mySim, n=n, beta.true=beta.true, analysis=analysis3, logistic=F)
sim.summary12 = SimSummary(sim.result, beta1.true = 0.75)

# Print the simulation summary results


cbind( rbind( sim.summary7, sim.summary8, sim.summary9),
rbind( sim.summary10, sim.summary11, sim.summary12))

## [,1] [,2] [,3] [,4] [,5] [,6]


## sim.summary7 0.06 76.6 0.1018 -0.75 77.4 0.0448
## sim.summary8 -0.45 94.8 0.1031 -0.24 94.2 0.0451
## sim.summary9 0.30 95.0 0.1093 0.13 94.8 0.0454

9
Table 3: Linear Regression Simulation Results

n = 100 n = 500
Variables 95% CI 95% CI
Included % Bias Coverage ASE % Bias Coverage ASE
x1 only 0.06 76.6 0.1018 -0.75 77.4 0.0448
x1 , . . . , x 5 -0.45 94.8 0.1031 -0.24 94.2 0.0451
x1 , . . . , x15 0.3 95 0.1093 0.13 94.8 0.0454

• The primary difference between the linear and logistic regression analysis is to note that all linear
regression analyses are essentially unbiased. It doesn’t matter if other important explanatory variables
are included or omitted (analysis 2 vs 1) or if nuisance variables are included or omitted (analysis 3 vs
2).
• The exclusion of important explanatory variables does affect the confidence interval coverage in analysis
1 (first row of the table) and a larger sample size does not really improve the coverage. The coverage of
confidence intervals in analysis 2 (the correct anlaysis) and analysis 3 (nuisance variables included) are
essentially at the nominal level.
• The ASE are very similar across the three analyses (note these should not be directly compared to
the ASE from logistic regression) with perhaps a slight underestimate of the ASE in analysis 1 and
overestimate in analysis 3.

Note: The take away message from this simulation study is that in linear regression we could
obtain an unbiased estimate for β1 without worrying too much about the other variables in the
model (this holds because the xij ’s were independent), the coverage suffered if we ommitted
important variables the the inclusion/exclusion of nuisance variables was not very important.
However, for logistic regression it is essential that we do a good job at model selection and include
all the important variables and exclude all the nuisance variables from the model.

10

You might also like