330 Lecture6 2014 PDF

STATS 330: Lecture 6
Inference for the Multiple Regression Model
31.07.2014
Getting RStudio
http://www.rstudio.org/
http://r-project.org/
Inference for the regression model
Aim of todays lecture
I To discuss how we assess the significance of variables in the

regression.
I Key concepts:
I Standard errors
I Confidence intervals for the coefficients
I Tests of significance
Variability of the regression coefficients
I Imagine that we keep the xs fixed, but resample the errors

and refit the plane. How much would the plane (estimated
coefficients) change?
I This gives us an idea of the variability (accuracy) of the

estimated coefficients as estimates of the coefficients of the
true regression plane.
Y
X1
X2
Variability of the regression coefficients
I Variability depends on
I The arrangement of the xs (the more correlation, the more

change)
I The error variance (the more scatter about the true plane, the
more the fitted plane changes)
I Measure variability by the standard error of the coefficients

Example: Cherries
Call:
lm(formula = volume ~ diameter + height)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---
Residual standard error: 3.882 on 28 degrees of freedom

Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Confidence intervals
CI : Estimated coefficient standard error t
t : 97.5% point of t distribution with df degrees of

freedom.
df : n k 1.
n : number of observations.
k : number of covariates (assuming we have a constant

term).
Confidence intervals
Example: Cherries
Use stats function confint
> confint(cherry.lm)
2.5 % 97.5 %
(Intercept) -75.68226247 -40.2930554
diameter 50.00206788 62.9937842
height 0.07264863 0.6058538
Hypothesis test
I Often we ask do we need a particular variable, given the

others are in the model?
I Note that this is not the same as asking is a particular

variable related to the response?
I Can test the former by examining the ratio of the coefficient

to its standard error.
Hypothesis test
I This is the t-statistic t.
I The bigger t, the more we need the variable.
I Equivalently, the smaller the p-value, the more we need the

variable.
Example: Cherries
Call:
Residuals:
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---

Recall: p-value
Density for t with df=28
2.607 2.607
0.4
0.3
pvalue = 0.0145
0.2
0.1
0.0
4 2 0 2 4
Other hypotheses
I Overall significance of the regression: do none of the variables

have a relationship with the response?
I Use the F statistic: the bigger F , the more evidence that at

least one variable has a relationship.
I equivalently, the smaller the p-value, the more evidence that

at least one variable has a relationship.
Example: Cherries
Call:
Residuals:
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---

Testing if a subset is required
I Often we want to test if a subset of variables is unnecessary.
I Terminology
Full model: Model containing all variables.
Submodel: Model with a set of variables removed.
I Test is based on comparing the RSS of the submodel with the

RSS of the full model. Full model RSS is always smaller
(why?)
I If the full model RSS is not much smaller than the submodel
RSS, the submodel is adequate: we do not need the extra
variables.
I To do the test, we
I fit both models, get RSS for both;

I calculate test statistic;
I If the test statistic is large, and equivalently the p-value is
small, the submodel is not adequate.
I The test statistic is
(RSSsub RSSfull )
F =
s 2 (dffull dfsub )
I dffull dfsub is the number of variables dropped.
I s 2 is the estimate of 2 from the full model (the residual

mean square)
I R has a function anova to do the calculation.

p-values
I If the submodel is correct, the test statistic has an

F -distribution with dffull dfsub and n k 1 degrees of
freedom.
I We assess if the value of F calculated from the sample is a

plausible value from this distribution by means of a p-value.
I if the p-value is too small, we have evidence against the

hypothesis that the submodel is ok.
p-values
Density for F with 2 and 16 degrees of freedom
1.0
0.8
Fvalue
0.6
0.4
0.2
pvalue
0.0
0 2 4 6 8 10
Example: Free fatty acid data
I Use physical measures to model a biochemical parameter in

overweight children.
I Variables are
FFA: Free fatty acid level in blood (response variable)
Age: months
Weight: pounds
Skinfold thickness: inches

Analysis
Call:
lm(formula = ffa ~ age + weight + skinfold, data = fatty.df)
Coefficients:
(Intercept) 3.95777 1.40138 2.824 0.01222 *
age -0.01912 0.01275 -1.499 0.15323
weight -0.02007 0.00613 -3.274 0.00478 **
skinfold -0.07788 0.31377 -0.248 0.80714
This suggests
I age is not required if weight and skinfold are retained
I skinfold is not required if weight and age are retained
I Can we get away with just weight?

Analysis
> model.sub <- lm(ffa~weight,data=fatty.df)

> anova(model.sub,model.full)
Analysis of Variance Table
Model 1: ffa ~ weight

Model 2: ffa ~ age + weight + skinfold
Res.Df RSS Df Sum of Sq F Pr(>F)
1 18 0.91007
2 16 0.79113 2 0.11895 1.2028 0.3261
I Small F and large p-value suggest weight alone is adequate.

I But test should be interpreted with caution, confounding?
Confounding?
I Non-causal relation due to missing variable.
I Effect can be checked by comparing coefficients in full and

submodel (if available).
> summary(model.full)
Coefficients:
(Intercept) 3.95777 1.40138 2.824 0.01222 *
age -0.01912 0.01275 -1.499 0.15323
weight -0.02007 0.00613 -3.274 0.00478 **
skinfold -0.07788 0.31377 -0.248 0.80714
> summary(model.sub)
Coefficients:
(Intercept) 2.01651 0.37578 5.366 4.23e-05 ***
weight -0.02162 0.00608 -3.555 0.00226 **

330 Lecture6 2014 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

330 Lecture6 2014 PDF

Uploaded by

Copyright:

Available Formats

STATS 330: Lecture 6

Inference for the Multiple Regression Model

Aim of todays lecture

I To discuss how we assess the significance of variables in the

I Imagine that we keep the xs fixed, but resample the errors

I This gives us an idea of the variability (accuracy) of the

I The arrangement of the xs (the more correlation, the more

I Measure variability by the standard error of the coefficients

Residual standard error: 3.882 on 28 degrees of freedom

CI : Estimated coefficient standard error t

t : 97.5% point of t distribution with df degrees of

k : number of covariates (assuming we have a constant

Use stats function confint

I Often we ask do we need a particular variable, given the

I Note that this is not the same as asking is a particular

I Can test the former by examining the ratio of the coefficient

I This is the t-statistic t.

I The bigger t, the more we need the variable.

I Equivalently, the smaller the p-value, the more we need the

Residual standard error: 3.882 on 28 degrees of freedom

I Overall significance of the regression: do none of the variables

I Use the F statistic: the bigger F , the more evidence that at

I equivalently, the smaller the p-value, the more evidence that

Residual standard error: 3.882 on 28 degrees of freedom

I Often we want to test if a subset of variables is unnecessary.

Full model: Model containing all variables.

Submodel: Model with a set of variables removed.

I Test is based on comparing the RSS of the submodel with the

I fit both models, get RSS for both;

I The test statistic is

I dffull dfsub is the number of variables dropped.

I s 2 is the estimate of 2 from the full model (the residual

I R has a function anova to do the calculation.

I If the submodel is correct, the test statistic has an

I We assess if the value of F calculated from the sample is a

I if the p-value is too small, we have evidence against the

I Use physical measures to model a biochemical parameter in

FFA: Free fatty acid level in blood (response variable)

Skinfold thickness: inches

I skinfold is not required if weight and age are retained

I Can we get away with just weight?

> model.sub <- lm(ffa~weight,data=fatty.df)

Model 1: ffa ~ weight

I Small F and large p-value suggest weight alone is adequate.

I Effect can be checked by comparing coefficients in full and

You might also like