You are on page 1of 26

STATS 330: Lecture 6

Inference for the Multiple Regression Model

31.07.2014
Getting RStudio

http://www.rstudio.org/

http://r-project.org/
Inference for the regression model

Aim of todays lecture

I To discuss how we assess the significance of variables in the


regression.

I Key concepts:

I Standard errors
I Confidence intervals for the coefficients
I Tests of significance
Variability of the regression coefficients

I Imagine that we keep the xs fixed, but resample the errors


and refit the plane. How much would the plane (estimated
coefficients) change?

I This gives us an idea of the variability (accuracy) of the


estimated coefficients as estimates of the coefficients of the
true regression plane.
Y

X1
X2
Variability of the regression coefficients

I Variability depends on

I The arrangement of the xs (the more correlation, the more


change)
I The error variance (the more scatter about the true plane, the
more the fitted plane changes)

I Measure variability by the standard error of the coefficients


Example: Cherries

Call:
lm(formula = volume ~ diameter + height)

Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---

Residual standard error: 3.882 on 28 degrees of freedom


Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Confidence intervals

CI : Estimated coefficient standard error t

t : 97.5% point of t distribution with df degrees of


freedom.

df : n k 1.

n : number of observations.

k : number of covariates (assuming we have a constant


term).
Confidence intervals
Example: Cherries

Use stats function confint

> confint(cherry.lm)
2.5 % 97.5 %
(Intercept) -75.68226247 -40.2930554
diameter 50.00206788 62.9937842
height 0.07264863 0.6058538
Hypothesis test

I Often we ask do we need a particular variable, given the


others are in the model?

I Note that this is not the same as asking is a particular


variable related to the response?

I Can test the former by examining the ratio of the coefficient


to its standard error.
Hypothesis test

I This is the t-statistic t.

I The bigger t, the more we need the variable.

I Equivalently, the smaller the p-value, the more we need the


variable.
Example: Cherries

Call:
lm(formula = volume ~ diameter + height)

Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---

Residual standard error: 3.882 on 28 degrees of freedom


Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Recall: p-value
Density for t with df=28

2.607 2.607
0.4
0.3

pvalue = 0.0145
0.2
0.1
0.0

4 2 0 2 4
Other hypotheses

I Overall significance of the regression: do none of the variables


have a relationship with the response?

I Use the F statistic: the bigger F , the more evidence that at


least one variable has a relationship.

I equivalently, the smaller the p-value, the more evidence that


at least one variable has a relationship.
Example: Cherries

Call:
lm(formula = volume ~ diameter + height)

Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
diameter 56.4979 3.1712 17.816 < 2e-16 ***
height 0.3393 0.1302 2.607 0.0145 *
---

Residual standard error: 3.882 on 28 degrees of freedom


Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Testing if a subset is required

I Often we want to test if a subset of variables is unnecessary.

I Terminology

Full model: Model containing all variables.

Submodel: Model with a set of variables removed.

I Test is based on comparing the RSS of the submodel with the


RSS of the full model. Full model RSS is always smaller
(why?)
Testing if a subset is required

I If the full model RSS is not much smaller than the submodel
RSS, the submodel is adequate: we do not need the extra
variables.

I To do the test, we

I fit both models, get RSS for both;


I calculate test statistic;
I If the test statistic is large, and equivalently the p-value is
small, the submodel is not adequate.
Testing if a subset is required

I The test statistic is

(RSSsub RSSfull )
F =
s 2 (dffull dfsub )

I dffull dfsub is the number of variables dropped.

I s 2 is the estimate of 2 from the full model (the residual


mean square)

I R has a function anova to do the calculation.


p-values

I If the submodel is correct, the test statistic has an


F -distribution with dffull dfsub and n k 1 degrees of
freedom.

I We assess if the value of F calculated from the sample is a


plausible value from this distribution by means of a p-value.

I if the p-value is too small, we have evidence against the


hypothesis that the submodel is ok.
p-values
Density for F with 2 and 16 degrees of freedom

1.0
0.8

Fvalue
0.6
0.4
0.2

pvalue
0.0

0 2 4 6 8 10
Example: Free fatty acid data

I Use physical measures to model a biochemical parameter in


overweight children.

I Variables are

FFA: Free fatty acid level in blood (response variable)

Age: months

Weight: pounds

Skinfold thickness: inches


Analysis

Call:
lm(formula = ffa ~ age + weight + skinfold, data = fatty.df)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.95777 1.40138 2.824 0.01222 *
age -0.01912 0.01275 -1.499 0.15323
weight -0.02007 0.00613 -3.274 0.00478 **
skinfold -0.07788 0.31377 -0.248 0.80714

This suggests
I age is not required if weight and skinfold are retained

I skinfold is not required if weight and age are retained

I Can we get away with just weight?


Analysis

> model.sub <- lm(ffa~weight,data=fatty.df)


> anova(model.sub,model.full)
Analysis of Variance Table

Model 1: ffa ~ weight


Model 2: ffa ~ age + weight + skinfold
Res.Df RSS Df Sum of Sq F Pr(>F)
1 18 0.91007
2 16 0.79113 2 0.11895 1.2028 0.3261

I Small F and large p-value suggest weight alone is adequate.


I But test should be interpreted with caution, confounding?
Confounding?
I Non-causal relation due to missing variable.

I Effect can be checked by comparing coefficients in full and


submodel (if available).
> summary(model.full)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.95777 1.40138 2.824 0.01222 *
age -0.01912 0.01275 -1.499 0.15323
weight -0.02007 0.00613 -3.274 0.00478 **
skinfold -0.07788 0.31377 -0.248 0.80714

> summary(model.sub)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.01651 0.37578 5.366 4.23e-05 ***
weight -0.02162 0.00608 -3.555 0.00226 **

You might also like