You are on page 1of 37

Multiple linear regression

Part 2

Beginning of next lecture online


course evaluation (bring a tablet,
laptop, phone?)
Model comparison by F-Test
Tests whether the addition (removal) of one or more terms
significantly increases (decreases) model fit

Works via a model comparison just like we discussed for t-tests,


regression & ANOVA

The reduced model has to be a subset of the full (i.e. they are
nested models)
E.g. all terms in reduced model are also in full; full has additional
terms unique to it, but reduced has no terms unique to it

In R, use anova(model1, model2). Order doesnt matter to


interpretation.

2
Model comparison by F-Test
> anova(model.polynomial.reduced, full.model.2ndinteractions)
Analysis of Variance Table

Model 1: logherp ~ logarea + thtden + swamp + I(swamp^2)


Model 2: logherp ~ logarea + cpfor2 + thtden + swamp + I(swamp^2) + logarea:cpfor2 +
logarea:thtden + logarea:swamp + cpfor2:thtden + cpfor2:swamp + thtden:swamp

Res.Df RSS Df Sum of Sq F Pr(>F)


1 23 0.25999
2 16 0.18651 7 0.073486 0.9006 0.5294

> anova(full.model.2ndinteractions, model.polynomial.reduced)


Analysis of Variance Table

Model 1: logherp ~ logarea + cpfor2 + thtden + swamp + I(swamp^2) + logarea:cpfor2 +


logarea:thtden + logarea:swamp + cpfor2:thtden + cpfor2:swamp + thtden:swamp
Model 2: logherp ~ logarea + thtden + swamp + I(swamp^2)

Res.Df RSS Df Sum of Sq F Pr(>F)


1 23 0.25999
2 16 0.18651 -7 -0.073486 0.9006 0.5294

3
What if relationship between Y and
one or more Xs is nonlinear?
Option 1: transform data.
Option 2: use non-linear regression.
Option 3: use polynomial regression.

4
The polynomial regression model
In polynomial
regression, the 1000

regression model

Black fly biomass


(mgDM/m)
includes terms of
increasingly higher 100

powers of the Linear model


dependent 2nd order
polynomial model
variable. 10
10 30 50 70 90 110
k
Yi j X i j i Current velocity (cm/s)
j 1

5
The polynomial regression model: procedure

Fit simple linear model.


1000

Black fly biomass


Fit model with quadratic,

(mgDM/m)
test for increase in SSmodel .
100
Continue with higher order Linear model
(cubic, quartic, etc.) until 2nd order
there is no further polynomial model
10
significant increase in
10 30 50 70 90 110
SSmodel .
Current velocity (cm/s)

6
Polynomial regression: caveats

The biological significance of Extrapolation of polynomial


the higher order terms in a models is usually nonsense.
polynomial regression (if
any) is often not known.
By definition, polynomial
terms are strongly correlated Y = X1- X12
(i.e. multicolinearity will be
high)
Hence standard errors will

Y
be large (precision is low)
and will increase with the
order of the terms.
X1

7
Example

Lowess curve

1
y

-1

-3
10 30 50 70 90
x

8
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -0.3701 0.1562 -2.3699 0.0188
x 0.0065 0.0027 2.4114 0.0168

Residual standard error: 1.096 on 198 degrees of freedom


Multiple R-Squared: 0.02853
F-statistic: 5.815 on 1 and 198 degrees of freedom, the p-value is
0.01681
3
2
partial for x

1
0
-1
-2

0 20 40 60 80 100

9
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 1.7718 0.1244 14.2486 0.0000
x -0.1195 0.0057 -21.0307 0.0000
I(x^2) 0.0012 0.0001 22.8826 0.0000

Residual standard error: 0.5745 on 197 degrees of freedom


Multiple R-Squared: 0.7344
F-statistic: 272.4 on 2 and 197 degrees of freedom, the p-value is 0
3
2
1
y

0
-1
-2

0 20 40 60 80 100
x

10
Value Std. Error t value Pr(>|t|)
(Intercept) 1.0295 0.1499 6.8655 0.0000
x -0.0335 0.0128 -2.6150 0.0096
I(x^2) -0.0009 0.0003 -2.9716 0.0033
I(x^3) 0.0000 0.0000 7.3217 0.0000

Residual standard error: 0.5104 on 196 degrees of freedom


Multiple R-Squared: 0.7915
F-statistic: 248 on 3 and 196 degrees of freedom, the p-value
is 0

3
2
1
y
0
-1
-2

0 20 40 60 80 100
x

11
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 1.0258 0.1923 5.3334 0.0000
x -0.0328 0.0261 -1.2547 0.2111
I(x^2) -0.0009 0.0010 -0.8652 0.3880
I(x^3) 0.0000 0.0000 0.9337 0.3516
I(x^4) 0.0000 0.0000 -0.0307 0.9755

Residual standard error: 0.5117 on 195 degrees of freedom


Multiple R-Squared: 0.7915
F-statistic: 185 on 4 and 195 degrees of freedom, the p-value is 0
3
2
1
y
0
-1
-2

0 20 40 60 80 100
x

12
> anova(degre1,degre2)
Analysis of Variance Table

Response: y

Terms Resid. Df RSS Test Df Sum of Sq F Value Pr(F)


1 x 198 237.8462
2 x + x^2 197 65.0221 +I(x^2) 1 172.8242 523.6125 0

> anova(degre2,degre3)
Analysis of Variance Table

Response: y

Terms Resid. Df RSS Test Df Sum of Sq F Value Pr(F)


1 x + x^2 197 65.02205
2 x + x^2 + x^3 196 51.05763 +I(x^3) 1 13.96443 53.60663 6.188605e-012

> anova(degre4,degre3)
Analysis of Variance Table

Response: y

Terms Resid. Df RSS Test Df Sum of Sq F Value Pr(F)


1 x + x^2 + x^3 + x^4 195 51.05738
2 x + x^2 + x^3 196 51.05763 -I(x^4) -1 -0.0002468752 0.0009428737 0.9755352
>

13
Overfitting/Underfitting

underfit pretty good overfit

More examples/discussion:
https://stats.stackexchange.com/questions/128616/whats-a-real-world-example-of-
overfitting

14
Multiway ANOVA vs Multiple regression

Multiway ANOVA Multiple regression


Test for significance of main terms Test for significance of main
and usually interactions terms and sometimes
Often balanced or close to interactions
balanced with no or very low
multicollinearity (balanced design Often some multcollinearity
with no collinearity means type 1 (especially for interactions),
and 3 SS (sequential vs. Partial) meaning sequential vs. partial
are identical) SS very different (order
Rarely used for prediction matters)
(coefficients of little interest; model Often used for prediction (so
does not need to be simplified) coefficients matter, simple
Models include few terms in models are usually better;
general (less than 10 including overfiting an issue)
interactions)
Full models, with interactions,
can include many terms

15
When, and why, do we pool in ANOVA

Pooling: dropping terms to estimate sums of


squares and df when combining across levels
of the dropped factor
When some terms are not significant
i.e. they do not contribute much to model fit and
can be dropped
Why? To increase power and/or simplify
interpretation
Increases df for F tests
Often only minor impact; substantial only when there are
many levels per factor

16
How and why do you seek simple models in
multiple regression?

How: by dropping terms from full model


(analogous to pooling)
Why? To increase power
Increases df for F test
Small impact
Increases factor SS
Large impact when multicollinearity an issue (which it
often is)

17
Multiple regression: the general idea for
inference
Evaluate significance of a
Model A
variable by fitting two models:
(X1 in)
one with the term in, the other
with it removed. MF
Test for change in model fit (e.g. R2)
associated with removal of the Model B
term in question. (X1 out)
Unfortunately, change in
model fit may depend on what
Retain X1
other variables are in model if
( large)
there is multicollinearity!
Delete X1
( small)

18
Fitting multiple regression models

Goal: find the best model, given the available


data.
Problem #1: what is best?
highest R2?
lowest RMS?
highest R2 but contains only individually significant
independent variables?
maximizes R2 with minimum number of independent
variables?

19
Selection of independent variables
(contd)

Problem 2: even if best is defined, by what


method do we find it?
Possibilities:
compute all possible models (2k -1) and choose
the best one.
use some procedure for winnowing down the set
of possible models.

20
Strategy I: computing all possible
models
Compute all possible
models and choose the {X1, X2, X3}
best one.
cons:
time-consuming
leaves definition of best {X1} {X1, X2} {X1, X2, X3}
to researcher
pros: {X2} {X2, X3}
if the best model is
defined, you will find it! {X3} {X1, X3}

21
Strategy II: stepwise forward selection

Begin with the nave (i.e. simplest) model (i.e.


intercept only)

Next entry is the variable which most improves model


fit
E.g. greatest increase in R2adj, most significant F-test

Continue until no remaining variable improves model


fit by the criterion employed

22
Strategy III: stepwise backward selection

Start with a full model with all the variables.

Drop variables whose removal does not compromise model fit


by whatever criteria used (e.g. R2adj, non-significant F-test, etc.)
one at a time, starting with the one with the smallest effect

Continue until only significant variables remain (i.e. those for


which removal would compromise model fit)

Note: once Xj is dropped, it stays out even if it explains a


significant amount of the remaining variability once other
variables are excluded.

23
AIC: Akaike Information Criteria

An index of quality of fit penalized for model complexity


AIC = 2k - 2ln(L)
k = number of parameters in model
L = Likelihood of model
Calculated assuming some distribution for the residuals
Assuming normal distribution for residuals, with variance =
residual variance
For each residual, calculate probability of obtaining a value
that high or higher if residuals are N(0,RMS)
L = product of all the probabilities
If fit is good, L will be large
If fit is bad, L will be very small
Lower (i.e. closer to -) the AIC the better the model fit

24
Example

log of herptile species richness (logherp) as a


function of log wetland area (logarea),
percentage of land within 1 km covered in
forest (cpfor2) and density of hard-surface
roads within 1 km (thtdens)

25
Example (all variables)
Call:
lm(formula = logherp ~ logarea + cpfor2 + thtden, data = mydata)

Residuals:
Min 1Q Median 3Q Max
-0.30729 -0.13779 0.02627 0.11441 0.29582

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.284765 0.191420 1.488 0.149867
logarea 0.228490 0.057647 3.964 0.000578 ***
cpfor2 0.001095 0.001414 0.774 0.446516
thtden -0.035794 0.015726 -2.276 0.032055 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.1619 on 24 degrees of freedom


(2 observations deleted due to missingness)
Multiple R-squared: 0.5471, Adjusted R-squared: 0.4904
F-statistic: 9.662 on 3 and 24 DF, p-value: 0.0002291

26
Example: stepwise backward
Start: AIC=-98.27 Akaike Information criterion.
logherp ~ logarea + cpfor2 + thtden To be minimized
Df Sum of Sq RSS AIC
- cpfor2 1 0.01571 0.64508 -99.576
<none> 0.62937 -98.267
- thtden 1 0.13585 0.76522 -94.794
- logarea 1 0.41198 1.04135 -86.167

Step: AIC=-99.58
logherp ~ logarea + thtden

Df Sum of Sq RSS AIC


<none> 0.64508 -99.576
- thtden 1 0.25092 0.89600 -92.376
- logarea 1 0.40204 1.04712 -88.013

27
Example: forward stepwise
model.null <- lm(logherp ~ 1, data = mydata)
> step <- stepAIC(model.null, scope = ~. + logarea +
+ cpfor2 + thtden, direction = "forward")

Start: AIC=-82.09
logherp ~ 1

Df Sum of Sq RSS AIC


+ logarea 1 0.494 0.896 -92.376
+ thtden 1 0.342 1.047 -88.013
+ cpfor2 1 0.129 1.260 -82.820
<none> 1.390 -82.091

Step: AIC=-92.38
logherp ~ logarea

Df Sum of Sq RSS AIC


+ thtden 1 0.251 0.645 -99.576
+ cpfor2 1 0.131 0.765 -94.794
<none> 0.896 -92.376

Step: AIC=-99.58
logherp ~ logarea + thtden

Df Sum of Sq RSS AIC


<none> 0.645 -99.576
+ cpfor2 1 0.016 0.629 -98.267

28
Information Theoretic Approach

Fit all models and compute AIC for each one.


Select all likely models, i.e. within 4 AIC
units of the best one
Compute average coefficients of these
models, weighted by model probability of
being the best one.

29
> #####################################################################
> # Dredging and the information theoretical approach
> library(MuMIn)
> dd <- dredge(full.model.2ndinteractions)
>
> # get all possible models
> top.models.1 <- get.models(dd, subset = delta < 4)
> model.avg(top.models.1)
AICc for smaller data sets (N < 40k)
Model summary:
Deviance AICc Delta Weight
2+3+4+5+7 0.23 -36.46 0.00 0.34
2+3+4+5 0.26 -35.91 0.55 0.26
1+2+3+4+5+7 0.22 -33.75 2.72 0.09
2+3+4+5+7+8 0.22 -33.67 2.79 0.08
1+2+3+4+5 0.25 -33.25 3.21 0.07
2+3+4+5+8 0.25 -33.02 3.44 0.06
2+3+4+5+6+7 0.22 -32.90 3.56 0.06
2+3+4+5+6 0.26 -32.50 3.97 0.05

Variables:
1 2 3 4 5
cpfor2 I(swamp^2) logarea swamp thtden
6 7 8
logarea:swamp logarea:thtden swamp:thtden

30
Averaged model parameters:
Coefficient SE Adjusted SE Lower CI Upper CI
(Intercept) -2.02e-01 2.50e-01 2.61e-01 -0.713000 0.310000
cpfor2 -1.27e-04 4.84e-04 5.02e-04 -0.001110 0.000856
I(swamp^2) -2.68e-04 4.91e-05 5.16e-05 -0.000370 -0.000167
logarea 1.27e-01 1.20e-01 1.23e-01 -0.114000 0.369000
swamp 3.20e-02 6.13e-03 6.45e-03 0.019400 0.044700
thtden -7.02e-02 5.35e-02 5.49e-02 -0.178000 0.037500
logarea:swamp 4.73e-05 5.53e-04 5.84e-04 -0.001100 0.001190
logarea:thtden 2.23e-02 2.52e-02 2.58e-02 -0.028300 0.072900
swamp:thtden -3.49e-05 1.46e-04 1.52e-04 -0.000333 0.000263

Relative variable importance:


I(swamp^2) logarea swamp thtden logarea:thtden
1.00 1.00 1.00 1.00 0.57
cpfor2 swamp:thtden logarea:swamp
0.16 0.14 0.10

31
What is the best approach?

Stepwise is the most used


Be very critical of p-values (not corrected for multiple tests)
Beware of multicolinearity
best model may not be found
Forward and backward solution could differ

AIC and Information Theoretic Approach increasingly


used
R does it easily
Explicit recognition that several models are equivalent
Average models proposed as a defence against
multicolinearity and the resulting ambiguities

32
What is the partial regression coefficient?

8
j measures the amount by X2 = -3
which Y changes when Xj is
4
increased by one unit and X2 = -1
all other independent
X2 = 1
variables are held constant. Y 0

-4
X2 = 3
Simple -8
regression -4 -2 0 2 4
Partial regression X1

33
https://stats.stackexchange.com/questions/78828/is-there-a-difference-
between-controlling-for-and-ignoring-other-variables-i/78830#78830

34
How to check the normality of data in multiple regression?

I don't understand why independent variables are always considered fixed? why
not random?

I was a bit confused on the difference between when two variables (ie: X1 and
X2) have an interaction versus when two variables are correlated. Would it be
possible to get an example of both?

go through an example were we transform the data to their z-scores to observe


the relative strength? I understand the theory of this but I'm not sure exactly how
its done.

Why do we have to bin continuous variables to visualize their interactions? Can


we use continuous variables directly?
Can you explain "binning" again or provide an example of when we would "bin"
a variable? Also, how would we recognize variations within bins as slide 2
mentioned?

35
I did not fully understand the definition of a parameter in the context of multiple regression,
and how these relate with the X variables.

Parameters and variables are NOT interchangeable!

Linear regression

Yi 0 1 X i i

dependent
variable independent unexplained
variable variation

parameters

36
If you know with two dependent variable that one is determined by the other are does it
make sense to omit it in a multiple regression

I still do not understand what Cook's distance is and how to interpret the graphs of
residuals vs leverage

How is tolerance useful if we can gain the same information from VIF?

I dont think I really understand the difference between interactions and multicolinearity (in
terms of how to identify them).

Are we required to memorize the degrees of freedom formulae that would be used to
calculate an F ratio, or just know that they are different for random or fixed effects?

In lecture 16: Slide 43: I don't understand "sensitivity of parameter estimates to small
changes in data (multicolinearity)

Table 37.3 (pg. 353) mentions repeated measures (multiple observations of the same
subject at different times) is a violation of independence in multiple regression. Would the
appropriate test(s) be a paired-t-test, or 2/Multi-Way ANOVA?

37

You might also like