Professional Documents
Culture Documents
Collinearity in regression
(the dreaded disease, and how to live with it)
Psychology 6140
15-30
watch out
>30
Trouble!
>100
DISASTER
2
What is collinearity?
Collinearity: Practicalities
Consequences:
( x1 x1 ) u ( x2 x2 )
3
Y value
fitted value
But: same
predicted values!
data points in x1,x2 plane
Measuring collinearity
Sampling variances: s2(b) = MSE (XX)-1
For 2 predictors:
s 2 (b1 )
MSE
1
u
2
2
(n 1)sx1 1 r12
1
More generally:
We can see this in that the
plane is not well supported.
So, a small change in the data
can make a large change in the
coefficients
s (bi )
r12 = .90
MSE
1
u
(n 1)sx2i 1 Ri2|others
Std. err. when R2=0
1
1 R | others
2
i
VIFi
VIFi
1
diag (Rxx
)i
Ri2
VIF
1.0
1.0
.2
.04
1.04
1.02
.4
.16
1.19
1.09
.6
.36
1.56
1.25
.8
.64
2.78
1.67
.9
.81
5.26
2.29
.95
.903
10.3
3.21
.99
.980
50.3
7.09
1.0
1.0
Collinearity diagnostics
VIF
Worry!
Worry!
CNi
15-30
watch out
>30
Trouble!
>100
DISASTER
X ' X
V diag(Oi ) V ' o
X ' X
1
V diag(
Oi
) V'
10
Collinearity diagnostics
%include data(cars2);
proc reg data=cars2;
model mpg = weight year engine horse accel cylinder;
run;
Standard output:
E.g., proc
Omax
Oi
Note: SAS (SPSS?) also has a less useful COLLIN that does not
adjust for the intercept.
Source
Analysis of Variance
Sum of
DF
Squares
Model
Error
Corrected Total
6
384
390
Root MSE
Variable
11
Intercept
Weight
Year
Engine
Horse
Accel
Cylinder
DF
1
1
1
1
1
1
1
19054
4523.41
23577
3.43216
Mean
Square
3175.66762
11.77970
R-Square
Parameter Estimates
Parameter
Standard
Estimate
Error
-14.63175
-0.00678
0.76205
0.00848
-0.00290
0.06121
-0.34602
4.88451
0.00067704
0.05292
0.00747
0.01411
0.10366
0.33313
F Value
Pr > F
269.59
<.0001
0.8081
t Value
Pr > |t|
-3.00
-10.02
14.40
1.13
-0.21
0.59
-1.04
0.0029
<.0001
<.0001
0.2572
0.8375
0.5552
0.2996
12
COLLINOINT Output:
Collinearity Diagnostics (intercept adjusted)
VIF Output:
Variable
Parameter Estimates
Parameter
Standard
Estimate
Error
t Value
DF
Intercept
Weight
Year
Engine
Horse
Accel
Cylinder
1
1
1
1
1
1
1
-14.63175
-0.00678
0.76205
0.00848
-0.00290
0.06121
-0.34602
4.88451
0.00067704
0.05292
0.00747
0.01411
0.10366
0.33313
-3.00
-10.02
14.40
1.13
-0.21
0.59
-1.04
Pr > |t|
Variance
Inflation
0.0029
<.0001
<.0001
0.2572
0.8375
0.5552
0.2996
0
10.85718
1.25307
20.23415
9.66219
2.70928
10.65789
Eigenvalue
Engine
Horse
Accel
4.25623
1.00000
0.0043
0.0097
0.0026
0.0052
0.0092
0.0046
0.83541
2.25716
0.0054
0.8562
0.0011
0.00004
0.0040
0.0030
0.68081
2.50034
0.0128
0.0536
0.0018
0.0024
0.4240
0.0052
0.13222
5.67358
0.0882
0.0058
0.0115
0.2917
0.0614
0.3172
0.05987
8.43157
0.7111
0.0688
0.00006
0.6602
0.4918
0.1110
0.03545
10.95701
0.1783
0.0059
0.983
0.0404
0.0096
0.5591
Look at large
CN rows
4 of 6 predictors have dangerously high VIFs!
How many near singularities?
which predictors involved in each?
Proportion of Variation
Number
Condition
Index
(others dont
matter)
Weight
Year
Cylinder
13
Weight
Year
Engine
Horse
Accel
Cylinder
#6
10.96
18
98
56
8.43
71
66
49
11
5.67
29
32
2.5
42
2.26
86
#5
#4
14
#3
#2
#1
16
17
VIF Output:
Variable
DF
Parameter
Estimate
Intercept
x1
x2
x3
x1x2
x1x1
1
1
1
1
1
1
390.53822
-0.77676
10.17351
-121.62608
-0.00805
0.00039831
Parameter Estimates
Standard
Error
t Value
211.52287
0.32448
0.94301
69.01749
0.00077209
0.00012528
18
Pr > |t|
Variance
Inflation
0.0946
0.0377
<.0001
0.1085
<.0001
0.0098
0
7682.37019
320.02156
53.52457
344.54471
6643.31989
1.85
-2.39
10.79
-1.76
-10.43
3.18
COLLINOINT Output:
high correlations
Divide by (adjust for): population, GNP, years in major leagues
per capita measures, etc.
Number
Eigenvalue
Condition
Index
x1
x2
x3
x1x2
x1x1
3.3204
1.000
0.0000103
0.0000867
0.0014
0.0001125
0.0000118
1.6176
1.433
0.0000061
0.0008279
0.0007648
0.0006342
0.0000071
0.0603
7.420
0.0002676
0.0001027
0.2085
0.0001889
0.0004914
0.0015
47.158
0.0003061
0.99890
0.0125
0.9990
0.0004257
0.0000696
218.335
0.9994
0.0000218
0.7767
0.00001123
0.9991
x1
x2
19
z1
z2
x1 x2
x1 x2
20
VIF Output:
Variable
Intercept
x1
x2
x3
x1x2
x1x1
DF
Parameter
Estimate
1
1
1
1
1
1
39.35299
0.08890
0.40706
-121.62608
-0.00805
0.00039831
Parameter Estimates
Standard
Error
t Value
2.16281
0.02431
0.05459
69.01749
0.00077209
0.00012528
18.20
3.66
7.46
-1.76
-10.43
3.18
COLLINOINT Output:
Collinearity Diagnostics (intercept adjusted)
Pr > |t|
Variance
Inflation
<.0001
0.0044
<.0001
0.1085
<.0001
0.0098
0
43.11271
1.07248
53.52457
1.09087
4.68010
Proportion of Variation
Number
Eigenvalue
Condition
Index
2.3699
1.0000
0.00330
1.0751
1.4847
0.00064
0.2678
0.000087
0.4871
0.017
0.8544
1.6654
0.0043
0.6113
0.0014
0.0949
0.032
0.6905
1.8526
0.0016
0.0974
0.0000015
0.3921
0.180
0.0100
15.3568
0.9902
0.000322
0.9954
0.0086
0.755
x1
x2
x3
x1x2
x1x1
0.0232
0.0031
0.0174
0.015
involving x1 and x3
21
22
Remedies
Statistical remedies
Transform X1 Xp to principal components, PC1 PCp
23
24
We should have known that runpulse and maxpulse would be highly correlated
%include data(fitnessd);
proc reg data=fitness;
model oxy = age weight runtime rstpulse runpulse maxpulse /
/ vif collinoint;
Parameter Estimates
Variable
Intercept
age
weight
runtime
rstpulse
runpulse
maxpulse
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Variance
Inflation
1
1
1
1
1
1
1
102.93448
-0.22697
-0.07418
-2.62865
-0.02153
-0.36963
0.30322
12.40326
0.09984
0.05459
0.38456
0.06605
0.11985
0.13650
8.30
-2.27
-1.36
-6.84
-0.33
-3.08
2.22
<.0001
0.0322
0.1869
<.0001
0.7473
0.0051
0.0360
0
1.51284
1.15533
1.59087
1.41559
8.43727
8.74385
Parameter Estimates
Variable
Intercept
age
weight
runtime
rstpulse
pulse
pdiff
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Variance
Inflation
1
1
1
1
1
1
1
102.93448
-0.22697
-0.07418
-2.62865
-0.02153
-0.03321
0.33642
12.40326
0.09984
0.05459
0.38456
0.06605
0.02780
0.12540
8.30
-2.27
-1.36
-6.84
-0.33
-1.19
2.68
<.0001
0.0322
0.1869
<.0001
0.7473
0.2439
0.0130
0
1.51284
1.15533
1.59087
1.41559
1.57086
1.26394
25
26
Statistical remedies
Ridge regression: purposely biased estimation
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Variance
Inflation
1
1
1
1
1
1
47.37581
-1.41517
-3.32426
-1.15396
-1.25553
1.36099
0.45988
0.29133
0.40570
0.48604
0.54226
0.76992
103.02
-4.86
-8.19
-2.37
-2.32
1.77
<.0001
<.0001
<.0001
0.0256
0.0291
0.0893
0
1.00000
1.00000
1.00000
1.00000
1.00000
where Gk
[I k ( XT X )1 ]1
28
Plot of VIF values vs. k for raw variables, just to illustrate how ridge regression
decreases the effects of collinearity.
Even very small values of k are effective here.
29
30
31
32
Summary
Collinearity is a data problem
33