Professional Documents
Culture Documents
April.
E[Y | X] = 0 + 1 X1 + + p Xp
Can outperform non-linear methods when one has
a small number of training examples
low signal-to-noise ratio
sparse data
(1)
E[Y | X] = 0 + 1 X1 + + p Xp
Can outperform non-linear methods when one has
a small number of training examples
low signal-to-noise ratio
sparse data
(1)
E[Y | X] = 0 + 1 X1 + + p Xp
Can outperform non-linear methods when one has
a small number of training examples
low signal-to-noise ratio
sparse data
(1)
f (X) = 0 + 1 X1 + + p Xp
How to estimate :
Training data: (x1 , y1 ), . . . , (xn , yn ) each xi Rp and yi R
Estimate parameters: Choose which minimizes
RSS() =
p
n
n
X
X
X
(yi f (xi ))2 =
(yi 0
xij j )2
i=1
residual sum-of-squares
i=1
j=1
f (X) = 0 + 1 X1 + + p Xp
How to estimate :
Training data: (x1 , y1 ), . . . , (xn , yn ) each xi Rp and yi R
Estimate parameters: Choose which minimizes
RSS() =
p
n
n
X
X
X
(yi f (xi ))2 =
(yi 0
xij j )2
i=1
residual sum-of-squares
i=1
j=1
45
Training data:
X2
Estimate parameters:
X1
E 3.1. Linear least squares fitting with X IR2 . We seek the linear
of X that minimizes the sum of squared residuals from Y .
n
X
i=1
n
X
i=1
p
X
j=1
xij j )2
Minimizing RSS()
Re-write
RSS() =
n
X
i=1
(yi 0
p
X
xij j )2
j=1
= (0 , 1 , . . . , p )t ,
and y = (y1 , . . . , yn )t .
1 x11 x12
1 x21 x22
X = .
..
..
..
.
.
1 xn1 xn2
. . . x1p
. . . x2p
..
...
.
. . . xnp
Minimizing RSS()
Want to find which minimizes
RSS() = (y X)t (y X)
Differentiate RSS() w.r.t. to obtain
RSS
= 2Xt (y X )
RSS
= 2Xt (y X ) = 0
Minimizing RSS()
Want to find which minimizes
RSS() = (y X)t (y X)
Differentiate RSS() w.r.t. to obtain
RSS
= 2Xt (y X )
RSS
= 2Xt (y X ) = 0
Minimizing RSS()
Want to find which minimizes
RSS() = (y X)t (y X)
Differentiate RSS() w.r.t. to obtain
RSS
= 2Xt (y X )
RSS
= 2Xt (y X ) = 0
Estimates using
Given an input x0 this model predicts its output as
y0 = (1, xt0 )
The fitted values at the training inputs are
y = X = X(Xt X)1 Xt y
= H y
Hat matrix
Estimates using
Given an input x0 this model predicts its output as
y0 = (1, xt0 )
The fitted values at the training inputs are
y = X = X(Xt X)1 Xt y
= H y
Hat matrix
Geometric
interpretation of the least squares estimate
46
3. Linear Methods for Regression
y
x2
x1
Let
FIGURE 3.2. The N -dimensional geometry of least squares regression with two
predictors.
Theinput
outcomedata
vectormatrix.
y is orthogonally projected onto the hyperplane
X be the
represents the vector
spanned by the input vectors x1 and x2 . The projection y
of the least squares predictions
0
0
0
In the
the atvector
of outputs
the figure
fitted values
the training
inputs arey is orthogonally projected onto
the hyperplane spanned
by the vectors x.1 and x.2 .
= X = X(XT X)1 XT y,
y
(3.7)
T
1 T
Thewhere
projection
). represents
least
estimate.
yi = f(xiy
The matrix Hthe
= X(X
X)squares
X appearing
in equation
(3.7) is sometimes called the hat matrix because it puts the hat on y.
Figure
3.2 shows
different geometrical
representationprojection.
of the least squares
The hat
matrix
Ha computes
the orthogonal
N
Geometric
interpretation of the least squares estimate
46
3. Linear Methods for Regression
y
x2
x1
Let
FIGURE 3.2. The N -dimensional geometry of least squares regression with two
predictors.
Theinput
outcomedata
vectormatrix.
y is orthogonally projected onto the hyperplane
X be the
represents the vector
spanned by the input vectors x1 and x2 . The projection y
of the least squares predictions
0
0
0
In the
the atvector
of outputs
the figure
fitted values
the training
inputs arey is orthogonally projected onto
the hyperplane spanned
by the vectors x.1 and x.2 .
= X = X(XT X)1 XT y,
y
(3.7)
T
1 T
Thewhere
projection
). represents
least
estimate.
yi = f(xiy
The matrix Hthe
= X(X
X)squares
X appearing
in equation
(3.7) is sometimes called the hat matrix because it puts the hat on y.
Figure
3.2 shows
different geometrical
representationprojection.
of the least squares
The hat
matrix
Ha computes
the orthogonal
N
Geometric
interpretation of the least squares estimate
46
3. Linear Methods for Regression
y
x2
x1
Let
FIGURE 3.2. The N -dimensional geometry of least squares regression with two
predictors.
Theinput
outcomedata
vectormatrix.
y is orthogonally projected onto the hyperplane
X be the
represents the vector
spanned by the input vectors x1 and x2 . The projection y
of the least squares predictions
0
0
0
In the
the atvector
of outputs
the figure
fitted values
the training
inputs arey is orthogonally projected onto
the hyperplane spanned
by the vectors x.1 and x.2 .
= X = X(XT X)1 XT y,
y
(3.7)
T
1 T
Thewhere
projection
). represents
least
estimate.
yi = f(xiy
The matrix Hthe
= X(X
X)squares
X appearing
in equation
(3.7) is sometimes called the hat matrix because it puts the hat on y.
Figure
3.2 shows
different geometrical
representationprojection.
of the least squares
The hat
matrix
Ha computes
the orthogonal
N
An example
f14, g14
Input
Output
points.
problem?
f14, g14
Input
Output
points.
problem?
|x,14 |
y,14
|y,14 |
Estimated Landmark
on novel image
Estimate not
even in image
miserably.
|x,14 |
y,14
|y,14 |
Estimated Landmark
on novel image
Estimate not
even in image
miserably.
Singular Xt X
examples.
examples.
examples.
examples.
What can we say about the distribution of ?
Analysis of the distribution of .
This requires making some assumptions. These are
the observations yi are uncorrelated
yi have constant variance 2 and
X
1
=
(yi yi )2
n p 1 i=1
2
Analysis of the distribution of .
This requires making some assumptions. These are
the observations yi are uncorrelated
yi have constant variance 2 and
X
1
=
(yi yi )2
n p 1 i=1
2
Analysis of the distribution of .
This requires making some assumptions. These are
the observations yi are uncorrelated
yi have constant variance 2 and
X
1
=
(yi yi )2
n p 1 i=1
2
assume
Y = E(Y | X1 , X2 , . . . , Xp ) +
= 0 +
p
X
Xj j +
i=1
where N (0, 2 )
Then its easy to show that (assuming non-random xi )
N (, (Xt X)1 2 )
y
0 + 1 x
yi
2
0 + 1 x
0
0
0 + 1 x
x
4
yi
0 + 1 x
0
0
T1
T is a training set {(xi , yi )}n
i=1
= (1, 1)t , n = 40, = .6
In this simulation the xi s differ across trials.
Tntrial
x
4
The distribution of
1
1.2
0.8
0.5
1.5
y
0 + 1 x
yi
0 + 1 x
0
0
0 + 1 x
x
4
T1
yi
0 + 1 x
0
0
Tntrial
x
4
The distribution of
1
0.8
0
0.5
1.5
model.
j
zj =
vjj
and if j = 0 then zj has a t-distribution with n p 1 dof.
model.
j
zj =
vjj
and if j = 0 then zj has a t-distribution with n p 1 dof.
model.
j
zj =
vjj
and if j = 0 then zj has a t-distribution with n p 1 dof.
model.
j
zj =
vjj
and if j = 0 then zj has a t-distribution with n p 1 dof.
Is 1 zero?
0.4
p(Z-score | 1 = 0)
t-distribution
0.3
0.2
Z-scores for 1 s
0.1
0
Z-score
0
10
20
30
0 + 1 x
4
2
0 + 1 x
0
0
x
4
T1
0 + 1 x
4
2
0 + 1 x
0
0
Tntrial
x
4
The distribution of
1
0.1
0
0.1
0.2
0
2.5
3.5
Is 1 zero?
0.4
p(Z-score | 1 = 0)
t-distribution
0.3
0.2
Z-scores for 1 s
0.1
0
Z-score
4
simultaneously
Gauss-Markov Theorem
Gauss-Markov Theorem
A famous result in statistics
= at ls = at (Xt X)1 Xt y
If X is fixed this is a linear function, ct0 y, of the response
vector y.
If we assume E[y] = X then at ls is unbiased
Var[at ls ] Var[ct y]
Have only stated the result for the estimation of one
Var[at ls ] Var[ct y]
Have only stated the result for the estimation of one
Var[at ls ] Var[ct y]
Have only stated the result for the estimation of one
estimating
= E(( )2 )
MSE()
+ ( E()
)2
= Var()
variance
bias
reduction in variance.
estimating
= E(( )2 )
MSE()
+ ( E()
)2
= Var()
variance
bias
reduction in variance.
estimating
= E(( )2 )
MSE()
+ ( E()
)2
= Var()
variance
bias
reduction in variance.
Y = X +
The least square estimate is
hx, yi
=
hx, xi
where x = (x1 , x2 , . . . , xn )t and y = (y1 , y2 , . . . , yn ).
The residuals are given by
r = y xt
Say xi Rp and the columns of X are orthogonal then
hx.j , yi
j =
,
hx.j , x.j i
Y = X +
The least square estimate is
hx, yi
=
hx, xi
where x = (x1 , x2 , . . . , xn )t and y = (y1 , y2 , . . . , yn ).
The residuals are given by
r = y xt
Say xi Rp and the columns of X are orthogonal then
hx.j , yi
j =
,
hx.j , x.j i
Y = X +
The least square estimate is
hx, yi
=
hx, xi
where x = (x1 , x2 , . . . , xn )t and y = (y1 , y2 , . . . , yn ).
The residuals are given by
r = y xt
Say xi Rp and the columns of X are orthogonal then
hx.j , yi
j =
,
hx.j , x.j i
Y = X +
The least square estimate is
hx, yi
=
hx, xi
where x = (x1 , x2 , . . . , xn )t and y = (y1 , y2 , . . . , yn ).
The residuals are given by
r = y xt
Say xi Rp and the columns of X are orthogonal then
hx.j , yi
j =
,
hx.j , x.j i
previous insight.
hx.1 , x.1 i
Then y 1 z = 1 (x.1
x.0 ) = X where = (1 , 1 )t .
previous insight.
hx.1 , x.1 i
Then y 1 z = 1 (x.1
x.0 ) = X where = (1 , 1 )t .
previous insight.
hx.1 , x.1 i
Then y 1 z = 1 (x.1
x.0 ) = X where = (1 , 1 )t .
previous insight.
hx.1 , x.1 i
Then y 1 z = 1 (x.1
x.0 ) = X where = (1 , 1 )t .
Subset Selection
Interpretation
Interpretation
Interpretation
Interpretation
Interpretation
Best-Subset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the
min
0 ,j1 ,...,jl
n
X
i=1
(yi 0
k
X
jl xi,jl )2
l=1
is smallest.
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for
Best-Subset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the
min
0 ,j1 ,...,jl
n
X
i=1
(yi 0
k
X
jl xi,jl )2
l=1
is smallest.
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for
Best-Subset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the
min
0 ,j1 ,...,jl
n
X
i=1
(yi 0
k
X
jl xi,jl )2
l=1
is smallest.
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for
Best-Subset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the
min
0 ,j1 ,...,jl
n
X
i=1
(yi 0
k
X
jl xi,jl )2
l=1
is smallest.
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for
Best-Subset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the
min
0 ,j1 ,...,jl
n
X
i=1
(yi 0
k
X
jl xi,jl )2
l=1
is smallest.
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for
Forward-Stepwise Selection
Instead of searching all possible subsets (infeasible for large p)
jl = arg min
min
jI 0 ,j1 ,...,jl1 ,j
and
I = I\{jl }
n
X
i=1
(yi 0
l1
X
s=1
js xijs j xij )2
Forward-Stepwise Selection
Instead of searching all possible subsets (infeasible for large p)
jl = arg min
min
jI 0 ,j1 ,...,jl1 ,j
and
I = I\{jl }
n
X
i=1
(yi 0
l1
X
s=1
js xijs j xij )2
subset selection.
subset selection.
subset selection.
subset selection.
Forward-Stagewise Regression
The steps of Forward-Stagewise Regression are
n
1X
Set 0 =
yi
i=1
Set 1 = 2 = = p = 0
At each iteration
ri = yi 0
p
X
j xij ,
j=1
j j + sign(hx.j , ri)
Stop iterations when the residuals are uncorrelated with all the
predictors.
dimensional problems.
dimensional problems.
dimensional problems.
dimensional problems.
Shrinkage methods
variance.
variance.
variance.
Ridge Regression
size.
ridge
p
p
n
X
X
X
2
2
= arg min
(yi 0
xij j ) +
j
i=1
j=1
j=1
residual sum-of-squares
regularizer fn
size.
ridge
p
p
n
X
X
X
2
2
= arg min
(yi 0
xij j ) +
j
i=1
j=1
j=1
residual sum-of-squares
regularizer fn
size.
ridge
p
p
n
X
X
X
2
2
= arg min
(yi 0
xij j ) +
j
i=1
j=1
j=1
residual sum-of-squares
regularizer fn
n
X
i=1
(yi 0
p
X
j=1
xij j )2 subject to
p
X
j=1
j2 t
j s.
formulations.
change.
x
ij = xij
n
X
xsj
s=1
n
X
i=1
(yi
0c
p
X
x
ij jc )2
j=1
p
X
+
(jc )2
j=1
are related to the coefficients found using the original data via
n
1X
0c =
yi = y,
n
i=1
jc = jridge
0ridge = y
for i = 1, . . . , p
p
X
j=1
x
.j jridge
data is centred.
= arg min
n
X
i=1
(yi
p
X
xij j )2 +
j=1
p
X
j=1
(y X)t (y X) + t
x11
x21
X= .
..
xn1
x12
x22
..
.
xn2
x1p
x2p
..
.
xnp
j2
data is centred.
= arg min
n
X
i=1
(yi
p
X
xij j )2 +
j=1
p
X
j=1
(y X)t (y X) + t
x11
x21
X= .
..
xn1
x12
x22
..
.
xn2
x1p
x2p
..
.
xnp
j2
data is centred.
= arg min
n
X
i=1
(yi
p
X
xij j )2 +
j=1
p
X
j=1
(y X)t (y X) + t
x11
x21
X= .
..
xn1
x12
x22
..
.
xn2
x1p
x2p
..
.
xnp
j2
ridge = (Xt X + Ip )1 Xt y
Note that the problem of inverting the potentially singular
ridge = (Xt X + Ip )1 Xt y
Note that the problem of inverting the potentially singular
ridge = (Xt X + Ip )1 Xt y
Note that the problem of inverting the potentially singular
X = UDVt
where
U is an n p orthogonal matrix
V is a p p orthogonal matrix
D is a p p diagonal matrix with d1 d2 dp 0.
Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=
p
X
j=1
uj
d2j
d2j +
utj y,
As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the
Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=
p
X
j=1
uj
d2j
d2j +
utj y,
As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the
Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=
p
X
j=1
uj
d2j
d2j +
utj y,
As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the
Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=
p
X
j=1
uj
d2j
d2j +
utj y,
As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the
Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=
p
X
j=1
uj
d2j
d2j +
utj y,
As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the
S=
1 t
XX
n
eigen-decomposition of Xt X
component directions of X.
i=1
i=1
1 X (1) 2
1X t
1
d2
(zi ) =
v1 xi xti v1 = v1t X t Xv1 = 1
n
n
n
n
S=
1 t
XX
n
eigen-decomposition of Xt X
component directions of X.
i=1
i=1
1 X (1) 2
1X t
1
d2
(zi ) =
v1 xi xti v1 = v1t X t Xv1 = 1
n
n
n
n
S=
1 t
XX
n
eigen-decomposition of Xt X
component directions of X.
i=1
i=1
1 X (1) 2
1X t
1
d2
(zi ) =
v1 xi xti v1 = v1t X t Xv1 = 1
n
n
n
n
Largest Principal
Component
o
o o oo o
o
o oo oo
ooooo o
oo o o
oo o
ooo o o ooo o
o oo o o o
o
o oo
o
o o
oo o o
ooo oo o oo o o
o
o
oooooooooooooo o oo oo
oo oo
ooo ooo o
ooo ooooooo o
o o o o
o
oo o
o
o
o
o
o
o
ooo o ooooo
o
o
o
o
o
o
o
o oo o ooooo o
o
o o oo o
o
o
o ooo oo ooo o o
o
o
o ooo oo
o
o
o
oo
o
o
oo o o
oo o
Smallest Principal
o
Component
-2
X2
o
-4
-4
-2
X1
FIGURE 3.9. Principal components of some input data points. The largest prin-
(j)
cipal component is the direction that maximizes the variance of the projected data,
Subsequent
components
zi that have
variance
andprincipal
the smallest principal
component minimizes
variance. maximum
Ridge regression
Largest Principal
Component
o
o o oo o
o
o oo oo
ooooo o
oo o o
oo o
ooo o o ooo o
o oo o o o
o
o oo
o
o o
oo o o
ooo oo o oo o o
o
o
oooooooooooooo o oo oo
oo oo
ooo ooo o
ooo ooooooo o
o o o o
o
oo o
o
o
o
o
o
o
ooo o ooooo
o
o
o
o
o
o
o
o oo o ooooo o
o
o o oo o
o
o
o ooo oo ooo o o
o
o
o ooo oo
o
o
o
oo
o
o
oo o o
oo o
Smallest Principal
o
Component
-2
X2
o
-4
-4
-2
X1
FIGURE 3.9. Principal components of some input data points. The largest prin-
(j)
cipal component is the direction that maximizes the variance of the projected data,
Subsequent
components
zi that have
variance
andprincipal
the smallest principal
component minimizes
variance. maximum
Ridge regression
dfridge () =
p
X
j=1
d2j
d2j +
f14, g14
Input
Output
points.
f14, g14
Input
Output
points.
= 103 ,df 97
|x,14 |
y,14
|y,14 |
Estimated Landmark
on novel image
= 104 ,df 29
|x,14 |
y,14
|y,14 |
Estimated Landmark
on novel image
0.6
lcavol
0.4
0.2
Coefficients
0.2
0.0
svi
lweight
pgg45
lbph
gleason
age
lcp
0
df()
3.8. Profiles
of ridge
coefficients
for theNotice
prostate cancer
example,
as
This is an FIGURE
example
from
the
book.
how
the
weights
the tuning parameter is varied. Coefficients are plotted versus df(), the effective
degrees of freedom. A vertical line is drawn at df = 5.0, the value chosen by
associated
with
each
predictor
vary
with
.
cross-validation.
The Lasso
n
X
i=1
yi 0
p
X
j=1
xij j
subject to
p
X
j=1
|j | t
n
X
i=1
yi 0
p
X
j=1
xij j +
p
X
j=1
|j |
solution.
n
X
i=1
yi 0
p
X
j=1
xij j
subject to
p
X
j=1
|j | t
n
X
i=1
yi 0
p
X
j=1
xij j +
p
X
j=1
|j |
solution.
n
X
i=1
yi 0
p
X
j=1
xij j
subject to
p
X
j=1
|j | t
n
X
i=1
yi 0
p
X
j=1
xij j +
p
X
j=1
|j |
solution.
n
X
i=1
yi 0
p
X
j=1
xij j
subject to
p
X
j=1
|j | t
of the j s to be exactly 0.
Pp
ls
j=1 |j |
n
X
i=1
yi 0
p
X
j=1
xij j
subject to
p
X
j=1
|j | t
of the j s to be exactly 0.
Pp
ls
j=1 |j |
n
X
i=1
yi 0
p
X
j=1
xij j
subject to
p
X
j=1
|j | t
of the j s to be exactly 0.
Pp
ls
j=1 |j |
n
X
i=1
yi 0
p
X
j=1
xij j
subject to
p
X
j=1
|j | t
of the j s to be exactly 0.
Pp
ls
j=1 |j |
= .04,df 93
= .1,df 53
|x,14 |
y,14
|y,14 |
Estimated Landmark
on novel image
= .5,df 7
= .7,df 2
|x,14 |
y,14
|y,14 |
Estimated Landmark
on novel image
jls :
Estimator
Formula
ls
jls Ind(|jls | |M
|)
Ridge
jls /(1 + )
Lasso
sign(jls ) (|jls | )+
ls
where M
is the M th largest coefficient.
idge
j /(1 + )
sign(j )(|j | )+
asso
Ridge
sign(j )(|j |
Lasso
Ridge
Lasso
(M ) |
(0,0)
(0,0)
(0,0)
|1 | + |2 | t resp.
^
!
!2
!1
^
!
!2
^
!
!1
Ridge regression
!2
!1
Lasso
Formula
Estimator
Formula
When
X does
not
have
orthogonal columns
Best subset (size M )
j I(|j | |(M ) |)
j I(|j | |(M ) |)
j /(1 + )
Ridge
/(1 + )
Both methods
choose the first point whereLasso
the elliptical contours
hit
sign( )(| | )
Lasso
sign( )(| | )
the constraint
region.
Ridge
Best Subset
Best Subset
Ridge
Ridge
Lasso
Lasso
(0,0)
|(M ) |
(0,0)
(0,0)
|(M ) |
(0,0)
(0,0)
(0,0)
!2
^
!
!2
!1
^
!
!2
^
!
!1
FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression
(right). Shown are contours of the error and constraint functions. The solid blue
areas are the constraint regions |1 | + |2 | t and 12 + 22 t2 , respectively,
Ridge regression
!2
^
!
!1
!1
FIGURE 3.11. Estimation picture for the lasso (left) and ridge regress
(right). Shown are contours of the error and constraint functions. The solid b
Lasso
regression. Notice that for q 1, the prior is not uniform in direction, but
concentrates more mass in the coordinate directions. The prior corresponding to the q = 1 case is an independent double exponential (or Laplace)
distribution for each input, with density (1/2 ) exp(||/ ) and = 1/.
The case q = 1 (lasso) is the smallest q such that the constraint region
is convex; non-convex constraint regions make the optimization problem
more difficult.
p
p
n
X
X
X
In this view, the lasso, ridge regression and best
2 subset selection
q are
= arg min
(y
)
+
|
|
i priors.
0 Note, however,
ij j that they arejderived
Bayes estimates with different
q=4
q=2
q=1
q = 0.5
q = 0.1
Thinking of |j | as the log-prior density for j , these are also the equicontours of the prior distribution of the parameters. The value q = 0 corresponds to variable subset selection, as the penalty simply counts the number
of nonzero parameters; q = 1 corresponds to the lasso, while q = 2 to ridge
regression. Notice that for q 1, the prior is not uniform in direction, but
concentrates more mass in the coordinate directions. The prior corresponding to the q = 1 case
is an independent double exponential (or Laplace)
p(1/2 ) exp(||/X
n
distribution for each
input,
with density
) pand =
1/.
X
X
2the constraint region
The
case
q =
1 (lasso) is(ythesmallest
that
=
arg
min
0 q such
xij
|j |q
i
j) +
is convex; non-convex
constraint regions make the optimization problem
i=1
j=1
j=1
more difficult.
In this view, the lasso, ridge regression and best subset selection are
Bayes estimates with different priors. Note, however, that they are derived
Can
try other
values
q.
as posterior
modes,
that of
is, maximizers
of the posterior. It is more common
to use the mean of the posterior as the Bayes estimate. Ridge regression is
When
q posterior
1 stillmean,
havebut
a convex
also the
the lassoproblem.
and best subset selection are not.
Looking again at the criterion (3.53), we might try using other values
When
0 q 0,<1,1 ordo2.not
have aoneconvex
problem.
of q besides
Although
might consider
estimating q from
the data, our experience is that it is not worth the effort for the extra
When
q
1 sparse
solutions
are
encouraged.
variance
incurred.
Values
of q (1,
2) explicitly
suggest a compromise
between the
lasso and ridge regression. Although this is the case, with q > 1, |j |q is
When
q
>
1
cannot
set
coefficients
to
zero.
differentiable at 0, and so does not share the ability of lasso (q = 1) for
q=4
q=2
q=1
q = 0.5
q = 0.1
Elastic-net penalty
A compromise between the ridge and lasso penalty is the Elastic
net penalty:
p
X
j=1
(j2 + (1 )|j |)
The elastic-net
select variables like the lasso and
shrinks together the coefficients of correlated predictors like
ridge regression.
n
1 X
Cov(
yi , yi )
2
i=1
where Cov(
yi , yi ) is the estimate of the
Intuitively the harder we fit to the data, the larger the
n
1 X
Cov(
yi , yi )
2
i=1
where Cov(
yi , yi ) is the estimate of the
Intuitively the harder we fit to the data, the larger the
n
1 X
Cov(
yi , yi )
2
i=1
where Cov(
yi , yi ) is the estimate of the
Intuitively the harder we fit to the data, the larger the
n
1 X
Cov(
yi , yi )
2
i=1
where Cov(
yi , yi ) is the estimate of the
Intuitively the harder we fit to the data, the larger the