Professional Documents
Culture Documents
Kelvin Gu
Contents
1 Model selection 1
1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 RSS-d.o.f. decomposition of the PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 RSS-d.o.f. decomposition in action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Bias-variance decomposition of the risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Bias-variance decomposition in action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Two reasons why the methods above arent ideal . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.8 Bayesian Info Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.9 Steins unbiased risk estimate (SURE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.10 SURE in action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Multiple hypothesis testing 6
2.1 The setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Why do we need it? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Controlling FDR using Benjamini Hochberg (BH) . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Proof of BH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 References 6
1 Model selection
1.1 Setup
Suppose we know X = x. We want to predict the value of Y
Dene the prediction error to be PE = (Y f (X))
2
We want to choose some function f that minimizes the objective E [PE | X = x]
the optimal solution is (x) = E [Y | X = x]
As a proxy for minimizing E
_
(Y f (X))
2
| X = x
_
, well minimize the risk: R = E
_
((X) f (X))
2
_
note that
E {E [PE | X = x]} = E [PE] = E
_
((X) + (X) f (X))
2
_
= E
_
((X) f (X))
2
_
+E
_
(X)
2
_
= R +V ar (Y )
so, the risk R is a reasonable proxy to optimize
V ar (Y ) is unavoidable
For notational convenience, well call = f (X) and = (X)
1
Kelvin Gu
1.2 Motivation
Why cant we just use cross-validation for all tasks?
The problem:
Suppose were doing ordinary least squares with p = 30 predictors (inputs)
we want to select a subset of the p predictors with smallest EPE
for each subset of predictors, we t the model and then test on some held-out test set.
there are
_
p
2
_
= 435 models of size 2, there are
_
p
15
_
= 155117520 models of size 15.
even if most of the size 15 models are terrible, after 155117520 opportunities, youll probably nd
one that ts the test data better than any of the size 2 models.
This is second-order overtting.
let M
15
be the set of all size 15 models
E
_
min
mM15
PE (m)
_
. .
cross validation thinks you get this
min
mM
E [PE (m)]
. .
you actually get this
even if you have the computation power to try all models, its still a bad idea (without
some modication)
How will we address this?
nd better ways to estimate PE, and add an additional penalty to account for the overtting
problem presented above
it turns out that we need a penalty which depends not only on model size p, but also data size n
Other ways to address this:
just avoid searching over high-dimensional model space in the rst place (e.g. ridge regression
and LASSO both oer just a single parameter to vary)
1.3 RSS-d.o.f. decomposition of the PE
We just saw that expected prediction error could be decomposed as: E [PE] = R + V ar (Y ). Here is
another decomposition.
Let (X, Y ) be the training data, and let (X, Y
)
2
E
_
(
XY
Y
)
2
_
. .
E[PE]
= E
_
(
XY
Y )
2
_
. .
E[RSS]
+2Cov (
XY
, Y )
. .
d.o.f.
Kelvin Gu
Proof:
E
_
(
XY
Y )
2
_
= E
_
(
XY
+ Y )
2
_
E
_
(
XY
Y )
2
_
. .
E[RSS]
= E
_
(
XY
)
2
_
+E
_
( Y )
2
_
. .
E[PE]
2E [(
XY
) (Y )]
. .
d.o.f.
The second term is E [PE] because:
E [PE] = E
_
(
XY
Y
)
2
_
= E
_
(
XY
+ Y
)
2
_
= E
_
(
XY
)
2
+ 2 (
XY
) ( Y
) + ( Y
)
2
_
= E
_
(
XY
)
2
_
+E
_
( Y
)
2
_
A key thing to note is that E [(
XY
) ( Y
)] = E [
XY
] E [ Y
] = 0
expectation factorizes because Y
2
y
i
|y
i
| >
2
We can solve this problem even though it has an L0 penalty. You would think this procedure should
be good, right? Nope. It tends to set too many entries to y
i
.
We will see that model selection using a constant penalty to the L
0
norm suers from the
same problem as C
p
and cross-validation
1.7 Two reasons why the methods above arent ideal
(Continuing the denoising problem)
1. Other estimators can achieve better prediction error
suppose the real = 0
then our risk is: E
Y
2
=
E
_
y
2
i
I
|yi|>
2
_
0.57p
2
consider James Stein estimator
whereas a dierent model selection procedure, such as James Stein, has risk 2
2
JS
=
_
1
p2
Y
2
_
Y
The following bound on the risk of
JS
is oered without proof:
E
JS
2
2
_
p
p 2
1 +
2
p2
_
plugging in
2
= 0, we get a bound of 2
2
. Doesnt even depend on p!
2. They arent consistent (this is a n argument)
suppose we get more iid observations y
i
N (, I), i = 1, . . . , n
suppose that has sparsity k
n
i=1
y
i
y
2
+ 2
2
y
0
Consistency would require that P (C
p
(k
) < C
p
(k)) 1 for all k = k
i=1
y
i
y
2
+ 2
2
y
0
=
n
i=1
_
_
k
j=1
_
y
ij
y
j
_
2
+
p
j=k+1
y
2
ij
_
_
+ 2
2
k
=
_
_
k
j=1
n
i=1
_
y
ij
y
j
_
2
+
p
j=k+1
n
i=1
y
2
ij
_
_
+ 2
2
k
=
_
_
k
j=1
_
n
i=1
y
2
ij
ny
2
j
_
+
p
j=k+1
n
i=1
y
2
ij
_
_
+ 2
2
k
=
_
_
p
j=1
n
i=1
y
2
ij
n
k
j=1
y
2
j
_
_
+ 2
2
k
= n
k
j=1
y
2
j
+ 2
2
k +
n
i=1
y
i
2
. .
constant for all k
wlog, consider k
> k.
C
p
(k
) C
p
(k) = 2
2
(k
k)
_
_
|k
k|
j
ny
2
j
_
_
each y
j
N
_
j
,
1
n
_
so ny
j
2
1
and
|k
k|
j
ny
2
j
2
|k
k|
this expression doesnt depend on n anymore, so theres always positive probability that P (C
p
(k
) > C
p
(k))
this proof seems slightly shy, because for each model size k, Im arbitrarily picking the rst k
entries to threshold
but you can replace all the sums up to k with sums over any size k subset
now compute C
p
for all subsets, and youll get the same result
in contrast, the Bayesian information criterion works
1.8 Bayesian Info Criterion
We have a collection of models to select from {M
i
}, indexed by i
Model M
i
has parameter vector
i
associated with it. Let |
i
| denote the dimension of
i
.
Pick model with highest marginal probability: P (y | M
i
) =
f (y |
i
) g
i
(
i
) d
i
log P (y | M
i
) log L
i
(y)
|i|
2
log n
derivation steps:
take log (exp ()) of integrand
Taylor expand around MLE
recognize Gaussian integral
take logs
deal with Hessian term using SLLN
Kelvin Gu
1.9 Steins unbiased risk estimate (SURE)
(just mentioning, not giving details here)
suppose X is a vector with mean and variance
2
I
we want to estimate = X +g (X) where g must be almost dierentiable
then
R = E
2
= n
2
+E
_
g (X)
2
+ 2
2
divg (X)
Xi
g
i
(X) = Tr
_
g
X
_
(trace of Jacobian)
1.10 SURE in action
leads to James Stein estimator
2 Multiple hypothesis testing
2.1 The setup
we have H
i
, i = 1, . . . , n hypotheses to test
we want to control the quality of our conclusions, using one of these metrics:
FWER: P (we make at least one false rejection)
FDR: E
_
# of false rejections
# of total rejections
_
2.2 Why do we need it?
Suppose you dont even care about making scientic conclusions. You just want to do good prediction.
You can think of model selection as a way to induce sparsity.
(Candes calls it testimation)
Back to the thresholding example
2.3 Controlling FDR using Benjamini Hochberg (BH)
if we have time
2.4 Proof of BH
if we have time
3 References
http://nscs00.ucmerced.edu/~nkumar4/BhatKumarBIC.pdf
The STATS 300 sequence, with thanks to Profs. Candes, Siegmund and Romano!