You are on page 1of 9

1 The Glivenko-Cantelli Theorem

Let Xi , i = 1, . . . , n be an i.i.d. sequence of random variables with distribu-


tion function F on R. The empirical distribution function is the function of
x defined by
1 X
Fn (x) = I{Xi x} .
n
1in

For a given x R, we can apply the strong law of large numbers to the
sequence I{Xi x}, i = 1, . . . n to assert that

Fn (x) F (x)

a.s (in order to apply the strong law of large numbers we only need to show
that E[|I{Xi x}|] < , which in this case is trivial because |I{Xi x}|
1). In this sense, Fn (x) is a reasonable estimate of F (x) for a given x R.
But is Fn (x) a reasonable estimate of the F (x) when both are viewed as
functions of x?
The Glivenko-Cantelli Thoerem provides an answer to this question. It
asserts the following:

Theorem 1.1 Let Xi , i = 1, . . . , n be an i.i.d. sequence of random variables


with distribution function F on R. Then,

sup |Fn (x) F (x)| 0 a.s. (1)


xR

This result is perhaps the oldest and most well known result in the very large
field of empirical process theory, which is at the center of much of modern
econometrics. The statistic (1) is an example of a Kolmogorov-Smirnov
statistic.
The proof of the result will require the following lemma.

Lemma 1.1 Let F be a (nonrandom) distribution function on R. For each


 > 0 there exists a finite partition of the real line of the form = t0 <
t1 < < tk = such that for 0 j k 1

F (t
j+1 ) F (tj )  .

1
Proof: Let  > 0 be given. Let t0 = and for j 0 define

tj+1 = sup{z : F (z) F (tj ) + } .

Note that F (tj+1 ) F (tj )+. To see this, suppose that F (tj+1 ) < F (tj )+.
Then, by right continuity of F there would exist > 0 so that F (tj+1 + ) <
F (tj ) + , which would contradict the definition of tj+1 . Thus, between tj
and tj+1 , F jumps by at least . Since this can happen at most a finite
number of times, the partition is of the desired form, that is = t0 <
t1 < < tk = with k < . Moreover, F (t
j+1 ) F (tj ) + . To see this,
note that by definition of tj+1 we have F (tj+1 ) F (tj ) +  for all > 0.
The desired result thus follows from the definition of F (t
j+1 ).

Proof of 1.1: It suffices to show that for any  > 0

lim sup sup |Fn (x) F (x)|  a.s.


n xR

To this send, let  > 0 be given and consider a partition of the real line
into finitely many pieces of the form = t0 < t1 < tk = such that
for 0 j k 1

F (t
j+1 ) F (tj )
.
2
The existence of such a partition is ensured by the previous lemma. For any
x R, there exists j such that tj x < tj+1 . For such j,

Fn (tj ) Fn (x) Fn (t
j+1 )

F (tj ) F (x) F (t
j+1 ) ,

which implies that

Fn (tj ) F (t
j+1 ) Fn (x) F (x) Fn (tj+1 ) F (tj ) .

Furthermore,

Fn (tj ) F (tj ) + F (tj ) F (t


j+1 ) Fn (x) F (x)

Fn (t
j+1 ) F (tj+1 ) + F (tj+1 ) F (tj ) Fn (x) F (x) .

2
By construction of the partition, we have that

Fn (tj ) F (tj ) Fn (x) F (x)
2

Fn (tj+1 ) F (tj+1 ) + Fn (x) F (x) .
2
The desired result now follows from the Strong Law of Large Numbers.

2 The Sample Median


We now give a brief application of the Glivenko-Cantelli Theorem. Let
Xi , i = 1, . . . , n be an i.i.d. sequence of random variables with distribution
F . Suppose one is interested in the median of F . Concretely, we will define
1
Med(F ) = inf{x : F (x) } .
2

A natural estimator of Med(F ) is the sample analog, Med(Fn ). Under what


conditions is Med(Fn ) a reasonable estimate of Med(F )?
Let m = Med(F ) and suppose that F is well behaved at m in the sense
1
that F (t) > 2 whenever t > m. Under this condition, we can show using
the Glivenko-Cantelli Theorem that Med(Fn ) Med(F ) a.s. We will now
prove this result.
Suppose Fn is a (nonrandom) sequence of distribution functions such
that
sup |Fn (x) F (x)| 0 .
xR

Let  > 0 be given. We wish to show that there exists N = N () such that
for all n > N
|Med(Fn ) Med(F )| <  .

Choose > 0 so that


1
< F (m )
2
1
< F (m + ) ,
2

3
which in turn implies that
1
F (m ) <
2
1
F (m + ) > + .
2
(It might help to draw a picture to see why we should pick in this way.)
Next choose N so that for all n > N ,

sup |Fn (x) F (x)| < .


xR

Let mn = Med(Fn ). For such n, mn > m , for if mn m , then


1
F (m ) > Fn (m ) ,
2
which contradicts the choice of . We also have that mn < m + , for if
mn m + , then
1
F (m + ) < Fn (m + ) + + ,
2
which again contradicts the choice of . Thus, for n > N , |mn m| < , as
desired.
By the Glivenko-Cantelli Theorem, it follows immediately that Med(Fn )
Med(F ) a.s.

3 A Generalized Glivenko-Cantelli Theorem


Let Xi , i = 1, . . . , n be an i.i.d. sequence of random variables with distri-
bution P . We no longer require that the random variables are real-valued.
The empirical measure is the measure defined by
1 X
Pn = Xi ,
n
1in

where x is the measure that puts mass one at x. In this language, the
empirical distribution function is simply Pn (, x] and the uniform con-
vergence over x R asserted in (1) can instead be understood as uniform

4
convergence of the empirical measure to the true measure over sets of the
form (, x] for x R. Can this result be extended to other classes of
sets?
It turns out that the convergence of the empirical measure to the true
measure is uniform for large classes of sets provided that the sets do not
pick out too many subsets of any set of n points in the space.

Definition 3.1 We say that a class V of subsets of a space S has polynomial


discrimination (of degree v) if there exists a polynomial () (of degree v) so
that from every set of n points in S the class picks out at most (n) distinct
subsets. Formally, if S0 S consists of n points, then there are at most
(n) distinct sets of the form V S0 with V V.

It is easy to see that half-intervals of the form (, x] with x R


have polynomial discrimination of degree 1. Unfortunately, it is difficult
to come up with more classes of sets have polynomial discrimination. For
that purpose, the following observation is useful. A class of sets V is said
to shatter a set of points S0 if it can pick out every possible subset of
S0 , i.e., if each subset of S0 can be written as V S0 for some V V.
A lemma due to Sauer asserts that if a class of sets shatters no set of v
points, then it has polynomial discrimination of degree (no greater than)
v 1. For example, on the real line, the half-intervals shatter no set of two
points, which is consistent with our earlier observation. More interestingly,
the class of all closed discs in R2 shatters no set of four points. This follows
from the following result, which allows us to generate many classes of sets
that have polynomial discrimination.

Lemma 3.1 Let G be a vector space of real-valued functions on S with


dimension v 1. The class of sets of the form {g 0} as g varies over G
shatters no set of v points in S.

Proof: Choose s1 , . . . , sv distinct points from S. We must show that there


is some subset of these points that is not picked out by the sets of the

5
form {g 0} as g varies over G. Define the map L : G Rv by the rule

L(g) = (g(s1 ), . . . , g(sv ) .

L is a linear map. Hence, L(G) is a linear subspace of Rv of dimension


no greater than v 1. As a result, there exists a nonzero Rv that is
orthogonal to L(G), i.e.,
X
i g(si ) = 0 for all g G .
1iv

Equivalently,
X X
i g(si ) = i g(si ) . (2)
i:i 0 i:i <0

Assume without loss of generality that {i : i < 0} =


6 0 (otherwise replace
by ). Consider the set {si : i 0}. Suppose by way of contradiction
that there exists a g G such that {g 0} {s1 , . . . , sv } = {si : i 0}.
For such a g, (??) cannot hold.

The following theorem generalizes the classical Glivenko-Cantelli The-


orem to all classes of sets with polynomial discrimination, including those
that can be generated using the previous lemma.

Theorem 3.1 Let Xi , i = 1, . . . , n be an i.i.d. sequence of random variables


with distribution P on S. For every (suitably measurable) class V of subsets
of S with polynomial discrimination,

sup |Pn V P V | 0 a.s. .


V V

We will break the proof of this result up into several steps.

Lemma 3.2 Let {Z(t) : t T } and {Z 0 (t) : t T } be independent


stochastic processes. Suppose there exists > 0 and > 0 such that
P {|Z 0 (t)| } for every t T . Then,
1
P {sup |Z(t)| > } P {sup |Z(t) Z 0 (t)| >  } .
tT tT

6
Proof: Let be a random variable such that

sup |Z(t)| >  |Z( )| >  .


tT

Note that

P {|Z( )| > , |Z 0 ( )| } = P {|Z 0 ( )| ||Z( )| > }P {|Z( )| > }


P {|Z( )| > } ,

where the inequality follows because

P {|Z 0 ( )| ||Z( )| > } = E[P {|Z 0 ( )| |Z}||Z( )| > ]


P {|Z 0 ( )| |Z} .

Furthermore,

P {|Z( )| > , |Z 0 ( )| } P {|Z( ) Z 0 ( )| >  }


P {sup |Z(t) Z 0 (t)| >  } .
tT

The desired result thus follows.

Lemma 3.3 Let Xi , i = 1, . . . , n and Xi0 , i = 1, . . . , n be independent i.i.d.


sequences of random variables with distribution P on S. For every (suitably
measurable) class V of subsets of S,

P { sup |Pn V P V | > } 2P { sup |Pn V Pn0 V | > }
V V V V 2
whenever n 8/2 .

Proof: By Chebychevs Inequality,


 4
P {|Pn0 V P V | > } 2 .
2 n
Hence,
 1
P {|Pn0 V P V | }
2 2
whenever n 8/2 . We may therefore apply the preceding lemma to Z(V ) =
Pn V P V and Z 0 (V ) = Pn0 V P V with = /2 and = 1/2 to reach the
desired conclusion.

7
Lemma 3.4 Let Xi , i = 1, . . . , n and Xi0 , i = 1, . . . , n be independent i.i.d.
sequences of random variables with distribution P on S. Independently, let
i , i = 1, . . . , n be an i.i.d. sequence of random variables with distribution
putting equal mass on 1 and 1. Let
1 X
Pno = i Xi .
n
1in

For every (suitably measurable) class V of subsets of S,


 
P { sup |Pn V Pn0 V | > } 2P { sup |Pno V | > } .
V V 2 V V 4

Proof: Note that the distribution of the random variables I{Xi V }


I{Xi0 V } for i = 1, . . . , n and V V and i [I{Xi V } I{Xi0 V }]
for i = 1, . . . , n and V V are the same. To see this, simply consider the
distribution conditional on i , i = 1, . . . , n. Then,

P { sup |Pn V Pn0 V | > }
V V 2
1 X 
= P { sup | i [I{Xi V } I{Xi0 V }]| > }
V V n 1in 2
1 X 
P { sup | i I{Xi V }| > }
V V n 4
1in
1 X 
+P { sup | i I{Xi0 V }| > } ,
V V n 1in
4

from which the desired result follows.

We are now almost prepared to prove the desired result, but to do so we


will require the following inequality due to Hoeffding for tail probabilities of
sums of independent, bounded random variables.

Lemma 3.5 Let Yi , i = 1, . . . , n be independent random variables with zero


mean and satisfying ai Yi bi . For each > 0,
X X
P {| Yi | } 2 exp(2 2 /( (bi ai )2 ) .
1in 1in

8
Proof of Theorem ??: Let Xn = {X1 , . . . , Xn } and consider

P { sup |Pno V | > |Xn } .
V V 4

Since V has polynomial discrimination, we may replace V with V , which


is a collection of at most (n) sets from V. Note that V may depend on
Xn . Thus,
 
P { sup |Pno V | > |Xn } (n) max P {|Pno V | > |Xn } .
V V 4 V V 4

Apply Hoeffdings inequality to Yi = i I{Xi V } to conclude that



P {|Pno V | > |Xn } 2 exp(n2 /32) .
4
Integrating out with respect to Xn and using the previous lemmas, we have
that
P { sup |Pn V P V | > } 8(n) exp(n2 /32) .
V V

It follows that
X
P { sup |Pn V P V | > } < ,
1n< V V

so the desired result follows from the Borel-Cantelli lemma.

You might also like