You are on page 1of 11

Learning with a Slowly Changing Distribution

Department of Electrical Engineering University of Queensland Queensland 4072 AUSTRALIA bartlett@s1.elec.uq.oz.au

Peter L. Bartlett

Abstract
In this paper, we consider the problem of learning a subset of a domain from randomly chosen examples when the probability distribution of the examples changes slowly but continually throughout the learning process. We give upper and lower bounds on the best achievable probability of misclassi cation after a given number of examples. If d is the VC-dimension of the target function class, t is the number of examples, and is the amount by which the distribution is allowed to change (measured by the largest change in the probability of a subset of the domain), the upper bound decreases as d=t initially, and settles to O(d2=3 1=3 ) for large t. The general lower bound on the probability of misclassi cation again decreases as d=t initially, but settles to (d1=2 1=2 ) for large t. These bounds give necessary and su cient conditions on , the rate of change of the distribution of examples, to ensure that some learning algorithm can produce an acceptably small probability of misclassi cation. We also consider the case of learning a near-optimal subset of the domain when the examples and their labels are generated by a joint probability distribution on the example and label spaces. We give an upper bound on that ensures learning is possible from a nite number of examples.

1 INTRODUCTION
In this paper, we examine the problem of learning a subset of a domain from randomly-chosen examples when the distribution of examples changes as learning proceeds. We are interested in how upper bounds on learning curves (graphs of error probability versus number of examples) vary with the amount by which the distribution is allowed to change. For many learning problems we can expect the distribution of examples (and the target function) to change over time. Consider a learning system in a telecommunications network that aims to avoid network congestion by controlling the admission of calls. The distribution of inputs (and the optimal decision function) for such a system will change with time as the network usage changes, as the channel characteristics (and therefore error rates) change, and as parts of the network fail. We consider two models of learning. The rst is similar to Haussler, Littlestone and Warmuth's prediction model HLW90]|the aim of learning is to minimize the probability over all sequences of examples of misclassifying the last example. The second is a more general model that allows noise and errors in the examples. In both cases, the distribution is allowed to change slowly throughout the learning process. The amount by which the distribution changes is measured by the largest change in the probability of a subset of the domain. In Kra88], Kramer presents a related model of learning, in which the distribution is allowed to drift. However in Kramer's model, when the learning system is presented with an example it can choose to see the classi cation of the example or to guess its classi cation (using a hypothesis from a particular class of hypotheses). The aim is for the algorithm to guess the label only if its hypothesis is accurate with high probability (taken over all sequences of random examples, as in Valiant's pac model Val84]). Kramer is concerned with the minimum number of labelled examples that a successful algorithm of

this type must store. In contrast, the results presented here give bounds on the misclassi cation probability for an optimal algorithm as a function of the number of examples and the amount of distribution drift. Helmbold and Long HL91] consider learning a slowly changing subset of the domain, when the distribution of examples is constant. This problem, and the problem of learning a xed subset with a changing distribution, are two special cases of a model of learning in which the labelled examples are described by a slowly changing joint distribution on the input and output spaces. We examine this more general model in Section 5. The paper is organized as follows. In Section 2, we present some notation and formally de ne the learning model we use. We compare some natural de nitions of the distance between distributions. In Section 3 we give upper bounds on the probability of misclassi cation for two general-purpose algorithms: the one-inclusion graph prediction strategy (presented in HLW90]), and a consistent hypothesis nder. Section 4 gives a general lower bound on the probability of misclassi cation. In Section 5, we apply the techniques used in Section 3 to the problem of learning an optimal classi cation function when the joint distribution on the input and output spaces varies slowly as learning proceeds. In Section 6, we summarize the results and mention some possible extensions.

if the functions in F induce all possible dichotomies of S, jffx 2 S : f (x) = 1g : f 2 F gj = 2jS j: The Vapnik-Chervonenkis dimension (VC-dimension) of F is the size of the largest shattered subset of X , VCdim(F ) = max fm : 9S X jS j = m and F shatters S g (see VC71]). The learning model described here is similar to the prediction model of learning described in HLW90]. We have a domain X , a class F of functions that map from X to f0; 1g (the target class ), and a target function f in F |the function we are trying to learn. At each learning trial, an example x is randomly chosen from X . The learning algorithm tries to predict the value of f (x) (the label of x). The algorithm is then told the label, and the process is repeated. A sequence x = (x1 ; x2; : : :; xt) 2 X t of examples is called a sample . A labelled sample is a sequence ((x1; f (x1 )); : : :; (xt; f (xt ))) of labelled examples. For sample x = (x1; x2; : : :; xt) 2 X t and target function f in F , de ne the labelled sample of f generated by x as samt (x; f ) = ((x1 ; f (x1 )); : : :; (xt; f (xt))) : Instead of assuming that each example is chosen independently from a single distribution on X , we assume that each example xi is drawn from a (possibly distinct) distribution Pi. The sequence of distributions hPii is intended to describe the change in the relative frequency of examples as learning proceeds. To quantify that change, we need some de nition of the distance between two distributions.

2 DEFINITIONS AND NOTATION


If D is a distribution on a set X and P (x) is a proposition about x 2 X , then we denote the probability that P (x) is true when x is chosen according to D by Prx2D (P (x)) = D fx 2 X : P (x)g : Similarly, if f is a real-valued function de ned on X , then Ex2D (f (x)) represents the expectation of f (x) when x is chosen according to D, Ex2D (f (x)) =
Z

2.1 COMPARING DISTRIBUTIONS We assume that there is a - eld F of subsets of X on

We sometimes use ED (f ) = Ex2D (f (x)) when the meaning is clear from the context. We assume throughout that every set is measurable ( HL91] gives a supporting argument, claiming that in practice the domain we consider is countable; BEHW89] gives su cient conditions for the assumption when the domain is Rn). If x = (x1 ; x2; : : :; xt) 2 X t and is a permutation on f1; 2; : : :; tg, de ne ? x = x (1); x (2); : : :; x (t) : Given a set X and a set F of functions that map from X to f0; 1g, we say that F shatters the nite subset S X

x2X

f (x) dD(x):

which the probability distributions Pi are de ned. We de ne the distance between distributions P1 and P2 as the largest change in probability of a subset in F . De nition 1 The distance d(P1; P2) between two distributions P1 and P2 is d(P1; P2) = sup jP1(E ) ? P2(E )j :
E 2F

We can also de ne this distance using a signed measure on the measurable space (X; F ), de ned as = P 1 ? P2 : For this signed measure, choose a partition fA; B g of X for which is positive in A and negative in B , and de ne

two measures on (X; F ) (the upper and lower variations of ), + (E ) = (E \ A) and ? (E ) = ? (E \ B ) for E 2 F . Clearly, + ; ? 0, and = + ? ? . Using this representation (the Jordan decomposition of ), the distance d is given by

2.1.1 Other Distances


This section examines two other commonly used ways of de ning the distance between distributions, and compares them with d. In the de nitions in this section, let (X; F ; P ) and (X; F ; Q) be probability spaces.
and Q is

Proposition 2

d(P1; P2) =

+ (X ):

De nition 4 The total variation distance between P dV (P; Q) = j j(X ); where j j is the total variation of the signed measure = P ? Q.
Suppose P and Q are discrete distributions with supports in the set fx1; x2; : : :; xng X , with P (xi) = pi, Q(xi) = qi for i = 1; 2; : : :; n. Then the de nition reduces to n X dV (P; Q) = jpi ? qij :
i=1

Proof Since
+ (X )

and ? are measures, supE 2F + (E ) = and supE 2F ? (E ) = ? (X ). But + (X ) ? ? (X ) = (X ) = 0, so + (X ) = ? (X ). Thus + supE 2F j (E ) ? ? (E )j = + (X ).


+

The measure j j de ned by j j = + + ? is called the total variation of . To prove upper bounds on the mistake probability under a drifting distribution, we will use the following result. It bounds the di erence between the expectations of a 0; 1]-valued random variable under two distributions that are close in the distance d.

lated to d by

Proposition 5 The total variation distance dV is red = dV =2:

Proof Using the signed measure de ned above, we have dV = j j(X ) = + (X ) + ? (X ) = 2 + (X ) = 2d:
A natural de nition of the distance between two distributions is the Kullback-Leibler divergence. De nition 6 The Kullback-Leibler divergence of P with respect to Q is Z p(!) d (!); dKL (P; Q) = p(!) log q (!) X where is a measure on (X; F ) such that P and Q are absolutely continuous with respect to , and p and q are the Radon-Nikodym derivatives of P and Q with respect to , p = dP=d , q = dQ=d . Notice that dKL(P; Q) is
not a symmetric function of its arguments.

Lemma 3 Consider two probability distributions P1 and P2 on the measurable space (X; F ) that satisfy
d(P1; P2) ; (1)
where 0 < 1. If f is an F -measurable function from X to 0; 1], then

jEP1 (f ) ? EP2 (f )j
By de nition,

(2)

Proof De ne the signed measure = P2 ? P2 as above.


jEP1 (f ) ? EP2 (f )j =
= = Now, 0 X f d p124) so
R Z Z Z

f dP2 ? fd fd
+?

X
Z

f dP1 (3) (4) f d ? (5)

If P and Q are the discrete distributions de ned above, this de nition reduces to n X i; dKL(P; Q) = pi log p q i i=1 with the conventions 0 log 0 = 0 and log 0=0 = 1. The quantity dKL(P; Q) is also known as the information of order 1 of P with respect to Q. It can be interpreted as the amount of information obtained from observing an event E for which P ( ) = Q( jE ) (see Ren61]).

+ (X ) for 0

f 1 (see Hal50], :

jEP1 (f ) ? EP2 (f )j

+ (X )

The following proposition shows that a bound on dKL is a stronger requirement than a bound on d.

Proposition 7 The Kullback-Leibler divergence dV is


related to d by Moreover, there are distributions P and Q for which d(P; Q) but dKL(P; Q) = 1, for 0 < 1.

t (x) = MQ;f

d2 dKL =2:

d2 V =2 + d4 = 12. Proposition 5 gives the desired inequality. V To see that d does not provide an upper bound on dKL, consider the distributions P and Q and the set fx1; x2g X , with P (x1) = 1 ? , P (x2) = , Q(x1) = 1, and P and Q zero elsewhere. Clearly, d(P; Q) = , but dKL(P; Q) = 1. This proposition implies that, if we use dKL instead of d to measure the change in the distribution of examples, the upper bounds described in Sections 3 and 5 are still applicable.

Proof Kullback Kul67] shows that dKL

1 Q(samt ((x1 ; : : :; xt?1); f ); xt) 6= f (xt ) 0 otherwise. For randomized prediction strategy (Qr ,Z ,D), de ne t (x) = MQ;f D fz 2 Z : Q (samt ((x1; : : :; xt?1); f ); xt ; z ) 6= f (xt )g : For a distribution sequence hPiit i=1 on X , de ne the mistake probability of Q with respect to f as ? t Ex2hPi i MQ;f (x) : We want this probability to be small for all distributions and all target functions. De nition 11 (( ; )-Prediction) Consider a class F of functions and a prediction strategy Q for F . If ^ Q;f; (t) be the supref is in F , 0 and t > 0, let M mum over all -admissible distribution sequences hPi i on X of the mistake probability, t ^ Q;f; (t) = sup EhPi i ?MQ;f : M t
hPi i2D

deterministic prediction strategy Q, de ne the mistake of Q on x with respect to f as

2.2 THE DEFINITION OF LEARNING


We restrict the amount by which the distribution can change between examples by bounding the distance d between consecutive distributions.

De ne the mistake bound,

De nition 8 (Admissible Distribution Sequence) Suppose (X; F ; Pi) is a probability space, i = 1; 2; : : :; t, t t > 0. The sequence hPi it < i=1 is in the class D (0
1) of -admissible distribution sequences if, d(Pi; Pi+1) for i = 1; : : :; t ? 1. The learning algorithms we consider are prediction strategies (see HLW90]).

^ Q;f; (t): ^ Q;F; (t) = sup M M


f 2F

^ Q;F; (t) < for We say that Q can ( ; )-predict F if M some nite t.

3 UPPER BOUNDS
In this section, we give mistake bounds for function classes of nite VC-dimension. To obtain these bounds for the constant-distribution case, we can use the observation that permuting the examples in a sample will not a ect the mistake probability, since the distribution on X t is a product distribution. This allows us to relate the mistake probability to an average of mistakes over a set of permutations (see BEHW89, HLW90, Vap82]). The proof we use here is similar, but the distribution on X t is not a product distribution. We proceed by bounding how far the mistake probabilities are from expectations under some product distribution. We can then use the permutation device to bound this expectation. Notice that it does not matter which product distribution we use to bound the mistake probability.
quence on X (where 0 < 1) and f is a measurable function from X k to 0; 1] (with k 1), then

De nition 9 (Prediction Strategy) Consider an input space X , a class F of functions from X to f0; 1g, and de ne the space of labelled examples, S = X f0; 1g and the space of nite length labelled

samples, S = m2N Sm . A deterministic prediction strategy Q for F is a function from S X to f0; 1g. A randomized prediction strategy (Qr ; Z; D) for F consists of a function Qr , a space Z , and a distribution D on Z . The strategy chooses a point z 2 Z according to D, and passes z to the function Qr , which maps from S X Z to f0; 1g.

We de ne the mistake probability of a prediction strategy as follows.

Lemma 12 If hPiik i=1 is a -admissible distribution se? 1) ; (6) Ex2hPi ik (f (x)) Ex2P1k (f (x)) + k(k 2 i=1

De nition 10 (Mistake Probability) For sample x = (x1 ; : : :; xt) 2 X t (t 1), f 2 F , and

and

? 1) : (7) Ex2hPi ik (f (x)) Ex2Pkk (f (x)) + k(k 2 i=1


Z

where

i(j ) = :

8 <

i j=t t j=i j otherwise

Proof We are interested in the expectation


EhPi ik (f ) = i=1 =
Z Z Z

Xk
Z

f (x1 ; : : :; xk ) dP1(x1) : : : dPk(xk ) f dP1(x1) dP2(x2 ) : : : dPk (xk ): the integral


Z

Now de ne the permutation mistake bound, ^ Q;F (t; k) = M


8 <

X k?2 X X Fix x3 ; x4; : : :; xk and consider


Z

= X t (x ) : f 2 F; x 2 X t sup : j?1 j MQ;f ; t;k 2?t;k for t = 1; 2; : : :, and k = 1; 2; : : :; t. We can relate this bound to the mistake bound as follows.

X X

f dP1(x1) dP2(x2) = Ex2 2P2

f dP1(x1) :

Theorem 13 Consider a prediction strategy Q for


function class F , with permutation mistake bound ^ Q;F (t; k). For this prediction strategy, M ^ Q;F; (t) M ^ Q;F (t; k) + k(k ? 1) M (8) 2 for k = 1; 2; : : :; t and 0 < 1.

Call the random variable inside the parentheses I (x2 ). Notice that 0 I 1, so Lemma 3 gives Ex2 2P2 (I (x2 )) Therefore EhPi ik (f ) i=1 Similarly,
Z

E x 2P (I (x2 )) + Z 2 1 = f dP12(x1; x2) + :


X2
Z

X k?2

X2

f dP12(x1; x2) : : : dPk(xk ) + :

EhPi ik (f ) i=1
Z

We will use the following lemma ( HLW90], Lemma 2.1) involving permutations of components of a random vector under a product distribution. Lemma 14 Consider an input space X , a distribution P on X , a real-valued function de ned on X t , and any set ? of permutations on f1; : : :; tg. EP t ( ) = Ex2P t
t EhPi iti=1 MQ;f
Z ?

X k?3 X

f dP13(x1; x2; x3) : : : dPk (xk ) + 2 + 3


Z

1 X (x ) : j?j 2?

and EhPi ik (f ) i=1

Proof (of Theorem 13) For any function f in F ,


k ?1 X i=1

? 1) ; = EP1k (f ) + k(k 2

Xk

f dP1k (x1; x2; : : :; xk ) +

i =

which is Inequality (6). The same argument with the labels for P1 : : :Pk reversed gives Inequality (7).

3.1 AN UPPER BOUND FOR THE ONE-INCLUSION GRAPH PREDICTION STRATEGY


We can relate the mistake bound to a certain permutation mistake bound. We will use the set of permutations on f1; : : :; tg that swap t with one of the elements of ft ? (k ? 1); : : :; tg and leave the other elements unchanged. Call this class of permutations ?t;k . Formally, ?t;k = f i : i = t ? k + 1; : : :; tg

t (x) dP k MQ;f t?(k?1)(xt?(k?1); : : :; xt) ? 1) dP (x ) : : :dP (x ) + k(k 2 1 1 t?k t?k Z Z t (x) dP k = MQ;f t?(k?1)(xt?(k?1); : : :; xt) t ? k k X X ? 1) : dP1(x1) : : :dPt?k(xt?k ) + k(k 2 Fix x1; x2; : : :; xt?k, and consider the inner integral, X t?k Xk

k X t?k X Z Z

t (x) dP1(x1) : : :dPt(xt) MQ;f

I = =

k ZX

t (x) dP k MQ;f t?(k?1)(xt?(k?1); : : :; xt) 1 X M t (x ) Q;f X k j?t;k j 2?t;k

^ Q;F (t; k): = M Therefore, for any function f in F ,

Xk

^ Q;F (t; k) dP k M t?(k?1)(xt?(k?1); : : :; xt)

dPtk?(k?1)(xt?(k?1); : : :; xt)

and so

t ^ Q;f; (t) = sup EhPi iti=1 ?MQ;f M hPi i ^ Q;F (t; k) + k(k ? 1) ; M 2
n o

for k = 1; 2; : : :; t. The right-hand side of this inequality is less than the function F (k), where F (k) = 2kd + k2 2 : Using elementary calculus, it can be shown that there is a k < (2d= )1=3 such that F (k) < 4(d2 )1=3 . This shows that ^ Q;F; (t) < 4 ?d2 1=3 (10) M if t > (2d= )1=3. If t (2d= )1=3 , k = t provides the best bound,
2 ^ Q;F; (t) < 2d=t + d M 2 1=3

^ Q;F; (t) = sup M ^ Q;f; (t) M f 2F ^ Q;F (t; k) + k(k ? 1) ; M 2 which is Inequality (8). In HLW90], a general purpose deterministic prediction strategy, the one-inclusion graph prediction strategy, is described. Call this strategy Q1. Using the same argument as the proof of Theorem 2.2 in HLW90], the one-inclusion graph strategy for a function class F can make a total of no more than 2 VCdim(F ) mistakes for all k permutations in ?t;k , so 2 VCdim(F ) : k This result and Theorem 13 give the mistake bound ^ Q1 ;F; (t) 2 VCdim(F ) + k(k ? 1) (9) M k 2 for k = 1; 2; : : :; t: By choosing the value of k appropriately, we get the following bounds. ^ Q ;F (t; k) M 1

(11)

To verify the su cient conditions for ( ; )-prediction, suppose < 3=(64d2). Then 4(d2 )1=3 < ; (d2 =2)1=3 < =1281=3 < =5: If, in addition, t > 5d=(2 ), 2d=t < 4 =5: and (12) (13) (14)

So, if the conditions on t and in the theorem are satised and Q is Q1, the one-inclusion graph strategy, then either t > (2d= )1=3 , in which case Inequality (12) and ^ Q;F; (t) < , or Inquality (10) imply M t (2d= )1=3, in which case Inequalities (13) ^ Q;F; (t) < , and (14), and Inequality (11) imply M which is the desired result.

Theorem 15 For any function class F with VCdimension 1 d < 1, there is a prediction strategy
Q such that the mistake probability satis es ^ Q;F; (t) M
8 > > > < > > > :

3.2 UPPER BOUNDS FOR CONSISTENT PREDICTION STRATEGIES


While the results in the previous section give general upper bounds on the mistake probability, the prediction strategy for which the bounds were derived (the one-inclusion graph strategy) may be ine cient because its computational complexity can grow as much as exponentially with the VC-dimension of the function class F HLW90]. In this section, we consider consistent prediction strategies. These strategies make predictions using consistent hypotheses chosen from a particular hypothesis class H . A hypothesis h is consistent with labelled sample ((x1 ; y1); : : :; (xt; yt)) 2 S t if h(xi) = yi for i = 1; 2; : : :; t. If a function class F is e ciently pac-learnable (that is, learnable in polynomial time),

2d 1=3 2d + d2 1=3 t t 2 1=3 4(d2 )1=3 t > 2d < 3=(64d2),

where 0 < 1. If t > 5d=(2 ) and ^ Q;F; (t) < for this strategy. then M

Proof We will show that the statement is true for the


one-inclusion graph strategy, Q1. We have ^ Q1 ;F; (t) 2d + k(k ? 1) M k 2

then there is an e cient randomized consistent hypothesis nder (and hence an e cient randomized consistent prediction strategy) for F ( HKLW88], Theorem 4.1). We use a bound on the probability that a consistent deterministic strategy makes a mistake on the last example. Lemma 16 If H is a set of functions from X to f0; 1g, with VCdim(H ) = d 1, P is any distribution on X , and Q is a consistent prediction strategy that uses H , then for any 0 < 1 and k > d + 1, ? k + 1) log 4e(k ? 1) : EP k MQ;f (x1 ; : : :; xk) 2(kd? 2 1 d This result appears in the proof of Theorem 4.1 in HLW90]. It is based on Theorem A2.1 in BEHW89] and Sauer's Lemma ( BEHW89], Proposition A2.1). Instead, we could use the corresponding exponential bound in Theorem 3.12 of ABST90], since it has better constants. However this would complicate the statement and proof of the following theorem.

has prediction error satisfying


8 > > > > > > > > > > > < > > > > > > > > > > :

Theorem 17 For any hypothesis class H with VCdim(H ) = d and 1 d < 1, any consistent prediction strategy using H
4(d + 1) log 8e + 2 ?d2 1=3 ; 2 (d2 )1=3 t 1=3 if d + 2 t < 2 d ? 19 d2 1=3 log2 216e1=3 ; (d ) 1=3 if t 2 d

^ Q;F; (t) < M >

where 0 < < 4=(d + 1)2 .

for any 1 k t. Now, for any k satisfying d + 1 < k t, and any x1 ; : : :; xt?k+1, Lemma 16 implies that ? t + 1) log 4e(k ? 1) ; (x1; : : :; xt) 2(kd? EPtk MQ;f 2 1 d so ? t + 1) log 4e(k ? 1) (x) 2(kd? Ex2hPi iti=1 MQ;f 2 1 d k ( k ? 1) ; + 2 for k = d + 2; : : :; t. It follows that ^ Q;F; (t) 2(d + 1) log2 4e(k ? 1) + k(k ? 1) : M k?1 d 2 As in the proof of the mistake bounds for Q1 , we choose k to give the best bound. We have 2(d + 1) log 4e(k ? 1) + k(k ? 1) ^ Q;F; (t) M 2 k?1 d 2 2 4 ek k 4( d + 1) (15) < k log2 d + 2 + 1) + k2 log 4ek ; < 4(d k 2 d 2 where the last two inequalities hold because k > d=(2e). Using elementary calculus, it can be shown that ^ Q;F; (t) < 19(d2 )1=3 log2 216e1=3 ; M (d ) provided t 2(d= )1=3 and (d + 1)2 < 4. This gives the second bound in the theorem. When d + 2 < t < 2(d= )1=3, Inequality (15) with k = t gives ^ Q;F; (t) < 4(d + 1) log2 28e1=3 + 2(d2 )1=3 ; M t (d ) which is the rst bound in the theorem. Notice that this bound has the same shape as the upper bound for the one-inclusion graph strategy (Theorem 15), but with an additional ? log(d2 ) factor.

probability of a mistake by considering the last k examples in the sample. If hPi it i=1 is a -admissible distribution sequence, Lemma 12 implies that ? t EhPi iti=1 MQ;f (x1 ; : : :; xt) =
Z Z Z

Proof Obviously, if Q uses a hypothesis that is consistent with all t ? 1 labelled examples, then that hypothesis is also consistent with the last k examples, where k t ? 1. Using this fact, we will nd bounds on the

4 LOWER BOUNDS
To nd a lower bound on the mistake probability for any prediction strategy, we construct a `nasty' admissible distribution sequence. Theorem 18 For any function class F with VCdimension d such that 3 d < 1, and for any prediction strategy Q, 8 d?1 > > for all t < 2et ^ p r MQ;F; (t) > (d ? 2) d?2 > : t > 4e

1) ; dP1(x1) : : :dPt?k+1(xt?k+1) + k(k ? 2

t (x1 ; : : :; xt) EP k MQ;f X t

t (x1; : : :; xt) dP1(x1) : : :dPt(xt ) MQ;f


?

No prediction strategy can ( ; )-predict F if 16e2 2 =(d ? 2). ^ Q;F; (t) for all t follows from the Proof The bound on M general lower bound for constant-distribution prediction ( HLW90], Theorem 3.1), since a constant distribution is always admissible. The second part of the bound uses a similar proof. Consider the shattered set X0 = fz; y0 ; y1; : : :yk g with d = k + 2 elements. We use a distribution sequence hPi it i=1 which has a support that drifts from the set fy0 ; z g to fy0 ; y1; : : :; yk g. The probability of y0 remains constant throughout; the remainder of the probability shifts from z to fy1 ; : : :; yk g, starting at time t ? m, where m lp m = k= : The distribution sequence is given by 8 k > > j = 1; : : :; t ? m < Pj (z ) = > m (t ? j )k j = t ? m + 1; : : :; t > : m2 k Pj (y0 ) = 1 ? m 8 0 j = 1; : : :; t ? m < Pj (yi ) = : j ? (t ? m) j = t ? m + 1; : : :; t m2 This distribution sequence is -admissible, because the subset of X which experiences the largest increase in probability is the set X1 = fy1; : : :; yk g, and ( 0 j = 1; : : :; t ? m + 1 Pj +1(X1 ) ? Pj (X1 ) = k j = t ? m; : : :; t ? 1 m2 k m2 < :

So Pr hPi i (B )
?1 Y k t j ? (t ? m) 1 ? m j =t?m+1 m2

?1 Y km l = m 1? m 2 l=1 ?1 Y km m?1 m l=1 1 ? m2 k 1 ? 1 m?1 > m m k > em pk 2e :

Now, using the same argument as in the proof of Theorem 3.1 in HLW90] (picking a set of 2d functions that shatters X0 and nding the expected error under the uniform distribution on this set of functions), we can show that there ispa function in F such that t ) Pr(B )=2 > k =(4e). EhPi i (MQ;f Rearranging the bound for large t shows that 16e2 2 d?2 ^ Q;F; (t) implies that M . Notice that this necessary condition on for ( ; )prediction is a factor of =d from the su cient condition given in Theorem 15.

5 LEARNING CHANGING NOISY PROBLEMS


In the prediction model of learning (and the pac model), we assume that the relationship between examples and their labels is a deterministic function in a known function class. This is an optimistic assumption, since it forbids noise and errors, and it assumes a great deal of knowledge about the function. To dispense with these assumptions, Blumer et al. BEHW89] proposed a learning model in which the relationship is described by a joint probability distribution on X f0; 1g. In this section, we consider a learning model of this kind in which the joint distribution is allowed to change slowly but continually as learning proceeds. This is a more general problem than either learning with a slowly changing distribution of examples or learning with a slowly changing target function. We begin with some notation.

Let B be the set of samples of length t in which the last example x has not already appeared in (x1 ; : : :; xt?1). The probability that a sample is in B is Pr hPi i (B ) = Pr hPi i ((x1 ; : : :; xt) : xt 6= xj ; j = 1; : : :; t ? 1) Pr hPi i (xt 6= y0 and xt 6= xj ; j = 1; : : :; t ? 1) = (1 ? Pt(y0 )) Now, if xt 6= y0 ,
(

t ?1 Y

j =1

(1 ? Pj (xt)) :

Pj (xt ) =

0 j = 1; : : :; t ? m j ? (t ? m) j = t ? m + 1; : : :; t m2

De nition 19 Let S be the space of labelled examples, S = X f0; 1g. If = ((x1 ; y1); : : :; (xt; yt)) 2 S t and h is a function from X to f0; 1g, de ne the empirical error of h as 1 b (h) = jfi 2 f1; : : :; tg : h(xi ) 6= yi gj : er t De ne the expected error of h with respect to the distribution D on S as erD (h) = D (f(x; y) 2 S : h(x) 6= yg) :
For the set H of functions from X to f0; 1g, the distribution D on S , and parameters 0 < 1, and 0 < < 1, de ne the set BD = BD (H; t; ; ) of misleading labelled samples as

Proof We prove Inequality (16) by induction. By


Lemma 3, EPi (f ) EPi+1 (f ) + : If Inequality (16) is true for n, we have E
Pin+1

(f ) = =

Z Z

ZS ZS

EPin (f ) dPi (xn+1) EPin+1 (f ) dPi (xn+1) + n :

Sn

f dPin(x1; : : :; xn) dPi(xn+1 )

The integrand is 0; 1]-valued, so Lemma 3 gives EPin+1 (f ) EPin+1 +1 (f ) + (n + 1) : Inequality (17) follows immediately, using D(A) = ED (1A ) for any distribution D, where 1A is the indicator function for A (1A (x) is 1 when x 2 A and 0 otherwise). The following Lemma is due to Anthony and ShaweTaylor ( AST90], Proposition 3.2). It improves on a similar result presented by Blumer et al. ( BEHW89], Theorem A3.1).

BD =

b (h) (1 ? ) 2 S t : 9h 2 H; er

and erD (h) > g :

The following theorem gives conditions on and t that ensure that the empirical error for part of a labelled sample of length t is an accurate indication of the expected error for the next example, when the labelled examples are generated according to a -admissible distribution sequence.

Theorem 20 Consider a hypothesis class H of functions from X to f0; 1g with VCdim(H ) = d < 1, parameters 0 < ; < 1, 0 < 1, and 0 < 1, on and a -admissible distribution sequence hPiit i=1 S = X f0; 1g. If
and t k, then where

Lemma 22 De ne BD ; H; t; ; ; d; as in Theorem 20. For any distribution D on S ,

Dt (BD (H; t; ; ))
if

(k + 1)2
?

4 1 4 (1 ? p ) 4 log + 6d log 2=3

Pr hPi iti=t?k+1 BPt+1 (H; k; ; ) k=


2

1 p 4 log 8 + 6d log 4 : 2=3 (1 ? ) and BPt+1 (H; k; ; ) S k is the set of misleading labelled samples de ned in De nition 19.

We will use the following results. Lemma 21 Let hPiiti=1 be a -admissible distribution sequence on S = X f0; 1g, 0 < 1. If f is a 0; 1]valued, measurable function de ned on S = m2N Sm , EPin (f ) EPin+1 (f ) + n ; (16) for n = 1; 2; : : : For an event A S , (17) Pin(A) Pin +1(A) + n ; for n = 1; 2; : : :

where the second inequality follows from Lemma 21. Now, ? Pjj+1 BPj+1 (H; j; ; ) 2 if j = k (by Lemma 22), and k(k + 1) 2 2 if =(k + 1)2 . Since the labelled examples are independent, the probability of a labelled sample of length t in which the last k

Proof (of Theorem 20) Let f in Lemma 3 be the indicator function of BPj+1 (H; j; ; ) for j 2 N. Then ? Pr hPi iji=1 BPj+1 (H; j; ; ) ? ? 1) Pjj BPj+1 (H; j; ; ) + j (j 2 ? + 1) ; Pjj+1 BPj+1 (H; j; ; ) + j (j 2

elements possess some property is the same as the probability that a labelled sample of length k possesses the property. This theorem suggests the following learning procedure: an algorithm considers the most recent k( ; ; ; d) labelled examples, and attempts to nd a hypothesis h 2 H that minimizes disagreements with the examples. With probability 1 ? , the empirical error for that hypothesis will be an accurate estimate of its expected error. Notice that this algorithm does not need to know the bound on the change in the distribution. Of course, the choice of the parameters , and imposes an upper bound on .

ternative de nitions of distance between distributions.

References
ABST90] M. Anthony, N. Biggs, and J. Shawe-Taylor. Learnability and formal concept analysis. Technical Report CSD-TR-624, UCL, 1990. AST90] M. Anthony and J. Shawe-Taylor. A result of Vapnik with applications. Technical Report CSD-TR-628, UCL, 1990.

6 CONCLUSIONS
We have presented two models of learning from random examples that allow the distribution of the examples to change slowly but continually as learning proceeds. The rst model assumes that there is a target function that de nes the label of each example. If is the amount by which the distribution of examples is allowed to drift and d is the VC-dimension of the target function class, we showed that an upper bound on the probability that a prediction strategy misclassi es the last example in a sequence of t examples decreases as d=t at rst (as in the constant-distribution case), but that this probability can reach a steady-state value between (d1=2 1=2 ) and O(d2=3 1=3 ). Using these bounds, we gave necessary and su cient conditions that ( ; )-prediction is possible ( = O( 2=d) and = O( 3=d2), respectively). Obviously, it would be desirable to remove the =d factor separating these bounds. Section 5 investigated the problem of learning when the labelled examples are generated by a slowly changing joint distribution on X f0; 1g. We gave an upper bound on that ensures that the empirical error of a hypothesis is close to its expected error (provided there are enough training examples). Since the most recent examples contain the most relevant information (and the earliest examples might be misleading), it may be possible to improve on this result by using a weighting scheme (see HL91]), in which a hypothesis that is consistent with most of the recent examples would be rated more highly than one that is consistent with earlier examples.

BEHW89] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929{965, 1989. Hal50] P. R. Halmos. Measure Theory. Van Nostrand, 1950.

HKLW88] D. Haussler, M. Kearns, N. Littlestone, and M. K. Warmuth. Equivalence of models for polynomial learnability. In Proceedings of the 1988 Workshop on Computational Learning Theory, pages 42{55. Mor-

gan Kaufmann, San Mateo, CA, 1988.

HL91]

D. P. Helmbold and P. M. Long. Tracking drifting concepts using random examples. In 13{23. Morgan Kaufmann, San Mateo, CA, 1991.
Proceedings of the Fourth Annual Workshop on Computational Learning Theory, pages

HLW90] D. Haussler, N. Littlestone, and M. K. Warmuth. Predicting f0; 1g-functions on randomly drawn points. Technical Report UCSC CRL-90-54, Baskin Center for Computer Engineering and Information Sciences, University of California Santa Cruz, 1990. Kra88] A. H. Kramer. Learning despite distribution drift. In Proceedings of the Connectionist Models Summer School, pages 201{210. Morgan Kaufmann, San Mateo, CA, 1988. S. Kullback. A lower bound for discrimination information in terms of variation. IEEE Transactions on Information Theory, IT-13:126{127, 1967. A. Renyi. On measures of entropy and information. In Proceedings of the Fourth BerkeUniversity of California Press, 1961.
ley Symposium on Mathematical Statistics and Probability, volume 1, pages 547{561.

Kul67]

Acknowledgements
This research was supported by OTC Australia, by the Australian Telecommunications and Electronics Research Board, and through an Australian Postgraduate Research Award. I thank D. Lovell and R. Williamson for helpful comments, and a reviewer for suggesting alRen61]

Val84] Vap82] VC71]

Communications of the ACM, 27(11):1134{

L. G. Valiant. A theory of the learnable.

1143, 1984. V. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, 1982. V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, XVI(2):264{280, 1971.

You might also like