On the Complexity of Function Learning - Home - …...ON THE COMPLEXITY OF FUNCTION LEARNING 189 At th e end of Sectio n 2, we conside r variation s of the previousl y described functio

Machine Learning, 18, 187-230(1995)© 1995 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

On the Complexity of Function Learning

PETER AUER [email protected] M. LONG [email protected] MAASS [email protected] J. WOEGINGER [email protected] for Theoretical Computer Science, Technische Universitat Graz, Klosterwiesgasse 32/2, A-8010 Graz,Austria

Editor: Sally A. Goldman

Abstract. The majority of results in computational learning theory are concerned with concept learning, i.e.with the special case of function learning for classes of functions with range {0, 1}. Much less is known about thetheory of learning functions with a larger range such as N or R. In particular relatively few results exist about thegeneral structure of common models for function learning, and there are only very few nontrivial function classesfor which positive learning results have been exhibited in any of these models.

We introduce in this paper the notion of a binary branching adversary tree for function learning, which allowsus to give a somewhat surprising equivalent characterization of the optimal learning cost for learning a class ofreal-valued functions (in terms of a max-min definition which does not involve any "learning" model).

Another general structural result of this paper relates the cost for learning a union of function classes to thelearning costs for the individual function classes.

Furthermore, we exhibit an efficient learning algorithm for learning convex piecewise linear functions fromRd into R. Previously, the class of linear functions from Rd into R was the only class of functions with multi-dimensional domain that was known to be learnable within the rigorous framework of a formal model for on-line learning.

Finally we give a sufficient condition for an arbitrary class f of functions from R into R that allows us tolearn the class of all functions that can be written as the pointwise maximum of k functions from f. This allowsus to exhibit a number of further nontrivial classes of functions from R into R for which there exist efficientlearning algorithms.

Keywords: computational learning theory, on-line learning, mistake-bounded learning, function learning

1. Introduction

We consider the complexity of function learning in the most common nonprobabilisticmodels of on-line learning. In contrast to the rather well-developed theory for the specialcase of {0, l}-valued functions (i.e., concepts), relatively little is known about generalproperties of optimal mistake bounds (resp. loss bounds) for learning classes of functionswith larger ranges (e.g., real-valued functions). Furthermore, nontrivial positive learningresults have so far been achieved in these models only for very few function classes.

The main learning model that we consider is the common model for on-line functionlearning. For some fixed class f of possible target functions T from X to Y the learnerproposes at each round s of a learning process a hypothesis function hs: X - Y. Theenvironment responds with a counterexample <x, T ( x ) > such that the learner's predictionhs (x) for argument x differs from the true value T (x) of the target function T (x). The loss

188 AUER ET AL.

of the learner at round s is measured by l(hs(x), T(x)), for some suitable loss functiont: Y x Y - R+ (e.g. t(hs(x), T(x)) = (hs(x) - T(x))2 if Y c R). The goal of thelearner is to minimize his total loss, i.e. the sum of his losses for all rounds s. Hence oneis interested in efficient learning algorithms for f that produce a suitable hypothesis hs foreach round s of any learning process so that the total loss of the learner for the worst casechoice of T e F and the worst case choice of counterexamples is as small as possible. Adetailed definition of the resulting "learning complexity" for an arbitrary function class f(denoted by LC-ARB(.F)) is given at the beginning of Section 2 of this paper.

This learning model is equivalent to a slightly different learning model, where the environ-ment provides arbitrary examples (or "trials") (x, T(x)) of the target function T:X-*Y.These are not required to be "counterexamples" to the current hypothesis hs of the learner(i.e. we may have t(hs(x), T(x)) = 0). At round s the learner is given a point xs

and he makes a "prediction" hs(xs), after which he receives the correct value T(xs). Ifhs(xs) 7 T(xs), one says that the learner makes a mistake at trial s (Littlestone, 1988). Asbefore, the loss of the learner at round s is measured by t(hs(x), T(x)), and his goal is tominimize the sum of his losses over all rounds. This variation of the learning model is abit more plausible from the point of view of applications. It is a straightforward general-ization of Littlestone's "mistake bounded" model for concept learning (Littlestone, 1988).However as in the case of concept learning (see (Littlestone, 1988)) one can also show veryeasily for function learning that this variation leads to the same definition of the learningcomplexity of a function class. Hence we prefer to work with the former version of thelearning model, which is somewhat simpler to analyze.

Our model for function learning has already been widely studied, both theoretically andpractically. For example the well-known back-propagation learning algorithm for functionlearning on neural nets is a learning algorithm for this type of learning model (howeverthere one just wants to minimize the deviation of the hypothesis at the end of the trainingphase from the target function). In that special case the hypothesis hs is the function thatis computed by the neural net after s applications of the backwards propagation rule foradjusting the weights of the neural net (for s incorrectly processed training examples).

From the theoretical point of view this function learning model is a straightforward gen-eralization of the common model for on-line concept learning in a worst case setting, sinceconcepts may be viewed as functions with range {0,1} (see (Barzdin & Frievald, 1972),(Angluin, 1988), (Littlestone, 1988), (Maass & Twin, 1992)). In ((Dawid, 1984), (Myciel-ski, 1988), (Vovk, 1990), (Littlestone, Long &Warmuth, 1991),(Faber&Mycielski, 1991),(Littlestone &Warmuth, 1991), (Kimber & Long, 1992), (Feder, Merhav & Gutman, 1992),(Vovk, 1992), (Cesa-Bianchi, Long & Warmuth, 1993)), the learning complexity of variousconcrete function classes have been investigated in this function learning model, and cumu-lative loss bounds in terms of certain properties of f have been proved for general-purpose(but not necessarily computationally efficient) algorithms ((Vovk, 1990), (Littlestone &Warmuth, 1991), (Feder, Merhav & Gutman, 1992), (Vovk, 1992)). Nevertheless the onlyinteresting class of functions f: Rd — R for d > 2 that has been shown to be efficientlylearnable is the class of linear functions ((Mycielski, 1988), (Littlestone, Long & Warmuth,1991), (Cesa-Bianchi, Long & Warmuth, 1993)). Coming up with an efficient algorithmfor linear functions with a decent loss bound is trivial, but coming up with optimal or nearoptimal loss bounded algorithms can be difficult.

In some learning settings, arising for example in control problems, the value T(x) is notmade available to the learner, and only weaker reinforcement is provided by the environment.

ON THE COMPLEXITY OF FUNCTION LEARNING 189

At the end of Section 2, we consider variations of the previously described function learningmodel which model two such situations which arise in practice. In the first variation weassume that the learner only receives the information that T(x) ^ hs(x). In the secondvariation (studied previously by Barland (1992)) we assume that the learner is in additiontold whether T(x) > h s ( x ) ("hs(x) is too low"), or whether T(x) < hs(x) ("hs(x) istoo high").

In Section 2, we introduce a notion of an adversary tree (or mistake tree) for functionlearning. This notion provides an independent "dual" definition of the learning complexityof a function class. It turns out to be sufficient to consider binary branching adversary trees,even for learning real-valued functions under a large class of smooth loss functions includingthe heavily studied quadratic loss and "absolute loss." We apply this result to prove optimallower bounds, in terms of LC-ARB(.F), both for probabilistic function learning algorithmsin an oblivious environment and for function learning with a wide variety of queries. At theend of Section 2, we exhibit appropriate notions of an adversary tree for two other functionlearning models.

In Section 3, we derive general upper bounds on the number LC-ARE (Uf_, Fi) of mistakesrequired for learning Uf=1Fi in terms of LC-ARB(F1),..., LC-ARB(Fp). These boundsare shown to be tight to within a constant factor. The results of Section 3 remain true, andare new, even for the special case of {0, l}-valued functions (i.e., concepts).

In Section 4, we exhibit two algorithms for learning the maximum of k linear functionsover Rd. The first has a mistake bound polynomial in k for fixed d, and the second has amistake bound polynomial in d for fixed k. To the best of our knowledge, these providethe first positive learning results in the LC-ARB model for any nontrivial class of functionsdefined on R2 other than linear functions.

In Section 5, we exhibit a more general algorithm for learning the maximum of k real-valued functions of a single real variable. It yields as special cases algorithms which havemistake bounds linear in k for learning the maximum of k polynomials (with a boundednumber of terms), of k rational functions (again, with a bounded number of terms), and ofk sigmoidal functions.

This paper provides full proofs and more detailed explanations of the results describedin the previously published extended abstract (Auer, Long, Maass & Woeginger, 1993).

2. Adversary Trees for Function Learning

We introduce in Definition 1 our new notion of an adversary tree for function learning, whichallows us to give in Theorem 1 a completely independent ("dual") definition of the learningcomplexity of a function class. A somewhat surprising feature of Definition 1 is that itsuffices for Theorem 1 if we restrict our attention to binary branching trees, independentlyof the size of the range Y of the functions that are learned. This allows us to derive inTheorem 2 an optimal lower bound for randomized algorithms for function learning in anoblivious environment. As another consequence of Theorem 1 we derive in Theorem 3 anoptimal lower bound for function learning with a large variety of queries.

Assume that X ("domain"), Y ("range"), and A ("action space") are some arbitrary sets(finite or infinite).1 Consider some arbitrary sets F c YX ("function class") and H c AX

("hypothesis space"). Let t. A x Y - R+ := {r e R: r > 0} be some arbitrary function("loss function"). Note that we consider a slightly more general setting than outlined in

190 AUER ET AL.

Section 1, since we allow here for the possibility that the learner proposes for each x e Xsome "action" h(x) that lies in some suitable "action space" A. In this general frameworkthe loss function 1: A x Y -> R+ measures how bad the proposed action h(x) is, withrespect to the given reinforcement y e Y.

A learning process is a dialogue between the "learner" and the "environment". At thebeginning of each learning process the environment fixes some T 6 F ("target function").At each round s of the learning process (s = 1, 2 , . . . ) the learner proposes some hypothesish, e H. The environment responds at the same round with a pair (xs, T(xs)) for somexs 6 X. The loss of the learner at round s is t(hs(xs), T(x,)). If hs(xs) ^ T(xs) we alsorefer to (xs, T ( x s ) ) as a counterexample (CE) to hypothesis h,. The total loss of the learnerfor this learning process is £Ui t(hs(xs), T(xs)), where t e N U {00} is the number ofrounds in this learning process.

A learning algorithm B for T with hypothesis space H is a function which produces forany T e F at each round s of such learning process some hypothesis

of the learner. We will consider in this paper only deterministic learning algorithms, forwhich ho,..., hs-1 are not actually needed as arguments for B in the definition of hs

(since they can be recomputed with the help of the preceding counterexamples). We defineLCl (B) as the supremum of the total losses of the learner for learning processes withlearning algorithm B and loss function l. We set

LCl(F, H) := inf{LCl(B): B is a deterministiclearning algorithm for F with hypothesis space H}

and LC-ARBl(F) := LCl(F, AX). These definitions contain as special cases those thatwere considered in ((Mycielski, 1988), (Littlestone & Warmuth, 1991), (Littlestone, Long& Warmuth, 1991), (Faber & Mycielski, 1991), (Kimber & Long, 1992)) for functions, andin ((Angluin, 1988), (Littlestone, 1988), (Maass & Turan, 1992)) for concepts.

Starting in Application 2 of this section, we will consider only the discrete loss functionl defined by l(a, y) = 0 if a = y, and l(a, y) = 1 if a ^ y. From that point on, we willdrop the subscript l in LC-ARBl.

REMARK.a) One can easily see that the size of LCl (F, H) does not change if we only allow responses

(xs, T ( x s ) ) with l(h(xs)> T(xs)) > 0 in a learning process. This will be assumed w.l.o.g.throughout this paper (except for Theorem 2, which concerns another model).

b) In contrast to the discrete case there does not always exist a learning algorithm B for Fwith L C l ( B ) = LC-ARB^CF) (i.e., the infimum of the LCl(B) need not be obtained).An easy example is constructed by setting X = A = Y = (0,1], F = (f:X _Y: f is constant}, and l ( a , y ) = 0 if y >a, else l(a,y) = a — y. Then for any learningalgorithm B one has LCl(B) = h(l) e (0, 1], where h is the first hypothesis of B.

We will show in the subsequent Theorem 1 that for learning real valued functions underany of the common loss functions one can assume without loss of generality that the"adversary" (i.e. the environment) is very nice to the learner. One can assume that at thebeginning of each round s in a learning process (before the learner has issued his hypothesis


hs for this round) the adversary tells the learner which point xs he is going to use for hiscounterexample (xs,y) at this round. In addition the adversary also gives the learner a set{y1, y2} £ Y with the guarantee that the second component y of the counterexample (xs, y}will be an element of {y1, y2}. Finally the adversary also announces to the learner the ruleby which he will choose y\ or y2 (depending on hs(xs)): he gives to the learner a set A1

such that y = y1 • hs(xs) e A1.Our intuition tells us that it is essential for an "optimal" adversary that he chooses xs

after he has seen hs (as a "weak spot of hs"), and that he exploits the full size of the rangeY to choose a reinforcement y that causes a largest possible loss l(hs(xs), y) to the learner.The subsequent Theorem 1 tells us that this intuition is wrong.

Whereas a general adversary strategy is a rather complex mathematical object (since ingeneral each move of the adversary depends on the preceding hypotheses of the learner),any adversary strategy of the previously described simple type can easily be described bya binary branching tree according to the following definition.

DEFINITION 1. An adversary tree U for a function class F c YX and an action set A is afinite binary branching tree whose nodes and edges are labeled in the following way:

Each node v of U is labeled by a pair (xv, Fv) with xv e X, Fv c f and Fv ^ 0. If v isthe root of U we set Fv :=f.

If v is not a leaf of U, then its two outgoing edges have labels of the form ( A \ , y\) and(^2, yi), where AI, Aa is a nontrivial partition of A (i.e. A\(JAi = A, A i f i A 2 = 0, A\ ^ 0and A2 ^ 0) and y\, yi are elements of Y such that y\ / y2 and {/ e fv: f(xv) = y,} ^ 0for i = 1,2, The set {/ 6 fv: f(xv) = y,-} is then the second component of the label ofthe node at which the edge with label (Ai , yi) ends (i = 1,2).

For any loss function t: A x Y -> R+, one defines the loss l ( A , y) of an edge in U withlabel (A, y) by t(A, y) := inf{l(a, y): a e A}. The total loss of a path in U is the sum ofthe losses of edges in this path.

We set

INFl(U) := i n f [ K e R+: Kis the total loss of a path in U that

starts at the root and ends at a leaf of U} and

ADVl(F) := sup{INFl (U):U is an adversary tree for

f with action space A and loss function l}.

Obviously any adversary tree U for F encodes an adversary strategy which forces thelearner to take a total loss of at least INFl(u), no matter which hypotheses he chooses inthe course of the learning process. The second part Fv of the label <xv, Fv> of an internalnode v of u specifies a set of functions which could still be choosen as target functionsby the adversary. The labels <A1 , y1> and (A2, y2> of the two edges that leave the node xv

encode the following rule for the adversary at the next step of a learning process for F: Ifh(x0) e Ai (where h is the next hypothesis of the learner) then the adversary responds withthe pair <x0, yi> (which forces the learner to take a loss of at least €(Ai , yi) at this step).

One also can easily read off from the definition of an adversary tree U that it onlyencodes "particularly nice" adversary strategies, since one may assume that the learner hasfull knowledge of U. In particular at each step of a learning process where the adversaryfollows the strategy encoded by u, the first component x0 and the rule by which the adversary

192 AUER ET AL.

Figure 1. Example of an adversary tree U for X — {a, b, c], A = Y = {1,2,3} and f = YX (where the firstcomponents of the labels of the leafs have been deleted since they are irrelevant). For the loss function i withl (a , y) = |a- y| one has INF l(u) = 2.

chooses the second component yi of his next counterexample <x0, yi> are already known tothe learner before he chooses his hypothesis h for this step.

Intuitively one may tend to believe that LC-ARBl(F) > ADV l(F) for many functionclasses F, i.e. that the environment becomes significantly weaker if it limits its adversarystrategies to those particularly simple ones that can be encoded by adversary trees. Thefollowing theorem asserts that this is not the case. The assumptions of this theorem containa rather trivial condition (+) which is explained and motivated in the subsequent Remark.

THEOREM 1. Let X, Y, A, and F c YX be arbitrary nonempty sets (finite or infinite) andlet l: A x Y -> R+ be a function such that

and

(i) A = Y and I is the "discrete loss function" with i(a, y) = 0 if a = y and l(a, y) = 1if a 7^ y, or

ii) A and Y are subsets 0/R and the loss function I is of the form l(a, y) = L(\a — y\)for some arbitrary nondecreasing and continuous function L: R+ -» R+ with L(0) = 0(e.g. l(a, y) =|a — y|p for some arbitrary parameter p > 0).

Then LC-ARB l(F) = ADV l(F).

REMARK. The assumptions of the preceding theorem contain the technical condition (+),which just says that the set A of possible outputs ("actions") of the learner is sufficientlylarge. More precisely, it demands that for any possible value f ( x ) of a target functionf e F there exist points a e A which make the loss l(a, f ( x ) ) of the learner for argumentx arbitrarily small. This condition is hardly restrictive, since usually one even has A = Y(e.g. A = Y = [0, 1]). Furthermore it is obvious that LC-ARBl(F) = o if condition (+)is not satisfied. Thus (+) holds for all learning problems that are of interest in this context.

We also would like to point out that (+) is a necessary assumption for Theorem 1. Thisfollows by considering an example where F consists of a single function f: R - R (hence


ADVl(.F) = 0), and A C R is chosen such that inf{l(a, f(x0)):a e A} > 0 for someJto € R (hence (+) is not satisfied, thus LC-ARBl (F) = oo).

PROOF OF THEOREM 1. We start by showing that LC-ARBl(F) < ADVl(F) in the case(ii). Fix some arbitrary e > 0. In order to show that LC-ARBl(F) < ADVl(F) + £ wechoose an arbitrary sequence (ej)j€N of reals e,- > 0 such that £ = £°lj e/. It sufficesto construct a learning algorithm B so that for every s e N the following property holds(which implies that L C l ( B ) < ADVl(F) + e):

(*) For every learning process with learning algorithm B and examples

<x1,y1>, . . . ,<xs ,ys>

the proposed actions a1 , . . . , as of B (where aj := hj(xj) for the hypothesis hj of B atround j) satisfy

The property (*) is trivially satisfied for s = 0. Assume now that (*) holds for somearbitrary s e N. In order to prove (*) for s + 1 we fix some learning process with learningalgorithm B and set F := {f e F: f ( x j ) = y}; for ;j = 1, . . . , s}. The hypothesis h of Bfor the next round s + I of the learning process is defined as follows:

For every x e X let h(x) be some a e A such that

for all y e Y for which there exists an f e F such that f(x) = y.It is obvious that if the learning algorithm B chooses as hypothesis hs+1 at step .s + 1 a

hypothesis h with the preceding property, then (*) is also satisfied for s + 1. Hence it justremains to show that there exists a hypothesis h with the preceding property. This proofturns out to be rather nontrivial, involving subtle arguments from real valued analysis.

LEMMA 1. The previously defined function h:X - A is well-defined for all x e X.

PROOF OF LEMMA 1. Assume for a contradiction that h(x0) is not well-defined for somex0 e X, i.e.

Intuitively this assumption (**) means that for every proposed action a = H(X0) of thelearner for the fixed argument x0 there is a legal countermove of the adversary (wherehe gives a value y(a) as the correct value f(x0) of the target function / at argument x0)which causes an "unusually large" loss to the learner. In this context the loss l(a, y(a))is "unusually large" if the sum of l(a, y(a)) and of the remaining "adversary power"ADVl ({f e F: f ( x 0 ) — y(a)]) after this countermove exceeds the "adversary power"

194 AUER ET AL.

Figure 2. Definition of A1 and A2 in Case 1.

ADVl(F) at the beginning of this move by more then the previously specified nonzeroquantity es+1.

The only chance to refute this assumption (**) is to show that then we could select two ofthese countermoves (y(a 1 ) and y(a2) for two suitable actions a1, a2) and a partition of theaction space A into subsets A1 and A2 such that the learner would still suffer an "unusuallylarge" (although possibly by a fraction of es+1 smaller) loss if the adversary would fori = 1 and i = 2 always respond with the same countermove y (a i ) for any proposed actiona = h(x0) € Ai. This would lead to a contradiction to the definition of ADVl(F) (as thesmallest total loss along a root-leaf path in a tree where the smallest total loss along anyroot-leaf path is as large as possible), since we could then construct an adversary tree U for Fwhich has (x0, F) as label of the root, (A\, y ( a 1 ) ) and (A2, y(a2)) as labels of the edges fromthe root, and an almost optimal adversary tree Ui for the subclass {f e F: f(x0) = y ( a i ) }attached below the edge with label (Ai, y ( a i ) ) (for i = 1, 2). This adversary tree U wouldthen have on any root-leaf path a total loss that exceeds ADVl (F) by ^±, contradicting thedefinition of ADVl(F).

In order to carry out this plan for refuting (**) one has to look at the concrete structureof any possible function a - y(a) that satisfies the conditions of (**). For exampleif there exist some a1,a2 e A such that y ( a 1 ) < a1 < a2 < y(a2) (this is Case 1 in thesubsequent analysis), then one can easily read off from Fig. 2 that the assumed monotonicityof the function L with l(a, y) = L(|a — y|) implies that for A1 := {a e A:a < a1} andA2 := A — A1 the desired properties are met (since l(A i , y(a i)) > l (a i , yi) for i = 1, 2).

If there are no actions a1,a2 as above, then we may conclude that the set {a e A:y(a) > a]is "closed to the left", i.e. Va, a' €. A((y(a) > a A a' < a) => y(a') > a') (this is Case2 of the subsequent precise proof). In this case one either has Va e A(y(a) > a) (this isCase 2.1), or A can be partitioned into a left interval {a e A: y(a) < a} with supremum s\and a right interval (a e A: y(a) < a] with infinum s2 > s\ (this is Case 2.2). The ideafor the definition of a1, a2, A1, A2 for these two subcases can be easily read off from Fig. 3respectively Fig. 4. The precise construction is a bit more involved, since it also depends onthe concrete structure of the loss function I. However the reader may skip the subsequentdetailed proof without loss of understanding for the later results in this paper.

The precise refutation of assumption (**) proceeds as follows. We fix for each a € Asome y(a) e Y as in (**). In each of the following cases we get a contradiction by buildingan adversary tree U for f which satisfies INFl(u) < ADVl(F) + ^ (hence U providesa contradiction to the definition of ADVl (F)). The label of the root of u will always be( x 0 , F ) .

Case 1. 3a1,a2 e A ( y ( a 1 ) < a1 < a2 < y(a2)).


Figure 3. Definition of A \ and A2 in Case 2.1.

Figure 4. Definition of A1 and A2 in Case 2.2.

Then t(a, y ( a 1 ) ) > l ( a \ , y ( a \ ) ) for all a > a1, and l(a, y(a2)) > l(a2, y(a2)) for alla < a2. We set A1 := {a e A: a > a\} and A2 := A- A\.

We choose (A], y ( a 1 ) ) and (A2, y(a2)} as labels of the edges that leave the root of theconstructed adversary treeU. Below the edge with label (Ai, y(ai)) we attach an adversarytree ui for {f E F: f(xo) = y(ai)} that satisfies INFt(ui) > ADVl({f E F: f(xo) =y ( a i ) } ) - £fi, for i = 1, 2. By definition of Ai we have l ( A i , y ( a i ) ) > l(ai,-, y(ai)) fori = 1,2. Hence

by the definition of y(ai) for i = 1, 2.

Case 2.We define

and

(as usually, we set sup0 := -o and inf 0 := oo ). The assumption of this case impliesthat s1 < s2.

Case 2.1. Va e A (y(a) > a).We first show that s1 = oo in this subcase. Assume for a contradiction that s1 < oo. Set

196 AUER ET AL.

Then L ( Z 0 ) = 0. Since l(a, y ( a ) ) < es+\ for all a 6 A, there exists some S > 0 such thaty(a) > s1 for all a e A with \a — s\ \ < 8. If |s1 — y(a)\ < Z0 for all these a e A, thenthis would imply that l(a, y(a)) — 0 for a —> s1. If \s\ )— y(a1)| > Z0 for some a1 e Awith y(a\) > s\, then L(\a — y ( a 1 ) \ ) > L(\s\ — y(a\)\) > 0 for all a e A. This providesa contradiction to the assumption (+) of Theorem 1. Hence we may assume that s\ = oo.

Fix any a1 e A. Choose a2 e A such that a2 > y ( a 1 ) and l(a2, y ( a 1 ) ) > l(a1, y (a 1 ) ) .Set A1 :1 {a 6 A:a > a2) and A2 := {a & A:a < a2} (see Fig. 2). We then havel(A1, y(a\)) > i(a\, y(a\)) and £(A2, y(a2)) > l(a2, y(a2)). Construct U analogously asin Case 1.

Case 2.2. 3a\, a2 € A (y(a1) > a1 A y(a2) < a2).We then have — o o < s 1 < s2 < o by the definition of Case 2. Our strategy in thissubcase is to choose a\ < s1 so close to s.?i that y ( a 1 ) > s1 and that l(s1, y ( a 1 ) ) differsfrom l(a1, y ( a 1 ) ) by at most ^-. Analogously we want to choose a2 > s2 so close to s2

that y(a2) < s2 and that l(s2, y(a2)) differs from l(a2, y ( a 2 ) ) by at most ^- (see Fig. 4).It turns out that such choice of a1, a2 is always possible except for subcase 2.2.2, where wetrivially succeed with a slightly different choice of a1, a2.

Set

Case 2.2.1. r1 < o and r2 < o..Set r := max(r1, r2). Then the continuous function L is uniformly continuous on thecompact interval [0, r]. Hence there exists some e' > 0 (w.l.o.g. e' < 1) such that|L(z) - L(z')l < ^ for any z, z' e [Q,r] with z - z'| < e'. Define as in Case 2.1Zo := inf{z > 0: L(z) > 0).

Define A\ := [a € A:y(a) > a} and A2 := {a 6 A:y(a) < a}. We want tochoose a\ e A such that |ai — Jj| < e', L(|ai — s\\) < ^-, and L(|^! - y ( a \ ) \ ) > 0.If |s1 - y(a)| < z0 for all a e A, with \a - s1\ < e' and L(\a - s\\) < ^-, thenL(\a — y ( a ) \ ) -+ Qfora -+ si,a e AI (since \\a — y(a)\ — \s\ — y(a)\\ -> 0 for a -> î,a 6 AI). This is a contradiction to the fact that L(\a — y(a)\) > e,+i for all a e A (by thedefinition of y (a)). Hence there exists some a1 e A1 with |a1— s1| < e', L(|a1 — s 1 | ) < ^-,and \s\ — y(a\)\ > ZQ (thus L(\s\ — y ( a \ ) \ ) > 0). Analogously there exists some a2 e A2

with |02 - s2\ < e', L(\a2 - s2\) < ̂ , and L(\s2 - y(a2)|) > 0.Since L(|a, — y(ai)\) > £.v+i by the definition of y(a,), and since L(|a, —•?;!) < ^ by

the choice of «,-, we have y(a\) > î and y(a2) < s2. Furthermore, we can conclude thaty(a\) ^ y(a2), since otherwise s\ < y(a\) = y(a2) < s2. This would provide a contra-diction to (+) because (s\, s2) n A = 0 by the definition of Si, s2, and L(\s/ — y(a,-)l) > 0by the choice of a,- for i = 1,2.

We have |a,- — y(a/)| < r since e' < 1. Hence l(Ait y(a,)) = inf{L(|a — y(a,-)|):a eA,-} < L(\s, - y(ai,-)|) > L(|a< - y(ai)|) - ^ = t(at, y(a,)) - ^, since |a, - y(ai-)|,k - y(ai)\ € [0, r] and \\at - y(at)\ - \sj - y(a/)|| < |a,- - 5,- < £_', i = 1, 2. Thereforewe can construct an adversary tree U for T with INF^(W) > ADV^J") + ^- by attachingbelow the edge with label (A,, y(a/)} an adversary tree Ut for {/ 6 T: /(XQ) = y(a,)} withTNBtQAi) > ADV,({/ e ^: /(JC0) = yfa)} - ^, i = 1, 2.


Case 2.2.2. r\ = r2 = o.If sup{L(z): z e R+} < oo, then L is uniformly continuous on R+ (since L is nondecreas-ing). Hence we can argue as in Case 2.2.1, with R+ instead of [0, r].

We assume in the following that sup{L(z):z € R+} = o. Define A1 := {a e A:y(a) <a} and A2 := {a e A:y(a) < a}. We can then conclude that sup{^(Ai, y(a)): a E[si - \,si] D A] = oo andsup{€(A2, y(a)):a e {si,s2 + 1] n A} = oo. Hence it sufficesfor the construction of U to choose some a\ € [s\ — \,s\]r\ A such that t(A\, y(a\)) >ADVf(^) + £J+i,andsomea2 e [,y2,.s2+l]nAsuchthat£(A2, yfe)) < AD\t(F)+ei+\.The resulting adversary tree U for JF of depth 1 satisfies INF£(W) > ADVj(^) + e,v+1.

Case2.2.3. |{i e {1, 2}:r/ < oo}| = 1.In this case we use for i with r,- < oo a construction of (A,, y,-} and Uj as in Case 2.2.1, andfor j e {1,2}- {;'} a construction of <A;-, y;-} (and Uj) as in Case 2.2.2.

Case 2.3. Va e A (y(a) < a),This case is dual to Case 2.1, and can be handled analogously.

This completes the proof of Lemma 1, and the proof that

in case (ii).The inequality LC-ARB^CF) > ADV^jF) is shown in case (ii) as follows. Assume

for a contradiction that LC-ARB^JF) < ADV^CF). Fix some adversary tree U for theconsidered F and A such that INF«(W) > LC-ARBt(F) + P for some p > 0. Let Bbe an arbitrary learning algorithm for f with hypothesis space Ax. The adversary tree Uprovides in the obvious way responses to the hypotheses h of B, until a leaf of U has beenreached. If one has arrived in U at a node v with label (xv, Fv), (A\, y\} and (A2, v2) arethe labels of the two outgoing edges, and h (xv) e Ai (i 6 {1,2}) for the current hypothesish e AX of learning algorithm B, then the adversary gives (xv, yi) as his next response.Note that this response is a valid move of the adversary, since by definition of u thereexists some f e Tv with f(xv) = y,. The loss l(h(xv), yi) of the learner at this roundsatisfies l(h(xv), y,-) > i(A,, yt). The adversary then proceeds in U to the node at the endof the edge with label (Ai, yi}. In this way the learning algorithm B defines a path from theroot to some leaf of U. By definition of INFl(u) the total loss K of the learner during theresulting learning process satisfies K > INFl(u). This implies that K < LC-ARBf(F)+/?.Since B was chosen arbitrarily, this inequality provides a contradiction to the definition ofLC-ARBl(F).

For the proof of Theorem 1 in case (i) we first note that the inequality

can be shown in the same way as for case (ii). However this proof is simpler since LC-ARBl(F), ADVl(F) € N U {00} for the discrete loss function l. The inequality LC-ARBl(F) < ADVl(F) is shown in case (i) as follows. The claim is obvious if ADVf (F) =oo. For all other function classes the claim is shown by induction on ADVl(F).

We define a learning algorithm B for F, which uses as first hypothesis a function h:X-*Y that assigns to each x € X some y 6 Y such that ADV{({/ eF:f(x) = y}) is as large aspossible. The subsequent Lemma 2 implies that ADVl({f e F: f ( x 0 ) = y0}) < ADVl(F)

198 AUER ET AL.

for any pair (x0, y0) that the learner might receive as a counterexample to this hypothesish (because either ADV l ({ f e F: f(x0) = y}) < ADVe(F) for all y e Y, or h(x0) is theonly value y € Y such that ADVt({f e F: f(x0) = y}) = ADV l(F)).

For subsequent rounds the learning algorithm B employs a learning algorithm B for{f 6 F: f ( x 0 ) = y0] with LC l(B) < ADVl({f e F: f(x0) = y 0 }) . Such B exists by theinduction hypothesis. Furthermore the preceding facts imply that LC l(B) < ADVl({f eF: f(x0) = yo}} + 1 < ADVt(F). D

LEMMA 2. Assume that A = Y and I is any loss function such that t(A, y) > Qfor anyA C Y, y 6 Y with y £ A. Then for all x e X there exists at most one y e Y such thatADVdif 6 F: f(x) = y}) = ADVe(F).

PROOF OF LEMMA 2. Assume for a contradiction that for some XQ e X there are yi, yi e Ywith yi ^ y2 and ADVt({/ e F: f(x0) = yt}) = ADVt(F) for i = 1,2.

Then one can build in the following way an adversary tree U for F with INF^(W) >ADVf(F): The root of U is labeled by (xo, F). Partition Y into A\t A2 such that y/ <£ A/for i = l ,2. Assign (Ai, y\) and (A2, >>2> as labels to the edges from the root of U. Attachbelow the edge with label (A/ , y,) an adversary tree U-t for {/ e F: f ( x o ) = yi} withINF,(W,0 = ADVt((f e F: f(x0) = yi}). D

Application 1: Randomization in an oblivious environment

We consider in Theorem 2 an alternative model for function learning, where the learningalgorithm may use randomization, and the given sequence of pairs (x, T(x)) is fixed beforelearning takes place and is therefore oblivious to the actions (and the randomization) ofthe learner (see (Maass, 1991)). For this model we write RLC-OBL^J7, H) in place ofLCi(F, H). It is easy to generalize the results from (Maass, 1991) to show that in thecase t is the discrete loss function, RLC-OBL^ (F, F) < In \F\ for any finite functionclass F. The following limits the amount randomization can help the learner, even in anoblivious environment.

THEOREM 2. Assume that X, Y, A, F and t satisfy the assumption of Theorem 1. ThenRLC-OBL€(.F, Ax) > iLC-ARB^(^). This lower bound is optimal.

PROOF OF THEOREM 2. The proof of the lower bound proceeds by showing that anyadversary tree U for F gives rise to an oblivious sequence S of examples, for which theexpected total loss of any randomized learning algorithm is at least ÎNF^W).

Let U be some adversary tree for F and let B be a probabilistic learning algorithm forF in the model RLC-OBL with arbitrary hypotheses. One uses U to construct an oblivioussequence S of adversary responses ( x , y ) such that the expected total loss of B for thissequence S is > ^ • INF^(W). The simultaneous construction of S and of an associated pathin u starts at the root of u. Assume that the so far constructed path ends at an interior node vwith label (xv, Fv), and that the two outgoing edges from v have labels (A i , yi) fon' = 1, 2.If we have p > 5 for the probability p that h(xv) e A\ conditioned on the assumptionthat B has processed the previously constructed sequence of adversary responses (whereh e Ax is the current hypothesis of the learner), then (xv,y\) is chosen to be the next


pair in the sequence S. If p < ½ one chooses (xv, y2) as the next pair in S. One extendsthe definition of the path in u by the corresponding edge with label {Ai, y i } , i e {1, 2}.The expected loss of B for this round of the learning process is > ½ l ( A 1 , y 1 ) , respectively> ½l (A2 , y2). This implies the claimed lower bound for the expected total loss of B, sincethe latter is equal to the sum of the expected losses of B for the individual rounds of thelearning process.

It was shown in (Maass, 1991) that there exist function classes F (in fact: classes of{0, 1} valued functions) such that RLC-OBLl(F) =½ADVl(F). D

Application 2: The utility of generalized membership queries

From this point on, we will focus exclusively on the case in which l is the discrete lossfunction, and we will drop the subscript l from our notation (i.e., writing LC-ARB(F) forLC-ARBl(F)).

We now turn to generalizations of membership queries for function learning. If oneallows queries of the form "What is the value of T (x )?" , and the range Y is infinite,one cannot expect to get a lower bound on the required number of queries in terms of anontrivial function of LC-ARBl(F) that holds for all F. The reason is that the domain Xmight contain some special point X0 such that, for any f e F, f(x0) reveals the identity of/. The following result provides a lower bound for learning with a large variety of otherqueries. Special cases of such queries are for example: "In which of the intervals I1, . . . , I q

of R does T(x0) lie?" (for some partition I 1 , . . . , Iq of the range of /), or "Does T have1, 2 , . . . , q - 1, or more than q - 1 points x where the derivative T'(x) of T has value 0"?

The following result shows that the use of arbitrary queries with at most q differentpossible answers, where q < 1 + LC-ARB(F), can at best reduce the learning complexityby a factor log(1+LCiARBOT).

THEOREM 3. Assume X and Y are arbitrary sets and F is an arbitrary subset of YX

(l is the discrete loss function, which is dropped from our notation). We assume thatd:=LC-ARB(F) < oo.

Let B be an arbitrary learning algorithm for T that either proceeds at each round asusual (i.e. B proposes a hypothesis h e Yx), or asks a query of the form "To which of thesubclasses PI ,..., Pq off does T belongT,for some arbitrary partition ( P 1 , . . . , Pq) ofF with q < d + 1. We define the total loss of B in a learning process for F as the numberof queries asked by B plus the sum of the losses of B at the other rounds (i. e. the number ofincorrect predictions of B). Let LC(B) be the supremum of the total losses of B that occurin learning processes for F.

Then LC(B) > log(f+d). This lower bound is optimal.

PROOF OF THEOREM 3. The proof of the lower bound proceeds by constructing an adver-sary strategy for B with the help of an adversary tree U such that INF(W) = LC-ARB(.F).

Fix some adversary tree U for f with INF(W) = d. Select some /„ e Tv for eachnode v on level d, and let f be the set of these 1d functions fv. We define with the helpof U an adversary strategy for the learning algorithm B such that after j' rounds there are> 2d /(d + 1)' functions in F that are consistent with all responses given during the first irounds (see Theorem 6.8 in [MT 92] for a similar argument).

200 AUER ET AL.

In the case of a query with a partition ( P 1 . . . , Pq) of F at round i + 1 the adversaryresponds with that Pj which contains the largest number of functions in F that are consistentwith all preceding responses.

In the case of an equivalence query at round i + 1, where the learner proposes a hypothesish: X ->• K, the adversary fixes some path Ph from the root of U to some node vh on leveld such that for every node v / vh on the path PI, the label (A, y) of the other edge from v(i.e. that edge which does not lie on Ph) satisfies h(xv~) ^ y. Let s > 1 be the number offunctions in T that are consistent with all responses during the first i rounds.

We will subsequently show that for some node v on path P/, the immediate subtree of vthat does not contain vh contains at least s/(d + 1) nodes v on level d such that fv e F isconsistent with all responses during the first i rounds. Let v be the first such node on Ph, andlet (A, y) be the label of that edge from v that does not lie on Ph . Then the adversary givesthe response (x^, y) at round i + l. By construction, this response keeps at least s / ( d + 1)functions in f "alive" (i.e. consistent with all responses during the first i + 1 rounds).

Assume for a contradiction that there does not exist any node v on Ph as above. We thenhave s /(d + ! ) > ! , i.e. s > d + 1, since otherwise fVh would be the only function in Tthat is consistent with the first i responses (contradicting our assumption that s > 1). SincePh has length d, there are then, apart from fVh, less than d • j^ functions in T that areconsistent with the first i responses. Hence the definition of s implies that s < 1 + d • -^< -^ + d • ̂ -j- = s, a contradiction. Thus we have shown that the preceding adversarystrategy is welldefined.

A straightforward induction on i shows that for every round (' there are > 2d/(d + 1)'functions in T that are consistent with all responses which the preceding adversary strategyhas given during the first i rounds.

The optimality of the lower bound of Theorem 3 is demonstrated by the following ex-ample. Consider any d 6 N such that log2(J + 1) € N. Set X := {1,..., d}, Y :- {0, 1},F :— YX. Thus LC-ARB(.F) = d. In order to construct a learning algorithm B withLC(fi) = ["i0g (d+nl' we partition X into [lo ?d+\)\ subsets 5,-of size at most Iog2(d+1).

At round i of any learning process, the algorithm asks: "To which of the subclassesPI ,..., Pj+i of T does T belong?" for a partition P\,..., Pj+\ of F that classifies each/ e F according to its restriction to 5,- (in other words: at round i the learner asks to seethe values of the target function for all arguments in 5/). D

REMARK. The special cases of the Theorems in this section for concepts (i.e. Y := {0, 1})are due to Littlestone (1988) (Theorem 1), Littlestone and Maass (Maass, 1991) (Theorem2), and Maass and Turan (1992) (Theorem 3). Using methods from (Auer & Long, 1994),one gets a suboptimal but more general bound than that in Theorem 3, proving that LC(fi) >i^t '°S (FH) if queries of the form "To which of the subclasses P t , . . . , Pk of T does Tbelong?" are allowed for some fixed k > 2.

2.1. Adversary Trees for Function Learning in Models with Weaker Reinforcement

We briefly consider here two natural variations of the model for function learning that wasdefined at the beginning of this section. It is curious that for each of these two variationsthere also exists in addition to the usual min-max definition an equivalent max-min definitionof the resulting learning complexity in terms of adversary trees.


Let hs e H be the hypothesis of the learner at round ,s of a learning process. In the firstvariation of the learning model, we assume that instead of a pair {*,, T(xs)), the learnerreceives at rounds from the environment just an argument x, e X such that h,(xs) ^ T(xs).However the learner does not receive the "correct value" 7*(;c,) for argument xs. We writeLC-ARBweak(J

r) for the maximal number of steps that the best learning algorithm for Fwith arbitrary hypotheses has to use in this model.

For this learning model we consider a notion of a finite adversary tree Wweak where eachnode is labeled by some x € X, Each interior node has fan-out \Y\, and its outgoing edgesare labeled by all possible values y e Y. We say that a function / 6 T is qualified for anedge with label y from a node labeled x if f(x) ^ y. We demand that for every leaf ofWWeak there exists at least one function / e T that is qualified for all edges on the path fromthe root to that leaf. We set

INF(Wweak) := min{m e N: m is the length of a path

from the root to a leaf in uweak}>

and

ADVweak(.F) := sup{INF(uweak):uweak is an adversary tree

for f as defined above}.

THEOREM 4. Let X, Y be arbitrary sets (finite or infinite) and f c YXnonempty. Then

LC-ARBweak(J-) = ADVweak(.F).

PROOF OF THEOREM 4. It is obvious that any adversary tree Wweak defines an adver-sary strategy that forces the learner to make at least INF(Wweak) mistakes. This impliesthat LC-ARBweak(F) > ADVweak(F). In order to show the other inequality we setFx ,y ;= (f e F:f(x) ^ y}. We define a learning algorithm which always uses asnext hypothesis h a function such that for any x € X the value h(x) e Y has the propertythat ADVweak(Fx,h(x))) < ADVweak(F). It will be shown in Lemma 3 that this learningalgorithm is well-defined. A straightforward induction on ADVl(F) shows that for everynonempty function class F c YX this learning algorithm for f requires at most ADV l(F)learning steps.

LEMMA 3.

PROOF OF LEMMA 3. Assume that the claim does not hold for some x0 € X, i.e. Vy 6Y(ADVweak(Fx0,y) > ADVweak(,F)). Then one can build an adversary tree Wweak for fwith XQ as label of the root which satisfies INF(Wweak) > ADVweak(.F). This provides acontradiction to the definition of ADVweakC^)- D

The second variation of our function learning model, considered previously by Borland(1992), applies to arbitrary function classes f c Yx for linearly ordered sets Y. In thismodel the learner receives at round i together with an argument xs such that hs (*.,) ^ T (xs)also the information whether T(xs) > M**) ("M**) >s too low") or T(xs) < hs(xs)

202 AUER ET AL.

("hs(xx) is too high"). We write LC-ARBsignC?7) for the maximal number of steps that thebest learning algorithm for f with arbitrary hypotheses has to use in this model.

We define for this model a notion of a finite adversary tree WSign where again each node islabeled by some x e X. If a node is not a leaf then it has two outgoing edges that are labeledby "> y" respectively "< y" for some y e Y. We say that a function / is qualified foran edge with label "> y" ("< y") from some node with label x if f ( x ) > y (respectivelyf ( x ) < y). We demand that for every leaf of Wsjgn there exists at least one / 6 T thatqualifies for all edges on the path from the root to that leaf.

INF(Wsign) := minfm € N: m is the length of a path

from the root to a leaf in Wsign},

and

ADVsign(.F) := sup{INF(Wsign): Wsign is an adversary tree

for f as defined above}.

THEOREM 5. Let X be an arbitrary set and let Y be an arbitrary linearly ordered set (finiteor infinite). Then for any nonempty function class F c YX,

PROOF OF THEOREM 5. Each adversary tree usign defines an adversary strategy in theobvious way. Simultaneously each learning process also defines a path from the root to aleaf in usign in the following way. Assume that one is currently at a node with label x, andthat "> y" and "< y" are the labels of the two edges out of that node. If h(x) < y, whereh(x) is the learner's prediction of the value of the target function at argument x, then theadversary responds that "h(x) is too low" and moves from the node with label x along theedge with label "> y". Else the adversary responds that "h(x) is too high" and moves alongthe edge with label "< y". This observation implies that LC-ARBsign(.F) > ADVsign(.F).

In order to show that LC-ARBsign(F) < ADVsign(F) we first note that

in case that there exists some x0 e X with \{f(xo): f e f}\ = oo. Hence the claimedinequality is trivial for this case. In the following we consider the case where | {/(*): / e F}\is finite for all x e X. The claim is also trivial if ADVsign(F) = 0.

We define a learning algorithm for F that needs at most ADVsign(F) steps by recursionover ADVsign(.F). The algorithm predicts for each x € X some value h(x) e Y such thatboth ADVsign({f e T: f ( x ) > h(x)}) and ADVsign({/ e T: f ( x ) < h(x)}) are less thanADVsignC.F). If the adversary responds that "h(x) is too low" ("h(x) is too high") then onegenerates the next hypothesis in the same manner, but applied to {f e F: f ( x ) > h(x)}(respectively {f e F: f ( x ) < h(x)}) instead of .F. One can easily show by induction onADVsign(F) that this learning algorithm uses at most ADVsign(F) steps. The followingLemma shows that this algorithm is well-defined.


LEMMA 4. For all x e X there exists some y e Y such that both ADVsjgn({f e f\ f ( x ) >*(*)}) andADVs^({f & T: f ( x ) < h(x)}) are less then ADVsign(JP).

PROOF OF LEMMA 4. Assume that the claim does not hold for some x0 e X. SinceK/(*o): / e f}\ is finite there exists some minimal yQ e Y such that ADVsign({/ 6 F:f ( x 0 ) > y0}) < ADVsjgn (7"). Together with our assumption this implies that ADVsjgn

({/ 6 T: /(jc0) < y»}) > ADVsign(.F). Furthermore ADVsign({/ e T: f(x0) > v0}) <ADVsjgn (?) by the minimal choice of yo. Hence one can build an adversary tree Wsign forF with x0 as label of its root which satisfies inf(usign) > ADVsing(F). This provides acontradiction to the definition of ADVsign(F). D

REMARK. It is obvious that for any F,

Very recently a new proof technique was introduced in (Auer & Long, 1994) which showsthat for any 7" c YX, LC-ARBweak(F) < 1.39|K|ri + Iog2 |y|lLC-ARB(.F).

3. Learning Unions of Function Classes

In this and the following sections we will only consider the case of function learning withthe discrete loss function t. Hence we can delete the subscript I in our notation (as alreadydone in the last part of the preceding section, starting at Application 2).

The quantity LC-ARB(/") measures the difficulty of learning given that the target func-tion is in T. If one has determined interesting classes T\,...,fp, then the assumptionthat the target function is in uf=1 ft is intuitively safer than assuming that the target is inone of the individual classes. If LC-ARB(Uf=I ft) is not much larger than the individ-ual LC-ARB(^/)'S> this provides evidence in support of using the optimal algorithm forLC-ARB(Uf=1J)-).

In this section, we approximately determine the best general bounds on

obtainable in terms of

To the best of our knowledge, these results are new even for concept learning, whereY = ( 0 , l } .

Define <p as follows. For any k\,..., kp, let

tp(k\ , . . . , & , , ) = max{LC-ARB(Uf=1 .?;•): ft,..., Fp are function classes with

LC-ARBCF,) = k j f o T i = l,...,p}.

First, in the case p = 2, we can calculate <p exactly.

THEOREM 6. For all k\,k2> 0, <p(k\, k2) = k\ + k2 + 1.

204 AUER ET AL.

PROOF. For the upper bound, consider the algorithm which simply uses the optimal algo-rithm for F1 until it has received k\ + 1 counterexamples, at which time it switches to theoptimal algorithm for F2.

Now for the lower bound, choose k\, k2 > 0. Let F be the set of all functions / from{1, ...,ki +k2 + 1} to {0, 1} such that |/-'(0)| < ki. Let Q be the set of all functions gfrom ( 1 , . . . , k\ + k2 + 1} to {0, 1} such that |g~'(l)| < k2. Note that every function from{ 1 , . . . , k\ + ki + 1} to {0, 1} must be in either T or Q. In order words, T U Q is the set ofall functions from {1 , . . . , k\ + k2 + 1} to (0, 1} and therefore,

It is easy to see that the VC-dimensions of f and Q are k\ and ki respectively, proving that

Next, consider the algorithm for f which repeatedly hypothesizes the function which eval-uates to 1 on all previously unseen elements of the domain, therefore only receiving coun-terexamples which evaluate to 0. Since there are only k\ such points, LC-ARBCF) < k\.Similarly the algorithm for Q which repeatedly hypothesizes the function which evaluatesto 0 on all previously unseen points receives at most ki counterexamples. Combining thiswith (3.2) and (3.3), we can see that LC-ARB(7") = k\ and LC-ARB(S) = k2. Combiningthis with (3.1) yields (p(k\ , k2) < k{ + Ic2+ I, completing the proof. D

The case in which the number p of functions in the union might be greater than 2 is moredifficult. We obtain closed-form bounds for this case that match to within a constant factor,and bounds which are sometimes tighter that are not closed-form.

3.1. Closed-Form Bounds

To establish the closed-form bound, we have found it useful to generalize some of theWeighted Majority (Littlestone & Warmuth, 1991) results to functions with arbitrary ranges.

Suppose we have (LC-ARB ) algorithms A \,.... An for some class T of functions fromX to Y. The weighted maximum algorithm (for A\,..., An) studied in this section (we willhereafter refer to it simply as WM) works as follows.

It initializes a sequence w\ _ \ , . . . , w\ .„ of weights. Its initial hypothesis h\^M is formedas follows. Suppose h\j, ... ,hi<n are the initial hypotheses of A\,..., An respectively.Then for any * e X,h\,wm(x) is the element)' e Y for which '}2i./luM=y w\.i is maximized.

After the fth counterexample (x,, y,), the weights are updated as follows:

Also, for all algorithms A/ for which h,j(xt) ^ yt, the counterexample (xt,yt) is fedto Aj, and ht+1,i is set to A,'s next hypothesis. For all other algorithms, the hypothesisis unchanged. The master hypothesis ht+1,W,M is then calculated from h t + 1 , 1 , . . . . , ht+1,n

as above.


The following notation will be useful for the rest of this subsection. For each positiveinteger t, let

be the total weight before the tth counterexample.

LEMMA 5. For any LC-ARB algorithms A\,..., An, after m counterexamples,

PROOF. Choose a positive integer t < m. Let (x,, y,) be the tth counterexample receivedby the WM algorithm. By the definition of H,,WM<

and therefore, since

Therefore, we have

by (3.4). By induction, we have

Solving for m yields the desired result. D

Many other of the Weighted Majority results (Littlestone & Warmuth, 1991) appear togeneralize just as easily.

For each n, define the set SVARn of functions from (0, 1 \n to (0, 1} to be all fi, 1 < i < n,where /•(*) = x, for all x 6 {0,1}".

206 AUER ET AL,

LEMMA 6 (Littlestone, 1989). For all n,

Now we are ready to establish bounds on <p that match to within a constant factor.

THEOREM?. Forallp>l,ki kp > 0,

PROOF. We begin with the upper bound. Choose a sequence F1, ...,FP of classes offunctions defined on a common domain X.

Consider the following (LC-ARB ) algorithm (call it A) for Uf=1 J)-. It uses the optimalalgorithms for f\,..., fp (call them A\,..., Ap) as subroutines for the WM algorithm,with the initial weights set to 2LC"ARB(-F'), . . . , 2LC'^m(fp\

Suppose the target is in ft, and that A has received m counterexamples. Since A, canreceive at most LC-ARB^) counterexamples, we have

Applying Lemma 5,

Since ?\,..., Fp were chosen arbitrarily,

Now for the lower bound. Choose ki,...,kp. Let X = {0, 1 }E"->2t', Y = {0, 1}. Foreach u < £f=i 2*', let /„: X -» Y be defined by setting /„(•?) = xu for all x e X. Foreach i < p, let

By trivial application of Lemma 6, for all i, LC-ARB(G/) = £,, and


Thus, <p(k\.... kp) > [Iog2 Ef=i 2ki J • Combining this with (3.5) completes the proof. D

Working more on the bounds of Theorem 7, we obtain even looser, but simpler, bounds,which still match to within a constant factor, and give a better idea of the flavor of thebehavior of <p.

THEOREM 8. For all

PROOF. Choose p > 1, k 1 , . . . ,kp> 0. Then, by Theorem 7,

establishing the upper bound.Also, again by Theorem 7,

completing the proof.

208 AUER ET AL,

We next turn to proving that the upper bounds of Theorems 7 and 8 cannot be improvedby reducing the leading constant.

For each u e N, let POWERM be the set of all functions from { 1 , . . . , « } to {0, 1} and foreach v < u and for each / € POWER,,, let

The following lemma, implicit in the work of Maass and Turan (Maass & Turan, 1992,Proposition 6.3) will prove useful.

LEMMA 7 (Maass & Turan, 1992). Define p: N - N by

Then for each k e N, there exists f 1 , . . . , fp(k) e POWER3k such that

We apply this to show that the 1/ log2(4/3) of Theorems 7 and 8 is optimal.

THEOREM 9. If p:N-— N is defined as in Lemm 7, THEN

PROOF. As observed in (Maass & Turan, 1992), for any f, LC-ARB(Bf.k) = k (predictwith / on unseen points). Since LC-ARB(POWER3k) = 3k,

Also,

Hence

Combining this with (3.6) yields the desired result.


3.2. Other Bounds

Now we described some bounds which are sometimes tighter, yet which are not closedform. The following theorem describes the bounds, where ( " v ] is defined to be £],U=o (")•

THEOREM 10. Forallp>\,k\,...,kp>Q,

These bounds are sometimes tighter than those of Theorem 7. For example, Theorem 7implies that cp(5, 10,20) < 48, where Theorem 10 implies that <p(5,10, 20) < 31.

Our proof proceeds through a series of lemmas.For the upper bound, we use a technique that was developed for coding theory and

extended to address the problem of searching using a comparator that occasionally lies((Berlekamp, 1968), (Rivest, Meyer, Kleitman, Winklman & Spencer, 1980), (Spencer,1992)). Cesa-Bianchi, Freund, Helmbold & Warmuth (1994) showed that the proof of(Spencer, 1992) could be modified to determine the (in some cases only nearly) best boundon the number of mistakes for learning a class of {0,1 }-valued functions, given that anunknown algorithm from a pool { A 1 , . . . , Ap} would make at most k mistakes. Due tothe fact that the functions in our application take on potentially more than two values,the elegant argument from (Cesa-Bianchi, Freund, Helmbold & Warmuth, 1994) cannoteasily be modified for our application. An inductive argument like that given below isapparently required.

The lower bound is obtained by generalizing Lemma 7 and the argument of Theorem 9.For the remainder of this section, we adopt the convention that LC-ARB(0) = — 1, and

that for all integers u > 0, w < 0, (^) = 0. Note that even with the above convention, thefollowing standard relationships still hold.

LEMMA 8. For all integers u, v such that u > 1,

and

We begin by establishing the upper bound. We will make use of a general-purposealgorithm described in the following definition.

DEFINITION 2. For each sequence F 1 , . . . , Fp of 'function classes, define the algorithmBF1 ..Fp for learning Uf^^j as follows. Suppose for each i, ki = LC-ARB(Fi), Let

210 AUER ET AL.

(One can interpret r as the mistake bound that BF1 Fp is shooting for.) Let B 1 , . . . , B p b eoptimal algorithms for T\,..., Fp respectively. (Note that in this model, optimal algorithmsexist.) Then Bf1.....ff 's initial hypothesis h is constructed as follows. If h 1 , . . . ,hp are theinitial hypotheses of B 1 , . . . , B p respectively, for each x, choose h(x) arbitrarily so that

Notice that h can be viewed as taking a "weighted maximum" of the hypotheses hi,... ,hp,with weights (*)' • • • ' ( / ) • After getting a first counterexample (x, y), if for each i < p,

Qi = {/ € Ti\ f(x) = y], B?-,.... Tf recurses to Bg, ....Qf learning Uf=1£.

The following lemma shows that B^..... jr/s hypothesis minimizes a technical quantityused later in our proofs.

LEMMA 9. Choose function classes T\,..., Fp. Then if r,k\,...,kp,hi,...,hp and hare defined as in Definition 2, for all x e X, if y = h(x), then for all y e Y,

PROOF. Choose x. Choosey. Then

which implies

Rearranging terms yields

Finally, adding Z^:/i,(*¥(y,.y) (<*•) to the second summation on either side of the inequal-ity yields


completing the proof.

The following lemma is well known and easily verified.

LEMMA 10. Choose integers l, p > 0, k 1 , . . . , kp. Then

We establish the upper bound for Theorem 10 in the following lemma.

LEMMA 11. Choose classes F 1 , . . . ,F P of functions with (ui=1Fi ^ 0. Then if for i < p,ki = LC-ARB(Fi),

PROOF. The proof is by induction on p + £f=i &,. The induction hypothesis is that foreach positive integer m, for any f\,..., Tp, if k\,..., kp and r are defined as in Definition 2and satisfy m = p + JT/Lj ki, then B f , f p makes at most r mistakes.

Base case: If p + £f=i î = 1. men kj ^ 0 and ?j ¥= ® for exactly one j, and kj = 0and \fj\ = 1. Define r,h,h\, ...,hpas in Definition 2. In this case r = 0, (^ ) = 1 and

(£) = 0 for all i ^ j. Thus h(x) = hj(x) for all x, and B^ pf never makes a mistake,establishing the base case.

Induction step: Assume m > 1. Choose F\,..., Tp such that if & i , . . . , kp are defined asin Definition 2, then m = p + £f=1 fc/. Define r, h, h\,..., hp as in Definition 2. Notethatr > 1.

Choose a target function T in Uf=1^ri. Choose x e X. Let y = T(x) and y = h(x).Assume without loss of generality that y ^ y. Foreach«'</?, letft = {/ 6 FI\ f ( x ) = y},and/, =LC-ARB(£).

(Lemma 8)

212 AUER ET AL.

by Lemma 9. Breaking apart one of the summations, we get

Since, for all integers u > 0, v, (<J > (<„"_,) (recall that (<J is defined to be 0 for negativeu), (3.7) implies

Since B 1 , . . . , Bp are optimal, li < ki if h i ( x ) = y and li < ki - I i i h j ( x ) ^ y, so

Since

we have

Applying (3.8) yields


which trivially implies

Thus

which, together with Lemma 10, implies

which in turn gives

However, by the induction hypothesis, the number of mistakes made by Bg, gf is at most

and therefore the number of mistakes made by BF1....Fp is at most

Combining this with (3.9) completes the proof.

The following more general variant of Lemma 7 will be useful in proving lower boundson (p.

LEMMA 12 (Maass & Turan, 1992). Choose integers k 1 , . . . , k p > 0. Then for eachinteger I such that

there exist f1,..., fp E POWER1 SUCH THAT

214 AUER ET AL.

PROOF. Choose/satisfying (3.10). Let U be the uniform distribution on POWER;. Choose/. Then for each (',

Thus, for our fixed /,

This implies that

We have

Dividing by 2' and exponentiating yields

Dividing both sides by the right hand side and "pushing the exponential function throughthe sum," we get

Applying the well known fact that for all x, 1 + x < e" yields

By (3.12), this implies that

and therefore there exist g\, ..., gp such that for all / e POWER/, there is an i such that| {x: f ( x ) 7^ gi(x)} | < kj, i.e. (3.11) holds. This completes the proof. D

We finish the proof of Theorem 10 by establishing the lower bound.


LEMMA 13. For all

PROOF. Fix p > 1, k ] , . . . , kp > 0. Choose / such that

and let / i , . . . , /p € POWER, be such that POWER, = Uf=i %t, • The existence of suchf i ' s is guaranteed by Lemma 12. Since for each /, LC-ARBCB^) = &, and LC-ARB(POWER,) = /, we have that

and since / was chosen subject to (3.13), we have

completing the proof. D

Suppose <pweak is defined analogously to <p for the LC-ARBweak model described at theend of Section 2. Formally,

are function classes

with

Using more a straightforward argument, we may determine ^weak exactly.

THEOREM 11. For all

PROOF. Choose p > 1, k\, ...,kp>Q.We begin with the upper bound. Choose a set X and sets F\,..., fp for which

Consider the following algorithm for learning \Jpi=lFj.. If p = 1, it just runs the optimalalgorithm for F\. If p > 1, it first runs the optimal algorithm for f\ until either it correctlyguesses the target or until it gets LC-ARBweakCî) + 1 counterexamples. In the secondcase, it recurses to learning Uf=2^/. By a trivial induction, this algorithm receives at most

216 AUER ET AL.

counterexamples, establishing the fact that

Now for the lower bound. For each i < p, define ft to be the set of all functions /from {1, . . . , / > - 1 + £f=, kt] to {2i - 1, 2i] for which |/~1(2J)I < *i- Trivially, for eachsuch i,

Choose an algorithm A for learning uf^F,- in the LC-ARBweak model. Let /ij be A'sinitial hypothesis, and for each 1 < t < p — 1 + £f=i &/, let /i/ be A's hypothesis, giventhat each previous hypothesis hs had received s as a counterexample. Define g:{l,...,p —1 + £f=, &,•} -»• {1 , . . . , 2p} by g(0 = h,(t). Let i0 < P be such that

Note that such an i0 must exist, since otherwise

a contradiction. Define /: N - {1 , . . . , 2p] as follows

Since

we have / € Fia, and therefore / e Uf=1 F(. Furthermore, for each t < p - 1 + £f=i fcf

and if / is the target, t is a valid counterexample for h,. Since A was chosen arbitrarily,

Combining this with (3.14) yields the desired result.


Notice that the range of U^=1 F, used in the lower bound argument above was in-creasingly large as p got bigger. Thus, for example, determining the best bounds onLC-ARBweaic(uf=1 F/) that can be obtained in terms of

and the size of the range of U("_, F,- remains an open problem.

4. Learning the Maximum of Linear Functions

A natural and commonly studied function class is the set of piecewise linear functions.Piecewise linear functions are simple enough to allow efficient algorithms (e.g. linearprogramming); they are easy to handle and easy to describe. On the other hand, theyare sufficiently complex to approximate and model most functions occuring in practice.Unfortunately, the following example shows that no finite mistake bound can be obtainedeven for the very restricted case of continuous functions consisting of three linear piecesand with arbitrary functions as hypotheses.2

EXAMPLE 1. For Q<a<b<\,we define continuous piecewise linear functionsga,i,: R -»• R with

An infinite sequence of CEs can be constructed as follows: Let a\ = 0, b\ = 1. For alli > 1 the i-th CE is given by ((a, + i>/)/2, fi) where ft e {0, 1} depending on the learner'shypothesis. If fi = 0 then aj+\ = (a,- + bj)/2, bj+\ = b,\ if /, = 1 then a,-+i = a,,bi+] = (ai + b i) /2. Clearly this process does not terminate and for any i the function gu/ /,,is consistent with all previous CEs.

The function in Example 1 is non-convex. In this section we will show that under therestriction of convexity, the class of piecewise linear functions admits mistake-boundedalgorithms, thus yielding one of the first positive results for functions of more than one realvariable. We will present two algorithms for this learning problem. The mistake boundfor the first algorithm grows polynomially in the number of pieces for a fixed numberof variables, and the mistake bound for the second algorithm grows polynomially in thenumber of variables for a fixed number of pieces. These results trivially imply mistake-bound algorithms for the class of arbitrary continuous functions consisting of two linearpieces, since all such functions are either convex or concave.

For a class F of functions, we denote by /^max the class {/: 3/i,..., fk e F Vx (f(x) =max1<,<t fi(x))}. Let C(d) consist of the linear functions Rd ->• R. (A linear function isdefined as usual by a vector c e Rd, c ^ 0 and a b e R. Points x € R^ are mapped toc'x + b.)

218 AUER ET AL.

4.1. The Maximum of Linear Functions of a Fixed Number of Variables

We first present the algorithm whose mistake bound is polynomial in the number k of piecesif the number d of variables is constant. Hence, let T e C(d)™m be the unknown targetfunction, with T(x) = maxi</<t fi(x) for certain ft e £(d). We begin by defining atechnical quantity central in our analysis.

DEFINITION. For integers k > 2, S > 1, we define fa(8) = (3ks - 2k&~1 - \)/(k - 1).(Note that fa (8) is integer for all k, S). Moreover, we define i/'t (0) = 1 and fa(8) = 8+ I.

DEFINITION. For P C Rd+l, the dimension 8(P) of P is defined to be the dimension ofthe smallest dimensional affine subspace S ( P ) of R(/+l with P c S ( P ) . A set P c Rd+1

is called safe with respect to £(</)£"*, if \P\ > fa(8(P)).

Since the intersection of two affine subspaces yields another affine subspace, S(P) iswell-defined. The reason for dealing with point sets in Rd+1 is easy to explain: everycounterexample presented to our learning algorithm consists of a point x e M.d togetherwith the real value T(x). Together, these d + I real numbers form a point in R''"*"1. Hence,the learning process may also be interpreted as receiving points in Ru+l (counterexamples)and trying to cover these points by a small number of hyperplanes (hypotheses).

Our main problem in learning T(x) is to avoid getting large numbers of counterexamplesin low-dimensional affine subspaces of Rd+1. Suppose, we already received as counterex-amples three points a, b and c that lie in this order on a straight line. Clearly, this line mustbe part of some fj. If we receive another counterexample on this line, we cannot deduceany new information. Thus, we should avoid receiving four counterexamples that lie ona common line. As a second example, consider 3k counterexamples lying in a commonplane. Such a plane need not be part of any fj: In the worst case, there are k groups eachconsisting of 3 points on a line such that the intersections of the fj with the plane exactlyyield these lines. However, if there are 3k + 1 counterexamples lying in a common planeand if no four of the counterexamples lie on a common line, it can be checked that this planemust be part of some /}. Hence, any line containing three counterexamples and any planecontaining 3k + 1 counterexamples should be incorporated in the construction of our hy-potheses. Similarly, for higher dimensional affine subspaces there are 'critical' numbers ofpoints that force the subspace to be a subspace of some fj (in case we simultanously avoidgetting too many counterexamples in lower dimensional affine subspaces). It turns out thatthis critical number in a 5-dimensional affine subspace is exactly fa(8) as defined above.

Safe subsets are those subsets of the counterexamples that span affine subspaces contain-ing a critical number of counterexamples; all safe subsets should be used in the constructionof the next hypothesis. This is also the main idea of algorithm B1 below. Algorithm B1

maintains the following invariant: for any safe set P contained in the set of counterexamplesreceived by the algorithm, which is consistent with some function T in £(d)fm, S ( P ) iscontained in the graph of one of the linear functions defining T.

The Algorithm B!. Let />,_, = {(jc,, !T(*,)), (*2, T(x2)), ...,. (*,_,, 7(*,_,))} C MJ+1

be the set of counterexamples known in the beginning of Step s. Let


be an enumeration of the safe subsets of Ps - 1 . For x e Rd, an affine subspace S(P^) ofRd+l assigns a (d + l)st coordinate to x iff the projection of the subspace into Rd containsthe point*. Let Coords^\(x) denote the set containing the (d + l)st coordinates assignedto x by the S(P^) for all i. In other words,

Then the hypothesis hs of algorithm B\ is defined by

LEMMA 14. For all s > 0, and for every safe subset P^'\, (i) there exists some component

fj ofT such that S(P^) c /;, and (ii) |/>«, | = ̂ (P/?,)).

PROOF. Assume the statement does not hold, and consider the first Step s in which acontradiction occurs. We consider the smallest dimensional P = P^} that does not fulfillcondition (i) or condition (ii). We distinguish three cases depending on the dimension of Pand on whether condition (i) or (ii) is violated.

Suppose, S(P) > 1 and P does not fulfill (i). Then we consider the intersections/i n P,..., fk D P, Each of these intersections is at most S(P) — 1 dimensional and hencecontains at most tyk(&(P) — 1) of the counterexamples. Consequently, P would contain atmost kTffk(S(P) — 1) = ^k(S(P)) — I counterexamples and could not be safe.

Next suppose, 8(P) > 1 and P fulfills (i) but not (ii). In the preceding Step (s - 1),all safe subsets fulfilled (i). Therefore, subspaces spanned by safe subsets agreed with thetarget function and could not receive any further counterexamples. Since | P \ was equal towk(s(P)) in step (s - 1), it is also equal to wk(s(P)) in Step s.

Finally, for S(P) = 0, the statement trivially holds. D

THEOREM 12. Ford,k > 1, LC-ARBOCfrOp") < kfa(d) e O(kd). Moreover, thehypotheses used in the learning algorithm consist of at most O(kd2+2d) linear pieces, andthe learning algorithm is computationally feasible.

PROOF. By Lemma 14, every /; contains at most ^k(d) counterexamples. Hence, B\must terminate after at most k^fk(d) counterexamples.

To get the bound on the combinatorial complexity of the hypotheses, we will use thenotion of the VC-dimension of set systems: A set X is shattered by a set system 5, if all2'x' subsets of X can be obtained by intersecting X with some set in S. The VC-dimensionof a set system S is defined to be the cardinality of the largest set shattered by S (in case nolargest shattered set exists, the VC-dimension is infinite). Sauer (1972) proved that a setsystem on p points with VC-dimension v consists of at most O(pv) sets.

We argue that the VC-dimension of the set system induced by all affine linear subspacesof Rd+l is at most d + 2: A set Pvc of points shattered by this system can only containextreme points (otherwise, some non-extreme point is contained in a simplex formed byseveral other points in Pvc. The set consisting of the corners of this simplex cannot beobtained by intersecting Pvc with an affine subspace without containing this non-extremepoint at the same time). If we have d + 3 extreme points in (d + l)-dimensional space, the

220 AUER ET AL.

affine hull of any d + 2 of them is already all of S(PVC). Then this spanning set cannot berepresented by intersecting Pvc with an affine subspace without containing the other pointat the same time.

Finally, we take the set of counterexamples as a ground set and consider the set systeminduced by the affine subspaces used in the construction of the hypothesis. There are atmost O(k(i) counterexamples and the VC-dimension of the system is at most d + 2. Nowby Sauer's result the claimed result follows. D

For d = 1, we can strengthen Theorem 12 to the following optimal result.

THEOREM 13. Fork>l, LC-ARB(£(l)f x) = 3k - 1.

PROOF. The upper bound is obvious for k — 1. For k > 2, i/r,t(l) = 3 holds and theresult in Theorem 12 yields LC-ARB(£(l)™ax) < 3k. We can do better by observing thatafter 3k — I counterexamples, these points determine k — 1 lines exactly (with each linecontaining three points) and leave two points for the last line. This yields the upper bound(with a computationally feasible learning algorithm).

For the lower bound, we will give an adversary argument. To this end, we define so-calledbad point sets EPS;, 1 < j < k fulfilling the following two properties.

(i) BPSj consists of 3; — 1 points in convex position. Let p\,..., pij-\ denote the pointsin BPSj sorted by increasing x -coordinate.

(ii) There exists an index a > 0 divisible by three such that BPS; is divided into threeparts Lj = {pi,..., pa], Mj = {pa+i, pa+2\ and Rj = {pa+?,,..., pij-i}. Scanningthrough Lj from left to right, every triple pu-2, p^-\, pn lies on a common line segment.Similarly, in Rj every triple p^-j, p-a-2, PM-\ lies on a common line segment. Notethat LJ or Rj might be empty.

We claim that a malicious adversary can confront the learner with bad sets of counterex-amples BPSi , . . . ,BPSj;. This yields 3k — 1 counterexamples and proves the lower bound.

For k = 1, the claim trivially holds, and it remains to show how to reach a BPS;+I fromsome BPS j. We assume that neither Lj nor Rj is empty; otherwise analogous argumentswill work. Thus, we must construct three legal counterexamples q1, q2 and #3, no matterhow the learner chooses his hypotheses.

The first counterexample is given somewhere between pa+1 and pa+2, below the linethru pa+1 and pa+2 but above the lines thru pa, pa+i and pa+2, pa+3. There is ample spaceto place the new point q\ such that the convexity condition still holds.

The x-coordinate of the second counterexample q2 is chosen very close to the x -coordinateof q\. The y-coordinate of q2 is determined such that q2 either lies on the line thru pa+1,q\ or on the line thru q\, pa+2. The actual choice depends on the learner's hypothesis; hecan assign to any x-coordinate at most one y-coordinate, and we present to him the otherpoint as new counterexample. (If the x-coordinate of q2 lies to the left of q\ and to the rightof the intersection of the lines thru pa, pa+1 and thru q, pa+2, the new point q2 does notviolate the convexity condition).

Depending on the choice of q2, either the triple pa+1 , q2, q1 is added to Lj or the tripleq2, q1,,qa+2 is added to Rj (since these points lie on a common line segment). In any case,we are left with a single point between Lj and Rj. Counterexample q3 is put somewhere


close to this single point such that it invalidates the next hypothesis and does not violate theconvexity condition.

Obviously, BPSj; together with 171, q2 and q3 forms a new bad point set BPSj+1. We keepon presenting counterexamples till we reach BPSk. The corresponding hidden functionconsists of the maximum of all k — 1 lines going thru the triples in L/C and Rk, together withthe line thru the two points in Mk. D

4.2. The Maximum of a Fixed Number of Linear Functions

Next, we describe an algorithm for learning the maximum of a fixed number of linearfunctions. Our algorithm will be obtained by reducing the problem of learning £(d)max tothe problem of learning unions of k subspaces in d dimensions.

For all positive integers d, k, define HYPERd,k to be the set of all f: Rd - (0, 1} suchthat there exist a 1 , . . . , ak e Rd, b 1 , . . . , bk e R such that for all x e Rd,

The following lemma is a straightforward consequence of a result of (Long & War-mum, 1993).

LEMMA 15 (Long & Warmuth, 1993). Choose positive integers d andk. There is a learningalgorithm B for HYPER^ such that, for any learning process with target T, for each t,B's tth hypothesis h, has h,(x)<T (x) for all x e Rd, and which makes at most (d + 1)k

mistakes.

We apply this in the following.

THEOREM 14. For integers d , k > 1 , LC-ARB(£(d);Pax) < (d + 2)k.

PROOF. Choose integers d, k > 1. Consider the following computationally feasible algo-rithm B for learning £.(d)^ using as a subroutine an algorithm B' for learning HYPERj+i ,*with the properties described in Lemma 15. B works by maintaining a simulated learningprocess for B'. In each round, B generates a hypothesis from the hypothesis of B', receivesa counterexample, and then generates a counterexample for B from its counterexample.

Suppose h't is the tth hypothesis of B'. Then if for each x e Rrf,

then the tth hypothesis h, of B is defined by

If B receives (x,, y,) as a counterexample to h,, then it passes ( C x , , i , . . . , x,j, y,), 1) to B'.Suppose B is learning a target T e £(<^)"ax defined, for a\,.... 2* e RJ, b\,..., b^

€ R, by

222 AUER ET AL.

Then define T' e HYPERd+1,k by

Notice that for each x e Rd,

We claim that for each t, the ?th counterexample ( C x / , i , . . . , *»,</, y>t), 1) passed to B'satisfies h't((x,,i, ... ,x,,d,y,)) = 0. If [u €Wl+l:h't(u') =!,(«,..... ud) = x} = 0; thisis trivial. Assume otherwise. Since, by Lemma 15,

we have

By (4.16), h(x,) ^ y, = T(x,) then implies

Thus h't(x,j,..., xttd, yt) = 0, and ( ( x , _ i , . . . , x,^, y,), 1) is a valid counterexample for B'learning T'. Since t was chosen without loss of generality, B' makes at least as many mis-takes learning T' as B does learning T. Applying Lemma 15 then yields the desired result.

D

5. Learning the Maximum of (/-Defined Functions

In this section we investigate learning maxima of functions. This is a natural generalizationof the preceeding section and (among others) learning OR's of boolean functions. Wedescribe a condition on a class T of functions of a single variable, such that the maximumof functions from this class is efficiently learnable. This condition is satisfied by a varietyof natural function classes as indicated below.

Let F c Yx be some function class where X and Y are totally ordered by <. Inaddition to J7™" we define the class j-Pmax ofpiecewise ^-maxima. We call a ̂ -maximum/ = maxi<(<t /' piecewise if there are disjoint intervals / ' , . . . , / * with X = Ui<;<* ''such that x e /' implies/(jc) = /'(*). Informally this means that each function/' appearsonly once in the graph of max]</<* /'.

Unfortunately it is not true that jFPmax is easy to learn if JF is easy to learn. Consider e.g.the class £ of functions /„,/,: M -> R with


for a < b e K. It is quite easy to see that LC-ARB(£) = 2 but that LC-ARB(£fmax) = oo,since functions like

b - 2 < a < b < 2, are in ££max (compare with Example 1).Therefore we need a stronger restriction on the function class F than LC-ARB(J") < oo

to bound LC-ARBCÎ"11^). A property which suffices is the notion of a d-defined functionclass, where a function class T is d-defined if any two functions /', /2 € f intersect inat most d points, i.e. \{x e X: /'(*) = f2(x)}\ > d implies /' = f2. Examples of d-defined classes are among others the class Pj of polynomials of degree at most d, the classof polynomials with at most [(d — 1)/4J terms ("sparse polynomials"), quotients of sparsepolynomials, and functions of the form ]C/=i a/a(bjX + a) for arbitrary a,-, &,-, c,- e K,and bounded k e N (where a(y) := 1/(1 + e~y) is the widely studied sigmoidactivationfunction in neural networks). The d-definedness of these classes (for suitable d e N)follows from the fact that any polynomial with m terms has at most 2m + 1 zeros (by(Uspensky, 1948, p. 121)). Functions £*=1 a;o-(b,-jc + c/) can be rewritten as "rational"functions $Zf=1 t "' /,. where C, = e~C| and y = e-x. Observe that bj is not integer butthat the result of (Uspensky, 1948) can be generalized to this case. For the class of quotientsof sparse polynomials a suitable d is 4kl +1 when nominator and denominator have at mostk and / terms, respectively, and for sums of k sigmoids a suitable d is 4k22k~* + 1.

Note that in contrast to Section 4 the here considered function class ^max may alsoinclude non-convex functions, and in contrast to (Kimber & Long, 1992) the norm of thederivative of functions in this class need not be bounded.

THEOREM 15. Iff is d-defined with d > 1 then LC-ARB(jrPmax) = O(dk).

REMARK. The above result is optimal up to constants insofar as for polynomials we canprove that LC-ARB((Prf)£max) = £2(dk). This can be seen by considering functions

where for all t a,-</ = — 1 and 5Z;=o au - 1 • The coefficients a,;- must be learned indepen-

dently for each i since £;=0a(;(* - 2i)2J < 0 only if \x ~ 2i\ < 1.

REMARK. A learning result for Jr™ax is given in Section 5.3. The learning algorithms forboth j-Pmax and J"™" are computationally feasible, see the final Remark of Section 5.1.

At first glance the proof of Theorem 15 seems easy because one could argue that thelearner just has to get d + 1 examples for each piece of the target. But the problem is todetermine which of the examples belong to which piece of the target function, i.e. how topartition the examples into subsets such that each subset belongs to one piece of the target.

224 AUER ET AL.

Thus the main point of the proof of Theorem 15 is to construct an algorithm which partitionsthe counterexamples and constructs its hypotheses in such a way that any piece of the targetdoes not receive more than O(d) counterexamples on its graph.

We achieve this goal by constructing an algorithm which only adds a new piece to itshypothesis if there is strong evidence that this piece is part of the target. The algorithm alsomaintains a set of cutting points to partition the set of counterexamples. These cutting pointsare also used to protect pieces which are very likely to be in the target against overlappingpieces which are less likely.

5.1. The AIgorithm

Denote by Es the set of X-coordinates of the counterexamples up to and including the s-thcounterexample and denote by y(x) the corresponding F-value of the counterexample ifx 6 Es. Since the algorithm always constructs hypotheses consistent with the previouslyseen counterexamples we have \ES\ = s. Some of the x e Es are used to cut X intointervals. We denote the set of these "cutting points" by Cs c Es. (The algorithm willmaintain the set Cx dynamically.) Furthermore denote by x\ < • • • < xs the sorted elementsof Es. For notational convenience we set JCQ = — oo, xs+\ = +00.

At first the algorithm looks for all functions from T consistent with at least Id + 1consecutive counterexamples (this is "strong evidence" that these functions build pieces ofthe target). Let

where the second condition states that none of the 2d+l counterexamples are cutting points(cutting points are not considered as regular counterexamples and do not give "evidence"). If1 € Qs we denote the appropriate function by f?. For each q e Qs letD.? = (Lf, /??) c Xbe the maximal interval around xq which contains no cutting points and no inconsistentcounterexamples for ff,

such that the "range" Dj of a function/piece // is bounded by inconsistent counterexamplesor cutting points and contains at least Id +1 counterexamples consistent with ff. (As usualwe define max 0 = —oo, min 0 = +00.) Observe that each x e X is element of at mosttwo distinct ranges: Assume* e D) nD^nDj, L] < L2 < L3.. If R] < fl2 then D3 c D2

which by the d-definedness of T implies /22 = fs and therefore ZJ2 = D}. If R} < R*then D2 c D] U D3 and the number of counterexamples in one of D2 n D] or D2 n D] is atleasts/ + 1, which again implies fj = //, D2 = D,!, or /2 = /v3, D2 = D;3, respectively.

The next hypothesis proposed by the algorithm is defined as the maximum of all pieces ff,


(Instead of 0 any element from Y may be used.)If (x*, y') is a new counterexample with Hs(x') ^ y% then Es is updated as £v+> =

Es U {x'} and the cutting points are updated according to the following rules Ul and U2:

Ul: (There is a function // consistent with the new counterexample but this function hasbeen overlapped by another function // . Now the "correct" function is protectedagainst the "wrong" function by introducing a new cutting point. Eventually thewrong function will not appear in the next hypothesis because its range becomes toosmall by the new cutting point.)Assume

Then there must be also some q' e g, with*; e D.f, //,(*?) = /,?'«) > f.?(xf) =y*. Since /,* seems to be the correct function the algorithm "protects" it against /,*by defining a new cutting point. If Li' < Li then C.v+i := C, U {Li}. If Rqs < tf.fthenC.v+i :=C,U{/?.?}. lfCs+\ = C.tU{L?} this means that in the next hypothesis nofunction from the left can overlap ff. From the definition of Qs and the cf-definednessof f one easily obtains that exactly one of the conditions Li < Li, Rl < Rl issatisfied, since otherwise /* and /,* would intersect in at least 2d + 1 points.

U2: Let F be the set of all functions / 6 T with f ( x ' ) = y" for which there exists0 < rf < s - Id for which xrf < x^ < xrf+2<i+i, f(xr/) ^ y(xr/), f(xrf+l) =y(xrf+1),..., f(xrf+2d) = y(xrf+2d), and f(xrf+2d+1) # y(xrf+2d+1). Then setC,+i := Cs\\JfsF{xrf+\, • • •, xr,+2d}- (We set for all / e ^that/(-oo) ^ y(-oo),/(+oo) * y(+oo)).The second update U2 is a more technical update rule. It removes some cutting pointsif there is evidence that they are misleading, partitioning the counterexamples into toomany subsets. Evidence is given by the set F of candidates for new pieces in the nexthypothesis, consistent with exactly Id consecutive counterexamples in Es and alsoconsistent with the new counterexample (x", y'). The update U2 enforces that thefunctions in F appear in the next hypothesis. Observe that the set F of U2 is emptyif condition (5.17) is satisfied: With the notation of Ul and assuming Li < Li wehave \(Ll, Rf) D £,| < d (otherwise /* = //). Thus |(L,f, xes) n Es\ < d + 1and \(x<, /?.?) n Es\ < d + 1. Let / e F and Df = {xr,+i,.'.., xrf+u}. Then\Dff\Dl'\ > d + 1 or \Df n Dl\ >d which implies that / coincides with /?' or/,* in at least d + 1 points (observe that /(xf) = y«) = v|(xj)). Thus / = /*' orf = ff and therefore / is consistent with at least 2^+1 consecutive counterexamplesin Es, which contradicts the definition of F.

REMARK. Observe that the hypotheses of our algorithm consist of at most O (dk) piecesof functions from ^Jmax. This follows from Theorem 15 which states that the algorithmreceives at most O(dk) counterexamples, and from the fact that each x e X is element ofat most two distinct ranges D?.

Concerning computational complexity, let the amount of time for computing the uniquefunction consistent with some d + I examples and calculating f ( x ) for / e f, x e X, bebounded by some constant CCf. Then the computational complexity of our algorithm isbounded by some small polynomial in d, k, CCp.

226 AUER ET AL.

5.2. Analysis

We will argue that the above algorithm can receive at most O (dk) counterexamples. LetT = maxi<,<it T' be the target function and let /'...., /* be the corresponding intervals.Furthermore, set i/, = |/' n Es\. To bound the number of counterexamples we will showthat

This implies s < [5k(2d + 1) + !]/[! - \/(d + 1)] = O(dk) for d>\.At first we prove that there can be only few cutting points, showing that there are at least

d counterexamples between two cutting points.

LEMMA 16. Ifxn, xr2 are counterexamples with r\ ^ ri andxr<, xr2 e C, then \r\ — r%\ >

d + l.

PROOF. The proof is by induction on 5. Since new cutting points are only generatedby Ul we assume with the notations of Ul and w.l.o.g. that L? < L? and that L? isthe new cutting point. Since |D.f n £>? n Es\<d (otherwise /? = ff) it follows that|(L?', L?] n £,| > d + 1. Since \(Lqs, /?.?) n Es > 2d + 1 the lemma follows fromC, n (Dl U D.f) = 0. n

Clearly Lemma 16 implies (5.19).

COROLLARY 1. U2 deletes at most 4 elements from C,.

PROOF. Since for all / € F xrf < x* < xr/+2ii+\ we nave

which together with Lemma implies the corollary. D

For proving (5.18) we start showing that there are no cutting points in /' if v| < Id + 1.

LEMMA 17. Ifx e £., n /'' a«^ vj < 2rf + 1 f/zen ^ ^ C,.

PROOF. To prove the lemma we have to introduce some notation. Let Z)( = (L(, Rs) bethe maximal interval where T' is consistent with the counterexamples seen so far,


(x < I' has the obvious interpretation). Furthermore let v'. = | D'. n E, \ be the number

of counterexamples in this interval. Clearly /' c D'S+I c D\ and vj < T>'.. Thus it is

sufficient to prove that v's>2d+l implies IT n Cs - 0.The proof is by induction on s. Since v's increases by at most 1 if ^ increases by 1, we

first prove that vlx = Id and v's+l = 2d + I implies 7?,+1 n Cs+\ = 0. If v's = 2d and

v's+l = 2d + 1 then JE* e D^ and D'. = D^+1. Furthermore T' e F as can be seen by

checking the conditions of U2. Thus all cutting points are removed from D', .It remains to prove that if v's > 2d+l andU^+1 > 2d+l then no cutting point is generated

in DS+I. Since new cutting points are only generated by Ul we assume w.l.o.g. that withthe notation of Ul L? < 1% < x* < /?.? < /?.? and Lj is the new cutting point. Nowassume that L,? € D'.+I . Since by the d-definedness of F \(Ll, R*j) n E,\ < d (otherwise

/« = //) we get |(L,f, L?] n £,| > d + 1. Thus |(L?', L?] n 7^ n E,\ < d + 1 or

|(L?, 7?.?) no' n £,| > d + 1 which implies T1' = f?' or T' = ff. Since T1' = // yieldsthe contradiction T(xes) > Ti(xes) = //«)e > f?(x's) = y(x*) we have T = /«. Now

L? 6 Dv+1 c Ds. implies Lf ^ C, (by the induction hypothesis) and by the definition ofL? y(Li) ^ //(L?) which contradicts /S*(L?) = T'(L«) = T(Lqs) and proves the lemma.

D

Now (5.18) follows by induction on s from the lemma below since \>'Q = 0 and C0 = 0.

LEMMA 18.

PROOF. If ** e /' and vj <2d+l the statement holds by Corollary 1. If v|. > 2d +1 thenby Lemma 17 there exists^ 6 6., with/? = 7;and^ e D.?. Since 7" = ^'«) = /*«)condition (4.17) is satisfied and a new cutting point is generated, thus |Cv+) | = |C,| + 1,which gives the lemma. D

5.3. Learning f^

THEOREM 16. Iff is d-defined with d > 1 then LC-ARB(^max) = O(dk2).

Let again T = maxi<i<4 71' be the target and Es the set of counterexamples. Furthermorelet /4 = \{x e Es:y(x) = T'(x)}\ be the number of counterexamples consistent withT'. The following lemma will prove useful in constructing a computationally feasiblelearning algorithm for •F™'"'. It states that any function consistent with at least (d + l)kcounterexamples appears in the target.

LEMMA 19. there existswith

PROOF. Since there must be a 7} with \{x e Es: f ( x ) = y(x) = Tt(x)}\ > d + 1 thelemma follows from the cf-definedness of f., D

228 AUER ET AL.

Now the hypothesis of the learning algorithm is constructed as follows. Let

and

Theorem 16 follows from the following lemma which states that each Ti will receive atmost (d + l)k counterexamples.

LEMMA 20.

PROOF. Assume \L\ = (d+ l)k and ̂ '!+1 > (d + \)k. Then T e Fs and y(x<) = T(xes).Thus y(x') < Hs(x") which yields the contradiction y(x') < T(x*) since by Lemma 19Hs(x) < TO) for ail* & X.. D

6. Conclusion

We have shown in this paper a number of general structural properties for the complexity oflearning functions with a larger range than {0, 1} in the most common non-stochastic modelsfor on-line learning. One somewhat surprising structural symmetry that has emerged insection 2 is the "dual" definition of the learning complexity for arbitrary function classes. Wehave exhibited in Theorems 2,4, and 5 (with the help of adversary trees) purely combinatorial"max-min" definitions of the learning complexity of a function class, which turns out tobe equivalent to the usual "min-max" definition of the learning complexity in terms of theperformance of learning algorithms for the respective learning-"game". On the side theseequivalences show that it suffices to consider in these worst case models for on-line functionlearning only adversaries of a relatively simple nature.

In section 3 we have investigated another general structural property of the complexity offunction learning: the relationship between the learning complexity of a complex functionclass in terms of the learning complexity of its parts. We have been able to give optimalupper and lower bounds, and we have exhibited a universal constant I/ log(4/3) that playsan important role in these relationships.

In sections 4 and 5 we have substantially enlarged the pool of interesting function classesfor which positive learning results can be achieved in a worst case model for on-line functionlearning. Theorem 12 provides a computationally feasible learning algorithm for convexpiecewise linear functions over R.d, thereby breaking the barrier of exhibiting larger "learn-able" classes of functions of several variables than the standard example of linear functions.In section 5 we have investigated in a systematic manner the conditions for achieving apositive learning result for the class of functions that are pointwise maxima of arbitraryfinite sets of functions from a certain class T. In particular we have exhibited a generalstructural property ('W-definedness") for function classes T, which provides a sufficientcondition for proving a positive learning result for pointwise maxima of functions from T.As a special case we get the first positive learning results for the classes of pointwise max-ima of polynomials of bounded degree, of sparse polynomials, and of linear combinationsof a fixed number of sigmoidal functions over a single variable.


Acknowledgment

We thank Yoav Freund, Wolfgang Ring and Hans Ulrich Simon for valuable conversations.We also thank two anonymous referees for their helpful comments, including for pointingout an error in an earlier version of this paper. Phil Long was supported by a Lise MeitnerFellowship from the Fonds zur Forderung der wissenschaftlichen Forschung (Austria), andAFOSR grant F49620-92-J-0515. His present address is: Computer Science Department,Duke University, Box 90129, Durham, NC 27708 USA. Gerhard Woeginger is supportedby the Christian Doppler Laboratorium fur Diskrete Optimierung. His present address is:Institut fur Mathematik, TU Graz, Kopernikusgasse 24, A-8010 Graz, Austria.

Notes

1. Some lower bounds require that \Y\ > 2. The case \Y\ < I is trivial.

2. This is in contrast to the case of probabilistic models of learning, where efficient algorithms with good learningperformance have been discovered for this function class (Kearns, Schapire & Sellie, 1992).

References

Auer, P., & Long, P.M. (1994). Simulating access to hidden information while learning, Proceedings of the 26thAnnual ACM Symposium on the Theory of Computing (pp. 263-272).

Auer, P., Long, P.M., Maass, W., & Woeginger, G.J. (1993). On the complexity of function learning, Proceedingsof the Sixth Annual ACM Conference on Computational Learning Theory, pp. 392—401.

Angluin, D. (1988). Queries and concept learning, Machine Learning, 2(4):319-342.Barland, I. (1992). Some ideas on learning with directional feedback. Master's thesis, Computer Science Depart-

ment, UC Santa Cruz.Berlekamp, E.R. (1968). Block coding for the binary symmetric channel with noiseless, delayless feedback, In

Error Correcting Codes (pp. 61-85), New York: Wiley.Barzdin, J.M., & Frievald, R.V. (1972). On the prediction of general recursive functions, Soviet Math. Doklady,

13:1224-1228.Cesa-Bianchi, N., Freund, Y., Helmbold, D.P., & Warmuth, M.K. (in press). On-line prediction and conversion

strategies. In Proceedings of the First Euro-COLT Workshop, The Institute of Mathematics and its Applications,to appear.

Cesa-Bianchi, N., Long, P.M., & Warmuth, M.K. (1993). Worst-case quadratic loss bounds for a generalization ofthe Widrow-Hoff rule. In Proceedings of the 6th Annual Workshop on Comput. Learning Theory (pp. 429-438).

Dawid, A. (1984). Statistical theory: The sequential approach. Journal of the Royal Statistical Society (Series A),pp. 278-292.

Faber, V., & Mycielski, J. (1991). Applications of learning theorems. Fundamenta Informaticae, 15(2): 145-167.Feder, M., Merhav, N., & Gutman, M. (1992). Universal prediction of individual sequences. IEEE Transactions

of Information Theory, 38:1258-1270.Kimber, D., & Long, P.M. (1992). The learning complexity of smooth functions of a single variable. In Proc. 5th

Annu. Workshop on Comput. Learning Theory (pp. 153-159).Kearns, M.J., Schapire, R.E., & Sellie, L.M. (1992). Toward efficient agnostic learning. In Proc. 5th Annu.

Workshop on Comput. Learning Theory (pp. 341-352).Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm,

Machine Learning, 2:285-318.Littlestone, N. (1989). Mistake Bounds and Logarithmic Linear-threshold Learning Algorithms PhD thesis. Tech-

nical Report UCSC-CRL-89-11, University of California Santa Cruz.Littlestone, N., Long, P.M., & Warmuth, M.K. (1991). On-line learning of linear functions. In Proc. of the 23rd

Symposium on Theory of Computing (pp. 465-475).

230 AUER ET AL.

Littlestone, N., & Warmuth, M.K. (1991). The weighted majority algorithm, Technical Report UCSC-CRL-91-28,UC Santa Cruz. A preliminary version appeared in the Proceedings of the 30th Annual IEEE Symposium of theFoundations of Computer Science.

Long, P.M., & Warmuth, M.K. (in press). Composite geometric concepts and polynomial predictability. Inform.Comput.

Maass, W. (1991). On-line learning with an oblivious environment and the power of randomization. Proc. 4thAnnu. Workshop on Comput. Learning Theory (pp. 167-175).

Maass, W., & Turan, G. (1992). Lower bound methods and separation results for on-line learning models. MachineLearning, 9:107-145.

Mycielski, J. (1988). A learning algorithm for linear operators. Proceedings of the American Mathematical Society,103(2):547-550.

Rivest, R.L., Meyer, A.R., Kleitman, D.J., Winklmann, K., & Spencer, J. (1980). Coping with errors in binarysearch procedures. Journal of Computer and System Sciences, 20:396-404.

Sauer, N. (1972). On the density of families of sets. J. Combinatorial Theory (A), 13:145-147.Spencer, J. (1992). Ulam's searching game with a fixed number of lies. Theoretical Computer Science, 95(2):307-

321.Uspensky, J.V. (1948). Theory of Equations, McGraw-Hill.Vovk, V. (1990). Aggregating strategies. In Proc. 3rd Annu. Workshop on Comput. Learning Theory (pp. 371-383).Vovk, V. (1992). Universal forecasting algorithms. Inform. Comput., 96(2):245-277.

Received October 5, 1993Accepted February 15, 1994Final Manuscript June 13, 1994

Documents

On the Complexity of Function Learning - Home - …...ON THE COMPLEXITY OF FUNCTION LEARNING 189 At th e end of Sectio n 2, we conside r variation s of the previousl y described functio