34
arXiv:1510.02847v2 [cs.LG] 16 Oct 2015 Active Learning from Weak and Strong Labelers Chicheng Zhang āˆ—1 and Kamalika Chaudhuri ā€ 1 1 University of California, San Diego October 19, 2015 Abstract An active learner is given a hypothesis class, a large set of unlabeled examples and the ability to interactively query labels to an oracle of a subset of these examples; the goal of the learner is to learn a hypothesis in the class that fits the data well by making as few label queries as possible. This work addresses active learning with labels obtained from strong and weak labelers, where in addition to the standard active learning setting, we have an extra weak labeler which may occasionally provide incorrect labels. An example is learning to classify medical images where either expensive labels may be obtained from a physician (oracle or strong labeler), or cheaper but occasionally incorrect labels may be obtained from a medical resident (weak labeler). Our goal is to learn a classifier with low error on data labeled by the oracle, while using the weak labeler to reduce the number of label queries made to this labeler. We provide an active learning algorithm for this setting, establish its statistical consistency, and analyze its label complexity to characterize when it can provide label savings over using the strong labeler alone. 1 Introduction An active learner is given a hypothesis class, a large set of unlabeled examples and the ability to interactively make label queries to an oracle on a subset of these examples; the goal of the learner is to learn a hypothesis in the class that fits the data well by making as few oracle queries as possible. As labeling examples is a tedious task for any one person, many applications of active learning involve synthesizing labels from multiple experts who may have slightly different labeling patterns. While a body of recent empirical work [28, 29, 30, 26, 27, 12] has developed methods for combining labels from multiple experts, little is known on the theory of actively learning with labels from multiple annotators. For example, what kind of assumptions are needed for methods that use labels from multiple sources to work, when these methods are statistically consistent, and when they can yield benefits over plain active learning are all open questions. This work addresses these questions in the context of active learning from strong and weak labelers. Specifically, in addition to unlabeled data and the usual labeling oracle in standard active learning, we have an extra weak labeler. The labeling oracle is a gold standard ā€“ an expert on the problem domain ā€“ and it provides high quality but expensive labels. The weak labeler is cheap, but may provide incorrect labels āˆ— [email protected] ā€  [email protected] 1

Chicheng Zhang and Kamalika Chaudhuri University of

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

arX

iv:1

510.

0284

7v2

[cs.

LG]

16 O

ct 2

015

Active Learning from Weak and Strong Labelers

Chicheng Zhangāˆ—1 and Kamalika Chaudhuriā€ 1

1University of California, San Diego

October 19, 2015

Abstract

An active learner is given a hypothesis class, a large set of unlabeled examples and the ability tointeractively query labels to an oracle of a subset of these examples; the goal of the learner is to learn ahypothesis in the class that fits the data well by making as fewlabel queries as possible.

This work addresses active learning with labels obtained from strong and weak labelers, where inaddition to the standard active learning setting, we have anextra weak labeler which may occasionallyprovide incorrect labels. An example is learning to classify medical images where either expensive labelsmay be obtained from a physician (oracle or strong labeler),or cheaper but occasionally incorrect labelsmay be obtained from a medical resident (weak labeler). Our goal is to learn a classifier with low erroron data labeled by the oracle, while using the weak labeler toreduce the number of label queries made tothis labeler. We provide an active learning algorithm for this setting, establish its statistical consistency,and analyze its label complexity to characterize when it canprovide label savings over using the stronglabeler alone.

1 Introduction

An active learner is given a hypothesis class, a large set of unlabeled examples and the ability to interactivelymake label queries to an oracle on a subset of these examples;the goal of the learner is to learn a hypothesisin the class that fits the data well by making as few oracle queries as possible.

As labeling examples is a tedious task for any one person, many applications of active learning involvesynthesizing labels from multiple experts who may have slightly different labeling patterns. While a bodyof recent empirical work [28, 29, 30, 26, 27, 12] has developed methods for combining labels from multipleexperts, little is known on the theory of actively learning with labels from multiple annotators. For example,what kind of assumptions are needed for methods that use labels from multiple sources to work, when thesemethods are statistically consistent, and when they can yield benefits over plain active learning are all openquestions.

This work addresses these questions in the context of activelearning from strong and weak labelers.Specifically, in addition to unlabeled data and the usual labeling oracle in standard active learning, we havean extra weak labeler. The labeling oracle is agold standardā€“ an expert on the problem domain ā€“ andit provides high quality but expensive labels. The weak labeler is cheap, but may provide incorrect labels

āˆ—[email protected]ā€ [email protected]

1

on some inputs. An example is learning to classify medical images where either expensive labels may beobtained from a physician (oracle), or cheaper but occasionally incorrect labels may be obtained from amedical resident (weak labeler). Our goal is to learn a classifier in a hypothesis class whose error withrespect to the data labeled by the oracle is low, while exploiting the weak labeler to reduce the number ofqueries made to this oracle. Observe that in our model the weak labeler can be incorrect anywhere, and doesnot necessarily provide uniformly noisy labels everywhere, as was assumed by some previous works [8, 24].

A plausible approach in this framework is to learn adifference classifierto predict where the weaklabeler differs from the oracle, and then use a standard active learning algorithm which queries the weaklabeler when this difference classifier predicts agreement. Our first key observation is that this approachis statistically inconsistent; false negative errors (that predict no difference whenO andW differ) lead tobiased annotation for the target classification task. We address this problem by learning instead acost-sensitive difference classifierthat ensures that false negative errors rarely occur. Our second key observationis that as existing active learning algorithms usually query labels in localized regions of space, it is sufficientto train the difference classifier restricted to this regionand still maintain consistency. This process leadsto significant label savings. Combining these two ideas, we get an algorithm that is provably statisticallyconsistent and that works under the assumption that there isa good difference classifier with low falsenegative error.

We analyze the label complexity of our algorithm as measuredby the number of label requests to thelabeling oracle. In general we cannot expect any consistentalgorithm to provide label savings under allcircumstances, and indeed our worst case asymptotic label complexity is the same as that of active learningusing the oracle alone. Our analysis characterizes when we can achieve label savings, and we show that thishappens for example if the weak labeler agrees with the labeling oracle for some fraction of the examplesclose to the decision boundary. Moreover, when the target classification task is agnostic, the number oflabels required to learn the difference classifier is of a lower order than the number of labels required foractive learning; thus in realistic cases, learning the difference classifier adds only a small overhead to thetotal label requirement, and overall we get label savings over using the oracle alone.

Related Work. There has been a considerable amount of empirical work on active learning where multipleannotators can provide labels for the unlabeled examples. One line of work assumes a generative model foreach annotatorā€™s labels. The learning algorithm learns theparameters of the individual labelers, and usesthem to decide which labeler to query for each example. [29, 30, 13] consider separate logistic regressionmodels for each annotator, while [20, 19] assume that each annotatorā€™s labels are corrupted with a differentamount of random classification noise. A second line of work [12, 16] that includes Pro-Active Learning,assumes that each labeler is an expert over an unknown subsetof categories, and uses data to measurethe class-wise expertise in order to optimally place label queries. In general, it is not known under whatconditions these algorithms are statistically consistent, particularly when the modeling assumptions do notstrictly hold, and under what conditions they provide labelsavings over regular active learning.

[25], the first theoretical work to consider this problem, consider a model where the weak labeler is morelikely to provide incorrect labels in heterogeneous regions of space where similar examples have differentlabels. Their formalization is orthogonal to ours ā€“ while theirs is more natural in a non-parametric setting,ours is more natural for fitting classifiers in a hypothesis class. In a NIPS 2014 Workshop paper, [21] havealso considered learning from strong and weak labelers; unlike ours, their work is in the online selectivesampling setting, and applies only to linear classifiers androbust regression. [11] study learning frommultiple teachers in the online selective sampling settingin a model where different labelers have differentregions of expertise.

2

Finally, there is a large body of theoretical work [1, 9, 10, 14, 31, 2, 4] on learning a binary classifierbased on interactive label queries madeto a single labeler. In the realizable case, [22, 9] show that a general-ization of binary search provides an exponential improvement in label complexity over passive learning. Theproblem is more challenging, however, in the more realisticagnostic case, where such approaches lead toinconsistency. The two styles of algorithms for agnostic active learning are disagreement-based active learn-ing (DBAL) [1, 10, 14, 4] and the more recent margin-based or confidence-based active learning [2, 31].Our algorithm builds on recent work in DBAL [4, 15].

2 Preliminaries

The Model. We begin with a general framework for actively learning fromweak and strong labelers. Inthe standard active learning setting, we are given unlabelled data drawn from a distributionU over an inputspaceX , a label spaceY = {āˆ’1,1}, a hypothesis classH , and a labeling oracleO to which we can makeinteractive queries.

In our setting, we additionally have access to a weak labeling oracleW which we can query interactively.QueryingW is significantly cheaper than queryingO; however, queryingW generates a labelyW drawn froma conditional distributionPW(yW|x) which is not the same as the conditional distributionPO(yO|x) of O.

Let D be the data distribution over labelled examples such that:PD(x,y) = PU(x)PO(y|x). Our goal is tolearn a classifierh in the hypothesis classH such that with probabilityā‰„ 1āˆ’Ī“ over the sample, we have:PD(h(x) 6= y)ā‰¤minhā€²āˆˆH PD(hā€²(x) 6= y)+ Īµ , while making as few (interactive) queries toO as possible.

Observe that in this modelW may disagree with the oracleO anywherein the input space; this isunlike previous frameworks [8, 24] where labels assigned bythe weak labeler are corrupted by randomclassification noise with a higher variance than the labeling oracle. We believe this feature makes our modelmore realistic.

Second, unlike [25], mistakes made by the weak labelerdo not have to be close to the decision boundary.This keeps the model general and simple, and allows greater flexibility to weak labelers. Our analysis showsthat if W is largely incorrect close to the decision boundary, then our algorithm will automatically makemore queries toO in its later stages.

Finally note thatO is allowed to benon-realizablewith respect to the target hypothesis classH .

Background on Active Learning Algorithms. The standard active learning setting is very similar to ours,the only difference being that we have access to the weak oracle W. There has been a long line of work onactive learning [1, 7, 9, 14, 2, 10, 4, 31]. Our algorithms arebased on a style calleddisagreement-basedactive learning (DBAL).The main idea is as follows. Based on the examples seen so far,the algorithmmaintains a candidate setVt of classifiers inH that is guaranteed with high probability to containhāˆ—, theclassifier inH with the lowest error. Given a randomly drawn unlabeled example xt , if all classifiers inVt

agree on its label, then this label is inferred; observe thatwith high probability, this inferred label ishāˆ—(xt).Otherwise,xt is said to be in thedisagreement regionof Vt , and the algorithm queriesO for its label.Vt isupdated based onxt and its label, and algorithm continues.

Recent works in DBAL [10, 4] have observed that it is possibleto determine if anxt is in the disagree-ment region ofVt without explicitly maintainingVt . Instead, a labelled datasetSt is maintained; the labelsof the examples inSt are obtained by either querying the oracle or direct inference. To determine whetheranxt lies in the disagreement region ofVt , two constrained ERM procedures are performed; empirical riskis minimized overSt while constraining the classifier to output the label ofxt as 1 andāˆ’1 respectively. If

3

these two classifiers have similar training errors, thenxt lies in the disagreement region ofVt ; otherwise thealgorithm infers a label forxt that agrees with the label assigned byhāˆ—.

More Definitions and Notation. The error of a classifierh under a labelled data distributionQ is definedas: errQ(h) = P(x,y)āˆ¼Q(h(x) 6= y); we use the notation err(h,S) to denote its empirical error on a labelled datasetS. We use the notationhāˆ— to denote the classifier with the lowest error underD andĪ½ to denote its errorerrD(hāˆ—), whereD is the target labelled data distribution.

Our active learning algorithm implicitly maintains a(1āˆ’ Ī“ )-confidence setfor hāˆ— throughout the algo-rithm. Given a setSof labelled examples, a set of classifiersV(S) āŠ†H is said to be a(1āˆ’ Ī“ )-confidenceset forhāˆ— with respect toS if hāˆ— āˆˆV with probabilityā‰„ 1āˆ’Ī“ overS.

The disagreement between two classifiersh1 andh2 under an unlabelled data distributionU , denoted byĻU(h1,h2), isPxāˆ¼U(h1(x) 6= h2(x)). Observe that the disagreements underU form a pseudometric overH .We use BU(h, r) to denote a ball of radiusr centered aroundh in this metric. Thedisagreement regionof asetV of classifiers, denoted by DIS(V), is the set of all examplesxāˆˆX such that there exist two classifiersh1 andh2 in V for whichh1(x) 6= h2(x).

3 Algorithm

Our main algorithm is a standard single-annotator DBAL algorithm with a major modification: when theDBAL algorithm makes a label query, we use an extra sub-routine to decide whether this query should bemade to the oracle or the weak labeler, and make it accordingly. How do we make this decision? We try topredict if weak labeler differs from the oracle on this example; if so, query the oracle, otherwise, query theweak labeler.

Key Idea 1: Cost Sensitive Difference Classifier. How do we predict if the weak labeler differs from theoracle? A plausible approach is to learn adifference classifier hd f in a hypothesis classH d f to determine ifthere is a difference. Our first key observation is when the region whereO andW differ cannot be perfectlymodeled byH d f , the resulting active learning algorithm isstatistically inconsistent. Any false negativeerrors (that is, incorrectly predicting no difference) made by difference classifier leads to biased annotationfor the target classification task, which in turn leads to inconsistency. We address this problem by insteadlearning acost-sensitive difference classifierand we assume that a classifier with low false negative errorexists inH d f . While training, we constrain the false negative error of the difference classifier to be low, andminimize the number of predicted positives (or disagreements betweenW andO) subject to this constraint.This ensures that the annotated data used by the active learning algorithm has diminishing bias, thus ensuringconsistency.

Key Idea 2: Localized Difference Classifier Training. Unfortunately, even with cost-sensitive training,directly learning a difference classifier accurately is expensive. Ifdā€² is the VC-dimension of the differencehypothesis classH d f , to learn a target classifier to excess errorĪµ , we need a difference classifier with falsenegative errorO(Īµ), which, from standard generalization theory, requiresO(dā€²/Īµ) labels [6, 23]! Our secondkey observation is that we can save on labels by training the difference classifier in a localized manner ā€“because the DBAL algorithm that builds the target classifieronly makes label queriesin the disagreementregion of the current confidence set forhāˆ—. Therefore we train the difference classifier only on this regionand still maintain consistency. Additionally this provides label savings because while training the target

4

classifier to excess errorĪµ , we need to train a difference classifier with onlyO(dā€²Ļ†k/Īµ) labels whereĻ†k is theprobability mass of this disagreement region. The localized training process leads to an additional technicalchallenge: as the confidence set forhāˆ— is updated, its disagreement region changes. We address this throughan epoch-based DBAL algorithm, where the confidence set is updated and a fresh difference classifier istrained in each epoch.

Main Algorithm. Our main algorithm (Algorithm 1) combines these two key ideas, and like [4], implicitlymaintains the(1āˆ’ Ī“ )-confidence set forhāˆ— by through a labeled datasetSk. In epochk, the target excesserror isĪµk ā‰ˆ 1

2k , and the goal of Algorithm 1 is to generate a labeled datasetSk that implicitly represents a(1āˆ’Ī“k)-confidence set onhāˆ—. Additionally, Sk has the property that the empirical risk minimizer over it hasexcess errorā‰¤ Īµk.

A naive way to generate such anSk is by drawingO(d/Īµ2k ) labeled examples, whered is the VC dimen-

sion ofH . Our goal, however, is to generateSk using a much smaller number of label queries, which isaccomplished by Algorithm 3. This is done in two ways. First,like standard DBAL, we infer the label ofanyx that liesoutsidethe disagreement region of the current confidence set forhāˆ—. Algorithm 4 identifieswhether anx lies in this region. Second, for anyx in the disagreement region, we determine whetherOandW agree onx using a difference classifier; if there is agreement, we query W, else we queryO. Thedifference classifier used to determine agreement is retrained in the beginning of each epoch by Algorithm 2,which ensures that the annotation has low bias.

The algorithms use a constrained ERM procedure CONS-LEARN.Given a hypothesis classH, a la-beled datasetS and a set of constraining examplesC, CONS-LEARNH(C,S) returns a classifier inH thatminimizes the empirical error onSsubject toh(xi) = yi for each(xi ,yi) āˆˆC.

Identifying the Disagreement Region. Algorithm 4 identifies if an unlabeled examplex lies in the dis-agreement region of the current(1āˆ’ Ī“ )-confidence set forhāˆ—; recall that this confidence set is implicitlymaintained throughSk. The identification is based on two ERM queries. Leth be the empirical risk mini-mizer on the current labeled datasetSkāˆ’1, andhā€² be the empirical risk minimizer onSkāˆ’1 under the constraintthathā€²(x) =āˆ’h(x). If the training errors ofh andhā€² are very different, then, all classifiers with training errorclose to that ofh assign the same label tox, andx lies outside the current disagreement region.

Training the Difference Classifier. Algorithm 2 trains a difference classifier on a random set of exampleswhich lies in the disagreement region of the current confidence set forhāˆ—. The training process is cost-sensitive, and is similar to [17, 18, 6, 23]. A hard bound is imposed on the false-negative error, whichtranslates to a bound on the annotation bias for the target task. The number of positives (i.e., the numberof examples whereW andO differ) is minimized subject to this constraint; this amounts to (approximately)minimizing the fraction of queries made toO.

The number of labeled examples used in training is large enough to ensure false negative errorO(Īµk/Ļ†k)over the disagreement region of the current confidence set; hereĻ†k is the probability mass of this disagree-ment region underU . This ensures that the overall annotation bias introduced by this procedure in the targettask is at mostO(Īµk). As Ļ†k is small and typically diminishes withk, this requires less labels than trainingthe difference classifier globally which would have required O(dā€²/Īµk) queries toO.

5

Algorithm 1 Active Learning Algorithm from Weak and Strong Labelers1: Input: Unlabeled distributionU , target excess errorĪµ , confidenceĪ“ , labeling oracleO, weak oracleW,

hypothesis classH , hypothesis class for difference classifierH d f .2: Output: Classifierh in H .3: Initialize: initial errorĪµ0 = 1, confidenceĪ“0 = Ī“/4. Total number of epochsk0 = āŒˆlog 1

Īµ āŒ‰.4: Initial number of examplesn0 = O( 1

Īµ20(d ln 1

Īµ20+ ln 1

Ī“0)).

5: Draw a fresh sample and queryO for its labelsS0 = {(x1,y1), . . . ,(xn0,yn0)}. Let Ļƒ0 = Ļƒ(n0,Ī“0).6: for k= 1,2, . . . ,k0 do7: Set target excess errorĪµk = 2āˆ’k, confidenceĪ“k = Ī“/4(k+1)2.8: # Train Difference Classifier9: hd f

k ā† Call Algorithm 2 with inputs unlabeled distributionU , oraclesW andO, target excess errorĪµk, confidenceĪ“k/2, previously labeled datasetSkāˆ’1.

10: # Adaptive Active Learning using Difference Classifier11: Ļƒk, Skā† Call Algorithm 3 with inputs unlabeled distributionU , oraclesW andO, difference classi-

fier hd fk , target excess errorĪµk, confidenceĪ“k/2, previously labeled datasetSkāˆ’1.

12: end for13: return hā† CONS-LEARNH ( /0, Sk0).

Algorithm 2 Training Algorithm for Difference Classifier

1: Input: Unlabeled distributionU , oraclesW andO, target errorĪµ , hypothesis classH d f , confidenceĪ“ ,previous labeled datasetT.

2: Output: Difference classifierhd f .3: Let p be an estimate ofPxāˆ¼U(in disagr region(T, 3Īµ

2 ,x) = 1), obtained by calling Algorithm 5 withfailure probabilityĪ“/3. 1

4: LetU ā€² = /0, i = 1, and

m=64Ā·1024p

Īµ(dā€² ln

512Ā·1024pĪµ

+ ln72Ī“) (1)

5: repeat6: Draw an examplexi from U .7: if in disagr region(T, 3Īµ

2 ,xi) = 1 then # xi is inside the disagreement region8: query bothW andO for labels to getyi,W andyi,O.9: end if

10: U ā€² =U ā€²āˆŖ{(xi ,yi,O,yi,W)}11: i = i +112: until |U ā€²|= m13: Learn a classifierhd f āˆˆH d f based on the following empirical risk minimizer:

hd f = argminhd fāˆˆH d f

m

āˆ‘i=1

1(hd f (xi) = +1), s.t.m

āˆ‘i=1

1(hd f (xi) =āˆ’1āˆ§yi,O 6= yi,W)ā‰¤mĪµ/256p (2)

14: return hd f .

1Note that if in Algorithm 5, the upper confidence bound ofPxāˆ¼U (in disagr region(T, 3Īµ2 ,x) = 1) is lower thanĪµ/64, then we

can halt Algorithm 2 and return an arbitraryhd f in H d f . Using thishd f will still guarantee the correctness of Algorithm 1.

6

Adaptive Active Learning using the Difference Classifier. Finally, Algorithm 3 is our main active learn-ing procedure, which generates a labeled datasetSk that is implicitly used to maintain a tighter(1āˆ’ Ī“ )-confidence set forhāˆ—. Specifically, Algorithm 3 generates aSk such that the setVk defined as:

Vk = {h : err(h, Sk)āˆ’ minhkāˆˆH

err(hk, Sk)ā‰¤ 3Īµk/4}

has the property that:

{h : errD(h)āˆ’errD(hāˆ—)ā‰¤ Īµk/2} āŠ†Vk āŠ† {h : errD(h)āˆ’errD(h

āˆ—)ā‰¤ Īµk}

This is achieved by labeling, through inference or query, a large enough sample of unlabeled data drawnfrom U . Labels are obtained from three sources - direct inference (if x lies outside the disagreement regionas identified by Algorithm 4), queryingO (if the difference classifier predicts a difference), and queryingW. How large should the sample be to reach the target excess error? If errD(hāˆ—) = Ī½ , then achieving anexcess error ofĪµ requiresO(dĪ½/Īµ2

k ) samples, whered is the VC dimension of the hypothesis class. AsĪ½ isunknown in advance, we use a doubling procedure in lines 4-14to iteratively determine the sample size.

Algorithm 3 Adaptive Active Learning using Difference Classifier

1: Input: Unlabeled data distributionU , oraclesW andO, difference classifierhd f , target excess errorĪµ ,confidenceĪ“ , previous labeled datasetT.

2: Output: ParameterĻƒ , labeled datasetS.3: Let h= CONS-LEARNH ( /0, T).4: for t = 1,2, . . . , do5: Let Ī“ t = Ī“/t(t +1). Define:Ļƒ(2t ,Ī“ t) = 8

2t (2d ln 2e2t

d + ln 24Ī“ t ).

6: Draw 2t examples fromU to form St,U .7: for eachxāˆˆ St,U do:8: if in disagr region(T, 3Īµ

2 ,x) = 0 then # x is inside the agreement region9: Add (x, h(x)) to St .

10: else# x is inside the disagreement region11: If hd f (x) = +1, queryO for the labely of x, otherwise queryW. Add (x,y) to St .12: end if13: end for14: Train ht ā† CONS-LEARNH ( /0, St).

15: if Ļƒ(2t ,Ī“ t)+āˆš

Ļƒ(2t ,Ī“ t)err(ht , St)ā‰¤ Īµ/512 then16: t0ā† t, break17: end if18: end for19: return Ļƒ ā† Ļƒ(2t0,Ī“ t0), Sā† St0.

4 Performance Guarantees

We now examine the performance of our algorithm, which is measured by the number of label queries madeto the oracleO. Additionally we require our algorithm to be statisticallyconsistent, which means that thetrue error of the output classifier should converge to the true error of the best classifier inH on the datadistributionD.

7

Algorithm 4 in disagr region(S,Ļ„ ,x): Test ifx is in the disagreement region of current confidence set

1: Input: labeled datasetS, rejection thresholdĻ„ , unlabeled examplex.2: Output: 1 ifx in the disagreement region of current confidence set, 0 otherwise.3: Train hā† CONS-LEARNH ({ /0, S}).4: Train hā€²xā† CONS-LEARNH ({(x,āˆ’h(x))}, S}).5: if err(hā€²x, S)āˆ’err(h, S)> Ļ„ then # x is in the agreement region6: return 07: else # x is in the disagreement region8: return 19: end if

Since our framework is very general, we cannot expect any statistically consistent algorithm to achievelabel savings over usingO alone under all circumstances. For example, if labels provided byW are thecomplete opposite ofO, no algorithm will achieve both consistency and label savings. We next provide anassumption under which Algorithm 1 works and yields label savings.

Assumption. The following assumption states that difference hypothesis class contains a good cost-sensitivepredictor of whenO andW differ in the disagreement region of BU(hāˆ—, r); a predictor is good if it has lowfalse-negative error and predicts a positive label with lowfrequency. If there is no such predictor, then wecannot expect an algorithm similar to ours to achieve label savings.

Assumption 1. LetD be the joint distribution:PD(x,yO,yW) = PU(x)PW(yW|x)PO(yO|x). For any r,Ī· > 0,there exists an hd f

Ī· ,r āˆˆH d f with the following properties:

PD(hd fĪ· ,r (x) =āˆ’1,xāˆˆ DIS(BU(hāˆ—, r)),yO 6= yW)ā‰¤ Ī· (3)

PD(hd fĪ· ,r (x) = 1,xāˆˆDIS(BU(hāˆ—, r)))ā‰¤ Ī±(r,Ī·) (4)

Note that (3), which states there is ahd f āˆˆH d f with low false-negative error, is minimally restrictive,and is trivially satisfied ifH d f includes the constant classifier that always predicts 1. Theorem showsthat (3) is sufficient to ensure statistical consistency.

(4) in addition states that the number of positives predicted by the classifierhd fĪ· ,r is upper bounded by

Ī±(r,Ī·). NoteĪ±(r,Ī·) ā‰¤ PU(DIS(BU(hāˆ—, r))) always; performance gain is obtained whenĪ±(r,Ī·) is lower,which happens when the difference classifier predicts agreement on a significant portion of DIS(BU(hāˆ—, r)).

Consistency. Provided Assumption 1 holds, we next show that Algorithm 1 isstatistically consistent.Establishing consistency is non-trivial for our algorithmas the output classifier is trained on labels frombothO andW.

Theorem 1(Consistency). Let hāˆ— be the classifier that minimizes the error with respect to D. If Assumption 1holds, then with probabilityā‰„ 1āˆ’Ī“ , the classifierh output by Algorithm 1 satisfies: errD(h)ā‰¤ errD(hāˆ—)+Īµ .

Label Complexity. The label complexity of standard DBAL is measured in terms ofthe disagreement co-efficient. The disagreement coefficientĪø(r) at scaler is defined as:Īø(r) = suphāˆˆH supr ā€²ā‰„r

PU (DIS(BU (h,r ā€²))r ā€² ;

intuitively, this measures the rate of shrinkage of the disagreement region with the radius of the ball BU(h, r)for anyh in H . It was shown by [10] that the label complexity of DBAL for target excess generalization

8

error Īµ is O(dĪø(2Ī½ + Īµ)(1+ Ī½2

Īµ2 )) where theO notation hides factors logarithmic in 1/Īµ and 1/Ī“ . In con-trast, the label complexity of our algorithm can be stated inTheorem 2. Here we use theO notation forconvenience; we have the same dependence on log1/Īµ and log1/Ī“ as the bounds for DBAL.

Theorem 2(Label Complexity). Let d be the VC dimension ofH and let dā€² be the VC dimension ofH d f .If Assumption 1 holds, and if the error of the best classifier in H on D isĪ½ , then with probabilityā‰„ 1āˆ’Ī“ ,the following hold:

1. The number of label queries made by Algorithm 1 to the oracle O in epoch k at most:

mk = O(d(2Ī½ + Īµkāˆ’1)(Ī±(2Ī½ + Īµkāˆ’1,Īµkāˆ’1/1024)+ Īµkāˆ’1)

Īµ2k

+dā€²P(DIS(BU(hāˆ—,2Ī½ + Īµkāˆ’1)))

Īµk

)

(5)

2. The total number of label queries made by Algorithm 1 to theoracle O is at most:

O(

suprā‰„Īµ

Ī±(2Ī½ + r, r/1024)+ r2Ī½ + r

Ā·d(

Ī½2

Īµ2 +1

)

+Īø(2Ī½ + Īµ)dā€²(Ī½

Īµ+1

))

(6)

4.1 Discussion

The first terms in (5) and (6) represent the labels needed to learn the target classifier, and second termsrepresent the overhead in learning the difference classifier.

In the realistic agnostic case (whereĪ½ > 0), asĪµā†’ 0, the second terms arelower ordercompared to thelabel complexity of DBAL. Thuseven if dā€² is somewhat larger than d, fitting the difference classifier doesnot incur an asymptotically high overhead in the more realistic agnostic case.In the realizable case, whendā€² ā‰ˆ d, the second terms are of the same order as the first; thereforewe should use a simpler differencehypothesis classH d f in this case. We believe that the lower order overhead term comes from the fact thatthere exists a classifier inH d f whose false negative error is very low.

Comparing Theorem 2 with the corresponding results for DBAL, we observe that instead ofĪø(2Ī½ + Īµ),we have the term suprā‰„Īµ

Ī±(2Ī½+r,r/1024)2Ī½+r . Since suprā‰„Īµ

Ī±(2Ī½+r,r/1024)2Ī½+r ā‰¤ Īø(2Ī½ + Īµ), theworst caseasymptotic

label complexity is the same as that of standard DBAL. This label complexity may be considerably betterhowever if suprā‰„Īµ

Ī±(2Ī½+r,r/1024)2Ī½+r is less than the disagreement coefficient. As we expect, thiswill happen

when the region of difference betweenW andO restricted to the disagreement regions is relatively small,and this region is well-modeled by the difference hypothesis classH d f .

An interesting case is when the weak labeler differs fromO close to the decision boundary and agreeswith O away from this boundary. In this case, any consistent algorithm should switch to queryingO closeto the decision boundary. Indeed in earlier epochs,Ī± is low, and our algorithm obtains a good differenceclassifier and achieves label savings. In later epochs,Ī± is high, the difference classifiers always predicta difference and the label complexity of the later epochs of our algorithm is the same order as DBAL. Inpractice, if we suspect that we are in this case, we can switchto plain active learning onceĪµk is small enough.

Case Study: Linear Classfication under Uniform Distribution. We provide a simple example where ouralgorithm provides a better asymptotic label complexity than DBAL. LetH be the class of homogeneouslinear separators on thed-dimensional unit ball and letH d f = {hāˆ†hā€² : h,hā€² āˆˆH }. Furthermore, letU bethe uniform distribution over the unit ball.

Suppose thatO is a deterministic labeler such that errD(hāˆ—) = Ī½ > 0. Moreover, suppose thatW issuch that there exists a difference classifierhd f with false negative error 0 for whichPU(hd f (x) = 1) ā‰¤ g.

9

+āˆ’

wāˆ—

P({x : hwāˆ—(x) 6= yO}) = Ī½

+āˆ’

W

{x : P(yO 6= yW|x)> 0}

P({x : hd f(x) = 1}) = g= o(āˆš

dĪ½)

Figure 1: Linear classification over unit ball withd = 2. Left: Decision boundary of labelerO andhāˆ— = hwāˆ— .The region whereO differs fromhāˆ— is shaded, and has probabilityĪ½ . Middle: Decision boundary of weaklabelerW. Right: hd f , W andO. Note that{x : P(yO 6= yW|x)> 0} āŠ† {x : hd f (x) = 1}.

Additionally, we assume thatg= o(āˆš

dĪ½); observe that this is not a strict assumption onH d f , asĪ½ couldbe as much as a constant. Figure 1 shows an example ind = 2 that satisfies these assumptions. In this case,asĪµ ā†’ 0, Theorem 2 gives the following label complexity bound.

Corollary 1. With probabilityā‰„ 1āˆ’ Ī“ , the number of label queries made to oracle O by Algorithm 1 is

O(

dmax( gĪ½ ,1)(

Ī½2

Īµ2 +1)+d3/2(

1+ Ī½Īµ)

)

, where theO notation hides factors logarithmic in1/Īµ and1/Ī“ .

As g= o(āˆš

dĪ½), this improves over the label complexity of DBAL, which isO(d3/2(1+ Ī½2

Īµ2 )).

Learning with respect to Data labeled by bothO andW. Finally, an interesting variant of our model isto measure error relative to data labeled by a mixture ofO andW ā€“ say,(1āˆ’Ī² )O+Ī²W for some 0< Ī² < 1.Similar measures have been considered in the domain adaptation literature [5].

We can also analyze this case using simple modifications to our algorithm and analysis. The results arepresented in Corollary 2, which suggests that the number of label queries toO in this case is roughly 1āˆ’Ī²times the label complexity in Theorem 2.

Let Oā€² be the oracle which, on inputx, queriesO for its label w.p 1āˆ’Ī² and queriesW w.p Ī² . Let Dā€² bethe distribution:PDā€²(x,y) = PU(x)POā€²(y|x), hā€² = argminhāˆˆH errDā€²(h) be the classifier inH that minimizeserror overDā€², andĪ½ ā€² = errDā€²(hā€²) be its true error. Moreover, suppose that Assumption 1 holdswith respectto oraclesOā€² andW with Ī± = Ī± ā€²(r,Ī·) in (4). Then, the modified Algorithm 1 that simulatesOā€² by randomqueries toO has the following properties.

Corollary 2. With probabilityā‰„ 1āˆ’2Ī“ ,

1. the classifierh output by (the modified) Algorithm 1 satisfies: errDā€²(h)ā‰¤ errDā€²(hā€²)+ Īµ .

2. the total number of label queries made by this algorithm toO is at most:

O(

(1āˆ’Ī² )(

suprā‰„Īµ

Ī± ā€²(2Ī½ ā€²+ r, r/1024)+ r2Ī½ ā€²+ r

Ā·d(

Ī½ ā€²2

Īµ2 +1

)

+Īø(2Ī½ ā€²+ Īµ)dā€²(

Ī½ ā€²

Īµ+1

)

))

Conclusion. In this paper, we take a step towards a theoretical understanding of active learning frommultiple annotators through a learning theoretic formalization for learning from weak and strong labelers.Our work shows that multiple annotators can be successfullycombined to do active learning in a statistically

10

consistent manner under a general setting with few assumptions; moreover, under reasonable conditions, thiskind of learning can provide label savings over plain activelearning.

An avenue for future work is to explore a more general settingwhere we have multiple labelers withexpertise on different regions of the input space. Can we combine inputs from such labelers in a statisticallyconsistent manner? Second, our algorithm is intended for a setting whereW is biased, and performs sub-optimally when the label generated byW is a random corruption of the label provided byO. How can weaccount for both random noise and bias in active learning from weak and strong labelers?

Acknowledgements

We thank NSF under IIS 1162581 for research support and Jennifer Dy for introducing us to the problem ofactive learning from multiple labelers.

References

[1] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnosticactive learning. J. Comput. Syst. Sci.,75(1):78ā€“89, 2009.

[2] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-concavedistributions. InCOLT, 2013.

[3] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Activelearning with an ERM oracle, 2009.

[4] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. InNIPS, 2010.

[5] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adapta-tion. In NIPS, pages 129ā€“136, 2007.

[6] Nader H. Bshouty and Lynn Burroughs. Maximizing agreements with one-sided error with applicationsto heuristic learning.Machine Learning, 59(1-2):99ā€“123, 2005.

[7] D. A. Cohn, L. E. Atlas, and R. E. Ladner. Improving generalization with active learning.MachineLearning, 15(2), 1994.

[8] K. Crammer, M. J. Kearns, and J. Wortman. Learning from data of variable quality. InNIPS, pages219ā€“226, 2005.

[9] S. Dasgupta. Coarse sample complexity bounds for activelearning. InNIPS, 2005.

[10] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. InNIPS, 2007.

[11] O. Dekel, C. Gentile, and K. Sridharan. Selective sampling and active learning from single and multipleteachers.JMLR, 13:2655ā€“2697, 2012.

[12] P. Donmez and J. Carbonell. Proactive learning: Cost-sensitive active learning with multiple imperfectoracles. InCIKM, 2008.

11

[13] Meng Fang, Xingquan Zhu, Bin Li, Wei Ding, and Xindong Wu. Self-taught active learning fromcrowds. InData Mining (ICDM), 2012 IEEE 12th International Conference on, pages 858ā€“863. IEEE,2012.

[14] S. Hanneke. A bound on the label complexity of agnostic active learning. InICML, 2007.

[15] D. Hsu. Algorithms for Active Learning. PhD thesis, UC San Diego, 2010.

[16] Panagiotis G Ipeirotis, Foster Provost, Victor S Sheng, and Jing Wang. Repeated labeling using multi-ple noisy labelers.Data Mining and Knowledge Discovery, 28(2):402ā€“441, 2014.

[17] Adam Tauman Kalai, Varun Kanade, and Yishay Mansour. Reliable agnostic learning.J. Comput.Syst. Sci., 78(5):1481ā€“1495, 2012.

[18] Varun Kanade and Justin Thaler. Distribution-independent reliable learning. InProceedings of The27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, June 13-15, 2014, pages 3ā€“24,2014.

[19] C.H. Lin, Mausam, and D.S. Weld. Reactive learning: Actively trading off larger noisier training setsagainst smaller cleaner ones. InICML Workshop on Crowdsourcing and Machine Learning and ICMLActive Learning Workshop, 2015.

[20] Christopher H. Lin, Mausam, and Daniel S. Weld. To re(label), or not to re(label). InProceedings ofthe Seconf AAAI Conference on Human Computation and Crowdsourcing, HCOMP 2014, November2-4, 2014, Pittsburgh, Pennsylvania, USA, 2014.

[21] L. Malago, N. Cesa-Bianchi, and J. Renders. Online active learning with strong and weak annotators.In NIPS Workshop on Learning from the Wisdom of Crowds, 2014.

[22] R. D. Nowak. The geometry of generalized binary search.IEEE Transactions on Information Theory,57(12):7893ā€“7906, 2011.

[23] Hans Ulrich Simon. Pac-learning in the presence of one-sided classification noise.Ann. Math. Artif.Intell., 71(4):283ā€“300, 2014.

[24] S. Song, K. Chaudhuri, and A. D. Sarwate. Learning from data with heterogeneous noise using sgd.In AISTATS, 2015.

[25] R. Urner, S. Ben-David, and O. Shamir. Learning from weak teachers. InAISTATS, pages 1252ā€“1260,2012.

[26] S. Vijayanarasimhan and K. Grauman. Whatā€™s it going to cost you?: Predicting effort vs. informative-ness for multi-label image annotations. InCVPR, pages 2262ā€“2269, 2009.

[27] S. Vijayanarasimhan and K. Grauman. Cost-sensitive active visual category learning.IJCV, 91(1):24ā€“44, 2011.

[28] P. Welinder, S. Branson, S. Belongie, and P. Perona. Themultidimensional wisdom of crowds. InNIPS, pages 2424ā€“2432, 2010.

[29] Y. Yan, R. Rosales, G. Fung, and J. G. Dy. Active learningfrom crowds. InICML, pages 1161ā€“1168,2011.

12

[30] Y. Yan, R. Rosales, G. Fung, F. Farooq, B. Rao, and J. G. Dy. Active learning from multiple knowledgesources. InAISTATS, pages 1350ā€“1357, 2012.

[31] C. Zhang and K. Chaudhuri. Beyond disagreement-based agnostic active learning. InNIPS, 2014.

13

A Notation

A.1 Basic Definitions and Notation

Here we do a brief recap of notation. We assume that we are given a target hypothesis classH of VCdimensiond, and a difference hypothesis classH d f of VC dimensiondā€².

We are given access to an unlabeled distributionU and two labeling oraclesO andW. QueryingO (resp.W) with an unlabeled data pointxi generates a labelyi,O (resp. yi,W) which is drawn from the distributionPO(y|xi) (resp.PW(y|xi)). In general these two distributions are different. We use the notationD to denotethe joint distribution over examples and labels fromO andW:

PD(x,yO,yW) = PU(x)PO(yO|x)PW(yW|x)

Our goal in this paper is to learn a classifier inH which has low error with respect to the data distributionD described as:PD(x,y) = PU(x)PO(y|x) and our goal is use queries toW to reduce the number of queriesto O. We useyO to denote the labels returned byO, yW to denote the labels returned byW.

The error of a classifierh under a labeled data distributionQ is defined as: errQ(h) = P(x,y)āˆ¼Q(h(x) 6= y);we use the notation err(h,S) to denote its empirical error on a labeled data setS. We use the notationhāˆ— todenote the classifier with the lowest error underD. Define the excess error ofh with respect to distributionD as errD(h)āˆ’errD(hāˆ—). For a setZ, we occasionally abuse notation and useZ to also denote the uniformdistribution over the elements ofZ.

Confidence Sets and Disagreement Region.Our active learning algorithm will maintain a(1āˆ’ Ī“ )-confidence setfor hāˆ— throughout the algorithm. A set of classifiersV āŠ†H produced by a (possibly ran-domized) algorithm is said to be a(1āˆ’ Ī“ )-confidence set forhāˆ— if hāˆ— āˆˆ V with probabilityā‰„ 1āˆ’ Ī“ ; herethe probability is over the randomness of the algorithm as well as the choice of all labeled and unlabeledexamples drawn by it.

Given two classifiersh1 andh2 the disagreement betweenh1 andh2 under an unlabeled data distributionU , denoted byĻU(h1,h2), isPxāˆ¼U(h1(x) 6= h2(x)). Given an unlabeled datasetS, the empirical disagreementof h1 andh2 on S is denoted byĻS(h1,h2). Observe that the disagreements underU form a pseudometricoverH . We use BU(h, r) to denote a ball of radiusr centered aroundh in this metric. Thedisagreementregion of a setV of classifiers, denoted by DIS(V), is the set of all examplesx āˆˆX such that there existtwo classifiersh1 andh2 in V for which h1(x) 6= h2(x).

Disagreement Region. We denote the disagreement region of a disagreement ball of radiusr centeredaroundhāˆ— by

āˆ†(r) := DIS(B(hāˆ—, r)) (7)

Concentration Inequalities. SupposeZ is a dataset consisting ofn iid samples from a distributionD. Wewill use the following result, which is obtained from a standard application of the normalized VC inequality.With probability 1āˆ’Ī“ over the random draw ofZ, for all h,hā€² āˆˆH ,

|(err(h,Z)āˆ’err(hā€²,Z))āˆ’ (errD(h)āˆ’errD(hā€²))|

ā‰¤ min(āˆš

Ļƒ(n,Ī“ )ĻZ(h,hā€²)+Ļƒ(n,Ī“ ),āˆš

Ļƒ(n,Ī“ )ĻD(h,hā€²)+Ļƒ(n,Ī“ ))

(8)

|(err(h,Z)āˆ’errD(h)|ā‰¤ min

(āˆš

Ļƒ(n,Ī“ )err(h,Z)+Ļƒ(n,Ī“ ),āˆš

Ļƒ(n,Ī“ )errD(h)+Ļƒ(n,Ī“ ))

(9)

14

whered is the VC dimension ofH and the notationĻƒ(n,Ī“ ) is defined as:

Ļƒ(n,Ī“ ) =8n(2d ln

2end

+ ln24Ī“) (10)

Equation (8) loosely implies the following equation:

|(err(h,Z)āˆ’err(hā€²,Z))āˆ’ (errD(h)āˆ’errD(hā€²))| ā‰¤

āˆš

4Ļƒ(n,Ī“ ) (11)

The following is a consequence of standard Chernoff bounds.Let X1, . . . ,Xn be iid Bernoulli random vari-ables with meanp. If p= āˆ‘i Xi/n, then with probabiliy 1āˆ’Ī“ ,

|pāˆ’ p| ā‰¤min(āˆš

pĪ³(n,Ī“ )+ Ī³(n,Ī“ ),āˆš

pĪ³(n,Ī“ )+ Ī³(n,Ī“ )) (12)

where the notationĪ³(n,Ī“ ) is defined as:

Ī³(n,Ī“ ) =4n

ln2Ī“

(13)

Equation (12) loosely implies the following equation:

|pāˆ’ p| ā‰¤āˆš

4Ī³(n,Ī“ ) (14)

Using the notation we just introduced, we can rephrase Assumption 1 as follows. For anyr,Ī· > 0, thereexists anhd f

Ī· ,r āˆˆH d f with the following properties:

PD(hd fĪ· ,r(x) =āˆ’1,xāˆˆ āˆ†(r),yO 6= yW)ā‰¤ Ī·

PD(hd fĪ· ,r(x) = 1,xāˆˆ āˆ†(r))ā‰¤ Ī±(r,Ī·)

We end with an useful fact aboutĻƒ(n,Ī“ ).

Fact 1. The minimum n such thatĻƒ(n,Ī“/(logn(logn+1)))ā‰¤ Īµ is at most

64Īµ(d ln

512Īµ

+ ln24Ī“)

A.2 Adaptive Procedure for Estimating Probability Mass

For completeness, we describe in Algorithm 5 a standard doubling procedure for estimating the bias of acoin within a constant factor. This procedure is used by Algorithm 2 to estimate the probability mass of thedisagreement region of the current confidence set based on unlabeled examples drawn fromU .

Lemma 1. Suppose p> 0 and Algorithm 5 is run with failure probabilityĪ“ . Then with probability1āˆ’ Ī“ ,(1) the outputp is such thatpā‰¤ pā‰¤ 2p. (2) The total number of calls toO is at most O( 1

p2 ln 1Ī“ p).

Proof. Consider the event

E = { for all i āˆˆ N, |piāˆ’ p| ā‰¤

āˆš

4ln 2Ā·2i

Ī“2i }

15

Algorithm 5 Adaptive Procedure for Estimating the Bias of a Coin1: Input: failure probabilityĪ“ , an oracleO which returns iid Bernoulli random variables with unknown

biasp.2: Output: p, an estimate of biasp such that Ė†pā‰¤ pā‰¤ 2p with probabilityā‰„ 1āˆ’Ī“ .3: for i = 1,2, . . . do4: Call the oracleO 2i times to get empirical frequency Ė†pi .

5: if

āˆš

4ln 4Ā·2iĪ“

2i ā‰¤ pi/3 then return p= 2pi3

6: end if7: end for

By Equation (14) and union bound,P(E)ā‰„ 1āˆ’Ī“ . On eventE, we claim that ifi is large enough that

4

āˆš

4ln 4Ā·2i

Ī“2i ā‰¤ p (15)

then the condition in line 5 will be met. Indeed, this implies

āˆš

4ln 4Ā·2i

Ī“2i ā‰¤

pāˆ’āˆš

4ln 4Ā·2iĪ“

2i

3ā‰¤ pi

3

Definei0 as the smallest numberi such that Equation (15) is true. Then by algebra, 2i0 =O( 1p2 ln 1

Ī“ p). Hence

the number of calls to oracleO is at most 1+2+ . . .+2i0 = O( 1p2 ln 1

Ī“ p).Consider the smallestiāˆ— such that the condition in line 5 is met. We have that

āˆš

4ln 4Ā·2iāˆ—

Ī“2iāˆ— ā‰¤ piāˆ—/3

By the definition ofE,|pāˆ’ piāˆ— | ā‰¤ piāˆ—/3

that is, 2piāˆ—/3ā‰¤ pā‰¤ 4piāˆ—/3, implying pā‰¤ pā‰¤ 2p.

A.3 Notations on Datasets

Without loss of generality, assume the examples drawn throughout Algorithm 1 have distinct feature valuesx, since this happens with probability 1 under mild assumptions.

Algorithm 1 uses a mixture of three kinds of labeled data to learn a target classifier ā€“ labels obtainedfrom queryingO, labels inferred by the algorithm, and labels obtained fromqueryingW. To analyze theeffect of these three kinds of labeled data, we need to introduce some notation.

Recall that we define the joint distributionD over examples and labels both fromO andW as follows:

PD(x,yO,yW) = PU(x)PO(yO|x)PW(yW|x)

where given an examplex, the labels generated byO andW are conditionally independent.

16

A datasetSwith empirical error minimizerh and a rejection thresholdĻ„ define a implicit confidence setfor hāˆ— as follows:

V(S,Ļ„) = {h : err(h, S)āˆ’err(h, S)ā‰¤ Ļ„}At the beginning of epochk, we haveSkāˆ’1. hkāˆ’1 is defined as the empirical error minimizer ofSkāˆ’1. The dis-agreement region of the implicit confidence set at epochk, Rkāˆ’1 is defined asRkāˆ’1 := DIS(V(Skāˆ’1,3Īµk/2)).Algorithm 4 in disagr region(Skāˆ’1,3Īµk/2,x) provides a test deciding if an unlabeled examplex is insideRkāˆ’1 in epochk. (See Lemma 6.)

DefineAk to be the distributionD conditioned on the set{(x,yO,yW) : x āˆˆ Rkāˆ’1}. At epochk, Algo-rithm 2 has inputs distributionU , oraclesW andO, target false negative errorĪµ = Īµk/128, hypothesis classH d f , confidenceĪ“ = Ī“k/2, previous labeled datasetSkāˆ’1, and outputs a difference classfierhd f

k . By thesetting ofm in Equation (1), Algorithm 2 first computes Ė†pk using unlabeled examples drawn fromU , whichis an estimator ofPD (xāˆˆ Rkāˆ’1). Then it draws a subsample of size

mk,1 =64Ā·1024pk

Īµk(d ln

512Ā·1024pk

Īµk+ ln

144Ī“k

) (16)

iid from Ak. We call the resulting datasetA ā€²k .

At epochk, Algorithm 3 performs adaptive subsampling to refine the implicit (1āˆ’ Ī“ )-confidence set.For each roundt, it subsamplesU to get an unlabeled datasetSt,U

k of size 2t . Define the corresponding(hypothetical) dataset with labels queried from bothW andO asS t

k . Stk, the (hypothetical) dataset with

labels queried fromO, is defined as:

Stk = {(x,yO)|(x,yO,yW) āˆˆS

tk}

In addition to obtaining labels fromO, the algorithm obtains labels in two other ways. First, if anx āˆˆX \Rkāˆ’1, then its label is safely inferred and with high probability, this inferred labelhkāˆ’1(x) is equal tohāˆ—(x). Second, if anx lies in Rkāˆ’1 but if the difference classifierhd f

k predicts agreement betweenO andW,then its label is obtained by queryingW. The actual datasetSt

k generated by Algorithm 3 is defined as:

Stk = {(x, hkāˆ’1(x))|(x,yO,yW) āˆˆS

tk ,x /āˆˆ Rkāˆ’1}āˆŖ{(x,yO)|(x,yO,yW) āˆˆS

tk ,xāˆˆ Rkāˆ’1, h

d fk (x) = +1}

āˆŖ{(x,yW)|(x,yO,yW) āˆˆStk ,xāˆˆ Rkāˆ’1, h

d fk (x) =āˆ’1}

We useDk to denote the labeled data distribution as follows:

PDk(x,y) = PU(x)PQk

(y|x)

PQk(y|x) =

I(hkāˆ’1(x) = y), x /āˆˆRkāˆ’1

PO(y|x), xāˆˆRkāˆ’1, hd fk (x) = +1

PW(y|x), xāˆˆRkāˆ’1, hd fk (x) =āˆ’1

Therefore,Stk can be seen as a sample of size 2t drawn iid fromDk.

Observe thathtk is obtained by training an ERM classifier overSt

k, andĪ“ tk = Ī“k/2t(t +1).

Suppose Algorithm 3 stops at iterationt0(k), then the final dataset returned isSk = St0(k)k , with a total

number ofmk,2 label requests toO. We defineSk = St0(k)k , Sk = S

t0(k)k andĻƒk = Ļƒ(2t0(k),Ī“ t0(k)

k ).

17

Fork= 0, we define the notationSk differently. S0 is the dataset drawn iid at random fromD, with labelsqueried entirely toO. For notational convenience, defineS0 = S0. Ļƒ0 is defined asĻƒ0 = Ļƒ(n0,Ī“0), whereĻƒ(Ā·, Ā·) is defined by Equation (10) andn0 is defined as:

n0 = (64Ā·10242)(2d ln(512Ā·10242)+ ln96Ī“)

Recall thathk = argminhāˆˆH err(h, Sk) is the empirical error minimizer with respect to the datasetSk.Note that the empirical distanceĻZ(Ā·, Ā·) does not depend on the labels in datasetZ, therefore,ĻSk

(h,hā€²) =ĻSk(h,h

ā€²). We will use them interchangably throughout.

Table 1: Summary of Notations.

Notation Explanation Samples Drawn from

D Joint distribution of(x,yW,yO) -D Joint distribution of(x,yO) -U Marginal distribution ofx -O Conditional distribution ofyO givenx -W Conditional distribution ofyW givenx -

Rkāˆ’1 Disagreement region at epochk -Ak Conditional distribution of(x,yW,yO) givenxāˆˆ Rkāˆ’1 -A ā€²

k Dataset used to train difference classifier at epochk Ak

hd fk Difference classifierhd f

2Ī½+Īµkāˆ’1,Īµk/512, wherehĪ· ,r is defined in Assump-tion 1

-

hd fk Difference classifier returned by Algorithm 2 at epochk -

St,Uk unlabeled dataset drawn at iterationt of Algorithm 3 at epochkā‰„ 1 U

S tk St,U

k augmented by labels fromO andW D

Stk {(x,yO)|(x,yO,yW) āˆˆS t

k} DSt

k Labeled dataset produced at iterationt of Algorithm 3 at epochkā‰„ 1 Dk

Dk Distribution of Stk for kā‰„ 1 and anyt. Has marginalU over X . The

conditional distribution ofy|x is I(hāˆ—(x)) if x /āˆˆ Rkāˆ’1, W if xāˆˆ Rkāˆ’1 andhd f (x) =āˆ’1, andO otherwise

-

t0(k) Number of iterations of Algorithm 3 at epochkā‰„ 1 -

S0 Initial dataset drawn by Algorithm 1 D

Sk Dataset finally returned by Algorithm 3 at epochkā‰„ 1. Equal toSt0(k)k Dk

Sk Dataset obtained by replacing all labels inSk by labels drawn fromO.Equal toSt0(k)

k

D

Sk Equal toSt0(k)k D

hk Empirical error minimizer onSk -

18

A.4 Events

Recall thatĪ“k = Ī“/(4(k+1)2),Īµk = 2āˆ’k.Define

hd fk = hd f

2Ī½+Īµkāˆ’1,Īµk/512

where the notationhd fr,Ī· is introduced in Assumption 1.

We begin by defining some events that we will condition on later in the proof, and showing that theseevents occur with high probability.

Define event

E1k :=

{

PD(xāˆˆ Rkāˆ’1)/2ā‰¤ pk ā‰¤ PD (xāˆˆRkāˆ’1),

and For allhd f āˆˆHd f ,

|PA ā€²k(hd f (x) =āˆ’1,yO 6= yW)āˆ’PAk(h

d f (x) =āˆ’1,yO 6= yW)| ā‰¤ Īµk

1024PD (xāˆˆRkāˆ’1)

+

āˆš

min(PAk(hd f (x) =āˆ’1,yO 6= yW),PA ā€²

k(hd f (x) =āˆ’1,yO 6= yW))

Īµk

1024PD (xāˆˆ Rkāˆ’1)

and|PA ā€²k(hd f (x) = +1)āˆ’PAk(h

d f (x) = +1)|

ā‰¤āˆš

min(PAk(hd f (x) = +1),PA ā€²

k(hd f (x) = +1))

Īµk

1024PD (xāˆˆRkāˆ’1)+

Īµk

1024PD (xāˆˆ Rkāˆ’1)

}

Fact 2. P(E1k)ā‰„ 1āˆ’Ī“k/2.

Define event

E2k =

{

For all t āˆˆ N, for all h,hā€² āˆˆH ,

|(err(h,Stk)āˆ’err(hā€²,St

k))āˆ’ (errD(h)āˆ’errD(hā€²))| ā‰¤ Ļƒ(2t ,Ī“ t

k)+āˆš

Ļƒ(2t ,Ī“ tk)ĻSt

k(h,hā€²)

and err(h, Stk)āˆ’errDk

(h) ā‰¤ Ļƒ(2t ,Ī“ tk)+

āˆš

Ļƒ(2t ,Ī“ tk)errDk

(h)

and PS tk(hd f

k (x) =āˆ’1,yO 6= yW,xāˆˆRkāˆ’1)āˆ’PD(hd fk (x) =āˆ’1,yO 6= yW,xāˆˆ Rkāˆ’1)

ā‰¤āˆš

Ī³(2t ,Ī“ tk)PS t

k(hd f

k (x) =āˆ’1,yO 6= yW,xāˆˆRkāˆ’1)+ Ī³(2t ,Ī“ tk)

and PS tk(hd f

k (x) =āˆ’1āˆ©xāˆˆRkāˆ’1)ā‰¤ 2(PD (hd fk (x) =āˆ’1,xāˆˆ Rkāˆ’1)+ Ī³(2t ,Ī“ t

k))}

Fact 3. P(E2k)ā‰„ 1āˆ’Ī“k/2.

We will also use the following definitions of events in our proof. Define eventF0 as

F0={

for all h,hā€² āˆˆH , |(err(h,S0)āˆ’err(hā€²,S0))āˆ’(errD(h)āˆ’errD(hā€²))| ā‰¤Ļƒ(n0,Ī“0)+

āˆš

Ļƒ(n0,Ī“0)ĻS0(h,hā€²)}

For kāˆˆ {1,2, . . . ,k0}, eventFk is defined inductively as

Fk = Fkāˆ’1āˆ© (E1k āˆ©E2

k)

Fact 4. For kāˆˆ {0,1, . . . ,k0}, P(Fk)ā‰„ 1āˆ’Ī“0āˆ’Ī“1āˆ’ . . .āˆ’Ī“k. Specifically,P(Fk0)ā‰„ 1āˆ’Ī“ .

The proofs of Facts 2, 3 and 4 are provided in Appendix E.

19

B Proof Outline and Main Lemmas

The main idea of the proof is to maintain the following three invariants on the outputs of Algorithm 1 ineach epoch. We prove that these invariants hold simultaneously for each epoch with high probability byinduction over the epochs. Throughout, forkā‰„ 1, the end of epochk refers to the end of execution of line13 of Algorithm 1 at iterationk. The end of epoch 0 refers to the end of execution of line 5 in Algorithm 1.

Invariant 1 states that if we replace the inferred labels andlabels obtained fromW in Sk by those obtainedfrom O (thus getting the datasetSk), then the excess errors of classifiers inH will not decrease by much.

Invariant 1 (Approximate Favorable Bias). Let h be any classifier inH , and hā€² be another classifier inHwith excess error on D no greater thanĪµk. Then, at the end of epoch k, we have:

err(h,Sk)āˆ’err(hā€²,Sk)ā‰¤ err(h, Sk)āˆ’err(hā€², Sk)+ Īµk/16

Invariant 2 establishes that in epochk, Algorithm 3 selects enough examples so as to ensure that con-centration of empirical errors of classifiers inH on Sk to their true errors.

Invariant 2 (Concentration). At the end of epoch k,Sk, Sk andĻƒk are such that:1. For any pair of classifiers h,hā€² āˆˆH , it holds that:

|(err(h,Sk)āˆ’err(hā€²,Sk))āˆ’ (errD(h)āˆ’errD(hā€²))| ā‰¤ Ļƒk+

āˆš

ĻƒkĻSk(h,hā€²) (17)

2. The datasetSk has the following property:

Ļƒk+

āˆš

Ļƒkerr(hk, Sk)ā‰¤ Īµk/512 (18)

Finally, Invariant 3 ensures that the difference classifierproduced in epochk has low false negative erroron the disagreement region of the(1āˆ’Ī“ ) confidence set at epochk.

Invariant 3 (Difference Classifier). At epoch k, the difference classifier output by Algorithm 2 issuch that

PD(hd fk (x) =āˆ’1,yO 6= yW,xāˆˆ Rkāˆ’1)ā‰¤ Īµk/64 (19)

PD(hd fk (x) = +1,xāˆˆ Rkāˆ’1)ā‰¤ 6(Ī±(2Ī½ + Īµkāˆ’1,Īµk/512)+ Īµk/1024) (20)

We will show the following property about the three invariants. Its proof is deferred to Subsection B.4.

Lemma 2. There is a numerical constant c0 > 0 such that the following holds. The collection of events{Fk}k0

k=0 is such that for kāˆˆ {0,1, . . . ,k0}:(1) If k= 0, then on event Fk, at epoch k,

(1.1) Invariants 1,2 hold.(1.2) The number of label requests to O is at most m0ā‰¤ c0(d+ ln 1

Ī“ ).(2) If kā‰„ 1, then on event Fk, at epoch k,

(2.1) Invariants 1,2,3 hold.(2.2) the number of label requests to O is at most

mkā‰¤ c0

((Ī±(2Ī½ + Īµkāˆ’1,Īµk/1024)+ Īµk)(Ī½ + Īµk)

Īµ2k

d(ln2 1Īµk

+ ln2 1Ī“k)+

PU(xāˆˆ āˆ†(2Ī½ + Īµkāˆ’1))

Īµk(dā€² ln

1Īµk

+ ln1Ī“k))

20

B.1 Active Label Inference and Identifying the Disagreement Region

We begin by proving some lemmas about Algorithm 4 which identifies if an example lies in the disagreementregion of the current confidence set. This is done by using a constrained ERM oracle CONS-LEARNH(Ā·, Ā·)using ideas similar to [10, 15, 3, 4].

Lemma 3. When given as input a datasetS, a thresholdĻ„ > 0, an unlabeled example x, Algorithm 4in disagr region returns 1 if and only if x lies inside DIS(V(S,Ļ„)).

Proof. (ā‡’) If Algorithm 4 returns 1, then we have found a classifierhā€²x such that (1)hx(x) =āˆ’h(x), and (2)err(hā€²x, S)āˆ’err(h, S)ā‰¤ Ļ„ , i.e. hā€²x āˆˆV(S,Ļ„). Therefore,x is in DIS(V(S,Ļ„)).(ā‡) If x is in DIS(V(S,Ļ„)), then there exists a classifierhāˆˆH such that (1)h(x) =āˆ’h(x) and (2) err(h, S)āˆ’err(h, S)ā‰¤ Ļ„ . Hence by definition ofhā€²x, err(hā€²x, S)āˆ’err(h, S)ā‰¤ Ļ„ . Thus, Algorithm 4 returns 1.

We now provide some lemmas about the behavior of Algorithm 4 called at epochk.

Lemma 4. Suppose Invariants 1 and 2 hold at the end of epoch kāˆ’ 1. If h āˆˆH is such that errD(h) ā‰¤errD(hāˆ—)+ Īµkāˆ’1/2, then

err(h, Skāˆ’1)āˆ’err(hkāˆ’1, Skāˆ’1)ā‰¤ 3Īµkāˆ’1/4

Proof. If hāˆˆH has excess error at mostĪµkāˆ’1/2 with respect toD, then,

err(h, Skāˆ’1)āˆ’err(hkāˆ’1, Skāˆ’1)

ā‰¤ err(h,Skāˆ’1)āˆ’err(hkāˆ’1,Skāˆ’1)+ Īµkāˆ’1/16

ā‰¤ errD(h)āˆ’errD(hkāˆ’1)+Ļƒkāˆ’1+

āˆš

Ļƒkāˆ’1ĻSkāˆ’1(h, hkāˆ’1)+ Īµkāˆ’1/16

ā‰¤ Īµkāˆ’1/2+Ļƒkāˆ’1+

āˆš

Ļƒkāˆ’1ĻSkāˆ’1(h, hkāˆ’1)+ Īµkāˆ’1/16

ā‰¤ 9Īµkāˆ’1/16+Ļƒkāˆ’1+

āˆš

Ļƒkāˆ’1err(h, Skāˆ’1)+

āˆš

Ļƒkāˆ’1err(hkāˆ’1, Skāˆ’1)

ā‰¤ 9Īµkāˆ’1/16+Ļƒkāˆ’1+

āˆš

Ļƒkāˆ’1err(h, Skāˆ’1)+

āˆš

Ļƒkāˆ’1(err(hkāˆ’1, Skāˆ’1)+9Īµkāˆ’1/16)

Where the first inequality follows from Invariant 1, the second inequality from Equation (17) of Invariant 2,the third inequality from the assumption thathhas excess error at mostĪµkāˆ’1/2, and the fourth inequality fromthe triangle inequality, the fifth inequality is by adding a nonnegative number in the last term. Continuing,

err(h, Skāˆ’1)āˆ’err(hkāˆ’1, Skāˆ’1)

ā‰¤ 9Īµkāˆ’1/16+4Ļƒkāˆ’1+2āˆš

Ļƒkāˆ’1(err(hkāˆ’1, Skāˆ’1)+9Īµkāˆ’1/16)

ā‰¤ 9Īµkāˆ’1/16+4Ļƒkāˆ’1+2āˆš

Ļƒkāˆ’1err(hkāˆ’1, Skāˆ’1)+2āˆš

Īµkāˆ’1/512Ā·9Īµkāˆ’1/16

ā‰¤ 9Īµkāˆ’1/16+ Īµkāˆ’1/32+2āˆš

Īµkāˆ’1/512Ā·9Īµkāˆ’1/16

ā‰¤ 3Īµkāˆ’1/4

Where the first inequality is by simple algebra (by lettingD = err(h, Skāˆ’1), E = err(hkāˆ’1, Skāˆ’1)+9Īµkāˆ’1/16,F = Ļƒkāˆ’1 in Dā‰¤ E+F +

āˆšDF +

āˆšEFā‡’ Dā‰¤ E+4F +2

āˆšEF), the second inequality is from

āˆšA+Bā‰¤āˆš

A+āˆš

B andĻƒkāˆ’1ā‰¤ Īµkāˆ’1/512 which utilizes Equation (18) of Invariant 2, the third inequality is again byEquation (18) of Invariant 2, the fourth inequality is by algebra.

21

Lemma 5. Suppose Invariants 1 and 2 hold at the end of epoch kāˆ’1. Then,

errD(hkāˆ’1)āˆ’errD(hāˆ—)ā‰¤ Īµkāˆ’1/8

Proof. By Lemma 4, we know that sincehāˆ— has excess error 0 with respect toD,

err(hāˆ—, Skāˆ’1)āˆ’err(hkāˆ’1, Skāˆ’1)ā‰¤ 3Īµkāˆ’1/4 (21)

Therefore,

errD(hkāˆ’1)āˆ’errD(hāˆ—)

ā‰¤ err(hkāˆ’1,Skāˆ’1)āˆ’err(hāˆ—,Skāˆ’1)+Ļƒkāˆ’1+āˆš

Ļƒkāˆ’1ĻSkāˆ’1(hkāˆ’1,hāˆ—)

ā‰¤ err(hkāˆ’1, Skāˆ’1)āˆ’err(hāˆ—, Skāˆ’1)+Ļƒkāˆ’1+

āˆš

Ļƒkāˆ’1ĻSkāˆ’1(hkāˆ’1,hāˆ—)+ Īµkāˆ’1/16

ā‰¤ Īµkāˆ’1/16+Ļƒkāˆ’1+

āˆš

Ļƒkāˆ’1(err(hkāˆ’1, Skāˆ’1)+err(hāˆ—, Skāˆ’1))

ā‰¤ Īµkāˆ’1/16+Ļƒkāˆ’1+

āˆš

Ļƒkāˆ’1(2err(hkāˆ’1, Skāˆ’1)+3Īµkāˆ’1/4)

ā‰¤ Īµkāˆ’1/16+Ļƒkāˆ’1+

āˆš

2Ļƒkāˆ’1err(hkāˆ’1, Skāˆ’1)+āˆš

Īµkāˆ’1/512Ā·3Īµkāˆ’1/4

ā‰¤ Īµkāˆ’1/8

where the first inequality is from Equation (17) of Invariant2, the second inequality uses Invariant 1, thethird inequality follows from the optimality ofhkāˆ’1 and triangle inequality, the fourth inequality uses Equa-tion (21), the fifth inequality uses the fact that

āˆšA+Bā‰¤

āˆšA+āˆš

B andĻƒkāˆ’1 ā‰¤ Īµkāˆ’1/512, which is fromEquation (18) of Invariant 2, the last inequality again utilizes the Equation (18) of Invariant 2.

Lemma 6. Suppose Invariants 1, 2, and 3 hold in epoch kāˆ’1 conditioned on event Fkāˆ’1. Then conditionedon event Fkāˆ’1, the implicit confidence set Vkāˆ’1 =V(Skāˆ’1,3Īµk/2) is such that:(1) If hāˆˆH satisfies errD(h)āˆ’errD(hāˆ—)ā‰¤ Īµk, then h is in Vkāˆ’1.(2) If hāˆˆH is in Vkāˆ’1, then errD(h)āˆ’errD(hāˆ—)ā‰¤ Īµkāˆ’1. Hence Vkāˆ’1āŠ† BU(hāˆ—,2Ī½ + Īµkāˆ’1).(3) Algorithm 4,in disagr region, when run on inputs datasetSkāˆ’1, threshold3Īµk/2, unlabeled example x,returns 1 if and only if x is in Rkāˆ’1.

Proof. (1) Let h be a classifier with errD(h)āˆ’ errD(hāˆ—) ā‰¤ Īµk = Īµkāˆ’1/2. Then, by Lemma 4, one haserr(h, Skāˆ’1)āˆ’err(hkāˆ’1, Skāˆ’1)ā‰¤ 3Īµkāˆ’1/4= 3Īµk/2. Hence,h is inVkāˆ’1.(2) Fix anyh in Vkāˆ’1, by definition ofVkāˆ’1,

err(h, Skāˆ’1)āˆ’err(hkāˆ’1, Skāˆ’1)ā‰¤ 3Īµk/2= 3Īµkāˆ’1/4 (22)

Recall that from Lemma 5,errD(hkāˆ’1)āˆ’errD(h

āˆ—)ā‰¤ Īµkāˆ’1/8

Thus for classifierh, applying Invariant 1 by takinghā€² := hkāˆ’1, we get

err(h,Skāˆ’1)āˆ’err(hkāˆ’1,Skāˆ’1)ā‰¤ err(h, Skāˆ’1)āˆ’err(hkāˆ’1, Skāˆ’1)+ Īµkāˆ’1/32 (23)

22

Therefore,

errD(h)āˆ’errD(hkāˆ’1)

ā‰¤ err(h,Skāˆ’1)āˆ’err(hkāˆ’1,Skāˆ’1)+Ļƒkāˆ’1+āˆš

Ļƒkāˆ’1ĻSkāˆ’1(h, hkāˆ’1)

ā‰¤ err(h,Skāˆ’1)āˆ’err(hkāˆ’1,Skāˆ’1)+Ļƒkāˆ’1+

āˆš

Ļƒkāˆ’1(err(h, Skāˆ’1)+err(hkāˆ’1, Skāˆ’1))

ā‰¤ err(h, Skāˆ’1)āˆ’err(hkāˆ’1, Skāˆ’1)+Ļƒkāˆ’1+

āˆš

Ļƒkāˆ’1(err(h, Skāˆ’1)+err(hkāˆ’1, Skāˆ’1))+ Īµkāˆ’1/16

ā‰¤ 13Īµkāˆ’1/16+Ļƒkāˆ’1+

āˆš

Ļƒkāˆ’1(2err(hkāˆ’1, Skāˆ’1)+3Īµkāˆ’1/4)

ā‰¤ 13Īµkāˆ’1/16+Ļƒkāˆ’1+

āˆš

2Ļƒkāˆ’1err(hkāˆ’1, Skāˆ’1)+āˆš

Īµkāˆ’1/512Ā·3Īµkāˆ’1/4

ā‰¤ 7Īµkāˆ’1/8

where the first inequality is from Equation (17) of Invariant2, the second inequality uses the fact thatĻSkāˆ’1

(h,hā€²) = ĻSkāˆ’1(h,hā€²)ā‰¤ err(h, Skāˆ’1)+err(hā€², Skāˆ’1) for h,hā€² āˆˆH , the third inequality uses Equation (23);

the fourth inequality is from Equation (22); the fifth inequality is from the fact thatāˆš

A+Bā‰¤āˆš

A+āˆš

BandĻƒkāˆ’1 ā‰¤ Īµkāˆ’1/512, which is from Equation (18) of Invariant 2, the last inequality again follows fromEquation (18) of Invariant 2 and algebra.In conjunction with the fact that errD(hkāˆ’1)āˆ’errD(hāˆ—)ā‰¤ Īµkāˆ’1/8, this implies

errD(h)āˆ’errD(hāˆ—)ā‰¤ Īµkāˆ’1

By triangle inequality,Ļ(h,hāˆ—)ā‰¤ 2Ī½ + Īµkāˆ’1, hencehāˆˆ BU(hāˆ—,2Ī½ + Īµkāˆ’1). In summaryVkāˆ’1āŠ† BU(hāˆ—,2Ī½ +Īµkāˆ’1).(3) Follows directly from Lemma 3 and the fact thatRkāˆ’1 = DIS(Vkāˆ’1).

B.2 Training the Difference Classifier

Recall thatāˆ†(r) = DIS(BU(hāˆ—, r)) is the disagreement region of the disagreement ball centered aroundhāˆ—

with radiusr.

Lemma 7 (Difference Classifier Invariant). There is a numerical constant c1 > 0 such that the followingholds. Suppose that Invariants 1 and 2 hold at the end of epochkāˆ’ 1 conditioned on event Fkāˆ’1 andthat Algorithm 2 has inputs unlabeled data distribution U, oracle O, Īµ = Īµk/128, hypothesis classH d f ,Ī“ = Ī“k/2, previous labeled datasetSkāˆ’1. Then conditioned on event Fk,(1) hd f

k , the output of Algorithm 2, maintains Invariant 3.(2)(Label Complexity: Part 1.) The number of label queries made to O is at most

mk,1ā‰¤ c1

(

PU(xāˆˆ āˆ†(2Ī½ + Īµkāˆ’1))

Īµk(dā€² ln

1Īµk

+ ln1Ī“k))

Proof. (1) Recall thatFk = Fkāˆ’1āˆ©E1k āˆ©E2

k , whereE1k , E2

k are defined in Subsection A.4. Suppose eventFk

happens.

23

Proof of Equation (19). Recall thathd fk is the optimal solution of optimization problem (2). We haveby

feasibility and the fact that on eventE3k , 2pk ā‰„ PD(xāˆˆ Rkāˆ’1),

PA ā€²k(hd f

k (x) =āˆ’1,yO 6= yW)ā‰¤ Īµk

256pkā‰¤ Īµk

128PD (xāˆˆ Rkāˆ’1)

By definition of eventE2k , this implies

PAk(hd fk (x) =āˆ’1,yO 6= yW)

ā‰¤ PA ā€²k(hd f

k (x) =āˆ’1,yO 6= yW)+

āˆš

PA ā€²k(hd f

k (x) =āˆ’1,yO 6= yW)Īµk

1024PD (xāˆˆ Rkāˆ’1)+

Īµk

1024PD (xāˆˆ Rkāˆ’1)

ā‰¤ Īµk

64PD (xāˆˆ Rkāˆ’1)

Indicating

PD(hd fk (x) =āˆ’1,yO 6= yW,xāˆˆ Rkāˆ’1)ā‰¤

Īµk

64

Proof of Equation (20). By definition ofhd fk in Subsection A.4,hd f

k is such that:

PD (hd fk (x) = +1,xāˆˆ āˆ†(2Ī½ + Īµkāˆ’1))ā‰¤ Ī±(2Ī½ + Īµkāˆ’1,Īµk/512)

PD(hd fk (x) =āˆ’1,yO 6= yW,xāˆˆ āˆ†(2Ī½ + Īµkāˆ’1))ā‰¤ Īµk/512

By item (2) of Lemma 6, we haveRkāˆ’1āŠ† DIS(BU(hāˆ—,2Ī½ + Īµkāˆ’1)), thus

PD(hd fk (x) = +1,xāˆˆ Rkāˆ’1)ā‰¤ Ī±(2Ī½ + Īµkāˆ’1,Īµk/512) (24)

PD (hd fk (x) =āˆ’1,yO 6= yW,xāˆˆ Rkāˆ’1)ā‰¤ Īµk/512 (25)

Equation (25) implies that

PAk(hd fk (x) =āˆ’1,yO 6= yW)ā‰¤ Īµk

512PD (xāˆˆ Rkāˆ’1)(26)

Recall thatA ā€²k is the dataset subsampled fromAk in line 3 of Algorithm 2. By definition of eventE1

k , we

have that forhd fk ,

PA ā€²k(hd f

k (x) =āˆ’1,yO 6= yW)

ā‰¤ PAk(hd fk (x) =āˆ’1,yO 6= yW)+

āˆš

PAk(hd fk (x) =āˆ’1,yO 6= yW)

Īµk

1024PD (xāˆˆ Rkāˆ’1)+

Īµk

1024PD (xāˆˆ Rkāˆ’1)

ā‰¤ Īµk

256PD (xāˆˆ Rkāˆ’1)ā‰¤ Īµk

256pk

24

where the second inequality is from Equation (26), and the last inequality is from the fact that Ė†pk ā‰¤ PD(xāˆˆRkāˆ’1). Hence,hd f

k is a feasible solution to the optimization problem (2). Thus,

PAk(hd fk (x) = +1)

ā‰¤ PA ā€²k(hd f

k (x) = +1)+āˆš

PA ā€²k(hd f

k (x) = +1)Īµk

1024PD (xāˆˆ Rkāˆ’1)+

Īµk

1024PD (xāˆˆ Rkāˆ’1)

ā‰¤ 2(PA ā€²k(hd f

k (x) = +1)+Īµk

1024PD (xāˆˆRkāˆ’1))

ā‰¤ 2(PA ā€²k(hd f

k (x) = +1)+Īµk

1024PD (xāˆˆRkāˆ’1))

ā‰¤ 2((PAk(hd fk (x) = +1)+

āˆš

PAk(hd fk (x) = +1)

Īµk

1024PD (xāˆˆ Rkāˆ’1)+

Īµk

1024PD (xāˆˆ Rkāˆ’1))+

Īµk

1024PD (xāˆˆ Rkāˆ’1))

ā‰¤ 6(PAk(hd fk (x) = +1)+

Īµk

1024PD (xāˆˆ Rkāˆ’1))

where the first inequality is by definition of eventE1k , the second inequality is by algebra, the third inequality

is by optimality ofhd fk in (2),PA ā€²

k(hd f

k (x) = +1)ā‰¤ PA ā€²k(hd f

k (x) = +1), the fourth inequality is by definition

of eventE1k , the fifth inequality is by algebra.

Therefore,

PD(hd fk (x)=+1,xāˆˆRkāˆ’1)ā‰¤ 6(PD (h

d fk (x)=+1,xāˆˆRkāˆ’1)+Īµk/1024)ā‰¤ 6(Ī±(2Ī½ +Īµkāˆ’1,Īµk/512)+Īµk/1024)

(27)where the second inequality follows from Equation (24). This establishes the correctness of Invariant 3.(2) The number of label requests toO follows from line 3 of Algorithm 2 (see Equation (16)). That is, wecan choosec1 large enough (independently ofk), such that

mk,1ā‰¤ c1

(

PD(xāˆˆRkāˆ’1)

Īµk(dā€² ln

1Īµk

+ ln1Ī“k))

ā‰¤ c1

(

PU(xāˆˆ āˆ†(2Ī½ + Īµkāˆ’1))

Īµk(dā€² ln

1Īµk

+ ln1Ī“k))

where in the second step we use the fact that on eventFk, by item (2) of Lemma 6,Rkāˆ’1āŠ†DIS(BU(hāˆ—,2Ī½ +Īµkāˆ’1)), thusPD(xāˆˆ Rkāˆ’1)ā‰¤ PD (xāˆˆ āˆ†(2Ī½ + Īµkāˆ’1)) = PU(xāˆˆ āˆ†(2Ī½ + Īµkāˆ’1)).

B.3 Adaptive Subsampling

Lemma 8. There is a numerical constant c2 > 0 such that the following holds. Suppose Invariants 1, 2,and 3 hold in epoch kāˆ’ 1 on event Fkāˆ’1; Algorithm 3 receives inputs unlabeled distribution U, classifierhkāˆ’1, difference classifierhd f = hd f

k , target excess errorĪµ = Īµk, confidenceĪ“ = Ī“k/2, previous labeleddatasetSkāˆ’1. Then on event Fk,(1) Sk, the output of Algorithm 3, maintains Invariants 1 and 2.(2) (Label Complexity: Part 2.) The number of label queries to O in Algorithm 3 is at most:

mk,2 ā‰¤ c2

((Ī½ + Īµk)(Ī±(2Ī½ + Īµkāˆ’1,Īµk/512)+ Īµk)

Īµ2k

Ā·d(ln2 1Īµk

+ ln2 1Ī“k))

Proof. (1) Recall thatFk = Fkāˆ’1āˆ©E1k āˆ©E2

k , whereE1k , E2

k are defined in Subsection A.4. Suppose eventFk

happens.

25

Proof of Invariant 1. We consider a pair of classifiersh,hā€² āˆˆH , whereh is an arbitrary classifier inHandhā€² has excess error at mostĪµk.

At iterationt = t0(k) of Algorithm 3, the breaking criteron in line 14 is met, i.e.

Ļƒ(2t0(k),Ī“ t0(k)k )+

āˆš

Ļƒ(2t0(k),Ī“ t0(k)k )err(ht0(k), St0(k)

k )ā‰¤ Īµk/512 (28)

First we expand the definition of err(h,Sk) and err(h, Sk) respectively:

err(h,Sk)=PSk(hd fk (x)=+1,h(x) 6= yO,xāˆˆRkāˆ’1)+PSk(h

d fk (x)=āˆ’1,h(x) 6= yO,xāˆˆRkāˆ’1)+PSk(h(x) 6= yO,x /āˆˆRkāˆ’1)

err(h, Sk)=PSk(hd fk (x)=+1,h(x) 6= yO,xāˆˆRkāˆ’1)+PSk(h

d fk (x)=āˆ’1,h(x) 6= yW,xāˆˆRkāˆ’1)+PSk(h(x) 6= hāˆ—(x),x /āˆˆRkāˆ’1)

where we use the fact that by Lemma 6, for all examplesx /āˆˆ Rkāˆ’1, hkāˆ’1(x) = hāˆ—(x).We next show thatPSk(h

d fk (x) = āˆ’1,h(x) 6= yO,xāˆˆ Rkāˆ’1) is close toPSk(h

d fk (x) =āˆ’1,h(x) 6= yW,xāˆˆ

Rkāˆ’1).From Lemma 7, we know that conditioned on eventFk,

PD(hd fk (x) =āˆ’1,yO 6= yW,xāˆˆ Rkāˆ’1)ā‰¤ Īµk/64

In the meantime, from Equation (28),Ī³(2t0(k),Ī“ t0(k)k ) ā‰¤ Ļƒ(2t0(k),Ī“ t0(k)

k ) ā‰¤ Īµk/512. Recall thatSk = St0(k)k .

Therefore, by definition ofE2k ,

PSk(hd fk (x) =āˆ’1,yO 6= yW,xāˆˆRkāˆ’1)

ā‰¤ PD(hd fk (x) =āˆ’1,yO 6= yW,xāˆˆRkāˆ’1)+

āˆš

PD (hd fk (x) =āˆ’1,yO 6= yW,xāˆˆ Rkāˆ’1)Ī³(2t0(k),Ī“ t0(k)

k )+ Ī³(2t0(k),Ī“ t0(k)k )

ā‰¤ PD(hd fk (x) =āˆ’1,yO 6= yW,xāˆˆRkāˆ’1)+

āˆš

PD (hd fk (x) =āˆ’1,yO 6= yW,xāˆˆ Rkāˆ’1)Īµk/512+ Īµk/512

ā‰¤ Īµk/32

By triangle inequality, for all classifierh0 āˆˆH ,

|PSk(hd fk (x) =āˆ’1,h0(x) 6= yO,xāˆˆ Rkāˆ’1)āˆ’PSk(h

d fk (x) =āˆ’1,h0(x) 6= yW,xāˆˆRkāˆ’1)| ā‰¤ Īµk/32 (29)

Specifically forh andhā€², Equation (29) hold:

|PSk(hd fk (x) =āˆ’1,h(x) 6= yO,xāˆˆ Rkāˆ’1)āˆ’PSk(h

d fk (x) =āˆ’1,h(x) 6= yW,xāˆˆ Rkāˆ’1)| ā‰¤ Īµk/32

|PSk(hd fk (x) =āˆ’1,hā€²(x) 6= yO,xāˆˆ Rkāˆ’1)āˆ’PSk(h

d fk (x) =āˆ’1,hā€²(x) 6= yW,xāˆˆ Rkāˆ’1)| ā‰¤ Īµk/32

Combining, we get:

(PSk(hd fk (x) =āˆ’1,h(x) 6= yW,xāˆˆ Rkāˆ’1)āˆ’PSk(h

d fk (x) =āˆ’1,hā€²(x) 6= yW,xāˆˆ Rkāˆ’1)) (30)

āˆ’ (PSk(hd fk (x) =āˆ’1,h(x) 6= yO,xāˆˆ Rkāˆ’1)āˆ’PSk(h

d fk (x) =āˆ’1,hā€²(x) 6= yO,xāˆˆ Rkāˆ’1))ā‰¤ Īµk/16

We now show the labels inferred in the regionX \Rkāˆ’1 is ā€œfavorableā€ to the classifiers whose excess erroris at mostĪµk/2.By triangle inequality,

PSk(h(x) 6= yO,x /āˆˆ Rkāˆ’1)āˆ’PSk(hāˆ—(x) 6= yO,x /āˆˆ Rkāˆ’1)ā‰¤ PSk(h(x) 6= hāˆ—(x),x /āˆˆ Rkāˆ’1)

26

By Lemma 6, sincehā€² has excess error at mostĪµk, hā€² agrees withhāˆ— on allx insideX \Rkāˆ’1 on eventFkāˆ’1,hencePSk(h

ā€²(x) 6= hāˆ—(x),x /āˆˆRkāˆ’1) = 0. This gives

PSk(h(x) 6= yO,x /āˆˆRkāˆ’1)āˆ’PSk(hā€²(x) 6= yO,x /āˆˆ Rkāˆ’1)

ā‰¤ PSk(h(x) 6= hāˆ—(x),x /āˆˆ Rkāˆ’1)āˆ’PSk(hā€²(x) 6= hāˆ—(x),x /āˆˆ Rkāˆ’1) (31)

Combining Equations (30) and (31), we conclude that

err(h,Sk)āˆ’err(hā€²,Sk)ā‰¤ err(h, Sk)āˆ’err(hā€², Sk)+ Īµk/16

This establishes the correctness of Invariant 1.

Proof of Invariant 2. Recall by definition ofE2k the following concentration results hold for allt āˆˆ N:

|(err(h,Stk)āˆ’err(hā€²,St

k))āˆ’ (errD(h)āˆ’errD(hā€²))| ā‰¤ Ļƒ(2t ,Ī“ t

k)+āˆš

Ļƒ(2t ,Ī“ tk)ĻSt

k(h,hā€²))

In particular, for iterationt0(k) we have

|(err(h,St0(k)k )āˆ’err(hā€²,St0(k)

k ))āˆ’ (errD(h)āˆ’errD(hā€²))| ā‰¤ Ļƒ(2t0(k),Ī“ t0(k)

k )+

āˆš

Ļƒ(2t0(k),Ī“ t0(k)k )Ļ

St0(k)k

(h,hā€²)

Recall thatSk = St0(k)k , hk = ht0(k)

k , andĻƒk = Ļƒ(2t0(k),Ī“ t0(k)k ), hence the above is equivalent to

|(err(h,Sk)āˆ’err(hā€²,Sk))āˆ’ (errD(h)āˆ’errD(hā€²))| ā‰¤ Ļƒk+

āˆš

ĻƒkĻSk(h,hā€²) (32)

Equation (32) establishes the correctness of Equation (17)of Invariant 2. Equation (18) of Invariant 2 fol-lows from Equation (28).

(2) We definehk = argminhāˆˆH errDk(h), and defineĪ½k to be errDk

(hk). To prove the bound on the num-ber of label requests, we first claim that ift is sufficiently large that

Ļƒ(2t ,Ī“ tk)+

āˆš

Ļƒ(2t ,Ī“ tk)Ī½k ā‰¤ Īµk/1536 (33)

then the algorithm will satisfy the breaking criterion at line 14 of Algorithm 3, that is, for this value oft,

Ļƒ(2t ,Ī“ tk)+

āˆš

Ļƒ(2t ,Ī“ tk)err(ht , St

k)ā‰¤ Īµk/512 (34)

Indeed, by definition ofE2k , if eventFk happens,

err(hk, Stk)

ā‰¤ errDk(hk)+Ļƒ(2t ,Ī“ t

k)+āˆš

Ļƒ(2t ,Ī“ tk)errDk

(hk)

= Ī½k+Ļƒ(2t ,Ī“ tk)+

āˆš

Ļƒ(2t ,Ī“ tk)Ī½k (35)

27

Therefore,

Ļƒ(2t ,Ī“ tk)+

āˆš

Ļƒ(2t ,Ī“ tk)err(ht

k, Stk)

ā‰¤ Ļƒ(2t ,Ī“ tk)+

āˆš

Ļƒ(2t ,Ī“ tk)err(hk, St

k)

ā‰¤ Ļƒ(2t ,Ī“ tk)+

āˆš

Ļƒ(2t ,Ī“ tk)(2Ī½k+2Ļƒ(2t ,Ī“ t

k))

ā‰¤ 3Ļƒ(2t ,Ī“ tk)+2

āˆš

Ļƒ(2t ,Ī“ tk)Ī½k

ā‰¤ Īµk/512

where the first inequality is from the optimality ofhtk, the second inequality is from Equation (35), the third

inequality is by algebra, the last inequality follows from Equation (33). The claim follows.Next, we solve for the minimumt that satisfies (33), which is an upper bound oft0(k). Fact 1 implies thatthere is a numerical constantc3 > 0 such that

2t0(k) ā‰¤ c3Ī½k+ Īµk

Īµ2k

(d ln1Īµk

+ ln1Ī“k))

Thus, there is a numerical constantc4 > 0 such that

t0(k)ā‰¤ c4(lnd+ ln1Īµk

+ ln ln1Ī“k)

Hence, there is a numerical constantc5 > 0 (that does not depend onk) such that the following holds. IfeventFk happens, then the number of label queries made by Algorithm 3to O can be bounded as follows:

mk,2 =t0(k)

āˆ‘t=1

|St,Uk āˆ©{x : hd f

k (x) = +1}āˆ©Rkāˆ’1|

=t0(k)

āˆ‘t=1

2tPS t

k(hd f

k (x) = +1,xāˆˆ Rkāˆ’1)

ā‰¤t0(k)

āˆ‘t=1

2t(2PD (hd fk (x) = +1,xāˆˆ Rkāˆ’1)+2Ā·4

ln 2Ī“ t

k

2t )

ā‰¤ 4Ā·2t0(k)PD(hd fk (x) = +1,xāˆˆ Rkāˆ’1)+8Ā· t0(k) ln

2

Ī“ t0(k)k

ā‰¤ c5

(

((Ī½k+ Īµk)PD (h

d fk (x) = +1,xāˆˆ Rkāˆ’1)

Īµ2k

+1) Ā·d(ln2 1Īµk

+ ln2 1Ī“k))

ā‰¤ c5

(

((Ī½k+ Īµk) Ā·6(Ī±(2Ī½ + Īµkāˆ’1,Īµk/512)+ Īµk/1024)

Īµ2k

+1) Ā·d(ln2 1Īµk

+ ln2 1Ī“k))

where the second equality is from the fact that|St,Uk āˆ© {x : hd f

k (x) = āˆ’1}āˆ©Rkāˆ’1| = |St,Uk | Ā·PS t

k(hd f

k (x) =

āˆ’1,xāˆˆRkāˆ’1), in conjunction with|St,Uk |= 2t ; the first inequality is by definition ofE2

k , the second and thirdinequality is from algebra thatt0(k) ln 1

Ī“ t0(k)k

ā‰¤ c5d(ln2 1Īµk+ ln2 1

Ī“k) for some constantc5 > 0, along with the

28

choice ofc2, the fourth step is from Lemma 7 which states that Invariant 3holds at epochk.What remains to be argued is an upper bound onĪ½k. Note that

Ī½k

= minhāˆˆH

[PD(hd fk (x) =āˆ’1,h(x) 6= yW,xāˆˆ Rkāˆ’1)+PD(h

d fk (x) = +1,h(x) 6= yO,xāˆˆ Rkāˆ’1)+PD(h(x) 6= hāˆ—(x),x /āˆˆ Rkāˆ’1)]

ā‰¤ PD(hd fk (x) =āˆ’1,hāˆ—(x) 6= yW,xāˆˆRkāˆ’1)+PD(h

d fk (x) = +1,hāˆ—(x) 6= yO,xāˆˆRkāˆ’1)

ā‰¤ PD(hd fk (x) =āˆ’1,hāˆ—(x) 6= yO,xāˆˆ Rkāˆ’1)+PD(h

d fk (x) = +1,hāˆ—(x) 6= yO,xāˆˆ Rkāˆ’1)+ Īµk/64

ā‰¤ PD(hd fk (x) =āˆ’1,hāˆ—(x) 6= yO,xāˆˆ Rkāˆ’1)+PD(h

d fk (x) = +1,hāˆ—(x) 6= yO,xāˆˆ Rkāˆ’1)+PD(h(x) 6= yO,x /āˆˆRkāˆ’1)+ Īµk/64

= Ī½ + Īµk/64

where the first step is by definition of errDk(h), the second step is by the suboptimality ofhāˆ—, the third step

is by Equation (29), the fourth step is by adding a positive term PD(h(x) 6= yO,x /āˆˆRkāˆ’1), the fifth step is bydefinition of errD(h). Therefore, we conclude that there is a numerical constantc2 > 0, such thatmk,2, thenumber of label requests toO in Algorithm 3 is at most

c2

((Ī½ + Īµk)(Ī±(2Ī½ + Īµkāˆ’1,Īµk/512)+ Īµk)

Īµ2k

Ā·d(ln2 1Īµk

+ ln2 1Ī“k))

B.4 Putting It Together ā€“ Consistency and Label Complexity

Proof of Lemma 2.With foresight, pickc0 > 0 to be a large enough constant. We prove the result by induc-tion.

Base case. Considerk= 0. Recall thatF0 is defined as

F0={

for all h,hā€² āˆˆH , |(err(h,S0)āˆ’err(hā€²,S0))āˆ’(errD(h)āˆ’errD(hā€²))| ā‰¤Ļƒ(n0,Ī“0)+

āˆš

Ļƒ(n0,Ī“0)ĻS0(h,hā€²)}

Note that by definition in Subsection A.3,S0 = S0. Therefore Invariant 1 trivially holds. WhenF0 happens,Equation (17) of Invariant 2 holds, andn0 is such that

āˆšĻƒ0ā‰¤ Īµ0/1024, thus,

Ļƒ0+

āˆš

Ļƒ0err(h0, S0)ā‰¤ Īµ0/512

which establishes the validity of Equation (18) of Invariant 2.Meanwhile, the number of label requests toO is

n0 = 64Ā·10242(d ln(512Ā·10242)+ ln96Ī“))ā‰¤ c0(d+ ln

1Ī“)

Inductive case. Suppose the claim holds forkā€² < k. The inductive hypothesis states that Invariants 1,2,3hold in epochkāˆ’1 on eventFkāˆ’1. By Lemma 7 and Lemma 8, Invariants 1,2,3 holds in epochk on event

29

Fk. SupposeFk happens. By Lemma 7, there is a numerical constantc1 > 0 such that the number of labelqueries in Algorithm 2 in line 12 is at most

mk,1ā‰¤ c1

(

PU(xāˆˆ āˆ†(2Ī½ + Īµkāˆ’1))

Īµk(dā€² ln

1Īµk

+ ln1Ī“k))

Meanwhile, by Lemma 8, there is a numerical constantc2 > 0 such that the number of label queries inAlgorithm 3 in line 14 is at most

mk,2 ā‰¤ c2

((Ī±(2Ī½ + Īµkāˆ’1,Īµk/512)+ Īµk)(Ī½ + Īµk)

Īµ2k

Ā·d(ln2 1Īµk

+ ln2 1Ī“k))

Thus, the number of label requests in total at epochk is at most

mk = mk,1+mk,2

ā‰¤ c0

(

(Ī±(2Ī½ + Īµkāˆ’1,Īµk/512)+ Īµk)(Ī½ + Īµk)

Īµ2k

d(ln2 1Īµk

+ ln2 1Ī“k)+

PU(xāˆˆ āˆ†(2Ī½ + Īµkāˆ’1))

Īµk(dā€² ln

1Īµk

+ ln1Ī“k))

This completes the induction.

Theorem 3(Consistency). If Fk0 happens, then the classifierh returned by Algorithm 1 is such that

errD(h)āˆ’errD(hāˆ—)ā‰¤ Īµ

Proof. By Lemma 2, Invariants 1, 2, 3 hold at epochk0. Thus by Lemma 5,

errD(h)āˆ’errD(hāˆ—) = errD(hk0)āˆ’errD(h

āˆ—)ā‰¤ Īµk0/8ā‰¤ Īµ

Proof of Theorem 1.This is an immediate consequence of Theorem 3.

Theorem 4 (Label Complexity). If Fk0 happens, then the number of label queries made by Algorithm 1toO is at most

O((suprā‰„Īµ

Ī±(2Ī½ + r, r/1024)2Ī½ + r

)d(Ī½2

Īµ2 +1)+ (suprā‰„Īµ

PU(xāˆˆ āˆ†(2Ī½ + r))2Ī½ + r

)dā€²(Ī½Īµ+1))

Proof. Conditioned on eventFk0, we bound the sumāˆ‘k0k=0 mk.

k0

āˆ‘k=0

mk

ā‰¤ c0(d+ ln1Ī“)+ c0

( k0

āˆ‘k=1

(Ī±(2Ī½ + Īµkāˆ’1,Īµk/512)+ Īµk)(Ī½ + Īµk)

Īµ2k

d(ln2 1Īµk

+ ln2 1Ī“k

)+PU(xāˆˆ āˆ†(2Ī½ + Īµkāˆ’1))

Īµk(dā€² ln

1Īµk

+ ln1Ī“k

))

ā‰¤ c0(d+ ln1Ī“)+ c0

( k0

āˆ‘k=1

(Ī±(2Ī½ + Īµkāˆ’1,Īµk/512)+ Īµk)(Ī½ + Īµk)

Īµ2k

d(3ln2 1Īµ+2ln2 1

Ī“)+

PU(xāˆˆ āˆ†(2Ī½ + Īµkāˆ’1))

Īµk(2dā€² ln

1Īµ+ ln

1Ī“))

ā‰¤ (suprā‰„Īµ

Ī±(2Ī½ + r, r/1024)+ r2Ī½ + r

)d(3ln2 1Īµ+2ln2 1

Ī“)

k0

āˆ‘k=0

(Ī½ + Īµk)2

Īµ2k

+ suprā‰„Īµ

PU(xāˆˆ āˆ†(2Ī½ + r))2Ī½ + r

(2dā€² ln1Īµ+ ln

1Ī“)

k0

āˆ‘k=0

(Ī½ + Īµk)

Īµk

ā‰¤ O((suprā‰„Īµ

Ī±(2Ī½ + r, r/1024)+ r2Ī½ + r

)d(Ī½2

Īµ2 +1)+ (suprā‰„Īµ

PU(xāˆˆ āˆ†(2Ī½ + r))2Ī½ + r

)dā€²(Ī½Īµ+1))

30

where the first inequality is by Lemma 2, the second inequality is by noticing for allkā‰„ 1, ln2 1Īµk+ ln2 1

Ī“kā‰¤

3ln2 1Īµ +2ln2 1

Ī“ anddā€² ln 1Īµk+ ln 1

Ī“kā‰¤ 2dā€² ln 1

Īµ + ln 1Ī“ , the rest of the derivations follows from standard algebra.

Proof of Theorem 2.Item 1 is an immediate consequence of Lemma 2, whereas item 2 is a consequence ofTheorem 4.

C Case Study: Linear Classfication under Uniform Distribution over UnitBall

We remind the reader the setting of our example in Section 4.H is the class of homogeneous linearseparators on thed-dimensional unit ball andH d f is defined to be{hāˆ†hā€² : h,hā€² āˆˆH }. Note thatdā€² is atmost 5d. Furthermore,U is the uniform distribution over the unit ball.O is a deterministic labeler such thaterrD(hāˆ—) = Ī½ > 0,W is such that there exists a difference classifierhd f with false negative error 0 for whichPrU(hd f (x) = 1)ā‰¤ g= o(

āˆšdĪ½). We prove the label complexity bound provided by Corollary 1.

Proof of Corollary 1. We claim that under the assumptions of Corollary 1,Ī±(2Ī½ + r, r/1024) is at mostg.Indeed, by takinghd f = hd f , observe that

P(hd f (x) =āˆ’1,yW 6= yO,xāˆˆ āˆ†(2Ī½ + r))ā‰¤ P(hd f (x) =āˆ’1,yW 6= yO) = 0

P(hd f (x) = +1,xāˆˆ āˆ†(2Ī½ + r))ā‰¤ g

This shows thatĪ±(2Ī½ + r,0)ā‰¤ g. Hence,Ī±(2Ī½ + r, r/1024) ā‰¤ Ī±(2Ī½ + r,0)ā‰¤ g. Therefore,

supr :rā‰„Īµ

Ī±(2Ī½ + r, r/1024)+ r2Ī½ + r

ā‰¤ suprā‰„Īµ

g+ rĪ½ + r

ā‰¤max(gĪ½,1)

Recall that the disagreement coefficientĪø(2Ī½ + r) ā‰¤āˆš

d for all r, anddā€² ā‰¤ 5d. Thus, by Theorem 2, thenumber of label queries toO is at most

O

(

dmax(gĪ½,1)(

Ī½2

Īµ2 +1)+d3/2(

1+Ī½Īµ

)

)

D Performance Guarantees for Learning with Respect to Data labeled byOand W

An interesting variant of our model is to consider learning from data labeled by a mixture ofO andW.Let DW be the distribution over labeled examples determined byU andW, specifically,PDW(x,y) =

PU(x)PW(y|x). Let Dā€² be a mixture ofD andDW, specificallyDā€² = (1āˆ’Ī² )D+Ī²DW, for some parameterĪ² > 0. Definehā€² to be the best classifier with respect toDā€², and denote byĪ½ ā€² the error ofhā€² with respect toDā€².

Let Oā€² be the followingmixture oracle. Given an examplex, the labelyOā€² is generated as follows.Oā€² flips a coin with biasĪ² . If it comes up heads, it queriesW for the label ofx and returns the result;otherwiseO is queried and the result returned. It is immediate that the conditional probability induced byOā€² is POā€²(y|x) = (1āˆ’Ī² )PO(y|x)+Ī²PW(y|x), andDā€²(x,y) = POā€²(y|x)PU (x).

31

Assumption 2. For any r,Ī· > 0, there exists an hd fĪ· ,r āˆˆH d f with the following properties:

PD(hd fĪ· ,r(x) =āˆ’1,xāˆˆ āˆ†(r),yOā€² 6= yW)ā‰¤ Ī·

PD (hd fĪ· ,r(x) = 1,xāˆˆ āˆ†(r))ā‰¤ Ī± ā€²(r,Ī·)

Recall that the disagreement coefficientĪø(r) at scaler is Īø(r) = suphāˆˆH supr ā€²ā‰„rPU (DIS(BU (h,r ā€²))

r ā€² , whichonly depends on the unlabeled data distributionU and does not depend onW or O.

We have the following corollary.

Corollary 3 (Learning with respect to Mixture). Let d be the VC dimension ofH and let dā€² be the VCdimension ofH d f . If Assumption 2 holds, and if the error of the best classifierin H on Dā€² is at mostĪ½ ā€².Algorithm 1 is run with inputs unlabeled distribution U, target excess errorĪµ , confidenceĪ“ , labeling oracleOā€², weak oracle W, hypothesis classH , hypothesis class for difference classifierH d f , confidenceĪ“ . Thenwith probabilityā‰„ 1āˆ’2Ī“ , the following hold:

1. the classifierh output by Algorithm 1 satisfies: errDā€²(h)ā‰¤ errDā€²(hā€²)+ Īµ .

2. the total number of label queries made by Algorithm 1 to theoracle O is at most:

O(

(1āˆ’Ī² )(

suprā‰„Īµ

Ī± ā€²(2Ī½ ā€²+ r, r/1024)+ r2Ī½ ā€²+ r

Ā·d(

Ī½ ā€²2

Īµ2 +1

)

+Īø(2Ī½ ā€²+ Īµ)dā€²(

Ī½ ā€²

Īµ+1

)

))

Proof Sketch.Consider running Algorithm 1 in the setting above. By Theorem 1 and Theorem 2, there isan eventF such thatP(F)ā‰„ 1āˆ’Ī“ , if eventF happens,h, the classifier learned by Algorithm 1 is such that

errDā€²(h)ā‰¤ errDā€²(hā€²)+ Īµ

By Theorem 2, the number of label requests toOā€² is at most

mOā€² = O(

suprā‰„Īµ

Ī± ā€²(2Ī½ ā€²+ r, r/1024)+ r2Ī½ ā€²+ r

Ā·d(

Ī½ ā€²2

Īµ2 +1

)

+Īø(2Ī½ ā€²+ Īµ)dā€²(

Ī½ ā€²

Īµ+1

)

)

SinceOā€² is simulated by drawing a Bernoulli random variableZi āˆ¼ Ber(1āˆ’Ī² ) in each call ofOā€², if Zi = 1,then returnO(x), otherwise returnW(x). Define event

H = {mOā€²

āˆ‘i=1

Zi ā‰¤ 2((1āˆ’Ī² )mOā€² +4ln2Ī“)}

by Chernoff bound,P(H)ā‰„ 1āˆ’Ī“ . Consider eventJ = F āˆ©H, by union bound,P(J)ā‰„ 1āˆ’2Ī“ . Conditionedon eventJ, the number of label requests toO is at mostāˆ‘mOā€²

i=1 Zi, which is at most

O(

(1āˆ’Ī² )(

suprā‰„Īµ

Ī± ā€²(2Ī½ ā€²+ r, r/1024)+ r2Ī½ ā€²+ r

Ā·d(

Ī½ ā€²2

Īµ2 +1

)

+Īø(2Ī½ ā€²+ Īµ)dā€²(

Ī½ ā€²

Īµ+1

)

))

32

E Remaining Proofs

Proof of Fact 2. (1) First by Lemma 1,PD(x āˆˆ Rkāˆ’1)/2ā‰¤ pk ā‰¤ PD(x āˆˆ Rkāˆ’1) holds with probability 1āˆ’Ī“k/6.Second, for each classifierhd f āˆˆH d f , define functionsf 1

hd f , and f 2hd f associated with it. Formally,

f 1hd f (x,yO,yW) = I(hd f (x) =āˆ’1,yO 6= yW)

f 2hd f (x,yO,yW) = I(hd f (x) = +1)

Consider the function classF 1 = { f 1hd f : hd f āˆˆH d f}, F 2 = { f 2

hd f : hd f āˆˆH d f}. Note that bothF 1 andF 2 have VC dimensiondā€², which is the same asH d f . We note thatA ā€²

k is a random sample of sizemk

drawn iid fromAk. The fact follows from normalized VC inequality onF 1 andF 2 and the choice ofmk inAlgorithm 2 called in epochk, along with union bound.

Proof of Fact 3.For fixedt, we note thatStk is a random sample of size 2t drawn iid fromD. By Equation (8),

for any fixedt āˆˆ N,

P

(

for all h,hā€² āˆˆH , |(err(h,Stk)āˆ’err(hā€²,St

k))āˆ’ (errD(h)āˆ’errD(hā€²))| ā‰¤ Ļƒ(2t ,Ī“ tk)+

āˆš

Ļƒ(2t ,Ī“ tk)ĻSt

k(h,hā€²)

)

ā‰„ 1āˆ’ Ī“ tk/8

(36)Meanwhile, for fixedt āˆˆN, note thatSt

k is a random sample of size 2t drawn iid fromDk. By Equation (8),

P

(

for all h,hā€² āˆˆH ,err(h, Stk)āˆ’errDk

(h)ā‰¤ Ļƒ(2t ,Ī“ tk)+

āˆš

Ļƒ(2t ,Ī“ tk)errDk

(h))

ā‰„ 1āˆ’Ī“ tk/8 (37)

Moreover, for fixedt āˆˆ N, note thatS tk is a random sample of size 2t drawn iid fromD . By Equation (12),

P

(

PS tk(hd f

k (x) =āˆ’1,yO 6= yW,xāˆˆ Rkāˆ’1)ā‰¤ PD(hd fk (x) =āˆ’1,yO 6= yW,xāˆˆ Rkāˆ’1)

+

āˆš

Ī³(2t ,Ī“ tk)PD (h

d fk (x) =āˆ’1,yO 6= yW,xāˆˆ Rkāˆ’1)+ Ī³(2t ,Ī“ t

k))

ā‰„ 1āˆ’Ī“ tk/8 (38)

Finally, for fixedt āˆˆ N, note thatS tk is a random sample of size 2t drawn iid fromD . By Equation (12),

P

(

PS tk(hd f

k (x)=āˆ’1,xāˆˆRkāˆ’1)ā‰¤PD(hd fk (x)=āˆ’1,xāˆˆRkāˆ’1)+

āˆš

PD(hd fk (x) =āˆ’1,xāˆˆ Rkāˆ’1)Ī³(2t ,Ī“ t

k)+Ī³(2t ,Ī“ tk))

ā‰„ 1āˆ’Ī“ tk/8

(39)Note that by algebra,

PD(hd fk (x)=āˆ’1,xāˆˆRkāˆ’1)+

āˆš

PD (hd fk (x) =āˆ’1,xāˆˆ Rkāˆ’1)Ī³(2t ,Ī“ t

k)+Ī³(2t ,Ī“ tk)ā‰¤ 2(PD(h

d fk (x)=āˆ’1,xāˆˆRkāˆ’1)+Ī³(2t ,Ī“ t

k))

Therefore,

P

(

PS tk(hd f

k (x) =āˆ’1,xāˆˆ Rkāˆ’1)ā‰¤ 2(PD (hd fk (x) =āˆ’1,xāˆˆ Rkāˆ’1)+ Ī³(2t ,Ī“ t

k)))

ā‰„ 1āˆ’Ī“ tk/12 (40)

The proof follows by applying union bound over Equations (36), (37), (38) and (40) andt āˆˆN.

We emphasize thatS tk is chosen iid at random afterhd f

k is determined, thus uniform convergence argu-ment overH d f is not necessary for Equations (38) and (40).

Proof of Fact 4.By induction onk.

33

Base Case. For k= 0, it follows directly from normalized VC inequality thatP(F0)ā‰„ 1āˆ’Ī“0.

Inductive Case. AssumeP(Fkāˆ’1)ā‰„ 1āˆ’Ī“0āˆ’ . . .āˆ’Ī“kāˆ’1 holds. By union bound,

P(Fk)ā‰„ P(Fkāˆ’1āˆ©E1k āˆ©E2

k)ā‰„ P(Fkāˆ’1)āˆ’Ī“k/2āˆ’Ī“k/2ā‰„ P(Fkāˆ’1)āˆ’Ī“k

Hence,P(Fk)ā‰„ 1āˆ’Ī“0āˆ’ . . .āˆ’Ī“k. This finishes the induction.In particular,P(Fk0)ā‰„ 1āˆ’Ī“0āˆ’ . . .Ī“k0 ā‰„ 1āˆ’Ī“ .

34