Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
arX
iv:1
510.
0284
7v2
[cs.
LG]
16 O
ct 2
015
Active Learning from Weak and Strong Labelers
Chicheng Zhangā1 and Kamalika Chaudhuriā 1
1University of California, San Diego
October 19, 2015
Abstract
An active learner is given a hypothesis class, a large set of unlabeled examples and the ability tointeractively query labels to an oracle of a subset of these examples; the goal of the learner is to learn ahypothesis in the class that fits the data well by making as fewlabel queries as possible.
This work addresses active learning with labels obtained from strong and weak labelers, where inaddition to the standard active learning setting, we have anextra weak labeler which may occasionallyprovide incorrect labels. An example is learning to classify medical images where either expensive labelsmay be obtained from a physician (oracle or strong labeler),or cheaper but occasionally incorrect labelsmay be obtained from a medical resident (weak labeler). Our goal is to learn a classifier with low erroron data labeled by the oracle, while using the weak labeler toreduce the number of label queries made tothis labeler. We provide an active learning algorithm for this setting, establish its statistical consistency,and analyze its label complexity to characterize when it canprovide label savings over using the stronglabeler alone.
1 Introduction
An active learner is given a hypothesis class, a large set of unlabeled examples and the ability to interactivelymake label queries to an oracle on a subset of these examples;the goal of the learner is to learn a hypothesisin the class that fits the data well by making as few oracle queries as possible.
As labeling examples is a tedious task for any one person, many applications of active learning involvesynthesizing labels from multiple experts who may have slightly different labeling patterns. While a bodyof recent empirical work [28, 29, 30, 26, 27, 12] has developed methods for combining labels from multipleexperts, little is known on the theory of actively learning with labels from multiple annotators. For example,what kind of assumptions are needed for methods that use labels from multiple sources to work, when thesemethods are statistically consistent, and when they can yield benefits over plain active learning are all openquestions.
This work addresses these questions in the context of activelearning from strong and weak labelers.Specifically, in addition to unlabeled data and the usual labeling oracle in standard active learning, we havean extra weak labeler. The labeling oracle is agold standardā an expert on the problem domain ā andit provides high quality but expensive labels. The weak labeler is cheap, but may provide incorrect labels
ā[email protected]ā [email protected]
1
on some inputs. An example is learning to classify medical images where either expensive labels may beobtained from a physician (oracle), or cheaper but occasionally incorrect labels may be obtained from amedical resident (weak labeler). Our goal is to learn a classifier in a hypothesis class whose error withrespect to the data labeled by the oracle is low, while exploiting the weak labeler to reduce the number ofqueries made to this oracle. Observe that in our model the weak labeler can be incorrect anywhere, and doesnot necessarily provide uniformly noisy labels everywhere, as was assumed by some previous works [8, 24].
A plausible approach in this framework is to learn adifference classifierto predict where the weaklabeler differs from the oracle, and then use a standard active learning algorithm which queries the weaklabeler when this difference classifier predicts agreement. Our first key observation is that this approachis statistically inconsistent; false negative errors (that predict no difference whenO andW differ) lead tobiased annotation for the target classification task. We address this problem by learning instead acost-sensitive difference classifierthat ensures that false negative errors rarely occur. Our second key observationis that as existing active learning algorithms usually query labels in localized regions of space, it is sufficientto train the difference classifier restricted to this regionand still maintain consistency. This process leadsto significant label savings. Combining these two ideas, we get an algorithm that is provably statisticallyconsistent and that works under the assumption that there isa good difference classifier with low falsenegative error.
We analyze the label complexity of our algorithm as measuredby the number of label requests to thelabeling oracle. In general we cannot expect any consistentalgorithm to provide label savings under allcircumstances, and indeed our worst case asymptotic label complexity is the same as that of active learningusing the oracle alone. Our analysis characterizes when we can achieve label savings, and we show that thishappens for example if the weak labeler agrees with the labeling oracle for some fraction of the examplesclose to the decision boundary. Moreover, when the target classification task is agnostic, the number oflabels required to learn the difference classifier is of a lower order than the number of labels required foractive learning; thus in realistic cases, learning the difference classifier adds only a small overhead to thetotal label requirement, and overall we get label savings over using the oracle alone.
Related Work. There has been a considerable amount of empirical work on active learning where multipleannotators can provide labels for the unlabeled examples. One line of work assumes a generative model foreach annotatorās labels. The learning algorithm learns theparameters of the individual labelers, and usesthem to decide which labeler to query for each example. [29, 30, 13] consider separate logistic regressionmodels for each annotator, while [20, 19] assume that each annotatorās labels are corrupted with a differentamount of random classification noise. A second line of work [12, 16] that includes Pro-Active Learning,assumes that each labeler is an expert over an unknown subsetof categories, and uses data to measurethe class-wise expertise in order to optimally place label queries. In general, it is not known under whatconditions these algorithms are statistically consistent, particularly when the modeling assumptions do notstrictly hold, and under what conditions they provide labelsavings over regular active learning.
[25], the first theoretical work to consider this problem, consider a model where the weak labeler is morelikely to provide incorrect labels in heterogeneous regions of space where similar examples have differentlabels. Their formalization is orthogonal to ours ā while theirs is more natural in a non-parametric setting,ours is more natural for fitting classifiers in a hypothesis class. In a NIPS 2014 Workshop paper, [21] havealso considered learning from strong and weak labelers; unlike ours, their work is in the online selectivesampling setting, and applies only to linear classifiers androbust regression. [11] study learning frommultiple teachers in the online selective sampling settingin a model where different labelers have differentregions of expertise.
2
Finally, there is a large body of theoretical work [1, 9, 10, 14, 31, 2, 4] on learning a binary classifierbased on interactive label queries madeto a single labeler. In the realizable case, [22, 9] show that a general-ization of binary search provides an exponential improvement in label complexity over passive learning. Theproblem is more challenging, however, in the more realisticagnostic case, where such approaches lead toinconsistency. The two styles of algorithms for agnostic active learning are disagreement-based active learn-ing (DBAL) [1, 10, 14, 4] and the more recent margin-based or confidence-based active learning [2, 31].Our algorithm builds on recent work in DBAL [4, 15].
2 Preliminaries
The Model. We begin with a general framework for actively learning fromweak and strong labelers. Inthe standard active learning setting, we are given unlabelled data drawn from a distributionU over an inputspaceX , a label spaceY = {ā1,1}, a hypothesis classH , and a labeling oracleO to which we can makeinteractive queries.
In our setting, we additionally have access to a weak labeling oracleW which we can query interactively.QueryingW is significantly cheaper than queryingO; however, queryingW generates a labelyW drawn froma conditional distributionPW(yW|x) which is not the same as the conditional distributionPO(yO|x) of O.
Let D be the data distribution over labelled examples such that:PD(x,y) = PU(x)PO(y|x). Our goal is tolearn a classifierh in the hypothesis classH such that with probabilityā„ 1āĪ“ over the sample, we have:PD(h(x) 6= y)ā¤minhā²āH PD(hā²(x) 6= y)+ Īµ , while making as few (interactive) queries toO as possible.
Observe that in this modelW may disagree with the oracleO anywherein the input space; this isunlike previous frameworks [8, 24] where labels assigned bythe weak labeler are corrupted by randomclassification noise with a higher variance than the labeling oracle. We believe this feature makes our modelmore realistic.
Second, unlike [25], mistakes made by the weak labelerdo not have to be close to the decision boundary.This keeps the model general and simple, and allows greater flexibility to weak labelers. Our analysis showsthat if W is largely incorrect close to the decision boundary, then our algorithm will automatically makemore queries toO in its later stages.
Finally note thatO is allowed to benon-realizablewith respect to the target hypothesis classH .
Background on Active Learning Algorithms. The standard active learning setting is very similar to ours,the only difference being that we have access to the weak oracle W. There has been a long line of work onactive learning [1, 7, 9, 14, 2, 10, 4, 31]. Our algorithms arebased on a style calleddisagreement-basedactive learning (DBAL).The main idea is as follows. Based on the examples seen so far,the algorithmmaintains a candidate setVt of classifiers inH that is guaranteed with high probability to containhā, theclassifier inH with the lowest error. Given a randomly drawn unlabeled example xt , if all classifiers inVt
agree on its label, then this label is inferred; observe thatwith high probability, this inferred label ishā(xt).Otherwise,xt is said to be in thedisagreement regionof Vt , and the algorithm queriesO for its label.Vt isupdated based onxt and its label, and algorithm continues.
Recent works in DBAL [10, 4] have observed that it is possibleto determine if anxt is in the disagree-ment region ofVt without explicitly maintainingVt . Instead, a labelled datasetSt is maintained; the labelsof the examples inSt are obtained by either querying the oracle or direct inference. To determine whetheranxt lies in the disagreement region ofVt , two constrained ERM procedures are performed; empirical riskis minimized overSt while constraining the classifier to output the label ofxt as 1 andā1 respectively. If
3
these two classifiers have similar training errors, thenxt lies in the disagreement region ofVt ; otherwise thealgorithm infers a label forxt that agrees with the label assigned byhā.
More Definitions and Notation. The error of a classifierh under a labelled data distributionQ is definedas: errQ(h) = P(x,y)ā¼Q(h(x) 6= y); we use the notation err(h,S) to denote its empirical error on a labelled datasetS. We use the notationhā to denote the classifier with the lowest error underD andĪ½ to denote its errorerrD(hā), whereD is the target labelled data distribution.
Our active learning algorithm implicitly maintains a(1ā Ī“ )-confidence setfor hā throughout the algo-rithm. Given a setSof labelled examples, a set of classifiersV(S) āH is said to be a(1ā Ī“ )-confidenceset forhā with respect toS if hā āV with probabilityā„ 1āĪ“ overS.
The disagreement between two classifiersh1 andh2 under an unlabelled data distributionU , denoted byĻU(h1,h2), isPxā¼U(h1(x) 6= h2(x)). Observe that the disagreements underU form a pseudometric overH .We use BU(h, r) to denote a ball of radiusr centered aroundh in this metric. Thedisagreement regionof asetV of classifiers, denoted by DIS(V), is the set of all examplesxāX such that there exist two classifiersh1 andh2 in V for whichh1(x) 6= h2(x).
3 Algorithm
Our main algorithm is a standard single-annotator DBAL algorithm with a major modification: when theDBAL algorithm makes a label query, we use an extra sub-routine to decide whether this query should bemade to the oracle or the weak labeler, and make it accordingly. How do we make this decision? We try topredict if weak labeler differs from the oracle on this example; if so, query the oracle, otherwise, query theweak labeler.
Key Idea 1: Cost Sensitive Difference Classifier. How do we predict if the weak labeler differs from theoracle? A plausible approach is to learn adifference classifier hd f in a hypothesis classH d f to determine ifthere is a difference. Our first key observation is when the region whereO andW differ cannot be perfectlymodeled byH d f , the resulting active learning algorithm isstatistically inconsistent. Any false negativeerrors (that is, incorrectly predicting no difference) made by difference classifier leads to biased annotationfor the target classification task, which in turn leads to inconsistency. We address this problem by insteadlearning acost-sensitive difference classifierand we assume that a classifier with low false negative errorexists inH d f . While training, we constrain the false negative error of the difference classifier to be low, andminimize the number of predicted positives (or disagreements betweenW andO) subject to this constraint.This ensures that the annotated data used by the active learning algorithm has diminishing bias, thus ensuringconsistency.
Key Idea 2: Localized Difference Classifier Training. Unfortunately, even with cost-sensitive training,directly learning a difference classifier accurately is expensive. Ifdā² is the VC-dimension of the differencehypothesis classH d f , to learn a target classifier to excess errorĪµ , we need a difference classifier with falsenegative errorO(Īµ), which, from standard generalization theory, requiresO(dā²/Īµ) labels [6, 23]! Our secondkey observation is that we can save on labels by training the difference classifier in a localized manner ābecause the DBAL algorithm that builds the target classifieronly makes label queriesin the disagreementregion of the current confidence set forhā. Therefore we train the difference classifier only on this regionand still maintain consistency. Additionally this provides label savings because while training the target
4
classifier to excess errorĪµ , we need to train a difference classifier with onlyO(dā²Ļk/Īµ) labels whereĻk is theprobability mass of this disagreement region. The localized training process leads to an additional technicalchallenge: as the confidence set forhā is updated, its disagreement region changes. We address this throughan epoch-based DBAL algorithm, where the confidence set is updated and a fresh difference classifier istrained in each epoch.
Main Algorithm. Our main algorithm (Algorithm 1) combines these two key ideas, and like [4], implicitlymaintains the(1ā Ī“ )-confidence set forhā by through a labeled datasetSk. In epochk, the target excesserror isĪµk ā 1
2k , and the goal of Algorithm 1 is to generate a labeled datasetSk that implicitly represents a(1āĪ“k)-confidence set onhā. Additionally, Sk has the property that the empirical risk minimizer over it hasexcess errorā¤ Īµk.
A naive way to generate such anSk is by drawingO(d/Īµ2k ) labeled examples, whered is the VC dimen-
sion ofH . Our goal, however, is to generateSk using a much smaller number of label queries, which isaccomplished by Algorithm 3. This is done in two ways. First,like standard DBAL, we infer the label ofanyx that liesoutsidethe disagreement region of the current confidence set forhā. Algorithm 4 identifieswhether anx lies in this region. Second, for anyx in the disagreement region, we determine whetherOandW agree onx using a difference classifier; if there is agreement, we query W, else we queryO. Thedifference classifier used to determine agreement is retrained in the beginning of each epoch by Algorithm 2,which ensures that the annotation has low bias.
The algorithms use a constrained ERM procedure CONS-LEARN.Given a hypothesis classH, a la-beled datasetS and a set of constraining examplesC, CONS-LEARNH(C,S) returns a classifier inH thatminimizes the empirical error onSsubject toh(xi) = yi for each(xi ,yi) āC.
Identifying the Disagreement Region. Algorithm 4 identifies if an unlabeled examplex lies in the dis-agreement region of the current(1ā Ī“ )-confidence set forhā; recall that this confidence set is implicitlymaintained throughSk. The identification is based on two ERM queries. Leth be the empirical risk mini-mizer on the current labeled datasetSkā1, andhā² be the empirical risk minimizer onSkā1 under the constraintthathā²(x) =āh(x). If the training errors ofh andhā² are very different, then, all classifiers with training errorclose to that ofh assign the same label tox, andx lies outside the current disagreement region.
Training the Difference Classifier. Algorithm 2 trains a difference classifier on a random set of exampleswhich lies in the disagreement region of the current confidence set forhā. The training process is cost-sensitive, and is similar to [17, 18, 6, 23]. A hard bound is imposed on the false-negative error, whichtranslates to a bound on the annotation bias for the target task. The number of positives (i.e., the numberof examples whereW andO differ) is minimized subject to this constraint; this amounts to (approximately)minimizing the fraction of queries made toO.
The number of labeled examples used in training is large enough to ensure false negative errorO(Īµk/Ļk)over the disagreement region of the current confidence set; hereĻk is the probability mass of this disagree-ment region underU . This ensures that the overall annotation bias introduced by this procedure in the targettask is at mostO(Īµk). As Ļk is small and typically diminishes withk, this requires less labels than trainingthe difference classifier globally which would have required O(dā²/Īµk) queries toO.
5
Algorithm 1 Active Learning Algorithm from Weak and Strong Labelers1: Input: Unlabeled distributionU , target excess errorĪµ , confidenceĪ“ , labeling oracleO, weak oracleW,
hypothesis classH , hypothesis class for difference classifierH d f .2: Output: Classifierh in H .3: Initialize: initial errorĪµ0 = 1, confidenceĪ“0 = Ī“/4. Total number of epochsk0 = ālog 1
Īµ ā.4: Initial number of examplesn0 = O( 1
Īµ20(d ln 1
Īµ20+ ln 1
Ī“0)).
5: Draw a fresh sample and queryO for its labelsS0 = {(x1,y1), . . . ,(xn0,yn0)}. Let Ļ0 = Ļ(n0,Ī“0).6: for k= 1,2, . . . ,k0 do7: Set target excess errorĪµk = 2āk, confidenceĪ“k = Ī“/4(k+1)2.8: # Train Difference Classifier9: hd f
k ā Call Algorithm 2 with inputs unlabeled distributionU , oraclesW andO, target excess errorĪµk, confidenceĪ“k/2, previously labeled datasetSkā1.
10: # Adaptive Active Learning using Difference Classifier11: Ļk, Skā Call Algorithm 3 with inputs unlabeled distributionU , oraclesW andO, difference classi-
fier hd fk , target excess errorĪµk, confidenceĪ“k/2, previously labeled datasetSkā1.
12: end for13: return hā CONS-LEARNH ( /0, Sk0).
Algorithm 2 Training Algorithm for Difference Classifier
1: Input: Unlabeled distributionU , oraclesW andO, target errorĪµ , hypothesis classH d f , confidenceĪ“ ,previous labeled datasetT.
2: Output: Difference classifierhd f .3: Let p be an estimate ofPxā¼U(in disagr region(T, 3Īµ
2 ,x) = 1), obtained by calling Algorithm 5 withfailure probabilityĪ“/3. 1
4: LetU ā² = /0, i = 1, and
m=64Ā·1024p
Īµ(dā² ln
512Ā·1024pĪµ
+ ln72Ī“) (1)
5: repeat6: Draw an examplexi from U .7: if in disagr region(T, 3Īµ
2 ,xi) = 1 then # xi is inside the disagreement region8: query bothW andO for labels to getyi,W andyi,O.9: end if
10: U ā² =U ā²āŖ{(xi ,yi,O,yi,W)}11: i = i +112: until |U ā²|= m13: Learn a classifierhd f āH d f based on the following empirical risk minimizer:
hd f = argminhd fāH d f
m
āi=1
1(hd f (xi) = +1), s.t.m
āi=1
1(hd f (xi) =ā1ā§yi,O 6= yi,W)ā¤mĪµ/256p (2)
14: return hd f .
1Note that if in Algorithm 5, the upper confidence bound ofPxā¼U (in disagr region(T, 3Īµ2 ,x) = 1) is lower thanĪµ/64, then we
can halt Algorithm 2 and return an arbitraryhd f in H d f . Using thishd f will still guarantee the correctness of Algorithm 1.
6
Adaptive Active Learning using the Difference Classifier. Finally, Algorithm 3 is our main active learn-ing procedure, which generates a labeled datasetSk that is implicitly used to maintain a tighter(1ā Ī“ )-confidence set forhā. Specifically, Algorithm 3 generates aSk such that the setVk defined as:
Vk = {h : err(h, Sk)ā minhkāH
err(hk, Sk)ā¤ 3Īµk/4}
has the property that:
{h : errD(h)āerrD(hā)ā¤ Īµk/2} āVk ā {h : errD(h)āerrD(h
ā)ā¤ Īµk}
This is achieved by labeling, through inference or query, a large enough sample of unlabeled data drawnfrom U . Labels are obtained from three sources - direct inference (if x lies outside the disagreement regionas identified by Algorithm 4), queryingO (if the difference classifier predicts a difference), and queryingW. How large should the sample be to reach the target excess error? If errD(hā) = Ī½ , then achieving anexcess error ofĪµ requiresO(dĪ½/Īµ2
k ) samples, whered is the VC dimension of the hypothesis class. AsĪ½ isunknown in advance, we use a doubling procedure in lines 4-14to iteratively determine the sample size.
Algorithm 3 Adaptive Active Learning using Difference Classifier
1: Input: Unlabeled data distributionU , oraclesW andO, difference classifierhd f , target excess errorĪµ ,confidenceĪ“ , previous labeled datasetT.
2: Output: ParameterĻ , labeled datasetS.3: Let h= CONS-LEARNH ( /0, T).4: for t = 1,2, . . . , do5: Let Ī“ t = Ī“/t(t +1). Define:Ļ(2t ,Ī“ t) = 8
2t (2d ln 2e2t
d + ln 24Ī“ t ).
6: Draw 2t examples fromU to form St,U .7: for eachxā St,U do:8: if in disagr region(T, 3Īµ
2 ,x) = 0 then # x is inside the agreement region9: Add (x, h(x)) to St .
10: else# x is inside the disagreement region11: If hd f (x) = +1, queryO for the labely of x, otherwise queryW. Add (x,y) to St .12: end if13: end for14: Train ht ā CONS-LEARNH ( /0, St).
15: if Ļ(2t ,Ī“ t)+ā
Ļ(2t ,Ī“ t)err(ht , St)ā¤ Īµ/512 then16: t0ā t, break17: end if18: end for19: return Ļ ā Ļ(2t0,Ī“ t0), Sā St0.
4 Performance Guarantees
We now examine the performance of our algorithm, which is measured by the number of label queries madeto the oracleO. Additionally we require our algorithm to be statisticallyconsistent, which means that thetrue error of the output classifier should converge to the true error of the best classifier inH on the datadistributionD.
7
Algorithm 4 in disagr region(S,Ļ ,x): Test ifx is in the disagreement region of current confidence set
1: Input: labeled datasetS, rejection thresholdĻ , unlabeled examplex.2: Output: 1 ifx in the disagreement region of current confidence set, 0 otherwise.3: Train hā CONS-LEARNH ({ /0, S}).4: Train hā²xā CONS-LEARNH ({(x,āh(x))}, S}).5: if err(hā²x, S)āerr(h, S)> Ļ then # x is in the agreement region6: return 07: else # x is in the disagreement region8: return 19: end if
Since our framework is very general, we cannot expect any statistically consistent algorithm to achievelabel savings over usingO alone under all circumstances. For example, if labels provided byW are thecomplete opposite ofO, no algorithm will achieve both consistency and label savings. We next provide anassumption under which Algorithm 1 works and yields label savings.
Assumption. The following assumption states that difference hypothesis class contains a good cost-sensitivepredictor of whenO andW differ in the disagreement region of BU(hā, r); a predictor is good if it has lowfalse-negative error and predicts a positive label with lowfrequency. If there is no such predictor, then wecannot expect an algorithm similar to ours to achieve label savings.
Assumption 1. LetD be the joint distribution:PD(x,yO,yW) = PU(x)PW(yW|x)PO(yO|x). For any r,Ī· > 0,there exists an hd f
Ī· ,r āH d f with the following properties:
PD(hd fĪ· ,r (x) =ā1,xā DIS(BU(hā, r)),yO 6= yW)ā¤ Ī· (3)
PD(hd fĪ· ,r (x) = 1,xāDIS(BU(hā, r)))ā¤ Ī±(r,Ī·) (4)
Note that (3), which states there is ahd f āH d f with low false-negative error, is minimally restrictive,and is trivially satisfied ifH d f includes the constant classifier that always predicts 1. Theorem showsthat (3) is sufficient to ensure statistical consistency.
(4) in addition states that the number of positives predicted by the classifierhd fĪ· ,r is upper bounded by
Ī±(r,Ī·). NoteĪ±(r,Ī·) ā¤ PU(DIS(BU(hā, r))) always; performance gain is obtained whenĪ±(r,Ī·) is lower,which happens when the difference classifier predicts agreement on a significant portion of DIS(BU(hā, r)).
Consistency. Provided Assumption 1 holds, we next show that Algorithm 1 isstatistically consistent.Establishing consistency is non-trivial for our algorithmas the output classifier is trained on labels frombothO andW.
Theorem 1(Consistency). Let hā be the classifier that minimizes the error with respect to D. If Assumption 1holds, then with probabilityā„ 1āĪ“ , the classifierh output by Algorithm 1 satisfies: errD(h)ā¤ errD(hā)+Īµ .
Label Complexity. The label complexity of standard DBAL is measured in terms ofthe disagreement co-efficient. The disagreement coefficientĪø(r) at scaler is defined as:Īø(r) = suphāH supr ā²ā„r
PU (DIS(BU (h,r ā²))r ā² ;
intuitively, this measures the rate of shrinkage of the disagreement region with the radius of the ball BU(h, r)for anyh in H . It was shown by [10] that the label complexity of DBAL for target excess generalization
8
error Īµ is O(dĪø(2Ī½ + Īµ)(1+ Ī½2
Īµ2 )) where theO notation hides factors logarithmic in 1/Īµ and 1/Ī“ . In con-trast, the label complexity of our algorithm can be stated inTheorem 2. Here we use theO notation forconvenience; we have the same dependence on log1/Īµ and log1/Ī“ as the bounds for DBAL.
Theorem 2(Label Complexity). Let d be the VC dimension ofH and let dā² be the VC dimension ofH d f .If Assumption 1 holds, and if the error of the best classifier in H on D isĪ½ , then with probabilityā„ 1āĪ“ ,the following hold:
1. The number of label queries made by Algorithm 1 to the oracle O in epoch k at most:
mk = O(d(2Ī½ + Īµkā1)(Ī±(2Ī½ + Īµkā1,Īµkā1/1024)+ Īµkā1)
Īµ2k
+dā²P(DIS(BU(hā,2Ī½ + Īµkā1)))
Īµk
)
(5)
2. The total number of label queries made by Algorithm 1 to theoracle O is at most:
O(
suprā„Īµ
Ī±(2Ī½ + r, r/1024)+ r2Ī½ + r
Ā·d(
Ī½2
Īµ2 +1
)
+Īø(2Ī½ + Īµ)dā²(Ī½
Īµ+1
))
(6)
4.1 Discussion
The first terms in (5) and (6) represent the labels needed to learn the target classifier, and second termsrepresent the overhead in learning the difference classifier.
In the realistic agnostic case (whereĪ½ > 0), asĪµā 0, the second terms arelower ordercompared to thelabel complexity of DBAL. Thuseven if dā² is somewhat larger than d, fitting the difference classifier doesnot incur an asymptotically high overhead in the more realistic agnostic case.In the realizable case, whendā² ā d, the second terms are of the same order as the first; thereforewe should use a simpler differencehypothesis classH d f in this case. We believe that the lower order overhead term comes from the fact thatthere exists a classifier inH d f whose false negative error is very low.
Comparing Theorem 2 with the corresponding results for DBAL, we observe that instead ofĪø(2Ī½ + Īµ),we have the term suprā„Īµ
Ī±(2Ī½+r,r/1024)2Ī½+r . Since suprā„Īµ
Ī±(2Ī½+r,r/1024)2Ī½+r ā¤ Īø(2Ī½ + Īµ), theworst caseasymptotic
label complexity is the same as that of standard DBAL. This label complexity may be considerably betterhowever if suprā„Īµ
Ī±(2Ī½+r,r/1024)2Ī½+r is less than the disagreement coefficient. As we expect, thiswill happen
when the region of difference betweenW andO restricted to the disagreement regions is relatively small,and this region is well-modeled by the difference hypothesis classH d f .
An interesting case is when the weak labeler differs fromO close to the decision boundary and agreeswith O away from this boundary. In this case, any consistent algorithm should switch to queryingO closeto the decision boundary. Indeed in earlier epochs,Ī± is low, and our algorithm obtains a good differenceclassifier and achieves label savings. In later epochs,Ī± is high, the difference classifiers always predicta difference and the label complexity of the later epochs of our algorithm is the same order as DBAL. Inpractice, if we suspect that we are in this case, we can switchto plain active learning onceĪµk is small enough.
Case Study: Linear Classfication under Uniform Distribution. We provide a simple example where ouralgorithm provides a better asymptotic label complexity than DBAL. LetH be the class of homogeneouslinear separators on thed-dimensional unit ball and letH d f = {hāhā² : h,hā² āH }. Furthermore, letU bethe uniform distribution over the unit ball.
Suppose thatO is a deterministic labeler such that errD(hā) = Ī½ > 0. Moreover, suppose thatW issuch that there exists a difference classifierhd f with false negative error 0 for whichPU(hd f (x) = 1) ā¤ g.
9
+ā
wā
P({x : hwā(x) 6= yO}) = Ī½
+ā
W
{x : P(yO 6= yW|x)> 0}
P({x : hd f(x) = 1}) = g= o(ā
dĪ½)
Figure 1: Linear classification over unit ball withd = 2. Left: Decision boundary of labelerO andhā = hwā .The region whereO differs fromhā is shaded, and has probabilityĪ½ . Middle: Decision boundary of weaklabelerW. Right: hd f , W andO. Note that{x : P(yO 6= yW|x)> 0} ā {x : hd f (x) = 1}.
Additionally, we assume thatg= o(ā
dĪ½); observe that this is not a strict assumption onH d f , asĪ½ couldbe as much as a constant. Figure 1 shows an example ind = 2 that satisfies these assumptions. In this case,asĪµ ā 0, Theorem 2 gives the following label complexity bound.
Corollary 1. With probabilityā„ 1ā Ī“ , the number of label queries made to oracle O by Algorithm 1 is
O(
dmax( gĪ½ ,1)(
Ī½2
Īµ2 +1)+d3/2(
1+ Ī½Īµ)
)
, where theO notation hides factors logarithmic in1/Īµ and1/Ī“ .
As g= o(ā
dĪ½), this improves over the label complexity of DBAL, which isO(d3/2(1+ Ī½2
Īµ2 )).
Learning with respect to Data labeled by bothO andW. Finally, an interesting variant of our model isto measure error relative to data labeled by a mixture ofO andW ā say,(1āĪ² )O+Ī²W for some 0< Ī² < 1.Similar measures have been considered in the domain adaptation literature [5].
We can also analyze this case using simple modifications to our algorithm and analysis. The results arepresented in Corollary 2, which suggests that the number of label queries toO in this case is roughly 1āĪ²times the label complexity in Theorem 2.
Let Oā² be the oracle which, on inputx, queriesO for its label w.p 1āĪ² and queriesW w.p Ī² . Let Dā² bethe distribution:PDā²(x,y) = PU(x)POā²(y|x), hā² = argminhāH errDā²(h) be the classifier inH that minimizeserror overDā², andĪ½ ā² = errDā²(hā²) be its true error. Moreover, suppose that Assumption 1 holdswith respectto oraclesOā² andW with Ī± = Ī± ā²(r,Ī·) in (4). Then, the modified Algorithm 1 that simulatesOā² by randomqueries toO has the following properties.
Corollary 2. With probabilityā„ 1ā2Ī“ ,
1. the classifierh output by (the modified) Algorithm 1 satisfies: errDā²(h)ā¤ errDā²(hā²)+ Īµ .
2. the total number of label queries made by this algorithm toO is at most:
O(
(1āĪ² )(
suprā„Īµ
Ī± ā²(2Ī½ ā²+ r, r/1024)+ r2Ī½ ā²+ r
Ā·d(
Ī½ ā²2
Īµ2 +1
)
+Īø(2Ī½ ā²+ Īµ)dā²(
Ī½ ā²
Īµ+1
)
))
Conclusion. In this paper, we take a step towards a theoretical understanding of active learning frommultiple annotators through a learning theoretic formalization for learning from weak and strong labelers.Our work shows that multiple annotators can be successfullycombined to do active learning in a statistically
10
consistent manner under a general setting with few assumptions; moreover, under reasonable conditions, thiskind of learning can provide label savings over plain activelearning.
An avenue for future work is to explore a more general settingwhere we have multiple labelers withexpertise on different regions of the input space. Can we combine inputs from such labelers in a statisticallyconsistent manner? Second, our algorithm is intended for a setting whereW is biased, and performs sub-optimally when the label generated byW is a random corruption of the label provided byO. How can weaccount for both random noise and bias in active learning from weak and strong labelers?
Acknowledgements
We thank NSF under IIS 1162581 for research support and Jennifer Dy for introducing us to the problem ofactive learning from multiple labelers.
References
[1] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnosticactive learning. J. Comput. Syst. Sci.,75(1):78ā89, 2009.
[2] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-concavedistributions. InCOLT, 2013.
[3] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Activelearning with an ERM oracle, 2009.
[4] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. InNIPS, 2010.
[5] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adapta-tion. In NIPS, pages 129ā136, 2007.
[6] Nader H. Bshouty and Lynn Burroughs. Maximizing agreements with one-sided error with applicationsto heuristic learning.Machine Learning, 59(1-2):99ā123, 2005.
[7] D. A. Cohn, L. E. Atlas, and R. E. Ladner. Improving generalization with active learning.MachineLearning, 15(2), 1994.
[8] K. Crammer, M. J. Kearns, and J. Wortman. Learning from data of variable quality. InNIPS, pages219ā226, 2005.
[9] S. Dasgupta. Coarse sample complexity bounds for activelearning. InNIPS, 2005.
[10] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. InNIPS, 2007.
[11] O. Dekel, C. Gentile, and K. Sridharan. Selective sampling and active learning from single and multipleteachers.JMLR, 13:2655ā2697, 2012.
[12] P. Donmez and J. Carbonell. Proactive learning: Cost-sensitive active learning with multiple imperfectoracles. InCIKM, 2008.
11
[13] Meng Fang, Xingquan Zhu, Bin Li, Wei Ding, and Xindong Wu. Self-taught active learning fromcrowds. InData Mining (ICDM), 2012 IEEE 12th International Conference on, pages 858ā863. IEEE,2012.
[14] S. Hanneke. A bound on the label complexity of agnostic active learning. InICML, 2007.
[15] D. Hsu. Algorithms for Active Learning. PhD thesis, UC San Diego, 2010.
[16] Panagiotis G Ipeirotis, Foster Provost, Victor S Sheng, and Jing Wang. Repeated labeling using multi-ple noisy labelers.Data Mining and Knowledge Discovery, 28(2):402ā441, 2014.
[17] Adam Tauman Kalai, Varun Kanade, and Yishay Mansour. Reliable agnostic learning.J. Comput.Syst. Sci., 78(5):1481ā1495, 2012.
[18] Varun Kanade and Justin Thaler. Distribution-independent reliable learning. InProceedings of The27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, June 13-15, 2014, pages 3ā24,2014.
[19] C.H. Lin, Mausam, and D.S. Weld. Reactive learning: Actively trading off larger noisier training setsagainst smaller cleaner ones. InICML Workshop on Crowdsourcing and Machine Learning and ICMLActive Learning Workshop, 2015.
[20] Christopher H. Lin, Mausam, and Daniel S. Weld. To re(label), or not to re(label). InProceedings ofthe Seconf AAAI Conference on Human Computation and Crowdsourcing, HCOMP 2014, November2-4, 2014, Pittsburgh, Pennsylvania, USA, 2014.
[21] L. Malago, N. Cesa-Bianchi, and J. Renders. Online active learning with strong and weak annotators.In NIPS Workshop on Learning from the Wisdom of Crowds, 2014.
[22] R. D. Nowak. The geometry of generalized binary search.IEEE Transactions on Information Theory,57(12):7893ā7906, 2011.
[23] Hans Ulrich Simon. Pac-learning in the presence of one-sided classification noise.Ann. Math. Artif.Intell., 71(4):283ā300, 2014.
[24] S. Song, K. Chaudhuri, and A. D. Sarwate. Learning from data with heterogeneous noise using sgd.In AISTATS, 2015.
[25] R. Urner, S. Ben-David, and O. Shamir. Learning from weak teachers. InAISTATS, pages 1252ā1260,2012.
[26] S. Vijayanarasimhan and K. Grauman. Whatās it going to cost you?: Predicting effort vs. informative-ness for multi-label image annotations. InCVPR, pages 2262ā2269, 2009.
[27] S. Vijayanarasimhan and K. Grauman. Cost-sensitive active visual category learning.IJCV, 91(1):24ā44, 2011.
[28] P. Welinder, S. Branson, S. Belongie, and P. Perona. Themultidimensional wisdom of crowds. InNIPS, pages 2424ā2432, 2010.
[29] Y. Yan, R. Rosales, G. Fung, and J. G. Dy. Active learningfrom crowds. InICML, pages 1161ā1168,2011.
12
[30] Y. Yan, R. Rosales, G. Fung, F. Farooq, B. Rao, and J. G. Dy. Active learning from multiple knowledgesources. InAISTATS, pages 1350ā1357, 2012.
[31] C. Zhang and K. Chaudhuri. Beyond disagreement-based agnostic active learning. InNIPS, 2014.
13
A Notation
A.1 Basic Definitions and Notation
Here we do a brief recap of notation. We assume that we are given a target hypothesis classH of VCdimensiond, and a difference hypothesis classH d f of VC dimensiondā².
We are given access to an unlabeled distributionU and two labeling oraclesO andW. QueryingO (resp.W) with an unlabeled data pointxi generates a labelyi,O (resp. yi,W) which is drawn from the distributionPO(y|xi) (resp.PW(y|xi)). In general these two distributions are different. We use the notationD to denotethe joint distribution over examples and labels fromO andW:
PD(x,yO,yW) = PU(x)PO(yO|x)PW(yW|x)
Our goal in this paper is to learn a classifier inH which has low error with respect to the data distributionD described as:PD(x,y) = PU(x)PO(y|x) and our goal is use queries toW to reduce the number of queriesto O. We useyO to denote the labels returned byO, yW to denote the labels returned byW.
The error of a classifierh under a labeled data distributionQ is defined as: errQ(h) = P(x,y)ā¼Q(h(x) 6= y);we use the notation err(h,S) to denote its empirical error on a labeled data setS. We use the notationhā todenote the classifier with the lowest error underD. Define the excess error ofh with respect to distributionD as errD(h)āerrD(hā). For a setZ, we occasionally abuse notation and useZ to also denote the uniformdistribution over the elements ofZ.
Confidence Sets and Disagreement Region.Our active learning algorithm will maintain a(1ā Ī“ )-confidence setfor hā throughout the algorithm. A set of classifiersV āH produced by a (possibly ran-domized) algorithm is said to be a(1ā Ī“ )-confidence set forhā if hā ā V with probabilityā„ 1ā Ī“ ; herethe probability is over the randomness of the algorithm as well as the choice of all labeled and unlabeledexamples drawn by it.
Given two classifiersh1 andh2 the disagreement betweenh1 andh2 under an unlabeled data distributionU , denoted byĻU(h1,h2), isPxā¼U(h1(x) 6= h2(x)). Given an unlabeled datasetS, the empirical disagreementof h1 andh2 on S is denoted byĻS(h1,h2). Observe that the disagreements underU form a pseudometricoverH . We use BU(h, r) to denote a ball of radiusr centered aroundh in this metric. Thedisagreementregion of a setV of classifiers, denoted by DIS(V), is the set of all examplesx āX such that there existtwo classifiersh1 andh2 in V for which h1(x) 6= h2(x).
Disagreement Region. We denote the disagreement region of a disagreement ball of radiusr centeredaroundhā by
ā(r) := DIS(B(hā, r)) (7)
Concentration Inequalities. SupposeZ is a dataset consisting ofn iid samples from a distributionD. Wewill use the following result, which is obtained from a standard application of the normalized VC inequality.With probability 1āĪ“ over the random draw ofZ, for all h,hā² āH ,
|(err(h,Z)āerr(hā²,Z))ā (errD(h)āerrD(hā²))|
ā¤ min(ā
Ļ(n,Ī“ )ĻZ(h,hā²)+Ļ(n,Ī“ ),ā
Ļ(n,Ī“ )ĻD(h,hā²)+Ļ(n,Ī“ ))
(8)
|(err(h,Z)āerrD(h)|ā¤ min
(ā
Ļ(n,Ī“ )err(h,Z)+Ļ(n,Ī“ ),ā
Ļ(n,Ī“ )errD(h)+Ļ(n,Ī“ ))
(9)
14
whered is the VC dimension ofH and the notationĻ(n,Ī“ ) is defined as:
Ļ(n,Ī“ ) =8n(2d ln
2end
+ ln24Ī“) (10)
Equation (8) loosely implies the following equation:
|(err(h,Z)āerr(hā²,Z))ā (errD(h)āerrD(hā²))| ā¤
ā
4Ļ(n,Ī“ ) (11)
The following is a consequence of standard Chernoff bounds.Let X1, . . . ,Xn be iid Bernoulli random vari-ables with meanp. If p= āi Xi/n, then with probabiliy 1āĪ“ ,
|pā p| ā¤min(ā
pĪ³(n,Ī“ )+ Ī³(n,Ī“ ),ā
pĪ³(n,Ī“ )+ Ī³(n,Ī“ )) (12)
where the notationĪ³(n,Ī“ ) is defined as:
Ī³(n,Ī“ ) =4n
ln2Ī“
(13)
Equation (12) loosely implies the following equation:
|pā p| ā¤ā
4Ī³(n,Ī“ ) (14)
Using the notation we just introduced, we can rephrase Assumption 1 as follows. For anyr,Ī· > 0, thereexists anhd f
Ī· ,r āH d f with the following properties:
PD(hd fĪ· ,r(x) =ā1,xā ā(r),yO 6= yW)ā¤ Ī·
PD(hd fĪ· ,r(x) = 1,xā ā(r))ā¤ Ī±(r,Ī·)
We end with an useful fact aboutĻ(n,Ī“ ).
Fact 1. The minimum n such thatĻ(n,Ī“/(logn(logn+1)))ā¤ Īµ is at most
64Īµ(d ln
512Īµ
+ ln24Ī“)
A.2 Adaptive Procedure for Estimating Probability Mass
For completeness, we describe in Algorithm 5 a standard doubling procedure for estimating the bias of acoin within a constant factor. This procedure is used by Algorithm 2 to estimate the probability mass of thedisagreement region of the current confidence set based on unlabeled examples drawn fromU .
Lemma 1. Suppose p> 0 and Algorithm 5 is run with failure probabilityĪ“ . Then with probability1ā Ī“ ,(1) the outputp is such thatpā¤ pā¤ 2p. (2) The total number of calls toO is at most O( 1
p2 ln 1Ī“ p).
Proof. Consider the event
E = { for all i ā N, |piā p| ā¤
ā
4ln 2Ā·2i
Ī“2i }
15
Algorithm 5 Adaptive Procedure for Estimating the Bias of a Coin1: Input: failure probabilityĪ“ , an oracleO which returns iid Bernoulli random variables with unknown
biasp.2: Output: p, an estimate of biasp such that Ėpā¤ pā¤ 2p with probabilityā„ 1āĪ“ .3: for i = 1,2, . . . do4: Call the oracleO 2i times to get empirical frequency Ėpi .
5: if
ā
4ln 4Ā·2iĪ“
2i ā¤ pi/3 then return p= 2pi3
6: end if7: end for
By Equation (14) and union bound,P(E)ā„ 1āĪ“ . On eventE, we claim that ifi is large enough that
4
ā
4ln 4Ā·2i
Ī“2i ā¤ p (15)
then the condition in line 5 will be met. Indeed, this implies
ā
4ln 4Ā·2i
Ī“2i ā¤
pāā
4ln 4Ā·2iĪ“
2i
3ā¤ pi
3
Definei0 as the smallest numberi such that Equation (15) is true. Then by algebra, 2i0 =O( 1p2 ln 1
Ī“ p). Hence
the number of calls to oracleO is at most 1+2+ . . .+2i0 = O( 1p2 ln 1
Ī“ p).Consider the smallestiā such that the condition in line 5 is met. We have that
ā
4ln 4Ā·2iā
Ī“2iā ā¤ piā/3
By the definition ofE,|pā piā | ā¤ piā/3
that is, 2piā/3ā¤ pā¤ 4piā/3, implying pā¤ pā¤ 2p.
A.3 Notations on Datasets
Without loss of generality, assume the examples drawn throughout Algorithm 1 have distinct feature valuesx, since this happens with probability 1 under mild assumptions.
Algorithm 1 uses a mixture of three kinds of labeled data to learn a target classifier ā labels obtainedfrom queryingO, labels inferred by the algorithm, and labels obtained fromqueryingW. To analyze theeffect of these three kinds of labeled data, we need to introduce some notation.
Recall that we define the joint distributionD over examples and labels both fromO andW as follows:
PD(x,yO,yW) = PU(x)PO(yO|x)PW(yW|x)
where given an examplex, the labels generated byO andW are conditionally independent.
16
A datasetSwith empirical error minimizerh and a rejection thresholdĻ define a implicit confidence setfor hā as follows:
V(S,Ļ) = {h : err(h, S)āerr(h, S)ā¤ Ļ}At the beginning of epochk, we haveSkā1. hkā1 is defined as the empirical error minimizer ofSkā1. The dis-agreement region of the implicit confidence set at epochk, Rkā1 is defined asRkā1 := DIS(V(Skā1,3Īµk/2)).Algorithm 4 in disagr region(Skā1,3Īµk/2,x) provides a test deciding if an unlabeled examplex is insideRkā1 in epochk. (See Lemma 6.)
DefineAk to be the distributionD conditioned on the set{(x,yO,yW) : x ā Rkā1}. At epochk, Algo-rithm 2 has inputs distributionU , oraclesW andO, target false negative errorĪµ = Īµk/128, hypothesis classH d f , confidenceĪ“ = Ī“k/2, previous labeled datasetSkā1, and outputs a difference classfierhd f
k . By thesetting ofm in Equation (1), Algorithm 2 first computes Ėpk using unlabeled examples drawn fromU , whichis an estimator ofPD (xā Rkā1). Then it draws a subsample of size
mk,1 =64Ā·1024pk
Īµk(d ln
512Ā·1024pk
Īµk+ ln
144Ī“k
) (16)
iid from Ak. We call the resulting datasetA ā²k .
At epochk, Algorithm 3 performs adaptive subsampling to refine the implicit (1ā Ī“ )-confidence set.For each roundt, it subsamplesU to get an unlabeled datasetSt,U
k of size 2t . Define the corresponding(hypothetical) dataset with labels queried from bothW andO asS t
k . Stk, the (hypothetical) dataset with
labels queried fromO, is defined as:
Stk = {(x,yO)|(x,yO,yW) āS
tk}
In addition to obtaining labels fromO, the algorithm obtains labels in two other ways. First, if anx āX \Rkā1, then its label is safely inferred and with high probability, this inferred labelhkā1(x) is equal tohā(x). Second, if anx lies in Rkā1 but if the difference classifierhd f
k predicts agreement betweenO andW,then its label is obtained by queryingW. The actual datasetSt
k generated by Algorithm 3 is defined as:
Stk = {(x, hkā1(x))|(x,yO,yW) āS
tk ,x /ā Rkā1}āŖ{(x,yO)|(x,yO,yW) āS
tk ,xā Rkā1, h
d fk (x) = +1}
āŖ{(x,yW)|(x,yO,yW) āStk ,xā Rkā1, h
d fk (x) =ā1}
We useDk to denote the labeled data distribution as follows:
PDk(x,y) = PU(x)PQk
(y|x)
PQk(y|x) =
I(hkā1(x) = y), x /āRkā1
PO(y|x), xāRkā1, hd fk (x) = +1
PW(y|x), xāRkā1, hd fk (x) =ā1
Therefore,Stk can be seen as a sample of size 2t drawn iid fromDk.
Observe thathtk is obtained by training an ERM classifier overSt
k, andĪ“ tk = Ī“k/2t(t +1).
Suppose Algorithm 3 stops at iterationt0(k), then the final dataset returned isSk = St0(k)k , with a total
number ofmk,2 label requests toO. We defineSk = St0(k)k , Sk = S
t0(k)k andĻk = Ļ(2t0(k),Ī“ t0(k)
k ).
17
Fork= 0, we define the notationSk differently. S0 is the dataset drawn iid at random fromD, with labelsqueried entirely toO. For notational convenience, defineS0 = S0. Ļ0 is defined asĻ0 = Ļ(n0,Ī“0), whereĻ(Ā·, Ā·) is defined by Equation (10) andn0 is defined as:
n0 = (64Ā·10242)(2d ln(512Ā·10242)+ ln96Ī“)
Recall thathk = argminhāH err(h, Sk) is the empirical error minimizer with respect to the datasetSk.Note that the empirical distanceĻZ(Ā·, Ā·) does not depend on the labels in datasetZ, therefore,ĻSk
(h,hā²) =ĻSk(h,h
ā²). We will use them interchangably throughout.
Table 1: Summary of Notations.
Notation Explanation Samples Drawn from
D Joint distribution of(x,yW,yO) -D Joint distribution of(x,yO) -U Marginal distribution ofx -O Conditional distribution ofyO givenx -W Conditional distribution ofyW givenx -
Rkā1 Disagreement region at epochk -Ak Conditional distribution of(x,yW,yO) givenxā Rkā1 -A ā²
k Dataset used to train difference classifier at epochk Ak
hd fk Difference classifierhd f
2Ī½+Īµkā1,Īµk/512, wherehĪ· ,r is defined in Assump-tion 1
-
hd fk Difference classifier returned by Algorithm 2 at epochk -
St,Uk unlabeled dataset drawn at iterationt of Algorithm 3 at epochkā„ 1 U
S tk St,U
k augmented by labels fromO andW D
Stk {(x,yO)|(x,yO,yW) āS t
k} DSt
k Labeled dataset produced at iterationt of Algorithm 3 at epochkā„ 1 Dk
Dk Distribution of Stk for kā„ 1 and anyt. Has marginalU over X . The
conditional distribution ofy|x is I(hā(x)) if x /ā Rkā1, W if xā Rkā1 andhd f (x) =ā1, andO otherwise
-
t0(k) Number of iterations of Algorithm 3 at epochkā„ 1 -
S0 Initial dataset drawn by Algorithm 1 D
Sk Dataset finally returned by Algorithm 3 at epochkā„ 1. Equal toSt0(k)k Dk
Sk Dataset obtained by replacing all labels inSk by labels drawn fromO.Equal toSt0(k)
k
D
Sk Equal toSt0(k)k D
hk Empirical error minimizer onSk -
18
A.4 Events
Recall thatĪ“k = Ī“/(4(k+1)2),Īµk = 2āk.Define
hd fk = hd f
2Ī½+Īµkā1,Īµk/512
where the notationhd fr,Ī· is introduced in Assumption 1.
We begin by defining some events that we will condition on later in the proof, and showing that theseevents occur with high probability.
Define event
E1k :=
{
PD(xā Rkā1)/2ā¤ pk ā¤ PD (xāRkā1),
and For allhd f āHd f ,
|PA ā²k(hd f (x) =ā1,yO 6= yW)āPAk(h
d f (x) =ā1,yO 6= yW)| ā¤ Īµk
1024PD (xāRkā1)
+
ā
min(PAk(hd f (x) =ā1,yO 6= yW),PA ā²
k(hd f (x) =ā1,yO 6= yW))
Īµk
1024PD (xā Rkā1)
and|PA ā²k(hd f (x) = +1)āPAk(h
d f (x) = +1)|
ā¤ā
min(PAk(hd f (x) = +1),PA ā²
k(hd f (x) = +1))
Īµk
1024PD (xāRkā1)+
Īµk
1024PD (xā Rkā1)
}
Fact 2. P(E1k)ā„ 1āĪ“k/2.
Define event
E2k =
{
For all t ā N, for all h,hā² āH ,
|(err(h,Stk)āerr(hā²,St
k))ā (errD(h)āerrD(hā²))| ā¤ Ļ(2t ,Ī“ t
k)+ā
Ļ(2t ,Ī“ tk)ĻSt
k(h,hā²)
and err(h, Stk)āerrDk
(h) ā¤ Ļ(2t ,Ī“ tk)+
ā
Ļ(2t ,Ī“ tk)errDk
(h)
and PS tk(hd f
k (x) =ā1,yO 6= yW,xāRkā1)āPD(hd fk (x) =ā1,yO 6= yW,xā Rkā1)
ā¤ā
Ī³(2t ,Ī“ tk)PS t
k(hd f
k (x) =ā1,yO 6= yW,xāRkā1)+ Ī³(2t ,Ī“ tk)
and PS tk(hd f
k (x) =ā1ā©xāRkā1)ā¤ 2(PD (hd fk (x) =ā1,xā Rkā1)+ Ī³(2t ,Ī“ t
k))}
Fact 3. P(E2k)ā„ 1āĪ“k/2.
We will also use the following definitions of events in our proof. Define eventF0 as
F0={
for all h,hā² āH , |(err(h,S0)āerr(hā²,S0))ā(errD(h)āerrD(hā²))| ā¤Ļ(n0,Ī“0)+
ā
Ļ(n0,Ī“0)ĻS0(h,hā²)}
For kā {1,2, . . . ,k0}, eventFk is defined inductively as
Fk = Fkā1ā© (E1k ā©E2
k)
Fact 4. For kā {0,1, . . . ,k0}, P(Fk)ā„ 1āĪ“0āĪ“1ā . . .āĪ“k. Specifically,P(Fk0)ā„ 1āĪ“ .
The proofs of Facts 2, 3 and 4 are provided in Appendix E.
19
B Proof Outline and Main Lemmas
The main idea of the proof is to maintain the following three invariants on the outputs of Algorithm 1 ineach epoch. We prove that these invariants hold simultaneously for each epoch with high probability byinduction over the epochs. Throughout, forkā„ 1, the end of epochk refers to the end of execution of line13 of Algorithm 1 at iterationk. The end of epoch 0 refers to the end of execution of line 5 in Algorithm 1.
Invariant 1 states that if we replace the inferred labels andlabels obtained fromW in Sk by those obtainedfrom O (thus getting the datasetSk), then the excess errors of classifiers inH will not decrease by much.
Invariant 1 (Approximate Favorable Bias). Let h be any classifier inH , and hā² be another classifier inHwith excess error on D no greater thanĪµk. Then, at the end of epoch k, we have:
err(h,Sk)āerr(hā²,Sk)ā¤ err(h, Sk)āerr(hā², Sk)+ Īµk/16
Invariant 2 establishes that in epochk, Algorithm 3 selects enough examples so as to ensure that con-centration of empirical errors of classifiers inH on Sk to their true errors.
Invariant 2 (Concentration). At the end of epoch k,Sk, Sk andĻk are such that:1. For any pair of classifiers h,hā² āH , it holds that:
|(err(h,Sk)āerr(hā²,Sk))ā (errD(h)āerrD(hā²))| ā¤ Ļk+
ā
ĻkĻSk(h,hā²) (17)
2. The datasetSk has the following property:
Ļk+
ā
Ļkerr(hk, Sk)ā¤ Īµk/512 (18)
Finally, Invariant 3 ensures that the difference classifierproduced in epochk has low false negative erroron the disagreement region of the(1āĪ“ ) confidence set at epochk.
Invariant 3 (Difference Classifier). At epoch k, the difference classifier output by Algorithm 2 issuch that
PD(hd fk (x) =ā1,yO 6= yW,xā Rkā1)ā¤ Īµk/64 (19)
PD(hd fk (x) = +1,xā Rkā1)ā¤ 6(Ī±(2Ī½ + Īµkā1,Īµk/512)+ Īµk/1024) (20)
We will show the following property about the three invariants. Its proof is deferred to Subsection B.4.
Lemma 2. There is a numerical constant c0 > 0 such that the following holds. The collection of events{Fk}k0
k=0 is such that for kā {0,1, . . . ,k0}:(1) If k= 0, then on event Fk, at epoch k,
(1.1) Invariants 1,2 hold.(1.2) The number of label requests to O is at most m0ā¤ c0(d+ ln 1
Ī“ ).(2) If kā„ 1, then on event Fk, at epoch k,
(2.1) Invariants 1,2,3 hold.(2.2) the number of label requests to O is at most
mkā¤ c0
((Ī±(2Ī½ + Īµkā1,Īµk/1024)+ Īµk)(Ī½ + Īµk)
Īµ2k
d(ln2 1Īµk
+ ln2 1Ī“k)+
PU(xā ā(2Ī½ + Īµkā1))
Īµk(dā² ln
1Īµk
+ ln1Ī“k))
20
B.1 Active Label Inference and Identifying the Disagreement Region
We begin by proving some lemmas about Algorithm 4 which identifies if an example lies in the disagreementregion of the current confidence set. This is done by using a constrained ERM oracle CONS-LEARNH(Ā·, Ā·)using ideas similar to [10, 15, 3, 4].
Lemma 3. When given as input a datasetS, a thresholdĻ > 0, an unlabeled example x, Algorithm 4in disagr region returns 1 if and only if x lies inside DIS(V(S,Ļ)).
Proof. (ā) If Algorithm 4 returns 1, then we have found a classifierhā²x such that (1)hx(x) =āh(x), and (2)err(hā²x, S)āerr(h, S)ā¤ Ļ , i.e. hā²x āV(S,Ļ). Therefore,x is in DIS(V(S,Ļ)).(ā) If x is in DIS(V(S,Ļ)), then there exists a classifierhāH such that (1)h(x) =āh(x) and (2) err(h, S)āerr(h, S)ā¤ Ļ . Hence by definition ofhā²x, err(hā²x, S)āerr(h, S)ā¤ Ļ . Thus, Algorithm 4 returns 1.
We now provide some lemmas about the behavior of Algorithm 4 called at epochk.
Lemma 4. Suppose Invariants 1 and 2 hold at the end of epoch kā 1. If h āH is such that errD(h) ā¤errD(hā)+ Īµkā1/2, then
err(h, Skā1)āerr(hkā1, Skā1)ā¤ 3Īµkā1/4
Proof. If hāH has excess error at mostĪµkā1/2 with respect toD, then,
err(h, Skā1)āerr(hkā1, Skā1)
ā¤ err(h,Skā1)āerr(hkā1,Skā1)+ Īµkā1/16
ā¤ errD(h)āerrD(hkā1)+Ļkā1+
ā
Ļkā1ĻSkā1(h, hkā1)+ Īµkā1/16
ā¤ Īµkā1/2+Ļkā1+
ā
Ļkā1ĻSkā1(h, hkā1)+ Īµkā1/16
ā¤ 9Īµkā1/16+Ļkā1+
ā
Ļkā1err(h, Skā1)+
ā
Ļkā1err(hkā1, Skā1)
ā¤ 9Īµkā1/16+Ļkā1+
ā
Ļkā1err(h, Skā1)+
ā
Ļkā1(err(hkā1, Skā1)+9Īµkā1/16)
Where the first inequality follows from Invariant 1, the second inequality from Equation (17) of Invariant 2,the third inequality from the assumption thathhas excess error at mostĪµkā1/2, and the fourth inequality fromthe triangle inequality, the fifth inequality is by adding a nonnegative number in the last term. Continuing,
err(h, Skā1)āerr(hkā1, Skā1)
ā¤ 9Īµkā1/16+4Ļkā1+2ā
Ļkā1(err(hkā1, Skā1)+9Īµkā1/16)
ā¤ 9Īµkā1/16+4Ļkā1+2ā
Ļkā1err(hkā1, Skā1)+2ā
Īµkā1/512Ā·9Īµkā1/16
ā¤ 9Īµkā1/16+ Īµkā1/32+2ā
Īµkā1/512Ā·9Īµkā1/16
ā¤ 3Īµkā1/4
Where the first inequality is by simple algebra (by lettingD = err(h, Skā1), E = err(hkā1, Skā1)+9Īµkā1/16,F = Ļkā1 in Dā¤ E+F +
āDF +
āEFā Dā¤ E+4F +2
āEF), the second inequality is from
āA+Bā¤ā
A+ā
B andĻkā1ā¤ Īµkā1/512 which utilizes Equation (18) of Invariant 2, the third inequality is again byEquation (18) of Invariant 2, the fourth inequality is by algebra.
21
Lemma 5. Suppose Invariants 1 and 2 hold at the end of epoch kā1. Then,
errD(hkā1)āerrD(hā)ā¤ Īµkā1/8
Proof. By Lemma 4, we know that sincehā has excess error 0 with respect toD,
err(hā, Skā1)āerr(hkā1, Skā1)ā¤ 3Īµkā1/4 (21)
Therefore,
errD(hkā1)āerrD(hā)
ā¤ err(hkā1,Skā1)āerr(hā,Skā1)+Ļkā1+ā
Ļkā1ĻSkā1(hkā1,hā)
ā¤ err(hkā1, Skā1)āerr(hā, Skā1)+Ļkā1+
ā
Ļkā1ĻSkā1(hkā1,hā)+ Īµkā1/16
ā¤ Īµkā1/16+Ļkā1+
ā
Ļkā1(err(hkā1, Skā1)+err(hā, Skā1))
ā¤ Īµkā1/16+Ļkā1+
ā
Ļkā1(2err(hkā1, Skā1)+3Īµkā1/4)
ā¤ Īµkā1/16+Ļkā1+
ā
2Ļkā1err(hkā1, Skā1)+ā
Īµkā1/512Ā·3Īµkā1/4
ā¤ Īµkā1/8
where the first inequality is from Equation (17) of Invariant2, the second inequality uses Invariant 1, thethird inequality follows from the optimality ofhkā1 and triangle inequality, the fourth inequality uses Equa-tion (21), the fifth inequality uses the fact that
āA+Bā¤
āA+ā
B andĻkā1 ā¤ Īµkā1/512, which is fromEquation (18) of Invariant 2, the last inequality again utilizes the Equation (18) of Invariant 2.
Lemma 6. Suppose Invariants 1, 2, and 3 hold in epoch kā1 conditioned on event Fkā1. Then conditionedon event Fkā1, the implicit confidence set Vkā1 =V(Skā1,3Īµk/2) is such that:(1) If hāH satisfies errD(h)āerrD(hā)ā¤ Īµk, then h is in Vkā1.(2) If hāH is in Vkā1, then errD(h)āerrD(hā)ā¤ Īµkā1. Hence Vkā1ā BU(hā,2Ī½ + Īµkā1).(3) Algorithm 4,in disagr region, when run on inputs datasetSkā1, threshold3Īµk/2, unlabeled example x,returns 1 if and only if x is in Rkā1.
Proof. (1) Let h be a classifier with errD(h)ā errD(hā) ā¤ Īµk = Īµkā1/2. Then, by Lemma 4, one haserr(h, Skā1)āerr(hkā1, Skā1)ā¤ 3Īµkā1/4= 3Īµk/2. Hence,h is inVkā1.(2) Fix anyh in Vkā1, by definition ofVkā1,
err(h, Skā1)āerr(hkā1, Skā1)ā¤ 3Īµk/2= 3Īµkā1/4 (22)
Recall that from Lemma 5,errD(hkā1)āerrD(h
ā)ā¤ Īµkā1/8
Thus for classifierh, applying Invariant 1 by takinghā² := hkā1, we get
err(h,Skā1)āerr(hkā1,Skā1)ā¤ err(h, Skā1)āerr(hkā1, Skā1)+ Īµkā1/32 (23)
22
Therefore,
errD(h)āerrD(hkā1)
ā¤ err(h,Skā1)āerr(hkā1,Skā1)+Ļkā1+ā
Ļkā1ĻSkā1(h, hkā1)
ā¤ err(h,Skā1)āerr(hkā1,Skā1)+Ļkā1+
ā
Ļkā1(err(h, Skā1)+err(hkā1, Skā1))
ā¤ err(h, Skā1)āerr(hkā1, Skā1)+Ļkā1+
ā
Ļkā1(err(h, Skā1)+err(hkā1, Skā1))+ Īµkā1/16
ā¤ 13Īµkā1/16+Ļkā1+
ā
Ļkā1(2err(hkā1, Skā1)+3Īµkā1/4)
ā¤ 13Īµkā1/16+Ļkā1+
ā
2Ļkā1err(hkā1, Skā1)+ā
Īµkā1/512Ā·3Īµkā1/4
ā¤ 7Īµkā1/8
where the first inequality is from Equation (17) of Invariant2, the second inequality uses the fact thatĻSkā1
(h,hā²) = ĻSkā1(h,hā²)ā¤ err(h, Skā1)+err(hā², Skā1) for h,hā² āH , the third inequality uses Equation (23);
the fourth inequality is from Equation (22); the fifth inequality is from the fact thatā
A+Bā¤ā
A+ā
BandĻkā1 ā¤ Īµkā1/512, which is from Equation (18) of Invariant 2, the last inequality again follows fromEquation (18) of Invariant 2 and algebra.In conjunction with the fact that errD(hkā1)āerrD(hā)ā¤ Īµkā1/8, this implies
errD(h)āerrD(hā)ā¤ Īµkā1
By triangle inequality,Ļ(h,hā)ā¤ 2Ī½ + Īµkā1, hencehā BU(hā,2Ī½ + Īµkā1). In summaryVkā1ā BU(hā,2Ī½ +Īµkā1).(3) Follows directly from Lemma 3 and the fact thatRkā1 = DIS(Vkā1).
B.2 Training the Difference Classifier
Recall thatā(r) = DIS(BU(hā, r)) is the disagreement region of the disagreement ball centered aroundhā
with radiusr.
Lemma 7 (Difference Classifier Invariant). There is a numerical constant c1 > 0 such that the followingholds. Suppose that Invariants 1 and 2 hold at the end of epochkā 1 conditioned on event Fkā1 andthat Algorithm 2 has inputs unlabeled data distribution U, oracle O, Īµ = Īµk/128, hypothesis classH d f ,Ī“ = Ī“k/2, previous labeled datasetSkā1. Then conditioned on event Fk,(1) hd f
k , the output of Algorithm 2, maintains Invariant 3.(2)(Label Complexity: Part 1.) The number of label queries made to O is at most
mk,1ā¤ c1
(
PU(xā ā(2Ī½ + Īµkā1))
Īµk(dā² ln
1Īµk
+ ln1Ī“k))
Proof. (1) Recall thatFk = Fkā1ā©E1k ā©E2
k , whereE1k , E2
k are defined in Subsection A.4. Suppose eventFk
happens.
23
Proof of Equation (19). Recall thathd fk is the optimal solution of optimization problem (2). We haveby
feasibility and the fact that on eventE3k , 2pk ā„ PD(xā Rkā1),
PA ā²k(hd f
k (x) =ā1,yO 6= yW)ā¤ Īµk
256pkā¤ Īµk
128PD (xā Rkā1)
By definition of eventE2k , this implies
PAk(hd fk (x) =ā1,yO 6= yW)
ā¤ PA ā²k(hd f
k (x) =ā1,yO 6= yW)+
ā
PA ā²k(hd f
k (x) =ā1,yO 6= yW)Īµk
1024PD (xā Rkā1)+
Īµk
1024PD (xā Rkā1)
ā¤ Īµk
64PD (xā Rkā1)
Indicating
PD(hd fk (x) =ā1,yO 6= yW,xā Rkā1)ā¤
Īµk
64
Proof of Equation (20). By definition ofhd fk in Subsection A.4,hd f
k is such that:
PD (hd fk (x) = +1,xā ā(2Ī½ + Īµkā1))ā¤ Ī±(2Ī½ + Īµkā1,Īµk/512)
PD(hd fk (x) =ā1,yO 6= yW,xā ā(2Ī½ + Īµkā1))ā¤ Īµk/512
By item (2) of Lemma 6, we haveRkā1ā DIS(BU(hā,2Ī½ + Īµkā1)), thus
PD(hd fk (x) = +1,xā Rkā1)ā¤ Ī±(2Ī½ + Īµkā1,Īµk/512) (24)
PD (hd fk (x) =ā1,yO 6= yW,xā Rkā1)ā¤ Īµk/512 (25)
Equation (25) implies that
PAk(hd fk (x) =ā1,yO 6= yW)ā¤ Īµk
512PD (xā Rkā1)(26)
Recall thatA ā²k is the dataset subsampled fromAk in line 3 of Algorithm 2. By definition of eventE1
k , we
have that forhd fk ,
PA ā²k(hd f
k (x) =ā1,yO 6= yW)
ā¤ PAk(hd fk (x) =ā1,yO 6= yW)+
ā
PAk(hd fk (x) =ā1,yO 6= yW)
Īµk
1024PD (xā Rkā1)+
Īµk
1024PD (xā Rkā1)
ā¤ Īµk
256PD (xā Rkā1)ā¤ Īµk
256pk
24
where the second inequality is from Equation (26), and the last inequality is from the fact that Ėpk ā¤ PD(xāRkā1). Hence,hd f
k is a feasible solution to the optimization problem (2). Thus,
PAk(hd fk (x) = +1)
ā¤ PA ā²k(hd f
k (x) = +1)+ā
PA ā²k(hd f
k (x) = +1)Īµk
1024PD (xā Rkā1)+
Īµk
1024PD (xā Rkā1)
ā¤ 2(PA ā²k(hd f
k (x) = +1)+Īµk
1024PD (xāRkā1))
ā¤ 2(PA ā²k(hd f
k (x) = +1)+Īµk
1024PD (xāRkā1))
ā¤ 2((PAk(hd fk (x) = +1)+
ā
PAk(hd fk (x) = +1)
Īµk
1024PD (xā Rkā1)+
Īµk
1024PD (xā Rkā1))+
Īµk
1024PD (xā Rkā1))
ā¤ 6(PAk(hd fk (x) = +1)+
Īµk
1024PD (xā Rkā1))
where the first inequality is by definition of eventE1k , the second inequality is by algebra, the third inequality
is by optimality ofhd fk in (2),PA ā²
k(hd f
k (x) = +1)ā¤ PA ā²k(hd f
k (x) = +1), the fourth inequality is by definition
of eventE1k , the fifth inequality is by algebra.
Therefore,
PD(hd fk (x)=+1,xāRkā1)ā¤ 6(PD (h
d fk (x)=+1,xāRkā1)+Īµk/1024)ā¤ 6(Ī±(2Ī½ +Īµkā1,Īµk/512)+Īµk/1024)
(27)where the second inequality follows from Equation (24). This establishes the correctness of Invariant 3.(2) The number of label requests toO follows from line 3 of Algorithm 2 (see Equation (16)). That is, wecan choosec1 large enough (independently ofk), such that
mk,1ā¤ c1
(
PD(xāRkā1)
Īµk(dā² ln
1Īµk
+ ln1Ī“k))
ā¤ c1
(
PU(xā ā(2Ī½ + Īµkā1))
Īµk(dā² ln
1Īµk
+ ln1Ī“k))
where in the second step we use the fact that on eventFk, by item (2) of Lemma 6,Rkā1āDIS(BU(hā,2Ī½ +Īµkā1)), thusPD(xā Rkā1)ā¤ PD (xā ā(2Ī½ + Īµkā1)) = PU(xā ā(2Ī½ + Īµkā1)).
B.3 Adaptive Subsampling
Lemma 8. There is a numerical constant c2 > 0 such that the following holds. Suppose Invariants 1, 2,and 3 hold in epoch kā 1 on event Fkā1; Algorithm 3 receives inputs unlabeled distribution U, classifierhkā1, difference classifierhd f = hd f
k , target excess errorĪµ = Īµk, confidenceĪ“ = Ī“k/2, previous labeleddatasetSkā1. Then on event Fk,(1) Sk, the output of Algorithm 3, maintains Invariants 1 and 2.(2) (Label Complexity: Part 2.) The number of label queries to O in Algorithm 3 is at most:
mk,2 ā¤ c2
((Ī½ + Īµk)(Ī±(2Ī½ + Īµkā1,Īµk/512)+ Īµk)
Īµ2k
Ā·d(ln2 1Īµk
+ ln2 1Ī“k))
Proof. (1) Recall thatFk = Fkā1ā©E1k ā©E2
k , whereE1k , E2
k are defined in Subsection A.4. Suppose eventFk
happens.
25
Proof of Invariant 1. We consider a pair of classifiersh,hā² āH , whereh is an arbitrary classifier inHandhā² has excess error at mostĪµk.
At iterationt = t0(k) of Algorithm 3, the breaking criteron in line 14 is met, i.e.
Ļ(2t0(k),Ī“ t0(k)k )+
ā
Ļ(2t0(k),Ī“ t0(k)k )err(ht0(k), St0(k)
k )ā¤ Īµk/512 (28)
First we expand the definition of err(h,Sk) and err(h, Sk) respectively:
err(h,Sk)=PSk(hd fk (x)=+1,h(x) 6= yO,xāRkā1)+PSk(h
d fk (x)=ā1,h(x) 6= yO,xāRkā1)+PSk(h(x) 6= yO,x /āRkā1)
err(h, Sk)=PSk(hd fk (x)=+1,h(x) 6= yO,xāRkā1)+PSk(h
d fk (x)=ā1,h(x) 6= yW,xāRkā1)+PSk(h(x) 6= hā(x),x /āRkā1)
where we use the fact that by Lemma 6, for all examplesx /ā Rkā1, hkā1(x) = hā(x).We next show thatPSk(h
d fk (x) = ā1,h(x) 6= yO,xā Rkā1) is close toPSk(h
d fk (x) =ā1,h(x) 6= yW,xā
Rkā1).From Lemma 7, we know that conditioned on eventFk,
PD(hd fk (x) =ā1,yO 6= yW,xā Rkā1)ā¤ Īµk/64
In the meantime, from Equation (28),Ī³(2t0(k),Ī“ t0(k)k ) ā¤ Ļ(2t0(k),Ī“ t0(k)
k ) ā¤ Īµk/512. Recall thatSk = St0(k)k .
Therefore, by definition ofE2k ,
PSk(hd fk (x) =ā1,yO 6= yW,xāRkā1)
ā¤ PD(hd fk (x) =ā1,yO 6= yW,xāRkā1)+
ā
PD (hd fk (x) =ā1,yO 6= yW,xā Rkā1)Ī³(2t0(k),Ī“ t0(k)
k )+ Ī³(2t0(k),Ī“ t0(k)k )
ā¤ PD(hd fk (x) =ā1,yO 6= yW,xāRkā1)+
ā
PD (hd fk (x) =ā1,yO 6= yW,xā Rkā1)Īµk/512+ Īµk/512
ā¤ Īµk/32
By triangle inequality, for all classifierh0 āH ,
|PSk(hd fk (x) =ā1,h0(x) 6= yO,xā Rkā1)āPSk(h
d fk (x) =ā1,h0(x) 6= yW,xāRkā1)| ā¤ Īµk/32 (29)
Specifically forh andhā², Equation (29) hold:
|PSk(hd fk (x) =ā1,h(x) 6= yO,xā Rkā1)āPSk(h
d fk (x) =ā1,h(x) 6= yW,xā Rkā1)| ā¤ Īµk/32
|PSk(hd fk (x) =ā1,hā²(x) 6= yO,xā Rkā1)āPSk(h
d fk (x) =ā1,hā²(x) 6= yW,xā Rkā1)| ā¤ Īµk/32
Combining, we get:
(PSk(hd fk (x) =ā1,h(x) 6= yW,xā Rkā1)āPSk(h
d fk (x) =ā1,hā²(x) 6= yW,xā Rkā1)) (30)
ā (PSk(hd fk (x) =ā1,h(x) 6= yO,xā Rkā1)āPSk(h
d fk (x) =ā1,hā²(x) 6= yO,xā Rkā1))ā¤ Īµk/16
We now show the labels inferred in the regionX \Rkā1 is āfavorableā to the classifiers whose excess erroris at mostĪµk/2.By triangle inequality,
PSk(h(x) 6= yO,x /ā Rkā1)āPSk(hā(x) 6= yO,x /ā Rkā1)ā¤ PSk(h(x) 6= hā(x),x /ā Rkā1)
26
By Lemma 6, sincehā² has excess error at mostĪµk, hā² agrees withhā on allx insideX \Rkā1 on eventFkā1,hencePSk(h
ā²(x) 6= hā(x),x /āRkā1) = 0. This gives
PSk(h(x) 6= yO,x /āRkā1)āPSk(hā²(x) 6= yO,x /ā Rkā1)
ā¤ PSk(h(x) 6= hā(x),x /ā Rkā1)āPSk(hā²(x) 6= hā(x),x /ā Rkā1) (31)
Combining Equations (30) and (31), we conclude that
err(h,Sk)āerr(hā²,Sk)ā¤ err(h, Sk)āerr(hā², Sk)+ Īµk/16
This establishes the correctness of Invariant 1.
Proof of Invariant 2. Recall by definition ofE2k the following concentration results hold for allt ā N:
|(err(h,Stk)āerr(hā²,St
k))ā (errD(h)āerrD(hā²))| ā¤ Ļ(2t ,Ī“ t
k)+ā
Ļ(2t ,Ī“ tk)ĻSt
k(h,hā²))
In particular, for iterationt0(k) we have
|(err(h,St0(k)k )āerr(hā²,St0(k)
k ))ā (errD(h)āerrD(hā²))| ā¤ Ļ(2t0(k),Ī“ t0(k)
k )+
ā
Ļ(2t0(k),Ī“ t0(k)k )Ļ
St0(k)k
(h,hā²)
Recall thatSk = St0(k)k , hk = ht0(k)
k , andĻk = Ļ(2t0(k),Ī“ t0(k)k ), hence the above is equivalent to
|(err(h,Sk)āerr(hā²,Sk))ā (errD(h)āerrD(hā²))| ā¤ Ļk+
ā
ĻkĻSk(h,hā²) (32)
Equation (32) establishes the correctness of Equation (17)of Invariant 2. Equation (18) of Invariant 2 fol-lows from Equation (28).
(2) We definehk = argminhāH errDk(h), and defineĪ½k to be errDk
(hk). To prove the bound on the num-ber of label requests, we first claim that ift is sufficiently large that
Ļ(2t ,Ī“ tk)+
ā
Ļ(2t ,Ī“ tk)Ī½k ā¤ Īµk/1536 (33)
then the algorithm will satisfy the breaking criterion at line 14 of Algorithm 3, that is, for this value oft,
Ļ(2t ,Ī“ tk)+
ā
Ļ(2t ,Ī“ tk)err(ht , St
k)ā¤ Īµk/512 (34)
Indeed, by definition ofE2k , if eventFk happens,
err(hk, Stk)
ā¤ errDk(hk)+Ļ(2t ,Ī“ t
k)+ā
Ļ(2t ,Ī“ tk)errDk
(hk)
= Ī½k+Ļ(2t ,Ī“ tk)+
ā
Ļ(2t ,Ī“ tk)Ī½k (35)
27
Therefore,
Ļ(2t ,Ī“ tk)+
ā
Ļ(2t ,Ī“ tk)err(ht
k, Stk)
ā¤ Ļ(2t ,Ī“ tk)+
ā
Ļ(2t ,Ī“ tk)err(hk, St
k)
ā¤ Ļ(2t ,Ī“ tk)+
ā
Ļ(2t ,Ī“ tk)(2Ī½k+2Ļ(2t ,Ī“ t
k))
ā¤ 3Ļ(2t ,Ī“ tk)+2
ā
Ļ(2t ,Ī“ tk)Ī½k
ā¤ Īµk/512
where the first inequality is from the optimality ofhtk, the second inequality is from Equation (35), the third
inequality is by algebra, the last inequality follows from Equation (33). The claim follows.Next, we solve for the minimumt that satisfies (33), which is an upper bound oft0(k). Fact 1 implies thatthere is a numerical constantc3 > 0 such that
2t0(k) ā¤ c3Ī½k+ Īµk
Īµ2k
(d ln1Īµk
+ ln1Ī“k))
Thus, there is a numerical constantc4 > 0 such that
t0(k)ā¤ c4(lnd+ ln1Īµk
+ ln ln1Ī“k)
Hence, there is a numerical constantc5 > 0 (that does not depend onk) such that the following holds. IfeventFk happens, then the number of label queries made by Algorithm 3to O can be bounded as follows:
mk,2 =t0(k)
āt=1
|St,Uk ā©{x : hd f
k (x) = +1}ā©Rkā1|
=t0(k)
āt=1
2tPS t
k(hd f
k (x) = +1,xā Rkā1)
ā¤t0(k)
āt=1
2t(2PD (hd fk (x) = +1,xā Rkā1)+2Ā·4
ln 2Ī“ t
k
2t )
ā¤ 4Ā·2t0(k)PD(hd fk (x) = +1,xā Rkā1)+8Ā· t0(k) ln
2
Ī“ t0(k)k
ā¤ c5
(
((Ī½k+ Īµk)PD (h
d fk (x) = +1,xā Rkā1)
Īµ2k
+1) Ā·d(ln2 1Īµk
+ ln2 1Ī“k))
ā¤ c5
(
((Ī½k+ Īµk) Ā·6(Ī±(2Ī½ + Īµkā1,Īµk/512)+ Īµk/1024)
Īµ2k
+1) Ā·d(ln2 1Īµk
+ ln2 1Ī“k))
where the second equality is from the fact that|St,Uk ā© {x : hd f
k (x) = ā1}ā©Rkā1| = |St,Uk | Ā·PS t
k(hd f
k (x) =
ā1,xāRkā1), in conjunction with|St,Uk |= 2t ; the first inequality is by definition ofE2
k , the second and thirdinequality is from algebra thatt0(k) ln 1
Ī“ t0(k)k
ā¤ c5d(ln2 1Īµk+ ln2 1
Ī“k) for some constantc5 > 0, along with the
28
choice ofc2, the fourth step is from Lemma 7 which states that Invariant 3holds at epochk.What remains to be argued is an upper bound onĪ½k. Note that
Ī½k
= minhāH
[PD(hd fk (x) =ā1,h(x) 6= yW,xā Rkā1)+PD(h
d fk (x) = +1,h(x) 6= yO,xā Rkā1)+PD(h(x) 6= hā(x),x /ā Rkā1)]
ā¤ PD(hd fk (x) =ā1,hā(x) 6= yW,xāRkā1)+PD(h
d fk (x) = +1,hā(x) 6= yO,xāRkā1)
ā¤ PD(hd fk (x) =ā1,hā(x) 6= yO,xā Rkā1)+PD(h
d fk (x) = +1,hā(x) 6= yO,xā Rkā1)+ Īµk/64
ā¤ PD(hd fk (x) =ā1,hā(x) 6= yO,xā Rkā1)+PD(h
d fk (x) = +1,hā(x) 6= yO,xā Rkā1)+PD(h(x) 6= yO,x /āRkā1)+ Īµk/64
= Ī½ + Īµk/64
where the first step is by definition of errDk(h), the second step is by the suboptimality ofhā, the third step
is by Equation (29), the fourth step is by adding a positive term PD(h(x) 6= yO,x /āRkā1), the fifth step is bydefinition of errD(h). Therefore, we conclude that there is a numerical constantc2 > 0, such thatmk,2, thenumber of label requests toO in Algorithm 3 is at most
c2
((Ī½ + Īµk)(Ī±(2Ī½ + Īµkā1,Īµk/512)+ Īµk)
Īµ2k
Ā·d(ln2 1Īµk
+ ln2 1Ī“k))
B.4 Putting It Together ā Consistency and Label Complexity
Proof of Lemma 2.With foresight, pickc0 > 0 to be a large enough constant. We prove the result by induc-tion.
Base case. Considerk= 0. Recall thatF0 is defined as
F0={
for all h,hā² āH , |(err(h,S0)āerr(hā²,S0))ā(errD(h)āerrD(hā²))| ā¤Ļ(n0,Ī“0)+
ā
Ļ(n0,Ī“0)ĻS0(h,hā²)}
Note that by definition in Subsection A.3,S0 = S0. Therefore Invariant 1 trivially holds. WhenF0 happens,Equation (17) of Invariant 2 holds, andn0 is such that
āĻ0ā¤ Īµ0/1024, thus,
Ļ0+
ā
Ļ0err(h0, S0)ā¤ Īµ0/512
which establishes the validity of Equation (18) of Invariant 2.Meanwhile, the number of label requests toO is
n0 = 64Ā·10242(d ln(512Ā·10242)+ ln96Ī“))ā¤ c0(d+ ln
1Ī“)
Inductive case. Suppose the claim holds forkā² < k. The inductive hypothesis states that Invariants 1,2,3hold in epochkā1 on eventFkā1. By Lemma 7 and Lemma 8, Invariants 1,2,3 holds in epochk on event
29
Fk. SupposeFk happens. By Lemma 7, there is a numerical constantc1 > 0 such that the number of labelqueries in Algorithm 2 in line 12 is at most
mk,1ā¤ c1
(
PU(xā ā(2Ī½ + Īµkā1))
Īµk(dā² ln
1Īµk
+ ln1Ī“k))
Meanwhile, by Lemma 8, there is a numerical constantc2 > 0 such that the number of label queries inAlgorithm 3 in line 14 is at most
mk,2 ā¤ c2
((Ī±(2Ī½ + Īµkā1,Īµk/512)+ Īµk)(Ī½ + Īµk)
Īµ2k
Ā·d(ln2 1Īµk
+ ln2 1Ī“k))
Thus, the number of label requests in total at epochk is at most
mk = mk,1+mk,2
ā¤ c0
(
(Ī±(2Ī½ + Īµkā1,Īµk/512)+ Īµk)(Ī½ + Īµk)
Īµ2k
d(ln2 1Īµk
+ ln2 1Ī“k)+
PU(xā ā(2Ī½ + Īµkā1))
Īµk(dā² ln
1Īµk
+ ln1Ī“k))
This completes the induction.
Theorem 3(Consistency). If Fk0 happens, then the classifierh returned by Algorithm 1 is such that
errD(h)āerrD(hā)ā¤ Īµ
Proof. By Lemma 2, Invariants 1, 2, 3 hold at epochk0. Thus by Lemma 5,
errD(h)āerrD(hā) = errD(hk0)āerrD(h
ā)ā¤ Īµk0/8ā¤ Īµ
Proof of Theorem 1.This is an immediate consequence of Theorem 3.
Theorem 4 (Label Complexity). If Fk0 happens, then the number of label queries made by Algorithm 1toO is at most
O((suprā„Īµ
Ī±(2Ī½ + r, r/1024)2Ī½ + r
)d(Ī½2
Īµ2 +1)+ (suprā„Īµ
PU(xā ā(2Ī½ + r))2Ī½ + r
)dā²(Ī½Īµ+1))
Proof. Conditioned on eventFk0, we bound the sumāk0k=0 mk.
k0
āk=0
mk
ā¤ c0(d+ ln1Ī“)+ c0
( k0
āk=1
(Ī±(2Ī½ + Īµkā1,Īµk/512)+ Īµk)(Ī½ + Īµk)
Īµ2k
d(ln2 1Īµk
+ ln2 1Ī“k
)+PU(xā ā(2Ī½ + Īµkā1))
Īµk(dā² ln
1Īµk
+ ln1Ī“k
))
ā¤ c0(d+ ln1Ī“)+ c0
( k0
āk=1
(Ī±(2Ī½ + Īµkā1,Īµk/512)+ Īµk)(Ī½ + Īµk)
Īµ2k
d(3ln2 1Īµ+2ln2 1
Ī“)+
PU(xā ā(2Ī½ + Īµkā1))
Īµk(2dā² ln
1Īµ+ ln
1Ī“))
ā¤ (suprā„Īµ
Ī±(2Ī½ + r, r/1024)+ r2Ī½ + r
)d(3ln2 1Īµ+2ln2 1
Ī“)
k0
āk=0
(Ī½ + Īµk)2
Īµ2k
+ suprā„Īµ
PU(xā ā(2Ī½ + r))2Ī½ + r
(2dā² ln1Īµ+ ln
1Ī“)
k0
āk=0
(Ī½ + Īµk)
Īµk
ā¤ O((suprā„Īµ
Ī±(2Ī½ + r, r/1024)+ r2Ī½ + r
)d(Ī½2
Īµ2 +1)+ (suprā„Īµ
PU(xā ā(2Ī½ + r))2Ī½ + r
)dā²(Ī½Īµ+1))
30
where the first inequality is by Lemma 2, the second inequality is by noticing for allkā„ 1, ln2 1Īµk+ ln2 1
Ī“kā¤
3ln2 1Īµ +2ln2 1
Ī“ anddā² ln 1Īµk+ ln 1
Ī“kā¤ 2dā² ln 1
Īµ + ln 1Ī“ , the rest of the derivations follows from standard algebra.
Proof of Theorem 2.Item 1 is an immediate consequence of Lemma 2, whereas item 2 is a consequence ofTheorem 4.
C Case Study: Linear Classfication under Uniform Distribution over UnitBall
We remind the reader the setting of our example in Section 4.H is the class of homogeneous linearseparators on thed-dimensional unit ball andH d f is defined to be{hāhā² : h,hā² āH }. Note thatdā² is atmost 5d. Furthermore,U is the uniform distribution over the unit ball.O is a deterministic labeler such thaterrD(hā) = Ī½ > 0,W is such that there exists a difference classifierhd f with false negative error 0 for whichPrU(hd f (x) = 1)ā¤ g= o(
ādĪ½). We prove the label complexity bound provided by Corollary 1.
Proof of Corollary 1. We claim that under the assumptions of Corollary 1,Ī±(2Ī½ + r, r/1024) is at mostg.Indeed, by takinghd f = hd f , observe that
P(hd f (x) =ā1,yW 6= yO,xā ā(2Ī½ + r))ā¤ P(hd f (x) =ā1,yW 6= yO) = 0
P(hd f (x) = +1,xā ā(2Ī½ + r))ā¤ g
This shows thatĪ±(2Ī½ + r,0)ā¤ g. Hence,Ī±(2Ī½ + r, r/1024) ā¤ Ī±(2Ī½ + r,0)ā¤ g. Therefore,
supr :rā„Īµ
Ī±(2Ī½ + r, r/1024)+ r2Ī½ + r
ā¤ suprā„Īµ
g+ rĪ½ + r
ā¤max(gĪ½,1)
Recall that the disagreement coefficientĪø(2Ī½ + r) ā¤ā
d for all r, anddā² ā¤ 5d. Thus, by Theorem 2, thenumber of label queries toO is at most
O
(
dmax(gĪ½,1)(
Ī½2
Īµ2 +1)+d3/2(
1+Ī½Īµ
)
)
D Performance Guarantees for Learning with Respect to Data labeled byOand W
An interesting variant of our model is to consider learning from data labeled by a mixture ofO andW.Let DW be the distribution over labeled examples determined byU andW, specifically,PDW(x,y) =
PU(x)PW(y|x). Let Dā² be a mixture ofD andDW, specificallyDā² = (1āĪ² )D+Ī²DW, for some parameterĪ² > 0. Definehā² to be the best classifier with respect toDā², and denote byĪ½ ā² the error ofhā² with respect toDā².
Let Oā² be the followingmixture oracle. Given an examplex, the labelyOā² is generated as follows.Oā² flips a coin with biasĪ² . If it comes up heads, it queriesW for the label ofx and returns the result;otherwiseO is queried and the result returned. It is immediate that the conditional probability induced byOā² is POā²(y|x) = (1āĪ² )PO(y|x)+Ī²PW(y|x), andDā²(x,y) = POā²(y|x)PU (x).
31
Assumption 2. For any r,Ī· > 0, there exists an hd fĪ· ,r āH d f with the following properties:
PD(hd fĪ· ,r(x) =ā1,xā ā(r),yOā² 6= yW)ā¤ Ī·
PD (hd fĪ· ,r(x) = 1,xā ā(r))ā¤ Ī± ā²(r,Ī·)
Recall that the disagreement coefficientĪø(r) at scaler is Īø(r) = suphāH supr ā²ā„rPU (DIS(BU (h,r ā²))
r ā² , whichonly depends on the unlabeled data distributionU and does not depend onW or O.
We have the following corollary.
Corollary 3 (Learning with respect to Mixture). Let d be the VC dimension ofH and let dā² be the VCdimension ofH d f . If Assumption 2 holds, and if the error of the best classifierin H on Dā² is at mostĪ½ ā².Algorithm 1 is run with inputs unlabeled distribution U, target excess errorĪµ , confidenceĪ“ , labeling oracleOā², weak oracle W, hypothesis classH , hypothesis class for difference classifierH d f , confidenceĪ“ . Thenwith probabilityā„ 1ā2Ī“ , the following hold:
1. the classifierh output by Algorithm 1 satisfies: errDā²(h)ā¤ errDā²(hā²)+ Īµ .
2. the total number of label queries made by Algorithm 1 to theoracle O is at most:
O(
(1āĪ² )(
suprā„Īµ
Ī± ā²(2Ī½ ā²+ r, r/1024)+ r2Ī½ ā²+ r
Ā·d(
Ī½ ā²2
Īµ2 +1
)
+Īø(2Ī½ ā²+ Īµ)dā²(
Ī½ ā²
Īµ+1
)
))
Proof Sketch.Consider running Algorithm 1 in the setting above. By Theorem 1 and Theorem 2, there isan eventF such thatP(F)ā„ 1āĪ“ , if eventF happens,h, the classifier learned by Algorithm 1 is such that
errDā²(h)ā¤ errDā²(hā²)+ Īµ
By Theorem 2, the number of label requests toOā² is at most
mOā² = O(
suprā„Īµ
Ī± ā²(2Ī½ ā²+ r, r/1024)+ r2Ī½ ā²+ r
Ā·d(
Ī½ ā²2
Īµ2 +1
)
+Īø(2Ī½ ā²+ Īµ)dā²(
Ī½ ā²
Īµ+1
)
)
SinceOā² is simulated by drawing a Bernoulli random variableZi ā¼ Ber(1āĪ² ) in each call ofOā², if Zi = 1,then returnO(x), otherwise returnW(x). Define event
H = {mOā²
āi=1
Zi ā¤ 2((1āĪ² )mOā² +4ln2Ī“)}
by Chernoff bound,P(H)ā„ 1āĪ“ . Consider eventJ = F ā©H, by union bound,P(J)ā„ 1ā2Ī“ . Conditionedon eventJ, the number of label requests toO is at mostāmOā²
i=1 Zi, which is at most
O(
(1āĪ² )(
suprā„Īµ
Ī± ā²(2Ī½ ā²+ r, r/1024)+ r2Ī½ ā²+ r
Ā·d(
Ī½ ā²2
Īµ2 +1
)
+Īø(2Ī½ ā²+ Īµ)dā²(
Ī½ ā²
Īµ+1
)
))
32
E Remaining Proofs
Proof of Fact 2. (1) First by Lemma 1,PD(x ā Rkā1)/2ā¤ pk ā¤ PD(x ā Rkā1) holds with probability 1āĪ“k/6.Second, for each classifierhd f āH d f , define functionsf 1
hd f , and f 2hd f associated with it. Formally,
f 1hd f (x,yO,yW) = I(hd f (x) =ā1,yO 6= yW)
f 2hd f (x,yO,yW) = I(hd f (x) = +1)
Consider the function classF 1 = { f 1hd f : hd f āH d f}, F 2 = { f 2
hd f : hd f āH d f}. Note that bothF 1 andF 2 have VC dimensiondā², which is the same asH d f . We note thatA ā²
k is a random sample of sizemk
drawn iid fromAk. The fact follows from normalized VC inequality onF 1 andF 2 and the choice ofmk inAlgorithm 2 called in epochk, along with union bound.
Proof of Fact 3.For fixedt, we note thatStk is a random sample of size 2t drawn iid fromD. By Equation (8),
for any fixedt ā N,
P
(
for all h,hā² āH , |(err(h,Stk)āerr(hā²,St
k))ā (errD(h)āerrD(hā²))| ā¤ Ļ(2t ,Ī“ tk)+
ā
Ļ(2t ,Ī“ tk)ĻSt
k(h,hā²)
)
ā„ 1ā Ī“ tk/8
(36)Meanwhile, for fixedt āN, note thatSt
k is a random sample of size 2t drawn iid fromDk. By Equation (8),
P
(
for all h,hā² āH ,err(h, Stk)āerrDk
(h)ā¤ Ļ(2t ,Ī“ tk)+
ā
Ļ(2t ,Ī“ tk)errDk
(h))
ā„ 1āĪ“ tk/8 (37)
Moreover, for fixedt ā N, note thatS tk is a random sample of size 2t drawn iid fromD . By Equation (12),
P
(
PS tk(hd f
k (x) =ā1,yO 6= yW,xā Rkā1)ā¤ PD(hd fk (x) =ā1,yO 6= yW,xā Rkā1)
+
ā
Ī³(2t ,Ī“ tk)PD (h
d fk (x) =ā1,yO 6= yW,xā Rkā1)+ Ī³(2t ,Ī“ t
k))
ā„ 1āĪ“ tk/8 (38)
Finally, for fixedt ā N, note thatS tk is a random sample of size 2t drawn iid fromD . By Equation (12),
P
(
PS tk(hd f
k (x)=ā1,xāRkā1)ā¤PD(hd fk (x)=ā1,xāRkā1)+
ā
PD(hd fk (x) =ā1,xā Rkā1)Ī³(2t ,Ī“ t
k)+Ī³(2t ,Ī“ tk))
ā„ 1āĪ“ tk/8
(39)Note that by algebra,
PD(hd fk (x)=ā1,xāRkā1)+
ā
PD (hd fk (x) =ā1,xā Rkā1)Ī³(2t ,Ī“ t
k)+Ī³(2t ,Ī“ tk)ā¤ 2(PD(h
d fk (x)=ā1,xāRkā1)+Ī³(2t ,Ī“ t
k))
Therefore,
P
(
PS tk(hd f
k (x) =ā1,xā Rkā1)ā¤ 2(PD (hd fk (x) =ā1,xā Rkā1)+ Ī³(2t ,Ī“ t
k)))
ā„ 1āĪ“ tk/12 (40)
The proof follows by applying union bound over Equations (36), (37), (38) and (40) andt āN.
We emphasize thatS tk is chosen iid at random afterhd f
k is determined, thus uniform convergence argu-ment overH d f is not necessary for Equations (38) and (40).
Proof of Fact 4.By induction onk.
33
Base Case. For k= 0, it follows directly from normalized VC inequality thatP(F0)ā„ 1āĪ“0.
Inductive Case. AssumeP(Fkā1)ā„ 1āĪ“0ā . . .āĪ“kā1 holds. By union bound,
P(Fk)ā„ P(Fkā1ā©E1k ā©E2
k)ā„ P(Fkā1)āĪ“k/2āĪ“k/2ā„ P(Fkā1)āĪ“k
Hence,P(Fk)ā„ 1āĪ“0ā . . .āĪ“k. This finishes the induction.In particular,P(Fk0)ā„ 1āĪ“0ā . . .Ī“k0 ā„ 1āĪ“ .
34