Jaime Carbonell, Pinar Donmez, Jingui He & Vamshi AmbatiLanguage Technologies Institute
Carnegie Mellon Universitywww.cs.cmu.edu/~jgc
27 October 2010
Active and Proactive Machine Learning
October 2010 Jaime G. Carbonell, Language Technolgies Institute
2
Why is Active Learning Important? Labeled data volumes unlabeled data
volumes 1.2% of all proteins have known structures < .01% of all galaxies in the Sloan Sky Survey have
consensus type labels < .0001% of all web pages have topic labels << E-10% of all internet sessions are labeled as to
fraudulence (malware, etc.) < .0001 of all financial transactions investigated
w.r.t. fraudulence
If labeling is costly, or limited, select the instances with maximal impact for learning
Is (Pro)Active Learning Relevant to Language Technologies?
Text Classification By topic, genre, difficulty, … In learning to rank search results
Question Answering Question-type classification Answer ranking
Machine Translation Selecting sentences to translate for
LDL’s Eliciting partial or full alignment
October 2010 Jaime G. Carbonell, Language Technolgies Institute
3
October 2010 Jaime G. Carbonell, Language Technolgies Institute
4
Active Learning
Training data: Special case:
Functional space: Fitness Criterion: a.k.a. loss function
Sampling Strategy:
iinkiikiii yxOxyx :}{},{ ,...1,...1
}{ lj pf
),()(minarg ,
,lj
iipji
ljpfxfy
l
0k
},...,{|))ˆ,(ˆ(minarg 1},...,{ 1
kitesttestxxx
xxxyxfLnki
October 2010 Jaime G. Carbonell, Language Technolgies Institute
5
Sampling Strategies
Random sampling (preserves distribution) Uncertainty sampling (Lewis, 1996; Tong & Koller, 2000)
proximity to decision boundary maximal distance to labeled x’s
Density sampling (kNN-inspired McCallum & Nigam, 2004) Representative sampling (Xu et al, 2003) Instability sampling (probability-weighted)
x’s that maximally change decision boundary Ensemble Strategies
Boosting-like ensemble (Baram, 2003) DUAL (Donmez & Carbonell, 2007)
Dynamically switches strategies from Density-Based to Uncertainty-Based by estimating derivative of expected residual error reduction
Which point to sample?Grey = unlabeled
Red = class A
Brown = class B
October 2010 6Jaime G. Carbonell, Language Technolgies Institute
Density-Based Sampling
Centroid of largest unsampled cluster
October 2010 7Jaime G. Carbonell, Language Technolgies Institute
Uncertainty Sampling
Closest to decision boundary
October 2010 8Jaime G. Carbonell, Language Technolgies Institute
Maximal Diversity Sampling
Maximally distant from labeled x’s
October 2010 9Jaime G. Carbonell, Language Technolgies Institute
Ensemble-Based Possibilities
Uncertainty + Diversity criteria
Density + uncertainty criteria
October 2010 10Jaime G. Carbonell, Language Technolgies Institute
October 2010 Jaime G. Carbonell, Language Technolgies Institute
11
Strategy Selection: No Universal Optimum
• Optimal operating range for AL sampling strategies differs
• How to get the best of both worlds?
• (Hint: ensemble methods, e.g. DUAL)
October 2010 Jaime G. Carbonell, Language Technolgies Institute
12
How does DUAL do better? Runs DWUS until it estimates a cross-over
Monitor the change in expected error at each iteration to detect when it is stuck in local minima
DUAL uses a mixture model after the cross-over ( saturation ) point
Our goal should be to minimize the expected future error If we knew the future error of Uncertainty
Sampling (US) to be zero, then we’d force But in practice, we do not know it
( )
t
DWUSx
^ ^21
( ) [( ) | ] 0i i it
DWUS E y y xn
^* 2argmax * [( ) | ] (1 ) * ( )
U
is i i ii I
x E y y x p x
1
October 2010 Jaime G. Carbonell, Language Technolgies Institute
13
More on DUAL [ECML 2007] After cross-over, US does better => uncertainty score
should be given more weight should reflect how well US performs
can be calculated by the expected error of
US on the unlabeled data* =>
Finally, we have the following selection criterion for DUAL:
* US is allowed to choose data only from among the already sampled instances, and is calculated on the remaining unlabeled set to
^ ^ ^* 2argmax(1 ( )) * [( ) | ] ( ) * ( )
U
is i i ii I
x US E y y x US p x
^ ^
( )US
^( )US
October 2010 Jaime G. Carbonell, Language Technolgies Institute
14
Results: DUAL vs DWUS
October 2010 Jaime G. Carbonell, Language Technolgies Institute
15
Beyond Dual
Paired Sampling with Geodesic Density Estimation Donmez & Carbonell, SIAM 2008
Active Rank Learning Search results: Donmez & Carbonell, WWW 2008 In general: Donmez & Carbonell, ICML 2008
Structure Learning Inferring 3D protein structure from 1D sequence
Remains open problem
October 2010 Jaime G. Carbonell, Language Technolgies Institute
16
Active Sampling for RankSVM I
Consider a candidate Assume is added to training set with Total loss on pairs that include is:
n is the # of training instances with a different label than
Objective function to be minimized becomes:
October 2010 Jaime G. Carbonell, Language Technolgies Institute
17
Active Sampling for RankSVM II
Assume the current ranking function is There are two possible cases:
Assume
Derivative w.r.t at a single point
or
October 2010 Jaime G. Carbonell, Language Technolgies Institute
18
Active Sampling for RankSVM III
Substitute in the previous equation to estimate
Magnitude of the total derivative
estimates the ability of to change the current ranker if added into training
Finally,
October 2010 Jaime G. Carbonell, Language Technolgies Institute
19
Active Sampling for RankBoost I Again, estimate how the current ranker would change if was in the training set
Estimate this change by the difference in ranking loss before and after is added
Ranking loss w.r.t is (Freund et al., 2003):
October 2010 Jaime G. Carbonell, Language Technolgies Institute
20
Active Sampling for RankBoost II Difference in the ranking loss between the current and the enlarged set:
indicates how much the current ranker needs to change to compensate for the loss introduced by the new instance
Finally, the instance with the highest loss differential is sampled:
October 2010 Jaime G. Carbonell, Language Technolgies Institute
21
Results on TREC03
October 2010 Jaime G. Carbonell, Language Technolgies Institute
22
Active vs Proactive Learning
Active Learning Proactive Learning
Number of Oracles Individual (only one) Multiple, with different capabilities, costs and areas of expertise
Reliability Infallible (100% right) Variable across oracles and queries, depending on difficulty, expertise, …
Reluctance Indefatigable (always answers)
Variable across oracles and queries, depending on workload, certainty, …
Cost per query Invariant (free or constant) Variable across oracles and queries, depending on workload, difficulty, …
Note: “Oracle” {expert, experiment, computation, …}
Active Learning is Awesome, but … is it Enough?
Single Perfect Source
Multiple Sources
Differing Expertise
Labeling Noise
Answer Reluctance
Fixed Labeling Cost
Varying-Cost Model
TaskDifficulty
Ambiguity
ExpertiseLevel
TraditionalTraditionalActiveActiveLearningLearning
GoingGoingBeyondBeyond
Fixed over time Time-varying23JMLR_’09
KDD ‘09
SDM_sub ‘10
CIKM ‘08
ProactiveProactiveLearningLearning
October 2010 Jaime G. Carbonell, Language Technolgies Institute
24
Scenario 1: Reluctance
2 oracles: reliable oracle: expensive but always
answers with a correct label reluctant oracle: cheap but may not
respond to some queries Define a utility score as expected value
of information at unit cost
( | , ) * ( )( , )
k
P ans x k V xU x k
C
October 2010 Jaime G. Carbonell, Language Technolgies Institute
25
How to estimate ?
Cluster unlabeled data using k-means Ask the label of each cluster centroid to the reluctant
oracle. If label received: increase of nearby points
no label: decrease of nearby points
equals 1 when label received, -1
otherwise
# clusters depend on the clustering budget and oracle fee
ˆ( | , )P ans x k
ˆ( | ,reluctant)P ans x
ˆ( | ,reluctant)P ans x
max( , )0.5ˆ( | ,reluctant) exp ln2
tt t
t
d cc ct
c
x xh x yP ans x x C
Z x x
( , ) { 1, 1}c ch x y
October 2010 Jaime G. Carbonell, Language Technolgies Institute
26
Algorithm for Scenario 1
October 2010 Jaime G. Carbonell, Language Technolgies Institute
27
Scenario 2: Fallibility Two oracles:
One perfect but expensive oracle One fallible but cheap oracle, always answers
Alg. Similar to Scenario 1 with slight modifications
During exploration: Fallible oracle provides the label with its confidence
Confidence = of fallible oracle
If then we don’t use the label
but we still update
ˆ( | ) [0.45,0.5]P y x
ˆ( | )P y x
ˆ(correct | , )P x k
October 2010 Jaime G. Carbonell, Language Technolgies Institute
28
Scenario 3: Non-uniform Cost Uniform cost: Fraud detection, face recognition,
etc.
Non-uniform cost: text categorization, medical diagnosis, protein structure prediction, etc.
2 oracles:Fixed-cost OracleVariable-cost Oracle
ˆmax ( | ) 1( ) 1
1 1y Y
non unif
P y x YC x
Y
October 2010 Jaime G. Carbonell, Language Technolgies Institute
29
Underlying Sampling Strategy Conditional entropy based sampling, weighted by a
density measure
Captures the information content of a close neighborhood
2
2{ 1} { 1}ˆ ˆˆ ˆ( ) log min ( | , ) exp * min ( | , )
xy y
k x N
U x P y x w x k P y k w
close neighbors of x
October 2010 Jaime G. Carbonell, Language Technolgies Institute
30
Results: Reluctance
October 2010 Jaime G. Carbonell, Language Technolgies Institute
31
Cost varies non-uniformly
statistically significant (p<0.01)
Sequential Bayesian Filtering
Tracking the states of multiple systems as each evolves over time
Sequentially arriving observations (noisy labels)
Goal: Estimate posterior
distribution
32
Changing Accuracy with time t
Noisy labels
SDM ‘10
33
Predict Update
A Closer Look to the Model SDM ‘10
Predictor Selection
34Accuracy at the last time selected
Probability of accuracy
There is a chance that the accuracy might have increased
Our belief of the accuracy diverges over time as thesource goes unexplored
SDM ‘10
Copyright@2009 Pinar Donmez 35
Red: true Blue: estimatedBlack: mle
Does Tracking Predictor Accuracy Actually Help in Proactive Learning?
36
SDM ‘10
October 2010 Jaime G. Carbonell, Language Technolgies Institute
37
Proactive Learning in General Multiple Experts (a.k.a. Oracles) Different areas of expertise Different costs Different reliabilities Different availability
What question to ask and whom to query? Joint optimization of query & oracle
selection Referals among Oracles (with referal fees) Learn about Oracle capabilities as well as
solving the Active Learning problem at hand Non-static Oracle properties
October 2010 Jaime G. Carbonell, Language Technolgies Institute
38
Current Issues in Proactive Learning
Large numbers of oracles [Donmez, Carbonell & Schneider, KDD-2009]
Based on multi-armed bandit approach Non-stationary oracles [Donmez, Carbonell & Schneider, SDM-
2010]
Expertise changes with time (improve or decay) Exploration vs exploitation tradeoff
What if labeled set is empty for some classes? Minority class discovery (unsupervised) [He & Carbonell,
NIPS 2007, SIAM 2008, SDM 2009]
After first instance discovery proactive learning, or minority-class characterization [He & Carbonell, SIAM 2010]
October 2010 Jaime G. Carbonell, Language Technolgies Institute
39
Minority Classes vs Outliers Rare classes
A group of points Clustered Non-separable from
the majority classes
Outliers A single point Scattered Separable
The Big Picture
UnbalancedUnlabeledData Set
RareCategoryDetection
Learning inUnbalanced
Settings
Classifier
RawData
FeatureRepresentation
Relational
Temporal
FeatureExtraction
October 2010 40Jaime G. Carbonell, Language
Technolgies Institute
October 2010 Jaime G. Carbonell, Language Technolgies Institute
41 7. Budget exhausted?
Minority Class Discovery Method
1. Calculate problem-specific similarity a
2. , , ix S , ,i iNN x a x A x x a ,i in NN x a
3.
,max
j ii i j
x NN x a ts n n
4. Query argmaxix S ix s
5. a new class?x
Increase t by 1
6. Output
No
Yes
x
No
2t RelevanceFeedback
October 2010 Jaime G. Carbonell, Language Technolgies Institute
42
Summary of Real Data Sets Abalone
4177 examples 7-dimensional
features 20 classes Largest class:
16.50% Smallest class:
0.34%
Shuttle 4515 examples 9-dimensional
features 7 classes Largest class:
75.53% Smallest class:
0.13%
October 2010 Jaime G. Carbonell, Language Technolgies Institute
43
Results on Real Data SetsAbalone Shuttle
MALICEMALICE
InterleaveInterleave
Random sampling Random sampling
SourceLanguageCorpus
ModelModel
Trainer
MT System
SS
Active Learner
S,TS,T
Active Learning for MT
ExpertTranslator
Sampled corpus
Parallel corpus
October 2010 44Jaime G. Carbonell, Language Technolgies Institute
S,T1
S,T1
SourceLanguageCorpus
ModelModel
Trainer
MT System
SS
ACT Framework
.
.
.
S,T2
S,T2
S,Tn
S,Tn
Active Crowd Translation
SentenceSelection
TranslationSelection
October 2010 45Jaime G. Carbonell, Language Technolgies Institute
Active Learning Strategy:Diminishing Density Weighted Diversity Sampling
46
|)(|
)]/(*[^)/(
)( )(
sPhrases
LxcounteULxP
Sdensity sPhrasesx
Lifx
Lifx
sPhrases
xcount
Sdiversity sPhrasesx
1
0
|)(|
)(*
)( )(
)()(
)(*)()1()(
2
2
SdiversitySdensity
SdiversitySdensitySScore
Experiments:Language Pair: Spanish-EnglishIterations: 20 Batch Size: 1000 sentences eachTranslation: Moses Phrase SMTDevelopment Set: 343 sensTest Set: 506 sens
Graph:X: Performance (BLEU )Y: Data (Thousand words)
October 2010 Jaime G. Carbonell, Language Technolgies Institute
Translation Selection from AMT
• Translator Reliability
• Translation Selection:
October 2010 47Jaime G. Carbonell, Language Technolgies Institute
Parting Thoughts
Proactive Learning New field just started New work and full details Domnez Dissertation Applications Abound: e-science (compbio),
finance, network securirty, language technologies (MT), …
Theory still in the making (e.g. Liu Yang) Open challenge: Proactive structure learning
Rare Class discovery and Classification Dovetails with Active/Proactive Learning New Work and Full Details Jingrui He
Dissertation
October 2010 Jaime G. Carbonell, Language Technolgies Institute
48
October 2010 Jaime G. Carbonell, Language Technolgies Institute
49
THANK YOU!
October 2010 Jaime G. Carbonell, Language Technolgies Institute
50
Specially Designed ExponentialFamilies [Efron & Tibshirani 1996]
Favorable compromise between parametric and nonparametric density estimation
Estimated density
xtxgxg T100 exp
Carrier density
Normalizing parameter
parameter vector1p
vector of sufficient statistics1p
October 2010 Jaime G. Carbonell, Language Technolgies Institute
51
October 2010 Jaime G. Carbonell, Language Technolgies Institute
52
SEDER Algorithm
Carrier density: kernel density estimator To decouple the estimation of different
parameters Decompose Relax the constraint such that
Tdxxxt221 ,,
d
j
j
1 00
jx
jjjjij
ji
j
jdxx
xx1exp
2exp
2
1 2
102
2
October 2010 Jaime G. Carbonell, Language Technolgies Institute
53
Parameter Estimation Theorem 3 [SDM 2009]: the maximum likelihood
estimate and of and satisfy the following conditions:
where
dj ,,1
n
kn
i j
ji
jkj
i
n
i
jjij
ji
jkj
in
k
jk
xx
xExx
x1
1 2
2
0
1
2
2
2
0
1
2
2ˆexp
2ˆexp
j1
ji0j
1̂ji0̂
jx
jjjjij
ji
j
j
jjji dxx
xxxxE
2
102
222 ˆˆexp
2exp
2
1
October 2010 Jaime G. Carbonell, Language Technolgies Institute
54
Parameter Estimation cont. Let
:
where ,
212
111
jjj
b
: positive parameterjb
dj ,,1A
ACBBb j
2
4ˆ2
n
kn
i j
ji
jk
n
i
jij
ji
jk
xx
xxx
nA
1
1 2
2
1
2
2
2
2exp
2exp
1
2jB
n
k
jkxn
C1
21
in most cases
1ˆ jb
October 2010 Jaime G. Carbonell, Language Technolgies Institute
55
Scoring Function The estimated density
Scoring function: norm of the gradient
where
n
i
d
j jj
ji
jj
jjbb
xbx
bnxg
1 1 2
2
2exp
2
11~
d
l ll
n
i
li
llkki
k
b
xbxxDs
1 22
2
1
d
j jj
ji
jj
jjib
xbx
bnxD
1 2
2
2exp
2
11
October 2010 Jaime G. Carbonell, Language Technolgies Institute
56
Summary of Real Data Sets
Data Set
n d m Largest Class
Smallest Class
Ecoli 336 7 6 42.56% 2.68%
Glass 214 9 6 35.51% 4.21%
Page Blocks 5473 10 5 89.77% 0.51%
Abalone 4177 7 20 16.50% 0.34%
Shuttle 4515 9 7 75.53% 0.13%
Moderately Skewed
Extremely Skewed
October 2010 Jaime G. Carbonell, Language Technolgies Institute
57
Moderately Skewed Data Sets
Ecoli Glass
MALICE
MALICE
GRADE: Full Prior Information
2. Calculate class-specific similarity ca
3. , , ix S , ,c ci iNN x a x A x x a ,c c
i in NN x a
4.
,max
cj i
c ci i j
x NN x a ts n n
5. Query argmaxix S ix s
6. class c?x
Increase t by 1
7. Output
No
Yes
x
1. For each rare class c, 2 c m
RelevanceFeedback
October 2010 58Jaime G. Carbonell, Language Technolgies Institute
Results on Real Data Sets
Ecoli
Glass
Abalone
Shuttle
MALICE MALICE
MALICEMALICE
October 2010 59Jaime G. Carbonell, Language Technolgies Institute
October 2010 Jaime G. Carbonell, Language Technolgies Institute
60
Performance Measures
MAP (Mean Average Precision)
MAP is the average of AP values for all queries
NDCG (Normalized Discounted Cumulative Gain) The impact of each relevant document is discounted as a function of rank position