Upload
others
View
18
Download
0
Embed Size (px)
Citation preview
1
Introduction to Machine Learning
Active Learning
Barnabás Póczos
2
Credits
Some of the slides are taken from Nina Balcan.
3
Modern applications: massive amounts of raw data.
Only a tiny fraction can be annotated by human experts.
Billions of webpages Images
Classic Supervised Learning Paradigm is Insufficient Nowadays
Sensor measurements
4
Modern applications: massive amounts of raw data.
Expert
We need techniques that minimize need for expert/human
intervention => Active Learning
Modern ML: New Learning Approaches
The Large Synoptic
Survey Telescope
15 Terabytes of data …
every night
5
Active Learning Intro
▪ Batch Active Learning vs Selective Sampling Active Learning
▪ Exponential Improvement on # of labels
▪ Sampling bias: Active Learning can hurt performance
Active Learning with SVM
Gaussian Processes▪ Regression
▪ Properties of Multivariate Gaussian distributions
▪ Ridge regression
▪ GP = Bayesian Ridge Regression + Kernel trick
Active Learning with Gaussian Processes
Contents
6
• Two faces of active learning. Sanjoy Dasgupta. 2011.
• Active Learning. Bur Settles. 2012.
• Active Learning. Balcan-Urner. Encyclopedia of Algorithms. 2015
Additional resources
7
A Label for that Example
Request for the Label of an Example
A Label for that Example
Request for the Label of an Example
Data Source
A set of
Unlabeled
examples
. . .
Algorithm outputs a classifier w.r.t D
Learning
Algorithm
Expert
• Learner can choose specific examples to be labeled.
• Goal: use fewer labeled examples
[pick informative examples to be labeled].
Underlying data distr. D.
Batch Active Learning
8
Unlabeled
example 𝑥3
Unlabeled
example 𝑥1
Unlabeled
example 𝑥2
Request for
label or let it
go?
Data Source
Learning
Algorithm
Expert
Request label
A label 𝑦1for example 𝑥1
Let it
go
Algorithm outputs a classifier w.r.t D
Request label
A label 𝑦3for example 𝑥3
• Selective sampling AL (Online AL): stream of unlabeled examples,
when each arrives make a decision to ask for label or not.
• Goal: use fewer labeled examples[pick informative examples to be labeled].
Underlying data distr. D.
Selective Sampling Active Learning
9
• Need to choose the label requests carefully, to get
informative labels.
• Guaranteed to output a relatively good classifier for
most learning problems.
• Doesn’t make too many label requests.
Hopefully a lot less than passive learning.
What Makes a Good Active Learning Algorithm?
10
• YES! (sometimes)
• We often need far fewer labels for active learning
than for passive.
• This is predicted by theory and has been observed
in practice.
Can adaptive querying really do better than passive/random sampling?
11
• Threshold fns on the real line:
w
+-
Exponential improvement.
hw(x) = 1(x ¸ w), C = {hw: w 2 R}
• How can we recover the correct labels with ≪ N queries?
-
• Do binary search (query at half)!
Active: only O(log 1/ϵ) labels.
Passive supervised: Ω(1/ϵ) labels to find an -accurate threshold.
+-
Active Algorithm
Just need O(log N) labels!
• N = O(1/ϵ) we are guaranteed to get a classifier of error ≤ ϵ.
• Get N unlabeled examples
• Output a classifier consistent with the N inferred labels.
Can adaptive querying help? [CAL92, Dasgupta04]
12
Uncertainty sampling in SVMs common and quite useful in
practice.
• At any time during the alg., we have a “current guess” wt
of the separator: the max-margin separator of all labeled
points so far.
• Request the label of the example closest to the current
separator.
E.g., [Tong & Koller, ICML 2000; Jain, Vijayanarasimhan & Grauman, NIPS 2010;
Schohon Cohn, ICML 2000]
Active SVM Algorithm
Active SVM
13
Active SVM seems to be quite useful in practice.
• Find 𝑤𝑡 the max-margin
separator of all labeled
points so far.
• Request the label of the example
closest to the current separator:
minimizing 𝑥𝑖 ⋅ 𝑤𝑡 .
[Tong & Koller, ICML 2000; Jain, Vijayanarasimhan & Grauman, NIPS 2010]
Algorithm (batch version)
Input Su={x1, …,xmu} drawn i.i.d from the underlying source D
Start: query for the labels of a few random 𝑥𝑖s.
For 𝒕 = 𝟏, ….,
(highest uncertainty)
Active SVM
14
• Uncertainty sampling works sometimes….
• However, we need to be very very very careful!!!
• Myopic, greedy techniques can suffer from sampling bias.
(The active learning algorithm samples from a different (x,y)
distribution than the true data)
• A bias created because of the querying strategy; as time goes
on the sample is less and less representative of the true data
source.[Dasgupta10]
DANGER!!!
15
DANGER!!!
• Main tension: want to choose informative points, but also want to guarantee that the classifier we output does well on true random examples from the underlying distribution.
• Observed in practice too!!!!
16
Interesting open question to analyze under
what conditions they are successful.
Other Interesting Active Learning Techniques used in Practice
17
Centroid of largest unsampled cluster
[Jaime G. Carbonell]
Density-Based Sampling
18
Closest to decision boundary (Active SVM)
[Jaime G. Carbonell]
Uncertainty Sampling
19
Maximally distant from labeled x’s
[Jaime G. Carbonell]
Maximal Diversity Sampling
20
Uncertainty + Diversity criteria
Density + uncertainty criteria
[Jaime G. Carbonell]
Ensemble-Based Possibilities
21
Active learning could be really helpful, could provide
exponential improvements in label complexity (both
theoretically and practically)!
Need to be very careful due to sampling bias.
Common heuristics
• (e.g., those based on uncertainty sampling).
What You Should Know so far
22
Gaussian Processes for Regression
23
http://www.gaussianprocess.org/
Some of these slides are taken from D. Lizotte, R. Parr, C. Guesterin
Additional resources
24
• Nonmyopic Active Learning of Gaussian Processes: An
Exploration–Exploitation Approach. A.Krause and C. Guestrin,
ICML 2007
• Near-Optimal Sensor Placements in Gaussian Processes: Theory,
Efficient Algorithms and Empirical Studies. A.Krause, A. Singh,
and C. Guestrin, Journal of Machine Learning Research 9 (2008)
• Bayesian Active Learning for Posterior Estimation, Kandasamy,
K., Schneider, J., and Poczos, B, International Joint Conference on
Artificial Intelligence (IJCAI), 2015
Additional resources
25
Why GPs for Regression?
Motivation:
All the above regression method give point estimates. We would like a method that could also provide confidence during the estimation.
Regression methods:
Linear regression, multilayer precpetron, ridge regression, support vector regression, kNN regression, etc…
Application in Active Learning:
This method can be used for active learning: query the next point and its label where the uncertainty is the highest
26
• Here’s where the function will most likely be.(expected function)
• Here are some examples of what it might look like. (sampling from the posterior distribution [blue, red, green functions)
• Here is a prediction of what you’ll see if you evaluate your function at x’, with confidence
GPs can answer the following questions:
Why GPs for Regression?
27
Properties of Multivariate Gaussian
Distributions
28
1D Gaussian DistributionParameters
• Mean,
• Variance, 2
29
Multivariate Gaussian
30
A 2-dimensional Gaussian is defined by
• a mean vector = [ 1, 2 ]
• a covariance matrix:
where i,j2 = E[ (xi – i) (xj – j) ]
is (co)variance
Note: is symmetric,
“positive semi-definite”: x: xT x 0
2
2,2
2
2,1
2
1,2
2
1,1
Multivariate Gaussian
31
Multivariate Gaussian examples
= (0,0)
18.0
8.01
32
Multivariate Gaussian examples
= (0,0)
18.0
8.01
33
Marginal distributions of Gaussians are Gaussian
Given:
The marginal distribution is:
bbba
abaa
baba xxx ),(),,(
Useful Properties of Gaussians
34
Marginal distributions of Gaussians are Gaussian
35
Block Matrix Inversion
Theorem
Definition: Schur complements
36
Conditional distributions of Gaussians are Gaussian
Notation:
Conditional Distribution:
bbba
abaa1
bbba
abaa
Useful Properties of Gaussians
37
Higher Dimensions Visualizing > 3 dimensional Gaussian random
variables is… difficult
Means and variances of marginal variable s are practical, but then we don’t see correlations between those variables
Marginals are Gaussian, e.g., f(6) ~ N(µ(6), σ2(6))
Visualizing an 8-dimensional Gaussian variable f:
µ(6)σ2(6)
654321 7 8
38
Yet Higher DimensionsWhy stop there?
39
Getting Ridiculous
Why stop there?
40
Gaussian Process
Probability distribution indexed by an arbitrary set (integer, real, finite dimensional vector, etc)
Each element (indexed by x) is a Gaussian distribution over the reals with mean µ(x)
These distributions are dependent/correlated as defined by k(x,z)
Any finite subset of indices defines a multivariate Gaussian distribution
Definition of GP:
41
Gaussian Process
Distribution over functions….
If our regression model is a GP, then it won’t be a point estimate anymore! It can provide regression estimates with confidence
Domain (index set) of the functions can be pretty much whatever
• Reals
• Real vectors
• Graphs
• Strings
• Sets
• …
42
Bayesian Updates for GPs
• How can we do regression and learn the GP from data?
• We will be Bayesians today:• Start with GP prior
• Get some data
• Compute a posterior
43
Samples from the prior distribution
Picture is taken from Rasmussen and Williams
44
Samples from the posterior distribution
Picture is taken from Rasmussen and Williams
45
PriorZero mean Gaussians with covariance k(x,z)
46
Data
47
Posterior
48
Ridge Regression
Linear regression:
Ridge regression:
The Gaussian Process is a Bayesian Generalization
of the kernelized ridge regression
49
Weight Space View
GP = Bayesian ridge regression in feature space
+ Kernel trick to carry out computations
The training data
50
Bayesian Analysis of Linear Regression with Gaussian noise
Linear regression:
Linear regression
with noise:
51
Bayesian Analysis of Linear Regression with Gaussian noise
The likelihood:
52
Bayesian Analysis of Linear Regression with Gaussian noise
The prior:
Now, we can calculate the posterior:
53
Bayesian Analysis of Linear Regression with Gaussian noise
After “completing the square”
MAP estimation
Ridge Regression
54
Bayesian Analysis of Linear Regression with Gaussian noise
This posterior covariance matrix doesn’t depend on the observations y,
A strange property of Gaussian Processes
55
Projections of Inputs into Feature Space
The Bayesian linear regression suffers from
limited expressiveness
To overcome the problem )
go to a feature space and do linear regression there
a., explicit features
b., implicit features (kernels)
56
Explicit Features
Linear regression in the feature space
57
Explicit Features
The predictive distribution after feature map:
Reminder: This is what we had without feature maps:
58
Explicit Features
Shorthands:
The predictive distribution after feature map:
59
Explicit FeaturesThe predictive distribution after feature map:
A problem with (*) is that it needs an NxN matrix inversion...
(*)
(*) can be rewritten:
Theorem:
60
Proofs• Mean expression. We need:
• Variance expression. We need:
Lemma:
Matrix inversion Lemma:
61
From Explicit to Implicit Features
62
From Explicit to Implicit Features
The feature space always appears in the form of:
Lemma:
No need to know the explicit N dimensional features.
Their inner product is enough.
63
GP pseudo code Inputs:
64
GP pseudo code (continued)
Outputs:
65
Results
66
Results using Netlab , Sin function
67
Results using Netlab, Sin function
Increased # of training points
68
Results using Netlab, Sin function
Increased noise
69
Results using Netlab, Sinc function
70
Applications: Sensor placement
Temperature modeling with GP
Near-Optimal Sensor Placements in Gaussian
Processes: Theory, Efficient Algorithms and
Empirical Studies. A.Krause, A. Singh, and C.
Guestrin, Journal of Machine Learning Research (2008)
71
An example of placements chosen using entropy and mutual information
criteria on temperature data. Diamonds indicate the positions chosen
using entropy; squares the positions chosen using MI.
Applications: Sensor placement
Entropy criterion:
Mutual information criterion:
Properties of Multivariate Gaussian distribution
Gaussian process = Bayesian Ridge Regression
GP Algorithm
GP application in active learning
What You Should Know
73
Thanks for the Attention! ☺