Introduction to Machine Learning10701/slides/13_Active_Learning.pdf · Higher Dimensions...

Preview:

Citation preview

1

Introduction to Machine Learning

Active Learning

Barnabás Póczos

2

Credits

Some of the slides are taken from Nina Balcan.

3

Modern applications: massive amounts of raw data.

Only a tiny fraction can be annotated by human experts.

Billions of webpages Images

Classic Supervised Learning Paradigm is Insufficient Nowadays

Sensor measurements

4

Modern applications: massive amounts of raw data.

Expert

We need techniques that minimize need for expert/human

intervention => Active Learning

Modern ML: New Learning Approaches

The Large Synoptic

Survey Telescope

15 Terabytes of data …

every night

5

Active Learning Intro

▪ Batch Active Learning vs Selective Sampling Active Learning

▪ Exponential Improvement on # of labels

▪ Sampling bias: Active Learning can hurt performance

Active Learning with SVM

Gaussian Processes▪ Regression

▪ Properties of Multivariate Gaussian distributions

▪ Ridge regression

▪ GP = Bayesian Ridge Regression + Kernel trick

Active Learning with Gaussian Processes

Contents

6

• Two faces of active learning. Sanjoy Dasgupta. 2011.

• Active Learning. Bur Settles. 2012.

• Active Learning. Balcan-Urner. Encyclopedia of Algorithms. 2015

Additional resources

7

A Label for that Example

Request for the Label of an Example

A Label for that Example

Request for the Label of an Example

Data Source

A set of

Unlabeled

examples

. . .

Algorithm outputs a classifier w.r.t D

Learning

Algorithm

Expert

• Learner can choose specific examples to be labeled.

• Goal: use fewer labeled examples

[pick informative examples to be labeled].

Underlying data distr. D.

Batch Active Learning

8

Unlabeled

example 𝑥3

Unlabeled

example 𝑥1

Unlabeled

example 𝑥2

Request for

label or let it

go?

Data Source

Learning

Algorithm

Expert

Request label

A label 𝑦1for example 𝑥1

Let it

go

Algorithm outputs a classifier w.r.t D

Request label

A label 𝑦3for example 𝑥3

• Selective sampling AL (Online AL): stream of unlabeled examples,

when each arrives make a decision to ask for label or not.

• Goal: use fewer labeled examples[pick informative examples to be labeled].

Underlying data distr. D.

Selective Sampling Active Learning

9

• Need to choose the label requests carefully, to get

informative labels.

• Guaranteed to output a relatively good classifier for

most learning problems.

• Doesn’t make too many label requests.

Hopefully a lot less than passive learning.

What Makes a Good Active Learning Algorithm?

10

• YES! (sometimes)

• We often need far fewer labels for active learning

than for passive.

• This is predicted by theory and has been observed

in practice.

Can adaptive querying really do better than passive/random sampling?

11

• Threshold fns on the real line:

w

+-

Exponential improvement.

hw(x) = 1(x ¸ w), C = {hw: w 2 R}

• How can we recover the correct labels with ≪ N queries?

-

• Do binary search (query at half)!

Active: only O(log 1/ϵ) labels.

Passive supervised: Ω(1/ϵ) labels to find an -accurate threshold.

+-

Active Algorithm

Just need O(log N) labels!

• N = O(1/ϵ) we are guaranteed to get a classifier of error ≤ ϵ.

• Get N unlabeled examples

• Output a classifier consistent with the N inferred labels.

Can adaptive querying help? [CAL92, Dasgupta04]

12

Uncertainty sampling in SVMs common and quite useful in

practice.

• At any time during the alg., we have a “current guess” wt

of the separator: the max-margin separator of all labeled

points so far.

• Request the label of the example closest to the current

separator.

E.g., [Tong & Koller, ICML 2000; Jain, Vijayanarasimhan & Grauman, NIPS 2010;

Schohon Cohn, ICML 2000]

Active SVM Algorithm

Active SVM

13

Active SVM seems to be quite useful in practice.

• Find 𝑤𝑡 the max-margin

separator of all labeled

points so far.

• Request the label of the example

closest to the current separator:

minimizing 𝑥𝑖 ⋅ 𝑤𝑡 .

[Tong & Koller, ICML 2000; Jain, Vijayanarasimhan & Grauman, NIPS 2010]

Algorithm (batch version)

Input Su={x1, …,xmu} drawn i.i.d from the underlying source D

Start: query for the labels of a few random 𝑥𝑖s.

For 𝒕 = 𝟏, ….,

(highest uncertainty)

Active SVM

14

• Uncertainty sampling works sometimes….

• However, we need to be very very very careful!!!

• Myopic, greedy techniques can suffer from sampling bias.

(The active learning algorithm samples from a different (x,y)

distribution than the true data)

• A bias created because of the querying strategy; as time goes

on the sample is less and less representative of the true data

source.[Dasgupta10]

DANGER!!!

15

DANGER!!!

• Main tension: want to choose informative points, but also want to guarantee that the classifier we output does well on true random examples from the underlying distribution.

• Observed in practice too!!!!

16

Interesting open question to analyze under

what conditions they are successful.

Other Interesting Active Learning Techniques used in Practice

17

Centroid of largest unsampled cluster

[Jaime G. Carbonell]

Density-Based Sampling

18

Closest to decision boundary (Active SVM)

[Jaime G. Carbonell]

Uncertainty Sampling

19

Maximally distant from labeled x’s

[Jaime G. Carbonell]

Maximal Diversity Sampling

20

Uncertainty + Diversity criteria

Density + uncertainty criteria

[Jaime G. Carbonell]

Ensemble-Based Possibilities

21

Active learning could be really helpful, could provide

exponential improvements in label complexity (both

theoretically and practically)!

Need to be very careful due to sampling bias.

Common heuristics

• (e.g., those based on uncertainty sampling).

What You Should Know so far

22

Gaussian Processes for Regression

23

http://www.gaussianprocess.org/

Some of these slides are taken from D. Lizotte, R. Parr, C. Guesterin

Additional resources

24

• Nonmyopic Active Learning of Gaussian Processes: An

Exploration–Exploitation Approach. A.Krause and C. Guestrin,

ICML 2007

• Near-Optimal Sensor Placements in Gaussian Processes: Theory,

Efficient Algorithms and Empirical Studies. A.Krause, A. Singh,

and C. Guestrin, Journal of Machine Learning Research 9 (2008)

• Bayesian Active Learning for Posterior Estimation, Kandasamy,

K., Schneider, J., and Poczos, B, International Joint Conference on

Artificial Intelligence (IJCAI), 2015

Additional resources

25

Why GPs for Regression?

Motivation:

All the above regression method give point estimates. We would like a method that could also provide confidence during the estimation.

Regression methods:

Linear regression, multilayer precpetron, ridge regression, support vector regression, kNN regression, etc…

Application in Active Learning:

This method can be used for active learning: query the next point and its label where the uncertainty is the highest

26

• Here’s where the function will most likely be.(expected function)

• Here are some examples of what it might look like. (sampling from the posterior distribution [blue, red, green functions)

• Here is a prediction of what you’ll see if you evaluate your function at x’, with confidence

GPs can answer the following questions:

Why GPs for Regression?

27

Properties of Multivariate Gaussian

Distributions

28

1D Gaussian DistributionParameters

• Mean,

• Variance, 2

29

Multivariate Gaussian

30

A 2-dimensional Gaussian is defined by

• a mean vector = [ 1, 2 ]

• a covariance matrix:

where i,j2 = E[ (xi – i) (xj – j) ]

is (co)variance

Note: is symmetric,

“positive semi-definite”: x: xT x 0

2

2,2

2

2,1

2

1,2

2

1,1

Multivariate Gaussian

31

Multivariate Gaussian examples

= (0,0)

18.0

8.01

32

Multivariate Gaussian examples

= (0,0)

18.0

8.01

33

Marginal distributions of Gaussians are Gaussian

Given:

The marginal distribution is:

bbba

abaa

baba xxx ),(),,(

Useful Properties of Gaussians

34

Marginal distributions of Gaussians are Gaussian

35

Block Matrix Inversion

Theorem

Definition: Schur complements

36

Conditional distributions of Gaussians are Gaussian

Notation:

Conditional Distribution:

bbba

abaa1

bbba

abaa

Useful Properties of Gaussians

37

Higher Dimensions Visualizing > 3 dimensional Gaussian random

variables is… difficult

Means and variances of marginal variable s are practical, but then we don’t see correlations between those variables

Marginals are Gaussian, e.g., f(6) ~ N(µ(6), σ2(6))

Visualizing an 8-dimensional Gaussian variable f:

µ(6)σ2(6)

654321 7 8

38

Yet Higher DimensionsWhy stop there?

39

Getting Ridiculous

Why stop there?

40

Gaussian Process

Probability distribution indexed by an arbitrary set (integer, real, finite dimensional vector, etc)

Each element (indexed by x) is a Gaussian distribution over the reals with mean µ(x)

These distributions are dependent/correlated as defined by k(x,z)

Any finite subset of indices defines a multivariate Gaussian distribution

Definition of GP:

41

Gaussian Process

Distribution over functions….

If our regression model is a GP, then it won’t be a point estimate anymore! It can provide regression estimates with confidence

Domain (index set) of the functions can be pretty much whatever

• Reals

• Real vectors

• Graphs

• Strings

• Sets

• …

42

Bayesian Updates for GPs

• How can we do regression and learn the GP from data?

• We will be Bayesians today:• Start with GP prior

• Get some data

• Compute a posterior

43

Samples from the prior distribution

Picture is taken from Rasmussen and Williams

44

Samples from the posterior distribution

Picture is taken from Rasmussen and Williams

45

PriorZero mean Gaussians with covariance k(x,z)

46

Data

47

Posterior

48

Ridge Regression

Linear regression:

Ridge regression:

The Gaussian Process is a Bayesian Generalization

of the kernelized ridge regression

49

Weight Space View

GP = Bayesian ridge regression in feature space

+ Kernel trick to carry out computations

The training data

50

Bayesian Analysis of Linear Regression with Gaussian noise

Linear regression:

Linear regression

with noise:

51

Bayesian Analysis of Linear Regression with Gaussian noise

The likelihood:

52

Bayesian Analysis of Linear Regression with Gaussian noise

The prior:

Now, we can calculate the posterior:

53

Bayesian Analysis of Linear Regression with Gaussian noise

After “completing the square”

MAP estimation

Ridge Regression

54

Bayesian Analysis of Linear Regression with Gaussian noise

This posterior covariance matrix doesn’t depend on the observations y,

A strange property of Gaussian Processes

55

Projections of Inputs into Feature Space

The Bayesian linear regression suffers from

limited expressiveness

To overcome the problem )

go to a feature space and do linear regression there

a., explicit features

b., implicit features (kernels)

56

Explicit Features

Linear regression in the feature space

57

Explicit Features

The predictive distribution after feature map:

Reminder: This is what we had without feature maps:

58

Explicit Features

Shorthands:

The predictive distribution after feature map:

59

Explicit FeaturesThe predictive distribution after feature map:

A problem with (*) is that it needs an NxN matrix inversion...

(*)

(*) can be rewritten:

Theorem:

60

Proofs• Mean expression. We need:

• Variance expression. We need:

Lemma:

Matrix inversion Lemma:

61

From Explicit to Implicit Features

62

From Explicit to Implicit Features

The feature space always appears in the form of:

Lemma:

No need to know the explicit N dimensional features.

Their inner product is enough.

63

GP pseudo code Inputs:

64

GP pseudo code (continued)

Outputs:

65

Results

66

Results using Netlab , Sin function

67

Results using Netlab, Sin function

Increased # of training points

68

Results using Netlab, Sin function

Increased noise

69

Results using Netlab, Sinc function

70

Applications: Sensor placement

Temperature modeling with GP

Near-Optimal Sensor Placements in Gaussian

Processes: Theory, Efficient Algorithms and

Empirical Studies. A.Krause, A. Singh, and C.

Guestrin, Journal of Machine Learning Research (2008)

71

An example of placements chosen using entropy and mutual information

criteria on temperature data. Diamonds indicate the positions chosen

using entropy; squares the positions chosen using MI.

Applications: Sensor placement

Entropy criterion:

Mutual information criterion:

Properties of Multivariate Gaussian distribution

Gaussian process = Bayesian Ridge Regression

GP Algorithm

GP application in active learning

What You Should Know

73

Thanks for the Attention! ☺

Recommended