Introduction to Machine Learning10701/slides/13_Active_Learning.pdf · Higher Dimensions...

Introduction to Machine Learning

Active Learning

Barnabás Póczos

Credits

Some of the slides are taken from Nina Balcan.

Modern applications: massive amounts of raw data.

Only a tiny fraction can be annotated by human experts.

Billions of webpages Images

Classic Supervised Learning Paradigm is Insufficient Nowadays

Sensor measurements

Modern applications: massive amounts of raw data.

Expert

We need techniques that minimize need for expert/human

intervention => Active Learning

Modern ML: New Learning Approaches

The Large Synoptic

Survey Telescope

15 Terabytes of data …

every night

Active Learning Intro

▪ Batch Active Learning vs Selective Sampling Active Learning

▪ Exponential Improvement on # of labels

▪ Sampling bias: Active Learning can hurt performance

Active Learning with SVM

Gaussian Processes▪ Regression

▪ Properties of Multivariate Gaussian distributions

▪ Ridge regression

▪ GP = Bayesian Ridge Regression + Kernel trick

Active Learning with Gaussian Processes

Contents

• Two faces of active learning. Sanjoy Dasgupta. 2011.

• Active Learning. Bur Settles. 2012.

• Active Learning. Balcan-Urner. Encyclopedia of Algorithms. 2015

Additional resources

A Label for that Example

Request for the Label of an Example

A Label for that Example

Request for the Label of an Example

Data Source

A set of

Unlabeled

examples

Algorithm outputs a classifier w.r.t D

Learning

Algorithm

Expert

• Learner can choose specific examples to be labeled.

• Goal: use fewer labeled examples

[pick informative examples to be labeled].

Underlying data distr. D.

Batch Active Learning

Unlabeled

example 𝑥3

Unlabeled

example 𝑥1

Unlabeled

example 𝑥2

Request for

label or let it

Data Source

Learning

Algorithm

Expert

Request label

A label 𝑦1for example 𝑥1

Let it

Algorithm outputs a classifier w.r.t D

Request label

A label 𝑦3for example 𝑥3

• Selective sampling AL (Online AL): stream of unlabeled examples,

when each arrives make a decision to ask for label or not.

• Goal: use fewer labeled examples[pick informative examples to be labeled].

Underlying data distr. D.

Selective Sampling Active Learning

• Need to choose the label requests carefully, to get

informative labels.

• Guaranteed to output a relatively good classifier for

most learning problems.

• Doesn’t make too many label requests.

Hopefully a lot less than passive learning.

What Makes a Good Active Learning Algorithm?

• YES! (sometimes)

• We often need far fewer labels for active learning

than for passive.

• This is predicted by theory and has been observed

in practice.

Can adaptive querying really do better than passive/random sampling?

• Threshold fns on the real line:

Exponential improvement.

hw(x) = 1(x ¸ w), C = {hw: w 2 R}

• How can we recover the correct labels with ≪ N queries?

• Do binary search (query at half)!

Active: only O(log 1/ϵ) labels.

Passive supervised: Ω(1/ϵ) labels to find an -accurate threshold.

Active Algorithm

Just need O(log N) labels!

• N = O(1/ϵ) we are guaranteed to get a classifier of error ≤ ϵ.

• Get N unlabeled examples

• Output a classifier consistent with the N inferred labels.

Can adaptive querying help? [CAL92, Dasgupta04]

Uncertainty sampling in SVMs common and quite useful in

practice.

• At any time during the alg., we have a “current guess” wt

of the separator: the max-margin separator of all labeled

points so far.

• Request the label of the example closest to the current

separator.

E.g., [Tong & Koller, ICML 2000; Jain, Vijayanarasimhan & Grauman, NIPS 2010;

Schohon Cohn, ICML 2000]

Active SVM Algorithm

Active SVM

Active SVM seems to be quite useful in practice.

• Find 𝑤𝑡 the max-margin

separator of all labeled

points so far.

• Request the label of the example

closest to the current separator:

minimizing 𝑥𝑖 ⋅ 𝑤𝑡 .

[Tong & Koller, ICML 2000; Jain, Vijayanarasimhan & Grauman, NIPS 2010]

Algorithm (batch version)

Input Su={x1, …,xmu} drawn i.i.d from the underlying source D

Start: query for the labels of a few random 𝑥𝑖s.

For 𝒕 = 𝟏, ….,

(highest uncertainty)

Active SVM

• Uncertainty sampling works sometimes….

• However, we need to be very very very careful!!!

• Myopic, greedy techniques can suffer from sampling bias.

(The active learning algorithm samples from a different (x,y)

distribution than the true data)

• A bias created because of the querying strategy; as time goes

on the sample is less and less representative of the true data

source.[Dasgupta10]

DANGER!!!

• Main tension: want to choose informative points, but also want to guarantee that the classifier we output does well on true random examples from the underlying distribution.

• Observed in practice too!!!!

Interesting open question to analyze under

what conditions they are successful.

Other Interesting Active Learning Techniques used in Practice

Centroid of largest unsampled cluster

[Jaime G. Carbonell]

Density-Based Sampling

Closest to decision boundary (Active SVM)

Uncertainty Sampling

Maximally distant from labeled x’s

Maximal Diversity Sampling

Uncertainty + Diversity criteria

Density + uncertainty criteria

Ensemble-Based Possibilities

Active learning could be really helpful, could provide

exponential improvements in label complexity (both

theoretically and practically)!

Need to be very careful due to sampling bias.

Common heuristics

• (e.g., those based on uncertainty sampling).

What You Should Know so far

Gaussian Processes for Regression

http://www.gaussianprocess.org/

Some of these slides are taken from D. Lizotte, R. Parr, C. Guesterin

• Nonmyopic Active Learning of Gaussian Processes: An

Exploration–Exploitation Approach. A.Krause and C. Guestrin,

ICML 2007

• Near-Optimal Sensor Placements in Gaussian Processes: Theory,

Efficient Algorithms and Empirical Studies. A.Krause, A. Singh,

and C. Guestrin, Journal of Machine Learning Research 9 (2008)

• Bayesian Active Learning for Posterior Estimation, Kandasamy,

K., Schneider, J., and Poczos, B, International Joint Conference on

Artificial Intelligence (IJCAI), 2015

Why GPs for Regression?

Motivation:

All the above regression method give point estimates. We would like a method that could also provide confidence during the estimation.

Regression methods:

Linear regression, multilayer precpetron, ridge regression, support vector regression, kNN regression, etc…

Application in Active Learning:

This method can be used for active learning: query the next point and its label where the uncertainty is the highest

• Here’s where the function will most likely be.(expected function)

• Here are some examples of what it might look like. (sampling from the posterior distribution [blue, red, green functions)

• Here is a prediction of what you’ll see if you evaluate your function at x’, with confidence

GPs can answer the following questions:

Why GPs for Regression?

Properties of Multivariate Gaussian

Distributions

1D Gaussian DistributionParameters

• Mean,

• Variance, 2

Multivariate Gaussian

A 2-dimensional Gaussian is defined by

• a mean vector = [ 1, 2 ]

• a covariance matrix:

where i,j2 = E[ (xi – i) (xj – j) ]

is (co)variance

Note: is symmetric,

“positive semi-definite”: x: xT x 0

Multivariate Gaussian

Multivariate Gaussian examples

= (0,0)

Multivariate Gaussian examples

= (0,0)

Marginal distributions of Gaussians are Gaussian

Given:

The marginal distribution is:

baba xxx ),(),,(

Useful Properties of Gaussians

Marginal distributions of Gaussians are Gaussian

Block Matrix Inversion

Theorem

Definition: Schur complements

Conditional distributions of Gaussians are Gaussian

Notation:

Conditional Distribution:

Useful Properties of Gaussians

Higher Dimensions Visualizing > 3 dimensional Gaussian random

variables is… difficult

Means and variances of marginal variable s are practical, but then we don’t see correlations between those variables

Marginals are Gaussian, e.g., f(6) ~ N(µ(6), σ2(6))

Visualizing an 8-dimensional Gaussian variable f:

µ(6)σ2(6)

654321 7 8

Yet Higher DimensionsWhy stop there?

Getting Ridiculous

Why stop there?

Gaussian Process

Probability distribution indexed by an arbitrary set (integer, real, finite dimensional vector, etc)

Each element (indexed by x) is a Gaussian distribution over the reals with mean µ(x)

These distributions are dependent/correlated as defined by k(x,z)

Any finite subset of indices defines a multivariate Gaussian distribution

Definition of GP:

Gaussian Process

Distribution over functions….

If our regression model is a GP, then it won’t be a point estimate anymore! It can provide regression estimates with confidence

Domain (index set) of the functions can be pretty much whatever

• Reals

• Real vectors

• Graphs

• Strings

• Sets

• …

Bayesian Updates for GPs

• How can we do regression and learn the GP from data?

• We will be Bayesians today:• Start with GP prior

• Get some data

• Compute a posterior

Samples from the prior distribution

Picture is taken from Rasmussen and Williams

Samples from the posterior distribution

Picture is taken from Rasmussen and Williams

PriorZero mean Gaussians with covariance k(x,z)

Posterior

Ridge Regression

Linear regression:

Ridge regression:

The Gaussian Process is a Bayesian Generalization

of the kernelized ridge regression

Weight Space View

GP = Bayesian ridge regression in feature space

+ Kernel trick to carry out computations

The training data

Bayesian Analysis of Linear Regression with Gaussian noise

Linear regression:

Linear regression

with noise:

The likelihood:

The prior:

Now, we can calculate the posterior:

After “completing the square”

MAP estimation

Ridge Regression

This posterior covariance matrix doesn’t depend on the observations y,

A strange property of Gaussian Processes

Projections of Inputs into Feature Space

The Bayesian linear regression suffers from

limited expressiveness

To overcome the problem )

go to a feature space and do linear regression there

a., explicit features

b., implicit features (kernels)

Explicit Features

Linear regression in the feature space

Explicit Features

The predictive distribution after feature map:

Reminder: This is what we had without feature maps:

Explicit Features

Shorthands:

The predictive distribution after feature map:

Explicit FeaturesThe predictive distribution after feature map:

A problem with (*) is that it needs an NxN matrix inversion...

(*) can be rewritten:

Theorem:

Proofs• Mean expression. We need:

• Variance expression. We need:

Lemma:

Matrix inversion Lemma:

From Explicit to Implicit Features

The feature space always appears in the form of:

Lemma:

No need to know the explicit N dimensional features.

Their inner product is enough.

GP pseudo code Inputs:

GP pseudo code (continued)

Outputs:

Results

Results using Netlab , Sin function

Results using Netlab, Sin function

Increased # of training points

Results using Netlab, Sin function

Increased noise

Results using Netlab, Sinc function

Applications: Sensor placement

Temperature modeling with GP

Near-Optimal Sensor Placements in Gaussian

Processes: Theory, Efficient Algorithms and

Empirical Studies. A.Krause, A. Singh, and C.

Guestrin, Journal of Machine Learning Research (2008)

An example of placements chosen using entropy and mutual information

criteria on temperature data. Diamonds indicate the positions chosen

using entropy; squares the positions chosen using MI.

Applications: Sensor placement

Entropy criterion:

Mutual information criterion:

Properties of Multivariate Gaussian distribution

Gaussian process = Bayesian Ridge Regression

GP Algorithm

GP application in active learning

What You Should Know

Thanks for the Attention! ☺

Introduction to Machine Learning10701/slides/13_Active_Learning.pdf · Higher Dimensions...

Documents

IN TO THE CONCEPT- VARIANCES

GLOBAL VARIANCES IN ATM USAGE - Transaction … · GLOBAL VARIANCES IN ATM USAGE A report prepared by Transaction Network Services September 2017 TNS Global Variances in ATM Usage

FLEXIBLE BUDGETS, VARIANCES

7KH,QVWLWXWHRI&KDUWHUHG$FFRXQWDQWVRI,QGLD · 2020-02-29 · Apple Ltd., is following three variances method to analyse 'and understand production overhead variances. The three variances

Flexible Budget, Direct and Overhead Variances, and ... · Flexible Budget, Direct and Overhead Cost Variances and Management Control ... Cost Variances and Management Control Flexible

Variances Summary

Standard Costs and Variances

120.96.119.5120.96.119.5/brandump/master/95/management/account/... · direct materials: A. mix and volume variances B. market-share and market-size variances C. mix and yield variances

Caa f5 Variances 2014

Variances of Direct Material

AGENDA: STANDARD COSTS AND VARIANCES · AGENDA: STANDARD COSTS AND VARIANCES A. Standard costs 1. Ideal vs. practical standards 2. Standard cost card 3. Computing variances a. The

Expected values and variances

Flexible Budgets, Variances, and Management Control: I · Flexible Budgets, Variances, and Management Control: I 1. ... Static and Flexible Budgets ... Price and Efficiency Variances

Hidden Error Variances and the optimal combination of static and flow dependent variances

Test for equal variances

PPT ANOVA (Analysis of Variances)

The effects of chromosomal variances

Manufacturing Order Variances Understanding the Variances Created during MO Close and Purge

Inferences about Variances ( Chapter 7)

Guide to Variances - Douglas