Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford,...

Online Linear Learning

John Langford, Large Scale Machine Learning Class,January 29

To follow along:git clonegit://github.com/JohnLangford/vowpal_wabbit.gitwgethttp://hunch.net/~jl/rcv1.train.txt.gz

(Post presentation updated version)

Linear Learning

Features: a vector x ∈ Rn

Label: y ∈ RGoal: Learn w ∈ Rn such that yw(x) =

∑i wixi is

close to y .

Linear Learning

Features: a vector x ∈ Rn

Label: y ∈ {−1, 1}Goal: Learn w ∈ Rn such that yw(x) =

∑i wixi is

close to y .

Start with ∀i : wi = 0Repeatedly:

1 Get features x ∈ Rn.2 Make linear prediction yw(x) =

∑i wixi .

3 Observe label y ∈ [0, 1].4 Update weights so yw(x) is closer to y .

Example: wi ← wi + η(y − y)xi .

1 Get features x ∈ Rn.2 Make logistic prediction yw(x) = 1

1+e−∑

i wi xi.

∑i wixi .

An Example: The RCV1 dataset

Pick whether a document is in category CCAT or not.Dataset size:781K examples60M nonzero features1.1G bytesFormat: label | sparse features ...

1 | 13:3.9656971e-02 24:3.4781646e-02 ...which corresponds to:1 | tuesday year ...command: time vw �sgd rcv1.train.txt -ctakes 1-3 seconds on my laptop.

Pick whether a document is in category CCAT or not.Dataset size:781K examples60M nonzero features1.1G bytesFormat: label | sparse features ...1 | 13:3.9656971e-02 24:3.4781646e-02 ...

which corresponds to:1 | tuesday year ...command: time vw �sgd rcv1.train.txt -ctakes 1-3 seconds on my laptop.

Pick whether a document is in category CCAT or not.Dataset size:781K examples60M nonzero features1.1G bytesFormat: label | sparse features ...1 | 13:3.9656971e-02 24:3.4781646e-02 ...which corresponds to:1 | tuesday year ...

command: time vw �sgd rcv1.train.txt -ctakes 1-3 seconds on my laptop.

Pick whether a document is in category CCAT or not.Dataset size:781K examples60M nonzero features1.1G bytesFormat: label | sparse features ...1 | 13:3.9656971e-02 24:3.4781646e-02 ...which corresponds to:1 | tuesday year ...command: time vw �sgd rcv1.train.txt -ctakes 1-3 seconds on my laptop.

Reasons for Online Learning

1 Fast convergence to a good predictor2 It's RAM e�cient. You need store only one

example in RAM rather than all of them. ⇒Entirely new scales of data are possible.

3 Online Learning algorithm = OnlineOptimization Algorithm. Online LearningAlgorithms ⇒ the ability to solve entirely newcategories of applications.

4 Online Learning = ability to deal with driftingdistributions.

De�ning updates

1 De�ne a loss function L(yw(x), y).

2 Update according to wi ← wi − η ∂L(yw (x),y)∂wi.

Here η is the learning rate.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

prediction when y=1

common loss functions

0/1squared

logisticquantile

De�ning updates

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

prediction when y=1

0/1squared

logisticquantile

De�ning updates

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

prediction when y=1

0/1squared

logisticquantile

De�ning updates

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

prediction when y=1

0/1squared

logisticquantile

Know your loss function semantics

1 What is a typical price for a house?

quantile: minimizer = median

2 What is the expected return on a stock?

squared: minimizer = expectation

3 What is the probability of a click on an ad?

logistic: minimizer = probability

4 Is the digit a 1?

hinge: closest 0/1 approximation

5 What do you really care about?

often 0/1

Know your loss function semantics

1 What is a typical price for a house?quantile: minimizer = median

2 What is the expected return on a stock?squared: minimizer = expectation

3 What is the probability of a click on an ad?logistic: minimizer = probability

4 Is the digit a 1?hinge: closest 0/1 approximation

5 What do you really care about?often 0/1

The proof for quantile regression

Consider conditional probability distribution D(y |x).

-1 -0.5 0 0.5 1

Discretize it into equal mass bins.

-1 -0.5 0 0.5 1

discretized pdf

Where is absolute value minimized?

-1 -0.5 0 0.5 1

A Pair

note that minimizer = discrete median and take thelimit as bin mass goes to 0.

rcv1 with di�erent loss functions

All the common loss functions are sound for binaryclassi�cation, so which is best is an empirical choice.vw �sgd rcv1.train.txt -c �loss_function

vw �sgd rcv1.train.txt -c �loss_function

logistic

quantile

How do you know when you succeed?

Standard answer: train on train set, test on test set.But can we do better?

Progressive Validation

On timestep t let lt = L(ywt(xt), yt).

Report loss L = Et lt .

PV analysis

Let D be a distribution over x , y . Letlt = E(x ,y)∼DL(ywt

(x), y)Theorem: For all probability distributions D(x , y), forall online learning algorithms, with probability 1− δ:

∣∣L− Et lt∣∣ ≤√ ln 2/δ

Standard answer: train on train set, test on test set.But can we do better?

PV analysis

∣∣L− Et lt∣∣ ≤√ ln 2/δ

PV analysis

∣∣L− Et lt∣∣ ≤√ ln 2/δ

hinge �binary

logistic �binary

quantile �binary

Progressive validation often does not replacetrain/test discipline, but it can greatly aid empiricaltesting.

hinge �binary

logistic �binary

quantile �binary

Progressive validation often does not replacetrain/test discipline, but it can greatly aid empiricaltesting.

Part II, advanced updates

1 Importance weight invariance2 Adaptive updates3 Normalized updates

Learning with importance weights

A common scenario: you need to do classi�cation butone choice is more expensive than the other.An example: In spam detection, predicting nonspamas spam is worse than spam as nonspam.

Let's say an example is I times more important thana typical example.How do you modify the update to use I?

The baseline approach: wi ← wi − ηI ∂L(yw (x),y)∂wi.

A common scenario: you need to do classi�cation butone choice is more expensive than the other.An example: In spam detection, predicting nonspamas spam is worse than spam as nonspam.Let's say an example is I times more important thana typical example.How do you modify the update to use I?

Dealing with the importance weights

wi ← wi − ηI ∂L(yw (x),y)∂wiperforms poorly.

-0.5 0 0.5 1 1.5 2 2.5 3 3.5

prediction when y=1

Baseline Importance Update

Squared LossBaseline Update

A better approach: wi ← wi − η ∂L(yw (x),y)∂wiI times

-0.5 0 0.5 1 1.5 2 2.5 3 3.5

prediction when y=1

Repeated update

An even better approach: wi ← wi − s(ηI )∂L(yw (x),y)∂wi

-0.5 0 0.5 1 1.5 2 2.5 3 3.5

prediction when y=1

Repeated UpdateInvariant Update

Robust results for unweighted problems

0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97

importance aware

astro - logistic loss

0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98

importance aware

spam - quantile loss

0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935 0.94 0.945 0.95

importance aware

rcv1 - squared loss

0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99

importance aware

webspam - hinge loss

rcv1 with an invariant update

vw rcv1.train.txt -c �binary �invariant

Performs slightly worse with the default learning rate,but much more robust to learning rate choice.

Adaptive Learning

Learning rates must decay to converge, but how?

Common answer: ηt = 1/t0.5 or ηt = 1/t.

Better answer: t, let git = ∂L(yw (xt),yt)∂wi

New update rule: wi ← wit − η git√∑t

t′=1g2

Common features stabilize quickly. Rare features canhave large updates.

Adaptive Learning

Learning rates must decay to converge, but how?Common answer: ηt = 1/t0.5 or ηt = 1/t.

t′=1g2

Adaptive Learning

t′=1g2

Adaptive Learning

t′=1g2

Adaptive Learning example

vw rcv1.train.txt -c �binary �adaptive

Slightly worse. Adding in �invariant -l 1 helps.

Dimensional Correction

git for squared loss = 2(yw(x)− y)xi so update is

wi ← wi − Cxi

The same form occurs for all linear updates.

Intrinsic problems! Doubling xi implies halving wi toget the same prediction.⇒ Update rule has mixed units!

Dimensional Correction

git for squared loss = 2(yw(x)− y)xi so update is

wi ← wi − Cxi

The same form occurs for all linear updates.Intrinsic problems! Doubling xi implies halving wi toget the same prediction.⇒ Update rule has mixed units!

A standard solution: Gaussian sphering

For each feature xi compute:empirical mean µi = Etxitempirical standard deviation σi =

√Et(xit − µi)2

Let x ′i ←xi−µiσi

Problems:1 Lose online.2 RCV1 becomes a factor of 500 larger.

A standard solution: Gaussian sphering

For each feature xi compute:empirical mean µi = Etxitempirical standard deviation σi =

√Et(xit − µi)2

Let x ′i ←xi−µiσi

.Problems:

1 Lose online.2 RCV1 becomes a factor of 500 larger.

A scale-free update

NG(learning_rate η)

1 Initially wi = 0, si = 0, N = 02 For each timestep t observe example (x , y)

1 For each i , if |xi | > si

1 wi ← wi s2

|xi |22 si ← |xi |

2 y =∑

3 N ← N +∑

4 For each i ,

1 wi ← wi − η√

∂L(y ,y)∂wi

A scale-free update

NG(learning_rate η)

1 Initially wi = 0, si = 0, N = 02 For each timestep t observe example (x , y)

1 For each i , if |xi | > si

1 Renormalize wi for new scale2 Adjust Scale

2 y =∑

3 Adjust global scale4 For each i ,

1 wi ← wi − η√

∂L(y ,y)∂wi

In combination

An adaptive, scale-free, importance invariant updaterule.vw rcv1.train.txt -c �binary

References

[RCV1 example] Leon Bottou, Stochastic GradientDescent, 2007.[VW] Vowpal Wabbit project,http://hunch.net/~vw, 2007-2012.[Quantile Regression] Roger Koenker, QuantileRegression, Econometric Society Monograph Series,Cambridge University Press, 2005.[Classi�cation Consistency] Ambuj Tewari and PeterL. Bartlett. On the consistency of multiclassclassi�cation methods. COLT, 2005.

References

[Progressive Validation I] Avrim Blum, Adam Kalai,and John Langford Beating the Holdout: Bounds forKFold and Progressive Cross-Validation. COLT99.[Progressive Validation II] N. Cesa-Bianchi, A.Conconi, and C. Gentile On the generalization abilityof on-line learning algorithms IEEE Transactions onInformation Theory, 50(9):2050-2057, 2004.[Importance Aware Updates] Nikos Karampatziakisand John Langford, Importance Weight AwareGradient Updates UAI 2010.[Online Convex Programming] Martin Zinkevich,Online convex programming and generalizedin�nitesimal gradient ascent, ICML 2003.

References

[Adaptive Updates I] John Duchi, Elad Hazan, andYoram Singer, Adaptive Subgradient Methods forOnline Learning and Stochastic Optimization, COLT2010 & JMLR 2011.[Adaptive Updates II] H. Brendan McMahan,Matthew Streeter, Adaptive Bound Optimization forOnline Convex Optimization, COLT 2010.[Scale invariant updates] Forthcoming.

Online Linear Learning - CILVR at NYU · 2013. 1. 30. · Online Linear Learning John Langford,...

Documents

Factor Graphs Structured Prediction - CILVR at NYU · 2010. 12. 20. · Yann LeCun Latent variables in Weakly Supervised LearningLatent variables in Weakly Supervised Learning Variables

Machine Learning - A. Supervised Learning A.1. Linear Regression · 2020. 12. 9. · Machine Learning 1. Linear Regression via Normal Equations Outline 1. Linear Regression via Normal

Reinforcement Learning Applied to Linear Quadratic Regulation...Reinforcement Learning Applied to Linear Quadratic Regulation 297 time t. U is a linear feedback controller. The cost

Machine Learning Foundations hxÒ · Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/23. Linear Regression Linear Regression Problem Linear Regression Hypothesis age 23 years

Supervised Learning (contd) Linear Separation

Linear Learning Machines and - unict.it › ~gfarinella › LLM_SVM.pdf · Linear Learning Machine • Learning machines using hypotheses that form linear combination of input variables

Learning Switching Linear Models of Human Motion

CS7267 MACHINE LEARNING - Kennesaw State Universityksuweb.kennesaw.edu/~mkang9/teaching/CS7267/07.Linear Regression.pdfCS7267 MACHINE LEARNING LINEAR REGRESSION Mingon Kang, Ph.D

Non Linear Method for Math Learning

Learning Objective: We will define a linear expression and ...mrrogove.weebly.com/.../4/3/...of_linear_equations.pdf · 1 G8M4L2: Linear Expressions and Linear Equations Learning

Murpy's Machine Learning 9. Generalize Linear Model

Machine Learning 1. Linear Regression · PDF fileMachine Learning Machine Learning 1. Linear Regression Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute

Adventures in Linear Algebra++ and unsupervised learning

03 Machine Learning Linear Algebra

Deep Learning using Rectified Linear Units (ReLU)

Learning with linear neurons

Music analysis and deep learning - CILVR at NYU 2016-05-15 · 2. How does music ... Overview of traditional analysis pipelines 4. Deepification 5. Example applications 6. ... (Spotify,

Learning MATLAB with Linear Algebra

Machine Learning and Data Mining Linear regression

Y LeCun MA Ranzato - CILVR at NYU – Computational Intelligence, Learning… · 2016-01-27Y LeCun MA Ranzato Deep Learning ... Image recognition ... AI, Neuroscience, Cognitive