StreamSVM - Linear SVMs and Logistic Regression …vishy/talks/streamsvm.pdfStreamSVM Linear SVMs...

StreamSVMLinear SVMs and Logistic Regression When Data Does Not Fit

In Memory

S.V.N. (vishy) Vishwanathan

Purdue University and Microsoftvishy@purdue.edu

October 9, 2012

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 1 / 41

Linear Support Vector Machines

Outline

1 Linear Support Vector Machines

2 StreamSVM

3 Experiments - SVM

4 Logistic Regression

5 Experiments - Logistic Regression

6 Conclusion

Binary Classification

yi = −1

yi = +1

Binary Classification

yi = −1

yi = +1

x | 〈w , x〉 = 0

yi = −1

yi = +1

x | 〈w , x〉 = 0x | 〈w , x〉 = −1

x | 〈w , x〉 = 1

〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2

⟩= 2‖w‖

Optimization Problem

12‖w‖2 + C

m∑i=1

max (0,1− yi 〈w , xi〉)

−3 −2 −1 1 2 3 4

3Hinge Loss

Margin

The Dual

Objective

D(α) :=12

∑i,j

αiαjyiyj⟨xi , xj

⟩−∑

s.t. 0 ≤ αi ≤ C

Convex and quadratic objectiveLinear (box) constraintsw =

∑i αiyixi

Each αi corresponds to an xi

Coordinate Descent

−2−1

Coordinate Descent

−2−1

Coordinate Descent

−3 −2 −1 0 1 2 30

Coordinate Descent

−3 −2 −1 0 1 2 30

Coordinate Descent

−2−1

Coordinate Descent

−3 −2 −1 0 1 2 3

Coordinate Descent

−3 −2 −1 0 1 2 3

Coordinate Descent in the Dual

One dimensional function

D(αt ) =α2

t2〈xt , xt〉+

∑i 6=t

αtαiyiyt 〈xi , xt〉 − αt + const.

s.t. 0 ≤ αt ≤ C

∇D(αt ) = αt 〈xt , xt〉+∑i 6=t

αiyiyt 〈xi , xt〉 − 1 = 0

αt = median(

0,C,1−

∑i 6=t αiyiyt 〈xi , xt〉〈xt , xt〉

)S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 8 / 41

StreamSVM

Outline

2 StreamSVM

3 Experiments - SVM

6 Conclusion

StreamSVM

We are Collecting Lots of Data . . .

yi = −1

yi = +1

StreamSVM

yi = −1

yi = +1

StreamSVM

yi = −1

yi = +1

StreamSVM

yi = −1

yi = +1

StreamSVM

What if Data Does not Fit in Memory?

Idea 1: Block Minimization [Yu et al., KDD 2010]Split data into blocks B1, B2 . . . such that Bj fits in memoryCompress and store each block separatelyLoad one block of data at a time and optimize only those αi ’s

Idea 2: Selective Block Minimization [Chang and Roth, KDD 2011]Split data into blocks B1, B2 . . . such that Bj fits in memoryCompress and store each block separatelyLoad one block of data at a time and optimize only those αi ’sRetain informative samples from each block in main memory

StreamSVM

What are Informative Samples?

StreamSVM

Some Observations

SBM and BM are wastefulBoth split data into blocks and compress the blocks

This requires reading the entire data at least once (expensive)

Both pause optimization while a block is loaded into memory

Hardware 101Disk I/O is slower than CPU (sometimes by a factor of 100)Random access on HDD is terrible

sequential access is reasonably fast (factor of 10)

Multi-core processors are becoming commonplaceHow can we exploit this?

StreamSVM

Our Architecture

Reader Trainer

RAM RAMHDD

Data Working Set Weight Vec

StreamSVM

Our Philosophy

Iterate over the data in main memory while streaming datafrom disk. Evict primarily examples from main memory thatare “uninformative”.

StreamSVM

Reader

for k = 1, . . . ,max_iter dofor i = 1, . . . ,n do

if |A| = Ω thenrandomly select i ′ ∈ AA = A \ i ′delete yi ,Qii , xi from RAM

end ifread yi , xi from Diskcalculate Qii = 〈xi , xi〉store yi ,Qii , xi in RAMA = A ∪ i

end forif stopping criterion is met then

exitend if

end forS.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 16 / 41

StreamSVM

Trainer

α1 = 0,w1 = 0, ε = 9, εnew = 0, β = 0.9while stopping criterion is not met do

for t = 1, . . . ,n doIf |A| > 0.9× Ω then ε = βεrandomly select i ∈ A and read yi ,Qii , xi from RAMcompute ∇iD := yi

⟨w t , xi

⟩− 1

if (αti = 0 and ∇iD > ε) or (αt

i = C and ∇iD < −ε) thenA = A \ i and delete yi ,Qii , xi from RAMcontinue

end ifαt+1

i = median(

0,C, αti −

∇i DQii

), w t+1 = w t + (αt+1

i − αti )yixi

εnew = max(εnew, |∇iD|)end forUpdate stopping criterionε = εnew

end whileS.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 17 / 41

StreamSVM

Proof of Convergence (Sketch)

Definition (Luo and Tseng Problem)

minimizeα

g(Eα) + b>α

subject to Li ≤ αi ≤ Ui

α,b ∈ Rn, E ∈ Rd×n, Li ∈ [−∞,∞),Ui ∈ (−∞,∞]The above optimization problem is a Luo and Tseng problem if thefollowing conditions hold:

1 E has no zero columns2 the set of optimal solutions A is non-empty3 the function g is strictly convex and twice cont. differentiable4 for all optimal solutions α∗ ∈ A, ∇2g(Eα∗) is positive definite.

StreamSVM

Lemma (see Theorem 1 of Hsieh et al., ICML 2008)If no data xi = 0, then the SVM dual is a Luo and Tseng problem.

Definition (Almost cyclic rule)There exists an integer B ≥ n such that every coordinate is iteratedupon at least once every B successive iterations.

Theorem (Theorem 2.1 of Luo and Tseng, JOTA 1992)

Let αt be a sequence of iterates generated by a coordinate descentmethod using the almost cyclic rule. The αt converges linearly to anelement of A.

StreamSVM

LemmaIf the trainer accesses the cached training data sequentially, and thefollowing conditions hold:

The trainer is at most κ ≥ 1 times faster than the reading thread,that is, the trainer performs at most κ coordinate updates in thetime that it takes the reader to read one training data from disk.A point is never evicted from the RAM unless the αi correspondingto that point has been updated.

Then this confirms to the almost cyclic rule.

Experiments - SVM

Outline

2 StreamSVM

3 Experiments - SVM

6 Conclusion

Experiments - SVM

Experiments

dataset n d s(%) n+ :n− Datasizeocr 3.5 M 1156 100 0.96 45.28 GBdna 50 M 800 25 3e−3 63.04 GBwebspam-t 0.35 M 16.61 M 0.022 1.54 20.03 GBkddb 20.01 M 29.89 M 1e-4 6.18 4.75 GB

Experiments - SVM

Does Active Eviction Work?

0 0.5 1 1.5 2 2.5·104

10−7

10−6

10−5

10−4

10−3

10−2

10−1

Wall Clock Time (sec)

Linear SVM

webspam-t C = 1.0RandomActive

Experiments - SVM

Comparison with Block Minimization

0 1 2 3 4·104

10−9

10−7

10−5

10−3

10−1

ocr C = 1.0StreamSVM

Experiments - SVM

0 0.5 1 1.5 2·104

10−7

10−6

10−5

10−4

10−3

10−2

10−1

webspam-t C = 1.0StreamSVM

Experiments - SVM

0 1 2 3 4·104

10−6

10−5

10−4

10−3

10−2

10−1

kddb C = 1.0StreamSVM

Experiments - SVM

0 1 2 3 4·104

10−11

10−9

10−7

10−5

10−3

10−1

dna C = 1.0StreamSVM

Experiments - SVM

Effect of Varying C

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·105

10−11

10−9

10−7

10−5

10−3

10−1

Experiments - SVM

Effect of Varying C

0 0.5 1 1.5·105

10−2

10−1

Experiments - SVM

Effect of Varying C

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·105

10−1

Experiments - SVM

Varying Cache Size

0 0.5 1 1.5 2·104

10−6

10−5

10−4

10−3

10−2

10−1

cekddb C = 1.0

256 MB1 GB4 GB

Experiments - SVM

Expanding Features

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·105

10−10

10−8

10−6

10−4

10−2

cedna expanded C = 1.0

16 GB32GB

Logistic Regression

Outline

2 StreamSVM

3 Experiments - SVM

6 Conclusion

Logistic Regression

Optimization Problem

12‖w‖2 + C

m∑i=1

log (1 + exp (−yi 〈w , xi〉))

−4 −2 2 4

3Hinge Loss

Logistic Loss

Margin

Logistic Regression

The Dual

Objective

D(α) :=12

∑i,j

αiαjyiyj⟨xi , xj

⟩+∑

αi logαi + (C − αi) log(C − αi)

s.t. 0 ≤ αi ≤ C

w =∑

i αiyixi

Convex objective (but not quadratic)Box constraints

Logistic Regression

Coordinate Descent

One dimensional function

D(αt ) :=α2

t2〈xt , xt〉+

∑i 6=t

αtαiyiyt 〈xi , xt〉

+ αt logαt + (C − αt ) log(C − αt )

s.t. 0 ≤ αt ≤ C

SolveNo closed form solutionA modified Newton method does the job!See Yu, Huang, Lin 2011 for details.

Logistic Regression

Determining Importance: SVM case

Shrinking

αti = 0 and ∇iD(α) > ε or

αti = C and ∇iD(α) < −ε

Logistic Regression

Determining Importance: SVM case

−4 −2 2 4 6

Margin

Loss Gradient

∇w loss(w , xi , yi) = αi (KKT condition)

∇πi D(α) = 0|〈w , xi〉 − 1| ≥ ε

Computing ε

ε = maxi∈A∣∣∇πi D(α)

∣∣S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 33 / 41

Logistic Regression

Determining Importance: Logistic Regression

−8 −6 −4 −2 2 4 6 8

Margin

Loss Gradient

∇w loss(w , xi , yi) ≈ αi (KKT condition)

∇iD(α) ≈ 0|〈w , xi〉| ≥ ε

Computing ε

ε = maxi∈A |∇iD(α)|

Experiments - Logistic Regression

Outline

2 StreamSVM

3 Experiments - SVM

6 Conclusion

Does Active Eviction Work?

0 0.5 1 1.5 2 2.5·104

10−10

10−8

10−6

10−4

10−2

Logistic Regression

webspam-t C = 1.0RandomActive

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·104

10−10

10−8

10−6

10−4

10−2

cewebspam-t C = 1.0

StreamSVMBM

Conclusion

Outline

2 StreamSVM

3 Experiments - SVM

6 Conclusion

Conclusion

Things I did not talk about/Work in Progress

StreamSVMAmortizing the cost of training different modelsOther storage Heirarchies (e.g. Solid State Drives)Extensions to other problems

Multiclass problemsStructured Output Prediction

Distributed version of the algorithm

Conclusion

References

Shin Matsushima, S. V. N. Vishwanathan, and Alexander J Smola.Linear Support Vector Machines via Dual Cached Loops. KDD2012.Code for StreamSVM available fromhttp://www.r.dl.itc.u-tokyo.ac.jp/~masin/streamsvm.html

Conclusion

Joint work with

Reading Thread Training Thread

Update

Weight Vector

Cached Data (Working Set)

Dataset

(Random Access)

(Sequential Access)

(Random Access)

StreamSVM - Linear SVMs and Logistic Regression …vishy/talks/streamsvm.pdfStreamSVM Linear SVMs...

Documents

avniermeni.chavniermeni.ch/libri/Great Chess Combinations - Vishy Anand.pdf · Introduction Boe.z.eHHe In this book, de- B )llliiHOii: KHHre, voted to Vishy Anand, ~HliOHTBOp-the

Introduction to SVMs

Learning Structural SVMs with Latent Variables

Learning Ranking Functions with SVMs

Support Vector Machines (SVMs). Semi-Supervised Learning.ninamf/courses/601sp15/slides/18_svm-ssl_03-25-2015... · • Support Vector Machines (SVMs). • Semi-Supervised SVMs.

SVMs, Part 2 Summary of SVM algorithm Examples of custom kernels Standardizing data for SVMs Soft-margin SVMs

An Idiot’s guide to Support vector machines (SVMs) · machines (SVMs) R. Berwick, Village Idiot SVMs: A New Generation of Learning Algorithms •Pre 1980: –Almost all learning

Asset 9110 Horse Svms 2057

Physical Organization Index Structures Query Processing Vishy Poosala

Applied Machine Learning - Oregon State Universityclasses.engr.oregonstate.edu/eecs/spring2020/cs519-400/...linear classiﬁers: perceptron, logistic regression, (linear) SVMs, etc

Linear Classifiers/SVMs

From image classification to object detection · Input image ConvNet ConvNet ConvNet SVMs SVMs SVMs Warped image regions Forward each region through ConvNet Classify regions with

A Quasi-Newton Approach to Nonsmooth Convex …vishy/papers/YuVisGueSch10.pdfS.V.N. Vishwanathan vishy@stat.purdue.edu Departments of Statistics and Computer Science Purdue University

Linear Models for Supervised Learningwnzhang.net/teaching/cs420/slides/2-linear-model.pdf · •Linear regression, logistic regression, k nearest neighbor, SVMs, (multi-layer) perceptrons,

CSE 258 - University of California, San Diego · 2017. 3. 5. · a classification task you could implement svms+logistic regression+naïve Bayes • You should also compare the results

9.520 Class 06, 24 February 2003 Ryan Rifkin...Ryan Rifkin Plan Regularization derivation of SVMs Geometric derivation of SVMs Optimality, Duality and Large Scale SVMs SVMs and RLSC:

Sorting - Insertion and Bubble Sortvishy/winter2016/notes/Sorting.pdf · Sorting Insertion and Bubble Sort S.V:N. (vishy) Vishwanathan University of California, Santa Cruz vishy@ucsc.edu

Cat SVMS 2068

Cutting-Plane Training of Structural SVMs...Cutting-Plane Training of Structural SVMs 5 tion problem that is similar to multi-class SVMs (Crammer and Singer, 2001) and extendingthe

SVMs in R tutorial