StreamSVM - Linear SVMs and Logistic Regression …vishy/talks/streamsvm.pdfStreamSVM Linear SVMs...

Preview:

Citation preview

StreamSVMLinear SVMs and Logistic Regression When Data Does Not Fit

In Memory

S.V.N. (vishy) Vishwanathan

Purdue University and Microsoftvishy@purdue.edu

October 9, 2012

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 1 / 41

Linear Support Vector Machines

Outline

1 Linear Support Vector Machines

2 StreamSVM

3 Experiments - SVM

4 Logistic Regression

5 Experiments - Logistic Regression

6 Conclusion

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 2 / 41

Linear Support Vector Machines

Binary Classification

yi = −1

yi = +1

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 3 / 41

Linear Support Vector Machines

Binary Classification

yi = −1

yi = +1

x | 〈w , x〉 = 0

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 3 / 41

Linear Support Vector Machines

Linear Support Vector Machines

yi = −1

yi = +1

x | 〈w , x〉 = 0x | 〈w , x〉 = −1

x | 〈w , x〉 = 1

x2 x1

〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2

⟩= 2‖w‖

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 4 / 41

Linear Support Vector Machines

Linear Support Vector Machines

Optimization Problem

minw

12‖w‖2 + C

m∑i=1

max (0,1− yi 〈w , xi〉)

−3 −2 −1 1 2 3 4

1

2

3Hinge Loss

Margin

Loss

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 5 / 41

Linear Support Vector Machines

The Dual

Objective

minα

D(α) :=12

∑i,j

αiαjyiyj⟨xi , xj

⟩−∑

i

αi

s.t. 0 ≤ αi ≤ C

Convex and quadratic objectiveLinear (box) constraintsw =

∑i αiyixi

Each αi corresponds to an xi

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 6 / 41

Linear Support Vector Machines

Coordinate Descent

−2

0

2−3

−2−1

01

23

0

20

40

60

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines

Coordinate Descent

−2

0

2−3

−2−1

01

23

0

20

40

60

x y

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines

Coordinate Descent

−3 −2 −1 0 1 2 30

20

40

60

x

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines

Coordinate Descent

−3 −2 −1 0 1 2 30

20

40

60

x

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines

Coordinate Descent

−2

0

2−3

−2−1

01

23

0

20

40

60

x y

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines

Coordinate Descent

−3 −2 −1 0 1 2 3

2

4

6

8

y

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines

Coordinate Descent

−3 −2 −1 0 1 2 3

2

4

6

8

y

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41

Linear Support Vector Machines

Coordinate Descent in the Dual

One dimensional function

D(αt ) =α2

t2〈xt , xt〉+

∑i 6=t

αtαiyiyt 〈xi , xt〉 − αt + const.

s.t. 0 ≤ αt ≤ C

Solve

∇D(αt ) = αt 〈xt , xt〉+∑i 6=t

αiyiyt 〈xi , xt〉 − 1 = 0

αt = median(

0,C,1−

∑i 6=t αiyiyt 〈xi , xt〉〈xt , xt〉

)S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 8 / 41

StreamSVM

Outline

1 Linear Support Vector Machines

2 StreamSVM

3 Experiments - SVM

4 Logistic Regression

5 Experiments - Logistic Regression

6 Conclusion

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 9 / 41

StreamSVM

We are Collecting Lots of Data . . .

yi = −1

yi = +1

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41

StreamSVM

We are Collecting Lots of Data . . .

yi = −1

yi = +1

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41

StreamSVM

We are Collecting Lots of Data . . .

yi = −1

yi = +1

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41

StreamSVM

We are Collecting Lots of Data . . .

yi = −1

yi = +1

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41

StreamSVM

What if Data Does not Fit in Memory?

Idea 1: Block Minimization [Yu et al., KDD 2010]Split data into blocks B1, B2 . . . such that Bj fits in memoryCompress and store each block separatelyLoad one block of data at a time and optimize only those αi ’s

Idea 2: Selective Block Minimization [Chang and Roth, KDD 2011]Split data into blocks B1, B2 . . . such that Bj fits in memoryCompress and store each block separatelyLoad one block of data at a time and optimize only those αi ’sRetain informative samples from each block in main memory

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 11 / 41

StreamSVM

What are Informative Samples?

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 12 / 41

StreamSVM

Some Observations

SBM and BM are wastefulBoth split data into blocks and compress the blocks

This requires reading the entire data at least once (expensive)

Both pause optimization while a block is loaded into memory

Hardware 101Disk I/O is slower than CPU (sometimes by a factor of 100)Random access on HDD is terrible

sequential access is reasonably fast (factor of 10)

Multi-core processors are becoming commonplaceHow can we exploit this?

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 13 / 41

StreamSVM

Our Architecture

Reader Trainer

RAM RAMHDD

Data Working Set Weight Vec

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 14 / 41

StreamSVM

Our Philosophy

Iterate over the data in main memory while streaming datafrom disk. Evict primarily examples from main memory thatare “uninformative”.

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 15 / 41

StreamSVM

Reader

for k = 1, . . . ,max_iter dofor i = 1, . . . ,n do

if |A| = Ω thenrandomly select i ′ ∈ AA = A \ i ′delete yi ,Qii , xi from RAM

end ifread yi , xi from Diskcalculate Qii = 〈xi , xi〉store yi ,Qii , xi in RAMA = A ∪ i

end forif stopping criterion is met then

exitend if

end forS.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 16 / 41

StreamSVM

Trainer

α1 = 0,w1 = 0, ε = 9, εnew = 0, β = 0.9while stopping criterion is not met do

for t = 1, . . . ,n doIf |A| > 0.9× Ω then ε = βεrandomly select i ∈ A and read yi ,Qii , xi from RAMcompute ∇iD := yi

⟨w t , xi

⟩− 1

if (αti = 0 and ∇iD > ε) or (αt

i = C and ∇iD < −ε) thenA = A \ i and delete yi ,Qii , xi from RAMcontinue

end ifαt+1

i = median(

0,C, αti −

∇i DQii

), w t+1 = w t + (αt+1

i − αti )yixi

εnew = max(εnew, |∇iD|)end forUpdate stopping criterionε = εnew

end whileS.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 17 / 41

StreamSVM

Proof of Convergence (Sketch)

Definition (Luo and Tseng Problem)

minimizeα

g(Eα) + b>α

subject to Li ≤ αi ≤ Ui

α,b ∈ Rn, E ∈ Rd×n, Li ∈ [−∞,∞),Ui ∈ (−∞,∞]The above optimization problem is a Luo and Tseng problem if thefollowing conditions hold:

1 E has no zero columns2 the set of optimal solutions A is non-empty3 the function g is strictly convex and twice cont. differentiable4 for all optimal solutions α∗ ∈ A, ∇2g(Eα∗) is positive definite.

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 18 / 41

StreamSVM

Proof of Convergence (Sketch)

Lemma (see Theorem 1 of Hsieh et al., ICML 2008)If no data xi = 0, then the SVM dual is a Luo and Tseng problem.

Definition (Almost cyclic rule)There exists an integer B ≥ n such that every coordinate is iteratedupon at least once every B successive iterations.

Theorem (Theorem 2.1 of Luo and Tseng, JOTA 1992)

Let αt be a sequence of iterates generated by a coordinate descentmethod using the almost cyclic rule. The αt converges linearly to anelement of A.

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 19 / 41

StreamSVM

Proof of Convergence (Sketch)

LemmaIf the trainer accesses the cached training data sequentially, and thefollowing conditions hold:

The trainer is at most κ ≥ 1 times faster than the reading thread,that is, the trainer performs at most κ coordinate updates in thetime that it takes the reader to read one training data from disk.A point is never evicted from the RAM unless the αi correspondingto that point has been updated.

Then this confirms to the almost cyclic rule.

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 20 / 41

Experiments - SVM

Outline

1 Linear Support Vector Machines

2 StreamSVM

3 Experiments - SVM

4 Logistic Regression

5 Experiments - Logistic Regression

6 Conclusion

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 21 / 41

Experiments - SVM

Experiments

dataset n d s(%) n+ :n− Datasizeocr 3.5 M 1156 100 0.96 45.28 GBdna 50 M 800 25 3e−3 63.04 GBwebspam-t 0.35 M 16.61 M 0.022 1.54 20.03 GBkddb 20.01 M 29.89 M 1e-4 6.18 4.75 GB

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 22 / 41

Experiments - SVM

Does Active Eviction Work?

0 0.5 1 1.5 2 2.5·104

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Wall Clock Time (sec)

Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

ce

Linear SVM

webspam-t C = 1.0RandomActive

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 23 / 41

Experiments - SVM

Comparison with Block Minimization

0 1 2 3 4·104

10−9

10−7

10−5

10−3

10−1

Wall Clock Time (sec)

Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

ce

ocr C = 1.0StreamSVM

SBMBM

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41

Experiments - SVM

Comparison with Block Minimization

0 0.5 1 1.5 2·104

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Wall Clock Time (sec)

Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

ce

webspam-t C = 1.0StreamSVM

SBMBM

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41

Experiments - SVM

Comparison with Block Minimization

0 1 2 3 4·104

10−6

10−5

10−4

10−3

10−2

10−1

100

Wall Clock Time (sec)

Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

ce

kddb C = 1.0StreamSVM

SBMBM

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41

Experiments - SVM

Comparison with Block Minimization

0 1 2 3 4·104

10−11

10−9

10−7

10−5

10−3

10−1

Wall Clock Time (sec)

Rel

ativ

eO

bjec

tive

Func

tion

Valu

e

dna C = 1.0StreamSVM

SBMBM

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41

Experiments - SVM

Effect of Varying C

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·105

10−11

10−9

10−7

10−5

10−3

10−1

Wall Clock Time (sec)

Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

ce

dna C = 10.0StreamSVM

SBMBM

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41

Experiments - SVM

Effect of Varying C

0 0.5 1 1.5·105

10−2

10−1

100

Wall Clock Time (sec)

Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

ce

dna C = 100.0StreamSVM

SBMBM

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41

Experiments - SVM

Effect of Varying C

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·105

10−1

100

Wall Clock Time (sec)

Rel

ativ

eO

bjec

tive

Func

tion

Valu

e

dna C = 1000.0StreamSVM

SBMBM

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41

Experiments - SVM

Varying Cache Size

0 0.5 1 1.5 2·104

10−6

10−5

10−4

10−3

10−2

10−1

100

Wall Clock Time (sec)

Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

cekddb C = 1.0

256 MB1 GB4 GB

16 GB

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 26 / 41

Experiments - SVM

Expanding Features

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·105

10−10

10−8

10−6

10−4

10−2

100

Wall Clock Time (sec)

Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

cedna expanded C = 1.0

16 GB32GB

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 27 / 41

Logistic Regression

Outline

1 Linear Support Vector Machines

2 StreamSVM

3 Experiments - SVM

4 Logistic Regression

5 Experiments - Logistic Regression

6 Conclusion

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 28 / 41

Logistic Regression

Logistic Regression

Optimization Problem

minw

12‖w‖2 + C

m∑i=1

log (1 + exp (−yi 〈w , xi〉))

−4 −2 2 4

1

2

3Hinge Loss

Logistic Loss

Margin

Loss

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 29 / 41

Logistic Regression

The Dual

Objective

minα

D(α) :=12

∑i,j

αiαjyiyj⟨xi , xj

⟩+∑

i

αi logαi + (C − αi) log(C − αi)

s.t. 0 ≤ αi ≤ C

w =∑

i αiyixi

Convex objective (but not quadratic)Box constraints

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 30 / 41

Logistic Regression

Coordinate Descent

One dimensional function

D(αt ) :=α2

t2〈xt , xt〉+

∑i 6=t

αtαiyiyt 〈xi , xt〉

+ αt logαt + (C − αt ) log(C − αt )

s.t. 0 ≤ αt ≤ C

SolveNo closed form solutionA modified Newton method does the job!See Yu, Huang, Lin 2011 for details.

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 31 / 41

Logistic Regression

Determining Importance: SVM case

Shrinking

αti = 0 and ∇iD(α) > ε or

αti = C and ∇iD(α) < −ε

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 32 / 41

Logistic Regression

Determining Importance: SVM case

−4 −2 2 4 6

0.5

1

Margin

Loss Gradient

∇w loss(w , xi , yi) = αi (KKT condition)

∇πi D(α) = 0|〈w , xi〉 − 1| ≥ ε

Computing ε

ε = maxi∈A∣∣∇πi D(α)

∣∣S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 33 / 41

Logistic Regression

Determining Importance: Logistic Regression

−8 −6 −4 −2 2 4 6 8

0.2

0.4

0.6

0.8

1

Margin

Loss Gradient

∇w loss(w , xi , yi) ≈ αi (KKT condition)

∇iD(α) ≈ 0|〈w , xi〉| ≥ ε

Computing ε

ε = maxi∈A |∇iD(α)|

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 34 / 41

Experiments - Logistic Regression

Outline

1 Linear Support Vector Machines

2 StreamSVM

3 Experiments - SVM

4 Logistic Regression

5 Experiments - Logistic Regression

6 Conclusion

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 35 / 41

Experiments - Logistic Regression

Does Active Eviction Work?

0 0.5 1 1.5 2 2.5·104

10−10

10−8

10−6

10−4

10−2

100

Wall Clock Time (sec)

Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

ce

Logistic Regression

webspam-t C = 1.0RandomActive

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 36 / 41

Experiments - Logistic Regression

Comparison with Block Minimization

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·104

10−10

10−8

10−6

10−4

10−2

100

Wall Clock Time (sec)

Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

cewebspam-t C = 1.0

StreamSVMBM

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 37 / 41

Conclusion

Outline

1 Linear Support Vector Machines

2 StreamSVM

3 Experiments - SVM

4 Logistic Regression

5 Experiments - Logistic Regression

6 Conclusion

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 38 / 41

Conclusion

Things I did not talk about/Work in Progress

StreamSVMAmortizing the cost of training different modelsOther storage Heirarchies (e.g. Solid State Drives)Extensions to other problems

Multiclass problemsStructured Output Prediction

Distributed version of the algorithm

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 39 / 41

Conclusion

References

Shin Matsushima, S. V. N. Vishwanathan, and Alexander J Smola.Linear Support Vector Machines via Dual Cached Loops. KDD2012.Code for StreamSVM available fromhttp://www.r.dl.itc.u-tokyo.ac.jp/~masin/streamsvm.html

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 40 / 41

Conclusion

Joint work with

Reading Thread Training Thread

Update

RAM

Weight Vector

RAM

Cached Data (Working Set)

Disk

Dataset

Read

(Random Access)

Read

(Sequential Access)

Load

(Random Access)

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 41 / 41

Recommended