NOMAD: Non-locking, stOchastic Multi-machine algorithm for ... › ~vishy › talks ›...

Preview:

Citation preview

NOMAD: Non-locking, stOchastic Multi-machinealgorithm for Asynchronous and Decentralized

matrix factorizationScaling by Exploiting Structure

S.V.N. (vishy) Vishwanathan

Purdue University and Amazonvishy@purdue.edu,amazon.com

July 1st, 2013

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 1 / 25

Regularized risk minimization

Machine LearningWe want to build a model which predicts well on dataA model’s performance is quantified by a loss function

a sophisticated discrepancy score

Our model must generalize to unseen dataAvoid over-fitting by penalizing complex models (Regularization)

More FormallyTraining data: x1, . . . , xmLabels: y1, . . . , ymLearn a vector: w

minimizew

J(w) := λ

d∑j=1

φj(wj)︸ ︷︷ ︸Regularizer

+1m

m∑i=1

l(〈w , xi〉 , yi)︸ ︷︷ ︸Risk Remp

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 2 / 25

Regularized risk minimization

Machine LearningWe want to build a model which predicts well on dataA model’s performance is quantified by a loss function

a sophisticated discrepancy score

Our model must generalize to unseen dataAvoid over-fitting by penalizing complex models (Regularization)

More FormallyTraining data: x1, . . . , xmLabels: y1, . . . , ymLearn a vector: w

minimizew

J(w) := λ

d∑j=1

φj(wj)︸ ︷︷ ︸Regularizer

+1m

m∑i=1

l(〈w , xi〉 , yi)︸ ︷︷ ︸Risk Remp

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 2 / 25

Regularized risk minimization

Machine LearningWe want to build a model which predicts well on dataA model’s performance is quantified by a loss function

a sophisticated discrepancy score

Our model must generalize to unseen dataAvoid over-fitting by penalizing complex models (Regularization)

More FormallyTraining data: x1, . . . , xmLabels: y1, . . . , ymLearn a vector: w

minimizew

J(w) := λ

d∑j=1

φj(wj)︸ ︷︷ ︸Regularizer

+1m

m∑i=1

l(〈w , xi〉 , yi)︸ ︷︷ ︸Risk Remp

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 2 / 25

Regularized risk minimization

Machine LearningWe want to build a model which predicts well on dataA model’s performance is quantified by a loss function

a sophisticated discrepancy score

Our model must generalize to unseen dataAvoid over-fitting by penalizing complex models (Regularization)

More FormallyTraining data: x1, . . . , xmLabels: y1, . . . , ymLearn a vector: w

minimizew

J(w) := λ

d∑j=1

φj(wj)︸ ︷︷ ︸Regularizer

+1m

m∑i=1

l(〈w , xi〉 , yi)︸ ︷︷ ︸Risk Remp

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 2 / 25

Regularized risk minimization

Machine LearningWe want to build a model which predicts well on dataA model’s performance is quantified by a loss function

a sophisticated discrepancy score

Our model must generalize to unseen dataAvoid over-fitting by penalizing complex models (Regularization)

More FormallyTraining data: x1, . . . , xmLabels: y1, . . . , ymLearn a vector: w

minimizew

J(w) := λ

d∑j=1

φj(wj)︸ ︷︷ ︸Regularizer

+1m

m∑i=1

l(〈w , xi〉 , yi)︸ ︷︷ ︸Risk Remp

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 2 / 25

NOMAD for Matrix Completion

Outline

1 NOMAD for Matrix Completion

2 NOMAD for Regularized Risk Minimization

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 3 / 25

NOMAD for Matrix Completion

Collaborative filtering

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10

U1

U2

U3

U4

U5

U6

3 3 7 33 3 3

7 7 33 7 3

3 7 33 7

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 4 / 25

NOMAD for Matrix Completion

Matrix completion

A≈

W

H

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 5 / 25

NOMAD for Matrix Completion

Matrix completion

minW ∈ Rm×k

H ∈ Rn×k

f (W ,H),

f (W ,H) =12

∑(i,j)∈Ω

(

Aij −w>i hj

)2

︸ ︷︷ ︸loss

+λ(‖wi‖2 +

∥∥hj∥∥2)

︸ ︷︷ ︸Regularizer

.

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 6 / 25

NOMAD for Matrix Completion

Stochastic approximation

f (W ,H) ≈ fn(W ,H) =12

(Aij −w>i hj

)2+ λ

(‖wi‖2 +

∥∥hj∥∥2)

∇wi′ fn(W ,H) =

(Aij −w>i hj

)hj + λwi , for i = i ′

0 otherwise

∇hj′fn(W ,H) =

(Aij −w>i hj

)wi + λhj , for j = j ′

0 otherwise

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 7 / 25

NOMAD for Matrix Completion

Stochastic updates

wi ← wi − η((

Aij −w>i hj

)hj + λwi

)hj ← hj − η

((Aij −w>i hj

)wi + λhj

)

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 8 / 25

NOMAD for Matrix Completion

Decoupling the updates [Gemulla et al., KDD 2011]

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 9 / 25

NOMAD for Matrix Completion

Decoupling the updates [Gemulla et al., KDD 2011]

Synchronize and Communicate

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 9 / 25

NOMAD for Matrix Completion

Decoupling the updates [Gemulla et al., KDD 2011]

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 9 / 25

NOMAD for Matrix Completion

Decoupling the updates [Gemulla et al., KDD 2011]

Synchronize and Communicate

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 9 / 25

NOMAD for Matrix Completion

Decoupling the updates [Gemulla et al., KDD 2011]

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 9 / 25

NOMAD for Matrix Completion

Decoupling the updates [Gemulla et al., KDD 2011]

Synchronize and Communicate

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 9 / 25

NOMAD for Matrix Completion

Decoupling the updates [Gemulla et al., KDD 2011]

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 9 / 25

NOMAD for Matrix Completion

Some Observations (also see Gemulla et al, ICDM 2012)

The goodUpdates are decoupled and easy to parallelizeEasy to implement using map-reduce

The badCommunication and computation are interleaved

When network is active then CPU is idleWhen CPU is active then network is active

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 10 / 25

NOMAD for Matrix Completion

Some Observations (also see Gemulla et al, ICDM 2012)

The goodUpdates are decoupled and easy to parallelizeEasy to implement using map-reduce

The badCommunication and computation are interleaved

When network is active then CPU is idleWhen CPU is active then network is active

Question: Can we keep CPU and network simultaneously busy?

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 10 / 25

NOMAD for Matrix Completion

Our answer

Non-locking, stOchastic Multi-machine algorithm for Asynchronousand Decentralized matrix factorization (NOMAD)

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 11 / 25

NOMAD for Matrix Completion

Illustration of NOMAD communication

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25

NOMAD for Matrix Completion

Illustration of NOMAD communication

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25

NOMAD for Matrix Completion

Illustration of NOMAD communication

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25

NOMAD for Matrix Completion

Illustration of NOMAD communication

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25

NOMAD for Matrix Completion

Illustration of NOMAD communication

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25

NOMAD for Matrix Completion

Illustration of NOMAD communication

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25

NOMAD for Matrix Completion

Illustration of NOMAD communication

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25

NOMAD for Matrix Completion

Illustration of NOMAD communication

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25

NOMAD for Matrix Completion

Illustration of NOMAD communication

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25

NOMAD for Matrix Completion

Illustration of NOMAD communication

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25

NOMAD for Matrix Completion

Illustration of NOMAD communication

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25

NOMAD for Matrix Completion

Eventually . . .

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 13 / 25

NOMAD for Matrix Completion

Experiments: Netflix

Size: 2649429 × 17770, nnz=99 million

0 5 10 15 20 25 30

0

10

20

30

40

number of machines

spee

dup

netflix

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 14 / 25

NOMAD for Matrix Completion

Experiments: Netflix

1 Processor

500 1,000 1,500

1

1.05

1.1

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

1

1.05

1.1

elapsed secs × num processors

test

RM

SE

2 Processors

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 15 / 25

NOMAD for Matrix Completion

Experiments: Netflix

1 Processor

500 1,000 1,500

1

1.05

1.1

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

1

1.05

1.1

1.15

elapsed secs × num processors

test

RM

SE

4 Processors

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 15 / 25

NOMAD for Matrix Completion

Experiments: Netflix

1 Processor

500 1,000 1,500

1

1.05

1.1

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,5000.95

1

1.05

1.1

1.15

elapsed secs × num processors

test

RM

SE

8 Processors

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 15 / 25

NOMAD for Matrix Completion

Experiments: Netflix

1 Processor

500 1,000 1,500

1

1.05

1.1

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

0.95

1

1.05

1.1

1.15

elapsed secs × num processors

test

RM

SE

15 Processors

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 15 / 25

NOMAD for Matrix Completion

Experiments: Netflix

1 Processor

500 1,000 1,500

1

1.05

1.1

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

1

1.1

1.2

1.3

elapsed secs × num processors

test

RM

SE

30 Processors

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 15 / 25

NOMAD for Matrix Completion

Experiments: Yahoo! music

Size: 1000990 × 624961, nnz=252 million

0 5 10 15 20 25 300

10

20

30

number of machines

spee

dup

Yahoo! music

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 16 / 25

NOMAD for Matrix Completion

Experiments: Yahoo! music

1 Processor

500 1,000 1,500

23

24

25

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

23

24

25

26

elapsed secs × num processors

test

RM

SE

2 Processors

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 17 / 25

NOMAD for Matrix Completion

Experiments: Yahoo! music

1 Processor

500 1,000 1,500

23

24

25

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

23

24

25

elapsed secs × num processors

test

RM

SE

4 Processors

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 17 / 25

NOMAD for Matrix Completion

Experiments: Yahoo! music

1 Processor

500 1,000 1,500

23

24

25

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

23

24

25

26

elapsed secs × num processors

test

RM

SE

8 Processors

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 17 / 25

NOMAD for Matrix Completion

Experiments: Yahoo! music

1 Processor

500 1,000 1,500

23

24

25

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

23

24

25

26

elapsed secs × num processors

test

RM

SE

15 Processors

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 17 / 25

NOMAD for Matrix Completion

Experiments: Yahoo! music

1 Processor

500 1,000 1,500

23

24

25

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

25

30

35

elapsed secs × num processors

test

RM

SE

30 Processors

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 17 / 25

NOMAD for Matrix Completion

Experiments: Synthetic Data

Size: 5 000 000 × 200 000, nnz=270 million

0 5 10 15 20 25 30

0

10

20

30

number of machines

spee

dup

Synthetic

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 18 / 25

NOMAD for Matrix Completion

Experiments: Synthetic data

1 Processor

500 1,000 1,50040

60

80

100

120

140

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

50

100

150

200

elapsed secs × num processors

test

RM

SE

2 Processors

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 19 / 25

NOMAD for Matrix Completion

Experiments: Synthetic data

1 Processor

500 1,000 1,50040

60

80

100

120

140

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

40

60

80

100

120

140

elapsed secs × num processors

test

RM

SE

4 Processors

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 19 / 25

NOMAD for Matrix Completion

Experiments: Synthetic data

1 Processor

500 1,000 1,50040

60

80

100

120

140

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

40

60

80

100

120

140

elapsed secs × num processors

test

RM

SE

8 Processors

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 19 / 25

NOMAD for Matrix Completion

Experiments: Synthetic data

1 Processor

500 1,000 1,50040

60

80

100

120

140

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

40

60

80

100

120

140

elapsed secs × num processors

test

RM

SE

15 Processors

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 19 / 25

NOMAD for Regularized Risk Minimization

Outline

1 NOMAD for Matrix Completion

2 NOMAD for Regularized Risk Minimization

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 20 / 25

NOMAD for Regularized Risk Minimization

Problem Formulation

minw

λ

d∑j=1

φj(wj) +1m

m∑i=1

l(〈w , xi〉 , yi)

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 21 / 25

NOMAD for Regularized Risk Minimization

Problem Formulation

minw ,u

λ

d∑j=1

φj(wj) +1m

m∑i=1

l(ui)

subject to ui = 〈w , xi〉 i = 1 . . .m

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 21 / 25

NOMAD for Regularized Risk Minimization

Problem Formulation

minw ,u

maxα

λ

d∑j=1

φj(wj) +1m

m∑i=1

l(ui) +1m

m∑i=1

αi(ui − 〈w , xi〉)

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 21 / 25

NOMAD for Regularized Risk Minimization

Problem Formulation

maxα

minw ,u

λ

d∑j=1

φj(wj) +1m

m∑i=1

l(ui) +1m

m∑i=1

αi(ui − 〈w , xi〉)

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 21 / 25

NOMAD for Regularized Risk Minimization

Problem Formulation

maxα

minw

λ

d∑j=1

φj(wj)−1m

m∑i=1

αi 〈w , xi〉+1m

minu

m∑i=1

(l(ui) + αiui)

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 21 / 25

NOMAD for Regularized Risk Minimization

Problem Formulation

maxα

minw

λ

d∑j=1

φj(wj)−

⟨w ,

1m

m∑i=1

αixi

⟩+

1m

m∑i=1

l∗(−αi)

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 21 / 25

NOMAD for Regularized Risk Minimization

Problem Formulation

maxα

minw

m∑i=1

∑j∈Ωi

(λ∣∣Ωj∣∣φj(wj)−

1mαiwjxij +

1m |Ωi |

l∗(−αi)

)

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 21 / 25

NOMAD for Regularized Risk Minimization

Stochastic Gradients

∂wj J(w , α) = |Ω|

(λ∣∣Ωj∣∣∂φj(wj)−

1mαixij

)

∂αi J(w , α) = |Ω|(− 1

mwjxij +

1m |Ωi |

∂αi l∗(−αi)

).

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 22 / 25

NOMAD for Regularized Risk Minimization

Decoupling the updates

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 23 / 25

NOMAD for Regularized Risk Minimization

It Converges!

0 20 40 60 80 10010−2

10−1

number of iterations

dual

gap

KDDA Test Dataset

10 percent coordinatesall coordinates

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 24 / 25

NOMAD for Regularized Risk Minimization

Joint work with

Reading Thread Training Thread

Update

RAM

Weight Vector

RAM

Cached Data (Working Set)

Disk

Dataset

Read

(Random Access)

Read

(Sequential Access)

Load

(Random Access)

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 25 / 25