Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Distributed Asynchronous Online Learning for

Natural Language Processing

Kevin Gimpel Dipanjan Das Noah A. Smith

lti

Introduction

� Two recent lines of research in speeding up large learning problems:�Parallel/distributed computing

�Online (and mini-batch) learning algorithms:stochastic gradient descent, perceptron, MIRA, stepwise EM

� How can we bring together the benefits of parallel computing and online learning?

lti

Introduction

� We use asynchronous algorithms (Nedic, Bertsekas, and Borkar, 2001; Langford, Smola, and Zinkevich, 2009)

� We apply them to structured prediction tasks:� Supervised learning� Unsupervised learning with both convex and non-

convex objectives

� Asynchronous learning speeds convergence and works best with small mini-batches

lti

Problem Setting

� Iterative learning� Moderate to large numbers of training examples� Expensive inference procedures for each example� For concreteness, we start with gradient-based

optimization

� Single machine with multiple processors� Exploit shared memory for parameters, lexicons,

feature caches, etc.� Maintain one master copy of model parameters

lti

Dataset:

Processors: Pi

Parameters: θt

D

Single-Processor Batch Learning

lti

Dataset:

Processors: Pi

Parameters: θt

D

θ

P1


Time0

lti

Dataset:

Processors: Pi

Parameters: θt

g=calc(D, θ0)

D

g=calc(D, θ) :

θ

P1

Calculate gradient on data using parametersDg θ

θ0


Time0

lti

Dataset:

Processors: Pi

Parameters: θt

g=calc(D, θ0)

θ1=up(θ0,g) :D

θ1θ

P1

Update using gradient to obtain

θ1=up(θ0,g)

θ0 θ1g

θ0


Time0

g=calc(D, θ) :Calculate gradient on data using parametersDg θ

lti

Dataset:

Processors: Pi

Parameters: θt

g=calc(D, θ0)

θ1=up(θ0,g) :D

θ

P1

Update using gradient to obtainθ0 θ1g

g=calc(D, θ1)

θ0 θ1

θ1=up(θ0,g)


Time0

g=calc(D, θ) :Calculate gradient on data using parametersDg θ

lti

Time

θ θ0

P2 g2=calc(D2, θ0)

0

� Divide data into parts, compute gradient on parts in parallel

One processor updates parameters

D = D1 ∪ D2 ∪ D3Dataset:

Processors: Pi

Parameters: θt

g3=calc(D3, θ0)

g1=calc(D1, θ0)

P3

P1

Parallel Batch Learning

Gradient: g = g1 + g2 + g3

lti

Time

θ1θ

θ1=up(θ0,g)

θ0

P2 g2=calc(D2, θ0)

0


� One processor updates parameters


Processors: Pi

Parameters: θt

g3=calc(D3, θ0)

g1=calc(D1, θ0)

P3

P1



lti

Time

θ1θ

θ1=up(θ0,g)

θ0

P2 g2=calc(D2, θ0)

0


� One processor updates parameters


Processors: Pi

Parameters: θt

g3=calc(D3, θ0)

g1=calc(D1, θ0)

P3

P1


g1=calc(D1, θ1)

g2=calc(D2, θ1)

g3=calc(D3, θ1)

θ2=up(θ1,g


lti

θ2=u(θ1,g)θ1=u(θ0,g)

Time

θ1θ θ0

P2

0

� Same architecture, just more frequent updates

P3

P1


g1=c(B1

1, θ0)

g2=c(B2

1, θ0)

g3=c(B3

1, θ0)

g1=c(B1

2, θ1)

g2=c(B2

2, θ1)

g3=c(B3

2, θ1)

Parallel Synchronous Mini-Batch LearningFinkel, Kleeman, and Manning (2008)

Mini-batches:

Processors: Pi

Parameters: θt

Bt = B1t ∪ B2

t ∪ B3

t

θ2

g1=c(B1

3,

g2=c(B2

3,

g3=c(B3

3,

lti

Time

θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

gk

Bj

lti

Time

θ θ0

P2

0

P3

P1

Gradient:


Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

Bj

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:


Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

Bj

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:


Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

Bj

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:


Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

θ2=u(θ1,g2)

Bj

θ2

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:


Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

θ2

g2=c(B5, θ2)θ2=u(θ1,g2)

Bj

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:


Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

θ2

g2=c(B5, θ2)θ2=u(θ1,g2)

θ3=u(θ2,g3)

θ3

Bj

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

� Gradients computed using stale parameters

� Increased processor utilization� Only idle time caused by lock for

updating parameters

P3

P1

Gradient:


Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

θ2

g2=c(B5, θ2)θ2=u(θ1,g2)

θ3=u(θ2,g3)

θ3

θ4=u(θ3,g1)

θ5=u(θ4

g1=c(B

g3=c(B6, θ3)

Bj

θ4

lti

Theoretical Results

� How does the use of stale parameters affect convergence?

� Convergence results exist for convex optimization using stochastic gradient descent� Convergence guaranteed when max delay is

bounded (Nedic, Bertsekas, and Borkar, 2001)� Convergence rates linear in max delay (Langford,

Smola, and Zinkevich, 2009)

lti

Experiments

4

10k

4

m

HMM

IBM Model 1

CRF

Model

2M42kNStepwise

EM

Unsupervised Part-of-Speech

Tagging

14.2M300kYStepwise

EMWord

Alignment

1.3M15kYStochastic Gradient Descent

Named-Entity Recognition

Convex?MethodTask |θ||D|

� To compare algorithms, we use wall clock time (with a dedicated 4-processor machine)

� m = mini-batch size

lti

Experiments

4

m

CRF

Model

1.3M15kYStochastic Gradient Descent

Named-Entity Recognition


� CoNLL 2003 English data

� Label each token with entity type (person, location, organization, or miscellaneous) or non-entity

� We show convergence in F1 on development data

lti

0 2 4 6 8 10 1284

85

86

87

88

89

90

91

Wall clock time (hours)

F1

Asynchronous (4 processors)Synchronous (4 processors)Single-processor

Asynchronous Updating Speeds Convergence

All use a mini-batch size of 4

lti

0 2 4 6 8 10 1284

85

86

87

88

89

90

91


F1

Asynchronous (4 processors)Ideal

Comparison with Ideal Speed-up

lti

Why Does Asynchronous Converge Faster?

� Processors are kept in near-constant use� Synchronous SGD leads to idle

processors � need for load-balancing

lti

0 1 2 3 4 5 6 7 885

86

87

88

89

90

91

F1

Synchronous (4 processors)Synchronous (2 processors)Single-processor

0 1 2 3 4 5 6 7 885

86

87

88

89

90

91


F1

Asynchronous (4 processors)Asynchronous (2 processors)Single-processor

Clearer improvement for asynchronous algorithms when increasing number ofprocessors

lti

0 2 4 6 8 10 1285

86

87

88

89

90

91


F1

Asynchronous, no delayAsynchronous, µ = 5Single-processor, no delayAsynchronous, µ = 10Asynchronous, µ = 20

Artificial Delays

After completing a mini-batch, 25% chance of delaying

Delay (in seconds) sampled from max(N (µ, (µ/5)2), 0)

Avg. time per mini-batch = 0.62 s

lti

Experiments

10k

m

IBM Model 1

Model

14.2M300kYStepwise

EMWord

Alignment

Convex?MethodTask

� Given parallel sentences, draw links between words:

� We show convergence in log-likelihood (convergence in AER is similar)

konnten sie es übersetzen ?

could you translate it ?

|θ||D|

lti

Stepwise EM(Sato and Ishii, 2000; Cappe and Moulines, 2009)

� Similar to stochastic gradient descent in the space of sufficient statistics, with a particular scaling of the update

� More efficient than incremental EM

(Neal and Hinton, 1998)� Found to converge much faster than batch EM

(Liang and Klein, 2009)

lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

Word Alignment Results

For stepwise EM, mini-batch size = 10,000

lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20


Log-

Like

lihoo

d



Asynchronous is no faster than synchronous!


lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20


Log-

Like

lihoo

d



Asynchronous is no faster than synchronous!


lti

Comparing Mini-Batch Sizes

10 20 30 40 50 60 70 80 90 100

-32

-30

-28

-26

-24

-22

-20


Log-

Like

lihoo

d

Asynch. (m = 10,000)Synch. (m = 10,000)Asynch. (m = 1,000)Synch. (m = 1,000)Asynch. (m = 100)Synch. (m = 100)

lti


10 20 30 40 50 60 70 80 90 100

-32

-30

-28

-26

-24

-22

-20


Log-

Like

lihoo

d


Asynchronous is faster when using small mini-batches

lti


10 20 30 40 50 60 70 80 90 100

-32

-30

-28

-26

-24

-22

-20


Log-

Like

lihoo

d


Error from asynchronous

updating

lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20


Log-

Like

lihoo

d




lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20


Log-

Like

lihoo

d

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)Ideal

Comparison with Ideal Speed-up


lti

MapReduce?

� We also ran these algorithms on a large MapReduce cluster (M45 from Yahoo!)

� Batch EM�Each iteration is one MapReduce job, using

24 mappers and 1 reducer

� Asynchronous Stepwise EM�4 mini-batches processed simultaneously,

each run as a MapReduce job�Each uses 6 mappers and 1 reducer

lti

MapReduce?

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20Lo

g-Li

kelih

ood


10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20


Log-

Like

lihoo

d

Asynch. Stepwise EM (MapReduce)Batch EM (MapReduce)

lti

MapReduce?

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20Lo

g-Li

kelih

ood


10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20


Log-

Like

lihoo

d

Asynch. Stepwise EM (MapReduce)Batch EM (MapReduce)

lti

Experiments

4

m

HMM

Model

2M42kNStepwise

EM

Unsupervised Part-of-Speech

Tagging


� Bigram HMM with 45 states

� We plot convergence in likelihood and many-to-1 accuracy

lti

0 1 2 3 4 5 6

-7.5

-7

-6.5

-6x 10

6

Log-

Like

lihoo

d

0 1 2 3 4 5 6

50

55

60

65


Acc

urac

y (%

)


Part-of-Speech Tagging Results

mini-batch size = 4 for stepwise EM

lti

0 1 2 3 4 5 6

-7.5

-7

-6.5

-6x 10

6

Log-

Like

lihoo

d

0 1 2 3 4 5 6

50

55

60

65


Acc

urac

y (%

)

Asynch. Stepwise EM (4 processors)Ideal

Comparison with Ideal

lti

0 1 2 3 4 5 6

-7.5

-7

-6.5

-6x 10

6

Log-

Like

lihoo

d

0 1 2 3 4 5 6

50

55

60

65


Acc

urac

y (%

)

Asynch. Stepwise EM (4 processors)Ideal

Comparison with Ideal

Asynchronous better than ideal?

lti

Conclusions and Future Work

� Asynchronous algorithms speed convergence and do not introduce additional error

� Effective for unsupervised learning and non-convex objectives

� If your problem works well with small mini-batches, try this!

� Future work� Theoretical results for non-convex case� Explore effects of increasing number of processors� New architectures (maintain multiple copies of )θ

lti

Thanks!

Documents

Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay