Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
lti
Distributed Asynchronous Online Learning for
Natural Language Processing
Kevin Gimpel Dipanjan Das Noah A. Smith
lti
Introduction
� Two recent lines of research in speeding up large learning problems:�Parallel/distributed computing
�Online (and mini-batch) learning algorithms:stochastic gradient descent, perceptron, MIRA, stepwise EM
� How can we bring together the benefits of parallel computing and online learning?
lti
Introduction
� We use asynchronous algorithms (Nedic, Bertsekas, and Borkar, 2001; Langford, Smola, and Zinkevich, 2009)
� We apply them to structured prediction tasks:� Supervised learning� Unsupervised learning with both convex and non-
convex objectives
� Asynchronous learning speeds convergence and works best with small mini-batches
lti
Problem Setting
� Iterative learning� Moderate to large numbers of training examples� Expensive inference procedures for each example� For concreteness, we start with gradient-based
optimization
� Single machine with multiple processors� Exploit shared memory for parameters, lexicons,
feature caches, etc.� Maintain one master copy of model parameters
lti
Dataset:
Processors: Pi
Parameters: θt
D
Single-Processor Batch Learning
lti
Dataset:
Processors: Pi
Parameters: θt
D
θ
P1
Single-Processor Batch Learning
Time0
lti
Dataset:
Processors: Pi
Parameters: θt
g=calc(D, θ0)
D
g=calc(D, θ) :
θ
P1
Calculate gradient on data using parametersDg θ
θ0
Single-Processor Batch Learning
Time0
lti
Dataset:
Processors: Pi
Parameters: θt
g=calc(D, θ0)
θ1=up(θ0,g) :D
θ1θ
P1
Update using gradient to obtain
θ1=up(θ0,g)
θ0 θ1g
θ0
Single-Processor Batch Learning
Time0
g=calc(D, θ) :Calculate gradient on data using parametersDg θ
lti
Dataset:
Processors: Pi
Parameters: θt
g=calc(D, θ0)
θ1=up(θ0,g) :D
θ
P1
Update using gradient to obtainθ0 θ1g
g=calc(D, θ1)
θ0 θ1
θ1=up(θ0,g)
Single-Processor Batch Learning
Time0
g=calc(D, θ) :Calculate gradient on data using parametersDg θ
lti
Time
θ θ0
P2 g2=calc(D2, θ0)
0
� Divide data into parts, compute gradient on parts in parallel
One processor updates parameters
D = D1 ∪ D2 ∪ D3Dataset:
Processors: Pi
Parameters: θt
g3=calc(D3, θ0)
g1=calc(D1, θ0)
P3
P1
Parallel Batch Learning
Gradient: g = g1 + g2 + g3
lti
Time
θ1θ
θ1=up(θ0,g)
θ0
P2 g2=calc(D2, θ0)
0
� Divide data into parts, compute gradient on parts in parallel
� One processor updates parameters
D = D1 ∪ D2 ∪ D3Dataset:
Processors: Pi
Parameters: θt
g3=calc(D3, θ0)
g1=calc(D1, θ0)
P3
P1
Gradient: g = g1 + g2 + g3
Parallel Batch Learning
lti
Time
θ1θ
θ1=up(θ0,g)
θ0
P2 g2=calc(D2, θ0)
0
� Divide data into parts, compute gradient on parts in parallel
� One processor updates parameters
D = D1 ∪ D2 ∪ D3Dataset:
Processors: Pi
Parameters: θt
g3=calc(D3, θ0)
g1=calc(D1, θ0)
P3
P1
Gradient: g = g1 + g2 + g3
g1=calc(D1, θ1)
g2=calc(D2, θ1)
g3=calc(D3, θ1)
θ2=up(θ1,g
Parallel Batch Learning
lti
θ2=u(θ1,g)θ1=u(θ0,g)
Time
θ1θ θ0
P2
0
� Same architecture, just more frequent updates
P3
P1
Gradient: g = g1 + g2 + g3
g1=c(B1
1, θ0)
g2=c(B2
1, θ0)
g3=c(B3
1, θ0)
g1=c(B1
2, θ1)
g2=c(B2
2, θ1)
g3=c(B3
2, θ1)
Parallel Synchronous Mini-Batch LearningFinkel, Kleeman, and Manning (2008)
Mini-batches:
Processors: Pi
Parameters: θt
Bt = B1t ∪ B2
t ∪ B3
t
θ2
g1=c(B1
3,
g2=c(B2
3,
g3=c(B3
3,
lti
Time
θ θ0
P2
0
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
gk
Bj
lti
Time
θ θ0
P2
0
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
g1=c(B1, θ0)
g2=c(B2, θ0)
g3=c(B3, θ0)
gk
Bj
lti
θ1=u(θ0,g1)
Time
θ1θ θ0
P2
0
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
g1=c(B1, θ0)
g2=c(B2, θ0)
g3=c(B3, θ0)
gk
Bj
lti
θ1=u(θ0,g1)
Time
θ1θ θ0
P2
0
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
g1=c(B1, θ0)
g2=c(B2, θ0)
g3=c(B3, θ0)
gk
g1=c(B4, θ1)
Bj
lti
θ1=u(θ0,g1)
Time
θ1θ θ0
P2
0
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
g1=c(B1, θ0)
g2=c(B2, θ0)
g3=c(B3, θ0)
gk
g1=c(B4, θ1)
θ2=u(θ1,g2)
Bj
θ2
lti
θ1=u(θ0,g1)
Time
θ1θ θ0
P2
0
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
g1=c(B1, θ0)
g2=c(B2, θ0)
g3=c(B3, θ0)
gk
g1=c(B4, θ1)
θ2
g2=c(B5, θ2)θ2=u(θ1,g2)
Bj
lti
θ1=u(θ0,g1)
Time
θ1θ θ0
P2
0
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
g1=c(B1, θ0)
g2=c(B2, θ0)
g3=c(B3, θ0)
gk
g1=c(B4, θ1)
θ2
g2=c(B5, θ2)θ2=u(θ1,g2)
θ3=u(θ2,g3)
θ3
Bj
lti
θ1=u(θ0,g1)
Time
θ1θ θ0
P2
0
� Gradients computed using stale parameters
� Increased processor utilization� Only idle time caused by lock for
updating parameters
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
g1=c(B1, θ0)
g2=c(B2, θ0)
g3=c(B3, θ0)
gk
g1=c(B4, θ1)
θ2
g2=c(B5, θ2)θ2=u(θ1,g2)
θ3=u(θ2,g3)
θ3
θ4=u(θ3,g1)
θ5=u(θ4
g1=c(B
g3=c(B6, θ3)
Bj
θ4
lti
Theoretical Results
� How does the use of stale parameters affect convergence?
� Convergence results exist for convex optimization using stochastic gradient descent� Convergence guaranteed when max delay is
bounded (Nedic, Bertsekas, and Borkar, 2001)� Convergence rates linear in max delay (Langford,
Smola, and Zinkevich, 2009)
lti
Experiments
4
10k
4
m
HMM
IBM Model 1
CRF
Model
2M42kNStepwise
EM
Unsupervised Part-of-Speech
Tagging
14.2M300kYStepwise
EMWord
Alignment
1.3M15kYStochastic Gradient Descent
Named-Entity Recognition
Convex?MethodTask |θ||D|
� To compare algorithms, we use wall clock time (with a dedicated 4-processor machine)
� m = mini-batch size
lti
Experiments
4
m
CRF
Model
1.3M15kYStochastic Gradient Descent
Named-Entity Recognition
Convex?MethodTask |θ||D|
� CoNLL 2003 English data
� Label each token with entity type (person, location, organization, or miscellaneous) or non-entity
� We show convergence in F1 on development data
lti
0 2 4 6 8 10 1284
85
86
87
88
89
90
91
Wall clock time (hours)
F1
Asynchronous (4 processors)Synchronous (4 processors)Single-processor
Asynchronous Updating Speeds Convergence
All use a mini-batch size of 4
lti
0 2 4 6 8 10 1284
85
86
87
88
89
90
91
Wall clock time (hours)
F1
Asynchronous (4 processors)Ideal
Comparison with Ideal Speed-up
lti
Why Does Asynchronous Converge Faster?
� Processors are kept in near-constant use� Synchronous SGD leads to idle
processors � need for load-balancing
lti
0 1 2 3 4 5 6 7 885
86
87
88
89
90
91
F1
Synchronous (4 processors)Synchronous (2 processors)Single-processor
0 1 2 3 4 5 6 7 885
86
87
88
89
90
91
Wall clock time (hours)
F1
Asynchronous (4 processors)Asynchronous (2 processors)Single-processor
Clearer improvement for asynchronous algorithms when increasing number ofprocessors
lti
0 2 4 6 8 10 1285
86
87
88
89
90
91
Wall clock time (hours)
F1
Asynchronous, no delayAsynchronous, µ = 5Single-processor, no delayAsynchronous, µ = 10Asynchronous, µ = 20
Artificial Delays
After completing a mini-batch, 25% chance of delaying
Delay (in seconds) sampled from max(N (µ, (µ/5)2), 0)
Avg. time per mini-batch = 0.62 s
lti
Experiments
10k
m
IBM Model 1
Model
14.2M300kYStepwise
EMWord
Alignment
Convex?MethodTask
� Given parallel sentences, draw links between words:
� We show convergence in log-likelihood (convergence in AER is similar)
konnten sie es übersetzen ?
could you translate it ?
|θ||D|
lti
Stepwise EM(Sato and Ishii, 2000; Cappe and Moulines, 2009)
� Similar to stochastic gradient descent in the space of sufficient statistics, with a particular scaling of the update
� More efficient than incremental EM
(Neal and Hinton, 1998)� Found to converge much faster than batch EM
(Liang and Klein, 2009)
lti
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)
Word Alignment Results
For stepwise EM, mini-batch size = 10,000
lti
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)
Word Alignment Results
Asynchronous is no faster than synchronous!
For stepwise EM, mini-batch size = 10,000
lti
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)
Word Alignment Results
Asynchronous is no faster than synchronous!
For stepwise EM, mini-batch size = 10,000
lti
Comparing Mini-Batch Sizes
10 20 30 40 50 60 70 80 90 100
-32
-30
-28
-26
-24
-22
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. (m = 10,000)Synch. (m = 10,000)Asynch. (m = 1,000)Synch. (m = 1,000)Asynch. (m = 100)Synch. (m = 100)
lti
Comparing Mini-Batch Sizes
10 20 30 40 50 60 70 80 90 100
-32
-30
-28
-26
-24
-22
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. (m = 10,000)Synch. (m = 10,000)Asynch. (m = 1,000)Synch. (m = 1,000)Asynch. (m = 100)Synch. (m = 100)
Asynchronous is faster when using small mini-batches
lti
Comparing Mini-Batch Sizes
10 20 30 40 50 60 70 80 90 100
-32
-30
-28
-26
-24
-22
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. (m = 10,000)Synch. (m = 10,000)Asynch. (m = 1,000)Synch. (m = 1,000)Asynch. (m = 100)Synch. (m = 100)
Error from asynchronous
updating
lti
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)
Word Alignment Results
For stepwise EM, mini-batch size = 10,000
lti
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)Ideal
Comparison with Ideal Speed-up
For stepwise EM, mini-batch size = 10,000
lti
MapReduce?
� We also ran these algorithms on a large MapReduce cluster (M45 from Yahoo!)
� Batch EM�Each iteration is one MapReduce job, using
24 mappers and 1 reducer
� Asynchronous Stepwise EM�4 mini-batches processed simultaneously,
each run as a MapReduce job�Each uses 6 mappers and 1 reducer
lti
MapReduce?
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20Lo
g-Li
kelih
ood
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. Stepwise EM (MapReduce)Batch EM (MapReduce)
lti
MapReduce?
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20Lo
g-Li
kelih
ood
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. Stepwise EM (MapReduce)Batch EM (MapReduce)
lti
Experiments
4
m
HMM
Model
2M42kNStepwise
EM
Unsupervised Part-of-Speech
Tagging
Convex?MethodTask |θ||D|
� Bigram HMM with 45 states
� We plot convergence in likelihood and many-to-1 accuracy
lti
0 1 2 3 4 5 6
-7.5
-7
-6.5
-6x 10
6
Log-
Like
lihoo
d
0 1 2 3 4 5 6
50
55
60
65
Wall clock time (hours)
Acc
urac
y (%
)
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)
Part-of-Speech Tagging Results
mini-batch size = 4 for stepwise EM
lti
0 1 2 3 4 5 6
-7.5
-7
-6.5
-6x 10
6
Log-
Like
lihoo
d
0 1 2 3 4 5 6
50
55
60
65
Wall clock time (hours)
Acc
urac
y (%
)
Asynch. Stepwise EM (4 processors)Ideal
Comparison with Ideal
lti
0 1 2 3 4 5 6
-7.5
-7
-6.5
-6x 10
6
Log-
Like
lihoo
d
0 1 2 3 4 5 6
50
55
60
65
Wall clock time (hours)
Acc
urac
y (%
)
Asynch. Stepwise EM (4 processors)Ideal
Comparison with Ideal
Asynchronous better than ideal?
lti
Conclusions and Future Work
� Asynchronous algorithms speed convergence and do not introduce additional error
� Effective for unsupervised learning and non-convex objectives
� If your problem works well with small mini-batches, try this!
� Future work� Theoretical results for non-convex case� Explore effects of increasing number of processors� New architectures (maintain multiple copies of )θ
lti
Thanks!