Upload
tyler-harrington
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Outline• Part one– Background– SINGA
• Overview• Programming Model• System Architecture
– Experimental Study– Research Challenges – Conclusion
• Part two– Basic user guide– Advanced user guide
2
Drug detectionImage caption generationMachine translationNatural language processingAcoustic modelingImage/video classification
Source websites of each image can be found by click on it
Background: Applications and Models
3
Source websites of each image can be found by click on it
Background: Applications and Models
Directed acyclic connection Recurrent connection Undirected connection
Convolutional Neural Network (CNN)Multi-Layer Perceptron (MLP)Auto-Encoder (AE)
Long Short Term Memory (LSTM)Recurrent Neural Network (RNN)
Restricted Boltzman machine (RBM)Deep Botzman Machine (DBM)Deep Belief Network (DBN)
CNN MLP LSTMAuto-Encoder RBM
Source websites of each image can be found by click on it
Drug detectionImage caption generationMachine translationNatural language processingAcoustic modelingImage/video classification
4
• Objective function
– Loss functions of deep learning models are (usually)non-linear non-convex No closed form solution.
• Mini-batch Stochastic Gradient Descent (SGD)
• Compute Gradients• Update Parameters
1 week
Background: Parameter Training
5
• Synchronous frameworks
• Asynchronous frameworks
worker worker
worker worker
server
worker worker worker worker
---one group
Sandblaster of Google Brain [1] AllReduce in Baidu’s DeepImage[2]
Distributed Hogwild in Caffe[4]
Downpour of Google Brain[1]
worker worker
worker worker
Background: Distributed Training Framework
6
server
worker worker worker worker
coordinator
• General– Different categories of models– Different training frameworks
• Scalable– Scale to a large model and training dataset
• e.g., 1 Billion parameters and 10M images
– Model partition and Data partition
• Easy to use– Simple programming model– Built-in models, Python binding, Web interface– without much awareness of the underlying distributed
platform
• Abstraction for extensibility, scalability and usability
7
CNNRNN
RBM …
HogwildAllReduce
…
SINGA Design Goals
Outline• Part one– Background– SINGA
• Overview• Programming Model• System Architecture
– Experimental Study– Research Challenges – Conclusion
• Part two– Basic user guide– Advanced user guide
8
SINGA Overview
• Users (or programmers) submit a job configuration for 4 components.
Main Components
• ClusterToplogy
• NeuralNet
• TrainOneBatch
• Updater
9
SINGA Overview
• Users (or programmers) submit a job configuration for 4 components.
• SINGA trains models over the worker-server architecture.‒ Workers compute parameter gradients‒ Servers perform parameter updates
Main Components
• ClusterToplogy
• NeuralNet
• TrainOneBatch
• Updater
10
SINGA Programming Model
Note:BP, BPTT and CD have different flows, so implement different TrainOneBatch
13
• Worker Group– Loads a subset of training data and computes gradients for model replica– Workers within a group run synchronously– Different worker groups run asynchronously
Distributed Training
• Server Group– Maintains one ParamShard– Handles requests of multiple
worker groups for parameter updates
– Synchronize with neighboring groups
• Configuration for AllReduce (Baidu’s DeepImage)– 1 worker group– 1 server group– Co-locate worker and server
Synchronous Frameworks
15
• Configuration for Sandblaster– Separate worker and server groups– 1 server group– 1 worker group
Synchronous Frameworks
16
• Configuration for Downpour– Separate worker and server groups– >=1 worker groups– 1 server group
Asynchronous Frameworks
• Configuration of distributed Hogwild– 1 server group/procs– >=1 worker groups/procs– Co-locate worker and server– Share ParamShard
Asynchronous Frameworks
18
Outline• Part one– Background– SINGA
• Overview• Programming Model• System Architecture
– Experimental Study– Research Challenges – Conclusion
• Part two– Basic user guide– Advanced user guide
19
CIFAR10 dataset: 50K training and 10K test images.
For single node setting- a 24-core server with 500GB
“SINGA-dist”:- disable OpenBlas multi-threading- with Sandblaster topology
Experiment: Training Performance (Synchronous)
20
Experiment: Training Performance (Synchronous)
For multi nodes setting- a 32-node cluster- each node with a quad-core
Intel Xeon 3.1 GHz CPU and 8G memory
- with AllReduce topology
CIFAR10 dataset: 50K training and 10K test images.
For single node setting- a 24-core server with 500GB
“SINGA-dist”:- disable OpenBlas multi-threading- with Sandblaster topology
21
Experiment: Training Performance (Asynchronous)
Caffeon
a singlenode
SINGAon
a single node
22
60,000 iterations
Experiment: Training Performance (Asynchronous)
Caffeon
a singlenode
SINGAon
a single node
SINGAin cluster of 32 nodes
60,000 iterations
• Problem– How to reduce the wall time to reach a certain accuracy by
increasing cluster size?
• Key factors– Efficiency of one training iteration, i.e., time per iteration– Convergence rate, i.e., number of training iterations
• Solutions– Efficiency
• Distributed training over one worker group• Hardware, e.g., GPU
– Convergence rate• Numeric optimization• Multiple groups
– Overhead of distributed training• communication would easily become the bottleneck• Parameter compression[8,9], elastic SGD[5], etc.
24
Challenge: Scalability
References[1] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, pages 1232-1240,2012.[2] R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deepimage: Scaling up image recognition. CoRR,abs/1501.02876, 2015.[3] B. Recht, C. Re, S. J. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693{701, 2011.[4] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.[5] S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. CoRR, abs/1412.6651, 2014[6] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. CoRR, abs/1003.0358, 2010.[7] D. Jiang, G. Chen, B. C. Ooi, K. Tan, and S. Wu. epic: an extensible and scalable system for processing big data. PVLDB, 7(7):541–552, 2014.[8] Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David. Low precision storage for deep learning. arXiv:1412.7024[9] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs. 2014[10] C. Zhang and C. Re. Dimmwitted: A study of main-memory statistical analytics. PVLDB, 7(12):1283–1294, 2014.[11] Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, Yann LeCun . Fast Convolutional Nets With fbfft: A GPU Performance Evaluation. 2015