21
Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation by Cameron Hamilton

Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Embed Size (px)

Citation preview

Page 1: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning

M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

Presentation by Cameron Hamilton

Page 2: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Overview

• Problem: disparity between deep learning tools oriented towards productivity/generality (e.g. MATLAB) and task-specific tools designed for speed and scale (e.g. CUDA-Convnet).

• Solution: A matrix-based API, known as Minerva, with a MATLAB-like procedural coding style. Program is translated into an internal dataflow graph at runtime, which is generic enough to be implemented on different types of hardware.

Page 3: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Minerva System Overview

• Every training iteration has two phases– Generate dataflow graph from user code– Evaluate dataflow graph

Page 4: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Example of User Code

Page 5: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

System Overview: Performance via Parallelism

• Performance of deep learning algorithms dependent on whether operations can be performed in parallel. Minerva utilizes two forms of parallelism:– Model parallelism: model replicas used to train the same

model• Replicas exchange updates via “logically centralized parameter

server” (p. 4).

– Data parallelism: model replicas assigned to different portions of the data sets

• Always evaluates on GPU if available

Page 6: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Programming Model

• Minerva API 3 stages for deep learning– Define model architecture • Model model;• Layer layer1 = model.AddLayer(dim);• model.AddConnection(layer1,layer2,FULL);

– Declaring primary matrices (i.e. weights & biases)• Matrix W = Matrix(layer2,layer1,RANDOM);• Matrix b(layer2,1,RANDOM);• Vector<Matrix> inputs = LoadBatches(layer1,…);

Page 7: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Programming Model

– Specifying training procedure

Convolutional neural networks (CNNs) are specified with a different syntax. The architecture is specified with a single line: AddConvConnect(layer1,layer2,…). Minerva then handles the arrangement of these layers (p.4).

Page 8: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Programming Model

• Expressing Parallelism– Model Parallelism

• SetPartition(layer1,2);SetPartition(layer2,2);

– Data Parallelism• ParameterSet pset;• pset.Add(“W”,W); pset.Add(“V”,V);• pset.Add(“b”,b); pset.Add(“c”,c);• RegisterToParameterServer(pset);• …//Learning Procedure Here• if(epoch % 3 == 0) PushToParameterServer(pset);• if(epoch % 6 == 0) PullFromParameterServer(pset);

EvalAll();

Page 9: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Putting it All Together

Page 10: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Putting it All Together

Page 11: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

System Design: More on Parallelism

• Within a neural network, the operations that will occur at each computing vertex (i.e. forward propagation, backward propagation, weight update) are predefined. This allows for network training to be partitioned for theoretically any number of threads.– Updates shared between local parameter servers

• Load-balance by dividing task up amongst partitions• Coordination and Overhead by determining ownership of

computing vertex based on location of its input and output vertices. Partitions stick to their vertices.

• Locality by receiving input to vertex in layer n from n-1 and outputting layer n+1

Page 12: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Model Parallelism

Page 13: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Convolutional Networks

• Partitions handle patches of the input data, the patches are merged, then convolved with a kernel.

Page 14: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

More on Data Parallelism

• Each machine/partition has its own local parameter server that updates and exchanges with its neighbor servers.

• Coordination done through belief-propagation-like algorithm (p.7)– Merge updates with neighbors, then server

“gossips to each of them the missing portion”

Page 15: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Experience and Evaluation

• Minerva Implementation Highlights– ImageNet (CNN) 1K classification task (Krzhevsky et al., 2012)

• 42.7% top-1 error rate • 15x faster than MATLAB implementation• 4.6x faster with 16-way partition on 16 core machine than no

partitions

– Speech-net• 1100 input neurons, 2000 sigmoid neurons x 8 hidden layers, 9000

softmax output layer• 1.5-2x faster than MATLAB implementation

– RNN• 10000 input, 1000 hidden, 10000 flat outputs

Page 16: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Experience and Evaluation

• Scaling Up (Figure 8): CNN using mini-batch size of 128

Minerva(GPU) trained faster than Caffee using 256 and 512 mini-batch sizes

Page 17: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Experience and Evaluation

Page 18: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Experience and Evaluation

Page 19: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Experience and Evaluation

Page 20: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Conclusion

• Powerful and versatile framework for big data and deep learning• Pipeline may be more preferable than partitioned fully

connected layers which cause traffic• My Comments

– Lacks restricted Boltzmann machine (RBM) so deep belief network (DBN) is not currently possible

– API appears to be concise and readable– Lacks implementation of algorithm for genetic design of network (e.g.

NEAT), however population generation would be ideal for partitioning.– Not clear how Minerva handles situations where partitions do not

evenly divide # of nodes within a given layer

Page 21: Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

References

• Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).

• Wang, M., Xiao, T., Li, J., Zhang, J., Hong, C., & Zhang, Z. (2014). Minerva: A scalable and highly efficient training platform for deep learning.

• All figures appearing within this presentation are borrowed from Wang et al., 2014.