Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation

Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning

M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

Presentation by Cameron Hamilton

Overview

• Problem: disparity between deep learning tools oriented towards productivity/generality (e.g. MATLAB) and task-specific tools designed for speed and scale (e.g. CUDA-Convnet).

• Solution: A matrix-based API, known as Minerva, with a MATLAB-like procedural coding style. Program is translated into an internal dataflow graph at runtime, which is generic enough to be implemented on different types of hardware.

Minerva System Overview

• Every training iteration has two phases– Generate dataflow graph from user code– Evaluate dataflow graph

Example of User Code

System Overview: Performance via Parallelism

• Performance of deep learning algorithms dependent on whether operations can be performed in parallel. Minerva utilizes two forms of parallelism:– Model parallelism: model replicas used to train the same

model• Replicas exchange updates via “logically centralized parameter

server” (p. 4).

– Data parallelism: model replicas assigned to different portions of the data sets

• Always evaluates on GPU if available

Programming Model

• Minerva API 3 stages for deep learning– Define model architecture • Model model;• Layer layer1 = model.AddLayer(dim);• model.AddConnection(layer1,layer2,FULL);

– Declaring primary matrices (i.e. weights & biases)• Matrix W = Matrix(layer2,layer1,RANDOM);• Matrix b(layer2,1,RANDOM);• Vector<Matrix> inputs = LoadBatches(layer1,…);

Programming Model

– Specifying training procedure

Convolutional neural networks (CNNs) are specified with a different syntax. The architecture is specified with a single line: AddConvConnect(layer1,layer2,…). Minerva then handles the arrangement of these layers (p.4).

Programming Model

• Expressing Parallelism– Model Parallelism

• SetPartition(layer1,2);SetPartition(layer2,2);

– Data Parallelism• ParameterSet pset;• pset.Add(“W”,W); pset.Add(“V”,V);• pset.Add(“b”,b); pset.Add(“c”,c);• RegisterToParameterServer(pset);• …//Learning Procedure Here• if(epoch % 3 == 0) PushToParameterServer(pset);• if(epoch % 6 == 0) PullFromParameterServer(pset);

EvalAll();

Putting it All Together

Putting it All Together

System Design: More on Parallelism

• Within a neural network, the operations that will occur at each computing vertex (i.e. forward propagation, backward propagation, weight update) are predefined. This allows for network training to be partitioned for theoretically any number of threads.– Updates shared between local parameter servers

• Load-balance by dividing task up amongst partitions• Coordination and Overhead by determining ownership of

computing vertex based on location of its input and output vertices. Partitions stick to their vertices.

• Locality by receiving input to vertex in layer n from n-1 and outputting layer n+1

Model Parallelism

Convolutional Networks

• Partitions handle patches of the input data, the patches are merged, then convolved with a kernel.

More on Data Parallelism

• Each machine/partition has its own local parameter server that updates and exchanges with its neighbor servers.

• Coordination done through belief-propagation-like algorithm (p.7)– Merge updates with neighbors, then server

“gossips to each of them the missing portion”

Experience and Evaluation

• Minerva Implementation Highlights– ImageNet (CNN) 1K classification task (Krzhevsky et al., 2012)

• 42.7% top-1 error rate • 15x faster than MATLAB implementation• 4.6x faster with 16-way partition on 16 core machine than no

partitions

– Speech-net• 1100 input neurons, 2000 sigmoid neurons x 8 hidden layers, 9000

softmax output layer• 1.5-2x faster than MATLAB implementation

– RNN• 10000 input, 1000 hidden, 10000 flat outputs


• Scaling Up (Figure 8): CNN using mini-batch size of 128

Minerva(GPU) trained faster than Caffee using 256 and 512 mini-batch sizes




Conclusion

• Powerful and versatile framework for big data and deep learning• Pipeline may be more preferable than partitioned fully

connected layers which cause traffic• My Comments

– Lacks restricted Boltzmann machine (RBM) so deep belief network (DBN) is not currently possible

– API appears to be concise and readable– Lacks implementation of algorithm for genetic design of network (e.g.

NEAT), however population generation would be ideal for partitioning.– Not clear how Minerva handles situations where partitions do not

evenly divide # of nodes within a given layer

References

• Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).

• Wang, M., Xiao, T., Li, J., Zhang, J., Hong, C., & Zhang, Z. (2014). Minerva: A scalable and highly efficient training platform for deep learning.

• All figures appearing within this presentation are borrowed from Wang et al., 2014.

Documents

Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation