High Performance Machine Learning Distributed training · High Performance Machine Learning...

Preview:

Citation preview

High Performance Machine LearningDistributed training

Pawe l Rosciszewski

pawel.rosciszewski@pg.edu.plOffice: 521 EA, office hours: Friday 10:00-11:30

March 7, 2018

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 1 / 21

Supplementary courses

Coursera - Machine Learning(https://www.coursera.org/learn/machine-learning)

Coursera - Deep Learning(https://www.coursera.org/specializations/deep-learning)

Stanford CS231N (http://cs231n.stanford.edu/)

DataCamp premium support (https://www.datacamp.com/)

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 2 / 21

Recap

HPC is crucial for contemporary Machine Learning workloads

Huge training datasets, big models, compute intensity

Also unsupervised learning, compute-intensive workloads, inference

We will focus mostly on training supervised learning models

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 3 / 21

Stochastic Gradient Descent to rule them all

Figure: SGD iterations [1]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 4 / 21

Single-node optimizations

Vectorization - example

High performance libraries - numPy, cuDNN, Intel MKL, eigen...

Hardware - examples(1060, P100, V100 - CPU, mem, P2P)

Monitoring - examples(top, nvidia-smi, ...)

Experiments - NHWC/NCHW, batch size, benchmarks

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 5 / 21

Computational graph - example

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 6 / 21

Profiling - example

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 7 / 21

Recap

why HPC is crucial for ML

dataset, model sizes, hardware used for training

convolutions and their implementations

benchmarking, profiling of ML code

numerical formats used in ML

distributed training algorithms

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 8 / 21

Schedule

numerical formats used in ML - recap

distributed training algorithms - recap

4.04 - W lodzimierz Kaoka (VoiceLab.ai) - practical deployment ofRNNs for acoustic model inference

4.04 - graph representations?

11.04 - TensorFlow hands-on

18.04 - midterm test (45 min), TF hands-on

25.04 - lab starts

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 9 / 21

Hardware trends

Figure: TensorCores [9]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 10 / 21

Hardware trends

Figure: Tensor Processing Unit architecture [10]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 11 / 21

Multinode training

Figure: SGD iterations [1]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 12 / 21

Multinode training - data parallelism

Figure: General architecture of data parallel multinode training [1]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 13 / 21

Multinode training - model parallelism

Figure: General architecture of model parallel multinode training [2]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 14 / 21

Multinode training - mixed model/data

Convolutional layers cumulatively contain about 90-95% of thecomputation, about 5% of the parameters, and have largerepresentations. [3]Fully-connected layers contain about 5-10% of the computation,about 95% of the parameters, and have small representations. [3]

Figure: Mixed data/model parallel multinode training [3]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 15 / 21

Multinode training - search parallelism

Figure: US Department of Energy Deep Learning objectives [11]

”DNNs in general do not have good strong scaling behavior, so to fullyexploit large-scale parallelism they rely on a combination of model, dataand search parallelism.” [4]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 16 / 21

Multinode training - model averaging

Figure: Parallel training of DNNs with natural gradient and parameter averaging [5]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 17 / 21

Multinode training - model averaging frequency?

Figure: Experiments with model averaging frequency [6]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 18 / 21

Multinode training - model averaging frequency?

Figure: Using experience from our HPCS class to optimize a popular ML framework [6]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 19 / 21

Federated Learning

Figure: Federated Learning architecture [7,8]

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 20 / 21

References

1 https://github.com/tensorflow/models/tree/master/research/inception

2 Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., others,2012. Large scale distributed deep networks, in: Advances in Neural Information Processing Systems. pp. 1223–1231.

3 Krizhevsky, A., 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997.

4 Stevens, R., 2017. Deep Learning in Cancer and Infectious Disease: Novel Driver Problems for Future HPCArchitecture. ACM Press, pp. 65–65. https://doi.org/10.1145/3078597.3091526

5 Povey, D., Zhang, X., Khudanpur, S., 2014. Parallel training of deep neural networks with natural gradient andparameter averaging. CoRR, vol. abs/1410.7455.

6 Rosciszewski, P., Kaliski, J., 2017. Minimizing Distribution and Data Loading Overheads in Parallel Training of DNNAcoustic Models with Frequent Parameter Averaging. IEEE, pp. 560–565. https://doi.org/10.1109/HPCS.2017.89

7 McMahan, H.B., Moore, E., Ramage, D., Hampson, S., 2016. Communication-efficient learning of deep networks fromdecentralized data. arXiv preprint arXiv:1602.05629.

8 https://research.googleblog.com/2017/04/federated-learning-collaborative.html

9 https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

10 https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

11 https://www.slideshare.net/insideHPC/a-vision-for-exascale-simulation-and-deep-learning

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 21 / 21

Recommended