High Performance Machine Learning Distributed training · High Performance Machine Learning...

High Performance Machine LearningDistributed training

Pawe l Rosciszewski

pawel.rosciszewski@pg.edu.plOffice: 521 EA, office hours: Friday 10:00-11:30

March 7, 2018

Pawe l Rosciszewski High Performance Machine Learning March 7, 2018 1 / 21

Supplementary courses

Coursera - Machine Learning(https://www.coursera.org/learn/machine-learning)

Coursera - Deep Learning(https://www.coursera.org/specializations/deep-learning)

Stanford CS231N (http://cs231n.stanford.edu/)

DataCamp premium support (https://www.datacamp.com/)

HPC is crucial for contemporary Machine Learning workloads

Huge training datasets, big models, compute intensity

Also unsupervised learning, compute-intensive workloads, inference

We will focus mostly on training supervised learning models

Stochastic Gradient Descent to rule them all

Figure: SGD iterations [1]

Single-node optimizations

Vectorization - example

High performance libraries - numPy, cuDNN, Intel MKL, eigen...

Hardware - examples(1060, P100, V100 - CPU, mem, P2P)

Monitoring - examples(top, nvidia-smi, ...)

Experiments - NHWC/NCHW, batch size, benchmarks

Computational graph - example

Profiling - example

why HPC is crucial for ML

dataset, model sizes, hardware used for training

convolutions and their implementations

benchmarking, profiling of ML code

numerical formats used in ML

distributed training algorithms

Schedule

numerical formats used in ML - recap

distributed training algorithms - recap

4.04 - W lodzimierz Kaoka (VoiceLab.ai) - practical deployment ofRNNs for acoustic model inference

4.04 - graph representations?

11.04 - TensorFlow hands-on

18.04 - midterm test (45 min), TF hands-on

25.04 - lab starts

Hardware trends

Figure: TensorCores [9]

Hardware trends

Figure: Tensor Processing Unit architecture [10]

Multinode training

Figure: SGD iterations [1]

Multinode training - data parallelism

Figure: General architecture of data parallel multinode training [1]

Multinode training - model parallelism

Figure: General architecture of model parallel multinode training [2]

Multinode training - mixed model/data

Convolutional layers cumulatively contain about 90-95% of thecomputation, about 5% of the parameters, and have largerepresentations. [3]Fully-connected layers contain about 5-10% of the computation,about 95% of the parameters, and have small representations. [3]

Figure: Mixed data/model parallel multinode training [3]

Multinode training - search parallelism

Figure: US Department of Energy Deep Learning objectives [11]

”DNNs in general do not have good strong scaling behavior, so to fullyexploit large-scale parallelism they rely on a combination of model, dataand search parallelism.” [4]

Multinode training - model averaging

Figure: Parallel training of DNNs with natural gradient and parameter averaging [5]

Multinode training - model averaging frequency?

Figure: Experiments with model averaging frequency [6]

Multinode training - model averaging frequency?

Figure: Using experience from our HPCS class to optimize a popular ML framework [6]

Federated Learning

Figure: Federated Learning architecture [7,8]

References

1 https://github.com/tensorflow/models/tree/master/research/inception

2 Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., others,2012. Large scale distributed deep networks, in: Advances in Neural Information Processing Systems. pp. 1223–1231.

3 Krizhevsky, A., 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997.

4 Stevens, R., 2017. Deep Learning in Cancer and Infectious Disease: Novel Driver Problems for Future HPCArchitecture. ACM Press, pp. 65–65. https://doi.org/10.1145/3078597.3091526

5 Povey, D., Zhang, X., Khudanpur, S., 2014. Parallel training of deep neural networks with natural gradient andparameter averaging. CoRR, vol. abs/1410.7455.

6 Rosciszewski, P., Kaliski, J., 2017. Minimizing Distribution and Data Loading Overheads in Parallel Training of DNNAcoustic Models with Frequent Parameter Averaging. IEEE, pp. 560–565. https://doi.org/10.1109/HPCS.2017.89

7 McMahan, H.B., Moore, E., Ramage, D., Hampson, S., 2016. Communication-efficient learning of deep networks fromdecentralized data. arXiv preprint arXiv:1602.05629.

8 https://research.googleblog.com/2017/04/federated-learning-collaborative.html

9 https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

10 https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

11 https://www.slideshare.net/insideHPC/a-vision-for-exascale-simulation-and-deep-learning

High Performance Machine Learning Distributed training · High Performance Machine Learning...

Documents

Tensor Relational Algebra for Distributed Machine Learning

Distributed Machine Learning - microsoft.com · machine learning algorithms into distributed learning algorithms. They de-termine conditions for which a distributed approach is better

Distributed machine learning

SWOT Analysis EN - pg.edu.pl

Large Scale Distributed Machine Learning on Apache Sparklitaotao.github.io/files/1. spark_meetup.pdf · Large Scale Distributed Machine Learning on Apache Spark ... Feature Engineering

Distributed Intelligent Systems – W6 Machine …...Distributed Intelligent Systems – W6 Machine-Learning and Multi-Level Modeling Methods Applied t Dibtdti Rbit S tto Distributed

Presto: Distributed Machine Learning and Graph …shivaram.org/publications/presto-eurosys13.pdfPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

Scalable, Distributed, Machine Learning for Big Data

MITIGATION OF SYNCHRONOUS MACHINE BASED DISTRIBUTED

Distributed Machine Learning and Text Analysis · Distributed Machine Learning and Text Analysis 31th Jan 2017 Martin Jaggi EPFL mlo.epﬂ.ch. Optimization Systems Machine Learning

Mahout and Distributed Machine Learning 101

Distributed Private Machine Learninghelper.ipam.ucla.edu/publications/pbd2018/pbd2018_14607.pdf · Distributed Private Machine Learning Abhradeep Guha Thakurta University of California

Human Machine Interfaces For Distributed Control Systemsliterature.rockwellautomation.com/idc/groups/literature/documents/... · Human Machine Interfaces For Distributed Control Systems

JIT-Compiler-Assisted Distributed Java Virtual Machine

Distributed machine learning examples

Machine learning‐based distributed model predictive

Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · 2017-02-05 · Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems for machine learning

Fast Transparent Virtual Machine Migration in Distributed ...lass.cs.umass.edu/papers/pdf/SEC17_fast_transparent.pdf · Fast Transparent Virtual Machine Migration in Distributed Edge

A Proxy for Distributed Hash Table based Machine-to ...maguire/DEGREE-PROJECT-REPORTS/110629... · based Machine-to-Machine Networks ... A Proxy for Distributed Hash Table based Machine-to-Machine

KungFu: Making Training in Distributed Machine Learning