Exploiting Multi-core Processors for HPC and Deep Learning: The …mvapich.cse.ohio-state.edu/static/media/talks/slide/sc19... · 2019. 12. 2. · Network Based Computing Laboratory

Exploiting Multi-core Processors for HPC and Deep Learning: The MVAPICH2 Approach

Dhabaleswar K. (DK) PandaThe Ohio State University

E-mail: [email protected]://www.cse.ohio-state.edu/~panda

Intel Booth Talk at SC ‘19

by

http://www.cse.ohio-state.edu/%7Epanda

Intel-Booth (SC ‘19) 2Network Based Computing Laboratory

Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning

Convergence of HPC, Big Data, and Deep Learning!

Increasing Need to Run these applications on the Cloud!!


Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (Supercomputing ‘02)

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 3,050 organizations in 89 countries

– More than 615,000 (> 0.6 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘19 ranking)

• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 5th, 448, 448 cores (Frontera) at TACC

• 8th, 391,680 cores (ABCI) in Japan

• 14th, 570,020 cores (Nurion) in South Korea and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decadePartner in the #5th TACC Frontera System

http://mvapich.cse.ohio-state.edu/


Architecture of MVAPICH2 Software Family (HPC and DL)

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-

Awareness

Remote Memory Access

I/O and

File Systems

Fault

ToleranceVirtualization Active

MessagesJob Startup

Introspection & Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RC SRD UD DC UMR ODPSR-IOV

Multi Rail

Transport MechanismsShared

MemoryCMA IVSHMEM

Modern Features

Optane* NVLink CAPI*

* Upcoming

XPMEM


One-way Latency: MPI over IB with MVAPICH2

00.20.40.60.8

11.21.41.61.8 Small Message Latency

Message Size (bytes)

Late

ncy

(us)

1.111.19

1.011.15

1.041.1

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch

Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switchConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch

0

20

40

60

80

100

120TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDROmni-PathConnectX-6 HDR

Large Message Latency


Late

ncy

(us)


Bandwidth: MPI over IB with MVAPICH2

0

5000

10000

15000

20000

25000

30000 Unidirectional Bandwidth

Band

wid

th

(MBy

tes/

sec)


12,590

3,3736,356

12,08312,366

24,532

0

10000

20000

30000

40000

50000

60000TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDROmni-PathConnectX-6 HDR

Bidirectional Bandwidth

Band

wid

th

(MBy

tes/

sec)


21,22712,161

21,983

6,228

48,027

24,136

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch

Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switchConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch


Startup Performance on TACC Frontera

• MPI_Init takes 3.9 seconds on 57,344 processes on 1,024 nodes• All numbers reported with 56 processes per node

4.5s3.9s

New designs available in MVAPICH2-2.3.2

0

1000

2000

3000

4000

5000

56 112 224 448 896 1792 3584 7168 14336 28672 57344

Tim

e Ta

ken

(Mill

isec

onds

)

Number of Processes

MPI_Init on Frontera

Intel MPI 2019

MVAPICH2 2.3.2


Multicast-based Bcast on Frontera

1

10

100

10001 4 16 64 256 1K 4K 16K

64K

256K

Late

ncy

(us)

Message Size (Bytes)

8-Node 16-Node32-Node 64-Node128-Node 256-Node512-Node

1

10

100

1000

8 32 128 512

Late

ncy

(us)

No. of Nodes

16B 256B 4KB64KB 512KB

• Scalable broadcast with increase in system size


MPI_Allreduce on KNL + Omni-Path (10,240 Processes)

0

50

100

150

200

250

300

4 8 16 32 64 128 256 512 1024 2048 4096

Late

ncy

(us)

Message SizeMVAPICH2 MVAPICH2-OPT IMPI

0200400600800

100012001400160018002000

8K 16K 32K 64K 128K 256KMessage Size

MVAPICH2 MVAPICH2-OPT IMPI

OSU Micro Benchmark 64 PPN

2.4X

• For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4XM. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design, SuperComputing '17. Available since MVAPICH2-X 2.3b


01000200030004000

16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)

Message Size (byte)

16 Nodes

MVAPICH2 MVAPICH2+HW-TM

1.7 X

0

5000

10000

15000

16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)

Message Size (byte)

64 Nodes

0500

100015002000

16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)

Message Size (byte)

8 Nodes


1.6 X

0

2000

4000

6000

8000

16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)

Message Size (byte)

32 Nodes1.8 X 1.5 X

Performance of MPI_Ialltoall using HW Tag Matching on Frontera

• Up to 1.8x Performance Improvement• Sustained benefits as system size increases

Presenter

Presentation Notes

s


0

50

100

16K 32K 64K 128K 256K 512K 1M

Ove

rlap

%

Message Size (byte)

16 Nodes


020406080

100

16K 32K 64K 128K 256K 512K 1M

Ove

rlap

%

Message Size (byte)

64 Nodes

0

50

100

16K 32K 64K 128K 256K 512K 1M

Ove

rlap

%

Message Size (byte)

8 Nodes


020406080

100

16K 32K 64K 128K 256K 512K 1M

Ove

rlap

%

Message Size (byte)

32 Nodes

Overlap with MPI_Ialltoall using HW Tag Matching on Frontera

• Maximizing the overlap of communication and computation• Sustained benefits as system size increases

Presenter

Presentation Notes

s


Evaluation of Applications on Frontera (Cascade Lake + HDR100)

MIMD Lattice Computation (MILC)

Performance of MILC and WRF2 applications scales well with increase in system size

0

50

100

150

200

250

13824 27648 41472 69984

CG T

ime

Number of Processes

0

0.5

1

1.5

2

2.5

7168 14336 28672

Tim

e p

er

Step

Number of Processes

WRF2

PPN=54 PPN=56

mailto:[email protected]



How to efficiently scale-out Deep Learning (DL) workloads by better

exploiting High Performance Computing (HPC) resources like Multi-/Many-core

CPUs?

Broad Challenge: Exploiting HPC for Deep Learning


• gRPC– Officially available and supported

– Open-source – can be enhanced by others

– Accelerated gRPC (add RDMA to gRPC)

• gRPC+X– Use gRPC for bootstrap and rendezvous

– Actual communication is in “X”

– XMPI, Verbs, GPUDirect RDMA (GDR), etc.

• No-gRPC– Baidu – the first one to use MPI Collectives for TF

– Horovod – Use NCCL, or MPI, or any other future library (e.g. IBM DDL support recently added)

Data Parallel Training with TensorFlow

*Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation ”, CCGrid ’19.


CPU-based Training: Single-Node Multi-Process (MP) mode

ResNet-152 Training performance

• BS=64, 4ppn is better; BS=32, 8ppn is slightly better

• However, keeping effective batch size (EBS) low is more important! – Why? (DNN does not converge to SOTA when batch size is large)

ResNet-152: Single Process (SP) vs. Multi-Process(MP)

• MP is better for all effective batch sizes

• Up to 1.35X better performance for MP compared to SP for BS=64.

1.35X

*Jain et al., “Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters”, Cluster ’19.

Presenter

Presentation Notes

Why performance degrades for BS 64 for PPN8


CPU-based Training: Multi-Process (MN): MP vs. SP?

Skylake-3 (48 cores, 96 threads)

• Scale—32 nodes

• MP-Tuned—up to 1.5Xbetter than SP

• MP-Tuned—10% better than MP-Default

• Why MP-Tuned is better?

– Uses the best possible number of inter-op and intra-op threads



Multi-Node Multi-Process (MN): TensorFlow vs. PyTorchPyTorch

TensorFlow

• This is an early experience with PyTorch

• TensorFlow is up to 2.5X fasterthan PyTorch for 128 Nodes.

• TensorFlow: up to 125X speedup for ResNet-152 on 128 nodes

• PyTorch: Scales well but overall lower performance than TensorFlow



Scaling ResNet-50 on TACC Frontera: 2,048 nodes!

• Scaled TensorFlow to 2048 nodes on Frontera using MVAPICH2 and IntelMPI

• MVAPICH2 and IntelMPI give similar performance for DNN training

• Report a peak of 260,000 images/sec on 2048 nodes

• On 2048 nodes, ResNet-50 can be trained in 7 minutes!

*Jain et al., “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (in conjunction with SC ’19).

Presenter

Presentation Notes

Why performance degrades for BS 64 for PPN8


• CPU based Hybrid-Parallel (Data Parallelism and Model Parallelism) training on Stampede2

• Benchmark developed for various configuration

– Batch sizes

– No. of model partitions

– No. of model replicas

• Evaluation on a very deep model– ResNet-1000 (a 1,000-layer

model)

Benchmarking HyPar-Flow on Stampede2

*Awan et al., “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, arXiv ’19. https://arxiv.org/pdf/1911.05146.pdf

110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2 Cluster)

https://arxiv.org/pdf/1911.05146.pdf


• Scalable performance for HPC and distributed training is getting important

• Requires high-performance middleware designs while exploiting modern interconnects

• Provided the approaches being taken care of by the MVAPICH2 project to achieve scalable distributed training

• Will continue to enable the HPC and DL communities to achieve scalability and high-performance for their scientific and distributed training workloads

Conclusions


Funding AcknowledgmentsFunding Support by

Equipment Support by


Personnel AcknowledgmentsCurrent Students (Graduate)

– A. Awan (Ph.D.)

– M. Bayatpour (Ph.D.)

– C.-H. Chu (Ph.D.)

– J. Hashmi (Ph.D.)– A. Jain (Ph.D.)

– K. S. Kandadi (M.S.)

Past Students – A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– R. Biswas (M.S.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– S. Chakraborthy (Ph.D.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– R. Rajachandrasekar (Ph.D.)

– D. Shankar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

– J. Zhang (Ph.D.)

Past Research Scientist– K. Hamidouche

– S. Sur

– X. Lu

Past Post-Docs– D. Banerjee

– X. Besseron

– H.-W. Jin

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– Kamal Raj (M.S.)

– K. S. Khorassani (Ph.D.)– P. Kousha (Ph.D.)

– A. Quentin (Ph.D.)

– B. Ramesh (M. S.)

– S. Xu (M.S.)

– J. Lin

– M. Luo

– E. Mancini

Past Programmers– D. Bureddy

– J. Perkins

Current Research Specialist– J. Smith

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc– M. S. Ghazimeersaeed

– A. Ruhela

– K. ManianCurrent Students (Undergraduate)

– V. Gangal (B.S.)

– N. Sarkauskas (B.S.)

Past Research Specialist– M. Arnold

Current Research Scientist– H. Subramoni– Q. Zhou (Ph.D.)


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/


Documents

Exploiting Multi-core Processors for HPC and Deep Learning: The …mvapich.cse.ohio-state.edu/static/media/talks/slide/sc19... · 2019. 12. 2. · Network Based Computing Laboratory