Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Exploiting Multi-core Processors for HPC and Deep Learning: The MVAPICH2 Approach
Dhabaleswar K. (DK) PandaThe Ohio State University
E-mail: [email protected]://www.cse.ohio-state.edu/~panda
Intel Booth Talk at SC ‘19
by
Intel-Booth (SC ‘19) 2Network Based Computing Laboratory
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big Data, and Deep Learning!
Increasing Need to Run these applications on the Cloud!!
Intel-Booth (SC ‘19) 3Network Based Computing Laboratory
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (Supercomputing ‘02)
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 3,050 organizations in 89 countries
– More than 615,000 (> 0.6 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘19 ranking)
• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 5th, 448, 448 cores (Frontera) at TACC
• 8th, 391,680 cores (ABCI) in Japan
• 14th, 570,020 cores (Nurion) in South Korea and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decadePartner in the #5th TACC Frontera System
Intel-Booth (SC ‘19) 4Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family (HPC and DL)
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-point
Primitives
Collectives Algorithms
Energy-
Awareness
Remote Memory Access
I/O and
File Systems
Fault
ToleranceVirtualization Active
MessagesJob Startup
Introspection & Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC SRD UD DC UMR ODPSR-IOV
Multi Rail
Transport MechanismsShared
MemoryCMA IVSHMEM
Modern Features
Optane* NVLink CAPI*
* Upcoming
XPMEM
Intel-Booth (SC ‘19) 5Network Based Computing Laboratory
One-way Latency: MPI over IB with MVAPICH2
00.20.40.60.8
11.21.41.61.8 Small Message Latency
Message Size (bytes)
Late
ncy
(us)
1.111.19
1.011.15
1.041.1
TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch
Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switchConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch
0
20
40
60
80
100
120TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDROmni-PathConnectX-6 HDR
Large Message Latency
Message Size (bytes)
Late
ncy
(us)
Intel-Booth (SC ‘19) 6Network Based Computing Laboratory
Bandwidth: MPI over IB with MVAPICH2
0
5000
10000
15000
20000
25000
30000 Unidirectional Bandwidth
Band
wid
th
(MBy
tes/
sec)
Message Size (bytes)
12,590
3,3736,356
12,08312,366
24,532
0
10000
20000
30000
40000
50000
60000TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDROmni-PathConnectX-6 HDR
Bidirectional Bandwidth
Band
wid
th
(MBy
tes/
sec)
Message Size (bytes)
21,22712,161
21,983
6,228
48,027
24,136
TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switchConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch
Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switchConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch
Intel-Booth (SC ‘19) 7Network Based Computing Laboratory
Startup Performance on TACC Frontera
• MPI_Init takes 3.9 seconds on 57,344 processes on 1,024 nodes• All numbers reported with 56 processes per node
4.5s3.9s
New designs available in MVAPICH2-2.3.2
0
1000
2000
3000
4000
5000
56 112 224 448 896 1792 3584 7168 14336 28672 57344
Tim
e Ta
ken
(Mill
isec
onds
)
Number of Processes
MPI_Init on Frontera
Intel MPI 2019
MVAPICH2 2.3.2
Intel-Booth (SC ‘19) 8Network Based Computing Laboratory
Multicast-based Bcast on Frontera
1
10
100
10001 4 16 64 256 1K 4K 16K
64K
256K
Late
ncy
(us)
Message Size (Bytes)
8-Node 16-Node32-Node 64-Node128-Node 256-Node512-Node
1
10
100
1000
8 32 128 512
Late
ncy
(us)
No. of Nodes
16B 256B 4KB64KB 512KB
• Scalable broadcast with increase in system size
Intel-Booth (SC ‘19) 9Network Based Computing Laboratory
MPI_Allreduce on KNL + Omni-Path (10,240 Processes)
0
50
100
150
200
250
300
4 8 16 32 64 128 256 512 1024 2048 4096
Late
ncy
(us)
Message SizeMVAPICH2 MVAPICH2-OPT IMPI
0200400600800
100012001400160018002000
8K 16K 32K 64K 128K 256KMessage Size
MVAPICH2 MVAPICH2-OPT IMPI
OSU Micro Benchmark 64 PPN
2.4X
• For MPI_Allreduce latency with 32K bytes, MVAPICH2-OPT can reduce the latency by 2.4XM. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi-Leader Design, SuperComputing '17. Available since MVAPICH2-X 2.3b
Intel-Booth (SC ‘19) 10Network Based Computing Laboratory
01000200030004000
16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
Message Size (byte)
16 Nodes
MVAPICH2 MVAPICH2+HW-TM
1.7 X
0
5000
10000
15000
16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
Message Size (byte)
64 Nodes
0500
100015002000
16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
Message Size (byte)
8 Nodes
MVAPICH2 MVAPICH2+HW-TM
1.6 X
0
2000
4000
6000
8000
16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
Message Size (byte)
32 Nodes1.8 X 1.5 X
Performance of MPI_Ialltoall using HW Tag Matching on Frontera
• Up to 1.8x Performance Improvement• Sustained benefits as system size increases
Intel-Booth (SC ‘19) 11Network Based Computing Laboratory
0
50
100
16K 32K 64K 128K 256K 512K 1M
Ove
rlap
%
Message Size (byte)
16 Nodes
MVAPICH2 MVAPICH2+HW-TM
020406080
100
16K 32K 64K 128K 256K 512K 1M
Ove
rlap
%
Message Size (byte)
64 Nodes
0
50
100
16K 32K 64K 128K 256K 512K 1M
Ove
rlap
%
Message Size (byte)
8 Nodes
MVAPICH2 MVAPICH2+HW-TM
020406080
100
16K 32K 64K 128K 256K 512K 1M
Ove
rlap
%
Message Size (byte)
32 Nodes
Overlap with MPI_Ialltoall using HW Tag Matching on Frontera
• Maximizing the overlap of communication and computation• Sustained benefits as system size increases
Intel-Booth (SC ‘19) 12Network Based Computing Laboratory
Evaluation of Applications on Frontera (Cascade Lake + HDR100)
MIMD Lattice Computation (MILC)
Performance of MILC and WRF2 applications scales well with increase in system size
0
50
100
150
200
250
13824 27648 41472 69984
CG T
ime
Number of Processes
0
0.5
1
1.5
2
2.5
7168 14336 28672
Tim
e p
er
Step
Number of Processes
WRF2
PPN=54 PPN=56
Intel-Booth (SC ‘19) 13Network Based Computing Laboratory
How to efficiently scale-out Deep Learning (DL) workloads by better
exploiting High Performance Computing (HPC) resources like Multi-/Many-core
CPUs?
Broad Challenge: Exploiting HPC for Deep Learning
Intel-Booth (SC ‘19) 14Network Based Computing Laboratory
• gRPC– Officially available and supported
– Open-source – can be enhanced by others
– Accelerated gRPC (add RDMA to gRPC)
• gRPC+X– Use gRPC for bootstrap and rendezvous
– Actual communication is in “X”
– XMPI, Verbs, GPUDirect RDMA (GDR), etc.
• No-gRPC– Baidu – the first one to use MPI Collectives for TF
– Horovod – Use NCCL, or MPI, or any other future library (e.g. IBM DDL support recently added)
Data Parallel Training with TensorFlow
*Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation ”, CCGrid ’19.
Intel-Booth (SC ‘19) 15Network Based Computing Laboratory
CPU-based Training: Single-Node Multi-Process (MP) mode
ResNet-152 Training performance
• BS=64, 4ppn is better; BS=32, 8ppn is slightly better
• However, keeping effective batch size (EBS) low is more important! – Why? (DNN does not converge to SOTA when batch size is large)
ResNet-152: Single Process (SP) vs. Multi-Process(MP)
• MP is better for all effective batch sizes
• Up to 1.35X better performance for MP compared to SP for BS=64.
1.35X
*Jain et al., “Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters”, Cluster ’19.
Intel-Booth (SC ‘19) 16Network Based Computing Laboratory
CPU-based Training: Multi-Process (MN): MP vs. SP?
Skylake-3 (48 cores, 96 threads)
• Scale—32 nodes
• MP-Tuned—up to 1.5Xbetter than SP
• MP-Tuned—10% better than MP-Default
• Why MP-Tuned is better?
– Uses the best possible number of inter-op and intra-op threads
*Jain et al., “Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters”, Cluster ’19.
Intel-Booth (SC ‘19) 17Network Based Computing Laboratory
Multi-Node Multi-Process (MN): TensorFlow vs. PyTorchPyTorch
TensorFlow
• This is an early experience with PyTorch
• TensorFlow is up to 2.5X fasterthan PyTorch for 128 Nodes.
• TensorFlow: up to 125X speedup for ResNet-152 on 128 nodes
• PyTorch: Scales well but overall lower performance than TensorFlow
*Jain et al., “Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters”, Cluster ’19.
Intel-Booth (SC ‘19) 18Network Based Computing Laboratory
Scaling ResNet-50 on TACC Frontera: 2,048 nodes!
• Scaled TensorFlow to 2048 nodes on Frontera using MVAPICH2 and IntelMPI
• MVAPICH2 and IntelMPI give similar performance for DNN training
• Report a peak of 260,000 images/sec on 2048 nodes
• On 2048 nodes, ResNet-50 can be trained in 7 minutes!
*Jain et al., “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (in conjunction with SC ’19).
Intel-Booth (SC ‘19) 19Network Based Computing Laboratory
• CPU based Hybrid-Parallel (Data Parallelism and Model Parallelism) training on Stampede2
• Benchmark developed for various configuration
– Batch sizes
– No. of model partitions
– No. of model replicas
• Evaluation on a very deep model– ResNet-1000 (a 1,000-layer
model)
Benchmarking HyPar-Flow on Stampede2
*Awan et al., “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, arXiv ’19. https://arxiv.org/pdf/1911.05146.pdf
110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2 Cluster)
Intel-Booth (SC ‘19) 20Network Based Computing Laboratory
• Scalable performance for HPC and distributed training is getting important
• Requires high-performance middleware designs while exploiting modern interconnects
• Provided the approaches being taken care of by the MVAPICH2 project to achieve scalable distributed training
• Will continue to enable the HPC and DL communities to achieve scalability and high-performance for their scientific and distributed training workloads
Conclusions
Intel-Booth (SC ‘19) 21Network Based Computing Laboratory
Funding AcknowledgmentsFunding Support by
Equipment Support by
Intel-Booth (SC ‘19) 22Network Based Computing Laboratory
Personnel AcknowledgmentsCurrent Students (Graduate)
– A. Awan (Ph.D.)
– M. Bayatpour (Ph.D.)
– C.-H. Chu (Ph.D.)
– J. Hashmi (Ph.D.)– A. Jain (Ph.D.)
– K. S. Kandadi (M.S.)
Past Students – A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– R. Biswas (M.S.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– S. Chakraborthy (Ph.D.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– R. Rajachandrasekar (Ph.D.)
– D. Shankar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
– J. Zhang (Ph.D.)
Past Research Scientist– K. Hamidouche
– S. Sur
– X. Lu
Past Post-Docs– D. Banerjee
– X. Besseron
– H.-W. Jin
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– Kamal Raj (M.S.)
– K. S. Khorassani (Ph.D.)– P. Kousha (Ph.D.)
– A. Quentin (Ph.D.)
– B. Ramesh (M. S.)
– S. Xu (M.S.)
– J. Lin
– M. Luo
– E. Mancini
Past Programmers– D. Bureddy
– J. Perkins
Current Research Specialist– J. Smith
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc– M. S. Ghazimeersaeed
– A. Ruhela
– K. ManianCurrent Students (Undergraduate)
– V. Gangal (B.S.)
– N. Sarkauskas (B.S.)
Past Research Specialist– M. Arnold
Current Research Scientist– H. Subramoni– Q. Zhou (Ph.D.)
Intel-Booth (SC ‘19) 23Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/