Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
1© 2019 Mellanox Technologies | Confidential
Qingchun Song, MellanoxJuly 2, 2019
Mellanox In-Network Computing For AI and The Development With NVIDIA (SHARP – NCCL)
2© 2019 Mellanox Technologies | Confidential
Compute-Centric Data-Centric
Data Processing Revolution – Data Centric
DataFlowMachine
Von Neumann Machine
3© 2019 Mellanox Technologies | Confidential
CPU-Centric HPC/AI Center
Everything
CPU Network Storage
4© 2019 Mellanox Technologies | Confidential
Data-Centric HPC/AI Center
Workload
Network Functions
Communication Framework (MPI)
Workload Workload
In-CPU Computing In-Network Computing In-Storage Computing
Storage Functions
CPU Functions
5© 2019 Mellanox Technologies | Confidential
In-Network Computing Architecture
CPU-Centric (Onload) Data-Centric (Offload)
Communications Latencies of 30-40us
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
Communications Latenciesof 3-4us
6© 2019 Mellanox Technologies | Confidential
In-Network Computing to Enable Data-Centric Data Centers
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
GPUDirect
RDMA
Scalable Hierarchical Aggregation and
Reduction Protocol
NVMeOver
Fabrics
7© 2019 Mellanox Technologies | Confidential
In-Network Computing Connects the World’s Fastest HPC and AI Supercomputer
• Summit CORAL System, World’s Fastest HPC / AI System
• Nvidia V100 GPU + InfiniBand HCA + In-Network Computing Fabric
8© 2019 Mellanox Technologies | Confidential
GPUDirect RDMA Technology and Advantages
9© 2019 Mellanox Technologies | Confidential
Scalable Hierarchical Aggregation And
Reduction Protocol(SHARP)
10© 2019 Mellanox Technologies | Confidential
Accelerating HPC and AI Applications
Accelerating HPC Applications▪ Significantly reduce MPI collective runtime ▪ Increase CPU availability and efficiency ▪Enable communication and computation overlap
Enabling Artificial Intelligence Solutions to Perform Critical and Timely Decision Making ▪Accelerating distributed machine learning
11© 2019 Mellanox Technologies | Confidential
AllReduce Example – Trees
▪ Many2One and One2Many traffic patterns – possible network congestion▪ Probably not a good solution for large data▪ Large scale requires higher tree / larger radix ▪ Result distribution – over the tree / MC
Switch
EndNode
Stage1
Stage2
12© 2019 Mellanox Technologies | Confidential
AllReduce (Example) - Recursive Doubling
▪ The data is recursively divided, processed by CPUs and distributed▪ The rank’s CPUs are occupied performing the reduce algorithm▪ The data is sent at least 2x times, consumes at least twice the BW
Rank 1 Rank 2 Rank 3 Rank 4
Step 1
Step 2
Step 3
Step 4
½ Data
¼ Data
¼ Data
½ Data
Calculation phase
Result sending phase
13© 2019 Mellanox Technologies | Confidential
Scalable Hierarchical Aggregation Protocol
Reliable Scalable General Purpose Primitive, Applicable to Multiple Use-cases▪ In-network Tree based aggregation mechanism▪ Large number of groups▪ Multiple simultaneous outstanding operations ▪ Streaming aggregation
Accelerating HPC applications▪ Scalable High Performance Collective Offload
▪ Barrier, Reduce, All-Reduce, Broadcast▪ Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND▪ Integer and Floating-Point, 16 / 32 / 64 bit▪ Up to 1KB payload size (in Quantum)
▪ Significantly reduce MPI collective runtime ▪ Increase CPU availability and efficiency ▪ Enable communication and computation overlap
Accelerating Machine Learning applications▪ Proven the many-to-one Traffic Pattern▪ CUDA , GPUDirect RDMA
SHArP Tree
SHArP Tree Aggregation Node
(Process running on HCA)
SHArP Tree Endnode
(Process running on HCA)
SHArP Tree Root
14© 2019 Mellanox Technologies | Confidential
Scalable Hierarchical Aggregation Protocol
▪ SHARP Tree is a Logical Construct▪ Nodes in the SHArP Tree are IB Endnodes▪ Logical tree defined on top of the physical
underlying fabric▪ SHArP Tree Links are implemented on top of the IB
transport (Reliable Connection)▪ Expected to follow the physical topology for
performance but not required
▪ SHARP Operations are Executed by a SHARP Tree▪ Multiple SHArP Trees are Supported▪ Each SHArP Tree can handle Multiple Outstanding
SHArP Operations▪ Within a SHArP Tree, each Operation is Uniquely
Identified by a SHArP-Tuple▪ GroupID▪ SequenceNumber
Physical Topology
One SHArP Tree
Switch/Router
HCA
SHArP Tree Aggregation Node (Process running on HCA)
SHArP Tree Endnode(Process running on HCA)
SHArP Tree Root
15© 2019 Mellanox Technologies | Confidential
SHARP Principles of Operation - Request
SHArP Tree Root
Aggregation Request
16© 2019 Mellanox Technologies | Confidential
SHARP Principles of Operation – Response
SHArP Tree Root
Aggregation Response
17© 2019 Mellanox Technologies | Confidential
NCCL Ring
▪ Simple▪ Linear Latency▪ Support in NCCL-2.3 & previous version▪ Multiple rings
Switch Switch
18© 2019 Mellanox Technologies | Confidential
NCCL Tree
Network Fabric
NIC NIC NIC
▪ Support added in NCCL-2.4▪Keep Intra-node chain▪Node leaders participate in tree▪Binary double tree▪Multiple rings -> Multiple trees
19© 2019 Mellanox Technologies | Confidential
NCCL SHARPNetwork Fabric
NIC NIC NIC
20© 2019 Mellanox Technologies | Confidential
NCCL SHARP
▪ Collective network Plugin▪ Replace Inter-node tree with SHARP Tree▪ Keeps Intra-node ring▪ Aggregation in network switch ▪ Streaming from GPU memory with GPU Direct RDMA▪ 2x BW compared to NCCL-TREE
SHARP Enables 2X Higher Data Throughput for NCCL
0
5
10
15
20
25
8
16
32
64
12
8
25
6
51
2
10
24
20
48
Ban
dw
idth
( G
B/s
)
Message Size(MB)
NCCL Allreduce Bandwidth - 1 Ring
SHARP Ring Tree
21© 2019 Mellanox Technologies | Confidential
SHARP AllReduce Performance Advantages (128 Nodes)
SHARP enables 75% Reduction in LatencyProviding Scalable Flat LatencyScalable Hierarchical
Aggregation and Reduction Protocol
22© 2019 Mellanox Technologies | Confidential
SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology
SHARP Enables Highest PerformanceScalable Hierarchical Aggregation and
Reduction Protocol
23© 2019 Mellanox Technologies | Confidential
SHARP Performance – Application (OSU)
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/
Source: Prof. DK Panda, Ohio State University
24© 2019 Mellanox Technologies | Confidential
SHARP Performance Advantage for AI
▪ SHARP provides 16% Performance Increase for deep learning, initial results▪ TensorFlow with Horovod running ResNet50 benchmark, HDR InfiniBand (ConnectX-6, Quantum)
16%
11%
P100 NVIDIA GPUs, RH 7.5, Mellanox OFED 4.4, HPC-X v2.3, TensorFlow v1.11, Horovod 0.15.0
25© 2019 Mellanox Technologies | Confidential
NCCL-SHARP Performance – DL Training
18755
20456
0
5000
10000
15000
20000
25000
NCCL NCCL-SHARP
Imag
es/s
ec
ResNet50
System Configuration: (4) HPE Apollo 6500 systems configured with (8) NVIDIA Tesla V100 SXM2 16GB, (2) HPE DL360 Gen10 Intel Xeon-Gold 6134 (3.2 GHz/8-core/130 W) CPUs, (24) DDR4-2666 CAS-19-19-19 Registered Memory ModulesHPE 1.6 TB NVMe SFF (2.5") SSD, ConnectX-6 HCA, IB Quantum Switch (EDR speed), Ubuntu 16.04
26© 2019 Mellanox Technologies | Confidential
GPUDirect
RDMA
Network
Communication
Application▪ Data Analysis
▪ Real Time
▪ Deep Learning
▪ Mellanox SHARP In-Network Computing
▪ MPI Tag Matching
▪ MPI Rendezvous
▪ Software Defined Virtual Devices
▪ Network Transport Offload
▪ RDMA and GPU-Direct RDMA
▪ SHIELD (Self-Healing Network)
▪ Enhanced Adaptive Routing and Congestion Control
Connectivity▪ Multi-Host Technology
▪ Socket-Direct Technology
▪ Enhanced Topologies
Accelerating All Levels of HPC / AI Frameworks
27© 2019 Mellanox Technologies | Confidential
Thank You