View
169
Download
0
Category
Preview:
Citation preview
1© 2018 Mellanox Technologies | Confidential
Mellanox – The Intelligence Interconnect CompanyFebruary 2018
Highest Performance and Scalability for HPC and AI
2© 2018 Mellanox Technologies | Confidential
NVMe, NVMe Over Fabrics
Big Data Storage
Hadoop / Spark
Software Defined Storage
SQL & NoSQL Database
Deep Learning
HPC
GPUDirect
RDMA
MPI
NCCLSHARP
Image , Video Recognition
Sentiment Analysis
Fraud & Flaw
Detection
Voice Recognition & Search
Same Interconnect Technology Enables a Variety of Applications
3© 2018 Mellanox Technologies | Confidential
Accelerating the Next Generation of HPC and AI
Summit CORAL System
Fastest Supercomputer in Japan
Fastest Supercomputer in CanadaInfiniBand Dragonfly+ Topology
Sierra CORAL System
Highest Performance with InfiniBand
4© 2018 Mellanox Technologies | Confidential
Higher Data SpeedsFaster Data Processing
Better Data Security
Adapters SwitchesCables &
Transceivers
SmartNIC System on a Chip
HPC and AI Needs the Most Intelligent Interconnect
5© 2018 Mellanox Technologies | Confidential
Cloud Big Data
Enterprise
Business Intelligence
HPC
Storage
Security
Machine Learning
Internet of Things
Exponential Data Growth Everywhere
6© 2018 Mellanox Technologies | Confidential
Big Data? No… REALLY BIG DATA
Average data generated in a self-driving vehicle is expected to reach 40TB for every eight hours of driving (this mostly applies to full service fleet vehicles)
The Pratt & Whitney PW1000G engine has 5,000 sensors installed, generating about 10 GB of data per second. With an average 12-hr. flight-time can produce up to 844 TB of data
Mellanox is the de-facto interconnect for deep learning deployments
7© 2018 Mellanox Technologies | Confidential
What’s Important For ML & Big Data Networking?
High Bandwidth and Low Latency Massive amounts of data requires thick data-pipes NVIDIA DGX-1 use 4 x 100G InfiniBand Port latency < 90 nsec with Mellanox EDR InfiniBand Spectrum is the world’s lowest latency Ethernet switch
No Packet Loss! Mellanox Spectrum switches have ZERO packet loss
Offloads – Free up the CPU! Stateless Offloads RDMA GPUDirect RDMA, GPUDirect ASYNC Virtualization and Containers Networking
GPUDirect™ RDMA
In-Network Computing SHARP
Storage Acceleration RDMA NVMe over Fabrics Acceleration Erasure Coding Acceleration
Security IPsec, TLS Offloads
Resiliency Robust End-to-End Network Solutions
8© 2018 Mellanox Technologies | Confidential
Data Centric Architecture to Overcome Latency Bottlenecks
Communications Latencies of 30-40us Communications Latencies of 3-4us
CPU-Centric (Onload) Data-Centric (Offload)
Network In-Network Computing
Intelligent Interconnect Paves the Road to Exascale Performance
9© 2018 Mellanox Technologies | Confidential
In-Network Computing
Self-Healing Technology
In-Network Computing
Unbreakable Data Centers
Delivers Highest Application Performance
GPUDirect™ RDMA
Critical for HPC and Machine Learning ApplicationsGPU Acceleration Technology
10X Performance AccelerationCritical for HPC and Machine Learning Applications
35XFaster Network Recovery5000X
10X Performance Acceleration
Performance Acceleration
In-Network Computing Delivers Highest Performance
10© 2018 Mellanox Technologies | Confidential
30%-250% Higher Return on Investment
Up to 50% Saving on Capital and Operation Expenses
Highest Applications Performance, Scalability and Productivity
InfiniBand Delivers Best Return on Investment
1.9X Better 2X Better 1.4X Better 2.5X Better 1.3X Better
Molecular DynamicsChemistry Automotive GenomicsWeather
11© 2018 Mellanox Technologies | Confidential
Cognitive Toolkit
RDMA Supercharges Leading AI Frameworks
Mellanox Supercharges Leading AI Companies
Higher ROI
Lower CapEx& OpEx
60%
50%
An Intelligent Network Unlocks the Power of AI
12© 2018 Mellanox Technologies | Confidential
10X Higher Performance with GPUDirect™ RDMA Technology
GPUDirect™ RDMA
Purpose-built for Acceleration of Deep Learning
Lowest communication latency for acceleration devices
No unnecessary system memory copies and CPU overhead
13© 2018 Mellanox Technologies | Confidential
GPUDirect™ ASYNC
GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect Control path still uses the CPU
CPU prepares and queues communication tasks on GPU GPU triggers communication on HCA Mellanox HCA directly accesses GPU memory
GPUDirect ASYNC (GPUDirect 4.0) Both data path and control path go directly
between the GPU and the Mellanox interconnect
Maximum PerformanceFor GPU Clusters
14© 2018 Mellanox Technologies | Confidential
Distributed Training
Training on large data sets can take a long time In some cases weeks
In many cases training need to happen frequently Model development and tuning Real life use cases may require retraining regularly
Accelerate training time by scale out architecture Add workers (nodes) to reduce training time
Data Parallelism is the solution
Network is critical element to accelerate Distributed Training!
15© 2018 Mellanox Technologies | Confidential
RDMA Accelerates Distributed Training
Data Parallelism Communication Workers to parameter servers and parameter servers to workers Frequent Maybe high bandwidth Bursty Point to point and collectives
RDMA Improves Point to Point and Collective operations
16© 2018 Mellanox Technologies | Confidential
Model Parallelism
Models size is limited by the compute engine (GPU for example)
In some cases the model doesn’t fit in compute Large models Small compute such as FPGAs
Model parallelism slice the model and run each part on different compute engine
Networking become a critical element
High bandwidth, low latency and RDMA are mandatory
Data Data
17© 2018 Mellanox Technologies | Confidential
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Reliable Scalable General Purpose Primitive
In-network tree-based aggregation mechanism Large number of groups Multiple simultaneous outstanding operations
Applicable to Multiple Use-cases HPC applications using MPI / SHMEM Distributed Machine Learning applications
Scalable High Performance Collective Offload Barrier, Reduce, All-Reduce, Broadcast and more Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND Integer and Floating-Point, 16/32/64/128 bits
SHARP Tree
SHARP Tree Aggregation Node
(Process running on HCA)
SHARP Tree Endnode
(Process running on HCA)
SHARP Tree Root
18© 2018 Mellanox Technologies | Confidential
SHARP Allreduce Performance Advantages
SHARP Enables 75% Reduction in Latency
Providing Scalable Flat Latency
19© 2018 Mellanox Technologies | Confidential
Performs the Gradient AveragingRemoves the need for physical parameter serverRemoves all parameter server overhead
Mellanox SHARP Technology Accelerates AI
The CPU in a parameter serverquickly becomes the bottleneck(roughly 4 nodes)
20© 2018 Mellanox Technologies | Confidential
Mellanox Accelerates TensorFlow with RDMA
Unmatched Linear Scalability at No Additional Cost
50% Better
Performance
21© 2018 Mellanox Technologies | Confidential
Mellanox Accelerates TensorFlow (Advanced Verbs)
10GbE is not enough for large scale models 6.5X Faster Training
with 100GbE
2.5X
6.5X
22© 2018 Mellanox Technologies | Confidential
Mellanox Accelerates NVIDIA NCCL 2.0
50% PerformanceImprovement
with NVIDIA® DGX-1 across32 NVIDIA Tesla V100 GPUsUsing InfiniBand RDMAand GPUDirect™ RDMA
23© 2018 Mellanox Technologies | Confidential
Questions?
Recommended