Highest Peformance and Scalability for HPC and AI

Mellanox – The Intelligence Interconnect CompanyFebruary 2018

Highest Performance and Scalability for HPC and AI

NVMe, NVMe Over Fabrics

Big Data Storage

Hadoop / Spark

Software Defined Storage

SQL & NoSQL Database

Deep Learning

GPUDirect

NCCLSHARP

Image , Video Recognition

Sentiment Analysis

Fraud & Flaw

Detection

Voice Recognition & Search

Same Interconnect Technology Enables a Variety of Applications

Accelerating the Next Generation of HPC and AI

Summit CORAL System

Fastest Supercomputer in Japan

Fastest Supercomputer in CanadaInfiniBand Dragonfly+ Topology

Sierra CORAL System

Highest Performance with InfiniBand

Higher Data SpeedsFaster Data Processing

Better Data Security

Adapters SwitchesCables &

Transceivers

SmartNIC System on a Chip

HPC and AI Needs the Most Intelligent Interconnect

Cloud Big Data

Enterprise

Business Intelligence

Storage

Security

Machine Learning

Internet of Things

Exponential Data Growth Everywhere

Big Data? No… REALLY BIG DATA

Average data generated in a self-driving vehicle is expected to reach 40TB for every eight hours of driving (this mostly applies to full service fleet vehicles)

The Pratt & Whitney PW1000G engine has 5,000 sensors installed, generating about 10 GB of data per second. With an average 12-hr. flight-time can produce up to 844 TB of data

Mellanox is the de-facto interconnect for deep learning deployments

What’s Important For ML & Big Data Networking?

High Bandwidth and Low Latency Massive amounts of data requires thick data-pipes NVIDIA DGX-1 use 4 x 100G InfiniBand Port latency < 90 nsec with Mellanox EDR InfiniBand Spectrum is the world’s lowest latency Ethernet switch

No Packet Loss! Mellanox Spectrum switches have ZERO packet loss

Offloads – Free up the CPU! Stateless Offloads RDMA GPUDirect RDMA, GPUDirect ASYNC Virtualization and Containers Networking

GPUDirect™ RDMA

In-Network Computing SHARP

Storage Acceleration RDMA NVMe over Fabrics Acceleration Erasure Coding Acceleration

Security IPsec, TLS Offloads

Resiliency Robust End-to-End Network Solutions

Data Centric Architecture to Overcome Latency Bottlenecks

Communications Latencies of 30-40us Communications Latencies of 3-4us

CPU-Centric (Onload) Data-Centric (Offload)

Network In-Network Computing

Intelligent Interconnect Paves the Road to Exascale Performance

In-Network Computing

Self-Healing Technology

In-Network Computing

Unbreakable Data Centers

Delivers Highest Application Performance

GPUDirect™ RDMA

Critical for HPC and Machine Learning ApplicationsGPU Acceleration Technology

10X Performance AccelerationCritical for HPC and Machine Learning Applications

35XFaster Network Recovery5000X

10X Performance Acceleration

Performance Acceleration

In-Network Computing Delivers Highest Performance

30%-250% Higher Return on Investment

Up to 50% Saving on Capital and Operation Expenses

Highest Applications Performance, Scalability and Productivity

InfiniBand Delivers Best Return on Investment

1.9X Better 2X Better 1.4X Better 2.5X Better 1.3X Better

Molecular DynamicsChemistry Automotive GenomicsWeather

Cognitive Toolkit

RDMA Supercharges Leading AI Frameworks

Mellanox Supercharges Leading AI Companies

Higher ROI

Lower CapEx& OpEx

An Intelligent Network Unlocks the Power of AI

10X Higher Performance with GPUDirect™ RDMA Technology

GPUDirect™ RDMA

Purpose-built for Acceleration of Deep Learning

Lowest communication latency for acceleration devices

No unnecessary system memory copies and CPU overhead

GPUDirect™ ASYNC

GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect Control path still uses the CPU

CPU prepares and queues communication tasks on GPU GPU triggers communication on HCA Mellanox HCA directly accesses GPU memory

GPUDirect ASYNC (GPUDirect 4.0) Both data path and control path go directly

between the GPU and the Mellanox interconnect

Maximum PerformanceFor GPU Clusters

Distributed Training

Training on large data sets can take a long time In some cases weeks

In many cases training need to happen frequently Model development and tuning Real life use cases may require retraining regularly

Accelerate training time by scale out architecture Add workers (nodes) to reduce training time

Data Parallelism is the solution

Network is critical element to accelerate Distributed Training!

RDMA Accelerates Distributed Training

Data Parallelism Communication Workers to parameter servers and parameter servers to workers Frequent Maybe high bandwidth Bursty Point to point and collectives

RDMA Improves Point to Point and Collective operations

Model Parallelism

Models size is limited by the compute engine (GPU for example)

In some cases the model doesn’t fit in compute Large models Small compute such as FPGAs

Model parallelism slice the model and run each part on different compute engine

Networking become a critical element

High bandwidth, low latency and RDMA are mandatory

Data Data

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Reliable Scalable General Purpose Primitive

In-network tree-based aggregation mechanism Large number of groups Multiple simultaneous outstanding operations

Applicable to Multiple Use-cases HPC applications using MPI / SHMEM Distributed Machine Learning applications

Scalable High Performance Collective Offload Barrier, Reduce, All-Reduce, Broadcast and more Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND Integer and Floating-Point, 16/32/64/128 bits

SHARP Tree

SHARP Tree Aggregation Node

(Process running on HCA)

SHARP Tree Endnode

(Process running on HCA)

SHARP Tree Root

SHARP Allreduce Performance Advantages

SHARP Enables 75% Reduction in Latency

Providing Scalable Flat Latency

Performs the Gradient AveragingRemoves the need for physical parameter serverRemoves all parameter server overhead

Mellanox SHARP Technology Accelerates AI

The CPU in a parameter serverquickly becomes the bottleneck(roughly 4 nodes)

Mellanox Accelerates TensorFlow with RDMA

Unmatched Linear Scalability at No Additional Cost

50% Better

Performance

Mellanox Accelerates TensorFlow (Advanced Verbs)

10GbE is not enough for large scale models 6.5X Faster Training

with 100GbE

Mellanox Accelerates NVIDIA NCCL 2.0

50% PerformanceImprovement

with NVIDIA® DGX-1 across32 NVIDIA Tesla V100 GPUsUsing InfiniBand RDMAand GPUDirect™ RDMA

Questions?

Highest Peformance and Scalability for HPC and AI

Technology

Peformance Theory and Case Studies

Maximizing performance and scalability using Intel ... - Maximizing...Maximizing performance and scalability using Intel performance libraries Roger Philp Intel HPC Software Workshop

Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei Center for HPC, ... Horizontal scalability-ability ... href="

Individual peformance in a team game

Peformance management

Knowledge, Action, and Peformance; Industry Surveys

How to Assess Nonprofit Financial Peformance

OEFLER HPC for ML and ML for HPC Scalability, Communication, … · cs.CV 577 852 1,349 2,261 3,627 5,693 8,599 9,825. spcl.inf.ethz.ch @spcl_eth How does Deep Learning work? Canziani

PEFORMANCE TRACKING & PRACTICE IMPLICATION

Seatwave Web Peformance Optimisation Case Study

Mini Sugar Mill Peformance

Drexer Sibbet Team Peformance Model

Peformance of ASEAN Doc

Jurnal Ta RD Peformance

Performance Engineering and Debugging HPC Applications · HPC Tool Topics • CPU and memory usage – FLOP rate – Memory high water mark • OpenMP – OMP overhead – OMP scalability

Revolutionary Performance and Scalability for HPC and AI · 2019-03-01 · RDMA delivers 2X or more performance advantage over traditional TCP Machine Learning and HPC platforms share

Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council

Peformance Evaluation of Container-based Vi

Peter Scharf Crafting Peformance Mgmt

NFS Tuning for High Peformance Usenix2004