Co-Design Architecture for Exascale

Dror Goldenberg, March 2016, HPCAC Swiss

Co-Design Architecture Emergence of New Co-Processors

© 2016 Mellanox Technologies 2

Co-Design Architecture to Enable Exascale Performance

CPU-Centric Co-Design

Limited to Main CPU Usage Results in Performance Limitation

Creating Synergies Enables Higher Performance and Scale

Software Software

In-CPU Computing

In-Network Computing

In-Storage Computing


The Intelligence is Moving to the Interconnect

CPU

Interconnect

Past Future


Intelligent Interconnect Delivers Higher Datacenter ROI

Users

NETWORK

COMPUTING

NETWORK

Users

Intelligence

Network Offloads Computing for applications

Smart Network Increase Datacenter Value

Network functions On CPU

COMPUTING


Breaking the Application Latency Wall

§ Today: Network device latencies are on the order of 100 nanoseconds

§ Challenge: Enabling the next order of magnitude improvement in application performance

§ Solution: Creating synergies between software and hardware – intelligent interconnect

Intelligent Interconnect Paves the Road to Exascale Performance

10 years ago

~10 microsecond

~100 microsecond

Network Communication Framework

Today

~10 microsecond

Communication Framework

~0.1 microsecond

Network

~1 microsecond

Communication Framework

Future

~0.05 microsecond

Co-Design Network


Introducing Switch-IB 2 World’s First Smart Switch


Introducing Switch-IB 2 World’s First Smart Switch

§ The world fastest switch with <90 nanosecond latency

§ 36-ports, 100Gb/s per port, 7.2Tb/s throughput, 7.02 Billion messages/sec

§ Adaptive Routing, Congestion control, support for multiple topologies

World’s First Smart Switch

Build for Scalable Compute and Storage Infrastructures

10X Higher Performance with The New Switch SHArP Technology


SHArP (Scalable Hierarchical Aggregation Protocol) Technology

Delivering 10X Performance Improvement

for MPI and SHMEM/PAGS Communications

Switch-IB 2 Enables the Switch Network to

Operate as a Co-Processor

SHArP Enables Switch-IB 2 to Manage and

Execute MPI Operations in the Network


SHArP Performance Advantage

§  MiniFE is a Finite Element mini-application •  Implements kernels that represent

implicit finite-element applications

10X to 25X Performance Improvement

AllRedcue MPI Collective


The Intelligence is Moving to the Interconnect

Communication Frameworks (MPI, SHMEM/PGAS)

The Only Approach to Deliver 10X Performance Improvements

Applications Transport RDMA SR-IOV

Collectives Peer-Direct GPUDirect

More…

MPI / SHMEM Offloads

Q1’16

Q3’16


Multi-Host Socket DirectTM – Low Latency Socket Communication

§ Each CPU with direct network access

§ QPI avoidance for I/O – improve performance

§ Enables GPU / peer direct on both sockets

§ Solution is transparent to software

CPU CPU CPU CPU QPI

Multi-Host Socket Direct Performance

50% Lower CPU Utilization

20% lower Latency

Multi Host Evaluation Kit

Lower Application Latency, Free-up CPU


Introducing ConnectX-4 Lx Programmable Adapter

Scalable, Efficient, High-Performance and Flexible Solution

Security

Cloud/Virtualization

Storage

High Performance Computing

Precision Time Synchronization

Networking + FPGA

Mellanox Acceleration Engines

and FGPA Programmability

On One Adapter


Mellanox InfiniBand Proven and Most Scalable HPC Interconnect

“Summit” System “Sierra” System

Paving the Road to Exascale


NCAR-Wyoming Supercomputing Center (NWSC) – “Cheyenne”

§ Cheyenne supercomputer system

§ 5.34-petaflop SGI ICE XA Cluster

§  Intel “Broadwell” processors

§ More than 4K compute nodes

§ Mellanox EDR InfiniBand interconnect

§ Mellanox Unified Fabric Manager

§ Partial 9D Enhanced Hypercube interconnect topology

§ DDN SFA14KX systems

§ 20 petabytes of usable file system space

§  IBM GPFS (General Parallel File System)


High-Performance Designed 100Gb/s Interconnect Solutions

Transceivers

Active Optical and Copper Cables

(10 / 25 / 40 / 50 / 56 / 100Gb/s) VCSELs, Silicon Photonics and Copper

36 EDR (100Gb/s) Ports, <90ns Latency

Throughput of 7.2Tb/s

7.02 Billion msg/sec (195M msg/sec/port)

100Gb/s Adapter, 0.7us latency

150 million messages per second

(10 / 25 / 40 / 50 / 56 / 100Gb/s)

32 100GbE Ports, 64 25/50GbE Ports

(10 / 25 / 40 / 50 / 100GbE)

Throughput of 6.4Tb/s


Leading Supplier of End-to-End Interconnect Solutions

Store Analyze Enabling the Use of Data

Software ICs Switches/Gateways Adapter Cards Cables/Modules

Comprehensive End-to-End InfiniBand and Ethernet Portfolio (VPI)

Metro / WAN NPU & Multicore

NPS TILE


The Performance Advantage of EDR 100G InfiniBand (28-80%)

28%


End-to-End Interconnect Solutions for All Platforms

Highest Performance and Scalability for

X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms

10, 20, 25, 40, 50, 56 and 100Gb/s Speeds

X86 Open POWER GPU ARM FPGA

Smart Interconnect to Unleash The Power of All Compute Architectures


Technology Roadmap – One-Generation Lead over the Competition

2000 2020 2010 2005

20G 40G 56G 100G

“Roadrunner” Mellanox Connected

1st 3rd TOP500 2003

Virginia Tech (Apple)

2015

200G

Terascale Petascale Exascale

Mellanox 400G


§ Transparent InfiniBand integration into OpenStack •  Since Havana

§ RDMA directly from VM - SRIOV § MAC to GUID mapping § VLAN to pkey mapping §  InfiniBand SDN network

§  Ideal fit for High Performance Computing Clouds

OpenStack Over InfiniBand – Extreme Performance in the Cloud

InfiniBand Enables The Highest Performance and Efficiency


§ Mellanox End to End •  Mellanox ConnectX-4 NIC family, Switch-IB/Spectrum switches and 25/100Gb/s cables

§ Bring the astonishing 100Gb/s speeds to the cloud with minimal CPU utilization •  Both VMs and Hypervisors •  Accelerations are critical to reach line rate -  SR-IOV, RDMA, etc.

25, 50 And 100Gb/s Clouds Are Here!

92.412 Gb/s

0.71%


The Next Generation HPC Software Framework To Meet the Needs of Future Systems / Applications

Unified Communication – X Framework (UCX)


Exascale Co-Design Collaboration

Collaborative Effort Industry, National Laboratories and Academia

The Next Generation

HPC Software Framework


A Collaboration Effort

§ Mellanox co-designs network interface and contributes MXM technology •  Infrastructure, transport, shared memory, protocols, integration with OpenMPI/SHMEM, MPICH

§ ORNL co-designs network interface and contributes UCCS project •  InfiniBand optimizations, Cray devices, shared memory

§ NVIDIA co-designs high-quality support for GPU devices •  GPUDirect, GDR copy, etc.

§  IBM co-designs network interface and contributes ideas and concepts from PAMI § UH/UTK focus on integration with their research platforms


Mellanox HPC-X™ Scalable HPC Software Toolkit

§ Complete MPI, PGAS OpenSHMEM and UPC package

§ Maximize application performance

§ For commercial and open source applications

§ Based on UCX (Unified Communication – X Framework)


Mellanox Delivers Highest MPI (HPC-X) Performance

Enabling Highest Applications Scalability and Performance

Mellanox ConnectX-4 Collectives Offload


Mellanox Delivers Highest Applications Performance (HPC-X)

§ Quantum Espresso application

IntelMPI BullMPI(HPC-X)

QuantumEspresso

TestCase #nodes <me(s) <me(s) Gain

A 43 584 368 37%

B 196 2592 998 61%

Enabling Highest Applications Scalability and Performance


Maximize Performance via Accelerator and GPU Offloads

GPUDirect RDMA Technology


GPUs are Everywhere!

GPUDirect RDMA / Sync

CPU

GPU Chip set

GPU Memory

System Memory 1

GPU


§ Eliminates CPU bandwidth and latency bottlenecks § Uses remote direct memory access (RDMA) transfers between GPUs § Resulting in significantly improved MPI efficiency between GPUs in remote nodes § Based on PCIe PeerDirect technology

GPUDirect™ RDMA (GPUDirect 3.0)

With GPUDirect™ RDMA Using PeerDirect™


Mellanox GPUDirect RDMA Performance Advantage

§ HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs § GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand

•  Unlocks performance between GPU and InfiniBand •  This provides a significant decrease in GPU-GPU communication latency •  Provides complete CPU offload from all GPU communications across the network

102%

2X Application Performance!


GPUDirect Sync (GPUDirect 4.0)

§ GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect •  Control path still uses the CPU -  CPU prepares and queues communication tasks on GPU -  GPU triggers communication on HCA -  Mellanox HCA directly accesses GPU memory

§ GPUDirect Sync (GPUDirect 4.0) •  Both data path and control path go directly between the GPU and the Mellanox interconnect

0

10

20

30

40

50

60

70

80

2 4

Aver

age

time

per i

tera

tion

(us)

Number of nodes/GPUs

2D stencil benchmark

RDMA only RDMA+PeerSync

27% faster 23% faster

Maximum Performance For GPU Clusters


Remote GPU Access through rCUDA

GPU servers GPU as a Service

rCUDA daemon

Network Interface CUDA Driver + runtime Network Interface

rCUDA library

Application

Client Side Server Side

Application

CUDA Driver + runtime

CUDA Application

rCUDA provides remote access from every node to any GPU in the system

CPU VGPU

CPU VGPU

CPU VGPU

GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU


Interconnect Architecture Comparison

Offload versus Onload (Non-Offload)


Offload versus Onload (Non-Offload)

§ Two interconnect architectures exist – Offload-based and Onload-based

§ Offload Architecture •  The Interconnect manages and executes all network operations •  The interconnect is capable of including application acceleration engines •  Offloads the CPU and therefore free CPU cycles to be used by the applications •  Development requires large R&D investment •  Higher data center ROI

§ Onload architecture •  A CPU-centric approach – everything must be executed on and by the CPU •  The CPU is responsible for all network functions, the interconnect only pushes the data into the wire •  Cannot support acceleration engines, no support for RDMA, and network transport is done by the CPU •  Onload the CPU and reduces the CPU cycles available for the applications •  Does not require R&D investments or interconnect expertise


Sandia National Laboratory Paper – Offloading versus Onloading


Interconnect Throughput – Offload versus Onload

The Offloading Advantage!

Network Performance Dramatically Depends on CPU Frequency!

Data Throughput:

20% Higher at common Xeon Frequency

250% Higher at common Xeon Phi Frequency

Common Xeon Frequency 2.6GHz

Common Xeon Phi Frequency ~1Ghz


Only Offload Architecture Can Enable Co-Processors

Offloading (Highest Performance for all Frequencies)

Onloading (performance loss with lower CPU frequency)

Common Xeon Frequency

Common Xeon Phi Frequency

Onloading Technology Not Suitable for Co-Processors!


Switch Latency

Message Rate

Mellanox InfiniBand Leadership Over Omni-Path

20% Lower

44% Higher

Power Consumption Per Switch Port

Scalability CPU efficiency

25% Lower

2X Higher

100 Gb/s Link Speed

200 Gb/s Link Speed

2014

Gain Competitive Advantage Today Protect Your Future

2017

Smart Network For Smart Systems RDMA, Acceleration Engines, Programmability

Higher Performance Unlimited Scalability

Higher Resiliency Proven!

Thank You