40
Dror Goldenberg, March 2016, HPCAC Swiss Co-Design Architecture Emergence of New Co-Processors

Co-Design Architecture for Exascale

Embed Size (px)

Citation preview

Page 1: Co-Design Architecture for Exascale

Dror Goldenberg, March 2016, HPCAC Swiss

Co-Design Architecture Emergence of New Co-Processors

Page 2: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 2

Co-Design Architecture to Enable Exascale Performance

CPU-Centric Co-Design

Limited to Main CPU Usage Results in Performance Limitation

Creating Synergies Enables Higher Performance and Scale

Software Software

In-CPU Computing

In-Network Computing

In-Storage Computing

Page 3: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 3

The Intelligence is Moving to the Interconnect

CPU

Interconnect

Past Future

Page 4: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 4

Intelligent Interconnect Delivers Higher Datacenter ROI

Users

NETWORK

COMPUTING

NETWORK

Users

Intelligence

Network Offloads Computing for applications

Smart Network Increase Datacenter Value

Network functions On CPU

COMPUTING

Page 5: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 5

Breaking the Application Latency Wall

§ Today: Network device latencies are on the order of 100 nanoseconds

§ Challenge: Enabling the next order of magnitude improvement in application performance

§ Solution: Creating synergies between software and hardware – intelligent interconnect

Intelligent Interconnect Paves the Road to Exascale Performance

10 years ago

~10 microsecond

~100 microsecond

Network Communication Framework

Today

~10 microsecond

Communication Framework

~0.1 microsecond

Network

~1 microsecond

Communication Framework

Future

~0.05 microsecond

Co-Design Network

Page 6: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 6

Introducing Switch-IB 2 World’s First Smart Switch

Page 7: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 7

Introducing Switch-IB 2 World’s First Smart Switch

§ The world fastest switch with <90 nanosecond latency

§ 36-ports, 100Gb/s per port, 7.2Tb/s throughput, 7.02 Billion messages/sec

§ Adaptive Routing, Congestion control, support for multiple topologies

World’s First Smart Switch

Build for Scalable Compute and Storage Infrastructures

10X Higher Performance with The New Switch SHArP Technology

Page 8: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 8

SHArP (Scalable Hierarchical Aggregation Protocol) Technology

Delivering 10X Performance Improvement

for MPI and SHMEM/PAGS Communications

Switch-IB 2 Enables the Switch Network to

Operate as a Co-Processor

SHArP Enables Switch-IB 2 to Manage and

Execute MPI Operations in the Network

Page 9: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 9

SHArP Performance Advantage

§  MiniFE is a Finite Element mini-application •  Implements kernels that represent

implicit finite-element applications

10X to 25X Performance Improvement

AllRedcue MPI Collective

Page 10: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 10

The Intelligence is Moving to the Interconnect

Communication Frameworks (MPI, SHMEM/PGAS)

The Only Approach to Deliver 10X Performance Improvements

Applications Transport RDMA SR-IOV

Collectives Peer-Direct GPUDirect

More…

MPI / SHMEM Offloads

Q1’16

Q3’16

Page 11: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 11

Multi-Host Socket DirectTM – Low Latency Socket Communication

§ Each CPU with direct network access

§ QPI avoidance for I/O – improve performance

§ Enables GPU / peer direct on both sockets

§ Solution is transparent to software

CPU CPU CPU CPU QPI

Multi-Host Socket Direct Performance

50% Lower CPU Utilization

20% lower Latency

Multi Host Evaluation Kit

Lower Application Latency, Free-up CPU

Page 12: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 12

Introducing ConnectX-4 Lx Programmable Adapter

Scalable, Efficient, High-Performance and Flexible Solution

Security

Cloud/Virtualization

Storage

High Performance Computing

Precision Time Synchronization

Networking + FPGA

Mellanox Acceleration Engines

and FGPA Programmability

On One Adapter

Page 13: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 13

Mellanox InfiniBand Proven and Most Scalable HPC Interconnect

“Summit” System “Sierra” System

Paving the Road to Exascale

Page 14: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 14

NCAR-Wyoming Supercomputing Center (NWSC) – “Cheyenne”

§ Cheyenne supercomputer system

§ 5.34-petaflop SGI ICE XA Cluster

§  Intel “Broadwell” processors

§ More than 4K compute nodes

§ Mellanox EDR InfiniBand interconnect

§ Mellanox Unified Fabric Manager

§ Partial 9D Enhanced Hypercube interconnect topology

§ DDN SFA14KX systems

§ 20 petabytes of usable file system space

§  IBM GPFS (General Parallel File System)

Page 15: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 15

High-Performance Designed 100Gb/s Interconnect Solutions

Transceivers

Active Optical and Copper Cables

(10 / 25 / 40 / 50 / 56 / 100Gb/s) VCSELs, Silicon Photonics and Copper

36 EDR (100Gb/s) Ports, <90ns Latency

Throughput of 7.2Tb/s

7.02 Billion msg/sec (195M msg/sec/port)

100Gb/s Adapter, 0.7us latency

150 million messages per second

(10 / 25 / 40 / 50 / 56 / 100Gb/s)

32 100GbE Ports, 64 25/50GbE Ports

(10 / 25 / 40 / 50 / 100GbE)

Throughput of 6.4Tb/s

Page 16: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 16

Leading Supplier of End-to-End Interconnect Solutions

Store Analyze Enabling the Use of Data

Software ICs Switches/Gateways Adapter Cards Cables/Modules

Comprehensive End-to-End InfiniBand and Ethernet Portfolio (VPI)

Metro / WAN NPU & Multicore

NPS TILE

Page 17: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 17

The Performance Advantage of EDR 100G InfiniBand (28-80%)

28%

Page 18: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 18

End-to-End Interconnect Solutions for All Platforms

Highest Performance and Scalability for

X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms

10, 20, 25, 40, 50, 56 and 100Gb/s Speeds

X86 Open POWER GPU ARM FPGA

Smart Interconnect to Unleash The Power of All Compute Architectures

Page 19: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 19

Technology Roadmap – One-Generation Lead over the Competition

2000 2020 2010 2005

20G 40G 56G 100G

“Roadrunner” Mellanox Connected

1st 3rd TOP500 2003

Virginia Tech (Apple)

2015

200G

Terascale Petascale Exascale

Mellanox 400G

Page 20: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 20

§ Transparent InfiniBand integration into OpenStack •  Since Havana

§ RDMA directly from VM - SRIOV § MAC to GUID mapping § VLAN to pkey mapping §  InfiniBand SDN network

§  Ideal fit for High Performance Computing Clouds

OpenStack Over InfiniBand – Extreme Performance in the Cloud

InfiniBand Enables The Highest Performance and Efficiency

Page 21: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 21

§ Mellanox End to End •  Mellanox ConnectX-4 NIC family, Switch-IB/Spectrum switches and 25/100Gb/s cables

§ Bring the astonishing 100Gb/s speeds to the cloud with minimal CPU utilization •  Both VMs and Hypervisors •  Accelerations are critical to reach line rate -  SR-IOV, RDMA, etc.

25, 50 And 100Gb/s Clouds Are Here!

92.412 Gb/s

0.71%

Page 22: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 22

The Next Generation HPC Software Framework To Meet the Needs of Future Systems / Applications

Unified Communication – X Framework (UCX)

Page 23: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 23

Exascale Co-Design Collaboration

Collaborative Effort Industry, National Laboratories and Academia

The Next Generation

HPC Software Framework

Page 24: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 24

A Collaboration Effort

§ Mellanox co-designs network interface and contributes MXM technology •  Infrastructure, transport, shared memory, protocols, integration with OpenMPI/SHMEM, MPICH

§ ORNL co-designs network interface and contributes UCCS project •  InfiniBand optimizations, Cray devices, shared memory

§ NVIDIA co-designs high-quality support for GPU devices •  GPUDirect, GDR copy, etc.

§  IBM co-designs network interface and contributes ideas and concepts from PAMI § UH/UTK focus on integration with their research platforms

Page 25: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 25

Mellanox HPC-X™ Scalable HPC Software Toolkit

§ Complete MPI, PGAS OpenSHMEM and UPC package

§ Maximize application performance

§ For commercial and open source applications

§ Based on UCX (Unified Communication – X Framework)

Page 26: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 26

Mellanox Delivers Highest MPI (HPC-X) Performance

Enabling Highest Applications Scalability and Performance

Mellanox ConnectX-4 Collectives Offload

Page 27: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 27

Mellanox Delivers Highest Applications Performance (HPC-X)

§ Quantum Espresso application

IntelMPI BullMPI(HPC-X)

QuantumEspresso

TestCase #nodes <me(s) <me(s) Gain

A 43 584 368 37%

B 196 2592 998 61%

Enabling Highest Applications Scalability and Performance

Page 28: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 28

Maximize Performance via Accelerator and GPU Offloads

GPUDirect RDMA Technology

Page 29: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 29

GPUs are Everywhere!

GPUDirect RDMA / Sync

CPU

GPU Chip set

GPU Memory

System Memory 1

GPU

Page 30: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 30

§ Eliminates CPU bandwidth and latency bottlenecks § Uses remote direct memory access (RDMA) transfers between GPUs § Resulting in significantly improved MPI efficiency between GPUs in remote nodes § Based on PCIe PeerDirect technology

GPUDirect™ RDMA (GPUDirect 3.0)

With GPUDirect™ RDMA Using PeerDirect™

Page 31: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 31

Mellanox GPUDirect RDMA Performance Advantage

§ HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs § GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand

•  Unlocks performance between GPU and InfiniBand •  This provides a significant decrease in GPU-GPU communication latency •  Provides complete CPU offload from all GPU communications across the network

102%

2X Application Performance!

Page 32: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 32

GPUDirect Sync (GPUDirect 4.0)

§ GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect •  Control path still uses the CPU -  CPU prepares and queues communication tasks on GPU -  GPU triggers communication on HCA -  Mellanox HCA directly accesses GPU memory

§ GPUDirect Sync (GPUDirect 4.0) •  Both data path and control path go directly between the GPU and the Mellanox interconnect

0

10

20

30

40

50

60

70

80

2 4

Aver

age

time

per i

tera

tion

(us)

Number of nodes/GPUs

2D stencil benchmark

RDMA only RDMA+PeerSync

27% faster 23% faster

Maximum Performance For GPU Clusters

Page 33: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 33

Remote GPU Access through rCUDA

GPU servers GPU as a Service

rCUDA daemon

Network Interface CUDA Driver + runtime Network Interface

rCUDA library

Application

Client Side Server Side

Application

CUDA Driver + runtime

CUDA Application

rCUDA provides remote access from every node to any GPU in the system

CPU VGPU

CPU VGPU

CPU VGPU

GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

Page 34: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 34

Interconnect Architecture Comparison

Offload versus Onload (Non-Offload)

Page 35: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 35

Offload versus Onload (Non-Offload)

§ Two interconnect architectures exist – Offload-based and Onload-based

§ Offload Architecture •  The Interconnect manages and executes all network operations •  The interconnect is capable of including application acceleration engines •  Offloads the CPU and therefore free CPU cycles to be used by the applications •  Development requires large R&D investment •  Higher data center ROI

§ Onload architecture •  A CPU-centric approach – everything must be executed on and by the CPU •  The CPU is responsible for all network functions, the interconnect only pushes the data into the wire •  Cannot support acceleration engines, no support for RDMA, and network transport is done by the CPU •  Onload the CPU and reduces the CPU cycles available for the applications •  Does not require R&D investments or interconnect expertise

Page 36: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 36

Sandia National Laboratory Paper – Offloading versus Onloading

Page 37: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 37

Interconnect Throughput – Offload versus Onload

The Offloading Advantage!

Network Performance Dramatically Depends on CPU Frequency!

Data Throughput:

20% Higher at common Xeon Frequency

250% Higher at common Xeon Phi Frequency

Common Xeon Frequency 2.6GHz

Common Xeon Phi Frequency ~1Ghz

Page 38: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 38

Only Offload Architecture Can Enable Co-Processors

Offloading (Highest Performance for all Frequencies)

Onloading (performance loss with lower CPU frequency)

Common Xeon Frequency

Common Xeon Phi Frequency

Onloading Technology Not Suitable for Co-Processors!

Page 39: Co-Design Architecture for Exascale

© 2016 Mellanox Technologies 39

Switch Latency

Message Rate

Mellanox InfiniBand Leadership Over Omni-Path

20% Lower

44% Higher

Power Consumption Per Switch Port

Scalability CPU efficiency

25% Lower

2X Higher

100 Gb/s Link Speed

200 Gb/s Link Speed

2014

Gain Competitive Advantage Today Protect Your Future

2017

Smart Network For Smart Systems RDMA, Acceleration Engines, Programmability

Higher Performance Unlimited Scalability

Higher Resiliency Proven!

Page 40: Co-Design Architecture for Exascale

Thank You