27
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

HETEROGENEOUS HPC,

ARCHITECTURE OPTIMIZATION,

AND NVLINK

Steve Oberlin

CTO, Accelerated Computing

Page 2: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

2

US to Build Two Flagship Supercomputers

Major Step Forward on the Path to Exascale

Partnership for Science

100-300 PFLOPS Peak Performance

10x in Scientific Applications

2017

SUMMIT SIERRA

Page 3: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

3

Just 4 nodes in Summit would make the Top500 list of

supercomputers today

Similar Power as Titan 5-10x Faster

1/5th the Size

150 PF = 3M Laptops One laptop for Every Resident in

State of Mississippi

Page 4: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

4

Optimizing Serial/Parallel Execution

Application Code

+

GPU CPU

Parallel Work

Majority of Ops

Serial Work

System and Sequential Ops

Page 5: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

5

IBM POWER CPU Most Powerful Serial Processor

NVIDIA NVLink Fastest CPU-GPU Interconnect

NVIDIA Volta GPU Most Powerful Parallel Processor

NVLink-Enabled Heterogeneous Node 5x Higher Energy Efficiency

80-200 GB/s

Page 6: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

6

Latency

Optimized

Throughput

Optimized

NVLink: Logical Node Integration

5x PCIe bandwidth

Move data at CPU memory speed

3x lower energy/bit

TESLA

GPU

Power or

ARM CPU

DDR Memory Stacked Memory

NVLink

80 GB/s

DDR4

50-75 GB/s

HBM

1 Terabyte/s

Page 7: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

7

NVLink High-speed GPU

Interconnect

NVLink

NVLink

POWER CPU

X86 ARM64 POWER CPU

PASCAL GPU KEPLER GPU

2016 2014

PCIe PCIe

X86 ARM64 POWER CPU

Page 8: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

8 8 8

NVLink Unleashes Multi-GPU Performance

3D FFT, ANSYS: 2 GPU configuration, All other apps comparing 4 GPU configuration AMBER Cellulose (256x128x128), FFT problem size (256^3)

TESLA

GPU

TESLA

GPU

CPU

5x Faster than

PCIe Gen3 x16

PCIe Switch

GPUs Interconnected with NVLink

1.00x

1.25x

1.50x

1.75x

2.00x

2.25x

ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT

Over 2x Application Performance Speedup When Next-Gen GPUs Connect via NVLink Versus PCIe

Speedup vs PCIe based Server

Page 9: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

9

Two Computing Models For Accelerators

CPU Optimized for Serial Tasks

GPU Accelerator Optimized for Parallel Tasks

Heterogeneous Computing Model Complementary Processors Work Together

Many-Weak-Cores (MWC) Model Single CPU Core for Both Serial & Parallel Work

Xeon Phi (And Others) Many Weak Serial Cores

Page 10: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

10

Amdahl’s Law Analysis

0 1 2 3 4 5 6 7 8 9 10

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

98% Parallel Work

Minutes Run Time

Page 11: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

11

Amdahl’s Law Analysis

0 1 2 3 4 5 6 7 8 9 10

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

90% Parallel Work

Minutes Run Time

Page 12: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

12

Amdahl’s Law Analysis

0 1 2 3 4 5 6 7 8 9 10

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

80% Parallel Work

Minutes Run Time

Page 13: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

13

Amdahl’s Law Analysis

0 2 4 6 8 10 12 14

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

70% Parallel Work

Minutes Run Time

Page 14: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

14

Amdahl’s Law Analysis

60% Parallel Work

0 2 4 6 8 10 12 14 16 18

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

Minutes Run Time

Page 15: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

15

Amdahl’s Law Analysis

50% Parallel Work

0 5 10 15 20 25

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

Minutes Run Time

Page 16: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

16

Amdahl’s Law Analysis

40% Parallel Work

0 5 10 15 20 25 30

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

Minutes Run Time

Page 17: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

17

Amdahl’s Law Analysis

30% Parallel Work

0 5 10 15 20 25 30

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

Minutes Run Time

Page 18: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

18

Amdahl’s Law Analysis

20% Parallel Work

0 5 10 15 20 25 30 35

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

Minutes Run Time

Page 19: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

19

Amdahl’s Law Analysis

10% Parallel Work

0 5 10 15 20 25 30 35 40

Work 1x CPU

2 x MWC (.25x CPU)

1 GPU+1 CPU

Serial

Parallel

Minutes Run Time

Page 20: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

20

TESLA K80

WORLD’S FASTEST ACCELERATOR

FOR DATA ANALYTICS AND

SCIENTIFIC COMPUTING

Caffe Benchmark: AlexNet training throughput based on 20 iterations, CPU: E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2

Maximum Performance Dynamically Maximize

Performance for Every Application

Double the Memory Designed for Big Data Apps

24GB

Oil & Gas

Data Analytics

HPC Viz

K40 12GB

2x Faster 2.9 TF| 4992 Cores | 480 GB/s

0x

5x

10x

15x

20x

25x

CPU Tesla K40 Tesla K80

Deep Learning: Caffe

Dual-GPU Accelerator for

Max Throughput

Page 21: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

21

Performance Lead Continues to Grow

0

500

1000

1500

2000

2500

3000

3500

2008 2009 2010 2011 2012 2013 2014

Peak Double Precision FLOPS

NVIDIA GPU x86 CPU

M2090

M1060

K20

K80

Westmere Sandy Bridge

Haswell

GFLOPS

0

100

200

300

400

500

600

2008 2009 2010 2011 2012 2013 2014

Peak Memory Bandwidth

NVIDIA GPU x86 CPU

GB/s

K20

K80

Westmere Sandy Bridge

Haswell

Ivy Bridge

K40

Ivy Bridge

K40

M2090

M1060

Page 22: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

22

0x

5x

10x

15x

K80 CPU

10x Faster than CPU on Applications

CPU: 12 cores, E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled

Quantum Chemistry Molecular Dynamics Physics Benchmarks

Page 23: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

23

Tesla Platform Enables Optimization

Scalable Nodes, ISA Choice

x86

NVLink

Page 24: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

24

Tesla Platform Enables Optimization

Ecosystem Industry Standard CPUs and Interconnects

ARM64 POWER x86

NVIDIA

GPU

InfiniBand

Industry-Driven Solutions

Others Cray Ethernet

Page 25: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

25

CORAL Scalable Heterogeneous Node

Approximately 3,400 nodes, each with:

IBM POWER9 CPUs and multiple NVIDIA Tesla® Volta GPUs

CPUs and GPUs integrated on-node with high speed NVLink

Large coherent memory: over 512 GB (HBM + DDR4)

All directly addressable from the CPUs and GPUs

An additional 800 GB of NVRAM, burst buffer or as extended memory

Over 40 TF peak performance/node(!)

NVLink In Practice

Page 26: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

26

Optimized Heterogeneous Node

CORAL Application Performance Projections

Page 27: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

Tesla Accelerated Computing

YOUR PLATFORM FOR DISCOVERY