GPU COMPUTING TRENDS - ENSEEIHTubee.enseeiht.fr/TSDL/pages/TALKS/TSDL2018-Romuald-Josien.pdfwith NVIDIA DGX-1 servers and Pure FlashBlade systems to accelerate their AI initiative

Romuald Josien - Oct, 2018

GPU COMPUTING TRENDS

Agenda:- NVIDIA the company- Examples of NVIDIA contribution to AI use cases- What a distance covered these past +5 years!- Thanks to NVIDIA innovation- Nvidia TESLA portfolio- The more you buy, the more you save!- NVIDIA program for Education

➢ Founded in 1993

➢ HQ in Santa Clara (CA – USA)

> Jensen Huang, Founder & CEO

> 12,000 employees WW

➢ $9.7B revenue in FY18 (+41%)

➢ >1B GPUs shipped to date

➢ 6,000 patents WW

NVIDIA FACTS

NVIDIA - GPU COMPUTING

ONE ARCHITECTURE — CUDA

Gaming VR AI & HPCSelf-Driving Cars

& Autonomous Machines

Artificial IntelligenceGPU Computing

NVIDIA “THE AI COMPUTING COMPANY”

Computer Graphics

NVIDIA CONTRIBUTES TO IMPROVE OUR WORLD

DEVELOPINGTHE VEHICLESOF THE FUTURE Zenuity, a joint venture of Volvo and Veoneer,

aims to build autonomous driving software for

production vehicles by 2021. They chose to

build their deep learning infrastructure

with NVIDIA DGX-1 servers and Pure

FlashBlade systems to accelerate

their AI initiative.

THE BRAINS BEHINDSMART CITIESVerizon’s Smart Communities Group is on a

mission to make cities safer, smarter and

greener. Using NVIDIA Metropolis, an edge-

to-cloud video platform for building smarter,

faster AI-powered applications, Verizon is

working to collect and analyze multiple

streams of video data to improve traffic

flow, enhance pedestrian safety,

optimize parking

and more.

SPEEDING UPDRUG DISCOVERYClassic Molecular Dynamics simulations are time-consuming

and expensive, but machine learning models can help

predict the probability of target molecules to bond

With chemical compounds. Researchers at the

University of Pittsburgh are improving model

Performance and prediction accuracy. Their

Convolutional neural network, accelerated by

NVIDIA GPU’s, improved prediction accuracy

from ~52% to 70% which could reduce

the time and costs to

bring new drugs

to market.

AI IS ON TRACK TO SAFEGUARD RAILWAY INTEGRITYTo maintain the integrity of its 3,232 km of tracks,

the Swiss Federal Railways (SBB) runs diagnostics

trains to photograph and monitor tracks in real-

time. But traditional data processing methods

produce false positives/negatives. To remedy

this, SBB and CSEM (Swiss Research and

Development Center) launched the

Railcheck project which applies deep

learning, powered by the NVIDIA DGX

Station, to improve the automatic

detection and classification

of faults.

AI HELPS DOCTORSDIAGNOSEBREAST CANCEREvery day, pathologists are tasked with providing

cancer diagnosis to guide patient treatment.

However, sifting through millions of normal cells

to identify a few malignant cells is extremely

laborious using conventional methods. PathAI

combines GPU deep learning with traditional

pathology to improve accuracy,

speed diagnosis, and

reduce error rates

by 85%.

AI HELPS PERSONALIZE IMMUNOTHERAPY Immunotherapy has a success rate of only 40% and a

risk that it may attack healthy cells. Max Kelsen is

using sophisticated AI approaches with NVIDIA V100

GPUs to integrate genomic, transcriptomic

and patient information to identify

a classifier and develop a test

that can predict treatment

response.

What a distance covered since 2012!

1980 1990 2000 2010 2020

GPU-Computing perf

1.5X per year

1000X

by

2025

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.

Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

102

103

104

105

106

107

Single-threaded perf

1.5X per year

1.1X per year

APPLICATIONS

SYSTEMS

ALGORITHMS

CUDA

ARCHITECTURE

CUDA – Domain Specific Computing Architecture10X in 5 Years

RISE OF GPU COMPUTING

NVIDIA CONFIDENTIAL – DO NOT DISTRIBUTE

DEEP LEARNING: EXPONENTIAL PERFORMANCE IMPROVEMENTS500X in 5 YEARS!

Alex Krizhevsky

won the Imagenet

competition in 2012

2018

Time to Train AlexNet on the Imagnet Dataset

Bigger and More Compute IntensiveNEURAL NETWORK COMPLEXITY IS EXPLODING

2013 2014 2015 2016 2017 2018

Speech(GOP * Bandwidth)

DeepSpeech

DeepSpeech 2

DeepSpeech 3

30X

2011 2012 2013 2014 2015 2016 2017

Image(GOP * Bandwidth)

ResNet-50

Inception-v2

Inception-v4

AlexNet

GoogleNet

350X

2014 2015 2016 2017 2018

Translation(GOP * Bandwidth)

MoE

OpenNMT

GNMT

10X

17

REVOLUTIONARY AI PERFORMANCE3X Faster DL Training Performance

3X Reduction in Time to Train Over P100

0 10 20

1XV100

1XP100

2XCPU

Relative Time to Train Improvements(LSTM)

Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4

15 Days

18 Hours

6 Hours

Over 80X DL Training Performance in 3 Years

1x K80cuDNN2

4x M40cuDNN3

8x P100cuDNN6

8x V100cuDNN7

0x

20x

40x

60x

80x

100x

Q1

15

Q3

15

Q2

17

Q2

16

Exponential Performance over time(GoogleNet)

Speedup v

s K80

GoogleNet Training Performance on versions of cuDNNVs 1x K80 cuDNN2

DGX-1: 140X FASTER THAN CPU

10X PERFORMANCE GAIN IN LESS THAN A YEAR

DGX-1, SEP’17 DGX-2, Q3‘18

software improvements across the stack including NCCL, cuDNN, etc.

Workload: FairSeq, 55 epochs to solution. PyTorch training performance.

Time to Train (days)

1.5

15

0 5 10 15 20

DGX-2

DGX-1 with V100

10 Times Fasterdays

days

20

AI AND HPC BENCHMARKS: DGX-2 VS CPUReplace CPU Nodes - Save Money, Power and Space in the Data Center

0

50

100

150

200

250

300

350

Dual Socket CPU HGX-2

Speed-U

p o

f Sin

gle

Node

AI Training: HGX-2 Replaces 300 CPU-Only Server Nodes

1

300X

Dual-Socket CPU0

10

20

30

40

50

60

70

Dual Socket CPU HGX-2

Speed-U

p o

f Sin

gle

Node

HPC: HGX-2 Replaces 60 CPU-Only Server Nodes

1

60X

Dual-Socket CPU

Workload: ResNet50, 90 epochs to solution | CPU Server: Dual-Socket Intel Xeon Gold 6140 Workload: MILC (particle physics HPC application) | CPU Server: Dual-Socket Intel Xeon Gold 6140

21

Up To 36X Faster Than CPUs | Accelerates All AI Workloads

WORLD’S MOST PERFORMANT INFERENCE PLATFORM

Speedup: 36x fasterGNMT

Speedup: 27x fasterResNet-50 (7ms latency limit)

Speedup: 21X fasterDeepSpeech 2

1.0

10X

36X

-0

5

10

15

20

25

30

35

40

Spee

du

p v

. CP

U S

erve

r

Natural Language Processing Inference

CPU Server Tesla P4 Tesla T4

1.0

4X

21X

-0

5

10

15

20

25

Spee

du

p v

. CP

U S

erve

r

Speech Inference


1.0

10X

27X

-0

5

10

15

20

25

30

Spee

du

p v

. CP

U S

erve

r

Video Inference


5.522

65

130

260

0

50

100

150

200

250

300

TFLO

PS

/ TO

PS

Peak Performance

T4P4

float INT8 float INT8 INT4

VOLTA TENSOR CORE GPUS POWER SUMMIT: WORLD'S FASTEST AI SUPERCOMPUTER

122 PetaFLOPS 3 ExaFLOPSHPC AI

27,648Volta Tensor Core GPUs

23

DELIVERING MAJORITY OF THE NEW COMPUTING PERFORMANCE

11%

25%

56%

2015Tesla K80

2017Tesla P100

2018Tesla V100

NVIDIA GPUs Share of New FLOPS on Top 500 Systems

Thanks to NVIDIA innovation!

25

FUSION OF HPC & AI

HPC AI

VOLTA TENSOR CORE GPU

TENSOR CORE GPU FUSES HPC & AI COMPUTING

MULTI-PRECISION COMPUTING

HPC (Simulation) – FP64, FP32

AI (Deep Learning) – FP16, INT8

TENSOR COREMixed Precision Matrix Math - 4x4 matrices

New CUDA TensorOp instructions & data formats

4x4 matrix processing array

D[FP32] = A[FP16] * B[FP16] + C[FP32]

Using Tensor cores via

• Volta optimized frameworks and libraries (cuDNN, CuBLAS, TensorRT, ..)

• CUDA C++ Warp Level Matrix Operations

FASTER RESULTS ON COMPLEX DL AND HPCV100: Up to 50% Faster Results With 2x The Memory

Unsupervised Image Translation

Input winter photo

AI converts it to summer

Dual E5-2698v4 server, 512GB DDR4, Ubuntu 16.04, CUDA9, cuDNN7| NMT is GNMT-like and run with TensorFlow NGC Container 18.01 (Batch Size= 128 (for 16GB) and 256 (for 32GB) | FFT is with cufftbench 1k x 1k x 1k and comparing 2 V100 16GB (DGX1V) vs. 2 V100 32GB (DGX1V)

Neural MachineTranslation (NMT)

3D FFT 1k x 1k x 1k

1.5X Faster Calculations

1.5X Faster Language Translation

1.2step/sec

0.8step/sec

2.5TF

3.8TF

GAN Image to ImageGen

1024x1024res images

512x512res images

4X Higher resolution

Accuracy(16 layers)

Accuracy(152 layers)

HIGHER ACCURACY HIGHER RESOLUTIONFASTER RESULTS

R-CNN for object detection at 1080P with Caffe | V100 16GB uses VGG16| V100 32GB uses Resnet-152

V100 16GB V100 32GB

VGG-16 RN-152

40% Lower Error Rate

GAN by NVRESEARCH (https://arxiv.org/pdf/1703.00848.pdf) | V100 16GB and V100 32GB with FP32

NVLINK AND MULTI-GPU SCALING

PCIe

Switch

CPU

PCIe

Switch

CPU

0

32

1 5

67

4

• Data loading over PCIe (red)• Gradient averaging over NVLink (green)• No sharing of communication resources:

No congestion

PCIe

Switch

CPU

PCIe

Switch

CPU

0

32

1 5

67

4

QPI Link

• Data loading over PCIe• Gradient averaging over PCIe and QPI• Data loading and gradient averaging share

communication resources: Congestion

PCIe based system NVLINK based system

For Data Parallel Training

30% BETTER PERFORMANCE WITH NVLINK THAN PCIE

• Encoder and decoder embedding size of 512

• Batch size of 256 per GPU

• NVIDIA DGX containers version 17.11, processing real data with cuDNN 7.0.4, NCCL 2.1.2

31

NEW TURING TENSOR CORE

MULTI-PRECISION FOR AI INFERENCE

65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4

32

RAPIDS

GPU Accelerated Data ScienceRAPIDS : a set of open source libraries for GPU

accelerating data preparation and machine learning.

rapids.ai

Announced at

GTC Europe

rapids.ai

33

NVIDIA GPU CLOUD

34

NVIDIA GPU CLOUD (NGC)Simple Access to GPU-Accelerated Software

Cloud Servers

Workstations

Deploy Applications In

Minutes, Not Days

Discover 35 Optimized

Containers

Run Anywhere with Maximum Performance

GPU-Powered

Accelerate Time to Market

DEEP LEARNING CONTAINERS ON NGC

NVIDIA ® DGX-1™

Containerized Applications

TF Tuned SW

NVIDIA Docker

CNTK Tuned SW

NVIDIA Docker

Caffe2 Tuned SW

NVIDIA Docker

Pytorch Tuned SW

NVIDIA Docker

CUDA RTCUDA RTCUDA RTCUDA RT

Linux Kernel + CUDA Driver

Tuned SW

NVIDIA Docker

CUDA RT

Other

Frameworks

and Apps. . .

THE POWER TO RUN MULTIPLE FRAMEWORKS AT ONCE

Container Images portable across new driver versions

37

TESLA PRODUCTFAMILY

38

END-TO-END PRODUCT FAMILY

FULLY INTEGRATED AI SYSTEMS

DESKTOP

TITAN

WORKSTATION

DGX StationQuadro

DATA

CENTER

Tesla V100

V100 PCIE

DATA CENTER

Tesla V100

AUTOMOTIVE EMBEDDED

Tesla T4

Drive AGX Pegasus Jetson AGX Xavier

VIRTUAL

WS

Virtual GPU

SERVER

CONFIGS

HGX

HPC/ TRAINING INFERENCE

DGX-1 DGX-2

39

TESLA V100WORLD’S MOST ADVANCED DATA CENTER GPU

5,120 CUDA cores

640 NEW Tensor cores

7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS | 125 Tensor TFLOPS

20MB SM RF | 16MB Cache

16GB/ 32GB HBM2 @ 900GB/s | 300GB/s NVLink

24/7 Uptime

Data Center Ready

Scalable Performance

40

Universal Inference Acceleration

320 Turing Tensor cores

2,560 CUDA cores

65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS

16GB | 320GB/s

TESLA T4WORLD’S MOST ADVANCED INFERENCE GPU

NVIDIA DGX-1 WITH VOLTAHighest Performance, Fully Integrated HW System

1 PetaFLOPS | 8x Tesla V100 32GB | 300 Gb/s NVLink Hybrid Cube Mesh

2x Xeon | 7 TB RAID 0 | Quad IB/Ethernet 100Gbps, Dual 10GbE | 3U — 3500W

7 TB SSD 8 x Tesla V100 32 GB

Quad IB/Ethernet 100Gbps, Dual 10GbE

2x Xeon

3U – 3200W NVLink Hybrid Cube Mesh

NEW NVIDIA DGX-2The Largest GPU Ever Created

2 PFLOPS | 512GB HBM2 | 16 TB/sec Memory Bandwidth | 10 kW / 160 kg

POWERING THE DEEP LEARNING ECOSYSTEMNVIDIA SDK accelerates every major framework

COMPUTER VISION

OBJECT DETECTION IMAGE CLASSIFICATION

SPEECH & AUDIO

VOICE RECOGNITION LANGUAGE TRANSLATION

NATURAL LANGUAGE PROCESSING

RECOMMENDATION ENGINES SENTIMENT ANALYSIS

DEEP LEARNING FRAMEWORKS

developer.nvidia.com/deep-learning-software

NVIDIA DEEP LEARNING SDK and CUDA

The more you buy, the more you save!

1 DGX-2 | $399K | 10kW

1/8 the Cost 1/60 the Space 1/18 the Power

300 Dual-CPU Servers | $3M | 180 kW

“The More GPUs You Buy, The More You Save” —Jensen Huang

TRADITIONAL HYPERSCALE CLUSTER NVIDIA DGX-2 FOR DEEP LEARNING

46

NVIDIA TESLA PLATFORM SAVES MONEYGame-Changing Inference Performance

160 CPU Servers

45,000 images/sec

65 KWatts

INFERENCE WORKLOAD:Image recognition using Resnet 50

1 HGX Server

45,000 images/sec

3 KWatts

INFERENCE WORKLOAD:Image recognition using Resnet 50

SAMETHROUGHPUT

1/20THE SPACE

1/22THE POWER

47

NVIDIA PROGRAMS

DEEP LEARNING INSTITUTEDLI Mission: Help the world to solve the most challenging problems using AI and deep learning

We help developers, data scientists and engineers to get started in architecting, optimizing, and deploying neural networks to solve real-world problems in diverse industries such as autonomous vehicles, healthcare, robotics, media & entertainment and game development.

49

HOW TO ACCESS DLI TRAINING

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

SELF-PACED ONLINE

Get started anywhere, any time with access to a GPU-accelerated workstation in the cloud

All you need is a device with an Internet connection

Courses (8 hrs) are $90

Electives (2 hrs) are $0-30

Take online courses at www.nvidia.com/dli

INSTRUCTOR-LED WORKSHOP

Full-day workshops are available by request

Workshops are delivered by DLI certified instructors through NVIDIA or DLI partners

MSRP: $10K/day for up to 20 attendees (EDU pricing available)

Request a workshop at www.nvidia.com/requestDLI

INDUSTRY CONFERENCES

Training available as instructor-led and self-paced at industry events

Deep learning presentations offered for business & technology leaders

Special training pass available for GTC (NVIDIA’s GPU Technology Conference)

View upcoming DLI workshops at www.nvidia.com/dli

50

RICH CONTENT PORTFOLIOFundamentals and advanced hands-on training in key technologies and application domains that matter

Game Development

Digital Content Creation

More industry-specific training coming soon…

Deep Learning Fundamentals

Finance

Medical Image AnalysisAutonomous Vehicles

Genomics

Accel. Computing Fundamentals

UNIVERSITY AMBASSADOR PROGRAMTraining the next generation of AI practitioners

University Ambassador Program enables qualified faculty and researchers to teach DLI courses to their students and academic staff at no cost

40 universities around the world are part of the Ambassador Program

Documents

GPU COMPUTING TRENDS - ENSEEIHTubee.enseeiht.fr/TSDL/pages/TALKS/TSDL2018-Romuald-Josien.pdfwith NVIDIA DGX-1 servers and Pure FlashBlade systems to accelerate their AI initiative