Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Romuald Josien - Oct, 2018
GPU COMPUTING TRENDS
Agenda:- NVIDIA the company- Examples of NVIDIA contribution to AI use cases- What a distance covered these past +5 years!- Thanks to NVIDIA innovation- Nvidia TESLA portfolio- The more you buy, the more you save!- NVIDIA program for Education
➢ Founded in 1993
➢ HQ in Santa Clara (CA – USA)
> Jensen Huang, Founder & CEO
> 12,000 employees WW
➢ $9.7B revenue in FY18 (+41%)
➢ >1B GPUs shipped to date
➢ 6,000 patents WW
NVIDIA FACTS
NVIDIA - GPU COMPUTING
ONE ARCHITECTURE — CUDA
Gaming VR AI & HPCSelf-Driving Cars
& Autonomous Machines
Artificial IntelligenceGPU Computing
NVIDIA “THE AI COMPUTING COMPANY”
Computer Graphics
NVIDIA CONTRIBUTES TO IMPROVE OUR WORLD
DEVELOPINGTHE VEHICLESOF THE FUTURE Zenuity, a joint venture of Volvo and Veoneer,
aims to build autonomous driving software for
production vehicles by 2021. They chose to
build their deep learning infrastructure
with NVIDIA DGX-1 servers and Pure
FlashBlade systems to accelerate
their AI initiative.
THE BRAINS BEHINDSMART CITIESVerizon’s Smart Communities Group is on a
mission to make cities safer, smarter and
greener. Using NVIDIA Metropolis, an edge-
to-cloud video platform for building smarter,
faster AI-powered applications, Verizon is
working to collect and analyze multiple
streams of video data to improve traffic
flow, enhance pedestrian safety,
optimize parking
and more.
SPEEDING UPDRUG DISCOVERYClassic Molecular Dynamics simulations are time-consuming
and expensive, but machine learning models can help
predict the probability of target molecules to bond
With chemical compounds. Researchers at the
University of Pittsburgh are improving model
Performance and prediction accuracy. Their
Convolutional neural network, accelerated by
NVIDIA GPU’s, improved prediction accuracy
from ~52% to 70% which could reduce
the time and costs to
bring new drugs
to market.
AI IS ON TRACK TO SAFEGUARD RAILWAY INTEGRITYTo maintain the integrity of its 3,232 km of tracks,
the Swiss Federal Railways (SBB) runs diagnostics
trains to photograph and monitor tracks in real-
time. But traditional data processing methods
produce false positives/negatives. To remedy
this, SBB and CSEM (Swiss Research and
Development Center) launched the
Railcheck project which applies deep
learning, powered by the NVIDIA DGX
Station, to improve the automatic
detection and classification
of faults.
AI HELPS DOCTORSDIAGNOSEBREAST CANCEREvery day, pathologists are tasked with providing
cancer diagnosis to guide patient treatment.
However, sifting through millions of normal cells
to identify a few malignant cells is extremely
laborious using conventional methods. PathAI
combines GPU deep learning with traditional
pathology to improve accuracy,
speed diagnosis, and
reduce error rates
by 85%.
AI HELPS PERSONALIZE IMMUNOTHERAPY Immunotherapy has a success rate of only 40% and a
risk that it may attack healthy cells. Max Kelsen is
using sophisticated AI approaches with NVIDIA V100
GPUs to integrate genomic, transcriptomic
and patient information to identify
a classifier and develop a test
that can predict treatment
response.
What a distance covered since 2012!
1980 1990 2000 2010 2020
GPU-Computing perf
1.5X per year
1000X
by
2025
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.
Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
CUDA – Domain Specific Computing Architecture10X in 5 Years
RISE OF GPU COMPUTING
NVIDIA CONFIDENTIAL – DO NOT DISTRIBUTE
DEEP LEARNING: EXPONENTIAL PERFORMANCE IMPROVEMENTS500X in 5 YEARS!
Alex Krizhevsky
won the Imagenet
competition in 2012
2018
Time to Train AlexNet on the Imagnet Dataset
Bigger and More Compute IntensiveNEURAL NETWORK COMPLEXITY IS EXPLODING
2013 2014 2015 2016 2017 2018
Speech(GOP * Bandwidth)
DeepSpeech
DeepSpeech 2
DeepSpeech 3
30X
2011 2012 2013 2014 2015 2016 2017
Image(GOP * Bandwidth)
ResNet-50
Inception-v2
Inception-v4
AlexNet
GoogleNet
350X
2014 2015 2016 2017 2018
Translation(GOP * Bandwidth)
MoE
OpenNMT
GNMT
10X
17
REVOLUTIONARY AI PERFORMANCE3X Faster DL Training Performance
3X Reduction in Time to Train Over P100
0 10 20
1XV100
1XP100
2XCPU
Relative Time to Train Improvements(LSTM)
Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4
15 Days
18 Hours
6 Hours
Over 80X DL Training Performance in 3 Years
1x K80cuDNN2
4x M40cuDNN3
8x P100cuDNN6
8x V100cuDNN7
0x
20x
40x
60x
80x
100x
Q1
15
Q3
15
Q2
17
Q2
16
Exponential Performance over time(GoogleNet)
Speedup v
s K80
GoogleNet Training Performance on versions of cuDNNVs 1x K80 cuDNN2
DGX-1: 140X FASTER THAN CPU
10X PERFORMANCE GAIN IN LESS THAN A YEAR
DGX-1, SEP’17 DGX-2, Q3‘18
software improvements across the stack including NCCL, cuDNN, etc.
Workload: FairSeq, 55 epochs to solution. PyTorch training performance.
Time to Train (days)
1.5
15
0 5 10 15 20
DGX-2
DGX-1 with V100
10 Times Fasterdays
days
20
AI AND HPC BENCHMARKS: DGX-2 VS CPUReplace CPU Nodes - Save Money, Power and Space in the Data Center
0
50
100
150
200
250
300
350
Dual Socket CPU HGX-2
Speed-U
p o
f Sin
gle
Node
AI Training: HGX-2 Replaces 300 CPU-Only Server Nodes
1
300X
Dual-Socket CPU0
10
20
30
40
50
60
70
Dual Socket CPU HGX-2
Speed-U
p o
f Sin
gle
Node
HPC: HGX-2 Replaces 60 CPU-Only Server Nodes
1
60X
Dual-Socket CPU
Workload: ResNet50, 90 epochs to solution | CPU Server: Dual-Socket Intel Xeon Gold 6140 Workload: MILC (particle physics HPC application) | CPU Server: Dual-Socket Intel Xeon Gold 6140
21
Up To 36X Faster Than CPUs | Accelerates All AI Workloads
WORLD’S MOST PERFORMANT INFERENCE PLATFORM
Speedup: 36x fasterGNMT
Speedup: 27x fasterResNet-50 (7ms latency limit)
Speedup: 21X fasterDeepSpeech 2
1.0
10X
36X
-0
5
10
15
20
25
30
35
40
Spee
du
p v
. CP
U S
erve
r
Natural Language Processing Inference
CPU Server Tesla P4 Tesla T4
1.0
4X
21X
-0
5
10
15
20
25
Spee
du
p v
. CP
U S
erve
r
Speech Inference
CPU Server Tesla P4 Tesla T4
1.0
10X
27X
-0
5
10
15
20
25
30
Spee
du
p v
. CP
U S
erve
r
Video Inference
CPU Server Tesla P4 Tesla T4
5.522
65
130
260
0
50
100
150
200
250
300
TFLO
PS
/ TO
PS
Peak Performance
T4P4
float INT8 float INT8 INT4
VOLTA TENSOR CORE GPUS POWER SUMMIT: WORLD'S FASTEST AI SUPERCOMPUTER
122 PetaFLOPS 3 ExaFLOPSHPC AI
27,648Volta Tensor Core GPUs
23
DELIVERING MAJORITY OF THE NEW COMPUTING PERFORMANCE
11%
25%
56%
2015Tesla K80
2017Tesla P100
2018Tesla V100
NVIDIA GPUs Share of New FLOPS on Top 500 Systems
Thanks to NVIDIA innovation!
25
FUSION OF HPC & AI
HPC AI
VOLTA TENSOR CORE GPU
TENSOR CORE GPU FUSES HPC & AI COMPUTING
MULTI-PRECISION COMPUTING
HPC (Simulation) – FP64, FP32
AI (Deep Learning) – FP16, INT8
TENSOR COREMixed Precision Matrix Math - 4x4 matrices
New CUDA TensorOp instructions & data formats
4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
Using Tensor cores via
• Volta optimized frameworks and libraries (cuDNN, CuBLAS, TensorRT, ..)
• CUDA C++ Warp Level Matrix Operations
FASTER RESULTS ON COMPLEX DL AND HPCV100: Up to 50% Faster Results With 2x The Memory
Unsupervised Image Translation
Input winter photo
AI converts it to summer
Dual E5-2698v4 server, 512GB DDR4, Ubuntu 16.04, CUDA9, cuDNN7| NMT is GNMT-like and run with TensorFlow NGC Container 18.01 (Batch Size= 128 (for 16GB) and 256 (for 32GB) | FFT is with cufftbench 1k x 1k x 1k and comparing 2 V100 16GB (DGX1V) vs. 2 V100 32GB (DGX1V)
Neural MachineTranslation (NMT)
3D FFT 1k x 1k x 1k
1.5X Faster Calculations
1.5X Faster Language Translation
1.2step/sec
0.8step/sec
2.5TF
3.8TF
GAN Image to ImageGen
1024x1024res images
512x512res images
4X Higher resolution
Accuracy(16 layers)
Accuracy(152 layers)
HIGHER ACCURACY HIGHER RESOLUTIONFASTER RESULTS
R-CNN for object detection at 1080P with Caffe | V100 16GB uses VGG16| V100 32GB uses Resnet-152
V100 16GB V100 32GB
VGG-16 RN-152
40% Lower Error Rate
GAN by NVRESEARCH (https://arxiv.org/pdf/1703.00848.pdf) | V100 16GB and V100 32GB with FP32
NVLINK AND MULTI-GPU SCALING
PCIe
Switch
CPU
PCIe
Switch
CPU
0
32
1 5
67
4
• Data loading over PCIe (red)• Gradient averaging over NVLink (green)• No sharing of communication resources:
No congestion
PCIe
Switch
CPU
PCIe
Switch
CPU
0
32
1 5
67
4
QPI Link
• Data loading over PCIe• Gradient averaging over PCIe and QPI• Data loading and gradient averaging share
communication resources: Congestion
PCIe based system NVLINK based system
For Data Parallel Training
30% BETTER PERFORMANCE WITH NVLINK THAN PCIE
• Encoder and decoder embedding size of 512
• Batch size of 256 per GPU
• NVIDIA DGX containers version 17.11, processing real data with cuDNN 7.0.4, NCCL 2.1.2
31
NEW TURING TENSOR CORE
MULTI-PRECISION FOR AI INFERENCE
65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4
32
RAPIDS
GPU Accelerated Data ScienceRAPIDS : a set of open source libraries for GPU
accelerating data preparation and machine learning.
rapids.ai
Announced at
GTC Europe
33
NVIDIA GPU CLOUD
34
NVIDIA GPU CLOUD (NGC)Simple Access to GPU-Accelerated Software
Cloud Servers
Workstations
Deploy Applications In
Minutes, Not Days
Discover 35 Optimized
Containers
Run Anywhere with Maximum Performance
GPU-Powered
Accelerate Time to Market
DEEP LEARNING CONTAINERS ON NGC
NVIDIA ® DGX-1™
Containerized Applications
TF Tuned SW
NVIDIA Docker
CNTK Tuned SW
NVIDIA Docker
Caffe2 Tuned SW
NVIDIA Docker
Pytorch Tuned SW
NVIDIA Docker
CUDA RTCUDA RTCUDA RTCUDA RT
Linux Kernel + CUDA Driver
Tuned SW
NVIDIA Docker
CUDA RT
Other
Frameworks
and Apps. . .
THE POWER TO RUN MULTIPLE FRAMEWORKS AT ONCE
Container Images portable across new driver versions
37
TESLA PRODUCTFAMILY
38
END-TO-END PRODUCT FAMILY
FULLY INTEGRATED AI SYSTEMS
DESKTOP
TITAN
WORKSTATION
DGX StationQuadro
DATA
CENTER
Tesla V100
V100 PCIE
DATA CENTER
Tesla V100
AUTOMOTIVE EMBEDDED
Tesla T4
Drive AGX Pegasus Jetson AGX Xavier
VIRTUAL
WS
Virtual GPU
SERVER
CONFIGS
HGX
HPC/ TRAINING INFERENCE
DGX-1 DGX-2
39
TESLA V100WORLD’S MOST ADVANCED DATA CENTER GPU
5,120 CUDA cores
640 NEW Tensor cores
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS | 125 Tensor TFLOPS
20MB SM RF | 16MB Cache
16GB/ 32GB HBM2 @ 900GB/s | 300GB/s NVLink
24/7 Uptime
Data Center Ready
Scalable Performance
40
Universal Inference Acceleration
320 Turing Tensor cores
2,560 CUDA cores
65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS
16GB | 320GB/s
TESLA T4WORLD’S MOST ADVANCED INFERENCE GPU
NVIDIA DGX-1 WITH VOLTAHighest Performance, Fully Integrated HW System
1 PetaFLOPS | 8x Tesla V100 32GB | 300 Gb/s NVLink Hybrid Cube Mesh
2x Xeon | 7 TB RAID 0 | Quad IB/Ethernet 100Gbps, Dual 10GbE | 3U — 3500W
7 TB SSD 8 x Tesla V100 32 GB
Quad IB/Ethernet 100Gbps, Dual 10GbE
2x Xeon
3U – 3200W NVLink Hybrid Cube Mesh
NEW NVIDIA DGX-2The Largest GPU Ever Created
2 PFLOPS | 512GB HBM2 | 16 TB/sec Memory Bandwidth | 10 kW / 160 kg
POWERING THE DEEP LEARNING ECOSYSTEMNVIDIA SDK accelerates every major framework
COMPUTER VISION
OBJECT DETECTION IMAGE CLASSIFICATION
SPEECH & AUDIO
VOICE RECOGNITION LANGUAGE TRANSLATION
NATURAL LANGUAGE PROCESSING
RECOMMENDATION ENGINES SENTIMENT ANALYSIS
DEEP LEARNING FRAMEWORKS
developer.nvidia.com/deep-learning-software
NVIDIA DEEP LEARNING SDK and CUDA
The more you buy, the more you save!
1 DGX-2 | $399K | 10kW
1/8 the Cost 1/60 the Space 1/18 the Power
300 Dual-CPU Servers | $3M | 180 kW
“The More GPUs You Buy, The More You Save” —Jensen Huang
TRADITIONAL HYPERSCALE CLUSTER NVIDIA DGX-2 FOR DEEP LEARNING
46
NVIDIA TESLA PLATFORM SAVES MONEYGame-Changing Inference Performance
160 CPU Servers
45,000 images/sec
65 KWatts
INFERENCE WORKLOAD:Image recognition using Resnet 50
1 HGX Server
45,000 images/sec
3 KWatts
INFERENCE WORKLOAD:Image recognition using Resnet 50
SAMETHROUGHPUT
1/20THE SPACE
1/22THE POWER
47
NVIDIA PROGRAMS
DEEP LEARNING INSTITUTEDLI Mission: Help the world to solve the most challenging problems using AI and deep learning
We help developers, data scientists and engineers to get started in architecting, optimizing, and deploying neural networks to solve real-world problems in diverse industries such as autonomous vehicles, healthcare, robotics, media & entertainment and game development.
49
HOW TO ACCESS DLI TRAINING
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
SELF-PACED ONLINE
Get started anywhere, any time with access to a GPU-accelerated workstation in the cloud
All you need is a device with an Internet connection
Courses (8 hrs) are $90
Electives (2 hrs) are $0-30
Take online courses at www.nvidia.com/dli
INSTRUCTOR-LED WORKSHOP
Full-day workshops are available by request
Workshops are delivered by DLI certified instructors through NVIDIA or DLI partners
MSRP: $10K/day for up to 20 attendees (EDU pricing available)
Request a workshop at www.nvidia.com/requestDLI
INDUSTRY CONFERENCES
Training available as instructor-led and self-paced at industry events
Deep learning presentations offered for business & technology leaders
Special training pass available for GTC (NVIDIA’s GPU Technology Conference)
View upcoming DLI workshops at www.nvidia.com/dli
50
RICH CONTENT PORTFOLIOFundamentals and advanced hands-on training in key technologies and application domains that matter
Game Development
Digital Content Creation
More industry-specific training coming soon…
Deep Learning Fundamentals
Finance
Medical Image AnalysisAutonomous Vehicles
Genomics
Accel. Computing Fundamentals
UNIVERSITY AMBASSADOR PROGRAMTraining the next generation of AI practitioners
University Ambassador Program enables qualified faculty and researchers to teach DLI courses to their students and academic staff at no cost
40 universities around the world are part of the Ambassador Program