View
221
Download
3
Category
Preview:
Citation preview
Jeremy Appleyard
July 2016
PASCAL AND CUDA 8.0
2
TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node
Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine
Highest Compute Performance GPU Interconnect for Maximum Scalability
Unifying Compute & Memory in Single Package
Simple Parallel Programming with Virtually Unlimited Memory Space
Unified Memory
CPU
Tesla P100
3 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
GIANT LEAPS
IN EVERYTHING
NVLINK
PAGE MIGRATION ENGINE
PASCAL ARCHITECTURE
CoWoS HBM2 Stacked Mem
K40 Tera
flops
(FP32/FP16)
5
10
15
20
P100
(FP32)
P100
(FP16)
M40
K40
Bi-
dir
ecti
onal BW
(G
B/Sec)
40
80
120
160 P100
M40
K40 Bandw
idth
(G
B/s)
200
400
600
P100
M40 K40
Addre
ssable
Mem
ory
(G
B)
10
100
1000
P100
M40
21 Teraflops of FP16 for Deep Learning 5x GPU-GPU Bandwidth
3x Higher for Massive Data Workloads Virtually Unlimited Memory Space
10000 800
4
HIGHEST ABSOLUTE PERFORMANCE DELIVERED NVLink for Max Scalability, More than 45x Faster with 8x P100
0x
5x
10x
15x
20x
25x
30x
35x
40x
45x
50x
Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC
2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100
Speed-u
p v
s D
ual Socket
Hasw
ell
2x Broadwell CPU
5
PASCAL ARCHITECTURE
6
TESLA P100 GPU: GP100
56 SMs
3584 CUDA Cores
5.3 TF Double Precision
10.6 TF Single Precision
21.2 TF Half Precision
16 GB HBM2
720 GB/s Bandwidth
7
GPU PERFORMANCE COMPARISON
P100 M40 K40
Double Precision TFlop/s 5.3 0.2 1.4
Single Precision TFlop/s 10.6 7.0 4.3
Half Precision Tflop/s 21.2 NA NA
Memory Bandwidth (GB/s) 720 288 288
Memory Size 16GB 12GB, 24GB 12GB
8
GP100 SM
GP100
CUDA Cores 64
Register File 256 KB
Shared
Memory 64 KB
Active Threads 2048
Active Blocks 32
9
Core
s
FP64
Core
s
FP64
LD
/ST
SFU
Registers
Warps
Registers
Warps
Core
s
Core
s
FP64
FP64
LD
/ST
SFU
Registers
Warps
Core
s
Core
s
FP64
FP64
LD
/ST
SFU
Registers
Warps
Core
s
Core
s
FP64
FP64
LD
/ST
SFU
Registers
Warps
Registers
Warps
Shared Mem
Registers
Warps
Shared Mem
Registers
Warps
Maxwell SM
P100 SM
P100 SM
More resources per core
2x Registers 1.33x Shared Memory Capacity 2x Shared Memory Bandwidth 2x Warps
Higher Instruction Throughput
10
IEEE 754 FLOATING POINT ON GP100 3 sizes, 3 speeds, all fast
Feature Half precision Single precision Double precision
Layout s5.10 s8.23 s11.52
Issue rate pair every clock 1 every clock 1 every 2 clocks
Subnormal support Yes Yes Yes
Atomic Addition Yes Yes Yes
11
HALF-PRECISION FLOATING POINT (FP16)
• 16 bits
• 1 sign bit, 5 exponent bits, 10 fraction bits
• 240 Dynamic range
• Normalized values: 1024 values for each power of 2, from 2-14 to 215
• Subnormals at full speed: 1024 values from 2-24 to 2-15
• Special values
• +- Infinity, Not-a-number
s e x p f r a c .
USE CASES
Deep Learning Training
Radio Astronomy
Sensor Data
Image Processing
12
NVLINK
13
NVLINK
P100 supports 4 NVLinks
Up to 94% bandwidth efficiency
Supports read/writes/atomics to peer GPU
Supports read/write access to NVLink-enabled CPU
Links can be ganged for higher bandwidth
NVLink on Tesla P100
40 GB/s
40 GB/s
40 GB/s
40 GB/s
14
NVLINK - GPU CLUSTER
Two fully connected quads, connected at corners
160GB/s per GPU bidirectional to Peers
Load/store access to Peer Memory
Full atomics to Peer GPUs
High speed copy engines for bulk data copy
PCIe to/from CPU
15
NVLINK TO CPU
Fully connected quad
120 GB/s per GPU bidirectional for peer traffic
40 GB/s per GPU bidirectional to CPU
Direct Load/store access to CPU Memory
High Speed Copy Engines for bulk data movement
16
UNIFIED MEMORY
17
PAGE MIGRATION ENGINE Support Virtual Memory Demand Paging
49-bit Virtual Addresses
Sufficient to cover 48-bit CPU address + all GPU memory
GPU page faulting capability
Can handle thousands of simultaneous page faults
Up to 2 MB page size
Better TLB coverage of GPU memory
28.7.2
016 г.
18
KEPLER/MAXWELL UNIFIED MEMORY
Performance
Through
Data Locality
Migrate data to accessing processor
Guarantee global coherency
Still allows explicit hand tuning
Simpler
Programming &
Memory Model
Single allocation, single pointer,
accessible anywhere
Eliminate need for explicit copy
Greatly simplifies code porting
Allocate Up To GPU Memory Size
Kepler
GPU CPU
Unified Memory
CUDA 6+
19
PASCAL UNIFIED MEMORY Large datasets, simple programming, High Performance
Allocate Beyond GPU Memory Size
Enable Large
Data Models
Oversubscribe GPU memory
Allocate up to system memory size
Tune
Unified Memory
Performance
Usage hints via cudaMemAdvise API
Explicit prefetching API
Simpler
Data Access
CPU/GPU Data coherence
Unified memory atomic operations
Unified Memory
Pascal
GPU CPU
CUDA 8
20
GPU OVERSUBSCRIPTION HPGMG: high-performance multi-grid
7/28/2
016
Tesla K40 (12 GB)
Tesla P100 (16 GB)
*Tesla P100 performance is very early modelling results
21
TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node
More P100 Features: compute preemption, new instructions, larger L2 cache, more…
Find out more at http://devblogs.nvidia.com/parallelforall/inside-pascal
Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine
Highest Compute Performance GPU Interconnect for Maximum Scalability
Unifying Compute & Memory in Single Package
Simple Parallel Programming with Virtually Unlimited Memory Space
Unified Memory
CPU
Tesla P100
22
CUDA 8.0
23
CUDA 8
New Architecture, Stacked Memory, NVLINK
Pascal Support Simple Parallel Programming with large virtual memory
Unified Memory
nvGRAPH – library for accelerating graph analytics apps
FP16 computation to boost Deep Learning workloads
Libraries Critical Path Analysis to speed overall app tuning
OpenACC profiling to optimize directive performance
Single GPU debugging on Pascal
Developer Tools
24
GTC EUROPE 2016
• GPU Technology Conference in Europe
• Call for speakers closes on August 21st
• https://www.gputechconf.eu
Thanks for listening!
jappleyard@nvidia.com
Amsterdam, 28–29 September
Recommended