Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0...

Jeremy Appleyard

July 2016

PASCAL AND CUDA 8.0

TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine

Highest Compute Performance GPU Interconnect for Maximum Scalability

Unifying Compute & Memory in Single Package

Simple Parallel Programming with Virtually Unlimited Memory Space

Unified Memory

Tesla P100

3 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

GIANT LEAPS

IN EVERYTHING

NVLINK

PAGE MIGRATION ENGINE

PASCAL ARCHITECTURE

CoWoS HBM2 Stacked Mem

K40 Tera

(FP32/FP16)

(FP32)

(FP16)

onal BW

B/Sec)

160 P100

K40 Bandw

M40 K40

ssable

21 Teraflops of FP16 for Deep Learning 5x GPU-GPU Bandwidth

3x Higher for Massive Data Workloads Virtually Unlimited Memory Space

10000 800

HIGHEST ABSOLUTE PERFORMANCE DELIVERED NVLink for Max Scalability, More than 45x Faster with 8x P100

Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC

2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100

Speed-u

ual Socket

2x Broadwell CPU

PASCAL ARCHITECTURE

TESLA P100 GPU: GP100

56 SMs

3584 CUDA Cores

5.3 TF Double Precision

10.6 TF Single Precision

21.2 TF Half Precision

16 GB HBM2

720 GB/s Bandwidth

GPU PERFORMANCE COMPARISON

P100 M40 K40

Double Precision TFlop/s 5.3 0.2 1.4

Single Precision TFlop/s 10.6 7.0 4.3

Half Precision Tflop/s 21.2 NA NA

Memory Bandwidth (GB/s) 720 288 288

Memory Size 16GB 12GB, 24GB 12GB

GP100 SM

CUDA Cores 64

Register File 256 KB

Shared

Memory 64 KB

Active Threads 2048

Active Blocks 32

Registers

Shared Mem

Registers

Shared Mem

Registers

Maxwell SM

P100 SM

More resources per core

2x Registers 1.33x Shared Memory Capacity 2x Shared Memory Bandwidth 2x Warps

Higher Instruction Throughput

IEEE 754 FLOATING POINT ON GP100 3 sizes, 3 speeds, all fast

Feature Half precision Single precision Double precision

Layout s5.10 s8.23 s11.52

Issue rate pair every clock 1 every clock 1 every 2 clocks

Subnormal support Yes Yes Yes

Atomic Addition Yes Yes Yes

HALF-PRECISION FLOATING POINT (FP16)

• 16 bits

• 1 sign bit, 5 exponent bits, 10 fraction bits

• 240 Dynamic range

• Normalized values: 1024 values for each power of 2, from 2-14 to 215

• Subnormals at full speed: 1024 values from 2-24 to 2-15

• Special values

• +- Infinity, Not-a-number

s e x p f r a c .

USE CASES

Deep Learning Training

Radio Astronomy

Sensor Data

Image Processing

NVLINK

P100 supports 4 NVLinks

Up to 94% bandwidth efficiency

Supports read/writes/atomics to peer GPU

Supports read/write access to NVLink-enabled CPU

Links can be ganged for higher bandwidth

NVLink on Tesla P100

40 GB/s

NVLINK - GPU CLUSTER

Two fully connected quads, connected at corners

160GB/s per GPU bidirectional to Peers

Load/store access to Peer Memory

Full atomics to Peer GPUs

High speed copy engines for bulk data copy

PCIe to/from CPU

NVLINK TO CPU

Fully connected quad

120 GB/s per GPU bidirectional for peer traffic

40 GB/s per GPU bidirectional to CPU

Direct Load/store access to CPU Memory

High Speed Copy Engines for bulk data movement

UNIFIED MEMORY

PAGE MIGRATION ENGINE Support Virtual Memory Demand Paging

49-bit Virtual Addresses

Sufficient to cover 48-bit CPU address + all GPU memory

GPU page faulting capability

Can handle thousands of simultaneous page faults

Up to 2 MB page size

Better TLB coverage of GPU memory

28.7.2

016 г.

KEPLER/MAXWELL UNIFIED MEMORY

Performance

Through

Data Locality

Migrate data to accessing processor

Guarantee global coherency

Still allows explicit hand tuning

Simpler

Programming &

Memory Model

Single allocation, single pointer,

accessible anywhere

Eliminate need for explicit copy

Greatly simplifies code porting

Allocate Up To GPU Memory Size

Kepler

GPU CPU

Unified Memory

CUDA 6+

PASCAL UNIFIED MEMORY Large datasets, simple programming, High Performance

Allocate Beyond GPU Memory Size

Enable Large

Data Models

Oversubscribe GPU memory

Allocate up to system memory size

Unified Memory

Performance

Usage hints via cudaMemAdvise API

Explicit prefetching API

Simpler

Data Access

CPU/GPU Data coherence

Unified memory atomic operations

Unified Memory

Pascal

GPU CPU

CUDA 8

GPU OVERSUBSCRIPTION HPGMG: high-performance multi-grid

7/28/2

Tesla K40 (12 GB)

Tesla P100 (16 GB)

*Tesla P100 performance is very early modelling results

TESLA P100 New GPU Architecture to Enable the World’s Fastest Compute Node

More P100 Features: compute preemption, new instructions, larger L2 cache, more…

Find out more at http://devblogs.nvidia.com/parallelforall/inside-pascal

Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine

Highest Compute Performance GPU Interconnect for Maximum Scalability

Unifying Compute & Memory in Single Package

Simple Parallel Programming with Virtually Unlimited Memory Space

Unified Memory

Tesla P100

CUDA 8.0

CUDA 8

New Architecture, Stacked Memory, NVLINK

Pascal Support Simple Parallel Programming with large virtual memory

Unified Memory

nvGRAPH – library for accelerating graph analytics apps

FP16 computation to boost Deep Learning workloads

Libraries Critical Path Analysis to speed overall app tuning

OpenACC profiling to optimize directive performance

Single GPU debugging on Pascal

Developer Tools

GTC EUROPE 2016

• GPU Technology Conference in Europe

• Call for speakers closes on August 21st

• https://www.gputechconf.eu

Thanks for listening!

jappleyard@nvidia.com

Amsterdam, 28–29 September

Jeremy Appleyard July 2016 - People · PDF fileJeremy Appleyard July 2016 PASCAL AND CUDA 8.0...

Documents

Appleyard cap 4_pag._39-57

Appleyard cap 10_pag._157-181

IE-Appleyard, Field and Cobb

British Library - Andy Appleyard

The Writings of-Appleyard 1170

Appleyard cap 3_pag._27-38

Advanced Course - Steve Appleyard G3PND P26-33 -33.pdf · Advanced Course - Steve Appleyard G3PND P26-33 Inductors ... • In theory the parallel tuned circuit has infinite impedance

Appleyard cap 8_pag._115-136

Lynch Appleyard View From the Road

wellowgate.co.ukwellowgate.co.uk/Appleyard/Lincs/parishregisterso02bost.pdfINTRODUCTION. THEsecondvolumeoftheBostonParishRegisterisprintedin tiiepresentwork.Itcontainsbaptisms,marriagesandburialsfor

Appleyard Cap 8 Pag. 115-136

Baden appleyard lunchlezing BZK 26 november

BCARF Newsletter, Summer 2017 new option 1 NEWSLETTER · 2019-01-24 · PPP.re.sid/nstMnsiagedffitfi 3 of 12 BCARF Newsletter, Summer 2017 Joseph Appleyard, S.J., english Joe Appleyard

Appleyard Elysian 27 mit - iNautia · Appleyard Elysian 27 mit Motorcruiser (1972) Appleyard Elysian 27 mit € 8.500 € Basisdaten Typ: Motorcruiser Jahr: 1972 Länge: 8.20 m Standort:

07-Appleyard Livable Streets

SOUTH EAST REGION STRATEGIC PLANNING Mike Appleyard Wycombe District Councillor Bucks County Councillor

Manley, K. , Park, C. H., Medland, V. L., & Appleyard, T-L

CASTOR logging at RAL Rob Appleyard, James Adams and Kashyap Manjusha

Appleyard Cap 12 Pag. 202-227

Chesbrough, H.W. and M.M Appleyard (2007)