124
© NVIDIA Corporation 2010 GPU Tutorial @ Lund Observatory Gernot Ziegler, NVIDIA UK

GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

GPU Tutorial @ Lund Observatory

Gernot Ziegler, NVIDIA UK

Page 2: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

HISTORY / INTRODUCTION

Page 3: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

VAX

Maspar

Thinking Machines

Blue Gene Many-Core

GPUs

Multi-Core

x86

Intel 4004

DEC PDP-1

ILLIAC IV

IBM System 360

Cray-1

IBM POWER4

Parallel vs Sequential Architecture Evolution

High Performance Computing Architectures

Database, Operating System Sequential Architectures

Page 4: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

Recent History

Specialised machines faded out (e.g. CRAY)

Cost, economies of scale

Intel and AMD chips designed for home/office use

Increasing clock frequencies gave increasing performance

Commodity clusters

Computer gaming drives Graphics Processing Unit (GPU)

NVIDIA and ATI

Page 5: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2009

Present

Clock frequency no longer increasing

Power consumption ƒ2

Multi-core dominates

Page 6: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

GPU Computing

CPU + GPU Co-Processing

4 cores

Page 7: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

Graphics Pipelines for Last 20 YearsProcessor per function

T&L evolved to vertex shading

Triangle, point, line – setup

Flat shading, texturing, eventually

Pixel shading

Blending, Z-buffering, anti-aliasing

Wider and faster over the years

Vertex

Triangle

Pixel

ROP

Memory

Page 8: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

Vertex

Shader

Pixel

ShaderIdle hardware

Idle hardwareVertex

Shader

Pixel

Shader

Previous Pipelined Architectures

Heavy Geometry

Workload Perf = 4

Heavy Pixel

Workload Perf = 8

Page 9: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

Unified Architecture Replaces the Pipeline

ModelThe future of GPUs is programmable processing

So – build the architecture around the processor

L2

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Data Assembler

L2 L2 L2 L2 L2

FB

SP SP

TF

L1

SP SP

TF

L1

SP SP

TF

L1

SP SP

TF

L1

SP SP

TF

L1

SP SP

TF

L1

SP SP

TF

L1

SP SP

TF

L1

FBFBFBFBFB

Host

Page 10: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Low Latency or High Throughput?

CPU

Optimised for low-latency

access to cached data sets

Control logic for out-of-order

and speculative execution

GPU

Optimised for data-parallel,

throughput computation

Architecture tolerant of

memory latency

More transistors dedicated to

computation

Cache

ALUControl

ALU

ALU

ALU

DRAM

DRAM

Page 11: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

Heterogeneous Computing Domains

Oil & Gas Finance Medical Biophysics Numerics Audio Video Imaging

GPU(Parallel Computing)

Graphics

CPU(Sequential Computing)

Massive DataParallelism

Instruction LevelParallelism

Data Fits in Cache Larger Data Sets

Page 12: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

146X

Medical Imaging

U of Utah

36X

Molecular Dynamics

U of Illinois, Urbana

18X

Video Transcoding

Elemental Tech

50X

Matlab Computing

AccelerEyes

100X

Astrophysics

RIKEN

149X

Financial simulation

Oxford

47X

Linear Algebra

Universidad Jaime

20X

3D Ultrasound

Techniscan

130X

Quantum Chemistry

U of Illinois, Urbana

30X

Gene Sequencing

U of Maryland

50x – 150x

Page 13: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

Tesla, CUDA & PSC – definitions

CUDA Architecture

Our enabling technology for GPU computing

The architecture of the GPU to support compute - plus

C language extensions and retargetter

Usable with any 8 series and up GPU

Tesla

Dedicated compute hardware

C1060 and S1070

Fermi: C2050 and S2070

PSC

Personal Super Computer

A desktop machine with at least 3 C1060s

Page 14: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

NVIDIA Tesla 20-Series (Fermi) Products

Tesla S2050 /

S2070 1U System

Tesla C2050 / C2070

Workstation Board

GPUs 1 Tesla GPU 4 Tesla GPUs 1 Tesla GPU

Single Precision

Performance

1030 Gigaflops 4.12 Teraflops 1030 Gigaflops

Double Precision

Performance

515 Gigaflops 2.06 Teraflops 515 Gigaflops

Memory : x2050

Memory : x2070

3 GB

6 GB

12 GB (3 GB / GPU)

24 GB (6 GB / GPU)

3 GB

6 GB

Tesla M2050 /

M2070 Module

Data Center Products Workstation

Page 15: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

The Performance Gap Widens Further

2003 2004 2005 2006 2007 2008 2009 2010

Peak Single Precision Performance

GFlops/sec

Tesla 8-series

Tesla 10-series

Nehalem

3 GHz

Tesla 20-series

8x double precision

ECC

L1, L2 Caches

1 TF Single Precision

4GB Memory

NVIDIA GPU

X86 CPU

Page 16: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

GPU Computing Applications

CUDA Parallel Computing Architecture

NVIDIA GPUwith the CUDA Parallel Computing Architecture

C OpenCLtm Direct

ComputeFortran

Java and Python

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

C++

Page 18: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA OVERVIEW

Page 19: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Processing Flow

1. Copy input data from CPU memory to GPU

memory

PCI Bus

Page 20: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Processing Flow

1. Copy input data from CPU memory to GPU

memory

2. Load GPU program and execute,

caching data on chip for performance

PCI Bus

Page 21: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Processing Flow

1. Copy input data from CPU memory to GPU

memory

2. Load GPU program and execute,

caching data on chip for performance

3. Copy results from GPU memory to CPU

memory

PCI Bus

Page 22: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Parallel Computing Architecture

Parallel computing architecture

and programming model

Includes a CUDA C compiler,

support for OpenCL and

DirectCompute

Architected to natively support

multiple computational

interfaces (standard languages

and APIs)

Page 23: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

C for CUDA : C with a few keywords

void saxpy_serial(int n, float a, float *x, float *y)

{

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

// Invoke serial SAXPY kernel

saxpy_serial(n, 2.0, x, y);

__global__ void saxpy_parallel(int n, float a, float *x, float *y)

{

int i = blockIdx.x*blockDim.x + threadIdx.x;

if (i < n) y[i] = a*x[i] + y[i];

}

// Invoke parallel SAXPY kernel with 256 threads/block

int nblocks = (n + 255) / 256;

saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

Standard C Code

Parallel C Code

Page 24: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Parallel Computing Architecture

CUDA defines:

Programming model

Memory model

Execution model

CUDA uses the GPU, but is for general-purpose computing

Facilitate heterogeneous computing: CPU + GPU

CUDA is scalable

Scale to run on 100s of cores/1000s of parallel threads

Page 25: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Compiling CUDA C Applications (Runtime API)

void serial_function(… ) {

...

}

void other_function(int ... ) {

...

}

void saxpy_serial(float ... ) {

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

void main( ) {

float x;

saxpy_serial(..);

...

}

NVCC

(Open64)CPU Compiler

C CUDA

Key Kernels

CUDA object

files

Rest of C

Application

CPU object

filesLinker

CPU-GPU

Executable

Modify into

Parallel

CUDA code

Page 26: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

PROGRAMMING MODEL

CUDA Review

Page 27: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Kernels

Parallel portion of application: execute as a kernel

Entire GPU executes kernel, many threads

CUDA threads:

Lightweight

Fast switching

1000s execute simultaneously

CPU Host Executes functions

GPU Device Executes kernels

Page 28: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Kernels: Parallel Threads

A kernel is a function executed

on the GPU

Array of threads, in parallel

All threads execute the same

code, can take different paths

Each thread has an ID

Select input/output data

Control decisions

float x = input[threadID];

float y = func(x);

output[threadID] = y;

Page 29: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Kernels: Subdivide into Blocks

Page 30: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Page 31: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Blocks are grouped into a grid

Page 32: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Blocks are grouped into a grid

A kernel is executed as a grid of blocks of threads

Page 33: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Blocks are grouped into a grid

A kernel is executed as a grid of blocks of threads

GPU

Page 34: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Communication Within a Block

Threads may need to cooperate

Memory accesses

Share results

Cooperate using shared memory

Accessible by all threads within a block

Restriction to “within a block” permits scalability

Fast communication between N threads is not feasible when N large

Page 35: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Transparent Scalability – G84

1 2 3 4 5 6 7 8 9 10 11 12

1 2

3 4

5 6

7 8

9 10

11 12

Page 36: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Transparent Scalability – G80

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8

9 10 11 12

Page 37: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Transparent Scalability – GT200

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12 ...Idle Idle Idle

Page 38: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Programming Model - Summary

A kernel executes as a grid of

thread blocks

A block is a batch of threads

Communicate through shared

memory

Each block has a block ID

Each thread has a thread ID

Host

Kernel 1

Kernel 2

Device

0 1 2 3

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

1D

2D

Page 39: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

MEMORY MODEL

CUDA Review

Page 40: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Memory hierarchy

Thread:

Registers

Page 41: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Page 42: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

Page 43: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

Page 44: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

All blocks:

Global memory

Page 45: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Memory hierarchy

Thread:

Registers

Thread:

Local memory

Block of threads:

Shared memory

All blocks:

Global memory

Page 46: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Additional Memories

Host can also allocate textures and arrays of constants

Textures and constants have dedicated caches

Page 47: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

PROGRAMMING ENVIRONMENT

CUDA Review

Page 48: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA APIs

API allows the host to manage the devices

Allocate memory & transfer data

Launch kernels

CUDA C “Runtime” API

High level of abstraction - start here!

CUDA C “Driver” API

More control, more verbose

(OpenCL: Similar to CUDA C Driver API)

Page 49: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA C and OpenCL

Shared back-end compiler

and optimization technology

Entry point for developers

who want low-level APIEntry point for developers

who prefer high-level C

Page 50: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Visual Studio

Separate file types

.c/.cpp for host code

.cu for device/mixed code

Compilation rules: cuda.rules

Syntax highlighting

Intellisense

Integrated debugger and

profiler: Nexus

Page 51: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Linux

Separate file types

.c/.cpp for host code

.cu for device/mixed code

Typically makefile driven

cuda-gdb for debugging

CUDA Visual Profiler

Page 52: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA OPTIMIZATION GUIDELINES

Performance

Page 53: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Optimize Algorithms for GPU

Algorithm selectionUnderstand the problem, consider alternate algorithms

Maximize independent parallelism

Maximize arithmetic intensity (math/bandwidth)

Recompute?GPU allocates transistors to arithmetic, not memory

Sometimes better to recompute rather than cache

Serial computation on GPU?Low parallelism computation may be faster on GPU vs copy to/from host

Page 54: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Optimize Memory Access

Coalesce global memory access

Maximise DRAM efficiency

Order of magnitude impact on performance

Avoid serialization

Minimize shared memory bank conflicts

Understand constant cache semantics

Understand spatial locality

Optimize use of textures to ensure spatial locality

Page 55: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Exploit Shared Memory

Hundreds of times faster than global memory

Inter-thread cooperation via shared memory and synchronization

Cache data that is reused by multiple threads

Stage loads/stores to allow reordering

Avoid non-coalesced global memory accesses

Page 56: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Use Resources Efficiently

Partition the computation to keep multiprocessors busyMany threads, many thread blocks

Multiple GPUs

Monitor per-multiprocessor resource utilizationRegisters and shared memory

Low utilization per thread block permits multiple active blocks per multiprocessor

Overlap computation with I/OUse asynchronous memory transfers

Page 57: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

DEBUGGING AND PROFILING

cuda-gdb and Visual Profiler

Page 58: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA-GDB

Extended version of GDB with support for C for CUDA

Supported on Linux 32bit/64bit systems

Seamlessly debug both the host|CPU and device|GPU code

• Set breakpoints on any source line or symbol name

• Single step executes only one warp – except on sync threads

• Access and print all CUDA memory allocations, local, global,

constant and shared vars.

Page 59: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Linux GDB

Integration with

EMACS

Page 60: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Linux GDB

Integration with

DDD

Page 61: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Driver – Low-level Profiling support

1. Set up environment variables› export CUDA_PROFILE=1

› export CUDA_PROFILE_CSV=1

› export CUDA_PROFILE_CONFIG=config.txt

› export CUDA_PROFILE_LOG=profile.csv

2. Set up configuration fileFILE "config.txt":

gpustarttimestamp

instructions

3. Run application› matrixMul

4. View profiler output

FILE "profile.csv":# CUDA_PROFILE_LOG_VERSION 1.5

# CUDA_DEVICE 0 GeForce 8800 GT

# CUDA_PROFILE_CSV 1

# TIMESTAMPFACTOR fa292bb1ea2c12c

gpustarttimestamp,method,gputime,cputime,occupancy,instructions

115f4eaa10e3b220,memcpyHtoD,7.328,12.000

115f4eaa10e5dac0,memcpyHtoD,5.664,4.000

115f4eaa10e95ce0,memcpyHtoD,7.328,6.000

115f4eaa10f2ea60,_Z10dmatrixmulPfiiS_iiS_,19.296,40.000,0.333,43

52

115f4eaa10f443a0,memcpyDtoH,7.776,36.000

Page 62: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Visual Profiler - Overview

• Performance analysis tool to fine tune CUDA applications

• Supported on Linux/Windows/Mac platforms

• Functionality:

• Execute a CUDA application and collect profiling data

• Multiple application runs to collect data for all hardware performance counters

• Profiling data for all kernels and memory transfers

• Analyze profiling data

Page 63: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Visual Profiler – data for kernels

Page 64: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Visual Profiler – computed data for kernels

• Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate

• Global memory read throughput (Gigabytes/second)

• Global memory write throughput (Gigabytes/second)

• Overall global memory access throughput (Gigabytes/second)

• Global memory load efficiency

• Global memory store efficiency

Page 65: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Visual Profiler – data for memory transfers

• Memory transfer type and direction

(D=Device, H=Host, A=cuArray)

• e.g. H to D: Host to Device

• Synchronous / Asynchronous

• Memory transfer size, in bytes

• Stream ID

Page 66: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Visual Profiler – data analysis views

• Views:

• Summary table

• Kernel table

• Memcopy table

• Summary plot

• GPU Time Height plot

• GPU Time Width plot

• Profiler counter plot

• Profiler table column plot

• Multi-device plot

• Multi-stream plot

• Analyze profiler counters

• Analyze kernel occupancy

Page 67: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CUDA Visual Profiler – Misc.

• Multiple sessions

• Compare views for different sessions

• Comparison Summary plot

• Profiler projects – save & load

• Import/Export profiler data

(.CSV format)

Page 68: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

NVIDIA Parallel Nsight

Accelerates GPU + CPUapplication development

The industry’s 1st Development Environment for massively parallel applications

Complete Visual Studio-integrateddevelopment environment

Page 69: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Parallel Nsight 1.0

Nsight Parallel Debugger

GPU source code debugging

Variable & memory inspection

Nsight Analyzer

Platform-level Analysis

For the CPU and GPU

Nsight Graphics Inspector

Visualize and debug graphics content

Page 70: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Source Debugging

Supporting CUDA C and HLSL code.

Hardware breakpoints

GPU memory and variable views

Nsight menu and toolbars

Page 71: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

View a correlated trace timeline with both CPU and GPU events.

Analysis

Page 72: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Detailed tooltips are available for every event on the timeline.

Analysis

Page 73: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

1.0 System Requirements

Operating SystemWindows Server 2008 R2

Windows 7 / Vista

32 or 64-bit

HardwareGeForce 9 series or higher

Tesla C1060/S1070 or higher

Quadro (G9x or higher)

Visual StudioVisual Studio 2008 SP1

Page 74: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Supported System Configurations

#1: Single machine, Single GPU

AnalyzerGraphics Inspector

#2: Two machines connected over the network

DebuggerAnalyzerGraphics Inspector

TCP/IP

#3: Single SLI MOS machine, Two Quadro GPUs

DebuggerAnalyzerGraphics Inspector

Page 75: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Parallel Nsight 1.0 Versions

Standard (free)GPU Source Debugger

Graphics Inspector

Professional ($349)Analyzer

Data Breakpoints

Premium ticket-based support

Volume and Site Licensing available

Page 76: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

NVIDIA Nexus IDE

The industry’s first IDE for massively

parallel applications

Accelerates co-processing (CPU + GPU)

application development

Complete Visual Studio-integrated

development environment

Page 77: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

NVIDIA Nexus IDE - Debugging

Page 78: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

NVIDIA Nexus IDE - Profiling

Page 79: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

RESOURCES

Productivity

Page 80: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Getting Started

CUDA Zone

www.nvidia.com/cuda

Introductory tutorials

GPU computing online seminars

(aka Webinars)

Forums

Documentation

Programming Guide

Best Practices Guide

Examples

CUDA SDK

Page 81: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Libraries

NVIDIA

cuBLAS Dense linear algebra (subset of full BLAS suite)

cuFFT 1D/2D/3D real and complex

Third party

NAG Numeric libraries e.g. RNGs

cuLAPACK/MAGMA

Open Source

Thrust STL/Boost style template language

cuDPP Data parallel primitives (e.g. scan, sort and reduction)

CUSP Sparse linear algebra and graph computation

Many more...

Page 82: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2009

Additional material

Page 83: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Targeting Multiple Platforms with CUDA

CUDA C / C++

NVCCNVIDIA CUDA Toolkit

MCUDACUDA to Multi-core

OcelotPTX to Multi-corePTX

MCUDA: http://impact.crhc.illinois.edu/mcuda.php

Ocelot: http://code.google.com/p/gpuocelot/

Swan: http://www.multiscalelab.org/swan

SWAN CUDA to OpencL

Other GPUs

Multi-Core

CPUs

NVIDIA

GPUs

Page 84: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

OPTIMIZATION 1:

MEMORY TRANSFERS &

COALESCING

Page 85: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 9

5

Execution ModelSoftware Hardware

Threads are executed by scalar processors

Thread

Scalar

Processor

Thread

Block Multiprocessor

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one

multiprocessor - limited by multiprocessor resources

(shared memory and register file)

...

Grid Device

A kernel is launched as a grid of thread blocks

Only one kernel can execute on a device at one time

Page 86: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 9

6

Warps and Half Warps

Thread

Block Multiprocessor

32 Threads

32 Threads

32 Threads

...

Warps

16

Half Warps

16

DRAM

Global

Local

A thread block consists of 32-thread

warps

A warp is executed physically in parallel

(SIMD) on a multiprocessor

Device

Memory

=

A half-warp of 16 threads can coordinate

global memory accesses into a single

transaction

Page 87: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 9

7

Memory Architecture

Host

CPU

Chipset

DRAM

Device

DRAM

Global

Constant

Texture

Local

GPU

Multiprocessor

Registers

Shared Memory

Multiprocessor

Registers

Shared Memory

Multiprocessor

Registers

Shared Memory

Constant and Texture

Caches

Page 88: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 9

8

Host-Device Data Transfers

Device to host memory bandwidth much lower than device to

device bandwidth

8 GB/s peak (PCI-e x16 Gen 2) vs. 141 GB/s peak (GTX 280)

Minimize transfers

Intermediate data can be allocated, operated on, and deallocated without

ever copying them to host memory

Group transfers

One large transfer much better than many small ones

Page 89: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 9

9

Page-Locked Data Transfers

cudaMallocHost() allows allocation of page-locked (“pinned”) host memory

Enables highest cudaMemcpy performance3.2 GB/s on PCI-e x16 Gen1

5.2 GB/s on PCI-e x16 Gen2

See the “bandwidthTest” CUDA SDK sample

Use with caution!!Allocating too much page-locked memory can reduce overall system performance

Test your systems and apps to learn their limits

Page 90: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

0

Overlapping Data Transfers and Computation

Async and Stream APIs allow overlap of H2D or D2H data transfers with computation

CPU computation can overlap data transfers on all CUDA capable devices

Kernel computation can overlap data transfers on devices with “Concurrent copy and execution” (roughly compute capability >= 1.1)

Stream = sequence of operations that execute in order on GPU

Operations from different streams can be interleaved

Stream ID used as argument to async calls and kernel launches

Page 91: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

0

Coalescing

Global Memory

Half-warp of threads

Global memory access of 32, 64, or 128-bit words by a half-warp of threads

(Fermi: warp of threads) can result in as few as one (or two) transaction(s) if

certain access requirements are met

Float (32-bit) data example:

32-byte segments

64-byte segments

128-byte segments

……

Page 92: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

0

CoalescingCompute capability 1.2 and higher

1 transaction - 64B segment

2 transactions - 64B and 32B segments

1 transaction - 128B segment

Issues transactions for segments of 32B, 64B, and 128B

Smaller transactions used to avoid wasted bandwidth

……

……

……

Page 93: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

OPTIMIZATION 2:

EXECUTION CONFIG

Page 94: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

0

Occupancy

Thread instructions are executed sequentially, so executing

other warps is the only way to hide latencies and keep the

hardware busy

Occupancy = Number of warps running concurrently on a

multiprocessor divided by maximum number of warps that can

run concurrently

Limited by resource usage:

Registers

Shared memory

Page 95: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

0

Blocks per Grid Heuristics

# of blocks > # of multiprocessors

So all multiprocessors have at least one block to execute

# of blocks / # of multiprocessors > 2

Multiple blocks can run concurrently in a multiprocessor

Blocks that aren’t waiting at a __syncthreads() keep the hardware busy

Subject to resource availability – registers, shared memory

# of blocks > 100 to scale to future devices

Blocks executed in pipeline fashion

1000 blocks per grid will scale across multiple generations

Page 96: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

0

Register Pressure

Hide latency by using more threads per multiprocessor

Limiting Factors:

Number of registers per kernel

8K/16K per multiprocessor, partitioned among concurrent threads

Amount of shared memory

16KB per multiprocessor, partitioned among concurrent threadblocks

Compile with –ptxas-options=-v flag

Use –maxrregcount=N flag to NVCC

N = desired maximum registers / kernel

At some point “spilling” into local memory may occur

Reduces performance – local memory is slow

Page 97: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

0

Occupancy Calculator

Page 98: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

0

Optimizing threads per block

Choose threads per block as a multiple of warp size

Avoid wasting computation on under-populated warps

Facilitates coalescing

Want to run as many warps as possible per multiprocessor (hide

latency)

Multiprocessor can run up to 8 blocks at a time

Heuristics

Minimum: 64 threads per block

Only if multiple concurrent blocks

192 or 256 threads a better choice

Usually still enough regs to compile and invoke successfully

This all depends on your computation, so experiment!

Page 99: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

0

Occupancy != Performance

Increasing occupancy does not necessarily increase

performance

BUT …

Low-occupancy multiprocessors cannot adequately hide latency

on memory-bound kernels

(It all comes down to arithmetic intensity and available parallelism)

Page 100: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

OPTIMIZATION 3:

MATH FUNCS & BRANCHING

Page 101: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

1

Runtime Math Library

There are two types of runtime math operations in single

precision

__funcf(): direct mapping to hardware ISA

Fast but lower accuracy (see prog. guide for details)

Examples: __sinf(x), __expf(x), __powf(x,y)

funcf() : compile to multiple instructions

Slower but higher accuracy (5 ulp or less)

Examples: sinf(x), expf(x), powf(x,y)

The -use_fast_math compiler option forces every funcf() to

compile to __funcf()

Page 102: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

1

Control Flow Instructions

Main performance concern with branching is divergence

Threads within a single warp take different paths

Different execution paths must be serialized

Avoid divergence when branch condition is a function of thread

ID

Example with divergence:

if (threadIdx.x > 2) { }

Branch granularity < warp size

Example without divergence:

if (threadIdx.x / WARP_SIZE > 2) { }

Branch granularity is a whole multiple of warp size

Page 103: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

OPTIMIZATION 4: SHARED MEMORY

Page 104: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

1

Shared Memory

~Hundred times faster than global memory

Cache data to reduce global memory accesses

Threads can cooperate via shared memory

Use it to avoid non-coalesced access

Stage loads and stores in shared memory to re-order non-coalesceable addressing

Page 105: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

1

Shared Memory Architecture

Many threads accessing memory

Therefore, memory is divided into banks

Successive 32-bit words assigned to successive banks

Each bank can service one address per cycle

A memory can service as many simultaneous

accesses as it has banks

Multiple simultaneous accesses to a bank

result in a bank conflict

Conflicting accesses are serialized

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Page 106: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

1

Bank Addressing Examples

No Bank Conflicts

Linear addressing

stride == 1

No Bank Conflicts

Random 1:1 Permutation

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Page 107: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

1

Bank Addressing Examples

2-way Bank Conflicts

Linear addressing

stride == 2

8-way Bank Conflicts

Linear addressing stride == 8

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0

x8

x8

Page 108: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

1

Shared memory bank conflicts

Shared memory is ~ as fast as registers if there are no bank conflicts

warp_serialize profiler signal reflects conflicts

The fast case:

If all threads of a half-warp access different banks, there is no bank conflict

If all threads of a half-warp read the identical address, there is no bank conflict

(broadcast)

The slow case:

Bank Conflict: multiple threads in the same half-warp access the same bank

Must serialize the accesses

Cost = max # of simultaneous accesses to a single bank

Page 109: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

1

Shared Memory Example: Transpose

Each thread block works on a tile of the matrix

Naïve implementation exhibits strided access to global memory

idata odata

Elements transposed by a half-warp of threads

Page 110: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

2

Naïve Transpose

Loads are coalesced, stores are not (strided by height)

idata odata

__global__ void transposeNaive(float *odata, float *idata,

int width, int height)

{

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;

int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

int index_in = xIndex + width * yIndex;

int index_out = yIndex + height * xIndex;

odata[index_out] = idata[index_in];

}

Page 111: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

2

Coalescing through shared memory

Access columns of a tile in shared memory to write contiguous

data to global memory

Requires __syncthreads() since threads access data in

shared memory stored by other threads

Elements transposed by a half-warp of threads

idata odatatile

Page 112: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

2

__global__ void transposeCoalesced(float *odata, float *idata,

int width, int height)

{

__shared__ float tile[TILE_DIM][TILE_DIM];

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;

int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

int index_in = xIndex + (yIndex)*width;

xIndex = blockIdx.y * TILE_DIM + threadIdx.x;

yIndex = blockIdx.x * TILE_DIM + threadIdx.y;

int index_out = xIndex + (yIndex)*height;

tile[threadIdx.y][threadIdx.x] = idata[index_in];

__syncthreads();

odata[index_out] = tile[threadIdx.x][threadIdx.y];

}

Coalescing through shared memory

Page 113: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010 1

2

Bank Conflicts in Transpose

16x16 shared memory tile of floats

Data in columns are in the same bank

16-way bank conflict reading columns in tile

Solution - pad shared memory array__shared__ float tile[TILE_DIM][TILE_DIM+1];

Data in anti-diagonals are in same bank

Elements transposed by a half-warp of threads

idata odatatile

Page 114: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

FERMI: NEW ARCHITECTURE

Page 115: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Fermi: The Computational GPU

Disclaimer: Specifications subject to change

Performance• 13x Double Precision of CPUs• IEEE 754-2008 SP & DP Floating Point

Flexibility

• Increased Shared Memory from 16 KB to 64 KB• Added L1 and L2 Caches• ECC on all Internal and External Memories• Enable up to 1 TeraByte of GPU Memories• High Speed GDDR5 Memory Interface

DR

AM

I/F

HO

ST

I/F

Gig

a T

hre

ad

DR

AM

I/F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2

Usability

• Multiple Simultaneous Tasks on GPU• 10x Faster Atomic Operations• C++ Support• System Calls, printf support

Availability: Q2 2010

Page 116: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Fermi

Memory operations are done per warp (32 threads) instead of half-warp

Global memory, Shared memory

Shared memory:

16 or 48KB

Now 32 banks, 32-bit wide each

No bank-conflicts when accessing 8-byte words

L1 cache per multiprocessor

Should help with misaligned access, strides access, some register spilling

Much improved dual-issue:

Can dual issue fp32 pairs, fp32-mem, fp64-mem, etc.

IEEE-conformant rounding

64bit address space, uniform

Page 117: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

L1 cache

For all memory operations

Global memory, Shared memory

Shares 64kb with Shared memory:

Switch size between 16 or 48KB (CUDA API call)

Caches gmem reads only

It benefits if compiler detects that all threads load same value

L1 cache per multiprocessor

NOT coherent! Use volatile for global memory access if other SM's threads

change the location. (but why needed? not all blocks running -> danger of

deadlock)

But caches local memory reads and writes

To improve spilling behaviour

(Coherence no problem as local memory SM-private)

Page 118: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

Fermi has 64bit address space

But only 32bit registers

In unfortunate cases, register allocation unnecessary overhead on Fermi

C2050 (3 GB)

Driver API:

Compile kernels in 32bit mode, can be loaded by 64bit app

Runtime API (CUDART):

Use new __launchbounds() intrinsic to help compiler optimize register usage

compile application in 32bit mode (nvcc -m32), produces also GPU code in 32bit.

Page 119: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

__umul24 not optimal on Fermi

On Tesla C1060 / GT200 architecture, bounded integer multiplications could be

accelerated with __umul24(a, b) instead of a * b, e.g. for

unsigned int tid = __umul24(blockIdx.x, blockDim.x) + threadIdx.x

On Fermi, __umul24() is emulated, and thus slower than a * b

Page 120: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

HPC and IEEE conformance

Default settings for computation on GPU now more conservative (for HPC)

Denormal support, IEEE-conformant division and square root

Accuracy over speed

If your app runs faster on Fermi with -arch=sm_13 than -arch=sm_20

then PTX JIT has used "old" Tesla C1060 settings, which favor speed:

flush-to-zero instead denormals, no IEEE-precise division, no IEEE-precise square

root

For similar results in -arch=sm_20, use:

-ftz=true -prec-div=false -prec-sqrt=false

NVIDIA CUDA Programming Guide, sections 5.4.1, G.2

The CUDA Compiler Driver NVCC, pg. 14-15

(BTW, Sections 5.4.1 also contains information on instruction timings)

Page 121: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

CONCLUSION, QUESTIONS

& GTC INVITE

Page 122: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2010

GPU Technology Conference 2010Monday, Sept. 20 - Thurs., Sept. 23, 2010

San Jose Convention Center, San Jose, California

The most important event in the GPU ecosystemLearn about seismic shifts in GPU computing

Preview disruptive technologies and emerging applications

Get tools and techniques to impact mission critical projects

Network with experts, colleagues, and peers across industries

Opportunities

Call for Submissions

Sessions & posters

Sponsors / Exhibitors

Reach decision makers

“CEO on Stage”

Showcase for Startups

Tell your story to VCs and

analysts

“I consider the GPU Technology Conference to be the single best place to see the amazing

work enabled by the GPU. It’s a great venue for meeting researchers, developers, scientists,

and entrepreneurs from around the world.”

-- Professor Hanspeter Pfister, Harvard University and GTC 2009 keynote speaker

Page 123: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support

© NVIDIA Corporation 2009

Thank You

Questions?

Page 124: GPU Tutorial @ Lund Observatory · 2017-10-20 · Medical Imaging U of Utah 36X Molecular Dynamics U of Illinois, Urbana 18X Video Transcoding ... The architecture of the GPU to support