327

Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Graphic Processors in Computational ApplicationsLecture 1

Krzysztof Kaczmarski

15 lutego 2018

Page 2: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

1 Introduction

2 / 110

Page 3: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Parallel Computing's Dark Age

� Impact of data-parallel computing limited

� Thinking Machines (100s of systems total)

� MasPar (sold 200 systems)

� HPC: bring data to supercomputer - no longer possible

� Massively-parallel machines replaced by clusters of ever-morepowerful commodity microprocessors

� Beowulf, Legion, grid computing, . . .

cudaWhitepapers

3 / 110

Page 4: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

General Purpose Computationon Graphics Processing Unit � GPGPU

Why GPU?

Owens:2007:ASO

4 / 110

Page 5: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Evolution of Processors

100

101

102

2008 2010 2012 2014 2016

HD 3870HD 4870

HD 5870

HD 6970

HD 6970HD 7970 G

Hz Ed.

HD 8970

FirePro

W9100

FirePro S9150

X5482X5492

W5590

X5680X5690 E5-2690

E5-2697 v2

E5-2699 v3

E5-2699 v3

E5-2699 v48800 G

TS

GTX 280

GTX 285

GTX 580

GTX 580 GTX 680

GTX Titan

Tesla K40

GTX Tita

n X Tesla P100

Xeon Phi 7120 (KNC)

Xeon Phi 7290 (KNL)

Physic

al C

ore

s/M

ultip

rocessors

End of Year

Number of Physical Cores/Multiprocessors, High-End Hardware

INTEL Xeon CPUs

NVIDIA GeForce GPUs

AMD Radeon GPUs

INTEL Xeon Phis

karlrupp5 / 110

Page 6: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Evolution of Processors

102

103

104

2008 2010 2012 2014 2016

HD 3870

HD 4870

HD 5870

HD 6970

HD 6970

HD 7970 G

Hz Ed.

HD 8970FirePro W9100

FirePro S9150

X5482

X5492

W5590

X5680

X5690

E5-2690

E5-2697 v

2

E5-2699 v

3

E5-2699 v

3

E5-2699 v

4

8800 GTS

GTX 2

80

GTX 2

85 GTX 5

80

GTX 5

80

GTX 6

80

GTX T

itan

Tesla K

40

GTX T

itan X

NVIDIA

Tita

n X

Xeon Phi 7120 (KNC)

Xeo

n Phi

729

0 (K

NL)

GF

LO

P/s

ec

End of Year

Theoretical Peak Performance, Single Precision

INTEL Xeon CPUs

NVIDIA GeForce GPUs

AMD Radeon GPUs

INTEL Xeon Phis

karlrupp5 / 110

Page 7: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Evolution of Processors

102

103

104

2008 2010 2012 2014 2016

HD 3870

HD 4870

HD 5870

HD 6970

HD 6970

HD 7970 G

Hz Ed.

HD 8970

FirePro

W9100

FirePro

S9150

X5482

X5492

W5590

X5680

X5690

E5-2690

E5-2697 v

2

E5-2699 v

3

E5-2699 v

3

E5-2699 v

4

Tesla C

1060

Tesla C

1060 Tesla C

2050 Tesla M

2090

Tesla K

20

Tesla K

20X

Tesla K

40

Tesla K

40

Tesla P

100

Xeon Phi 7120 (KNC)

Xeo

n Phi

729

0 (K

NL)

GF

LO

P/s

ec

End of Year

Theoretical Peak Performance, Double Precision

INTEL Xeon CPUs

NVIDIA Tesla GPUs

AMD Radeon GPUs

INTEL Xeon Phis

karlrupp5 / 110

Page 8: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Evolution of Processors

101

102

103

2008 2010 2012 2014 2016

HD 3870

HD 4870HD 5870

HD 6970

HD 6970HD 7970 G

Hz Ed.

HD 8970Fire

Pro W9100

FirePro S9150

X5482X5492 W5590

X5680X5690

E5-2690E5-2697 v2

E5-2699 v3

E5-2699 v3

E5-2699 v4Tesla C1060

Tesla C1060 Tesla C

2050Tesla M

2090

Tesla K20 Tesla K20X

Tesla K40

Tesla P100

Xeon Phi 7120 (KNC)

Xeo

n Phi

729

0 (K

NL)

GB

/sec

End of Year

Theoretical Peak Memory Bandwidth Comparison

INTEL Xeon CPUs

NVIDIA Tesla GPUs

AMD Radeon GPUs

INTEL Xeon Phis

karlrupp5 / 110

Page 9: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Kepler SM Architecture

6 / 110

Page 10: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Pascal SM Architecture

7 / 110

Page 11: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Pascal P100 Device

8 / 110

Page 12: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

TeslaK20M TeslaK40M TeslaP100

cc 3.5 3.5 6.0

architecture kepler kepler pascal

SMPs 13 15 56

cores-per-SM 192 192 64

cores 2496 2880 3584

processor GK110 GK110B GP100

memory-bandwidth-GBs 208 288 721

threads-per-warp 32 32 32

max-warps-per-SM 64 64 64

memory-bus-bit 320 384 3072

max-threads-per-SM 2048 2048 2048

processor-clock-MHz 706 745 1126

threads-per-block 1024 1024 1024

memory-MB 5120 12288 12288

memory-clock-MHz 1300 1502 704

memory-clock-e�ective-MHz 5200 6008 1408

9 / 110

Page 13: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU Applications

cudaWebsite10 / 110

Page 14: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU Applications

cudaWebsite10 / 110

Page 15: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU Applications

cudaWebsite10 / 110

Page 16: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU Applications

cudaWebsite10 / 110

Page 17: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU Applications

cudaWebsite10 / 110

Page 18: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU Applications

cudaWebsite10 / 110

Page 19: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU Applications

cudaWebsite10 / 110

Page 20: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU Applications

cudaWebsite10 / 110

Page 21: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

11 / 110

Page 22: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Sunway TaihuLight

journals/chinaf/FuLYWSHYXLQZYHZ16

12 / 110

Page 23: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU Clusters

www.olcf.ornl.gov/titan/

13 / 110

Page 24: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU Clusters

www.olcf.ornl.gov/titan/

13 / 110

Page 25: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU � The new market...

� NVidia (Tesla, GeForce 8x, Quadro. . . )� Cg (C for graphics), GLSL for OpenGL, HLSL for DirectX� PhysX (Ageia - taken by nVidia in February 2008)� CUDA (developed from November 2006)

� ATI� ATI Stream

14 / 110

Page 26: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU languages

� CUDA (only NVidia)

� each next release compatible with previous one(old software can be run on new hardware)

� Stream (only ATI)� OpenCL (Open Computing Language) � Kronos Group Consortium

� Designed for all GPGPU models and vendors(currently supported only by NVidia?)

� DirectComputing (for DirectX) � present in Windows 7

15 / 110

Page 27: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU languages

� CUDA (only NVidia)� each next release compatible with previous one

(old software can be run on new hardware)

� Stream (only ATI)� OpenCL (Open Computing Language) � Kronos Group Consortium

� Designed for all GPGPU models and vendors(currently supported only by NVidia?)

� DirectComputing (for DirectX) � present in Windows 7

15 / 110

Page 28: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU languages

� CUDA (only NVidia)� each next release compatible with previous one

(old software can be run on new hardware)

� Stream (only ATI)

� OpenCL (Open Computing Language) � Kronos Group Consortium

� Designed for all GPGPU models and vendors(currently supported only by NVidia?)

� DirectComputing (for DirectX) � present in Windows 7

15 / 110

Page 29: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU languages

� CUDA (only NVidia)� each next release compatible with previous one

(old software can be run on new hardware)

� Stream (only ATI)� OpenCL (Open Computing Language) � Kronos Group Consortium

� Designed for all GPGPU models and vendors(currently supported only by NVidia?)

� DirectComputing (for DirectX) � present in Windows 7

15 / 110

Page 30: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU languages

� CUDA (only NVidia)� each next release compatible with previous one

(old software can be run on new hardware)

� Stream (only ATI)� OpenCL (Open Computing Language) � Kronos Group Consortium

� Designed for all GPGPU models and vendors(currently supported only by NVidia?)

� DirectComputing (for DirectX) � present in Windows 7

15 / 110

Page 31: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU languages

� CUDA (only NVidia)� each next release compatible with previous one

(old software can be run on new hardware)

� Stream (only ATI)� OpenCL (Open Computing Language) � Kronos Group Consortium

� Designed for all GPGPU models and vendors(currently supported only by NVidia?)

� DirectComputing (for DirectX) � present in Windows 7

15 / 110

Page 32: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU processing model

� GPU uses SIMD processing model� Single Instruction Multiple Data

� Threads working in parallel may share common cache

� Increased memory data transfer� Results:

� Speedup of hundreds times for some applications� Extraordinary scalability for di�erent data sets

16 / 110

Page 33: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU processing model

� GPU uses SIMD processing model� Single Instruction Multiple Data

� Threads working in parallel may share common cache

� Increased memory data transfer� Results:

� Speedup of hundreds times for some applications� Extraordinary scalability for di�erent data sets

16 / 110

Page 34: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU processing model

� GPU uses SIMD processing model� Single Instruction Multiple Data

� Threads working in parallel may share common cache

� Increased memory data transfer

� Results:

� Speedup of hundreds times for some applications� Extraordinary scalability for di�erent data sets

16 / 110

Page 35: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU processing model

� GPU uses SIMD processing model� Single Instruction Multiple Data

� Threads working in parallel may share common cache

� Increased memory data transfer� Results:

� Speedup of hundreds times for some applications� Extraordinary scalability for di�erent data sets

16 / 110

Page 36: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU processing model

� GPU uses SIMD processing model� Single Instruction Multiple Data

� Threads working in parallel may share common cache

� Increased memory data transfer� Results:

� Speedup of hundreds times for some applications

� Extraordinary scalability for di�erent data sets

16 / 110

Page 37: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU processing model

� GPU uses SIMD processing model� Single Instruction Multiple Data

� Threads working in parallel may share common cache

� Increased memory data transfer� Results:

� Speedup of hundreds times for some applications� Extraordinary scalability for di�erent data sets

16 / 110

Page 38: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

SIMD processing model

17 / 110

Page 39: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

SIMD processing model

17 / 110

Page 40: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

SIMD processing model

17 / 110

Page 41: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

SISD, MIMD, MISD - Flynn Taxonomy

18 / 110

Page 42: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

SISD, MIMD, MISD - Flynn Taxonomy

18 / 110

Page 43: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

SISD, MIMD, MISD - Flynn Taxonomy

18 / 110

Page 44: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Heterogeneous programming with host and device

cudaProgrammingGuide19 / 110

Page 45: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU drawbacks

� Application must be (re)written especially for GPU

� Good scalability for certain tasks (embarrassingly parallel), howevernot clear algorithms for all tasks

� Dedicated hardware needed� So far no common standard for programming (currently waiting forDirectX11 and OpenCL)

� However nVidia CUDA seems to be the industry leading solution� 25,000+ active developers, 100+ applications, 30+ NVIDIA GPU

clusters using CUDA tool chain (04/2009)

� Task must �t into GPU memory or must be divided into portions� Several programming limitations

� no e�cient synchronization mechanisms between threads - huge slowdown

� another slow down for computations using conditional constructs

20 / 110

Page 46: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU drawbacks

� Application must be (re)written especially for GPU

� Good scalability for certain tasks (embarrassingly parallel), howevernot clear algorithms for all tasks

� Dedicated hardware needed� So far no common standard for programming (currently waiting forDirectX11 and OpenCL)

� However nVidia CUDA seems to be the industry leading solution� 25,000+ active developers, 100+ applications, 30+ NVIDIA GPU

clusters using CUDA tool chain (04/2009)

� Task must �t into GPU memory or must be divided into portions� Several programming limitations

� no e�cient synchronization mechanisms between threads - huge slowdown

� another slow down for computations using conditional constructs

20 / 110

Page 47: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU drawbacks

� Application must be (re)written especially for GPU

� Good scalability for certain tasks (embarrassingly parallel), howevernot clear algorithms for all tasks

� Dedicated hardware needed

� So far no common standard for programming (currently waiting forDirectX11 and OpenCL)

� However nVidia CUDA seems to be the industry leading solution� 25,000+ active developers, 100+ applications, 30+ NVIDIA GPU

clusters using CUDA tool chain (04/2009)

� Task must �t into GPU memory or must be divided into portions� Several programming limitations

� no e�cient synchronization mechanisms between threads - huge slowdown

� another slow down for computations using conditional constructs

20 / 110

Page 48: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU drawbacks

� Application must be (re)written especially for GPU

� Good scalability for certain tasks (embarrassingly parallel), howevernot clear algorithms for all tasks

� Dedicated hardware needed� So far no common standard for programming (currently waiting forDirectX11 and OpenCL)

� However nVidia CUDA seems to be the industry leading solution� 25,000+ active developers, 100+ applications, 30+ NVIDIA GPU

clusters using CUDA tool chain (04/2009)

� Task must �t into GPU memory or must be divided into portions� Several programming limitations

� no e�cient synchronization mechanisms between threads - huge slowdown

� another slow down for computations using conditional constructs

20 / 110

Page 49: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU drawbacks

� Application must be (re)written especially for GPU

� Good scalability for certain tasks (embarrassingly parallel), howevernot clear algorithms for all tasks

� Dedicated hardware needed� So far no common standard for programming (currently waiting forDirectX11 and OpenCL)� However nVidia CUDA seems to be the industry leading solution

� 25,000+ active developers, 100+ applications, 30+ NVIDIA GPUclusters using CUDA tool chain (04/2009)

� Task must �t into GPU memory or must be divided into portions� Several programming limitations

� no e�cient synchronization mechanisms between threads - huge slowdown

� another slow down for computations using conditional constructs

20 / 110

Page 50: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU drawbacks

� Application must be (re)written especially for GPU

� Good scalability for certain tasks (embarrassingly parallel), howevernot clear algorithms for all tasks

� Dedicated hardware needed� So far no common standard for programming (currently waiting forDirectX11 and OpenCL)� However nVidia CUDA seems to be the industry leading solution� 25,000+ active developers, 100+ applications, 30+ NVIDIA GPU

clusters using CUDA tool chain (04/2009)

� Task must �t into GPU memory or must be divided into portions� Several programming limitations

� no e�cient synchronization mechanisms between threads - huge slowdown

� another slow down for computations using conditional constructs

20 / 110

Page 51: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU drawbacks

� Application must be (re)written especially for GPU

� Good scalability for certain tasks (embarrassingly parallel), howevernot clear algorithms for all tasks

� Dedicated hardware needed� So far no common standard for programming (currently waiting forDirectX11 and OpenCL)� However nVidia CUDA seems to be the industry leading solution� 25,000+ active developers, 100+ applications, 30+ NVIDIA GPU

clusters using CUDA tool chain (04/2009)

� Task must �t into GPU memory or must be divided into portions

� Several programming limitations

� no e�cient synchronization mechanisms between threads - huge slowdown

� another slow down for computations using conditional constructs

20 / 110

Page 52: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU drawbacks

� Application must be (re)written especially for GPU

� Good scalability for certain tasks (embarrassingly parallel), howevernot clear algorithms for all tasks

� Dedicated hardware needed� So far no common standard for programming (currently waiting forDirectX11 and OpenCL)� However nVidia CUDA seems to be the industry leading solution� 25,000+ active developers, 100+ applications, 30+ NVIDIA GPU

clusters using CUDA tool chain (04/2009)

� Task must �t into GPU memory or must be divided into portions� Several programming limitations

� no e�cient synchronization mechanisms between threads - huge slowdown

� another slow down for computations using conditional constructs

20 / 110

Page 53: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU drawbacks

� Application must be (re)written especially for GPU

� Good scalability for certain tasks (embarrassingly parallel), howevernot clear algorithms for all tasks

� Dedicated hardware needed� So far no common standard for programming (currently waiting forDirectX11 and OpenCL)� However nVidia CUDA seems to be the industry leading solution� 25,000+ active developers, 100+ applications, 30+ NVIDIA GPU

clusters using CUDA tool chain (04/2009)

� Task must �t into GPU memory or must be divided into portions� Several programming limitations

� no e�cient synchronization mechanisms between threads - huge slowdown

� another slow down for computations using conditional constructs

20 / 110

Page 54: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

GPGPU drawbacks

� Application must be (re)written especially for GPU

� Good scalability for certain tasks (embarrassingly parallel), howevernot clear algorithms for all tasks

� Dedicated hardware needed� So far no common standard for programming (currently waiting forDirectX11 and OpenCL)� However nVidia CUDA seems to be the industry leading solution� 25,000+ active developers, 100+ applications, 30+ NVIDIA GPU

clusters using CUDA tool chain (04/2009)

� Task must �t into GPU memory or must be divided into portions� Several programming limitations

� no e�cient synchronization mechanisms between threads - huge slowdown

� another slow down for computations using conditional constructs20 / 110

Page 55: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Vector threads execution

All threads in a vector machine perform the same action in the sametime.

1 if ((thid == num_elements/2) && ((num_elements%2) > 0))

2 temp[num_elements-1] = g_idata[num_elements-1];

3 for (unsigned int i=0; i<num_elements; ++i)

4 {

5 if ( (n+off) <= (num_elements-1) )

6 {

7 if (temp[m+off] > temp[n+off])

8 swap( temp[m+off], temp[n+off] );

9 }

10 off = 1 - off;

11 }

21 / 110

Page 56: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

2 NVidia Compute Uni�ed Device Architecture (CUDA)IntroductionSoftware Development KitA few words about hardwareA simple kernel executionCUDA Programming LanguageMemory ManagementSynchronizationError reporting

3 Advanced topicsVolatile

22 / 110

Page 57: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

2 NVidia Compute Uni�ed Device Architecture (CUDA)IntroductionSoftware Development KitA few words about hardwareA simple kernel executionCUDA Programming LanguageMemory ManagementSynchronizationError reporting

3 Advanced topicsVolatile

23 / 110

Page 58: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA characteristics I

� Scales to 100s of cores, 1000s of parallel threads

� Lets programmers truly focus on parallel algorithms only

� Enable heterogeneous systems (i.e., CPU+GPU)

De�nitions:

Device = GPUHost = CPUKernel = function called from the host that runs on the device

24 / 110

Page 59: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA APIs:

� A low-level API called the CUDA driver API � nvcuda dynamiclibrary (all its entry points pre�xed with cu).

� A higher-level API called the C run-time for CUDA that isimplemented on top of the CUDA driver API � cudart dynamiclibrary (all its entry points pre�xed with cuda).

They are mutually exclusive for older CUDA version. Now in CUDA4.0 and latest drivers an application may use them together.

25 / 110

Page 60: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA APIs:

� A low-level API called the CUDA driver API � nvcuda dynamiclibrary (all its entry points pre�xed with cu).

� A higher-level API called the C run-time for CUDA that isimplemented on top of the CUDA driver API � cudart dynamiclibrary (all its entry points pre�xed with cuda).

They are mutually exclusive for older CUDA version. Now in CUDA4.0 and latest drivers an application may use them together.

25 / 110

Page 61: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA APIs:

� A low-level API called the CUDA driver API � nvcuda dynamiclibrary (all its entry points pre�xed with cu).

� A higher-level API called the C run-time for CUDA that isimplemented on top of the CUDA driver API � cudart dynamiclibrary (all its entry points pre�xed with cuda).

They are mutually exclusive for older CUDA version. Now in CUDA4.0 and latest drivers an application may use them together.

25 / 110

Page 62: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA APIs:

� A low-level API called the CUDA driver API � nvcuda dynamiclibrary (all its entry points pre�xed with cu).

� A higher-level API called the C run-time for CUDA that isimplemented on top of the CUDA driver API � cudart dynamiclibrary (all its entry points pre�xed with cuda).

They are mutually exclusive for older CUDA version. Now in CUDA4.0 and latest drivers an application may use them together.

25 / 110

Page 63: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

2 NVidia Compute Uni�ed Device Architecture (CUDA)IntroductionSoftware Development KitA few words about hardwareA simple kernel executionCUDA Programming LanguageMemory ManagementSynchronizationError reporting

3 Advanced topicsVolatile

26 / 110

Page 64: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Compiler

� nvcc � a replacement for standard C compiler (compiler driver)

� Works by invoking all the necessary tools and compilers (likecudacc, g++, cl, ...)

� nvcc can output:

� Either C code (CPU Code) � That must then be compiled with therest of the application using another tool

� PTX object code directly

� useful command line parameters:

� -deviceemu (also de�nes preprocesor macro� -arch, -code, -gencode (including -arch=sm13 for double

precision on double enabled devices)� --ptxas-options=-v compiler reports total local memory usage per

kernel� --keep lets intermediate PTX code inspection

� An executable with CUDA code requires:

� The CUDA core library (cuda)� The CUDA run-time library (cudart)

27 / 110

Page 65: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Compiler

� nvcc � a replacement for standard C compiler (compiler driver)� Works by invoking all the necessary tools and compilers (likecudacc, g++, cl, ...)

� nvcc can output:

� Either C code (CPU Code) � That must then be compiled with therest of the application using another tool

� PTX object code directly

� useful command line parameters:

� -deviceemu (also de�nes preprocesor macro� -arch, -code, -gencode (including -arch=sm13 for double

precision on double enabled devices)� --ptxas-options=-v compiler reports total local memory usage per

kernel� --keep lets intermediate PTX code inspection

� An executable with CUDA code requires:

� The CUDA core library (cuda)� The CUDA run-time library (cudart)

27 / 110

Page 66: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Compiler

� nvcc � a replacement for standard C compiler (compiler driver)� Works by invoking all the necessary tools and compilers (likecudacc, g++, cl, ...)

� nvcc can output:

� Either C code (CPU Code) � That must then be compiled with therest of the application using another tool

� PTX object code directly� useful command line parameters:

� -deviceemu (also de�nes preprocesor macro� -arch, -code, -gencode (including -arch=sm13 for double

precision on double enabled devices)� --ptxas-options=-v compiler reports total local memory usage per

kernel� --keep lets intermediate PTX code inspection

� An executable with CUDA code requires:

� The CUDA core library (cuda)� The CUDA run-time library (cudart)

27 / 110

Page 67: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Compiler

� nvcc � a replacement for standard C compiler (compiler driver)� Works by invoking all the necessary tools and compilers (likecudacc, g++, cl, ...)

� nvcc can output:� Either C code (CPU Code) � That must then be compiled with the

rest of the application using another tool

� PTX object code directly� useful command line parameters:

� -deviceemu (also de�nes preprocesor macro� -arch, -code, -gencode (including -arch=sm13 for double

precision on double enabled devices)� --ptxas-options=-v compiler reports total local memory usage per

kernel� --keep lets intermediate PTX code inspection

� An executable with CUDA code requires:

� The CUDA core library (cuda)� The CUDA run-time library (cudart)

27 / 110

Page 68: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Compiler

� nvcc � a replacement for standard C compiler (compiler driver)� Works by invoking all the necessary tools and compilers (likecudacc, g++, cl, ...)

� nvcc can output:� Either C code (CPU Code) � That must then be compiled with the

rest of the application using another tool� PTX object code directly

� useful command line parameters:

� -deviceemu (also de�nes preprocesor macro� -arch, -code, -gencode (including -arch=sm13 for double

precision on double enabled devices)� --ptxas-options=-v compiler reports total local memory usage per

kernel� --keep lets intermediate PTX code inspection

� An executable with CUDA code requires:

� The CUDA core library (cuda)� The CUDA run-time library (cudart)

27 / 110

Page 69: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Compiler

� nvcc � a replacement for standard C compiler (compiler driver)� Works by invoking all the necessary tools and compilers (likecudacc, g++, cl, ...)

� nvcc can output:� Either C code (CPU Code) � That must then be compiled with the

rest of the application using another tool� PTX object code directly

� useful command line parameters:

� -deviceemu (also de�nes preprocesor macro� -arch, -code, -gencode (including -arch=sm13 for double

precision on double enabled devices)� --ptxas-options=-v compiler reports total local memory usage per

kernel� --keep lets intermediate PTX code inspection

� An executable with CUDA code requires:

� The CUDA core library (cuda)� The CUDA run-time library (cudart)

27 / 110

Page 70: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Compiler

� nvcc � a replacement for standard C compiler (compiler driver)� Works by invoking all the necessary tools and compilers (likecudacc, g++, cl, ...)

� nvcc can output:� Either C code (CPU Code) � That must then be compiled with the

rest of the application using another tool� PTX object code directly

� useful command line parameters:� -deviceemu (also de�nes preprocesor macro

� -arch, -code, -gencode (including -arch=sm13 for doubleprecision on double enabled devices)

� --ptxas-options=-v compiler reports total local memory usage perkernel

� --keep lets intermediate PTX code inspection� An executable with CUDA code requires:

� The CUDA core library (cuda)� The CUDA run-time library (cudart)

27 / 110

Page 71: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Compiler

� nvcc � a replacement for standard C compiler (compiler driver)� Works by invoking all the necessary tools and compilers (likecudacc, g++, cl, ...)

� nvcc can output:� Either C code (CPU Code) � That must then be compiled with the

rest of the application using another tool� PTX object code directly

� useful command line parameters:� -deviceemu (also de�nes preprocesor macro� -arch, -code, -gencode (including -arch=sm13 for double

precision on double enabled devices)

� --ptxas-options=-v compiler reports total local memory usage perkernel

� --keep lets intermediate PTX code inspection� An executable with CUDA code requires:

� The CUDA core library (cuda)� The CUDA run-time library (cudart)

27 / 110

Page 72: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Compiler

� nvcc � a replacement for standard C compiler (compiler driver)� Works by invoking all the necessary tools and compilers (likecudacc, g++, cl, ...)

� nvcc can output:� Either C code (CPU Code) � That must then be compiled with the

rest of the application using another tool� PTX object code directly

� useful command line parameters:� -deviceemu (also de�nes preprocesor macro� -arch, -code, -gencode (including -arch=sm13 for double

precision on double enabled devices)� --ptxas-options=-v compiler reports total local memory usage per

kernel

� --keep lets intermediate PTX code inspection� An executable with CUDA code requires:

� The CUDA core library (cuda)� The CUDA run-time library (cudart)

27 / 110

Page 73: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Compiler

� nvcc � a replacement for standard C compiler (compiler driver)� Works by invoking all the necessary tools and compilers (likecudacc, g++, cl, ...)

� nvcc can output:� Either C code (CPU Code) � That must then be compiled with the

rest of the application using another tool� PTX object code directly

� useful command line parameters:� -deviceemu (also de�nes preprocesor macro� -arch, -code, -gencode (including -arch=sm13 for double

precision on double enabled devices)� --ptxas-options=-v compiler reports total local memory usage per

kernel� --keep lets intermediate PTX code inspection

� An executable with CUDA code requires:

� The CUDA core library (cuda)� The CUDA run-time library (cudart)

27 / 110

Page 74: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Compiler

� nvcc � a replacement for standard C compiler (compiler driver)� Works by invoking all the necessary tools and compilers (likecudacc, g++, cl, ...)

� nvcc can output:� Either C code (CPU Code) � That must then be compiled with the

rest of the application using another tool� PTX object code directly

� useful command line parameters:� -deviceemu (also de�nes preprocesor macro� -arch, -code, -gencode (including -arch=sm13 for double

precision on double enabled devices)� --ptxas-options=-v compiler reports total local memory usage per

kernel� --keep lets intermediate PTX code inspection

� An executable with CUDA code requires:

� The CUDA core library (cuda)� The CUDA run-time library (cudart)

27 / 110

Page 75: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Compiler

� nvcc � a replacement for standard C compiler (compiler driver)� Works by invoking all the necessary tools and compilers (likecudacc, g++, cl, ...)

� nvcc can output:� Either C code (CPU Code) � That must then be compiled with the

rest of the application using another tool� PTX object code directly

� useful command line parameters:� -deviceemu (also de�nes preprocesor macro� -arch, -code, -gencode (including -arch=sm13 for double

precision on double enabled devices)� --ptxas-options=-v compiler reports total local memory usage per

kernel� --keep lets intermediate PTX code inspection

� An executable with CUDA code requires:� The CUDA core library (cuda)

� The CUDA run-time library (cudart)

27 / 110

Page 76: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Compiler

� nvcc � a replacement for standard C compiler (compiler driver)� Works by invoking all the necessary tools and compilers (likecudacc, g++, cl, ...)

� nvcc can output:� Either C code (CPU Code) � That must then be compiled with the

rest of the application using another tool� PTX object code directly

� useful command line parameters:� -deviceemu (also de�nes preprocesor macro� -arch, -code, -gencode (including -arch=sm13 for double

precision on double enabled devices)� --ptxas-options=-v compiler reports total local memory usage per

kernel� --keep lets intermediate PTX code inspection

� An executable with CUDA code requires:� The CUDA core library (cuda)� The CUDA run-time library (cudart)27 / 110

Page 77: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Visual Environments

1 NVidia NSight for Visual Studio

2 NVidia NSight for Linux build on Eclipse

3 computeprof � NVidia Pro�ler

28 / 110

Page 78: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

2 NVidia Compute Uni�ed Device Architecture (CUDA)IntroductionSoftware Development KitA few words about hardwareA simple kernel executionCUDA Programming LanguageMemory ManagementSynchronizationError reporting

3 Advanced topicsVolatile

29 / 110

Page 79: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Streaming Multiprocessor

A visualization of a multiprocessor and threads execution (almost accurate)

nvidiaGT200Review

30 / 110

Page 80: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Memory Hierarchy

cudaWhitepapers

31 / 110

Page 81: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

2 NVidia Compute Uni�ed Device Architecture (CUDA)IntroductionSoftware Development KitA few words about hardwareA simple kernel executionCUDA Programming LanguageMemory ManagementSynchronizationError reporting

3 Advanced topicsVolatile

32 / 110

Page 82: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Threads organization � Grid, Block and Threads

cudaWhitepapers

33 / 110

Page 83: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

The �rst kernel

1 int blockSize = 32;

2 dim3 threads( blockSize );

3 dim3 blocks( arraySize / (float)blockSize + 1 );

4

5 first_kernel<<< blocks, threads >>>( arrayPtr, arraySize );

1 __global__ void first_kernel(int *a, int dimx)

2 {

3 int ix = blockIdx.x * blockDim.x + threadIdx.x;

4

5 // check if still inside array

6 if (ix < dimx)

7 a[ix] = a[ix] + 1;

8 }

34 / 110

Page 84: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

The �rst kernel

1 int blockSize = 32;

2 dim3 threads( blockSize );

3 dim3 blocks( arraySize / (float)blockSize + 1 );

4

5 first_kernel<<< blocks, threads >>>( arrayPtr, arraySize );

1 __global__ void first_kernel(int *a, int dimx)

2 {

3 int ix = blockIdx.x * blockDim.x + threadIdx.x;

4

5 // check if still inside array

6 if (ix < dimx)

7 a[ix] = a[ix] + 1;

8 }

34 / 110

Page 85: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

The �rst 2D kernel

1 __global__ void first_kernel(int *a, int dimx, int dimy}

2 {

3 int ix = blockIdx.x * blockDim.x + threadIdx.x;

4 int iy = blockIdx.y * blockDim.y + threadIdx.y;

5

6 // check if still inside array

7 if ((ix < dimx) && (iy < dimy))

8 {

9 int idx = iy * dimx + ix;

10 a[idx] = a[idx] + 1;

11 // equivalent: a[ix][iy] = a[ix][iy] + 1;

12 }

13 }

1 int blockSize = 32

2 dim3 threads( blockSize, blocksize );

3 dim3 blocks( array_x_dim / (float)blockSize + 1,

4 array_y_dim / (float)blockSize + 1);

5

6 first_kernel<<< blocks, threads >>>( array_ptr, array_x_dim, array_y_dim );

35 / 110

Page 86: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

The �rst 2D kernel

1 __global__ void first_kernel(int *a, int dimx, int dimy}

2 {

3 int ix = blockIdx.x * blockDim.x + threadIdx.x;

4 int iy = blockIdx.y * blockDim.y + threadIdx.y;

5

6 // check if still inside array

7 if ((ix < dimx) && (iy < dimy))

8 {

9 int idx = iy * dimx + ix;

10 a[idx] = a[idx] + 1;

11 // equivalent: a[ix][iy] = a[ix][iy] + 1;

12 }

13 }

1 int blockSize = 32

2 dim3 threads( blockSize, blocksize );

3 dim3 blocks( array_x_dim / (float)blockSize + 1,

4 array_y_dim / (float)blockSize + 1);

5

6 first_kernel<<< blocks, threads >>>( array_ptr, array_x_dim, array_y_dim );35 / 110

Page 87: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Code must be thread block independent

Blocks may be executed concurrently or sequentially(depending on context and device capabilities).

cudaWhitepapers

36 / 110

Page 88: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Advantages over Legacy GPGPU

� Random access byte-addressable memory

� Thread can access any memory location

� Unlimited access to memory

� Thread can read/write as many locations as needed

� Shared memory (per block) and thread synchronization

� Threads can cooperatively load data into shared memory� Any thread can then access any shared memory location

� Low learning curve

� Just a few extensions to C� No knowledge of graphics is required

cudaWhitepapers

37 / 110

Page 89: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Advantages over Legacy GPGPU

� Random access byte-addressable memory� Thread can access any memory location

� Unlimited access to memory

� Thread can read/write as many locations as needed

� Shared memory (per block) and thread synchronization

� Threads can cooperatively load data into shared memory� Any thread can then access any shared memory location

� Low learning curve

� Just a few extensions to C� No knowledge of graphics is required

cudaWhitepapers

37 / 110

Page 90: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Advantages over Legacy GPGPU

� Random access byte-addressable memory� Thread can access any memory location

� Unlimited access to memory

� Thread can read/write as many locations as needed

� Shared memory (per block) and thread synchronization

� Threads can cooperatively load data into shared memory� Any thread can then access any shared memory location

� Low learning curve

� Just a few extensions to C� No knowledge of graphics is required

cudaWhitepapers

37 / 110

Page 91: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Advantages over Legacy GPGPU

� Random access byte-addressable memory� Thread can access any memory location

� Unlimited access to memory� Thread can read/write as many locations as needed

� Shared memory (per block) and thread synchronization

� Threads can cooperatively load data into shared memory� Any thread can then access any shared memory location

� Low learning curve

� Just a few extensions to C� No knowledge of graphics is required

cudaWhitepapers

37 / 110

Page 92: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Advantages over Legacy GPGPU

� Random access byte-addressable memory� Thread can access any memory location

� Unlimited access to memory� Thread can read/write as many locations as needed

� Shared memory (per block) and thread synchronization

� Threads can cooperatively load data into shared memory� Any thread can then access any shared memory location

� Low learning curve

� Just a few extensions to C� No knowledge of graphics is required

cudaWhitepapers

37 / 110

Page 93: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Advantages over Legacy GPGPU

� Random access byte-addressable memory� Thread can access any memory location

� Unlimited access to memory� Thread can read/write as many locations as needed

� Shared memory (per block) and thread synchronization� Threads can cooperatively load data into shared memory

� Any thread can then access any shared memory location

� Low learning curve

� Just a few extensions to C� No knowledge of graphics is required

cudaWhitepapers

37 / 110

Page 94: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Advantages over Legacy GPGPU

� Random access byte-addressable memory� Thread can access any memory location

� Unlimited access to memory� Thread can read/write as many locations as needed

� Shared memory (per block) and thread synchronization� Threads can cooperatively load data into shared memory� Any thread can then access any shared memory location

� Low learning curve

� Just a few extensions to C� No knowledge of graphics is required

cudaWhitepapers

37 / 110

Page 95: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Advantages over Legacy GPGPU

� Random access byte-addressable memory� Thread can access any memory location

� Unlimited access to memory� Thread can read/write as many locations as needed

� Shared memory (per block) and thread synchronization� Threads can cooperatively load data into shared memory� Any thread can then access any shared memory location

� Low learning curve

� Just a few extensions to C� No knowledge of graphics is required

cudaWhitepapers

37 / 110

Page 96: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Advantages over Legacy GPGPU

� Random access byte-addressable memory� Thread can access any memory location

� Unlimited access to memory� Thread can read/write as many locations as needed

� Shared memory (per block) and thread synchronization� Threads can cooperatively load data into shared memory� Any thread can then access any shared memory location

� Low learning curve� Just a few extensions to C

� No knowledge of graphics is required

cudaWhitepapers

37 / 110

Page 97: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Advantages over Legacy GPGPU

� Random access byte-addressable memory� Thread can access any memory location

� Unlimited access to memory� Thread can read/write as many locations as needed

� Shared memory (per block) and thread synchronization� Threads can cooperatively load data into shared memory� Any thread can then access any shared memory location

� Low learning curve� Just a few extensions to C� No knowledge of graphics is required

cudaWhitepapers

37 / 110

Page 98: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

2 NVidia Compute Uni�ed Device Architecture (CUDA)IntroductionSoftware Development KitA few words about hardwareA simple kernel executionCUDA Programming LanguageMemory ManagementSynchronizationError reporting

3 Advanced topicsVolatile

38 / 110

Page 99: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics I

� Modi�ed C language

� A program is build of C functions (executed in CPU or GPU)

� Function running in GPU (streaming processor) is called kernel.� Kernel properties:

� can only access GPU memory (now changing in some devices)� no variable number of arguments� no static variables� no recursion� must be void

39 / 110

Page 100: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics I

� Modi�ed C language

� A program is build of C functions (executed in CPU or GPU)

� Function running in GPU (streaming processor) is called kernel.� Kernel properties:

� can only access GPU memory (now changing in some devices)� no variable number of arguments� no static variables� no recursion� must be void

39 / 110

Page 101: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics I

� Modi�ed C language

� A program is build of C functions (executed in CPU or GPU)

� Function running in GPU (streaming processor) is called kernel.

� Kernel properties:

� can only access GPU memory (now changing in some devices)� no variable number of arguments� no static variables� no recursion� must be void

39 / 110

Page 102: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics I

� Modi�ed C language

� A program is build of C functions (executed in CPU or GPU)

� Function running in GPU (streaming processor) is called kernel.� Kernel properties:

� can only access GPU memory (now changing in some devices)� no variable number of arguments� no static variables� no recursion� must be void

39 / 110

Page 103: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics I

� Modi�ed C language

� A program is build of C functions (executed in CPU or GPU)

� Function running in GPU (streaming processor) is called kernel.� Kernel properties:

� can only access GPU memory (now changing in some devices)

� no variable number of arguments� no static variables� no recursion� must be void

39 / 110

Page 104: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics I

� Modi�ed C language

� A program is build of C functions (executed in CPU or GPU)

� Function running in GPU (streaming processor) is called kernel.� Kernel properties:

� can only access GPU memory (now changing in some devices)� no variable number of arguments

� no static variables� no recursion� must be void

39 / 110

Page 105: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics I

� Modi�ed C language

� A program is build of C functions (executed in CPU or GPU)

� Function running in GPU (streaming processor) is called kernel.� Kernel properties:

� can only access GPU memory (now changing in some devices)� no variable number of arguments� no static variables

� no recursion� must be void

39 / 110

Page 106: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics I

� Modi�ed C language

� A program is build of C functions (executed in CPU or GPU)

� Function running in GPU (streaming processor) is called kernel.� Kernel properties:

� can only access GPU memory (now changing in some devices)� no variable number of arguments� no static variables� no recursion

� must be void

39 / 110

Page 107: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics I

� Modi�ed C language

� A program is build of C functions (executed in CPU or GPU)

� Function running in GPU (streaming processor) is called kernel.� Kernel properties:

� can only access GPU memory (now changing in some devices)� no variable number of arguments� no static variables� no recursion� must be void

39 / 110

Page 108: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics II � Threads

� We write a kernel from a single thread's point of view.

� Each thread is free to execute a unique code path(however with danger of a huge slowdown).

� Each kernel contains local variables de�ning the execution context:

� threadIdx � three dimensional value unique within a block� blockIdx � two dimensional value unique within a grid� blockDim � two dimensional value describing a block dimensions� gridDim � two dimensional value describing a grid dimensions

40 / 110

Page 109: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics II � Threads

� We write a kernel from a single thread's point of view.

� Each thread is free to execute a unique code path(however with danger of a huge slowdown).

� Each kernel contains local variables de�ning the execution context:

� threadIdx � three dimensional value unique within a block� blockIdx � two dimensional value unique within a grid� blockDim � two dimensional value describing a block dimensions� gridDim � two dimensional value describing a grid dimensions

40 / 110

Page 110: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics II � Threads

� We write a kernel from a single thread's point of view.

� Each thread is free to execute a unique code path(however with danger of a huge slowdown).

� Each kernel contains local variables de�ning the execution context:

� threadIdx � three dimensional value unique within a block� blockIdx � two dimensional value unique within a grid� blockDim � two dimensional value describing a block dimensions� gridDim � two dimensional value describing a grid dimensions

40 / 110

Page 111: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics II � Threads

� We write a kernel from a single thread's point of view.

� Each thread is free to execute a unique code path(however with danger of a huge slowdown).

� Each kernel contains local variables de�ning the execution context:� threadIdx � three dimensional value unique within a block

� blockIdx � two dimensional value unique within a grid� blockDim � two dimensional value describing a block dimensions� gridDim � two dimensional value describing a grid dimensions

40 / 110

Page 112: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics II � Threads

� We write a kernel from a single thread's point of view.

� Each thread is free to execute a unique code path(however with danger of a huge slowdown).

� Each kernel contains local variables de�ning the execution context:� threadIdx � three dimensional value unique within a block� blockIdx � two dimensional value unique within a grid

� blockDim � two dimensional value describing a block dimensions� gridDim � two dimensional value describing a grid dimensions

40 / 110

Page 113: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics II � Threads

� We write a kernel from a single thread's point of view.

� Each thread is free to execute a unique code path(however with danger of a huge slowdown).

� Each kernel contains local variables de�ning the execution context:� threadIdx � three dimensional value unique within a block� blockIdx � two dimensional value unique within a grid� blockDim � two dimensional value describing a block dimensions

� gridDim � two dimensional value describing a grid dimensions

40 / 110

Page 114: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics II � Threads

� We write a kernel from a single thread's point of view.

� Each thread is free to execute a unique code path(however with danger of a huge slowdown).

� Each kernel contains local variables de�ning the execution context:� threadIdx � three dimensional value unique within a block� blockIdx � two dimensional value unique within a grid� blockDim � two dimensional value describing a block dimensions� gridDim � two dimensional value describing a grid dimensions

40 / 110

Page 115: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics III � Blocks

� Thread block is a group of threads that can:

� synchronize their execution� communicate via shared memory

� Grid = all blocks for given launch

41 / 110

Page 116: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics III � Blocks

� Thread block is a group of threads that can:� synchronize their execution

� communicate via shared memory

� Grid = all blocks for given launch

41 / 110

Page 117: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics III � Blocks

� Thread block is a group of threads that can:� synchronize their execution� communicate via shared memory

� Grid = all blocks for given launch

41 / 110

Page 118: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Characteristics III � Blocks

� Thread block is a group of threads that can:� synchronize their execution� communicate via shared memory

� Grid = all blocks for given launch

41 / 110

Page 119: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Elements

� dim3 type:� used for indexing and describing blocks of threads and grids� can be constructed from one, two and three values� based on uint[3], default value: (1,1,1)

� other built-in vector types:� [u]{char,short,int,long}{1..4}, float{1..4}� Structures accessed with x, y, z, w �elds:

1 uint4 param;

2 int y = param.y;

42 / 110

Page 120: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Elements

� dim3 type:� used for indexing and describing blocks of threads and grids� can be constructed from one, two and three values� based on uint[3], default value: (1,1,1)

� other built-in vector types:� [u]{char,short,int,long}{1..4}, float{1..4}� Structures accessed with x, y, z, w �elds:

1 uint4 param;

2 int y = param.y;

42 / 110

Page 121: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Elements

� functions quali�ers:� __global__ launched by CPU on device (must return void)� __device__ called from other GPU functions (never CPU)� __host__ can be executed by CPU

(can be used together with __device__)

43 / 110

Page 122: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Elements

� kernel launch:f_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

where

� f_name � name of a kernel function with __global__ quali�er� gridDim � dim3 value describing number of blocks in a grid� blockDim � dim3 value describing number of threads in each block� sharedMem � (optional) size of shared memory allocated for each

block in bytes (max 16kB)� strId � (optional) identi�cation of a stream for parallel kernel

execution� p1,... pN � kernel parameters (automatically copied to device)

� Kernel launches are asynchronous (returns to CPU immediately).

� Kernel executes after all previous CUDA calls have completed.

44 / 110

Page 123: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Elements

� kernel launch:f_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

where� f_name � name of a kernel function with __global__ quali�er

� gridDim � dim3 value describing number of blocks in a grid� blockDim � dim3 value describing number of threads in each block� sharedMem � (optional) size of shared memory allocated for each

block in bytes (max 16kB)� strId � (optional) identi�cation of a stream for parallel kernel

execution� p1,... pN � kernel parameters (automatically copied to device)

� Kernel launches are asynchronous (returns to CPU immediately).

� Kernel executes after all previous CUDA calls have completed.

44 / 110

Page 124: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Elements

� kernel launch:f_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

where� f_name � name of a kernel function with __global__ quali�er� gridDim � dim3 value describing number of blocks in a grid

� blockDim � dim3 value describing number of threads in each block� sharedMem � (optional) size of shared memory allocated for each

block in bytes (max 16kB)� strId � (optional) identi�cation of a stream for parallel kernel

execution� p1,... pN � kernel parameters (automatically copied to device)

� Kernel launches are asynchronous (returns to CPU immediately).

� Kernel executes after all previous CUDA calls have completed.

44 / 110

Page 125: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Elements

� kernel launch:f_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

where� f_name � name of a kernel function with __global__ quali�er� gridDim � dim3 value describing number of blocks in a grid� blockDim � dim3 value describing number of threads in each block

� sharedMem � (optional) size of shared memory allocated for eachblock in bytes (max 16kB)

� strId � (optional) identi�cation of a stream for parallel kernelexecution

� p1,... pN � kernel parameters (automatically copied to device)

� Kernel launches are asynchronous (returns to CPU immediately).

� Kernel executes after all previous CUDA calls have completed.

44 / 110

Page 126: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Elements

� kernel launch:f_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

where� f_name � name of a kernel function with __global__ quali�er� gridDim � dim3 value describing number of blocks in a grid� blockDim � dim3 value describing number of threads in each block� sharedMem � (optional) size of shared memory allocated for each

block in bytes (max 16kB)

� strId � (optional) identi�cation of a stream for parallel kernelexecution

� p1,... pN � kernel parameters (automatically copied to device)

� Kernel launches are asynchronous (returns to CPU immediately).

� Kernel executes after all previous CUDA calls have completed.

44 / 110

Page 127: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Elements

� kernel launch:f_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

where� f_name � name of a kernel function with __global__ quali�er� gridDim � dim3 value describing number of blocks in a grid� blockDim � dim3 value describing number of threads in each block� sharedMem � (optional) size of shared memory allocated for each

block in bytes (max 16kB)� strId � (optional) identi�cation of a stream for parallel kernel

execution

� p1,... pN � kernel parameters (automatically copied to device)

� Kernel launches are asynchronous (returns to CPU immediately).

� Kernel executes after all previous CUDA calls have completed.

44 / 110

Page 128: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Elements

� kernel launch:f_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

where� f_name � name of a kernel function with __global__ quali�er� gridDim � dim3 value describing number of blocks in a grid� blockDim � dim3 value describing number of threads in each block� sharedMem � (optional) size of shared memory allocated for each

block in bytes (max 16kB)� strId � (optional) identi�cation of a stream for parallel kernel

execution� p1,... pN � kernel parameters (automatically copied to device)

� Kernel launches are asynchronous (returns to CPU immediately).

� Kernel executes after all previous CUDA calls have completed.

44 / 110

Page 129: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Elements

� kernel launch:f_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

where� f_name � name of a kernel function with __global__ quali�er� gridDim � dim3 value describing number of blocks in a grid� blockDim � dim3 value describing number of threads in each block� sharedMem � (optional) size of shared memory allocated for each

block in bytes (max 16kB)� strId � (optional) identi�cation of a stream for parallel kernel

execution� p1,... pN � kernel parameters (automatically copied to device)

� Kernel launches are asynchronous (returns to CPU immediately).

� Kernel executes after all previous CUDA calls have completed.

44 / 110

Page 130: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Language Elements

� kernel launch:f_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

where� f_name � name of a kernel function with __global__ quali�er� gridDim � dim3 value describing number of blocks in a grid� blockDim � dim3 value describing number of threads in each block� sharedMem � (optional) size of shared memory allocated for each

block in bytes (max 16kB)� strId � (optional) identi�cation of a stream for parallel kernel

execution� p1,... pN � kernel parameters (automatically copied to device)

� Kernel launches are asynchronous (returns to CPU immediately).

� Kernel executes after all previous CUDA calls have completed.

44 / 110

Page 131: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

2 NVidia Compute Uni�ed Device Architecture (CUDA)IntroductionSoftware Development KitA few words about hardwareA simple kernel executionCUDA Programming LanguageMemory ManagementSynchronizationError reporting

3 Advanced topicsVolatile

45 / 110

Page 132: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Allocating and deallocating memory

1 int n = 1024;

2 int nbytes = n*sizeof(int);

3 int *d_array = 0;

� cudaMalloc((void**)&d_array, nbytes)

� cudaMemset(d_array, 0, nbytes)

� cudaFree(d_array)

� cudaMemcpy(void *dst, void *src, size_t nBytes, enum

cudaMemcpyKind direction)

� HostToDevice� DeviceToHost� DeviceToDevice

CPU blocking version (also assures that kernels has completed).

46 / 110

Page 133: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Allocating and deallocating memory

1 int n = 1024;

2 int nbytes = n*sizeof(int);

3 int *d_array = 0;

� cudaMalloc((void**)&d_array, nbytes)

� cudaMemset(d_array, 0, nbytes)

� cudaFree(d_array)

� cudaMemcpy(void *dst, void *src, size_t nBytes, enum

cudaMemcpyKind direction)

� HostToDevice� DeviceToHost� DeviceToDevice

CPU blocking version (also assures that kernels has completed).

46 / 110

Page 134: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Allocating and deallocating memory

1 int n = 1024;

2 int nbytes = n*sizeof(int);

3 int *d_array = 0;

� cudaMalloc((void**)&d_array, nbytes)

� cudaMemset(d_array, 0, nbytes)

� cudaFree(d_array)

� cudaMemcpy(void *dst, void *src, size_t nBytes, enum

cudaMemcpyKind direction)

� HostToDevice� DeviceToHost� DeviceToDevice

CPU blocking version (also assures that kernels has completed).

46 / 110

Page 135: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Allocating and deallocating memory

1 int n = 1024;

2 int nbytes = n*sizeof(int);

3 int *d_array = 0;

� cudaMalloc((void**)&d_array, nbytes)

� cudaMemset(d_array, 0, nbytes)

� cudaFree(d_array)

� cudaMemcpy(void *dst, void *src, size_t nBytes, enum

cudaMemcpyKind direction)

� HostToDevice� DeviceToHost� DeviceToDevice

CPU blocking version (also assures that kernels has completed).

46 / 110

Page 136: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Allocating and deallocating memory

1 int n = 1024;

2 int nbytes = n*sizeof(int);

3 int *d_array = 0;

� cudaMalloc((void**)&d_array, nbytes)

� cudaMemset(d_array, 0, nbytes)

� cudaFree(d_array)

� cudaMemcpy(void *dst, void *src, size_t nBytes, enum

cudaMemcpyKind direction)

� HostToDevice� DeviceToHost� DeviceToDevice

CPU blocking version (also assures that kernels has completed).

46 / 110

Page 137: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Allocating and deallocating memory

1 int n = 1024;

2 int nbytes = n*sizeof(int);

3 int *d_array = 0;

� cudaMalloc((void**)&d_array, nbytes)

� cudaMemset(d_array, 0, nbytes)

� cudaFree(d_array)

� cudaMemcpy(void *dst, void *src, size_t nBytes, enum

cudaMemcpyKind direction)

� HostToDevice

� DeviceToHost� DeviceToDevice

CPU blocking version (also assures that kernels has completed).

46 / 110

Page 138: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Allocating and deallocating memory

1 int n = 1024;

2 int nbytes = n*sizeof(int);

3 int *d_array = 0;

� cudaMalloc((void**)&d_array, nbytes)

� cudaMemset(d_array, 0, nbytes)

� cudaFree(d_array)

� cudaMemcpy(void *dst, void *src, size_t nBytes, enum

cudaMemcpyKind direction)

� HostToDevice� DeviceToHost

� DeviceToDevice

CPU blocking version (also assures that kernels has completed).

46 / 110

Page 139: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Allocating and deallocating memory

1 int n = 1024;

2 int nbytes = n*sizeof(int);

3 int *d_array = 0;

� cudaMalloc((void**)&d_array, nbytes)

� cudaMemset(d_array, 0, nbytes)

� cudaFree(d_array)

� cudaMemcpy(void *dst, void *src, size_t nBytes, enum

cudaMemcpyKind direction)

� HostToDevice� DeviceToHost� DeviceToDevice

CPU blocking version (also assures that kernels has completed).

46 / 110

Page 140: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Allocating and deallocating memory

1 int n = 1024;

2 int nbytes = n*sizeof(int);

3 int *d_array = 0;

� cudaMalloc((void**)&d_array, nbytes)

� cudaMemset(d_array, 0, nbytes)

� cudaFree(d_array)

� cudaMemcpy(void *dst, void *src, size_t nBytes, enum

cudaMemcpyKind direction)

� HostToDevice� DeviceToHost� DeviceToDevice

CPU blocking version (also assures that kernels has completed).

46 / 110

Page 141: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Allocating and deallocating memory

1 int n = 1024;

2 int nbytes = n*sizeof(int);

3 int *d_array = 0;

� cudaMalloc((void**)&d_array, nbytes)

� cudaMemset(d_array, 0, nbytes)

� cudaFree(d_array)

� cudaMemcpy(void *dst, void *src, size_t nBytes, enum

cudaMemcpyKind direction)

� HostToDevice� DeviceToHost� DeviceToDevice

CPU blocking version (also assures that kernels has completed).46 / 110

Page 142: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Memory Management

De-referencing CPU pointer on GPU will crash (and vice versa).

Good naming practices

d_ � device pointers

h_ � host pointers

s_ � shared memory

47 / 110

Page 143: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

First kernel � Host code completed

1 cudaSetDevice( cutGetMaxGflopsDeviceId() );

2

3 int arraySize = ?;

4 int blockSize = 32;

5 dim3 threads( blockSize );

6 dim3 blocks( arraySize / (float)blockSize );

7 int numBytes = arraySize * sizeof(int);

8 int* h_A = (int*) malloc(numBytes);

9

10 int* d_A = 0;

11 cudaMalloc((void**)&d_A, numbytes);

12

13 cudaMemset( d_A, 0, arraySize );

14 //cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);

15

16 first_kernel<<< blocks, threads >>>( d_A, arraySize );

17

18 cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);

19

20 cudaFree(d_A);

21 free(h_A);48 / 110

Page 144: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Variable Quali�ers (GPU side)

� __device__

� Stored in device memory (large, high latency, no cache)� Allocated with cudaMalloc (__device__ quali�er implied)� Accessible by all threads� Lifetime: application

� __shared__

� Stored in on-chip shared memory (very low latency)� Allocated by execution con�guration or declared at compile time� Accessible by all threads in the same thread block� Lifetime: kernel execution

� Unquali�ed variables:

� Scalars and built-in vector types are stored in registers� Arrays of more than 4 elements stored in device memory

49 / 110

Page 145: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Variable Quali�ers (GPU side)

� __device__

� Stored in device memory (large, high latency, no cache)

� Allocated with cudaMalloc (__device__ quali�er implied)� Accessible by all threads� Lifetime: application

� __shared__

� Stored in on-chip shared memory (very low latency)� Allocated by execution con�guration or declared at compile time� Accessible by all threads in the same thread block� Lifetime: kernel execution

� Unquali�ed variables:

� Scalars and built-in vector types are stored in registers� Arrays of more than 4 elements stored in device memory

49 / 110

Page 146: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Variable Quali�ers (GPU side)

� __device__

� Stored in device memory (large, high latency, no cache)� Allocated with cudaMalloc (__device__ quali�er implied)

� Accessible by all threads� Lifetime: application

� __shared__

� Stored in on-chip shared memory (very low latency)� Allocated by execution con�guration or declared at compile time� Accessible by all threads in the same thread block� Lifetime: kernel execution

� Unquali�ed variables:

� Scalars and built-in vector types are stored in registers� Arrays of more than 4 elements stored in device memory

49 / 110

Page 147: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Variable Quali�ers (GPU side)

� __device__

� Stored in device memory (large, high latency, no cache)� Allocated with cudaMalloc (__device__ quali�er implied)� Accessible by all threads

� Lifetime: application

� __shared__

� Stored in on-chip shared memory (very low latency)� Allocated by execution con�guration or declared at compile time� Accessible by all threads in the same thread block� Lifetime: kernel execution

� Unquali�ed variables:

� Scalars and built-in vector types are stored in registers� Arrays of more than 4 elements stored in device memory

49 / 110

Page 148: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Variable Quali�ers (GPU side)

� __device__

� Stored in device memory (large, high latency, no cache)� Allocated with cudaMalloc (__device__ quali�er implied)� Accessible by all threads� Lifetime: application

� __shared__

� Stored in on-chip shared memory (very low latency)� Allocated by execution con�guration or declared at compile time� Accessible by all threads in the same thread block� Lifetime: kernel execution

� Unquali�ed variables:

� Scalars and built-in vector types are stored in registers� Arrays of more than 4 elements stored in device memory

49 / 110

Page 149: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Variable Quali�ers (GPU side)

� __device__

� Stored in device memory (large, high latency, no cache)� Allocated with cudaMalloc (__device__ quali�er implied)� Accessible by all threads� Lifetime: application

� __shared__

� Stored in on-chip shared memory (very low latency)� Allocated by execution con�guration or declared at compile time� Accessible by all threads in the same thread block� Lifetime: kernel execution

� Unquali�ed variables:

� Scalars and built-in vector types are stored in registers� Arrays of more than 4 elements stored in device memory

49 / 110

Page 150: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Variable Quali�ers (GPU side)

� __device__

� Stored in device memory (large, high latency, no cache)� Allocated with cudaMalloc (__device__ quali�er implied)� Accessible by all threads� Lifetime: application

� __shared__

� Stored in on-chip shared memory (very low latency)

� Allocated by execution con�guration or declared at compile time� Accessible by all threads in the same thread block� Lifetime: kernel execution

� Unquali�ed variables:

� Scalars and built-in vector types are stored in registers� Arrays of more than 4 elements stored in device memory

49 / 110

Page 151: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Variable Quali�ers (GPU side)

� __device__

� Stored in device memory (large, high latency, no cache)� Allocated with cudaMalloc (__device__ quali�er implied)� Accessible by all threads� Lifetime: application

� __shared__

� Stored in on-chip shared memory (very low latency)� Allocated by execution con�guration or declared at compile time

� Accessible by all threads in the same thread block� Lifetime: kernel execution

� Unquali�ed variables:

� Scalars and built-in vector types are stored in registers� Arrays of more than 4 elements stored in device memory

49 / 110

Page 152: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Variable Quali�ers (GPU side)

� __device__

� Stored in device memory (large, high latency, no cache)� Allocated with cudaMalloc (__device__ quali�er implied)� Accessible by all threads� Lifetime: application

� __shared__

� Stored in on-chip shared memory (very low latency)� Allocated by execution con�guration or declared at compile time� Accessible by all threads in the same thread block

� Lifetime: kernel execution

� Unquali�ed variables:

� Scalars and built-in vector types are stored in registers� Arrays of more than 4 elements stored in device memory

49 / 110

Page 153: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Variable Quali�ers (GPU side)

� __device__

� Stored in device memory (large, high latency, no cache)� Allocated with cudaMalloc (__device__ quali�er implied)� Accessible by all threads� Lifetime: application

� __shared__

� Stored in on-chip shared memory (very low latency)� Allocated by execution con�guration or declared at compile time� Accessible by all threads in the same thread block� Lifetime: kernel execution

� Unquali�ed variables:

� Scalars and built-in vector types are stored in registers� Arrays of more than 4 elements stored in device memory

49 / 110

Page 154: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Variable Quali�ers (GPU side)

� __device__

� Stored in device memory (large, high latency, no cache)� Allocated with cudaMalloc (__device__ quali�er implied)� Accessible by all threads� Lifetime: application

� __shared__

� Stored in on-chip shared memory (very low latency)� Allocated by execution con�guration or declared at compile time� Accessible by all threads in the same thread block� Lifetime: kernel execution

� Unquali�ed variables:

� Scalars and built-in vector types are stored in registers� Arrays of more than 4 elements stored in device memory

49 / 110

Page 155: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Variable Quali�ers (GPU side)

� __device__

� Stored in device memory (large, high latency, no cache)� Allocated with cudaMalloc (__device__ quali�er implied)� Accessible by all threads� Lifetime: application

� __shared__

� Stored in on-chip shared memory (very low latency)� Allocated by execution con�guration or declared at compile time� Accessible by all threads in the same thread block� Lifetime: kernel execution

� Unquali�ed variables:� Scalars and built-in vector types are stored in registers

� Arrays of more than 4 elements stored in device memory

49 / 110

Page 156: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Variable Quali�ers (GPU side)

� __device__

� Stored in device memory (large, high latency, no cache)� Allocated with cudaMalloc (__device__ quali�er implied)� Accessible by all threads� Lifetime: application

� __shared__

� Stored in on-chip shared memory (very low latency)� Allocated by execution con�guration or declared at compile time� Accessible by all threads in the same thread block� Lifetime: kernel execution

� Unquali�ed variables:� Scalars and built-in vector types are stored in registers� Arrays of more than 4 elements stored in device memory

49 / 110

Page 157: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Allocating shared memory

� Device side:

1 ^^I__global__ void kernel(...)

2 ^^I{

3 ^^I ...

4 ^^I __shared__ float sData[256];

5 ^^I ...

6 ^^I}

7 ^^I

� Host side:

1 ^^Ikernel<<<nBlocks, blockSize

>>>(...);

2 ^^I

� Device side:

1 ^^I__global__ void kernel(...)

2 ^^I{

3 ^^I ...

4 ^^I extern __shared__ float sData

[];

5 ^^I ...

6 ^^I}

7 ^^I

� Host side:

1 ^^IsmBytes = blockSize*sizeof(float);

2

3 ^^Ikernel<<<nBlocks, blockSize,

4 ^^I smBytes>>>(...);

5 ^^I

50 / 110

Page 158: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

51 / 110

Page 159: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

52 / 110

Page 160: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

2 NVidia Compute Uni�ed Device Architecture (CUDA)IntroductionSoftware Development KitA few words about hardwareA simple kernel executionCUDA Programming LanguageMemory ManagementSynchronizationError reporting

3 Advanced topicsVolatile

53 / 110

Page 161: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Threads Synchronization

� Device side: __syncthreads()

� Synchronizes all threads in a block

� No thread can pass this barrier until all threads in the block reach it� Used to avoid con�icts when accessing shared memory� Allowed in conditional code only if the conditional is uniform across

the entire thread block

� Host side: cudaThreadSynchronize()

� Blocks the current CPU thread until all GPU calls are �nished.� Including all streams.

54 / 110

Page 162: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Threads Synchronization

� Device side: __syncthreads()� Synchronizes all threads in a block

� No thread can pass this barrier until all threads in the block reach it� Used to avoid con�icts when accessing shared memory� Allowed in conditional code only if the conditional is uniform across

the entire thread block

� Host side: cudaThreadSynchronize()

� Blocks the current CPU thread until all GPU calls are �nished.� Including all streams.

54 / 110

Page 163: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Threads Synchronization

� Device side: __syncthreads()� Synchronizes all threads in a block

� No thread can pass this barrier until all threads in the block reach it

� Used to avoid con�icts when accessing shared memory� Allowed in conditional code only if the conditional is uniform across

the entire thread block

� Host side: cudaThreadSynchronize()

� Blocks the current CPU thread until all GPU calls are �nished.� Including all streams.

54 / 110

Page 164: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Threads Synchronization

� Device side: __syncthreads()� Synchronizes all threads in a block

� No thread can pass this barrier until all threads in the block reach it� Used to avoid con�icts when accessing shared memory

� Allowed in conditional code only if the conditional is uniform acrossthe entire thread block

� Host side: cudaThreadSynchronize()

� Blocks the current CPU thread until all GPU calls are �nished.� Including all streams.

54 / 110

Page 165: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Threads Synchronization

� Device side: __syncthreads()� Synchronizes all threads in a block

� No thread can pass this barrier until all threads in the block reach it� Used to avoid con�icts when accessing shared memory� Allowed in conditional code only if the conditional is uniform across

the entire thread block

� Host side: cudaThreadSynchronize()

� Blocks the current CPU thread until all GPU calls are �nished.� Including all streams.

54 / 110

Page 166: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Threads Synchronization

� Device side: __syncthreads()� Synchronizes all threads in a block

� No thread can pass this barrier until all threads in the block reach it� Used to avoid con�icts when accessing shared memory� Allowed in conditional code only if the conditional is uniform across

the entire thread block

� Host side: cudaThreadSynchronize()

� Blocks the current CPU thread until all GPU calls are �nished.� Including all streams.

54 / 110

Page 167: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Threads Synchronization

� Device side: __syncthreads()� Synchronizes all threads in a block

� No thread can pass this barrier until all threads in the block reach it� Used to avoid con�icts when accessing shared memory� Allowed in conditional code only if the conditional is uniform across

the entire thread block

� Host side: cudaThreadSynchronize()� Blocks the current CPU thread until all GPU calls are �nished.

� Including all streams.

54 / 110

Page 168: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Threads Synchronization

� Device side: __syncthreads()� Synchronizes all threads in a block

� No thread can pass this barrier until all threads in the block reach it� Used to avoid con�icts when accessing shared memory� Allowed in conditional code only if the conditional is uniform across

the entire thread block

� Host side: cudaThreadSynchronize()� Blocks the current CPU thread until all GPU calls are �nished.� Including all streams.

54 / 110

Page 169: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Advanced Threads Synchronization I

CUDA Toolkit > 3.0 and CC≥ 2.0Device side:

� int __syncthreads_count(int predicate); is identical to__syncthreads() with the additional feature that it evaluatespredicate for all threads of the block and returns the number ofthreads for which predicate evaluates to non-zero.

� int __syncthreads_and(int predicate); similarly but evaluatespredicate for all threads of the block and returns non-zero if andonly if predicate evaluates to non-zero for all of them.

� int __syncthreads_or(int predicate); . . . similarly but returnsnon-zero if predicate evaluates to non-zero for any of the threads.

55 / 110

Page 170: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Advanced Threads Synchronization I

CUDA Toolkit > 3.0 and CC≥ 2.0Device side:

� int __syncthreads_count(int predicate); is identical to__syncthreads() with the additional feature that it evaluatespredicate for all threads of the block and returns the number ofthreads for which predicate evaluates to non-zero.

� int __syncthreads_and(int predicate); similarly but evaluatespredicate for all threads of the block and returns non-zero if andonly if predicate evaluates to non-zero for all of them.

� int __syncthreads_or(int predicate); . . . similarly but returnsnon-zero if predicate evaluates to non-zero for any of the threads.

55 / 110

Page 171: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Advanced Threads Synchronization I

CUDA Toolkit > 3.0 and CC≥ 2.0Device side:

� int __syncthreads_count(int predicate); is identical to__syncthreads() with the additional feature that it evaluatespredicate for all threads of the block and returns the number ofthreads for which predicate evaluates to non-zero.

� int __syncthreads_and(int predicate); similarly but evaluatespredicate for all threads of the block and returns non-zero if andonly if predicate evaluates to non-zero for all of them.

� int __syncthreads_or(int predicate); . . . similarly but returnsnon-zero if predicate evaluates to non-zero for any of the threads.

55 / 110

Page 172: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Advanced Threads Synchronization II

Device side memory fence functions:

� void __threadfence_block(); waits until all global and sharedmemory accesses made by the calling thread before are visible to allthreads in the thread block.

� void __threadfence(); waits until all global and shared memoryaccesses made by the calling thread prior to __threadfence() arevisible to:

� All threads in the thread block for shared memory accesses,� All threads in the device for global memory accesses.

56 / 110

Page 173: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Advanced Threads Synchronization II

Device side memory fence functions:

� void __threadfence_block(); waits until all global and sharedmemory accesses made by the calling thread before are visible to allthreads in the thread block.

� void __threadfence(); waits until all global and shared memoryaccesses made by the calling thread prior to __threadfence() arevisible to:

� All threads in the thread block for shared memory accesses,� All threads in the device for global memory accesses.

56 / 110

Page 174: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Advanced Threads Synchronization II

Device side memory fence functions:

� void __threadfence_block(); waits until all global and sharedmemory accesses made by the calling thread before are visible to allthreads in the thread block.

� void __threadfence(); waits until all global and shared memoryaccesses made by the calling thread prior to __threadfence() arevisible to:� All threads in the thread block for shared memory accesses,

� All threads in the device for global memory accesses.

56 / 110

Page 175: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Advanced Threads Synchronization II

Device side memory fence functions:

� void __threadfence_block(); waits until all global and sharedmemory accesses made by the calling thread before are visible to allthreads in the thread block.

� void __threadfence(); waits until all global and shared memoryaccesses made by the calling thread prior to __threadfence() arevisible to:� All threads in the thread block for shared memory accesses,� All threads in the device for global memory accesses.

56 / 110

Page 176: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

2 NVidia Compute Uni�ed Device Architecture (CUDA)IntroductionSoftware Development KitA few words about hardwareA simple kernel executionCUDA Programming LanguageMemory ManagementSynchronizationError reporting

3 Advanced topicsVolatile

57 / 110

Page 177: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Error Reporting to CPU

� All CUDA calls return error code: cudaError_t (Except for kernellaunches)

� cudaError_t cudaGetLastError(void) � Returns the code for thelast error

� char* cudaGetErrorString(cudaError_t code) � Returns anull-terminated character string describing the errorprintf("%s\n", cudaGetErrorString( cudaGetLastError()));

58 / 110

Page 178: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Error Reporting to CPU

� All CUDA calls return error code: cudaError_t (Except for kernellaunches)

� cudaError_t cudaGetLastError(void) � Returns the code for thelast error

� char* cudaGetErrorString(cudaError_t code) � Returns anull-terminated character string describing the errorprintf("%s\n", cudaGetErrorString( cudaGetLastError()));

58 / 110

Page 179: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

CUDA Error Reporting to CPU

� All CUDA calls return error code: cudaError_t (Except for kernellaunches)

� cudaError_t cudaGetLastError(void) � Returns the code for thelast error

� char* cudaGetErrorString(cudaError_t code) � Returns anull-terminated character string describing the errorprintf("%s\n", cudaGetErrorString( cudaGetLastError()));

58 / 110

Page 180: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

2 NVidia Compute Uni�ed Device Architecture (CUDA)IntroductionSoftware Development KitA few words about hardwareA simple kernel executionCUDA Programming LanguageMemory ManagementSynchronizationError reporting

3 Advanced topicsVolatile

59 / 110

Page 181: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

2 NVidia Compute Uni�ed Device Architecture (CUDA)IntroductionSoftware Development KitA few words about hardwareA simple kernel executionCUDA Programming LanguageMemory ManagementSynchronizationError reporting

3 Advanced topicsVolatile

60 / 110

Page 182: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Volatile variables and speci�c compiler optimizations

� One of the compiler's tricks: reuse references to memory location

� Result: A reused value may be changed by another thread in thebackground

1 // myArray is an array of non-zero integers

2 // located in global or shared memory

3 __global__ void myKernel(int* result)

4 {

5 int tid = threadIdx.x;

6 int ref1 = myArray[tid] * 1;

7 myArray[tid + 1] = 2;

8 int ref2 = myArray[tid] * 1;

9 result[tid] = ref1 * ref2;

10 }

� the �rst reference to myArray[tid] compiles into a memory readinstruction

� the second reference does not as the compiler simply reuses theresult of the �rst read

cudaProgrammingGuide

61 / 110

Page 183: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Volatile variables and speci�c compiler optimizations

� One of the compiler's tricks: reuse references to memory location� Result: A reused value may be changed by another thread in thebackground

1 // myArray is an array of non-zero integers

2 // located in global or shared memory

3 __global__ void myKernel(int* result)

4 {

5 int tid = threadIdx.x;

6 int ref1 = myArray[tid] * 1;

7 myArray[tid + 1] = 2;

8 int ref2 = myArray[tid] * 1;

9 result[tid] = ref1 * ref2;

10 }

� the �rst reference to myArray[tid] compiles into a memory readinstruction

� the second reference does not as the compiler simply reuses theresult of the �rst read

cudaProgrammingGuide

61 / 110

Page 184: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Volatile variables and speci�c compiler optimizations

� One of the compiler's tricks: reuse references to memory location� Result: A reused value may be changed by another thread in thebackground

1 // myArray is an array of non-zero integers

2 // located in global or shared memory

3 __global__ void myKernel(int* result)

4 {

5 int tid = threadIdx.x;

6 int ref1 = myArray[tid] * 1;

7 myArray[tid + 1] = 2;

8 int ref2 = myArray[tid] * 1;

9 result[tid] = ref1 * ref2;

10 }

� the �rst reference to myArray[tid] compiles into a memory readinstruction

� the second reference does not as the compiler simply reuses theresult of the �rst read

cudaProgrammingGuide

61 / 110

Page 185: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Volatile variables and speci�c compiler optimizations

� One of the compiler's tricks: reuse references to memory location� Result: A reused value may be changed by another thread in thebackground

1 // myArray is an array of non-zero integers

2 // located in global or shared memory

3 __global__ void myKernel(int* result)

4 {

5 int tid = threadIdx.x;

6 int ref1 = myArray[tid] * 1;

7 myArray[tid + 1] = 2;

8 int ref2 = myArray[tid] * 1;

9 result[tid] = ref1 * ref2;

10 }

� the �rst reference to myArray[tid] compiles into a memory readinstruction

� the second reference does not as the compiler simply reuses theresult of the �rst read

cudaProgrammingGuide

61 / 110

Page 186: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Volatile variables and speci�c compiler optimizations

� One of the compiler's tricks: reuse references to memory location� Result: A reused value may be changed by another thread in thebackground

1 // myArray is an array of non-zero integers

2 // located in global or shared memory

3 __global__ void myKernel(int* result)

4 {

5 int tid = threadIdx.x;

6 int ref1 = myArray[tid] * 1;

7 myArray[tid + 1] = 2;

8 int ref2 = myArray[tid] * 1;

9 result[tid] = ref1 * ref2;

10 }

� the �rst reference to myArray[tid] compiles into a memory readinstruction

� the second reference does not as the compiler simply reuses theresult of the �rst read

cudaProgrammingGuide

61 / 110

Page 187: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

62 / 110

Page 188: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

4 OptimizationsBasicsMeasurementsInstruction OptimizationsMemory optimizations

Playing with memory types

Coalesced memory access

Avoiding memory bank con�icts

Asynchronous operationsStream API

Assuring optimum load for a device

63 / 110

Page 189: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

4 OptimizationsBasicsMeasurementsInstruction OptimizationsMemory optimizations

Playing with memory types

Coalesced memory access

Avoiding memory bank con�icts

Asynchronous operationsStream API

Assuring optimum load for a device

64 / 110

Page 190: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Optimization Basics

� Maximize parallel blocks (many threads and many blocks)

� Maximize computations / minimize memory transfer

� Avoid costly operations

� Take advantage of shared memory

65 / 110

Page 191: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Optimization Basics

� Maximize parallel blocks (many threads and many blocks)

� Maximize computations / minimize memory transfer

� Avoid costly operations

� Take advantage of shared memory

65 / 110

Page 192: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Optimization Basics

� Maximize parallel blocks (many threads and many blocks)

� Maximize computations / minimize memory transfer

� Avoid costly operations

� Take advantage of shared memory

65 / 110

Page 193: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Optimization Basics

� Maximize parallel blocks (many threads and many blocks)

� Maximize computations / minimize memory transfer

� Avoid costly operations

� Take advantage of shared memory

65 / 110

Page 194: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

4 OptimizationsBasicsMeasurementsInstruction OptimizationsMemory optimizations

Playing with memory types

Coalesced memory access

Avoiding memory bank con�icts

Asynchronous operationsStream API

Assuring optimum load for a device

66 / 110

Page 195: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Timers API

� cudaEvent_t � event type

� cudaEventSynchronize() � blocks CPU until given event records

� cudaEventRecord() � records given event in given stream

� cudaEventElapsedTime() � calculates time in milliseconds betweenevents

� cudaEventCreate() � creates an event

� cudaEventDestroy() � destroys an event

67 / 110

Page 196: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Timers Example

1 cudaEvent_t start, stop; float time;

2 cudaEventCreate(&start);

3 cudaEventCreate(&stop);

4

5 cudaEventRecord( start, 0 );

6

7 kernel<<<grid,threads>>> ( d_odata, d_idata, size_x, size_y, NUM_REPS);

8

9 cudaEventRecord( stop, 0 );

10

11 cudaEventSynchronize( stop );

12

13 cudaEventElapsedTime( &time, start, stop );

14

15 cudaEventDestroy( start );

16 cudaEventDestroy( stop );

68 / 110

Page 197: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Timers with cutil

1 unsigned int timer;

2 cutilCheckError( cutCreateTimer(&timer) );

3

4 cutStartTimer(timer);

5

6 [...]

7

8 cudaThreadSynchronize();

9

10 cutStopTimer(timer);

11

12 timer_result = cutGetTimerValue(timer);

13

14 cutResetTimer(timer);

15

16 cutilCheckError(cutDeleteTimer(timer));

69 / 110

Page 198: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Theoretical Bandwidth Calculation

TB = (Clock× 106 ×MemInt× 2)/109

� TB � theoretical bandwidth [GB/s]

� Clock � memory clock rate [MHz]

� MemInt � width of memory interface [B]

� 2 � DDR � Double Data Rate Memory

For NVIDIA GeForce GTX 280 we get:

(1107× 106 × 512/8× 2)/109 = 141.6 [GB/s]

70 / 110

Page 199: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Theoretical Bandwidth Calculation

TB = (Clock× 106 ×MemInt× 2)/109

� TB � theoretical bandwidth [GB/s]

� Clock � memory clock rate [MHz]

� MemInt � width of memory interface [B]

� 2 � DDR � Double Data Rate Memory

For NVIDIA GeForce GTX 280 we get:

(1107× 106 × 512/8× 2)/109 = 141.6 [GB/s]

70 / 110

Page 200: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

E�ective Bandwidth Calculation

EB =(Br +Bw)× 10−9

t

� EB � e�ective bandwidth [GB/s]

� Br � bytes read

� Bw � bytes written

� t � time of the test [s]

71 / 110

Page 201: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

4 OptimizationsBasicsMeasurementsInstruction OptimizationsMemory optimizations

Playing with memory types

Coalesced memory access

Avoiding memory bank con�icts

Asynchronous operationsStream API

Assuring optimum load for a device

72 / 110

Page 202: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Instruction Optimizations

� Use faster equivalents for slower ones

� Use single precission (functions, variables and constants)

� Use specialized math functions

73 / 110

Page 203: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Find balance between speed and precision

Functions working in host and device code (selection):

single prec.: x+y, x*y, x/y, 1/x, sqrtf(x), rsqrtf(x), expf(x), exp2f(x),exp10f(x), logf(x), log2f(x), log10f(x), sinf(x), cosf(x),tanf(x), asinf(x), acosf(x), atanf(x), sinhf(x), . . .powf(x,y)rintf(x), truncf(x), ceilf(x), �oorf(x),roundf(x)

double prec.: x+y, x*y, x/y, 1/x, sqrt(x), rsqrt(x), exp(x), exp2(x),exp10(x), log(x), log2(x), log10(x), sin(x), cos(x),tan(x), asin(x), acos(x), atan(x), sinh(x), . . .pow(x,y),rint(x), trunc(x), ceil(x), �oor(x),round(x)

integer func.: min(x,y), max(x,y)

When executed in host code, a given function uses the C runtimeimplementation if available74 / 110

Page 204: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Several code samples

If the following code is executed on a device with cc≤ 1.2 or on devicewith cc>1.2 but with disabled double precision

1 float a;

2 a = a*1.02;

then multiplication is done in single precision.

What happens if thesame code is run on host? Multiplication is done in double precision.

1 float a;

2 a = a*1.02f;

1 sizeof(char) = 1

2 sizeof(short) = 2

3 sizeof(int) = 4

4 sizeof(float) = 4

5 sizeof(double)= 8

75 / 110

Page 205: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Several code samples

If the following code is executed on a device with cc≤ 1.2 or on devicewith cc>1.2 but with disabled double precision

1 float a;

2 a = a*1.02;

then multiplication is done in single precision. What happens if thesame code is run on host?

Multiplication is done in double precision.

1 float a;

2 a = a*1.02f;

1 sizeof(char) = 1

2 sizeof(short) = 2

3 sizeof(int) = 4

4 sizeof(float) = 4

5 sizeof(double)= 8

75 / 110

Page 206: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Several code samples

If the following code is executed on a device with cc≤ 1.2 or on devicewith cc>1.2 but with disabled double precision

1 float a;

2 a = a*1.02;

then multiplication is done in single precision. What happens if thesame code is run on host? Multiplication is done in double precision.

1 float a;

2 a = a*1.02f;

1 sizeof(char) = 1

2 sizeof(short) = 2

3 sizeof(int) = 4

4 sizeof(float) = 4

5 sizeof(double)= 8

75 / 110

Page 207: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Several code samples

If the following code is executed on a device with cc≤ 1.2 or on devicewith cc>1.2 but with disabled double precision

1 float a;

2 a = a*1.02;

then multiplication is done in single precision. What happens if thesame code is run on host? Multiplication is done in double precision.

1 float a;

2 a = a*1.02f;

1 sizeof(char) = 1

2 sizeof(short) = 2

3 sizeof(int) = 4

4 sizeof(float) = 4

5 sizeof(double)= 8

75 / 110

Page 208: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Several code samples

If the following code is executed on a device with cc≤ 1.2 or on devicewith cc>1.2 but with disabled double precision

1 float a;

2 a = a*1.02;

then multiplication is done in single precision. What happens if thesame code is run on host? Multiplication is done in double precision.

1 float a;

2 a = a*1.02f;

1 sizeof(char) = 1

2 sizeof(short) = 2

3 sizeof(int) = 4

4 sizeof(float) = 4

5 sizeof(double)= 8

75 / 110

Page 209: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Device intrinsics mathematical functions

� Less accurate but faster device-only versions (selection):__fadd_[rn,rz,ru,rd](x,y), __fmul_[rn,rz,ru,rd](x,y)__pow(x,y)__log(x),

__log2(x), __log10(x)__exp(x)__sin(x), __cos(x), __tan(x)__umul24,

__float2int_[rn,rz,ru,rd](x), __float2uint_[rn,rz,ru,rd](x)

� Double precission (selection):__dadd_[rn,rz,ru,rd](x,y), __dmul_[rn,rz,ru,rd](x,y), __double2float_[rn

,rz](x), __double2int_[rn,rz,ru,rd](x), __double2uint_[rn,rz,ru,rd](x)

76 / 110

Page 210: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

4 OptimizationsBasicsMeasurementsInstruction OptimizationsMemory optimizations

Playing with memory types

Coalesced memory access

Avoiding memory bank con�icts

Asynchronous operationsStream API

Assuring optimum load for a device

77 / 110

Page 211: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Using di�erent memory types

� On-chip registers � very limited � local kernel variables

� On-chip constant cache � 8kB � read by kernel

� On-chip shared memory � allocated by device or host � accessed inkernel

� Texture memory � allocated by host � read by kernel

� Global device memory � allocated by host � accessed in kernel andhost

78 / 110

Page 212: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Using host page-locked memory

� (�pinned�) host memory enables highest cudaMemcpy performance

� 3.2 GB/s on PCI-e x16 Gen1� 5.2 GB/s on PCI-e x16 Gen2

� allows for parallel execution of data transfer and kernel

� on some devices page-locked memory may be mapped directly todevice's memory

� Allocating too much page-locked memory can reduce overall systemperformance

79 / 110

Page 213: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Using host page-locked memory

� (�pinned�) host memory enables highest cudaMemcpy performance� 3.2 GB/s on PCI-e x16 Gen1

� 5.2 GB/s on PCI-e x16 Gen2

� allows for parallel execution of data transfer and kernel

� on some devices page-locked memory may be mapped directly todevice's memory

� Allocating too much page-locked memory can reduce overall systemperformance

79 / 110

Page 214: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Using host page-locked memory

� (�pinned�) host memory enables highest cudaMemcpy performance� 3.2 GB/s on PCI-e x16 Gen1� 5.2 GB/s on PCI-e x16 Gen2

� allows for parallel execution of data transfer and kernel

� on some devices page-locked memory may be mapped directly todevice's memory

� Allocating too much page-locked memory can reduce overall systemperformance

79 / 110

Page 215: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Using host page-locked memory

� (�pinned�) host memory enables highest cudaMemcpy performance� 3.2 GB/s on PCI-e x16 Gen1� 5.2 GB/s on PCI-e x16 Gen2

� allows for parallel execution of data transfer and kernel

� on some devices page-locked memory may be mapped directly todevice's memory

� Allocating too much page-locked memory can reduce overall systemperformance

79 / 110

Page 216: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Using host page-locked memory

� (�pinned�) host memory enables highest cudaMemcpy performance� 3.2 GB/s on PCI-e x16 Gen1� 5.2 GB/s on PCI-e x16 Gen2

� allows for parallel execution of data transfer and kernel

� on some devices page-locked memory may be mapped directly todevice's memory

� Allocating too much page-locked memory can reduce overall systemperformance

79 / 110

Page 217: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Using host page-locked memory

� (�pinned�) host memory enables highest cudaMemcpy performance� 3.2 GB/s on PCI-e x16 Gen1� 5.2 GB/s on PCI-e x16 Gen2

� allows for parallel execution of data transfer and kernel

� on some devices page-locked memory may be mapped directly todevice's memory

� Allocating too much page-locked memory can reduce overall systemperformance

79 / 110

Page 218: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Page-locked memory allocation I

� cudaError_t cudaMallocHost (void **ptr, size_t size)

� cudaError_t cudaHostAlloc (void **ptr, size_t size,

unsigned int flags)

� possible �ags (yes, they are orthogonal):

� cudaHostAllocDefault � This �ag's value is de�ned to be 0 andcauses cudaHostAlloc() to emulate cudaMallocHost().

� cudaHostAllocPortable � The memory returned by this call will beconsidered as pinned memory by all CUDA contexts, not just the onethat performed the allocation.

� cudaHostAllocMapped � Maps the allocation into the CUDA addressspace.

� cudaHostAllocWriteCombined � Allocates the memory aswrite-combined (WC). WC memory can be transferred across the PCIExpress bus more quickly on some system con�gurations, but cannotbe read e�ciently by most CPUs.

80 / 110

Page 219: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Page-locked memory allocation I

� cudaError_t cudaMallocHost (void **ptr, size_t size)

� cudaError_t cudaHostAlloc (void **ptr, size_t size,

unsigned int flags)

� possible �ags (yes, they are orthogonal):

� cudaHostAllocDefault � This �ag's value is de�ned to be 0 andcauses cudaHostAlloc() to emulate cudaMallocHost().

� cudaHostAllocPortable � The memory returned by this call will beconsidered as pinned memory by all CUDA contexts, not just the onethat performed the allocation.

� cudaHostAllocMapped � Maps the allocation into the CUDA addressspace.

� cudaHostAllocWriteCombined � Allocates the memory aswrite-combined (WC). WC memory can be transferred across the PCIExpress bus more quickly on some system con�gurations, but cannotbe read e�ciently by most CPUs.

80 / 110

Page 220: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Page-locked memory allocation I

� cudaError_t cudaMallocHost (void **ptr, size_t size)

� cudaError_t cudaHostAlloc (void **ptr, size_t size,

unsigned int flags)

� possible �ags (yes, they are orthogonal):� cudaHostAllocDefault � This �ag's value is de�ned to be 0 and

causes cudaHostAlloc() to emulate cudaMallocHost().

� cudaHostAllocPortable � The memory returned by this call will beconsidered as pinned memory by all CUDA contexts, not just the onethat performed the allocation.

� cudaHostAllocMapped � Maps the allocation into the CUDA addressspace.

� cudaHostAllocWriteCombined � Allocates the memory aswrite-combined (WC). WC memory can be transferred across the PCIExpress bus more quickly on some system con�gurations, but cannotbe read e�ciently by most CPUs.

80 / 110

Page 221: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Page-locked memory allocation I

� cudaError_t cudaMallocHost (void **ptr, size_t size)

� cudaError_t cudaHostAlloc (void **ptr, size_t size,

unsigned int flags)

� possible �ags (yes, they are orthogonal):� cudaHostAllocDefault � This �ag's value is de�ned to be 0 and

causes cudaHostAlloc() to emulate cudaMallocHost().� cudaHostAllocPortable � The memory returned by this call will be

considered as pinned memory by all CUDA contexts, not just the onethat performed the allocation.

� cudaHostAllocMapped � Maps the allocation into the CUDA addressspace.

� cudaHostAllocWriteCombined � Allocates the memory aswrite-combined (WC). WC memory can be transferred across the PCIExpress bus more quickly on some system con�gurations, but cannotbe read e�ciently by most CPUs.

80 / 110

Page 222: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Page-locked memory allocation I

� cudaError_t cudaMallocHost (void **ptr, size_t size)

� cudaError_t cudaHostAlloc (void **ptr, size_t size,

unsigned int flags)

� possible �ags (yes, they are orthogonal):� cudaHostAllocDefault � This �ag's value is de�ned to be 0 and

causes cudaHostAlloc() to emulate cudaMallocHost().� cudaHostAllocPortable � The memory returned by this call will be

considered as pinned memory by all CUDA contexts, not just the onethat performed the allocation.

� cudaHostAllocMapped � Maps the allocation into the CUDA addressspace.

� cudaHostAllocWriteCombined � Allocates the memory aswrite-combined (WC). WC memory can be transferred across the PCIExpress bus more quickly on some system con�gurations, but cannotbe read e�ciently by most CPUs.

80 / 110

Page 223: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Page-locked memory allocation I

� cudaError_t cudaMallocHost (void **ptr, size_t size)

� cudaError_t cudaHostAlloc (void **ptr, size_t size,

unsigned int flags)

� possible �ags (yes, they are orthogonal):� cudaHostAllocDefault � This �ag's value is de�ned to be 0 and

causes cudaHostAlloc() to emulate cudaMallocHost().� cudaHostAllocPortable � The memory returned by this call will be

considered as pinned memory by all CUDA contexts, not just the onethat performed the allocation.

� cudaHostAllocMapped � Maps the allocation into the CUDA addressspace.

� cudaHostAllocWriteCombined � Allocates the memory aswrite-combined (WC). WC memory can be transferred across the PCIExpress bus more quickly on some system con�gurations, but cannotbe read e�ciently by most CPUs.

80 / 110

Page 224: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Page-locked memory allocation II

� The device pointer to the memory may be obtained by callingcudaHostGetDevicePointer()

� Before one may get device pointer to mapped host memory it mustbe enabled in driver.

� cudaSetDeviceFlags() � �ag: cudaDeviceMapHost

� Memory allocated by this function must be freed withcudaFreeHost().

� Real life:

� Pinned memory may double memory transfer if used properly� For some devices mapped host memory may save on memory transfer

('zero-copy' function is available since CUDA 2.2)� Write-combined memory reported to have no in�uence for current

hardware

81 / 110

Page 225: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Page-locked memory allocation II

� The device pointer to the memory may be obtained by callingcudaHostGetDevicePointer()

� Before one may get device pointer to mapped host memory it mustbe enabled in driver.

� cudaSetDeviceFlags() � �ag: cudaDeviceMapHost

� Memory allocated by this function must be freed withcudaFreeHost().

� Real life:

� Pinned memory may double memory transfer if used properly� For some devices mapped host memory may save on memory transfer

('zero-copy' function is available since CUDA 2.2)� Write-combined memory reported to have no in�uence for current

hardware

81 / 110

Page 226: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Page-locked memory allocation II

� The device pointer to the memory may be obtained by callingcudaHostGetDevicePointer()

� Before one may get device pointer to mapped host memory it mustbe enabled in driver.� cudaSetDeviceFlags() � �ag: cudaDeviceMapHost

� Memory allocated by this function must be freed withcudaFreeHost().

� Real life:

� Pinned memory may double memory transfer if used properly� For some devices mapped host memory may save on memory transfer

('zero-copy' function is available since CUDA 2.2)� Write-combined memory reported to have no in�uence for current

hardware

81 / 110

Page 227: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Page-locked memory allocation II

� The device pointer to the memory may be obtained by callingcudaHostGetDevicePointer()

� Before one may get device pointer to mapped host memory it mustbe enabled in driver.� cudaSetDeviceFlags() � �ag: cudaDeviceMapHost

� Memory allocated by this function must be freed withcudaFreeHost().

� Real life:

� Pinned memory may double memory transfer if used properly� For some devices mapped host memory may save on memory transfer

('zero-copy' function is available since CUDA 2.2)� Write-combined memory reported to have no in�uence for current

hardware

81 / 110

Page 228: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Page-locked memory allocation II

� The device pointer to the memory may be obtained by callingcudaHostGetDevicePointer()

� Before one may get device pointer to mapped host memory it mustbe enabled in driver.� cudaSetDeviceFlags() � �ag: cudaDeviceMapHost

� Memory allocated by this function must be freed withcudaFreeHost().

� Real life:

� Pinned memory may double memory transfer if used properly� For some devices mapped host memory may save on memory transfer

('zero-copy' function is available since CUDA 2.2)� Write-combined memory reported to have no in�uence for current

hardware

81 / 110

Page 229: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Page-locked memory allocation II

� The device pointer to the memory may be obtained by callingcudaHostGetDevicePointer()

� Before one may get device pointer to mapped host memory it mustbe enabled in driver.� cudaSetDeviceFlags() � �ag: cudaDeviceMapHost

� Memory allocated by this function must be freed withcudaFreeHost().

� Real life:� Pinned memory may double memory transfer if used properly

� For some devices mapped host memory may save on memory transfer('zero-copy' function is available since CUDA 2.2)

� Write-combined memory reported to have no in�uence for currenthardware

81 / 110

Page 230: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Page-locked memory allocation II

� The device pointer to the memory may be obtained by callingcudaHostGetDevicePointer()

� Before one may get device pointer to mapped host memory it mustbe enabled in driver.� cudaSetDeviceFlags() � �ag: cudaDeviceMapHost

� Memory allocated by this function must be freed withcudaFreeHost().

� Real life:� Pinned memory may double memory transfer if used properly� For some devices mapped host memory may save on memory transfer

('zero-copy' function is available since CUDA 2.2)

� Write-combined memory reported to have no in�uence for currenthardware

81 / 110

Page 231: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Page-locked memory allocation II

� The device pointer to the memory may be obtained by callingcudaHostGetDevicePointer()

� Before one may get device pointer to mapped host memory it mustbe enabled in driver.� cudaSetDeviceFlags() � �ag: cudaDeviceMapHost

� Memory allocated by this function must be freed withcudaFreeHost().

� Real life:� Pinned memory may double memory transfer if used properly� For some devices mapped host memory may save on memory transfer

('zero-copy' function is available since CUDA 2.2)� Write-combined memory reported to have no in�uence for current

hardware

81 / 110

Page 232: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Global Memory alignment

� A device is capable of reading 4-byte, 8-byte, or 16-byte words fromglobal memory into registers in a single instruction.

1 __device__ type device[32];

2 type data = device[tid];

� type must be of size 4, 8, or 16� its variables structures must be aligned:

1 struct __align__(16) {

2 float a;

3 float b;

4 float c;

5 };

One load instruction

1 struct __align__(16) {

2 float a;

3 float b;

4 float c;

5 float d;

6 float e;

7 };

Two load instructions instead of �ve

82 / 110

Page 233: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Global Memory alignment

� A device is capable of reading 4-byte, 8-byte, or 16-byte words fromglobal memory into registers in a single instruction.

1 __device__ type device[32];

2 type data = device[tid];

� type must be of size 4, 8, or 16� its variables structures must be aligned:

1 struct __align__(16) {

2 float a;

3 float b;

4 float c;

5 };

One load instruction

1 struct __align__(16) {

2 float a;

3 float b;

4 float c;

5 float d;

6 float e;

7 };

Two load instructions instead of �ve82 / 110

Page 234: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access

� Coalesced global memory access � A coordinated read by ahalf-warp (16 threads) .

� All read may result in one or two memory accesses.

� If threads do not ful�l additional conditions reads are serializedresulting in 16 or more memory accesses.

� Coalescing is possible for segments of 32, 64 and 128 bytes

� Starting address for a region must be a multiple of region size

83 / 110

Page 235: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access

� Coalesced global memory access � A coordinated read by ahalf-warp (16 threads) .

� All read may result in one or two memory accesses.

� If threads do not ful�l additional conditions reads are serializedresulting in 16 or more memory accesses.

� Coalescing is possible for segments of 32, 64 and 128 bytes

� Starting address for a region must be a multiple of region size

83 / 110

Page 236: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access

� Coalesced global memory access � A coordinated read by ahalf-warp (16 threads) .

� All read may result in one or two memory accesses.

� If threads do not ful�l additional conditions reads are serializedresulting in 16 or more memory accesses.

� Coalescing is possible for segments of 32, 64 and 128 bytes

� Starting address for a region must be a multiple of region size

83 / 110

Page 237: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access

� Coalesced global memory access � A coordinated read by ahalf-warp (16 threads) .

� All read may result in one or two memory accesses.

� If threads do not ful�l additional conditions reads are serializedresulting in 16 or more memory accesses.

� Coalescing is possible for segments of 32, 64 and 128 bytes

� Starting address for a region must be a multiple of region size

83 / 110

Page 238: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access

� Coalesced global memory access � A coordinated read by ahalf-warp (16 threads) .

� All read may result in one or two memory accesses.

� If threads do not ful�l additional conditions reads are serializedresulting in 16 or more memory accesses.

� Coalescing is possible for segments of 32, 64 and 128 bytes

� Starting address for a region must be a multiple of region size

83 / 110

Page 239: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access

� Coalesced global memory access � A coordinated read by ahalf-warp (16 threads) .

� All read may result in one or two memory accesses.

� If threads do not ful�l additional conditions reads are serializedresulting in 16 or more memory accesses.

� Coalescing is possible for segments of 32, 64 and 128 bytes

� Starting address for a region must be a multiple of region size

83 / 110

Page 240: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access

� Coalesced global memory access � A coordinated read by ahalf-warp (16 threads) .

� All read may result in one or two memory accesses.

� If threads do not ful�l additional conditions reads are serializedresulting in 16 or more memory accesses.

� Coalescing is possible for segments of 32, 64 and 128 bytes

� Starting address for a region must be a multiple of region size

83 / 110

Page 241: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≤ 1.1

� The kth thread in a half-warp must access the kth element in ablock being read(not all threads must be participating)

� A contiguous region of global memory:

� 64 bytes � each thread reads a word: int, float, . . .� 128 bytes � each thread reads a double-word: int2, float2, . . .� 256 bytes � each thread reads a quad-word: int4, float4, . . .

(two memory transactions)

84 / 110

Page 242: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≤ 1.1

� The kth thread in a half-warp must access the kth element in ablock being read(not all threads must be participating)

� A contiguous region of global memory:

� 64 bytes � each thread reads a word: int, float, . . .� 128 bytes � each thread reads a double-word: int2, float2, . . .� 256 bytes � each thread reads a quad-word: int4, float4, . . .

(two memory transactions)

84 / 110

Page 243: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≤ 1.1

� The kth thread in a half-warp must access the kth element in ablock being read(not all threads must be participating)

� A contiguous region of global memory:� 64 bytes � each thread reads a word: int, float, . . .

� 128 bytes � each thread reads a double-word: int2, float2, . . .� 256 bytes � each thread reads a quad-word: int4, float4, . . .

(two memory transactions)

84 / 110

Page 244: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≤ 1.1

� The kth thread in a half-warp must access the kth element in ablock being read(not all threads must be participating)

� A contiguous region of global memory:� 64 bytes � each thread reads a word: int, float, . . .� 128 bytes � each thread reads a double-word: int2, float2, . . .

� 256 bytes � each thread reads a quad-word: int4, float4, . . .(two memory transactions)

84 / 110

Page 245: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≤ 1.1

� The kth thread in a half-warp must access the kth element in ablock being read(not all threads must be participating)

� A contiguous region of global memory:� 64 bytes � each thread reads a word: int, float, . . .� 128 bytes � each thread reads a double-word: int2, float2, . . .� 256 bytes � each thread reads a quad-word: int4, float4, . . .

(two memory transactions)

84 / 110

Page 246: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≤ 1.1

� The kth thread in a half-warp must access the kth element in ablock being read(not all threads must be participating)

� A contiguous region of global memory:� 64 bytes � each thread reads a word: int, float, . . .� 128 bytes � each thread reads a double-word: int2, float2, . . .� 256 bytes � each thread reads a quad-word: int4, float4, . . .

(two memory transactions)

84 / 110

Page 247: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≤ 1.1

� The kth thread in a half-warp must access the kth element in ablock being read(not all threads must be participating)

� A contiguous region of global memory:� 64 bytes � each thread reads a word: int, float, . . .� 128 bytes � each thread reads a double-word: int2, float2, . . .� 256 bytes � each thread reads a quad-word: int4, float4, . . .

(two memory transactions)

84 / 110

Page 248: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≥ 1.2

� Memory accesses are coalesced into a single memory transaction ifwords accessed by all threads lie in the same segment of size equalto:

� 32 bytes if all threads access 1-byte words,� 64 bytes if all threads access 2-byte words,� 128 bytes if all threads access 4-byte or 8-byte words.

� Coalescing is achieved for any pattern of addresses requested by thehalf-warp, including patterns where multiple threads access thesame address.

85 / 110

Page 249: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≥ 1.2

� Memory accesses are coalesced into a single memory transaction ifwords accessed by all threads lie in the same segment of size equalto:� 32 bytes if all threads access 1-byte words,

� 64 bytes if all threads access 2-byte words,� 128 bytes if all threads access 4-byte or 8-byte words.

� Coalescing is achieved for any pattern of addresses requested by thehalf-warp, including patterns where multiple threads access thesame address.

85 / 110

Page 250: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≥ 1.2

� Memory accesses are coalesced into a single memory transaction ifwords accessed by all threads lie in the same segment of size equalto:� 32 bytes if all threads access 1-byte words,� 64 bytes if all threads access 2-byte words,

� 128 bytes if all threads access 4-byte or 8-byte words.

� Coalescing is achieved for any pattern of addresses requested by thehalf-warp, including patterns where multiple threads access thesame address.

85 / 110

Page 251: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≥ 1.2

� Memory accesses are coalesced into a single memory transaction ifwords accessed by all threads lie in the same segment of size equalto:� 32 bytes if all threads access 1-byte words,� 64 bytes if all threads access 2-byte words,� 128 bytes if all threads access 4-byte or 8-byte words.

� Coalescing is achieved for any pattern of addresses requested by thehalf-warp, including patterns where multiple threads access thesame address.

85 / 110

Page 252: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≥ 1.2

� Memory accesses are coalesced into a single memory transaction ifwords accessed by all threads lie in the same segment of size equalto:� 32 bytes if all threads access 1-byte words,� 64 bytes if all threads access 2-byte words,� 128 bytes if all threads access 4-byte or 8-byte words.

� Coalescing is achieved for any pattern of addresses requested by thehalf-warp, including patterns where multiple threads access thesame address.

85 / 110

Page 253: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≥ 1.2

� Memory accesses are coalesced into a single memory transaction ifwords accessed by all threads lie in the same segment of size equalto:� 32 bytes if all threads access 1-byte words,� 64 bytes if all threads access 2-byte words,� 128 bytes if all threads access 4-byte or 8-byte words.

� Coalescing is achieved for any pattern of addresses requested by thehalf-warp, including patterns where multiple threads access thesame address.

85 / 110

Page 254: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≥ 1.2

� Memory accesses are coalesced into a single memory transaction ifwords accessed by all threads lie in the same segment of size equalto:� 32 bytes if all threads access 1-byte words,� 64 bytes if all threads access 2-byte words,� 128 bytes if all threads access 4-byte or 8-byte words.

� Coalescing is achieved for any pattern of addresses requested by thehalf-warp, including patterns where multiple threads access thesame address.

85 / 110

Page 255: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalesced memory access - devices with cc ≥ 1.2

� Memory accesses are coalesced into a single memory transaction ifwords accessed by all threads lie in the same segment of size equalto:� 32 bytes if all threads access 1-byte words,� 64 bytes if all threads access 2-byte words,� 128 bytes if all threads access 4-byte or 8-byte words.

� Coalescing is achieved for any pattern of addresses requested by thehalf-warp, including patterns where multiple threads access thesame address.

85 / 110

Page 256: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Timing Results

Kernel: read a �oat, increment, write back

3M �oats (12MB)Times averaged over 10K runs

� 12K blocks x 256 threads reading �oats:

� 356µs � coalesced� 357µs � coalesced, some threads don't participate� 3,494µs � permuted/misaligned thread access

� 4K blocks x 256 threads reading �oat3s:

� 3,302µs � �oat3 uncoalesced� 359µs � �oat3 coalesced through shared memory

cudaWhitepapers

86 / 110

Page 257: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Timing Results

Kernel: read a �oat, increment, write back

3M �oats (12MB)Times averaged over 10K runs

� 12K blocks x 256 threads reading �oats:

� 356µs � coalesced� 357µs � coalesced, some threads don't participate� 3,494µs � permuted/misaligned thread access

� 4K blocks x 256 threads reading �oat3s:

� 3,302µs � �oat3 uncoalesced� 359µs � �oat3 coalesced through shared memory

cudaWhitepapers

86 / 110

Page 258: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Timing Results

Kernel: read a �oat, increment, write back

3M �oats (12MB)Times averaged over 10K runs

� 12K blocks x 256 threads reading �oats:� 356µs � coalesced

� 357µs � coalesced, some threads don't participate� 3,494µs � permuted/misaligned thread access

� 4K blocks x 256 threads reading �oat3s:

� 3,302µs � �oat3 uncoalesced� 359µs � �oat3 coalesced through shared memory

cudaWhitepapers

86 / 110

Page 259: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Timing Results

Kernel: read a �oat, increment, write back

3M �oats (12MB)Times averaged over 10K runs

� 12K blocks x 256 threads reading �oats:� 356µs � coalesced� 357µs � coalesced, some threads don't participate

� 3,494µs � permuted/misaligned thread access

� 4K blocks x 256 threads reading �oat3s:

� 3,302µs � �oat3 uncoalesced� 359µs � �oat3 coalesced through shared memory

cudaWhitepapers

86 / 110

Page 260: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Timing Results

Kernel: read a �oat, increment, write back

3M �oats (12MB)Times averaged over 10K runs

� 12K blocks x 256 threads reading �oats:� 356µs � coalesced� 357µs � coalesced, some threads don't participate� 3,494µs � permuted/misaligned thread access

� 4K blocks x 256 threads reading �oat3s:

� 3,302µs � �oat3 uncoalesced� 359µs � �oat3 coalesced through shared memory

cudaWhitepapers

86 / 110

Page 261: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Timing Results

Kernel: read a �oat, increment, write back

3M �oats (12MB)Times averaged over 10K runs

� 12K blocks x 256 threads reading �oats:� 356µs � coalesced� 357µs � coalesced, some threads don't participate� 3,494µs � permuted/misaligned thread access

� 4K blocks x 256 threads reading �oat3s:

� 3,302µs � �oat3 uncoalesced� 359µs � �oat3 coalesced through shared memory

cudaWhitepapers

86 / 110

Page 262: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Timing Results

Kernel: read a �oat, increment, write back

3M �oats (12MB)Times averaged over 10K runs

� 12K blocks x 256 threads reading �oats:� 356µs � coalesced� 357µs � coalesced, some threads don't participate� 3,494µs � permuted/misaligned thread access

� 4K blocks x 256 threads reading �oat3s:� 3,302µs � �oat3 uncoalesced

� 359µs � �oat3 coalesced through shared memory

cudaWhitepapers

86 / 110

Page 263: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Timing Results

Kernel: read a �oat, increment, write back

3M �oats (12MB)Times averaged over 10K runs

� 12K blocks x 256 threads reading �oats:� 356µs � coalesced� 357µs � coalesced, some threads don't participate� 3,494µs � permuted/misaligned thread access

� 4K blocks x 256 threads reading �oat3s:� 3,302µs � �oat3 uncoalesced� 359µs � �oat3 coalesced through shared memory

cudaWhitepapers

86 / 110

Page 264: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Guidelines I

� Align data to �t equal segments in memory(arrays allocated with cudaMalloc... are positioned to appropriateaddresses automatically)

� For single-dimensional arrays

� array of type* accessed by BaseAddress + tid� type* must meet the size and alignment requirements� if size of type* is larger than 16 it must be treated with additional

care

� For two-dimensional arrays

� array of type* accessed by BaseAddress + width*tiy + tix� width is a multiply of 16� The width of the thread block is a multiple of half the warp size

87 / 110

Page 265: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Guidelines I

� Align data to �t equal segments in memory(arrays allocated with cudaMalloc... are positioned to appropriateaddresses automatically)

� For single-dimensional arrays

� array of type* accessed by BaseAddress + tid� type* must meet the size and alignment requirements� if size of type* is larger than 16 it must be treated with additional

care

� For two-dimensional arrays

� array of type* accessed by BaseAddress + width*tiy + tix� width is a multiply of 16� The width of the thread block is a multiple of half the warp size

87 / 110

Page 266: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Guidelines I

� Align data to �t equal segments in memory(arrays allocated with cudaMalloc... are positioned to appropriateaddresses automatically)

� For single-dimensional arrays� array of type* accessed by BaseAddress + tid

� type* must meet the size and alignment requirements� if size of type* is larger than 16 it must be treated with additional

care

� For two-dimensional arrays

� array of type* accessed by BaseAddress + width*tiy + tix� width is a multiply of 16� The width of the thread block is a multiple of half the warp size

87 / 110

Page 267: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Guidelines I

� Align data to �t equal segments in memory(arrays allocated with cudaMalloc... are positioned to appropriateaddresses automatically)

� For single-dimensional arrays� array of type* accessed by BaseAddress + tid� type* must meet the size and alignment requirements

� if size of type* is larger than 16 it must be treated with additionalcare

� For two-dimensional arrays

� array of type* accessed by BaseAddress + width*tiy + tix� width is a multiply of 16� The width of the thread block is a multiple of half the warp size

87 / 110

Page 268: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Guidelines I

� Align data to �t equal segments in memory(arrays allocated with cudaMalloc... are positioned to appropriateaddresses automatically)

� For single-dimensional arrays� array of type* accessed by BaseAddress + tid� type* must meet the size and alignment requirements� if size of type* is larger than 16 it must be treated with additional

care

� For two-dimensional arrays

� array of type* accessed by BaseAddress + width*tiy + tix� width is a multiply of 16� The width of the thread block is a multiple of half the warp size

87 / 110

Page 269: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Guidelines I

� Align data to �t equal segments in memory(arrays allocated with cudaMalloc... are positioned to appropriateaddresses automatically)

� For single-dimensional arrays� array of type* accessed by BaseAddress + tid� type* must meet the size and alignment requirements� if size of type* is larger than 16 it must be treated with additional

care

� For two-dimensional arrays

� array of type* accessed by BaseAddress + width*tiy + tix� width is a multiply of 16� The width of the thread block is a multiple of half the warp size

87 / 110

Page 270: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Guidelines I

� Align data to �t equal segments in memory(arrays allocated with cudaMalloc... are positioned to appropriateaddresses automatically)

� For single-dimensional arrays� array of type* accessed by BaseAddress + tid� type* must meet the size and alignment requirements� if size of type* is larger than 16 it must be treated with additional

care

� For two-dimensional arrays� array of type* accessed by BaseAddress + width*tiy + tix

� width is a multiply of 16� The width of the thread block is a multiple of half the warp size

87 / 110

Page 271: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Guidelines I

� Align data to �t equal segments in memory(arrays allocated with cudaMalloc... are positioned to appropriateaddresses automatically)

� For single-dimensional arrays� array of type* accessed by BaseAddress + tid� type* must meet the size and alignment requirements� if size of type* is larger than 16 it must be treated with additional

care

� For two-dimensional arrays� array of type* accessed by BaseAddress + width*tiy + tix� width is a multiply of 16

� The width of the thread block is a multiple of half the warp size

87 / 110

Page 272: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Guidelines I

� Align data to �t equal segments in memory(arrays allocated with cudaMalloc... are positioned to appropriateaddresses automatically)

� For single-dimensional arrays� array of type* accessed by BaseAddress + tid� type* must meet the size and alignment requirements� if size of type* is larger than 16 it must be treated with additional

care

� For two-dimensional arrays� array of type* accessed by BaseAddress + width*tiy + tix� width is a multiply of 16� The width of the thread block is a multiple of half the warp size

87 / 110

Page 273: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Guidelines II

� If proper memory alignment is impossible:

� Use structures of arrays instead of arrays of structures

AoS x1 y1 x2 y2 x3 y3 x4 y4

SoA x1 x2 x3 x4 y1 y2 y3 y4

� Use __align(4), __align(8) or __align(16) in structuredeclarations

88 / 110

Page 274: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Guidelines II

� If proper memory alignment is impossible:� Use structures of arrays instead of arrays of structures

AoS x1 y1 x2 y2 x3 y3 x4 y4

SoA x1 x2 x3 x4 y1 y2 y3 y4

� Use __align(4), __align(8) or __align(16) in structuredeclarations

88 / 110

Page 275: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Guidelines II

� If proper memory alignment is impossible:� Use structures of arrays instead of arrays of structures

AoS x1 y1 x2 y2 x3 y3 x4 y4

SoA x1 x2 x3 x4 y1 y2 y3 y4

� Use __align(4), __align(8) or __align(16) in structuredeclarations

88 / 110

Page 276: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Guidelines II

� If proper memory alignment is impossible:� Use structures of arrays instead of arrays of structures

AoS x1 y1 x2 y2 x3 y3 x4 y4

SoA x1 x2 x3 x4 y1 y2 y3 y4

� Use __align(4), __align(8) or __align(16) in structuredeclarations

88 / 110

Page 277: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing: Guidelines II

� If proper memory alignment is impossible:� Use structures of arrays instead of arrays of structures

AoS x1 y1 x2 y2 x3 y3 x4 y4

SoA x1 x2 x3 x4 y1 y2 y3 y4

� Use __align(4), __align(8) or __align(16) in structuredeclarations

88 / 110

Page 278: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing example I

cudaWhitepapers

Misaligned memory access with float3 data

1 __global__ void accessFloat3(float3 *d_in, float3 d_out)

2 {

3 int index = blockIdx.x * blockDim.x + threadIdx.x;

4 float3 a = d_in[index];

5 a.x += 2;

6 a.y += 2;

7 a.z += 2;

8 d_out[index] = a;

9 }

� Each thread reads 3 �oats = 12 bytes

� Half warp reads 16 ∗ 12 = 192 bytes(three 64B non-contiguous segments)

89 / 110

Page 279: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Coalescing example II

cudaWhitepapersCoalesced memory access with float3 data

1 __global__ void accessFloat3Shared(float *g_in, float *g_out)

2 {

3 int index = 3 * blockIdx.x * blockDim.x + threadIdx.x;

4 __shared__ float s_data[256*3];

5 s_data[threadIdx.x] = g_in[index];

6 s_data[threadIdx.x+256] = g_in[index+256];

7 s_data[threadIdx.x+512] = g_in[index+512];

8 __syncthreads();

9 float3 a = ((float3*)s_data)[threadIdx.x];

10 a.x += 2;

11 a.y += 2;

12 a.z += 2;

13 ((float3*)s_data)[threadIdx.x] = a;

14 __syncthreads();

15 g_out[index] = s_data[threadIdx.x];

16 g_out[index+256] = s_data[threadIdx.x+256];

17 g_out[index+512] = s_data[threadIdx.x+512];

18 }

� Computation part remains the same90 / 110

Page 280: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Organization of shared memory

� Shared memory is divided into equally sized memory modules,called banks.

� Di�erent banks can be accessed simultaneously.

� Read or write to n addresses in n banks multiplies bandwidth of asingle bank by n.

� If many threads refers the same bank the access is serialized �hardware splits a memory request that has bank con�icts into asmany separate con�ict-free requests as necessary.

� There is one exception if all threads within a half-warp accesses thesame address.

91 / 110

Page 281: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Organization of shared memory

� Shared memory is divided into equally sized memory modules,called banks.

� Di�erent banks can be accessed simultaneously.

� Read or write to n addresses in n banks multiplies bandwidth of asingle bank by n.

� If many threads refers the same bank the access is serialized �hardware splits a memory request that has bank con�icts into asmany separate con�ict-free requests as necessary.

� There is one exception if all threads within a half-warp accesses thesame address.

91 / 110

Page 282: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Organization of shared memory

� Shared memory is divided into equally sized memory modules,called banks.

� Di�erent banks can be accessed simultaneously.

� Read or write to n addresses in n banks multiplies bandwidth of asingle bank by n.

� If many threads refers the same bank the access is serialized �hardware splits a memory request that has bank con�icts into asmany separate con�ict-free requests as necessary.

� There is one exception if all threads within a half-warp accesses thesame address.

91 / 110

Page 283: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Organization of shared memory

� Shared memory is divided into equally sized memory modules,called banks.

� Di�erent banks can be accessed simultaneously.

� Read or write to n addresses in n banks multiplies bandwidth of asingle bank by n.

� If many threads refers the same bank the access is serialized �hardware splits a memory request that has bank con�icts into asmany separate con�ict-free requests as necessary.

� There is one exception if all threads within a half-warp accesses thesame address.

91 / 110

Page 284: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Organization of shared memory

� Shared memory is divided into equally sized memory modules,called banks.

� Di�erent banks can be accessed simultaneously.

� Read or write to n addresses in n banks multiplies bandwidth of asingle bank by n.

� If many threads refers the same bank the access is serialized �hardware splits a memory request that has bank con�icts into asmany separate con�ict-free requests as necessary.

� There is one exception if all threads within a half-warp accesses thesame address.

91 / 110

Page 285: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Bank con�icts

Shared memory banks are organized in such a way that successive32-bit words are assigned to successive banks and each bank has abandwidth of 32 bits per clock cycle. The bandwidth of sharedmemory is 32 bits per bank per clock cycle.

Important

For devices of compute capability 1.x, the warp size is 32 threads andthe number of banks is 16.

92 / 110

Page 286: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Access with no bank con�icts

left: stride = 1right: stride random

cudaWhitepapers93 / 110

Page 287: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Access with bank con�icts

left: stride = 2 (2 way bank con�ict)right: stride = 8 (8 way bank con�ict)

cudaWhitepapers94 / 110

Page 288: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Avoiding bank con�icts

� Bank checker macro � included in SDK (common) � not veryreliable

� Padding � adding extra space between array elements in order tobrake cyclic access to same bank.

95 / 110

Page 289: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Example of bank con�icts removal in Scan I

Harris_scanCuda

96 / 110

Page 290: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Example of bank con�icts removal in Scan II

Harris_scanCuda

97 / 110

Page 291: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Padding implementation IHarris_scanCuda

We need more space in shared memory:

1 unsigned int extra_space = num_elements / NUM_BANKS;

Padding macro:

1 #define NUM_BANKS 16

2 #define LOG_NUM_BANKS 4

3

4 #ifdef ZERO_BANK_CONFLICTS

5 #define CONFLICT_FREE_OFFSET(index) ((index) >> LOG_NUM_BANKS \

6 + (index) >> (2 * LOG_NUM_BANKS))

7 #else

8 #define CONFLICT_FREE_OFFSET(index) ((index) >> LOG_NUM_BANKS)

9 #endif

Zero bank con�icts requires even more additional space:

1 #ifdef ZERO_BANK_CONFLICTS

2 extra_space += extra_space / NUM_BANKS;

3 #endif

98 / 110

Page 292: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Padding implementation IHarris_scanCuda

We need more space in shared memory:

1 unsigned int extra_space = num_elements / NUM_BANKS;

Padding macro:

1 #define NUM_BANKS 16

2 #define LOG_NUM_BANKS 4

3

4 #ifdef ZERO_BANK_CONFLICTS

5 #define CONFLICT_FREE_OFFSET(index) ((index) >> LOG_NUM_BANKS \

6 + (index) >> (2 * LOG_NUM_BANKS))

7 #else

8 #define CONFLICT_FREE_OFFSET(index) ((index) >> LOG_NUM_BANKS)

9 #endif

Zero bank con�icts requires even more additional space:

1 #ifdef ZERO_BANK_CONFLICTS

2 extra_space += extra_space / NUM_BANKS;

3 #endif

98 / 110

Page 293: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Padding implementation IHarris_scanCuda

We need more space in shared memory:

1 unsigned int extra_space = num_elements / NUM_BANKS;

Padding macro:

1 #define NUM_BANKS 16

2 #define LOG_NUM_BANKS 4

3

4 #ifdef ZERO_BANK_CONFLICTS

5 #define CONFLICT_FREE_OFFSET(index) ((index) >> LOG_NUM_BANKS \

6 + (index) >> (2 * LOG_NUM_BANKS))

7 #else

8 #define CONFLICT_FREE_OFFSET(index) ((index) >> LOG_NUM_BANKS)

9 #endif

Zero bank con�icts requires even more additional space:

1 #ifdef ZERO_BANK_CONFLICTS

2 extra_space += extra_space / NUM_BANKS;

3 #endif

98 / 110

Page 294: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Padding implementation IHarris_scanCuda

We need more space in shared memory:

1 unsigned int extra_space = num_elements / NUM_BANKS;

Padding macro:

1 #define NUM_BANKS 16

2 #define LOG_NUM_BANKS 4

3

4 #ifdef ZERO_BANK_CONFLICTS

5 #define CONFLICT_FREE_OFFSET(index) ((index) >> LOG_NUM_BANKS \

6 + (index) >> (2 * LOG_NUM_BANKS))

7 #else

8 #define CONFLICT_FREE_OFFSET(index) ((index) >> LOG_NUM_BANKS)

9 #endif

Zero bank con�icts requires even more additional space:

1 #ifdef ZERO_BANK_CONFLICTS

2 extra_space += extra_space / NUM_BANKS;

3 #endif

98 / 110

Page 295: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Padding implementation IIHarris_scanCuda

Loading data into shared memory:

1 int ai = thid, bi = thid + (n/2);

2

3 // compute spacing to avoid bank conflicts

4 int bankOffsetA = CONFLICT_FREE_OFFSET(ai);

5 int bankOffsetB = CONFLICT_FREE_OFFSET(bi);

6

7 TEMP(ai + bankOffsetA) = g_idata[ai];

8 TEMP(bi + bankOffsetB) = g_idata[bi];

Algorithm:

1 int ai = offset*(2*thid+1)-1;

2 int bi = offset*(2*thid+2)-1;

3

4 ai += CONFLICT_FREE_OFFSET(ai);

5 bi += CONFLICT_FREE_OFFSET(bi);

6

7 TEMP(bi) += TEMP(ai);

99 / 110

Page 296: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Padding implementation IIHarris_scanCuda

Loading data into shared memory:

1 int ai = thid, bi = thid + (n/2);

2

3 // compute spacing to avoid bank conflicts

4 int bankOffsetA = CONFLICT_FREE_OFFSET(ai);

5 int bankOffsetB = CONFLICT_FREE_OFFSET(bi);

6

7 TEMP(ai + bankOffsetA) = g_idata[ai];

8 TEMP(bi + bankOffsetB) = g_idata[bi];

Algorithm:

1 int ai = offset*(2*thid+1)-1;

2 int bi = offset*(2*thid+2)-1;

3

4 ai += CONFLICT_FREE_OFFSET(ai);

5 bi += CONFLICT_FREE_OFFSET(bi);

6

7 TEMP(bi) += TEMP(ai);

99 / 110

Page 297: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Padding implementation IIHarris_scanCuda

Loading data into shared memory:

1 int ai = thid, bi = thid + (n/2);

2

3 // compute spacing to avoid bank conflicts

4 int bankOffsetA = CONFLICT_FREE_OFFSET(ai);

5 int bankOffsetB = CONFLICT_FREE_OFFSET(bi);

6

7 TEMP(ai + bankOffsetA) = g_idata[ai];

8 TEMP(bi + bankOffsetB) = g_idata[bi];

Algorithm:

1 int ai = offset*(2*thid+1)-1;

2 int bi = offset*(2*thid+2)-1;

3

4 ai += CONFLICT_FREE_OFFSET(ai);

5 bi += CONFLICT_FREE_OFFSET(bi);

6

7 TEMP(bi) += TEMP(ai);

99 / 110

Page 298: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

4 OptimizationsBasicsMeasurementsInstruction OptimizationsMemory optimizations

Playing with memory types

Coalesced memory access

Avoiding memory bank con�icts

Asynchronous operationsStream API

Assuring optimum load for a device

100 / 110

Page 299: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Streams API

� Applications manage concurrency through streams.

� A stream is a sequence of commands that execute in order.� Di�erent streams may execute their commands out of order withrespect to one another or concurrently.

� cudaStream_t � stream type� cudaStreamCreate( &stream )

� cudaStreamDestroy( &stream ) � waits for all tasks to completebefore destroying a stream;

� cudaStreamQuery() � checks if all preceding commands in a streamhave completed

� cudaStreamSynchronize() � forces the run-time to wait until allpreceding commands in a stream have completed.

� cudaThreadSynchronize() � forces the run-time to wait until allpreceding device tasks in all streams have completed

101 / 110

Page 300: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Streams API

� Applications manage concurrency through streams.� A stream is a sequence of commands that execute in order.

� Di�erent streams may execute their commands out of order withrespect to one another or concurrently.

� cudaStream_t � stream type� cudaStreamCreate( &stream )

� cudaStreamDestroy( &stream ) � waits for all tasks to completebefore destroying a stream;

� cudaStreamQuery() � checks if all preceding commands in a streamhave completed

� cudaStreamSynchronize() � forces the run-time to wait until allpreceding commands in a stream have completed.

� cudaThreadSynchronize() � forces the run-time to wait until allpreceding device tasks in all streams have completed

101 / 110

Page 301: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Streams API

� Applications manage concurrency through streams.� A stream is a sequence of commands that execute in order.� Di�erent streams may execute their commands out of order withrespect to one another or concurrently.

� cudaStream_t � stream type� cudaStreamCreate( &stream )

� cudaStreamDestroy( &stream ) � waits for all tasks to completebefore destroying a stream;

� cudaStreamQuery() � checks if all preceding commands in a streamhave completed

� cudaStreamSynchronize() � forces the run-time to wait until allpreceding commands in a stream have completed.

� cudaThreadSynchronize() � forces the run-time to wait until allpreceding device tasks in all streams have completed

101 / 110

Page 302: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Streams API

� Applications manage concurrency through streams.� A stream is a sequence of commands that execute in order.� Di�erent streams may execute their commands out of order withrespect to one another or concurrently.

� cudaStream_t � stream type

� cudaStreamCreate( &stream )

� cudaStreamDestroy( &stream ) � waits for all tasks to completebefore destroying a stream;

� cudaStreamQuery() � checks if all preceding commands in a streamhave completed

� cudaStreamSynchronize() � forces the run-time to wait until allpreceding commands in a stream have completed.

� cudaThreadSynchronize() � forces the run-time to wait until allpreceding device tasks in all streams have completed

101 / 110

Page 303: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Streams API

� Applications manage concurrency through streams.� A stream is a sequence of commands that execute in order.� Di�erent streams may execute their commands out of order withrespect to one another or concurrently.

� cudaStream_t � stream type� cudaStreamCreate( &stream )

� cudaStreamDestroy( &stream ) � waits for all tasks to completebefore destroying a stream;

� cudaStreamQuery() � checks if all preceding commands in a streamhave completed

� cudaStreamSynchronize() � forces the run-time to wait until allpreceding commands in a stream have completed.

� cudaThreadSynchronize() � forces the run-time to wait until allpreceding device tasks in all streams have completed

101 / 110

Page 304: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Streams API

� Applications manage concurrency through streams.� A stream is a sequence of commands that execute in order.� Di�erent streams may execute their commands out of order withrespect to one another or concurrently.

� cudaStream_t � stream type� cudaStreamCreate( &stream )

� cudaStreamDestroy( &stream ) � waits for all tasks to completebefore destroying a stream;

� cudaStreamQuery() � checks if all preceding commands in a streamhave completed

� cudaStreamSynchronize() � forces the run-time to wait until allpreceding commands in a stream have completed.

� cudaThreadSynchronize() � forces the run-time to wait until allpreceding device tasks in all streams have completed

101 / 110

Page 305: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Streams API

� Applications manage concurrency through streams.� A stream is a sequence of commands that execute in order.� Di�erent streams may execute their commands out of order withrespect to one another or concurrently.

� cudaStream_t � stream type� cudaStreamCreate( &stream )

� cudaStreamDestroy( &stream ) � waits for all tasks to completebefore destroying a stream;

� cudaStreamQuery() � checks if all preceding commands in a streamhave completed

� cudaStreamSynchronize() � forces the run-time to wait until allpreceding commands in a stream have completed.

� cudaThreadSynchronize() � forces the run-time to wait until allpreceding device tasks in all streams have completed

101 / 110

Page 306: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Streams API

� Applications manage concurrency through streams.� A stream is a sequence of commands that execute in order.� Di�erent streams may execute their commands out of order withrespect to one another or concurrently.

� cudaStream_t � stream type� cudaStreamCreate( &stream )

� cudaStreamDestroy( &stream ) � waits for all tasks to completebefore destroying a stream;

� cudaStreamQuery() � checks if all preceding commands in a streamhave completed

� cudaStreamSynchronize() � forces the run-time to wait until allpreceding commands in a stream have completed.

� cudaThreadSynchronize() � forces the run-time to wait until allpreceding device tasks in all streams have completed

101 / 110

Page 307: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Streams API

� Applications manage concurrency through streams.� A stream is a sequence of commands that execute in order.� Di�erent streams may execute their commands out of order withrespect to one another or concurrently.

� cudaStream_t � stream type� cudaStreamCreate( &stream )

� cudaStreamDestroy( &stream ) � waits for all tasks to completebefore destroying a stream;

� cudaStreamQuery() � checks if all preceding commands in a streamhave completed

� cudaStreamSynchronize() � forces the run-time to wait until allpreceding commands in a stream have completed.

� cudaThreadSynchronize() � forces the run-time to wait until allpreceding device tasks in all streams have completed

101 / 110

Page 308: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Streams API � example

cudaProgrammingGuide

1 cudaStream_t stream[2];

2 for (int i = 0; i < 2; ++i)

3 cudaStreamCreate(&stream[i]);

4 float* hostPtr;

5 cudaMallocHost((void**)&hostPtr, 2 * size);

6 for (int i = 0; i < 2; ++i)

7 cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,

8 size, cudaMemcpyHostToDevice, stream[i]);

9 for (int i = 0; i < 2; ++i)

10 myKernel<<<100, 512, 0, stream[i]>>>

11 (outputDevPtr + i * size, inputDevPtr + i * size, size);

12 for (int i = 0; i < 2; ++i)

13 cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,

14 size, cudaMemcpyDeviceToHost, stream[i]);

15 cudaThreadSynchronize();

16 for (int i = 0; i < 2; ++i)

17 cudaStreamDestroy(&stream[i]);

102 / 110

Page 309: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Outline

4 OptimizationsBasicsMeasurementsInstruction OptimizationsMemory optimizations

Playing with memory types

Coalesced memory access

Avoiding memory bank con�icts

Asynchronous operationsStream API

Assuring optimum load for a device

103 / 110

Page 310: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Occupancy calculator I

� A tool for processors occupancy analysis

� Input:

� Computational Capability� Threads per block� Registers per thread� Shared memory per block (bytes)

� Output:

� GPU Occupancy� Active blocks per multiprocessor

104 / 110

Page 311: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Occupancy calculator I

� A tool for processors occupancy analysis� Input:

� Computational Capability� Threads per block� Registers per thread� Shared memory per block (bytes)

� Output:

� GPU Occupancy� Active blocks per multiprocessor

104 / 110

Page 312: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Occupancy calculator I

� A tool for processors occupancy analysis� Input:

� Computational Capability

� Threads per block� Registers per thread� Shared memory per block (bytes)

� Output:

� GPU Occupancy� Active blocks per multiprocessor

104 / 110

Page 313: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Occupancy calculator I

� A tool for processors occupancy analysis� Input:

� Computational Capability� Threads per block

� Registers per thread� Shared memory per block (bytes)

� Output:

� GPU Occupancy� Active blocks per multiprocessor

104 / 110

Page 314: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Occupancy calculator I

� A tool for processors occupancy analysis� Input:

� Computational Capability� Threads per block� Registers per thread

� Shared memory per block (bytes)

� Output:

� GPU Occupancy� Active blocks per multiprocessor

104 / 110

Page 315: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Occupancy calculator I

� A tool for processors occupancy analysis� Input:

� Computational Capability� Threads per block� Registers per thread� Shared memory per block (bytes)

� Output:

� GPU Occupancy� Active blocks per multiprocessor

104 / 110

Page 316: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Occupancy calculator I

� A tool for processors occupancy analysis� Input:

� Computational Capability� Threads per block� Registers per thread� Shared memory per block (bytes)

� Output:

� GPU Occupancy� Active blocks per multiprocessor

104 / 110

Page 317: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Occupancy calculator I

� A tool for processors occupancy analysis� Input:

� Computational Capability� Threads per block� Registers per thread� Shared memory per block (bytes)

� Output:� GPU Occupancy

� Active blocks per multiprocessor

104 / 110

Page 318: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Occupancy calculator I

� A tool for processors occupancy analysis� Input:

� Computational Capability� Threads per block� Registers per thread� Shared memory per block (bytes)

� Output:� GPU Occupancy� Active blocks per multiprocessor

104 / 110

Page 319: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Occupancy calculator II

� Assuming a device with computational capability ≤ 1.1

� Consider an application with 256 threads per block, 10 registers perthread, and 4KB of shared memory per thread block.

� Then it can schedule 3 thread blocks and 768 threads on each SM.

105 / 110

Page 320: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Occupancy calculator II

� Assuming a device with computational capability ≤ 1.1

� Consider an application with 256 threads per block, 10 registers perthread, and 4KB of shared memory per thread block.

� Then it can schedule 3 thread blocks and 768 threads on each SM.

105 / 110

Page 321: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Occupancy calculator II

� Assuming a device with computational capability ≤ 1.1

� Consider an application with 256 threads per block, 10 registers perthread, and 4KB of shared memory per thread block.

� Then it can schedule 3 thread blocks and 768 threads on each SM.

105 / 110

Page 322: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870
Page 323: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Occupancy calculator III

Applying an optimization:

Increases each thread's register usage from 10 to 11(an increase of only 10%)

� Decreases the number of blocks per SM from 3 to 2

� decreases the number of threads on an SM by 33% (Because of8,192 registers limit per SM)

Ryoo07programoptimization

107 / 110

Page 324: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Occupancy calculator III

Applying an optimization:

Increases each thread's register usage from 10 to 11(an increase of only 10%)

� Decreases the number of blocks per SM from 3 to 2

� decreases the number of threads on an SM by 33% (Because of8,192 registers limit per SM)

Ryoo07programoptimization

107 / 110

Page 325: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870
Page 326: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Device Computation Capabilities

109 / 110

Page 327: Graphic Processors in Computational Applications - Lecture 1kaczmars/gpca/resources/01... · 2018-02-15 · Evolution of Processors 10 0 10 1 10 2 2008 2010 2012 2014 2016 HD 3870

Bibliography

110 / 110