GPU Compute

1

So I found this batman image on the net. I had to put it somewhere in the presenta7on. Here it is. Nvidia has a page on its «cuda zone» showing applica7ons and speedups, it’s a bit silly and it’s not so adver7zed anymore nowadays, but s7ll useful as an introduc7on hCp://www.nvidia.com/object/cuda-‐apps-‐flash-‐new.html# Intel’s Knights Corner news hCp://www.intel.com/pressroom/archive/releases/2010/20100531comp.htm hCp://en.wikipedia.org/wiki/Intel_MIC#Knights_Corner

2

A «wild» and not even a par7cularly crea7ve example... More computa7on could enable en7rely novel scenarios... The past couple of Unreal engine demos were architected to show some of what it could be possible…

3

I like the idea of mo7ongraphs hCp://cg.cis.upenn.edu/hms/research/wcMG/ Also, imperfect shadowmaps are cool hCp://cg.ibds.kit.edu/english/publica7ons.php The two images top right are for megameshes (Lionhead) and megatextures (ID) Fight night round 4 and Champion are the first games using EA’s Real AI, which is trained from examples

4

Naughy dog famously hand op7mizes SPU kernels. How many games do pack so much computa7on, even today, on SPUs? And s7ll, preCy «uniform» workloads, on a processor that can handle non-‐uniform loads beCer than compute (lower SIMD width). Heart bokeh image taken from hCp://www.augus7nefou.com/2009/08/bokeh-‐filter-‐turns-‐blurry-‐lights-‐into.html, I like how the woman is pushed out of the frame to show the bokeh, as a metaphor of how we did some effects “early on” 360 and ps3

5

* Data parallel is “harder” for more high level tasks, but it’s far from impossible. In fact, we s7ll rely on lots of data, just we let this data be combed through with mostly handmade serial logic… Use learning, classifica7on, probabilis7c, op7miza7on methods for games... I.E. in anima7on, we have data, go from blend trees to anima7on clips all down to work with individual poses, use anima7on data itself and its metrics to drive anima7on, not logic! hCp://graphics.stanford.edu/projects/ccclde/ccclde.pdf hCp://www.stanford.edu/~svlevine/ Another avenue: simula7on, modeling, physics… The last image is from hCp://www.crea7veapplica7ons.net/theory/meibach-‐and-‐posavec-‐data-‐visualiza7on-‐poetry-‐and-‐sculpture/ think about what can you do with all the data we already have…

6

7

Usually also superscalar, mul7ple pipes, SIMD… but for now we don’t care

8

Ok, ok, the last batman…

9

...but ooen is not, it’s «surprising» (but consider that ooen the compiler does not know if a given loop is executed ooen or not) how much manual unrolling (and prefetching on plaporms without strong cache predic7ors) in 7ght loops s7ll pays, allowing the compiler to then compute a good schedule without dependencies of the generated assembly.

10

11

12

First «insight» we learned

13

Again, same reasoning

14

15

Fibers and Corou7nes are i.e. used to hide I/O and network access in web servers (a popular example is node.js)

16

For a sequen7al CPU loop, all this work done *in the specified order*. The compiler may reschedule instruc7ons without breaking the logical order of opera7ons, and the OOO CPU (i.e. PC, but not Ps3 or 360) will re-‐order instruc7ons (well, microinstruc7ons) to get rid of dependencies, but the processing order is essen7ally preserved. By contrast, on the GPU, all you're saying is you want that whole array processed, you don't care about the order.

17

Of course there is other stuff too, and never mistake a logical model for a physical one, but this is enough for everything we’ll need

18

Vectors are usually even larger than the number of ALUs. Repeat the same instruc7on a number of 7mes to process an en7re vector (allows ALUs to be pipelined, i.e. a vector of 16 float4 elements might be processed by four float4 ALUs issuing 4 at a 7me, each ALU pipeline stage can compute a single float per cycle and with four stages we achieve the float4 processing with a throughput of one per cycle and a latency of four). We say “threads” here but they are more similar to lightweight fibers. Also, confusingly, different GPU languages call the same concepts in different ways… The more computa7onal threads we can pack in the registers, the more independent work we’ll have to hide latencies! This lightweight threading mechanism is fundamental for performance with manycores, see i.e. hCp://www.linleygroup.com/newsleCers/newsleCer_detail.php?num=4747&year=2011&tag=3 which seems a deriva7ve of the Tilera Tile64 hCp://www.7lera.com/products/processors/TILE64

19

Illustra7on stole from Running Code at a Teraflop: Overview of GPU Architecture, Kayvon Fatahalian, Siggraph 2009 Some GPUs do not support variable length loops, everything gets unrolled and inlined...

20

I.E. Pixar’s Reyes rendering algorithm had interpreted, very parallel shaders. The cost of the interpreter is zero when you repeat the same instruc7on over thousands of elements. Reyes shaders are essen7ally the same as the hardware shaders we have on modern GPUs Note that GPU shaders have some explicit SIMD instruc7ons in HLSL and similar languages, suppor7ng vectors of elements up to 4-‐wide. In prac7ce, many modern GPUs process only scalar code, turning vector instruc7on into scalars. This is because the SIMD parallelism that GPUs implement is much wider than the 4-‐wide instruc7ons of the language, coupling many (i.e. 16 or more) execu7on elements together in a big execu7on vector. This regardless of if each element is scalar or 4-‐way, some GPUs even have a 4+1 configura7on (one 4-‐way and one scalar instruc7on in parallel per each execu7on element). In general when we talk about “GPU vectors” we don’t mean the actual vector instruc7ons in the shader code but the aggrega7on of many ALUs following the same code.

21

22

23

Skinning images from hCp://blog.wolfire.com/2009/11/volumetric-‐heat-‐diffusion-‐skinning/ If you’ve done rendering on the GPU, you might recognize that the pseudo-‐code looks nothing like a vertex shader, where you don’t fetch data but it’s passed as the parameters of the «main» func7on of a kernel. The data is laid out in GPU memory and bound via «streams», «constants» and other means. I didn’t include a real vertex shader code example, even if it would not have been harder to read, because it might even be misleading. The way DirectX or OpenGL expose programmable func7onality to the user might be close to the hardware, but most ooen nowadays is not. I.E. shaders might indeed compile to include explicit fetches... Pixel shaders can «kill» their output, so to be pedan7c they can output one or zero elements. Also, pixel processing might be culled «early», this is very important. The «independence» that I assert is not en7rely true. Pixel shaders need deriva7ve informa7on for all the variables, these are computed using finite differences from the neighboring pixels which not only constrains them to always be processed in 2x2 pixel units (quads) but also establishes through the deriva7ves (ddx, ddy), a mean of communica7on (i.e. see Eric Penner’s Shader Amor7za7on using Pixel Quad Message Passing – GPU Pro 2)

24

We’ll see later some more accurate pseudo-code… for now let’s s7ll pretend to work with a generic “in-‐parallel”

25

26

* At least in our representa7on of the GPU... Really compilers can decide to spill some of the local state onto the shared memory (we’ll see later) and allow for a stack and so on, but that is quite a different thing... I.E. Nvidia Fermi has a limit of 64 temporary registers in kernel code (number of registers per core is higher, but it will be used for mul7ple contexts)

27

28

ALL PSEUDO-‐CODE DOES NOT DO BOUNDS CHECKS J

29

hCp://en.wikipedia.org/wiki/Sor7ng_network hCp://www.cs.brandeis.edu/~hugues/sor7ng_networks.html hCp://www.sciencedirect.com/science/ar7cle/pii/0020019089901968 hCp://www.springerlink.com/content/g002019156467227/ Tangen7ally related, but interes7ng hCp://www.pvk.ca/Blog/2012/08/27/tabasco-‐sort-‐super-‐op7mal-‐merge-‐sort/

30

31

Rendering is a non-‐uniform workload, fully automated scheduling has its benefits… early-‐z early-‐stencil rejec7on, which performs an hardware, automated «stream compac7on»! Fundamental for rendering...

32

At first it’s «easy» to complain about bank rules and cohalescing issues when trying to access shared and global memory... But we have to consider that the GPU offers the ability of accessing different memory loca7ons per SIMD element, on many of them on many cores... This is a rather impressive feat, on standard CPUs to load data into SIMD vectors you have to do it serially, one element at a 7me. Memory has addressing limita7ons, it’s a limit of the technology. SIMD communica7on via vo7ng: all, any, ballot Atomics: Add, Sub, Inc, Dec, And, Or, Xor, Exch, Min, Max, CompareAndSwap. Append (or) consume structured buffer views (Dx11)

33

If we don’t have control over scheduling, we could not be able to access core local memory, obviously, as we would not know which elements are scheduled where.

34

For convenience, in compute, indices can be generated on 1d, 2d or 3d grids ResetCoreMemory… memory is not –really-‐ reset, but we can’t rely on its contents, the ini7aliza7on is to be done in the Kernel and any output has to go into GLOBAL_MEM, so I “created” this pseudo-‐func7on to illustrate this in a C-‐like way.

35

36

37

In the picture, the first Connec7on Machine, a 65536-‐processor supercomputer of the 80ies

38

Shared memory is not exposed to pixel and vertex shaders as it requires manual scheduling of elements on cores. It’s s7ll used internally by the graphic card to store results of rasteriza7on and shared informa7on…

39

40

Note how widening the number of outputs per element looks like «unrolling» the computa7on... Such scheduling decisions have impacts on latency hiding (switching contextes), the ability of using registers as cache (i.e. in this computa7on, as we average columns, going wide on columns will enable reusing fetches from registers), the ability of syncronizing different tasks in a single kernel (persistent threads) and so on...

41

Also, many libraries worth checking out. Thrust, Nvidia performance primi7ves, CuDPP, libCL etc… Even more for tasks that are inherently parallel, linear algebra, imaging and so on

42

43

44

45

All the arrows here represent sums to the exis7ng values, the “self” part of the addi7on is not shown to not confuse with too many arrows hCp://en.wikipedia.org/wiki/Prefix_sum

46

All the black arrows here represent sums to the exis7ng values, the “self” part of the addi7on is not shown, as for the slide before The red arrows mark element “moves” hCp://hCp.developer.nvidia.com/GPUGems3/gpugems3_ch39.html www.umiacs.umd.edu/~ramani/cmsc828e_gpusci/ScanTalk.pdf

47

The moderngpu website has lots of code, and it take a „scan centric” approach to parallel computa7on hCp://www.moderngpu.com/intro/scan.html hCp://www.moderngpu.com/scan/segscan.html

48

Blelloch paper references many applica7on domains: www.cs.cmu.edu/~guyb/papers/Ble93.pdf hCp://back40compu7ng.googlecode.com/svn-‐history/r225/wiki/documents/RadixSortTR.pdf

49

50

hCp://www.cse.chalmers.se/~olaolss/papers/Efficient%20Stream%20Compac7on%20on%20Wide%20SIMD%20Many-‐Core%20Architectures.pdf

51

52

Images from the original paper: On-‐the-‐fly Point Clouds through Histogram Pyramids. Gernot Ziegler, Art Tevs, Chris7an Theobalt, Hans-‐Peter Seidel hCp://www.mpi-‐inf.mpg.de/~gziegler/gpu_pointlist/paper17_gpu_pointclouds.pdf hCp://www.mpi-‐inf.mpg.de/%7Egziegler/gpu_pointlist/slides_vmv2006.pdf hCp://www.astro.lu.se/compugpu2010/resources/bonus_histopyramid.pdf Marching cubes using HP hCp://diglib.eg.org/EG/DL/CGF/volume27/issue8/v27i8pp2028-‐2039.pdf.abstract.pdf

53

Good rendering hCps://graphics.stanford.edu/wikis/cs448s-‐10/FrontPage?ac7on=ACachFile&do=get&target=08-‐GPUArchII.pdf

54

Some of the compromises, many differences in the implementa7on… See Debunking the 100x GPU vs CPU Myth for more details: h@p://www.hwsw.hu/kepek/hirek/2010/06/p451-‐lee.pdf

55

There are many languages that are trying to be both GPU and CPU targets, OpenCL has CPU backends, without considering CPU and GPUs on the same chip like AMD Fusion (see i.e. synergy.cs.vt.edu/pubs/papers/daga-‐saahpc11-‐apu-‐efficacy.pdf) and languages that target the CPU for wide SIMD compu7ng like Intel’s opensource SPMD ISPC hCp://ispc.github.com/ … hCp://tbex.twbbs.org/~tbex/pad/CA_papers/Twin%20Peaks.pdf Throughput compu7ng is everywhere, not only for high performance calcula7ons but also servers, low power and so on. Amdhal for mul7cores: research.cs.wisc.edu/mul7facet/papers/ieeecomputer08_amdahl_mul7core.pdf

56

57

Some “bonus” reads…

BASICS •  Nvidia CUDA programming guide hCp://developer.download.nvidia.com/compute/

DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf •  N-‐Body Simula7on with CUDA from GPU Gems 3 hCp://hCp.developer.nvidia.com/

GPUGems3/gpugems3_ch31.html

ADVANCED •  Efficient Primi7ves and Algorithms for Many-‐core architectures hCp://

www.cs.ucdavis.edu/research/tech-‐reports/2011/CSE-‐2011-‐4.pdf •  Persistent threads programming hCp://developer.download.nvidia.com/GTC/PDF/

GTC2012/Presenta7onPDF/S0157-‐GTC2012-‐Persistent-‐Threads-‐Compu7ng.pdf •  More on PT and Task Parallelism… hCp://developer.download.nvidia.com/GTC/PDF/

GTC2012/Presenta7onPDF/S0138-‐GTC2012-‐Parallelism-‐Primi7ves-‐Applica7ons.pdf •  Registers or shared memory? hCp://mc.stanford.edu/cgi-‐bin/images/6/65/

SC08_Volkov_GPU.pdf •  Op7miza7on 7ps hCp://www.cs.virginia.edu/~skadron/Papers/

cuda_tuning_bof_sc09_final.pdf

DATA STRUCTURES… TONS of spa7al ones, very few general purpose… NOTE: aCributes of parallel data structures:

58

Documents

GPU Compute