[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

– Data-independent tasks

– Tasks with statically-known data dependences

– SIMD divergence

– Lacking fine-grained synchronization

– Lacking writeable, coherent caches

– Data-independent tasks

– Tasks with statically-known data dependences

– SIMD divergence

– Lacking fine-grained synchronization

– Lacking writeable, coherent caches

DEVICE 32-‐bit Key-‐value Sor7ng (106 keys / sec)

Keys-‐only Sor7ng (106 pairs/ sec)

NVIDIA GTX 280 449 (3.8x speedup*) 534 (2.9x speedup*)

* Satish et al.,"Designing efficient sorting algorithms for manycore GPUs," in IPDPS '09



NVIDIA GTX 480 775 1005

NVIDIA GTX 280 449 534

NVIDIA 8800 GT 129 171



NVIDIA GTX 480 775 1005

NVIDIA GTX 280 449 534

NVIDIA 8800 GT 129 171

Intel Knight's Ferry MIC 32-‐core* 560

Intel Core i7 quad-‐core * 240

Intel Core-‐2 quad-‐core* 138

*Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC Architectures,“ Intel Tech Report 2010.

– Each output is dependent upon a finite subset of the input • Threads are decomposed by output element

• The output (and at least one input) index is a static function of thread-id

Output

Thread Thread Thread Thread

Input

– Each output element has dependences upon any / all input elements

– E.g., sorting, reduction, compaction, duplicate removal, histogram generation, map-reduce, etc.

Input

Output

?

– Threads are decomposed by output element

– Repeatedly iterate over recycled input streams

– Output stream size is statically known before each pass



– O(n) global work from passes of pairwise-neighbor-reduction

– Static dependences, uniform output

+ + + +

– Repeated pairwise swapping

• Bubble sort is O(n2)

• Bitonic sort is O(nlog2n)

– Need partitioning: dynamic, cooperative

allocation

– Repeatedly check each vertex or edge • Breadth-first search becomes O(V2)

• O(V+E) is work-optimal

– Need queue: dynamic, cooperative allocation

– Repeated pairwise swapping

• Bubble sort is O(n2)

• Bitonic sort is O(nlog2n)

– Need partitioning: dynamic, cooperative

allocation

– Repeatedly check each vertex or edge • Breadth-first search becomes O(V2)

• O(V+E) is work-optimal

– Need queue: dynamic, cooperative allocation

– Variable output per thread

– Need dynamic, cooperative allocation

●

● ● ● ●●●

●●

●

● ● ● ● ● ● ● ●

● ● ● ●

● ● ● ●

● ● ● ●

● ● ●●

● ● ●

●● ● ●

●

●

●

●

●●

●●

● ● ● ●

● ● ● ●● ● ● ●●

●●

●

●●

●●

●●

●●

• Where do I put something in a list?

– Duplicate removal

– Sorting

– Histogram compilation

Where do I enqueue something?

– Search space exploration

– Graph traversal

– General work queues

Input

Output

?

Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread

• For 30,000 producers and consumers?

– Locks serialize everything

– O(n) work

– For allocation: use scan results as a scattering vector

– Popularized by Blelloch et al. in the ‘90s

– Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009

2 1 0 3 2

0 2 3 3 6

Input

Prefix Sum

– O(n) work




2 1 0 3 2

0 2 3 3 6

Input ( & allocaOon requirement)

Result of prefix scan (sum)

Thread Thread Thread Thread Thread

– O(n) work




2 1 0 3 2

0 2 3 3 6

0 1 2 3 4 5 6 7

Input ( & allocaOon requirement)

Output

Result of prefix scan (sum)

Thread Thread Thread Thread Thread

0s 1s

Key sequence 1110 1010 1100 1000 0011 0111 0101 0001

Output key sequence 1110 1100 0011 0111 1010 1000 0101 0001

Key sequence 1110 1010 1100 1000 0011 0111 0101 0001 0 2 4 5 1 3 6 7

Allocation requirements 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

0s 1s

1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Scanned allocations (relocation offsets)

1 2 4 4 0 1 2 2 0 1 2 3 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Allocation requirements 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

0s 1s

1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Scanned allocations (bin relocation offsets)

1 2 4 4 0 1 2 2 0 1 2 3 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Adjusted allocations (global relocation offsets)

1 2 4 4 4 5 6 6 4 5 6 7 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Key sequence 1110 1010 1100 1000 0011 0111 0101 0001

Output key sequence 1110 1100 0011 0111 1010 1000 0101 0001 0 2 4 5 1 3 6 7

1 2 0 3 4 5 7 6

Un-fused

GPU

Globa

l Device Mem

ory

Host P

rogram

Determine allocaCon size

CUDPP scan

CUDPP scan

Distribute output

CUDPP Scan

Host

Un-fused Fused

GPU

Globa

l Device Mem

ory

Host P

rogram

Determine allocaCon size

CUDPP scan

CUDPP scan

Distribute output

CUDPP Scan

Host

Host P

rogram

Globa

l Device Mem

ory Scan

Scan

Scan

Determine allocaCon

Distribute output

GPU Host

1. Heavy SMT (over-threading) yields usable “bubbles” of free computation

2. Propagate live data between steps in fast registers / smem

3. Use scan (or variant) as a “runtime” for everything

Fused

Host P

rogram

Globa

l Device Mem

ory

Scan

Scan

Scan

Determine allocaCon

Distribute output

GPU Host

1. Heavy SMT (over-threading) yields usable “bubbles” of free computation

2. Propagate live data between steps in fast registers / smem

3. Use scan (or variant) as a “runtime” for everything

Fused

Host P

rogram

Globa

l Device Mem

ory

Scan

Scan

Scan

Determine allocaCon

Distribute output

GPU Host

Device Memory Bandwidth Compute Throughput Memory wall Memory wall

(109 bytes/s) (109 thread-‐cycles/s) (bytes/cycle) (instrs/word)

GTX 480 169.0 672.0 0.251 15.9

GTX 285 159.0 354.2 0.449 8.9

GTX 280 141.7 311.0 0.456 8.8

Tesla C1060 102.0 312.0 0.327 12.2

9800 GTX+ 70.4 235.0 0.300 13.4

8800 GT 57.6 168.0 0.343 11.7

9800 GT 57.6 168.0 0.343 11.7

8800 GTX 86.4 172.8 0.500 8.0

Quadro FX 5600 76.8 152.3 0.504 7.9

Device Memory Bandwidth Compute Throughput Memory wall Memory wall

(109 bytes/s) (109 thread-‐cycles/s) (bytes/cycle) (instrs/word)

GTX 480 169.0 672.0 0.251 15.9

GTX 285 159.0 354.2 0.449 8.9

GTX 280 141.7 311.0 0.456 8.8

Tesla C1060 102.0 312.0 0.327 12.2

9800 GTX+ 70.4 235.0 0.300 13.4

8800 GT 57.6 168.0 0.343 11.7

9800 GT 57.6 168.0 0.343 11.7

8800 GTX 86.4 172.8 0.500 8.0

Quadro FX 5600 76.8 152.3 0.504 7.9

GTX285 r+w memory wall (17.8 instrucOons per

input word)

0

5

10

15

20

25

0 16 32 48 64 80 96 112

Thread

-‐InstrucOon

s / 32-‐bit scan

element

Problem Size (millions)

Insert work here

GTX285 r+w memory wall (17.8)

Data Movement Skeleton

0

5

10

15

20

25

0 16 32 48 64 80 96 112

Thread

-‐InstrucOon

s / 32-‐bit scan

element


Insert work here

GTX285 r+w memory wall (17.8)

Our Scan Kernel


0

5

10

15

20

25

0 16 32 48 64 80 96 112

Thread

-‐InstrucOon

s / 32-‐bit scan

element


Insert work here

– Increase granularity / redundant computation

• ghost cells

• radix bits

– Orthogonal kernel fusion

GTX285 r+w memory wall

(17.8)

Our Scan Kernel


0

5

10

15

20

25

0 16 32 48 64 80 96 112

Thread

-‐InstrucOon

s / 32-‐bit scan

element


Insert work here

CUDPP Scan Kernel

Our Scan Kernel

0

5

10

15

20

25

0 20 40 60 80 100 120

Thread

-‐InstrucOon

s / 32-‐bit scan

element


– Partially-coalesced writes

– 2x write overhead

– 4 total concurrent scan operations (radix 16)

GTX285 Scan Kernel Wall

Our Scan Kernel

GTX285 Radix Scader Kernel Wall

0

5

10

15

20

25

30

35

0 16 32 48 64 80 96 112

Thread

-‐InstrucOon

s / 32-‐bit scan

element


Insert work here

– Need kernels with tunable local (or redundant) work

• ghost cells

• radix bits

285 Radix Scader Kernel

Wall

480 Radix Scader Kernel

Wall

0

5

10

15

20

25

30

35

40

45

50

0 10 20 30 40 50 60 70 80 90

Thread

-‐instructoins / 32-‐bit w

ord


– Virtual processors abstract a diversity of hardware configurations

– Leads to a host of inefficiencies

– E.g., only several hundred CTAs

– Virtual processors abstract a diversity of hardware configurations

– Leads to a host of inefficiencies

– E.g., only several hundred CTAs

Grid

A

grid-size = (N / tilesize) CTAs

grid-size = 150 CTAs (or other small constant)

…

threadblock

…

threadblock Grid

B

– Thread-dependent predicates

– Setup and initialization code (notably for smem)

– Offset calculations (notably for smem)

– Common values are hoisted and kept live

…

threadblock





…

threadblock





– Spills are really bad

…

threadblock

– O( N / tilesize) gmem accesses

– 2-4 instructions per access (offset calcs,

load, store)

– GPU is least efficient here: get it over with as quick as possible

log tilesize (N) -level tree

Two-level tree

– O( N / tilesize) gmem accesses

– 2-4 instructions per access (offset calcs,

load, store)

– GPU is least efficient here: get it over with as quick as possible

log tilesize (N) -level tree

Two-level tree

0

4

8

12

16

20

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Thread

-‐instrucOon

s / Elem

ent

Grid Size (# of threadblocks)

Compute Load

285 Scan Kernel Wall

– 16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA

– conditional evaluation

– singleton loads

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

– 16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA

– conditional evaluation

– singleton loads

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

– floor(16.1M / (1024 * 150) ) = 109 tiles per CTA

– 16.1M % (1024 * 150) = 136.4 extra tiles

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

– floor(16.1M / (1024 * 150) ) = 109 tiles per CTA (14 CTAs)

– 109 + 1 = 110 tiles per CTA (136 CTAs)

– 16.1M % (1024 * 150) = 0.4 extra tiles

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

– If you breathe on your code, run it through the VP • Kernel runtimes

• Instruction counts

– Indispensible for tuning • Host-side timing requires too many iterations

• Only 1-2 cudaprof iterations for consistent counter-based perf data

– Write tools to parse the output • “Dummy” kernels useful for demarcation

0

100

200

300

400

500

600

700

800

900

1000

1100

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272

SorOng

Rate (106 keys/sec)

Problem size (millions)

GTX 480

C2050 (no ECC)

GTX 285

C2050 (ECC)

GTX 280

C1060

9800 GTX+

0

100

200

300

400

500

600

700

800

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240

SorOng

Rate (m

illions of p

airs/sec)

Problem size (millions)

GTX 480 C2050 (no ECC) GTX 285 GTX 280 C2050 (ECC) C1060 9800 GTX+

0

20

40

60

80

100

120

140

160

180

0 20 40 60 80 100 120

Kernel Ban

dwidth (G

iBytes / sec)


merrill_tree Reduce

merrill_rts Scan

0

20

40

60

80

100

120

140

160

180

0 20 40 60 80 100 120

Kernel Ban

dwidth (B

ytes x109 / sec)


merrill_linear Reduce

merrill_linear Scan

– Implement device “memcpy” for tile-processing • Optimize for “full tiles”

– Specialize for different SM versions, input types, etc.

– Use templated code to generate various instances

– Run with cudaprof env vars to collect data

80

90

100

110

120

130

140

150

160

0 20 40 60 80 100 120

Band

width (G

iBytes/sec)

Words Copied (millions)

128-‐Thread CTA (64B ld)

Single

Single (no-‐overlap)

Double

Double (no-‐overlap)

Quad

Quad (no-‐overlap)

70

80

90

100

110

120

130

140

0 20 40 60 80 100 120

Band

width (G

iBytes/sec)

Words Copied (millions)

128-‐Thread CTA (128B ld/st)

Single Single (no-‐overlap) Double Double (no-‐overlap) Quad Quad (no-‐overlap) Intrinsic Copy

One-‐way

Two-‐way

cudaMemcpy() us

– SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size

– Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size

x0 x1 x2 x3 x4 x5 x6

x0 ⊕(x0..x1) x2 ⊕(x6..x7) ⊕(x2..x3) x4 ⊕(x4..x5) x6

x0 ⊕(x0..x1) x2 ⊕(x4..x7) ⊕(x0..x3) x4 ⊕(x4..x5) x6 i

x0 ⊕(x0..x1) x2 ⊕(x0..x3) i x4 ⊕(x4..x5) x6

x0 i x2 ⊕(x0..x5) ⊕(x0..x1) x4 ⊕(x0..x3) x6

i x0 ⊕(x0..x1) ⊕(x0..x6) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) =1 =3 =0 =2 ⊕1 ⊕3 ⊕0 ⊕2

⊕1 ⊕0 =1 =0

=0 ⊕0

⊕0 ⊕1

⊕0 ⊕1 ⊕2 ⊕3

x7 t0

t1

t2

t3

t4

t5

m0 m1 m2 m7 m3 m4 m5 m6

i i i i

i i i i

x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x6..x7) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6)



x0 x1 x2 x7 x3 x4 x5 x6

t1

t2

t3

t0

⊕1 ⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7

⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7

⊕4 ⊕5 ⊕6 ⊕7

⊕0

⊕1

i i i i

⊕0

i i i i

⊕2 ⊕3 ⊕1 ⊕0

m3 m0 m1 m2 m4 m5 m6 m11 m7 m8 m9 m10

– SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size

– Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size

x0 x1 x2 x3 x4 x5 x6

x0 ⊕(x0..x1) x2 ⊕(x6..x7) ⊕(x2..x3) x4 ⊕(x4..x5) x6

x0 ⊕(x0..x1) x2 ⊕(x4..x7) ⊕(x0..x3) x4 ⊕(x4..x5) x6 i

x0 ⊕(x0..x1) x2 ⊕(x0..x3) i x4 ⊕(x4..x5) x6

x0 i x2 ⊕(x0..x5) ⊕(x0..x1) x4 ⊕(x0..x3) x6

i x0 ⊕(x0..x1) ⊕(x0..x6) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) =1 =3 =0 =2 ⊕1 ⊕3 ⊕0 ⊕2

⊕1 ⊕0 =1 =0

=0 ⊕0

⊕0 ⊕1

⊕0 ⊕1 ⊕2 ⊕3

x7 t0

t1

t2

t3

t4

t5

m0 m1 m2 m7 m3 m4 m5 m6

i i i i

i i i i




x0 x1 x2 x7 x3 x4 x5 x6

t1

t2

t3

t0

⊕1 ⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7

⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7

⊕4 ⊕5 ⊕6 ⊕7

⊕0

⊕1

i i i i

⊕0

i i i i

⊕2 ⊕3 ⊕1 ⊕0

m3 m0 m1 m2 m4 m5 m6 m11 m7 m8 m9 m10

t1 t2 t3 t2 t3

t3

t0 t1 t0

t0

t2 t1 t0

t3 t2 t1

t3 t3 t3

t2 t2 t2

t1 t1 t1

t0 t0 t0

… … … … tT -‐ 1 tT/2 + 1

tT/2 tT/2 + 2

tT/4 + 1

tT/4 tT/4 + 2

tT/2 -‐ 1 t1 t0 t2 tT/4 -‐1 t3T/4+1 t3T/4 t3T/4+2 t3T/4 -‐1

barrier

Tree-‐based:

vs. raking-‐based:

t1 t2 t3 t2 t3

t3

t0 t1 t0

t0

t2 t1 t0

t3 t2 t1

barrier

barrier

barrier

t1 t2 t3 t2 t3

t3

t0 t1 t0

t0

t2 t1 t0

t3 t2 t1

t3 t3 t3

t2 t2 t2

t1 t1 t1

t0 t0 t0

… … … … tT -‐ 1 tT/2 + 1

tT/2 tT/2 + 2

tT/4 + 1

tT/4 tT/4 + 2

tT/2 -‐ 1 t1 t0 t2 tT/4 -‐1 t3T/4+1 t3T/4 t3T/4+2 t3T/4 -‐1

barrier

Tree-‐based:

vs. raking-‐based:

t1 t2 t3 t2 t3

t3

t0 t1 t0

t0

t2 t1 t0

t3 t2 t1

barrier

barrier

barrier

– Barriers make O(n) code O(n log n)

– The rest are “DMA engine” threads

– Use threadblocks to cover pipeline latencies, e.g., for Fermi:

• 2 worker warps per CTA

• 6-7 CTAs

t1 t2 t3 t2 t3

t3

t0 t1 t0

t0

t2 t1 t0

t3 t2 t1

t3 t3 t3

t2 t2 t2

t1 t1 t1

t0 t0 t0

… … … … tT -‐ 1 tT/2 + 1

tT/2 tT/2 + 2

tT/4 + 1

tT/4 tT/4 + 2

tT/2 -‐ 1

t1 t0 t2 tT/4 -‐1

t3T/4+1

t3T/4 t3T/4+2

t3T/4 -‐1

Worker

DMA

– Different SMs (varied local storage: registers/smem)

– Different input types (e.g., sorting chars vs. ulongs)

– # of steps for each algorithm phase is configuration-driven

– Template expansion + Constant-propagation + Static loop unrolling + Preprocessor Macros

– Compiler produces a target assembly that is well-tuned for the specifically targeted hardware and problem

Serial ReducOon (regs)

Serial ReducOon (shmem)

SIMD Kogge-‐Stone Scan

(shmem) Serial Scan (shmem)

Serial Scan (regs)

Digit Decoding

0s carry-‐in

9

1s carry-‐in

25

2s carry-‐in

33

3s carry-‐in

49

Scader (SIMD Kogge-‐Stone Scan, smem exchange)

0s carry-‐out

18

1s carry-‐out

32

2s carry-‐out

42

3s carry-‐out

56

t 1

t 2

t 3

t 2

t 3

t 3

t 0

t 1

t 0

t 0

t 2

t 1

t 0

t 3

t 2

t 1

0s to

tal

1s to

tal

2s to

tal

3s to

tal

10

17

18

19

1 11

20

12

27

2 28

3

4 5

21

14

29

22

30

23

24

7 15

31

8 16

25

9

26

13

6 32

10

11

12

13

14

15

16

18

26

27

28

29

30

31

33

34

35

36

37

38

39

41

50

51

52

53

54

55

17

32

40

56

Loca

l Excha

nge Offs

ets

Globa

l Sca

tter Offs

ets

…

…

…

…

t T -‐ 1

t T/2 + 1

t T/2

t 3T/4 -‐2

t T/4 + 1

t T/4

t T/2 -‐ 2

t T/2 -‐ 1

t 1

t 0

t T/4 -‐2

t T/4 -‐1

t 3T/4+1

t 3T/4

t T -‐ 2

t 3T/4 -‐1

…

…

…

…

t T -‐ 1

t T/2 + 1

t T/2

t 3T/4 -‐2

t T/4 + 1

t T/4

t T/2 -‐ 2

t T/2 -‐ 1

t 1

t 0

t T/4 -‐2

t T/4 -‐1

t 3T/4+1

t 3T/4

t T -‐ 2

t 3T/4 -‐1

Digit Flag Vectors

211

122

302

232

300

021

022

021

123

330

023

130

220

020

112

221

023

322

003

012

022

010

121

323

020

101

212

220

013

301

130

333

Subseq

uenc

e of Key

s

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

1 1

1 1

1 1

1 1

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

1 1

1 1

1 1

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

1 1

1 1

1 1

1 1

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

1 1

1 1

1 1

0s

1s

2s

3s

0s to

tal

9

1s to

tal

7

2s to

tal

9

3s to

tal

7

0 0

0 0

1 1

1 1

2 5

5 5

5 5

5 5

7 7

8 8

1 5

9 1

2 3

4 5

7 8

9 6

1 1

1 1

2 3

3 3

3 3

3 4

5 5

5 5

5 5

6 6

7 7

2 5

7 1

2 3

5 6

7 4

0 3

3 4

4 4

4 4

4 4

5 5

6 8

8 8

8 8

9 4

4 8

9 1

2 3

4 5

6 7

8 9

0 0

0 0

0 0

0 1

2 3

3 3

3 3

4 5

5 5

5 6

6 6

6 3

5 2

3 4

5 6

1 7

0s

1s

2s

3s

Exch

ange

d Ke

ys

300

330

130

220

020

130

010

220

211

021

021

301

221

121

122

302

232

022

122

112

322

212

013

123

023

023

003

323

020

101

022

333

t 1

t 2

t 3

t 2

t 3

t 3

t 0

t 1

t 0

t 0

t 2

t 1

t 0

t 3

t 2

t 1

t 3

t 3

t 3

t 2

t 2

t 2

t 1

t 1

t 1

t 0

t 0

t 0

t 3

t 3

t 3

t 2

t 2

t 2

t 1

t 1

t 1

t 0

t 0

t 0

t T -‐ 1

t T/2 + 1

t T/2

t T/2 + 2

t T/4 + 1

t T/4

t T/4 + 2

t T/2 -‐ 1

t 1

t 0

t 2

t T/4 -‐1

t 3T/4+1

t 3T/4

t 3T/4+2

t 3T/4 -‐1

t T -‐ 1

t T/2 + 1

t T/2

t 3T/4 -‐2

t T/4 + 1

t T/4

t T/2 -‐ 2

t T/2 -‐ 1

t 1

t 0

t T/4 -‐2

t T/4 -‐1

t 3T/4+1

t 3T/4

t T -‐ 2

t 3T/4 -‐1

…

…

…

…

…

…

…

…

t 1

t 2

t 3

t 2

t 3

t 3

t 0

t 1

t 0

t 0

t 2

t 1

t 0

t 3

t 2

t 1

t 3

t 3

t 3

t 2

t 2

t 2

t 1

t 1

t 1

t 0

t 0

t 0

t 3

t 3

t 3

t 2

t 2

t 2

t 1

t 1

t 1

t 0

t 0

t 0

t T -‐ 1

t T/2 + 1

t T/2

t T/2 + 2

t T/4 + 1

t T/4

t T/4 + 2

t T/2 -‐ 1

t 1

t 0

t 2

t T/4 -‐1

t 3T/4+1

t 3T/4

t 3T/4+2

t 3T/4 -‐1

t T -‐ 1

t T/2 + 1

t T/2

t 3T/4 -‐2

t T/4 + 1

t T/4

t T/2 -‐ 2

t T/2 -‐ 1

t 1

t 0

t T/4 -‐2

t T/4 -‐1

t 3T/4+1

t 3T/4

t T -‐ 2

t 3T/4 -‐1

…

…

…

…

…

…

…

…

t 1

t 2

t 3

t 2

t 3

t 3

t 0

t 1

t 0

t 0

t 2

t 1

t 0

t 3

t 2

t 1

t 3

t 3

t 3

t 2

t 2

t 2

t 1

t 1

t 1

t 0

t 0

t 0

t 3

t 3

t 3

t 2

t 2

t 2

t 1

t 1

t 1

t 0

t 0

t 0

t T -‐ 1

t T/2 + 1

t T/2

t T/2 + 2

t T/4 + 1

t T/4

t T/4 + 2

t T/2 -‐ 1

t 1

t 0

t 2

t T/4 -‐1

t 3T/4+1

t 3T/4

t 3T/4+2

t 3T/4 -‐1

t T -‐ 1

t T/2 + 1

t T/2

t 3T/4 -‐2

t T/4 + 1

t T/4

t T/2 -‐ 2

t T/2 -‐ 1

t 1

t 0

t T/4 -‐2

t T/4 -‐1

t 3T/4+1

t 3T/4

t T -‐ 2

t 3T/4 -‐1

…

…

…

…

…

…

…

…

t 1

t 2

t 3

t 2

t 3

t 3

t 0

t 1

t 0

t 0

t 2

t 1

t 0

t 3

t 2

t 1

t 3

t 3

t 3

t 2

t 2

t 2

t 1

t 1

t 1

t 0

t 0

t 0

t 3

t 3

t 3

t 2

t 2

t 2

t 1

t 1

t 1

t 0

t 0

t 0

t T -‐ 1

t T/2 + 1

t T/2

t T/2 + 2

t T/4 + 1

t T/4

t T/4 + 2

t T/2 -‐ 1

t 1

t 0

t 2

t T/4 -‐1

t 3T/4+1

t 3T/4

t 3T/4+2

t 3T/4 -‐1

t T -‐ 1

t T/2 + 1

t T/2

t 3T/4 -‐2

t T/4 + 1

t T/4

t T/2 -‐ 2

t T/2 -‐ 1

t 1

t 0

t T/4 -‐2

t T/4 -‐1

t 3T/4+1

t 3T/4

t T -‐ 2

t 3T/4 -‐1

…

…

…

…

…

…

…

…

• Resource-allocation as runtime

1. Kernel memory-wall analysis (kernel fusion)

2. Algorithm serialization

3. Tune for data-movement

4. Warp-synchronous programming

5. Flexible granularity via meta-programming

– Back40Computing (a Google Code Project) • http://code.google.com/p/back40computing/

– Default sorting method for Thrust • http://code.google.com/p/thrust/

– A single host-side procedure call launches a kernel that performs orthogonal program steps

MyUberKernel<<<grid_size, num_threads>>>(d_device_storage);

– No existing public repositories of kernel “subroutines” for scavenging

– Callbacks, iterators, visitors, functors, etc.

– ReduceKernel<<<grid_size, num_threads>>>(CountingIterator(100));

– E.g., fused kernel left can’t be composed using a callback-based pattern

GATHER(key)

EXCHANGE (key)

SCATTER (key) SCATTER (value)

GATHER (value)

EXCHANGE (value)

LOCAL MULTI-SCAN (flag vectors)

Encode flag bit (into flag vectors)

Decode local rank (from flag vectors)

Extract radix digit

Extract radix digit (again)

Update global radix digit partition offsets

Fused radix sorting kernel

• Digit extraction

• Local prefix scan

• Scatter accordingly

– Compiled libraries suffer from code bloat • CUDPP primitives library is 100s of MBs, yet still doesn’t support all built-in numeric types.

• Specializing for device configurations makes it even worse

– The alternative is to ship source for #include’ing • Have to be willing to share source

– Need a way to fit meta-programming in at the JIT / bytecode level to help avoid expansion / mismatch-by-omission

– Can leverage fundamentally different algorithms for different phases • How to teach the compiler do to this?

Education

[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)