Upload
npinto
View
1.340
Download
3
Embed Size (px)
DESCRIPTION
http://cs264.orghttp://goo.gl/1K2fI
Citation preview
– Data-independent tasks
– Tasks with statically-known data dependences
– SIMD divergence
– Lacking fine-grained synchronization
– Lacking writeable, coherent caches
– Data-independent tasks
– Tasks with statically-known data dependences
– SIMD divergence
– Lacking fine-grained synchronization
– Lacking writeable, coherent caches
DEVICE 32-‐bit Key-‐value Sor7ng (106 keys / sec)
Keys-‐only Sor7ng (106 pairs/ sec)
NVIDIA GTX 280 449 (3.8x speedup*) 534 (2.9x speedup*)
* Satish et al.,"Designing efficient sorting algorithms for manycore GPUs," in IPDPS '09
DEVICE 32-‐bit Key-‐value Sor7ng (106 keys / sec)
Keys-‐only Sor7ng (106 pairs/ sec)
NVIDIA GTX 480 775 1005
NVIDIA GTX 280 449 534
NVIDIA 8800 GT 129 171
DEVICE 32-‐bit Key-‐value Sor7ng (106 keys / sec)
Keys-‐only Sor7ng (106 pairs/ sec)
NVIDIA GTX 480 775 1005
NVIDIA GTX 280 449 534
NVIDIA 8800 GT 129 171
Intel Knight's Ferry MIC 32-‐core* 560
Intel Core i7 quad-‐core * 240
Intel Core-‐2 quad-‐core* 138
*Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC Architectures,“ Intel Tech Report 2010.
– Each output is dependent upon a finite subset of the input • Threads are decomposed by output element
• The output (and at least one input) index is a static function of thread-id
Output
Thread Thread Thread Thread
Input
– Each output element has dependences upon any / all input elements
– E.g., sorting, reduction, compaction, duplicate removal, histogram generation, map-reduce, etc.
Input
Output
?
– Threads are decomposed by output element
– Repeatedly iterate over recycled input streams
– Output stream size is statically known before each pass
Thread Thread Thread Thread
Thread Thread Thread Thread
– O(n) global work from passes of pairwise-neighbor-reduction
– Static dependences, uniform output
+ + + +
– Repeated pairwise swapping
• Bubble sort is O(n2)
• Bitonic sort is O(nlog2n)
– Need partitioning: dynamic, cooperative
allocation
– Repeatedly check each vertex or edge • Breadth-first search becomes O(V2)
• O(V+E) is work-optimal
– Need queue: dynamic, cooperative allocation
– Repeated pairwise swapping
• Bubble sort is O(n2)
• Bitonic sort is O(nlog2n)
– Need partitioning: dynamic, cooperative
allocation
– Repeatedly check each vertex or edge • Breadth-first search becomes O(V2)
• O(V+E) is work-optimal
– Need queue: dynamic, cooperative allocation
– Variable output per thread
– Need dynamic, cooperative allocation
●
● ● ● ●●●
●●
●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ●●
● ● ●
●● ● ●
●
●
●
●
●●
●●
● ● ● ●
● ● ● ●● ● ● ●●
●●
●
●●
●●
●●
●●
• Where do I put something in a list?
– Duplicate removal
– Sorting
– Histogram compilation
Where do I enqueue something?
– Search space exploration
– Graph traversal
– General work queues
Input
Output
?
Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread
• For 30,000 producers and consumers?
– Locks serialize everything
– O(n) work
– For allocation: use scan results as a scattering vector
– Popularized by Blelloch et al. in the ‘90s
– Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
2 1 0 3 2
0 2 3 3 6
Input
Prefix Sum
– O(n) work
– For allocation: use scan results as a scattering vector
– Popularized by Blelloch et al. in the ‘90s
– Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
2 1 0 3 2
0 2 3 3 6
Input ( & allocaOon requirement)
Result of prefix scan (sum)
Thread Thread Thread Thread Thread
– O(n) work
– For allocation: use scan results as a scattering vector
– Popularized by Blelloch et al. in the ‘90s
– Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
2 1 0 3 2
0 2 3 3 6
0 1 2 3 4 5 6 7
Input ( & allocaOon requirement)
Output
Result of prefix scan (sum)
Thread Thread Thread Thread Thread
0s 1s
Key sequence 1110 1010 1100 1000 0011 0111 0101 0001
Output key sequence 1110 1100 0011 0111 1010 1000 0101 0001
Key sequence 1110 1010 1100 1000 0011 0111 0101 0001 0 2 4 5 1 3 6 7
Allocation requirements 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0s 1s
1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5
Scanned allocations (relocation offsets)
1 2 4 4 0 1 2 2 0 1 2 3 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5
Allocation requirements 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0s 1s
1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5
Scanned allocations (bin relocation offsets)
1 2 4 4 0 1 2 2 0 1 2 3 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5
Adjusted allocations (global relocation offsets)
1 2 4 4 4 5 6 6 4 5 6 7 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5
Key sequence 1110 1010 1100 1000 0011 0111 0101 0001
Output key sequence 1110 1100 0011 0111 1010 1000 0101 0001 0 2 4 5 1 3 6 7
1 2 0 3 4 5 7 6
Un-fused
GPU
Globa
l Device Mem
ory
Host P
rogram
Determine allocaCon size
CUDPP scan
CUDPP scan
Distribute output
CUDPP Scan
Host
Un-fused Fused
GPU
Globa
l Device Mem
ory
Host P
rogram
Determine allocaCon size
CUDPP scan
CUDPP scan
Distribute output
CUDPP Scan
Host
Host P
rogram
Globa
l Device Mem
ory Scan
Scan
Scan
Determine allocaCon
Distribute output
GPU Host
1. Heavy SMT (over-threading) yields usable “bubbles” of free computation
2. Propagate live data between steps in fast registers / smem
3. Use scan (or variant) as a “runtime” for everything
Fused
Host P
rogram
Globa
l Device Mem
ory
Scan
Scan
Scan
Determine allocaCon
Distribute output
GPU Host
1. Heavy SMT (over-threading) yields usable “bubbles” of free computation
2. Propagate live data between steps in fast registers / smem
3. Use scan (or variant) as a “runtime” for everything
Fused
Host P
rogram
Globa
l Device Mem
ory
Scan
Scan
Scan
Determine allocaCon
Distribute output
GPU Host
Device Memory Bandwidth Compute Throughput Memory wall Memory wall
(109 bytes/s) (109 thread-‐cycles/s) (bytes/cycle) (instrs/word)
GTX 480 169.0 672.0 0.251 15.9
GTX 285 159.0 354.2 0.449 8.9
GTX 280 141.7 311.0 0.456 8.8
Tesla C1060 102.0 312.0 0.327 12.2
9800 GTX+ 70.4 235.0 0.300 13.4
8800 GT 57.6 168.0 0.343 11.7
9800 GT 57.6 168.0 0.343 11.7
8800 GTX 86.4 172.8 0.500 8.0
Quadro FX 5600 76.8 152.3 0.504 7.9
Device Memory Bandwidth Compute Throughput Memory wall Memory wall
(109 bytes/s) (109 thread-‐cycles/s) (bytes/cycle) (instrs/word)
GTX 480 169.0 672.0 0.251 15.9
GTX 285 159.0 354.2 0.449 8.9
GTX 280 141.7 311.0 0.456 8.8
Tesla C1060 102.0 312.0 0.327 12.2
9800 GTX+ 70.4 235.0 0.300 13.4
8800 GT 57.6 168.0 0.343 11.7
9800 GT 57.6 168.0 0.343 11.7
8800 GTX 86.4 172.8 0.500 8.0
Quadro FX 5600 76.8 152.3 0.504 7.9
GTX285 r+w memory wall (17.8 instrucOons per
input word)
0
5
10
15
20
25
0 16 32 48 64 80 96 112
Thread
-‐InstrucOon
s / 32-‐bit scan
element
Problem Size (millions)
Insert work here
GTX285 r+w memory wall (17.8)
Data Movement Skeleton
0
5
10
15
20
25
0 16 32 48 64 80 96 112
Thread
-‐InstrucOon
s / 32-‐bit scan
element
Problem Size (millions)
Insert work here
GTX285 r+w memory wall (17.8)
Our Scan Kernel
Data Movement Skeleton
0
5
10
15
20
25
0 16 32 48 64 80 96 112
Thread
-‐InstrucOon
s / 32-‐bit scan
element
Problem Size (millions)
Insert work here
– Increase granularity / redundant computation
• ghost cells
• radix bits
– Orthogonal kernel fusion
GTX285 r+w memory wall
(17.8)
Our Scan Kernel
Data Movement Skeleton
0
5
10
15
20
25
0 16 32 48 64 80 96 112
Thread
-‐InstrucOon
s / 32-‐bit scan
element
Problem Size (millions)
Insert work here
CUDPP Scan Kernel
Our Scan Kernel
0
5
10
15
20
25
0 20 40 60 80 100 120
Thread
-‐InstrucOon
s / 32-‐bit scan
element
Problem Size (millions)
– Partially-coalesced writes
– 2x write overhead
– 4 total concurrent scan operations (radix 16)
GTX285 Scan Kernel Wall
Our Scan Kernel
GTX285 Radix Scader Kernel Wall
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112
Thread
-‐InstrucOon
s / 32-‐bit scan
element
Problem Size (millions)
Insert work here
– Need kernels with tunable local (or redundant) work
• ghost cells
• radix bits
285 Radix Scader Kernel
Wall
480 Radix Scader Kernel
Wall
0
5
10
15
20
25
30
35
40
45
50
0 10 20 30 40 50 60 70 80 90
Thread
-‐instructoins / 32-‐bit w
ord
Problem Size (millions)
– Virtual processors abstract a diversity of hardware configurations
– Leads to a host of inefficiencies
– E.g., only several hundred CTAs
– Virtual processors abstract a diversity of hardware configurations
– Leads to a host of inefficiencies
– E.g., only several hundred CTAs
Grid
A
grid-size = (N / tilesize) CTAs
grid-size = 150 CTAs (or other small constant)
…
threadblock
…
threadblock Grid
B
– Thread-dependent predicates
– Setup and initialization code (notably for smem)
– Offset calculations (notably for smem)
– Common values are hoisted and kept live
…
threadblock
– Thread-dependent predicates
– Setup and initialization code (notably for smem)
– Offset calculations (notably for smem)
– Common values are hoisted and kept live
…
threadblock
– Thread-dependent predicates
– Setup and initialization code (notably for smem)
– Offset calculations (notably for smem)
– Common values are hoisted and kept live
– Spills are really bad
…
threadblock
– O( N / tilesize) gmem accesses
– 2-4 instructions per access (offset calcs,
load, store)
– GPU is least efficient here: get it over with as quick as possible
log tilesize (N) -level tree
Two-level tree
– O( N / tilesize) gmem accesses
– 2-4 instructions per access (offset calcs,
load, store)
– GPU is least efficient here: get it over with as quick as possible
log tilesize (N) -level tree
Two-level tree
0
4
8
12
16
20
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Thread
-‐instrucOon
s / Elem
ent
Grid Size (# of threadblocks)
Compute Load
285 Scan Kernel Wall
– 16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA
– conditional evaluation
– singleton loads
C = number of CTAs
N = problem size
T = tile size
B = tiles per CTA
– 16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA
– conditional evaluation
– singleton loads
C = number of CTAs
N = problem size
T = tile size
B = tiles per CTA
– floor(16.1M / (1024 * 150) ) = 109 tiles per CTA
– 16.1M % (1024 * 150) = 136.4 extra tiles
C = number of CTAs
N = problem size
T = tile size
B = tiles per CTA
– floor(16.1M / (1024 * 150) ) = 109 tiles per CTA (14 CTAs)
– 109 + 1 = 110 tiles per CTA (136 CTAs)
– 16.1M % (1024 * 150) = 0.4 extra tiles
C = number of CTAs
N = problem size
T = tile size
B = tiles per CTA
– If you breathe on your code, run it through the VP • Kernel runtimes
• Instruction counts
– Indispensible for tuning • Host-side timing requires too many iterations
• Only 1-2 cudaprof iterations for consistent counter-based perf data
– Write tools to parse the output • “Dummy” kernels useful for demarcation
0
100
200
300
400
500
600
700
800
900
1000
1100
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272
SorOng
Rate (106 keys/sec)
Problem size (millions)
GTX 480
C2050 (no ECC)
GTX 285
C2050 (ECC)
GTX 280
C1060
9800 GTX+
0
100
200
300
400
500
600
700
800
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240
SorOng
Rate (m
illions of p
airs/sec)
Problem size (millions)
GTX 480 C2050 (no ECC) GTX 285 GTX 280 C2050 (ECC) C1060 9800 GTX+
0
20
40
60
80
100
120
140
160
180
0 20 40 60 80 100 120
Kernel Ban
dwidth (G
iBytes / sec)
Problem Size (millions)
merrill_tree Reduce
merrill_rts Scan
0
20
40
60
80
100
120
140
160
180
0 20 40 60 80 100 120
Kernel Ban
dwidth (B
ytes x109 / sec)
Problem Size (millions)
merrill_linear Reduce
merrill_linear Scan
– Implement device “memcpy” for tile-processing • Optimize for “full tiles”
– Specialize for different SM versions, input types, etc.
– Use templated code to generate various instances
– Run with cudaprof env vars to collect data
80
90
100
110
120
130
140
150
160
0 20 40 60 80 100 120
Band
width (G
iBytes/sec)
Words Copied (millions)
128-‐Thread CTA (64B ld)
Single
Single (no-‐overlap)
Double
Double (no-‐overlap)
Quad
Quad (no-‐overlap)
70
80
90
100
110
120
130
140
0 20 40 60 80 100 120
Band
width (G
iBytes/sec)
Words Copied (millions)
128-‐Thread CTA (128B ld/st)
Single Single (no-‐overlap) Double Double (no-‐overlap) Quad Quad (no-‐overlap) Intrinsic Copy
One-‐way
Two-‐way
cudaMemcpy() us
– SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size
– Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
x0 x1 x2 x3 x4 x5 x6
x0 ⊕(x0..x1) x2 ⊕(x6..x7) ⊕(x2..x3) x4 ⊕(x4..x5) x6
x0 ⊕(x0..x1) x2 ⊕(x4..x7) ⊕(x0..x3) x4 ⊕(x4..x5) x6 i
x0 ⊕(x0..x1) x2 ⊕(x0..x3) i x4 ⊕(x4..x5) x6
x0 i x2 ⊕(x0..x5) ⊕(x0..x1) x4 ⊕(x0..x3) x6
i x0 ⊕(x0..x1) ⊕(x0..x6) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) =1 =3 =0 =2 ⊕1 ⊕3 ⊕0 ⊕2
⊕1 ⊕0 =1 =0
=0 ⊕0
⊕0 ⊕1
⊕0 ⊕1 ⊕2 ⊕3
x7 t0
t1
t2
t3
t4
t5
m0 m1 m2 m7 m3 m4 m5 m6
i i i i
i i i i
x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x6..x7) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6)
x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x4..x7) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6)
x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x7) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6)
x0 x1 x2 x7 x3 x4 x5 x6
t1
t2
t3
t0
⊕1 ⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7
⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7
⊕4 ⊕5 ⊕6 ⊕7
⊕0
⊕1
i i i i
⊕0
i i i i
⊕2 ⊕3 ⊕1 ⊕0
m3 m0 m1 m2 m4 m5 m6 m11 m7 m8 m9 m10
– SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size
– Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
x0 x1 x2 x3 x4 x5 x6
x0 ⊕(x0..x1) x2 ⊕(x6..x7) ⊕(x2..x3) x4 ⊕(x4..x5) x6
x0 ⊕(x0..x1) x2 ⊕(x4..x7) ⊕(x0..x3) x4 ⊕(x4..x5) x6 i
x0 ⊕(x0..x1) x2 ⊕(x0..x3) i x4 ⊕(x4..x5) x6
x0 i x2 ⊕(x0..x5) ⊕(x0..x1) x4 ⊕(x0..x3) x6
i x0 ⊕(x0..x1) ⊕(x0..x6) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) =1 =3 =0 =2 ⊕1 ⊕3 ⊕0 ⊕2
⊕1 ⊕0 =1 =0
=0 ⊕0
⊕0 ⊕1
⊕0 ⊕1 ⊕2 ⊕3
x7 t0
t1
t2
t3
t4
t5
m0 m1 m2 m7 m3 m4 m5 m6
i i i i
i i i i
x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x6..x7) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6)
x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x4..x7) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6)
x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x7) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6)
x0 x1 x2 x7 x3 x4 x5 x6
t1
t2
t3
t0
⊕1 ⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7
⊕2 ⊕3 ⊕4 ⊕5 ⊕6 ⊕7
⊕4 ⊕5 ⊕6 ⊕7
⊕0
⊕1
i i i i
⊕0
i i i i
⊕2 ⊕3 ⊕1 ⊕0
m3 m0 m1 m2 m4 m5 m6 m11 m7 m8 m9 m10
t1 t2 t3 t2 t3
t3
t0 t1 t0
t0
t2 t1 t0
t3 t2 t1
t3 t3 t3
t2 t2 t2
t1 t1 t1
t0 t0 t0
… … … … tT -‐ 1 tT/2 + 1
tT/2 tT/2 + 2
tT/4 + 1
tT/4 tT/4 + 2
tT/2 -‐ 1 t1 t0 t2 tT/4 -‐1 t3T/4+1 t3T/4 t3T/4+2 t3T/4 -‐1
barrier
Tree-‐based:
vs. raking-‐based:
t1 t2 t3 t2 t3
t3
t0 t1 t0
t0
t2 t1 t0
t3 t2 t1
barrier
barrier
barrier
t1 t2 t3 t2 t3
t3
t0 t1 t0
t0
t2 t1 t0
t3 t2 t1
t3 t3 t3
t2 t2 t2
t1 t1 t1
t0 t0 t0
… … … … tT -‐ 1 tT/2 + 1
tT/2 tT/2 + 2
tT/4 + 1
tT/4 tT/4 + 2
tT/2 -‐ 1 t1 t0 t2 tT/4 -‐1 t3T/4+1 t3T/4 t3T/4+2 t3T/4 -‐1
barrier
Tree-‐based:
vs. raking-‐based:
t1 t2 t3 t2 t3
t3
t0 t1 t0
t0
t2 t1 t0
t3 t2 t1
barrier
barrier
barrier
– Barriers make O(n) code O(n log n)
– The rest are “DMA engine” threads
– Use threadblocks to cover pipeline latencies, e.g., for Fermi:
• 2 worker warps per CTA
• 6-7 CTAs
t1 t2 t3 t2 t3
t3
t0 t1 t0
t0
t2 t1 t0
t3 t2 t1
t3 t3 t3
t2 t2 t2
t1 t1 t1
t0 t0 t0
… … … … tT -‐ 1 tT/2 + 1
tT/2 tT/2 + 2
tT/4 + 1
tT/4 tT/4 + 2
tT/2 -‐ 1
t1 t0 t2 tT/4 -‐1
t3T/4+1
t3T/4 t3T/4+2
t3T/4 -‐1
Worker
DMA
– Different SMs (varied local storage: registers/smem)
– Different input types (e.g., sorting chars vs. ulongs)
– # of steps for each algorithm phase is configuration-driven
– Template expansion + Constant-propagation + Static loop unrolling + Preprocessor Macros
– Compiler produces a target assembly that is well-tuned for the specifically targeted hardware and problem
Serial ReducOon (regs)
Serial ReducOon (shmem)
SIMD Kogge-‐Stone Scan
(shmem) Serial Scan (shmem)
Serial Scan (regs)
Digit Decoding
0s carry-‐in
9
1s carry-‐in
25
2s carry-‐in
33
3s carry-‐in
49
Scader (SIMD Kogge-‐Stone Scan, smem exchange)
0s carry-‐out
18
1s carry-‐out
32
2s carry-‐out
42
3s carry-‐out
56
t 1
t 2
t 3
t 2
t 3
t 3
t 0
t 1
t 0
t 0
t 2
t 1
t 0
t 3
t 2
t 1
0s to
tal
1s to
tal
2s to
tal
3s to
tal
10
17
18
19
1 11
20
12
27
2 28
3
4 5
21
14
29
22
30
23
24
7 15
31
8 16
25
9
26
13
6 32
10
11
12
13
14
15
16
18
26
27
28
29
30
31
33
34
35
36
37
38
39
41
50
51
52
53
54
55
17
32
40
56
Loca
l Excha
nge Offs
ets
Globa
l Sca
tter Offs
ets
…
…
…
…
t T -‐ 1
t T/2 + 1
t T/2
t 3T/4 -‐2
t T/4 + 1
t T/4
t T/2 -‐ 2
t T/2 -‐ 1
t 1
t 0
t T/4 -‐2
t T/4 -‐1
t 3T/4+1
t 3T/4
t T -‐ 2
t 3T/4 -‐1
…
…
…
…
t T -‐ 1
t T/2 + 1
t T/2
t 3T/4 -‐2
t T/4 + 1
t T/4
t T/2 -‐ 2
t T/2 -‐ 1
t 1
t 0
t T/4 -‐2
t T/4 -‐1
t 3T/4+1
t 3T/4
t T -‐ 2
t 3T/4 -‐1
Digit Flag Vectors
211
122
302
232
300
021
022
021
123
330
023
130
220
020
112
221
023
322
003
012
022
010
121
323
020
101
212
220
013
301
130
333
Subseq
uenc
e of Key
s
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
1 1
1 1
1 1
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
1 1
1 1
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
1 1
1 1
1 1
1 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
1 1
1 1
1 1
0s
1s
2s
3s
0s to
tal
9
1s to
tal
7
2s to
tal
9
3s to
tal
7
0 0
0 0
1 1
1 1
2 5
5 5
5 5
5 5
7 7
8 8
1 5
9 1
2 3
4 5
7 8
9 6
1 1
1 1
2 3
3 3
3 3
3 4
5 5
5 5
5 5
6 6
7 7
2 5
7 1
2 3
5 6
7 4
0 3
3 4
4 4
4 4
4 4
5 5
6 8
8 8
8 8
9 4
4 8
9 1
2 3
4 5
6 7
8 9
0 0
0 0
0 0
0 1
2 3
3 3
3 3
4 5
5 5
5 6
6 6
6 3
5 2
3 4
5 6
1 7
0s
1s
2s
3s
Exch
ange
d Ke
ys
300
330
130
220
020
130
010
220
211
021
021
301
221
121
122
302
232
022
122
112
322
212
013
123
023
023
003
323
020
101
022
333
t 1
t 2
t 3
t 2
t 3
t 3
t 0
t 1
t 0
t 0
t 2
t 1
t 0
t 3
t 2
t 1
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t T -‐ 1
t T/2 + 1
t T/2
t T/2 + 2
t T/4 + 1
t T/4
t T/4 + 2
t T/2 -‐ 1
t 1
t 0
t 2
t T/4 -‐1
t 3T/4+1
t 3T/4
t 3T/4+2
t 3T/4 -‐1
t T -‐ 1
t T/2 + 1
t T/2
t 3T/4 -‐2
t T/4 + 1
t T/4
t T/2 -‐ 2
t T/2 -‐ 1
t 1
t 0
t T/4 -‐2
t T/4 -‐1
t 3T/4+1
t 3T/4
t T -‐ 2
t 3T/4 -‐1
…
…
…
…
…
…
…
…
t 1
t 2
t 3
t 2
t 3
t 3
t 0
t 1
t 0
t 0
t 2
t 1
t 0
t 3
t 2
t 1
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t T -‐ 1
t T/2 + 1
t T/2
t T/2 + 2
t T/4 + 1
t T/4
t T/4 + 2
t T/2 -‐ 1
t 1
t 0
t 2
t T/4 -‐1
t 3T/4+1
t 3T/4
t 3T/4+2
t 3T/4 -‐1
t T -‐ 1
t T/2 + 1
t T/2
t 3T/4 -‐2
t T/4 + 1
t T/4
t T/2 -‐ 2
t T/2 -‐ 1
t 1
t 0
t T/4 -‐2
t T/4 -‐1
t 3T/4+1
t 3T/4
t T -‐ 2
t 3T/4 -‐1
…
…
…
…
…
…
…
…
t 1
t 2
t 3
t 2
t 3
t 3
t 0
t 1
t 0
t 0
t 2
t 1
t 0
t 3
t 2
t 1
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t T -‐ 1
t T/2 + 1
t T/2
t T/2 + 2
t T/4 + 1
t T/4
t T/4 + 2
t T/2 -‐ 1
t 1
t 0
t 2
t T/4 -‐1
t 3T/4+1
t 3T/4
t 3T/4+2
t 3T/4 -‐1
t T -‐ 1
t T/2 + 1
t T/2
t 3T/4 -‐2
t T/4 + 1
t T/4
t T/2 -‐ 2
t T/2 -‐ 1
t 1
t 0
t T/4 -‐2
t T/4 -‐1
t 3T/4+1
t 3T/4
t T -‐ 2
t 3T/4 -‐1
…
…
…
…
…
…
…
…
t 1
t 2
t 3
t 2
t 3
t 3
t 0
t 1
t 0
t 0
t 2
t 1
t 0
t 3
t 2
t 1
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t 3
t 3
t 3
t 2
t 2
t 2
t 1
t 1
t 1
t 0
t 0
t 0
t T -‐ 1
t T/2 + 1
t T/2
t T/2 + 2
t T/4 + 1
t T/4
t T/4 + 2
t T/2 -‐ 1
t 1
t 0
t 2
t T/4 -‐1
t 3T/4+1
t 3T/4
t 3T/4+2
t 3T/4 -‐1
t T -‐ 1
t T/2 + 1
t T/2
t 3T/4 -‐2
t T/4 + 1
t T/4
t T/2 -‐ 2
t T/2 -‐ 1
t 1
t 0
t T/4 -‐2
t T/4 -‐1
t 3T/4+1
t 3T/4
t T -‐ 2
t 3T/4 -‐1
…
…
…
…
…
…
…
…
• Resource-allocation as runtime
1. Kernel memory-wall analysis (kernel fusion)
2. Algorithm serialization
3. Tune for data-movement
4. Warp-synchronous programming
5. Flexible granularity via meta-programming
– Back40Computing (a Google Code Project) • http://code.google.com/p/back40computing/
– Default sorting method for Thrust • http://code.google.com/p/thrust/
– A single host-side procedure call launches a kernel that performs orthogonal program steps
MyUberKernel<<<grid_size, num_threads>>>(d_device_storage);
– No existing public repositories of kernel “subroutines” for scavenging
– Callbacks, iterators, visitors, functors, etc.
– ReduceKernel<<<grid_size, num_threads>>>(CountingIterator(100));
– E.g., fused kernel left can’t be composed using a callback-based pattern
GATHER(key)
EXCHANGE (key)
SCATTER (key) SCATTER (value)
GATHER (value)
EXCHANGE (value)
LOCAL MULTI-SCAN (flag vectors)
Encode flag bit (into flag vectors)
Decode local rank (from flag vectors)
Extract radix digit
Extract radix digit (again)
Update global radix digit partition offsets
Fused radix sorting kernel
• Digit extraction
• Local prefix scan
• Scatter accordingly
– Compiled libraries suffer from code bloat • CUDPP primitives library is 100s of MBs, yet still doesn’t support all built-in numeric types.
• Specializing for device configurations makes it even worse
– The alternative is to ship source for #include’ing • Have to be willing to share source
– Need a way to fit meta-programming in at the JIT / bytecode level to help avoid expansion / mismatch-by-omission
– Can leverage fundamentally different algorithms for different phases • How to teach the compiler do to this?