Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
DL: Data Layout System for
Heterogeneous Computing
I-Jui (Ray) Sung, Geng Daniel Liu, and Wen-Mei Hwu
University of Illinois at Urbana-Champaign
Agenda
GPU Global Memory Throughput and Array-of-Structure
ASTA Layout
In-Place Conversion Between Layouts
2
Global Memory Bandwidth
Ideal Reality
3
GPU Memory Bandwidth vs. Stride
SAXPY with stride:
y[i * stride ] = a * x[ i * stride ] + y[i * stride ];
"Efficient Sparse Matrix-Vector Multiplication on CUDA"
Nathan Bell and Michael Garland, in, "NVIDIA Technical Report NVR-2008-004",, December 2008
4
Sources of Strided Accesses
Examples of strided accesses
Structure members of the same name in an array-of-structure
e.g. foo[0].bar and foo[1].bar
Elements in the same column in a row-majored array
e.g. A[1][2] and A[2][2]
Unit-strides can be achieved through transposition
5
Array-of-Structures
Structure:
Array of Structures:
struct foo{
float a;
float b;
float c;
int d;
};
struct foo{
float a;
float b;
float c;
int d;
} A[8];
6
Array-of-Structures
Many data parallel algorithms naturally take array-of-
structures
e.g. simulating temperature, pressure, velocity of the flow of a
cell in a regular grid
Computational Fluid
Dynamics Codes
Structural Engineering
Codes
Financial Engineering
Codes
7
Array-of-Structures
Build an abstract view of related data
A common source of small strided accesses
Can we decouple the abstraction from the actual layout?
“The” actual layout?
Across components of a heterogeneous system?
GPU and CPU
Across nodes?
Shared memory machines? MPI?
8
Data Layout Alternatives
Array of
Structures
(AoS)
struct foo{
float a;
float b;
float c;
int d;
} A[8];
9
Data Layout Alternatives
Array of
Structures
(AoS)
struct foo{
float a;
float b;
float c;
int d;
} A[8];
10
Data Layout Alternatives
Array of
Structures
(AoS)
struct foo{
float a;
float b;
float c;
int d;
} A[8];
Structure of
Arrays
(SoA)
struct foo{
float a[8];
float b[8];
float c[8];
int d[8];
} A;
11
a[8], b[8], and c[8] may be declared as separate arrays,
so the term SoA is used interchangeably with Discrete Arrays
Array-of-Structures
Example application of 1D LBM iterative CFD solver
GPU
Lattice-Boltzmann Kernel (updates one iteration)
CPU
Communication Thread (exchange boundary cells with other nodes via MPI)
Data grid that logically has multiple properties per cell
12
Data grid that logically has multiple properties per cell
Array-of-Structures
Example application of 1D LBM iterative CFD solver
GPU
Lattice-Boltzmann Kernel (updates one iteration)
CPU
Communication Thread (exchange boundary cells with other nodes via MPI)
Vector of
threads update
same property
across nearby
cells
13
Data grid that logically has multiple properties per cell
Array-of-Structures
Example application of 1D LBM iterative CFD solver
GPU
Lattice-Boltzmann Kernel (updates one iteration)
CPU
Communication Thread (exchange boundary cells with other nodes via MPI)
Prefers AoS
layout so
properties of
boundary cells
are consecutive
in memory
14
Intuitive Solution
Map AoS dynamically to appropriate actual layouts to
fit different layout preferences in a heterogeneous system
? layout
transformation
15
Intuitive Solution
This work is about the non-intuitive parts of the
seemingly intuitive solution:
What layout(s)?
How do we convert between layouts efficiently?
Efficiency as in Time and Space
When should we convert between layouts?
Use array-of-structures as a case study
16
Data Layout Alternatives
Array of
Structures
(AoS)
struct foo{
float a;
float b;
float c;
int d;
} A[8];
Structure of
Arrays
(SoA)
struct foo{
float a[8];
float b[8];
float c[8];
int d[8];
} A;
17
Data Layout Alternatives
Array of
Structures
(AoS)
struct foo{
float a;
float b;
float c;
int d;
} A[8];
Structure of
Arrays
(SoA)
struct foo{
float a[8];
float b[8];
float c[8];
int d[8];
} A;
divide into tiles
18
Data Layout Alternatives
Array of
Structures
(AoS)
Array of
Structure of
Tiled Array
(ASTA)
struct foo{
float a;
float b;
float c;
int d;
} A[8];
struct foo{
float a[4];
float b[4];
float c[4];
int d[4];
} A[2];
Structure of
Arrays
(SoA)
struct foo{
float a[8];
float b[8];
float c[8];
int d[8];
} A;
19
Performance of ASTA
As the default layout, ASTA is as good as Discrete Arrays
Advantages of ASTA are during in-place layout conversion:
Fast layout conversion (95GB/s) from/to AoS
AoS to/from SoA(DA)
Via ASTA: 8GB/s; Direct: <<8GB/s 20
0
2
4
6
8
10
12
LBM BlackScholes SpMV (bcsstk18)
Kernel Speedup on NVIDIA
GTX480
AOS
Discrete Arrays
ASTA(64)
ASTA(32)
ASTA(16)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
LBM BlackScholes SpMV (bcsstk18)
Kernel Speedup on ATI Radeon
HD5870
AOS
Discrete Arrays
ASTA(64)
ASTA(32)
ASTA(16)
Layout Conversion and Transposition
Converting AoS to SoA is not too different from
transposing a tall and thin array
same as same as
transpose
AoS SoA
21
In-place Transpostion: First Attempt
// data[W][H]-->data[H][W]
parallel for (j<W)
parallel for (i<H)
float temp = data[j][i]; //offset = j*H + i
22
In-place Transpostion: First Attempt
// data[W][H]-->data[H][W]
parallel for (j<W)
parallel for (i<H)
float temp = data[j][i]; //offset = j*H + i
barrier();
23
In-place Transpostion: First Attempt
// data[W][H]-->data[H][W]
parallel for (j<W)
parallel for (i<H)
float temp = data[j][i]; //offset = j*H + i
barrier();
data[i][j] = temp; //offset = i*W + j
24
In-place Transpostion: First attempt
// data[W][H]-->data[H][W]
parallel for (j<W)
parallel for (i<H)
float temp = data[j][i]; //offset = j*H + i
barrier();
data[i][j] = temp; //offset = i*W + j
Advantages:
Simple, Fast
Disadvantages
Scope of barrier() is work group
Limited by on-chip memory accessible to one work-group
25
Layout Conversion and Transposition
Converting AoS to ASTA is not too different from
transposing a bunch of small tiles
The first attempt, barrier-sync, would more likely to work
same as same as
transpose
AoS ASTA
divide into tiles
transpose
26
AoS to ASTA Transformation
AoS to ASTA
Marshaling
Kernel
Global Memory
Throughput
(GB/s)
Fine Print
Out-of-Place 80 2x Space
In-Place Barrier
Sync 95* Tile Size <
On-chip Memory
27
* Current results; results reported in Table 3 (~80GB/s) was measured on an earlier
implementation
What if tile size > on-chip memory capacity ?
Layout Conversion and Transposition
Transposition is a permutation
A permutation can be decomposed to independent cycles of
shifting
0 1 2 3 4
5 6 7 8 9
0 1
2 3
4 5
6 7
8 9
transpose
28
Layout Conversion and Transposition
Transposition is a permutation
A permutation can be decomposed to independent cycles of
shifting
0 1 2 3 4
5 6 7 8 9
0 1
2 3
4 5
6 7
8 9
transpose
Cycles:
(curr % N)*M + curr/N; curr next
M = 2, N = 5
29
Layout Conversion and Transposition
Transposition is a permutation
A permutation can be decomposed to independent cycles of
shifting
0 1 2 3 4
5 6 7 8 9
0 1
2 3
4 5
6 7
8 9
transpose
Cycles:
{0}
(curr % N)*M + curr/N; curr next
M = 2, N = 5
0 0
30
Layout Conversion and Transposition
Transposition is a permutation
A permutation can be decomposed to independent cycles of
shifting
0 1 2 3 4
5 6 7 8 9
0 1
2 3
4 5
6 7
8 9
transpose
Cycles:
{0}
{1, 2, 4, 8, 7, 5, 1}
(curr % N)*M + curr/N; curr next
M = 2, N = 5
1 2
31
Layout Conversion and Transposition
Transposition is a permutation
A permutation can be decomposed to independent cycles of
shifting
0 1 2 3 4
5 6 7 8 9
0 1
2 3
4 5
6 7
8 9
transpose
Cycles:
{0}
{1, 2, 4, 8, 7, 5, 1}
{3, 6, 3} (curr % N)*M + curr/N; curr next
M = 2, N = 5
32
Layout Conversion and Transposition
Transposition is a permutation
A permutation can be decomposed to independent cycles of
shifting
0 1 2 3 4
5 6 7 8 9
0 1
2 3
4 5
6 7
8 9
transpose
Cycles:
{0}
{1, 2, 4, 8, 7, 5, 1}
{3, 6, 3}
{9}
(curr % N)*M + curr/N; curr next
M = 2, N = 5
33
Cycle Following – Original
Cycles:
thread 0: {0}
thread 1: {1, 2, 4, 8, 7, 5, 1}
thread 2: {3, 6, 3}
thread 3: {9}
This is equivalent to a straightforward parallelism of the IPT algorithm in Gustavson et al,
" In-place transposition of rectangular matrices." in PARA'06.
Imbalance
34
Cycle Following – Load Balanced
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}
0 1 2 3 4
5 6 7 8 9
t0 t1 t2 t3 t4
35
Cycle Following – Load Balanced
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}
0 1 2 3 4
5 6 7 8 9
t0 t1 t2 t3 t4
36
Cycle Following – Load Balanced
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}
0 1 2 3 4
5 6 7 8 9
t0 t1 t2 t3 t4
37
Cycle Following – Load Balanced
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}
0 1 2 3 4
5 6 7 8 9
t0 t1 t2 t3 t4
38
AoS to ASTA Transformation
AoS to ASTA
Marshaling
Kernel
Global Memory
Throughput
(GB/s)
Fine Print
Out-of-Place 80 2x Space
In-Place Barrier
Sync 95* Tile Size <
On-chip Memory
In-Place Cycle
Following
14* Any tile size
39
* Current results; Table 3 in the paper was measured on an earlier implementation
Layout Conversion and Transposition
Converting SoA to ASTA is not too different from
transposing a matrix with super-elements
The first attempt, barrier-sync, would still not work
same as
SoA
same as
ASTA
transpose super-elements
40
SoA to ASTA Transformation
SoA to ASTA
Marshaling
Kernel
Global Memory
Throughput
(GB/s)
Fine Print
In-Place Barrier
Sync -- Does Not Work
In-Place Cycle
Following 9
ASTA(64): 17GB/s
ASTA(32): 9GB/s
ASTA(16): 4GB/s
41
SoA to ASTA Transformation
0.44
13.65
0.85
3.33 2.40
27.35
21.95
25.06
0.85
4.75
9.17
15.10
4.81
10.02
17.13
4.63
9.25
17.11
4.63
9.30
17.07
4.62
9.32
17.12
4.68
9.56
17.40
0.00
5.00
10.00
15.00
20.00
25.00
30.00
Su
stain
ed
Mem
ory B
an
dw
idth
(GB
/s)
Sparse Matrices, Tile Size
Original vs. Load Balanced Cycle Following Algorithms
Original
Load Balanced
42
Summary
A new layout for AoS and tall arrays is proposed
Good locality on GPUs
Enables efficient in-place marshaling
Parallel in-place tiled transposition algorithms for
AoS/SoA ↔ ASTA are proposed
The tool itself is available upon request
A library implementation for ATI and NVIDIA in OpenCL and
CUDA is available at
https://bitbucket.org/ijsung/libmarshal
43
Questions?
44
Backup Slides
45
Marshaling Overhead
Runtime In-place marshaling at transformation boundaries
GPU kernel invocation and CPU/GPU memory transfer
Parallel hi-throughput in-place transposition kernels
ASTA layout
46
Array-of-Structures and Discrete Arrays
In the Array-of-Structure microbenchmark
t0 t1 t2 t3 t4 t5 t6 t7
A
47
Array-of-Structures and Discrete Arrays
In the Array-of-Structure microbenchmark
A
t0 t1 t2 t3 t4 t5 t6 t7
48
Array-of-Structures and Discrete Arrays
In the Array-of-Structure microbenchmark
A
t0 t1 t2 t3 t4 t5 t6 t7
49
Array-of-Structures and Discrete Arrays
In the Array-of-Structure microbenchmark
A
t0 t1 t2 t3 t4 t5 t6 t7
50
Array-of-Structures and Discrete Arrays
In the Discrete Arrays microbenchmark
A
t0 t1 t2 t3 t4 t5 t6 t7
51
Array-of-Structures and Discrete Arrays
In the Discrete Arrays microbenchmark
A
t0 t1 t2 t3 t4 t5 t6 t7
52
Array-of-Structures and Discrete Arrays
In the Discrete Arrays microbenchmark
A
t0 t1 t2 t3 t4 t5 t6 t7
53
Array-of-Structures and Discrete Arrays
In the Discrete Arrays microbenchmark
A
t0 t1 t2 t3 t4 t5 t6 t7
54
Array-of-Structures and Discrete Arrays
ATI (Evergreen) NVIDIA (Fermi)
GPU caches are too small to hold structure instances for
every executing wave-front
Future CPUs will have less cache per thread b/c energy limitations
55
DRAM Bank Organization
• Each core array
has about 1M bits
• Each bit is stored
in a tiny capacitor,
made of one
transistor
DRAM Bursting
A very small (8x2 bit) DRAM bank
DL for OpenCL
59
Sources of Strided Accesses
When the stride is large (>103 bytes),
Problems are more on conflicting cache lines and DRAM banks
When the stride is small (<103 bytes),
Problem can sometimes be alleviated by having large cache
lines
60
Cycle Following - Improvement
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}
t=
1
t1 t2 t3 t4
t=
2
t=
3
1. R1=Load(id). If RED then
quit
2. Load next in cycle to R2.
3. Atomically set next in
cycle to RED.
4. If succeed, store R1 to
next in cycle, and R1=R2
and repeat 2-3
Cycle Following - Improvement
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}
t=
1
t1 t2 t3 t4
t=
2
t=
3
1. R1=Load(id). If RED then
quit
2. Load next in cycle to R2.
3. Atomically set next in
cycle to RED.
4. If succeed, store R1 to
next in cycle, and R1=R2
and repeat 2-3
Cycle Following - Improvement
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}
t=
1
t1 t2 t3 t4
t=
2
t1 t2 t3 t4
t=
3
1. R1=Load(id). If RED then
quit
2. Load next in cycle to R2.
3. Atomically set next in
cycle to RED.
4. If succeed, store R1 to
next in cycle, and R1=R2
and repeat 2-3
Cycle Following - Improvement
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}
t=
1
t1 t2 t3 t4
t=
2
t1 t2 t3 t4
t=
3
t5 t6 t3 t4
1. R1=Load(id). If RED then
quit
2. Load next in cycle to R2.
3. Atomically set next in
cycle to RED.
4. If succeed, store R1 to
next in cycle, and R1=R2
and repeat 2-3
Cycle Following - Improvement
{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}
t=
1
t1 t2 t3 t4
t=
2
t1 t2 t3 t4
t=
3
t5 t6 t3 t4
t=
4
t5 t7 t8 t4
1. R1=Load(id). If RED then
quit
2. Load next in cycle to R2.
3. Atomically set next in
cycle to RED.
4. If succeed, store R1 to
next in cycle, and R1=R2
and repeat 2-3