DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages

DL: Data Layout System for

Heterogeneous Computing

I-Jui (Ray) Sung, Geng Daniel Liu, and Wen-Mei Hwu

University of Illinois at Urbana-Champaign

Agenda

GPU Global Memory Throughput and Array-of-Structure

ASTA Layout

In-Place Conversion Between Layouts

2

Global Memory Bandwidth

Ideal Reality

3

GPU Memory Bandwidth vs. Stride

SAXPY with stride:

y[i * stride ] = a * x[ i * stride ] + y[i * stride ];

"Efficient Sparse Matrix-Vector Multiplication on CUDA"

Nathan Bell and Michael Garland, in, "NVIDIA Technical Report NVR-2008-004",, December 2008

4

Sources of Strided Accesses

Examples of strided accesses

Structure members of the same name in an array-of-structure

e.g. foo[0].bar and foo[1].bar

Elements in the same column in a row-majored array

e.g. A[1][2] and A[2][2]

Unit-strides can be achieved through transposition

5

Array-of-Structures

Structure:

Array of Structures:

struct foo{

float a;

float b;

float c;

int d;

};

struct foo{

float a;

float b;

float c;

int d;

} A[8];

6

Array-of-Structures

Many data parallel algorithms naturally take array-of-

structures

e.g. simulating temperature, pressure, velocity of the flow of a

cell in a regular grid

Computational Fluid

Dynamics Codes

Structural Engineering

Codes

Financial Engineering

Codes

7

Array-of-Structures

Build an abstract view of related data

A common source of small strided accesses

Can we decouple the abstraction from the actual layout?

“The” actual layout?

Across components of a heterogeneous system?

GPU and CPU

Across nodes?

Shared memory machines? MPI?

8

Data Layout Alternatives

Array of

Structures

(AoS)

struct foo{

float a;

float b;

float c;

int d;

} A[8];

9


Array of

Structures

(AoS)

struct foo{

float a;

float b;

float c;

int d;

} A[8];

10


Array of

Structures

(AoS)

struct foo{

float a;

float b;

float c;

int d;

} A[8];

Structure of

Arrays

(SoA)

struct foo{

float a[8];

float b[8];

float c[8];

int d[8];

} A;

11

a[8], b[8], and c[8] may be declared as separate arrays,

so the term SoA is used interchangeably with Discrete Arrays

Array-of-Structures

Example application of 1D LBM iterative CFD solver

GPU

Lattice-Boltzmann Kernel (updates one iteration)

CPU

Communication Thread (exchange boundary cells with other nodes via MPI)

Data grid that logically has multiple properties per cell

12


Array-of-Structures


GPU


CPU


Vector of

threads update

same property

across nearby

cells

13


Array-of-Structures


GPU


CPU


Prefers AoS

layout so

properties of

boundary cells

are consecutive

in memory

14

Intuitive Solution

Map AoS dynamically to appropriate actual layouts to

fit different layout preferences in a heterogeneous system

? layout

transformation

15

Intuitive Solution

This work is about the non-intuitive parts of the

seemingly intuitive solution:

What layout(s)?

How do we convert between layouts efficiently?

Efficiency as in Time and Space

When should we convert between layouts?

Use array-of-structures as a case study

16


Array of

Structures

(AoS)

struct foo{

float a;

float b;

float c;

int d;

} A[8];

Structure of

Arrays

(SoA)

struct foo{

float a[8];

float b[8];

float c[8];

int d[8];

} A;

17


Array of

Structures

(AoS)

struct foo{

float a;

float b;

float c;

int d;

} A[8];

Structure of

Arrays

(SoA)

struct foo{

float a[8];

float b[8];

float c[8];

int d[8];

} A;

divide into tiles

18


Array of

Structures

(AoS)

Array of

Structure of

Tiled Array

(ASTA)

struct foo{

float a;

float b;

float c;

int d;

} A[8];

struct foo{

float a[4];

float b[4];

float c[4];

int d[4];

} A[2];

Structure of

Arrays

(SoA)

struct foo{

float a[8];

float b[8];

float c[8];

int d[8];

} A;

19

Performance of ASTA

As the default layout, ASTA is as good as Discrete Arrays

Advantages of ASTA are during in-place layout conversion:

Fast layout conversion (95GB/s) from/to AoS

AoS to/from SoA(DA)

Via ASTA: 8GB/s; Direct: <<8GB/s 20

0

2

4

6

8

10

12

LBM BlackScholes SpMV (bcsstk18)

Kernel Speedup on NVIDIA

GTX480

AOS

Discrete Arrays

ASTA(64)

ASTA(32)

ASTA(16)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

LBM BlackScholes SpMV (bcsstk18)

Kernel Speedup on ATI Radeon

HD5870

AOS

Discrete Arrays

ASTA(64)

ASTA(32)

ASTA(16)

Layout Conversion and Transposition

Converting AoS to SoA is not too different from

transposing a tall and thin array

same as same as

transpose

AoS SoA

21

In-place Transpostion: First Attempt

// data[W][H]-->data[H][W]

parallel for (j<W)

parallel for (i<H)

float temp = data[j][i]; //offset = j*H + i

22



parallel for (j<W)

parallel for (i<H)


barrier();

23



parallel for (j<W)

parallel for (i<H)


barrier();

data[i][j] = temp; //offset = i*W + j

24

In-place Transpostion: First attempt


parallel for (j<W)

parallel for (i<H)


barrier();

data[i][j] = temp; //offset = i*W + j

Advantages:

Simple, Fast

Disadvantages

Scope of barrier() is work group

Limited by on-chip memory accessible to one work-group

25


Converting AoS to ASTA is not too different from

transposing a bunch of small tiles

The first attempt, barrier-sync, would more likely to work

same as same as

transpose

AoS ASTA

divide into tiles

transpose

26

AoS to ASTA Transformation

AoS to ASTA

Marshaling

Kernel

Global Memory

Throughput

(GB/s)

Fine Print

Out-of-Place 80 2x Space

In-Place Barrier

Sync 95* Tile Size <

On-chip Memory

27

* Current results; results reported in Table 3 (~80GB/s) was measured on an earlier

implementation

What if tile size > on-chip memory capacity ?


Transposition is a permutation

A permutation can be decomposed to independent cycles of

shifting

0 1 2 3 4

5 6 7 8 9

0 1

2 3

4 5

6 7

8 9

transpose

28




shifting

0 1 2 3 4

5 6 7 8 9

0 1

2 3

4 5

6 7

8 9

transpose

Cycles:

(curr % N)*M + curr/N; curr next

M = 2, N = 5

29




shifting

0 1 2 3 4

5 6 7 8 9

0 1

2 3

4 5

6 7

8 9

transpose

Cycles:

{0}


M = 2, N = 5

0 0

30




shifting

0 1 2 3 4

5 6 7 8 9

0 1

2 3

4 5

6 7

8 9

transpose

Cycles:

{0}

{1, 2, 4, 8, 7, 5, 1}


M = 2, N = 5

1 2

31




shifting

0 1 2 3 4

5 6 7 8 9

0 1

2 3

4 5

6 7

8 9

transpose

Cycles:

{0}

{1, 2, 4, 8, 7, 5, 1}

{3, 6, 3} (curr % N)*M + curr/N; curr next

M = 2, N = 5

32




shifting

0 1 2 3 4

5 6 7 8 9

0 1

2 3

4 5

6 7

8 9

transpose

Cycles:

{0}

{1, 2, 4, 8, 7, 5, 1}

{3, 6, 3}

{9}


M = 2, N = 5

33

Cycle Following – Original

Cycles:

thread 0: {0}

thread 1: {1, 2, 4, 8, 7, 5, 1}

thread 2: {3, 6, 3}

thread 3: {9}

This is equivalent to a straightforward parallelism of the IPT algorithm in Gustavson et al,

" In-place transposition of rectangular matrices." in PARA'06.

Imbalance

34

Cycle Following – Load Balanced

{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}

0 1 2 3 4

5 6 7 8 9

t0 t1 t2 t3 t4

35


{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}

0 1 2 3 4

5 6 7 8 9

t0 t1 t2 t3 t4

36


{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}

0 1 2 3 4

5 6 7 8 9

t0 t1 t2 t3 t4

37


{1, 2, 4, 8, 7, 5, 1} {3, 6, 3} {0} {9}

0 1 2 3 4

5 6 7 8 9

t0 t1 t2 t3 t4

38

AoS to ASTA Transformation

AoS to ASTA

Marshaling

Kernel

Global Memory

Throughput

(GB/s)

Fine Print

Out-of-Place 80 2x Space

In-Place Barrier

Sync 95* Tile Size <

On-chip Memory

In-Place Cycle

Following

14* Any tile size

39

* Current results; Table 3 in the paper was measured on an earlier implementation


Converting SoA to ASTA is not too different from

transposing a matrix with super-elements

The first attempt, barrier-sync, would still not work

same as

SoA

same as

ASTA

transpose super-elements

40

SoA to ASTA Transformation

SoA to ASTA

Marshaling

Kernel

Global Memory

Throughput

(GB/s)

Fine Print

In-Place Barrier

Sync -- Does Not Work

In-Place Cycle

Following 9

ASTA(64): 17GB/s

ASTA(32): 9GB/s

ASTA(16): 4GB/s

41

SoA to ASTA Transformation

0.44

13.65

0.85

3.33 2.40

27.35

21.95

25.06

0.85

4.75

9.17

15.10

4.81

10.02

17.13

4.63

9.25

17.11

4.63

9.30

17.07

4.62

9.32

17.12

4.68

9.56

17.40

0.00

5.00

10.00

15.00

20.00

25.00

30.00

Su

stain

ed

Mem

ory B

an

dw

idth

(GB

/s)

Sparse Matrices, Tile Size

Original vs. Load Balanced Cycle Following Algorithms

Original

Load Balanced

42

Summary

A new layout for AoS and tall arrays is proposed

Good locality on GPUs

Enables efficient in-place marshaling

Parallel in-place tiled transposition algorithms for

AoS/SoA ↔ ASTA are proposed

The tool itself is available upon request

A library implementation for ATI and NVIDIA in OpenCL and

CUDA is available at

https://bitbucket.org/ijsung/libmarshal

43

Questions?

44

Backup Slides

45

Marshaling Overhead

Runtime In-place marshaling at transformation boundaries

GPU kernel invocation and CPU/GPU memory transfer

Parallel hi-throughput in-place transposition kernels

ASTA layout

46

Array-of-Structures and Discrete Arrays

In the Array-of-Structure microbenchmark

t0 t1 t2 t3 t4 t5 t6 t7

A

47



A

t0 t1 t2 t3 t4 t5 t6 t7

48



A

t0 t1 t2 t3 t4 t5 t6 t7

49



A

t0 t1 t2 t3 t4 t5 t6 t7

50


In the Discrete Arrays microbenchmark

A

t0 t1 t2 t3 t4 t5 t6 t7

51



A

t0 t1 t2 t3 t4 t5 t6 t7

52



A

t0 t1 t2 t3 t4 t5 t6 t7

53



A

t0 t1 t2 t3 t4 t5 t6 t7

54


ATI (Evergreen) NVIDIA (Fermi)

GPU caches are too small to hold structure instances for

every executing wave-front

Future CPUs will have less cache per thread b/c energy limitations

55

DRAM Bank Organization

• Each core array

has about 1M bits

• Each bit is stored

in a tiny capacitor,

made of one

transistor

DRAM Bursting

A very small (8x2 bit) DRAM bank

DL for OpenCL

59

Sources of Strided Accesses

When the stride is large (>103 bytes),

Problems are more on conflicting cache lines and DRAM banks

When the stride is small (<103 bytes),

Problem can sometimes be alleviated by having large cache

lines

60

Cycle Following - Improvement

{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}

t=

1

t1 t2 t3 t4

t=

2

t=

3

1. R1=Load(id). If RED then

quit

2. Load next in cycle to R2.

3. Atomically set next in

cycle to RED.

4. If succeed, store R1 to

next in cycle, and R1=R2

and repeat 2-3


{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}

t=

1

t1 t2 t3 t4

t=

2

t=

3


quit



cycle to RED.



and repeat 2-3


{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}

t=

1

t1 t2 t3 t4

t=

2

t1 t2 t3 t4

t=

3


quit



cycle to RED.



and repeat 2-3


{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}

t=

1

t1 t2 t3 t4

t=

2

t1 t2 t3 t4

t=

3

t5 t6 t3 t4


quit



cycle to RED.



and repeat 2-3


{1, 2, 4, 8, 7, 5, 1} {3, 6, 3}

t=

1

t1 t2 t3 t4

t=

2

t1 t2 t3 t4

t=

3

t5 t6 t3 t4

t=

4

t5 t7 t8 t4


quit



cycle to RED.



and repeat 2-3

Documents

DL: Data Layout System for Heterogeneous Computingimpact.crhc.illinois.edu/shared/papers/data2012.pdfPerformance of ASTA As the default layout, ASTA is as good as Discrete Arrays Advantages