Efficient Solution of Multiple Scalar and Block ...on-demand.gputechconf.com/gtc/2014/presentations/S4289-efficient-solution-multiple... · S4289: E cient solution of multiple scalar

S4289: Efficient solution of multiple scalar and block-tridiagonalequations

Endre Laszloendre.laszlo [at] oerc.ox.ac.uk

Oxford e-Research Centre, University of Oxford, UKPazmany Peter Catholic University, Budapest, Hungary

Mike Giles (Oxford), Jeremy Appleyard (NVIDIA)

GPU Technology Conference

March 26th, 2014San Jose

Endre Laszlo (Oxford) S4289 March 26th, 2014 San Jose 1 / 25

Outline for GPU developers

1 Batch scalar-tridiagonal solversI ADI (Alternating Directions Implicit) methodI Thomas algorithmI Multi-dimensional data structures - access patternsI Optimization: local data transposition in shared memoryI Optimization: local data transposition with shfl()I Thomas-PCR hybridI Comparison to CPU, Xeon Phi and LAPACK tridiagonal solver

2 Batch block-tridiagonal solverI Block tridiagonal data structure - access patternsI Work-sharing on the GPUI Comparison to CPU and LAPACK banded solver

Conclusion


Example: Solving the heat equation with ADI

The heat-diffusion equation is the PDE that is solved with the method:

du

dt= ∇2u (1)

ADI (Alternating Directions Implicit) methodI Classical FD schemeI Computationally cheaper then Crank-NicolsonI Relies on approximate factorizationI O(∆t2,∆x2) order accurate in both space and timeI Unconditionally stable if parameters chosen right positiveI Introduced by Peaceman and Rachford 1

1D. W. Peaceman and J. Rachford, H. H., ”The numerical solution of parabolic and elliptic differential equations,” Journal of the Society for Industrial and

Applied Mathematics, vol. 3, no. 1, pp. pp. 2841, 1955.


Example: Solving the heat equation with ADI

3 tridiagonal solves along dimensions X,Y,Z

preproc : u(0) = λ(δ2x + δ2

y + δ2z

)un

x− dim :(1− λδ2

x

)u(1) = u(0)

y − dim :(1− λδ2

y

)u(2) = u(1)

z− dim :(1− λδ2

z

)∆u = u(2)

add : un+1 = un + ∆u

The upcoming discussion of tridiagonal solvers is in the context of the ADI method


A tridiagonal system

Storage:

3 coefficient arrays

1 solution arrays

1 RHS array

All stored in a cubic datastructureb0 c0 0 0 · · · 0a1 b1 c1 0 · · · 00 a2 b2 c2 · · · 0...

......

.... . .

...0 0 0 · · · aN−1 bN−1

u0

u1

u2

u3...

uN−1

=

d0

d1

d2

d3...

dN−1


Solving tridiagonal systems

Assumptions:

Stems from real-world CFD and financial applications

Computation domain: structured multidimensional

”Hypercubic-ish”: Ω = RN0×N1×···ND , D = 2..8

Large number of systems:I N2 on an N3 cubeI Enough to saturate GPU

System sizes in the order of 100s − 1000s

Each system has its own coefficients and RHS

No pivoting: diagonal dominance required


Batch scalar-tridiagonal solvers

cusparse?gtsvBathStride()I Inefficient with the previous assumptionsI CR-PCR hybridI Lack of multidimensional support

F Global data transpose needed in Y and Z dimensions

I Extra space requirements: 768MB for a 2563 SP problemI Uses two kernel calls

Enough parallelism is batch problems

CR/PCR not necessarily needed

Tesla K40 has 12GB device memoryI Multidimensional problem domain Nd

I N = d

√34 × 109 for dimensions d = 2..8:

d N # parallel systems

2 27386 273863 908 8244644 165 44921255 59 121173616 30 243000007 18 340122248 12 35831808


Thomas algorithm

Algorithm 1 Thomas algorithm

Require: thomas(a,b, c,d)1: d∗0 := d0/b0

2: c∗0 := c0/b0

3: for i = 1, . . . ,N−1 do4: r := 1 / (bi − ai ci−1)5: d∗i := r (di − ai di−1)6: c∗i := r ci7: end for8: for i = N−2, . . . , 0 do9: di := d∗i − c∗i ui+1

10: end for11: return d


Multi-dimensional data structures - access patternsData layout: idx = k ∗NX ∗NY + j ∗NX + i

Performance depends on how the threads are mapped to the domain

Different efficiency along different dimensions

Assume sequential dependence in algorithm iterating along dim.:I X: stride = 1 → worst performanceI Y: stride = NX → best performance

I Z: stride = NX ∗ NY → good performance if TLB miss rate is avoided


Mapping threads to the domain: X/Y/Z-dimension solves

0

0.5

1

1.5

2

2.5

3

PreProc X-solve Y-solve Z-solve

Tim

e /

grid

ele

men

t [n

s]

SP

DP

x16.5

X: offset = 1, stride = NX

→ 4byte/32byte = 12.5% cache line utilization in SP

Y: offset = NX , stride = 1→ perfectly coalesced, 100% utilization

Z: offset = NX ∗ NY , stride = 1→ perfectly coalesced, 100% utilization + moderate TLB hit rate


Mapping threads to the domain: X/Y/Z-dimension solvesNvidia Tesla K40 (GK 110B): 288 GB/s

0

50

100

150

200

250

300

X-solve Y-solve Z-solve

GB

/s

SP

DP


TLB (Translation Lookaside Buffer) miss rate

CUDA uses Unified Virtual Address Space

Virtual address space uses memory pages

Memory page frame pointers are cached in TLB

TLB is a ”coarser” cache thatI works with LLCI translates address tag to frame pointerI caches frame pointers from main memory

On NVIDIA devices TLB is hardware implemented and page sizes can not be changed

Small page size + long-stride → high TLB miss rate

NVVP implicitly reports it within the ”Global memory replay overhead” counter

753 clock latency in case of TLB page miss 2

2Measure on GT200 by Wong et al. in ”Demystifying GPU microarchitecture through microbenchmarking,”Performance Analysis of Systems and Software (ISPASS), 2010


How to cope with TLB miss rate and coalescence?

TLB is easy: remap your solver for better localityI Change 2D thread block mapping into 1D thread blocksI So that threads within a block will solve the closest neighboring set of systemsI Perform cache/register blocking

Coalesced memory access is more difficult:I Only a problem in the X-dimI Need for cache blocking:

F Local transpose in shared memory orF Local transpose with register shuffle ( shfl() intrinsic) orF Caching a whole system → Thomas-PCR hybrid


Thomas with shared memory transpose

Forward pass:1 Wrap a warp (32 threads) into 4x8 blocks

to perform non-caching (32byte) loads

2 Load 32x8 size tiles into shared memory:I 8 steps of 4x8 blocks loads

3 Transpose data by putting values into

registers:I float a[8]; is compiled to 8 registers

if array indexing is known in compiletime

4 Perform calculation with the 8 valuesalong X dimension

5 Repeat from 2 until end of X-dim isreached

Backward pass:1 Do the same backward: transpose + store

0 1 2 3 4 5 6 7

8 9 10 11 12 13 14 15

16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31

0 1 2 3 4 5 6 7

8 9 10 11 12 13 14 15

16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31

32byte

0 1 2

8 9 10

16 17 18

24 25 26

0 1 2

8 9 10

16 17 18

24 25 26

x

y

Step 0

32 rows

8 columns

Transpose

Thread 0: float reg[8]








Thread 0

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Thread 6

Thread 7

Step 1

Shared memory

Register file


Thomas with register shuffle transposeForward pass:

1 Wrap 32 threads into 8x4 blocks toperform 4 x float4 vector loads

2 Load 32x16 size tiles into registers:I 4 threads read 4 consecutive float4

vectors = 64 bytesI Do this 4 times for rows under each

other

3 Transpose data within 4 threads:I 4 threads exchange data on a 4x4

2D array with shfl(float4)I Each element in the 2D array is a

float4 vector

4 Perform calculation with the 16 valuesalong X dimension

5 Repeat from 2 until end of X-dim isreached

Backward pass:1 Do the same backward: transpose + store

4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7

4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7

4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7

4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7

0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

float4 float4 float4 float4

float a[16]

x

y

32 rows

16 columns 64 bytes

Transpose

Step 1

Step 2

Step 3

Step 4

Step 1

Step 2

Step 3

Step 4

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

4 4 4 4

4 4 4 4

4 4 4 4

4 4 4 4

Read

steps

4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6

7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

0 0 0 0

1 1 1 1

2 2 2 2

3 3 3 3

4 4 4 4

5 5 5 5

6 6 6 6

7 7 7 7

Thread 0: float reg[16] =








Read

Register file

Register file


Thomas/PCR hybrid algorithm


Performance comparison

0

0.5

1

1.5

2

2.5

3

Naïve Sharedtranspose

Registershuffle

ThomasPCRhybrid

cuSPARSEv5.5.22

2 socketXeon

LAPACKEv.3.5.0

Xeon Phi

Tim

e /

grid

ele

men

t [n

s]

Trid-X SP

Trid-X DP

Trid-Y SP

Trid-Y DP

CPU: Intel Xeon E5-2680, 2 socket, 40MB, 16 core (32 HT), 102 GB/sGPU: Nvidia K40m, 288 GB/s


Scalar-tridiagonal library use with OpenACC

void main()

int n = NX*NY*NZ;

float* u = (float *) malloc(sizeof(float)*n);

float* ax = (float *)acc_malloc(sizeof(float)*n);

...

#pragma acc data copy(u[n]) deviceptr(ax,bx,cx,ay,by,cy,az,bz,cz,du)

for(it=0; it<iter; it++)

... // calculate r.h.s. and set tri-diagonal coefficients

int ndim = 3;

int dims[3] = 256,256,256;

int pads[3] = 320,320,320;

solvedim = 0; // X-solve

tridSmtsvStridedBatch(ax, bx, cx, du, u, ndim, solvedim, dims, pads);

solvedim = 1; // Y-solve

tridSmtsvStridedBatch(ay, by, cy, du, u, ndim, solvedim, dims, pads);

...


Batch block-tridiagonal solver

Motivation for block solver:I State variables in CFD/finance PDEs have inter-depenedenceI Block matrices with block sizes of 2− 8

Sub-problems to be solved:I Inverting, multiplying blocks (matrices) involves branchingI Data storage shortage limits the number of systems on the device

Optimization strategies on GPUs:I Data storage - for better data localityI Work sharing - to increase parallelism

F Inter-thread communication with shared memoryF Inter-thread communication with register shuffle


Batch block-tridiagonal solver work sharing

Threads within a warp compute:I Matrix-matrix productI Matrix-vector productI Gauss-Jordan block solve

A thread stores a column of a block and a scalar value of a vector

Need to pay special attention to help register allocation

Algorithms are implemented with shared memory or shfl() instrinsic communication

In worst case (M = 7) 4 threads out of 32 are idle


Batch block-tridiagonal matrix storage

Blocks are stored in a row major format

Blocks of different problems are interleaved for better data locality

A00 A1

0 A20 ... AP−1

0 A01 A1

1 A11 ... AP−1

1 A02 A1

2 A12 ... (2)


Batch block-tridiagonal solver

Two versions: shared memory, register shuffle

Register spill above 8x8 DP block size

Approx. 8-16k threads saturate GPU

Low shared memory use 576(1125) bytes/ThreadBlock SP(DP) → good occupancy

Shared SP 8x8I Shared memory efficiency 84.5%I Shared memory load/store throughput 1700 GB/sI L2 Hit Rate (L1 Reads) 51.9%I Executed IPC 1.68I Texture Cache hit rate 50%

In SP shuffle is better

In DP shared memory is better


Data and compute throughput

0

50

100

150

200

250

2 4 6 8

GB

/s

M - block size

SP

DP

(a) Effective data throughput

0

50

100

150

200

250

300

2 4 6 8

GFL

OP

S

M - block size

SP

DP

(b) Compute throughput

CPU: Intel Xeon E5-2690, 2 socket, 40MB, 16 core (32 HT), 102 GB/sGPU: Nvidia K40m, 288 GB/s


Performance comparison

Baseline: Multi-threaded LAPACKE ?gbsv work() banded solver

0

2

4

6

8

10

12

14

16

2 4 6 8

Spee

du

p o

ver

LAPA

CK

E

M - block size

CPU

GPU - Shared

GPU - Shuffle

(a) Single Precision

0

2

4

6

8

2 4 6 8

Spee

du

p o

ver

LAPA

CK

E

M - block size

CPU

GPU - Shared

GPU - Shuffle

(b) Double Precision


Conclusion

Batch tridiagonal solvers

Scalar solverI Different optimization strategies:

F Thomas with shared memory transposeF Thomas with register shuffle transposeF Thomas/PCR hybrid

I Library quality solution for scalar tridiagonal solvers

Block solverI High throughput solverI Higher performance than:

F Vectorized, multi-threaded CPU block solver orF Banded LAPACK(E) solver

I Contributions of Nvidia funded summer interns is acknowledged:

→ James Whittle, Catherine Hastings


Documents

Efficient Solution of Multiple Scalar and Block ...on-demand.gputechconf.com/gtc/2014/presentations/S4289-efficient-solution-multiple... · S4289: E cient solution of multiple scalar