AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

AmgX: Scalability and

Performance on Massively

Parallel Platforms

Maxim Naumov

SIAM Workshop on Exascale Applied Mathematics Challenges and Opportunities, 2014

YOUR

APPLICATION

???

Applications

Linear Systems

Need to solve a set of linear systems

Ai xi = fi for i=1,…,k

Different methods

Direct (more reliable, but have large memory requirements)

Multigrid (work well for specific classes of problems)

Preconditioned Iterative (more amenable to parallelism)

Algebraic Multigrid (AMG)

Aggregation (unsmooth)

Selectors: SIZE_2, SIZE_4, SIZE_8

Coarse Generators: LOW_DEG

Classical

Selectors: HMIS and PMIS

Interpolators: D1 and D2

Cycles:

V, W and F

Preconditioned Iterative Methods

Fixed-point iteration xk=xk-1+ M-1(f-A*xk-1)

Jacobi, GS, DILU, ILU0

(Flexible) Krylov subspace methods K(A,v) = {v, Av, …, Akv}

[F]CG, [F]BiCGStab and [F]GMRES

0 2 4 6 8 10 12 14 16 18 2010

1

102

103

104

# iterations

||r|

| 2

BiCGStab

GMRES(20)

JSON Config File: { “config_version”: 2,

“solver”: {

“solver”: “GMRES”,

“preconditioner”: {

“solver”: “AMG”,

“smoother”: {

“solver”: “Jacobi”

}

“coarse_solver” : {

“solver”: “GMRES”,

“preconditioner”: {

“solver”: “MC_ILU”

}

}

}

}

}

GMRES

solve – local and global operations

AMG

setup – graph coarsening and matrix-matrix products

solve – smoothing and matrix-vector products

Jacobi

solve – simple local (neighbor) operations smoothing

MC-ILU

setup – graph coloring and factorization

solve – local (sub)matrix-vector multiplication

Hierarchy/Nesting of Solvers

GMRES AMG GMRES

Jacobi MC-ILU

Parallel Algorithms

(within a single node)

Aggregation

Used in the setup of the hierarchy of levels

(in the aggregation-, not classical-based path)

Use a heuristic for merging (strongest neighbor) nodes

Take advantage of graph matching techniques

For example, one phase hand shaking

See Jon and Patrice’s GTC presentation for details

Efficient Graph Matching and Coloring on the GPU

Aggregation

a51

a33

a22

a11

a73

a41

a62

a44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

a55

1 3 2

matrix sparsity pattern adjacency graph

4 5

a84

a94 a95

a66

a77

a88

a99

6 7

8 9

a14 a15

a26

a37

a48 a49

a59

Aggregation

a51

a33

a22

a11

a73

a41

a62

a44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

a55

1 3 2


4 5

a84

a94 a95

a66

a77

a88

a99

6 7

8 9

a14 a15

a26

a37

a48 a49

a59

Aggregation

a51

a33

a22

a11

a73

a41

a62

a44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

a55

1 3 2


4 5

a84

a94 a95

a66

a77

a88

a99

6 7

8 9

a14 a15

a26

a37

a48 a49

a59

Aggregation

a51

a33

a22

a11

a73

a41

a62

a44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

a55

1 3 2


4 5

a84

a94 a95

a66

a77

a88

a99

6 7

8 9

a14 a15

a26

a37

a48 a49

a59

Aggregation

a51

a33

a22

a11

a73

a41

a62

a44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

a55

matrix sparsity pattern Prolongation/Restriction

a84

a94 a95

a66

a77

a88

a99

a14 a15

a26

a37

a48 a49

a59 1

1

1

1

1

1

1

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

1

1

R=PT P= A=

Aggregation

a51

a33

a22

a11

a73

a41

a62

a44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

a55

matrix sparsity pattern Prolongation/Restriction

a84

a94 a95

a66

a77

a88

a99

a14 a15

a26

a37

a48 a49

a59 1

1

1

1

1

1

1

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

1

1

R=PT P= A=

C=RAP

Coarse Matrix

Sparse Matrix – Sparse Matrix Multiply

AMG hierarchy setup critical task

Perform C=R*A*P

where R – restriction and P – prolongation matrices



Perform C=R*A*P


Focus on Z=A*P

Let all matrices be stored in CSR format

Then, it is convenient to write

z1T

Z= …

znT

p1T

P= …

pnT

ziT= Sum_{j} aij

T* piT

a11 … a1n

A= …

an1 … ann



Perform C=R*A*P


Focus on Z=A*P

Let all matrices be stored in CSR format

Then, it is convenient to write

How can we write this in CUDA?

z1T

Z= …

znT

a11 … a1n

A= …

an1 … ann

p1T

P= …

pnT

ziT= Sum_{j} aij

T* piT

__global__ csrgemm_count_kernel(int n, int *Ap, int *Ai, double *Av, int *Bp, int *Bi, double *Bv, int *Cp, int *Ci, double *Cv){

//for each row of A

for (row = threadIdx.z + blockIdx.z*blockDim.z; row < n; row += blockDim.z*gridDim.z){

//for each col of A

for(int i = Ap[row]+ threadIdx.y + blockIdx.y*blockDim.y; i < Ap[row+1]; i+= blockDim.y*gridDim.y){

col = Ai[i]; //also, row of B

//for each col of B

for(int j = Bp[col]+ threadIdx.x + blockIdx.x*blockDim.x; j < Bp[col+1]; j+= blockDim.x*gridDim.x){

col_B = Bi[j];

//perform union (eliminate duplicates)

hashTable[row].insert(col_B);

}

}

Cp[row] = hashTable[row].size();

}

}

CUDA Pseudo-code

Matrix A in CSR

Ap – row pointers

Ai – rows indices

Av – row values

__global__ csrgemm_count_kernel(int n, int *Ap, int *Ai, double *Av, int *Bp, int *Bi, double *Bv, int *Cp, int *Ci, double *Cv){

//for each row of A


//for each col of A



//for each col of B


col_B = Bi[j];

//perform union (eliminate duplicates)

hashTable[row].insert(col_B);

}

}

Cp[row] = hashTable[row].size();

}

}

CUDA Pseudo-code

__global__ csrgemm_compute_kernel(int n, int *Ap, int *Ai, double *Av, int *Bp, int *Bi, double *Bv, int *Cp, int *Ci, double *Cv){

//for each row of A


//for each col of A



val = Av[i];

//for each col of B


col_B = Bi[j];

val_B = Bv[j]

//perform union (eliminate duplicates – reduce values if keys are the same)

hashTable[row].insert_by_key(col_B,val*val_B);

}

}

Ci[row] = hashTable[row].export_keys()

Cv[row] = hashTable[row].export_values();

}

}

Matrix A in CSR

Ap – row pointers

Ai – rows indices

Av – row values

Launching 3D grid

In Practice

Many CUDA kernel optimizations

Coalescing of memory reads

Kepler intrinsics __popc, __ballot, __any and __all

hashTable internal implementation is critical for performance

Many library level optimizations

HyperQ – better overlap of tasks in streams

CUDA Stream Priorities – prioritize certain tasks

See Julien’s GTC presentation for details

Optimization-of-Sparse-Matrix-Matrix-Multiplication-on-GPU

Incomplete-LU

Level scheduling implicit reordering

Solve the same linear system A x = f,

but reorder A so that the rows in the same level are adjacent

ILU preconditioner computed for the original A

Can improve the memory access pattern

Does not affect convergence

Incomplete-LU

Level scheduling implicit reordering

Solve the same linear system A x = f,

but reorder A so that the rows in the same level are adjacent

ILU preconditioner computed for the original A

Can improve the memory access pattern

Does not affect convergence

Graph coloring explicit reordering

Solve (PT A Q) (QT x) = PT f,

where P and Q are permutation matrices

ILU preconditioner computed on the permuted PT A Q

Can significantly increase parallelism

Can adversely affect convergence

Level Scheduling: Example

l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Level Ptr

Level Index

1

1 2 3

Level/Depth 1

matrix sparsity pattern directed acyclic graph (DAG)

4 5

l84

l94 l95

l66

l77

l88

l99

6 7

8 9


l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Level Ptr

Level Index

1 4

1 2 3 4 5 6 7

Level/Depth 1 2


4 5

l84

l94 l95

l66

l77

l88

l99

6 7

8 9


l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Level Ptr

Level Index

1 4 8

1 2 3 4 5 6 7 8 9

Level/Depth 1 2 3


4 5

l84

l94 l95

l66

l77

l88

l99

6 7

8 9


l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Level Ptr

Level Index

1 4 8 10

1 2 3 4 5 6 7 8 9

Level/Depth 1 2 3


4 5

l84

l94 l95

l66

l77

l88

l99

6 7

8 9

Graph Coloring: Example

l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Node/Color 1 2 3 4 5 6 7 8 9

matrix sparsity pattern Graph Coloring

4 5

l84

l94 l95

l66

l77

l88

l99

6 7

8 9


l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Node/Color 1 2 3 4 5 6 7 8 9


4 5

l84

l94 l95

l66

l77

l88

l99

6 7

8 9


l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Node/Color 1 2 3 4 5 6 7 8 9


4 5

l84

l94 l95

l66

l77

l88

l99

6 7

8 9

Permutation 1 2 3 8 9 4 5 6 7


l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Node/Color 1 2 3 4 5 6 7 8 9


5 4 l84

l94 l95

l66

l77

l88

l99 6 7

8 9

Permutation 1 2 3 8 9 4 5 6 7


l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55


l84

l94 l95

l66

l77

l88

l99

Level Ptr

Level Index

1 6 10

1 2 3 8 9 4 5 6 7

Level/Depth 1 2

1 3 2

5 4 6 7

8 9

Parallel Algorithms

(across distributed nodes)

Reordering (to minimize communication) METIS, Scotch, …

Sparse Matrix-Vector Multiplication

r0

r1

r2

* *

*

*

* *


Packing compress data (global to local)

represent connections between partitions


r0

r1

r2

* *

*

*

* *

*

*

*

*

*

*


Packing compress data (global to local)

represent connections between partitions

Identify matrix and vector element type Interior: local rows without dependency on other partitions

Boundary: local rows with dependency on other partitions

Halo: rows from other partitions connected to the boundary

Vector: elements follow matrix


r0

r1

r2

* *

*

*

* *

*

*

*

*

*

*


*

*

*

*

*

*

Exchange vector halo

elements at every iteration

Approach 1


*

*

*

*

*

*



*

*

*

*

*

*

Approach 1 Approach 2

Exchange matrix halo rows once


*

*

*

*

*

*



*

*

*

*

*

*

Approach 1 Approach 2

Exchange matrix halo rows once

less setup communicate less often

Consolidation Coarse level: little work to do (most time spent in communication)

Fine level: used to allow multiple ranks on a single GPU

Consolidation

r0

r1

r2

*

*

*

*

*

*

*

*

*

*

*

* r2

r0

r1

Consolidation Coarse level: little work to do (most time spent in communication)

Fine level: used to allow multiple ranks on a single GPU

CUDA MPS* MPS: Multi-Process Service

allows multiple ranks to share a single GPU context

also, used to allow multiple ranks on a single GPU

disadvantage: higher kernel launch overhead

Consolidation

r0

r1

r2

*

*

*

*

*

*

*

*

*

*

*

* r2

r0

r1

*: http://cudamusing.blogspot.com/2013/07/enabling-cuda-multi-process-service-mps.html

GPU

r0

MPS

r1 r2

Numerical Experiments

0

1

2

3

4

5

6

7

Florida Sparse Matrix Collection

GPU: NVIDIA K40

CPU: 10 core Xeon E5-2690 V2 @3.0GHz

Speedup

AmgX Classical vs. HYPRE

Higher is

Better

Poisson Equation / Laplace operator

Titan (Oak Ridge National Laboratory)

GPU: NVIDIA K20x (one per node)

CPU: 16 core AMD Opteron 6274 @ 2.2GHz

AmgX Aggregation and Classical Weak Scaling

0.0

2.0

4.0

6.0

8.0

10.0

12.0

1 2 4 8 16 32 64 128 256 512

Tim

e (

s)

# of GPUs

Setup

AmgX 1.0 (PMIS)

AmgX 1.0 (AGG)






0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

1 2 4 8 16 32 64 128 256 512

# of GPUs

Solve / Iteration

AmgX 1.0 (PMIS)

AmgX 1.0 (AGG)

Tim

e (

s)






0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

1 2 4 8 16 32 64 128 256 512

# of GPUs

Total

AmgX 1.0 (PMIS)

AmgX 1.0 (AGG)

Tim

e (

s)

NVIDIA Confidential

ANSYS Fluent on NVIDIA GPUs

ANSYS® Fluent 15.0

CPU + GPU

AN

SY

S F

luent

Tim

e (

s)

AMG solver time

5.9x

2.5x Lower

is Better

Solution time

GPU Acceleration of Water Jacket Analysis

• Unsteady RANS model

• Fluid: water

• Internal flow

• CPU: Intel Xeon E5-2680

• GPU: 2 X Tesla K40

Water jacket model

ANSYS Fluent 15.0 performance on pressure-based coupled Solver

NOTE: Times

for 20 time steps CPU only CPU + GPU CPU only

4557

775

6391

2520

GPU Scaling on 111M Aerodynamic Problem

• 111M mixed cells

• External aerodynamics

• Steady, k-e turbulence

• Double-precision solver

• CPU: Intel Xeon E5-2667; 12 cores per node

• GPU: Tesla K40, 4 per node

Truck Body Model

144 CPU cores – Amg

48 GPUs – AmgX

AMG solver time

per iteration (s)

29

11

Fluent solution time

per iteration (s)

36

18

144 CPU cores

144 CPU cores + 48 GPUs

2.7 X

2 X

Lower is

Better

NOTE: AmgX is a GPU solver

developed by NVIDIA and is

implemented by ANSYS in

Fluent for accelerating CFD

Better performance on problems with relatively high %AMG solver time

80% AMG solver time

AmgX Team Maxim Naumov, Marat Arsaev, Patrice Castonguay, Jonathan Cohen,

Julien Demouth, Simon Layton, Nikolay Markovskiy, Istvan Reguly,

Nikolai Sakharnykh, Robert Strzodka and Joe Eaton

Public beta http://developer.nvidia.com/amgx

Presentations [1] “High Performance Algebraic Multigrid for Commercial Applications”,

J. Cohen, et al., GTC13.

[2] “AmgX: Performance Acceleration for Large-Scale Iterative Methods”,

J. Eaton, et al., SC2013.

Thank you

Backup Slides

Plain

device -> host, MPI, host -> device

always go through the host

MPI and CUDA

GPU0 GPU1 GPU2

CPU

Plain



GPUDirect*

single rank + multi-GPU

o memcpy device <-> device

o access another device data directly

single rank + network/storage

MPI and CUDA

*: https://developer.nvidia.com/gpudirect

P2P Outline: //check

cudaDeviceCanAccessPeer(…)

//enable

cudaDeviceEnablePeerAccess(…)

//use (call kernel or memcpy)

cudaMemcpy(gpu0, gpu1, …, cudaMemcpyDefault)

k<<<,>>>(gpu0,gpu1,…)

//disable

cudaDeviceDisablePeerAccess

GPU0 GPU1 GPU2

CPU

Plain



GPUDirect*

single rank + multi-GPU

o memcpy device <-> device

o access another device data directly

single rank + network/storage

CUDA IPC**

IPC: Inter-Process Communication

multiple ranks on a single node

often part of CUDA aware MPIs

MPI and CUDA

*: https://developer.nvidia.com/gpudirect

**: https://developer.nvidia.com/mpi-solutions-gpus

P2P Outline: //check

cudaDeviceCanAccessPeer(…)

//enable

cudaDeviceEnablePeerAccess(…)



k<<<,>>>(gpu0,gpu1,…)

//disable

cudaDeviceDisablePeerAccess

GPU0 GPU1 GPU2

CPU

IPC Outline: //get memory (and event) handle

cudaIpcGetMemHandle(…)

//open memory (and event) handle

cudaIpcOpenMemHandle(…)



k<<<,>>>(gpu0,gpu1,…)

//close memory (and event) handle

cudaIpcCloseMemHandle(…)

Documents

AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics