54
AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics Challenges and Opportunities, 2014

AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

AmgX: Scalability and

Performance on Massively

Parallel Platforms

Maxim Naumov

SIAM Workshop on Exascale Applied Mathematics Challenges and Opportunities, 2014

Page 2: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

YOUR

APPLICATION

???

Applications

Page 3: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Linear Systems

Need to solve a set of linear systems

Ai xi = fi for i=1,…,k

Different methods

Direct (more reliable, but have large memory requirements)

Multigrid (work well for specific classes of problems)

Preconditioned Iterative (more amenable to parallelism)

Page 4: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Algebraic Multigrid (AMG)

Aggregation (unsmooth)

Selectors: SIZE_2, SIZE_4, SIZE_8

Coarse Generators: LOW_DEG

Classical

Selectors: HMIS and PMIS

Interpolators: D1 and D2

Cycles:

V, W and F

Page 5: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Preconditioned Iterative Methods

Fixed-point iteration xk=xk-1+ M-1(f-A*xk-1)

Jacobi, GS, DILU, ILU0

(Flexible) Krylov subspace methods K(A,v) = {v, Av, …, Akv}

[F]CG, [F]BiCGStab and [F]GMRES

0 2 4 6 8 10 12 14 16 18 2010

1

102

103

104

# iterations

||r|

| 2

BiCGStab

GMRES(20)

Page 6: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

JSON Config File: { “config_version”: 2,

“solver”: {

“solver”: “GMRES”,

“preconditioner”: {

“solver”: “AMG”,

“smoother”: {

“solver”: “Jacobi”

}

“coarse_solver” : {

“solver”: “GMRES”,

“preconditioner”: {

“solver”: “MC_ILU”

}

}

}

}

}

GMRES

solve – local and global operations

AMG

setup – graph coarsening and matrix-matrix products

solve – smoothing and matrix-vector products

Jacobi

solve – simple local (neighbor) operations smoothing

MC-ILU

setup – graph coloring and factorization

solve – local (sub)matrix-vector multiplication

Hierarchy/Nesting of Solvers

GMRES AMG GMRES

Jacobi MC-ILU

Page 7: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Parallel Algorithms

(within a single node)

Page 8: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Aggregation

Used in the setup of the hierarchy of levels

(in the aggregation-, not classical-based path)

Use a heuristic for merging (strongest neighbor) nodes

Take advantage of graph matching techniques

For example, one phase hand shaking

See Jon and Patrice’s GTC presentation for details

Efficient Graph Matching and Coloring on the GPU

Page 9: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Aggregation

a51

a33

a22

a11

a73

a41

a62

a44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

a55

1 3 2

matrix sparsity pattern adjacency graph

4 5

a84

a94 a95

a66

a77

a88

a99

6 7

8 9

a14 a15

a26

a37

a48 a49

a59

Page 10: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Aggregation

a51

a33

a22

a11

a73

a41

a62

a44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

a55

1 3 2

matrix sparsity pattern adjacency graph

4 5

a84

a94 a95

a66

a77

a88

a99

6 7

8 9

a14 a15

a26

a37

a48 a49

a59

Page 11: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Aggregation

a51

a33

a22

a11

a73

a41

a62

a44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

a55

1 3 2

matrix sparsity pattern adjacency graph

4 5

a84

a94 a95

a66

a77

a88

a99

6 7

8 9

a14 a15

a26

a37

a48 a49

a59

Page 12: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Aggregation

a51

a33

a22

a11

a73

a41

a62

a44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

a55

1 3 2

matrix sparsity pattern adjacency graph

4 5

a84

a94 a95

a66

a77

a88

a99

6 7

8 9

a14 a15

a26

a37

a48 a49

a59

Page 13: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Aggregation

a51

a33

a22

a11

a73

a41

a62

a44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

a55

matrix sparsity pattern Prolongation/Restriction

a84

a94 a95

a66

a77

a88

a99

a14 a15

a26

a37

a48 a49

a59 1

1

1

1

1

1

1

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

1

1

R=PT P= A=

Page 14: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Aggregation

a51

a33

a22

a11

a73

a41

a62

a44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

a55

matrix sparsity pattern Prolongation/Restriction

a84

a94 a95

a66

a77

a88

a99

a14 a15

a26

a37

a48 a49

a59 1

1

1

1

1

1

1

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

1

1

R=PT P= A=

C=RAP

Coarse Matrix

Page 15: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Sparse Matrix – Sparse Matrix Multiply

AMG hierarchy setup critical task

Perform C=R*A*P

where R – restriction and P – prolongation matrices

Page 16: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Sparse Matrix – Sparse Matrix Multiply

AMG hierarchy setup critical task

Perform C=R*A*P

where R – restriction and P – prolongation matrices

Focus on Z=A*P

Let all matrices be stored in CSR format

Then, it is convenient to write

z1T

Z= …

znT

p1T

P= …

pnT

ziT= Sum_{j} aij

T* piT

a11 … a1n

A= …

an1 … ann

Page 17: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Sparse Matrix – Sparse Matrix Multiply

AMG hierarchy setup critical task

Perform C=R*A*P

where R – restriction and P – prolongation matrices

Focus on Z=A*P

Let all matrices be stored in CSR format

Then, it is convenient to write

How can we write this in CUDA?

z1T

Z= …

znT

a11 … a1n

A= …

an1 … ann

p1T

P= …

pnT

ziT= Sum_{j} aij

T* piT

Page 18: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

__global__ csrgemm_count_kernel(int n, int *Ap, int *Ai, double *Av, int *Bp, int *Bi, double *Bv, int *Cp, int *Ci, double *Cv){

//for each row of A

for (row = threadIdx.z + blockIdx.z*blockDim.z; row < n; row += blockDim.z*gridDim.z){

//for each col of A

for(int i = Ap[row]+ threadIdx.y + blockIdx.y*blockDim.y; i < Ap[row+1]; i+= blockDim.y*gridDim.y){

col = Ai[i]; //also, row of B

//for each col of B

for(int j = Bp[col]+ threadIdx.x + blockIdx.x*blockDim.x; j < Bp[col+1]; j+= blockDim.x*gridDim.x){

col_B = Bi[j];

//perform union (eliminate duplicates)

hashTable[row].insert(col_B);

}

}

Cp[row] = hashTable[row].size();

}

}

CUDA Pseudo-code

Matrix A in CSR

Ap – row pointers

Ai – rows indices

Av – row values

Page 19: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

__global__ csrgemm_count_kernel(int n, int *Ap, int *Ai, double *Av, int *Bp, int *Bi, double *Bv, int *Cp, int *Ci, double *Cv){

//for each row of A

for (row = threadIdx.z + blockIdx.z*blockDim.z; row < n; row += blockDim.z*gridDim.z){

//for each col of A

for(int i = Ap[row]+ threadIdx.y + blockIdx.y*blockDim.y; i < Ap[row+1]; i+= blockDim.y*gridDim.y){

col = Ai[i]; //also, row of B

//for each col of B

for(int j = Bp[col]+ threadIdx.x + blockIdx.x*blockDim.x; j < Bp[col+1]; j+= blockDim.x*gridDim.x){

col_B = Bi[j];

//perform union (eliminate duplicates)

hashTable[row].insert(col_B);

}

}

Cp[row] = hashTable[row].size();

}

}

CUDA Pseudo-code

__global__ csrgemm_compute_kernel(int n, int *Ap, int *Ai, double *Av, int *Bp, int *Bi, double *Bv, int *Cp, int *Ci, double *Cv){

//for each row of A

for (row = threadIdx.z + blockIdx.z*blockDim.z; row < n; row += blockDim.z*gridDim.z){

//for each col of A

for(int i = Ap[row]+ threadIdx.y + blockIdx.y*blockDim.y; i < Ap[row+1]; i+= blockDim.y*gridDim.y){

col = Ai[i]; //also, row of B

val = Av[i];

//for each col of B

for(int j = Bp[col]+ threadIdx.x + blockIdx.x*blockDim.x; j < Bp[col+1]; j+= blockDim.x*gridDim.x){

col_B = Bi[j];

val_B = Bv[j]

//perform union (eliminate duplicates – reduce values if keys are the same)

hashTable[row].insert_by_key(col_B,val*val_B);

}

}

Ci[row] = hashTable[row].export_keys()

Cv[row] = hashTable[row].export_values();

}

}

Matrix A in CSR

Ap – row pointers

Ai – rows indices

Av – row values

Launching 3D grid

Page 20: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

In Practice

Many CUDA kernel optimizations

Coalescing of memory reads

Kepler intrinsics __popc, __ballot, __any and __all

hashTable internal implementation is critical for performance

Many library level optimizations

HyperQ – better overlap of tasks in streams

CUDA Stream Priorities – prioritize certain tasks

See Julien’s GTC presentation for details

Optimization-of-Sparse-Matrix-Matrix-Multiplication-on-GPU

Page 21: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Incomplete-LU

Level scheduling implicit reordering

Solve the same linear system A x = f,

but reorder A so that the rows in the same level are adjacent

ILU preconditioner computed for the original A

Can improve the memory access pattern

Does not affect convergence

Page 22: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Incomplete-LU

Level scheduling implicit reordering

Solve the same linear system A x = f,

but reorder A so that the rows in the same level are adjacent

ILU preconditioner computed for the original A

Can improve the memory access pattern

Does not affect convergence

Graph coloring explicit reordering

Solve (PT A Q) (QT x) = PT f,

where P and Q are permutation matrices

ILU preconditioner computed on the permuted PT A Q

Can significantly increase parallelism

Can adversely affect convergence

Page 23: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Level Scheduling: Example

l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Level Ptr

Level Index

1

1 2 3

Level/Depth 1

matrix sparsity pattern directed acyclic graph (DAG)

4 5

l84

l94 l95

l66

l77

l88

l99

6 7

8 9

Page 24: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Level Scheduling: Example

l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Level Ptr

Level Index

1 4

1 2 3 4 5 6 7

Level/Depth 1 2

matrix sparsity pattern directed acyclic graph (DAG)

4 5

l84

l94 l95

l66

l77

l88

l99

6 7

8 9

Page 25: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Level Scheduling: Example

l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Level Ptr

Level Index

1 4 8

1 2 3 4 5 6 7 8 9

Level/Depth 1 2 3

matrix sparsity pattern directed acyclic graph (DAG)

4 5

l84

l94 l95

l66

l77

l88

l99

6 7

8 9

Page 26: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Level Scheduling: Example

l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Level Ptr

Level Index

1 4 8 10

1 2 3 4 5 6 7 8 9

Level/Depth 1 2 3

matrix sparsity pattern directed acyclic graph (DAG)

4 5

l84

l94 l95

l66

l77

l88

l99

6 7

8 9

Page 27: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Graph Coloring: Example

l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Node/Color 1 2 3 4 5 6 7 8 9

matrix sparsity pattern Graph Coloring

4 5

l84

l94 l95

l66

l77

l88

l99

6 7

8 9

Page 28: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Graph Coloring: Example

l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Node/Color 1 2 3 4 5 6 7 8 9

matrix sparsity pattern Graph Coloring

4 5

l84

l94 l95

l66

l77

l88

l99

6 7

8 9

Page 29: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Graph Coloring: Example

l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Node/Color 1 2 3 4 5 6 7 8 9

matrix sparsity pattern Graph Coloring

4 5

l84

l94 l95

l66

l77

l88

l99

6 7

8 9

Permutation 1 2 3 8 9 4 5 6 7

Page 30: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Graph Coloring: Example

l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

1 3 2

Node/Color 1 2 3 4 5 6 7 8 9

matrix sparsity pattern Graph Coloring

5 4 l84

l94 l95

l66

l77

l88

l99 6 7

8 9

Permutation 1 2 3 8 9 4 5 6 7

Page 31: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Graph Coloring: Example

l51

l33

l22

l11

l73

l41

l62

l44

1 2 3 4 5 6 7 8 9

12

3 4

5 6

7 8

9

l55

matrix sparsity pattern Graph Coloring

l84

l94 l95

l66

l77

l88

l99

Level Ptr

Level Index

1 6 10

1 2 3 8 9 4 5 6 7

Level/Depth 1 2

1 3 2

5 4 6 7

8 9

Page 32: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Parallel Algorithms

(across distributed nodes)

Page 33: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Reordering (to minimize communication) METIS, Scotch, …

Sparse Matrix-Vector Multiplication

r0

r1

r2

* *

*

*

* *

Page 34: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Reordering (to minimize communication) METIS, Scotch, …

Packing compress data (global to local)

represent connections between partitions

Sparse Matrix-Vector Multiplication

r0

r1

r2

* *

*

*

* *

*

*

*

*

*

*

Page 35: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Reordering (to minimize communication) METIS, Scotch, …

Packing compress data (global to local)

represent connections between partitions

Identify matrix and vector element type Interior: local rows without dependency on other partitions

Boundary: local rows with dependency on other partitions

Halo: rows from other partitions connected to the boundary

Vector: elements follow matrix

Sparse Matrix-Vector Multiplication

r0

r1

r2

* *

*

*

* *

*

*

*

*

*

*

Page 36: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Sparse Matrix-Vector Multiplication

*

*

*

*

*

*

Exchange vector halo

elements at every iteration

Approach 1

Page 37: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Sparse Matrix-Vector Multiplication

*

*

*

*

*

*

Exchange vector halo

elements at every iteration

*

*

*

*

*

*

Approach 1 Approach 2

Exchange matrix halo rows once

Page 38: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Sparse Matrix-Vector Multiplication

*

*

*

*

*

*

Exchange vector halo

elements at every iteration

*

*

*

*

*

*

Approach 1 Approach 2

Exchange matrix halo rows once

less setup communicate less often

Page 39: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Consolidation Coarse level: little work to do (most time spent in communication)

Fine level: used to allow multiple ranks on a single GPU

Consolidation

r0

r1

r2

*

*

*

*

*

*

*

*

*

*

*

* r2

r0

r1

Page 40: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Consolidation Coarse level: little work to do (most time spent in communication)

Fine level: used to allow multiple ranks on a single GPU

CUDA MPS* MPS: Multi-Process Service

allows multiple ranks to share a single GPU context

also, used to allow multiple ranks on a single GPU

disadvantage: higher kernel launch overhead

Consolidation

r0

r1

r2

*

*

*

*

*

*

*

*

*

*

*

* r2

r0

r1

*: http://cudamusing.blogspot.com/2013/07/enabling-cuda-multi-process-service-mps.html

GPU

r0

MPS

r1 r2

Page 41: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Numerical Experiments

Page 42: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

0

1

2

3

4

5

6

7

Florida Sparse Matrix Collection

GPU: NVIDIA K40

CPU: 10 core Xeon E5-2690 V2 @3.0GHz

Speedup

AmgX Classical vs. HYPRE

Higher is

Better

Page 43: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Poisson Equation / Laplace operator

Titan (Oak Ridge National Laboratory)

GPU: NVIDIA K20x (one per node)

CPU: 16 core AMD Opteron 6274 @ 2.2GHz

AmgX Aggregation and Classical Weak Scaling

0.0

2.0

4.0

6.0

8.0

10.0

12.0

1 2 4 8 16 32 64 128 256 512

Tim

e (

s)

# of GPUs

Setup

AmgX 1.0 (PMIS)

AmgX 1.0 (AGG)

Page 44: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Poisson Equation / Laplace operator

Titan (Oak Ridge National Laboratory)

GPU: NVIDIA K20x (one per node)

CPU: 16 core AMD Opteron 6274 @ 2.2GHz

AmgX Aggregation and Classical Weak Scaling

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

1 2 4 8 16 32 64 128 256 512

# of GPUs

Solve / Iteration

AmgX 1.0 (PMIS)

AmgX 1.0 (AGG)

Tim

e (

s)

Page 45: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Poisson Equation / Laplace operator

Titan (Oak Ridge National Laboratory)

GPU: NVIDIA K20x (one per node)

CPU: 16 core AMD Opteron 6274 @ 2.2GHz

AmgX Aggregation and Classical Weak Scaling

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

1 2 4 8 16 32 64 128 256 512

# of GPUs

Total

AmgX 1.0 (PMIS)

AmgX 1.0 (AGG)

Tim

e (

s)

Page 46: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

NVIDIA Confidential

ANSYS Fluent on NVIDIA GPUs

ANSYS® Fluent 15.0

Page 47: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

CPU + GPU

AN

SY

S F

luent

Tim

e (

s)

AMG solver time

5.9x

2.5x Lower

is Better

Solution time

GPU Acceleration of Water Jacket Analysis

• Unsteady RANS model

• Fluid: water

• Internal flow

• CPU: Intel Xeon E5-2680

• GPU: 2 X Tesla K40

Water jacket model

ANSYS Fluent 15.0 performance on pressure-based coupled Solver

NOTE: Times

for 20 time steps CPU only CPU + GPU CPU only

4557

775

6391

2520

Page 48: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

GPU Scaling on 111M Aerodynamic Problem

• 111M mixed cells

• External aerodynamics

• Steady, k-e turbulence

• Double-precision solver

• CPU: Intel Xeon E5-2667; 12 cores per node

• GPU: Tesla K40, 4 per node

Truck Body Model

144 CPU cores – Amg

48 GPUs – AmgX

AMG solver time

per iteration (s)

29

11

Fluent solution time

per iteration (s)

36

18

144 CPU cores

144 CPU cores + 48 GPUs

2.7 X

2 X

Lower is

Better

NOTE: AmgX is a GPU solver

developed by NVIDIA and is

implemented by ANSYS in

Fluent for accelerating CFD

Better performance on problems with relatively high %AMG solver time

80% AMG solver time

Page 49: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

AmgX Team Maxim Naumov, Marat Arsaev, Patrice Castonguay, Jonathan Cohen,

Julien Demouth, Simon Layton, Nikolay Markovskiy, Istvan Reguly,

Nikolai Sakharnykh, Robert Strzodka and Joe Eaton

Public beta http://developer.nvidia.com/amgx

Presentations [1] “High Performance Algebraic Multigrid for Commercial Applications”,

J. Cohen, et al., GTC13.

[2] “AmgX: Performance Acceleration for Large-Scale Iterative Methods”,

J. Eaton, et al., SC2013.

Page 50: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Thank you

Page 51: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Backup Slides

Page 52: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Plain

device -> host, MPI, host -> device

always go through the host

MPI and CUDA

GPU0 GPU1 GPU2

CPU

Page 53: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Plain

device -> host, MPI, host -> device

always go through the host

GPUDirect*

single rank + multi-GPU

o memcpy device <-> device

o access another device data directly

single rank + network/storage

MPI and CUDA

*: https://developer.nvidia.com/gpudirect

P2P Outline: //check

cudaDeviceCanAccessPeer(…)

//enable

cudaDeviceEnablePeerAccess(…)

//use (call kernel or memcpy)

cudaMemcpy(gpu0, gpu1, …, cudaMemcpyDefault)

k<<<,>>>(gpu0,gpu1,…)

//disable

cudaDeviceDisablePeerAccess

GPU0 GPU1 GPU2

CPU

Page 54: AmgX: Scalability and Performance on Massively Parallel ...AmgX: Scalability and Performance on Massively Parallel Platforms Maxim Naumov SIAM Workshop on Exascale Applied Mathematics

Plain

device -> host, MPI, host -> device

always go through the host

GPUDirect*

single rank + multi-GPU

o memcpy device <-> device

o access another device data directly

single rank + network/storage

CUDA IPC**

IPC: Inter-Process Communication

multiple ranks on a single node

often part of CUDA aware MPIs

MPI and CUDA

*: https://developer.nvidia.com/gpudirect

**: https://developer.nvidia.com/mpi-solutions-gpus

P2P Outline: //check

cudaDeviceCanAccessPeer(…)

//enable

cudaDeviceEnablePeerAccess(…)

//use (call kernel or memcpy)

cudaMemcpy(gpu0, gpu1, …, cudaMemcpyDefault)

k<<<,>>>(gpu0,gpu1,…)

//disable

cudaDeviceDisablePeerAccess

GPU0 GPU1 GPU2

CPU

IPC Outline: //get memory (and event) handle

cudaIpcGetMemHandle(…)

//open memory (and event) handle

cudaIpcOpenMemHandle(…)

//use (call kernel or memcpy)

cudaMemcpy(gpu0, gpu1, …, cudaMemcpyDefault)

k<<<,>>>(gpu0,gpu1,…)

//close memory (and event) handle

cudaIpcCloseMemHandle(…)