Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
AmgX: Scalability and
Performance on Massively
Parallel Platforms
Maxim Naumov
SIAM Workshop on Exascale Applied Mathematics Challenges and Opportunities, 2014
YOUR
APPLICATION
???
Applications
Linear Systems
Need to solve a set of linear systems
Ai xi = fi for i=1,…,k
Different methods
Direct (more reliable, but have large memory requirements)
Multigrid (work well for specific classes of problems)
Preconditioned Iterative (more amenable to parallelism)
Algebraic Multigrid (AMG)
Aggregation (unsmooth)
Selectors: SIZE_2, SIZE_4, SIZE_8
Coarse Generators: LOW_DEG
Classical
Selectors: HMIS and PMIS
Interpolators: D1 and D2
Cycles:
V, W and F
Preconditioned Iterative Methods
Fixed-point iteration xk=xk-1+ M-1(f-A*xk-1)
Jacobi, GS, DILU, ILU0
(Flexible) Krylov subspace methods K(A,v) = {v, Av, …, Akv}
[F]CG, [F]BiCGStab and [F]GMRES
0 2 4 6 8 10 12 14 16 18 2010
1
102
103
104
# iterations
||r|
| 2
BiCGStab
GMRES(20)
JSON Config File: { “config_version”: 2,
“solver”: {
“solver”: “GMRES”,
“preconditioner”: {
“solver”: “AMG”,
“smoother”: {
“solver”: “Jacobi”
}
“coarse_solver” : {
“solver”: “GMRES”,
“preconditioner”: {
“solver”: “MC_ILU”
}
}
}
}
}
GMRES
solve – local and global operations
AMG
setup – graph coarsening and matrix-matrix products
solve – smoothing and matrix-vector products
Jacobi
solve – simple local (neighbor) operations smoothing
MC-ILU
setup – graph coloring and factorization
solve – local (sub)matrix-vector multiplication
Hierarchy/Nesting of Solvers
GMRES AMG GMRES
Jacobi MC-ILU
Parallel Algorithms
(within a single node)
Aggregation
Used in the setup of the hierarchy of levels
(in the aggregation-, not classical-based path)
Use a heuristic for merging (strongest neighbor) nodes
Take advantage of graph matching techniques
For example, one phase hand shaking
See Jon and Patrice’s GTC presentation for details
Efficient Graph Matching and Coloring on the GPU
Aggregation
a51
a33
a22
a11
a73
a41
a62
a44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
a55
1 3 2
matrix sparsity pattern adjacency graph
4 5
a84
a94 a95
a66
a77
a88
a99
6 7
8 9
a14 a15
a26
a37
a48 a49
a59
Aggregation
a51
a33
a22
a11
a73
a41
a62
a44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
a55
1 3 2
matrix sparsity pattern adjacency graph
4 5
a84
a94 a95
a66
a77
a88
a99
6 7
8 9
a14 a15
a26
a37
a48 a49
a59
Aggregation
a51
a33
a22
a11
a73
a41
a62
a44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
a55
1 3 2
matrix sparsity pattern adjacency graph
4 5
a84
a94 a95
a66
a77
a88
a99
6 7
8 9
a14 a15
a26
a37
a48 a49
a59
Aggregation
a51
a33
a22
a11
a73
a41
a62
a44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
a55
1 3 2
matrix sparsity pattern adjacency graph
4 5
a84
a94 a95
a66
a77
a88
a99
6 7
8 9
a14 a15
a26
a37
a48 a49
a59
Aggregation
a51
a33
a22
a11
a73
a41
a62
a44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
a55
matrix sparsity pattern Prolongation/Restriction
a84
a94 a95
a66
a77
a88
a99
a14 a15
a26
a37
a48 a49
a59 1
1
1
1
1
1
1
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
1
1
R=PT P= A=
Aggregation
a51
a33
a22
a11
a73
a41
a62
a44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
a55
matrix sparsity pattern Prolongation/Restriction
a84
a94 a95
a66
a77
a88
a99
a14 a15
a26
a37
a48 a49
a59 1
1
1
1
1
1
1
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
1
1
R=PT P= A=
C=RAP
Coarse Matrix
Sparse Matrix – Sparse Matrix Multiply
AMG hierarchy setup critical task
Perform C=R*A*P
where R – restriction and P – prolongation matrices
Sparse Matrix – Sparse Matrix Multiply
AMG hierarchy setup critical task
Perform C=R*A*P
where R – restriction and P – prolongation matrices
Focus on Z=A*P
Let all matrices be stored in CSR format
Then, it is convenient to write
z1T
Z= …
znT
p1T
P= …
pnT
ziT= Sum_{j} aij
T* piT
a11 … a1n
A= …
an1 … ann
Sparse Matrix – Sparse Matrix Multiply
AMG hierarchy setup critical task
Perform C=R*A*P
where R – restriction and P – prolongation matrices
Focus on Z=A*P
Let all matrices be stored in CSR format
Then, it is convenient to write
How can we write this in CUDA?
z1T
Z= …
znT
a11 … a1n
A= …
an1 … ann
p1T
P= …
pnT
ziT= Sum_{j} aij
T* piT
__global__ csrgemm_count_kernel(int n, int *Ap, int *Ai, double *Av, int *Bp, int *Bi, double *Bv, int *Cp, int *Ci, double *Cv){
//for each row of A
for (row = threadIdx.z + blockIdx.z*blockDim.z; row < n; row += blockDim.z*gridDim.z){
//for each col of A
for(int i = Ap[row]+ threadIdx.y + blockIdx.y*blockDim.y; i < Ap[row+1]; i+= blockDim.y*gridDim.y){
col = Ai[i]; //also, row of B
//for each col of B
for(int j = Bp[col]+ threadIdx.x + blockIdx.x*blockDim.x; j < Bp[col+1]; j+= blockDim.x*gridDim.x){
col_B = Bi[j];
//perform union (eliminate duplicates)
hashTable[row].insert(col_B);
}
}
Cp[row] = hashTable[row].size();
}
}
CUDA Pseudo-code
Matrix A in CSR
Ap – row pointers
Ai – rows indices
Av – row values
__global__ csrgemm_count_kernel(int n, int *Ap, int *Ai, double *Av, int *Bp, int *Bi, double *Bv, int *Cp, int *Ci, double *Cv){
//for each row of A
for (row = threadIdx.z + blockIdx.z*blockDim.z; row < n; row += blockDim.z*gridDim.z){
//for each col of A
for(int i = Ap[row]+ threadIdx.y + blockIdx.y*blockDim.y; i < Ap[row+1]; i+= blockDim.y*gridDim.y){
col = Ai[i]; //also, row of B
//for each col of B
for(int j = Bp[col]+ threadIdx.x + blockIdx.x*blockDim.x; j < Bp[col+1]; j+= blockDim.x*gridDim.x){
col_B = Bi[j];
//perform union (eliminate duplicates)
hashTable[row].insert(col_B);
}
}
Cp[row] = hashTable[row].size();
}
}
CUDA Pseudo-code
__global__ csrgemm_compute_kernel(int n, int *Ap, int *Ai, double *Av, int *Bp, int *Bi, double *Bv, int *Cp, int *Ci, double *Cv){
//for each row of A
for (row = threadIdx.z + blockIdx.z*blockDim.z; row < n; row += blockDim.z*gridDim.z){
//for each col of A
for(int i = Ap[row]+ threadIdx.y + blockIdx.y*blockDim.y; i < Ap[row+1]; i+= blockDim.y*gridDim.y){
col = Ai[i]; //also, row of B
val = Av[i];
//for each col of B
for(int j = Bp[col]+ threadIdx.x + blockIdx.x*blockDim.x; j < Bp[col+1]; j+= blockDim.x*gridDim.x){
col_B = Bi[j];
val_B = Bv[j]
//perform union (eliminate duplicates – reduce values if keys are the same)
hashTable[row].insert_by_key(col_B,val*val_B);
}
}
Ci[row] = hashTable[row].export_keys()
Cv[row] = hashTable[row].export_values();
}
}
Matrix A in CSR
Ap – row pointers
Ai – rows indices
Av – row values
Launching 3D grid
In Practice
Many CUDA kernel optimizations
Coalescing of memory reads
Kepler intrinsics __popc, __ballot, __any and __all
hashTable internal implementation is critical for performance
Many library level optimizations
HyperQ – better overlap of tasks in streams
CUDA Stream Priorities – prioritize certain tasks
See Julien’s GTC presentation for details
Optimization-of-Sparse-Matrix-Matrix-Multiplication-on-GPU
Incomplete-LU
Level scheduling implicit reordering
Solve the same linear system A x = f,
but reorder A so that the rows in the same level are adjacent
ILU preconditioner computed for the original A
Can improve the memory access pattern
Does not affect convergence
Incomplete-LU
Level scheduling implicit reordering
Solve the same linear system A x = f,
but reorder A so that the rows in the same level are adjacent
ILU preconditioner computed for the original A
Can improve the memory access pattern
Does not affect convergence
Graph coloring explicit reordering
Solve (PT A Q) (QT x) = PT f,
where P and Q are permutation matrices
ILU preconditioner computed on the permuted PT A Q
Can significantly increase parallelism
Can adversely affect convergence
Level Scheduling: Example
l51
l33
l22
l11
l73
l41
l62
l44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
l55
1 3 2
Level Ptr
Level Index
1
1 2 3
Level/Depth 1
matrix sparsity pattern directed acyclic graph (DAG)
4 5
l84
l94 l95
l66
l77
l88
l99
6 7
8 9
Level Scheduling: Example
l51
l33
l22
l11
l73
l41
l62
l44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
l55
1 3 2
Level Ptr
Level Index
1 4
1 2 3 4 5 6 7
Level/Depth 1 2
matrix sparsity pattern directed acyclic graph (DAG)
4 5
l84
l94 l95
l66
l77
l88
l99
6 7
8 9
Level Scheduling: Example
l51
l33
l22
l11
l73
l41
l62
l44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
l55
1 3 2
Level Ptr
Level Index
1 4 8
1 2 3 4 5 6 7 8 9
Level/Depth 1 2 3
matrix sparsity pattern directed acyclic graph (DAG)
4 5
l84
l94 l95
l66
l77
l88
l99
6 7
8 9
Level Scheduling: Example
l51
l33
l22
l11
l73
l41
l62
l44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
l55
1 3 2
Level Ptr
Level Index
1 4 8 10
1 2 3 4 5 6 7 8 9
Level/Depth 1 2 3
matrix sparsity pattern directed acyclic graph (DAG)
4 5
l84
l94 l95
l66
l77
l88
l99
6 7
8 9
Graph Coloring: Example
l51
l33
l22
l11
l73
l41
l62
l44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
l55
1 3 2
Node/Color 1 2 3 4 5 6 7 8 9
matrix sparsity pattern Graph Coloring
4 5
l84
l94 l95
l66
l77
l88
l99
6 7
8 9
Graph Coloring: Example
l51
l33
l22
l11
l73
l41
l62
l44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
l55
1 3 2
Node/Color 1 2 3 4 5 6 7 8 9
matrix sparsity pattern Graph Coloring
4 5
l84
l94 l95
l66
l77
l88
l99
6 7
8 9
Graph Coloring: Example
l51
l33
l22
l11
l73
l41
l62
l44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
l55
1 3 2
Node/Color 1 2 3 4 5 6 7 8 9
matrix sparsity pattern Graph Coloring
4 5
l84
l94 l95
l66
l77
l88
l99
6 7
8 9
Permutation 1 2 3 8 9 4 5 6 7
Graph Coloring: Example
l51
l33
l22
l11
l73
l41
l62
l44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
l55
1 3 2
Node/Color 1 2 3 4 5 6 7 8 9
matrix sparsity pattern Graph Coloring
5 4 l84
l94 l95
l66
l77
l88
l99 6 7
8 9
Permutation 1 2 3 8 9 4 5 6 7
Graph Coloring: Example
l51
l33
l22
l11
l73
l41
l62
l44
1 2 3 4 5 6 7 8 9
12
3 4
5 6
7 8
9
l55
matrix sparsity pattern Graph Coloring
l84
l94 l95
l66
l77
l88
l99
Level Ptr
Level Index
1 6 10
1 2 3 8 9 4 5 6 7
Level/Depth 1 2
1 3 2
5 4 6 7
8 9
Parallel Algorithms
(across distributed nodes)
Reordering (to minimize communication) METIS, Scotch, …
Sparse Matrix-Vector Multiplication
r0
r1
r2
* *
*
*
* *
Reordering (to minimize communication) METIS, Scotch, …
Packing compress data (global to local)
represent connections between partitions
Sparse Matrix-Vector Multiplication
r0
r1
r2
* *
*
*
* *
*
*
*
*
*
*
Reordering (to minimize communication) METIS, Scotch, …
Packing compress data (global to local)
represent connections between partitions
Identify matrix and vector element type Interior: local rows without dependency on other partitions
Boundary: local rows with dependency on other partitions
Halo: rows from other partitions connected to the boundary
Vector: elements follow matrix
Sparse Matrix-Vector Multiplication
r0
r1
r2
* *
*
*
* *
*
*
*
*
*
*
Sparse Matrix-Vector Multiplication
*
*
*
*
*
*
Exchange vector halo
elements at every iteration
Approach 1
Sparse Matrix-Vector Multiplication
*
*
*
*
*
*
Exchange vector halo
elements at every iteration
*
*
*
*
*
*
Approach 1 Approach 2
Exchange matrix halo rows once
Sparse Matrix-Vector Multiplication
*
*
*
*
*
*
Exchange vector halo
elements at every iteration
*
*
*
*
*
*
Approach 1 Approach 2
Exchange matrix halo rows once
less setup communicate less often
Consolidation Coarse level: little work to do (most time spent in communication)
Fine level: used to allow multiple ranks on a single GPU
Consolidation
r0
r1
r2
*
*
*
*
*
*
*
*
*
*
*
* r2
r0
r1
Consolidation Coarse level: little work to do (most time spent in communication)
Fine level: used to allow multiple ranks on a single GPU
CUDA MPS* MPS: Multi-Process Service
allows multiple ranks to share a single GPU context
also, used to allow multiple ranks on a single GPU
disadvantage: higher kernel launch overhead
Consolidation
r0
r1
r2
*
*
*
*
*
*
*
*
*
*
*
* r2
r0
r1
*: http://cudamusing.blogspot.com/2013/07/enabling-cuda-multi-process-service-mps.html
GPU
r0
MPS
r1 r2
Numerical Experiments
0
1
2
3
4
5
6
7
Florida Sparse Matrix Collection
GPU: NVIDIA K40
CPU: 10 core Xeon E5-2690 V2 @3.0GHz
Speedup
AmgX Classical vs. HYPRE
Higher is
Better
Poisson Equation / Laplace operator
Titan (Oak Ridge National Laboratory)
GPU: NVIDIA K20x (one per node)
CPU: 16 core AMD Opteron 6274 @ 2.2GHz
AmgX Aggregation and Classical Weak Scaling
0.0
2.0
4.0
6.0
8.0
10.0
12.0
1 2 4 8 16 32 64 128 256 512
Tim
e (
s)
# of GPUs
Setup
AmgX 1.0 (PMIS)
AmgX 1.0 (AGG)
Poisson Equation / Laplace operator
Titan (Oak Ridge National Laboratory)
GPU: NVIDIA K20x (one per node)
CPU: 16 core AMD Opteron 6274 @ 2.2GHz
AmgX Aggregation and Classical Weak Scaling
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
1 2 4 8 16 32 64 128 256 512
# of GPUs
Solve / Iteration
AmgX 1.0 (PMIS)
AmgX 1.0 (AGG)
Tim
e (
s)
Poisson Equation / Laplace operator
Titan (Oak Ridge National Laboratory)
GPU: NVIDIA K20x (one per node)
CPU: 16 core AMD Opteron 6274 @ 2.2GHz
AmgX Aggregation and Classical Weak Scaling
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
1 2 4 8 16 32 64 128 256 512
# of GPUs
Total
AmgX 1.0 (PMIS)
AmgX 1.0 (AGG)
Tim
e (
s)
NVIDIA Confidential
ANSYS Fluent on NVIDIA GPUs
ANSYS® Fluent 15.0
CPU + GPU
AN
SY
S F
luent
Tim
e (
s)
AMG solver time
5.9x
2.5x Lower
is Better
Solution time
GPU Acceleration of Water Jacket Analysis
• Unsteady RANS model
• Fluid: water
• Internal flow
• CPU: Intel Xeon E5-2680
• GPU: 2 X Tesla K40
Water jacket model
ANSYS Fluent 15.0 performance on pressure-based coupled Solver
NOTE: Times
for 20 time steps CPU only CPU + GPU CPU only
4557
775
6391
2520
GPU Scaling on 111M Aerodynamic Problem
• 111M mixed cells
• External aerodynamics
• Steady, k-e turbulence
• Double-precision solver
• CPU: Intel Xeon E5-2667; 12 cores per node
• GPU: Tesla K40, 4 per node
Truck Body Model
144 CPU cores – Amg
48 GPUs – AmgX
AMG solver time
per iteration (s)
29
11
Fluent solution time
per iteration (s)
36
18
144 CPU cores
144 CPU cores + 48 GPUs
2.7 X
2 X
Lower is
Better
NOTE: AmgX is a GPU solver
developed by NVIDIA and is
implemented by ANSYS in
Fluent for accelerating CFD
Better performance on problems with relatively high %AMG solver time
80% AMG solver time
AmgX Team Maxim Naumov, Marat Arsaev, Patrice Castonguay, Jonathan Cohen,
Julien Demouth, Simon Layton, Nikolay Markovskiy, Istvan Reguly,
Nikolai Sakharnykh, Robert Strzodka and Joe Eaton
Public beta http://developer.nvidia.com/amgx
Presentations [1] “High Performance Algebraic Multigrid for Commercial Applications”,
J. Cohen, et al., GTC13.
[2] “AmgX: Performance Acceleration for Large-Scale Iterative Methods”,
J. Eaton, et al., SC2013.
Thank you
Backup Slides
Plain
device -> host, MPI, host -> device
always go through the host
MPI and CUDA
GPU0 GPU1 GPU2
CPU
Plain
device -> host, MPI, host -> device
always go through the host
GPUDirect*
single rank + multi-GPU
o memcpy device <-> device
o access another device data directly
single rank + network/storage
MPI and CUDA
*: https://developer.nvidia.com/gpudirect
P2P Outline: //check
cudaDeviceCanAccessPeer(…)
//enable
cudaDeviceEnablePeerAccess(…)
//use (call kernel or memcpy)
cudaMemcpy(gpu0, gpu1, …, cudaMemcpyDefault)
k<<<,>>>(gpu0,gpu1,…)
//disable
cudaDeviceDisablePeerAccess
GPU0 GPU1 GPU2
CPU
Plain
device -> host, MPI, host -> device
always go through the host
GPUDirect*
single rank + multi-GPU
o memcpy device <-> device
o access another device data directly
single rank + network/storage
CUDA IPC**
IPC: Inter-Process Communication
multiple ranks on a single node
often part of CUDA aware MPIs
MPI and CUDA
*: https://developer.nvidia.com/gpudirect
**: https://developer.nvidia.com/mpi-solutions-gpus
P2P Outline: //check
cudaDeviceCanAccessPeer(…)
//enable
cudaDeviceEnablePeerAccess(…)
//use (call kernel or memcpy)
cudaMemcpy(gpu0, gpu1, …, cudaMemcpyDefault)
k<<<,>>>(gpu0,gpu1,…)
//disable
cudaDeviceDisablePeerAccess
GPU0 GPU1 GPU2
CPU
IPC Outline: //get memory (and event) handle
cudaIpcGetMemHandle(…)
//open memory (and event) handle
cudaIpcOpenMemHandle(…)
//use (call kernel or memcpy)
cudaMemcpy(gpu0, gpu1, …, cudaMemcpyDefault)
k<<<,>>>(gpu0,gpu1,…)
//close memory (and event) handle
cudaIpcCloseMemHandle(…)