Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Domain specific libraries for PDEs
Simone Bnà – [email protected] Applications and Innovation Department
Outline
Introduction to Sparse Matrix algebra
The PETSc toolkit
Sparse Matrix Computation with PETSc
Profiling and preliminary tests on KNL
Introduction to Sparse matrix
algebra
Definition of a Sparse Matrix and a Dense Matrix
A sparse matrix is a matrix in which the number of non-zeroes
entries is O(n) (The average number of non-zeroes entries in each
row is bounded independently from n)
A dense matrix is a non-sparse matrix (The number of non-zeroes
elements is O(n2))
Sparsity and Density
The sparsity of a matrix is defined as the number of zero-valued
elements divided by the total number of elements (m x n for an m
x n matrix)
The density of a matrix is defined as the complementary of the
sparsity: density = 1 – sparsity
For Sparse matrices the sparsity is ≈ 1 and the density is << 1
Example:
m = 8 nnzeros = 12
n = 8 nzeros = m*n – nnzeros
sparsity = 64 – 12 / 64 = 0.8125
density = 1 – 0.8125 = 0.1875
Sparsity pattern
The distribution of non-zero elements of a sparse matrix can be
described by the sparsity pattern, which is defined as the set of
entries of the matrix different from zero. In symbols:
{ 𝑖, 𝑗 : 𝐴𝑖𝑗 ≠ 0 }
Sparsity pattern
The sparsity pattern can be represented also as a Graph, where
nodes i and j are connected by an edge if and only if 𝐴𝑖𝑗 ≠ 0
In a Sparse Matrix the degree of a vertex in the graph is
<<relatively low>>
Conceptually, sparsity corresponds to a system loosely coupled
Jacobian of a PDE
Matrices are used to store the Jacobian of a PDE.
The following discretizations generates a sparse matrix
Finite difference
Finite volume
Finite element method (FEM)
Different discretization can lead to a Dense linear matrix:
Spectral element method (SEM)
Sparsity pattern in Finite Difference
The sparsity pattern in finite difference depends on the topology
of the adopted computational grid (e.g. cartesian grid), the
indexing of the nodes and the type of stencil
Sparsity pattern in Finite Difference
The sparsity pattern in finite difference depends on the topology
of the adopted computational grid (e.g. cartesian grid), the
indexing of the nodes and the type of stencil
Sparsity pattern in Finite Element
The sparsity pattern depends on the topology of the adopted
computational grid (e.g. unstructured grid), the kind of the finite
element (e.g. Taylor-Hood, Crouzeix-Raviart, Raviart-Thomas,
Mini-Element,…) and on the indexing of the nodes.
In Finite-Element discretizations, the sparsity of the matrix is a
direct consequence of the small-support property of the finite
element basis
Finite Volume can be seen as a special case of Finite Element
Don’t reinvent the wheel! The use of storage techniques for sparse matrices is fundamental,
in particular for large-scale problems
Standard dense-matrix structures and algorithms are slow and
ineffcient when applied to large sparse matrices
There are some available tools to work with Sparse matrices that
uses specialised algorithms and data structures to take advantage
of the sparse structure of the matrix
The PETSc toolkit (http://www.mcs.anl.gov/petsc/)
The TRILINOS project (https://trilinos.org/)
The PETSc toolkit
PETSc in a nutshell
PETSc – Portable, Extensible Toolkit for ScientificComputation
Is a suite of data structures and routines for the scalable (parallel) solutionof scientific applications mainly modelled by partial differential equations.
Tools for distributed vectors and matrices
Linear system solvers (sparse/dense, iterative/direct)
Non linear system solvers
Serial and parallel computation
Support for Finite Difference and Finite Elements PDE
discretizations
Structured and Unstructured topologies
Support for debugging, profiling and graphical output
PETSc class hierarchy
Frameworks built on top of Petsc
PETSc is a toolkit, not a framework
PETSc is PDE oriented, but not specific to any kind of PDE
Alternatives:
FEM packages: MOOSE, libMesh, DEAL.II, FEniCS
Solvers for classes of problems: CHASTE
MOOSEMultiphy sics object-oriented
Simulation env ironment
libMeshAdaptiv e Finite Element library
PETScPortable, Extensible Toolkit f or Scientif ic
Computation
DEAL.IISophisticated C++ based f inite
element simulation package
PHAMLThe parallel Hierarchical Adaptiv e
MultiLev el Project
ChasteCancer, Heart and Sof t Tissue
Env ironment
FEniCSSophisticated py thon based f inite
element simulation package
PETSc numerical components
External Packages
Dense linear algebra: Scalapack, Plapack
Sparse direct linear solvers: Mumps, SuperLU, SuperLU_dist
Grid partitioning software: Metis, ParMetis, Jostle, Chaco, Party
ODE solvers: PVODE
Eigenvalue solvers (including SVD): SLEPc
Optimization: TAO
PETSc design concepts
Goals
• Portability: available on many platforms, basically anything that has MPI
• Performance
• Scalable parallelism
• Flexibility: easy switch among different implementations
Approach
• Object Oriented Delegation Pattern : many specific implementations of the same object
• Shared interface (overloading):MatMult(A,x,y); // y <- A xsame code for sequential, parallel, dense, sparse
• Command line customization
Drawback
• Nasty details of the implementation hidden
PETSc and Parallelism
PETSc is layered on top of MPI: you do not need to know much MPI when
you use PETSc
All objects in PETSc are defined on a communicator; they can only
interact if on the same communicator
Parallelism through MPI (Pure MPI programming model). Limited support
for use with the hybrid MPI-thread model.
PETSc supports to have individual threads (OpenMP or others) to each manage their own
(sequential) PETSc objects (and each thread can interact only with its own objects).
No support for threaded code that made Petsc calls (OpenMP, Pthreads) since PETSc is not
«thread-safe».
Transparent: same code works sequential and parallel.
Sparse Matrix computation with
PETSc
Vectors
What are PETSc vectors?
• Represent elements of a vector space over a field (e.g. Rn)
• Usually they store field solutions and right-hand sides of PDE
• Vector elements are PetscScalars (there are no vectors of integers)
• Each process locally owns a subvector of contiguously numbered global indices
Features
• Vector types: STANDARD (SEQ on one process and MPI on several), VIENNACL, CUSP…
• Supports all vector space operations
• VecDot(), VecNorm(), VecScale(), …
• Also unusual ops, like e.g. VecSqrt(), VecReciprocal()
• Hidden communication of vector values during assembly
• Communications between different parallel vectors
Numerical vector operations
Matrices
What are PETSc matrices?
• Roughly represent linear operators that belong to the dual of a vector space over a field (e.g. Rn)
• In most of the PETSc low-level implementations, each process logically owns a submatrix of contiguous rows
Features
• Supports many storage formats
• AIJ, BAIJ, SBAIJ, DENSE, VIENNACL, CUSP (on GPU) ...
• Data structures for many external packages
• MUMPS (parallel), SuperLU_dist (parallel), SuperLU, UMFPack
• Hidden communications in parallel matrix assembly
• Matrix operations are defined from a common interface
• Shell matrices via user defined MatMult and other ops
Matrices
The default matrix representation within PETSc is the general sparse AIJ format (Yale sparse matrix or Compressed Sparse Row, CSR)
The nonzero elements are stored by rows Array of corresponding column numbers Array of pointers to the beginning of each row
Matrix memory preallocation
• PETSc matrix creation is very flexible: No preset sparsity pattern
• Memory preallocation is critical for achieving good performance
during matrix assembly, as this reduces the number of allocations
and copies required during the assembling process. Remember:
malloc is very expensive (run your code with –memory_info, -
malloc_log)
• Private representations of PETSc sparse matrices are dynamic data
structures: additional nonzeros can be freely added (if no
preallocation has been explicitly provided).
• No preset sparsity pattern, any processor can set any element:
potential for lots of malloc calls
• Dynamically adding many nonzeros
requires additional memory allocations
requires copies
→ kills performances!
Preallocation of a parallel sparse matrix
Each process logically owns a matrix subset of contiguously numbered global rows. Each subset consists of two sequential matrices corresponding to diagonal and off-diagonal parts.
P0
P1
P2
Process 0
dnz=2, onz=2
dnnz[0]=2, onnz[0]=2
dnnz[1]=2, onnz[1]=2
dnnz[2]=2, onnz[2]=2
Process 1
dnz=3, onz=2
dnnz[0]=3, onnz[0]=2
dnnz[1]=3, onnz[1]=1
dnnz[2]=2, onnz[2]=1
Process 2
dnz=1, onz=4
dnnz[0]=1, onnz[0]=4
dnnz[1]=1, onnz[1]=4
Numerical Matrix Operations
Matrix multiplication (MatMult)
y A * xA + B * xB
• xB needs to be communicated• A * xA can be computed in the
meantime
Algorithm
• Initiate asynchronous sends/receives for xB
• compute A * xA
• make sure xB is in• compute B * xB
Due to the splitting of the matrix storage into A (diag) and B (off-diag) part, code for the sequential case can be reused.
Sparse Matrices and Linear Solvers
• Solve a linear system A x = b using the Gauss Elimination method
can be very time-resource consuming
• Alternatives to direct solvers are iterative solvers
• Convergence of the succession is not always guaranteed
• Possibly much faster and less memory consuming
• Basic iteration: y <- A x executed once x iteration
• Also needed a good preconditioner: B ≈ A-1
Iterative solver basics
• KSP (Krylov SPace Methods) objects are used for solving linear
systems by means of iterative methods.
• Convergence can be improved by using a suitable PC object
(preconditoner).
• Almost all iterative methods are implemented.
• Classical iterative methods (not belonging to KSP solvers) are
classified as preconditioners
• Direct solution for parallel square matrices available through
external solvers (MUMPS, SuperLU_dist). Petsc provides a built-in
LU serial solver.
• Many KSP options can be controlled by command line
• Tolerances, convergence and divergence reason
• Custom monitors and convergence tests
Solver Types
Preconditioner types
Factorization preconditioner
• Exact factorization: A = LU
• Inexact factorization: A ≈ M = L U where L, U obtained by throwing
away the ‘fill-in’ during the factorization process (sparsity pattern of
M is the same as A)
• Application of the preconditioner (that is, solve Mx = y) approx same
cost as matrix-vector product y <- A x
• Factorization preconditioners are sequential
• PCICC: symmetric matrix, PCILU: nonsymmetric matrix
Parallel preconditioners
• Factorization preconditioners are sequential
• We can use them in parallel as a subpreconditioner of a parallel
preconditioner as Block Jacobi or Additive Schwarz Methods (ASM)
• Each processor has its own block(s) to work with
• Block Jacobi is fully parallel, ASM requires communications between
neighbours
• ASM can be more robust than Block Jacobi and have better
convergence properties
Profiling and preliminary tests on
KNL
Profiling and performance tuning
• Integrated profiling of:
time
floating-point performance
memory usage
communication
• User-defined events
• Profiling by stages of an application
-log_view - Prints an ASCII version of performance data at
program’s conclusion. These statistics are comprehensive and concise and require little overhead; thus, -log_view is intended
as the primary means of monitoring the performance of PETSc
codes.
Log view: Overview
Petsc benchmark: ex56 (3D linear elasticity)
• 3D, tri-linear quadrilateral (Q1), displacement finite element formulation
of linear elasticity. E=1.0, nu=0.25.
• Unit box domain with Dirichlet boundary condition on the y=0 side only.
• Load of 1.0 in x + 2y direction on all nodes (not a true uniform load).
• np = number of processes; npe^{1/3} must be integer
• ne = number of elements in the x,y,z direction; (ne+1)%(npe^{1/3})
must equal zero
• Default solver: GMRES + BLOCK_JACOBI + ILU(0)
Petsc benchmark: ex56 (3D linear elasticity)Command Time
Broadwell (ne=80, np=27) mpirun -np 27 ./ex56 -ne 80 -log_view 14.2 s
KNL (ne=80, np=27) + DRAM mpirun -np 64 numactl --membind=0,1
./ex56 -ne 79 -log_view
38.61 s
KNL (ne=80, np=27) +
MCDRAM=FLAT +
NUMA=SNC2
mpirun -np 27 numactl --membind=2,3
./ex56 -ne 80 -log_view12.12 s
KNL (ne=79, np=64) +
MCDRAM=FLAT +
NUMA=SNC2
mpirun -np 64 numactl --membind=2,3
./ex56 -ne 79 -log_view
10.90 s
KNL (ne=80, np=27) +
MCDRAM=CACHE
mpirun -np 27 ./ex56 -ne 80 -log_view 14.12 s
KNL (ne=80, np=64) +
MCDRAM=CACHE
mpirun -np 64 ./ex56 -ne 79 -log_view 12.50 s
Thank you for the attention