Domain specific libraries for PDEs - Cineca...Domain specific libraries for PDEs Simone Bn à–[email protected] SuperComputing Applications and Innovation Department Outline Introduction

Domain specific libraries for PDEs

Simone Bnà – [email protected] Applications and Innovation Department

Outline

Introduction to Sparse Matrix algebra

The PETSc toolkit

Sparse Matrix Computation with PETSc

Profiling and preliminary tests on KNL

Introduction to Sparse matrix

algebra

Definition of a Sparse Matrix and a Dense Matrix

A sparse matrix is a matrix in which the number of non-zeroes

entries is O(n) (The average number of non-zeroes entries in each

row is bounded independently from n)

A dense matrix is a non-sparse matrix (The number of non-zeroes

elements is O(n2))

Sparsity and Density

The sparsity of a matrix is defined as the number of zero-valued

elements divided by the total number of elements (m x n for an m

x n matrix)

The density of a matrix is defined as the complementary of the

sparsity: density = 1 – sparsity

For Sparse matrices the sparsity is ≈ 1 and the density is << 1

Example:

m = 8 nnzeros = 12

n = 8 nzeros = m*n – nnzeros

sparsity = 64 – 12 / 64 = 0.8125

density = 1 – 0.8125 = 0.1875

Sparsity pattern

The distribution of non-zero elements of a sparse matrix can be

described by the sparsity pattern, which is defined as the set of

entries of the matrix different from zero. In symbols:

{ 𝑖, 𝑗 : 𝐴𝑖𝑗 ≠ 0 }

Sparsity pattern

The sparsity pattern can be represented also as a Graph, where

nodes i and j are connected by an edge if and only if 𝐴𝑖𝑗 ≠ 0

In a Sparse Matrix the degree of a vertex in the graph is

<<relatively low>>

Conceptually, sparsity corresponds to a system loosely coupled

Jacobian of a PDE

Matrices are used to store the Jacobian of a PDE.

The following discretizations generates a sparse matrix

Finite difference

Finite volume

Finite element method (FEM)

Different discretization can lead to a Dense linear matrix:

Spectral element method (SEM)

Sparsity pattern in Finite Difference

The sparsity pattern in finite difference depends on the topology

of the adopted computational grid (e.g. cartesian grid), the

indexing of the nodes and the type of stencil

Sparsity pattern in Finite Difference

The sparsity pattern in finite difference depends on the topology

of the adopted computational grid (e.g. cartesian grid), the

indexing of the nodes and the type of stencil

Sparsity pattern in Finite Element

The sparsity pattern depends on the topology of the adopted

computational grid (e.g. unstructured grid), the kind of the finite

element (e.g. Taylor-Hood, Crouzeix-Raviart, Raviart-Thomas,

Mini-Element,…) and on the indexing of the nodes.

In Finite-Element discretizations, the sparsity of the matrix is a

direct consequence of the small-support property of the finite

element basis

Finite Volume can be seen as a special case of Finite Element

Don’t reinvent the wheel! The use of storage techniques for sparse matrices is fundamental,

in particular for large-scale problems

Standard dense-matrix structures and algorithms are slow and

ineffcient when applied to large sparse matrices

There are some available tools to work with Sparse matrices that

uses specialised algorithms and data structures to take advantage

of the sparse structure of the matrix

The PETSc toolkit (http://www.mcs.anl.gov/petsc/)

The TRILINOS project (https://trilinos.org/)

The PETSc toolkit

PETSc in a nutshell

PETSc – Portable, Extensible Toolkit for ScientificComputation

Is a suite of data structures and routines for the scalable (parallel) solutionof scientific applications mainly modelled by partial differential equations.

Tools for distributed vectors and matrices

Linear system solvers (sparse/dense, iterative/direct)

Non linear system solvers

Serial and parallel computation

Support for Finite Difference and Finite Elements PDE

discretizations

Structured and Unstructured topologies

Support for debugging, profiling and graphical output

PETSc class hierarchy

Frameworks built on top of Petsc

PETSc is a toolkit, not a framework

PETSc is PDE oriented, but not specific to any kind of PDE

Alternatives:

FEM packages: MOOSE, libMesh, DEAL.II, FEniCS

Solvers for classes of problems: CHASTE

MOOSEMultiphy sics object-oriented

Simulation env ironment

libMeshAdaptiv e Finite Element library

PETScPortable, Extensible Toolkit f or Scientif ic

Computation

DEAL.IISophisticated C++ based f inite

element simulation package

PHAMLThe parallel Hierarchical Adaptiv e

MultiLev el Project

ChasteCancer, Heart and Sof t Tissue

Env ironment

FEniCSSophisticated py thon based f inite

element simulation package

PETSc numerical components

External Packages

Dense linear algebra: Scalapack, Plapack

Sparse direct linear solvers: Mumps, SuperLU, SuperLU_dist

Grid partitioning software: Metis, ParMetis, Jostle, Chaco, Party

ODE solvers: PVODE

Eigenvalue solvers (including SVD): SLEPc

Optimization: TAO

PETSc design concepts

Goals

• Portability: available on many platforms, basically anything that has MPI

• Performance

• Scalable parallelism

• Flexibility: easy switch among different implementations

Approach

• Object Oriented Delegation Pattern : many specific implementations of the same object

• Shared interface (overloading):MatMult(A,x,y); // y <- A xsame code for sequential, parallel, dense, sparse

• Command line customization

Drawback

• Nasty details of the implementation hidden

PETSc and Parallelism

PETSc is layered on top of MPI: you do not need to know much MPI when

you use PETSc

All objects in PETSc are defined on a communicator; they can only

interact if on the same communicator

Parallelism through MPI (Pure MPI programming model). Limited support

for use with the hybrid MPI-thread model.

PETSc supports to have individual threads (OpenMP or others) to each manage their own

(sequential) PETSc objects (and each thread can interact only with its own objects).

No support for threaded code that made Petsc calls (OpenMP, Pthreads) since PETSc is not

«thread-safe».

Transparent: same code works sequential and parallel.

Sparse Matrix computation with

PETSc

Vectors

What are PETSc vectors?

• Represent elements of a vector space over a field (e.g. Rn)

• Usually they store field solutions and right-hand sides of PDE

• Vector elements are PetscScalars (there are no vectors of integers)

• Each process locally owns a subvector of contiguously numbered global indices

Features

• Vector types: STANDARD (SEQ on one process and MPI on several), VIENNACL, CUSP…

• Supports all vector space operations

• VecDot(), VecNorm(), VecScale(), …

• Also unusual ops, like e.g. VecSqrt(), VecReciprocal()

• Hidden communication of vector values during assembly

• Communications between different parallel vectors

Numerical vector operations

Matrices

What are PETSc matrices?

• Roughly represent linear operators that belong to the dual of a vector space over a field (e.g. Rn)

• In most of the PETSc low-level implementations, each process logically owns a submatrix of contiguous rows

Features

• Supports many storage formats

• AIJ, BAIJ, SBAIJ, DENSE, VIENNACL, CUSP (on GPU) ...

• Data structures for many external packages

• MUMPS (parallel), SuperLU_dist (parallel), SuperLU, UMFPack

• Hidden communications in parallel matrix assembly

• Matrix operations are defined from a common interface

• Shell matrices via user defined MatMult and other ops

Matrices

The default matrix representation within PETSc is the general sparse AIJ format (Yale sparse matrix or Compressed Sparse Row, CSR)

The nonzero elements are stored by rows Array of corresponding column numbers Array of pointers to the beginning of each row

Matrix memory preallocation

• PETSc matrix creation is very flexible: No preset sparsity pattern

• Memory preallocation is critical for achieving good performance

during matrix assembly, as this reduces the number of allocations

and copies required during the assembling process. Remember:

malloc is very expensive (run your code with –memory_info, -

malloc_log)

• Private representations of PETSc sparse matrices are dynamic data

structures: additional nonzeros can be freely added (if no

preallocation has been explicitly provided).

• No preset sparsity pattern, any processor can set any element:

potential for lots of malloc calls

• Dynamically adding many nonzeros

requires additional memory allocations

requires copies

→ kills performances!

Preallocation of a parallel sparse matrix

Each process logically owns a matrix subset of contiguously numbered global rows. Each subset consists of two sequential matrices corresponding to diagonal and off-diagonal parts.

P0

P1

P2

Process 0

dnz=2, onz=2

dnnz[0]=2, onnz[0]=2



Process 1

dnz=3, onz=2




Process 2

dnz=1, onz=4



Numerical Matrix Operations

Matrix multiplication (MatMult)

y A * xA + B * xB

• xB needs to be communicated• A * xA can be computed in the

meantime

Algorithm

• Initiate asynchronous sends/receives for xB

• compute A * xA

• make sure xB is in• compute B * xB

Due to the splitting of the matrix storage into A (diag) and B (off-diag) part, code for the sequential case can be reused.

Sparse Matrices and Linear Solvers

• Solve a linear system A x = b using the Gauss Elimination method

can be very time-resource consuming

• Alternatives to direct solvers are iterative solvers

• Convergence of the succession is not always guaranteed

• Possibly much faster and less memory consuming

• Basic iteration: y <- A x executed once x iteration

• Also needed a good preconditioner: B ≈ A-1

Iterative solver basics

• KSP (Krylov SPace Methods) objects are used for solving linear

systems by means of iterative methods.

• Convergence can be improved by using a suitable PC object

(preconditoner).

• Almost all iterative methods are implemented.

• Classical iterative methods (not belonging to KSP solvers) are

classified as preconditioners

• Direct solution for parallel square matrices available through

external solvers (MUMPS, SuperLU_dist). Petsc provides a built-in

LU serial solver.

• Many KSP options can be controlled by command line

• Tolerances, convergence and divergence reason

• Custom monitors and convergence tests

Solver Types

Preconditioner types

Factorization preconditioner

• Exact factorization: A = LU

• Inexact factorization: A ≈ M = L U where L, U obtained by throwing

away the ‘fill-in’ during the factorization process (sparsity pattern of

M is the same as A)

• Application of the preconditioner (that is, solve Mx = y) approx same

cost as matrix-vector product y <- A x

• Factorization preconditioners are sequential

• PCICC: symmetric matrix, PCILU: nonsymmetric matrix

Parallel preconditioners

• Factorization preconditioners are sequential

• We can use them in parallel as a subpreconditioner of a parallel

preconditioner as Block Jacobi or Additive Schwarz Methods (ASM)

• Each processor has its own block(s) to work with

• Block Jacobi is fully parallel, ASM requires communications between

neighbours

• ASM can be more robust than Block Jacobi and have better

convergence properties

Profiling and preliminary tests on

KNL

Profiling and performance tuning

• Integrated profiling of:

time

floating-point performance

memory usage

communication

• User-defined events

• Profiling by stages of an application

-log_view - Prints an ASCII version of performance data at

program’s conclusion. These statistics are comprehensive and concise and require little overhead; thus, -log_view is intended

as the primary means of monitoring the performance of PETSc

codes.

Log view: Overview

Petsc benchmark: ex56 (3D linear elasticity)

• 3D, tri-linear quadrilateral (Q1), displacement finite element formulation

of linear elasticity. E=1.0, nu=0.25.

• Unit box domain with Dirichlet boundary condition on the y=0 side only.

• Load of 1.0 in x + 2y direction on all nodes (not a true uniform load).

• np = number of processes; npe^{1/3} must be integer

• ne = number of elements in the x,y,z direction; (ne+1)%(npe^{1/3})

must equal zero

• Default solver: GMRES + BLOCK_JACOBI + ILU(0)

Petsc benchmark: ex56 (3D linear elasticity)Command Time

Broadwell (ne=80, np=27) mpirun -np 27 ./ex56 -ne 80 -log_view 14.2 s

KNL (ne=80, np=27) + DRAM mpirun -np 64 numactl --membind=0,1

./ex56 -ne 79 -log_view

38.61 s

KNL (ne=80, np=27) +

MCDRAM=FLAT +

NUMA=SNC2

mpirun -np 27 numactl --membind=2,3

./ex56 -ne 80 -log_view12.12 s

KNL (ne=79, np=64) +

MCDRAM=FLAT +

NUMA=SNC2

mpirun -np 64 numactl --membind=2,3

./ex56 -ne 79 -log_view

10.90 s

KNL (ne=80, np=27) +

MCDRAM=CACHE

mpirun -np 27 ./ex56 -ne 80 -log_view 14.12 s

KNL (ne=80, np=64) +

MCDRAM=CACHE

mpirun -np 64 ./ex56 -ne 79 -log_view 12.50 s

Thank you for the attention

Documents

Domain specific libraries for PDEs - Cineca...Domain specific libraries for PDEs Simone Bn à–[email protected] SuperComputing Applications and Innovation Department Outline Introduction