Upload
tedkord
View
239
Download
0
Embed Size (px)
Citation preview
8/6/2019 Intro Sparse
1/112
CR07 Sparse Matrix Computations /Cours de Matrices Creuses (2010-2011)
Jean-Yves LExcellent (INRIA) and Bora Ucar (CNRS)
LIP-ENS Lyon, Team: ROMA
([email protected], [email protected])
Fridays 10h15-12h15
prepared in collaboration with P. Amestoy (ENSEEIHT-IRIT)
1/ 94
http://find/http://goback/8/6/2019 Intro Sparse
2/112
Motivations Applications ayant un besoin croissant en puissance de calcul:
modelisation,
simulation (plutot quexperimentation), optimisation numerique
Typiquement:Probleme continu Discretisation (maillage)
Algorithme numerique de resolution (selon lois physiques)
Probleme matriciel (Ax = b, . . .) Besoins:
Modelisations de plus en plus precises Problemes de plus en plus complexes Applications critiques en temps de reponse Minimisation des couts du calcul
Calculateurs haute performance, parallelisme Algorithmes permettant de tirer le meilleur parti de ces
calculateurs et des proprietes des matrices considerees
2/ 94
http://find/http://goback/8/6/2019 Intro Sparse
3/112
Example of preprocessing for Ax = b
Original (A =lhr01) Preprocessed matrix (A
(lhr01))
0 200 400 600 800 1000 1200 1400
0
200
400
600
800
1000
1200
1400
nz = 18427
0 200 400 600 800 1000 1200 1400
0
200
400
600
800
1000
1200
1400
nz = 18427
Modified Problem:Ax = b with A = PnPDrAQDcPt
3/ 94
http://find/http://goback/8/6/2019 Intro Sparse
4/112
Quelques exemples dans le domaine du calculscientifique
Constraints of duration: weather forecast
4/ 94
http://find/http://goback/8/6/2019 Intro Sparse
5/112
A few examples in the field of scientific computing
Cost constraints: wind tunnels, crash simulation, . . .
5/ 94
http://find/http://goback/8/6/2019 Intro Sparse
6/112
Scale Constraints
large scale: climate modelling, pollution, astrophysics tiny scale: combustion, quantum chemistry
6/ 94
http://find/http://goback/8/6/2019 Intro Sparse
7/112
Contents of the course
Introduction, reminders on graph theory and on numerical
linear algebra- Introduction to sparse matrices- Graphs and hypergraphs, trees, some classical algorithms- Gaussian elimination, LU factorization, fill-in- Conditioning/sensitivity of a problem and error analysis
7/ 94
http://find/http://goback/8/6/2019 Intro Sparse
8/112
Contents of the course
Graph
Sparse Gaussian elimination
- Sparse matrices and graphs- Gaussian elimination of sparse matrices- Ordering and permuting sparse matrices
7/ 94
http://find/http://goback/8/6/2019 Intro Sparse
9/112
Contents of the course
Maximum (weighted) Matching algorithms and their use in
sparse linear algebra
Efficient factorization methods
- Implementation of sparse direct solvers- Parallel sparse solvers
- Scheduling of computations to optimize memory usageand/or performance
Graph and hypergraph partitioning
Iterative methods
Current research activities
7/ 94
http://find/http://goback/8/6/2019 Intro Sparse
10/112
Tentative outline
A. INTRODUCTION
I. Sparse matrices
II. Graph theory and algorithmsIII. Linear algebra basics
------------------------------------------------------
B. SPARSE GAUSSIAN ELIMINATION
IV. Elimination tree and structure prediction
V. Fill-reducing ordering methods
VI. Matching in bipartite graphs
VII. Factorization: Methods
VIII. Factorization: Parallelization aspects
------------------------------------------------------
C. SOME OTHER ESSENTIAL SPARSE MATRIX ALGORITHMS
IX. Graph and hypergraph partitioningX. Iterative methods
------------------------------------------------------
D. CLOSING
XI. Current research activities
XII. Presentations8/ 94
http://find/http://goback/8/6/2019 Intro Sparse
11/112
Tentative organization of the course
References, teaching material will be made available athttp://graal.ens-lyon.fr/~bucar/CR07/
2+ hours dedicated to exercises (manipulation of sparsematrices and graphs)
Evaluation: mainly based on the study of research articles
related to the contents of the course; the students will write areport and do an oral presentation.
9/ 94
http://graal.ens-lyon.fr/~bucar/CR07/http://graal.ens-lyon.fr/~bucar/CR07/http://graal.ens-lyon.fr/~bucar/CR07/http://graal.ens-lyon.fr/~bucar/CR07/http://find/http://goback/8/6/2019 Intro Sparse
12/112
Outline
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matrices
Gaussian eliminationParallel and high performance computingNumerical simulation and sparse matricesDirect vs iterative methodsConclusion
10/ 94
http://find/http://goback/8/6/2019 Intro Sparse
13/112
A selection of references
Books
Duff, Erisman and Reid, Direct methods for Sparse Matrices,Clarendon Press, Oxford 1986. Dongarra, Duff, Sorensen and H. A. van der Vorst, Solving
Linear Systems on Vector and Shared Memory Computers,SIAM, 1991.
Davis, Direct methods for sparse linear systems, SIAM, 2006. Saad, Iterative methods for sparse linear systems, 2nd edition,
SIAM, 2004.
Articles
Gilbert and Liu, Elimination structures for unsymmetric sparse
LU factors, SIMAX, 1993. Liu, The role of elimination trees in sparse factorization,
SIMAX, 1990. Heath and E. Ng and B. W. Peyton, Parallel Algorithms for
Sparse Linear Systems, SIAM review 1991.
11/ 94
http://find/http://goback/8/6/2019 Intro Sparse
14/112
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian elimination
Parallel and high performance computingNumerical simulation and sparse matricesDirect vs iterative methodsConclusion
12/ 94
http://find/http://goback/8/6/2019 Intro Sparse
15/112
Motivations
solution of linear systems of equations key algorithmic
kernelContinuous problem
Discretization
Solution of a linear system Ax = b Main parameters:
Numerical properties of the linear system (symmetry, pos.definite, conditioning, . . . )
Size and structure: Large (> 1000000 1000000 ?), square/rectangular Dense or sparse (structured / unstructured) Target computer (sequential/parallel/multicore)
Algorithmic choices are critical
13/ 94
http://find/http://goback/8/6/2019 Intro Sparse
16/112
Motivations for designing efficient algorithms
Time-critical applications
Solve larger problems
Decrease elapsed time (parallelism ?)
Minimize cost of computations (time, memory)
14/ 94
http://find/http://goback/8/6/2019 Intro Sparse
17/112
Difficulties
Access to data : Computer : complex memory hierarchy (registers, multilevel
cache, main memory (shared or distributed), disk) Sparse matrix : large irregular dynamic data structures.
Exploit the locality of references to data on the computer(design algorithms providing such locality)
Efficiency (time and memory) Number of operations and memory depend very much on the
algorithm used and on the numerical and structural propertiesof the problem.
The algorithm depends on the target computer (vector, scalar,shared, distributed, clusters of Symmetric Multi-Processors(SMP), multicore).
Algorithmic choices are critical
15/ 94
http://find/http://goback/8/6/2019 Intro Sparse
18/112
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian elimination
Parallel and high performance computingNumerical simulation and sparse matricesDirect vs iterative methodsConclusion
16/ 94
S i
http://find/http://goback/8/6/2019 Intro Sparse
19/112
Sparse matrices
Example:3 x1 + 2 x2 = 5
2 x2 - 5 x3 = 12 x1 + 3 x3 = 0
can be represented as
Ax = b,
where A = 3 2 0
0 2 52 0 3
, x = x1
x2x3
, and b =
510
Sparse matrix: only nonzeros are stored.
17/ 94
S i ?
http://find/http://goback/8/6/2019 Intro Sparse
20/112
Sparse matrix ?
0 100 200 300 400 500
0
100
200
300
400
500
nz = 5104
Matrix dwt 592.rua (N=592, NZ=5104);Structural analysis of a submarine
18/ 94
S t i ?
http://find/http://goback/8/6/2019 Intro Sparse
21/112
Sparse matrix ?
Matrix from Computational Fluid Dynamics;(collaboration Univ. Tel Aviv)
0 1000 2000 3000 4000 5000 6000 7000
0
1000
2000
3000
4000
5000
6000
7000
nz = 43105
Saddle-point problem
19/ 94
P i t i
http://find/http://goback/8/6/2019 Intro Sparse
22/112
Preprocessing sparse matrices
Original (A =lhr01) Preprocessed matrix (A(lhr01))
0 200 400 600 800 1000 1200 1400
0
200
400
600
800
1000
1200
1400
nz = 18427
0 200 400 600 800 1000 1200 1400
0
200
400
600
800
1000
1200
1400
nz = 18427
Modified Problem:Ax = b with A = PnPDrADcQPt
20/ 94
F t i ti
http://find/http://goback/8/6/2019 Intro Sparse
23/112
Factorization process
Solution of Ax = b A is unsymmetric :
A is factorized as: A = LU, whereL is a lower triangular matrix, andU is an upper triangular matrix.
Forward-backward substitution: Ly = b then Ux = y A is symmetric:
A = LDLT or LLT
A is rectangular m n with m n and minxAx b2 : A = QR where Q is orthogonal (Q1 = QT) and R is
triangular. Solve: y = QTb then Rx = y
21/ 94
Factorization process
http://find/http://goback/8/6/2019 Intro Sparse
24/112
Factorization process
Solution of Ax = b A is unsymmetric :
A is factorized as: A = LU, whereL is a lower triangular matrix, andU is an upper triangular matrix.
Forward-backward substitution: Ly = b then Ux = y A is symmetric:
A = LDLT or LLT
A is rectangular m n with m n and minxAx b2 : A = QR where Q is orthogonal (Q1 = QT) and R is
triangular. Solve: y = QTb then Rx = y
21/ 94
Factorization process
http://find/http://goback/8/6/2019 Intro Sparse
25/112
Factorization process
Solution of Ax = b A is unsymmetric :
A is factorized as: A = LU, whereL is a lower triangular matrix, andU is an upper triangular matrix.
Forward-backward substitution: Ly = b then Ux = y A is symmetric:
A = LDLT or LLT
A is rectangular m n with m n and minxAx b2 : A = QR where Q is orthogonal (Q1 = QT) and R is
triangular. Solve: y = QTb then Rx = y
21/ 94
Difficulties
http://find/http://goback/8/6/2019 Intro Sparse
26/112
Difficulties
Only non-zero values are stored
Factors L and U have far more nonzeros than A
Data structures are complex
Computations are only a small portion of the code (the rest isdata manipulation)
Memory size is a limiting factor ( out-of-core solvers )
22/ 94
Key numbers:
http://find/http://goback/8/6/2019 Intro Sparse
27/112
Key numbers:
1- Small sizes : 500 MB matrix;Factors = 5 GB; Flops = 100 Gflops ;
2- Example of 2D problem: Lab. Geosiences Azur, Valbonne Complex 2D finite difference matrix n=16 106 , 150 106
nonzeros Storage (single prec): 2 GB (12 GB with the factors) Flops: 10 TeraFlops
3- Example of 3D problem: EDF (Code Aster, structuralengineering) real matrix finite elements n = 106 , nz = 71 106 nonzeros Storage: 3.5 109 entries (28 GB) for factors, 35 GB total Flops: 2.1
1013
4- Typical performance (MUMPS): PC LINUX 1 core (P4, 2GHz) : 1.0 GFlops/s Cray T3E (512 procs) : Speed-up 170, Perf. 71 GFlops/s AMD Opteron 8431, 24 [email protected] GHz: 50 GFlops/s (1 core:
7 GFlop/s)
23/ 94
Typical test problems:
http://find/http://goback/8/6/2019 Intro Sparse
28/112
Typical test problems:
BMW car body,227,362 unknowns,5,757,996 nonzeros,MSC.Software
Size of factors: 51.1 million entriesNumber of operations: 44.9 109
24/ 94
Typical test problems:
http://find/http://goback/8/6/2019 Intro Sparse
29/112
Typical test problems:
BMW crankshaft,
148,770 unknowns,5,396,386 nonzeros,MSC.Software
Size of factors: 97.2 million entriesNumber of operations: 127.9 109
25/ 94
Sources of parallelism
http://find/http://goback/8/6/2019 Intro Sparse
30/112
Sources of parallelism
Several levels of parallelism can be exploited:
At problem level: problem can be decomposed intosub-problems (e.g. domain decomposition)
At matrix level: Sparsity implies independency in calculation
At submatrix level: within dense linear algebra computations(parallel BLAS, . . . )
26/ 94
Data structure for sparse matrices
http://find/http://goback/8/6/2019 Intro Sparse
31/112
Data structure for sparse matrices
Storage scheme depends on the pattern of the matrix and onthe type of access required band or variable-band matrices block bordered or block tridiagonal matrices general matrix row, column or diagonal access
27/ 94
Data formats for a general sparse matrix A
http://find/http://goback/8/6/2019 Intro Sparse
32/112
Data formats for a general sparse matrix A
What needs to be represented
Assembled matrices: MxN matrix A with NNZ nonzeros.
Elemental matrices (unassembled): MxN matrix A with NELTelements.
Arithmetic: Real (4 or 8 bytes) or complex (8 or 16 bytes)
Symmetric (or Hermitian) store only part of the data.
Distributed format ?
Duplicate entries and/or out-of-range values ?
28/ 94
Classical Data Formats for Assembled Matrices
http://find/http://goback/8/6/2019 Intro Sparse
33/112
Classical Data Formats for Assembled Matrices Example of a 3x3 matrix with NNZ=5 nonzeros
a31
a23a22
a11
a33
1 2 3
1
2
3
Coordinate formatIRN [1 : NNZ] = 1 3 2 2 3
JCN [1 : NNZ] = 1 1 2 3 3VAL [1 : NNZ] = a11 a31 a22 a23 a33 Compressed Sparse Column (CSC) format
IRN [1 : NNZ] = 1 3 2 2 3VAL [1 : NNZ] = a11 a31 a22 a23 a33
COLPTR [1 : N + 1] = 1 3 4 6column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)- Compressed Sparse Row (CSR) format:
Similar to CSC, but row by row Diagonal format (M=N):
NDIAG = 3 29/ 94
Classical Data Formats for Assembled Matrices
http://find/http://goback/8/6/2019 Intro Sparse
34/112
Example of a 3x3 matrix with NNZ=5 nonzeros
a31
a23a22
a11
a33
1 2 3
1
2
3
Coordinate formatIRN [1 : NNZ] = 1 3 2 2 3
JCN [1 : NNZ] = 1 1 2 3 3VAL [1 : NNZ] = a11 a31 a22 a23 a33 Compressed Sparse Column (CSC) format
IRN [1 : NNZ] = 1 3 2 2 3VAL [1 : NNZ] = a11 a31 a22 a23 a33
COLPTR [1 : N + 1] = 1 3 4 6column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)- Compressed Sparse Row (CSR) format:
Similar to CSC, but row by row Diagonal format (M=N):
NDIAG = 3 29/ 94
Classical Data Formats for Assembled Matrices
http://find/http://goback/8/6/2019 Intro Sparse
35/112
Example of a 3x3 matrix with NNZ=5 nonzeros
a31
a23a22
a11
a33
1 2 3
1
2
3
Coordinate formatIRN [1 : NNZ] = 1 3 2 2 3
JCN [1 : NNZ] = 1 1 2 3 3VAL [1 : NNZ] = a11 a31 a22 a23 a33 Compressed Sparse Column (CSC) format
IRN [1 : NNZ] = 1 3 2 2 3VAL [1 : NNZ] = a11 a31 a22 a23 a33
COLPTR [1 : N + 1] = 1 3 4 6column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)- Compressed Sparse Row (CSR) format:
Similar to CSC, but row by row Diagonal format (M=N):
NDIAG = 3 29/ 94
Classical Data Formats for Assembled Matrices
http://find/http://goback/8/6/2019 Intro Sparse
36/112
Example of a 3x3 matrix with NNZ=5 nonzeros
a31
a23a22
a11
a33
1 2 3
1
2
3
Diagonal format (M=N):NDIAG = 3IDIAG = 2 0 1
VAL =
na a11 0na a22 a23
a31 a33 na
(na: not accessed)
VAL(i,j) corresponds to A(i,i+IDIAG(j)) (for 1 i + IDIAG(j) N)
29/ 94
Sparse Matrix-vector products Y AX
http://find/http://goback/8/6/2019 Intro Sparse
37/112
p p
Algorithm depends on sparse matrix format:
Coordinate format:Y ( 1 : M ) = 0DO k=1,NNZ
Y ( I R N ( k ) ) = Y ( I R N ( k ) ) + V A L ( k ) X(JCN(k))ENDDO
CSC format:
CSR format
30/ 94
Sparse Matrix-vector products Y AX
http://find/http://goback/8/6/2019 Intro Sparse
38/112
Algorithm depends on sparse matrix format:
Coordinate format:Y ( 1 : M ) = 0DO k=1,NNZ
Y ( I R N ( k ) ) = Y ( I R N ( k ) ) + V A L ( k ) X(JCN(k))ENDDO
CSC format:
Y ( 1 : M ) = 0DO J=1,NXj=X(J)DO k=COLPTR( J ) ,COLPTR( J+1)1
Y ( I R N ( k ) ) = Y ( I R N ( k ) ) + V A L ( k )XjENDDO
ENDDO
CSR format
30/ 94
Sparse Matrix-vector products Y AX
http://find/http://goback/8/6/2019 Intro Sparse
39/112
Algorithm depends on sparse matrix format:
Coordinate format:Y ( 1 : M ) = 0DO k=1,NNZ
Y ( I R N ( k ) ) = Y ( I R N ( k ) ) + V A L ( k ) X(JCN(k))ENDDO
CSC format:
Y ( 1 : M ) = 0DO J=1,NXj=X(J)DO k=COLPTR( J ) ,COLPTR( J+1)1
Y ( I R N ( k ) ) = Y ( I R N ( k ) ) + V A L ( k )XjENDDO
ENDDO
CSR formatDO I =1,MYi=0
DO k=ROWPTR( I ) ,ROWPTR( I +1)1Y i = Y i + V A L ( k )X(JCN(k))
ENDDOY( I)=Yi
ENDDO
30/ 94
Sparse Matrix-vector products Y AX
http://find/http://goback/8/6/2019 Intro Sparse
40/112
Algorithm depends on sparse matrix format:
Coordinate format:Y ( 1 : M ) = 0DO k=1,NNZ
Y ( I R N ( k ) ) = Y ( I R N ( k ) ) + V A L ( k ) X(JCN(k))ENDDO
Diagonal format: (VAL(i,j) corresponds to A(i,i+IDIAG(j)))
Y ( 1 : N ) = 0DO j =1 ,NDIAG
DO i= max(1,1 IDIAG( j )) , min(N ,NIDIAG( j ))Y( i ) = Y( i ) + VAL( i , j )X( i+IDIAG ( j ))
END DOEND DO
30/ 94
Jagged diagonal storage (JDS)
http://find/http://goback/8/6/2019 Intro Sparse
41/112
a31
a23a22
a11
a33
1 2 3
1
2
3
1. Shift all elements left (similar to CSR) and keep column
indices
a11 (1)
a22 (2) a23 (3)a31 (1) a33 (3)
2. Sort rows in decreasing order of their number of nonzeros
3. Store corresponding row permutation: PERM = 2 3 1
4. Stored jagged diagonals (columns of step 2)VAL = a22 a31 a11 a23 a33COL IND = 2 1 1 3 3COL PTR = 1 4 6
31/ 94
Jagged diagonal storage (JDS)
http://find/http://goback/8/6/2019 Intro Sparse
42/112
a31
a23a22
a11
a33
1 2 3
1
2
3
1. Shift all elements left (similar to CSR) and keep column
indicesa11 (1)a22 (2) a23 (3)
a31 (1) a33 (3)2. Sort rows in decreasing order of their number of nonzerosa22 (2) a23 (3)a31 (1) a33 (3)a11 (1)
3. Store corresponding row permutation: PERM = 2 3 14. Stored jagged diagonals (columns of step 2)
VAL = a22 a31 a11 a23 a33COL IND = 2 1 1 3 3COL PTR = 1 4 6
31/ 94
Jagged diagonal storage (JDS)
http://find/http://goback/8/6/2019 Intro Sparse
43/112
a31
a23a22
a11
a33
1 2 3
1
2
3
1. Shift all elements left (similar to CSR) and keep columnindices
2. Sort rows in decreasing order of their number of nonzerosa22 (2) a23 (3)a31 (1) a33 (3)a11 (1)
3. Store corresponding row permutation: PERM = 2 3 1
4. Stored jagged diagonals (columns of step 2)VAL = a22 a31 a11 a23 a33COL IND = 2 1 1 3 3COL PTR = 1 4 6
31/ 94
Jagged diagonal storage (JDS)
http://find/http://goback/8/6/2019 Intro Sparse
44/112
a31
a23a22
a11
a33
1 2 3
1
2
3
1. Shift all elements left (similar to CSR) and keep columnindices
2. Sort rows in decreasing order of their number of nonzerosa22 (2) a23 (3)a31 (1) a33 (3)a11 (1)
3. Store corresponding row permutation: PERM = 2 3 1
4. Stored jagged diagonals (columns of step 2)VAL = a22 a31 a11 a23 a33COL IND = 2 1 1 3 3COL PTR = 1 4 6
31/ 94
Jagged diagonal storage (JDS)1 2 3
http://find/http://goback/8/6/2019 Intro Sparse
45/112
a31
a23a22
a11
a33
1 2 3
1
2
3
1. Shift all elements left (similar to CSR) and keep columnindices
2. Sort rows in decreasing order of their number of nonzeros
3. Store corresponding row permutation: PERM = 2 3 14. Stored jagged diagonals (columns of step 2)
VAL = a22 a31 a11 a23 a33COL IND = 2 1 1 3 3
COL PTR = 1 4 6
Pros: manipulate longer vectors than CSR (interesting on vectorcomputers or GPUs).Cons: extra-indirection due to permutation array.
31/ 94
Example of elemental matrix format
http://find/http://goback/8/6/2019 Intro Sparse
46/112
A =
1 2 3 0 02 1 1 0 01 1 3 1 30 0 1 2 10 0 3 2 1
= A1 + A2
A1
=123
1 2 32 1 11 1 1
, A2
=345
2 1 31 2
1
3 2 1
32/ 94
Example of elemental matrix format
http://find/http://goback/8/6/2019 Intro Sparse
47/112
A1 =
1
23 1 2 3
2 1 11 1 1
, A2 =
3
45 2 1 3
1 2 13 2 1
N=5 NELT=2 NVAR=6 A =
NELT
i=1 Ai
ELTPTR [1:NELT+1] = 1 4 7ELTVAR [1:NVAR] = 1 2 3 3 4 5ELTVAL [1:NVAL] = -1 2 1 2 1 1 3 1 1 2 1 3 -1 2 2 3 -1 1
Remarks:
NVAR = ELTPTR(NELT+1)-1 NVAL =
S2i (unsym) ou
Si(Si + 1)/2 (sym), avec
Si = ELTPTR(i + 1) ELTPTR(i) storage of elements in ELTVAL: by columns
32/ 94
File storage: Rutherford-Boeing
http://find/http://goback/8/6/2019 Intro Sparse
48/112
Standard ASCII format for files Header + Data (CSC format). key xyz:
x=[rcp] (real, complex, pattern) y=[suhzr] (sym., uns., herm., skew sym., rectang.) z=[ae] (assembled, elemental) ex: M T1.RSA, SHIP003.RSE
Supplementary files: right-hand-sides, solution,permutations. . .
Canonical format introduced to guarantee a unique
representation (order of entries in each column, no duplicates).
33/ 94
File storage: Rutherford-Boeing
http://find/http://goback/8/6/2019 Intro Sparse
49/112
DNV-Ex 1 : Tubular joint-1999-01-17 M_T1
1733710 9758 492558 1231394 0
rsa 97578 97578 4925574 0(10I8) (10I8) (3e26.16)
1 49 96 142 187 231 274 346 417 487
556 624 691 763 834 904 973 1041 1108 11801251 1321 1390 1458 1525 1573 1620 1666 1711 1755
1798 1870 1941 2011 2080 2148 2215 2287 2358 2428
2497 2565 2632 2704 2775 2845 2914 2982 3049 3115...
1 2 3 4 5 6 7 8 9 10
11 12 49 50 51 52 53 54 55 5657 58 59 60 67 68 69 70 71 72
223 224 225 226 227 228 229 230 231 232
233 234 433 434 435 436 437 438 2 3
4 5 6 7 8 9 10 11 12 4950 51 52 53 54 55 56 57 58 59
...
-0.2624989288237320E+10 0.6622960540857440E+09 0.2362753266740760E+110.3372081648690030E+08 -0.4851430162799610E+08 0.1573652896140010E+08
0.1704332388419270E+10 -0.7300763190874110E+09 -0.7113520995891850E+10
0.1813048723097540E+08 0.2955124446119170E+07 -0.2606931100955540E+070.1606040913919180E+07 -0.2377860366909130E+08 -0.1105180386670390E+09
0.1610636280324100E+08 0.4230082475435230E+07 -0.1951280618776270E+07
0.4498200951891750E+08 0.2066239484615530E+09 0.3792237438608430E+080.9819999042370710E+08 0.3881169368090200E+08 -0.4624480572242580E+08
34/ 94
File storage: Matrix-market
http://find/http://goback/8/6/2019 Intro Sparse
50/112
Example
%%MatrixMarket matrix coordinate real general
% Comments
5 5 8
1 1 1.000e+00
2 2 1.050e+01
3 3 1.500e-02
1 4 6.000e+00
4 2 2.505e+02
4 4 -2.800e+02
4 5 3.332e+01
5 5 1.200e+01
35/ 94
Examples of sparse matrix collections
http://find/http://goback/8/6/2019 Intro Sparse
51/112
The University of Florida Sparse Matrix Collectionhttp://www.cise.ufl.edu/research/sparse/matrices/
Matrix market http://math.nist.gov/MatrixMarket/
Rutherford-Boeinghttp://www.cerfacs.fr/algor/Softs/RB/index.html
TLSE http://gridtlse.org/
36/ 94
http://www.cise.ufl.edu/research/sparse/matrices/http://math.nist.gov/MatrixMarket/http://www.cerfacs.fr/algor/Softs/RB/index.htmlhttp://gridtlse.org/http://gridtlse.org/http://www.cerfacs.fr/algor/Softs/RB/index.htmlhttp://math.nist.gov/MatrixMarket/http://www.cise.ufl.edu/research/sparse/matrices/http://find/http://goback/8/6/2019 Intro Sparse
52/112
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian eliminationParallel and high performance computingNumerical simulation and sparse matricesDirect vs iterative methodsConclusion
37/ 94
Gaussian elimination
http://find/http://goback/8/6/2019 Intro Sparse
53/112
A = A(1), b = b(1), A(1)x = b(1):0@ a11 a12 a13a21 a22 a23
a31 a32 a33
1A0@ x1x2
x3
1A =
0@ b1b2
b3
1A 2 2 1 a21/a11
3 3 1 a31/a11
A(2)x = b(2)0B@
a11 a12 a13
0 a(2)22 a
(2)23
0 a(2)32 a
(2)33
1CA0@
x1x2x3
1A =
0B@
b1
b(2)2
b(2)3
1CA b(2)2 = b2 a21b1/a11 . . .
a(2)32 = a32 a31a12/a11 . . .
Finally A(3)x = b(3)0B@
a11 a12 a13
0 a(2)22 a
(2)23
0 0 a(3)33
1CA0@
x1x2x3
1A =
0B@
b1
b(2)2
b(3)3
1CA
a(3)(33)
= a(2)(33)
a(2)32 a
(2)23 /a
(2)22 . . .
Typical Gaussian elimination step k : a(k+1)ij = a
(k)ij
a(k)ik
a(k)kj
a(k)kk
38/ 94
Relation with A = LU factorization
http://find/http://goback/8/6/2019 Intro Sparse
54/112
One step of Gaussian elimination can be written:
A
(k+1)
= L
(k)
A
(k)
, with
L(k) =
0BBBBBBB@
1.
.1
lk+1,k .. .
ln,k 1
1CCCCCCCA
and lik =a
(k)ik
a(k)kk
.
Then, A(n)
= U = L(n1)
. . .L(1)
A, which gives A = LU ,
with L = [L(1)]1 . . . [L(n1)]1 =
0BBB@
1 0.
..
li,j 1
1CCCA ,
In dense codes, entries of L and U overwrite entries of A.
Furthermore, if A is symmetric, A = LDLT with dkk = a(k)kk :A = LU = At = UtLt implies (U)(Lt)1 = L1Ut = D diagonal
and U = DLt, thus A = L(DLt) = LDLt
Gaussian elimination and sparsity
http://find/http://goback/8/6/2019 Intro Sparse
55/112
Step k of LU factorization (akk pivot):
For i> k compute lik = aik/akk (= aik),
For i> k,j> k
aij = aij aik akj
akkor
a
ij = aij lik akj If aik = 0 and akj = 0 then aij = 0 If aij was zero its non-zero value must be stored
k j
k
i
x
x
x
x
k j
k
i
x
x
x
0
fill-in
40/ 94
http://find/http://goback/8/6/2019 Intro Sparse
56/112
Idem for Cholesky : For i> k compute lik = aik/
akk (= a
ik),
For i> k,j> k,j i (lower triang.)
aij = aij
aik ajkakk
oraij = aij lik ajk
41/ 94
Example
http://find/http://goback/8/6/2019 Intro Sparse
57/112
Original matrix
x x x x x
x x
x x
x xx x
Matrix is full after the first step of elimination
After reordering the matrix (1st row and column last rowand column)
42/ 94
http://find/http://goback/8/6/2019 Intro Sparse
58/112
x x
x xx x
x x
x x x x x
No fill-in Ordering the variables has a strong impact on
the fill-in the number of operations
NP-hard problem in general (Yannakakis, 1981)
43/ 94
Illustration: Reverse Cuthill-McKee on matrixdwt 592.rua
http://find/http://goback/8/6/2019 Intro Sparse
59/112
dwt 592.rua
Harwell-Boeing matrix: dwt 592.rua, structural computing on a
submarine. NZ(LU factors)=58202
Original matrix Factorized matrix
0 100 200 300 400 500
0
100
200
300
400
500
nz = 5104
0 100 200 300 400 500
0
100
200
300
400
500
nz = 58202
44/ 94
Illustration: Reverse Cuthill-McKee on matrixdwt 592.rua
http://find/http://goback/8/6/2019 Intro Sparse
60/112
NZ(LU factors)=16924
Permuted matrix Factorized permuted matrix(RCM)
0 100 200 300 400 500
0
100
200
300
400
500
nz = 5104
0 100 200 300 400 500
0
100
200
300
400
500
nz = 16924
44/ 94
http://find/http://goback/8/6/2019 Intro Sparse
61/112
Table: Benefits of Sparsity on a matrix of order 2021 2021 with 7353nonzeros. (Dongarra etal 91) .
Procedure Total storage Flops Time (sec.)on CRAY J90
Full Syst. 4084 Kwords 5503 106
34.5Sparse Syst. 71 Kwords 1073106 3.4Sparse Syst. and reordering 14 Kwords 42103 0.9
45/ 94
Control of numerical stability: numerical pivoting
http://find/http://goback/8/6/2019 Intro Sparse
62/112
In dense linear algebra partial pivoting commonly used (ateach step the largest entry in the column is selected).
In sparse linear algebra, flexibility to preserve sparsity isoffered : Partial threshold pivoting : Eligible pivots are not too small
with respect to the maximum in the column.
Set of eligible pivots =
{r
| |a
(k)rk
| u
maxi
|a
(k)ik
|}, where
0 < u 1. Then among eligible pivots select one preserving better
sparsity. u is called the threshold parameter (u = 1 partial pivoting). It restricts the maximum possible growth of: aij = aij
aikakjakk
to 1 + 1u which is sufficient to the preserve numerical stability. u 0.1 is often chosen in practice.
For symmetric indefinite problems 2by2 pivots (withthreshold) is also used to preserve symmetry and sparsity.
Threshold pivoting and numerical accuracy
http://find/http://goback/8/6/2019 Intro Sparse
63/112
Table: Effect of variation in threshold parameter u on matrix 541 541with 4285 nonzeros (Dongarra etal 91) .
u Nonzeros in LU factors Error
1.0 16767 3
109
0.25 14249 6 10100.1 13660 4 1090.01 15045 1 105104 16198 1 102
1010
16553 3 1023
47/ 94
Threshold pivoting and numerical accuracy
http://find/http://goback/8/6/2019 Intro Sparse
64/112
Table: Effect of variation in threshold parameter u on matrix 541 541with 4285 nonzeros (Dongarra etal 91) .
u Nonzeros in LU factors Error
1.0 16767 3 109
0.25 14249 6 1010
0.1 13660 4 1090.01 15045 1 105104 16198 1 1021010 16553 3 1023
Difficulty: numerical pivoting implies dynamic datastructures thatcan not be forecasted symbolically
47/ 94
Three-phase scheme to solve Ax = b
http://find/http://goback/8/6/2019 Intro Sparse
65/112
1. Analysis step Preprocessing of A (symmetric/unsymmetric orderings,
scalings) Build the dependency graph (elimination tree, eDAG . . . )
2. Factorization (A = LU, LDLT
, LLT
, QR)Numerical pivoting
3. Solution based on factored matrices triangular solves: Ly = b, then Ux = y improvement of solution (iterative refinement), error analysis
48/ 94
Efficient implementation of sparse algorithms
http://find/http://goback/8/6/2019 Intro Sparse
66/112
Indirect addressing is often used in sparse calculations: e.g.sparse SAXPY
d o i = 1 , m
A( ind(i) ) = A( ind(i) ) + alpha * w( i )
enddo
Even if manufacturers provide hardware for improving indirectaddressing It penalizes the performance
Identify dense blocks or switch to dense calculations as soon
as the matrix is not sparse enough
49/ 94
Effect of switch to dense calculations
M t i f 5 i t di ti ti f th L l i 50 50
http://find/http://goback/8/6/2019 Intro Sparse
67/112
Matrix from 5-point discretization of the Laplacian on a 50 50grid (Dongarra etal 91)
Density for Order of Millions Timeswitch to full code full submatrix of flops (seconds)
No switch 0 7 21.81.00 74 7 21.4
0.80 190 8 15.00.60 235 11 12.50.40 305 21 9.00.20 422 50 5.50.10 531 100 3.70.005 1420 1908 6.1
Sparse structure should be exploited if density < 10%.
50/ 94
http://find/http://goback/8/6/2019 Intro Sparse
68/112
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian eliminationParallel and high performance computing
Numerical simulation and sparse matricesDirect vs iterative methodsConclusion
51/ 94
Main processor (r)evolutions
http://find/http://goback/8/6/2019 Intro Sparse
69/112
pipelined functional units
superscalar processors
out-of-order execution (ILP)
larger caches
evolution of instruction set (CISC, RISC, EPIC, . . . )
multicores
52/ 94
Pourquoi des traitements paralleles ?
B i d l l i f i d b d di i li
http://find/http://goback/8/6/2019 Intro Sparse
70/112
Besoins de calcul non satisfaits dans beaucoup de disciplines(pour resoudre des problemes significatifs)
Performance uniprocesseur proche des limites physiques
Temps de cycle 0.5 nanoseconde 8 GFlop/s (avec 4 operations flottantes / cycle)
Calculateur 40 TFlop/s
5000 coeurs
calculateurs massivement paralleles Pas parce que cest le plus simple mais parce que cest
necessaire
Puissance actuelle (juin 2010, cf http://www.top500.org):Cray XT5, Oak Ridge Natl Lab:
1.7 PFlop/s, 300 TBytes de memoire, 224256 coeurs
53/ 94
Quelques unites pour le calcul haute performance
http://www.top500.org/http://www.top500.org/http://find/http://goback/8/6/2019 Intro Sparse
71/112
Vitesse
1 MFlop/s 1 Megaflop/s 106 operations / seconde1 GFlop/s 1 Gigaflop/s 109 operations / seconde1 TFlop/s 1 Teraflop/s 1012 operations / seconde1 PFlop/s 1 Petaflop/s 1015 operations / seconde1 EFlop/s 1 Exaflop/s 1015 operations / seconde
Memoire
1 kB / 1 ko 1 kilobyte 103 octets1 MB / 1 Mo 1 Megabyte 106 octets1 GB / 1 Go 1 Gigabyte 109 octets
1 TB / 1 To 1 Terabyte 1012
octets1 PB / 1 Po 1 Petabyte 1015 octets
54/ 94
Mesures de performance
http://find/http://goback/8/6/2019 Intro Sparse
72/112
Nombre doperations flottantes par seconde (pas MIPS) Performance crete :
Ce qui figure sur la publicite des constructeurs Suppose que toutes les unites de traitement sont actives On est sur de ne pas aller plus vite :
Performance crete = #unites fonctionnellesclock (sec.)
Performance reelle : Habituellement tres inferieure a la precedente
(malheureusement)
55/ 94
Rapport (Performance reelle / performance de crete) souvent bas !!
http://find/http://goback/8/6/2019 Intro Sparse
73/112
Soit P un programme :
1. Processeur sequentiel: 1 unite scalaire (1 GFlop/s) Temps dexecution de P : 100 s
2. Machine parallele a 100 processeurs: Chaque processor: 1 GFlop/s Performance crete: 100 GFlop/s
3. Si P : code sequentiel (10%) + code parallelise (90%) Temps dexecution de P : 0.9 + 10 = 10.9 s Performance reelle : 9.2 GFlop/s
4.Performance reelle
Performance de crete = 0.1
56/ 94
Moores law
http://find/http://goback/8/6/2019 Intro Sparse
74/112
Gordon Moore (co-fondateur dIntel) a predit en 1965 que ladensite en transitors des circuits integres doublerait tous les24 mois.
A aussi servi de but a atteindre pour les fabriquants. A ete deforme:
24 18 mois nombre de transistors performance
57/ 94
Comment accrotre la vitesse de calcul ?
Accelerer la frequence avec des technologies plus rapides
http://find/http://goback/8/6/2019 Intro Sparse
75/112
q g p p
On approche des limites:
Conception des puces Consommation electrique et chaleur dissipee Refroidissement probleme despace
On peut encore miniaturiser, mais: pas indefiniment
resistance des conducteurs (R =ls ) augmente et ..
la resistance est responsable de la dissipation denergie (effetJoule).
effets de capacites difficiles a matriser
Remarque: 0.5 nanoseconde = temps pour quun signal
parcourt 15 cm de cable
Temps de cycle 0.5 nanosecond 8 GFlop/s (avec 4operations flottantes par cycle)
58/ 94
Seule solution: le parallelisme
parallelisme: execution simultanee de plusieurs instructions a
http://find/http://goback/8/6/2019 Intro Sparse
76/112
parallelisme: execution simultanee de plusieurs instructions alinterieur dun programme
A linterieur dun cur : micro-instructions traitement pipeline recouvrement dinstructions executees par des unites de calcul
distinctes
transparent pour le programmeur(gere par le compilateur ou durant lexecution)
Entre des processeurs ou curs distincts: suites dinstructions differentes executees synchronisations:
implicites (compilateur, parallelisation automatique, utilisationde librairies paralleles)
ou explicites (transfert de messages, programmationmultithreads, sections critiques)
59/ 94
Probleme dacces aux donnees
http://find/http://goback/8/6/2019 Intro Sparse
77/112
On est souvent (en pratique) a 10% de la performance crete Processeurs plus rapides acces aux donnees plus rapide :
organisation processeur, organisation memoire, communication inter-processeurs
Hardware plus complexe : pipe, technologie, reseau, . . . Logiciel plus complexe : compilateur, systeme dexploitation,
langages de programmation, gestion du parallelisme, outils demise au point . . . applications
Il devient plus difficile de programmer efficacement
60/ 94
Problemes de debit memoire
Lacces aux donnees est un probleme crucial dans les
http://find/http://goback/8/6/2019 Intro Sparse
78/112
calculateurs modernes
Performance processeur : + 60% par an Memoire DRAM : + 9% par an
Ratio performance processeurtemps acces memoire augmente denviron 50% par an!MFlop/s plus faciles que MB/s pour debit memoire
Hierarchie memoire de plus en plus complexe (mais latenceaugmente) Facon dacceder aux donnees de plus en plus critique:
Minimiser les defauts de cache Minimiser la pagination memoire
Localite: ameliorer le rapport references a des memoireslocales/ references a des memoires a distance Reutilisation, blocage: accrotre le ratio flops/memory access Gestion des transferts de donnees a la main ? (Cell, GPU)
61/ 94
Average access time (# cycles) hit/missSize
http://find/http://goback/8/6/2019 Intro Sparse
79/112
Cache level #2
Cache level #1 12 / 8 66
615 / 30 200
Main memory 10 100
Remote memory 500 5000
Registers < 1
256 KB 16 MB
1 128 KB
Disks 700,000 / 6,000,000
1 10 GB
Figure: Exemple de hierarchie memoire.
62/ 94
Conception memoire pour nombre important deprocesseurs ?
Comment 500 processeurs peuvent ils avoir acces a des donnees
http://find/http://goback/8/6/2019 Intro Sparse
80/112
Comment 500 processeurs peuvent-ils avoir acces a des donneesrangees dans une memoire partagee (technologie, interconnexion,prix ?) Solution a cout raisonnable : memoire physiquement distribuee(chaque processeur ou groupe de processeurs a sa propre memoirelocale)
2 solutions : memoires locales globalement adressables : Calulateurs a
memoire partagee virtuelle transferts explicites des donnees entre processeurs avec
echanges de messages Scalabilite impose :
augmentation lineaire debit memoire / vitesse du processeur augmentation du debit des communications / nombre de
processeurs Rapport cout/performance memoire distribuee et bon
rapport cout/performance sur les processeurs63/ 94
Architecture des multiprocesseurs
http://find/http://goback/8/6/2019 Intro Sparse
81/112
Nombre eleve de processeurs
memoire physiquement distribuee
Organisation Organisation physiquelogique Partagee (32 procs max) DistribueePartagee multiprocesseurs espace dadressage global
a memoire partagee (hard/soft) au dessus de messagesmemoire partagee virtuelleDistribuee emulation de messages echange de messages
(buffers)
Table: Organisation des processeurs
64/ 94
Terminologie
Architecture SMP (Symmetric Multi Processor)
M i ( h i l i )
http://find/http://goback/8/6/2019 Intro Sparse
82/112
Memoire partagee (physiquement et logiquement)
Temps dacces uniforme a la memoire Similaire du point de vue applicatif aux architectures
multi-curs (1 cur = 1 processeur logique)
Mais communications bcp plus rapides dans les multi-curs
(latence < 3ns, bande passantee > 20 GB/s) que dans lesSMP (latence 60ns, bande passantee 2 GB/s)
Architecture NUMA (Non Uniform Memory Access)
Memoire physiquement distribuee et logiquement partagee
Plus facile daugmenter le nombre de procs quen SMP
Temps dacces depend de la localisation de la donnee
Acces locaux plus rapides quacces distants
hardware permet la coherence des caches (ccNUMA)65/ 94
Programmation
http://find/http://goback/8/6/2019 Intro Sparse
83/112
Standards de programmation
Org. logique partagee: threads POSIX, directives OpenMPOrg. logique distribuee: PVM, MPI, sockets (Message Passing)
Programmation hybride: MPI + OpenMP
Machines a 1 million de curs? architectures emergentes typeGPGPU ? pas encore de standard !
66/ 94
Evolution du calcul haute performance
http://find/http://goback/8/6/2019 Intro Sparse
84/112
Evolution rapide des architectures haute performance (SMP,clusters, NUMA, multicoeurs, Cell, GPUs, . . . )
Parallelisme a plusieurs niveaux Hierarchie memoire de plus en plus complexe
Programmation de plus en plus difficile avec des outils logicielset des standards qui ont toujours un temps de retard.
67/ 94
http://find/http://goback/8/6/2019 Intro Sparse
85/112
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian eliminationParallel and high performance computing
Numerical simulation and sparse matricesDirect vs iterative methodsConclusion
68/ 94
Simulation numerique et matrices creuses
Demarche generale pour le calcul scientifique:
http://find/http://goback/8/6/2019 Intro Sparse
86/112
Demarche generale pour le calcul scientifique:
1. Probleme de simulation (probleme continu)2. Application de lois physiques (Equations aux deriveespartielles)
3. Discretisation, mise en equations en dimension finie4. Resolution de systemes lineaires (Ax = b)5. (Etude des resultats, remise en cause eventuelle du modele ou
de la methode)
Resolution de systemes lineaires=noyau algorithmiquefondamental. Parametres a prendre en compte: Proprietes du systeme (symetrie, defini positif,
conditionnement, sur-determine, . . . ) Structure: dense ou creux, Taille: plusieurs millions dequations ?
69/ 94
Equations aux derivees partielles
http://find/http://goback/8/6/2019 Intro Sparse
87/112
Modelisation dun phenomene physique
Equations differentielles impliquant: forces moments
temperatures vitesses energies temps
Solutions analytiques rarement disponibles
70/ 94
Exemples dequations aux derivees partielles
http://find/http://goback/8/6/2019 Intro Sparse
88/112
Trouver le potentiel electrique pour une distribution de
charge donnee f:2 = f = f, or2
x2(x, y, z) +
2
y2(x, y, z) +
2
z2(x, y, z) = f(x, y, z)
Equation de la chaleur (ou equation de Fourier):
2
ux2 + 2
uy2 + 2
uz2 = 1 utavec u = u(x, y, z, t): temperature, : diffusivite thermique du milieu.
Equations de propagation dondes, equation de Schrodinger,Navier-Stokes,. . .
71/ 94
Discretisation (etape qui suit la modelisationphysique)
http://find/http://goback/8/6/2019 Intro Sparse
89/112
Travail du numericien: Realisation dun maillage (regulier, irregulier)
Choix des methodes de resolution et etude de leurcomportement
Etude de la perte dinformation due au passage a la dimensionfinie
Principales techniques de discretisation
Differences finies
Elements finis Volumes finis
72/ 94
Discretization with finite differences (1D)
Basic approximation (ok if h is small enough):
( ) ( )
http://find/http://goback/8/6/2019 Intro Sparse
90/112
dudx
(x) u(x + h)
u(x)
h
Results from Taylors formula
u(x + h) = u(x) + hdu
dx
+h2
2
d2u
dx2
+h3
6
d3u
dx3
+ O(h4)
Replacing h by h:
u(x h) = u(x) h dudx
+h2
2
d2u
dx2 h
3
6
d3u
dx3+ O(h4)
Thus:
d2u
dx2=
u(x + h) 2u(x) + u(x h)h2
+ O(h2)
73/ 94
Discretization with finite differences (1D)
http://find/http://goback/8/6/2019 Intro Sparse
91/112
d
2
udx2 = u(x + h) 2u(x) + u(x h)h2 + O(h2)
3-point stencil for the centered difference approximation to
the second order derivative:
21 1
74/ 94
Finite Differences for the Laplacian Operator (2D)
Assuming same mesh refinement h in x and y directions:
( ) u(xh y)2u(x y)+u(x+h y) u(x yh)2u(x y)+u(x y+h)
http://find/http://goback/8/6/2019 Intro Sparse
92/112
u(x)
u(x h,y) 2u(x,y)+u(x+h,y)
h2+ u(x,y h) 2u(x,y)+u(x,y+h)
h2
u(x) 1h2
(u(xh, y)+ u(x+h, y)+ u(x, yh)+ u(x, y+ h)4u(x, y))
1 1
1
1
4 4
1 1
11
5-point stencils for the centered difference approximation tothe Laplacian operator (left) standard (right) skewed
75/ 94
http://find/http://goback/8/6/2019 Intro Sparse
93/112
27-point stencil usedfor 3D geophysicalapplications (collabo-ration with GeoscienceAzur, S.Operto andJ.Virieux).
1D example
Consider the problem
u(x) = f(x) for x (0, 1)u(0) = u(1) = 0
http://find/http://goback/8/6/2019 Intro Sparse
94/112
xi = i
h, i = 0, . . . , n + 1, f(xi) = fi, u(xi) = ui
h = 1/(n + 1)
Centered difference approximation:
ui1 + 2ui ui+1 = h2fi (u0 = un+1 = 0),
We obtain a linear system Au = f or (for n = 6):
1
h2
2 1 0 0 0 01 2 1 0 0 0
0
1 2
1 0 00 0 1 2 1 00 0 0 1 2 10 0 0 0 1 2
u1u2u
3u4u5u6
=
f1f2f
3f4f5f6
77/ 94
Slightly more complicated (2D)
http://find/http://goback/8/6/2019 Intro Sparse
95/112
Consider an elliptic PDE:
(a(x, y)ux
)
x
(b(x, y)uy
)
y+ c(x, y) u = g(x, y) s u r
u(x, y) = 0 sur
0 x, y 1a(x, y) > 0
b(x, y) > 0
c(x, y) 0
78/ 94
Case of a regular 2D mesh:1
http://find/http://goback/8/6/2019 Intro Sparse
96/112
0 1
2 3 41
5
discretization step: h = 1n+1 , n = 4
5-point finite difference scheme:
(a(x, y)ux
)ij
x=
ai+ 12,j(ui+1,j ui,j)
h2
ai 12,j(ui,j ui1,j)
h2+O(h2)
Similarly:
(b(x, y)uy
)ij
y=
bi,j+ 12
(ui,j+1 ui,j)h2
bi,j 1
2(ui,j ui,j1)
h2+O(h2)
ai+ 12,j, bi+ 1
2,j, cij, . . . known.
With the ordering of unknows of the example, we obtain ali f h f
http://find/http://goback/8/6/2019 Intro Sparse
97/112
linear system of the form:
Ax = b,
where
x1
u1,1 = u(1
n+1 ,1
n+1 )
x2 u2,1 = u( 2n+1 , 1n+1 )x3 u3,1x4 u4,1x
5 u
1,2, . . .
and A is n2 by n2, b is of size n2, with the following structure:
80/ 94
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
|x x x | 1 |g11|
|x x x x | 2 |g21|
| | 3 | 31|
http://find/http://goback/8/6/2019 Intro Sparse
98/112
| x x x x | 3 |g31|
| x x 0 x | 4 |g41||x 0 x x x | 5 |g12|
| x x x x x | 6 |g22|
| x x x x x | 7 |g32|
A=| x x x 0 x | 8 b=|g42|
| x 0 x x x | 9 |g13|| x x x x x |10 |g23|
| x x x x x |11 |g33|
| x x x 0 x |12 |g43|
| x 0 x x |13 |g14|
| x x x x |14 |g24|
| x x x x |15 |g34|
| x x x |16 |g44|
81/ 94
Solution of the linear system
Oft th t tl t i i l i l ti d
http://find/http://goback/8/6/2019 Intro Sparse
99/112
Often the most costly part in a numerical simulation code
Direct methods: L U factorization followed by triangular substitutions parallelism depends highly on the structure of the matrix
Iterative methods: usually rely on sparse matrix-vector products algebraic preconditioner useful
82/ 94
Evolution in time of a complex phenomenon
Examples:
http://find/http://goback/8/6/2019 Intro Sparse
100/112
climate modeling, evolution of radioactive waste, . . . heat equation:
u(x, y, z, t) = u(x,y,z,t)
t
u(x, y, z, t0) = u0(x, y, z)
Discretization in both space and time (1D case): Explicit approaches:
u
n+1
j
u
n
j
tn+1tn= u
n
j+12un
j +un
j
1h2 .
Implicit approaches:un+1junj
tn+1tn=
un+1j+12u
n+1j
+un+1j1
h2.
Implicit approaches are preferred (more stable, larger timestep
possible) but are more numerically intensive: a sparse linearsystem must be solved at each iteration.
83/ 94
Discretization with Finite elements
Consider a partial differential equation of the form (Poisson
http://find/http://goback/8/6/2019 Intro Sparse
101/112
Equation): u =
2ux2
+ 2u
y2= f
u = 0 on
we can show (using Greens formula) that the previousproblem is equivalent to:
a(u, v) =
f v dx dy v such that v = 0 on
where a(u, v) =
ux
vx
+ uy
vy
dxdy
84/ 94
Finite element scheme: 1D Poisson Equation u =
2ux2
= f, u = 0 on
Equivalent to
http://find/http://goback/8/6/2019 Intro Sparse
102/112
a(u, v) = g(v) for all v (v| = 0)
where a(u, v) =
ux
vx
and g(v) = f(x)v(x)dx(1D: similar to integration by parts)
Idea: we search u of the form = kkk(x)(k)k=1,n basis of functions such that k is linear on all Ei,
and k(xi) = ik = 1 if k = i, 0 otherwise.
k k+1k1
xkEk Ek+1
85/ 94
Finite Element Scheme: 1D Poisson Equationk k+1k1
http://find/http://goback/8/6/2019 Intro Sparse
103/112
xk
Ek Ek+1
We rewrite a(u, v) = g(v) for all k:
a(u,k) = g(k) for all k iia(i,k) = g(k)a(i,k) =
ix
kx
= 0 when |i k| 2 kth equation associated with k
k1a(k1,k) + ka(k,k) + k+1a(k+1,k) = g(k)
a(k1,k) =Ek
k1
x
kx
a(k+1,k) =Ek+1
k+1x
kx
a(k,k) =Ek
kx
kx
+Ek+1
kx
kx
86/ 94
Finite Element Scheme: 1D Poisson Equation
From the point of view of Ek, we have a 2x2 contribution matrix:
Ek
k1x
k1x Ek
k1x
kx IEk (k1,k1) IEk (k1,k)
http://find/http://goback/8/6/2019 Intro Sparse
104/112
Ek x x Ek x xEk
k1x
kx
Ek
kx
kx
= Ek( k 1, k 1) Ek( k 1, k)IEk(k,k1) IEk(k,k)
210 3 4
E3
1 2 3
E4E1 E2
IE1 (1,1) + IE2 (1,1) IE2 (1,2)
IE2 (2,1) IE2 (2,2) + IE3 (2,2) IE3 (2,3)IE3 (2,3) IE3 (3,3) + IE4 (3,3)
12
3
= g(1)g(2)
g(3)
87/ 94
Finite Element Scheme in Higher Dimension
http://find/http://goback/8/6/2019 Intro Sparse
105/112
Can be used for higher dimensions
Mesh can be irregular i can be a higher degree polynomial
Matrix pattern depends on mesh connectivity/ordering
88/ 94
Finite Element Scheme in Higher Dimension
Set of elements (tetrahedras, triangles) to assemble:
http://find/http://goback/8/6/2019 Intro Sparse
106/112
j
T
i
k C(T) =
aTi,i aTi,j aTi,kaTj,i aTj,j aTj,k
aTk,i aTk,j a
Tk,k
Needs for the parallel case Assemble the sparse matrix A =
kC(Tk): graph coloring
algorithms
Parallelization domain by domain: graph partitioning
Solution of Ax = b: high performance matrix computationkernels
88/ 94
Other example: linear least squares
mathematical model + approximate measures estimatet f th d l
http://find/http://goback/8/6/2019 Intro Sparse
107/112
parameters of the model
m experiments + n parameters xi:minAx b avec: A Rmn,m n: data matrix b Rm: vector of observations
x Rn
: parameters of the model Solving the problem:
Decompose A under the form A = QR, with Q orthogonal, Rtriangular
Axb = QTAxQTb = QTQRxQTb = RxQTb Problems can be large (meteorological data, . . . ), sparse or
not
89/ 94
Introduction to Sparse Matrix Computations
http://find/http://goback/8/6/2019 Intro Sparse
108/112
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian eliminationParallel and high performance computing
Numerical simulation and sparse matricesDirect vs iterative methodsConclusion
90/ 94
Solution of sparse linear systems Ax = b (Direct or Iterative approaches ?
Direct methods
Very general/robust
Iterative methods
Efficiency depends on:
http://find/http://goback/8/6/2019 Intro Sparse
109/112
Numerical accuracy Irregular/unstructured
problems
Factorization of matrix A
May be costly
(flops/memory) Factors can be reused for
multiple right-hand sides b
Computing issues
Good granularity of
computations Several levels of parallelism
can be exploited
convergence preconditioning
numerical properties /structure of A
Rely on efficient Mat-Vect product
memory effective successive right-hand sides
is problematic
Computing issues
Smaller granularity of
computations Often, only one level of
parallelism
Introduction to Sparse Matrix Computations
http://find/http://goback/8/6/2019 Intro Sparse
110/112
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian eliminationParallel and high performance computing
Numerical simulation and sparse matricesDirect vs iterative methodsConclusion
92/ 94
Summary sparse matrices
Widely used in engineering and industry
Irregular data structures
http://find/http://goback/8/6/2019 Intro Sparse
111/112
Strong relations with graph theory Efficient algorithms are critical
Ordering Sparse Gaussian elimination Sparse matrix-vector multiplication Parallelization
Challenges: Modern applications leading to
bigger and bigger problems different types of matrices and requirements
Dynamic data structures (numerical pivoting) need fordynamic scheduling
More and more parallelism (evolution of parallel architectures)
93/ 94
Suggested home reading
Google page rank, The worlds largest matrix computation,
http://find/http://goback/8/6/2019 Intro Sparse
112/112
Cleve Moler.
Architecture and Performance Characteristics of ModernHigh Performance Computers, Georg Hager and Gerhard
Wellein, Lect. Notes in Physics
Optimization Techniques for Modern High PerformanceComputers, Georg Hager and Gerhard Wellein, Lect. Notes
in Physics
94/ 94
http://find/http://goback/