Intro Sparse

8/6/2019 Intro Sparse

1/112

CR07 Sparse Matrix Computations /Cours de Matrices Creuses (2010-2011)

Jean-Yves LExcellent (INRIA) and Bora Ucar (CNRS)

LIP-ENS Lyon, Team: ROMA

([email protected], [email protected])

Fridays 10h15-12h15

prepared in collaboration with P. Amestoy (ENSEEIHT-IRIT)

1/ 94
http://find/http://goback/


2/112

Motivations Applications ayant un besoin croissant en puissance de calcul:

modelisation,

simulation (plutot quexperimentation), optimisation numerique

Typiquement:Probleme continu Discretisation (maillage)

Algorithme numerique de resolution (selon lois physiques)

Probleme matriciel (Ax = b, . . .) Besoins:

Modelisations de plus en plus precises Problemes de plus en plus complexes Applications critiques en temps de reponse Minimisation des couts du calcul

Calculateurs haute performance, parallelisme Algorithmes permettant de tirer le meilleur parti de ces

calculateurs et des proprietes des matrices considerees

2/ 94


3/112

Example of preprocessing for Ax = b

Original (A =lhr01) Preprocessed matrix (A

(lhr01))

0 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 18427

0 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 18427

Modified Problem:Ax = b with A = PnPDrAQDcPt

3/ 94


4/112

Quelques exemples dans le domaine du calculscientifique

Constraints of duration: weather forecast

4/ 94


5/112

A few examples in the field of scientific computing

Cost constraints: wind tunnels, crash simulation, . . .

5/ 94


6/112

Scale Constraints

large scale: climate modelling, pollution, astrophysics tiny scale: combustion, quantum chemistry

6/ 94


7/112

Contents of the course

Introduction, reminders on graph theory and on numerical

linear algebra- Introduction to sparse matrices- Graphs and hypergraphs, trees, some classical algorithms- Gaussian elimination, LU factorization, fill-in- Conditioning/sensitivity of a problem and error analysis

7/ 94


8/112


Graph

Sparse Gaussian elimination

- Sparse matrices and graphs- Gaussian elimination of sparse matrices- Ordering and permuting sparse matrices

7/ 94


9/112


Maximum (weighted) Matching algorithms and their use in

sparse linear algebra

Efficient factorization methods

- Implementation of sparse direct solvers- Parallel sparse solvers

- Scheduling of computations to optimize memory usageand/or performance

Graph and hypergraph partitioning

Iterative methods

Current research activities

7/ 94


10/112

Tentative outline

A. INTRODUCTION

I. Sparse matrices

II. Graph theory and algorithmsIII. Linear algebra basics

------------------------------------------------------

B. SPARSE GAUSSIAN ELIMINATION

IV. Elimination tree and structure prediction

V. Fill-reducing ordering methods

VI. Matching in bipartite graphs

VII. Factorization: Methods

VIII. Factorization: Parallelization aspects

------------------------------------------------------

C. SOME OTHER ESSENTIAL SPARSE MATRIX ALGORITHMS

IX. Graph and hypergraph partitioningX. Iterative methods

------------------------------------------------------

D. CLOSING

XI. Current research activities

XII. Presentations8/ 94


11/112

Tentative organization of the course

References, teaching material will be made available athttp://graal.ens-lyon.fr/~bucar/CR07/

2+ hours dedicated to exercises (manipulation of sparsematrices and graphs)

Evaluation: mainly based on the study of research articles

related to the contents of the course; the students will write areport and do an oral presentation.

9/ 94
http://graal.ens-lyon.fr/~bucar/CR07/http://graal.ens-lyon.fr/~bucar/CR07/http://graal.ens-lyon.fr/~bucar/CR07/http://graal.ens-lyon.fr/~bucar/CR07/http://find/http://goback/


12/112

Outline

Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matrices

Gaussian eliminationParallel and high performance computingNumerical simulation and sparse matricesDirect vs iterative methodsConclusion

10/ 94


13/112

A selection of references

Books

Duff, Erisman and Reid, Direct methods for Sparse Matrices,Clarendon Press, Oxford 1986. Dongarra, Duff, Sorensen and H. A. van der Vorst, Solving

Linear Systems on Vector and Shared Memory Computers,SIAM, 1991.

Davis, Direct methods for sparse linear systems, SIAM, 2006. Saad, Iterative methods for sparse linear systems, 2nd edition,

SIAM, 2004.

Articles

Gilbert and Liu, Elimination structures for unsymmetric sparse

LU factors, SIMAX, 1993. Liu, The role of elimination trees in sparse factorization,

SIMAX, 1990. Heath and E. Ng and B. W. Peyton, Parallel Algorithms for

Sparse Linear Systems, SIAM review 1991.

11/ 94


14/112

Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian elimination

Parallel and high performance computingNumerical simulation and sparse matricesDirect vs iterative methodsConclusion

12/ 94


15/112

Motivations

solution of linear systems of equations key algorithmic

kernelContinuous problem

Discretization

Solution of a linear system Ax = b Main parameters:

Numerical properties of the linear system (symmetry, pos.definite, conditioning, . . . )

Size and structure: Large (> 1000000 1000000 ?), square/rectangular Dense or sparse (structured / unstructured) Target computer (sequential/parallel/multicore)

Algorithmic choices are critical

13/ 94


16/112

Motivations for designing efficient algorithms

Time-critical applications

Solve larger problems

Decrease elapsed time (parallelism ?)

Minimize cost of computations (time, memory)

14/ 94


17/112

Difficulties

Access to data : Computer : complex memory hierarchy (registers, multilevel

cache, main memory (shared or distributed), disk) Sparse matrix : large irregular dynamic data structures.

Exploit the locality of references to data on the computer(design algorithms providing such locality)

Efficiency (time and memory) Number of operations and memory depend very much on the

algorithm used and on the numerical and structural propertiesof the problem.

The algorithm depends on the target computer (vector, scalar,shared, distributed, clusters of Symmetric Multi-Processors(SMP), multicore).

Algorithmic choices are critical

15/ 94


18/112

Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian elimination

Parallel and high performance computingNumerical simulation and sparse matricesDirect vs iterative methodsConclusion

16/ 94

S i


19/112

Sparse matrices

Example:3 x1 + 2 x2 = 5

2 x2 - 5 x3 = 12 x1 + 3 x3 = 0

can be represented as

Ax = b,

where A = 3 2 0

0 2 52 0 3

, x = x1

x2x3

, and b =

510

Sparse matrix: only nonzeros are stored.

17/ 94

S i ?


20/112

Sparse matrix ?

0 100 200 300 400 500

0

100

200

300

400

500

nz = 5104

Matrix dwt 592.rua (N=592, NZ=5104);Structural analysis of a submarine

18/ 94

S t i ?


21/112

Sparse matrix ?

Matrix from Computational Fluid Dynamics;(collaboration Univ. Tel Aviv)

0 1000 2000 3000 4000 5000 6000 7000

0

1000

2000

3000

4000

5000

6000

7000

nz = 43105

Saddle-point problem

19/ 94

P i t i


22/112

Preprocessing sparse matrices

Original (A =lhr01) Preprocessed matrix (A(lhr01))

0 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 18427

0 200 400 600 800 1000 1200 1400

0

200

400

600

800

1000

1200

1400

nz = 18427

Modified Problem:Ax = b with A = PnPDrADcQPt

20/ 94

F t i ti


23/112

Factorization process

Solution of Ax = b A is unsymmetric :

A is factorized as: A = LU, whereL is a lower triangular matrix, andU is an upper triangular matrix.

Forward-backward substitution: Ly = b then Ux = y A is symmetric:

A = LDLT or LLT

A is rectangular m n with m n and minxAx b2 : A = QR where Q is orthogonal (Q1 = QT) and R is

triangular. Solve: y = QTb then Rx = y

21/ 94



24/112





A = LDLT or LLT



21/ 94



25/112





A = LDLT or LLT



21/ 94

Difficulties


26/112

Difficulties

Only non-zero values are stored

Factors L and U have far more nonzeros than A

Data structures are complex

Computations are only a small portion of the code (the rest isdata manipulation)

Memory size is a limiting factor ( out-of-core solvers )

22/ 94

Key numbers:


27/112

Key numbers:

1- Small sizes : 500 MB matrix;Factors = 5 GB; Flops = 100 Gflops ;

2- Example of 2D problem: Lab. Geosiences Azur, Valbonne Complex 2D finite difference matrix n=16 106 , 150 106

nonzeros Storage (single prec): 2 GB (12 GB with the factors) Flops: 10 TeraFlops

3- Example of 3D problem: EDF (Code Aster, structuralengineering) real matrix finite elements n = 106 , nz = 71 106 nonzeros Storage: 3.5 109 entries (28 GB) for factors, 35 GB total Flops: 2.1

1013

4- Typical performance (MUMPS): PC LINUX 1 core (P4, 2GHz) : 1.0 GFlops/s Cray T3E (512 procs) : Speed-up 170, Perf. 71 GFlops/s AMD Opteron 8431, 24 [email protected] GHz: 50 GFlops/s (1 core:

7 GFlop/s)

23/ 94

Typical test problems:


28/112


BMW car body,227,362 unknowns,5,757,996 nonzeros,MSC.Software

Size of factors: 51.1 million entriesNumber of operations: 44.9 109

24/ 94



29/112


BMW crankshaft,

148,770 unknowns,5,396,386 nonzeros,MSC.Software

Size of factors: 97.2 million entriesNumber of operations: 127.9 109

25/ 94

Sources of parallelism


30/112

Sources of parallelism

Several levels of parallelism can be exploited:

At problem level: problem can be decomposed intosub-problems (e.g. domain decomposition)

At matrix level: Sparsity implies independency in calculation

At submatrix level: within dense linear algebra computations(parallel BLAS, . . . )

26/ 94

Data structure for sparse matrices


31/112

Data structure for sparse matrices

Storage scheme depends on the pattern of the matrix and onthe type of access required band or variable-band matrices block bordered or block tridiagonal matrices general matrix row, column or diagonal access

27/ 94

Data formats for a general sparse matrix A


32/112

Data formats for a general sparse matrix A

What needs to be represented

Assembled matrices: MxN matrix A with NNZ nonzeros.

Elemental matrices (unassembled): MxN matrix A with NELTelements.

Arithmetic: Real (4 or 8 bytes) or complex (8 or 16 bytes)

Symmetric (or Hermitian) store only part of the data.

Distributed format ?

Duplicate entries and/or out-of-range values ?

28/ 94

Classical Data Formats for Assembled Matrices


33/112

Classical Data Formats for Assembled Matrices Example of a 3x3 matrix with NNZ=5 nonzeros

a31

a23a22

a11

a33

1 2 3

1

2

3

Coordinate formatIRN [1 : NNZ] = 1 3 2 2 3

JCN [1 : NNZ] = 1 1 2 3 3VAL [1 : NNZ] = a11 a31 a22 a23 a33 Compressed Sparse Column (CSC) format

IRN [1 : NNZ] = 1 3 2 2 3VAL [1 : NNZ] = a11 a31 a22 a23 a33

COLPTR [1 : N + 1] = 1 3 4 6column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)- Compressed Sparse Row (CSR) format:

Similar to CSC, but row by row Diagonal format (M=N):

NDIAG = 3 29/ 94



34/112

Example of a 3x3 matrix with NNZ=5 nonzeros

a31

a23a22

a11

a33

1 2 3

1

2

3






NDIAG = 3 29/ 94



35/112


a31

a23a22

a11

a33

1 2 3

1

2

3






NDIAG = 3 29/ 94



36/112


a31

a23a22

a11

a33

1 2 3

1

2

3

Diagonal format (M=N):NDIAG = 3IDIAG = 2 0 1

VAL =

na a11 0na a22 a23

a31 a33 na

(na: not accessed)

VAL(i,j) corresponds to A(i,i+IDIAG(j)) (for 1 i + IDIAG(j) N)

29/ 94

Sparse Matrix-vector products Y AX


37/112

p p

Algorithm depends on sparse matrix format:

Coordinate format:Y ( 1 : M ) = 0DO k=1,NNZ

Y ( I R N ( k ) ) = Y ( I R N ( k ) ) + V A L ( k ) X(JCN(k))ENDDO

CSC format:

CSR format

30/ 94



38/112




CSC format:

Y ( 1 : M ) = 0DO J=1,NXj=X(J)DO k=COLPTR( J ) ,COLPTR( J+1)1

Y ( I R N ( k ) ) = Y ( I R N ( k ) ) + V A L ( k )XjENDDO

ENDDO

CSR format

30/ 94



39/112




CSC format:

Y ( 1 : M ) = 0DO J=1,NXj=X(J)DO k=COLPTR( J ) ,COLPTR( J+1)1

Y ( I R N ( k ) ) = Y ( I R N ( k ) ) + V A L ( k )XjENDDO

ENDDO

CSR formatDO I =1,MYi=0

DO k=ROWPTR( I ) ,ROWPTR( I +1)1Y i = Y i + V A L ( k )X(JCN(k))

ENDDOY( I)=Yi

ENDDO

30/ 94



40/112




Diagonal format: (VAL(i,j) corresponds to A(i,i+IDIAG(j)))

Y ( 1 : N ) = 0DO j =1 ,NDIAG

DO i= max(1,1 IDIAG( j )) , min(N ,NIDIAG( j ))Y( i ) = Y( i ) + VAL( i , j )X( i+IDIAG ( j ))

END DOEND DO

30/ 94

Jagged diagonal storage (JDS)


41/112

a31

a23a22

a11

a33

1 2 3

1

2

3

1. Shift all elements left (similar to CSR) and keep column

indices

a11 (1)

a22 (2) a23 (3)a31 (1) a33 (3)

2. Sort rows in decreasing order of their number of nonzeros

3. Store corresponding row permutation: PERM = 2 3 1

4. Stored jagged diagonals (columns of step 2)VAL = a22 a31 a11 a23 a33COL IND = 2 1 1 3 3COL PTR = 1 4 6

31/ 94



42/112

a31

a23a22

a11

a33

1 2 3

1

2

3

1. Shift all elements left (similar to CSR) and keep column

indicesa11 (1)a22 (2) a23 (3)

a31 (1) a33 (3)2. Sort rows in decreasing order of their number of nonzerosa22 (2) a23 (3)a31 (1) a33 (3)a11 (1)

3. Store corresponding row permutation: PERM = 2 3 14. Stored jagged diagonals (columns of step 2)

VAL = a22 a31 a11 a23 a33COL IND = 2 1 1 3 3COL PTR = 1 4 6

31/ 94



43/112

a31

a23a22

a11

a33

1 2 3

1

2

3

1. Shift all elements left (similar to CSR) and keep columnindices

2. Sort rows in decreasing order of their number of nonzerosa22 (2) a23 (3)a31 (1) a33 (3)a11 (1)



31/ 94



44/112

a31

a23a22

a11

a33

1 2 3

1

2

3


2. Sort rows in decreasing order of their number of nonzerosa22 (2) a23 (3)a31 (1) a33 (3)a11 (1)



31/ 94

Jagged diagonal storage (JDS)1 2 3


45/112

a31

a23a22

a11

a33

1 2 3

1

2

3


2. Sort rows in decreasing order of their number of nonzeros

3. Store corresponding row permutation: PERM = 2 3 14. Stored jagged diagonals (columns of step 2)

VAL = a22 a31 a11 a23 a33COL IND = 2 1 1 3 3

COL PTR = 1 4 6

Pros: manipulate longer vectors than CSR (interesting on vectorcomputers or GPUs).Cons: extra-indirection due to permutation array.

31/ 94

Example of elemental matrix format


46/112

A =

1 2 3 0 02 1 1 0 01 1 3 1 30 0 1 2 10 0 3 2 1

= A1 + A2

A1

=123

1 2 32 1 11 1 1

, A2

=345

2 1 31 2

1

3 2 1

32/ 94

Example of elemental matrix format


47/112

A1 =

1

23 1 2 3

2 1 11 1 1

, A2 =

3

45 2 1 3

1 2 13 2 1

N=5 NELT=2 NVAR=6 A =

NELT

i=1 Ai

ELTPTR [1:NELT+1] = 1 4 7ELTVAR [1:NVAR] = 1 2 3 3 4 5ELTVAL [1:NVAL] = -1 2 1 2 1 1 3 1 1 2 1 3 -1 2 2 3 -1 1

Remarks:

NVAR = ELTPTR(NELT+1)-1 NVAL =

S2i (unsym) ou

Si(Si + 1)/2 (sym), avec

Si = ELTPTR(i + 1) ELTPTR(i) storage of elements in ELTVAL: by columns

32/ 94

File storage: Rutherford-Boeing


48/112

Standard ASCII format for files Header + Data (CSC format). key xyz:

x=[rcp] (real, complex, pattern) y=[suhzr] (sym., uns., herm., skew sym., rectang.) z=[ae] (assembled, elemental) ex: M T1.RSA, SHIP003.RSE

Supplementary files: right-hand-sides, solution,permutations. . .

Canonical format introduced to guarantee a unique

representation (order of entries in each column, no duplicates).

33/ 94

File storage: Rutherford-Boeing


49/112

DNV-Ex 1 : Tubular joint-1999-01-17 M_T1

1733710 9758 492558 1231394 0

rsa 97578 97578 4925574 0(10I8) (10I8) (3e26.16)

1 49 96 142 187 231 274 346 417 487

556 624 691 763 834 904 973 1041 1108 11801251 1321 1390 1458 1525 1573 1620 1666 1711 1755

1798 1870 1941 2011 2080 2148 2215 2287 2358 2428

2497 2565 2632 2704 2775 2845 2914 2982 3049 3115...

1 2 3 4 5 6 7 8 9 10

11 12 49 50 51 52 53 54 55 5657 58 59 60 67 68 69 70 71 72

223 224 225 226 227 228 229 230 231 232

233 234 433 434 435 436 437 438 2 3

4 5 6 7 8 9 10 11 12 4950 51 52 53 54 55 56 57 58 59

...

-0.2624989288237320E+10 0.6622960540857440E+09 0.2362753266740760E+110.3372081648690030E+08 -0.4851430162799610E+08 0.1573652896140010E+08

0.1704332388419270E+10 -0.7300763190874110E+09 -0.7113520995891850E+10

0.1813048723097540E+08 0.2955124446119170E+07 -0.2606931100955540E+070.1606040913919180E+07 -0.2377860366909130E+08 -0.1105180386670390E+09

0.1610636280324100E+08 0.4230082475435230E+07 -0.1951280618776270E+07

0.4498200951891750E+08 0.2066239484615530E+09 0.3792237438608430E+080.9819999042370710E+08 0.3881169368090200E+08 -0.4624480572242580E+08

34/ 94

File storage: Matrix-market


50/112

Example

%%MatrixMarket matrix coordinate real general

% Comments

5 5 8

1 1 1.000e+00

2 2 1.050e+01

3 3 1.500e-02

1 4 6.000e+00

4 2 2.505e+02

4 4 -2.800e+02

4 5 3.332e+01

5 5 1.200e+01

35/ 94

Examples of sparse matrix collections


51/112

The University of Florida Sparse Matrix Collectionhttp://www.cise.ufl.edu/research/sparse/matrices/

Matrix market http://math.nist.gov/MatrixMarket/

Rutherford-Boeinghttp://www.cerfacs.fr/algor/Softs/RB/index.html

TLSE http://gridtlse.org/

36/ 94
http://www.cise.ufl.edu/research/sparse/matrices/http://math.nist.gov/MatrixMarket/http://www.cerfacs.fr/algor/Softs/RB/index.htmlhttp://gridtlse.org/http://gridtlse.org/http://www.cerfacs.fr/algor/Softs/RB/index.htmlhttp://math.nist.gov/MatrixMarket/http://www.cise.ufl.edu/research/sparse/matrices/http://find/http://goback/


52/112

Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian eliminationParallel and high performance computingNumerical simulation and sparse matricesDirect vs iterative methodsConclusion

37/ 94

Gaussian elimination


53/112

A = A(1), b = b(1), A(1)x = b(1):0@ a11 a12 a13a21 a22 a23

a31 a32 a33

1A0@ x1x2

x3

1A =

0@ b1b2

b3

1A 2 2 1 a21/a11

3 3 1 a31/a11

A(2)x = b(2)0B@

a11 a12 a13

0 a(2)22 a

(2)23

0 a(2)32 a

(2)33

1CA0@

x1x2x3

1A =

0B@

b1

b(2)2

b(2)3

1CA b(2)2 = b2 a21b1/a11 . . .

a(2)32 = a32 a31a12/a11 . . .

Finally A(3)x = b(3)0B@

a11 a12 a13

0 a(2)22 a

(2)23

0 0 a(3)33

1CA0@

x1x2x3

1A =

0B@

b1

b(2)2

b(3)3

1CA

a(3)(33)

= a(2)(33)

a(2)32 a

(2)23 /a

(2)22 . . .

Typical Gaussian elimination step k : a(k+1)ij = a

(k)ij

a(k)ik

a(k)kj

a(k)kk

38/ 94

Relation with A = LU factorization


54/112

One step of Gaussian elimination can be written:

A

(k+1)

= L

(k)

A

(k)

, with

L(k) =

0BBBBBBB@

1.

.1

lk+1,k .. .

ln,k 1

1CCCCCCCA

and lik =a

(k)ik

a(k)kk

.

Then, A(n)

= U = L(n1)

. . .L(1)

A, which gives A = LU ,

with L = [L(1)]1 . . . [L(n1)]1 =

0BBB@

1 0.

..

li,j 1

1CCCA ,

In dense codes, entries of L and U overwrite entries of A.

Furthermore, if A is symmetric, A = LDLT with dkk = a(k)kk :A = LU = At = UtLt implies (U)(Lt)1 = L1Ut = D diagonal

and U = DLt, thus A = L(DLt) = LDLt

Gaussian elimination and sparsity


55/112

Step k of LU factorization (akk pivot):

For i> k compute lik = aik/akk (= aik),

For i> k,j> k

aij = aij aik akj

akkor

a

ij = aij lik akj If aik = 0 and akj = 0 then aij = 0 If aij was zero its non-zero value must be stored

k j

k

i

x

x

x

x

k j

k

i

x

x

x

0

fill-in

40/ 94


56/112

Idem for Cholesky : For i> k compute lik = aik/

akk (= a

ik),

For i> k,j> k,j i (lower triang.)

aij = aij

aik ajkakk

oraij = aij lik ajk

41/ 94

Example


57/112

Original matrix

x x x x x

x x

x x

x xx x

Matrix is full after the first step of elimination

After reordering the matrix (1st row and column last rowand column)

42/ 94


58/112

x x

x xx x

x x

x x x x x

No fill-in Ordering the variables has a strong impact on

the fill-in the number of operations

NP-hard problem in general (Yannakakis, 1981)

43/ 94

Illustration: Reverse Cuthill-McKee on matrixdwt 592.rua


59/112

dwt 592.rua

Harwell-Boeing matrix: dwt 592.rua, structural computing on a

submarine. NZ(LU factors)=58202

Original matrix Factorized matrix

0 100 200 300 400 500

0

100

200

300

400

500

nz = 5104

0 100 200 300 400 500

0

100

200

300

400

500

nz = 58202

44/ 94

Illustration: Reverse Cuthill-McKee on matrixdwt 592.rua


60/112

NZ(LU factors)=16924

Permuted matrix Factorized permuted matrix(RCM)

0 100 200 300 400 500

0

100

200

300

400

500

nz = 5104

0 100 200 300 400 500

0

100

200

300

400

500

nz = 16924

44/ 94


61/112

Table: Benefits of Sparsity on a matrix of order 2021 2021 with 7353nonzeros. (Dongarra etal 91) .

Procedure Total storage Flops Time (sec.)on CRAY J90

Full Syst. 4084 Kwords 5503 106

34.5Sparse Syst. 71 Kwords 1073106 3.4Sparse Syst. and reordering 14 Kwords 42103 0.9

45/ 94

Control of numerical stability: numerical pivoting


62/112

In dense linear algebra partial pivoting commonly used (ateach step the largest entry in the column is selected).

In sparse linear algebra, flexibility to preserve sparsity isoffered : Partial threshold pivoting : Eligible pivots are not too small

with respect to the maximum in the column.

Set of eligible pivots =

{r

| |a

(k)rk

| u

maxi

|a

(k)ik

|}, where

0 < u 1. Then among eligible pivots select one preserving better

sparsity. u is called the threshold parameter (u = 1 partial pivoting). It restricts the maximum possible growth of: aij = aij

aikakjakk

to 1 + 1u which is sufficient to the preserve numerical stability. u 0.1 is often chosen in practice.

For symmetric indefinite problems 2by2 pivots (withthreshold) is also used to preserve symmetry and sparsity.

Threshold pivoting and numerical accuracy


63/112

Table: Effect of variation in threshold parameter u on matrix 541 541with 4285 nonzeros (Dongarra etal 91) .

u Nonzeros in LU factors Error

1.0 16767 3

109

0.25 14249 6 10100.1 13660 4 1090.01 15045 1 105104 16198 1 102

1010

16553 3 1023

47/ 94

Threshold pivoting and numerical accuracy


64/112

Table: Effect of variation in threshold parameter u on matrix 541 541with 4285 nonzeros (Dongarra etal 91) .

u Nonzeros in LU factors Error

1.0 16767 3 109

0.25 14249 6 1010

0.1 13660 4 1090.01 15045 1 105104 16198 1 1021010 16553 3 1023

Difficulty: numerical pivoting implies dynamic datastructures thatcan not be forecasted symbolically

47/ 94

Three-phase scheme to solve Ax = b


65/112

1. Analysis step Preprocessing of A (symmetric/unsymmetric orderings,

scalings) Build the dependency graph (elimination tree, eDAG . . . )

2. Factorization (A = LU, LDLT

, LLT

, QR)Numerical pivoting

3. Solution based on factored matrices triangular solves: Ly = b, then Ux = y improvement of solution (iterative refinement), error analysis

48/ 94

Efficient implementation of sparse algorithms


66/112

Indirect addressing is often used in sparse calculations: e.g.sparse SAXPY

d o i = 1 , m

A( ind(i) ) = A( ind(i) ) + alpha * w( i )

enddo

Even if manufacturers provide hardware for improving indirectaddressing It penalizes the performance

Identify dense blocks or switch to dense calculations as soon

as the matrix is not sparse enough

49/ 94

Effect of switch to dense calculations

M t i f 5 i t di ti ti f th L l i 50 50


67/112

Matrix from 5-point discretization of the Laplacian on a 50 50grid (Dongarra etal 91)

Density for Order of Millions Timeswitch to full code full submatrix of flops (seconds)

No switch 0 7 21.81.00 74 7 21.4

0.80 190 8 15.00.60 235 11 12.50.40 305 21 9.00.20 422 50 5.50.10 531 100 3.70.005 1420 1908 6.1

Sparse structure should be exploited if density < 10%.

50/ 94


68/112

Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian eliminationParallel and high performance computing

Numerical simulation and sparse matricesDirect vs iterative methodsConclusion

51/ 94

Main processor (r)evolutions


69/112

pipelined functional units

superscalar processors

out-of-order execution (ILP)

larger caches

evolution of instruction set (CISC, RISC, EPIC, . . . )

multicores

52/ 94

Pourquoi des traitements paralleles ?

B i d l l i f i d b d di i li


70/112

Besoins de calcul non satisfaits dans beaucoup de disciplines(pour resoudre des problemes significatifs)

Performance uniprocesseur proche des limites physiques

Temps de cycle 0.5 nanoseconde 8 GFlop/s (avec 4 operations flottantes / cycle)

Calculateur 40 TFlop/s

5000 coeurs

calculateurs massivement paralleles Pas parce que cest le plus simple mais parce que cest

necessaire

Puissance actuelle (juin 2010, cf http://www.top500.org):Cray XT5, Oak Ridge Natl Lab:

1.7 PFlop/s, 300 TBytes de memoire, 224256 coeurs

53/ 94

Quelques unites pour le calcul haute performance
http://www.top500.org/http://www.top500.org/http://find/http://goback/


71/112

Vitesse

1 MFlop/s 1 Megaflop/s 106 operations / seconde1 GFlop/s 1 Gigaflop/s 109 operations / seconde1 TFlop/s 1 Teraflop/s 1012 operations / seconde1 PFlop/s 1 Petaflop/s 1015 operations / seconde1 EFlop/s 1 Exaflop/s 1015 operations / seconde

Memoire

1 kB / 1 ko 1 kilobyte 103 octets1 MB / 1 Mo 1 Megabyte 106 octets1 GB / 1 Go 1 Gigabyte 109 octets

1 TB / 1 To 1 Terabyte 1012

octets1 PB / 1 Po 1 Petabyte 1015 octets

54/ 94

Mesures de performance


72/112

Nombre doperations flottantes par seconde (pas MIPS) Performance crete :

Ce qui figure sur la publicite des constructeurs Suppose que toutes les unites de traitement sont actives On est sur de ne pas aller plus vite :

Performance crete = #unites fonctionnellesclock (sec.)

Performance reelle : Habituellement tres inferieure a la precedente

(malheureusement)

55/ 94

Rapport (Performance reelle / performance de crete) souvent bas !!


73/112

Soit P un programme :

1. Processeur sequentiel: 1 unite scalaire (1 GFlop/s) Temps dexecution de P : 100 s

2. Machine parallele a 100 processeurs: Chaque processor: 1 GFlop/s Performance crete: 100 GFlop/s

3. Si P : code sequentiel (10%) + code parallelise (90%) Temps dexecution de P : 0.9 + 10 = 10.9 s Performance reelle : 9.2 GFlop/s

4.Performance reelle

Performance de crete = 0.1

56/ 94

Moores law


74/112

Gordon Moore (co-fondateur dIntel) a predit en 1965 que ladensite en transitors des circuits integres doublerait tous les24 mois.

A aussi servi de but a atteindre pour les fabriquants. A ete deforme:

24 18 mois nombre de transistors performance

57/ 94

Comment accrotre la vitesse de calcul ?

Accelerer la frequence avec des technologies plus rapides


75/112

q g p p

On approche des limites:

Conception des puces Consommation electrique et chaleur dissipee Refroidissement probleme despace

On peut encore miniaturiser, mais: pas indefiniment

resistance des conducteurs (R =ls ) augmente et ..

la resistance est responsable de la dissipation denergie (effetJoule).

effets de capacites difficiles a matriser

Remarque: 0.5 nanoseconde = temps pour quun signal

parcourt 15 cm de cable

Temps de cycle 0.5 nanosecond 8 GFlop/s (avec 4operations flottantes par cycle)

58/ 94

Seule solution: le parallelisme

parallelisme: execution simultanee de plusieurs instructions a


76/112

parallelisme: execution simultanee de plusieurs instructions alinterieur dun programme

A linterieur dun cur : micro-instructions traitement pipeline recouvrement dinstructions executees par des unites de calcul

distinctes

transparent pour le programmeur(gere par le compilateur ou durant lexecution)

Entre des processeurs ou curs distincts: suites dinstructions differentes executees synchronisations:

implicites (compilateur, parallelisation automatique, utilisationde librairies paralleles)

ou explicites (transfert de messages, programmationmultithreads, sections critiques)

59/ 94

Probleme dacces aux donnees


77/112

On est souvent (en pratique) a 10% de la performance crete Processeurs plus rapides acces aux donnees plus rapide :

organisation processeur, organisation memoire, communication inter-processeurs

Hardware plus complexe : pipe, technologie, reseau, . . . Logiciel plus complexe : compilateur, systeme dexploitation,

langages de programmation, gestion du parallelisme, outils demise au point . . . applications

Il devient plus difficile de programmer efficacement

60/ 94

Problemes de debit memoire

Lacces aux donnees est un probleme crucial dans les


78/112

calculateurs modernes

Performance processeur : + 60% par an Memoire DRAM : + 9% par an

Ratio performance processeurtemps acces memoire augmente denviron 50% par an!MFlop/s plus faciles que MB/s pour debit memoire

Hierarchie memoire de plus en plus complexe (mais latenceaugmente) Facon dacceder aux donnees de plus en plus critique:

Minimiser les defauts de cache Minimiser la pagination memoire

Localite: ameliorer le rapport references a des memoireslocales/ references a des memoires a distance Reutilisation, blocage: accrotre le ratio flops/memory access Gestion des transferts de donnees a la main ? (Cell, GPU)

61/ 94

Average access time (# cycles) hit/missSize


79/112

Cache level #2

Cache level #1 12 / 8 66

615 / 30 200

Main memory 10 100

Remote memory 500 5000

Registers < 1

256 KB 16 MB

1 128 KB

Disks 700,000 / 6,000,000

1 10 GB

Figure: Exemple de hierarchie memoire.

62/ 94

Conception memoire pour nombre important deprocesseurs ?

Comment 500 processeurs peuvent ils avoir acces a des donnees


80/112

Comment 500 processeurs peuvent-ils avoir acces a des donneesrangees dans une memoire partagee (technologie, interconnexion,prix ?) Solution a cout raisonnable : memoire physiquement distribuee(chaque processeur ou groupe de processeurs a sa propre memoirelocale)

2 solutions : memoires locales globalement adressables : Calulateurs a

memoire partagee virtuelle transferts explicites des donnees entre processeurs avec

echanges de messages Scalabilite impose :

augmentation lineaire debit memoire / vitesse du processeur augmentation du debit des communications / nombre de

processeurs Rapport cout/performance memoire distribuee et bon

rapport cout/performance sur les processeurs63/ 94

Architecture des multiprocesseurs


81/112

Nombre eleve de processeurs

memoire physiquement distribuee

Organisation Organisation physiquelogique Partagee (32 procs max) DistribueePartagee multiprocesseurs espace dadressage global

a memoire partagee (hard/soft) au dessus de messagesmemoire partagee virtuelleDistribuee emulation de messages echange de messages

(buffers)

Table: Organisation des processeurs

64/ 94

Terminologie

Architecture SMP (Symmetric Multi Processor)

M i ( h i l i )


82/112

Memoire partagee (physiquement et logiquement)

Temps dacces uniforme a la memoire Similaire du point de vue applicatif aux architectures

multi-curs (1 cur = 1 processeur logique)

Mais communications bcp plus rapides dans les multi-curs

(latence < 3ns, bande passantee > 20 GB/s) que dans lesSMP (latence 60ns, bande passantee 2 GB/s)

Architecture NUMA (Non Uniform Memory Access)

Memoire physiquement distribuee et logiquement partagee

Plus facile daugmenter le nombre de procs quen SMP

Temps dacces depend de la localisation de la donnee

Acces locaux plus rapides quacces distants

hardware permet la coherence des caches (ccNUMA)65/ 94

Programmation


83/112

Standards de programmation

Org. logique partagee: threads POSIX, directives OpenMPOrg. logique distribuee: PVM, MPI, sockets (Message Passing)

Programmation hybride: MPI + OpenMP

Machines a 1 million de curs? architectures emergentes typeGPGPU ? pas encore de standard !

66/ 94

Evolution du calcul haute performance


84/112

Evolution rapide des architectures haute performance (SMP,clusters, NUMA, multicoeurs, Cell, GPUs, . . . )

Parallelisme a plusieurs niveaux Hierarchie memoire de plus en plus complexe

Programmation de plus en plus difficile avec des outils logicielset des standards qui ont toujours un temps de retard.

67/ 94


85/112



68/ 94

Simulation numerique et matrices creuses

Demarche generale pour le calcul scientifique:


86/112

Demarche generale pour le calcul scientifique:

1. Probleme de simulation (probleme continu)2. Application de lois physiques (Equations aux deriveespartielles)

3. Discretisation, mise en equations en dimension finie4. Resolution de systemes lineaires (Ax = b)5. (Etude des resultats, remise en cause eventuelle du modele ou

de la methode)

Resolution de systemes lineaires=noyau algorithmiquefondamental. Parametres a prendre en compte: Proprietes du systeme (symetrie, defini positif,

conditionnement, sur-determine, . . . ) Structure: dense ou creux, Taille: plusieurs millions dequations ?

69/ 94

Equations aux derivees partielles


87/112

Modelisation dun phenomene physique

Equations differentielles impliquant: forces moments

temperatures vitesses energies temps

Solutions analytiques rarement disponibles

70/ 94

Exemples dequations aux derivees partielles


88/112

Trouver le potentiel electrique pour une distribution de

charge donnee f:2 = f = f, or2

x2(x, y, z) +

2

y2(x, y, z) +

2

z2(x, y, z) = f(x, y, z)

Equation de la chaleur (ou equation de Fourier):

2

ux2 + 2

uy2 + 2

uz2 = 1 utavec u = u(x, y, z, t): temperature, : diffusivite thermique du milieu.

Equations de propagation dondes, equation de Schrodinger,Navier-Stokes,. . .

71/ 94

Discretisation (etape qui suit la modelisationphysique)


89/112

Travail du numericien: Realisation dun maillage (regulier, irregulier)

Choix des methodes de resolution et etude de leurcomportement

Etude de la perte dinformation due au passage a la dimensionfinie

Principales techniques de discretisation

Differences finies

Elements finis Volumes finis

72/ 94

Discretization with finite differences (1D)

Basic approximation (ok if h is small enough):

( ) ( )


90/112

dudx

(x) u(x + h)

u(x)

h

Results from Taylors formula

u(x + h) = u(x) + hdu

dx

+h2

2

d2u

dx2

+h3

6

d3u

dx3

+ O(h4)

Replacing h by h:

u(x h) = u(x) h dudx

+h2

2

d2u

dx2 h

3

6

d3u

dx3+ O(h4)

Thus:

d2u

dx2=

u(x + h) 2u(x) + u(x h)h2

+ O(h2)

73/ 94

Discretization with finite differences (1D)


91/112

d

2

udx2 = u(x + h) 2u(x) + u(x h)h2 + O(h2)

3-point stencil for the centered difference approximation to

the second order derivative:

21 1

74/ 94

Finite Differences for the Laplacian Operator (2D)

Assuming same mesh refinement h in x and y directions:

( ) u(xh y)2u(x y)+u(x+h y) u(x yh)2u(x y)+u(x y+h)


92/112

u(x)

u(x h,y) 2u(x,y)+u(x+h,y)

h2+ u(x,y h) 2u(x,y)+u(x,y+h)

h2

u(x) 1h2

(u(xh, y)+ u(x+h, y)+ u(x, yh)+ u(x, y+ h)4u(x, y))

1 1

1

1

4 4

1 1

11

5-point stencils for the centered difference approximation tothe Laplacian operator (left) standard (right) skewed

75/ 94


93/112

27-point stencil usedfor 3D geophysicalapplications (collabo-ration with GeoscienceAzur, S.Operto andJ.Virieux).

1D example

Consider the problem

u(x) = f(x) for x (0, 1)u(0) = u(1) = 0


94/112

xi = i

h, i = 0, . . . , n + 1, f(xi) = fi, u(xi) = ui

h = 1/(n + 1)

Centered difference approximation:

ui1 + 2ui ui+1 = h2fi (u0 = un+1 = 0),

We obtain a linear system Au = f or (for n = 6):

1

h2

2 1 0 0 0 01 2 1 0 0 0

0

1 2

1 0 00 0 1 2 1 00 0 0 1 2 10 0 0 0 1 2

u1u2u

3u4u5u6

=

f1f2f

3f4f5f6

77/ 94

Slightly more complicated (2D)


95/112

Consider an elliptic PDE:

(a(x, y)ux

)

x

(b(x, y)uy

)

y+ c(x, y) u = g(x, y) s u r

u(x, y) = 0 sur

0 x, y 1a(x, y) > 0

b(x, y) > 0

c(x, y) 0

78/ 94

Case of a regular 2D mesh:1


96/112

0 1

2 3 41

5

discretization step: h = 1n+1 , n = 4

5-point finite difference scheme:

(a(x, y)ux

)ij

x=

ai+ 12,j(ui+1,j ui,j)

h2

ai 12,j(ui,j ui1,j)

h2+O(h2)

Similarly:

(b(x, y)uy

)ij

y=

bi,j+ 12

(ui,j+1 ui,j)h2

bi,j 1

2(ui,j ui,j1)

h2+O(h2)

ai+ 12,j, bi+ 1

2,j, cij, . . . known.

With the ordering of unknows of the example, we obtain ali f h f


97/112

linear system of the form:

Ax = b,

where

x1

u1,1 = u(1

n+1 ,1

n+1 )

x2 u2,1 = u( 2n+1 , 1n+1 )x3 u3,1x4 u4,1x

5 u

1,2, . . .

and A is n2 by n2, b is of size n2, with the following structure:

80/ 94

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

|x x x | 1 |g11|

|x x x x | 2 |g21|

| | 3 | 31|


98/112

| x x x x | 3 |g31|

| x x 0 x | 4 |g41||x 0 x x x | 5 |g12|

| x x x x x | 6 |g22|

| x x x x x | 7 |g32|

A=| x x x 0 x | 8 b=|g42|

| x 0 x x x | 9 |g13|| x x x x x |10 |g23|

| x x x x x |11 |g33|

| x x x 0 x |12 |g43|

| x 0 x x |13 |g14|

| x x x x |14 |g24|

| x x x x |15 |g34|

| x x x |16 |g44|

81/ 94

Solution of the linear system

Oft th t tl t i i l i l ti d


99/112

Often the most costly part in a numerical simulation code

Direct methods: L U factorization followed by triangular substitutions parallelism depends highly on the structure of the matrix

Iterative methods: usually rely on sparse matrix-vector products algebraic preconditioner useful

82/ 94

Evolution in time of a complex phenomenon

Examples:


100/112

climate modeling, evolution of radioactive waste, . . . heat equation:

u(x, y, z, t) = u(x,y,z,t)

t

u(x, y, z, t0) = u0(x, y, z)

Discretization in both space and time (1D case): Explicit approaches:

u

n+1

j

u

n

j

tn+1tn= u

n

j+12un

j +un

j

1h2 .

Implicit approaches:un+1junj

tn+1tn=

un+1j+12u

n+1j

+un+1j1

h2.

Implicit approaches are preferred (more stable, larger timestep

possible) but are more numerically intensive: a sparse linearsystem must be solved at each iteration.

83/ 94

Discretization with Finite elements

Consider a partial differential equation of the form (Poisson


101/112

Equation): u =

2ux2

+ 2u

y2= f

u = 0 on

we can show (using Greens formula) that the previousproblem is equivalent to:

a(u, v) =

f v dx dy v such that v = 0 on

where a(u, v) =

ux

vx

+ uy

vy

dxdy

84/ 94

Finite element scheme: 1D Poisson Equation u =

2ux2

= f, u = 0 on

Equivalent to


102/112

a(u, v) = g(v) for all v (v| = 0)

where a(u, v) =

ux

vx

and g(v) = f(x)v(x)dx(1D: similar to integration by parts)

Idea: we search u of the form = kkk(x)(k)k=1,n basis of functions such that k is linear on all Ei,

and k(xi) = ik = 1 if k = i, 0 otherwise.

k k+1k1

xkEk Ek+1

85/ 94

Finite Element Scheme: 1D Poisson Equationk k+1k1


103/112

xk

Ek Ek+1

We rewrite a(u, v) = g(v) for all k:

a(u,k) = g(k) for all k iia(i,k) = g(k)a(i,k) =

ix

kx

= 0 when |i k| 2 kth equation associated with k

k1a(k1,k) + ka(k,k) + k+1a(k+1,k) = g(k)

a(k1,k) =Ek

k1

x

kx

a(k+1,k) =Ek+1

k+1x

kx

a(k,k) =Ek

kx

kx

+Ek+1

kx

kx

86/ 94

Finite Element Scheme: 1D Poisson Equation

From the point of view of Ek, we have a 2x2 contribution matrix:

Ek

k1x

k1x Ek

k1x

kx IEk (k1,k1) IEk (k1,k)


104/112

Ek x x Ek x xEk

k1x

kx

Ek

kx

kx

= Ek( k 1, k 1) Ek( k 1, k)IEk(k,k1) IEk(k,k)

210 3 4

E3

1 2 3

E4E1 E2

IE1 (1,1) + IE2 (1,1) IE2 (1,2)

IE2 (2,1) IE2 (2,2) + IE3 (2,2) IE3 (2,3)IE3 (2,3) IE3 (3,3) + IE4 (3,3)

12

3

= g(1)g(2)

g(3)

87/ 94

Finite Element Scheme in Higher Dimension


105/112

Can be used for higher dimensions

Mesh can be irregular i can be a higher degree polynomial

Matrix pattern depends on mesh connectivity/ordering

88/ 94

Finite Element Scheme in Higher Dimension

Set of elements (tetrahedras, triangles) to assemble:


106/112

j

T

i

k C(T) =

aTi,i aTi,j aTi,kaTj,i aTj,j aTj,k

aTk,i aTk,j a

Tk,k

Needs for the parallel case Assemble the sparse matrix A =

kC(Tk): graph coloring

algorithms

Parallelization domain by domain: graph partitioning

Solution of Ax = b: high performance matrix computationkernels

88/ 94

Other example: linear least squares

mathematical model + approximate measures estimatet f th d l


107/112

parameters of the model

m experiments + n parameters xi:minAx b avec: A Rmn,m n: data matrix b Rm: vector of observations

x Rn

: parameters of the model Solving the problem:

Decompose A under the form A = QR, with Q orthogonal, Rtriangular

Axb = QTAxQTb = QTQRxQTb = RxQTb Problems can be large (meteorological data, . . . ), sparse or

not

89/ 94

Introduction to Sparse Matrix Computations


108/112



90/ 94

Solution of sparse linear systems Ax = b (Direct or Iterative approaches ?

Direct methods

Very general/robust

Iterative methods

Efficiency depends on:


109/112

Numerical accuracy Irregular/unstructured

problems

Factorization of matrix A

May be costly

(flops/memory) Factors can be reused for

multiple right-hand sides b

Computing issues

Good granularity of

computations Several levels of parallelism

can be exploited

convergence preconditioning

numerical properties /structure of A

Rely on efficient Mat-Vect product

memory effective successive right-hand sides

is problematic

Computing issues

Smaller granularity of

computations Often, only one level of

parallelism

Introduction to Sparse Matrix Computations


110/112



92/ 94

Summary sparse matrices

Widely used in engineering and industry

Irregular data structures


111/112

Strong relations with graph theory Efficient algorithms are critical

Ordering Sparse Gaussian elimination Sparse matrix-vector multiplication Parallelization

Challenges: Modern applications leading to

bigger and bigger problems different types of matrices and requirements

Dynamic data structures (numerical pivoting) need fordynamic scheduling

More and more parallelism (evolution of parallel architectures)

93/ 94

Suggested home reading

Google page rank, The worlds largest matrix computation,


112/112

Cleve Moler.

Architecture and Performance Characteristics of ModernHigh Performance Computers, Georg Hager and Gerhard

Wellein, Lect. Notes in Physics

Optimization Techniques for Modern High PerformanceComputers, Georg Hager and Gerhard Wellein, Lect. Notes

in Physics

94/ 94

Documents

Intro Sparse