Memory-Aware Scheduling for Sparse Direct Methodseagullo/thesis/siam_cse_2009_sparse-mem… ·...

Memory-Aware Scheduling for Sparse DirectMethods

Emmanuel AGULLO, ICL - University of TennesseeJean-Yves L’EXCELLENT, LIP - INRIA

Abdou GUERMOUCHE, LaBRI, Universite de Bordeaux

MS31 Parallel Sparse Matrix Computations and EnablingAlgorithms

SIAM CSE 2009, Miami, FL, March 2-6, 2009

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 1

Context

Solving sparse linear systems

Ax = b⇒ Direct methods: A = LU

Typical matrix: BRGM matrixF 3.7× 106 variablesF 156× 106 non zeros in AF 4.5× 109 non zeros in LUF 26.5× 1012 flops

Hardware paradigmF Many-core architecture.F Large global amount of

memory.F Limited memory per core.

Software challenge→ Need for algorithms whose

memory usage scales withthe number of processors.

F Case study: MUMPS

Context

F Case study: MUMPS

Context

F Case study: MUMPS

Context

F Case study: MUMPS

Context

Outline

1. MUMPS

2. Limits to memory scalability

3. A new memory-aware algorithm

4. Preliminary results

5. Conclusion

Outline

1. MUMPS

5. Conclusion

MUMPS: a MUltifrontal Massively Parallel sparse directSolver

Solution of large sparse linear systems with:F Symmetric positive definite matrices;F General symmetric matrices;F General unsymmetric matrices.

ImplementationF Distributed Multifrontal Solver (F90, MPI based);F Dynamic Distributed Scheduling;F Use of BLAS, BLACS, ScaLAPACK.

InterfacesF Fortran, C, Matlab, Scilab, Visual Studio.

The multifrontal method (Duff, Reid’83)

1 2 3 4 5

Non−zero Fill−in

L+U−I=

Storage divided into two parts:F Factors systematically written to

disk;F Active Storage kept in memory.

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

Factors

Contribution block

Elimination tree

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

Factors

Contribution block

Elimination tree

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

Factors

Contribution block

Elimination tree

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

Factors

Contribution block

Elimination tree

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

Factors

Contribution block

Elimination tree

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

Factors

Contribution block

Elimination tree

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

Factors

Contribution block

Elimination tree

Memory behaviour (serial postorder traversal)

Memory efficiency of MUMPS

Definition: Memory Efficiency on p processors (or cores)

e(p) = Sseqp×Smax(p) , Sseq: serial storage, Smax: parallel storage

Results: Memory Efficiency of MUMPS (with factors on disk)

Number p of processors 16 32 64 128AUDI KW 1 0.16 0.12 0.13 0.10

CONESHL MOD 0.28 0.28 0.22 0.19CONV3D64 0.42 0.40 0.41 0.37QIMONDA07 0.30 0.18 0.11 -

ULTRASOUND80 0.32 0.31 0.30 0.26

Limits to memory scalability

Outline

1. MUMPS

5. Conclusion

Parallel multifrontal scheme

F Type 1 : Nodes processed on a single processorF Type 2 : Nodes processed with a parallel 1D blocked factorizationF Type 3 : Parallel 2D cyclic factorization (root node)

: STATIC

2D static decomposition

SUBTREES

Parallel multifrontal scheme

F Type 1 : Nodes processed on a single processorF Type 2 : Nodes processed with a parallel 1D blocked factorizationF Type 3 : Parallel 2D cyclic factorization (root node)

P2P2P3P0

: STATIC

1D pipelined factorization

: DYNAMIC

P3 and P0 chosen by P2 at runtime

SUBTREES

P2P2P3P0

: STATIC

: DYNAMIC

SUBTREES

F Many simultaneous active tasks;F Large master tasks;F Large subtrees;F Proportional mapping.

P2P2P3P0

: STATIC

: DYNAMIC

SUBTREES

P2P2P3P0

: STATIC

: DYNAMIC

SUBTREES

Proportional mapping VS postorder traversal (1/2)Elimination tree :

MappingF Initially: all processors on root node;F Recursively split the set of processors on child subtrees.

Advantages and drawbacks, Fine-grain + coarse-grain parallelism;/ Bad memory efficiency.

Proportional mapping VS postorder traversal (1/2)Proportional mapping:

256 256

128 128 128 128

256 256

128 128 128 128

Proportional mapping VS postorder traversal (2/2)Elimination tree :

TraversalF Postorder traversal, node by node;F All processors on each node.

Advantages and drawbacks/ Only fine-grain parallelism;, High memory efficiency.

Proportional mapping VS postorder traversal (2/2)Postorder traversal :

512 512512

A new memory-aware algorithm

Outline

1. MUMPS

5. Conclusion

Memory-aware mapping algorithmElimination tree :

MappingF Initially: all processors on root node;F Recursively split the set of processors on child subtrees if

memory allows for it.

Memory-aware mapping algorithmMemory-aware mapping:

256 256

Advantages

, Robust: guaranteed (if memory M0 <Sseq

p )., Efficient: available memory provides coarse-grain parallelism.

Preliminary results

Outline

1. MUMPS

5. Conclusion

Preliminary results

Preliminary resultsF Excellent memory scalability:

I memory efficiency closed to 1.F Competitive (time) efficiency

I closed to proportional mapping (if enough memory);I memory provides coarse-grain parallelism:

0 5 10 15 20 25 30

Distance to root node (depth)

Proportional mappingM0=1/32*sequential peakM0=2/32*sequential peakM0=5/32*sequential peakM0=8/32*sequential peak

Conclusion

Outline

1. MUMPS

5. Conclusion

Conclusion

Prototype of a memory-aware algorithmF Maximizes the amount of coarse-grain parallelism with respect

to the amount of memory available per processor/core.F New static mapping implemented, with constraints on dynamic

schedulers; experimented within the OOC version of MUMPS.F Very good memory scalability obtained.

On-going workF Further tuning and validation.F Generalization to the in-core case.F Reinject dynamic information to schedulers.

Memory-Aware Scheduling for Sparse Direct Methodseagullo/thesis/siam_cse_2009_sparse-mem… ·...

Documents

Autotuning sparse matrix kernels

Sparse Matrix Methods Day 1: Overview Day 2: Direct methods Nonsymmetric systems Graph theoretic tools Sparse LU with partial pivoting Supernodal factorization

Sparse Matrix Sparse Vector Multiplication using Parallel ...web.eecs.utk.edu/~gdp/pdf/baugher-ms-thesis.pdf · Sparse Matrix Sparse Vector Multiplication using Parallel and Reconfigurable

Intel® MKL Sparse Solvers · Intel® MKL Sparse Solvers. 22 ... matrix into simpler form. ... Usually, sparse direct solvers solve the task by th

MatRaptor: A Sparse-Sparse Matrix Multiplication ......Sparse-sparse matrix-matrix multiplication (SpGEMM) is a key computational primitive in many important application do-mains such

sparse Matrix

Sparse Matrix-Vector Multiplication using FPGA - …kodiak.ee.ucla.edu/cite/pdf/Sparse Matrix-Vector Multiplication...Sparse Matrix-Vector Multiplication using FPGA NAN WANG 3 1 Introduction

Sparse matrix computations

Sparse Matrix and Polynomial

Generalized Sparse Matrix-Matrix Multiplication for Vector ...ipcc.soic.iu.edu/WebsitePaper/conference_workshop...Generalized sparse matrix-matrix multiplication (SpGEMM) is a primitive

Optimizing Sparse Matrix-Matrix Multiplication on a

Сравнение расчетов простейших моделей в Comsol и Elmer · 2020-02-11 · Решатель Comsol v3.4 Elmer Direct Sparse Matrix Solver Direct

Matrix Block Structure In Sparse Matrix-Vector Multiplication...Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication (SpMV) — y = A x — Iterative methods

A Direct Method for the Solution of Sparse Linear Least ... · We examine a direct method based on an LU decomposition of the rectangular coefficient matrix for the solution of sparse

MATRIX-FREE SPARSE DIRECT SOLVERS

Parallel Sparse Matrix-Vector and Matrix-Transpose …supertech.csail.mit.edu/papers/csb.pdf · Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed

Sparse matrix inversion - archive.org

Tuning Sparse Matrix Vector Multiplication for multi …. BIPS. Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating

Multithreaded sparse matrix-matrix multiplication for many ...srajama/publications/spgemm_parco18.pdfMultithreaded sparse matrix-matrix multiplication for ... One-phase methods rely

Sparse matrix formats