Memory-Aware Scheduling for Sparse Direct Methodseagullo/thesis/siam_cse_2009_sparse-mem… ·...

Preview:

Citation preview

Memory-Aware Scheduling for Sparse DirectMethods

Emmanuel AGULLO, ICL - University of TennesseeJean-Yves L’EXCELLENT, LIP - INRIA

Abdou GUERMOUCHE, LaBRI, Universite de Bordeaux

MS31 Parallel Sparse Matrix Computations and EnablingAlgorithms

SIAM CSE 2009, Miami, FL, March 2-6, 2009

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 1

Context

Context

Solving sparse linear systems

Ax = b⇒ Direct methods: A = LU

Typical matrix: BRGM matrixF 3.7× 106 variablesF 156× 106 non zeros in AF 4.5× 109 non zeros in LUF 26.5× 1012 flops

Hardware paradigmF Many-core architecture.F Large global amount of

memory.F Limited memory per core.

Software challenge→ Need for algorithms whose

memory usage scales withthe number of processors.

F Case study: MUMPS

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 2

Context

Context

Solving sparse linear systems

Ax = b⇒ Direct methods: A = LU

Typical matrix: BRGM matrixF 3.7× 106 variablesF 156× 106 non zeros in AF 4.5× 109 non zeros in LUF 26.5× 1012 flops

Hardware paradigmF Many-core architecture.F Large global amount of

memory.F Limited memory per core.

Software challenge→ Need for algorithms whose

memory usage scales withthe number of processors.

F Case study: MUMPS

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 2

Context

Context

Solving sparse linear systems

Ax = b⇒ Direct methods: A = LU

Typical matrix: BRGM matrixF 3.7× 106 variablesF 156× 106 non zeros in AF 4.5× 109 non zeros in LUF 26.5× 1012 flops

Hardware paradigmF Many-core architecture.F Large global amount of

memory.F Limited memory per core.

Software challenge→ Need for algorithms whose

memory usage scales withthe number of processors.

F Case study: MUMPS

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 2

Context

Context

Solving sparse linear systems

Ax = b⇒ Direct methods: A = LU

Typical matrix: BRGM matrixF 3.7× 106 variablesF 156× 106 non zeros in AF 4.5× 109 non zeros in LUF 26.5× 1012 flops

Hardware paradigmF Many-core architecture.F Large global amount of

memory.F Limited memory per core.

Software challenge→ Need for algorithms whose

memory usage scales withthe number of processors.

F Case study: MUMPS

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 2

Context

Outline

1. MUMPS

2. Limits to memory scalability

3. A new memory-aware algorithm

4. Preliminary results

5. Conclusion

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 3

MUMPS

Outline

1. MUMPS

2. Limits to memory scalability

3. A new memory-aware algorithm

4. Preliminary results

5. Conclusion

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 4

MUMPS

MUMPS: a MUltifrontal Massively Parallel sparse directSolver

Solution of large sparse linear systems with:F Symmetric positive definite matrices;F General symmetric matrices;F General unsymmetric matrices.

ImplementationF Distributed Multifrontal Solver (F90, MPI based);F Dynamic Distributed Scheduling;F Use of BLAS, BLACS, ScaLAPACK.

InterfacesF Fortran, C, Matlab, Scilab, Visual Studio.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 5

MUMPS

The multifrontal method (Duff, Reid’83)

3

5

4

2

1

1 2 3 4 5

3

5

4

2

1

1 2 3 4 5

Non−zero Fill−in

A=

00

0

0

0

0 0 0

0

0

00

0 0

0 0

0

0

0 0

0

0

L+U−I=

Storage divided into two parts:F Factors systematically written to

disk;F Active Storage kept in memory.

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

3

2

4

5

1

1

5

4 2

3

3

4

4

5

5

Factors

Contribution block

Elimination tree

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 6

MUMPS

The multifrontal method (Duff, Reid’83)

Storage divided into two parts:F Factors systematically written to

disk;F Active Storage kept in memory.

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

3

2

4

5

1

1

5

4 2

3

3

4

4

5

5

Factors

Contribution block

Elimination tree

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 6

MUMPS

The multifrontal method (Duff, Reid’83)

Storage divided into two parts:F Factors systematically written to

disk;F Active Storage kept in memory.

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

3

2

4

5

1

1

5

4 2

3

3

4

4

5

5

Factors

Contribution block

Elimination tree

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 6

MUMPS

The multifrontal method (Duff, Reid’83)

Storage divided into two parts:F Factors systematically written to

disk;F Active Storage kept in memory.

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

3

2

4

5

1

1

5

4 2

3

3

4

4

5

5

Factors

Contribution block

Elimination tree

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 6

MUMPS

The multifrontal method (Duff, Reid’83)

Storage divided into two parts:F Factors systematically written to

disk;F Active Storage kept in memory.

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

3

2

4

5

1

1

5

4 2

3

3

4

4

5

5

Factors

Contribution block

Elimination tree

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 6

MUMPS

The multifrontal method (Duff, Reid’83)

Storage divided into two parts:F Factors systematically written to

disk;F Active Storage kept in memory.

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

3

2

4

5

1

1

5

4 2

3

3

4

4

5

5

Factors

Contribution block

Elimination tree

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 6

MUMPS

The multifrontal method (Duff, Reid’83)

Storage divided into two parts:F Factors systematically written to

disk;F Active Storage kept in memory.

FactorsStack of

contribution

blocks

Active

frontalmatrix

Active Storage

3

2

4

5

1

1

5

4 2

3

3

4

4

5

5

Factors

Contribution block

Elimination tree

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 6

MUMPS

Memory behaviour (serial postorder traversal)

3

21

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 7

MUMPS

Memory behaviour (serial postorder traversal)

3

21

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 7

MUMPS

Memory behaviour (serial postorder traversal)

3

21

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 7

MUMPS

Memory behaviour (serial postorder traversal)

3

21

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 7

MUMPS

Memory behaviour (serial postorder traversal)

3

21

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 7

MUMPS

Memory behaviour (serial postorder traversal)

3

21

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 7

MUMPS

Memory behaviour (serial postorder traversal)

3

21

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 7

MUMPS

Memory behaviour (serial postorder traversal)

3

21

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 7

MUMPS

Memory efficiency of MUMPS

Definition: Memory Efficiency on p processors (or cores)

e(p) = Sseqp×Smax(p) , Sseq: serial storage, Smax: parallel storage

Results: Memory Efficiency of MUMPS (with factors on disk)

Number p of processors 16 32 64 128AUDI KW 1 0.16 0.12 0.13 0.10

CONESHL MOD 0.28 0.28 0.22 0.19CONV3D64 0.42 0.40 0.41 0.37QIMONDA07 0.30 0.18 0.11 -

ULTRASOUND80 0.32 0.31 0.30 0.26

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 8

Limits to memory scalability

Outline

1. MUMPS

2. Limits to memory scalability

3. A new memory-aware algorithm

4. Preliminary results

5. Conclusion

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 9

Limits to memory scalability

Parallel multifrontal scheme

F Type 1 : Nodes processed on a single processorF Type 2 : Nodes processed with a parallel 1D blocked factorizationF Type 3 : Parallel 2D cyclic factorization (root node)

P0

P0

P3P2

P0 P1

P3

P0 P1

P0

P0

P3

P0

P2 P2

P0

P2P2

P0

P0

P1 P3

P3

TIM

E

: STATIC

2D static decomposition

SUBTREES

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 10

Limits to memory scalability

Parallel multifrontal scheme

F Type 1 : Nodes processed on a single processorF Type 2 : Nodes processed with a parallel 1D blocked factorizationF Type 3 : Parallel 2D cyclic factorization (root node)

P0P1

P0

P0

P1

P3

P2

P1

P3P2

P0 P1

P3

P0 P1

P0

P0

P3

P0

P2 P2

P0

P2P2P3P0

P0

P0

P1 P3

P3

TIM

E

P0

: STATIC

P2

1D pipelined factorization

: DYNAMIC

P3 and P0 chosen by P2 at runtime

2D static decomposition

SUBTREES

P2P3

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 10

Limits to memory scalability

Limits to memory scalability

P0P1

P0

P0

P1

P3

P2

P1

P3P2

P0 P1

P3

P0 P1

P0

P0

P3

P0

P2 P2

P0

P2P2P3P0

P0

P0

P1 P3

P3

TIM

E

P0

: STATIC

P2

1D pipelined factorization

: DYNAMIC

P3 and P0 chosen by P2 at runtime

2D static decomposition

SUBTREES

P2P3

F Many simultaneous active tasks;F Large master tasks;F Large subtrees;F Proportional mapping.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 11

Limits to memory scalability

Limits to memory scalability

P0P1

P0

P0

P1

P3

P2

P1

P3P2

P0 P1

P3

P0 P1

P0

P0

P3

P0

P2 P2

P0

P2P2P3P0

P0

P0

P1 P3

P3

TIM

E

P0

: STATIC

P2

1D pipelined factorization

: DYNAMIC

P3 and P0 chosen by P2 at runtime

2D static decomposition

SUBTREES

P2P3

F Many simultaneous active tasks;F Large master tasks;F Large subtrees;F Proportional mapping.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 11

Limits to memory scalability

Limits to memory scalability

P0P1

P0

P0

P1

P3

P2

P1

P3P2

P0 P1

P3

P0 P1

P0

P0

P3

P0

P2 P2

P0

P2P2P3P0

P0

P0

P1 P3

P3

TIM

E

P0

: STATIC

P2

1D pipelined factorization

: DYNAMIC

P3 and P0 chosen by P2 at runtime

2D static decomposition

SUBTREES

P2P3

F Many simultaneous active tasks;F Large master tasks;F Large subtrees;F Proportional mapping.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 11

Limits to memory scalability

Proportional mapping VS postorder traversal (1/2)Elimination tree :

d=0

d=1

d=2

d=3

d=4

MappingF Initially: all processors on root node;F Recursively split the set of processors on child subtrees.

Advantages and drawbacks, Fine-grain + coarse-grain parallelism;/ Bad memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 12

Limits to memory scalability

Proportional mapping VS postorder traversal (1/2)Proportional mapping:

d=0

d=1

d=2

d=3

d=4

MappingF Initially: all processors on root node;F Recursively split the set of processors on child subtrees.

Advantages and drawbacks, Fine-grain + coarse-grain parallelism;/ Bad memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 12

Limits to memory scalability

Proportional mapping VS postorder traversal (1/2)Proportional mapping:

d=0

d=1

d=2

d=3

d=4

512

MappingF Initially: all processors on root node;F Recursively split the set of processors on child subtrees.

Advantages and drawbacks, Fine-grain + coarse-grain parallelism;/ Bad memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 12

Limits to memory scalability

Proportional mapping VS postorder traversal (1/2)Proportional mapping:

d=0

d=1

d=2

d=3

d=4

256 256

128 128 128 128

512

MappingF Initially: all processors on root node;F Recursively split the set of processors on child subtrees.

Advantages and drawbacks, Fine-grain + coarse-grain parallelism;/ Bad memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 12

Limits to memory scalability

Proportional mapping VS postorder traversal (1/2)Proportional mapping:

d=0

d=1

d=2

d=3

d=4

256 256

128 128 128 128

512

MappingF Initially: all processors on root node;F Recursively split the set of processors on child subtrees.

Advantages and drawbacks, Fine-grain + coarse-grain parallelism;/ Bad memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 12

Limits to memory scalability

Proportional mapping VS postorder traversal (1/2)Proportional mapping:

MappingF Initially: all processors on root node;F Recursively split the set of processors on child subtrees.

Advantages and drawbacks, Fine-grain + coarse-grain parallelism;/ Bad memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 12

Limits to memory scalability

Proportional mapping VS postorder traversal (2/2)Elimination tree :

d=0

d=1

d=2

d=3

d=4

TraversalF Postorder traversal, node by node;F All processors on each node.

Advantages and drawbacks/ Only fine-grain parallelism;, High memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 13

Limits to memory scalability

Proportional mapping VS postorder traversal (2/2)Postorder traversal :

d=0

d=1

d=2

d=3

d=4

TraversalF Postorder traversal, node by node;F All processors on each node.

Advantages and drawbacks/ Only fine-grain parallelism;, High memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 13

Limits to memory scalability

Proportional mapping VS postorder traversal (2/2)Postorder traversal :

d=0

d=1

d=2

d=3

d=4

TraversalF Postorder traversal, node by node;F All processors on each node.

Advantages and drawbacks/ Only fine-grain parallelism;, High memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 13

Limits to memory scalability

Proportional mapping VS postorder traversal (2/2)Postorder traversal :

d=0

d=1

d=2

d=3

d=4

TraversalF Postorder traversal, node by node;F All processors on each node.

Advantages and drawbacks/ Only fine-grain parallelism;, High memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 13

Limits to memory scalability

Proportional mapping VS postorder traversal (2/2)Postorder traversal :

d=0

d=1

d=2

d=3

d=4

TraversalF Postorder traversal, node by node;F All processors on each node.

Advantages and drawbacks/ Only fine-grain parallelism;, High memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 13

Limits to memory scalability

Proportional mapping VS postorder traversal (2/2)Postorder traversal :

d=0

d=1

d=2

d=3

d=4

TraversalF Postorder traversal, node by node;F All processors on each node.

Advantages and drawbacks/ Only fine-grain parallelism;, High memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 13

Limits to memory scalability

Proportional mapping VS postorder traversal (2/2)Postorder traversal :

d=0

d=1

d=2

d=3

d=4

TraversalF Postorder traversal, node by node;F All processors on each node.

Advantages and drawbacks/ Only fine-grain parallelism;, High memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 13

Limits to memory scalability

Proportional mapping VS postorder traversal (2/2)Postorder traversal :

TraversalF Postorder traversal, node by node;F All processors on each node.

Advantages and drawbacks/ Only fine-grain parallelism;, High memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 13

Limits to memory scalability

Proportional mapping VS postorder traversal (2/2)Postorder traversal :

TraversalF Postorder traversal, node by node;F All processors on each node.

Advantages and drawbacks/ Only fine-grain parallelism;, High memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 13

Limits to memory scalability

Proportional mapping VS postorder traversal (2/2)Postorder traversal :

d=0

d=1

d=2

d=3

d=4

512

512

512

512 512512

512

TraversalF Postorder traversal, node by node;F All processors on each node.

Advantages and drawbacks/ Only fine-grain parallelism;, High memory efficiency.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 13

A new memory-aware algorithm

Outline

1. MUMPS

2. Limits to memory scalability

3. A new memory-aware algorithm

4. Preliminary results

5. Conclusion

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 14

A new memory-aware algorithm

Memory-aware mapping algorithmElimination tree :

d=0

d=1

d=2

d=3

d=4

MappingF Initially: all processors on root node;F Recursively split the set of processors on child subtrees if

memory allows for it.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 15

A new memory-aware algorithm

Memory-aware mapping algorithmMemory-aware mapping:

d=0

d=1

d=2

d=3

d=4

MappingF Initially: all processors on root node;F Recursively split the set of processors on child subtrees if

memory allows for it.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 15

A new memory-aware algorithm

Memory-aware mapping algorithmMemory-aware mapping:

d=0

d=1

d=2

d=3

d=4

512

MappingF Initially: all processors on root node;F Recursively split the set of processors on child subtrees if

memory allows for it.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 15

A new memory-aware algorithm

Memory-aware mapping algorithmMemory-aware mapping:

d=0

d=1

d=2

d=3

d=4

256 256

512

512

512

512

512

MappingF Initially: all processors on root node;F Recursively split the set of processors on child subtrees if

memory allows for it.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 15

A new memory-aware algorithm

Memory-aware mapping algorithmMemory-aware mapping:

d=0

d=1

d=2

d=3

d=4

256 256

512

512

512

512

512

Advantages

, Robust: guaranteed (if memory M0 <Sseq

p )., Efficient: available memory provides coarse-grain parallelism.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 15

Preliminary results

Outline

1. MUMPS

2. Limits to memory scalability

3. A new memory-aware algorithm

4. Preliminary results

5. Conclusion

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 16

Preliminary results

Preliminary resultsF Excellent memory scalability:

I memory efficiency closed to 1.F Competitive (time) efficiency

I closed to proportional mapping (if enough memory);I memory provides coarse-grain parallelism:

0

2

4

6

8

10

12

14

16

18

20

0 5 10 15 20 25 30

avg

num

ber

of p

roce

ssor

s pe

r no

de (

norm

aliz

ed)

Distance to root node (depth)

Proportional mappingM0=1/32*sequential peakM0=2/32*sequential peakM0=5/32*sequential peakM0=8/32*sequential peak

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 17

Conclusion

Outline

1. MUMPS

2. Limits to memory scalability

3. A new memory-aware algorithm

4. Preliminary results

5. Conclusion

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 18

Conclusion

Conclusion

Prototype of a memory-aware algorithmF Maximizes the amount of coarse-grain parallelism with respect

to the amount of memory available per processor/core.F New static mapping implemented, with constraints on dynamic

schedulers; experimented within the OOC version of MUMPS.F Very good memory scalability obtained.

On-going workF Further tuning and validation.F Generalization to the in-core case.F Reinject dynamic information to schedulers.

AGULLO - GUERMOUCHE - L’EXCELLENT Memory-Aware Scheduling for Sparse Direct Methods 19

Recommended