Optimization and Parallelization of FIND Algorithm

Preview:

Citation preview

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Optimization and Parallelization of FINDAlgorithm

Song Li Eric Darve

Institute for Computational and Mathematical Engineering, Stanford Universitylisong@stanford.edu

SIAM CSE09March 4, 2009

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Outline

1 Background

2 Serial FIND (Fast Inverse using Nested Dissection)

3 Simulation Results

4 Parallel Methods

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Outline

1 Background

2 Serial FIND (Fast Inverse using Nested Dissection)

3 Simulation Results

4 Parallel Methods

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Introduction

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Modeling the current throughnano-devices by Non-EquilibriumGreen’s Function approachSystem of Schrödinger-PoissonequationsBest known algorithm (RGF) hasrunning time O(n3

xny )

Our method (FIND): O(n2xny )

Other devices: nanotubes andnanowires

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

The Math Problem

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

What we want: thediagonal of Gr = A−1

What we have: a sparsematrix A from adiscretized 2D mesh

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

The Math Problem

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

4× 5 mesh

ny = 5

nx = 4

What we want: thediagonal of Gr = A−1

What we have: a sparsematrix A from adiscretized 2D mesh

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

The Math Problem

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

20× 20 matrix A4× 5 mesh

ny = 5

nx = 4

What we want: thediagonal of Gr = A−1

What we have: a sparsematrix A from adiscretized 2D mesh

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

The Math Problem

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

20× 20 matrix A4× 5 mesh

ny = 5

nx = 4

What we want: thediagonal of Gr = A−1

What we have: a sparsematrix A from adiscretized 2D mesh

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

The Math Problem

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

20× 20 matrix A4× 5 mesh

ny = 5

nx = 4

What we want: thediagonal of Gr = A−1

What we have: a sparsematrix A from adiscretized 2D mesh

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Outline

1 Background

2 Serial FIND (Fast Inverse using Nested Dissection)

3 Simulation Results

4 Parallel Methods

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Key Observations

Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1

Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Key Observations

Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1

Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Key Observations

Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1

Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Key Observations

Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1

Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Key Observations

Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1

Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Overall Structure: Partition Tree

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Order the mesh nodesin a way similar tonested dissection

Partition the wholemesh and form a treestructure to exploit thesubproblem overlap

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

One Step of Elimination

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Gaussian elimination: A∗( t, t) def= A( t, t)− A( t, t)A( t, t)−1A( t, t)

A( t, t) A( t, t) 0A( t, t) A( t, t) A( t, t)

0 A( t, t) A( t, t) elimination

=⇒

A( t, t) A( t, t) 00 A∗( t, t) A( t, t)0 A( t, t) A( t, t)

eliminated node

inner node

bounary node

outer node⇒

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Two Full Elimination Processes

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Extensions and Optimizations

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Outline

1 Background

2 Serial FIND (Fast Inverse using Nested Dissection)

3 Simulation Results

4 Parallel Methods

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Simulation Device

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Running Time ComparisonLog-Log Scale with Reference Lines

1

8

64

512

4096

32768

64 128 256 512 1024

Run

ning

tim

e (s

econ

d)

n (=Nx=Ny)

Running Time ComparisonBetween FIND and RGF

FINDO(n3)RGF

O(n4)

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Memory Cost Comparison

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

FIND: O(N log(N))

RGF: O(N3/2)

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Outline

1 Background

2 Serial FIND (Fast Inverse using Nested Dissection)

3 Simulation Results

4 Parallel Methods

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

How to Parallelize?

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Straightforward for leaf clustersTop level clusters dominate runningtime with less degree of parallelismUse the idle processors for redundantcomputationsMore floating point operations butshorter wall clock timeWorks for 1D, 2D, and 3D domains

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Problem and Processor Settings

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Problem and Processor Settings

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Problem and Processor Settings

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Problem and Processor Settings

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Problem and Processor Settings

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Detailed Merging Process

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1P1

P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2P2

P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3P3

P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4P4

P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5P5

P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6P6

P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7P7

P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8P8

P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9P9

P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10P10

P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11P11

P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12P12

P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13P13

P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14P14

P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15P15

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Communication Pattern

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Summary

Direct method for fast inverseTwo extensions, two optimizationsAn optimal parallel schemeCollaboration with other groups for moreapplications

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Recommended