Optimization and Parallelization of FIND Algorithm

BackgroundSerial FIND (Fast Inverse using Nested Dissection)

Simulation ResultsParallel Methods

Optimization and Parallelization of FINDAlgorithm

Song Li Eric Darve

Institute for Computational and Mathematical Engineering, Stanford Universitylisong@stanford.edu

SIAM CSE09March 4, 2009

Song Li, Eric Darve Optimization and Parallelization of FIND Algorithm

Outline

1 Background

2 Serial FIND (Fast Inverse using Nested Dissection)

3 Simulation Results

4 Parallel Methods

Outline

1 Background

4 Parallel Methods

Introduction

Modeling the current throughnano-devices by Non-EquilibriumGreen’s Function approachSystem of Schrödinger-PoissonequationsBest known algorithm (RGF) hasrunning time O(n3

Our method (FIND): O(n2xny )

Other devices: nanotubes andnanowires

The Math Problem

What we want: thediagonal of Gr = A−1

What we have: a sparsematrix A from adiscretized 2D mesh

The Math Problem

4× 5 mesh

ny = 5

nx = 4

The Math Problem

20× 20 matrix A4× 5 mesh

ny = 5

nx = 4

The Math Problem

ny = 5

nx = 4

The Math Problem

ny = 5

nx = 4

Outline

1 Background

4 Parallel Methods

Key Observations

Last entry in A−1 can be obtained through LU factorization:(A−1)nn = (U−1)nn = (Unn)−1

Obtain all the diagonals through multiple factorizationsLocal connectivity⇒ problem decomposition: partialfactorizations feasibleProper ordering makes most of them identical:subproblems overlap⇒ dynamic programmingComputational cost for all the diagonal entries of theinverse is of the same order as a single LU factorization!

Key Observations

Overall Structure: Partition Tree

Order the mesh nodesin a way similar tonested dissection

Partition the wholemesh and form a treestructure to exploit thesubproblem overlap

One Step of Elimination

Gaussian elimination: A∗( t, t) def= A( t, t)− A( t, t)A( t, t)−1A( t, t)

A( t, t) A( t, t) 0A( t, t) A( t, t) A( t, t)

0 A( t, t) A( t, t) elimination

A( t, t) A( t, t) 00 A∗( t, t) A( t, t)0 A( t, t) A( t, t)

eliminated node

inner node

bounary node

outer node⇒

Two Full Elimination Processes

Keep partitioning the mesh to get small clustersStore results of each partial eliminationThe partial results could be reused

eliminated node

inner node

bounary node

outer node

target node

eliminated node

inner node

bounary node

outer node

target node

eliminated node

inner node

bounary node

outer node

target node

eliminated node

inner node

bounary node

outer node

target node

eliminated node

inner node

bounary node

outer node

target node

eliminated node

inner node

bounary node

outer node

target node

eliminated node

inner node

bounary node

outer node

target node

eliminated node

inner node

bounary node

outer node

target node

eliminated node

inner node

bounary node

outer node

target node

eliminated node

inner node

bounary node

outer node

target node

eliminated node

inner node

bounary node

outer node

target node

eliminated node

inner node

bounary node

outer node

target node

Extensions and Optimizations

G< = A−1ΣA−† has similar sparsity patternso our method is applicable as wellAlso for computing off-diagonal entriesExtra sparsity

rewrite the one step elimination:A∗( t, t) def

= A( t, t)− A( t, t)A( t, t)−1A( t, t)these blocks are themselves sparse

Exploit to optimize!The elimination preserves symmetry andthis further reduces cost

t t t tt × r × 0t r × 0 ×t × 0 × ×t 0 × × ×

Outline

1 Background

4 Parallel Methods

Simulation Device

Running Time ComparisonLog-Log Scale with Reference Lines

64 128 256 512 1024

n (=Nx=Ny)

Running Time ComparisonBetween FIND and RGF

FINDO(n3)RGF

Memory Cost Comparison

FIND: O(N log(N))

RGF: O(N3/2)

Outline

1 Background

4 Parallel Methods

How to Parallelize?

Straightforward for leaf clustersTop level clusters dominate runningtime with less degree of parallelismUse the idle processors for redundantcomputationsMore floating point operations butshorter wall clock timeWorks for 1D, 2D, and 3D domains

Problem and Processor Settings

P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0P0

16 processors, 16 clusters in 1DOne target cluster per processorKeep merging all the other clustersuntil we have them all merged as thecomplement of the target clusterEliminate the merged complementclusters and compute the inverse

Detailed Merging Process

Each processor keeps thecomplement of its target cluster withrespect to the current subdomainStart with subdomain of size 2Expand to subdomains of size 4Some processors are idleUse them to prepare for the nextsubdomain expansionUntil the subdomain is expanded tothe whole domainAdditional speedup of factor 2

Communication Pattern

Summary

Direct method for fast inverseTwo extensions, two optimizationsAn optimal parallel schemeCollaboration with other groups for moreapplications

Optimization and Parallelization of FIND Algorithm

Documents

Use Genetic Algorithm in Optimization Function For … · Use Genetic Algorithm in Optimization Function ... And The Genetic Algorithm is well ... Use Genetic Algorithm in Optimization

Parallelization of a Dynamic Unstructured Algorithm using Three

Parallelization of the branch-and-bound algorithm in

new optimization algorithm for topology optimization

Real time image processing: algorithm parallelization on

Spider Monkey Optimization AlgorithmforNumericalOptimizationsmo.scrs.in/files/Spider Monkey Optimization Algorithm for Numerical Optimization.pdfSpider Monkey Optimization Algorithm

Cluster Optimization and Parallelization of Simulations with

A hybrid optimization algorithm based on genetic algorithm and ant colony optimization

Non-Dominated Sorting Whale Optimization Algorithm · Whale Optimization Algorithm (WOA) known as Non-Dominated Sorting Whale Optimization Algorithm (NSWOA). This proposed NSWOA algorithm

Thread-Level Parallelization and Optimization of …Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture Hongzhang Shan, Samuel Williams, Wibe de

Forest Optimization Algorithm

Parallelization of the AES Algorithm - Semantic Scholar€¦ · Parallelization of the AES Algorithm WŁODZIMIERZ BIELECKI, DARIUSZ BURAK Faculty of Computer Science and Information

Parallelization of a sparse matrix-vector multiplication algorithmmarcovan.hulten.org/report.pdf · 2017-04-06 · Parallelization of a sparse matrix-vector multiplication algorithm

Laplacian whale optimization algorithm · Whale optimization algorithm is hybridized with Laplace Crossover operator and a new algorithm, Laplacian whale optimization algorithm (LXWOA),

OPTIMIZATION AND OPENMP PARALLELIZATION OF …€¦ · OPTIMIZATION AND OPENMP PARALLELIZATION OF A DISCRETE ELEMENT ... with the optimization and parallelization of a discrete element

Fractal image compression. And Parallelization of the fracta l-based compression algorithm

Optimization and parallelization of tensor and ODE/PDE ...raiith.iith.ac.in/4112/1/Thesis_Mtech_CS_4112.pdf · by Stone et. al., or some recent ones like the SPIKE Algorithm proposed

Parallelization of Smith-Waterman Algorithm using MPI

An overview of loop nest optimization, parallelization and ...mhall/mlir4hpc/cohen-MLIR-loop-overview.pdf · An overview of loop nest optimization, parallelization and acceleration

Research on optimization and parallelization of Optimal