A Software Strategy for Simple Parallelization of Sequential PDE Solvers Hans Petter Langtangen Xing Cai Dept. of Informatics University of Oslo

A Software Strategy for A Software Strategy for Simple Parallelization of Simple Parallelization of Sequential PDE SolversSequential PDE Solvers

Hans Petter LangtangenHans Petter Langtangen

Xing CaiXing Cai

Dept. of InformaticsDept. of InformaticsUniversity of OsloUniversity of Oslo

IMA

CS

20

00

IMA

CS

20

00

Outline of the TalkOutline of the Talk

• BackgroundBackground

• A generic parallelization techniqueA generic parallelization technique

• Implementational aspectsImplementational aspects

• Numerical experimentsNumerical experiments

IMA

CS

20

00

IMA

CS

20

00

The QuestionThe Question

Starting point: sequential codeStarting point: sequential codeHow to do the parallelization?How to do the parallelization?

Resulting parallel solvers should haveResulting parallel solvers should have good parallel efficiencygood parallel efficiency good overall numerical performancegood overall numerical performance

We needWe need a good parallelization strategya good parallelization strategy a good and simple implementation of the strategya good and simple implementation of the strategy

IMA

CS

20

00

IMA

CS

20

00

Problem DomainProblem Domain

• Partial differential equationsPartial differential equations

• Finite elements/differencesFinite elements/differences

• Communication through message Communication through message passingpassing

IMA

CS

20

00

IMA

CS

20

00

Domain DecompositionDomain Decomposition

• Solution of the original large problem Solution of the original large problem through iteratively solving many smaller through iteratively solving many smaller subproblemssubproblems

• Can be used as solution method or Can be used as solution method or preconditionerpreconditioner

• Flexibility -- localized treatment of Flexibility -- localized treatment of irregular geometries, singularities etcirregular geometries, singularities etc

• Very efficient numerical methods -- even Very efficient numerical methods -- even on sequential computerson sequential computers

• Suitable for coarse grained parallelizationSuitable for coarse grained parallelization

IMA

CS

20

00

IMA

CS

20

00

Overlapping DDOverlapping DD

Alternating Schwarz method for two Alternating Schwarz method for two subdomainssubdomains

Example: solving an elliptic boundary value Example: solving an elliptic boundary value problemproblem

inin

A sequence of approximationsA sequence of approximations

wherewhere

on

in

gu

fAu21

nuuu ,, 10

1|

\on

in

121

111

111

nn

n

n

uu

gu

fAu

2|

\on

in

12

222

222

nn

n

n

uu

gu

fAu

IMA

CS

20

00

IMA

CS

20

00Convergence of the SolutionConvergence of the Solution

Single-phaseSingle-phasegroundwatergroundwaterflowflow

IMA

CS

20

00

IMA

CS

20

00

Mesh Partition ExampleMesh Partition Example

IMA

CS

20

00

IMA

CS

20

00

Coarse Grid CorrectionCoarse Grid Correction

• This DD algorithm is a kind of block This DD algorithm is a kind of block Jacobi iterationJacobi iteration (CBJ) (CBJ)

• Problem: often (very) slow Problem: often (very) slow convergenceconvergence

• Remedy: coarse grid correctionRemedy: coarse grid correction

• A kind of two-grid multigrid algorithmA kind of two-grid multigrid algorithm

• Coarse grid solve on each processorCoarse grid solve on each processor

IMA

CS

20

00

IMA

CS

20

00

ObservationsObservations

• DD is a good parallelization strategyDD is a good parallelization strategy• The approach is not PDE-specificThe approach is not PDE-specific• A program for the original global problem A program for the original global problem

can be reused (modulo B.C.) for each can be reused (modulo B.C.) for each subdomainsubdomain

• Must communicate overlapping point Must communicate overlapping point values values

• No need for global dataNo need for global data• Data distribution impliedData distribution implied• Explicit temporal schemes are a special Explicit temporal schemes are a special

case where no iteration is needed (“exact case where no iteration is needed (“exact DD”)DD”)

IMA

CS

20

00

IMA

CS

20

00

A Known ProblemA Known Problem

““The hope among early domain decomposition workers The hope among early domain decomposition workers was that one could write a simple controlling program was that one could write a simple controlling program which would call the old PDE software directly to which would call the old PDE software directly to perform the subdomain solves. This turned out to be perform the subdomain solves. This turned out to be unrealistic because most PDE packages are too rigid unrealistic because most PDE packages are too rigid and inflexible.”and inflexible.”

- - Smith, Bjørstad and Smith, Bjørstad and GroppGropp

One remedy:One remedy:

Use of object-oriented programming Use of object-oriented programming techniquestechniques

IMA

CS

20

00

IMA

CS

20

00Goals for the ImplementationGoals for the Implementation

• Reuse sequential solver as subdomain Reuse sequential solver as subdomain solversolver

• Add DD management and Add DD management and communication as separate modulescommunication as separate modules

• Collect common operations in generic Collect common operations in generic library moduleslibrary modules

• Flexibility and portabilityFlexibility and portability

• Simplified parallelization process for the Simplified parallelization process for the end-userend-user

IMA

CS

20

00

IMA

CS

20

00

Generic Programming FrameworkGeneric Programming Framework

IMA

CS

20

00

IMA

CS

20

00

The Subdomain SimulatorThe Subdomain Simulator

Subdomain SimulatorSubdomain Simulator

seq. solverseq. solver

add-onadd-oncommunicationcommunication

IMA

CS

20

00

IMA

CS

20

00

The CommunicatorThe Communicator

• Need functionality for exchanging Need functionality for exchanging point values inside the overlapping point values inside the overlapping regionsregions

• The communicator works with a The communicator works with a hidden communication modelhidden communication model

• MPI in use, but easy to changeMPI in use, but easy to change

IMA

CS

20

00

IMA

CS

20

00

RealizationRealization

• Object-oriented programming Object-oriented programming

(C++, Java, Python)(C++, Java, Python)

• Use inheritance, polymorphism, Use inheritance, polymorphism, dynamic bindingdynamic binding– Simplifies modularizationSimplifies modularization

– Supports reuse of sequential solver Supports reuse of sequential solver (without touching its source code!)(without touching its source code!)

IMA

CS

20

00

IMA

CS

20

00Making the Simulator ParallelMaking the Simulator Parallel

class SimulatorP : public SubdomainFEMSolverclass SimulatorP : public SubdomainFEMSolver public Simulatorpublic Simulator{{ // // … just a small amount of code… just a small amount of code virtual void createLocalMatrix ()virtual void createLocalMatrix () { Simulator::makeSystem (); }{ Simulator::makeSystem (); }};};

SubdomainSimulatorSubdomainSimulator

SubdomainFEMSolver

AdministratorAdministrator

SimulatorPSimulatorP

SimulatorSimulator

IMA

CS

20

00

IMA

CS

20

00

PerformancePerformance

• Algorithmic efficiencyAlgorithmic efficiency efficiency of original sequential simulator(s)efficiency of original sequential simulator(s) efficiency of domain decomposition methodefficiency of domain decomposition method

• Parallel efficiencyParallel efficiency communication overhead (communication overhead (lowlow)) coarse grid correction overhead (coarse grid correction overhead (normally normally

lowlow)) load balancingload balancing

– subproblem sizesubproblem size– work on subdomain solveswork on subdomain solves

IMA

CS

20

00

IMA

CS

20

00

ApplicationApplication Single-phase groundwater flowSingle-phase groundwater flow DD as the global solution methodDD as the global solution method Subdomain solvers use CG+FFTSubdomain solvers use CG+FFT Fixed number of subdomains Fixed number of subdomains MM=32 (independent of =32 (independent of

PP)) Straightforward parallelization of an existing Straightforward parallelization of an existing

simulatorsimulator P Sim. Time Speedup Efficiency

1 53.08 N/A N/A

2 27.23 1.95 0.97

4 14.12 3.76 0.94

8 7.01 7.57 0.95

16 3.26 16.28 1.02

32 1.63 32.56 1.02

P: number of processors

IMA

CS

20

00

IMA

CS

20

00

DiffpackDiffpack

• O-O software environment for O-O software environment for scientific computationscientific computation

• Rich collection of PDE solution Rich collection of PDE solution components - components - portable, flexible, extensibleportable, flexible, extensible

• www.diffpack.comwww.diffpack.com

• H.P.Langtangen: H.P.Langtangen: Computational Computational Partial Differential EquationsPartial Differential Equations, , Springer 1999Springer 1999

IMA

CS

20

00

IMA

CS

20

00Straightforward ParallelizationStraightforward Parallelization

• Develop a sequential simulator, without Develop a sequential simulator, without paying attention to parallelismpaying attention to parallelism

• Follow the Diffpack coding standardsFollow the Diffpack coding standards

• Need Diffpack add-on libraries for Need Diffpack add-on libraries for parallel computingparallel computing

• Add a few new statements for Add a few new statements for transformation to a parallel simulatortransformation to a parallel simulator

IMA

CS

20

00

IMA

CS

20

00

Linear-Algebra-Level ApproachLinear-Algebra-Level Approach

• Parallelize matrix/vector operationsParallelize matrix/vector operations– inner-product of two vectorsinner-product of two vectors

– matrix-vector productmatrix-vector product

– preconditioning - block contribution from subgridspreconditioning - block contribution from subgrids

• Easy to useEasy to use– access to all Diffpack access to all Diffpack v3.0 v3.0 CG-like methods, CG-like methods,

preconditioners and convergence monitorspreconditioners and convergence monitors

– ““hidden” parallelizationhidden” parallelization

– need only to add a few lines of new codeneed only to add a few lines of new code

– arbitrary choice of number of procs at run-timearbitrary choice of number of procs at run-time

– less flexibility than DDless flexibility than DD

IMA

CS

20

00

IMA

CS

20

00

Linear-Algebra-Level ApproachLinear-Algebra-Level Approach

• Domain decomposition as preconditionerDomain decomposition as preconditioner

• Conjugate Gradient (CG)-like or Multigrid Conjugate Gradient (CG)-like or Multigrid (MG) solver(MG) solver

• Preconditioner: DD, e.g. 1 iterationPreconditioner: DD, e.g. 1 iteration– implemented as a basic DD solverimplemented as a basic DD solver

• Need to parallelize the CG or MG method Need to parallelize the CG or MG method (quite easy; matrix-vector/vector op.)(quite easy; matrix-vector/vector op.)– general library routinegeneral library routine

IMA

CS

20

00

IMA

CS

20

00

Single-Phase Groundwater FlowSingle-Phase Groundwater Flow

)())(( xfuxK

•Highly unstructured gridHighly unstructured grid•Discontinuity in the coefficientDiscontinuity in the coefficient K K (0.1 & 1)(0.1 & 1)

IMA

CS

20

00

IMA

CS

20

00

MeasurementsMeasurements

P # iter Time Speedup

1 480 420.09 N/A

3 660 200.17 2.10

4 691 156.36 2.69

6 522 83.87 5.01

8 541 60.30 6.97

12 586 38.23 10.99

16 564 28.32 14.83

•130,561 degrees of freedom130,561 degrees of freedom•Overlapping subgridsOverlapping subgrids•Global BiCGStab using (block) ILU prec.Global BiCGStab using (block) ILU prec.

IMA

CS

20

00

IMA

CS

20

00Two-Phase Porous Media FlowTwo-Phase Porous Media Flow

Simulation result obtained on 16 processorsSimulation result obtained on 16 processors

IMA

CS

20

00

IMA

CS

20

00

Nonlinear Water WavesNonlinear Water Waves

IMA

CS

20

00

IMA

CS

20

00

Nonlinear Water WavesNonlinear Water Waves• CG + DD prec. for global solverCG + DD prec. for global solver• Multigrid V-cycle as subdomain solverMultigrid V-cycle as subdomain solver• Fixed number of subdomains Fixed number of subdomains MM=16 (independent =16 (independent

of of PP))• Subgrids from partition of a global 41x41x41 gridSubgrids from partition of a global 41x41x41 grid

P Execution time Speedup Efficiency

1 1404.44 N/A N/A

2 715.32 1.96 0.98

4 372.79 3.77 0.94

8 183.99 7.63 0.95

16 90.89 15.45 0.97

IMA

CS

20

00

IMA

CS

20

00

2D Linear Elasticity2D Linear Elasticity

IMA

CS

20

00

IMA

CS

20

00

2D Linear Elasticity2D Linear Elasticity

• BiCGStab + DD prec. as global solverBiCGStab + DD prec. as global solver

• Multigrid V-cycle in subdomain solvesMultigrid V-cycle in subdomain solves

• II:: number of global BiCGStab iterations needed number of global BiCGStab iterations needed

• PP:: number of processors ( number of processors (PP=#subdomains)=#subdomains)

P CPU Speedup I Subgrid

1 66.01 N/A 19 241 x 241

2 24.64 2.68 12 129 x 241

4 14.97 4.41 14 129 x 129

8 5.96 11.08 11 69 x 129

16 3.58 18.44 13 69 x 69

IMA

CS

20

00

IMA

CS

20

00Test Case: Vortex-SheddingTest Case: Vortex-Shedding

IMA

CS

20

00

IMA

CS

20

00

Simulation SnapshotsSimulation Snapshots

PressurePressure

IMA

CS

20

00

IMA

CS

20

00

Animated Pressure FieldAnimated Pressure Field

IMA

CS

20

00

IMA

CS

20

00

Animated Velocity FieldAnimated Velocity Field

IMA

CS

20

00

IMA

CS

20

00

SummarySummary

• Goal: provide software and programming Goal: provide software and programming rules for easy parallelization of sequential rules for easy parallelization of sequential simulatorssimulators

• Applicable to a wide range of PDE Applicable to a wide range of PDE problemsproblems

• Domain decomposition parallelization Domain decomposition parallelization strategystrategy

• Compact visible code/algorithmCompact visible code/algorithm• Performance: satisfactory speed-upPerformance: satisfactory speed-up

Documents

A Software Strategy for Simple Parallelization of Sequential PDE Solvers Hans Petter Langtangen Xing Cai Dept. of Informatics University of Oslo