Upload
liliana-nicholson
View
215
Download
0
Embed Size (px)
Citation preview
Domain Decomposition in High-Level Parallelizaton of PDE codes
Xing CaiXing CaiUniversity of Oslo
Outline of the Talk
Introduction and motivation
A simulator parallel model
A generic programming framework
Applications
Inro
du
ctio
nThe Question
Starting point: sequential PDE simulators.How to do the parallelization?
Resulting parallel simulators should have Good parallel performance Good overall numerical performance A relative simple parallelization process
We need a good parallelization strategy a good implementation of the strategy
Intr
od
uct
ion
3 Key Words
Parallel Computing
faster solution, larger simulation
Domain Decomposition (additive Schwarz method)
good algorithmic efficiency
mathematical foundation of parallelization
Object-Oriented Programming extensible sequential simulator
flexible implementation framework for parallelization
Intr
od
uct
ion
A Known Problem
“The hope among early domain decomposition workers was that one could write a simple controlling program which would call the old PDE software directly to perform the subdomain solves. This turned out to be unrealistic because most PDE packages are too rigid and inflexible.”
- Smith, Bjørstad and Gropp
The remedy:
Correct use of object-oriented programming techniques.
Do
mai
n D
eco
mp
osi
tio
nAdditive Schwarz Method
Example:Solving the Poissonproblem on the unitsquare
Des
ign
Parallelization
A simulator-parallel model
Each processor hosts an arbitrary number of subdomains balance between algorithmic efficiency and load balancing
One subdomain is assigned with a sequential simulator
Flexibility - different linear system solvers, preconditioners, convergence monitors etc. can easily be chosen for different subproblems
Domain decomposition at the level of subdomain simulators!
Ob
serv
atio
ns
The Simulator-Parallel Model
Reuse of existing sequential simulators
Data distribution is implied
No need for global data
Needs additional functionalities for exchanging nodal values inside the overlapping region
Needs some global administration
OO
Im
ple
men
tati
onA Generic Programming Framework
An add-on library (SPMD model) Use of object-oriented programming technique Flexibility and portability Simplified parallelization process for end-user
OO
Im
ple
men
tati
on
The Administrator
Parameter Interfacesolution method or preconditioner, max iterations, stopping criterion etc
DD algorithm Interfaceaccess to predefined numerical algorithm e.g. CG
Operation Interface (standard codes & UDC)access to subdomain simulators, matrix-vector product, inner product etc
OO
Im
ple
men
tati
on
The Communicator
Encapsulation of communication related codes
Hidden concrete communication model
MPI in use, but easy to change
Communication pattern determination
Inter-processor communication
Intra-processor communication
OO
Im
ple
men
tati
on
The Subdomain Simulator
Subdomain Simulator -- a generic representation C++ class hierarchy Standard interface of generic member functions
OO
Im
ple
men
tati
onAdaptation of Subdomain Simulator
class NewSimulator : public SubdomainFEMSolver
public OldSimulator
{
// ….
virtual void createLocalMatrix ()
{ OldSimualtor::makeSystem (); }
};SubdomainSimulator
SubdomainFEMSolver OldSimulator
NewSimulator
Performance
Algorithmic efficiency efficiency of original sequential simulator(s) efficiency of domain decomposition method
Parallel efficiency communication overhead (low) coarse grid correction overhead (normally low) load balancing
subproblem size work on subdomain solves
Sim
ula
tor
Par
alle
lApplication
P Sim. Time Speedup Efficiency
1 53.08 N/A N/A
2 27.23 1.95 0.97
4 14.12 3.76 0.94
8 7.01 7.57 0.95
16 3.26 16.28 1.02
32 1.63 32.56 1.02
Test case: 2D Poisson problem on unit square. Fixed subdomains M=32 based on a 481 x 481 global grid. Straightforward parallelization of an existing simulator. Subdomain solves use CG+FFT
P: number of processors.
Sim
ula
tor
Par
alle
lApplication
Test case: 2D linear elasticity, 241 x 241 global grid.
Vector equation ),( 21 uuu
fuu )(
Straightforward parallelization based on an existing Diffpack simulator
Sim
ula
tor
Par
alle
l2D Linear Elasticity
Sim
ula
tor
Par
alle
l2D Linear Elasticity
P: number of processors in use (P=M).I: number of parallel BiCGStab iterations needed.
Multigrid V-cycle in subdomain solves
P CPU Speedup I Subgrid
1 66.01 N/A 19 241 x 241
2 24.64 2.68 12 129 x 241
4 14.97 4.41 14 129 x 129
8 5.96 11.08 11 69 x 129
16 3.58 18.44 13 69 x 69
Ap
pli
cati
on
Unstructured Grid
P Subgrid Time Speedup Efficiency
1 1,503,433 201.30 N/A N/A
2 766,489 114.91 1.75 0.83
4 388,025 54.95 3.66 0.92
8 200,489 25.18 7.99 1.00
16 105,297 13.69 14.70 0.92
32 56,121 7.74 26.01 0.81
Sim
ula
tor
Par
alle
lApplication
Test case: two-phase porous media flow problem.
P Total CPU Subgrid CPU PEQ I CPU SEQ
1 4053.33 241x241 3586.98 3.10 440.58
2 2497.43 129 x 241 2241.78 3.48 241.08
4 1244.29 129 x 129 1101.58 2.97 134.28
8 804.47 129 x 69 725.58 3.93 72.76
16 490.47 69 x 69 447.27 4.13 39.64
psv
Tqps
Tsfvst
)(
,0in ))((
0,in 0))((
PEQ:
SEQ:
I: average number of parallel BiCGStab iterations per step
Multigrid V-cycle in subdomain solves
Sim
ula
tor
Par
alle
lTwo-Phase Porous Media Flow
Simulation result obtained on 16 processors
Two-Phase Porous Media Flow
Sim
ula
tor
Par
alle
lApplication
Test case: fully nonlinear 3D water wave problem.
wallssolidon 0
surfaceon water 02/)(
surfaceon water 0
olumein water v 0
222
2
n
gzyxt
zyyxxt
Parallelization based on an existing Diffpack simulator.
Sim
ula
tor
Par
alle
lPreliminary Results
Fixed number of subdomains M=16. Subdomain grids from partitioning a global 41x41x41 grid. Simulation over 32 time steps. DD as preconditioner of CG for the Laplace eq. Multigrid V-cycle as subdomain solver.
P Execution time Speedup
1 1404.40 N/A
2 715.32 1.96
4 372.79 3.77
8 183.99 7.63
16 90.89 15.45
Sim
ula
tor
Par
alle
l3D Water Waves
Sim
ula
tor
Par
alle
lSummary
High-level parallelization of PDE codes through DD
Introduction of a simulator-parallel model
A generic implementation framework