On the scalability of BDDC-based fast parallel iterative

On the scalability of BDDC-based fast parallel iterative solvers forthe discrete Stokes problem with continuous pressures

Alberto F. Martın, Santiago Badia

HPSC Team, Centre Internacional de Metodes Numerics a l’Enginyeria (CIMNE)Castelldefels, Spain

Universitat Politecnica de CatalunyaBarcelona, Spain

WCCM XI - ECCM V - ECFD VI, Barcelona, July 2014

1 / 21

Outline

1 Introduction

2 BDDC preconditioner and its highly scalable parallel implementation

3 Preconditioners for the Stokes problem

4 Numerical Experiments

5 Conclusions and future work

1 / 21

Outline

1 Introduction





1 / 21

Introduction

• This work grounds on recent efforts∗,† towards the development of highlyscalable domain decomposition linear solver codes tailored for FE analysis

• These codes rely on a novel implementation approach of the BalancingDomain Decomposition by Constraints (BDDC) preconditioning ideas

• Scalability systematically assessed for 3D elliptic PDEs (Poisson, Elasticity)with remarkable results (e.g., weakly scales up to 100K IBM BG/Q cores)

• Final goal is extreme-scale multiphysics solvers+, where highly scalable solvercodes for one-physics problems critical-for-scalability basic building blocks

• In the road to more complex problems, some experiences with BDDC-basedparallel solvers for discrete Stokes flows (continuous pressure spaces)

∗ S. Badia, A. F. Martın and J. Principe. A highly scalable parallel implementation of balancing domaindecomposition by constraints. SIAM J. Sci. Comput. Vol. 36(2), pp. C190-C218, 2014.† S. Badia, A. F. Martın and J. Principe. On the scalability of inexact balancing domain decompositionby constraints with overlapped coarse/fine corrections. Submitted, 2014.+ S. Badia, A. F. Martın and R. Planas. Block recursive LU preconditioners for the thermally coupledincompressible inductionless MHD problem. J. Comput. Physics, Vol. 274, pp. 562-591, 2014.

XXX 2 / 21

Outline

1 Introduction





2 / 21

Domain DecompositionGiven Ω and a FE partition T (Ω)→ Build a conforming FE space Vh ⊂ H1

0 (Ω) (also for dG)

• Variational problem: Find

u ∈ Vh : a(u, v) = (f , v), for any v ∈ Vh,

assuming a(·, ·) symmetric, coercive (e.g. Laplacian or linear elasticity)

• Algebraic problem: Find

x ∈ Rn : Ax = b,

A is a large, sparse, and symmetric positive definite (also for nonsym.)

Motivation:Efficient exploitation of distributed-memorymachines for large scale FE problems ⇒Domain decomposition framework

: interior DoFs (I ); : interface dofs (Γ)

3 / 21







x ∈ Rn : Ax = b,




3 / 21







x ∈ Rn : Ax = b,




3 / 21

BDDC Balancing DD by constraints

BDDC preconditioner [Dohrmann, . . .]

• Replace Vh by Vh (reduced continuity)

• Define the injection I : Vh −→ Vh

(weight, comm and add)

• Find xh ∈ Vh such that:

a(xh, vh) = 〈I trh, vh〉, ∀vh ∈ Vh

and obtain zh = MBDDC rh = E I xh

• Last correction: E is the harmonic extensionof the boundary values, which implies localDirichlet solvers

Vh

6I I t

?

Vh

4 / 21










Vh

6I I t

?

Vh

4 / 21










Vh

6I I t

?

Vh

4 / 21



• Alternatively, find x ∈ Rn such that:

Ax = I tr

and obtain z = MBDDC r = E I x

• A is a sub-assembled global matrix (onlyassembled the red corners in the figure)

• E =

[0 −A−1

II AI Γ

0 IΓ

],

with AII = diag(A

(1)II ,A

(2)II , . . . ,A

(P)II

)

Vh

6I I t

?

Vh

4 / 21

BDDC preconditioning

• Let Vh = [v v ] and decompose Vh as

Vh = VF ⊕ VC , with

VF = [v 0]

VC ⊥A VF

• Now, problem split into fine-grid (xF ) and coarse-grid (xC ) correction

Fine-grid correction (xF )

• Find xF ∈ Rn such that

AxF = I tr , constrained to (xF ) = 0

• Equivalent to P independent problems

Find x(i)F ∈ Rn(i)

such that

A(i)x(i)F = I t

i r , constrained to (x(i)F ) = 0

Vh

5 / 21




VF = [v 0]

VC ⊥A VF







such that

A(i)x(i)F = I t


Vh

5 / 21




VF = [v 0]

VC ⊥A VF







such that

A(i)x(i)F = I t


Vh

5 / 21




VF = [v 0]

VC ⊥A VF


Coarse-grid correction (xC )

Computation of VC = spanΦ1,Φ2, . . . ,ΦnC

• Find Φ ∈ Rn×nC such that

AΦ = 0, constrained to Φ = I


Find Φ(i) ∈ Rn×n(i)C such that

A(i)Φ(i) = 0, constrained to Φ(i)

= I Vh

5 / 21

BDDC coarse corner function

Circle domain partitioned into 9subdomains

Φj (VC ’s basis vector)

6 / 21




VF = [v 0]

VC ⊥A VF


Coarse-grid correction (xC )

Assembly and solution of coarse-grid problem

AC = assembly(ΦtA(i)Φ), Solve ACαc = Φt I tr , xC = ΦαC

Coarse-grid problem is

• Global, i.e. couples all subdomains

• But much smaller than original Schur complement S (size nC)

• Potential loss of parallel efficiency with P

7 / 21

Overlapped implementation

Typical parallel implementation

Highly-scalable parallel implementation

global communication

fine-gridcorrection

coarse-gridcorrection

core

1

core

2

core

3

core

4

core

P

TC

TF

time

idling

main MPI communicator

core

1

core

2

core

3

core

4

core

PF


fine-grid MPIcommunicator

core

1

core

2

core

PC

coarse-grid MPIcommunicator

TF

TC

PC

OpenMP-based coarse-gridsolution

• All MPI tasks have f-g duties andone/several have also c-g duties

• Computation of f-g/c-g duties serialized(but they are independent!)

• TC ∝ O(P2) → idling ' PTC

• mem ∝ O(P43 ) → mem per core rapidly

exceeded

• MPI tasks have either f-g OR c-g duties

• f-g/c-g corrections OVERLAPPED in time(asynchronous)

• c-g tasks can be MASKED with f-g tasksduties

• MPI-based or OpenMP-based (this work)solutions are possible for c-g correction

8 / 21


Typical parallel implementation Highly-scalable parallel implementation


fine-gridcorrection


core

1

core

2

core

3

core

4

core

P

TC

TF

time

idling


core

1

core

2

core

3

core

4

core

PF



core

1

core

2

core

PC


TF

TC

PC

OpenMP-based coarse-gridsolution





exceeded





8 / 21


Typical parallel implementation Highly-scalable parallel implementation


fine-gridcorrection


core

1

core

2

core

3

core

4

core

P

TC

TF

time

idling


core

1

core

2

core

3

core

4

core

PF



core

1

core

2

core

PC


TF

TC

PC

MPI-based coarse-gridsolution





exceeded





9 / 21

Outline

1 Introduction





9 / 21

The Stokes system

The continuous stationary problem:

−ν∆u +∇p = f,

∇ · u = 0.

The discrete problem (e.g. using Galerkin/stabilized FEM):[A BT

B −C

] [up

]=

[fg

],

where:

• u and p are the velocity and pressure field unknowns

• A is a vector-Laplacian matrix (SPD)

• C is a SPD stabilization matrix (C = 0 for inf-sup stable FEs)

10 / 21

Preconditioning the discrete Stokes problem

Approach 1: Block preconditioning

Considers block triangular matrices of the form

M =

[A BT

0 −S

]with M ≈

[A BT

B −C

],

where S = −C − BA−1BT , the pressure Schur complement, is a full SPD matrix

Solve Mz = r

1: Solve Szp = −rp

2: Solve Azu = ru − BT zp

Some properties of M:

• Easy-to-proof optimality: conditionnumber bounded irrespective of h

• 1 vector-Laplacian-type and 1 pressureSchur complement solves per iteration

• Direction solution of linear systems withA and S computationally prohibitive

11 / 21



Considers block triangular matrices of the form

M =

[A BT

0 −S

]with M ≈

[A BT

B −C

],

where S = −C − BA−1BT , the pressure Schur complement, is a full SPD matrix

Solve Mz = r

1: Solve Szp = −rp

2: Solve Azu = ru − BT zp



• 1 vector-Laplacian-type and 1 pressureSchur complement solves per iteration

• Direction solution of linear systems withA and S computationally prohibitive

11 / 21



Choosing “suitable approximations” (spectrally equivalent) MA ≈ A and MS ≈ S

M =

[MA BT

0 −MS

]with M ≈

[A BT

B −C

],

condition number bounds irrespective of h can still be attained

Cheap-to-invert, highly-scalable choices at our disposal:

• MA =

MBDDC

A : BDDC preconditioner + exact (i.e., sparse direct) solvers

MBDDCA : BDDC preconditioner + inexact (i.e., AMG) solvers

• MS = Dp, with Dp the lumped pressure mass matrix (i.e., a diagonal matrix)

Solve Mz = r

1: Solve Dpzp = −rp

2: Solve MBDDCA zu = ru −BT zp

or MBDDCA zu = ru − BT zp



• 1 BDDC preconditioner and 1 diagonalpreconditioner application per iteration

• Application of MBDDCA or MBDDC

A , and Dp

is cheap and highly scalable12 / 21




M =

[MA BT

0 −MS

]with M ≈

[A BT

B −C

],



• MA =

MBDDC




Solve Mz = r








A , and Dp





M =

[MA BT

0 −MS

]with M ≈

[A BT

B −C

],



• MA =

MBDDC




Solve Mz = r








A , and Dp



Approach 2: Monolithic BDDC

• Apply BDDC methodology to Stokes problem

• Theory not fully understood, only for disc. pressures [Li & Widlund, 2006]

Well-posedness:† (Weak) inf-sup condition not obvious for (Vh, Qh)

BDDC(c/ce/cef) are well-posed preconditioners

Proof: inspired by the nonconforming Crouzeix-Raviart FE

Both for inf-sup stable and stabilized formulations with continuous pressures

The BDDC problem

Find uh ∈ Vh and ph ∈ Qh st

ν(∇uh,∇vh)− (ph,∇ · vh)− (qh,∇ · uh) = 〈rh, vh〉

†S. Badia, and A. F. Martın. Balancing domain decomposition preconditioning for the discrete Stokes

problem with continuous pressures. In preparation, 2014.

13 / 21


Approach 2: Monolithic BDDC

• Apply BDDC methodology to Stokes problem

• Theory not fully understood, only for disc. pressures [Li & Widlund, 2006]

Well-posedness:† (Weak) inf-sup condition not obvious for (Vh, Qh)

BDDC(c/ce/cef) are well-posed preconditioners

Proof: inspired by the nonconforming Crouzeix-Raviart FE

Both for inf-sup stable and stabilized formulations with continuous pressures

The BDDC problem

Find uh ∈ Vh and ph ∈ Qh st

ν(∇uh,∇vh)− (ph,∇ · vh)− (qh,∇ · uh) = 〈rh, vh〉

†S. Badia, and A. F. Martın. Balancing domain decomposition preconditioning for the discrete Stokes

problem with continuous pressures. In preparation, 2014.

13 / 21

Outline

1 Introduction





13 / 21

FEMPAR Software

FEMPAR (in-house developed HPC software, free software GNU-GPL):Finite Element Multiphysics PARallel software

• Massively parallel sw for FE simulation of multiphysics PDEs

• Scalable preconditioning of fully coupled and implicit system via blockpreconditioning techniques

• Scalable preconditioning for one-physics (elliptic) PDEs relies on BDDC→ hybrid MPI/OpenMP implementation

• Relies on highly-efficient vendor implementations of the dense/sparse BLAS(Intel MKL, IBM ESSL, etc.)

• Interfaces to external multi-threaded sparse direct solvers (PARDISO,HSL MA87, etc.) and serial AMG preconditioners (HSL MI20)

14 / 21

Weak scalability for 3D Stokes

Target machine: HELIOS@IFERC-CSC4,410 bullx B510 compute blades (2 Intel Xeon E5-2680 8-core CPUs)

• Target problem: Stokes on Ω = [0, 1]3 (lid-driven cavity problem)

• Uniform global mesh (Q1-Q1 FEs, ASGS-stabilized) + Uniform partition

• 8, 432, . . . , 16000 cores for fine duties

• Entire 16-core blade for coarse duties

• Three different preconditioners: (a) mono, (b) blk-ex, (c) blk-inex

• Direct (a & b) or AMG(1) (c) solves for Dirichlet/Neumann/Coarse probs.

• Gradually larger local problem sizes Hh

= 5, 10, . . . , 35

15 / 21

Weak scaling BDDC-based solvers

Stokes problem, Hh

= 20 (20× 40× 40 = 32K FEs/core)

0

10

20

30

40

50

60

70

128 2K 3.5K 5.5K 8K 11.7K 16K

#RG

MR

ES

iter

atio

ns

#cores

Weak scaling for BDDC(ce)-based solvers with H/h=20 (32K FEs/core)

monoblk-ex

blk-inex 0

5

10

15

20

25

30

35

40

45

128 2K 3.5K 5.5K 8K 11.7K 16K

#RG

MR

ES

iter

atio

ns

#cores

Weak scaling for BDDC(cef)-based solvers with H/h=20 (32K FEs/core)

monoblk-ex

blk-inex

BDDC(c+e)-based BDDC(c+e+f)-based

# of RGMRES iterations

• mono requires c+e+f constraints to reach asymptotically constant # of iters.

• Also observed by [Sıstek et. al., 2011] for Taylor-Hood FEs (inf-sup stable)

16 / 21

Weak scaling BDDC-based solvers

Stokes problem, Hh

= 20 (20× 40× 40 = 32K FEs/core)

0

20

40

60

80

100

120

140

160

180

128 2K 3.5K 5.5K 8K 11.7K 16K

Tot

al W

all c

lock

tim

e (s

ecs.

)

#cores


monoblk-ex

blk-inex

0

20

40

60

80

100

120

140

128 2K 3.5K 5.5K 8K 11.7K 16K

Tot

al W

all c

lock

tim

e (s

ecs.

)

#cores

Weak scaling for BDDC(cef)-based solvers with H/h=20 (32K FEs/core)

monoblk-ex

blk-inex

BDDC(c+e)-based BDDC(c+e+f)-based

Total time (secs.)

• blk-ex and blk-inex much faster than mono

• mono-BDDC(cef) weak scalability degrades beyond 8K cores

• Reason: time spent in coarse duties exceeds that spent in fine duties

• Bad news for mono, Hh

= 20 is the largest problem size that fits into memory !17 / 21

Weak scaling BDDC(c+e)-based solvers

Stokes problem, Hh

= 25 (25× 50× 50 = 62K FEs/core)

0

10

20

30

40

50

60

70

80

128 2K 3.5K 5.5K 8K 11.7K 16K

#RG

MR

ES

iter

atio

ns

#cores


blk-exblk-inex

0

10

20

30

40

50

60

70

128 2K 3.5K 5.5K 8K 11.7K 16K

Tot

al W

all c

lock

tim

e (s

ecs.

)

#cores


blk-exblk-inex

# of RGMRES iters. Total time (secs.)

BDDC(c+e)-based solvers

• blk-ex and blk-inex weakly scalable up-to 16K cores

• blk-inex becomes faster than blk-ex for “sufficiently large” Hh

• Reason: O(n2) vs. O(n) complexity of AMG vs. direct solvers

18 / 21

Weak scaling BDDC(c+e)-based solvers

Stokes problem, Hh

= 30 (30× 60× 60 = 108K FEs/core)

0

10

20

30

40

50

60

70

80

128 2K 3.5K 5.5K 8K 11.7K 16K

#RG

MR

ES

iter

atio

ns

#cores


blk-exblk-inex

0

20

40

60

80

100

120

140

160

128 2K 3.5K 5.5K 8K 11.7K 16K

Tot

al W

all c

lock

tim

e (s

ecs.

)

#cores


blk-exblk-inex

# of RGMRES iters. Total time (secs.)

BDDC(c+e)-based solvers

• blk-ex and blk-inex weakly scalable up-to 16K cores

• blk-inex becomes faster than blk-ex for “sufficiently large” Hh

• Reason: O(n2) vs. O(n) complexity of AMG vs. direct solvers

19 / 21

Memory consumption

Memory consumption on fine-grid MPI tasksLocal problem size (FEs/core)

Precond. 0.5K 4K 13.5K 32K 62.5K 108K 171Kmono 155.0M 320.3M 973.2M 2.7G † † †blk-ex 97.3M 149.1M 336.9M 809.9M 1.7G 3.3G †

blk-inex ∗ ∗ ∗ 465.5M 823.5M 1.3G 2.1G∗: experiment not performed †: out of memory

Memory consumption on coarse-grid MPI task# subdomains

Constraints Precond. 128 1K 3.5K 8K 16Kc+e mono 45.3M 258.0M 998.2M 2.7G 6.1G

blk-ex 34.7M 130.1M 419.8M 1.0G 2.1Gblk-inex 30.9M 109.8M 326.3M 757.4M 1.4G

20 / 21

Outline

1 Introduction





20 / 21

Conclusions and future work

Conclusions:

• 3 different parallel preconditioners for the discrete Stokes problem evaluated→ mono, blk-ex and blk-inex

• Block preconditioning is faster and more scalable than monolithic approach

• For “large enough” Hh

, blk-inex achieves the best trade-off amongpreconditioner quality and preconditioner set-up/application time

Future work:

• Development/evaluation of inexact BDDC variants for monolithic problem

• Multilevel extension + MPI-based coarse-grid implementation

• Potential to scale easily up to O(106) cores for three levels (on the way)

21 / 21

References

S. Badia, A. F. Martın and J. Principe. Enhanced balancing Neumann-Neumann precondition-

ing in computational fluid and solid mechanics. International Journal for Numerical Methods inEngineering. Vol. 96(4), pp. 203-230, 2013.

S. Badia, A. F. Martın and J. Principe. Implementation and scalability analysis of balancing domain

decomposition methods. Archives of Computational Methods in Engineering. Vol. 20(3), pp. 239-262, 2013.

S. Badia, A. F. Martın and J. Principe. A highly scalable parallel implementation of balancing

domain decomposition by constraints. SIAM Journal on Scientific Computing. Vol. 36(2), pp.C190-C218, 2014.

S. Badia, A. F. Martın and R. Planas. Block recursive LU preconditioners for the thermally coupled

incompressible inductionless MHD problem. Journal of Computational Physics, Vol. 274, pp. 562–591, 2014.

S. Badia, A. F. Martın and J. Principe. On the scalability of inexact balancing domain decomposition

by constraints with overlapped coarse/fine corrections. Submitted, 2014.

S. Badia, and A. F. Martın. Balancing domain decomposition preconditioning for the discrete

Stokes problem with continuous pressures. In preparation, 2014.

Preprints available at http://badia.rmee.upc.edu/sbadia_ar.html

HPSC team: https://web.cimne.upc.edu/groups/comfus/

Work funded by the European Research Council under Starting Grant258443 - COMFUS: Computational Methods for Fusion Technology 2007 - 2012

European Research CouncilYears of excellent IDEAS

21 / 21

http://badia.rmee.upc.edu/sbadia_ar.html

https://web.cimne.upc.edu/groups/comfus/

Documents

On the scalability of BDDC-based fast parallel iterative