41
On the scalability of BDDC-based fast parallel iterative solvers for the discrete Stokes problem with continuous pressures Alberto F. Mart´ ın , Santiago Badia HPSC Team, Centre Internacional de M` etodes Num` erics a l’Enginyeria (CIMNE) Castelldefels, Spain Universitat Polit` ecnica de Catalunya Barcelona, Spain WCCM XI - ECCM V - ECFD VI, Barcelona, July 2014 1 / 21

On the scalability of BDDC-based fast parallel iterative

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On the scalability of BDDC-based fast parallel iterative

On the scalability of BDDC-based fast parallel iterative solvers forthe discrete Stokes problem with continuous pressures

Alberto F. Martın, Santiago Badia

HPSC Team, Centre Internacional de Metodes Numerics a l’Enginyeria (CIMNE)Castelldefels, Spain

Universitat Politecnica de CatalunyaBarcelona, Spain

WCCM XI - ECCM V - ECFD VI, Barcelona, July 2014

1 / 21

Page 2: On the scalability of BDDC-based fast parallel iterative

Outline

1 Introduction

2 BDDC preconditioner and its highly scalable parallel implementation

3 Preconditioners for the Stokes problem

4 Numerical Experiments

5 Conclusions and future work

1 / 21

Page 3: On the scalability of BDDC-based fast parallel iterative

Outline

1 Introduction

2 BDDC preconditioner and its highly scalable parallel implementation

3 Preconditioners for the Stokes problem

4 Numerical Experiments

5 Conclusions and future work

1 / 21

Page 4: On the scalability of BDDC-based fast parallel iterative

Introduction

• This work grounds on recent efforts∗,† towards the development of highlyscalable domain decomposition linear solver codes tailored for FE analysis

• These codes rely on a novel implementation approach of the BalancingDomain Decomposition by Constraints (BDDC) preconditioning ideas

• Scalability systematically assessed for 3D elliptic PDEs (Poisson, Elasticity)with remarkable results (e.g., weakly scales up to 100K IBM BG/Q cores)

• Final goal is extreme-scale multiphysics solvers+, where highly scalable solvercodes for one-physics problems critical-for-scalability basic building blocks

• In the road to more complex problems, some experiences with BDDC-basedparallel solvers for discrete Stokes flows (continuous pressure spaces)

∗ S. Badia, A. F. Martın and J. Principe. A highly scalable parallel implementation of balancing domaindecomposition by constraints. SIAM J. Sci. Comput. Vol. 36(2), pp. C190-C218, 2014.† S. Badia, A. F. Martın and J. Principe. On the scalability of inexact balancing domain decompositionby constraints with overlapped coarse/fine corrections. Submitted, 2014.+ S. Badia, A. F. Martın and R. Planas. Block recursive LU preconditioners for the thermally coupledincompressible inductionless MHD problem. J. Comput. Physics, Vol. 274, pp. 562-591, 2014.

XXX 2 / 21

Page 5: On the scalability of BDDC-based fast parallel iterative

Outline

1 Introduction

2 BDDC preconditioner and its highly scalable parallel implementation

3 Preconditioners for the Stokes problem

4 Numerical Experiments

5 Conclusions and future work

2 / 21

Page 6: On the scalability of BDDC-based fast parallel iterative

Domain DecompositionGiven Ω and a FE partition T (Ω)→ Build a conforming FE space Vh ⊂ H1

0 (Ω) (also for dG)

• Variational problem: Find

u ∈ Vh : a(u, v) = (f , v), for any v ∈ Vh,

assuming a(·, ·) symmetric, coercive (e.g. Laplacian or linear elasticity)

• Algebraic problem: Find

x ∈ Rn : Ax = b,

A is a large, sparse, and symmetric positive definite (also for nonsym.)

Motivation:Efficient exploitation of distributed-memorymachines for large scale FE problems ⇒Domain decomposition framework

: interior DoFs (I ); : interface dofs (Γ)

3 / 21

Page 7: On the scalability of BDDC-based fast parallel iterative

Domain DecompositionGiven Ω and a FE partition T (Ω)→ Build a conforming FE space Vh ⊂ H1

0 (Ω) (also for dG)

• Variational problem: Find

u ∈ Vh : a(u, v) = (f , v), for any v ∈ Vh,

assuming a(·, ·) symmetric, coercive (e.g. Laplacian or linear elasticity)

• Algebraic problem: Find

x ∈ Rn : Ax = b,

A is a large, sparse, and symmetric positive definite (also for nonsym.)

Motivation:Efficient exploitation of distributed-memorymachines for large scale FE problems ⇒Domain decomposition framework

: interior DoFs (I ); : interface dofs (Γ)

3 / 21

Page 8: On the scalability of BDDC-based fast parallel iterative

Domain DecompositionGiven Ω and a FE partition T (Ω)→ Build a conforming FE space Vh ⊂ H1

0 (Ω) (also for dG)

• Variational problem: Find

u ∈ Vh : a(u, v) = (f , v), for any v ∈ Vh,

assuming a(·, ·) symmetric, coercive (e.g. Laplacian or linear elasticity)

• Algebraic problem: Find

x ∈ Rn : Ax = b,

A is a large, sparse, and symmetric positive definite (also for nonsym.)

Motivation:Efficient exploitation of distributed-memorymachines for large scale FE problems ⇒Domain decomposition framework

: interior DoFs (I ); : interface dofs (Γ)

3 / 21

Page 9: On the scalability of BDDC-based fast parallel iterative

BDDC Balancing DD by constraints

BDDC preconditioner [Dohrmann, . . .]

• Replace Vh by Vh (reduced continuity)

• Define the injection I : Vh −→ Vh

(weight, comm and add)

• Find xh ∈ Vh such that:

a(xh, vh) = 〈I trh, vh〉, ∀vh ∈ Vh

and obtain zh = MBDDC rh = E I xh

• Last correction: E is the harmonic extensionof the boundary values, which implies localDirichlet solvers

Vh

6I I t

?

Vh

4 / 21

Page 10: On the scalability of BDDC-based fast parallel iterative

BDDC Balancing DD by constraints

BDDC preconditioner [Dohrmann, . . .]

• Replace Vh by Vh (reduced continuity)

• Define the injection I : Vh −→ Vh

(weight, comm and add)

• Find xh ∈ Vh such that:

a(xh, vh) = 〈I trh, vh〉, ∀vh ∈ Vh

and obtain zh = MBDDC rh = E I xh

• Last correction: E is the harmonic extensionof the boundary values, which implies localDirichlet solvers

Vh

6I I t

?

Vh

4 / 21

Page 11: On the scalability of BDDC-based fast parallel iterative

BDDC Balancing DD by constraints

BDDC preconditioner [Dohrmann, . . .]

• Replace Vh by Vh (reduced continuity)

• Define the injection I : Vh −→ Vh

(weight, comm and add)

• Find xh ∈ Vh such that:

a(xh, vh) = 〈I trh, vh〉, ∀vh ∈ Vh

and obtain zh = MBDDC rh = E I xh

• Last correction: E is the harmonic extensionof the boundary values, which implies localDirichlet solvers

Vh

6I I t

?

Vh

4 / 21

Page 12: On the scalability of BDDC-based fast parallel iterative

BDDC Balancing DD by constraints

BDDC preconditioner [Dohrmann, . . .]

• Alternatively, find x ∈ Rn such that:

Ax = I tr

and obtain z = MBDDC r = E I x

• A is a sub-assembled global matrix (onlyassembled the red corners in the figure)

• E =

[0 −A−1

II AI Γ

0 IΓ

],

with AII = diag(A

(1)II ,A

(2)II , . . . ,A

(P)II

)

Vh

6I I t

?

Vh

4 / 21

Page 13: On the scalability of BDDC-based fast parallel iterative

BDDC preconditioning

• Let Vh = [v v ] and decompose Vh as

Vh = VF ⊕ VC , with

VF = [v 0]

VC ⊥A VF

• Now, problem split into fine-grid (xF ) and coarse-grid (xC ) correction

Fine-grid correction (xF )

• Find xF ∈ Rn such that

AxF = I tr , constrained to (xF ) = 0

• Equivalent to P independent problems

Find x(i)F ∈ Rn(i)

such that

A(i)x(i)F = I t

i r , constrained to (x(i)F ) = 0

Vh

5 / 21

Page 14: On the scalability of BDDC-based fast parallel iterative

BDDC preconditioning

• Let Vh = [v v ] and decompose Vh as

Vh = VF ⊕ VC , with

VF = [v 0]

VC ⊥A VF

• Now, problem split into fine-grid (xF ) and coarse-grid (xC ) correction

Fine-grid correction (xF )

• Find xF ∈ Rn such that

AxF = I tr , constrained to (xF ) = 0

• Equivalent to P independent problems

Find x(i)F ∈ Rn(i)

such that

A(i)x(i)F = I t

i r , constrained to (x(i)F ) = 0

Vh

5 / 21

Page 15: On the scalability of BDDC-based fast parallel iterative

BDDC preconditioning

• Let Vh = [v v ] and decompose Vh as

Vh = VF ⊕ VC , with

VF = [v 0]

VC ⊥A VF

• Now, problem split into fine-grid (xF ) and coarse-grid (xC ) correction

Fine-grid correction (xF )

• Find xF ∈ Rn such that

AxF = I tr , constrained to (xF ) = 0

• Equivalent to P independent problems

Find x(i)F ∈ Rn(i)

such that

A(i)x(i)F = I t

i r , constrained to (x(i)F ) = 0

Vh

5 / 21

Page 16: On the scalability of BDDC-based fast parallel iterative

BDDC preconditioning

• Let Vh = [v v ] and decompose Vh as

Vh = VF ⊕ VC , with

VF = [v 0]

VC ⊥A VF

• Now, problem split into fine-grid (xF ) and coarse-grid (xC ) correction

Coarse-grid correction (xC )

Computation of VC = spanΦ1,Φ2, . . . ,ΦnC

• Find Φ ∈ Rn×nC such that

AΦ = 0, constrained to Φ = I

• Equivalent to P independent problems

Find Φ(i) ∈ Rn×n(i)C such that

A(i)Φ(i) = 0, constrained to Φ(i)

= I Vh

5 / 21

Page 17: On the scalability of BDDC-based fast parallel iterative

BDDC coarse corner function

Circle domain partitioned into 9subdomains

Φj (VC ’s basis vector)

6 / 21

Page 18: On the scalability of BDDC-based fast parallel iterative

BDDC preconditioning

• Let Vh = [v v ] and decompose Vh as

Vh = VF ⊕ VC , with

VF = [v 0]

VC ⊥A VF

• Now, problem split into fine-grid (xF ) and coarse-grid (xC ) correction

Coarse-grid correction (xC )

Assembly and solution of coarse-grid problem

AC = assembly(ΦtA(i)Φ), Solve ACαc = Φt I tr , xC = ΦαC

Coarse-grid problem is

• Global, i.e. couples all subdomains

• But much smaller than original Schur complement S (size nC)

• Potential loss of parallel efficiency with P

7 / 21

Page 19: On the scalability of BDDC-based fast parallel iterative

Overlapped implementation

Typical parallel implementation

Highly-scalable parallel implementation

global communication

fine-gridcorrection

coarse-gridcorrection

core

1

core

2

core

3

core

4

core

P

TC

TF

time

idling

main MPI communicator

core

1

core

2

core

3

core

4

core

PF

global communication

fine-grid MPIcommunicator

core

1

core

2

core

PC

coarse-grid MPIcommunicator

TF

TC

PC

OpenMP-based coarse-gridsolution

• All MPI tasks have f-g duties andone/several have also c-g duties

• Computation of f-g/c-g duties serialized(but they are independent!)

• TC ∝ O(P2) → idling ' PTC

• mem ∝ O(P43 ) → mem per core rapidly

exceeded

• MPI tasks have either f-g OR c-g duties

• f-g/c-g corrections OVERLAPPED in time(asynchronous)

• c-g tasks can be MASKED with f-g tasksduties

• MPI-based or OpenMP-based (this work)solutions are possible for c-g correction

8 / 21

Page 20: On the scalability of BDDC-based fast parallel iterative

Overlapped implementation

Typical parallel implementation Highly-scalable parallel implementation

global communication

fine-gridcorrection

coarse-gridcorrection

core

1

core

2

core

3

core

4

core

P

TC

TF

time

idling

main MPI communicator

core

1

core

2

core

3

core

4

core

PF

global communication

fine-grid MPIcommunicator

core

1

core

2

core

PC

coarse-grid MPIcommunicator

TF

TC

PC

OpenMP-based coarse-gridsolution

• All MPI tasks have f-g duties andone/several have also c-g duties

• Computation of f-g/c-g duties serialized(but they are independent!)

• TC ∝ O(P2) → idling ' PTC

• mem ∝ O(P43 ) → mem per core rapidly

exceeded

• MPI tasks have either f-g OR c-g duties

• f-g/c-g corrections OVERLAPPED in time(asynchronous)

• c-g tasks can be MASKED with f-g tasksduties

• MPI-based or OpenMP-based (this work)solutions are possible for c-g correction

8 / 21

Page 21: On the scalability of BDDC-based fast parallel iterative

Overlapped implementation

Typical parallel implementation Highly-scalable parallel implementation

global communication

fine-gridcorrection

coarse-gridcorrection

core

1

core

2

core

3

core

4

core

P

TC

TF

time

idling

main MPI communicator

core

1

core

2

core

3

core

4

core

PF

global communication

fine-grid MPIcommunicator

core

1

core

2

core

PC

coarse-grid MPIcommunicator

TF

TC

PC

MPI-based coarse-gridsolution

• All MPI tasks have f-g duties andone/several have also c-g duties

• Computation of f-g/c-g duties serialized(but they are independent!)

• TC ∝ O(P2) → idling ' PTC

• mem ∝ O(P43 ) → mem per core rapidly

exceeded

• MPI tasks have either f-g OR c-g duties

• f-g/c-g corrections OVERLAPPED in time(asynchronous)

• c-g tasks can be MASKED with f-g tasksduties

• MPI-based or OpenMP-based (this work)solutions are possible for c-g correction

9 / 21

Page 22: On the scalability of BDDC-based fast parallel iterative

Outline

1 Introduction

2 BDDC preconditioner and its highly scalable parallel implementation

3 Preconditioners for the Stokes problem

4 Numerical Experiments

5 Conclusions and future work

9 / 21

Page 23: On the scalability of BDDC-based fast parallel iterative

The Stokes system

The continuous stationary problem:

−ν∆u +∇p = f,

∇ · u = 0.

The discrete problem (e.g. using Galerkin/stabilized FEM):[A BT

B −C

] [up

]=

[fg

],

where:

• u and p are the velocity and pressure field unknowns

• A is a vector-Laplacian matrix (SPD)

• C is a SPD stabilization matrix (C = 0 for inf-sup stable FEs)

10 / 21

Page 24: On the scalability of BDDC-based fast parallel iterative

Preconditioning the discrete Stokes problem

Approach 1: Block preconditioning

Considers block triangular matrices of the form

M =

[A BT

0 −S

]with M ≈

[A BT

B −C

],

where S = −C − BA−1BT , the pressure Schur complement, is a full SPD matrix

Solve Mz = r

1: Solve Szp = −rp

2: Solve Azu = ru − BT zp

Some properties of M:

• Easy-to-proof optimality: conditionnumber bounded irrespective of h

• 1 vector-Laplacian-type and 1 pressureSchur complement solves per iteration

• Direction solution of linear systems withA and S computationally prohibitive

11 / 21

Page 25: On the scalability of BDDC-based fast parallel iterative

Preconditioning the discrete Stokes problem

Approach 1: Block preconditioning

Considers block triangular matrices of the form

M =

[A BT

0 −S

]with M ≈

[A BT

B −C

],

where S = −C − BA−1BT , the pressure Schur complement, is a full SPD matrix

Solve Mz = r

1: Solve Szp = −rp

2: Solve Azu = ru − BT zp

Some properties of M:

• Easy-to-proof optimality: conditionnumber bounded irrespective of h

• 1 vector-Laplacian-type and 1 pressureSchur complement solves per iteration

• Direction solution of linear systems withA and S computationally prohibitive

11 / 21

Page 26: On the scalability of BDDC-based fast parallel iterative

Preconditioning the discrete Stokes problem

Approach 1: Block preconditioning

Choosing “suitable approximations” (spectrally equivalent) MA ≈ A and MS ≈ S

M =

[MA BT

0 −MS

]with M ≈

[A BT

B −C

],

condition number bounds irrespective of h can still be attained

Cheap-to-invert, highly-scalable choices at our disposal:

• MA =

MBDDC

A : BDDC preconditioner + exact (i.e., sparse direct) solvers

MBDDCA : BDDC preconditioner + inexact (i.e., AMG) solvers

• MS = Dp, with Dp the lumped pressure mass matrix (i.e., a diagonal matrix)

Solve Mz = r

1: Solve Dpzp = −rp

2: Solve MBDDCA zu = ru −BT zp

or MBDDCA zu = ru − BT zp

Some properties of M:

• Easy-to-proof optimality: conditionnumber bounded irrespective of h

• 1 BDDC preconditioner and 1 diagonalpreconditioner application per iteration

• Application of MBDDCA or MBDDC

A , and Dp

is cheap and highly scalable12 / 21

Page 27: On the scalability of BDDC-based fast parallel iterative

Preconditioning the discrete Stokes problem

Approach 1: Block preconditioning

Choosing “suitable approximations” (spectrally equivalent) MA ≈ A and MS ≈ S

M =

[MA BT

0 −MS

]with M ≈

[A BT

B −C

],

condition number bounds irrespective of h can still be attained

Cheap-to-invert, highly-scalable choices at our disposal:

• MA =

MBDDC

A : BDDC preconditioner + exact (i.e., sparse direct) solvers

MBDDCA : BDDC preconditioner + inexact (i.e., AMG) solvers

• MS = Dp, with Dp the lumped pressure mass matrix (i.e., a diagonal matrix)

Solve Mz = r

1: Solve Dpzp = −rp

2: Solve MBDDCA zu = ru −BT zp

or MBDDCA zu = ru − BT zp

Some properties of M:

• Easy-to-proof optimality: conditionnumber bounded irrespective of h

• 1 BDDC preconditioner and 1 diagonalpreconditioner application per iteration

• Application of MBDDCA or MBDDC

A , and Dp

is cheap and highly scalable12 / 21

Page 28: On the scalability of BDDC-based fast parallel iterative

Preconditioning the discrete Stokes problem

Approach 1: Block preconditioning

Choosing “suitable approximations” (spectrally equivalent) MA ≈ A and MS ≈ S

M =

[MA BT

0 −MS

]with M ≈

[A BT

B −C

],

condition number bounds irrespective of h can still be attained

Cheap-to-invert, highly-scalable choices at our disposal:

• MA =

MBDDC

A : BDDC preconditioner + exact (i.e., sparse direct) solvers

MBDDCA : BDDC preconditioner + inexact (i.e., AMG) solvers

• MS = Dp, with Dp the lumped pressure mass matrix (i.e., a diagonal matrix)

Solve Mz = r

1: Solve Dpzp = −rp

2: Solve MBDDCA zu = ru −BT zp

or MBDDCA zu = ru − BT zp

Some properties of M:

• Easy-to-proof optimality: conditionnumber bounded irrespective of h

• 1 BDDC preconditioner and 1 diagonalpreconditioner application per iteration

• Application of MBDDCA or MBDDC

A , and Dp

is cheap and highly scalable12 / 21

Page 29: On the scalability of BDDC-based fast parallel iterative

Preconditioning the discrete Stokes problem

Approach 2: Monolithic BDDC

• Apply BDDC methodology to Stokes problem

• Theory not fully understood, only for disc. pressures [Li & Widlund, 2006]

Well-posedness:† (Weak) inf-sup condition not obvious for (Vh, Qh)

BDDC(c/ce/cef) are well-posed preconditioners

Proof: inspired by the nonconforming Crouzeix-Raviart FE

Both for inf-sup stable and stabilized formulations with continuous pressures

The BDDC problem

Find uh ∈ Vh and ph ∈ Qh st

ν(∇uh,∇vh)− (ph,∇ · vh)− (qh,∇ · uh) = 〈rh, vh〉

†S. Badia, and A. F. Martın. Balancing domain decomposition preconditioning for the discrete Stokes

problem with continuous pressures. In preparation, 2014.

13 / 21

Page 30: On the scalability of BDDC-based fast parallel iterative

Preconditioning the discrete Stokes problem

Approach 2: Monolithic BDDC

• Apply BDDC methodology to Stokes problem

• Theory not fully understood, only for disc. pressures [Li & Widlund, 2006]

Well-posedness:† (Weak) inf-sup condition not obvious for (Vh, Qh)

BDDC(c/ce/cef) are well-posed preconditioners

Proof: inspired by the nonconforming Crouzeix-Raviart FE

Both for inf-sup stable and stabilized formulations with continuous pressures

The BDDC problem

Find uh ∈ Vh and ph ∈ Qh st

ν(∇uh,∇vh)− (ph,∇ · vh)− (qh,∇ · uh) = 〈rh, vh〉

†S. Badia, and A. F. Martın. Balancing domain decomposition preconditioning for the discrete Stokes

problem with continuous pressures. In preparation, 2014.

13 / 21

Page 31: On the scalability of BDDC-based fast parallel iterative

Outline

1 Introduction

2 BDDC preconditioner and its highly scalable parallel implementation

3 Preconditioners for the Stokes problem

4 Numerical Experiments

5 Conclusions and future work

13 / 21

Page 32: On the scalability of BDDC-based fast parallel iterative

FEMPAR Software

FEMPAR (in-house developed HPC software, free software GNU-GPL):Finite Element Multiphysics PARallel software

• Massively parallel sw for FE simulation of multiphysics PDEs

• Scalable preconditioning of fully coupled and implicit system via blockpreconditioning techniques

• Scalable preconditioning for one-physics (elliptic) PDEs relies on BDDC→ hybrid MPI/OpenMP implementation

• Relies on highly-efficient vendor implementations of the dense/sparse BLAS(Intel MKL, IBM ESSL, etc.)

• Interfaces to external multi-threaded sparse direct solvers (PARDISO,HSL MA87, etc.) and serial AMG preconditioners (HSL MI20)

14 / 21

Page 33: On the scalability of BDDC-based fast parallel iterative

Weak scalability for 3D Stokes

Target machine: HELIOS@IFERC-CSC4,410 bullx B510 compute blades (2 Intel Xeon E5-2680 8-core CPUs)

• Target problem: Stokes on Ω = [0, 1]3 (lid-driven cavity problem)

• Uniform global mesh (Q1-Q1 FEs, ASGS-stabilized) + Uniform partition

• 8, 432, . . . , 16000 cores for fine duties

• Entire 16-core blade for coarse duties

• Three different preconditioners: (a) mono, (b) blk-ex, (c) blk-inex

• Direct (a & b) or AMG(1) (c) solves for Dirichlet/Neumann/Coarse probs.

• Gradually larger local problem sizes Hh

= 5, 10, . . . , 35

15 / 21

Page 34: On the scalability of BDDC-based fast parallel iterative

Weak scaling BDDC-based solvers

Stokes problem, Hh

= 20 (20× 40× 40 = 32K FEs/core)

0

10

20

30

40

50

60

70

128 2K 3.5K 5.5K 8K 11.7K 16K

#RG

MR

ES

iter

atio

ns

#cores

Weak scaling for BDDC(ce)-based solvers with H/h=20 (32K FEs/core)

monoblk-ex

blk-inex 0

5

10

15

20

25

30

35

40

45

128 2K 3.5K 5.5K 8K 11.7K 16K

#RG

MR

ES

iter

atio

ns

#cores

Weak scaling for BDDC(cef)-based solvers with H/h=20 (32K FEs/core)

monoblk-ex

blk-inex

BDDC(c+e)-based BDDC(c+e+f)-based

# of RGMRES iterations

• mono requires c+e+f constraints to reach asymptotically constant # of iters.

• Also observed by [Sıstek et. al., 2011] for Taylor-Hood FEs (inf-sup stable)

16 / 21

Page 35: On the scalability of BDDC-based fast parallel iterative

Weak scaling BDDC-based solvers

Stokes problem, Hh

= 20 (20× 40× 40 = 32K FEs/core)

0

20

40

60

80

100

120

140

160

180

128 2K 3.5K 5.5K 8K 11.7K 16K

Tot

al W

all c

lock

tim

e (s

ecs.

)

#cores

Weak scaling for BDDC(ce)-based solvers with H/h=20 (32K FEs/core)

monoblk-ex

blk-inex

0

20

40

60

80

100

120

140

128 2K 3.5K 5.5K 8K 11.7K 16K

Tot

al W

all c

lock

tim

e (s

ecs.

)

#cores

Weak scaling for BDDC(cef)-based solvers with H/h=20 (32K FEs/core)

monoblk-ex

blk-inex

BDDC(c+e)-based BDDC(c+e+f)-based

Total time (secs.)

• blk-ex and blk-inex much faster than mono

• mono-BDDC(cef) weak scalability degrades beyond 8K cores

• Reason: time spent in coarse duties exceeds that spent in fine duties

• Bad news for mono, Hh

= 20 is the largest problem size that fits into memory !17 / 21

Page 36: On the scalability of BDDC-based fast parallel iterative

Weak scaling BDDC(c+e)-based solvers

Stokes problem, Hh

= 25 (25× 50× 50 = 62K FEs/core)

0

10

20

30

40

50

60

70

80

128 2K 3.5K 5.5K 8K 11.7K 16K

#RG

MR

ES

iter

atio

ns

#cores

Weak scaling for BDDC(ce)-based solvers with H/h=25 (62K FEs/core)

blk-exblk-inex

0

10

20

30

40

50

60

70

128 2K 3.5K 5.5K 8K 11.7K 16K

Tot

al W

all c

lock

tim

e (s

ecs.

)

#cores

Weak scaling for BDDC(ce)-based solvers with H/h=25 (62K FEs/core)

blk-exblk-inex

# of RGMRES iters. Total time (secs.)

BDDC(c+e)-based solvers

• blk-ex and blk-inex weakly scalable up-to 16K cores

• blk-inex becomes faster than blk-ex for “sufficiently large” Hh

• Reason: O(n2) vs. O(n) complexity of AMG vs. direct solvers

18 / 21

Page 37: On the scalability of BDDC-based fast parallel iterative

Weak scaling BDDC(c+e)-based solvers

Stokes problem, Hh

= 30 (30× 60× 60 = 108K FEs/core)

0

10

20

30

40

50

60

70

80

128 2K 3.5K 5.5K 8K 11.7K 16K

#RG

MR

ES

iter

atio

ns

#cores

Weak scaling for BDDC(ce)-based solvers with H/h=30 (108K FEs/core)

blk-exblk-inex

0

20

40

60

80

100

120

140

160

128 2K 3.5K 5.5K 8K 11.7K 16K

Tot

al W

all c

lock

tim

e (s

ecs.

)

#cores

Weak scaling for BDDC(ce)-based solvers with H/h=30 (108K FEs/core)

blk-exblk-inex

# of RGMRES iters. Total time (secs.)

BDDC(c+e)-based solvers

• blk-ex and blk-inex weakly scalable up-to 16K cores

• blk-inex becomes faster than blk-ex for “sufficiently large” Hh

• Reason: O(n2) vs. O(n) complexity of AMG vs. direct solvers

19 / 21

Page 38: On the scalability of BDDC-based fast parallel iterative

Memory consumption

Memory consumption on fine-grid MPI tasksLocal problem size (FEs/core)

Precond. 0.5K 4K 13.5K 32K 62.5K 108K 171Kmono 155.0M 320.3M 973.2M 2.7G † † †blk-ex 97.3M 149.1M 336.9M 809.9M 1.7G 3.3G †

blk-inex ∗ ∗ ∗ 465.5M 823.5M 1.3G 2.1G∗: experiment not performed †: out of memory

Memory consumption on coarse-grid MPI task# subdomains

Constraints Precond. 128 1K 3.5K 8K 16Kc+e mono 45.3M 258.0M 998.2M 2.7G 6.1G

blk-ex 34.7M 130.1M 419.8M 1.0G 2.1Gblk-inex 30.9M 109.8M 326.3M 757.4M 1.4G

20 / 21

Page 39: On the scalability of BDDC-based fast parallel iterative

Outline

1 Introduction

2 BDDC preconditioner and its highly scalable parallel implementation

3 Preconditioners for the Stokes problem

4 Numerical Experiments

5 Conclusions and future work

20 / 21

Page 40: On the scalability of BDDC-based fast parallel iterative

Conclusions and future work

Conclusions:

• 3 different parallel preconditioners for the discrete Stokes problem evaluated→ mono, blk-ex and blk-inex

• Block preconditioning is faster and more scalable than monolithic approach

• For “large enough” Hh

, blk-inex achieves the best trade-off amongpreconditioner quality and preconditioner set-up/application time

Future work:

• Development/evaluation of inexact BDDC variants for monolithic problem

• Multilevel extension + MPI-based coarse-grid implementation

• Potential to scale easily up to O(106) cores for three levels (on the way)

21 / 21

Page 41: On the scalability of BDDC-based fast parallel iterative

References

S. Badia, A. F. Martın and J. Principe. Enhanced balancing Neumann-Neumann precondition-

ing in computational fluid and solid mechanics. International Journal for Numerical Methods inEngineering. Vol. 96(4), pp. 203-230, 2013.

S. Badia, A. F. Martın and J. Principe. Implementation and scalability analysis of balancing domain

decomposition methods. Archives of Computational Methods in Engineering. Vol. 20(3), pp. 239-262, 2013.

S. Badia, A. F. Martın and J. Principe. A highly scalable parallel implementation of balancing

domain decomposition by constraints. SIAM Journal on Scientific Computing. Vol. 36(2), pp.C190-C218, 2014.

S. Badia, A. F. Martın and R. Planas. Block recursive LU preconditioners for the thermally coupled

incompressible inductionless MHD problem. Journal of Computational Physics, Vol. 274, pp. 562–591, 2014.

S. Badia, A. F. Martın and J. Principe. On the scalability of inexact balancing domain decomposition

by constraints with overlapped coarse/fine corrections. Submitted, 2014.

S. Badia, and A. F. Martın. Balancing domain decomposition preconditioning for the discrete

Stokes problem with continuous pressures. In preparation, 2014.

Preprints available at http://badia.rmee.upc.edu/sbadia_ar.html

HPSC team: https://web.cimne.upc.edu/groups/comfus/

Work funded by the European Research Council under Starting Grant258443 - COMFUS: Computational Methods for Fusion Technology 2007 - 2012

European Research CouncilYears of excellent IDEAS

21 / 21