COSA Compressible Finite Volume Parallel Multiblock ... · COSA Compressible Finite Volume Parallel Multiblock Multigrid Navier-Stokes Solver Dr M.Sergio Campobasso, Jernej Drofelnik

COSA

Compressible Finite Volume

Parallel Multiblock Multigrid

Navier-Stokes Solver

Dr M.Sergio Campobasso, Jernej DrofelnikUniversity of Lancaster, Lancaster LA1 4YR, UK

University of Glasgow, Glasgow G12 8QQ, UK

ASEArch study group meeting, 25th-26th April 2012

Outline

• Steady and Time-Domain (TD) Navier-Stokes (NS) equations

– Space discretization– Numerical integration

• Harmonic Balance (HB) NS equations and their numerical integration

• Code architecture

• Parallelization

• Sample results

1

Time-Domain Navier-Stokes equations

• Arbitrary Lagrangian-Eulerian (ALE) form of NS equations:

∂

∂t

(∫

C(t)

U dV

)+

∮

∂C(t)

(Φc − Φd) · dS = 0

U = [ρ ρu ρv ρε]T , ε = e +u2 + v2

2, H = ε +

p

ρ

Φc = Eci + Fcj − vb U , Φd = Edi + Fdj

Ec = [ρu ρu2+p ρuv ρuH]T , Ed = [0 τxx τxy uτxx + vτxy − qx]T

Fc = [ρv ρuv ρv2+p ρvH]T , Fd = [0 τxy τyy uτxy + vτyy − qy]T

τ = 2µ[s − 1/3(∇ · v)I] , s =1

2

(∇v + ∇

Tv)

, q = −k∇T

• k − ω Shear Stress Transport (SST) for turbulence closure

2

Space discretization of steady equations

• Convective fluxes discretized with Roe’s flux difference splitting (Roe 1981)and Van Leer’s second/third order MUSCL extrapolations (Van Leer 1974).

Φ∗

i,f =1

2[Φi,f(UL) + Φi,f(UR) −

∣∣∣∣∂Φi,f

∂U

∣∣∣∣ δU]

with Φ∗

i,f being the numerical approximation to the continuous flux componentΦi,f = Φi · n at a volume face along the face normal n.

• Diffusive fluxes discretized with second order central differences.

3

Space discretization and integration of steady equations

• Scheme stencil of NS equations

– 2D: 13 points– 3D: 25 points

• Explicit numerical integration based on 4-stage Runge-Kutta (RK) schemeand convergence acceleration by means of local time-stepping (LTS), centeredvariable-coefficient implicit residual smoothing (IRS), and multigrid (MG).

4

Time integration of space discretized TD equations

• After space-discretizing, one has to solve system of ODE’s:

VdQ

dt+ RΦ (Q) = 0

• Dual-time stepping: implicit second order discretization of dQ/dt to marchin physical time t:

Rg

(Qn+1

)=

3Qn+1− 4Qn + Qn−1

2∆tV + RΦ

(Qn+1

)= 0

and RK pseudo-time-marching with LTS/IRS/MG to get solution at eachphysical time:

V

(dQ

dτ

)n+1

+ Rg

(Qn+1

)= 0

5

MG integration of TD equations (for a given physical time)

• Using Jameson’s RK scheme (Jameson 1981)

W0 = Qn+1,l

Wk = W0−

αk∆τ

VLirs[Rg − fMG]

(Wk−1

), k = 1, NS

Qn+1,l+1 = WNS

• For each block W, Rg and fMG have length (imax × jmax) with structure

W = [W1 W2 · · · Wimax×jmax]T

and each subarray has length npde.

• Application of low-speed preconditioning and RK stabilization reported inCampobasso and Baba-Ahmadi, GT2011-45303

6

Harmonic Balance Navier-Stokes equations• Arbitrary Lagrangian-Eulerian (ALE) form of Navier Stokes equations:

ωD

∫

C(t)

UH dVH +

∮

∂C(t)

(Φc,H − Φd,H) · dSH = 0

where UH, Φc,H and Φd,H ∈ Rnpde×(2NH+1)

UH =[U(t0) U(t1) U(t2) · · · U(t2NH+1)

]T

Φc,H =[Φc(t0) Φc(t1) Φc(t2) · · · Φc(t2NH+1)

]T

Φd,H =[Φd(t0) Φd(t1) Φd(t2) · · · Φd(t2NH+1)

]T

NH is user-given number of complex harmonics and D is block antisymmetricmatrix of size (2NH + 1) × (2NH + 1)

• k − ω Shear Stress Transport (SST) for turbulence closure

7

MG integration of HB equations

• Integration based on same RK/LTS/IRS/MG approach used for steady problem:

(dQH

dτ

)V + Rg,H (QH) = 0 where Rg,H (QH) = ωQHV D + Rφ,H

• Update step reads:

WkH = W0

H − αk∆τV −1Lirs

[Rg,H

(Wk−1

H

)+ fMG,H

]

• For each block, WH, Rg,H and fMG,H have length (imax × jmax) withstructure

WH = [WH,1 WH,2 · · · WH,imax×jmax]T

and each subarray has length npde × (2NH + 1).

• LSP implementation and RK stabilization in GT2011-45303

8

Code architecture• Majority of the code is written in FORTRAN 77

• Finite Volume cell-centered structured multi-block grids

– Adjacent block connectivity via two rows of halo cells– No hanging nodes

• Steady and TD:

– W̃ is defined for each multigrid level and it has dimension W̃(N), where

N = npde ×nblock∑

i=1

imax(iblock) × jmax(iblock) × kmax(iblock)

W̃ = [W1,W2, ...,Wiblock, ...,Wnblock]

Wiblock(imax(iblock), jmax(iblock), kmax(iblock), npde)

9

Code architecture (cont’d)

• HB:

– W̃N is defined for each multigrid level and it has dimension W̃N(N), where

N = npde×nharms×nblock∑

i=1

imax(iblock)×jmax(iblock)×kmax(iblock)

W̃N = [W1,W2, ...,Wiblock, ...,Wnblock]

Wiblock(imax(iblock), jmax(iblock), kmax(iblock), npde, nharms)

• Memory of W̃N allocated dynamically for the entire grid level

• Integer offsets used to move block to block on each grid level

• Pointers used to move from one grid level to another

10

Code architecture (cont’d)• Parent/child structure. For each grid level nl:

subroutine smooth(nl)

...

pq=p_q(nl)

...

call roflux(q, ...)

...

subroutine roflux(q, ...)

...

do iblock = 1,nblock

iq=offset(iblock)

call roflux_b(q(iq), ...)

subroutine roflux_b(q(iq), ...)

block operations

end

end do

end

11

Steady/TD versus HB code structure• HB equations can be viewed as system of 2NH + 1 steady problems

coupled by source term ωQHV D.• Thus, memory requirement of HB solver is about 2NH + 1 that of the

steady or TD solver.• Most routines of HB solver have one additional loop level with respect

to those of steady and TD solvers.

Typical TD routinedo ib = 1,nblock

do k = 1,kmax

do j = 1,jmax

do i = 1,imax

operations on

arrays(i,j,k,1:npde,ib)

Typical HB routinedo ib = 1,nblock

do ih = 0,2*nh

do k = 1,kmax

do j = 1,jmax

do i = 1,imax

operations on

arrays(i,j,k,1:npde,1:ih,ib)

12

Parallelization of HB solver

• Steady and TD NS solvers feature distributed memory MPI parallelizationover blocks (grid partitions). Efficiency of this approach stems from smallnessof data transferred among blocks (halo data).

• for HB NS solver hybrid parallelization is used: distributed memory MPIparallelization over blocks, and shared memory OpenMP parallelizationover harmonics

• With pure MPI parallelization, node memory force number of MPI processesto be smaller than number of cores. Adding OpenMP threads retreivescomputational speed

• New multicore processors (IBM BG/Q) allow using more processes thanavailable cores

13

Parallelization options for HB solver

DISTRIBUTED (MPI)

• Handles many-partition analyses

• More efficient than shared (OpenMP)memory parallelization

• Efficiency may (?) decreaseswith increasing amount of processcommunication

SHARED (OpenMP)

• Applicable to harmonic, but also(alternatively) to block or cell loops

• Max. problem size dictated by nodememory

• Usually less efficient than MPI, butefficiency increases with work done byloop

HYBRID (or MIXED)

• Cluster as set of interlinked (in distributed fashion) shared memory nodes

• Geometric partitions handled by MPI

• Harmonic loops handled by OpenMP

14

Parallel communications

• MPI communications used to exchange halo data across grid cuts

• Each MPI process handles block set (minimum set size = 1)

• Global values (forces, residual RMSs, etc.) computed with MPI reduce

15

Parallel I/O

• Entire multiblock grid in single file (mesh.dat).

• Entire flow field in single file (restart).

• Entire solution (grid and flow field) in single TECPLOT file (flowtec.dat).

• All MPI processes read/write from/to same global file.

• Processes use MPI FILE OPEN and MPI FILE CLOSE to open and closefiles.

• Each process works out data (blocks set) location in the file using file pointer;MPI FILE SEEK moves the file pointer to the desired location.

16

• Data to that location written using MPI I/O functionality, e.g.MPI FILE WRITE.

17

Parallel performance of MPI

Pitching NACA 0015 airfoil:

• Number of blocks: 2048

• Block size: 48×48

• Overall number of cells: 4.7 million

• HB nharms = 8

• Cluster name: FERMI

• Cluster characteristics:

– IBM-BG/Q– IBM PowerA2 processor, 1.6 GHz– 16 cores per node– 16GB RAM per node

128 256 512 1024 20481

2

4

8

16

No. of cores

Spe

edup

idealXlfGfortran

18

Parallel performance of MPI (cont’d)

3-blade Vertical axis wind turbine:

• Number of blocks: 3072

• Average block size: 24×60

• Overall number of cells: 4.7 million

• HB nharms = 8

• Cluster name: FERMI

• Cluster characteristics:

– IBM-BG/Q– IBM PowerA2 processor, 1.6 GHz– 16 cores per node– 16GB RAM per node

128 256 512 1024 1536 30721

2

4

8

12

24

No. of cores

Spe

edup

idealXlfGfortran

19

Thank you for your attention

For further enquiries, please contact

M. Sergio CampobassoE-mail: [email protected]

Jernej DrofelnikE-mail: [email protected]

20

Documents

COSA Compressible Finite Volume Parallel Multiblock ... · COSA Compressible Finite Volume Parallel Multiblock Multigrid Navier-Stokes Solver Dr M.Sergio Campobasso, Jernej Drofelnik