@let@token OT1TS1blackNumerical methods and optimizations ...irma.math.unistra.fr/IMG/pdf/latu_numkin2016.pdf · Assumption: smooth variation along the green line. (,) j* j+1 j-1

Numerical methods andoptimizations that enhancesemi-lagrangian gyrokinetic

calculations

G. Latu, Y. Asahi, J. Bigot, G. Dif-Pradalier, P. Donnel,

C. Ehrlacher, X. Garbet, P. Ghendrih, V. Grandgirard,

M. Ottaviani, C. Passeron, Y. Sarazin

CEA, France

N. Bouzat, M. Mehrenberger Y. Guclu, E Sonnendrucker

INRIA + U. Strasbourg IPP, Germany

Acknowledgements to:Eurofusion funding,

EoCoE project funding

G. Latu & al. Enhancing semi-lagrangian gyrokinetic calculations � 20/10/2016 1

Outline

• Short introduction to Gysela parallel setting

• Aligned advection along θ,ϕ

• Cubic splines versus Lagrange interpolants in Gysela

• Removing boundary condition at magnetic axis

Numerical scheme: overview

I Main unknown: fn(r , θ, ϕ, v‖, µ)

Input : Physics parameters, f0

Output : Diagnostics

for time step n ≥ 0 doIntegrals: Nn

i (r , θ, ϕ) =∫ ∫

fn B(r , θ) J(k⊥ρC) dv‖dµ;Push fields (Poisson Eq.): Nn

i (r , θ, ϕ)→ Φn(r , θ, ϕ);Diagnostics/Outputs for time step n;Push particles (Vlasov Eq. + other terms): Φn(r , θ, ϕ), fn

→ fn+1;

Algorithm 1: Simplified overall Gysela algorithm

I Practically: predictor-corrector time integration scheme O(∆t2)


Hybrid parallelizationMPI + OpenMP

Fortran 90 code, hybrid MPI+OpenMP

I MPI parallelization in variables µ, r , θ (most of the time)(in: Integrals, Diagnostics, Vlasov;

not in: Poisson solver)→ Work well balanced between processors→ Parallel overhead: MPI communications mainly

I OpenMP fine grain parallelization→ Avoid MPI communication costs→ Easy to change OpenMP parallelization (vs. MPI)


Parallel algo. for a 1D advection(in ϕ direction)

Input : f?(r , θ, ϕ, v‖, µ)Output : f�(r , θ, ϕ, v‖, µ)

for µ do in parallel MPI /* One MPI commmunicator per µ-value */

for r do in parallel MPIfor θ do in parallel MPI

for θ do in parallel OpenMPfor v‖ do

Compute cubic spline representation of f?(r , θ, ϕ = ∗, v‖, µ)for ϕ do

∆ϕ← (v‖ + other terms)∆tf�(r , θ, ϕ, v‖, µ) = spline interpolate(f?(r , θ, ϕ −∆ϕ, v‖, µ))

Cubic spline are used in each advections of Gysela


Parallelization of Vlasov solverTransposition used

for time step n ≥ 0 doIntegrals, Poisson, Diagnostics

1D Advection in v‖ (∀(µ, r , θ) = [local],∀(ϕ, v‖) = [∗]);1D Advection in ϕ (∀(µ, r , θ) = [local],∀(ϕ, v‖) = [∗]);Transposition of f ;

Vlasov 2D Advection in (r , θ) (∀(µ, ϕ, v‖) = [local],∀(r , θ) = [∗]);Transposition of f ;1D Advection in ϕ (∀(µ, r , θ) = [local],∀(ϕ, v‖) = [∗]);

1D Advection in v‖ (∀(µ, r , θ) = [local],∀(ϕ, v‖) = [∗]);

Algorithm 2: Transposes, two parallel data decompositions

I Two tranposes in each µ communicator (per Vlasov solve)I No CFL constraint on time step dt for 2D advection


Communication schemes - Transpose

Transpose forward

block_θ

block_r

local subdomain D1 (larger than 32x32 pts)

v//=*, φ=*

block_φ

block_v//

r=*, θ=*

local subdomain D2

Transpose backward

Figure 1: Transpose communication scheme within a MPIcommunicator identified by µ

I Large communication amount: Θ((Nr Nθ)Nϕ Nv‖ Nµ)

I However, scale well up to 64k coresI Less than 15% of total elapsed time


Outline





Motivation: aligned advection

I Anisotropy within (θ,ϕ) plane [ at a given r ]structures are aligned along field lines

I Strong gradients perpendicularly to field linesI Safety factor q(r) gives the direction of the field line

I Aim: interpolate along field line: smooth variationsG. Latu & al. Enhancing semi-lagrangian gyrokinetic calculations � 20/10/2016 7

Numerical method - aligned advection

Aligned scheme: interpolation of a target point at position (θ?, ϕ?).Assumption: smooth variation along the green line.

(θ*,φ*)

θ

φ

φj*

φj*+1

φj*-1

Squares are located at intersection of green lines and grid lines along θ.Values at square are interpolated using values known at black point (grid)Value at the red circle position (θ∗, ϕ∗) is interpolated using values known atthe square positions.


Algorithm - aligned advection

/* Input: distrib. function on (θ,ϕ) + feet (θ?, ϕ?) */

Input : g(:, :), theta?(:, :), phi?(:, :)

/* Output: distrib. function on (θ,ϕ) plane */

Output : g†(:, :)

for j = 0,Nϕ doη(i = ∗, j)← spline coefficients along θ for g(i = ∗, j)

for j = 0,Nϕ dofor i = 0,Nθ do

ϕ? ← phi?(i, j); θ? ← theta?(i, j);j? ← index of the left grid

point close to ϕ? ;for k = −d,d+1 do

θk ← fieldlineθ(θ?, ϕ?, j? + k );uk ← 1D spline interpolation along θ

at θk using η(i = ∗, j? + k );g†(i, j)← 1D Lagrange interpolation

using values (uk )k=−d,d+1

Algorithm 3: Aligned interpolation in (θ,ϕ) plane


Estimate of derivatives

I Reduced number of points along ϕ→ which method to get accurate derivative along ϕ ?→ major issue: derivative of electric potential Φ

I Method used (based on aligned interpolation):

(θi ,φ+ε)

θ

φ

φj*

φj*+1

φj*-1

(θi ,φ-ε)

∂f∂ϕ =

f(θi ,ϕj+ε)−f(θi ,ϕj−ε)

2 ε Two aligned interp. f -values


Global algorithm for Vlasov solver

1D Advection in v‖ (∀(µ, r , θ) = [local],∀(ϕ, v‖) = [∗]);2D Aligned advection in (θ,ϕ) ;

Vlasov 2D Advection in (r , θ) (∀(µ, ϕ, v‖) = [local],∀(r , θ) = [∗]);2D Aligned advection in (θ,ϕ) ;

1D Advection in v‖ (∀(µ, r , θ) = [local],∀(ϕ, v‖) = [∗]);

Algorithm 4: Aligned method, Vlasov algorithm

I Domain decomp. for the aligned advection ?→ θ=∗ needed for splines

I Which MPI communications should be done ?


Parallel algorithm (v1)

I Whole plane (θ = ∗, ϕ = ∗) known→ simplify implementation

1D advection in v‖ (∀(µ, r , θ) = [local],∀(ϕ, v‖) = [∗]);Get feet for 2D advection in (θ,ϕ) (∀(µ, r , θ) = [local],∀(ϕ, v‖) = [∗]);Transpose f , and redistribute feet;2D aligned advection in (θ,ϕ) (∀(µ, v‖) = [local],∀(r , θ, ϕ) = [∗]);Transpose f ;2D advection in (r , θ) (∀(µ, ϕ, v‖) = [local],∀(r , θ) = [∗]);Get feet for 2D advection in (θ,ϕ) (∀(µ, ϕ, v‖) = [local],∀(r , θ) = [∗]);Transpose f , and redistribute feet;2D aligned advection in (θ,ϕ) (∀(µ, v‖) = [local],∀(r , θ, ϕ) = [∗]);Transpose f ;1D advection in v‖ (∀(µ, r , θ) = [local],∀(ϕ, v‖) = [∗]);

Algorithm 5: Aligned Vlasov solver (v1)I 3 domain decompositionsI Overheads: communication + memory + compute costs


Performance - aligned method (v1)

Execution Time Standard Nphi =32 Standard Nphi =64 Aligned Nphi =32

Transposes 34.6 46.0 145.2Advections 81.8 159.6 162.1

Others 94.1 184.6 96.4Total run time 210.6 403.7 390.3

Table 1: Time (in s.) of a short Gysela run

I Domain size Nr = 256, Ntheta = 256, Nvpar = 48, Nmu = 4,nbtimestep = 16, nbcpus = 256

I Algo (v1)→ Pb 1: Exec. time→ significant overheadI Algo (v1)→ Pb 2: memory footprint increase by 3×

→ data structures containing feet + MPI buffers /I Algo (v1)→ Aligned Nphi = 32 results close to standard Nphi = 128 or 256


Numerical experiments(Standard Nϕ = 256 close to Aligned Nϕ = 32)

Linear simulations (mode n=10, full torus) [Y. Sarazin]ρ? = 1/150, Nr = 256, Ntheta = 256, Nvpar = 128, Nmu = 16

Plotting amplitude of the 4 most unstable modes

Nφ=32 Nφ=128 Nφ=256

Standard


Parallel algorithm (v2)Improved performance

1D advection in v‖ (∀(µ, r , θ) = [local],∀(ϕ, v‖) = [∗]);Transpose f and get ghost cells on f along ϕ direction;2D aligned advection in (θ,ϕ) (∀(µ, ϕ, v‖) = [local],∀(r , θ) = [∗]);Comm: if feet outside of local domain, interpolate on another MPI process;2D advection in (r , θ) (∀(µ, ϕ, v‖) = [local],∀(r , θ) = [∗]);Comm: get ghost cells on f along ϕ direction;2D aligned advection in (θ,ϕ) (∀(µ, ϕ, v‖) = [local],∀(r , θ) = [∗]);Comm: if feet outside of local domain, interpolate on another MPI process;Transpose f ;1D advection in v‖ (∀(µ, r , θ) = [local],∀(ϕ, v‖) = [∗]);

Algorithm 6: Aligned Vlasov solver (v2)

I Subdomain (θ = ∗, ϕ = [local])→ comm. at ϕ boundariesI Transmit particles escaping subdomain→ comm.I Extra communication costs→ less than (v1) ,


Performance - aligned method (v2)

Execution Time Standard Nphi =32 Standard Nphi =64 Aligned Nphi =32

Transposes 34.6 46.0 42.2 145.2Advections 81.8 159.6 143.0 162.1

Others 94.1 184.6 96.6 96.4Total run time 210.6 403.7 281.8 390.3

Table 2: Time (in s.) of a short Gysela run

I Domain size Nr = 256, Ntheta = 256, Nvpar = 48, Nmu = 4,nbtimestep = 16, nbcpus = 256

I Algo (v2)→ No more 3 domain decompositions as in (v1) ,I Algo (v2)→ Almost no memory costs vs standard version ,I Algo (v2)→ Execution time reduction vs (v1) ,


Overheads estimate - aligned method

I Algorithm v1I 2D advection instead of 1D advection along ϕI Feet computation is decoupled from advection in (θ,ϕ)

→ bad cache effect (temporal locality)I Feet (θ?, ϕ?) should be stored/sent to other processesI Extra communication cost due to another domain

decomposition (∀(µ, v‖) = [local],∀(r , θ, ϕ) = [∗])

I Algorithm v2 (best)I 2D advection instead of 1D advection along ϕI Ghost cells exchange f along ϕ→ small comm. costsI Points outside subdomain during (θ,ϕ) advection

Send request for each point (small comm. cost)Message to send back the interpolated value (small cost)


Aligned method - Conclusion

I Main aligned method achievements:→ Reduced number of points along ϕ (4×)

I Algorithm v2 has reasonable overheads for production

I However, spurious modes (electric potential Φ) grows(long time runs) for small Nϕ, solution:

→ Filtering them out in Fourier space in (θ,ϕ)many gyrokinetic codes do that, for various reasons

→ Explain the causesome modes are not damped as they should (if Nϕ too small)

[Ottaviani, Physics Letters A 375 (2011) 1677]still under investigation


Outline





Interpolations in Gysela

I Interpolations have a main role in advections:I along ϕ, along v‖, along (r , θ), along (θ,ϕ) (aligned method)

I Interpolations used in derivative estimates:I along ϕ (aligned method)

I Cubic splines usedI frequently in Semi-Lagrangian code (plasma physics,

atmospheric model ...)I non-local→ couples all values along one direction /I good compromise between computational efficiency and

numerical accuracy ,I smooth interpolation, C1 continuity ,I degrade well when distribution is underresolved on the mesh


Uniform cubic spline interpolation(for Semi-Lagrangian method)

Input : set of values G =

g0

g1

.

.

.

gN

, a location x

Output : g(x) Interpolated value at location x

beginCompute g−1 and gN+1 depending on boundary conditions;1

Solve A

η−1

η0

.

.

.

ηN+1

=

g−1

G

gN+1v

with A = L U ;

2

Set index i ← b(x − x0)/dxc /* local support */;3

Interpolate g(x) using coefficients ηi−1, ηi , ηi+1, ηi+2;4

end

For several interpolations at different locations using G:LU system solved only onceG. Latu & al. Enhancing semi-lagrangian gyrokinetic calculations � 20/10/2016 20

Lagrange interpolants

I Lagrange polynomials, alternative to cubic splines:I more local than splines

I Definition: g a discrete function (defined on x ∈ [x0, xN]).

L(x) =

n∑j=1

Lj(x),

Lj(x) = g(xj)

k=n∏k=1, k !=j

(x − xk )

(xj − xk )

I Property: ∀j ∈ [1,n], L(xj) = g(xj)

I Property: n points, degree of the L polynomial n − 1


One drawback of Lagrange polynomial...

I Assumption: sharp gradients in input dataI Often the case for turbulent simulation

I Everything seems fine, let’s have a zoom ...

-1

-0.5

0

0.5

1

1.5

2

5 5.2 5.4 5.6 5.8 6

cubic splineslagrange 6-ptslagrange 8-pts

Input points


One drawback of Lagrange polynomiallack of continuity

I Lagrange polynomial: lack of continuity

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

5.64 5.66 5.68 5.7 5.72 5.74 5.76 5.78 5.8

Pb: lagrange not C1

cubic splineslagrange 6-ptslagrange 8-pts

Input points


Hybrid Lagrange-Hermite polynomials

I Recipies for new Lagrange polynomials [M. Mehrenberger]:I Take Lagrange polynomial (nb points: (2 m)), C0, not C1I Remove first and last pointI Increase multiplicity of 2 zeros to fix 2 derivatives→ C1 continuityI Finally: 2 m points in input, degree of polynomial: 2 m − 1

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

5.64 5.66 5.68 5.7 5.72 5.74 5.76 5.78 5.8

Success: new lagrange is C1 !

new lagrange close to cubic spline

cubic splinesnew lagrange 6-ptsnew lagrange 8-pts

Input points


Amplification factor

I Measure quality of the numerical schemeI Def: amplitude error for a linear advection of a waveI Higher is better (but should remain less than 1 to be stable)

0.94

0.95

0.96

0.97

0.98

0.99

1

1.01

1.02

0 0.2 0.4 0.6 0.8 1

Amplification for Omega= PI/2 (Lagrange)

Lagrange 6ptsLagrange 8pts

Lagrange 10ptsCubic splines

0.94

0.95

0.96

0.97

0.98

0.99

1

1.01

1.02

0 0.2 0.4 0.6 0.8 1

Amplification for Omega= PI/2 (NEW Lagrange)

New Lagrange 6ptsNew Lagrange 8pts

New Lagrange 10ptsCubic splines

I New Lagrange 8 & 10 pts→ better than splinesI New Lagrange has derivative=0 in 0 and 1→ C1 continuity


Dispersion/Phase error

I Measure quality of the numerical scheme alsoI Def: phase error for a linear advection of a waveI Closer to 0 is better [Filbet, Sonnendrucker, CPC 2003]

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0 0.2 0.4 0.6 0.8 1

Dispersion for Omega= PI/2

Lagrange 6ptsLagrange 8pts

Lagrange 10ptsCubic splines

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0 0.2 0.4 0.6 0.8 1

Dispersion for Omega= PI/2

New Lagrange 6ptsNew Lagrange 8pts

New Lagrange 10ptsCubic splines

I Lagrange performs better than New Lagrange /I New Lagrange 10 pts better than splines, 8 pts is good ,


Numerical experiment with Gysela

I Advections with Lagrange interpolations instead of splinesI Many short-time runs behave well (no significant changes compared to

splines) with Lagrange or New Lagrange using 5-pts up to 10-pts ,

I Run a difficult case: kinetic e−, not-so-fine mesh, aligned advec. (θ,ϕ)

Nr = 256, Ntheta = 256, Nvpar = 48, , Nphi = 32, Nmu = 4, ρ? = 1/95

Interpolation StatusSpline OK

Lagrange 6-pts failLagrange 7-pts failLagrange 8-pts fail

New Lagrange 6-pts OKNew Lagrange 8-pts OK

Table 3: Status after a Gysela run of ≈ 400 time steps

I Failures of Lagrange interp. due to errors on distrib. functionI New Lagrange degrades well if distrib. is underresolved on the mesh


Time measurements with Gysela

I Run a difficult case: kinetic e−, not-so-fine mesh, aligned advec. (θ,ϕ)

Interpolation Total 1D advection v‖ 2D advectionsLagrange 8-pts 628 102 332

New Lagrange 8-pts 595 99 300Spline 590 110 280

New Lagrange 6-pts 566 96 273Lagrange 6-pts 560 95 267

Table 4: Execution time (in s.) of a Gysela run of 20 time steps

I New Lagrange is competitive against splinesVectorizations/optimizations will be undertaken soon


Costs of Lagrange 8-pts versus splines

I Considering one advection step,assuming cache large enough, 64-bit computations,excluding cost to get grid index at left to target location

I Average cost of 1D interpolation (lagrange 8-pts):I 1 load, 1 store, 48 multiply, 37 additions

I Average cost of 1D interpolation (cubic spline):I 1 load, 1 store, 26 multiply, 16 additions, 1 divide

I Average cost of 2D interpolation (lagrange 8-pts):I 1 load, 1 store, 144 multiply, 122 additions

I Average cost of 2D interpolation (cubic spline):I 1 load, 1 store, 60 multiply, 40 additions, 2 divide

I Why Lagrange 8-pts exec. time so close to spline then ?I under investigation, possibly: mem. bandwidth,

vectorization, instruction parallelismG. Latu & al. Enhancing semi-lagrangian gyrokinetic calculations � 20/10/2016 29

Lagrange polynomials - Conclusion

I Results obtained:

I Standard Lagrange polynomial→ good Gysela simulations→ However some simulations underresolved fails→ New hybrid Hermite-Lagrange solved this problem

I To recover the same accuracy of spline with New Lagrange→ Needs 6-pts or 8-pts or 10-pts (higher order vs spline)→ Computate costs competitive for 6-pts or 8-pts (1D & 2D)→ Vectorizations/optimizations will be undertaken


Outline





Issue near r = 0Artificial radial inner boundary condition

I Cause: assuming there is a point at r = 0I Several operators/solvers consider terms in 1/r

Pb: Field solver, Field derivative computations, ...I Mesh has a singularity near r = 0 (large nb of θ points)

I Gysela simulations uses an inner radius (rmin) boundary conditionI Physical submodels are needed for all operators at rmin (up to now)I Transport solver: What if an eddy goes through the center ?I Field solver: How to avoid adding artificial boundary at rmin ?

Numerical artifact at rmindue to boundary conditions

Zoom on distribution function (r,θ) cutat a given phi=0, μ=0.05, v//=3.1vth


Poisson solver upgraded

I Method to deal with inner boundary condition:Lai, M.-C., Wang, W.-C., Fast direct solvers for Poisson equationon 2D polar and spherical geometries. Numer. Methods forPartial Differential Equations (2002)

I Method directly integrated into Gysela ,recipe for the new 2D poloidal solver:

I Finite difference along r , Spectral along θ (as before)I First radial point fixed to rmin = ∆r/2 (clever trick),

cancelation of 2 terms→ no boundary condition at rmin

I Results:I No more boundary condition at magnetic axis

(used to be Dirichlet or Neumann in the past)I No more numerical artifacts due to boundary cond. at rminI Eddies can go through magnetic axis


Poisson upgrade - result

Removing Boundary condition at rmin,zoom on electric potential, poloidal cut, at a given time step

Figure 2: Neumann at rmin (old),artifact in the center

Figure 3: Lai & Wang trick at rmin(new), nothing bad in the center


Poisson upgrade - result

Removing Boundary condition at rmin,zoom on electric potential, poloidal cut, at a given time step

Figure 4: Neumann at rmin (old),artifact in the center

Figure 5: Lai & Wang trick at rmin(new), nothing bad in the center


Vlasov solver - upgrade

I Interpolation at r � rmin

using cubic splines as usual (or Lagrange polynomial)

I Interpolation at the very center (near r =0)using bilinear interpolation in x , yremoving dependency along θ direction to avoid singularity

I Interpolation in-betweenratio mixing: 2D cubic splines interp., bilinear interp.weighting coefficient: depending on r value

cubic splines 2D interpolation in (r,θ)

smooth transition from 2D splines to bilinear

bilinear interpolation in (x,y) for r ∈ [0,rmin]

r=0 r=1.5Δr r=rmax

Weighting coefficients

0

1

0.5

bili

near

inte

rp.

2D

cub

ic s

plin

es


Vlasov upgrade - result

Removing Boundary condition at rmin

Figure 6: No interp. in [0, rmin] (old),artifact at r = 0

Figure 7: Interp. in [0, rmin] (new),nothing specific at r = 0


Vlasov upgrade - result

Removing Boundary condition at rmin

Figure 8: No interp. in [0, rmin] (old),artifact at r = 0

Figure 9: Interp. in [0, rmin] (new),nothing specific at r = 0


Conclusion - inner boundary condition

I Two methods integrated to suppress inner bound. cond.I Poisson solver: Lai & Wang methodI Vlasov solver: specific interp. in the center r ∈ [0 : rmin].→ Alternative to bilinear will be investigated

I Results:I Remove possible artifacts close to r = 0 ,I In general, simulations are close to those using previous

boundary condition→ does not invalidate previous Gysela simulations


Documents

@let@token OT1TS1blackNumerical methods and optimizations ...irma.math.unistra.fr/IMG/pdf/latu_numkin2016.pdf · Assumption: smooth variation along the green line. (*,*) j* j*+1 j*-1

@let@token OT1TS1blackNumerical methods and optimizations ...irma.math.unistra.fr/IMG/pdf/latu_numkin2016.pdf · Assumption: smooth variation along the green line. (,) j* j+1 j-1