Performance of the hybrid MPI/OpenMP version of the ......STEP 1 : Optimizing the MPI-IO Hints MPI-O hints can have a dramatic effect on the IO performances Best parameters depend

Performance of the hybrid MPI/OpenMPversion of the HERACLES code on the

Curie « Fat nodes » system

Edouard Audit, Matthias Gonzalez, Pierre Kestener and Pierre-François Lavallé

SIAM meeting, Savannah, February 2012

The HERACLES code

(Magneto)hydrodynamics : finite volume, 2nd order godunovExplicit or Implicit

Multigroup radiative transfer : Moment method, Implicit

Gravity, fully coupled to ohydro / Splitted

Thermochemistry and/or heating/coling function (local)

Turbulent forcing (local)

Fixed grid finite volume code working in 1,2,and 3D in cartesian, cylindrical and spherical coordinate. Fortran + MPI, domain decomposition

Used in astrophysics (star formation, interstellar medium studies,…) and to interpret laser generated plasma experiment.


The HERACLES code

(Magneto)hydrodynamics : finite volume, 2nd order godunovExplicit or Implicit

Multigroup radiative transfer : Moment method, Implicit

Gravity, fully coupled to hydro / Splitted

Thermochemistry and/or heating/cooling function (local)

Turbulent forcing (local)

Fixed grid finite volume code working in 1,2,and 3D in cartesian, cylindrical and spherical coordinate. Fortran + MPI, domain decomposition

Used in astrophysics (star formation, interstellar medium studies,…) and to interpret laser generated plasma experiment.


Domain Decomposition

MPI process

MPI processMPI process

MPI process


Domaine DecompositionPhysical boundaries

Communications


The HERACLES code Read simulation parameters

Split domain over the MPI processes

Initial conditions

Loop over time

Fill the ghost cells : boundary conditions or communications

Compute time step Hydro step

Loop over chunk

Loop over cells (slope, Riemann solver,….)

Compute cooling (local)

Stirring (local)

Output

End

OpenMPOpenMP

OpenMPOpenMP

Not multi-threaded


Pure MPI vs MPI/OpenMPMPI MPI + 4 threads

16 messages of size 1 4 messages of size 2


The Curie system

Fat nodes360 BullX-S6010

Intel NH EX 2,26 Ghz11 520 cores32 cores/node128 GB/node

105 TFlops

Thin nodes5040 BullX B510

Intel New generation (SNB)80 640 cores

16 cores/node - 4 GB/core – 128 GB SSD1.5+ PFflops

Hybrid nodes144 Bullx B505

288 Nvidia M2090

184 + 11 TFlops

Interconnect Infiniband QDR

1st levelLustre

6 PB - 150 GB/s

February 2011 March 2012October 2011


The Curie system


Strong Scaling (9003 run)

Pur MPI

2 threads

4 threads

8 threads



Pur MPI

2 threads

4 threads

8 threads



Pur MPI

2 threads

4 threads

8 threads


weak scaling - 2563 / node (32 cores)

Pur MPI

2 threads

4 threads

8 threads






Scaling on BlueGene-IDRIS (strong scaling)


IO – the craftsman way

All processes write their output at the same time….

Failure when > few 103

Write by packet + temporization

Ncpu_write ~ 100 – 1000 T_wait ~ 2 – 10 secondes

One output ~ 5 time steps


IO – the professional approachP. Wautelet and P. Kestener

4 different IO approach where tested : POSIX : 1 file per MPI processes MPI-IO HDF5 Parallel-NetCDF

STEP 1 : Optimizing the MPI-IO Hints MPI-O hints can have a dramatic effect on the IO performances Best parameters depend on the application 7 of the 23 available hints where tested !!

STEP 2 : Strong Scaling test


IO – the professional approachP. Wautelet and P. Kestener


Conclusions

Multi-threading necessary for large number of cores

OpenMP is “easy” to implement but not always to understand…

Multi-threaded communications probably necessary

Good results for a small number of threads.

Documents

Performance of the hybrid MPI/OpenMP version of the ......STEP 1 : Optimizing the MPI-IO Hints MPI-O hints can have a dramatic effect on the IO performances Best parameters depend