RAMSES @CSCS

RAMSES @CSCS

Kotsalos Christos & Claudio Gheller

Refactoring of the Ramses code and performance optimisation on CPUs and GPUs

RAMSES: modular physics

AMR buildDomain

Decomposition - Load balancing

Gravity

Hydro

MHD

N-body

CoolingStar formationOther physics RT

per time step⤾

Our goal

AMR build Load balance Gravity

Hydro

MHD

N-body


GPU using OpenACC directives

❖Recognise the GPU - friendly parts : ❖ Computational intensity + data independency

❖Minimise the data transfer: ❖ GPU <—> CPU communication

❖GPU-to-GPU communication through GPUDirect & Communication generally ❖GPU porting and optimisation

}Infrastructure

Problems to overcome!


Hydro

MHD

N-body


Dependencies between the modulesMore complex amr_step than this simplified representationRecursive callsCommunicators (OpenACC & memory)# grids depend on the level of refinement. GPU porting issues. Difficulty to fit the loops in the architecture!

1st part of the Project

Redesign communication for GPUs & CPUs:❖ Is the CPU communication optimal?❖ Is the communication suitable for GPU programming?

Communication between the subdomains

• Point-to-Point Communication :: ISend <—> IRecv • The subdomains communicate the solutions with their neighbours (physical)• Everything through the communicators

MPI

Send Buffer Recv Buffer

emission communicator (structure/ derived data types)

reception communicator (structure/ derived data types)

Stored information:‣ Number of cells‣ cell index‣ Double precision arrays‣ Single precision arrays

} Advantages:Elegant Structure

Disadvantages:Data locality issues

Not fully supported by OpenACC

Type of Communication: Point-to-Point or Collective ?

Point-to-point: Isend & IrecvOriginal implementation in RAMSES

Collective: Alltoall(v)The data to be communicated are gathered in arrays (one per PE) These arrays (of intrinsic data types) are scattered from all PES to all PES

OpenACC & Allocatable Derived data types

• Not fully supported (cray only but still partially) • No GPUDirect support (GPU-to-GPU communication)

⇒ Two solutions to overcome this problem

❖ Use collective communication (the buffers are of intrinsic data types)

❖ Replace locally or globally the communicators with regular arrays

Check performance

Everywhere in the code, not an easy solution

GPUDirect

❖ GPU-to-GPU communication ❖ The same calls as the regular MPI but:

export MPICH_RDMA_ENABLED_CUDA=1

!$acc host_data use_device(send buffer) call MPI_ISEND( normal arguments ) !$acc end host_data

!$acc host_data use_device(recv buffer) call MPI_IRECV( normal arguments ) !$acc end host_data

Embed the MPI call

export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH

AlltoAll

1 @ 2 @ … n @

Send Buffers Recv Buffers

1 # 2 # … n #

1 * 2 * … n *

…

1 @ 1 # … n *

2 @ 2 # … 2 *

n @ n # … n *

…pe1

pe2

pen

(sendbuf,sendcount,sendtype,recvbuf,recvcnt,recvtype,comm,ierr)

Restriction: the sendcnt & recvcnt are fixed per pe

AlltoAllv

1 @ 2 @ … n @

Send Buffer

pei

(sendbuf,sendcnts,sdispls,sendtype,recvbuf,recvcnts,rdispls,recvtype,comm,ierr)

} } }sendcnts(1) sendcnts(2) sendcnts(n)

sdispls(1) = 0

sdispls(2) = sdispls(1)+sendcnts(1)

sdispls(n) = sdispls(n-1)+sendcnts(n-1)

the sendcnt & recvcnt are NOT fixed per pe

Experiment: Replace the point-to-point with the collective communication everywhere (CPU version of RAMSES)

Alltoallv ISend/ IRecv

Latency Issue!

point-to-point ~ 3 times faster

Load balancing

The subdomains need to communicate mainly with their neighbours

⇒ The collective communication sends and recvs too many empty buffers

Latency Issue!

AlltoAllv_tuned

User specified

Bandwidth Issue! Too much data to communicate!

1 @ 2 @ … n @

Send Buffer

pei } } }sendcnts(1) sendcnts(2) sendcnts(n)

If sendcnts(i) = 0

then sendcnts(i) = fill the buffer with zeros

Special case is the AlltoAll

Final solution (communication part)

❖ Point-to-Point communication ❖ Replace locally the communicators with regular arrays❖ GPU-to-GPU through GPUDirect flawlessly

Send Buffer Recv Buffer

emission array (intrinsic data types)

reception array (intrinsic data types)

emission communicator (derived data types)

reception communicator (derived data types)

↧ ↧

2nd part of the Project

GPU porting of the Poisson Solver


Hydro

MHD

N-body


GPU porting of poisson solver

Communicators caused issues because of data locality: ❖ Implicit synchronisation barriers ❖ Poor performance

Local communicators

Stored information:‣ Number of grids‣ Grid index‣ Double precision arrays‣ Single precision arrays

} ↧

Replaceable

1D arrays (intrinsic data types)

levelcomponents

cpuscells

1D arrays (intrinsic data types)

levelcomponents

cpuscells

Data Locality

Increased Performance2 to 3 times faster

In the beginning of multigrid_fine

3rd part of the Project

Infrastructure development❖ Which parts on CPU, which on GPU?❖ Minimise interaction between CPU-GPU


Hydro

MHD

N-body


Infrastructure

Subroutines that update host/ device (optimised data transfer)

#if defined(_OPENACC) call update_globalvar_dp_to_host (var,level) call update_globalvar_dp_to_device(var,level) #endif

instead of

!$acc update device/host(var)

communicate only what it is needed

Use of the update directive but in an

optimal way

Manual Profiler (MPROF)

Why? ❖ Bugs in CRAYPAT ❖ Strange SYNC barriers (implicit) ❖ Discrepancy of CRAYPAT and NVIDIA’s tools

It is enabled from the Makefile by the flag MPROF

It uses the MPI_WTIME subroutine

It is adapted for all the GPU-ported subroutines of RAMSES

Until this point


Hydro

MHD

N-body

CoolingStar formationOther physics RTGPU

Recognise the GPU - friendly parts Construct an optimised infrastructure so as to minimise the data transfer GPU-to-GPU communication through GPUDirect & Communication generally GPU porting completed (~95%) Optimisation (ongoing with daily encouraging results) OpenMP porting of the non-GPU-ported parts

GPU

GPU

GPU

CPU CPU CPU

CPUCPU CPU

Working environment : Piz Daint

Results

Optimise so as:1 GPU > 8 cores

Test128, time steps 100 to 110

0"

50"

100"

150"

200"

250"

1" 2" 3" 4" 5" 6" 7" 8"

Time%(sec)%

Number%of%PES%(if%ACCyes%then%#%PES%=%#%GPUs%with%1%task%per%node)%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%(if%ACCno%then%#%PES%=%#%nodes%with%8%tasks%per%node)%%

Summary%of%the%current%situaDon%

new_ACCyes"Total"Time"

original_ACCno"Total"Time"

original RAMSES ~ 1.7 times faster

non-GPU parts must be ported using OpenMP for full

comparability!

CRAY-PAT report

Fair comparison:216 - 80 ~ 140 sec

original RAMSES ~ 1.1 times faster

Manual Profiler (MPROF)1GPU 8Cores

Infrastrucure

GPUs1 2 4 8

ACC_COPY (sec) 10.5 7.7 6.3 6.8Total Time (sec) 216.8 195.2 129.7 111.4

(%) 4.8 3.9 4.9 6.1

0"

25"

50"

75"

1" 2" 3" 4" 5" 6" 7" 8"

MPI$Tim

e$(sec)$

Number$of$PES$(if$ACCyes$then$#$PES$=$#$GPUs$with$1$task$per$node)$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$(if$ACCno$then$#$PES$=$#$nodes$with$8$tasks$per$node)$$

RAMSES$communicaGon$

new_ACCyes"MPI"Time"

original_ACCno"MPI"Time"

~2 less communication in the new approach

0"

25"

50"

75"

100"

1" 2" 3" 4" 5" 6" 7" 8"

Commun

ica)

on*/*Total*Tim

e*(%

)*

Number*of*PES*(if*ACCyes*then*#*PES*=*#*GPUs*with*1*task*per*node)*******************************(if*ACCno*then*#*PES*=*#*nodes*with*8*tasks*per*node)**

new_ACCyes"

original_ACCno"

GPUDirect ~ 1.55 regular MPI (CPU-to-CPU)

0"

5"

10"

15"

20"

0" 8" 16" 24" 32" 40" 48" 56" 64"

Isen

d+Ire

cv*+me*(sec)*

Number*of*GPUs*(if*ACCyes)*&*Number*of*CPUs*(if*ACCno)*

GPUDirect*

original_GPUDirectNO"

new_GPUDirectYES"

Optimisation 15-071GPU 8Cores

RAMSES @CSCS

Thank you very much for your attention!

Documents

RAMSES @CSCS