32
RAMSES @CSCS Kotsalos Christos & Claudio Gheller Refactoring of the Ramses code and performance optimisation on CPUs and GPUs

RAMSES @CSCS

Embed Size (px)

Citation preview

Page 1: RAMSES @CSCS

RAMSES @CSCS

Kotsalos Christos & Claudio Gheller

Refactoring of the Ramses code and performance optimisation on CPUs and GPUs

Page 2: RAMSES @CSCS

RAMSES: modular physics

AMR buildDomain

Decomposition - Load balancing

Gravity

Hydro

MHD

N-body

CoolingStar formationOther physics RT

per time step⤾

Page 3: RAMSES @CSCS

Our goal

AMR build Load balance Gravity

Hydro

MHD

N-body

CoolingStar formationOther physics RT

GPU using OpenACC directives

❖Recognise the GPU - friendly parts : ❖ Computational intensity + data independency

❖Minimise the data transfer: ❖ GPU <—> CPU communication

❖GPU-to-GPU communication through GPUDirect & Communication generally ❖GPU porting and optimisation

}Infrastructure

Page 4: RAMSES @CSCS

Problems to overcome!

AMR build Load balance Gravity

Hydro

MHD

N-body

CoolingStar formationOther physics RT

Dependencies between the modulesMore complex amr_step than this simplified representationRecursive callsCommunicators (OpenACC & memory)# grids depend on the level of refinement. GPU porting issues. Difficulty to fit the loops in the architecture!

Page 5: RAMSES @CSCS

1st part of the Project

Redesign communication for GPUs & CPUs:❖ Is the CPU communication optimal?❖ Is the communication suitable for GPU programming?

Page 6: RAMSES @CSCS

Communication between the subdomains

• Point-to-Point Communication :: ISend <—> IRecv • The subdomains communicate the solutions with their neighbours (physical)• Everything through the communicators

MPI

Send Buffer Recv Buffer

emission communicator (structure/ derived data types)

reception communicator (structure/ derived data types)

Stored information:‣ Number of cells‣ cell index‣ Double precision arrays‣ Single precision arrays

} Advantages:Elegant Structure

Disadvantages:Data locality issues

Not fully supported by OpenACC

Page 7: RAMSES @CSCS

Type of Communication: Point-to-Point or Collective ?

Point-to-point: Isend & IrecvOriginal implementation in RAMSES

Collective: Alltoall(v)The data to be communicated are gathered in arrays (one per PE) These arrays (of intrinsic data types) are scattered from all PES to all PES

Page 8: RAMSES @CSCS

OpenACC & Allocatable Derived data types

• Not fully supported (cray only but still partially) • No GPUDirect support (GPU-to-GPU communication)

⇒ Two solutions to overcome this problem

❖ Use collective communication (the buffers are of intrinsic data types)

❖ Replace locally or globally the communicators with regular arrays

Check performance

Everywhere in the code, not an easy solution

Page 9: RAMSES @CSCS

GPUDirect

❖ GPU-to-GPU communication ❖ The same calls as the regular MPI but:

export MPICH_RDMA_ENABLED_CUDA=1

!$acc host_data use_device(send buffer) call MPI_ISEND( normal arguments ) !$acc end host_data

!$acc host_data use_device(recv buffer) call MPI_IRECV( normal arguments ) !$acc end host_data

Embed the MPI call

export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH

Page 10: RAMSES @CSCS

AlltoAll

1 @ 2 @ … n @

Send Buffers Recv Buffers

1 # 2 # … n #

1 * 2 * … n *

1 @ 1 # … n *

2 @ 2 # … 2 *

n @ n # … n *

…pe1

pe2

pen

(sendbuf,sendcount,sendtype,recvbuf,recvcnt,recvtype,comm,ierr)

Restriction: the sendcnt & recvcnt are fixed per pe

Page 11: RAMSES @CSCS

AlltoAllv

1 @ 2 @ … n @

Send Buffer

pei

(sendbuf,sendcnts,sdispls,sendtype,recvbuf,recvcnts,rdispls,recvtype,comm,ierr)

} } }sendcnts(1) sendcnts(2) sendcnts(n)

sdispls(1) = 0

sdispls(2) = sdispls(1)+sendcnts(1)

sdispls(n) = sdispls(n-1)+sendcnts(n-1)

the sendcnt & recvcnt are NOT fixed per pe

Page 12: RAMSES @CSCS

Experiment: Replace the point-to-point with the collective communication everywhere (CPU version of RAMSES)

Alltoallv ISend/ IRecv

Latency Issue!

point-to-point ~ 3 times faster

Page 13: RAMSES @CSCS

Load balancing

The subdomains need to communicate mainly with their neighbours

⇒ The collective communication sends and recvs too many empty buffers

Latency Issue!

Page 14: RAMSES @CSCS

AlltoAllv_tuned

User specified

Bandwidth Issue! Too much data to communicate!

1 @ 2 @ … n @

Send Buffer

pei } } }sendcnts(1) sendcnts(2) sendcnts(n)

If sendcnts(i) = 0

then sendcnts(i) = fill the buffer with zeros

Special case is the AlltoAll

Page 15: RAMSES @CSCS

Final solution (communication part)

❖ Point-to-Point communication ❖ Replace locally the communicators with regular arrays❖ GPU-to-GPU through GPUDirect flawlessly

Send Buffer Recv Buffer

emission array (intrinsic data types)

reception array (intrinsic data types)

emission communicator (derived data types)

reception communicator (derived data types)

↧ ↧

Page 16: RAMSES @CSCS

2nd part of the Project

GPU porting of the Poisson Solver

AMR build Load balance Gravity

Hydro

MHD

N-body

CoolingStar formationOther physics RT

Page 17: RAMSES @CSCS

GPU porting of poisson solver

Communicators caused issues because of data locality: ❖ Implicit synchronisation barriers ❖ Poor performance

Local communicators

Stored information:‣ Number of grids‣ Grid index‣ Double precision arrays‣ Single precision arrays

} ↧

Replaceable

1D arrays (intrinsic data types)

levelcomponents

cpuscells

Page 18: RAMSES @CSCS

1D arrays (intrinsic data types)

levelcomponents

cpuscells

Data Locality

Increased Performance2 to 3 times faster

In the beginning of multigrid_fine

Page 19: RAMSES @CSCS

3rd part of the Project

Infrastructure development❖ Which parts on CPU, which on GPU?❖ Minimise interaction between CPU-GPU

AMR build Load balance Gravity

Hydro

MHD

N-body

CoolingStar formationOther physics RT

Page 20: RAMSES @CSCS

Infrastructure

Subroutines that update host/ device (optimised data transfer)

#if defined(_OPENACC) call update_globalvar_dp_to_host (var,level) call update_globalvar_dp_to_device(var,level) #endif

instead of

!$acc update device/host(var)

communicate only what it is needed

Use of the update directive but in an

optimal way

Page 21: RAMSES @CSCS

Manual Profiler (MPROF)

Why? ❖ Bugs in CRAYPAT ❖ Strange SYNC barriers (implicit) ❖ Discrepancy of CRAYPAT and NVIDIA’s tools

It is enabled from the Makefile by the flag MPROF

It uses the MPI_WTIME subroutine

It is adapted for all the GPU-ported subroutines of RAMSES

Page 22: RAMSES @CSCS

Until this point

AMR build Load balance Gravity

Hydro

MHD

N-body

CoolingStar formationOther physics RTGPU

Recognise the GPU - friendly parts Construct an optimised infrastructure so as to minimise the data transfer GPU-to-GPU communication through GPUDirect & Communication generally GPU porting completed (~95%) Optimisation (ongoing with daily encouraging results) OpenMP porting of the non-GPU-ported parts

GPU

GPU

GPU

CPU CPU CPU

CPUCPU CPU

Page 23: RAMSES @CSCS

Working environment : Piz Daint

Results

Optimise so as:1 GPU > 8 cores

Test128, time steps 100 to 110

Page 24: RAMSES @CSCS

0"

50"

100"

150"

200"

250"

1" 2" 3" 4" 5" 6" 7" 8"

Time%(sec)%

Number%of%PES%(if%ACCyes%then%#%PES%=%#%GPUs%with%1%task%per%node)%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%(if%ACCno%then%#%PES%=%#%nodes%with%8%tasks%per%node)%%

Summary%of%the%current%situaDon%

new_ACCyes"Total"Time"

original_ACCno"Total"Time"

original RAMSES ~ 1.7 times faster

non-GPU parts must be ported using OpenMP for full

comparability!

Page 25: RAMSES @CSCS

CRAY-PAT report

Fair comparison:216 - 80 ~ 140 sec

original RAMSES ~ 1.1 times faster

Page 26: RAMSES @CSCS

Manual Profiler (MPROF)1GPU 8Cores

Page 27: RAMSES @CSCS

Infrastrucure

GPUs1 2 4 8

ACC_COPY (sec) 10.5 7.7 6.3 6.8Total Time (sec) 216.8 195.2 129.7 111.4

(%) 4.8 3.9 4.9 6.1

Page 28: RAMSES @CSCS

0"

25"

50"

75"

1" 2" 3" 4" 5" 6" 7" 8"

MPI$Tim

e$(sec)$

Number$of$PES$(if$ACCyes$then$#$PES$=$#$GPUs$with$1$task$per$node)$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$(if$ACCno$then$#$PES$=$#$nodes$with$8$tasks$per$node)$$

RAMSES$communicaGon$

new_ACCyes"MPI"Time"

original_ACCno"MPI"Time"

~2 less communication in the new approach

Page 29: RAMSES @CSCS

0"

25"

50"

75"

100"

1" 2" 3" 4" 5" 6" 7" 8"

Commun

ica)

on*/*Total*Tim

e*(%

)*

Number*of*PES*(if*ACCyes*then*#*PES*=*#*GPUs*with*1*task*per*node)*******************************(if*ACCno*then*#*PES*=*#*nodes*with*8*tasks*per*node)**

new_ACCyes"

original_ACCno"

Page 30: RAMSES @CSCS

GPUDirect ~ 1.55 regular MPI (CPU-to-CPU)

0"

5"

10"

15"

20"

0" 8" 16" 24" 32" 40" 48" 56" 64"

Isen

d+Ire

cv*+me*(sec)*

Number*of*GPUs*(if*ACCyes)*&*Number*of*CPUs*(if*ACCno)*

GPUDirect*

original_GPUDirectNO"

new_GPUDirectYES"

Page 31: RAMSES @CSCS

Optimisation 15-071GPU 8Cores

Page 32: RAMSES @CSCS

RAMSES @CSCS

Thank you very much for your attention!