Upload
christos-kotsalos
View
407
Download
1
Embed Size (px)
Citation preview
RAMSES @CSCS
Kotsalos Christos & Claudio Gheller
Refactoring of the Ramses code and performance optimisation on CPUs and GPUs
RAMSES: modular physics
AMR buildDomain
Decomposition - Load balancing
Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
per time step⤾
Our goal
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
GPU using OpenACC directives
❖Recognise the GPU - friendly parts : ❖ Computational intensity + data independency
❖Minimise the data transfer: ❖ GPU <—> CPU communication
❖GPU-to-GPU communication through GPUDirect & Communication generally ❖GPU porting and optimisation
}Infrastructure
Problems to overcome!
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
Dependencies between the modulesMore complex amr_step than this simplified representationRecursive callsCommunicators (OpenACC & memory)# grids depend on the level of refinement. GPU porting issues. Difficulty to fit the loops in the architecture!
1st part of the Project
Redesign communication for GPUs & CPUs:❖ Is the CPU communication optimal?❖ Is the communication suitable for GPU programming?
Communication between the subdomains
• Point-to-Point Communication :: ISend <—> IRecv • The subdomains communicate the solutions with their neighbours (physical)• Everything through the communicators
MPI
Send Buffer Recv Buffer
emission communicator (structure/ derived data types)
reception communicator (structure/ derived data types)
Stored information:‣ Number of cells‣ cell index‣ Double precision arrays‣ Single precision arrays
} Advantages:Elegant Structure
Disadvantages:Data locality issues
Not fully supported by OpenACC
Type of Communication: Point-to-Point or Collective ?
Point-to-point: Isend & IrecvOriginal implementation in RAMSES
Collective: Alltoall(v)The data to be communicated are gathered in arrays (one per PE) These arrays (of intrinsic data types) are scattered from all PES to all PES
OpenACC & Allocatable Derived data types
• Not fully supported (cray only but still partially) • No GPUDirect support (GPU-to-GPU communication)
⇒ Two solutions to overcome this problem
❖ Use collective communication (the buffers are of intrinsic data types)
❖ Replace locally or globally the communicators with regular arrays
Check performance
Everywhere in the code, not an easy solution
GPUDirect
❖ GPU-to-GPU communication ❖ The same calls as the regular MPI but:
export MPICH_RDMA_ENABLED_CUDA=1
!$acc host_data use_device(send buffer) call MPI_ISEND( normal arguments ) !$acc end host_data
!$acc host_data use_device(recv buffer) call MPI_IRECV( normal arguments ) !$acc end host_data
Embed the MPI call
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
AlltoAll
1 @ 2 @ … n @
Send Buffers Recv Buffers
1 # 2 # … n #
1 * 2 * … n *
…
1 @ 1 # … n *
2 @ 2 # … 2 *
n @ n # … n *
…pe1
pe2
pen
(sendbuf,sendcount,sendtype,recvbuf,recvcnt,recvtype,comm,ierr)
Restriction: the sendcnt & recvcnt are fixed per pe
AlltoAllv
1 @ 2 @ … n @
Send Buffer
pei
(sendbuf,sendcnts,sdispls,sendtype,recvbuf,recvcnts,rdispls,recvtype,comm,ierr)
} } }sendcnts(1) sendcnts(2) sendcnts(n)
sdispls(1) = 0
sdispls(2) = sdispls(1)+sendcnts(1)
sdispls(n) = sdispls(n-1)+sendcnts(n-1)
the sendcnt & recvcnt are NOT fixed per pe
Experiment: Replace the point-to-point with the collective communication everywhere (CPU version of RAMSES)
Alltoallv ISend/ IRecv
Latency Issue!
point-to-point ~ 3 times faster
Load balancing
The subdomains need to communicate mainly with their neighbours
⇒ The collective communication sends and recvs too many empty buffers
Latency Issue!
AlltoAllv_tuned
User specified
Bandwidth Issue! Too much data to communicate!
1 @ 2 @ … n @
Send Buffer
pei } } }sendcnts(1) sendcnts(2) sendcnts(n)
If sendcnts(i) = 0
then sendcnts(i) = fill the buffer with zeros
Special case is the AlltoAll
Final solution (communication part)
❖ Point-to-Point communication ❖ Replace locally the communicators with regular arrays❖ GPU-to-GPU through GPUDirect flawlessly
Send Buffer Recv Buffer
emission array (intrinsic data types)
reception array (intrinsic data types)
emission communicator (derived data types)
reception communicator (derived data types)
↧ ↧
2nd part of the Project
GPU porting of the Poisson Solver
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
GPU porting of poisson solver
Communicators caused issues because of data locality: ❖ Implicit synchronisation barriers ❖ Poor performance
Local communicators
Stored information:‣ Number of grids‣ Grid index‣ Double precision arrays‣ Single precision arrays
} ↧
Replaceable
1D arrays (intrinsic data types)
levelcomponents
cpuscells
1D arrays (intrinsic data types)
levelcomponents
cpuscells
Data Locality
Increased Performance2 to 3 times faster
In the beginning of multigrid_fine
3rd part of the Project
Infrastructure development❖ Which parts on CPU, which on GPU?❖ Minimise interaction between CPU-GPU
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RT
Infrastructure
Subroutines that update host/ device (optimised data transfer)
#if defined(_OPENACC) call update_globalvar_dp_to_host (var,level) call update_globalvar_dp_to_device(var,level) #endif
instead of
!$acc update device/host(var)
communicate only what it is needed
Use of the update directive but in an
optimal way
Manual Profiler (MPROF)
Why? ❖ Bugs in CRAYPAT ❖ Strange SYNC barriers (implicit) ❖ Discrepancy of CRAYPAT and NVIDIA’s tools
It is enabled from the Makefile by the flag MPROF
It uses the MPI_WTIME subroutine
It is adapted for all the GPU-ported subroutines of RAMSES
Until this point
AMR build Load balance Gravity
Hydro
MHD
N-body
CoolingStar formationOther physics RTGPU
Recognise the GPU - friendly parts Construct an optimised infrastructure so as to minimise the data transfer GPU-to-GPU communication through GPUDirect & Communication generally GPU porting completed (~95%) Optimisation (ongoing with daily encouraging results) OpenMP porting of the non-GPU-ported parts
GPU
GPU
GPU
CPU CPU CPU
CPUCPU CPU
Working environment : Piz Daint
Results
Optimise so as:1 GPU > 8 cores
Test128, time steps 100 to 110
0"
50"
100"
150"
200"
250"
1" 2" 3" 4" 5" 6" 7" 8"
Time%(sec)%
Number%of%PES%(if%ACCyes%then%#%PES%=%#%GPUs%with%1%task%per%node)%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%(if%ACCno%then%#%PES%=%#%nodes%with%8%tasks%per%node)%%
Summary%of%the%current%situaDon%
new_ACCyes"Total"Time"
original_ACCno"Total"Time"
original RAMSES ~ 1.7 times faster
non-GPU parts must be ported using OpenMP for full
comparability!
CRAY-PAT report
Fair comparison:216 - 80 ~ 140 sec
original RAMSES ~ 1.1 times faster
Manual Profiler (MPROF)1GPU 8Cores
Infrastrucure
GPUs1 2 4 8
ACC_COPY (sec) 10.5 7.7 6.3 6.8Total Time (sec) 216.8 195.2 129.7 111.4
(%) 4.8 3.9 4.9 6.1
0"
25"
50"
75"
1" 2" 3" 4" 5" 6" 7" 8"
MPI$Tim
e$(sec)$
Number$of$PES$(if$ACCyes$then$#$PES$=$#$GPUs$with$1$task$per$node)$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$(if$ACCno$then$#$PES$=$#$nodes$with$8$tasks$per$node)$$
RAMSES$communicaGon$
new_ACCyes"MPI"Time"
original_ACCno"MPI"Time"
~2 less communication in the new approach
0"
25"
50"
75"
100"
1" 2" 3" 4" 5" 6" 7" 8"
Commun
ica)
on*/*Total*Tim
e*(%
)*
Number*of*PES*(if*ACCyes*then*#*PES*=*#*GPUs*with*1*task*per*node)*******************************(if*ACCno*then*#*PES*=*#*nodes*with*8*tasks*per*node)**
new_ACCyes"
original_ACCno"
GPUDirect ~ 1.55 regular MPI (CPU-to-CPU)
0"
5"
10"
15"
20"
0" 8" 16" 24" 32" 40" 48" 56" 64"
Isen
d+Ire
cv*+me*(sec)*
Number*of*GPUs*(if*ACCyes)*&*Number*of*CPUs*(if*ACCno)*
GPUDirect*
original_GPUDirectNO"
new_GPUDirectYES"
Optimisation 15-071GPU 8Cores
RAMSES @CSCS
Thank you very much for your attention!