Upload
others
View
18
Download
0
Embed Size (px)
Citation preview
Porting Telemac–Mascaret to OpenPower and
experimenting GPU offloading to accelerate
the Tomawac module
TUC 2019 16-17th October, CERFACS, Toulouse, France
Judicael Grasset(1), Stephen Longshaw(1), Charles Moulinec(1), David R. Emerson(1)
Yoann Audouin(2), Pablo Tassi(2)
October 17, 2019
(1) STFC, Daresbury Laboratory, Warrington, United Kingdom
(2) EDF R&D, Chatou, France
Computing used
OpenPower architecture in a
nutshell:
• IBM POWER processors
• NVIDIA GPUs
• NVIDIA NVLink The machine used for this work, Paragon
In our case, each node of the machine used consists of:
• 2 IBM POWER8 processors, with 8 cores each
• Each core has simultaneous multithreading (SMT) capability
• In this case the cores are able to run either 1 thread (SMT1), 2
threads (SMT2), 4 threads (SMT4) or 8 threads (SMT8) at the
same time
• 4 NVIDIA P100 GPUs
• NVIDIA NVLink for GPU–GPU and GPU–CPU interconnections1
Porting to OpenPower
• Why? Summit and Sierra, the 2 most powerful cluster in the world
are based on an OpenPower architecture (Top500, June 2019)
• Porting to different architecure might reveal some bugs in the code
(increased robustness)
2
Porting to OpenPower
Status of the port:
Version > PGI 18.10 > GCC 9.1 > XL 16.1.1.1
v8p0r2 compile compile does not compile*
trunk (Oct. 2019) does not compile* compile does not compile*
*problem known and solved, it compile when applying a small patch
All tests done with the Spectrum MPI library
3
.
Experimenting with GPUs
Or trying to port Telemac to the architecture of the ���future present
4
The test case
Test case used: tomawac/fetch limited/tom test6.cas
• This is a limited test with a small mesh: 75k elements, 32k points.
• It spends all of its time in a single fortran subroutine: qnlin3.f
• This function was reported to be a bottleneck by some users during
the annual TELEMAC User Conference (2018).
5
qnlin3.f
In a nutshell:
• do loop
• init some variables
• do loop
• init some variables
• do loop
• init some variables
• do loop
• tmp array(x,y,z) = tmp array(x,y,z) + k
6
Porting to GPUs, methods
Different solutions exist:
• Pragma based: OpenMP, OpenACC
• Library based: Magma, cuBLAS...
• Language extension: CUDA, OpenCL
7
MPI+OpenACC (PGI compiler) on GPU
Move data to GPU and execute the loop on it.
• !$acc data copy(array)
• !$acc parallel loop collapse(4)
• do loop
• do loop
• do loop
• do loop
• !$acc atomic
• array(x,y,z) = array(x,y,z) + k
• ...
• !$acc end data
Elsewhere during the initialisation of the code, we have linked each MPI
task to a specific GPU.
8
MPI+OpenACC (PGI compiler) on GPU
9
MPI+OpenMP (IBM compiler) on GPU
Move data to GPU and execute the loop on it.
• !$omp target data map(array)
• !$omp target teams distribute parallel do collapse(4)
• do loop
• do loop
• do loop
• do loop
• !$omp atomic
• array(x,y,z) = array(x,y,z) + k
• ...
• !$omp end target data
Elsewhere during the initialisation of the code, we have linked each MPI
task to a specific GPU.
10
MPI+OpenMP (IBM compiler) on GPU
11
Somme test-case
• Somme 7 days
• Telemac2d-Tomawac-Sisyphe
20.8%
6%6%6.6%
6.9%
9.6%
11.4%
11.6%
21.1%
other subroutinessemimpqwind1propa
fremoyschar41 per 4dlogqnlin1bief interp
12
Inclusion in the codebase
• OpenACC and OpenMP redundancy
• Could be solved with pragma in this case
• But might not always be possible
• Usage of the optional directory
13
Conclusion
Results achieved:
• Telemac-Mascaret ported to OpenPower
• The port revelead bugs in Telemac-Mascaret and some compilers
• Good improvement when using GPU for the qnlin3 subroutine
• Work still going on, but will be more difficult for real world test-case
14
Acknowledgements
• This work is supported by the Hartree Centre through the Innovation
Return on Research (IROR) programme.
15
Thank you for your attention
If you think the code is too slow, or uses to much memory for you
(partel, Telemac, Tomawac...)
Please contact us.
Contact:
[email protected] [email protected]
16