Automatic code generation for highly parallel multigrid solvers Sebastian Kuckuk1, Christian Schmitt2, Harald Köstler1
1 Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Department of Computer Science 10 (System Simulation)2 Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Department of Computer Science 12 (Hardware-Software-Co-Design)
References:[1] Christian Schmitt, Sebastian Kuckuk, Harald Köstler, Frank Hannig, and Jürgen Teich. An Evaluation of Domain-Specific Language Technologies for Code Generation. To appear in Proceedings of the 14th International Conference on Computational Science and Its Applications (ICCSA 2014), June 2014.[2] Stefan Kronawitter and Christian Lengauer. Optimization of two Jacobi Smoother Kernels by Domain-Specific Program Transformation. In Proceedings of the 1st International Workshop on High-Performance Stencil Computations (HiStencils), pages 75–80, January 2014.[3] Alexander Grebhahn, Norbert Siegmund, Sven Apel, Sebastian Kuckuk, Christian Schmitt, and Harald Köstler. Optimizing Performance of Stencil Code with SPL Conqueror. In Proceedings of the 1st International Workshop on High-Performance Stencil Computations (HiStencils), pages 7–14, January 2014.[4] Sebastian Kuckuk, Björn Gmeiner, Harald Köstler, and Ulrich Rüde. A generic prototype to benchmark algorithms and data structures for hierarchical hybrid grids. In Proceedings of the International Conference on Parallel Computing (ParCo), pages 813 - 822, September 2013
Code Generation with Scala Necessary due to the high variance of the multigrid domain Hardware - CPU, GPU or both? Number of nodes, sockets and cores? Cache characteristics? Network characteristics? Software - MPI, OpenMP or both? CUDA or OpenCL? Which version? MG components - Cycle type? Which smoother(s)? Which coarse grid solver? Which inter-grid operators? MG parameters - Relaxation? Number of smoothing steps? Optimizations - Vectorization? Temporal Blocking? Loop transformations? Problem description - Which PDE? Which boundary conditions? Discretization - Finite Differences, Finite Element or Finite Volumes? Domain - Uniform or block-structured? How to partition? …
Sebastian KuckukHarald Köstler
Ulrich Rüde
Alexander GrebhahnSven Apel
Stefan KronawitterArmin Größlinger
Christian Lengauer
Christian SchmittFrank HannigJürgen Teich
Project ExaStencils Generation of efficient, robust and exa-scalable geometric multigrid solvers
Modular and feature-rich code generation and transformation framework written in Scala [1] Automatic low-level optimization via polyhedral transformations [2] Interface to SPL and LFA prediction and optimization [3]
Hannah RittichMatthias Bolten
Geometric Multigrid
Smoothing of high frequency errors
Coarsened representation of low frequency errors
Preliminary Results First scaling results with generated solvers match behavior of earlier reference experiments [4] 3D FD discretization of Poisson‘s equation on uniform grids 4 threads per core, pure MPI 1M unknowns per core
0
0,2
0,4
0,6
0,8
1
1,2
512 1k 2k 4k 8k 16k 32k 64k 128k 256k
Par
alle
l Effi
cien
cy
Number of Cores
Weak Scaling for two Configurations
V3,3 with Gauss-Seidel V4,2 with Jacobi
250
270
290
310
330
350
370
390
410
430
450
512 1k 2k 4k 8k 16k 32k 64k 128k 256k
Mea
n T
ime
per
vCyc
le [
ms]
Number of Cores
Weak Scaling for two Configurations
V3,3 with Gauss-Seidel V4,2 with Jacobi
The domain partition is directly mapped to the parallelization Each domain consists of one or more blocks Each block consists of one or more fragments Each fragment consists of several data points / cells
Each block corresponds to one MPI rank Each fragment corresponds to one OMP rank Pure MPI corresponds to one fragment/block Pure OMP corresponds to one block Hybrid MPI/OMP corresponds to multiple blocks and multiple fragments/block Possible optimization: aggregate all fragments within one block and OMP parallelize field operations directly
Support of various communication patterns Different regions (overlap, ghost layers) Arbitrary list of neighbors (represented by directions) Easy to use subsets, e.g. to send to all processes with larger or equal coordinates
Generated domain initialization function sets relevant information,e.g. connection to local/remote primitives, ids, ranks, etc., on each process at run-time
Multi-Layered DSL Approach From abstract problem specification on layer 1 to concrete solver implementation on layer 4 L1: mathematical formulation of problem L2: discretization of the problem L3: specification of algorithmic components L4: complete program specification