View
1
Download
0
Category
Preview:
Citation preview
Performance Portability Experiments with the Grid C++ Lattice QCDLibrary
Alejandro Vaquero
University of Utah
August 22th, 2017
DOE Exascale Software project, with:
Peter Boyle, University of EdinburgKate Clark, NVIDIA
Carleton Detar, University of UtahMeifeng Lin, BNL
Verinder Rana, BNL
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 1 / 18
Motivation
Exascale Computing
US DOE planning to bring Exascale capable machines by 2021-2023
The architecture is expected to be complexHeterogeneityComplex memory hierarchyMultiple levels of parallelism
USQCD and its partners are working on a new scientific software stack that will beExascale-ready under the Exascale Computing Project (ECP)
ECP requirements
EfficiencyThe code must exploit the multiple levels of parallelism of the exascale architectures, as well as itsheterogenous natureCommunications are expected to become a serious bottleneck and might deserve special treatment
Flexibility and ease of useThe code must be flexible enough so the different algorithms required in our physics calculations canbe implemented in a simple waySupport for several layers of abstraction that simplify the development
Performance PortabilityThe code should be portable with minimal modifications, while preserving competitive performance
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 2 / 18
The Grid library
Data parallel C++ mathematical object libraryP. A. Boyle, G. Cossu, A. Yamaguchi, A. Portelli
https://www.github.com/paboyle/Grid
Adapted for Quantum Field Theories
Abstracts away the complexities of the target machineEfficient: Offers a high level interface that exploits all levels of parallelism
MPI ⊗ OpenMP ⊗ SIMD
Flexible: Uses C++ classes to wrap SIMD intrinsics
Object operations are defined independently of vector length or layoutVector class operator directly translated into intrinsic codePerformance not impacted by the high-level interfaceCurrent supported SIMD instruction sets: SSE4, AVX, AVX2, AVX512, IMCI, ARM Neon, QPX
Performance portable: Easily hits maximum memory bandwidth in supported platforms
Different architectures are supported through redefinitions of the SIMD intrinsicsUnfortunately this approach won’t work for Gpus
Indeed GPU support is missing from the list
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 3 / 18
Data layout in QCD (and computer simulated QFTs)
Quantum Chromo Dynamics (QCD) is ourcurrent theory to describe the strong force
QCD is simulated in supercomputers in a 4Dlattice
Each site of the lattice carries complex vectorsof 3 or 12 componentsEach link of the lattice carries a SU(3) matrix(3× 3 complex matrix)There are D independent links per site (D =dimension)(Anti)Periodic boundary conditions areenforced on each dimension
Grid implements natively the requireddatatypes (complex matrices, vectors...)
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 4 / 18
SIMD parallelism in Grid
Split the lattice in virtual nodes (outer sites)
Each simd lane deals with a virtual node(inner sites)
AdvantagesEasily to parallelize
DisadvantagesPermutations/rotations to keep the right
boundary conditions
Vector datatypes appear at the lowest level(AoSoV)
a111 a112 a113 a114 a121 a122 a123 a124 a131 a132 a133 a134
a211 a212 a213 a214 a221 a222 a223 a224 a231 a232 a233 a234
a311 a312 a313 a314 a321 a322 a323 a224 a331 a332 a333 a334
SIMD vector
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 5 / 18
Expression Templates in Grid
Using C++11 templates we can abstract complex datatypes and treat them as numbers inan expression
Grid grid(vol); // Creates a cartesian grid of points
Lattice<Su3f> x(&grid); // Creates objects that populate the cartesian
Lattice<Su3f> y(&grid); // grid with SuNf objects (= SU3 matrix single
Lattice<Su3f> z(&grid); // precision)
z = x*(y+x) - y*y; // Expression to be evaluated
Achieved through a recursive evaluation of expressionsSample code for the miniapp used for portability tests:
template <typename Op, typename T1,typename T2>
inline Lattice<obj> & operator=(const LatticeBinaryExpression<Op,T1,T2> expr) {
#pragma omp parallel for
for(int ss = 0; ss < this->oSites(); ss++) {
_odata[ss] = eval(ss,expr);
}
return *this;
}
template <typename Op, typename T1, typename T2>
auto inline eval (const unsigned int ss, const LatticeBinaryExpression<Op,T1,T2> &expr)
-> decltype(expr.Op.func(eval(ss,expr.arg1),eval(ss,expr.arg2))) {
return expr.Op.func(eval(ss,expr.arg1),eval(ss,expr.arg2));
}
template<class obj>
inline obj& eval(const unsigned int ss, const Lattice<obj> arg) {
return arg._odata[ss];
}
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 6 / 18
Porting Grid to GPUs
Porting to GPUs quite challengingC++11 code strongly based on STL, which is not well supported in GPUsDeep templating makes deep copy (or unified memory) a requirementAchieving coalesced access with the AoSoV layout can be complicated
Several possibilities
OpenACC/OpenMP
Directive-based, unifies code
Easy to adapt existing code
Portable
Lacks deep copy support
Use in C++ code non-trivial
Compiler dependent
Developer has little control
Jitify
CUDA extensions not needed
CPU-GPU execution policiescan be present simultaneously
Runtime compilation mightaffect performance
Kernel functions must besupplied in header files
www.github.com/NVIDIA/jitify
CUDA
Most mature approach
C++ support improving
Developer has full control
Need to write CUDAkernels, code branching
Only NVIDIA GPU support
Must add hostdevice decorations
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 7 / 18
Porting Grid to Gpus
Kernel examples
OpenACC
#pragma acc parallel loop independent copyin(expr[0:1])
for(int ss = 0; ss < this->oSites(); ss++) {
_odata[ss] = eval(ss,expr);
}
OpenMP
#pragma omp target device(0) map(to:expr) map(tofrom:_odata[0:this->oSites()])
{
#pragma omp teams distribute parallel for
for(int ss = 0; ss < this->oSites(); ss++) {
_odata[ss] = eval(ss,expr);
}
}
Jitify
static const Location ExecutionSpaces[] = DEVICE;
policy = ExecutionPolicy(location)
...
parallel_for(policy, 0, this->oSites(),
JITIFY_LAMBDA( (_odata, expr), _odata[i]=eval(ss,expr); ) );
CUDA
template<class Expr, class obj> __global__
void ETapply(int N,obj *_odata,Expr Op) {
int ss = blockIdx.x;
_odata[ss]=eval(ss,Op);
}
LatticeBinaryExpression<Op,T1,T2> temp = expr;
ETapply<decltype(temp),obj> <<<nBs,nTs>>>(this->oSites(),this->_odata,temp);
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 8 / 18
Porting Grid to Gpus
Gpu compilers won’t give coalesced access in a AoSoV layoutUsually they assign a whole vector to a thread, and performance decreases due to register spillingPerformance can be improved with a coalesced wrapper that dynamically transforms the layout
But Grid’s native layout can be effectively used by the gpu giving coalesced accessAssign osites −→ blocks, isites −→ threads
Requires some modifications to make the Expression Templates work properly
a111 a112 a113 a114 a121 a122 a123 a124 a131 a132 a133 a134
a211 a212 a213 a214 a221 a222 a223 a224 a231 a232 a233 a234
a311 a312 a313 a314 a321 a322 a323 a224 a331 a332 a333 a334
−→
a111 . . ....
. . .
a112 . . ....
. . .
Thread 1 Thread 2a113 . . .
.... . .
a114 . . ....
. . .
Thread 3 Thread 4
Each osite is a vectorized matrix, each isite access a particular simd lane of the vectorized matrix
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 9 / 18
Porting Grid to Gpus
template<class Expr, class vObj> __global__ void ETapply(obj *_odata,Expr Op) {
typedef typename vObj::scalar_object sObj;
auto sD = evalS(blockIdx.x, Op, threadIdx.x);
mergeS(_odata[blockIdx.x], sD, threadIdx.x);
}
template <typename Op, typename T1,typename T2>
inline Lattice<vObj> & operator=(const LatticeBinaryExpression<Op,T1,T2> expr) {
#ifdef __NVCC__
ETapply<decltype(expr), vObj> <<<this->oSites(), this->iSites()>>> (this->_odata, expr);
#else
// CPU code goes here...
#endif
return *this;
}
template <typename Op, typename T1, typename T2>
auto inline evalS (const unsigned int ss, const LatticeBinaryExpression<Op,T1,T2> &expr, const int tIdx)
-> decltype(expr.Op.func(evalS(ss,expr.arg1,tIdx), evalS(ss,expr.arg2,tIdx))) {
return expr.Op.func(evalS(ss,expr.arg1,tIdx), eval(ss,expr.arg2,tIdx));
}
template<class vObj> inline auto evalS(const unsigned int ss, const Lattice<vObj> arg, const int tIdx) {
auto sD = extractS<vObj, sObj>(arg._odata[ss], tIdx);
return (sObj) sD;
}
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 10 / 18
Porting Grid to Gpus
102 103 104 105 106
Problem size (Kb)
0
50
100
150
200
250
300
350
GFl
ops/
s
Appx. stream triad (ST)OpenAcc scalar datatypeJitify scalar datatypeJitify vector datatype
Jitify vector w/coalesced ptrCUDA scalar datatypeCUDA scalar w/ coalesced ptrCUDA w/ Grid native datatype
0
50
100
150
200
250
300
350
400
GB
yte
s/s
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 11 / 18
Porting Grid to Gpus
Comparison P100, V100, KNL. CUDA + Grid Native datatypes
102 103 104 105 106
Problem size (Kb)
0
100
200
300
400
500
600
700
800
GFl
ops/
sXeon E5-2695v4 18 cores Avx2Knl 64 cores, A2A Cache ModeQuadro GP100 PascalTesla V100 Volta
Stream triad (ST) CpuStream triad (ST) KnlStream triad (ST) PascalStream triad (ST) Volta
0
100
200
300
400
500
600
700
800
GB
yte
s/s
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 12 / 18
Porting Grid to Gpus
Results with native datatypes almost independent of vector length
Can be improved if we assign several vectors per block
4 8 16 32 48 64 96 128 192 256 384 512 768 1024Block size (8 bytes elements)
0
100
200
300
400
500
600
GFl
ops/
s
16^424^432^4
48^464^4Stream triad (ST) Gpu
0
100
200
300
400
500
600
700
GB
yte
s/s
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 13 / 18
Summary Grid
Main difficulties
By design Grid easily create many temporary objects (constructor calls) during any operation
The coalesced pointer also requires calls to constructors to operateIn most implementations constructors are not easily called from device code
OpenACC completely forbids thisCUDA is more relaxed, but we still find segfaultsJifity directly uses the GNU compiler and seemed to be the most flexible approach
Unified memory is a must for our implementation
OpenACC offers easy portability, but it’s not mature enough and performance is not good
Jitify must place kernels and datatypes in header files, which can be cumbersomeCUDA with native Grid datatypes
Superb performanceNo need to change the layoutThe nvcc compiler might struggle with the C++What do we do with the STL?
Other members of the team are trying other approaches. For example, NIM (see Xian-YongJin’s talk Wed. 2:35PM)
Next steps
Integrate Grid’s ET (not our stripped down version), tough C++ test for the compilers
Implement a simple linear operator w = Mv (dslash)
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 14 / 18
Other approaches: Kokkos (B. Joo, T. Kurth, J. Deslippe and K. Clark)
Single Right Hand Side (SRHS) Case:
Naive Implementation: No explicitvectorization over lattice sites
Not much vectorization opportunity forcompiler3× 3 and 3× 4 complex matrices,short trip count loops
Expect performance similar to legacycodes on KNL
Lack of vectorization causes lowperformance on KNL
GPU Performance is very good80% of QUDA Library on P100No vectorization needed due to SIMTprogramming modelEach thread works on single site, inscalar mode
Single RHS Dslash Kernel
V100 results from NVIDIA, P100 result from SummitDev (OLCF), KNL
results from Cori (NERSC). KNL nodes have 68 cores, but in some cases
(QPhiX AVX512) only 64 may have been used, for load balance reasons.
Lattice volume was 324 sites: B. Joo (JLab) T. Kurth and J.Deslippe
(NERSC), K. Clark (NVIDIA)
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 15 / 18
Other approaches: Kokkos (B. Joo, T. Kurth, J. Deslippe and K. Clark)
Multi Right Hand Side (MRHS) Case:
Potential to vectorize over right handsides and reuse gauge field
Using regular Kokkos complex,performance was low
Still problems with vectorization
Specializing Vector Type and usingsome AVX512 intrinsics gave a majorboost (navy blue bar)
12× speedup over unspecializedComparable to QPhiX SRHS code
GPU performance was really good,specially on Volta
Lower latency, unoptimized code runsbetter4× speedup in int32 indexingoperations
Code is performance portable
Multi RHS Dslash Kernel
V100 results from NVIDIA, P100 result from SummitDev (OLCF), KNL
results from Cori (NERSC). KNL nodes have 68 cores, but in some cases
(QPhiX AVX512) only 64 may have been used, for load balance reasons.
The lattice volume was 163 × 32 sites, with 8 RHS for KNL and 16 for
GPU (natural H/W length) B. Joo (JLab) T. Kurth and J.Deslippe
(NERSC), K. Clark (NVIDIA)
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 16 / 18
Thank you
Thank you for your attention
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 17 / 18
Acknowledgements
We thank Mathias Wagner (NVIDIA) for his help during the recent hackathon
B. Joo thanks NERSC, for travel support and a summer Associate appointment for the workinvolving Kokos, for which we also gratefully acknowledge use of computer resources atNERSC (Cori), OLCF (SummitDev) and NVIDIA (P100 and V100 devices).
This research was supported by the Exascale Computing Project (17-SC-20-SC), acollaborative effort of two U.S. Department of Energy organizations (Office of Science andthe National Nuclear Security Administration) responsible for the planning and preparation ofa capable exascale ecosystem, including software, applications, hardware, advanced systemengineering and early testbed platforms, in support of the nation’s exascale computingimperative.
Part of this research was carried out at the Brookhaven Hackathon 2017. BrookhavenHackathon is a collaboration between Brookhaven National Laboratory, University ofDelaware, Stony Brook University, and the Oak Ridge Leadership Computing Facility at theOak Ridge National Laboratory, and used resources of Brookhaven National Laboratory.
Alejandro Vaquero (University of Utah) Performance Portability with Grid C++ August 22th , 2017 18 / 18
Recommended