View
4
Download
0
Category
Preview:
Citation preview
Supporting Diverse Parallel Models in the Trilinos Library
Chris Baker Computational Engineering and Energy Studies
Oak Ridge National Laboratory, USA
MS 42: Parallel Programming Models, Algorithms and Frameworks for Scalable Manycore Systems SIAM Parallel Processing 2012 February 15-17, Savannah, GA
2 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Collaborators
• Oak Ridge National Laboratory – Ross Bartlett
• Sandia National Laboratories – Mike Heroux – Mark Hoemmen – Alan Williams – Carter Edwards
• École Polytechnique Fédérale de Lausanne – Radu Popescu
3 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Dominant Scientific Library Paradigm
• Library provides a specific capability. – Apps can grab the data in order to expand functionality.
• In an MPI-only scenario, expansion comes via domain-specific serial kernels coded by domain specialists. – i.e., not doing any shared-memory programming
• With a single memory pool, data easily shared between library and app.
• With a single target architecture, compilation is relatively simple. – Use any language for which you have a compiler. – Mechanisms exist for mixed language capability.
4 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Enter the Hybrid Parallel Environment
• The path to exascale apparently requires addressing many-core. – LANL RoadRunner: Cell BE and multi-core CPUs – Tianhe 1A: NVIDIA GPUs and multi-core CPUs – “K” Computer: simply consists of nodes of 8-core CPUs – TACC Stampede: dual octo-core CPUs and Intel Knight’s Corner – OLCF Titan/Cray XK6: one NVIDIA GPU per 12-core AMD CPU
• Ditch the assumptions of the previous slide/paradigm. – We must investigate other parallel programming models – We must revisit app/library relationship. – We may need to consider other programming languages. – Portability is more challenging than recently.
5 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Numerous Considerations • Parallel Programming Model
– MPI-only is the status quo for a large number of codes. • Well-defined message passing API is an optimization target for vendors • Users write serial, portable code
– MPI-plus is where many codes are going. • e.g., MPI+OpenMP, MPI+CUDA, MPI+directives • Explicit two-level shared/distributed hybrid.
• Programming Language – Programmer productivity is rooted in languages and APIs. – C++, Fortran, OpenCL, CUDA offer different levels of expressiveness.
• Library Extension – “Grab the data and run” extension model requires addressing parallelism. – Intrusive modification to a living library is untenable.
6 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Challenges • MPI-only not enough
– Need to port: it doesn’t work for accelerators. – Inefficient: it misses a lot of shared-memory benefits.
• MPI-plus can entail significant work – We want to minimize the number of code bases. – We want to minimize the effort to add a new code base.
• Language issues – Many APIs require a particular language. – Developers resent being told what language to use.
• Lib/User interface issues – Extending the library should not introduce serial bottlenecks. – Shouldn’t require users to be shared-memory API experts.
7 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Some approaches in Stage 2 Trilinos
• Templated C++ code – Templating data allows more efficient use of cache and bandwidth. – Templating data expands capability (e.g., integer limit, complex)
• Generic shared memory parallel node – Kokkos provides shared memory parallel node API – Interface to numerous APIs via template metaprogramming layer
• Hybrid programming model – Hybrid programming skeletons to support most common patterns – Expose models for high-productivity, performance-portable apps
• Non-intrusive modification of structures and algorithms – Expose the SMP node to apps; enable node-optimized kernels.
8 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Kokkos and Tpetra Packages
• Kokkos is an API for shared-memory parallel nodes. – Provides parallel_for and parallel_reduce skeletons – Memory model addresses challenge of accelerator memory – Provides reference linear algebra kernels – Currently supports multiple shared-memory APIs:
• ThreadPool Interface (TPI, a Trilinos pthreads package) • Intel Threading Building Blocks (TBB) • NVIDIA CUDA-capable GPUs (via Thrust) • OpenMP New! implemented by Radu Popescu/EPFL
• Tpetra is a distributed linear algebra library. – Heavily exploits templated C++ – Employs hybrid (distributed + shared) parallelism via Kokkos
9 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Programming Heterogeneous Clusters
• Kokkos handles shared-memory. • Tpetra handles communication between nodes.
– How do we handle heterogeneous multi-core architectures?
• Multiple disjoint memories è distributed memory – We have significant tools built around this model.
• One MPI process per shared-memory pool. – Have to be even more careful with communication than before.
• A lot can be done with a two-level hybrid model. • Templated classes differentiate node types. • Emulate MPI: identify common patterns, provide skeletons.
10 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Tpetra Hybrid Parallelism
• The typical Tpetra computational kernel concerns: 1) member data structures 2) calls to Kokkos NodeAPI for shared-memory programming 3) calls to a communication for message passing
e.g., Tpetra::Vector::norm1()
(1) internal class data Scalar *x; int N;
(2) call the Kokkos NodeAPI DotOp<Scalar> op(x); lcl = node.parallel_for( 0, N, op );
(3) call the Comm gbl = comm.reduceAll( lcl, SUM );
• Extending library functionality can be done via external input at these three junctions.
11 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Tpetra Vector Methods • Set of stand-alone non-member methods, e.g.:
– unary_transform<UOP>(Vector &v, UOP op) – binary_transform<BOP>(Vector &v1, const Vector &v2, BOP op) – reduce<G>(const Vector &v1, const Vector &v2, G op_glob)
• Kernel level provides maximal expressiveness, but coarser levels brings convenience. // single-prec dot() with double-prec accumulator via custom kernel result = reduce( *x, *y, myDotProductKernel<float,double>() ); // Or a composite adaptor and standard functors result = reduce( *x, *y, reductionGlob<ZeroOp<double>>( std::multiplies<float>(), std::plus<double>()) ); // Or using inline functors via C++11 lambda functions result = reduce( *x, *y, reductionGlob<ZeroOp<double>>( [](float x, float y) {return x*y;} , [](double a, double b){return a+b;} ); // Or using a convenience macro to generate all of that result = REDUCE2( x, y, x*y, ZeroOp<float>, std::plus<double>() );
12 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Easy Parallel Algorithm Development
for (k=0; k<numIters; ++k) { A->apply( *p, *Ap ); // Ap = A*p S pAp = REDUCE2( p, Ap, p*Ap, ZeroOp<S>, plus<S>() ); // p'*Ap const S alpha = rr / pAp; // alpha = r’*r/p’*Ap BINARY_TRANSFORM( x, p, x + alpha*p ); // x = x + alpha*p S rrold = rr; rr = BINARY_PRETRANSFORM_REDUCE( r, Ap, // fused kernels r - alpha*Ap, // r - alpha*Ap r*r, ZeroOp<S>, plus<S>() ); // sum r'*r const S beta = rr / rrold; // beta = r’*r/old(r’*r) BINARY_TRANSFORM( p, r, r + beta*p); // p = z + beta*p }
• Inline templated hybrid-parallel conjugate gradient. – Fun game: Find the MPI or threading!
13 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Example: Recursive Multi-Prec. CG for (k=0; k<numIters; ++k) { A->apply(*p,*Ap); // Ap = A*p
T pAp = REDUCE2( p, Ap, p*Ap, ZeroOp<T>, plus<T>()); // p'*Ap const T alpha = zr / pAp; BINARY_TRANSFORM( x, p, x + alpha*p ); // x = x + alpha*p BINARY_TRANSFORM( rold, r, r ); // rold = r T rr = BINARY_PRETRANSFORM_REDUCE( r, Ap, // fused: r - alpha*Ap, // r - alpha*Ap r*r, ZeroOp<T>, plus<T>() ); // sum r'*r
recursiveFPCG<TS::next,LO,GO,Node>(out,db_T2); // recurse
auto plusTT = make_pair_op<T,T>(plus<T>());
pair<T,T> both = REDUCE3( z, r, rold, // fused: make_pair( z*r, z*rold ), // z'*r, z'*r_old ZeroPTT, plusTT ); const T beta = (both.first - both.second) / zr; zr = both.first; BINARY_TRANSFORM( p, z, z + beta*p ); // p = z + beta*p }
14 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Example: Simple CG
• Problem dimension 5M • 500 iterations • Double precision arithmetic • MPI + TBB parallel node • #threads = #mpi x #tbb
• invocation like: mpirun -np 4 ./driver.exe --machine-file=tbb4.xml
1 2 4 8 16
RunA
me (log sec)
Total number of threads
MPI 1
MPI 2
MPI 4
MPI 8
MPI 16
15 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Example: Simple CG
• Problem dimension 512K • 125 iterations • Quad-double precision • MPI + TBB parallel node • #threads = #mpi x #tbb
• Same codebase, simply instantiated on qd_real instead of double.
1 2 4 8 16
RunA
me (log sec)
Total number of threads
MPI 1
MPI 2
MPI 4
MPI 8
MPI 16
16 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Example: Recursive Multi-Prec. CG TBBNode initializing with numThreads == 2 TBBNode initializing with numThreads == 2 Running test with Node==Kokkos::TBBNode on rank 0/2 Beginning recursiveFPCG<qd_real> Beginning recursiveFPCG<dd_real> |res|/|res_0|: 1.269903e-14 |res|/|res_0|: 3.196573e-24 |res|/|res_0|: 6.208795e-35 Convergence detected! Leaving recursiveFPCG<dd_real> after 2 iterations. |res|/|res_0|: 2.704682e-32 Beginning recursiveFPCG<dd_real> |res|/|res_0|: 4.531185e-09 |res|/|res_0|: 6.341084e-20 |res|/|res_0|: 8.326745e-31 Convergence detected! Leaving recursiveFPCG<dd_real> after 2 iterations. |res|/|res_0|: 3.661388e-58 Leaving recursiveFPCG<qd_real> after 2 iterations.
17 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Example: Recursive Multi-Prec. CG
• Problem: Oberwolfach/gyro • N=17K, nnz=1M • qd_real / dd_real / double • MPI + TBB parallel node • #threads = #mpi x #tbb • Solved to over 60 digits • Around 99.9% of time spent
in double precision computation.
• Single codebase. 4 8 16
qd_real MPI 1
MPI 2
MPI 4
MPI 8
MPI 16
4 8 16
dd_real MPI 1
MPI 2
MPI 4
MPI 8
MPI 16
18 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Problems With Generic Kernels
• Generic kernels are not always successful: – e.g., CRS mat-vec on GPUs is sub-optimal
• Different kernel may need different data structure. • We want vendors and researchers to be able to substitute
kernels into our library. • Solution #1 treats the kernel as a first-class object.
– It is also a template parameter, potentially informing the structure of the local data.
• Solution #2 allows a class to be “specialized” to a particular platform, non-intrusively.
19 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Kernel-Agnostic Sparse Matrix
class CrsMatrix<Scalar,Ord,Node,Matvec> { Comm comm; typename Matvec::rebind<Scalar>::type lclMatVecOp; typename Matvec::matrix<Scalar,Ord,Node>::type lclMatrix; }; CrsMatrix::fillComplete() { // ... use comm to communicate non-local entries lclMatrix.fill( ... ); lclMatVecOp.submitEntries( lclMatrix ); } CrsMatrix::multiply(Vector x, Vector y) { // ... use comm to perform exchange on x Kokkos::Vector lclx = x.getLocalVector(); Kokkos::Vector lcly = y.getLocalVector(); lclMatVecOp.apply(lclx, lcly); // ... use comm to perform exchange on y }
20 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Specializations for Fine-Tuning
• Metaprogramming-based generic node is not perfect. – Some APIs not amenable to this approach (e.g., OpenCL). – We don’t want to have expose every kernel like for mat-vec
• You could hack up the library with #ifdefs. – This is the main benefit of FOSS. – But once you touch it, you own it. And upgrades are hard.
• Template specializations provide a non-intrusive means for augmenting/modifying library capability.
class Tpetra::Vector<double,int,int,OpenCLNode> { // manual implementation for double/int under OpenCL }; class Tpetra::Vector<float,int,int,OpenCLNode> { // manual implementation for float/int under OpenCL };
21 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Conclusion
• C++ templates and metaprogramming are being used in Trilinos to define a programming model that: – provides support for research into efficient solvers – allows user-authored serial code to be executed in hybrid parallel
on heterogeneous platforms – provide non-intrusive modification/extension of library by users,
researchers and vendors.
• The goal is to optimize programmer efficiency without significant performance sacrifices.
• This is largely an experimental capability, deployed in only parts of the library.
22 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
appendix
23 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Tpetra Operator Methods • Tpetra Reduction/Transformation Interface provides
convenience methods/macros for applying user Kokkos kernels to Tpetra Vectors/MultiVectors. RCP< Tpetra::Map<LO,GO,Node> > domMap, rngMap, rowMap, colMap; RCP< Tpetra::Import<LO,GO,Node> > importer = ...; RCP< Tpetra::Export<LO,GO,Node> > exporter = ...; MyKernel<T,LO> kern(...); RCP< Tpetra::Operator<T,LO,GO,Node> > op; op = Tpetra::RTI::kernelOp<T>(kern,domMap,rngMap,importer,exporter); op->apply(x, y);
• Also wrappers for applying general functors. – e.g.: simple diagonal operator using a C++11 lambda function
RCP< Tpetra::Map<LO,GO,Node> > map; RCP< Tpetra::Operator<T,LO,GO,Node> > op; op = Tpetra::RTI::binaryOp<T>( [](T, T x) {return 2.0 * x;} , map ); op->apply(x, y);
24 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Tool: Tpetra HybridPlatform
• Encapsulate main in a templated class method:
• HybridPlatform maps the communicator rank to the Node type, instantiates a node and the user routine:
template <class Node> class myMainRoutine { static void run(ParameterList &runParams, const RCP<const Comm<int> > &comm, const RCP<Node> &node) { // do something interesting } };
int main(...) { Comm<int> comm = ... ParameterList machine_file = ... // instantiate appropriate node and myMainRoutine Tpetra::HybridPlatform platform( comm , machine_file ); platform.runUserCode< myMainRoutine >(); return 0; }
25 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
hostname0
HybridPlatform Machine File
<ParameterList> <ParameterList name="%2=0"> <Parameter name="NodeType" type="string" value="Kokkos::ThrustGPUNode"/> <Parameter name="Verbose" type="int" value="1"/> <Parameter name="Device Number" type="int" value="0"/> <Parameter name="Node Weight" type="int" value="4"/> </ParameterList> <ParameterList name="%2=1"> <Parameter name="NodeType" type="string" value="Kokkos::TPINode"/> <Parameter name="Verbose" type="int" value="1"/> <Parameter name="Num Threads" type="int" value="15"/> <Parameter name="Node Weight" type="int" value="15"/> </ParameterList> </ParameterList>
ThrustGPUNode TPINode
rank 0 rank 1
hostname1
ThrustGPUNode TPINode
rank 2 rank 3 ...
round-‐robin assignment interval assignment explicit assignment default
%M=N [M,N] =N default
26 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Refresher: Kokkos Parallel Constructs • Parallel for: execute loop iterations in parallel • User-defined struct (work-data pair) contains:
– the necessary data and execute(int iter)
• Parallel reduce: reduce implicit set of elements in parallel via user-specified associative binary operation – typedef ReductionType – ReductionType identity() – ReductionType generate(int i) – ReductionType reduce(ReductionType a, ReductionType b)
• Template meta-programming fuses generic loop skeleton with user data and kernel specifications.
Node::parallel_for <WDP>(int beg, int end, WDP args); Node::parallel_reduce<WDP>(int beg, int end, WDP args);
27 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Kokkos parallel_for example
• Consider simple vector axpy:
template <class Scalar> struct AxpyOp { Scalar alpha; const Scalar *x; Scalar *y; inline void execute(int i) { y[i] += alpha * x[i]; } };
AxpyOp<double> daxpy( ... ); Node::parallel_for(0,N,daxpy); AxpyOp<complex<float> > caxpy( ... ); Node::parallel_for(0,N,caxpy);
y = α ∗ x + y
28 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Kokkos parallel_reduce example
• Consider real-valued vector inner product:
template <class Scalar> struct DotOp { const Scalar *x, *y; typedef Scalar ReductionType; Scalar identity() { return 0; } Scalar generate(int i) { return x[i]*y[i]; } Scalar reduce(Scalar a, Scalar b) { return a+b; } };
DotOp<float> fdot( ... ); float f = Node::parallel_reduce(0,N,fdot); DotOp<qd_real> qddot( ... ); qd_real q = Node::parallel_reduce(0,N,qddot);
α = xT y
29 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library
Some Ugly Details
• Host compiler: implicit instantiation handles coupling – important to use inline/static whenever possible
• Device compiler (nvcc): need explicit instantiation 1. put explicit instantiations in .cu file:
2. compile via nvcc:
– nvcc supports templates and template meta-programming J – OpenCL does not (yet?) L
#include "Kokkos_ThrustGPUNode.cuh” // Node routines, in CUDA #include "TestOps.hpp” // Kernels, in C template void Kokkos::ThrustGPUNode::parallel_for<InitOp<int> > (int, int, InitOp<int>);
prompt> nvcc -c -o libkernels_cuda.a exp_inst_cuda_kernels.cu
Recommended