Supporting Diverse Parallel Models in the Trilinos Library

Chris Baker Computational Engineering and Energy Studies

Oak Ridge National Laboratory, USA

MS 42: Parallel Programming Models, Algorithms and Frameworks for Scalable Manycore Systems SIAM Parallel Processing 2012 February 15-17, Savannah, GA

2 Managed by UT-Battelle for the U.S. Department of Energy SIAM PP12: Supporting Diverse Parallel Models in the Trilinos Library

Collaborators

• Oak Ridge National Laboratory –  Ross Bartlett

• Sandia National Laboratories –  Mike Heroux –  Mark Hoemmen –  Alan Williams –  Carter Edwards

• École Polytechnique Fédérale de Lausanne –  Radu Popescu

Dominant Scientific Library Paradigm

•  Library provides a specific capability. –  Apps can grab the data in order to expand functionality.

•  In an MPI-only scenario, expansion comes via domain-specific serial kernels coded by domain specialists. –  i.e., not doing any shared-memory programming

• With a single memory pool, data easily shared between library and app.

• With a single target architecture, compilation is relatively simple. –  Use any language for which you have a compiler. –  Mechanisms exist for mixed language capability.

Enter the Hybrid Parallel Environment

•  The path to exascale apparently requires addressing many-core. –  LANL RoadRunner: Cell BE and multi-core CPUs –  Tianhe 1A: NVIDIA GPUs and multi-core CPUs –  “K” Computer: simply consists of nodes of 8-core CPUs –  TACC Stampede: dual octo-core CPUs and Intel Knight’s Corner –  OLCF Titan/Cray XK6: one NVIDIA GPU per 12-core AMD CPU

• Ditch the assumptions of the previous slide/paradigm. –  We must investigate other parallel programming models –  We must revisit app/library relationship. –  We may need to consider other programming languages. –  Portability is more challenging than recently.

Numerous Considerations • Parallel Programming Model

–  MPI-only is the status quo for a large number of codes. •  Well-defined message passing API is an optimization target for vendors •  Users write serial, portable code

–  MPI-plus is where many codes are going. •  e.g., MPI+OpenMP, MPI+CUDA, MPI+directives •  Explicit two-level shared/distributed hybrid.

• Programming Language –  Programmer productivity is rooted in languages and APIs. –  C++, Fortran, OpenCL, CUDA offer different levels of expressiveness.

•  Library Extension –  “Grab the data and run” extension model requires addressing parallelism. –  Intrusive modification to a living library is untenable.

Challenges • MPI-only not enough

–  Need to port: it doesn’t work for accelerators. –  Inefficient: it misses a lot of shared-memory benefits.

• MPI-plus can entail significant work –  We want to minimize the number of code bases. –  We want to minimize the effort to add a new code base.

•  Language issues –  Many APIs require a particular language. –  Developers resent being told what language to use.

•  Lib/User interface issues –  Extending the library should not introduce serial bottlenecks. –  Shouldn’t require users to be shared-memory API experts.

Some approaches in Stage 2 Trilinos

•  Templated C++ code –  Templating data allows more efficient use of cache and bandwidth. –  Templating data expands capability (e.g., integer limit, complex)

• Generic shared memory parallel node –  Kokkos provides shared memory parallel node API –  Interface to numerous APIs via template metaprogramming layer

• Hybrid programming model –  Hybrid programming skeletons to support most common patterns –  Expose models for high-productivity, performance-portable apps

• Non-intrusive modification of structures and algorithms –  Expose the SMP node to apps; enable node-optimized kernels.

Kokkos and Tpetra Packages

• Kokkos is an API for shared-memory parallel nodes. –  Provides parallel_for and parallel_reduce skeletons –  Memory model addresses challenge of accelerator memory –  Provides reference linear algebra kernels –  Currently supports multiple shared-memory APIs:

•  ThreadPool Interface (TPI, a Trilinos pthreads package) •  Intel Threading Building Blocks (TBB) •  NVIDIA CUDA-capable GPUs (via Thrust) •  OpenMP New! implemented by Radu Popescu/EPFL

•  Tpetra is a distributed linear algebra library. –  Heavily exploits templated C++ –  Employs hybrid (distributed + shared) parallelism via Kokkos

Programming Heterogeneous Clusters

• Kokkos handles shared-memory. •  Tpetra handles communication between nodes.

–  How do we handle heterogeneous multi-core architectures?

• Multiple disjoint memories è distributed memory –  We have significant tools built around this model.

• One MPI process per shared-memory pool. –  Have to be even more careful with communication than before.

• A lot can be done with a two-level hybrid model. •  Templated classes differentiate node types. • Emulate MPI: identify common patterns, provide skeletons.

Tpetra Hybrid Parallelism

•  The typical Tpetra computational kernel concerns: 1)  member data structures 2)  calls to Kokkos NodeAPI for shared-memory programming 3)  calls to a communication for message passing

e.g., Tpetra::Vector::norm1()

(1) internal class data Scalar *x; int N;

(2) call the Kokkos NodeAPI DotOp<Scalar> op(x); lcl = node.parallel_for( 0, N, op );

(3) call the Comm gbl = comm.reduceAll( lcl, SUM );

• Extending library functionality can be done via external input at these three junctions.

Tpetra Vector Methods • Set of stand-alone non-member methods, e.g.:

–  unary_transform<UOP>(Vector &v, UOP op) –  binary_transform<BOP>(Vector &v1, const Vector &v2, BOP op) –  reduce<G>(const Vector &v1, const Vector &v2, G op_glob)

• Kernel level provides maximal expressiveness, but coarser levels brings convenience. // single-prec dot() with double-prec accumulator via custom kernel result = reduce( *x, *y, myDotProductKernel<float,double>() ); // Or a composite adaptor and standard functors result = reduce( *x, *y, reductionGlob<ZeroOp<double>>( std::multiplies<float>(), std::plus<double>()) ); // Or using inline functors via C++11 lambda functions result = reduce( *x, *y, reductionGlob<ZeroOp<double>>( [](float x, float y) {return x*y;} , [](double a, double b){return a+b;} ); // Or using a convenience macro to generate all of that result = REDUCE2( x, y, x*y, ZeroOp<float>, std::plus<double>() );

Easy Parallel Algorithm Development

for (k=0; k<numIters; ++k) { A->apply( *p, *Ap ); // Ap = A*p S pAp = REDUCE2( p, Ap, p*Ap, ZeroOp<S>, plus<S>() ); // p'*Ap const S alpha = rr / pAp; // alpha = r’*r/p’*Ap BINARY_TRANSFORM( x, p, x + alpha*p ); // x = x + alpha*p S rrold = rr; rr = BINARY_PRETRANSFORM_REDUCE( r, Ap, // fused kernels r - alpha*Ap, // r - alpha*Ap r*r, ZeroOp<S>, plus<S>() ); // sum r'*r const S beta = rr / rrold; // beta = r’*r/old(r’*r) BINARY_TRANSFORM( p, r, r + beta*p); // p = z + beta*p }

•  Inline templated hybrid-parallel conjugate gradient. –  Fun game: Find the MPI or threading!

Example: Recursive Multi-Prec. CG for (k=0; k<numIters; ++k) { A->apply(*p,*Ap); // Ap = A*p

T pAp = REDUCE2( p, Ap, p*Ap, ZeroOp<T>, plus<T>()); // p'*Ap const T alpha = zr / pAp; BINARY_TRANSFORM( x, p, x + alpha*p ); // x = x + alpha*p BINARY_TRANSFORM( rold, r, r ); // rold = r T rr = BINARY_PRETRANSFORM_REDUCE( r, Ap, // fused: r - alpha*Ap, // r - alpha*Ap r*r, ZeroOp<T>, plus<T>() ); // sum r'*r

recursiveFPCG<TS::next,LO,GO,Node>(out,db_T2); // recurse

auto plusTT = make_pair_op<T,T>(plus<T>());

pair<T,T> both = REDUCE3( z, r, rold, // fused: make_pair( z*r, z*rold ), // z'*r, z'*r_old ZeroPTT, plusTT ); const T beta = (both.first - both.second) / zr; zr = both.first; BINARY_TRANSFORM( p, z, z + beta*p ); // p = z + beta*p }

Example: Simple CG

• Problem dimension 5M •  500 iterations • Double precision arithmetic • MPI + TBB parallel node •  #threads = #mpi x #tbb

•  invocation like: mpirun -np 4 ./driver.exe --machine-file=tbb4.xml

1 2 4 8 16

me (log sec)

Total number of threads

MPI 16

Example: Simple CG

• Problem dimension 512K •  125 iterations • Quad-double precision • MPI + TBB parallel node •  #threads = #mpi x #tbb

• Same codebase, simply instantiated on qd_real instead of double.

1 2 4 8 16

me (log sec)

Total number of threads

MPI 16

Example: Recursive Multi-Prec. CG TBBNode initializing with numThreads == 2 TBBNode initializing with numThreads == 2 Running test with Node==Kokkos::TBBNode on rank 0/2 Beginning recursiveFPCG<qd_real> Beginning recursiveFPCG<dd_real> |res|/|res_0|: 1.269903e-14 |res|/|res_0|: 3.196573e-24 |res|/|res_0|: 6.208795e-35 Convergence detected! Leaving recursiveFPCG<dd_real> after 2 iterations. |res|/|res_0|: 2.704682e-32 Beginning recursiveFPCG<dd_real> |res|/|res_0|: 4.531185e-09 |res|/|res_0|: 6.341084e-20 |res|/|res_0|: 8.326745e-31 Convergence detected! Leaving recursiveFPCG<dd_real> after 2 iterations. |res|/|res_0|: 3.661388e-58 Leaving recursiveFPCG<qd_real> after 2 iterations.

Example: Recursive Multi-Prec. CG

• Problem: Oberwolfach/gyro • N=17K, nnz=1M •  qd_real / dd_real / double • MPI + TBB parallel node •  #threads = #mpi x #tbb • Solved to over 60 digits • Around 99.9% of time spent

in double precision computation.

• Single codebase. 4 8 16

qd_real MPI 1

MPI 16

4 8 16

dd_real MPI 1

MPI 16

Problems With Generic Kernels

• Generic kernels are not always successful: –  e.g., CRS mat-vec on GPUs is sub-optimal

• Different kernel may need different data structure. • We want vendors and researchers to be able to substitute

kernels into our library. • Solution #1 treats the kernel as a first-class object.

–  It is also a template parameter, potentially informing the structure of the local data.

• Solution #2 allows a class to be “specialized” to a particular platform, non-intrusively.

Kernel-Agnostic Sparse Matrix

class CrsMatrix<Scalar,Ord,Node,Matvec> { Comm comm; typename Matvec::rebind<Scalar>::type lclMatVecOp; typename Matvec::matrix<Scalar,Ord,Node>::type lclMatrix; }; CrsMatrix::fillComplete() { // ... use comm to communicate non-local entries lclMatrix.fill( ... ); lclMatVecOp.submitEntries( lclMatrix ); } CrsMatrix::multiply(Vector x, Vector y) { // ... use comm to perform exchange on x Kokkos::Vector lclx = x.getLocalVector(); Kokkos::Vector lcly = y.getLocalVector(); lclMatVecOp.apply(lclx, lcly); // ... use comm to perform exchange on y }

Specializations for Fine-Tuning

• Metaprogramming-based generic node is not perfect. –  Some APIs not amenable to this approach (e.g., OpenCL). –  We don’t want to have expose every kernel like for mat-vec

• You could hack up the library with #ifdefs. –  This is the main benefit of FOSS. –  But once you touch it, you own it. And upgrades are hard.

•  Template specializations provide a non-intrusive means for augmenting/modifying library capability.

class Tpetra::Vector<double,int,int,OpenCLNode> { // manual implementation for double/int under OpenCL }; class Tpetra::Vector<float,int,int,OpenCLNode> { // manual implementation for float/int under OpenCL };

Conclusion

• C++ templates and metaprogramming are being used in Trilinos to define a programming model that: –  provides support for research into efficient solvers –  allows user-authored serial code to be executed in hybrid parallel

on heterogeneous platforms –  provide non-intrusive modification/extension of library by users,

researchers and vendors.

•  The goal is to optimize programmer efficiency without significant performance sacrifices.

•  This is largely an experimental capability, deployed in only parts of the library.

appendix

Tpetra Operator Methods •  Tpetra Reduction/Transformation Interface provides

convenience methods/macros for applying user Kokkos kernels to Tpetra Vectors/MultiVectors. RCP< Tpetra::Map<LO,GO,Node> > domMap, rngMap, rowMap, colMap; RCP< Tpetra::Import<LO,GO,Node> > importer = ...; RCP< Tpetra::Export<LO,GO,Node> > exporter = ...; MyKernel<T,LO> kern(...); RCP< Tpetra::Operator<T,LO,GO,Node> > op; op = Tpetra::RTI::kernelOp<T>(kern,domMap,rngMap,importer,exporter); op->apply(x, y);

• Also wrappers for applying general functors. –  e.g.: simple diagonal operator using a C++11 lambda function

RCP< Tpetra::Map<LO,GO,Node> > map; RCP< Tpetra::Operator<T,LO,GO,Node> > op; op = Tpetra::RTI::binaryOp<T>( [](T, T x) {return 2.0 * x;} , map ); op->apply(x, y);

Tool: Tpetra HybridPlatform

• Encapsulate main in a templated class method:

•  HybridPlatform maps the communicator rank to the Node type, instantiates a node and the user routine:

template <class Node> class myMainRoutine { static void run(ParameterList &runParams, const RCP<const Comm<int> > &comm, const RCP<Node> &node) { // do something interesting } };

int main(...) { Comm<int> comm = ... ParameterList machine_file = ... // instantiate appropriate node and myMainRoutine Tpetra::HybridPlatform platform( comm , machine_file ); platform.runUserCode< myMainRoutine >(); return 0; }

hostname0

HybridPlatform Machine File

ThrustGPUNode TPINode

rank 0 rank 1

hostname1

ThrustGPUNode TPINode

rank 2 rank 3 ...

round-‐robin assignment interval assignment explicit assignment default

%M=N [M,N] =N default

Refresher: Kokkos Parallel Constructs • Parallel for: execute loop iterations in parallel • User-defined struct (work-data pair) contains:

–  the necessary data and execute(int iter)

• Parallel reduce: reduce implicit set of elements in parallel via user-specified associative binary operation –  typedef ReductionType –  ReductionType identity() –  ReductionType generate(int i) –  ReductionType reduce(ReductionType a, ReductionType b)

•  Template meta-programming fuses generic loop skeleton with user data and kernel specifications.

Node::parallel_for <WDP>(int beg, int end, WDP args); Node::parallel_reduce<WDP>(int beg, int end, WDP args);

Kokkos parallel_for example

• Consider simple vector axpy:

template <class Scalar> struct AxpyOp { Scalar alpha; const Scalar *x; Scalar *y; inline void execute(int i) { y[i] += alpha * x[i]; } };

AxpyOp<double> daxpy( ... ); Node::parallel_for(0,N,daxpy); AxpyOp<complex<float> > caxpy( ... ); Node::parallel_for(0,N,caxpy);

y = α ∗ x + y

Kokkos parallel_reduce example

• Consider real-valued vector inner product:

template <class Scalar> struct DotOp { const Scalar *x, *y; typedef Scalar ReductionType; Scalar identity() { return 0; } Scalar generate(int i) { return x[i]*y[i]; } Scalar reduce(Scalar a, Scalar b) { return a+b; } };

DotOp<float> fdot( ... ); float f = Node::parallel_reduce(0,N,fdot); DotOp<qd_real> qddot( ... ); qd_real q = Node::parallel_reduce(0,N,qddot);

α = xT y

Some Ugly Details

• Host compiler: implicit instantiation handles coupling –  important to use inline/static whenever possible

• Device compiler (nvcc): need explicit instantiation 1.  put explicit instantiations in .cu file:

2.  compile via nvcc:

–  nvcc supports templates and template meta-programming J –  OpenCL does not (yet?) L

#include "Kokkos_ThrustGPUNode.cuh” // Node routines, in CUDA #include "TestOps.hpp” // Kernels, in C template void Kokkos::ThrustGPUNode::parallel_for<InitOp<int> > (int, int, InitOp<int>);

prompt> nvcc -c -o libkernels_cuda.a exp_inst_cuda_kernels.cu

Supporting Diverse Parallel Models in the Trilinos Library · 2012. 3. 8. · 18 Managed by...

Documents

Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode

Supporting Culturally Diverse Learners - VCLDvcld.org/wp-content/uploads/2018/04/VCLD-Program-2018.pdf · 2018. 4. 17. · 2" "" "" " " " " "" " " Supporting Culturally Diverse Learners

Introduction to the iPad: Supporting Diverse Learners in VET and Beyond

Supporting Information · Supporting Information. Chiral Phosphoric Acid-Catalyzed Enantioselective Construction of Structurally Diverse Benzothiazolopyrimidines . Lucie Jarrige,

Supporting ELL/Culturally and Linguistically Diverse Students for

Compiling Parallel Languageson-demand.gputechconf.com/supercomputing/2012/... · A Platform for Diverse Parallel Computing Developers want to build front-ends for Java, Python, R,

Understanding Diverse Communities and Supporting Equitable

Supporting the Diverse and Expanding World of Bioenergy · SUPPORTING THE DIVERSE AND EXPANDING WORLD OF BIOENERGY 6 csagroup.org Russia & other CIS Asia North 5.76 EU 28 America

Supporting Diverse Learners Within the Classroom

Student Services: More Than a Number: Supporting a Diverse Community of Learners

Supporting Parallel Component Debugging Using the GDB Python Interface

Supporting Online Material forperrimon/papers/Bakal...Supporting Online Material for Phosphorylation Networks Regulating JNK Activity in Diverse Genetic Backgrounds Chris Bakal,* Rune

Supporting the development of a free, diverse and gender

Writing about Advantaages-Disadvantages (Supporting With Parallel Points) Instructor Mihrican Yigit

Supporting Diverse Learners Guidebooktantasqua.org/TRSD/studentsupportservices/docs/diverselearner.pdf · With the needs of diverse learners in mind, educators will promote: Student

Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

Sparsely Faceted Arrays: A Mechanism Supporting Parallel ... · Sparsely Faceted Arrays: A Mechanism Supporting Parallel Allocation, Communication, and Garbage Collection by Jeremy

Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings

Compressor technology options supporting r744 system design for diverse end user needs

SQL Server Parallel Data Warehouse: Supporting Large Scale Analytics