Upload
horatio-payne
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
GSLStencil computations made cool
Mauro BiancoUgo Varetto
ETH-CSCS
Background
• GOAL: increase productivity of scientific programmer• Communities use subsets of algorithms• The usual DSL dilemma– Traditional languages are too low level for portability– Ways of expressing the required algorithms exists– Developers are conservative – Compilers are complex
• Work with communities to – Shift toward high level specification– Improve portability and reusability of code– Help service providers to identify resources
Motivation
• Many computations are represented by iterating over a data set to apply functions to its elements
• Developers do not have an uniform abstraction to express these computations– Re-inventing the wheel every time: developing,
debugging, maintaining codes– Several performance issues are shared among all
these versions– High rewrite efforts for portability to new machines
Stencil Computation (For Regular Grids)
• Given a regular D-dimensional grid • Compute a function in all pertinent elements
which depends on elements at fixed offsets– Fixed w.r.t. grid size
• Pertinent elements are those for which the offsets are well defined
• The iteration order is a parameter of the computation
Example
for i = 1 ... N−2 /* 0−based indices */ for j = 1 ... M−2 for k = 1 ... L−2 /* Stencil operator */ G1(i , j ,k)= 1/6(G0(i−1, j, k)+G0(i+1, j, k) + G0(i, j-1, k)+G0(i, j+1, k) + G0(i, j, k-1)+G0(i, j, k+1)) end for end forend for
Applying function
Induced Core Space
GSL Concepts
• Generic Programming Approach– Decoupling of algorithms, data, operators
• Our classification:– Storage: Area of memory with data– Grid: Representation of storage as D-dims grid– Stencil operator: Function to apply to grid elements– Stencil/Shape: Area accessed by the operator (Halo)– Iteration Space: Partial order required by algorithm
The Computation is Parallel
• Loops are not in the jargon of GSL• Stencil operator applied based to a partial order– Partial orders are DAGs = not easy in general
• But our DAGs typically have structure (guarantees)– do_all: no order specified– do_i, do_j, do_k: increment i, j, or k– do_diamond: when in (i,j,k), (i-1,j-1,k-1) is done– Etc.
• We call them iteration spaces– Each have an associated semantic
A stencil computation
• Grid=Storage+Stencil• Grid is a wrapper• Stencil Operator written in
terms of stencil interface• Stencil interface provided by
stencil/shape• Preamble performs checks• Stencil operator may exists
in multiple copies: no state• Post-amble enforces storage
semantics
Storage StencilShape
GRIDStencil
Operator
Iteration Space1. Preamble
2. Embed elements in access interface
3. Pass them to stencil operators
4. [Compute reduction]5. Post-amble
Example
typedef default_DBStorage<arch_type, bouble>::type storage_type;storage_type storage(n,m,1); // max halo is needed
typedef Grid2D<stencil_1x1_stateful, DBStorage<double> grid_i_t;grid_i_t grid_init(storage, n, m);
do_all( ctx, grid_init, init_function() );
typedef Grid2D<stencil_3x3, DBStorage<double> grid_t;grid_t grid(storage, n, m);
bool res=false;do { do_all ( grid , stencil_operator_avg() } while ( !do_reduce(grid, iden(), check() ) );
struct init_f { template <typename St> void operator()(St &u) const { int i, j; get_index(i,j); if ( (i == 0) || (j == 0) ) u()=1.0; else u()=0.0; }};
struct stencil_operator_3x3_avg { template <typename St> void operator()(St &u) const { u() = 1.0/16.0 * (4*u() – (u(-1, 0) + u(0,-1) + u(0, 1) + u(1, 0)) ); }};
struct iden { typedef bool result_type; template <typename St> bool operator()(St const& u) const { return (u()!=0.0); }};
struct check { template <typename St> void operator()(St &u) const { u() = 1.0/16.0 * (4*u() – (u(-1, 0) + u(0,-1) + u(0, 1) + u(1, 0)) ); }};
Specifying the Architecture
• Defines a hierarchy of programming models• typedef make_arch<3, mpi, openmp, sequential>::type
arch_type;
– Application is 3D– Top level is MPI– Each MPI task uses OpenMP– Each OpenMP thread runs on a regular sequential processor
• Defaults for best system utilization• typedef default_DBStorage<arch_type, bouble>::type
storage_type;
• Specific architectures can guarantee certain semantics– Contexts can be relaxed, access data may be safe, etc…
Context semantics
• Contexts depends on architectures• Iterations can be executed only within Begin and End • Accessing data between Begin and End can be
undefined (to deal with GPUs)
typedef make_arch<3, mpi, cuda>::type arch_type;…context<arch_type> ctx = GSL_Begin<arch_type>();
do_all( ctx, grid, stencil_operator() ); // OKcout << grid(3,3,3) << “\n”; // UNDEFINED
GSL_End(ctx);
do_all( ctx, grid, stencil_operator() ); // ERRORcout << grid(3,3,3) << “\n”; // OK
Portability
• To make program portable:• Architecture can be obtained• Storage is specialized based on architecture– default_storage<arch_type, double>::type
– Same structure as architecture• Contexts are used to pass information about
architecture layers down to iteration spaces• Semantic of contexts is enforced always
Storage
• Abstracts a 1D contiguous address space• Now two types of storage are supported:– Single buffer: behaves just like a regular buffer– Double buffer: write results are visible after commit
• Commit and sync ensure correctness– Executed at the end of an iteration based on
information available from stencil operator traits– Commit makes writes visible to subsequent reads– Sync makes the buffer ready to use after initialization
Stencils/Shapes• Template class that specify the extension of the halo
– E.g., 3D shape specify • iminus: number of cells in which halo extends on elements preceding the core
(indices less than core)• iplus: number of cells in which halo extends on elements following the core
(indices greater than core)• jminus: …
• Constructor that takes a pointer to a grid and coordinates– Set the core pointer
• Methods to access elements around the core in the halo region (for reading)– value_type const & operator()(int, int, int) const;
• Methods to access the core element for read/write• Methods for modifying the core pointer (moving the shape)
Statefull Stencils
• Provide methods to obtain– Index of the core element in the Grid– Global index of the core in case the Grid is a
subgrid of a bigger grid• Additional flexibility at the cost of– Memory usage– Performance (actual tests tends to confirm the
impact is visible only if index methods are used)
Grids=Storage+Shape• GSL::Grid3D<stencil_3x3, SBStorage<int>, ijk>
Grid(storage, m, n,
l) • The shape/stencil is the first argument• Second argument is the storage type• Then comes the layout arguments– Specifies how data would be traversed by a minimal
stride loop: ijk means ‘for i, for j, for k’• Several utilities to deal with sub-grids
Subgrids and Regions
• Given a region– Obtain a subgrid
• grid.subgrid(region)
• Result type is the same as grid
– Obtain a re-shaped subgrid• grid.reshape<newstencil>(region)• Result type is grid type with a different shape
• Grids alignment is user responsibility– To avoid making too restrictive assumptions– Error messages are rose if something is wrong
• Strict mode and relaxed mode available
Stencil Operators
• Function objects with additional traits to specify useful characteristicsstruct stencil_operator_eq_copy { typedef bool result_type; typedef boost::fusion::vector<flag_read, flag_write> access_list_type;
template <typename St1, typename St2> bool operator()(St1 const &u, St2 &v) const { bool res = (u() == v()); // Equal? v()=u(); // Copy return res; }};
User can specify access pattern for debugging purpose and improve
performance. Not mandatory but highly recommended!!
Access List Type
• One flag per argument of stencil operator• flag_read if the stencil is only read• flag_write if the stencil is read and written• flag_init if the stencil is only written• At the end of an iteration commit and/or sync
operations are invoked on storage classes– Semantics of these operations are exposed to users
• Note: A the moment it is a boost.vector, but simpler syntax can be devised
Iteration Spaces
• Iteration spaces specify requirements!• Available in GSL (in decreasing parallelism)– do_all: Visit all the (core) elements somehow– do_reduce: Visit elements and compute a
reduction on values returned by (commutative) operator
– do_i_inc: Ensure (i,j,k) is processed only if (i-1,j,k) have been (j- and k- versions available)
– do_i_dec, do_j_dec, do_k_dec: analogous– do_diamond: (i-1, j-1, k-1) is processed before (i,j,k)
Iteration Spaces Implementation
• General First – Specialization Later– Basic implementation with no much care for performance
– Specializations can be provided for specific cases and/or applications*
*Code can look ugly (for several reasons: reduce redundancy, improve performance,…), but this is internal code, not seen at top level, which is provided by the library developers, or by an advanced user after the basic code is up.
template <typename Grid, typename Operator>struct ARCH::do_all(Grid const &g, Operator const &op) { for(int i = 0; i < g.nx(); ++i) for(int j = 0; j < g.mx(); ++j) for(int k = 0; k < g.lx(); ++k) { typename Grid::stencil_type s(&g, i, j, k); op(s); }
#define MACRO_IMPL(z, n, _) \ template <BOOST_PP_ENUM_PARAMS_Z(z, BOOST_PP_INC(n), typename _Grid), typename _stencil_operator> \ struct sequential::_DO_(all, n)<BOOST_PP_ENUM_PARAMS_Z(z, BOOST_PP_INC(n), _Grid), _stencil_operator,\ typename boost::enable_if< \ GSL::same_major_with_base<typename GSL::_3D_major TYPE_CHK(BOOST_PP_INC(n))> \ >::type > \: GSL::nary_loop<void, BOOST_PP_INC(n)> \ { \ TYPE_INST(BOOST_PP_INC(n)) \ typedef typename boost::remove_reference<_stencil_operator>::type stencil_op; \ typedef typename Grid0::major_type major_type; \ \ void operator()( BOOST_PP_ENUM_BINARY_PARAMS_Z(z, BOOST_PP_INC(n), Grid, const &grid), stencil_op const &sten_op)\ { \ assert(_impl::check_grids3D(BOOST_PP_ENUM_PARAMS_Z(z, BOOST_PP_INC(n), grid) )); \ boost::tuple<int, int, int> bounds(grid0.nx(), grid0.mx(), grid0.lx()); \ int i, j, k; \ boost::tuple<int&, int&, int&> indices(i, j, k); \ int & i1 = boost::get<major_type::_3D_outer_dimension>(indices); \ int & i2 = boost::get<major_type::_3D_middle_dimension>(indices); \ int & i3 = boost::get<major_type::_3D_inner_dimension>(indices); \ const int N1 = boost::get<major_type::_3D_outer_dimension>(bounds); \ const int N2 = boost::get<major_type::_3D_middle_dimension>(bounds); \ const int N3 = boost::get<major_type::_3D_inner_dimension>(bounds); \ i3 = 0; \ int NN = N3; \ for (i1 = 0; i1 < N1; ++i1) { \ for (i2 = 0; i2 < N2; ++i2) { \ STEN_INST(BOOST_PP_INC(n)) \ for (int ii=0; ii < NN; ++ii) { \ sten_op(BOOST_PP_ENUM_PARAMS_Z(z, BOOST_PP_INC(n),stencil));\ STEN_INC(BOOST_PP_INC(n)) \ } \ } \ } \ } \ };
BOOST_PP_REPEAT(GSL_MAX_GRIDS, MACRO_IMPL, nil)
Conclusions&Future Work
• Generic programming for portable stencil computations
• GSL is a C++ generic library– No specific compiler => Some boiler plate code
• Using DSEL technology to perform optimizations
Initial Results
Tips and Tricks
• GSL tries to reduce the amount of platform specific considerations
• Performance would require minimum number of iteration spaces, but this has repercussions on code readability (fusion of operators)
• GSL provide facilities for fusing loops, but• Loop fusion can be automated by using SEL
SEL: Stencil Embedded Language
• A prototype of a DSEL for combining stencil computations
• Meaning: “perform a do_all averaging on a 3x3x3 grid followed by a reduction to check correctness”
• Since the same grid appears in both loops, the DSEL fuse the computation into
do_all(grd3x3x3, avg) + do_reduce(grd3x3x3, grd_1x1x1_bf, equ, l_and)
do_reduce(grd3x3x3, grd_1x1x1_bf, fuse(avg, equ), l_and)
From Imperative to Declarative
• When using SEL user adopts a declarative approach– Specification + Information
– All arguments are placeholders• Needed to postpone execution – lazy evaluation• Allow symbolic analysis of programs• Downside: expressions tend to grow in size
eval(do_all(grd3x3x3, avg) + do_reduce(grd3x3x3, grd_1x1x1_bf, equ, l_and) , context)
Lunch Bill
• Need to associate placeholders to real dataGrid3D< stencil_3x3x3 > grid3x3x3(storage, n, m, l);Grid3D< stencil_1x1x1 > grid1x1x1_bef(storage_bef, n, m, l);
typedef fvector<Grid3D<stencil_3x3x3>, Grid3D<stencil_1x1x1> > GVEC;typedef fvector<operator_avg, operator_eq, std::logical_and<bool> > OVEC;GVEC Gvec(grid3x3x3, grid1x1x1_bef);OVEC Ovec(operator_avg(), operator_eq(1.0e-4), std::logical_and<bool>());
SEL_context<cuda,GVEC,OVEC> context(Gvec, Ovec);MAKE_GRID(0, grd3x3x3); MAKE_GRID(1, grd_1x1x1_bf);MAKE_OPER(0, avg);MAKE_OPER(1, equ);MAKE_OPER(2, l_and);
template <int I>struct _grid {};
template <int I>struct _oper {};Define grids as beforeMap grids and operators (types and
values) to indices (in this case of vectors)Giving the execution engine
information about these maps (and about implementation to use)
Associating to indices some mnemonic (placeholders) symbols to be used in expressions
There is boiler plate code that can be reduced (e.g., by using macros), but with some potential drawbacks.
GSL vs SEL
Aggressive inlining does not guarantee the
best performance
Default inlining can get worse from beginning to end
SEL allows loop fusion with no penalty
GSL BASE SEL BASE SEL FUSION
DSEL Considerations
• SEL is for loop constructs– Useful to analyze macrostructures– Not drastic change w.r.t. GSL for syntax• More semantics can be available (e.g., loop fusion)
• DSEL for stencil operators– More than merely syntax embellishment– More semantics is the golden rule• Perform transformations the user may be not aware of• Enable auto-tuning at expression level
Sensitivity to expression writing
• Given an expression like
• Can be written in (at least) 5040 different ways– Does the way which?– If yes, how much?
• In this case a factor 2!
u(0,0,0) = 1./7.0 * (u(0,0,0)+u(1,0,0) +u(0,1,0) +u(0,0,1) +u(-1,0,0)+u(0,-1,0)+u(0,0,-1))
Implenetation (sorted by time)
Tim
e (m
s)
400x400x400 3D grid of doubles
In this case it was esay (after analyzing the permutations): Each big step corresponds to moving u(0,0,-1) to the right!
A little more formally
• Give the set A={+,-,~} we can define a descriptor (a,b,c) where a,b,c belong to A
• A storage has a Fastest Iteration Order (FIO)– The iteration order that guarantees fastest scan of
elements