GSL Stencil computations made cool Mauro Bianco Ugo Varetto ETH-CSCS

GSLStencil computations made cool

Mauro BiancoUgo Varetto

ETH-CSCS

Background

• GOAL: increase productivity of scientific programmer• Communities use subsets of algorithms• The usual DSL dilemma– Traditional languages are too low level for portability– Ways of expressing the required algorithms exists– Developers are conservative – Compilers are complex

• Work with communities to – Shift toward high level specification– Improve portability and reusability of code– Help service providers to identify resources

Motivation

• Many computations are represented by iterating over a data set to apply functions to its elements

• Developers do not have an uniform abstraction to express these computations– Re-inventing the wheel every time: developing,

debugging, maintaining codes– Several performance issues are shared among all

these versions– High rewrite efforts for portability to new machines

Stencil Computation (For Regular Grids)

• Given a regular D-dimensional grid • Compute a function in all pertinent elements

which depends on elements at fixed offsets– Fixed w.r.t. grid size

• Pertinent elements are those for which the offsets are well defined

• The iteration order is a parameter of the computation

Example

for i = 1 ... N−2 /* 0−based indices */ for j = 1 ... M−2 for k = 1 ... L−2 /* Stencil operator */ G1(i , j ,k)= 1/6(G0(i−1, j, k)+G0(i+1, j, k) + G0(i, j-1, k)+G0(i, j+1, k) + G0(i, j, k-1)+G0(i, j, k+1)) end for end forend for

Applying function

Induced Core Space

GSL Concepts

• Generic Programming Approach– Decoupling of algorithms, data, operators

• Our classification:– Storage: Area of memory with data– Grid: Representation of storage as D-dims grid– Stencil operator: Function to apply to grid elements– Stencil/Shape: Area accessed by the operator (Halo)– Iteration Space: Partial order required by algorithm

The Computation is Parallel

• Loops are not in the jargon of GSL• Stencil operator applied based to a partial order– Partial orders are DAGs = not easy in general

• But our DAGs typically have structure (guarantees)– do_all: no order specified– do_i, do_j, do_k: increment i, j, or k– do_diamond: when in (i,j,k), (i-1,j-1,k-1) is done– Etc.

• We call them iteration spaces– Each have an associated semantic

A stencil computation

• Grid=Storage+Stencil• Grid is a wrapper• Stencil Operator written in

terms of stencil interface• Stencil interface provided by

stencil/shape• Preamble performs checks• Stencil operator may exists

in multiple copies: no state• Post-amble enforces storage

semantics

Storage StencilShape

GRIDStencil

Operator

Iteration Space1. Preamble

2. Embed elements in access interface

3. Pass them to stencil operators

4. [Compute reduction]5. Post-amble

Example

typedef default_DBStorage<arch_type, bouble>::type storage_type;storage_type storage(n,m,1); // max halo is needed

typedef Grid2D<stencil_1x1_stateful, DBStorage<double> grid_i_t;grid_i_t grid_init(storage, n, m);

do_all( ctx, grid_init, init_function() );

typedef Grid2D<stencil_3x3, DBStorage<double> grid_t;grid_t grid(storage, n, m);

bool res=false;do { do_all ( grid , stencil_operator_avg() } while ( !do_reduce(grid, iden(), check() ) );

struct init_f { template <typename St> void operator()(St &u) const { int i, j; get_index(i,j); if ( (i == 0) || (j == 0) ) u()=1.0; else u()=0.0; }};

struct stencil_operator_3x3_avg { template <typename St> void operator()(St &u) const { u() = 1.0/16.0 * (4*u() – (u(-1, 0) + u(0,-1) + u(0, 1) + u(1, 0)) ); }};

struct iden { typedef bool result_type; template <typename St> bool operator()(St const& u) const { return (u()!=0.0); }};

struct check { template <typename St> void operator()(St &u) const { u() = 1.0/16.0 * (4*u() – (u(-1, 0) + u(0,-1) + u(0, 1) + u(1, 0)) ); }};

Specifying the Architecture

• Defines a hierarchy of programming models• typedef make_arch<3, mpi, openmp, sequential>::type

arch_type;

– Application is 3D– Top level is MPI– Each MPI task uses OpenMP– Each OpenMP thread runs on a regular sequential processor

• Defaults for best system utilization• typedef default_DBStorage<arch_type, bouble>::type

storage_type;

• Specific architectures can guarantee certain semantics– Contexts can be relaxed, access data may be safe, etc…

Context semantics

• Contexts depends on architectures• Iterations can be executed only within Begin and End • Accessing data between Begin and End can be

undefined (to deal with GPUs)

typedef make_arch<3, mpi, cuda>::type arch_type;…context<arch_type> ctx = GSL_Begin<arch_type>();

do_all( ctx, grid, stencil_operator() ); // OKcout << grid(3,3,3) << “\n”; // UNDEFINED

GSL_End(ctx);

do_all( ctx, grid, stencil_operator() ); // ERRORcout << grid(3,3,3) << “\n”; // OK

Portability

• To make program portable:• Architecture can be obtained• Storage is specialized based on architecture– default_storage<arch_type, double>::type

– Same structure as architecture• Contexts are used to pass information about

architecture layers down to iteration spaces• Semantic of contexts is enforced always

Storage

• Abstracts a 1D contiguous address space• Now two types of storage are supported:– Single buffer: behaves just like a regular buffer– Double buffer: write results are visible after commit

• Commit and sync ensure correctness– Executed at the end of an iteration based on

information available from stencil operator traits– Commit makes writes visible to subsequent reads– Sync makes the buffer ready to use after initialization

Stencils/Shapes• Template class that specify the extension of the halo

– E.g., 3D shape specify • iminus: number of cells in which halo extends on elements preceding the core

(indices less than core)• iplus: number of cells in which halo extends on elements following the core

(indices greater than core)• jminus: …

• Constructor that takes a pointer to a grid and coordinates– Set the core pointer

• Methods to access elements around the core in the halo region (for reading)– value_type const & operator()(int, int, int) const;

• Methods to access the core element for read/write• Methods for modifying the core pointer (moving the shape)

Statefull Stencils

• Provide methods to obtain– Index of the core element in the Grid– Global index of the core in case the Grid is a

subgrid of a bigger grid• Additional flexibility at the cost of– Memory usage– Performance (actual tests tends to confirm the

impact is visible only if index methods are used)

Grids=Storage+Shape• GSL::Grid3D<stencil_3x3, SBStorage<int>, ijk>

Grid(storage, m, n,

l) • The shape/stencil is the first argument• Second argument is the storage type• Then comes the layout arguments– Specifies how data would be traversed by a minimal

stride loop: ijk means ‘for i, for j, for k’• Several utilities to deal with sub-grids

Subgrids and Regions

• Given a region– Obtain a subgrid

• grid.subgrid(region)

• Result type is the same as grid

– Obtain a re-shaped subgrid• grid.reshape<newstencil>(region)• Result type is grid type with a different shape

• Grids alignment is user responsibility– To avoid making too restrictive assumptions– Error messages are rose if something is wrong

• Strict mode and relaxed mode available

Stencil Operators

• Function objects with additional traits to specify useful characteristicsstruct stencil_operator_eq_copy { typedef bool result_type; typedef boost::fusion::vector<flag_read, flag_write> access_list_type;

template <typename St1, typename St2> bool operator()(St1 const &u, St2 &v) const { bool res = (u() == v()); // Equal? v()=u(); // Copy return res; }};

User can specify access pattern for debugging purpose and improve

performance. Not mandatory but highly recommended!!

Access List Type

• One flag per argument of stencil operator• flag_read if the stencil is only read• flag_write if the stencil is read and written• flag_init if the stencil is only written• At the end of an iteration commit and/or sync

operations are invoked on storage classes– Semantics of these operations are exposed to users

• Note: A the moment it is a boost.vector, but simpler syntax can be devised

Iteration Spaces

• Iteration spaces specify requirements!• Available in GSL (in decreasing parallelism)– do_all: Visit all the (core) elements somehow– do_reduce: Visit elements and compute a

reduction on values returned by (commutative) operator

– do_i_inc: Ensure (i,j,k) is processed only if (i-1,j,k) have been (j- and k- versions available)

– do_i_dec, do_j_dec, do_k_dec: analogous– do_diamond: (i-1, j-1, k-1) is processed before (i,j,k)

Iteration Spaces Implementation

• General First – Specialization Later– Basic implementation with no much care for performance

– Specializations can be provided for specific cases and/or applications*

*Code can look ugly (for several reasons: reduce redundancy, improve performance,…), but this is internal code, not seen at top level, which is provided by the library developers, or by an advanced user after the basic code is up.

template <typename Grid, typename Operator>struct ARCH::do_all(Grid const &g, Operator const &op) { for(int i = 0; i < g.nx(); ++i) for(int j = 0; j < g.mx(); ++j) for(int k = 0; k < g.lx(); ++k) { typename Grid::stencil_type s(&g, i, j, k); op(s); }

#define MACRO_IMPL(z, n, _) \ template <BOOST_PP_ENUM_PARAMS_Z(z, BOOST_PP_INC(n), typename _Grid), typename _stencil_operator> \ struct sequential::_DO_(all, n)<BOOST_PP_ENUM_PARAMS_Z(z, BOOST_PP_INC(n), _Grid), _stencil_operator,\ typename boost::enable_if< \ GSL::same_major_with_base<typename GSL::_3D_major TYPE_CHK(BOOST_PP_INC(n))> \ >::type > \: GSL::nary_loop<void, BOOST_PP_INC(n)> \ { \ TYPE_INST(BOOST_PP_INC(n)) \ typedef typename boost::remove_reference<_stencil_operator>::type stencil_op; \ typedef typename Grid0::major_type major_type; \ \ void operator()( BOOST_PP_ENUM_BINARY_PARAMS_Z(z, BOOST_PP_INC(n), Grid, const &grid), stencil_op const &sten_op)\ { \ assert(_impl::check_grids3D(BOOST_PP_ENUM_PARAMS_Z(z, BOOST_PP_INC(n), grid) )); \ boost::tuple<int, int, int> bounds(grid0.nx(), grid0.mx(), grid0.lx()); \ int i, j, k; \ boost::tuple<int&, int&, int&> indices(i, j, k); \ int & i1 = boost::get<major_type::_3D_outer_dimension>(indices); \ int & i2 = boost::get<major_type::_3D_middle_dimension>(indices); \ int & i3 = boost::get<major_type::_3D_inner_dimension>(indices); \ const int N1 = boost::get<major_type::_3D_outer_dimension>(bounds); \ const int N2 = boost::get<major_type::_3D_middle_dimension>(bounds); \ const int N3 = boost::get<major_type::_3D_inner_dimension>(bounds); \ i3 = 0; \ int NN = N3; \ for (i1 = 0; i1 < N1; ++i1) { \ for (i2 = 0; i2 < N2; ++i2) { \ STEN_INST(BOOST_PP_INC(n)) \ for (int ii=0; ii < NN; ++ii) { \ sten_op(BOOST_PP_ENUM_PARAMS_Z(z, BOOST_PP_INC(n),stencil));\ STEN_INC(BOOST_PP_INC(n)) \ } \ } \ } \ } \ };

BOOST_PP_REPEAT(GSL_MAX_GRIDS, MACRO_IMPL, nil)

Conclusions&Future Work

• Generic programming for portable stencil computations

• GSL is a C++ generic library– No specific compiler => Some boiler plate code

• Using DSEL technology to perform optimizations

Initial Results

Tips and Tricks

• GSL tries to reduce the amount of platform specific considerations

• Performance would require minimum number of iteration spaces, but this has repercussions on code readability (fusion of operators)

• GSL provide facilities for fusing loops, but• Loop fusion can be automated by using SEL

SEL: Stencil Embedded Language

• A prototype of a DSEL for combining stencil computations

• Meaning: “perform a do_all averaging on a 3x3x3 grid followed by a reduction to check correctness”

• Since the same grid appears in both loops, the DSEL fuse the computation into

do_all(grd3x3x3, avg) + do_reduce(grd3x3x3, grd_1x1x1_bf, equ, l_and)

do_reduce(grd3x3x3, grd_1x1x1_bf, fuse(avg, equ), l_and)

From Imperative to Declarative

• When using SEL user adopts a declarative approach– Specification + Information

– All arguments are placeholders• Needed to postpone execution – lazy evaluation• Allow symbolic analysis of programs• Downside: expressions tend to grow in size

eval(do_all(grd3x3x3, avg) + do_reduce(grd3x3x3, grd_1x1x1_bf, equ, l_and) , context)

Lunch Bill

• Need to associate placeholders to real dataGrid3D< stencil_3x3x3 > grid3x3x3(storage, n, m, l);Grid3D< stencil_1x1x1 > grid1x1x1_bef(storage_bef, n, m, l);

typedef fvector<Grid3D<stencil_3x3x3>, Grid3D<stencil_1x1x1> > GVEC;typedef fvector<operator_avg, operator_eq, std::logical_and<bool> > OVEC;GVEC Gvec(grid3x3x3, grid1x1x1_bef);OVEC Ovec(operator_avg(), operator_eq(1.0e-4), std::logical_and<bool>());

SEL_context<cuda,GVEC,OVEC> context(Gvec, Ovec);MAKE_GRID(0, grd3x3x3); MAKE_GRID(1, grd_1x1x1_bf);MAKE_OPER(0, avg);MAKE_OPER(1, equ);MAKE_OPER(2, l_and);

template <int I>struct _grid {};

template <int I>struct _oper {};Define grids as beforeMap grids and operators (types and

values) to indices (in this case of vectors)Giving the execution engine

information about these maps (and about implementation to use)

Associating to indices some mnemonic (placeholders) symbols to be used in expressions

There is boiler plate code that can be reduced (e.g., by using macros), but with some potential drawbacks.

GSL vs SEL

Aggressive inlining does not guarantee the

best performance

Default inlining can get worse from beginning to end

SEL allows loop fusion with no penalty

GSL BASE SEL BASE SEL FUSION

DSEL Considerations

• SEL is for loop constructs– Useful to analyze macrostructures– Not drastic change w.r.t. GSL for syntax• More semantics can be available (e.g., loop fusion)

• DSEL for stencil operators– More than merely syntax embellishment– More semantics is the golden rule• Perform transformations the user may be not aware of• Enable auto-tuning at expression level

Sensitivity to expression writing

• Given an expression like

• Can be written in (at least) 5040 different ways– Does the way which?– If yes, how much?

• In this case a factor 2!

u(0,0,0) = 1./7.0 * (u(0,0,0)+u(1,0,0) +u(0,1,0) +u(0,0,1) +u(-1,0,0)+u(0,-1,0)+u(0,0,-1))

Implenetation (sorted by time)

Tim

e (m

s)

400x400x400 3D grid of doubles

In this case it was esay (after analyzing the permutations): Each big step corresponds to moving u(0,0,-1) to the right!

A little more formally

• Give the set A={+,-,~} we can define a descriptor (a,b,c) where a,b,c belong to A

• A storage has a Fastest Iteration Order (FIO)– The iteration order that guarantees fastest scan of

elements

Documents

GSL Stencil computations made cool Mauro Bianco Ugo Varetto ETH-CSCS