Designing Architecture-aware Library using Boost.Proto

Designing Architecture-aware Library using Boost.Proto

Joel Falcou, Mathias GaunardPierre Esterie, Eric

Jourdanneau

LRI - University Paris Sud XI - CNRS

13/03/2012

Context

In Scientific Computing ...

� there is Scientific

� Applications are domain driven� Users 6= Developers� Users are reluctant to changes

� there is Computing

� Computing requires performance ...� ... which implies architectures specific tuning� ... which requires expertise� ... which may or may not be available

The ProblemPeople using computers to do science want to do science first.

2 of 34

Context







2 of 34

Context







2 of 34

Context







2 of 34

Context







2 of 34

The Problem and how to solve it

The Facts� The ”Library to bind them all” doesn’t exist (or we should have it already)

� New architectures performances are underused due to their inherent complexity

� Few people are both experts in parallel programming and SomeOtherField

The Ends� Find a way to shield users from parallel architecture details

� Find a way to shield developpers from parallel architecture details

� Isolate software from hardware evolution

The Means� Generic Programming

� Template Meta-Programming

� Embedded Domain Specific Languages

3 of 34

Talk Layout

Introduction

NT2

Expression Templates v2.0

Supporting GPUs

Conclusion

4 of 34

What’s NT2 ?

A Scientific Computing Library

� Provide a simple, Matlab-like interface for users

� Provide high-performance computing entities and primitives

� Easily extendable

A Research Platform

� Simple framework to add new optimization schemes

� Test bench for EDSL development methodologies

� Test bench for Generic Programming in real life projects

5 of 34

The NT2 API

Principles

� table<T,S> is a simple, multidimensional array object that exactly mimicsMatlab array behavior and functionalities

� 300+ functions usable directly either on table or on any scalar values as inMatlab

How does it works

� Take a .m file, copy to a .cpp file

� Add #include <nt2/nt2.hpp> and do cosmetic changes

� Compile the file and link with libnt2.a

6 of 34

The NT2 API

Principles



How does it works




6 of 34

The NT2 API

Principles



How does it works




6 of 34

The NT2 API

Principles



How does it works




6 of 34

Matlab you said ?

1 R = I(:,:,1);

2 G = I(:,:,2);

3 B = I(:,:,3);

4

5 Y = min(abs (0.299.*R+0.587.*G+0.114.*B) ,235);

6 U = min(abs ( -0.169.*R -0.331.*G+0.5.*B) ,240);

7 V = min(abs (0.5.*R -0.419.*G -0.081.*B) ,240);

7 of 34

Now with NT2

1 table <double > R = I(_,_,1);

2 table <double > G = I(_,_,2);

3 table <double > B = I(_,_,3);

4 table <double > Y, U, V;

5

6 Y = min(abs (0.299*R+0.587*G+0.114*B) ,235);

7 U = min(abs ( -0.169*R -0.331*G+0.5*B) ,240);

8 V = min(abs (0.5*R -0.419*G -0.081*B) ,240);

8 of 34

Now with NT2

1 table <float ,settings(shallow_ ,of_size_ <N,M>)> R = I(_,_,1);

2 table <float ,settings(shallow_ ,of_size_ <N,M>)> G = I(_,_,2);

3 table <float ,settings(shallow_ ,of_size_ <N,M>)> B = I(_,_,3);

4 table <float ,of_size_ <N,M> > Y, U, V;

5

6 Y = min(abs (0.299*R+0.587*G+0.114*B) ,235);

7 U = min(abs ( -0.169*R -0.331*G+0.5*B) ,240);

8 V = min(abs (0.5*R -0.419*G -0.081*B) ,240);

8 of 34

Some real life applications

9 of 34

NT2 before Boost.Proto

EDSL Core

� Code base was 11 KLOC

� 8KLOC was dedicated to the various Expression Template glue

� 3KLOC of actual smart stuff

Architectural support

� Altivec and SSE2 extension support

� Some vague pthread support

� Efforts were stagnant: adding a simple feature required multipel KLOC of codeto change

10 of 34

Talk Layout

Introduction

NT2


Supporting GPUs

Conclusion

11 of 34

Embedded Domain Specific Languages

What’s an EDSL ?

� DSL = Domain Specific Language

� Declarative language, easy-to-use, fitting the domain

� EDSL = DSL within a general purpose language

EDSL in C++

� Relies on operator overload abuse (Expression Templates)

� Carry semantic information around code fragment

� Generic implementation become self-aware of optimizations

12 of 34

Expression Templates

matrix x(h,w),a(h,w),b(h,w);

x = cos(a) + (b*a);

expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);

+

*cos

a ab

=

x

#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}

Arbitrary Transforms appliedon the meta-AST

13 of 34

Boost.Proto

What’s Boost.Proto

� EDSL for defining EDSLs in C++

� Generalize Expression Templates

� Easy way to define and test EDSL

� EDSL = some Grammar Rules + some Semantic

Boost.Proto Benefits

� Fast development process

� Compiler guys are at ease : Grammar + Semantic + Code Generation process

� Easily extendable through Transforms

� Easy to handle : MPL and Fusion compatible

14 of 34

Why Boost.Proto ?

Benefits

� Scalability: Keep EDSL glue code size low

� Extensibility: Simplify new optimizations design

� Debugging: Allow for meaningful error reporting

Remaining Challenges

� Provide a generic ”compilation” process

� Select the best function definition w/r to hardware

� Allow for non-intrusive new pass definition

15 of 34

State of the art EDSL

Usual approaches and results

� Eigen : supports SIMD arithmetic, thread only algorithms

� uBLAS : parallelism comes from potential back-end

What’s the problem ?

� Hardware consideration comes after ET code

� Retro-fitting parallelism is hard

� Need of a hardware-aware EDSL

� Back to Generative Programming

16 of 34

Principles of Generative Programming

Domain SpecificApplication Description

Generative Component Concrete Application

Translator

Parametric Sub-components

17 of 34

Generative Programming v2

18 of 34

Our Approach

Principles

� Segment EDSL evaluation in phases

� Each phases use Proto transforms to advance code generation

� Hardware specification are active element in function overload

� Use Generalized Tag Dispatching (GTD) to select proper functionimplementation

Advantages

� Hardware can influence part or whole evaluation process

� Proto transforms syntax are easy to use

� GTD increases opportunity for optimizations

19 of 34

Generalized Tag Dispatching

Principles

� Tag dispatching use types tag to overload functions

� Classical example : standard Iterators

� We augment the system with:

� Tag computed from the function identifier� Tag computed from the hardware configuration

20 of 34


Principles





20 of 34


Principles





1 namespace tag { struct plus_ : elementwise_ <plus_ > {}; }

2

3 functor <tag::plus_ > callee;

4 callee( some_a , some_b );

20 of 34


Principles





1 auto functor <Tag ,Site >:: operator ()(A&& args ...)

2 {

3 dispatch < hierarchy_of <Site >:: type

4 , hierarchy_of <Tag >:: type(hierarchy_of <A>:: type ...)

5 > callee;

6 return callee( args ... );

7 }

20 of 34


1 template <class AO>

2 strutc dispatch < tag::sse_

3 , tag::plus_( simd_ < real_ < float_ <A0> > >

4 , simd_ < real_ < float_ <A0 > > >

5 )

6 >

7 {

8 A0 operator(A0&& a, A0&& b)

9 {

10 return _mm_add_ps( a, b );

11 }

12 };

21 of 34


1 template <class S,class R, class L, class N, class AO >

2 strutc dispatch <openmp_ <S>,reduction_ <R,L,N>(ast_ <A0 >)>

3 {

4 A0:: value_type operator(A0&& ast)

5 {

6 auto result = functor <N,S>(as_ <A0::value_type >());

7 #pragma omp parallel

8 {

9 auto local = functor <N,S>(as_ <A0:: value_type >());

10 #pragma omp for nowait

11 for(int i=0;i<size(ast);++i)

12 functor <R,S>()(local , ast(i) );

13

14 #pragma omp critial

15 functor <L,S>()(result , local );

16 }

17

18 return result;

19 }

20 };

21 of 34

Multipass EDSL Code Generation

Principles

� Don’t walk the AST directly

� Capture where hardware introduces extension points

� Use GTD to select proper phase implementation

Structure

� Optimize : AST to AST transforms

� Schedule : AST to Forest of Code Generator transforms

� Run : Code Generator to Code transforms

22 of 34


Principles




Structure




22 of 34


Principles




Structure




22 of 34


Principles




Structure




22 of 34


Optimize

� AST Pattern Matching using Proto grammar

� Replace large sequence of operation by equivalent, faster kernels

� E.g: a+b*c to gemm or fma, x = inv(a)*b to solve(a,x,b)

Schedule

� Segment AST into loop-compatible loop functors

� Use AST node hierarchy (and GTD) for splitting

Run

� Generate code for each AST int he scheduled forest

� Potentially generate runtime calls for runtime optimization

� Classical EDSL code generation happen here on the correct hardware

23 of 34


Optimize




Schedule



Run




23 of 34


Optimize




Schedule



Run




23 of 34

The new NT2 Structure

State of the code base

� Expression Template handling 200-300LOC

� GTD handling : 1KLOC

� Actual features and function implementation : 10KLOC

Evolution process

� Adding a new site 100-500LOC

� Adding a function 10-100 LOC

� Time for complete new architecture support : 1 to 3 weeks

24 of 34

The new NT2 Structure

24 of 34

Some results

Support for OpenMP

� 1 build system script

� 4 new overloads for the run phase

� 1 new allocator to prevent false sharing

Support for Next Gen SIMD

� AVX : 2 weeks of work, functionnal

� AVX2 : prototype ready, work in simulator

� LarBn : prototype ready, work in simulator

25 of 34

Talk Layout

Introduction

NT2


Supporting GPUs

Conclusion

26 of 34

OpenCL in action

Principle

� Standard for manycore/multicore systems

� Use runtime kernel and JIT Compilation

Example1 const char* OpenCLSource [] = {

2 "__kernel void VectorAdd(__global int* c, __global int* a,__global int* b,unsigned

int nb)",

3 "{",

4 " // Index of the elements to add \n",

5 " unsigned int n = get_global_id (0);",

6 " // Sum the n’th element of vectors a and b and store in c \n",

7 "c[n]=0;",

8 " for (int i=0;i<nb;i++) c[n] += (a[n]+b[n]);",

9 "}"

10 };

27 of 34

From Proto to OpenCL

The Schedule Pass

� Split AST as usual on function hierarchy

� Replace each AST by a functor generating the equivalent OpenCL

� Compute a static hash of the tree to remember already compiled expressions

The Run pass

� Stream data at beginning of each kernel

� Handle generic calls and retrieval of compiled kernel

� Fire asynchronous kernel calls over the forest

28 of 34

From Proto to OpenCL

The Schedule Pass

� Split AST as usual on function hierarchy

� Replace each AST by a functor generating the equivalent OpenCL

� Compute a static hash of the tree to remember already compiled expressions

The Run pass

� Stream data at beginning of each kernel

� Handle generic calls and retrieval of compiled kernel

� Fire asynchronous kernel calls over the forest

28 of 34

Support for openCL

NT2 Example1 int main()

2 {

3 int nbpoints =512000;

4 table <float , settings( _1D ) > xx(ofSize(nbpoints));

5 table <float , settings( _1D ) > yy(ofSize(nbpoints));

67 for (int i=1;i<= nbpoints;i++)

8 {

9 xx(i) = -nbpoints /2+i;

10 yy(i) = 0.0;

11 }

12 yy=100* sin (8*3.14* xx /512000);

1314 std::cout <<xx(1) <<" "<<yy(1);

15 }

29 of 34

Support for openCL

NT2 Example1 int main()

2 {

3 int nbpoints =512000;

4 table <float , settings( _1D , nt2::gpu ) > xx(ofSize(nbpoints));

5 table <float , settings( _1D , nt2::gpu ) > yy(ofSize(nbpoints));

67 for (int i=1;i<= nbpoints;i++)

8 {


10 yy(i) = 0.0;

11 }

12 yy=100* sin (8*3.14* xx /512000);

1314 std::cout <<xx(1) <<" "<<yy(1);

15 }

29 of 34

Preliminary results for openCL

Generating some signals1 for (int i=1;i<= nbpoints;i++)

2 {


4 yy(i) = 0.0;

5 }

67 yy=100* sin (8*3.14* xx /512000);

8 yy=yy +0.0001* xx;

910 for (int ii=1;ii <512000; ii +=15000)

11 {

12 addGaussian(xx ,yy ,(float)ii ,1000);

13 yy=1-yy;

14 }

1516 yy=1-yy;

17 yy=yy*yy;

18 yy=yy*cos(xx);

30 of 34


Generating some signals1 template <typename A>

2 void addGaussian(A &xx, A &yy , float centre , float largeur)

3 {

4 yy = yy + ( 100000/( val(largeur)*sqrt (2*3.1415)))

5 * exp(-(xx-val(centre))*(xx-val(centre))

6 /(2* val(largeur)*val(largeur)

7 )

8 );

9 }

30 of 34


Generating some signals

CPU 1 core 520.2 ms

CPU 4 cores 141.3 ms

CPU 4 cores+SIMD 36.1 ms

GPU First Run 878.2 ms

GPU Other Run 13.1 ms

GPU Manual OCL 12.9 ms

� Compares favorably to ViennaCL and Accelereyes

30 of 34

Talk Layout

Introduction

NT2


Supporting GPUs

Conclusion

31 of 34

Let’s round this up!

Parallel Computing for Scientist

� Parallel programmign tools are required

� EDSL are an efficient way to design such tools

� But hardware has to stay in the loop

Our claims

� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.

� Like regular language, EDSL needs informations about the hardware system

� Integrating hardware descriptions as Generic components increases toolsportability and re-targetability

32 of 34

Let’s round this up!

Parallel Computing for Scientist

� Parallel programmign tools are required

� EDSL are an efficient way to design such tools

� But hardware has to stay in the loop

Our claims

� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.

� Like regular language, EDSL needs informations about the hardware system

� Integrating hardware descriptions as Generic components increases toolsportability and re-targetability

32 of 34

Current and Future Works

Recent activity

� A public release of NT2 is planned ”soon”

� A Matlab to NT2 compiler has been designed

� Prototype for single source CUDA support in NT2

� PhD started on MPI support in NT2

What we’ll be cooking in the future

� std::future is the future

� Experimental research on NT2 for embedded systems

� Toward a global generic approach to parallelism

33 of 34

Current and Future Works

Recent activity

� A public release of NT2 is planned ”soon”

� A Matlab to NT2 compiler has been designed

� Prototype for single source CUDA support in NT2

� PhD started on MPI support in NT2

What we’ll be cooking in the future

� std::future is the future

� Experimental research on NT2 for embedded systems

� Toward a global generic approach to parallelism

33 of 34

Thanks for your attention

Documents

Designing Architecture-aware Library using Boost.Proto