Designing C++ portable SIMD support

Designing C++ portable SIMD support

Joel Falcou

NumScale

CppCon 2016

NumScale in a few words

Our company

� French start-up specialized in software performance� We sell C++ libraries to master modern hardware performance� Consulting & training on all things C++ or HPC

NumScale and C++

� Member of the ISO C++ French National Body� Enthusiastic user & contributor to OSS projects� Involved in the European C++ community

1 of 33

What in the world is SIMD ?

Is SIMD that obscure ?

Let’s have a poll

� Who needs performances in their daily job ?

� Who knows about parallel programming ?� Who knows about SIMD/multimedia extensions ?� Who uses SIMD extensions like SSE, AVX, VMX or NEON ?� Who has nightmares because of those ?

3 of 33


Let’s have a poll

� Who needs performances in their daily job ?� Who knows about parallel programming ?

� Who knows about SIMD/multimedia extensions ?� Who uses SIMD extensions like SSE, AVX, VMX or NEON ?� Who has nightmares because of those ?

3 of 33


Let’s have a poll

� Who needs performances in their daily job ?� Who knows about parallel programming ?� Who knows about SIMD/multimedia extensions ?

� Who uses SIMD extensions like SSE, AVX, VMX or NEON ?� Who has nightmares because of those ?

3 of 33


Let’s have a poll

� Who needs performances in their daily job ?� Who knows about parallel programming ?� Who knows about SIMD/multimedia extensions ?� Who uses SIMD extensions like SSE, AVX, VMX or NEON ?

� Who has nightmares because of those ?

3 of 33


Let’s have a poll

� Who needs performances in their daily job ?� Who knows about parallel programming ?� Who knows about SIMD/multimedia extensions ?� Who uses SIMD extensions like SSE, AVX, VMX or NEON ?� Who has nightmares because of those ?

3 of 33

What is SIMD ? - A french cuisine approach

Some workload to process4 of 33


A regular CPU4 of 33


Single Instruction, Single Data processing4 of 33


A SIMD enabled CPU4 of 33


Single Instruction, Multiple Data processing4 of 33

What is SIMD ? - For real

Instructions

Data

Results

SISD SIMDPrincipes

� Wide registers store N > 1 values.

� Special instructions process thoseregisters.

� Code uses a blitter like approach.

5 of 33

What is SIMD ? - For real

Instructions

Data

Results

SISD SIMD Benets� Speed-up of N on cache-hot data

� Avoid premature scale-out

� Maximize FLOPS/Watts

5 of 33

1,001 avors of SIMD

Intel x86� MMX 64 bits oat, double� SSE 128 bits oat� SSE2 128 bits int8, int16, int32, int64,

double� SSE3, SSSE3� SSE4a (AMD)� SSE4.1, SSE4.2� AVX 256 bits oat, double� AVX2 256 bits int8, int16, int32, int64� FMA3� FMA4, XOP (AMD)� AVX512 512 bits oat, double, int32,

int64

PowerPC� VMX 128 bits int8, int16, int32,

int64, oat� VSX, 128 bits int8, int16, int32,

int64, oat, double� QPX, 512 bits double

ARM� VFP 64 bits oat, double� NEON 64 bits et 128 bits double,

oat, int8, int16, int32, int64

6 of 33

The Many Ways to Vectorize

Implicit vectorization

� Auto-Vectorization� Compiler hints

Explicit vectorization

� Langages extensions� SIMD Intrinsics libraries� Vector Intrinsics� Inline Assembly

7 of 33

The Many Ways to Vectorize


� Auto-Vectorization� Compiler hints

Explicit vectorization

� Langages extensions� SIMD Intrinsics libraries� Vector Intrinsics� Inline Assembly (let’s just say no right now)

7 of 33


Auto-vectorizer

� Compile-time analysis of loop nest� May use special hints� Only safe transformations are applied

template <typename T>void f(T* restrict a, T* restrict b, int size){

#pragma ivdepfor(int i=0;i<size ;++i)

a[i] += b[i];}

8 of 33

OpenMP4

Principle

� Flag loop as must be vectorized� Support for reductions & SIMD functions� User is in charge of checking validity

template <typename T> T f(T* a, T* b, int size){

T res =0;#pragma omp simd reduction (+:res)for(int i=0;i<size ;++i){

a[i] += b[i];res += b[i];

}return res;

}9 of 33

SIMD Intrinsics Library

Principle

� Wrap SIMD computations in a library� Support SIMD idioms with algorithms or other abstractions� Improve portability across compilers

Examples

� Agner Fog’s x86 library� Vc, NOVA� gSIMD, Cyme� Boost.SIMD

10 of 33

Explicit usage of intrinsics

// NEONreturn vmul_s32(a0, a1); // 64-bitreturn vmulq_s32(a0 , a1); // 128-bit

11 of 33


// SSE4.1return _mm_mullo_epi32(a0 , a1);

11 of 33


// SSE2return_mm_or_si128(

_mm_and_si128(_mm_mul_epu32(a0 ,a1),_mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0)

), _mm_slli_si128(

_mm_and_si128(_mm_mul_epu32( _mm_srli_si128(a0 ,4)

, _mm_srli_si128(a1 ,4))

, _mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0))

, 4)

);

11 of 33

Explicit usage of intrinsics// Altivec// reinterpret as u16short0 = (__vector unsigned short)a0;short1 = (__vector unsigned short)a1;

// shifting constantshift = vec_splat_u32 (-16);sf = vec_rl(a1, shift_);

// Compute high part of the producthigh = vec_msum( short0 , (__vector unsigned short)sf

, vec_splat_u32 (0));

// Complete by adding low part of the 16 bits productreturn vec_add( vec_sl(high , shift_)

, vec_mulo(short0 , short1));

11 of 33

Implicit vs Explicit SIMD

Implicit

� Automatic dependencyanalysis (e.g. reductions).

� Recognises idioms with datadependencies.

� Non-inline functions arescalar.

� Limited support forouter-loop vectorisation

� Relies on the compiler’svectorizable patterns library

Explicit

� No dependency analysis� Recognises idioms without

data dependencies.� Non-inline functions can be

vectorised.� Outer loops can be

vectorised.� May be more cross-compiler

portable.

12 of 33

bSIMD and Boost.SIMD†

†Boost.SIMD is a candidate for acceptance as a Boost Library

From bSIMD to Boost.SIMD

bSIMD

� NumScale closed source software for SIMD programming� Explicit SIMD library� Supports x86, PPC, ARM architectures� Provides domain specic algorithms

Boost.SIMD

� Open Source sub-part of bSIMD� Supports x86 and Power6� Provides STD like algorithms

14 of 33

The Boost.SIMD register abstraction

pack<T,N>

� Usable as a regular Value Type� Wraps a block of contiguous N elements of type T� pack<T> picks the optimal N for current hardware

Constraints

� T is a fundamental type� logical<T> is used to handle boolean� N must be a power of 2.

15 of 33

The Boost.SIMD register abstraction

pack<T,N>

� Usable as a regular Value Type� Wraps a block of contiguous N elements of type T� pack<T> picks the optimal N for current hardware

What if ?Let’s have C, the current hardware register size for type T

� If N == C, use the native register directly� If N < C, use a scalar array (for now)� If N > C, use an aggregate of 2 x pack<T,N/2>

15 of 33

Getting data into packs

Constructors� pack<T,N> x(U v) : ll x with N v� pack<T,N> x(U v...) : ll x with (v0,v1,...)

� pack<T,N> x(T* ptr) : load N element from aligned memory ptr� pack<T,N> x(It b, It e) : load N element from the [b,e[ Range

Explicit Memory Load

� load<T>(U* ptr [,Offset o] )

� load<T>(mask_ptr<U> ptr [,Offset o] )

� aligned_load<T>(U* ptr [,int o] )

� aligned_load<T>(mask_ptr<U> ptr [,int o] )

16 of 33




Misaligned Loads

� aligned_store<T,N>(ptr)

� Load an unaligned address with static misalignment� Optimized to be faster than unaligned load

16 of 33




Explicit Memory Store

� store<T>(U* ptr [,Offset o] )

� store<T>(mask_ptr<U> ptr [,Offset o] )

� aligned_store<T>(U* ptr [,int o] )

� aligned_store<T>(mask_ptr<U> ptr [,int o] )

16 of 33

Supported operations on pack

Basic Operators

� All operators are available with possible scalar mixing� No convertion nor promotion

Comparisons

� ==, !=, <, <=, >, >= perform SIMD comparisons.� compare_equal, compare_less perform reductive comparisons.

Other properties

� Models RandomAccessRange� p[i] return a proxy to access the register internal value17 of 33

Selection of available functions

Arithmetic� saturated arithmetics

� long multiplication

� oat/int conversion

� round, oor, ceil, trunc

� sqrt, cbrt

� hypot

� average

� random

� min/max

� rounded division andremainder

Bitwise

� select

� andnot, ornot

� popcnt

� ffs

� ror, rol

� rshr, rshl

� twopower

IEEE

� ilogb

� frexp

� ldexp

� next/prev

� ulpdist

� exponent/mantissa

Predicates

� comparison to zero

� negated comparisons

� is_unord, is_nan,is_invalid

� is_odd, is_even

� majority

18 of 33

Selection of available functions

Reduction

� any, all, none

� nbtrue

� minimum/maximum

� sum

� product, dot product

SWAR� group/split� combine/slice� splatted reduction� cumsum� sort� shuffle� interleaving� deinterleaving

19 of 33

{and, Shuffling, Permutation}1,0,2

Principles

� Data reordering is #1 technique in SIMD� Use cases : transpose, AoS/SoA transformations� Turn memory access into computations� Support for specic permutations� Support for arbitrary shuffling patterns

Basic permutations

� reverse, broadcast� interleave, deinterleave� repeat, slide� runtime lookup

20 of 33


Principles


Shuffle� Arbitrary permutation of elements� Optimizable if known at compile-time� Available for one or two parameters

20 of 33


Principles


pack <float ,4> a{1,2,3,4};pack <float ,4> b{10 ,20 ,30 ,40};

// r1 = [1 1 4 4 ]auto r1 = shuffle <0,0,3,3>(a);

// r2 = [1 0 0 10 ]auto r2 = shuffle <0,-1,-1,4>(a,b);

20 of 33


Principles


struct reverse_{

template <class I, class C> struct apply: std:: integral_constant <int ,C::value -I::value -1>;

{};};

// res = [4 3 2 1]pack <float > res = shuffle <reverse_ >(a);

20 of 33


Principles


constexpr int mix_half(int i, int c){

return i < c/2 ? i+c : i;};

// res = [10 20 3 4]pack <float > res = shuffle <pattern <mix_half >>(a,b);

20 of 33

Integration with the Standard Library

� Algorithms :� SIMD transform� SIMD reduce� Use generic functor/lambda for mixing scalar/SIMD

� Allocators� Ranges :

� boost::simd::input_range� boost::simd::output_range� boost::simd::aligned_input_range� boost::simd::aligned_output_range� boost::simd::segmented_input_range� boost::simd::segmented_output_range

21 of 33


std::vector <float , simd::allocator <float > > v(N);

simd:: transform( v.begin(), v.end(), []( auto const& p)

{return p * 2.f;

});

22 of 33


std::vector <float , simd::allocator <float > > i(N), o(N);

auto x = simd:: reduce( i.data(), i.data()+N, 0.f);

auto y = simd:: reduce( i.data(), i.data()+N, 0.f, []( auto&& a,auto&& e){return a+e*e;}, 0.f, simd::plus);

22 of 33

Under the SIMD Hood

Performances !

Basic Functions

Single precision math functions (cycles/values)Hardware : Core i7 SandyBridge, AVX

Function Range std Scalar SIMDexp [−10, 10] 46 38 7log [−10,−10] 42 37 5asin [−1, 1] 40 35 13cos [−20π, 20π] 66 47 6

restricted_(cos) [−π/4, π/4] 32 9 1.3

25 of 33

Julia set generator

� Generate a fractal image using the Julia funtion� Largely compute-bound� Challenge : Workload depends on pixel location

26 of 33

Julia set generator

template <class T> auto julia(T const& a, T const& b){

as_integer_t <T> res {0};std:: size_t max_iter {0};T x{0}, y{0};

do {auto x2 = x * x;auto y2 = y * y;auto mask = x2 + y2 < 4;auto xy = 2 * x * y;x = x2 - y2 + a;y = xy + b;res = if_inc(mask , res);

} while(any(mask) && max_iter ++ < 256);

return res;}

27 of 33

Julia set generator

Timing w/ Boost.SIMD and other solutionsfrom An Evaluation of current SIMD programming Models for C++ Pohl et al., 2015

28 of 33

Interaction with Boost.Odeint

Coupled/Uncoupled Roessler system

� Written by Mario Mulanski� Showcase effects of both cache and SIMD� Use Boost.ODEINT for the ODE system� Use Boost.SIMD to vectorize the system

Results

� Minimal disruption in the code� Global x3 performances gain� See the whole code at https ://github.com/mariomulansky/olsos

29 of 33


template <class S, class D>void operator ()(const S &x_ , D &dxdt_ , double t) const{

auto x = boost::begin( x_ );auto dxdt = boost:: begin( dxdt_ );const int N = boost::size(x_);for( int j=1; j<N/dim -1; ++j ){

const int i = j*dim;dxdt[i] = -1.0*x[i + 1] - x[i + 2] +

m_d * (x[i - dim] + x[i + dim] - 2.0 * x[i]);dxdt[i + 1] = x[i] + m_a * x[i + 1];dxdt[i + 2] = m_b + x[i + 2] * (x[i] - m_c);

}}

29 of 33


// Scalar callusing state_type = std::vector <double >;state_type x(N);

odeint :: runge_kutta4 <state_type > rk4 ;odeint :: integrate_const(rk4 , roessler , x, 0.0, T , dt);

// Boost.SIMD callusing alloc_t = simd::allocator <double >;using state_type = vector <pack <double >,alloc_t >;state_type x ( N/ pack <double >:: static_size );

odeint :: runge_kutta4 < state_type > rk4 ;odeint :: integrate_const(rk4 , roessler , x, 0.0, T, dt);

29 of 33


29 of 33

Conclusion

Conclusion

High level SIMD in C++11/14

� Designing a C++ library for low level performance primitives ispossible

� C++11/14 features play nice with SIMD intrinsics� SIMD specic idioms maps to modern C++ components

Boost.SIMD

� To be proposed for review this fall� Find us on https ://github.com/numscale/boost.simd

� Tests and feedback welcome

31 of 33

This talk would not have been feasible without

The bSIMD team

� Lead Developer : Charly Chevalier� Developers : Jean-Thierry Lapresté, Guillaume Quintin� Tests and Doc : Alan Kelly, Kenny Peou

Our supporters

� Tim Blenchman, our earliest adopter� Serge Guelton, for integrating Boost.SIMD into pythran� Mario Mulansky, for its work with Boost.SIMD & Boost.Odeint� Sylvain Jubertie,Ian Masliah, for testing Boost.SIMD in clever ways

32 of 33

Thanks for your attention !

Software

Designing C++ portable SIMD support