Upload
joel-falcou
View
518
Download
2
Embed Size (px)
Citation preview
Designing C++ portable SIMD support
Joel Falcou
NumScale
CppCon 2016
NumScale in a few words
Our company
� French start-up specialized in software performance� We sell C++ libraries to master modern hardware performance� Consulting & training on all things C++ or HPC
NumScale and C++
� Member of the ISO C++ French National Body� Enthusiastic user & contributor to OSS projects� Involved in the European C++ community
1 of 33
What in the world is SIMD ?
Is SIMD that obscure ?
Let’s have a poll
� Who needs performances in their daily job ?
� Who knows about parallel programming ?� Who knows about SIMD/multimedia extensions ?� Who uses SIMD extensions like SSE, AVX, VMX or NEON ?� Who has nightmares because of those ?
3 of 33
Is SIMD that obscure ?
Let’s have a poll
� Who needs performances in their daily job ?� Who knows about parallel programming ?
� Who knows about SIMD/multimedia extensions ?� Who uses SIMD extensions like SSE, AVX, VMX or NEON ?� Who has nightmares because of those ?
3 of 33
Is SIMD that obscure ?
Let’s have a poll
� Who needs performances in their daily job ?� Who knows about parallel programming ?� Who knows about SIMD/multimedia extensions ?
� Who uses SIMD extensions like SSE, AVX, VMX or NEON ?� Who has nightmares because of those ?
3 of 33
Is SIMD that obscure ?
Let’s have a poll
� Who needs performances in their daily job ?� Who knows about parallel programming ?� Who knows about SIMD/multimedia extensions ?� Who uses SIMD extensions like SSE, AVX, VMX or NEON ?
� Who has nightmares because of those ?
3 of 33
Is SIMD that obscure ?
Let’s have a poll
� Who needs performances in their daily job ?� Who knows about parallel programming ?� Who knows about SIMD/multimedia extensions ?� Who uses SIMD extensions like SSE, AVX, VMX or NEON ?� Who has nightmares because of those ?
3 of 33
What is SIMD ? - A french cuisine approach
Some workload to process4 of 33
What is SIMD ? - A french cuisine approach
A regular CPU4 of 33
What is SIMD ? - A french cuisine approach
Single Instruction, Single Data processing4 of 33
What is SIMD ? - A french cuisine approach
A SIMD enabled CPU4 of 33
What is SIMD ? - A french cuisine approach
Single Instruction, Multiple Data processing4 of 33
What is SIMD ? - For real
Instructions
Data
Results
SISD SIMDPrincipes
� Wide registers store N > 1 values.
� Special instructions process thoseregisters.
� Code uses a blitter like approach.
5 of 33
What is SIMD ? - For real
Instructions
Data
Results
SISD SIMD Benets� Speed-up of N on cache-hot data
� Avoid premature scale-out
� Maximize FLOPS/Watts
5 of 33
1,001 avors of SIMD
Intel x86� MMX 64 bits oat, double� SSE 128 bits oat� SSE2 128 bits int8, int16, int32, int64,
double� SSE3, SSSE3� SSE4a (AMD)� SSE4.1, SSE4.2� AVX 256 bits oat, double� AVX2 256 bits int8, int16, int32, int64� FMA3� FMA4, XOP (AMD)� AVX512 512 bits oat, double, int32,
int64
PowerPC� VMX 128 bits int8, int16, int32,
int64, oat� VSX, 128 bits int8, int16, int32,
int64, oat, double� QPX, 512 bits double
ARM� VFP 64 bits oat, double� NEON 64 bits et 128 bits double,
oat, int8, int16, int32, int64
6 of 33
The Many Ways to Vectorize
Implicit vectorization
� Auto-Vectorization� Compiler hints
Explicit vectorization
� Langages extensions� SIMD Intrinsics libraries� Vector Intrinsics� Inline Assembly
7 of 33
The Many Ways to Vectorize
Implicit vectorization
� Auto-Vectorization� Compiler hints
Explicit vectorization
� Langages extensions� SIMD Intrinsics libraries� Vector Intrinsics� Inline Assembly (let’s just say no right now)
7 of 33
Implicit vectorization
Auto-vectorizer
� Compile-time analysis of loop nest� May use special hints� Only safe transformations are applied
template <typename T>void f(T* restrict a, T* restrict b, int size){
#pragma ivdepfor(int i=0;i<size ;++i)
a[i] += b[i];}
8 of 33
OpenMP4
Principle
� Flag loop as must be vectorized� Support for reductions & SIMD functions� User is in charge of checking validity
template <typename T> T f(T* a, T* b, int size){
T res =0;#pragma omp simd reduction (+:res)for(int i=0;i<size ;++i){
a[i] += b[i];res += b[i];
}return res;
}9 of 33
SIMD Intrinsics Library
Principle
� Wrap SIMD computations in a library� Support SIMD idioms with algorithms or other abstractions� Improve portability across compilers
Examples
� Agner Fog’s x86 library� Vc, NOVA� gSIMD, Cyme� Boost.SIMD
10 of 33
Explicit usage of intrinsics
// NEONreturn vmul_s32(a0, a1); // 64-bitreturn vmulq_s32(a0 , a1); // 128-bit
11 of 33
Explicit usage of intrinsics
// SSE4.1return _mm_mullo_epi32(a0 , a1);
11 of 33
Explicit usage of intrinsics
// SSE2return_mm_or_si128(
_mm_and_si128(_mm_mul_epu32(a0 ,a1),_mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0)
), _mm_slli_si128(
_mm_and_si128(_mm_mul_epu32( _mm_srli_si128(a0 ,4)
, _mm_srli_si128(a1 ,4))
, _mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0))
, 4)
);
11 of 33
Explicit usage of intrinsics// Altivec// reinterpret as u16short0 = (__vector unsigned short)a0;short1 = (__vector unsigned short)a1;
// shifting constantshift = vec_splat_u32 (-16);sf = vec_rl(a1, shift_);
// Compute high part of the producthigh = vec_msum( short0 , (__vector unsigned short)sf
, vec_splat_u32 (0));
// Complete by adding low part of the 16 bits productreturn vec_add( vec_sl(high , shift_)
, vec_mulo(short0 , short1));
11 of 33
Implicit vs Explicit SIMD
Implicit
� Automatic dependencyanalysis (e.g. reductions).
� Recognises idioms with datadependencies.
� Non-inline functions arescalar.
� Limited support forouter-loop vectorisation
� Relies on the compiler’svectorizable patterns library
Explicit
� No dependency analysis� Recognises idioms without
data dependencies.� Non-inline functions can be
vectorised.� Outer loops can be
vectorised.� May be more cross-compiler
portable.
12 of 33
bSIMD and Boost.SIMD†
†Boost.SIMD is a candidate for acceptance as a Boost Library
From bSIMD to Boost.SIMD
bSIMD
� NumScale closed source software for SIMD programming� Explicit SIMD library� Supports x86, PPC, ARM architectures� Provides domain specic algorithms
Boost.SIMD
� Open Source sub-part of bSIMD� Supports x86 and Power6� Provides STD like algorithms
14 of 33
The Boost.SIMD register abstraction
pack<T,N>
� Usable as a regular Value Type� Wraps a block of contiguous N elements of type T� pack<T> picks the optimal N for current hardware
Constraints
� T is a fundamental type� logical<T> is used to handle boolean� N must be a power of 2.
15 of 33
The Boost.SIMD register abstraction
pack<T,N>
� Usable as a regular Value Type� Wraps a block of contiguous N elements of type T� pack<T> picks the optimal N for current hardware
What if ?Let’s have C, the current hardware register size for type T
� If N == C, use the native register directly� If N < C, use a scalar array (for now)� If N > C, use an aggregate of 2 x pack<T,N/2>
15 of 33
Getting data into packs
Constructors� pack<T,N> x(U v) : ll x with N v� pack<T,N> x(U v...) : ll x with (v0,v1,...)
� pack<T,N> x(T* ptr) : load N element from aligned memory ptr� pack<T,N> x(It b, It e) : load N element from the [b,e[ Range
Explicit Memory Load
� load<T>(U* ptr [,Offset o] )
� load<T>(mask_ptr<U> ptr [,Offset o] )
� aligned_load<T>(U* ptr [,int o] )
� aligned_load<T>(mask_ptr<U> ptr [,int o] )
16 of 33
Getting data into packs
Constructors� pack<T,N> x(U v) : ll x with N v� pack<T,N> x(U v...) : ll x with (v0,v1,...)
� pack<T,N> x(T* ptr) : load N element from aligned memory ptr� pack<T,N> x(It b, It e) : load N element from the [b,e[ Range
Misaligned Loads
� aligned_store<T,N>(ptr)
� Load an unaligned address with static misalignment� Optimized to be faster than unaligned load
16 of 33
Getting data into packs
Constructors� pack<T,N> x(U v) : ll x with N v� pack<T,N> x(U v...) : ll x with (v0,v1,...)
� pack<T,N> x(T* ptr) : load N element from aligned memory ptr� pack<T,N> x(It b, It e) : load N element from the [b,e[ Range
Explicit Memory Store
� store<T>(U* ptr [,Offset o] )
� store<T>(mask_ptr<U> ptr [,Offset o] )
� aligned_store<T>(U* ptr [,int o] )
� aligned_store<T>(mask_ptr<U> ptr [,int o] )
16 of 33
Supported operations on pack
Basic Operators
� All operators are available with possible scalar mixing� No convertion nor promotion
Comparisons
� ==, !=, <, <=, >, >= perform SIMD comparisons.� compare_equal, compare_less perform reductive comparisons.
Other properties
� Models RandomAccessRange� p[i] return a proxy to access the register internal value17 of 33
Selection of available functions
Arithmetic� saturated arithmetics
� long multiplication
� oat/int conversion
� round, oor, ceil, trunc
� sqrt, cbrt
� hypot
� average
� random
� min/max
� rounded division andremainder
Bitwise
� select
� andnot, ornot
� popcnt
� ffs
� ror, rol
� rshr, rshl
� twopower
IEEE
� ilogb
� frexp
� ldexp
� next/prev
� ulpdist
� exponent/mantissa
Predicates
� comparison to zero
� negated comparisons
� is_unord, is_nan,is_invalid
� is_odd, is_even
� majority
18 of 33
Selection of available functions
Reduction
� any, all, none
� nbtrue
� minimum/maximum
� sum
� product, dot product
SWAR� group/split� combine/slice� splatted reduction� cumsum� sort� shuffle� interleaving� deinterleaving
19 of 33
{and, Shuffling, Permutation}1,0,2
Principles
� Data reordering is #1 technique in SIMD� Use cases : transpose, AoS/SoA transformations� Turn memory access into computations� Support for specic permutations� Support for arbitrary shuffling patterns
Basic permutations
� reverse, broadcast� interleave, deinterleave� repeat, slide� runtime lookup
20 of 33
{and, Shuffling, Permutation}1,0,2
Principles
� Data reordering is #1 technique in SIMD� Use cases : transpose, AoS/SoA transformations� Turn memory access into computations� Support for specic permutations� Support for arbitrary shuffling patterns
Shuffle� Arbitrary permutation of elements� Optimizable if known at compile-time� Available for one or two parameters
20 of 33
{and, Shuffling, Permutation}1,0,2
Principles
� Data reordering is #1 technique in SIMD� Use cases : transpose, AoS/SoA transformations� Turn memory access into computations� Support for specic permutations� Support for arbitrary shuffling patterns
pack <float ,4> a{1,2,3,4};pack <float ,4> b{10 ,20 ,30 ,40};
// r1 = [1 1 4 4 ]auto r1 = shuffle <0,0,3,3>(a);
// r2 = [1 0 0 10 ]auto r2 = shuffle <0,-1,-1,4>(a,b);
20 of 33
{and, Shuffling, Permutation}1,0,2
Principles
� Data reordering is #1 technique in SIMD� Use cases : transpose, AoS/SoA transformations� Turn memory access into computations� Support for specic permutations� Support for arbitrary shuffling patterns
struct reverse_{
template <class I, class C> struct apply: std:: integral_constant <int ,C::value -I::value -1>;
{};};
// res = [4 3 2 1]pack <float > res = shuffle <reverse_ >(a);
20 of 33
{and, Shuffling, Permutation}1,0,2
Principles
� Data reordering is #1 technique in SIMD� Use cases : transpose, AoS/SoA transformations� Turn memory access into computations� Support for specic permutations� Support for arbitrary shuffling patterns
constexpr int mix_half(int i, int c){
return i < c/2 ? i+c : i;};
// res = [10 20 3 4]pack <float > res = shuffle <pattern <mix_half >>(a,b);
20 of 33
Integration with the Standard Library
� Algorithms :� SIMD transform� SIMD reduce� Use generic functor/lambda for mixing scalar/SIMD
� Allocators� Ranges :
� boost::simd::input_range� boost::simd::output_range� boost::simd::aligned_input_range� boost::simd::aligned_output_range� boost::simd::segmented_input_range� boost::simd::segmented_output_range
21 of 33
Integration with the Standard Library
std::vector <float , simd::allocator <float > > v(N);
simd:: transform( v.begin(), v.end(), []( auto const& p)
{return p * 2.f;
});
22 of 33
Integration with the Standard Library
std::vector <float , simd::allocator <float > > i(N), o(N);
auto x = simd:: reduce( i.data(), i.data()+N, 0.f);
auto y = simd:: reduce( i.data(), i.data()+N, 0.f, []( auto&& a,auto&& e){return a+e*e;}, 0.f, simd::plus);
22 of 33
Under the SIMD Hood
Performances !
Basic Functions
Single precision math functions (cycles/values)Hardware : Core i7 SandyBridge, AVX
Function Range std Scalar SIMDexp [−10, 10] 46 38 7log [−10,−10] 42 37 5asin [−1, 1] 40 35 13cos [−20π, 20π] 66 47 6
restricted_(cos) [−π/4, π/4] 32 9 1.3
25 of 33
Julia set generator
� Generate a fractal image using the Julia funtion� Largely compute-bound� Challenge : Workload depends on pixel location
26 of 33
Julia set generator
template <class T> auto julia(T const& a, T const& b){
as_integer_t <T> res {0};std:: size_t max_iter {0};T x{0}, y{0};
do {auto x2 = x * x;auto y2 = y * y;auto mask = x2 + y2 < 4;auto xy = 2 * x * y;x = x2 - y2 + a;y = xy + b;res = if_inc(mask , res);
} while(any(mask) && max_iter ++ < 256);
return res;}
27 of 33
Julia set generator
Timing w/ Boost.SIMD and other solutionsfrom An Evaluation of current SIMD programming Models for C++ Pohl et al., 2015
28 of 33
Interaction with Boost.Odeint
Coupled/Uncoupled Roessler system
� Written by Mario Mulanski� Showcase effects of both cache and SIMD� Use Boost.ODEINT for the ODE system� Use Boost.SIMD to vectorize the system
Results
� Minimal disruption in the code� Global x3 performances gain� See the whole code at https ://github.com/mariomulansky/olsos
29 of 33
Interaction with Boost.Odeint
template <class S, class D>void operator ()(const S &x_ , D &dxdt_ , double t) const{
auto x = boost::begin( x_ );auto dxdt = boost:: begin( dxdt_ );const int N = boost::size(x_);for( int j=1; j<N/dim -1; ++j ){
const int i = j*dim;dxdt[i] = -1.0*x[i + 1] - x[i + 2] +
m_d * (x[i - dim] + x[i + dim] - 2.0 * x[i]);dxdt[i + 1] = x[i] + m_a * x[i + 1];dxdt[i + 2] = m_b + x[i + 2] * (x[i] - m_c);
}}
29 of 33
Interaction with Boost.Odeint
// Scalar callusing state_type = std::vector <double >;state_type x(N);
odeint :: runge_kutta4 <state_type > rk4 ;odeint :: integrate_const(rk4 , roessler , x, 0.0, T , dt);
// Boost.SIMD callusing alloc_t = simd::allocator <double >;using state_type = vector <pack <double >,alloc_t >;state_type x ( N/ pack <double >:: static_size );
odeint :: runge_kutta4 < state_type > rk4 ;odeint :: integrate_const(rk4 , roessler , x, 0.0, T, dt);
29 of 33
Interaction with Boost.Odeint
29 of 33
Conclusion
Conclusion
High level SIMD in C++11/14
� Designing a C++ library for low level performance primitives ispossible
� C++11/14 features play nice with SIMD intrinsics� SIMD specic idioms maps to modern C++ components
Boost.SIMD
� To be proposed for review this fall� Find us on https ://github.com/numscale/boost.simd
� Tests and feedback welcome
31 of 33
This talk would not have been feasible without
The bSIMD team
� Lead Developer : Charly Chevalier� Developers : Jean-Thierry Lapresté, Guillaume Quintin� Tests and Doc : Alan Kelly, Kenny Peou
Our supporters
� Tim Blenchman, our earliest adopter� Serge Guelton, for integrating Boost.SIMD into pythran� Mario Mulansky, for its work with Boost.SIMD & Boost.Odeint� Sylvain Jubertie,Ian Masliah, for testing Boost.SIMD in clever ways
32 of 33
Thanks for your attention !