View
261
Download
0
Category
Tags:
Preview:
Citation preview
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® SoftwareDevelopment Products
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Getting Existing Code to Auto-Vectorize with
Intel® Composer XE
Alex Wells and Brandon HewittMarch 10, 2011
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Before We Start
•All webinars are recorded and will be available on Intel Learning Lab within a week– http://www.intel.com/go/learninglab
•Use the Question Module to ask questions
• If you have audio issues, give it 5 seconds to resolve.– If audio issues persist, we suggest:
– Drop and reenter the webinar – Call into the phone bridge
04/18/23 2
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
How Intel® Composer XE can improve your application performance• Intel® Composer XE is the latest compiler offering from Intel for
32-bit and 64-bit applications.• What are SIMD instructions and how do they help performance• What is auto-vectorization and how does it help generate
optimized SIMD code• What are the typical obstacles to using the auto-vectorizer and
how do I resolve them• What’s improved in the vectorizer in Composer XE• Skinning Kernel Example• Using high-level objects while still vectorizing• Using Strided Array Access to Vectorize Arrays of Structures• Combining these Techniques into a Kernel Template Library• Performance with Intel(R) Advanced Vector Extensions• More options and info
04/18/23 3
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
SIMD: Single Instruction, Multiple Data
• Scalar mode– one instruction produces
one result
• SIMD processing– With Intel® Streaming SIMD Extensions
(SSE) or Advanced Vector Extensions (AVX) instructions
– one instruction can produce multiple
results
++
XX
YY
X + YX + Y
++
XX
YY
X + YX + Y
== ==
x0+y0x0+y0 x1+y1x1+y1 x2+y2x2+y2 x3+y3x3+y3 x4+y4x4+y4 x5+y5x5+y5 x6+y6x6+y6 x7+y7x7+y7
y0y0 y1y1 y2y2 y3y3 y4y4 y5y5 y6y6 y7y7
x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7
04/18/23 4
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Automatic Vectorization by Compiler Translates Loops into SIMD Parallelism loop is stripmined (unrolled), strip length of 8 for floats with Intel® AVX of 4 for floats with Intel® SSE
128-bit Registers
for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];
A[7] A[6] A[5] A[4] A[3] A[2] A[1] A[0]
B[7] B[6] B[5] B[4] B[3] B[2] B[1] B[0]
C[7] C[6] C[5] C[4] C[3] C[2] C[1] C[0]
+ + + ++ + + +
04/18/23 5
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Did my loop auto-vectorize?
icc -vec-report1 -c Multiply.cMultiply.c(31): (col. 3) remark: LOOP WAS VECTORIZED.
• Vectorization and reports are enabled only at –O2 and above• Intel® SSE2 enabled by default• Intel® AVX enabled with /Q[a]xAVX, –[a]xAVX
– Read the documentation to find the option that works best for you
• If you use /arch:IA32 or –mia32, vectorizer is disabled.• /Qvec-reportN or –vec-reportN options enable reports
– N is a number from 0-5– 1 reports any loops vectorized– Recommend 3 to get information on loops vectorized and not vectorized
and why
04/18/23 6
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Obstacles to Successful Vectorization:Pointer Aliasing and Loop iteration dependencies
• Vectorization entails changes in the order of operations within a loop, since each SIMD instruction operates on several data elements at once. Vectorization is only possible if this change of order does not change the results of the calculation.
• One major cause can be the assumption of pointer aliasingvoid add(char *cp_a, char *cp_b, int n) {
for (int i = 0; i < n; i++) {
cp_a[i] += cp_b[i];
}
}
• The compiler must by default assume cp_a and cp_b can overlap in memory, thus causing potential loop dependencies (e.g. a write to cp_a[1] might affect the read from cp_b[12])
• For simple cases like this one, the compiler can insert checks at runtime for aliasing and still vectorize– Extra checks can impact performance– Difficult to resolve for more complicated algorithms
04/18/23 7
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Solutions for Loop Dependencies and Pointer Aliasing
• #pragma ivdep– Just place before the loop, and compiler will assume no dependencies
unless it can prove otherwise.– Compiler may still be able to prove dependencies which will still cause
auto-vectorization problems– Applied per loop
• restrict keyword– Use with pointer declarations to assert that memory is only referenced
through that pointer– Requires extra compiler option (/Qrestrict or –restrict, or C99)– Applied per pointer
• Also compiler options to affect compiler assumptions about loop aliasing– /Qansi-alias, /Oa, /Qalias-args- (Windows*)– -ansi-alias, -fno-alias, -fargument-noalias (Linux*/Mac OS*)
• Note that any of these changes may result in incorrect code if the assumptions being changed are not correct (e.g. your pointers do overlap)
04/18/23 8
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Obstacles to Successful Vectorization:Function Calls
•Vectorization cannot occur when function calls are made in the loop– Can be hard to recognize that calls are happening,
especially in C++
•Solutions– Inline wherever possible
– Use __forceinline on function definitions– Use #pragma forceinline recursive to inline all calls
(and nested calls) in a statement– Use __declspec(vector) on function declarations and
definitions to safely auto-vectorize them– Part of Intel® Cilk™ Plus– Many standard math library functions already vectorize
04/18/23 9
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Obstacles to Successful Vectorization:Not Enough Work
•Compiler may decide that vectorizing a loop will not generate more efficient code.
•Solution– Use #pragma vector always to override compiler
heuristics– Note that compiler will still not vectorize if it determines it’s
unsafe or illegal to do so– Can use #pragma vector always assert to give a
compile-time error if loop does not vectorize– Be sure that vectorization does improve performance
before use
04/18/23 10
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Obstacles to Successful Vectorization:C++ STL Data Types
• Can be done, but need to do help the compiler$ cat vec-stl-1.cpp#include <vector>
std::vector<double> XX, YY;
void foo(int iters, int x, int y) { for (int i=0; i < iters; i++) { XX[x+i] += YY[y+i]; }}
$ icc -vec-report3 -c vec-stl-1.cppvec-stl-1.cpp(6): (col. 3) remark: loop was not vectorized: existence of vector dependence.vec-stl-1.cpp(7): (col. 7) remark: vector dependence: assumed FLOW dependence between
_M_start line 7 and _M_start line 7.
• Some similar problems with loop dependence (due to STL implementation)
• Also problem with compiler not recognizing global addresses as invariant
04/18/23 11
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Solution for Enabling Auto-vectorization
$ cat vec-stl-2.cpp#include <vector>
std::vector<double> XX, YY;
void foo(int iters, int x, int y) { double *x1 = &XX[x]; double *y1 = &YY[y];#pragma ivdep for (int i=0; i < iters; i++) { x1[i] += y1[i]; }}
$ icc -vec-report3 -c vec-stl-2.cppvec-stl-2.cpp(9): (col. 3) remark: LOOP WAS VECTORIZED.
04/18/23 12
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Obstacles to Successful Vectorization:Data Alignment
•Data alignment (on 16-bytes) is a major issue with Intel® SSE performance prior to Intel® Core™ i7
• It’s still a potential performance issue as the compiler will generate runtime checks for alignment
• It will become important again in the next generation of Core i7 (with Intel® AVX instructions) (this time on 32-bytes)
•Two places where explicit coding is needed– When declaring new pointers– When declaring function arguments
04/18/23 13
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Data Alignment Solutions
•For new arrays:– __declspec(align(N)) type name[bounds];
•For new dynamically allocated memory:– type * p = (type*) _mm_malloc(size, N);– _mm_free(p);– Threading Building Blocks memory allocator
– scalable_aligned_malloc / scalable_aligned_free
– Can also overload new/delete operators for classes
•For function arguments:– __assume_aligned(name, N);
•For specific loops:– #pragma vector aligned/unaligned
•Can cause runtime exceptions if assumption is false
04/18/23 14
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Improvements in Intel® Composer XE (compared to previous versions of Intel® C++ Compiler)
• Mixed data types– Loops containing mixed data types (such as float and double) will
now auto-vectorize$ cat mix-data.c
void foo(int n, float *restrict A, double *restrict B){
int i;
float t = 0.0f;
for (i=0; i<n; i++) {
A[i] = t;
B[i] = t;
t += 1.0f;
}
}
11.1 Update 7:
$ icc -vec-report3 -restrict -c mix-data.c
mix-data.c(4): (col. 3) remark: loop was not vectorized: unsupported data type.
12.0 Update 2:
$ icc -vec-report3 -restrict -c mix-data.c
mix-data.c(4): (col. 3) remark: LOOP WAS VECTORIZED.
04/18/23 15
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Improvements in Intel® Composer XE (compared to previous versions of Intel® C++ Compiler)
• Support for more complicated nested if statements$ cat vec-if.c
void foo(int n, double *A, double *B, double *C){
int i;
#pragma ivdep
for (i=0; i<n; i++){
if (A[i] > 0 || B[i] < 0) {
C[i] = 0;
}
else {
C[i] = 1;
}
}
}
11.1 Update 7:
$ icc -vec-report3 -c vec-if.c
vec-if.c(9): (col. 7) remark: loop was not vectorized: statement cannot be vectorized.
12.0 Update 2:
$ icc -vec-report3 –c vec-if.c
vec-if.c(4): (col. 3) remark: LOOP WAS VECTORIZED.
04/18/23 16
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Application of these Techniques in Skinning Kernel Example
04/18/23 17
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 18
For Auto-Vectorization, data needs to be accessed component wise as arrays
•Skinning Kernel Example: compute influence of a joint over a set of vertices.
•Array per component of Input And Output data streams•Expand high level math to be performed per component
#pragma vector always assert#pragma ivdepfor(unsigned int i=0; i < count; ++i) { const float x = offsetX[i]; const float y = offsetY[i]; const float z = offsetZ[i]; const float nw = normalizedWeight[i]; outX[i] = (x * joint.m[0][0] + y * joint.m[1][0] + z * joint.m[2][0] + joint.m[3][0]) * nw; outY[i] = (x * joint.m[0][1] + y * joint.m[1][1] + z * joint.m[2][1] + joint.m[3][1]) * nw; outZ[i] = (x * joint.m[0][2] + y * joint.m[1][2] + z * joint.m[2][2] + joint.m[3][2]) * nw; }
out[i] = (offset[i]*joint)*normalizedWeight[i];
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 19
You can express a kernel’s algorithm using higher level objects, like Point or Matrix and still Auto-Vectorize!
•Within a Kernel Body– Local objects can be created on the stack– Only arrays external to the kernel body need to
be accessed by control loop’s index– Fixed offset Data Members can be accessed.– Methods may be called as long as they inline
•Auto-Vectorization will still work!
float length = offset.length();Vector3 scaledOffset = offset*scale;
const Vector3 vertex(offsetX[i], offsetY[i], offsetZ[i]);float lengthSquared = (offset.x*offset.x) + (offset.y*offset.y) + (offset.z*offset.z);float length = sqrt(lengthSquared);
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 20
Define high level helper classes for use inside the kernel body
•All helper classes & methods inline code to the stack•Compiler optimizes out needless copies
– By the time the auto-vectorizer gets to it, the code resembles the long hand per component version
const Matrix4x3 joint(usersJoint);#pragma vector always assert#pragma ivdepfor(unsigned int i=0; i < count; ++i) { const Vector3 offset(offsetX[i], offsetY[i], offsetZ[i]); const float nw = normalizedWeight[i]; const Point3 out = (offset*joint)*nw;
outX[i]=out.x(); outY[i]=out.y(); outZ[i]=out.z(); }
Algorithm Expressed using High Level Objects
Algorithm Expressed using High Level Objects
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 21
Anything special about the helper classes?
•Flat data layout •All methods are
__forceinline•Otherwise plain old C++
struct Matrix4x3{private: float m00; float m01; float m02; float m10; float m11; float m12; float m20; float m21; float m22; float m30; float m31; float m32; public:// ... implement methods}
struct Matrix4x3{private: float m[4][3];public:// ... implement methods}
VSVS
__forceinline Vector3Vector3::operator*(float aScalar){ x() *= aScalar; y() *= aScalar; z() *= aScalar; return *this;}
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 22
Multiple Arrays: getting data to the “vector” SSE registers in the CPU
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15
•SIMD lanes (registers) must be loaded with the same logical kernel value from multiple elements.
•Best is linear memory access (contiguous array per component) - 1 instruction to load
•Multiple arrays can lose cache locality and pressure hardware prefetchers and the TLB vs. single AOS
04/18/23 22
C[i] = A[i] + B[i]; CPU SSERegisters
A8 A9 A10 A11 A12 A13 A14 A15
B8 B9 B10 B11 B12 B13 B14 B15
X M M 0
X M M 1+
A0 +B0
A1 +B1
A2 +B2
A3 +B3=
A4 A5 A6 A7
B4 B5 B6 B7
A0 A1 A2 A3
B0 B1 B2 B3
Array A
Array B
Array C
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 23
Tiled SOA: A Hybrid Data Layout
•We can bring back cache locality by changing the data layout to a Array Of Structures Of Fixed Length Arrays– Lets just call it “Tiled SOA”.– Requires outer loop to walk
through tiles while inner loop vectorizes over the tile’s Width
•A Single Data Stream made up of Blocks of Fixed Sized Arrays.– Provides linear memory access pattern
04/18/23 23
const size_t Width = 4;struct InputTile{ float a[Width]; float b[Width];};InputTile inStreamOfTiles[2500];
CPU SSERegisters
X M M 0
X M M 1
A0 A1 A2 A3 B0 B1 B2 B3 A4 A5 A6 A7 B4 B5 B6 B7 A8 A9 A10 A11 B8 B9 B10
for(size_t tileIndex=0u; tileIndex < 2500; ++tileIndex) { const InputTile & tile = inStreamOfTiles[tileIndex]; const float * A = tile.a; const float * B = tile.b; float * C_forTile = *C[tileIndex*Width]; #pragma ivdep for(size_t index=0u; index < Width; ++index) { C_forTile[index] = A[index] + B[index]; }}
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 24
Array of Structures: Use Strided Array Access to pack component data into SIMD lanes• If A and B were in AOS format…
•Keep pointers to the each component•Use strided array access to operate on the correct
component within the elements
struct Input{ float a; float b;};Input in[10000];
C[i] = in[i].a + in[i].b;
C0 C1 C2 C3 C4 C5 C6 C7
float *A = &in[0].a;float *B = &in[0].b;const unsigned int Stride=2;C[i] = A[i*Stride] + B[i*Stride];
CPU SSERegisters
X M M 0
X M M 1+=
A5 A6 A7B5 B6 B7A2A1 A3B1 B2 B3 A4A0 B4B0
A0 +B0
A1 +B1
A2 +B2
A3 +B3
Array in
Array C
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 25
Improving AOS access• Maintaining a pointer per component increases register
pressure as different pointers are used to access each component.
• Alternatively, the data accesses could be defined in terms of a single pointer + fixed offset to the data member of the structure + (stride of structure * index)
• stddef.h contains a macro that computes the fixed offset to a structure’s data member at compile time: offsetof(structure, member)
• Would be nice to have a template library that simplifies accessing AOS and Tiled SOA data for vectorization…
typedef unsigned char BYTE;const BYTE * inArray = reinterpret_cast<const BYTE *>(&in[0]);for(size_t index=0u; index < 10000; ++index) { const BYTE * pointerA = inArray + offsetof(Input, a) + (sizeof(Input)*index); const BYTE * pointerB = inArray + offsetof(Input, b) + (sizeof(Input)*index); float localA = *(reinterpret_cast<const float *>(pointerA); float localB = *(reinterpret_cast<const float *>(pointerB); C[index] = localA + localB;}
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 26
Kernel Template Library (KTL)
• Express algorithms with High level helper objects• Maps accessing multiple arrays into high level objects• Hides accessing AOS as Strided Arrays for the user • Support Tiled SOA data streams• Apply the kernel to each element of the data streams in a
data parallel manner– vectorized, – threaded, – vectorized + threaded, – or serial (for non vectorizing compilers or comparison)
• Utilizes only c++0x standard features allowing other compilers to still compile the code.
• Generate code that optimizes well
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 27
Kernel Code that Optimizes Well
• GOAL: layout code to give the compiler all available information, enabling it to perform better optimizations.
• Put the constants, loop control, and kernel body on the same stack level with no interruptions.– A function call interrupts the optimizer’s view of the stack.
• No Function calls– kernel body must inline inside the control loop– All functions or methods called must inline
• Constants and variables used inside the kernel must be declared on the stack– A reference or pointer to a constant isn’t good enough
– The compiler can’t assume it hasn’t changed in-between iterations
– With a local stack instance the compiler can prove it wasn’t changed and only load a register once
• Identify array data as independent
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 28
Code Layout that Optimizes Well
#pragma forceinline recursive{ // Define the context for the kernel body here // local stack variables for all constants Constant1 constant1; Constant2 constant2; //... const Array1 * array1; Array2 * array2; //...
// Loop control structure #pragma ivdep for(size_t index=0; index < ElementCount; ++index) { // define kernel body here, using only context variables // declared on the stack above array2[index] = array1[index]*constant1 + constant2; }}
All of the objects and helpers need to inline for auto-vectorization to work
All of the objects and helpers need to inline for auto-vectorization to work
Identify array data as independent
Identify array data as independent
No function calls between local declarations and loop
to interrupt stack
No function calls between local declarations and loop
to interrupt stack
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 29
C++0x Lambda Functions with a template function can generate such a Code Layout
// User code defines kernel as lambda function that operates // on a single element auto kernel = [=](const size_t &iIndex){ array2[iIndex] = array1[iIndex]*constant1 + constant2;};// template library function generates the code for loop controllayoutCodeForKernel(kernel, ElementCount);
template <typename KernelT>__declspec(noinline) void layoutCodeForKernel(const KernelT &iKernel, const size_t iElementCount){ #pragma forceinline recursive { // Copy kernel closure onto the local stack KernelT localKernel(iKernel); // Loop control structure #pragma ivdep for(size_t index=0u; index < iElementCount; ++index) { localKernel(index); } }}
Use”auto” keyword to hide type of the right
hand side
Use”auto” keyword to hide type of the right
hand side
Use Lambda Function with capture by value to define the body of
Kernel
Use Lambda Function with capture by value to define the body of
Kernel
Any external variables accessed will be
implicitly captured by value inside the Lambda
Function’s closure
Any external variables accessed will be
implicitly captured by value inside the Lambda
Function’s closure
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 30
Kernel Template Library Example
•Using KTL to vectorize our Skinning example when data is in Array of Structures (AOS) format.
struct InputElement{ Vector3 offset; float weight;};
struct OutputElement{ Vector3 position;};
const size_t elementCount = 10000;InputElement containerInput[elementCount];OutputElement containerOutput[elementCount];
const Matrix4x3 * joint;
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 31
Kernel Template Library Example
•Using KTL to vectorize our Skinning example when data is in Array of Structures (AOS) format.
const InputElement * inArray = &containerInput[0];OutputElement * outArray = &containerOutput[0];const Matrix4x3 localJoint(*joint);
auto kernel = [=](const ElementAccess &access){ // Access array pointers as ktl elements // ktl elements encapsulate strided arrays and loop control indexes) auto inElem = access.inputElementOfAos(inArray); auto outElem = access.outputElementOfAos(outArray);
// Get all input data out of the element(s) into local high level objects Vector3 localOffset; float localWeight; inElem.get<offsetof(InputElement, offset)>(localOffset); inElem.get<offsetof(InputElement, weight)>(weight);
// Express algorithm using high level objects const Vector3 result = (localOffset*localJoint)*localWeight;
// Set all output element(s) with the results outElem.set<offsetof(OutputElement, position)>(result); };ktl::Engine::vectorizeKernel(kernel, elementCount);
ElementAccess will create elements that encapsulate getting
or setting data for current iteration
ElementAccess will create elements that encapsulate getting
or setting data for current iteration
Use C++0x keyword “auto” to hide template
types
Use C++0x keyword “auto” to hide template
types
Use C++0x Lambda Function to define body
of Kernel
Use C++0x Lambda Function to define body
of Kernel
Get the element data onto the stack in terms of High Level Objects
Get the element data onto the stack in terms of High Level Objects
Algorithm Expressed using High Level Objects
Algorithm Expressed using High Level Objects
Store the results out in terms of High Level
Objects
Store the results out in terms of High Level
ObjectsAsk the Engine to vectorize
the kernel bodyAsk the Engine to vectorize
the kernel body
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 32
Kernel Template Library Example
•Can use MACROS to simplify
KTL_KERNEL_BEGIN(kernel){ KTL_ELEM_IN_AOS(inElem, inArray) KTL_ELEM_OUT_AOS(outElem, outArray);
KTL_LOCAL_FROM_MEMBER(localOffset, inElem, offset) KTL_LOCAL_FROM_MEMBER(localWeight, inElem, weight)
const Vector3 result = (localOffset*localJoint)*localWeight;
KTL_SET_MEMBER_WITH(outElem, position, result)}KTL_KERNEL_ENDktl::Engine::vectorizeKernel(kernel, elementCount);
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 33
Intel® AVX performance• Automatic CPU dispatch to add Intel® AVX code paths
– /QaxAVX /Qdiag-enable:cpu-dispatch
• Weighted Joint Deformation 256 elements (fits in L1), 1000000 iterations
• Test Platform Intel® Core™ i7 2nd generation processor [Sandy Bridge]– 3.10 Ghz, 6MB L3, 4GB Ram, Windows Server 2008* r2
• Compiled with Intel® C++ Composer XE [Version 12.0.0.104 for Intel® 64]• Microsoft Visual Studio 2010* Release Config plus:/O3 /Oi /Ot /fp:fast /QaxAVX
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
04/18/23 34
What’s Next for Kernel Template Library?
•Containers– Runtime Tiled SOA defined and accessed by AOS
– Take on managing tiled structure of arrays allowing user to pretend it’s a AOS for setup/editting and when defining a kernel, but under the hood the it will be Tiled SOA.
•More Data Types for use in Kernel– Current data types are just for example purposes
•Gather/Scatter or other staging data streams– Gather and scatter currently don’t vectorized and an
additional stage needs to be introduced to gather/scatter to/from an intermediate buffer before the rest of the kernel can vectorize.
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
More Options and Information
•Lots of other ways to get the benefits of vectorization and SIMD instructions– Automatic CPU Dispatch (/Qax, -ax)– Manual CPU Dispatch (APIs to dispatch multiple versions of
specific functions explicitly)– Intel® Cilk™ Plus Array Notations and #pragma simd– Intel® Integrated Performance Primitives and Math Kernel
Library
•KTL is available at http://software.intel.com/en-us/articles/kernel-template-library/
•More presentations at http://software.intel.com/en-us/articles/intel-software-development-products-technical-presentations/
•Go to http://software.intel.com for more
04/18/23 35
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
Optimization Notice
Optimization Notice
Intel compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel Compiler User and Reference Guides” under “Compiler Options." Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.
Intel compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel Streaming SIMD Extensions 2 (Intel SSE2), Intel Streaming SIMD Extensions 3 (Intel SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.
Notice revision #20110228
04/18/23 36
Legal Disclaimers
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.http://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement/?wapkw=(Samples+Software+License+Agreement)
Note: The below disclaimer should be included whenever the general performance disclaimer is used, but should be numbered separately:
Configurations: Details on slide 33. For more information go to http://www.intel.com/performance
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number
Intel, Core i7 and Cilk Plus are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2011 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
04/18/23 37
Recommended