Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
IntrinsicsLecture 1
Manfred LiebmannTechnische Universitat Munchen
Chair of Optimal Control
Center for Mathematical Sciences, M17
January 12, 2016
Manfred Liebmann January 12, 2016
Programming with Intrinsics
What are intrinsics?
Intrinsics are functions that the compiler replaces with the proper assemblyinstructions. Intrinsics are primarily used to access the vector processing capabilities ofmodern CPUs.
• Long history of Intrinsics
– MMX : Multi Media Extensions 8 x 64bit (1997)– SSE/SSE2/SSE3/SSSE3/SSE4.x : Streaming SIMD Extensions 8 x 128bit (1999)– AVX/AVX2/FMA : Advanced Vector Extensions 16 x 256 bit (2008)– AVX-512/KNC : Advanced Vector Extensions 32 x 512 bit (2012)
Intrinsics 1
Manfred Liebmann January 12, 2016
Choose the Right Header!
Intrinsics are supported by all modern C/C++ compilers.
• Every generation has its own header!
– #include <mmintrin.h> //MMX– #include <xmmintrin.h> //SSE– #include <emmintrin.h> //SSE2– #include <pmmintrin.h> //SSE3– #include <tmmintrin.h> //SSSE3– #include <smmintrin.h> //SSE4.1– #include <nmmintrin.h> //SSE4.2– #include <ammintrin.h> //SSE4A– #include <wmmintrin.h> //AES– #include <immintrin.h> //AVX
Intrinsics 2
Manfred Liebmann January 12, 2016
Advanced Vector Extensions (AVX)
Intel Advanced Vector Extensions (AVX) is a set of instructions for doing Single InstructionMultiple Data (SIMD) operations on Intel architecture CPUs. These instructions extend theprevious SIMD o↵erings, MMX instructions and Intel Streaming SIMD Extensions (SSE).
Intel Intrinsics Guide
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Complete interactive reference for all intrinsic functions!
Instruction Set Architecture (ISA) Extensions
https://software.intel.com/en-us/isa-extensions
Intrinsics 3
Manfred Liebmann January 12, 2016
Intel AVX Su�x Markings
All modern C++ compilers support the same intrinsic operations to simplify using IntelAVX from C or C++ code. Intrinsics are functions that the compiler replaces with the properassembly instructions. Most Intel AVX intrinsic names follow the following format:
_mm256_op_suffix(data_type param1, data_type param2, data_type param3)
where mm256 is the prefix for working on the new 256-bit registers; op is the operation,like add for addition or sub for subtraction; and su�x denotes the type of data to operateon, with the first letters denoting packed (p), extended packed (ep), or scalar (s). Theremaining letters are the types given in the table below.
• Su�x Markings
[s/d] : Single- or double-precision floating point[i/u]nnn : Signed or unsigned integer of bit size nnn, where nnn is 128, 64, 32, 16, or 8[ps/pd/sd] : Packed single, packed double, or scalar doubleepi32 : Extended packed 32-bit signed integersi256 : Scalar 256-bit integer
Intrinsics 4
Manfred Liebmann January 12, 2016
Intel AVX Intrinsics Data Types
• Data Types
m256 : 256-bit as eight single-precision floating-point valuesm256d : 256-bit as four double-precision floating-point valuesm256i : 256-bit as integers, (bytes, words, etc.)m128 : 128-bit single precision floating-point (32 bits each)m128d : 128-bit double precision floating-point (64 bits each)
Figure 1: Intel AVX and Intel SSE data types
Intrinsics 5
Manfred Liebmann January 12, 2016
Mandelbrot Set Code Example
Pseudocode for calculating the Mandelbrot set.
z,p are complex numbers
for each point p on the complex plane
z = 0
for count = 0 to max_iterations
if abs(z) > 2.0
break
z = z*z+p
set color at p based on count reached
Intrinsics 6
Manfred Liebmann January 12, 2016
Mandelbrot Set Visualization
Figure 2: Mandelbrot set 0.29768+0.48354i to 0.29778+0.48364i with 4096 max iterations
Intrinsics 7
Manfred Liebmann January 12, 2016
Simple Mandelbrot C++ STL Code
#include <iostream>
#include <complex>
using namespace std;
int main(int argc, char** argv)
{
float x1 = 0.29768, y1 = 0.48364, x2 = 0.29778, y2 = 0.48354;
int width = 2048, height = 2048, int maxIters = 4096;
unsigned short *image = new unsigned short[width * height];
float dx = (x2-x1)/width, dy = (y2-y1)/height;
for (int j = 0; j < height; ++j) {
for (int i = 0; i < width; ++i) {
complex<float> c(x1+dx*i, y1+dy*j), z(0,0);
int count = -1;
while ((++count < maxIters) && (norm(z) < 2.0))
z = z*z+c;
*image++ = count;
}
}
}
Intrinsics 8
Manfred Liebmann January 12, 2016
Mandelbrot Set Benchmark
Cores STL FPU AVX
1 63.5186 11.9445 1.644152 50.1687 9.42479 1.269574 42.7716 8.02288 1.056728 23.2062 4.34219 0.56915216 13.9921 2.62823 0.345063
Table 1: Total runtimes in seconds for the Mandelbrot set benchmark with a 2048 x 2048grid on 2x Intel Xeon E5-2650 @ 2.00GHz.
Intrinsics 9