High-Performance Computing with C++

[email protected]

Quant Programmer (C++, .NET, MATLAB) Microsoft MVP Visual C# (since 2009) Pluralsight course author (MATLAB, CUDA, D, Boost,) Technical Evangelist @ JetBrains

An overview of available technologies for computation A look at managed vs. unmanaged code How to leverage capabilities of x86 architecture What COTS and specialized acceleration h/w exists and how to use it

Native code Managed code

More portable. But ++ is also portable provided you do not use platform-specific things. In theory gets optimized for various platforms. In practice, this isnt great. Does not permit low-level interaction with the processor. Additional safety (managed) array bound checks, type conversion checks, etc.

Not always portable (e.g. .NET is only partially portable, excluding UI, WCF, ) Typically supports garbage collection. Has ways of interacting with native code (JNI, P/Invoke, C++/CLI).

Developer vs. software productivity? Managed languages simpler to use

This talk focuses on CPU bound problems Some problems bottleneck on I/O SSD made things a lot better Optimization mechanisms

Dont expect CPU clock speed to pick up PC/server architecture does not scale The only way to accelerate computation is to provide more entities to compute on.

Instruction-level Thread-level Machine-level

Via inline assembly Via intrinsics Compiler vectorization Use magical compilers (e.g. Intel SPMD)

SIMD things

Processing data in an array OpenMP Intel Threading Building Blocks/ Parallel Patterns Library (MS)

GPGPU Expansion boards Custom chips

Hardware Platforms NVIDIA, ATI Software platforms for computation CUDA, OpenCL, C++ AMP

Typically 2, effectiveness drop-off after that PCI bus congestion, but depends on usage patterns

CUDA is the principal commercially successful GPGPU platform CUDA is supported by many software manufacturers (Photoshop, MATLAB, etc.) In many domains (e.g. video transcoding), the situation with GPU leveraging is dire In terms of performance, it is thought that CUDA has better floating-point, AMD better integral math

CUDA is actually a managed technology CUDA is not device-independent CUDA C is the primary development language

A GPU has several streaming multiprocessors (SM) Each SM has lots of processors (SP) We can launch a large number of threads in parallel Very large number of SPs ensures that even at lower clock speeds, GPU wins out over CPU

A look at CUDA development

GPU does not support ordinary x86. Running several tasks on a GPU is difficult Branch divergence branching code (a simple if) turns computation from parallel to sequential.

How do you plug in a few CPUs into a motherboard? You cannot. The architecture doesnt scale. (And never will.) An alternative is to put a coprocessor on the PCI bus

Commercial coprocessor implementation from Intel PCI board with 60x cores Supports x86!!!!!!!!!111111 Supports different technologies Runs its own micro Linux (not a driver) Can be used in either independent or offload mode Requires special development tools (Intel C++ compiler)

Intel makes a lot of tools for ++ developers To work with Xeon Phi, you need

Offload mode Native execution mode Symmetric execution

Programming the Xeon Phi

60 processors 4 hardware threads per core 8Gb memory 512-bit SIMD

Same as in ordinary PCs, i.e., OpenMP, MPI pthreads Other models coming soon

FPGA Field Programmable Gate Array Design your own CPU processing mechanic Middle ground between hard-wired ASIC and very flexible general-purpose CPU Uses special hardware description languages (HDL) VHDL, Verilog. There are others (SystemC, OpenCL) and higher-level solutions (e.g., MATLAB, Embeddr).

Intrinsically parallel Low-power Better scalability Not a COTS solution

FPGA lets us offload some tasks from the CPU FPGA is a lot less flexible. Not so good for math. FPGA is a low-level construct. FPGAs are relatively expensive to operate.

FPGAs do not directly compete with ordinary CPUs Gain an advantage due to a highly asynchronous nature The goal is to pre-program an FPGA to solve a single problem very quickly E.g., protocol parsing in hardware (so called feed handler)

JetBrains is working on the C++ IDE And C++ support in ReSharper Questions?