Upload
duonganh
View
222
Download
3
Embed Size (px)
Citation preview
Bringing Next Generation C++ to GPUs
Michael Haidl1, Michel Steuwer2, Lars Klein1 and Sergei Gorlatch1
1University of Muenster, Germany
2University of Edinburgh, UK
The Problem: Dot Product
std::vector<int> a(N), b(N), tmp(N);std::transform(a.begin(), a.end(), b.begin(), tmp.begin(),
std::multiplies<int>());auto result = std::accumulate(tmp.begin(), tmp.end(), 0,
std::plus<int>());
• The STL is the C++ programmers swiss knife• STL containers, iterators and algorithms introduce a high-levelof abstraction
• Since C++17 it is also parallel
1
The Problem: Dot Product
std::vector<int> a(N), b(N), tmp(N);std::transform(a.begin(), a.end(), b.begin(), tmp.begin(),
std::multiplies<int>());auto result = std::accumulate(tmp.begin(), tmp.end(), 0,
std::plus<int>());
template<typename T>auto mult(const std::vector<T>& a const std::vector<T>& b){ std::vector<T> tmp(a.size()); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; }
template<typename T>auto sum(const std::vector<T>& a){ return std::accumulate(a.begin(), a.end(), T(), std::plus<T>()); }
f1.hpp f2.hpp
1
The Problem: Dot Product
std::vector<int> a(N), b(N);auto result = sum(mult(a, b));
template<typename T>auto mult(const std::vector<T>& a const std::vector<T>& b){ std::vector<T> tmp(a.size()); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; }
template<typename T>auto sum(const std::vector<T>& a){ return std::accumulate(a.begin(), a.end(), T(), std::plus<T>()); }
f1.hpp f2.hpp
1
The Problem: Dot Product
std::vector<int> a(N), b(N);auto result = sum(mult(a, b));
template<typename T>auto mult(const std::vector<T>& a const std::vector<T>& b){ std::vector<T> tmp(a.size()); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; }
template<typename T>auto sum(const std::vector<T>& a){ return std::accumulate(a.begin(), a.end(), T(), std::plus<T>()); }
f1.hpp f2.hpp
Performance:• vectors of size 25600000• Clang/LLVM 5.0.0svn -O3 optimized
transform/accumulate
transform/accumulate*
inner_product
0
20
40
Runtime(ms)
* LLVM patched with extended D17386 (loop fusion) 1
The Problem: Dot Product on GPUs
thrust::device_vector<int> a(N), b(N), tmp(N);thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(),
thrust::multiplies<int>());auto result = thrust::reduce(tmp.begin(), tmp.end(), 0,
thrust::plus<int>());
• Highly tuned STL-like library for GPU programming• Thrust offers containers, iterators and algorithms• Based on CUDA
2
The Problem: Dot Product on GPUs
thrust::device_vector<int> a(N), b(N), tmp(N);thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(),
thrust::multiplies<int>());auto result = thrust::reduce(tmp.begin(), tmp.end(), 0,
thrust::plus<int>());
• Highly tuned STL-like library for GPU programming• Thrust offers containers, iterators and algorithms• Based on CUDA
Same Experiment:• nvcc -O3 (from CUDA 8.0)
transform/reduce
inner_product
0
10
20
30
Runtime(ms)
2
The Next Generation: Ranges for the STL
• range-v3 prototype implementation by E. Niebler• Proposed as N4560 for the C++ Standard
std::vector<int> a(N), b(N);auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); };auto result = accumulate(view::transform(view::zip(a, b), mult), 0);
3
The Next Generation: Ranges for the STL
std::vector<int> a(N), b(N);auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); };auto result = accumulate(view::transform(view::zip(a, b), mult), 0);
Performance?• Clang/LLVM 5.0.0svn -O3 optimized
transform/accumulate
inner_product
0
20
40
Runtime(ms)
3
The Next Generation: Ranges for the STL
std::vector<int> a(N), b(N);auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); };auto result = accumulate(view::transform(view::zip(a, b), mult), 0);
• Views describe lazy, non-mutating operations on ranges• Evaluation happens inside an algorithm (e.g., accumulate)• Fusion is guaranteed by the implementation
3
Ranges for GPUs
• Extended range-v3 with GPU-enabled container andalgorithms
• Original code of range-v3 remains unmodified
std::vector<int> a(N), b(N);auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); };auto ga = gpu::copy(a);auto gb = gpu::copy(b);auto result = gpu::reduce(view::transform(view::zip(ga, gb), mult), 0);
4
Programming Accelerators with C++ (PACXX)
ExecutablePACXX Runtime
Online CompilerLLVM-based
online compiler
LLVM IR to SPIR NVPTXOpenCLBackend
CUDABackend
LLVMIR
SPIR PTX
OpenCL Runtime
AMD GPU Intel MIC
CUDA Runtime
Nvidia GPU
#include <algorithm>#include <vector>#include <iostream>
template< class ForwardIt, class T >
void fill(ForwardIt first, ForwardIt last, const T& value)
{ for (; first != last;
++first) {
*first = value; }}
C++
PACXX Offline Compiler
LLVM libc++
Clang Frontend
Offline Stage
Online Stage
• Based entirely on LLVM / Clang• Supports C++14 for GPU Programming• Just-In-Time Compilation of LLVM IR for target accelerators
5
Multi-Stage Programming
template <typename InRng, typename T, typename Fun>auto reduce(InRng&& in, T init, Fun&& fun) {
// 1. preparation of kernel call...// 2. create GPU kernelauto kernel = pacxx::kernel([fun](auto&& in, auto&& out, int size, auto init) {// 2a. stage elements per threadauto ept = stage([&]{ return size / get_block_size(0); });// 2b. start reduction computationauto sum = init;for (int x = 0; x < ept; ++x) {sum = fun(sum, *(in + gid));gid += glbSize; }
// 2c. perform reduction in shared memory...// 2d. write result backif (lid = 0) *(out + bid) = shared[0];
}, blocks, threads);// 3. execute kernelkernel(in, out, distance(in), init);// 4. finish reduction on the CPUreturn std::accumulate(out, init, fun); }
6
MSP Integration into PACXX
Executable
LLVM-basedonline compiler
PACXX Runtime
LLVM IR to SPIR NVPTXOpenCLBackend
CUDABackend
SPIR PTX
OpenCL Runtime
AMD GPU Intel MIC
CUDA Runtime
Nvidia GPU
#include <algorithm>#include <vector>#include <iostream>
template< class ForwardIt, class T >
void fill(ForwardIt first, ForwardIt last, const T& value)
{ for (; first != last;
++first) {
*first = value; }}
C++
PACXX Offline Compiler
LLVM libc++
Clang Frontend
MSPIR
MSP Engine
KERNELIR
• MSP Engine JIT compiles the MSP IR,• evaluates stage prior to a kernel launch, and• replaces the calls to stage in the kernel’s IR with the results.• Enables more optimizations (e.g., loop-unrolling) in the onlinestage. 7
Performance Impact of MSP
gpu::reduce on Nvidia K20c
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
215 217 219 221 223 225
Sp
ee
du
p
Input Size
DotSumDot +MSSum +MS
Up to 35% better performance compared to non-MSP version
8
Just-In-Time Compilation Overhead
Comparing MSP in PACXX with Nvidia’s nvrtc library
0
50
100
150
200
250
300
350
400
450
Dot Sum
CUDA 7.5CUDA 8.0 RCPACXX
Co
mp
ilati
on
Tim
e (
ms)
10 to 20 times faster because front-end actions are performed.
9
Benchmarks
range-v3 + PACXX vs. Nvidia’s Thrust
vaddsaxpy sum dot Monte Carlo
MandelbrotVoronoi
0.5
1
1.5
2
8.37%1.61%
82.32%73.31%
33.6%
−3.13% −7.38%Speedup
Thrust
PACXX
• Evaluated on a Nvidia K20c GPU• 11 different input sizes• 1000 runs for each benchmark
Competitive performance with a composable GPU programming API10
Going Native: Work in Progress
Executable
PACXX Runtime
LLVM-based online compiler
NVPTXOpenCLBackend
CUDABackend
SPIR PTX
OpenCL Runtime
AMD GPU Intel MIC
CUDA Runtime
Nvidia GPU
#include <algorithm>#include <vector>#include <iostream>
template< class ForwardIt, class T >
void fill(ForwardIt first, ForwardIt last, const T& value)
{ for (; first != last;
++first) {
*first = value; }}
C++
PACXX Offline Compiler
LLVM libc++
Clang Frontend
MSPIR
MSP Engine
KERNELIR
LLVM IR to SPIR
WFV[1]
MCJIT
CPUs: Intel / AMD / IBM ...
• PACXX is extended by a native CPU backend• The Kernel IR is modified to be runnable on a CPU• Kernels are vectorized by the Whole Function Vectorizer (WFV) [1]• MCJIT compiles the kernels and TBB executes them in parallel.
[1] Karrenberg, Ralf, and Sebastian Hack. ”Whole-Function Vectorization.” @ CGO’11, pp. 141–15011
Benchmarks
range-v3 + PACXX vs. OpenCL on x86_64
vaddsaxpy sum dot Mandelbrot
0
0.5
1
1.5
2
−1.45%−13.94% −12.64% −20.02%
−36.48%
Speedup
OpenCL (Intel)
PACXX
• Running on 2x Intel Xeon E5-2620 CPUs• Intel’s auto-vectorizer optimizes the OpenCL C code
12
Benchmarks
range-v3 + PACXX vs. OpenCL on x86_64
vaddsaxpy sum dot Mandelbrot
1
2
3
58.12% 58.45%
153.82%
Speedup
OpenCL (AMD)
PACXX
• Running on 2x Intel Xeon E5-2620 CPUs• AMD OpenCL SDK has no auto-vectorizer• Barriers are very expensive in AMD’s OpenCL implementation(speedup up to 126x for sum)
13
Benchmarks
range-v3 + PACXX vs. OpenMP on IBM Power8
vaddsaxpy sum dot Mandelbrot
0.5
1
1.5
2
0.45%
35.09%
−20.05%
−42.95%
17.97%
Speedup
OpenMP (IBM)
PACXX
• Running on a PowerNV 8247-42L with 4x IBM Power8e CPUs• No OpenCL implementation from IBM available• #prama omp parallel for simd parallelized loops• Compiled with XL C++ 13.1.5
14