22
Bringing Next Generation C++ to GPUs Michael Haidl 1 , Michel Steuwer 2 , Lars Klein 1 and Sergei Gorlatch 1 1 University of Muenster, Germany 2 University of Edinburgh, UK

Bringing Next Generation C++ to GPUs - LLVMllvm.org/devmtg/2017-03//assets/slides/bringing_next_generation...Bringing Next Generation C++ to GPUs Michael Haidl 1,MichelSteuwer2,LarsKlein

Embed Size (px)

Citation preview

Bringing Next Generation C++ to GPUs

Michael Haidl1, Michel Steuwer2, Lars Klein1 and Sergei Gorlatch1

1University of Muenster, Germany

2University of Edinburgh, UK

The Problem: Dot Product

std::vector<int> a(N), b(N), tmp(N);std::transform(a.begin(), a.end(), b.begin(), tmp.begin(),

std::multiplies<int>());auto result = std::accumulate(tmp.begin(), tmp.end(), 0,

std::plus<int>());

• The STL is the C++ programmers swiss knife• STL containers, iterators and algorithms introduce a high-levelof abstraction

• Since C++17 it is also parallel

1

The Problem: Dot Product

std::vector<int> a(N), b(N), tmp(N);std::transform(a.begin(), a.end(), b.begin(), tmp.begin(),

std::multiplies<int>());auto result = std::accumulate(tmp.begin(), tmp.end(), 0,

std::plus<int>());

template<typename T>auto mult(const std::vector<T>& a const std::vector<T>& b){ std::vector<T> tmp(a.size()); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; }

template<typename T>auto sum(const std::vector<T>& a){ return std::accumulate(a.begin(), a.end(), T(), std::plus<T>()); }

f1.hpp f2.hpp

1

The Problem: Dot Product

std::vector<int> a(N), b(N);auto result = sum(mult(a, b));

template<typename T>auto mult(const std::vector<T>& a const std::vector<T>& b){ std::vector<T> tmp(a.size()); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; }

template<typename T>auto sum(const std::vector<T>& a){ return std::accumulate(a.begin(), a.end(), T(), std::plus<T>()); }

f1.hpp f2.hpp

1

The Problem: Dot Product

std::vector<int> a(N), b(N);auto result = sum(mult(a, b));

template<typename T>auto mult(const std::vector<T>& a const std::vector<T>& b){ std::vector<T> tmp(a.size()); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; }

template<typename T>auto sum(const std::vector<T>& a){ return std::accumulate(a.begin(), a.end(), T(), std::plus<T>()); }

f1.hpp f2.hpp

Performance:• vectors of size 25600000• Clang/LLVM 5.0.0svn -O3 optimized

transform/accumulate

transform/accumulate*

inner_product

0

20

40

Runtime(ms)

* LLVM patched with extended D17386 (loop fusion) 1

The Problem: Dot Product on GPUs

thrust::device_vector<int> a(N), b(N), tmp(N);thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(),

thrust::multiplies<int>());auto result = thrust::reduce(tmp.begin(), tmp.end(), 0,

thrust::plus<int>());

• Highly tuned STL-like library for GPU programming• Thrust offers containers, iterators and algorithms• Based on CUDA

2

The Problem: Dot Product on GPUs

thrust::device_vector<int> a(N), b(N), tmp(N);thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(),

thrust::multiplies<int>());auto result = thrust::reduce(tmp.begin(), tmp.end(), 0,

thrust::plus<int>());

• Highly tuned STL-like library for GPU programming• Thrust offers containers, iterators and algorithms• Based on CUDA

Same Experiment:• nvcc -O3 (from CUDA 8.0)

transform/reduce

inner_product

0

10

20

30

Runtime(ms)

2

The Next Generation: Ranges for the STL

• range-v3 prototype implementation by E. Niebler• Proposed as N4560 for the C++ Standard

std::vector<int> a(N), b(N);auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); };auto result = accumulate(view::transform(view::zip(a, b), mult), 0);

3

The Next Generation: Ranges for the STL

std::vector<int> a(N), b(N);auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); };auto result = accumulate(view::transform(view::zip(a, b), mult), 0);

Performance?• Clang/LLVM 5.0.0svn -O3 optimized

transform/accumulate

inner_product

0

20

40

Runtime(ms)

3

The Next Generation: Ranges for the STL

std::vector<int> a(N), b(N);auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); };auto result = accumulate(view::transform(view::zip(a, b), mult), 0);

• Views describe lazy, non-mutating operations on ranges• Evaluation happens inside an algorithm (e.g., accumulate)• Fusion is guaranteed by the implementation

3

Ranges for GPUs

• Extended range-v3 with GPU-enabled container andalgorithms

• Original code of range-v3 remains unmodified

std::vector<int> a(N), b(N);auto mult = [](auto tpl) { return get<0>(tpl) * get<1>(tpl); };auto ga = gpu::copy(a);auto gb = gpu::copy(b);auto result = gpu::reduce(view::transform(view::zip(ga, gb), mult), 0);

4

Programming Accelerators with C++ (PACXX)

ExecutablePACXX Runtime

Online CompilerLLVM-based

online compiler

LLVM IR to SPIR NVPTXOpenCLBackend

CUDABackend

LLVMIR

SPIR PTX

OpenCL Runtime

AMD GPU Intel MIC

CUDA Runtime

Nvidia GPU

#include <algorithm>#include <vector>#include <iostream>

template< class ForwardIt, class T >

void fill(ForwardIt first, ForwardIt last, const T& value)

{ for (; first != last;

++first) {

*first = value; }}

C++

PACXX Offline Compiler

LLVM libc++

Clang Frontend

Offline Stage

Online Stage

• Based entirely on LLVM / Clang• Supports C++14 for GPU Programming• Just-In-Time Compilation of LLVM IR for target accelerators

5

Multi-Stage Programming

template <typename InRng, typename T, typename Fun>auto reduce(InRng&& in, T init, Fun&& fun) {

// 1. preparation of kernel call...// 2. create GPU kernelauto kernel = pacxx::kernel([fun](auto&& in, auto&& out, int size, auto init) {// 2a. stage elements per threadauto ept = stage([&]{ return size / get_block_size(0); });// 2b. start reduction computationauto sum = init;for (int x = 0; x < ept; ++x) {sum = fun(sum, *(in + gid));gid += glbSize; }

// 2c. perform reduction in shared memory...// 2d. write result backif (lid = 0) *(out + bid) = shared[0];

}, blocks, threads);// 3. execute kernelkernel(in, out, distance(in), init);// 4. finish reduction on the CPUreturn std::accumulate(out, init, fun); }

6

MSP Integration into PACXX

Executable

LLVM-basedonline compiler

PACXX Runtime

LLVM IR to SPIR NVPTXOpenCLBackend

CUDABackend

SPIR PTX

OpenCL Runtime

AMD GPU Intel MIC

CUDA Runtime

Nvidia GPU

#include <algorithm>#include <vector>#include <iostream>

template< class ForwardIt, class T >

void fill(ForwardIt first, ForwardIt last, const T& value)

{ for (; first != last;

++first) {

*first = value; }}

C++

PACXX Offline Compiler

LLVM libc++

Clang Frontend

MSPIR

MSP Engine

KERNELIR

• MSP Engine JIT compiles the MSP IR,• evaluates stage prior to a kernel launch, and• replaces the calls to stage in the kernel’s IR with the results.• Enables more optimizations (e.g., loop-unrolling) in the onlinestage. 7

Performance Impact of MSP

gpu::reduce on Nvidia K20c

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

215 217 219 221 223 225

Sp

ee

du

p

Input Size

DotSumDot +MSSum +MS

Up to 35% better performance compared to non-MSP version

8

Just-In-Time Compilation Overhead

Comparing MSP in PACXX with Nvidia’s nvrtc library

0

50

100

150

200

250

300

350

400

450

Dot Sum

CUDA 7.5CUDA 8.0 RCPACXX

Co

mp

ilati

on

Tim

e (

ms)

10 to 20 times faster because front-end actions are performed.

9

Benchmarks

range-v3 + PACXX vs. Nvidia’s Thrust

vaddsaxpy sum dot Monte Carlo

MandelbrotVoronoi

0.5

1

1.5

2

8.37%1.61%

82.32%73.31%

33.6%

−3.13% −7.38%Speedup

Thrust

PACXX

• Evaluated on a Nvidia K20c GPU• 11 different input sizes• 1000 runs for each benchmark

Competitive performance with a composable GPU programming API10

Going Native: Work in Progress

Executable

PACXX Runtime

LLVM-based online compiler

NVPTXOpenCLBackend

CUDABackend

SPIR PTX

OpenCL Runtime

AMD GPU Intel MIC

CUDA Runtime

Nvidia GPU

#include <algorithm>#include <vector>#include <iostream>

template< class ForwardIt, class T >

void fill(ForwardIt first, ForwardIt last, const T& value)

{ for (; first != last;

++first) {

*first = value; }}

C++

PACXX Offline Compiler

LLVM libc++

Clang Frontend

MSPIR

MSP Engine

KERNELIR

LLVM IR to SPIR

WFV[1]

MCJIT

CPUs: Intel / AMD / IBM ...

• PACXX is extended by a native CPU backend• The Kernel IR is modified to be runnable on a CPU• Kernels are vectorized by the Whole Function Vectorizer (WFV) [1]• MCJIT compiles the kernels and TBB executes them in parallel.

[1] Karrenberg, Ralf, and Sebastian Hack. ”Whole-Function Vectorization.” @ CGO’11, pp. 141–15011

Benchmarks

range-v3 + PACXX vs. OpenCL on x86_64

vaddsaxpy sum dot Mandelbrot

0

0.5

1

1.5

2

−1.45%−13.94% −12.64% −20.02%

−36.48%

Speedup

OpenCL (Intel)

PACXX

• Running on 2x Intel Xeon E5-2620 CPUs• Intel’s auto-vectorizer optimizes the OpenCL C code

12

Benchmarks

range-v3 + PACXX vs. OpenCL on x86_64

vaddsaxpy sum dot Mandelbrot

1

2

3

58.12% 58.45%

153.82%

Speedup

OpenCL (AMD)

PACXX

• Running on 2x Intel Xeon E5-2620 CPUs• AMD OpenCL SDK has no auto-vectorizer• Barriers are very expensive in AMD’s OpenCL implementation(speedup up to 126x for sum)

13

Benchmarks

range-v3 + PACXX vs. OpenMP on IBM Power8

vaddsaxpy sum dot Mandelbrot

0.5

1

1.5

2

0.45%

35.09%

−20.05%

−42.95%

17.97%

Speedup

OpenMP (IBM)

PACXX

• Running on a PowerNV 8247-42L with 4x IBM Power8e CPUs• No OpenCL implementation from IBM available• #prama omp parallel for simd parallelized loops• Compiled with XL C++ 13.1.5

14

Questions?

14