Thrust

並列アルゴリズムライブラリ Thrust について

@krustf

Boost.勉強会 #13 仙台2013/10/19

1

自己紹介

• @krustf

• 現在 B4 で来年からつくば

• GPGPU とか MPI とかのあたり

• 卒論終わった (Appendix B 以外)

2

言い訳

• 先週は高校の同窓会と計算シミュレーション勉強会の発表と大学の OB 会

• 昨日は 19 - 23 時ぐらいまで飲み会

• つい1時間前まで学会の最終版投稿〆切

• 時間がない

3

GPGPU

• 流行の GPU に汎用計算させるあれ

• GPU クラスタも世界的に用いられる

4

MIC (Xeon Phi)

• Intel がリリースした x86_64 動作のコプロセッサ

• MIC が単独でプログラムを実行可能なところが GPU と大きく異なる

5

GPU と MIC

• TOP500 June 2013 では

1位: 天河2号 (MIC) Rpeak 54 PFLOPS

2位: Titan (GPU) Rpeak 27 PFLOPS

3位: Sequoia (BlueGene/Q) Rpeak 20PFLOPS

4位: 京 (SPARC64) Rpeak 11 PFLOPS

(Rpeak = Theoretical Peak.)

6

GPU と MIC の問題

• GPU は元のプログラムに大幅な修正を入れなければいけない

• MIC は元のプログラムに手を加えずとも動くがパフォーマンスが悪い

• 最終的にどちらもチューニングが必要

7

API とか

• GPU といっても

• CUDA (NVIDIA)

• OpenCL (NVIDIA/AMD)

• OpenACC

• DirectCompute (DirectX11)

• 等いろいろあるのでとてもつらい8

OpenCL

• GPU だけでなく複数のアクセラレータで使用できる

• Heterogeneous Computing

• 一般的なアプリケーションでの並列コンピューティングの提供

9

OpenACC

• OpenMP と同じ pragma directive

• GPU よりだが，MIC も扱える

• OpenACC が使用できる PGI コンパイラの開発会社の PGI は NVIDIA の子会社

10

プログラミングモデル

• GPGPU はプログラミングが難解

• 面倒とも呼ぶ (かも)

• OpenCL も結構面倒と認識

11

STL 風ライブラリ

• 最近は STL 風ライブラリが並列プログラミングでは実装されている

• Bolt C++ Template Library (OpenCL)

• Intel TBB

• Thrust (CUDA)

12

Bolt C++ Template Library

• OpenCL ベースの STL 風ライブラリ

• ただし AMD GPU でないと動かないに加えて VS2010 が必要

• Windows も AMD GPU も VS2010 もないのでパス

13

Intel TBB

• CPU 並列の STL 風ライブラリ

• 前回の Boost.勉強会 #11 東京で紹介済

14

Thrust

• CUDA の STL 風ライブラリ

• OpenMP, TBB をバックエンドに変更可能なので CUDA だけという訳ではない

• 今回のメイン

15

Thrust の使用

• CUDA SDK に標準でインストール

• バージョンはまちまち• 最新版は version 1.7.0

• ヘッダオンリーなのでライブラリのリンクは不要

16

Thrust が使われているライブラリ

• Boost.ODEint

• 常微分方程式の解法

• MATLAB

• 有名な数値解析ソフトウェア

17

Host と Device

• CUDA では Host と Device という名前でCPU と GPU を区別している

• Host : CPU

• Device : GPU

• Thrust も上記の名称を踏襲している

18

Host-Device 間データ通信

• CUDA の Host/Device はメモリ空間を共有していないので，データ通信が必要

• Thrust でも同じ

19

サンプル

20

#include <thrust/host_vector.h>#include <thrust/device_vector.h>#include <thrust/sequence.h>#include <thrust/transform.h>#include <thrust/functional.h>#include <iostream>

int main(int argc, char** argv) { static const int n = 10; thrust::device_vector<double> x(n), y(n); // initialize x to 0,1,2,3... thrust::sequence(x.begin(), x.end()); // compute y = 2x using namespace thrust::placeholders; thrust::transform(x.begin(), x.end(), y.begin(), 2 * _1); // print y thrust::copy(y.begin(), y.end() , std::ostream_iterator<double>(std::cout, " "));}

> 0 2 4 6 8 10 12 14 .......

ビルド方法• CUDA 標準インストールの Thrust を用いる場合

• nvcc => CUDA 用コンパイラ

• CUDA 標準インストールの Thrust は/usr/local/cuda/include にあるが，-I や CPLUS_INCLUDE_PATH 等でインクルードパスを指定しないこと

21

$ nvcc -o thrust.run main.cu

ソースを落とす場合

• /usr/local/cuda/include 以外にインストールしてインクルードパスを通す

• http://thrust.github.io

22

http://thrust.github.io

http://thrust.github.io

Thrust の特徴

• STL アルゴリズムをそのまま CUDA に持ち込むと効率が悪いので専用の関数等が用意されている

• C++11 Lambda は使えないが，Thrust 用 Placeholder が定義されている

23

Functor の定義

24

template<class T>struct axpy_functor { const T a; axpy_functor(T _a) : a(_a) {}

__host__ __device__ T operator()(T const& x, T const& y) const { return a * x + y; }};

static const int n = 10;thrust::device_vector<double> x(n), y(n);// initialize...thrust::transform(x.begin(), x.end() , y.begin(), axpy_functor<double>());

修飾子

• __host__

• CUDA の Host 関数

• __device__

• CUDA の Device 関数

• 両方一緒に定義可能

25

OpenMP と TBB バックエンド

• Thrust は CUDA だけでなく OpenMP と TBB をバックエンドにできる

• CUDA を動かせない or CUDA よりも CPU で処理した方が良い部分も Thrust で計算できるので効率が良い

26

ビルド時の設定

• 実行方法をプログラム全体で指定• OpenMP

• TBB

27

$ nvcc -o thrust.run main.cu \ -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP \ -fopenmp

$ nvcc -o thrust.run main.cu \ -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_TBB \ -ltbb

g++, icpc 等でビルド

• OpenMP, TBB で Thrust を使用する際には nvcc ではなく g++ などの C++ コンパイラでビルド可能

• CUDA がない環境でも動かせる

28

コード中で個別指定

29

#include <thrust/omp/vector.h>#include <thrust/sequence.h>#include <thrust/transform.h>#include <thrust/functional.h>#include <iostream>

int main(int argc, char** argv) { static const int n = 10; thrust::omp::vector<double> x(n), y(n); // initialize x to 0,1,2,3... (by OpenMP) thrust::sequence(x.begin(), x.end()); // compute y = 2x (by OpenMP) using namespace thrust::placeholders; thrust::transform(x.begin(), x.end(), y.begin(), 2 * _1); // print y thrust::copy(y.begin(), y.end() , std::ostream_iterator<double>(std::cout, " "));}

実行ポリシーで指定

30

#include <thrust/system/omp/execution_policy.h>#include <thrust/system/tbb/execution_policy.h>#include <thrust/transform.h>#include <vector>#include <iostream>

int main(int argc, char** argv) { static const int n = 10; std::vector<double> x(n), y(n); // initialize x to 0,1,2,3... (by OpenMP) thrust::sequence(thrust::omp::par, x.begin(), x.end()); // compute y = 2x (by TBB) using namespace thrust::placeholders; thrust::transform(thrust::tbb::par , x.begin(), x.end(), y.begin(), 2 * _1); // print y thrust::copy(y.begin(), y.end() , std::ostream_iterator<double>(std::cout, " "));}

Fancy Iterator

• zip_iterator

• タプルのイテレータ• transform_iterator

• functor を適用しながら進める

• permutation_iterator

• parallel pattern の mapとか

31

実装済みアルゴリズム

• searching

• copying (gather, scatter)

• reduction

• merging

• reordering

32

• prefix sums

• sorting

• transform

• random

よくわからなかったところ

• 使用する vector が複数あるときは？

• zip_iterator でタプルにするのが楽

• vector size が同じでない場合は？

33

思うところ

• 恐らく並列処理はソフトウェア (つまりプログラミング) が容易になってくると考えられる

• 記述やチューニングの容易さ• それをサポートするのが TBB や Thrust,

Bolt と言ったライブラリと考えられる

34

質疑

• なにかあれば

35

Technology

Thrust