Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

Luciano Martins and Robert Sohigian, 2018-11-22

Prototyping and Developing GPU-Accelerated Solutions with Python and CUDA

2

Agenda

Introduction to Python

GPU-Accelerated Computing

NVIDIA® CUDA® technology

Why Use Python with GPUs?

Methods: PyCUDA, Numba, CuPy, and scikit-cuda

Summary

Q&A

3


Released by Guido van Rossum in 1991

The Zen of Python:

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

Flat is better than nested.

Interpreted language (CPython, Jython, ...)

Dynamically typed; based on objects

4


Small core structure:

~30 keywords

~ 80 built-in functions

Indentationis

apretty serious thing

Dynamically typed; based on objects

Binds to many different languages

Supports GPU acceleration via modules

5


6


7


8


“[T]the use of a graphics processing unit (GPU) together with a CPU to accelerate deep learning, analytics, and engineering applications” (NVIDIA)

Most common GPU-accelerated operations:

Large vector/matrix operations (Basic Linear Algebra Subprograms - BLAS)

Speech recognition

Computer vision

9


Important concepts for GPU-accelerated computing:

Host ― the machine running the workload (CPU)

Device ― the GPUs inside of a host

Kernel ― the code part that runs on the GPU

SIMT ― Single Instruction Multiple Threads

10


11


12

CUDA

Parallel computing platform and programming model developed by NVIDIA:

Stands for Compute Unified Device Architecture

Based on C/C++ with some extensions

Fairly short learning curve for those with experience of OpenMP and MPI programming

CUDA on a system has three components:

Driver (software that controls the graphics card)

Toolkit (nvcc, several libraries, debugging tools)

SDK (examples and error-checking utilities)

13

CUDA

A kernel is executed as a grid of thread blocks

All threads within a block share a portion of data memory

A thread block is a batch of threads that can cooperate with each other by:

Synchronizing their execution to providehazard-free common memory accesses

Efficiently sharing data through low-latency shared memory

Multiple blocks are combined to form a grid

Blocks on a grid contain the same number of threads

14

CUDA

DeviceHost

Grid 1Kernel 1

Block

(0, 0)

Block

(1, 0)

Block

(2, 0)

Block

(1, 1)

Grid 2Kernel 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Block

(0, 1)

Block

(2, 1)

15

CUDA

The host performs the following tasks (CPU):

1.Initializes GPU card(s)

2.Allocates memory in host and on device

3.Copies data from host to device memory

4.Launches instances of the kernel on device(s)

5.Copies data from device memory to host

6.Repeats 3-5 as needed

7.De-allocates all memory and terminates

16

“ Python is an interpreted, object-oriented, high-level

programming language with dynamic semantics. Its high-

level built-in data structures, combined with dynamic

typing and dynamic binding, make it very attractive for

Rapid Application Development, as well as for use as a

scripting or glue language to connect existing

components together.”

https://www.python.org/doc/essays/blurb/

https://www.python.org/doc/essays/blurb/

17

Python (and the Need for Speed)

Since interpreted and high-level languages can be slow for high-performance needs, Python needs assistance for those tasks.

Keep the best of both scenarios:

Quick development and prototyping with Python

Use high-processing power and speed of GPU

18

Accelerating Python

Accelerated code may be pure Python or also involve C-code.

Focusing here on the following modules:

PyCUDA

Numba

CuPy

scikit-cuda

19

PyCUDA

A Python wrapper to the CUDA API

Gives speed to Python – near zero wrapping

Requires C programming knowledge (kernel)

Compiles the CUDA code and copies to GPU

CUDA errors translated to Python exceptions

Easy installation (pip)

20

PyCUDA

21

PyCUDA

22

Numba

No need to write C-code

High-performance functions written in Python

On-the-fly code generation

Native code generation for the CPU and GPU

Integration with the Python scientific stack

Takes advantage of Python decorators

Code translation done using LLVM compiler

23

Numba

https://numba.pydata.org/numba-examples/examples/finance/blackscholes/results.html

24

Numba

25

CuPy

An implementation of NumPy-compatible multi-dimensional array on CUDA

Useful to perform matrix ops on GPUs

Provides easy ways to define three types of CUDA kernels:

Elementwise kernels

Reduction kernels

Raw kernels

Also easy to install (pip)

26

CuPy

27

CuPy

Array Size NumPy [ms] CuPy [ms]

104 0.03 0.58

105 0.20 0.97

106 2.00 1.84

107 55.55 12.48

108 517.17 84.73

28

scikit-cuda

Motivated by the idea of enhancing PyCUDA

Exposes GPU powered libraries

Tested on Linux (potentially works elsewhere)

Can be seen as “SciPy on GPU juice”

Presents low-level and high-level functions

29

scikit-cuda

Low-Level Functions

Wrapping C functions via ctypes

Catching errors and mapping to Python exceptions

High-Level Functions

Take advantage of PyCUDA GPUArray to manipulate matrices in GPU memory

Some high-level functions available include FFT/IFFT, numerical integration, randomized linear algebra, NumPy-like routines not available on PyCUDA (cumsum, zeros, etc)

30

scikit-cuda

31

Summary

Many projects ported to Python are available

Keeps the simplicity of Python whilst adding GPU performance

Allows faster prototype development cycles

Supports C performance depending on the module (approach) chosen

Goes through matrices operation, scientific programming to custom kernels creation

32

Summary Pages

33

PyCUDA Summary

PyCUDA

CUDA Python wrapper

C code added directly on the Python project

All CUDA libraries support

Relevant complexity due to the kernels in C

https://documen.tician.de/pycuda/

https://documen.tician.de/pycuda/

34

Numba Summary

Numba

Similar coverage as PyCUDA

No C coding needed

Takes advantage of LLVM and JIT compiling

Missing: dynamic parallelism and texture memory

http://numba.pydata.org/doc.html

http://numba.pydata.org/doc.html

35

CuPy Summary

CuPy

Fully supports NumPy structures

Performs same operations at scale using GPU

Allows CPU/GPU agnostic code creation

https://docs-cupy.chainer.org/en/stable

the code part that runs on the GPU

36

scikit-cuda Summary

scikit-cuda:

Scientific computing using Python and GPU

Presents high-level and low-level functions

Big coverage of operations already available

Depends on PyCUDA GPUArray mechanisms

http://scikit-cuda.readthedocs.io/en/latest/

http://scikit-cuda.readthedocs.io/en/latest/

Documents

Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through