37
Luciano Martins and Robert Sohigian, 2018-11-22 Prototyping and Developing GPU - Accelerated Solutions with Python and CUDA

Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

Luciano Martins and Robert Sohigian, 2018-11-22

Prototyping and Developing GPU-Accelerated Solutions with Python and CUDA

Page 2: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

2

Agenda

Introduction to Python

GPU-Accelerated Computing

NVIDIA® CUDA® technology

Why Use Python with GPUs?

Methods: PyCUDA, Numba, CuPy, and scikit-cuda

Summary

Q&A

Page 3: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

3

Introduction to Python

Released by Guido van Rossum in 1991

The Zen of Python:

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

Flat is better than nested.

Interpreted language (CPython, Jython, ...)

Dynamically typed; based on objects

Page 4: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

4

Introduction to Python

Small core structure:

~30 keywords

~ 80 built-in functions

Indentationis

apretty serious thing

Dynamically typed; based on objects

Binds to many different languages

Supports GPU acceleration via modules

Page 5: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

5

Introduction to Python

Page 6: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

6

Introduction to Python

Page 7: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

7

Introduction to Python

Page 8: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

8

GPU-Accelerated Computing

“[T]the use of a graphics processing unit (GPU) together with a CPU to accelerate deep learning, analytics, and engineering applications” (NVIDIA)

Most common GPU-accelerated operations:

Large vector/matrix operations (Basic Linear Algebra Subprograms - BLAS)

Speech recognition

Computer vision

Page 9: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

9

GPU-Accelerated Computing

Important concepts for GPU-accelerated computing:

Host ― the machine running the workload (CPU)

Device ― the GPUs inside of a host

Kernel ― the code part that runs on the GPU

SIMT ― Single Instruction Multiple Threads

Page 10: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

10

GPU-Accelerated Computing

Page 11: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

11

GPU-Accelerated Computing

Page 12: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

12

CUDA

Parallel computing platform and programming model developed by NVIDIA:

Stands for Compute Unified Device Architecture

Based on C/C++ with some extensions

Fairly short learning curve for those with experience of OpenMP and MPI programming

CUDA on a system has three components:

Driver (software that controls the graphics card)

Toolkit (nvcc, several libraries, debugging tools)

SDK (examples and error-checking utilities)

Page 13: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

13

CUDA

A kernel is executed as a grid of thread blocks

All threads within a block share a portion of data memory

A thread block is a batch of threads that can cooperate with each other by:

Synchronizing their execution to providehazard-free common memory accesses

Efficiently sharing data through low-latency shared memory

Multiple blocks are combined to form a grid

Blocks on a grid contain the same number of threads

Page 14: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

14

CUDA

DeviceHost

Grid 1Kernel 1

Block

(0, 0)

Block

(1, 0)

Block

(2, 0)

Block

(1, 1)

Grid 2Kernel 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Block

(0, 1)

Block

(2, 1)

Page 15: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

15

CUDA

The host performs the following tasks (CPU):

1.Initializes GPU card(s)

2.Allocates memory in host and on device

3.Copies data from host to device memory

4.Launches instances of the kernel on device(s)

5.Copies data from device memory to host

6.Repeats 3-5 as needed

7.De-allocates all memory and terminates

Page 16: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

16

“ Python is an interpreted, object-oriented, high-level

programming language with dynamic semantics. Its high-

level built-in data structures, combined with dynamic

typing and dynamic binding, make it very attractive for

Rapid Application Development, as well as for use as a

scripting or glue language to connect existing

components together.”

https://www.python.org/doc/essays/blurb/

Page 17: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

17

Python (and the Need for Speed)

Since interpreted and high-level languages can be slow for high-performance needs, Python needs assistance for those tasks.

Keep the best of both scenarios:

Quick development and prototyping with Python

Use high-processing power and speed of GPU

Page 18: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

18

Accelerating Python

Accelerated code may be pure Python or also involve C-code.

Focusing here on the following modules:

PyCUDA

Numba

CuPy

scikit-cuda

Page 19: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

19

PyCUDA

A Python wrapper to the CUDA API

Gives speed to Python – near zero wrapping

Requires C programming knowledge (kernel)

Compiles the CUDA code and copies to GPU

CUDA errors translated to Python exceptions

Easy installation (pip)

Page 20: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

20

PyCUDA

Page 21: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

21

PyCUDA

Page 22: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

22

Numba

No need to write C-code

High-performance functions written in Python

On-the-fly code generation

Native code generation for the CPU and GPU

Integration with the Python scientific stack

Takes advantage of Python decorators

Code translation done using LLVM compiler

Page 23: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

23

Numba

https://numba.pydata.org/numba-examples/examples/finance/blackscholes/results.html

Page 24: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

24

Numba

Page 25: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

25

CuPy

An implementation of NumPy-compatible multi-dimensional array on CUDA

Useful to perform matrix ops on GPUs

Provides easy ways to define three types of CUDA kernels:

Elementwise kernels

Reduction kernels

Raw kernels

Also easy to install (pip)

Page 26: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

26

CuPy

Page 27: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

27

CuPy

Array Size NumPy [ms] CuPy [ms]

104 0.03 0.58

105 0.20 0.97

106 2.00 1.84

107 55.55 12.48

108 517.17 84.73

Page 28: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

28

scikit-cuda

Motivated by the idea of enhancing PyCUDA

Exposes GPU powered libraries

Tested on Linux (potentially works elsewhere)

Can be seen as “SciPy on GPU juice”

Presents low-level and high-level functions

Page 29: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

29

scikit-cuda

Low-Level Functions

Wrapping C functions via ctypes

Catching errors and mapping to Python exceptions

High-Level Functions

Take advantage of PyCUDA GPUArray to manipulate matrices in GPU memory

Some high-level functions available include FFT/IFFT, numerical integration, randomized linear algebra, NumPy-like routines not available on PyCUDA (cumsum, zeros, etc)

Page 30: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

30

scikit-cuda

Page 31: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

31

Summary

Many projects ported to Python are available

Keeps the simplicity of Python whilst adding GPU performance

Allows faster prototype development cycles

Supports C performance depending on the module (approach) chosen

Goes through matrices operation, scientific programming to custom kernels creation

Page 32: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

32

Summary Pages

Page 33: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

33

PyCUDA Summary

PyCUDA

CUDA Python wrapper

C code added directly on the Python project

All CUDA libraries support

Relevant complexity due to the kernels in C

https://documen.tician.de/pycuda/

Page 34: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

34

Numba Summary

Numba

Similar coverage as PyCUDA

No C coding needed

Takes advantage of LLVM and JIT compiling

Missing: dynamic parallelism and texture memory

http://numba.pydata.org/doc.html

Page 35: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

35

CuPy Summary

CuPy

Fully supports NumPy structures

Performs same operations at scale using GPU

Allows CPU/GPU agnostic code creation

https://docs-cupy.chainer.org/en/stable

Page 36: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through

36

scikit-cuda Summary

scikit-cuda:

Scientific computing using Python and GPU

Presents high-level and low-level functions

Big coverage of operations already available

Depends on PyCUDA GPUArray mechanisms

http://scikit-cuda.readthedocs.io/en/latest/

Page 37: Prototyping and Developing GPU-Accelerated Solutions with ...Allows faster prototype development cycles Supports C performance depending on the module (approach) chosen Goes through