93
PyCUDA Loo.py GPU-DG Easy, Effective, Efficient: GPU Programming in Python with PyOpenCL and PyCUDA Andreas Kl¨ ockner Courant Institute of Mathematical Sciences New York University PASI: The Challenge of Massive Parallelism Lecture 4 · January 8, 2011 AndreasKl¨ockner GPU-Python with PyOpenCL and PyCUDA

Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Easy, Effective, Efficient:GPU Programming in Pythonwith PyOpenCL and PyCUDA

Andreas Klockner

Courant Institute of Mathematical SciencesNew York University

PASI: The Challenge of Massive ParallelismLecture 4 · January 8, 2011

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 2: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Outline

1 PyCUDA

2 Automatic GPU Programming

3 GPU-DG: Challenges and Solutions

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 3: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Lab Solutions

Lab solutions:

Lab 1 yesterday:Sorry, posted wrong tarball(I think)

Will post lab solutions after secondlab today:http://tiker.net/tmp/

pasi-lab-solution.tar.gz

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 4: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Outline

1 PyCUDA

2 Automatic GPU Programming

3 GPU-DG: Challenges and Solutions

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 5: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Whetting your appetite

1 import pycuda.driver as cuda2 import pycuda.autoinit , pycuda.compiler3 import numpy45 a = numpy.random.randn(4,4).astype(numpy.float32)6 a gpu = cuda.mem alloc(a.nbytes)7 cuda.memcpy htod(a gpu, a)

[This is examples/demo.py in the PyCUDA distribution.]

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 6: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Whetting your appetite

1 mod = pycuda.compiler.SourceModule(”””2 global void twice( float ∗a)3 4 int idx = threadIdx.x + threadIdx.y∗4;5 a[ idx ] ∗= 2;6 7 ”””)89 func = mod.get function(”twice”)

10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a

Compute kernel

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 7: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Whetting your appetite

1 mod = pycuda.compiler.SourceModule(”””2 global void twice( float ∗a)3 4 int idx = threadIdx.x + threadIdx.y∗4;5 a[ idx ] ∗= 2;6 7 ”””)89 func = mod.get function(”twice”)

10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a

Compute kernel

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 8: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Whetting your appetite, Part II

Did somebody say “Abstraction is good”?

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 9: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Whetting your appetite, Part II

1 import numpy2 import pycuda.autoinit3 import pycuda.gpuarray as gpuarray45 a gpu = gpuarray.to gpu(6 numpy.random.randn(4,4).astype(numpy.float32))7 a doubled = (2∗a gpu).get()8 print a doubled9 print a gpu

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 10: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

gpuarray: Simple Linear Algebra

pycuda.gpuarray:

Meant to look and feel just like numpy.

gpuarray.to gpu(numpy array)

numpy array = gpuarray.get()

+, -, ∗, /, fill, sin, exp, rand,basic indexing, norm, inner product, . . .

Mixed types (int32 + float32 = float64)

print gpuarray for debugging.

Allows access to raw bits

Use as kernel arguments, textures, etc.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 11: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Sparse Matrix-Vector on the GPU

New feature in 0.94:Sparse matrix-vectormultiplication

Uses “packeted format”by Garland and Bell (alsoincludes parts of their code)

Integrates with scipy.sparse.

Conjugate-gradients solverincluded

Deferred convergencechecking

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 12: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

PyOpenCL ↔ PyCUDA: A (rough) dictionary

PyOpenCL PyCUDAContext Context

CommandQueue Stream

Buffer mem alloc / DeviceAllocation

Program SourceModule

Kernel Function

Event (eg. enqueue marker) Event

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 13: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Scripting: Interpreted, not Compiled

Program creation workflow:

Edit

Compile

Link

Run

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 14: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Scripting: Interpreted, not Compiled

Program creation workflow:

Edit

Compile

Link

Run

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 15: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Scripting: Interpreted, not Compiled

Program creation workflow:

Edit

Compile

Link

Run

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 16: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

PyCUDA: Workflow

Edit

PyCUDA

Run

SourceModule("...")

Cache?

nvcc

no

.cubin

Upload to GPU

Run on GPU

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 17: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

PyCUDA in the CUDA ecosystem

Hardware

Kernel Driver

Driver API

Runtime API PyCuda

C/C++ Python

CUDA has two ProgrammingInterfaces:

“Runtime” high-level(separate install)

“Driver” low-level(libcuda.so, comes withGPU driver)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 18: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

PyCUDA: Vital Information

http://mathema.tician.de/

software/pycuda

Complete documentation

X Consortium License(no warranty, free for all use)

Convenient abstractionsArray, Fast Vector Math, Reductions

Requires: numpy, Python 2.4+(Win/OS X/Linux)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 19: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Outline

1 PyCUDA

2 Automatic GPU Programming

3 GPU-DG: Challenges and Solutions

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 20: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Automating GPU Programming

GPU programming can be time-consuming, unintuitive anderror-prone.

Obvious idea: Let the computer do it.

One way: Smart compilers

GPU programming requires complex tradeoffsTradeoffs require heuristicsHeuristics are fragile

Another way: Dumb enumeration

Enumerate loop slicingsEnumerate prefetch optionsChoose by running resulting code on actual hardware

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 21: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Automating GPU Programming

GPU programming can be time-consuming, unintuitive anderror-prone.

Obvious idea: Let the computer do it.

One way: Smart compilers

GPU programming requires complex tradeoffsTradeoffs require heuristicsHeuristics are fragile

Another way: Dumb enumeration

Enumerate loop slicingsEnumerate prefetch optionsChoose by running resulting code on actual hardware

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 22: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Automating GPU Programming

GPU programming can be time-consuming, unintuitive anderror-prone.

Obvious idea: Let the computer do it.

One way: Smart compilers

GPU programming requires complex tradeoffsTradeoffs require heuristicsHeuristics are fragile

Another way: Dumb enumeration

Enumerate loop slicingsEnumerate prefetch optionsChoose by running resulting code on actual hardware

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 23: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Loo.py Example

Empirical GPU loop optimization:

a, b, c, i , j , k = [var(s) for s in ”abcijk”]n = 500k = make loop kernel([

LoopDimension(”i”, n),LoopDimension(”j”, n),LoopDimension(”k”, n),], [(c[ i+n∗j], a[ i+n∗k]∗b[k+n∗j])])

gen kwargs = ”min threads”: 128,”min blocks”: 32,

→ Ideal case: Finds 160 GF/s kernelwithout human intervention.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 24: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Loo.py Status

Limited scope:

Require input/output separationKernels must be expressible using“loopy” model(i.e. indices decompose into “output”and “reduction”)Enough for DG, LA, FD, . . .

Kernel compilation limits trial rate

Non-Goal: Peak performance

Good results currently for dense linearalgebra and (some) DG subkernels

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 25: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG

Loo.py Status

Limited scope:

Require input/output separationKernels must be expressible using“loopy” model(i.e. indices decompose into “output”and “reduction”)Enough for DG, LA, FD, . . .

Kernel compilation limits trial rate

Non-Goal: Peak performance

Good results currently for dense linearalgebra and (some) DG subkernels

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 26: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Outline

1 PyCUDA

2 Automatic GPU Programming

3 GPU-DG: Challenges and SolutionsIntroductionChallengesBenefits of MetaprogrammingGPU-DG: Performance and GeneralityViscous Shock Capture

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 27: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Outline

1 PyCUDA

2 Automatic GPU Programming

3 GPU-DG: Challenges and SolutionsIntroductionChallengesBenefits of MetaprogrammingGPU-DG: Performance and GeneralityViscous Shock Capture

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 28: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Discontinuous Galerkin Method

Let Ω :=⋃

i Dk ⊂ Rd .

Goal

Solve a conservation law on Ω: ut +∇ · F (u) = 0

Example

Maxwell’s Equations: EM field: E (x , t), H(x , t) on Ω governed by

∂tE −1

ε∇× H = − j

ε, ∂tH +

1

µ∇× E = 0,

∇ · E =ρ

ε, ∇ · H = 0.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 29: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Discontinuous Galerkin Method

Let Ω :=⋃

i Dk ⊂ Rd .

Goal

Solve a conservation law on Ω: ut +∇ · F (u) = 0

Example

Maxwell’s Equations: EM field: E (x , t), H(x , t) on Ω governed by

∂tE −1

ε∇× H = − j

ε, ∂tH +

1

µ∇× E = 0,

∇ · E =ρ

ε, ∇ · H = 0.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 30: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Discontinuous Galerkin Method

Let Ω :=⋃

i Dk ⊂ Rd .

Goal

Solve a conservation law on Ω: ut +∇ · F (u) = 0

Example

Maxwell’s Equations: EM field: E (x , t), H(x , t) on Ω governed by

∂tE −1

ε∇× H = − j

ε, ∂tH +

1

µ∇× E = 0,

∇ · E =ρ

ε, ∇ · H = 0.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 31: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Discontinuous Galerkin Method

Multiply by test function, integrate by parts:

0 =

ˆDk

utϕ+ [∇ · F (u)]ϕ dx

=

ˆDk

utϕ− F (u) · ∇ϕ dx +

ˆ∂Dk

(n · F )∗ϕ dSx ,

Subsitute in basis functions, introduce elementwise stiffness, mass,and surface mass matrices matrices S , M, MA:

∂tuk = −

∑ν

D∂ν ,k [F (uk)] + Lk [n · F − (n · F )∗]|A⊂∂Dk.

For straight-sided simplicial elements:Reduce D∂ν and L to reference matrices.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 32: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Decomposition of a DG operator into Subtasks

DG’s execution decomposes into two (mostly) separate branches:

uk

Flux Gather Flux Lifting

F (uk) Local Differentiation

∂tuk

Green: Element-local parts of the DG operator.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 33: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

DG on GPUs: Possible Advantages

DG on GPUs: Why?

GPUs have deep Memory Hierarchy

The majority of DG is local.

Compute Bandwidth Memory Bandwidth

DG is arithmetically intense.

GPUs favor dense data.

Local parts of the DG operator are dense.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 34: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

DG on the GPU: What are we trying to achieve?

Objectives:

Main: SpeedReduce need for compute-bound clusters

Secondary: GeneralityBe applicable to many problems

Tertiary: Ease-of-UseHide complexity of GPU hardware

Setting (for now):

Specialize to straight-sided simplices

Optimize for (but don’t specialize to)tetrahedra (ie. 3D)

Optimize for “medium” order (3. . . 5)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 35: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Outline

1 PyCUDA

2 Automatic GPU Programming

3 GPU-DG: Challenges and SolutionsIntroductionChallengesBenefits of MetaprogrammingGPU-DG: Performance and GeneralityViscous Shock Capture

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 36: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Element-Local Operations: Differentiation

Local TemplatedDerivative MatricesLocal Templated

Derivative MatricesLocal Templated

Derivative Matrices

Np

Np

K

NpField Data

Geometric Factors

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 37: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Element-Local Operations: Lifting

Local TemplatedLifting Matrix

NfNfp

Np

K

NfNfpFacial

Field Data

(Inverse) Jacobians

On-Chip Storage

?

?

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 38: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Element-Local Operations: Lifting

Local TemplatedLifting Matrix

NfNfp

Np

K

NfNfpFacial

Field Data

(Inverse) Jacobians

On-Chip Storage

?

?

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 39: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Element-Local Operations: Lifting

Local TemplatedLifting Matrix

NfNfp

Np

K

NfNfpFacial

Field Data

(Inverse) Jacobians

On-Chip Storage

?

?

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 40: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Element-Local Operations: Lifting

Local TemplatedLifting Matrix

NfNfp

Np

K

NfNfpFacial

Field Data

(Inverse) Jacobians

On-Chip Storage

?

?

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 41: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Best use for on-chip memory?

Basic Problem

On-chip storage is scarce. . .. . . and will be for the foreseeable future.

Possible uses:

Matrix/Matrices

Part of a matrix

Field Data

Both

How to decide? Does it matter?

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 42: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Work Partition for Element-Local Operators

Natural Work Decomposition:One Element per Block

+ Straightforward to implement+ No granularity penalty- Cannot fill wide SIMD: unused compute

power for small to medium elements- Data alignment: Padding wastes memory- Cannot amortize cost of preparation steps

(e.g. fetching)

?

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 43: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Work Partition for Element-Local Operators

Natural Work Decomposition:One Element per Block

+ Straightforward to implement+ No granularity penalty- Cannot fill wide SIMD: unused compute

power for small to medium elements- Data alignment: Padding wastes memory- Cannot amortize cost of preparation steps

(e.g. fetching)

?

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 44: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Loop Slicing for element-local parts of GPU DG

Per Block: KL element-local mat.mult. + matrix loadPreparation

Question: How should one assign work to threads?

ws : in sequenceThread

t

wi : “inline-parallel”Thread

t

wp: in parallel

Thread

t

(amortize preparation) (exploit register space)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 45: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Loop Slicing for element-local parts of GPU DG

Per Block: KL element-local mat.mult. + matrix loadPreparation

Question: How should one assign work to threads?

ws : in sequenceThread

t

wi : “inline-parallel”Thread

t

wp: in parallel

Thread

t

(amortize preparation) (exploit register space)

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 46: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Best Work Partition?

Basic Problem

Additional tier in parallelism offers additional choices. . .. . . but very little in the way of guidance.

Possible work partitions:

One or multiple elements perblock?

One or multiple DOFs per thread?

In parallel?In sequence?In-line?

How to decide? Does it matter?

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 47: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Work Partition for Surface Flux Evaluation

Granularity Tradeoff:

Large Blocks:+ More Data Reuse- Less Parallelism- Less Latency Hiding

Block Size limited by two factors:

Output buffer sizeFace metadata size

Optimal Block Size:not obvious

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 48: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Work Partition for Surface Flux Evaluation

Granularity Tradeoff:

Large Blocks:+ More Data Reuse- Less Parallelism- Less Latency Hiding

Block Size limited by two factors:

Output buffer sizeFace metadata size

Optimal Block Size:not obvious

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 49: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Work Partition for Surface Flux Evaluation

Granularity Tradeoff:

Large Blocks:+ More Data Reuse- Less Parallelism- Less Latency Hiding

Block Size limited by two factors:

Output buffer sizeFace metadata size

Optimal Block Size:not obvious

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 50: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

More than one Granularity

Different block sizes introduced so far:

Differentiation

Lifting

Surface Fluxes

Idea

Introduce another, smaller block size tosatisfy SIMD width and alignmentconstraints. (“Microblock”)

And demand other block sizes be amultiple of this new size

How big? Not obvious.

Element

Element

. . .

Element

Element

. . .

Element

Element

. . .

Padding

NpKMNp

128

64

0

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 51: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

More than one Granularity

Different block sizes introduced so far:

Differentiation

Lifting

Surface Fluxes

Idea

Introduce another, smaller block size tosatisfy SIMD width and alignmentconstraints. (“Microblock”)

And demand other block sizes be amultiple of this new size

How big? Not obvious.

Element

Element

. . .

Element

Element

. . .

Element

Element

. . .

Padding

NpKMNp

128

64

0

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 52: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

More than one Granularity

Different block sizes introduced so far:

Differentiation

Lifting

Surface Fluxes

Idea

Introduce another, smaller block size tosatisfy SIMD width and alignmentconstraints. (“Microblock”)

And demand other block sizes be amultiple of this new size

How big? Not obvious.

Element

Element

. . .

Element

Element

. . .

Element

Element

. . .

Padding

NpKMNp

128

64

0

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 53: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

DG on GPUs: Implementation Choices

Many difficult questions

Insufficient heuristics

Answers are hardware-specific andhave no lasting value

Proposed Solution: Tune automaticallyfor hardware at computation time, cachetuning results.

Decrease reliance on knowledge ofhardware internals

Shift emphasis fromtuning results to tuning ideas

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 54: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

DG on GPUs: Implementation Choices

Many difficult questions

Insufficient heuristics

Answers are hardware-specific andhave no lasting value

Proposed Solution: Tune automaticallyfor hardware at computation time, cachetuning results.

Decrease reliance on knowledge ofhardware internals

Shift emphasis fromtuning results to tuning ideas

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 55: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Outline

1 PyCUDA

2 Automatic GPU Programming

3 GPU-DG: Challenges and SolutionsIntroductionChallengesBenefits of MetaprogrammingGPU-DG: Performance and GeneralityViscous Shock Capture

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 56: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Metaprogramming for GPU-DG

Specialize code for user-given problem:

Flux Terms

(*)

Automated Tuning:

Memory layoutLoop slicing

(*)

Gather granularity

Constants instead of variables:

DimensionalityPolynomial degreeElement propertiesMatrix sizes

Loop Unrolling

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 57: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Metaprogramming for GPU-DG

Specialize code for user-given problem:

Flux Terms

(*)

Automated Tuning:

Memory layoutLoop slicing

(*)

Gather granularity

Constants instead of variables:

DimensionalityPolynomial degreeElement propertiesMatrix sizes

Loop Unrolling

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 58: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Metaprogramming for GPU-DG

Specialize code for user-given problem:

Flux Terms

(*)

Automated Tuning:

Memory layoutLoop slicing

(*)

Gather granularity

Constants instead of variables:

DimensionalityPolynomial degreeElement propertiesMatrix sizes

Loop Unrolling

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 59: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Metaprogramming for GPU-DG

Specialize code for user-given problem:

Flux Terms

(*)

Automated Tuning:

Memory layoutLoop slicing

(*)

Gather granularity

Constants instead of variables:

DimensionalityPolynomial degreeElement propertiesMatrix sizes

Loop Unrolling

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 60: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Metaprogramming for GPU-DG

Specialize code for user-given problem:

Flux Terms (*)

Automated Tuning:

Memory layoutLoop slicing (*)Gather granularity

Constants instead of variables:

DimensionalityPolynomial degreeElement propertiesMatrix sizes

Loop Unrolling

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 61: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Loop Slicing for Differentiation

15 20 25 30wp

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Execu

tion t

ime [ms]

Local differentiation, matrix-in-shared,order 4, with microblockingpoint size denotes wi ∈

1, ,4

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

ws

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 62: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Metaprogramming DG: Flux Terms

0 =

ˆDk

utϕ+ [∇ · F (u)]ϕ dx −ˆ∂Dk

[n · F − (n · F )∗]ϕ dSx︸ ︷︷ ︸Flux term

Flux terms:

vary by problem

expression specified by user

evaluated pointwise

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 63: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Metaprogramming DG: Flux Terms

0 =

ˆDk

utϕ+ [∇ · F (u)]ϕ dx −ˆ∂Dk

[n · F − (n · F )∗]ϕ dSx︸ ︷︷ ︸Flux term

Flux terms:

vary by problem

expression specified by user

evaluated pointwise

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 64: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Metaprogramming DG: Flux Terms Example

Example: Fluxes for Maxwell’s Equations

n · (F − F ∗)E :=1

2[n × (JHK− αn × JEK)]

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 65: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Metaprogramming DG: Flux Terms Example

Example: Fluxes for Maxwell’s Equations

n · (F − F ∗)E :=1

2[n × (JHK− αn × JEK)]

User writes: Vectorial statement in math. notation

flux = 1/2∗cross(normal, h. int−h.ext−alpha∗cross(normal, e. int−e.ext))

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 66: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Metaprogramming DG: Flux Terms Example

Example: Fluxes for Maxwell’s Equations

n · (F − F ∗)E :=1

2[n × (JHK− αn × JEK)]

We generate: Scalar evaluator in C (6×)

a flux += (((( val a field5 − val b field5 )∗ fpair−>normal[2]− ( val a field4 − val b field4 )∗ fpair−>normal[0])

+ val a field0 − val b field0 )∗ fpair−>normal[0]− ((( val a field4 − val b field4 ) ∗ fpair−>normal[1]− ( val a field1 − val b field1 )∗ fpair−>normal[2])

+ val a field3 − val b field3 ) ∗ fpair−>normal[1])∗value type (0.5);

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 67: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Hedge DG Solver

High-Level Operator Description

Maxwell’sEulerPoissonCompressible Navier-Stokes, . . .

One Code runs. . .

. . . on CPU, CUDA

. . . on CPU,CUDA+MPI

. . . in 1D, 2D, 3D

. . . at any order

Uses CPU, GPU codegeneration

Open Source (GPL3)

Written in Python,

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 68: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Outline

1 PyCUDA

2 Automatic GPU Programming

3 GPU-DG: Challenges and SolutionsIntroductionChallengesBenefits of MetaprogrammingGPU-DG: Performance and GeneralityViscous Shock Capture

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 69: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Nvidia GTX280 vs. single core of Intel Core 2 Duo E8400

0 2 4 6 8 10Polynomial Order N

0

50

100

150

200

250

300

GFl

ops/

s

GPU

CPU

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 70: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Memory Bandwidth on a GTX 280

1 2 3 4 5 6 7 8 9Polynomial Order N

20

40

60

80

100

120

140

160

180

200

Glo

bal M

em

ory

Bandw

idth

[G

B/s

]

GatherLiftDiffAssy.Peak

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 71: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Multiple GPUs via MPI: 16 GPUs vs. 64 CPUs

0 2 4 6 8 10Polynomial Order N

0

1000

2000

3000

4000

GFl

ops/

s

Flop Rates: 16 GPUs vs 64 CPU cores

GPUCPU

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 72: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

GPU-DG in Double Precision

0 2 4 6 8 10Polynomial Order N

0

50

100

150

200

250

300

350

400

GFl

ops/

s

GPU-DG: Double vs. Single Precision

SingleDouble

0.0

0.2

0.4

0.6

0.8

1.0

Ratio

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 73: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Outline

1 PyCUDA

2 Automatic GPU Programming

3 GPU-DG: Challenges and SolutionsIntroductionChallengesBenefits of MetaprogrammingGPU-DG: Performance and GeneralityViscous Shock Capture

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 74: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Nonlinear conservation laws → shocks?

0 2 4 6 8 10x

u(x

)

t=0.00

t=0.67

t=1.33

0 2 4 6 8 10x

u(x

)

t=0.00

t=0.67

t=1.33

1D advection:

∂tu + ∂xu = 0

1D advection with viscosity:

∂tu + v · ∇xu = ∇x · (ν∇xu).

Important: Conservation form.

Upwind fluxes for advection, IPDG for second-order

Detector → ν?GPU-suitability? Data locality?Properties? [Build on work byPersson/Peraire ‘06]

Time integrationImplicit/explicit? Adaptivity? RKC forbigger ∆t with viscosity?

Accuracy?Near shocks? Away from them?

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 75: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Nonlinear conservation laws → shocks?

0 2 4 6 8 10x

u(x

)

t=0.00

t=0.67

t=1.33

0 2 4 6 8 10x

u(x

)

t=0.00

t=0.67

t=1.33

1D advection:

∂tu + ∂xu = 0

1D advection with viscosity:

∂tu + v · ∇xu = ∇x · (ν∇xu).

Important: Conservation form.

Upwind fluxes for advection, IPDG for second-order

Detector → ν?GPU-suitability? Data locality?Properties? [Build on work byPersson/Peraire ‘06]

Time integrationImplicit/explicit? Adaptivity? RKC forbigger ∆t with viscosity?

Accuracy?Near shocks? Away from them?

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 76: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Nonlinear conservation laws → shocks?

0 2 4 6 8 10x

u(x

)

t=0.00

t=0.67

t=1.33

0 2 4 6 8 10x

u(x

)

t=0.00

t=0.67

t=1.33

1D advection:

∂tu + ∂xu = 0

1D advection with viscosity:

∂tu + v · ∇xu = ∇x · (ν∇xu).

Important: Conservation form.

Upwind fluxes for advection, IPDG for second-order

Detector → ν?GPU-suitability? Data locality?Properties? [Build on work byPersson/Peraire ‘06]

Time integrationImplicit/explicit? Adaptivity? RKC forbigger ∆t with viscosity?

Accuracy?Near shocks? Away from them?

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 77: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Nonlinear conservation laws → shocks?

0 2 4 6 8 10x

u(x

)

t=0.00

t=0.67

t=1.33

0 2 4 6 8 10x

u(x

)

t=0.00

t=0.67

t=1.33

1D advection:

∂tu + ∂xu = 0

1D advection with viscosity:

∂tu + v · ∇xu = ∇x · (ν∇xu).

Important: Conservation form.

Upwind fluxes for advection, IPDG for second-order

Detector → ν?GPU-suitability? Data locality?Properties? [Build on work byPersson/Peraire ‘06]

Time integrationImplicit/explicit? Adaptivity? RKC forbigger ∆t with viscosity?

Accuracy?Near shocks? Away from them?

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 78: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Results: Euler’s Equations of Gas Dynamics

Euler’s equations with viscosity:

∂tρ+∇x · (ρu) = ∇x · (ν∇xρ),

∂t(ρu) +∇x · (u⊗ (ρu)) +∇xp = ∇x · (ν∇x(ρu)),

∂tE +∇x · (u(E + p)) = ∇x · (ν∇xE ).

Again: Single ν, sensed on ρ. → Undue pollution ofthe other field?

[Persson/Peraire ‘06] suggest Navier-Stokes-like vis-cosity. No good: can’t control jumps in ρ.

Rusanov fluxes for Euler, IPDG for viscosity.

0.0 0.2 0.4 0.6 0.8 1.0x

0.0

0.2

0.4

0.6

0.8

1.0

1.2

ρ, p

Sod's Problem with N=5 and K=80

ρ

p

ρ (exact, L2 proj.)

p (exact, L2 proj.)

6 4 2 0 2 4 6x

0

2

4

6

8

10

12

ρ, p

Shock-Wave Interaction Problemwith N=5 and K=80

ρ

p

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 79: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Results: Euler’s Equations of Gas Dynamics

Euler’s equations with viscosity:

∂tρ+∇x · (ρu) = ∇x · (ν∇xρ),

∂t(ρu) +∇x · (u⊗ (ρu)) +∇xp = ∇x · (ν∇x(ρu)),

∂tE +∇x · (u(E + p)) = ∇x · (ν∇xE ).

Again: Single ν, sensed on ρ. → Undue pollution ofthe other field?

[Persson/Peraire ‘06] suggest Navier-Stokes-like vis-cosity. No good: can’t control jumps in ρ.

Rusanov fluxes for Euler, IPDG for viscosity.

0.0 0.2 0.4 0.6 0.8 1.0x

0.0

0.2

0.4

0.6

0.8

1.0

1.2

ρ, p

Sod's Problem with N=5 and K=80

ρ

p

ρ (exact, L2 proj.)

p (exact, L2 proj.)

6 4 2 0 2 4 6x

0

2

4

6

8

10

12

ρ, p

Shock-Wave Interaction Problemwith N=5 and K=80

ρ

p

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 80: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Results: Euler’s Equations of Gas Dynamics

Euler’s equations with viscosity:

∂tρ+∇x · (ρu) = ∇x · (ν∇xρ),

∂t(ρu) +∇x · (u⊗ (ρu)) +∇xp = ∇x · (ν∇x(ρu)),

∂tE +∇x · (u(E + p)) = ∇x · (ν∇xE ).

Again: Single ν, sensed on ρ. → Undue pollution ofthe other field?

[Persson/Peraire ‘06] suggest Navier-Stokes-like vis-cosity. No good: can’t control jumps in ρ.

Rusanov fluxes for Euler, IPDG for viscosity.

0.0 0.2 0.4 0.6 0.8 1.0x

0.0

0.2

0.4

0.6

0.8

1.0

1.2

ρ, p

Sod's Problem with N=5 and K=80

ρ

p

ρ (exact, L2 proj.)

p (exact, L2 proj.)

6 4 2 0 2 4 6x

0

2

4

6

8

10

12

ρ, p

Shock-Wave Interaction Problemwith N=5 and K=80

ρ

p

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 81: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Results: Euler’s Equations of Gas Dynamics

Euler’s equations with viscosity:

∂tρ+∇x · (ρu) = ∇x · (ν∇xρ),

∂t(ρu) +∇x · (u⊗ (ρu)) +∇xp = ∇x · (ν∇x(ρu)),

∂tE +∇x · (u(E + p)) = ∇x · (ν∇xE ).

Again: Single ν, sensed on ρ. → Undue pollution ofthe other field?

[Persson/Peraire ‘06] suggest Navier-Stokes-like vis-cosity. No good: can’t control jumps in ρ.

Rusanov fluxes for Euler, IPDG for viscosity.

0.0 0.2 0.4 0.6 0.8 1.0x

0.0

0.2

0.4

0.6

0.8

1.0

1.2

ρ, p

Sod's Problem with N=5 and K=80

ρ

p

ρ (exact, L2 proj.)

p (exact, L2 proj.)

6 4 2 0 2 4 6x

0

2

4

6

8

10

12

ρ, p

Shock-Wave Interaction Problemwith N=5 and K=80

ρ

p

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 82: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Results: Euler’s Equations of Gas Dynamics

Euler’s equations with viscosity:

∂tρ+∇x · (ρu) = ∇x · (ν∇xρ),

∂t(ρu) +∇x · (u⊗ (ρu)) +∇xp = ∇x · (ν∇x(ρu)),

∂tE +∇x · (u(E + p)) = ∇x · (ν∇xE ).

Again: Single ν, sensed on ρ. → Undue pollution ofthe other field?

[Persson/Peraire ‘06] suggest Navier-Stokes-like vis-cosity. No good: can’t control jumps in ρ.

Rusanov fluxes for Euler, IPDG for viscosity.

0.0 0.2 0.4 0.6 0.8 1.0x

0.0

0.2

0.4

0.6

0.8

1.0

1.2

ρ, p

Sod's Problem with N=5 and K=80

ρ

p

ρ (exact, L2 proj.)

p (exact, L2 proj.)

6 4 2 0 2 4 6x

0

2

4

6

8

10

12

ρ, p

Shock-Wave Interaction Problemwith N=5 and K=80

ρ

p

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 83: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Results: Euler’s Equations of Gas Dynamics

Euler’s equations with viscosity:

∂tρ+∇x · (ρu) = ∇x · (ν∇xρ),

∂t(ρu) +∇x · (u⊗ (ρu)) +∇xp = ∇x · (ν∇x(ρu)),

∂tE +∇x · (u(E + p)) = ∇x · (ν∇xE ).

Again: Single ν, sensed on ρ. → Undue pollution ofthe other field?

[Persson/Peraire ‘06] suggest Navier-Stokes-like vis-cosity. No good: can’t control jumps in ρ.

Rusanov fluxes for Euler, IPDG for viscosity.

0.0 0.2 0.4 0.6 0.8 1.0x

0.0

0.2

0.4

0.6

0.8

1.0

1.2

ρ, p

Sod's Problem with N=5 and K=80

ρ

p

ρ (exact, L2 proj.)

p (exact, L2 proj.)

6 4 2 0 2 4 6x

0

2

4

6

8

10

12

ρ, p

Shock-Wave Interaction Problemwith N=5 and K=80

ρ

p

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 84: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Results: Euler’s Equations of Gas Dynamics

Euler’s equations with viscosity:

∂tρ+∇x · (ρu) = ∇x · (ν∇xρ),

∂t(ρu) +∇x · (u⊗ (ρu)) +∇xp = ∇x · (ν∇x(ρu)),

∂tE +∇x · (u(E + p)) = ∇x · (ν∇xE ).

Again: Single ν, sensed on ρ. → Undue pollution ofthe other field?

[Persson/Peraire ‘06] suggest Navier-Stokes-like vis-cosity. No good: can’t control jumps in ρ.

Rusanov fluxes for Euler, IPDG for viscosity.

0.0 0.2 0.4 0.6 0.8 1.0x

0.0

0.2

0.4

0.6

0.8

1.0

1.2

ρ, p

Sod's Problem with N=5 and K=80

ρ

p

ρ (exact, L2 proj.)

p (exact, L2 proj.)

6 4 2 0 2 4 6x

0

2

4

6

8

10

12

ρ, p

Shock-Wave Interaction Problemwith N=5 and K=80

ρ

p

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 85: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Results: Euler’s Equations of Gas Dynamics

Euler’s equations with viscosity:

∂tρ+∇x · (ρu) = ∇x · (ν∇xρ),

∂t(ρu) +∇x · (u⊗ (ρu)) +∇xp = ∇x · (ν∇x(ρu)),

∂tE +∇x · (u(E + p)) = ∇x · (ν∇xE ).

Again: Single ν, sensed on ρ. → Undue pollution ofthe other field?

[Persson/Peraire ‘06] suggest Navier-Stokes-like vis-cosity. No good: can’t control jumps in ρ.

Rusanov fluxes for Euler, IPDG for viscosity.

0.0 0.2 0.4 0.6 0.8 1.0x

0.0

0.2

0.4

0.6

0.8

1.0

1.2

ρ, p

Sod's Problem with N=5 and K=80

ρ

p

ρ (exact, L2 proj.)

p (exact, L2 proj.)

6 4 2 0 2 4 6x

0

2

4

6

8

10

12

ρ, p

Shock-Wave Interaction Problemwith N=5 and K=80

ρ

p

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 86: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

GPU DG Showcase

Eletromagnetism

Poisson

CFD

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 87: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

GPU DG Showcase

Eletromagnetism

Poisson

CFD

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 88: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

GPU DG Showcase

Eletromagnetism

Poisson

CFD

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 89: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

GPU DG Showcase

Eletromagnetism

Poisson

CFD

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 90: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Where to from here?

PyCUDA, PyOpenCL, hedge

→ http://www.cims.nyu.edu/~kloeckner/

GPU-DG Article

AK, T. Warburton, J. Bridge, J.S. Hesthaven, “NodalDiscontinuous Galerkin Methods on Graphics Processors”,J. Comp. Phys., 228 (21), 7863–7882.

GPU RTCG

AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation forHigh-Performance Computing, submitted.

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 91: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Conclusions

GPUs and scripting work surprisingly well together

Enable Run-Time Code Generation

GPU-DG is significantly faster than CPU-DG

Method well-suited a prioriNumerous tricks enable good performance

Further work in GPU-DG:

Curvilinear Elements (T. Warburton)Local Time SteppingShock Capturing for Nonlinear Equations

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 92: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Questions?

?

Thank you for your attention!

http://www.cims.nyu.edu/~kloeckner/

image credits

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA

Page 93: Easy, E ective, E cient: GPU Programming in Python with ...On-Chip Storage?? Andreas Kl ockner GPU-Python with PyOpenCL and PyCUDA. PyCUDALoo.pyGPU-DG IntroductionChallengesBene tsPerformanceShocks

PyCUDA Loo.py GPU-DG Introduction Challenges Benefits Performance Shocks

Image Credits

Exclamation mark: sxc.hu/cobrasoft

Adding Machine: flickr.com/thomashawk

Floppy disk: flickr.com/ethanheinCarrot: OpenClipart.orgDart board: sxc.hu/195617Question Mark: sxc.hu/svilen001Question Mark: sxc.hu/svilen001?/! Marks: sxc.hu/svilen001

Andreas Klockner GPU-Python with PyOpenCL and PyCUDA