119
OpenCL PyOpenCL Runtime Patterns Code gen. PyOpenCL Tutorial Andreas Kl¨ ockner Computer Science University of Illinois at Urbana-Champaign May 7, 2014 AndreasKl¨ockner PyOpenCL Tutorial

PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

Embed Size (px)

Citation preview

Page 1: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

PyOpenCL Tutorial

Andreas Klockner

Computer ScienceUniversity of Illinois at Urbana-Champaign

May 7, 2014

Andreas Klockner PyOpenCL Tutorial

Page 2: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Outline

1 OpenCL

2 PyOpenCL

3 OpenCL runtime

4 Parallel patterns

5 Code writes Code

6 OpenCL implementations

Andreas Klockner PyOpenCL Tutorial

Page 3: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

CL vs CUDA side-by-side

CUDA source code:global void transpose(

float ∗A t, float ∗A,int a width, int a height )

{int base idx a =blockIdx .x ∗ BLK SIZE +blockIdx .y ∗ A BLOCK STRIDE;

int base idx a t =blockIdx .y ∗ BLK SIZE +blockIdx .x ∗ A T BLOCK STRIDE;

int glob idx a =base idx a + threadIdx.x+ a width ∗ threadIdx.y;

int glob idx a t =base idx a t + threadIdx.x+ a height ∗ threadIdx .y;

shared float A shared[BLK SIZE][BLK SIZE+1];

A shared[threadIdx .y ][ threadIdx .x] =A[glob idx a ];

syncthreads ();

A t[ glob idx a t ] =A shared[threadIdx .x ][ threadIdx .y ];

}

OpenCL source code:void transpose(

global float ∗a t, global float ∗a,unsigned a width, unsigned a height){

int base idx a =get group id (0) ∗ BLK SIZE +get group id (1) ∗ A BLOCK STRIDE;

int base idx a t =get group id (1) ∗ BLK SIZE +get group id (0) ∗ A T BLOCK STRIDE;

int glob idx a =base idx a + get local id (0)+ a width ∗ get local id (1);

int glob idx a t =base idx a t + get local id (0)+ a height ∗ get local id (1);

local float a local [BLK SIZE][BLK SIZE+1];

a local [ get local id (1)∗BLK SIZE+get local id(0)] =a[ glob idx a ];

barrier (CLK LOCAL MEM FENCE);

a t [ glob idx a t ] =a local [ get local id (0)∗BLK SIZE+get local id(1)];

}

Andreas Klockner PyOpenCL Tutorial

Page 4: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL ↔ CUDA: A dictionary

OpenCL CUDAGrid Grid

Work Group BlockWork Item Thread

kernel global

global device

local shared

private local

imagend t texture<type, n, ...>barrier(LMF) syncthreads()

get local id(012) threadIdx.xyz

get group id(012) blockIdx.xyz

get global id(012) – (reimplement)

Andreas Klockner PyOpenCL Tutorial

Page 5: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner PyOpenCL Tutorial

Page 6: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner PyOpenCL Tutorial

Page 7: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner PyOpenCL Tutorial

Page 8: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

MemoryCompute Device 1 (Platform 0)

· · ·· · ·· · ·

MemoryCompute Device 0 (Platform 1)

· · ·· · ·· · ·

MemoryCompute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner PyOpenCL Tutorial

Page 9: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner PyOpenCL Tutorial

Page 10: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner PyOpenCL Tutorial

Page 11: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner PyOpenCL Tutorial

Page 12: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner PyOpenCL Tutorial

Page 13: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner PyOpenCL Tutorial

Page 14: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner PyOpenCL Tutorial

Page 15: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner PyOpenCL Tutorial

Page 16: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner PyOpenCL Tutorial

Page 17: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner PyOpenCL Tutorial

Page 18: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

OpenCL: Computing as a Service

Host(CPU)

Memory

Compute Device 0 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 0)

· · ·· · ·· · ·

Memory

Compute Device 0 (Platform 1)

· · ·· · ·· · ·

Memory

Compute Device 1 (Platform 1)

· · ·· · ·· · ·

Memory

Platform 0 (e.g. CPUs)

Platform 1 (e.g. GPUs)

(think “chip”,has memoryinterface)

Compute Unit(think “processor”,has insn. fetch)

Processing Element(think “SIMD lane”)

Python

Device Language: ∼ C99

Andreas Klockner PyOpenCL Tutorial

Page 19: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 20: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 21: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 22: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 23: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 24: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 25: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 26: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 27: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 28: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

Hardware

Software representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 29: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 30: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

Grid

(Kernel: Func-

tion on Grid)

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 31: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

Grid

(Kernel: Func-

tion on Grid)

(Work) Group

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 32: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

Grid

(Kernel: Func-

tion on Grid)

(Work) Group

(Work) Item

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 33: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 34: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 35: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 36: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 37: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 38: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 39: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 40: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 41: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 42: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 43: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 44: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 45: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

get local id(axis)?/size(axis)?

get group id(axis)?/num groups(axis)?

get global id(axis)?/size(axis)?

axis=0,1,2,...

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 46: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Connection: Hardware ↔ Programming Model

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Fetch/Decode

32 kiB CtxPrivate

(“Registers”)

16 kiB CtxShared

Who cares how

many cores?

Idea:

Program as if there were“infinitely” many cores

Program as if there were“infinitely” many ALUs percore

Consider: Which is easy to do automatically?

Parallel program → sequential hardware

or

Sequential program → parallel hardware?

Axis 0

Axi

s1

HardwareSoftware representation

?

Really: Group providespool of concurrency todraw from.

X,Y,Z order within groupmatters. (Not amonggroups, though.)

get local id(axis)?/size(axis)?

get group id(axis)?/num groups(axis)?

get global id(axis)?/size(axis)?

axis=0,1,2,...

Grids can be 1,2,3-dimensional.

Andreas Klockner PyOpenCL Tutorial

Page 47: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Outline

1 OpenCL

2 PyOpenCL

3 OpenCL runtime

4 Parallel patterns

5 Code writes Code

6 OpenCL implementations

Andreas Klockner PyOpenCL Tutorial

Page 48: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

DEMO TIME

Andreas Klockner PyOpenCL Tutorial

Page 49: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Outline

1 OpenCL

2 PyOpenCL

3 OpenCL runtimeA Kingdom of NounsSynchronization

4 Parallel patterns

5 Code writes Code

6 OpenCL implementations

Andreas Klockner PyOpenCL Tutorial

Page 50: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Outline

1 OpenCL

2 PyOpenCL

3 OpenCL runtimeA Kingdom of NounsSynchronization

4 Parallel patterns

5 Code writes Code

6 OpenCL implementations

Andreas Klockner PyOpenCL Tutorial

Page 51: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

OpenCL Object Diagram

Last Revision Date: 9/30/10 Page 20

Figure 2.1 - OpenCL UML Class Diagram

Credit: Khronos Group

Andreas Klockner PyOpenCL Tutorial

Page 52: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

CL “Platform”

“Platform”: a collection of devices, all fromthe same vendor.

All devices in a platform use same CLdriver/implementation.

Multiple platforms can be used from oneprogram → ICD.

libOpenCL.so: ICD loader

/etc/OpenCL/vendors/somename.icd:Plain text file with name of .so containingCL implementation.

Andreas Klockner PyOpenCL Tutorial

Page 53: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

CL “Compute Device”

CL Compute Devices:

CPUs, GPUs, accelerators, . . .

Anything that fits the programming model.

A processor die with an interface to off-chipmemory

Can get list of devices from platform.

Andreas Klockner PyOpenCL Tutorial

Page 54: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Contexts

context = cl.Context(devices=None | [dev1, dev2], dev type=None)context = cl. create some context( interactive =True)

Spans one or more Devices

Create from device type or list of devices

See docs for cl.Platform, cl.Device

dev type: DEFAULT , ALL, CPU, GPU

Needed to. . .

. . . allocate Memory Objects

. . . create and build Programs

. . . host Command Queues

. . . execute Grids

Andreas Klockner PyOpenCL Tutorial

Page 55: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

OpenCL: Command Queues

Host and Device runasynchronously

Host submits to queue:

ComputationsMemory TransfersSync primitives. . .

Host can wait fordrained queue

Profiling

. . .HostHost

DeviceDevice

Qu

eue

1Q

ueu

e1

Qu

eue

2Q

ueu

e2

Andreas Klockner PyOpenCL Tutorial

Page 56: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Command Queues and Events

queue = cl.CommandQueue(context, device=None,properties =None | [(prop, value ),...])

Attached to single device

cl.command queue properties. . .

OUT OF ORDER EXEC MODE ENABLE:Do not force sequential executionPROFILING ENABLE:Gather timing info

Andreas Klockner PyOpenCL Tutorial

Page 57: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Capturing Dependencies

B = f(A)C = g(B)E = f(C)F = h(C)G = g(E,F)P = p(B)Q = q(B)R = r(G,P,Q)

A

C

B

E

G

F Q

P

R

h

r

g

rg

r

g

q

f

p

f

Switch queue to out-of-ordermode!

Specify as list of events usingwait for= optional keyword toenqueue XXX.

Can also enqueue barrier.

Common use case:Transmit/receive from other MPIranks.

Possible in hardware on Nv Fermi,AMD Cayman: Submit parallelwork to increase machine use.

Not yet ubiquitouslyimplemented

Andreas Klockner PyOpenCL Tutorial

Page 58: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Capturing Dependencies

B = f(A)C = g(B)E = f(C)F = h(C)G = g(E,F)P = p(B)Q = q(B)R = r(G,P,Q)

A

C

B

E

G

F Q

P

R

h

r

g

rg

r

g

q

f

p

f

Switch queue to out-of-ordermode!

Specify as list of events usingwait for= optional keyword toenqueue XXX.

Can also enqueue barrier.

Common use case:Transmit/receive from other MPIranks.

Possible in hardware on Nv Fermi,AMD Cayman: Submit parallelwork to increase machine use.

Not yet ubiquitouslyimplemented

Andreas Klockner PyOpenCL Tutorial

Page 59: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Memory Objects: Buffers

buf = cl. Buffer(context , flags , size =0, hostbuf=None)

Chunk of device memory

No type information: “Bag of bytes”

Observe: Not tied to device.→ no fixed memory address→ pointers do not survive kernel launches→ movable between devices→ not even allocated before first use!

flags:

READ ONLY/WRITE ONLY/READ WRITE

{ALLOC,COPY,USE} HOST PTR

Andreas Klockner PyOpenCL Tutorial

Page 60: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Memory Objects: Buffers

buf = cl. Buffer(context , flags , size =0, hostbuf=None)

COPY HOST PTR:

Use hostbuf as initial content of buffer

USE HOST PTR:

hostbuf is the buffer.

Caching in device memory is allowed.

ALLOC HOST PTR:

New host memory (unrelated tohostbuf) is visible from device and host.

Andreas Klockner PyOpenCL Tutorial

Page 61: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Memory Objects: Buffers

buf = cl. Buffer(context , flags , size =0, hostbuf=None)

Specify hostbuf or size (or both)

hostbuf: Needs Python Buffer Interfacee.g. numpy.ndarray, str.

Important: Memory layout matters

Passed to device code as pointers(e.g. float *, int *)

enqueue copy(queue, dest, src)

Can be mapped into host address space:cl.MemoryMap.

Andreas Klockner PyOpenCL Tutorial

Page 62: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Programs and Kernels

prg = cl.Program(context, src)

src: OpenCL device code

Derivative of C99Functions with kernel attributecan be invoked from host

prg.build(options="",

devices=None)

kernel = prg.kernel name

kernel(queue,

(Gx ,Gy ,Gz), (Lx , Ly , Lz),arg, ...,

wait for=None)

Andreas Klockner PyOpenCL Tutorial

Page 63: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Program Objects

kernel (queue, (Gx,Gy,Gz), (Sx,Sy,Sz), arg , ..., wait for =None)

arg may be:

None (a NULL pointer)

numpy sized scalars:numpy.int64,numpy.float32,...

Anything with buffer interface:numpy.ndarray, str

Buffer Objects

Also: cl.Image, cl.Sampler,cl.LocalMemory

Andreas Klockner PyOpenCL Tutorial

Page 64: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Program Objects

kernel (queue, (Gx,Gy,Gz), (Sx,Sy,Sz), arg , ..., wait for =None)

Explicitly sized scalars:6 Annoying, error-prone.

Better:kernel.set scalar arg dtypes([

numpy.int32, None,

numpy.float32])

Use None for non-scalars.

Andreas Klockner PyOpenCL Tutorial

Page 65: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

OpenCL Object Diagram

Last Revision Date: 9/30/10 Page 20

Figure 2.1 - OpenCL UML Class Diagram

Credit: Khronos Group

Andreas Klockner PyOpenCL Tutorial

Page 66: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Outline

1 OpenCL

2 PyOpenCL

3 OpenCL runtimeA Kingdom of NounsSynchronization

4 Parallel patterns

5 Code writes Code

6 OpenCL implementations

Andreas Klockner PyOpenCL Tutorial

Page 67: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Recap: Concurrency and Synchronization

GPUs have layers of concurrency.Each layer has its synchronization primitives.

Intra-group:barrier(...),... =CLK {LOCAL,GLOBAL} MEM FENCE

Inter-group:Kernel launch

CPU-GPU:Command queues, Events

Andreas Klockner PyOpenCL Tutorial

Page 68: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Recap: Concurrency and Synchronization

GPUs have layers of concurrency.Each layer has its synchronization primitives.

Intra-group:barrier(...),... =CLK {LOCAL,GLOBAL} MEM FENCE

Inter-group:Kernel launch

CPU-GPU:Command queues, Events

Andreas Klockner PyOpenCL Tutorial

Page 69: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Synchronization between Groups

Golden Rule:

Results of the algorithm must be independent of the order in whichwork groups are executed.

Consequences:

Work groups may read the same information from globalmemory.

But: Two work groups may not validly write different thingsto the same global memory.

Kernel launch serves as

Global barrierGlobal memory fence

Andreas Klockner PyOpenCL Tutorial

Page 70: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Synchronization between Groups

Golden Rule:

Results of the algorithm must be independent of the order in whichwork groups are executed.

Consequences:

Work groups may read the same information from globalmemory.

But: Two work groups may not validly write different thingsto the same global memory.

Kernel launch serves as

Global barrierGlobal memory fence

Andreas Klockner PyOpenCL Tutorial

Page 71: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Synchronization

What is a Barrier?

Andreas Klockner PyOpenCL Tutorial

Page 72: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Synchronization

What is a Barrier?

Andreas Klockner PyOpenCL Tutorial

Page 73: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Synchronization

What is a Barrier?

Andreas Klockner PyOpenCL Tutorial

Page 74: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Synchronization

What is a Barrier?

Andreas Klockner PyOpenCL Tutorial

Page 75: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Synchronization

What is a Barrier?

Andreas Klockner PyOpenCL Tutorial

Page 76: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Synchronization

What is a Barrier?

Andreas Klockner PyOpenCL Tutorial

Page 77: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. A Kingdom of Nouns Synchronization

Synchronization

What is a Barrier?

Andreas Klockner PyOpenCL Tutorial

Page 78: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Outline

1 OpenCL

2 PyOpenCL

3 OpenCL runtime

4 Parallel patternsMapReduceScan

5 Code writes Code

6 OpenCL implementations

Andreas Klockner PyOpenCL Tutorial

Page 79: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Outline

1 OpenCL

2 PyOpenCL

3 OpenCL runtime

4 Parallel patternsMapReduceScan

5 Code writes Code

6 OpenCL implementations

Andreas Klockner PyOpenCL Tutorial

Page 80: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Map

yi = fi(xi)where i ∈ {1, . . . ,N}.

Notation: (also for rest of this lecture)

xi : inputs

yi : outputs

fi : (pure) functions (i.e. no side effects)

When does a function have a “side effect”?

In addition to producing a value, it

modifies non-local state, or

has an observable interaction with theoutside world.

Often: f1 = · · · = fN . Then

Python function map

Andreas Klockner PyOpenCL Tutorial

Page 81: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Map

yi = fi(xi)where i ∈ {1, . . . ,N}.

Notation: (also for rest of this lecture)

xi : inputs

yi : outputs

fi : (pure) functions (i.e. no side effects)

When does a function have a “side effect”?

In addition to producing a value, it

modifies non-local state, or

has an observable interaction with theoutside world.

Often: f1 = · · · = fN . Then

Python function map

Andreas Klockner PyOpenCL Tutorial

Page 82: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Map

yi = fi(xi)where i ∈ {1, . . . ,N}.

Notation: (also for rest of this lecture)

xi : inputs

yi : outputs

fi : (pure) functions (i.e. no side effects)

When does a function have a “side effect”?

In addition to producing a value, it

modifies non-local state, or

has an observable interaction with theoutside world.

Often: f1 = · · · = fN . Then

Python function map

Andreas Klockner PyOpenCL Tutorial

Page 83: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Map

yi = fi(xi)where i ∈ {1, . . . ,N}.

Notation: (also for rest of this lecture)

xi : inputs

yi : outputs

fi : (pure) functions (i.e. no side effects)

When does a function have a “side effect”?

In addition to producing a value, it

modifies non-local state, or

has an observable interaction with theoutside world.

Often: f1 = · · · = fN . Then

Python function map

Andreas Klockner PyOpenCL Tutorial

Page 84: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Map: Graph Representation

x0

y0

f0

x1

y1

f1

x2

y2

f2

x3

y3

f3

x4

y4

f4

x5

y5

f5

x6

y6

f6

x7

y7

f7

x8

y8

f8

Trivial? Often: no.

Andreas Klockner PyOpenCL Tutorial

Page 85: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Map: Graph Representation

x0

y0

f0

x1

y1

f1

x2

y2

f2

x3

y3

f3

x4

y4

f4

x5

y5

f5

x6

y6

f6

x7

y7

f7

x8

y8

f8

Trivial? Often: no.

Andreas Klockner PyOpenCL Tutorial

Page 86: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Embarrassingly Parallel: Examples

Surprisingly useful:

Element-wise linear algebra:Addition, scalar multiplication (notinner product)

Image Processing: Shift, rotate,clip, scale, . . .

Monte Carlo simulation

(Brute-force) Optimization

Random Number Generation

Encryption, Compression(after blocking)

Andreas Klockner PyOpenCL Tutorial

Page 87: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

DEMO TIME

Andreas Klockner PyOpenCL Tutorial

Page 88: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Outline

1 OpenCL

2 PyOpenCL

3 OpenCL runtime

4 Parallel patternsMapReduceScan

5 Code writes Code

6 OpenCL implementations

Andreas Klockner PyOpenCL Tutorial

Page 89: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Reduction

y = f (· · · f (f (x1, x2), x3), . . . , xN)where N is the input size.

Also known as. . .

Python function reduce

Andreas Klockner PyOpenCL Tutorial

Page 90: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Reduction

y = f (· · · f (f (x1, x2), x3), . . . , xN)where N is the input size.

Also known as. . .

Python function reduce

Andreas Klockner PyOpenCL Tutorial

Page 91: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Reduction: Graph

y

x1 x2

x3

x4

x5

x6

Painful! Not parallelizable.

Andreas Klockner PyOpenCL Tutorial

Page 92: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Reduction: Graph

y

x1 x2

x3

x4

x5

x6

Painful! Not parallelizable.

Andreas Klockner PyOpenCL Tutorial

Page 93: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Approach to Reduction

f (x ,y)

?

Can we do better?

“Tree” very imbalanced. What propertyof f would allow ‘rebalancing’?

f (f (x , y), z) = f (x , f (y , z))

Looks less improbable if we letx ◦ y = f (x , y):

x ◦ (y ◦ z)) = (x ◦ y) ◦ z

Has a very familiar name: Associativity

Andreas Klockner PyOpenCL Tutorial

Page 94: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Approach to Reduction

f (x ,y)

?

Can we do better?

“Tree” very imbalanced. What propertyof f would allow ‘rebalancing’?

f (f (x , y), z) = f (x , f (y , z))

Looks less improbable if we letx ◦ y = f (x , y):

x ◦ (y ◦ z)) = (x ◦ y) ◦ z

Has a very familiar name: Associativity

Andreas Klockner PyOpenCL Tutorial

Page 95: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Reduction: A Better Graph

y

x0 x1 x2 x3 x4 x5 x6 x7

Andreas Klockner PyOpenCL Tutorial

Page 96: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Reduction: Examples

Sum, Inner Product, Norm

Occurs in iterative methods

Minimum, Maximum

Data Analysis

Evaluation of Monte CarloSimulations

List Concatenation, Set Union

Matrix-Vector product (but. . . )

Andreas Klockner PyOpenCL Tutorial

Page 97: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

DEMO TIME

Andreas Klockner PyOpenCL Tutorial

Page 98: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Outline

1 OpenCL

2 PyOpenCL

3 OpenCL runtime

4 Parallel patternsMapReduceScan

5 Code writes Code

6 OpenCL implementations

Andreas Klockner PyOpenCL Tutorial

Page 99: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Scan

y1 = x1y2 = f (y1, x2)... = ...

yN = f (yN−1, xN)where N is the input size. (Think: N large, f (x , y) = x + y)

Prefix Sum/Cumulative Sum

Abstract view of: loop-carried dependence

Also possible: Segmented Scan

Andreas Klockner PyOpenCL Tutorial

Page 100: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Scan: Graph

x0

y0

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

y1

Id

y2

Id

y3

Id

y4

Id y5

Id

Id

This can’t possibly be parallelized.Or can it?

Andreas Klockner PyOpenCL Tutorial

Page 101: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Scan: Graph

x0

y0

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

y1

Id

y2

Id

y3

Id

y4

Id y5

Id

Id

This can’t possibly be parallelized.Or can it?

Andreas Klockner PyOpenCL Tutorial

Page 102: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Scan: Graph

x0

y0

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

y1

Id

y2

Id

y3

Id

y4

Id y5

Id

Id

This can’t possibly be parallelized.Or can it?Again: Need assumptions on f .Associativity, commutativity.

Andreas Klockner PyOpenCL Tutorial

Page 103: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Scan: Implementation

Work-efficient?

Andreas Klockner PyOpenCL Tutorial

Page 104: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Scan: Implementation

Work-efficient?

Andreas Klockner PyOpenCL Tutorial

Page 105: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Scan: Examples

Low-level building block for manyhigher-level algorithms algorithms

Index computations (!)

E.g. sorting

Anything with a loop-carrieddependence

One row of triangular solve

Segment numbering if boundariesare known

FIR/IIR Filtering

G.E. Blelloch:Prefix Sums and their Applications

Andreas Klockner PyOpenCL Tutorial

Page 106: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Scan: Issues

Subtlety: Inclusive/Exclusive Scan

Pattern sometimes hard torecognize

But shows up surprisingly oftenNeed to proveassociativity/commutativity

Andreas Klockner PyOpenCL Tutorial

Page 107: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

DEMO TIME

Andreas Klockner PyOpenCL Tutorial

Page 108: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Scan: Features

“Map” processing on input: f (xi )

Also: stencils f (xi−1, xi )

“Map” processing on output

Output stencilsInclusive/Exclusive scan

Segmented scan

Works on compound types

Efficient!

Scan: a fundamental parallel primitive.

Anything involving indexchanges/renumbering!(e.g. sort, filter, . . . )

Andreas Klockner PyOpenCL Tutorial

Page 109: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Scan: Features

“Map” processing on input: f (xi )

Also: stencils f (xi−1, xi )

“Map” processing on output

Output stencilsInclusive/Exclusive scan

Segmented scan

Works on compound types

Efficient!Scan: a fundamental parallel primitive.

Anything involving indexchanges/renumbering!(e.g. sort, filter, . . . )

Andreas Klockner PyOpenCL Tutorial

Page 110: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen. Map Reduce Scan

Scan: More Algorithms

copy if

remove if

partition

unique

sort (plain and key-value)

build list of lists

bin sort

All in pyopencl, all built on scan.

Andreas Klockner PyOpenCL Tutorial

Page 111: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Outline

1 OpenCL

2 PyOpenCL

3 OpenCL runtime

4 Parallel patterns

5 Code writes Code

6 OpenCL implementations

Andreas Klockner PyOpenCL Tutorial

Page 112: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Code generation

Three main ways to generate code:

Textual substitution

Templating

Abstract syntax trees

Andreas Klockner PyOpenCL Tutorial

Page 113: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

DEMO TIME

Andreas Klockner PyOpenCL Tutorial

Page 114: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Outline

1 OpenCL

2 PyOpenCL

3 OpenCL runtime

4 Parallel patterns

5 Code writes Code

6 OpenCL implementations

Andreas Klockner PyOpenCL Tutorial

Page 115: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

The Nvidia CL implementation

Targets only GPUs

Notes:

Nearly identical to CUDA

No native C-level JIT in CUDA (→PyCUDA)

Page-locked memory:Use CL MEM ALLOC HOST PTR.(Careful: double meaning)

Andreas Klockner PyOpenCL Tutorial

Page 116: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

The Apple CL implementation

Targets CPUs and GPUs

General notes:

Different header nameOpenCL/cl.h instead of CL/cl.hUse -framework OpenCL for Caccess.

Beware of imperfect compiler cacheimplementation(ignores include files)

CPU notes:

One work item per processor

GPU similar to hardware vendorimplementation.(New: Intel w/ Sandy Bridge)

Andreas Klockner PyOpenCL Tutorial

Page 117: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

The AMD CL implementation

Targets CPUs and GPUs (from both AMD and Nvidia)

GPU notes:

Wide SIMD groups (64)

GCN: Vector and scalar unit (previously VLIW4/5)

very flop-heavy machine→ ILP and explicit SIMD

CPU notes:

Many work items per processor (emulated)

“APU”: Growing CPU/GPU integration

Andreas Klockner PyOpenCL Tutorial

Page 118: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

The Intel CL implementation

CPUs, GPUs with Ivy Bridge+

CPU notes:

Good vectorizing compiler

Only implementation of out-of-order queuesfor now

Based on Intel TBB

GPU notes:

Flexible design: SIMDm VLIWn

Lots of fixed-function hardware

Last-level Cache (LLC) integrated betweenCPU and GPU

®

Andreas Klockner PyOpenCL Tutorial

Page 119: PyOpenCL Tutorial - Iniciofisica.cab.cnea.gov.ar/gpgpu/images/charlas/lab-slides.pdf · May 7, 2014 Andreas Kl ockner PyOpenCL Tutorial. ... blockIdx.y A BLOCK STRIDE; ... Grids can

OpenCL PyOpenCL Runtime Patterns Code gen.

Questions?

?

Andreas Klockner PyOpenCL Tutorial