12

Click here to load reader

Martin Kruli 3. 12. 2015 by Martin Kruli (v1.0)1

Embed Size (px)

DESCRIPTION

 Parallelism in GPU ◦ Data parallelism  The same kernel is executed by many threads  Thread process one data item ◦ Limited task parallelism  Multiple kernels executed simultaneously (since Fermi)  At most as many kernels as SMPs ◦ But we do not have  Any means of kernel-wide synchronization (barrier)  Any guarantees that two blocks/kernels will actually run concurrently by Martin Kruliš (v1.0)3

Citation preview

Page 1: Martin Kruli 3. 12. 2015 by Martin Kruli (v1.0)1

by Martin Kruliš (v1.0) 1

Dynamic ParallelismMartin Kruliš

3. 12. 2015

Page 2: Martin Kruli 3. 12. 2015 by Martin Kruli (v1.0)1

by Martin Kruliš (v1.0) 2

Different Types of Parallelism◦ Task parallelism

Problem is divided to tasks, which are processed independently

◦ Data parallelism Same operation is performed over many data items

◦ Pipeline parallelism Data are flowing though a sequence (or oriented

graph) of stages, which operate concurrently◦ Other types of parallelism

Event driven, …

3. 12. 2015

Types of Parallelism

Page 3: Martin Kruli 3. 12. 2015 by Martin Kruli (v1.0)1

by Martin Kruliš (v1.0) 3

Parallelism in GPU◦ Data parallelism

The same kernel is executed by many threads Thread process one data item

◦ Limited task parallelism Multiple kernels executed simultaneously (since

Fermi) At most as many kernels as SMPs

◦ But we do not have Any means of kernel-wide synchronization (barrier) Any guarantees that two blocks/kernels will actually

run concurrently

3. 12. 2015

GPU Execution Model

Page 4: Martin Kruli 3. 12. 2015 by Martin Kruli (v1.0)1

by Martin Kruliš (v1.0) 4

Unsuitable Problems for GPUs◦ Processing irregular data structures

Trees, graphs, …◦ Regular structures with irregular processing

workload Difficult simulations, iterative approximations

◦ Iterative tasks with explicit synchronization◦ Pipeline-oriented tasks with many simple stages

3. 12. 2015

Problematic Cases

Page 5: Martin Kruli 3. 12. 2015 by Martin Kruli (v1.0)1

by Martin Kruliš (v1.0) 5

Solutions◦ Iterative kernel execution

Usually applicable only for cases when there are none or few dependencies between the subsequent kernels

The state (or most of it) is kept on the GPU◦ Mapping irregular structures to regular grids

May be too fine/coarse grained Not always possible

◦ 2-phase Algorithms First phase determines the amount of work (items,

…) Second phase process tasks mapped by first phase

3. 12. 2015

Problematic Cases

Page 6: Martin Kruli 3. 12. 2015 by Martin Kruli (v1.0)1

by Martin Kruliš (v1.0) 6

Persistent Threads◦ Single kernel executed so there are as many

thread-blocks as SMPs◦ Workload is generated and processed on the GPU

Using global queue implemented by atomic operations

◦ Possible problems The kernel may be quite complex There are no guarantees that two blocks will ever run

concurrently Possibility of creating deadlock

The synchronization overhead is significant But it might be lower than host-based synchronization

3. 12. 2015

Persistent Threads

Page 7: Martin Kruli 3. 12. 2015 by Martin Kruli (v1.0)1

by Martin Kruliš (v1.0) 7

CUDA Dynamic Parallelism◦ New feature presented in CC 3.5 (Kepler)◦ GPU threads are allowed to launch another

kernels

3. 12. 2015

Dynamic Parallelism

Page 8: Martin Kruli 3. 12. 2015 by Martin Kruli (v1.0)1

by Martin Kruliš (v1.0) 8

Dynamic Parallelism Purpose◦ The device does not need to synchronize with

host to issue new work to the device◦ Irregular parallelism may be expressed more

easily

3. 12. 2015

Dynamic Parallelism

Page 9: Martin Kruli 3. 12. 2015 by Martin Kruli (v1.0)1

by Martin Kruliš (v1.0) 9

How It Works◦ Portions of CUDA runtime are ported to device side

Kernel execution Device synchronization Streams, events, and async memory operations

◦ Kernel launches are asynchronous No guarantee the child kernel starts immediately Synchronization points may cause context switch

Entire blocks are switched on a SMP◦ Block-wise locality of resources

Streams and events are shared in thread block

3. 12. 2015

Dynamic Parallelism

Page 10: Martin Kruli 3. 12. 2015 by Martin Kruli (v1.0)1

by Martin Kruliš (v1.0) 10

Example__global__ void child_launch(int *data) { data[threadIdx.x] = data[threadIdx.x]+1;}

__global__ void parent_launch(int *data) { data[threadIdx.x] = threadIdx.x; __syncthreads(); if (threadIdx.x == 0) { child_launch<<< 1, 256 >>>(data); cudaDeviceSynchronize(); } __syncthreads();}

void host_launch(int *data) { parent_launch<<< 1, 256 >>>(data);}

3. 12. 2015

Dynamic Parallelism

Synchronization does not have to be invoked by all

threads

Thread 0 invokes a grid of child threads

Device synchronization does not synchronize threads in the block

Page 11: Martin Kruli 3. 12. 2015 by Martin Kruli (v1.0)1

by Martin Kruliš (v1.0) 11

Compilation Issues◦ CC 3.5 or higher◦ -rdc=true◦ Dynamic linking

Before hostlinking

With cudadevrt

3. 12. 2015

Dynamic Parallelism

Page 12: Martin Kruli 3. 12. 2015 by Martin Kruli (v1.0)1

by Martin Kruliš (v1.0) 123. 12. 2015

Discussion