Software Pipelined Execution of Stream Programs on GPUs Abhishek Udupa et. al. Indian Institute of Science, Bangalore 2009

Software Pipelined Execution of Stream Programs on GPUs

Abhishek Udupa et. al.

Indian Institute of Science, Bangalore

2009

Abstract

Describe the challenges in mapping StreamIt to GPUs

Propose an efficient technique to software pipeline the execution of stream programs on GPUs

Formulating this problem of scheduling and actor assignment to a processor as an efficient ILP

Also describes an efficient buffer layout technique for GPUs That facilitates exploiting the high memory bandwidth

available in GPUs Yielding speedups between 1.87X and 36.83X over

a single threaded CPU

2

Motivations

GPUs have emerged as massively parallel machines

GPU can perform general purpose computation Latest GPUs consisting of hundreds of SPUs are

capable of supporting thousands concurrent threads with zero-cost h/w controlled context switch

GeForce 8800 GTS [128 SPUs forming 16 multiprocessors, connected by a 256-bit wide bus to 512 MB of memory]

3

Motivations Contd.

Each multiprocessors in NVIDIA GPUs is conceptually wide SIMD processor

CPUs have hierarchies of caches to tolerate memory latency

GPUs address the problem by providing high bandwidth processor-memory link and supporting high degree of simultaneous multithreading (SMT) within the processing element

The combination of SIMD and SMT enables GPUs to achieve a peak throughput of 400 GFLOPs

4

Motivations Contd.

The high performance of GPUs comes at the cost of reduced flexibility and greater programming effort from the user

For instance, in the NVIDIA GPUs, threads executing on different multiprocessors cannot synchronize in an efficient manner communicate in a reliable and consistent manner through the device

memory, within a kernel invocation GPU cannot access the memory of the host system While the memory bus is capable of delivering very high bandwidth

accesses to device memory by threads executing on a multiprocessor, need to be combined in order

5

Motivations Contd.

ATI and NVIDIA have proposed the CTM and CUDA frameworks, respectively, for developing general purpose applications targeting the GPUs But both of these frameworks still require the

programmer to express the program as data-parallel computations that can be executed efficiently on the GPU

Programming with these frameworks tie the application to the platform

Any change in the platform would require significant rework in porting the applications

6

Prior Works

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for GPUs: Stream Computing on Graphics Hardware,” ACM Transactions on Graphics, vol. 23, no. 3, pp. 777–786, 2004

D. Tarditi, S. Puri, and J. Oglesby, “Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses,” in ASPLOS-XII: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, 2006, pp. 325–335.

These techniques still require considerable effort by the programmer in order to transform the streaming application to execute efficiently on the GPU

Maps StreamIt onto the CellBE platform [M. Kudlur and S. Mahlke PLDI 2008]

7

x2 x2

x1 y1

z2

z1

x5

x1

x6

x2

x7

x3

x8

x4

y5

y1

y6

y2

y7

y3

y8

y4

z5

z1

z6

z2

z7

z3

z8

z4

x2 x2

x1 y1

Split

x3

x1

y3

y1

x4

x2

y4

y2

SPE0 SPE1

z3

z1

z4

z2

Join

z2

z1

(a) Unthreaded CPU (b) 4-wide SIMD (c) On Cell by Scott8

SIMD reduces the trip-count of the loop

With GPU Model

Zn+1

Yn+1 Xn+2 Yn+2

Z1

Y1 X2 Y2

X2n Y2n

Xn Yn

Zn+2

Z2

Xn+1

X1

Z2n

Zn

Thousands of iterations are carried out in parallel9

Greatly reduces the trip-count of the loop

Motivations Contd.

Parallelism must be exploited at various levels The data parallelism across threads executing on a

multiprocessor The SMT parallelism among threads on the same multiprocessor

needs to be managed to provide optimum performance Parallelism offered by having multiple multiprocessors on a GPU

should be used to exploit the task and pipeline level parallelism Further, accesses to device memory need to be coalesced as far

as possible to ensure optimal usage of available bandwidth

These challenges are solved in this paper

10

Contributions

This paper makes the following contributions: Describe a software pipelining framework for efficiently mapping

StreamIt programs onto GPUs Present a buffer mapping scheme for StreamIt programs to make

efficient use of the memory bandwidth of the GPU Describe a profile based approach to decide on the optimal

number of threads assigned to a multiprocessor called execution configuration for StreamIt filters

Implement the scheme in the StreamIt compiler and demonstrate a 1.87X to 36.83X speedup over a single threaded CPU on a set of streaming applications

11

parallel computation

StreamIt Overview StreamIt is a novel

language for streaming Exposes parallelism and

communication Architecture independent Modular and composable

Simple structures composed to creates complex graphs

Malleable Change program behavior

with small modifications

may be any StreamIt language construct

joinersplitter

pipeline

feedback loop

joiner splitter

splitjoin

filter

12

Organization of the NVIDIA GeForce 8800 series of GPUs

Architecture of GeForce 8800 GPU Architecture of individual SM13

ILP Formulation: Definition

Formulate the scheduling of the stream graph across the multiprocessors (SMs) of the GPU as ILP

Each instance of each filter to be the fundamental schedulable entity

V denotes the set of all nodes (filters) in the stream graph

N is the number of nodes in the stream graph E denotes the set of all (directed) edges in the

stream graph kv represents the number of firings of a filter v Є V in

steady state

14

ILP Formulation: Definition Contd. T represents the Initiation Interval (II) of the software

pipelined loop d(v) is the delay or execution time of filter v Є V Iuv is the number of elements consumed by filter v on

each firing of v for an edge (u, v) Є E Ouv is the number of elements produced by filter u

on each firing of u for an edge (u, v) Є E muv denotes the number of elements initially present

on the edge (u, v) Є E

15

ILP Formulation: Resource Constraints wk,v,p is 0-1 integer variables all k Є [0, kv), all v Є V, all

k Є [0, Pmax)

wk,v,p = 1 implies that the kth instance of the filter v has been assigned to SMp

The following constraint ensures that each instance of each filter is assigned to exactly one SM

)1(),,0[,11max

0,,

P

pvpvk Vvkkw

16

ILP Formulation: Resource Constraints Contd. This constraint models the fact that all the

filters assigned to an SM can be scheduled in the II (T) specified

)2(,))((1

0,,

vk

k Vvpvk PpTvdw

17

ILP Formulation Contd. For dependency among filters, first need to ensure that

the execution of any instance of any filter cannot wrap around into the next II

Consider the linear form of a software pipelined schedule, given by

where σ(j, k, v) represents the time at which the execution of the kth instance of filter v in the jth steady state iteration of the stream graph is scheduled and Ak,v are integer variables with Ak,v ≥ 0, all k Є [0; kv), all v Є V

vkAjTvkj ,),,(

18

ILP Formulation Contd.

Can write

Defining Thus, the linear form of the schedule is given

by

TAoT

Af vkvk

vkvk mod and ,,

,,

)3()(

),,(

,,

,

,,,

vkvk

vk

vkvkvk

ofjT

AjTvkj

ofTA

TAT

ATA vk

vkvk mod,

,,

19

ILP Formulation Contd.

Constrain the starting time of filters ok,v as follows

This constraint ensures that every firing of a filter is scheduled so that it can complete within the same kernel or II

)4(),0[,,)(, vvk kkVvTvdo

20

ILP Formulation: Dependency Constraints To ensure that the firing rule for each

instance of each filter is satisfied by a schedule, the admissibility of a schedule is given below

It assumes that the firings the instances of filter v are serialized

0,),(),(,)1(

),(

iEvuudu

O

OmIivi

uv

uvuvuv

21

ILP Formulation: Dependency Constraints Contd. However this assumption does not hold for

this model, where instances of each filter could execute out of order, or in parallel across different SMs, as long as the firing rule is satisfied

So need to ensure that the schedule is admissible with respect to all the predecessors of the ith firing of the filter v

22

ILP Formulation: Dependency Constraints Contd.

A

B

2

3

A0

B0

21

A1 A2

B1

12

23

ILP Formulation: Dependency Constraints Contd. Intuitively, the ith firing of a filter v must wait

for all the Iuv tokens produced after its (i-1) th firing

0,),(],,1[

),(,),(

iEvuIl

uduO

OmlIivi

uv

uv

uvuvuv

24

ILP Formulation: Dependency Constraints Contd. So far the assumption that the result of executing a filter

u is available d(u) time units after the filter u has started executing

However, the limitations of a GPU imply that this may not always be the case

If the results are required from a producer instance of a filter u, which is scheduled on an SM different from the SM where the consumer instance of a filter v is scheduled, then the data produced can only be reliably accessed in the next steady state iteration

25

ILP Formulation: Dependency Constraints Contd. Modelling this using the wk,v,p variables, which

indicate which SM a particular instance of a given filter is assigned to

Defining 0 – 1 integer variables gl,k,u,v

)5(),0[],,1[,),(),,0[ max

,,,,,,,

,,,,,,,

PpIlEvukk

wwg

wwg

uvv

pvkpukvukl

pukpvkvukl

l

l

26

ILP Formulation: Dependency Constraints Contd. Finally by simplifying

)6(],1[),,0[,),(

)(

)()(

,,,,,,,

,,,,,

uvv

vuklukllagvkvk

ukukllagvkvk

IlkkEvu

gfjTofT

udofjTofT

l

l

27

All these equations 1-6 form an ILP

Code Generation for GPUs

Overview of the compilation process, targeting a StreamIt program onto the GPU28

CUDA Memory Model

29

All threads of a thread block areassigned to exactly one SM or upto 8 thread blocks can be assigned to one SM

A group of thread blocks forms a grid

Finally, a kernel call dispatched to the GPU through the CUDA runtime consists of exactly one grid

Profiling and Execution Configuration Selection It is important to determine the optimal execution

configuration, specified by the number of threads per block and the number of registers allocated per thread

This is achieved by the profiling and execution configuration selection phase

StreamIt compiler is modified the to generate the CUDA sources along with a driver routine for each filter in the StreamIt program

30

The purpose of profiling

First, for accurately estimating the execution time of each filter on the GPU, for use in the ILP

Secondly, by running multiple profile runs of the same filter, with a varying the number of threads and the number of registers thus identifying the optimal number of threads

31

Algorithm for Profiling Filters

32

Optimizing Buffer Layout

33

- A filter has poprate 4 and executing with 4 threads- on a device with memory organized as 8 banks

Optimizing Buffer Layout Contd.

34Each thread of each cluster of 128 threads, pops or pushes the first token it consumes or produces fromcontiguous locations in memory

Experimental Evaluation

35Only considers stateless filters

Experimental Setup

Each benchmark was compiled and executed on a machine with dual quad-core Intel Xeon processors, running at 2.83 GHz, with 16 GB of FB-DIMM main memory

The machine runs Linux, with kernel version 2.6.18, and the NVIDIA driver version 173.14.09

Used a GeForce 8800 GTS 512 GPU, with 512 MB of device memory

36

gpu

host

t

tspeedup

37SWPNC: SWP implementation with No Coalescing; Serial: Serialexecution of filters using a SAS schedule

Phased Behaviour

38Optimized SWP schedule with no coarsening; SWP 4: Coarsened4 times; SWP 8: Coarsened 8 times; SWP 16: Coarsened 16 times

About ILP Solver

CPLEX version 9.0, running on Linux to solve the ILP The machine used was a quad processor Intel Xeon 3.06GHz, with

4GB of RAM CPLEX was running as a single threaded application and hence did

not make use of the SMP available The solver was allotted 20 seconds to attempt a solution with this II If it failed to find a solution in 20 seconds, the II is relaxed by 0.5%

and the process is repeated until a feasible solution was found Handling stateful filters on GPUs is a possible future work

39

Questions?

40

Documents

Software Pipelined Execution of Stream Programs on GPUs Abhishek Udupa et. al. Indian Institute of Science, Bangalore 2009