Upload
avery-maberry
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
Software Pipelined Execution of Stream Programs on GPUs
Abhishek Udupa et. al.
Indian Institute of Science, Bangalore
2009
Abstract
Describe the challenges in mapping StreamIt to GPUs
Propose an efficient technique to software pipeline the execution of stream programs on GPUs
Formulating this problem of scheduling and actor assignment to a processor as an efficient ILP
Also describes an efficient buffer layout technique for GPUs That facilitates exploiting the high memory bandwidth
available in GPUs Yielding speedups between 1.87X and 36.83X over
a single threaded CPU
2
Motivations
GPUs have emerged as massively parallel machines
GPU can perform general purpose computation Latest GPUs consisting of hundreds of SPUs are
capable of supporting thousands concurrent threads with zero-cost h/w controlled context switch
GeForce 8800 GTS [128 SPUs forming 16 multiprocessors, connected by a 256-bit wide bus to 512 MB of memory]
3
Motivations Contd.
Each multiprocessors in NVIDIA GPUs is conceptually wide SIMD processor
CPUs have hierarchies of caches to tolerate memory latency
GPUs address the problem by providing high bandwidth processor-memory link and supporting high degree of simultaneous multithreading (SMT) within the processing element
The combination of SIMD and SMT enables GPUs to achieve a peak throughput of 400 GFLOPs
4
Motivations Contd.
The high performance of GPUs comes at the cost of reduced flexibility and greater programming effort from the user
For instance, in the NVIDIA GPUs, threads executing on different multiprocessors cannot synchronize in an efficient manner communicate in a reliable and consistent manner through the device
memory, within a kernel invocation GPU cannot access the memory of the host system While the memory bus is capable of delivering very high bandwidth
accesses to device memory by threads executing on a multiprocessor, need to be combined in order
5
Motivations Contd.
ATI and NVIDIA have proposed the CTM and CUDA frameworks, respectively, for developing general purpose applications targeting the GPUs But both of these frameworks still require the
programmer to express the program as data-parallel computations that can be executed efficiently on the GPU
Programming with these frameworks tie the application to the platform
Any change in the platform would require significant rework in porting the applications
6
Prior Works
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for GPUs: Stream Computing on Graphics Hardware,” ACM Transactions on Graphics, vol. 23, no. 3, pp. 777–786, 2004
D. Tarditi, S. Puri, and J. Oglesby, “Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses,” in ASPLOS-XII: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, 2006, pp. 325–335.
These techniques still require considerable effort by the programmer in order to transform the streaming application to execute efficiently on the GPU
Maps StreamIt onto the CellBE platform [M. Kudlur and S. Mahlke PLDI 2008]
7
x2 x2
x1 y1
z2
z1
x5
x1
x6
x2
x7
x3
x8
x4
y5
y1
y6
y2
y7
y3
y8
y4
z5
z1
z6
z2
z7
z3
z8
z4
x2 x2
x1 y1
Split
x3
x1
y3
y1
x4
x2
y4
y2
SPE0 SPE1
z3
z1
z4
z2
Join
z2
z1
(a) Unthreaded CPU (b) 4-wide SIMD (c) On Cell by Scott8
SIMD reduces the trip-count of the loop
With GPU Model
Zn+1
Yn+1 Xn+2 Yn+2
Z1
Y1 X2 Y2
X2n Y2n
Xn Yn
Zn+2
Z2
Xn+1
X1
Z2n
Zn
Thousands of iterations are carried out in parallel9
Greatly reduces the trip-count of the loop
Motivations Contd.
Parallelism must be exploited at various levels The data parallelism across threads executing on a
multiprocessor The SMT parallelism among threads on the same multiprocessor
needs to be managed to provide optimum performance Parallelism offered by having multiple multiprocessors on a GPU
should be used to exploit the task and pipeline level parallelism Further, accesses to device memory need to be coalesced as far
as possible to ensure optimal usage of available bandwidth
These challenges are solved in this paper
10
Contributions
This paper makes the following contributions: Describe a software pipelining framework for efficiently mapping
StreamIt programs onto GPUs Present a buffer mapping scheme for StreamIt programs to make
efficient use of the memory bandwidth of the GPU Describe a profile based approach to decide on the optimal
number of threads assigned to a multiprocessor called execution configuration for StreamIt filters
Implement the scheme in the StreamIt compiler and demonstrate a 1.87X to 36.83X speedup over a single threaded CPU on a set of streaming applications
11
parallel computation
StreamIt Overview StreamIt is a novel
language for streaming Exposes parallelism and
communication Architecture independent Modular and composable
Simple structures composed to creates complex graphs
Malleable Change program behavior
with small modifications
may be any StreamIt language construct
joinersplitter
pipeline
feedback loop
joiner splitter
splitjoin
filter
12
Organization of the NVIDIA GeForce 8800 series of GPUs
Architecture of GeForce 8800 GPU Architecture of individual SM13
ILP Formulation: Definition
Formulate the scheduling of the stream graph across the multiprocessors (SMs) of the GPU as ILP
Each instance of each filter to be the fundamental schedulable entity
V denotes the set of all nodes (filters) in the stream graph
N is the number of nodes in the stream graph E denotes the set of all (directed) edges in the
stream graph kv represents the number of firings of a filter v Є V in
steady state
14
ILP Formulation: Definition Contd. T represents the Initiation Interval (II) of the software
pipelined loop d(v) is the delay or execution time of filter v Є V Iuv is the number of elements consumed by filter v on
each firing of v for an edge (u, v) Є E Ouv is the number of elements produced by filter u
on each firing of u for an edge (u, v) Є E muv denotes the number of elements initially present
on the edge (u, v) Є E
15
ILP Formulation: Resource Constraints wk,v,p is 0-1 integer variables all k Є [0, kv), all v Є V, all
k Є [0, Pmax)
wk,v,p = 1 implies that the kth instance of the filter v has been assigned to SMp
The following constraint ensures that each instance of each filter is assigned to exactly one SM
)1(),,0[,11max
0,,
P
pvpvk Vvkkw
16
ILP Formulation: Resource Constraints Contd. This constraint models the fact that all the
filters assigned to an SM can be scheduled in the II (T) specified
)2(,))((1
0,,
vk
k Vvpvk PpTvdw
17
ILP Formulation Contd. For dependency among filters, first need to ensure that
the execution of any instance of any filter cannot wrap around into the next II
Consider the linear form of a software pipelined schedule, given by
where σ(j, k, v) represents the time at which the execution of the kth instance of filter v in the jth steady state iteration of the stream graph is scheduled and Ak,v are integer variables with Ak,v ≥ 0, all k Є [0; kv), all v Є V
vkAjTvkj ,),,(
18
ILP Formulation Contd.
Can write
Defining Thus, the linear form of the schedule is given
by
TAoT
Af vkvk
vkvk mod and ,,
,,
)3()(
),,(
,,
,
,,,
vkvk
vk
vkvkvk
ofjT
AjTvkj
ofTA
TAT
ATA vk
vkvk mod,
,,
19
ILP Formulation Contd.
Constrain the starting time of filters ok,v as follows
This constraint ensures that every firing of a filter is scheduled so that it can complete within the same kernel or II
)4(),0[,,)(, vvk kkVvTvdo
20
ILP Formulation: Dependency Constraints To ensure that the firing rule for each
instance of each filter is satisfied by a schedule, the admissibility of a schedule is given below
It assumes that the firings the instances of filter v are serialized
0,),(),(,)1(
),(
iEvuudu
O
OmIivi
uv
uvuvuv
21
ILP Formulation: Dependency Constraints Contd. However this assumption does not hold for
this model, where instances of each filter could execute out of order, or in parallel across different SMs, as long as the firing rule is satisfied
So need to ensure that the schedule is admissible with respect to all the predecessors of the ith firing of the filter v
22
ILP Formulation: Dependency Constraints Contd.
A
B
2
3
A0
B0
21
A1 A2
B1
12
23
ILP Formulation: Dependency Constraints Contd. Intuitively, the ith firing of a filter v must wait
for all the Iuv tokens produced after its (i-1) th firing
0,),(],,1[
),(,),(
iEvuIl
uduO
OmlIivi
uv
uv
uvuvuv
24
ILP Formulation: Dependency Constraints Contd. So far the assumption that the result of executing a filter
u is available d(u) time units after the filter u has started executing
However, the limitations of a GPU imply that this may not always be the case
If the results are required from a producer instance of a filter u, which is scheduled on an SM different from the SM where the consumer instance of a filter v is scheduled, then the data produced can only be reliably accessed in the next steady state iteration
25
ILP Formulation: Dependency Constraints Contd. Modelling this using the wk,v,p variables, which
indicate which SM a particular instance of a given filter is assigned to
Defining 0 – 1 integer variables gl,k,u,v
)5(),0[],,1[,),(),,0[ max
,,,,,,,
,,,,,,,
PpIlEvukk
wwg
wwg
uvv
pvkpukvukl
pukpvkvukl
l
l
26
ILP Formulation: Dependency Constraints Contd. Finally by simplifying
)6(],1[),,0[,),(
)(
)()(
,,,,,,,
,,,,,
uvv
vuklukllagvkvk
ukukllagvkvk
IlkkEvu
gfjTofT
udofjTofT
l
l
27
All these equations 1-6 form an ILP
Code Generation for GPUs
Overview of the compilation process, targeting a StreamIt program onto the GPU28
CUDA Memory Model
29
All threads of a thread block areassigned to exactly one SM or upto 8 thread blocks can be assigned to one SM
A group of thread blocks forms a grid
Finally, a kernel call dispatched to the GPU through the CUDA runtime consists of exactly one grid
Profiling and Execution Configuration Selection It is important to determine the optimal execution
configuration, specified by the number of threads per block and the number of registers allocated per thread
This is achieved by the profiling and execution configuration selection phase
StreamIt compiler is modified the to generate the CUDA sources along with a driver routine for each filter in the StreamIt program
30
The purpose of profiling
First, for accurately estimating the execution time of each filter on the GPU, for use in the ILP
Secondly, by running multiple profile runs of the same filter, with a varying the number of threads and the number of registers thus identifying the optimal number of threads
31
Algorithm for Profiling Filters
32
Optimizing Buffer Layout
33
- A filter has poprate 4 and executing with 4 threads- on a device with memory organized as 8 banks
Optimizing Buffer Layout Contd.
34Each thread of each cluster of 128 threads, pops or pushes the first token it consumes or produces fromcontiguous locations in memory
Experimental Evaluation
35Only considers stateless filters
Experimental Setup
Each benchmark was compiled and executed on a machine with dual quad-core Intel Xeon processors, running at 2.83 GHz, with 16 GB of FB-DIMM main memory
The machine runs Linux, with kernel version 2.6.18, and the NVIDIA driver version 173.14.09
Used a GeForce 8800 GTS 512 GPU, with 512 MB of device memory
36
gpu
host
t
tspeedup
37SWPNC: SWP implementation with No Coalescing; Serial: Serialexecution of filters using a SAS schedule
Phased Behaviour
38Optimized SWP schedule with no coarsening; SWP 4: Coarsened4 times; SWP 8: Coarsened 8 times; SWP 16: Coarsened 16 times
About ILP Solver
CPLEX version 9.0, running on Linux to solve the ILP The machine used was a quad processor Intel Xeon 3.06GHz, with
4GB of RAM CPLEX was running as a single threaded application and hence did
not make use of the SMP available The solver was allotted 20 seconds to attempt a solution with this II If it failed to find a solution in 20 seconds, the II is relaxed by 0.5%
and the process is repeated until a feasible solution was found Handling stateful filters on GPUs is a possible future work
39
Questions?
40