Parallel Programming Models for Heterogeneous Multi-Core ....pdfwrite an SPU task dispatcher forselecting be-tween different offloaded tasks on the SPU and for calculating the number

Parallel Programming Models for Heterogeneous Multi-CoreArchitectures

Ferrer, R., Bellens, P., Beltran, V., Gonzalez, M., Martorell, X., Badia, R. M., Ayguade, E., Yeom, J., Schneider,S., Koukos, K., Alvanos, M., Nikolopoulos, D., & Bilas, A. (2010). Parallel Programming Models forHeterogeneous Multi-Core Architectures. IEEE Micro, 30(5), 42-53. [5640603].https://doi.org/10.1109/MM.2010.94

Published in:IEEE Micro

Queen's University Belfast - Research Portal:Link to publication record in Queen's University Belfast Research Portal

General rightsCopyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or othercopyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associatedwith these rights.

Take down policyThe Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made toensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in theResearch Portal that you believe breaches copyright or violates any law, please contact [email protected].

Download date:07. May. 2021

https://doi.org/10.1109/MM.2010.94

https://pure.qub.ac.uk/en/publications/parallel-programming-models-for-heterogeneous-multicore-architectures(8b779189-2c9e-49ab-8f86-5771485621d3).html

..........................................................................................................................................................................................................................

PARALLEL PROGRAMMING MODELSFOR HETEROGENEOUS MULTICORE

ARCHITECTURES..........................................................................................................................................................................................................................

THIS ARTICLE EVALUATES THE SCALABILITY AND PRODUCTIVITY OF SIX PARALLEL

PROGRAMMING MODELS FOR HETEROGENEOUS ARCHITECTURES, AND FINDS THAT

TASK-BASED MODELS USING CODE AND DATA ANNOTATIONS REQUIRE THE MINIMUM

PROGRAMMING EFFORT WHILE SUSTAINING NEARLY BEST PERFORMANCE. HOWEVER,

ACHIEVING THIS RESULT REQUIRES BOTH EXTENSIONS OF PROGRAMMING MODELS TO

CONTROL LOCALITY AND GRANULARITY AND PROPER INTEROPERABILITY WITH

PLATFORM-SPECIFIC OPTIMIZATIONS.

......Vendors have commoditized het-erogeneous parallel architectures throughsingle-chip multicore processors with hetero-geneous and asymmetric cores (for example,IBM Cell, AMD Fusion, and NVIDIATegra) and through stand-alone wide single-instruction, multiple data (SIMD) andsingle-instruction, multiple-thread (SIMT)processors that serve as board-level computa-tional accelerators for high-performancecomputing nodes (such as NVIDIA GPUsand Intel’s Larrabee and Graphics Media Ac-celerator). Amdahl’s law of the multicore erasuggests that heterogeneous parallel architec-tures have more potential than homogeneousarchitectures to accelerate workloads andparallel applications.1

Developing accessible, productive, andscalable parallel programming models forheterogeneous architectures is challenging.The compiler and runtime environment ofa programming model for heterogeneous

architectures must manage multiple instruction-set architectures (ISAs), multiple addressspaces, software-managed local memories,heterogeneous computational power, mem-ory capacities, and communication and syn-chronization mechanisms. Many vendor andacademic efforts attempt to address these chal-lenges. IBM advocates the use of OpenMP toexploit parallelism in the heterogeneous Cellprocessor.2 Vendors have agreed to provideOpenCL on their heterogeneous platforms.Several research environments are also beingused, including Sequoia,3 RapidMind (http://www.rapidmind.com), Offload,4 Star Super-scalar (StarSs),5 and CellGen.6

Efforts seem to converge in at least twocommon patterns for programming heteroge-neous multicore processors. The first is thetask-offloading pattern, whereby one or morecores of the processor, preferably those coresthat run the operating system and support ageneral-purpose ISA (for example, x86 and

[3B2-14] mmi2010050042.3d 11/11/010 18:16 Page 42

SARC European

Project

Barcelona Supercomputing

Center

Virginia Tech

Foundation for Research

and Technology�Hellas

..............................................................

42 Published by the IEEE Computer Society 0272-1732/10/$26.00 �c 2010 IEEE

PowerPC), offload work to the other cores,which typically have special acceleration capa-bilities, such as vector execution units and fine-grained multithreading. The second pattern isthe explicit management of locality, via controlof data transfers to and from the accelerator-type cores, and via runtime support for sched-uling these data transfers so they can be over-lapped with computation.

Despite convergence, little is known aboutthese models’ relative performance on a com-mon hardware platform and productivityunder a common set of applications andassumptions regarding programmers. The lit-erature lacks insights on the importance ofparallel programming constructs and theimplications of using these constructs on pro-grammer productivity. Furthermore, the lit-erature lacks insight on the relationshipbetween performance and productivity with-in parallel programming models. To bridgethis knowledge gap, we evaluate six program-ming models designed for and implementedon the Cell processor: Sequoia, StarSs, Cell-Gen, Tagged Procedure Calls (TPC),CellMP (both TPC and CellMP were devel-oped in the context of the SARC project),and IBM’s low-level Cell software developer’skit (SDK). Our evaluation departs from priorwork in that it considers simultaneously per-formance and productivity in these models,using rigorous metrics on both counts. Weevaluate the models using the same hardware,same software stack running beneath the pro-gramming models, same applications (see the‘‘Applications’’ sidebar for a description ofthe three applications used in our study),and programmers with similar skills andbackgrounds (advanced graduate studentswith extensive exposure to parallel program-ming and parallel code optimization).

Programming models under studyWe make a first attempt at classifying

programming models for heterogeneousmulticore processors in terms of how they ex-press parallelism and locality (directives ver-sus language types versus runtime librarycalls), the vehicles of parallel execution(such as tasks and loops), their schedulingscheme (for example, static or dynamic),their means for controlling and optimizingdata transfers, and the availability of

compiler support for automatic work outlin-ing. Table 1 summarizes this classification.

Hand coding on the cellProgramming the Cell by hand requires

using IBM’s Cell SDK 3.0. The SDK exposesCell architectural details to the programmer,such as SIMD intrinsics for synergistic pro-cessor unit (SPU) code. It also provides libra-ries for low-level, Pthread-style thread-basedparallelization, and direct memory access(DMA) commands based on a get/put inter-face for managing locality and data transfers.

Programming in the Cell SDK is similarin complexity to programming with an ex-plicit remote DMA-style communication in-terface on a typical cluster7 and with anexplicit multithreading interface (such asPosix threads) on a typical multiprocessor.The programmer needs both deep under-standing of thread-level parallelization anddeep understanding of the Cell processorand memory hierarchy.

Although programming models can trans-parently manage data transfers, the Cell SDKrequires the programmer explicitly identifyand schedule all data transfers. Furthermore,the programmer is solely responsible for dataalignment, for setting up and sizing buffersto achieve computation/communication over-lap, and for synchronizing threads running ondifferent cores. Although tedious, hand-tunedparallelization also has well-known advan-tages. A programmer with insight into theparallel algorithm and the Cell architecturecan maximize locality, eliminate unnecessarydata transfers, and schedule data and compu-tation on cores in an optimal manner.

SequoiaSequoia expresses parallelism through

explicit task and data subdivision. The pro-grammer constructs trees of dependent tasks

[3B2-14] mmi2010050042.3d 11/11/010 18:16 Page 43

.....................................................................................................................

The SARC European Project Team

Members of the SARC European Project team include:

� Roger Ferrer, Pieter Bellens, Vicenc Beltran, Marc Gonzalez, Xavier Martorell, Rosa M.

Badia, and Eduard Ayguade from the Barcelona Supercomputing Center;

� Jae-Seung Yeom and Scott Schneider from Virginia Tech; and

� Konstantinos Koukos, Michail Alvanos, Dimitrios S. Nikolopoulos, and Angelos Bilas

from the Foundation for Research and Technology�Hellas

....................................................................

SEPTEMBER/OCTOBER 2010 43

in which the inner tasks call tasks furtherdown the tree, eventually ending in a leaftask, which is typically where the real compu-tation occurs. At each level, the data is decom-posed and copied to the child tasks asspecified. Each task has a private address space.

Sequoia strictly enforces locality becausetasks can only reference local data. In thismanner, there can be a direct mapping oftasks to the Cell architecture, where theSPU local storage is divorced from the typicalmemory hierarchy. By providing a program-ming model in which tasks operate on localdata, and providing abstractions to subdividedata and pass it on to subtasks, Sequoia cancompletely abstract away the underlying ar-chitecture from programmers. Sequoia letsprogrammers explicitly define data and com-putation subdivision through a specializednotation. Using these definitions, the Sequoiacompiler generates code that divides andtransfers the data between tasks and performsthe computations on the data as described byprogrammers for the specific architecture.The mappings of data to task and task tohardware are fixed at compile time.

Sequoia relieves programmers fromawareness of the underlying architecturaldata-transfer mechanism between processorsand memories by providing a custom memory-allocation API, through which the program-mer can reserve memory spaces that meetsthe alignment constraints of the architecture.The current Sequoia runtime does not supportnoncontiguous data transfers.

Star SuperscalarStarSs is a task-based parallel-program-

ming model.8 Similarly to code in OpenMP3.0, the code is annotated with pragmas, al-though in this case the pragmas annotatewhen a function is a task. Additionally, prag-mas indicate the direction and access proper-ties (input, output, or inout) of thefunction’s parameters, with the objective ofgiving hints to the runtime. This lets theStarSs runtime system automatically discoverthe actual data dependencies between tasks.For this article, we used the Cell Superscalar(CellSs)5 instance of StarSs, tailored to theCell broadband engine (Cell BE) platform.

CellSs is based on a source-to-sourcecompiler and a runtime library. The source-to-source compiler can separate the codecorresponding to the PowerPC processingunit (PPU) and to the SPU. It also generatesthe necessary glue code and the correspond-ing calls to the runtime library. CellSs thencompiles and links this generated code withthe IBM SDK toolchain.

The runtime library exploits the existingparallelism by building a task-dependencygraph at runtime. The graph parallelism isfurther increased through renaming. Thistechnique replicates intermediate data,eliminating false dependencies. The run-time system issues data transfers before issu-ing the task needing the data, letting usoverlap computation with communication.Task scheduling is centralized; the runtimesystem groups in a single bundle several

[3B2-14] mmi2010050042.3d 11/11/010 18:16 Page 44

Table 1. Qualitative properties of the Cell broadband engine programming models evaluated in this study.

Model

Concurrency

constructs

Parallel

execution

constructs

Scheduling

scheme Locality control

Automatic

function

outlining

Cell SDK Library calls Threads Static/dynamic Explicit (direct memory

accesses, buffers)

No

CellGen Directives Parallel loops/tasks Static Implicit (compiler) Yes

CellMP Directives Parallel loops/tasks Static Explicit (copy in/out

directives, blocking)

Yes

Sequoia Language types Tasks Static Explicit (language in/out

data types)

No

StarSs Directives Tasks Dynamic Explicit (in/out clauses) No

Tagged

Procedure

Calls (TPC)

Library calls Tasks Static Explicit (argument tags) No

....................................................................

44 IEEE MICRO

...............................................................................................................................................................................................

MULTICORE: THE VIEW FROM EUROPE

tasks that are assigned to a SPU. The objec-tive is to exploit reuse of data that is gener-ated and consumed inside the bundle in theSPU local stores.

CellGenCellGen implements a subset of OpenMP

on the Cell.6 The model uses a source-to-source optimizing compiler. Programmers

[3B2-14] mmi2010050042.3d 11/11/010 18:16 Page 45

...............................................................................................................................................................................................

Applications

We use three applications for our analysis: a memory bandwidth

benchmark and two realistic supercomputer-class applications.

CellStreamCellStream is a memory bandwidth benchmark that attempts to max-

imize the rate at which synergistic processor units (SPUs) can transfer

data to and from main memory.1 The benchmark is designed so that a

small computational kernel can be dropped in to perform work on

data as it streams through SPUs. If no computational kernel is used,

the benchmark can match the performance of the read/write direct mem-

ory access (DMA) benchmark bundled with the Cell SDK 3.0.

For our experiments, we use an input/output stream of 192 Mbytes of

data flowing across the SPUs. Data originates from a file, is read in by a

PowerPC processing unit (PPU) thread, transferred in and out of one of

the participating SPUs, and written by another PPU thread onto a

separate file.

FixedGridFixedgrid is a comprehensive prototypical atmospheric model written

entirely in C. It describes chemical transport via a third-order upwind-

biased advection discretization and second-order diffusion discretiza-

tion.2-4 It uses an implicit Rosenbrock method to integrate a 79-species

SAPRC-99 atmospheric chemical mechanism for volatile organic com-

pounds (VOCs) and nitrogen oxides (NOx) on every grid point.5 Scientists

can selectively disable chemical or transport processes to observe their

effect on monitored concentrations.

To calculate mass flux on a 2D domain, Fixedgrid calculates a two-

component wind vector, horizontal diffusion tensor, and concentrations

for every species of interest. To promote data contiguity, Fixedgrid stores

the data according to function. The latitudinal wind field, longitudinal

wind field, and horizontal diffusion tensor are each stored in a separate

NX � NY array, where NX is the domain’s width and NY is the domain’s

height. Concentration data is stored in an NS � NX � NY array, where

NS is the number of monitored chemical species. To calculate ozone (O3)

concentrations on a 600 � 600 domain as in our experiments, FixedGrid

calculates approximately 1,080,000 double-precision values (8.24 Mbytes)

at each time step, using 25,920,000 double-precision values

(24.7 Mbytes) in the calculation.

PBPIPBPI is a parallel implementation of the Bayesian phylogenetic infer-

ence method, which constructs phylogenetic trees from DNA or amino

acid sequences using a Markov chain Monte Carlo (MCMC) sampling

method.6 Two factors determine the computation time of a Bayesian

phylogenetic inference based on MCMC: the length of the Markov chains

for approximating the posterior probability of the phylogenetic trees, and

the computation time needed for evaluating the likelihood values at each

generation. PBPI developers can reduce the length of the Markov chains

by developing improved MCMC strategies to propose high-quality candi-

date states and to improve acceptance/rejection decisions. We can ac-

celerate the computation time per generation by optimizing the likelihood

evaluation and exploiting parallelism. PBPI implements both techniques,

and achieves linear speedup with the number of processors for large

problem sizes.

For our experiments, we used a data set of 107 taxa with 19,989

nucleotides for a tree. Three computational loops are called a total of

324,071 times and account for most of the program’s execution time.

The first loop accounts for 88 percent of the calls and requires 1.2 Mbytes

to compute a result of 0.6 Mbytes. The second loop accounts for 6 per-

cent of the calls and requires 1.8 Mbytes to compute a result of 0.6

Mbytes. The third also accounts for 6 percent of the calls and requires

0.6 Mbytes to compute the weighted reduction of a vector onto a result

of 8 bytes.

References

1. B. Rose, ‘‘Intra- and Inter-chip Communication Support for

Asymmetric Multicore Processors with Explicitly Managed

Memory Hierarchies,’’ master’s thesis, Dept. of Computer

Science, Virginia Polytechnic Inst. and State Univ., 2008.

2. W. Hundsdorfer, Numerical Solution of Advection-Diffusion-

Reaction Equations, tech. report, Centrum voor Wiskunde

en Informatica, 1996.

3. J.C. Linford and A. Sandu, ‘‘Optimizing Large Scale Chemical

Transport Models for Multicore Platforms,’’ Proc. 2008

Spring Simulation Multiconf., Soc. for Modeling and Simula-

tion Int’l, 2008, pp. 369-376.

4. A. Sandu et al., ‘‘Adjoint Sensitivity Analysis of Regional Air

Quality Models,’’ J. Computational Physics, vol. 204, no. 1,

2005, pp. 222-252.

5. W.P.L. Carter, ‘‘Documentation of the SAPRC-99 Chemical

Mechanism for VOC Reactivity Assessment,’’ final report

contract no. 92-329, Calif. Air Resources Board, 8 May 2000.

6. X. Feng, K.W. Cameron, and D.A. Buell, ‘‘PBPI: A High Per-

formance Implementation of Bayesian Phylogenetic Infer-

ence,’’ Proc. Conf. Supercomputing (SC 06), ACM Press,

2006, article no. 75.

....................................................................


identify parallel sections of their code in theform of loops accessing particular segmentsof memory. Programmers must annotatethese sections to mark them for parallel exe-cution, and indicate how the data accessedin these sections should be handled.This model provides the abstraction of ashared-memory architecture and an indirectand implicit abstraction of data locality, viathe annotation of the data set accessed byeach parallel section. Note that although thedata set accessed by each parallel section isannotated, the data set accessed by each taskthat executes a part of the work in the parallelsection is not annotated.

Data is annotated as private or shared,using the same keywords as in OpenMP. Pri-vate variables follow OpenMP semantics.They are copied into local stores usingDMAs and each SPU gets a private copy ofthe variable. Shared variables are further clas-sified internally in the CellGen compiler asin, out, or inout variables, using referenceanalysis. This classification departs fromOpenMP semantics and serves as the mainvehicle for managing locality on Cell. Indata must be streamed into the SPU’s localstore, out data must be streamed out oflocal stores, and inout data must be streamedboth in and out of local stores. By annotatingthe data referenced in the parallel section,programmers implicitly tell CellGen whatdata they want transferred to and from thelocal stores. The CellGen compiler manageslocality by triggering and dynamically sched-uling the associated data transfers.

Being able to stream in, out, and inout datasimultaneously in CellGen is paramount fortwo reasons: the local stores are small sothey can only contain a fraction of the work-ing sets of parallel sections; and the DMAtime required to move data in and out oflocal stores might dominate performance.Overlapping DMAs with computation is nec-essary to achieve high performance. Dataclassified by the compiler as in or out arestreamed using double buffering, whereasinout data are streamed using triple buffering.The number of states a variable can be indetermines the depth of buffering. In variablescan be either streaming in or computing; outvariables can be either computing or stream-ing out; and inout variables can be streaming

in, computing, or streaming out. The Cell-Gen compiler creates a buffer for each ofthese states. For array references inside parallelsections, the goal is to maximize computa-tion/DMA overlap by having different arrayelements in two (for in and out arrays) orthree (for inout arrays) states simultaneously.

SPUs operate on independent loop itera-tions in parallel. CellGen, like OpenMP,assumes that it is the programmer’s responsi-bility to ensure that loop iterations are in factindependent. However, the compiler sched-ules loop iterations to SPUs. The current im-plementation uses static scheduling, wherebythe iterations are divided equally among allSPUs.

Tagged Procedure CallsTPC is a programming model that

exploits parallelism through asynchronous ex-ecution of tasks.9 The TPC runtime systemcreates a task using a function descriptorand argument descriptors as input and iden-tifies the task using a unique task ID. Theprogrammer uses library calls to identify cer-tain procedure calls as concurrent tasks andspecify properties of the data accessed bythem, to facilitate their transfers to andfrom local memories. Each argument descrip-tor is formed by a triplet containing the baseaddress, the argument size, and the argumenttype, which can be in, out, or inout. The TPCruntime handles contiguous and strided argu-ments with a fixed stride. For the latter, theprogrammer masks the type field with thestride flag and packs the number of strides,stride size, and offset within the size fieldusing a macro. The task argument sizes definethe granularity of parallelism within a regionof straight-line code or a subset of a loop’s it-eration space. The programmer implementssynchronization using either point-to-pointor collective wait primitives.

The PPU issues tasks across SPUs roundrobin, using remote stores. Each active SPUthread has its own private queue that theissuer can access to post tasks. Each SPUruns a thread that continuously polls itslocal queue and executes any available tasksin first in, first out (FIFO) order. For eachnew task, the main thread running on thePPU tries to find an empty slot in someSPU queue to issue the task. Upon task

[3B2-14] mmi2010050042.3d 11/11/010 18:16 Page 46

....................................................................

46 IEEE MICRO

...............................................................................................................................................................................................


completion, each SPU thread updates thetask status in a thread-specific completionqueue located in the PPU cache. In thiscase, to avoid cache invalidation and theresulting off-chip transfer, the TPC runtimeuses atomic DMA commands to uncondi-tionally update the PPU’s cache. PPU pollsthese completion queues to detect task com-pletion. This polling is done when no moretasks can be issued or a synchronizationprimitive has been reached.

The TPC runtime uses only on-chip oper-ations when initiating and completing tasks,whereas argument data might require off-chip transfers. The runtime uses task argumentprefetching and outstanding write-backs tooverlap communication with computation.

CellMPCellMP uses the OpenMP 3.0 task

approach to express the work to be executedin the accelerators. It uses a new clausedevice (accelerator-kind-list)10

to annotate tasks to be executed on an accel-erator. In CellMP, the use of the copy_in(variable | array-section)clause causes a data transfer from main mem-ory to the accelerator; the copy_out(variable | array-section) clausewill similarly cause the variable or arraysection to be written back from the accel-erator to main memory. The programmercan choose to split the parallel computa-tion to work on chunks of main memoryarrays that fit in the local store, giventhat local stores typically have limited sizeand cannot fit entire arrays from mainmemory. To stream large arrays throughlocal memories, a programmer specifiesarray-sections, as in Fortran90.Communication in CellMP is imple-mented synchronously. To support com-munication-computation overlap, CellMPuses CellMT threads.11,12 Usually, twoSPU threads are created in each SPU, sothat while one of them is computing, theother is in the data-transfer stage.

CellMP also uses loop blocking,13-15 tocoarsen the granularity of work and to over-come accelerators’ memory constraints. Thescope of blocking includes the N loops sur-rounding the enclosed loop body and it isdefined by a clause factors (F_1,

F_2 . . . F_N) to specify the blocking factorof each loop in the considered loop nest,starting from the outermost loop. Codeexamples showing the use of CellMP direc-tives are available elsewhere.10

EvaluationWe first evaluate the memory bandwidth

achieved by each of the programming mod-els using CellStream and then presentthe evaluation of two realistic applications,Fixedgrid and Parallel Bayesian PhylogeneticInference (PBPI) (see the ‘‘Applications’’sidebar for a description of these applica-tions). The PhD student authors of this arti-cle parallelized these applications. They allhave similar experience and backgrounds inparallel programming and parallel applica-tion optimization on Cell.

We performed the evaluation on QS20Cell BE blades at the Barcelona Supercom-puting Center. Two Cell processors clockedat 3.2 GHz and 1 Gbyte of DRAM powerthe blades. We installed the IBM Cell SDK3.0 on each blade. We ran all experimentswithin the default Linux (version 2.6.22.5-fc7) execution environment, with the defaultvirtual page size of 65,536 bytes (64 Kbytes).We executed each experiment 20 times andcalculated means and standard deviations.The machines were used in dedicated mode.

We evaluated the benchmarks with re-spect to changes in the number of linesneeded for the parallelization and exploita-tion of the Cell BE SPUs. To count lines,we eliminate comments and empty lines,and then use indent so that the codeexpressed in all programming models isorganized with the same rules. The codesare available at http://people.ac.upc.edu/xavim/sarc/pm-codes.tar.bz2.

CellStreamTable 2 shows, for each of the program-

ming models under evaluation, the numberof lines we added or removed and the prag-mas we added to CellStream to be able torun it using the Cell SPUs as accelerators.We made all changes manually, beforecompiling the application with the corre-sponding programming model.

The programmer must add more than100 lines of code to port the benchmark to

[3B2-14] mmi2010050042.3d 11/11/010 18:16 Page 47

....................................................................


run on the SDK. This code includes threadcreation and synchronization on the PPU, ex-plicit DMA transfers on the PPU and SPUs,and management of DMA completion notifi-cations. Sequoia needs half the number oflines added, and TPC reduces them to onethird. Sequoia eliminates code for explicitDMA transfers; however, it requires two ver-sions of each function enclosed in a parallel

task: a nonleaf version, which is executedon the PPU to partition work betweenSPUs in leaf tasks and maps these tasks toSPUs; and a leaf version, which encloses theactual task executed on the PUs.

TPC also eliminates code for DMA trans-fers; however, it requires the programmer towrite an SPU task dispatcher for selecting be-tween different offloaded tasks on the SPUand for calculating the number and size ofcopy in and copy out arguments. This re-quirement adds 27 lines of SPU code. Cell-Gen, StarSs, and CellMP require onefourth or fewer changes, due to the use ofpragmas. The changes for all three modelsare similar. They consist of using pragmasto annotate the function (StarSs) or thecode (CellGen and CellMP) to be offloaded.CellMP allows an additional blocking trans-formation of copy_in and copy_outdata with pragmas, so it usually has morepragmas added than the other models. Cell-Gen is closer to OpenMP, which annotatesentire loops for parallelization and thereforeincurs the fewest code changes.

Figure 1 shows the bandwidth for Cell-Stream measured in Gbytes per second(GBps). Because memory bandwidth limitsthe benchmark and Cell blades have two dis-tributed memory modules, we present ourevaluation with and without nonuniformmemory access (NUMA)-aware memory al-location. The plot clearly shows the benefitsof using NUMA-aware memory allocation.

[3B2-14] mmi2010050042.3d 11/11/010 18:16 Page 48

Table 2. Changes in line counts in CellStream.

File Type Change SDK Sequoia CellGen StarSs TPC CellMP

main.c Offloading

function

Added lines 29 7 7 16 10 19

Added

pragmas

— — 1 2 — 8

Removed

lines

2 2 3 5 18 7

spe_main.c SPE task/

kernel

Added lines 81 46 — — 27 —

Removed

lines

0 0 — — 0 —

Total All lines Added 110 53 8 18 37 27

Removed 2 2 3 5 18 7

User

functions

Added One task One task and

one leaf

None None One task None

24

20

Gb

ytes

/sec

ond

Number of SPUs

16

12

8

4

1 2 4 8 12 16

Sequoia StarSs TPC CellMP SDK CellGen-NUMA

CellMP-NUMA SDK-NUMA Peak

0

Figure 1. Comparison of the programming models with respect to memory

bandwidth (in Gbytes per second) obtained from CellStream. The Cell

broadband engine (BE) peak bandwidth is shown as a reference.

....................................................................

48 IEEE MICRO

...............................................................................................................................................................................................


The NUMA-aware versions of CellGen,CellMP, and SDK outperform theirNUMA-unaware counterparts, as well as Se-quoia, TPC, and StarSs. These runtime sys-tems enumerate SPUs and establish affinitybetween SPUs and their local DRAM mod-ules, in a way that the memory touched byeach SPU is initially allocated in a localDRAM module. StarSs, on the other hand,uses interleaved memory allocation, whichplaces every other page in the same nodeupon first access to the page. StarSs cannotcontrol the data-to-SPU mapping becauseof its dynamic scheduler. This, in turn, limitsthe achieved memory bandwidth. The TPCruntime is unaware of NUMA and suffersfrom a problem similar to StarSs because ofthe use of a dynamic task scheduler.

The differences in performance between theNUMA-aware versions of CellGen, CellMP,and SDK runtime systems arise because of adifferent SPU enumeration and clusteringscheme. When the application requests fewerthan eight SPUs, the SDK distributes therequested number of SPUs evenly betweenthe blade’s two Cell nodes, whereas CellGenand CellMP cluster the requested SPUs onthe same node. The SDK’s SPU-distribution

scheme therefore increases the memory band-width available to each SPU.

FixedgridTable 3 shows the changes we made to

Fixedgrid. Our observations about codingcomplexity in CellStream also hold forFixedgrid. The Sequoia version uses addi-tional SPE code to improve the performanceof DMA transfers. CellGen minimizes therequired code changes; however, this happensnot only because of the use of pragmas to an-notate entire loops, but also because CellGencannot offload three matrix transpositions toSPUs, which are offloaded in the other pro-gramming models. The reason is that the cur-rent implementation of strided data accessesin CellGen requires explicit data paddingfrom the programmer. Such data paddingmakes the resulting matrix incompatiblewith the format expected by the vectorizedkernels. In this case, interoperability betweenthe memory-allocation scheme in theprogramming model and the underlyingmachine-specific vectorization framework isessential to maximize performance.

Figure 2a shows the performance of Fix-edgrid in terms of the speedup of each

[3B2-14] mmi2010050042.3d 11/11/010 18:16 Page 49

Table 3. Changes in line counts in Fixedgrid.

File Type Change SDK

Sequoia-

Scaler

Sequoia-

SIMD CellGen StarSs TPC CellMP

transport.c

(103

lines)

Offloading

functions

Added

lines

42 27 27 32 22 95 40

Added

pragmas

— — — 2 5 — 26

Removed

lines

33 15 33 11 17 14 11

transport_

spe.c

SPE task Added

lines

349 0 0 — — — —

discretize.c

(143

lines)

SPE kernel Added

lines

219 119 314 0 19 51 0

Removed

lines

60 0 60 0 0 0 1

Total All lines Added 610 146 341 34 46 146 66

Removed 93 15 93 11 17 14 12

User

functions

Added Two

tasks

Two tasks

and two

leaves

Two tasks

and two

leaves

None Two

tasks

Two

tasks

None

....................................................................


solution with regard to the time obtainedwhen running the serial version on the PPU(237 seconds). We use the SDK-SIMDversion as a reference. The SDK-SIMD

and Sequoia-SIMD versions fully vectorizethe row discretization performed in Fixed-grid. With this optimization, data transfersand SIMD operations are fully overlappedand optimized, respectively. All program-ming models scale similarly, except for Cell-Gen, because of the inability to offload thedata transpositions. For TPC and CellMP,we use two kernel codes. The gcc compilerautomatically vectorizes the gccSIMD ver-sion; we hand-tuned the optimized kernel(OPTK) version, eliminating most branchesin the kernel’s innermost function. Theresults show that autovectorization can occa-sionally achieve comparable performance tomanual vectorization and that both TPCand CellMP interoperate efficiently with al-ternative kernel vectorization schemes, inthe sense that their memory-managementschemes prevent neither automatic norhand-tuned vectorization.

PBPITable 4 summarizes the changes in source

code for PBPI. Our observations on codingeffort are similar to the ones we made forthe other benchmarks. TPC in particularadds 215 lines of code on the PPU becauseof the use of a separate offloading codepath for each of the three main kernels ofPBPI that are enclosed in tasks (labeledL1, L2, and L3). For CellGen-SIMD,StarSs-SIMD, and CellMP-SIMD, we pro-vide separate line counts for each kernel

[3B2-14] mmi2010050042.3d 11/11/010 18:16 Page 50

Table 4. Changes in line counts in PBPI.

File Type Change

SDK-

scalar

SDK-

SIMD

Sequoia-

Scalar

likelihood.c

(463 lines)

Offloading

functions

Added lines 32 32 19

Added pragmas — — —

Removed lines 52 52 54

pbpi.spe.c Task/kernel Added lines 200 219 223

L1_spe.c Kernel Added lines 18 28 0



Total All lines Added 292 341 242

Removed 52 52 54

User function Added Three tasks Three tasks Three tasks and three

leaves

18Sequoia-SIMD CellGen-SIMD StarSs-OPTK

TPC-gccSIMD CellMP-gccSIMD TPC-OPTK

CellMP-OPTK SDK-SIMD

1614121086420

1 2(a)

(b)

4 8 12 16Number of SPUs

Sp

eed

up

20Sequoia-Scalar

Sequoia-SIMD CellGen-SIMDStarSs-SIMD TPC-SIMDCellMP-SIMD SDK-SIMD

CellGen-ScalarTPC-Scalar CellMP-Scalar

181614121086420

1 2 4 8 12 16Number of SPUs

Sp

ped

up

Figure 2. Comparison of the programming models and variants

(Scalar, SIMD, gccSIMD, and OPTK) with respect to the speedup

obtained in Fixedgrid (a) and PBPI (b).

....................................................................

50 IEEE MICRO

...............................................................................................................................................................................................


Sequoia-

SIMD

CellGen-

SIMD

StarSs-

scalar

StarSs-

SIMD

TPC-

SIMD

CellMP-

Scalar

CellMP-

SIMD

19 14 123 64 215 17 17

— 3 4 4 — 19 19

54 70 58 58 13 2 115

170 — — — 48 — —

56 43 0 60 71 0 55

72 65 0 74 94 0 74

53 36 0 24 26 15 60

370 158 123 222 454 32 206

54 70 58 58 13 2 115

Three tasks

and three leaves

Three tasks Three tasks Three tasks Three PPU/

three SPU

None None

[3B2-14] mmi2010050042.3d 11/11/010 18:16 Page 51

(rows labeled L[123]_spe.c), al-though the kernels are not really addedinto separate files. Overall, SDK, Sequoia,and TPC-SIMD need more changes tothe original code. Note that if the nativecompiler could vectorize the kernels auto-matically, we could save about 150 linesin each model.

Figure 2b shows the performanceobtained from each of the programmingmodels in terms of speedup with regard tothe serial time obtained in the PPU (671 sec-onds). It shows the Scalar and SIMDversions for each programming modelevaluated, and the hand-coded SDK-SIMDversion. SIMD versions get increased perfor-mance compared to scalar versions. In PBPI,tasks are small enough to stress the PPUspawning mechanism. We have found twoways to achieve higher performance. Onone hand, StarSs and CellMP can eliminatebarriers between parallel regions. StarSsdoes so by using runtime detection ofdependences between tasks, and runningchains of dependent tasks on the sameSPU. CellMP static scheduling of tasks toSPUs allows the programmer to remove thebarriers if there is no reduction operationin the parallel loops. On the other hand,CellGen, CellMP, and the SDK versions ex-ploit parallelism in a way similar to OpenMPparallelfor loops. Exploiting parallelloops in this way coarsens granularity andcreates fewer tasks. Having to create fewer

tasks puts less pressure on the runtime systemtask instantiation and scheduling mechanismthat runs on the PPU. Furthermore, eachtask must bring data to work with at run-time. CellMP achieves this through anoption to express copy_in/copy_outdata movements in the middle of taskcode, rather than only at the beginning andthe end. Even with this approach, CellMP-SIMD is 4 seconds slower than the SDKversion when using 16 SPUs.

W e are currently working to incorporatethe dependence graph management

of StarSs and the data movement hints ontoOpenMP for accelerators, and participating inthe definition of the OpenMP standard.16 Inthe future, we will continue porting applica-tions to work on TPC and CellMP to furtherdemonstrate their usefulness. We will alsomove toward the exploitation of GPUs withpragma-based programming models. In thisdirection, we think that the appropriate wayto go is to have full interoperability betweenOpenMP and OpenCL. OpenMP does thehard work of code offloading, and OpenCLsolves most of the problems related to theexploitation of vectorization.17

MICR O

AcknowledgmentsWe thankfully acknowledge the support

of the European Commission through theSARC IP project (contract no. 27648), theEncore Strep project (contract no. 248647),

....................................................................


[3B2-14] mmi2010050042.3d 11/11/010 18:16 Page 52

the HiPEAC-2 Network of Excellence(FP7/ICT 217068), the MCF IRG projectI-Cores (contract no. IRG-224759), theSpanish Ministry of Education (contractsTIN2007-60625 and CSD2007-00050),the Generalitat de Catalunya (2009-SGR-980), and the BSC-IBM MareIncognitoproject; the support of the US NationalScience Foundation through grants CCR-0346867, CCF-0715051, CNS-0521381,CNS-0720750, and CNS-0720673; the sup-port of the US Department of Energythrough grants DE-FG02-06ER25751 andDE-FG02-05ER25689; and the support ofIBM through grant VTF-874197.

....................................................................References

1. M.D. Hill and M.R. Marty, ‘‘Amdahl’s Law in

the Multicore Era,’’ Computer, vol. 41,

no. 7, July 2008, pp. 33-38.

2. K. O’Brien et al., ‘‘Supporting OpenMP on

Cell,’’ Int’l J. Parallel Programming, vol. 36,

no. 3, 2008, pp. 289-311.

3. K. Fatahalian et al., ‘‘Sequoia: Programming

the Memory Hierarchy,’’ Proc. 2006 Conf.

High-Performance Networking and Comput-

ing (SC 06), IEEE CS Press, 2006, pp. 83-92.

4. P. Cooper et al., ‘‘Offload: Automating Code

Migration to Heterogeneous Multicore Sys-

tems,’’ Proc. High-Performance Embedded

Architectures and Compilers (HiPEAC 10),

LNCS 5952, Springer, 2010, pp. 307-321.

5. J.M. Perez et al., ‘‘CellSs: Making It Easier to

Program the Cell Broadband Engine Pro-

cessor,’’ IBM J. Research and Development,

vol. 51, no. 5, Sept. 2007, pp. 593-604.

6. S. Schneider et al., ‘‘A Comparison of Pro-

gramming Models for Multiprocessors with

Explicitly Managed Memory Hierarchies,’’

Proc. 14th ACM SIGPLAN Symp. Principles

and Practice of Parallel Programming

(PPoPP 09), ACM Press, 2009, pp. 131-140.

7. J. Nieplocha et al., ‘‘High Performance

Remote Memory Access Communication:

The Armci Approach,’’ Int’l J. High Perfor-

mance Computing Applications, vol. 20,

no. 2, 2006, pp. 233-253.

8. J.M. Perez, R.M. Badia, and J. Labarta, ‘‘A

Dependency-aware Task-based Programming

Environment for Multi-core Architectures,’’

Proc. IEEE Int’l Conf. Cluster Computing,

IEEE CS Press, 2008, pp. 142-151.

9. G. Tzenakis et al., ‘‘Tagged Procedure Calls

(TPC): Efficient Runtime Support for Task-

Based Parallelism on the Cell Processor,’’

Proc. High-Performance Embedded Archi-

tectures and Compilers (HiPEAC 10),

LNCS 5952, Springer, 2010, pp. 307-321.

10. R. Ferrer et al., ‘‘Analysis of Task Offloading

for Accelerators,’’ Proc. High-Performance

Embedded Architectures and Compilers

(HiPEAC 10), LNCS 5952, Springer, 2010,

pp. 322-336.

11. V. Beltran et al., ‘‘CellMT: A Cooperative

Multithreading Library for the Cell BE,’’

Proc. 16th Ann. IEEE Int’l Conf. High Perfor-

mance Computing (HiPC 09), IEEE CS

Press, 2009, pp. 245-253.

12. V. Beltran et al., Cooperative Multithread-

ing on the Cell BE, tech. report, Computer

Architecture Dept., Technical Univ. of Cat-

alonia, 2009.

13. K. Kennedy and J.R. Allen, Optimizing

Compilers for Modern Architectures: A

Dependence-based Approach, Morgan

Kaufmann Publishers, 2002.

14. S.S. Muchnick, Advanced Compiler Design

and Implementation, Morgan Kaufmann

Publishers, 1997.

15. J. Xue, Loop Tiling for Parallelism, Kluwer

Academic Publishers, 2000.

16. E. Ayguade et al., ‘‘A Proposal to Extend the

OpenMP Tasking Model for Heterogeneous

Architectures,’’ Proc. Evolving OpenMP in

an Age of Extreme Parallelism (IWOMP

09), vol. 5568, Springer, 2009, pp. 154-167.

17. R. Ferrer et al., ‘‘Optimizing the Exploitation

of Multicore Processors and GPUs with

OpenMP and OpenCL,’’ Proc. 23rd Int’l

Workshop Languages and Compilers for Par-

allel Computing (LCPC 10), Springer-Verlag,

2010.

Roger Ferrer is a researcher at the BarcelonaSupercomputing Center. His research inter-ests include language and compiler supportfor parallel programming models. Ferrer hasa master’s degree in computer architectureand network systems from the UniversitatPolitecnica de Catalunya.

Pieter Bellens is a PhD student at theUniversitat Politecnica de Catalunya and amember of the Parallel Programming Mod-els group at the Barcelona Supercomputing

....................................................................

52 IEEE MICRO

...............................................................................................................................................................................................


[3B2-14] mmi2010050042.3d 11/11/010 18:16 Page 53

Center. His research interests include parallelprogramming, scheduling, and the cellbroadband engine. Bellens has a master’sdegree in computer science from the Katho-lieke Universiteit Leuven, Belgium.

Vicenc Beltran is a senior researcher inthe Computer Science Department at theBarcelona Supercomputing Center. His re-search interests are heterogeneous and hybridsystems, domain-specific languages, andoperating systems. Beltran has a PhD incomputer science from the UniversitatPolitecnica de Catalunya.

Marc Gonzalez is an associate professor inthe Computer Architecture Department atthe Universitat Politecnica de Catalunya(UPC). His research interests include parallelcomputing, virtualization, and power con-sumption. Gonzalez has a PhD in computerscience from UPC.

Xavier Martorell is an associate professor inthe Computer Architecture Department at inthe Universitat Politecnica de Catalunya(UPC). His research interests include sup-port for parallel computing, programmingmodels, and operating systems. Martorell hasa PhD in computer science from UPC. He ismember of IEEE.

Rosa M. Badia is manager of the gridcomputing and clusters group at the BarcelonaSupercomputing Center and a scientificresearcher at the Spanish National ResearchCouncil. Her research interests include perfor-mance prediction and modeling of MPIprograms and programming models for com-plex platforms (from multicore to the grid/cloud). Badia has a PhD in computer sciencefrom the Universitat Politecnica de Catalunya.

Eduard Ayguade is a full professor at theUniversitat Politecnica de Catalunya andassociate director for research in computersciences at the Barcelona SupercomputingCenter. His research interests include multi-core architectures, programming models andtheir architectural support.

Jae-Seung Yeom is a graduate student atVirginia Tech. His research interests include

parallel programming models and large-scalesimulation. Yeom has an MS in informationnetworking from Carnegie Mellon Univer-sity. He is a member of IEEE.

Scott Schneider is a PhD candidate in theComputer Science Department at VirginiaTech and a member of the Parallel EmergingArchitecture Research Laboratory. His re-search interests include high-performancecomputing, systems, and programming lan-guages. Schneider has a master’s degree incomputer science from the College ofWilliam and Mary. He is a member of IEEEand the ACM.

Konstantinos Koukos received an MS incomputer science from the University ofCrete. His research interests include supportfor parallel applications and heterogeneouscomputing.

Michail Alvanos is a PhD student at theUniversitat Politecnica de Catalunya. Hisresearch interests include parallel program-ming models and heterogeneous architec-tures. Alvanos has an MS in computerscience from the University of Crete.

Dimitrios S. Nikolopoulos is an associateprofessor of computer science at the Uni-versity of Crete and an affiliated facultymember of the Institute of ComputerScience at the Foundation for Research andTechnology-Hellas (FORTH). His researchinterests include the hardware-software inter-face of parallel computer architectures. He isa member of IEEE and the ACM.

Angelos Bilas is an associate professor at theInstitute of Computer Science at the Founda-tion for Research and Technology-Hellas andthe University of Crete. His research interestsinclude architectures and runtime-systemsupport for scalable systems, low-latencyhigh-bandwidth communication protocols,and miniaturization of computer systems.Bilas has a PhD in computer science fromPrinceton University.

Direct questions and comments about thisarticle to Xavier Martorell, Computer SciencesDept., Barcelona Supercomputing Center,Campus Nord—C6, c/Jordi Girona 1,3,08034 Barcelona, Spain; [email protected].

....................................................................


Documents

Parallel Programming Models for Heterogeneous Multi-Core ....pdfwrite an SPU task dispatcher forselecting be-tween different offloaded tasks on the SPU and for calculating the number