Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Scheduling on Asymmetric Parallel Architectures
Filip Blagojevic
Dissertation submitted to the faculty of the Virginia Polytechnic Institute andState University in partial fulfillment of the requirements for the degree of
Doctor of Philosophyin
Computer Science and Applications
Committee Members:Dimitrios S. Nikolopoulos (Chair)
Kirk W. CameronWu-chun Feng
David K. LowenthalCalvin J. Ribbens
May 30, 2008Blacksburg, Virginia
Keywords: Multicore processors, Cell BE, process scheduling,high-performance computing, performance prediction, runtime adaptation
c© Copyright 2008, Filip Blagojevic
Scheduling on Asymmetric Parallel ArchitecturesFilip Blagojevic
(ABSTRACT)
We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelismon heterogeneous multi-core processors. Heterogeneous multi-core processors integrate con-ventional cores that run legacy codes with specialized cores that serve as computational ac-celerators. The term multi-grain parallelism refers to the exposure of multiple dimensions ofparallelism from within the runtime system, so as to best exploit a parallel architecture withheterogeneous computational capabilities between its cores and execution units. To maximizeperformance on heterogeneous multi-core processors, programs need to expose multiple dimen-sions of parallelism simultaneously. Unfortunately, programming with multiple dimensions ofparallelism is to date an ad hoc process, relying heavily on the intuition and skill of program-mers. Formal techniques are needed to optimize multi-dimensional parallel program designs.We investigate user- and kernel-level schedulers that dynamically ”rightsize” the dimensionsand degrees of parallelism on the asymmetric parallel platforms. The schedulers address theproblem of mapping application-specific concurrency to an architecture with multiple hardwarelayers of parallelism, without requiring programmer intervention or sophisticated compiler sup-port. Our runtime environment outperforms the native Linux and MPI scheduling environmentby up to a factor of 2.7. We also present a model of multi-dimensional parallel computationfor steering the parallelization process on heterogeneous multi-core processors. The model pre-dicts with high accuracy the execution time and scalability of a program using conventionalprocessors and accelerators simultaneously. More specifically, the model reveals optimal de-grees of multi-dimensional, task-level and data-level concurrency, to maximize performanceacross cores. We evaluate our runtime policies as well as the performance model we developed,on an IBM Cell BladeCenter, as well as on a cluster composed of Playstation3 nodes, using tworealistic bioinformatics applications.
ACKNOWLEDGMENTS
I would like to thank my advisor Dr. Dimitrios S. Nikolopoulos for his guidance during mygraduate studies. I would also like to thank Dr. Alexandros Stamatakis, Dr. Xizhou Feng,and Dr. Kirk Cameron for providing us with the original MPI implementations of PBPI andRAxML and for discussions on scheduling and modeling the Cell/BE. I would like to thank tothe members of the PEARL group, Dr. Christos Antonopoulos, Dr. Matthew Curtis-Maury,Scott Schneider, Jae-Sung Yeom, and Benjamin Rose, for their involvement in the projects pre-sented in this dissertation. I would also like to thank my Ph.D. committee for their discussionand suggestions for this work: Dr. Kirk W. Cameron, Dr. Davd Lowenthal, Dr. Wu-chunFeng, and Dr. Calvin J. Ribbens. Also, I thank Georgia Tech, its Sony-Toshiba-IBM Centerof Competence, and NSF, for the Cell/BE resources that have contributed to this research. Fi-nally, I would like to thank the institutions that have funded this research: the National ScienceFoundation and the U.S. Department of Energy.
iii
This page intentionally left blank.
iv
Contents
1 Problem Statement 11.1 Mapping Parallelism to Asymmetric Parallel Architectures . . . . . . . . . . . 2
2 Statement of Objectives 52.1 Dynamic Multigrain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Rightsizing Multigrain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 MMGP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Experimental Testbed 113.1 RAxML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 PBPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Hardware Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Code Optimization Methdologies for Asymmetric Multi-core Systems with Explic-itly Managed Memories 174.1 Porting and Optimizing RAxML on Cell . . . . . . . . . . . . . . . . . . . . . 18
4.2 Function Off-loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.1 Optimizing Off-Loaded Functions . . . . . . . . . . . . . . . . . . . . 19
4.2.2 Vectorizing Conditional Statements . . . . . . . . . . . . . . . . . . . 20
4.2.3 Double Buffering and Memory Management . . . . . . . . . . . . . . 23
4.2.4 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.5 PPE-SPE Communication . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.6 Increasing the Coverage of Offloading . . . . . . . . . . . . . . . . . . 28
4.3 Parallel Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Scheduling Multigrain Parallelism on Asymmetric Systems 335.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
v
5.2 Scheduling Multi-Grain Parallelism on Cell . . . . . . . . . . . . . . . . . . . 33
5.2.1 Event-Driven Task Scheduling . . . . . . . . . . . . . . . . . . . . . . 34
5.2.2 Scheduling Loop-Level Parallelism . . . . . . . . . . . . . . . . . . . 36
5.2.3 Implementing Loop-Level Parallelism . . . . . . . . . . . . . . . . . . 42
5.3 Dynamic Scheduling of Task- and Loop-Level Parallelism . . . . . . . . . . . 43
5.3.1 Application-Specific Hybrid Parallelization on Cell . . . . . . . . . . . 44
5.3.2 MGPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 S-MGPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4.2 Sampling-Based Scheduler for Multi-grain Parallelism . . . . . . . . . 51
5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6 Model of Multi-Grain Parallelism 616.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Modeling Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2.1 Hardware Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2.2 Application Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 Model of Multi-grain Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3.1 Modeling sequential execution . . . . . . . . . . . . . . . . . . . . . . 66
6.3.2 Modeling parallel execution on APUs . . . . . . . . . . . . . . . . . . 67
6.3.3 Modeling parallel execution on HPUs . . . . . . . . . . . . . . . . . . 69
6.3.4 Using MMGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.5 MMGP Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4 Experimental Validation and Results . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.1 MMGP Parameter approximation . . . . . . . . . . . . . . . . . . . . 73
6.4.2 Case Study I: Using MMGP to parallelize PBPI . . . . . . . . . . . . . 74
6.4.3 Case Study II: Using MMGP to Parallelize RAxML . . . . . . . . . . 77
6.4.4 MMGP Usability Study . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7 Scheduling Asymmetric Parallelism on a PS3 Cluster 857.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 Experimental Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3 PS3 Cluster Scalability Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3.1 MPI Communication Performance . . . . . . . . . . . . . . . . . . . . 88
7.3.2 Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 88
vi
7.4 Modeling Hybrid Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.4.1 Modeling PPE Execution Time . . . . . . . . . . . . . . . . . . . . . . 94
7.4.2 Modeling the off-loaded Computation . . . . . . . . . . . . . . . . . . 96
7.4.3 DMA Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.4.4 Cluster Execution Modeling . . . . . . . . . . . . . . . . . . . . . . . 98
7.4.5 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.5 Co-Scheduling on Asymmetric Clusters . . . . . . . . . . . . . . . . . . . . . 99
7.6 PS3 versus IBM QS20 Blades . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8 Kernel-Level Scheduling 1078.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.2 SLED Scheduler Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.3 ready to run List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.3.1 ready to run List Organization . . . . . . . . . . . . . . . . . . . . . . 110
8.3.2 Splitting ready to run List . . . . . . . . . . . . . . . . . . . . . . . . 111
8.4 SLED Scheduler - Kernel Level . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.5 SLED Scheduler - User Level . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.6.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.6.2 Microbenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.6.3 PBPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.6.4 RAxML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9 Future Work 1279.1 Integrating ready-to-run list in the Kernel . . . . . . . . . . . . . . . . . . . . 128
9.2 Load Balancing and Task Priorities . . . . . . . . . . . . . . . . . . . . . . . . 130
9.3 Increasing Processor Utilization . . . . . . . . . . . . . . . . . . . . . . . . . 131
9.4 Novel Applications and Programming Models . . . . . . . . . . . . . . . . . . 132
9.5 Conventional Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
9.6 MMGP extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
10 Overview of Related Research 13510.1 Cell – Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.2 Process Scheduling - Related Research . . . . . . . . . . . . . . . . . . . . . . 138
vii
10.3 Modeling – Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 14110.3.1 PRAM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14110.3.2 BSP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14210.3.3 LogP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14310.3.4 Models Describing Nested Parallelism . . . . . . . . . . . . . . . . . . 144
Bibliography 147
viii
List of Figures
2.1 A hardware abstraction of an accelerator-based architecture. Host processingunits (HPUs) supply coarse-grain parallel computation across accelerators. Ac-celerator processing units (APUs) are the main computation engines and maysupport internally finer grain parallelism. . . . . . . . . . . . . . . . . . . . . . 6
3.1 Organization of Cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1 The likelihood vector structure is used in almost all memory traffic be-tween main memory and the local storage of the SPEs. The structure is 128-bitaligned, as required by the Cell architecture. . . . . . . . . . . . . . . . . . . . 23
4.2 The body of the first loop in newview(): a) Non–vectorized code, b) Vector-ized code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 The second loop in newview(). Non–vectorized code shown on the left, vector-ized code shown on the right. spu madd() multiplies the first two arguments andadds the result to the third argument. spu splats() creates a vector by replicatinga scalar element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Performance of (a) RAxML and (b) PBPI with different number of MPI pro-cesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1 Scheduler behavior for two off-loaded tasks, representative of RAxML. Case(a) illustrates the behavior of the EDTLP scheduler. Case (b) illustrates the be-havior of the Linux scheduler with the same workload. The numbers correspondto MPI processes. The shaded slots indicate context switching. The exampleassumes a Cell-like system with four SPEs. . . . . . . . . . . . . . . . . . . . 36
5.2 Parallelizing a loop across SPEs using a work-sharing model with an SPE des-ignated as the master. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
ix
5.3 The data structure Pass is used for communication among SPEs. The vi ad
variables are used to pass input arguments for the loop body from one localstorage to another. The variable sig is used as a notification signal that thememory transfer for the shared data updated during the loop is completed. Thevariable res is used to send results back to the master SPE, and as a dependenceresolution mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Parallelization of the loop from function evaluate() in RAxML. The leftside depitcs the code executed by the master SPE, while the right side depitcsthe code executed by a worker SPE. Num SPE represents the number of SPEworker threads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 Comparison of task-level and hybrid parallelization schemes in RAxML, on theCell BE. The input file is 42 SC. The number of ML trees created is (a) 1–16,(b) 1–128. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.6 MGPS, EDTLP and static EDTLP-LLP. Input file: 42 SC. Number of ML treescreated: (a) 1–16, (b) 1–128. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.7 Execution time of RAxML with a variable number of SPE threads. The inputdataset is 25 SC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.8 Execution times of RAxML, with various static multi-grain scheduling strate-gies. The input dataset is 25 SC. . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.9 The sampling phase of S-MGPS. Samples are taken from four execution inter-vals, during which the code performs identical operations. For each sample,each MPI process uses a variable number of SPEs to parallelize its enclosedloops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.10 PBPI executed with different levels of TLP and LLP parallelism: deg(TLP)=1-4, deg(LLP)=1–16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1 A hardware abstraction of an accelerator-based architecture with two layers ofparallelism. Host processing units (HPUs) relatively supply coarse-grain paral-lel computation across accelerators. Accelerator processing units (APUs) arethe main computation engines and may support internally finer grain paral-lelism. Both HPUs and APUs have local memories and communicate throughshared-memory or message-passing. Additional layers of parallelism can beexpressed hierarchically in a similar fashion. . . . . . . . . . . . . . . . . . . . 62
x
6.2 Our application abstraction of two parallel tasks. Two tasks are spawned bythe main process. Each task exhibits phased, multi-level parallelism of varyinggranularity. In this paper, we address the problem of mapping tasks and subtasksto accelerator-based systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 The sub-phases of a sequential application are readily mapped to HPUs andAPUs. In this example, sub-phases 1 and 3 execute on the HPU and sub-phase 2executes on the APU. HPUs and APUs are assumed to communicate via sharedmemory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 Parallel APU execution. The HPU (leftmost bar in parts a and b) offloads com-putations to one APU (part a) and two APUs (part b). The single point-to-pointtransfer of part a is modeled as overhead plus computation time on the APU.For multiple transfers, there is additional overhead (g), but also benefits due toparallelization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.5 Parallel HPU execution. The HPU (center bar) offloads computations to 4 APUs(2 on the right and 2 on the left). The first thread on the HPU offloads compu-tation to APU1 and APU2 then idles. The second HPU thread is switched in,offloads code to APU3 and APU4, and then idles. APU1 and APU2 completeand return data followed by APU3 and APU4. . . . . . . . . . . . . . . . . . 69
6.6 MMGP predictions and actual execution times of PBPI, when the code uses onedimension of PPE (HPU) parallelism. . . . . . . . . . . . . . . . . . . . . . . 75
6.7 MMGP predictions and actual execution times of PBPI, when the code usesone dimension of SPE (APU) parallelism, with a data-parallel implementationof the maximum likelihood calculation. . . . . . . . . . . . . . . . . . . . . . 76
6.8 MMGP predictions and actual execution times of PBPI, when the code uses twodimensions of SPE (APU) and PPE (HPU) parallelism. The mix of degrees ofparallelism which optimizes performance is 4-way PPE parallelism combinedwith 4-way SPE parallelism. The chart illustrates the results when both SPEparallelism and PPE parallelism are scaled to two Cell processors. . . . . . . . 78
6.9 MMGP predictions and actual execution times of RAxML, when the code usesone dimension of PPE (HPU) parallelism: (a) with DS1, (b) with DS2. . . . . . 79
6.10 MMGP predictions and actual execution times of RAxML, when the code usesone dimension of SPE (APU) parallelism: (a) with DS1, (b) with DS2. . . . . . 80
6.11 MMGP predictions and actual execution times of RAxML, when the code usestwo dimensions of SPE (APU) and PPE (HPU) parallelism. Performance isoptimized by oversubscribing the PPE and maximizing task-level parallelism. . 82
xi
6.12 Overhead of the sampling phase when MMGP scheduler is used with the PBPIapplication. PBPI is executed multiple times with 107 input species. The se-quence size of the input file is varied from 1,000 to 10,000. In the worst case,the overhead of the sampling phase is 2.2% (sequence size 7,000). . . . . . . . 83
7.1 MPI Allreduce() performance on the PS3 cluster. Processes are distributedevenly between nodes. Each node runs up to 6 processes, using shared memoryfor communication within the node. . . . . . . . . . . . . . . . . . . . . . . . 89
7.2 MPI Send/Recv() latency on the PS3 cluster. Processes are distributedevenly between nodes. Each node runs up to 6 processes, using shared memoryfor communication within the node. . . . . . . . . . . . . . . . . . . . . . . . 90
7.3 Measured and predicted performance of applications on the PS3 cluster. PBPIis executed with weak scaling. RAxML is executed with strong scaling. x-axisnotation: Nnode - number of nodes, Nprocess - number of processes per node,NSPE - number of SPEs per process. . . . . . . . . . . . . . . . . . . . . . . . 92
7.4 Four cases illustrating the importance of co-scheduling PPE threads and SPEthreads. Threads labeled ”P” are PPE threads, while threads labeled ”S” areSPE threads. We assume that P-threads and S-threads communicate throughshared memory. P-threads poll shared memory locations directly to detect if apreviously off-loaded S-thread has completed. Striped intervals indicate yield-ing of the PPE, dark intervals indicate computation leading to a thread off-loadon an SPE, light intervals indicate computation yielding the PPE without off-loading on an SPE. Stars mark cases of mis-scheduling. . . . . . . . . . . . . . 95
7.5 SPE execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.6 Double buffering template for tiled parallel loops. . . . . . . . . . . . . . . . . 97
7.7 Performance of yield-if-not-ready policy and the native Linux scheduler in PBPIand RAxML. x-axis notation: Nnode - number of nodes, Nprocess - number ofprocesses per node, NSPE - number of SPEs per process. . . . . . . . . . . . . 101
7.8 Performance of different scheduling strategies in PBPI and RAxML. . . . . . . 103
7.9 Comparison between the PS3 cluster and an IBM QS20 cluster. . . . . . . . . . 104
8.1 Upon completing the assigned tasks, the SPEs send signal to the PPE processesthrough the ready-to-run list. The PPE process which decides to yield passesthe data from the ready-to-run list to the kernel, which in return can schedulethe appropriate process on the PPE. . . . . . . . . . . . . . . . . . . . . . . . 108
xii
8.2 Vertical overview of the SLED scheduler. The user level part contains the ready-
to-run list, shared among the processes, while the kernel part contains the sys-tem call through which the information from the ready-to-run list is passed tothe kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.3 ProcessP1, which is bound to CPU1, needs to be scheduled to run by the sched-uler that was invoked on CPU2. Consequently, the kernel needs to performmigration of the process P1, from CPU1 to CPU2 . . . . . . . . . . . . . . . . 112
8.4 System call for migrating the processes across the execution contexts. Functionsched migrate task() performs the actual migration. SLEDS yield() functionschedules the process to be the next to run on the CPU. . . . . . . . . . . . . . 113
8.5 The ready to run list is split in two parts. Each of the two sublists contain pro-cesses that are sharing the execution context (CPU1 or CPU2). This approachavoids any possibility of expensive process migration across the execution con-texts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.6 Execution flow of the SLEDS yield() function: (a) The appropriate process isfound in the running list (tree), (b) The process is pulled out from the list, andits priority is increased, (c) The process is returned to the list, and since itspriority is increased it will be stored at the left most position. . . . . . . . . . . 115
8.7 Outline of the SLEDS scheduler: Upon off-loading a process is required to callthe SLEDS Offload() function. SLEDS Offload() checks if the off-loaded taskhas finished (Line 14), and if not, calls the yield() function. yield() scans theready to run list, and yields to the next process by executing SLEDS yield()
system call. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.8 Execution times of RAxML when the ready to run list is scanned between 50and 1000 times. x-axis represents the number of scans of the ready to run list.y-axis represents the execution time. Note that the lowest value for the y-axisis 12.5, and the difference between the lowest and the highest execution time is4.2%. The input file contains 10 species, each represented by 1800 nucleotides. 118
8.9 Comparison of the EDTLP and SLED schemes using microbenchmarks: Totalexecution time is measured as the length of the off-loaded tasks is increased. . . 119
8.10 Comparison of the EDTLP and SLED schemes using microbenchmarks: Totalexecution time is measured as the length of the off-loaded tasks is increased –task size is limited to 2.1us. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.11 EDTLP outperforms SLED for small task sizes due to higher complexity of theSLED scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xiii
8.12 Comparison of the EDTLP scheme and the combination of SLED and EDTLPschemes using microbenchmarks. EDTLP is used for the task sizes smaller than15µs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.13 Comparison of the EDTLP scheme and the combination of SLED and EDTLPschemes using microbenchmarks. EDTLP is used for the task sizes smaller than15µs – task size is limited to 2.µs. . . . . . . . . . . . . . . . . . . . . . . . . 122
8.14 Comparison of EDTLP and SLED schemes using the PBPI application. Theapplication is executed multiple times with varying length of the input sequence(represented on the x-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.15 Comparison of EDTLP and the combination of SLED and EDTLP schemesusing the PBPI application. The application is executed multiples time withvarying length of the input sequence (represented on the x-axis). . . . . . . . . 124
8.16 Comparison of EDTLP and SLED schemes using the RAxML application. Theapplication is executed multiple times with varying length of the input sequence(represented on the x-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.17 Comparison of EDTLP and the combination of SLED and RAxML schemesusing the RAxML application. The application is executed multiple times withvarying length of the input sequence (represented on the x-axis). . . . . . . . . 125
9.1 Upon completing the assigned tasks, SPEs send signals to PPE processes throughthe ready-to-run list. The PPE process which decides to yield passes the datafrom the ready-to-run queue to the kernel, which in return can schedule theappropriate process on the PPE. . . . . . . . . . . . . . . . . . . . . . . . . . 129
xiv
List of Tables
4.1 Execution time of RAxML (in seconds). The input file is 42 SC. (a) The wholeapplication is executed on the PPE, (b) newview() is offloaded on one SPE. . . . 20
4.2 Execution time of RAxML after the floating-point conditional statement is trans-formed to an integer conditional statement and vectorized. The input file is 42 SC. 22
4.3 Execution time of RAxML with double buffering applied to overlap DMAtransfers with computation. The input file is 42 SC. . . . . . . . . . . . . . . . 24
4.4 Execution time of RAxML following vectorization. The input file is 42 SC. . . 27
4.5 Execution time of RAxML following the optimization of communication to usedirect memory-to-memory transfers. The input file is 42 SC. . . . . . . . . . . 28
4.6 Execution time of RAxML after offloading and optimizing three functions:newview(), makenewz() and evaluate(). The input file is 42 SC. . . . . . . . . . 29
5.1 Performance comparison for (a) RAxML and (b) PBPI with two schedulers.The second column shows execution time with the EDTLP scheduler. The thirdcolumn shows execution time with the native Linux kernel scheduler. The work-load for RAxML contains 42 organisms. The workload for PBPI contains 107organisms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Execution time of RAxML when loop-level parallelism (LLP) is exploited inone bootstrap, via work distribution between SPEs. The input file is 42 SC: (a)DNA sequences are represented with 10,000 nucleotides, (b) DNA sequencesare represented with 20,000 nucleotides. . . . . . . . . . . . . . . . . . . . . . 40
5.3 Execution time of PBPI when loop-level parallelism (LLP) is exploited via workdistribution between SPEs. The input file is 107 SC: (a) DNA sequences arerepresented with 1,000 nucleotides, (b) DNA sequences are represented with10,000 nucleotides. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
xv
5.4 Efficiency of different program configurations with two data sets in RAxML.The best configuration for 42 SC input is deg(TLP)=8, deg(LLP)=1. The bestconfiguration for 25 SC is deg(TLP)=4, deg(LLP)=2. deg() corresponds thedegree of a given dimension of parallelism (LLP or TLP). . . . . . . . . . . . . 54
5.5 RAxML – Comparison between S-MGPS and static scheduling schemes, illus-trating the convergence overhead of S-MGPS. . . . . . . . . . . . . . . . . . . 55
5.6 PBPI – comparison between S-MGPS and static scheduling schemes: (a) deg(TLP)=1,deg(LLP)=1–16; (b) deg(TLP)=2, deg(LLP)=1–8; (c) deg(TLP)=4, deg(LLP)=1–4; (d) deg(TLP)=8, deg(LLP)=1–2. . . . . . . . . . . . . . . . . . . . . . . . . 58
xvi
Chapter 1
Problem Statement
In the quest for delivering higher performance to scientific applications, hardware designers be-
gan to move away from superscalar processor models and embraced architectures with multiple
processing cores. Although all commodity microprocessor vendors are marketing multicore
processors, these processors are largely based on replication of superscalar cores. Unfortu-
nately, superscalar designs exhibit well-known performance and power limitations. These limi-
tations, in conjunction with a sustained requirement for higher performance, stimulated interest
in unconventional processor designs, that combine parallelism with acceleration. These designs
leverage multiple cores some of which are customized accelerators for data-intensive computa-
tion. Examples of these heterogeneous, accelerator-based parallel architectures are Cell BE [3],
GPGPU [4], Rapport KiloCore [2], EXOCHI [96], etc.
As a case study and a representative of the accelerator-based asymmetric architectures, in
this dissertation we investigate the Cell Broadband Engine (CBE). Cell has recently drawn
considerable attention by industry and academia. Since it was originally designed for the game
box market, Cell has low cost and a modest power budget. Nevertheless, the processor is able
to achieve unprecedented peak performance for some real-world applications. IBM announced
recently the use of Cell chips in a new Petaflop system with 16,000 Cells named RoadRunner,
due for delivery in 2008.
1
The potential of the Cell BE has been demonstrated convincingly in a number of stud-
ies [33,39,69,74,91]. Thanks to eight high-frequency execution cores with pipelined SIMD ca-
pabilities, and an aggressive data transfer architecture, Cell has a theoretical peak performance
of over 200 Gflops for single-precision FP calculations and a peak memory bandwidth of over
25 Gigabytes/s. These performance figures position Cell ahead of the competition against the
most powerful commodity microprocessors. Cell has already demonstrated impressive perfor-
mance ratings in applications and computational kernels with highly vectorizable data paral-
lelism, such as signal processing, compression, encryption, dense and sparse numerical ker-
nels [12, 13, 15, 39, 48, 49, 66, 75, 78, 79, 99].
1.1 Mapping Parallelism to Asymmetric Parallel Architec-
tures
Arguably, one of the most difficult problems that programmers face while migrating to a new
parallel architecture is the mapping of algorithms and data to the architecture. Accelerator-
based multi-core processors complicate this problem in two ways. Firstly, by introducing het-
erogeneous execution cores, the user needs to be concerned with mapping each component of
the application to the type of core that best matches the computational and memory bandwidth
demand of the component. Secondly, by providing multiple cores with embedded SIMD or
multi-threading capabilities, the user needs to be concerned with extracting multiple dimen-
sions of parallelism from the application and mapping each dimension to parallel execution
units, so as to maximize performance.
Cell provides a motivating and timely example for the problem of mapping algorithmic
parallelism to modern multi-core architectures. The processor can exploit task and data par-
allelism, both across and within its cores. On accelerator-based multi-core architectures the
programmer must be aware of core heterogeneity, and carefully balance execution between the
2
host and accelerator cores. Furthermore, the programmer faces a seemingly vast number of
options for parallelizing code on these architectures. Functional and data decompositions of the
program can be implemented on both, the host and the accelerator cores. Functional decom-
positions can be achieved by dividing functions between the hosts and the accelerators and by
off-loading functions from the hosts to accelerators at runtime. Data decompositions are also
possible, by using SIMDization on the vector units of the accelerator cores, or loop-level paral-
lelization across accelerators, or a combination of loop-level parallelization across accelerators
and SIMDization within accelerators.
In this thesis we explore different approaches used to automatize mapping applications to
asymmetric parallel architectures. We explore both, runtime and static approaches for combin-
ing and managing functional and data decomposition. We combine and orchestrate multiple
levels of parallelism inside an application in order to achieve both, harmoniously utilization
of all host and accelerator cores, as well as high memory bandwidth available on asymmetric
multi-core processors. Although we chose Cell to be our case study, our scheduling algorithms
and decisions are general and can be applied to any asymmetric parallel architecture.
3
4
Chapter 2
Statement of Objectives
2.1 Dynamic Multigrain Parallelism
While many studies have been focused on performance evaluation and optimizations for the
heterogeneous multi-core architectures [23, 31, 54, 63, 65, 74, 98], the optimal mapping of par-
allel applications to these architectures has not been investigated. In this thesis we explore
heterogeneous multi-core architectures from a different perspective, namely that of multigrain
parallelization. The asymmetric parallel architectures have a specific design, they can exploit
orthogonal dimensions of task and data parallelism on a single chip. The processor is controlled
by one or more host processing elements, which usually schedule the computation off-loaded to
accelerator processing units. The accelerators are usually SIMD processors and provide the bulk
of the processor’s computational power. A general design of heterogeneous, accelerator-based
architectures is represented in Figure 2.1.
To simplify programming and improve efficiency on asymmetric parallel architectures, we
present a set of dynamic scheduling policies and the associated mechanisms. We introduce an
event-driven scheduler, EDTLP, which oversubscribes the host processing cores and exposes
dynamic parallelism across accelerators. We also propose MGPS, a scheduling module which
controls multi-grain parallelism on the fly to monotonically increase accelerator utilization.
5
APU/LM
#1
HPU/LM
#1
HPU/LM
#NHP
APU/LM
#2
APU/LM
#NAP
Shared Memory / Message Interface
Figure 2.1: A hardware abstraction of an accelerator-based architecture. Host processing units(HPUs) supply coarse-grain parallel computation across accelerators. Accelerator processingunits (APUs) are the main computation engines and may support internally finer grain paral-lelism.
MGPS monitors the number of active accelerators used by off-loaded tasks over discrete inter-
vals of execution and makes a prediction on the best combination of dimensions and granularity
of parallelism to expose to the hardware. The purpose of these policies is to exploit the proper
layers and degrees of parallelism from the application, in order to maximize efficiency of the
processor’s computational cores. We explore the design and implementation of our schedul-
ing policies using two real-world scientific applications, RAxML [87] and PBPI [45]. RAxML
and PBPI are bioinformatics applications used for generating the phylogenetic trees, and we
describe them in more detail in Chapter 3.
One of the most efficient execution models on asymmetric parallel architectures, which
reduces the idle time on the host processors as well as on the accelerators, is to oversubscribe
the host processors unit with multiple processes. In this approach one or more accelerators are
assigned to each process for off-loading the expensive computation. Although the offloading
approach enables high utilization of the architecture, it also increases contention and the number
of context-switches on the host processor unit, as well as time necessary of a single context-
switch to complete. To reduce the contention caused by context switching, and the idle time
that occurs on the accelerator cores as a consequence, we designed and implemented slack-
6
minimizer scheduler (SLED). In our case study, the SLED scheduler is capable of improving
the performance on the Cell processor for up to 17%.
The study related to dynamic scheduling strategies makes the following contributions:
• We present a runtime system and scheduling policies that exploit polymorphic (task and
loop-level) parallelism on asymmetric parallel processors. Our runtime system is adap-
tive, in the sense that it chooses the form and degree of parallelism to expose to the
hardware, in response to workload characteristics. Since the right choice of form(s) and
degree(s) of parallelism depends non-trivially on workload characteristics and user input,
our runtime system unloads an important burden from the programmer.
• We show that dynamic multigrain parallelization is a necessary optimization for sustain-
ing maximum performance on asymmetric parallel architectures, since no static paral-
lelization scheme is able to achieve high accelerator efficiency in all cases.
• We present an event-driven multithreading execution engine, which achieves higher effi-
ciency on accelerators by oversubscribing the host core.
• We present a feedback-guided scheduling policy for dynamically triggering and throttling
loop-level parallelism across accelerators. We show that work-sharing of divisible tasks
across accelerators should be used when the event-driven multithreading engine leaves
more than half of the accelerators idle. We observe benefits from loop-level paralleliza-
tion of off-loaded tasks across accelerators. However, we also observe that loop-level
parallelism should be exposed only in conjunction with low-degree task-level parallelism.
• We present the kernel-level extensions to our runtime system, which enable efficient pro-
cess scheduling in a case when the host core is oversubscribed with multiple processes
7
2.2 Rightsizing Multigrain Parallelism
When executing multi-level parallel application on asymmetric parallel processors, the perfor-
mance can be strongly affected by the execution configuration. In case of RAxML execution
on the Cell processor, depending on the runtime degree of each level of parallelism in the ap-
plication, the performance variation can be as high as 40%. To address the issue of determining
the most optimal parallel configuration, we introduce a new runtime scheduler, S-MGPS, which
performs sampling and timing of the dominant phases in the application in order to determine
the most efficient mapping of different levels of parallelism to the architecture. There are sev-
eral essential differences between S-MGPS and our previously introduced runtime scheduler,
MGPS. MGPS is a utilization-driven scheduler, which seeks the highest possible accelerator
utilization by exploiting additional layers of parallelism when some accelerator cores appear
underutilized. MGPS attempts to increase utilization by creating more accelerator tasks from
innermost layers of parallelism, more specifically, as many tasks as the number of idle acceler-
ators recorded during intervals of execution. S-MGPS is a scheduler which seeks the optimal
application-system configuration, in terms of layers of parallelism exposed to the hardware and
degree of granularity per layer of parallelism, based on runtime task throughput of the appli-
cation and regardless of system utilization. S-MGPS takes into account the cumulative effects
of contention and other system bottlenecks on software parallelism and can converge to the
best multi-grain parallel execution algorithm. MGPS on the other hand uses only informa-
tion on SPE utilization and may often converge to a suboptimal multi-grain parallel execution
algorithm. A further contribution of S-MGPS is that the scheduler is immune to the initial
configuration of parallelism in the application and uses a sampling method which is indepen-
dent of application-specific parameters, or input. On the contrary, the performance of MGPS is
sensitive to both the initial structure of parallelism in the application and input.
Although the scientific codes we use in this thesis implement similar functionality, they
differ in their structure and parallelization strategies and raise different challenges for user-level
8
schedulers. We show that S-MGPS performs within 2% off the optimal scheduling algorithm in
PBPI and within 2%–10% off the optimal scheduling algorithm in RAxML. We also show that
S-MGPS adapts well to variation of the input size and granularity of parallelism, whereas the
performance of MGPS is sensitive to both these factors.
2.3 MMGP Model
The technique used by the S-MGPS scheduler might not be scalable to large, complex systems,
large applications, or applications with behavior that varies significantly with the input. The
execution time of a complex application is the function of many parameters. A given parallel
application may consist of N phases where each phase is affected differently by accelerators.
Each phase can exploit d dimensions of parallelism or any combination thereof such as ILP,
TLP, or both. Each phase or dimension of parallelism can use any of m different programming
and execution models such as message passing, shared memory, SIMD, or any combination
thereof. Accelerator availability or use may consist of c possible configurations, involving dif-
ferent numbers of accelerators. Exhaustive analysis of the execution time for all combinations
requires at least N × d×m× c trials with any given input.
Models of parallel computation have been instrumental in the adoption and use of parallel
systems. Unfortunately, commonly used models [24,35] are not directly portable to accelerator-
based systems. First, the heterogeneous processing common to these systems is not reflected
in most models of parallel computation. Second, current models do not capture the effects of
multi-grain parallelism. Third, few models account for the effects of using multiple program-
ming models in the same program. Parallel programming at multiple dimensions and with a
synthesis of models consumes both enormous amounts of programming effort and significant
amounts of execution time, if not handled with care. To overcome these deficits, we present a
model for multi-dimensional parallel computation on asymmetric multi-core processors. Con-
sidering that each dimension of parallelism reflects a different degree of computation granular-
9
ity, we name the model MMGP, for Model of Multi-Grain Parallelism.
MMGP is an analytical model which formalizes the process of programming accelerator-
based systems and reduces the need for exhaustive measurements. This proposal presents a
generalized MMGP model for accelerator-based architectures with one layer of host processor
parallelism and one layer of accelerator parallelism, followed by the specialization of this model
for the Cell Broadband Engine.
The input to MMGP is an explicitly parallel program, with parallelism expressed with
machine-independent abstractions, using common programming libraries and constructs. Upon
identification of a few key parameters of the application derived from micro-benchmarking and
profiling of a sequential run, MMGP predicts with reasonable accuracy the execution time with
all feasible mappings of the application to host processors and accelerators. MMGP is fast
and reasonably accurate, therefore it can be used to quickly identify optimal operating points,
in terms of the exposed layers of parallelism and the degree of parallelism in each layer, on
accelerator-based systems. Experiments with two complete applications from the field of com-
putational phylogenetics on a shared-memory multiprocessor with single and multiple nodes
that contain the Cell BE, show that MMGP models parallel execution time of complex parallel
codes with multiple layers of task and data parallelism, with mean error in the range of 1%–6%,
across all feasible program configurations on the target system. Due to the narrow margin of
error, MMGP predicts accurately the optimal mapping of programs to cores for the cases we
have studied so far.
10
Chapter 3
Experimental Testbed
This chapter provides details on our experimental testbed, including the two applications that
we used to study user-level schedulers on the Cell BE (RAxML and PBPI) and the hardware
platform on which we conducted this thesis.
RAxML and PBPI are computational biology applications designed to determine the phy-
logenetic trees. Phylogenetic trees are used to represent the evolutionary history of a set of n
organisms. An alignment with the DNA or AA sequences representing those n organisms (also
called taxa) can be used as input for the computation of phylogenetic trees. In a phylogeny
the organisms of the input data set are located at the tips (leaves) of the tree whereas the inner
nodes represent extinct common ancestors. The branches of the tree represent the time which
was required for the mutation of one species into another, new one. The generation of phylo-
genies with computational methods has many important applications in medical and biological
research (see [14] for a summary).
The fundamental algorithmic problem computational phylogeny faces consists of the im-
mense amount of alternative tree topologies which grows exponentially with the number of or-
ganisms n, e.g. for n = 50 organisms there exist 2.84 ∗ 1076 alternative trees (number of atoms
in the universe ≈ 1080). In fact, it has only recently been shown that the phylogeny problem is
NP-hard [34]. In addition, generating phylogenies is very memory- and floating point-intensive
11
process, such that the application of high performance computing techniques as well as the as-
sessment of new CPU architectures can contribute significantly to the reconstruction of larger
and more accurate trees. The computation of the phylogenetic tree containing representatives
of all living beings on earth is still one of the grand challenges in Bioinformatics.
3.1 RAxML
RAxML-VI-HPC (v2.1.3) (Randomized Axelerated Maximum Likelihood version VI for High
Performance Computing) [87] is a program for large-scale ML-based (Maximum Likelihood
[43]) inference of phylogenetic (evolutionary) trees using multiple alignments of DNA or AA
(Amino Acid) sequences. The program is freely available as open source code at icwww.epfl.ch/˜stamatak.
The current version of RAxML incorporates a rapid hill climbing search algorithm. A re-
cent performance study [87] on real world datasets with ≥ 1,000 sequences reveals that it is
able to find better trees in less time and with lower memory consumption than other current ML
programs (IQPNNI, PHYML, GARLI). Moreover, RAxML-VI-HPC has been parallelized with
MPI (Message Passing Interface), to enable embarrassingly parallel non-parametric bootstrap-
ping and multiple inferences on distinct starting trees in order to search for the best-known ML
tree. Like every ML-based program, RAxML exhibits a source of fine-grained loop-level par-
allelism in the likelihood functions which consume over 90% of the overall computation time.
This source of parallelism scales well on large memory-intensive multi-gene alignments due to
increased cache efficiency.
The MPI version of RAxML is the basis of our Cell version of the code [20]. In RAxML
multiple inferences on the original alignment are required in order to determine the best-known
(best-scoring) ML tree (we use the term best-known because the problem is NP-hard). Fur-
thermore, bootstrap analyses are required to assign confidence values ranging between 0.0 and
1.0 to the internal branches of the best-known ML tree. This allows determining how well-
supported certain parts of the tree are and is important for the biological conclusions drawn
12
from it. All those individual tree searches, be it bootstrap or multiple inferences, are completely
independent from each other and can thus be exploited by a simple master-worker MPI scheme.
Each search can further exploit data parallelism via thread-level parallelization of loops and/or
SIMDization.
3.2 PBPI
PBPI is based on Bayesian phylogenetic inference, which constructs phylogenetic trees from
DNA or AA sequences using the Markov Chain Monte Carlo (MCMC) sampling method. The
program is freely available as open source code at www.pbpi.org.The MCMC method is inher-
ently sequential, and the state of each time step depends on previous time steps. Therefore,
the PBPI application uses algorithmic improvements described below to achieve highly effi-
cient parallel inference of phylogenetic trees. PBPI exploits multi-grain parallelism, to achieve
scalability on large-scale distributed memory systems, such as the IBM BlueGene/L [45]. The
algorithm of PBPI can be summarized as follows:
1. Partition the Markov chains into chain groups, and split the data set into segments along
the sequences.
2. Organize the virtual processors that execute the code into a two-dimensional grid; map
each chain group to a row on the grid and map each segment to a column on the grid.
3. During each generation, compute the partial likelihood across all columns and use all-to-
all communication to collect the complete likelihood values to all virtual processors on
the same row.
4. When there are multiple chains, randomly choose two chains for swapping using point-
to-point communication.
13
PowerPC
PPE
I/OController
ControllerMemory
Element Interconnect BUS (EIB)
SPE
LS
SPE SPE SPE
SPE SPE SPE SPE
LS LS LS
LS LS LS LS
Figure 3.1: Organization of Cell.
From a computational perspective, PBPI differs substantially from RAxML. While RAxML
is embarrassingly parallel, PBPI uses a predetermined virtual processor topology and a corre-
sponding data decomposition method. While the degree of task parallelism in RAxML may
vary considerably at runtime, PBPI exposes from the beginning of execution, a high-degree of
two-dimensional data parallelism to the runtime system. On the other hand, while the degree
of task parallelism can be controlled dynamically in RAxML without performance penalty, in
PBPI changing the degree of outermost data parallelism requires data redistribution and incurs
a high performance penalty.
3.3 Hardware Platform
The Cell BE is a heterogeneous multi-core processor which integrates a simultaneous multi-
threading PowerPC core ( the Power Processing Element or PPE), and eight specialized accel-
erator cores (the Synergistic Processing Elements or SPEs) [40]. These elements are connected
in a ring topology on an on-chip network called the Element Interconnect Bus (EIB). The orga-
nization of Cell is illustrated in Figure 3.1.
The PPE is a 64-bit SMT processor running the PowerPC ISA, with vector/SIMD multime-
dia extensions [71]. The PPE has two levels of on-chip cache. The L1-I and L1-D caches of the
PPE have a capacity of 32 KB. The L2 cache of the PPE has a capacity of 512 KB.
14
Each SPE is a 128-bit vector processor with two major components: a Synergistic Processor
Unit (SPU) and a Memory Flow Controller (MFC). All instructions are executed on the SPU.
The SPU includes 128 registers, each 128 bits wide, and 256 KB of software-controlled local
storage. The SPU can fetch instructions and data only from its local storage and can write data
only to its local storage. The SPU implements a Cell-specific set of SIMD intrinsics. All single
precision floating point operations on the SPU are fully pipelined and the SPU can issue one
single-precision floating point operation per cycle. Double precision floating point operations
are partially pipelined and two double-precision floating point operations can be issued every six
cycles. Double-precision FP performance is therefore significantly lower than single-precision
FP performance. With all eight SPUs active and fully pipelined double precision FP operation,
the Cell BE is capable of a peak performance of 21.03 Gflops. In single-precision FP operation,
the Cell BE is capable of a peak performance of 230.4 Gflops [33].
The SPE can access RAM through direct memory access (DMA) requests. DMA transfers
are handled by the MFC. All programs running on an SPE use the MFC to move data and
instructions between local storage and main memory. Data transferred between local storage
and main memory must be 128-bit aligned. The size of each DMA transfer can be at most 16
KB. DMA-lists can be used for transferring more than 16 KB of data. A list can have up to
2,048 DMA requests, each for up to 16 KB. The MFC supports only DMA transfer sizes that
are 1, 2, 4, 8 or multiples of 16 bytes long.
The EIB is an on-chip coherent bus that handles communication between the PPE, SPE,
main memory, and I/O devices. Physically, the EIB is a 4-ring structure, which can transmit 96
bytes per cycle, for a maximum theoretical memory bandwidth of 204.8 Gigabytes/second. The
EIB can support more than 100 outstanding DMA requests.
In this work we are using a Cell blade (IBM BladeCenter QS20) with two Cell BEs running
at 3.2 GHz, and 1GB of XDR RAM (512 MB per processor). The PPEs run Linux Fedora Core
6. We use IBM SDK2.1 and Lam/MPI 7.1.3.
15
16
Chapter 4
Code Optimization Methdologies forAsymmetric Multi-core Systems withExplicitly Managed Memories
Accelerator-based architectures with explicitly managed memories have the advantage of achiev-
ing a high degree of communication-computation overlap. While this is a highly desirable goal
in high-performance computing, it is also a significant drawback prom the programability per-
spective. Managing all memory accesses from the application level significantly increases the
complexity of the written code. In our work, we investigate the execution models that reduce
the complexity of the code written for the asymmetric architectures, but still achieve desirable
performance and high utilization of the available architectural resources. We investigate a set of
optimizations that have the most significant impact on the performance of scientific applications
executed on the asymmetric architectures. In our case study, we investigate the optimization
process which enables efficient execution of RAxML and PBPI on the Cell architecture.
The results presented in this chapter indicate that RAxML and PBPI are highly optimized for
Cell, and also motivate the discussion presented in the rest of the thesis. Cell-specific optimiza-
tion applied to the two bioinformatics applications resulted in more than two times speedup. At
the same time, we show that regardless of being extensively optimized for sequential execution,
parallel applications demand sophisticated scheduling support for efficient parallel execution
on heterogeneous multi-core platforms.
17
4.1 Porting and Optimizing RAxML on Cell
We ported RAxML to Cell in four steps:
1. We ported the MPI code on the PPE;
2. We offloaded the most time-consuming parts of each MPI process on the SPEs;
3. We optimized the SPE code using vectorization of floating point computation, vectoriza-
tion of control statements coupled with a specialized casting transformation, overlapping
of computation and communication (double buffering) and other communication opti-
mizations;
4. Lastly, we implemented multi-level parallelization schemes across and within SPEs in
selected cases, as well as a scheduler for effective simultaneous exploitation of task, loop,
and SIMD parallelism.
We outline optimizations 1-3 in the rest of the chapter. We focus on multi-level paralleliza-
tion, as well as different scheduling policies in Chapter 5.
4.2 Function Off-loading
We profiled the application using gprofile to identify the computationally intensive functions
that could be candidates for offloading and optimization on SPEs. We used an IBM Power5
processor for profiling RAxML. For the profiling and benchmarking runs of RAxML presented
in this chapter, we used the input file 42 SC, which contains 42 organisms, each represented by
a DNA sequence of 1167 nucleotides. The number of distinct data patterns in a DNA alignment
is on the order of 250.
On the IBM Power5, 98.77% of the total execution time is spent in three functions:
• 77.24% in newview() - which computes the partial likelihood vector [44] at an inner node
of the phylogenetic tree,
18
• 19.16% in makenewz() - which optimizes the length of a given branch with respect to the
tree likelihood using the Newton–Raphson method,
• 2.37% in evaluate() - which calculates the log likelihood score of the tree at a given branch
by summing over the partial likelihood vector entries.
These functions are the best candidates for offloading on SPEs.
The prerequisite for computing evaluate() and makenewz() is that the likelihood vectors
at the nodes of the phylogenetic tree that are right and left of the current branch have been
computed. Thus, makenewz() and evaluate() initially make calls to newview(), before they can
execute their own computation. The newview() function at an inner node p of a tree, calls itself
recursively when the two children r and q are not tips (leaves) and the likelihood array for r and
q has not already been computed. Consequently, the first candidate for offloading is newview().
Although makenewz() and evaluate() are both taking a smaller portion of the execution time
than newview(), offloading these two functions results in significant speedup (see Section 4.2.6).
Besides the fact that each function can be executed faster on an SPE, having all three functions
offloaded to an SPE reduces significantly the amount of PPE-SPE communication.
In order to have a function executed on an SPE, we spawn an SPE thread at the beginning
of each MPI process. The thread executes the offloaded function upon receiving a signal from
the PPE and returns the result back to the PPE upon completion. To avoid excessive overhead
from repeated thread spawning and joining, threads remain bound on SPEs and busy-wait for
the PPE signal, before starting to execute a function.
4.2.1 Optimizing Off-Loaded Functions
The discussion in this section refers to function newview(), which is the most computationally
expensive in the code. Table 4.1 summarizes the execution times of RAxML before and after
newview() is offloaded. The first column shows the number of workers (MPI processes) used
in the experiment and the amount of work (bootstraps) performed. The maximum number
19
1 worker, 1 bootstrap 24.4s2 workers, 8 bootstraps 134.1s2 workers, 16 bootstraps 267.7s2 workers, 32 bootstraps 539s
1 worker, 1 bootstrap 45s2 workers, 8 bootstraps 201.9s2 workers, 16 bootstraps 401.7s2 workers, 32 bootstraps 805s
(a) (b)
Table 4.1: Execution time of RAxML (in seconds). The input file is 42 SC. (a) The wholeapplication is executed on the PPE, (b) newview() is offloaded on one SPE.
of workers we use is 2, since more workers would conflict on the PPE which is 2-way SMT
processor. Executing small number of workers results in low SPE utilization (each worker uses
1 SPE). In Section 4.3, we present results when the PPE is oversubscribed with up to 8 worker
processes.
As shown in Table 4.1, merely offloading newview() causes performance degradation. We
profiled the new version of the code in order to get a better understanding of the major bot-
tlenecks. Inside newview(), we identified 3 parts where the function spends almost its entire
lifetime: the first part includes a large if(. . .) statement with a conjunction of four arithmetic
comparisons used to check if small likelihood vector entries need to be scaled to avoid numerical
underflow (similar checks are used in every ML implementation); the second time-consuming
part involves DMA transfers; the third includes the loops that perform the actual likelihood
vector calculation. In the next few sections we describe the techniques used to optimize the
aforementioned parts in newview(). The same techniques were applied to the other offloaded
functions.
4.2.2 Vectorizing Conditional Statements
RAxML always invokes newview() at an inner node of the tree (p) which is at the root of a sub-
tree. The main computational kernel in newview() has a switch statement which selects one out
of four paths of execution. If one or both descendants (r and q) of p are tips (leaves), the com-
putations of the main loop in newview() can be simplified. This optimization leads to significant
20
performance improvements [87]. To activate the optimization, we use four implementations of
the main computational part of newview() for the case that r and q are tips, r is a tip, q is a tip,
or r and q are both inner nodes.
Each of the four execution paths in newview() leads to a distinct—highly optimized—
version of the loop which performs the actual likelihood vector calculations. Each iteration
of this loop executes the previously mentioned if() statement (Section 4.2.1), to check for like-
lihood scaling. Mis-predicted branches in the compiled code for this statement incur a penalty
of approximately 20 cycles [92]. We profiled newview() and found that 45% of the execution
time is spent in this particular conditional statement. Furthermore, almost all the time is spent
in checking the condition, while negligible time is spent in the body of code in the fall-through
part of the conditional statement. The problematic conditional statement is shown below. The
symbol ml is a constant and all operands are double precision floating point numbers.
if (ABS(x3->a) < ml && ABS(x3->g) < ml &&ABS(x3->c) < ml && ABS(x3->t) < ml) {
. . .
}
This statement is a challenge for a branch predictor, since it implies 8 conditions, one for
each of the four ABS() macros and the four comparisons against the minimum likelihood value
constant (ml).
On an SPE, comparing integers can be significantly faster than comparing doubles, since
integer values can be compared using the SPE intrinsics. Although the current SPE intrinsics
support only comparison of 32-bit integer values, the comparison of 64-bit integers is also pos-
sible by combining different intrinsics that operate on the 32-bit integers. The current spu-gcc
compiler automatically optimizes an integer branch using the SPE intrinsics. To optimize the
problematic branches, we made the observation that integer comparison is faster than floating
21
1 worker, 1 bootstrap 32.5s2 workers, 8 bootstraps 151.7s2 workers, 16 bootstraps 302.7s2 workers, 32 bootstraps 604s
Table 4.2: Execution time of RAxML after the floating-point conditional statement is trans-formed to an integer conditional statement and vectorized. The input file is 42 SC.
point comparison on an SPE. According to the IEEE standard, numbers represented in float and
double formats are “lexicographically ordered” [61], i.e., if two floating point numbers in the
same format are ordered, then they are ordered the same way when their bits are reinterpreted as
Sign-Magnitude integers [61]. In other words, instead of comparing two floating point numbers
we can interpret their bit pattern as integers, and do an integer comparison. The final outcome
of comparing the integer interpretation of two doubles (floats) will be the same as comparing
their floating point values, as long as one of the numbers is positive. In our case, all operands
are positive, consequently instead of floating point comparison we can perform an integer com-
parison.
To get an absolute value of a floating point number, we used the spu and() logic intrinsic,
which performs vector bit-wise AND operation. With spu and() we always set the left most
bit of a floating point number to one. If the number is already positive, nothing will change,
since the most significant bit is already one. In this way, we avoid using ABS(), which uses a
conditional statement to check if the operand is greater than or less than 0. After getting absolute
values of all the operands involved in the problematic if() statement, we cast each operand to
an unsigned long long value and perform the comparison. The optimized conditional statement
is presented in Figure 4.2.2. Following optimization of the offending conditional statement,
its contribution to execution time in newview() comes down to 6%, as opposed to 45% before
optimization. The total execution time (Table 4.2) improves by 25%–27%.
22
unsigned long long a[4];
a[0] = *(unsigned long long*)&x3->a & 0x7fffffffffffffffULL;a[1] = *(unsigned long long*)&x3->c & 0x7fffffffffffffffULL;a[2] = *(unsigned long long*)&x3->g & 0x7fffffffffffffffULL;a[3] = *(unsigned long long*)&x3->t & 0x7fffffffffffffffULL;
if (*(unsigned long long*)&a[0] < minli &&
*(unsigned long long*)&a[1] < minli &&
*(unsigned long long*)&a[2] < minli &&
*(unsigned long long*)&a[3] < minli){
. . .
}
4.2.3 Double Buffering and Memory Management
Depending on the size of the input alignment, the major calculation loop (the loop that performs
the calculation of the likelihood vector) in newview() can execute up to 50,000 iterations. The
number of iterations is directly related to the alignment length. The loop operates on large
arrays, and each member in the arrays is an instance of a likelihood vector structure, shown in
Figure 4.1 . The arrays are allocated dynamically at runtime. Since there is no limit on the
typedef struct likelihood_vector{
double a, c, g, t;int exp;
} likelivector __attribute__((aligned(128)));
Figure 4.1: The likelihood vector structure is used in almost all memory traffic betweenmain memory and the local storage of the SPEs. The structure is 128-bit aligned, as requiredby the Cell architecture.
size of these arrays, we are unable to keep all the members of the arrays in the local storage of
23
1 worker, 1 bootstrap 31.1s2 workers, 8 bootstraps 145.4s2 workers, 16 bootstraps 290s2 workers, 32 bootstraps 582.6s
Table 4.3: Execution time of RAxML with double buffering applied to overlap DMA transferswith computation. The input file is 42 SC.
SPEs. Instead, we strip-mine the arrays, by fetching a few array elements to local storage at a
time, and execute the corresponding loop iterations on a batch of elements at a time. We use a
2 KByte buffer for caching likelihood vectors, which is enough to store the data needed for 16
loop iterations. It should be noted that the space used for buffers is much smaller than the size
of the local storage.
In the original code where SPEs wait for all DMA transfers, the idle time accounts for 11.4%
of execution time of newview(). We eliminated the waiting time by using double buffering to
overlap DMA transfers with computation. The total execution time of the application after
applying double buffering and tuning the data transfer size (set to 2 KBytes) is shown in Table
4.3.
4.2.4 Vectorization
All calculations in newview() are enclosed in two loops. The first loop has a small trip count
(typically 4–25 iterations) and computes the individual transition probability matrices (see Sec-
tion 4.2.1) for each distinct rate category of the CAT or Γ models of rate heterogeneity [86].
Each iteration executes 36 double precision floating point operations. The second loop com-
putes the likelihood vector. Typically, the second loop has a large trip count, which depends on
the number of distinct data patterns in the data alignment. For the 42 SC input file, the second
loop has 228 iterations and executes 44 double precision floating point operations per iteration.
Each SPE on the Cell is capable of exploiting data parallelism via vectorization. The SPE vector
registers can store two double precision floating point elements. We vectorized the two loops in
24
newview() using these registers.
The kernel of the first loop in newview() is shown in Figure 4.2a. In Figure 4.2b we
for( ... ){ki = *rptr++;
d1c = exp (ki * lz10);d1g = exp (ki * lz11);d1t = exp (ki * lz12);
*left++ = d1c * *EV++;
*left++ = d1g * *EV++;
*left++ = d1t * *EV++;
*left++ = d1c * *EV++;
*left++ = d1g * *EV++;
*left++ = d1t * *EV++;. . .
}
1: vector double *left_v =(vector double*)left;
2: vector double lz1011 =(vector double)(lz10,lz11);. . .
for( ... ){3: ki_v = spu_splats(*rptr++);
4: d1cg = _exp_v ( spu_mul(ki_v,lz1011) );d1tc = _exp_v ( spu_mul(ki_v,lz1210) );d1gt = _exp_v ( spu_mul(ki_v,lz1112) );
left_v[0] = spu_mul(d1cg,EV_v[0]);left_v[1] = spu_mul(d1tc,EV_v[1]);left_v[2] = spu_mul(d1gt,EV_v[2]);
. . .}
(a) (b)
Figure 4.2: The body of the first loop in newview(): a) Non–vectorized code, b) Vectorizedcode.
show the same code vectorized for the SPE. For better understanding of the vectorized code we
briefly describe the SPE vector instructions we used:
• Instruction labeled 1 creates a vector pointer to an array consisting of double elements.
• Instruction labeled 2 joins two double elements, lz10 and lz11, into a single vector
element.
• Instruction labeled 3 creates a vector from a single double element.
• Instruction labeled 4 is a composition of 2 different vector instructions:
25
for( . . . ){ump_x1_0 = x1->a;ump_x1_0 += x1->c * *left++;ump_x1_0 += x1->g * *left++;ump_x1_0 += x1->t * *left++;
ump_x1_1 = x1->a;ump_x1_1 += x1->c * *left++;ump_x1_1 += x1->g * *left++;ump_x1_1 += x1->t * *left++;
. . .
}
for( . . . ){
a_v = spu_splats(x1->a);c_v = spu_splats(x1->c);g_v = spu_splats(x1->g);t_v = spu_splats(x1->t);l1 = (vector double)(left[0],left[3]);l2 = (vector double)(left[1],left[4]);l3 = (vector double)(left[2],left[5]);ump_v1[0] = spu_madd(c_v,l1,a_v);ump_v1[0] = spu_madd(g_v,l2,ump_v1[0]);ump_v1[0] = spu_madd(t_v,l3,ump_v1[0]);
. . .}
Figure 4.3: The second loop in newview(). Non–vectorized code shown on the left, vectorizedcode shown on the right. spu madd() multiplies the first two arguments and adds the result tothe third argument. spu splats() creates a vector by replicating a scalar element.
1. spu mul() multiplies two vectors (in this case the arguments are vectors of dou-
bles.)
2. exp v() is the vector version of the exponential instruction.
After vectorization, the number of the floating point instructions executed in the body of the first
loop is 24. Also, there is one additional instruction for creating a vector from a scalar element.
Note that due to involved pointer arithmetic on dynamically allocated data structures, automatic
vectorization of this code would be particularly challenging for a compiler.
Figure 4.3 illustrates the second loop(showing a few selected instructions which dominate
execution time in the loop). The variables x1->a, x1->c, x1->g, and x1->t belong to the same
C structure (likelihood vector) and occupy contiguous memory locations. Only three of these
variables are multiplied by the elements of the array left[ ]. This makes vectorization more dif-
ficult, since the code requires vector construction instructions such as spu splats(). Obviously,
there are many different possibilities for vectorizing this code. The scheme shown in Figure 4.3
26
1 worker, 1 bootstrap 27.82 workers, 8 bootstraps 132.3s2 workers, 16 bootstraps 265.2s2 workers, 32 bootstraps 527s
Table 4.4: Execution time of RAxML following vectorization. The input file is 42 SC.
is the one that achieved the best performance in our tests. Note that due to involved pointer
arithmetic on dynamically allocated data structures, automatic vectorization of this code may
be challenging for a compiler. After vectorization, the number of floating point instructions in
the body of the loops drops from 36 to 24 for the first loop, and from 44 to 22 for the second
loop. Vectorization adds 25 instructions for creating vectors.
Without vectorization, newview() spends 69.4% of its execution time in the two loops. Fol-
lowing vectorization, the time spent in loops drops to 57% of the execution time of newview().
Table 4.4 shows execution times following vectorization.
4.2.5 PPE-SPE Communication
Although newview() accounts for most of the execution time, its granularity is fine and its con-
tribution to execution time is attributed to the large number of invocations. For the 42 SC input,
newview() is invoked 230,500 times and the average execution time per invocation is 71µs. In
order to invoke an offloaded function, the PPE needs to send a signal to an SPE. Also, after an
offloaded function completes, it sends the result back to the PPE.
In an early implementation of RAxML, we used mailboxes to implement the communica-
tion between the PPE and SPEs. We observed that PPE-SPE communication can be significantly
improved if it is performed through main memory and SPE local storage instead of mailboxes.
Using memory-to-memory communication improves execution time by 5%–6.4%. Table 4.5
shows RAxML execution times, including all optimizations discussed so far and direct memory
to memory communication, for the 42 SC input. It is interesting to note that direct memory-
27
1 worker, 1 bootstrap 26.4s2 workers, 8 bootstraps 123.3s2 workers, 16 bootstraps 246.8s2 workers, 32 bootstraps 493.3s
Table 4.5: Execution time of RAxML following the optimization of communication to usedirect memory-to-memory transfers. The input file is 42 SC.
to-memory communication is an optimization which scales with parallelism on Cell, i.e. its
performance impact grows as the code uses more SPEs. As the number of workers and boot-
straps executed on the SPEs increases, the code becomes more communication-intensive, due
to the fine granularity of the offloaded functions.
4.2.6 Increasing the Coverage of Offloading
In addition to newview(), we offloaded makenewz() and evaluate(). All three offloaded functions
were packaged in a single code module loaded on the SPEs. The advantage of using a single
module is that it can be loaded to the local storage once when an SPE thread is created and
remain pinned in local storage for the rest of the execution. Therefore, the cost of loading the
code on SPEs is amortized and communication between the PPE and SPEs is reduced. For
example, when newview() is called by makenewz() or evaluate(), there is no need for any PPE-
SPE communication, since all functions already reside in SPE local storage.
Offloading all three critical functions improves performance by a further 25%–31%. A
more important implication is that after offloading and optimization of all three functions, the
RAxML code split between the PPE and one SPE becomes actually faster than the sequential
code executed exclusively on the PPE, by as much as 19%. Function offloading is another
optimization which scales with parallelism. When more than one MPI processes are used and
more than one bootstraps are offloaded to SPEs by each process, the gains from offloading rise
to 36%. Table 4.6 illustrates execution times after full function offloading.
28
1 worker, 1 bootstrap 19.8s2 workers, 8 bootstraps 86.8s2 workers, 16 bootstraps 173s2 workers, 32 bootstraps 344.4s
Table 4.6: Execution time of RAxML after offloading and optimizing three functions:newview(), makenewz() and evaluate(). The input file is 42 SC.
RAxML
0
10
20
30
40
50
60
70
80
90
1 2 3 4 5 6 7 8
Number of Processes
Execu
tio
n T
ime (
s)
PBPI
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8
Number of Processes
Execu
tio
n T
ime (
s)
(a) (b)
Figure 4.4: Performance of (a) RAxML and (b) PBPI with different number of MPI processes.
4.3 Parallel Execution
After improving the performance of RAxML and PBPI using the presented optimization tech-
niques, we investigated parallel execution of both applications on the Cell processor. To achieve
higher utilization of the Cell chip, we oversubscribed the PPE with different number of MPI pro-
cesses (2 – 8) and assigned a single SPE to each MPI process. The execution time of different
parallel configurations is presented in Figure 4.4. In the presented experiments we use strong
scaling, i.e. the computation increases with the number of processors growing.
In Figure 4.4(a) we observe that for any number of processes larger than two, the execution
time of RAxML remains constant. There are two reasons responsible for the detected behavior:
1. On-chip contention, as well as bus and memory contention which occurs on the PPE side
when the PPE is oversubscribed by multiple processes,
29
2. Linux kernel is oblivious to the process of off-loading which results in poor scheduling
decisions. Each process following the off-loading execution model constantly alternates
the execution between the PPE and an SPE. Unaware of the execution alternation, the OS
allows processes to keep the control over the resources which are not actually used. In
other words, the PPE might be assigned to a process which is currently switched to the
SPE execution.
In PBPI case, Figure 4.4(b) we observe similar performance as with RAxML. From the
presented experiments it is clear that naive parallelization of the applications, where the PPE
is simply oversubscribed with multiple processes, does not provide satisfactory performance.
Poor scaling of the applications is a strong motivation for detail exploration of different parallel
programming models as well as scheduling policies for the asymmetric processors. We continue
the discussion about parallel execution on heterogeneous architectures in Chapter 5.
4.4 Chapter Summary
In this chapter we presented a set of optimizations which enable efficient sequential execution of
scientific applications on asymmetric platforms. We exploited the fact that our test application
contain large computational functions (loops) which consume majority of the execution time.
Nevertheless, this assumption does not reduce the generality of the presented techniques, since
large, time-consuming computational loops are common in most of the scientific codes.
We explored a total of five optimizations and the performance implications of these opti-
mizations: I) Offloading the bulk of the maximum likelihood tree calculation on the accelera-
tors; II) Casting and vectorization of expensive conditional statements involving multiple, hard
to predict conditions; III) Double buffering for overlapping memory communication with com-
putation; IV) Vectorization of the core of the floating point computation; V) Optimization of
communication between the host core and accelerators using direct memory-to-memory trans-
fers;
30
In our case study, starting from an optimized version of RAxML and PBPI for conventional
uniprocessors and multiprocessors, we were able to boost performance on the Cell processor by
more than a factor of two.
31
32
Chapter 5
Scheduling Multigrain Parallelism onAsymmetric Systems
5.1 Introduction
In this chapter, we investigate runtime scheduling policies for mapping different layers of par-
allelism, exposed by an application, to the Cell processor. We assume that applications describe
all available algorithmic parallelism to the runtime system explicitly, while the runtime system
dynamically selects the degree of granularity and the dimensions of parallelism to expose to the
hardware at runtime, using dynamic scheduling mechanisms and policies. In other words, the
runtime system is responsible for partitioning algorithmic parallelism in layers that best match
the diverse capabilities of the processor cores, while at the same time rightsizing the granularity
of parallelism in each layer.
5.2 Scheduling Multi-Grain Parallelism on Cell
We hereby explore the possibilities for exploiting multi-grain parallelism on Cell. The Cell
PPE can execute two threads or processes simultaneously, from where parts of code can be off-
loaded and executed on SPEs. To increase the sources of parallelism for SPEs, the user may
consider two approaches:
• The user may oversubscribe the PPE with more processes or threads, than the number of
33
processes/threads that the PPE can execute simultaneously. In other words, the program-
mer attempts to find more parallelism to off-load to accelerators, by attempting a more
fine-grain task decomposition of the code. In this case, the runtime system needs to sched-
ule the host processes/threads so as to minimize the idle time on the host core while the
computation is off-loaded to accelerators. We present an event-driven task-level scheduler
(EDTLP) which achieves this goal in Section 5.2.1.
• The user can introduce a new dimension of parallelism to the application by distributing
loops from within the off-loaded functions across multiple SPEs. In other words, the
user can exploit data parallelism both within and across accelerators. Each SPE can work
on a part of a distributed loop, which can be further accelerated with SIMDization. We
present case studies that motivate the dynamic extraction of multi-grain parallelism via
loop distribution in Section 5.2.2.
5.2.1 Event-Driven Task Scheduling
EDTLP is a runtime scheduling module which can be embedded transparently in MPI codes.
The EDTLP scheduler operates under the assumption that the code to off-load to accelerators
is specified by the user at the level of functions. In the case of Cell, this means that the user
has either constructed SPE threads in a separate code module, or annotated the host PPE code
with directives to extract SPE threads via a compiler [17]. The EDTLP scheduler avoids un-
derutilization of SPEs by oversubscribing the PPE and preventing a single MPI process from
monopolizing the PPE.
Informally, the EDTLP scheduler off-loads tasks from MPI processes. A task ready for off-
loading serves as an event trigger for the scheduler. Upon the event occurrence, the scheduler
immediately attempts to serve the MPI process that carries the task to off-load and sends the
task to an available SPE, if any. While off-loading a task, the scheduler suspends the MPI
process that spawned the task and switches to another MPI process, anticipating that more tasks
34
will be available for off-loading from ready-to-run MPI processes. Switching upon off-loading
prevents MPI processes from blocking the PPE while waiting for their tasks to return. The
scheduler attempts to sustain a high supply of tasks for off-loading to SPEs by serving MPI
processes round-robin.
The downside of a scheduler based on oversubscribing a processor is context-switching
overhead. Cell in particular also suffers from the problem of interference between processes
or threads sharing the SMT PPE core. The granularity of the off-loaded code determines if the
overhead introduced by oversubscribing the PPE can be tolerated. The code off-loaded to an
SPE should be coarse enough to marginalize the overhead of context switching performed on
the PPE. The EDTLP scheduler addresses this issue by performing granularity control of the
off-loaded tasks and preventing off-loading of code that does not meet a minimum granularity
threshold.
Figure 5.1 illustrates an example of the difference between scheduling MPI processes with
the EDTLP scheduler and the native Linux scheduler. In this example, each MPI process has
one task to off-load to SPEs. For illustrative purposes only, we assume that there are only 4
SPEs on the chip. In Figure 5.1(a), once a task is sent to an SPE, the scheduler forces a context
switch on the PPE. Since the PPE is a two-way SMT, two MPI processes can simultaneously
off-load tasks to two SPEs. The EDTLP scheduler enables the use of four SPEs via function off-
loading. On the contrary, if the scheduler waits for the completion of a task before providing
an opportunity to another MPI process to off-load (Figure 5.1 (b)), the application can only
utilize two SPEs. Realistic application tasks often have significantly shorter lengths than the
time quanta used by the Linux scheduler. For example, in RAxML, task lengths measure in the
order of tens of microseconds, when Linux time quanta measure to tens of milliseconds.
Table 5.1(a) compares the performance of the EDTLP scheduler to that of the native Linux
scheduler, using RAxML and running a workload comprising 42 organisms. In this experiment,
the number of performed bootstraps is not constant and it is equal to the number of MPI pro-
cesses. The EDTLP scheduler outperforms the Linux scheduler by up to a factor of 2.7. In the
35
(a) (b)
Figure 5.1: Scheduler behavior for two off-loaded tasks, representative of RAxML. Case (a)illustrates the behavior of the EDTLP scheduler. Case (b) illustrates the behavior of the Linuxscheduler with the same workload. The numbers correspond to MPI processes. The shadedslots indicate context switching. The example assumes a Cell-like system with four SPEs.
experiment with PBPI, Table 5.1(b), we execute the code with one Markov chain for 20,000
generations and we change the number of MPI processes used across runs. PBPI is also exe-
cuted with weak scaling, i.e. we increase the size of the DNA alignment with the number of
processes. The workload for PBPI includes 107 organisms. EDTLP outperforms the Linux
scheduler policy in PBPI by up to a factor of 2.7.
5.2.2 Scheduling Loop-Level Parallelism
The EDTLP model described in Section 5.2 is effective if the PPE has enough coarse-grained
functions to off-load to SPEs. In cases where the degree of available task parallelism is less
than the number of SPEs, the runtime system can activate a second layer of parallelism, by
splitting an already off-loaded task across multiple SPEs. We implemented runtime support
for parallelization of for-loops enclosed within off-loaded SPE functions. We parallelize loops
in off-loaded functions using work-sharing constructs similar to those found in OpenMP. In
RAxML, all for-loops in the three off-loaded functions have no loop-carried dependencies, and
obtain speedup from parallelization, assuming that there are enough idle SPEs dedicated to their
execution. The number of SPEs activated for work-sharing is user- or system-controlled, as in
36
EDTLP Linux1 worker, 1 bootstrap 19.7s 19.7s2 workers, 2 bootstraps 22.2s 30s3 workers, 3 bootstraps 26s 40.7s4 workers, 4 bootstraps 28.1s 43.3s5 workers, 5 bootstraps 33s 60.7s6 workers, 6 bootstraps 34s 61.8s7 workers, 7 bootstraps 38.8s 81.2s8 workers, 8 bootstraps 39.8s 81.7s
(a)
EDTLP Linux1 worker, 20,000 gen. 27.77s 27.54s2 workers, 20,000 gen. 30.2s 30s3 workers, 20,000 gen. 31.92s 56.16s4 workers, 20,000 gen. 36.4s 63.7s5 workers, 20,000 gen. 40.12s 93.71s6 workers, 20,000 gen. 41.48s 93s7 workers, 20,000 gen. 53.93s 144.81s8 workers, 20,000 gen. 52.64s 135.92s
(b)
Table 5.1: Performance comparison for (a) RAxML and (b) PBPI with two schedulers. Thesecond column shows execution time with the EDTLP scheduler. The third column showsexecution time with the native Linux kernel scheduler. The workload for RAxML contains 42organisms. The workload for PBPI contains 107 organisms.
OpenMP. We discuss dynamic system-level control of loop parallelism further in Section 5.3.
The parallelization scheme is outlined in Figure 5.2. The program is executed on the PPE
until the execution reaches the parallel loop to be off-loaded. At that point the PPE sends a
signal to a single SPE which is designated as the master. The signal is processed by the master
and further broadcasted to all workers involved in parallelization. Upon a signal reception,
each SPE worker fetches the data necessary for loop execution. We ensure that SPEs work
on different parts of the loop and do not overlap by assigning a unique identifier to each SPE
thread involved in parallelization of the loop. Global data, changed by any of the SPEs during
37
loop execution, is committed to main memory at the end of each iteration. After processing
the assigned parts of the loop, the SPE workers send a notification back to the master. If the
loop includes a reduction, the master collects also partial results from the SPEs and accumulates
them locally. All communication between SPEs is performed on chip in order to avoid the long
latency of communicating through shared memory.
Note that in our loop parallelization scheme on Cell, all work performed by the master SPE
can also be performed by the PPE. In this case, the PPE would broadcast a signal to all SPE
threads involved in loop parallelization and the partial results calculated by SPEs would be
accumulated back at the PPE. Such collective operations increase the frequency of SPE-PPE
communication, especially when the distributed loop is a nested loop. In the case of RAxML,
in order to reduce SPE-PPE communication and avoid unnecessary invocation of the MPI pro-
cess that spawned the parallelized loop, we opted to use an SPE to distribute loops to other
SPEs and collect the results from other SPEs. In PBPI, we let the PPE execute the master
thread during loop parallelization, since loops are coarse enough to overshadow the loop exe-
cution overhead. Optimizing and selecting between these loop execution schemes is a subject
of ongoing research.
SPE threads participating in loop parallelization are created once upon off-loading the code
for the first parallel loop to SPEs. The threads remain active and pinned to the same SPEs during
the entire program execution, unless the scheduler decides to change the parallelization strategy
and redistribute the SPEs between one or more concurrently executing parallel loops. Pinned
SPE threads can run multiple off-loaded loop bodies, as long as the code of these loop bodies
fits on the local storage of the SPEs. If the loop parallelization strategy is changed on the fly by
the runtime system, a new code module with loop bodies that implement the new parallelization
strategy is loaded on the local storage of the SPEs.
Table 5.2 illustrates the performance of the basic loop-level parallelization scheme of our
runtime system in RAxML. Table 5.2(a) illustrates the execution time of RAxML using one
MPI process and performing one bootstrap, on a data set which comprises 42 organisms. This
38
Worker1
from 7/8x to x
iterations
executes
. . .
signal to MasterWorker1 sending stop
Master sending start signal
Master sending start signalto Worker1
from 1 to x/8
iterations
Worker7
Worker7
Worker7 sending stop signal to Master
Master
executesMaster
x − Total numberof iterations
from x/8 to x/4iterationsexecutesWorker1
to Worker7
Figure 5.2: Parallelizing a loop across SPEs using a work-sharing model with an SPE designatedas the master.
experiment isolates the impact of our loop-level parallelization mechanisms on Cell. The num-
ber of iterations in parallelized loops depends on the size of the input alignment in RAxML. For
the given data set, each parallel loop executes 228 iterations.
The results shown in Table 5.2(a) suggest that when using loop-level parallelism RAxML
sees a reasonable yet limited performance improvement. The highest speedup (1.72) is achieved
with 7 SPEs. The reasons for the modest speedup are the non-optimal coverage of loop-level
parallelism —more specifically, less than 90% of the original sequential code is covered by par-
allelized loops—, the fine granularity of the loops, and the fact that most loops have reductions,
which create bottlenecks on the Cell DMA engine. The performance degradation that occurs
when 5 or 6 SPEs are used, happens because of specific memory alignment constraints that have
to be met on the SPEs. It is due to alignment constraints that in certain occasions it is not pos-
sible to evenly distribute the data used in the loop body and therefore the workload of iterations
between SPEs. More specifically, the use of character arrays for the main data set in RAxML
39
1 worker, 1 boot., no LLP 19.7s1 worker, 1 boot., 2 SPEs used for LLP 14s1 worker, 1 boot., 3 SPEs used for LLP 13.36s1 worker, 1 boot., 4 SPEs used for LLP 12.8s1 worker, 1 boot., 5 SPEs used for LLP 13.8s1 worker, 1 boot., 6 SPEs used for LLP 12.47s1 worker, 1 boot., 7 SPEs used for LLP 11.4s1 worker, 1 boot., 8 SPEs used for LLP 11.44s
(a)
1 worker, 1 boot., no LLP 47.9s1 worker, 1 boot., 2 SPEs used for LLP 29.5s1 worker, 1 boot., 3 SPEs used for LLP 23.3s1 worker, 1 boot., 4 SPEs used for LLP 20.5s1 worker, 1 boot., 5 SPEs used for LLP 18.7s1 worker, 1 boot., 6 SPEs used for LLP 18.1s1 worker, 1 boot., 7 SPEs used for LLP 17.1s1 worker, 1 boot., 8 SPEs used for LLP 16.8s
(b)
Table 5.2: Execution time of RAxML when loop-level parallelism (LLP) is exploited in onebootstrap, via work distribution between SPEs. The input file is 42 SC: (a) DNA sequencesare represented with 10,000 nucleotides, (b) DNA sequences are represented with 20,000 nu-cleotides.
forces array transfers in multiples of 16 array elements. Consequently, loop distribution across
processors is done with a minimum chunk size of 16 iterations.
Loop-level parallelization in RAxML can achieve higher speedup in a single bootstrap with
larger input data sets. Alignments that have a larger number of nucleotides per organism have
more loop iterations to distribute across SPEs. To illustrate the behavior of loop-level paral-
lelization with coarser loops, we repeated the previous experiment using a data set where the
DNA sequences are represented with 20,000 nucleotides. The results are shown in Table 5.2(b).
The performance of the loop-level parallelization scheme always increases with the number of
SPEs in this experiment.
40
PBPI exhibits clearly better scalability than RAxML with LLP, since the granularity of loops
is coarser in PBPI than RAxML. Table 5.3 illustrates the execution times when PBPI is executed
with a variable number of SPEs used for LLP. Again, we control the granularity of the off-loaded
code by using different data sets: Table 5.3(a) shows execution times for a data set that contains
107 organisms, each represented by a DNA sequence of 3,000 nucleotides. Table 5.3(b) shows
execution times for a data set that contains 107 organisms, each represented by a DNA sequence
of 10,000 nucleotides. We run PBPI with one Markov chain for 20,000 generations. For the
two data sets, PBPI achieves a maximum speedup of 4.6 and 6.1 respectively, after loop-level
parallelization.
1 worker, 1,000 gen., no LLP 27.2s1 worker, 1,000 gen., 2 SPEs used for LLP 14.9s1 worker, 1,000 gen., 3 SPEs used for LLP 11.3s1 worker, 1,000 gen., 4 SPEs used for LLP 8.4s1 worker, 1,000 gen., 5 SPEs used for LLP 7.3s1 worker, 1,000 gen., 6 SPEs used for LLP 6.8s1 worker, 1,000 gen., 7 SPEs used for LLP 6.2s1 worker, 1,000 gen., 8 SPEs used for LLP 5.9s
(a)
1 worker, 20,000 gen., no LLP 262s1 worker, 20,000 gen., 2 SPEs used 131.3s1 worker, 20,000 gen., 3 SPEs used 92.3s1 worker, 20,000 gen., 4 SPEs used 70.1s1 worker, 20,000 gen., 5 SPEs used 58.1s1 worker, 20,000 gen., 6 SPEs used 49s1 worker, 20,000 gen., 7 SPEs used 43s1 worker, 20,000 gen., 8 SPEs used 39.7s
(b)
Table 5.3: Execution time of PBPI when loop-level parallelism (LLP) is exploited via workdistribution between SPEs. The input file is 107 SC: (a) DNA sequences are represented with1,000 nucleotides, (b) DNA sequences are represented with 10,000 nucleotides.
41
struct Pass{
volatile unsigned int v1_ad;volatile unsigned int v2_ad;//...arguments for loop bodyvolatile unsigned int vn_ad;volatile double res;volatile int sig[2];
} __attribute__((aligned(128)));
Figure 5.3: The data structure Pass is used for communication among SPEs. The vi ad vari-ables are used to pass input arguments for the loop body from one local storage to another. Thevariable sig is used as a notification signal that the memory transfer for the shared data updatedduring the loop is completed. The variable res is used to send results back to the master SPE,and as a dependence resolution mechanism.
5.2.3 Implementing Loop-Level Parallelism
The SPE threads participating in loop work-sharing constructs are created once upon function
off-loading. Communication among SPEs participating in work-sharing constructs is imple-
mented using DMA transfers and the communication structure Pass, depicted in Figure 5.3.
The Pass structure is private to each thread. The master SPE thread allocates an array of
Pass structures. Each member of this array is used for communication with an SPE worker
thread. Once the SPE threads are created, they exchange the local addresses of their Pass
structures. This address exchange is performed through the PPE. Whenever one thread needs
to send a signal to a thread on another SPE, it issues an mfc put() request and sets the
destination address to be the address of the Pass structure of the recipient.
In Figure 5.4, we illustrate a RAxML loop parallelized with work-sharing among SPE
threads. Before executing the loop, the master thread sets the parameters of the Pass struc-
ture for each worker SPE and issues one mfc put() request per worker. This is done in
send to spe(). Worker i uses the parameters of the received Pass structure and fetches
the data needed for the loop execution to its local storage (function fetch data()). After
42
finishing the execution of its portion of the loop, a worker sets the res parameter in the local
copy of the structure Pass and sends it to the master, using send to master(). The master
accumulates the results from all workers and commits the sum to main memory.
Immediately after calling send to spe(), the master participates in the execution of the
loop. The master tends to have a slight head start over the workers. The workers need to
complete several DMA requests before they can start executing the loop, in order to fetch the
required data from the master’s local storage or shared memory. In fine-grained off-loaded
functions such as those encountered in RAxML, load imbalance between the master and the
workers is noticeable. To achieve better load balancing, we set the master to execute a slightly
larger portion of the loop. A fully automated and adaptive implementation of this purposeful
load unbalancing is obtained by timing idle periods in the SPEs across multiple invocations of
the same loop. The collected times are used for tuning iteration distribution in each invocation,
in order to reduce idle time on SPEs.
5.3 Dynamic Scheduling of Task- and Loop-Level Parallelism
Merging task-level and loop-level parallelism on Cell can improve the utilization of acceler-
ators. A non-trivial problem with such a hybrid parallelization scheme is the assignment of
accelerators to tasks. The optimal assignment is largely application-specific, task-specific and
input-specific. We support this argument using RAxML as an example. The discussion in this
section is limited to RAxML, where the degree of outermost parallelism can be changed ar-
bitrarily by varying the number of MPI processes executing bootstraps, with a small impact
on performance. PBPI uses a data decomposition approach which depends on the number of
processors, therefore dynamically varying the number of MPI processes executing the code at
runtime can not be accomplished without data redistribution.
43
Master SPE:
struct Pass pass[Num_SPE];
for(i=0; i < Num_SPE; i++){pass[i].sig[0] = 1;
...send_to_spe(i,&pass[i]);
}
/* Paralelized loop */for ( ... ){
. . .}
tr->likeli = sum;
for(i=0; i < Num_SPE; i++){while(pass[i].sig[1] == 0);pass[i].sig[1] = 0;tr->likeli += pass[i].res;
}
commit(tr->likeli);
Worker SPE:
struct Pass pass;
while(pass.sig[0]==0);fetch_data();
/* Paralelized loop */for ( ... ){
. . .}
tr->likeli = sum;pass.res = sum;pass.sig[1] = 1;send_to_master(&pass);
Figure 5.4: Parallelization of the loop from function evaluate() in RAxML. The left sidedepitcs the code executed by the master SPE, while the right side depitcs the code executed bya worker SPE. Num SPE represents the number of SPE worker threads.
5.3.1 Application-Specific Hybrid Parallelization on Cell
We present a set of experiments with RAxML performing a number of bootstraps ranging be-
tween 1 and 128. In these experiments we use three versions of RAxML. Two of the three ver-
sions use hybrid parallelization models combining task- and loop-level parallelism. The third
version exploits only task-level parallelism and uses the EDTLP scheduler. More specifically,
in the first version, each off-loaded task is parallelized across 2 SPEs, and 4 MPI processes
are multiplexed on the PPE, executing 4 concurrent bootstraps. In the second version, each
off-loaded task is parallelized across 4 SPEs and 2 MPI processes are multiplexed on the PPE,
44
10
20
30
40
50
60
70
80
90
100
110
120
0 2 4 6 8 10 12 14 16
Exec
uti
on
tim
e in
sec
ond
s
Number of bootstraps
EDTLP+LLP with 4 SPEs per parallel loopEDTLP+LLP with 2 SPEs per parallel loop
EDTLP
(a)
0
100
200
300
400
500
600
700
800
900
0 20 40 60 80 100 120 140
Exec
uti
on t
ime
in s
eco
nds
Number of bootstraps
EDTLP+LLP with 4 SPEs per parallel loopEDTLP+LLP with 2 SPEs per parallel loop
EDTLP
(b)
Figure 5.5: Comparison of task-level and hybrid parallelization schemes in RAxML, on theCell BE. The input file is 42 SC. The number of ML trees created is (a) 1–16, (b) 1–128.
executing 2 concurrent bootstraps. In the third version, the code concurrently executes 8 MPI
processes, the off-loaded tasks are not parallelized and the tasks are scheduled with the EDTLP
scheduler. Figure 5.5 illustrates the results of the experiments, with a data set representing 42
organisms. The x-axis shows the number of bootstraps, while the y-axis shows execution time
in seconds.
As expected, the hybrid model outperforms EDTLP when up to 4 bootstraps are executed,
since only a combination of EDTLP and LLP can off-load code to more than 4 SPEs simul-
45
taneously. With 5 to 8 bootstraps, the hybrid models execute bootstraps in batches of 2 and 4
respectively, while the EDTLP model executes all bootstraps in parallel. EDTLP activates 5
to 8 SPEs solely for task-level parallelism, leaving room for loop-level parallelism on at most
3 SPEs. This proves to be unnecessary, since the parallel execution time is determined by the
length of the non-parallelized off-loaded tasks that remain on at least one SPE. In the range
between 9 and 12 bootstraps, combining EDTLP and LLP selectively, so that the first 8 boot-
straps execute with EDTLP and the last 4 bootstraps execute with the hybrid scheme is the best
option. For the input data set with 42 organisms, performance of EDTLP and hybrid EDTLP-
LLP schemes is almost identical when the number of bootstraps is between 13 and 16. When
the number of bootstraps is higher than 16, EDTLP clearly outperforms any hybrid scheme
(Figure 5.5(b)).
The reader may notice that the problem of hybrid parallelization is trivialized when the
problem size is scaled beyond a certain point, which is 28 bootstraps in the case of RAxML
(see Section 5.3.2). A production run of RAxML for real-world phylogenetic analysis would
require up to 1,000 bootstraps, thus rendering hybrid parallelization seemingly unnecessary.
However, if a production RAxML run with 1,000 bootstraps were to be executed across multiple
Cell BEs, and assuming equal division of bootstraps between the processors, the cut-off point
for EDTLP outperforming the hybrid EDTLP-LLP scheme would be set at 36 Cell processors.
Beyond this scale, performance per processor would be maximized only if LLP were employed
in conjunction with EDTLP on each Cell. Although this observation is empirical and somewhat
simplifying, it is further supported by the argument that scaling across multiple processors will
in all likelihood increase communication overhead and therefore favor a parallelization scheme
with less MPI processes. The hybrid scheme reduces the volume of MPI processes compared
to the pure EDTLP scheme, when the granularity of work per Cell becomes fine.
46
5.3.2 MGPS
The purpose of MGPS is to dynamically adapt the parallel execution by either exposing only
one layer of task parallelism to the SPEs via event-driven scheduling, or expanding to the second
layer of data parallelism and merging it with task parallelism when SPEs are underutilized at
runtime.
MGPS extends the EDTLP scheduler with an adaptive processor-saving policy. The sched-
uler runs locally in each process and it is driven by two events:
• arrivals, which correspond to off-loading functions from PPE processes to SPE threads;
• departures, which correspond to completion of SPE functions.
MGPS is invoked upon arrivals and departures of tasks. Initially, upon arrivals, the scheduler
conservatively assigns one SPE to each off-loaded task. Upon a departure, the scheduler mon-
itors the degree of task-level parallelism exposed by each MPI process, i.e. how many discrete
tasks were off-loaded to SPEs while the departing task was executing. This number reflects the
history of SPE utilization from task-level parallelism and is used to switch from the EDTLP
scheduling policy to a hybrid EDTLP-LLP scheduling policy. The scheduler monitors the num-
ber of SPEs that execute tasks over epochs of 100 off-loads. If the observed SPE utilization
is over 50% the scheduler maintains the most recently selected scheduling policy (EDTLP or
EDTLP-LLP). If the observed SPE utilization falls under 50% and the scheduler uses EDTLP,
it switches to EDTLP-LLP by loading parallelized versions of the loops in the local storages of
SPEs and performing loop distribution.
To switch between different parallel execution models at runtime, the runtime system uses
code versioning. It maintains three versions of the code of each task. One version is used for
execution on the PPE. The second version is used for execution on an SPE from start to finish,
using SIMDization to exploit the vector execution units of the SPE. The third version is used
for distribution of the loop enclosed by the task between more than one SPEs. The use of code
47
versioning increases code management overhead, as SPEs may need to load different versions
of the code of each off-loaded task at runtime. On the other hand, code versioning obviates
the need for conditionals that would be used in a monolithic version of the code. These con-
ditionals are expensive on SPEs, which lack branch prediction capabilities. Our experimental
analysis indicates that overlaying code versions on the SPEs via code transfers ends up being
slightly more efficient than using monolithic code with conditionals. This happens because of
the overhead and frequency of the conditionals in the monolithic version of the SPE code, but
also because the code overlays leave more space available in the local storage of SPEs for data
caching and buffering to overlap computation and communication [20].
We compare MGPS to EDTLP and two static hybrid (EDTLP-LLP) schedulers, using 2
SPEs per loop and 4 SPEs per loop respectively. Figure 5.6 shows the execution times of MGPS,
EDTLP-LLP and EDTLP with various RAxML workloads. The x-axis shows the number of
bootstraps, while the y-axis shows execution time. We observe benefits from using MGPS for up
to 28 bootstraps. Beyond 28 bootstraps, MGPS converges to EDTLP and both are increasingly
faster than static EDTLP-LLP execution, as the number of bootstraps increases.
A clear disadvantage of MGPS is that the time needed for any adaptation decision depends
on the total number of off-loading requests, which in turn is inherently application-dependent
and input-dependent. If the off-loading requests from different processes are spaced apart, there
may be extended idle periods on SPEs, before adaptation takes place. Another disadvantage of
MGPS is the dependency of its dynamic scheduling policy on the initial configuration used to
execute the application. In RAxML, MGPS converges to the best execution strategy only if the
application begins by oversubscribing the PPE and exposing the maximum degree of task-level
parallelism to the runtime system. This strategy is unlikely to converge to the best scheduling
policy in other applications, where task-level parallelism is limited and data parallelism is more
dominant. In this case, MGPS would have to commence its optimization process from a dif-
ferent program configuration favoring data-level rather than task-level parallelism. We address
the aforementioned shortcomings via a sampling-based MGPS algorithm (S-MGPS), which we
48
10
20
30
40
50
60
70
80
90
100
110
120
0 2 4 6 8 10 12 14 16
Exec
uti
on
tim
e in
sec
ond
s
Number of bootstraps
MGPSEDTLP+LLP with 4 SPEs per parallel loopEDTLP+LLP with 2 SPEs per parallel loop
EDTLP
(a)
0
100
200
300
400
500
600
700
800
900
0 20 40 60 80 100 120 140
Exec
uti
on t
ime
in s
eco
nds
Number of bootstraps
MGPSEDTLP+LLP with 4 SPEs per parallel loopEDTLP+LLP with 2 SPEs per parallel loop
EDTLP
(b)
Figure 5.6: MGPS, EDTLP and static EDTLP-LLP. Input file: 42 SC. Number of ML treescreated: (a) 1–16, (b) 1–128.
introduce in the next section.
5.4 S-MGPS
We begin this section by presenting a motivating example to show why controlling concur-
rency on the Cell is useful, even if SPEs are seemingly fully utilized. This example motivates
the introduction of a sampling-based algorithm that explores the space of program and system
49
configurations that utilize all SPEs, under different distributions of SPEs between concurrently
executing tasks and parallel loops. We present S-MGPS and evaluate S-MGPS using RAxML
and PBPI.
5.4.1 Motivating Example
Increasing the degree of task parallelism on Cell comes at a cost, namely increasing contention
between MPI processes that time-share the PPE. Pairs of processes that execute in parallel on
the PPE suffer from contention for shared resources, a well-known problem of simultaneous
multithreaded processors. Furthermore, with more processes, context switching overhead and
lack of co-scheduling of SPE threads and PPE threads from which the SPE threads originate,
may harm performance. On the other hand, while loop-level parallelization can ameliorate PPE
contention, its performance benefit depends on the granularity and locality properties of parallel
loops.
Figure 5.7 shows the efficiency of loop-level parallelism in RAxML when the input data
set is relatively small. The input data set in this example (25 SC) has 25 organisms, each
of them represented by a DNA sequence of 500 nucleotides. In this experiment, RAxML is
executed multiple times with a single worker process and a variable number of SPEs used for
LLP. The best execution time is achieved with 5 SPEs. The behavior illustrated in Figure 5.7 is
caused by several factors, including the granularity of loops relative to the overhead of PPE-SPE
commnication and load imbalance (discussed in Section 5.2.2).
By using two dimensions of parallelism to execute an application, the runtime system can
control both PPE contention and loop-level parallelization overhead. Figure 5.8 illustrates an
example in which multi-grain parallel executions outperform one-dimensional parallel execu-
tions in RAxML, for any number of bootstraps. In this example, RAxML is executed with three
static parallelization schemes, using 8 MPI processes and 1 SPE per process, 4 MPI processes
and 2 SPEs per process, or 2 MPI processes and 4 SPEs per process respectively. The input data
50
0
5
10
15
20
25
30
35
40
45
50
1 2 3 4 5 6 7 8
Number of SPEs
Exeutin time (s)
Figure 5.7: Execution time of RAxML with a variable number of SPE threads. The inputdataset is 25 SC.
0
50
100
150
200
250
300
350
400
0 20 40 60 80 100 120 140
Number of bootstraps
Execution time (s)
8 worker processes, 1 SPE per off-loaded task
4 worker processes, 2SPEs per off-loaded task
2 worker processes, 4 SPEs per off-loaded task
Figure 5.8: Execution times of RAxML, with various static multi-grain scheduling strategies.The input dataset is 25 SC.
set is 25 SC. Using this data set, RAxML performs the best with a multi-level parallelization
model when 4 MPI processes are simultaneously executed on the PPE and each of them uses 2
SPEs for loop-level parallelization.
5.4.2 Sampling-Based Scheduler for Multi-grain Parallelism
The S-MGPS scheduler automatically determines the best parallelization scheme for a specific
workload, by using a sampling period. During the sampling period, S-MGPS performs a search
51
of program configurations along the available dimensions of parallelism. The search starts with
a single MPI process and during the first step S-MGPS determines the optimal number of SPEs
that should be used by a single MPI process. The search is implemented by sampling execution
phases of the MPI process with different degrees of loop-level parallelism. Phases represent
code that is executed repeatedly in an application and dominates execution time. In case of
RAxML and PBPI, phases are the off-loaded tasks. Although we identify phases manually in
our execution environment, the selection process for phases is trivial and can be automated in a
compiler. Furthermore, parallel applications almost always exhibit a very strong runtime peri-
odicity in their execution patterns, which makes the process of isolating the dominant execution
phases straightforward.
Once the first sampling step of S-MGPS is completed, the search continues by sampling ex-
ecution intervals with every feasible combination of task-level and loop-level parallelism. In the
second phase of the search, the degree of loop-level parallelism never exceeds the optimal value
determined by the first sampling step. For each execution interval, the scheduler uses execution
time of phases as a criterion for selecting the optimal dimension(s) and granularity of paral-
lelism per dimension. S-MGPS uses a performance-driven mechanism to rightsize parallelism
on Cell, as opposed to the utilization-driven mechanism used in MGPS.
Figure 5.9 ilustrates the steps of the sampling phase when 2 MPI processes are executed
on the PPE. This process can be performed for any number of MPI processes that can be exe-
cuted on a single Cell node. For each MPI process, the runtime system uses a variable number
of SPEs, ranging from 1 up to the optimal number of SPEs determined by the first phase of
sampling.
The purpose of the sampling period is to determine the configuration of parallelism that
maximizes efficiency. We define a throughput metric W as:
52
EIB
SPE2 SPE3 SPE4
SPE5
SPE1
SPE6 SPE7 SPE8
PPE
Process1
Process2
EIB
SPE2 SPE3 SPE4
SPE5
SPE1
SPE6 SPE7 SPE8
PPE
Process1
Process2
(a) (b)
EIB
SPE2 SPE3 SPE4
SPE5
SPE1
SPE6 SPE7 SPE8
PPE
Process1
Process2
EIB
SPE2 SPE3 SPE4
SPE5
SPE1
SPE6 SPE7 SPE8
PPE
Process1
Process2
(c) (d)
Figure 5.9: The sampling phase of S-MGPS. Samples are taken from four execution intervals,during which the code performs identical operations. For each sample, each MPI process usesa variable number of SPEs to parallelize its enclosed loops.
W =C
T(5.1)
where C is the number of completed tasks and T is execution time. Note that a task is de-
fined as a function off-loaded on SPEs, therefore C captures application- and input-dependent
behavior. S-MGPS computes C by counting the number of task off-loads. This metric works
reasonably well, assuming that tasks of the same type (i.e. the same function or chunk of an
expensive computational loop, off-loaded multiple times on an SPE) have approximately the
same execution time. This is indeed the case in the applications that we studied. The metric can
be easily extended so that each task is weighed with its execution time relative to the execution
time of other tasks, to account for unbalanced task execution times. We do not explore this
option further in this thesis.
53
S-MGPS calculates efficiency for every sampled configuration and selects the configuration
with the maximum efficiency for the rest of the execution. In Table 5.4 we represent partial
results of the sampling phase in RAxML for different input datasets. In this example, the
degree of task-level parallelism sampled is 8, 4 and 2, while the degree of loop-level parallelism
sampled is 1, 2 and 4. In the case of RAxML we set a single sampling phase to be time necessary
for all active worker processes to finish a single bootstrap. Therefore, in the case of RAxML
in Table 5.4, the number of bootstraps and the execution time differ across sampling phases:
when the number of active workers is 8, the sampling phase will contain 8 bootstraps, when the
number of active workers is 4 the sampling phase will contain 4 bootstraps, etc. Nevertheless,
the throughput (W ) remains invariant across different sampling phases and always represents
the efficiency of a certain configuration, i.e. amount of work done per second. Results presented
in Table 5.4 confirm that S-MGPS converges to the optimal configurations (4x2 and 8x1) for
the input files 25 SC and 42 SC.
Dataset deg(TLP) × # bootstr. per # off-loaded phase Wdeg (LLP) sampling phase tasks duration time
42 SC 8x1 8 2,526,126 41.73s 60,53542 SC 4x2 4 1,263,444 21.05s 60,02142 SC 2x4 2 624,308 14.42s 43,29425 SC 8x1 8 1,261,232 16.53s 76,29925 SC 4x2 4 612,155 8.01s 76,42325 SC 2x4 2 302,394 5.6s 53,998
Table 5.4: Efficiency of different program configurations with two data sets in RAxML. Thebest configuration for 42 SC input is deg(TLP)=8, deg(LLP)=1. The best configuration for25 SC is deg(TLP)=4, deg(LLP)=2. deg() corresponds the degree of a given dimension ofparallelism (LLP or TLP).
Since the scheduler performs an exhaustive search, for the 25 SC input, the total number
of bootstraps required for the sampling period on Cell is 17, for up to 8 MPI processes and
1 to 5 SPEs used per MPI process for loop-level parallelization. The upper bound of 5 SPEs
per loop is determined by the first step of the sampling period. Assuming that performance is
optimized if the maximum number of SPEs of the processor are involved in parallelization, the
54
feasible configurations to sample are constrained by deg(TLP)×deg(LLP)=8, for a single Cell
with 8 SPEs. Under this constraint, the number of samples needed by S-MGPS on Cell drops
to 3. Unfortunately, when considering only configurations that use all SPEs, the scheduler may
omit a configuration that does not use all SPEs but still performs better than the best scheme
that uses all processor cores. In principle, this situation may occur in certain non-scalable
codes or code phases. To address such cases, we recommend the use of exhaustive search in
S-MGPS, given that the total number of feasible configurations of SPEs on a Cell is manageable
and small compared to the number of tasks and the number of instances of each task executed
in real applications. This assumption may need to be revisited in the future for large-scale
systems with many cores and exhaustive search may need to be replaced by heuristics such as
hill climbing or simulated annealing.
In Table 5.5 we compare the performance of S-MGPS to the static scheduling policies with
both one-dimensional (TLP) and multi-grain (TLP-LLP) parallelism on Cell, using RAxML.
For a small number of bootstraps, S-MGPS underperforms the best static scheduling scheme
by 10%. The reason is that S-MGPS expends a significant percentage of execution time in
the sampling period, while executing the program in mostly suboptimal configurations. As the
number of bootstraps increases, S-MGPS comes closer to the performance of the best static
scheduling scheme (within 3%–5%).
deg(TLP)=8, deg(TLP)=4, deg(TLP)=2,deg(LLP)=1 deg(LLP)=2 deg(LLP)=4 S-MGPS
32 boots. 60s 57s 80s 63s64 boots. 117s 112s 161s 118s128 boots. 231s 221s 323s 227s
Table 5.5: RAxML – Comparison between S-MGPS and static scheduling schemes, illustratingthe convergence overhead of S-MGPS.
To map PBPI to Cell, we used a hybrid parallelization approach where a fixed number of
MPI processes is multiplexed on the PPE and multiple SPEs are used for loop-level paralleliza-
tion. The performance of the parallelized off-loaded code in PBPI is influenced by the same
55
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number of SPEs
Execution time (s)
1 MPI process2 MPI processes4 MPI processes8 MPI processes
Figure 5.10: PBPI executed with different levels of TLP and LLP parallelism: deg(TLP)=1-4,deg(LLP)=1–16
factors as in RAxML: granularity of the off-loaded code, PPE-SPE communication, and load
imbalance. In Figure 5.10 we present the performance of PBPI when a variable number of SPEs
is used to execute the parallelized off-loaded code. The input file we used in this experiment is
107 SC, including 107 organisms, each represented by a DNA sequence of 1,000 nucleotides.
We run PBPI with one Markov chain for 200,000 generations. Figure 5.10 contains four exe-
cutions of PBPI with 1, 2, 4 and 8 MPI processes with 1–16, 1–8, 1–4 and 1–2 SPEs used per
MPI process respectively. In all experiments we use a single BladeCenter with two Cell BE
processors (total of 16 SPEs).
In the experiments with 1 and 2 MPI processes, the off-loaded code scales successfully only
up to a certain number of SPEs, which is always smaller than the number of total available SPEs.
Furthermore, the best performance in these two cases is reached when the number of SPEs used
for parallelization is smaller than the total number of available SPEs. The optimal number of
SPEs in general depends on the input data set and on the outermost parallelization and data
decomposition scheme of PBPI. The best performance for the specific dataset is reached by
using 4 MPI processes, spread across 2 Cell BEs, with each process using 4 SPEs on one Cell
BE.This optimal operating point shifts with different data set sizes.
The fixed virtual processor topology and data decomposition method used in PBPI prevents
56
dynamic scheduling of MPI processes at runtime without excessive overhead. We have exper-
imented with the option of dynamically changing the number of active MPI processes via a
gang scheduling scheme, which keeps the total number of active MPI processes constant, but
co-schedules MPI processes in gangs of size 1, 2, 4, or 8 on the PPE and uses 8, 4, 2, or 1
SPE(s) per MPI process per gang respectively, for the execution of parallel loops. This scheme
also suffered from system overhead, due to process control and context switching on the SPEs.
Pending better solutions for adaptively controlling the number of processes in MPI, we evalu-
ated S-MGPS in several scenarios where the number of MPI processes remains fixed. Using
S-MGPS we were able to determine the optimal degree of loop-level parallelism, for any given
degree of task-level parallelism (i.e. initial number of MPI processes) in PBPI. Being able to
pinpoint the optimal SPE configuration for LLP is still important since different loop paral-
lelization strategies can result in a significant difference in execution time. For example, the
naıve parallelization strategy, where all available SPEs are used for parallelization of off-loaded
loops, can result in up to 21% performance degradation (see Figure 5.10).
Table 5.6 shows a comparison of execution times when S-MGPS is used and when different
static parallelization schemes are used. S-MGPS performs within 2% of the optimal static
parallelization scheme. S-MGPS also performs up to 20% better than the naıve parallelization
scheme where all available SPEs are used for LLP (see Table 5.6(b)).
5.5 Chapter Summary
In this chapter we investigated policies and mechanisms pertaining to scheduling multigrain
parallelism on the Cell Broadband Engine. We proposed an event-driven task scheduler, striv-
ing for higher utilization of SPEs via oversubscribing the PPE. We have explored the conditions
under which loop-level parallelism within off-loaded code can be used. We have also proposed a
comprehensive scheduling policy for combining task-level and loop-level parallelism autonom-
ically within MPI code, in response to workload fluctuation. Using a bio-informatics code with
57
(a)
deg(LLP)deg(TLP)=1 1 2 3 4 5 6 7 8
Time 502 267.8 222.8 175.8 142.1 118.6 108.1 134.3deg(TLP)=1 9 10 11 12 13 14 15 16
Time (s) 122 111.9 138.3 109.2 122.3 133.2 115.3 116.5S-MGPS Time (s) 110.3
(b)
deg(LLP)deg(TLP)=2 1 2 3 4 5 6 7 8
Time (s) 275.9 180.8 139.4 113.5 91.3 97.3 102.55 115S-MGPS Time (s) 93
(c)
deg(LLP)deg(TLP)=4 1 2 3 4
Time (s) 180.6 118.67 94.63 83.61S-MGPS Time (s) 85.9
(d)
deg(LLP)deg(TLP)=8 1 2
Time (s) 355.5 265S-MGPS Time (s) 267
Table 5.6: PBPI – comparison between S-MGPS and static scheduling schemes: (a)deg(TLP)=1, deg(LLP)=1–16; (b) deg(TLP)=2, deg(LLP)=1–8; (c) deg(TLP)=4, deg(LLP)=1–4; (d) deg(TLP)=8, deg(LLP)=1–2.
inherent multigrain parallelism as a case study, we have shown that our user-level scheduling
policies outperform the native OS scheduler by up to a factor of 2.7.
Our MGPS scheduler proves to be responsive to small and large degrees of task-level and
data-level parallelism, at both fine and coarse levels of granularity. This kind of parallelism
is commonly found in optimization problems where many workers are spawned to search a
very large space of solutions, using a heuristic. RAxML is representative of these applica-
tions. MGPS is also appropriate for adaptive and irregular applications such as adaptive mesh
refinement, where the application has task-level parallelism with variable granularity (because
of load imbalance incurred while meshing subdomains with different structural properties) and,
in some implementations, a statically unpredictable degree of task-level parallelism (because
of non-deterministic dynamic load balancing which may be employed to improve execution
time). N-body simulations and ray-tracing are applications that exhibit similar properties and
can also benefit from our scheduler. As a final note, we observe that MGPS reverts to the best
static scheduling scheme for regular codes with a fixed degree of task-level parallelism, such as
blocked linear algebra kernels.
We also investigated the problem of mapping multi-dimensional parallelism on hetero-
58
geneous parallel architectures with both conventional and accelerator cores. We proposed a
feedback-guided dynamic scheduling scheme, S-MGPS, which rightsizes parallelism on the fly,
without a priori knowledge of application-specific information and regardless of the input data
set.
59
60
Chapter 6
Model of Multi-Grain Parallelism
6.1 Introduction
The migration of parallel programming models to accelerator-based architectures raises many
challenges. Accelerators require platform-specific programming interfaces and re-formulation
of parallel algorithms to fully exploit the additional hardware. Furthermore, scheduling code
on accelerators and orchestrating parallel execution and data transfers between host processors
and accelerators is a non-trivial exercise, as discussed in Chapter 5.
Although being able to accurately determine the most efficient execution configuration of a
multi-level parallel application, the S-MGPS scheduler (Section 5.4) requires sampling of many
different configurations, at runtime. The sampling time grows with the number of accelerators
on the chip, and with the number of different levels of parallelism available in the applica-
tion. To pinpoint the most efficient execution configuration without using the sampling phase,
we develop a model for multi-dimensional parallel computation on heterogeneous multi-core
processors. We name the model Model of Multi-Grain Parallelism (MMGP). The model is ap-
plicable to any type of accelerator based architecture, and in Section 6.4 we test the accuracy
and usability of the MMGP model on the multicore Cell architecture.
61
APU/LM
#1
HPU/LM
#1
HPU/LM
#NHP
APU/LM
#2
APU/LM
#NAP
Shared Memory / Message Interface
Figure 6.1: A hardware abstraction of an accelerator-based architecture with two layers ofparallelism. Host processing units (HPUs) relatively supply coarse-grain parallel computationacross accelerators. Accelerator processing units (APUs) are the main computation engines andmay support internally finer grain parallelism. Both HPUs and APUs have local memories andcommunicate through shared-memory or message-passing. Additional layers of parallelism canbe expressed hierarchically in a similar fashion.
6.2 Modeling Abstractions
Performance can be dramatically affected by the assignment of tasks to resources on a complex
parallel architecture with multiple types of parallel execution vehicles. We intend to create a
model of performance that captures the important costs of parallel task assignment at multiple
levels of granularity, while maintaining simplicity. Additionally, we want our techniques to be
independent of both programming models and the underlying hardware. Thus, in this section
we identify abstractions necessary to allow us to define a simple, accurate model of parallel
computation for accelerator-based architectures.
6.2.1 Hardware Abstraction
Figure 6.1 shows our abstraction for accelerator-based architectures. In this abstraction, each
node consists of multiple host processing units (HPU) and multiple accelerator processing units
(APU). Both the HPUs and APUs have local and shared memory. Multiple HPU-APU nodes
form a cluster. We model the communication cost for i and j, where i and j are HPUs, APUs,
62
and/or HPU-APU nodes, using a variant of the LogP model [35] of point-to-point communi-
cation:
Ci,j = Oi + L+Oj (6.1)
Where Ci,j is the communication cost, Oi and Oj is the overhead of sender and receiver respec-
tively, and L is the communication latency.
In this hardware abstraction, we model an HPU, APU, or HPU-APU node as a sequential
device with streaming memory accesses. For simplicity, we assume that additional levels of
parallelism in HPUs or APUs, such as ILP and SIMD, can be reflected with a parameter that
represents computing capacity. We could alternatively express multi-grain parallelism hierar-
chically, but this complicates model descriptions without much added value. Assumption of
streaming memory accesses, allows inclusion of the effects of communication and computation
overlap.
6.2.2 Application Abstraction
Figure 6.2 provides an illustrative view of the succeeding discussion. We model the workload
of a parallel application using a version of the Hierarchical Task Graph (HTG [52]). An HTG
represents multiple levels of concurrency with progressively finer granularity when moving
from outermost to innermost layers. We use a phased HTG, in which we partition the application
into multiple phases of execution and split each phase into nested sub-phases, each modeled as
a single, potentially parallel task. Each subtask may incorporate one or more layers of data or
sub-task parallelism. The degree of concurrency may vary between tasks and within tasks.
Mapping a workload with nested parallelism as shown in Figure 6.2 to an accelerator-based
multi-core architecture can be challenging. In the general case, any application task of any
granularity could map to any type combination of HPUs and APUs. The solution space under
these conditions can be unmanageable.
In this work, we confine the solution space by making some assumptions about the applica-
63
Task2Subtask2
Subtask3
Main Process
Time
Main Process
Subtask1
Subtask2
Subtask3
Subtask1
Task1
Task1 Task2
Task2
Figure 6.2: Our application abstraction of two parallel tasks. Two tasks are spawned by themain process. Each task exhibits phased, multi-level parallelism of varying granularity. In thispaper, we address the problem of mapping tasks and subtasks to accelerator-based systems.
tion and hardware. First, we assume that the amount and type of parallelism is known a priori
for all phases in the application. In other words, we assume that the application is explicitly par-
allelized, in a machine-independent fashion. More specifically, we assume that the application
exposes all available layers of inherent parallelism to the runtime environment, without how-
ever specifying how to map this parallelism to parallel execution vehicles in hardware. In other
words, we assume that the application’s parallelism is expressed independently of the number
and the layout of processors in the architecture. The parallelism of the application is represented
by a phased HTG graph. The intent of our work is to improve and formalize programming of
accelerator-based multicore architectures. We believe it is not unreasonable to assume those
interested in porting code and algorithms to such systems would have detailed knowledge about
the inherent parallelism of their application. Furthermore, explicit, processor-independent par-
allel programming is considered by many as a means to simplify parallel programming mod-
els [10].
Second, we prune the number and type of hardware configurations. We assume hardware
64
configurations consist of a hierarchy of nested resources, even though the actual resources may
not be physically nested in the architecture. Each resource is assigned to an arbitrary level of
parallelism in the application and resources are grouped by level of parallelism in the applica-
tion. For instance, the Cell Broadband Engine can be considered as 2 HPUs and 8 APUs, where
the two HPUs correspond to the PowerPC dual-thread SMT core and APUs to the synergistic
(SPE) accelerator cores. HPUs support parallelism of any granularity, however APUs support
the same or finer, not coarser, granularity. This assumption is reasonable since it represents
faithfully all current accelerator architectures, where front-end processors offload computation
and data to accelerators. This assumption simplifies modeling of both communication and com-
putation.
6.3 Model of Multi-grain Parallelism
This section provides theoretical rigor to our approach. We present MMGP, a model which
predicts execution time on accelerator-based system configurations and applications under the
assumptions described in the previous section. Readers familiar with point-to-point models of
parallel computation may want to skim this section and continue directly to the results of our
execution time prediction techniques discussed in Section 6.4.
We follow a bottom-up approach. We begin by modeling sequential execution on the HPU,
with part of the computation off-loaded to a single APU. Next, we incorporate multiple APUs
in the model, followed by multiple HPUs. We end up with a general model of execution time,
which is not particularly practical. Hence, we reduce the general model to reflect different uses
of HPUs and APUs on real systems. More specifically, we specialize the model to capture the
scheduling policy of threads on the HPUs and to estimate execution times under different map-
pings of multi-grain parallelism across HPUs and APUs. Lastly, we describe the methodology
we use to apply MMGP to real systems.
65
HPU_1
APU_1
shared Memory
Phase_1
Phase_2
Phase_3
(a) an architecture with one HPU and one APU (b) an application with three phases
Figure 6.3: The sub-phases of a sequential application are readily mapped to HPUs and APUs.In this example, sub-phases 1 and 3 execute on the HPU and sub-phase 2 executes on the APU.HPUs and APUs are assumed to communicate via shared memory.
6.3.1 Modeling sequential execution
As the starting point, we consider the mapping of the program to an accelerator-based architec-
ture that consists of one HPU and one APU, and an application with one phase decomposed into
three sub-phases, a prologue and epilogue running on the HPU, and a main accelerated phase
running on the APU, as illustrated in Figure 6.3.
Offloading computation incurs additional communication cost, for loading code and data on
the APU, and saving results calculated from the APU. We model each of these communication
costs with a latency and an overhead at the end-points, as in Equation 6.1. We assume that
APU’s accesses to data during the execution of a procedure are streamed and overlapped with
APU computation. This assumption reflects the capability of current streaming architectures,
such as the Cell and Merrimac [37], to aggressively overlap memory latency with computa-
tion, using multiple buffers. Due to overlapped memory latency, communication overhead is
assumed to be visible only during loading the code and arguments of a procedure on the APU
and during returning the result of a procedure from the APU to the HPU. We combine the com-
munication overhead for offloading the code and arguments of a procedure and signaling the
execution of that procedure on the APU in one term (Os), and the overhead for returning the
result of a procedure from the APU to the HPU in another term (Or).
We can model the execution time for the offloaded sequential execution for sub-phase 2 in
66
Figure 6.3 as:
Toffload(w2) = TAPU(w2) +Or +Os (6.2)
where TAPU(w2) is the time needed to complete sub-phase 2 without additional overhead.
Further, we can write the total execution time of all three sub-phases as:
T = THPU(w1) + TAPU(w2) +Or +Os + THPU(w3) (6.3)
To reduce complexity, we replace THPU(w1)+THPU(w3) with THPU , TAPU(w2) with TAPU ,
and Os +Or with Ooffload. Therefore, we can rewrite Equation 6.3 as:
T = THPU + TAPU +Ooffload (6.4)
The application model in Figure 6.3 is representative of one of potentially many phases in
an application. We further modify Equation 6.4 for a generic application with N phases, where
each phase i offloads a part of its computation on one APU:
T =N∑
i=1
(THPU,i + TAPU,i +Ooffload) (6.5)
6.3.2 Modeling parallel execution on APUs
Each offloaded part of a phase may contain fine-grain parallelism, such as task-level parallelism
at the sub-procedural level or data-level parallelism in loops. This parallelism can be exploited
by using multiple APUs for the offloaded workload. Figure 6.4 shows the execution time de-
composition for execution using one APU and two APUs. We assume that the code off-loaded
to an APU during phase i, has a part which can be further parallelized across APUs, and a part
executed sequentially on the APU. We denote TAPU,i(1, 1) as the execution time of the further
parallelized part of the APU code during the ith phase. The first index 1 refers to the use of
one HPU thread in the execution. We denote TAPU,i(1, p) as the execution time of the same
67
part when p APUs are used to execute this part during the ith phase. We denote as CAPU,i the
non-parallelized part of APU code in phase i. Therefore, we obtain:
TAPU,i(1, p) =TAPU,i(1, 1)
p+ CAPU,i (6.6)
Or
Ti (1,1)
Os
T (1,1)APU,i
C
����������������������
����������������������
��������������������������������������
��������������������������������������
��������������
��������������
���������������
���������������
Time
T
HPU APU
Overhead associated with offloading (gap)PPE, SPE Computation
HPU,i (1,1)
APU
Os
������������������
������������������
Or
��������������������
��������������������
����������������������
����������������������
������������������
������������������
Time
Ti
HPU APU1 APU2
Offloading gap
Receiving gap
(1,2)
THPU,i (1,2)
T
C
APU (1,2)
APU
(a) Offloading to one APU (b) offloading to two APUs
Figure 6.4: Parallel APU execution. The HPU (leftmost bar in parts a and b) offloads compu-tations to one APU (part a) and two APUs (part b). The single point-to-point transfer of parta is modeled as overhead plus computation time on the APU. For multiple transfers, there isadditional overhead (g), but also benefits due to parallelization.
Given that the HPU offloads to APUs sequentially, there exists a latency gap between con-
secutive offloads on APUs. Similarly, there exists a gap between receiving return values from
two consecutive offloaded procedures on the HPU. We denote with g the larger of the two gaps.
On a system with p APUs, parallel APU execution will incur an additional overhead as large as
p · g. Thus, we can model the execution time in phase i as:
Ti(1, p) = THPU,i +TAPU,i(1, 1)
p+ CAPU,i +Ooffload + p · g (6.7)
68
6.3.3 Modeling parallel execution on HPUs
An accelerator-based architecture can support parallel HPU execution in several ways, by pro-
viding a multi-core HPU, an SMT HPU or combinations thereof. As a point of reference, we
consider an architecture with one SMT HPU, which is representative of the Cell BE.
Since the compute intensive parts of an application are typically offloaded to APUs, the
HPUs are expected to be in idle state for extended intervals. Therefore, multiple threads can
be used to reduce idle time on the HPU and provide more sources of work for APUs, so that
APUs are better utilized. It is also possible to oversubscribe the HPU with more threads than
the number of available hardware contexts, in order to expose more parallelism via offloading
on APUs.
Figure 6.5 illustrates the execution timeline when two threads share the same HPU, and each
thread offloads parallelized code on two APUs. We use different shade patterns to represent the
workload of different threads.
TAPU,i1 (2,2)
CAPU
CAPU
TAPU,i2 (2,2)
����������
����������
����������
����������
����������
����������
����������
����������
����������
����������
���������
������������������
���������
����������
����������
HPU APU1 APU2APU3APU4
O s
HPU Thread 2HPU Thread 1
1
T1
Ti (2,2)Os
HPU,i(2,2)
rO
Or
(2,2)HPU,iT2
Ti (2,2)2
Figure 6.5: Parallel HPU execution. The HPU (center bar) offloads computations to 4 APUs (2on the right and 2 on the left). The first thread on the HPU offloads computation to APU1 andAPU2 then idles. The second HPU thread is switched in, offloads code to APU3 and APU4,and then idles. APU1 and APU2 complete and return data followed by APU3 and APU4.
69
For m concurrent HPU threads, where each thread uses p APUs for distributing a single
APU task, the execution time of a single off-loading phase can be represented as:
T ki (m, p) = T k
HPU,i(m, p) + T kAPU,i(m, p) +Ooffload + p · g (6.8)
where T ki (m, p) is the completion time of the kth HPU thread during the ith phase.
Modeling the APU time
Similarly to Equation 6.6, we can write the APU time of the k-th thread in phase i in Equation
6.8 as:
T kAPU,i(m, p) =
TAPU,i(m, 1)
p+ CAPU,i (6.9)
Different parallel implementations may result in different TAPU,i(m, 1) terms and a different
number of offloading phases. For example, the implementation could parallelize each phase
among m HPU threads and then offload the work of each HPU thread to p APUs, resulting
in the same number of offloading phases and a reduced APU time during each phase, i.e.,
TAPU,i(m, 1) =TAPU,i(1,1)
m. As another example, the HPU threads can be used to execute multi-
ple identical tasks, resulting in a reduced number of offloading phases (i.e., N/m, where N is
the number of offloading phases when there is only one HPU thread) and the same APU time
in each phase, i.e., TAPU,i(m, 1) = TAPU,i(1, 1).
Modeling the HPU time
The execution time of each HPU thread is affected by three factors:
1. Contention between HPU threads for shared resources.
2. Context switch overhead related to resource scheduling.
3. Global synchronization between dependent HPU threads.
70
Considering all three factors, we can model the execution time of an HPU thread in phase i as:
T kHPU,i(m, p) = αm · THPU,i(1, p) + TCSW +OCOL (6.10)
In this equation TCSW is context switching time on the HPU, and OCOL is the time needed for
collective communication. The parameter αm is introduced to account for contention between
threads that share resources on the HPU. On SMT and CMP HPUs, such resources typically
include one or more levels of the on-chip cache memory. On SMT HPUs in particular, shared
resources include also TLBs, branch predictors and instruction slots in the pipeline. Contention
between threads often introduces artificial load imbalance due to occasional unfair hardware
policies of allocating resources between threads.
Synthesis
Combining Equation (6.8)-(6.10) and summarizing all phases, we can write the execution time
for MMGP as:
T (m, p) = αm ·THPU(1, 1)+TAPU(1, 1)
m · p+CAPU +N ·(OOffload+TCSW +OCOL+p·g) (6.11)
Due to limited hardware resources (i.e. number of HPUs and APUs), we further constrain
this equation to m × p ≤ NAPU , where NAPU is the number of available APUs. As described
later in this paper, we can either measure or approximate all parameters in Equation 6.11 from
microbenchmarks and profiles of sequential runs of the program.
6.3.4 Using MMGP
Given a parallel application, MMGP can be applied using the following process:
1. Calculate parameters including OOffload, αm, TCSW and OCOL using micro-benchmarks
for the target platform.
71
2. Profile a short run of the sequential execution with off-loading to a single APU, to estimate
THPU(1), g, TAPU(1, 1) and CAPU .
3. Solve a special case of Equation 6.11 (e.g. 6.7) to find the optimal mapping between
application concurrency and HPUs and APUs available on the target platform.
6.3.5 MMGP Extensions
We note that the concepts and assumptions mentioned in this section do not preclude further
specialization of MMGP for higher accuracy. For example, in Section 6.3.1 we assume com-
putation and data communication overlap. This assumption reflects the fact that streaming
processors can typically overlap completely memory access latency with computation. For
non-overlapped memory accesses, we can employ a DMA model as a specialization of the
overhead factors in MMGP. Also, in Sections 6.3.2 and 6.3.3 we assume only two levels of
parallelism. MMGP is easily extensible to additional levels but the terms of the equations grow
quickly without conceptual additions. Furthermore, MMGP can be easily extended to reflect
specific scheduling policies for threads on HPUs and APUs, as well as load imbalance in the
distribution of tasks between HPUs and APUs. To illustrate the usefulness of our techniques we
apply them to a real system. We next present results from applying MMGP to Cell.
6.4 Experimental Validation and Results
We use MMGP to derive multi-grain parallelization schemes for two bioinformatics applica-
tions, RAxML and PBPI, described in Chapter 3, on a shared-memory dual Cell blade, IBM
QS20. Although we are using only two applications in our experimental evaluation, we should
point out that these are complete applications used for real-world biological data analyses, and
that they are fully optimized for the Cell BE using an arsenal of optimizations, including vector-
ization, loop unrolling, double buffering, if-conversion and dynamic scheduling. Furthermore,
72
these applications have inherent multi-grain concurrency and non-trivial scaling properties in
their phases, therefore scheduling them optimally on Cell is a challenging exercise for MMGP.
Lastly, in the absence of comprehensive suites of benchmarks (such as NAS or SPEC HPC)
ported on Cell, optimized, and made available to the community by experts, we opted to use,
PBPI and RAxML, codes on which we could verify that enough effort has been invested towards
Cell-specific parallelization and optimization.
6.4.1 MMGP Parameter approximation
MMGP has eight free parameters, THPU , TAPU , CAPU , Ooffload, g, TCSW , OCOL and αm. We
estimate four of the parameters using micro-benchmarks.
αm captures contention between processes or threads running on the PPE. This contention
depends on the scheduling algorithm on the PPE. We estimate αm under an event-driven schedul-
ing model which oversubscribes the PPE with more processes than the number of hardware
threads supported for simultaneous execution on the PPE, and switches between processes upon
each off-loading event on the PPE [19].
To estimate αm, we use a parallel micro-benchmark that computes the product of twoM×M
square matrices consisting of double-precision floating point elements. Matrix-matrix multi-
plication involves O(n3) computation and O(n2) data transfers, thus stressing the impact of
sharing execution resources and the L1 and L2 caches between processes on the PPE. We used
several different matrix sizes, ranging from 100× 100 to 500× 500, to exercise different levels
of pressure on the thread-shared caches of the PPE. In the MMGP model, we use the mean of
αm obtained from these experiments, which is 1.28.
PPE-SPE communication is optimally implemented through DMAs on Cell. We devised
a ping-pong micro-benchmark using DMAs to send a single word from the PPE to one SPE
and backwards. We measured PPE→SPE→PPE round-trip communication overhead (Ooffload)
to 70 ns. To measure the overhead caused by various collective communications we used
73
mpptest [55] on the PPE. Using a micro-benchmark that repeatedly executes the sched yield()
system call, we estimate the overhead caused by the context switching (TCSW ) on the PPE to be
2 µs. This is a conservative upper bound for context switching overhead, since it includes some
user-level library overhead.
THPU , TAPU , CAPU and the gap g between consecutive DMAs on the PPE are application-
dependent and cannot be approximated easily with a micro-benchmark. To estimate these pa-
rameters, we use a profile of a sequential run of the code, with tasks off-loaded on one SPE. We
use the timing instructions inserted into the applications at specific locations. To estimate THPU
we measure the time that applications spend on the HPU. To estimate TAPU and CAPU we mea-
sure the time that applications spend on the accelerators, in large computational loops which
can be parallelized (TAPU ), and in the sequential accelerator code outside of the large loops
(CAPU ). To estimate g, we measure the time intervals between the consecutive task off-loads
and task completions.
6.4.2 Case Study I: Using MMGP to parallelize PBPI
PBPI with One Dimension of Parallelism
We compare the PBPI execution times predicted by MMGP to the actual execution times ob-
tained on real hardware, using various degrees of PPE and SPE parallelism, i.e. the equivalents
of HPU and APU parallelism on Cell. These experiments illustrate the accuracy of MMGP, in a
sample of the feasible program configurations. The sample includes one-dimensional decompo-
sitions of the program between PPE threads, with simultaneous off-loading of code to one SPE
from each PPE thread, one-dimensional decompositions of the program between SPE threads,
where the execution of tasks on the PPE is sequential and each task off-loads code which is
data-parallel across SPEs, and two-dimensional decompositions of the program, where multi-
ple tasks run on the PPE threads concurrently and each task off-loads code which is data-parallel
across SPEs. In all cases, the SPE code is SIMDized in the innermost loops, to exploit the vec-
74
(a) (b)
Figure 6.6: MMGP predictions and actual execution times of PBPI, when the code uses onedimension of PPE (HPU) parallelism.
tor units of the SPEs. We believe that this sample of program configurations is representative of
what a user would reasonably experiment with while trying to optimize the codes on the Cell.
For these experiments, we used the arch107 L10000 input data set. This data set consists
of 107 sequences, each with 10000 characters. We run PBPI with one Markov chain for 20000
generations. Using the time base register on the PPE and the decrementer register on one SPE,
we obtained the following model parameters for PBPI: THPU = 1.3s, TAPU = 370s, g = 0.8s
and O = 1.72s.
Figure 6.6 compares MMGP and actual execution times for PBPI, when PBPI only ex-
ploits one-dimensional PPE (HPU) parallelism in which each PPE thread uses one SPE for
off-loading. We execute the code with up to 16 MPI processes, which off-load code to up to
16 SPEs on two Cell BEs. Referring to Equation 6.11, we set p = 1 and vary the value of
m between 1 to 8. The X-axis shows the number of processes running on the PPE (i.e. HPU
parallelism), and the Y-axis shows the predicted and measured execution times. The maximum
prediction error of MMGP is 5%. The arithmetic mean of the error is 2.3% and the standard
deviation is 1.4.
Figure 6.7 illustrates predicted and actual execution times when PBPI uses one dimension
75
(c) (d)
Figure 6.7: MMGP predictions and actual execution times of PBPI, when the code uses onedimension of SPE (APU) parallelism, with a data-parallel implementation of the maximumlikelihood calculation.
of SPE (APU) parallelism. Referring to Equation 6.11, we set p = 1 and vary m from 1 to
8. MMGP remains accurate, the mean prediction error is 4.1% and the standard deviation is
3.2. The maximum prediction error in this case is higher (approaching 10%) when the APU
parallelism increases and the code uses SPEs on both Cell processors. A closer inspection of
this result reveals that the data-parallel implementation of tasks in PBPI stops scaling beyond
the 8 SPEs confined in one Cell processor, because of DMA bottlenecks and non-uniformity in
the latency of memory accesses by the two Cell processors on the blade. Capturing the DMA
bottlenecks requires the introduction of a model of DMA contention in MMGP, while captur-
ing the NUMA bottleneck would require an accurate memory hierarchy model integrated with
MMGP. The NUMA bottleneck can be resolved by a better page placement policy implemented
in the operating system. We intend to examine these issues in our future work. For the purposes
of this paper, it suffices to observe that MMGP is accurate enough despite its generality. As we
show later, MMGP predicts accurately the optimal mapping of the program to the Cell multi-
processor, regardless of inaccuracies in execution time prediction in certain edge cases.
76
PBPI with Two Dimensions of Parallelism
Multi-grain parallelization aims at exploiting simultaneously task-level and data-level paral-
lelism in PBPI. We only consider multi-grain parallelization schemes in which degHPU·degAPU ≤
16, i.e. the total number of SPEs (APUs) on the dual-processor Cell Blade we used in this study.
deg() denotes the degree of a layer of parallelism, which corresponds to the number of SPE or
PPE threads used to run the code. Figure 6.8 shows the predicted and actual execution times
of PBPI for all feasible combinations of multi-grain parallelism under the aforementioned con-
straint. MMGP’s mean prediction error is 3.2%, the standard deviation of the error is 2.6 and
the maximum prediction error is 10%. The important observation in these results is that MMGP
agrees with the experimental outcome in terms of the mix of PPE and SPE parallelism to use
in PBPI for maximum performance. In a real program development scenario, MMGP would
point the programmer to the direction of using both task-level and data-level parallelism with a
balanced allocation of PPE contexts and SPEs between the two layers.
6.4.3 Case Study II: Using MMGP to Parallelize RAxML
RAxML with a Single Layer of Parallelism
The units of work (bootstraps) in RAxML are distributed evenly between MPI processes, there-
fore the degree of PPE (HPU) concurrency is bound by the number of MPI processes. As
discussed in Section 6.3.3, the degree of HPU concurrency may exceed the number of HPUs, so
that on an architecture with more APUs than HPUs, the program can expose more concurrency
to APUs. The degree of SPE (APU) concurrency may vary per MPI process. In practice, the
degree of PPE concurrency can not meaningfully exceed the total number of SPEs available
on the system, since as many MPI processes can utilize all available SPEs via simultaneous
off-loading. Similarly to PBPI, each MPI process in RAxML can exploit multiple SPEs via
data-level parallel execution of off-loaded tasks across SPEs. To enable maximal PPE and SPE
concurrency in RAxML, we use a version of the code scheduled by a Cell BE event-driven
77
0
50
100
150
200
250
300
350
400
(1,1)
(1,2)
(2,1)
(1,3)
(3,1)
(1,4)
(2,2)
(4,1)
(1,5)
(5,1)
(1,6)
(2,3)
(3,2)
(6,1)
(1,7)
(7,1)
(1,8)
(2,4)
(4,2)
(8,1)
(1,9)
(3,3)
(9,1)
(1,10)
(2,5)
(5,2)
(10,1)
(1,11)
(11,1)
(1,12)
(2,6)
(3,4)
(4,3)
(6,2)
(12,1)
(1,13)
(13,1)
(1,14)
(2,7)
(7,2)
(14,1)
(1,15)
(3,5)
(5,3)
(15,1)
(1,16)
(2,8)
(4,4)
(8,2)
(16,1)
Executed Configuration (#PPE,#SPE)
Exec
utio
n tim
e in
seo
nds MMGP model
Measured time
Figure 6.8: MMGP predictions and actual execution times of PBPI, when the code uses twodimensions of SPE (APU) and PPE (HPU) parallelism. The mix of degrees of parallelismwhich optimizes performance is 4-way PPE parallelism combined with 4-way SPE parallelism.The chart illustrates the results when both SPE parallelism and PPE parallelism are scaled totwo Cell processors.
scheduler [19], in which context switches on the PPE are forced upon task off-loading and PPE
processes are served with a fair-share scheduler, so as to have even chances for off-loading on
SPEs.
We evaluate the performance of RAxML when each process performs the same amount of
work, i.e. the number of distributed bootstraps is divisible by the number of processes. The case
of unbalanced distribution of bootstraps between MPI processes can be handled with a minor
modification to Equation 6.11, to scale the MMGP parameters by a factor of (d BMe·M)
B, where B
is the number of bootstraps (tasks) and M is the number of MPI processes used to execute the
code.
We compare the execution time of RAxML to the time predicted by MMGP, using two
input data sets. The first data set contains 10 organisms, each represented by a DNA sequence
of 20,000 nucleotides. We refer to this data set as DS1. The second data set (DS2) contains 10
78
020406080
100120140160180200
1 2 4 8 16Number of PPEs
Exec
utio
n tim
e (s
) MMGP ModelMeasured time
(a)
0
50
100
150
200
250
300
350
1 2 4 8 16Number of PPEs
Exec
utio
n tim
e (s
) MMGP ModelMeasured time
(b)
Figure 6.9: MMGP predictions and actual execution times of RAxML, when the code uses onedimension of PPE (HPU) parallelism: (a) with DS1, (b) with DS2.
organisms, each represented by a DNA sequence of 50,000 nucleotides. For both data sets, we
set RAxML to perform a total of 16 bootstraps using different parallel configurations.
The MMGP parameters for RAxML, obtained from profiling a sequential run of the code are
THPU = 3.3s, TAPU = 63s, CAPU = 104s for DS1, and THPU = 8.8s, TAPU = 118s, CAPU =
157s for DS2. The values of other MMGP parameters are negligible compared to TAPU , THPU ,
and CAPU , therefore we disregard them for RAxML. Note that the off-loaded code that cannot
be parallelized (CAPU ) takes 57-62% of the execution time of a task on the SPE. Figure 6.9
illustrates the estimated and actual execution times of RAxML with up to 16 bootstraps, using
one dimension of PPE (HPU) parallelism. In this case, each MPI process offloads tasks to one
SPE and SPEs are utilized by oversubscribing the PPE with more processes than the number of
hardware threads available on the PPEs. For DS1, the mean MMGP prediction error is 7.1%,
the standard deviation is 6.4. and the maximum error is 18%. For DS2, the mean MMGP
prediction error is 3.4%, the standard deviation is 1.9 and the maximum error is 5%.
79
020406080
100120140160180200
1 2 3 4 5 6 7 8Number of SPEs
Exec
utio
n tim
e (s
) MMGP ModelMeasured time
(a)
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8Number of SPEs
Exec
utio
n tim
e (s
) MMGP ModelMeasured time
(b)
Figure 6.10: MMGP predictions and actual execution times of RAxML, when the code usesone dimension of SPE (APU) parallelism: (a) with DS1, (b) with DS2.
Figure 6.10 illustrates estimated and actual execution times of RAxML, when the code uses
one dimension of SPE (APU) parallelism, with a data-parallel implementation of the maxi-
mum likelihood calculation functions across SPEs. We should point out that although both
RAxML and PBPI perform maximum likelihood calculations in their computational cores,
RAxML’s loops have loop-carried dependencies that prevent scalability and parallelization in
many cases [20], whereas PBPI’s core computation loops are fully parallel and coarse enough to
achieve scalability. The limited scalability of data-level parallelization of RAxML is the reason
why we confine the executions with data-level parallelism on at most 8 SPEs. As shown in Fig-
ure 6.10, the data-level parallel implementation of RAxML does not scale substantially beyond
4 SPEs. When only APU parallelism is extracted from RAxML, for DS1 the mean MMGP
prediction error is 0.9%, the standard deviation is 0.8. and the maximum error is 2%. For DS2,
the mean MMGP prediction error is 2%, the standard deviation is 1.3 and the maximum error
is 4%.
80
RAxML with Two Dimensions of Parallelism
Figure 6.11 shows the actual and predicted execution times in RAxML, when the code exposes
two dimensions of parallelism to the system. Once again, regardless of execution time predic-
tion accuracy, MMGP is able to pin-point the optimal parallelization model, which in the case
of RAxML is task-level parallelization with no further data-parallel decompositions of tasks
between SPEs, as the opportunity for scalable data-level parallelization in the code is limited.
Innermost loops in tasks are still SIMDized within each SPE. MMGP remains accurate, with
mean execution time prediction error of 4.3%, standard deviation of 4, and maximum prediction
error of 18% for DS1, and mean execution time prediction error of 2.8%, standard deviation of
1.9, and maximum prediction error of 7% for DS2. It is worth noting that although the two
codes tested are fundamentally similar in their computational core, their optimal parallelization
model is radically different. MMGP accurately reflects this disparity, using a small number of
parameters and rapid prediction of execution times across a large number of feasible program
configurations.
6.4.4 MMGP Usability Study
We demonstrate a practical use of MMGP through a simple usability study. We modified PBPI
to execute an MMGP sampling phase at the beginning of the execution. During the sampling
phase, the application is profiled and all MMGP parameters are determined. After finishing
the sampling phase, MMGP estimates the optimal configuration and the application is executed
with the MMGP-recommended configuration. The profiling, sampling and MMGP actuation
phases are performed automatically without any user intervention. We set PBPI to execute 106
generations, since this is a number of generations typically required by biologists. We set the
sampling phase to be 10,000 generations. Even with the overhead introduced by the sampling
phase included in the measurements, the configuration provided by MMGP outperforms all
other configurations by margins ranging from 1.1% (compared to the next best configuration
81
0
20
40
60
80
100
120
140
160
180
200
(1,1
)
(1,2
)
(2,1
)
(1,3
)
(1,4
)
(2,2
)
(4,1
)
(1,5
)
(1,6
)
(2,3
)
(1,7
)
(1,8
)
(2,4
)
(4,2
)
(8,1
)
(2,5
)
(2,6
)
(4,3
)
(2,7
)
(8,2
)
(2,8
)
(16,
1)
Executed configuration (#PPE/#SPE)
Exec
utio
n tim
e in
sec
onds MMGP Model
Measured time
(a)
0
50
100
150
200
250
300
350
(1,1
)
(1,2
)
(2,1
)
(1,3
)
(1,4
)
(2,2
)
(4,1
)
(1,5
)
(1,6
)
(2,3
)
(1,7
)
(1,8
)
(2,4
)
(4,2
)
(8,1
)
(2,5
)
(2,6
)
(4,3
)
(2,7
)
(8,2
)
(2,8
)
(16,
1)
Executed configuration (#PPE/#SPE)
Exec
utio
n tim
e in
sec
onds MMGP model
Measured time
(b)
Figure 6.11: MMGP predictions and actual execution times of RAxML, when the code usestwo dimensions of SPE (APU) and PPE (HPU) parallelism. Performance is optimized by over-subscribing the PPE and maximizing task-level parallelism.
82
0
200
400
600
800
1000
1200
1400
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Sequence Size
To
tal
Tim
e (
s)
Sampling PhasePBPI Execution
Figure 6.12: Overhead of the sampling phase when MMGP scheduler is used with the PBPIapplication. PBPI is executed multiple times with 107 input species. The sequence size of theinput file is varied from 1,000 to 10,000. In the worst case, the overhead of the sampling phaseis 2.2% (sequence size 7,000).
identified via an exhaustive search), and up to 4 times (compared to the worst configuration
identified via an exhaustive search). The sampling phase takes in the worst case 2.2% of the
total execution time, but completely eliminates the exhaustive search that would otherwise be
necessary to find the best mapping of the application to the Cell architecture. Figure 6.12
illustrates the overhead of the sampling phase with PBPI application.
6.5 Chapter Summary
The introduction of accelerator-based parallel architectures complicates the problem of map-
ping algorithms to systems, since parallelism can no longer be considered as a one-dimensional
abstraction of processors and memory. We presented a new model of multi-dimensional parallel
computation, MMGP, which we introduced to relieve users from the arduous task of mapping
parallelism to accelerator-based architectures. We have demonstrated that the model is fairly
accurate, albeit simple, and that it is extensible and easy to specialize for a given architecture.
We envision three uses of MMGP: i) As a rapid prototyping tool for porting algorithms to
83
accelerator-based architectures. More specifically, MMGP can help users derive not only a de-
composition strategy, but also an actual mix of programming models to use in the application in
order to best utilize the architecture, while using architecture-independent programming tech-
niques. ii) As a compiler tool for assisting compilers in deriving efficient mappings of programs
to accelerator-based architectures automatically. iii) As a runtime tool for dynamic control of
parallelism in applications, whereby the runtime system searches for optimal program configu-
rations in the neighborhood of optimal configurations derived by MMGP, using execution time
sampling or prediction-based techniques.
84
Chapter 7
Scheduling Asymmetric Parallelism on a
PS3 Cluster
7.1 Introduction
Cluster computing is already feeling the impact of multi-core processors [30]. Several highly
ranked entries of the latest Top-500 list include clusters of commodity dual-core processors1.
The availability of abundant chip-level and board-level parallelism changes fundamental as-
sumptions that developers currently make while writing software for HPC clusters. While recent
work has improved our understanding of the implications of small-scale symmetric multi-core
processors on cluster computing [7], emerging asymmetric multi-core processors such as the
Cell/BE and boards with conventional processors and hardware accelerators –such as GPUs–,
are rapidly making their way into HPC clusters [94]. There are strong incentives that support
this trend, not the least of which is higher performance with higher energy-efficiency made
possible through asymmetric, rather than symmetric multi-core processor organizations [57].
Understanding the implications of asymmetric multi-core processors on cluster computing
and providing models and software support to ease the migration of parallel programs to these
1http://www/top500.org
85
platforms is a challenging and relevant problem. This study makes four contributions:
i) We conduct a performance analysis of a Linux cluster of Sony PlayStation3 (PS3) nodes.
To the best of our knowledge, this is the first study to evaluate this cost-effective and unconven-
tional HPC platform with microbenchmarks and realistic applications from the area of bioin-
formatics. The cluster we used has 22 PS3 nodes connected with a GigE switch and was built
at Virginia Tech for less than $15,000. We first evaluate the performance of MPI collective
and point-to-point communication on the PS3 cluster, and explore the scalability of MPI com-
munication operations under contention for bandwidth within and across PS3 nodes. We then
evaluate performance and scalability of the PS3 cluster with bioinformatics applications. Our
analysis reveals the sensitivity of computation and communication to the mapping of asym-
metric parallelism to the cluster and the importance of coordinated scheduling across multiple
layers of parallelism. Optimal scheduling of MPI codes on the PS3 cluster requires coordi-
nated scheduling and mapping of at least three layers of parallelism (two layers within each
Cell processor and an additional layer across Cell processors), whereas the optimal mapping
and schedule changes with the application, the input data set, and the number of nodes used for
execution.
ii) We adapt and validate MMGP on the PS3 cluster. We model a generic heterogeneous clus-
ter built from compute nodes with front-end host cores and back-end accelerator cores. The
extended model combines analytical components with empirical measurements, to navigate the
optimization space for mapping MPI programs with nested parallelism on the PS3 cluster. Our
evaluation of the extended MMGP model shows that it estimates execution time with an average
error rate of 5.2% on a cluster composed of PlayStation3 nodes. The model captures the effects
of application characteristics, input data sets, and cluster scale on performance. Furthermore,
The model pin-points optimal mappings of MPI applications to the PS3 cluster with remarkable
accuracy.
iii) Using the cluster of Playstation3 nodes, we analyze earlier proposed user-level scheduling
heuristics for co-scheduling threads (Chapter 5). We show that co-scheduling algorithms yield
86
significant performance improvements (1.7–2.7×) over the native OS scheduler in MPI appli-
cations. We also explore the trade-off between different co-scheduling policies that selectively
spin or yield the host cores, based on runtime prediction of task execution lengths on the accel-
erator cores.
iv) We present a comparison between our PS3 cluster and an IBM QS20 blade cluster (based on
Cell/BE), illustrating that despite important limitations in computational ability and the com-
munication substrate, the PS3 cluster is a viable platform for HPC research and development.
The rest of this Chapter is organized as follows: Section 7.2 presents our experimental
platform. Section 7.3 presents our performance analysis of the PS3 cluster. Section 7.4 presents
the extended model of hybrid parallelism and its validation. Section 7.5 presents co-scheduling
policies for clusters of asymmetric multi-core processors and evaluates these policies. Section
7.6 compares the PS3 cluster against an IBM QS20 Cell-blade cluster. Section 7.7 concludes
the chapter.
7.2 Experimental Platform
Our experimental platform for this thesis is a cluster of 22 PS3 nodes. 8 of these nodes were
available to us in dedicated mode for the purposes of this thesis. The PS3 nodes are connected
to a 1000BASE-T Gigabit Etherent switch, which supports 96 Gbps switching capacity. Each
PS3 runs Linux FC5 with kernel version 2.6.16, compiled for the 64-bit PowerPC architecture
with platform-specific kernel patches for managing the heterogeneous cores of the Cell/BE.
The nodes communicate with LAM/MPI 7.1.1. We used the IBM Cell SDK 2.1 for intra-
Cell/BE parallelization of the MPI codes. The Linux kernel on the PS3 is running on top of a
proprietary hypervisor. Though some devices are directly accessed, the built-in Gigabit Ethernet
controller in the PS3 is accessed via hypervisor calls, therefore communication performance is
not optimized.
87
7.3 PS3 Cluster Scalability Study
7.3.1 MPI Communication Performance
As we extend our MMPG model for the cluster of PlayStation3 machines, the cost of MPI calls
becomes a more significant parameter of the prediction model than it would be for a single
machine. To study how the MMGP model scales for the new parallel computing environment,
we experimented with the real-world parallel computing applications, PBPI and RAxML.
To measure communication performance on the PS3 cluster, we use mpptest [56]. We
present mpptest results only for two MPI communication primitives, which dominate com-
munication time in our application benchmarks. Figure 7.1 shows the overhead of MPI Allreduce(),
with various message sizes. Each data point represents a number and a distribution of MPI
processes between PS3 nodes. For any given number of PS3 nodes, we use 1 to 6 MPI pro-
cesses, using shared memory for communication within the PS3. Our evaluation methodology
stresses the impact of contention for communication bandwidth both within and across PS3
nodes. There is benefit in exploiting shared memory for communication between MPI pro-
cesses on each PS3. For example, collective operations between 8 processes running on 2 PS3
nodes are up to 30% faster than collective operations between 8 processes running across 8 PS3
nodes. However, there is also a noticeable penalty for oversubscribing each PPE with more
than two processes, due to OS overhead, despite our use of blocking shared memory commu-
nication within each PS3. Similar observations can be made for point-to-point communication
(Figure 7.2), albeit the effect of using shared-memory within each PS3 is less pronounced and
the effect of oversuscribing the PPE is more pronounced.
7.3.2 Application Benchmarks
We evaluate the performance of two state-of-the-art phylogenetic tree construction codes, RAxML
and PBPI, on our PS3 cluster, described in Chapter 3. The applications have been painstakingly
88
0
100
200
300
400
500
600
700
800
900
1000
87654321
Tim
e (u
sec)
Number of Nodes
(a) MPI Allreduce() latency, one double.
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
87654321
Tim
e (u
sec)
Number of nodes
8192409620481024512256
1
(b) MPI Allreduce() latency, arrays of doubles.
Figure 7.1: MPI Allreduce() performance on the PS3 cluster. Processes are distributedevenly between nodes. Each node runs up to 6 processes, using shared memory for communi-cation within the node.
89
0
20000
40000
60000
80000
100000
120000
140000
87654321
Tim
e (u
sec)
Number of nodes
24871117619921
Figure 7.2: MPI Send/Recv() latency on the PS3 cluster. Processes are distributed evenlybetween nodes. Each node runs up to 6 processes, using shared memory for communicationwithin the node.
optimized for the Cell BE, using vectorization, loop unrolling and tiling, branch optimizations,
double buffering, and optimized numerical implementations of kernels utilizing fast single-
precision arithmetic to implement double-precision operations. The optimization process is
described in Chapter 4. Both RAxML and PBPI are capabale of exploiting multiple levels of
PPE and SPE parallelism. We used a task off-loading execution model in the codes. The execu-
tion commences on the PPE, and SPEs are used for accelerating computation-intensive loops.
The off-loaded loops are parallelized across SPEs and vectorized within SPEs. The number of
PPE processes and the number of SPEs per PPE process are user-specified.
When the PPE is oversubscribed with more than two processes, the processes are scheduled
using an event-driven task-level parallelism (EDTLP) scheduler, described in Chapter 5. Each
process executes until the point it off-loads an SPE task and then releases the PPE, while waiting
for the off-loaded task to complete. The same process resumes execution on the PPE only after
all other processes off-load SPE tasks at least once. The RAxML and PBPI ports on the PS3
90
cluster are adaptations of the original MPI codes and are capable for execution in a distributed
environment. No algorithmic modifications have been applied to the applications.
For each application we use three data sets, briefly termed small, medium, and large. The
large data set occupies the entire memory of a PS3, minus memory used by the operating system
and the hypervisor. The medium and small data sets occupy 40% and 15% of the free memory
of the PS3 respectively. In PBPI the small, medium, and large data sets represent 218 species
with 1250, 3000, and 5000 nucleotides respectively. In RAxML, the small, medium, and large
datasets represent 42, 50, and 107 species respectively. We execute PBPI using weak scal-
ing, i.e. we scale the data set as we add more PS3 nodes, which is the recommended execution
mode. For RAxML we use strong scaling, since the application uses a master-worker paradigm,
where each worker performs independent, parameterized phylogenetic tree bootstrapping and
processes the entire tree independently. Workers are distributed between nodes to maximize
throughput. We perform 192 bootstraps, which is a realistic workload for real-world phyloge-
netic analysis.
Figure 7.3 illustrates the measured execution times of RAxML and PBPI on the PS3 clus-
ter. The predicted execution times on the same charts are derived from the extended MMGP
model, which is discussed in Section 7.4. We make three observations regarding measured
performance:
i) The PS3 cluster scales well under strong scaling (RAxML) and relatively well under weak
scaling (PBPI), for the problem sizes considered. PBPI is more communication-bound than
RAxML, as it involves several collective operations between executions of its Markov-chain
Monte Carlo kernel. We note that due to the hypervisor of the PS3 and the lack of Cell/BE-
specific optimization of the MPI library we used, the performance measurements on the PS3
cluster are conservative.
ii) The optimal layered decomposition of the applications is at the opposite ends of the opti-
mization space. RAxML executes optimally if the PPE on each PS3 is oversubscribed by 6
MPI processes, each off-loading simultaneously on 1 SPE. PBPI generally executes optimally
91
50
100
150
200
250
300
350
400
450
500
Exe
cutio
n T
ime
(sec
)
Configuration (Nnode
, Nprocess
, NSPE
)
(1,
1,6)
(1,
2,3)
(1,
3,2)
(1,
6,1)
(2,
1,6)
(2,
2,3)
(2,
3,2)
(2,
6,1)
(3,
1,6)
(3,
2,3)
(3,
3,2)
(3,
6,1)
(4,
1,6)
(4,
2,3)
(4,
3,2)
(4,
6,1)
(5,
1,6)
(5,
2,3)
(5,
3,2)
(5,
6,1)
(6,
1,6)
(6,
2,3)
(6,
3,2)
(6,
6,1)
(7,
1,6)
(7,
2,3)
(7,
3,2)
(7,
6,1)
(8,
1,6)
(8,
2,3)
(8,
3,2)
(8,
6,1)
PBPIMeasured−LPredicted−LMeasured−MPredicted−MMeasured−SPredicted−S
0
0.5
1
1.5
2
2.5x 10
4
Exe
cutio
n T
ime
(sec
)
Configuration (Nnode
, Nprocess
, NSPE
)
(1,
1,6)
(1,
2,3)
(1,
3,2)
(1,
6,1)
(2,
1,6)
(2,
2,3)
(2,
3,2)
(2,
6,1)
(3,
1,6)
(3,
2,3)
(3,
3,2)
(3,
6,1)
(4,
1,6)
(4,
2,3)
(4,
3,2)
(4,
6,1)
(5,
1,6)
(5,
2,3)
(5,
3,2)
(5,
6,1)
(6,
1,6)
(6,
2,3)
(6,
3,2)
(6,
6,1)
(7,
1,6)
(7,
2,3)
(7,
3,2)
(7,
6,1)
(8,
1,6)
(8,
2,3)
(8,
3,2)
(8,
6,1)
RAxMLMeasured−LPredicted−LMeasured−MPredicted−MMeasured−SPredicted−S
Figure 7.3: Measured and predicted performance of applications on the PS3 cluster. PBPI isexecuted with weak scaling. RAxML is executed with strong scaling. x-axis notation: Nnode -number of nodes, Nprocess - number of processes per node, NSPE - number of SPEs per process.
92
with 1 MPI process per PS3 using all 6 SPEs for data-parallel computation, as this configuration
reduces inter-PS3 communication volume and avoids PPE contention.
iii) Although the optimal layered decomposition does not change with the problem size for the
three data sets used in this experiment2, it changes with the scale of the cluster. When PBPI is
executed with 8 SPEs, the optimal operating point of the code shifts from 1 to 2 MPI processes
per node, each off-loading simultaneously on 3 SPEs. We have verified with an out-of-band
experiment that this shift is permanent beyond 8 SPEs. This shift happens because of the large
drop in the per process overhead of MPI Allreduce() (Figure 7.1), when 2 MPI processes
are packed per node, on 3 or more PS3 nodes. This drop is large enough to outweigh the over-
head due to contention between MPI processes on the PPE. The difficulty in experimentally
discovering the hardware and software implications on the optimal mapping of applications to
asymmetric multi-core clusters motivates the introduction of an analytical model presented in
the next section.
7.4 Modeling Hybrid Parallelism
We present an analytical model of layered parallelism on clusters of asymmetric multi-core
nodes, which is a generalization of a model of parallelism on stand-alone asymmetric multi-core
processors (MMGP), presented in Chapter 6. Our generalized model captures computation and
communication across nodes with host cores and acceleration cores. We specialize the model
for the PS3 cluster, to capture the overhead of non-overlapped DMA operations, wait times
during communication operations in the presence of contention for bandwidth both within and
across nodes, and non-overlapped scheduling overhead on the PPEs. In the rest of the section
we present overview of the MMGP model and we discuss the extensions related to the context-
switch overhead, on-chip and inter-node communication.
2The optimal decomposition does not change with the data set, however, as we show in Section 7.5, the optimalscheduling of an application may change with the data set.
93
We model the non-overlapped components of execution time on the Cell/BE’s PPE and SPE,
for single-threaded PPE code which off-loads to one SPE as:
T = (Tppe +Oppe) + (Tspe +Ospe) (7.1)
where Tppe and Tspe represent non-overlapped computation, while Oppe and Ospe represent non-
overlapped overhead on the PPE and SPE respectively. We apply the aforementioned model for
phases of parallel computation individually. Phases are separated by collective communication
operations.
7.4.1 Modeling PPE Execution Time
The overhead on the PPE includes instructions and DMA operations to off-load data and code
to SPEs, and wait time for receiving synchronization signals from SPEs on the PPE.
Assuming that multiple PPE threads can simultaneously off-load computation, we introduce
an additional factor for context switching overhead on the PPE. This factor depends on the
thread scheduling algorithm on the PPE. In the general case, Oppe for code off-loaded from a
single PPE thread to l SPEs is modeled as:
Oppe = l ·Ooff−load + Tcsw(p) (7.2)
We assume that a single PPE thread off-loads to multiple SPEs sequentially and that the context
switching overhead is a function of the number of threads co-executed on the PPE, which is
denoted by p. Ooff−load is application-dependent and includes DMA setup overhead which
we measure with microbenchmarks. Tcsw depends on system software and includes context
switching overhead for p/C context switches, C the number of hardware contexts on the PPE.
The overhead per context switch is also measured with microbenchmarks.
If a hardware thread on the PPE is oversubscribed with multiple application threads, the
94
S1
S2
S3
P3
OS quantum
* *P1P2P1
spin idle spin idleOS quantum OS quantum
P1 P2 P3 P1 P2 P3 P1 P2
S1
S2
S3
P3
S2S2S1
S3
S1
S2
(a) (b)
P1 P2 P3 P1 P2 P3 P1 P2
S1
S2
S3
P3
S1
S2
S3
* * *P1 P2 P3 P1 P2
S1
S2
S3
S1
S2
S3
P3
S1
spin spin spin
P1
spin
(c) (d)
Figure 7.4: Four cases illustrating the importance of co-scheduling PPE threads and SPEthreads. Threads labeled ”P” are PPE threads, while threads labeled ”S” are SPE threads. We as-sume that P-threads and S-threads communicate through shared memory. P-threads poll sharedmemory locations directly to detect if a previously off-loaded S-thread has completed. Stripedintervals indicate yielding of the PPE, dark intervals indicate computation leading to a threadoff-load on an SPE, light intervals indicate computation yielding the PPE without off-loadingon an SPE. Stars mark cases of mis-scheduling.
computation time of each thread may increase due to on-chip resource contention. To accu-
rately model this case, we introduce a scaling parameter, α(p) to the PPE computation compo-
nent, which depends on the number of threads co-executed on the PPE. The PPE component of
the model therefore becomes αp · Tppe + Oppe. The factor αp is estimated using linear regres-
sion with one free parameter, the number of threads sharing a PPE hardware thread, and co-
efficients derived from training samples of Tppe taken during executions of a microbenchmark
that oversubscribes the PPE with 3-6 threads and executes a parameterized ratio of computation
to memory instructions.
95
p
s
SPE execution
T − Sequential communication
T − Communication overheadcT − Parallel computation
Trec Tsen
Figure 7.5: SPE execution
The formulation of Tppe derived thus far ignores additional wait time of threads on the PPE
due to lack of co-scheduling between a PPE thread and an SPE thread off-loaded from it. This
scenario arises when the PPE hardware threads are time-shared between application threads, as
shown in Figure 7.4(a). Ideal co-scheduling requires accurate knowledge of the execution time
of tasks on SPEs by both the operating system and the runtime system. This knowledge is not
generally available. Our model assumes an idealized co-scheduling scenario. SPE tasks for a
given phase of computation are assumed to be of the same execution length and are off-loaded in
bundles with as many tasks per bundle as the number of SPEs on a Cell/BE. We also assume that
the SPE execution time of the first task is long enough to allow for idealized co-scheduling, i.e.
each PPE thread that off-loads a task is rescheduled on the PPE timely, to receive immediately
the signal from the corresponding finishing SPE task. We explore this scheduling problem in
Section 7.5 under more realistic assumptions and propose solutions.
7.4.2 Modeling the off-loaded Computation
Execution on SPEs is divided into stages, as shown in Figure 7.5. Tspe is modeled as:
Tspe = Tp + Ts (7.3)
Tp denotes the computation executed in parallel by more than one SPE. An example is a parallel
loop distributed across SPEs. Ts denotes the part of the off-loaded computation that is inherently
96
sequential and cannot be parallelized across SPEs.
When l SPEs are used for parallelization of off-loaded code, the Tspe term becomes:
Tspe =Tp
l+ Ts (7.4)
The accelerated execution on SPEs includes three more stages, shown in Figure 7.5. Trec
and Tsen account for PPE-SPE communication latency, while Tc captures the SPE overhead that
occurs when an SPE sends to or receives a message from the PPE. The per-byte latencies for
Trec, Tsen and Tc are application-independent and are obtained from microbenchmarks designed
to stress the PPE-SPE communication. Tp and Ts are application-dependent and are obtained
from a profile of a sequential run of the application, annotated with directives that delimit the
code off-loaded to SPEs.
7.4.3 DMA Modeling
Each SPE on the Cell/BE is capable of moving data between main memory and local storage,
while at the same time executing computation. To overlap computation and communication,
applications use loop tiling and double buffering, which are illustrated in pseudocode in Fig-
ure 7.6. When double-buffering is used, the off-loaded loop can be either computation or com-
1: DMA(Fetch Iteration 1, TAG1);2: DMA_Wait(TAG1);
3: for( ... ){4: DMA(Fetch Iteration i+1, TAG1);5: compute(Iteration i);6: DMA_Wait(TAG1);7: DMA(Commit Iteration i, TAG2);
}
8: DMA_Wait(TAG2);
Figure 7.6: Double buffering template for tiled parallel loops.
97
munication bound: if the amount of computation in a single iteration of the loop is sufficient
to completely mask the latency of fetching the data necessary for the next iteration, the loop is
computation bound. Otherwise the loop is communication bound.
Note that a parallel off-loaded loop can be described using Equation 7.4, independently of
whether the parallel part of the loop is computation or communication bound. In both cases, the
loop iterations are assumed to be distributed evenly across SPEs and blocking DMA accesses
can be interspersed with computation in the loop. With double buffering, the DMA request used
to fetch data for the first iteration, as well as the DMA request necessary to commit data to main
memory after the last iteration, can be neither overlapped with computation, nor distributed
(lines 2 and 8 in Figure 7.6). We capture the effect of blocking and non-overlapped DMA in the
model as:
Ospe = Trec + Tsen + Tc + TDMA (7.5)
The last term in equation 7.5 is itemized to the blocking DMAs performed within loop
iterations and the non-overlapped DMAs exposed when the loop is unrolled, tiled, and executed
with double buffering in the code. We use static analysis of the code to captures the DMA sizes.
7.4.4 Cluster Execution Modeling
We generalize our model of a single asymmetric multi-core processor to a cluster by introducing
an inter-processor communication component as:
T = (Tppe +Oppe) + (Tspe +Ospe) + C (7.6)
We further decompose the communication term C in communication latency due to each dis-
tinct type of communication pattern in the program, including point-to-point and all-to-all com-
98
munication. Assuming MPI as the programming model used to communicate across nodes or
between address spaces within nodes, we use mpptest to estimate the MPI communication
overhead for variable message sizes and communication primitives. The message sizes are
captured by static analysis of the application code.
7.4.5 Verification
We verify our model by exhaustively executing PBPI and RAxML on all feasible layered de-
compositions that use 1 to 6 PPE threads, 1 to 6 SPEs per PPE and up to 8 PS3 nodes. Fig-
ure 7.3(a),(b) illustrates that the model is accurate both in terms of predicting execution time and
in terms of discovering optimal application decompositions and mappings for different cluster
scales and data sets. The optimal decomposition may vary across multiple dimensions, includ-
ing application characteristics, such as granularity of off-loaded tasks, and frequency and size
of communication and DMA operations, size and structure of the data set used in the applica-
tion, and number of nodes available to the application for execution. Accurate modeling of the
application under each scenario is valuable to tame the complexity of discovering the optimal
decomposition and mapping experimentally. In our test cases, the model achieves error rates
consistently under 15%. The mean error rate is 5.2%. The errors tend to be higher when the
PPE is oversubscribed with large number of processes, due to error in estimating the thread
interference factor. With respect to prediction accuracy for any given application, data set, and
number of PS3 nodes, the model predicts accurately the optimal configuration and mapping in
all 48 test cases.
7.5 Co-Scheduling on Asymmetric Clusters
Although our model projects optimal mappings of MPI applications on the PS3 cluster with
high accuracy, it is oblivious to the implications of user-level and kernel-level scheduling on
99
oversubscribed cores. More specifically, the model ignores cases in which PPE threads and
SPE threads are not co-scheduled when they need to synchronize through shared memory. We
explore user-level co-scheduling solutions to this problem.
The main objective of co-scheduling is to minimize slack time on SPEs, since SPEs bear
the brunt of the computation in practical cases. This slack is minimized whenever a thread off-
loaded to an SPE needs to communicate or synchronize with its originating thread at the PPE
and the originating thread is running on a PPE hardware context.
As illustrated in Figure 7.4, different scheduling policies can have a significant impact on co-
scheduling, slack, SPE utilization, and eventually performance. In Figure 7.4(a), PPE threads
are spinning while waiting for the corresponding off-loaded threads to return results from SPEs.
The time quantum allocated to each PPE thread by the OS can cause continuous mis-scheduling
of PPE threads with respect to SPE threads.
In Figure 7.4(b), the user-level scheduler uses a yield-if-not-ready policy, which forces each
PPE thread to yield the processor, whenever a corresponding off-loaded SPE thread is pend-
ing completion. This policy can be implemented at user-level by having PPE threads poll
shared-memory flags that matching SPE threads set upon completion. Figure 7.7 illustrates
the performance of this policy in PBPI and RAxML on a PS3 cluster, when the PPE on each
node is oversubscribed with 6 MPI processes, each off-loading on 1 SPE (recall that the PPE
is a two-way SMT processor). The results show that compared to a scheduling policy which
is oblivious to PPE-SPE co-scheduling (Linux scheduling policy), yield-if-not-ready achieves
a performance improvement of 1.7–2.7×, on a cluster composed of PS3 nodes. Yield-if-not-
ready bounds the slack by the time needed to context switch across p− 1 PPE threads, p is the
total number of active PPE threads, but can still cause temporary mis-scheduling and slack, as
shown in Figure 7.4(c). Figure 7.4(d) illustrates an adaptive spinning policy, in which a thread
either spins or yields the processor, based on which thread is anticipated to offload the soonest
on an SPE. This policy uses a prediction which can be derived with various algorithms, the
simplest of which is using the execution length of the most recently off-loaded task from any
100
(1,6,1) (2,6,1) (3,6,1) (4,6,1) (5,6,1) (6,6,1) (7,6,1) (8,6,1)0
100
200
300
400
500
600
700
800
Exe
cutio
n T
ime
(sec
)
Configuration (Nnode
, Nprocess
, NSPE
)
PBPI Linuxyield−if−not−ready
(1,6,1) (2,6,1) (3,6,1) (4,6,1) (5,6,1) (6,6,1) (7,6,1) (8,6,1)0
2000
4000
6000
8000
10000
12000
Exe
cutio
n T
ime
(sec
)
Configuration (Nnode
, Nprocess
, NSPE
)
RAxML Linuxyield−if−not−ready
Figure 7.7: Performance of yield-if-not-ready policy and the native Linux scheduler in PBPIand RAxML. x-axis notation: Nnode - number of nodes, Nprocess - number of processes pernode, NSPE - number of SPEs per process.
given thread as a predictor of the earliest time that the same thread will be ready to off-load in
the future. The thread spins if it anticipates that it will be the first to off-load, otherwise it yields
the processor.
Although the aforementioned adaptive policy can reduce accelerator slack compared to the
yield-if-not-ready policy, it is still suboptimal, as it may mis-schedule threads due to variations
in the execution lengths of consecutive tasks off-loaded by the same thread, or variations in
the run lengths between any two consecutive off-loads on a PPE thread. We should also note
101
that better policies –with tighter bounds on the maximum slack–, can be obtained if the user-
level scheduler is not oblivious to the kernel-level scheduler and vice versa. Devising and
implementing such policies is described in Chapter 8.
Figure 7.8 illustrates results when RAxML and PBPI are executed with various co-scheduling
policies. Both applications are executed with variable sequence length (x-axis), hence variable
SPE task sizes. In RAxML, Figure 7.8(a), adaptive spinning performs better for small data sets,
while yield-if-not-ready performs better for large data sets. In PBPI, Figure 7.8(b), adaptive
spinning outperforms yield-if-not-ready in all cases. In RAxML, there is variance in length
of the offloaded tasks which increases with the size of the input sequence. This causes more
mis-scheduling when the adaptive policy is used. In PBPI, the task length is not varying, which
enables nearly optimal co-scheduling by the adaptive spinning policy. In general, the best co-
scheduling algorithm can improve performance by more than 10%. We emphasize that the opti-
mal co-scheduling policy changes with the dataset, therefore support for flexible co-scheduling
algorithms in system software is essential on the PS3 cluster.
7.6 PS3 versus IBM QS20 Blades
We compare the performance of the PS3 cluster to a cluster of IBM QS20 dual-Cell/BE blades
located at Georgia Tech. The Cell/BE processors on the QS20 have 8 active SPEs and possibly
other undisclosed microarchitectural differences. Furthermore, although both the QS20 cluster
and the PS3 cluster use GigE, communication latencies tend to be markedly lower on the QS20
cluster, first due to the absence of a hypervisor, which is a communication bottleneck on the PS3
cluster, and second due to exploitation of shared-memory communication between two Cell/BE
processors on each QS20, instead of one Cell/BE processor on each PS3.
We present selected experimental data points where the two platforms use the same number
of Cell processors. On the QS20 cluster, we use both Cell processors per node. Figure 7.9
illustrates execution times of PBPI and RAxML on the two platforms. We report the execution
102
200 1000 3000 5000 8000 100000
10
20
30
40
50
60
Sequence
Exe
cutio
n T
ime
(sec
)
PBPI
adaptiveyield−if−not−ready
300 400 500 600 1500 2000 3000 40000
50
100
150
200
Sequence
Exe
cutio
n T
ime
(sec
)
RAxML
adaptiveyield−if−not−ready
Figure 7.8: Performance of different scheduling strategies in PBPI and RAxML.
time of the most efficient pair of application configuration and co-scheduling policy, on any
given number of Cell processors.
We observe that the performance of the PS3 cluster is reasonably close (within 14% to 27%
for PBPI and 11% to 13% for RAxML) to the performance of the QS20 cluster. The difference
is attributed to the reduced number of active SPEs per processor on the PS3 cluster (6 versus 8
for the QS20 cluster), and faster communication on the QS20 cluster. The difference between
the two platforms in RAxML is less than PBPI, as RAxML is not as communication-intensive.
Interestingly, if we compare datapoints with the same total number of SPEs (48 SPEs on 8
PS3’s versus 48 SPEs on 6 QS20’s), in RAxML the PS3 cluster outperforms the QS20 blade.
103
2 4 6 80
50
100
150
200
250
300
Number of Cells
Exe
cutio
n T
ime
(sec
)
PBPIPS3 ClusterBlade Cluster
2 4 6 80
1000
2000
3000
4000
5000
6000
7000
Number of Cells
Exe
cutio
n T
ime
(sec
)
RAxMLPS3 ClusterBlade Cluster
Figure 7.9: Comparison between the PS3 cluster and an IBM QS20 cluster.
This result does not indicate superiority of the PS3 hardware or system software, as we apply
experimentally defined optimal decompositions and scheduling policies on both platforms. It
rather indicates the implications of layered parallelization. Oversubscribing the QS20 with 8
MPI processes (versus 6 on the PS3) introduces significantly higher scheduling overhead and
brings performance below that of the PS3. This result stresses our earlier observations on the
necessity of models and better schedulers for asymmetric multi-core clusters.
7.7 Chapter Summary
We evaluated a very low-cost HPC cluster based on PS3 consoles and proposed a model of
asymmetric parallelism and software support for orchestrating asymmetric parallelism extracted
from MPI programs on the PS3 cluster. While the Sony PlayStation has several limitations as
104
an HPC platform, including limited storage and limited support for advanced networking, it
has enough computational power compared to vastly more expensive multi-processor blades
and forms a solid experimental testbed for research on programming and runtime support for
asymmetric multi-core clusters, before migrating software and applications to production-level
asymmetric machines, such as the LANL RoadRunner. The model presented in this paper
accurately captures heterogeneity in computation and communication substrates and helps the
user or the runtime environment map layered parallelism effectively to the target architecture.
The co-scheduling heuristics presented in this thesis increase parallelism and minimize slack
on computational accelerators.
105
106
Chapter 8
Kernel-Level Scheduling
8.1 Introduction
The ideal scheduling policy, which minimizes the context-switching overhead, assumes that
whenever an SPE communicates to the PPE, the corresponding PPE thread is scheduled and
running on the PPE. In Chapter 7 we discussed the possibility of predicting the next-to-run
thread on the PPE. We have implemented a prototype of the scheduling strategy capable of
predicting what process will be the next to run, and the produced results imply that predicting
the next thread to run might be difficult, especially if the the off-loaded tasks exhibit high
variance in execution time.
As another approach which targets minimizing the context switching overhead on the PPE,
we investigate a user-level scheduler which is capable of influencing the kernel scheduling
decisions. We explore the scheduling strategies where the scheduler can not only decide when
a process should release the PPE, but also what will be the next process to run on the PPE. By
reducing the response time related to scheduling on the PPE, our new approach also reduces the
idle time that occurs on the SPE side, while waiting for the new off-loaded task. We call our
new scheduling strategy Slack-minimizer Event-Driven Scheduler (SLEDS).
Besides improving the overall performance, the new scheduling strategy enables more accu-
107
SPE1SPE2SPE3SPE4SPE5SPE6SPE7SPE8
1 23 45 67 8
processes
Shared MemoryRegion
PPE schedules processwhich is ready to run
PPE with multiple
Figure 8.1: Upon completing the assigned tasks, the SPEs send signal to the PPE processesthrough the ready-to-run list. The PPE process which decides to yield passes the data from theready-to-run list to the kernel, which in return can schedule the appropriate process on the PPE.
rate performance modeling. Although the MMGP model projects the most efficient mappings
of MPI applications on the Cell processor with high accuracy, it is oblivious to the implications
of user-level and kernel-level scheduling on oversubscribed cores. More specifically, the model
ignores cases in which PPE threads and SPE threads are not appropriately co-scheduled. The
scheduling policy where the PPE threads are not arbitrary scheduled by the OS scheduler, in-
troduces more regularity in the application execution and consequently improves the MMGP
predictions.
8.2 SLED Scheduler Overview
The SLED scheduler is invoked through a user-level library calls which can easily be integrated
into the existing Cell application. The overview of the SLED scheduler is illustrated in Fig-
ure 8.1. Each SPE thread upon completing the assigned task sends its own pid to the shared
ready to run list, from where this information is further passed to the kernel. Using the knowl-
edge of the SPE threads that have finished processing the assigned tasks, the kernel can decide
what will be the next process to run on the PPE.
Although it is invoked through a user-level library, part of the scheduler resides in the ker-
nel. Therefore, the implementation of the SLED scheduler can be vertically divided into two
108
distinguishable parts:
1. The user-level library, and
2. The kernel code that enables accepting and processing the user-level information, which
is further used in making kernel-level scheduling decisions.
User−Level
Kernel−LevelKernel SchedulingDecisions
ready−to−run List
Information is passed to the Kernel
Figure 8.2: Vertical overview of the SLED scheduler. The user level part contains the ready-to-run list, shared among the processes, while the kernel part contains the system call throughwhich the information from the ready-to-run list is passed to the kernel.
Passing the information from the ready to run list to the kernel can be achieved in two ways:
• The information from the list can be read by the processes running on the PPE, and the
information can be passed to the kernel through a system call, or
• The ready to run list can be visible to the kernel and the kernel can directly read the
information from the list.
In the current study we followed the first approach, where the information to the kernel is
passed through a system call, see Figure 8.2. Placing the ready to run list inside the kernel will
be the subject of our future research. In the current implementation of the SLED scheduler, the
size of the list is constant and is equal to the total number of SPEs available on the system. Each
SPE is assigned an entry in the list. We further described the ready to run list organization in
the following section.
109
8.3 ready to run List
In the current implementation of the IBM SDK on the Cell BE, the local storage of every SPE
is memory-mapped to the address space of the process which has spawned the SPE thread.
Using DMA requests, the running SPE thread is capable of accessing the global memory of the
system. However, these accesses are restricted to the areas of main memory that belong to the
address space of the corresponding PPE thread. Therefore, if the SPE threads do not belong
to the same process, the only possibility of sharing a data structure among them is through the
globally shared memory segments which can reside in the main memory.
The ready to run list needs to be a shared structure accessible by all SPE threads (even if
they belong to different processes). Therefore, it is implemented as a part of a global shared
memory region. The shared memory region is attached to each process at the beginning of
execution.
8.3.1 ready to run List Organization
Initial observation suggests that the ready to run list should be organized in the FIFO order,
i.e. the process corresponding to the SPE thread which was the first to finish processing a
task, should also be the first to run on the PPE. Nevertheless, the strict FIFO organization of
the scheduler might cause certain problems. Consider the situation where the PPE process
A has off-loaded, but the granularity of the off-loaded task is relatively small, and the SPE
execution finishes before process A had a chance to yield on the PPE side. If process B is
in the ready to run list waiting to be scheduled, in the case when the FIFO order is strictly
followed, process A will yield the PPE and process B will be scheduled to run on the PPE. In
the described scenario, the strict FIFO scheduling causes an extra context switch to occur (there
is no necessity for process A to yield the PPE to process B).
Therefore the SLED scheduler is not designed as a strictly FIFO scheduler. Instead, after off-
loading and before yielding, the process verifies that the off-loaded task is still executing. If the
110
SPE task has finished executing, instead of yielding, the PPE process will continue executing.
Following the described soft FIFO policy, it is possible that a process (we will call itA) upon
off-loading will not yield the PPE, but at the same time the off-loaded task upon completion will
write the pid of process A to the ready to run list. Because the pid is written to the list, at some
point process A will be scheduled by the SLED scheduler. However, when scheduled by the
SLED scheduler, process A might not have anything useful to process, since it has not yield
upon off-loading. To avoid the described situation, the process which decides not to yield upon
off-loading also needs to clear the field that has been filled in the ready to run list with its own
pid. Since multiple processes require simultaneous read/write access to the list, maintaining the
list in the consistent state requires usage of locks, which can bring significant overhead.
Instead of allowing processes to access any field in the ready to run list, we found more
efficient solution to be if each process is assigned an exclusive entry in the list. By not allowing
processes to share the entries in the ready to run list, we avoid any type of locking, which
significantly reduces the overhead of maintaining the list in the consistent state.
8.3.2 Splitting ready to run List
The ready to tun list serves as a buffer through which the necessary off-loading related infor-
mation are passed to the kernel. Initially, the SLED scheduler was designed to use a single
ready to run list. However, in certain cases the single-list design forced the SLED scheduler to
perform process migration across the PPE execution contexts.
Consider the situation described in Figure 8.3. The off-loaded task which belongs to the
process P1 has finished processing on the SPE side (the pid of the process P1 has been written
to the ready to run list). Process P1 is bound to CPU1, but the process P2 which is running on
CPU2 off-loads, and initiates the context switch by passing the pid of process P1 to the kernel.
Since the context switch occurred on CPU2 and P1 is bound to run on CPU1, the kernel needs
to migrate process P1 to CPU2. Initially, we implemented a system call which performs the
111
process migration, and the designed of this system call is outlined in Figure 8.4. The essential
step in this system call is outlined in Line 9, where the sched migrate task() function is invoked.
This is a kernel function, which accepts two parameters: task to be migrated, and the destination
cpu where the task should migrate.
������������������������������������������������������������
������������������������������������������������������������
������������������������������������������������������������
������������������������������������������������������������
P2 yieldsCPU2
CPU1
Next to run
P1 is bound to CPU1
ready−to−run List
P2 reads the pid of the
next process
P2 sends inform
ation to
the kernel
Kernel migrates P1 to CPU2
P3 is running
. . .
. . .
. . .
. . .
. . .
. . .
. . .
P1
SPE1
SPE2
SPE3
SPE4
SPE5
SPE6
SPE7
SPE8
Figure 8.3: ProcessP1, which is bound to CPU1, needs to be scheduled to run by the schedulerthat was invoked on CPU2. Consequently, the kernel needs to perform migration of the processP1, from CPU1 to CPU2
The described process migration might create some drawbacks. Specifically, the sched migrate task()
function might be expensive due to the required locking of the running queues, and also this
function can create uneven distribution of processes across available cpus. To avoid the draw-
backs caused by the sched migrate task() function, we redesigned the ready to run list. Instead
of having a single ready to run list shared among all processes in the system, we assign one
ready to run list to each execution context on the PPE. In this way, only processes sharing the
execution context are accessing the same ready to run list. This mechanism is presented in
Figure 8.5. With the separate ready to run lists there is no more necessity for expensive task
migration, and also we avoid possible load imbalance on the PPE processor.
112
1: void migrate(pid_t next){
2: int next, i, j=0;
3: p = find_process_by_pid(next);4: if (p){5: rq_p = task_rq(p);6: this_cpu = smp_processor_id();7: p_cpu = task_cpu(p);8: if (p_cpu != this_cpu && p!= rq_p->curr){9: sched_migrate_task(p, this_cpu);10: }
11: SLEDS_yield(next) ...
12: }
Figure 8.4: System call for migrating the processes across the execution contexts. Functionsched migrate task() performs the actual migration. SLEDS yield() function schedules the pro-cess to be the next to run on the CPU.
8.4 SLED Scheduler - Kernel Level
The standard scheduler used in the Linux kernel, starting from version 2.6.23, is the Completely
Fair Scheduler (CFS). CFS implements a simple algorithm based on the idea that at any given
moment in time, the CPU should be evenly divided across all active processes in the system.
While this is a desirable theoretical goal, in practice it cannot be achieved since at any moment
in time the CPU can serve only one process. For each process in the system, the CFS records
the amount of time that the process has been waiting to be scheduled on the CPU. Based on the
amount of time spent in waiting to be scheduled and the number of processes in the system, as
well as the static priority of the process, each process is assigned dynamic priority. The dynamic
priority of a process is used to determine the time when and for how long the process will be
scheduled to run.
The structure used by the CFS for storing the active processes is a red-black tree. The
processes are stored in the nodes of the tree, and the process with the highest dynamic priority
113
������������������������������������������������������������
������������������������������������������������������������
������������������������������������������������������������
������������������������������������������������������������
CPU1
CPU2
SPEs
ready−to−run List
ready−to−run List
SPE1
SPE2
SPE3
SPE4
SPE5
SPE6
SPE7
SPE8
Figure 8.5: The ready to run list is split in two parts. Each of the two sublists contain processesthat are sharing the execution context (CPU1 or CPU2). This approach avoids any possibilityof expensive process migration across the execution contexts.
(which will be the first to run on the CPU) is stored in the left most node in the tree.
The SLED scheduler passes the information from the ready-to-run list to the kernel through
the SLEDS yield() system call. SLEDS yield() extends the standard kernel scheduler yield()
system call by accepting an integer parameter pid which represents the process that should be the
next to run. A high level overview of the SLEDS yield() function is described in Figure 8.6(a)-
(c) (assumption is that the passed parameter pid is different than zero). First, the process which
should be the next to run is pulled out from the running tree, and its static priority is increased
to the maximum value. The process is then returned to the running tree, where it will be stored
in the left most node (since it has the highest priority). After being returned to the tree, the
static priority of the process is decreased to the normal value. Besides increasing the static
priority of the process, we also increase the time that the process is supposed to run on the
CPU. Increasing the CPU time is important, since if a process is artificially scheduled to run
many times, it might exhaust all the CPU time that it was assigned by the Linux scheduler.
114
. . .P1 P2 P3 Pn
System Call − Kernel Level
Find the process using
Increase the priority
Return the process
Running List (Tree)
pid
. . .P1 P2 Pn
System Call − Kernel Level
Find the process using
Increase the priority
Return the process
P3
Running List (Tree)
pid
(a) (b)
. . .P3 P1 Pn
System Call − Kernel Level
Find the process using
Increase the priority
Return the process
P2
Running List (Tree)
pid
(c)
Figure 8.6: Execution flow of the SLEDS yield() function: (a) The appropriate process is foundin the running list (tree), (b) The process is pulled out from the list, and its priority is increased,(c) The process is returned to the list, and since its priority is increased it will be stored at theleft most position.
In that case, although we are capable of scheduling the process to run on the CPU using the
SLEDS yield() function, the process will almost immediately be switched out by the kernel.
Before it exits, the SLEDS yield() function calls the kernel-level schedule() function which
initiates context switching.
We measured the overhead in the SLEDS yield() system call caused by the operations per-
formed on the running tree. We found that the SLEDS yield() incurs overhead of approximately
8% compared to the standard sched yield() system call.
115
8.5 SLED Scheduler - User Level
Figure 8.7 outlines the part of the SLED scheduler which resides in the user space. Upon off-
loading, the process is required to call the SLEDS Offload() function (Figure 8.7, Line 13). This
function is polling over the the member of the structure signal in order to check if the SPE
has finished processing the off-loaded task. Structure signal resides in the local storage of an
SPE, and the process executing on the PPE knows the address of this structure and uses it for
accessing the members of the sructure. While the SPE task is running, the stop field of the
structure signal is equal to zero, and upon completion of the off-loaded task, the value of this
field is set to one.
If the SPE has not finished processing the off-loaded task, the SLEDS Offload() function
calls the SLEDS yield() function, Figure 8.7, Line 15. The SLEDS yield() function scans the
ready to run list searching for the SPE that has finished processing the assigned task, Figure 8.7,
Line 3 – 10. Two interesting things can be noticed in the function SLEDS yield(). First, the
function scans only three entries in the ready to run list. The reason for this is that the list
is divided among the execution contexts on the PPE, as described in Section 8.3.2. Since the
presented version of the scheduler is adapted to the PlayStation3 (contains a Cell processor
with only 6 SPEs), each ready to run list contains only 3 entries. Second, the list is scanned
at most N times (see Figure 8.7, Line 3), after which the process is forced to yield. If the N
parameter is relatively large, repeatedly scanning of the ready to run list becomes harmful for
the process executing on the adjacent PPE execution context. However, if the parameter N is
not large enough, the process might yield before having a chance to find the next-to-run process.
Although the results presented in Figure 8.8 show that execution time of RAxML depends on
N , the theoretical model capable of describing this dependence will be the subject of our future
work. Currently, for RAxML we chose N to be 300, as this is the value which achieves the
most efficient execution in our test cases (Figure 8.8). For PBPI we did not see any variance in
execution times for values of N smaller than 1000. When N is larger than 1000, performance of
116
1: void _yield(){
2: int next, i, j=0;
3: while(next == 0 && j < N){4: i=0;5: j++;6: while(next == 0 && i < 3){7: next = ready_to_run[i];8: i++;9: }10: }
11: SLEDS_yield(next);
12: }
13: void SLEDS_Offload(){
14: while (((struct signal *)signal)->stop == 0){15: _yield();16: }17: }
Figure 8.7: Outline of the SLEDS scheduler: Upon off-loading a process is required to call theSLEDS Offload() function. SLEDS Offload() checks if the off-loaded task has finished (Line14), and if not, calls the yield() function. yield() scans the ready to run list, and yields to thenext process by executing SLEDS yield() system call.
PBPI decreases due to contention caused by scanning the ready to run list.
8.6 Experimental Setup
To test the SLED scheduler we used the Cell processor inbuilt in the PlayStation3 console. As
an operating system, we used a variant of the 2.6.23 kernel version, specially adapted to run
on the PlayStation3. We also changed the kernel by introducing system calls necessary for
executing the SLED scheduler. We used SDK2.1 to execute our applications on Cell.
117
12.5
12.6
12.7
12.8
12.9
13
13.1
13.2
13.3
13.4
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
ready-to-run list scanning
Execu
tio
n T
ime (
s)
Figure 8.8: Execution times of RAxML when the ready to run list is scanned between 50 and1000 times. x-axis represents the number of scans of the ready to run list. y-axis representsthe execution time. Note that the lowest value for the y-axis is 12.5, and the difference betweenthe lowest and the highest execution time is 4.2%. The input file contains 10 species, eachrepresented by 1800 nucleotides.
8.6.1 Benchmarks
In this section we describe the benchmarks used to test the performance of the SLED scheduler.
We compared the SLED scheduler to the EDTLP scheduler using microbenchmarks and real-
world bioinformatics applications, RAxML and PBPI.
The microbenchmarks we used are designed to imitate the behavior of real applications
utilizing the off-loading execution model. Using the microbenchmarks we aimed to determine
the dependence of the context switch overhead to the size of the off-loaded tasks.
8.6.2 Microbenchmarks
The microbenchmarks we designed are composed of multiple MPI processes, and each process
uses an SPE for task off-loading. The tasks in each process are repeatedly off-loaded inside a
loop which iterates 1,000,000 times. The part of the process executed on the PPE only initiates
task off-loading and waits for the off-loaded task to complete. The off-loaded task executes
a loop which may vary in length. In our experiments we oversubscribe the PPE with 6 MPI
118
0
10
20
30
40
50
60
70
80
90
100
6
11
15
20
24
28
33
37
41
46
50
55
59
63
68
72
76
81
85
Task Length (us)
Execu
tio
n T
ime (
s)
SLEDSEDTLP
Figure 8.9: Comparison of the EDTLP and SLED schemes using microbenchmarks: Totalexecution time is measured as the length of the off-loaded tasks is increased.
processes.
We compare the performance of the microbenchmarks using the SLED and EDTLP sched-
ulers. Figure 8.9 represents the total execution time of the microbenchmarks that are executed
with different lengths of the off-loaded tasks. For large task sizes the SLED scheduler outper-
forms EDTLP by up to 17%. However, when the size of the off-loaded task is relatively small,
the EDTLP scheme outperforms the SLED scheme by up to 29%, as represented in Figure 8.10.
We will use the example presented in Figure 8.11 to explain the behavior of the EDTLP and
SLED schemes for small task sizes. Assume that 3 processes, P1, P2, and P3 are oversubscribed
on the PPE. In the EDTLP scheme Figure 8.11 (EDTLP), upon offloading, P1 yields and the
operating system decides which should be the next process to run on the PPE. Since the process
P1 was the first to off-load and yield, it is not likely that the same process will be scheduled
until all other processes have off-loaded and been switched out from the PPE. If the size of
the off-loaded task is relatively small, by the time the process P1 gets scheduled to run again
on the PPE, the off-loaded task will already be completed and the process P1 can immediately
continue running on the PPE.
119
0
5
10
15
20
25
30
6 7 8 9 10 11 12 13 13 14 15 16 17 18 19 20 20 21
Task Length (us)
Execu
tio
n T
ime (
s)
SLEDSEDTLP
Figure 8.10: Comparison of the EDTLP and SLED schemes using microbenchmarks: Totalexecution time is measured as the length of the off-loaded tasks is increased – task size islimited to 2.1us.
Consider now the situation represented in Figure 8.11 (SLED), when the SLED scheduler is
used for scheduling the processes with small off-loaded tasks. Due to complexity introduced by
the SLED scheduler, the time necessary for the context switch to complete is increased. Conse-
quently, the time interval for process P1, between the off-loading and the next opportunity to run
on the PPE, increases. Based on the performed analysis, we can conclude that for scheduling
the processes with relatively fine-grain off-loaded tasks (the execution time of a task is shorter
than 15µs), it is more efficient to use the EDTLP scheme than the SLED scheme.
For coarser task sizes (the execution time of a task is longer than 15µs), the SLED scheme
almost always outperforms the EDTLP scheme. The exceptions are certain tasks sizes which
are equal to the exact multiple of scheduling intervals, as can be seen in Figure 8.9. Scheduling
interval represents the time interval after which a process will be scheduled to run on the PPE,
upon offloading. For specific task sizes, the processes will be ready to run at exact moment
when they get scheduled on the PPE using only the EDTLP scheme. We need to point out that
these situations are rare, and in real applications described in Section 8.6.3 and Section 8.6.4
we did not observe this behavior.
120
. . .
. . .P1 P3 P1P2
Task off−loaded from P1 has completed
immediately continue runningP1 is scheduled on the PPE, and can
Context switch overhead
EDTLP
P1
Task off−loaded from P1 has completed
P2 . . . P1
. . . P1 is scheduled on the PPE
Time
P3
SLED
Figure 8.11: EDTLP outperforms SLED for small task sizes due to higher complexity of theSLED scheme.
0
10
20
30
40
50
60
70
80
90
100
6
10
13
17
20
24
27
31
34
38
41
45
48
52
55
59
62
66
69
73
76
80
83
Task Length (us)
Execu
tio
n T
ime (
s)
SLEDS+EDTLPEDTLPSLEDS
Figure 8.12: Comparison of the EDTLP scheme and the combination of SLED and EDTLPschemes using microbenchmarks. EDTLP is used for the task sizes smaller than 15µs.
121
0
5
10
15
20
25
30
6 7 8 9 10 11 12 13 13 14 15 16 17 18 19 20 20 21
Task Length (us)
Execu
tio
n T
ime (
s)
SLEDS+EDTLPEDTLPSLEDS
Figure 8.13: Comparison of the EDTLP scheme and the combination of SLED and EDTLPschemes using microbenchmarks. EDTLP is used for the task sizes smaller than 15µs – tasksize is limited to 2.µs.
To address the issues related to the small task sizes (when EDTLP outperforms SLED), we
combined the two schemes into a single scheduling policy. The EDTLP scheme is used when
the size of the off-loaded tasks are smaller than 15µs. The results of the combined scheme are
presented in Figure 8.12 and Figure 8.13.
8.6.3 PBPI
We also compared the performance of the two schemes, EDTLP and SLED, using the PBPI
application. As an input file for PBPI we used a set that contains 107 species and we varied
the length of the DNA sequence that represents the species. In the PBPI application, the length
of the input DNA sequence is directly related to the size of the off-loaded tasks. We varied the
length of the DNA sequence from 200 to 5,000. Figure 8.14 represents the the execution time of
PBPI when the EDTLP and SLED scheduling schemes are used. In all experiments the configu-
ration for PBPI was 6 MPI processes and each process was assigned an SPE for off-loading the
expensive computation. As in the previous example, the EDTLP outperforms the SLED scheme
122
0
5
10
15
20
25
30
35
40
200 600 1000 1400 1800 2200 2600 3000 3400 3800 4200 4600 5000
Sequence Size
Execu
tio
n T
ime (
s)
SLEDSEDTLP
Figure 8.14: Comparison of EDTLP and SLED schemes using the PBPI application. Theapplication is executed multiple times with varying length of the input sequence (representedon the x-axis).
for small task sizes. Again we combined the two schemes, EDTLP for task sizes smaller than
15µs and the SLED for the larger tasks sizes and we present the obtained performance in Fig-
ure 8.15. The combined scheme constantly outperforms the EDTLP scheduler, and the highest
difference we recorded is 13%.
8.6.4 RAxML
We executed RAxML with the input file that contained 10 species in order to compare the
EDTLP and SLED schedulers. As in PBPI case, we varied the length of the input DNA se-
quence, as the size of the input sequence is directly related to the size of the off-loaded tasks.
The length of the sequence in our experiments was between 100 and 5,000 nucleotides. In
case of RAxML, SLED outperforms EDTLP by up to 7%. As in previous experiments, for
relatively small task sizes the EDTLP scheme outperforms the SLED scheme, as represented
in Figure 8.16 and Figure 8.17. For larger tasks sizes the SLED scheme outperforms EDTLP.
Again, by combining the two schemes we can achieve the best performance.
123
0
5
10
15
20
25
30
35
40
200 600 1000 1400 1800 2200 2600 3000 3400 3800 4200 4600 5000
Sequence Size
Execu
tio
n T
ime (
s)
SLEDS+EDTLPEDTLP
Figure 8.15: Comparison of EDTLP and the combination of SLED and EDTLP schemes usingthe PBPI application. The application is executed multiples time with varying length of theinput sequence (represented on the x-axis).
0
5
10
15
20
25
30
35
40
45
50
1003005007009001100150017001900230026002800300032003400370039004200450047004900
Sequence Size
Execu
tio
n T
ime (
s)
SLEDS
EDTLP
Figure 8.16: Comparison of EDTLP and SLED schemes using the RAxML application. Theapplication is executed multiple times with varying length of the input sequence (representedon the x-axis).
124
0
5
10
15
20
25
100 200 300 400 500 600
Sequence Size
Execu
tio
n T
ime (
s)
SLEDS
EDTLP
Figure 8.17: Comparison of EDTLP and the combination of SLED and RAxML schemes usingthe RAxML application. The application is executed multiple times with varying length of theinput sequence (represented on the x-axis).
8.7 Chapter Summary
In this chapter we investigated strategies targeting to reduce the scheduling overhead which oc-
curs on the PPE side of the Cell BE. We designed and tested the SLED scheduler, which uses the
user-level off-loaded-related information in order to influence kernel-level scheduling decisions.
On a PlayStation3, which contains 6 SPEs, we conducted a set of experiments comparing the
SLED and EDTLP scheduling schemes. For comparison we used the real scientific applications
RAxML and PBPI, as well as a set of microbenchmarks developed to simulate the behavior of
larger applications. Using the microbenchmarks, we found that the SLED scheme is capable
outperforming the EDTLP scheme by up to 17%. SLEDS performs better by up to 13% with
PBPI, and up to 7% with RAxML. Note that higher advantage of the SLED scheme is likely
to occur on the Cell BE with all 8 SPEs available (the Cell BE used in PS3 has only 6 SPEs
available) due to higher PPE contention and consequently higher context-switch overhead.
125
126
Chapter 9
Future Work
This chapter discusses directions of future work. The proposed extensions are summarized as
follows:
• We plan to extend the presented kernel-level scheduling policies by implementing the
ready to run list into the kernel and considering more scheduling parameters such as load
balancing and job priorities when making scheduling decisions.
• We plan to increase utilization of the host and accelerator cores by sharing the accelerators
among multiple tasks and by extending the loop-level parallelism to also include the host
core besides already considered accelerator cores.
• We plan to port more applications to Cell. Specifically, we will focus on streaming,
memory intensive applications, and evaluate the capability of Cell to execute these ap-
plications. By using memory intensive applications, we hope to get better insights in
scheduling strategies which would enable efficient execution of communication bound
applications on asymmetric processors. We consider this to be an important problem
since the memory and bus contention will grow rapidly as the number of cores in multi-
core asymmetric architectures increases.
• Most of the presented techniques in this thesis are not specifically designed for Cell and
127
heterogeneous accelerator-based architectures, and in our future work we plan to extend
the techniques presented in this thesis to homogeneous parallel architectures.
• Finally, we plan to extend the MMGP model by capturing the overhead caused by the
Element Interconnect Bus congestion that can significantly limit the abilities of Cell to
overlap computation and communication.
We expand on our plans for future work in the following sections.
9.1 Integrating ready-to-run list in the Kernel
As described in Chapter 8, the SLED scheduler spans across both, the kernel and the user space.
The ready-to-run list resides in the user space, and it is shared among all active processes. The
information from the ready-to-run list is passed to the kernel-level part of the SLED scheduler
through a system call. Based on the received information, the kernel part of the SLED scheduler
biases kernel scheduling decisions. Further in this section we explain the possible drawbacks
of having the ready-to-run residing in the user space.
The timeline diagram of the SLED scheduler is presented in Figure 9.1 (upper figure). Each
process upon off-loading issues a call to the SLED scheduler. The scheduler iterates through the
ready-to-run list, in order to determine the pid of the next process. As presented in Figure 9.1
(upper figure), it is possible that all processes have been switched to SPE execution and the
scheduler will iterate through the list until one of the SPEs sends the signal to the ready-to-run
list. Therefore, it is likely that some idle time will occur (when no useful work is performed)
upon off-loading and before finding the next-to-run process,. Once it finds the next to run
process, the scheduler will switch to the kernel mode, and influence the kernel scheduler to run
the appropriate process.
The possible drawback of this scheme is that upon determining which process should be the
next to run, the system still needs to perform two context switches, between the user process
128
User−Level Space
Kernel−Level Space
Process running
Process off−loading,and switching to SLED
ready_to_run listSLED iterates through
next process to runKernel schedules
Context−sw
itch
SLE
D−>K
ernel
Con
text
sw
itch
Ker
nel−
>Pro
cess
(next) Process contiues
SLED has found next process
Time
Process running
Process off−loading,and switching to SLED
(next) Process contiues
Con
text
sw
itch
Ker
nel−
>Pro
cess
Kernel schedulesnext process to run
SLE
D−>K
ernel
Context−sw
itch
User−Level Space
Kernel−Level Space
ready_to_run listSLED iterates through
SLED has found process
Figure 9.1: Upon completing the assigned tasks, SPEs send signals to PPE processes throughthe ready-to-run list. The PPE process which decides to yield passes the data from the ready-to-run queue to the kernel, which in return can schedule the appropriate process on the PPE.
129
and the kernel, and between the kernel and the user process. In our future work we plan to allow
kernel to directly access the list, which would eliminate one context switch. In other words, by
allowing the kernel to see the ready-to-run list, we overlap the first context switch with the idle
time which occurs before some of the active processes is ready to be rescheduled on the PPE.
When the next-ro-run process is determined the scheduler would already be in the kernel space
and there would be only one context switch left to return the execution to a specific process in
the user space, see Figure 9.1 (bottom figure).
9.2 Load Balancing and Task Priorities
So far we have considered applying the MGPS and SLED schedulers only to a single applica-
tion. In our future work we plan to investigate the described scheduling strategies in a context of
multi-program workloads. Since the schedulers are already designed to work in a distributed en-
vironment, using the schedulers with entirely separated applications should be relatively simple.
However, we envision several challenges with multi-program workload that could potentially
influence the system performance.
First, using the SLED scheduler with multi-program workload can cause load imbalance.
The SLED scheduler contains two ready-to-run lists, and each list is shared among the processes
running on a single cpu. Therefore, the scheduler needs to be capable of deciding how to
group the active processes across cpus in order to minimize load imbalance. The grouping of
processes will depend on parameters such as granularities of the PPE and SPE tasks, PPE-SPE
communication, inter-node and on-chip communication. Furthermore, the scheduler needs to
be able to recognize when the load of the system has changed (for example when one of the
processes has finished executing), and appropriately reschedule the remaining tasks across the
available cpus.
Besides being able to handle the load balancing issues, our future work will focus on in-
cluding support for real-time tasks in our scheduling policies. So far in our experiments all
130
processes were assumed to have the same priority. This is not the case in all situations, and one
example would be streaming video applications. While trying to increase the system throughput
with different process grouping and load-balancing policies, we might actually hurt the perfor-
mance of the real-time jobs in the system. A simple example would be if a real-time task is
grouped with processes that require a lot of resources. Although this might be the best grouping
decision for overall system performance, that particular real-time task might suffer performance
degradation. To address the mentioned issues, we plan to include multiple applications in our
experiments and focus more on load-balancing problems as well as real-time task priorities.
9.3 Increasing Processor Utilization
Our initial scheduling scheme, Event-Driven Task-Level Parallelization (EDTLP), reduces the
idle time on the PPE by forcing each process to yield upon off-loading, and assigning the PPE
to a process which is ready to do work on the PPE side. For further reducing the idle time on
both, the PPE and SPEs, we developed the Slack-minimizer Event Driven Scheduler (SLEDS).
In our future work, as another approach for increasing utilization of SPEs, we plan to intro-
duce sharing of SPEs among multiple PPE threads. The processes in an MPI application are
almost identical, and the off-loaded parts of each process are exactly the same. Therefore, a sin-
gle SPE thread could potentially execute the off-loaded computation from multiple processes.
However, different processes cannot share the SPE threads, since the SPE threads exclusively
belong to the process which created them. Therefore, we plan to investigate another level of
parallelism on the Cell processor, namely thread-level parallelism. Inside of a single node,
instead of having multiple MPI processes, a parallel application would operate with multiple
threads which could share SPEs among themselves. Different processes would be used across
nodes. To further increase utilization of the PPE, we will consider extended loop-level schedul-
ing policies which would also involve the PPE in computation, besides already used accelerator
cores.
131
9.4 Novel Applications and Programming Models
In our thesis we used a limited number of applications that were able to benefit from the off-
loading execution approach. While it is obvious that many scientific (computationally expen-
sive) applications will benefit from the proposed execution models and scheduling strategies, in
our future work we plan to focus on applications with high-bandwidth requirements. Specifi-
cally, we plan to investigate the capability of accelerator-based architectures to execute applica-
tions such as database servers and network package processing.
The mentioned applications are computationally intensive, but also these applications usu-
ally require high memory bandwidth because they stream large amounts of data. Besides being
extremely computationally powerful, Cell has a high-bandwidth bus which connects the on-
chip cores among themselves and with the main memory. While the high-bandwidth bus will
be capable of improving the performance of streaming applications, in a near future it might
become a bottleneck as the number of on-chip cores increases. Therefore, in our future work
we will focus on runtime systems which improve the execution of data-intensive applications
on asymmetric processors.
9.5 Conventional Architectures
The main focus of our thesis were heterogeneous, accelerator-based architectures. However,
parallel architectures comprising homogeneous cores represent the majority of processors that
are in use nowadays. When working with conventional, highly parallel architectures, it is likely
that similar problems will occur as those that we were facing on heterogeneous architectures.
As with asymmetric architectures, applications designed for homogeneous parallel archi-
tectures need to be parallelized at multiple levels in order to achieve efficient execution. Ap-
plications with multiple levels of parallelism are likely to experience load imbalance, which
might result in poor utilization of chip resources. Therefore, we need to have techniques that
132
are capable of detecting and correcting these anomalies.
Most of the techniques we presented in this thesis are not bound to heterogeneous archi-
tectures. In our future work we plan to extend and test our scheduling and modeling work
to homogeneous parallel architectures. While scheduling approaches such as MGPS and S-
MGPS might be relatively simple to apply on any kind of architecture, the MMGP modeling
approach will require more detailed communication modeling. On Cell architecture, because of
the specifics of the SPE design, we were able to assume significant computation-communication
overlap. This obviously will not be the case on architectures with conventional caches, therefore
we will focus more on modeling communication patterns.
9.6 MMGP extensions
Another direction of our future work regarding the MMGP model is more accurate modeling
of the off-loaded tasks, specifically DMA communication in the off-loaded tasks. Each SPE on
the Cell/BE is capable of moving data between main memory and local storage, while at the
same time executing computation. To overlap computation and communication, applications
use loop tiling and double buffering.
In this thesis in our MMGP model we have included all blocking DMA requests that can-
not be computation overlapped. However, unrolling and increased DMA communication can
influence the performance on a completely different architectural level. Although the Element
Interconnect Bus (structure that connects cores on Cell) can achieve bandwidth of over 200
GB/s, the processor-memory bandwidth is limited to 25 GB/s. When many SPEs work simul-
taneously the available bandwidth might not be sufficient. Consider a case where each SPE ex-
ecutes exactly the same loop – realistic scenario when an off-loaded loop is parallelized across
multiple accelerators. If the off-loaded execution is synchronized, all SPEs will issue a DMA
request at the same time. Although the total bandwidth requirements might be less than 25
GB/s, when all SPEs simultaneously and synchronously perform memory communication the
133
requirements might exceed the available bandwidth. The described scenario is likely to occur
when significant loop unrolling is performed, due to heavily increased DMA communication
necessary for bringing data for the enlarged loop bodies. In our future work we plan to extend
the MMGP model by capturing all on-chip contention caused by high bandwidth requirements
of the off-loaded code.
134
Chapter 10
Overview of Related Research
10.1 Cell – Related Research
Cell has recently attracted considerable attention as a high-end computing platform. Recent
work on Cell covers modeling, performance analysis, programming and compilation environ-
ments, and application studies.
Kistler et. al [63] analyze the performance of Cell’s on-chip interconnection network and
provide insights into its communication and synchronization protocols. They present experi-
ments that estimate the DMA latencies and bandwidth of Cell, using microbenchmarks. They
also investigate the system behavior under different patterns of communication between local
storage and main memory. Based on the presented results, the Cell communications network
provides the speed and bandwidth that applications need to exploit the processor’s computa-
tional power. Williams et. al [98] present an analytical framework to predict performance on
Cell. In order to test their model, they use several computational kernels, including dense ma-
trix multiplication, sparse matrix vector multiplication, stencil computations, and 1D/2D FFTs.
In addition, they propose micro-architectural modifications that can increase the performance
of Cell when operating on double-precision floating point elements. Chen et. al [33] investi-
gate communication (DMA) performance on the SPEs. They found strong relation between
135
the size of the prefetching buffers allocated in local storage, and application performance. To
determine the optimal buffer size, they present a detailed analytical model of DMA accesses on
the Cell and use the model to optimize the buffer size for DMAs. To evaluate performance of
their model, they use a set of micro-kernels. Our work differs in that it considers the overall
performance implications of multigrain parallelization strategies on Cell.
Balart et. al [16] present a runtime library for asynchronous communication in the Cell BE
processor. The library is organized as a Software Cache and provides opportunities for over-
lapping communication and computation. They found that the full-associative scheme offers
better chances for communication-computation overlap. To evaluate their system they used
benchmarks from the HPCC suite. While their concern was design and implementation of the
off-loaded code, in our work we assume that the application is already Cell-optimized, and we
focus on scheduling of already applicatoin-exposed parallelism.
Eichenberger et. al [39] present several compiler techniques targeting automatic generation
of highly optimized code for Cell. These techniques attempt to exploit two levels of parallelism,
thread-level and SIMD-level, on the SPEs. The techniques include compiler assisted memory
alignment, branch prediction, SIMD parallelization, OpenMP thread-level parallelization, and
compiler-controlled software caching. The study of Eichenberger et. al does not present details
on how multiple dimensions of parallelism are exploited and scheduled simultaneously by the
compiler. Our contribution addresses this issue. The compiler techniques presented in [39] are
complementary to the work presented in this paper. They focus primarily on extracting high
performance out of each individual SPE, whereas our work focuses on scheduling and orches-
trating computation across SPEs. Zhao and Kennedy [102] present a dependence-driven com-
pilation framework for simultaneous automatic loop-level parallelization and SIMDization on
Cell. They also implement strategies to boost performance by managing DMA data movement,
improving data alignment, and exploiting memory reuse in the innermost loop. To evaluate per-
formance of their techniques, Zhao and Kennedy use microbenchmarks. Similar to the results
presented in our study, they do not see linear speedup when parallelizing tasks across multiple
136
SPEs. The framework of Zhao and Kennedy does not consider task-level functional parallelism
and its coordinated scheduling with data parallelism, two central issues explored in this thesis.
Although Cell has been a focal point in numerous articles in popular press, published re-
search using Cell for real-world applications beyond games was scarce until recently. Hjelte [58]
presents an implementation of a smooth particle hydrodynamics simulation on Cell. This simu-
lation requires good interactive performance, since it lies on the critical path of real-time appli-
cations such as interactive simulation of human organ tissue, body fluids, and vehicular traffic.
Benthin et. al [18] present an implementation of ray-tracing algorithms on Cell, also targeting
high interactive performance. They have shown how to efficiently map the ray tracing algorithm
to Cell, with the performance improvements of nearly an order of magnitude over the conven-
tional processors. However, they found that for certain algorithms Cell does not perform well
due to frequent memory accesses. Petrini et. al [73] recently reported experiences from porting
and optimizing Sweep3D on Cell, in which they consider multi-level data parallelization on the
SPEs. They heavily optimized Sweep3D for Cell and achieved impressive performance of 9.3
Gflops/s for double precision and 50 Gflops/s for single precision floating point computation.
Contrary to their conclusion that the memory performance and the data communication patterns
play central role in Sweep3D, we were able to achieve complete communication-computation
overlap in the bioinformatics code we ported to Cell. The same author presented a study of
graph explorations algorithms on Cell [75]. They investigated suitability of the breath-search
first (BFS) algorithm on the Cell BE. The achieved performance is an order of magnitude better
compared to conventional architectures. Bader et. al [13] examine the implementation of list
ranking algorithms on Cell. List ranking is a challenging algorithm for Cell due to its highly
irregular patterns. When utilizing the entire Cell chip, they reported an overall speedup of 8.34
over a PPE-only implementation of the same algorithm. Recently several Cell studies have ben
conducted as a result of 2007 IBM Cell Challenge. Moorkanikara-Nageswaran et. al [1] de-
veloped Brain Circuit simulation on a PS3 node. As part of the same contest, De Kruijf ported
MapReduce [38] algorithm to Cell. The main goal of our work is both, to develop and opti-
137
mize applications for Cell and develop system software tools and methodologies for improving
performance on the Cell architecture across application domains. We use a case study from
bioinformatics to understand the implications of static and dynamic multi-grain parallelization
on Cell.
10.2 Process Scheduling - Related Research
Dynamic and off-line process scheduling which improves the performance and overall through-
put of the system has been a very active research area. With the introduction of multi-core
systems many scheduling related studies have been conducted targeting performance improve-
ment on these novel systems. We list several contributions in this area.
Anderson et. al [9] argue that the performance of kernel-level threads is inherently worse
than the performance of user-level threads. While user-level threads are essential for high per-
formance computation, kernel-level threads, which support user-level threads, are a poor kernel-
level abstraction due to inherently bad performance. The authors propose new kernel interface
and user-level thread package that together provide the same functionality as kernel threads
while at the same time the performance of their thread library is comparable to that of any
user-level thread library.
Siddha et. al [82] conducted a thorough study on possible scheduling strategies on emerg-
ing multi-core architectures. They consider different multi-core topologies and the associ-
ated power management technologies, and try to point to possible tradeoffs when performing
scheduling on these novel architectures. They focus on symmetric processors and do not con-
sider any of the asymmetric architectures. Somewhat similar to results obtained from our study
(using asymmetric cores), they conclude that the most efficient performance can be achieved by
making the process scheduler aware of the multi-core topologies and the task characteristics.
Fedorova et. al [42] designed a kernel-level scheduling algorithm targeting to improve the
performance of multi-core architectures with shared levels of cache. The motivation for their
138
work comes from the fact that application on multi-core systems are performance dependent
on the behavior of their co-runners. This performance dependency occurs as a consequence of
shared on-chip resources, such as the cache. Their algorithm ensures that the processes always
run as quickly as they would if the cache was fairly shared among all co-running processes. To
achieve this behavior they adjust the CPU timeslices assigned to the running processes by the
kernel scheduler.
Calandrino et. al [26] developed an approach for scheduling soft real-time periodic tasks in
Linux, on Asymmetric Multi-Core Processors. Their approach performs dynamic scheduling of
real-time tasks, while at the same time attempts to provide good performance for non-real-time
processes. To evaluate their approach they used a Linux scheduler simulator, as well as the real
Linux operating system running on a dual-core Intel Xeon processor.
Settle et. al [80] proposed the memory monitoring framework, an architectural support that
provides cache resource information to the operating system. The authors introduce the concept
of an activity vector which represents a collection of event counters for a set of contiguous cache
blocks. Using runtime information the operating system can improve the process scheduling.
Their scheme schedules threads based on run-time cache use and miss pattern for each active
hardware thread. Their techniques improve system performance by 5%. The performance
improvements are caused by increased cache hit rate.
Thekkath and Eggers [93] tested a hypothesis that scheduling threads that share data on the
same processor will decrease compulsory and invalidation misses. They evaluated a variety
of thread placement algorithms. Their workload was composed of fourteen parallel programs
that are representative of real-world scientific applications. They found that placing threads
that share data on the same processor does not have any impact on performance. Instead, the
performance was mostly affected by thread load balancing.
Rajagopalan et. al [76] introduce a scheduling framework for multi-core processors tar-
geting to achieve a balance between control over the system and the level of abstraction. Their
framework uses high-level information supplied by the user to guide thread scheduling and also,
139
where necessary, gives the programmer fine control over thread placement.
Snavely and Tullsen [83] designed the SOS (Sample, Optimize, Symbios) scheduler - an
OS level scheduler that dynamically choses the best scheduling strategy in order to increase
the throughput of the system. The SOS scheduler samples the space of possible process com-
binations and collects values of the hardware counters for different scheduling combinations.
The scheduler applies heuristics to the collected counters in order to determine the most effi-
cient scheduling strategy. The scheduler is designed for the SMT architecture and is capable
of improving system performance by up to 17%. The same authors extend their initial work
by introducing job priorities [84]. While different jobs might have various priorities from the
user’s perspective, the SOS scheduler might be unaware of that. In this way while trying to
improve the system throughput, the SOS scheduler might increase the response time of high
priority jobs.
Sudarsan et. al [90] developed ReSHAPE, a runtime scheduler for dynamic resizing of paral-
lel application executed in a distributed environment. MPI-based applications using ReSHAPE
framework can expand or shrink depending on the availability of underlying hardware. Using
ReSHAPE they demonstrated improvement in job turn-around time and overall system through-
put. McCan et. al [67] propose a dynamic processor-scheduling policy for multiprogrammed
shared-memory multiprocessors. Their scheduling policy also assumes multiple independent
process and it is capable of allocating processors from one parallel job to another based on the
requirement of the parallel jobs. The authors show that is it possible to run beneficially low
priority jobs on the same cpu with high priority jobs, without hurting the high priority jobs.
Their new scheduling scheme can improve system performance by up to 40%.
Curtis-Maury et. al [36] present a prediction model for identifying energy-efficient operating
points of concurrency in multithreaded scientific applications. Their runtime system optimizes
applications at runtime, by using live analysis of hardware event rates. Zhang et. al [100] de-
veloped an OMP-based loop scheduler that selects the number of threads to use per processor
based on sample executions of each possibility. The authors extend that work to incorporate de-
140
cision tree based prediction of the optimal number of threads to use [101]. Springer et. al [85]
developed a scheduler that conforms two conditions: the scheduling strategy satisfies an exter-
nal upper limit for energy consumption and minimizes the execution time. The execution model
chosen by their scheduler is usually less tha 2% of optimal.
10.3 Modeling – Related Research
In this section we review related work in programming environments and models for paral-
lel computation on conventional homogeneous parallel systems and programming support for
nested parallelization. The list of related work in models for parallel computation is by no
means complete, but we believe it provides adequate context for the model presented in this
thesis.
10.3.1 PRAM Model
Fortune and Wyllie presented a model based on random access machine operating in paral-
lel and sharing a common memory [46]. They are modeling execution of a finite program on
PRAM (parallel random access machine) that consists of unbounded set of processors connected
through unbounded global shared memory. The model is rather simple but not realistic for mod-
ern multicore processors, since it assumes that all processors work synchronously and that in-
terprocessor communication is free. PRAM also does not consider network congestion. There
are several variants of the PRAM models: (i) EREW - exclusive read exclusive write PRAM
model does not allow simultaneous execution of read or write operations, (ii) CEREW - concur-
rent read exclusive write, allows simultaneous reading but prevents simultaneous writing, (iii)
CRCW allows both, simultaneous read and simultaneous write operations. Cameron et. al [28]
describe two different implementations of CRCW PRAM: priority and arbitrary.
Several extensions of the PRAM model have been developed in order to make it more prac-
141
tical, but at the same time to preserve its simplicity [5, 6, 51, 62, 68, 72]. Aggarwal et. al [5] add
communication latency to the PRAM model, while the same author includes reduced commu-
nication costs when blocks of data are transfered [6].
The original PRAM model assumes no asyncronous execution. Asynchronous PRAM (APRAM)
model includes the synchronization costs [51, 68]. APRAM contains four different types of
instructions: global reads, global writes, local operations and synchronization steps. Synchro-
nization step represents a global synchronization among processors.
10.3.2 BSP model
Valiant introduced the bulk-synchronous parallel model (BSP) [95], which is a bridging model
between parallel software and hardware. The BSP model is intended neither as a hardware nor
as a programming model, but something in between. The model is defined as a combination
of three attributes: 1) A number of components each performing processing and/or memory
functions; 2) A router that delivers point-to-point messages between the components; and 3)
Facilities for synchronizing all or a subset of components at regular intervals. The computa-
tion is performed in supersteps - each component is allocated a task, and all components are
synchronized at the end of computation/superstep. BSP as well as the other models mentioned,
does not captures the overhead of context switching, which is a significant part of accelerator-
based execution and the MMGP model. BSP allows processors to work asynchronously and
models latency and limited bandwidth.
Baumker et. al [25] introduce extensions of the BSP model, where they introduce blockwise
communication in parallel algorithms. A good parallel algorithm should communicate using
smaller number of large messages rather than using a large number of small messages. There-
fore they introduce a new parameter B which represents a minimum size of the message in
order to fully exploit the bandwidth of the router. Fantozzi et. al [41] introduce D-BSP, a model
where a machine can be divided into submachines capable of exploiting locality. Furthermore,
142
each submachine can execute different algorithms independently. Juurlink et. al [60] extend the
BSP model by providing a way to deal with unbalanced communication patterns, and by adding
a notion of general locality to the BSP model, where the delay of a remote memory access
depends on the relative location of the processors in the interconnection network.
10.3.3 LogP model
LogP [35] is another widely used machine-independent model for parallel computation. The
LogP model captures the performance of parallel applications using four parameters: the latency
(L), overhead (o), bandwidth (g) of communication, and the number of processors (P ).
The drawback of LogP is that it can accurately predict performance only when short mes-
sages are used for communication. Alexandrov et. al [8] propose the LogGP model, which is
an extension of LogP that supports large communication messages and high bandwidth. They
introduce an extra parameter G which captures the bandwidth obtained for large messages.
Ino et. al [59] introduce an extension of LogGP, named LogGPS. LogGPS improves the ac-
curacy of the LogGP model, by capturing the synchronization needed before sending a long
message by high-level communication libraries. They introduce a new parameter S, defined as
the threshold for message length, above which synchronous messages are sent. Frank et. al [47]
extend the LogP model by capturing the impact of contention for message processing resources.
Cameron et. al [27] extend the LogP model by modeling the point-to-point memory latencies
of inter-node communication in a shared memory cluster.
Traditional parallel programming models, such as BSP [95], LogP [35], PRAM [51] and de-
rived models [8,27,59,70] developed to respond to changes in the relative impact of architectural
components on the performance of parallel systems, are based on a minimal set of parameters to
capture the impact of communication overhead on computation running across a homogeneous
collection of interconnected processors. MMGP borrows elements from LogP and its deriva-
tives, to estimate performance of parallel computations on heterogeneous parallel systems with
143
multiple dimensions of parallelism implemented in hardware. A variation of LogP, HLogP [24],
considers heterogeneous clusters with variability in the computational power and interconnec-
tion network latencies and bandwidths between the nodes. Although HLogP is applicable to
heterogeneous multi-core architectures, it does not consider nested parallelism. It should be
noted that although MMGP has been evaluated on architectures with heterogeneous processors,
it can readily support architectures with heterogeneity in their communication substrates as well
(e.g. architectures providing both shared-memory and message-passing communication).
10.3.4 Models Describing Nested Parallelism
Several parallel programming models have been developed to support nested parallelism, in-
cluding nested parallel languages such as NESL [21], task-level parallelism extensions to data-
parallel languages such as HPF [89], extensions of common parallel programming libraries such
as MPI and OpenMP to support nested parallel constructs [29, 64], and techniques for combin-
ing constructs from parallel programming libraries, typically MPI and OpenMP, to better ex-
ploit nested parallelism [11,50,77]. Prior work on languages and libraries for nested parallelism
based on MPI and OpenMP is largely based on empirical observations on the relative speed of
data communication via cache-coherent shared memory, versus communication with message
passing through switching networks. Our work attempts to formalize these observations into
a model which seeks optimal work allocation between layers of parallelism in the application
and optimal mapping of these layers to heterogeneous parallel execution hardware. NESL [21]
and Cilk [22] are languages based on formal algorithmic models of performance that guaran-
tee tight bounds on estimating performance of multithreaded computations and enable nested
parallelization. Both NESL and Cilk assume homogeneous machines.
Subhlok and Vondran [88] present a model for estimating the optimal number of homoge-
neous processors to assign to each parallel task in a chain of tasks that form a pipeline. MMGP
has a similar goal of assigning co-processors to simultaneously active tasks originating from
144
the host processors, however it also searches for the optimal number of tasks to activate in host
processors, in order to achieve a balance between supply from host processors and demand
from co-processors. Sharapov et. al [81] use a combination of queuing theory and cycle-
accurate simulation of processors and interconnection networks, to predict the performance of
hybrid parallel codes written in MPI/OpenMP on ccNUMA architectures. MMGP uses a sim-
pler model, designed to estimate scalability along more than one dimensions of parallelism on
heterogeneous parallel architectures.
Research on optimizing compilers for novel microprocessors, such as tiled and streaming
processors, has contributed methods for multi-grain parallelization of scientific and media com-
putations. Gordon et. al [53] present a compilation framework for exploiting three layers of
parallelism (data, task and pipelined) on streaming microprocessors running DSP applications.
The framework uses a combination of fusion and fission transformations on data-parallel com-
putations, to ”right-size” the degree of task and data parallelism in a program running on a
homogeneous multi-core microprocessor. MMGP is a complementary tool which can assist
both compile-time and runtime optimization on heterogeneous multi-core platforms. The de-
velopment of MMGP coincides with several related efforts on measuring, modeling and opti-
mizing performance on the Cell Broadband Engine [32, 75]. An analytical model of the Cell
presented by Williams et. al [97], considers execution of floating point code and DMA accesses
on the Cell SPE for scientific kernels parallelized at one level across SPEs and vectorized fur-
ther within SPEs. MMGP models the use of both the PPE and SPEs and has been demonstrated
to work effectively with complete application codes. In particular, MMGP factors the effects of
PPE thread scheduling, PPE-SPE communication and SPE-SPE communication into the Cell
performance model.
145
This page intentionally left blank.
146
Bibliography
[1] http://www-304.ibm.com/jct09002c/university/students/contests/cell/index.html.
[2] http://www.rapportincorporated.com.
[3] The Cell project at IBM Research; http://www.research. ibm.com/cell .
[4] www.gpgpu.org.
[5] A. Aggarwal, A. K. Chandra, and M. Snir. On communication latency in pram com-putations. In SPAA ’89: Proceedings of the first annual ACM symposium on Parallel
algorithms and architectures, pages 11–21, New York, NY, USA, 1989. ACM.
[6] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. Communication complexity ofprams. Theor. Comput. Sci., 71(1):3–28, 1990.
[7] S. Alam, R. Barrett, J. Kuehn, P. Roth, and J. Vetter. Characterization of scientific work-loads on systems with multi-core processors. In Proc. of IEEE International Symposium
on Workload Characterization (IISWC), 2006.
[8] A. Alexandrov, M. Ionescu, C. Schauser, and C. Scheiman. LogGP: Incorporating LongMessages into the LogP Model: One Step Closer towards a Realistic Model for ParallelComputation. In Proc. of the 7th Annual ACM Symposium on Parallel Algorithms and
Architectures, pages 95–105, Santa Barbara, CA, June 1995.
[9] Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and Henry M. Levy.Scheduler activations: effective kernel support for the user-level management of paral-lelism. ACM Trans. Comput. Syst., 10(1):53–79, 1992.
[10] K. Asanovic, R. Bodik, C. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson,W. Plishker, J. Shalf, S. Williams, and K. Yelick. The Landscape of Parallel Comput-ing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECSDepartment, University of California–Berkeley, December 2006.
147
[11] E. Ayguade, X. Martorell, J. Labarta, M. Gonzalez, and N. Navarro. Exploiting MultipleLevels of Parallelism in OpenMP: A Case Study. In Proc. of the 1999 International Con-
ference on Parallel Processing (ICPP’99), pages 172–180, Aizu, Japan, August 1999.
[12] A. Azevedo, C.H. Meenderinck, B.H.H. Juurlink, M. Alvarez, and A. Ramirez. Analysisof video filtering on the cell processor. In Proceeding in Prorisc Conference, pages 116–121, November 2007.
[13] D. Bader, V. Agarwal, and K. Madduri. On the Design and Analysis of Irregular Al-gorithms on the Cell Processor: A Case Study on List Ranking. In Proc. of the 21st
International Parallel and Distributed Processing Symposium, Long Beach, CA, March2007.
[14] D.A. Bader, B.M.E. Moret, and L. Vawter. Industrial applications of high-performancecomputing for phylogeny reconstruction. In Proc. of SPIE ITCom, volume 4528, pages159–168, 2001.
[15] David A. Bader, Virat Agarwal, Kamesh Madduri, and Seunghwa Kang. High perfor-mance combinatorial algorithm design on the cell broadband engine processor. Parallel
Comput., 33(10-11):720–740, 2007.
[16] Jairo Balart, Marc Gonzalez, Xavier Martorell, Eduard Ayguade, Zehra Sura, Tong Chen,Tao Zhang, Kevin O’brien, and Kathryn O’Brien. A novel asynchronous software cacheimplementation for the cell/be processor. In The 20th International Workshop on Lan-
guages and Compilers for Parallel Computing, 2007.
[17] P. Bellens, J. Perez, R. Badia, and J. Labarta. CellSs: A Programming Model for the CellBE Architecture. In Proc. of Supercomputing’2006, Tampa, FL, November 2006.
[18] Carsten Benthin, Ingo Wald, Michael Scherbaum, and Heiko Friedrich. Ray Tracingon the CELL Processor. Technical Report, inTrace Realtime Ray Tracing GmbH, No
inTrace-2006-001 (submitted for publication), 2006.
[19] F. Blagojevic, D. Nikolopoulos, A. Stamatakis, and C. Antonopoulos. Dynamic Multi-grain Parallelization on the Cell Broadband Engine. In Proc. of the 12th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, pages 90–100, March2007.
148
[20] F. Blagojevic, A. Stamatakis, C. Antonopoulos, and D. Nikolopoulos. RAxML-CELL:Parallel Phylogenetic Tree Construction on the Cell Broadband Engine. In Proc. of the
21st International Parallel and Distributed Processing Symposium, March 2007.
[21] G. Blelloch, S. Chatterjee, J. Harwick, J. Sipelstein, and M. Zagha. Implementation of aPortable Nested Data Parallel Language. In Proc. of the 4th ACM SIGPLAN Symposium
on Principles and Practice of Parallel Programming (PPoPP’93), pages 102–112, SanDiego, CA, June 1993.
[22] R. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson, K. Randall, and Y. Zhou. Cilk: anEfficient Multithreaded Runtime System. In Proc. of the 5th ACM Symposium on Princi-
ples and Practices of Parallel Programming (PPoPP’95), pages 207–216, Santa Barbara,California, August 1995.
[23] Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schrooder. Sparse matrix solvers on thegpu: conjugate gradients and multigrid. ACM Trans. Graph., 22(3):917–924, 2003.
[24] J. Bosque and L. Pastor. A Parallel Computational Model for Heterogeneous Clusters.IEEE Transactions on Parallel and Distributed Systems, 17(12):1390–1400, December2006.
[25] Armin BŁumker and Wolfgang Dittrich. Fully dynamic search trees for an extension ofthe bsp model.
[26] John M. Calandrino, Dan Baumberger, Tong Li, Scott Hahn, and James H. Anderson.Soft real-time scheduling on performance asymmetric multicore platforms. In RTAS ’07:
Proceedings of the 13th IEEE Real Time and Embedded Technology and Applications
Symposium, pages 101–112, Washington, DC, USA, 2007. IEEE Computer Society.
[27] K. Cameron and X. Sun. Quantifying Locality Effect in Data Access Delay: MemoryLogP. In Proc. of the 17th International Parallel and Distributed Processing Symposium,Nice, France, April 2003.
[28] Kirk W. Cameron and Rong Ge. Predicting and evaluating distributed communicationperformance. In SC ’04: Proceedings of the 2004 ACM/IEEE conference on Supercom-
puting, page 43, Washington, DC, USA, 2004. IEEE Computer Society.
[29] F. Cappello and D. Etiemble. MPI vs. MPI+OpenMP on the IBM SP for the NAS Bench-marks. In Proc. of the IEEE/ACM Supercomputing’2000: High Performance Networking
and Computing Conference (SC’2000), Dallas, Texas, November 2000.
149
[30] L. Chai, Q. Gao, and D. K. Panda. Understanding the Impact of Multi-Core Architec-ture in Cluster Computing: A Case Study with Intel Dual-Core System. In Proc. of
CCGrid2007, May 2007.
[31] Maria Charalambous, Pedro Trancoso, and Alexandros Stamatakis. Initial experiencesporting a bioinformatics application to a graphics processor. In Panhellenic Conference
on Informatics, pages 415–425, 2005.
[32] T. Chen, Z. Sura, K. O’Brien, and K. O’Brien. Optimizing the Use of Static Buffers forDMA on a Cell Chip. In Proc. of the 19th International Workshop on Languages and
Compilers for Parallel Computing, New Orleans, LA, November 2006.
[33] Thomas Chen, Ram Raghavan, Jason Dale, and Eiji Iwata. Cell broadband engine archi-tecture and its first implementation. IBM developerWorks, Nov 2005.
[34] Benny Chor and Tamir Tuller. Maximum likelihood of evolutionary trees: hardness andapproximation. Bioinformatics, 21(1):97–106, 2005.
[35] D. Culler, R. Karp, D. Patterson, A. Sahay, K. Scauser, E. Santos, R. Subramonian, andT. Von Eicken. LogP: Towards a Realistic Model of Parallel Computation. In Proc. of
the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP’93), pages 1–12, San Diego, California, May 1993.
[36] Matthew Curtis-Maury, Filip Blagojevic, Christos D. Antonopoulos, and Dimitrios S.Nikolopoulos. Prediction-based power-performance adaptation of multithreaded scien-tific codes. IEEE Transaction on Parallel and Distributed Systems.
[37] William J. Dally, Francois Labonte, Abhishek Das, Patrick Hanrahan, Jung-Ho Ahn,Jayanth Gummaraju, Mattan Erez, Nuwan Jayasena, Ian Buck, Timothy J. Knight, andUjval J. Kapasi. Merrimac: Supercomputing with streams. In SC ’03: Proceedings of the
2003 ACM/IEEE conference on Supercomputing, page 35, Washington, DC, USA, 2003.IEEE Computer Society.
[38] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on largeclusters. pages 137–150.
[39] A. Eichenberger, Z. Sura, A. Wang, T. Zhang, P. Zhao, M. Gschwind, K. O’Brien,K. O’Brien, P. Wu, T. Chen, P. Oden, D. Prener, J. Shepherd, and B. So. OptimizingCompiler for the CELL Processor. In Proc. of the 14th International Conference on
150
Parallel Architectures and Compilation Techniques, pages 161–172, Saint Louis, MO,September 2005.
[40] B. Flachs et al. The Microarchitecture of the Streaming Processor for a CELL Processor.Proceedings of the IEEE International Solid-State Circuits Symposium, pages 184–185,February 2005.
[41] Carlo Fantozzi, Andrea Pietracaprina, and Geppino Pucci. Translating submachine lo-cality into locality of reference. J. Parallel Distrib. Comput., 66(5):633–646, 2006.
[42] Alexandra Fedorova, Margo Seltzer, and Michael D. Smith. Improving performanceisolation on chip multiprocessors via an operating system scheduler. In PACT ’07: Pro-
ceedings of the 16th International Conference on Parallel Architecture and Compilation
Techniques, pages 25–38, Washington, DC, USA, 2007. IEEE Computer Society.
[43] J. Felsenstein. Evolutionary trees from DNA sequences: A maximum likelihood ap-proach. Journal of Molecular Evolution, 17:368–376, 1981.
[44] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood ap-proach. J. Mol. Evol., 17:368–376, 1981.
[45] X. Feng, K. Cameron, B. Smith, and C. Sosa. Building the Tree of Life on Terascale Sys-tems. In Proc. of the 21st International Parallel and Distributed Processing Symposium,Long Beach, CA, March 2007.
[46] Steven Fortune and James Wyllie. Parallelism in random access machines. In STOC
’78: Proceedings of the tenth annual ACM symposium on Theory of computing, pages114–118, New York, NY, USA, 1978. ACM.
[47] Matthew Frank, Anant Agarwal, and Mary K. Vernon. Lopc: Modeling contention inparallel algorithms. In Principles Practice of Parallel Programming, pages 276–287,1997.
[48] Bugra Gedik, Rajesh Bordawekar, and Philip S. Yu. Cellsort: High performance sortingon the cell processor. In Proc. of the 33rd Very Large Databases Conference, pages1286–1207, 2007.
[49] Bugra Gedik, Philip S. Yu, and Rajesh R. Bordawekar. Executing stream joins on thecell processor. In VLDB ’07: Proceedings of the 33rd international conference on Very
large data bases, pages 363–374. VLDB Endowment, 2007.
151
[50] A. Gerndt, S. Sarholz, M. Wolter, D. An Mey, C. Bischof, and T. Kuhlen. Particlesand Continuum – Nested OpenMP for Efficient Computation of 3D Critical Points inMultiblock Data Sets. In Proc. of Supercomputing’2006, Tampa, FL, November 2006.
[51] P. Gibbons. A More Practical PRAM Model. In Proc. of the First Annual ACM Sym-
posium on Parallel Algorithms and Architectures, pages 158–168, Santa Fe, NM, June1989.
[52] M. Girkar and C. Polychronopoulos. The Hierarchical Task Graph as a Universal Inter-mediate Representation. International Journal of Parallel Programming, 22(5):519–551,October 1994.
[53] M. Gordon, W. Thies, and S. Amarasinghe. Exploiting Coarse-Grained Task, Data andPipelined Parallelism in Stream Programs. In Proc. of the 12th International Conference
on Architectural Support for Programming Languages and Operating Systems, pages151–162, San Jose, CA, October 2006.
[54] Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin, and Dinesh Manocha. Fastcomputation of database operations using graphics processors. In SIGMOD ’04: Pro-
ceedings of the 2004 ACM SIGMOD international conference on Management of data,pages 215–226, New York, NY, USA, 2004. ACM.
[55] W. Gropp and E. Lusk. Reproducible Measurements of MPI Performance Characteristics.In Proc. of the 6th European PVM/MPI User’s Group Meeting, pages 11–18, Barcelona,Spain, September 1999.
[56] W. Gropp and E. Lusk. Reproducible Measurements of MPI Performance Characteristics.In Proc. of the 6th European PVM/MPI Users Group Meeting, pages 11–18, September1999.
[57] M. Hill and M. Marty. Amdahls Law in the Multi-core Era. Technical Report 1593,Department of Computer Sciences, University of Wisconsin-Madison, March 2007.
[58] Nils Hjelte. Smoothed particle hydrodynamics on the cell broadband engine. Master’sthesis, Umea University, Department of Computer Science, Jun 2006.
[59] F. Ino, N. Fujimoto, and K. Hagihara. LogGPS: A Parallel Computational Model forSynchronization Analysis. In Proc. of the 8th ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming, pages 133–142, Snowbird, UT, June 2001.
152
[60] Ben H. H. Juurlink and Harry A. G. Wijshoff. The e-BSP model: Incorporating generallocality and unbalanced communication into the BSP model. In Euro-Par, Vol. II, pages339–347, 1996.
[61] W. Kahan. Lecture notes on the status of ieee standard 754 for binary floating-pointarithmetic. 1997.
[62] Richard M. Karp, Michael Luby, and Friedhelm Meyer auf der Heide. Efficient pramsimulation on a distributed memory machine. In STOC ’92: Proceedings of the twenty-
fourth annual ACM symposium on Theory of computing, pages 318–326, New York, NY,USA, 1992. ACM.
[63] Mike Kistler, Michael Perrone, and Fabrizio Petrini. Cell Multiprocessor InterconnectionNetwork: Built for Speed. IEEE Micro, 26(3), May-June 2006. Available from http://hpc.pnl.gov / people / fabrizio / papers / ieeemicro-cell.pdf.
[64] G. Krawezik. Performance Comparison of MPI and three OpenMP Programming Styleson Shared Memory Multiprocessors. In Proc. of the 15th Annual ACM Symposium on
Parallel Algorithms and Architectures, pages 118–127, San Diego, CA, June 2003.
[65] E. Scott Larsen and David McAllister. Fast matrix multiplies using graphics hardware.In Supercomputing ’01: Proceedings of the 2001 ACM/IEEE conference on Supercom-
puting (CDROM), pages 55–55, New York, NY, USA, 2001. ACM.
[66] L-K. Liu, Q. Li, A. Natsev, K.A. Ross, J.R. Smith, and A.L. Varbanescu. Digital me-dia indexing on the cell processor. In ICME 2007, pages 1866 – 1869. IEEE SignalProcessing Society, July 2007.
[67] Cathy McCann, Raj Vaswani, and John Zahorjan. A dynamic processor allocation pol-icy for multiprogrammed shared-memory multiprocessors. ACM Trans. Comput. Syst.,11(2):146–178, 1993.
[68] Kurt Mehlhorn and Uzi Vishkin. Randomized and deterministic simulations of prams byparallel machines with restricted granularity of parallel memories. Acta Inf., 21(4):339–374, 1984.
[69] Barry Minor, Gordon Fossum, and Van To. Terrain rendering engine (tre), http://www.research.ibm.com / cell / whitepapers / tre.pdf. May 2005.
153
[70] C. Moritz and M. Frank. LoGPC: Modeling Network Contention in Message PassingPrograms. In Proc. of the 1998 ACM SIGMETRICS Conference on Measurement and
Modeling of Computer Systems, pages 254–263, Madison, WI, June 1998.
[71] PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Pro-gramming Environments Manual. http:// www-306.ibm.com / chips / techlib.
[72] Christos Papadimitriou and Mihalis Yannakakis. Towards an architecture-independentanalysis of parallel algorithms. In STOC ’88: Proceedings of the twentieth annual ACM
symposium on Theory of computing, pages 510–513, New York, NY, USA, 1988. ACM.
[73] F. Petrini, G. Fossum, A. Varbanescu, M. Perrone, M. Kistler, and J. Fernandez Periador.Multi-core Surprises: Lessons Learned from Optimized Sweep3D on the Cell BroadbandEngine. In Proc. of the 21st International Parallel and Distributed Processing Sympo-
sium, Long Beach, CA, March 2007.
[74] Fabrizio Petrini, Gordon Fossum, Mike Kistler, and Michael Perrone. MulticoreSuprises: Lesson Learned from Optimizing Sweep3D on the Cell Broadbend Engine.
[75] Fabrizio Petrini, Daniel Scarpazza, Oreste Villa, and Juan Fernandez. Challenges inMapping Graph Exploration Algorithms on Advanced Multi-core Processors. In Proc.
of the 21st International Parallel and Distributed Processing Symposium, Long Beach,CA, March 2007.
[76] Mohan Rajagopalan, Brian T. Lewis, and Todd A. Anderson. Thread scheduling formulti-core platforms. In HotOS 2007: Proceedings of the Eleventh Workshop on Hot
Topics in Operating Systems, 2007.
[77] T. Rauber and G. Ruenger. Library Support for Hierarchical Multiprocessor Tasks. InProc. of Supercomputing’2002, Baltimore, MD, November 2002.
[78] Daniele Paolo Scarpazza, Oreste Villa, and Fabrizio Petrini. Peak-performance dfa-basedstring matching on the cell processor. In IPDPS, pages 1–8. IEEE, 2007.
[79] Harald Servat, Cecilia Gonzalez, Xavier Aguilar, Daniel Cabrera, and Daniel Jimenez.Drug design on the cell broadband engine. In PACT ’07: Proceedings of the 16th In-
ternational Conference on Parallel Architecture and Compilation Techniques, page 425,Washington, DC, USA, 2007. IEEE Computer Society.
154
[80] Alex Settle, Joshua Kihm, Andrew Janiszewski, and Dan Connors. Architectural supportfor enhanced smt job scheduling. In PACT ’04: Proceedings of the 13th International
Conference on Parallel Architectures and Compilation Techniques, pages 63–73, Wash-ington, DC, USA, 2004. IEEE Computer Society.
[81] I. Sharapov, R. Kroeger, G. Delamater, R. Cheveresan, and M. Ramsay. A Case Study inTop-Down Performance Estimation for a Large-Scale Parallel Application. In Proc. of the
11th ACM SIGPLAN Symposium on Pronciples and Practice of Parallel Programming,pages 81–89, New York, NY, March 2006.
[82] Suresh Siddha, Venkatesh Pallipadi, and Asit Mallick. Process Scheduling Challenges inthe Era of Multi-core Processors. Intel Technology Journal, 11:btl446, 2007.
[83] Allan Snavely and Dean M. Tullsen. Symbiotic jobscheduling for a simultaneous mul-tithreaded processor. In ASPLOS-IX: Proceedings of the ninth international conference
on Architectural support for programming languages and operating systems, pages 234–244, New York, NY, USA, 2000. ACM.
[84] Allan Snavely, Dean M. Tullsen, and Geoff Voelker. Symbiotic jobscheduling with pri-orities for a simultaneous multithreading processor. In SIGMETRICS ’02: Proceedings
of the 2002 ACM SIGMETRICS international conference on Measurement and modeling
of computer systems, pages 66–76, New York, NY, USA, 2002. ACM.
[85] Robert Springer, David K. Lowenthal, Barry Rountree, and Vincent W. Freeh. Minimiz-ing execution time in mpi programs on an energy-constrained, power-scalable cluster. InPPoPP ’06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and
practice of parallel programming, pages 230–238, New York, NY, USA, 2006. ACM.
[86] A. Stamatakis. Phylogenetic models of rate heterogeneity: A high performance com-puting perspective. In Proceedings of 20th IEEE/ACM International Parallel and Dis-
tributed Processing Symposium (IPDPS2006), High Performance Computational Biol-ogy Workshop, Proceedings on CD, Rhodos, Greece, April 2006.
[87] Alexandros Stamatakis. RAxML-VI-HPC: maximum likelihood-based phylogeneticanalyses with thousands of taxa and mixed models. Bioinformatics, page btl446, 2006.
[88] J. Subhlok and G. Vondran. Optimal Use of Mixed Task and Data Parallelism forPipelined Computations. Journal of Parallel and Distributed Computing, 60(3):297–319, March 2000.
155
[89] J. Subhlok and B. Yang. A New Model for Integrated Nested Task and Data Parallelism.In Proc. of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, pages 1–12, Las Vegas, NV, June 1997.
[90] Rajesh Sudarsan and Calvin J. Ribbens. Reshape: A framework for dynamic resizingand scheduling of homogeneous applications in a parallel environment, 2007.
[91] Alias Systems. Alias cloth technology demonstration for the cell processor, http://www.research.ibm.com / cell / whitepapers / alias cloth.pdf. 2005.
[92] Cell broadband engine programming tutorial version 1.0; http:// www-106. ibm.com /developerworks / eserver / library / es-archguide-v2.html.
[93] R. Thekkath and S. J. Eggers. Impact of sharing-based thread placement on multithreadedarchitectures. SIGARCH Comput. Archit. News, 22(2):176–186, 1994.
[94] John A. Turner. Roadrunner: Heterogeneous Petascale Com-puting for Predictive Simu-lation. Technical Report LANL-UR-07-1037, Los Alamos National Lab, Las Vegas, NV,February 2007. ASC Principal Investigator Meeting.
[95] L. Valiant. A bridging model for parallel computation. Communications of the ACM,22(8):103–111, August 1990.
[96] Perry H. Wang, Jamison D. Collins, Gautham N. Chinya, Hong Jiang, Xinmin Tian,Milind Girkar, Nick Y. Yang, Guei-Yuan Lueh, and Hong Wang. Exochi: architectureand programming environment for a heterogeneous multi-core multithreaded system. InPLDI ’07: Proceedings of the 2007 ACM SIGPLAN conference on Programming lan-
guage design and implementation, pages 156–166, New York, NY, USA, 2007. ACM.
[97] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. The Potential of theCell Processor for Scientific Computing. In Proc. of the 3rd Conference on Computing
Frontiers, pages 9–20, Ischia, Italy, June 2006.
[98] Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, and Kather-ine Yelick. The Potentinal of the Cell Processor for Scientific Computing. ACM Interna-
tional Conference on Computing Frontiers, May 3-6 2006.
[99] Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, and Kather-ine Yelick. Scientific computing kernels on the cell processor. Int. J. Parallel Program.,35(3):263–298, 2007.
156
[100] Yun Zhang, Mihai Burcea, Victor Cheng, Ron Ho, and Michael Voss. An adaptiveopenmp loop scheduler for hyperthreaded smps. In David A. Bader and Ashfaq A.Khokhar, editors, ISCA PDCS, pages 256–263. ISCA, 2004.
[101] Yun Zhang and Michael Voss. Runtime empirical selection of loop schedulers on hyper-threaded smps. In IPDPS ’05: Proceedings of the 19th IEEE International Parallel and
Distributed Processing Symposium (IPDPS’05) - Papers, page 44.2, Washington, DC,USA, 2005. IEEE Computer Society.
[102] Y. Zhao and K. Kennedy. Dependence-based Code Generation for a Cell Processor.In Proc. of the 19th International Workshop on Languages and Compilers for Parallel
Computing, New Orleans, LA, November 2006.
157