Building More Powerful Less Expensive Supercomputers …Final Report Richard C. Murphy Prepared by Sandia National Laboratories Albuquerque, New Mexico 87185 and Livermore, California

SANDIA REPORTSAND2009-6229XXXXUnlimited ReleasePrinted September 2009

Building More Powerful LessExpensive Supercomputers UsingProcessing-In-Memory (PIM) LDRDFinal Report

Richard C. Murphy

Prepared bySandia National LaboratoriesAlbuquerque, New Mexico 87185 and Livermore, California 94550

Sandia is a multiprogram laboratory operated by Sandia Corporation,a Lockheed Martin Company, for the United States Department of Energy’sNational Nuclear Security Administration under Contract DE-AC04-94-AL85000.

Approved for public release; further dissemination unlimited.

Issued by Sandia National Laboratories, operated for the United States Department of Energy by SandiaCorporation.

NOTICE: This report was prepared as an account of work sponsored by an agency of the United StatesGovernment. Neither the United States Government, nor any agency thereof, nor any of their employees,nor any of their contractors, subcontractors, or their employees, make any warranty, express or implied,or assume any legal liability or responsibility for the accuracy, completeness, or usefulness of any infor-mation, apparatus, product, or process disclosed, or represent that its use would not infringe privatelyowned rights. Reference herein to any specific commercial product, process, or service by trade name,trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recom-mendation, or favoring by the United States Government, any agency thereof, or any of their contractorsor subcontractors. The views and opinions expressed herein do not necessarily state or reflect those ofthe United States Government, any agency thereof, or any of their contractors.

Printed in the United States of America. This report has been reproduced directly from the best availablecopy.

Available to DOE and DOE contractors fromU.S. Department of EnergyOffice of Scientific and Technical InformationP.O. Box 62Oak Ridge, TN 37831

Telephone: (865) 576-8401Facsimile: (865) 576-5728E-Mail: [email protected] ordering: http://www.osti.gov/bridge

Available to the public fromU.S. Department of CommerceNational Technical Information Service5285 Port Royal RdSpringfield, VA 22161

Telephone: (800) 553-6847Facsimile: (703) 605-6900E-Mail: [email protected] ordering: http://www.ntis.gov/help/ordermethods.asp?loc=7-4-0#online

DE

PA

RT

MENT OF EN

ER

GY

• • UN

IT

ED

STATES OFA

M

ER

IC

A

2

SAND2009-6229XXXXUnlimited Release

Printed September 2009

Building More Powerful Less ExpensiveSupercomputers Using Processing-In-Memory

(PIM) LDRDFinal Report

Richard C. MurphyScalable Computer Architectures Department, 1422

Sandia National LaboratoriesP.O. Box 5800, MS-1319

Albuquerque, NM 87185-1319

Abstract

This report summarizes the activities of the Processing-In-Memory (PIM) LDRDproject from FY07-FY09.

3

4

Contents

1 Introduction 9

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Publications, Presentations, and Press . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Refereed Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Invited Talks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Coverage in the Press . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Appendix

A Reproduction of Referred Papers 13

5

List of Figures

6

List of Tables

7

8

Chapter 1

Introduction

Introduction

This report details the accomplishments of the “Building More Powerful Less Expen-sive Supercomputers Using Processing-In-Memory (PIM)” LDRD (“PIM LDRD”,number 105809) for FY07-FY09.

Latency dominates all levels of supercomputer design. Within a node, increasingmemory latency, relative to processor cycle time, limits CPU performance. Be-tween nodes, the same increase in relative latency impacts scalability. Processing-In-Memory (PIM) is an architecture that directly addresses this problem using enhancedchip fabrication technology and machine organization. PIMs combine high-speed logicand dense, low-latency, high-bandwidth DRAM, and lightweight threads that toleratelatency by performing useful work during memory transactions. This work examinesthe potential of PIM-based architectures to support mission critical Sandia applica-tions and an emerging class of more data intensive informatics applications.

This work has resulted in a stronger architecture/implementation collaboration be-tween 1400 and 1700. Additionally, key technology components have impacted vendorroadmaps, and we are in the process of pursuing these new collaborations. This workhas the potential to impact future supercomputer design and construction, reducingpower and increasing performance.

This final report is organized as follow: this summary chapter discusses the impactof the project (Section 1), provides an enumeration of publications and other publicdiscussion of the work (Section 1), and concludes with a discussion of future work andimpact from the project (Section 1). The appendix contains reprints of the refereedpublications resulting from this work.

Impact

This LDRD has helped to establish a processor and memory architecture researchfocus within the lab, facilitated collaboration between 1400 and 1700, and served as

9

an intellectual basis for a large-scale proposal effort to DARPA targeted at the endof the Fall of 2009. A summary of impact is:

• A collaboration with Micron was started to examine alternatives for imple-menting PIM-based computers, with the focus now being on 3D integration ofDRAM and logic. Sandia’s Microelectronics center has deep experience in 3DI,and it has allowed us to begin post-LDRD collaborating with industry on bothnear- and long-term supercomputers.

• We have simulated a set of real application and application kernels, particularlythose from the Mantevo LDRD project and demonstrated that those applica-tions are memory bound in performance.

• We have identified processor inefficiencies related to the large data sets in HPCapplications and the fact that commodity processors are not designed to addressthem.

• We have quantitatively demonstrated that memory latency, not memory band-width limits performance.

• We have developed a multithreaded programming model, qthreads, that in-creased on-node application concurrency. Given that increased concurrencylowers effective latency, this is a key method for creating future memory la-tency tolerant systems.

• We have enhanced the SST simulator to include improved memory models andprovide better PIM processor simulations. This can be leveraged by academicand industry collaborators in the future.

• We have published six refereed publications and given three invited talks relatedto the LDRD, and expect future results to provide impact to two additionalpapers in the Fall.

Publications, Presentations, and Press

This section enumerates public presentation of our work.

Refereed Publications

• Wheeler, Kyle, Douglas Thain, and Richard Murphy, Portable Performancefrom Workstation to Supercomputer: Distributing Data Structures with Qthreads,Proceedings of the First Workshop on Programming Models for Emerging Ar-chitectures, 2009.

10

• Barrett, Brian W., and Jonathan W. Berry, Richard C. Murphy, and Kyle B.Wheeler, Implementing a Portable Multi-threaded Graph Library: the MTGLon Qthreads, International Parallel and Distributed Processing Symposium 2009(IPDPS09), Rome, Italy, Pages 1–8, May 2009.

• Murphy, Richard C., DOE’s Institute for Advanced Architecture and Algo-rithms: an Application-Driven Approach, Journal of Physics Conference Series180(2009), 012044.

• Wheeler, Kyle, Richard C. Murphy, and Douglas Thain, Qthreads: An API forProgramming with Millions of Lightweight Threads in the Proceedings of theWorkshop on Multithreaded Architectures and Applications, Miami, FL, 2008.

• Murphy, Richard C. On the Effects of Memory Latency and Bandwidth onSupercomputer Application Performance in the IEEE International Symposiumon Workload Characterization 2007 (IISWC07), Boston, MA, September 27-29,2007.

• Murphy, Richard C. and Peter M. Kogge, On the Memory Access Patterns ofSupercomputer Applications: Benchmark Selection and Its Implications, IEEETransactions on Computers 56(7): 937-945, July 2007.

Invited Talks

• DOE’s Institute for Advanced Architecture and Algorithms: an Application-Driven Approach, SciDAC 2009, June 14-18, San Diego, CA.

• Data Movement Dominates: An Application-Centric View of High PerformanceInterconnects, HSD 2009 20th Annual Workshop on Interconnects within HighSpeed Digital Systems, May 3-6, 2009, Santa Fe, NM.

• Can We Continue to Build Supercomputers Out of Processors Optimized forLaptops?, 13th Workshop on Distributed Supercomputing (SOS 13), March9-12, 2009, Hilton Head, SC.

Coverage in the Press

• Moore, Samuel K., “Multicore Is Bad News for Supercomputers”, IEEE Spec-trum, November 2008.

Workshops

This LDRD also contributed to the ideas behind the CSRI Workshop Memory Op-portunities for High Performance Computing (MOHPC), January 9-10, 2008.

11

Future Work

We plan to continue to pursue three key points of impact from this work:

1. Work with Micron on advanced memory technologies for near-term (2014) andExascale (2017) supercomputers will continue, and expand to include bothDRAM and nonvolatile memory. This work will continue through the DOE In-stitute for Advanced Architecture and Algorithms (IAA), and the Sandia/LosAlamos Alliance for Computing at Extreme Scale (ACES). We believe impactcan be shown in our Trinity supercomputer in the 2014 time-frame. This workhas also generated interest from Other Government Agency (OGA) activities.

2. The qthreads programming model will be expanded and serve as the basis foran experimental on-node programming model for future machines. We ex-pect qthreads will have to be expanded to include additional communicationand synchronization primitives, particularly transactional memory, but thatreal applications could be targeted to a qthreads environment for evaluationpurposes in FY10 and FY11.

3. The X-caliber strawman architecture, generated from this LDRD, will serve asthe basis for a Sandia-led proposal to the DARPA UHPC program. We expecta BAA to be issued in the Fall of 2009, and have formed a multi-institutionalteam including industry and academia.

12

Appendix A

Reproduction of Referred Papers

The following pages reproduce the refereed publications enumerated in Section 1.

13

1

On The Memory Access Patterns of SupercomputerApplications: Benchmark Selection and Its

ImplicationsRichard C. Murphy†, Member, IEEE, Peter M. Kogge, Fellow, IEEE

Abstract— This paper compares the SPEC Integer and FloatingPoint suites to a set of real-world applications for high perfor-mance computing at Sandia National Laboratories. These appli-cations focus on the high-end scientific and engineering domains,however the techniques presented in this paper are applicable toany application domain. The applications are compared in termsof three memory properties: first, temporal locality (or reuse overtime); second, spatial locality (or the use of data “near” datathat has already been accessed); and third, data intensiveness(or the number of unique bytes the application accesses). Theresults show that real world applications exhibit significantlyless spatial locality, often exhibit less temporal locality, and havemuch larger data sets than the SPEC benchmark suite. Theyfurther quantitatively demonstrates the memory properties ofreal supercomputing applications.

Index Terms— B.8.2 Performance Analysis and Design Aids,C.1.0 General, C.4.c Measurement techniques, C.4.g Measure-ment, evaluation, modeling, simulation of multiple-processorsystems

I. INTRODUCTION

The selection of benchmarks relevant to the supercomputingcommunity is challenging at best. In particular, there is adiscrepancy between the workloads that are most extensivelystudied by the computer architecture community, and the codesrelevant to high performance computing. This paper exam-ines these differences quantitatively in terms of the memorycharacteristics of a set of a real applications from the high-end science and engineering domain as they compare to theSPEC CPU2000 benchmark suite, and more general HighPerformance Computing (HPC) benchmarks. The purpose ofthis paper is two-fold: first to demonstrate what generalmemory characteristics the computer architecture communityshould look for when identifying benchmarks relevant to HPC(and how they differ from SPEC); and second, to quantitativelyexplain application’s memory characteristics to the HPC com-munity, which often relies on intuition when discussing mem-ory locality. Finally, although most studies discuss temporaland spatial locality when referring to memory performance,this work introduces a new measure data intensiveness thatserves as the biggest differentiator in application propertiesbetween real applications and benchmarks. These techniquescan be applied to any architecture-independent comparisonof any benchmark or application suite, and this study could

†Richard Murphy is at Sandia National Laboratories. Sandia is a mul-tiprogram laboratory operated by Sandia Corporation, a Lockheed MartinCompany, for the United States Department of Energy’s National NuclearSecurity Administration under contract DE-AC04-94AL85000.

be repeated for other markets of interest. The three keycharacteristics of the application are:

1) Temporal Locality: the reuse over time of a data itemfrom memory;

2) Spatial Locality: the use of data items in memory nearother items that have already been used; and,

3) Data Intensiveness: the amount of unique data the ap-plication accesses.

Quantifying each of these three measurements is extremelydifficult (see Section IV and Section II). Beyond quantitativelydefining the measurements, there are two fundamental prob-lems: first, choosing the applications to measure; and second,performing the measurement in an architecture-independentfashion that allows general conclusions to be drawn about theapplication rather than specific observations of the applica-tions performance on one particular architecture. This paperaddresses the former problem by using a suite of real codesthat consume significant compute time at Sandia NationalLaboratories; and it addresses the latter by defining the metricsto be orthogonal to each other, and measuring them in anarchitecture independent fashion. Consequently, these results(and the techniques used to generate them) are applicable forcomparing any set of benchmarks or applications memoryproperties without regard to how those properties perform onany particular architectural implementation.

The remainder of this paper is organized as follows: SectionII examines the extensive related work in measuring spatialand temporal locality, as well as application’s working sets;Section III describes the Sandia integer and floating point ap-plications, as well as the SPEC suite; Section IV quantitativelydefines the measures of temporal locality, spatial locality,and data intensiveness; Section V compares the application’sproperties; Section VI presents the results; and Section VIIends with the conclusions.

II. RELATED WORK

Beyond the somewhat intuitive definitions of spatial andtemporal locality provided in computer architecture text books[14], [27], there have been numerous attempts to quantitativelydefine spatial and temporal locality [37]. Early research incomputer architecture [7] examined working sets, or the dataactively being used by a program, in the context of paging.That work focused on efficiently capturing the working set inlimited core memory, and has been an active area of research[9], [31], [35].

2

More recent work is oriented towards addressing the mem-ory wall. It examines the spatial and temporal locality prop-erties of cache accesses, and represent modern hierarchicalmemory structures [6], [10], [32]. Compiler writers have alsoextensively examined the locality properties of applications inthe context of code generation and optimization [11], [12].

In addition to definitions that are continuously refined,the methodology for modeling and measuring the workingset has evolved continuously. Early analytical models werevalidated by small experiments [7], [9], [31], [35], whilemodern techniques have focused on the use of trace-basedanalysis and full system simulation [18], [21], [32].

Because of its preeminence in computer architecture bench-marks, the SPEC suite has been extensively analyzed [12],[18], [19], [21], [33], as have other relevant workloads suchas database and OLTP applications [3], [6], [20].

Finally, the construction of specialized benchmarks suchas the HPC Challenge RandomAccess benchmark [8] or theSTREAM benchmark [22] is specifically to address memoryperformance. Real world applications on a number of plat-forms have been studied extensively [26], [36].

III. APPLICATIONS AND BENCHMARKS

This section describes a set of floating point and integerapplications from Sandia National Laboratories, as well asthe SPEC Floating Point and Integer benchmark suites towhich they will be compared. The HPC Challenge RandomAc-cess benchmark, which measures random memory accessesin GUPS (Giga-Updates Per Second), and the STREAMbenchmark, which measures effective bandwidth, are used ascomparison points to show very basic memory characteristics.Finally, LINPACK, the standard supercomputing benchmarkused to generate the Top500 list is included for comparison.In the case of MPI codes, the user portion of MPI calls isincluded in the trace.

A. Floating Point Benchmarks

Real scientific applications tend to be significantly differentfrom common processor benchmarks, such as the SPEC suite.Their datasets are larger, the applications themselves are morecomplex, and they are designed to run on large-scale machines.The following benchmarks were selected to represent criticalproblems in supercomputing seen by the largest scale deploy-ments in the United States. The input sets were all chosen to berepresentative of real problems, or, when they are benchmarkproblems, are the typical performance evaluation benchmarksused during new system deployment. Two of the codes arebenchmarks, sPPM (see Section III-A.5), which is part of theASCI 7x benchmark suite (that set requirements for the ASCIPurple supercomputer), and Cube3 which is used as a simpledriver for the Trilinos linear algebra package. The sPPM codeis a slightly simplified version of a real-world problem, and,in the case of Trilinos, linear algebra is so fundamental tomany areas of scientific computing that studying core kernelsis significantly important.

All of the codes are written for MPPs using the MPIprogramming model, but, for the purposes of this study, were

traced as a single node run of the application. Even without theuse of MPI the codes are structured to be MPI-scalable. Otherbenchmarking (both performance register and trace-based) hasshown that the local memory access patterns for a single nodeof the application and serial runs are substantially the same.

1) LAMMPS: LAMMPS represents a classic moleculardynamics simulation designed to represent systems at theatomic or molecular level [28], [29]. The program is usedto simulate proteins in solution, liquid crystals, polymers,zeolites, and simple Lenard-Jones systems. The version understudy is written in C++, and two significant inputs were chosenfor analysis:

• Lenard Jones Mixture: This input simulated a 2048 atomsystem consisting of three different types;

• Chain: simulates 32000 atoms and 31680 bonds.

LAMMPS consists of approximately 30,000 lines of code.2) CTH: CTH is a multi-material, large deformation, strong

shock wave, solid mechanics code developed over the lastthree decades at Sandia National Laboratories [16]. CTHhas models for multi-phase, elastic viscoplastic, porous andexplosive materials. CTH supports multiple types of meshes:

• Three-dimensional rectangular meshes;• two-dimensional rectangular and cylindrical meshes; and• one-dimensional rectilinear, cylindrical, and spherical

meshes.

It uses second-order accurate numerical methods to reducedispersion and dissipation and produce accurate, efficientresults. CTH is used extensively within the Department of En-ergy laboratory complexes for studying armor/anti-armor inter-actions, warhead design, high explosive initiation physics andweapons safety issues. It consists of approximately 500,000lines of Fortran and C.

CTH has two modes of operation: with or without adaptivemesh refinement (AMR)1. Adaptive mesh refinement changesthe application properties significantly and is useful for onlycertain types of input problems. One AMR problem and twonon-AMR problems were chosen for analysis.

Three input sets were examined:

• 2-Gas: The input set uses an 80×80×80 mesh to simulatetwo gases intersecting on a 45 degree plane. This is themost “benchmark-like” (e.g., simple) input set, and isincluded to better understand how representative it is ofreal problems.

• Explosively Formed Projectile (EFP): The simulationrepresents a simple Explosively Formed Projectile (EFP)that was designed by Sandia National Laboratories staff.The original design was a combined experimental andmodeling activity where design changes were evaluatedcomputationally before hardware was fabricated for test-ing. The design features a concave copper liner that isformed into an effective fragment by the focusing ofshock waves from the detonation of the high explosive.

1AMR typically uses graph partitioning as part of the refinement, twoalgorithms for which are part of the integer benchmarks under study. Oneof the most interesting results of including a code like CTH in a “benchmarksuite” is its complexity.

3

The measured fragment size, shape, and velocity is accu-rately (within 5%) modeled by CTH.

• CuSt AMR: This input problem simulates a 4.52 km/simpact of a 4 mm copper ball on a steel plate at a 90degree angle. Adaptive mesh refinement is used in thisproblem.

3) Cube3: Cube3 is meant to be a generic linear solverand drives the Trilinos [15] frameworks for parallel linearand eigensolvers. Cube3 mimics a finite element analysisproblem by creating a beam of hexagonal elements, thenassembling and solving a linear system. The problem canbe varied by width, depth, and degrees of freedom (e.g.,temperature, pressure, velocity, or whatever physical modelingthe problem is meant to represent). The physical problemis three dimensional. The number of equations in the linearsystem is equal to the number of nodes in the mesh multipliedby the degrees of freedom at each node. There are two variantsbased on how the sparse matrices are stored:

• CRS: a 55x55 sparse compressed row system; and• VBR: a 32x16 variable block row system.These systems were chosen to represent a large system of

equations.4) MPSalsa: MPSalsa performs high resolution 3D simu-

lations of reacting flow problems [34]. These problems requireboth fluid flow and chemical kinetics modeling.

5) sPPM: The sPPM [5] benchmark is part of the ASCIPurple benchmark suite as well as the 7× application list forASCI Red Storm. It solves a 3D gas dynamics problem ona uniform Cartesian mesh using a simplified version of thePPM (Piecewise Parabolic Method) code. The hydrodynamicsalgorithm requires three separate sweeps through the meshper time step. Each sweep requires approximately 680 flopsto update the state variables for each cell. The sPPM codecontains over 4000 lines of mixed Fortran 77 and C routines.The problem solved by sPPM involves a simple, but strong(about Mach 5) shock propagating through a gas with a densitydiscontinuity.

B. Integer BenchmarksWhile floating point applications represent the classic su-

percomputing workload, problems in discrete mathematics,particularly graph theory, are becoming increasingly preva-lent. Perhaps most significant of these are fundamental graphtheory algorithms. These routines are important in the fieldsof proteomics, genomics, data mining, pattern matching andcomputational geometry (particularly as applied to medicine).Furthermore, their performance emphasizes the critical need toaddress the von Neumann bottleneck in a novel way. The datastructures in question are very large, sparse, and referencedindirectly (e.g., through pointers) rather than as regular arrays.Despite their vital importance, these applications are signifi-cantly underrepresented in computer architecture research, andthere is currently little joint work between architects and graphalgorithms developers.

In general, the integer codes are more “benchmark” prob-lems (in the sense that they use non-production input sets),heavily weighted towards graph theory codes, than are thefloating point benchmarks.

1) Graph Partitioning: There are two large-scale graphpartitioning heuristics included here: Chaco [13] and Metis[17]. Graph partitioning is used extensively in automation forVLSI circuit design, static and dynamic load balancing onparallel machines, and numerous other applications. The inputset in this work consists of a 143,437 vertex and 409,593edge graph to be partitioned into 1,024 balanced parts (withminimum edge cut between partitions).

2) Depth First Search (DFS): DFS implements a DepthFirst Search on a graph with 2,097,152 vertices and 25,690,112edges. DFS is used extensively in higher-level algorithms,including identifying connected components, tree and cycledetection, solving the two-coloring problem, finding Articula-tion Vertices (e.g., the vertex in a connected graph that, whendeleted, will cause the graph to become a disconnected graph),and topological sorting.

3) Shortest Path: Shortest Path computes the shortest pathon a graph of 1,048,576 vertices and 7,864,320 edges, and in-corporates a breadth first search. Extensive applications exist inreal world path planning and networking and communications.

4) Isomorphism: The graph isomorphism problem deter-mines whether or not two graphs have the same shape orstructure. Two graphs are isomorphic if there exists a one-to-one mapping between vertices and edges in the graph(independent of how those vertices and edges are labeled).The problem under study confirms that two graphs of 250,000vertices and 10 million edges are isomorphic. There arenumerous applications in finding similarity (particularly, sub-graph isomorphism) and relationships between two differentlylabeled graphs.

5) BLAST: The Basic Local Alignment Search Tool(BLAST) [1] is the most heavily used method for quicklysearching nucleotide and protein databases in biology. Thealgorithm attempts to find both local and global alignmentof DNA nucleotides, as well identifying regions of similarityembedded in two proteins. BLAST is implemented as adynamic programming algorithm.

The input sequence chosen was obtained by training ahidden Markov model on approximately 15 examples ofpiggyBac transposons from various organisms. This model wasused to search the newly assembled aedes aegypti genome (amosquito). The best result from this search was the sequenceused in the blast search. The target sequence obtained wasblasted against the entire aedes aegypti sequence to identifyother genes that could be piggyBac transposons, and to doublecheck that the subsequence is actually a transposon.

6) zChaff: The zChaff program implements the Chaffheuristic [23] for finding solutions to the Boolean satisfiabilityproblem. A formula in propositional logic is satisfiable if thereexists an assignment of truth values to each of its variablesthat will make the formula true. Satisfiability is critical incircuit validation, software validation, theorem proving, modelanalysis and verification, and path planning. The zChaff inputcomes from circuit verification and consists of 1,534 Booleanvariables, 132,295 clauses with five instances, that are allsatisfiable.

4

TABLE ISPEC CPU2000 INTEGER SUITE

Benchmark Lang. Description164.gzip C Data Compression175.vpr C FPGA Placement and Routing176.gcc C GNU C Compiler181.mcf C Combinatorial Optimization186.crafty C Chess197.parser C Word Processing252.eon C++ Visualization253.perlbmk C PERL254.gap C Group Theory255.vortex C Object Oriented Database256.bzip2 C Data compression300.twolf C VLSI Placement and Routing

C. SPEC

The SPEC CPU2000 suite is by far the most currentlystudied benchmark suite for processor performance [4]. Thiswork uses both the SPEC-Integer and SPEC-FP componentsof the suite, as summarized in Tables I and II respectively, asits baseline comparison for benchmark evaluation.

1) SPEC Integer Benchmarks: The SPEC Integer Suite,summarized in Table I, is by far the most studied half of theSPEC suite. It is meant to generally represent workstation classproblems. Compiling (176.gcc), compression (164.gzip and256.bzip2), and systems administration tasks (253.perlbmk)have many input sets in the suite. These tasks tend to be some-what streaming on average (the perl benchmarks, in particular,perform a lot of line-by-line processing of data files). Themore scientific and engineering oriented benchmarks (175.vpr,181.mcf, 252.eon, 254.gap, 255.vortex, and 300.twolf) aresomewhat more comparable to the Sandia integer benchmarksuite. However selectively choosing benchmarks from SPECproduces generally less accurate comparisons than using theentire suite (although it would lessen the computational re-quirements for analysis significantly).

It should be noted that the SPEC suite is specificallydesigned to emphasize computational rather then memoryperformance. Indeed, other benchmark suites, such as theSTREAM benchmark or RandomAccess focus much moreextensively on memory performance. However, given thenature of the memory wall, what is important is a mix ofthe two. SPEC, in this work, represents the baseline onlybecause it is, architecturally, the most studied benchmarksuite. Indeed, a benchmark such as RandomAccess wouldundoubtedly overemphasize the memory performance at theexpense of computation, as compared to the real-world codesin the Sandia suite.

2) SPEC Floating Point Benchmarks: The SPEC FloatingPoint suite is summarized in Table II, and primarily representsscientific applications. At first glance, these applications wouldappear very similar to the Sandia Floating Point suite; howeverthe scale of the applications (in terms of execution time, codecomplexity, and input set size) differs significantly.

TABLE IISPEC FLOATING POINT SUITE

Benchmark Lang Description168.wupwise F77 Quantum Chromodynamics171.swim F77 Shallow Water modeling172.mgrid F77 Multi-grid Solver173.applu F77 Parabolic PDEs177.mesa C 3d Graphics178.galgel F90 Comp. Fluid Dynamics179.art C Adaptive Resonance Theory183.equake C Seismic Wave Propagation187.facerec F90 Face Recognition188.ammp C Computational Chemistry189.lucas F90 Primary Number Testing191.fma3d F90 Finite Element Crash Simulation200.sixtrack F77 High Energy Physics Accelerator301.apsi F77 Pollutant Distribution

D. RandomAccess

The RandomAccess benchmarks is part of the HPC Chal-lenge suite [8] and measures the performance of the memorysystem by updating random entries in a very large tablethat is unlikely to be cached. This benchmark is specificallydesigned to exhibit very low spatial and temporal locality,and a very large data set (as the table update involves verylittle computation). It represents the most extreme of memoryintensive codes, and is used as a comparison point to thebenchmarks and real applications in this work.

E. STREAM

The STREAM benchmark [22] is used to measure sustain-able bandwidth on a platform and does so via four simpleoperations performed non-contiguously on three large arrays:

• Copy: a(i) = b(i)• Scale: a(i) = q ∗ b(i)• Sum: a(i) = b(i) + c(i)• Triad: a(i) = b(i) + q ∗ c(i)To measure the performance of main memory, the STREAM

rule is that the data size is scaled to four times the size ofthe platform’s L2 cache. Because this work is focused onarchitecture independent numbers, the each array size wasscaled to 32MB, which is reasonably large for a workstation.

IV. METHODOLOGY AND METRICS

This work evaluates the temporal and spatial locality char-acteristics of applications separately. This section describesthe methodology used in this work and formally definesthe temporal locality, spatial locality, and data intensivenessmeasures.

A. Methodology

The applications in this were each traced using the Amberinstruction trace generator [2] for the PowerPC. Trace filescontaining 4 billion sequential instructions were generated byidentifying and capturing each instruction executed in criticalsections of the program. The starting point for each trace waschosen using a combination of performance register profiling

5

of the memory system, code reading, and, in the case of SPEC,accumulated knowledge of good sampling points. The adviceof application experts was also used for the Sandia codes. Thetraces typically represent multiple executions of the main loop(multiple time steps for the Sandia floating point benchmarks).These traces have been used extensively in other work, and arewell understood [25].

B. Temporal Locality

The application’s temporal working set describes its tem-poral locality. As in prior work [25], a temporal workingset of size N is modeled as an N byte, fully associative,true least recently used cache with native machine wordsized blocks. The hit rate of that cache is used to describethe effectiveness of the fixed-size temporal working set atcapturing the application’s data set. The same work foundthat the greatest differentiation between conventional andsupercomputer applications occurred in the 32KB-64KB levelone cache sized region of the temporal working set. Thetemporal locality in this work is given by a temporal workingset of size 64 KB. The temporal working set is measured over along-studied 4 billion instruction trace from the core of eachapplication. The number of instructions is held constant foreach application. This puts the much shorter running SPECbenchmark suite on comparable footing to the longer runningsupercomputing applications.

It should be noted that there is significant latitude in thechoice of temporal working set size. The choice of a levelone cache sized working set is given for two reasons: first,it has been demonstrated to offer the greatest differentiationof applications between the floating point benchmarks in thissuite and SPEC FP; and second, while there is no direct map toa conventionally constructed L1 cache, the L1 hit rate stronglyimpacts performance. There are two other compelling choicesfor temporal working set size:

1) Level 2 Cache Sized: in the 1-8 MB region. Arguably,the hit rate of the cache closest to memory most impactsperformance (given very long memory latencies).

2) Infinite: describes the temporal hit rate required tocapture the application’s total data set size.

Given N memory accesses, H of which hit the cachedescribed above, the temporal locality is given by: H

N .

C. Spatial Locality

Measuring the spatial locality of an application may be themost challenging aspect of this work. Significant prior workhas examined it as the application’s stride of memory access.The critical measurement is how quickly the application con-sumes all the data presented to it in a cache block. Thus,given a cache block size, and a fixed interval of instructions,the spatial locality can be described as the ratio of data theapplication actually uses (through a load or store) to the cacheline size. This work uses an instruction interval of 1, 000instructions, and a cache block size of 64-bytes. For thiswork, every 1, 000 instruction window in the application’s4 billion instruction trace is examined for unique loads and

Fig. 1. An example of temporal locality, spatial locality, and data intensive-ness.

stores. Those loads and stores are then clustered into 64-byteblocks, and the ratio of used to unused data in the block iscomputed. The block size is chosen as a typical conventionalcache system’s block size. There is much more latitude inthe instruction window size. It must be large enough to allowfor meaningful computation, while being small enough toreport differentiation in the application’s spatial locality. Forexample, a window size of the number of instructions in theapplication should report that virtually all cache lines are 100%used. The 1, 000 instruction window was chosen based on priorexperience with the applications [24].

Given U1000 unique bytes accessed in an average intervalof 1,000 instructions that are clustered into L 64-byte cachelines, the spatial locality is given by U

64L .

D. Data Intensiveness

One critical yet often overlooked metric of an application’smemory performance is its data intensiveness, or the totalamount of unique data that the application accesses (regardlessof ordering) over a fixed interval of instructions. Over thecourse of the entire application, this would be the application’smemory footprint. This is not fully captured by the measure-ments given above, and it is nearly impossible to determinefrom a cache miss rate. This differs from the application’smemory footprint because it only includes program data thatis accessed via a load or store (where the memory footprintwould also include program text). Because a cache representsa single instantiation used to capture an application’s workingset, a high miss rate could be more indicative of the applicationaccessing a relatively small amount of memory in a temporalorder that is poorly suited to the cache’s parameters, or thatthe application exhibits very low spatial locality. It is not nec-essarily indicative of the application accessing a large data set,which is critical to supercomputing application performance.This work presents the data intensiveness as the total numberof unique bytes that the application’s trace accessed over its4 billion instruction interval.

This is directly measured by counting the total number ofunique bytes accessed over the given interval of 4 billioninstructions. This is the same as the unique bytes measuregiven above, except it is measured over a larger interval (U4B).

E. An Example

Figure 1 shows an example instruction sequence. Assumingthat this is the entire sequence under analysis, each of themetrics given above is computed as follows:

6

Temporal Locality: is the hit rate of a fully associa-tive cache. The first 3 loads in the sequence (of 0xA0000,0xA0004, and 0xB0000) miss the cache. The final store (to0xA0000) hits the item in the cache that was loaded 3 memoryreferences prior. Thus, the temporal locality is:

1 hit

4 memory references= 0.25

Spatial Locality: is the ratio of used to unused bytes ina 64-byte cache line. Assuming that each load requests is32-bits, there are two unique lines requested, 0xA0000 (to0xA0040), and 0xB0000 (to 0xB0040). Two 32-bit words areconsumed from 0xA0000, and 1 32-bit word from 0xB0000.The spatial locality is calculated as:

12 consumed bytes

128 requested bytes= 0.09375

Data Intensiveness: is the total number of unique bytesconsumed by the stream. In this case, 3 unique 32-bit wordsare requested, for a total of 12 bytes.

V. INITIAL OBSERVATIONS OF PROGRAMCHARACTERISTICS

Fig. 2. Benchmark Suite Mean Instruction Mix

Figure 2 shows the instruction mix breakdown for thebenchmark suites. Of particular importance is that the SandiaFloating Point applications perform significantly more integeroperations than their SPEC Floating Point counterparts, inexcess of 1.66 times the number of integer operations, in fact.This is largely due to the complexity of the Sandia applications(with many configuration operations requiring integer tests,table look ups requiring integer index calculations, etc.) aswell as their typically more complicated memory addressingpatterns [30]. This is largely due to the complexity of thealgorithm, and the fact that significantly more indirection isused in memory address calculations. Additionally, in the case

of the floating point applications, although the Sandia applica-tions perform only about 1.5% more total memory referencesthan their SPEC-FP counterparts, the Sandia codes perform11% more loads, and only about 2

3 the number of stores,indicating that the results produced require more memoryinputs to produce fewer memory outputs. The configurationcomplexity can also be seen in that the Sandia codes performabout 11% more branches than their SPEC counterparts.

In terms of the integer applications, the Sandia codesperform about 12.8% fewer memory references over the samenumber of instructions, however those references are signifi-cantly harder to capture in a cache. The biggest difference isthat the Sandia Integer codes perform 4.23 times the numberof floating point operations as their SPEC Integer counterparts.This is explained by the fact that three of the Sandia Integerbenchmarks perform somewhat significant floating point com-putations.

TABLE IIISANDIA INTEGER APPLICATIONS WITH SIGNIFICANT FLOATING POINT

COMPUTATION

Application Percent Floating Point InstructionsChaco 15.84%DFS 14.74%Isomorphism 13.41%

Table III summarizes the three Sandia Integer Suite ap-plications with significant floating point work: Chaco, DFS,and Isomorphism. Their floating point ratios are quite belowthe median for SPEC FP (28.69%), but above the SandiaFloating Point median (10.67%). They are in the integercategory because their primary computation is an integer graphmanipulation, whereas CTH is in the floating point categoryeven though runs have a lower floating point percentage (amean over its three input runs of 6.83%), but the floatingpoint work is the primary computation. For example, Chacois a multilevel partitioner and uses spectral partitioning in itsbase case, which requires the computation of an eigenvector(a floating point operation). However, graph partitioning isfundamentally a combinatorial algorithm, and consequently inthe integer category. In the case of CTH, which is a floatingpoint application with a large number of integer operations, itis a shock physics code. The flops fundamentally represent the“real work”, and the integer operations can be accounted for bythe complexity of the algorithms, and the large number of tablelook-ups employed by CTH to find configuration parameters.In either case, the SPEC FP suite is significantly more floatingpoint intensive.

VI. RESULTS

The experimental results given by the metrics from SectionIV are presented below. Each graph depicts the temporallocality on the X-axis, and the spatial locality on the Y-axis.The area of each circle on the graph depicts each application’srelative data intensiveness (or the total amount of unique dataconsumed over the instruction stream).

Figure 3 provides the summary results for each suite ofapplications, and the RandomAccess memory benchmark.

7

Fig. 3. Mean Temporal vs. Spatial Locality and Data Intensiveness for eachbenchmark suite.

The Sandia Floating Point suite exhibits approximately 36%greater spatial locality and nearly 7% less temporal localitythan its SPEC-FP counterpart. The nearness in temporal local-ity, and increased spatial locality is somewhat surprising whentaken out of context. One would typically expect scientificapplications to be less well structured. The critical data inten-siveness measure proves the most enlightening. The SandiaFP suite accesses over 2.6 times the amount of data as SPECFP. The data intensiveness is the most important differentiatorbetween the suites. A larger data set size would reflect signif-icantly worse performance in any real cache implementation.Without the additional measure, the applications would appearmore comparable. It should be noted that the increased spatiallocality seen in the Sandia Floating Point applications islikely because those applications use the MPI programmingmodel, which generally groups data to be operated upon into abuffer for transmission over the network (increasing the spatiallocality).

The Sandia integer suite is significantly farther from theSPEC integer suite in all dimensions. It exhibits close to 30%less temporal locality, nearly 40% less spatial locality, and hasa unique data set over 5.9 times the size of the SPEC integersuite.

The LINPACK benchmark shows the highest spatial andtemporal locality of any benchmark, and by far the smallestdata intensiveness (the dot is hardly visible on the graph). Itis over 3,000 times smaller than any of the real world Sandiaapplications. It exhibits 17% less temporal locality and roughlythe same spatial locality than the Sandia FP suite. The SandiaInteger suite has half the temporal locality and less than onethird the spatial locality.

The STREAM benchmark showed over 100 times less tem-poral locality than RandomAccess, and 2.4 times the spatiallocality. However, critically, the data intensiveness for streamsis 1/95th that of RandomAccess. The Sandia Integer Suite isonly 1% less spatially local than STREAM, indicating that

most of the bandwidth used to fetch a cache line is wasted.While it is expected that RandomAccess exhibits very low

spatial and temporal locality, given its truly random memoryaccess pattern, its data set is 3.7× the size of the Sandia FPsuite, 4.5× the size of the Sandia Integer suite, and 9.7× and26.5× the SPEC floating point and integer suites respectively.

Figure 4(a) shows each individual floating point applicationin the Sandia and SPEC suites. On the basis of spatial andtemporal locality measurements alone, the the 177.mesa SPECFP benchmark would appear to dominate all others in the suite.However, it has the second smallest unique data set size in theentire SPEC suite. In fact, the Sandia FP applications averageover 9 times the data intensiveness of 177.mesa. There arenumerous very small data set applications in SPEC FP, in-cluding 177.mesa, 178.galgel, 179.art, 187.facerec, 188.ammp,and 200.sixtrack. In fact, virtually all of the applications fromSPEC FP that are “close” to a Sandia application in terms ofspatial and temporal locality exhibit a much smaller uniquedata set. The mpsalsa application from the Sandia suite and183.equake are good examples. While they are quite near onthe graph, mpsalsa has almost 17 times the unique data setof equake. 183.equake is also very near the mean spatialand temporal locality point for the entire Sandia FP suite,except that the Sandia applications average more than 15 times183.equake’s data set size.

Unfortunately, it would be extremely difficult to identifya SPEC FP application that is “representative” of the Sandiacodes (either individually, or on average). Often papers choosea subset of a given benchmark suite’s applications when pre-senting the results. Choosing the five applications in SPEC FPwith the largest data intensiveness (168.wupwise, 171.swim,173.applu, 189.lucas, 301.apsi), and 183.equake (because ofits closeness to the average and to mpsalsa) yields a suite thataverages 90% of the Sandia suite’s temporal locality, 86% ofits temporal locality, 75% of it’s data intensiveness. Whilesomewhat far from “representative”, particularly in terms ofdata intensiveness, this subset is more representative of thereal applications than the whole.

Several interesting Sandia applications are shown on thegraph. The CTH application exhibits the most temporal lo-cality, but relatively low spatial locality, and a relativelysmall data set size. The LAMMPS (lmp) molecular dynamicscode is known to be compute intensive, but it exhibits arelatively small memory footprint, and shows good memoryperformance. The temporal and spatial locality measures arequite low. SPPM exhibits very high spatial locality, very lowtemporal locality, and a moderate data set size.

Figure 4(b) depicts the Sandia and SPEC Integer bench-mark suites. These applications are strikingly more differentthan the floating point suite. All of the applications exhibitrelatively low spatial locality, although the majority of Sandiaapplications exhibit significantly less spatial locality than theirSPEC counterparts. The DFS code in the Sandia suite is themost “RandomAccess-like”, with 255.vortex in the SPEC suitebeing the closest counter part in terms of spatial and temporallocality. 255.vortex’s temporal and spatial locality are within25% and 15% of DFS’ respectively. However, once again,DFS’s data set size is over 19 times that of 255.vortex’s.

8

(a) (b)

Fig. 4. (a) Integer and (b) Floating Point Applications Temporal vs. Spatial Locality and Data Intensiveness.

300.twolf actually comes closest in terms of spatial andtemporal locality to representing the “average” Sandia Integerapplication, however, the average Sandia code has nearly 140times the data set size.

VII. CONCLUSIONS

This work has measured the temporal and spatial locality,and the relative data intensiveness of a set of real worldSandia applications, and compared them to the SPEC Integerand Floating Point suites, as well as the RandomAccessmemory benchmark. While the SPEC floating point suiteexhibits greater temporal locality and less spatial localitythan the Sandia floating point suite, it averages significantlyless data intensiveness. This is crucial because the numberof unique items consumed by the application can affect theperformance of hierarchical memory systems more than theaverage efficiency with which those items are stored in thehierarchy.

The integer suites showed even greater divergence in allthree dimensions (temporal locality, spatial locality, and dataintensiveness). Many of the key integer benchmarks, whichrepresent applications of emerging importance, are close toRandomAccess in their behavior.

This work has further quantitatively demonstrated the dif-ference between a set of real applications (both current andemerging) relevant to the high performance computing com-munity, and the most studied set of benchmarks in computerarchitecture. The real integer codes are uniformly harder onthe memory system than the SPEC integer suite. In the caseof floating point codes, the Sandia applications exhibit asignificantly larger data intensiveness, and lower temporallocality. Because of the dominance of the memory systemin achieving performance, this indicates that architects shouldfocus on codes with significantly larger data set sizes.

The emerging applications characterized by the SandiaInteger suite are the most challenging applications (next to

the RandomAccess benchmark). Because of their importanceand their demands on the memory system, they represent acore group of applications that require significant attention.

Finally, beyond a specific study of one application domain,this work presents an architecture-independent methodologyfor quantifying the difference in memory properties betweenany two applications (or suites of applications). This study canbe repeated for other problem domains of interest (the desktop,multimedia, business, etc.).

ACKNOWLEDGMENTS

The authors would like to thank Arun Rodrigues and KeithUnderwood at Sandia National Laboratories for their valuablecomments.

The application analysis and data gathering discussed hereused tools that were funded in part by DARPA through Cray,Inc. under the HPCS Phase 1 program.

REFERENCES

[1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. Basiclocal alignment search tool. Journal of Molecular Biology, 215:403–410,1990.

[2] Apple Architecture Performance Groups. Computer Hardware Under-standing Development Tools 2.0 Reference Guide for MacOS X. AppleComputer Inc, July 2002.

[3] Luiz Andre Barroso, Kourosh Gharachorloo, and Edouard Bugnion.Memory system characterization of commercial workloads. In Pro-ceedings of the 25th annual international symposium on Computerarchitecture, pages 3–14. IEEE Computer Society, 1998.

[4] Daniel Citron, John Hennessy, David Patterson, and Gurindar Sohi. Theuse and abuse of SPEC: An ISCA panel. IEEE Mico, 23(4):73–77,July-August 2003.

[5] P. Colella and P.R. Woodward. The Piecewise Parabolic Method (PPM)for Gas-Dynamical Simulations. Journal of Computational Physics,54:174–201, 1984.

[6] Z. Cvetanovic and D. Bhandarkar. Characterization of alpha AXPperformance using TP and SPEC workloads. In Proceedings of the21ST annual international symposium on Computer architecture, pages60–70. IEEE Computer Society Press, 1994.

9

[7] Peter J. Denning. The working set model for program behavior.In Proceedings of the first ACM symposium on Operating SystemPrinciples, pages 15.1–15.12. ACM Press, 1967.

[8] J. Dongarra and P. Luszczek. Introduction to the HPCChallengeBenchmark Suite. Technical Report ICL-UT-05-01, 2005.

[9] Domenico Ferrari. A generative model of working set dynamics. In Pro-ceedings of the 1981 ACM SIGMETRICS conference on Measurementand modeling of computer systems, pages 52–57. ACM Press, 1981.

[10] Jeffrey D. Gee, Mark D. Hill, Dionisios N. Pnevmatikatos, and Alan JaySmith. Cache performance of the SPEC benchmark suite. TechnicalReport CS-TR-1991-1049, 1991.

[11] Somnath Ghosh, Margaret Martonosi, and Sharad Malik. Cache missequations: An analytical representation of cache misses. In InternationalConference on Supercomputing, pages 317–324, 1997.

[12] Swathi Tanjore Gurumani and Aleksandar Milenkovic. Executioncharacteristics of SPEC CPU2000 benchmarks: Intel C++ vs. MicrosoftVC++. In Proceedings of the 42nd annual Southeast regional confer-ence, pages 261–266. ACM Press, 2004.

[13] B. Hendrickson and R. Leland. The chaco user’s guide — version 2.0.Technical Report SAND94-2692, 1994.

[14] John L. Hennessy and David A. Patterson. Computer Architecture aQuantitative Approach. Morgan Kaufmann Publishers, 2002.

[15] Michael Heroux, Roscoe Bartlett, Vicki Howle, Robert Hoekstra,Jonathan Hu, Tamara Kolda, Richard Lehoucq, Kevin Long, RogerPawlowski, Eric Phipps, Andrew Salinger, Heidi Thornquist, Ray Tumi-naro, James Willenbring, and Alan Williams. An Overview of Trilinos.Technical Report SAND2003-2927, 2003.

[16] E. Hertel, J. Bell, M. Elrick, A. Farnsworth, G. Kerley, J. McGlaun,S. Petney, S. Silling, P. Taylor, and L. Yarrington. CTH: A SoftwareFamily for Multi-Dimensional Shock Physics Analysis. In Proceedingsof the 19th International Symposium on Shock Waves, pages 377–382,July 1993.

[17] George Karypis and Vipin Kumar. Multilevel k-way Partitioning Schemefor Irregular Graphs. Journal of Parallel and Distributed Computing,48(1):96–129, 1998.

[18] Kimberly Keeton, David A. Patterson, Yong Qiang He, Roger C.Raphael, and Walter E. Baker. Performance Characterization of a QuadPentium Pro SMP using OLTP Workloads. In Proceedings of theInternational Sympoisum on Computer Architecture, pages 15–26, 1998.

[19] Dennis C. Lee, Patrick Crowley, Jean-Loup Baer, Thomas E. Anderson,and Brian N. Bershad. Execution characteristics of desktop applicationson windows NT. In International Symposium on Computer Architecture,pages 27–38, 1998.

[20] Jack L. Lo, Luiz Andre Barroso, Susan J. Eggers, Kourosh Gharachor-loo, Henry M. Levy, and Sujay S. Parekh. An Analysis of DatabaseWorkload Performance on Simultaneous Multithreaded Processors. InInternational Symposium on Computer Architecture, pages 39–50, 1998.

[21] Ann Marie Grizzaffi Maynard, Colette M. Donnelly, and Bret R. Ol-szewski. Contrasting characteristics and cache performance of technicaland multi-user commercial workloads. In Proceedings of the SixthInternational Conference on Architectural support for ProgrammingLanguages and Operating Systems, pages 145–156. ACM Press, 1994.

[22] McCalpin, John D. Stream: Sustainable memory bandwidth in highperformance computers, 1997.

[23] Matthew W. Moskewicz, Conor F. Madigan, Ying Zhao, Lintao Zhang,and Sharad Malik. Chaff: Engineering an Efficient SAT Solver. InProceedings of the 38th Design Automation Conference, June 2001.

[24] Richard C. Murphy. Traveling Threads: A New Multithreaded ExecutionModel. Ph.D. Dissertation, University of Notre Dame, May 2006.

[25] Richard C. Murphy, Arun Rodrigues, Peter Kogge, and Keith Under-wood. The Implications of Working Set Analysis on SupercomputingMemory Hierarchy Design. In Proceedings of the 2005 InternationalConference on Supercomputing, pages 332–340, June 20-22, 2005.

[26] Leonid Oliker, Andrew Canning, Jonathan Carter, John Shalf, andStephane Ethier. Scientific Computations on Modern Parallel VectorSystems. In Proceedings of Supercomputing, page 10, 2004.

[27] David A. Patterson and John L. Hennessy. Computer Organizationand Design: The Hardware/Software Interface, 2ed. Morgan KaufmannPublishers, 1997.

[28] Steven J. Plimpton. Fast parallel algorithms for short-range moleculardynamics. Journal Computation Physics, 117:1–19, 1995.

[29] Steven J. Plimpton, R. Pollock, and M. Stevens. Particle-Mesh Ewaldand rRESPA for Parallel Molecular Dynamics. In Proceedings of theEighth SIAM Conference on Parallel Processing for Scientific Comput-ing, Minneapolis, MN, March 1997.

[30] Arun Rodrigues, Richard Murphy, Peter Kogge, and Keith Underwood.Characterizing a New Class of Threads in Scientific Applications forHigh End Supercomputers. In Proceedings of the 18th Annual ACMInternational Conference on Supercomputing, pages 164–174.

[31] Juan Rodriguez-Rosell. Empirical working set behavior. Communica-tions of the ACM, 16(9):556–560, 1973.

[32] Edward Rothberg, Jaswinder Pal Singh, and Anoop Gupta. Working sets,cache sizes, and node granularity issues for large-scale multiprocessors.In Proceedings of the 20th annual international symposium on Computerarchitecture, pages 14–26. ACM Press, 1993.

[33] Rafael Saavedra and Alan Smith. Analysis of benchmark characteristicsand benchmark performance prediction. ACM Transactions on ComputerSystems, 14(4):344–84, 1996.

[34] J. Shadid, A. Salinger, R. Schmidt, T. Smith, S. Hutchinson, G. Henni-gan, K. Devine, and H. Moffat. MPSalsa: A Finite Element ComputerProgram for Reacting Flow Problems. Technical Report SAND98-2864,1998.

[35] D. R. Slutz and I. L. Traiger. A note on the calculation of averageworking set size. Communications of the ACM, 17(10):563–565, 1974.

[36] Jeffrey S. Vetter and Andy Yoo. An Empirical Performance Evaluationof Scalable Scientific Applications. In Proceedings of Supercomputing,pages 1–18, 2002.

[37] Jonathan Weinberg, Michael McCracken, Alan Snavely, and ErichStrohmaier. Quantifying Locality In The Memory Access Patterns ofHPC Applications. In Supercomputing 2005, page 50, November, 2005.

Richard C. Murphy is a computer architect in thescalable systems group at Sandia National Laborato-ries. He received his Ph.D. in computer engineeringat the University of Notre Dame. His research inter-ests include computer architecture, with a focus onmemory systems and Processing-In-Memory, VLSI,and massively parallel architectures, programminglanguages, and runtime systems. He spent 2000 to2002 at Sun Microsystems focusing on hardwareresource management and dynamic configuration.He is a member of the IEEE.

Peter M. Kogge was with IBM, Federal SystemsDivision, from 1968 until 1994, and was appointedan IEEE Fellow in 1990, and an IBM Fellow in1993. In 1977 he was a Visiting Professor in the ECEDept. at the University of Massachusetts, Amherst.From 1977 through 1994, he was also an AdjunctProfessor in the Computer Science Dept. of the StateUniversity of New York at Binghamton. In August,1994 he joined the University of Notre Dame asfirst holder of the endowed McCourtney Chair inComputer Science and Engineering (CSE). Starting

in the summer of 1997, he has been a Distinguished Visiting Scientist at theCenter for Integrated Space Microsystems at JPL. He is also the ResearchThrust Leader for Architecture in Notre Dame’s Center for Nano-Science andTechnology. For the 2000-2001 academic year he was the Interim Schubmehl-Prein Chairman of the CSE Dept. at Notre Dame. Starting in August, 2001he is the Associate Dean for Research, College of Engineering. Starting inthe fall of 2003, he also is a Concurrent Professor of Electrical Engineering.He is a fellow of the IEEE.

On the Effects of Memory Latency and Bandwidth on SupercomputerApplication Performance

Richard Murphy

Sandia National Laboratories∗

PO Box 5800, MS-1110Albuquerque, NM [email protected]

Abstract

Since the first vector supercomputers in the mid-1970’s,the largest scale applications have traditionally been float-ing point oriented numerical codes, which can be broadlycharacterized as the simulation of physics on a computer.Supercomputer architectures have evolved to meet the needsof those applications. Specifically, the computational workof the application tends to be floating point oriented, and thedecomposition of the problem two or three dimensional. To-day, an emerging class of critical applications may changethose assumptions: they are combinatorial in nature, in-teger oriented, and irregular. The performance of bothclasses of applications is dominated by the performance ofthe memory system. This paper compares the memory per-formance sensitivity of both traditional and emerging HPCapplications, and shows that the new codes are significantlymore sensitive to memory latency and bandwidth than theirtraditional counterparts. Additionally, these codes exhibitlower base-line performance, which only exacerbates theproblem. As a result, the construction of future supercom-puter architectures to support these applications will mostlikely be different from those used to support traditionalcodes. Quantitatively understanding the difference betweenthe two workloads will form the basis for future designchoices.

1. Introduction and Motivation

Supercomputing is in the midst of large technological,architectural, and application changes that greatly impactthe way designers and programmers think about the system.Technologically, the constraints of power and the speed of

∗Sandia is a multiprogram laboratory operated by Sandia Corporation,a Lockheed Martin Company, for the United States Department of En-ergy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

light dictate that multicore architectures will form the basisfor the commodity processors that constitute the heart ofmassively parallel processing (MPP) supercomputers. Thisimpacts the architecture in three ways:

1. Power constraints dictate that clock rates will not im-prove appreciably as they have in the past.

2. The combination of power, the constraint of the speedof light, and the architectural limits of instruction levelparallelism dictate that the trend in scalar processorstowards higher performing individual cores will nothold.

3. Even though die area is increasing, cost is still dictatedby packaging, and the number of pins available for ex-ternal communication will likely not grow as quicklyas the number of cores.

These technological and architectural trends are no lesssignificant than the those which dictated the transition fromvector-based supercomputers to commodity MPP architec-tures in the early 1990’s. In fact, given the maturity of vectorarchitectures, such a change was likely inevitable. One criti-cal architectural trend still holds across any implementation:the challenge posed by the memory wall. In today’s super-computers, the memory wall manifests itself as a dramaticdifference between the increase in processor clock rate andthe rate of a memory access. In multicore machines – evenwith flat clock rates – the memory wall manifests itself dueto the increasing number of cores compared to availablechannels (or independent access paths) to memory.

Unlike prior large-scale technological and architecturalchanges, today’s architects must also contend with a shiftin the application base. Historically, supercomputing hasbeen dominated by the simulation of physics on a computer,which itself can be thought of as fundamentally structuredin nature. In a three dimensional universe, problem decom-position can be performed in three dimensions (by divid-ing the simulated area into cubes). The types of supercom-

puter architectures that have been adopted often reflect thisstructure. For example, 3d mesh topologies reflect the kindof “nearest neighbors” communication pattern common inphysics codes. Furthermore, the “work” performed in theseapplications is fundamentally floating point: the computa-tion of temperature, pressure, volume, etc. Many emergingapplications, however, are different. They are combinatorialin nature, fundamentally unstructured, and often consist ofinteger computations.

This paper examines the memory performance of a suiteof real world applications from both the traditional andemerging problem domains. It examines the impact ofmemory latency and bandwidth on the applications. Theresults demonstrate that both sets of applications are funda-mentally dominated by memory latency, but that the emerg-ing applications both begin with a lower baseline perfor-mance, and are more sensitive to memory than their tra-ditional counterparts. This represents a significant chal-lenge to supercomputer architects, and quantitatively es-tablishes how emerging applications differ from their tra-ditional counterparts.

The remainder of this paper is organized as follows: Sec-tion 2 discusses the related work. Section 3 examines theapplications under study. Section 4 specifies the methodol-ogy and metrics used for evaluation. Section 5 presents theresults Conclusions are given in Section 6.

2. Related Work

Characterizing performance is a richly studied area ofcomputer architecture. The SPEC suite has been extensivelycharacterized[10, 8, 12, 14, 26], as have other workloadssuch as OLTP[13, 4, 2]. These codes are generally chosento represent “typical” machine workloads for a class of ap-plications.

In the area of supercomputing, specialized bench-marks have been constructed to test specific areas of ma-chine performance. The HPC Challenge RandomAccessbenchmark[6] and the STREAM benchmark[15] have beenspecifically constructed to measure the performance of thememory system. Although this information is useful to ar-chitects, it is difficult to directly map back to applicationperformance.

The applications in this work have been studiedextensively[19, 18, 20]. The floating point suite is similarin structure to other real world supercomputer applicationsthat have been previously examine[21, 28].

Numerous definitions of spatial and temporal local-ity have been proffered to characterize the memory per-formance of applications, both canonical [22, 9] andexperimental[19, 29, 5, 7, 27, 25]. This study examines theperformance impact of the memory system on applications.

The memory wall is also an extremely well studied

problem in computer architecture[30, 16]. It has beenargued that this is the dominant problem in computerarchitecture[11]. Indeed, the results of this work furthersupport this conclusion by affirming that key supercomputerapplications are memory latency dominated in performance,which is the classic definition of the memory wall.

3. Applications

This study examines two classes of important applica-tions from Sandia National Laboratories: a traditional setof primarily floating point codes, and an emerging class ofprimarily integer codes. These codes have been discussedextensively previously, and are significantly different fromtraditional suites such as SPEC CPU[20, 19]. Their descrip-tion and basic instruction mix follow.

3.1. Traditional Floating Point Codes

Each of the floating point applications are productionMPI codes designed to run at very large scale. Broadlyspeaking they are scientific and engineering applicationswhich represent physical simulations. They are:

• ALEGRA is a finite element shock physics code capa-ble of modeling near- and far-field responses to explo-sions, impacts, and energy depositions.

• CTH is a multi-material, large deformation, strongshock wave, solid mechanics code developed at San-dia National Laboratories over the last 30 years. CTHmodels multi-phase, elastic viscoplastic, porous andexplosive materials with multiple mesh refinementmethods.

• Cube3 Cube3 is a generic linear solver that drives theTrilinos framework for parallel linear and eigensolvers.It mimics a finite element analysis problem by creatinghexagonal elements, then assembling and solving a lin-ear system. The width, depth, and degrees of freedom(e.g., temperature, pressure, velocity, etc.) can be var-ied.

• ITS performs Monte Carlo simulations of linear time-independent coupled electron/photon radiation trans-port.

• MPSalsa is a high resolution 3d simulation of react-ing flow. The simulation requires both fluid flow andchemical kinetics modeling.

• Xyce is a parallel circuit simulation system capable ofmodeling very large circuits at multiple layers of ab-straction (device, analog, digital, and mixed-signal). Itincludes both SPICE-like models and radiation mod-els.

3.2. Emerging Integer Applications

Unlike their traditional counterparts, the emerging appli-cations studied in this work are integer-oriented, generallyproblems from Discrete Math, typically unstructured, andwritten for a variety of programming models. They are atthe core of important problems in proteomics, genomics,data mining, pattern matching, and computational geome-try. Although many are classic algorithms problems (DFS,for example), the implementations are typically “newer”than their traditional counterparts (that is, there are no threedecade old codes among these applications).

• DFS implements a depth-first search on a large graphand forms the basis for several high-level algorithmsincluding connected components, tree and cycle detec-tion, solving the two-coloring problem, finding Artic-ulation Vertices, and topological sorting.

• Connected Components breaks a graph into compo-nents. Two vertices are in a connected component ifand only if there is a path between them.

• Subgraph Isomorphism determines whether or not asubgraph exists within a larger graph.

• Full Graph Isomorphism determines whether or nottwo graphs have the same shape or structure.

• Shortest Path computes the shortest path between twovertices using a breadth first search. Real world ap-plications include path planning and networking andcommunication.

• Graph Partitioning is used extensively in VLSI cir-cuit design and adaptive mesh refinement. The prob-lem divides a graph in to k partitions while minimiz-ing the cut between the partitions (or the total weightof the edges that cross from one partition to another).

• BLAST is the Basic Local Alignment Search Tool andis the most heavily used method for quickly search nu-cleotide and protein databases in biology.

• zChaff is a heuristic for solving the Boolean Satisfi-ability Problem. In propositional logic, a formula issatisfiable if there exists an assignment of truth valuesto each of its variables that make the formula true.

3.3. Instruction Mix for Each Suite

Figure 1 shows the instruction mix for the integer andfloating point suites. Although the “work” for the floatingpoint suite is primarily floating point, real applications per-form significantly less floating point than do typical bench-mark suites. For example, the SPEC CPU 2000 suite av-erages 32% floating point, while real Sandia codes average

Figure 1. Sandia Integer and Floating Point SuiteInstruction Mix

only about 12% floating point[19]. Other key differencesbetween the integer and floating point codes are apparent.The integer codes perform 15% fewer memory references(although those references are much more likely to miss thecache, as Section 5 will demonstrate). Furthermore, the in-teger codes perform twice as many branches as their float-ing point counterparts. This is unsurprising since scientificcodes tend to have larger basic blocks due to complex for-mula calculations[23].

The combination of an increased cache miss rate and anincreased number of branches makes the integer codes chal-lenging for modern superscalar processors.

4. Methodology and Metrics

This section discusses the application traces used in thiswork, the metrics used for evaluation, and the simulationenvironment. The simulation methodology is similar to thatemployed in prior studies of this application base, althoughthe metrics are entirely new.

4.1. Application Traces

Each of the applications used in this work was ana-lyzed using traces of 4 billion sequential instructions pro-duced by the Amber[1] instruction trace generator for thePowerPC. These traces have been used in several priorstudies[19, 18, 20, 23, 17, 24] and are well understood. Thetraces typically represent multiple executions of the mainloop of the program and were originally generated with theinput from applications experts and platform profiling tools.In the case of the MPI programs, traces were of single nodeexecution.

Table 1. Processor ConfigurationParameter ValIssue Width 8Commit Width 4RUU Size 64L1 Instruction/Data Cache 64k

2-way Set Associative64-byte blockLeast Recently Used

L1 Cache Latency 3 cyclesL2 Unified Cache 1 MB

16-way Set Associative64-byte blockLeast Recently Used

L2 Cache Latency 20 cyclesInteger ALUs 3Integer Multiplier/Dividers 1FP ALUs 2FP Multiplier/Divider 1Clock Rate 2.5 GHz

4.2. Metrics

This work defines bandwidth and latency as follows:

• Latency: the time between when the processor re-quests a memory value and when the first byte of thatrequest arrives.

• Bandwidth: the transfer speed of the second and allsubsequent bytes of a memory request.

The simulations in this work vary the memory bandwidthand latency according to this definition for each run.

4.3. Simulation Environment

The traces were used as inputs to Sandia’s StructuralSimulation Toolkit (SST). The SST is a parallel machinesimulator (for both shared and distributed memory archi-tectures) that uses an enhanced version of SimpleScalar’ssim-outorder processor simulator[3] as the baselineprocessor. SST has significantly enhanced cache and mem-ory models, and has been used to simulate several super-computer architectures.

The memory simulated in this work is a DDR-like inter-face which performs transfers in 16-byte blocks.

Table 1 summarizes the conventional superscalar proces-sor configuration used in this study. The memory latencyand bandwidth were varied and the committed InstructionsPer Cycle (IPC) measured.

The memory latencies examined were: 15ns, 30ns, 60ns,120ns, and 240ns.

Figure 2. Average Latency and Bandwidth Ef-fects for the Sandia Floating Point and IntegerSuites

The bandwidths in this experiment were: 2.5 GB/sec, 5GB/sec, 10 GB/sec, 20 GB/sec, and 40 GB/sec.

The baseline memory latency and bandwidth numbers inthis study look somewhat more aggressive than a modernOpteron (60ns latency and 10 GB/sec of bandwidth). How-ever, several points around that baseline were examined, andcan be used for comparison given the reader’s assumptionsabout what is appropriate. Those points were generated byhalving and doubling each configuration parameter (latencyand bandwidth) twice.

5. Results

Figure 2 shows the average effect of varying latency andbandwidth (60ns of memory latency and 10 GB/sec of mem-ory bandwidth). The center point for each graph (relativelatency and bandwidth of 1.0) is this baseline, and each barrepresents the IPC achieved at that latency/bandwidth point.

Floating Point Applications

Integer Applications

Figure 3. Complete Results for the Integer and Floating Point Suites

Figure 3 depicts the latency/bandwidth results for eachapplication in both the integer and floating point suites.

On average, in the floating point case, halving the avail-able bandwidth results in an average drop in performance of1.24%. In the case of the integer suite, the average drop inperformance is 3.59%. By contrast, doubling the memorylatency leads to a respective drop in performance of 11%and 32% respectively. Thus, all of the codes are more la-tency than bandwidth dominated. This is a critical point fortwo reasons:

1. Memory bandwidth is the typical unit of memory per-formance measure discussed by supercomputer archi-tects. This may be because the construction of MPP-based supercomputers is dominated by the construc-tion of a high performance network interface, and MPPsystem architects therefore tend to think in more net-working oriented terms. It may also be because band-width is an easier system characteristic to affect thanlatency. Regardless, the more critical unit of measure-ment is quite conclusively latency.

2. As discussed in Section 1, one of the key concerns forsupercomputer architects is the transition to multicoremachines. If today’s instruction streams in a unicoreprocessor were memory bandwidth bound, and avail-able bandwidth did not grow with the number of cores,performance would suffer. However, these results in-dicate that there is potentially headroom in bandwidth.

Additionally, the structure of these applications generallymakes it very difficult for the processor to compute mem-ory addresses quickly enough to keep the memory bus busy.This tends to make bandwidth less important than latency.

The integer codes are clearly more sensitive to mem-ory performance than their traditional floating point coun-terparts (doubling the latency or halving the bandwidth has2.9× the impact on the integer suite as compared to thefloating point suite).

It is also critical to note that the baseline performance ofeach suite is significantly different. The floating point suitehas a baseline IPC of 1.22, while the integer suite has a base-line performance of 0.70. Thus, the integer suite baselineswith 43% less performance than the floating point suite,and is significantly more sensitive to latency and bandwidthvariations after that.

Figure 4 shows the most affected applications for thefloating point and integer suites. Cube3 from the floatingpoint suite and Shortest Path from the integer suite look re-markably similar. Cube3 has a baseline performance only10% lower than Shortest Path, and both applications expe-rience a 55% drop in performance if the memory latencyis doubled. Cube3 experiences a 6% drop in performanceif the bandwidth is halved while Shortest Path experiencesa 7% drop in performance. This is not surprising since

Figure 4. The Most Sensitive Applications FromEach Suite

sparse graphs can be represented as sparse matrices, simi-lar to those that are fundamental to linear algebra problems.The sparse matrix representation is not used in this particu-lar instance of the graph problem, however the fundamentalnature of the data structures used are similarly sparse.

The least affected applications from each suite (Xycefrom the floating point suite and BLAST from the integersuite) show nearly flat latency/bandwidth curves. Doublingthe memory latency shows less than a 1.25% performancedegradation for Xyce and less than a 5% performance degra-dation for BLAST. Similarly, halving the bandwidth yieldsless than half a percent performance degradation. These ap-plications are known to generally be more compute inten-sive, and this result confirms that prior knowledge. Theseare the only applications with relatively flat curves.

Cube3 and Alegra are the most sensitive to latency orbandwidth changes of the floating point suite, while con-nected components and shortest path dominate the integersuite.

To better quantify each suite’s sensitivity to bandwidthand latency effects, the sensitivity must be defined. Intu-itively, the sensitivity can be thought of as the slope of ei-ther the latency or bandwidth lines. As a first-order approx-imation, these curves can be thought of as linear (they arevery nearly so). For a given latency or bandwidth, the slopeof the resulting 2-dimensional slice of Figure 2 (either la-tency vs. IPC or bandwidth vs. IPC) can be calculated. Forexample, given the baseline latency of 60ns, the bandwidthsensitivity (or slope of the bandwidth line) can be calculatedas follows:

P60ns,2.5GB/sec − P60ns,40GB/sec

15ns− 240ns(1)

Where Platency,bandwidth represents the performance ata given latency and bandwidth. Similarly, given the base-line bandwidth of 10GB/sec, the latency sensitivity can becomputed by:

P15ns,10GB/sec − P240ns,10GB/sec

2.5GB/sec− 40GB/sec(2)

The sensitivity to bandwidth is a positive number be-cause an increase in bandwidth leads to an increase in per-formance, while the sensitivity to latency is a negative num-ber because an increase in latency leads to a decrease inperformance.

Figure 5(a) shows the average sensitivity to latency andbandwidth for both suites, while (b) includes each individ-ual application for the floating point suite, and (c) includesthe same information for the integer suite. As discussedearlier, the integer suite is clearly more sensitive to bothbandwidth and latency than the floating point suite. Theinteger suite is 42− 60% more sensitive to latency than thefloating point suite. While it begins 66% more sensitive tobandwidth than the floating point suite, at very high mem-ory latencies the integer suite is nearly 30% less sensitive tobandwidth than the floating point suite.

6. Conclusions and Future Work

This paper has examined the impact of memory latencyand bandwidth on a set of traditional floating-point ori-ented and emerging integer oriented supercomputer appli-cations. The results demonstrate that both application suitesare more dominated by latency than bandwidth. It has fur-ther shown that the emerging integer applications are 2.9×more sensitive to a halving of the memory bandwidth ordoubling of the memory latency than their traditional coun-terparts, which is critical to understanding the nature ofthese emerging codes.

This result has two critical impacts to the field of su-percomputer architecture: first, it demonstrates that there issome degree of “bandwidth headroom” in the construction

of multicore supercomputers; and second, it quantitativelyshows that emerging applications are much more memorysensitive than traditional scientific computing codes. Thisprovides further evidence in support of the long-held beliefthat the lessons of scientific computing have little applica-bility to these applications, particularly as they relate to datadecomposition.

Because the transition to multicore MPPs will impact thememory system in future machines, understanding the on-node memory performance of real applications is critical.The technology determines the number of available inde-pendent channels into memory (which affects aggregate la-tency by constraining the number of simultaneous memoryaccesses the memory system can sustain), and the speed andwidth of those channels (which affects bandwidth). Bothare likely to be more constrained than the number of coresthat can placed on a die. Choosing the right balance forsupercomputer applications will depend on characterizingthe requirements of the workloads of those machines. Thisstudy does so, and shows that this is an even more signif-icant problem for emerging codes than for classical super-computing applications.

Finally, this study identifies four critical applications(two from each suite) that demonstrate the most sensitiv-ity to latency and bandwidth. Cube3 and Alegra from thefloating point suite and Connected Components and Short-est Path from the integer suite. As a whole, problems ingraph theory are shown to be particularly challenging to thememory system.

Future work will extend this study to examine the impactof moving to multicore architectures with simpler cores.

References

[1] Apple Architecture Performance Groups. ComputerHardware Understanding Development Tools 2.0 Ref-erence Guide for MacOS X. Apple Computer Inc, July2002.

[2] Luiz Andre Barroso, Kourosh Gharachorloo, andEdouard Bugnion. Memory system characterizationof commercial workloads. In Proceedings of the 25thannual international symposium on Computer archi-tecture, pages 3–14. IEEE Computer Society, 1998.

[3] Doug Burger and Todd Austin. The SimpleScalar ToolSet, Version 2.0. SimpleScalar LLC.

[4] Z. Cvetanovic and D. Bhandarkar. Characterizationof alpha AXP performance using TP and SPEC work-loads. In Proceedings of the 21ST annual internationalsymposium on Computer architecture, pages 60–70.IEEE Computer Society Press, 1994.

(a) Suite Mean (b) Floating Point (c) Integer

Figure 5. Latency and Bandwidth Sensitivity. It should be noted that bandwidth sensitivity is positive becauseincreasing bandwidth increases performance while latency sensitivity is negative because increasing latencydecreases performance. Part (a) of the figure depicts the average for each of the benchmark suites, while(b) shows the floating point applications and (c) the integer.

[5] Peter J. Denning. The working set model for pro-gram behavior. In Proceedings of the first ACM sym-posium on Operating System Principles, pages 15.1–15.12. ACM Press, 1967.

[6] J. Dongarra and P. Luszczek. Introduction to the HPC-Challenge Benchmark Suite. Technical Report ICL-UT-05-01, 2005.

[7] Domenico Ferrari. A generative model of workingset dynamics. In Proceedings of the 1981 ACM SIG-METRICS conference on Measurement and modelingof computer systems, pages 52–57. ACM Press, 1981.

[8] Swathi Tanjore Gurumani and AleksandarMilenkovic. Execution characteristics of SPECCPU2000 benchmarks: Intel C++ vs. MicrosoftVC++. In Proceedings of the 42nd annual Southeastregional conference, pages 261–266. ACM Press,2004.

[9] John L. Hennessy and David A. Patterson. ComputerArchitecture a Quantitative Approach. Morgan Kauf-mann Publishers, 2002.

[10] Kimberly Keeton, David A. Patterson, Yong QiangHe, Roger C. Raphael, and Walter E. Baker. Perfor-mance Characterization of a Quad Pentium Pro SMPusing OLTP Workloads. In Proceedings of the Inter-national Sympoisum on Computer Architecture, pages15–26, 1998.

[11] Peter M. Kogge, Jay B. Brockman, and Vincent Freeh.Processing-In-Memory Based Systems: PerformanceEvaluation Considerations. In Workshop on Perfor-mance Analysis and its Impact on Design held in con-junction with ISCA, Barcelona, Spain, June 27-28,1998.

[12] Dennis C. Lee, Patrick Crowley, Jean-Loup Baer,Thomas E. Anderson, and Brian N. Bershad. Exe-cution characteristics of desktop applications on win-dows NT. In International Symposium on ComputerArchitecture, pages 27–38, 1998.

[13] Jack L. Lo, Luiz Andre Barroso, Susan J. Eggers,Kourosh Gharachorloo, Henry M. Levy, and Sujay S.Parekh. An Analysis of Database Workload Perfor-mance on Simultaneous Multithreaded Processors. In

International Symposium on Computer Architecture,pages 39–50, 1998.

[14] Ann Marie Grizzaffi Maynard, Colette M. Donnelly,and Bret R. Olszewski. Contrasting characteristics andcache performance of technical and multi-user com-mercial workloads. In Proceedings of the Sixth Inter-national Conference on Architectural support for Pro-gramming Languages and Operating Systems, pages145–156. ACM Press, 1994.

[15] McCalpin, John D. Stream: Sustainable memorybandwidth in high performance computers, 1997.

[16] Sally A. McKee. Reflections on the memory wall. InCF ’04: Proceedings of the 1st conference on Com-puting frontiers, page 162, New York, NY, USA, 2004.ACM Press.

[17] Richard C. Murphy. Traveling Threads: A New Mul-tithreaded Execution Model. Ph.D. Dissertation, Uni-versity of Notre Dame, May 2006.

[18] Richard C. Murphy, Jonathan Berry, William McLen-don, Bruce Hendrickson, Douglas Gregor, and An-drew Lumsdaine. DFS: A Simple to Write YetDifficult to Execute Benchmark. In IEEE Interna-tional Symposium on Workload Characterization 2006(IISWC06), October 25-27, 2006.

[19] Richard C. Murphy and Peter M. Kogge. On the Mem-ory Access Patterns of Supercomputer Applications:Benchmark Selection and its Implications. To AppearIn IEEE Transactions on Computers.

[20] Richard C. Murphy, Arun Rodrigues, Peter Kogge,and Keith Underwood. The implications of workingset analysis on supercomputing memory hierarchy de-sign. In The 2005 International Conference on Super-computing, June 20-22, 2005.

[21] Leonid Oliker, Andrew Canning, Jonathan Carter,John Shalf, and Stephane Ethier. Scientific Compu-tations on Modern Parallel Vector Systems. In Pro-ceedings of Supercomputing, page 10, 2004.

[22] David A. Patterson and John L. Hennessy. ComputerOrganization and Design: The Hardware/Software In-terface, 2ed. Morgan Kaufmann Publishers, 1997.

[23] Arun Rodrigues, Richard Murphy, Peter Kogge, andKeith Underwood. Characterizing a New Class ofThreads in Scientific Applications for High End Su-percomputers. In Proceedings of the 18th AnnualACM International Conference on Supercomputing,pages 164–174, June 26-July 1 2004.

[24] Arun F. Rodrigues. Programming Future Architec-tures: Dusty Decks, Memory Walls, and the Speed ofLight. Ph.D. Dissertation, University of Notre Dame,2006.

[25] Juan Rodriguez-Rosell. Empirical working set be-havior. Communications of the ACM, 16(9):556–560,1973.

[26] Rafael Saavedra and Alan Smith. Analysis of bench-mark characteristics and benchmark performance pre-diction. ACM Transactions on Computer Systems,14(4):344–84, 1996.

[27] D. R. Slutz and I. L. Traiger. A note on the calculationof average working set size. Communications of theACM, 17(10):563–565, 1974.

[28] Jeffrey S. Vetter and Andy Yoo. An Empirical Perfor-mance Evaluation of Scalable Scientific Applications.In Proceedings of Supercomputing, pages 1–18, 2002.

[29] Jonathan Weinberg, Michael McCracken, AlanSnavely, and Erich Strohmaier. Quantifying LocalityIn The Memory Access Patterns of HPC Applications.In Supercomputing 2005, page 50, November, 2005.

[30] Wm. A. Wulf and Sally A. McKee. Hitting the mem-ory wall: implications of the obvious. SIGARCH Com-put. Archit. News, 23(1):20–24, 1995.

Qthreads: An API for Programming with Millions of Lightweight Threads

Kyle B. WheelerUniversity of Notre DameSouth Bend, Indiana, USA

[email protected]

Richard C. MurphySandia National Laboratories∗

Albuquerque, New Mexico, [email protected]

Douglas ThainUniversity of Notre DameSouth Bend, Indiana, USA

[email protected]

Abstract

Large scale hardware-supported multithreading, an at-tractive means of increasing computational power, benefitssignificantly from low per-thread costs. Hardware supportfor lightweight threads is a developing area of research.Each architecture with such support provides a unique in-terface, hindering development for them and comparisonsbetween them. A portable abstraction that provides basiclightweight thread control and synchronization primitivesis needed. Such an abstraction would assist in exploringboth the architectural needs of large scale threading andthe semantic power of existing languages. Managing threadresources is a problem that must be addressed if massiveparallelism is to be popularized. The qthread abstractionenables development of large-scale multithreading applica-tions on commodity architectures. This paper introducesthe qthread API and its Unix implementation, discusses re-source management, and presents performance results fromthe HPCCG benchmark.

1. Introduction

Lightweight threading primitives, crucial to large scalemultithreading, are typically either platform dependent orcompiler-dependent. Generic programmer-visible multi-threading interfaces, such as pthreads, were designed for“reasonable” numbers of threads—less than one hundred orso. In large-scale multithreading situations, the features andguarantees provided by these interfaces prevent them fromscaling to “unreasonable” numbers of threads (a million ormore), necessary for multithreaded teraflop-scale problems.

Parallel execution has largely existed two differentworlds: the world of the very large, where program-mers explicitly create parallel threads of execution, and the

∗Sandia is a multiprogram laboratory operated by Sandia Corporation,a Lockheed Martin Company, for the United States Department of En-ergy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

world of the very small, where processors extract paral-lelism from serial instruction streams. Recent hardware ar-chitectural research has investigated lightweight threadingand programmer-defined large scale shared-memory paral-lelism. The lightweight threading concept allows exposureof greater potential parallelism, increasing performance viagreater hardware parallelism. The Cray XMT [7], with theThreadstorm CPU architecture, avoids memory dependencystalls by switching among 128 concurrent threads. XMTsystems support between over 8000 processors. To maxi-mize throughput, the programmer must provide at least 128threads per processor, or over 1,024,000 threads.

Taking advantage of large-scale parallel systems withcurrent parallel programming APIs requires significantcomputational and memory overhead. For example, stan-dard POSIX threads must be able to receive signals, whicheither requires an OS representation of every thread or re-quires user-level signal multiplexering [14]. Threads in alarge-scale multithreading context often need only a fewbytes of stack (if any) and do not require the ability to re-ceive signals. Some architectures, such as the Processor-in-Memory (PIM) designs [5, 18, 23], suggest threads that aremerely a few instructions included in the thread’s context.

While hardware-based lightweight threading constructsare important developments, the methods for exposing suchparallelism to the programmer are platform-specific andtypically rely either on custom compilers [3, 4, 7, 10, 11,26], entirely new languages [1, 6, 8, 13], or have archi-tectural limitations that cannot scale to millions of threads[14, 22]. This makes useful comparisons between archi-tectures difficult. With a standard way of expressing par-allelism that can be used with existing compilers, compar-ing cross-platform algorithms becomes convenient. For ex-ample, the MPI standard allows a programmer to create aparallel application that is portable to any system provid-ing an MPI library, and different systems can be comparedwith the same code on each system. Development and studyof large-scale multithreaded applications is limited becauseof the platform-specific nature of the available interfaces.Having a portable large-scale multithreading interface al-

lows application development on commodity hardware thatcan exploit the resources available on large-scale systems.

Lightweight threading requires a lightweight synchro-nization model [9]. The model used by the Cray XMT andPIM designs, pioneered by the Denelcor HEP [15], uses ful-l/empty bits (FEBs). This technique marks each word inmemory with a “full” or “empty” state, allows programsto wait for either state, and makes the state change atomi-cally with the word’s contents. This technique can be imple-mented directly in hardware, as it is in the XMT. Alterna-tives include ADA-like protected records [24] and fork-sync[4], which lack a clear hardware analog.

This paper discusses programming models’ impacton efficient multithreading and the resource managementnecessary for those models. It introduces the qthreadlightweight threading API and its Unix implementation.The API is designed to be a lightweight threading standardfor current and future architectures. The Unix implementa-tion is a proof of concept that provides a basis for develop-ing applications for large scale multithreaded architectures.

2. Recursive threaded programming modelsand resource management

Managing large numbers of threads requires managingper-thread resources, even if those requirements are low.This management can affect whether multithreaded appli-cations run to completion and whether they execute fasterthan an equivalent serial implementation. The worst conse-quence of poor management is deadlock: if more threadsare needed than resources are available, and reclaimingthread resources depends on spawning more threads, thesystem cannot make forward progress.

2.1. Parallel programming models

An illustrative example of the effect the programmingmodel on resource management is the trivial problem ofsumming integers in an array. A serial solution is trivial:start at the beginning, and tally each sequential number un-til the end of the array. There are at least three parallel ex-ecution models that could compute the sum. They can bereferred to as the recursive tree model, the equal distribu-tion model, and the lagging-loop model.

A recursive tree solution to summing the numbers in anarray is simple to program: divide the array in half, andspawn two threads to sum up both halves. Each thread doesthe same until its array has only one value, whereupon thethread returns that value. Thread that spawned threads waitfor their children to return and return the sum of their values.This technique is parallel, but uses a large amount of state.At any point, most of the threads are not doing useful work.While convenient, this is a wasteful technique.

The equal distribution solution is also simple: dividethe array equally among all of the available processors andspawn a thread for each. Each thread must sum its segmentserially and return the result. The parent thread sums thereturn values. This technique is efficient because it matchesthe needed parallelism to the available parallelism, and theprocessors do minimal communication. However, equaldistribution is not particularly tolerant of other load imbal-ances: execution is as slow as the slowest thread.

The lagging-loop model relies upon arbitrary workloaddivisions. It breaks the array into small chunks and spawnsa thread for each chunk. Each thread sums its chunk andthen waits for the preceding thread (if any) to return an an-swer before combining the sum and returning its own total.Eventually the parent thread will do the same with the lastchunk. This model is more efficient than the tree model,and the number of threads depends on the chunk size. Theincreased number of threads makes it more tolerant of loadimbalances, but has more overhead.

2.2. Handling resource exhaustion

These methods differ in the way resources must be man-aged to guarantee forward progress. Whenever new threadis requested, one of four things can be done:

1. Execute the function inline.2. Create a new thread.3. Create an record that will become a thread later.4. Block until sufficient resources are available.

In a large enough parallel program, eventually the resourceswill run out. Requests for new threads must either block un-til resources become available or must fail and let the pro-gram handle the problem.

Blocking to wait for resources to become available af-fects each parallel model differently. The lagging loopmethod works well with blocking requests, because thespawned threads don’t rely on spawning more threads.When these threads complete, their resources may bereused, and deadlock is easily avoided. The equal distri-bution method has a similar advantage. However, becauseit avoids using more than the minimum number of threads,it does not cope as well with load imbalances.

The recursive tree method gathers a lot of state quicklyand slowly releases it, making the method particularly sus-ceptible to resource exhaustion deadlock, where all run-ning threads are blocked spawning more threads. In orderto guarantee forward progress, resources must be reservedwhen threads spawn and threads must execute serially whenreservation fails. The minimum state that must be reservedis the amount necessary get to the bottom of the recursivetree serially. Thus, if there are only enough resources for asingle depth-first exploration of the tree, recursion may only

occur serially. If there are enough resources for two serialexplorations of the tree, the tree may be divided into twosegments to be explored in parallel, and so forth. Once re-source reservation fails, only a serial traversal of the recur-sive tree may be performed. Thus, blocking for resourcesis a poor behavior for a recursive tree as forward progresscannot be assured.

Such an algorithm is only possible when the maximumdepth of the recursive tree is known. If the depth is un-known, then sufficient resources for a serial execution can-not be reserved. Any resources reserved for a parallel execu-tion could prevent the serial recursive tree from completing.

It is worth noting that a threading library can only be re-sponsible for the resources necessary for basic thread state.Additional state required during recursion has the potentialto cause deadlock and must be managed similarly.

3. Application programming interface

The qthread API provides several key features:

• Large scale lightweight multithreading support• Access to or emulation of lightweight synchronization• Basic thread-resource management• Source-level compatibility between platforms• A library-based API, forgoing custom compilers

The qthread API maximizes portability to architecturessupporting lightweight threads and synchronization prim-itives by providing a stable interface to the programmer.Because architectures and operating systems supportinglightweight threading are difficult to obtain, initial analysisof the API’s performance and usability studies commodityarchitectures such as Itanium and PowerPC processors.

The qthread API consists of three components: the corelightweight thread command set, a set of commands forresource-limit-aware threads (“futures”), and an interfacefor basic threaded loops. Qthreads have a restricted stacksize, and provide a locking scheme based on the full/emptybit concept. The API provides alternate threads, called “fu-tures”, which are created as resources are available.

One of the likely features of machines supportinglarge scale multithreading is non-uniform memory access(NUMA). To take advantage of NUMA systems, they mustbe described to the library, in the form of “shepherds,”which define memory locality.

3.1. Basic thread control

The API is an anonymous threading interface. Threads,once created, cannot be controlled by other threads. How-ever, they can provide FEB-protected return values so thata thread can easily wait for another. FEBs do not require

polling, which is discouraged as the library does not guar-antee preemptive scheduling.

Threads are assigned to one of several “shepherds” atcreation. A shepherd is a grouping construct. The numberof shepherds is defined when the library is initialized. Inan environment supporting traveling threads, shepherds al-low threads to identify their location. Shepherds may corre-spond to nodes in the system, memory regions, or protectiondomains. In the Unix implementation, a shepherd is man-aged by at least one pthread which executes qthreads. It isworth noting that this hierarchical thread structure, partic-ular to the Unix implementation (not inherent to the API),is not new but rather useful for mapping threads to mobilitydomains. A similar strategy was used by the Cray X-MP[30], as well as Cilk [4] and other threading models.

Only two functions are required for creating threads:qthread init (shep), which initializes the library with shepshepherds; and qthread fork(func,arg, ret ), which creates athread to perform the equivalent of ∗ret = func(arg). TheAPI also provides mutex-style and FEB-style locking func-tions. Using synchronization external to the qthread libraryis not encouraged, as that prevents the library from makingscheduling decisions.

The mutex operations are qthread lock(addr) andqthread unlock(addr). The FEB semantics are morecomplex, with functions to manipulate the FEB statein a non-blocking way (qthread empty(addr) andqthread fill (addr)), as well as blocking reads and

blocking writes. The blocking read functions wait fora given address to be full and then copy the contents ofthat address elsewhere. One (qthread readFF()) will leavethe address marked full, the other (qthread readFE()) willthen mark the address empty. There are also two writeactions. Both will fill the address being written, but one(qthread writeEF()) will wait for the address to be emptyfirst, while the other (qthread writeF()) won’t. Using thetwo synchronization techniques on the same addresses atthe same time produces undefined behavior, as they may beimplemented using the same underlying mechanism.

3.2. Futures

Though the API has no built-in limits on the number ofthreads, thread creation may fail due to memory limits orother system-specific limits. “Futures” are threads that al-low the programmer to set limits on the number of futuresthat may exist. The library tracks the futures that exist,and stalls attempts to create too many. Once a future ex-its, a future waiting to be created is spawned and its parentthread is unblocked. The futures API has its own initializa-tion function ( future init ( limit )) to specify the maximumnumber of futures per shepherd, and a way to create a future( future fork (func,arg, ret )) that behaves like qthread fork() .

3.3. Threaded loops and utility functions

The qthread API includes several threaded loop inter-faces, built on the core threading components. Both C++-based templated loops and C-based loops are provided.Several utility functions are also included as examples.These utility functions are relatively simple, such as sum-ming all numbers in an array, finding the maximum value,or sorting an array.

There are two parallel loop behaviors: one spawns aseparate thread for each iteration of the loop, and theother uses an equal distribution technique. The func-tions that provide one thread per iteration are qt loop ()and qt loop future () , using either qthreads or futures, re-spectively. The functions that use equal distribution areqt loop balance() and qt loop balance future () . A variantof these, qt loopaccum balance(), allows iterations to returna value that is collected (“accumulated”).

The qt loop () functions take arguments start , stop,stride , func, and argptr. They behave like this loop:

unsigned i n t i ;for ( i = s t a r t ; i < stop ; i += s t r i d e ) {

func (NULL, a r g p t r ) ;}

The qt loop balance() functions, since they distribute theiteration space, require a function that takes its iterationspace as an argument. Thus, while it behaves similar toqt loop () , it requires that its func argument point to a func-tion structured like this:

void func ( q th read t ∗me, const s i z e t s t a r t a t ,const s i z e t s topat , void ∗arg ) {

for ( s i z e t i = s t a r t a t ; i < s topa t ; i ++)/∗ do work ∗ /

}

The qt loopaccum balance() functions require an accu-mulation function so that return values can be gathered. Thefunction behaves similar to the following loop:

unsigned i n t i ;for ( i = s t a r t ; i < stop ; i ++) {

func (NULL, argp t r , tmp ) ;accumulate ( r e t v a l , tmp ) ;

}

Similar to the qt loop balance() function, it uses theequal distribution technique. The func function must storeits return value in tmp, which is then given to the accumulatefunction to gather and store in retval .

4. Performance

The design of the qthread API is based around twoprimary goals: efficiency in handling large numbers ofthreads and portability to large-scale multithreaded archi-tectures. The implementation of the API discussed in thissection is the Unix implementation, which is for POSIX-compatible Unix-like systems running on traditional CPUs,

such as PowerPC, x86, and IA-64 architectures. In thisenvironment, the qthread library relies on pthreads to al-low multiple threads to run in parallel. Lightweight threadsare created as a processor context and a small (4k) stack.These lightweight threads are executed by the pthreads.Context-switching between qthreads is performed as nec-essary rather than on an interrupt basis. For performance,memory is pooled in shepherd-specific structures, allowingshepherds to operate independently.

Without hardware support, FEB locks are emulated viaa central hash table. This table is a bottleneck that wouldnot exist on a system with hardware lightweight synchro-nization support. However, the FEB semantics still allowapplications to exploit asynchrony even when using a cen-tralized implementation of those semantics.

4.1. Benchmarks

To demonstrate qthread’s advantages, six micro-benchmarks were designed and tested using both pthreadsand qthreads. The algorithms of both implementations areidentical, with the exception that one uses qthreads as thebasic unit of threading and the other uses pthreads. Thebenchmarks are as follows:

1. Ten threads atomically increment a shared counter onemillion times each

2. 1,000 threads lock and unlock a shared mutex ten thou-sand times each

3. Ten threads lock and unlock 1 million mutexes4. Ten threads spinlock and unlock ten mutexes 100 times5. Create and execute 1 million threads in blocks of 200

with at most 400 concurrently executing threads6. Create and execute 1 million concurrent threads

Figure 1 illustrates the difference between using qthreadsand pthreads on a 1.3Ghz dual-processor PowerPC G5 with2GB of RAM. Figure 2 illustrates the same on a 48-node1.5Ghz Itanium Altix with 64GB of RAM. Both systemsused the Native Posix Thread Library Linux Pthread imple-mentation. The bars in each chart in Figure 1 are, from leftto right, the pthread implementation, the qthread implemen-tation with a single shepherd, with two shepherds, and withfour shepherds. The bars in each chart in Figure 2 are, fromleft to right, the pthread implementation, the qthread imple-mentation with a single shepherd, with 16 shepherds, with48 shepherds, and with 128 shepherds.

In Figures 1(a) and 2(a), using pthreads is outperformedby qthreads because qthreads uses a hardware-based atomicincrement while pthreads is forced to rely on a mutex. Be-cause of contention, additional shepherds do not improvethe qthread performance but rather decrease it slightly.Since the qthread locking implementation is built withpthread mutexes, it cannot compete with raw pthread mu-texes for speed, as illustrated in Figures 1(b), 2(b), 1(c),

Exec

utio

n Ti

me

05

10152025303540

P

Q1Q2 Q4

(a) Increment

Exec

utio

n Ti

me

020406080

100120

PQ1

Q2 Q4

(b) Lock/Unlock

Exec

utio

n Ti

me

05

1015202530

P

Q1

Q2 Q4

(c) MutexChaining

Exec

utio

n Ti

me

050

100150200250300350 P

Q1 Q2 Q4

(d) SpinlockChaining

Exec

utio

n Ti

me

05

1015202530354045 P

Q1Q2 Q4

(e) ThreadCreation

Exec

utio

n Ti

me

020406080

100120140160180 Q1

Q2 Q4

(f) ConcurrentThreads

Figure 1: Microbenchmarks on a dual PPC

Exec

utio

n Ti

me

05

101520253035404550 P

Q1 Q16 Q48Q128

(a) Increment

Exec

utio

n Ti

me

050

100150200250

P Q1

Q16Q48

Q128

(b) Lock/Unlock

Exec

utio

n Ti

me

05

10152025

P

Q1

Q16 Q48Q128

(c) MutexChaining

Exec

utio

n Ti

me

00.20.40.60.8

11.2

P

Q1

Q16 Q48Q128

(d) SpinlockChaining

Exec

utio

n Ti

me

050

100150200250

P

Q1Q16 Q48Q128

(e) ThreadCreation

Exec

utio

n Ti

me

020406080

100

Q1

Q16 Q48Q128

(f) ConcurrentThreads

Figure 2: Microbenchmarks on a 48-node Altix

and 2(c). This is a detail that would likely not be true on asystem that had hardware support for FEBs, and would besignificantly improved with a better centralized data struc-ture, such as a lock-free hash table. Because of the qthreadlibrary’s simple scheduler, it outperforms pthreads when us-ing spinlocks and a low number of shepherds, as illustratedin Figure 1(d). The impact of the scheduler is demonstratedby larger numbers of shepherds (Figure 2(d)).

The pthread library was incapable of more than severalhundred concurrent threads—requesting too many threadsdeadlocked the kernel (Figures 1(f) and 2(f)). A benchmarkwas designed that worked within pthreads’ limitations by al-lowing a maximum of 400 concurrent threads. Threads arespawned in blocks of 200, and after each block, threads arejoined until there are only 200 outstanding before spawninga new block of 200 threads. In this benchmark, Figures 1(e)and 2(e), pthreads performs more closely qthreads—on thePowerPC system, it is only a factor of two more expensive.

5. Application development

Development of software that realistically takes advan-tage of lightweight threading is important to research, butdifficult to achieve due to the lack of lightweight threadinginterfaces. To evaluate the performance potential of the APIand how difficult it is to integrate into existing code, two

representative applications were considered. First, a paral-lel quicksort algorithm was analyzed and modified to fit theqthread model. Secondly, a small parallel benchmark wasmodified to use qthreads.

5.1. Quicksort

Portability of an API does not free the programmer com-pletely from taking the hardware into consideration whendesigning an algorithm. There are features of alternativethreading environments that the qthread API does not emu-late, such as the hashed memory design found in the CrayMTA-2. Memory addresses in the MTA-2 are distributedthroughout the machine at word boundaries. When dividingwork amongst several threads on the MTA-2, the boundariesof the work regions can be fine-grained without significantloss of performance. Conventional processors, on the otherhand, assume that memory within page boundaries are allcontiguous. Thus, conventional cache designs reward pro-grams that allow an entire page to reside in a single pro-cessor’s cache, and limit the degree to which tasks can bedivided among multiple processors without paying a heavycache coherency penalty.

An example wherein the granularity of data distributioncan be crucial to performance is a parallel quicksort algo-rithm. In any quicksort algorithm, there are two phases:first the array is partitioned into two segments around a“pivot” point, and then both segments are sorted indepen-dently. Sorting the segments independently is relativelyeasy, but partitioning the array in parallel is more complex.On the MTA-2, elements of the array to be partitioned canbe divided up among each thread without regard to the lo-cation of the elements. On conventional processors, how-ever, that behavior is very likely to result in multiple pro-cessors transferring the same cache-line or memory pagebetween processors. Constantly sending the same memoryback and forth between processors prevents the parallel al-gorithm from exploiting the capabilities of multiple proces-sors.

The qthread library includes an implementation of thequicksort algorithm that avoids contention problems by en-suring that work chunks are always at least the size of apage. This avoids cache-line competition between proces-sors while still exploiting the parallel computational powerof all available processors on sufficiently large arrays. Fig-ure 3 illustrates the scalability of the qthread-based quick-sort implementation, and compares its performance to thelibc qsort () function. This benchmark sorts an array ofone billion double-precision floating point numbers on a 48-node SGI Altix SMP with 1.5Ghz Itanium processors.

Available Processors1 2 4 8 16 32 48

Exec

utio

n Ti

me

0200400600800

10001200140016001800

libc’s qsort() qutil_qsort()

Figure 3: qutil qsort() and libc’s qsort()

5.2. High performance computing conjugategradient benchmark

The qthread API makes parallelizing ordinary serialcode simple. As a demonstration of its capabilities, theHPCCG benchmark from the Mantevo project [20] was par-allelized with the Qloop interface of the qthread library.The HPCCG program is a conjugate gradient benchmark-ing code for a 3-D chimney domain, largely based on codein the Trilinos[21] solver package. The code relies largelyupon tight loops where every iteration of the loop is es-sentially independent of every other iteration. With simplemodifications to the code structure, the serial implementa-tion of HPCCG was transformed into multithreaded code.As illustrated in Figure 4, the parallelization is able to scalewell. Results are presented using strong scaling with a uni-form 75x75x1024 domain on a 48-node SGI Altix SMP.The SGI MPI results are presented to 48 processes, or oneprocess per CPU, as further results would over-subscribe theprocessors, which generally underperforms with SGI MPI.

Processes/Shepherds1 2 4 8 16 32 48 64 128

Exec

utio

n Ti

me

0

50

100

150

200

Serial qthread MPI

Figure 4: HPCCG on a 48-Node SGI Altix SMP

One of the features of the HPCCG benchmark is that itcomes with an optimized MPI implementation. The MPIimplementation, using SGI’s MPI library, is entirely dif-

ferent from the qthread implementation and does not useshepherds. The qthread and MPI implementations scale ap-proximately equally well up to about sixteen nodes. Beyondsixteen nodes however, MPI begins to behave very badly. Atthe same time, the qthread implementation’s execution timedoes not change significantly.

Upon analysis of the MPI code, the poor performance ofthe MPI implementation is caused by MPI Allreduce() in oneof the main functions of the code. While this takes almost18.9% of execution time with eight MPI processes, it takes84.1% of the execution time with 48 MPI processes. Whileit is tempting to simply blame the problem on a bad imple-mentation of MPI Allreduce(), it is probably more valid toexamine the difference between the qthread and MPI imple-mentations. The qthread implementation performs the samecomputation as the MPI Allreduce(), but rather than requireall nodes to come to the same point before the reductioncan be computed and distributed, the computation is per-formed as the component data becomes available from thethreads returning, the computational threads can exit, andother threads scheduled on the shepherds can proceed. Theqthread implementation exposes the asynchronous nature ofthe whole benchmark, while the MPI implementation doesnot. This asynchrony is revealed even though the Unix im-plementation of the qthread library relies upon centralizedsynchronization, and would likely provide further improve-ment on a real massively parallel architecture.

6. Related work

Lightweight threading models generally fit one of twodescriptions: they either require a special compiler or theyaren’t sufficiently designed for large-scale threading (orboth). For example, Python stackless threads [28] provideextremely lightweight threads. Putting aside issues of us-ability, which is a significant issue with stackless threads,the interface allows for no method of applying data paral-lelism to the stackless threads: a thread may be scheduledon any processor. Many other threading models, from nano-threads [26] to OpenMP [11], lack a sufficient means of al-lowing the programmer to specify locality. This becomesa significant issue as machines get larger and memory ac-cess becomes non-uniform [31]. Languages such as Chapel[8] and X10 [6], or modifications to existing languages suchas UPC [13] and Cilk [4], that require special compilers areinteresting and allow for better parallel semantic expressive-ness than approaches based in adding library calls to exist-ing languages. However, such models not only break com-patibility with existing large codebases but also do not pro-vide for strong comparisons between architectures. Somethreading models, such as Cilk, use a fork-and-join style ofsynchronization that, while semantically convenient, doesnot allow for as fine-grained control over communication

between threads as the FEB-based model, which allows in-dividual load and store instructions to be synchronized.

The drawbacks of heavyweight, kernel-supportedthreading such as pthreads are well-known [2], leading tothe development of a plethora of user-level threading mod-els. The GNU Portable Threads [14], for example, allow aprogrammer to use user-level threading on any system thatsupports the full C standard library. It uses a signal stackto allow the subthreads to receive signals, which limits itsability to scale. Coroutines [29] are another model that al-low for virtual threading even in a serial-execution-only en-vironment, by specifying alternative contexts that get usedat specific times. Coroutines can be viewed as the most ba-sic form of cooperative multitasking, though they can usemore synchronization points than just context-switch barri-ers when run in an actual parallel context. One of the morepowerful details of coroutines is that generally one routinespecifies which routine gets processing time next, whichis behavior that can also be obtained when using continu-ations [19, 27]. Continuations, in the most broad sense, areprimarily a way of minimizing state during blocking oper-ations. When using heavyweight threads, whenever are athread does something that causes it to stop executing, itsfull context—local variables, a full set of processor regis-ters, and the program counter—are saved so that when thethread becomes unblocked it may continue as if it had notblocked. A continuation allows the programmer to spec-ify that when a thread blocks it exits, and that unblock-ing causes a new thread to be created with specific argu-ments, thus requiring the programmer to save any neces-sary state to memory while any unnecessary state can bedisposed of. Protothreads [12, 17] and Python stacklessthreads [28], by contrast, assert that outside of CPU contextthere is no thread-specific state (i.e. “stack”) at all. Thismakes them extremely lightweight but limits the flexibility(at most, only one of them can call a function), which hasrepercussions for ease-of-use. User-level threading modelscan be further enhanced with careful kernel modification[25] to enable convenient support of many of the featuresof heavyweight kernel threads, such as signals, advancedscheduling conventions, and even limited software interrupthandling.

The problem of resource exhaustion due to excessiveparallelism was considered by Goldstein et. al. [16]. Their“lazy-threads” concept addresses the issue that most thread-ing models conflate logical parallelism and actual paral-lelism. This semantics problem often requires that program-mers tailor the expression of parallelism to the available par-allelism, thereby forcing programmers to either require toomuch overhead in low-parallelism situations or forgo thefull use of parallelism in high-parallelism situations.

The qthread API combines many of the advantages ofother threading models. The API allows parallelism to be

expressed independently of the parallelism used, much likeGoldstein’s lazy-thread approach. However, rather than re-quire a customized compiler, the qthread API does thiswithin a library that uses two different categories of threads:thread workers (shepherds) and stateful thread work units(qthreads). This technique, while convenient, has overheadthat a compiler-based optional-inlining method would not:every qthread requires memory. This overhead can be lim-ited arbitrarily through the use of futures, which is a power-ful abstraction to express resource limitations without lim-iting the expressibility of inherent algorithmic parallelism.

7. Future work

Much work still remains in development of the qthreadAPI. A demonstration of how well the API maps to theAPIs of existing large scale architectures, such as the CrayMTA/XMT systems, is important to reinforce the claim ofportability. Custom implementations for other architectureswould be useful, if not crucial.

Along similar lines, development of additional bench-marks to demonstrate the potential of the qthread API andlarge-scale multithreading would be useful for studying theeffect of large-scale multithreading on standard algorithms.The behavior and scalability of such benchmarks will pro-vide guidance for the development of new large-scale mul-tithreading architectures.

Thread migration is an important detail of large scalemultithreading environments. The qthread API addressesthis with the shepherd concept, but the details of mappingshepherds to real systems requires additional study. For ex-ample, shepherds may need to have limits enforced uponthem, such as CPU-pinning, in some situations. The ef-fect of such limitations on multithreaded application per-formance is unknown, and deserving of further study.

8. Conclusions

Large scale computation of the sort performed by com-mon computational libraries can benefit significantly fromlow-cost threading, as demonstrated here. Lightweightthreading with hardware support is a developing area of re-search that the qthread library assists in exploring while si-multaneously providing a solid platform for lighter-weightthreading on common operating systems. It provides basiclightweight thread control and synchronization primitivesin a way that is portable to existing highly parallel architec-tures as well as to future and potential architectures. Be-cause the API can provide scalable performance on existingplatforms, it allows study and modeling of the behavior oflarge scale parallel scientific applications for the purposesof developing and refining such parallel architectures.

References

[1] E. Allen, D. Chase, J. Hallett, V. Luchangco, J.-W. Maessen,S. Ryu, G. L. S. Jr., and S. Tobin-Hochstadt. The FortressLanguage Specification. Sun Microsystems, Inc., 1.0β edi-tion, March 2007.

[2] T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M.Levy. Scheduler activations: Effective kernel support fot theuser-level management of parallelism. ACM Transactionson Computer Systems, 10(1):53–79, 1992.

[3] A. Begel, J. MacDonald, and M. Shilman. Picothreads:Lightweight threads in java.

[4] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson,K. H. Randall, and Y. Zhou. Cilk: an efficient multithreadedruntime system. SIGPLAN Not., 30(8):207–216, 1995.

[5] J. B. Brockman, S. Thoziyoor, S. K. Kuntz, and P. M. Kogge.A low cost, multithreaded processing-in-memory system.In WMPI ’04: Proceedings of the 3rd workshop on Mem-ory performance issues, pages 16–22, New York, NY, USA,2004. ACM Press.

[6] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kiel-stra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: anobject-oriented approach to non-uniform cluster computing.In OOPSLA ’05: Proceedings of the 20th annual ACM SIG-PLAN conference on Object oriented programming, systems,languages, and applications, pages 519–538, New York,NY, USA, 2005. ACM.

[7] Cray XMT platforrm.http://www.cray.com/products/xmt/index.html, Octo-ber 2007.

[8] Cray Inc., Seattle, WA 98104. Chapel Language Specifica-tion, 0.750 edition.

[9] H.-E. Crusader. High-end computing needs radical program-ming change. HPCWire, 13(37), September 2004.

[10] D. E. Culler, S. C. Goldstein, K. E. Schauser, and T. vonEicken. Tama compiler controlled threaded abstract ma-chine. Journal of Parallel and Distributed Computing,18(3):347–370, 1993.

[11] L. Dagum and R. Menon. OpenMP: An industry-standardapi for shared-memory programming. IEEE ComputationalScience & Engineering, 5(1):46–55, 1998.

[12] A. Dunkels, O. Schmidt, T. Voigt, and M. Ali. Pro-tothreads: Simplifying event-driven programming ofmemory-constrained embedded systems. In SenSys ’06:Proceedings of the 4th International Conference on Embed-ded Networked Sensor Systems, pages 29–42, New York,NY, USA, 2006. ACM Press.

[13] T. El-Ghazawi and L. Smith. Upc: unified parallel c. In SC’06: Proceedings of the 2006 ACM/IEEE conference on Su-percomputing, page 27, New York, NY, USA, 2006. ACM.

[14] R. S. Engelschall. Portable multithreading: The signal stacktrick for user-space thread creation. In ATEC’00: Proceed-ings of the Annual Technical Conference on 2000 USENIXAnnual Technical Conference, pages 20–20, Berkeley, CA,USA, 2000. USENIX Association.

[15] M. C. Gilliland, B. J. Smith, and W. Calvert. Hep - asemaphore-synchronized multiprocessor with central con-trol (heterogeneous element processor). In Summer Com-puter Simulation Conference, pages 57–62, Washington,D.C., July 1976.

[16] S. C. Goldstein, K. E. Schauser, and D. E. Culler. Lazythreads: Implementing a fast parallel call. Journal of Paral-lel and Distributed Computing, 37(1):5–20, 1996.

[17] B. Gu, Y. Kim, J. Heo, and Y. Cho. Shared-stack cooper-ative threads. In SAC ’07: Proceedings of the 2007 ACMsymposium on Applied Computing, pages 1181–1186, NewYork, NY, USA, 2007. ACM Press.

[18] M. Hall, P. Kogge, J. Koller, P. Diniz, J. Chame, J. Draper,J. LaCoss, J. Granacki, J. Brockman, A. Srivastava,W. Athas, V. Freeh, J. Shin, and J. Park. Mapping ir-regular applications to DIVA, a PIM-based data-intensivearchitecture. In Supercomputing ’99: Proceedings of the1999 ACM/IEEE conference on Supercomputing (CDROM),page 57, New York, NY, USA, 1999. ACM Press.

[19] C. T. Haynes, D. P. Friedman, and M. Wand. Continuationsand coroutines. In LFP ’84: Proceedings of the 1984 ACMSymposium on LISP and Functional Programming, pages293–298, New York, NY, USA, 1984. ACM Press.

[20] M. Heroux. Mantevo.http://software.sandia.gov/mantevo/index.html, December2007.

[21] M. Heroux, R. Bartlett, V. Hoekstra, J. Hu, T. Kolda,R. Lehoucq, K. Long, R. Pawlowski, E. Phipps, A. Salinger,et al. An overview of trilinos. Technical Report SAND2003-2927, Sandia National Laboratories, 2003.

[22] Institute of Electrical and Electronics Engineers. IEEEStd 1003.1-1990: Portable Operating Systems Interface(POSIX.1), 1990.

[23] P. Kogge, S. Bass, J. Brockman, D. Chen, and E. Sha. Pur-suing a petaflop: Point designs for 100 TF computers usingPIM technologies. In Proceedings of the 1996 Frontiers ofMassively Parallel Computation Symposium, 1996.

[24] C. D. Locke, T. J. Mesler, and D. R. Vogel. Replacingpassive tasks with ada9x protected records. Ada Letters,XIII(2):91–96, 1993.

[25] B. D. marsh, M. L. Scott, T. J. LeBlanc, and E. P. Markatos.First-class user-level threads. In SOSP ’91: Proceedings ofthe thirteenth ACM symposium on Operaing Systems Prin-ciples, pages 110–121, New York, NY, USA, 1991. ACMPress.

[26] X. Martorell, J. Labarta, N. Navarro, and E. Ayguade. Alibrary implementation of the nano-threads programmingmodel. In Euro-Par, Vol. II, pages 644–649, 1996.

[27] A. Meyer and J. G. Riecke. Continuations may be unreason-able. In LFP ’88: Proceedings of the 1988 ACM Conferenceon LISP and Functional Programming, pages 63–71, NewYork, NY, USA, 1988. ACM Press.

[28] Stackless python. http://www.stackless.org, January 2008.[29] S. E. Sevcik. An analysis of uses of coroutines. Master’s

thesis, 1976.[30] F. Szelenyi and W. E. Nagel. A comparison of parallel pro-

cessing on Cray X-MP and IBM 3090 VF multiprocessors.In ICS ’89: Proceedings of the 3rd international conferenceon Supercomputing, pages 271–282, New York, NY, USA,1989. ACM.

[31] J. Tao, W. Karl, and M. Schulz. Memory access behavioranalysis of NUMA-based shared memory programs, 2001.

DOE’s Institute for Advanced Architecture and

Algorithms: an Application-Driven Approach

Richard C. MurphySandia National Laboratories, PO Box 5800, MS-1319, Albuquerque, NM 87185

E-mail: [email protected]

Abstract. This paper describes an application driven methodology for understanding theimpact of future architecture decisions on the end of the MPP era. Fundamental transistordevice limitations combined with application performance characteristics have created the switchto multicore/multithreaded architectures. Designing large-scale supercomputers to matchapplication demands is particularly challenging since performance characteristics are highlycounter-intuitive. In fact, data movement more than FLOPS dominates. This work discussessome basic performance analysis for a set of DOE applications, the limits of CMOS technology,and the impact of both on future architectures.

1. IntroductionAs the MPP era draws to a close, the energy inefficiency of modern supercomputer architecturesis rapidly becoming the limiting factor for architectural scalability. The MPP era’s total relianceon commodity processors that are optimized for less challenging application spaces has createdenergy inefficiencies throughout the system. Further, fundamental transistor device physics hasforced scaling in the form of increased parallelism at the chip level, rather than the exponentialincrease in single thread performance that characterized the MPP era. Indeed, the transition tomulticore architectures is driven entirely by the aging of the CMOS fabrication process. Withouta well-defined successor capable of Moore’s Law scaling, the transistor limits the potential offuture supercomputers deployed in energy constrained environments. As a result, changes tocomputer architecture are required to enable newer, more highly-scalable applications. However,choosing the right architectural path is extremely difficult. Generally, computer architects doeverything possible to hide the details of the hardware implementation from the programmer:cache sizes, branch prediction policies, out-of-order execution mechanisms, and thread schedulingat the hardware level are all examples of critical architectural mechanisms and structures thatare typically observed only indirectly by the programmer.

More worrisome still, each of these structures that architects use to boost single threadperformance and hide the depths and complexities of modern communication hierarchies(memory or interconnect) consumes energy that could otherwise be available for the computationif the programmer could exploit simpler, lighter-weight mechanisms. It has long been arguedthat these mechanisms should be exposed to the programmer[9, 5, 1, 4], but the transition tomultithreaded/multicore architectures leaves little choice in an energy constrained environment.

This paper presents a synthesis of application analysis results derived from simulation,analysis tools, and real hardware. The remainder of the paper is structured as follows: Section

2 discusses some of the applications studied in this work; Section 3 examines the limits ofCMOS transistors; Section 4 provides results to show that performance bottlenecks tend to belatency and concurrency dominated; and Section 5 describes the impact of these trends on futurearchitectures.

2. Applications Discussed In This WorkThe application studies referenced in this work are divided into two basic application classes:physics codes, which tend to be floating point oriented; and informatics codes, which tend tobe more integer oriented. The full application analysis and simulation methodology has beenpreviously described[7, 6, 8]. The results are summarized in this paper.

2.1. Physics ApplicationsBroadly speaking, the physics application suite analyzed by Sandia are large-scale MPIapplications, and represent real world physical simulations. They are:

• ALEGRA is a finite element shock physics code capable of modeling near- and far-fieldresponses to explosions, impacts, and energy depositions.

• CTH is a multi-material, large deformation, strong shock wave, solid mechanics codedeveloped at Sandia National Laboratories over the last 30 years. CTH models multi-phase, elastic viscoplastic, porous and explosive materials with multiple mesh refinementmethods.

• Cube3 Cube3 is a generic linear solver that drives the Trilinos framework for parallellinear and eigensolvers. It mimics a finite element analysis problem by creating hexagonalelements, then assembling and solving a linear system. The width, depth, and degrees offreedom (e.g., temperature, pressure, velocity, etc.) can be varied.

• ITS performs Monte Carlo simulations of linear time-independent coupled electron/photonradiation transport.

• MPSalsa is a high resolution 3d simulation of reacting flow. The simulation requires bothfluid flow and chemical kinetics modeling.

• Xyce is a parallel circuit simulation system capable of modeling very large circuits atmultiple layers of abstraction (device, analog, digital, and mixed-signal). It includes bothSPICE-like models and radiation models.

This suite of applications is somewhat more narrow than those of interest to the Office ofScience, but future work in the DOE Institute for Advanced Architecture will apply the samefundamental analysis techniques to specific Office of Science applications.

2.2. Informatics ApplicationsRecently, new large-scale applications in the area of informatics have emerged as importantand substantially different from physics applications that have been the core of supercomputingover the past three decades. These applications tend to be more integer-oriented, and representproblems in discrete math. They are typically less structured (in terms of memory accesspatterns), and are often techniques used to analyze the output of a physical simulation, or toexamine real world data sets in an attempt to find patterns or form hypotheses. Additionally,because Informatics applications tend to be less structured in nature, they are written using avariety of programming models. The application set discussed are:

• DFS implements a depth-first search on a large graph and forms the basis for severalhigh-level algorithms including connected components, tree and cycle detection, solving thetwo-coloring problem, finding Articulation Vertices, and topological sorting.

• Connected Components breaks a graph into components. Two vertices are in aconnected component if and only if there is a path between them.

• Subgraph Isomorphism determines whether or not a subgraph exists within a largergraph.

• Full Graph Isomorphism determines whether or not two graphs have the same shape orstructure.

• Shortest Path computes the shortest path between two vertices using a breadth firstsearch. Real world applications include path planning and networking and communication.

• Graph Partitioning is used extensively in VLSI circuit design and adaptive meshrefinement. The problem divides a graph in to k partitions while minimizing the cut betweenthe partitions (or the total weight of the edges that cross from one partition to another).

• BLAST is the Basic Local Alignment Search Tool and is the most heavily used methodfor quickly search nucleotide and protein databases in biology.

• zChaff is a heuristic for solving the Boolean Satisfiability Problem. In propositional logic,a formula is satisfiable if there exists an assignment of truth values to each of its variablesthat make the formula true.

These applications have the potential to significantly change the methodology used incomputer-driven science, and fundamentally represent a new and disruptive use of highperformance computing resources.

3. CMOS Transistor LimitationsThe switch to multicore is fundamentally driven by a change in the CMOS transistor that formsthe implementation technology for commodity processors. Power dissipation (manifesting itselfas heat which must be removed from the device) can be broken into two components: StaticPower Dissipation (Pstat), which is dissipated continuously while the circuit is powered andDynamic Power Dissipation (Pdyn) which is dissipated during switching. Although increasinglyimportant, static power dissipation cannot be addressed by the computer architect (other thanby powering off parts of the machine not in use), and is primarily a function of the fabricationprocess. It is defined as:

Pstat + IleakageVdd

Where Ileakage is the leakage current of the device, and Vdd is the voltage. It should benoted that while the computer architect cannot do much to control leakage current, the systemarchitect and environment in which the machine is deployed can. At the junction, leakage isgenerally caused by thermally generated carriers, and increases exponentially as temperaturerises. In fact, at 85◦C, leakage current increases by a factor of 60 over room temperature values.These factor and increased failure rates must be taken into account in the “hotter running”compute environments that have become a topic of recent significant interest.

The majority of power dissipation is dynamic, and is defined as as:

Pdyn = CLV 2ddf

Where CL is the capacitance of the entire circuit, Vdd the voltage, and f the switchingfrequency (proportional to the clock frequency since not all devices will switch every cycle).This equation is critically important in describing the switch to multicore.

Moore’s Law as originally defined demands an exponential increase in the number oftransistors, which increases CL. Specifically, for a given area, the total capacitance is the sumof the capacitance of each device. Although capacitance per device decreases as the feature size

decreases (decreasing the individual transistor’s contribution to CL), the number of devices isincreasing exponentially. Similarly, until the multicore era the frequency increased substantiallyin each generation. To keep the total dynamic power dissipation essentially constant (so thatchips could be cooled with relatively inexpensive heat syncs at roughly room temperature), theequation was balanced by decreasing Vdd, which as it approaches 1 volt is nearing a fundamentallimit (in fact, the rate at which Vdd has decreased has already begun to flatten).

Although there are additional manufacturing benefits to moving to multicore architecture,including simplifying the verification of designs, this fundamental technology limit is the realdriver. The devices are capable of higher clock rates, at higher densities, but engineering asystem to cope with the power dissipation is nearly impossible. The era is characterized asan increasing bounty of transistors (Moore’s original observation), but the inability to increasesingle thread performance (through faster clock rates). The inevitable result is more parallelism.

Of course the end of the CMOS scaling curve in the next decade or so will cause additionalfundamental physical limitations, unless a solution can be found.

4. Performance Bottlenecks Tend to Be CounterintuitiveThe following section discusses basic application properties, and performance bottlenecks in thenetwork and memory system.

4.1. Real Floating Point Applications Do Less Floating Point Than Expected

Figure 1. Instruction Mix

The instruction mix for the Physics and Informatics suites is depicted in Figure 1, andcompared to the industry standard SPEC CPU 2000 suite. While SPEC FP averagesapproximately one floating point instruction in three, real Sandia Physics codes average only 12%(and is typically much less than 10%) because they exhibit more complex memory addressingpatterns, which, in turn require more integer instructions to calculate the addresses. In fact,nearly 85% of integer instructions are calculating memory addresses, with the remaining 15%split between real integer data and branch (boolean) support.

The Informatics applications are very different from the Physics codes as well, suggesting thepossibility of two different supercomputer architecture classes in the future. They perform 15%fewer memory references, and those references are much more likely to miss the cache, which

stresses the memory subsystem. Additionally, the Informatics applications perform nearly twiceas many branches as their Physics counterparts, leading to relatively smaller basic blocks. (Thisresult is unsurprising due to the complex floating point calculations typically performed by thePhysics codes[9].) This combination makes the Informatics codes more difficult to scale onmodern superscalar processors than their physics counterparts.

4.2. Latency and Concurrency Matter More than BandwidthData movement systems are typically rated by bandwidth, but most evidence points toapplications being more sensitive to latency or concurrency. This can be simplistically describedby Little’s Law (from Economics) which says that:

throughput =concurrency

latency

Particularly in the case of memory systems, increasing the number of memory requestsoutstanding (currently limited by hardware and instruction sequences) or decreasing the latencyrequired to receive a response has significantly more impact than changing the bandwidth.

(a) (b)

Figure 2. Latency and Bandwidth Sensitivity for (a) Physics Applications and (b) InformaticsApplications

Figure 2 shows the change in Instructions Per Cycle (IPC, the computer architect’s typicalmeasure of performance) as relative latency and bandwidth are altered (the center bar being thelatency and bandwidth of a typical processor’s memory system. The full results and methodologyhave been described previously[6]. The results demonstrate that both application suites aresignificantly more sensitive to latency than bandwidth. A decrease in bandwidth by half oneither suites affects performance by less than 5%, while doubling the memory latency halvesperformance1

1 It should be noted that this bandwidth measure is more strict than that presented for a typical memory part,which will also include a decrease in latency in subsequent generations.

These result combined are confirmed by observations from performance counters and otherreal system measurements going back to ASCI Red that demonstrates that real applicationshave difficult saturating the memory bus on a modern architecture because address generation(the integer instructions discussed in Section 4.1) is the bottleneck more than typical memoryperformance.

This is further demonstrated by typical out-of-order processors that are designed to maskthe multi-cycle latency of a Level 1 cache hit, more than the latency of a memory access (dueto limitations of both the incoming instruction stream and the implementation technology).

The performance of the two application suites is also significantly different. The Physicsapplication achieve reasonable performance on the simulated out-of-order execution unit,achieving an IPC of 1.22 for the base case. The processor is capable of retiring 4 instructionsper cycle, so the achieved IPC is approximately 30% of maximum. However, any IPC greaterthan one is quite good for a typical application suite. In contrast, the baseline case for theInformatics suite is only 0.70, indicating that the applications are significantly more memoryintensive, causing the processor additional idle time. Key graph algorithms, including bothisomorphism problems and connected components show IPCs significantly less than 0.5. Asidefrom being less than an eighth of achievable performance, the results indicate that significantpower is being wasted on high clock rates for those applications.

Figure 3. Application Memory Performance Properties

Figure 3 describes the fundamental memory performance properties of each benchmark suite,and has been fully described previously[7]. The spatial locality describes the probability thatdata near previously referenced data will be referenced. The temporal locality describes theprobability that data already referenced will be reused. And the relative size of the dotsrepresents the amount of unique data that each application suite consumes over a fixed intervalof instructions. These fundamental application properties dictate the performance observedin Figure 2, with applications consuming more data, and doing so in a more random fashion(i.e., closer to the origin on the graph) showing similarly low performance in comparison toapplications with smaller data sets consumed in a more structured pattern. These fundamentalapplication properties determine the class of architecture required to provide high performanceat low power more than any other issue.

4.3. Network PerformanceThese issues are much more difficult to measure for the network, but preliminary results andprior experience show the importance of network injection rate vs. raw bandwidth.

Figure 4. Measurement of Bandwidth Restricted Performance on Red Storm. This is a recastof data collected by Kurt Ferreira at Sandia, a subset of which is discussed in [3]

The trade-off between latency, bandwidth, and message rates are difficult to measure formodern networks, and the trade-offs appear to be highly application dependent. This state ofaffairs makes drawing firm conclusions on minimal network design difficult, which may lead toeither poor application performance for a subset of applications or over-provisioned networksfor some subset of applications.

Figure 4 shows the effects of running three applications on Sandia’s Red Storm machine,with bandwidth degraded in the mesh by 75%, while not degrading small message latency orinjection rate. The mesh is intentionally capable of approximately twice the bandwidth ofthe link between processor and NIC to offer more stable performance in the face of messagecontention, whether from non-3D communication patterns or poor application and processplacement. The applications shown are CTH, SAGE, and POP. CTH and Sage were run at2,048 nodes, and POP at 2,500. CTH and Sage were run with weak scaling data sets, and POPas a strong scaling problem.

POP is limited by a combination of message rate and latency, and it is therefore not surprisingthat reducing network bandwidth does not impact application performance. SAGE, like POP,appears to be immune to bandwidth changes, although it is unclear whether this is problemspecific. CTH, on the other hand, shows a 27% performance degradation when mesh bandwidthis reduced to half of injection bandwidth, and is traditionally thought to be insensitive to latencyand message rate. The difference between behavior of POP and CTH suggests a significantchallenge for hardware design – a need to balance hardware cost, energy usage, and performanceacross a wide variety of applications.

In an energy optimized environment, one could envision a system that dynamicallyscales network bandwidth performance to meet both application and overall system demands(contention, I/O, etc.) when full topological bandwidth is not required for the current set ofrunning applications.

5. Impact on Future Architectures: Multicore, MultithreadingSection 3 described the fundamental technology changes pushing multicore processors. Thediscussion of IPC in Section 4.2 shows that the processor is tremendously underutilized,especially for applications with very large, sparse data sets. Because the problem isfundamentally one of latency, there are only two basic solutions: avoiding the latency throughfaster devices, memory hierarchies, and caches; or tolerating the latency via mechanisms suchas multiple outstanding loads, out-of-order execution, vector pipelines, or threads.

Avoiding latency via faster devices is fundamentally an economically and technologicallydriven process, and often provides a “one time” benefit rather than a sustained and continuousbenefit over multiple generations. For example, cheap 3D integration of memory onto theprocessor would reduce wire lengths between the processor and memory from centimeters tomicrons, providing a one-time latency reduction.

Architectural techniques for avoiding latency such as caching also show diminishing returns.Caches place frequently used items closer to the processor, but become less effective proportionalto cost as they increase in size. They are also large consumers of energy.

In terms of memory toleration, many of the classic techniques have reached their limits. Giventhe relatively aggressive out-of-order execution in modern processors and the observed IPCs fromFigure 2, it is clear that there is not much room for improvement. In fact, most processors aremoving towards simpler latency toleration mechanisms because the power dissipation for out-of-order execution is so high compared with the relatively small benefits. As discussed previously,most applications fail to fully utilize memory system bandwidth, meaning that the problemof generating more outstanding loads is not due to a hardware limitation, but fundamentalapplication properties. For the Informatics suite in particular, the pointer chasing requires theresults from one load to be available before another load can be generated, limiting architecturaloptions for solving the problem.

The remaining option for power-efficient latency toleration is threads, and it can be seen inthe successes of early throughput-oriented architectures such as the Sun Niagara. Threads are:

• Relatively inexpensive to implement in hardware, requiring additional memory structuresto support larger register files, but not requiring complex logic as is required for out-of-orderexecution;

• Efficient in that switching to another thread while the currently executing thread waits ona long latency event still allows “useful” work to be performed during the idle time (incontrast with techniques such as speculative execution which simply create heat when thearchitecture predicts an erroneous path of execution);

• Low power, primarily because useful work is always being done and they replace morecomplex hardware structures;

• Capable of fully utilizing the memory system by exposing more address generation streamsto the programmer (if the programmer can exploit them), which can be seen in the trendtoward fewer memory controllers per core in modern processors; and,

• Potentially plentiful, in that for the desktop and server space for which the processorsare optimized there typically exists sufficient parallelism (even at the task level) to allowfor multithreaded processors not to starve, which is potentially a significant applicationproblem for future supercomputers.

Consequently, the trend towards multicore/multithreaded architectures is inevitable,especially in highly energy constrained environments. The DARPA Exascale Report[2] describesdominance of data movement in future exascale architecture energy budgets.

AcknowledgmentsI would like to thank Kurt Ferreira for sharing his data used in the networking discussion, as wellas Arun Rodrigues, Brian Barrett, Scott Hemmert, and Jim Ang for many useful discussionsover the course of preparing this paper.

References[1] Jay B. Brockman, Peter M. Kogge, Vincent Freeh, Shannon K. Kuntz, and Thomas Sterling. Microservers:

A new memory semantics for massively parallel computing. In ICS, 1999.[2] Peter M. Kogge (editor). ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems.

University of Notre Dame, CSE Deptartment Technical Report TR-2008-13, September 28, 2008.[3] Kurt Ferreira, Ron Brightwell, and Patrick Bridges. Characterizing Application Sensitivity to OS Interference

Using Kernel-Level Noise Injection. In Supercomputing 2008 (SC08), Austin TX, November 2008.[4] Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John Granacki,

Apoorv Srivastava, William Athas, Jay Brockman, Vincent Freeh, Joonseok Park, and Jaewook Shin.Mapping Irregular Applications to DIVA, A PIM-based Data-Intensive Architecture. In Supercomputing,Portland, OR, November 1999.

[5] Shannon K. Kuntz, Richard C. Murphy, Michael T. Niemier, Jesus Izaguirre, and Peter M. Kogge. PetaflopComputing for Protein Folding. In Proceedings of the Tenth SIAM Conference on Parallel Processing forScientific Computing, Portsmouth, VA, March 12-14, 2001.

[6] Richard C. Murphy. On the Effects of Memory Latency and Bandwidth on Supercomputer ApplicationPerformance. In IEEE International Symposium on Workload Characterization 2007 (IISWC2007),September 27-29, 2007.

[7] Richard C. Murphy and Peter M. Kogge. On the Memory Access Patterns of Supercomputer Applications:Benchmark Selection and its Implications. To Appear In IEEE Transactions on Computers.

[8] Richard C. Murphy, Arun Rodrigues, Peter Kogge, and Keith Underwood. The implications of workingset analysis on supercomputing memory hierarchy design. In The 2005 International Conference onSupercomputing, June 20-22, 2005.

[9] Arun Rodrigues, Richard Murphy, Peter Kogge, and Keith Underwood. Characterizing a New Class ofThreads in Scientific Applications for High End Supercomputers. In Proceedings of the 18th Annual ACMInternational Conference on Supercomputing, pages 164–174, June 26-July 1 2004.

Implementing a Portable Multi-threaded Graph Library: the MTGL on Qthreads

Brian W. Barrett, Jonathan W. Berry, and Richard C. MurphySandia National Laboratories

Albuquerque, NM{bwbarre,jberry,rcmurph}@sandia.gov

Kyle B. WheelerUniversity of Notre Dame andSandia National Laboraroties

Notre Dame, [email protected]

1. Introduction

Graph-based Informatics applications challenge tradi-tional high-performance computing (HPC) environmentsdue to their unstructured communications and poor load-balancing. As a result, such applications have typically beenrelegated to either poor efficiency or specialized platforms,such as the Cray MTA/XMT series. The multi-threadednature of the Cray MTA architecture1 presents an idealplatform for graph-based informatics applications. As com-modity processors adopt features to enable greater levels ofmulti-threaded programming and higher memory densities,the ability to run these multi-threaded algorithms on lessexpensive, more available hardware becomes attractive.

The Cray MTA architecture provides both an auto-threading compiler and a number of architectural features toassist the programmer in developing multi-threaded applica-tions. Unfortunately, commodity processors have increasedthe amount of concurrency available by adding an ever-growing number of processor cores on a single socket, buthave not added the fine-grained synchronization availableon the Cray MTA architecture. Further, while auto-threadingcompilers are being discussed, none provide the feature setof the Cray offerings.

Although massively multi-threaded architectures haveshown tremendous potential for graph algorithms, devel-opment poses unique challenges. Algorithms typically uselight-weight synchronization primitives (Full/Empty bits,discussed in Section 3.1) for synchronization. Parallelismis not expressed explicitly, but instead compiler hints andcareful code construction allow the compiler to parallelizea given code. Unlike the Parallel Boost Graph Library(PBGL) [8], which runs on the developers laptop as wellas the largest supercomputers, applications developed forthe MTA architecture only run on the MTA architecture.Experiments with the programming paradigm require accessto the platform, which is obviously a constrained resource.

In this paper, we explore the possibility of using theQthreads user-level threading library to increase the porta-bility of scalable multi-threaded algorithms. The Multi-

1. Throughout this paper, we will use the phrase MTA architecture torefer to Cray’s multi-threaded architecture, including both the Cray MTA-2and Cray XMT platforms.

Threaded Graph Library (MTGL) [2], which providesgeneric programming access to the XMT, is our testbed forthis work. We show the use of important algorithms from theMTGL on on emerging commodity multi-core and multi-threaded platforms, with only minor changes to the codebase. Although performance is not at the same level as thesame algorithm on a Cray XMT, the performance motivatesour technique as a workable solution for developing multi-threaded codes for a variety of architectures.

2. Background

Recent work [9], [12] has described the challenges inHPC graph processing. These challenges are fundamentallyrelated to locality (both spatial and temporal), and thelack thereof when graph algorithms are applied to highlyunstructured datasets. The PBGL [8] attempts to meet thesechallenges through storage techniques that reduce commu-nication. These techniques have been shown to work well incertain contexts, though they introduce other challenges suchas memory scalability. Even when they achieve run-timescalability, the processor utilization on commodity CPUs isconsiderably lower than that found in the MTA architecture.

2.1. Cray XMT

The Cray XMT is the successor to the Cray MTA-2 highlymulti-threaded architecture. Unlike the MTA-2, in which allmemory was equidistant from any processor on the network,the XMT uses a more traditional model in which memoryis closer to a single processor than all others. The CrayXMT utilizes similar processors to the MTA-2, includingthe ability to sustain 128 simultaneous hardware threads,but with an improved 500 MHz clock rate. Rather than thecustom network found on the MTA-2, the XMT utilizes theSeaStar based network found on the Cray XT massivelyparallel processor distributed memory platform.

2.2. Multi-core Architectures

As processor vendors have begun offering quad-coreprocessors, as well as more commodity multi-threaded pro-cessors such as the Sun Niagara processors, it has become

possible to write multi-threaded applications on more tra-ditional platforms. Given the high cost of even a smallXMT platform, the ability of modern workstations to supporta growing number of threads makes them attractive foralgorithm development and experimentation.

The Sun Niagara platform opens even greater multi-threaded opportunities, supporting 8 threads per core and 8cores per socket, for a total of 64 threads per socket. Currentgeneration Niagara processors support single, dual, and quadsocket installations. Unlike the Cray XMT, the Sun Niagarauses a more traditional memory system, including L1 andshared L2 cache structures, and an unhashed memory sys-tem. The machines are also capable of running unmodifiedUltraSPARC executables.

2.3. Multi-threaded Programming

Our approach is to take algorithm codes that have alreadybeen carefully designed to perform on the MTA architec-ture, and run them without altering the core algorithm oncommodity multi-core machines by simulating the thread-ing hardware. In contrast, codes written using frameworksspecifically designed for multi-core commodity machines(e.g. SWARM [13]) won’t run on the MTA architecture.

Standard multi-core software designs, such as Intel’sThread Building Blocks [10], OpenMP [6], and Cilk [3], tar-get current multicore systems, and their architecture reflectsthis. For example, they lack a means of associating threadswith a locale. This becomes a significant issue as machinesget larger and memory access becomes more non-uniform.

Another important consideration is the granularity andoverhead of synchronization. Existing large scale multi-threaded hardware, such as the XMT, implement full/emptybits. This provides for blocking synchronization in a locality-efficient way. Existing multi-threaded software systems tendto use lock-based techniques, such as mutexes and spinlocks,or require tight control over memory layout. These methodsare logically equivalent, but are not as efficient to implement.FEB’s are memory efficient when implemented in hardware,and thus allow tight memory structures that can be safelyoperated upon without requiring locking structures to beinserted into them.

3. Qthreads

The Qthread API [16] is a library-based API for accessinglightweight threading and synchronization primitives similarto those provided on the MTA architecture. The API wasdesigned to support large-scale lightweight threading andsynchronization in a cross-platform library that can bereadily implemented on both conventional and massivelyparallel architectures. On architectures where there is nohardware support for the features it provides, or wherenative threads are heavyweight, these features are emulated.

There are several existing threading models that supportlightweight threading and lightweight synchronization, butnone that sufficiently closely emulate the MTA architecturesemantics.

Equivalents for basic thread control, FEB-based read andwrite functions, as well as basic threaded loops (analogsfor many of the pragma-defined compiler loop optimizationsavailable on the MTA architecture) are all provided by theAPI. Even though the operations that do not have hardwaresupport, such as FEB-based operations, are emulated, theyretain usefulness as a means of intra-thread communication.

The API establishes convenient management of the basicmemory requirements of threads as they are created. Wheninsufficient resources are available, either thread creationfails or it waits for the resources to become available,depending on how the API is used.

Relatively speaking, locality of reference is not an impor-tant consideration to the MTA architecture, as the addressspace is hashed and divided among all processors at wordboundaries. This is an unusual environment, and localityis an important consideration in most other large parallelmachines. To address this, the Qthread API provides ageneralized notion of locality, called a “shepherd”, whichidentifies the location of a thread. A machine may bedescribed to the library as a set of shepherds, which canrefer to memory boundaries, CPUs, nodes, or whatever is auseful division. Threads are assigned to specific shepherdswhen they are created.

3.1. Implementation of MTA Intrinsics

The MTA architecture has several features that are intrin-sic to the architecture, which the Qthread library emulates.These features include full/empty bits (FEBs), fast atomicincrements, and conditionally created threads.

On the MTA architecture, a full/empty bit (FEB) is anextra hardware flag associated with every word in memory,marking that word either full or empty. Qthreads uses a cen-tralized collection data structure to achieve the same effect:if an address is present in the collection, it is considered“empty”, and if not, it is considered “full”. Thus, all memoryaddresses are considered full until they are operated upon byone of the commands that will alter the memory word’s con-tents and full/empty status. The synchronization protectingeach word is pushed into the centralized data structure. Notall of the semantics of the MTA architecture can be fullyemulated, however. For example, on the MTA architecture,all writes to memory implicitly mark the correspondingmemory words as full. However, when pieces of memoryare being used for synchronization purposes, even implicitoperations are done purposefully by the programmer, andreplacing implicit writes with explicit calls is trivial.

The MTA architecture also provides a hardware atomicincrement intrinsic. Atomic increment functions have often

been considered useful, even on commodity architectures,and so hardware-based techniques for doing atomic incre-ments are common. The Qthread API provides an atomicincrement function that uses a hardware-based implementa-tion on supported architectures, but which falls back to usingemulated locks to achieve the same behavior on architectureswithout explicit hardware support in the library. This is anexample of opportunistically using hardware features whileproviding a standardized interface; a key feature of theQthread API.

3.2. Qthreads implementation of thread virtualiza-tion

Conditionally created threads are called “futures” in MTAarchitecture terminology, and are used to indicate thatthreads need not be created now, but merely whenever thereare resources available for them. This can be crucial on theMTA, as each processor can handle at most 128 threads,and extremely parallel algorithms may generate significantlymore. The Qthread API provides an analogous feature byproviding alternate thread creation semantics that allow theprogrammer to specify the permissible number of threadsthat may exist concurrently, and which will stall threadcreation until the number of threads is less than the numberof permissible threads.

A key application of this is in loops. While a given loopmay have a large number of entirely independent iterations,it is typically unwise to spawn all of the iterations asthreads, because each thread has a context and eventuallythe machine will run out of memory to hold all the threadcontexts. Limiting the number of concurrently extant threadslimits the amount of overhead that will be used by thethreads. In a loop, the option to stall the thread creationwhile the maximum number of threads still exist providesthe ability to specify a threaded loop without the risk ofusing an excessive amount memory for thread contexts. Thelimit on the number of threads is a per-shepherd limit, whichhelps with load balancing.

4. The Multi-Threaded Graph Library

The Multi-Threaded Graph Library is a graph librarydesigned in the spirit of the Boost Graph Library (BGL)and Parallel Boost Graph Library. The library utilizes thegeneric component features of the C++ language to allowflexibility in graph structures, without changes to a givenalgorithm. Unlike the distributed memory, message passingbased PBGL, the MTGL was designed specifically forthe shared-memory multi-threaded MTA architecture. TheMTGL includes a number of common graph algorithms,including the breadth-first search, connected components,and PageRank algorithms discussed in this paper.

To facilitate writing new algorithms, the MTGL providesa small number of basic intrinsics upon which graph al-gorithms can be implemented. The intrinsics hide muchof the complexity of multi-threaded race conditions andload-balancing from algorithm developers and users. ParallelSearch (PSearch), a recursive parallel variant of depth-firstsearch (which is not truly depth-first in order to achieveparallelism), combined with an extensive vertex and edgevisitor interface, provides powerful parallelism for a numberof algorithms.

MTA architecture-specific features used by the MTGL areeither compiler hints specified via the #pragma mechanismor are encapsulated into a limited number of templatedfunctions, which are easily re-implementable for a newarchitecture. An example is the mt_readfe call, whichtranslates to readfe on the MTA architecture, a simpleread for serial builds on commodity architectures, andqthread_readfe on commodity architectures using theQthreads library.

The combination of an internal interface for explicitparallelism and the set of core intrinsics upon which much ofthe MTGL is based provides an ideal platform for extensionto new platforms. While auto-threading compilers like thosefound on the MTA architecture are not available for otherplatforms, the small number of intrinsics can be hand-parallelized with a reasonable amount of effort.

5. Qthreads and the MTGL

Making the MTGL into a cross-platform library requiredovercoming significant development challenges. The MTAarchitecture programming environment has a large numberof intrinsic semantics, and its cacheless hashed memoryarchitecture has unusual performance characteristics. TheMTA compiler also recognizes common programming pat-terns, such as reductions, and optimizes them transparently.For these reasons, the MTA developer is encouraged todevelop “close to the compiler”.

The size of stack necessary, for example, presents achallenge. Some MTGL routines are highly recursive, andthe MTA transparently handles expanding the stack for eachthread as-needed. The Qthread library, however, has a fixedstack size. Iterative solutions, combined with using largerstacks was required to address the issue.

Both the MTA architecture and commodity processorsare susceptible to the problem of hot spotting, performancedegradation due to repeated access to the same memorylocation. The MTA architecture suffers from both readand write hot spotting, due to constraints in traffic acrossthe platform’s network. Commodity processors, however,provide cache structures to improve performance and benefitfrom read hot spotting. Commodity architectures also havea larger granularity of memory sharing: a cache line, whichcan be as large as 64 bytes. Concurrent writes within a cache

line create a hot spot, even if the writes affect independentaddresses. The cache was a consideration for atomic op-erations as well, as they typically cause a cache flush tomemory. Avoiding atomic operations where possible, suchas in reductions, is important for performance.

6. Multi-platform Graph Algorithms

We consider three graph kernel algorithms: a search, acomponent finding algorithm, and an algebraic algorithm.There are myriad other graph algorithms, but we use thesethree as primitive representatives on which other algorithmscan be built.

6.1. BFS

Breadth-first search (BFS) is, perhaps, the most funda-mental of graph algorithms. Given a vertex v, find theneighbors of v, then the neighbors of those neighbors, etc.Furthermore BFS is well-suited for parallelization. Pseu-docode for BFS from [5] is included in Figure 1.

BFS(G,s)1 for each vertex u ∈ V [G] − {s}2 do color[u] ← WHITE3 d[u] ← inf4 color[s] ← GRAY5 d[s] ← 06 Q ← ∅7 while Q 6= 08 do u ← DEQUEUE(Q)9 for each vertex v ∈ Adj[u]10 do if color[v] ← WHITE11 then color[v] ← GRAY12 d[v] ← d[u] + 113 ENQUEUE(Q, v)14 color[v] ← BLACK

Figure 1. The basic BFS algorithm

There are two inherent problems with using this basicalgorithm in a multithreaded environment. The first is thata parallel version of the for loop beginning on Line 9 willmake many synchronized writes to the color array. This is aproblem on machines like the Niagara regardless of the datacharacteristics. It is also a problem on the XMT if there isa vertex v of high in-degree (since many vertices u wouldtest v’s color simultaneously, making it a hot spot).

The second problem is even more basic: the ENQUEUEoperation of Line 13 typically involves incrementing a tailpointer. As all threads will increment this same location, itis an obvious hot spot.

We avoid these problems by chunking and sorting: sup-pose that the next BFS level contains k vertices, whoseadjacency lists have combined length l. We divide the workof processing these adjacencies into dl/Ce chunks, each of

size C (except for the last one). Then dl/Ce threads processthe chunks individually, saving newly discovered verticesto local stores. Each thread can then increment the Q tailpointer only once, mitigating that hot spot. However, in orderto handle the color hot spot, we do not write the localstores directly into the Q. Rather, we concatenate them intoa buffer, sort that buffer with a thread-safe sorting routine(qsort in Qthreads, or a counting sort on the XMT), thenhave a single thread put the unique elements of this array intothe Q. This thread does linear work in serial, but the “hotspot” is now used to advantage in cache-based multicorearchitectures.

A better BFS algorithm is known for the XMT. Althoughwe currently do not have an implementation of this algo-rithm, it would be a straightforward exercise to incorporateit into the MTGL so that the same program could runefficiently on either type of platform.

6.2. Connected Components

A connected component of a graph G is a set S ofvertices with the property that any pair of vertices u, v ∈ Sare connected by a path. Finding connected componentsis a prerequisite for dividing many graph problems intosmaller parts. The canonical algorithm for finding connectedcomponents in parallel is the Shiloach-Vishkin algorithm(SV) [15], and the MTGL has an implementation of thisalgorithm that roughly follows [1].

Unfortunately, a key property of many real-world datasetswill limit the performance of SV in practice. Specifically, itis known both theoretically [7] (for random graphs), andin practice (for interaction networks such, the World-WideWeb, and many social networks) that the majority of the ver-tices tend to be grouped into one “giant component” (GCC).Algorithms like SV work by assigning a representative toeach vertex. Toward the end of these algorithms, all verticesin the GCC are pointing at the same representative, makingit a severe hot spot.

We adopt a simple alternative to SV, which we callGCC-SV. It is overwhelmingly likely (though we do notnot provide any formal analysis here) that the vertex ofhighest degree is in the GCC. Given this assumption, weBFS from that vertex using the method of Section 6.1 (orpsearch on the XMT), then collect all orphaned edges thatdo not link vertices discovered during this search. RunningSV on the subgraph induced by the orphaned edges we findthe remaining components. This subproblem is likely to besmall enough so that even if the largest component of theinduced subgraph is a GCC of that graph (which is likely),the running time is dwarfed by that of the original BFS. Ifthere is no GCC in the original graph, then the original SVwould perform well.

6.3. PageRank

#pragma mta assert nodepfor (int i=0; i<n; i++) {

double total=0.0;int begin = g[i];int end = g[i+1];for (int j=begin; j<end; j++) {

int src = rev end points[j];double r = rinfo[src].rank;double incr = (r/rinfo[src].degree);total += incr;

}rinfo[i].acc = total;

}

Figure 2. The MTGL code for PageRank’s inner loop onthe XMT

PageRank, the algorithm made famous by Google forranking web pages [14], is a linear algebraic techniquefor modeling the propagation of votes through a directedgraph, where each page contributes a fraction of its voteto each of its out-neighbors. Ranks continue propagatinguntil convergence. A thorough mathematical explanation ofPageRank is beyond the scope of this paper. However, atan abstract level PageRank is a sequence of matrix-vectormultiplications, each followed by a normalization step. Ingraph terms, the most computationally expensive portion ofthe algorithm is simply traversing all of the adjacencies inthe graph in order to accumulate votes.

Figure 2 shows the vote accumulation loops of PageRankused by the MTGL on the XMT. The structure of theseloops enables the XMT compiler to merge them into one,and to remove the reduction of votes into the variable totalfrom the final line of the inner loop. The result is excellentperformance. We simulate this in a Qthread-enabled versionof this code in the MTGL in order to achieve good scalingon multi-core machines.

6.4. R-MAT graphs

R-MAT [4] is a parameterized generator of graphs that canmimic real-world datasets. The term stands for “Recursive-MATrix,” derived from the generation procedure, which isa simulation of repeated Kronecker products [11] of theadjacency matrix by itself. Intuitively, the R-MAT procedurecan be thought of as repeatedly dropping marbles through aseries of plastic trays. The topmost one typically is dividedinto 4 quadrants, the second one into 16, etc. The bottom trayis the adjacency matrix. At each level, a marble will passthrough one of 4 holes with probability given by 4 inputparameters; a, b, c, d. Multiple edges are not allowed, so if

a marble ends up on top of another marble in the adjacencymatrix, it is discarded and we try again.

Varying the parameters a, b, c, d determines much aboutthe structure of the resulting graph. For example, usinga = 0.25, b = 0.25, c = 0.25, d = 0.25 would generate anErdos-Renyi random graph. Putting more weight on one ofthe quadrants tends to generate an inverse power-law degreedistribution, which is found in many real datasets.

In our experiments we generate two different classes ofR-MAT graphs:

• nice graphs have a = 0.45, b = 0.15, c = 0.15, d =0.25. These graphs feature two natural communities ateach of many levels of recursion (quadrants a and d).However, even in graphs a quarter of a billion edges,the maximum vertex degree is only roughly a thousand.

• nasty graphs have a = 0.57, b = 0.19, c = 0.19, d =0.05. These feature a much steeper degree distribution,with a maxmimum degree of roughly 200,000 in ourquarter-billion edge example. Load balancing wouldnaturally be more challenging in this case.

Furthermore, we label our graphs with the exponent of thenumber of vertices and hold the average degree at a constant16, since this is relatively close to (though an over-estimateof) the average degree of a page in the WWW. For example,graph “R-MAT 21 Nasty” has 221 vertices, 224 undirectededges, and R-MAT parameters as given above.

7. Multiplatform Experiments

We compare performance of the three graph kernel al-gorithms described in Section 6—breadth-first search, con-nected components, and PageRank—on three platforms ca-pable of executing multiple threads simultaneously: the CrayXMT, the Sun Niagara T2, and a traditional multi-socket,multi-core platform.

The Cray XMT used in testing contains 64 500 MHzThreadStorm processors, each capable of sustaining 128simultaneous hardware threads and 500 GB of shared mem-ory. The SeaStar based network is a 3-d torus in a 8x4x2configuration. The system was running version 6.2.1 of theXMT operating system.

A Sun SPARC Enterprise T5240 server, with two 1.2 GHzUltraSPARC T2 processors, each capable of sustaining 64simultaneous hardware threads, was also used in testing. Thesystem contains 128 GB of memory and was running SunSolaris 10, 5/08 Release. The Sun CoolThreads version ofGCC was used to compile all tests.

Finally, a quad-socket, quad-core Opteron system, clockedat 2.2 GHz, provides a traditional multi-core environment.The system provides 32 GB of memory and is running RedHat EL 5.1. GCC 4.1.2 was used to compile all tests.

0.1

1

10

100

1 10 100

Com

ple

tion tim

e (

Seconds)

Number of Threads

R-MAT 21 NiceR-MAT 21 NastyR-MAT 23 Nice

R-MAT 23 NastyR-MAT 25 Nice

R-MAT 25 Nasty

Figure 3. Opteron Breadth-First Search

0.1

1

10

100

1000

1 10 100 1000

Com

ple

tion tim

e (

Seconds)

Number of Threads



R-MAT 25 Nasty

Figure 4. Niagara T2 Breadth-First Search

7.1. Breadth-First Search

We find that our method of avoiding hot spots in BFSenables scaling beyond what would be achievable by a naivealgorithm. At the time of this writing, our implementationruns on the XMT, but does not perform as well as nativeXMT BFS implementations have done in the past. However,our method does leverage the multi-core and Niagara plat-forms effectively. As implied before, MTGL programmerswill run BFS by associating a visitor object with the kernelalgorithm, then running the latter. Underlying differencesin the kernel implementation, such as that likely in theXMT implementation of BFS, will be hidden from theprogrammer.

7.2. Connected Components

Our connected components codes demonstrate strong scal-ing on multi-core and Niagara, as the GCC-SV algorithm isdominated by a single run of BFS on the realistic datasets

0.1

1

10

100

1 10 100

Com

ple

tion tim

e (

Seconds)

Number of Threads



R-MAT 25 Nasty

Figure 5. Opteron Connected Components GCC-SV

1

10

100

1000

1 10 100 1000

Com

ple

tion tim

e (

Seconds)

Number of Threads



R-MAT 25 Nasty

Figure 6. Niagara T2 Connected Components GCC-SV

1

10

100

1 10 100

Com

ple

tion T

ime (

Seconds)

Number of Processors

GCC-SV: R-MAT 25 NastyGCC-SV: R-MAT 25 Nice

SV: R-MAT 25 NastySV: R-MAT 25 Nice

Figure 7. Cray XMT Connected Components - GCC-SVand SV

we address. Furthermore, we are able to demonstrate strongscaling on the XMT as well by replacing the BFS by therecursive psearch. Note the effect of data on algorithm

0.1

1

10

100

1 10 100

Tim

e p

er

itera

tion (

Seconds)

Number of Threads



R-MAT 25 Nasty

Figure 8. Opteron PageRank

0.1

1

10

100

1000

1 10 100 1000

Tim

e p

er

Itera

tion (

Seconds)

Number of Threads



R-MAT 25 Nasty

Figure 9. Niagara T2 PageRank

performance in Figure 7. Ironically, the “nasty” datasets aremost friendly to the algorithm, as the vast majority of allvertices fall into the GCC in this case. As we considerthe “nice” datasets, this GCC membership becomes lesspathological (and less realistic). Therefore, the inherentlyhot spotting SV algorithm has more work to do once theGCC has been processed.

7.3. PageRank

As we saw in Figure 2, PageRank can be written toleverage the auto-parallelizing compiler of the XMT quiteeffectively. We cannot match the XMT’s performance inemulation without work to reconstruct the compiler’s op-timization. However, a straightforward parallelization of theouter loop using qthreads still provides significant benefit,as we see in Figures 8 and 9.

0.1

1

10

1 10 100

Tim

e p

er

itera

tion (

Seconds)

Number of Processors

R-MAT 25 Nasty

Figure 10. Cray XMT PageRank

8. Conclusions and future work

Developing multi-threaded graph algorithms, even whenusing the MTGL infrastructure, provides a number of chal-lenges, including discovering appropriate levels of paral-lelism, preventing memory hot spotting, and eliminatingaccidental synchronization. In this paper, we have demon-strated that using the combination of Qthreads and MTGLwith commodity processors enables the development andtesting of algorithms without the expense and complexityof a Cray XMT. While achievable performance is lower forboth the Opteron and Niagara platform, performance issuesare similar.

While we believe it is possible to port Qthreads to theCray XMT, this work is still on-going. Therefore, portingwork still must be done to move algorithm implementationsbetween commodity processors and the XMT. Although itis likely that the Qthreads-version of an algorithm will notbe as optimized as a natively implemented version of thealgorithm, such a performance impact may be an acceptabletrade-off for ease of implementation.

9. Acknowledgments

The BFS used in this paper leverages a load balancingalgorithm to divide adjacencies into chunks. Cynthia Phillipsand Jon Berry developed an preliminary version of thisalgorithm in 2008.

Sandia is a multiprogram laboratory operated by SandiaCorporation, a Lockheed Martin Company, for the UnitedStates Department of Energy’s National Nuclear SecurityAdministration under contract DE-AC04-94AL85000.

References

[1] BADER, D., CONG, G., AND FEO, J. On the architecturalrequirements of efficient execution of graph algorithms. In

The 33rd International Conference on Parallel Processing(ICPP) (2005), pp. 547–556.

[2] BERRY, J. W., HENDRICKSON, B. A., KAHAN, S., ANDKONECNY, P. Software and algorithms for graph queries onmultithreaded architectures. In Proceedings of the Interna-tional Parallel & Distributed Processing Symposium (2007),IEEE.

[3] BLUMOFE, R. D., JOERG, C. F., KUSZMAUL, B. C., LEIS-ERSON, C. E., RANDALL, K. H., AND ZHOU, Y. Cilk: anefficient multithreaded runtime system. SIGPLAN Not. 30, 8(1995), 207–216.

[4] CHAKRABARTI, D., ZHAN, Y., AND FALOUTSOS, C. R-mat:A recursive model for graph mining. In In SDM (2004).

[5] CORMAN, T., LEISERSON, C., RIVEST, R., AND STEIN, C.Introduction to Algorithms. MIT Press, 2001.

[6] DAGUM, L., AND MENON, R. OpenMP: An industry-standard API for shared-memory programming. IEEE Com-putational Science and Engineering 05, 1 (1998), 46–55.

[7] ERDOS, P., AND RENYI, A. On random graphs I. Publica-tiones Mathematicae, 6 (1959), 290–297.

[8] GREGOR, D., AND LUMSDAINE, A. The Parallel BGL: Ageneric library for distributed graph computations. In ParallelObject-Oriented Scientific Computing (POOSC) (July 2005).

[9] HENDRICKSON, B., AND BERRY, J. Graph analysis withhigh-performance computing. Computers in Science andEngineering 10, 2 (2008), 14–19.

[10] INTEL CORPORATION. Intel R©Thread Building Blocks,1.6 ed., 2007.

[11] LESKOVEC, J., AND FALOUTSOS, C. Scalable modeling ofreal graphs using kronecker multiplication. In In Proceedingsof the 24th International Conference on Machine Learning(2007).

[12] LESKOVEC, J., LANG, K. J., DASGUPTA, A., AND MA-HONEY, M. W. Statistical properties of community structurein large social and information networks. In WWW (2008),pp. 695–704.

[13] MINAR, N., BURKHART, R., LANGTON, C., AND ASKE-NAZI, M. The Swarm simulation system: A toolkit forbuilding multi-agent simulations. Working Paper 96-06-042,Santa Fe Institute, 1996.

[14] PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T.The pagerank citation ranking: Bringing order to the web.Tech. rep., Stanford Digital Library Technologies Project,1998.

[15] SHILOACH, Y., AND VISHKIN, U. An o(n log n) parallelconnectivity algorithm. J. Algorithms 3, 7 (1982), 57–67.

[16] WHEELER, K., MURPHY, R., AND THAIN, D. Qthreads: AnAPI for programming with millions of lightweight threads.In Proceedings of the 22nd IEEE International Parallel &Distributed Processing Symposium (April 2008), MTAAP’08, IEEE Computer Society Press, pp. 1–8.

!"#$%&'()!(#*"#+%,-()*#"+)."#/0$%$1",)$")234(#-"+43$(#5610$#1&3$1,7)6%$%)2$#3-$3#(0)81$9):$9#(%;0

!"#$ %& '($$#$)1,2* +,-.#/0 1(/232* /345627(/)4 8& 9-):("1

1;/342/5</=2,3/#5>/?,)/=,)2$0! 2@32A$)02="5,B5<,=)$5+/C$D#?-E-$)E-$* <$F59$G27,* @;D 8,C:-=$)5;72$37$5/345H3.23$$)23.IJ?F($$#*)7C-):(KL0/342/&.,A <,=)$5+/C$* M342/3/* @;D

IJF($$#$)*4=(/23KL70$&34&$4-

<&0$#%-$

!"#$%&'()$*+)+$'$)%$(",')$-.)#-&'&#$%)'")'/.)0.-1"-2$*#."1)'/-.$+.+)0$-$%%.%)0-"3-$245 6,')4'$*+$-+)'/-.$+&*3)&*7'.-1$#.4)&*#"--.#'%()0-.4,2.)'/$')+$'$)&4).8,$%%()$##.44&6%.1-"2)$%%)'/-.$+49 :/.)8'/-.$+)%&6-$-(;4)%"#$%&'()1-$2.<"-=/$4)6..*).>0$*+.+)'")0-"?&+.)#"*?.*&.*')@AB $1C*&'(9 :/&4%"#$%&'()4,00"-').*$6%.4)+.?.%"02.*')"1)'/-..)%"#$%&'()$<$-.+&4'-&6,'.+)+$'$)4'-,#',-.4D $)2.2"-()0""%5 $*)$--$(5 $*+$)8,.,.9 :/&4)0$0.-)0-.4.*'4)6.*#/2$-=)-.4,%'4)&%%,4'-$'7&*3)'/.)0.-1"-2$*#.)#/$-$#'.-&4'&#4)"1)'/.4.)+$'$)4'-,#',-.4"*)$*)E%'&>)##FBGE HGA 4(4'.25 $)/&3/%()2,%'&'/-.$+.+F&$3$-$76$4.+)4.-?.-5 $*+)$)#"*?.*'&"*$%)HGA <"-=4'$7'&"*9 :/.)#"26&*$'&"*)"1)%"#$%&'()$*+)'/-.$+&*3)&*'.-1$#.<&'/)$+$0'&?.)+&4'-&6,'.+)+$'$)4'-,#',-.4)0-"?&+.4)4#$%$6%.0.-1"-2$*#.)"*)2,%'&0%.)0$-$%%.%)$-#/&'.#',-.49

=> ?,$#";3-$1",

+/=/5:#/7$C$3=520 5,3$5,B5 =($5C,0=57)2=27/# 57(/##$3.$0235C-#=27,)$5/::#27/=2,35:$)B,)C/37$& M37)$/023.#"* C-#N=2:),7$00,) 5C/7(23$0 5 (/A$ 5 3,3N-32B,)C 5C$C,)" 5 /77$00O<@9DP #/=$372$0& @3B,)=-3/=$#"* :/)/##$#5C/7(23$05(/A$/5F24$5A/)2$="5,B50=)-7=-)/#542BB$)$37$05=(/=52C:,0$5/5#/).$:$)B,)C/37$5:$3/#="5,353,3N,:=2C/#54/=/5#/",-= QR* SST* 0,4/=/5#,7/#2="5C-0=5?$5$G:,0$45B,)5:$)B,)C/37$&;=/34/)45=()$/423.523=$)B/7$0U 0-7(5/05:=()$/40 QVVT*

W:$39X QSYT* /34 5 M3=$# 5 1()$/423. 5 %-2#423. 5 %#,7J0O1%%P QVSTU/)$54$02.3$45B,)57,CC,42="5C-#=2:),7$0023./345C-#=27,)$50"0=$C0& 1($"5:),A24$5:,F$)B-#5/3457,3NA$32$3=5=,,#05B,)54$A$#,:23.50,B=F/)$5=/2#,)$45=,50-7(50"0N

!;/342/5205/5C-#=2:),.)/C5#/?,)/=,)"5,:$)/=$45?"5;/342/58,):,)/=2,3*/5>,7J($$459/)=2358,C:/3"* B,)5=($5@32=$45;=/=$05+$:/)=C$3=5,B5H3N$)."Z05</=2,3/#5<-7#$/)5;$7-)2="5D4C2320=)/=2,35-34$)57,3=)/7=5+HND8Y[N\[D>]RYYY&

=$C0& '(2#$5=($0$523=$)B/7$05C/J$52=5$/0"5=,57)$/=$5:/)/##$#=/0J0* =($"5/00-C$5=(/=5/##54/=/5205$E-/##"5/77$002?#$5B),C5/##=()$/40& %$7/-0$5=(205/00-C:=2,35205.$3$)/##"5B/#0$* =($0$523N=$)B/7$054,53,=5C/23=/2354/=/5/345=()$/45#,7/#2="5,)5:),A24$=,,#05=,54207,A$)5/345$G:#,2=50"0=$C5=,:,#,."& ;,C$5,:$)/=N23.50"0=$C05:),A24$5C$7(/320C05=,54207,A$)5C/7(23$5=,:,#N,."5/345=,50:$72B"5=()$/45/345C$C,)"5?#,7J5#,7/#2="& ;-7(C$7(/320C05/)$50"0=$CN0:$72^75/345.$3$)/##"53,3N:,)=/?#$&W-)50,#-=2,35=,5=(205:),?#$C5205=($5E=()$/45#2?)/)" QS_T*

F(27(523=$.)/=$05#,7/#2="5F2=(5=($5=()$/423.523=$)B/7$& 1($E=()$/45#2?)/)"5205/57),00N:#/=B,)C5=()$/423.5#2?)/)"5=(/=5:),NA24$05#2.(=F$2.(=5=()$/40* /3523=$.)/=$45#,7/#2="5B)/C$F,)J*/345#2.(=F$2.(=50"37(),32`/=2,35235/5F/"5=(/=5C/:05F$##5=,4$A$#,:23.5C-#=2=()$/4$45/)7(2=$7=-)$0& M=50-::,)=05C/3"42BB$)$3=5,:$)/=23.50"0=$C05/345(/)4F/)$5/)7(2=$7=-)$05/34:),A24$05/57,3020=$3=523=$)B/7$5=,5=($2)5-32E-$5C$=(,405,B4207,A$)23.5/3450:$72B"23.5#,7/#2="& '$5$G:/34$452=05#,N7/#2="5B)/C$F,)J5/345/44$45420=)2?-=$454/=/50=)-7=-)$05=(/=/4/:=5=,5<@9D $3A2),3C$3=0& 1($53$F5420=)2?-=$454/=/0=)-7=-)$05(24$5=($57,C:#$G2="5,B54/=/5#,7/#2="& 1(205:/:$):)$0$3=05=($54$02.35,B5/5420=)2?-=$45C$C,)"5:,,#* /5420=)2?N-=$45/))/"* /345=F,5B,)C05,B5420=)2?-=$45E-$-$&%$37(C/)J5)$0-#=0* ./=($)$45B),C50$A$)/#5:/)/##$#5/)7(2N

=$7=-)$0* 4$C,30=)/=$5=($507/#/?2#2="5,B5=($0$54/=/50=)-7=-)$0&%"57,C:/)23.5,:$)/=2,30N:$)N0$7,345/345$BB$7=2A$5?/34NF24=(* =($5?$37(C/)J050(,F5=($52C:,)=/37$5,B57(,,023.5/:N:),:)2/=$50"0=$CN0:$72^754$02.35:/)/C$=$)0* 0-7(5/05C$CN,)"5420=)2?-=2,35:/==$)35/3450$.C$3=502`$&1($5)$C/234$)5,B5=(205:/:$)5205,)./32`$45/05B,##,F0& D

0-)A$"5,B5)$#/=$45F,)J5205:)$0$3=$45235;$7=2,3 S& ;$7=2,3 _0-CC/)2`$05=($5E=()$/45#2?)/)"5/345=($5#,7/#2="5B$/=-)$052=:),A24$0& ;$7=2,3 [ 4$07)2?$05=($5=()$$5/)7(2=$7=-)$05-0$45=,?$37(C/)J5=($5420=)2?-=$454/=/50=)-7=-)$0& 1($54$02.35,B5=($4/=/50=)-7=-)$0* /#,3.5F2=(5?$37(C/)J5)$0-#=054$C,30=)/=23.=($2)5$B^72$37"* 205:)$0$3=$45235;$7=2,3 R& ;$7=2,3 a 0-..$0=0B-=-)$5)$0$/)7(&

1()$/45W:$)/=2,30 9$C,)"5X,,#0 D))/"5W:$)/=2,30 b-$-$5W:$)/=2,30!"#$%&'()*)"+!"#$%&'(%,232=2/#2`$5=($5#2?)/)"

!"#$%&'(-*&.)/%+,7#$/35-:* 0(-=54,F35=($5#2?)/)"

!"#$%&'(01$2+)"!*3 +,-3 ,'.,0:/F35/5=()$/4

!"#$%&'(01$2("1+) 3 +3 , 3 %&'(,0:/F35/5=()$/45=,5=($50:$72^$450($:($)4

!"#$%&'(4)5$&"%("1+%&'(,C,A$5=($57/##23.5=()$/45=,5=($50:$72^$40($:($)4

!"#$%&'(')6"&*7%+%,*3 /'%.,)$=-)305=($5420=/37$5?$=F$$35=F,50($:N($)45M+0

!811.(7$%&"%+0.'#$%01',7)$/=$5/5:,,#5,B5,?c$7=05,B5=($50:$7N2^$4502`$

!811.(7$%&"%(&.)5*%'+0$%3 +20-!,02C2#/) 5 =, !"##$%&'()*(+,* ?-=)$=-)305/#2.3$45,?c$7=0

!811.(&..17+(332,.$=5/35,?c$7=5B),C5=($5:,,#

!811.(0$%%+(332 3 +//, ,)$=-)35/35,?c$7=5=,5=($5:,,#

!811.('%6"$19+(332,4$/##,7/=$05/##5:,,#5C$C,)"

!&$$&9(7$%&"%+*3"!. 3 "!0.$%01',/##,7/=$ 5 /3 5 /))/"* 420=)2?-=$ 5 2=0C$C,)"

!&$$&9(7$%&"%(")5#"+*3 "$%,0/C$ 5 /0 !)'')-%&'()*(+,* ?-=.-/)/3=$$0 5 2=$C 5 02`$ 5 F2## 5 3,=7(/3.$

!&$$&9(%.%4++,,+4 3 0!/'5,)$=-)305/5:,23=$)5=,5=($50:$72^$45$#N$C$3=5235=($5/))/"

!&$$&9()"%$(.118++,,+4 3 )"!*3 +,-,2=$)/=$5,A$)5=($5/))/"5$#$C$3=0523:/)/##$#

!&$$&9('%6"$19++,,+4 ,4$/##,7/=$5=($5/))/"

@!'A!.0B!:%:%(7$%&"%+,/##,7/=$5/5E-$-$

@!'A!.0B!:%:%(%*!:%:%+63 '2'#,/::$345/35$#$C$3=5=,5=($5=($5E-$-$

@!'A!.0B!:%:%('%!:%:%+6"'"',.$=5=($5($/45,BB5=($5E-$-$

@!'A!.0B!:%:%(%48"9+6"'"',7($7J52B5=($5E-$-$57,3=/2305/3"5$#N$C$3=0

@!'A!.0B!:%:%('%6"$19+6"'"',4$/##,7/=$5/5E-$-$

D E#BE-$-$5205/5#,7JNB)$$5E-$-$* /345/E4E-$-$5205/5420=)2?-=$45E-$-$&

Figure 1. Qthread API (abridged)

C> D('%$(;)8"#/

1(205)$0$/)7(54)/F05B),C5=()$$57/=$.,)2$05,B5:)2,)5F,)Jd#2.(=F$2.(=5=()$/423.5C,4$#0* 4/=/50=)-7=-)$0* /345=,:,#,."23=$)B/7$0& 82#J Q[T5/345W:$39X QSYT5/)$5$G/C:#$05,B57,3NA$32$3=5/345#2.(=F$2.(=5=()$/423.5C,4$#0& e,F$A$)* =($"2.3,)$5#,7/#2="* B,)723.5=($5:),.)/CC$)5=,5)$#"5,35=($507($4N-#$)5=,5J$$:5$/7(5=()$/453$/)5=($54/=/52=5205C/32:-#/=23.& @3NB,)=-3/=$#"* =($5F,)JN0=$/#23.507($4-#23.5/#.,)2=(C05=($0$=()$/423.5C,4$#05-0$5/00-C$5=(/=5/##5=/0J057/35?$5:$)B,)C$4$E-/##"5F$##5,35/3"5:),7$00,)* F(27(5205/3523A/#245/00-C:N=2,35,35<@9D C/7(23$05F2=(5/5(2.(5#/=$37"5A/)2/?2#2="&8,37-))$3=5A$7=,)0* (/0($0* /345E-$-$0* 0-7(5/05:),A24$4

?"5M3=$#Z051()$/423.5%-2#423.5%#,7J0 QVST5/)$5.,,45$G/CN:#$05,B54/=/50=)-7=-)$054$02.3$45/),-345:/)/##$# 5C/3/.$NC$3=5,:$)/=2,305)/=($)5=(/354/=/5,:$)/=2,30* 2.3,)23.5#,7/#N2="& 8,N/))/"5f,)=)/3 QV]T5:),A24$05/5:/)/##$#57,3=/23$)U=($57,N/))/"U =(/=523=$.)/=$05#,7/#2="5F2=(5=($5$G$7-=2,3C,4$#* ?-=5-0$05#,/4$)N4$^3$450=/=275420=)2?-=2,3& D 420=)2?N-=$45E-$-$5205/50:$72/#57/0$5,B5/57,37-))$3=5:,,# QVRT* F(27(2054$02.3$45B,)5:/)/##$#5$B^72$37"& W-)5420=)2?-=$45E-$-$-0$05#,7/#2="523B,)C/=2,35=,57,C?23$5g,(30,3Z050=,7(/0=27E-$-$ QV_T* =($5:-0(N?/0$45E-$-$5?"5D):/72N+-00$/-5$=&/#& Q_T* /345/5(2.(N0:$$45#,7JNB)$$5E-$-$5?/0$45,35=($5F,)J,B5927(/$#5/345;7,== QVaT&8,-3=$)23=-2=2A$#"* =()$/423.523=$)B/7$05=$3453,=5=,5/4N

4)$005C$C,)"5=,:,#,."& 1(205205:),?/?#"5/5)$0-#=5,B5=($5(20N=,)27/#5#,F5A/)2/37$5235C$C,)"5/77$005#/=$37" QVhT& 9,4$)3/345B-=-)$5#/).$N07/#$50(/)$4NC$C,)"5C/7(23$05(/A$5#/).$)#/=$37"5A/)2/37$0* /345=($52C:/7=5205C/.32^$45?"5237)$/023.8X@ 0:$$40& 1,:,#,."523=$)B/7$05=$345=,5?$5,:$)/=23.N0"0N=$C50:$72^7& >23-G50"0=$C05#/).$#"5)$#"5,35=($5#2?3-C/ QV[T#2?)/)"5=,5$G:#,2=50"0=$C5=,:,#,."& 1(205#2?)/)"5:)$0$3=07,C:-=$)05/05/50$=5,B53-C?$)$453,4$0* 8X@0* /345i:("02N7/#j58X@05=(/=5,A$)#/:& 1($5420=/37$5=,5/53,4$Z05,F35C$CN,)"52053,)C/#2`$45=,5=$3* /345,=($)5420=/37$05/)$5$G:)$00$4,35=(/=507/#$& '2=(,-=5#2?3-C/* >23-G5:),7$00$057/354$N^3$5=($2)58X@ /B^32="5-023.5=($ .&/(0%.(*)1234*- +, 23=$)B/7$&@3B,)=-3/=$#"* =(20523=$)B/7$57(/3.$05/7),005J$)3$#5A$)02,30&

;,B=F/)$5=(/=5C-0=5F,)J5)$#2/?#"5/7),005C-#=2:#$5>23-G50"0N=$C05C-0=5$2=($)54$=$7=5=($5/A/2#/?#$523=$)B/7$5A/)2$="5,)5-0$=($5X,)=/?#$5>23-G5X),7$00,)5DB^32="5OX>XDP #2?)/)" QV\T*F(27(5:),A24$05/50=/?#$523=$)B/7$& ;,#/)2050"0=$C05:),A24$=($5#2?#.):5#2?)/)" QSVT* F(27(5:)$0$3=05/5(2$)/)7(27/#54$N07)2:=2,35,B5=,:,#,."& H/7(5#,7/#2="5.),-:5O#.):P57,3=/230/50$=5,B58X@05/34k,)5/442=2,3/#5#.):05/345205/00,72/=$45F2=(/5?#,7J5,B5C$C,)"& 6$7$3=5A$)02,305,B5=($5#2?)/)"57/35)$N:,)=5=($5#/=$37"5?$=F$$35#.):05235C/7(23$N0:$72^75-32=0& D##,B5=($0$523=$)B/7$05$3/?#$5=()$/458X@ /B^32="5/345#2?3-C//345#2?#.):5$3/?#$5C$C,)"58X@ /B^32="& %$7/-0$5=($0$23=$)B/7$05C/32:-#/=$5=($5,:$)/=23.50"0=$CZ0507($4-#$)5/34C$C,)"50-?0"0=$C* =($"57/35,3#"5/BB$7=5=(23.05=(/=5=($5,:$)N/=23.50"0=$C507($4-#$0* F(27(5/)$523($)$3=#"5($/A"F$2.(=&X),.)/CC23.5#/3.-/.$050-7(5/058(/:$# Q]T* lVY QaT*

f,)=)$00 QST* /345@X8 Q\T* :),A24$5$G:#272= 5 #,7/#2="5/05/B-37=2,35,B5=($5:),.)/CC23.5$3A2),3C$3=& 1($0$5#/3.-/.$0:),A24$5$G:)$002A$50$C/3=2705/3457/35?$5$BB$7=2A$5235$G:#,2=N23.5/A/2#/?#$5(/)4F/)$5)$0,-)7$0& @3B,)=-3/=$#"* =($"5/)$237,C:/=2?#$5F2=(5$G20=23.50,B=F/)$&

E> F9()G$9#(%;)'1&#%#H

1($5E=()$/45#2?)/)" QS_T5205/57),00N:#/=B,)C5#2.(=F$2.(==()$/423.5#2?)/)"5F2=(5/3523=$.)/=$45#,7/#2="5B)/C$F,)J5/34#2.(=F$2.(=50"37(),32`/=2,35=(/=5?)24.$05=($5./:5?$=F$$37,CC,42="50"0=$C05/3454$A$#,:23.5C-#=2=()$/4$45/)7(2N=$7=-)$0& D iE=()$/4j5205/53$/)#"N/3,3"C,-05=()$/45F2=(/50C/##50=/7J5=(/=5#/7J05=($5$G:$302A$5.-/)/3=$$05/345B$/N=-)$05,B5($/A"F$2.(=5=()$/40* 0-7(5/05:$)N=()$/45:),7$0024$3=2^$)05OXM+0P* 02.3/#5A$7=,)0* /345=($5/?2#2="5=,5?$57/3N7$#$4& 1(205#2.(=F$2.(=53/=-)$50-::,)=05#/).$53-C?$)05,B=()$/405F2=(5B/0=57,3=$G=N0F2=7(23.* $3/?#23.5(2.(N:$)B,)NC/37$5^3$N.)/23$45=()$/4N#$A$#5:/)/##$#20C& 1($5DXM ,B=($5E=()$/45#2?)/)"52050-CC/)2`$45235f2.-)$ V& 1($5#,7/#N2="5B)/C$F,)J5(/05?$$35$G:/34$45=,50-::,)=5=()$/45/3454/=//B^32="5235:/)/##$#50"0=$C0&m2)=-/##"5/##57,C:-=$)5/)7(2=$7=-)$05:),A24$5/=,C275,:N

1 Node x 4 Cores

(a) Xeon

Memory

Cache Cache

MemoryMemory

MemoryMemory

MemoryMemory

MemoryMemory

MemoryMemory

MemoryMemory

MemoryMemory

MemoryMemoryMemory Memory

MemoryMemory

Memory

Memory Memory

Memory

A

16 Nodes x 2 CPUs

B

8 Nodes x 2 CPUs

(b) Altix

2 Chips x 8 Cores x 8 Threads

(c) Niagara

Memory

Core C

ore

Core C

ore

Core C

ore

Core C

ore

Cache

Memory

Core C

ore

Core C

oreC

ache

Core C

ore

Core C

ore

CPU

CPUCache

Cache

Cache

Cache

Cache

CPU

Cache

CPU

Cache

CPU

Cache

Cache

Cache

CPU

CPU

CPUCPU

Cache

Cache

Cache

CPU

Cache

Cache

CPUCPU

CPU

CPUCacheCache

CPU

CPU

CPU

CPU

Cache

Cache

Cache

Cache

Cache

CPU

Cache

CPU

CPU

Cache

Cache

CPU

Cache

CPU

CPUCPU

Cache

Cache

Cache

CPU

Cache

Cache

CPU

Cache

CPU

CPU

Cache

CPU

Cache

CPU

CPUCPU

Cache Cache

Cache

Core

Cache

Cache

CPU

Cache

CPU

CPU

CPU

Cache

CPU

Cache

CPU

CPUCPU

Core

Cache

Cache

CPU

Cache

Cache

CPU

Core

CPU

CPU

Cache

CPU

Cache

Core

CPU

CPU

Figure 2. System Topologies

$)/=2,30* "$= 5 =($" 5 /)$ 5 3,= 5 42)$7=#" 5 /77$002?#$ 5 23 5 C,0=:),.)/CC23. 5 23=$)B/7$0n W:$39X 20 5 / 5 3,=/?#$ 5 $G7$:N=2,3& 1($ 5 E=()$/4 5 #2?)/)" 5 :),A24$0 5 /3 5 /=,C27 5 237)$NC$3=5A2/ !*/'()0%43&'+,* /345/=,C2757,C:/)$N/34N0F/:5F2=(!*/'()0%&).+,& +/=/N?/0$4 5?#,7J23. 5 0"37(),32`/=2,3 5 20 5 /:,F$)B-#5F/"5=,5$G:)$005=/0J54$:$34$372$0& 1($5E=()$/4#2?)/)"5:),A24$05?,=(5B-##k$C:="5?2=5OfH%P /345C-=$GN#2J$?#,7J23.5,:$)/=2,30& fH% ,:$)/=2,305/)$5/=,C275)$/4kF)2=$0=(/=54$:$345,35:$)NF,)45B-##k$C:="50=/=-0& ;,C$50"0=$C0*0-7(5/05=($58)/"591Dkl91*5:),A24$5(/)4F/)$50-::,)=5B,)fH% ,:$)/=2,30* ?-=5-0-/##"5=($"5C-0=5?$5$C-#/=$4&1($ 5 E=()$/4 5 #2?)/)" 5 -0$0 5 7,,:$)/=2A$NC-#=2=/0J23.d

?#,7J23.5,:$)/=2,305=)2..$)57,3=$G=50F2=7($0& 1($0$57,3=$G=0F2=7($05,77-)5235-0$)50:/7$5A2/5B-37=2,357/##05/345/)$5=(-07($/:$)5=(/3502.3/#N?/0$457,3=$G=50F2=7($0& 1(205/::),/7(/##,F05=()$/405=,5(24$57,CC-327/=2,35#/=$372$05/345/A,247,3=$G=N0F2=7(5,A$)($/45-3=2#5-3/A/2#/?#$54/=/52053$$4$4&H/7(5E=()$/45(/05/5#,7/=2,3* ,)5i0($:($)4&j ;($:($)40

4$^3$ 5 2CC,A/?#$ 5 )$.2,30 5F2=(23 5 =($ 5 0"0=$C& b=()$/407/3 5 ?$ 5 0:/F3$4 5 42)$7=#" 5 =, 5 / 5 0:$72^7 5 0($:($)4 5 F2=(!*/'()0%1#'5%*#+,* /34 5 C2.)/=$ 5 ?$=F$$3 5 0($:($)40 5 F2=(!*/'()0%647')*(%*#+,& HG:#272= 5 C2.)/=2,3 5 .-/)/3=$$0 5 =(/==()$/4057/33,=5$G$7-=$5235-3$G:$7=$45:#/7$0& ;-7(57,3=),#:)$A2,-0#"5)$E-2)$45=($5-0$5,B5:#/=B,)CN0:$72^75#2?)/)2$05/34($/A"F$2.(=5=()$/40& 1($5420=/37$5?$=F$$350($:($)405C/"?$54$=$)C23$45-023.5=($ !*/'()0%04.*)3&(+, B-37=2,3&

I> !%#%''(')%#-91$(-$3#(0

;$A$)/# 5 42BB$)$3= 5 0"0=$C0 5F2=( 5 4)/C/=27/##" 5 42BB$)$3==,:,#,.2$05/)$5-0$45=,54$C,30=)/=$5-023.5=,:,#,."5=,523B,)C)-3=2C$54$7202,30d /54-/#N:),7$00,)54-/#N7,)$5M3=$#5l$,3RVRY5F,)J0=/=2,3* /5[]N:),7$00,)5;oM D#=2G5_hYY577<@9D;9X*5/345/54-/#N:),7$00,)5VaN7,)$5;-35<2/./)/ S50$)A$)&1($0$5C/7(23$05F$)$50$#$7=$45=,54$C,30=)/=$5=()$$542BB$)N$3=5=":$05,B5:/)/##$#50"0=$C0d /57,CC,354$A$#,:C$3=5F,)JN0=/=2,3* /5#/).$577<@9D 0"0=$C* /345/5C/002A$#"5C-#=2N=()$/4$457(2:5/)7(2=$7=-)$& 1($5=,:,#,."5,B5$/7(5,B5=($0$

C/7(23$052052##-0=)/=$45235f2.-)$ S&1($5l$,350"0=$C* f2.-)$ SO/P* 205/54-/#N7,)$54-/#N:),N

7$00,)& H/7(5:),7$00,)5(/052=05,F357/7($5?-=50(/)$05=($VYaa9(`5B),3=N024$5?-0& 1(2050"0=$C5)-305>23-G* :),A24N23.5=($5#2?3-C/523=$)B/7$& +-$5=,5=($50(/)$45?-0* #2?3-C/:)$0$3=05=(2050"0=$C5/05/5023.#$53,4$5F2=(5B,-)5$E-2420=/3=:),7$00,)0& 1(205#/7J5,B54$=/2#5205-3B,)=-3/=$* 0237$57/7($N234$:$34$37$5205/352C:,)=/3=5/0:$7=5,B5=($5C$C,)"5(2$)/)N7("& '2=(,-=57/7($* =($5C$C,)"5#/=$37"5205-32B,)C&1($5D#=2G5;9X*5f2.-)$ SO?P* -0$05M3=$#5M=/32-C S5:),7$0N

0,)0& 1($53,4$05/)$57,33$7=$45?"54-/#5_&S5o%k05-3242)$7N=2,3/#5#23J0& H/7(53,4$5(/05=F,58X@05/345/5#/).$5?#,7J5,B6D9&51($5C/7(23$520542A24$4523=,5=F,57,C:,3$3=0d ,3$ODP F2=(5VaN3,4$05=($5,=($)5O%P F2=(5]& 8,C:,3$3=5D 205/)N)/3.$45235B,-)57#-0=$)05,B5B,-)53,4$0& 8,C:,3$3=5%Z053,4$0B,)C5/54-/#N:#/3$5B/=N=)$$& 1($5=F,57,C:,3$3=05/)$5/0"CNC$=)27/##"57,33$7=$45?"5/442=2,3/#5-3242)$7=2,3/#5#23J0& 1($0"0=$C5)-305>23-G5F2=(5#2?3-C/&1($5<2/./)/ S50$)A$)* f2.-)$ SO7P* (/05=F,5$2.(=N7,)$

:),7$00,)0& H/7(57,)$50-::,)=05$2.(=57,37-))$3=5(/)4F/)$=()$/40* B,)5/5=,=/#5,B5VS]57,37-))$3=5(/)4F/)$5=()$/40& H/7(7,)$5(/052=05,F35?/3J5235=($5>S57/7($5F(27(5205/77$002?#$B),C5/3"57,)$5235=(/=5:),7$00,)& H/7(57/7($5?/3J5:/2)50(/)$0/54-/#57(/33$#5f%+M99 C$C,)"57,3=),##$)& 1(2050"0=$C)-305;,#/)20 VY5/3450-::,)=05=($5#2?#.):523=$)B/7$&

J> 610$#1&3$(;);%$%)0$#3-$3#(0

>,7/#2=" 5 /F/)$3$00 5 7/3 5 2C:),A$ 5 =($ 5 :$)B,)C/37$ 5 ,B420=)2?-=$454/=/ 50=)-7=-)$0& 1()$$54/=/ 50=)-7=-)$54$02.304$C,30=)/=$5=(205,::,)=-32="d /5:,,#* /35/))/"* /345/5E-$-$&1($0$54/=/50=)-7=-)$5=":$05/)$57,)3$)0=,3$05,B5/::#27/=2,34$02.3* /345/)$5-0$45235/5F24$5)/3.$5,B5(2.(N:$)B,)C/37$/::#27/=2,30& 1($5J$"5B$/=-)$5,B5=($0$53$F54$02.305205=(/==($"5/4/:=5=,5=($5=,:,#,."5,B5=($50"0=$C5235-0$5/=5)-3=2C$&

106

107

108

1 2 4Op

era

tion

s p

er

seco

nd

Number of Threadsqpool_alloc()

malloc()mutex pool

+&, ;%1*<=..17&")1*

106

107

108

1 2 4 8 16 32Op

era

tion

s p

er

seco

nd


malloc()mutex pool

+>, =.")?<=..17&")1*

104

105

106

107

108

109

1 2 4 8 16 32 64 128Op

era

tion

s p

er

seco

nd


malloc()mutex pool

mtmalloc

+7, @)&5&$&<A<=..17&")1*

Figure 3. Multithreaded 44-byte memory allocations/sec, 100-million allocations.

5.1. Distributed memory pool

9$C,)"5/##,7/=2,35205/5B)$E-$3=#"5,A$)#,,J$454$=/2#5=(/=7/3502.32^7/3=#"52C:/7=5:$)B,)C/37$& ;=/34/)45C$C,)"5/#N#,7/=2,35#2?)/)2$05/)$5=":27/##"54$02.3$45B,)5.$3$)/#N:-):,0$/##,7/=2,35235023.#$N=()$/4$45/::#27/=2,30* ?/#/3723.5/##,7/N=2,350:$$45F2=(5#2C2=23.5B)/.C$3=/=2,3* F2=(,-=57,37$)35B,)#,7/#2="& 9$C,)"5#,7/#2="5205=":27/##"50:$72^$45:$)N:/.$5/343$$405/35/?0=)/7=2,35B,)5.$3$)/#N:-):,0$5-0$&D 0$=5,B5C$C,)"5:,,#05F2=(5C-=$G$05=,5:),A24$5=()$/4N

0/B$5/77$005205B-37=2,3/#* ?-=50-BB$)05B),C5(2.(5C-=$G5,A$)N($/4& 1($5E=()$/45C$C,)"5:,,#* E:,,#* 2052C:#$C$3=$45/0/50$=5,B5#,7/=2,3N0:$72^75#,7JNB)$$50=/7J05F(27(5:),A24$5B/0=/77$005=,5#,7/#5C$C,)"& 1,:,#,."523B,)C/=2,35205-0$45=,:-##5C$C,)"5B),C53$/)?"5C$C,)"5:,,#05F($353$7$00/)"&D E:,,#52057)$/=$45A2/ !"##$%&'()*(+,& W37$57)$/=$4* $#N

$C$3=057/35?$5B$=7($45B),C5=($5:,,#5F2=( !"##$%)$$#&+, /34)$=-)3$45=,5=($5:,,#5-023. !"##$%1'((+,& D :,,#52054$0=),"$4?" !"##$%0(.*'#-+,& H/7(5/##,7/=2,35205:-##$45B),C5=($5#,7/#0=/7J& >,7/#5C$C,)"5205/##,7/=$45235#/).$5?#,7J05/345:,)N=2,3$45,-=52357(-3J05,B5=($502`$50:$72^$45/=5:,,#57)$/=2,3&9$C,)"5=(/=5205B)$$45205:-0($45,3=,5/5#,7/#5#,7JNB)$$5)$N-0$0=/7J& @:,35/##,7/=2,3* =(2050=/7J52057($7J$45^)0=& MB5=($#,7/#50=/7J5205$C:="* C$C,)"5205:-##$45B),C5=($57-))$3=5#,N7/#5#/).$5/##,7/=$45?#,7J& '($35=(205?#,7J5205$G(/-0=$4* =($)$N-0$50=/7J05,B53$2.(?,)23.50($:($)45:,,#05/)$57($7J$4* 23)/34,C5,)4$)& MB53,3$5,B5=($C5(/A$5C$C,)"* /53$F5#/).$?#,7J5,B5C$C,)"5205/##,7/=$45B),C5=($5#,7/#53,4$Z05C$CN,)"* /345/44$45=,5/5#20=5,B5/##,7/=$45?#,7J0& MB5/##,7/=2,3B/2#0* =($5/##,7/=$45?#,7J05,B5/##5:,,#05/)$57($7J$4* 235,)4$),B5=($2)5420=/37$5B),C5=($5)$E-$0=23.5=()$/4& MB53,5C$C,)"7/35?$5B,-34* =($5/##,7/=2,35)$E-$0=5B/2#0&1($5.)/:(05235f2.-)$ _ 7,C:/)$5=($5/##,7/=2,305:$)50$7N

,345,B 5/ 5C-#=2=()$/4$45/::#27/=2,35 =(/= 5/##,7/=$0* F)2=$0=,* /3454$/##,7/=$05VYY5C2##2,350$:/)/=$5[[N?"=$5C$C,)"?#,7J0 5 O=($ 5 02`$ 5,B 5 / "*/'()0%68*(9%* ,3 5 0,C$ 5 0"0=$C0P&1()$$5/##,7/=2,35C$=(,405/)$57,C:/)$45,35=($5>23-G50"0N=$C0d =($5E:,,#* /502C2#/)5C-=$GN:),=$7=$457,37-))$3=5C$CN,)"5:,,#* /3450=/34/)45C/##,7& 1($5E:,,#5,-=:/7$05.#2?7C/##,7 QVYT5/=507/#$& 1($5C-=$GN?/0$45C$C,)"5:,,#050-BN

B$)5B),C5>23-GZ050#,F5C-=$G5,:$)/=2,30& 1($5C/##,752CN:#$C$3=/=2,3* F(2#$53,=5)$=-)323.5#,7/=2,3N0:$72^75C$C,)"*-0$05/4/:=2A$5/)$3/5/##,7/=2,35=,507/#$& 1($5;,#/)205C/#N#,75#2?)/)"* 4$02.3$45B,)50$)2/#5/::#27/=2,30* 205-32B,)C#"0#,F& ;,#/)20Z0 5C-#=2=()$/4$45C=C/##,7 5 #2?)/)" 5:),A24$0?$==$)5:$)B,)C/37$5B,)5)$#/=2A$#"5#,F53-C?$)05,B5=()$/40*?-=54,$053,=5/::$/)5=,5?$54$02.3$45B,)5C,)$5=(/3502G=$$3=()$/405/3457/35,3#"5/##,7/=$5?#,7J05235:,F$)N,BN=F,502`$0&

5.2. Distributed array

1($5420=)2?-=$45/))/"* ,)5E/))/"* -0$05/5?/0275i?#,7J$4j4$02.3* F2=(53,4$N0:$72^75/))/"50$.C$3=0& '$5$G/C23$=()$$5/0:$7=05,B5=(2054$02.3d 0$.C$3=5420=)2?-=2,35:/==$)3*0$.C$3=502`$* /345$#$C$3=502`$& 1($5E/))/"5420=)2?-=$052=0C$C,)"5F($357)$/=$4* A2/ !)'')-%&'()*(+,& W37$57)$/=$4* $#$NC$3=05F2=(235=($5/))/"57/35?$5/77$00$45F2=( !)'')-%($(6+,* ,)-023.5:,23=$)5C/=(5F2=(235/50$.C$3=& 8,3A$32$3=5:/)/##$#52=N$)/=2,357/35?$54,3$5F2=( !)'')-%4*('%$##"+, & 1(205C$7(/320C-0$05/5-0$)N0:$72^$45B-37=2,35=,5:),7$005234$G5)/3.$0* $3N0-)23.5=(/=5=($5B-37=2,35$G$7-=$053$/)5=($5)/3.$52=5:),7$00$0&D))/"05/)$54$/##,7/=$45F2=( !)'')-%0(.*'#-+,&1($5420=)2?-=2,35:/==$)35/345C$=(,45,B5#,7/=23.50$.C$3=0

2C:/7=05:$)B,)C/37$& W3$5,:=2,35205=,5:#/7$5$/7(50$.C$3=/77,)423.5=,52=05,)4$)5235=($5/))/"* A2/5/5(/0(& 1(205(/05=($A2)=-$05,B502C:#272="5/345-32B,)C2="* /345/A,2405/77$0023.C$C,)"5=,5#,7/=$50$.C$3=0* ?-=57/33,=5)$#,7/=$50$.C$3=0&D#=$)3/=$#"* =($5#,7/=2,35,B5$/7(50$.C$3=57/35?$50=,)$45F2=(=($50$.C$3=& ;$.C$3=05C-0=5?$5/77$00$45=,54207,A$)5=($2)#,7/=2,3* ?-=5=(205$3/?#$05/5F24$)5)/3.$5,B5420=)2?-=2,35:/=N=$)305/3450$.C$3=5)$#,7/=2,3& ;$A$)/#5,:=2,305/)$57,C:/)$4235f2.-)$ [& 1($5i;=/=275e/0(j5:/==$)35-0$05=($50$.C$3=,)4$)5=,54$=$)C23$50$.C$3=5#,7/=2,30& 1($5i+20=j5:/==$)30/##50=,)$50$.C$3=5#,7/=2,305F2=(235=($50$.C$3=0& i+20=56$.;=)2:$0j5420=)2?-=$0502C2#/)#"5=,5=($5;=/=275e/0(* i+20=56/34j420=)2?-=$05)/34,C#"* /345i+20=56$.5f2$#40j57#-0=$)050$E-$3N=2/#50$.C$3=05$A$3#"& i;$)2/#5M=$)/=2,3j52053,=5/5420=)2?-=2,3:/==$)3* ?-=50$)A$05=,57,C:/)$5=($5420=)2?-=2,35:/==$)305F2=(=":27/#53,3NE/))/"5/))/"52=$)/=2,3&M=5205F,)=(53,=23.5=(/=5:/)/##$#52=$)/=2,3507/#$05F$##5,3

0

1000

2000

3000

4000

5000

6000

7000

1 2 4

Ba

nd

wid

th (

MB

/s)

Number of ThreadsStatic Hash

Dist Reg StripesDist Rand

Dist Reg FieldsSerial Iteration

+&, ;%1*

0

2000

4000

6000

8000

10000

12000

14000

16000

1 2 4 8 16 32

Ba

nd

wid

th (

MB

/s)




+>, =.")?

0

2000

4000

6000

8000

10000

12000

14000

16000

1 2 4 8 16 32 64 128

Ba

nd

wid

th (

MB

/s)




+7, @)&5&$&<A

Figure 4. Impact of distribution pattern on multithreaded memory bandwidth over large arrays.

0

1000

2000

3000

4000

5000

6000

7000

1 2 4 8 16 32 64 128 256

Ba

nd

wid

th (

MB

/s)

Number of PagesStatic Hash



+&, B<"#$%&'63 ;%1*

0

2000

4000

6000

8000

10000

12000

14000

16000

1 2 4 8 16 32 64 128 256

Ba

nd

wid

th (

MB

/s)




+>, CA<"#$%&'63 =.")?

0

2000

4000

6000

8000

10000

12000

14000

16000

1 2 4 8 16 32 64 128 256

Ba

nd

wid

th (

MB

/s)




+7, DAE<"#$%&'63 @)&5&$&<A

Figure 5. Impact of segment size and distribution pattern on multithreaded memory bandwidth over100-million element arrays.

=($0$50"0=$C0& M=$)/=2,35F2=(5VS]5=()$/405F/05]a&SG5B/0=$)=(/350$)2/#5 2=$)/=2,35,35=($5<2/./)/ S& M=$)/=2,35F2=(5_S3,4$05F/05_V&SG5B/0=$)5=(/350$)2/#52=$)/=2,35,35=($5D#=2G& 1($l$,35F,)J0=/=2,35:$/J$45/=5S&VG5B/0=$)5=(/350$)2/#5,:$)/=2,3*#2J$#"5?$7/-0$5=($)$5/)$5,3#"5=F,5C$C,)"57,3=),##$)0* 7,CN:$=23.5B,)5/5)$#/=2A$#"5#,FN?/34F24=(5C$C,)"5?-0&1($50=/=275(/0(5F/05=($5B/0=$0=5,35=($5D#=2G5/345l$,35?$N

7/-0$5/77$0023.5)$C,=$5C$C,)"5=,5E-$)"52=05#,7/=2,352050#,F/3457/-0$05237,))$7=5:)$B$=7(23.& 1(-0* 0=/=275(/0(5205=($4$B/-#=5420=)2?-=2,35C$=(,45B,)5/##5?-=5=($5<2/./)/ S& 1($<2/./)/5S5?$3$^=05B),C5B$=7(23.5i237,))$7=j5?#,7J05?$7/-0$F(/=5205237,))$7=5B,)5,3$5=()$/45205#2J$#"57,))$7=5B,)5/53$/)?"=()$/4& +20=56$.5;=)2:$05#$A$)/.$05=($5<2/./)/ SZ050(/)$47/7($5?"57#-0=$)23.5=($5F,)J23.50$=5,B5C$C,)"&+20=)2?-=$45/))/"5:$)B,)C/37$5/#0,54$:$3405,35=($5420=)2N

?-=2,35.)/3-#/)2="* ,)5i0$.C$3=502`$&j 9,0=5#,7/#2="N/F/)$C$C,)"5/##,7/=2,35C$=(,405,:$)/=$5,35:/.$N02`$5C$C,)"?#,7J0* =($)$?"54$^323.5=($5C232C-C50$.C$3=502`$& 1(-0*,3#"5#/).$5/))/"057/35?$5$B^72$3=#"5420=)2?-=$4* /345=($50$.NC$3=502`$5205/5C-#=2:#$5,B5=($5:/.$502`$& f2.-)$ R 2##-0=)/=$0=($52C:/7=5,B50$.C$3=502`$& >/).$)50$.C$3=502`$057/35$2N=($)5$C:,F$)5:)$B$=7(23.5,)57)$/=$5#,/452C?/#/37$0& +20N

=)2?-=2,35(/05/502.32^7/3=52C:/7=5,35,:=2C/#50$.C$3=502`$&1($50=/=275(/0(5:$)B,)C/37$5A/)2$45#$005=(/35VSp5,35=($<2/./)/ S50$)A$)* F(2#$5=($5?$0=5:$)B,)C23.50$.C$3=502`$B,)5+20=56$.5f2$#405:),A24$05/5S&SG52C:),A$C$3=5,A$)5=($F,)0=5:$)B,)C23.502`$& 1($50=/=275(/0(5420=)2?-=2,35:),A24$4=($5?$0=50C/##50$.C$3=502`$5:$)B,)C/37$5,B5=($5420=)2?-=2,3C$=(,405=$0=$4n ,=($)5C$=(,4053$$45C-#=2:#$5:/.$05=,5C/0J=($57,0=5,B5237,))$7=5:)$B$=7(23.UVa5:/.$05205/5.,,454$B/-#=&D#2.3C$3=57/35?$57)2=27/#5=,5=/J23.5B-##5/4A/3=/.$5,B5=($

7/7($5/05F$##5/05/A,2423.5-33$7$00/)"5?-05=)/B^7& 8/7($0/#C,0=5/#F/"05#,/45/#2.3$454/=/5B),C5C$C,)"& D77$0023.-3/#2.3$454/=/57/35)$0-#=5235C-#=2:#$5#,/45230=)-7=2,30* 7,CN:,02=23.5230=)-7=2,30* /345$A$357)/0($0& f,)5$G/C:#$* f2.N-)$ aO7P50(,F05=(/=5=($5:$3/#="5B,)5-023.5-3/#2.3$454/=/57/3?$5/05(2.(5/05/5RYp5)$4-7=2,35235?/34F24=(& 1($50:2J$05/34A/)2/37$05235?/34F24=(50(,F35/)$5)$:$/=/?#$* 3,=5=($5)$0-#=,B5=$C:,)/)"5200-$0& DA,2423.50-7(5:),?#$C05205/5=/0J5=(/=205,B=$35#$B=5=,5=($57,C:2#$)5=,5(/34#$* :/)=27-#/)#"5B,)5.#,?/#/3450=/7J5A/)2/?#$0& ;237$5/5E/))/"5:$)B,)C05#/",-=5/=5)-3N=2C$* 4/=/5/#2.3C$3=5C-0=5?$5(/34#$45C/3-/##"&f2.-)$ a 0(,F05=($52C:/7=5,B5$#$C$3=502`$5/345/#2.3C$3=

,35:$)B,)C/37$& 1($0$5.)/:(057,C:/)$5=($5C$C,)"5?/34N

0

500

1000

1500

2000

2500

4 8 16 32 64 128 256 512

Band

widt

h (M

B/s)

Element Size (bytes)Aligned R/W Unaligned R/W

+&, B<"#$%&'63 ;%1*

0 2000 4000 6000 8000

10000 12000 14000 16000

4 8 16 32 64 128 256 512

Band

widt

h (M

B/s)


+>, CA<"#$%&'63 =.")?

0

2000

4000

6000

8000

10000

12000

4 8 16 32 64 128 256 512

Band

widt

h (M

B/s)


+7, DAE<"#$%&'63 @)&5&$&<A

Figure 6. Impact of alignment on multithreaded memory bandwidth over million-element arrays.

F24=(5/7(2$A$45F(2#$5/77$0023.5C2##2,3N$#$C$3=5/))/"05F2=(/5A/)2$="5,B5$#$C$3=502`$0& D :/7J$45/))/"52057,C:/)$45=,5/3]N?"=$5/#2.3$45/))/"& 9/3-/##"5/#2.323.54/=/5(/05/57#$/)5?$3N$^=5B,)50C/##5$#$C$3=502`$0* ?-=5(/05#$005?$3$^=5F2=(5#/).$5$#N$C$3=502`$0& 1(205205#2J$#"5?$7/-0$5-3/#2.3$45#/",-=5205C,)$7,34$30$4* /345=(/=5?$3$^=5,-=F$2.(05=($5:$3/#="5B,)5-3N/#2.3$45/77$005?$",345/57$)=/23502`$5$#$C$3=& 1(-0* -3#$00,=($)F20$542)$7=$4* =($5E/))/"5/-=,C/=27/##"5),-3405$#$C$3=02`$05-34$)5a[5?"=$05=,5=($53$G=N#/).$0=5C-#=2:#$5,B5$2.(=&

5.3. Distributed queue

X/)/##$# 5E-$-$ 54$02.30 54$:$345,3 5 =($ 5-0$N7/0$& MB 5/E-$-$5205/5?-BB$)5?$=F$$35=F,5=()$/40* /00-C:=2,3057/352CN:),A$50:$$4& f,)5$G/C:#$* /00-C23.5#,F57,3=$3=2,35B,)=($5E-$-$Z05($/45/345=/2#* =($)$53$$45,3#"5?$5,3$5,B5$/7(&b-$-$05/)$5/#0,5-0$45B,)5420=)2?-=23.5F,)J5/C,3.5C-#=2:#$=()$/40* F2=(5C-#=2:#$5:),4-7$)05/345C-#=2:#$57,30-C$)0& M3=(/=502=-/=2,3* =($5.-/)/3=$$05,B5/502C:#$)5E-$-$* #2J$5.#,?/#,)4$)23.* C/"5?$5)$#/G$45235,)4$)5=,5237)$/0$50:$$4& ;-7(/5E-$-$5C/"5:)$B$)54$E-$-$23.53$/)?"5$#$C$3=05,A$)5=($,#4$0=5$#$C$3=05235=($5,A$)/##5E-$-$&1($ 5 E=()$/4 5 #2?)/)" 5 :),A24$0 5 ?,=( 5 =":$0 5 ,B 5 E-$-$&

1($ 5 E#BE-$-$* ?/0$4 5 ,3 5927(/$# 5 /34 5 ;7,==Z0 5 #,7JNB)$$E-$-$ QVaT* .-/)/3=$$05.#,?/#5,)4$)23.& 1($5$34N=,N$345,)N4$)$45E-$-$5205=($5E4E-$-$& <$F5E#BE-$-$05/)$57)$/=$45F2=(!$1!8(8(%&'()*(+, /3454$0=),"$45F2=( !$1!8(8(%0(.*'#-+,& H#$NC$3=05C/"5?$5E-$-$45F2=( !$1!8(8(%(3!8(8(+, /3454$E-$-$4F2=( !$1!8(8(%0(!8(8(+,& D E-27J5$#$C$3=57($7J5C/"5?$5:$)NB,)C$45F2=( !$1!8(8(%(6"*-+,& HE-2A/#$3=5E4E-$-$5B-37=2,30?$.235F2=(5iE4E-$-$j5230=$/45,B5iE#BE-$-$&jf2.-)$ h 2##-0=)/=$05=($5$BB$7=5,B50=)27=5,)4$)23.5,35=($

07/#/?2#2="5,B5/5E-$-$5?"57,C:/)23.5=($5#,7JNB)$$5E#BE-$-$=,5/5023.#$NC-=$G5E-$-$52C:#$C$3=/=2,35B),C5=($57:),:05#2N?)/)" QVT& H30-)23.5.#,?/#5,)4$)23.5)$E-2)$05/50$)2/#2`$457)2=N27/#50$7=2,3& D #,7JNB)$$5E-$-$5205B/0=$)5=(/35/5C-=$GN?/0$4E-$-$5?$7/-0$52=5-0$05(/)4F/)$5/0020=/37$5=,5C232C2`$5,)N4$)23.5,A$)($/4& <$A$)=($#$00* =(2050$)2/#2`/=2,35/7=05/05/?,==#$3$7J5=(/=5:)$7#-4$0507/#23.&

1($57,)$5,B5/35$34N=,N$345,)4$)$45E-$-$5205C/=7(23.57,3N0-C$)05=,5:),4-7$)0& D 7$3=)/#5C/=7(C/J$)5205/502C:#$5/:N:),/7(* ?-=57)$/=$05/5?,==#$3$7J& e2$)/)7(27/#5C/=7(C/JN23.5)$4-7$057,30-C$)57,3=$3=2,3* ?-=5237)$/0$05=($5F,)J,B5:),4-7$)05F2=(,-=5/44)$0023.5=($2)5?,==#$3$7J5200-$0& Di0=,7(/0=275420=)2?-=$45E-$-$j QV_T* F($)$57,30-C$)05)$N:$/=$4#"5:),?$5)/34,C5:),4-7$)5E-$-$05-3=2#50/=20^$4* 20$B^72$3=5F($35=($5E-$-$05/)$5)/)$#"5$C:="& %-=52B5E-$-$0/)$5B)$E-$3=#"5$C:="* )/34,C5:,##23.57)$/=$05-33$7$00/)"F,)J& 6/34,C5:,##23.57/35?$5/A,24$45F2=(5/4A$)=20$C$3=0&X$) 5 #,7/=2,3* =($ 5 E4E-$-$ 5 -0$0 5 / 5 E#BE-$-$* / 5 #20= 5 ,B

i/40j5)$7$2A$4* /345/5)$7,)45,B5=($5#/0=57,30-C$45$#$C$3=Z00,-)7$& H#$C$3=05/)$5$3E-$-$45#,7/##"& MB5=($5E-$-$5F/03,=5$C:="* /405/)$5:,0=$45=,53$/)?"5E-$-$0& D35/4523B,)C0)$C,=$57,30-C$)05=(/=5=($)$52054/=/5/A/2#/?#$5235=(205E-$-$&1($50$=5,B5i3$2.(?,)j50($:($)405=,5)$7$2A$5/4A$)=20$C$3=0B),C5/3"5.2A$350($:($)452054$=$)C23$45/=50$=-:5=2C$* ?/0$4,35420=/37$& 1(-0* =($5C/G2C-C5F,)J5,B5/5:),4-7$)5205 G$4*234$:$34$3=5,B5=($502`$5,B5=($50"0=$C& D :),4-7$)57/35/A,24)$0$3423.5/405?"5=/..23.5=($C5F2=(5/57,-3=$)5/345=)/7J23.=($5(2.($0=57,30-C$45/457,-3=$)5A/#-$& MB5=($5#/0=N200-$4/405(/A$53,=5?$$357,30-C$4* /4053$$453,=5?$5)$N200-$4&1,54$E-$-$* =($5#,7/#5E-$-$5(/05:)2,)2="& MB5=($5#,7/#

E-$-$ 5F/0 5$C:="* /3"5J3,F35/40 5/)$ 57($7J$45 23 5,)4$),B5420=/37$& 1($50,-)7$5,B5=($5#/0=N4$E-$-$45$#$C$3=520)$7,)4$45#,7/##"& '($35)$0:,3423.5=,5/35/4* -:4/=$5=($5/4NA$)=20$)Z05)$7,)45,B5/4057,30-C$4& MB53,3$5,B5=($5/405)$0-#=235/35$#$C$3=* 2=52053$7$00/)"5=,57($7J $%% )$C,=$5E-$-$0* 23=($5,)4$)5,B5420=/37$5B),C5=($57,30-C$)& MB5/5)$C,=$5E-$-$205$C:="* =(/=5E-$-$Z05#/0=N7,30-C$45)$7,)45205=)$/=$45/05/3/4& MB5/35/45205)$7$2A$45F(2#$57($7J23.5)$C,=$5E-$-$0* 2=5207($7J$452CC$42/=$#"& 1(-0* 7,30-C$)057,,:$)/=$5=,5^345$#N$C$3=0& MB5/##5)$C,=$5E-$-$05/)$5$C:="* =($57,30-C$)5C-0=$2=($)5)$=-)35$C:="N(/34$45,)5F/2=5B,)5/35/45=,5?$5:,0=$4&f2.-)$ ] 7,C:/)$05=($507/#23.5,B5=($5E4E-$-$5F2=(5=($

1%% 7,37-))$3=5E-$-$* F(27(5(/A$502C2#/)5,)4$)23.5.-/)N/3=$$0& 1($5?$37(C/)J* 235?,=(57/0$0* =)/30B$)05F,)4N02`$4/=/5-023.5/5A/)2/?#$53-C?$)5,B5$3E-$-$23.5=()$/405F2=(/3524$3=27/#53-C?$)5,B54$E-$-$23.5=()$/40& 1($5E4E-$-$7/35-0$58X@ :23323.U #/?$#$45iODB^32="P&j M35023.#$N

0

5e+06

1e+07

1.5e+07

2e+07

2.5e+07

1 2 4

Op

era

tion

s p

er

Se

c

Number of ThreadsLock-free Queue

One Mutex Queue

+&, ;%1*

0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

1 2 4 8 16 32

Op

era

tion

s p

er

Se

c


One Mutex Queue

+>, =.")?

0

500000

1e+06

1.5e+06

2e+06

2.5e+06

3e+06

1 2 4 8 16 32 64 128

Op

era

tion

s p

er

Se

c


One Mutex Queue

+7, @)&5&$&<A

Figure 7. The ops/sec of strictly ordered multithreaded queues.

0

5e+06

1e+07

1.5e+07

2e+07

2.5e+07

1 2 4

Op

era

tion

s p

er

Se

c

Number of ThreadsQthread Dist. Queue (Affinity)

Qthread Dist. QueueIntel TBB Queue

+&, ;%1*

01e+062e+063e+064e+065e+066e+067e+068e+069e+061e+07

1 2 4 8 16 32

Op

era

tion

s p

er

Se

c



+>, =.")?

0

5e+06

1e+07

1.5e+07

2e+07

2.5e+07

1 2 4 8 16 32 64 128

Op

era

tion

s p

er

Se

c


Qthread Dist. Queue

+7, @)&5&$&<A

Figure 8. The ops/sec of end-to-end ordered multithreaded queues with small elements.

=()$/4$45C,4$5=($5E4E-$-$5205c-0=5/05B/0=5/05=($5E#BE-$-$* ?-=2C:),A$05,35=($5E#BE-$-$Z05C-#=2N=()$/45:$)B,)C/37$& D=07/#$* =($5E4E-$-$5:$)B,)C$45\&aG5?$==$)5=(/35=($5E#BE-$-$,35=($5D#=2G* hSG5?$==$)5,35=($5<2/./)/ S* /345[G5?$==$)5,3=($5l$,3& 1($5<2/./)/ SZ05:$)B,)C/37$5205/5)$0-#=5,B57($/:7,CC-327/=2,35A2/50(/)$457/7($& 8/7($N7,($)$37"5:$3/#N2`$050(/)$4523B,)C/=2,35,35=($5D#=2G5/345l$,3& f2.-)$ \*2##-0=)/=$05=($52C:/7=5,B5-023.5#/).$)5OVYS[N?"=$P5?#,7J05,B4/=/& >/).$523=$)N3,4$5=)/30B$)052C:,0$5/5(2.($)5:$3/#="=(/350C/##5,3$0* C/.32B"23.5=($5?$3$^=5,B5#,7/=2,35/B^32="&

K> L3$3#()8"#/

1($)$5/)$5C/3"5/442=2,3/#5/))/"5,:$)/=2,3057,CC,3523072$3=2^757,C:-=23.5=(/=5?$3$^=5B),C5#,7/#2="N/F/)$3$00&D))/"50=$372#05/)$57,CC,3523502.3/#5:),7$0023.* 2C/.$5:),N7$0023.* /3450,#A23.5:/)=2/#542BB$)$3=2/#5$E-/=2,30& f,)$NJ3,F#$4.$5,B5=($50=$372#057/35?$5-0$45=,523q-$37$5420=)2?N-=$45/))/"5#/",-=* 0-7(5/05?"5/#2.323.50=$372#5/3450$.C$3=?,-34/)2$05/345/A,2423.5-33$7$00/)"5=()$/450:/F30& D)N)/"57,C?23/=2,3* 0-7(5/052350=)23.57,C:/)20,3* .$3$=275)$N0$/)7(* C/=)2G5C-#=2:#27/=2,3* /345C-#=2N42C$302,3/#5/))/"0*/)$5/#0,57,CC,3& 1($0$5/))/"57,C?23/=2,35,:$)/=2,305(/A$F$##N-34$)0=,,45C$C,)"5/77$005:/==$)305=(/=5:),A24$5,:N:,)=-32=2$05B,)5#,7/#2="N/F/)$5,:=2C2`/=2,3& D J$"5E-$0=2,3205F($)$5=,5?$0=5$G$7-=$5,:$)/=2,305F2=(5420:$)0$523:-=&

9/:6$4-7$ QhT 5 20 5/ 5:,F$)B-# 5/0"37(),3,-0 5/34 5$/0N2#" 5 :2:$#23$4 5 4/=/N:),7$0023. 5 /?0=)/7=2,3 5 :,:-#/)2`$4 5 ?"o,,.#$& M=57,-#45?$52C:#$C$3=$45F2=(5420=)2?-=$45E-$-$0230=$/45,B5=($5-0-/#5iC/0=$)j53,4$5/::),/7(* 3/=-)/##"5:)$N0$)A23.5#,7/#2="5?$=F$$350=/.$0& 1(2054$02.35C/"5?$5-0$NB-#5B,)5$G:#,)23.5=($5?$0=53-C?$)5,B5F,)J$)05/=5$/7(50=/.$*=($5,:=2C/#523B,)C/=2,35=)/30B$)5C$=(,4* /345=($5$BB$7=5,BF,)J$)5#,005/345F,)J$)5C2.)/=2,3&

M> N",-'301",

+$A$#,:23.5:,)=/?#$5#,7/#2="N/F/)$54/=/50=)-7=-)$0* $A$3F2=(5=($5/4A/3=/.$5/5=()$/423.5#2?)/)"5F2=(523=$.)/=$45#,N7/#2="* 20 5 / 5 02.32^7/3= 5 7(/##$3.$& '$5(/A$ 5:)$0$3=$4 5/:,)=/?#$5=()$/423.523=$)B/7$5=(/=523=$.)/=$05#,7/#2="* 0$A$)/#4/=/50=)-7=-)$54$02.305=(/=5-0$5#,7/#2="523B,)C/=2,3* /345$GN/C23$45=($50$#$7=2,35,B5,:=2C/#50"0=$CN0:$72^754$02.35:/N)/C$=$)0& '$5(/A$54$C,30=)/=$45=($5$BB$7=2A$3$005,B5#,N7/#2="N/F/)$54$02.35235420=)2?-=$454/=/50=)-7=-)$0& 9$CN,)"5:,,#0 57/35?$5-:5=,5VRR5=2C$05B/0=$) 5 =(/35 =)/42=2,3/#6)$$#&+, F(2#$5:),A2423.5#,7/=2,3N0:$72^75C$C,)"& 1($5E/)N)/"5420=)2?-=$45/))/"54$02.350-::,)=050=),3.N07/#23.5,35#/).$<@9D 0"0=$C0* $G$7-=23.5_V&S5=2C$05B/0=$)5F2=(5_S53,4$0=(/35F2=(5,3$& 1($5E4E-$-$5420=)2?-=$45E-$-$54$02.35:),NA24$05-:5=,5/5[hG52C:),A$C$3=5,A$)5/5B/0=5#,7JNB)$$5E-$-$?"5:),A2423.5,3#"5$34N=,N$345,)4$)23.& >,7/#2="N/F/)$3$00

02e+054e+056e+058e+051e+06

1.2e+061.4e+061.6e+061.8e+06

1 2 4

Op

era

tion

s p

er

Se

c



+&, ;%1*

0

5e+05

1e+06

1.5e+06

2e+06

2.5e+06

3e+06

3.5e+06

1 2 4 8 16 32

Op

era

tion

s p

er

Se

c



+>, =.")?

01e+062e+063e+064e+065e+066e+067e+068e+069e+061e+07

1 2 4 8 16 32 64 128

Op

era

tion

s p

er

Se

c


Qthread Dist. Queue

+7, @)&5&$&<A

Figure 9. The ops/sec of end-to-end ordered multithreaded queues with large (1024 byte) elements.

:),A24$05-:5=,5/35]&_G52C:),A$C$3=5235:$)B,)C/37$5,A$)5=($0=/=$N,BN=($N/)=57,37-))$3=5E-$-$05,35/5#/).$5<@9D 0"0=$C&MC:,)=/3=#"* =($5-0$5,B5#,7/#2="523B,)C/=2,354,$053,=502.32BN27/3=#"52C:/7=50$)2/#5:$)B,)C/37$5,)5:$)B,)C/37$5,350C/##0"0=$C05F2=(5-32B,)C5C$C,)"5/77$005#/=$372$0& HG:,023.(/)4F/)$53,3N-32B,)C2="5205?$7,C23.57,CC,3* /345=($5-32N^7/=2,35,B5=()$/45(/34#23.5/345=,:,#,."5205/352C:,)=/3=542N)$7=2,35B,)5B-=-)$50(/)$4NC$C,)"54/=/50=)-7=-)$5)$0$/)7(&

D(*(#(,-(0

QVT M& D$#2,3& 7:),:0 5 N 5 8 :),=,=":23. 5 =,,#0&(==:dkk7:),:0&0,-)7$B,).$&3$=* 9/)7(5SYY\&

QST H& D##$3* +& 8(/0$* g& e/##$==* m& >-7(/3.7,* g&N'&59/$00$3*;& 6"-* o& >&5;=$$#$5g)&* /345;& 1,?23Ne,7(0=/4=& :/.)I"-'-.44!$*3,$3.)H0.#&C#$'&"*)?9)J9K!& ;-35927),0"0=$C0* SYYh&

Q_T 6& e& 5D):/72N+-00$/-* H& D34$)0,3* <& 1)$-(/B=* +& H&8-##$)* g& 9&5e$##$)0=$23* +& X/==$)0,3* /345!& r$#27J& 8#-0N=$)5MkW F2=(562A$)d 9/J23.5=($5B/0=57/0$57,CC,3& M3A-"#9)"1'/.)L'/)M"-=4/"0)"*)NOP &*)A$-$%%.%)$*+)Q&4'-&6,'.+)H(4'.24*:/.$05VYsSS* <$F5r,)J* <r*5V\\\&5D89&

Q[T 6& +&5%#-C,B$* 8& f&5g,$).* %& 8&5!-0`C/-#* 8& H&5>$20$)0,3*!& e&56/34/##* /345r& t(,-& 82#Jd /35$B^72$3=5C-#=2=()$/4$4)-3=2C$50"0=$C& M3 A-"#9)"1)'/.)R'/)E@G HNSA!EF H(209"*)A-&*#&0%.4)$*+)A-$#'&#.)"1)A$-$%%.%)A-"3-$22&*3* :/.$0SYhsSVa* <$F5r,)J* <r*5V\\R&5D89&

QRT g& 8(/:23* D& e$)),4* 9& 6,0$3?#-C* /345D& o-:=/& 9$CN,)"50"0=$C5:$)B,)C/37$5,B5@<Ml ,3588N<@9D C-#=2:),N7$00,)0& M3 A-"#9)"1)'/.)JTTR)E@G HNSGU:VN@H W"&*')N*'.-7*$'&"*$%)@"*19)"*)G.$4,-.2.*')$*+)G"+.%&*3)"1)@"20,'.-H(4'.24* :/.$05VsV_* <$F5r,)J* <r*5V\\R&5D89&

QaT X& 8(/)#$0* 8& o),=(,BB* m& ;/)/0F/=* 8& +,3/F/* D& !2$#N0=)/* !& H?72,u#-* 8& A,35X)/-3* /345m& ;/)J/)& lVYd /3,?c$7=N,)2$3=$45/::),/7(5=,53,3N-32B,)C57#-0=$)57,C:-=23.&M3 A-"#9)"1)'/.)XK'/)E**9)E@G HNSA!EF @"*19)"*)P6Y.#'7P-&.*'.+)A-"3-$22&*35 H(4'.245 !$*3,$3.45 $*+)E00%&#$7'&"*4* :/.$05RV\sR_]* <$F5r,)J* <r*5SYYR&5D89&

QhT g& +$/35/345;& o($C/F/=& 9/:)$4-7$d ;2C:#2^$454/=/5:),N7$0023.5,35#/).$57#-0=$)0& M3 A-"#9)"1)'/.)L'/)H(209)"*)P0.-7$'&*3)H(4'.24)Q.4&3*)Z)N20%.2.*'$'&"** :/.$ VY* %$)J$#$"*8D*5SYY[&5@;H<Ml D00,7&

Q]T 6& +2/7,3$07-5/345e& t2C/& D35/::),/7(5=,54/=/5420=)2?-N=2,3052357(/:$#& N*'.-*$'&"*$%)W",-*$%)"1)[&3/)A.-1"-2$*#.@"20,'&*3)E00%&#$'&"*4* SVO_Pd_V_s__R* SYYh&

Q\T 1& H#No(/`/F25/345>& ;C2=(& @:7d -32^$45:/)/##$#57& M3A-"#9"1)'/.)XKKL)E@GONUUU @"*19)"*)H,0.-#"20,'&*3* :/.$ Sh*<$F5r,)J* <r*5SYYa&5D89&

QVYT '& o#,.$)& +"3/C27 5 C$C,)" 5 /##,7/=,) 5 2C:#$C$3=/N=2,305235>23-G50"0=$C5#2?)/)2$0& (==:dkkFFF&4$3=&C$4&-32NC-$37($3&4$k FC.#,k* 9/"5V\\h&

QVVT M30=2=-=$ 5,B 5H#$7=)27/# 5 /34 5H#$7=),3270 5H3.23$$)0& NUUUH'+ ) JKK\9J7JTTKD A"-'$6%. )P0.-$'&*3 ) H(4'.24 ) N*'.-1$#.]APHN^9J_* V\\Y&

QVST M3=$#58,):& N*'.%`):/-.$+&*3)a,&%+&*3)a%"#=4)?9)J9L* SYYh&QV_T 1& g,(30,3& +$02.323.5/5420=)2?-=$45E-$-$& M3A-"#9)"1)'/.)b'/

NUUU H(209)"*)A$-$%%.%)$*+)Q&4'-&6,'.+)A-"#.44&*3* :/.$0_Y[s_VV* '/0(23.=,3* +8*5V\\R&5MHHH 8,C:&5;,7&

QV[T D& !#$$3& D3 5 <@9D DXM B,) 5 >23-G&(==:dkk(/#,?/=$0&4$k3-C//:2_&:4B* D-.-0=5SYY[&

QVRT @& 9/3?$)& W35C/23=/2323.54"3/C27523B,)C/=2,35235/57,3N7-))$3=5$3A2),3C$3=& M3 A-"#9)"1)'/.)JL'/)E**9)E@G H(209"*):/."-()"1)@"20,'&*3* :/.$05Sh_sSh]* <$F5r,)J* <r*V\][&5D89&

QVaT 9& 9&5927(/$#5/3459& >&5;7,==& ;2C:#$* B/0=* /345:)/7=27/#3,3N?#,7J23.5/345?#,7J23.57,37-))$3=5E-$-$5/#.,)2=(C0& M3A-"#9)"1)'/.)JR'/)E**9)E@G H(209)"*)A-&*#9)"1)Q&4'9)@"270,'9* :/.$05SahsShR* <$F5r,)J* <r*5V\\a&5D89&

QVhT +& ;& 5<2J,#,:,-#,0* 1& ;& 5 X/:/=($,4,),-* 8& +& 5 X,#"N7(),3,:,-#,0* g& >/?/)=/* /345+& D".-/4v& M054/=/5420=)2?-N=2,353$7$00/)"5235W:$39Xw M3 A-"#9)"1)'/.)XKKK)E@GONUUU@"*19)"*)H,0.-#"20,'&*3* :/.$ [h* '/0(23.=,3* +8*5SYYY&MHHH 8,C:&5;,7&

QV]T 6& '&5<-C)27(5/345g& 6$24& 8,N/))/"5B,)=)/35B,)5:/)/##$#:),.)/CC23.& HNSA!EF I"-'-$*)I"-,2* VhOSPdVs_V* V\\]&

QV\T W:$3 5 9XM 1$/C& X,)=/?#$ 5 >23-G 5 :),7$00,) 5 /B^32="&(==:dkkFFF&,:$3NC:2&,).k:),c$7=0k:#:/k* 9/)7(5SYY\&

QSYT W:$39X D)7(2=$7=-)$56$A2$F5%,/)4& P0.*GA E00%&#$'&"*A-"3-$2)N*'.-1$#.* S&R5$42=2,3* 9/"5SYYR&

QSVT ;-35927),0"0=$C0* M37&* ;/3=/58#/)/* 8D& G.2"-()$*+:/-.$+)A%$#.2.*')P0'&2&c$'&"*)Q.?.%"0.-;4)S,&+.* SYYh&

QSST g& 1/,* '& !/)#* /3459& ;7(-#`& 9$C,)"5/77$005?$(/A2,)/3/#"0205,B5<@9DN?/0$450(/)$45C$C,)"5:),.)/C0* SYYV&

QS_T !& '($$#$)* 6& 9-):("* /345+& 1(/23& b=()$/40d D35DXMB,)5:),.)/CC23.5F2=(5C2##2,305,B5#2.(=F$2.(=5=()$/40& M3A-"#9)"1)'/.)XX*+)NUUU N*'.-*$'&"*$%)A$-$%%.%)Z)Q&4'-&6,'.+A-"#.44&*3)H(209 MHHH 8,C:&5;,7&* D:)2#5SYY]&

DISTRIBUTION:

1 MS 1322 John Aidun, 1435

1 MS 1319 Jim Ang, 1422

1 MS 1319 Brian Barrett, 1423

1 MS 1319 Ron Brightwell, 1423

1 MS 1320 Scott Collis, 1414

1 MS 1319 Doug Doerfler, 1422

1 MS 1322 Sudip Dosanjh, 1420

1 MS 1319 K. Scott Hemmert, 1422

1 MS 1318 Bruce Hendrickson, 1410

1 MS 9151 Howard Hirano, 8960

1 MS 9158 Curtis Janssen, 8961

1 MS 1319 Sue Kelly, 1423

1 MS 0801 Rob Leland, 9300

1 MS 0321 John Mitchner, 1430

1 MS 0321 James Peery, 1400

1 MS 1316 Danny Rintoul, 1409/1412

1 MS 1319 Arun Rodrigues, 1423

1 MS 0822 David Rogers, 1424

1 MS 1318 Suzanne Rountree, 1415

1 MS 1318 David Womble, 1540

2 MS 9018 Central Technical Files, 8944

2 MS 0899 Technical Library, 4536

65

66

v1.27

Documents

Building More Powerful Less Expensive Supercomputers …Final Report Richard C. Murphy Prepared by Sandia National Laboratories Albuquerque, New Mexico 87185 and Livermore, California