POEMS: End-to-End Performance Design of Large …pages.cs.wisc.edu/~vernon/papers/poems.00tse.pdfTo appear in IEEE Trans. on Software Engr., Nov. 2000. 1 POEMS: End-to-End Performance

To appear in IEEE Trans. on Software Engr., Nov. 2000. 1

POEMS: End-to-End Performance Design ofLarge Parallel Adaptive Computational Systems

Vikram S. Adve, Rajive Bagrodia, James C. Browne, Ewa Deelman, Aditya Dube,Elias Houstis, John Rice, Rizos Sakellariou, David Sundaram-Stukel,

Patricia J. Teller and Mary K. Vernon

AbstractThe POEMS project is creating an environment for end-to-end performance modeling of complexparallel and distributed systems, spanning the domains of application software, runtime and operating systemsoftware, and hardware architecture. Towards this end, the POEMS framework supports composition of componentmodels from these different domains into an end-to-end system model. This composition can be specified using ageneralized graph model of a parallel system, together with interface specifications that carry information aboutcomponent behaviors and evaluation methods. The POEMS Specification Language compiler, under development,will generate an end-to-end system model automatically from such a specification. The components of the targetsystem may be modeled using different modeling paradigms (analysis, simulation, or direct measurement) and maybe modeled at various levels of detail. As a result, evaluation of a POEMS end-to-end system model may require avariety of evaluation tools including specialized equation solvers, queuing network solvers, and discrete-eventsimulators. A single application representation based on static and dynamic task graphs serves as a commonworkload representation for all these modeling approaches. Sophisticated parallelizing compiler techniques allowthis representation to be generated automatically for a given parallel program. POEMS includes a library ofpredefined analytical and simulation component models of the different domains, and a knowledge base thatdescribes performance properties of widely-used algorithms. This paper provides an overview of the POEMSmethodology and illustrates several of its key components. The methodology and modeling capabilities aredemonstrated by predicting the performance of alternative configurations of Sweep3D, a complex benchmark forevaluating wavefront application technologies and high-performance, parallel architectures.

Index Terms Performance modeling, parallel system, message passing, analytical modeling, parallel simulation,processor simulation, task graph, parallelizing compiler, compositional modeling, recommender system.

1 INTRODUCTION

Determining the performance of large-scale computationalsystems across all stages of design enables more effectivedesign and development of these complexsoftware/hardware systems. Towards this end, thePerformance Oriented End-to-end Modeling System(POEMS) project is creating and experimentally evaluatinga problem-solving environment for end-to-endperformance analysis of complex parallel/distributedsystems, spanning application software, operating system(OS) and runtime system software, and hardwarearchitecture. The POEMS project leverages innovations incommunication models, data mediation, parallelprogramming, performance modeling, softwareengineering, and CAD/CAE into a comprehensivemethodology to realize this goal. This paper presents anoverview of the POEMS methodology, illustrates thecomponent models being developed in the POEMS projectby applying them to analyze the performance of a highly-scalable application, and finally describes preliminarywork being performed in integrating multiple differentmodels for a single application using the POEMSmethodology.

The key innovation in POEMS is a methodology that

• Vikram Adve is with the Computer Science Department,University of Illinois at Urbana-Champaign, Urbana, IL61801. E-mail: [email protected].

• Rajive Bagrodia and Ewa Deelman are with theComputer Science Department, University of Californiaat Los Angeles, Los Angeles, CA 90095-1596. E-mail:{rajive,deelman}@cs.ucla.edu.

• James C. Browne and Aditya Dube are with theDepartment of Computer Science, University of Texas atAustin, Austin, TX 78712-1188. E-mail:{browne,dube}@cs.utexas.edu

• Elias Houstis and John Rice are with the ComputerScience Department, PurdueUniversity, West Lafayette,IN 47907-1398. E-mail: {jrr,enh}@cs.purdue.edu.

• Rizos Sakellariou is with the Computer ScienceDepartment,. University of Manchester, Manchester M139PL, U.K. E-mail: [email protected].

• David Sunderam-Stukel and Mary Vernon are with theComputer Science Department, University of Wisconsin-Madison, Madison, WI 53706. E-mail:{sunderam,vernon}@cs.wisc.edu

• Patricia Teller is with the Department of ComputerScience, The University of Texas at El Paso, El Paso, TX79968. E-mail: [email protected].

A preliminary version of this paper appeared in theProceedings of the First International Workshop on Softwareand Performance (WOSP) ’98, October 1998.

makes it possible to compose multi-domain, multi-paradigm, and multi-resolution component models into acoherent system model. Multi-domain models integratecomponent models from multiple semantic domains; in thecase of POEMS, these domains are application software,OS/runtime software, and hardware. Multi-paradigmmodels allow the analyst to use multiple evaluationparadigms—analysis, simulation, or direct measurement ofthe software or hardware system itself—in a single systemmodel. To facilitate the integration of models fromdifferent paradigms, the specification and implementationof POEMS models is accomplished through formulatingcomponent models as compositional objects withassociative interfaces [11,12]. Associative moduleinterfaces extend conventional interfaces to includecomplete specification of component interactions. Aspecification language (the POEMS SpecificationLanguage or PSL) for specification of multi-domain, multi-paradigm, multi-resolution performance models has beendesigned and a compiler for this language is underdevelopment.

The POEMS project is building an initial library ofcomponent models, at multiple levels of granularity, foranalyzing both non-adaptive and adaptive applications onhighly-scalable architectures. POEMS supports analyticalmodels, discrete-event simulation models at multiple levelsof detail, and direct execution. The analytical modelsinclude deterministic task graph analysis [1], the LogP[15]family of models including LogGP [5] and LoPC [18], andcustomized Approximate Mean Value Analysis (AMVA)[37]. Simulation models include validated state-of-the-artprocessor and memory hierarchy models based onSimpleScalar [14], interconnection network models usingthe PARSEC parallel simulation language [9], large-scaleparallel program simulations using the MPI-Sim simulator[28,10], and parallel I/O system simulators [8]. A unifiedapplication representation based on a combination of staticand dynamic task graphs has been developed that can serveas a common workload representation for this wide rangeof performance models.

Ongoing research within POEMS, is developingtechniques to integrate subsets of two or more of thesecomponent models, as a first step towards automaticintegration of models within the POEMS framework. Onesuch effort is integrating a compiler-generated analyticalmodel based on the static task graph with the MPI-Simsimulator. This integrated model has the potential toincrease greatly the size of problems and systems that canbe simulated in practice. In addition, it can be expanded toinclude detailed processor simulation (e.g., with theSimpleScalar simulator), ideally for a small subset of thecomputational tasks. Another effort is examining theintegration of MPI-Sim, LogGP, and SimpleScalar models.These ongoing efforts are described briefly in Section 6.

The project also is building a knowledge base ofperformance data, gathered during the modeling andevaluation process, which can be used to estimate, as afunction of architectural characteristics, the performanceproperties of widely-used algorithms. The knowledge base,along with the component models, application task graphs,and formal task executions descriptions (called TEDs), isstored in the POEMS database, an integral part of thePOEMS system.

POEMS development is being driven by the designevaluations of highly-scalable applications executed onparallel architectures. The first driver application is theSweep3D [22] program that is being used to evaluateadvanced and future parallel architectures at Los AlamosNational Laboratory.

Section 2 presents an overview of the Sweep3Dapplication, which is used to illustrate the POEMSmethodology and performance prediction capabilities.Section 3 describes the conceptual elements of the POEMSmethodology, and illustrates it with an example. Section 4presents the suite of initial POEMS performance tools andthe initial library of component models that are underdevelopment for Sweep3D. Section 5 presents results ofapplying these models to provide performance projectionsand design results for Sweep3D. Section 6 describesongoing research on the development and evaluation ofintegrated multi-paradigm models in POEMS. Section 7discusses related work. Conclusions are presented inSection 8.

2 POEMS DRIVER APPLICATION: SWEEP3DThe initial application driving the development of POEMSis the analysis of an ASCI kernel application calledSweep3D executed on high-performance, parallelarchitectures such as the IBM SP/2, the SGI Origin 2000,

j t

mk

it

K

I

J

Figure 2.1. One Wavefront of Sweep3D on a 2×4Processor Grid.

and future architectures. The Sweep3D application is animportant benchmark because it is representative of thecomputation that occupies 50-80% of the execution time ofmany simulations on the leading edge DOE productionsystems [19]. Our analysis of this application has threeprincipal goals. One goal is to determine which of thealternative configurations of the Sweep3D application hasthe lowest total execution time on a given architecture. Arelated, second goal is to provide quantitative estimates ofexecution time for larger systems that are expected to beavailable in the near future. The third goal is to predict thequantitative impact of various possible architectural and/orOS/runtime system improvements in reducing the requiredexecution time. This section contains a brief description ofthis application; Sections 4 and 5 discuss how the POEMSmethodology is being applied to obtain the desiredperformance projections for Sweep3D.

The Sweep3D kernel is a solver for the three-dimensional,time-independent, neutron particle transport equation on anorthogonal mesh [22]. The main part of the computationconsists of a balance loop in which particle flux out of acell in three Cartesian directions is updated based on thefluxes into that cell and other quantities such as localsources, cross section data, and geometric factors.

Figure 2.1 illustrates how the three-dimensionalcomputation is partitioned on a 2×4 two-dimensionalprocessor grid. That is, each processor performs thecalculations for a column of cells containing J/2 × I/4 × Kcells. The output fluxes are computed along a number of

directions (called angles) through the cube. The angles aregrouped into eight octants, corresponding to the eightdiagonals of the cube (i.e., there is an outer loop overoctants and an inner loop over angles within each octant).Along any angle, the flux out of a given cell cannot becomputed until each of its upstream neighbors has beencomputed, implying a wavefront structure for each octant.The dark points in the figure illustrate one wavefront thatoriginates from the corner marked by the circle. Theprocessor containing the marked corner cell is the first tocompute output fluxes, which are forwarded to theneighboring processors in the i and j dimensions, and soon. In order to increase the available parallelism, thewavefront is pipelined in the k dimension. That is, aprocessor computes the output fluxes for a partial columnof cells of height mk, as shown in the figure, and thenforwards the output fluxes for the partial column to itsneighboring processors before computing the next partialblock of results. (Since the octant starts in the markedupper corner of the cube, the next partial block of cells tobe computed is below the dark shaded partial block in eachcolumn.) The computation is pipelined by groups of mmiangles, not shown in the figure. The amount ofcomputation for each “pipelined block” is thereforeproportional to it×jt×mk×mmi, where it and jt are thenumber of points mapped to a processor in the i and jdimensions, mk is the number of points in the k dimensionper pipeline stage, and mmi is the number of angles perpipeline stage.

Figure 3.1. Overview of the POEMS Environment

The inputs to Sweep3D include the total problem size (I, J,K), the size of the processor grid, the k-blocking factor(mk), and the angle-blocking factor (mmi). Two aggregateproblem sizes of particular interest for the DOE ASCIprogram are one billion cells (i.e., I=1000, J=1000,K=1000) and 20 million cells (i.e., I=255, J=255, J=255).Key configuration questions include how many processorsshould be used for these problem sizes, and what are theoptimal values of mk and mmi.

3 POEMS METHODOLOGY

Using POEMS, a performance analyst specifies theworkload, operating system, and hardware architecture fora system under study, henceforth referred to as the targetsystem. In response, as depicted in Figure 3.1, thecompleted POEMS system will generate and run an end-to-end model of the specified software/hardware system. Thetarget system is defined via an integrated graphical/textualspecification language (the POEMS SpecificationLanguage or PSL) for a generalized dependence graphmodel of parallel computation. Section 3.1 describes theprocess of component model composition and evaluationtool interfacing. The generalized dependence graph modelof parallel computation is described in Section 3.2. ThePSL is briefly introduced and illustrated in Section 3.3.

The nodes of the dependence graph are models of systemcomponents, i.e., instances of component models in theapplication, OS/runtime, and hardware domains. Inspecifying the target system, the analyst defines theproperties of each system component in the context of andin terms of the attribute set of its semantic domain. Forexample, for a hardware component, the analyst defines thedesign parameters, the modeling paradigm, and the level ofdetail of the component model. As discussed in Section3.3, the component models are implemented ascompositional objects, i.e., “standard objects” encapsulatedwith associative interfaces [11, 12, 16] specified in thePSL. These component models, together with applicationtask graphs, and performance results, are stored in thePOEMS database, described in Section 3.6. POEMS alsoincludes a knowledge-based system called PerformanceRecommender to assist analysts in choosing componentmodels. The Performance Recommender is sketched inSection 3.5. As described in Section 3.2, the applicationdomain represents a parallel computation by a combinationof static and dynamic task graphs, which are specializedforms of generalized dependence graphs. Generalizeddependence graphs can be used in the operating systemdomain to model process and memory management,interprocess communication, and parallel file systems. Inthe hardware domain, the nodes of a graph are associatedwith models of hardware components. Figure 3.2 is aschematic of a generalized dependence graph modelspanning multiple domains.

When specifying the target system, the analyst selects thecomponents that are most appropriate for the goals ofher/his performance study. As discussed in Section 3.4,model composition is facilitated by task executiondescriptions (TEDs), which characterize the execution ofcomponents of the task graph on particular hardwaredomain components, e.g., processors, memory hierarchies,and interconnection networks. It also relies on the modelsbeing implemented as compositional objects and on datamediation methods. The performance knowledge base ofthe POEMS database may provide guidance to the analystin the selection of components. The compiler for thePOEMS specification language will access the specifiedcomponent models from the POEMS database, whenappropriate component models are present, and willincorporate them into the system model.

3.1 Component Model Composition andEvaluation Tool Integration

As noted earlier, POEMS composes its system modelsfrom component models of different types ranging fromsimple analytical models such as LogGP to detailedsimulation models of instructions executing on modernprocessor/cache architectures. Each component model canbe characterized as a "document" that carries an externalinterface defining its properties and behavior. Successfulcomposition requires that this interface specifies theproperties and behavior of the component models withwhich the given component model can be composedsuccessfully.

Evaluation of the multi-paradigm system models requiresinterfacing and integration of evaluation tools for eachcomponent model type. Each evaluation tool also can becharacterized as a processor that "understands" and actsupon information supplied for the component model it isevaluating. Each tool evaluates the behavior of thecomponent model types it “understands”. The input to eachtool must be in the types set it "understands." Interfacing

Figure 3.2. Multi-Domain Dependence Graph

and integration of tools requires that the output of one toolbe the input of another tool. Successful interfacing andintegration of tools, each of which "understands" differenttype sets, requires that the output of a source tool (whichwill be in the type set it “understands”) must be mapped tothe type set "understood" by its target tool. Theinformation necessary to support these mappings must bedefined in the external interfaces from which the tools areinvoked.

Manual composition of component models to createsystem models is now accomplished on an ad hoc basis byhumans reading documentation and applying theirexperience and knowledge to select component modelswith properties and behaviors such that the componentscan be integrated. Manual interfacing of evaluation toolsthat utilize different representations of components is nowaccomplished by manual specification of mappings fromoutputs to inputs and hand coding of these mappings.Design and implementation of these mappings is often avery laborious and error-prone process, sufficiently so thatthese tasks are seldom attempted. A major reason for thedifficulty of these tasks is that there is no consistentspecification language in which to describe the propertiesand behaviors of either components or tools. The humansystem model composer and evaluation tool integratormust resolve the ambiguities and inconsistencies ofmappings among languages with different semantics on anad hoc, case-by-case basis. Automation of componentmodel composition and tool interfacing and integrationwill require that the interfaces of component models andevaluation tools be specified in a common representationwith semantics sufficiently precise to support automatedtranslations.

The POEMS Specification Language (PSL) [16] is therepresentation used in POEMS to specify the informationnecessary to support composition of component modelsand interfacing and integration of evaluation tools. PSL isan interface specification language. Components and toolsare treated as objects and each is encapsulated with a PSL-specified interface. PSL-specified interfaces, calledassociative interfaces, are a common language forexpressing both component model composition andmappings across evaluation tool interfaces. An objectencapsulated with an associative interface is called acompositional object. Associative interfaces andcompositional objects are defined and described in Section3.3.2. An example of PSL specifications is given in Section3.3.4.

A compiler for PSL-specified interfaces that has access toa library of component models and evaluation tools canautomatically compose a set of component models selectedby a performance engineer into a system model andgenerate the mappings among the outputs and inputs of the

evaluation tools. The compiler for PSL programs willgenerate an instance of the generalized hierarchicaldependence graph defined in Section 3.2. A feasibilitydemonstration prototype of a PSL compiler has beendeveloped [16] and a more capable version of the PSLcompiler is under development.

PSL-specified associative interfaces also facilitate manualcomposition and integration processes since they provide asystematic framework for formulation of component modeland evaluation tool interfaces. The applications of thePOEMS methodology to performance studies of Sweep3Dgiven in Sections 5 and 6 are manually executed examplesof the model composition and evaluation tool interfacingmethods and concepts defined and described in this sectionand in Sections 4 and 5.

3.2 General Model of Parallel Computation

Given a PSL program, the PSL compiler generates aninstance of a generalized hierarchical dependence graph.This model of parallel computation and its use in POEMSsystem modeling are described in this section afterintroducing some basic definitions and terminology.

3.2.1 Definitions and Terminology

Generalized Hierarchical Dependence Graph: A generalgraphical model of computation in which each noderepresents a component of the computation and eachedge represents a flow of information from node tonode. The graph is hierarchical in that each node canitself be an interface to another generalized dependencegraph. Each edge in such a graph may represent either adataflow relationship or a precedence relationship. Foreach node, an extended firing rule (that may includelocal and global state variables in addition to input datavalues) specifies when each node may begin execution.For a precedence edge, the node at the head of the edgemay not begin execution until the node at the tail hascompleted.

Dynamic Task Graph: An acyclic, hierarchicaldependence graph in which each node represents asequential task, precedence edges represent controlflow or synchronization, and dataflow edges (calledcommunication edges) represent explicit data transfersbetween processes.

Static Task Graph: A static (symbolic) representation ofthe possible dynamic task graphs of a program, inwhich each node represents a set of parallel tasks andeach edge represents a set of edge instances. Unlike thedynamic task graph, this graph includes loop andbranch nodes to capture logical control flow (and hencethe graph may contain cycles).

3.2.2 Dependence Graph Model of Parallel

Systems

The POEMS representation of a parallel system is ahierarchical dependence graph in which the nodes areinstances of compositional objects and the edges representflows of information from node to node. As shown inFigure 3.2, the nodes may be defined in different domainsand a node in one domain may invoke nodes in both itsown and implementing domains. For example, theimplementing domains for an application are theOS/runtime system and hardware domains.

The graph model is executable, where an executionrepresents a model solution for the specified applicationsoftware and system configuration. The nodes and edgesmay be instantiated during execution, thus defining adynamic instance of the graph that captures the behavior ofthe entire system during the execution being modeled. Thisdynamic graph may or may not be constructed explicitly atruntime, depending on the specified solution methods.

A node executes when the computation reaches a state inwhich its associated “firing rule” evaluates to true. Afiring rule is a conditional expression over the state of theinput edges and the local state of a node. Each edge hasassociated with it a data type specification, which is calleda transaction. Clock-driven execution is obtained by addingto the data dependence relationships (and specifications ofthe firing rules for the nodes) an edge between each sourcesink node pair carrying the current time of each sourcenode. Each node executes the Lamport [23] distributedclock algorithm to determine the current time.

3.2.3 Common Application Representation

The POEMS application representation is designed toprovide workload information at various levels ofabstraction for both analytical and simulation models.More specifically, the representation is designed to meetfour key goals [3]. First, it should provide a commonsource of workload information for the various modelingtechniques envisaged in POEMS. Second, therepresentation should be computable automatically usingparallelizing compiler technology. Third, therepresentation should be concise and efficient enough tosupport modeling terascale applications on very largeparallel systems. Finally, the representation should beflexible enough to support performance prediction studiesthat can predict the impact of changes to the application,such as changes to the parallelization strategy,communication, and scheduling. The design of therepresentation is described in more detail in [3], and issummarized briefly here.

The application representation in POEMS is based on acombination of static and dynamic task graphs. As defined

earlier, the static task graph (STG) is a compact, symbolicgraph representation of the parallel structure of theprogram. The nodes represent computational tasks, CPUcomponents of communication, or control flow. The graphalso includes extensive symbolic information in order tocapture symbolically the structure of the parallelcomputation as a function of parameters such as thenumber of processors and relevant program input variables.For example, each static task node contains a symbolicinteger set that describes the set of task instances that areexecuted at runtime (e.g., the following set can be used todenote a task node with P instances:

}10:]{[ −≤≤ Pii ). Each edge between parallel task

nodes contains a symbolic integer mapping that describesthe edge instances connecting task instances (e.g., themapping }111:]{[ +=∧−≤≤→ ijPjji denotes

P-1 edge instances connecting instance i of the node at thetail to instance i+1 of the node at the head). Finally,control-flow nodes contain symbolic information aboutloop bounds or branching conditions. Together, thisinformation enables the STG to capture the parallelstructure of the program while remaining independent ofprogram input values. In addition, each computational taskalso has an associated scaling function describing how thecomputational work for the task scales as a function ofrelevant program variables. For example, this informationcan be used to support simple extrapolation of performancemetrics as a function of input size.

The dynamic task graph, which is instantiated from thestatic task graph, is a directed acyclic graph that capturesthe precise parallel structure of an execution of theapplication for a given input [1]. The dynamic task graphis important for supporting detailed and precise modelingof parallel program behavior. For many modelingtechniques, however, the dynamic task graph need not beinstantiated explicitly but the same information can beobtained by traversing the static task graph at modelruntime. This is a key capability already available in thegraph-based runtime system on which the POEMSimplementation will be based.

3.2.4 Sweep3D Task Graph

The task graph concepts are illustrated by the static taskgraph for the sweep phase of Sweep3D, shown on the left-hand side of Figure 3.3. This task graph was generatedmanually and is a simplified version of the graph thatwould be generated using the task graph synthesistechniques in the dHPF compiler, described in Section 4.1.(The actual static task graph differs mainly in that there areadditional small computational tasks between thecommunication tasks, and the k-block actually consists ofseveral computation and a few control-flow nodes.)

The right-hand side of Figure 3.3 shows the dynamiccommunication pattern of the sweep phase that would berealized assuming a 3×3 processor grid. This shows four ofthe eight octants in each time step of Sweep3D, and thecomputation and communication tasks performed by eachprocessor for each octant can be seen on the left. Anillustration of the dynamic task graph and a more detaileddescription can be found in [3].

3.3 Modeling Using Compositional Objects

The interfaces of compositional objects carry sufficientinformation to enable compiler-based integration of multi-domain, multi-paradigm, multi-resolution componentmodels into a system model. Section 3.3.1 definescompositional objects. Section 3.3.2 defines the interfacesof compositional objects. Section 3.3.3 describes howcompositional objects are incorporated into the dependencegraph model of computation and Section 3.3.4 illustratescompositional modeling with an example from Sweep3D.

3.3.1 Compositional Objects

The POEMS Specification Language and programmingenvironment enable creation of performance models asinstances of the general dependence graph model ofparallel computation. The nodes of the graph are instancesof compositional objects that represent components.Compositional objects are defined in the context of anapplication semantic domain, an OS/runtime semanticdomain, and/or a hardware semantic domain. Theproperties of objects are defined in terms of the attributeset of the appropriate semantic domain.

Each component is specified by a set of attributes. Acomponent model is an instantiation of a component.There may be many instances of a given component modeleach with the same set of attributes, but with differentvalues bound to these attributes.

Figure 3.3. Static Task Graph for Sweep3D and the Dynamic Communication Structure on a 3x3 Processor Grid.

3.3.2 Associative Interfaces

An associative interface is an extension of the associativemodel of communication [11,12] used to specify complex,dynamic interactions among object instances. Anassociative interface specifies the functions it implements,domain-specific properties of the functions it implements,the functions it requires, and the domain-specific propertiesof the functions it requires.

An associative interface has two elements: an “accepts”interface for the services that the component modelimplements and a “requests” interface that specifies theservices the component model requires. Interfaces arespecified in terms of the attributes that define the behaviorand the states of standard objects [13,31,33]. An object thathas associative interfaces is said to be a compositionalobject and an object that interacts through associativeinterfaces is said to have associative interactions.

An “accepts” interface consists of a profile, a transaction,and a protocol. A profile is a set of name/value pairs overthe attributes of a domain. An object may change its profileduring execution. A transaction is a type definition for aparameterized unit of work to be executed. A protocolspecifies a mode of interaction such as call-return or dataflow (transfer of control) and/or a sequence of elementaryinteractions. A “requests” interface consists of a selectorexpression, a transaction, and a protocol. A selectorexpression is a conditional expression over the attributes ofa domain. The selector references attributes from theprofiles of target objects. When evaluated using theattributes of a profile, a selector is said to match the profilewhenever it evaluates to true. A match that causes theselector to evaluate to true selects an object as the target ofan interaction. The parameters of the transaction in thematch should either conform or the POEMS componentmodel library must include a routine to translate theparameters of the transaction in the requests interfaceinstance to the parameters of the transaction in the matchedaccepts interface and vice versa. A compositional objectmay have multiple accepts and requests in its associativeinterface. Multiple accepts arise when a component modelimplements more than one behavior. A component modelmay have a request instance for a service from animplementing domain and a request instance forcontinuation of execution in its own domain.

3.3.3 Mapping of Compositional Objects toDependence Graphs

POEMS compositional objects are defined byencapsulating "standard objects" [13,31,33] withassociative interfaces and data-flow execution semantics.Figure 3.4 shows the structure of a POEMS compositionalobject. The edges of the generalized dependence graphdefined in Section 3.2 are derived by matching request

interfaces with accepts interfaces. The selector mustevaluate to true and the transaction must be conformablefor an edge to be created. Transactions are conformable ifthe parameters of the transaction can be mapped to oneanother. The requirement for the mapping of parametersarises when component models in a system model eitherare defined at different levels of resolution or use differentparadigms for evaluation. The POEMS compiler uses alibrary of domain-aware type coercions to implementmappings when required. The firing rule for a componentmodel is derived by requiring that all of the parameters inthe transaction in the interface be present on its inputedges. Requests and accepts interfaces can be matchedwhen a system model is compiled or at runtime.

The matching of selectors in requests to profiles in acceptsthus composes a dynamic data flow graph and controls thetraversal of the graph that models the execution behaviorof the system.

Note that a component model at the application orOS/runtime level may have dependence arcs to and fromits own and implementing levels.

3.3.4 Illustration of Compositional Modeling forSweep3D

As an example of compositional modeling, thespecification of two of the component models that arecomposed to create an end-to-end model of Sweep3D areillustrated in Figures 3.5 and 3.6. The target hardwaresystem is an IBM SP/2, except that the memory hierarchyof the 604e processors is modeled as being different fromthat of the SP/2. (This modification is particularly relevantbecause measurements have shown that Sweep3D utilizes

Figure 3.4. A Node of a Dependence Graph as aCompositional Object.

only about 20%-25% of the available cycles on a singleprocessor in high-performance systems [35].)

This example assumes that the Sweep3D application beingmodeled is represented by the task graph of Figure 3.3 with"recv" nodes, "k-block" nodes, and "send" nodes. It also isassumed that the analyst has specified that execution of thecompute operations be modeled using detailed instruction-level simulation, including a detailed simulation ofmemory access, and that the execution of communicationand synchronization operations be modeled using MPIsimulation.

This example involves the application domain, theOS/runtime domain, and the hardware domain. (In theinterest of brevity and clarity, the syntax is a simplifiedform of the actual syntax of PSL.) In the example thatfollows, the PSL interfaces for a “k-block” node of thedependence graph and the first "send" node after a “k-block” node are specified. In the specification for the "k-block" component model the profile element "domain =

application[Sweep3D]" specifies that this component isfrom the Sweep3D application. "Evaluation mode =simulation[SimpleScalar]" specifies that the evaluation ofthis component will be done by the SimpleScalarsimulator. In this case the "k-block" is a sequence ofinstructions to be executed by the SimpleScalar simulator.The PSL compiler composes these component modelspecifications into an instance of the generalizedhierarchical dependence graph defined in Section 3.2.1.

The accepts interface for the compositional objectrepresenting the k-block node is straightforward. The k-block node has been generated by the task graph compilerand belongs to the application domain. The identifier "k-block" specifies a set of instructions that are to beexecuted. The requests interface for the k-block node hastwo request instances, one for the implementing (hardware)domain for the k-block and a second request instancecontinuing the flow of control to the next send node in theapplication domain.

The first request in the requests interface selects theSimpleScalar simulator to execute the instructions of the k-block node and evaluate the execution time of the code forthe k-block node. The transaction associated with the firstselector invokes an execute code block entry point definedfor SimpleScalar. The protocol for this selector is call-return since SimpleScalar must complete its execution andreturn control to the k-block node before the k-block nodecan transfer the computed data to the send node and send

Accepts:Profile = { domain = application[Sweep3D], node type =

send, evaluation mode = simulation[MPI-Sim] };Transaction = {MPI-send(destination, message,

mode)};Protocol = dataflow;

Body:/* Invocation of the MPI-Sim simulator to simulate themodified MPI send operation specified in the transaction.*/

Requests:Selector={ domain = runtime, node type = N/A,

evaluation mode = simulation[MPI-Sim] };Transaction ={MPI-send(destination, message, mode) };Protocol = call-return;

Selector = { domain = Application[Sweep3D], node type= send, evaluation mode = simulation[MPI-Sim] };

Transaction = {MPI-send(destination, message, mode)};Protocol = dataflow;

Figure 3.6. Specification of Compositional Objects:an Example Using the Representation of a “send”

node of the Sweep3D Task Graph.

Accepts:Profile = {domain = application[Sweep3D], node type =

k-block, evaluation mode =simulation[SimpleScalar] };

Transaction = { k-block(k-block-code) };Protocol = dataflow;

Body:/* Invocation of the SimpleScalar simulator toexecute the instructions that comprise thecomputation of this k -block. The instructions for thek-block (and the other nodes) have been previouslyloaded into SimpleScalar’s data space as the memorycontents of the processor/memory pair beingsimulated. Initialization of SimpleScalar is doneusing an initialization component in the POEMSdatabase. The parameter of the transaction is apointer to the base address in memory for theinstructions. */

Requests:Selector = {domain = hardware [processor/memory],

node type = N/A, evaluation mode = simulation[SimpleScalar]};

Transaction = {execute_code_block(k-block-code};Protocol = call-return;}

Selector = {domain = Application[Sweep3D], node type= send, evaluation mode = simulation[MPI-Sim]};

Transaction = {MPI-send(destination, message, mode)};Protocol = dataflow ;

Figure 3.5. Specification of Compositional Objects:an Example Using the Representation of a “k-block”

Node of the Sweep3D Task Graph.

node execution can be initiated. SimpleScalar has beenencapsulated in the POEMS library as a single nodedependence graph in the hardware domain.

The second selector in the requests interface selects thesend node immediately following the k-block instance inthe task graph to continue execution of the system model.The transaction in this request invokes the MPI-Simsimulator to evaluate the execution time of the MPI-sendfunction defined in MPI-Sim. The protocol is the "data-flow" protocol for data-flow graphs; in this case, thecontrol follows the data along the arc that is defined by thematching of the request and the accept interfaces. Notethat even though MPI-Sim is in the runtime domain wehave assigned the send node to the application domain.This choice was made because the send node is constructedas a part of the application by the task graph compiler ( SeeSection 4.1 for a description of the task graph compiler.)This is the most direct translation of the task graph into adata-flow graph. MPI-Sim is encapsulated in the POEMScomponent model library as a single-node graph. MPI-Simis invoked from the send node of the application send nodewith call-return semantics. This encapsulation of theruntime domain MPI-Sim node in the application nodeallows the send node to pass control to its successor sendnode as soon as the MPI-Sim simulator has completedexecution.

For this example, the representation of the first "send"node following the k-block node in the task graph for

Sweep3D is given in Figure 3.6. As shown, the profile inthe accepts interface of the send component modelspecifies that it derives from the Sweep3D application, is a"send" node, and is to be evaluated using MPI-Sim. Thetransaction specifies an entry point of MPI-Sim.

The selector in the first request instance matches the MPI-Sim simulator and the transaction specifies the entry pointfor the MPI-send operation. The protocol is call-returnbecause MPI-Sim must complete execution and returncontrol to the application before this send node can transfercontrol to the next send node. The selector of the secondrequest instance identifies the next node to which controlshould be transferred.

The accepts and requests interfaces of each compositionalobject or task graph node direct the search of the POEMSdatabase by the POEMS compiler to identify appropriatecomponent models to be used to instantiate componentsand link them to generate this segment of the systemmodel. (The current version of the POEMS SpecificationLanguage compiler [16] does not yet interface to thePOEMS database. It accesses component models from aUnix file system.) This database search provides a link tothe code that implements an instance of the specifiedcomponent model. In this case, the code that implementsthe k-node component model is the SimpleScalar simulatorand the code that implements the send node is MPI-Sim.(See Section 4 for a description of SimpleScalar and MPI-Sim.) The accepts and requests interfaces, the link to the

Figure 3.7. Overview of the POEMS Database.

code implementing the instance of a component model,and other parameters associated with the component modelwill be stored in a Task Execution Description (TED) (seeSection 3.4) in the POEMS database.

To execute the specified system model, SimpleScalar andMPI-Sim execute as processes on a host processor.SimpleScalar runs as the executing program for whichMPI-Sim is modeling communication. SimpleScalar takesas input the executable file of the k-block node, which isstored in simulated memory. At the conclusion ofexecuting the “k-block” node, the POEMS environmentinvokes the MPI-send module of MPI-Sim. A special built-in interface procedure that links SimpleScalar and MPI-Sim copies the data to be transmitted from the simulatedmemory of SimpleScalar into the real memory of the host,which allows MPI-Sim to model the communicationoperation.

3.4 Task Execution Descriptions (TEDs)

Considerable information is needed to characterize thebehavior and properties of each component and differentinstances of each component model will have differentattribute values. For example, there may be two instancesof a component, one that is analytically evaluated and onethat is evaluated by simulation. Analytical modeling mayrequire parameter values that specify task execution time,while simulation of a single task requires either anexecutable representation of the task or its memory addresstrace. As a result, these two instances of the samecomponent require different modes of integration into asystem model. The information necessary to accomplishthese integration functions must be associated with eachcomponent instance. POEMS will use a Task ExecutionDescription (TED) to describe the modeled execution of atask; a TED is associated with each node of a task graph.

In addition, a TED contains the attributes required todefine the method used to model single-task execution.The methods that are being evaluated for simulatingindividual tasks are instruction-driven, execution-driven,and trace-driven simulation. For example, a TED woulddefine the input parameters for SimpleScalar that wouldenable the creation a particular instantiation of theSimpleScalar component model.

3.5 Performance Recommender

The POEMS Performance Recommender system facilitatesthe selection of computational parameters for widely usedalgorithms to achieve specified performance goals. Forexample, in the Sweep3D context, the system is used toobtain the parameters of the algorithm (e.g., grid size,spacing, scattering order, angles, k-blocking factor,convergence test), system (e.g., I/O switches), and machine(e.g., number and configuration of processors). Capturingthe results of system measurement as well as modeling

studies (discussed in Section 5), this facility can provideinsight into how inter-relationships among variables andproblem features affect application performance. Itfunctions at several levels ranging from the capture ofanalytical and simulation model results to those of themeasured application.

POEMS is using a kernel (IFESTOS), developed at Purdue[29], that supports the rapid prototyping of recommendersystems. IFESTOS abstracts the architecture of arecommender system as a layered system with clearlydefined subsystems for problem formulation, knowledgeacquisition, performance modeling, and knowledgediscovery. The designer of the recommender system firstdefines a database of application classes (problems) andcomputation class instances (methods). The dataacquisition subsystem generates performance data byinvoking the appropriate application (e.g., Sweep3D). Theperformance data management subsystem providesfacilities for the selective editing, viewing, andmanipulation of the generated information. Performanceanalysis is performed by traditional attribute-valuestatistical techniques, and “mining” this informationproduces useful rules that can be used to drive the actualrecommender system. This approach has beendemonstrated successfully for problem domains innumerical quadrature and elliptic partial differentialequations [29]. Currently it is being applied to theSweep3D application.

3.6 POEMS Database

The volume and complexity of the data environment forPOEMS make the POEMS database a critical componentof the project. In fact, POEMS as a tool could not be builtwithout a database as a searchable repository for a widespectrum of model and performance data. The POEMSdatabase will be the repository for:

a) The component model definitions as compositionalobjects and component model definitions as instancesof the component models.

b) Static task graphs for the applications, generated bythe extended dHPF compiler.

c) The Task Execution Descriptions, which characterizeeach component model, as discussed in Section 3.3.4.

d) The knowledge base, which will guide analysts anddesigners in the development of total systems.

e) Measurements of performance in several formatsincluded measurements and predictions of executiontime.

The POEMS Specification Language compiler will beinterfaced to the database as will the PerformanceRecommender. Figure 3.7 is a schematic of the role of the

database in POEMS.

4 THE INITIAL POEMS PERFORMANCEANALYSIS TOOLS

Currently, several performance analysis tools are beingintegrated in POEMS. These include the following tools,which are described briefly in this section: an automatictask graph generator [3], the LogGP [5] and LoPC [18]analytic models, the MPI-Sim simulator [28], and theSimpleScalar instruction-level, processor/memoryarchitecture simulator [14]. Each tool contains capabilitiesfor modeling the application, OS/runtime, and hardwaredomains. Together, these tools provide the capability toanalyze, with a fairly high degree of confidence, theperformance of current and future applications andarchitectures, for very large and complex systemconfigurations. As a pre-requisite to developing multi-paradigm models, each of these tools has been used todevelop a single-paradigm, multi-domain model ofSweep3D. This has allowed us to understand the uniquerole that each paradigm can play in total systemperformance analysis.

4.1 Automatic Task Graph Generator

A comprehensive performance modeling environment likePOEMS will be used by system designers in practice onlyif model construction and solution can be largelyautomated. For an existing application, one critical steptowards this goal is to automatically construct theapplication representation described in Section 3.2. Data-parallel compiler technology from the dHPF compilerproject [2] has been extended to compute the task-graph-based application representation automatically for HighPerformance Fortran (HPF) programs. In normal use, thedHPF compiler compiles a given HPF program into anexplicitly parallel program in SPMD (Single ProgramMultiple Data) form, for message-passing (MPI), shared-memory (e.g., pthreads), or software distributed shared-memory (TreadMarks [6]) systems. The synthesized taskgraphs represent this explicitly parallel program. In thefuture, the task graph synthesis will be extended to handleexisting message-passing programs as well.

There are three key innovations in the compiler support fortask graph construction:

• The use of symbolic integer sets and mappings: Theseare critical for capturing the set of possible dynamictask graphs as a concise, static task graph. Althoughthis is a design feature of the representation itself, thefeature depends directly on the integer set frameworktechnology that is a foundation of the dHPF compiler[2]. The construction of these sets and mappings for anSTG is described in more detail in [3].

• Techniques for condensing the static task graph: Theconstruction of static task graphs in dHPF is a two-phase process [3]. First, an initial static task graphwith fine-grain tasks is constructed directly from thedHPF compiler’s internal program representation andanalysis information. Second, a pass over the STGcondenses fine-grain computational tasks (e.g., singlestatements) into coarse-grain tasks (e.g., entire loopiterations or even entire loop nests). The degree ofgranularity clearly depends on the goals of themodeling study, and can be controlled as discussedbelow.

• Integer set code generation techniques forinstantiating a dynamic task graph for a given input:Although this step is conceptually performed outsidethe compiler, it can be done in a novel, highly-efficient, manner for many programs using dHPF’scapability of generating code to enumerate theelements of an integer set or mapping. This also isexplained further below.

The goal of condensing the task graph is to obtain arepresentation that is accurate, yet of a manageable size.For example, it may be appropriate to assume that alloperations of a process between two communication pointsconstitute a single task, permitting a coarse-grain modelingapproach. In this case, in order to preserve precision, thescaling function of the condensed task must be computedas the symbolic sum of the scaling functions of thecomponent tasks, each multiplied by the symbolic numberof iterations of the surrounding loops or by the branchingprobabilities for surrounding control-flow, as appropriate.Note, however, that condensing conditional branches canintroduce fairly coarse approximations in the modeling oftask execution times. The compiler, therefore, takes aconservative approach and does not collapse branches bydefault. Instead, it is desirable to allow the modeler tointervene and specify that portions of the task graph can becollapsed even further (e.g., by inserting a special directivebefore control flow that can be collapsed). Even in thedefault case, however, the compiler can increase thegranularity of tasks further in some cases by moving loop-invariant branches out of enclosing loops, a standardcompiler transformation. For example, this would bepossible for a key branch within the k-block of Sweep3D.

Constructing the DTG for a given program input requiressymbolic interpretation of the parallel structure and part ofthe control flow of the program. This interpretation mustenumerate the loop iterations, resolve all dynamicinstances of each branch, and instantiate the actual tasks,edges, and communication events. For many regular, non-adaptive codes, the control flow (loop iterations andbranches) can be determined uniquely by the programinput, so that the DTG can be instantiated statically.

(Again, in some cases, a few data-dependent branches mayhave to be ignored for approximate modeling, under thecontrol of the modeler as proposed above.) To instantiateparallel tasks or edges, the elements of the correspondinginteger set or mapping must be enumerated. The key todoing this is that, for a given symbolic integer set, thedHPF compiler can synthesize code to enumerate theelements of that set [2]. (Any mapping can be convertedinto an equivalent set for code generation.) We exploit thiscapability and generate a separate procedure for each set,parameterized by the symbolic program variables thatappear in the set; typically these are process identifiers,program input variables, and loop index variables. Thenthis generated code is compiled separately and linked withthe program that performs the instantiation of the DTG.When instantiating dynamic task instances for a givenstatic task, the code for that task’s symbolic set is invokedand is provided with the current values of the symbolicparameters at this point in the interpretation process. Thisdirectly returns the required index values for the taskinstances. Edge instances of a static edge are instantiated inexactly the same manner, from the code for the integer setmapping of that static edge.

In some irregular and adaptive programs, the control flowmay depend on intermediate computational results of theprogram, and the DTG would have to be instantiateddynamically using an actual or simulated programexecution. The DTG for a given input can be either

synthesized and stored offline for further modeling withany model, or instantiated on the fly during modelexecution for modeling techniques such as execution-driven or instruction-driven simulation. Both theseapproaches will be key for multi-paradigm modeling ofadvanced adaptive codes.

The automatic construction of the static task graph hasbeen exploited directly in integrating a task-graph-basedmodel with MPI-Sim for improving the scalability ofsimulation, as described in Section 6.1. The automaticconstruction of the dynamic task graph makes it possible todo automatic analytical modeling of program performance,using deterministic task graph analysis [1].

4.2 LogGP/LoPC

The task graph for a given application elucidates theprincipal structure of the code, including the inter-processor communication events, from which it isrelatively easy to derive the LogGP or LoPC modelequations. The approach is illustrated by deriving theLogGP model of the Sweep3D application that usesblocking MPI send and receive primitives forcommunication. The task graph for the sweep (or main)phase of this application is given in Figure 3.3.

The LogGP model of the Sweep3D application is given inFigure 4.1. The hardware domain is modeled by threesimple parameters: L, G, and Wg, which are defined below.

Send (message length < 4KB) = o (1a)

Send (message length ≥ 4KB) = os + L + os + os + L + ol (1b)

Receive (message length < 4KB) = o (2a)

Receive (message length ≥ 4KB) = os + L + ol + (message_size × Gl) + L + ol (2b)

Total Comm (message length < 4KB) = o + (message_size × G) + L + o (3a)

Total Comm (message length ≥ 4KB) = os + L + os + os + L + ol + (message_size × Gl) + L + ol (3b)

Wi,j = Wg × mmi × mk × it × jt (4)

StartPi,j = max (StartPi –1,j + Wi−1,j + Total_Comm + Receive, StartPi,j−1 + Wi,j−1 + Send + Total_Comm) (5)

T5,6 = startP1,m+ 2[(W1,m+ SendE+ ReceiveN + (m-1)L) × #k-blocks × #angle-groups] (6)

T7,8 = startPn-1,m + 2[(Wn-1,m+ SendE+ ReceiveW+ ReceiveN+ (m–1)L+ (n-2)L) × #k-blocks × #angle-groups]

+ReceiveW+Wn,m (7)

T = 2 ( T5,6 + T7,8 ) (8)

Figure 4.1. LogGP Model of MPI Communication and the Sweep3D Application.

The first six equations, (1a) through (3b), model theruntime system components used by the Sweep3Dapplication. That is, these equations model the MPIcommunication primitives as they are implemented on theSP/2 that was used to validate the model. Equations (4)through (8) model the execution time of the application asderived from the dynamic task graph.

Equations (1a) through (3b) reflect a fairly precise, yetsimple, model of how the MPI-send and MPI-receiveprimitives are implemented on the SP/2. Information abouthow the primitives are implemented was obtained from theauthor of the MPI software. For messages smaller than fourkilobytes, the cost of a send or receive operation is simplythe LogGP processing overhead (o) parameter. The totalcommunication cost for these messages (equation 3a) is thesum of the send processing overhead, the messagetransmission time (modeled as the network latency (L),plus the message size times the gap per byte (G)parameter), and the receive processing overhead.1 Asubscript on the processing overhead parameter denotes thevalue of this parameter for messages smaller (os) or larger(ol) than one kilobyte. When the subscript is omitted, theappropriate value is assumed. For messages larger thanfour kilobytes (equation 3b), an additional handshake isrequired. The sending processor sends a short message tothe receiving processor, which is acknowledged by a shortmessage from the receiving processor when the receivesystem call has been executed and the buffer space isavailable to hold the message. After that, the messagetransfer takes place. In this case, the Send cost (1b) orReceive cost (2b) is the duration of the communicationevent on the processor where the corresponding MPIruntime system call occurs. Further details about theaccuracy of these communication models and how theparameter values were measured are given in [36].

The equations that model the MPI communicationprimitives might need to be modified for future versions ofthe MPI library, or if Sweep3D is run on a differentmessage-passing system or is modified to use non-blockingMPI primitives. The equations illustrate a general approachfor capturing the impact of such system modifications.

Equations (4) through (8) model the application executiontime, taking advantage of the symmetry in the Sweep3Dtask graph (see Figure 3.3). For simplicity, the Sweep3Dapplication model presented here assumes each processorin the m×n Sweep3D processor grid is mapped to adifferent SMP node in the SP/2. In this case, networklatency, L, is the same for all (nearest-neighbor)communication in Sweep3D. As explained in [36], the

1 The communication structure of Sweep3D is such that we canignore the LogGP gap (g) parameter, since the time betweenconsecutive message transmissions is greater than the minimumallowed value of inter-message transmission time.

equations that model communication can be modifiedeasily for the case when 2x2 regions of the processor gridare mapped to the same SMP node.

Equation (4) models the time required for a single task tocompute the values for a portion of the grid of size mmi ×mk × it × jt. In this equation, Wg is the measured time tocompute one grid point, and mmi, mk, it, and jt are theSweep3D input parameters that specify the number ofangles and grid points per block per processor.

Equation (5) models the precedence constraints in the taskgraph for the sweeps for octants 5 and 6, assuming theprocessors are numbered according to their placement inthe two-dimensional grid, with the processor in the upperleft being numbered (1,1). Specifically, the recursiveequation computes the time that processor pi,j begins itscalculations for these sweeps, where i denotes thehorizontal position of the processor in the grid. The firstterm in equation (5) corresponds to the case where themessage from the West is the last to arrive at processor pi,j.In this case, due to the blocking nature of the MPIprimitives, the message from the North has already beensent but cannot be received until the message from theWest is processed. The second term in equation (5) modelsthe case where the message from the North is the last toarrive. Note that StartP1,1 = 0, and that the appropriate oneof the two terms in equation (5) is deleted for each of theother processors at the east or north edges of the processorgrid.

The Sweep3D application makes sweeps across theprocessors in the same direction for each octant pair. Thecritical path time for the two right-downward sweeps iscomputed in equation (6) of Figure 4.1. This is the timeuntil the lower-left corner processor p1,m has finishedcommunicating the results from its last block of the sweepfor octant 6. At this point, the sweeps for octants 7 and 8(to the upper right) can start at processor p1,m and proceedtoward pn,1. The subscripts on the Send and Receive termsin equation (6) are included only to indicate the directionof the communication event, to make it easier tounderstand why the term is included in the equation.

Since the sweeps from octants 1 and 2 (in the nextiteration) will not begin until processor pn,1 is finished, thecritical path for the sweeps for octants 7 and 8 is the timeuntil all processors in the grid complete their calculationsfor the sweeps. Due to the symmetry in the Sweep3Dalgorithm, captured in the task graph, the time for thesweeps to the Northeast is the same as the total time for thesweeps for octants 5 and 6, which is computed in equation(7) of the figure. Due to the symmetry between the sweepsfor octants 1 through 4 and the sweeps for octants 5through 8, the total execution time of one iteration iscomputed as in equation (8) of Figure 4.1. Equation (6)contains one term [(m–1)L], and the equation (7) contains

two terms [(m–1)L and (n-2)L], that account forsynchronization costs, as explained in [36].

The input parameters to the LogGP model derived aboveare the L, o, G, P, and Wg parameters. The first threeparameters were derived by measuring the round-tripcommunication times for three different message sizes onthe IBM SP system, and solving equations (3a) and (3b)with the derived measures (see [36] for details). The Wi,j

parameter value was measured on a 2x2 grid of processorsso that the work done by corner, edge, and interiorprocessors could be measured. In fact, to obtain theaccuracy of the results in this paper, Wi,j for each per-processor grid size was measured to account fordifferences (up to 20%) that arise from cache miss andother effects. Since the Sweep3D program contains extracalculations (“fixups”) for five of the twelve iterations, Wi,j

values for both of these iteration types were measured.Although this is more detailed than the creators ofLogP/LogGP may have intended, the increased accuracy issubstantial and needed for the large-scale performanceprojections in Section 5. In the future, other POEMS toolswill be used to obtain these input parameters, as explainedin Section 6.3.

The LogGP model of the SP/2 MPI communicationprimitives is shown to be highly accurate in [36]. Selectedvalidations and performance projections of the LogGPmodel of Sweep3D are presented in Section 5.

4.3 MPI-Sim: Direct Execution-DrivenSystem Simulation

POEMS includes a modular, direct execution-driven,parallel program simulator called MPI-Sim that has beendeveloped at UCLA [10, 28]. MPI-Sim can evaluate theperformance of existing MPI programs as a function ofvarious hardware and system software characteristics thatinclude the number of processors, interconnection networkcharacteristics, and message-passing libraryimplementations. The simulator also can be used toevaluate the performance of parallel file systems and I/Osystems [8]. Supported capabilities include a number ofdifferent disk caching algorithms, collective I/Otechniques, disk cache replacement algorithms, and I/Odevice models. The parallel discrete-event simulator uses aset of conservative synchronization protocols together witha number of optimizations to reduce the time to executesimulation models.

MPI-Sim models the application and the underlyingsystem. An application is represented by its local codeblocks and their communication requirements. The localcode block model is evaluated by direct execution. MPIprograms execute as a collection of single threadedprocesses, and, in general, the host machine has fewer

processors than the target machine. (For sequentialsimulation, the host machine has only one processor). Thisrequires that the simulator supports multithreadedexecution of MPI programs. MPI-LITE, a portable libraryfor multithreaded execution of MPI programs, has beendeveloped for this purpose.

The MPI communication layer, which is part of theruntime domain, is simulated by MPI-Sim in detail; bufferallocation and internal MPI synchronization messages aretaken into account. The simulator does not simulate everyMPI call, rather all collective communication functions arefirst translated by the simulator in terms of point-to-pointcommunication functions, and all point-to-pointcommunication functions are implemented using a set ofcore non-blocking MPI functions. Note that the translationof collective communication functions in the simulator isidentical to how they are implemented on the targetarchitecture. A preprocessor replaces all MPI calls byequivalent calls to corresponding routines in the simulator.The physical communications between processors, whichare part of the hardware domain are modeled by simpleend-to-end latencies, similar to the communication latencyparameter in the LogP model.

For many applications, these runtime domain and hardwaredomain communication models are highly accurate[10,28]. MPI-Sim has been validated against the NAS MPIbenchmarks and has demonstrated excellent performanceimprovement with parallel execution against thesebenchmarks [28]. As shown in Section 5, it also is highlyaccurate for the Sweep3D application, and the simulationresults (which reflect fairly detailed modeling of theruntime domain) greatly increase confidence in thescalability projections of the more abstract LogGP model.

4.4 Instruction-Level Simulation

As stated above, the system model provided by MPI-Sim isbased on direct execution of computational code andsimulation of MPI communication. As a result, theprocessor/memory architecture of the target system to beevaluated must be identical to that of the host system.

To provide performance evaluation of applications onalternative (future) processor and memory subsystemdesigns, the POEMS component model library includesprocessor, memory, and transport component models thatare evaluated using instruction-level, discrete-eventsimulation. These models are in the process of beingcomposed with MPI-Sim to predict the performance ofprograms for proposed next-generation multiprocessorsystems. The results of the more abstract task graph,LogGP/LoPC, and MPI-Sim analyses can be used toidentify the most important regions of the design space tobe evaluated with these, more detailed, hardwarecomponent models. The detailed hardware models can be

used to validate the more abstract models, and can provideparameter values for future processor/memoryarchitectures needed in the more abstract models.

In order to provide this kind of modeling capability, asimulator that models instruction-level parallelism isessential. The “sim-outorder” component of theSimpleScalar tool set [14] meets this requirement forsimulating complex modern processor/memoryarchitectures. As a result, it is being integrated in thePOEMS environment. sim-outorder can be used to modelstate-of-the-art superscalar processors, which support out-of-order instruction issue and execution. The processorattributes of these hardware processor/memory subsystemcomponent models include the processor fetch, issue, anddecode rates, number and types of functional units, anddefining characteristics of the branch predictor. Currently,its integrated memory components include level-one andlevel-two instruction and data caches, a translation-lookaside buffer, and main memory. The simulator is fairlyportable and customizable with reasonable effort.

Currently, the MPI-version of Sweep3D successfullyexecutes on multiple instantiations of sim-outorder, eachexecuted by a separate MPI process. This simulationproduces estimates of the parameter value called Wi,j in theLogGP model, thus enabling the LogGP model to predictthe performance of Sweep3D on alternative processor-cache architectures. This simulation capability alsodemonstrates a proof-of-concept for integrating the sim-outorder and MPI-Sim modeling tools. The futureintegration of MPI-Sim and sim-outorder will supportdetailed simulation of both the OS/runtime domain and thehardware domain.

5 APPLICATION OF POEMS TOPERFORMANCE ANALYSIS OF SWEEP3D

The goal of the POEMS performance modeling system isto enable complete end-to-end performance studies ofapplications executed on specific system architectures. Toaccomplish this, the system must be able to generateinformation that can be used to identify and characterizeperformance bottlenecks, analyze scalability, determine theoptimal mapping of the application on a given architecture,and analyze the sensitivity of the application toarchitectural changes. This section presents arepresentative sample of the results obtained in modelingSweep3D. These results are for the IBM SP system.

5.1 Model Validations

We first demonstrate the accuracy of the performancemodels described above. Both MPI-Sim and LogGP modelSweep3D accurately for a variety of application parameterssuch as the mk and mmi parameters that define the size of

a pipelined block, different total problem sizes, andnumber of processors. Figures 5.1(a) and (b) present themeasured and predicted execution time of the program forthe 1503 and 503 total problem sizes respectively, as afunction of the number of processors. For these results, thek-blocking factor (mk) is 10 and the angle-blocking factor(mmi) is 3. Both the MPI-Sim and LogGP models showexcellent agreement with the measured values, withdiscrepancies of at most 7%. The next section shows thatthese two problem sizes have very different scalability, yetFigure 5.1 shows that the LogGP and MPI-Sim estimatesare highly accurate for both problem sizes.

5.2 Scalability Analysis

It is important to application developers to determine howwell an application scales as the number of processors inthe system is increased. On today’s systems, the user couldconduct such studies by measuring the runtime of theapplication as the number of processors is increased to themaximum number of processors in the system, e.g., 128(see Figure 5.2(a)). The figure clearly shows that the smallproblem size (503) cannot efficiently exploit more than 64processors. On the other hand, the larger problem sizeshows excellent speedup up to 128 processors. However,due to the system size limitation, no information about thebehavior of larger problem sizes on very large numbers ofprocessors is available to the user. Modeling tools enableusers to look beyond the available multiprocessor systems.Figure 5.2(b) shows the projections of LogGP and MPI-Sim to hardware configurations with as many as 2500processors. Although the application for the 1503 problemsize shows good performance for up to 900 processors, itwould not perform well on machines with a greater numberof processors.

Figure 5.2 also demonstrates the excellent agreementbetween the analytical LogGP model and the MPI-Simsimulation model. Each model was independentlyvalidated for a variety of application configurations on asmany as 128 processors, and the good cross-validationbetween the models for up to 2500 processors increasesour confidence in both models.

Figure 5.3(a) demonstrates the projective capability of thesimulation and analytical models. This figure shows themeasured and estimated performance as a function of thenumber of processors, for a fixed per-processor grid size of14×14×255. For this per-processor grid size the totalproblem size on 400 processors is approximately 20million grid points, which is one problem size of interest tothe application developers. Obviously, measurement islimited to the size of the physical system (128 processors).The maximum problem size that can be measured, with theper-processor grid size of 14×14×255 is 6.4 million cells.MPI-Sim can extend the projections to 1,600 processors(80 million cells) before running out of memory because

the simulator needs at least as much aggregate memory asthe application it models. (In Section 6.1, we describecompiler-based techniques that eliminate the memorybottleneck for simulation of regular programs such asSweep3D and thus greatly increase the system and problemsizes that can be simulated.) Finally, LogGP can take theprojections to as much as 28,000 processors (i.e., for thegiven application configuration parameters, a 1.4 billionproblem size).

One important issue addressed in the figure is the validityof the model projections for very large systems. Inparticular, the close agreement between measurement andthe MPI-Sim and LogGP models for up to 128 processorsand the close agreement between MPI-Sim and LogGP forup to 1,600 processors greatly increases our confidence inthe projections of MPI-Sim and LogGP for large problemsizes of interest to the application developers.

The mutual validation between the two different models is

one key way in which simulation and analytical modelingcomplement each other. Furthermore, the two models havecomplementary strengths. The key strength of MPI-Sim isthat it can be used to study program performance by userswith little or no modeling expertise, and can be used tostudy the performance impact of detailed changes in thedesign of Sweep3D or in the implementation of the MPIcommunication primitives. The key strength of the LogGPmodel is that it can project the performance of designchanges before the changes are implemented, and for thelargest problem sizes of interest.

5.3 Application Mapping

For applications such as Sweep3D, which allow varyingthe degree of pipelining, it is often important to explore notonly how many resources are needed to achieve goodresults, but also how best to map the application onto themachine. Figure 5.3(b) shows how LogGP explores theparameter space for the 20 million-cell Sweep3D problem

0

100

200

300

400

500

600

700

0 50 100 150

number of processors

MPI-SIM

measured

LogGP

0

20

40

60

80

100

120

0 50 100 150


MPI-SIM

measured

LogGP

(a) Validation for Problem Size 150x150x150. (b) Validation for Problem Size 50x50x50.

Figure 5.1. Validation of LogGP and MPI-Sim.

0

20

40

60

80

100

120

140

0 50 100 150


spee

du

p

50^3-LogGP

150^3-LogGP

150^3-MPI-SIM

0

50

100

150

200

250

300

350

400

0 500 1000 1500 2000 2500 3000


50^3-LogGP

150^3-LogGP

150^3-MPI-SIM

(a) Speedup for up to 128 Processors. (b) Speedup for up to 2500 Processors.

Figure 5.2 Projected Sweep3D Speedup.

on up to 20,000 processors. Here the k- and angle-blockingfactors are varied, which results in various degrees ofpipelining in the algorithm. The graph shows that for asmall number of processors (less than 400) the choice ofblocking factors is not very important. However, as moreprocessors are used and the grid size per processordecreases, the degree of pipelining in the k-dimension hasa significant impact, resulting in poor performance for thesmallest blocking factor (mk=1). For this blocking factor,the performance of the application also is sensitive to theangle-blocking factor. However, when the k-blockingfactor is properly chosen (mk=10), the impact of the mmiparameter value is negligible. The results also indicate thatthe optimal operating point for this total problem size isperhaps one or two thousand processors; increasing thenumber of processors beyond this number leads to greatlydiminished returns in terms of reducing applicationexecution time. This projected optimal operating point is akey design question that was unanswered prior to thePOEMS analysis.

5.4 Communication Requirements Analysis

Another issue of interest is the application’s performanceon a system with upgraded communication capability. Arelated issue is the application’s performance on high-performance tightly-coupled multiprocessor systemsversus its performance on a network of loosely-coupledworkstations. Both simulation and analytical models canbe used to study these issues by modifying theircommunication components to reflect changes in thecommunication latency of the system.

As depicted in Figure 5.4, MPI-Sim was used in this wayto show the impact of latency on the performance of the 20million-cell Sweep3D. In this experiment, the latency wasset to n times the latency of the IBM SP (denoted n×SP inthe figure), for a range of values of n. The MPI-Simprojections shown in Figure 5.4(a) indicate that if thecommunication latency is 0, not much improvement isgained. In addition, these results show that for largenumbers of processors, 400 and over, the application may

0

50

100

150

200

250

1 10 100 1000 10000 100000number of processors

LogGP

MPI-SIM

measured

20 million Problem

(a) Projective Capability of Analysis and Simulation (mk=10,mmi=3).

0500

1000150020002500300035004000

0 5000 10000 15000 20000 25000number of Processors

mk=1,mmi=3

mk=1,mmi=6

mk=10,mmi=3

mk=10,mmi=6

(b) Effect of Pipeline Parameters.

050

100150200250300350400450500

100 400 1600


0

sp

10xsp

50xsp

100xsp

(a) Latency Variation (MPI-Sim).

050

100150200250300350400

0 10000 20000 30000

Number of Processors

Tim

e(se

c)

total

comp

commsynch

(b) Communication Costs (LogGP).

Figure 5.4 Effect of Communications on Sweep3D, 20Million Cell Problem.

Figure 5.3. Projective Capabilities and ModelProjections for Sweep3D, 20 Million Cell Problem.

perform well on a network of workstations, because even ifthe latency is increased to 50 times that of the SP,performance does not suffer significantly. However, asFigure 5.3(b) demonstrates, the performance of the 20million-cell Sweep3D does not improve when more than5000 processors are used. Since the grid-size per processordiminishes, reducing the computation time, and latencydoes not seem to be a factor, one would expectperformance gains to persist. The detailed communicationcomponent in the LogGP model provides an explanation.As can be seen in Figure 5.4(b), the computation timedecreases, but the communication time remains flat and,more importantly, the synchronization costs resulting fromblocking sends and receives dominate the total executiontime as the number of processors grows to 20,000. Thisimplies that modifications to the algorithm are necessary toeffectively use large numbers of processors. In addition,the use of non-blocking communication primitives mightbe worth investigating. Again these are key design resultsproduced by the POEMS modeling and analysis effort.

6 RESEARCH IN PROGRESS - INTEGRATION OFMODELING PARADIGMS

One goal of the POEMS methodology is to facilitate theintegration of different modeling paradigms. The use ofcompositional objects and the task graph abstraction as aworkload representation together provide the frameworkneeded for this integration. Compositional objects facilitatethe use of different models for a particular system orapplication component. The task graph explicitly separatesthe representation of sequential computations (tasks) frominter-process communication or synchronization. Thisseparation directly enables combinations of modelingparadigms where different paradigms are used to modelvarious tasks as well as the parallel communicationbehavior. Other combinations (e.g., combining analysisand simulation to model the execution of a particularsequential task) can be performed with some additionaleffort, using task component models that are themselvescomposed of submodels. This section presents an overviewof some of the multi-paradigm modeling approaches thatare being designed and evaluated for inclusion in thePOEMS environment.

6.1 Integrating Task Graph and SimulationModels

When simulating communication performance with asimulator such as MPI-Sim, the computational code of theapplication is executed or simulated in detail to determineits impact on performance. State-of-the-art simulators suchas MPI-Sim use both parallel simulation and direct-execution simulation to reduce overall simulation timegreatly, but these simulators still consume at least as much

aggregate memory as the original application. This highmemory usage is a major bottleneck limiting the size ofproblems that can be simulated today, especially if thetarget system has many more processors than the hostsystem used to run the simulations.

In principle, only two aspects of the computations areneeded to predict communication behavior and overallperformance: (a) the elapsed time for each computationalinterval and (b) those intermediate computational resultsthat affect the computation times and communicationbehavior. We refer to the computations required for part(b) as “essential” computations (signifying that theirresults affect program performance), and the rest of thecomputations as “non-essential.” Essential computationsare exactly those that affect the control-flow (and thereforecomputation times) or the communication behavior of theprogram. Assume for now that computation times for thenon-essential computations can be estimated analyticallyusing compiler-driven performance prediction. Then, if theessential computations can be isolated, the non-essentialcomputations can be ignored during the detailedsimulation, and the data structures used exclusively by thenon-essential computations can be eliminated. If thememory savings are substantial and the simplifiedsimulation is accurate, this technique can make it practicalto simulate much larger systems and data sets than iscurrently possible even with parallel direct-executionsimulation.

We have integrated the static task graph model with theMPI-Sim simulator (plus additional compiler analysis) inorder to implement and evaluate the technique describedabove [4]. Compiler analysis is essential becauseidentifying and eliminating the “non-essential”computations requires information about thecommunication and control-flow in the application. Thestatic task graph model serves two purposes. First, itprovides an abstract representation for the compileranalysis, in which the computation intervals (tasks) areclearly separated from the communication structure.Second it serves as the interface to MPI-Sim (in the formof a simplified MPI program that captures the task graphstructure plus the “essential” computations).

Briefly, the integration works as follows. The static taskgraph directly identifies the computational intervals: thesesimply correspond to subgraphs containing nocommunication. First, the compiler identifies the values(i.e, the uses of variables) that affect the control-flowwithin the computation intervals, and the values that affectthe communication behavior (note that this does not yetinclude the values being communicated). The compilerthen uses a standard technique known as program slicing[20] to identify the subset of the computations that affectthese variable values; these are exactly the essential

computations.

Second, the compiler computes a symbolic expressionrepresenting the elapsed time for each non-essentialcomputation interval. These task time estimates are similarto equation (4) of the LogGP model, but derivedautomatically from program analysis. The Wij parameterscan be measured directly, or estimated by more detailedanalysis, or simulated using SimpleScalar as described inSections 5.2 or 6.2. The current implementation simplymeasures these values by generating an instrumentedversion of the original source code for one or morerelatively small problem sizes.

Finally, the compiler transforms the original parallel codeinto a simplified MPI program that has exactly the sameparallel structure as the original task graph, but where thenon-essential computations are replaced by function callsto a special simulator function. MPI-Sim has beenaugmented to provide this special simulator function,which takes an argument specifying a time value andsimply advances the simulation clock for the currentprocess by that value. The symbolic estimate derived bythe compiler is passed as an argument to the function.

Preliminary results demonstrating the scalability of theintegrated simulator can be seen in Figure 6.1. The per-processor problem size is fixed (6×6×1000) in the figure sothat the total problem size scales linearly with the numberof processors. The original MPI-Sim could not be used tosimulate applications running on more than 400 processorsin this case (i.e., an aggregate 14.4 million problem size),whereas the integrated task graph simulator model scaledup to 6400 processors (i.e., a 230 million problem size).The primary reason for the improved scalability of thesimulation is that the integrated model requires a factor of1760× less memory than the original simulator! The totalsimulation time is also improved by about a factor of 2.Finally, the integrated model has an average error under10% for this problem, compared with an average error of3.6% for the original simulator. In fact, for other cases wehave looked at, the two approaches are comparable in theiraccuracy [4].

The composition of models described above wasdeveloped manually because substantial research issueswere involved. In practice, the POEMS methodology forcomponent model composition and evaluation toolintegration described in Section 3 can be applied toperform this composition. The dHPF compiler would firstgenerate a modified static task graph that creates separatetasks for intervals of essential and non-essentialcomputations. (The code for each graph node in PSL isencapsulated in a separate function, mainly forimplementation convenience.) The evaluation tool for thecommunication operations is of course the MPI-Simsimulator. The evaluation tool for the essential task nodes,

including essential control-flow nodes, would be directexecution within the POEMS runtime system. Last andmost important, several different evaluation tools can beused for the non-essential task nodes: compiler-synthesizedsymbolic expressions parameterized with measurements asin our experiments above (in fact, the POEMSenvironment would further simplify the measurementprocess), purely analytical estimates derived by thecompiler, or the SimpleScalar processor simulator.Furthermore, different evaluation tools could be used fordifferent nodes. For example, the SimpleScalar simulationscould be used for a few instances of each node in order tomodel cache behavior on a hypothetical system, and theseresults used as parameter values in the symbolicexpressions for modeling other instances of the node.

The specification of this integrated model in PSL is closelyanalogous to the example given in Section 3.3. Thecomponent models for the task graph nodes are MPI-Simfor communication nodes, direct execution for the essentialcomputation nodes, and one of the evaluation methodsdescribed above for each of the non-essential computationnodes. The accepts interfaces for the component modelsimplemented through MPI-Sim identify the position of thegiven computation or communication node in the statictask graph, and the function to be modeled or the signatureof the MPI component to be simulated. The requestsinterfaces specify the evaluation tools and the successornodes in the task graph. The accepts interfaces of the non-essential computation nodes specify the function beingencapsulated, the position in the static task graph of thegiven computation node, and an evaluation mode. Therequests interfaces of the computation nodes specify theevaluation tools, the next communication node in the taskgraph, and the signature of the desired successor

0

5

10

15

20

25

30

35

40

1 10 100 1000 10000

number of target processors

pred

icte

d ru

ntim

e (in

sec

)

STG + MPI-SIM

MPI-SIM

Measured

Figure 6.1. Aggregate Problem and System Sizesthat can be Simulated with the Integrated StaticTask Graph + MPI-Sim Model; Per-processor

Problem Size is Fixed at 6×6×1000.

communication function. The PSL compiler wouldautomatically generate a single executable that invokes theappropriate model components at runtime. An execution ofthis program essentially can be envisioned as synthesizingthe dynamic task graph from the static task graph on the flyand traversing it, alternating between execution ofcomputation nodes and communication nodes, where eachnode is evaluated or executed by its component model.

6.2 Integration of MPI-Sim and the LogGPModels

As discussed in Section 5, both analytical and simulationtechniques can predict the performance of large-scaleparallel applications as a function of various architecturalcharacteristics. Simulation can model system performanceat much greater levels of detail than analytical models, andcan evaluate application performance for architecturalmodifications that would change the analytical model inputparameters in unknown ways. As a simple example, MPI-Sim results were used to gain confidence in the very large-scale projections of Sweep3D performance from theLogGP model. On the other hand, because of the resourceand time constraints of simulation, analytical models canelucidate the principal system performance parameters andcan provide performance projections for much largerconfigurations than is possible with simulation models.Because of these complementary strengths, significantbenefits can be derived from combining the twoapproaches.

One key advantage to further integrating MPI-Sim and theLogGP models is that performance of large-scaleapplications with modified implementations of the MPIcommunication primitives can be evaluated. For example,the MPI runtime software implementation on the SP/2 hasnot yet been optimized for communication among theprocessors in the same SMP node. For the experimentsreported in Section 5, the Sweep3D processes weremapped to different nodes in the SP/2, in order to utilizethe efficient MPI communication between nodes. To obtainperformance estimates of Sweep3D for next generationsystems, MPI-Sim component models can be used tosimulate the efficient communication that is expected to beavailable in future versions of the MPI runtime library.The measured communication costs from these simulationscan then be used in the LogGP model, appropriatelyupdated to reflect non-uniform intra-node/inter-nodecommunication costs [36], to predict application scalabilitywhen Sweep3D uses intra-node as well as inter-nodecommunication in the SP/2.

The Sweep3D application that uses MPI communicationprimitives is accurately modeled by the LogGP and MPI-Sim models, which assume no significant queuing delaysat the network interfaces or in the interconnection networkswitches. Other applications, including the shared-memory

version of Sweep3D, may require estimates of queuingdelays in the network, or the network interfaces, in order toachieve accurate performance models. In many previousstudies, analytical models of contention in interconnectionnetworks have proven to be both very efficient to evaluateand highly accurate. Thus, a detailed analytical componentmodel for the interconnection network might be composedwith (1) a more abstract model of the application runningtime (as in the LoPC model [18]), and/or (2) an MPI-Simcomponent model of the MPI runtime systemcommunication primitives.

6.3 Integration of SimpleScalar with MPI-Sim and LogGP

Although both MPI-Sim and LogGP have flexiblecommunication components, the processor model in bothsystems is simple and relies on measurement of the localcode blocks on an existing processor and memory system.To allow the POEMS modeling capabilities to be expandedto include assessment of the impact of next generationprocessors and memory systems, detailed componentmodels based on SimpleScalar are under development.These models interact with the MPI-Sim and LogGPcomponent models to project the impact of processor andmemory system architecture changes on the execution timeof large-scale Sweep3D simulations (on thousands ofprocessor nodes).

The modeling of specific processors by SimpleScalar stillneeds to be validated. For example, the POEMSSimpleScalar model of the Power604e is not exact becauseof certain aspects of the 604e; e.g., the modeled ISA is notidentical to that of the 604e. A key question is whether theapproximate model of the ISA is sufficiently accurate forapplications such as Sweep3D. Several validation methodsare under consideration:

1) execute narrow-spectrum benchmarks on a 604e andon SimpleScalar configured as a 604e to validate thememory hierarchy design parameters;

2) use on-chip performance counters to validate themicroarchitecture modeling; and

3) compare measured Sweep3D task times on the SP/2with task execution times attained by running multipleSimpleScalar instances configured as 604es underMPI.

One example of combined simulation and analysis, is touse the LogGP model to obtain application scalabilityprojections for next-generation processor and cachearchitectures, by using estimates of task execution timesfor those architectures that are derived from SimpleScalarsimulations. For example, in the Sweep3D application, aSimpleScalar component model can produce the Wi,j

values for a fixed per-processor grid size on a 2×2

processor grid, and then the LogGP component model canbe run using the estimates for various per-processor gridsizes to project the performance scalability of theapplication. This is one of the near-future goals of thePOEMS multi-paradigm modeling efforts. As anotherexample, instead of simulating or directly executing theMPI communication primitives, a LogGP componentmodel of the MPI communication costs might becomposed with a SimpleScalar component model of thenode architecture.

7 RELATED WORK

The conceptual framework for POEMS is a synthesis frommodels of naming and communication [11, 12], Computer-Aided Design (CAD), software frameworks, parallelcomputation [24], object-oriented analysis [33], datamediation [38] and intelligent agents. In this section,however, we focus on research projects that share ourprimary goal of end-to-end performance evaluation forparallel systems. A more extensive but still far fromcomprehensive survey of related work and a list ofreferences can be found on the POEMS project Web pageat http://www.cs.utexas.edu/users/poems.

The most closely related projects to POEMS are probablythe Maisie/Parsec parallel discrete-event simulationframework and its use in parallel program simulation [7, 9,27, 28], SimOS [32], RSIM [25,26], PACE [21], and theearlier work in program simulators, direct-executionsimulators, and parallel discrete-event simulation. Inaddition, an early system that shared many of the goals ofthe POEMS modeling environment, but did not incorporaterecent results from object-oriented analysis, data mediationand intelligent agents, nor the range of modern analytic andsimulation modeling tools being incorporated in POEMS,was the SARA system [17]. SimOS simulates the computerhardware of both uniprocessor and multiprocessorcomputer systems in enough detail to run an entireoperating system, thus, providing a simulationenvironment that can investigate realistic workloads.Different modes of operation provide a trade-off betweenthe speed and detail of a simulation. Thus, SimOS supportsmulti-domain and multi-resolution modeling, but unlikePOEMS, it primarily uses a single evaluation paradigm,namely, simulation. RSIM supports detailed instruction-level and direct-execution simulation of parallel programperformance for shared memory multiprocessors with ILPprocessors. PACE (Performance Analysis andCharacterisation Environment) is designed to predict andanalyze the performance of parallel systems defined by auser, while hiding the underlying performancecharacterization models and their evaluation processesfrom the user.

None of the above projects supports the general integrationof multiple paradigms for model evaluation, a key goal ofPOEMS. The conceptual extensions used to achieve this inPOEMS are a formal specification language forcomposition of component models into a full systemmodel, a unified application representation that supportsmultiple modeling paradigms, and automatic synthesis ofthis workload representation using a parallelizing compiler.The alternative modeling paradigms support validation andallow different levels of analyses of existing and futureapplication programs within a common framework.

Finally, there are many effective commercial products forsimulation modeling of computer and communicationsystems. The March 1994 IEEE CommunicationsMagazine presents a survey of such products.

8 CONCLUSIONS

The POEMS project is creating a problem-solvingenvironment for end-to-end performance models ofcomplex parallel and distributed systems and applying thisenvironment for performance prediction of applicationsoftware executed on current and future generations ofsuch systems. This paper has described the keycomponents of the POEMS framework: a generalized taskgraph model for describing parallel computations,automatic generation of the task graph by a parallelizingcompiler, a specification language for mapping thecomputation on component models from the operatingsystem and hardware domain, compositional modeling ofmulti-paradigm, multi-scale, multi-domain models,integration of a Performance Recommender for selectingthe computational parameters for a given targetperformance, and a wide set of modeling tools rangingfrom analytical models to parallel discrete-event simulationtools.

The paper illustrates the POEMS modeling methodologyand approach, by using a number of the POEMS tools forperformance prediction of the Sweep3D application kernelselected by Los Alamos National Laboratory for evaluationof ASCI architectures. The paper validates the performancepredicted by the analytical and simulation models againstthe measured application performance. The Sweep3Dkernel used for this study is an example of a regular CPU-bound application. Reusable versions of the analytical andsimulation models, parameterized for three-dimensionalwavefront applications, will form the initial componentmodel library for POEMS. Future development of POEMSmethods and tools will be largely driven by MOL [30],which is a modular program that implements the Methodof Lines for solving partial differential equations. It isdesigned to be a "simple" program (less than 1000 lines ofcode) which has all the features of a "sophisticated"dynamic code. Features that can be varied easily include(1) work load needed to maintain quality of service, (2)

number of processors needed, (3) communication patterns,(4) communication bandwidth needed, (5) internal datastructures, etc…

In addition to continuing efforts on the preceding topics,several interesting research directions are being pursued inthe project. First, the POEMS modeling framework andtools will be extended to directly support the evaluation ofparallel programs expressed using the task graph notation.Second, in the study reported in this paper, there wasconsiderable synergy among the development of theanalytical and simulation models, enabling validations tooccur more rapidly than if each model had been developedin isolation. As the next step, the project is enhancing themulti-paradigm modeling capability in POEMS, in whichthe analytical models will be used by the execution-drivensimulator, e.g., to estimate communication delays and/ortask execution times, and simulation models will beinvoked automatically to derive analytical model inputparameters. Several initial examples of such integratedmodeling approaches were described in Section 7.Innovations in parallel discrete-event simulationtechnology to reduce the execution time for the integratedmodels will continue to be investigated, with and withoutcompiler support. The integration of compiler supportwith analytical and parallel simulation models will enable(to our knowledge) the first fully-automatic, end-to-endperformance prediction capability for large-scale parallelapplications and systems.

ACKNOWLEDGMENT

A number of people from the member institutionsrepresented by the POEMS team contributed to the work.In particular, the authors acknowledge Adolfy Hoisie, OlafLubeck, Yong Luo, and Harvey Wasserman of Los AlamosNational Laboratory for suggesting the Sweep3Dapplication, providing significant assistance inunderstanding the application and the performance issuesthat are of importance to the application developers, andproviding key feedback on our research results. We alsowish to thank Thomas Phan and Steven Docy for their helpwith the use of MPI-Sim to predict the Sweep3Dperformance on the SP/2. Thanks to the Office ofAcademic Computing at UCLA and to Paul Hoffman forhelp with the IBM SP/2 on which many of theseexperiments were executed. Thanks also to LawrenceLivermore Laboratory for providing extensive computertime on the IBM SP/2.

This work was supported by DARPA/ITO under ContractN66001-97-C-8533, “End-to-End Performance Modelingof Large Heterogeneous Adaptive Parallel/DistributedComputer/Communication Systems,” 10/01/97 - 09/30/00,and by an NSF grant titled “Design of Parallel Algorithms,Language, and Simulation Tools,” Award ASC9157610,

08/15/91 - 7/31/98. Thanks to Frederica Darema for hersupport of this research.REFERENCES

[1] Adve, V. S., “Analyzing the Behavior and Performanceof Parallel Programs”, Univ. of Wisconsin-Madison,UW CS Tech. Rep. #1201, Oct. 1993.

[2] Adve, V. S., and J. Mellor-Crummey, “Using IntegerSets for Data-Parallel Program Analysis andOptimization”, Proc. SIGPLAN98 Conf. on Prog.Lang. Design and Implementation, Montreal, June1998.

[3] Adve, V. S., and R. Sakellariou, “ApplicationRepresentations for Multi-Paradigm PerformanceModeling”, International Journal of High PerformanceComputing Applications 14(4), 2000.

[4] Adve, V. S., R. Bagrodia, E. Deelman, T. Phan and R.Sakellariou, “Compiler-Supported Simulation ofHighly Scalable Parallel Applications”, SC99: HighPerformance Computing and Networking, Portland,OR, Nov. 1999.

[5] Alexandrov, A., M. Ionescu, K. E. Schauser, and C.Scheiman, “LogGP: Incorporating Long Messages intothe LogP Model”, Proc. 7th Ann. ACM Symp. onParallel Algorithms and Architectures (SPAA ’95),Santa Barbara, CA, July 1995.

[6] Amza, C., A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R.Rajamony, W. Lu, and W. Zwaenepoel, “TreadMarks:Shared Memory Computing on Networks ofWorkstations”, Computer, 29(2), Feb. 1996, pp. 18-28.

[7] Bagrodia, R., and W. Liao, “Maisie: A Language forDesign of Efficient Discrete-event Simulations”, IEEETran. on Software Engineering, 20(4), Apr. 1994.

[8] Bagrodia, R., S. Docy, and A. Kahn, “ParallelSimulation of Parallel File Systems and I/O Programs”,Proc. Supercomputing ’97, San Jose, 1997.

[9] Bagrodia, R., R. Meyer, M. Takai, Y. Chen, X. Zeng, J.Martin, B. Park, and H. Song, “Parsec: A ParallelSimulation Environment for Complex Systems”,Computer, 31(10), Oct. 1998, pp. 77-85.

[10] Bagrodia, R., E. Deelman, S. Docy, and T. Phan,“Performance Prediction of Large ParallelApplications Using Parallel Simulations”, 7th ACMSIGPLAN Symp. on the Principles and Practices ofParallel Programming (PPoPP ‘99), Atlanta, May1999.

[11] Bayerdorffer, B., Associative Broadcast and theCommunication Semantics of Naming in ConcurrentSystems, Ph.D. Dissertation, Dept. of ComputerSciences, Univ. of Texas at Austin, Dec. 1993.

[12] Bayerdorffer, B., “Distributed Programming withAssociative Broadcast”, Proc. of the Twenty-eighthHawaii International Conf. on System Sciences, Jan.

1995, pp. 525-534.

[13] Booch, G., J. Rumbaugh, and I. Jacobson, UnifiedModeling Language User Guide, Addison-Wesley,Englewood Cliffs, NJ, 1997.

[14] Burger, D., and T. M. Austin, “The SimpleScalar ToolSet, Version 2.0”, Univ. of Wisconsin-Madison, UWCS Tech Rpt. #1342, June 1997.

[15] Culler, D., R. Karp, D. Patterson, A. Sahay, K. E.Schauser, E. Santos, R. Subramonian, and T.VonEiken, “LogP: Towards a Realistic Model ofParallel Computation”, Proc. 4th ACM SIGPLANSymp. on Principles and Practices of ParallelProgramming (PPoPP ’93), San Diego, CA, May1993, pp. 1-12.

[16] Dube, A., “A Language for CompositionalDevelopment of Performance Models and itsTranslation”, Masters Thesis, Dept. of ComputerScience, Univ. of Texas at Austin, Aug., 1998.

[17] Estrin, G., R. Fenchel, R. Razouk, and M. K. Vernon,“SARA: Modeling, Analysis, and Simulation Supportfor Design of Concurrent Systems”, IEEE Trans. onSoftware Engineering, Special Issue on SoftwareDesign Methods, SE-12(2), Feb. 1986, pp. 293-311.

[18] Frank, M. I., A. Agarwal, and M. K. Vernon, “LoPC:Modeling Contention in Parallel Algorithms”, Proc.6th ACM SIGPLAN Symp. on Principles and Practicesof Parallel Programming (PPoPP ’97), Las Vegas,NV, June 1997, pp. 62-73.

[19] Hoisie, A., O. M. Lubeck, and H. J. Wasserman,“Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using MultidimensionalWavefront Applications”, Proc. Frontiers ‘ 99.

[20] Horwitz, S., T. Reps, and D. Binkley, “InterproceduralSlicing Using Dependence Graphs,” ACM Trans. onProgramming Languages and Systems 12(1), Jan.1990, pp. 26-60.

[21] Kerbyson, D. J., J. S. Harper, A. Craig, and G. R.Nudd, “PACE: A Toolset to Investigate and PredictPerformance in Parallel Systems”, European ParallelTools Meeting, ONERA, Paris, Oct. 1996.

[22] Koch, K. R., R. S. Baker, and R. E. Alcouffe,“Solution of the First-Order Form of the 3-D DiscreteOrdinates Equation on a Massively ParallelProcessor”, Trans. of the Amer. Nuc. Soc., 65(198),1992.

[23] Lamport, L., "Time, Clocks and the Ordering ofEvents in a Distributed System" Communications ofthe ACM, 21(7), (July 1978) pp. 558-565.

[24] Newton, P., and J. C. Browne, “The CODE 2.0Graphical Parallel Programming Language”, Proc.ACM Int. Conf. on Supercomputing, July 1992, pp.167-177.

[25] Pai, V. S., P. Ranganathan, and S. V. Adve, “RSIMReference Manual Version 1.0”, Dept. of Electricaland Computer Engineering, Rice Univ., TechnicalReport #9705, Aug. 1997.

[26] Pai, V. S., P. Ranganathan, and S. V. Adve. “TheImpact of Instruction Level Parallelism onMultiprocessor Performance and SimulationMethodology”, Proc. 3rd Int. Conf. on HighPerformance Computer Architecture, San Antonio,TX, Mar. 1997, pp. 72-83.

[27] Prakash, S. and R. Bagrodia, “Parallel Simulation ofData Parallel Programs”, Proc. 8th Workshop onLanguages and Compilers for Parallel Computing,Columbus, OH, August 1995.

[28] Prakash, S. and R. Bagrodia, “Using ParallelSimulation to Evaluate MPI Programs”, Proc. WinterSimulation Conf., Washington D.C., Dec. 1998.

[29] Ramakrishnan, N., Recommender Systems forProblem Solving Environments, Ph.D. Dissertation,Dept. of Computer Sciences, Purdue Univ., 1997.

[30] Rice, John., Numerical Methods, Software andAnalysis, Academic Press, 2nd Edition, New York,1993, Chapter 10.2.E, pp.524-527

[31] Rumbaugh, J., et al. Object-Oriented Modeling andDesign, Prentice-Hall, Englewood Cliffs, NJ, 1991.

[32] Rosenblum, M., S. A. Herrod, E. Witchel, and A.Gupta, “Complete Computer System Simulation:The SimOS Approach”, IEEE Parallel andDistributed Technology, Winter 1995, pp. 34-43.

[33] Shlaer, S., and S. Mellor, Object Lifecycles: Modelingthe World in States, Yourdon Press, NY, 1992.

[35] Sun, X.H., D. He, K. W. Cameron and Y. Luo, “AFactorial Performance Evaluation for HierarchicalMemory Systems”, Proc. Int’l Parallel ProcessingSymposium (IPPS'99), San Juan, PR, Apr. 1999.

[36] Sundaram-Stukel, D., and M. K. Vernon, “PredictiveAnalysis of a Wavefront Application Using LogGP”,Proc. 7th ACM SIGPLAN Symp. on the Principles andPractices of Parallel Programming (PPoPP ’99),Atlanta, May 1999.

[37] Vernon, M. K., E. D. Lazowska, and J. Zahorjan, “AnAccurate and Efficient Performance AnalysisTechnique for Multiprocessor Snooping Cache-Consistency Protocols,” Proc. 15th Annual Int’l.Symp. on Computer Architecture, Honolulu, May1988, pp. 308-315,.

[38] Wiederhold, G., “Mediation in Information Systems;in Research Directions in Software Engineering”,ACM Computing Surveys, 27(2), June 1995, pp. 265-267.

Documents

POEMS: End-to-End Performance Design of Large …pages.cs.wisc.edu/~vernon/papers/poems.00tse.pdfTo appear in IEEE Trans. on Software Engr., Nov. 2000. 1 POEMS: End-to-End Performance