A component-based architecture for parallel multi-physics ... · A component-based architecture for parallel multi-physics PDE simulation ... simulation of partial differential equations

Future Generation Computer Systems 22 (2006) 204–216

A component-based architecture for parallel multi-physicsPDE simulation

Steven G. Parker

Scientific Computing and Imaging Institute, School of Computing, University of Utah, 50 S. Central Campus Dr.,Room 3490, Salt Lake City, UT 84112, USA

Available online 13 May 2005

Abstract

We describe the Uintah Computational Framework (UCF), a set of software components and libraries that facilitate thesimulation of partial differential equations on structured adaptive mesh refinement grids using hundreds to thousands of pro-cessors. The UCF uses a non-traditional approach to achieving parallelism, employing an abstract taskgraph representation todescribe computation and communication. This representation has a number of advantages that affect the performance of theresulting simulation. We demonstrate performance of the system on a solid mechanics algorithm, two different computational

using up

uchlti-the

in-

gthsfy thesu-ge-on

s aitate

fluid-dynamics (CFD) algorithms, as well as coupled CFD/mechanics algorithms. We show performance of the UCFto 2000 processors.© 2005 Published by Elsevier B.V.

Keywords: Structured AMR; Components; Scientific computing

1. Introduction

Computational scientists continue to push thecapabilities of current computer hardware to the limitsin order to simulate complex real-world phenom-ena. These simulations necessitate the use of everincreasing computational resources. Furthermore, thesoftware written to model real-world scientific andengineering problems is typically very complex. Gridgeneration, non-linear and linear solvers, visualizationsystems, and parallel run-time systems all combineto provide a very powerful environment for solving

E-mail address: [email protected]: http://www.cs.utah.edu/∼sparker.

complex scientific and engineering problems. Scomplexities are further compounded when muple simulation codes are combined to simulateinteraction of multiple phenomena.

Frameworks for PDE simulation are common,cluding systems such as Diffpack[1], Ellpack [2],Overture[3], POOMA[4], SAMRAI [5], and Sprint 2D[6]. Each of these frameworks have their own strenand weaknesses, but each are designed to simpliprocess of implementing PDE simulations. They ually provide support for grid generation and manament, parallel communication, and simplify commoperations.

The Uintah Computational Framework (UCF) iset of software components and libraries that facil

0167-739X/$ – see front matter © 2005 Published by Elsevier B.V.doi:10.1016/j.future.2005.04.001

S.G. Parker / Future Generation Computer Systems 22 (2006) 204–216 205

Table 1Abbreviations used in this paper

AMR Adaptive Mesh Refinement, using a denser discretization in parts of the domain that require more accuracyASCI Accelerated Strategic Computing Initiative, a US Department of Energy supported program that is pursuing large-scale

scientific computationCCA Common Component Architecture, components designed for high-performance computingCFD Computational Fluids DynamicsC-SAFE Center for Simulation of Accidental Fires and Explosions, the project that is utilizing this workMPI Message Passing Interface, a commonly used mechanism for communication in a distributed-memory parallel machineMPM Material Point Method, a particle-based method for solid mechanicsPDE Partial Differential EquationPSE Problem Solving Environment, a tool for easily accessing various tools required to solve a scientific problemSAMR Structured Adaptive Mesh Refinement, a style of AMR using overlapping structured gridsUCF Uintah Computational Framework, the PDE framework described here

the simulation of partial differential equations (PDEs)on structured adaptive mesh refinement (SAMR) gridsusing hundreds to thousands of processors. The UCFuses a non-traditional approach to achieving paral-lelism, employing an abstract taskgraph representationto describe computation and communication. Thisrepresentation has a number of advantages, includingefficient fine-grained coupling of multi-physics com-ponents, flexible load-balancing mechanisms, and aseparation of application concerns from parallelismconcerns. The taskgraph representation distinguishesthe UCF from other approaches.

The taskgraph concept will be described, along witha number of the implementation details. We will discussthe advantages and disadvantages of this methodology,and will present results from UCF simulations. Thispaper contains numerous acronyms that are describedin Table 1for the convenience of the reader.

2. Overview

The system described here involves several connec-tions between a number of different pieces. In order toadequately describe how they all interact, we first givean overview of the individual pieces in this section. Ther cesfi ons.

2

ncew Ini-t on

of Accidental Fires and Explosions (C-SAFE)[7,8].C-SAFE focuses specifically on providing state-of-the-art, science-based tools for the numerical simulation ofaccidental fires and explosions, especially within thecontext of handling and storage of highly flammablematerials. The primary objective of C-SAFE is to pro-vide a software system comprising a problem solvingenvironment (PSE) in which fundamental chemistryand engineering physics are fully coupled with non-linear solvers, optimization, computational steering,visualization and experimental data verification. Theavailability of simulations using this system will helpto better evaluate the risks and safety issues associatedwith fires and explosions. Our goal is to integrate anddeliver a system that is validated and documented forpractical application to accidents involving both hydro-carbon and energetic materials. Efforts of this naturerequires expertise from a wide variety of academic dis-ciplines. A typical C-SAFE problem is shown inFig. 1.

2.2. Common Component Architecture

The Common Component Architecture (CCA) fo-rum [9,10] was established by a group of researchersfrom several Department of Energy national laborato-ries, and several universities to address the need for

thear-

licitup-re-for

pli-

emainder of the paper will discuss how those piet together to achieve large-scale parallel simulati

.1. C-SAFE

In 1997, the University of Utah created an alliaith the DOE Accelerated Strategic Computing

iative (ASCI) to form the Center for the Simulati

a software component architecture that fulfilledneeds of high-performance computing. The CCAchitecture aims to provide higher performance, expsupport for multi-dimensional arrays, and explicit sport for parallelism. Uintah, described below, is asearch vehicle for implementing these ideas andexercising their efficacy on a complex scientific apcation, such as C-SAFE simulations.

206 S.G. Parker / Future Generation Computer Systems 22 (2006) 204–216

2.3. SCIRun

SCIRun1 is a scientific PSE that allows the in-teractive construction and steering of large-scale sci-entific computations[11–13]. A scientific applicationis constructed by connecting computational elements(modules) to form a program (network). This programmay contain several computational elements as wellas several visualization elements, all of which worktogether in orchestrating a solution to a scientific prob-lem. Geometric inputs and computational parametersmay be changed interactively, and the results of thesechanges provide immediate feedback to the investiga-tor. SCIRun is designed to facilitate large-scale scien-tific computation and visualization on a wide range ofmachines from the desktop to large supercomputers.

2.4. Uintah

C-SAFE’s Uintah PSE[14,15] is a massivelyparallel, component-based, PSE designed to simulatelarge-scale scientific problems, while allowing thescientist to interactively visualize, steer, and verifysimulation results. Uintah is a derivative of the SCIRunPSE, and adds support for the more powerful CCAcomponent model and support for distributed-memory

chi-de-e

onwe

er totedlti-nt.

portichlit-

Sci-y of

tle more than an abstract class on which the caller canperform method invocations. Componentsprovide a setof interfaces, which other components canuse. At com-ponent creation time, the component declares a set ofProvides and Uses ports. An external entity, called abuilder, connects the provides port of one componentto the uses port of another component. For the simu-lations described here, this connection is provided bya stand-alone main program. However, in general thebuilder can be a script, a graphical user interface, oreven another component.

The CCA is centered around the Scientific InterfaceDefinition Language (SIDL)[16]. SIDL is the mecha-nism by which component interfaces are described. ASIDL compiler generates code for inter-language op-eration as well as for remote method invocations ina distributed-memory environment. Although Uintahcontains support for such distributed-memory opera-tions, the component interactions described here alloccur within a single address space. Operation in adistributed-memory parallel environment occurs usingMPI calls within individual components.

4. Uintah Computational Framework

in-eral,al-re,port-of

s toand

tollscastheprob-

a-d onare

f

parallel computers.

3. Component architecture

Due to space constraints, the overall Uintah artecture will be described only briefly here. Furthertails can be found in[10,14,15], as well as in futurpublications from C-SAFE.

The Uintah Component Architecture is basedthe CCA. Since the CCA is an evolving standard,focused on a subset of the overall standard in ordfocus on C-SAFE simulations. In particular, we creaa C++ only implementation, and ignored the mulanguage features that were still under developme

The primary feature of the architecture is themodel. The port model, is the mechanism by whcomponents communicate. In C++, this looks like

1 Pronounced “ski-run.” SCIRun derives its name from theentific Computing and Imaging (SCI) Institute at the UniversitUtah.

The UCF is implemented in the context of the Utah PSE. The Uintah PSE architecture is very genfacilitating a wide range of computational and visuization applications. On top of the Uintah architectuwe have designed a set of components and suping libraries that are targeted toward the solutionPDEs on massively parallel architectures (hundredthousands of processors). This set of componentslibraries is collectively called the UCF.

4.1. Overview

The UCF employs a non-traditional approachachieving parallelism. Instead of explicit MPI caplaced throughout the program, applications arein terms of ataskgraph, a construct that describes tdata dependencies between various steps in thelem.

The UCF exposes flexibility in dynamic appliction structure by adopting an execution model basesoftware-based “macro”-dataflow. Computationsexpressed as directed acyclic graphs oftasks, each o


which produces some output and consumes some in-put (which is in turn the output of some previous task).These inputs and outputs are specified for each patchin a Structured AMR grid. Tasks form a UCF datastructure called thetaskgraph, which represents immi-nent computation. Associated with each task is a C++method which is used to perform the actual computa-tion. UCF data structures are compatible with Fortranarrays, so that the application writer can use Fortransubroutines to provide numeric kernels on each patch.

Each execution of a taskgraph integrates a singletimestep, or a single non-linear iteration, or some othercoarse algorithm step. Tasks “communicate” with eachother through an entity called theDataWarehouse. TheDataWarehouse is accessed through a simple name-based dictionary mechanism, and provides each taskwith the illusion that all memory is global. If the taskscorrectly describe their data dependencies, then thedata stored in the DataWarehouse will match the data(variable and region of space) needed by the task. Inother words, the DataWarehouse is an abstraction of aglobal single-assignment memory, with automatic datalifetime management and storage reclamation. Val-ues stored in the DataWarehouse are typically array-structured.

Communication is scheduled by a local schedulinga malc y ofs ecutet turec

gh-l othm ationm ccesst ows val-u com-m UCFi agea portsu letefl izec ithina stingd cor-r eforet

Consider the taskgraph inFig. 2. Ovals representtasks, each of which are a simple array program andeasily treated by traditional compiler array optimiza-tions. Edges represent named values stored by UCF.Solid edges have values defined at each material point(Particle Data) and dashed edges have values definedat each grid vertex (Grid Data). Variables denoted witha prime (′) have been updated during the timestep. Thefigure shows the slice of the actual Uintah MaterialPoint Method (MPM)[17] taskgraph concerned withadvancing Newtonian material point motion on a singlepatch for a single timestep.

The idea of the dataflow graph as an organizingstructure for execution is well known. The SMARTS[18] dataflow engine that underlies the POOMA[4]toolkit shares goals and philosophy with UCF. SISALcompilers[19] used dataflow concepts at a much finergranularity to structure code generation and execution.Dataflow is a simple, natural and efficient way of ex-posing parallelism and managing computation, and isan intuitive way of reasoning about parallelism. Whatdistinguishes implementations of dataflow ideas is thateach caters to a particular higher-level presentation.SMARTS caters to POOMA’s C++ implementation andstylistic template-based presentation. The SISAL com-piler was of course developed to support the SISAL lan-

ationgridary

and

om-id,FEap-rtant

iblera-

thmpri-d totheDby

lgorithm that approximates the true globally optiommunication schedule. Because of the flexibilitingle-assignment semantics, the UCF is free to exasks close to data or move data to minimize fuommunication.

The UCF storage abstraction is sufficiently hievel that it can be efficiently mapped onto b

essage-passing and shared-memory communicechanisms. Threads sharing a memory can a

heir input data directly; single-assignment dataflemantics eliminate the need for any locking ofes. Threads running in disjoint address spacesunicate by message-passing protocol, and the

s free to optimize such communication by messggregation. Tasks need not be aware of the transsed to deliver their inputs and thus UCF has compexibility in control and data placement to optimommunication both between address spaces or wsingle shared-memory node. Latency in reque

ata from the DataWarehouse is not an issue; theect data is deposited into the DataWarehouse bhe task is executed.

guage. UCF is implemented to support a presentcatering to C++ and Fortran based mixed particle/algorithms on a structured adaptive mesh. The primalgorithms of importance to C-SAFE are the MPM,Eulerian CFD algorithms.

4.2. Taskgraph advantages/disadvantages

This dataflow-based representation of parallel cputation fits very well with the Structured AMR grand with the nature of the computations that C-SAis performing. In particular, we decided to use thisproach in order to accommodate a number of imponeeds.

First, the taskgraph helps accommodate flexmulti-physics integration needs. In particular, integtion of a particle-based solid mechanics algori(MPM) with a state-of-the-art CFD code was amary C-SAFE goal. However, scientists still wantebe able to execute the CFD algorithm by itself, orMPM algorithm by itself or even with a different CFalgorithm. The taskgraph facilitates this integration


allowing each application component (MPM and CFDin this example) to describe their tasks independently.The scheduler connects these tasks where data is ex-changed between the different algorithm phases. In thisfashion, a fine-grained interleaving of these differentalgorithms is obtained.

Second, the taskgraph can accommodate a widerange of unforeseen workloads. In C-SAFE simula-tions, load imbalances arise from a variety of situa-tions: with particle-based methods particles may existonly in small portions of the domain; or ordinary dif-ferential equation-based reaction solvers may be morecostly in some regions of space than in others. Us-ing the taskgraph, the UCF can map patches to otherprocessors to minimize overall load imbalance. Com-munication is performed automatically, and the systemhas the information necessary to predict whether datamotion is likely to pay off. These features would bemore difficult to implement if each scientist were bur-dened with these complexities when writing simulationcomponents.

Third, the taskgraph helps manage the complexity ofa mixed threads/MPI programming model. Many mod-ern supercomputing architectures employ a number ofshared-memory nodes connected together by a fast in-terconnect. These architectures are often programmed

28singer toThearingndmoryle ifddi-

ix ofpar-fine-ory

teps)paces

hiledata

im-tionthe

most important advantage for a large interdisciplinaryproject such as C-SAFE. Since C-SAFE is a researchproject, we need to accommodate the fact that mostof the simulation components are still under develop-ment. The component-based architecture allows piecesof the system to be implemented in a basic form at first,and then to evolve as the technologies mature. Mostimportantly, the UCF allows the aspects of parallelism(schedulers, load balancers, parallel I/O, and so forth)to evolve independently of the simulation components.This allows the computer science effort to focus onthese problems without waiting for the completionof the scientific applications or vice versa. Object-oriented and component-based programming tech-niques, such as adherence to well-defined interfaces,encapsulation, and dynamic composition are used toprovide that isolation. On example of the power of thecomposition model is the SingleProcessorScheduler.Many problems can be debugged on a single processorwithout the complexities of parallel debuggers. TheSingleProcessorScheduler still provides executionover multiple patches, so (usually) if a componentis made to work with this scheduler, it will work inparallel.

However, in addition to these advantages there aresome disadvantages to the approach that we chose.

aphweandanardese

earverydule

on. Itges,ayan

iresrs.ere

andthensors.costtion

with a flat MPI model, ignoring the fact that 2–1of the nodes actually share a single memory. Uthe UCF, tasks can be mapped to threads in ordachieve multi-threaded execution within a node.semantics of the DataWarehouse enable true shof the data, eliminating explicit communication adata redundancy between neighboring shared-meprocessors. Once again, this would not be possibsimulation scientists were burdened with these ational complexities.

Fourth, the taskgraph can accommodate a mstatic and dynamic load-balancing mechanisms. Inticular, we are developing a mechanism that usesgrained dynamic load balance within a shared-memnode, and a less frequent (every several timescoarse-grained mechanism between address sThreads execute tasks from a pool of ready tasks, wasynchronous MPI calls are made to communicatebetween nodes.

Fifth, the taskgraph facilitates development of sulation components that allow pieces of the simulato evolve independently. In many respects, this is

.

First, creating an optimal schedule for the taskgris known to be an NP-hard problem. However,have found that using some simple heuristics,exploiting the regularity in the problem, that we cobtain respectable performance using straightforwscalable algorithms. Further refinements in thalgorithms will receive more attention in the nfuture. Second, creation of the schedule can becostly. We take advantage of the fact that the schedoes not need to be recomputed for each executidoes need to be recomputed if the algorithm chanor if the grid changes. In addition, the schedule mneed to be recomputed periodically to maintainoptimal load balance. Third, the taskgraph requa mental shift for parallel application programmeNevertheless, we found that the programmers wable to easily take a description of their algorithmcast that into a set of tasks. The application wasable to run in parallel, even on hundreds of procesFor our applications, this benefit outweighed theof casting the algorithms in the dataflow execumodel.


Fig. 1. A typical C-SAFE problem involving hydrocarbon fires andexplosions of energetic materials.

Fig. 2. An example UCF taskgraph.

Fig. 3. UCF simulation components.

Fig. 4. A simple computational domain and a four-patch decompo-sition of that domain.

For C-SAFE, the advantages far outweighed the dis-advantages. We note that for a typical purely structured-grid computation, the taskgraph may be overkill. For apurely unstructured-grid computation, the granularitywould be too small and data dependencies would bemore complex. The Structured AMR grids employedby C-SAFE seem to have just the right granularity forthis approach to be successful.

Fig. 5. Communication of ghost nodes in a simple four-patch do-main.


4.3. Components involved

Fig. 3 shows the main components involved in atypical C-SAFE simulation. The SimulationControlleris the component in charge of the simulation. It willmanage restart files if necessary, and control the inte-gration through time. First, it reads the specification ofthe problem from an XML input file. After setting upthe initial grid, it passes the description to the simula-tion component. The simulation component can be anumber of different things, including one of the twodifferent CFD algorithms, the MPM algorithm, or acoupled MPM–CFD algorithm. The simulation com-ponents define a set of tasks to the scheduler. In ad-dition, the DataArchiver component describes a set ofoutput tasks to the scheduler. These tasks will save aspecified set of variables to disk. Once all tasks areknown to the scheduler, the load-balancer componentuses the configuration of the machine (including pro-cessor counts, communication topologies, etc.) to as-sign tasks to processing resources. The scheduler usesMPI to communicate the data to the right processorat the right time and then executes callbacks into thesimulation or DataArchiver components to perform theactual work. This process continues until the taskgraphis fully executed. The execution process is repeated to

eachtheirever,eeds

we

exhys-

has

ce,ted

ls ing to

• Node: An entity at the corners of each of the cells.A variable centered at the nodes in the simulationwould have a value corresponding to each of the O’sin Fig. 4.

• Face: The faces join two cells. The UCF representsvalues onX, Y, andZ faces separately.

• Ghost cell: Cells (or nodes) that are associated witha neighboring patch, but are copied locally to fulfilldata dependencies from outside of the patch.

4.5. Variable types

UCF simulations are performed using a strict“owner computes” strategy. This means that each topo-logical entity (a node, cell or face) belongs to exactlyone patch. There are several variable types that repre-sent data associated with these entities. AnNCVariable(node-centered variable) contains values at eachNodein the domain. Similarly, CCVariables contain valuesfor each cell, and XFCVariables, YFCVariables andZFCVariables are face-centered values for the facescorresponding to theX, Y andZ axes. A single vari-able class represents all of the values for a single patch(possibly with ghost cells) and is accessed as a three-dimensional array. Each of these variable types are C++template classes, therefore a node/cell/face-centered

aretem-, butas ating

ericaloneRe-um,in.

de-ype:oci-rti-

therandies. Forn beincepor-les

integrate further timesteps.Each of these components run concurrently on

processor. The components communicate withcounterparts on other processors using MPI. Howthe scheduler is typically the only component that nto communicate with other processors.

4.4. Definitions

ConsiderFig. 4. We define several terms whichuse in discussing Structured AMR grids:

• Patch: A contiguous rectangular region of indspace and a corresponding region of simulated pical space. The domain on the right ofFig. 4 is thesame as the domain on the left, except that isbeen decomposed into four patches.

• Cell: A single coordinate in the integer index spaalso corresponding to the smallest unit in simulaphysical space. A variable centered at the celthe simulation would have a value correspondineach of the X’s inFig. 4.

value can be any arbitrary type. Typically valuesa double-precision number representing pressure,perature, or some other scalar throughout the fieldvalues may also be a more complex entity, suchvector representing velocity or a tensor represena stress. Variables can be passed to Fortran numkernels without copying. PerPatch variables store(templated) value with each patch in the domain.duction variables are used to compute a global smin, max, or other operations over the entire doma

In addition to the topological-based variablesscribed above, there is one additional variable tParticleVariable. This variable contains values assated with each particle in the domain. A special pacle variable contains the position of the particle. Oparticle variables are defined by the simulations,in the case of the MPM algorithm include quantitlike temperature, acceleration, stress and so forththe purposes of the discussions below, particles caconsidered a fancy type of cell-centered variable, seach particle is associated with a single cell. It is imtant however, to point out that explicit lists of partic


within a cell are not maintained. We have found it moreefficient to determine particle/cell associations as theyare needed instead of paying the high cost of maintain-ing lists of particles for each cell.

4.6. A around B

Tasks describe data requirements in terms of theircomputations on node, cell and face-centered quan-tities. A task that computes a cell-centered quantityfrom the values on surrounding nodes would establisha requirement for one layer of nodes around the cellswithin a patch. This is termed “nodes around cells” inUCF terminology. As shown inFig. 5, a layer ofghostnodes would be copied from neighboring patches onthe top and right edges of the lower-left patch. In afour-processor simulation this copy would involve MPImessages from each of the other three processors. It isimportant to note the asymmetry in this process; datais often not required from all 26 (in 3D) neighbors tosatisfy a computation. Symmetry comes when a subse-quent step uses “cells around nodes” to satisfy anotherdata dependency.

In this fashion, each task in the algorithm specifiesa set of data requirements. Similarly, each task speci-fies a set of data which it will compute, but in this casen com-p d toc

en-t al-a ibly2

4

, theU of thet datad ver-t willc ow-e one

rsetg erst in-

stantiate graph vertices for the task/patch combinationswhich are assigned to the processor, and will instantiategraph edges which connect to or from those vertices. Fi-nally, it will create vertices that connect to those edges.In this fashion, the detailed taskgraph contains the ver-tices for tasks owned by this processor and for thosetasks on other processors with which it will communi-cate. The detailed taskgraph uses a very compact rep-resentation since the number of detailed tasks can besignificant. However, there are O(1) detailed tasks oneach processor for scalable simulations.

After instantiating the detailed tasks, the UCFscheduler performs a set of analysis functions on theresulting taskgraph. It ensures that the application pro-grammers have used every variable that it computes,and that they do not expect variables that are never pro-duced. It also ensures that the types of variables matchbetween the computes and requires, including the typeof the underlying templated data. Finally, the scheduleranalyzes the lifetimes of each variable used throughoutthe execution of the taskgraph. Variables that hold in-termediate quantities are scheduled for deletion whenno more tasks require them.

The compilation process proceeds simultaneouslyon each processor without communication betweenprocessors.

justwithirectssorsns,the

iresks.Thes are, or

hen

tputat thethatwillcor-re-

o ghost cells are necessary (or allowed). These “utes and requires” lists for each task are collectereate the full taskgraph.

A task could specify that it requires data from theire computational domain. However, for typical scble algorithms, the tasks ask for only one (or poss) layer of data outside of the patch.

.7. Compilation

Using the data dependencies described aboveCF scheduler compiles a coarse-representation

askgraph. This representation contains all of theependency information, but contains only a single

ex (graph bubble) for each task. The full taskgraphontain a vertex for each patch/task combination. Hver, this full representation is not fully instantiatedach processor.

First, a topological sort is performed on the coaaskgraph. The scheduler then creates adetailed task-raph for the subregion of the total graph which covhe neighborhood of a particular processor. It will

4.8. I/O and checkpointing

Data output is scheduled using the taskgraphlike any other computation. Constraints specifiedthe task allow the load-balancing components to dthose tasks (and the associated data) to the procewhere data I/O should occur. In typical simulatioeach processor writes data independently forportions of the dataset which it owns. This requno additional parallel communication for output tasHowever, in some cases this may not be ideal.UCF can also accommodate situations where diskphysically attached to only a portion of the nodesa parallel filesystem where I/O is more efficient wperformed by only a fraction of the total nodes.

Checkpointing is obtained by using these outasks to save all of the data in the DataWarehouseend of the timestep. Data lifetime analysis ensuresonly the data required by subsequent iterationsbe saved. If the simulation components have beenrectly written to store all of their data in the DataWa


house, restart is a trivial process. During restart, thecomponents process the XML specification of the prob-lem that was saved with the datasets, and then the UCFcreates input tasks that load the DataWarehouse fromthe checkpoint files. If necessary, data redistribution isperformed automatically during the first execution ofthe taskgraph. In a similar fashion, changing the num-ber of processors is possible. The current implementa-tion does not redistribute data among the patches whenthe number of processors are changed. Patch redistri-bution is a useful component even beyond changing theprocessor count, and will be implemented in the future.

4.9. Legacy MPI compatibility

To accommodate software packages that were notwritten using the UCF execution model, we allow tasksto be specially flagged as “using MPI”. These tasks willbe gang-scheduled on all processors simultaneously,and will be associated with all of the patches assignedto each processor. In this fashion, UCF applicationscan use available MPI libraries, such as PETSc[20]and hypre[21].

4.10. Execution

ph islogi-inced onatchcutescor-

cesseretoway

allynon-

e

ro-task.

callsthat

resources are cleaned up when the sends actually com-plete.

The mixed MPI/thread execution is somewhatdifferent. First, non-blocking MPIIrecvs are postedfor all of the tasks assigned to the processor. Theneach thread will concurrently call MPIWaitsomeand will block for internal data dependencies (i.e.from other tasks) until the data dependencies for anytask are complete. That task is executed and data thatit produces is sent out. The thread then goes backand tries to complete a next task. This implementsa completely asynchronous scheduling algorithm.Preliminary results for this scheduler indicate that aperformance improvement of approximately 2× isobtainable. However, thread-safety issues in vendorMPI implementations have slowed this effort.

It can be seen that dramatically different communi-cation styles can be employed by simply changing outthe scheduler component. The application componentsare completely insulated from these variations. Thisis a very important aspect that allows the ComputerScience teams to focus on the best way to utilize thecommunication software and hardware on the machinewithout requiring sweeping changes in the application.Each scheduler implementation consists of less than1000 lines of code, so it is relatively easy to write one

om-ten,ontegy

r as-

aticwasnal-

atedde-singuni-

alsoncethe

lsac-

On a single processor, execution of the taskgrasimple. The tasks are simply executed in the topocally sorted order. This is valuable for debugging, smulti-patch problems can be tested and debuggea single processor. In most cases, if the multi-pproblem passes the taskgraph analysis and execorrectly on a single processor, then it will executerectly in parallel.

In a multi-processor machine the execution prois more complex. In an MPI-only implementation, thare a number of ways to utilize MPI functionalityoverlap communication and I/O. We describe onethat is currently implemented in the UCF.

We process each detailed task in a topologicsorted order. For each task, the scheduler postsblocking receives (using MPIIrecv) for each of thdata dependencies. Subsequently we call MPIWaitallto wait for the data to be sent from neighboring pcessors. After all data has arrived, we execute theWhen the task is finished, we call MPIIsend to initi-ate data transfer to any dependent tasks. Periodicto MPI Waitsome for these posted sends ensure

that will take advantage of the properties of the cmunication hardware available on a machine. Ofthe only difficult part is getting the correct informatifrom the vendor in order to determine the best strafor communicating data.

4.11. Load balancing

The load balancer component is responsible fosigning each detailed task to one processor.

To date, we have implemented only simple stload-balancing mechanisms. However, the UCFdesigned to allow very sophisticated load-balance aysis algorithms. In particular, a cost model associwith each task will allow an optimization process totermine the optimal assignment of tasks to procesresources. Cost models associated with the commcation architecture of the underlying machine areavailable. One interesting aspect of the load-balaproblem is that integrated performance analysis inUCF [22] will (in the future) allow the cost modeto be corrected at run-time to provide the most


curate cost information possible to the optimizationprocess.

The mixed thread/MPI scheduler described aboveimplements a dynamic load-balancing mechanism (i.e.a work queue) within a shared-memory node, and usesa static load-balancing mechanism between nodes. Wefeel that this is a powerful combination that we willpursue further.

Careful readers will pick up on the fact that thecreation of detailed tasks require knowledge of proces-sor assignment. However, sophisticated load-balancecomponents may require this detailed informationbefore they can optimize the task/processor assign-ments. We use a two-phase approach where tasks areassigned arbitrarily, then an optimization is performedand the final assignments are made to the tasks.Subsequent load-balance iterations use the previousapproximation as a starting point for the optimizationprocess.

4.12. Adding a new component

Adding a new simulation to the system consists ofwriting a new simulation component. The other com-ponents in the system (schedulers, load-balancers, sim-u bem tro-d

oma pre-s y fort theU del.H lvera f thep ane arityo cifics ighti

• aining

• thecha-

tranage

conventions, Fortran arrays are accessed inX-axismajor form, and C/C++ arrays inZ-axis major form.

• Modify the code to execute each step of the algorithmon a single patch. Typically this requires reconsider-ation of how boundary conditions are applied to thedomain.

• Optionally, modify the parameter setting mecha-nisms in the code to use the UCF XML description.

If the code is already parallel with a patch-wisedecomposition, one may be able to utilize the gang-scheduled task description described in Section4.9for portions of the code without restructuring the codecompletely.

4.13. Applications and results

The system described here has been used to im-plement a variety of simulations. Two different CFDalgorithms, one MPM (solid mechanics) algorithm,and two different coupled CFD–MPM algorithms havebeen implemented to date. Each of these simulationsrun quite well in parallel.Fig. 6shows the scalability of

is aces-The

w dataown2000,

rs on

lation controller, DataArchiver, etc.) do not need toodified when a new simulation component is inuced into the system.

The UCF has a structure that is quite different frtypical parallel application, and the applications

ented here have all been developed specificallhe UCF. One of them, Arches, existed prior toCF and has been modified to work with this moowever, one may wish to take an existing PDE sond create a UCF component to take advantage oarallel infrastructure. The complexity of adaptingxisting code can vary, depending on the modulf the code and the specifics of the algorithm. Speteps required to port an existing PDE solver mnclude:

Replace the high-level structure (typically the mprogram) with the UCF mechanism for describthe taskgraph.Add adapters from the UCF task callbacks toPerPatch steps on each algorithm. We have menisms for passing UCF variables into both Forand C/C++ arrays without copying. Due to langu

Fig. 6. Parallel performance of a typical C-SAFE problem. This16 million particle MPM computation, running on up to 2000 prosors of the Los Alamos National Laboratory Nirvana machine.straight line represents ideal performance, while the boxes shopoints of performance actually obtained. There is a slight slowdas the computation spans more than one 128 processor Originbut continues on linearly to the maximum number of processothe machine.


the MPM application up to 2000 processors of the ASCINirvana machine at Los Alamos National Laboratory.Nirvana is a cluster of 16 SGI Origin 2000 ccNUMAmachines, each consisting of 128 MIPS R10K proces-sors running at 250 MHz. The machines are connectedwith a 6.4 Gigabit GSN (Gigabyte System Network).

The 125 of the 128 processors on each node wereutilized for the simulation, leaving three to attend tooperating system and communication issues. Super-linear speedups in low processor counts are the resultof higher cache efficiencies due to the constant-sizeproblem. As a result, we show the linear speedup line

Fig. 7. Visualization of two different simulations from C-SAFE. On thstress propagation through a block of granular material. Each of thesprocessors.

e left is a simulation of a heptane fire. On the right is a simulation ofe simulations were performed using the UCF and were executed on 1000


from the highest performance per processor, at eightprocessors for this problem. It should be noted thatthe problem shown is relatively small, with timestepscompleting in much less than 1 s for the large processorconfigurations. Typical computations contain 100 sof millions of particles or more, resulting in evenbetter parallel efficiency in practice. One of the CFDcomponents (Arches) has similar scaling properties toMPM; the others we have not yet had the opportunityto push to large numbers of processors.Fig. 7 showsresults from 1000 processor simulations using theArches CFD component and the MPM component.

We have also demonstrated parallel performance on960 processors of Lawrence Livermore National Lab-oratory’s Frost, consisting of 1088 Power three proces-sors running at 375 MHz.

4.14. Future work

The past few years of this project have focused ondeveloping a flexible framework in which to accom-plish scalable parallel simulations. Now that we havethe basic infrastructure in place, there are a number ofresearch projects worthy of pursuit, including: devel-opment of feedback-based load-balancing componentsthat utilize task performance information collected atr de-v ed-u iona enta-t t oft

lexs platei ntedr de-s

5

m-p atedt e onu sci-e sicalp scalev

Acknowledgements

This work was supported by the DOE ASCI ASAPProgram. James Bigler, J. Davison de St. Germain, andWayne Witzel are very valuable members of the C-SAFE PSE team that developed this software. C-SAFEvisualization images were provided by Kurt Zimmer-man and Wing Yee. Datasets were created by Scott Bar-denhagen, Jim Guilkey, Todd Harman, Seshadri Ku-mar, Rajesh Rawat, and John Schmidt. The DOE ASCIASAP program also provided computing time for thesimulations shown.

References

[1] A.M. Bruaset, H.P. Langtangen, Comprehensive set of tools forsolving partial differential equations, in: M. Daehlen, A. Tveito(Eds.), Numerical Methods and Software Tools in IndustrialMathematics, Diffpack, Birkhauser, Basel, 1996, Chapter A.

[2] W.R. Dyksen, C.J. Ribbens, Interactive Ellpack: an interac-tive problem-solving environment for elliptic partial differen-tial equations, ACM Trans. Math. Softw. (TOMS) 13 (2) (1987)113–132.

[3] F. Bassetti, D. Brown, K. Davis, W. Henshaw, D. Quinlan, Over-ture: an object-oriented framework for high performance scien-tific computing, in: Proceedings of the 1998 ACM/IEEE Confer-

ciety

ant,nceca-

ber

ternsgs

ter-998.

les,ath.

s—ts.A.plo-

a.

[ , L.po-, in:on

un-time to optimize resource allocations; furtherelopment and tuning of the mixed thread/MPI schler; investigation of and development of optimizatlgorithms that operate on an incomplete repres

ion of the global graph; and further developmenhe UCFs AMR capabilities.

We have described only a fraction of a very compystem. Other interesting aspects, such as the temmplementations of the DataWarehouse, the augmeun-time typing system, and other features will becribed in more detail in the future.

. Conclusions

We have presented a powerful way of building colex PDE simulation software. We have demonstr

hat the UCF system can obtain good performancp to 2000 processors. Using the UCF, C-SAFEntists have been able to simulate complex phyhenomena that was not achievable using smaller-ersions of the code.

ence on Supercomputing (CDROM), IEEE Computer SoPress, Washington D.C., 1998, pp. 1–9.

[4] S. Atlas, S. Banerjee, J.C. Cummings, P.J. Hinker, M. SrikJ.V.W. Reynders, M. Tholburn, POOMA: a high-performadistributed simulation environment for scientific applitions, in: Proceedings of the Supercomputing’95, Decem1995.

[5] R. Hornung, S. Kohn, The use of object-oriented design patin the SAMRAI structured AMR framework, in: Proceedinof the SIAM Workshop on Object-oriented Methods for Inoperable Scientific and Engineering Computing, October 1http://www.llnl.gov/CASC/SAMRAI.

[6] M. Berzins, R. Fairlie, S.V. Pennington, J.M. Ware, L.E. ScaSPRINT2D: adaptive software for PDEs, ACM Trans. MSoftw. 24 (4) (1998) 475–499.

[7] Center for the Simulation of Accidental Fires and ExplosionAnnual Report, Year 2.http://www.csafe.utah.edu/documen.

[8] T.C. Henderson, P.A. McMurtry, P.J. Smith, G.A. Voth, CWight, D.W. Pershing, Simulating accidental fires and exsions, Comput. Sci. Eng. 2 (1994) 64–76.

[9] Common Component Architecture Forum.http://z.ca.sandigov/∼cca-forum.

10] R. Armstrong, D. Gannon, A. Geist, K. Keahey, S. KohnMcInnes, S. Parker, B. Smolinski, Toward a common comnent architecture for high-performance scientific computingProceedings of the Eighth IEEE International SymposiumHigh Performance Distributed Computing, 1999.

http://www.llnl.gov/casc/samrai

http://www.csafe.utah.edu/documents

http://z.ca.sandia.gov/~cca-forum

http://z.ca.sandia.gov/~cca-forum


[11] S.G. Parker, D.M. Beazley, C.R. Johnson, Computational steer-ing software systems and strategies, IEEE Comput. Sci. Eng. 4(4) (1997) 50–59.

[12] S.G. Parkerm, The SCIRun problem solving environment andcomputational steering software system, Ph.D. Thesis, Univer-sity of Utah, 1999.

[13] S.G. Parker, C.R. Johnson, SCIRun: a scientific programmingenvironment for computational steering, in: Proceedings of theSupercomputing’95, IEEE Press, New York, 1995.

[14] J.D. de St Germain, J. McCorquodale, S.G. Parker, C.R. John-son, Uintah: a massively parallel problem solving environment,in: Proceedings of the Ninth IEEE International Symposium onHigh Performance and Distributed Computing, August 2000.

[15] J. McCorquodale, S. Parker, J. Davison, C. Johnson, The Utahparallelism infrastructure: a performance evaluation, in: Pro-ceedings of the 2001 High Performance Computing Sympo-sium (HPC’01), Advanced Simulation Technologies Confer-ence, 2001.

[16] A. Cleary, S. Kohn, S.G. Smith, B. Smolinski, Language in-teroperability mechanisms for high-performance scientific ap-plications, in: Proceedings of the SIAM Workshop on Object-oriented Methods for Interoperable Scientific and EngineeringComputing, October 1998.

[17] D. Sulsky, Z. Chen, H.L. Schreyer, A particle method for his-tory dependent materials, Comput. Meth. Appl. Mech. Eng. 118(1994) 179–196.

[18] S. Vajracharya, S. Karmesin, P. Beckman, J. Crotinger, A. Mal-ony, S. Shende, R. Oldehoeft, S. Smith, Smarts: exploiting tem-poral locality and parallelism through vertical execution, 1999.

[19] J.T. Feo, D.C. Cann, R.R. Oldehoeft, A report on the sisallanguage project, J. Parallel Distrib. Comput. 10 (4) (1990)349–366.

[20] S. Balay, W. Gropp, L. McInnes, B. Smith, Petsc home page,1999.

[21] hypre: high performance preconditioners.http://www.llnl.gov/casc/hypre/.

[22] J.D. de St. Germain, A. Morris, S.G. Parker, A.D. Malony, S.Shende, Performance analysis in the Uintah Software develop-ment cycle, Int. J. Parallel Program. 31 (1) (2003) 35–53.

Steven G. Parker is a Research AssistantProfessor in the School of Computingand Scientific Computing and Imaging(SCI) Institute at the University of Utah.His research focuses on problem solvingenvironments, which tie together scientificcomputing, scientific visualization, andcomputer graphics. He is the principalarchitect of the SCIRun Software System,which formed the core of his PhD disserta-tion, and is currently the chief architect of

Uintah, a software system designed to simulate accidental fires andexplosions using thousands of processors. He was a recipient of theComputational Science Graduate Fellowship from the Department ofEnergy. He received a BS in electrical engineering from the Univer-sity of Oklahoma in 1992, and a PhD from the University of Utah in1999.

http://www.llnl.gov/casc/hypre/

http://www.llnl.gov/casc/hypre/

Documents

A component-based architecture for parallel multi-physics ... · A component-based architecture for parallel multi-physics PDE simulation ... simulation of partial differential equations