Scalability of Parallel and Distributed Modeling ...Communication scalability (or lack of scalability) is sometimes explained in the following way. Consider two machines with 8 processors

Simulation Interoperability Workshop, Paper 11S-SIW-065 DRAFT 4/4/11

1

Scalability of Parallel and Distributed Modeling & Simulation Architectures

Jeffrey S. Steinman, Ph.D. Craig N. Lammers Maria E. Valinski

Wendy L. Steinman

WarpIV Technologies, Inc.

5230 Carroll Canyon Road, Suite 306 San Diego, CA 92121

(858) 605-1646

[email protected], [email protected], [email protected], [email protected]

Key Words: Scalability, Open Unified Technical Framework, OpenUTF, Parallel and Distributed Simulation, PDES, PDMS, HPC,

Multicore, Manycore, OpenMSA, OSAMS, OpenCAF, WarpIV Kernel

ABSTRACT: Characterizing scalability is vital for shaping future modeling & simulation architecture and framework standards. Common performance-related scalability metrics include: (1) execution speed, (2) memory consumption, and (3) communication bandwidth and/or throughput. Other scalability factors that are often neglected include: costs for technical training, operational usage, scenario generation, testing, troubleshooting, performing VV&A, and providing overall life-cycle software support for the program. Scalability metrics are often sensitive to: (a) the number of entities in a particular scenario, (b) processor configurations, (c) machine and network topology, (d) model composition structures, (e) scenario layout, (f) cause and effect dependencies within the physical system being modeled, (g) time and/or model resolution, and (h) software engineering practices. Scalability metrics are often sensitive to internal algorithms, data structures, inter-process communication techniques, publish and subscribe data distribution services, modeling constructs, critical-path dependencies between entity interaction patterns, time resolution and/or model resolution used, approach towards system composability, and the actual models themselves.

This paper first defines performance-related scalability in terms of observable metrics that are affected by one or more independent variables. Then, the subject of scalable software engineering is discussed with an emphasis on reducing costs through composability methodologies based on emerging architecture standards. Finally, this paper describes a series of specific parameterized performance benchmarks that can be used to measure the scalability of parallel and distributed simulation architectures and their framework implementations.

The Simulation Interoperability Standards Organization (SISO), Parallel and Distributed Modeling & Simulation Standing Study Group (PDMS-SSG), is investigating the potential standardization of three synergistic architectures that comprise the Open Unified Technical Framework (OpenUTF). These architectures are: (1) Open Modeling & Simulation Architecture (OpenMSA) – focuses on the technology of parallel and distributed simulation, (2) Open System Architecture for Modeling & Simulation (OSAMS) – focuses on plug-and-play model composability, and (3) Open Cognitive Architecture Framework (OpenCAF) – focuses on the representation of intelligent or cognitive behaviors. The WarpIV Kernel provides a freely available open-source reference implementation of the OpenUTF for use by United States citizens. The OpenUTF and its three architectures provide the basis for discussing scalability in this paper. A series of standard benchmarks are proposed for the OpenUTF to help ensure scalability across all metrics. These benchmarks will be used to characterize scalability and overall performance of future OpenUTF implementations and/or enhancements made to the current reference implementation.

1.0 Introduction

Scalability, in its simplest form, usually refers to how a measurable function (i.e., a metric) of several variables behaves as one or more of its variables changes, while

holding the other variables constant. In general, the function could represent anything of interest, including costs for manufacturing and distributing a product, engineering staff required to support a program over its life cycle, time to complete a task, labor required to

10S-SIW-052 DRAFT – WarpIV Technologies, Inc. 3/1/11

2

verify, validate, and accredit a simulation program, etc. This paper primarily focuses on scalability from the standpoint of Parallel and Distributed Modeling & Simulation (PDMS) applications. In particular, this paper focuses on performance-related scalability concerning execution time, memory consumption, and overall communication overheads. A secondary focus is on metrics relating to the cost of software engineering, interoperability, testing, and general software reuse. These cost-related issues can achieve scalability through architecture standards that promote component-based plug-and-play interoperability.

Scalability and performance do not mean the same thing. An application that does not scale very well might actually perform better than another application with perfect scalability. This is common in data structures such as list and tree containers, where for small numbers of elements, managing those elements in a continually updated sorted list might be more efficient than maintaining them in a more naturally sorted binary tree. Lists, being very simple data structures, generally have lower overheads than trees. However, as the number of elements increases, the tree gradually begins to outperform the list because it scales as Log2(n), while the list scales as n, where n is the number of elements. These concepts extend beyond just software execution speed to software engineering practices with regard to managing small and large teams. Mature software development organizations understand that the same engineering processes for large teams do not necessarily translate well when applied to small nimble teams.

2.0 Performance-related Scalability

The run time speed of an entity-based simulation executing on multiple processors is typically affected by several independent variables, including:

1. The number of entities in the scenario 2. The number of processors used during execution 3. Communication bandwidth and latency 4. Time resolution and/or fidelity of the models

For example, one might consider how the number of processing nodes utilized during execution affects run time performance, while holding all other variables such as the scenario, time/model resolution, communication infrastructure, etc., constant (see Figure 1).

For general functions of multiple variables, if its graph as a function of one of those variables is linear, with the absolute value1 of the slope equal to one, the function is

1 The absolute value of the slope is used because some

functions (e.g., run time) might decrease, while other functions (e.g., speedup) increase. Whether the function

said to perfectly scale with 100% efficiency, with respect to that variable. Graphs with a linear slope between zero and one are also said to scale, but do not achieve 100% efficiency. Often times, a graph might initially scale2 in a linear manner and then begin to flatten out. The asymptote helps establish the limits of scalability.3 One might say that the scalability of the application breaks down at that point.

Figure 1: Examples of scalability graphs.

A weaker form of scalability is logarithmic, where the performance degrades by a logarithmic factor. For example when executing an application on a hypercube parallel computer (see Figure 2), doubling the number of nodes adds one more communication hop for worst-case message routing. The number of worst-case hops goes as Log2(n), where n is the number of nodes. Other communication topologies such as 2D and 3D meshes behave as the sum of the mesh dimensions, resulting in square root or cube root scalability as a function of the number of nodes. This can dominate performance for applications that are bounded by communication overheads.4 Cluster computing architectures often provide high-speed communications between blades stacked within a rack, and then additional communication support between multiple racks. Clusters are relatively inexpensive to develop, but typically offer less scalable communication between processors than, for example,

increases or decreases as a function of the independent variable depends on how the function is defined.

2 There is usually a performance hit taken when going from one node to two nodes because of the additional overhead for supporting messages, rollbacks, time management, etc.

3 Sometimes, functions not only flatten out, but also get worse. So, an application might actually run slower as more processors are applied.

4 Applications typically have several performance factors that are often hidden by other factors until they are stressed. So, an application may have minimal communication overheads for small numbers of nodes, but then becomes communication bound for larger processor configurations.


3

cores residing within a multicore chip5 or several multicore chips residing on a single board. Providing scalable memory access between cores on a multicore chip is a primary focus of modern chip designers.

Figure 2: Hypercube and 2D/3D mesh parallel computing communication topologies.

100% efficiency over all possible independent variables is never achieved in practice. For example, adding more compute nodes at some point will not improve run time performance, and will likely degrade performance. However, it is common to consider scalability in terms of two variables. It is reasonable to say that an application scales if the run time, memory consumption per node, and communications overhead per node are unaffected when simultaneously (1) doubling the number of processing nodes and (2) doubling the number of entities. These kinds of considerations lead to establishing scalability criteria where the size of the problem scales with the computing resources provided. In reality, a modeling & simulation application is said to scale well if doubling the number of entities and number of processors only slightly increases the execution time, memory consumed per node, and communication overhead per node.

Communication scalability (or lack of scalability) is sometimes explained in the following way. Consider two machines with 8 processors and then consider how the machines communicate when hooking them up as if the machine was designed as a 16-processor machine. Hypercubes have a fairly rich approach towards scalability. Hooking two 8-node hypercubes together into a 16-node hypercube is accomplished by having each node add one more communication link to its counterpart (see inner box and outer box in Figure 2).

5 Multicore chips with 2-8 cores typically serialize access to

shared memory on the chip. Larger numbers of cores on a chip typically use hierarchical memory structures with internal cache coherency mechanisms to achieve scalability.

More typical network topologies might connect two 8-node machines through a low cost commodity router (see Figure 3), which can quickly become a bottleneck. Scalability would be hindered for parallel software applications requiring frequent communication between the two 8-node machines. Load balancing techniques that attempt to minimize message passing between nodes, while equally distributing computations can help preserve scalability in these situations.

Figure 3: Communication bottlenecks using off-the-shelf network routers to connect multiple parallel computers can impact scalability when machines frequently communicate.

Unfortunately, some commonly used algorithms exhibit quadratic scalability. For example, algorithms that require every entity to perform a computation for every other entity results in n2 execution-time behavior, where n is used to represent the number of entities. One example of this is brute force sensor models that loop through all other entities during each scan event to determine which entities are actually detected. Another example of this is range-based filtering techniques that compute and schedule all possible events for when entities enter and exit each other’s detection range.6 Assuming perfect communications scalability, systems with quadratic scalability typically require quadrupling the number of processors when doubling the number of entities. In practice, these kinds of algorithms only work for simulations with small numbers of entities7 and should be avoided for use in any large-scale application.8

6 This technique is sometimes called the “pure” discrete event

approach. However, it does not scale for several reasons. First the number of events pending in the event queue goes as n2, which quickly becomes excessive for large scenarios. This causes slow-down in event-queue management as well as out-of-control memory consumption. In addition, this approach introduces thrashing for simulations where entities continually change their motion unexpectedly.

7 Engineering-level simulations often study one-on-one type engagements between entities. So n2 algorithms might be acceptable for these types of simulations.

8 Publish and subscribe data distribution techniques must be used to avoid the situation where every entity interacts with every other entity.


4

Another interesting situation that can occur is superlinear scalability. A good example of this is how events are managed in parallel discrete event simulations. Pending events are often managed in Log2(n) event queue data structures9 such as trees or heaps, where n is the number of pending events at a given point in time. In steady state, n is constant. When distributing entities across nodes and executing the simulation in parallel, the number of events per node decreases on average to n/N, where N represents the number of processing nodes. This assumes that N < n. When executing in parallel, the time for each node to manage an event is Log2(n/N). In addition, each node manages its events in parallel with the other nodes, so the effective time per event decreases by another factor of 1/N. Thus the average time per event across all nodes is (1/N) Log2(n/N). Speedup is defined as the time to perform an operation on one node divided by the time to perform the same operation on N nodes. So, the speedup for managing events in a logarithmic event queue data structure is represented as N / [1 - Log2(N) / Log2(n)]. Notice that (superlinear) speedup is larger than N.

Superlinear speedup for event management in parallel simulation has caused some debate within the simulation community over the years. Some would say that superlinear speedup should never occur and that the problem is within the data structure, itself. Notice that a simpler data structure for managing events, such as a linked list with O(n) computational complexity per event exhibits even more O(N2) apparent superlinear speedup.

However, there is a simple explanation. When executing sequentially, events as they are processed in simulated time are automatically able to produce output in time-ascending order.10 When executing in parallel, extra work is required to order the output streams generated by multiple nodes. If done properly, the extra work to merge the parallel output streams exactly cancels the superlinear speedup term. The superlinear speedup exhibited in event management can be explained by considering information lost when executing in parallel. If a simulation does not need to produce output streams in ascending time order, then perhaps the sequential simulation is doing more work than necessary and other event queue data structures might be more efficient. On the other hand, if streaming time-ordered output is required, then the superlinear speedup issue goes away.

9 Event queue data structures require three operations, (1) insert

event with given time stamp, (2) remove event with earliest time stamp, and (3) remove a pending event with a given time stamp. Trees satisfy all three operations, while heaps only satisfy the first two. The third operation is needed for simulations to retract or cancel previously scheduled events.

10 For example, a simulation might need to generate output for driving a time-based display.

Instantaneous scalability can be thought of as determining the scalability, i.e., slope with respect to a particular variable, of a function at a point. This is represented as the partial derivative of the function with respect to one of its variables, while holding the other variables constant. For example, one could take the partial derivative of an application’s run time as a function of the number of nodes, while holding the number of entities, time resolution, or any other variable, constant. In this case, the absolute value of the slope represents the instantaneous efficiency at a point along the graph. When the slope levels out (i.e., equals 0), the performance is unchanged by adding more nodes. Eventually, the slope changes from negative (i.e., run times decrease as the number of nodes increases) to positive (i.e., the run time actually increases as the number of nodes increases). The concept of instantaneous scalability can be extended to multiple variables through the use of gradients.

Scalability is also affected by the time or resolution of the model. For continuous time-stepped simulations, smaller steps result in more accurate computations, but at the expense of increasing execution times. Some models use a resolution parameter to compute their next time step, which can vary from step to step. This discrete-event approach tends to produce more accurate results with faster execution times,11 but requires a significantly more sophisticated simulation engine when executing in parallel than simple time stepped engines. Nevertheless, tighter resolutions still result in smaller time steps, which increases execution times. Therefore, one should also consider run-time performance and scalability as a function of either time or model resolution parameters.

3.0 Scalable Software Engineering

One of the difficult challenges in managing software programs is minimizing overall life-cycle engineering costs. This includes formulating the vision of the program, performing basic R&D through the use of rapid prototyping and benchmarking to confirm general feasibility, producing meaningful and executable requirements, designing the system, training the engineering staff on the tools and frameworks that will be used throughout the program, developing the software, testing, troubleshooting, benchmarking performance, analyzing scalability metrics, providing users guides and

11 Often, time-stepped simulations boast of near-perfect speedup

when running in parallel. Data decomposition simply distributes portions of the problem to each node where the same workload is performed at each time step. The discrete-event approach can significantly reduce execution times by only scheduling and processing events when necessary, which is usually much less often than time stepped approaches. Because discrete-event simulations can also handle very tight interactions when necessary, this also makes this approach more accurate than time stepping.


5

other documentation, and eventually performing VV&A. All of this is usually accomplished using formal well-documented 12processes that are standard practices for DoD programs. Once all of the development is completed, the software application typically goes into further rounds of requirements, development, and testing before eventually going into basic maintenance mode. By that time, the original team of developers has already moved on to other programs that are more interesting and require their support. Often, critical subject matter experts move to other companies or organizations as their legacy software programs go into maintenance mode and/or when their funding is cut. At this point, maintenance becomes prohibitively expensive because the remaining skeleton engineering team is inadequately prepared to enhance or even just maintain the software. The program eventually is shelved after years of expensive, yet inadequate support. Despite the best of intentions, this pattern repeats itself on many government programs.

The problem with traditional software engineering is that each system is developed with its own unique framework without serious consideration for interoperability, reuse, modularity, and modern multicore computing technologies [22]. The “not invented here” syndrome minimizes interoperability and reuse, both for (1) using models developed on other programs and (2) making models developed by the current program usable in other contexts.

The thinking goes like this… How can a model developed in one system be reused in another system anyway? After all, each system has its own infrastructure with its own set of modeling constructs, dependencies with other models, and design assumptions [1,28]. Why should a program go beyond its agreed-upon requirements with the sponsoring agency to provide additional capabilities such as model-component interoperability within a common framework?

Modeling & Simulation has become prohibitively expensive because of this way of thinking.

New software development methodologies, architectures, frameworks, tools, libraries, and languages are required to fully embrace the multicore computing revolution that is now underway. Providing scalable (1) software engineering methodologies and (2) run-time performance metrics are primary focuses of the Open Unified Technical Framework (OpenUTF) [8,23,25,26].

12 VV&A should be implemented throughout the life cycle of

every meaningful M&S application, but can only be completed once software development is finished and the application has been thoroughly tested by the test team. It makes no sense to initiate independent verification, validation, or accreditation efforts while software is still being developed and/or debugged by the development team.

The OpenUTF is comprised of three synergistic open architectures.

1. The Open Modeling & Simulation Architecture (OpenMSA) [19] modularizes all of the necessary technology layers for supporting scalable parallel and distributed computing.

2. The Open System Architecture for Modeling & Simulation (OSAMS) [21,29] is part of the OpenMSA and primarily focuses on plug-and-play model interoperability and composability.

3. The Open Cognitive Architecture Framework (OpenCAF) [24] extends OSAMS by focusing on providing modeling constructs for representing thought processes and cognitive behaviors.

The Parallel and Distributed Modeling & Simulation Standing Study Group (PDMS-SSG) [12,13] is governed by the Simulation Interoperability Standards Organization (SISO) [14] and is in the process of evaluating these three architectures for future standardization. The WarpIV Kernel is the reference implementation of the OpenUTF. As regulated by the United States Department of Commerce, it is freely provided with source code to United States organizations or citizens requesting a license.

Figure 4: Background of the OpenUTF, starting with the development of the Time Warp Operating System (TWOS) and the Synchronous Parallel Environment for Emulation and Discrete Event Simulation (SPEEDES) at JPL, through its transition into various programs and the eventual development of its architectures (OpenMSA, OSAMS, OpenCAF) with the WarpIV Kernel reference implementation.

As shown in Figure 4, the OpenUTF has evolved over the past 20 years, starting with the invention of scalable optimistic time-management and flow control techniques originally developed at the Jet Propulsion Laboratory and Caltech, followed by numerous mainstream government and research programs that used the technology. Lessons have been learned from each effort, which has helped to


6

evolve the OpenUTF, its three architectures, and its open source reference implementation. Over the past 20 years, more than an estimated $35M dollars have been invested in maturing this technology. This figure does not include the investment made by government programs that have actually used the technology to develop models.

The OpenMSA, shown in Figure 5, provides a common layered architecture for technologists to contribute their ideas and R&D efforts back to the rest of the community.13 This includes high speed communications, event management techniques, modeling constructs, time management, publish and subscribe data distribution, utilities, and automatic interfaces to other distributed computing standards [2,3,4,5,6,7].14

Figure 5: Technology layers of the OpenMSA.

The Open System Architecture for Modeling & Simulation (OSAMS) is subset of the Open Modeling & Simulation Architecture (OpenMSA), which is part of the Open Unified Technical Framework (OpenUTF). The yellow OpenMSA layers shown in Figure 5 represent OSAMS utilities, services, and modeling constructs. Model developers use OSAMS constructs to build their models. OSAMS could be implemented for other frameworks besides the OpenMSA. OSAMS-compliant models could execute within any OSAMS-compliant framework. This makes OSAMS ideal for standardization

13 Research performed by universities or research institutions

that focus primarily on publications often result in throwaway software. The OpenMSA solves this problem by providing an open source reference implementation that (a) focuses and enhances research and (b) streamlines the transition of successful software back to the user community.

14 Such standards include Live, Virtual, and Constructive (LVC) standards such as the High Level Architecture (HLA), Distributed Interactive Simulation (DIS), and the Test/Training Enabling Architecture (TENA). Other mechanisms include XML-based web standards such as the Simple Object Access Protocol (SOAP) and Web Service Description Language (WSDL).

apart from the OpenMSA. OSAMS provides the following cost-saving benefits.

1. Powerful modeling constructs that minimize the amount of user-written software necessary to develop complex models.15 Fewer lines of code translate to lower development, testing, and maintenance costs.

2. Common modeling framework that can be used by the software community at large. This means that an OASMS trained community of developers can quickly maintain collections of models without having to learn a new system.

3. Models and/or services are developed as encapsulated plug-and-play software components with no direct dependencies on other component implementations. The encapsulation mechanisms ensure that models do not become so large and complex that they cannot be maintained.

4. Reuse of models and services becomes standard practice as repositories of components, along with standard interfaces and data representations are developed. This lowers software engineering costs through wholesale interoperability and reuse.

5. Verification and validation tests for model components can be created that operate on more than one model or service component implementation using their common abstract interfaces. This means that a single verification or validation test can support multiple component implementations. It also helps to compare/contrast models using the exact same test.

OSAMS achieves plug-and-play interoperability between model and service components through several important encapsulation mechanisms.

1. Model/service components within an entity invoke services provided by other components using abstract interfaces, known as Polymorphic Methods. The technique used by OSAMS goes beyond simple inheritance and virtual functions by allowing users to independently define abstract interfaces, and then allow objects to dynamically register (or unregister) their methods to support those interfaces. Polymorphic methods do not require inheritance from base classes. Registered methods can also have arbitrary names. However,

15 In comparison to other modeling frameworks, experience has

shown that the OSAMS programming constructs reduce the number of lines of code necessary to develop complex models by factors ranging from three to five.


7

their method signature must match the abstract interface signature. Conflicts between defined interfaces and method implementations are caught at compile time.

2. A data registry is provided within each simulated entity to dynamically register (or unregister) arbitrary data. Each component hierarchically structured within an entity registers its data that might be needed by other components. Similarly, each component can access data registered by other components within the entity. String tags are used to handle name space conflicts. This supports the abstract sharing of data between components residing within an entity.

3. Embedded attributes within a component and/or its internal classes can be mapped into Federation Objects16 that are published for other subscribing entity models to discover. Publish and subscribe data distribution provides mechanisms for entities to interact with object proxies of other entities. An entity, such as an aircraft carrier, can publish multiple Federation Objects.

4. Directed and Undirected Interactions17 can be scheduled to support generic interactions between entities. Models that schedule interactions never know what methods are invoked by receiving models to handle their interactions. Filters can be applied to undirected interactions that are scheduled to reduce recipients.

To support meaningful composability, each component should provide important metadata that describes (1) all of the abstract interfaces that are invoked by the component, (2) all of the abstract interfaces that are provided by the component, (3) all data that is needed by the component that must be registered by other components, (4) all data registered by the component, (5) all data attributes that are mapped into published Federation Objects, (6) the complete set of subscription filters created by the component for discovering Federation Objects that are published by other entities, (7) all interaction types that are generated by the component, and finally, (8) all interaction types that can be handled by the component.

16 Federation Objects provide a mechanism to distribute a

collection of attributes associated with an object to subscribers within the OpenUTF. This does not imply use of HLA. Rather, Federation Objects are optimized for use within the OpenUTF, but because of their similar nature, can automate integration with HLA.

17 Interactions provide an HLA-like mechanism for scheduling events between entities within the OpenUTF. Parameters are passed in a generic “Run Time Class” data structure that is easily translated to other formats such as XML, DIS, or HLA.

In theory, two different component implementations representing the same kind of model18 can be plug-and-play swappable if their metadata is identical. In addition, the same set of verification and validation “black box tests” could be used for similar components with identical metadata. This approach to testing can significantly lower the cost of performing VV&A. A technical paper on the subject of model/service component composability will be provided in the near future.

The OpenCAF19 provides a critical capability for representing autonomous behavior in a standard way. Its proposed reasoning engine allows users to construct cognitive graphs that represent stimulus inputs and outputs to various reasoning techniques (see Figure 6). Each reasoning node in a graph (i.e., Emotion-based, Training-based, or Rule-based nodes) represents a thought process that is implemented as a method on an object. Each method is triggered when any of its inputs (i.e., stimulus) change. The order of triggered methods can be prioritized based on each stimulus input and the overall structure of the graph. The cognitive engine activates the graph when any stimulus changes.

Figure 6: Depiction of a cognitive graph operating within the OpenCAF Reasoning Engine. The red, green, and blue nodes in the graph represent various kinds of thought processes. Emotion-based reasoning weighs factors leading to decision outputs. Training-based reasoning uses trained neural networks to produce outputs based on certain inputs. Rule-based reasoning is simply an algorithm that uses certain inputs to produce certain outputs.

As shown in Figure 7, stimulus to a graph can be modified from several sources, including data provided by Federation Objects, received Interactions, outputs of other models, internal events that update state, time updates,

18 OSAMS provides the framework for supporting plug-and-

play component interoperability. However, this alone does not guarantee that models are truly interoperable in a meaningful way. Additional metadata is required to ensure consistent assumptions between models.

19 The OpenCAF is currently being designed and has not yet been fully implemented.


8

and/or data fusion performed by sensors and intelligence models that refine the model’s perception of the world. Reprioritized goals generated by the graph are then provided to the behavior management structure that decides which task to perform to accomplish a set of ordered goals (i.e., a mission).

Figure 7: OpenCAF infrastructure showing how external inputs trigger the reasoning engine thought process, which in turn establishes priorities in goals, resulting in managing behaviors.

The OpenCAF provides a common infrastructure to represent complex behaviors of autonomous entities. Without the OpenCAF, modelers would have to write their own reasoning engine or worse, hard-code embedded if-then-else statements as complex behavior representations. Ultimately, manually hard-coding complex behaviors leads to unmaintainable software, which eventually causes the program to become too costly to sustain, especially after subject matter experts have left the program. The OpenCAF provides critical functionality for supporting meaningful entity-based simulations and is the final architecture component of the OpenUTF.

4.0 Establishing Standard Benchmarks

This section describes a set of OpenUTF benchmarks that have been (1) previously developed on other programs, (2) planned for future development, and/or (3) already developed. These benchmarks form a somewhat comprehensive set of parameterizable tests that can be used to study critical scalability and performance issues. It is anticipated that these benchmarks will mature and become standardized for future use by the OpenUTF community and beyond.

It is important to note that a parallel discrete-event simulation has various overheads that individually contribute to the overall event-processing overhead. Examples are (1) event management algorithms and data structures, (2) time management techniques and flow

control, (3) incremental state saving for rollback support, (4) memory consumption, (5) communication throughput and bandwidth consumption, and (6) publish and subscribe data distribution. Keep in mind as the various benchmarks are described below, that it is important to isolate each of these overheads through specific tests to fully understand how each scales and affects overall run-time performance. It is also important to recognize that some overheads are negligible compared to others, even though they might at first appear to be large. For example, state-saving overheads for supporting rollbacks might at first appear to be excessive until other factors are considered. The time to save the state of a double-precision variable for rollback support is often negligible compared to the processing time spent to actually compute its value. Also, the time and memory-consumption overheads associated with rollbacks have proven to almost always be negligible compared to event management and actual event-computation overheads.20

4.1 Machine

The Machine benchmark is a very simple standalone program that measures the execution time of basic operations. It provides important insight on hardware performance and compiler efficiencies [10].21 Each operation is repeated many times inside a loop. Overhead of the loop is subtracted out to isolate the actual timing of each studied operation. The number of iterations performed in each loop is dynamically determined at run time22 to ensure accurate measurements.

Operations include: adding two numbers, multiplying two numbers, taking the square root of a double, computing the cosine of a double, throwing random numbers, obtaining the system and wall clock, etc.

4.2 Communications

The OpenUTF High Speed Communication (OpenUTF-HSC) framework defines a set of standard services. Each

20 The WarpIV Kernel reference implementation of the

OpenUTF provides a highly optimized incremental state saving framework for rollback support that (1) consumes minimal memory, (2) employs optimized free lists for rapid memory management of generated rollback items, and (3) provides very fast state-saving, rollback, and rollforward operations.

21 It is interesting to run the Machine benchmark on a PC that is dual-bootable in Windows and Linux. The Microsoft Visual Studio compiler is used in Windows, while the GNU compiler is used in Linux. Different optimization settings can be explored. Results from the two compilers are vastly different even on the same machine.

22 Each loop currently runs for 2 seconds within an accuracy of 0.1 seconds. The time per operation is computed as the time divided by the number of iterations minus the time to perform the same number of loops doing nothing.


9

service is timed within a loop that is automatically adjusted to run for a specified period of time. This provides a standard way to measure the performance of each operation on different computing architectures and network configurations.

The communications benchmarks is being modified to transparently time the same operations using the standard Message Passing Interface (MPI) that is used on many high performance computing programs.

Timing results are saved in comma-separated value files that can be imported into spreadsheets and automatically plotted or compared with other runs.

4.3 Rollback Operations

A rollback benchmark was developed for SPEEDES incremental state saving in 1993 [15] that carefully isolated each important rollback operation and timed performance in carefully controlled loops. Rollback operations include primitive state variables, pointers, strings, container classes, memory allocation and deallocation operations with and without free lists, and other miscellaneous operations.

Several timings were performed on each operation to measure all relevant information.

1. Time to perform native operation without rollback overheads.

2. Time to perform and commit operation with rollback support.

3. Time to perform the operation, roll it back, and then uncommit.

4. Time to perform the operation, roll it back, roll it forward, and then commit.

These benchmarks should be reconstructed within the WarpIV Kernel reference implementation of the OpenUTF. Results should be published in a future paper. In addition, any simulation or benchmark developed in the WarpIV Kernel can execute in a repeatable manner with the following settings:

1. TestRollback – Processes each event, rolls the event back, and then reprocesses the event. If internal overheads are minimal, the simulation should take twice as long to execute because each event is processed twice. On the other hand, if the execution time is minimally changed, this implies that the event management overheads dominate execution time, which is often true for simulations with low-grained events.

2. TestRollforward – Processes each event, rolls the event back, and then rolls the event forward. If the simulation execution time is minimally changed, this implies that rollback and rollforward overheads are minimal. If the simulation takes three times as long to execute, this means that rollback overheads are dominating the performance. Usually, this test produces a minimal effect to an application’s execution time.

3. TestFullRollback – Processes multiple events within an entire Global Virtual Time (GVT) cycle, rolls all events back, and then reprocesses the GVT cycle. Like TestRollback, this test provides insight into event management overheads. It also stresses models further to ensure that rollbacks are supported correctly. Usually, this test results in slightly more than doubling the execution time of an application.

It is important to note that the incremental state saving overheads (i.e., time and memory) used within the OpenUTF scale as a function of processed events. The time management flow control techniques bound these overheads to produce a completely scalable rollback framework. So, when discussing rollback overheads, one should not confuse scalability with performance or memory consumption. The rollback techniques in the OpenUTF fundamentally scale.

4.4 Event Queue

Numerous event queue data structures have been benchmarked for use in SPEEDES and the WarpIV Kernel. These frameworks have traditionally used the fastest ones. The event queue benchmark inserts a specified number of events with random times into the event queue to initialize the test. Then, within a timed loop, the following operations are performed: (1) remove the earliest time-tagged event from the event queue, followed by (2) insert a new event with a random time tag back into the event queue. Random times are pre-generated to minimize the effect of random number generation on the timings. Various distributions are generated to study their effects on event queue performance [17].

4.5 P-hold

A simple steady-state simulation23 is constructed with a run-time configurable number of simulation objects and initial events per object. A typical run might contain 10,000 objects with 10 initial events per simulation object. The simulation can run on one or more nodes. 23 Each event, when processed, generates a single new event to

keep the number of pending events constant.


10

Event processing granularities are controlled by a run-time parameter to study performance for simulations with either heavy or light event-processing loads. Two configurations are supported.

1. Fully connected – Each event randomly schedules an event for any other object in the simulation with a future time generated according to a specified probability distribution.

2. Ring connected – Each event schedules an event for the next object connected in the ring with a future time generated according to a specified probability distribution. When executing in parallel, the SCATTER decomposition strategy generates inter-node messages for every event. This provides a worst-case message-generation mechanism that is interesting to study. In contrast, the BLOCK decomposition strategy generates almost all events locally on the same node because the next object is on the same node except at block boundaries. In this configuration, BLOCK decomposition minimizes inter-node message overheads and provides the best-case scenario for communication overhead.

The various configurations of P-hold can be used to study internal event queue overheads, scheduler overheads, message-passing overheads, rollback sensitivity to various event time statistics, etc. As discussed in the rollback benchmark section, the P-hold benchmark can also be executed with TestRollback, TestRollforward, and TestFullRollback enabled.

4.6 Crazy Cancel

The Crazy Cancel benchmark was established to understand potential instabilities in optimistic simulations and various flow control mechanisms that mitigate those instabilities. Crazy cancel essentially specifies a certain number of objects in the simulation. Within a Generate event, each object schedules a specified number of CrazyGenerate events for other objects in the simulation 10 units of time into the future. 9 units of time later, however, the original scheduling object retracts those events. When executing sequentially, those originally scheduled CrazyGenerate events are never processed. However, when executing optimistically in parallel, the retraction might be received after those CrazyGenerate events have been processed. To produce instability, CrazyGenerate events when processed randomly generate a specified number of additional CrazyGenerate events for other objects with zero time delay. Thus, an artificial explosion of events is potentially generated when running optimistically in parallel that would never occur when running sequentially or in a conservative manner in parallel. Without flow control mechanisms to throttle

event processing and message passing, optimistic simulations quickly break down, especially when running on systems with slow communications. The pattern repeats by each simulation object every 10 units of simulation time.

4.7 Publish and Subscribe Data Distribution

Publish and subscribe data distribution [11,20,27] benchmarks were originally developed for SPEEDES. Several configurations were established to isolate data distribution overheads between publishers and subscribers. The tests measured (1) discovery overheads, (2) removal overheads, and (3) attribute update/reflection overheads [9]. The goal was to understand both performance overheads and scalability of the infrastructure. A simple curve-fitted performance prediction model was generated based on number of nodes, publishers, and subscribers. This performance model proved to accurately predict publish and subscribe data distribution overheads for realistic configurations. Basic configurations benchmarked were:

1. Single publisher with a specified number of subscribers.

2. Single subscriber with a specified number of publishers.

3. Specified numbers of publishers and subscribers.

Similar benchmarks with the same publisher and subscriber configurations were developed for SPEEDES to measure performance and scalability for Interactions.

4.8 Range-based filtering

A large-scale benchmark was developed and published in 1994 and 1999 [16,18] that simulated up to 1,000,000 randomly moving aircraft distributed around the globe with range-based radar sensors. The benchmark ran on various processor configurations, including runs on 96 processors. Each aircraft moved about the globe with random speed with occasional motion [30] changes. Tests with various sensor ranges were conducted that validated the expected number of detections made by each aircraft. A first-generation distributed hierarchical grid data structure provided a scalable foundation for publish and subscribe data distribution. The results demonstrated near-perfect scalability in every aspect: (1) overall processing time, (2) memory consumption per node, and (3) communication overhead per node.

4.9 Real-world applications

The final type of benchmark is real-world applications executing within the OpenUTF on a variety of hardware


11

configurations. Real-world simulation applications, such as missile defense, ground campaigns, complex networks, air-missions, chemical and biological defense, naval operations, etc., can all be tested for scalability on a wide variety of computer architectures and configurations by measuring execution time, memory consumption, and message overheads as a function of the number of nodes.

The WarpIV Kernel reference implementation of the OpenUTF provides an embedded critical path analysis capability that dynamically computes the maximum speedup possible for any application during its execution. The maximum speedup calculation assumes that each entity executes on its own processor and that all event management and communication overheads are negligable. The maximum speedup is calculated as the sum of all event-processing times divided by the largest critical path time of all entities.

The WarpIV Kernel also provides a wide variety of performance statistics such as event processing times, number of rollbacks per event type, messages and antimessages generated, event memory consumption, etc. to help users understand the scalability of their applications. These automatically generated performance metrics are critical for studying scalability.

5.0 Next-generation Standards

The Parallel and Distributed Modeling & Simulation Standing Study Group (PDMS-SSG) operates within the Simulation Interoperability Standards Organization (SISO) and provides a forum for investigating open architecture standards, such as OpenMSA, OSAMS, and OpenCAF. The benchmarks described in this paper provide a starting point for establishing future standards for categorizing scalability within the PDMS-SSG, OpenUTF Users Group, and broader communities.

6.0 Summary and Conclusions

This paper first introduced the subject of scalability in terms of general concepts before shifting the focus to (1) run time performance (i.e., processing time, memory consumption, and message-passing overheads) and (2) cost-saving scalable software engineering principles of the OpenUTF. Then this paper described a series of benchmarks designed to measure critical performance overheads that collectively contribute to the overall overhead in operating parallel discrete-event simulations.

These benchmarks should collectively evolve into formal standards that can be used to (1) understand performance and scalability and (2) to compare/contrast different independent implementations of OpenUTF architecture layers in a uniform manner. Standard benchmarks will

also provide important feedback to the PDMS-SSG for assessing the feasibility of the OpenUTF. Future publications should provide benchmark performance results on various architectures and run-time configurations.

7.0 Acknowledgements

The authors of this paper would like to thank Chris Gaughan, Deputy Technology Program Manager of the U.S. Army Research, Development and Engineering Command (RDECOM) Modeling Architecture for Technology, Research & EXperimentation (MATREX) program. Mr. Gaughan is the Vice Chair of the PDMS-SSG and is actively involved in bringing multicore computing to Army Force Modeling Simulation (FMS) and Chemical and Biological Defense (CBD) programs.

The authors of this paper would like to thank Bob Pritchard, Program Manager of the Joint Virtual Network Centric Warfare (JVNCW) program at SPAWAR Systems Center Pacific, for his continued support of the OpenUTF. Bob Pritchard is a strong advocate of the PDMS Standing Study Group and is a proponent of open source software methodologies.

8.0 References

1. Davis Paul, and Anderson Robert, 2004. “Improving the Composability of Department of Defense Models and Simulations.” Prepared for the Defense Modeling & Simulation Office (DMSO). RAND National Defense Research Institute.

2. Distributed Interactive Simulation - Application protocols, IEEE 1278.1A-1998.

3. Distributed Interactive Simulation - Communication Services and Profiles, IEEE 1278.2-1995.

4. HLA Framework and Rules, Version IEEE 1516-2000. 5. HLA Interface Specification, Version IEEE 1516.1-2000. 6. HLA Object Model Template, Version IEEE 1516.2-2000. 7. HLA Federation Development and Execution Process,

Version IEEE 1516.3. 8. Joint Program Executive Office (JPEO) Chemical and

Biological Defense (CBD) Modeling & Simulation Strategic Plan.

9. Lammers Craig, Steinman Jeff, 2004, “JSIMS CCSE Performance Analysis.” In proceedings of the Fall 2004 Simulation Interoperability Workshop, paper 04F-SIW-34.

10. Lammers Craig, Valinski Maria, Steinman Jeff, 2009. “Multiplatform Support for the OpenMSA/OSAMS Reference Implementation.” In proceedings of the Spring 2009 Simulation Interoperability Workshop (SIW).

11. Morse Katherine, Steinman Jeff, 1997. “Data Distribution Management in the HLA: Multidimensional Regions and Physically Correct Filtering.” In proceedings of the Spring


12

Simulation Interoperability Workshop. Paper 97S-SIW-052.

12. Parallel and Distributed Modeling & Simulation Standing Study Group (PDMS-SSG) Terms Of Reference (TOR).

13. Parallel and Distributed Modeling & Simulation Standing Study Group (PDMS-SSG), 2009. “Parallel and Distributed Modeling & Simulation Standing Study Group (PDMS-SSG) DRAFT REPORT Volume 1: PDMS Technology.” Submitted to the SISO Standards Activities Committee (SAC), 14 April 2010.

14. Simulation Interoperability Standards Organization (SISO), http://www.sisostds.org.

15. Steinman Jeff, 1993. "Incremental State Saving in SPEEDES Using C++." In proceedings of the 1993 Winter Simulation Conference. Pages 687-696.

16. Steinman Jeff and Wieland Fred, 1994. "Parallel Proximity Detection and the Distribution List Algorithm." In proceedings of the 1994 Parallel And Distributed Simulation Conference. Pages 3-11.

17. Steinman Jeff, 1996. "Discrete-Event Simulation and the Event Horizon Part 2: Event List Management." In proceedings of the 1996 Parallel And Distributed Simulation Conference. Pages 170-178.

18. Steinman Jeff, Tran Tuan, Burckhardt Jacob, Brutocao Jim, 1999. “Logically Correct Data Distribution Management in SPEEDES.” In proceedings of the 1999 Fall Simulation Interoperability Workshop, Paper 99F-SIW-067.

19. Steinman Jeff and Hardy Doug, 2004. “Evolution of the Standard Simulation Architecture.” In proceedings of the Command and Control Research Technology Symposium, paper 067.

20. Steinman Jeff, 2004. “The Distributed Simulation Management Services Layer in SPEEDES.” In proceedings of the Spring 2004 Simulation Interoperability Workshop, paper 04S-SIW-98.

21. Steinman Jeff, Park Jennifer, Walters Wally, and Delane Nathan, 2007. “A Proposed Open System Architecture for Modeling and Simulation.” In proceedings of the Fall 2007 Simulation Interoperability Workshop, 07F-SIW-044.

22. Steinman Jeff, 2008. “The Need for Change in Modeling and Simulation.” Article in Chem-Bio Defense Quarterly.

23. Steinman Jeff, Lammers Craig, Valinski Maria, 2008. “A Unified Technical Framework for Net-centric Systems of Systems, Test and Evaluation, Training, Modeling and Simulation, and Beyond…” In proceedings of the Fall 2008 Simulation Interoperability Workshop (SIW). Paper 08F-SIW-041.

24. Steinman Jeff, Lammers Craig, and Valinski Maria, 2009. “A Proposed Open Cognitive Architecture Framework (OpenCAF).” In the proceedings of the 2009 Winter Simulation Conference.

25. Steinman Jeff, Lammers Craig, Valinski Maria, 2009. “Composability Requirements for the Open Unified Technical Framework.” In proceedings of the Fall 2009 Simulation Interoperability Workshop, paper 09F-SIW-022.

26. Steinman Jeff, Lammers Craig, Valinski Maria, Steinman Wendy, 2009. “Migrating Software to the Open Unified

Technical Framework.” In proceedings of the Spring 2010 Simulation Interoperability Workshop, paper 10S-SIW-052.

27. Steinman Jeff, Lammers Craig, Valinski Maria, and Steinman Wendy, 2010. “Scalable Publish and Subscribe Data Distribution.” In proceedings of the Fall 2010 Simulation Interoperability Workshop, 10F-SIW-027.

28. Tolk Andreas, Muguira James, 2003. “The Levels of Conceptual Interoperability Model.” In proceedings of the 2003 Fall Simulation Interoperability Workshop. Paper 03F-SIW-007.

29. WarpIV Technologies, Inc. 2009. “Open Unified Technical Framework (OpenUTF) Volume 1: OSAMS User’s Guide for Model Developers.”

30. WarpIV Technologies, Inc. 2009. “Open Unified Technical Framework (OpenUTF): Motion Library Description and Derivation.

9.0 Author Biographies

JEFFREY S. STEINMAN, President and CEO of WarpIV Technologies, Inc. received his Ph.D. in High Energy Physics from UCLA in 1988 where he studied the quark structure function of high-energy virtual photons at the Stanford Linear Accelerator Center. Dr. Steinman was the creator of the Synchronous Parallel Environment for Emulation and Discrete Event Simulation (SPEEDES), has published more than 70 papers and articles in the field of high performance computing, has five U.S. patents in high performance M&S technology, is a frequent invited speaker at conferences, seminars, and forums, and is the principle designer/developer of the WarpIV Kernel Reference Implementation of the OpenMSA, OSAMS, and OpenCAF architectures that make up the Open Unified Technical Framework (OpenUTF). Dr. Steinman is currently the Chair of the Parallel and Distributed Modeling & Simulation Standing Study Group (PDMS-SSG) that is governed by the Simulation Interoperability Standards Organization (SISO).

CRAIG N. LAMMERS, Vice President of Applied Research and Development at WarpIV Technologies, Inc. is the program manager for an effort with AFRL, Rome Labs conducting new research and development in parallel and distributed simulation to support formal estimation and prediction of complex systems using the OpenUTF Kernel. He is also the lead engineer for a new research effort with the Missile Defense Agency applying five-dimensional simulation to support the analysis of interceptor engagement strategies. His professional interests are in the areas of M&S, artificial intelligence, signal processing, user interface design, and music. Mr. Lammers received his B.S. degree in Industrial and Systems Engineering from the University of Michigan, Dearborn in 2001, where he graduated with high honors.


13

MARIA E. VALINSKI, Vice President of Software Engineering at WarpIV Technologies, Inc., has provided modeling, simulation, and integration technical support for numerous programs and organizations. Maria Valinski is the lead software engineer for the Joint Virtual Network Centric Warfare (JVNCW) program that models wireless communications on supercomputers using the OpenUTF reference implementation. Maria Valinski is currently providing Subject Matter Expert (SME) support for OpenUTF users at the Space and Naval Warfare (SPAWAR) Systems Center Pacific (SSC-PAC). Ms. Valinski received her B.S. degree in Computer Science from East Stroudsburg University in 1997.

WENDY L. STEINMAN, Software Engineer at WarpIV Technologies, Inc. received her B.S. degree in Cognitive Science specializing in Neuroscience with a minor in Biology from UCSD in 2009. She is currently supporting several initiatives involving the OpenUTF. This includes developing data registry services, implementing two and three dimensional grid decomposition strategies for simulation objects, exploring hybrid computation techniques, conducting performance benchmarks, comparing the internal OpenUTF High Speed Communications (HSC) library with the standard Message Passing Interface (MPI), analyzing M&S data provided by the Chemical, Biological Radiological and Nuclear Defense (CBRND) Data Backbone, assisting in writing the OpenUTF User’s Guide, and editing the Parallel and Distributed Force Modeling & Simulation Technical Report for the PDMS-SSG.

Documents

Scalability of Parallel and Distributed Modeling ...Communication scalability (or lack of scalability) is sometimes explained in the following way. Consider two machines with 8 processors