Distributed Adaptive Simulations using Structured Adaptive ...parashar/Papers/samr-siam-pp-06.pdf · Related Work: SAMR Infrastructures • SAMRAI, Lawrence Livermore National Lab

Distributed Adaptive Simulations using Structured Adaptive Mesh-Refinement

(SAMR)

Manish ParasharThe Applied Software Systems Laboratory

ECE/CAIP, Rutgers Universityhttp://www.caip.rutgers.edu/TASSL

(Ack: NSF, DoE, NIH, DoD)

http://www.caip.rutgers.edu/TASSL

Overview

• Computational engines for SAMR applications – distributed, dynamic data-management

• Runtime (reactive and proactive) management– dynamic (application and system sensitive) partitioning

and load-balancing– AHMP – Adaptive Hierarchical Meta-Partitioning– Dispatch – Addressing Point-wise Varying Loads

• Conclusion

Adaptive Mesh Refinement•Start with a base coarse grid with minimum acceptable resolution

• Tag regions in the domain requiring additional resolution, cluster the tagged cells, and fit finer grids over these clusters

• Proceed recursively so that regions on the finer grid requiring more resolution are similarly tagged and even finer grids are overlaid on these regions

• Resulting grid structure is a dynamic adaptive grid hierarchy

The Berger-Oliger AlgorithmRecursive Procedure Integrate(level)

If (RegridTime) Regrid Step Δt on all grids at level “level”If (level + 1 exists)

Integrate (level + 1)Update(level, level + 1)

End ifEnd Recursionlevel = 0Integrate(level)

Structured Adaptive Mesh Refinement (SAMR)

Related Work: SAMR Infrastructures

• SAMRAI, Lawrence Livermore National Lab– Object-oriented structured adaptive mesh refinement application infrastructure– Modules handle visualization, mesh management, integration, geometry, etc.

• Chombo, Lawrence Berkeley National Lab– Set of tools for implementing finite difference methods for PDE solutions– Distributed infrastructure for parallel calculations over block-structured,

adaptively refined grids• Paramesh, NASA Goddard Space Flight Center

– Fortran 90 subroutines to extend existing serial code into parallel AMR code– Hierarchy of Cartesian mesh grids which form nodes of a tree data-structure

• Batsrus, University of Michigan– Block-based approach with adaptation distributed over processors in

computational pool in phases• GrACE, Rutgers University

– Adaptive computational and data-management engine for structured grids– Distributed adaptive grid hierarchy, grid function and geometry abstractions– Parallel support for AMR computations in various scientific domains

GrACE: Adaptive Computational Engine for SAMR

• Semantically Specialized DSM– Application-centric programming abstractions – Regular access semantics to dynamic, heterogeneous, and

physically distributed data objects• Encapsulate distribution, communication, and interaction

– Coupling/interactions between multiple physics, models, structures, scales

• Distributed Shared Objects– virtual Hierarchical Distributed Dynamic Array

• Hierarchical Index-Space + Extendible Hashing + Heterogeneous objects

– Multifaceted objects• Integration of computation + data + visualization + interaction

• Adaptive Run-time Management– Application and system sensitive management

• Algorithms, partitioners, load-balancing, communications, etc.• Policy-based automated adaptations

1024x128x128, 3 levels, 2K PE’sTime: ~ 15% Memory: ~25%

Richtmyer Meshkov (3D)

IPARS Multi-block Oil Reservoir Simulation

Data-Management for Adaptive Applications

• Application requirements– Adaptive Finite Difference

• Large hierarchical objects; dynamic size, orientation and interactions– Adaptive Finite Element

• Dynamic number of objects of dynamic size– Adaptive Fast Multipole

• Dynamic number of small objects with dynamic interactions• Traditional data-management for computation – multi-dimensional arrays

– data set, index space, injective function• Data-management abstraction for distributed adaptive applications

– An extended definition of an array where • Each element of the array can itself be an array • Each element of the array can be an object of arbitrary and variable size• The array can grow and shrink dynamically• The array is distributed

– Hierarchical Distributed Dynamic Array• Performance, Performance, Performance• Locality, Locality, Locality

“A Common Data Management Infrastructure for Parallel Adaptive Algorithms for PDE Solutions,” M. Parashar, J. C. Browne, C. Edwards, and K. Klimkowsky, Proceeding of Supercomputing,‘ San Jose, CA, November 1997

MACE: Supporting Dynamic Coupling/Interactions

• High Performance Geometry-based Shared Spaces– Models/numerics as well as the interactions are typically based on the geometry of the

discretized domain– Use SFC’s to create a distributed directory of shared geometric regions– Processors can create shared regions and can read and write object related to a

shared region – e.g. mortar grid– Complements MPI, OpenMP, PVM, etc.

Multi-numerics Multi-physics Multi-scale

“A Dynamic Geometry-Based Shared Space Interaction Framework for Parallel ScientificApplications,” L. Zhang* and M. Parashar, Proceedings of the 11th International Conference on High Performance Computing (HiPC 2004), Bangalore, India, December 2004.

A Selection of SAMR Applications Enabled

Multi-block grid structure and oil concentrations contours (IPARS, M. Peszynska, UT Austin)

Blast wave in the presence of a uniform magnetic field) – 3 levels of refinement. (Zeus +

GrACE + Cactus, P. Li, NCSA, UCSD)

Mixture of H2 and Air in stoichiometricproportions with a non-uniform temperature field(GrACE + CCA, Jaideep Ray, SNL, Livermore)

Richtmyer-Meshkov - detonation in a deforming tube - 3 levels. Z=0 plane visualized on the right

(VTF + GrACE, R. Samtaney, CIT)

SAMR: Spatial and Temporal Heterogeneity and Dynamics

regrid step 114regrid step 5 regrid step 96

regrid step 201

RM3D (200 regrid steps, size=256*64*64)

0

1020

3040

50

6070

80

0 20 40 60 80 100 120 140 160 180

Regrid Steps

Tota

l Loa

d (1

00k)

regrid step 176

Spatial and Temporal Heterogeneity and Load Dynamics of a 3D RicSpatial and Temporal Heterogeneity and Load Dynamics of a 3D Richtmyerhtmyer--Meshkov Simulation using SAMRMeshkov Simulation using SAMR

Analysis of Computation and Communication Patterns of Distributed SAMR Applications

P10

01

12

22

2 1 01

12

22

2 10

computation

communicationtime

00

11

22

22 1 0

11

22

22 1

0computation

communicationtime

P2

. . .

. . .

* The number in the time slot box denotes the refinement level of the load under processing* In this case, the number of refinement levels is 3 and the refinement factor is 2.* The communication time consists of three types, intra-level, iter-level and synchronization cost

2 intra-level

2 intra-level

2 sync 2 and 1 inter-level

computation

communicationtime

computation

communicationtime

. . .P1

P2

2

. . . 22 and 1 inter-level

Enlarged with more details

1 intra-level

1 intra-level

2 sync. . .1

. . .1

Timing Diagram for Distributed SAMRTiming Diagram for Distributed SAMR

Runtime Management for SAMR Applications

• Partitioning/Load-balancing strategy– maximize parallelism, minimize inter/intra level comm., maintain inter/intra

level locality, support efficient repartitioning, …– Partitioning/load-balancing strategy depends on the structure of the grid

hierarchy and the current application/system state [IEEE TPDS 2002]• Granularity

– patch size, AMR efficiency, comm./comp. ratio, overhead, node-performance, load-balance, …

• Number of processors/Load per processor– Dynamic allocations/configuration/management

• 1000+ processor from the beginning or “on-demand”

• Hierarchical “emergent” distributions using dynamic processor groups• Communication optimizations/latency tolerance/multithreading• Availability, capabilities, and state of system resources

Partitioning Approaches

Ack. X. Li, OSU

SAMR – Partitioning Systems

System Execution Mode Granularity Partitioner

Organization Decomposition Institute

CHARM Comp-intensive Coarse-grained Static single-partitioner Domain-based UIUC

Chombo Comp-intensive Fine-grained, coarse-grained Static single-partitioner Domain-based LBNL

HRMS/ GrACE Comp-intensive

Fine-grained, coarse-grained

Adaptive hierarchical multi-partitioner, Hybrid strategies

Domain-based, hybrid Rutgers

Nature+ Fable Comp-intensive Coarse-grained Single meta-partitioner

Domain-based, hybrid Sandia

ParaMesh Comp-intensive Fine-grained, coarse-grained Static single-partitioner Domain-based NASA

ParMetis Comp-intensive, comm-intensive Fine-grained Static single-partitioner Graph-based Minnesota

PART Comp-intensive Coarse-grained Static single-partitioner Domain-based Northwestern

SAMRAI Comp-intensive, comm-intensive Fine-grained,

coarse-grained Static single-partitioner Patch-based LLNL

Ack. X. Li, OSU

Proactive & Reactive Runtime Management

• Reactively and proactively mange and optimize application execution using current system and application state, predictive models for system behavior and application performance– Runtime sensing of current system and application state– Analyze, characterize, anticipate system and application behavior – Reactively and proactively adapt application execution

• Application-sensitive adaptation– Characterizes current application state– Determines resource allocation, partitioning/mapping of application

components, granularity, load-balancing and communication mechanisms• System-sensitive adaptation

– Driven by system state and system performance predictions– Determine application granularity, communication strategies based on

bandwidth, and nature of refinements based on availability and “health” of computing elements

• Performance prediction functions - (S. Hariri, Univ. of AZ)

“Investigating Autonomic Runtime Management Strategies for SAMR Applications”, S. Chandra, J. Yang, Y. Zhang, M. Parashar, and S. Hariri, International Journal of Parallel Processing, Editor: F. Darema, Kluwer Academic Publishers, 2005.

ARMaDA: Adaptive Application-Sensitive Management for SAMR Applications

• Identify and characterize cliques• Define management objective and strategy• Hierarchically partition, map and tune

Clique Region Characterization

Partitioningand

Scheduling

Application StateCharacterization

Nature of Adaptation

ApplicationDynamics

Com

puta

tion/

Com

mun

icat

ion

Dynamic Driver Application

MappingDistribution

Redistribution

Load-Balancing Algorithms- Greedy, Binpack, Level-basedCommunication Strategies- Staggered sends, Delayed waitsClustering Algorithms- Segmentation/Level basedSpace-Time Hybrid Schemes- Application-level Pipelining- Application-level Out-of-core

Partitioning Algorithms- ISP, LPA, HPA, G-MISP, ……

Optimization Repository

State Analysis• Data migration• Application locality• Communication costs• Load balancing• Adaptive partitioning• Adaptation overheads• Memory requirements• Granularity control

IdentifyClique

Regions

RuntimePrescriptions

CharacterizeClique

Regions

CurrentApplication

State

*A Clique Region is a relatively homogeneous region in the SAMR grid hierarchy

ARMaDA: Adaptive Application-Sensitive Management for SAMR Applications

• Runtime application monitoring and characterization– computation/communication

requirements, application dynamics, nature of adaptation, etc.

• Deduction– map partitioners to application state

• Adaptation (Meta-partitioner)– dynamically select and configure and

invoke “best” partitioner at runtime

Characterizing Application State at Runtime

• Application state characterized using operations on the geometry of the grid hierarchy– Computation/Communication requirements

• computationally-intensive or communication-dominated

– Application Dynamics• speed of changes in application refinement patterns

– Nature of Adaptation• scattered or localized refinements, affecting overheads

• Fast and efficient characterization algorithms minimize overheads

“Towards Autonomic Application-Sensitive Partitioning for SAMR Applications”, S. Chandra and M. Parashar, Journal of Parallel and Distributed Computing, Academic Press, Vol. 65, Issue 4, pp. 519 – 531, April 2005.

Reactive System Sensitive Partitioning

• Cost model used to calculate relative capacities of nodes in terms of CPU, memory, and bandwidth availability

• Relative capacity for node k:

– where wp, wm, and wb are the weights associated with relative CPU, Memory, and Bandwidth availability respectively

• Evaluation– Linux based 32 node Beowulf cluster and

synthetic load generators– RM3D kernel, 128*32*32 base grid, 3

refinement levels, 4 steps regrid– 18% improvement in execution time over

non-system sensitive scheme

kbkmkpk BwMwPwC ++=1=++ bmp www

Capacity Calculator

Heterogeneous System Sensitive Partitioner

CPU

Memory

Bandwidth

Capacity Available

Application

Weights

Partitions

Resource

Monitoring

Tool

“Adaptive System-Sensitive Partitioning of AMR Applications on Heterogeneous Clusters”, S. Sinha and M. Parashar, Cluster Computing: The Journal of Networks, Software Tools, and Applications, KluwerAcademic Publishers, Vol. 5, Issue 4, pp. 343 - 352, 2002

Adaptive Hierarchical Multiple-Partitioner Strategy (AHMP)

• Rationale– Extends Hierarchical Partitioning

Algorithm (HPA)• reduce global communication

overheads • enable incremental repartitioning• expose more concurrent

communication and computation – Addresses spatial heterogeneity

• Approach: Divide-and-Conquer– Identify clique regions and

characterize states through clustering– Select appropriate partitioner for each

clique region to match characteristics of partitioners and cliques

– Adapt to the states of computational domains and systems

– Repartition and reschedule within local resource group

Start

Clustering

gridhierarchy

cliquehierarchy

Select a Partitioner

LBC

SBCRecursivelyfor each clique

End

Characterize clique

Partition clique

PartitionerRepository

SelectionPolicies

Repartitioning

“Using Clustering to Address the Heterogeneity and Dynamism in Parallel SAMR Applications”, X. Li and M. Parashar, Proceedings of the 12th International Conference on High Performance Computing, Goa, India, December 2005.

AHMP: Operation

RG2

RG1

T iexe: estimated execution time for processor i in RG

RG3

RG4

The load imbalance factor (LIF) is defined by,

LIF(RG-k) = max(T iexe) – min(T iexe)avg of T iexe in this RG

Repartitionreschedule AHMP

GPAAHMP

LPA

Partitionschedule

AHMPALP

AHMPALOC

RG: Resource Group

AHMP: Experimental Evaluation

Experiment Setup:IBM SP4 cluster (DataStar at San Diego Supercomputer Center,

total 1632 processors)SP4 (p655) node: 8 processors(1.5 GHz), memory 16 GB, 6.0 GFlops

Performance gain

AHMP 30% - 42%LPA 20% - 29%

GPA: GreedyLPA: Level-based

Execution Time for RM3D Application (1000 time steps, size=256x64x64)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

64 128 256 512 1024 1280

Number of Processors

Exec

utio

n Ti

me

(sec

)

GPALPASBC+AHMP

Overall PerformanceRM3D:

refinement levels = 3refinement factor = 2

SBC Clustering Time

On average, clustering cost is less than 0.01 second, while the execution time between regridding steps is typically > 10 seconds.

RM3D: Turbulence at CaltechRM2D: Turbulence at CaltechBL3D: Oil-water flow, UT AustinTP2D: Heat transportBL2D: Oil-water flow, UT AustinENO2D: VTF at Caltech

Clustering Time for SBC

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

rm3d rm2d bl3d tp2d bl2d eno2d

SAMR Applications

Tim

e (m

icro

seco

nd)

Dispatch: Heterogeneous Workload Simulations

• Applications with computational heterogeneity– reactive flows such as simulation of hydrocarbon flames– pointwise processes operate at different timescales than diffusive

and convective processes – hence approximately decoupled– operator-split integration methods in PDEs– highly uneven distribution of workload as a function of space– traditional partitioning/load-balancing approaches not suitable– preserving spatial coupling reduces communication costs

• Dispatch strategy– dynamic structured partitioning for parallel applications with

computational heterogeneity– integrated with GrACE computational framework– combines inverse space-filling curve based partitioning with in-

situ weighted global load balancing

“Dynamic Structured Partitioning for Parallel Scientific Applications with Pointwise Varying Workloads”, S. Chandra, M. Parashar and J. Ray, Proceedings of 20th IEEE/ACM International Parallel and Distributed Processing Symposium, IEEE Computer Society Press, April 2006 .

Dispatch: Illustrative Reactive-Diffusion Kernel

• R-D application– model approximates the ignition of

CH4-Air mixture in a non-uniform temperature field with 3 hot-spots

– high dynamism, varying workloads, space-time heterogeneity

– reactive processes near flame front have high computation requirement

– solves equation of the form

Dispatch: Experimental Evaluation

• Evaluation setup– 256*256 base grid resolution, 200

iterations for R-D kernel– 8-128 processors on SDSC IBM SP4

“DataStar”– single uniform level– compare performance of Dispatch

and Homogeneous• Evaluation results

– Dispatch improves execution time by 11.23 – 46.34%

– Dispatch considers weights of pointwise processes; achieves smaller deviation for compute times and reduced sync times

Conclusion

• Adaptive and interactive simulations can enable accurate solutions of physically realistic models of complex phenomena– Large-scale efficient parallel/distributed implementations present significant

challenges• Conceptual and implementation solutions for enabling adaptive and

interactive simulations– Computational engines

• HDDA/DAGH/GrACE/MACE– Adaptive runtime management/optimization

• PRAGMA/ARMaDA– Interactive/collaborative monitoring/steering - computation collaboratories

• Discover/DIOS

• More Information, publications, software– www.caip.rutgers.edu/~parashar/– [email protected]

http://www.caip.rutgers.edu/~parashar/

The Team

• TASSL Rutgers University• Viraj Bhat• Andrez Quiroz Hernandez• Nanyan Jiang• Zhen Li (Jenny)• Vincent Matossian• Sumir Chandra• Mingliang Wang• Li Zhang

• Key CS Collaborators– HPDC, University of Arizona

• Salim Hariri– Biomedical Informatics, The Ohio

State University• Tahsin Kurc, Joel Saltz

– CS, University of Maryland• Alan Sussman

• Key Applications Collaborators– CSM, University of Texas at Austin

• Hector Klie, Mary Wheeler– IG, University of Texas at Austin

• Mrinal Sen, Paul Stoffa– PPPL

• R. Samtaney– CRL, Sandia National Laboratory,

Livermore• Jaideep Ray, Johan Steensland

– University of Arizona• T. –C. Jim Yeh

– Rutgers University• S. Garofilini, A. Cutinho, N. Zabusky

Distributed Adaptive Simulations using Structured Adaptive Mesh-Refinement (SAMR)OverviewStructured Adaptive Mesh Refinement (SAMR)Related Work: SAMR InfrastructuresGrACE: Adaptive Computational Engine for SAMRData-Management for Adaptive ApplicationsMACE: Supporting Dynamic Coupling/InteractionsA Selection of SAMR Applications EnabledSAMR: Spatial and Temporal Heterogeneity and DynamicsAnalysis of Computation and Communication Patterns of Distributed SAMR ApplicationsRuntime Management for SAMR ApplicationsPartitioning ApproachesSAMR – Partitioning SystemsProactive & Reactive Runtime ManagementARMaDA: Adaptive Application-Sensitive Management for SAMR ApplicationsARMaDA: Adaptive Application-Sensitive Management for SAMR ApplicationsCharacterizing Application State at RuntimeReactive System Sensitive PartitioningAdaptive Hierarchical Multiple-Partitioner Strategy (AHMP)AHMP: OperationAHMP: Experimental EvaluationSBC Clustering TimeDispatch: Heterogeneous Workload SimulationsDispatch: Illustrative Reactive-Diffusion KernelDispatch: Experimental EvaluationConclusionThe Team

Documents

Distributed Adaptive Simulations using Structured Adaptive ...parashar/Papers/samr-siam-pp-06.pdf · Related Work: SAMR Infrastructures • SAMRAI, Lawrence Livermore National Lab