27
Distributed Adaptive Simulations using Structured Adaptive Mesh-Refinement (SAMR) Manish Parashar The Applied Software Systems Laboratory ECE/CAIP, Rutgers University http://www.caip.rutgers.edu/TASSL (Ack: NSF, DoE, NIH, DoD)

Distributed Adaptive Simulations using Structured Adaptive ...parashar/Papers/samr-siam-pp-06.pdf · Related Work: SAMR Infrastructures • SAMRAI, Lawrence Livermore National Lab

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Distributed Adaptive Simulations using Structured Adaptive Mesh-Refinement

    (SAMR)

    Manish ParasharThe Applied Software Systems Laboratory

    ECE/CAIP, Rutgers Universityhttp://www.caip.rutgers.edu/TASSL

    (Ack: NSF, DoE, NIH, DoD)

    http://www.caip.rutgers.edu/TASSL

  • Overview

    • Computational engines for SAMR applications – distributed, dynamic data-management

    • Runtime (reactive and proactive) management– dynamic (application and system sensitive) partitioning

    and load-balancing– AHMP – Adaptive Hierarchical Meta-Partitioning– Dispatch – Addressing Point-wise Varying Loads

    • Conclusion

  • Adaptive Mesh Refinement•Start with a base coarse grid with minimum acceptable resolution

    • Tag regions in the domain requiring additional resolution, cluster the tagged cells, and fit finer grids over these clusters

    • Proceed recursively so that regions on the finer grid requiring more resolution are similarly tagged and even finer grids are overlaid on these regions

    • Resulting grid structure is a dynamic adaptive grid hierarchy

    The Berger-Oliger AlgorithmRecursive Procedure Integrate(level)

    If (RegridTime) Regrid Step Δt on all grids at level “level”If (level + 1 exists)

    Integrate (level + 1)Update(level, level + 1)

    End ifEnd Recursionlevel = 0Integrate(level)

    Structured Adaptive Mesh Refinement (SAMR)

  • Related Work: SAMR Infrastructures

    • SAMRAI, Lawrence Livermore National Lab– Object-oriented structured adaptive mesh refinement application infrastructure– Modules handle visualization, mesh management, integration, geometry, etc.

    • Chombo, Lawrence Berkeley National Lab– Set of tools for implementing finite difference methods for PDE solutions– Distributed infrastructure for parallel calculations over block-structured,

    adaptively refined grids• Paramesh, NASA Goddard Space Flight Center

    – Fortran 90 subroutines to extend existing serial code into parallel AMR code– Hierarchy of Cartesian mesh grids which form nodes of a tree data-structure

    • Batsrus, University of Michigan– Block-based approach with adaptation distributed over processors in

    computational pool in phases• GrACE, Rutgers University

    – Adaptive computational and data-management engine for structured grids– Distributed adaptive grid hierarchy, grid function and geometry abstractions– Parallel support for AMR computations in various scientific domains

  • GrACE: Adaptive Computational Engine for SAMR

    • Semantically Specialized DSM– Application-centric programming abstractions – Regular access semantics to dynamic, heterogeneous, and

    physically distributed data objects• Encapsulate distribution, communication, and interaction

    – Coupling/interactions between multiple physics, models, structures, scales

    • Distributed Shared Objects– virtual Hierarchical Distributed Dynamic Array

    • Hierarchical Index-Space + Extendible Hashing + Heterogeneous objects

    – Multifaceted objects• Integration of computation + data + visualization + interaction

    • Adaptive Run-time Management– Application and system sensitive management

    • Algorithms, partitioners, load-balancing, communications, etc.• Policy-based automated adaptations

    1024x128x128, 3 levels, 2K PE’sTime: ~ 15% Memory: ~25%

    Richtmyer Meshkov (3D)

    IPARS Multi-block Oil Reservoir Simulation

  • Data-Management for Adaptive Applications

    • Application requirements– Adaptive Finite Difference

    • Large hierarchical objects; dynamic size, orientation and interactions– Adaptive Finite Element

    • Dynamic number of objects of dynamic size– Adaptive Fast Multipole

    • Dynamic number of small objects with dynamic interactions• Traditional data-management for computation – multi-dimensional arrays

    – data set, index space, injective function• Data-management abstraction for distributed adaptive applications

    – An extended definition of an array where • Each element of the array can itself be an array • Each element of the array can be an object of arbitrary and variable size• The array can grow and shrink dynamically• The array is distributed

    – Hierarchical Distributed Dynamic Array• Performance, Performance, Performance• Locality, Locality, Locality

    “A Common Data Management Infrastructure for Parallel Adaptive Algorithms for PDE Solutions,” M. Parashar, J. C. Browne, C. Edwards, and K. Klimkowsky, Proceeding of Supercomputing,‘ San Jose, CA, November 1997

  • MACE: Supporting Dynamic Coupling/Interactions

    • High Performance Geometry-based Shared Spaces– Models/numerics as well as the interactions are typically based on the geometry of the

    discretized domain– Use SFC’s to create a distributed directory of shared geometric regions– Processors can create shared regions and can read and write object related to a

    shared region – e.g. mortar grid– Complements MPI, OpenMP, PVM, etc.

    Multi-numerics Multi-physics Multi-scale

    “A Dynamic Geometry-Based Shared Space Interaction Framework for Parallel ScientificApplications,” L. Zhang* and M. Parashar, Proceedings of the 11th International Conference on High Performance Computing (HiPC 2004), Bangalore, India, December 2004.

  • A Selection of SAMR Applications Enabled

    Multi-block grid structure and oil concentrations contours (IPARS, M. Peszynska, UT Austin)

    Blast wave in the presence of a uniform magnetic field) – 3 levels of refinement. (Zeus +

    GrACE + Cactus, P. Li, NCSA, UCSD)

    Mixture of H2 and Air in stoichiometricproportions with a non-uniform temperature field(GrACE + CCA, Jaideep Ray, SNL, Livermore)

    Richtmyer-Meshkov - detonation in a deforming tube - 3 levels. Z=0 plane visualized on the right

    (VTF + GrACE, R. Samtaney, CIT)

  • SAMR: Spatial and Temporal Heterogeneity and Dynamics

    regrid step 114regrid step 5 regrid step 96

    regrid step 201

    RM3D (200 regrid steps, size=256*64*64)

    0

    1020

    3040

    50

    6070

    80

    0 20 40 60 80 100 120 140 160 180

    Regrid Steps

    Tota

    l Loa

    d (1

    00k)

    regrid step 176

    Spatial and Temporal Heterogeneity and Load Dynamics of a 3D RicSpatial and Temporal Heterogeneity and Load Dynamics of a 3D Richtmyerhtmyer--Meshkov Simulation using SAMRMeshkov Simulation using SAMR

  • Analysis of Computation and Communication Patterns of Distributed SAMR Applications

    P10

    01

    12

    22

    2 1 01

    12

    22

    2 10

    computation

    communicationtime

    00

    11

    22

    22 1 0

    11

    22

    22 1

    0computation

    communicationtime

    P2

    . . .

    . . .

    * The number in the time slot box denotes the refinement level of the load under processing* In this case, the number of refinement levels is 3 and the refinement factor is 2.* The communication time consists of three types, intra-level, iter-level and synchronization cost

    2 intra-level

    2 intra-level

    2 sync 2 and 1 inter-level

    computation

    communicationtime

    computation

    communicationtime

    . . .P1

    P2

    2

    . . . 22 and 1 inter-level

    Enlarged with more details

    1 intra-level

    1 intra-level

    2 sync. . .1

    . . .1

    Timing Diagram for Distributed SAMRTiming Diagram for Distributed SAMR

  • Runtime Management for SAMR Applications

    • Partitioning/Load-balancing strategy– maximize parallelism, minimize inter/intra level comm., maintain inter/intra

    level locality, support efficient repartitioning, …– Partitioning/load-balancing strategy depends on the structure of the grid

    hierarchy and the current application/system state [IEEE TPDS 2002]• Granularity

    – patch size, AMR efficiency, comm./comp. ratio, overhead, node-performance, load-balance, …

    • Number of processors/Load per processor– Dynamic allocations/configuration/management

    • 1000+ processor from the beginning or “on-demand”

    • Hierarchical “emergent” distributions using dynamic processor groups• Communication optimizations/latency tolerance/multithreading• Availability, capabilities, and state of system resources

  • Partitioning Approaches

    Ack. X. Li, OSU

  • SAMR – Partitioning Systems

    System Execution Mode Granularity Partitioner

    Organization Decomposition Institute

    CHARM Comp-intensive Coarse-grained Static single-partitioner Domain-based UIUC

    Chombo Comp-intensive Fine-grained, coarse-grained Static single-partitioner Domain-based LBNL

    HRMS/ GrACE Comp-intensive

    Fine-grained, coarse-grained

    Adaptive hierarchical multi-partitioner, Hybrid strategies

    Domain-based, hybrid Rutgers

    Nature+ Fable Comp-intensive Coarse-grained Single meta-partitioner

    Domain-based, hybrid Sandia

    ParaMesh Comp-intensive Fine-grained, coarse-grained Static single-partitioner Domain-based NASA

    ParMetis Comp-intensive, comm-intensive Fine-grained Static single-partitioner Graph-based Minnesota

    PART Comp-intensive Coarse-grained Static single-partitioner Domain-based Northwestern

    SAMRAI Comp-intensive, comm-intensive Fine-grained,

    coarse-grained Static single-partitioner Patch-based LLNL

    Ack. X. Li, OSU

  • Proactive & Reactive Runtime Management

    • Reactively and proactively mange and optimize application execution using current system and application state, predictive models for system behavior and application performance– Runtime sensing of current system and application state– Analyze, characterize, anticipate system and application behavior – Reactively and proactively adapt application execution

    • Application-sensitive adaptation– Characterizes current application state– Determines resource allocation, partitioning/mapping of application

    components, granularity, load-balancing and communication mechanisms• System-sensitive adaptation

    – Driven by system state and system performance predictions– Determine application granularity, communication strategies based on

    bandwidth, and nature of refinements based on availability and “health” of computing elements

    • Performance prediction functions - (S. Hariri, Univ. of AZ)

    “Investigating Autonomic Runtime Management Strategies for SAMR Applications”, S. Chandra, J. Yang, Y. Zhang, M. Parashar, and S. Hariri, International Journal of Parallel Processing, Editor: F. Darema, Kluwer Academic Publishers, 2005.

  • ARMaDA: Adaptive Application-Sensitive Management for SAMR Applications

    • Identify and characterize cliques• Define management objective and strategy• Hierarchically partition, map and tune

    Clique Region Characterization

    Partitioningand

    Scheduling

    Application StateCharacterization

    Nature of Adaptation

    ApplicationDynamics

    Com

    puta

    tion/

    Com

    mun

    icat

    ion

    Dynamic Driver Application

    MappingDistribution

    Redistribution

    Load-Balancing Algorithms- Greedy, Binpack, Level-basedCommunication Strategies- Staggered sends, Delayed waitsClustering Algorithms- Segmentation/Level basedSpace-Time Hybrid Schemes- Application-level Pipelining- Application-level Out-of-core

    Partitioning Algorithms- ISP, LPA, HPA, G-MISP, ……

    Optimization Repository

    State Analysis• Data migration• Application locality• Communication costs• Load balancing• Adaptive partitioning• Adaptation overheads• Memory requirements• Granularity control

    IdentifyClique

    Regions

    RuntimePrescriptions

    CharacterizeClique

    Regions

    CurrentApplication

    State

    *A Clique Region is a relatively homogeneous region in the SAMR grid hierarchy

  • ARMaDA: Adaptive Application-Sensitive Management for SAMR Applications

    • Runtime application monitoring and characterization– computation/communication

    requirements, application dynamics, nature of adaptation, etc.

    • Deduction– map partitioners to application state

    • Adaptation (Meta-partitioner)– dynamically select and configure and

    invoke “best” partitioner at runtime

  • Characterizing Application State at Runtime

    • Application state characterized using operations on the geometry of the grid hierarchy– Computation/Communication requirements

    • computationally-intensive or communication-dominated

    – Application Dynamics• speed of changes in application refinement patterns

    – Nature of Adaptation• scattered or localized refinements, affecting overheads

    • Fast and efficient characterization algorithms minimize overheads

    “Towards Autonomic Application-Sensitive Partitioning for SAMR Applications”, S. Chandra and M. Parashar, Journal of Parallel and Distributed Computing, Academic Press, Vol. 65, Issue 4, pp. 519 – 531, April 2005.

  • Reactive System Sensitive Partitioning

    • Cost model used to calculate relative capacities of nodes in terms of CPU, memory, and bandwidth availability

    • Relative capacity for node k:

    – where wp, wm, and wb are the weights associated with relative CPU, Memory, and Bandwidth availability respectively

    • Evaluation– Linux based 32 node Beowulf cluster and

    synthetic load generators– RM3D kernel, 128*32*32 base grid, 3

    refinement levels, 4 steps regrid– 18% improvement in execution time over

    non-system sensitive scheme

    kbkmkpk BwMwPwC ++=1=++ bmp www

    Capacity Calculator

    Heterogeneous System Sensitive Partitioner

    CPU

    Memory

    Bandwidth

    Capacity Available

    Application

    Weights

    Partitions

    Resource

    Monitoring

    Tool

    “Adaptive System-Sensitive Partitioning of AMR Applications on Heterogeneous Clusters”, S. Sinha and M. Parashar, Cluster Computing: The Journal of Networks, Software Tools, and Applications, KluwerAcademic Publishers, Vol. 5, Issue 4, pp. 343 - 352, 2002

  • Adaptive Hierarchical Multiple-Partitioner Strategy (AHMP)

    • Rationale– Extends Hierarchical Partitioning

    Algorithm (HPA)• reduce global communication

    overheads • enable incremental repartitioning• expose more concurrent

    communication and computation – Addresses spatial heterogeneity

    • Approach: Divide-and-Conquer– Identify clique regions and

    characterize states through clustering– Select appropriate partitioner for each

    clique region to match characteristics of partitioners and cliques

    – Adapt to the states of computational domains and systems

    – Repartition and reschedule within local resource group

    Start

    Clustering

    gridhierarchy

    cliquehierarchy

    Select a Partitioner

    LBC

    SBCRecursivelyfor each clique

    End

    Characterize clique

    Partition clique

    PartitionerRepository

    SelectionPolicies

    Repartitioning

    “Using Clustering to Address the Heterogeneity and Dynamism in Parallel SAMR Applications”, X. Li and M. Parashar, Proceedings of the 12th International Conference on High Performance Computing, Goa, India, December 2005.

  • AHMP: Operation

    RG2

    RG1

    T iexe: estimated execution time for processor i in RG

    RG3

    RG4

    The load imbalance factor (LIF) is defined by,

    LIF(RG-k) = max(T iexe) – min(T iexe)avg of T iexe in this RG

    Repartitionreschedule AHMP

    GPAAHMP

    LPA

    Partitionschedule

    AHMPALP

    AHMPALOC

    RG: Resource Group

  • AHMP: Experimental Evaluation

    Experiment Setup:IBM SP4 cluster (DataStar at San Diego Supercomputer Center,

    total 1632 processors)SP4 (p655) node: 8 processors(1.5 GHz), memory 16 GB, 6.0 GFlops

    Performance gain

    AHMP 30% - 42%LPA 20% - 29%

    GPA: GreedyLPA: Level-based

    Execution Time for RM3D Application (1000 time steps, size=256x64x64)

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    64 128 256 512 1024 1280

    Number of Processors

    Exec

    utio

    n Ti

    me

    (sec

    )

    GPALPASBC+AHMP

    Overall PerformanceRM3D:

    refinement levels = 3refinement factor = 2

  • SBC Clustering Time

    On average, clustering cost is less than 0.01 second, while the execution time between regridding steps is typically > 10 seconds.

    RM3D: Turbulence at CaltechRM2D: Turbulence at CaltechBL3D: Oil-water flow, UT AustinTP2D: Heat transportBL2D: Oil-water flow, UT AustinENO2D: VTF at Caltech

    Clustering Time for SBC

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    rm3d rm2d bl3d tp2d bl2d eno2d

    SAMR Applications

    Tim

    e (m

    icro

    seco

    nd)

  • Dispatch: Heterogeneous Workload Simulations

    • Applications with computational heterogeneity– reactive flows such as simulation of hydrocarbon flames– pointwise processes operate at different timescales than diffusive

    and convective processes – hence approximately decoupled– operator-split integration methods in PDEs– highly uneven distribution of workload as a function of space– traditional partitioning/load-balancing approaches not suitable– preserving spatial coupling reduces communication costs

    • Dispatch strategy– dynamic structured partitioning for parallel applications with

    computational heterogeneity– integrated with GrACE computational framework– combines inverse space-filling curve based partitioning with in-

    situ weighted global load balancing

    “Dynamic Structured Partitioning for Parallel Scientific Applications with Pointwise Varying Workloads”, S. Chandra, M. Parashar and J. Ray, Proceedings of 20th IEEE/ACM International Parallel and Distributed Processing Symposium, IEEE Computer Society Press, April 2006 .

  • Dispatch: Illustrative Reactive-Diffusion Kernel

    • R-D application– model approximates the ignition of

    CH4-Air mixture in a non-uniform temperature field with 3 hot-spots

    – high dynamism, varying workloads, space-time heterogeneity

    – reactive processes near flame front have high computation requirement

    – solves equation of the form

  • Dispatch: Experimental Evaluation

    • Evaluation setup– 256*256 base grid resolution, 200

    iterations for R-D kernel– 8-128 processors on SDSC IBM SP4

    “DataStar”– single uniform level– compare performance of Dispatch

    and Homogeneous• Evaluation results

    – Dispatch improves execution time by 11.23 – 46.34%

    – Dispatch considers weights of pointwise processes; achieves smaller deviation for compute times and reduced sync times

  • Conclusion

    • Adaptive and interactive simulations can enable accurate solutions of physically realistic models of complex phenomena– Large-scale efficient parallel/distributed implementations present significant

    challenges• Conceptual and implementation solutions for enabling adaptive and

    interactive simulations– Computational engines

    • HDDA/DAGH/GrACE/MACE– Adaptive runtime management/optimization

    • PRAGMA/ARMaDA– Interactive/collaborative monitoring/steering - computation collaboratories

    • Discover/DIOS

    • More Information, publications, software– www.caip.rutgers.edu/~parashar/– [email protected]

    http://www.caip.rutgers.edu/~parashar/

  • The Team

    • TASSL Rutgers University• Viraj Bhat• Andrez Quiroz Hernandez• Nanyan Jiang• Zhen Li (Jenny)• Vincent Matossian• Sumir Chandra• Mingliang Wang• Li Zhang

    • Key CS Collaborators– HPDC, University of Arizona

    • Salim Hariri– Biomedical Informatics, The Ohio

    State University• Tahsin Kurc, Joel Saltz

    – CS, University of Maryland• Alan Sussman

    • Key Applications Collaborators– CSM, University of Texas at Austin

    • Hector Klie, Mary Wheeler– IG, University of Texas at Austin

    • Mrinal Sen, Paul Stoffa– PPPL

    • R. Samtaney– CRL, Sandia National Laboratory,

    Livermore• Jaideep Ray, Johan Steensland

    – University of Arizona• T. –C. Jim Yeh

    – Rutgers University• S. Garofilini, A. Cutinho, N. Zabusky

    Distributed Adaptive Simulations using Structured Adaptive Mesh-Refinement (SAMR)OverviewStructured Adaptive Mesh Refinement (SAMR)Related Work: SAMR InfrastructuresGrACE: Adaptive Computational Engine for SAMRData-Management for Adaptive ApplicationsMACE: Supporting Dynamic Coupling/InteractionsA Selection of SAMR Applications EnabledSAMR: Spatial and Temporal Heterogeneity and DynamicsAnalysis of Computation and Communication Patterns of Distributed SAMR ApplicationsRuntime Management for SAMR ApplicationsPartitioning ApproachesSAMR – Partitioning SystemsProactive & Reactive Runtime ManagementARMaDA: Adaptive Application-Sensitive Management for SAMR ApplicationsARMaDA: Adaptive Application-Sensitive Management for SAMR ApplicationsCharacterizing Application State at RuntimeReactive System Sensitive PartitioningAdaptive Hierarchical Multiple-Partitioner Strategy (AHMP)AHMP: OperationAHMP: Experimental EvaluationSBC Clustering TimeDispatch: Heterogeneous Workload SimulationsDispatch: Illustrative Reactive-Diffusion KernelDispatch: Experimental EvaluationConclusionThe Team