1
Histograms detect imbalances Variable sizes capture variance Lawrence Livermore National Laboratory Center for Applied Scientific Computing NC STATE UNIVERSITY Department of Computer Science An Open Framework for Scalable, Reconfigurable Performance Analysis Todd Gamblin , Prasun Ratn *† , Bronis R. de Supinski * , Martin Schulz * , Frank Mueller , Robert J. Fowler , Daniel Reed * Lawrence Livermore National Laboratory, North Carolina State University, Renaissance Computing Institute This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 . This work was also supported in part by NSF grants CNS-0410203, CCF-0429653 and CAREER CCR-0237570, as well as the SciDAC Performance Engineering Research Institute, DE-FC02-06ER25764. UCRL-POST-236200 ScalaTrace: Reconfigurable, Scalable Performance Analysis Size of machines is rapidly increasing (130,000+ processors) Tools will be overwhelmed with data Need scalable, online measurement and analysis Future Directions Large Scale MPI Applications Implicit Program Behavior Interactive User Tools and Visualization Understanding of Program Behavior and Optimizations Feedback, Steering & Global Configuration Data Collection & Analysis MPI tasks User Interaction & Storage Reduction Nodes Local Configuration Data Input: from Tracer, Profiler, Child Nodes Data Output: to Parent Node, Tool Communication paths OPS Attr.1 Attr.2 Attr.N extensible, self-describing format Attribute examples: PRSDs, timing, … Reducer Transformer Annotater Analyzer Compressor PRSDs Timing Loop Detection Annotator MPI PRSD Time Loop MPI PRSD Time MPI PRSD Time MPI PRSD Time Loop Information Time PRSD Types of tool components LOCAL Node INSTANTIATION GLOBAL REDUCTION Trace representation using Power Regular Section Descriptors (PRSDs) Idea: preserve time in compressed traces • Encode time deltas instead of timestamps • Create delta histograms automatically • Dynamically balance histograms •Time depends on path taken •Distinguish histograms by path Description of problem Bins generated for synthetic input span entire range with similar sample counts Number of histograms per record depends on number of possible call paths Sample bimodal distribution from UMT2K collectives ... MPI _Allreduce ( ... ) ; for ( ... ) { for ( ... ) { MPI _Send ( ... ) ; MPI _Recv ( ... ) ; } MPI _Barrier ( ... ) ; } ... Replay accuracy (NAS Benchmarks and UMT2K) LU FT IS FT Trace sizes (NAS Benchmarks and UMT2K) MG BT Near-constant trace size DT, EP, LU Sub-linear trace size CG, MG, FT Non-scalable trace size BT, IS, UMT2K Accurate replay: DT, EP, FT, LU, IS, UMT2K Replay inaccurate in MPI Time: CG, MG Replay inaccurate in Compute Time: BT ScalaReplay: Replay Using Histogram Timing Annotations Evolutionary Load-Balance Analysis with Scalable Data Collection Idea: Normalize measurements and models based on application semantics. Progress loops Typically outer loops in SPMD codes Indicate absolute progress towards some domain-specific goal Basis for comparison of load over time Effort Loops Variable-time loops, represent load Data-dependent execution Load Modeling Framework Visualization Effort Region Analysis Progress Event Annotator MPI PRSD Compressor Trace merge Wavelet Compression MPI PRSD Progress Local Effort Records PRSDs MPI PRSD MPI Tracer Load-annotated Trace MPI PRSD Progress Progress instrumentation User marks progress loop explicitly Effort modeled with code regions Dynamically detected at runtime Split MPI-op trace at collectives and wait operations User can further divide code into phases with instrumentation Models dislocation dynamics in crystals Dislocations discretized as nodes & arms Recursive spatial domain decomposition Balancer subdivides nodes/arms along x, then y, then z Compress . . . . . . . . . Scalable Compression Using Wavelet Transform Distributed per-process effort measurements are consolidated onto fewer processes Parallel wavelet transform Transformed data is locally compressed by thresholding, run-length and entropy coding Compressed representation is merged in parallel Load measurements are reconstructed on client Visualization tool allows intuitive data browsing Compress Compress Near-constant low-overhead MPI traces Annotate with additional reconfigurable data, e.g. Time using adaptive histograms (left) Progress rates, load imbalance (right) Compression Ratio Merge Time Error Process Id Progress Process Id Progress Process Id Progress Process Id Progress Process Id Progress Process Id Progress Checkpoint Force Computation Effort for 64-node ParaDiS Run Exact Reconstructed Re-partition Load Balance in ParaDiS z x y Flexible framework for application-specific tools Ability to select compression schemes for different fields, data types Foster combination, interaction between data collection, analysis mechanisms Adaptive, self-tuning runtime systems Near term: Adaptive wavelet transform topology, equivalence class detection Full time-sequence compression

Progress loops ScalaTrace: Reconfigurable, Scalable ...mueller/ftp/pub/mueller/papers/sc07poster.pdfTransformed data is locally compressed by thresholding, run-length and entropy coding

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Progress loops ScalaTrace: Reconfigurable, Scalable ...mueller/ftp/pub/mueller/papers/sc07poster.pdfTransformed data is locally compressed by thresholding, run-length and entropy coding

Histograms detect imbalances

Variable sizes capture variance

Lawrence LivermoreNational Laboratory

Center for Applied Scientific Computing

NC STATE UNIVERSITYDepartment of Computer Science

An Open Framework for Scalable, Reconfigurable Performance AnalysisTodd Gamblin‡, Prasun Ratn*†, Bronis R. de Supinski*, Martin Schulz*, Frank Mueller†, Robert J. Fowler‡, Daniel Reed‡

*Lawrence Livermore National Laboratory, †North Carolina State University, ‡Renaissance Computing Institute

This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.This work was also supported in part by NSF grants CNS-0410203, CCF-0429653 and CAREER CCR-0237570, as well as the SciDAC Performance Engineering Research Institute, DE-FC02-06ER25764.

UCRL-POST-236200

ScalaTrace: Reconfigurable,Scalable Performance Analysis

Size of machines is rapidly increasing(130,000+ processors)

Tools will be overwhelmed with data Need scalable, online measurement and analysis

Future Directions

Large Scale MPI Applications Implicit Program Behavior

Interactive User Tools and Visualization Understanding of Program Behavior and Optimizations

Feedback, Steering & G

lobal Configuration

Dat

a C

olle

ctio

n &

Ana

lysi

s

…MPI tasks

User Interaction& Storage

ReductionNodes

Local Configuration

Data Input: from Tracer, Profiler, Child Nodes

Data Output: to Parent Node, Tool

Communication pathsOPS Attr.1 Attr.2 Attr.N…

extensible, self-describing formatAttribute examples: PRSDs, timing, …Reducer

Transformer

Annotater

Analyzer

Compressor PRSDs Timing

Loop Detection

Annotator

MPI PRSD Time Loop

MPI PRSD Time

MPI PRSD Time

MPI PRSD Time

LoopInformation

Time

PRSD

Types of tool components

LOCAL Node INSTANTIATION

GLOBALREDUCTION

Trace representation using Power Regular Section Descriptors (PRSDs)

Idea: preserve time in compressed traces

• Encode time deltas instead of timestamps

• Create delta histograms automatically

• Dynamically balance histograms

•Time depends on path taken

•Distinguish histograms by path

Description of problem

Bins generated for synthetic input spanentire range with similar sample counts

Number of histograms per record depends onnumber of possible call paths

Sample bimodal distribution from UMT2K collectives

... MPI _Allreduce ( ... ) ; for ( ... ) { for ( ... ) { MPI _Send ( ... ) ; MPI _Recv ( ... ) ; } MPI _Barrier ( ... ) ; } ...

Replay accuracy (NAS Benchmarks and UMT2K)

IS

LU

FT

IS

FT

Trace sizes (NAS Benchmarks and UMT2K)

MG

BT

Near-constant trace size

DT, EP, LU

Sub-linear trace sizeCG, MG, FT

Non-scalable trace sizeBT, IS, UMT2K

Accurate replay: DT, EP,FT, LU, IS, UMT2K

Replay inaccurate in MPITime: CG, MG

Replay inaccurate inCompute Time: BT

ScalaReplay: Replay UsingHistogram Timing Annotations

Evolutionary Load-Balance Analysiswith Scalable Data Collection

Idea: Normalize measurements andmodels based on applicationsemantics.

Progress loops• Typically outer loops in SPMD codes• Indicate absolute progress towards some

domain-specific goal• Basis for comparison of load over time

Effort Loops• Variable-time loops, represent load• Data-dependent execution

Load Modeling Framework

Visualization

Effort Region Analysis

Progress EventAnnotator

MPI PRSD

Compressor

Trace merge WaveletCompression

MPI PRSD Progress

LocalEffort

Records

PRSDs

MPI

PRSD

MPI Tracer

Load-annotatedTrace

MPI PRSD Progress

Progress instrumentation• User marks progress loop explicitly

Effort modeled with code regions• Dynamically detected at runtime• Split MPI-op trace at collectives and wait operations• User can further divide code into phases with instrumentation

Models dislocation dynamics in crystals• Dislocations discretized as nodes & arms• Recursive spatial domain decomposition• Balancer subdivides nodes/arms along x, then y, then z

Com

press

. . .

. . .

. . .

Scalable Compression Using Wavelet TransformDistributed per-process effort measurements areconsolidated onto fewer processes

Parallel wavelet transform

Transformed data is locally compressed bythresholding, run-length and entropy coding

Compressed representation is merged in parallel

• Load measurements are reconstructed on client• Visualization tool allows intuitive data browsing

Com

press

Com

press

Near-constant low-overhead MPI traces

Annotate with additional reconfigurable data, e.g.

Time using adaptive histograms (left)

Progress rates, load imbalance (right)

Compression Ratio

Merge Time

Error

Process IdProgress

Process IdProgress

Process IdProgressProcess IdProgress

Process IdProgress

Process IdProgress

Checkpoint

ForceComputation

Effort for 64-node ParaDiS RunExact Reconstructed

Re-partition

Load Balance in ParaDiSz

x

y

Flexible framework for application-specific tools

Ability to select compression schemes fordifferent fields, data types

Foster combination, interaction between datacollection, analysis mechanisms

Adaptive, self-tuning runtime systems

Near term:

Adaptive wavelet transform topology,equivalence class detection

Full time-sequence compression