Progress loops ScalaTrace: Reconfigurable, Scalable ...mueller/ftp/pub/mueller/papers/sc07poster.pdfTransformed data is locally compressed by thresholding, run-length and entropy coding

Histograms detect imbalances

Variable sizes capture variance

Lawrence LivermoreNational Laboratory

Center for Applied Scientific Computing

NC STATE UNIVERSITYDepartment of Computer Science

An Open Framework for Scalable, Reconfigurable Performance AnalysisTodd Gamblin‡, Prasun Ratn*†, Bronis R. de Supinski*, Martin Schulz*, Frank Mueller†, Robert J. Fowler‡, Daniel Reed‡

*Lawrence Livermore National Laboratory, †North Carolina State University, ‡Renaissance Computing Institute

This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.This work was also supported in part by NSF grants CNS-0410203, CCF-0429653 and CAREER CCR-0237570, as well as the SciDAC Performance Engineering Research Institute, DE-FC02-06ER25764.

UCRL-POST-236200

ScalaTrace: Reconfigurable,Scalable Performance Analysis

Size of machines is rapidly increasing(130,000+ processors)

Tools will be overwhelmed with data Need scalable, online measurement and analysis

Future Directions

Large Scale MPI Applications Implicit Program Behavior

Interactive User Tools and Visualization Understanding of Program Behavior and Optimizations

Feedback, Steering & G

lobal Configuration

Dat

a C

olle

ctio

n &

Ana

lysi

s

…MPI tasks

User Interaction& Storage

ReductionNodes

Local Configuration

Data Input: from Tracer, Profiler, Child Nodes

Data Output: to Parent Node, Tool

Communication pathsOPS Attr.1 Attr.2 Attr.N…

extensible, self-describing formatAttribute examples: PRSDs, timing, …Reducer

Transformer

Annotater

Analyzer

Compressor PRSDs Timing

Loop Detection

Annotator

MPI PRSD Time Loop

MPI PRSD Time

MPI PRSD Time

MPI PRSD Time

LoopInformation

Time

PRSD

Types of tool components

LOCAL Node INSTANTIATION

GLOBALREDUCTION

Trace representation using Power Regular Section Descriptors (PRSDs)

Idea: preserve time in compressed traces

• Encode time deltas instead of timestamps

• Create delta histograms automatically

• Dynamically balance histograms

•Time depends on path taken

•Distinguish histograms by path

Description of problem

Bins generated for synthetic input spanentire range with similar sample counts

Number of histograms per record depends onnumber of possible call paths

Sample bimodal distribution from UMT2K collectives

... MPI _Allreduce ( ... ) ; for ( ... ) { for ( ... ) { MPI _Send ( ... ) ; MPI _Recv ( ... ) ; } MPI _Barrier ( ... ) ; } ...

Replay accuracy (NAS Benchmarks and UMT2K)

IS

LU

FT

IS

FT

Trace sizes (NAS Benchmarks and UMT2K)

MG

BT

Near-constant trace size

DT, EP, LU

Sub-linear trace sizeCG, MG, FT

Non-scalable trace sizeBT, IS, UMT2K

Accurate replay: DT, EP,FT, LU, IS, UMT2K

Replay inaccurate in MPITime: CG, MG

Replay inaccurate inCompute Time: BT

ScalaReplay: Replay UsingHistogram Timing Annotations

Evolutionary Load-Balance Analysiswith Scalable Data Collection

Idea: Normalize measurements andmodels based on applicationsemantics.

Progress loops• Typically outer loops in SPMD codes• Indicate absolute progress towards some

domain-specific goal• Basis for comparison of load over time

Effort Loops• Variable-time loops, represent load• Data-dependent execution

Load Modeling Framework

Visualization

Effort Region Analysis

Progress EventAnnotator

MPI PRSD

Compressor

Trace merge WaveletCompression

MPI PRSD Progress

LocalEffort

Records

PRSDs

MPI

PRSD

MPI Tracer

Load-annotatedTrace

MPI PRSD Progress

Progress instrumentation• User marks progress loop explicitly

Effort modeled with code regions• Dynamically detected at runtime• Split MPI-op trace at collectives and wait operations• User can further divide code into phases with instrumentation

Models dislocation dynamics in crystals• Dislocations discretized as nodes & arms• Recursive spatial domain decomposition• Balancer subdivides nodes/arms along x, then y, then z

Com

press

. . .

. . .

. . .

Scalable Compression Using Wavelet TransformDistributed per-process effort measurements areconsolidated onto fewer processes

Parallel wavelet transform

Transformed data is locally compressed bythresholding, run-length and entropy coding

Compressed representation is merged in parallel

• Load measurements are reconstructed on client• Visualization tool allows intuitive data browsing

Com

press

Com

press

Near-constant low-overhead MPI traces

Annotate with additional reconfigurable data, e.g.

Time using adaptive histograms (left)

Progress rates, load imbalance (right)

Compression Ratio

Merge Time

Error

Process IdProgress

Process IdProgress

Process IdProgressProcess IdProgress

Process IdProgress

Process IdProgress

Checkpoint

ForceComputation

Effort for 64-node ParaDiS RunExact Reconstructed

Re-partition

Load Balance in ParaDiSz

x

y

Flexible framework for application-specific tools

Ability to select compression schemes fordifferent fields, data types

Foster combination, interaction between datacollection, analysis mechanisms

Adaptive, self-tuning runtime systems

Near term:

Adaptive wavelet transform topology,equivalence class detection

Full time-sequence compression

Documents

Progress loops ScalaTrace: Reconfigurable, Scalable ...mueller/ftp/pub/mueller/papers/sc07poster.pdfTransformed data is locally compressed by thresholding, run-length and entropy coding