Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version]

Adam Leko

HCS Research LaboratoryUniversity of Florida

2

Summary Give characteristics of existing tools to aid our design discussions

Metrics (what is recorded, any hardware counters, etc) Profiled entities Visualizations

Most information & some slides taken from tool evaluations Tools overviewed

TAU Paradyn MPE/Jumpshot Dimemas/Paraver/MPITrace mpiP Dynaprof KOJAK Intel Cluster Tools (old Vampir/VampirTrace) Pablo MPICL/Paragraph

3

TAU Metrics recorded

Two modes: profile, trace Profile mode

Inclusive/exclusive time spent in functions Hardware counter information

PAPI/PCL: L1/2/3 cache reads/writes/misses, TLB misses, cycles, integer/floating point/load/store/stalls executed, wall clock time, virtual time

Other OS timers (gettimeofday, getrusage) MPI message size sent

Trace mode Same as profile (minus hardware counters?) Message send time, message receive time, message size,

message sender/recipient(?) Profiled entities

Functions (automatic & dynamic), loops + regions (manual instrumentation)

4

TAU Visualizations

Profile mode Text-based: pprof, shows a summary of profile

information Graphical: racy (old), jracy a.k.a. paraprof

Trace mode No built-in visualizations Can export to CUBE (see KOJAK), Jumpshot (see

MPE), and Vampir format (see Intel Cluster Tools)

5

Paradyn Metrics recorded

Number of CPUs, number of active threads, CPU and inclusive CPU time

Function calls to and by Synchronization (# operations, wait time, inclusive wait time) Overall communication (# messages, bytes sent and received),

collective communication (# messages, bytes sent and received), point-to-point communication (# messages, bytes sent and received)

I/O (# operations, wait time, inclusive wait time, total bytes) All metrics recorded as “time histograms” (fixed-size data

structure) Profiled entities

Functions only (but includes functions linked to in existing libraries)

6

Paradyn Visualizations

Time histograms Tables Barcharts “Terrains” (3-D histograms)

7

MPE/Jumpshot Metrics collected

MPI message send time, receive time, size, message sender/recipient

User-defined event entry & exit Profiled entities

All MPI functions Functions or regions via manual instrumentation

and custom events Visualization

Jumpshot: timeline view (space-time diagram overlaid on Gantt chart), histogram

8

Dimemas/Paraver/MPITrace Metrics recorded (MPITrace)

All MPI functions Hardware counters (2 from the

following two lists, uses PAPI) Counter 1

Cycles Issued instructions, loads, stores,

store conditionals Failed store conditionals Decoded branches Quadwords written back from

scache(?) Correctible scache data array

errors(?) Primary/secondary I-cache misses Instructions mispredicted from

scache way prediction table(?) External interventions (cache

coherency?) External invalidations (cache

coherency?) Graduated instructions

Counter 2 Cycles Graduated instructions,

loads, stores, store conditionals, floating point instructions

TLB misses Mispredicted branches Primary/secondary data

cache miss rates Data mispredictions from

scache way prediction table(?)

External intervention/invalidation (cache coherency?)

Store/prefetch exclusive to clean/shared block

9

Dimemas/Paraver/MPITrace Profiled entities (MPITrace)

All MPI functions (message start time, message end time, message size, message recipient/sender)

User regions/functions via manual instrumentation Visualization

Timeline display (like Jumpshot) Shows Gantt chart and messages Also can overlay hardware counter information

Clicking on timeline brings up a text listing of events near where you clicked

1D/2D analysis modules

10

mpiP Metrics collected

Start time, end time, message size for each MPI call Profiled entities

MPI function calls + PMPI wrapper Visualization

Text-based output, with graphical browser that displays statistics in-line with source

Displayed information: Overall time (%) for each MPI node Top 20 callsites for time (MPI%, App%, variance) Top 20 callsites for message size (MPI%, App%, variance) Min/max/average/MPI%/App% time spent at each call site Min/max/average/sum of message sizes at each call site

App time = wall clock time between MPI_Init and MPI_Finalize MPI time = all time consumed by MPI functions App% = % of metric in relation to overall app time MPI% = % of metric in relation to overall MPI time

11

Dynaprof Metrics collected

Wall clock time or PAPI metric for each profiled entity

Collects inclusive, exclusive, and 1-level call tree % information

Profiled entities Functions (dynamic instrumentation)

Visualizations Simple text-based Simple GUI (shows same info as text-based)

12

KOJAK Metrics collected

MPI: message start time, receive time, size, message sender/recipient

Manual instrumentation: start and stop times 1 PAPI metric / run (only FLOPS and L1 data misses

visualized) Profiled entities

MPI calls (MPI wrapper library) Function calls (automatic instrumentation, only available on

a few platforms) Regions and function calls via manual instrumentation

Visualizations Can export traces to Vampir trace format (see ICT) Shows profile and analyzed data via CUBE

13

Intel Cluster Tools (ICT) Metrics collected

MPI functions: start time, end time, message size, message sender/recipient

User-defined events: counter, start & end times Code location for source-code correlation

Instrumented entities MPI functions via wrapper library User functions via binary instrumentation(?) User functions & regions via manual instrumentation

Visualizations Different types: timelines, statistics & counter info

14

Pablo Metrics collected

Time inclusive/exclusive of a function Hardware counters via PAPI Summary metrics computed from timing info

Min/max/avg/stdev/count Profiled entities

Functions, function calls, and outer loops All selected via GUI

Visualizations Displays derived summary metrics color-coded

and inline with source code

15

MPICL/Paragraph Metrics collected

MPI functions: start time, end time, message size, message sender/recipient

Manual instrumentation: start time, end time, “work” done (up to user to pass this in)

Profiled entities MPI function calls via PMPI interface User functions/regions via manual instrumentation

Visualizations Many, separated into 4 categories: utilization,

communication, task, “other”

16

ParaGraph visualizations Utilization visualizations

Display rough estimate of processor utilization Utilization broken down into 3 states:

Idle – When program is blocked waiting for a communication operation (or it has stopped execution) Overhead – When a program is performing communication but is not blocked (time spent within MPI library) Busy – if execution part of program other than communication

“Busy” doesn’t necessarily mean useful work is being done since it assumes (not communication) := busy

Communication visualizations Display different aspects of communication Frequency, volume, overall pattern, etc. “Distance” computed by setting topology in options menu

Task visualizations Display information about when processors start & stop tasks Requires manually instrumented code to identify when processors start/stop tasks

Other visualizations Miscellaneous things

Documents

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida