Upload
antony-gardner
View
217
Download
0
Embed Size (px)
Citation preview
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version]
Adam Leko
HCS Research LaboratoryUniversity of Florida
2
Summary Give characteristics of existing tools to aid our design discussions
Metrics (what is recorded, any hardware counters, etc) Profiled entities Visualizations
Most information & some slides taken from tool evaluations Tools overviewed
TAU Paradyn MPE/Jumpshot Dimemas/Paraver/MPITrace mpiP Dynaprof KOJAK Intel Cluster Tools (old Vampir/VampirTrace) Pablo MPICL/Paragraph
3
TAU Metrics recorded
Two modes: profile, trace Profile mode
Inclusive/exclusive time spent in functions Hardware counter information
PAPI/PCL: L1/2/3 cache reads/writes/misses, TLB misses, cycles, integer/floating point/load/store/stalls executed, wall clock time, virtual time
Other OS timers (gettimeofday, getrusage) MPI message size sent
Trace mode Same as profile (minus hardware counters?) Message send time, message receive time, message size,
message sender/recipient(?) Profiled entities
Functions (automatic & dynamic), loops + regions (manual instrumentation)
4
TAU Visualizations
Profile mode Text-based: pprof, shows a summary of profile
information Graphical: racy (old), jracy a.k.a. paraprof
Trace mode No built-in visualizations Can export to CUBE (see KOJAK), Jumpshot (see
MPE), and Vampir format (see Intel Cluster Tools)
5
Paradyn Metrics recorded
Number of CPUs, number of active threads, CPU and inclusive CPU time
Function calls to and by Synchronization (# operations, wait time, inclusive wait time) Overall communication (# messages, bytes sent and received),
collective communication (# messages, bytes sent and received), point-to-point communication (# messages, bytes sent and received)
I/O (# operations, wait time, inclusive wait time, total bytes) All metrics recorded as “time histograms” (fixed-size data
structure) Profiled entities
Functions only (but includes functions linked to in existing libraries)
6
Paradyn Visualizations
Time histograms Tables Barcharts “Terrains” (3-D histograms)
7
MPE/Jumpshot Metrics collected
MPI message send time, receive time, size, message sender/recipient
User-defined event entry & exit Profiled entities
All MPI functions Functions or regions via manual instrumentation
and custom events Visualization
Jumpshot: timeline view (space-time diagram overlaid on Gantt chart), histogram
8
Dimemas/Paraver/MPITrace Metrics recorded (MPITrace)
All MPI functions Hardware counters (2 from the
following two lists, uses PAPI) Counter 1
Cycles Issued instructions, loads, stores,
store conditionals Failed store conditionals Decoded branches Quadwords written back from
scache(?) Correctible scache data array
errors(?) Primary/secondary I-cache misses Instructions mispredicted from
scache way prediction table(?) External interventions (cache
coherency?) External invalidations (cache
coherency?) Graduated instructions
Counter 2 Cycles Graduated instructions,
loads, stores, store conditionals, floating point instructions
TLB misses Mispredicted branches Primary/secondary data
cache miss rates Data mispredictions from
scache way prediction table(?)
External intervention/invalidation (cache coherency?)
Store/prefetch exclusive to clean/shared block
9
Dimemas/Paraver/MPITrace Profiled entities (MPITrace)
All MPI functions (message start time, message end time, message size, message recipient/sender)
User regions/functions via manual instrumentation Visualization
Timeline display (like Jumpshot) Shows Gantt chart and messages Also can overlay hardware counter information
Clicking on timeline brings up a text listing of events near where you clicked
1D/2D analysis modules
10
mpiP Metrics collected
Start time, end time, message size for each MPI call Profiled entities
MPI function calls + PMPI wrapper Visualization
Text-based output, with graphical browser that displays statistics in-line with source
Displayed information: Overall time (%) for each MPI node Top 20 callsites for time (MPI%, App%, variance) Top 20 callsites for message size (MPI%, App%, variance) Min/max/average/MPI%/App% time spent at each call site Min/max/average/sum of message sizes at each call site
App time = wall clock time between MPI_Init and MPI_Finalize MPI time = all time consumed by MPI functions App% = % of metric in relation to overall app time MPI% = % of metric in relation to overall MPI time
11
Dynaprof Metrics collected
Wall clock time or PAPI metric for each profiled entity
Collects inclusive, exclusive, and 1-level call tree % information
Profiled entities Functions (dynamic instrumentation)
Visualizations Simple text-based Simple GUI (shows same info as text-based)
12
KOJAK Metrics collected
MPI: message start time, receive time, size, message sender/recipient
Manual instrumentation: start and stop times 1 PAPI metric / run (only FLOPS and L1 data misses
visualized) Profiled entities
MPI calls (MPI wrapper library) Function calls (automatic instrumentation, only available on
a few platforms) Regions and function calls via manual instrumentation
Visualizations Can export traces to Vampir trace format (see ICT) Shows profile and analyzed data via CUBE
13
Intel Cluster Tools (ICT) Metrics collected
MPI functions: start time, end time, message size, message sender/recipient
User-defined events: counter, start & end times Code location for source-code correlation
Instrumented entities MPI functions via wrapper library User functions via binary instrumentation(?) User functions & regions via manual instrumentation
Visualizations Different types: timelines, statistics & counter info
14
Pablo Metrics collected
Time inclusive/exclusive of a function Hardware counters via PAPI Summary metrics computed from timing info
Min/max/avg/stdev/count Profiled entities
Functions, function calls, and outer loops All selected via GUI
Visualizations Displays derived summary metrics color-coded
and inline with source code
15
MPICL/Paragraph Metrics collected
MPI functions: start time, end time, message size, message sender/recipient
Manual instrumentation: start time, end time, “work” done (up to user to pass this in)
Profiled entities MPI function calls via PMPI interface User functions/regions via manual instrumentation
Visualizations Many, separated into 4 categories: utilization,
communication, task, “other”
16
ParaGraph visualizations Utilization visualizations
Display rough estimate of processor utilization Utilization broken down into 3 states:
Idle – When program is blocked waiting for a communication operation (or it has stopped execution) Overhead – When a program is performing communication but is not blocked (time spent within MPI library) Busy – if execution part of program other than communication
“Busy” doesn’t necessarily mean useful work is being done since it assumes (not communication) := busy
Communication visualizations Display different aspects of communication Frequency, volume, overall pattern, etc. “Distance” computed by setting topology in options menu
Task visualizations Display information about when processors start & stop tasks Requires manually instrumented code to identify when processors start/stop tasks
Other visualizations Miscellaneous things