State of the Art of Performance Visualizationgraphics.idav.ucdavis.edu/~hamann/IsaacsGimenezJusufiGa... · 2014. 4. 14. · Eurographics Conference on Visualization (EuroVis) (2014)

Eurographics Conference on Visualization (EuroVis) (2014) STAR ndash State of The Art ReportR Borgo R Maciejewski and I Viola (Editors)

State of the Art of Performance Visualization

Katherine E Isaacs1 Alfredo Gimeacutenez1 Ilir Jusufi1 Todd Gamblin2 Abhinav Bhatele2

Martin Schulz2 Bernd Hamann1 and Peer-Timo Bremer2

1Department of Computer Science University of California Davis2Lawrence Livermore National Laboratory

AbstractPerformance visualization comprises techniques that aid developers and analysts in improving the time and energyefficiency of their software In this work we discuss performance as it relates to visualization and survey existingapproaches in performance visualization We present an overview of what types of performance data can becollected and a categorization of the types of goals that performance visualization techniques can address Wedevelop a taxonomy for the contexts in which different performance visualizations reside and describe the stateof the art research pertaining to each Finally we discuss unaddressed and future challenges in performancevisualization

Categories and Subject Descriptors (according to ACM CCS) I33 [Computer Graphics] PictureImageGenerationmdashLine and curve generation

1 Introduction

High performance computing (HPC) simulations drive in-novation across a wide range of scientific fields includingastrophysics climate simulation material science combus-tion and energy production Numerical problems in thesedisciplines would take hundreds of years to compute with-out massively parallel machines To shape the developmentof future fast power-efficient architectures and to acceler-ate the pace of computational science it is critical to gain acomprehensive understanding of the factors that affect per-formance and power consumption on HPC systems

Optimizing the performance of parallel applications isnot straightforward and performance analysis has becomeincreasingly complex Programs now must take advantageof multicore processors programmable Graphics Process-ing Units (GPUs) and multi-level non-uniform memory hi-erarchies On-node performance counters and instrumenta-tion tools allow detailed performance measurements but theprofusion of data they generate when applied to parallel pro-grams makes exploring and understanding the data difficult

This highlights the need for performance visualizationtechniques We present an overview of performance visual-ization and survey existing work Our contributions are

bull An overview of the available performance data

bull A classification of the goals of developers and analystswho use performance visualization

bull A context-based classification and survey of existing per-formance visualizations

bull A discussion of challenge areas in developing new andmore powerful performance visualizations

Others [MMC02 KS93] have reviewed software visual-ization techniques but tended to lump visualizations focus-ing on performance into a single category We focus solelyon performance avoiding other software visualization areassuch as software evolution programming environments vi-sual programming and software design Performance tendsto overlap with debugging and general program comprehen-sion so we include work from those areas as appropriate

2 Performance Data

We detail methods for acquiring performance data and thetypes of performance data that can be generated Many toolscan be used to record performance measurements [GKM04NS07 Rei05 MBDH99 SM06 BM11] allowing visualiza-tion developers to gather their own datasets

ccopy The Eurographics Association 2014

K E Isaacs A Gimeacutenez I Jusufi T Gamblin A Bhatele M Schulz B Hamann amp P-T Bremer State of the Art of Performance Visualization

21 Methods for Acquiring Performance Data

211 Instrumentation

Instrumentation is the act of modifying a program for analternative purpose in this case for acquiring performancedata At the most basic level an instrumentation tool insertsextra code into a programrsquos control flow The instrumenta-tion code may record timer values or it may perform morecomplex analysis such as writing out program variable val-ues and recording variable accesses and conditions met In-strumentation can be applied to source code before compil-ing or it can be applied at runtime using binary modifica-tion [BM11] or sampling [ABFlowast10] Care must be taken toensure that the instrumentation does not change normal pro-gram behavior or add excessive overhead

212 Interception

Interception is a form of instrumentation that leverages func-tion calls already present in program source code Intercep-tor functions are typically grouped together into a librarywhich is then linked with a program either dynamically orstatically The programrsquos original function calls are linkedto the interception library which executes special measure-ment code then delegates to the original implementation ofthe intercepted function Interception is useful for profilinglibraries because it can record the dynamic values of param-eters passed to library calls This can give more semanticcontext to measurements For example communication per-formance of many parallel programs is often measured byintercepting calls to the Message Passing Interface (MPI)and interceptor calls can differentiate between send and re-ceive operations based on the size of data passed to themAs with other types of instrumentation interception must beused sparingly to avoid incurring overhead

213 Profiling and Tracing

Profiling and tracing are measurement techniques that deter-mine where a single execution of a program spends its timeProfiling tools such as gprof [GKM04] and VTune [Rei05]pause execution of a program repeatedly over a specifiedsampling period and record the contents of either the instruc-tion pointer or the entire call stack At the end of a profil-ing run samples are analyzed to determine the percentageof time spent in each part of the code Profiles lose tempo-ral information but quickly identify key bottlenecks in a pro-gram Tracing is similar to profiling in that it measures a pro-gramrsquos execution but it records a detailed time line of whenevents occurred For example a trace might record func-tion entry and exit times for an entire run Because it doesnot aggregate over time recorded traces can require largeamounts of memory which can cause excessive overheadcompared to profiling TAU [SM06] Vampir [NAWlowast96]and EPILOG [WMfAM04] provide resources for creatingand dumping traces of programs that use MPI

Profiling trades off comprehensive data for low overhead

while tracing provides complete data on runtime events atthe cost of much higher overhead

22 System Monitoring

The measurements discussed so far are application-levelmeasurements in that they measure the performance of asingle application process We can also acquire system-wideperformance information during a programrsquos execution byexecuting external processes enabling system-wide counters(described in Section 231) or gathering other metadata atruntime Such methods require no modification to the codeor executables involved and as such are simple to use How-ever this kind of data is generally very coarse and semanti-cally low-level Further because data is collected outside themeasured application process it can be difficult to attributemeasurements to the target programrsquos source code

23 Types of Performance Data

231 Counters

A counter is a special hardware register that accumulates thenumber occurrences of a specified event over time Thesecan be either software events such as system calls or hard-ware events such as floating-point operations cache missesand packets received over a network link The complete setof countable events is specific to the platform being used butis generally quite extensive Commonly counters are eitherinstrumented to initialize and terminate around a block ofcode designated for analysis or run system-wide during pro-gram execution PAPI [MBDH99] provides a portable inter-face to specify initialize terminate and read out counters

The overhead and precision of performance counter mea-surements depends on how frequently they are sampledSampling performance counters too frequently gives highoverhead which can limit precision and make attributingcounted events to particular instructions difficult

Counter data most directly benefits visualizations in thehardware (Section 5) and software (Section 6) contexts Forexample in a network visualization packet counters can berecorded per-link to visualize network traffic Other hard-ware counters can be mapped directly to the resource (CPUmemory etc) responsible for generating the event Countersmeasured within instrumentation can also be attributed indi-rectly to the instrumented code giving software context

232 Hardware Samples

Traditional accumulative hardware counters have been ex-tended to provide more precise and detailed informationabout particular instructions Instead of simply incrementinga counter modern hardware performance units can write de-tailed information about an instructionrsquos execution includ-ing its precise instruction pointer progress within the pro-cessorrsquos pipeline total latency and more Intel and AMD



processor architectures both include hardware capabilities tomeasure memory loads and stores and Intel in addition pro-vides a capability to sample branching events [Int07DC07]

Hardware sampling provides finer granularity with lowoverhead because it is implemented as part of a microproces-sor Tools still need to conduct detailed analysis to attributesuch samples to program source code

233 Traces and Call Paths

Trace files contain lists of timestamped point events recordedduring program execution These events can include proce-dure entry and exit message sends and receives and ob-ject acquisitions and releases By following function entriesand exits the call stack at any point in time can be derivedThese events may also be associated with certain hardwareelements like memory addresses or particular CPUs

In some parallel environments there may be one trace filegenerated for each process or thread so trace data size typ-ically scales with the number of concurrent tasks Parallelsystems may not guarantee high resolution clock synchro-nization resulting in some inaccuracy in event timestamps

Depending on the features in the tracing tool and the op-tions selected by the user more or less information can beincluded ndash for example message sizes with the sends andreceives or parameters with the procedures Some tools canalso record counter values with each event

3 Performance Goals

The main goal of performance analysis and thus in perfor-mance visualization is to make the application execute fasteror use less power There are several sub-goals on the roadto efficiency that have utilized visualization In this sectionwe discuss these goals dividing them into three main cate-gories global comprehension problem detection and diag-nosis and attribution

31 Global Comprehension

Often the first step in optimizing an application is under-standing the big picture regarding what occurred during anexecution When specific targets for optimization are un-known analysts must narrow down regions of interest fromthe whole application Global comprehension goals also ex-ist so users can get a sense of normal behavior as wellas compare predicted and achieved performance Visualiza-tions that present a strong overview or allow for patternmatching may be particularly useful here

The tasks involve understanding program structures andresource utilization Program structures include phases ofexecution algorithms data structures communication pat-terns data motion access patterns and data dependencesResource utilization includes the magnitude and distribution

of demands on processors memory and the network Under-standing the intricate relationshpis between these differentaspects of program behavior forms the necessary foundationfor identifying and understanding the performance of an ex-ecution

32 Problem Detection

Visualization can help developers detect performance prob-lems such as anomalous behavior performance bottlenecksload imbalance and resource usage issues Outlier detectionpattern detection focus+context features and dependencytracking can aid in finding problems

Anomalous behavior includes deadlocks livelocks datarace issues or unexpected behavior Bottlenecks and imbal-ance may exhibit similar symptoms like outlier computationor message durations and significant idle times These prob-lems could also be detected by recognizing network conges-tion or memory contention or by characteristics of the criti-cal path through the execution

Resource misuse includes low parallelism and false par-allelism where many threads are created unecessarily Syn-chronization may also be unnecessary and impede perfor-mance Another resource issue could be poor locality in dataaccesses

33 Diagnosis and Attribution

Diagnosis of a problem may follow directly from detectionor be more subtle Problems may be attributed to softwarerelationships with lines of code variables data structures orthird party libraries This may be a step in the process of rec-ognizing poor distribution or division of work a sub-optimalalgorithm or data structure or a better overlap of messag-ing and computation In distributed and parallel systems themapping of tasks to the system can likewise be an issue

Problems may also be attributed to the system on whichthe code is run Operating system effects memory orscheduling policies and network routing algorithms maycontribute to poor performance Finding these effects givesdevelopers the information necessary to take any steps theycan to ameliorate them

Highlighting the true sources of inefficiency can be dif-ficult Linking and correlation pattern detection depen-dency tracking and ensemble comparison features may aidin achieving these goals

4 Taxonomy

We organize our survey by the main context represented bythe visualization By context we refer to the concepts ontowhich the data is mapped and of which the visualization isconstructed In some cases this context can be derived di-rectly from recorded data such as a visualization focusing



on a specific data column In other cases the context mayrequire some form of additional input about the environmentfrom which the data was collected such as structure minedfrom the source code or the graph of a distributed systemSometimes this information is assumed and hard-coded intothe visualization We define four major contexts in perfor-mance visualization hardware software tasks and applica-tion

Hardware is the natural context for data collected fromperformance counters as these are associated with individ-ual hardware elements like nodes cores or links Additionalcontext in hardware includes the hierarchical grouping ofthe elements the network topology of these elements andqueues scheduling and interfaces associated with these ele-ments (even if they may be implemented by low-level soft-ware)

Software covers contexts related to a programrsquos sourcecode This includes static information such as the class struc-ture of the program and individual variables as well as dy-namic data associated with executions such as call graphsdeveloped data structures and program flow

Tasks contexts involve the individual tasks performing thecomputation Tasks contexts exist at many levels of granu-larity Processing elements threads and processes are frac-tions of a single program Jobs and commands represent en-tire programs that may share a system Note that processingelements form a tasks context when viewed as mostly anony-mous actors performing work but they form a software con-text when considered as specific (and largely different) ob-jects interacting in an object-oriented program

Application refers to the context of what is actually be-ing computed In scientific simulations this is often boundedphysical space Another common application context is theset of matrices used in matrix libraries

Some visualizations draw contexts from multiple cate-gories For example a tasks layout may be influenced by theunderlying topology of the processors on which they run Inthese situations we classify the visualization by the dom-inant context but make mention of the additional contextsused In the specific case of contexts related to the oper-ating system (OS) we generally classify them under hard-ware with the justification that the OS is typically not pro-grammable to the extent that the software is

Figure 1 provides a alternative picture organizing themost recent visualizations by complete tool rather than byindividual view as we discuss in the following sections Thisshows which tools cover multiple contexts as well as whichgoals they address and what sizes of problems they handle

5 Hardware Visualization

Visualizations in the hardware context create a visual rep-resentation of the hardware on which an application code is

Figure 2 A screenshot of Boxfish [LLBlowast12 ILGlowast12] show-ing a selected set of nodes in a 3-dimensional torus networkNodes are arranged in a 3-dimensional graph with links de-noted by edges (left) and an orthogonal projection shows asubset of the edges without occlusion (right)

run Often these visualizations map data from performancecounters onto a depiction of the hardware from which thedata originated Building representations of different hard-ware requires developing an intuitive metaphor for the topol-ogy of the hardware An effective hardware metaphor de-composes the hardware into its basic elements while re-taining its unique characteristics necessary for performanceanalysis These techniques aim to illustrate complex hard-ware topology identify hardware-based performance prob-lems and show the relationship between software and hard-ware

We categorize visualizations in the hardware context intothose depicting the computing network and those depictingindividual compute nodes

51 Network

Supercomputing nodes are connected via a network Thesystem can be interpreted as a graph where vertices are nodesand edges are network links The performance visualiza-tions therefore often take the form of graph visualizationsBecause network topologies vary so widely in their struc-tures a challenge in creating network visualizations is thateach must be highly tailored to a specific topology Tree-based networks such as fat-trees lend themselves to hierar-chical visualizations while others such as torus and hyper-cube networks lend themselves to complex graph layoutsand dimension-reducing projections

A common general representation of the network graphis an adjacency matrix with computation nodes as the xand y axes sometimes referred to as a communication ma-



Taxonomy Demonstrated Scale

Visualization Techniques Papers H S T A Data Parallel Ano

mal

ies

Sof

twar

e

Sys

tem

Radial Tree X NR X X X

Node-Link Graph X NR X X X

Radial Tree Animation X X NA X X X X XLayered Node-Link DOTS [BKS05] X X NR NR X X X X XClustered Node-Link Animation X X NR X X XNode-Link Graph X X X XRadial Tree Kim et al [KLJ07] X X NR X X XNode-Link Trees Indented trees Lin et al [LTOB10] X NR NR X X XNode-Link trees Rose et al [RHJ07] X X NR X X

Node-Link graph Animation X NR NR X X

Radial Tree X X X XNode-Link trees STAT [AdSLlowast09] X X NR X X X

X X streaming X X X XLayered None-Link X X X X X X

X NR 1 X X

X X streaming X X X X

Shared Timeline X NR X X

X X X NR X X X

X X NR X X X

X X X X X X

X X NA X X X

X NA X X X

X NR X X

X X X X X X X X X

X X X NR X X X X X

X X gigabytes X X X X X

TraceVis [RZ05] X X NR X X X X X

X X X X X X

X X X terabytes X X XAbstract Diagram X X NA X X XDot Plot Bar Charts X X 2 jobs X X XScriptable ParaProf [SML12] X NR X X

Indented Trees Matrix X X terabytes X X X X

Schulz et al [SLBlowast11] X X X NR X X X XBubble Chart Animation X NR X X XCity Metaphor SynchroVis [WWF13] X X X X XIcicle Timeline Bundles SyncTrace [KTD13] X X X X X X XSunburst Matrix Dendrogram X NR X X

Global Compre

Problem Detection

Diagnosis Attribution

Pro

gram

S

truc

ture

Res

ourc

e U

sage

Bot

tlen

ecks

an

d Im

bala

nce

Res

ourc

e M

isus

e

Bhatele et al [BGI12] 104 processesBoxfish [LLBlowast12 ILGlowast12] 104 nodesChoudhury and Rosen [CR11]

107 transactions

Frishman et al [FT05] 102 objectsHeapvis [AKGlowast10] 103 nodes

103

102 coresSambasivan et al [SSMG13]Sigovan et al [SMMlowast13a] 101 resources 103 processes

105 tasksClustered Node-Link AnimationReal Time Streamsight [DPA09] 103 tasks

Threadscope [WT10] 103 events 101 threads

Node-Link Graph TreemapWeidendorfer et al [WKT04]

Timeline Stacked Graph Small Multiples

de Pauw et al [DPWB13] 103 tasks

Muelder et al [MGM09] 104 processesGantt Charts Timeline Matrix Scatterplot

Muelder et al [MSMlowast11] 103 cores

3D Parallel Gantt Chart TreemapForce-directed layouts Triva [SHN10] 103 processesParallel Gantt Chart Node-Link Tree Bar Charts Zinsight [DPH10] 105 events 102 processes1D Color-Coded Array Histograms

Cheadle and Field [CFAlowast06]

101 memory groups

1D Color-Coded Array Stacked By Time

Moreta and Telea [MT07] 105 allocations

Edge Bundling Gantt Charts Hierarchies Extravis [CHZlowast07] 105 eventsParallel Gantt Chart Indented Trees Code view

HPCToolkit [ABF10 TMCF11 LMC13] 101 gigabytes 104 processes

Stacked Barcharts Stacked Timelines Lumiegravere [BBH08] 106 decisionsParallel Gantt Chart Small multiples Plots Ensemble

Projections [KZKL06 LMK08] 104 processes

Stacked Barcharts Scatterplot Histograms Code Coloring

107 instructions

Icicle Timelines Coordinated views Trumper et al [TBD10] 104 events 101 threadsParallel Gantt Chart Icicle Timeline Adjacency Indented Trees Ensemble Timeline Plots

Vampir [NAW96 BW12 ISC12 VMa13] 105 processesChoudhury et al [CPP] 101 bufferslviz [WYH10] 106 events

104 processesScalasca [GWW10 WG11] 105 cores

Color-coded 2D matrix histograms 3D graph layout 104 cores

Sigovan et al [SMM13] 104

102 objects 101 threads107 events 102 threads

Trevis [AH10] 103 nodes

Figure 1 Classification of recent visualizations by context scale and goal We limit the scale and goal to what was reported(rounded) in practice the visualization may exceed what is listed in the chart Any value that was not clear or missing entirelyin the publications are marked not reported (NR) We focus on works published in the last 10 years



trix [HE91] ParaGraph [HE91] depicts communication ma-trices and color-codes the elements to indicate areas of heavylink traffic Zhou and Summers [ZSC03] use an adjacencymatrix to show quaternary fat-trees and depict transactionsby animating 3-dimensional glyphs on matrix element lo-cations While communication matrices effectively show alllinks they are very ineffective in showing the shape of thenetwork and the distance between non-neighboring nodesFurthermore because they show all possible links and ex-isting links are shown twice (once for each direction) theycontain much visual redundancy

Haynes et al [HCR01] makes another general represen-tation by depicting all nodes in a 2-dimensional grid andcolor-coding all network links As such the user can followlinks by finding matching colors with no redundant links orwasted space However this technique still suffers in show-ing the network shape and paths with multiple links Anotherissue is that humans are visually limited in discerning multi-ple unique colors with accuracy which limits the size of thenetwork this visualization can usefully represent

Zhou and Summers [ZSC03] also use a modified 2-dimensional H-tree layout to demonstrate the network topol-ogy of the quaternary fat-tree used in a variety of HPC sys-tems They aggregate histograms and arcs in a third dimen-sion to show messages passed between nodes Muelder etal [MSMlowast11] demonstrate another hierarchical style graphvisualization of the IO network for Blue GeneP They de-pict the network in a radial layout with storage nodes in thecenter compute nodes on the outside and IO nodes in be-tween The performance data is aggregated on the drawnlinks between the different types of nodes Both hierarchi-cal visualizations demonstrate the ability to discover areasof heavy communication traffic within their respective net-work topology Muelder et al [MSMlowast11] depict the entiretyof the performance data in a single 2-dimensional viewwhile Zhou and Summers [ZSC03] take advantage of en-coding data in the third dimension and using animation atthe cost of occlusion and complexity Purely hierarchical ap-proaches are only possible for specific network topologiesbut effectively reduce the visual complexity of the networkgraph by utilizing well-known hierarchical metaphors Wealso note that the aforementioned visualizations deal withnetwork trees of relatively shallow depth but this has not yetbeen an issue because existing HPC interconnects typicallydo not use much deeper hierarchies

Many visualizations lay out the network graph of spe-cific network topologies as 2- or 3-dimensional mesheswith nodes as vertices and links as edges Boxfish [ILGlowast12]provides an interface for displaying performance data on amesh representing a 3-dimensional torus network Landgeet al [LLBlowast12] create 2-dimensional projections of the 3-dimensional torus network in Boxfish with no occlusion(Fig 2) Haynes et al [HCR01] depicts in addition to thegeneral 2-dimensional visualizations another 3-dimensional

mesh layout of a 3-dimensional torus network The meshlayouts create more intuitive depictions of the network butthere does not always exist an intuitive mesh projection fora network topology type As topologies increase in dimen-sion such as the 5-dimensional and 6-dimensional torus itbecomes much more difficult to create a low-dimensionalmesh that is easily understood

52 Node

The topologies of CPU nodes are often relatively small com-pared to network topologies these are on the order of tensand hundreds of processors and memory resources For thisreason parallel programs usually employ a mapping fromN tasks to M processors with N M Techniques for vi-sualizing on-node computation space often take the form oftask-based visualizations [TBD10 KLJ07 dKSB00 Rei05ABFlowast10] However such a mapping typically does not ex-ist between tasks and memory resources especially in thecontext of multi-level memory hierarchies where multipleprocessors share resources simultaneously As a result on-node hardware visualizations have mostly targeted memoryaddress space and resource usage

521 Processor Topology

Processor-based visualizations typically visually encode cu-mulative performance data per-processor Often the layoutof processors is based on the hardware numbering and datais represented with histograms or stacked bar charts [BD01ABFlowast10] Schulz et al [SLBlowast11] arranged processors basedon the 2D layout of the application (see Section 8) and dis-played values using color

Processor topologies are often embedded within largernetwork visualizations to show processor resources withinindividual nodes Haynes et al [HCR01] depicted nodeswith aggregated glyphs representing multiple processors oneach node Similarly Zhou and Summers [ZSC03] showedeach node as a subdivided grid with cells representing indi-vidual processors

522 Memory Topology

Several on-node memory visualizations represent the vir-tual address space of memory as an infinite one-dimensionalspace A program is allocated a finite subset of that space bythe operating system and all program variables lie within itAs such many techniques [GT89MT07CFAlowast06] depict thespace of a single program with variables as finite contiguousblocks within the programrsquos memory Griswold et al [GT89]color-code different variables and their datatypes on a linewhich wraps down multiple rows to more effectively uti-lize screen space Moreta and Telea [MT07] expands the 1-dimensional layout to depict allocations and deallocationsover time with the address space on the vertical axis andtime on the horizontal axis They also include an overview



Figure 3 Left Memory visualization by [GT89] Mem-ory address space depicted as a 1-dimensional array wrap-ping down several rows Blocks indicate allocations alongaddress space color legend indicates variable data typesPrinted with authorrsquos permission Right Memory visualiza-tion by [MT07] Allocations depicted as 2D blocks with ad-dress space on the vertical axis and time on the horizontalaxis Image courtesy of A Telea

Figure 4 The output of lstopo a tool provided byhwloc [BCOMlowast10] to visualize hardware topology Thememory hierarchy is shown as a hierarchical space-fillinglayout with larger resources at the root and smaller cachesand processors at the leaves

visualization with the address space collapsed which moreclearly shows the order of allocations over time

All the aforementioned address space-based visualiza-tions target an understanding of allocator behavior and de-pict fragmentation issues effectively However the visual-ized datasets are often small especially compared to com-mon HPC applications As programs utilize more and moreof the growing 1-dimensional memory space we believe thevisualizations will not scale well enough to continue beinguseful

Other memory visualization techniques have focused ondepicting properties of the memory hierarchy eg multi-level caches RAM and disk rather than the address space

Alpern [ACS90] created an early visualization showingthe memory hierarchy of various hardware for the purposeof observing data migration between disk memory trans-lational lookaside buffer and registers The visualizationshowed the different memory resources as boxes connectedby drawn links and also drew subsets of the data within thememory resource they resided While it did not embed per-formance data it created a model for understanding what oc-curs in hardware and how cache-optimized algorithms moreefficiently utilize memory resources hwloc [BCOMlowast10] asoftware package for the analysis of system attributes pro-vides a tool called lstopo that detects and displays thetopology of different architectures in a hierarchical space-filling layout (Figure 4) but like the work of Alpern alsodoes not encode performance data

Choudhury et al [CPP08] created an interactive visual-ization depicting simulated memory access data embeddedwithin diagrams of the caches address space and itera-tions This visualization shows accesses and misses from in-dividual cache lines and addresses While highly detailedit would be unfeasible to scale beyond the demonstratednumber of memory resources Rivet [BD01] displays an-other diagram-like visualization of different caches with per-processor memory performance data mapped to cache re-sources

Choudhury and Rosen [CR11] created a more abstractrepresentation using a radial space-filling layout also forsimulated data They represent different levels of cache asrings around a central processor with lower levels closerto the processor and depict data migration between levelsof cache as lines between ring segments as seen in Fig-ure 5 Mu et al [MTSM03] created a visualization targetingNUMA effects on multi-socket nodes by depicting differentNUMA domains and transactions between them The visu-alization also includes sufficient information to map areas ofNUMA transactions to source lines of code Because theyfocus on a specific performance issue the visualization isable to depict a small amount of information in a way that isdirectly useful in optimizing code for NUMA efficiency

Rosen [Ros13] specifically targets the memory topologyof NVIDIA graphics processors and create a visual modeldepicting both processor and memory layout The visual-ization decomposes performance data of multiple process-ing units (warps) to find representative subsets with whichto compare The visualized warps include information aboutmemory banks used by each warp for the purpose of identi-fying bank conflicts which represent major memory accessbottlenecks The idea to use representative subsets is an ef-fective way to handle the plethora of performance data andthe extensively large processor topologies while retaining in-formation about average behavior and outliers



Figure 5 Memory hierarchy visualization by [CR11] LeftAddresses are represented as points and different sets ofpoints represent different memory resources The outer ringrepresents main memory the outer four arcs of points rep-resent L2 cache and the inner two arcs represent L1 cacheLines between points denote migration of data between re-sources Right Simulated transactions are associated withthe lines of code which caused them Image courtesy of PRosen

6 Software Visualization

We survey software visualization only as it related to perfor-mance visualization Therefore software visualization tech-niques applied for other purposes such as education or soft-ware maintenance belong outside of the scope of this workAs mentioned earlier we define the software context as vi-sualizations related to a programrsquos source code This in-cludes visualization of the software structures in terms ofclasses and packages visualization of the code itself se-rial traces of events related to method invocation and callgraphs of specific execution Data-structure visualizationtools are mainly used for education and debugging althoughHeapvis [AKGlowast10] offers features that could be for perfor-mance visualization

61 Serial Trace Visualization

Serial trace visualization shows a sequence of events Sev-eral different visualization metaphors have been used forthe visualization of traces One of them is a variation oficicle plots where width encodes duration [RR99 TBD10KTD13] Figure 10 shows one such example the icicle plotis located at the top of the screenshot

A common practice in trace visualizations is to assign oneof the axes to the time variable while the other axis is used torepresent different processes classes instructions or meth-ods [JSB97 DPH10 CHZlowast07 MHJ91 MSMlowast11 RZ05] Inessence most of these approaches represent different varia-tions of Gantt charts Some trace visualization tools animatethe execution of events and even provide additional viewsfor visualizing algorithms [JSB97 BBH08]

In contrast to the presented approaches Cornelissen et

al [CHZlowast07] place the methods in a circular layout whilethe edges (events) that represent method calls are bundledto avoid clutter (cf Figure 6) They provide an additionallinked view where different methods are placed on top ofthe view calls between them are shown with horizontal linesand time is shown in the y axis In somewhat similar fashionDe Pauw and Heisig [DPH10] use the vertical axis for en-coding time while the horizontal view encodes different pro-cesses Within each process column events are representedas blocks color-coded according to different software com-ponents The horizontal position within a process column de-notes the load module where the calls were generated

Figure 6 Inner nodes represent different methods Edgesshow caller-callee relationships The edge color gradient canbe used to denote the call direction or to show the call timendashless recent (light) to most recent (dark) Radial icicle plotsshow the software structure which provides insight into howdifferent packages are linked to each other Image ccopy2007IEEE Reprinted with permission from [CHZlowast07]

The trace visualizations thus far are usually viewed infractions of the total duration Wu et al [WYH10] createsdot plots of the entire trace versus itself marking where theevents are the same Additional information is encoded asldquobar codesrdquo along the axes The dot plot shows global pat-terns along the full timespan of the trace The same methodcan be used to compare two different traces Sambasivan etal [SSMG13] focus specifically on comparing two request-flow traces with a side-by-side view difference view andanimation between them

In order to facilitate the size of data or to gain insight intothe possible branching of events traces are aggregated intocall graphs as described in [JSB97]



62 Call Graph Visualization

Together with serial trace views call graph views appearto be one of the most common visualizations in the soft-ware context for performance data analysis In most casescall graphs are tree structures such as context call trees thatare usually produced by profilers to help understand caller-callee relationships Here one should keep in mind not toconfuse the the debugging goal with the performance op-timization goal Call graph visualization for performanceanalysis purposes usually encodes additional performancedata in itself or is shown together with other contexts suchas tasks or hardware

The most common representation of call graphs uses thenode-link metaphor where the node is usually a function(method) and the link represents a function call In this re-gard there are several tools that use an indented tree lay-out for visualizing the call trees [ABFlowast10 MW03 WG11]Some of these tools integrate performance data directly intothe nodes by color-coding them [MW03] Others use the hor-izontal space provided by the indented layout to add tabulardata or even small barcharts or histograms They may alsoemploy computational methods to find hot-paths at whichpoint the corresponding branch would expand and directusersrsquo attention to the relevant portion of the tree [ABFlowast10]However due to the size and complexity of the call graphsperformance and statistical data is usually visualized usingmultiple coordinated views

Other node-link layouts mainly use the conventional treedrawing algorithms [SM06DPH10Rei90DPH10LTOB10WKT04 AdSLlowast09] Usually some data is visualized usingthe color shape or size of the nodes For instance DeRose etal [DHJ07] managed to integrate load balancing data insidethe nodes of the call tree by using the width and the heightof the nodes as well as by integrating small barcharts insidethe nodes

Space-filling approaches such as treemaps [WKT04] andsunbursts [AH10] have also been used to represent call treesAdditionally Adamoli et al [AH10] present a view where adissimilarity matrix is used to compare several calling con-text trees

63 Code and Code Structure Visualization

Sometimes it is important to invoke specific lines of codewhere a potential performance problem is detected Manytools that employ call graph visualization show the codeas well so that when users click on the specific node inthe graph the corresponding line of code is shown or high-lighted in the code view [ABFlowast10 SG93]

However there are also approaches that visualize the codeand the performance data together One of the first such ex-amples is the Seesoft tool [ES92] Here each line of codeis represented by a line of pixels and color-coded according

to the number of executions providing the user with an easyway to notice ldquohot spotsrdquo A similar idea is presented by Liaoet al [LDBlowast99] where each code character has been encodedinto a pixel and the color denotes various cycles in the codeThe aim of this visualization is improving parallelization

There are approaches to show specific parts of code inother contexts as well TraceVis offers functionality wherethe user can specify regions of static code [RZ05] It willthen color-code the background regions of the trace viewshowing which static code elements map to the dynamictrace data It is also possible to select a specific region in thedynamic trace view and automatically color-code all staticinstructions

In some cases certain structural features of the code suchas class hierarchy should be analyzed in context of the per-formance data in order to understand if a potential prob-lem is originating from the application code or an exter-nal library One approach is to visualize software modulesor class hierarchy Icicle plots could be used in this caseas well [CHZlowast07] Figure 6 shows the use of icicle plotsin a radial layout SynchroVis shows program traces in thestatic structure of the program visualized using a three-dimensional city metaphor [WWFlowast13] Here different fea-tures of the city are mapped to code structures For exam-ple districts represent packages while buildings representclasses This work is conceptually similar to the previoustwo-dimensional representation approach [JSB97] One ofthe most straightforward methods to map different softwarecomponents or modules in other contexts is color For in-stance different parts of a call tree can be color-coded ac-cording to the component they belong to [AH10 LTOB10]

7 Tasks Visualization

The fundamental context required by tasks visualizations arethe attribution of the performance data to the abstract actorsthat generated it These actors include processes threadsand jobs Further context in this area includes the hierarchi-cal structure of the actors (eg what threads belong to whatprocess) Some tasks visualizations are able to take advan-tage of other contexts such as the specific nodes or siteswhere the process is being run

Execution traces and system logs are often recorded withtasks context These documents capture timestamped eventssuch as function entry message receives and job initiationTraces and logs offer analysts a full record of what occurredbut this increases the difficulty of making sense of them Or-dering of events can unveil bottlenecks delays and anoma-lies Patterns in utilization and communication can be foundover the duration of the data collection The time componentof this data is essential in this analysis so there are many vi-sualizations that attempt to display the time streams per taskWe discuss these in Section 71

However time is not a necessary component in tasks vi-



sualizations Sometimes aggregating information over a du-ration either from the trace or through profiling can yieldinsight into program behavior as well Non-time tasks visu-alizations create a snapshot of task interactions over timeThese are discussed in Section 72

71 Time in Tasks Visualizations

Figure 7 Vampir [NAWlowast96] The main visualization is thetimeline view Bar color indicates whether application orMPI functions are active The black lines indicate messagesbetween processes

The majority of time-based tasks visualizations assigntime to the x or y axis and then constrain the events ofeach task to bars in a row or column respectively similarto a Gantt chart These visualizations generally omit the callstack information found in single task trace visualizations(Section 6) as available space for the parallel tasks is alreadya challenge A typical visualization of this type would beVampir [NAWlowast96] as shown in Figure 7 We refer to theseas timelines and classify them and closely related ones inSection 711 Other techniques such as animation and soni-fication are discussed in Section 712

711 Task Timelines

We classify task timeline visualizations by their representa-tion of both time and the relationships between individualtimelines Most visualizations use physical time (eg wall-clock time system time and cycle counts) which is gen-erally what is recorded in traces and logs However somevisualizations support logical time a partial ordering basedon dependency information often Lamport clocks [Lam78]Cuny et al [CHK92] claim that logical time is needed fordebugging the correctness of parallel programs while phys-ical time is more important for performance where the ulti-mate goal is decreasing the total time required by the pro-gram PARADE [KS98] supports phase time which is a par-tial ordering of phases of an execution rather than individualevents However computing phases from trace data withoutextra information is difficult

Some visualizations show no relationships between time-lines [DPH10 SG93 LSVlowast89 TBD10 Sha90 Rei90] Zin-sight [DPH10] recognizes a hierarchy of tasks and allows

Figure 8 SIEVE contour plot shows event wavefrontReprinted from [SG93] with permission from Elsevier

users to select which granularity to plot events While Vam-pirrsquos default view shows messages between timelines it alsoprovides cluster timelines which show aggregated eventsover the cluster with no messages [VMa13]

SIEVE [SG93] draws contour lines across the tasks wherethe events are equivalent as shown in Figure 8 Muelder etal [MGM09] show log-scale duration versus time ratherthan placing tasks or groups of tasks on the y-axis Insteadevents from all tasks were drawn over the same area andoverplotting and blending techniques were used to showconsensus (or lack their of) among tasks

In some cases intertimeline relationship data may not ex-ist HPCToolkit [ABFlowast10] visualizes sampled data ratherthan full traces It shows all tasks in an information muralstyle display sampling each pixel for its task and samplecontributors Individual tasks can be selected for a detailedsingle timeline display

De Pauw et al [DPWB13] SeeLog [EL96] andlviz [WYH10] show separate program instances whichunlike processes do not interact directly The De Pauw et alvisualization (Figure 9) displays job lifetimes on a sharedsystem in an online stacked graph-like visualization thatgroups jobs by user Rather than assigning rows to jobs per-manently De Pauw et al changes the y value over time so theclusters remain continguous and separate from each otherSeeLog shows classes of applications per row with glyphsindicating how many are active rather than bars lviz visu-alizes job logs on Microsoft Windows in a dot plot revealingrepeated patterns of jobs over time The dot plot can be usedto compare two separate logs by assigning one log to rowsand another to columns



Figure 9 De Pauw et al timeline of jobs on a sharedsystem Image ccopy2013 IEEE Reprinted with permissionfrom [DPWB13]

Timelines may affect each other through dependency con-straints such as message sends and receives or access toshared objects Message dependencies are often shown asa line or arrow from the send on one timeline to the re-ceive on another [YSM95SKV03ZLGS99LMCF90FB89dKSB00 KS98 HE91 PLCG95 KZLK06 TSS98 KTD13SHN10 SRWS99 KTM97 KG96] Including these types ofdependencies makes it possible to highlight critical paths

Virtue [SRWS99] draws timelines in 3-dimensionalspace using a ring layout for tasks rather than an axis Thevisualization is also compatiable with a CAVE environmentVisuaLinda [KTM97] and Triva [SHN10] use 3D in order tocluster tasks by their location on physical processors in twoof the dimensions

SyncTrace [KTD13] shown in Figure 10 draws a se-rial timeline overview for a selected thread and a focusview which draws multiple threads as sectors of a circleThe call stack is maintained for these threads resulting in asunburst-like design Relationships between the focus threadand other threads are drawn as aggregated edges similar toa chord diagram

Figure 10 SyncTrace [KTD13] visualization shows threadcall stacks and relationships between them Image courtesyof B Karran

In addition to message and shared object dependenciesthere may be relationships regarding the lifetime of threads

While these are also dependencies we separate them be-cause they involve the addition or removal of tasks in thevisualization Many of the visualizations supporting lifetimerelationships also support message or shared object depen-dencies

DOTS [BKS05] uses a Sugiyama-style layout algorithmto assign threads to columns and route dependency lines Thethreads are grouped by processor ThreadScope [WT10] alsouses a layered node-link diagram with line styles represent-ing different relationships and node styles representing boththreads and memory The graph can be condensed throughgrouping by malloc calls or classes

Several visualizations represent parent-child relationshipsamong tasks but do not emphasize creation and destructioninstead the space is allocated to the task far past the extentof its lifetime We do not consider these as showing lifetimerelationships because it is not clear if the task is non-existantor just idle Wang and Kunz [WK00] modify the usual time-line view to show the lifetimes of migratable objects as theymove between individual timelines representing machines indistributed systems

Table 1 organizes the task timeline visualizations by whatrelationships are present between the individual timelinesand what type of time is displayed

712 Other Time-based Tasks Visualizations

Several visualizations use animation to represent timeshowing the state of the tasks at every instanceVISTOP [BB92] uses a mailbox metaphor to show messag-ing and semaphore activity and a directory to show threadspawning relationships SynchroVis [WWFlowast13] uses a citymetaphor to represent the static structure of the programwith special buildings where added floors represent threadand shared object creation Arrows connecting to the specialbuildings show the evolution of the system in time

Belvedere [HC88] and its follow-up Ariadne [CFHlowast93]animate messages between processes in logical time for de-bugging Streamsight [DPA09] creates a node-link diagramwith processing elements as nodes and streams betweenthem as links Grouping by job or host makes aggregationand clutter reduction possible This visualization allows forreal-time monitoring but can also be recorded and replayed

Sigovan et al [SMM13b] animate events as rising bub-bles per process which fade into the background at the endof their duration creating a contextual history Using over-plotting and blending techniques this animation is able toscale to 16K processes

PARADE [KS98] and PVaniM [TSS98] place processeson a circle and animate messages moving between themGrowing Squares [ET03] similarly places processors whenanimating dependency relationships between processes inlogical time Process squares lsquogrowrsquo outlines that incorpo-rate the colors of other processes that have causally affected



Relationships Time VisualizationsNone Physical ConcurrencyVisualizer [GN10] De Pauw et al [DPWB13] Devise [KMLM97] Falcon [GEKlowast95]

HPCToolkit [ABFlowast10] lviz [WYH10] Muelder et al [MGM09] PIE [LSVlowast89] Reillly [Rei90]SeeLog [EL96] Sharma [Sha90] SIEVE [SG93] Truumlmper et al [TBD10] Vampir [NAWlowast96]Zinsight [DPH10]

Dependency Physical AIMS [YSM95] Jumpshot [ZLGS99] Moviola [LMCF90 FB89] Pajeacute [dKSB00] PARADE [KS98]ParaGraph [HE91] PARAVER [PLCG95] Projections [KZLK06] PVaniM [TSS98] SyncTrace [KTD13]Triva [SHN10] Virtue [SRWS99] VisuaLinda [KTM97] XPVM [KG96]

Logical Concurrency Maps [Sto88] DeWiz [SKV03] Moviola [LMCF90 FB89]Lifetime Physical Wang and Kunz [WK00]

Logical DOTS [BKS05] Threadscope [WT10] Zernik et allsquo [ZR91 ZSM92]

Table 1 Classification of Tasks Timelines

Figure 11 Snapshot from the animated trace visualizationby Sigovan et al [SMM13b] As events persist they rise up-ward logarithmically Image courtesy of C Sigovan

them Frishman and Tal [FT05] animate migrating object in-teractions using the Growing Squares technique to show ob-ject history Clusters representing object locations lend theircolors to objects that pass through This visualization re-duces clutter by more aggressively aggregating clusters fur-ther from the userrsquos focus

Yamaguchi and Itoh [YI03] animate the moving locationsof tasks on hierarchical distributed systems The hierarchyis shown using nested rectangles Using the third dimensionthey encode other metrics in the height of each process

Sonification methods have also been used Francioni etal [FAJ91] and Pablo [RRAlowast93] map tasks to separate in-struments or tones having them sound for the duration ofparticular events or other states of interest (eg idleness)

72 Visualization of Non-Time Tasks

Several visualizations show the process communicationgraph a summary of all messages among processes insome time frame Adjacency matrices are frequently used[HE91RRAlowast93VMa13] Bhatele et al [BGIlowast12] modify anode-link diagram to aggregate processes with similar delaybehavior into arcs

Kim et alrsquos [KLJ07] method represents threads as pointson a cone with the distance from the apex indicating cre-ation depth Threads can be aggregated to reduce clutter

ParaProf [SMLlowast12] colors tasks by user-chosen metrics andprovides a scripting language so the user can decide the tasklayout

8 Application Visualization

Application contexts are specific to the problems that theirtarget programs are attempting to solve This generally cor-responds to the perceived space of the input and output of theprogram For example the application context of a matrixmultiplication program is the space of the matrices involvedin the operation

ParaGraph [HE91] includes facilities to generate perfor-mance displays in the application context noting that suchdisplays could provide new detail and insight but also men-tioning that such displays are highly non-trivial and appli-cation specific They show an example of data transactioncounts of a matrix operation overlaid onto the input and out-put matrices

A similar visualization involving parallel prefix sums wasshown by Stasko and Kraemer [SK93] They created an an-imation showing different processors operating on differentparts of the input dataset This visualization proved useful indebugging parallelism issues in the prefix code and provideda stronger understanding of the utilized parallelism

Schulz et al [SLBlowast11] observed that application devel-opers find the application context highly intuitive They cre-ated visualizations which successfully uncover application-specific performance bottlenecks by arranging processorsand their generated counter data by the physical regionswithin the bounded fluid dynamics simulation they com-puted The visualization showed that areas of high compu-tational and bandwidth costs occurred in areas of high fluidturbulence as seen in Figure 12 Schulz et al also observeda need to complement the application context view withother contexts such as hardware and communication notingthat problems become more obvious when projected into thecontext from which they originate

Wylie and Geimer [WG11] likewise created a visualiza-tion with processor computation time attributed to physical



(a) Aluminum density visualization over severaltimesteps

(b) Floating-point operations mapped to application con-text over several timesteps

Figure 12 Floating-point operations are highly correlatedwith aluminum density of their associated application ar-eas Images ccopy2011 IEEE Reprinted with permissionfrom [SLBlowast11]

regions in a large scale reservoir simulation They also ob-served application-specific performance bottlenecks wherecertain areas of the dataset incurred larger computationaloverhead

The aforementioned applications generally have intuitivecontexts in 2 or 3-dimensional space but it is a challengeto depict the application context of programs with high-dimensional or abstract output Furthermore application vi-sualizations have required significant implementation effortby the analyst rather than allowing them to leverage existingvisualization tools associated with simulation output ratherthan performance

9 Challenges

In this section we discuss challenges in performance visu-alization Many of the challenges highlight a need for closeand continued collaborations with domain experts The chal-lenges of parallel scale system complexity and attributionrequire expert input to craft useful and informative visual-izations Experts in HPC are not strangers to issues of datascale and it is highly beneficial to harness this experience inan effort to create scalable and useful visualizations Finallythe integration of new visualizations with data collection orperformance workflows necessitates sustained partnershipsbetween communities

91 Scale

As the scale of computing resources continues to grow ex-ponentially and along with it the scale of the collected per-formance data it is becoming increasingly critical to cre-ate highly scalable performance visualizations There aretwo major scale challenges facing performance visualiza-tion parallel scale and data scale Parallel scale refers to thenumber of elements required by the context that the visu-alization is attempting to represent simultaneously This in-cludes nodes cores and memory addresses in the hardwarecontext and tasks processes threads and jobs in the taskscontext Data scale refers to the amount of data collectedthat need not necessarily be displayed all at once but mustbe processed by the visualization There are several ways tothink about data scale ndash file size execution time and totalnumber of samples or events Figure 1 shows the reportedparallel and data scales of the most recent performance visu-alizations

As Figure 1 shows few of the visualizations methodscited demonstrated an ability to handle tens of thousands ofsimultaneous tasks and some that do only do so for statisti-cal plots not for more sophisticated views Others simply av-erage across pixels which may hide the insights users seekAt the same time requiring users to pan extensively withindetailed visualizations is not reasonable While some toolsmay scale the utility of the visualization does not Creat-ing sweet spots between full aggregation and largely unpro-cessed detail so that the necessary contexts are still shownremains a challenge

As the size of acquired performance data increases it hasbecome necessary to scale not only the visualization butthe underlying data Though this problem has been oftenneglected by the visualization community tools developedwithin the performance community have begun to addressthe issue HPCToolkit [TMCFlowast11] maintains interactivityof its views by sampling the data rather than reprocessingall of it during panning and zooming It reduces the sizeof data during collection through sampling as well Vam-pir [ISClowast12] can utilize the same systems it is meant toanalyze handling terabytes of data via the parallel filesys-tem and an allotment of processors It can also employ datareduction techniques during collection As the data size in-creases it may not be feasible to save the entirety of the col-lected data so integrating more approaches like the parallelsystem usage of Vampir or the sampling-based functionalityof HPCToolkit is crucial The greater use of sampling andthe effects of overhead and clock synchronization necessi-tate more techniques for handling uncertainty

92 System Complexity

Many of the continuing challenges in performance visual-ization are the product of the ever-evolving technology inhigh performance computing systems Network topologies



are constantly changing and increasing in dimensionality toimprove parallelism and efficiency and as a result existingtechniques may quickly become obsolete Previous networkshad natural embeddings into 2- or 3-dimensional space butthis is no longer the case for the largest systems New layoutsare needed that can leverage the inherent structure of thesenetworks and improve developerrsquos understanding of themFurthermore as hardware developers expose new ways tocapture different performance events visualizations mustadapt to fully utilize the new and changing performance dataand contexts

93 Ensemble Runs

Analysts often must make comparisons among different ex-ecutions of the same application to determine the most likelycauses of performance differences or to validate the perfor-mance benefits of changes in algorithms or parameters Fewvisualizations we surveyed had support for handling ensem-ble datasets Those that did were limited to or demonstratedonly a few at a time [BW12 WYH10 TMCFlowast11] Insteadusers generally compare two executions by examining vi-sualizations of each one individually This is an area wherevisualization can help reduce the cognitive load on the ana-lyst Showing differences can be tricky ndash even in the two-runcase there are issues of normalizing metrics and resolvingmultiple corresponding entities

94 Coordination

The current state of performance visualization software isscattered amongst a variety of tools and techniques serv-ing different purposes As different techniques are able toaccomplish different subsets of the tasks delineated in sec-tion 3 and no individual technique accomplishes all of themthis chaotic state is largely unavoidable While we presentedthe relevant research in four context categories we have ob-served that the distinction between contexts is not alwayswell-defined and as such visualizations need not be con-strained to any single one Many of the studies we haveobserved have highly validated the usefulness of combin-ing multiple techniques whether closely tied together in theform of linked views [VMa13] or simply applying differ-ent techniques to the same target program [KZLK06] Oneof the main challenges in performance visualization there-fore is the development of improved integrations of multipleviews and performance data in intuitive ways

95 Attribution

While complex visualizations can elucidate novel or inter-esting performance data it is important to keep in mind thatthe visualization has to aid the developer in accomplishingsome set of performance goals Many of the examples in Fig-ure 1 reflect case studies that rely heavily on expertise orinsight from the user Most solutions that handle attribution

directly do so at the function call or line-of-code level Thearea of performance goals targeted least by the surveyed pa-pers was attributing performance problems to semanticallyhigh-level reasons and determining possible avenues of im-proving the code

96 Evaluation

When dealing with more complex systems and programs es-pecially in the high-performance computing field the num-ber of domain experts capable of participating in user studiesis small Therefore full-fledged usability evaluations or con-trolled experiments with large numbers of participants arerare (for example [SSMG13]) However variations of ex-pert evaluations can be performed [TM05] as was done inDe Pauw and Heisig [DPH10] These require a small num-ber of domain and visualization experts making them morefeasible to conduct The surveyed papers have in general notstudied the usability of performance visualization methodsand interfaces Expert evaluations with the inclusion of us-ability would further help to fill this gap in knowledge

10 Conclusions

Performance visualization is a growing field which contin-ues to adapt to the growing ecosystem of high performancecomputing As supercomputers become more powerful in-creasing effort is required to understand how different soft-ware is run on such machines and optimize their perfor-mance Rising complexity of systems and performance datacollected on them invites the utilization of visualization andanalysis tools Largely driven by necessity performance vi-sualization presents new and challenging research questionsmany of which remain to be answered

We have presented a survey of existing approaches in per-formance visualization The current work has been orga-nized based on the primary contexts in which the data hasbeen visualized Moreover we have presented and catego-rized the goals that domain experts seek to address throughvisualization Finally we have discussed the existing chal-lenges in this domain This survey should act as an intro-duction to the state of the art for information visualizationexperts seeking to apply their knowledge to new domains Italso may aid HPC professionals in exploring new tools andmethods to analyze their data

11 Acknowledgements

We thank the participants of the Dagstuhl PerspectivesWorkshop 14022 ldquoConnecting Performance Analysis andVisualization to Advance Extreme Scale Computingrdquo fortheir constructive discussions that inspired parts of this pa-per

This work supported in part by the University of Cal-ifornia Laboratory Fees Research Grant Program and the



Department of Energy Office of Science Graduate Fellow-ship Program (DOE SCGF) administered by ORISE-ORAUunder contract no DE-AC05-06OR23100 This work per-formed under the auspices of the US Department of Energyby Lawrence Livermore National Laboratory under ContractDE-AC52-07NA27344 LLNL-CONF-652873

References[ABFlowast10] ADHIANTO L BANERJEE S FAGAN M KRENTEL

M MARIN G MELLOR-CRUMMEY J TALLENT N R Hpc-toolkit tools for performance analysis of optimized parallel pro-grams Concurrency and Computation Practice and Experience22 6 (2010) 685ndash701 doi101002cpe1553 2 6 910 12

[ACS90] ALPERN B CARTER L SELKER T Visualizing com-puter memory architectures In Proceedings of the 1st Con-ference on Visualization rsquo90 (Los Alamitos CA USA 1990)VIS rsquo90 IEEE Computer Society Press pp 107ndash113 doi101109VISUAL1990146371 7

[AdSLlowast09] AHN D H DE SUPINSKI B R LAGUNA I LEEG L LIBLIT B MILLER B P SCHULZ M Scalable tempo-ral order analysis for large scale debugging In Proceedings of theConference on High Performance Computing Networking Stor-age and Analysis (New York NY USA 2009) SC rsquo09 ACMpp 441ndash4411 doi10114516540591654104 9

[AH10] ADAMOLI A HAUSWIRTH M Trevis A context treevisualization amp analysis framework and its use for classifyingperformance failure reports In Proceedings of the 5th Inter-national Symposium on Software Visualization (New York NYUSA 2010) SOFTVIS rsquo10 ACM pp 73ndash82 doi10114518792111879224 9

[AKGlowast10] AFTANDILIAN E E KELLEY S GRAMAZIO CRICCI N SU S L GUYER S Z Heapviz Interactive heapvisualization for program understanding and debugging In Pro-ceedings of the 5th International Symposium on Software Visu-alization (New York NY USA 2010) SOFTVIS rsquo10 ACMpp 53ndash62 doi10114518792111879222 8

[BB92] BEMMERL T BRAUN P Visualization of messagepassing parallel programs In Parallel Processing CON-PAR 92-VAPP V Bougeacute L Cosnard M Robert Y TrystramD (Eds) vol 634 of Lecture Notes in Computer ScienceSpringer Berlin Heidelberg 1992 pp 79ndash90 doi1010073-540-55895-0_400 11

[BBH08] BERNARDIN T BUDGE B C HAMANN B Stacked-widget visualization of scheduling-based algorithms In Pro-ceedings of the 4th ACM Symposium on Software Visualization(New York NY USA 2008) SoftVis rsquo08 ACM pp 165ndash174doi10114514097201409746 8

[BCOMlowast10] BROQUEDIS F CLET-ORTEGA J MOREAUD SFURMENTO N GOGLIN B MERCIER G THIBAULT SNAMYST R hwloc A generic framework for managing hard-ware affinities in hpc applications In Parallel Distributed andNetwork-Based Processing (PDP) 2010 18th Euromicro Inter-national Conference on (Feb 2010) pp 180ndash186 doi101109PDP201067 7

[BD01] BOSCH R DEPT S U C S Using visualization to un-derstand the behavior of computer systems Stanford University2001 6 7 14

[BGIlowast12] BHATELE A GAMBLIN T ISAACS K E GUNNEYB T N SCHULZ M BREMER P-T HAMANN B Novelviews of performance data to analyze large-scale adaptive ap-plications In Proceedings of the International Conference on

High Performance Computing Networking Storage and Anal-ysis (Los Alamitos CA USA 2012) SC rsquo12 IEEE ComputerSociety Press pp 311ndash3111 doi101109SC20128012

[BKS05] BLOCHINGER W KAUFMANN M SIEBENHALLERM Visualizing structural properties of irregular parallel compu-tations In Proceedings of the 2005 ACM Symposium on SoftwareVisualization (New York NY USA 2005) SoftVis rsquo05 ACMpp 125ndash134 doi10114510560181056036 11 12

[BM11] BERNAT A R MILLER B P Anywhere any-time bi-nary instrumentation In Proceedings of the 10th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools(New York NY USA 2011) PASTE rsquo11 ACM pp 9ndash16doi10114520245692024572 1 2

[BW12] BRUNST H WEBER M Custom hot spot analysis ofhpc software with the vampir performance tool suite In Par-allel Tools Workshop (2012) pp 95ndash114 doi101007978-3-642-37349-7_7 14

[CFAlowast06] CHEADLE A M FIELD A J AYRES J W DUNNN HAYDEN R A NYSTROM-PERSSON J Visualising dy-namic memory allocators In Proceedings of the 5th Inter-national Symposium on Memory Management (New York NYUSA 2006) ISMM rsquo06 ACM pp 115ndash125 doi10114511339561133972 6

[CFHlowast93] CUNY J FORMAN G HOUGH A KUNDU J LINC SNYDER L STEMPLE D The Ariadne debugger Scal-able application of event-based abstraction In Proceedings ofthe 1993 ACMONR Workshop on Parallel and Distributed De-bugging (New York NY USA 1993) PADD rsquo93 ACM pp 85ndash95 doi101145174266174276 11

[CHK92] CUNY J HOUGH A KUNDU J Logical time in visu-alizations produced by parallel programs In Visualization 1992Visualization rsquo92 Proceedings IEEE Conference on (1992)pp 186ndash193 doi101109VISUAL1992235209 10

[CHZlowast07] CORNELISSEN B HOLTEN D ZAIDMAN A MOO-NEN L VAN WIJK J J VAN DEURSEN A Understand-ing execution traces using massive sequence and circular bun-dle views In Proceedings of the 15th IEEE International Con-ference on Program Comprehension (Washington DC USA2007) ICPC rsquo07 IEEE Computer Society pp 49ndash58 doi101109ICPC200739 8 9

[CPP08] CHOUDHURY A I POTTER K C PARKER S G In-teractive visualization for memory reference traces ComputerGraphics Forum 27 3 (May 2008) 815ndash822 doi101111j1467-8659200801212x 7 14

[CR11] CHOUDHURY A N M I ROSEN P Abstract visu-alization of runtime memory behavior In Proc 6th IEEE IntWorkshop on Visualizing Software for Understanding and Anal-ysis (2011) doi101109VISSOF20116069452 78

[DC07] DRONGOWSKI P J CENTER B D Instruction-basedsampling A new performance analysis technique for amd family10h processors Advanced Micro Devices Inc (2007) 3

[DHJ07] DEROSE L HOMER B JOHNSON D Detecting ap-plication load imbalance on high end massively parallel sys-tems In Euro-Par (2007) Kermarrec A-M Bougeacute L Priol T(Eds) vol 4641 of Lecture Notes in Computer Science Springerpp 150ndash159 doi101007978-3-540-74466-5_179

[dKSB00] DE KERGOMMEAUX J C STEIN B BERNARD PPajeacute an interactive visualization tool for tuning multi-threadedparallel applications Parallel Computing 26 10 (2000) 1253 ndash1274 doi101016S0167-8191(00)00010-7 6 1112



[DPA09] DE PAUW W ANDRADE H Visualizing large-scalestreaming applications Information Visualization 8 2 (2009)87ndash106 doi101057ivs20095 11

[DPH10] DE PAUW W HEISIG S Zinsight A visual and an-alytic environment for exploring large event traces In Proceed-ings of the 5th International Symposium on Software Visualiza-tion (New York NY USA 2010) SOFTVIS rsquo10 ACM pp 143ndash152 doi10114518792111879233 8 9 10 12 14

[DPWB13] DE PAUW W WOLF J L BALMIN A Visualiz-ing jobs with shared resources in distributed environments InVISSOFT (2013) Telea A Kerren A Marcus A (Eds) IEEEpp 1ndash10 doi101109VISSOFT20136650535 1011 12

[EL96] EICK S G LUCAS P J Displaying tracefiles Softw Pract Exper 26 4 (Apr 1996) 399ndash409 doi101002(SICI)1097-024X(199604)264lt399AID-SPE8gt30CO2-J 10 12

[ES92] EICK S G STEFFEN J L Visualizing code profil-ing line oriented statistics In Proceedings of the 3rd Con-ference on Visualization rsquo92 (Los Alamitos CA USA 1992)VIS rsquo92 IEEE Computer Society Press pp 210ndash217 doi101109VISUAL1992235206 9

[ET03] ELMQVIST N TSIGAS P Growing squares Animatedvisualization of causal relations In Proceedings of the 2003ACM Symposium on Software Visualization (New York NYUSA 2003) SoftVis rsquo03 ACM pp 17ndashff doi101145774833774836 12

[FAJ91] FRANCIONI J M ALBRIGHT L JACKSON J A De-bugging parallel programs using sound SIGPLAN Not 26 12(Dec 1991) 68ndash75 doi101145127695122765 12

[FB89] FOWLER R BELLA I The programmerrsquos guide to Movi-ola An interactive execution history browser Tech rep DTICDocument 1989 11 12

[FT05] FRISHMAN Y TAL A Visualization of mobile objectenvironments In Proceedings of the 2005 ACM Symposium onSoftware Visualization (New York NY USA 2005) SoftVis rsquo05ACM pp 145ndash154 doi10114510560181056038 12

[GEKlowast95] GU W EISENHAUER G KRAEMER E SCHWANK STASKO J VETTER J MALLAVARUPU N Falcon on-line monitoring and steering of large-scale parallel programsIn Frontiers of Massively Parallel Computation 1995 Proceed-ings Frontiers rsquo95 Fifth Symposium on the (1995) pp 422ndash429doi101109FMPC1995380483 12

[GKM04] GRAHAM S L KESSLER P B MCKUSICK M Kgprof A call graph execution profiler SIGPLAN Not 39 4 (Apr2004) 49ndash57 doi101145989393989401 1 2

[GN10] GEORGE B NAGPAL P Optimizing Parallel Applica-tions Using Concurrency Visualizer A Case Study Tech rep2010 12

[GT89] GRISWOLD R TOWNSEND G The Visualization ofDynamic Memory Management in the Icon Programming Lan-guage Tech Rep TR 89-30 Department of Computer ScienceUniversity of Arizona December 1989 6 7

[GWWlowast10] GEIMER M WOLF F WYLIE B J N AacuteBRAHAacuteME BECKER D MOHR B The Scalasca performance toolsetarchitecture Concurr Comput Pract Exper 22 6 (Apr 2010)702ndash719 doi101002cpev226 14

[HC88] HOUGH A A CUNY J E Initial experiences with apattern-oriented parallel debugger In Proceedings of the 1988ACM SIGPLAN and SIGOPS Workshop on Parallel and Dis-tributed Debugging (New York NY USA 1988) PADD rsquo88ACM pp 195ndash205 doi1011456821069234 11

[HCR01] HAYNES R CROSSNO P RUSSELL E A visual-ization tool for analyzing cluster performance data In ClusterComputing 2001 Proceedings 2001 IEEE International Con-ference on (2001) pp 295ndash302 doi101109CLUSTR2001959990 6

[HE91] HEATH M ETHERIDGE J Visualizing the performanceof parallel programs Software IEEE 8 5 (1991) 29ndash39 doi1011095284214 4 6 11 12

[ILGlowast12] ISAACS K E LANDGE A G GAMBLIN T BRE-MER P-T PASCUCCI V HAMANN B Abstract Explor-ing performance data with Boxfish In Proceedings of the2012 SC Companion High Performance Computing Network-ing Storage and Analysis (Washington DC USA 2012) SCCrsquo12 IEEE Computer Society pp 1380ndash1381 doi101109SCCompanion2012202 4 6

[Int07] INTEL Intel 64 and IA-32 Architectures Software Devel-operrsquos Manual - Volume 3B Intel Corporation August 2007 3

[ISClowast12] ILSCHE T SCHUCHART J COPE J KIMPE DJONES T KNUumlPFER A ISKRA K ROSS R NAGEL W EPOOLE S Enabling event tracing at leadership-class scalethrough io forwarding middleware In Proceedings of the21st International Symposium on High-Performance Parallel andDistributed Computing (New York NY USA 2012) HPDC rsquo12ACM pp 49ndash60 doi10114522870762287085 13

[JSB97] JERDING D F STASKO J T BALL T Visualiz-ing interactions in program executions In Proceedings of the19th International Conference on Software Engineering (NewYork NY USA 1997) ICSE rsquo97 ACM pp 360ndash370 doi101145253228253356 8 9

[KG96] KOHL J A GEIST G The pvm 34 tracing facility andxpvm 11 In System Sciences 1996 Proceedings of the Twenty-Ninth Hawaii International Conference on (1996) vol 1 IEEEpp 290ndash299 doi101109HICSS1996495474 11 12

[KLJ07] KIM Y-J LIM J-S JUN Y-K Scalable threadvisualization for debugging data races in openmp programsIn Proceedings of the 2nd International Conference on Ad-vances in Grid and Pervasive Computing (Berlin Heidelberg2007) GPCrsquo07 Springer-Verlag pp 310ndash321 doi101007978-3-540-72360-8_27 6 12

[KMLM97] KARAVANIC K L MYLLYMAKI J LIVNY MMILLER B P Integrated visualization of parallel program per-formance data Parallel Comput 23 1-2 (Apr 1997) 181ndash198doi101016S0167-8191(96)00104-4 12

[KS93] KRAEMER E STASKO J T The visualization of paral-lel systems An overview J Parallel Distrib Comput 18 2 (June1993) 105ndash117 doi101006jpdc19931050 1

[KS98] KRAEMER E STASKO J T Creating an accurate por-trayal of concurrent executions IEEE Concurrency 6 1 (Jan1998) 36ndash46 doi1011094434656778 10 11 12

[KTD13] KARRAN B TRAtildeIJMPER J DAtildeULLNER J Sync-trace Visual thread-interplay analysis In Proceedings (elec-tronic) of the 1st Working Conference on Software Visualiza-tion (VISSOFT) (2013) IEEE Computer Society p 10 doi101109VISSOFT20136650534 8 11 12

[KTM97] KOIKE H TAKADA T MASUI T Visualinda aframework for visualizing parallel linda programs In Visual Lan-guages 1997 Proceedings 1997 IEEE Symposium on (1997)pp 174ndash178 doi101109VL1997626578 11 12

[KZLK06] KALE L V ZHENG G LEE C W KUMAR SScaling applications to massively parallel machines using Pro-jections performance analysis tool In Future Generation Com-puter Systems Special Issue on Large-Scale System Performance



Modeling and Analysis (February 2006) vol 22 pp 347ndash358doi101016jfuture200411020 11 12 14

[Lam78] LAMPORT L Time clocks and the ordering of eventsin a distributed system Commun ACM 21 7 (July 1978) 558ndash565 doi101145359545359563 10

[LDBlowast99] LIAO S-W DIWAN A BOSCH JR R P GHU-LOUM A LAM M S Suif explorer An interactive and in-terprocedural parallelizer In Proceedings of the Seventh ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming (New York NY USA 1999) PPoPP rsquo99 ACMpp 37ndash48 doi101145301104301108 9

[LLBlowast12] LANDGE A LEVINE J BHATELE A ISAACS KGAMBLIN T SCHULZ M LANGER S BREMER P-T PAS-CUCCI V Visualizing network traffic to understand the perfor-mance of massively parallel simulations Visualization and Com-puter Graphics IEEE Transactions on 18 12 (2012) 2467ndash2476doi101109TVCG2012286 4 6

[LMC13] LIU X MELLOR-CRUMMEY J A data-centric pro-filer for parallel programs In Proceedings of the InternationalConference on High Performance Computing Networking Stor-age and Analysis (New York NY USA 2013) SC rsquo13 ACMpp 281ndash2812 doi10114525032102503297 14

[LMCF90] LEBLANC T J MELLOR-CRUMMEY J MFOWLER R J Analyzing parallel program executions usingmultiple views J Parallel Distrib Comput 9 2 (June 1990)203ndash217 doi1010160743-7315(90)90046-R 1112

[LMK08] LEE C W MENDES C KALEacute L V Towards Scal-able Performance Analysis and Visualization through Data Re-duction In 13th International Workshop on High-Level Paral-lel Programming Models and Supportive Environments (MiamiFlorida USA April 2008) 14

[LSVlowast89] LEHR T SEGALL Z VRSALOVIC D F CAPLANE CHUNG A L FINEMAN C E Visualizing performancedebugging Computer 22 10 (Oct 1989) 38ndash51 doi101109242013 10 12

[LTOB10] LIN S TAIumlANI F ORMEROD T C BALL L JTowards anomaly comprehension Using structural compressionto navigate profiling call-trees In Proceedings of the 5th In-ternational Symposium on Software Visualization (New YorkNY USA 2010) SOFTVIS rsquo10 ACM pp 103ndash112 doi10114518792111879228 9

[MBDH99] MUCCI P J BROWNE S DEANE C HO G PAPIA portable interface to hardware performance counters In ProcDepartment of Defense HPCMP User Group Conference (June1999) 1 2

[MGM09] MUELDER C GYGI F MA K-L Visual analysis ofinter-process communication for large-scale parallel computingIEEE Transactions on Visualization and Computer Graphics 156 (2009) 1129ndash1136 doi101109TVCG2009196 1012

[MHJ91] MALONY A HAMMERSLAG D JABLONOWSKI DTraceview a trace visualization tool Software IEEE 8 5 (1991)19ndash28 doi1011095284213 8

[MMC02] MALETIC J MARCUS A COLLARD M A task ori-ented view of software visualization In Visualizing Softwarefor Understanding and Analysis 2002 Proceedings First In-ternational Workshop on (2002) pp 32ndash40 doi101109VISSOF20021019792 1

[MSMlowast11] MUELDER C SIGOVAN C MA K-L COPE JLANG S ISKRA K BECKMAN P ROSS R Visual analysisof io system behavior for high-end computing In Proceedings

of the Third International Workshop on Large-scale System andApplication Performance (New York NY USA 2011) LSAPrsquo11 ACM pp 19ndash26 doi101145199602919960366 8

[MT07] MORETA S TELEA A Visualizing dynamic mem-ory allocations In Visualizing Software for Understanding andAnalysis 2007 VISSOFT 2007 4th IEEE International Work-shop on (2007) pp 31ndash38 doi101109VISSOF20074290697 6 7

[MTSM03] MU T TAO J SCHULZ M MCKEE S A In-teractive locality optimization on numa architectures In Pro-ceedings of the 2003 ACM Symposium on Software Visualiza-tion (New York NY USA 2003) SoftVis rsquo03 ACM pp 133ndashffdoi101145774833774853 7

[MW03] MOHR B WOLF F Kojak - a tool set for automaticperformance analysis of parallel programs In Euro-Par 2003Parallel Processing Kosch H BAtildeuszAtildeurmAtildelrsquonyi L Hellwag-ner H (Eds) vol 2790 of Lecture Notes in Computer Sci-ence Springer Berlin Heidelberg 2003 pp 1301ndash1304 doi101007978-3-540-45209-6_177 9

[NAWlowast96] NAGEL W E ARNOLD A WEBER M HOPPEH C SOLCHENBACH K VAMPIR Visualization and anal-ysis of MPI resources Supercomputer 12 1 (1996) 69ndash80 210 12

[NS07] NETHERCOTE N SEWARD J Valgrind A frameworkfor heavyweight dynamic binary instrumentation In Proceedingsof the 2007 ACM SIGPLAN Conference on Programming Lan-guage Design and Implementation (New York NY USA 2007)PLDI rsquo07 ACM pp 89ndash100 doi10114512507341250746 1

[PLCG95] PILLET V LABARTA J CORTES T GIRONA SParaver A tool to visualize and analyze parallel code In Pro-ceedings of WoTUG-18 Transputer and occam Developments(1995) vol 44 pp 17ndash31 11 12

[Rei90] REILLY M Presentation tools for performance visu-alization the m31 instrumentation experience In System Sci-ences 1990 Proceedings of the Twenty-Third Annual HawaiiInternational Conference on (1990) vol i pp 307ndash313 vol1doi101109HICSS1990205129 9 10 12

[Rei05] REINDERS J VTune Performance Analyzer EssentialsMeasurement and Tuning Techniques for Software DevelopersEngineer to Engineer Series Intel Press 2005 1 2 6

[Ros13] ROSEN P A visual approach to investigating shared andglobal memory behavior of CUDA kernels In Computer Graph-ics Forum (EuroVis) (2013) vol 32 doi101111cgf12103 7

[RR99] RENIERIS M REISS S P ALMOST Exploring pro-gram traces In 1999 Workshop on New Paradigms in Infor-mation Visualization adn Manipulation (199) ACM pp 70ndash77doi101145331770331788 8

[RRAlowast93] REED D A ROTH P AYDT R A SHIELDS KTAVERA L NOE R SCHWARTZ B Scalable performanceanalysis The pablo performance analysis environment In Scal-able Parallel Libraries Conference 1993 Proceedings of the(1993) IEEE pp 104ndash113 doi101109SPLC1993365577 12

[RZ05] ROBERTS J ZILLES C Tracevis An execution tracevisualization tool In MoBS rsquo05 (2005) 8 9

[SG93] SARUKKAI S R GANNON D SIEVE A performancedebugging environment for parallel programs J Parallel DistribComput 18 2 (June 1993) 147ndash168 doi101006jpdc19931053 9 10 12



[Sha90] SHARMA S Real-time visualization of concurrent pro-cesses Springer 1990 doi1010073-540-53065-7_160 10 12

[SHN10] SCHNORR L M HUARD G NAVAUX P O TrivaInteractive 3d visualization for performance analysis of paral-lel applications Future Generation Computer Systems 26 3(2010) 348 ndash 358 doihttpdxdoiorg101016jfuture200910006 11 12

[SK93] STASKO J T KRAEMER E A methodology for build-ing application-specific visualizations of parallel programs JParallel Distrib Comput 18 2 (June 1993) 258ndash264 doi101006jpdc19931062 12

[SKV03] SCHAUBSCHLAumlGER C KRANZLMUumlLLER D VOLK-ERT J Event-based program analysis with DeWiz In Proceed-ings of the Fifth International Workshop on Automated Debug-ging AADEBUG2003 (2003) 11 12

[SLBlowast11] SCHULZ M LEVINE J BREMER P-T GAMBLINT PASCUCCI V Interpreting performance data across intu-itive domains In Parallel Processing (ICPP) 2011 InternationalConference on (2011) pp 206ndash215 doi101109ICPP201160 6 12 13

[SM06] SHENDE S MALONY A D The TAU parallel perfor-mance system International Journal of High Performance Com-puting Applications 20 2 (2006) 287ndash311 doi1011771094342006064482 1 2 9

[SMLlowast12] SPEAR W MALONY A D LEE C W BIERS-DORFF S SHENDE S An approach to creating performancevisualizations in a parallel profile analysis tool In Proceedings ofthe 2011 International Conference on Parallel Processing - Vol-ume 2 (Berlin Heidelberg 2012) Euro-Par rsquo11 Springer-Verlagpp 156ndash165 doi101007978-3-642-29740-3_1912

[SMMlowast13a] SIGOVAN C MUELDER C MA K-L COPEJ ISKRA K ROSS R A visual network analysis methodfor large-scale parallel io systems Parallel and Dis-tributed Processing Symposium International 0 (2013) 308ndash319 doihttpdoiieeecomputersocietyorg101109IPDPS201396 14

[SMM13b] SIGOVAN C MUELDER C W MA K-L Visu-alizing large-scale parallel communication traces using a particleanimation technique Computer Graphics Forum 32 3pt2 (2013)141ndash150 doi101111cgf12101 11

[SRWS99] SHAFFER E REED D WHITMORE S SCHAEFFERB Virtue performance visualization of parallel and distributedapplications Computer 32 12 (1999) 44ndash51 doi1011092809250 11 12

[SSMG13] SAMBASIVAN R SHAFER I MAZUREK MGANGER G Visualizing request-flow comparison to aid perfor-mance diagnosis in distributed systems Visualization and Com-puter Graphics IEEE Transactions on 19 12 (2013) 2466ndash2475doi101109TVCG2013233 8 14

[Sto88] STONE J M A graphical representation of concur-rent processes In Proceedings of the 1988 ACM SIGPLANand SIGOPS Workshop on Parallel and Distributed Debugging(New York NY USA 1988) PADD rsquo88 ACM pp 226ndash235doi1011456821069237 12

[TBD10] TRUumlMPER J BOHNET J DOumlLLNER J Understand-ing complex multithreaded software systems by using trace vi-sualization In Proceedings of the 5th International Symposiumon Software Visualization (New York NY USA 2010) SOFT-VIS rsquo10 ACM pp 133ndash142 doi10114518792111879232 6 8 10 12

[TKS01] TAO J KARL W SCHULZ M Visualizing the mem-ory access behavior of shared memory applications on numa ar-chitectures In In Proceedings of the 2001 International Confer-ence on Computational Science (ICCS) volume 2074 of LNCS(2001) pp 861ndash870 doi1010073-540-45718-6_9114

[TM05] TORY M MOLLER T Evaluating visualizations Doexpert reviews work IEEE Comput Graph Appl 25 5 (Sept2005) 8ndash11 doi101109MCG2005102 14

[TMCFlowast11] TALLENT N R MELLOR-CRUMMEY J FRANCOM LANDRUM R ADHIANTO L Scalable fine-grained callpath tracing In Proceedings of the International Conference onSupercomputing (New York NY USA 2011) ICS rsquo11 ACMpp 63ndash74 doi10114519958961995908 13 14

[TSS98] TOPOL B STASKO J T SUNDERAM V Pvanima tool for visualization in network computing environmentsConcurrency Practice and Experience 10 14 (1998) 1197ndash1222 doi101002(SICI)1096-9128(19981210)1014lt1197AID-CPE364gt30CO2-O 11 12

[VMa13] Manual - Vampir 82 httpwwwvampireu Novem-ber 2013 10 12 14

[WG11] WYLIE B J N GEIMER M Large-scale performanceanalysis of PFLOTRAN with Scalasca In Proc of the 53rd CrayUser Group meeting Fairbanks AK USA (May 2011) CrayUser Group Inc 9 12

[WK00] WANG Y KUNZ T Visualizing mobile agent exe-cutions In Proceedings of the Second International Workshopon Mobile Agents for Telecommunication Applications (Lon-don UK UK 2000) MATA rsquo00 Springer-Verlag pp 103ndash114doi1010073-540-45391-1_8 11 12

[WKT04] WEIDENDORFER J KOWARSCHIK M TRINITIS CA tool suite for simulation based analysis of memory access be-havior In Computational Science - ICCS 2004 Bubak M Al-bada G Sloot P Dongarra J (Eds) vol 3038 of Lecture Notesin Computer Science Springer Berlin Heidelberg 2004 pp 440ndash447 doi101007978-3-540-24688-6_58 9

[WMfAM04] WOLF F MOHR B FUumlR ANGEWANDTE MATH-EMATIK Z EPILOG Binary Trace-data Format FZJ-ZAM2004 2

[WT10] WHEELER K B THAIN D Visualizing massivelymultithreaded applications with threadscope Concurr Comput Pract Exper 22 1 (Jan 2010) 45ndash67 doi101002cpev221 11 12

[WWFlowast13] WALLER J WULF C FITTKAU F DOumlHRING PHASSELBRING W Synchrovis 3d visualization of monitor-ing traces in the city metaphor for analyzing concurrency In1st IEEE International Working Conference on Software Visual-ization (VISSOFT 2013) (September 2013) doi101109VISSOFT20136650520 9 11

[WYH10] WU Y YAP R H HALIM F Visualizing windowssystem traces In Proceedings of the 5th International Symposiumon Software Visualization (New York NY USA 2010) SOFT-VIS rsquo10 ACM pp 123ndash132 doi10114518792111879231 8 10 12 14

[YI03] YAMAGUCHI Y ITOH T Visualization of distributedprocesses using data jewelry box algorithm In ComputerGraphics International 2003 Proceedings (2003) pp 162ndash169doi101109CGI20031214461 12

[YSM95] YAN J SARUKKAI S MEHRA P Performancemeasurement visualization and modeling of parallel and dis-tributed programs using the aims toolkit Software Practiceand Experience 25 4 (1995) 429ndash461 doi101002spe4380250406 11 12



[ZLGS99] ZAKI O LUSK E GROPP W SWIDER D Towardscalable performance visualization with Jumpshot High Per-formance Computing Applications 13 2 (Fall 1999) 277ndash288doi101109ipdps20084536187 11 12

[ZR91] ZERNIK D RUDOLPH L Animating work and timefor debugging parallel programs foundation and experience InProceedings of the 1991 ACMONR Workshop on Parallel andDistributed Debugging (New York NY USA 1991) PADD rsquo91ACM pp 46ndash56 doi101145122759122763 12

[ZSC03] ZHOU C SUMMERS K L CAUDELL T P Graphvisualization for the analysis of the structure and dynamics ofextreme-scale supercomputers In Proceedings of the 2003 ACMSymposium on Software Visualization (New York NY USA2003) SoftVis rsquo03 ACM pp 143ndash149 doi101145774833774854 6

[ZSM92] ZERNIK D SNIR M MALKI D Using visualizationtools to understand concurrency Software IEEE 9 3 (1992)87ndash92 doi10110952136185 12

Biography

Katherine E Isaacs is a third year computer science PhDstudent at the University of California Davis researching in-formation visualization techniques for performance analysisIn 2012 she was awarded a Department of Energy Office ofScience Graduate Fellowship (DOE SCGF) She completeda BS in computer science and a BA in mathematics at SanJoseacute State University and a BS in physics at the CaliforniaInstitute of Technology

Alfredo Gimeacutenez is a third year PhD student at the Uni-versity of California at Davis His research focus is in in-strumentation and information visualization for performanceanalysis on HPC systems He received a BS in ComputerScience at the University of California at Davis in 2010and worked for 2 years developing performance optimiza-tion tools for graphics hardware at Intel Corporation

Ilir Jusufi is a Postdoctoral Scholar at the University ofCalifornia Davis His research focuses on visualization ofperformance analysis data for HPC He received a BS inComputer Science at the South East European University inMacedonia and a MS in Computer Science at the Vaumlxjoumluniversity in Sweden He earned his PhD degree at the Lin-naeus University in Sweden focusing on the visualizationand interaction techniques of multivariate networks

Todd Gamblin is a computer scientist in the Center forApplied Scientific Computing at Lawrence Livermore Na-tional Laboratory His research focuses mainly on scalablealgorithms for measuring analyzing and visualizing per-formance data from massively parallel applications He isalso interested in fault tolerance resilience MPI and par-allel programming models Todd has been at LLNL since2008

Todd works closely with researchers in CASC and withstaff in the Development Environment Group in LivermoreComputing He is the team leader for the Performance Anal-ysis and Visualization at Exascale (PAVE) project and he

also works on the Exascale Computing Technologies LDRDproject the SciDAC Sustained Performance Energy andResilience (SUPER) project and many other ASC projectsat LLNL

Todd received the PhD and MS degrees in ComputerScience from the University of North Carolina at ChapelHill in 2009 and 2005 He received his BA in ComputerScience and Japanese from Williams College in 2002 Hehas also worked as a software developer in Tokyo and heldgraduate research internships at the University of Tokyo andIBM Research

Abhinav Bhatele is a computer scientist in the Center forApplied Scientific Computing at Lawrence Livermore Na-tional Laboratory His interests lie in performance optimiza-tions through analysis visualization and tuning and devel-oping algorithms for high-end parallel systems His thesiswas on topology aware task mapping and distributed loadbalancing for parallel applications

Abhinav received a B Tech degree in Computer Sci-ence and Engineering from IIT Kanpur India in May 2005and MS and PhD degrees in Computer Science from theUniversity of Illinois at Urbana-Champaign in 2007 and2010 respectively Abhinav was an ACMIEEE-CS GeorgeMichael Memorial HPC Fellow in 2009 He has receivedseveral awards for his dissertation work including the DavidJ Kuck Outstanding MS Thesis Award in 2009 a Distin-guished Paper Award at Euro-Par 2009 and the David JKuck Outstanding PhD Thesis Award in 2011 Recently apaper that he co-authored with LLNL and external collabo-rators was selected for a best paper award at IPDPS in 2013

Martin Schulz is a Computer Scientist at the Center forApplied Scientific Computing (CASC) at Lawrence Liver-more National Laboratory (LLNL) He earned his Doctor-ate in Computer Science in 2001 from the Technische Uni-versitAtildedrsquot MAtildeijnchen (Munich Germany) and also holds aMaster of Science in Computer Science from the Universityof Illinois at Urbana Champaign He has published over 150peer-reviewed papers He is the PI for the Office of ScienceX-Stack project Performance Insights for Programmers andExascale Runtimes (PIPER) and for the ASCCCE projecton Open|SpeedShop Further he is the chair of the MPI fo-rum the standardization body for the Message Passing Inter-face and is involved in the DOEOffice of Science ExascaleProjects CESAR ExMatEx and ARGO Martinrsquos researchinterests include parallel and distributed architectures andapplications performance monitoring modeling and anal-ysis memory system optimization parallel programmingparadigms tool support for parallel programming powerefficiency for parallel systems optimizing parallel and dis-tributed IO and fault tolerance at the application and systemlevel In his position at LLNL he especially focuses on theissue of scalability for parallel applications code correctnesstools and parallel performance analyzers as well as scalabletool infrastructures to support these efforts



Bernd Hamann is a professor of computer science atthe University of California Davis He studied mathematicsand computer science at the Technical University of Braun-schweig Germany and received a PhD in computer sciencefrom Arizona State University in 1991 His main teachingand research interests are data visualization data analysisand geometric modeling

Peer-Timo Bremer is a member of technical staff andproject leader at the Center for Applied Scientific Comput-ing (CASC) at the Lawrence Livermore National Laboratory(LLNL) and Associated Director for Research at the Centerfor Extreme Data Management Analysis and Visualizationat the University of Utah His research interests include largescale data analysis performance analysis and visualizationand he recently co-organized a Dagstuhl Perspectives work-shop on integrating performance analysis and visualizationPrior to his tenure at CASC he was a postdoctoral researchassociate at the University of Illinois Urbana-ChampaignPeer-Timo earned a PhD in Computer science at the Uni-versity of California Davis in 2004 and a Diploma in Math-ematics and Computer Science from the Leipniz Universityin Hannover Germany in 2000 He is a member of the IEEEComputer Society and ACM



21 Methods for Acquiring Performance Data

211 Instrumentation

Instrumentation is the act of modifying a program for analternative purpose in this case for acquiring performancedata At the most basic level an instrumentation tool insertsextra code into a programrsquos control flow The instrumenta-tion code may record timer values or it may perform morecomplex analysis such as writing out program variable val-ues and recording variable accesses and conditions met In-strumentation can be applied to source code before compil-ing or it can be applied at runtime using binary modifica-tion [BM11] or sampling [ABFlowast10] Care must be taken toensure that the instrumentation does not change normal pro-gram behavior or add excessive overhead

212 Interception

Interception is a form of instrumentation that leverages func-tion calls already present in program source code Intercep-tor functions are typically grouped together into a librarywhich is then linked with a program either dynamically orstatically The programrsquos original function calls are linkedto the interception library which executes special measure-ment code then delegates to the original implementation ofthe intercepted function Interception is useful for profilinglibraries because it can record the dynamic values of param-eters passed to library calls This can give more semanticcontext to measurements For example communication per-formance of many parallel programs is often measured byintercepting calls to the Message Passing Interface (MPI)and interceptor calls can differentiate between send and re-ceive operations based on the size of data passed to themAs with other types of instrumentation interception must beused sparingly to avoid incurring overhead

213 Profiling and Tracing

Profiling and tracing are measurement techniques that deter-mine where a single execution of a program spends its timeProfiling tools such as gprof [GKM04] and VTune [Rei05]pause execution of a program repeatedly over a specifiedsampling period and record the contents of either the instruc-tion pointer or the entire call stack At the end of a profil-ing run samples are analyzed to determine the percentageof time spent in each part of the code Profiles lose tempo-ral information but quickly identify key bottlenecks in a pro-gram Tracing is similar to profiling in that it measures a pro-gramrsquos execution but it records a detailed time line of whenevents occurred For example a trace might record func-tion entry and exit times for an entire run Because it doesnot aggregate over time recorded traces can require largeamounts of memory which can cause excessive overheadcompared to profiling TAU [SM06] Vampir [NAWlowast96]and EPILOG [WMfAM04] provide resources for creatingand dumping traces of programs that use MPI

Profiling trades off comprehensive data for low overhead

while tracing provides complete data on runtime events atthe cost of much higher overhead

22 System Monitoring

The measurements discussed so far are application-levelmeasurements in that they measure the performance of asingle application process We can also acquire system-wideperformance information during a programrsquos execution byexecuting external processes enabling system-wide counters(described in Section 231) or gathering other metadata atruntime Such methods require no modification to the codeor executables involved and as such are simple to use How-ever this kind of data is generally very coarse and semanti-cally low-level Further because data is collected outside themeasured application process it can be difficult to attributemeasurements to the target programrsquos source code

23 Types of Performance Data

231 Counters

A counter is a special hardware register that accumulates thenumber occurrences of a specified event over time Thesecan be either software events such as system calls or hard-ware events such as floating-point operations cache missesand packets received over a network link The complete setof countable events is specific to the platform being used butis generally quite extensive Commonly counters are eitherinstrumented to initialize and terminate around a block ofcode designated for analysis or run system-wide during pro-gram execution PAPI [MBDH99] provides a portable inter-face to specify initialize terminate and read out counters

The overhead and precision of performance counter mea-surements depends on how frequently they are sampledSampling performance counters too frequently gives highoverhead which can limit precision and make attributingcounted events to particular instructions difficult

Counter data most directly benefits visualizations in thehardware (Section 5) and software (Section 6) contexts Forexample in a network visualization packet counters can berecorded per-link to visualize network traffic Other hard-ware counters can be mapped directly to the resource (CPUmemory etc) responsible for generating the event Countersmeasured within instrumentation can also be attributed indi-rectly to the instrumented code giving software context

232 Hardware Samples

Traditional accumulative hardware counters have been ex-tended to provide more precise and detailed informationabout particular instructions Instead of simply incrementinga counter modern hardware performance units can write de-tailed information about an instructionrsquos execution includ-ing its precise instruction pointer progress within the pro-cessorrsquos pipeline total latency and more Intel and AMD









3 Performance Goals














4 Taxonomy
















51 Network







mal

ies

Sof

twar

e

Sys

tem







X NR 1 X X



X X X NR X X X

X X NR X X X

X X X X X X

X X NA X X X

X NA X X X

X NR X X

X X X X X X X X X

X X X NR X X X X X



X X X X X X




Global Compre

Problem Detection


Pro

gram

S

truc

ture

Res

ourc

e U

sage

Bot

tlen

ecks

an

d Im

bala

nce

Res

ourc

e M

isus

e


107 transactions


103











101 memory groups








107 instructions
















52 Node





522 Memory Topology

















































711 Task Timelines






















































9 Challenges


91 Scale









93 Ensemble Runs


94 Coordination


95 Attribution



96 Evaluation


10 Conclusions



11 Acknowledgements






























































































































Biography























3 Performance Goals














4 Taxonomy
















51 Network







mal

ies

Sof

twar

e

Sys

tem







X NR 1 X X



X X X NR X X X

X X NR X X X

X X X X X X

X X NA X X X

X NA X X X

X NR X X

X X X X X X X X X

X X X NR X X X X X



X X X X X X




Global Compre

Problem Detection


Pro

gram

S

truc

ture

Res

ourc

e U

sage

Bot

tlen

ecks

an

d Im

bala

nce

Res

ourc

e M

isus

e


107 transactions


103











101 memory groups








107 instructions
















52 Node





522 Memory Topology

















































711 Task Timelines






















































9 Challenges


91 Scale









93 Ensemble Runs


94 Coordination


95 Attribution



96 Evaluation


10 Conclusions



11 Acknowledgements






























































































































Biography





























51 Network







mal

ies

Sof

twar

e

Sys

tem







X NR 1 X X



X X X NR X X X

X X NR X X X

X X X X X X

X X NA X X X

X NA X X X

X NR X X

X X X X X X X X X

X X X NR X X X X X



X X X X X X




Global Compre

Problem Detection


Pro

gram

S

truc

ture

Res

ourc

e U

sage

Bot

tlen

ecks

an

d Im

bala

nce

Res

ourc

e M

isus

e


107 transactions


103











101 memory groups








107 instructions
















52 Node





522 Memory Topology

















































711 Task Timelines






















































9 Challenges


91 Scale









93 Ensemble Runs


94 Coordination


95 Attribution



96 Evaluation


10 Conclusions



11 Acknowledgements






























































































































Biography



















mal

ies

Sof

twar

e

Sys

tem







X NR 1 X X



X X X NR X X X

X X NR X X X

X X X X X X

X X NA X X X

X NA X X X

X NR X X

X X X X X X X X X

X X X NR X X X X X



X X X X X X




Global Compre

Problem Detection


Pro

gram

S

truc

ture

Res

ourc

e U

sage

Bot

tlen

ecks

an

d Im

bala

nce

Res

ourc

e M

isus

e


107 transactions


103











101 memory groups








107 instructions
















52 Node





522 Memory Topology

















































711 Task Timelines






















































9 Challenges


91 Scale









93 Ensemble Runs


94 Coordination


95 Attribution



96 Evaluation


10 Conclusions



11 Acknowledgements






























































































































Biography






















52 Node





522 Memory Topology

















































711 Task Timelines






















































9 Challenges


91 Scale









93 Ensemble Runs


94 Coordination


95 Attribution



96 Evaluation


10 Conclusions



11 Acknowledgements






























































































































Biography






























































711 Task Timelines






















































9 Challenges


91 Scale









93 Ensemble Runs


94 Coordination


95 Attribution



96 Evaluation


10 Conclusions



11 Acknowledgements






























































































































Biography



















































711 Task Timelines






















































9 Challenges


91 Scale









93 Ensemble Runs


94 Coordination


95 Attribution



96 Evaluation


10 Conclusions



11 Acknowledgements






























































































































Biography






































711 Task Timelines






















































9 Challenges


91 Scale









93 Ensemble Runs


94 Coordination


95 Attribution



96 Evaluation


10 Conclusions



11 Acknowledgements






























































































































Biography





















711 Task Timelines






















































9 Challenges


91 Scale









93 Ensemble Runs


94 Coordination


95 Attribution



96 Evaluation


10 Conclusions



11 Acknowledgements






























































































































Biography





























































9 Challenges


91 Scale









93 Ensemble Runs


94 Coordination


95 Attribution



96 Evaluation


10 Conclusions



11 Acknowledgements






























































































































Biography












































9 Challenges


91 Scale









93 Ensemble Runs


94 Coordination


95 Attribution



96 Evaluation


10 Conclusions



11 Acknowledgements






























































































































Biography






















9 Challenges


91 Scale









93 Ensemble Runs


94 Coordination


95 Attribution



96 Evaluation


10 Conclusions



11 Acknowledgements






























































































































Biography


















93 Ensemble Runs


94 Coordination


95 Attribution



96 Evaluation


10 Conclusions



11 Acknowledgements






























































































































Biography










































































































































Biography















































































































Biography
















































































Biography


















































Biography





















Biography




















Documents

State of the Art of Performance Visualizationgraphics.idav.ucdavis.edu/~hamann/IsaacsGimenezJusufiGa... · 2014. 4. 14. · Eurographics Conference on Visualization (EuroVis) (2014)