Parallel Performance Wizard: A Generalized Performance Analysis Tool

Parallel Performance Wizard:

A Generalized Performance Analysis ToolHung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George

PPW Overview• Computationally intensive parallel applications are constantly being developed in many scientific

fields using parallel programming paradigms such as:• Message-passing: MPI, etc.• Partitioned Global Address Space (PGAS): Unified Parallel C (UPC), SHMEM, Co-array

Fortran (CAF), Titanium, etc.• Reconfigurable Computing (RC) systems and other non-traditional paradigms

• Performance optimization is often needed to minimize the application’s overall execution time• Performance analysis tools are very useful in this process, but existing tools have limited

programming paradigm support

Data Visualizations

Generalized Operation Types

Timeline visualization (through export to Jumpshot) of

Synthetic Aperture Radar MPI application using PPW

Visualization representing time spent

in N-Queens RC benchmark program

Data transfer visualization of

Space Aperture Radar MPI application

PGAS model-specific array distribution

visualization of UPC NPB FT benchmark

Tree table visualization of

N-Queens RC benchmark program

Automatic Bottleneck Detection RC Application Performance Analysis

• Parallel Performance Wizard (PPW) was originally designed and developed to improve the much-needed performance tool support for PGAS programming models

• Global Address Space Performance (GASP) interface introduced (http://gasp.hcs.ufl.edu)• Version 1.0 released in April 2007

• Latest PPW updates & extensions include• Redesigned framework to enable additional model/paradigm support with minimal effort• Automatic performance bottleneck detection• Enhanced Cray XT UPC support; HP UPC support coming very soon

• Version 1.1 available for download at http://ppw.hcs.ufl.edu

UnoptimizedParallel

Application

OptimizedParallel

Application

• Previous versions of PPW (as with other tools) were largely model-dependent

• Multiple versions of the same component (one per model) had to be developed in a very similar fashion

• However, constructs from different models behave very closely to each other, and thus can be handled similarly by the tool

• Latest version of PPW takes advantage of a generalized operation type abstraction

• Model constructs are classified into one of the pre-defined operation types

• Components are categorized into model-dependent or model-independent parts

• Once modification has been made, we are able to add new programming model support to PPW in a relatively small amount of time

• In most cases, adding new model support can be achieve by performing

• Classification of model constructs• Implementation of instrumentation

and bottleneck resolution units• MPI support was added in a matter of

months (as opposed to years)

Data exchange

Pair-wise sync.

Group-wise sync.

Local processing

One-sided

(put / get)

Lock manipulation

Sub-group

(barrier, collectives)

Work distribution

(for-all)

Two-sided

(send / receive)

Wait on remote

(fence, join)

Global

(barrier, collectives)

User functions &

I/O operations

Measurement Unit (MU)

Instrumentation-Measurement Interface (IMI)

Performance-Data Manager (PDM)

Visualization Manager (VM)

Bottleneck-Detection Unit (BDU)

High-Level Analysis Unit (HAU)

Data-Format Converter (DFC)

Model-independent components

Model-dependent components

Analysis

Presentation

Instrumentation

Measurement

Event-Type Mapper (ETM)

Instrumentation Unit (IU)

Bottleneck-Resolution Unit (BRU)

• Automatic bottleneck detection feature is desirable for a performance analysis tool because• Novice users often do not know upon what they should concentrate their efforts• Performance data generated by long-running or complex applications can be difficult to

visualize and understand• A new post-mortem bottleneck detection approach is currently being developed for PPW

• Perform data filtering at various stages to minimize execution time• Detection mechanism is parallelizable (each node performs analysis semi-independently)

• Potential speedup for large applications• Performance data from all nodes need not be merged

• Operates using the generalized operation type abstraction• New operation type-specific detection mechanisms to identify known bottleneck classes• Potential to support multi-model application (one that uses two or more models) analysis

Baseline filtering

Deviation filtering

Trace data (local, all)

Potential bottlenecks

Cause analysis

Trace data (remote, selective)

Profile data (local)

Bottlenecks & causes

• Instrumentation and measurement of both CPUs and FPGAs, towards a unified performance tool for RC systems

• Automated instrumentation of hardware & software for ease-of-use

• Runtime storage & transfer of performance data for continued monitoring of performance

• Configurable profiling, tracing, and sampling in hardware to complement software data

• Low overhead (application can run at or near full-speed to improve accuracy of results)

• Visualization of performance data in tables, charts, and timeline views

• Allows for strategic instrumentation and measurement from hardware and software

• Enables a cohesive view of system performance in order to facilitate locating performance bottlenecks

• Provide useful information to aid designer in fixing bottlenecks

TriggersSignal

Analysis Module

SignalsData

Profile Counters0 1 2 P - 1

Trace Data

Trace Data

Trace Data

Cycle Counter

Module Statistics

Module Control

Request

Perf. Data

Sample Control

signal

value

comp trigger

Bu

ffe

r

Blo

ck

RA

M

On

-bo

ard

M

em

ory

(D

DR

/QD

R)

data

...

Original top-level file

Module

Submodule

Modified component

interface

User Application (HLL)

Hardware Measurement Thread / Process

Lock

CPU(s)FPGA Access Methods

(Wrapper)

Original Application

Data Transfer Module

FPGA(s)

User Application (HDL)

Measurement and Interface

Hardware Measurement Module (HMM)

Submodule Submodule

Module

New top-level file

Legend

Original RC Application

Additions by Instrumentation

FPGAFPGA

http://gasp.hcs.ufl.edu/

http://ppw.hcs.ufl.edu/

Documents

Parallel Performance Wizard: A Generalized Performance Analysis Tool