24
April 29, 2003 Scalability 1: A Multi-Pronged Approach To Overcoming Scalability Barriers In Paradyn Philip C. Roth [email protected] Computer Sciences Department University of Wisconsin-Madison Madison, WI 53706 USA

Scalability 1: A Multi-Pronged Approach To Overcoming

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalability 1: A Multi-Pronged Approach To Overcoming

April 29, 2003

Scalability 1: A Multi-ProngedApproach To Overcoming Scalability

Barriers In Paradyn

Philip C. [email protected]

Computer Sciences DepartmentUniversity of Wisconsin-Madison

Madison, WI 53706USA

Page 2: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -2- Paradyn Scalability

The HPC Situation Today• Large parallel computing resources

– Tightly coupled systems• Earth Simulator (Japan, 5120 CPUs)• HPCx (UK, 1280 CPUs)• ASCI Q (LANL, 4096 CPUs)

– Clusters• Pink (LANL, 2048 CPUs)• MCR Linux Cluster (LLNL, 2304 CPUs)• Aspen Systems (Forecast Systems Lab, 1536 CPUs)

– Grid• Large applications

– ASCI Blue Mountain job sizes (2001)• 512 cpus: 17.8%• 1024 cpus: 34.9%• 2048 cpus: 19.9%

Page 3: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -3- Paradyn Scalability

Tool Front End

d0 d1 d2 d3 dP-4 dP-3 dP-2 dP-1

Barriers to Large-ScalePerformance Diagnosis

a0 a1 a2 a3 aP-4 aP-3 aP-2 aP-1

ToolDaemons

AppProcesses

• Computation cost• Communication cost• Storage cost

Centralized data collectionand processing

Page 4: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -4- Paradyn Scalability

Our Approach1. MRNet: infrastructure for building scalable tools2. SStart: strategy for improving tool start-up

latency3. Distributed Performance Consultant: strategy for

efficiently finding performance bottlenecks inlarge-scale applications

4. Multithreaded Data Manager: for effectivefront-end performance data management on SMPs

5. Sub-Graph Folding Algorithm: algorithm foreffectively presenting bottleneck search resultsfor large-scale applications

Page 5: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -5- Paradyn Scalability

MRNet: Multicast/Reduction Network• Software infrastructure for building scalable

parallel performance and administration tools• Scalable data aggregation—reduces centralized

data processing cost for tasks like:– Computing global performance measures (e.g., CPU

utilization for function F across all processes– Collecting application meta-data (e.g., names, addresses

of functions in each application process)• Scalable multicast: efficient delivery of control

requests

ÿDetails in Dorian’s talk, coming next!

Page 6: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -6- Paradyn Scalability

Problem of Tool Start-Up Latency• Some tools transfer lots of data at tool

start-up– Debugger needs function names and addresses to

set breakpoints by name– Paradyn needs information about modules,

functions, processes, threads, synchronizationobjects, call graph

• Front-end bottleneck—high latency:– Reduces tool interactivity– May cause failures for application’s

communication run-time library (e.g., MPI)

Page 7: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -7- Paradyn Scalability

SStart: Scalable Tool Start-up• Reduce redundant data transfer

– Daemons deliver summary to front end using customMRNet reduction to find equivalence classes

– Front end asks equivalence class representatives forcomplete info

– Representative daemons send full info to front end– Metrics, code resources, call graph

• Increase efficiency of non-redundant datatransfer– In-network concatenation of messages

• More efficient point-to-point transfers• Front-end sees single message instead of many

– Metric definition broadcast– Machine resources, daemon info, process info

• Clock skew detection

Page 8: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -8- Paradyn Scalability

Clock Skew Detection Algorithm

• Phase 1:– Repeated broadcast/reduce pairs to compute

each process’ clock skew with directlyconnected children

• Phase 2:– Upward sweep to compute cumulative clock

skew to all reachable daemons

Page 9: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -9- Paradyn Scalability

Clock Skew Detection: Phase 1Front-End

7

Daemon0 Daemon1 Daemon2 Daemon3

7

10 810 8

7 77 7

7 79 6

Page 10: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -10- Paradyn Scalability

Clock Skew Detection: Phase 1Front-End

7

Daemon0 Daemon1 Daemon2 Daemon3

7-0.5

10 8101 83.50.5 0

3.5

Page 11: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -11- Paradyn Scalability

Clock Skew Detection: Phase 2Front-End

7

Daemon0 Daemon1 Daemon2 Daemon3

7-0.5

10 8101 83.50.5 0

3.5

0 0 0 0

1 3.50.5 01 0.5 3.5 0

0.5 0 7 3.5

Page 12: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -12- Paradyn Scalability

SStart ResultsSmg2000 on ASCI Blue Pacific

Start-up Latency

0

500

1000

1500

2000

2500

0 100 200 300 400 500 600

Daemons

Tim

e (s

ec)

Baseline

Flat

4-way

8-way

16-way

“Baseline” is the start-up latency without any SStart optimizations

Page 13: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -13- Paradyn Scalability

SStart ResultsSmg2000 on ASCI Blue Pacific

SStart Latency

0

10

20

30

40

50

60

70

0 100 200 300 400 500 600

Daemons

Tim

e (s

ec)

No MRNet

4-way

8-way

16-way

Page 14: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -14- Paradyn Scalability

SStart Results

0 5 10 15 20 25

Report Self

Report Metrics

Find Clock Skew

Parse Executable

Report Process

Report Machine Resources

Report Code Eq Classes

Report Code Resources

Report Callgraph Eq Classes

Report Callgraph

Report Done

Act

ivit

y

Latency (sec)

512-512

512-008

Page 15: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -15- Paradyn Scalability

SStart ResultsError of MRNet-based clock skew detection algorithm

as compared to direct connection scheme

(Paradyn’s minimum sampling rate is 0.2 s)

By MagnitudeDaemons Fanout Average (ms) Stdev (ms) Min (ms) Max (ms)

4 4 0.13325 0.059369079 0.061 0.2048 8 -0.1755 0.406402202 0.075 -0.782

16 4 -0.064125 0.333511221 -0.018 -0.57564 4 -1.47925 0.524946158 -0.486 -2.58964 8 -0.52975 0.844241801 -0.039 -2.22

256 4 -4.038492188 0.696410389 0.298 -5.623512 8 -6.511570313 1.167506052 -3.275 -9.359

Page 16: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -16- Paradyn Scalability

Multithreaded Data Manager

• Problem: single-threaded Data Manager in Paradynfront-end is data management bottleneck, limitingoverall scalability

• Approach: Data Manager to use multiple threads,taking advantage of increasingly common SMPhardware– Part of overall effort to improve performance data

management scalability with MRNet, daemon support forDistributed Performance Consultant

• Status: front-end scalability study underway

Page 17: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -17- Paradyn Scalability

Distributed Performance Consultant

• Problem: front-end bottleneck whensearching for performance problems inlarge-scale applications– MRNet reduces front-end load when processing

global performance data (e.g., CPU utilizationacross all application processes)

– Front-end still processes local performancedata (e.g., CPU utilization in process 5247 onhost blue199.pacific.llnl.gov)

Page 18: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -18- Paradyn Scalability

c002.cs.wisc.educ001.cs.wisc.edu

Distributed Performance Consultant

CPUbound

cham.cs.wisc.edu

c128.cs.wisc.edu

• Approach:

myapp367 myapp4287 myapp27549

Page 19: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -19- Paradyn Scalability

c002.cs.wisc.educ001.cs.wisc.edu

Distributed Performance Consultant

CPUbound

c001.cs.wisc.edu

main

myapp{367}

Do_row Do_col

Do_mult

c128.cs.wisc.edu

main

myapp{27549}

Do_row Do_col

Do_mult

main

Do_row Do_col

Do_mult

c002.cs.wisc.edu

main

myapp{4287}

Do_row Do_col

Do_mult

… …

……

…cham.cs.wisc.edu

c128.cs.wisc.edu

• Approach:

myapp367 myapp4287 myapp27549

Page 20: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -20- Paradyn Scalability

c002.cs.wisc.educ001.cs.wisc.edu

Distributed Performance Consultant

CPUbound

c001.cs.wisc.edu

main

myapp{367}

Do_row Do_col

Do_mult

c128.cs.wisc.edu

main

myapp{27549}

Do_row Do_col

Do_mult

main

Do_row Do_col

Do_mult

c002.cs.wisc.edu

main

myapp{4287}

Do_row Do_col

Do_mult

… …

……

…cham.cs.wisc.edu

c128.cs.wisc.edu

• Approach:

myapp367 myapp4287 myapp27549

Page 21: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -21- Paradyn Scalability

Distributed Performance Consultant• Status: design underway

– New data management support in daemons– New instrumentation cost model– New instrumentation scheduling policy

Page 22: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -22- Paradyn Scalability

Sub-Graph Folding Algorithm• Problem: Search History Graph displays grow too

complex when showing results from large-scaleapplications

• Approach: Inspired by the TMC/Sun PRISMparallel debugger, fold similar sub-graphs of theSearch History Graph into a single composite sub-graph

• Status: Algorithm designed, feasibility study done,code integration needed

• Dramatic reduction of graph node count infeasibility study– 1232 nodes to 153

Page 23: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -23- Paradyn Scalability

Summary• Taking many approaches to improving

Paradyn scalability– MRNet– SStart– Distributed Performance Consultant– Multithreaded Data Manager– Sub-Graph Folding Algorithm

• Lots yet to do, but efforts paying off– SStart reduces start-up latency dramatically– MRNet results in Dorian’s talk

Page 24: Scalability 1: A Multi-Pronged Approach To Overcoming

© Philip C. Roth 2003 -24- Paradyn Scalability

Scalability 1: A Multi-ProngedApproach To Overcoming

Scalability Barriers In Paradynhttp://www.paradyn.org

[email protected]