22
The Scalasca Performance Analysis Toolset Markus Geimer Jülich Supercomputing Centre [email protected] April 3, 2013

The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

The Scalasca Performance Analysis Toolset

Markus Geimer

Jülich Supercomputing Centre

[email protected]

April 3, 2013

Page 2: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 2

“A picture is worth 1000 words…”

M. Geimer | SEA Conference'13, Boulder, CO

Page 3: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 3

“But what about 1000’s of pictures?”

M. Geimer | SEA Conference'13, Boulder, CO

Page 4: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 4

Automatic trace analysis

Idea Automatic search for patterns of inefficient behavior

Classification of behavior & quantification of significance

Advantages Guaranteed to cover the entire event trace

Quicker than manual/visual trace analysis

Helps to identify hot-spots for in-depth manual analysis

Complements the functionality of other tools

Call

path

Pro

pert

y

Location

Low-level

event trace

High-level

result Analysis

M. Geimer | SEA Conference'13, Boulder, CO

Page 5: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 5

Higher degrees of parallelism

Number of Cores share for TOP 500 November 2012

Average system size: 29,772 cores

Median system size: 15,360 cores

1025-2048

2049-4096

4097-8192

8193-16384

> 16384

NCore

1

2

81

206

210

Count

0.2%

0.4%

16.2%

41.2%

42.0%

Share

122 TF

155 TF

8,579 TF

24,543 TF

128,574 TF

∑Rmax

Total 500 100% 161,973 TF

0.1%

0.1%

5.3%

15.1%

79.4%

Share

100%

1,280

7,104

551,624

2,617,986

11,707,806

∑NCore

14,885,800

M. Geimer | SEA Conference'13, Boulder, CO

Page 6: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 6

Higher degrees of parallelism (II)

Also new demands on scalability of software

tools Familiar tools cease to work in a satisfactory manner

for large processor/core counts

Optimization of applications more difficult Increasing machine complexity

Every doubling of scale reveals a new bottleneck

Need for scalable performance tools Efficient to meet performance expectations

Effective to use so that programmer productivity is

maximized

M. Geimer | SEA Conference'13, Boulder, CO

Page 7: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 7

The Scalasca project

Project started in 2006 Follow-up to pioneering KOJAK project (started 1998)

Automatic pattern-based trace analysis Initial funding by Helmholtz Initiative & Networking Fund Many follow-up projects

Objective: Development of a scalable performance analysis toolset Specifically targeting large-scale parallel applications

such as those running on IBM Blue Gene or Cray XT with 10,000s or 100,000s of processes

Now joint project of Jülich Supercomputing Centre

German Research School for Simulation Sciences

M. Geimer | SEA Conference'13, Boulder, CO

Page 8: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 8

The Scalasca toolset

Scalasca 1.4.3

Custom instrumentation &

measurement system

Scalasca trace analysis

components based on

custom trace format EPILOG

Analysis report explorer &

algebra utilities CUBE v3

New BSD license

Scalasca 2.0

Community instrumentation &

measurement system Score-P

Scalasca trace analysis

components based on

community trace format OTF2

Analysis report explorer &

algebra utilities CUBE v4

New BSD license

M. Geimer | SEA Conference'13, Boulder, CO

http://www.scalasca.org

Page 9: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 9

Scalasca features

Open source

Portable Blue Gene/Q, Blue Gene/P, IBM SP & blade clusters, SGI Altix,

Solaris & Linux clusters

Scalasca 1.4.3 only: Cray XT, NEC SX, K Computer, Fujitsu FX10

Supports parallel programming paradigms & languages MPI, OpenMP & hybrid MPI+OpenMP

Fortran, C, C++

Scalable trace analysis Automatic wait-state search

Parallel replay exploits memory & processors to deliver scalability

M. Geimer | SEA Conference'13, Boulder, CO

Page 10: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 10

Scalasca workflow

M. Geimer | SEA Conference'13, Boulder, CO

Instr.

target

application

Measurement

library

HWC

Parallel wait-

state search Wait-state

report

Local event

traces

Summary

report

Optimized measurement configuration

Instrumenter

compiler /

linker

Instrumented

executable

Source

modules

Rep

ort

ma

nip

ula

tion

Which problem? Where in the

program?

Which

process?

Page 11: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 11

Example: MPI patterns

time

pro

ce

ss

ENTER LEAVE SEND RECV COLLECTIVE

(a) Late Sender time

pro

ce

ss

(b) Late Receiver

time

pro

ce

ss

(d) Wait at N x N time

pro

ce

ss

(c) Late Sender / Wrong Order

M. Geimer | SEA Conference'13, Boulder, CO

Page 12: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 12

Example: Late Sender

Sender:

Triggered by send event

Determine enter event

Send both events to receiver

Receiver:

Triggered by receive event

Determine enter event

Receive remote events

Detect Late Sender situation

Calculate & store waiting time

time

pro

cess … …

… …

ENTER LEAVE SEND RECV

M. Geimer | SEA Conference'13, Boulder, CO

Page 13: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 13

CESM Sea Ice Module – Late Sender Analysis

M. Geimer | SEA Conference'13, Boulder, CO

Page 14: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 14

Scalasca trace analysis sweep3D@294,912 BG/P

10 min sweep3D

runtime

11 sec analysis

4 min trace data

write/read

(576 files)

7.6 TB buffered

trace data

510 billion

events

B. J. N. Wylie, M. Geimer,

B. Mohr, D. Böhme,

Z.Szebenyi, F. Wolf:

Large-scale performance

analysis of Sweep3D with

the Scalasca toolset.

Parallel Processing Letters,

20(4):397-414, 2010.

Page 15: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 15

Scalasca trace analysis bt-mz@1,048,704 BG/Q

Execution

imbalance

“z_solve”

Page 16: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 16

Scalasca trace analysis bt-mz@1,048,704 BG/Q

Wait at

implicit

barrier

“z_solve”

Page 17: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 17

Research: Root Cause Analysis

Root-cause analysis

Wait states typically caused by

load or communication

imbalances earlier in the

program

Waiting time can also propagate

(e.g., indirect waiting time)

Goal: Enhance performance

analysis to find the root cause

of wait states

Approach

Distinguish between direct and

indirect waiting time

Identify call path/process

combinations delaying other

processes and causing first

order waiting time

Identify original delay

Recv

Send

Send

foo

foo

foo

bar

bar Recv

A

B

C

time

cause

Recv

Recv

Direct wait Indirect wait

Recv

bar DELAY

Page 18: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 18

CESM Sea Ice Module – Direct Wait Time

Page 19: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 19

CESM Sea Ice Module – Indirect Wait Time

Page 20: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 20

CESM Sea Ice Module – Delay Costs

Page 21: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 21

Acknowledgements

Scalasca team (JSC) (GRS)

Sponsors

M. Geimer | SEA Conference'13, Boulder, CO

Michael

Knobloch

Bernd

Mohr

Peter

Philippen

Markus

Geimer

Daniel

Lorenz

Christian

Rössel

David

Böhme

Marc-André

Hermanns

Pavel

Saviankou

Marc

Schlütter

Ilja

Zhukov

Alexandre

Strube

Brian

Wylie

Felix

Wolf

Anke

Visser

Monika

Lücke

Aamer

Shah

Alexandru

Calotoiu

Jie

Jiang

Sergei

Shudler

Guoyong

Mao

Page 22: The Scalasca Performance Analysis ToolsetGuaranteed to cover the entire event trace Quicker than manual/visual trace analysis Helps to identify hot-spots for in-depth manual analysis

Slide 22

Thank you!

M. Geimer | SEA Conference'13, Boulder, CO

http://www.scalasca.org