Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
The Scalasca Performance Analysis Toolset
Markus Geimer
Jülich Supercomputing Centre
April 3, 2013
Slide 2
“A picture is worth 1000 words…”
M. Geimer | SEA Conference'13, Boulder, CO
Slide 3
“But what about 1000’s of pictures?”
M. Geimer | SEA Conference'13, Boulder, CO
Slide 4
Automatic trace analysis
Idea Automatic search for patterns of inefficient behavior
Classification of behavior & quantification of significance
Advantages Guaranteed to cover the entire event trace
Quicker than manual/visual trace analysis
Helps to identify hot-spots for in-depth manual analysis
Complements the functionality of other tools
Call
path
Pro
pert
y
Location
Low-level
event trace
High-level
result Analysis
M. Geimer | SEA Conference'13, Boulder, CO
Slide 5
Higher degrees of parallelism
Number of Cores share for TOP 500 November 2012
Average system size: 29,772 cores
Median system size: 15,360 cores
1025-2048
2049-4096
4097-8192
8193-16384
> 16384
NCore
1
2
81
206
210
Count
0.2%
0.4%
16.2%
41.2%
42.0%
Share
122 TF
155 TF
8,579 TF
24,543 TF
128,574 TF
∑Rmax
Total 500 100% 161,973 TF
0.1%
0.1%
5.3%
15.1%
79.4%
Share
100%
1,280
7,104
551,624
2,617,986
11,707,806
∑NCore
14,885,800
M. Geimer | SEA Conference'13, Boulder, CO
Slide 6
Higher degrees of parallelism (II)
Also new demands on scalability of software
tools Familiar tools cease to work in a satisfactory manner
for large processor/core counts
Optimization of applications more difficult Increasing machine complexity
Every doubling of scale reveals a new bottleneck
Need for scalable performance tools Efficient to meet performance expectations
Effective to use so that programmer productivity is
maximized
M. Geimer | SEA Conference'13, Boulder, CO
Slide 7
The Scalasca project
Project started in 2006 Follow-up to pioneering KOJAK project (started 1998)
Automatic pattern-based trace analysis Initial funding by Helmholtz Initiative & Networking Fund Many follow-up projects
Objective: Development of a scalable performance analysis toolset Specifically targeting large-scale parallel applications
such as those running on IBM Blue Gene or Cray XT with 10,000s or 100,000s of processes
Now joint project of Jülich Supercomputing Centre
German Research School for Simulation Sciences
M. Geimer | SEA Conference'13, Boulder, CO
Slide 8
The Scalasca toolset
Scalasca 1.4.3
Custom instrumentation &
measurement system
Scalasca trace analysis
components based on
custom trace format EPILOG
Analysis report explorer &
algebra utilities CUBE v3
New BSD license
Scalasca 2.0
Community instrumentation &
measurement system Score-P
Scalasca trace analysis
components based on
community trace format OTF2
Analysis report explorer &
algebra utilities CUBE v4
New BSD license
M. Geimer | SEA Conference'13, Boulder, CO
http://www.scalasca.org
Slide 9
Scalasca features
Open source
Portable Blue Gene/Q, Blue Gene/P, IBM SP & blade clusters, SGI Altix,
Solaris & Linux clusters
Scalasca 1.4.3 only: Cray XT, NEC SX, K Computer, Fujitsu FX10
Supports parallel programming paradigms & languages MPI, OpenMP & hybrid MPI+OpenMP
Fortran, C, C++
Scalable trace analysis Automatic wait-state search
Parallel replay exploits memory & processors to deliver scalability
M. Geimer | SEA Conference'13, Boulder, CO
Slide 10
Scalasca workflow
M. Geimer | SEA Conference'13, Boulder, CO
Instr.
target
application
Measurement
library
HWC
Parallel wait-
state search Wait-state
report
Local event
traces
Summary
report
Optimized measurement configuration
Instrumenter
compiler /
linker
Instrumented
executable
Source
modules
Rep
ort
ma
nip
ula
tion
Which problem? Where in the
program?
Which
process?
Slide 11
Example: MPI patterns
time
pro
ce
ss
ENTER LEAVE SEND RECV COLLECTIVE
(a) Late Sender time
pro
ce
ss
(b) Late Receiver
time
pro
ce
ss
(d) Wait at N x N time
pro
ce
ss
(c) Late Sender / Wrong Order
M. Geimer | SEA Conference'13, Boulder, CO
Slide 12
Example: Late Sender
Sender:
Triggered by send event
Determine enter event
Send both events to receiver
Receiver:
Triggered by receive event
Determine enter event
Receive remote events
Detect Late Sender situation
Calculate & store waiting time
time
pro
cess … …
… …
ENTER LEAVE SEND RECV
M. Geimer | SEA Conference'13, Boulder, CO
Slide 13
CESM Sea Ice Module – Late Sender Analysis
M. Geimer | SEA Conference'13, Boulder, CO
Slide 14
Scalasca trace analysis sweep3D@294,912 BG/P
10 min sweep3D
runtime
11 sec analysis
4 min trace data
write/read
(576 files)
7.6 TB buffered
trace data
510 billion
events
B. J. N. Wylie, M. Geimer,
B. Mohr, D. Böhme,
Z.Szebenyi, F. Wolf:
Large-scale performance
analysis of Sweep3D with
the Scalasca toolset.
Parallel Processing Letters,
20(4):397-414, 2010.
Slide 15
Scalasca trace analysis bt-mz@1,048,704 BG/Q
Execution
imbalance
“z_solve”
Slide 16
Scalasca trace analysis bt-mz@1,048,704 BG/Q
Wait at
implicit
barrier
“z_solve”
Slide 17
Research: Root Cause Analysis
Root-cause analysis
Wait states typically caused by
load or communication
imbalances earlier in the
program
Waiting time can also propagate
(e.g., indirect waiting time)
Goal: Enhance performance
analysis to find the root cause
of wait states
Approach
Distinguish between direct and
indirect waiting time
Identify call path/process
combinations delaying other
processes and causing first
order waiting time
Identify original delay
Recv
Send
Send
foo
foo
foo
bar
bar Recv
A
B
C
time
cause
Recv
Recv
Direct wait Indirect wait
Recv
bar DELAY
Slide 18
CESM Sea Ice Module – Direct Wait Time
Slide 19
CESM Sea Ice Module – Indirect Wait Time
Slide 20
CESM Sea Ice Module – Delay Costs
Slide 21
Acknowledgements
Scalasca team (JSC) (GRS)
Sponsors
M. Geimer | SEA Conference'13, Boulder, CO
Michael
Knobloch
Bernd
Mohr
Peter
Philippen
Markus
Geimer
Daniel
Lorenz
Christian
Rössel
David
Böhme
Marc-André
Hermanns
Pavel
Saviankou
Marc
Schlütter
Ilja
Zhukov
Alexandre
Strube
Brian
Wylie
Felix
Wolf
Anke
Visser
Monika
Lücke
Aamer
Shah
Alexandru
Calotoiu
Jie
Jiang
Sergei
Shudler
Guoyong
Mao
Slide 22
Thank you!
M. Geimer | SEA Conference'13, Boulder, CO
http://www.scalasca.org