View
215
Download
0
Embed Size (px)
Citation preview
Tools for Engineering Analysis of High Performance Parallel Programs
David Culler,
Frederick Wong, Alan Mainwaring
Computer Science Division
U.C.Berkeley
http://www.cs.berkeley.edu/~culler/talks
11/5/99 LLNL ASCI III 2
Traditional Parallel Programming Tools
• Focus on showing “what program did” and “when it did it”– microscopic analysis of deterministic
events
– oriented towards initial development of small programs on small data sets and small machines
• Instrumentation– traces, counters, profiles
• Visualization
• Examples– AIMS, PTOOLS, PPP
– pablo + paradyn + ... => delphi
– ACTS TAU - tuning and analysis util.
11/5/99 LLNL ASCI III 3
Example: Pablo
11/5/99 LLNL ASCI III 4
Beyond Zeroth-order Analysis
• Basic level to get to a system design that is reasonable and behaves properly under “ideal condition”
• Subject the system to various stresses to understand its operating regime and gain deeper insight into its dynamic behavior
• Combine empirical data with analytical models
• Iterate
• from What? to What if?
Wind Speed
max
dis
pla
cem
en
t
11/5/99 LLNL ASCI III 5
Approach: Framework for Parameterized Sensitivity Analsys
• framework performs analysis over numerous runs– statistical filtering
– vary parameter of interest
• provides means of combining data to isolate effects of interest
=> ROBUSTNESS
Well-developedParallel Program
StudyParameter
Problem Data SetGenerator
InstrumentationTools
MachineCharacterizers
visualization, modeling
• Procs
• Comm. perf.
• Cache
• Scheduling
• ...
11/5/99 LLNL ASCI III 6
Simplest Example: Performance( P )
• NPB2.2 on NOW and Origin 2000 (250)
Origin Speedup
048
12162024283236404448
0 4 8 12 16 20 24 28 32 36 40 44 48
Machine Size (Processors)
Spee
dup
BT
SP
LU
MG
FT
IS
Ideal
Cluster Speedup
048
12162024283236404448
0 4 8 12 16 20 24 28 32 36 40 44 48
Machine Size (Processors)
Spee
dup
BT
SP
LU
MG
FT
IS
Ideal
11/5/99 LLNL ASCI III 7
Where Time is Spent ( P )
• Reveal basic Processor and network loading (vs P)
• Basis for model derivation - comm(P)
LU (Origin)
0
500
1,000
1,500
2,000
2,500
3,000
0 4 8 12 16 20 24 28 32 36 40
Machine Size (Processors)
Tim
e (s
econ
ds)
Total
Comp
Comm
Ideal
LU (Cluster)
0
500
1,000
1,500
2,000
2,500
3,000
0 4 8 12 16 20 24 28 32 36 40
Machine Size (Processors)
Tim
e (s
econ
ds)
Total
Comp
Comm
Ideal
11/5/99 LLNL ASCI III 8
Where Time is Spent ( P ) - cont
• Reveal basic Processor and network loading (vs P)
FT (Cluster)
0
50
100
150
200
250
300
350
0 4 8 12 16 20 24 28 32 36 40
Machine Size (Processors)
Tim
e (s
econ
ds)
Total
Comp
Comm
Ideal
FT (Origin)
0
50
100
150
200
250
300
350
0 4 8 12 16 20 24 28 32 36 40
Machine Size (Processors)
Tim
e (s
econ
ds)
Total
Comp
Comm
Ideal
11/5/99 LLNL ASCI III 9
Communication Volume ( P )
Total Communication Volume
0
20
40
60
80
100
120
0 4 8 12 16 20 24 28 32 36 40
Machine Size (Processors)
Vol
ume
(GB
)
BT
SP
LU
MG
FT
IS
Bytes Per Processor
0
1,000
2,000
3,000
4,000
5,000
6,000
0 4 8 12 16 20 24 28 32 36 40
Machine Size (Processors)
Vol
ume
(MB
)
BT
SP
LU
MG
FT
IS
11/5/99 LLNL ASCI III 10
Communication Structure ( P )
Normalized Messages Per Processor
0
1
2
3
4
5
6
7
8
0 4 8 12 16 20 24 28 32 36 40
Machine Size (Processors)
Mes
sage
s P
er P
roce
ssor BT
SP
LU
MG
FT
IS
Average Message Size
1
10
100
1,000
10,000
0 4 8 12 16 20 24 28 32 36 40
Machine Size (Processors)
Ave
rage
Mes
sage
Siz
e (K
B)
BT
SP
LU
MG
FT
IS
11/5/99 LLNL ASCI III 11
Understanding Efficiency ( P, M )
• Want to understand both what load the program is placing on the system
• and how well the system is handling that load=> characterize the capability of the system via simple benchmarks
(rather than advertised peaks)
=> combine with measured load for predictive model, & compare
MPI One-way Latency on Cluster
0
10
20
30
40
50
60
70
1 10 100 1000
Message Size (Bytes)
Tim
e (u
sec)
MPI One-way Latency on Origin
0
10
20
30
40
50
60
70
1 10 100 1,000
Message Size (Bytes)
Tim
e (u
sec)
11/5/99 LLNL ASCI III 12
Communication Efficiency
Cluster (rendezvous)
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28 32 36 40
Machine Size (Processors)
Eff
icie
ncy
(%) BT
SP
LU
MG
FT
Origin
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28 32 36 40
Machine Size (Processors)
Eff
icie
ncy
(%)
BT
SP
LU
MG
FT
IS
11/5/99 LLNL ASCI III 13
Tools => Improvements in Run Time
• Efficiency analysis (vs parameters) gives insight into where to improve the system or the program– use traditional profiling to see where is program the ‘bad
stuff’ happens
– or go back and tune the system to do better
Cluster (eager)
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28 32 36 40
Machine Size (Processors)
Eff
icie
ncy
(%)
BT
SP
LU
MG
FT
IS
Cluster (rendezvous)
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28 32 36 40
Machine Size (Processors)
Eff
icie
ncy
(%) BT
SP
LU
MG
FT
11/5/99 LLNL ASCI III 14
Cache Behavior (P, $)
• Combining trace generation with simulation provides new structural insight
• Here: clear knees in program working set ($)these shift with machine size (P)
LU
0
2
4
6
8
10
12
14
1 2 4 8 16 32 64 128
256
512
1024
2048
4096
Per Processor Cache Size (KB)
Mis
s R
ate
(%)
4
8
16
32
LU
0
2
4
6
8
10
12
14
1 2 4 8 16 32 64 128
256
512
1024
2048
4096
Per Processor Cache Size (KB)
Mis
s R
ate
(%)
4
8
16
LU
0
2
4
6
8
10
12
14
1 2 4 8 16 32 64 128
256
512
1024
2048
4096
Per Processor Cache Size (KB)
Mis
s R
ate
(%)
4
8
LU
0
2
4
6
8
10
12
14
1 2 4 8 16 32 64 128
256
512
1024
2048
4096
Per Processor Cache Size (KB)
Mis
s R
ate
(%)
4
11/5/99 LLNL ASCI III 15
Cache Behavior (P, $)
• Clear knees in program working set ($) not affected by P
FT
0
5
10
15
20
25
1 2 4 8 16 32 64 128
256
512
1024
2048
4096
Per Processor Cache Size (KB)
Mis
s R
ate
(%)
4
8
16
32
11/5/99 LLNL ASCI III 16
Sensitivity to Multiprogramming
• Parallel machines are increasingly general purpose– multiprogramming, at least interrupts and daemons
• Many ‘ideal’ programs very sensitive to perturbations– Msg Passing is loosely coupled, but implementation may not be!
1 1 1
6.39
1.43
4.11
19.05
1.63
5.86
20.25
1.65
6.53
0
24
68
1012
1416
1820
22
LU FT MG
Slowdown
Dedicated1-Seq2-Seq3-Seq
1 1 1
4.20
1.28
4.18
18.24
1.51
6.27
0
24
68
1012
1416
1820
22
LU FT MG
Slowdown
Dedicated2-PP3-PP
11/5/99 LLNL ASCI III 17
Tools => Improvements in Run Time
• MPI implementation spin-waits on send till network available (or queue not full) or on recv-complete
• Should use two-phase spin-block
1 1 1
1.24
0.96
1.16
1.31
0.91
1.20
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
LU FT MG
Slowdown
Dedicated2-PP3-PP
11/5/99 LLNL ASCI III 18
Sensitivity to Seemingly Unrelated Activity
• The mechanism for doing parameter studies is naturally extended to get statistically valid data through multiple samples at each point– tend to get crisp, fast results in the wee hours
• Extend study outside the app
• Example: two programs on big Origin– alone together on
64 P
– 8 processor IS run: 4.71 sec 6.18
– 36 processor SP run: 26.36 sec 65.28
11/5/99 LLNL ASCI III 19
Repeatability
• The variance for the repeated runs is a key result for production codes - the real world is not ideal
Scatter Plot of FT Runtime on Origin (30 samples)
0
50
100
150
200
250
300
350
0 4 8 12 16 20 24 28 32
Machine Size (processors)T
ime
(sec
onds
) Average
Scatter Plot of LU Runtime on Origin (30 samples)
0
200
400
600
800
1000
1200
1400
0 4 8 12 16 20 24 28 32
Machine Size (processors)
Tim
e (s
econ
ds) Average
11/5/99 LLNL ASCI III 20
Plans
• Integrate our instrumentation and analysis tools with ACTS TAU– port to UCB Millennium environment
– experiment with ASCI platforms
• Refine and complete the automated sensitivity analysis framework
• Backend performance data storage– Pablo SPPF?
• Next Year– integrate performance model development, prediction