Upload
duonganh
View
217
Download
0
Embed Size (px)
Citation preview
Fast Classification of MPI Applications Using Lamport’s Logical Clocks
Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan
Florida State University
Los Alamos National Laboratory
1
Motivation
2
Conventional trace-based performance simulation and analysis process
• Useful but expensive 1. One simulation for one target platform2. Time-consuming and resource
demanding3. Requires large number of simulations
Can we simulate the performance of an application on many interconnection networking options in one run?
Message Passing Trace Files
Trace-driven Simulator
Networking Options
1GbE 10GbE IB QDR
Result Result Result
Motivation (cont.)
Fast: one simulation run produces performance prediction on all pre-defined interconnect technologies
Classification:
1. Test the application’s sensitivity to slow-down and speed-up scenarios of each interconnection technology
2. Identify performance limiting factors by observing performance trend and classify each application into• Bandwidth-bound (BW)
• Latency-bound (L)
• Communication-bound (Comm.)
• Computation-bound (Comp.)
• Load-imbalance-bound (Imb.)
3
Inte
rco
nn
ecti
on
Tech
no
log
ies
InfiniBand
SDR
DDR
QDR
FDR
EDR
HDR
Ethernet
1GbE
10GbE
20GbE
40GbE
80GbE
…
Other
Classification Example
4
• Communication time and wait time is not affected by the change of bandwidth and latency.
• Upgrading to faster interconnection network will probably not help improve its performance, but investing on CPUs will
MiniFE: Finite element solver
• One simulation – 7 predictions• x/8 (1.25 Gbps, 40 𝜇𝑠)
• x/4 (2.5 Gbps, 20 𝜇𝑠)
• x/2 (5 Gbps, 10 𝜇𝑠)
• 1x (10 Gbps, 5 𝜇𝑠)
• 2x (20 Gbps, 2.5 𝜇𝑠)
• 4x (40 Gbps, 1.25 𝜇𝑠)
• 8x (80 Gbps, 0.625 𝜇𝑠)
Our Tool
• One simulation predicts application performance for many latency and bandwidth parameters
• Uncover the benefits of various networking options• Classifies the application using the predicted performance trend for
a range of network configurations
Usage
• Application: rapidly forecast communication-related performance bottlenecks
• System: quickly gauge the relative benefits of various networking options
• Complements slower but more accurate simulation approaches
5
Flowchart
6
Input: DUMPI traces
Fast Classification Tool
Output: classification summary
Identify performance characteristics
1. Computation
2. Load-imbalance
3. Communication
• Bandwidth
• Latency
1
2
1
2
3
3
MPI Application
DUMPI
traces
DUMPI
Process
0
DUMPI
Process
1
Fast Classification Tool
Calculating Logical Timestamps
Update Logical Time Counters
Clock Synchronization
Rank 0
Processing
Rank 1
Processing
Rank n
Processing
Classification Summary
DUMPI
7
DUMPI: MPI communication trace• Developed at Sandia National Lab• Records communication related events
• Enter/Exit wall time (in nanosecond)• Source/Destination, • Communicator• Status• Count• Datatype• Tag
• Generally, no topology and task-mapping information is provided
1
Process 0:
MPI_Init: enter time = 10;
MPI_Init : exit time = 30;
…
MPI_Isend: entering time= 100;
Dest = 1;
Datatype = INT;
Count = 4;
Tag = 1000;
Communicator = 2
(MPI_COMM_WORLD)
MPI_Isend: exit time = 200;
…
MPI_Exit: enter time = 300;
MPI_Exit : exit time = 350;
Lamport’s Logical Clocks
8
• What if we let 𝒕 be a vector of timestamps, [𝑡0, 𝑡1, … , 𝑡𝑛] ?• What if we use non-unit communication and computation time ?
0 5 10 15 20 25
P0
P1
MPI_Sendt = 20
MPI_Recv enters at time 10, exits at 21
0 5 10 15 20 25
P0
P1
MPI_Send
MPI_Recv enters at time 20, exits at 21
t = 20
• Provide clock synchronization mechanism that honors the happen-before relationship
a) Each process represents time with a local counter, Tb) A process increments its counter by 1 unit of time
before an communication or computation event c) Counter(t) is transmitted with each messaged) Receiver updates: 𝐓 = max 𝑡𝑖𝑛, 𝑡 + 1
𝑡𝑖𝑛: routine enter time𝑡: timestamp received from sender
Simulation with Extended Lamport’s Logical Clocks
9
Process 0
MPI_Init
MPI_Barrier
Compute 5
MPI_Send
Compute 5
MPI_Recv:
MPI_Finalize
Goal: find out the execution time of the application under different system configurations
1. Decide how to count the computation & communication time2. Decide when each routine will return using the modeled communication time
Process 1
MPI_Init
MPI_Barrier
Compute 10
MPI_Recv
Compute 10
MPI_Send
MPI_Finalize
1
2MPI_Send: 10ns to copy message system buffer (Eager protocol)MPI_Recv: the logical exit time depends on the time for the matching send and the modeled point-to-point communication time by Hockney’s model (20ns)
0 10 20 30 40 50
P1
P0
Comp Send/Recv
MPI_Send
MPI_Recv MPI_Send
MPI_Recv
12
Implementation
10
1. Extended Lamport’s logical clocks
2. Logical time counters• Computation: gap from the exit time of an MPI routine to the entry time of the next
• Wait: time spent waiting for corresponding party to start its communication
• Latency: the fixed latency for the first byte to reach the receiver
• Bandwidth: time needed for the rest of message to reach the receiver
3. Modeling communication• P2P: Hockney's model: 𝜶 + 𝒏𝜷
• 𝛼 = 𝑙𝑎𝑡𝑒𝑛𝑐𝑦, 𝛽 = 𝑟𝑒𝑐𝑖𝑝𝑟𝑜𝑐𝑎𝑙 𝑏𝑎𝑛𝑑𝑖𝑑𝑡ℎ, n = message size
• Collective: Thakur and Gropp’s model
4. Implemented in MPI
Modeling Point-to-point Communication
11
Blocking Send and non-blocking SendBlocking Recv = Non-blocking Recv + WaitWaitall = array of blocking wait requests
Concurrent Sender & ReceiverEarly Receiver
9 units of BW time is experienced at the
receiver P1
8 units of wait time, 2 units of latency
and 10 units of bandwidth are
experienced at the receiver P1
t = 5t = 18
Example: 𝛼 = 2, nβ = 10, Memory copy = 2Send: 𝑡𝑜𝑢𝑡 = tin +memcopyRecv: 𝑡𝑜𝑢𝑡 = max(𝑡𝑖𝑛 + 1, 𝒕 + 𝛼 + 𝑛𝛽), where t ∈ [𝑡0, 𝑡1, … , 𝑡𝑛]
Modeling Collective Communication
12
Exit time: 𝑡𝑜𝑢𝑡 = max 𝑡𝑖𝑛𝑐 +𝑚𝑜𝑑𝑒𝑙_𝑡𝑖𝑚𝑒(𝑛)
(d) Collective
t = 18
t = 18
t = 18
t = 18
(e) Collective with multiple network configurations
All processes perform MPI_Allreduce and obtain 𝒕𝐢𝐧 =
[10,20, 30 ….], where each entry is the max 𝑡𝑖𝑛𝑐
All processes exit MPI_Alltoall at 𝐭𝐨𝐮𝐭 = [10 + 𝑚0, 20+𝑚1, 30 + 𝑚2… ], where 𝒎𝒊 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑜𝑑𝑒𝑙_𝑡𝑖𝑚𝑒 𝑛 ith
configuration by Thakur and Gropp’s model
Progression logical times for multiple network configurations
Scenarios: trace-replay of a process can be blocked
1. Blocking receive: A process must receive a logical timestamp from the matching sending process in order to proceed
2. Collective: Each process must perform an MPI_Allreduce operation to obtain the largest logical enter timestamps from all processes
Traces are generated for a correctly executed MPI program
1) A blocking receive operation always has a matching send operation
2) All P2P communications before a collective operation must complete before the collective operation
13
Experimental Results
Benchmarks & Applications• 9 DOE design forward extracted kernels, miniapps and full-sized applications
• NPB: NAS Parallel Benchmarks
Validation
Classification & Results• Classification criterion
• Classification results
• Classification vs. application problem sizes
Simulation Performance • Speed-up: Simulation time vs. application time
• Overhead: Simulation time vs. number of network configurations
14
Validation
15
• 1St column: (eag.): eager protocol + Hockney's model.
• 2nd column: (eag.*): eager protocol + look-up table approach
• 3rd column: (rend.): rendezvous protocol + look-up table
Network Configurations Selection
16
x/8 x/4 x/2 1x 2x 4x 8x
Latency 40 20 10 5 2.5 1.25 0.625
BW 10 10 10 10 10 10 10
0
10
20
30
40
50
Ethernet 10G, Latency Model
x/8 x/4 x/2 1x 2x 4x 8x
Latency 5 5 5 5 5 5 5
BW 1.25 2.5 5 10 20 40 80
0
20
40
60
80
100
Ethernet 10G, Bandwidth Model
Models• Latency model • Bandwidth model• Communication model
Classification
17
Category Execution Time Scaling Latency and Bandwidth
Computation-bound(Comp.)
𝑇𝑐𝑜𝑚𝑝 ≥ 90% Computation time is insensitive to the 8x and x/8 ( < 5% difference)
Load-imbalanced-bound (Imb.)
𝑇𝑤𝑎𝑖𝑡 ≥ 25% Wait time is insensitive to the 8x and x/8 ( < 5% difference)
Bandwidth-bound(BW)
𝑇𝑐𝑜𝑚𝑚 ≥ 25% 𝑇𝑐𝑜𝑚𝑚 is insensitive to 8x and x/8 latency; 𝑇𝑐𝑜𝑚𝑚 drops by 2x when BW goes from ½ BW to 2BW
Latency-bound(Latency)
𝑇𝑐𝑜𝑚𝑚 ≥ 25% 𝑇𝑐𝑜𝑚𝑚 is insensitive to 8x and x/8 BW; 𝑇𝑐𝑜𝑚𝑚 decreasesby 2x as latency goes from 2L to L/2
Communication-bound(Comm.)
𝑇𝑐𝑜𝑚𝑚 ≥ 25% 𝑇𝑐𝑜𝑚𝑚 decreases by a factor of 2 as BW increase across BW by a factor of 4, and as latency time improves
across L by a factor of 2
𝑻𝒄𝒐𝒎𝒎 = 𝑻𝒍𝒂𝒕𝒆𝒏𝒄𝒚 + 𝑻𝑩𝒘 + 𝑻𝒘𝒂𝒊𝒕
Classification Results: BigFFT
18
-s: sensitive if 𝑇𝑐𝑜𝑚𝑚 ∈ 10%, 25%
BigFFT(100) is load-imbalanced bound (imb.) as wait time accounts for nearly 25% of total time with QDR comm. model
Classification Results: CLAMR
19
-s: sensitive if 𝑇𝑐𝑜𝑚𝑚 ∈ 10%, 25%
CLAMR(64) is Latency-bound as the latency time becomes negligible with decreasing latency on 1GbE Latency model
0
5
10
15
20
x/8 x/4 x/2 1x 2x 4x 8x
Tim
e in
Sec
on
ds
WAIT LAT BW COMP Total
Simulation Speed-up
20
(a) Simulation time shows 3-45 time speed-up for 64-rank runs of NPB benchmarks (class C)
(b) 2-15 time speedup for 4096-rank runs of NPB benchmarks (Class D)
Simulation Time vs. Number of Network Configurations
21
Negligible simulation overhead with increasing number of network configurations simulated
Conclusions & Future work
22
Trace-based and Communication-centric Fast Classification Tool:• Provide insights on application performance characteristics• Predict execution time on many network configurations in nearly the same time needed to
predict time for a single configuration• Enables new analyses to be performed that would be too computationally expensive to perform
with traditional, one-configuration-at-a-time simulation• Classification results provide early-diagnosis for further investigation• Could assist code optimization and better overlap of communication and computation
Future work• Explore the tradeoff of various overlapping options for bandwidth and latency time counters• Validation of classification results with larger runs• Model computation by scaling processor frequency
Thanks & Questions
Fast Classification of MPI Applications Using Lamport’s Logical Clocks
Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan
Florida State University
Los Alamos National Laboratory
IEEE International Parallel & Distributed Processing Symposium (IPDPS '16)Chicago, Illinois, May 23-27, 2016
23
24
Motivation
25
• Conventional trace-based performance simulation and analysis process
• Useful but expensive • One simulation for one target platform• Time-consuming and resource demanding• Requires large number of simulations
• Can we simulate the performance of an application on many interconnection networking options in one run?
2- MPI & Tracing library
3- Sequential &
Parallel Machine
Parameters
modification
1- Message Passing Code
4-Trace Files
5-Trace-driven Simulator
Visualization
& Analysis
Code modification
Fast Classification Tool
• Simulates performance for many networking options in one run
• Extended Lamport’s Logical clocks maintains time counters that track computation, bandwidth, latency and wait time
• Tests application’s sensitivity to the slowdown and speedup of bandwidth and latency of each interconnection technology
• Classify based on the performance trend
26A
pp
lic
ati
on
Se
ns
itiv
ity
BandwidthBandwidth-
bound
Latency Latency-bound
BothCommunication-
bound
Neither
Computation-bound
Other
Load-imbalanceLoad-Imbalance-
bound
Simulation
27
Process 0:
MPI_Barrier: exit time = 50;
…
MPI_Isend: entering time= 100;
Dest = 1;
Datatype = INT;
Count = 4;
Tag=1000;
Communicator=2 (MPI_COMM_WORLD)
MPI_Isend: exit time = 150;
…
Process 1:
MPI_Barrier: exit time = 75;
…
MPI_Recv: entering time= 200;
Source= 0;
Datatype = INT;
Count = 4;
Tag=1000;
Communicator=2 (MPI_COMM_WORLD)
MPI_Recv: exit time = 300;
…
Assumptions: 1) Process 0 and Process 1 exited the MPI_Barrier at logical time 02) It takes 100ns to sent a 4-integer message from Process 0 to Process 1
𝑇 = 𝑇𝑏𝑎𝑟𝑟𝑖𝑒𝑟 = 0
𝑇 = 𝑇𝑖𝑛 = 𝑇𝑏𝑎𝑟𝑟𝑖𝑒𝑟 + 𝑇𝑐𝑜𝑚𝑝
= 0 + 100 − 50= 50
𝑇𝑖𝑛 = 𝑇𝑏𝑎𝑟𝑟𝑖𝑒𝑟 + 𝑇𝑐𝑜𝑚𝑝
= 0 + (200 − 75)= 125
𝑇𝑜𝑢𝑡 = 𝑇𝑖𝑛 + 𝑇𝑖𝑠𝑒𝑛𝑑= 50 + (150 − 100)= 100
𝑇𝑜𝑢𝑡 = max(𝑇𝑖𝑛 + 1, trecv + 𝛼 + 𝑛 ∗ β)= max 125 + 1, 50 + 100= max 126, 150 = 𝟏𝟓𝟎
Benchmarks & Applications
9 DOE’s design forward extracted kernels, miniapps and full-sized applications NPB: NAS Parallel Benchmarks
28
AMR BoxLib Adaptive Mesh Refinement cosmology Miniapp
BigFFT 3D Fast Fourier Transform solver Kernel
CLAMR Cell-based adaptive mesh refinement Miniapp
CR Nek5000 Kernel
FB Halo update PDE solver code Application
MG Geometric Multigrid elliptic solver Application
MiniFE Finite element solver Miniapp
PARTISN Neutral-particle transport Application
NPB NAS Parallel Benchmarks Application
Classification: BigFFT & Crystal Router
29
BigFFT(100) is load-imbalanced bound (imb.) as wait time accounts for nearly 25% of total time with QDR communication model.
CR (64) is communication-bound as the communication time becomes negligible with fasters interconnection networks under 10G Ethernet Communication model
Classification Results: MiniFE
30
-s: sensitive if 𝑇𝑐𝑜𝑚𝑚 ∈ 10%, 25%
MiniFe (1152) is computation-bound (Comp.) as 90% time on computation with 10G Ethernet Communication model
Classification vs. Application Problem Sizes
31
Percentage of computation time decreases as the problem size and the number of ranks in AMG increases.
Processing (cont’)
32
Clock Synchronization• Computation time is measured by computation time observed in
DUMPI trace• Communication time is modeled by network bandwidth and
latency parameters
• Computation (architecture)• CPU frequency, cache, processor/core affinity,
memory contention, NIC contention• Communication (network)
• Topology, network delay, network contention, routing, queuing/buffering
• Application• Problem decomposition, task-mapping
• Ensures partial ordering of causally-related events • Increment of 1 unit of time for both computation and
communication.