Fast Classification of MPI Applications Using Lamport’s ...ww2.cs.fsu.edu/~tong/paper/ipdps16_slides.pdf · Fast Classification of MPI Applications Using Lamport’s Logical Clocks

Fast Classification of MPI Applications Using Lamport’s Logical Clocks

Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan

Florida State University

Los Alamos National Laboratory

1

Motivation

2

Conventional trace-based performance simulation and analysis process

• Useful but expensive 1. One simulation for one target platform2. Time-consuming and resource

demanding3. Requires large number of simulations

Can we simulate the performance of an application on many interconnection networking options in one run?

Message Passing Trace Files

Trace-driven Simulator

Networking Options

1GbE 10GbE IB QDR

Result Result Result

Motivation (cont.)

Fast: one simulation run produces performance prediction on all pre-defined interconnect technologies

Classification:

1. Test the application’s sensitivity to slow-down and speed-up scenarios of each interconnection technology

2. Identify performance limiting factors by observing performance trend and classify each application into• Bandwidth-bound (BW)

• Latency-bound (L)

• Communication-bound (Comm.)

• Computation-bound (Comp.)

• Load-imbalance-bound (Imb.)

3

Inte

rco

nn

ecti

on

Tech

no

log

ies

InfiniBand

SDR

DDR

QDR

FDR

EDR

HDR

Ethernet

1GbE

10GbE

20GbE

40GbE

80GbE

…

Other

Classification Example

4

• Communication time and wait time is not affected by the change of bandwidth and latency.

• Upgrading to faster interconnection network will probably not help improve its performance, but investing on CPUs will

MiniFE: Finite element solver

• One simulation – 7 predictions• x/8 (1.25 Gbps, 40 𝜇𝑠)

• x/4 (2.5 Gbps, 20 𝜇𝑠)

• x/2 (5 Gbps, 10 𝜇𝑠)

• 1x (10 Gbps, 5 𝜇𝑠)

• 2x (20 Gbps, 2.5 𝜇𝑠)

• 4x (40 Gbps, 1.25 𝜇𝑠)

• 8x (80 Gbps, 0.625 𝜇𝑠)

Our Tool

• One simulation predicts application performance for many latency and bandwidth parameters

• Uncover the benefits of various networking options• Classifies the application using the predicted performance trend for

a range of network configurations

Usage

• Application: rapidly forecast communication-related performance bottlenecks

• System: quickly gauge the relative benefits of various networking options

• Complements slower but more accurate simulation approaches

5

Flowchart

6

Input: DUMPI traces

Fast Classification Tool

Output: classification summary

Identify performance characteristics

1. Computation

2. Load-imbalance

3. Communication

• Bandwidth

• Latency

1

2

1

2

3

3

MPI Application

DUMPI

traces

DUMPI

Process

0

DUMPI

Process

1


Calculating Logical Timestamps

Update Logical Time Counters

Clock Synchronization

Rank 0

Processing

Rank 1

Processing

Rank n

Processing

Classification Summary

DUMPI

7

DUMPI: MPI communication trace• Developed at Sandia National Lab• Records communication related events

• Enter/Exit wall time (in nanosecond)• Source/Destination, • Communicator• Status• Count• Datatype• Tag

• Generally, no topology and task-mapping information is provided

1

Process 0:

MPI_Init: enter time = 10;

MPI_Init : exit time = 30;

…

MPI_Isend: entering time= 100;

Dest = 1;

Datatype = INT;

Count = 4;

Tag = 1000;

Communicator = 2

(MPI_COMM_WORLD)

MPI_Isend: exit time = 200;

…

MPI_Exit: enter time = 300;

MPI_Exit : exit time = 350;

Lamport’s Logical Clocks

8

• What if we let 𝒕 be a vector of timestamps, [𝑡0, 𝑡1, … , 𝑡𝑛] ?• What if we use non-unit communication and computation time ?

0 5 10 15 20 25

P0

P1

MPI_Sendt = 20

MPI_Recv enters at time 10, exits at 21

0 5 10 15 20 25

P0

P1

MPI_Send

MPI_Recv enters at time 20, exits at 21

t = 20

• Provide clock synchronization mechanism that honors the happen-before relationship

a) Each process represents time with a local counter, Tb) A process increments its counter by 1 unit of time

before an communication or computation event c) Counter(t) is transmitted with each messaged) Receiver updates: 𝐓 = max 𝑡𝑖𝑛, 𝑡 + 1

𝑡𝑖𝑛: routine enter time𝑡: timestamp received from sender

Simulation with Extended Lamport’s Logical Clocks

9

Process 0

MPI_Init

MPI_Barrier

Compute 5

MPI_Send

Compute 5

MPI_Recv:

MPI_Finalize

Goal: find out the execution time of the application under different system configurations

1. Decide how to count the computation & communication time2. Decide when each routine will return using the modeled communication time

Process 1

MPI_Init

MPI_Barrier

Compute 10

MPI_Recv

Compute 10

MPI_Send

MPI_Finalize

1

2MPI_Send: 10ns to copy message system buffer (Eager protocol)MPI_Recv: the logical exit time depends on the time for the matching send and the modeled point-to-point communication time by Hockney’s model (20ns)

0 10 20 30 40 50

P1

P0

Comp Send/Recv

MPI_Send

MPI_Recv MPI_Send

MPI_Recv

12

Implementation

10

1. Extended Lamport’s logical clocks

2. Logical time counters• Computation: gap from the exit time of an MPI routine to the entry time of the next

• Wait: time spent waiting for corresponding party to start its communication

• Latency: the fixed latency for the first byte to reach the receiver

• Bandwidth: time needed for the rest of message to reach the receiver

3. Modeling communication• P2P: Hockney's model: 𝜶 + 𝒏𝜷

• 𝛼 = 𝑙𝑎𝑡𝑒𝑛𝑐𝑦, 𝛽 = 𝑟𝑒𝑐𝑖𝑝𝑟𝑜𝑐𝑎𝑙 𝑏𝑎𝑛𝑑𝑖𝑑𝑡ℎ, n = message size

• Collective: Thakur and Gropp’s model

4. Implemented in MPI

Modeling Point-to-point Communication

11

Blocking Send and non-blocking SendBlocking Recv = Non-blocking Recv + WaitWaitall = array of blocking wait requests

Concurrent Sender & ReceiverEarly Receiver

9 units of BW time is experienced at the

receiver P1

8 units of wait time, 2 units of latency

and 10 units of bandwidth are

experienced at the receiver P1

t = 5t = 18

Example: 𝛼 = 2, nβ = 10, Memory copy = 2Send: 𝑡𝑜𝑢𝑡 = tin +memcopyRecv: 𝑡𝑜𝑢𝑡 = max(𝑡𝑖𝑛 + 1, 𝒕 + 𝛼 + 𝑛𝛽), where t ∈ [𝑡0, 𝑡1, … , 𝑡𝑛]

Modeling Collective Communication

12

Exit time: 𝑡𝑜𝑢𝑡 = max 𝑡𝑖𝑛𝑐 +𝑚𝑜𝑑𝑒𝑙_𝑡𝑖𝑚𝑒(𝑛)

(d) Collective

t = 18

t = 18

t = 18

t = 18

(e) Collective with multiple network configurations

All processes perform MPI_Allreduce and obtain 𝒕𝐢𝐧 =

[10,20, 30 ….], where each entry is the max 𝑡𝑖𝑛𝑐

All processes exit MPI_Alltoall at 𝐭𝐨𝐮𝐭 = [10 + 𝑚0, 20+𝑚1, 30 + 𝑚2… ], where 𝒎𝒊 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑜𝑑𝑒𝑙_𝑡𝑖𝑚𝑒 𝑛 ith

configuration by Thakur and Gropp’s model

Progression logical times for multiple network configurations

Scenarios: trace-replay of a process can be blocked

1. Blocking receive: A process must receive a logical timestamp from the matching sending process in order to proceed

2. Collective: Each process must perform an MPI_Allreduce operation to obtain the largest logical enter timestamps from all processes

Traces are generated for a correctly executed MPI program

1) A blocking receive operation always has a matching send operation

2) All P2P communications before a collective operation must complete before the collective operation

13

Experimental Results

Benchmarks & Applications• 9 DOE design forward extracted kernels, miniapps and full-sized applications

• NPB: NAS Parallel Benchmarks

Validation

Classification & Results• Classification criterion

• Classification results

• Classification vs. application problem sizes

Simulation Performance • Speed-up: Simulation time vs. application time

• Overhead: Simulation time vs. number of network configurations

14

Validation

15

• 1St column: (eag.): eager protocol + Hockney's model.

• 2nd column: (eag.*): eager protocol + look-up table approach

• 3rd column: (rend.): rendezvous protocol + look-up table

Network Configurations Selection

16

x/8 x/4 x/2 1x 2x 4x 8x

Latency 40 20 10 5 2.5 1.25 0.625

BW 10 10 10 10 10 10 10

0

10

20

30

40

50

Ethernet 10G, Latency Model

x/8 x/4 x/2 1x 2x 4x 8x

Latency 5 5 5 5 5 5 5

BW 1.25 2.5 5 10 20 40 80

0

20

40

60

80

100

Ethernet 10G, Bandwidth Model

Models• Latency model • Bandwidth model• Communication model

Classification

17

Category Execution Time Scaling Latency and Bandwidth

Computation-bound(Comp.)

𝑇𝑐𝑜𝑚𝑝 ≥ 90% Computation time is insensitive to the 8x and x/8 ( < 5% difference)

Load-imbalanced-bound (Imb.)

𝑇𝑤𝑎𝑖𝑡 ≥ 25% Wait time is insensitive to the 8x and x/8 ( < 5% difference)

Bandwidth-bound(BW)

𝑇𝑐𝑜𝑚𝑚 ≥ 25% 𝑇𝑐𝑜𝑚𝑚 is insensitive to 8x and x/8 latency; 𝑇𝑐𝑜𝑚𝑚 drops by 2x when BW goes from ½ BW to 2BW

Latency-bound(Latency)

𝑇𝑐𝑜𝑚𝑚 ≥ 25% 𝑇𝑐𝑜𝑚𝑚 is insensitive to 8x and x/8 BW; 𝑇𝑐𝑜𝑚𝑚 decreasesby 2x as latency goes from 2L to L/2

Communication-bound(Comm.)

𝑇𝑐𝑜𝑚𝑚 ≥ 25% 𝑇𝑐𝑜𝑚𝑚 decreases by a factor of 2 as BW increase across BW by a factor of 4, and as latency time improves

across L by a factor of 2

𝑻𝒄𝒐𝒎𝒎 = 𝑻𝒍𝒂𝒕𝒆𝒏𝒄𝒚 + 𝑻𝑩𝒘 + 𝑻𝒘𝒂𝒊𝒕

Classification Results: BigFFT

18

-s: sensitive if 𝑇𝑐𝑜𝑚𝑚 ∈ 10%, 25%

BigFFT(100) is load-imbalanced bound (imb.) as wait time accounts for nearly 25% of total time with QDR comm. model

Classification Results: CLAMR

19


CLAMR(64) is Latency-bound as the latency time becomes negligible with decreasing latency on 1GbE Latency model

0

5

10

15

20

x/8 x/4 x/2 1x 2x 4x 8x

Tim

e in

Sec

on

ds

WAIT LAT BW COMP Total

Simulation Speed-up

20

(a) Simulation time shows 3-45 time speed-up for 64-rank runs of NPB benchmarks (class C)

(b) 2-15 time speedup for 4096-rank runs of NPB benchmarks (Class D)

Simulation Time vs. Number of Network Configurations

21

Negligible simulation overhead with increasing number of network configurations simulated

Conclusions & Future work

22

Trace-based and Communication-centric Fast Classification Tool:• Provide insights on application performance characteristics• Predict execution time on many network configurations in nearly the same time needed to

predict time for a single configuration• Enables new analyses to be performed that would be too computationally expensive to perform

with traditional, one-configuration-at-a-time simulation• Classification results provide early-diagnosis for further investigation• Could assist code optimization and better overlap of communication and computation

Future work• Explore the tradeoff of various overlapping options for bandwidth and latency time counters• Validation of classification results with larger runs• Model computation by scaling processor frequency

Thanks & Questions

Fast Classification of MPI Applications Using Lamport’s Logical Clocks

Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan

Florida State University

Los Alamos National Laboratory

IEEE International Parallel & Distributed Processing Symposium (IPDPS '16)Chicago, Illinois, May 23-27, 2016

23

24

Motivation

25

• Conventional trace-based performance simulation and analysis process

• Useful but expensive • One simulation for one target platform• Time-consuming and resource demanding• Requires large number of simulations

• Can we simulate the performance of an application on many interconnection networking options in one run?

2- MPI & Tracing library

3- Sequential &

Parallel Machine

Parameters

modification

1- Message Passing Code

4-Trace Files

5-Trace-driven Simulator

Visualization

& Analysis

Code modification


• Simulates performance for many networking options in one run

• Extended Lamport’s Logical clocks maintains time counters that track computation, bandwidth, latency and wait time

• Tests application’s sensitivity to the slowdown and speedup of bandwidth and latency of each interconnection technology

• Classify based on the performance trend

26A

pp

lic

ati

on

Se

ns

itiv

ity

BandwidthBandwidth-

bound

Latency Latency-bound

BothCommunication-

bound

Neither

Computation-bound

Other

Load-imbalanceLoad-Imbalance-

bound

Simulation

27

Process 0:

MPI_Barrier: exit time = 50;

…

MPI_Isend: entering time= 100;

Dest = 1;

Datatype = INT;

Count = 4;

Tag=1000;

Communicator=2 (MPI_COMM_WORLD)

MPI_Isend: exit time = 150;

…

Process 1:

MPI_Barrier: exit time = 75;

…

MPI_Recv: entering time= 200;

Source= 0;

Datatype = INT;

Count = 4;

Tag=1000;

Communicator=2 (MPI_COMM_WORLD)

MPI_Recv: exit time = 300;

…

Assumptions: 1) Process 0 and Process 1 exited the MPI_Barrier at logical time 02) It takes 100ns to sent a 4-integer message from Process 0 to Process 1

𝑇 = 𝑇𝑏𝑎𝑟𝑟𝑖𝑒𝑟 = 0

𝑇 = 𝑇𝑖𝑛 = 𝑇𝑏𝑎𝑟𝑟𝑖𝑒𝑟 + 𝑇𝑐𝑜𝑚𝑝

= 0 + 100 − 50= 50

𝑇𝑖𝑛 = 𝑇𝑏𝑎𝑟𝑟𝑖𝑒𝑟 + 𝑇𝑐𝑜𝑚𝑝

= 0 + (200 − 75)= 125

𝑇𝑜𝑢𝑡 = 𝑇𝑖𝑛 + 𝑇𝑖𝑠𝑒𝑛𝑑= 50 + (150 − 100)= 100

𝑇𝑜𝑢𝑡 = max(𝑇𝑖𝑛 + 1, trecv + 𝛼 + 𝑛 ∗ β)= max 125 + 1, 50 + 100= max 126, 150 = 𝟏𝟓𝟎

Benchmarks & Applications

9 DOE’s design forward extracted kernels, miniapps and full-sized applications NPB: NAS Parallel Benchmarks

28

AMR BoxLib Adaptive Mesh Refinement cosmology Miniapp

BigFFT 3D Fast Fourier Transform solver Kernel

CLAMR Cell-based adaptive mesh refinement Miniapp

CR Nek5000 Kernel

FB Halo update PDE solver code Application

MG Geometric Multigrid elliptic solver Application

MiniFE Finite element solver Miniapp

PARTISN Neutral-particle transport Application

NPB NAS Parallel Benchmarks Application

Classification: BigFFT & Crystal Router

29

BigFFT(100) is load-imbalanced bound (imb.) as wait time accounts for nearly 25% of total time with QDR communication model.

CR (64) is communication-bound as the communication time becomes negligible with fasters interconnection networks under 10G Ethernet Communication model

Classification Results: MiniFE

30


MiniFe (1152) is computation-bound (Comp.) as 90% time on computation with 10G Ethernet Communication model

Classification vs. Application Problem Sizes

31

Percentage of computation time decreases as the problem size and the number of ranks in AMG increases.

Processing (cont’)

32

Clock Synchronization• Computation time is measured by computation time observed in

DUMPI trace• Communication time is modeled by network bandwidth and

latency parameters

• Computation (architecture)• CPU frequency, cache, processor/core affinity,

memory contention, NIC contention• Communication (network)

• Topology, network delay, network contention, routing, queuing/buffering

• Application• Problem decomposition, task-mapping

• Ensures partial ordering of causally-related events • Increment of 1 unit of time for both computation and

communication.

Documents

Fast Classification of MPI Applications Using Lamport’s ...ww2.cs.fsu.edu/~tong/paper/ipdps16_slides.pdf · Fast Classification of MPI Applications Using Lamport’s Logical Clocks