Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Center for Information Services and High Performance Computing (ZIH)

Deadlock-Free Oblivious Routing

for Arbitrary Topologies

Jens Domke, Torsten Hoefler, Wolfgang E. Nagel

May 18th, 2011

Zellescher Weg 12Willers-Bau A 21901062 DresdenTel. +49 0351 - 463 39114

Jens Domke ( [email protected] )

m

Outline

1 Basics and previous work

2 Deadlocks

3 Deadlock-free SSSP routing algorithm

4 Simulations and measurements

5 Conclusion

Jens Domke Slide 2

Outline

Basics and previous work

InfiniBand interconnect

InfiniBand subnet manager – OpenSM

Motivation

Previous work

Jens Domke Slide 3

InfiniBand interconnect

Based on an open standard, developed by the InfiniBand TradeAssociation

One of the most widely used interconnect in the field of HPC

Gigabit Ethernet

InfiniBand

Proprietary

Others42,4%

45,6%6,2%

5,8%

Figure: Top500 List, Interconnects, Nov. 2010

Jens Domke Slide 4


Tasks

Scan the components of the IB subnet

Initialize the IB ports

Calculate paths for each port pair in the subnetGenerate linear forwarding tables (LFT)Configure the IB ports with additional preferences, e.g. QoS

Reconfiguration, if the subnet changes

Jens Domke Slide 5


Implemented static/destination-based routing algorithms

MinHop

Up*/Down*

Fat -Tree

LASH

DOR

Jens Domke Slide 6

Motivation

General problems for most of the routing algorithms

No global balancing of the traffic ⇒ congestions reduce the bandwidth

Only designed for a small set of topologies

Not deadlock-free for every topology

Not usable for production systems, because of long runtime

The algorithm should support irregular topologies, because

HPC-systems grow in their lifetime

Additional node like I/O or login nodes are connected

Network components can fail

Jens Domke Slide 7

Previous work

Single-source-shortest-path routing algorithm

”Optimized Routing for Large-Scale InfiniBand Networks” [Hoefler et al.,2009] presented SSSP

Minimizes congestions thru global balancing

Higher effective bisection bandwidth compared to others algorithms

Disadvantage of the presented approach

Algorithm is not deadlock-freeLFT are calculated by an external program (not OpenSM)

Jens Domke Slide 8

Outline

Deadlocks

Definition

Deadlocks in interconnects

Approaches for deadlock-free routing

Theorem of Dally and Seitz

Virtual channels and channel dependency graph

Jens Domke Slide 9

Definition

Definition Deadlock [Tanenbaum, 2007]

A set of processes is deadlocked if each process in the set is waiting for anevent that only a process in the set can cause.

Jens Domke Slide 10

Deadlocks in interconnects

Package source

Switch buffer

Package destination

Jens Domke Slide 11

Approaches for deadlock-free routing

Package life-time (only to break the deadlock, if they occur)

Controller principle

Up*/Down* routing

Virtual channels

”Deadlock-Free Message Routing in Multiprocessor InterconnectionNetworks” [Dally and Seitz, 1987]Each link will be split into multiple virtual channelsChannel dependency graph

Jens Domke Slide 12



A routing algorithm for a interconnect is deadlock-free, iff there are no cyclesin the corresponding channel dependency graph.

Jens Domke Slide 13


c2c1

c3c4

n4

c1 n1c2

n2

c3n3c4

Jens Domke Slide 14


c2,2c2,1

c1,1 c1,2

c1,3c1,4

c2,3c2,4

c2,2n1c2,1

c1,2c1,1

n2n4

c1,4 c1,3

c2,3n3c2,4

Jens Domke Slide 15


c2

c3c4

c1r2

r3

r1

c1,2c1,1r2

r1

c1,3

r3

c2,1 c2,2

c2,4 c2,3

Jens Domke Slide 16

Outline

Deadlock-free SSSP routing algorithm

DFSSSP routing algorithm

How to identify the ”weakest” edge?

Jens Domke Slide 17


Algorithm 1 DFSSSP routing algorithm

/* Phase 1: Identification of all network components */Scan(. . .)/* Phase 2: Calculate paths */SSSP(. . .)/* Phase 3: Assign paths to virtual layers */RemoveDeadlocks(. . .)/* Phase 4: Balancing of all virtual layers */Balancing(. . .)

Jens Domke Slide 18


Algorithm 2 Remove deadlocks from the channel dependency graph (Phase 3)

Input: Linear forwarding tablesOutput: Assign each path to a virtual layer

/* Initialization of layer 1 */for all PortPairs(source, destination) do

Update CDG[1] with the source-destination pathend for/* Search cycles in the channel dependency graph */for i = 1, . . . ,max−1 do

repeatSearch for cycle in CDG[i ]Identify ”weakest” edge of the cycleMove port pairs or paths on this edge to CDG[i + 1]

until no cycle found in CDG[i ]end forSearch for cycle in CDG[max ]

Jens Domke Slide 19

How to identify the ”weakest” edge?

... to minimize the number of needed virtual layers.

Abstract formulation: ”acyclic path partitioning” problem (APP)

Split a set of paths into subsets which produces acyclic channeldependency graphs.

Shown to be NP-complete

Proof based on an polynomial transformation from graph k-colorabilityproblem into APP

APP is NP-complete ⇒ use heuristic to identify the ”weakest” edge

Edge with most paths in the cycle

Random edge of the cycle

Edge with smallest number of paths

Jens Domke Slide 20

Outline

Simulations and measurements

Simulations with IBSimReal existing topologies

Measurements on a real system – DeimosPC-Farm Deimos

Netgauge

BenchIT

NAS parallel benchmarks

Jens Domke Slide 21

Real existing topologies

0

0,2

0,4

0,6

0,8

1

CHiCDeimos

JUROPA

OdinRanger

Tsubame

Eff

.b

isec

tio

nb

an

dw

idth

MinHopUp*/Down*FatTree

LASHDORSSSP

DFSSSP

10-4

10-2

100

102

104

CHiCDeimos

JUROPA

OdinRanger

Tsubame

Ru

nti

me

ins

Figure: Simulation with IBSim and ORCS [Schneider et al., 2009]

Jens Domke Slide 22

Measurements on a real system – Deimos

HPC-system operated by ZIH

Linux Networx PC-Farm(13.9 TFlop/s)

726 compute nodes connected by 108IB switches

2,6 GHz AMD Opteron X85 dual core

1, 2 or 4 processors per node

2 GByte RAM per core

Jens Domke Slide 23

Measurements on a real system – Deimos

Measurement environment and used benchmarks

Exclusive access

One MPI process per node (for measurements with ≤ 512 cores)

Same number of MPI processes =⇒ same compute nodes used

Eff. bisection bandwidth with Netgauge [Hoefler et al., 2007]

Runtime and bandwidths of pure MPI communication measured withmicro-benchmarks (BenchIT [Juckeland et al., 2004])

Performance gain for application benchmarks of NASA(NAS Parallel Benchmarks [Bailey et al., 1995])

Jens Domke Slide 24

Netgauge

0

50

100

150

200

250

300

350

400

128 256 512 1024

Eff

.b

isec

tio

nb

and

wid

thin

MiB

yte/

s

Number of cores

MinHopLASHSSSPDFSSSP

Figure: Approximation with 1000 random bisections

Jens Domke Slide 25

BenchIT

0

0,01

0,02

0,03

0,04

0,05

0,06

0,07

0,08

0 512 1024 1536 2048 2560 3072 3584 4096

Ru

nti

me

ins

Elements in send buffer (#floats)


Figure: Collective N-to-N MPI operation on 128 nodes

Jens Domke Slide 26

NAS parallel benchmarks

0

50

100

150

200

250

121 256 484 1024

Gfl

op

/s

(to

tal)

Number of cores


Figure: BT, class C – equation system solver

Jens Domke Slide 27

Conclusion

Developed deadlock-free SSSP routing for arbitrary network topologies

DF-/SSSP routing algorithm integrated in OpenSM

Patch available: http://unixer.de/research/dfsssp/

Not limited to InfiniBand; usable for all interconnects which supportvirtual channels

Modeled the ”acyclic path partition” problem; proofed NP-completeness

Doubled the eff. bisection bandwidth of Deimos for 512 nodes

Performance gain (communication bound) for application benchmarks upto 95%

Jens Domke Slide 28

References

D. Bailey, T. Harris, W. Saphir, R. V. D. Wijngaart, A. Woo, and M. Yarrow. The nas parallel benchmarks 2.0. Technical Report NAS-95-020,NASA Ames Research Center, Dec. 1995.

W. Dally and C. Seitz. Deadlock-Free Message Routing in Multiprocessor Interconnection Networks. Computers, IEEE Transactions on, C-36(5):547–553, May 1987. ISSN 0018-9340. doi: 10.1109/TC.1987.1676939.

T. Hamada and N. Nakasato. InfiniBand Trade Association, InfiniBand Architecture Specification, Volume 1, Release 1.0. In InternationalConference on Field Programmable Logic and Applications, pages 366–373, 2005.

T. Hoefler, T. Mehlan, A. Lumsdaine, and W. Rehm. Netgauge: A Network Performance Measurement Framework. In High PerformanceComputing and Communications, Third International Conference, HPCC 2007, Houston, USA, September 26-28, 2007, Proceedings, volume4782, pages 659–671. Springer, Sept. 2007. ISBN 978-3-540-75443-5.

T. Hoefler, T. Schneider, and A. Lumsdaine. Optimized Routing for Large-Scale InfiniBand Networks. In 17th Annual IEEE Symposium on HighPerformance Interconnects (HOTI 2009), Aug. 2009.

G. Juckeland, S. Borner, M. Kluge, S. Kolling, W. Nagel, S. Pfluger, H. Roding, S. Seidl, T. William, and R. Wloch. Benchit – performancemeasurement and comparison for scientific applications. In F. P. G.R. Joubert, W.E. Nagel and W. Walter, editors, Parallel Computing -Software Technology, Algorithms, Architectures and Applications, volume 13 of Advances in Parallel Computing, pages 501–508.North-Holland, 2004.

T. Schneider, T. Hoefler, and A. Lumsdaine. ORCS: An Oblivious Routing Congestion Simulator. Technical Report 675, Indiana University, Feb.2009.

A. S. Tanenbaum. Modern Operating Systems. Prentice Hall Press, Upper Saddle River, NJ, USA, 3. edition, 2007. ISBN 9780136006633.

Jens Domke Slide 29

Backup – Complexity analysis

Time complexity

The time complexity for the DFSSSP routing algorithm is

O( |N|2 · (log |N|+ ∇) + |N| · |C |+ ∇ · (|C |+ |E |))

Memory complexity

The memory complexity for DFSSSP is

O(∇ ·d(I ) · |N|2 + ∇ · (|C |+ |E |) + |N|)

Variables:

N – nodes in the network

C – channels/links

E – edges in the channel dependency graph

∇ – minimal number of needed virtual layer

d(I ) – diameter of network I

Jens Domke Slide 30

Backup – InfiniBand subnet

...

... ...

Switch 1

Switch 2 Switch n

Link

HCA HCA

TCATCA

Subnet

Router

OpenSM

Subnet Subnet

CPU CPUCPU CPU

Compute node

I/O node

Tape TapeTape Tape

Jens Domke Slide 31

Backup – Metrics for interconnects

Significant properties

Low latencyHigh bandwidth for package transferAbsence of deadlocks in the routing

Established metrics to rate the interconnect

LatencyBandwidthBisection bandwidthEffective bandwidthEffective bisection bandwidth

Jens Domke Slide 32

Backup – SSSP algorithm

Algorithm 3 SSSP routing algorithm (Phase 2)

Input: Context of DFSSSP routingOutput: Linear Forwarding Tabellen

/* N-to-N, multi-graph Dijkstra algorithm */for all Port ∈ Subnet do

Dijkstra(. . .) for this port as sourceUpdate all linear forwarding tablesIncrease edge wights

end for

Jens Domke Slide 33

Documents

Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher