33
Center for Information Services and High Performance Computing (ZIH) Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke , Torsten Hoefler, Wolfgang E. Nagel May 18th, 2011 Zellescher Weg 12 Willers-Bau A 219 01062 Dresden Tel. +49 0351 - 463 39114 Jens Domke ( [email protected] )

Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Center for Information Services and High Performance Computing (ZIH)

Deadlock-Free Oblivious Routing

for Arbitrary Topologies

Jens Domke, Torsten Hoefler, Wolfgang E. Nagel

May 18th, 2011

Zellescher Weg 12Willers-Bau A 21901062 DresdenTel. +49 0351 - 463 39114

Jens Domke ( [email protected] )

m

Page 2: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Outline

1 Basics and previous work

2 Deadlocks

3 Deadlock-free SSSP routing algorithm

4 Simulations and measurements

5 Conclusion

Jens Domke Slide 2

Page 3: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Outline

Basics and previous work

InfiniBand interconnect

InfiniBand subnet manager – OpenSM

Motivation

Previous work

Jens Domke Slide 3

Page 4: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

InfiniBand interconnect

Based on an open standard, developed by the InfiniBand TradeAssociation

One of the most widely used interconnect in the field of HPC

Gigabit Ethernet

InfiniBand

Proprietary

Others42,4%

45,6%6,2%

5,8%

Figure: Top500 List, Interconnects, Nov. 2010

Jens Domke Slide 4

Page 5: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

InfiniBand subnet manager – OpenSM

Tasks

Scan the components of the IB subnet

Initialize the IB ports

Calculate paths for each port pair in the subnetGenerate linear forwarding tables (LFT)Configure the IB ports with additional preferences, e.g. QoS

Reconfiguration, if the subnet changes

Jens Domke Slide 5

Page 6: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

InfiniBand subnet manager – OpenSM

Implemented static/destination-based routing algorithms

MinHop

Up*/Down*

Fat -Tree

LASH

DOR

Jens Domke Slide 6

Page 7: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Motivation

General problems for most of the routing algorithms

No global balancing of the traffic ⇒ congestions reduce the bandwidth

Only designed for a small set of topologies

Not deadlock-free for every topology

Not usable for production systems, because of long runtime

The algorithm should support irregular topologies, because

HPC-systems grow in their lifetime

Additional node like I/O or login nodes are connected

Network components can fail

Jens Domke Slide 7

Page 8: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Previous work

Single-source-shortest-path routing algorithm

”Optimized Routing for Large-Scale InfiniBand Networks” [Hoefler et al.,2009] presented SSSP

Minimizes congestions thru global balancing

Higher effective bisection bandwidth compared to others algorithms

Disadvantage of the presented approach

Algorithm is not deadlock-freeLFT are calculated by an external program (not OpenSM)

Jens Domke Slide 8

Page 9: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Outline

Deadlocks

Definition

Deadlocks in interconnects

Approaches for deadlock-free routing

Theorem of Dally and Seitz

Virtual channels and channel dependency graph

Jens Domke Slide 9

Page 10: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Definition

Definition Deadlock [Tanenbaum, 2007]

A set of processes is deadlocked if each process in the set is waiting for anevent that only a process in the set can cause.

Jens Domke Slide 10

Page 11: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Deadlocks in interconnects

Package source

Switch buffer

Package destination

Jens Domke Slide 11

Page 12: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Approaches for deadlock-free routing

Package life-time (only to break the deadlock, if they occur)

Controller principle

Up*/Down* routing

Virtual channels

”Deadlock-Free Message Routing in Multiprocessor InterconnectionNetworks” [Dally and Seitz, 1987]Each link will be split into multiple virtual channelsChannel dependency graph

Jens Domke Slide 12

Page 13: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Theorem of Dally and Seitz

Theorem of Dally and Seitz

A routing algorithm for a interconnect is deadlock-free, iff there are no cyclesin the corresponding channel dependency graph.

Jens Domke Slide 13

Page 14: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Virtual channels and channel dependency graph

c2c1

c3c4

n4

c1 n1c2

n2

c3n3c4

Jens Domke Slide 14

Page 15: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Virtual channels and channel dependency graph

c2,2c2,1

c1,1 c1,2

c1,3c1,4

c2,3c2,4

c2,2n1c2,1

c1,2c1,1

n2n4

c1,4 c1,3

c2,3n3c2,4

Jens Domke Slide 15

Page 16: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Virtual channels and channel dependency graph

c2

c3c4

c1r2

r3

r1

c1,2c1,1r2

r1

c1,3

r3

c2,1 c2,2

c2,4 c2,3

Jens Domke Slide 16

Page 17: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Outline

Deadlock-free SSSP routing algorithm

DFSSSP routing algorithm

How to identify the ”weakest” edge?

Jens Domke Slide 17

Page 18: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

DFSSSP routing algorithm

Algorithm 1 DFSSSP routing algorithm

/* Phase 1: Identification of all network components */Scan(. . .)/* Phase 2: Calculate paths */SSSP(. . .)/* Phase 3: Assign paths to virtual layers */RemoveDeadlocks(. . .)/* Phase 4: Balancing of all virtual layers */Balancing(. . .)

Jens Domke Slide 18

Page 19: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

DFSSSP routing algorithm

Algorithm 2 Remove deadlocks from the channel dependency graph (Phase 3)

Input: Linear forwarding tablesOutput: Assign each path to a virtual layer

/* Initialization of layer 1 */for all PortPairs(source, destination) do

Update CDG[1] with the source-destination pathend for/* Search cycles in the channel dependency graph */for i = 1, . . . ,max−1 do

repeatSearch for cycle in CDG[i ]Identify ”weakest” edge of the cycleMove port pairs or paths on this edge to CDG[i + 1]

until no cycle found in CDG[i ]end forSearch for cycle in CDG[max ]

Jens Domke Slide 19

Page 20: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

How to identify the ”weakest” edge?

... to minimize the number of needed virtual layers.

Abstract formulation: ”acyclic path partitioning” problem (APP)

Split a set of paths into subsets which produces acyclic channeldependency graphs.

Shown to be NP-complete

Proof based on an polynomial transformation from graph k-colorabilityproblem into APP

APP is NP-complete ⇒ use heuristic to identify the ”weakest” edge

Edge with most paths in the cycle

Random edge of the cycle

Edge with smallest number of paths

Jens Domke Slide 20

Page 21: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Outline

Simulations and measurements

Simulations with IBSimReal existing topologies

Measurements on a real system – DeimosPC-Farm Deimos

Netgauge

BenchIT

NAS parallel benchmarks

Jens Domke Slide 21

Page 22: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Real existing topologies

0

0,2

0,4

0,6

0,8

1

CHiCDeimos

JUROPA

OdinRanger

Tsubame

Eff

.b

isec

tio

nb

an

dw

idth

MinHopUp*/Down*FatTree

LASHDORSSSP

DFSSSP

10-4

10-2

100

102

104

CHiCDeimos

JUROPA

OdinRanger

Tsubame

Ru

nti

me

ins

Figure: Simulation with IBSim and ORCS [Schneider et al., 2009]

Jens Domke Slide 22

Page 23: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Measurements on a real system – Deimos

HPC-system operated by ZIH

Linux Networx PC-Farm(13.9 TFlop/s)

726 compute nodes connected by 108IB switches

2,6 GHz AMD Opteron X85 dual core

1, 2 or 4 processors per node

2 GByte RAM per core

Jens Domke Slide 23

Page 24: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Measurements on a real system – Deimos

Measurement environment and used benchmarks

Exclusive access

One MPI process per node (for measurements with ≤ 512 cores)

Same number of MPI processes =⇒ same compute nodes used

Eff. bisection bandwidth with Netgauge [Hoefler et al., 2007]

Runtime and bandwidths of pure MPI communication measured withmicro-benchmarks (BenchIT [Juckeland et al., 2004])

Performance gain for application benchmarks of NASA(NAS Parallel Benchmarks [Bailey et al., 1995])

Jens Domke Slide 24

Page 25: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Netgauge

0

50

100

150

200

250

300

350

400

128 256 512 1024

Eff

.b

isec

tio

nb

and

wid

thin

MiB

yte/

s

Number of cores

MinHopLASHSSSPDFSSSP

Figure: Approximation with 1000 random bisections

Jens Domke Slide 25

Page 26: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

BenchIT

0

0,01

0,02

0,03

0,04

0,05

0,06

0,07

0,08

0 512 1024 1536 2048 2560 3072 3584 4096

Ru

nti

me

ins

Elements in send buffer (#floats)

MinHopLASHSSSPDFSSSP

Figure: Collective N-to-N MPI operation on 128 nodes

Jens Domke Slide 26

Page 27: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

NAS parallel benchmarks

0

50

100

150

200

250

121 256 484 1024

Gfl

op

/s

(to

tal)

Number of cores

MinHopLASHSSSPDFSSSP

Figure: BT, class C – equation system solver

Jens Domke Slide 27

Page 28: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Conclusion

Developed deadlock-free SSSP routing for arbitrary network topologies

DF-/SSSP routing algorithm integrated in OpenSM

Patch available: http://unixer.de/research/dfsssp/

Not limited to InfiniBand; usable for all interconnects which supportvirtual channels

Modeled the ”acyclic path partition” problem; proofed NP-completeness

Doubled the eff. bisection bandwidth of Deimos for 512 nodes

Performance gain (communication bound) for application benchmarks upto 95%

Jens Domke Slide 28

Page 29: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

References

D. Bailey, T. Harris, W. Saphir, R. V. D. Wijngaart, A. Woo, and M. Yarrow. The nas parallel benchmarks 2.0. Technical Report NAS-95-020,NASA Ames Research Center, Dec. 1995.

W. Dally and C. Seitz. Deadlock-Free Message Routing in Multiprocessor Interconnection Networks. Computers, IEEE Transactions on, C-36(5):547–553, May 1987. ISSN 0018-9340. doi: 10.1109/TC.1987.1676939.

T. Hamada and N. Nakasato. InfiniBand Trade Association, InfiniBand Architecture Specification, Volume 1, Release 1.0. In InternationalConference on Field Programmable Logic and Applications, pages 366–373, 2005.

T. Hoefler, T. Mehlan, A. Lumsdaine, and W. Rehm. Netgauge: A Network Performance Measurement Framework. In High PerformanceComputing and Communications, Third International Conference, HPCC 2007, Houston, USA, September 26-28, 2007, Proceedings, volume4782, pages 659–671. Springer, Sept. 2007. ISBN 978-3-540-75443-5.

T. Hoefler, T. Schneider, and A. Lumsdaine. Optimized Routing for Large-Scale InfiniBand Networks. In 17th Annual IEEE Symposium on HighPerformance Interconnects (HOTI 2009), Aug. 2009.

G. Juckeland, S. Borner, M. Kluge, S. Kolling, W. Nagel, S. Pfluger, H. Roding, S. Seidl, T. William, and R. Wloch. Benchit – performancemeasurement and comparison for scientific applications. In F. P. G.R. Joubert, W.E. Nagel and W. Walter, editors, Parallel Computing -Software Technology, Algorithms, Architectures and Applications, volume 13 of Advances in Parallel Computing, pages 501–508.North-Holland, 2004.

T. Schneider, T. Hoefler, and A. Lumsdaine. ORCS: An Oblivious Routing Congestion Simulator. Technical Report 675, Indiana University, Feb.2009.

A. S. Tanenbaum. Modern Operating Systems. Prentice Hall Press, Upper Saddle River, NJ, USA, 3. edition, 2007. ISBN 9780136006633.

Jens Domke Slide 29

Page 30: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Backup – Complexity analysis

Time complexity

The time complexity for the DFSSSP routing algorithm is

O( |N|2 · (log |N|+ ∇) + |N| · |C |+ ∇ · (|C |+ |E |))

Memory complexity

The memory complexity for DFSSSP is

O(∇ ·d(I ) · |N|2 + ∇ · (|C |+ |E |) + |N|)

Variables:

N – nodes in the network

C – channels/links

E – edges in the channel dependency graph

∇ – minimal number of needed virtual layer

d(I ) – diameter of network I

Jens Domke Slide 30

Page 31: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Backup – InfiniBand subnet

...

... ...

Switch 1

Switch 2 Switch n

Link

HCA HCA

TCATCA

Subnet

Router

OpenSM

Subnet Subnet

CPU CPUCPU CPU

Compute node

I/O node

Tape TapeTape Tape

Jens Domke Slide 31

Page 32: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Backup – Metrics for interconnects

Significant properties

Low latencyHigh bandwidth for package transferAbsence of deadlocks in the routing

Established metrics to rate the interconnect

LatencyBandwidthBisection bandwidthEffective bandwidthEffective bisection bandwidth

Jens Domke Slide 32

Page 33: Deadlock-Free Oblivious Routing for Arbitrary Topologies · Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoe er, Wolfgang E. Nagel May 18th, 2011 Zellescher

Backup – SSSP algorithm

Algorithm 3 SSSP routing algorithm (Phase 2)

Input: Context of DFSSSP routingOutput: Linear Forwarding Tabellen

/* N-to-N, multi-graph Dijkstra algorithm */for all Port ∈ Subnet do

Dijkstra(. . .) for this port as sourceUpdate all linear forwarding tablesIncrease edge wights

end for

Jens Domke Slide 33