Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Center for Information Services and High Performance Computing (ZIH)
Deadlock-Free Oblivious Routing
for Arbitrary Topologies
Jens Domke, Torsten Hoefler, Wolfgang E. Nagel
May 18th, 2011
Zellescher Weg 12Willers-Bau A 21901062 DresdenTel. +49 0351 - 463 39114
Jens Domke ( [email protected] )
m
Outline
1 Basics and previous work
2 Deadlocks
3 Deadlock-free SSSP routing algorithm
4 Simulations and measurements
5 Conclusion
Jens Domke Slide 2
Outline
Basics and previous work
InfiniBand interconnect
InfiniBand subnet manager – OpenSM
Motivation
Previous work
Jens Domke Slide 3
InfiniBand interconnect
Based on an open standard, developed by the InfiniBand TradeAssociation
One of the most widely used interconnect in the field of HPC
Gigabit Ethernet
InfiniBand
Proprietary
Others42,4%
45,6%6,2%
5,8%
Figure: Top500 List, Interconnects, Nov. 2010
Jens Domke Slide 4
InfiniBand subnet manager – OpenSM
Tasks
Scan the components of the IB subnet
Initialize the IB ports
Calculate paths for each port pair in the subnetGenerate linear forwarding tables (LFT)Configure the IB ports with additional preferences, e.g. QoS
Reconfiguration, if the subnet changes
Jens Domke Slide 5
InfiniBand subnet manager – OpenSM
Implemented static/destination-based routing algorithms
MinHop
Up*/Down*
Fat -Tree
LASH
DOR
Jens Domke Slide 6
Motivation
General problems for most of the routing algorithms
No global balancing of the traffic ⇒ congestions reduce the bandwidth
Only designed for a small set of topologies
Not deadlock-free for every topology
Not usable for production systems, because of long runtime
The algorithm should support irregular topologies, because
HPC-systems grow in their lifetime
Additional node like I/O or login nodes are connected
Network components can fail
Jens Domke Slide 7
Previous work
Single-source-shortest-path routing algorithm
”Optimized Routing for Large-Scale InfiniBand Networks” [Hoefler et al.,2009] presented SSSP
Minimizes congestions thru global balancing
Higher effective bisection bandwidth compared to others algorithms
Disadvantage of the presented approach
Algorithm is not deadlock-freeLFT are calculated by an external program (not OpenSM)
Jens Domke Slide 8
Outline
Deadlocks
Definition
Deadlocks in interconnects
Approaches for deadlock-free routing
Theorem of Dally and Seitz
Virtual channels and channel dependency graph
Jens Domke Slide 9
Definition
Definition Deadlock [Tanenbaum, 2007]
A set of processes is deadlocked if each process in the set is waiting for anevent that only a process in the set can cause.
Jens Domke Slide 10
Deadlocks in interconnects
Package source
Switch buffer
Package destination
Jens Domke Slide 11
Approaches for deadlock-free routing
Package life-time (only to break the deadlock, if they occur)
Controller principle
Up*/Down* routing
Virtual channels
”Deadlock-Free Message Routing in Multiprocessor InterconnectionNetworks” [Dally and Seitz, 1987]Each link will be split into multiple virtual channelsChannel dependency graph
Jens Domke Slide 12
Theorem of Dally and Seitz
Theorem of Dally and Seitz
A routing algorithm for a interconnect is deadlock-free, iff there are no cyclesin the corresponding channel dependency graph.
Jens Domke Slide 13
Virtual channels and channel dependency graph
c2c1
c3c4
n4
c1 n1c2
n2
c3n3c4
Jens Domke Slide 14
Virtual channels and channel dependency graph
c2,2c2,1
c1,1 c1,2
c1,3c1,4
c2,3c2,4
c2,2n1c2,1
c1,2c1,1
n2n4
c1,4 c1,3
c2,3n3c2,4
Jens Domke Slide 15
Virtual channels and channel dependency graph
c2
c3c4
c1r2
r3
r1
c1,2c1,1r2
r1
c1,3
r3
c2,1 c2,2
c2,4 c2,3
Jens Domke Slide 16
Outline
Deadlock-free SSSP routing algorithm
DFSSSP routing algorithm
How to identify the ”weakest” edge?
Jens Domke Slide 17
DFSSSP routing algorithm
Algorithm 1 DFSSSP routing algorithm
/* Phase 1: Identification of all network components */Scan(. . .)/* Phase 2: Calculate paths */SSSP(. . .)/* Phase 3: Assign paths to virtual layers */RemoveDeadlocks(. . .)/* Phase 4: Balancing of all virtual layers */Balancing(. . .)
Jens Domke Slide 18
DFSSSP routing algorithm
Algorithm 2 Remove deadlocks from the channel dependency graph (Phase 3)
Input: Linear forwarding tablesOutput: Assign each path to a virtual layer
/* Initialization of layer 1 */for all PortPairs(source, destination) do
Update CDG[1] with the source-destination pathend for/* Search cycles in the channel dependency graph */for i = 1, . . . ,max−1 do
repeatSearch for cycle in CDG[i ]Identify ”weakest” edge of the cycleMove port pairs or paths on this edge to CDG[i + 1]
until no cycle found in CDG[i ]end forSearch for cycle in CDG[max ]
Jens Domke Slide 19
How to identify the ”weakest” edge?
... to minimize the number of needed virtual layers.
Abstract formulation: ”acyclic path partitioning” problem (APP)
Split a set of paths into subsets which produces acyclic channeldependency graphs.
Shown to be NP-complete
Proof based on an polynomial transformation from graph k-colorabilityproblem into APP
APP is NP-complete ⇒ use heuristic to identify the ”weakest” edge
Edge with most paths in the cycle
Random edge of the cycle
Edge with smallest number of paths
Jens Domke Slide 20
Outline
Simulations and measurements
Simulations with IBSimReal existing topologies
Measurements on a real system – DeimosPC-Farm Deimos
Netgauge
BenchIT
NAS parallel benchmarks
Jens Domke Slide 21
Real existing topologies
0
0,2
0,4
0,6
0,8
1
CHiCDeimos
JUROPA
OdinRanger
Tsubame
Eff
.b
isec
tio
nb
an
dw
idth
MinHopUp*/Down*FatTree
LASHDORSSSP
DFSSSP
10-4
10-2
100
102
104
CHiCDeimos
JUROPA
OdinRanger
Tsubame
Ru
nti
me
ins
Figure: Simulation with IBSim and ORCS [Schneider et al., 2009]
Jens Domke Slide 22
Measurements on a real system – Deimos
HPC-system operated by ZIH
Linux Networx PC-Farm(13.9 TFlop/s)
726 compute nodes connected by 108IB switches
2,6 GHz AMD Opteron X85 dual core
1, 2 or 4 processors per node
2 GByte RAM per core
Jens Domke Slide 23
Measurements on a real system – Deimos
Measurement environment and used benchmarks
Exclusive access
One MPI process per node (for measurements with ≤ 512 cores)
Same number of MPI processes =⇒ same compute nodes used
Eff. bisection bandwidth with Netgauge [Hoefler et al., 2007]
Runtime and bandwidths of pure MPI communication measured withmicro-benchmarks (BenchIT [Juckeland et al., 2004])
Performance gain for application benchmarks of NASA(NAS Parallel Benchmarks [Bailey et al., 1995])
Jens Domke Slide 24
Netgauge
0
50
100
150
200
250
300
350
400
128 256 512 1024
Eff
.b
isec
tio
nb
and
wid
thin
MiB
yte/
s
Number of cores
MinHopLASHSSSPDFSSSP
Figure: Approximation with 1000 random bisections
Jens Domke Slide 25
BenchIT
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0 512 1024 1536 2048 2560 3072 3584 4096
Ru
nti
me
ins
Elements in send buffer (#floats)
MinHopLASHSSSPDFSSSP
Figure: Collective N-to-N MPI operation on 128 nodes
Jens Domke Slide 26
NAS parallel benchmarks
0
50
100
150
200
250
121 256 484 1024
Gfl
op
/s
(to
tal)
Number of cores
MinHopLASHSSSPDFSSSP
Figure: BT, class C – equation system solver
Jens Domke Slide 27
Conclusion
Developed deadlock-free SSSP routing for arbitrary network topologies
DF-/SSSP routing algorithm integrated in OpenSM
Patch available: http://unixer.de/research/dfsssp/
Not limited to InfiniBand; usable for all interconnects which supportvirtual channels
Modeled the ”acyclic path partition” problem; proofed NP-completeness
Doubled the eff. bisection bandwidth of Deimos for 512 nodes
Performance gain (communication bound) for application benchmarks upto 95%
Jens Domke Slide 28
References
D. Bailey, T. Harris, W. Saphir, R. V. D. Wijngaart, A. Woo, and M. Yarrow. The nas parallel benchmarks 2.0. Technical Report NAS-95-020,NASA Ames Research Center, Dec. 1995.
W. Dally and C. Seitz. Deadlock-Free Message Routing in Multiprocessor Interconnection Networks. Computers, IEEE Transactions on, C-36(5):547–553, May 1987. ISSN 0018-9340. doi: 10.1109/TC.1987.1676939.
T. Hamada and N. Nakasato. InfiniBand Trade Association, InfiniBand Architecture Specification, Volume 1, Release 1.0. In InternationalConference on Field Programmable Logic and Applications, pages 366–373, 2005.
T. Hoefler, T. Mehlan, A. Lumsdaine, and W. Rehm. Netgauge: A Network Performance Measurement Framework. In High PerformanceComputing and Communications, Third International Conference, HPCC 2007, Houston, USA, September 26-28, 2007, Proceedings, volume4782, pages 659–671. Springer, Sept. 2007. ISBN 978-3-540-75443-5.
T. Hoefler, T. Schneider, and A. Lumsdaine. Optimized Routing for Large-Scale InfiniBand Networks. In 17th Annual IEEE Symposium on HighPerformance Interconnects (HOTI 2009), Aug. 2009.
G. Juckeland, S. Borner, M. Kluge, S. Kolling, W. Nagel, S. Pfluger, H. Roding, S. Seidl, T. William, and R. Wloch. Benchit – performancemeasurement and comparison for scientific applications. In F. P. G.R. Joubert, W.E. Nagel and W. Walter, editors, Parallel Computing -Software Technology, Algorithms, Architectures and Applications, volume 13 of Advances in Parallel Computing, pages 501–508.North-Holland, 2004.
T. Schneider, T. Hoefler, and A. Lumsdaine. ORCS: An Oblivious Routing Congestion Simulator. Technical Report 675, Indiana University, Feb.2009.
A. S. Tanenbaum. Modern Operating Systems. Prentice Hall Press, Upper Saddle River, NJ, USA, 3. edition, 2007. ISBN 9780136006633.
Jens Domke Slide 29
Backup – Complexity analysis
Time complexity
The time complexity for the DFSSSP routing algorithm is
O( |N|2 · (log |N|+ ∇) + |N| · |C |+ ∇ · (|C |+ |E |))
Memory complexity
The memory complexity for DFSSSP is
O(∇ ·d(I ) · |N|2 + ∇ · (|C |+ |E |) + |N|)
Variables:
N – nodes in the network
C – channels/links
E – edges in the channel dependency graph
∇ – minimal number of needed virtual layer
d(I ) – diameter of network I
Jens Domke Slide 30
Backup – InfiniBand subnet
...
... ...
Switch 1
Switch 2 Switch n
Link
HCA HCA
TCATCA
Subnet
Router
OpenSM
Subnet Subnet
CPU CPUCPU CPU
Compute node
I/O node
Tape TapeTape Tape
Jens Domke Slide 31
Backup – Metrics for interconnects
Significant properties
Low latencyHigh bandwidth for package transferAbsence of deadlocks in the routing
Established metrics to rate the interconnect
LatencyBandwidthBisection bandwidthEffective bandwidthEffective bisection bandwidth
Jens Domke Slide 32
Backup – SSSP algorithm
Algorithm 3 SSSP routing algorithm (Phase 2)
Input: Context of DFSSSP routingOutput: Linear Forwarding Tabellen
/* N-to-N, multi-graph Dijkstra algorithm */for all Port ∈ Subnet do
Dijkstra(. . .) for this port as sourceUpdate all linear forwarding tablesIncrease edge wights
end for
Jens Domke Slide 33