Exploring pattern-aware routing in generalized fat tree networks

Exploring Pattern-aware Routing inGeneralized Fat Tree Networks

German RodriguezBarcelona Supercomputing

Center (BSC)Barcelona, Spain

[email protected]

Ramon BeivideUniversity of Cantabria

Cantabria, [email protected]

Cyriel MinkenbergIBM Research GmbH

Zurich Research LaboratoryRüschlikon, [email protected]

Jesus LabartaUniversitat Politècnica de

Catalunya and BSCBarcelona, Spain

[email protected]

Mateo ValeroUniversitat Politècnica de

Catalunya and BSCBarcelona, Spain

[email protected]

ABSTRACT

New static source routing algorithms for High PerformanceComputing (HPC) are presented in this work. The targetparallel architectures are based on the commonly used fat-tree networks and their slimmed versions. The evaluation ofsuch proposals and their comparison against currently usedrouting mechanisms have been driven by realistic traffic gen-erated by HPC applications. Our experimental frameworkis based on the integration of two existing simulators, onereplaying an MPI application and another simulating thenetwork details. The resulting simulation platform has beenfed with traces from real executions.

We have obtained several interesting findings: (i) con-trary to the widely accepted belief, random static routing ink-ary n-trees (which is the default option for InfiniBand andMyrinet technologies) is not a good solution for HPC appli-cations; (ii) some existing oblivious routing techniques canbe very good for certain communication patterns presenton applications, but clearly fail for some others and (iii)one of the proposed pattern-aware routing algorithms couldbe used to better utilize network resources and thus achievehigher performance, particularly for the case of cost-effectivenetworks.

Categories and Subject Descriptors

C.2.2 [Computer-Communication Networks]: Net-work Protocols—Routing protocols; B.4.3 [Input/Outputand Data Communications]: Interconnections (Subsys-tems)—Topology (Fat Trees); C.4 [Performance of Sys-tems]: Design studies

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ICS’09, June 8–12, 2009, Yorktown Heights, New York, USA.Copyright 2009 ACM 978-1-60558-498-0/09/06 ...$5.00.

General Terms

Performance, algorithms

Keywords

Extended Generalized Fat Trees (XGFT’s), k-ary n-trees,communication/traffic patterns, network topologies, routingalgorithms, Clos networks

1. INTRODUCTIONCurrent High Performance Computing (HPC) systems

consist of thousands of processors connected by customizedinterconnection networks. The generalized use of such mas-sive parallelism has increased the impact of the network onthe overall system performance and cost. Although someof the fastest supercomputers in the Top500 list use Torusnetworks, a large number are built around indirect networksbased mainly on fat-tree topologies. The present work fo-cuses on enhancing the performance of this second class ofnetworks by providing better routing algorithms.

Different papers have studied the effect of routing on theperformance of regular and irregular indirect networks, [20],[4]. Such works are mainly based on simulations fed by con-stant, randomly generated traffic managed under static oradaptive routing. Two main conclusions were obtained: (i)random routing is good because it uniformly distributes traf-fic, and (ii) intelligent adaptive routing can balance trafficand maximize memory utilization at the switches so as notto block packet injectors.

It is tempting to extrapolate these results, which can bevaluable in certain contexts, to the communication patternsof supercomputer applications. However, the bursty andcausal nature of HPC traffic is quite unlike random non-reactive traffic. In general, supercomputer applications ex-hibit very regular and repetitive communication patternsalternated by computation phases that act as implicit traf-fic flow control. Hence, the main networking objective inthe HPC domain is not congestion control under a constantpacket injection, which usually means taking decisions thatdirectly or indirectly reduce the injection rate. In contrast,networks for HPC must provide the maximum possible peakrate for the transmissions involved in the various commu-

276

nication phases of applications. This can be achieved byminimizing packet contention over network resources.

In addition, many recent works [9], [23], [2] concludethat, given the communications requirements of HPC ap-plications, current networks are overdesigned. These worksare based on the observed resource usage of such “overde-signed” networks, but their conclusions do not necessarilyhold for more cost-effective networks. Instead of a typicalnon-blocking network (as defined in Sec. 2) a slimmed block-ing version can be used. An important question to ask in thiscontext is how much a network can be cut down in switchingresources without incurring a significant performance degra-dation.

The present work focuses on reducing the cost-performance ratio of interconnection networks by optimizingmessage routing. Good routing mechanisms should improveperformance at little or no cost by removing or reducingthe contention of packets for network ports. A bad rout-ing scheme can underutilize an overdesigned network, anda good routing scheme for a non-blocking network is notnecessarily good for its blocking version. Hence, our anal-ysis addresses both the non-blocking and blocking networkvariants.

The main contributions of this work are as follows: (i) wepropose an offline contention metric to factor out the con-tention at the adapters from the contention in the networkfabric; (ii) we devise two new pattern-aware routing heuris-tics that lead to optimized routes according to the offlinecontention metric; (iii) we compare these routing schemeswith several well-known routing techniques for both rear-rangeable non-blocking networks and their slimmed (block-ing) versions, and (iv) we show that performance gains canbe achieved in different realistic HPC scenarios at the negli-gible cost of modifying the routes. Moreover, the proposedpattern-aware routing opens the door to research targetedat its application to power-saving and fault-tolerant super-computers.

It has been pointed out in [5] that the performance of astatic routing scheme (D-mod-k) can be a lower bound foradaptive routing techniques. In our work, we have studiedseveral static routing schemes in the HPC domain that canhelp to set and better understand these bounds extendingthe study to slimmed (blocking) networks.

Our study is based on a realistic experimental method-ology. We obtained traces of communication patterns ex-tracted from real applications running on a production ma-chine. We fed these traces to a trace-driven MPI simulatorintegrated with a detailed network-level simulator. Finally,we evaluated the routing performance in a family of popularnetwork topologies, of which k-ary n-trees are a sub-family.

The remainder of this paper is organized as follows. InSec. 2 we review the background and related work on thetopologies studied and oblivious routing algorithms. InSec. 3 we introduce pattern-aware routing, propose an offlinecontention metric, and propose two new heuristic pattern-aware routing techniques that attempt to minimize this con-tention metric. Sec. 5 explains our evaluation methodologyand presents the results. We conclude in Sec. 6.

2. BACKGROUND AND RELATED WORKMany current supercomputers employ k-ary n-tree net-

works [19]. They are a popular parametric family of indi-rect multi-tree networks. A k-ary n-tree has N = kn leaf

nodes used as processing nodes and n · kn−1 inner nodes(2k-port switches). These full bisection bandwidth networksexhibit path redundancy, and also the property of beingrearrangeable [7]. This means that any scheduled permu-tation of sources over destinations can be routed withoutblocking, i.e., no messages contend for the same networkport. Each specific permutation needs an appropriate setof routes. Therefore, these networks will be referred to asnon-blocking.

As stated earlier, several recent works have identifieda potential over-provisioning of bandwidth of k-ary n-trees [9], [2], [15]. Consequently, the use of “slimmed” k-aryn-trees has been considered. Slimmed k-ary n-tree topolo-gies have less than n · kn−1 switches, losing both the full bi-section bandwidth and the rearrangeable non-blocking prop-erties.

Formally, k-ary n-trees and their slimmed versions be-long to the family of Extended Generalized Fat Trees(XGFT) [16]. This family includes many popularMulti-stage Interconnection Networks (MIN), such as m-ary complete trees, k-ary n-trees [19], fat trees asdescribed in [12], and slimmed k-ary n-trees. AnXGFT (h; m1, ..., mi, ..., mh; w1, ..., wi, ..., wh) of height h

has N =Qh

i=1 mi leaf processors, with the inner nodes serv-ing only as routers. Each non-leaf node in level i has mi childnodes, and each non-root has wi+1 parent nodes [16]. AnXGFT of height h has h + 1 levels. Leaf nodes are at levell = 0. XGFTs are constructed recursively, each sub-tree atlevel l having parents numbered from 0 to (wl+1 − 1). SeeFigure 1 for some examples.

A k-ary n-tree is an XGFT (n;

n

k, ..., k; 1,

n−1

k, ..., k), whereh = n, w1 = 1, m1 = k, and mi = k, wi = k, ∀i with2 ≤ i ≤ n.

A slimmed k-ary n-tree is precisely defined by the vectorswi and mi when ∃i|(wi < k) with 2 ≤ i ≤ n. Slimmed treesare blocking networks.

In both cases, the number of inner switches I can be com-puted as

I =

hX

i=1

hY

j=i+1

mj ·iY

j=1

wj

!

. (1)

Figure 1: Several XGFTs

Finding a minimal deadlock-free path for connection (s →d) between source node s and destination node d in anXGFT network can be done by choosing any of their Near-est Common Ancestors (NCAs). Having selected the NCA,it is trivial to compute the unique ascending and descend-

277

ing paths to the nodes using their identifiers (self-routingproperty) [19].

Different oblivious algorithms have been proposed tofill the routing tables in k-ary n-trees. Random rout-ing [22], [6], [4], is used as the default mechanism in Myrinetand InfiniBand interconnects. A path (s → d) from nodes to node d is created by choosing any random NCA be-tween both nodes. Two other oblivious routing techniqueshave been proposed independently without a common agree-ment on their names: what we will call Source-mod-k rout-ing [16], [12] and Destination-mod-k routing [13], [10], [5], [8].Both techniques employ the same function to select routes.The difference is that the former uses the source node iden-tifier and the latter the destination identifier. S-mod-k andD-mod-k routings can be concisely described for k-ary n-trees: to establish a path (s → d) from node s to node d,S-mod-k routing chooses parent ⌊ s

kl−1 ⌋ mod k at hop l, and

D-mod-k routing chooses ⌊ d

kl−1 ⌋ mod k.Routing in any XGFT is identical to finding paths in k-ary

n-trees (see [16]). The routing tables of XGFTs can be filledusing straightforward adaptations of the algorithms used ink-ary n-trees. For instance, S-mod-k and D-mod-k can beadapted by replacing the denominator kl−1 by

Ql−1j=1 mj , l >

1, (1, l1 = 1) and k by wl. We have used a roughly equivalentadaptation, using a self-routing variable-radix base to labelthe nodes as described in [13], applying the modulo wl to thecorresponding digit of the node label at each level. A router from s to d that has NCAs at level lNCA is determined bythe sequence of local output ports to reach the NCA. Localoutput ports of switches at level l are numbered from 0 towl+1 − 1. Each local output port corresponds to one of thepossible parents of the switch reached at level l. A route ris therefore described as: < r0, ..., rl, ..., rl(s,d)−1

>, the pathto NCA. The second half of the route to the destination caneasily be reconstructed from the first half by knowing thedestination (d) identifier [3].

Very few pattern-aware routing schemes have been pro-posed for XGFT networks. A pattern-aware routing schemetries to optimize the set of connections C = (s → d) presentin an application for any leaf nodes s and d. A special caseof a set of connections is a permutation in which every nodesends to a sole distinct destination. A very efficient algo-rithm to route a particular permutation without conflicts(i.e., realizing the rearrangeability property) is proposedin [3]. Optimizing a more general set of connections is acombinatorial problem. A brute-force search using a breadthfirst search or depth first search backtracking algorithm isimpractical for the node counts of current supercomputers.Breadth-first Search strategies have been used to find min-imal distance deadlock-free up*/down* routes in networksof workstations [20], [21], but no attempt has been made tooptimize the global set of routes for a particular communi-cation pattern. Brute-force and greedy strategies have alsobeen used to optimize multicast traffic [1]. Whereas ourwork also tries to minimize contention for a many-sourcesmany-destinations set, it is fundamentally different becausewe cannot assume that the source data is the same for dif-ferent destinations. Finally, we have not found any specificpattern-aware routing scheme for slimmed networks, exceptthe obvious adaptations of the general ones discussed abovefor k-ary n-trees.

An orthogonal work for more general network topologies,Application Specific Routing (APSRA) [17], uses the com-

munication pattern to remove channel dependencies so thatmore deadlock-free paths can be found. In XGFTs, all min-imum paths are deadlock free.

In contrast to the aforementioned works, this one triesto optimize routing to adequately manage the communica-tion patterns of HPC applications. Such patterns are muchmore general than permutations. Our main goal is to obtaina global set of fixed routes for the entire application execu-tion if possible, or at least, for a reasonable time span, asglobally re-programming the routing tables in the switchesand adapters can be very costly, in the order of seconds [8].

3. PATTERN-AWARE ROUTINGA pattern-aware routing scheme takes the connectivity

matrix M(N × N) of a communication pattern C as in-put and produces an optimized set of routes for this patternas output. The connectivity matrix M of C records its setof connections with elements mij 6= 0 iff the connection(i → j) ∈ C. The actual value of mi,j can represent a usefulcost metric of (i → j) as, for example, the number of bytes.The connectivity matrix of a permutation has at most Nnon-zero elements, namely, a single non-zero filled elementper row, such that no two non-zero elements are in the samecolumn.

The connectivity matrix is built from a set of connectionswithin a certain time span of the execution of the appli-cation that could range from one instantaneous moment toa complete communication phase or the entire application.The connectivity matrix has no timing information. The in-formation about when each communication started, or howlong it took, is lost. Different executions of the same ap-plication will probably experience different communicationtimings. However, the structure of the connectivity matrixwill be almost the same across different runs with the samenumber of processors, as evidenced by our own experimentalanalysis and by [9].

Whereas much effort has been devoted to minimizing pathlengths while guaranteeing deadlock freedom, our work fo-cuses on computing an optimized routing table to increaseperformance by reducing network fabric contention. In orderto do that, we will define a cost function that only accountsfor network contention and eliminates the effect of endpointcontention.

3.1 Cost function: Offline Contention MetricWe can differentiate between two kinds of contending mes-

sages in an application: those that contend for the networkadapter because they were produced by or are going to beconsumed at the same node1 and those that were injectedby different nodes and compete to go through some switchport. A routing scheme by itself can only address the latterkind.

We have devised a cost metric for every switch port p thatsolely accounts for the network contention. We define

routes(p) = {(s → d), such that (s → d) uses p}

as the set of routes that go through port p. The cost function

1Assuming that each node has a single network adapter.

278

is computed as follows:

srcs(p) = {s | ∃(s → d) ∈ routes(p)} (2)

dsts(p) = {d | ∃(s → d) ∈ routes(p)} (3)

cost(p) = min(|srcs(p)|, |dsts(p)|) (4)

The Max-flow Min-cut theorem tells us that for a fullyconnected graph connecting sources (2) to destinations (3),the maximum flow of the set routes(p) is achieved by theminimum cut. Assuming all flows to be 1, the minimumcut corresponds to (4). In our case, the computed maxi-mum flow (4) has to go through a single port p. Hence, thebandwidth loss of sharing port p in comparison to a fullyconnected network is at most (4). The cost function hasa useful secondary property: it assigns low costs to verybusy ports that would benefit very little of a full-connectednetwork because the contention is concentrated at the end-points.

Finally, the global cost function of a partial or a completeroute r =< r0, ..., rn−1 >, where n is the number of hops, isthe maximum of the cost functions of the individual ports pthat it traverses (5). The global cost function of a completeroute includes both the upward and the downward path:

Global cost(r) =n−1maxi=0

cost(ri). (5)

The two heuristics we present next try to find a singlerouting table that minimizes the maximum global cost func-tion for the entire connectivity matrix.

3.2 Non-backtracking Best-first Search withBranch and Bound Heuristic (BeFS).

BeFS is a greedy heuristic that takes the connections ofM and for each one, finds a route using a Best-first Searchheuristic. A Breadth-first Search can be thought of as aBest-first Search with ties in the priority queue resolved asLIFO and equal costs for all search nodes [18]. We describethe internals of this heuristic below.

BeFS Heuristic. Input: connectivity matrix, M . Out-put: optimized routes for M according to Global cost (5).

Step À: Initialization. Insert elements (s → d) withM(s, d) 6= 0 to the list L sorted by source node. Initialize theset of routes found: S = ∅. Initialize the port annotations(routes) Pi = ∅, where i is a global network port identifier,with 0 ≤ i ≤ (I · K), I being the number of inner switchesand K being that of ports per switch.

Step Á: for each (s → d) ∈ L,perform BeFS(s → d) with Branch andBound to find the first route r with a valuefor (5) close to 0 or the minimum that canstill be achieved. Update S = S ∪ r, andPi = Pi ∪r,∀i|r uses port i. An accepted routeis never backtracked. Finding the paths in-side the BeFS(s → d) can result in backtrack-ing. Subsequent calls to BeFS(s → d) will usethe globally updated port annotations Pi inter-nally.

The function BeFS(s → d) searches a path from s tod by inserting reachable ports from the currently inspectedswitch or node into the priority queue. At each step towardsa solution, the port that is first in the priority queue (tiesare resolved in a FIFO manner) is expanded. When all costsare equal, the algorithm behaves as a traditional Breadth-first Search. The search stops with the first route found that

achieves a minimum value for (5). If a partial route alreadyhas a higher value of (5) than the best complete route foundfor (s → d), the partial route is not further expanded.

Because a solution to BeFS(s → d) reaching an optimalvalue for (5) is not unique, the order in which routes arefound (which depends on the ordering of the priority queue)is relevant. To reproduce the algorithm, it is therefore cru-cial to define the ordering of the priority queue.

The priority queue has two levels of ordering: a first levelbased on the properties (kinds) of the ports, and the secondinternal ordering based on either the FIFO ordering or thecost function. The different kinds are, from most preferableto least, the following:

À Ports whose assigned routes (Pi) all have either s assource or d as destination. Formally: either ∀(s′ →d′) ∈ Pi, (s = s′) or ∀(s′ → d′) ∈ Pi, (d = d′).

Á Ports with no routes assigned yet.

Â Ports whose assigned routes (Pi) share either s or damong other conflicting routes (those that have neithers nor d in common).

Ã All other ports with conflicting routes.

Ports of kind À and Á are in the default FIFO order.Ports of kind Â and Ã are ordered by the cost function (4)that would result from adding the current connection to theset of routes of that port.

The ordering of the priority queue tries to reuse the samepaths as much as possible if the communication topologyis sending from one to many or from many to one. Portsof kind À are the non-conflicting busy ports (value of costfunction is 1) that will not suffer from added network fabriccontention by adding the path (s → d), therefore savinglinks for other connections. Ports of kind Á are the freeports: a new path will be used if any other busy path willcause contention. Ports of kind Â are those conflicting portsthat will not experience more contention by adding the path(s → d), and finally, kind Ã comprises all the conflictingports, ordered by the number of conflicting paths, in anattempt to evenly distribute the conflicts if they cannot beavoided.

In summary, this heuristic tries to find paths that econ-omize links without causing additional network fabric con-tention. This is done by leaving more room for the remainingof connections by finding non-conflicting paths through freelinks, and eventually distributing the conflicts across portsfor the conflicting connections.

3.3 The Colored Heuristic (Colored)Under certain conditions for the matrix M , the problem of

assigning an optimized set of routes could be formulated asa graph-coloring problem (assigning an NCA for each com-municating pair). However, the general case cannot be for-mulated as such in a practical way because (i) it would haveto be formulated as minimum weighted coloring (the weightbeing the cost function), and (ii) the weights depend on thecoloring assignment. We have derived a heuristic that makesuse of some properties specific to the recursive nature of theXGFT to approximate the original graph-coloring problemformulation. We will call this heuristic Colored.

The Colored heuristic relies on a routing property ofXGFTs, proved in [3], that states that, regardless of the

279

NCA chosen for (s → d), the relative parents of the com-plete route (upward and downward path) are symmetric. Allcomplete routes in an XGFT have an odd number of hops,and the middle hop selects the NCA. Once in the NCA, itis no longer possible to choose a different set of relative par-ents of the smallest sub-trees down to the destination noded. The first hop downward from the NCA is determined byd, and the rest of the route will follow exactly the same rel-ative sequence of parents (in inverse order) as the upwardpath.

Our algorithm explores the outgoing and incoming con-nections hierarchically from the leaves towards the roots ofthe trees. The goal is to achieve the least conflicting assign-ments, by level, considering all the nodes under a certainlevel of the tree as a cloud (SuperNode) that sends and re-ceives messages. At level 0, there are as many clouds as leafnodes, at level 1; there will be N/m1 clouds of m1 nodeseach. At each level from 0 to (h − 1), the algorithm willassign some or all of the wl+1 parents to the different outgo-ing or incoming connections of the cloud. The assignment isdone per level such that the cost function (5) is minimizedfor all connections. Note that the sequence of parents chosenin going up to the network fabric are the resulting routes.The implementation of the algorithm distinguishes betweensending and receiving communications of the cloud. We de-note SourceGroups and TargetGroups as the sets of sendingand receiving connections, respectively.

SourceGroups and TargetGroups are ordered, and the as-signment of parents is done group by group. The bene-fit of doing the assignment of parents (partial routes) bySourceGroups and TargetGroups is that the communicationtopology is taken into account. The embedding of routesinto the physical topology is done in such a way that it op-timizes both the conflicts and the use of resources. Thisis achieved by concentrating the contention of the “send-ing” nodes (SourceGroups) in the upward paths, and thecontention of the “receiving” nodes (TargetGroups) in thedownward paths.

Next, we introduce some definitions and, after that, theColored heuristic.

Definition 1: A SuperNode N li of level l is the set that

contains the nodes belonging to a sub-tree of level l. Node iis SuperNode N0

i , nodes connected to the first level (l = 1)switch i constitute SuperNodes N1

i , and so on. SuperNodescapture the recursive nature of the physical topology.

Definition 2: The SourceGroup Sli is the set of commu-

nicating pairs (sk → dk) whose sources belong to SuperNodeN l

i and whose destinations belong to any other SuperNode,but not itself. Equivalently, a TargetGroup T l

j is the set of

communicating pairs whose targets belong to N lj and whose

sources belong to any other SuperNode, but not itself.Note that a SourceGroup or TargetGroup of level l does

not contain the inbound communication pairs, but onlythose that go outbound level l. SourceGroups and Target-Groups superpose the structure of the communication topol-ogy to the physical topology.

Definition 3: A route for s → d is defined as the sequenceof selected intermediate parents < r0, r1, ..., r(l−1) > to oneof the NCA at level l connecting s and d. The r∗ elementsare specific for each communication pair, but will be omittedfor brevity.

The available routes for s → d depend on the numberof roots. Given two nodes (s,d) whose common level is

hs,d, they can choose from as many asQhs,d

i=0 wi roots. Ifthe route from s → d has already been set up to levell < hs,d, the set of available roots at level hs,d gets restricted

toQhs,d

i=l wi. Moreover, two assignments < r0, r1, ..., rl >and < r′0, r

′

1, ..., r′

l > for any two communicating pairs ofany SourceGroup or TargetGroup will share the restrictedset of common roots only if they have a common prefix:∀i ≤ l, ri = r′i.

Algorithm Colored Heuristic. Input: connectivitymatrix, M . Output: optimized routes for M according toGlobal cost (5).

Step À: for each level l, 0 < l < (h − 1) of theXGFT, build a list Ll containing the SourceGroups andTargetGroups of level l ordered by their cardinality, i.e., bythe number of outgoing/incoming communications in eachgroup. The order is such that the groups (SourceGroup orTargetGroup) that potentially need more resources will beanalyzed first and will direct the future assignments of rootsfor the remainder of the communicating pairs.

Step Á: Analysis of Ll: Analyze every Gli ∈ Ll, (Gl

i

being either a Sli or a T l

i ), Gli. Gl

i is a collection of com-municating pairs. The assignments rj

l made to the commu-

nicating pairs (sj → dj) are tracked independently of Gli.

When a Gli is analyzed, communicating pairs are queried for

their assignment of rj

l . Some communicating pairs might

have rj

l assigned from a previous analysis of another group,and others might not.

Step Â: Analysis of Gli: for each set S = {(s → d) ∈

Gli} sharing the same route prefix, choose an rl for a com-

municating pair. To decide which root rl from the availableroots (0, ..., w(l−1)) should be assigned to each communi-cating pair, a matrix R is built. The rows of the matrixare labeled by the individual communicating pairs and thecolumns represent the available parents 0, ..., (wl − 1). Thematrix is filled in with a weight indicating how good (pos-itive value) or how bad (negative value) it is to choose aparent in column k for row j.

Step Ã Rules to fill row j of matrix R: for Source-Group or TargetGroup. Given row j:

(sj → dj , < rj0, ..., r

j

l−1 >) ∈ Gli,

we will call Slsj

the SourceGroup containing sj as a source

and T ldj

the TargetGroup containing dj as a destination.

À Communicating pairs (sk → dk) for SourceGroup Slsj

with a root rkl assigned are analyzed:

• For each communicating pair k, with dj 6= dk,and a previously assigned root rk

l , penalize thecolumn rk

l heavily.

• For each communicating pair k, with dj = dk,and a previously assigned root rk

l , increase thepreference of the column rk

l .

Á Communicating pairs (sk → dk) for TargetGroup T ldj

with a root rkl assigned are analyzed:

• For each communicating pair k, with sj 6= sk, anda previously assigned root rk

l , penalize the columnrk

l heavily.

• For each communicating pair k, with sj = sk,and a previously assigned root rk

l , increase thepreference of the column rk

l .

280

The row j having the highest positive value for the entirematrix in column k will be chosen: rl = k will be set forsj → dj . Step Ã is repeated with the non-assigned com-municating pairs as long as there are pairs still remainingin S. At the first iteration of the algorithm, all values ofmatrix R will be 0, and the parent chosen will be the onewith the smaller index, i.e., 0. As the algorithm iterates,more information to choose the roots becomes available.

At some point after Step Â, a particular Gli will have

all terms in the matrix with negative numbers (conflicting).When this happens, the current assignments for rl are re-moved, and the group is given a second chance by putting itagain at the end of the list L to be analyzed. The aim of thisis that as the algorithm progresses, there will be more in-formation to assign the roots to better balance the conflicts.This second chance (a kind of backtracking) is only givenonce to each group, and only if all elements in the matrixwere negative (conflicting). In the worst case, each group isanalyzed twice.

The complexity of this algorithm does not increase ex-ponentially because at each level l (i) the communicatingpairs that do not go outbound are ignored, and (ii) for eachSourceGroup or TargetGroup, only the communicating pairssharing the same prefix of assigned roots up to level l will beanalyzed together. The time needed to compute the pattern-aware routing tables for the applications studied does notexceed 8 sec in the worst case. Typical run times for theseHPC applications take several hours.

4. WORKLOADS AND EXPERIMENTAL

METHODOLOGYIn this Section, the applications chosen as benchmarks

are described and the employed methodology and tools arepresented.

4.1 ApplicationsMost of the research done in routing has focused on max-

imizing the throughput of synthetic, flow-controlled, gen-erated traffic. In this work, we simulate the MPI level ofexecution traces of the following applications:

1. WRF (Weather Report Forecast) is a numericalweather prediction system designed to serve the atmo-spheric research community. We include results with256 processors (WRF-256).

2. Alya is a Finite Element Method (FEM) solver code.Alya uses the Metis partitioning library to balancethe workload among threads. We include results with101, 200 and 201 processors (Alya-101, Alya-200, andAlya-201). We have also executed a replay of Alya-200changing the synchronous Send calls by asynchronousones; we will refer to this run as Alya-200Isends.

3. The NAS Parallel Benchmarks is a set of pseudo-applications and numerical kernels designed to com-pare the performance of HPC machines. We presentthe results for Conjugate Gradient (CG) from the NPBsuite, which is one of the most demanding in termsof point-to-point communication performance. We in-clude here results with 128 processors for data-set classD: CG.D-128.

4.2 Tools and Experimental FrameworkTo study the effect of the routing scheme on network con-

tention, we have used two coupled simulators: Venus andDimemas. Venus is an event-driven simulator developed atthe IBM Zurich Lab that is able to simulate any generic net-work topology of nodes, switches and wires at the flit level.It can simulate all range of XGFTs as well as many othertopologies. Dimemas [11] is an MPI simulator driven by apost-mortem trace of a real application execution. The tracecontains the MPI calls the application performed, which inturn include the communication pattern as well as the causalrelationships between messages. Dimemas reconstructs thetemporal behavior according to a parametric bus networkmodel.

We have implemented a co-simulation approach betweenVenus and Dimemas to substitute the default networkmodel from Dimemas with the detailed network model fromVenus [14]. We have used an input/output buffered switchmodel, link speed of 2 Gbits/s, flit size of 8 bytes, and seg-ment size of 1KB with a round-robin interleaving of messagesat the network adapter.

We have obtained execution traces from runs of the appli-cations selected. Dimemas was fed with the execution trace,relying on to Venus to do the detailed network simulation ofthe communications. We extracted the connectivity matrixM (source-destination pairs) for each communication phase.For each topology under study (instantiations of XGFTs) wefed our routing algorithms with (i) the connectivity matrix,(ii) the topology file, and (iii) the mapping of processes tonodes (sequential). The routes obtained were then supplied,along with the topology and mapping, to the Venus simula-tor.

For better comparison, we have scaled the reported timesagainst the time employed by a single ideal full-crossbar con-necting all the nodes. Simulating a full-crossbar with hun-dreds of ports provides the best performance that can be ob-tained in the absence of network contention. A full-crossbardoes not need any routing algorithm.

5. EVALUATIONIn this section we present the results obtained for non-

slimmed and slimmed networks. We will address here thequestion posed in Section 1 of how much a network canbe trimmed without incurring in a significant performancedegradation. We will take into account the effect of the rout-ing schemes, comparing several well-known and some customtechniques. For the slimmed networks, we will analyze ev-ery application in greater detail. Finally, we will summarizethe results to draw overall conclusions on the joint effects oftopology and routing decisions.

5.1 Non-slimmed networksFigure 2 shows the relative degradation of the various

routing schemes for the communication patterns of the ap-plications we have evaluated. WRF performance is almostidentical for all routing schemes, except for Random, whichreduces it by more than a factor of 3. CG is also noticeablyaffected by the choice of the routing scheme: here S-mod-k,D-mod-k and BeFS are unable to obtain the optimum per-formance for CG, which only Colored achieves.

The communication phases of Alya-101, Alya-201 andAlya-200Isends exhibit almost no difference for the differentrouting schemes compared with the Full Crossbar. None of

281

0

0.5

1

1.5

2

2.5

3

3.5

4

WRF-256Alya-101

Alya-201Alya-200

Alya-200IsendsCG-128

Slo

wdow

nComparison of Routing schemes for non-blocking networks

Routing schemes

RandomS mod kD mod kBeFSColored

Figure 2: Routing schemes in Non-slimmed Net-works vs. Full Crossbar (no routing)

the five routing schemes is able to achieve the Full Crossbarperformance for Alya-200, and they all perform almost iden-tically. A deeper analysis of the Alya-200 communicationpattern reveals that the routing schemes are already closeto the optimum that is achievable. The chain of several syn-chronous sends in the implementation causes dependenciesthat can only be resolved by appropriately scheduling thesends, which exceeds the scope of the routing scheme.

Figure 2 shows that pattern-aware routing schemes workwell with complete k-ary n-tree topologies. Random staticis probably the least advisable routing technique, despitethe fact that most of the literature is in favor of it. Thisrecommendation seems to be mostly based on research doneusing continuous injection of synthetic traffic. As we cansee, static routing techniques, including S-mod-k, which wasproposed in the first works [12] [16] on these topologies, doreasonably well for non-slimmed networks.

5.2 Slimmed treesWhen using the slimmed-tree versions, which constrain

the availability of paths, the routing problem plays a moreimportant role. So far, routing techniques in regularslimmed-trees have not been studied using detailed simu-lations of the actual traffic generated by applications. Thefollowing subsections show the results obtained for the ap-plications studied in this paper.

5.2.1 WRF

The WRF communication pattern consists of a pairwiseexchange between the neighboring nodes in a 2-D mesh(±1,±16 nodes away). The phase analyzed here is the non-local phase (±16 nodes away).

In Figure 3(a) we see how the routing algorithms per-form when we slim the tree and use fewer switches at thesecond level. The first point (w2 = 16 middle switches) cor-responds to the histogram in Figure 2 for the WRF case(non-slimmed networks). The X axis records the topology,whereas the Y axis is the slowdown with respect to a FullCrossbar. The network with only 1 middle switch (right-most, w2 = 1) can be considered as the worst case. In thisminimum cost network, routing decisions do not matter be-cause there is only one path between each pair of nodes. The

1

3

5

7

9

11

13

15

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

Slo

wd

ow

n

Value of w2 (#middle switches) for XGFT(2;16,16;1,w2)

WRF, Progressive tree-slimming

Full-CrossbarRandomS mod kD mod kBeFSColored

(a) WRF: slimmed trees with 32-port switches m1, m2 = 16

1

3

5

7

9

11

13

15

8,8

8,7

8,6

8,5

8,4

8,3

8,2

8,1

7,7

7,6

7,5

7,4

7,3

7,2

7,1

6,7

6,6

6,5

6,4

6,3

6,2

6,1

5,5

5,4

5,3

5,2

5,1

4,4

4,3

4,2

4,1

3,3

3,2

3,1

2,2

2,1

1,1

12

81

20

11

21

04

96

88

80

72

10

91

02

95

88

81

74

67

92

92

86

80

74

68

62

77

72

67

62

57

64

60

56

52

53

50

47

44

42

37

Slo

wd

ow

n

Values of w2,w3 for XGFT(3;8,8,4;1,w2,w3)

WRF, Progressive tree-slimming

Inner Switches


(b) WRF: slimmed trees with 16-port switches m1, m2 =8, m3 = 4

Figure 3: Routing schemes in slimmed-versions of(a) 16-ary 2-trees and (b) 8-ary 3-trees, for WRF-256. The x axis indicates the number of innerswitches of the progressively slimmed topologieswith parameters (w2, w3, ...) of a correspondingslimmed XGFT from the complete k-ary n-tree.

following effect can be observed: if a single middle switch istaken out (w2 = 15 middle switches), the duration of thecommunication phase doubles, but if we take out 2, 3, 4, oreven 8 middle switches, neither the D-mod-k, the S-mod-k,nor the Colored algorithm suffer any additional decrease inperformance. The performance degradation for these threerouting schemes exhibits a step-wise behavior: once an addi-tional switch had been removed, several other switches couldbe removed as well without degrading the performance fur-ther. The vertical lines in Figure 3(a) are placed between thesteps of the most efficient algorithm. The BeFS approachis very sensitive to the slimming, and the random approachhas a high variability.

Figure 3(b) shows the relative performance degradationfor progressive slimming (of the second and third levels) ofan XGFT (3; 8, 8, 4; 1, w2, w3). The top X axis shows the to-

282

tal number of switches of the corresponding topology in thebottom X axis. The case for w2 = w3 = 8 corresponds tothe 8-ary 3-tree. We draw similar conclusions as for the casewith 16 middle switches: Random is not advisable, S-mod-k,D-mod-k and Colored exhibit stable behavior and a perfor-mance close to the optimum achievable for each slimmedtopology, close to that of the Full Crossbar. We note thatS-mod-k, D-mod-k, and Colored manage to achieve the per-formance of the first step up to the slimmed topology withw2 = 4, w3 = 2 using only 56 switches for WRF-256 (in com-parison to the 128 switches used by topology w2 = w3 = 8).A closer look at Figure 3(b) shows that from configuration(w2 = 7, w3 = 7) up to configuration (w2 = 4, w3 = 1),only the configurations with w3 = 1, i.e., a single top-mostlevel switch per sub-tree, suffer a degradation of more than afactor of four. WRF needs very little connectivity at the top-most level, but it needs w3 ≥ 2; otherwise, its performancehalves again. If we could accept a performance degradationof four times that of a Full Crossbar for WRF, we couldchoose the configuration w2 = 3, w3 = 1, with only 47switches. However, the penalty incurred by a bad routingscheme would be huge.

5.2.2 Alya

Figure 4 shows three executions of Alya with differ-ent numbers of processors (101, 201 and 200), whereasFigure 4(d) shows the Alya-200 case executed with asyn-chronous calls (MPI Isends) instead of the synchronous ones(MPI Sendrecv).

The results for Alya-101 and Alya-201 (Figures 4(a) and4(b)) show that the routing schemes studied have almost noimpact on Alya’s performance with even as few as w2 = 6middle switches. With less middle switches, only the BeFSrouting scheme performs badly. Alya-200 (Figure 4(c))shows that most routing schemes except BeFS achieve aperformance close to that of a Full Crossbar.

The little important of the routing algorithm, and a per-formance close that of a Full Crossbar suggests that mostcommunications must be local. However, that is not thecase, the communication pattern has many non-local com-munications, and the small effect of routing decisions on theperformance of the communication phases of all variants ofAlya is related to the implementation of the communicationphase, using synchronous send/receive calls, serializing eachof the data exchanges between the nodes, which underuti-lizes the network. We have tested how this communicationpattern would perform if all calls were turned into asyn-chronous ones. This change is possible for this application asthe casual dependencies introduced by the blocking natureof the MPI Sendrecv calls are not inherent to the algorithm,but to the implementation alone.

Figure 4(d) shows the performance results normalizedwith the completion time for Full Crossbar for the syn-chronous case. In comparison with Figure 4(c), the perfor-mance with the implementation change doubles. However,the variability in the performance of the routing schemesbecomes only slightly more noticeable. The communicationpattern of Alya is limited by endpoint contention, which therouting scheme cannot mitigate.

As evidenced by Figures 4(c) and 4(d), most routingschemes do reasonably well with as few as w2 = 8 middleswitches for the synchronous and w2 = 11 for the asyn-chronous case. Colored manages to achieve a stable per-

formance with up to five middle switches for both the syn-chronous and the asynchronous case.

5.2.3 CG

The results for CG are plotted in Figure 5. CG has acommunication pattern that consists of five exchanges ofequal size, four of which are local to the first-level switchfor the radix2 we have used (m1 = 16). Only the fifthphase is non-local, so whatever degradation in performancethis application might suffer due to the routing decision ex-clusively corresponds to the fifth exchange phase. It canbe seen that all routing schemes, except Colored, entail ahuge performance degradation. The fifth phase of CG per-forms exchanges with destinations whose differences are mul-tiples of the radix, which is precisely the one kind of ex-changes that the D-mod-k algorithm cannot route withoutconflicts. When the tree is slimmed, i.e., with w2 = 15 mid-dle switches, the best possible performance can no longerbe obtained. At least one non-local communication will suf-fer contention, doubling the time of this fifth phase, andtherefore increasing the ideal time, i.e., that of a Full Cross-bar, by 1/5. Colored can route the pattern with only ninemiddle switches without increased performance degradation.With eight middle switches, a new conflict arises, thereforeincreasing the total time by an additional 1/5.

0

1

2

3

4

5

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

Slo

wd

ow

n

Value for w2 (#middle switches) for XGFT(2;16,16;1,w2)

GG, 128 processors, Progressive tree-slimming


Figure 5: CG.D: 128 processors

We performed a detailed analysis of the performancedegradation incurred by D-mod-k for the non-slimmed case(w2 = 16), i.e., a k = 16-ary 2-tree. There is no contentionin the first four phases, which are local to the switch. How-ever, the degradation for the fifth phase (all of equal numberof bytes, namely, 750 KB), accounts for more than a factor oftwo. The simulated trace reveals that this last phases takeseight times longer with D-mod-k routing. This is due to thenature of the communication pattern of CG: each processors inside a switch communicates to a processor

d =s

2· 16 + (s mod 2). (6)

D-mod-k routing will choose r1 = (d mod 16) as the firstlocal port going up into the tree. Given (6), r1 can onlybe either 0 for the eight sources within a switch, where s ≡

2The radix is the k parameter of a k-ary n-tree.

283

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

Slo

wd

ow

n


Alya, 101 processors, Progressive tree-slimming


(a) Alya 101

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

Slo

wd

ow

n




(b) Alya 201

0

0.5

1

1.5

2

2.5

3

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

Slo

wd

ow

n




(c) Alya 200

0

0.5

1

1.5

2

2.5

3

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

Slo

wd

ow

n


Alya, 200 processors, using Isends, Progressive tree-slimming

Full-Crossbar,Alya 200 w/o isendsFull-CrossbarRandomS mod kD mod kBeFSColored

(d) Alya 200 (using Isends)

Figure 4: (a) Alya-101 (b) Alya-201 (c) Alya-200 (d) Alya-200-ISends (using Immediate Sends) normalizedto the performance of a full crossbar with synchronous sends from (c).

0 (mod 2), or 1 for the other eight sources, where s ≡ 1(mod 2).

6. CONCLUSIONS AND FUTURE WORKIn this paper, a detailed analysis of the impact of rout-

ing on communication performance has been carried out fora broad family of commonly used networks. The workloadconsidered is based on several benchmarks and productionapplications. This analysis allows us to draw several conclu-sions for both oblivious and pattern-aware routing schemes.

Regarding oblivious routing, one of the most interestingconclusions is that a random distribution of paths is not ad-visable. It introduces a great variability and produces badperformance. Even simple regular routings do better for anon-slimmed network. Both S-mod-k and D-mod-k routingschemes are good, but strongly depends on both the com-munication pattern and on the application mapping. Thereis a group of common permutations in parallel computingthat cause network conflicts and degrade its performance.

Regarding pattern-aware techniques, the BeFS strategy

suffers from its greediness. The first paths are routed with-out conflicts, but when more paths are added without thepossibility of backtracking the algorithm performs poorly.The nature of the algorithm concentrates the contention inthe upward paths. The Colored approach, however, whichtries to optimize both the conflicting paths and the resourceusage, revealed useful to improve the performance of thecommunication phases in complete and strongly slimmedtrees for all the benchmarks studied.

Routing is coupled with mapping, and production ma-chines usually cannot offer a sequential mapping of processesto nodes, but only fragments scattered across the network.This makes the routing policies even more difficult to eval-uate. We plan to continue this work to study the effect ofthis fragmentation in both oblivious and pattern-aware rout-ing techniques. In addition, fault-tolerant and power-savingsupercomputers could benefit from pattern-aware routingschemes, such as Colored, that try to optimally embed thecommunication topology of the application into the physicaltopology of the network.

284

7. ACKNOWLEDGEMENTSThis work has been partially supported by the Ministry

of Science and Technology of Spain under contracts TIN-2004-07739-C02-01, TIN-2007-60625, TIN2007-68023-C02-01, the BSC-IBM MareIncognito research agreement andthe HiPEAC European Network of Excelence. Part of ithas been carried out during German Rodriguez’s internshipat IBM Zurich Research Labs. We would also like to thankPhillip Stanley-Marbell from IBM Zurich Research Labs forhis thorough reading and valuable comments.

8. REFERENCES[1] S. Coll, D. Duato, F. Petrini, and F. J. Mora. Scalable

hardware-based multicast trees. In SC ’03: Proc. 2003ACM/IEEE Conference on Supercomputing, page 54,Washington, DC, USA, 2003. IEEE Computer Society.

[2] N. Desai, P. Balaji, P. Sadayappan, and M. Islam.Are Nonblocking Networks Really Needed forHigh-End-Computing Workloads? In Proc. 2008 IEEEInternational Conference on Cluster Computing, pages152–159, Washington, DC, USA, 2008. IEEEComputer Society.

[3] Z. Ding, R. R. Hoare, A. K. Jones, and R. Melhem.Level-wise scheduling algorithm for fat treeinterconnection networks. In Proc. 2006 ACM/IEEEConference on Supercomputing, page 96, New York,NY, USA, 2006. ACM.

[4] J. Flich, M. P. Malumbres, P. Lopez, and J. Duato.Improving routing performance in Myrinet networks.In Proc. of the 14th International Parallel andDistributed Processing Symposium, pages 27–32, LosAlamitos, CA, USA, 2000. IEEE Computer Society.

[5] C. Gomez, F. Gilabert, M. Gomez, P. Lopez, andJ. Duato. Deterministic versus adaptive routing infat-trees. Proc. of the 21st Parallel and DistributedProcessing Symposium, 2007, pages 1–8, Mar. 2007.

[6] R. I. Greenberb and C. E. Leiserson. Randomizedrouting on fat-trees. In Proc. of the 26th AnnualSymposium on the Foundations of Computer Science,pages 241–249, 1985.

[7] A. Jajszczyk. Nonblocking, repackable, andrearrangeable Clos networks: fifty years of the theoryevolution. Communications Magazine, IEEE,41(10):28–33, Oct. 2003.

[8] G. Johnson, D. J. Kerbyson, and M. Lang.Optimization of InfiniBand for ScientificApplicationsa. In Proc. of the 22nd InternationalParallel and Distributed Processing Symposium, pages1–8. IEEE, 2008.

[9] S. Kamil, J. Shalf, L. Oliker, and D. Skinner.Understanding ultra-scale application communicationrequirements. Proc. Workload CharacterizationSymposium, pages 178–187, Oct. 2005.

[10] H. Kariniemi. On-Line Reconfigurable ExtendedGeneralized Fat Tree Network-on-Chip forMultiprocessor System-on-Chip Circuits. PhD thesis,Tampere University of Technology, 2006.

[11] J. Labarta, S. Girona, V. Pillet, T. Cortes, andL. Gregoris. DiP: A parallel program developmentenvironment. In Proc. of the Second InternationalEuro-Par Conference on Parallel Processing,volume II, pages 665–674, London, UK, 1996.Springer-Verlag.

[12] C. E. Leiserson et al. The network architecture of theConnection Machine CM-5. In Proc. of the FourthAnnual ACM Symposium on Parallel Algorithms andArchitectures, pages 272–285, San Diego, California,June 1992.

[13] X.-Y. Lin, Y.-C. Chung, and T.-Y. Huang. A multipleLID routing scheme for fat-tree-based InfiniBandnetworks. Proc. of the 18th International Parallel andDistributed Processing Symposium, pages 11–, 2004.

[14] C. Minkenberg and G. Rodriguez Herrera.Trace-driven Co-simulation of High-PerformanceComputing Systems using OMNeT++. In Proc. 2nd

International Workshop on OMNeT++, held inconjuction with the Second International Conferenceon Simulation Tools and Techniques (SIMUTools’09),2009.

[15] J. Navaridas, J. Miguel-Alonso, F. J. Ridruejo, andW. Denzel. Reducing complexity in tree-like computerinterconnection networks. Technical ReportEHU-KAT-IK-06-07, UPV/EHU, 2007.

[16] S. R. Ohring, M. Ibel, S. K. Das, and M. J. Kumar.On generalized fat trees. In Proc. of the 9th

International Parallel Processing Symposium, page 37,Washington, DC, USA, 1995. IEEE Computer Society.

[17] M. Palesi, R. Holsmark, S. Kumar, and V. Catania.Application specific routing algorithms for networks onchip. IEEE Trans. Parallel Distrib. Syst., 20(3), 2009.

[18] J. Pearl. Heuristics: intelligent search strategies forcomputer problem solving. Addison-Wesley LongmanPublishing Co., Inc., Boston, MA, USA, 1984.

[19] F. Petrini and M. Vanneschi. A comparison ofwormhole-routed interconnection networks. In Proc.Third International Conference on Computer Scienceand Informatics, Research Triangle Park, NC, USA,Mar. 1997.

[20] J. C. Sancho and A. Robles. Improving theUp*/Down* routing scheme for networks ofworkstations. In Proc. 6th International Euro-ParConference on Parallel Processing, pages 882–889,London, UK, 2000. Springer-Verlag.

[21] J. C. Sancho, A. Robles, and J. Duato. Effectivestrategy to compute forwarding tables for InfiniBandnetworks. In Proc. of the International Conference onParallel Processing, page 48, Los Alamitos, CA, USA,2001. IEEE Computer Society.

[22] L. G. Valiant and G. J. Brebner. Universal schemesfor parallel communication. In STOC, pages 263–277.ACM, 1981.

[23] J. S. Vetter and F. Mueller. Communicationcharacteristics of large-scale scientific applications forcontemporary cluster architectures. J. Parallel Distrib.Comput., 63(9):853–865, 2003.

285

Documents

Exploring pattern-aware routing in generalized fat tree networks