13
2019 52nd IEEE/ACM International Symposium on Microarchitecture SWAP: S ynchronized W eaving of A djacent P ackets for Network Deadlock Resolution Mayank Parasar 1 , Natalie Enright Jerger 2 , Paul V. Gratz 3 , Joshua San Miguel 4 , and Tushar Krishna 1 1 School of ECE at Georgia Institute of Technology, 2 Department of ECE at the University of Toronto, 3 Department of ECE at Texas A&M University, 4 Department of ECE at the University of Wisconsin-Madison 1 [email protected] 2 [email protected] 3 [email protected] 4 [email protected] 1 [email protected] ABSTRACT An interconnection network forms the communication back- bone in both on-chip and off-chip systems. In networks, con- gestion causes packets to be blocked. Indefinite blocking can occur if cyclic dependencies exist, leading to deadlock. All modern networks devote resources to either avoid deadlock by eliminating cyclic dependences or to detect and recover from it. Conventional buffered flow control does not allow a blocked packet to move forward unless the buffer at the next hop is guaranteed to be free. We introduce SWAP, a novel mech- anism for enabling a blocked packet to perform an in-place swap with a buffered packet at the next hop. We prove that in-place swaps are sufficient to break any deadlock and are ag- nostic to the underlying topology or routing algorithm. This makes SWAP applicable across homogeneous or heteroge- neous on-chip and off-chip topologies. We present a light- weight implementation of SWAP that reuses conventional router resources with minor additions to enable these swaps. The additional path diversity provided by SWAP provides 20- 80% higher throughput with synthetic traffic patterns across regular and irregular topologies compared to baseline es- cape VC based solutions, and consumes 2-8× lower network energy compared to deflection and global-synchronization based solutions. 1. INTRODUCTION Interconnection networks form the communication back- bone of today’s computer systems. They provide the means for communication between different entities of the system, be they different cores of a many-core chip multiprocessor (CMP), or a system-wide network where each node itself could be a CMP or server. In a buffered network, packets naturally get blocked due to contention for shared links. A deadlock occurs when some packets remain blocked inside the network indefinitely due to a cyclic dependence between buffers, and never reach their destination, causing application or system level failure. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MICRO ’52, October 12–16, 2019, Columbus, OH, USA c 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6938-1/19/10...$15.00 https://doi.org/10.1145/3352460.3358255 Deadlock freedom is a matter of correctness, rather than performance. Therefore, careful consideration is given while designing the network to make it deadlock free; more often than not, via over provisioning of resources. Almost every commercial interconnection network today prevents deadlocks from occurring by ensuring that the afore- mentioned cyclic dependence between buffers never gets cre- ated in the first place. This is done by restricting certain turns (either for all packets such as with dimension-ordered XY routing [1], or within extra escape virtual channels (VCs) [2]) or by restricting the injection of packets [3, 4]. These restric- tions reduce path diversity; throughput is lost since avoiding runtime formation of cycles means that links will go unused by certain packets even if they are idle. Moreover, they are inherently tied to the topology (e.g., XY routing only works for a mesh; any other topology requires a full channel depen- dence graph analysis to disable turns [1], BFC only works in rings/tori) – making them inflexible as plug-and-play solu- tions in arbitrary topologies in heterogeneous SoCs or when links/routers fail due to waning silicon reliability [5]. We focus on schemes that provide full path diversity, and resolve deadlocks that have occurred. We characterize prior work on deadlock resolution via the following taxonomy: 1 Deadlock resolution via escaping (e.g., escape buffers [5, 6]); the key drawback is the need for extra buffers; Deadlock resolution via misrouting (e.g., deflection rout- ing [7, 8, 10]); the key drawback is increased energy con- sumption and loss in throughput; Deadlock resolution via coordinating or synchronizing (e.g., SPIN [9]); the key drawback is expensive global coordination for detection and spinning. None of the state-of-the-art deadlock-freedom solutions (avoidance/recovery) provide full path diversity and high- throughput without requiring deadlock detection or global coordination for arbitrary topologies. This motivates our work. Going back to first principles, we argue that a network deadlocks because a packet is indefinitely blocked; the reason is a fundamental network design rule that says that a packet should be forwarded from an upstream router to a downstream router if and only if the downstream buffer is free (which is known via credits/on-off signaling), or is guaranteed to become free by the time the packet arrives [9, 11]. If this rule is violated, packets may be dropped. We question this fundamental rule and propose Synchro- 1 Section 2 provides more details on each of these approaches. 873

SWAP: Synchronized Weaving of Adjacent Packets for Network

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SWAP: Synchronized Weaving of Adjacent Packets for Network

2019 52nd IEEE/ACM International Symposium on Microarchitecture

SWAP: Synchronized Weaving of Adjacent Packets forNetwork Deadlock Resolution

Mayank Parasar1, Natalie Enright Jerger2, Paul V. Gratz3, Joshua San Miguel4, and Tushar Krishna1

1School of ECE at Georgia Institute of Technology, 2Department of ECE at the University of Toronto,

3Department of ECE at Texas A&M University, 4Department of ECE at the University of Wisconsin-Madison

[email protected] [email protected] [email protected] [email protected] [email protected]

ABSTRACT

An interconnection network forms the communication back-bone in both on-chip and off-chip systems. In networks, con-gestion causes packets to be blocked. Indefinite blocking canoccur if cyclic dependencies exist, leading to deadlock. Allmodern networks devote resources to either avoid deadlockby eliminating cyclic dependences or to detect and recoverfrom it.

Conventional buffered flow control does not allow a blockedpacket to move forward unless the buffer at the next hop isguaranteed to be free. We introduce SWAP, a novel mech-anism for enabling a blocked packet to perform an in-placeswap with a buffered packet at the next hop. We prove thatin-place swaps are sufficient to break any deadlock and are ag-nostic to the underlying topology or routing algorithm. Thismakes SWAP applicable across homogeneous or heteroge-neous on-chip and off-chip topologies. We present a light-weight implementation of SWAP that reuses conventionalrouter resources with minor additions to enable these swaps.The additional path diversity provided by SWAP provides 20-80% higher throughput with synthetic traffic patterns acrossregular and irregular topologies compared to baseline es-cape VC based solutions, and consumes 2-8× lower networkenergy compared to deflection and global-synchronizationbased solutions.

1. INTRODUCTION

Interconnection networks form the communication back-bone of today’s computer systems. They provide the meansfor communication between different entities of the system,be they different cores of a many-core chip multiprocessor(CMP), or a system-wide network where each node itselfcould be a CMP or server.

In a buffered network, packets naturally get blocked dueto contention for shared links. A deadlock occurs when somepackets remain blocked inside the network indefinitely dueto a cyclic dependence between buffers, and never reachtheir destination, causing application or system level failure.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] ’52, October 12–16, 2019, Columbus, OH, USAc© 2019 Association for Computing Machinery.

ACM ISBN 978-1-4503-6938-1/19/10...$15.00https://doi.org/10.1145/3352460.3358255

Deadlock freedom is a matter of correctness, rather thanperformance. Therefore, careful consideration is given whiledesigning the network to make it deadlock free; more oftenthan not, via over provisioning of resources.

Almost every commercial interconnection network todayprevents deadlocks from occurring by ensuring that the afore-mentioned cyclic dependence between buffers never gets cre-ated in the first place. This is done by restricting certain turns(either for all packets such as with dimension-ordered XYrouting [1], or within extra escape virtual channels (VCs) [2])or by restricting the injection of packets [3, 4]. These restric-tions reduce path diversity; throughput is lost since avoidingruntime formation of cycles means that links will go unusedby certain packets even if they are idle. Moreover, they areinherently tied to the topology (e.g., XY routing only worksfor a mesh; any other topology requires a full channel depen-dence graph analysis to disable turns [1], BFC only works inrings/tori) – making them inflexible as plug-and-play solu-tions in arbitrary topologies in heterogeneous SoCs or whenlinks/routers fail due to waning silicon reliability [5].

We focus on schemes that provide full path diversity, andresolve deadlocks that have occurred. We characterize priorwork on deadlock resolution via the following taxonomy:1

• Deadlock resolution via escaping (e.g., escape buffers [5,6]); the key drawback is the need for extra buffers;

• Deadlock resolution via misrouting (e.g., deflection rout-ing [7, 8, 10]); the key drawback is increased energy con-sumption and loss in throughput;

• Deadlock resolution via coordinating or synchronizing(e.g., SPIN [9]); the key drawback is expensive globalcoordination for detection and spinning.

None of the state-of-the-art deadlock-freedom solutions(avoidance/recovery) provide full path diversity and high-throughput without requiring deadlock detection or globalcoordination for arbitrary topologies. This motivates ourwork. Going back to first principles, we argue that a networkdeadlocks because a packet is indefinitely blocked; the reasonis a fundamental network design rule that says that a packetshould be forwarded from an upstream router to a downstreamrouter if and only if the downstream buffer is free (whichis known via credits/on-off signaling), or is guaranteed tobecome free by the time the packet arrives [9, 11]. If this ruleis violated, packets may be dropped.

We question this fundamental rule and propose Synchro-

1Section 2 provides more details on each of these approaches.

873

Page 2: SWAP: Synchronized Weaving of Adjacent Packets for Network

clk

BA0

1

0

1

Swap=0clk

BA0

1

0

1

clk

AB0

1

0

1

clk

AB0

1

0

1

Swap=0 Swap=1 Swap=1 Swap=1 Swap=1 Swap=0 Swap=0

Figure 1: The basic hardware implementation for swapping the content of two Flip Flops / FIFOs.Table 1: Qualitative Comparison of Deadlock Freedom Mechanisms

Deadlock Freedom Mechanism Resolution Type Full Path Diversity No VCs Required Topology Agnostic No coordination

CDG / Dally [1] No Cycle ✗ ✓ ✗ ✓

Duato [2, 6] Escape ✓/✗* ✗ ✓/✗* ✓

BFC [3, 4, 5] Bubble ✗ ✓ ✗ ✓

Deflection [7, 8] Misroute ✗** ✓ ✓ ✓

SPIN [9] Synchronize ✓ ✓ ✓ ✗

SWAP Backtrack ✓ ✓ ✓ ✓

* Within esc VC: limited path diversity + requires topology info for esc path.**At low-loads, full path diversity is available. But at medium-high loads, packets cannot control the directions or paths along with they are deflected.

nized Weaving of Adjacent Packets (SWAP). SWAP intro-duces a new deadlock resolution paradigm: backtracking,where deadlocked packets yield their position and allow otherpackets to move forward. Backtracking is performed viain-place packet swaps across buffers in neighboring routers.Fig. 1 shows the conceptual idea. Unlike software, where aswap requires additional temporary storage, in hardware, anin-place swap is conceptually the same as a cyclic shift reg-ister. Intuitively, SWAP allows any blocked packet to makeguaranteed forward progress to its destination, regardless ofthe congestion in the network, via the mechanism of swaps.More formally, we prove that performing swaps at periodicintervals ensures that any cycles, if they form in the network,are broken dynamically via swaps, guaranteeing deadlockfreedom. Further we show that despite infrequent misroutingof packets, SWAP is free from livelock.

This paper makes the following contributions:

• We propose SWAP, a novel mechanism for in-place packetswaps across routers in a network.

• SWAP provides deadlock-freedom via packet backtrack-ing, while providing desired metrics of full path diversity,high throughput, no additional VCs, no deadlock detec-tion, and topology agnosticism. Table 1 qualitatively con-trasts SWAP with current deadlock-freedom solutions.

• We present a light-weight implementation of SWAP thatadds 4% area overhead over a state-of-the-art VC router.

• SWAP increases throughput by 20-80% with syntheticbenchmarks for a full and faulty mesh, compared to es-cape VCs, and reduces network link activity by 2-8× com-pared to deflection and synchronization based schemes.

• We show that using SWAP with conventional deadlock-free routing enhances network throughput by 10% sinceswaps allow packets to move away from congested partsof the network. Thus SWAP emulates the behavior ofVCs without adding any additional buffers, making itapplicable beyond deadlock resolution.

2. BACKGROUND AND RELATED WORK

There exists a significant body of work on deadlock free-dom in interconnection networks. For the purpose of discus-sion, we categorize them into deadlock avoidance and dead-lock resolution techniques as per our taxonomy in Section 1.Table 1 summarizes key attributes of different approaches.

2.1 Deadlock Avoidance

Routing Restrictions. The most common technique to avoiddeadlocks is to make the Channel Dependency Graph (CDG)acyclic [1]. In one variant of this technique, certain turns ina given topology are not allowed, to ensure that a deadlockis never created. The turn model [12] for a mesh is the mostprevalent implementation. These algorithms allow selectiveadaptivity in the routing algorithm–for example, west-firstrouting only allows adaptivity if the destination happens tobe in the North-East or South-East quadrant of the mesh. Analternate implementation is to change the virtual channel (VC)at which the packet would sit at downstream router whenevercertain turns are made, to ensure that the VCs themselvesdo not form a cyclic dependence. This is used in off-chipnetworks for algorithms such as UGAL [13] in dragon-flynetworks and require at least three VCs. Fully adaptiverouting can be implemented to allow full path diversity acrossall VCs, while each VC itself has turn restrictions [14]. Inirregular topologies, arising due to network faults [15, 16, 17]or power-gated nodes [18], spanning trees are often used toguarantee the same acyclic CDG behavior.Injection Restrictions. Another technique to avoid dead-locks is to ensure that a cyclic dependence never forms atruntime even though the CDG is cyclic. This can be ensuredby cleverly managing the injection of packets into the net-work. Bubble flow control [3] and its variants [3,4,19,20] arethe most common implementation of this idea, but only workfor rings and tori. Some extensions of BFC for meshes [21]and dragon-fly [22] have been explored.

Both of these approaches limit throughput either by restrict-ing path diversity or by limiting packet injection. They alsotend to be topology-specific, limiting their use in irregular orfaulty networks.

2.2 Deadlock Resolution

Escape Virtual Channels. Escape VCs [2, 23, 24] allow allVCs to use deadlock-prone routing with no turn restrictions,except one (the escape VC) that uses a deadlock-free routingpath. This provides an acyclic CDG only within the escapeVC. Thus, escape VC-based solutions require at least 2 VCsto provide fully adaptive routing. Escape VCs have beenused across a suite of networks [24, 25, 26, 27, 28, 29, 30]. Akey challenge with escape VCs, just like the CDG schemes,is that they are topology-dependent–the path through the

874

Page 3: SWAP: Synchronized Weaving of Adjacent Packets for Network

A

F E

B C

D

BC

CD

AB

DEEF

FA

BC

CD

AB

DEEF

FA

cc

BC

CD

AB

DEEF

FA

c

Cyclic Cyclic

Channel Dependence Graph (CDG) for SWAP

Deadlock! No Deadlock

spin spin

Deadlock!

timeout

(1) Timeout triggered if no movement for K cycles

(2) Router A sends “probe” to map deadlock path

(3) Router A sends “move” to synchronize spin time

Deadlock!

timeout

swap

Deadlock!

swap

No Deadlock No Deadlock

A

F E

B C

D

c d f

z y x

z

c d

y x f

A

F E

B C

D

z

c d

y x f

A

F E

B C

D

c

x df

y z

A

F E

B C

D

A

F E

B C

D

d f

y x

packet

going

to D

c

z

packet

going

to C

Deadlock!

A

F E

B C

D

c d f

z y x

cd f

z y x

backtrack

A

F E

B C

D

A

F E

B C

D

cd f

z y x

backtrack

d f

z

y x

c

ejectBC

CD

AB

DEEF

FA Cyclic

BC

CD

AB

DEEF

FAAcyclic Acyclic

Channel

Dependence

Network State

(c) SWAP

A coordinates with B B coordinates with C

A

F E

B C

D

packet

going

to C

Deadlock!

A

F E

B C

D

c d f

z y x

d f

y x

packet

going

to D

c

z

A

F E

B C

D

packet

going

to C

A

F E

B C

D

c d f

z y x

d f

y x

packet

going

to D

c

z

cd

W->E

L->E

Conflict for

East port!

East

deflectA

F E

B C

D

d

fz

b

x

c misroute

misroute

y

dnew packet injected for D A

F E

B C

D

b

cre-routeA

F E

B C

Dfx

y

dmisroute

k conflicting packet for S port

d

re-route

(b) SPIN

(a) Deflection

d

deflect

suppose conflict for S port

A

Detection and

Synchronization

Messages

(Probe, Move)

(1) Timeout triggered if no movement for K cycles

(2) Router A sends “probe” to map deadlock path

(3) Router A sends “move” to synchronize spin time

Figure 2: Example comparing (a) Deflection, (b) SPIN and (c) SWAP using a 3×2 mesh. The left side of the figure(before dotted line) sets up the same initial condition of deadlock in the three designs, and the right side demonstrateshow they operate. In deflection routing, the deadlock does not persist as packets move every cycle. However, the greenpacket (going to Router C) and the purple packet (going to Router D) are both misrouted due to conflicts, and takemultiple cycles to be re-routed to their destinations. In SPIN, if a packet in a specific VC (e.g., at Router A) does notmove for a specified number of cycles, a timeout occurs, and a probe is sent to map the possible deadlock path. Theprobe returns after 12 cycles. A move message synchronizes all routers on the deadlock path to perform a spin. Oncethe move returns, the spin is performed, and every packet moves forward one hop. The deadlock still persists, so thetimeout, probe, move, and spin process repeats. In the last step, packet c reaches its destination and the deadlock isresolved. In SWAP, Router A (at a fixed period), coordinates locally with its neighbor (Router B) and performs a swap:packet d is backtracked and packet c moves forward. The deadlock still persists. Packet c performs another swap,reaches its destination, and the deadlock is resolved. The corresponding CDG at every step in SWAP is also shown.

escape VCs needs to be deadlock-free via routing restrictions.Further, they require the overhead of provisioning at leasttwo VCs, one of which typically is not well utilized.Misrouting packets (Deflection routing). Deflection rout-ing [7,8,31,32] assigns every input flit some output port everycycle. When more than one flit requests the same output port,only one (chosen according to a priority scheme) is allot-ted the output port and the rest are deflected to some otheravailable output port. It is primarily intended for bufferlessrouters. It is inherently deadlock free because every packetmakes progress every cycle. Although it is not guaranteedthat same packet will make forward progress at each router.This increases network congestion and energy consumptiondue to mis-routing which will be especially bad at high loads,limiting network throughput significantly [33].Coordination. Deadlock detection and recovery is an alter-native approach to provide deadlock freedom. These are mo-tivated by the fact that deadlocks are actually quite rare [5, 6]and argue that the routing restrictions or additional escapeVCs are overly conservative. These solutions fundamen-tally rely on a deadlock detection mechanism [5, 9, 34, 35]involving time-outs and probes, followed by a recovery mech-anisms that introduces additional buffers [5, 6, 35] or coordi-nates packet movement [9] to guarantee forward progress.

Although their datapath overheads are low, these solutionsrequire extremely complex control circuitry in the form oftime-out counters and state machines to detect deadlocks andmanage false positives and negatives. Moreover, due to theoverhead of deadlock detection, they suffer from low through-put once the network starts to deadlock. Thus, they have notmade it into mainstream systems.Backtracking. SWAP uses the same underlying theory ofguaranteed forward progress as deflection routing to providedeadlock freedom–thereby not requiring any turn or injectionrestrictions, or detection and recovery. Packet swaps can beviewed at a high level as controlled deflections. There is asubtle yet important difference: misrouting is tricky becauseyou have to make sure that you misroute a packet only todirections where it still has a legal path to its final destination;whereas backtracking guarantees this. The key differencesof SWAP compared to a deflection-based mechanism andSPIN [9] (coordination-based) are highlighted in Fig. 2 anddiscussed in detail in Section 4.4.

3. SWAP THEORY

In this section, we present the theoretical underpinnings ofour SWAP scheme. First, we provide necessary definitionsfor the reader, overview the basic operation and provide a

875

Page 4: SWAP: Synchronized Weaving of Adjacent Packets for Network
Page 5: SWAP: Synchronized Weaving of Adjacent Packets for Network

DEADLOCK!!

DEADLOCK!!

DEADLOCK!!

DEADLOCK!!

21

2

3

4

5

6

1 7

83

4 95

6

1

2

3

6

5

4

7

1 4

56

2 3

(a) Ring (b) Mesh (d) Irregular-2(c) Irregular-1

Figure 4: Examples of Deadlocks in Arbitrary Topologies

network is deadlock-free.Proof: A deadlock, by definition, is an indefinite blocking

of a packet. As long as the implementation of SWAP canguarantee that every packet will have a chance to performa swap, it will make forward progress, making the networkinherently deadlock free. However, it is possible for thepacket that made forward progress to later be backtrackedby another packet. Indefinite backtracking could lead to alivelock (i.e., the packet never reaches its destination). Next,we discuss why SWAP is livelock free.

3.3 Proof of Livelock Freedom

Theorem: For a given system that implements SWAP, aslong as any backtracked packet eventually has the opportunityto move forward by two hops before being backtracked again,the network is livelock-free.

Proof: Backtracking can be viewed as incrementing apacket’s number of remaining hops h (to its destination) byat most 1. Assume that the system allows a packet to movetwo hops before being backtracked again (i.e., before beingselected as a swapBack packet). For every increment by 1 inh due to backtracking, there is a decrement by 2. This impliesthat in this system, for any given packet, h eventually goes to0, guaranteeing forward progress to the destination.

Implementation: There are two important design consid-erations to implement such a system. First, the swapPeriodmust be greater than the time it would take for a packet tomove two hops forward in the absence of contention. Thisavoids any pathological cases of packets continuously be-ing backtracked. Second, the selection scheme that decideswhich packet should be swapped next must be fair. Ideally,this selection is random; though for practical purposes, roundrobin is sufficient. A fair selection scheme ensures that nopacket is starved in the presence of contention. Even if apacket is not able to move two hops forward now, it will even-tually be able to, as guaranteed by the fair selection. Since thenetwork topology is finite, h is upper-bounded by the networkdiameter (i.e., the largest number of hops between any tworouters). Thus no amount of contention in the system cancause a packet to be backtracked indefinitely.

3.4 SWAP in Arbitrary Topologies

Arbitrary topologies are challenging for popular deadlock-avoidance solutions such as XY/West-first routing algorithms;routing algorithms now require CDG analysis to determinetopology-specific turn restrictions (for all paths or withinescape VCs). In contrast, SWAP is agnostic to the topologyas any swap just involves neighboring routers. For example,the deadlock ring in Fig. 2 could lie within any topology inFig. 4 and use the same mechanism of swaps for deadlockfreedom. SWAP is also agnostic to the underlying routingalgorithm. The routing algorithm decides the output port

(i.e., neighboring router) of the swapFwd packet, and theswapBack packet is chosen from the corresponding inputport at the neighbor.

4. SWAP IMPLEMENTATION

Multiple implementations of SWAP are possible. We favoran implementation with low complexity. We describe the pos-sible design space and the intuition behind our given designchoices, acknowledging that alternate implementations arepossible.

4.1 Initiating a Swap

Although it is possible to map out the full deadlock loopat runtime via timeout and probes [5, 9], and then performcontrolled swaps to recover from the deadlock (Proof 1 inSection 3.2), we prefer a less expensive approach. Recall thatany SWAP implementation ensures deadlock and livelockfreedom if it ensures that (a) every packet gets the chance tomake forward progress via a swap, and (b) the system allowsa packet to move two hops in an uncongested scenario, beforebeing backtracked. To ensure (a), we enforce periodic swapsby every router at a configurable time period (swapPeriod)and we add a pointer in every router to cycle through allVCs at all ports that decides which VC will try and initiate aswap. To ensure (b), we need to account for the worst casedelay for a packet in two adjacent routers without any stallsdue to insufficient credits. This would be a packet in a VCcontending with all other VCs at that router for a specificoutput port, followed by traversing the router and link, andrepeating the same at the next router. Thus,

swapPeriod ≥ 2× (# ports×# vcs/port+(router_pipeline_delay

+link_delay))+ serialization_delay

This works out to be 54 cycles for a 5-ported mesh router with4 VCs per port, 5-flit packets, 4-cycle routers and 1-cyclelinks, and 18 cycles for a 1-cycle, 1 VC per port mesh router.

In our implementation, each router performs a swap duringits swapCycle. The swapCycle is defined as

(cycle/m)%(K ×N) == router_id (1)

where K is a configurable swapDutyCycle, N is the number ofrouters in the network, m is the maximum number of flits ofany packet in the system and K ×N is the swapPeriod. K de-termines how often each router initiates swaps; the lower thevalue of K (minimum could be 1), the more swaps performedin the network. When K = 1 and m = 1, each router initiates aswap every N cycles in a TDM manner. In a 64-core system,this means that even with K = 1, each router attempts a swaponce every 64 cycles, which is greater than the minimumswapPeriod calculated above for livelock avoidance.

The router initiating the swap, as dictated by its ID andcurrent cycle, is the upstream router and router with whichit will swap its packet, is the downstream router. At anygiven cycle, by design, there can only be one upstream routerand several possible downstream routers depending on thetopology.

During the swapCycle, the upstream router selects a swap-Fwd packet from one of its internally buffered packets. Itsends a swap request signal via a 1-bit wire to the downstream

877

Page 6: SWAP: Synchronized Weaving of Adjacent Packets for Network
Page 7: SWAP: Synchronized Weaving of Adjacent Packets for Network

Table 2: SWAP Operation Details.

Updating swapPointer Conditions for Failed Swaps

* When the packet pointed by swapPointer leaves the router naturally bywinning switch arbitration, the swapPointer moves in round-robin fashion tothe next non-empty VC.

* At least one of the VCs within the virtual network of the swap_req isempty. In this case, the packet could arrive by normal means, and a swap isnot required.

* When a swapFwd packet arrives at this router from an upstream router viaa swap. This packet now becomes the swapFwd packet to give it the highestpriority at the next swapCycle in case it does not leave naturally.

* In virtual cut-through routers, if the candidate swapFwd and swapBackpacket is distributed across two routers, a swap is not performed. In worm-hole, this condition leads to packet truncation [7, 8].

X

vc-0vc-1 vc

-k

vc-0vc-1 vc

-k

vc-0

vc-1

vc-k

vc-0

vc-1

vc-k

Swap_busRouter

N

E

S

W

is_swapBack?

is_swapBack?

is_swap

Back?

is_swap

Back?

is_swap

Back?

is_swapBack?

is_swapBack?

is_swapBack?

Swap

Management Unit

Swap_bus

Arbiter

uTurn

support

Route

Compute

VC

Allocate

SWAP PTR

N_in

E_in

S_in

W_in

W-inport

E-inport

N-inport

S-inport

N_out

W_out

S_out

E_out1

010

1

0 0

0

0

0

0

1

1

1

1

1

swap_req

swap_ack

swap_req

swap_ack

swap_req swap_ack

swap_reqswap_ack

Figure 6: SWAP Router Microarchitecture. Featuresadded by SWAP are shaded in grey. Datapath: bus con-necting all input ports to allow a swapBack packet fromthe downstream router to get buffered at any input VC,and u-turn support in the crossbar. Control path: SwapManagement Unit controlling when and what to swap.The blue and red paths show a swapFwd packet goingfrom South in port to East out port, and a correspondingswapBack packet entering from East out port and gettingbuffered in the South in port.

(even though our implementation restricts the swaps to occurwithin the same VC ID).

• forward path (A_South_VC1 to B_West_VC3): theswapFwd packet reuses A’s crossbar to traverse to B’sWest input port (blue path in Fig. 6) and gets bufferedinto VC3. This is exactly like a regular traversal. Asmentioned earlier, during the swap, the output link fromA to B is not allocated to any other packet.

• backward path (B_West_VC3 to A_South_VC1): theswapBack packet reuses B’s crossbar to make a u-turntowards A. The swapBack packet arrives at the East inputport at A, but needs to be buffered at the South input port.The pre-set swap_bus transports the packet from East toSouth, and buffers it in VC1 (red path in Fig. 6).

Why is a simple bus sufficient? SWAP does not supportmultiple swaps in the same cycle. Thus, we do not needa crossbar at the input of the router to support multipleswaps from multiple downstream routers simultaneously. Theswap_bus is pre-configured by the swap_ack. This makes the

Table 3: SWAP vs. Deflection Routing

Deflection Routing SWAP

Mis-

routing

Forces packet deflections uponbuffers overflow [10] (every cy-cle in case of bufferless de-signs [7]) without support forstalls. This leads to high mis-routing and congestion.

Provides localized mis-routing, which we callbacktracking, the rate ofwhich can be controlledusing the swapPeriod pa-rameter.

Spread

Deflections in one part of thenetwork can trigger deflectionsin another part, leading to highlatencies and dropped through-put for all packets.

Backtracking is con-trolled by swapPeriod andswapCycle parameters.

Router

Ports

Indirect restriction on the routermicro-architecture: number ofinput ports must equal the num-ber of output ports of the router.

Places no restrictions onthe router’s radix, makingit more amenable to arbi-trary irregular topologies.

Router

Critical

Path

High hardware overhead forswitch-arbiter to perform thebest matching upon packet con-flict. This also lies in the criticalpath of the router.

Adds minimal changes tothe baseline router micro-architecture (Fig. 6), withthe SMU operating off thecritical path.

Routing

Deflection routing algorithm isa de-facto routing algorithm,controlled purely by current net-work congestion

Any routing algo-rithm (minimal/non-minimal/adaptive) whichby itself may or may notbe deadlock-prone.

Table 4: SWAP vs. SPIN

SPIN SWAP

Detection

Approach

Maps entire deadlock pathupon a timeout using probesthat take multiple cycles.

No detection. Performsswaps periodically based ona VC occupancy threshold.

Detection

Time

Longer deadlock cycles takelonger to map and resolve

Independent of deadlock cy-cle length

Synchro-

nization

Global: all routers in dead-lock must spin at same time

Local: with neighbor whowill be performing swap

Resolution

Approach

All packets in deadlockedring move forward simulta-neously

Only two packets move si-multaneously.

Resolution

Time

(N-1) spins in worst case fordeadlock of length N

(N-1) swaps of specificpacket in worst case fordeadlock of length N

Misrouting None Backtracks packet one hop

SWAP implementation extremely light-weight.Virtual Cut-Through (VCT) and Wormhole Implemen-tations. VCT routers have buffers deep enough to hold anentire packet. This design naturally works well for SWAP.Swaps are only performed once the entire packet is received,as shown in Table 2. To support SWAP with wormholerouters, we would add packet truncation support, similar toprior works in deflection routers [7, 8, 10]. Packet truncationoccurs on swapFwd and/or swapBack packets if the formerinitiates a swap.Multi-flit Packets For m-flit packets, the swap operationtakes m cycles, as the flits are serially swapped.Control Path. The control path of SWAP adds a Swap Man-agement Unit (SMU) that handles if, when and what to swap,as described in Section 4.2.

879

Page 8: SWAP: Synchronized Weaving of Adjacent Packets for Network
Page 9: SWAP: Synchronized Weaving of Adjacent Packets for Network
Page 10: SWAP: Synchronized Weaving of Adjacent Packets for Network
Page 11: SWAP: Synchronized Weaving of Adjacent Packets for Network
Page 12: SWAP: Synchronized Weaving of Adjacent Packets for Network

7. REFERENCES

[1] W. J. Dally and C. L. Seitz, “Deadlock-free message routing inmultiprocessor interconnection networks,” IEEE Trans. Comput.,vol. 36, pp. 547–553, May 1987.

[2] J. Duato, “A new theory of deadlock-free adaptive routing inwormhole networks,” IEEE Trans. Parallel Distrib. Syst., vol. 4,pp. 1320–1331, Dec. 1993.

[3] C. Carrion, R. Beivide, J. A. Gregorio, and F. Vallejo, “A flow controlmechanism to avoid message deadlock in k-ary n-cube networks,” inProceedings of the Fourth International Conference onHigh-Performance Computing, pp. 322–329, 1997.

[4] V. Puente, C. Izu, R. Beivide, J. Gregorio, F. Vallejo, and J. Prellezo,“The adaptive bubble router,” J. Parallel Distrib. Comput., vol. 61,pp. 1180–1208, Sept. 2001.

[5] A. Ramrakhyani and T. Krishna, “Static bubble: A framework fordeadlock-free irregular on-chip topologies,” in IEEE InternationalSymposium on High Performance Computer Architecture (HPCA),pp. 253–264, 2017.

[6] K. V. Anjan and T. M. Pinkston, “An efficient, fully adaptive deadlockrecovery scheme: DISHA,” in Proceedings of the 22nd AnnualInternational Symposium on Computer Architecture, (ISCA),pp. 201–210, 1995.

[7] T. Moscibroda and O. Mutlu, “A case for bufferless routing in on-chipnetworks,” in Proceedings of the 36th Annual InternationalSymposium on Computer Architecture, pp. 196–207, 2009.

[8] C. Fallin, C. Craik, and O. Mutlu, “CHIPPER: A low-complexitybufferless deflection router,” in Proceedings of the 17th IEEEInternational Symposium on High Performance ComputerArchitecture, pp. 144–155, 2011.

[9] A. Ramrakhyani, P. V. Gratz, and T. Krishna, “Synchronized progressin interconnection networks (SPIN) : A new theory for deadlockfreedom,” in International Symposium on Computer ArchitectureISCA, 2018.

[10] C. Fallin, G. Nazario, X. Yu, K. K. Chang, R. Ausavarungnirun, andO. Mutlu, “MinBD: Minimally-buffered deflection routing forenergy-efficient interconnect,” in 2012 Sixth IEEE/ACM InternationalSymposium on Networks-on-Chip (NOCS), pp. 1–10, 2012.

[11] L.-S. Peh and W. J. Dally, “Flit-reservation flow control,” inProceedings of the IEEE International Symposium onHigh-Performance Computer Architecture, pp. 73–84, 2000.

[12] C. J. Glass and L. M. Ni, “The turn model for adaptive routing,” J.ACM, vol. 41, pp. 874–902, Sept. 1994.

[13] J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven,highly-scalable dragonfly topology,” in Proceedings of the 35th AnnualInternational Symposium on Computer Architecture, pp. 77–88, 2008.

[14] M. Ebrahimi and M. Daneshtalab, “A new theory on forming acyclicchannel dependency graph for the design of deadlock-free networks,”in Proceedings of the 44th International Symposium on ComputerArchitecture (ISCA), 2017.

[15] K. Aisopos, A. DeOrio, L.-S. Peh, and V. Bertacco, “ARIADNE:agnostic reconfiguration in a disconnected network environment,” inInternational Conference on Parallel Architectures and CompilationTechniques, PACT, pp. 298–309, 2011.

[16] R. Parikh and V. Bertacco, “uDIREC: Unified diagnosis andreconfiguration for frugal bypass of NoC faults,” in Proceedings of theIEEE/ACM International Symposium on Microarchitecture (MICRO),,2013.

[17] D. Lee, R. Parikh, and V. Bertacco, “Brisk and limited-impact NoCrouting reconfiguration,” in Design, Automation and Test in EuropeConference (DATE), 2014.

[18] R. Parikh, R. Das, and V. Bertacco, “Power-aware NoCs throughrouting and topology reconfiguration,” in Design AutomationConference (DAC), 2014.

[19] L. Chen, R. Wang, and T. M. Pinkston, “Critical bubble scheme: Anefficient implementation of globally aware network flow control,” in25th IEEE International Symposium on Parallel and DistributedProcessing, IPDPS, pp. 592–603, 2011.

[20] L. Chen and T. M. Pinkston, “Worm-bubble flow control,” inProceedings of the 19th IEEE International Symposium on HighPerformance Computer Architecture (HPCA), pp. 366–377, 2013.

[21] C. Xiao, M. Zhang, Y. Dou, and Z. Zhao, “Dimensional bubble flowcontrol and fully adaptive routing in the 2-d mesh network on chip,” in2008 IEEE/IPIP International Conference on Embedded andUbiquitous Computing (EUC, pp. 353–358, 2008.

[22] M. Garcia, E. Vallejo, R. Beivide, M. Odriozola, C. Camarero,M. Valero, G. Rodriguez, J. Labarta, and C. Minkenberg, “On-the-flyadaptive routing in high-radix hierarchical networks,” in Proceedingsof the 41st International Conference on Parallel Processing,pp. 279–288, 2012.

[23] J. Duato and T. M. Pinkston, “A general theory for deadlock-freeadaptive routing using a mixed set of resources,” IEEE Trans. ParallelDistrib. Syst., vol. 12, pp. 1219–1235, Dec. 2001.

[24] J. Duato, “A necessary and sufficient condition for deadlock-freeadaptive routing in wormhole networks,” IEEE Trans. Parallel Distrib.Syst., vol. 6, pp. 1055–1067, Oct. 1995.

[25] X. Lin, P. K. McKinley, and L. M. Ni, “The message flow model forrouting in wormhole-routed networks,” IEEE Trans. Parallel Distrib.Syst., vol. 6, no. 7, pp. 755–760, 1995.

[26] S. Ma, N. Enright Jerger, and Z. Wang, “DBAR: An efficient routingalgorithm to support multiple concurrent applications innetworks-on-chip,” in Proceedings of the 38th Annual InternationalSymposium on Computer Architecture, pp. 413–424, 2011.

[27] P. Gratz, B. Grot, and S. W. Keckler, “Regional congestion awarenessfor load balance in networks-on-chip,” in 14th InternationalSympoisum on High-Performance Computer Architecture (HPCA-14,pp. 203–214, 2008.

[28] C. A. Nicopoulos, D. Park, J. Kim, N. Vijaykrishnan, M. S. Yousif,and C. R. Das, “ViChaR: A dynamic virtual channel regulator fornetwork-on-chip routers,” in Proceedings of the 39th AnnualIEEE/ACM International Symposium on Microarchitecture,pp. 333–346, 2006.

[29] A. Samih, R. Wang, A. Krishna, C. Maciocco, T. C. Tai, andY. Solihin, “Energy-efficient interconnect via router parking,” in 19thIEEE International Symposium on High Performance ComputerArchitecture, HPCA, pp. 508–519, 2013.

[30] L. Chen and T. M. Pinkston, “NoRD: Node-router decoupling foreffective power-gating of on-chip routers,” in Proceedings of theIEEE/ACM International Symposium on Microarchitecture (MICRO),2012.

[31] P. Baran, “On distributed communication networks,” IEEE Trans. onCommunications, 1964.

[32] S. A. R. Jafri, Y.-J. Hong, M. Thottethodi, and T. N. Vijaykumar,“Adaptive flow control for robust performance and energy,” inProceedings of the 43rd Annual IEEE/ACM International Symposiumon Microarchitecture, pp. 433–444, 2010.

[33] G. Michelogiannakis, D. Sanchez, W. J. Dally, and C. Kozyrakis,“Evaluating bufferless flow control for on-chip networks,” inProceedings of the 2010 Fourth ACM/IEEE International Symposiumon Networks-on-Chip, pp. 9–16, IEEE Computer Society, 2010.

[34] P. Lopez, J. M. Martinez, and J. Duato, “A very efficient distributeddeadlock detection mechanism for wormhole networks,” inProceedings of the Fourth International Symposium onHigh-Performance Computer Architecture, pp. 57–66, 1998.

[35] Y. H. Song and T. M. Pinkston, “A new mechanism for congestion anddeadlock resolution,” in Proceedings of the International Conferenceon Parallel Processing, pp. 81–, 2002.

[36] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi,A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen,K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “Thegem5 simulator,” SIGARCH Comput. Archit. News, vol. 39, pp. 1–7,Aug. 2011.

[37] N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha, “GARNET: Adetailed on-chip network model inside a full-system simulator,” inProc of the IEEE International Symposium on Performance Analysisof Systems and Software, 2009.

[38] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmarksuite: Characterization and architectural implications,” inInternational Conference on Parallel Architectures and CompilationTechniques (PACT), 2008.

[39] J. Shun and G. E. Blelloch, “Ligra: A lightweight graph processingframework for shared memory,” in Proceedings of the 18th ACM

884

Page 13: SWAP: Synchronized Weaving of Adjacent Packets for Network

SIGPLAN Symposium on Principles and Practice of ParallelProgramming, pp. 135–146, 2013.

[40] M. D. Schroeder, A. Birrell, M. Burrows, H. Murray, R. M. Needham,T. L. Rodeheffer, E. H. Satterthwaite, and C. P. Thacker, “Autonet: Ahigh-speed, self-configuring local area network using point-to-point

links,” IEEE Journal on Selected Areas in Communications, vol. 9,no. 8, pp. 1318–1335, 1991.

[41] H. Kwon and T. Krishna, “OpenSMART: Single-cycle multi-hop NoCgenerator in BSV and Chisel,” in Proc of the IEEE InternationalSymposium on Performance Analysis of Systems and Software, 2017.

885