Dual-mode inter-router communication channel for deflectionrouted networks-on-chip

Dual-mode inter-router communication channel for deflection-routed networks-on-chip

Igor Z. Stojanovic, Milica D. Jovanovic, Goran Lj. Djordjevic

Faculty of Electronic Engineering, University of Nis, A. Medvedeva 14, 18000 Nis, Serbia

Igor Z. Stojanovic (corresponding author) University of Nis, Faculty of Electronic Engineering, A. Medvedeva 14, 18000 Nis, Serbia tel. +381 18 529 601 fax. +381 18 529 105 e-mail: [email protected]

M. D. Jovanovic G. Lj. Djordjevic, Faculty of Electronic Engineering, University of Nis, Nis, Serbia

M. D. Jovanovic e-mail: [email protected]

G. Lj. Djordjevic e-mail: [email protected]

Abstract: Deflection routing, characterized by routing simplicity and minimal in-router buffer resources, has recently emerged as a promising approach for improving power and area efficiency of on-chip networks. With this routing strategy, packet contentions in routers are resolved by intentionally misrouting some of packets along unwanted directions instead of storing them. However, at higher network loads, when the contentions are more frequent, packets take longer paths on average to their destinations, and thus increase the energy consumption, delay, and reduce the throughput in the network. To address this problem, we enhance the inter-router communication channels with a lightweight link-control mechanism that prevents unnecessary network hops by forcing deflected packets, when possible, to loop back to their current routers instead of being misrouted. The effect of the packet loop-backing is similar to that of storing deflected packet into a small central in-router buffer, but is accomplished with lower implementation cost (i.e. there is no need for additional buffer memory) and without any modification to the underlying router microarchitecture. Evaluations on synthetic traffic patterns show that the proposed misrouting suppression mechanism yields an improvement of 11.8 14.5% in network saturation throughput when coupled with the conventional bufferless and buffered deflection-based routers.

Keywords: Network-on-chip, multi-core, deflection routing, misrouting suppression.

1. Introduction

With the constant growing complexity of modern system on-chip (SoC) architectures, the role played by the on-chip interconnection infrastructure becomes increasingly important for overall system performances and energy-efficiency. Traditional bus-based interconnect architectures, which use broadcast communication and serialization of bus transactions, cannot reach a high degree of scalability with respect to the increasing number of integrated computational resources. Nowadays, it is widely recognized that highly structured network-on-chip (NoC) architectures, whereby packets are routed in a way similar to the traditional large-scale multi-processors and the wide-area networks, represent the most viable design solution to accommodate the communication needs and reduce the communication energy consumption of large-scale SoCs [1].

The NoC consists of multiple routers interconnected with each other using point-to-point physical communication channels to form a suitable network topology [4]. In order to transport data packets between network nodes, most current NoC designs employ wormhole packet switching in combination with either deterministic or minimal adaptive routing policies [7]. In wormhole switching, a packet is decomposed into flits, and flits are delivered in a pipelined fashion through the network with each router holding one flit. If blocking occurs (due to port contention), the flits of the packet are blocked in place. To gain higher throughput and avoid potential deadlock situations, wormhole routers use virtual channel flow control such that the input buffer is organized as several independent buffers allocated to different packets [8]. Although in-router buffering improves the bandwidth efficiency, the virtual channel buffers draw a significant fraction of NoC power and area, and can increase router latency. For a static random access memory (SRAM) buffer implementation, the input buffers can consume 46% of the total on-chip network power while occupying 17% of the total area [9].

The deflection routing has recently emerged as an attractive power-efficient NoC design alternative, but is generally viewed as suitable only for multi-core SoCs with low to medium network loads [10][11]. In comparison to the wormhole packet switching method, the deflection routing has an advantage that does not require in-router buffering because any port-contention between multiple arriving flits is resolved by forwarding one of the contending flits through the preferred output port and deflecting others to another (free) output ports. Because the deflection router does not need neither to manage virtual channels nor to control internal pipeline, its datapath is very simple (typically, it consists of several multiplexers to allow flits to enter and leave) and can achieve a higher speed with much lower hardware cost compared with a wormhole router. Recent studies have shown that in such bufferless NoCs, the power consumption is reduced by 20-40%, and the router area on die is reduced by 40-75% [10]. In addition, the adaptive nature of deflection routing enables hot spots avoidance and provides fault-tolerance in the network [19]. The drawback is that flits are occasionally misrouted, i.e. sent out in the wrong

direction, which increases the amount of time they spend in the network. At high network load, when flits are misrouted more frequently, the cost and energy benefits of this low-cost routing scheme are offset by the performance degradation [21].

The key aspect of deflection routing is how to map a set of input flits, each with its preferred outputs, to the set of output ports in order to minimize the number of deflected flits. This task is performed by a switch allocator stage, which usually dominantly determines router delay performance [22]. In order to attain a low deflection rate, some designs rely on complex arbitration schemes that involve flit priorities [23]. For example, in BLESS router, input flits are passed to output ports through a 4x4 crossbar switch controlled by a global switch allocator unit that gives older flits a higher priority [10]. The full priority ordering of flits results in fewer deflections, but it incurs a long critical path delay, thus limiting router operation to slow clock frequencies. In order to speed up the critical path, CHIPPER router replaces the global allocator and crossbar with a two-stage permutation network (PN) composed of four 2x2 switch modules, each controlled by a dedicated allocator unit [13]. This design parallelizes port allocation and reduces hardware cost significantly. However, the use of randomized and too simplified port allocation algorithm occasionally leads to unnecessary deflections at the outputs of individual switch modules, and consequently increases the deflection rate. The minimally buffered deflection router (MinBD) improves performance of the PN-based deflection router by attaching a small side-buffer that forms a registered feedback path from the output to the input of the PN [15]. At each clock cycle, the side-buffer (if not full) can accept up to one of deflected flits from PN output, and resubmit that flit to the PN at some later cycle. By saving a fraction of deflected flits from being misrouted, the side-buffer can significantly reduce delay overheads of deflection. Note that, the side-buffering may be taught as a mechanism for misrouting suppression in that it attempts to prevent misrouting of already deflected flit rather than preventing deflection to occur. Drawback of this buffered deflection scheme is that it requires an additional pair of inject/eject stages, which not only increases the hardware complexity but also increases the router propagation delay.

In this paper, we introduce a low-cost and practical technique which provides a misrouting suppression in deflection-routed NoC architectures by enhancing the functionality of inter-router communication channels, as contrary to increasing a hardware complexity of the internal router microarchitecture. In our proposal, depending on the routing statuses of the flits, the inter-router channels switch independently and dynamically between two operational modes. In particular, if deflected flits are present on both ends of the channel, or one flit is deflected and the other one is absent, then the channel activates the loop-back mode. In this mode, the flits are returned back to the corresponding input ports of their current routers. Otherwise, the channel is configured in the normal mode allowing both flits to make one network hop. The loop-backing technique is similar to the side-buffering in that it suppresses misrouting, i.e. it prevents a transfer of the deflected flits to the next router. In difference with the side-buffering, the loop-backing technique does not need installation of additional buffers nor does require any changes in the internal router microarchitecture. Namely, the single modification of the deflection-routed NoC architecture

deals with the following: with aim to control the flow of flits more efficiently, a pair of multiplexers is involved into the hardware structure of the inter-router communication channel, only. By using this approach, the loop-backing technique can be applied to any deflection-routed NoC architecture (bufferless or buffered) in which the neighboring routers are connected through bidirectional (i.e., full-duplex) communication channels. As we show in our evaluations, the proposed technique provides a higher saturation throughput and lower transport delay compared to the conventional, baseline deflection-routed NoC designs.

The remainder of the paper is organized as follows. Section II provides a background on deflection routing including the overview of two representative classes of deflection router architectures: bufferless and buffered. Section III presents the novel misrouting suppression scheme for deflection-routed NoCs. In Section IV, evaluation and results are presented. Section V concludes this paper.

2. Baseline deflection-routed NoC architectures

In this section, we will analyze models of the following two general deflection-routed NoC architectures: the bufferless deflection NoC and the buffered deflection NoC. Through these models, we will consider only the essential features reported in several previous deflection-routed NoC proposals [10][13][14][15]. In particular, we will consider a network of 2D mesh topology composed of non-pipelined (i.e. combinational) and non-optimized routers connected by synchronous bidirectional communication channels. The main reason of using such general architectural models instead of optimized ones is that the proposed method of suppressing flit misrouting can be implemented without restriction into any actual deflection NoC architecture.

2.1. Bufferless deflection NoC

The deflection-routed 2D mesh network is constructed as a grid of routers where each router is connected by bidirectional communication channels to its (at most) four neighbors, as illustrated in Fig. 1a. In addition, each router is also connected to a network interface (NI). The NI implements the interface to a local processing core (not shown in Fig. 1a). The processing cores serve as sources and sinks of data packets. Packets are split into smaller flow control units, so called flits, and each flit is routed independently. The flit size matches the inter-router channel width so that a single flit can traverse a single hop in a single clock cycle. An inter-router communication channel consists of a pair of unidirectional synchronous links (Fig. 1b). A single synchronous link is a group of wires segmented by an edge-triggered flit-register. All flit-registers in the network are clocked by the same clock signal. Note that each link includes an additional status signal v(alid) which flags the presence of a flit.

(a) (b)

Fig. 1 2D mesh deflection-routed NoC architecture: a) topology, and b) inter-router channel

The network operates synchronously. The time axis is divided into clock cycles and each link transfers one flit per cycle. The deflection router is a pure combinational logic block, which directs the incoming flits from the input ports to the proper output ports. Since there are no in-router buffers, the flit-registers are the only memory elements for storing flits in transit. During traveling towards their destinations, flits are always on the move, by hopping between the flit-registers and propagating through the routers without any waiting or stalling. Routers attempt to route each flit along a shortest path to its destination. A router forwards a flit through a productive port in a productive direction if the distance between the current flit position and its destination decreases. Otherwise, the flit is deflected. Here, we make a distinction between the terms deflection and misrouting. Deflection occurs within the internal router structure. As a consequence deflection, the flit is forwarded to a non-productive output port. On the other hand, misrouting refers to an external manifestation of the flit deflection. It corresponds to a transfer of a deflected flit over the inter-router channel one hop further in a non-productive direction. The cost of misrouting is two clock cycles since each non-productive hop must be compensated by one productive hop in the opposite direction. Let note that in the baseline bufferless deflection-routed network, every flit deflection leads to a flit misrouting.

Figure 2a shows the architecture of the deflection router with four pairs of input- and output- network ports (denoted as N - North, S - South, W - West and E - East) and a pair of eject and inject ports which are connected to the NI. The router is composed of three consecutive stages: the eject stage, the inject stage and the port allocation and switching stage (PAS). Through these stages, four internal flit-channels, C1, ..., C4, are established to guide flits from the set of input to the set of output ports. The eject stage compares destination addresses of the incoming flits with the routers own address with the aim to differentiate between the locally-addressed flits (i.e. flits that are destined for the local processing core), and in-transit flits (destined for other processing cores). The locally-addressed flit is removed from the flit-channel and is forwarded to the NI. If more than one locally-addressed flit is present, the eject block randomly picks one. Non-ejected locally-addressed flits propagate through the rest of the router logic and eventually deflect. A new

flit (generated by the local processing core) is injected into the router by NI through the inject stage. The inject stage detects the presence of an empty flit-channel and directs the new flit to that channel. A new flit can be injected only in clock cycles in which the router did not receive an in-transit flit through every one of its four input network ports. If the new flit is not injected into the network, then it remains in the NIs queue and is resubmitted in the next clock cycle.

(a) (b)

Fig. 2 Architecture of baseline bufferless deflection router: a) internal structure, and b) PAS based on permutation network

The most complex stage in the deflection router is the PAS stage. It permutes and passes the flits from flit-channels (C1, ..., C4) to output network ports (Sout, Nout, Eout, Wout). Here, we adopt a PAS stage introduced in CHIPPER router, where the deflection-routing problem is mapped to a four-input permutation network [13]. The permutation network consists of four two-input switch modules arranged into two stages (Fig. 2b). Each switch module either passes or swaps a pair of flits from its input to its output ports. In this manner, a 1-to-1 mapping of four input flits to four network output ports is achieved. Each switch module is controlled by an arbitration logic which firstly, decides the winner between two flits, and secondly, sends the winning flit toward its productive output port. The losing flit is directed to the other output of the module. If both outputs of the module are productive for the winning flit, then the pass configuration is selected. If neither of the modules output port is productive, the winning flit can take either port. The winner between two input flits is determined according to the silver-flit arbitration policy [15]. In this arbitration scheme, a single flit (which is randomly selected among flits that enter the permutation network at every clock cycle) is designated as a silver flit, i.e. it is prioritized above the others. The silver flit always wins in arbitration. The winner between any two non-silver flits is decided randomly. This rule helps to reduce the deflection rate, because it insures that at least one flit will win in arbitration at both stages of the permutation network.

2.2. Buffered deflection NoC

In buffered deflection network, a small side buffer is attached to each router. The side buffer can be implemented either as a single flit-register, or as a small-size FIFO (composed of several flit-registers). With the side buffer at disposal, the router is able to buffer at most one deflected flit per clock cycle. In this way, by replacing the misrouting with less costly buffering, for a fraction of the deflected flits, performance improvement is achieved. Since the buffered flit does not

change its position in the network, the delay overhead due to deflection for such flit is reduced to one clock cycle, only.

Figure 3 shows the architecture of the buffered deflection router. A side buffer is attached to the deflection router via two additional stages: the buffer-eject stage and the buffer-inject stage. The buffer-eject stage recognizes the deflected flits at the output of the PAS stage, and puts one of them into the side buffer if the side buffer is not full. This flit is picked randomly among the deflected flits. The buffered flit will be re-ejected through the buffer-inject stage in some later clock cycle, when there is a free flit-channel after flit ejection. The fact that the buffer-inject stage proceeds the inject stage gives priority to flit re-injection from the side buffer over the flit injection from the NI. As a consequence of such stage arrangement, we have that a buffered locally-addressed flit cannot be delivered to the NI directly from the side buffer. Instead, such flit is moved to the PAS stage and deflected (and possibly buffered) again. The only way for that flit to reach the NI is to be first misrouted to a neighboring router. Accordingly, the total buffering cost for locally-addressed flit will be three clock cycles: one for buffering plus two due to misrouting. From this reason, in the considered baseline buffered deflection architecture, the buffer-eject stage prevents buffering of non-ejected locally-addressed flits, which represents a modification in respect to MinBD deflection router [15].

Eject

Inject

Buffe

rInject

Buffe

rEject

Fig. 3 Architecture of baseline buffered deflection router

3. Dual-mode inter-router channel

In this section, we will present a novel approach for flit misrouting suppression in deflection-routed NoC architectures. The approach is based on enhancing the flow-control capability of the inter-router communication channels as an additional feature to the conventional bufferless and buffered deflection-routed NoC architectures. By using this enhancement, the inter-router channel is able to create dynamically local, external feedback paths between router ports. In this way, it can force the deflected flits to stay at their current route positions, instead of being misrouted to the next.

3.1. Overview

The general idea of the proposed approach is depicted in Fig. 4. The figure shows the segment of a deflection network with the bidirectional channel between routers A and B. The conventional

inter-router channel forces the flits residing at opposite ends of the channel to exchange their current route positions, as indicated by dashed curved lines in Fig. 4a. Flit fA is transferred from router A to router B, and simultaneously, flit fB is transferred in the opposite direction (from router B to router A). The exchange of flits between adjacent routers provides the mechanism for flit transfer through the deflection network. However, this mechanism cannot prevent a deflected flit from going one hop further in the wrong direction.

Suppose now that flits are redirected within the channel before they are written into the in-channel flit-registers, that is, instead of to rAB flit fA is send to rBA, while flit fB is send to rAB instead of to rBA, as is indicated by dashed lines in Fig. 4b. The two loop-back paths, formed by such redirection, prevent the flits to leave their current route positions. This means that both flits move from one to another input port of the same router, only. In this manner, flit fA is transferred from North to East port of router A, while flit fB is transferred from South to West port of router B.

RouterA

RouterB

fA

fB

rAB

rBA

N

E

S

W

(a) (b)

IF (fA.isAbsent OR fA.isDeflected) AND (fB.isAbsent OR fB.isDeflected) THEN mode = loop_back; ELSE mode = exchange; END IF;

(c)

Fig. 4 High-level illustration of dual-mode inter-router channel operation: a) exchange mode, b) loop-back mode, and c) operating mode selection rule

The in-channel redirection is beneficial for the deflected flits because they avoid misrouting, but it blocks the progress of productively-routed flits. Therefore, there is a need for a configurable inter-router communication channel with ability to dynamically (i.e. on a cycle-by-cycle basis) switches between two modes of operation: a) the exchange mode, in which neighboring routers interexchange their flits, allowing the non-deflected flits to make productive hops, and b) the loop-back mode, in which routers keep their flits, preventing the deflected flits to make non-productive hops. Obviously, the choice of the operating mode depends on the routing statuses of the flits that are currently present at the opposite ends of the channel. The deflected flits demand

the loop-back, while the productively-routed flits request the exchange mode. In order not to disturb the flow of the productively-routed flits, we introduce the following operating mode selection rule: if there is a productively-routed flit on either side of the channel, the channel is configured in the exchange mode; otherwise, the channel is configured in the loop-back mode. For formal definition of the operating mode selection rule see the pseudo code given in Fig. 4c. The consequences of this rule are:

a) A deflected flit will be misrouted only if there is a productively-routed flit on the opposite side of the channel. In all other cases, the deflected flit will stay at its current route position, thus saving one clock cycle;

b) The loop-back mechanism is transparent for productively-routed flits, which will flow as in a network with the conventional inter-route channels.

3.2. Hardware implementation

The dual-mode inter-router communication channel can be designed using the datapath shown in Fig. 5. It consists of two flit-registers in addition to 2x2 switch module and simple mode selection logic. The switch module regulates the connection between the routers output ports and in-channel flit-registers. The swap configuration of the switch module corresponds to the exchange mode, while the no-swap configuration to the loop-back-mode of the inter-router channel. In order to implement the mode selection rule, each routers output port need to be extended with a binary output signal (denoted as p in Fig. 5). This signal indicates the routing status of the flit which is present on that port. In particular, the signal p is set to 1 if the corresponding output port is occupied by a productively-routed flit. The exchange mode is selected if at least one of two routers indicates the presence (v=1) of a productively-routed flit (p=1); otherwise, the channel is configured in the loop-back mode.

Fig. 5 Architecture of proposed dual-mode inter-router communication channel

3.3. Side-buffering vs. looping-back

The side-buffering and the dual-mode inter-router channel, as mechanisms for misrouting suppression in deflection-routed networks, share similarities, but also some important differences. First of all, once a deflection happens within the PAS stage of the router, it cannot be canceled by either of the two mechanisms. Instead, these mechanisms reduce the deflection overhead for one

clock cycle, only. Another important common characteristic of both mechanisms is that they do not directly influence the passage of productively-routed flits, which are never held in the side-buffer nor looped-back in the dual-mode inter-router channel. Both mechanisms attempt to avoid flit misrouting by temporary holding the deflected flits into the flit-registers. The side-buffer technique uses a centralized flit-register, dedicated specifically for that purpose. When it is not full, this register can save from misrouting one deflected flits in each clock cycle. In circumstances when the router is overloaded, the lack of availability of free flit-channels may postpone re-injection of the buffered flit for several clock cycles. During that period the side-buffer is practically useless because it is not able to accept the new deflected flits. On the other hand, the loop-back mechanism relays on the existing in-channel flit-registers. Unless there are productively-routed flits coming from neighboring routers, the loop-back mechanism can potentially save all deflected flits that appear on routers output ports during a clock cycle. In addition, by entering the router via an input network port, the looped-back flit has a direct access to the PAS stage. Therefore, the looped-back flit will have a higher chance to get a productive port at the very next clock cycle, than one residing in the side buffer. Finally, the side-buffer and the loop-back techniques do not exclude each other but can be combined. If both mechanisms are implemented in the same network, they will operate independently in succession, with the side-buffering being the first stage and the loop-backing being the second stage of this combined misrouting suppression approach.

4. Evaluation

In this section, we use an in-house architectural level cycle-accurate NoC simulator to evaluate the impact of the proposed dual-mode inter-router communication channel on performance of deflection-routed networks. Simulator simulates 2D mesh NoC with size of NxN nodes by using the models of the baseline deflection router architectures presented in Section 2. The design parameters varied between simulations are: a) the baseline deflection architecture (BL bufferless, and BF - buffered), b) the inter-router channel configuration (SMC - conventional single-mode channel, and DMC - dual-mode channel), and c) the NoC size NxN, where N is taken from the set of even integers in the range 4 to 16. The test configurations are represented in the results as A_C, where , , and ,. Simulation runs for a warm-up period of 1,000 cycles, plus a measurement period of 20,000 cycles.

4.1. Performance in saturation mode

The first evaluation was carried out in the saturation mode under the uniform traffic pattern. In this mode, each node injects a new flit into the network in every clock cycle in which at least one flit-channel is available in the router inject stage. The injected flits are destined randomly to other nodes with an equal probability. The performance measures are: a) the deflection rate, b) the transport delay, and c) the saturation throughput. Fig. 6 shows a comparison of the saturation performance between various deflection-routed NoC configurations.

Fig. 6 P

Deflectidestinatislightly result insignificathe effic

Transposource ntransporand the

The shoof hops in absenuniformequals 2

Performance i

ion rate is ion. As canwith the ch

ndicates thatantly on theciency of rou

ort delay is dnode injects rt delay can other part is

ortest path dalong the sh

nce of contem traffic patt2 3 cycle

(a)

(c) in saturation

d) perc

the probabn be observhange of Not the rate at

e misroutinguter PAS sta

defined as tthe flit to thbe decomp

s the deflect

delay is a lowhortest pathention, whetern, the avees for

mode: a) defcentage impr

ility of a fved from FioC size andt which def

g suppressioage (i.e. per

time, measuhe network osed into twtion overhea

wer bound

h between flien flits traveerage shorte mesh netw

flection rate, rovement in s

flit being dg.6a, the ded/or the conflections areon mechanisrmutation ne

ured in clocto the instan

wo parts, whad, : on transporits source ael through test path delwork [6]. In

b) transport saturation thr

deflected wieflection ranfiguration e produced sm employeetwork with

ck cycles, elnt when thehere one par

rt delay, whand destinatthe networkay dependsn the presen

(b)

(d) delay, c) satu

roughput.

ithin each rate in the saof the interin the netw

ed but is prih silver flit a

lapsed from destinationrt is the sho

hich is detertion node. Itk without d on networknce of cont

uration throu

router on iaturation mor-router chawork does nmarily dete

arbitration p

m the instantn node receivortest path d

rmined by tht can only bdeflections. k topology, tentions, the

ughput, and

its path to ode varies

annel. This not depend rmined by olicy).

t when the ves it. The

delay, ,

(1)

he number e obtained Under the only, and e transport

delay can be substantially large due to increased deflection overhead. The deflection overhead,, depends not only on the deflection rate, but also on how deflections are handled in the network. In the baseline bufferless architecture, i.e. without any misrouting suppression mechanism implemented, each deflection causes misrouting, thus prolonging the transport delay for two clock cycles. In other architectures, a fraction of deflected flits are buffered and/or looped-back and the rest of them are misrouted. Let,, and be the probabilities of a deflected flit being misrouted, buffered and looped-back at each router, respectively. Obviously, 1. Given that each flit misrouting incurs delay overhead of two clock cycles, while each flit buffering and loop-back adds overhead of one clock cycle, the expected deflection overhead for a flit that spentcycles in the network with the deflection rate equals: 2 2 . Substituting this equation in Eq. (1) and then rearranging we obtain:

1 2 (2)Since the deflection rate and the shortest path delay both can be considered as constant values, eq. (2) signifies that the implementation of side-buffering ( 0) and/or dual-mode channels ( 0) lead to reduced transport delay. In fact, the per-deflection overhead is decreased from 2 clock cycles, as is in the baseline bufferless architecture, to 2 clock cycles in the general case.

As observed in Fig. 6b, the increase of NoC size leads to a proportional increase in the transport delay. This effect is due to a linear dependence of the shortest path delay on the NoC size. For a given NoC size, the variation in transport delay is a consequence of difference in the deflection overhead induced by different inter-router channel configurations. The use of dual-mode channel alone (BL_DMC configuration) decreases the average transport delay over the baseline bufferless architecture (BL_SMC configuration) up to 11%. The side-buffer technique (BF_SMC) provides a similar improvement, i.e. 15%. The combination of two misrouting suppression mechanisms (BF_DMC) lowers the transport delay up to 21%.

Figure 6c shows the saturation throughput (i.e. maximum traffic accepted by the network measured in flits per node and per clock cycle) as a function of NoC size for different NoC configurations. Figure 6d shows the same data, but represented as the percentage improvement in saturation throughput of architectures with misrouting suppression capability over the baseline bufferless NoC. A general trend of decrease in throughput with increasing NoC size observed in Fig. 6c is a consequence of a lower average transport delay in small-scale than in large-scale NoC architectures. From Fig. 6d it is evident that the misrouting suppression mechanisms bring significant improvement in saturation throughput, which slightly varies with NoC size. The implementation of dual-mode channels in the baseline bufferless architecture improves the throughput for 11.8 14.5%. This improvement is smaller than with the side-buffering technique, which raises the throughput for 21.8 28.1%. With both misrouting suppression techniques implemented, the increase in saturation throughput reaches 27.6 34.1%.

4.2. F

One of balance known tuniformedge lintransit trto injectfairness nodes, i.the injec

Fig. 7 bufferle

Fig. 7 sharchitect

Fairness

the major network lo

that many rm traffic pattnks underutiraffic takes t their flits i

in differen.e. by the nuction rates o

Injection ratess; b) buffer

hows distribtures. As ob

concerns foads, giving

routing algotern [24]. Bilized and coprecedence

into the netwnt deflectionumber of fliof all nodes

(a)

(c)

te distributionrless with dua

bution of thbserved in F

for any routeach node

orithms are By steering ongests the in the route

work as fastn NoC archts injected bare in close

n under satural-mode inter

dual-mod

he injection Fig. 7a, the

ting schemea fair chancnot able to flits towardcenter of th

er over new t as nodes athitectures wby each nodproximity.

ration load inr-router chande inter-route

rate over ainjection ra

e, deflectioce to send ibalance tra

d the middlehe mesh. Winjections, t the networ

we measure de per clock

n deflection-rnnels; c) baseer channels

all nodes in te differenc

on or otherwts flits throu

affic in 2D me of the net

With deflecticongested nrk boundarythe injectiocycle. Fairn

(b)

(d)

routed 2D meeline buffered

different dces between

wise, is its ugh the netwmesh, even twork, theyion routing, nodes may ny. In order ton rates of ness is achie

esh NoCs: a)d, and d) buff

deflection-ronodes in th

ability to work. It is under the

y leave the where in-

not be able to evaluate individual eved when

baseline ffered with

outed NoC he baseline

bufferless deflection architecture are minimal. This advantage occurs because the deflection routing is inherently adaptive, i.e. able to effectively spread traffic away from congested areas to underutilized areas of the network. Fig. 7b shows that the fairness property is not compromised with the inclusion of dual-mode inter-router channels. This is because the loop-back mechanism is transparent for the deflection router, which treats each incoming flit equally, regardless of whether the flit is looped-back or come from a neighboring router.

As presented in Fig. 7c, the injection rate differences between nodes in the baseline buffered architecture are significant: while corner nodes can inject their flits at almost every cycle, the nodes in the middle of the mesh get a chance to inject their flits on every tenth cycle. The strong unfairness observed in the buffered deflection architecture appears because the flits residing in the side-buffer have injection precedence over new flits waiting at the NI inject ports. Under saturation load, links in the middle area of the network are almost fully utilized, and a free flit-channel in the router appears only after flit ejection. In most cases, the just released flit-channel is immediately occupied by the buffered flit, leaving the new flit to wait for another chance. Under such conditions, the ability of nodes in the middle of the mesh to send their flits is significantly reduced, although they are still able to receive flits at the full rate. As can be seen in Fig. 7d, the loop-back mechanism slightly improves the fairness in the buffered network, but the main problem of reduced injection rate of middle nodes still persists. This observation suggests that the buffered deflection router needs further improvements to avoid imbalance between the incoming and the outgoing throughput of nodes with highly loaded routers in throughput-oriented workloads.

4.3. Latency

In this section, we evaluate the latency of deflection NoC architectures for different values of traffic load. In these simulations, the arrival process of flits generated in each node obeys the exponential distribution with an inter-arrival time being calculated according to the desired load. More specifically, the inter-arrival time is varied so the offered load (i.e. the average number of flits generated in each node per clock cycle) changes from zero to network saturation. Two synthetic traffic patterns are considered: uniform and transpose. In uniform traffic pattern, each node sends flits randomly to other nodes with an equal probability. In transpose traffic pattern, the source node positioned at (x, y) sends flits to the destination node (y, x) for all x y. The uniform traffic is benign in the sense that it naturally balances load. On the other hand, the transpose traffic is adversarial since it causes load imbalance.

Fig. 8 contains load-latency graphs for the deflection NoC configurations of size 8x8 across two synthetic traffic patterns. Latency numbers presented in these graphs are measured from the time the flit was generated at the source node to the time it arrives at the destination node, including the time the flit spends in the NIs queue. As can be seen from these figures, for both traffic patterns, the four deflection schemes have almost the same performance at low traffic load. As the traffic load increases, the flit latency dramatically increases due to the network congestion. In

the unifimproveAmong by BF_Ssuffers sbuffer isThis occadditionother harchitectnaturallyuse of dsacrifici

Fig.

4.4. H

The maiwide muflit size,componare usedchannel.Each of mode ch

Table 1 architect

form trafficements in lathe analyzeSMC, BL_Dsignificant ps added, thecurs becaus

nally worsenhand, the itures. Fromy into the dedual-mode ing its adapt

8 Load-latenmisrouting

Hardware c

in datapath ultiplexer (o, which var

nents dominad to control . The eject

f four switchhannel conta

shows hardtures with t

c pattern, atency over ted deflectionDMC and Bperformanc

e saturation se the negatned by the inclusion o

m these resueflection rouchannels im

tability and

(a)

ncy graphs cosuppression

complexity

componentor flit-multiries from 32ates the harthe flow ostage cons

h modules iains two flit

dware compthe conventi

the inclusithe baselinen shames, BBL_SMC. Fe loss undepoint dropstive effects load imbala

of the dualults, we cauting than tmproves peload balanc

omparing bassupport unde

ts of the defiplexer, for 2 to 256 birdware comf flits in bosumes threein the PAS t-multiplexe

parison betwional single

on of misre bufferless BF_DMC arFig. 8b shower the advers by 25% rel

of the inheance producl-mode chaan concludethe side bufferformance cing properti

seline deflecter different tr

flection-routshort) and

its in most mplexity of toth the deflee, and the i

stage as weers.

ween bufferle-mode chan

routing suprouter desigrchitecture ws that the rsarial traffilative to theerent unfairnced by the annel improe that the lffer techniquof the baseies.

tion NoC arcraffic pattern

ted NoC arcthe flit-regimodern def

the deflectioection routeinject stage ell as the si

less and bufnnels (SMC

ppression mgn with a hiperforms bebuffered d

c pattern. Ie baseline bness of the transpose troves latencloop-back mue. This is deline deflec

(b)

chitecture andns: a) uniform

chitecture aister. Becauflection Noon network.er and the d

consumes ingle switch

ffered deflecC) and the d

mechanismsigher saturaest, followe

deflection-ron fact, wheufferless arcbuffered sc

raffic pattercy in bothmechanism due to the faction schem

d architecturem; b) transpos

are: the two-use of relatioC designs, The flit-mu

dual-mode infour flit-mu

h module in

ction-routeddual-mode in

s provides ation point. ed in order outed NoC en the side chitecture. cheme are rn. On the h baseline

fits more act that the

me without

es with se

-input flit-ively large these two

ultiplexers nter-router ultiplexes.

n the dual-

d 2D mesh nter-router

channels (DMC). The comparison is based on the numbers of flit-registers and flit-multiplexers expressed per inner router node. Implementation of the side-buffer technique requires inclusion of an additional pair of eject-inject stages into the datapath of bufferless deflection router, which incurs overhead of 46.6% in terms of combinational logic, along with the increase of 25% in memory elements (assuming the side-buffer capacity of one flit). On the other hand, replacing the conventional inter-router channels with the dual-mode channels requires addition of only two flit-multiplexers per channel, that is, four flit-multiplexes per (inner) router node; no additional flit-registers are needed. When applied to the bufferless architecture, the proposed technique introduces overhead of 26.6% in combinational logic, only. The dual-mode inter-router channel can also be combined with the side-buffering with minimal architectural impact, requiring only 18.2% increase in combinational logic compared to the baseline buffered router.

Table 1. Hardware cost summary of different deflection NoC architectures Baseline deflection

architecture #flit-registers SMC/DMC

#flit-multiplexers SMC/DMC

Bufferless (BL) 4/4 15/19 Buffered (BF) 5/5 22/26

5. Conclusions

In this paper we have presented a lightweight link-control strategy for reducing overhead of flit deflections in deflection-routed networks-on-chip. The proposed scheme is transparent to the underlying network infrastructure as no modification of the deflection router architecture is needed. Only the inter-router channels are augmented with a simple multiplexer logic that, although represents an overhead, does not incur a significant penalty in terms of hardware cost. The simulation analysis shown that by using the proposed scheme it is possible to improve performance of both the bufferless and (minimally) buffered deflection-routed networks-on-chip. Specifically, as compared to the baseline implementations, the increase of up to 14.5% in the saturation throughput and the reduction of 12.3% in the network transport delay have been observed. We conclude that the proposed dual-mode inter-router communication channel brings additional performance enhancements to the existing deflection-routed on-chip networks at low-cost.

Acknowledgement

This work was partially supported by the Serbian Ministry of Science and Technological Development Project No. TR-32009, TR-33035.

References [1] Owens JD, Dally WJ, Ho R, Jayasimha DN, Keckler SW, Peh LS (2007) Research challenges for

on-chip interconnection networks. IEEE Micro 27(5):96108. doi:10.1109/MM.2007.4378787

[2] Borkar S (2010) Future of interconnect fabric: a contrarian view. In: Proceidings of 12th ACM/IEEE international workshop on System level interconnect prediction, ACM, New York, pp. 1-2. doi:10.1145/1811100.1811101

[3] Arteris. A comparison of network-on-chip and busses. Available online: http://www.arteris.com/noc whitepaper.pdf . Accessed 27 Octobar 2014

[4] Gebali F, Elmiligi H, El-Kharashi MW (2009) Networks-on-chips: theory and practice. Taylor & Francis Group, LLC.

[5] Dally WJ and Towles B (2001) Route packets, not wires: On-chip interconnection networks. In: Proceedings of the 38th Conference on Design Automation, ACM, New York, pp. 684689. doi:10.1109/DAC.2001.156225

[6] Dally WJ, James W, Towles B (2003) Principles and practices of interconnection networks. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA

[7] Ni LM (1993) A survey of wormhole routing techniques in direct networks, Computer 26(2): 62-76. doi: 10.1109/2.191995

[8] Dally WJ (1992) Virtual-channel flow control. IEEE Transactions on Parallel and Distributed Systems, 3(2): 194-205. doi: 10.1109/71.127260

[9] Kumar A, Kundu P, Singh A, Peh LS, Jha N (2007) A 4.6 Tbits/s 3.6 GHz single-cycle NOC router with a novel switch allocator in 65 nm CMOS. In: Proceedings of 25th International Conference on Computer Design, ICCD, pp. 63-70. doi: 10.1109/ICCD.2007.4601881

[10] Moscibroda T, Mutlu O (2009) A Case for Bufferless Routing in On-Chip Networks. In: Proceedings of the 36th annual international symposium on Computer architecture, ACM, New York, pp. 196-207. doi: 10.1145/1555754.1555781

[11] Assad A, Mazhar A, Ahmad F, et al. (2014) A survey on energy-efficient meyhodologies and architectures of network-on-chip. Computers and Electrical Engineering, 40(8): 333-347. doi: 10.1016/j.compeleceng.2014.07.012

[12] Hayenga M (2009) SCARAB: A single cycle adaptive routing and bufferless network. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42), pp. 244-254. doi:10.1145/1669112.1669144

[13] Fallin C, Craik C, Mutlu O (2011) CHIPPER: A low-complexity bufferless deflection router. In: Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA), 2011, pp. 144155. doi: 10.1109/HPCA.2011.5749724

[14] Jose J, Nayak B, Kumar K, Muyam M (2013) DeBAR: Deflection based adaptive router with minimal buffering. In: Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 15831588. doi: 10.7873/DATE.2013.322

[15] Fallin C, Nazario G, Yu X, Chang K, Ausavarungnirun R, Mutlu O (2012) MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect. In: Proceedings of the 6th IEEE/ACM International Symposium on Networks on Chip, pp. 1-10. doi:10.1109/NOCS.2012.8

[16] Jafri S, Hong Y, Thottethodi M, Vijaykumar T (2010) Adaptive flow control for robust performance and energy. In: Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 433-444. doi: 10.1109/MICRO.2010.48

[17] Yan J, Lai G, Lin X (2014) A novel distributed congestion control for bufferless network-on-chip. The Journal of Supercomputing 68(2): 849-866. doi: 10.1007/s11227-013-1069-6

[18] Nychis GP, Fallin C, Moscibroda T, Mutlu O, Seshan S (2012) On-chip networks from a networking perspective: Congestion and scalability in many-core interconnects, ACM SIGCOMM Computer Communication Review 42(4): 407418. doi: 10.1145/2342356.2342436

[19] Kohler A, Radetzki M (2009) Fault-tolerant architecture and deflection routing for degradable NoC switches. In: Proceedings of the 3rd IEEE International Symposium on Networks-on-Chip, pp. 2231. doi: 10.1109/NOCS.2009.5071441

[20] Feng C, Zhang M, Li J, Lu Z, Jantsch A (2011) A Low-Overhead Fault-Aware Deflection Routing Algorithm for 3D Network-on-Chip. In: Proceedings of IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 19-24. doi: 10.1109/ISVLSI.2011.42

[21] Michelogiannakis G, Sanchez D, Dally WJ, Kozyrakis C (2010) Evaluating bufferless flow control for on-chip networks. In: Proceedings of the 4th ACM/IEEE International Symposium on Networks-on-Chip, pp. 9-16. doi: 10.1109/NOCS.2010.10

[22] Zhang N, Huaxi G, Yang Y, Fan D (2014) QBNoC: QoS-aware bufferless NoC architecture. Microelectronics Journal, 45 (6): 751-758. doi: 10.1016/j.mejo.2014.04.015

[23] Feng C, Li J, Lu Z, Jantsch A, Zhang M (2011) Evaluation of Deflection Routing on Various NoC Topologies. In: Proceedings of IEEE 9th International Conference on ASIC (ASICON), pp. 163-166. doi: 10.1109/ASICON.2011.6157147

[24] Guz Z, Walter I, Bolotin E, Cidon I, Ginosar R, Kolodny A (2006) Efficient link capacity and QoS design for network-on-chip. In: Proceedings of Design, Automation and Test in Europe, DATE '06, pp. 9-14. doi: 10.1109/DATE.2006.243951

Documents

Dual-mode inter-router communication channel for deflectionrouted networks-on-chip