High-Performance High-Performance Networks for Networks for
Dataflow Dataflow ArchitecturesArchitectures
Pravin BhatPravin Bhat
Andrew PutnamAndrew Putnam
OverviewOverview
Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion
OverviewOverview
Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion
MotivationMotivation
Signal delay on wires is more important Signal delay on wires is more important than transistor switching speedthan transistor switching speed
Seriously decreased reliability in future Seriously decreased reliability in future processesprocesses Factory testing will not be possibleFactory testing will not be possible Expect 20% of transistors to be DOAExpect 20% of transistors to be DOA Expect 10% more to die over several Expect 10% more to die over several
monthsmonths Dataflow is an answer, but the network Dataflow is an answer, but the network
is currently a bottleneckis currently a bottleneck
Dataflow CharacteristicsDataflow Characteristics
Unpredictable trafficUnpredictable traffic Cannot pre-allocate resourcesCannot pre-allocate resources
Highly bursty trafficHighly bursty traffic Quick delivery of bursts is criticalQuick delivery of bursts is critical
Nodes are not guaranteed to Nodes are not guaranteed to consume messagesconsume messages Potential for livelock & deadlockPotential for livelock & deadlock
OverviewOverview
Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion
Network RequirementsNetwork Requirements
High-Performance during burstsHigh-Performance during bursts Area efficientArea efficient Guarantee message deliveryGuarantee message delivery Deadlock & Livelock freeDeadlock & Livelock free Fault TolerantFault Tolerant Regular 2-D physical structureRegular 2-D physical structure
TopologyTopology
On-chip - must be implementable in 2-On-chip - must be implementable in 2-DD
Regular tiled structure suggests:Regular tiled structure suggests: GridGrid TorusTorus HypercubeHypercube Fat TreeFat Tree
Hypercube is difficult to route, scaleHypercube is difficult to route, scale Fat Tree has a single point of failureFat Tree has a single point of failure
RoutingRouting
Static routing does not provide Static routing does not provide essential fault toleranceessential fault tolerance
Use a modified Virtual Channel Use a modified Virtual Channel algorithmalgorithm VC guarantees deadlock free if nodes VC guarantees deadlock free if nodes
consume messagesconsume messages Dynamically adaptive to handle Dynamically adaptive to handle
transient faults & congestiontransient faults & congestion Initial studies used static routingInitial studies used static routing
Flow ControlFlow Control
Resource reservation not possibleResource reservation not possible Long-latency wires prohibit Long-latency wires prohibit
handshakeshandshakes Send messages assuming acceptSend messages assuming accept Buffer just enough to allow receiver Buffer just enough to allow receiver
to send reject signal on subsequent to send reject signal on subsequent clock cycleclock cycle
Deadlock-Free OperationDeadlock-Free Operation
Nodes cannot always consume Nodes cannot always consume messagesmessages
Add a dedicated channel to and from Add a dedicated channel to and from memorymemory Adds 8% area overheadAdds 8% area overhead
Rotate stalled operands out of PEs to Rotate stalled operands out of PEs to ensure forward progressensure forward progress
Send first operand back at a faster Send first operand back at a faster rate to avoid livelockrate to avoid livelock
OverviewOverview
Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion
PerformancePerformance
Ran network-centric simulationsRan network-centric simulations 20 billion instructions20 billion instructions Spec2000, Splash2, and Dataflow Spec2000, Splash2, and Dataflow
benchmarksbenchmarks Goal is to find optimum balance of:Goal is to find optimum balance of:
Number of Virtual ChannelsNumber of Virtual Channels Queue LengthQueue Length Link BandwidthLink Bandwidth Packets per messagePackets per message
Virtual Channels
0
0.5
1
1.5
2
2.5
0 4 8 12 16
Virtual Channels
Performance (Runtime)
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
Queue Length
0.8
1.2
1.6
2
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
Queue Length
Performance (Runtime)
ocean (G)
lu (G)
fir (G)
art (G)
mcf (G)
ocean (T)
lu (T)
fir (T)
art (T)
mcf (T)
Link Bandwidth
0.8
1
1.2
1.4
1.6
1.8
2
0 4 8 12 16
Bandwidth
Performance (Runtime)
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
Link Width
0
0.2
0.4
0.6
0.8
1
1.2
0 8 16 24 32 40 48 56 64
Packets per Message
Performance (Runtime)
ocean (G)
lu (G)
fir (G)
art (G)
mcf (G)
ocean (T)
lu (T)
fir (T)
art (T)
mcf (T)
ASIC ModelASIC Model
Performance must be balanced with areaPerformance must be balanced with area Developed RTL model of WaveScalar Developed RTL model of WaveScalar
network architecturenetwork architecture 90 nm process ASIC standard cell library90 nm process ASIC standard cell library Timing per link:Timing per link:
Grid links: 2.76 nsGrid links: 2.76 ns Torus links: 6.16 nsTorus links: 6.16 ns
Network switch is 11.6% of chip areaNetwork switch is 11.6% of chip area
Virtual Channels
0
0.5
1
1.5
2
2.5
3
3.5
0 2 4 6 8 10 12 14 16 18
Virtual Channels
Performance / Area
ocean (G)
lu (G)
fir (G)
art (G)
mcf (G)
ocean (T)
lu (T)
fir (T)
art (T)
mcf (T)
Link Bandwidth
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 2 4 6 8 10 12 14 16
Number of Links
Performance / Area
ocean (G)
lu (G)
fir (G)
art (G)
mcf (G)
ocean (T)
lu (T)
fir (T)
art (T)
mcf (T)
Queue Length
0
0.5
1
1.5
2
2.5
3
0 8 16 24 32 40 48 56 64
Queue Length
Performance / Area
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
OverviewOverview
Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion
Virtual Channels Flow Virtual Channels Flow ControlControl
In hardware only In hardware only Head-of-Queue can be Head-of-Queue can be dequeued in one clock dequeued in one clock cyclecycle
If the first message in If the first message in a queue is blocked a queue is blocked then every message then every message behind it is blockedbehind it is blocked
The network The network utilization suffers due utilization suffers due to idle linksto idle links
Virtual Channels Flow Virtual Channels Flow Channel Channel
Virtual Channels – Virtual Channels – several small several small queues instead of queues instead of one long queueone long queue
Decouples buffer Decouples buffer resources from link resources from link resourcesresources
Increase network Increase network throughput by throughput by increasing link increasing link usageusage
Dimension Order Dimension Order RoutingRouting
Old WaveScalar Routing ProtocolOld WaveScalar Routing Protocol Network topology is a static gridNetwork topology is a static grid Packets first travel to the correct Packets first travel to the correct
x-coordinate and then to the x-coordinate and then to the correct y-coordinatecorrect y-coordinate
Low network utilization from not Low network utilization from not using all available pathsusing all available paths
Not fault tolerantNot fault tolerant
Adaptive RoutingAdaptive Routing
Progressively chooses Progressively chooses longer routes instead of longer routes instead of waiting for an unavailable waiting for an unavailable resourceresource
High Network UtilizationHigh Network Utilization Fault tolerantFault tolerant Can cause deadlockCan cause deadlock
Deadlock Free Adaptive Deadlock Free Adaptive RoutingRouting
Some Virtual Channels are reserved for Some Virtual Channels are reserved for Dimension Order Routing, rest used for Dimension Order Routing, rest used for Adaptive routingAdaptive routing
Every time a packet is routed in the wrong Every time a packet is routed in the wrong direction the Dimension Reversal count direction the Dimension Reversal count incrementedincremented
No packet is allowed to wait in a virtual No packet is allowed to wait in a virtual channel with a packet that has a lower channel with a packet that has a lower Dimension reversal countDimension reversal count
Mathematically proven to be deadlock free.Mathematically proven to be deadlock free.
Virtual Channels
0
0.5
1
1.5
2
2.5
3
3.5
0 4 8 12 16
Virtual Channels
Performance (Runtime)
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
Queue Length (Adaptive Speedup)
0.8
1.2
1.6
2
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
Queue Length
Performance (Runtime)
ocean (G)
lu (G)
fir (G)
art (G)
mcf (G)
ocean (T)
lu (T)
fir (T)
art (T)
mcf (T)
Link Bandwidth (Adaptive Speedup)
0.8
1
1.2
1.4
1.6
1.8
2
0 4 8 12 16
Bandwidth
Performance (Runtime)
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
ConclusionConclusion
Best performance per area with:Best performance per area with: 2 Virtual Channels2 Virtual Channels 2 Links2 Links 2-4 entries per queue2-4 entries per queue Torus TopologyTorus Topology Adaptive RoutingAdaptive Routing
Dataflow chip networks can be high-Dataflow chip networks can be high-performance at reasonable areaperformance at reasonable area