Upload
webb
View
33
Download
3
Tags:
Embed Size (px)
DESCRIPTION
High-Performance Networks for Dataflow Architectures. Pravin Bhat Andrew Putnam. Overview. Motivation & Design Constraints Network design Performance Adaptive Routing Conclusion. Overview. Motivation & Design Constraints Network design Performance Adaptive Routing Conclusion. - PowerPoint PPT Presentation
Citation preview
High-Performance High-Performance Networks for Networks for
Dataflow Dataflow ArchitecturesArchitectures
Pravin BhatPravin BhatAndrew PutnamAndrew Putnam
OverviewOverview Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion
OverviewOverview Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion
MotivationMotivation Signal delay on wires is more important Signal delay on wires is more important
than transistor switching speedthan transistor switching speed Seriously decreased reliability in future Seriously decreased reliability in future
processesprocesses Factory testing will not be possibleFactory testing will not be possible Expect 20% of transistors to be DOAExpect 20% of transistors to be DOA Expect 10% more to die over several Expect 10% more to die over several
monthsmonths Dataflow is an answer, but the network Dataflow is an answer, but the network
is currently a bottleneckis currently a bottleneck
Dataflow CharacteristicsDataflow Characteristics Unpredictable trafficUnpredictable traffic
Cannot pre-allocate resourcesCannot pre-allocate resources Highly bursty trafficHighly bursty traffic
Quick delivery of bursts is criticalQuick delivery of bursts is critical Nodes are not guaranteed to Nodes are not guaranteed to
consume messagesconsume messages Potential for livelock & deadlockPotential for livelock & deadlock
OverviewOverview Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion
Network RequirementsNetwork Requirements High-Performance during burstsHigh-Performance during bursts Area efficientArea efficient Guarantee message deliveryGuarantee message delivery Deadlock & Livelock freeDeadlock & Livelock free Fault TolerantFault Tolerant Regular 2-D physical structureRegular 2-D physical structure
TopologyTopology On-chip - must be implementable in 2-DOn-chip - must be implementable in 2-D Regular tiled structure suggests:Regular tiled structure suggests:
GridGrid TorusTorus HypercubeHypercube Fat TreeFat Tree
Hypercube is difficult to route, scaleHypercube is difficult to route, scale Fat Tree has a single point of failureFat Tree has a single point of failure
RoutingRouting Static routing does not provide Static routing does not provide
essential fault toleranceessential fault tolerance Use a modified Virtual Channel Use a modified Virtual Channel
algorithmalgorithm VC guarantees deadlock free if nodes VC guarantees deadlock free if nodes
consume messagesconsume messages Dynamically adaptive to handle transient Dynamically adaptive to handle transient
faults & congestionfaults & congestion Initial studies used static routingInitial studies used static routing
Flow ControlFlow Control Resource reservation not possibleResource reservation not possible Long-latency wires prohibit Long-latency wires prohibit
handshakeshandshakes Send messages assuming acceptSend messages assuming accept Buffer just enough to allow receiver Buffer just enough to allow receiver
to send reject signal on subsequent to send reject signal on subsequent clock cycleclock cycle
Deadlock-Free OperationDeadlock-Free Operation Nodes cannot always consume Nodes cannot always consume
messagesmessages Add a dedicated channel to and from Add a dedicated channel to and from
memorymemory Adds 8% area overheadAdds 8% area overhead
Rotate stalled operands out of PEs to Rotate stalled operands out of PEs to ensure forward progressensure forward progress
Send first operand back at a faster rate Send first operand back at a faster rate to avoid livelockto avoid livelock
OverviewOverview Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion
PerformancePerformance Ran network-centric simulationsRan network-centric simulations 20 billion instructions20 billion instructions Spec2000, Splash2, and Dataflow Spec2000, Splash2, and Dataflow
benchmarksbenchmarks Goal is to find optimum balance of:Goal is to find optimum balance of:
Number of Virtual ChannelsNumber of Virtual Channels Queue LengthQueue Length Link BandwidthLink Bandwidth Packets per messagePackets per message
Virtual Channels
0
0.5
1
1.5
2
2.5
0 4 8 12 16Virtual Channels
Performance (Runtime)
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
Queue Length
0.8
1.2
1.6
2
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64Queue Length
Performance (Runtime)
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
Link Bandwidth
0.8
1
1.2
1.4
1.6
1.8
2
0 4 8 12 16Bandwidth
Performance (Runtime)
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
Link Width
0
0.2
0.4
0.6
0.8
1
1.2
0 8 16 24 32 40 48 56 64Packets per Message
Performance (Runtime)
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
ASIC ModelASIC Model Performance must be balanced with areaPerformance must be balanced with area Developed RTL model of WaveScalar Developed RTL model of WaveScalar
network architecturenetwork architecture 90 nm process ASIC standard cell library90 nm process ASIC standard cell library Timing per link:Timing per link:
Grid links: 2.76 nsGrid links: 2.76 ns Torus links: 6.16 nsTorus links: 6.16 ns
Network switch is 11.6% of chip areaNetwork switch is 11.6% of chip area
Virtual Channels
0
0.5
1
1.5
2
2.5
3
3.5
0 2 4 6 8 10 12 14 16 18Virtual Channels
Performance / Area
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
Link Bandwidth
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 2 4 6 8 10 12 14 16Number of Links
Performance / Area
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
Queue Length
0
0.5
1
1.5
2
2.5
3
0 8 16 24 32 40 48 56 64Queue Length
Performance / Area
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
OverviewOverview Motivation & Design ConstraintsMotivation & Design Constraints Network designNetwork design PerformancePerformance Adaptive RoutingAdaptive Routing ConclusionConclusion
Virtual Channels Flow Virtual Channels Flow ControlControl
In hardware only In hardware only Head-of-Queue can be Head-of-Queue can be dequeued in one clock dequeued in one clock cyclecycle
If the first message in If the first message in a queue is blocked a queue is blocked then every message then every message behind it is blockedbehind it is blocked
The network utilization The network utilization suffers due to idle linkssuffers due to idle links
Virtual Channels Flow Virtual Channels Flow Channel Channel
Virtual Channels – Virtual Channels – several small several small queues instead of queues instead of one long queueone long queue
Decouples buffer Decouples buffer resources from link resources from link resourcesresources
Increase network Increase network throughput by throughput by increasing link increasing link usageusage
Dimension Order Dimension Order RoutingRouting
Old WaveScalar Routing ProtocolOld WaveScalar Routing Protocol Network topology is a static gridNetwork topology is a static grid Packets first travel to the correct Packets first travel to the correct
x-coordinate and then to the x-coordinate and then to the correct y-coordinatecorrect y-coordinate
Low network utilization from not Low network utilization from not using all available pathsusing all available paths
Not fault tolerantNot fault tolerant
Adaptive RoutingAdaptive Routing Progressively chooses Progressively chooses
longer routes instead of longer routes instead of waiting for an unavailable waiting for an unavailable resourceresource
High Network UtilizationHigh Network Utilization Fault tolerantFault tolerant Can cause deadlockCan cause deadlock
Deadlock Free Adaptive Deadlock Free Adaptive RoutingRouting
Some Virtual Channels are reserved for Some Virtual Channels are reserved for Dimension Order Routing, rest used for Dimension Order Routing, rest used for Adaptive routingAdaptive routing
Every time a packet is routed in the wrong Every time a packet is routed in the wrong direction the Dimension Reversal count direction the Dimension Reversal count incrementedincremented
No packet is allowed to wait in a virtual No packet is allowed to wait in a virtual channel with a packet that has a lower channel with a packet that has a lower Dimension reversal countDimension reversal count
Mathematically proven to be deadlock free.Mathematically proven to be deadlock free.
Virtual Channels
0
0.5
1
1.5
2
2.5
3
3.5
0 4 8 12 16Virtual Channels
Performance (Runtime)
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
Queue Length (Adaptive Speedup)
0.8
1.2
1.6
2
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64Queue Length
Performance (Runtime)
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
Link Bandwidth (Adaptive Speedup)
0.8
1
1.2
1.4
1.6
1.8
2
0 4 8 12 16Bandwidth
Performance (Runtime)
ocean (G)lu (G)fir (G)art (G)mcf (G)ocean (T)lu (T)fir (T)art (T)mcf (T)
ConclusionConclusion Best performance per area with:Best performance per area with:
2 Virtual Channels2 Virtual Channels 2 Links2 Links 2-4 entries per queue2-4 entries per queue Torus TopologyTorus Topology Adaptive RoutingAdaptive Routing
Dataflow chip networks can be high-Dataflow chip networks can be high-performance at reasonable areaperformance at reasonable area