Upload
tia-corker
View
217
Download
0
Tags:
Embed Size (px)
Designing Regular Network-on-Chip Topologies under Technology,
Architecture and Software Constraints
F.Gilabert†, D.Ludovici §, S.Medardoni‡, D.Bertozzi‡, L.Benini††, G.N.Gaydadjiev§
‡University of Ferrara.††University of Bologna.†Universidad Politecnica de Valencia.§Delft University of Technology,.
Multi-dimension topologies2D mesh frequently used for NoC design - perfectly matches 2D silicon surface- high level of modularity- controllability of electrical parameters
But its avg latency and resource consumption scale poorly with network size
Topology with more than 2 dimensions attractive: - higher bandwidth and lower avg latency- on-chip wiring more cost-effective than off-chip
But physical design issues might impact their effectiveness and even feasibility(decreased operating frequency)
(higher link latency)
ObjectiveExplore the effectiveness and feasibility of multi-dimensional topologies
Under realistic technological constraintsUnder realistic technological constraints
1. Physical synthesis impact over performance
Over-the-cell routing?
Latency in injection
links?
Latency in express
links?
Which switch
operating frequency?
Regularity broken by asymmetric
tile size or heterogeneous
tiles!
Our approachPhysical parameters from the physical synthesis are applied to system-level
simulationsSilicon-aware performance analysis
ObjectiveExplore the effectiveness and feasibility of multi-dimensional topologies
Under realistic architectural constraintsUnder realistic architectural constraints
Our approach•Chip I/O interface modeling•Capture the implications of I/O performance on topology performance differentiation
1. Physical synthesis impact over performance2. Impact of chip I/O interface over topology performance
May introduce an upper bound to the topology performance, affecting the performance differentiation between topologies
ObjectiveExplore the effectiveness and feasibility of multi-dimensional topologies
Software constraints: communication semantics of the middlewareSoftware constraints: communication semantics of the middleware
Traffic pattern usually abstracted as an average link
bandwidth utilization or as a synthetic
traffic pattern
May lead to highly inaccurate performance predictions(traffic peaks, different kinds of messaging, synchronization mismatches)
Our approach• Project network traffic based on latest advances in MPSoC communication middleware• Generate traffic patterns for the NoC “shaped” by the above communication middleware (e.g., synchronization, communication semantics)
1. Physical synthesis impact over performance2. Impact of chip I/O interface over topology performance3. Realistically capture traffic behavior
Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions
Topology generation
Topology specification
RTL SystemC/Verilog
Simulation
VCD Trace
Physical Synthesis
PlacementFloorplan
Clock Tree Synth., Power Grid, routing, post-routing opt
Netlist, Parasitic Extraction Prime time
SDF (timing)Prime timePower estimation
OCP Traffic Generator
Transactional Simulator
Backend synthesis flow
Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions
Tile Architecture
• Processor core – Connected through a Network Interface Initiator
• Local memory core – Connected through a Network Interface Target
• Two network interfaces can be used in parallel
ProcessorCore
MemoryCore
Network IF Initiator
Network IF Target
Tile
Communication protocol• Step 1: Producer checks
local semaphores for pending messages for the destination• If not, it writes data to
the local tile memory and unblocks a semaphore at the consumer tile
• The producer is free to carry out other tasks
Local Polling
Producer Tile
WriteMessage
Reset Semaphore
Local Polling
ConsumerTile
ReadOperation
1
2
3
4
• Step 2: Consumer detects unblocked semaphore• Requests producer for
data
• Step 3: Consumer reads data from the producer
• Step 4: Consumer sends a notification upon completion– This allows the producer
to send another message to this consumer
• Message sent only when consumer is ready to read it • Only one outstanding message for a producer-consumer pair•Low network bandwidth utilization•Tight latency constraints on the topologyDalla Torre, A. et al., ”MP-Queue: an Efficient Communication Library for Embedded Streaming Multimedia Platform”, IEEE Workshop on Embedded Systems for Real-Time Multimedia, 2007.
Backend synthesis flow Communication semantics
Topologies under test Physical synthesis Layout-aware topology performance Conclusions
Topologies Under Test – 16 tiles4-ary 2-mesh
(2D Mesh)
Switches 16
Bis. Band. 4
Tiles x Switch 1
Switch Arity 6
Max. Hops 6
4-ary 2-meshBaseline Topology
TileSwitch
Topologies Under Test – 16 tiles4-ary 2-mesh
(2D Mesh)2-ary 4-mesh(Hypercube)
Switches 16 16
Bis. Band. 4 8
Tiles x Switch 1 1
Switch Arity 6 6
Max. Hops 6 4
4-ary 2-meshBaseline Topology
2-ary 4-meshHigh Bandwith
TileSwitch
TileSwitch
Topologies Under Test – 16 tiles
4-ary 2-meshBaseline Topology
2-ary 2-meshLow latency
4-ary 2-mesh(2D Mesh)
2-ary 4-mesh(Hypercube)
2-ary 2-mesh(Concentrated
)
Switches 16 16 4
Bis. Band. 4 8 2
Tiles x Switch 1 1 4
Switch Arity 6 6 10
Max. Hops 6 4 2Tile
Switch
TileSwitch
Topologies Under Test – 64 tiles8-ary 2-mesh
(2D Mesh)
Switches 64
Bis. Band. 8
Tiles x Switch 1
Switch Arity 8
Max. Hops 14
8-ary 2-meshBaseline Topology
Topologies Under Test – 64 tiles8-ary 2-mesh
(2D Mesh)2-ary 6-mesh(Hypercube)
Switches 64 64
Bis. Band. 8 32
Tiles x Switch 1 1
Switch Arity 6 8
Max. Hops 14 6
8-ary 2-meshBaseline Topology
2-ary 6-meshHigh Bandwith
Topologies Under Test – 64 tiles8-ary 2-mesh
(2D Mesh)2-ary 6-mesh(Hypercube)
2-ary 4-mesh(Concentrated)
Switches 64 64 16
Bis. Band. 8 32 8
Tiles x Switch 1 1 4
Switch Arity 6 8 12
Max. Hops 14 6 4
8-ary 2-meshBaseline Topology
2-ary 4-meshLow Latency
Backend synthesis flow Communication semantics Topologies under test
Physical synthesis Layout-aware topology performance Conclusions
Physical Synthesis• Link latency and maximum frequency
– Performance, area and power – Quantified by post-layout analysis
• For 16 tile systems– Real physical parameter values were obtained
• For 64 tile systems– Physical parameter values extrapolated based on
16 tiles results– Synthesis time constraints
Physical Synthesis – 16 Tiles
• Network building blocks synthesized for maximum performance
• Timing path in network logic– Ignore switch-to-switch links.
• Critical paths are in the switches – never in the network interfaces– Network speed closely reflects the maximum switch radix
4-ary 2-mesh(2D Mesh)
2-ary 4-mesh(Hypercube)
2-ary 2-mesh(Concentrated)
Switch Arity 6 6 10
Post-synthesis freq. 1 Ghz 1 Ghz 850 Mhz
Post-layout freq. 786 MHz 640 Mhz 600 Mhz
Core speed (max. 500) 393 MHz 320 Mhz 300 Mhz
Cell Area 949k μm2 1108k μm2 733k μm2
Physical Synthesis – 16 Tiles
• Inter-switch wiring reduces performance • The connectivity pattern of 2-ary 4-mesh results into a
larger frequency drop than the 2D mesh• The 2-ary 2-mesh pays its lower number of switching
resources with a larger switch-to-switch separation– Severe degradation of network performance
4-ary 2-mesh(2D Mesh)
2-ary 4-mesh(Hypercube)
2-ary 2-mesh(Concentrated)
Switch Arity 6 6 10
Post-synthesis freq. 1 Ghz 1 Ghz 850 Mhz
Post-layout freq. 786 MHz 640 Mhz 600 Mhz
Core speed (max. 500) 393 MHz 320 Mhz 300 Mhz
Cell Area 949k μm2 1108k μm2 733k μm2
Physical Synthesis – 16 Tiles
• Frequency-ratioed clock domain crossing in network interface– Network speed affects core speed.
• Maximum core speed of 500 MHz is assumed • Post-layout speed drop
– Cores cannot sustain the network speed – A divider of 2 is applied
4-ary 2-mesh(2D Mesh)
2-ary 4-mesh(Hypercube)
2-ary 2-mesh(Concentrated)
Switch Arity 6 6 10
Post-synthesis freq. 1 Ghz 1 Ghz 850 Mhz
Post-layout freq. 786 MHz 640 Mhz 600 Mhz
Core speed (max. 500) 393 MHz 320 Mhz 300 Mhz
Cell Area 949k μm2 1108k μm2 733k μm2
Physical Synthesis – 16 Tiles
• 2-ary 4-mesh larger area footprint than the 2D mesh
• 2-ary 2-mesh reduces the number of switches– Larger radix– Area not halved
4-ary 2-mesh(2D Mesh)
2-ary 4-mesh(Hypercube)
2-ary 2-mesh(Concentrated)
Switch Arity 6 6 10
Post-synthesis freq. 1 Ghz 1 Ghz 850 Mhz
Post-layout freq. 786 MHz 640 Mhz 600 Mhz
Core speed (max. 500) 393 MHz 320 Mhz 300 Mhz
Cell Area 949k μm2 1108k μm2 733k μm2
Physical Synthesis – 64 tiles• 64 tile hypercubes present very long links– Switch-to-switch link delay impacts overall network speed– Overall network speed unacceptably low for 64 tiled
systems
• Link pipelining becomes mandatory– Allows to sustain network speed even in the presence of
long links
• Number of pipeline stages depends on the link length on the layout
Concentrated 2-ary 4-mesh
Physical Synthesis – 64 tiles
Physical Synthesis – 64 tiles8-ary 2-mesh
(2D Mesh)2-ary 6-mesh(Hypercube)
2-ary 4-mesh(Concentrated)
Switch Arity 6 8 12
Post-synthesis freq. 1 Ghz 900 Ghz 790 Mhz
Post-layout freq. 786 MHz 640 Mhz 500 Mhz
Core speed (max. 500) 393 MHz 320 Mhz 250 Mhz
Cell Area 4461kμm2 7356k μm2 2610k μm2
Latency on top dimensions
Dimension 3 - 1 1
Dimension 4 - 1 2
Dimension 5 - 2 -
Dimension 6 - 3 -
Physical Synthesis – 64 tiles8-ary 2-mesh
(2D Mesh)2-ary 6-mesh(Hypercube)
2-ary 4-mesh(Concentrated)
Reduced 2-ary 4-mesh
Switch Arity 6 8 12 12
Post-synthesis freq. 1 Ghz 900 Ghz 790 Mhz 790 Mhz
Post-layout freq. 786 MHz 640 Mhz 500 Mhz 500 Mhz
Core speed (max. 500) 393 MHz 320 Mhz 250 Mhz 500 Mhz
Cell Area 4461kμm2 7356k μm2 2610k μm2 2610k μm2
Latency on top dimensions
Dimension 3 - 1 1 1
Dimension 4 - 1 2 2
Dimension 5 - 2 - -
Dimension 6 - 3 - -
Physical Synthesis – 64 tiles8-ary 2-mesh
(2D Mesh)2-ary 6-mesh(Hypercube)
2-ary 6-meshHigh-Speed
Switch Arity 6 8 8
Post-synthesis freq. 1 Ghz 900 Ghz 900Mhz
Post-layout freq. 786 MHz 640 Mhz 786 Mhz
Core speed (max. 500) 393 MHz 320 Mhz 393Mhz
Cell Area 4461kμm2 7356k μm2 22784k μm2
Latency on top dimensions
Latency on top dimensions
Dimension 3 - 1 2
Dimension 4 - 1 2
Dimension 5 - 2 2
Dimension 6 - 3 3
Aggressive link pipelining200% area overhead for 20% improvement in performanceNot usable
Backend synthesis flow Communication semantics Topologies under test Physical synthesis
Layout-aware topology performance Conclusions
Workload distribution
• Producer, worker and consumer tasks• I/O devices dedicated to input OR output data
External I/O
Topology performance• 1 Input and 1 Output ports to the external memory
are assumed for 16 tile systems• 4 Input and 4 Output ports to the external memory
are assumed for 64 tile systems• I/O ports are accessed through sidewall tiles
– The mapping of producer(s) and consumer(s) tasks is therefore constrained to these tiles
Topology performance• Several I/O mapping strategies were considered:• For sake of space, we only show here the most
significative– OneSided: all the I/O tiles are placed on the same side of
the chip.
Topology performance - 16 tiles
• 2-ary 4-mesh reduces total number of cycles by 27.4%• 2-ary 2-mesh reduces cycles only by 1.6% over the hypercube
– Chip I/O becomes the bottleneck• Real operating frequency of each topology changes conclusions
– Physical degradation is too severe to be compensated• 2-ary 2-mesh shows superior energy saving properties
– 50% over the 2D mesh
Topology performance - 64 tiles
• 2D mesh outperforms the non-reduced hypercubes• Systems under test are I/O constrained
– Computation tiles spend around 50% of their time waiting to send data to the consumer tile– Upper bound to topology-related performance optimization
• Improvement in terms of execution cycles– Performance improvements in cycles are not such to offset the lower operating speed
Removal of the I/O bottleneck has to be considered as mandatory to achieve performance differentiation between topologies
Topology performance - 64 tiles
• Network and tiles work at the same frequency– Maximum frequency for all tiles: I/O tiles and processing tiles.• Very similar performance– Reduced number of cycles– Low network frequency
• Reduced hardware resources – 4 times less switches, half the number of ports and works at half the
frequency
Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance
Conclusions
Conclusions Bottom-up approach to assess k-ary n-mesh
topologies A number of real life issues are considered:
Physical constraints of nanoscale technologies Impact of I/O interface Communication semantics of the middleware
The intricate wiring of multi-dimension topologies or the long links required by concentrated k-ary n-meshes can be changed into 2 different kind of performance overhead by means of proper design techniques:
Conclusions
Operating frequency reduction: in spite of a lower number of execution cycles, multi-dimension topologies loose in terms of RET due to lower working frequency. Concentrated topologies provide a way to trade
performance for power/area
Increase of link latency: the utilization of retiming stages allows to sustain operating frequency while increasing the network latency. Area and power overhead have to be taken into account Link pipelining can not materialize a frequency
higher than the switch radix itself for 64 tile systems we found that in general, the 2D
mesh outperforms the hypercubes. In spite of a better execution cycles, the real elapsed
time is worst because of a lower operating frequency
Conclusions Unexpected results for the reduced 2-ary 4-mesh:
Expected: Low cost - Low performance solution Results: Low cost with similar performance as
2D mesh Increment in core speed allows to reduce the
impact: I/O tile congestion Processing tiles
Possible solution to hypercube physical degradation issues: Decouple network speed from core speed
(GALS) Other solutions:
High performance – high radix switches
Designing Regular Network-on-Chip Topologies under Technology,
Architecture and Software Constraints
F.Gilabert†, D.Ludovici §, S.Medardoni‡, D.Bertozzi‡, L.Benini††, G.N.Gaydadjiev§
‡University of Ferrara.††University of Bologna.†Universidad Politecnica de Valencia.§Delft University of Technology,.