Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Designing Regular Network-on-Chip Topologies under Technology,

Architecture and Software Constraints

F.Gilabert†, D.Ludovici §, S.Medardoni‡, D.Bertozzi‡, L.Benini††, G.N.Gaydadjiev§

‡University of Ferrara.††University of Bologna.†Universidad Politecnica de Valencia.§Delft University of Technology,.

Multi-dimension topologies2D mesh frequently used for NoC design - perfectly matches 2D silicon surface- high level of modularity- controllability of electrical parameters

But its avg latency and resource consumption scale poorly with network size

Topology with more than 2 dimensions attractive: - higher bandwidth and lower avg latency- on-chip wiring more cost-effective than off-chip

But physical design issues might impact their effectiveness and even feasibility(decreased operating frequency)

(higher link latency)

ObjectiveExplore the effectiveness and feasibility of multi-dimensional topologies

Under realistic technological constraintsUnder realistic technological constraints

1. Physical synthesis impact over performance

Over-the-cell routing?

Latency in injection

links?

Latency in express

links?

Which switch

operating frequency?

Regularity broken by asymmetric

tile size or heterogeneous

tiles!

Our approachPhysical parameters from the physical synthesis are applied to system-level

simulationsSilicon-aware performance analysis


Under realistic architectural constraintsUnder realistic architectural constraints

Our approach•Chip I/O interface modeling•Capture the implications of I/O performance on topology performance differentiation

1. Physical synthesis impact over performance2. Impact of chip I/O interface over topology performance

May introduce an upper bound to the topology performance, affecting the performance differentiation between topologies


Software constraints: communication semantics of the middlewareSoftware constraints: communication semantics of the middleware

Traffic pattern usually abstracted as an average link

bandwidth utilization or as a synthetic

traffic pattern

May lead to highly inaccurate performance predictions(traffic peaks, different kinds of messaging, synchronization mismatches)

Our approach• Project network traffic based on latest advances in MPSoC communication middleware• Generate traffic patterns for the NoC “shaped” by the above communication middleware (e.g., synchronization, communication semantics)

1. Physical synthesis impact over performance2. Impact of chip I/O interface over topology performance3. Realistically capture traffic behavior

Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Topology generation

Topology specification

RTL SystemC/Verilog

Simulation

VCD Trace

Physical Synthesis

PlacementFloorplan

Clock Tree Synth., Power Grid, routing, post-routing opt

Netlist, Parasitic Extraction Prime time

SDF (timing)Prime timePower estimation

OCP Traffic Generator

Transactional Simulator

Backend synthesis flow

Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Tile Architecture

• Processor core – Connected through a Network Interface Initiator

• Local memory core – Connected through a Network Interface Target

• Two network interfaces can be used in parallel

ProcessorCore

MemoryCore

Network IF Initiator

Network IF Target

Tile

Communication protocol• Step 1: Producer checks

local semaphores for pending messages for the destination• If not, it writes data to

the local tile memory and unblocks a semaphore at the consumer tile

• The producer is free to carry out other tasks

Local Polling

Producer Tile

WriteMessage

Reset Semaphore

Local Polling

ConsumerTile

ReadOperation

1

2

3

4

• Step 2: Consumer detects unblocked semaphore• Requests producer for

data

• Step 3: Consumer reads data from the producer

• Step 4: Consumer sends a notification upon completion– This allows the producer

to send another message to this consumer

• Message sent only when consumer is ready to read it • Only one outstanding message for a producer-consumer pair•Low network bandwidth utilization•Tight latency constraints on the topologyDalla Torre, A. et al., ”MP-Queue: an Efficient Communication Library for Embedded Streaming Multimedia Platform”, IEEE Workshop on Embedded Systems for Real-Time Multimedia, 2007.

Backend synthesis flow Communication semantics

Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Topologies Under Test – 16 tiles4-ary 2-mesh

(2D Mesh)

Switches 16

Bis. Band. 4

Tiles x Switch 1

Switch Arity 6

Max. Hops 6

4-ary 2-meshBaseline Topology

TileSwitch


(2D Mesh)2-ary 4-mesh(Hypercube)

Switches 16 16

Bis. Band. 4 8

Tiles x Switch 1 1

Switch Arity 6 6

Max. Hops 6 4


2-ary 4-meshHigh Bandwith

TileSwitch

TileSwitch

Topologies Under Test – 16 tiles


2-ary 2-meshLow latency

4-ary 2-mesh(2D Mesh)

2-ary 4-mesh(Hypercube)

2-ary 2-mesh(Concentrated

)

Switches 16 16 4

Bis. Band. 4 8 2

Tiles x Switch 1 1 4

Switch Arity 6 6 10

Max. Hops 6 4 2Tile

Switch

TileSwitch


(2D Mesh)

Switches 64

Bis. Band. 8

Tiles x Switch 1

Switch Arity 8

Max. Hops 14




Switches 64 64

Bis. Band. 8 32

Tiles x Switch 1 1

Switch Arity 6 8

Max. Hops 14 6


2-ary 6-meshHigh Bandwith



2-ary 4-mesh(Concentrated)

Switches 64 64 16

Bis. Band. 8 32 8

Tiles x Switch 1 1 4

Switch Arity 6 8 12

Max. Hops 14 6 4


2-ary 4-meshLow Latency

Backend synthesis flow Communication semantics Topologies under test

Physical synthesis Layout-aware topology performance Conclusions

Physical Synthesis• Link latency and maximum frequency

– Performance, area and power – Quantified by post-layout analysis

• For 16 tile systems– Real physical parameter values were obtained

• For 64 tile systems– Physical parameter values extrapolated based on

16 tiles results– Synthesis time constraints

Physical Synthesis – 16 Tiles

• Network building blocks synthesized for maximum performance

• Timing path in network logic– Ignore switch-to-switch links.

• Critical paths are in the switches – never in the network interfaces– Network speed closely reflects the maximum switch radix




Switch Arity 6 6 10

Post-synthesis freq. 1 Ghz 1 Ghz 850 Mhz

Post-layout freq. 786 MHz 640 Mhz 600 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 300 Mhz

Cell Area 949k μm2 1108k μm2 733k μm2


• Inter-switch wiring reduces performance • The connectivity pattern of 2-ary 4-mesh results into a

larger frequency drop than the 2D mesh• The 2-ary 2-mesh pays its lower number of switching

resources with a larger switch-to-switch separation– Severe degradation of network performance




Switch Arity 6 6 10






• Frequency-ratioed clock domain crossing in network interface– Network speed affects core speed.

• Maximum core speed of 500 MHz is assumed • Post-layout speed drop

– Cores cannot sustain the network speed – A divider of 2 is applied




Switch Arity 6 6 10






• 2-ary 4-mesh larger area footprint than the 2D mesh

• 2-ary 2-mesh reduces the number of switches– Larger radix– Area not halved




Switch Arity 6 6 10





Physical Synthesis – 64 tiles• 64 tile hypercubes present very long links– Switch-to-switch link delay impacts overall network speed– Overall network speed unacceptably low for 64 tiled

systems

• Link pipelining becomes mandatory– Allows to sustain network speed even in the presence of

long links

• Number of pipeline stages depends on the link length on the layout

Concentrated 2-ary 4-mesh

Physical Synthesis – 64 tiles

Physical Synthesis – 64 tiles8-ary 2-mesh



Switch Arity 6 8 12




Cell Area 4461kμm2 7356k μm2 2610k μm2

Latency on top dimensions

Dimension 3 - 1 1

Dimension 4 - 1 2

Dimension 5 - 2 -

Dimension 6 - 3 -




Reduced 2-ary 4-mesh

Switch Arity 6 8 12 12

Post-synthesis freq. 1 Ghz 900 Ghz 790 Mhz 790 Mhz

Post-layout freq. 786 MHz 640 Mhz 500 Mhz 500 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 250 Mhz 500 Mhz

Cell Area 4461kμm2 7356k μm2 2610k μm2 2610k μm2


Dimension 3 - 1 1 1

Dimension 4 - 1 2 2

Dimension 5 - 2 - -

Dimension 6 - 3 - -



2-ary 6-meshHigh-Speed

Switch Arity 6 8 8

Post-synthesis freq. 1 Ghz 900 Ghz 900Mhz


Core speed (max. 500) 393 MHz 320 Mhz 393Mhz

Cell Area 4461kμm2 7356k μm2 22784k μm2



Dimension 3 - 1 2

Dimension 4 - 1 2

Dimension 5 - 2 2

Dimension 6 - 3 3

Aggressive link pipelining200% area overhead for 20% improvement in performanceNot usable

Backend synthesis flow Communication semantics Topologies under test Physical synthesis

Layout-aware topology performance Conclusions

Workload distribution

• Producer, worker and consumer tasks• I/O devices dedicated to input OR output data

External I/O

Topology performance• 1 Input and 1 Output ports to the external memory

are assumed for 16 tile systems• 4 Input and 4 Output ports to the external memory

are assumed for 64 tile systems• I/O ports are accessed through sidewall tiles

– The mapping of producer(s) and consumer(s) tasks is therefore constrained to these tiles

Topology performance• Several I/O mapping strategies were considered:• For sake of space, we only show here the most

significative– OneSided: all the I/O tiles are placed on the same side of

the chip.

Topology performance - 16 tiles

• 2-ary 4-mesh reduces total number of cycles by 27.4%• 2-ary 2-mesh reduces cycles only by 1.6% over the hypercube

– Chip I/O becomes the bottleneck• Real operating frequency of each topology changes conclusions

– Physical degradation is too severe to be compensated• 2-ary 2-mesh shows superior energy saving properties

– 50% over the 2D mesh


• 2D mesh outperforms the non-reduced hypercubes• Systems under test are I/O constrained

– Computation tiles spend around 50% of their time waiting to send data to the consumer tile– Upper bound to topology-related performance optimization

• Improvement in terms of execution cycles– Performance improvements in cycles are not such to offset the lower operating speed

Removal of the I/O bottleneck has to be considered as mandatory to achieve performance differentiation between topologies


• Network and tiles work at the same frequency– Maximum frequency for all tiles: I/O tiles and processing tiles.• Very similar performance– Reduced number of cycles– Low network frequency

• Reduced hardware resources – 4 times less switches, half the number of ports and works at half the

frequency

Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance

Conclusions

Conclusions Bottom-up approach to assess k-ary n-mesh

topologies A number of real life issues are considered:

Physical constraints of nanoscale technologies Impact of I/O interface Communication semantics of the middleware

The intricate wiring of multi-dimension topologies or the long links required by concentrated k-ary n-meshes can be changed into 2 different kind of performance overhead by means of proper design techniques:

Conclusions

Operating frequency reduction: in spite of a lower number of execution cycles, multi-dimension topologies loose in terms of RET due to lower working frequency. Concentrated topologies provide a way to trade

performance for power/area

Increase of link latency: the utilization of retiming stages allows to sustain operating frequency while increasing the network latency. Area and power overhead have to be taken into account Link pipelining can not materialize a frequency

higher than the switch radix itself for 64 tile systems we found that in general, the 2D

mesh outperforms the hypercubes. In spite of a better execution cycles, the real elapsed

time is worst because of a lower operating frequency

Conclusions Unexpected results for the reduced 2-ary 4-mesh:

Expected: Low cost - Low performance solution Results: Low cost with similar performance as

2D mesh Increment in core speed allows to reduce the

impact: I/O tile congestion Processing tiles

Possible solution to hypercube physical degradation issues: Decouple network speed from core speed

(GALS) Other solutions:

High performance – high radix switches

Designing Regular Network-on-Chip Topologies under Technology,

Architecture and Software Constraints

F.Gilabert†, D.Ludovici §, S.Medardoni‡, D.Bertozzi‡, L.Benini††, G.N.Gaydadjiev§

‡University of Ferrara.††University of Bologna.†Universidad Politecnica de Valencia.§Delft University of Technology,.

Documents

Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,