40
Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert†, D.Ludovici §, S.Medardoni‡, D.Bertozzi‡, L.Benini††, G.N.Gaydadjiev§ ‡University of Ferrara. ††University of Bologna. †Universidad Politecnica de Valencia. §Delft University of Technology,.

Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Embed Size (px)

Citation preview

Page 1: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Designing Regular Network-on-Chip Topologies under Technology,

Architecture and Software Constraints

F.Gilabert†, D.Ludovici §, S.Medardoni‡, D.Bertozzi‡, L.Benini††, G.N.Gaydadjiev§

‡University of Ferrara.††University of Bologna.†Universidad Politecnica de Valencia.§Delft University of Technology,.

Page 2: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Multi-dimension topologies2D mesh frequently used for NoC design - perfectly matches 2D silicon surface- high level of modularity- controllability of electrical parameters

But its avg latency and resource consumption scale poorly with network size

Topology with more than 2 dimensions attractive: - higher bandwidth and lower avg latency- on-chip wiring more cost-effective than off-chip

But physical design issues might impact their effectiveness and even feasibility(decreased operating frequency)

(higher link latency)

Page 3: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

ObjectiveExplore the effectiveness and feasibility of multi-dimensional topologies

Under realistic technological constraintsUnder realistic technological constraints

1. Physical synthesis impact over performance

Over-the-cell routing?

Latency in injection

links?

Latency in express

links?

Which switch

operating frequency?

Regularity broken by asymmetric

tile size or heterogeneous

tiles!

Our approachPhysical parameters from the physical synthesis are applied to system-level

simulationsSilicon-aware performance analysis

Page 4: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

ObjectiveExplore the effectiveness and feasibility of multi-dimensional topologies

Under realistic architectural constraintsUnder realistic architectural constraints

Our approach•Chip I/O interface modeling•Capture the implications of I/O performance on topology performance differentiation

1. Physical synthesis impact over performance2. Impact of chip I/O interface over topology performance

May introduce an upper bound to the topology performance, affecting the performance differentiation between topologies

Page 5: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

ObjectiveExplore the effectiveness and feasibility of multi-dimensional topologies

Software constraints: communication semantics of the middlewareSoftware constraints: communication semantics of the middleware

Traffic pattern usually abstracted as an average link

bandwidth utilization or as a synthetic

traffic pattern

May lead to highly inaccurate performance predictions(traffic peaks, different kinds of messaging, synchronization mismatches)

Our approach• Project network traffic based on latest advances in MPSoC communication middleware• Generate traffic patterns for the NoC “shaped” by the above communication middleware (e.g., synchronization, communication semantics)

1. Physical synthesis impact over performance2. Impact of chip I/O interface over topology performance3. Realistically capture traffic behavior

Page 6: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Page 7: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Topology generation

Topology specification

RTL SystemC/Verilog

Simulation

VCD Trace

Physical Synthesis

PlacementFloorplan

Clock Tree Synth., Power Grid, routing, post-routing opt

Netlist, Parasitic Extraction Prime time

SDF (timing)Prime timePower estimation

OCP Traffic Generator

Transactional Simulator

Page 8: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Backend synthesis flow

Communication semantics Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Page 9: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Tile Architecture

• Processor core – Connected through a Network Interface Initiator

• Local memory core – Connected through a Network Interface Target

• Two network interfaces can be used in parallel

ProcessorCore

MemoryCore

Network IF Initiator

Network IF Target

Tile

Page 10: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Communication protocol• Step 1: Producer checks

local semaphores for pending messages for the destination• If not, it writes data to

the local tile memory and unblocks a semaphore at the consumer tile

• The producer is free to carry out other tasks

Local Polling

Producer Tile

WriteMessage

Reset Semaphore

Local Polling

ConsumerTile

ReadOperation

1

2

3

4

• Step 2: Consumer detects unblocked semaphore• Requests producer for

data

• Step 3: Consumer reads data from the producer

• Step 4: Consumer sends a notification upon completion– This allows the producer

to send another message to this consumer

• Message sent only when consumer is ready to read it • Only one outstanding message for a producer-consumer pair•Low network bandwidth utilization•Tight latency constraints on the topologyDalla Torre, A. et al., ”MP-Queue: an Efficient Communication Library for Embedded Streaming Multimedia Platform”, IEEE Workshop on Embedded Systems for Real-Time Multimedia, 2007.

Page 11: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Backend synthesis flow Communication semantics

Topologies under test Physical synthesis Layout-aware topology performance Conclusions

Page 12: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Topologies Under Test – 16 tiles4-ary 2-mesh

(2D Mesh)

Switches 16

Bis. Band. 4

Tiles x Switch 1

Switch Arity 6

Max. Hops 6

4-ary 2-meshBaseline Topology

TileSwitch

Page 13: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Topologies Under Test – 16 tiles4-ary 2-mesh

(2D Mesh)2-ary 4-mesh(Hypercube)

Switches 16 16

Bis. Band. 4 8

Tiles x Switch 1 1

Switch Arity 6 6

Max. Hops 6 4

4-ary 2-meshBaseline Topology

2-ary 4-meshHigh Bandwith

TileSwitch

TileSwitch

Page 14: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Topologies Under Test – 16 tiles

4-ary 2-meshBaseline Topology

2-ary 2-meshLow latency

4-ary 2-mesh(2D Mesh)

2-ary 4-mesh(Hypercube)

2-ary 2-mesh(Concentrated

)

Switches 16 16 4

Bis. Band. 4 8 2

Tiles x Switch 1 1 4

Switch Arity 6 6 10

Max. Hops 6 4 2Tile

Switch

TileSwitch

Page 15: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Topologies Under Test – 64 tiles8-ary 2-mesh

(2D Mesh)

Switches 64

Bis. Band. 8

Tiles x Switch 1

Switch Arity 8

Max. Hops 14

8-ary 2-meshBaseline Topology

Page 16: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Topologies Under Test – 64 tiles8-ary 2-mesh

(2D Mesh)2-ary 6-mesh(Hypercube)

Switches 64 64

Bis. Band. 8 32

Tiles x Switch 1 1

Switch Arity 6 8

Max. Hops 14 6

8-ary 2-meshBaseline Topology

2-ary 6-meshHigh Bandwith

Page 17: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Topologies Under Test – 64 tiles8-ary 2-mesh

(2D Mesh)2-ary 6-mesh(Hypercube)

2-ary 4-mesh(Concentrated)

Switches 64 64 16

Bis. Band. 8 32 8

Tiles x Switch 1 1 4

Switch Arity 6 8 12

Max. Hops 14 6 4

8-ary 2-meshBaseline Topology

2-ary 4-meshLow Latency

Page 18: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Backend synthesis flow Communication semantics Topologies under test

Physical synthesis Layout-aware topology performance Conclusions

Page 19: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Physical Synthesis• Link latency and maximum frequency

– Performance, area and power – Quantified by post-layout analysis

• For 16 tile systems– Real physical parameter values were obtained

• For 64 tile systems– Physical parameter values extrapolated based on

16 tiles results– Synthesis time constraints

Page 20: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Physical Synthesis – 16 Tiles

• Network building blocks synthesized for maximum performance

• Timing path in network logic– Ignore switch-to-switch links.

• Critical paths are in the switches – never in the network interfaces– Network speed closely reflects the maximum switch radix

4-ary 2-mesh(2D Mesh)

2-ary 4-mesh(Hypercube)

2-ary 2-mesh(Concentrated)

Switch Arity 6 6 10

Post-synthesis freq. 1 Ghz 1 Ghz 850 Mhz

Post-layout freq. 786 MHz 640 Mhz 600 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 300 Mhz

Cell Area 949k μm2 1108k μm2 733k μm2

Page 21: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Physical Synthesis – 16 Tiles

• Inter-switch wiring reduces performance • The connectivity pattern of 2-ary 4-mesh results into a

larger frequency drop than the 2D mesh• The 2-ary 2-mesh pays its lower number of switching

resources with a larger switch-to-switch separation– Severe degradation of network performance

4-ary 2-mesh(2D Mesh)

2-ary 4-mesh(Hypercube)

2-ary 2-mesh(Concentrated)

Switch Arity 6 6 10

Post-synthesis freq. 1 Ghz 1 Ghz 850 Mhz

Post-layout freq. 786 MHz 640 Mhz 600 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 300 Mhz

Cell Area 949k μm2 1108k μm2 733k μm2

Page 22: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Physical Synthesis – 16 Tiles

• Frequency-ratioed clock domain crossing in network interface– Network speed affects core speed.

• Maximum core speed of 500 MHz is assumed • Post-layout speed drop

– Cores cannot sustain the network speed – A divider of 2 is applied

4-ary 2-mesh(2D Mesh)

2-ary 4-mesh(Hypercube)

2-ary 2-mesh(Concentrated)

Switch Arity 6 6 10

Post-synthesis freq. 1 Ghz 1 Ghz 850 Mhz

Post-layout freq. 786 MHz 640 Mhz 600 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 300 Mhz

Cell Area 949k μm2 1108k μm2 733k μm2

Page 23: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Physical Synthesis – 16 Tiles

• 2-ary 4-mesh larger area footprint than the 2D mesh

• 2-ary 2-mesh reduces the number of switches– Larger radix– Area not halved

4-ary 2-mesh(2D Mesh)

2-ary 4-mesh(Hypercube)

2-ary 2-mesh(Concentrated)

Switch Arity 6 6 10

Post-synthesis freq. 1 Ghz 1 Ghz 850 Mhz

Post-layout freq. 786 MHz 640 Mhz 600 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 300 Mhz

Cell Area 949k μm2 1108k μm2 733k μm2

Page 24: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Physical Synthesis – 64 tiles• 64 tile hypercubes present very long links– Switch-to-switch link delay impacts overall network speed– Overall network speed unacceptably low for 64 tiled

systems

• Link pipelining becomes mandatory– Allows to sustain network speed even in the presence of

long links

• Number of pipeline stages depends on the link length on the layout

Page 25: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Concentrated 2-ary 4-mesh

Physical Synthesis – 64 tiles

Page 26: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Physical Synthesis – 64 tiles8-ary 2-mesh

(2D Mesh)2-ary 6-mesh(Hypercube)

2-ary 4-mesh(Concentrated)

Switch Arity 6 8 12

Post-synthesis freq. 1 Ghz 900 Ghz 790 Mhz

Post-layout freq. 786 MHz 640 Mhz 500 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 250 Mhz

Cell Area 4461kμm2 7356k μm2 2610k μm2

Latency on top dimensions

Dimension 3 - 1 1

Dimension 4 - 1 2

Dimension 5 - 2 -

Dimension 6 - 3 -

Page 27: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Physical Synthesis – 64 tiles8-ary 2-mesh

(2D Mesh)2-ary 6-mesh(Hypercube)

2-ary 4-mesh(Concentrated)

Reduced 2-ary 4-mesh

Switch Arity 6 8 12 12

Post-synthesis freq. 1 Ghz 900 Ghz 790 Mhz 790 Mhz

Post-layout freq. 786 MHz 640 Mhz 500 Mhz 500 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 250 Mhz 500 Mhz

Cell Area 4461kμm2 7356k μm2 2610k μm2 2610k μm2

Latency on top dimensions

Dimension 3 - 1 1 1

Dimension 4 - 1 2 2

Dimension 5 - 2 - -

Dimension 6 - 3 - -

Page 28: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Physical Synthesis – 64 tiles8-ary 2-mesh

(2D Mesh)2-ary 6-mesh(Hypercube)

2-ary 6-meshHigh-Speed

Switch Arity 6 8 8

Post-synthesis freq. 1 Ghz 900 Ghz 900Mhz

Post-layout freq. 786 MHz 640 Mhz 786 Mhz

Core speed (max. 500) 393 MHz 320 Mhz 393Mhz

Cell Area 4461kμm2 7356k μm2 22784k μm2

Latency on top dimensions

Latency on top dimensions

Dimension 3 - 1 2

Dimension 4 - 1 2

Dimension 5 - 2 2

Dimension 6 - 3 3

Aggressive link pipelining200% area overhead for 20% improvement in performanceNot usable

Page 29: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Backend synthesis flow Communication semantics Topologies under test Physical synthesis

Layout-aware topology performance Conclusions

Page 30: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Workload distribution

• Producer, worker and consumer tasks• I/O devices dedicated to input OR output data

External I/O

Page 31: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Topology performance• 1 Input and 1 Output ports to the external memory

are assumed for 16 tile systems• 4 Input and 4 Output ports to the external memory

are assumed for 64 tile systems• I/O ports are accessed through sidewall tiles

– The mapping of producer(s) and consumer(s) tasks is therefore constrained to these tiles

Page 32: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Topology performance• Several I/O mapping strategies were considered:• For sake of space, we only show here the most

significative– OneSided: all the I/O tiles are placed on the same side of

the chip.

Page 33: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Topology performance - 16 tiles

• 2-ary 4-mesh reduces total number of cycles by 27.4%• 2-ary 2-mesh reduces cycles only by 1.6% over the hypercube

– Chip I/O becomes the bottleneck• Real operating frequency of each topology changes conclusions

– Physical degradation is too severe to be compensated• 2-ary 2-mesh shows superior energy saving properties

– 50% over the 2D mesh

Page 34: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Topology performance - 64 tiles

• 2D mesh outperforms the non-reduced hypercubes• Systems under test are I/O constrained

– Computation tiles spend around 50% of their time waiting to send data to the consumer tile– Upper bound to topology-related performance optimization

• Improvement in terms of execution cycles– Performance improvements in cycles are not such to offset the lower operating speed

Removal of the I/O bottleneck has to be considered as mandatory to achieve performance differentiation between topologies

Page 35: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Topology performance - 64 tiles

• Network and tiles work at the same frequency– Maximum frequency for all tiles: I/O tiles and processing tiles.• Very similar performance– Reduced number of cycles– Low network frequency

• Reduced hardware resources – 4 times less switches, half the number of ports and works at half the

frequency

Page 36: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Backend synthesis flow Communication semantics Topologies under test Physical synthesis Layout-aware topology performance

Conclusions

Page 37: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Conclusions Bottom-up approach to assess k-ary n-mesh

topologies A number of real life issues are considered:

Physical constraints of nanoscale technologies Impact of I/O interface Communication semantics of the middleware

The intricate wiring of multi-dimension topologies or the long links required by concentrated k-ary n-meshes can be changed into 2 different kind of performance overhead by means of proper design techniques:

Page 38: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Conclusions

Operating frequency reduction: in spite of a lower number of execution cycles, multi-dimension topologies loose in terms of RET due to lower working frequency. Concentrated topologies provide a way to trade

performance for power/area

Increase of link latency: the utilization of retiming stages allows to sustain operating frequency while increasing the network latency. Area and power overhead have to be taken into account Link pipelining can not materialize a frequency

higher than the switch radix itself for 64 tile systems we found that in general, the 2D

mesh outperforms the hypercubes. In spite of a better execution cycles, the real elapsed

time is worst because of a lower operating frequency

Page 39: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Conclusions Unexpected results for the reduced 2-ary 4-mesh:

Expected: Low cost - Low performance solution Results: Low cost with similar performance as

2D mesh Increment in core speed allows to reduce the

impact: I/O tile congestion Processing tiles

Possible solution to hypercube physical degradation issues: Decouple network speed from core speed

(GALS) Other solutions:

High performance – high radix switches

Page 40: Designing Regular Network-on-Chip Topologies under Technology, Architecture and Software Constraints F.Gilabert, D.Ludovici §, S.Medardoni, D.Bertozzi,

Designing Regular Network-on-Chip Topologies under Technology,

Architecture and Software Constraints

F.Gilabert†, D.Ludovici §, S.Medardoni‡, D.Bertozzi‡, L.Benini††, G.N.Gaydadjiev§

‡University of Ferrara.††University of Bologna.†Universidad Politecnica de Valencia.§Delft University of Technology,.