Networks on Chips (NoC) - SLIP onlinesliponline.org/SLIP07/presentations/4A_Kolodny.pdf · Networks...

Preview:

Citation preview

11

Networks on Chips (NoC) –

Keeping up with Rent’s Rule and Moore’s Law

Avi Kolodny

Technion – Israel Institute of Technology

International Workshop on System Level Interconnect Prediction (SLIP)

March 2007

22

Acknowledgements

Research colleagues Israel Cidon, Ran Ginosar, Idit Keidar,Uri Weiser

Students:Evgeny Bolotin,Reuven Dobkin,Roman Gindin,Zvika Guz,Tomer Morad, Arkadiy Morgenshtein,Zigi Walter

Sponsors

33

Keeping up with Moore’s law:Principles for dealing with complexity:

AbstractionHierarchyRegularityDesign Methodology

0.001

0.01

0.1

1

10

100

1000

1970 1980 1990 2000 2010 2020

Millions ofTransistors

per chip

Source: S. Borkar, Intel

100

1000

100 101 102 103 104

Terminalsper module

1

10

Transistorsper module

Rent’s exponent <

1

44

NoC = More Regularity and Higher AbstractionFrom: Dedicated signal wires To: Shared network

ComputingModule

Networkswitch

Networklink

Module

Module Module

Module Module

Module Module

Module

Module

Module

Module

Module

Similar to:- Road system- Telephone system

55

Module

Module Module

Module Module

Module Module

Module

Module

Module

Module

Module

NoC essentials

Communication by packets of bitsRouting of packets through several hops, via switchesParallelism Efficient sharing of wires

point-to-point link

Switch (a.k.a. Router)

66

Origins of the NoC concept

The idea was talked about in the 90’s,but actual research came in the new Millenium.Some well-known early publications:

Guerrier and Greiner (2000) – “A generic architecture for on-chip packet-switched interconnections”Hemani et al. (2000) – “Network on chip: An architecture for billion transistor era”Dally and Towles (2001)– “Route packets, not wires: on-chip interconnection networks”Wingard (2001)– “MicroNetwork-based integration of SoCs”Rijpkema, Goossens and Wielage (2001)– “A router architecture for networks on silicon”Kumar et al. (2002) – “A Network on chip architecture and design methodology”De Micheli and Benini (2002) – “Networks on chip: A new paradigm for systems on chip design”

77

From buses to networks

Shared Bus

B

B

Segmented Bus

Original bus features:• One transaction at a time• Central Arbiter• Limited bandwidth• Synchronous• Low cost

88

Original bus features:• One transaction at a time• Central Arbiter• Limited bandwidth• Synchronous• Low cost

Advanced bus

Multi-Level Segmented

BusB

B

Segmented Bus

New features:• Versatile bus architectures• Pipelining capability• Burst transfer • Split transactions• Overlapped arbitration • Transaction preemption and resumption • Transaction reordering…

B

B

99

Evolution or Paradigm Shift?

Computingmodule

Networkrouter

Networklink

Architectural paradigm shift Replace the wire spaghetti by a network

Usage paradigm shift Pack everything in packets

Organizational paradigm shift Confiscate communications from logic designersCreate a new discipline, a new infrastructure responsibilityuAlready done for power grid, clock grid, …

Bus

1010

Past examples of paradigm shifts in VLSI

Logic SynthesisFrom: Schematic entry To: HDLs and Cell libraries

Logic designers became programmersEnabled ASIC industry and Fab-less companies “System-on-Chip”

The MicroprocessorFrom: Hard-wired state machines To: Programmable chips

Created a new computer industry

1111

Characteristics of a paradigm shift

Solves a critical problem (or several problems)

Step-up in abstraction

Design is affected:Design becomes more restricted

New tools

The changes enable higher complexity and capacity

Jump in design productivity

Initially: skepticism. Finally: change of mindset!

successful

Let’s look at the problems

addressed by NoC

1212

Critical problems addressed by NoC

3) Chip Multi Processors(key to power-efficient computing)

1) Global interconnect design problem:delay, power, noise, scalability, reliability

2) System integrationproductivity problem

Module

Module Module

Module Module

Module Module

Module

Module

Module

Module

Module

1313

1(a): NoC and Global wire delay

Long wire delay is dominated by Resistance

Add repeaters

Repeaters become latches (with clock frequency scaling)

NoCrouter

NoCrouter

NoCrouter

Latches evolve to NoC routers

Source: W. Dally

1414

1(b): Wire Design for NoC

NoC links:RegularPoint-to-point (no fanout tree)Can use transmission-line layoutWell-defined current return path

Can be optimized for noise / speed / powerLow swing, current mode, ….

h

wg

wSIGNAL

GROUND

t

tg

h

wg

wSIGNAL

GROUND

t

tg

ws

t

d

ws

t

wg

tgGROUND

hS S

d

1515

1(c): NoC Scalability

For Same Performance, compare the wire-area cost of:

n

n

dd

n

n

dd

NoC:n

n

dd

Simple Bus:

SegmentedBus:

Point-to-

Point:

( )3O n n

( )2O n n

( )O n

( )2O n nE. Bolotin at al. , “Cost Considerations in Network on Chip”, Integration, special issue on Network on Chip, October 2004

1616

1(d): NoC and communication reliability

Fault tolerance and error correction

A. Morgenshtein, E. Bolotin, I. Cidon, A. Kolodny, R. Ginosar, “Micro-modem – reliability solution for NOC communications”, ICECS 2004

1717

1(e): NoC and GALS

System modules may use different clocksMay use different voltages

NoC can take care of synchronizationNoC design may be asynchronous

No waste of power when the links and routers are idle

1818

2: NoC and engineering productivity

NoC eliminates ad-hoc global wire engineering

NoC separates computation from communication

NoC supports modularity and reuse of cores

NoC is a platform for system integration, debugging and testing

1919

3: NoC and CMP

Inter-connect

Gate

Diff.

Uniprocessordynamic power

(Magen et al., SLIP 2004)

Die Area (or Power)

Uniprocessor Performance

“Pollack’s rule”

(F. Pollack. Micro 32, 1999)

Uniprocessors cannot providePower-efficient performance growth

Interconnect dominates dynamic powerGlobal wire delay doesn’t scaleInstruction-level parallelism is limited

Power-efficiency requires many parallel local computations

Chip Multi Processors (CMP)Thread-Level Parallelism (TLP)

Network is a natural choice for CMP!

2020

Why Now is the time for NoC?

Module

Module Module

Module Module

Module Module

Module

Module

Module

Module

Module

Difficulty of DSM wire design

Productivity pressure

CMPs

2121

Characteristics of a paradigm shift

Solves a critical problem (or several problems)

Step-up in abstraction

Design is affected:Design becomes more restricted

New tools

The changes enable higher complexity and capacity

Jump in design productivity

Initially: skepticism. Finally: change of mindset!

successful

Now, let’s look at the Abstraction

provided by NoC

2222

Traffic model abstraction

Traffic model may be captured from actual traces of functional simulationA statistical distribution is often assumed for messages

12nsec

15nsec

12nsec

5nsec

22nsec

12nsec

22nsec

15nsec

22nsec

15nsec

5nsec

12nsec

5nsec

Latency

3Kb

2Kb

1.5Kb

5Kb

1Kb

1Kb

5Kb

3Kb

1Kb

2Kb

2Kb

3Kb

1Kb

Packet Size

1.5Mb/s10 11

300Kb/s9 11

50Kb/s8 11

1.5Mb/s7 11

300Kb/s6 11

50Kb/s5 11

1.5Mb/s4 11

300Kb/s3 11

50Kb/s2 11

50Kb/s1 11

200Kb/s7 10

1.5Mb/s4 5

500Kb/s1 4

BWFlow

PE11

PE8 PE9 PE10

PE1 PE2 PE3 PE4 PE5

PE6 PE7

2323

Data abstraction

Message

PacketHeader Payload

Flit(Flow control digit)

Type

Dest.

Phit(Physical unit)

VC Type

Body

VC Type

Tail

VC

2424

Layers of Abstraction in Network ModelingSoftware layers

O/S, applicationNetwork and transport layers

Network topologySwitchingAddressingRoutingQuality of ServiceCongestion control, end-to-end flow control

Data link layerFlow control (handshake)Handling of contentionCorrection of transmission errors

Physical layerWires, drivers, receivers, repeaters, signaling, circuits,..

e.g. crossbar, ring, mesh, torus, fat tree,…Circuit / packet switching: SAF, VCT, wormhole

e.g. guaranteed-throughput, best-effort

Logical/physical, source/destination, flow, transactioStatic/dynamic, distributed/source, deadlock avoidance

Let’s skip a tutorial here,

and look at an example

2525

Architectural choices depend on system needs

A large design space for NoCs!

Flexibility

Reconfigurationrate

single application

Generalpurpose computer

at design time

at boot time

during run time

ASIC

CMPASSP

FPGA

I. Cidon and K. Goossens, in “Networks on Chips” , G. De Micheli and L. Benini, Morgan Kaufmann, 2006

2626

Example: QNoCTechnion’s Quality-of-service NoC architecture

Application-Specific system (ASIC) assumed~10 to 100 IP coresTraffic requirements are known a-priori

Overall approachPacket switchingBest effort(“statistical guarantee”)Quality of Service(priorities)

Module

Module

Module

Module

Module

Module

Module

Module

Module

Module

ModuleModule Module Module Module

ModuleModule Module Module Module

ModuleModule Module Module Module

R

R

R

R

R R

R

R

R

RR R R R

RR R R R

RR R R R

R

* E. Bolotin, I. Cidon, R. Ginosar and A. Kolodny., “QNoC: QoS architecture and design process for Network on Chip”, JSA special issue on NoC, 2004.

2727

Choice of generic network topology

2828

Topology customization

Irregular meshAddress = coordinates in the basic grid

2929

Message routing path

Fixed shortest-path routing (X-Y)Simple Router No deadlock scenarioNo retransmissionNo reordering of messagesPower-efficient

3030

IP1

Inte

rface

IP2Interface

Small number of buffersLow latency

Wormhole Switching

3131

IP3Interface

IP2

Inte

rface

IP1

Inte

rface

The “hot module” IP1 is not a local problem. Traffic destined elsewheresuffers too!

The Green packetexperiences a long delay even though it does NOT share any link with IP1 traffic

Blocking issue

3232

Statistical network delay

Some packets get more delay than others, because of blocking

% of packets

Time

3333

Average delay depends on load

3434

Quality-of-Service in QNoCMultiple priority (service) levels

Define latency / throughputExample: u Signalingu Real Time Streamu Read-Writeu DMA Block TransferPreemptive

Best effort performanceE.g. 0.01% arrivelater then required

N

T

* E. Bolotin, I. Cidon, R. Ginosar and A. Kolodny., “QNoC: QoS architecture and design process for Network on Chip”, JSA special issue on NOC, 2004.

3535

Router structure

• Flits stored in input ports• Output port schedules

transmission of pending flits according to:• Priority (Service Level)• Buffer space in next router• Round-Robin on input ports

of same SL• Preempt lower priority

packets

Router

Module

Moduleor

another router

3636

Virtual Channels

3737

QNoC router with multipleVirtual Channels

×5×5

3838

Simulation ModelOPNET Models for QNoCAny topology and traffic loadStatistical or trace-based traffic generation at source nodes

3939

Simulation ResultsFlit-accurate simulations

Delay of high-priority service levels

is not affected by load

4040

Perspective 1: NoC vs. Bus

Aggregate bandwidth growsLink speed unaffected by N Concurrent spatial reusePipelining is built-inDistributed arbitrationSeparate abstraction layers

However:No performance guaranteeExtra delay in routersArea and power overhead?Modules need network interfaceUnfamiliar methodology

Bandwidth is limited, sharedSpeed goes down as N growsNo concurrency Pipelining is toughCentral arbitrationNo layers of abstraction(communication and computation are coupled)

However:Fairly simple and familiar

BusNoC

4141

Perspective 2: NoC vs. Off-chip Networks

Sensitive to cost:area power

Wires are relatively cheapLatency is criticalTraffic may be known a-prioriDesign time specializationCustom NoCs are possible

Cost is in the links

Latency is tolerableTraffic/applications unknownChanges at runtimeAdherence to networking standards

Off-Chip NetworksNoC

4242

NoC can provide system services Example: Distributed CMP cache

* E.Bolotin, Z. Guz, I.Cidon, R. Ginosar and A. Kolodny, “The Power of Priority: NoC based Distributed Cache Coherency”, NoCs 2007.

4343

Characteristics of a paradigm shift

Solves a critical problem (or several problems)

Step-up in abstraction

Design is affected:Design becomes more restricted

New tools

The changes enable higher complexity and capacity

Jump in design productivity

Initially: skepticism. Finally: change of mindset!

successful

OK, what’s the impact of NoC

on chip design?

4444

VLSI CAD problemsApplication mapping

Floorplanning / placement

Routing

Buffer sizing

Timing closure

Simulation

Testing

4545

VLSI CAD problems reframed for NoCApplication mapping (map tasks to cores)Floorplanning / placement (within the network)Routing (of messages)Buffer sizing (size of FIFO queues in the routers)Timing closure (Link bandwidth capacity allocation)Simulation (Network simulation, traffic/delay/power modeling)Testing

… combined with problems of designing the NoC itself(topology synthesis, switching, virtual channels, arbitration,flow control,……)

Let’s see a NoC-based

design flow example

4646

QNoC-based SoC design flow

PlaceModules

Adjustlink

capacities

Inter-module Traffic model

Determinerouting

Trim routers / ports /

links

4747

Routing on Irregular Mesh

Goal: Minimize the total size of routing tables required in the switches

Around the Block

Dead End

E. Bolotin, I. Cidon, R. Ginosar and A. Kolodny, "Routing Table Minimization for Irregular Mesh NoCs", DATE 2007.

4848

Routing Heuristics for Irregular Mesh

Routing Cost Reduction in Real Aplications

1

10

100

1000

MPEG4 VOPD

Log

( Rou

ting

Cos

t )

DR

TT

XYDT

SR

SRDP

Routing Cost in 12x12 NoC (many holes, high hotspost probability)

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

10 30 50Hotspot Number

Rou

ting

Cos

t [g

ates

]

DRXYDTSRSRDP

Distributed Routing (full tables)X-Y Routing with Deviation TablesSource RoutingSource Routing for Deviation Points

Systems with real applications

Random problem instances

4949

Timing closure in NoC

Module

Module

Module

Module

Module

Module

Module

Module

Module Module Module

Module

Module

Module

R

R

R R R

R

RR R R R

R R

R

R

R

R

Module

Module

Module

Module

Module

Module

Module

Module

Module

Module Module Module

Module

Module

Module

R

R

R R R

R

RR R R R

R R

R

R

R

R

Module

Define inter-module traffic

Place modules

Increase link capacities

Too low capacity results in poor QoSToo high capacity wastes power/areaUniform link capacities are a waste in application-specific systems!

Module

Module

Module

Module

Module

Module

Module

Module

Module Module Module

Module

Module

Module

R

R

R R R

R

RR R R R

R R

R

R

R

R

Module

Module

Module

Module

Module

Module

Module

Module

Module

Module Module Module

Module

Module

Module

R

R

R R R

R

RR R R R

R R

R

R

R

R

Module

QoSSatisfied?

5050

Network Delay Modeling

Analysis of mean packet delay in wormhole networkMultiple Virtual-ChannelsDifferent link capacitiesDifferent communicationdemands

| f

ij f f

jf j f i

ltC l m

π

λ∈ ∧ ≠

=− ⋅ ⋅∑

Flit interleaving delay approximation:

11 22 ( )

ii network

iinetwork

TQ

= −⋅ −

Queuing delay:

* I. Walter, Z. Guz, I. Cidon, R. Ginosar and A. Kolodny, “Efficient Link Capacity and QoS Design for Wormhole Network-on-Chip,” DATE 2006.

5151

Capacity Allocation ProblemGiven:

system topology and routingEach flow’s bandwidth (fi ) and delay bound (Ti

REQ)Minimize total link capacity e

e E

C∈

| ( ): i

ei e path i

link e f C∈

∀ <∑ : i i

REQflow i T T∀ ≤

Such that:

5252

State of the art:NoC is already here!

> 50 different NoC architecture proposals in the literature;2 books; hundreds of papers since 2000Companies use (try) it

Freescale, Philips, ST, Infineon, IBM, Intel, …

Companies sell itSonics (USA), Arteris (France), Silistix (UK), …

1st IEEE Conference: NOCS 2007102 papers submitted

5353

NoC research community

Academe and industryVLSI / CAD peopleComputer system architectsInterconnect expertsAsynchronous circuit expertsNetworking/Telecomm experts

5454

Possible impact:Expect new forms of Rent’s Rule?

View interconnection as transmission of messages over virtual wires(through the NoC)

Model system interconnections among blocks in terms of required bandwidth and timing

Dependence on NoC topologyDependence on the S/W application (in a CMP)Usage for prediction of hop-lengths, router design, ….

* D. Greenfield, A. Banerjee, J. Lee and S. Moore, “Implications of Rent's Rule for NoC Design and Its Fault-Tolerance”, NoCS 2007 (to appear)

5555

Summary

NoC is a scalable platform for billion-transistor chipsSeveral driving forces behind itMany open research questionsMay change the way we structure and model VLSI systems

Module

Module Module

Module Module

Module Module

Module

Module

Module

Module

Module

Recommended