Outline

Communication Modeling for System-Level Design

Andrew B. Kahng#,*

[email protected]

Kambiz Samadi*[email protected]

CSE# and ECE* Departments, UCSD

November 24, 2008

mailto:[email protected]

mailto:[email protected]

ISSOC-2008 2

Outline Motivation Communication Synthesis for Network-on-Chip Network-on-Chip Architecture Modeling

Buffered Interconnect Model Router Power and Area Model

Bus Architecture Modeling Conclusions

ISSOC-2008 3

Motivation Focus of design process is shifting from “computation” to

“communication” Device and interconnect performance scaling mismatches

cause breakdown of traditional across-chip communication System-level designers require accurate, yet simple models

to bridge planning and implementation stages Today’s system-level performance, power modeling suffers:

Ad hoc selection of models Poor balance between accuracy and simplicity Lack of model extensibility across future technology nodes

Due to design performance / power constraints, early-stage design exploration has become crucial

Our Goal: Develop accurate models that are easily usable by system-level design early in the design cycle

ISSOC-2008 4

Communication Synthesis for Network-on-Chip Given

An input specification as a set of communication constraints A library of communication components An objective function (e.g., power, area, delay)

Find A network-on-chip implementation as a composition of

library components that Satisfies the specification Minimizes the cost function

Communication Synthesis Infrastructure (COSI) Based on the Platform-Based Design methodology Takes specification and library descriptions in XML format Produces a variety of outputs, including a cycle accurate

SystemC implementation of the optimal network-on-chip

ISSOC-2008 5

App

licat

ion

Impl

emen

tatio

n

Constraints Propagation

Point-to-Point Specification On-Chip Communication Library

Perf. / CostAbstractions

Synt

hesi

sSynthesis Result

Constraint-Driven Communication Synthesis

ISSOC-2008 6

Buffered Interconnect ModelComponents

Repeater delay model Separate models for intrinsic delay, output slew, input

capacitanceWire delay model

Accounts for coupling capacitance impact on wire delayRepeater power model

Accounts for sub-threshold and gate leakagesRepeater area model

Derived from existing cell layouts (can be extrapolated)Wire area model

Derived from wire width and spacing (can be extrapolated)

Local

Interconnect Device

ITRS

PTM

Min. InverterRd

Cin

Ioff

tintrinsic

MASTARInterconnectChapter

SPICE Sim.

InterconnectT

HILD

Wmin

Smin

εILD

TIERS(L,I,SG,G)

Intermediate

GlobalSemi-global

LEF/ITF

.libAutomaticExtraction

AutomaticExtraction

Technology parameter extraction flows.

Inputs for repeater delay calculation Delay and slew values for a set of input slew and load capacitance values

(obtained from Liberty / SPICE) Input capacitance for different repeater size (Liberty, PTM)

Inputs for wire delay calculation Wire dimensions (ITRS/PTM, LEF, ITF) Inter-wire spacings for global and intermediate layers (ITRS/PTM, LEF, ITF)

Inputs for power calculation Input capacitance (Liberty, PTM) Wire parasitics (computed in wire delay calculation)

ISSOC-2008 7

Repeater and Wire Models

delay = i(slewin) + r(slewin) * CL

r(s) = f(size, slewin) slewout = f(slewin,CL) wire delay = Elmore

Intrinsic Delay Model – i(slewin)Drive Res Model – r(slewin)

Repeater area and power linear with repeater size

Predictions extend down to 16nm

Delay model is < 15% of PrimeTime

ISSOC-2008 8

Impact on System-Level Design Testcases

VPROC: video processor with 42 cores and 128-bit datawidth dVOPD: dual video object plane decoder with 26 cores and 128-bit

datawidth

Original model (Orig.) underestimates power compared to the Proposed Model (Prop.)

Original Model is very optimistic in delay

becomes more critical as technology scales and the chip size becomes larger

Orig. Prop. Orig. Prop. Orig. Prop. Orig. Prop. Orig. Prop.90nm 117.3 364.8 38.1 99.6 0.37 0.346 3.09 3.01 4 565nm 51.1 179.9 69.9 86.7 0.217 0.223 3.1 3.42 4 690nm 63.4 88 14.2 32.5 0.141 0.162 1.76 1.76 3 365nm 27.3 73.2 25.7 33.2 0.082 0.085 1.76 1.91 3 4

Avg # hops Max # hops

VPROC

dVOPD

SoC Dynamic Power (mW) Leakage Power (mW) Total Area (mm2)

ISSOC-2008 9

ORION2.0: Accurate NoC Router Modelscircuit implementation &

buffering scheme• SRAM and register FIFO• MUX-tree and Matrix crossbar• different arbitration scheme• hybrid buffering scheme

architectural parameters• # of ports; # of buffers• # of xbar ports; # of VC• voltage, frequency

• interconnect parameters• device parameters• scaling factors for future technologies• …

technology parameters

FIFOArbiter

CrossbarClockLinkLeakage Dynamic

Area ORION2.0 – NEW !

Built on top of ORION1.0 Provides, previously missing, power subcomponents Provides significant accuracy improvement vs. ORION1.0 Uses our automatic flows to obtain technology inputs To appear in DATE-2009 (A. B. Kahng, B. Li, L.-S. Peh, and K.

Samadi)

ISSOC-2008 10

Validation and Significance Assessment Validation: Two Intel NoC Chips

(1) Intel 80-core Teraflops, and (2) Intel SCC ORION2.0 offers significant accuracy improvement

v1.0 v2.0 v1.0 v2.0 v1.0 v2.0 v1.0 v2.0 v1.0 v2.0VPROC 0.875 0.924 2.043 2.329 33 25 8 12 6 5dVOPD 0.412 0.486 1.217 1.343 18 16 6 6 11 10

P (mW) A (mm2) # routers max. # hopsSoC max. # router ports

System-level Impact: COSI-OCC ORION2.0 models lead to better-performing NoC: (1) less # hops, and

(2) less # routers Relative power due to additional port not as high in ORION2.0 vs. 1.0

v1.0 v2.0 v1.0 v2.0%diff (total power) -85.3 -19.4 +202.4 +20.47%diff (total area) -80.9 -23.6 +31.87 +26.37

Intel 80-core Intel SCC

ISSOC-2008 11

AMBA Models Signal Bus Modeling:

system-level interconnect model (described earlier)

Logic Modeling (multiplexers, decoders, and arbiter):

Block latency based on gate delay model (cf. Carloni et al. ASPDAC’08)

Dynamic power is computed after measuring the switching capacitance

Leakage power is computed from average device leakages

Area is computed from cell areas of logic gates

ISSOC-2008 12

AMBA Modeling and Bus vs. NoC Study Delay, power, area models within 11% of

physical implementation Functional forms verified against physical

implementation of AMBA-AHB controller Bus vs. NoC study enables design space

explorations of heterogeneous communication fabrics

technology & design style

• min. width, spacing, thickness• dielectric thickness, constant• device drive res, cap, leakage• width/spacing, buffering scheme

AMBAModel

Delay

LeakageDynamic

Area

floorplan• location of all masters, slaves• bit widths of all masters, slaves• optionally, locations of arbiter,

decoder, and multiplexers

transaction• read and write• length• address progression

ISSOC-2008 13

Conclusions and Future Directions Accurate models can drive effective system-level

exploration Reproducible methodology for extracting inputs to

models Modeling at different levels of abstractions

protocol encapsulation (e.g., hand-shaking for AMBA bus allocation)

buses, pipelined rings (e.g. EIB in IBM Cell) routers, network interfaces FIFOs, queues, crossbar switches (ORION2.0)

Extending to other technologies 3D IC integration (i.e., TSV modeling, multi-layer router

modeling, etc.)

ISSOC-2008 14

Backup Slides

ISSOC-2008 15

Communication Synthesis Key Elements Specification of input constraints

Set of IP cores: area and interface End-to-end communication requirements between pairs of

IP cores: latency and throughput Characterization of library of components

Interface types, max number of ports Max capacities: bandwidth, latency, max distance Performance and cost model

Component instantiation and parallel composition Rename, set parameters of library components Composition based on algebra on quantities (including type

compatibility)

ISSOC-2008 16

Synthesis of optimal network-on-chip Return valid composition that meets input constraints and Minimizes the objective function (e.g., power dissipation)

(Original Specification)

Platform Instance 1

Platform Instance 2

Communication Synthesis Example

ISSOC-2008 17

COSI is a public-domain software package for NoC synthesis http://embedded.eecs.berkeley.edu/cosi/

COSI: Communication Synthesis Infrastructure

ISSOC-2008 18

Dynamic and Leakage Power Models

∑∑i s

'gategate

'subsubleak ))s,i(I)s,i(W)s,i(I).s,i(W).(s,i()Block(I +Prob=

Leakage Power: Subthreshold and Gate From 65nm and beyond gate leakage becomes significant I’

sub(i,s) and I’gate(i,s) are subthreshold and gate leakage currents per

unit transistor width for a specific technology Wsub(i,s) and Wgate(i,s) are the effective widths of component i at input

state s for subthreshold and gate leakage, respectively Key circuit components INVx1, NAND2x1, NOR2x1, and DFF

Dynamic Power: Switching Capacitance Clock power: Pclk = CclkVdd

2f Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cwiring Physical Links: due to charging and discharging of capacitive load Pd = CloadVdd

2f; Cload = Cground + Ccoupling + Cinput

Register-based FIFO: implemented as shift registers Other components: we use ORION 1.0 models

ISSOC-2008 19

Area Model As number of cores increases, the area occupied by communication

components becomes significant (19% of total tile area in the Intel 80-core Teraflops Chip)

Gate area model by Yoshida et al. (DAC’04) Link area model by Carloni et al. (ASPDAC’08) We model FIFO, crossbar switch, and arbiter areas using the adopted

gate area model

Areaarbiter = (AreaNOR2x1.2(R-1)R) +(AreaDFF.(R(R-1)/2)) + (AreaINVx1.R)

Matrix Arbiter

Documents

Outline