Upload
meda
View
53
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Communication Modeling for System-Level Design Andrew B. Kahng #,* [email protected] Kambiz Samadi * [email protected] CSE # and ECE * Departments, UCSD November 24, 2008. Outline. Motivation Communication Synthesis for Network-on-Chip Network-on-Chip Architecture Modeling - PowerPoint PPT Presentation
Citation preview
Communication Modeling for System-Level Design
Andrew B. Kahng#,*
Kambiz Samadi*[email protected]
CSE# and ECE* Departments, UCSD
November 24, 2008
ISSOC-2008 2
Outline Motivation Communication Synthesis for Network-on-Chip Network-on-Chip Architecture Modeling
Buffered Interconnect Model Router Power and Area Model
Bus Architecture Modeling Conclusions
ISSOC-2008 3
Motivation Focus of design process is shifting from “computation” to
“communication” Device and interconnect performance scaling mismatches
cause breakdown of traditional across-chip communication System-level designers require accurate, yet simple models
to bridge planning and implementation stages Today’s system-level performance, power modeling suffers:
Ad hoc selection of models Poor balance between accuracy and simplicity Lack of model extensibility across future technology nodes
Due to design performance / power constraints, early-stage design exploration has become crucial
Our Goal: Develop accurate models that are easily usable by system-level design early in the design cycle
ISSOC-2008 4
Communication Synthesis for Network-on-Chip Given
An input specification as a set of communication constraints A library of communication components An objective function (e.g., power, area, delay)
Find A network-on-chip implementation as a composition of
library components that Satisfies the specification Minimizes the cost function
Communication Synthesis Infrastructure (COSI) Based on the Platform-Based Design methodology Takes specification and library descriptions in XML format Produces a variety of outputs, including a cycle accurate
SystemC implementation of the optimal network-on-chip
ISSOC-2008 5
App
licat
ion
Impl
emen
tatio
n
Constraints Propagation
Point-to-Point Specification On-Chip Communication Library
Perf. / CostAbstractions
Synt
hesi
sSynthesis Result
Constraint-Driven Communication Synthesis
ISSOC-2008 6
Buffered Interconnect ModelComponents
Repeater delay model Separate models for intrinsic delay, output slew, input
capacitanceWire delay model
Accounts for coupling capacitance impact on wire delayRepeater power model
Accounts for sub-threshold and gate leakagesRepeater area model
Derived from existing cell layouts (can be extrapolated)Wire area model
Derived from wire width and spacing (can be extrapolated)
Local
Interconnect Device
ITRS
PTM
Min. InverterRd
Cin
Ioff
tintrinsic
MASTARInterconnectChapter
SPICE Sim.
InterconnectT
HILD
Wmin
Smin
εILD
TIERS(L,I,SG,G)
Intermediate
GlobalSemi-global
LEF/ITF
.libAutomaticExtraction
AutomaticExtraction
Technology parameter extraction flows.
Inputs for repeater delay calculation Delay and slew values for a set of input slew and load capacitance values
(obtained from Liberty / SPICE) Input capacitance for different repeater size (Liberty, PTM)
Inputs for wire delay calculation Wire dimensions (ITRS/PTM, LEF, ITF) Inter-wire spacings for global and intermediate layers (ITRS/PTM, LEF, ITF)
Inputs for power calculation Input capacitance (Liberty, PTM) Wire parasitics (computed in wire delay calculation)
ISSOC-2008 7
Repeater and Wire Models
delay = i(slewin) + r(slewin) * CL
r(s) = f(size, slewin) slewout = f(slewin,CL) wire delay = Elmore
Intrinsic Delay Model – i(slewin)Drive Res Model – r(slewin)
Repeater area and power linear with repeater size
Predictions extend down to 16nm
Delay model is < 15% of PrimeTime
ISSOC-2008 8
Impact on System-Level Design Testcases
VPROC: video processor with 42 cores and 128-bit datawidth dVOPD: dual video object plane decoder with 26 cores and 128-bit
datawidth
Original model (Orig.) underestimates power compared to the Proposed Model (Prop.)
Original Model is very optimistic in delay
becomes more critical as technology scales and the chip size becomes larger
Orig. Prop. Orig. Prop. Orig. Prop. Orig. Prop. Orig. Prop.90nm 117.3 364.8 38.1 99.6 0.37 0.346 3.09 3.01 4 565nm 51.1 179.9 69.9 86.7 0.217 0.223 3.1 3.42 4 690nm 63.4 88 14.2 32.5 0.141 0.162 1.76 1.76 3 365nm 27.3 73.2 25.7 33.2 0.082 0.085 1.76 1.91 3 4
Avg # hops Max # hops
VPROC
dVOPD
SoC Dynamic Power (mW) Leakage Power (mW) Total Area (mm2)
ISSOC-2008 9
ORION2.0: Accurate NoC Router Modelscircuit implementation &
buffering scheme• SRAM and register FIFO• MUX-tree and Matrix crossbar• different arbitration scheme• hybrid buffering scheme
architectural parameters• # of ports; # of buffers• # of xbar ports; # of VC• voltage, frequency
• interconnect parameters• device parameters• scaling factors for future technologies• …
technology parameters
FIFOArbiter
CrossbarClockLinkLeakage Dynamic
Area ORION2.0 – NEW !
Built on top of ORION1.0 Provides, previously missing, power subcomponents Provides significant accuracy improvement vs. ORION1.0 Uses our automatic flows to obtain technology inputs To appear in DATE-2009 (A. B. Kahng, B. Li, L.-S. Peh, and K.
Samadi)
ISSOC-2008 10
Validation and Significance Assessment Validation: Two Intel NoC Chips
(1) Intel 80-core Teraflops, and (2) Intel SCC ORION2.0 offers significant accuracy improvement
v1.0 v2.0 v1.0 v2.0 v1.0 v2.0 v1.0 v2.0 v1.0 v2.0VPROC 0.875 0.924 2.043 2.329 33 25 8 12 6 5dVOPD 0.412 0.486 1.217 1.343 18 16 6 6 11 10
P (mW) A (mm2) # routers max. # hopsSoC max. # router ports
System-level Impact: COSI-OCC ORION2.0 models lead to better-performing NoC: (1) less # hops, and
(2) less # routers Relative power due to additional port not as high in ORION2.0 vs. 1.0
v1.0 v2.0 v1.0 v2.0%diff (total power) -85.3 -19.4 +202.4 +20.47%diff (total area) -80.9 -23.6 +31.87 +26.37
Intel 80-core Intel SCC
ISSOC-2008 11
AMBA Models Signal Bus Modeling:
system-level interconnect model (described earlier)
Logic Modeling (multiplexers, decoders, and arbiter):
Block latency based on gate delay model (cf. Carloni et al. ASPDAC’08)
Dynamic power is computed after measuring the switching capacitance
Leakage power is computed from average device leakages
Area is computed from cell areas of logic gates
ISSOC-2008 12
AMBA Modeling and Bus vs. NoC Study Delay, power, area models within 11% of
physical implementation Functional forms verified against physical
implementation of AMBA-AHB controller Bus vs. NoC study enables design space
explorations of heterogeneous communication fabrics
technology & design style
• min. width, spacing, thickness• dielectric thickness, constant• device drive res, cap, leakage• width/spacing, buffering scheme
AMBAModel
Delay
LeakageDynamic
Area
floorplan• location of all masters, slaves• bit widths of all masters, slaves• optionally, locations of arbiter,
decoder, and multiplexers
transaction• read and write• length• address progression
ISSOC-2008 13
Conclusions and Future Directions Accurate models can drive effective system-level
exploration Reproducible methodology for extracting inputs to
models Modeling at different levels of abstractions
protocol encapsulation (e.g., hand-shaking for AMBA bus allocation)
buses, pipelined rings (e.g. EIB in IBM Cell) routers, network interfaces FIFOs, queues, crossbar switches (ORION2.0)
Extending to other technologies 3D IC integration (i.e., TSV modeling, multi-layer router
modeling, etc.)
ISSOC-2008 14
Backup Slides
ISSOC-2008 15
Communication Synthesis Key Elements Specification of input constraints
Set of IP cores: area and interface End-to-end communication requirements between pairs of
IP cores: latency and throughput Characterization of library of components
Interface types, max number of ports Max capacities: bandwidth, latency, max distance Performance and cost model
Component instantiation and parallel composition Rename, set parameters of library components Composition based on algebra on quantities (including type
compatibility)
ISSOC-2008 16
Synthesis of optimal network-on-chip Return valid composition that meets input constraints and Minimizes the objective function (e.g., power dissipation)
(Original Specification)
Platform Instance 1
Platform Instance 2
Communication Synthesis Example
ISSOC-2008 17
COSI is a public-domain software package for NoC synthesis http://embedded.eecs.berkeley.edu/cosi/
COSI: Communication Synthesis Infrastructure
ISSOC-2008 18
Dynamic and Leakage Power Models
∑∑i s
'gategate
'subsubleak ))s,i(I)s,i(W)s,i(I).s,i(W).(s,i()Block(I +Prob=
Leakage Power: Subthreshold and Gate From 65nm and beyond gate leakage becomes significant I’
sub(i,s) and I’gate(i,s) are subthreshold and gate leakage currents per
unit transistor width for a specific technology Wsub(i,s) and Wgate(i,s) are the effective widths of component i at input
state s for subthreshold and gate leakage, respectively Key circuit components INVx1, NAND2x1, NOR2x1, and DFF
Dynamic Power: Switching Capacitance Clock power: Pclk = CclkVdd
2f Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cwiring Physical Links: due to charging and discharging of capacitive load Pd = CloadVdd
2f; Cload = Cground + Ccoupling + Cinput
Register-based FIFO: implemented as shift registers Other components: we use ORION 1.0 models
ISSOC-2008 19
Area Model As number of cores increases, the area occupied by communication
components becomes significant (19% of total tile area in the Intel 80-core Teraflops Chip)
Gate area model by Yoshida et al. (DAC’04) Link area model by Carloni et al. (ASPDAC’08) We model FIFO, crossbar switch, and arbiter areas using the adopted
gate area model
Areaarbiter = (AreaNOR2x1.2(R-1)R) +(AreaDFF.(R(R-1)/2)) + (AreaINVx1.R)
Matrix Arbiter