102
Maurizio Palesi [[email protected]] 1 Some Key Issues in Embedded Some Key Issues in Embedded System Design System Design Maurizio Palesi Università di Catania Università di Catania Dipartimento di Ingegneria Informatica e delle Telecomunicazioni Dipartimento di Ingegneria Informatica e delle Telecomunicazioni

Some Key Issues in Embedded System Design

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Maurizio Palesi [[email protected]] 1

Some Key Issues in EmbeddedSome Key Issues in EmbeddedSystem DesignSystem Design

Maurizio Palesi

Università di CataniaUniversità di CataniaDipartimento di Ingegneria Informatica e delle TelecomunicazioniDipartimento di Ingegneria Informatica e delle Telecomunicazioni

Maurizio Palesi [[email protected]] 2

OutlineOutlineDesign Space Exploration of ParameterizedSystemsNetwork on Chip ArchitecturesInstruction Level Power Estimation

Maurizio Palesi [[email protected]] 3

OutlineOutlineDesign Space Exploration ofParameterized SystemsNetwork on Chip ArchitecturesInstruction Level Power Estimation

Maurizio Palesi [[email protected]] 4

MotivationMotivationDesign trends

Growing demand forportable devicesGrowing demand forlow power designIncreased applicationcomplexityShrinking time-to-market windows

Technology trendsIncreased chipcapacityIncreased I/O pinsImproved on-chipintegration techniques(storage, digital,analog, digital, …)SOC era

Need for greater design productivity!

Maurizio Palesi [[email protected]] 5

MotivationMotivationOne approach: reuse ofexisting IP

IP selection?IP integration?SOC verification?Multi-source IP licensing…

SOCCPU Memory

JPEGCODEC

Math/FPU

UART

MMXBRIDGE

?

???

?

MIPS

RAM

JPEGCODEC1

Math/FPU

UART

ISABRIDGE

ARM

SRAM

DRAM

AMBABRIDGE

JPEGCODEC2

USB

Maurizio Palesi [[email protected]] 6

MotivationMotivationAlternate approach:reuse of SOC

Designed, integrated,testedDomain specificParameterized

Designed by firmsspecializing in SOCUser: map application,then, “configure-and-execute”

Parameterized SOC

CPU Memory

JPEGCODEC

Math/FPU

UART

MMXBRIDGE

Maurizio Palesi [[email protected]] 7

MotivationMotivationComposed of 100s ofcoresCores are “configurable”Configurations impactpower/performanceLarge number of totalconfigurations!

Parameterized SOC

CPU Memory

JPEGCODEC

Math/FPU

UART

MMXBRIDGE

Architecture is otherwise fixed!

Maurizio Palesi [[email protected]] 8

MotivationMotivationATI Technologies – XILLEON™ 220 SOC forDigital Set-top Box MarketTensilica – Xtensa™ 1040 configurable processorcoresPhilips Semiconductors – Nexperia™ SOCplatformsAdelante Technologies – offers complete SOCcustomizable platforms for DSP domainsMore…

Maurizio Palesi [[email protected]] 9

OverviewOverviewGiven:

Parameterized SOCarchitectureFixed application

Automatically explore thedesign space

Find optimal pointsw/respect to power andperformance

Explore

0

200

400

600

800

1000

1200

1400

0 200 400 600 800 1000 1200 1400 1600 1800

Execution Time (us)

Pow

er(u

W)

void main(){ while(1){ Receive(); Decode(); Display(); }} Application

SOCCPU Memory

JPEGCODEC

Math/FPU

UART

I$-D$BRIDGE

Size = {1K, 4K, 8K}Line = {4, 8, 16}Assoc = {1, 2, 4}

Maurizio Palesi [[email protected]] 10

Sample SOC Platform for Digital CameraSample SOC Platform for Digital Camera

A/DA/D CCDCCD MemoryMemory D/AD/A

JPEG CodecJPEG Codec

DMADMAMIPSMIPS

UARTUART

I$I$

D$D$BridgeBridge

LCD driverLCD driver

Size

Associativity

Block size

Widthencoding

TX/RXbuf size

Pixelwidth

>1025 configurations

Maurizio Palesi [[email protected]] 11

Design Space Exploration (DSE)Design Space Exploration (DSE)Defining strategies for tuning the parametersso as to obtain the Pareto-optimal set ofconfigurations that provide multi-criteriaoptimisationCriteria (a.k.a. objectives)

Power dissipationPerformance (delay, execution time, …)Area (cost, complexity)Energy…

Maurizio Palesi [[email protected]] 12

Pareto’s ConceptPareto’s ConceptA new notion of optimality is required in thepresence of objective conflicts

power

dela

y

A

B

C

Maurizio Palesi [[email protected]] 13

DSE ApproachesDSE ApproachesDependency analysis (dep) [Givargis et al. TVLSI02]

GA-based DSE (ga) [Palesi et al. TCAD05]

Sensitivity Analysis [Zaccaria et al. DAES02]

Pareto-based Sensitivity Analysis (pbsa) [Palesi et al. IWSOC02]

Dependency + GA (depga) [Palesi, Givargis CODES02]

Sensitivity + GA (saga) [Palesi et al. ISCS02]

Maurizio Palesi [[email protected]] 14

Parameter DependencyParameter DependencyA → B: Pareto-optimal configurations of B shouldbe calculated after Pareto-optimal configurationsof nodes along the path A → B are calculated

A → B → A, (cycle): Pareto-optimalconfigurations of all the parameters on the cycleneed to be calculated simultaneously

A: Pareto-optimal configurations calculated inisolation

Maurizio Palesi [[email protected]] 15

Dependency GraphDependency Graph

AB

C

D

J KE

F

G

H I

N O

L M

R S

P Q

V W

T U

X

YZ

Node Core Parameter

A MIPS

Voltage scale

B I$ Total size

C Line size

D Associativity

E D$ Total size

F Line size

G Associativity

H CPUI$

bus

Data buswidth

I Data bus code

J Addr buswidth

K Addr bus code

X UART

Tx buffer size

Y Rx buffer size

Node Core Parameter

L CPUD$ bus

Data buswidth

M Data buscode

N Addr buswidth

O Addr buscode

P I/D$Membus

Data buswidth

Q Data buscode

R Addr buswidth

S Addr buscode

T Peripheral bus

Data buswidth

U Data buscode

V Addr buswidth

W Addr buscode

Z DCTCODE

C

Pixelresolution

Maurizio Palesi [[email protected]] 16

dep: Exploration Algorithmdep: Exploration Algorithm

A

BC

D

J K

EF

GH I

N O

L M

R S

P Q

V W

T UX

Y Z

Step 1: Clustering followed by simulation

Maurizio Palesi [[email protected]] 17

dep: Exploration Algorithmdep: Exploration Algorithm

Step 2: Pair-wise merge followed by simulationA,H,I B,C,D,E,

F,GJ,K,T,U

L,M,P,Q N,O,V,W X,Y,R,S

Z

A,H,I,B,C,D,E,F,G

J,K,T,U,Z

L,M,P,Q,N,O,V,W

X,Y,R,S

A,H,I,B,C,D,E,F,G,J,K,T,U,Z

L,M,P,Q,N,O,V,W,X,Y,R,S

A,H,I,B,C,D,E,F,G,J,K,T,U,Z,L,M,P,Q,N,O,V,W,X,Y,R,S

Maurizio Palesi [[email protected]] 18

dep: Problemsdep: ProblemsDependency graph

Based on designer knowledgeIf the parameters are heavilyinterdependent

Big clusterLocal configuration space too largeThe approach becomes infeasible

Maurizio Palesi [[email protected]] 19

GA: ApplicationGA: Application5 items

Configuration representationFeasible functionCost/Objective functionsConstraint functionsConvergency criteria

Maurizio Palesi [[email protected]] 20

Configuration RepresentationConfiguration RepresentationSystemSystem

ParametersdeterminationParameters

determination

ParametersParameters Parametervalues

Parametervalues

Chromosomemapping

Chromosomemapping

ChromosomeChromosome P1 P2 P3 PN

Maurizio Palesi [[email protected]] 21

Feasible/Cost/Constraints FunctionsFeasible/Cost/Constraints Functions

ConfigurationConfiguration

Feasiblefunction

Feasiblefunction

Feasible/Not feasibleFeasible/

Not feasible

ConfigurationConfiguration

Sys levelsimulationSys levelsimulation

Obj1Obj1 Obj1Obj1 Obj1Obj1

Obj1Obj1 Obj1Obj1 Obj1Obj1

Constraintsfunction

Constraintsfunction

Okay/rejectedOkay/rejected

Maurizio Palesi [[email protected]] 22

How Many Generations?How Many Generations?Fixed number of generationsAutostop criteria

Based on convergency

power

dela

y

Maurizio Palesi [[email protected]] 23

dep: Problems & Solutionsdep: Problems & SolutionsIf the parameters are heavily interdependent

Big clusterLocal configuration space too largeThe approach becomes infeasible

SolutionSubstituite a GA based approach in place ofthe exhaustive search

Maurizio Palesi [[email protected]] 24

depga: 1st phasedepga: 1st phaseC: set of configurations

to be explored

|C|>Threshold

Init random population

GA evolve

Exhaustive search

LPOS LPOS

Y N

Maurizio Palesi [[email protected]] 25

depga: 2nd phasedepga: 2nd phase

|CIJ|>Threshold

Init population with LPOSI and LPOSJ

GA evolve

Exhaustive search

LPOS LPOS

MergeLPOSI LPOSJ

Y N

Maurizio Palesi [[email protected]] 26

EfficiencyEfficiencyNumber of Configuration Visited

0

5000

10000

15000

20000

25000

Dependency Mixed, T=100 Mixed, T=200 Mixed, T=400

Maurizio Palesi [[email protected]] 27

depga: Experimentsdepga: Experiments

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 0.005 0.01 0.015 0.02 0.025 0.03

execution time (sec)

pow

er (W

)

DependencyMixed, T=100Mixed, T=200Mixed, T=400

Maurizio Palesi [[email protected]] 28

Sensitivity AnalysisSensitivity AnalysisLet us consider N parameters P1,P2,...,PNeach of them can assume respectively {V1},{V2},...,{VN} values

The number of possible configurations will be|V1|×| V2|×,...,× |VN|Sensitivity analysis allows to test only|V1|+|V2|+,...,+|VN| configurations

Two phasesParameters sorted according to their sensitivityExhaustive search in the individuated subspace

Maurizio Palesi [[email protected]] 29

Sensitivity Analysis (example)Sensitivity Analysis (example)Minimization power-delay (PD) of a cache

A configuration is a triple c=<s,b,a> (size,bsize,assoc)Sensitivity analysis

Fix b and a and let s variable ⇒ Pdsmin, Pds

max

Fix s and a and let b variable ⇒ Pdbmin, Pdb

max

Fix s and b and let a variable ⇒ Pdamin, Pda

maxSorting starting from the most sensitive to the less one (e.g. s,a, b)

ExplorationFix a=a0, b=b0 and make s variable ⇒ sopt

Fix s=sopt, b=b0 and make a variable ⇒ aopt

Fix s=sopt, a=aopt and make b variable ⇒ bopt

Maurizio Palesi [[email protected]] 30

SA + GA = SAGASA + GA = SAGAParametersParameters

Sensitivityanalysis

Sensitivityanalysis

High-sensitiveparameters

High-sensitiveparameters

GAGA

ChromosomeChromosome

ReferenceparametersReferenceparameters

ThresholdThreshold

Maurizio Palesi [[email protected]] 31

Overall ResultsOverall Resultsdepsims % dist % sav % dist % sav % dist % sav % dist % sav

adpcm 14694 0.04 89.21 0.02 78.75 0.00 80.53 0.03 91.40bcnt 22642 0.02 93.01 0.01 89.67 0.00 83.99 0.03 95.42blit 37787 0.04 95.82 0.00 90.99 0.02 93.09 0.05 97.03compress 14035 0.03 89.02 0.01 75.24 0.01 60.41 0.03 91.12crc 21205 0.04 92.94 0.01 83.75 0.01 93.78 0.03 96.42des 21970 0.02 93.00 0.01 82.60 0.00 58.77 0.03 93.18engine 15532 0.03 90.36 0.02 78.44 0.00 70.35 0.03 91.93fir 16832 0.03 91.23 0.01 82.60 0.00 67.77 0.03 92.43g3fax 27382 0.04 94.26 0.02 86.20 0.01 93.09 0.02 96.83jpeg 15996 0.04 90.61 0.01 82.15 0.00 57.48 0.03 91.87pocsag 19102 0.04 91.49 0.01 83.93 0.00 71.49 0.03 94.66qurt 13231 0.03 88.68 0.00 77.58 0.00 73.92 0.02 91.63ucbqsort 19102 0.03 91.82 0.01 84.60 0.00 84.21 0.03 94.66Average 0.03 91.65 0.01 82.81 0.01 76.07 0.03 93.74

sagaBenchmark ga depga pbsa

Maurizio Palesi [[email protected]] 32

VLIW Embedded SystemVLIW Embedded SystemSimultaneous interaction between

Hardware parametersSoftware parameters

Analysis of the tradeoffsPower/PerformanceEnergy/Performance

Maurizio Palesi [[email protected]] 33

Instruction Level ParallelismInstruction Level ParallelismHigh performance processors in the 1980s:maximize ILP

Issue more than one single instruction in agiven clock cycleWho decides which instructions can beexecuted in parallel?

Two different philosophiesSuperscalarVery Long Instruction Word (VLIW)

Maurizio Palesi [[email protected]] 34

ILP philosophy: SuperscalarILP philosophy: SuperscalarHide the process of finding ILPILP is discovered dynamically at run-time by thecontrol hardware of the processor

HWOp1,Op2Op3Op4,Op5…

Foo.c

Op1Op2Op3Op4Op5…

compiler

Instruction stream

Run-time

Maurizio Palesi [[email protected]] 35

ILP philosophy: VLIWILP philosophy: VLIWHardware resources are architecturally visible to thecompilerCompiler can create a sequence of Very LongInstructions that defines the plan of executionHW simply execute the plan

HWFoo.cOp1,Op2

Op3Op4,Op5

compiler

Hardware resources configuration

Plan of execution

Run-time

Maurizio Palesi [[email protected]] 36

VLIW past & futureVLIW past & futureDecline of VLIWs for general purpose systems

Couldn’t be integrated in a single chipBinary compatibility between implementations

Rediscovery of VLIW in embeddedNo more integrability issuesBinary incompatibility not relevant

SW is very specific and supplied with the HW

Advanteges of VLIWSimplified hardwareOptimize ad-hoc the architecture to achieve ILP

Maurizio Palesi [[email protected]] 37

ASIP VLIW in Embedded SystemASIP VLIW in Embedded SystemASIP based on VLIW architecturesExamples are

TI C6x familyAgere/Motorola StarCore architectureSun MAJCFujitsu FR-VST HP/ST LxPhilips TriMedia…

Maurizio Palesi [[email protected]] 38

Reference architecture (HPL-PD)Reference architecture (HPL-PD)

L2 U

nifie

d C

ache

L2 U

nifie

d C

ache

PrefetchCache

PrefetchUnit

FetchUnit Instruction

Queue

Dec

ode

and

Con

trol

Log

ic

PredicateRegisters

BranchRegisters

GeneralPruposeRegisters

FloatingPoint

Registers

ControlRegisters

Load/StoreUnit

BranchUnit

IntegerUnit

FloatingPointUnit

L1 D

ata

Cac

heL1

Dat

aC

ache

L1 In

stru

ctio

nC

ache

L1 In

stru

ctio

nC

ache

Maurizio Palesi [[email protected]] 39

An Open Platform: EPIC ExplorerAn Open Platform: EPIC ExplorerInterfacing to the Trimaran framework thatprovide VLIW compiler and simulator for dynamicstatisticsEstimator component implementing high levelmodelsExplorer component implementing multi-objectivedesign space exploration algorithms

EPICExplorerhttp://epic-explorer.sourceforge.net

Maurizio Palesi [[email protected]] 40

The Exploration Data FlowThe Exploration Data Flow

IMPACTIMPACTFoo.cFoo.c

System configuration

ProcessorProcessor

MemoryMemory

EmulibEmulib

foo.exefoo.exe

Execution statisticsExecution statistics

EstimatorEstimator

EnergyEnergy PowerPowerCyclesCycles

ExplorerExplorer

ELCORELCOR

HMDES

Maurizio Palesi [[email protected]] 41

Energy estimationEnergy estimationSubdivide architecture in Functional Block Unit (FBU)

Instruction decode logic, Integer units, floating point units, registerfiles

For each FBU (from ST Microelectronics LX)Active power: average power dissipated when the FBU is usedInactive power: average power dissipated when the FBU is not used

From the execution statistic, we know how many cycles eachFBU has been active/inactive

EFBU=(Pactive× cyclesactive+ Pinactive×cyclesinactive) × Tclock

Discrete degree of accuracy (about 25%)investigate relative power savings beetween designs

Maurizio Palesi [[email protected]] 42

Configuration SpaceConfiguration SpaceThree main parameter categories:

VLIW core:Number of Registers in each register file (from 16 to 256)Number of istancies for Functional Units of each type (from 1 to 6)

Mem Hierarchy:Size, Blocksize, Associativity for each of the caches

(L1 Instruction, L1 Data, L2)

Compiler:Conservative compilation strategy (basic blocks)Aggressive ILP oriented compilation strategy (hyperblocks)

Total space size: 1.47 x 1013 configurations!

Maurizio Palesi [[email protected]] 43

ExplorationExploration MethodologyMethodologyPreliminary analisys of compilation

Impact of ILP oriented code transformationsPredict the right compilation strategy

Basic Blocks (conservative)Hyper Blocks (aggressive, ILP-oriented)

Multi-objective Design Space ExplorationExtract Pareto Set

Maurizio Palesi [[email protected]] 44

PreliminaryPreliminary AnalisysAnalisys (1/3) (1/3)For each objective, Unpaired two sample t-test allows toestimate the average effect of hyperblock formation

ConfigurationSpace

CN

CH

Random subsets of n configurations

T-test

ON

OH

Compilation with (H)and without (N)hyperblock formation

Is the mean effect on the objectivesignificant respect to the chosencritical difference?

Maurizio Palesi [[email protected]] 45

PreliminaryPreliminary AnalisysAnalisys (2/3) (2/3)Example of a metric for critical difference in means: δ > 50% M

δM

Maurizio Palesi [[email protected]] 46

PreliminaryPreliminary AnalisysAnalisys (3/3) (3/3)

-56.6+6.4359.60-0.24+0.090.54-23.83+2.5821.55gsm-dec

-0.97+0.121.40-0.27+0.090.79-0.26+0.080.68fir

-27.74+7.358.54-0.5+0.121.02-6.19+3.3124.2adpcm-dec

-32.4+5.965.53-0.39+0.080.76-7.23+2.9522.76g721-enc

-3.48+9.8862.500.25+0.160.88-5.28+4.8533.39mpeg-dec

-8.56+3.7346.12-0.89+0.141.258.17+2.215.8adpcm-enc

-2.31+1.019.72-0.07+0.090.89-0.97+0.514.07jpeg

55.84+9.8279.28-0.48+0.140.8833.25+4.7936.62gsm-enc

30.82+4.5549.010.38+0.161.646.76+1.8416.64ieee810

µN-µH∆µN-µH∆µN-µH∆

Energy (mJ)Power (W)Time (ms)Application

ILP-oriented compilation impact (positive,negative)

Maurizio Palesi [[email protected]] 47

Pareto Set (G721 encode)Pareto Set (G721 encode)

13%

Maurizio Palesi [[email protected]] 48

ParetoPareto Set ( Set (GSM-encodeGSM-encode))

Maurizio Palesi [[email protected]] 49

Pareto SurfacePareto Surface

Maurizio Palesi [[email protected]] 50

OutlineOutlineDesign Space Exploration of ParameterizedSystemsNetwork on Chip ArchitecturesInstruction Level Power Estimation

Maurizio Palesi [[email protected]] 51

The Evolution of SoC PlatformsThe Evolution of SoC Platforms

General-purposeScalable RISCProcessor• 50 to 300+ MHz• 32-bit or 64-bit

Library of DeviceIP Blocks• Imagecoprocessors• DSPs• UART• 1394• USB

Scalable VLIWMedia Processor:• 100 to 300+ MHz• 32-bit or 64-bit

Nexperia™System Buses• 32-128 bit

2 Cores: Philips’ Nexperia PNX8850 SoC platform for High-end digital video (2001)

Maurizio Palesi [[email protected]] 52

Running Forward…Running Forward…

Four 350/400 MHz StarCore SC140DSP extended cores16 ALUs: 5600/6400 MMACS1436 KB of internal SRAM & multi-level memory hierarchyInternal DMA controller supports 16TDM unidirectional channels,Two internal coprocesssors (TCOPand VCOP) to provide special-purpose processing capability inparallel with the core processors

6 Cores: Motorola’s MSC8126 SoC platform for 3G base stations (late 2003)

Maurizio Palesi [[email protected]] 53

What´s Happening in SoCs?What´s Happening in SoCs?Technology: no slow-down in sight!

Faster and smaller transistors: 90 → 65 → 45 nm… but slower wires, lower voltage, more noise!

80% or more of the delay of critical paths will be due tointerconnects

Design complexity: from 2 to 10 to 100 cores!Design reuse is essential…but differentiation/innovation is key for winning on themarket!

Performance and power: GOPS for MWs!Performance requirements keep going up…but power budgets don’t!

Maurizio Palesi [[email protected]] 54

The Deep Submicron EffectsThe Deep Submicron Effects

Maurizio Palesi [[email protected]] 55

Communication ArchitecturesCommunication ArchitecturesShared bus

Low areaPoor scalabilityHigh energy consumption

Network-on-ChipScalability and modularityLow energy consumptionIncrease of design complexity

Shared bus

IP IP IP

IP IP IP

IP IP IP

IPIPIP

IP IP IP

Maurizio Palesi [[email protected]] 56

Design Space Exploration for NoCDesign Space Exploration for NoC

Ogras et al., ASAP’05

Maurizio Palesi [[email protected]] 57

NOC Research TopicsNOC Research TopicsTopological Mapping ProblemRouting Strategies

Routing FunctionSelection Function

Maurizio Palesi [[email protected]] 58

Topological MappingTopological Mapping

Maurizio Palesi [[email protected]] 59

The Mapping ProblemThe Mapping Problem

Application

Graph of concurrent tasks

IPLibrary

ASIC1

CPU1DSP1

CPU2MEM1

Mapping(NP-hard)

NOC

Theapplication isdivided into a

graph ofconcurrent

tasks

Theapplicationtasks are

assigned andscheduled

Decide to whichtile each

selected IPshould be

mapped suchthat the metricsof interest are

optimized

11 22 33

Maurizio Palesi [[email protected]] 60

Impact on PerformanceImpact on Performance

Maurizio Palesi [[email protected]] 61

Impact on EnergyImpact on Energy

Maurizio Palesi [[email protected]] 62

Design Space ExplorationDesign Space Exploration

Communication characteristics Trace files Communication statistics …

Communication characteristics Trace files Communication statistics …

MappingEvaluatorMapping

EvaluatorMapping Space

ExplorerMapping Space

Explorer

BestMappings

BestMappings

Initial mapping New mapping

Constraints Bandwidth Delay Jitter …

Constraints Bandwidth Delay Jitter …

Delay Throughput Power/Energy …

Maurizio Palesi [[email protected]] 63

Exploration EnginesExploration EnginesGAMAP

Multi-objective mapping based on genetic algorithmsPBBB

Pareto-based Branch & Bound approachAdaptation of the approach proposed by Hu and Marculescu[ASPDAC03] for a multi-objective exploration of the mappingspace

PBNMAPPareto-based NMAP approach

Adaptation of the approach proposed by Murali and De Micheli[DATE04] for a multi-objective exploration of the mapping space

– For the XY routing

Maurizio Palesi [[email protected]] 64

GAMAPGAMAPChromosome: A representation of the solution tothe problem (the mapping)

Each tile in the mesh has an associated gene whichencodes the identifier of the core mapped in the tileIn an n×m mesh

The chromosome is formed by n×m genesThe i-th gene encodes the identifier of the core in the tile in rowi/n and column i%m

2 0 3 1

Chromosome2 0

3 1

NOC

Maurizio Palesi [[email protected]] 65

GAMAPGAMAPGenetic operators

Hot-spot

crossover

BA

A B

mutation

Maurizio Palesi [[email protected]] 66

PBBBPBBBMulti-objective extension of the Branch-and-Bound algorithm

Core 1Core 2Core 3Core 4

Communicationdemand

Search tree

Maurizio Palesi [[email protected]] 67

PBNMAPPBNMAPMulti-objective extension of the NMAP algorithm proposed by Murali and DeMicheli in [DATE04] for the single XY routing

The core that hasmaximum communicationdemand is placed ontoone of the mesh nodes

with maximum number ofneighboring

For each core yet to be mapped, thecore that communicate most with thealready mapped cores is selected. Thecore is placed onto the mesh node thatminimize the communication cost with

mapped cores.

Maurizio Palesi [[email protected]] 68

ExperimentsExperimentsReal Traffic (MPEG-2)Real Traffic (MPEG-2)

Maurizio Palesi [[email protected]] 69

ExperimentsExperimentsReal Traffic (MPEG-2)Real Traffic (MPEG-2)

0 1 2 30 ASIC5 ASIC4 DSP2 CPU21 CPU1 ASIC3 MEM1 ASIC12 DSP1 ASIC2 IP3 IP13 MEM2 DSP3 IP2 IP4

Execution time 41.0 msEnergy: 20.8 mJ

High performance mapping

0 1 2 30 ASIC5 ASIC3 ASIC1 DSP21 DSP1 ASIC4 MEM1 CPU22 CPU1 ASIC2 IP3 IP13 MEM2 DSP3 IP2 IP4

Execution time 42.4 msEnergy: 17.8 mJ

Low energy mapping

Maurizio Palesi [[email protected]] 70

ExperimentsExperimentsEfficiencyEfficiency

GAMAP 840 simsPBBB 7238 simsPBNMAP 2670 sims

SpeedupPBBB = 8.6SppedupPBNMAP = 3.2

Maurizio Palesi [[email protected]] 71

ExperimentsExperimentsEfficiencyEfficiency

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Region

2x2

Region

2x2o

verlap

Nsew4x

4

Nsewbia

s4x4

Neighb

oring

4x4

Rando

m4x4

Averag

esi

mul

atio

nsGAMAP PBNMAP PBBB

Maurizio Palesi [[email protected]] 72

RoutingRouting

Maurizio Palesi [[email protected]] 73

RoutingRoutingDeterministic

Path completely determined by the source andthe destination address

AdaptiveThe path taken by a particular packet dependson dynamic network conditions

Fault,Congestion, …

Routing packets along alternative paths inorder to avoid congested areas

Maurizio Palesi [[email protected]] 74

Routing and SelectionRouting and Selection

Routingfunction

Routingfunction

Selectionfunction

Selectionfunction

message

Output portsSet of possibleoutput ports

Maurizio Palesi [[email protected]] 75

Neighbors on PathNeighbors on PathCongestion-Aware RoutingCongestion-Aware Routing

Maurizio Palesi [[email protected]] 76

SituationsSituations of of IndecisionIndecision

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.15

10

15

20

25

30

35

40

45

packet injection rate (packets/cycle/node)

% s

ituaz

ioni

inde

cisi

onal

i

UniformM1M2

Packet injection rate (packets/cycle/node)

% In

deci

sion

al s

ituat

ions

Maurizio Palesi [[email protected]] 77

SelectionSelection ProblemProblem

They both can accept the flit

Maurizio Palesi [[email protected]] 78

SelectionSelection ProblemProblem

Send the packet to the nodethat can early dispatches it witha higher probability

Maurizio Palesi [[email protected]] 79

NoPCARNoPCAR AlgorithmAlgorithm

Maurizio Palesi [[email protected]] 80

x x

x

o o

o

NoPCARNoPCAR AlgorithmAlgorithm

xo

Not available

AvailableNESWox-oScore=2

NESWxox-Score=1

Maurizio Palesi [[email protected]] 81

NoPCARNoPCAR AlgorithmAlgorithm

xo

Not available

Availablex

x

o

o

o

Maurizio Palesi [[email protected]] 82

NoPCARNoPCAR AlgorithmAlgorithm

xo

Not available

Availablex

x

o

o

o

NESWox-oxoxxxxxxScore=0

NESWxox-ooxxxoxxScore=1

Route a flit towards the router which has themost possibility to send the flit to neighborsand is part of a routing path leading to thedestination

Route a flit towards the router which has themost possibility to send the flit to neighborsand is part of a routing path leading to thedestination

Maurizio Palesi [[email protected]] 83

NoPCAR algorithmNoPCAR algorithmNoPCAR_Selection(in AC: set of admissible output channels, nd: destination node, out SC: selected output channel){ credits[] = 0; foreach channel1 in AC { node1 = target(channel1); foreach channel2 in R(node1, nd) { node2 = target(channel2); if bsv[node1][node2] is available then credits[channel1]++; } } sc = ch s.t. credits[ch] = max(credits[]);}

Maurizio Palesi [[email protected]] 84

Block Diagram of the RouterBlock Diagram of the Router

Synthesised with Synopsys Design Compiler0.13 um tech. Library from Virtual SiliconFlit size: 64 bitFIFO size 3 flitsBuffer area significantly dominatesthe logic

Area overhead due to NoPCAR is less then 6%

Maurizio Palesi [[email protected]] 85

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.060

50

100

150

200

250

300

Average packet injection rate [packets/cycle/node]

Ave

rage

pac

ket d

elay

[cyc

les]

OELAXY

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.060

50

100

150

200

250

300

Average packet injection rate [packets/cycle/node]

Ave

rage

pac

ket d

elay

[cyc

les]

OELAXY

M1: (i,j) (N-1-j,N-1-i) M2: (i,j) (j,i)

M1/M2 M1/M2 TrafficTraffic

Maurizio Palesi [[email protected]] 86

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.040

50

100

150

200

250

300

Average packet injection rate [packets/cycle/node]

Ave

rage

pac

ket d

elay

[cyc

les]

OELAXY

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.040

50

100

150

200

250

300

Average packet injection rate [packets/cycle/node]

Ave

rage

pac

ket d

elay

[cyc

les]

OELAXY

1 node HS (hsp=15%) 5 nodes HS (hsp=40%)

Hot-Spot Hot-Spot TrafficTraffic (1/2) (1/2)

Maurizio Palesi [[email protected]] 87

0 0.005 0.01 0.015 0.02 0.025 0.030

50

100

150

200

250

300

Average packet injection rate [packets/cycle/node]

Ave

rage

pac

ket d

elay

[cyc

les]

OELAXY

(2,2) (3,3)

0 0.005 0.01 0.015 0.02 0.025 0.030

50

100

150

200

250

300

Average packet injection rate [packets/cycle/node]

Ave

rage

pac

ket d

elay

[cyc

les]

OELAXY

(1,1) (4,4)

Hot-Spot Hot-Spot TrafficTraffic (2/2) (2/2)

Maurizio Palesi [[email protected]] 88

OutlineOutlineDesign Space Exploration of ParameterizedSystemsNetwork on Chip ArchitecturesInstruction Level Power Estimation

Maurizio Palesi [[email protected]] 89

Why Low Power Design?Why Low Power Design?Mobile applications

Large & heavy devicesBattery life

High-performance devicesHeat dissipation problemsExpensive packaging

ReliabilityFault probability doubling for every 10°Cincrease in temperature

Maurizio Palesi [[email protected]] 90

SoC Architecture: an exampleSoC Architecture: an exampleSystem bus

Peripheral bus

AMS VCs

Logic VCs

Embedded SoftwareRTOS, Drivers, Applications

µP

Application logic

Memory control

Glue logic

Test EIDE USB10Base-T

PCI

Bridge

PLL

ROM

DSP

MPEGDecode

AudioCodec

Video I/F

RAM

FLASH

time

pow

er

Maurizio Palesi [[email protected]] 91

The Average Cost ModelThe Average Cost Model“By measuring the power absorbed by a processor Xrepeating certain instructions or short sequences ofinstructions it is possible to obtain much of theinformation needed to calculate the power dissipatedwhen the processor X executes the instructions of anyprogram Y”.

Tiwari et. al ’94

Maurizio Palesi [[email protected]] 92

Case StudyCase StudySTMicroelectronics ST20 C2P

Stack based architectureStrong data dependency

STMicroelectronics LXVLIW architecture

Work supported by CNR (Project MADESS II) and STMicroelectronics

Maurizio Palesi [[email protected]] 93

The Average Cost ModelThe Average Cost Model

add r1, r2, r3add r1, r2, r3add r1, r2, r3add r1, r2, r3add r1, r2, r3add r1, r2, r3add r1, r2, r3… … … … …add r1, r2, r3add r1, r2, r3add r1, r2, r3add r1, r2, r3

Transistor- or Gate-level

Power estimator

Instruction Costadd 1.5sub 1.9lw 4.2sw 7.9…

Maurizio Palesi [[email protected]] 94

Inter-instruction EffectsInter-instruction Effects

15.3052Add

353%58.223Stl

0%9.6162Stl

85%16.5719Stl

25.885LdnlpIncr.CostInstr.

Maurizio Palesi [[email protected]] 95

Code OptimizationCode Optimization…ldl 0ldl 1addldl 2addldl 3addldl 4...ldl 28addldl 29

...ldl 0ldl 1ldl 2addaddldl 3ldl 4add...ldl 28ldl 29add

13 µJ 7.3 µJ

44% saving!

Maurizio Palesi [[email protected]] 96

Data DependencyData Dependency

Clock cycles

EU

cur

rent

abs

orpt

ion

(mA

)

Low activityHigh activity

Maurizio Palesi [[email protected]] 97

Data DependencyData Dependency

Maurizio Palesi [[email protected]] 98

Data DependencyData Dependency

Maurizio Palesi [[email protected]] 99

Data Dependent ModelData Dependent ModelE(IN) = base(IN) + Int(IN,IN-1) + Data(IN)

Maurizio Palesi [[email protected]] 100

FFTFFT

Maurizio Palesi [[email protected]] 101

ExperimentsExperiments

Real ACM DDM(mA) (mA) (mA) ACM DDM ACM DDM

sum_low 6,79 13,44 6,8 97,77% 0,12% 97,79% 0,86%sum_high 14,24 13,44 14,31 5,58% 0,50% 20,37% 3,47%rand_inst 40,62 48,14 40,12 18,51% 1,24% 31,37% 6,44%mtx_sum 31,97 37,83 30,97 18,32% 3,13% 31,44% 6,81%mtx_mul 36,01 41,26 34,02 7,07% 5,53% 27,21% 8,61%dft 32,07 28,57 30,92 10,91% 3,54% 15,13% 8,94%

26,36% 2,34% 37,22% 5,86%

Benchmark Relative Error Absolute Error

Average

Maurizio Palesi [[email protected]] 102

OutlineOutlineDesign Space Exploration of ParameterizedSystemsNetwork on Chip ArchitecturesInstruction Level Power Estimation