Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Maurizio Palesi [[email protected]] 1
Some Key Issues in EmbeddedSome Key Issues in EmbeddedSystem DesignSystem Design
Maurizio Palesi
Università di CataniaUniversità di CataniaDipartimento di Ingegneria Informatica e delle TelecomunicazioniDipartimento di Ingegneria Informatica e delle Telecomunicazioni
Maurizio Palesi [[email protected]] 2
OutlineOutlineDesign Space Exploration of ParameterizedSystemsNetwork on Chip ArchitecturesInstruction Level Power Estimation
Maurizio Palesi [[email protected]] 3
OutlineOutlineDesign Space Exploration ofParameterized SystemsNetwork on Chip ArchitecturesInstruction Level Power Estimation
Maurizio Palesi [[email protected]] 4
MotivationMotivationDesign trends
Growing demand forportable devicesGrowing demand forlow power designIncreased applicationcomplexityShrinking time-to-market windows
Technology trendsIncreased chipcapacityIncreased I/O pinsImproved on-chipintegration techniques(storage, digital,analog, digital, …)SOC era
Need for greater design productivity!
Maurizio Palesi [[email protected]] 5
MotivationMotivationOne approach: reuse ofexisting IP
IP selection?IP integration?SOC verification?Multi-source IP licensing…
SOCCPU Memory
JPEGCODEC
Math/FPU
UART
MMXBRIDGE
?
???
?
MIPS
RAM
JPEGCODEC1
Math/FPU
UART
ISABRIDGE
ARM
SRAM
DRAM
AMBABRIDGE
JPEGCODEC2
USB
Maurizio Palesi [[email protected]] 6
MotivationMotivationAlternate approach:reuse of SOC
Designed, integrated,testedDomain specificParameterized
Designed by firmsspecializing in SOCUser: map application,then, “configure-and-execute”
Parameterized SOC
CPU Memory
JPEGCODEC
Math/FPU
UART
MMXBRIDGE
Maurizio Palesi [[email protected]] 7
MotivationMotivationComposed of 100s ofcoresCores are “configurable”Configurations impactpower/performanceLarge number of totalconfigurations!
Parameterized SOC
CPU Memory
JPEGCODEC
Math/FPU
UART
MMXBRIDGE
Architecture is otherwise fixed!
Maurizio Palesi [[email protected]] 8
MotivationMotivationATI Technologies – XILLEON™ 220 SOC forDigital Set-top Box MarketTensilica – Xtensa™ 1040 configurable processorcoresPhilips Semiconductors – Nexperia™ SOCplatformsAdelante Technologies – offers complete SOCcustomizable platforms for DSP domainsMore…
Maurizio Palesi [[email protected]] 9
OverviewOverviewGiven:
Parameterized SOCarchitectureFixed application
Automatically explore thedesign space
Find optimal pointsw/respect to power andperformance
Explore
0
200
400
600
800
1000
1200
1400
0 200 400 600 800 1000 1200 1400 1600 1800
Execution Time (us)
Pow
er(u
W)
void main(){ while(1){ Receive(); Decode(); Display(); }} Application
SOCCPU Memory
JPEGCODEC
Math/FPU
UART
I$-D$BRIDGE
Size = {1K, 4K, 8K}Line = {4, 8, 16}Assoc = {1, 2, 4}
Maurizio Palesi [[email protected]] 10
Sample SOC Platform for Digital CameraSample SOC Platform for Digital Camera
A/DA/D CCDCCD MemoryMemory D/AD/A
JPEG CodecJPEG Codec
DMADMAMIPSMIPS
UARTUART
I$I$
D$D$BridgeBridge
LCD driverLCD driver
Size
Associativity
Block size
Widthencoding
TX/RXbuf size
Pixelwidth
>1025 configurations
Maurizio Palesi [[email protected]] 11
Design Space Exploration (DSE)Design Space Exploration (DSE)Defining strategies for tuning the parametersso as to obtain the Pareto-optimal set ofconfigurations that provide multi-criteriaoptimisationCriteria (a.k.a. objectives)
Power dissipationPerformance (delay, execution time, …)Area (cost, complexity)Energy…
Maurizio Palesi [[email protected]] 12
Pareto’s ConceptPareto’s ConceptA new notion of optimality is required in thepresence of objective conflicts
power
dela
y
A
B
C
Maurizio Palesi [[email protected]] 13
DSE ApproachesDSE ApproachesDependency analysis (dep) [Givargis et al. TVLSI02]
GA-based DSE (ga) [Palesi et al. TCAD05]
Sensitivity Analysis [Zaccaria et al. DAES02]
Pareto-based Sensitivity Analysis (pbsa) [Palesi et al. IWSOC02]
Dependency + GA (depga) [Palesi, Givargis CODES02]
Sensitivity + GA (saga) [Palesi et al. ISCS02]
Maurizio Palesi [[email protected]] 14
Parameter DependencyParameter DependencyA → B: Pareto-optimal configurations of B shouldbe calculated after Pareto-optimal configurationsof nodes along the path A → B are calculated
A → B → A, (cycle): Pareto-optimalconfigurations of all the parameters on the cycleneed to be calculated simultaneously
A: Pareto-optimal configurations calculated inisolation
Maurizio Palesi [[email protected]] 15
Dependency GraphDependency Graph
AB
C
D
J KE
F
G
H I
N O
L M
R S
P Q
V W
T U
X
YZ
Node Core Parameter
A MIPS
Voltage scale
B I$ Total size
C Line size
D Associativity
E D$ Total size
F Line size
G Associativity
H CPUI$
bus
Data buswidth
I Data bus code
J Addr buswidth
K Addr bus code
X UART
Tx buffer size
Y Rx buffer size
Node Core Parameter
L CPUD$ bus
Data buswidth
M Data buscode
N Addr buswidth
O Addr buscode
P I/D$Membus
Data buswidth
Q Data buscode
R Addr buswidth
S Addr buscode
T Peripheral bus
Data buswidth
U Data buscode
V Addr buswidth
W Addr buscode
Z DCTCODE
C
Pixelresolution
Maurizio Palesi [[email protected]] 16
dep: Exploration Algorithmdep: Exploration Algorithm
A
BC
D
J K
EF
GH I
N O
L M
R S
P Q
V W
T UX
Y Z
Step 1: Clustering followed by simulation
Maurizio Palesi [[email protected]] 17
dep: Exploration Algorithmdep: Exploration Algorithm
Step 2: Pair-wise merge followed by simulationA,H,I B,C,D,E,
F,GJ,K,T,U
L,M,P,Q N,O,V,W X,Y,R,S
Z
A,H,I,B,C,D,E,F,G
J,K,T,U,Z
L,M,P,Q,N,O,V,W
X,Y,R,S
A,H,I,B,C,D,E,F,G,J,K,T,U,Z
L,M,P,Q,N,O,V,W,X,Y,R,S
A,H,I,B,C,D,E,F,G,J,K,T,U,Z,L,M,P,Q,N,O,V,W,X,Y,R,S
Maurizio Palesi [[email protected]] 18
dep: Problemsdep: ProblemsDependency graph
Based on designer knowledgeIf the parameters are heavilyinterdependent
Big clusterLocal configuration space too largeThe approach becomes infeasible
Maurizio Palesi [[email protected]] 19
GA: ApplicationGA: Application5 items
Configuration representationFeasible functionCost/Objective functionsConstraint functionsConvergency criteria
Maurizio Palesi [[email protected]] 20
Configuration RepresentationConfiguration RepresentationSystemSystem
ParametersdeterminationParameters
determination
ParametersParameters Parametervalues
Parametervalues
Chromosomemapping
Chromosomemapping
ChromosomeChromosome P1 P2 P3 PN
Maurizio Palesi [[email protected]] 21
Feasible/Cost/Constraints FunctionsFeasible/Cost/Constraints Functions
ConfigurationConfiguration
Feasiblefunction
Feasiblefunction
Feasible/Not feasibleFeasible/
Not feasible
ConfigurationConfiguration
Sys levelsimulationSys levelsimulation
Obj1Obj1 Obj1Obj1 Obj1Obj1
Obj1Obj1 Obj1Obj1 Obj1Obj1
Constraintsfunction
Constraintsfunction
Okay/rejectedOkay/rejected
Maurizio Palesi [[email protected]] 22
How Many Generations?How Many Generations?Fixed number of generationsAutostop criteria
Based on convergency
power
dela
y
Maurizio Palesi [[email protected]] 23
dep: Problems & Solutionsdep: Problems & SolutionsIf the parameters are heavily interdependent
Big clusterLocal configuration space too largeThe approach becomes infeasible
SolutionSubstituite a GA based approach in place ofthe exhaustive search
Maurizio Palesi [[email protected]] 24
depga: 1st phasedepga: 1st phaseC: set of configurations
to be explored
|C|>Threshold
Init random population
GA evolve
Exhaustive search
LPOS LPOS
Y N
Maurizio Palesi [[email protected]] 25
depga: 2nd phasedepga: 2nd phase
|CIJ|>Threshold
Init population with LPOSI and LPOSJ
GA evolve
Exhaustive search
LPOS LPOS
MergeLPOSI LPOSJ
Y N
Maurizio Palesi [[email protected]] 26
EfficiencyEfficiencyNumber of Configuration Visited
0
5000
10000
15000
20000
25000
Dependency Mixed, T=100 Mixed, T=200 Mixed, T=400
Maurizio Palesi [[email protected]] 27
depga: Experimentsdepga: Experiments
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0 0.005 0.01 0.015 0.02 0.025 0.03
execution time (sec)
pow
er (W
)
DependencyMixed, T=100Mixed, T=200Mixed, T=400
Maurizio Palesi [[email protected]] 28
Sensitivity AnalysisSensitivity AnalysisLet us consider N parameters P1,P2,...,PNeach of them can assume respectively {V1},{V2},...,{VN} values
The number of possible configurations will be|V1|×| V2|×,...,× |VN|Sensitivity analysis allows to test only|V1|+|V2|+,...,+|VN| configurations
Two phasesParameters sorted according to their sensitivityExhaustive search in the individuated subspace
Maurizio Palesi [[email protected]] 29
Sensitivity Analysis (example)Sensitivity Analysis (example)Minimization power-delay (PD) of a cache
A configuration is a triple c=<s,b,a> (size,bsize,assoc)Sensitivity analysis
Fix b and a and let s variable ⇒ Pdsmin, Pds
max
Fix s and a and let b variable ⇒ Pdbmin, Pdb
max
Fix s and b and let a variable ⇒ Pdamin, Pda
maxSorting starting from the most sensitive to the less one (e.g. s,a, b)
ExplorationFix a=a0, b=b0 and make s variable ⇒ sopt
Fix s=sopt, b=b0 and make a variable ⇒ aopt
Fix s=sopt, a=aopt and make b variable ⇒ bopt
Maurizio Palesi [[email protected]] 30
SA + GA = SAGASA + GA = SAGAParametersParameters
Sensitivityanalysis
Sensitivityanalysis
High-sensitiveparameters
High-sensitiveparameters
GAGA
ChromosomeChromosome
ReferenceparametersReferenceparameters
ThresholdThreshold
Maurizio Palesi [[email protected]] 31
Overall ResultsOverall Resultsdepsims % dist % sav % dist % sav % dist % sav % dist % sav
adpcm 14694 0.04 89.21 0.02 78.75 0.00 80.53 0.03 91.40bcnt 22642 0.02 93.01 0.01 89.67 0.00 83.99 0.03 95.42blit 37787 0.04 95.82 0.00 90.99 0.02 93.09 0.05 97.03compress 14035 0.03 89.02 0.01 75.24 0.01 60.41 0.03 91.12crc 21205 0.04 92.94 0.01 83.75 0.01 93.78 0.03 96.42des 21970 0.02 93.00 0.01 82.60 0.00 58.77 0.03 93.18engine 15532 0.03 90.36 0.02 78.44 0.00 70.35 0.03 91.93fir 16832 0.03 91.23 0.01 82.60 0.00 67.77 0.03 92.43g3fax 27382 0.04 94.26 0.02 86.20 0.01 93.09 0.02 96.83jpeg 15996 0.04 90.61 0.01 82.15 0.00 57.48 0.03 91.87pocsag 19102 0.04 91.49 0.01 83.93 0.00 71.49 0.03 94.66qurt 13231 0.03 88.68 0.00 77.58 0.00 73.92 0.02 91.63ucbqsort 19102 0.03 91.82 0.01 84.60 0.00 84.21 0.03 94.66Average 0.03 91.65 0.01 82.81 0.01 76.07 0.03 93.74
sagaBenchmark ga depga pbsa
Maurizio Palesi [[email protected]] 32
VLIW Embedded SystemVLIW Embedded SystemSimultaneous interaction between
Hardware parametersSoftware parameters
Analysis of the tradeoffsPower/PerformanceEnergy/Performance
Maurizio Palesi [[email protected]] 33
Instruction Level ParallelismInstruction Level ParallelismHigh performance processors in the 1980s:maximize ILP
Issue more than one single instruction in agiven clock cycleWho decides which instructions can beexecuted in parallel?
Two different philosophiesSuperscalarVery Long Instruction Word (VLIW)
Maurizio Palesi [[email protected]] 34
ILP philosophy: SuperscalarILP philosophy: SuperscalarHide the process of finding ILPILP is discovered dynamically at run-time by thecontrol hardware of the processor
HWOp1,Op2Op3Op4,Op5…
Foo.c
Op1Op2Op3Op4Op5…
compiler
Instruction stream
Run-time
Maurizio Palesi [[email protected]] 35
ILP philosophy: VLIWILP philosophy: VLIWHardware resources are architecturally visible to thecompilerCompiler can create a sequence of Very LongInstructions that defines the plan of executionHW simply execute the plan
HWFoo.cOp1,Op2
Op3Op4,Op5
compiler
Hardware resources configuration
Plan of execution
Run-time
Maurizio Palesi [[email protected]] 36
VLIW past & futureVLIW past & futureDecline of VLIWs for general purpose systems
Couldn’t be integrated in a single chipBinary compatibility between implementations
Rediscovery of VLIW in embeddedNo more integrability issuesBinary incompatibility not relevant
SW is very specific and supplied with the HW
Advanteges of VLIWSimplified hardwareOptimize ad-hoc the architecture to achieve ILP
Maurizio Palesi [[email protected]] 37
ASIP VLIW in Embedded SystemASIP VLIW in Embedded SystemASIP based on VLIW architecturesExamples are
TI C6x familyAgere/Motorola StarCore architectureSun MAJCFujitsu FR-VST HP/ST LxPhilips TriMedia…
Maurizio Palesi [[email protected]] 38
Reference architecture (HPL-PD)Reference architecture (HPL-PD)
L2 U
nifie
d C
ache
L2 U
nifie
d C
ache
PrefetchCache
PrefetchUnit
FetchUnit Instruction
Queue
Dec
ode
and
Con
trol
Log
ic
PredicateRegisters
BranchRegisters
GeneralPruposeRegisters
FloatingPoint
Registers
ControlRegisters
Load/StoreUnit
BranchUnit
IntegerUnit
FloatingPointUnit
L1 D
ata
Cac
heL1
Dat
aC
ache
L1 In
stru
ctio
nC
ache
L1 In
stru
ctio
nC
ache
Maurizio Palesi [[email protected]] 39
An Open Platform: EPIC ExplorerAn Open Platform: EPIC ExplorerInterfacing to the Trimaran framework thatprovide VLIW compiler and simulator for dynamicstatisticsEstimator component implementing high levelmodelsExplorer component implementing multi-objectivedesign space exploration algorithms
EPICExplorerhttp://epic-explorer.sourceforge.net
Maurizio Palesi [[email protected]] 40
The Exploration Data FlowThe Exploration Data Flow
IMPACTIMPACTFoo.cFoo.c
System configuration
ProcessorProcessor
MemoryMemory
EmulibEmulib
foo.exefoo.exe
Execution statisticsExecution statistics
EstimatorEstimator
EnergyEnergy PowerPowerCyclesCycles
ExplorerExplorer
ELCORELCOR
HMDES
Maurizio Palesi [[email protected]] 41
Energy estimationEnergy estimationSubdivide architecture in Functional Block Unit (FBU)
Instruction decode logic, Integer units, floating point units, registerfiles
For each FBU (from ST Microelectronics LX)Active power: average power dissipated when the FBU is usedInactive power: average power dissipated when the FBU is not used
From the execution statistic, we know how many cycles eachFBU has been active/inactive
EFBU=(Pactive× cyclesactive+ Pinactive×cyclesinactive) × Tclock
Discrete degree of accuracy (about 25%)investigate relative power savings beetween designs
Maurizio Palesi [[email protected]] 42
Configuration SpaceConfiguration SpaceThree main parameter categories:
VLIW core:Number of Registers in each register file (from 16 to 256)Number of istancies for Functional Units of each type (from 1 to 6)
Mem Hierarchy:Size, Blocksize, Associativity for each of the caches
(L1 Instruction, L1 Data, L2)
Compiler:Conservative compilation strategy (basic blocks)Aggressive ILP oriented compilation strategy (hyperblocks)
Total space size: 1.47 x 1013 configurations!
Maurizio Palesi [[email protected]] 43
ExplorationExploration MethodologyMethodologyPreliminary analisys of compilation
Impact of ILP oriented code transformationsPredict the right compilation strategy
Basic Blocks (conservative)Hyper Blocks (aggressive, ILP-oriented)
Multi-objective Design Space ExplorationExtract Pareto Set
Maurizio Palesi [[email protected]] 44
PreliminaryPreliminary AnalisysAnalisys (1/3) (1/3)For each objective, Unpaired two sample t-test allows toestimate the average effect of hyperblock formation
ConfigurationSpace
CN
CH
Random subsets of n configurations
T-test
ON
OH
Compilation with (H)and without (N)hyperblock formation
Is the mean effect on the objectivesignificant respect to the chosencritical difference?
Maurizio Palesi [[email protected]] 45
PreliminaryPreliminary AnalisysAnalisys (2/3) (2/3)Example of a metric for critical difference in means: δ > 50% M
δM
Maurizio Palesi [[email protected]] 46
PreliminaryPreliminary AnalisysAnalisys (3/3) (3/3)
-56.6+6.4359.60-0.24+0.090.54-23.83+2.5821.55gsm-dec
-0.97+0.121.40-0.27+0.090.79-0.26+0.080.68fir
-27.74+7.358.54-0.5+0.121.02-6.19+3.3124.2adpcm-dec
-32.4+5.965.53-0.39+0.080.76-7.23+2.9522.76g721-enc
-3.48+9.8862.500.25+0.160.88-5.28+4.8533.39mpeg-dec
-8.56+3.7346.12-0.89+0.141.258.17+2.215.8adpcm-enc
-2.31+1.019.72-0.07+0.090.89-0.97+0.514.07jpeg
55.84+9.8279.28-0.48+0.140.8833.25+4.7936.62gsm-enc
30.82+4.5549.010.38+0.161.646.76+1.8416.64ieee810
µN-µH∆µN-µH∆µN-µH∆
Energy (mJ)Power (W)Time (ms)Application
ILP-oriented compilation impact (positive,negative)
Maurizio Palesi [[email protected]] 50
OutlineOutlineDesign Space Exploration of ParameterizedSystemsNetwork on Chip ArchitecturesInstruction Level Power Estimation
Maurizio Palesi [[email protected]] 51
The Evolution of SoC PlatformsThe Evolution of SoC Platforms
General-purposeScalable RISCProcessor• 50 to 300+ MHz• 32-bit or 64-bit
Library of DeviceIP Blocks• Imagecoprocessors• DSPs• UART• 1394• USB
Scalable VLIWMedia Processor:• 100 to 300+ MHz• 32-bit or 64-bit
Nexperia™System Buses• 32-128 bit
2 Cores: Philips’ Nexperia PNX8850 SoC platform for High-end digital video (2001)
Maurizio Palesi [[email protected]] 52
Running Forward…Running Forward…
Four 350/400 MHz StarCore SC140DSP extended cores16 ALUs: 5600/6400 MMACS1436 KB of internal SRAM & multi-level memory hierarchyInternal DMA controller supports 16TDM unidirectional channels,Two internal coprocesssors (TCOPand VCOP) to provide special-purpose processing capability inparallel with the core processors
6 Cores: Motorola’s MSC8126 SoC platform for 3G base stations (late 2003)
Maurizio Palesi [[email protected]] 53
What´s Happening in SoCs?What´s Happening in SoCs?Technology: no slow-down in sight!
Faster and smaller transistors: 90 → 65 → 45 nm… but slower wires, lower voltage, more noise!
80% or more of the delay of critical paths will be due tointerconnects
Design complexity: from 2 to 10 to 100 cores!Design reuse is essential…but differentiation/innovation is key for winning on themarket!
Performance and power: GOPS for MWs!Performance requirements keep going up…but power budgets don’t!
Maurizio Palesi [[email protected]] 55
Communication ArchitecturesCommunication ArchitecturesShared bus
Low areaPoor scalabilityHigh energy consumption
Network-on-ChipScalability and modularityLow energy consumptionIncrease of design complexity
Shared bus
IP IP IP
IP IP IP
IP IP IP
IPIPIP
IP IP IP
Maurizio Palesi [[email protected]] 56
Design Space Exploration for NoCDesign Space Exploration for NoC
Ogras et al., ASAP’05
Maurizio Palesi [[email protected]] 57
NOC Research TopicsNOC Research TopicsTopological Mapping ProblemRouting Strategies
Routing FunctionSelection Function
Maurizio Palesi [[email protected]] 59
The Mapping ProblemThe Mapping Problem
Application
Graph of concurrent tasks
IPLibrary
ASIC1
CPU1DSP1
CPU2MEM1
Mapping(NP-hard)
NOC
Theapplication isdivided into a
graph ofconcurrent
tasks
Theapplicationtasks are
assigned andscheduled
Decide to whichtile each
selected IPshould be
mapped suchthat the metricsof interest are
optimized
11 22 33
Maurizio Palesi [[email protected]] 62
Design Space ExplorationDesign Space Exploration
Communication characteristics Trace files Communication statistics …
Communication characteristics Trace files Communication statistics …
MappingEvaluatorMapping
EvaluatorMapping Space
ExplorerMapping Space
Explorer
BestMappings
BestMappings
Initial mapping New mapping
Constraints Bandwidth Delay Jitter …
Constraints Bandwidth Delay Jitter …
Delay Throughput Power/Energy …
Maurizio Palesi [[email protected]] 63
Exploration EnginesExploration EnginesGAMAP
Multi-objective mapping based on genetic algorithmsPBBB
Pareto-based Branch & Bound approachAdaptation of the approach proposed by Hu and Marculescu[ASPDAC03] for a multi-objective exploration of the mappingspace
PBNMAPPareto-based NMAP approach
Adaptation of the approach proposed by Murali and De Micheli[DATE04] for a multi-objective exploration of the mapping space
– For the XY routing
Maurizio Palesi [[email protected]] 64
GAMAPGAMAPChromosome: A representation of the solution tothe problem (the mapping)
Each tile in the mesh has an associated gene whichencodes the identifier of the core mapped in the tileIn an n×m mesh
The chromosome is formed by n×m genesThe i-th gene encodes the identifier of the core in the tile in rowi/n and column i%m
2 0 3 1
Chromosome2 0
3 1
NOC
Maurizio Palesi [[email protected]] 65
GAMAPGAMAPGenetic operators
Hot-spot
crossover
BA
A B
mutation
Maurizio Palesi [[email protected]] 66
PBBBPBBBMulti-objective extension of the Branch-and-Bound algorithm
Core 1Core 2Core 3Core 4
Communicationdemand
Search tree
Maurizio Palesi [[email protected]] 67
PBNMAPPBNMAPMulti-objective extension of the NMAP algorithm proposed by Murali and DeMicheli in [DATE04] for the single XY routing
The core that hasmaximum communicationdemand is placed ontoone of the mesh nodes
with maximum number ofneighboring
For each core yet to be mapped, thecore that communicate most with thealready mapped cores is selected. Thecore is placed onto the mesh node thatminimize the communication cost with
mapped cores.
Maurizio Palesi [[email protected]] 68
ExperimentsExperimentsReal Traffic (MPEG-2)Real Traffic (MPEG-2)
Maurizio Palesi [[email protected]] 69
ExperimentsExperimentsReal Traffic (MPEG-2)Real Traffic (MPEG-2)
0 1 2 30 ASIC5 ASIC4 DSP2 CPU21 CPU1 ASIC3 MEM1 ASIC12 DSP1 ASIC2 IP3 IP13 MEM2 DSP3 IP2 IP4
Execution time 41.0 msEnergy: 20.8 mJ
High performance mapping
0 1 2 30 ASIC5 ASIC3 ASIC1 DSP21 DSP1 ASIC4 MEM1 CPU22 CPU1 ASIC2 IP3 IP13 MEM2 DSP3 IP2 IP4
Execution time 42.4 msEnergy: 17.8 mJ
Low energy mapping
Maurizio Palesi [[email protected]] 70
ExperimentsExperimentsEfficiencyEfficiency
GAMAP 840 simsPBBB 7238 simsPBNMAP 2670 sims
SpeedupPBBB = 8.6SppedupPBNMAP = 3.2
Maurizio Palesi [[email protected]] 71
ExperimentsExperimentsEfficiencyEfficiency
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Region
2x2
Region
2x2o
verlap
Nsew4x
4
Nsewbia
s4x4
Neighb
oring
4x4
Rando
m4x4
Averag
esi
mul
atio
nsGAMAP PBNMAP PBBB
Maurizio Palesi [[email protected]] 73
RoutingRoutingDeterministic
Path completely determined by the source andthe destination address
AdaptiveThe path taken by a particular packet dependson dynamic network conditions
Fault,Congestion, …
Routing packets along alternative paths inorder to avoid congested areas
Maurizio Palesi [[email protected]] 74
Routing and SelectionRouting and Selection
Routingfunction
Routingfunction
Selectionfunction
Selectionfunction
message
Output portsSet of possibleoutput ports
Maurizio Palesi [[email protected]] 75
Neighbors on PathNeighbors on PathCongestion-Aware RoutingCongestion-Aware Routing
Maurizio Palesi [[email protected]] 76
SituationsSituations of of IndecisionIndecision
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.15
10
15
20
25
30
35
40
45
packet injection rate (packets/cycle/node)
% s
ituaz
ioni
inde
cisi
onal
i
UniformM1M2
Packet injection rate (packets/cycle/node)
% In
deci
sion
al s
ituat
ions
Maurizio Palesi [[email protected]] 77
SelectionSelection ProblemProblem
They both can accept the flit
Maurizio Palesi [[email protected]] 78
SelectionSelection ProblemProblem
Send the packet to the nodethat can early dispatches it witha higher probability
Maurizio Palesi [[email protected]] 80
x x
x
o o
o
NoPCARNoPCAR AlgorithmAlgorithm
xo
Not available
AvailableNESWox-oScore=2
NESWxox-Score=1
Maurizio Palesi [[email protected]] 81
NoPCARNoPCAR AlgorithmAlgorithm
xo
Not available
Availablex
x
o
o
o
Maurizio Palesi [[email protected]] 82
NoPCARNoPCAR AlgorithmAlgorithm
xo
Not available
Availablex
x
o
o
o
NESWox-oxoxxxxxxScore=0
NESWxox-ooxxxoxxScore=1
Route a flit towards the router which has themost possibility to send the flit to neighborsand is part of a routing path leading to thedestination
Route a flit towards the router which has themost possibility to send the flit to neighborsand is part of a routing path leading to thedestination
Maurizio Palesi [[email protected]] 83
NoPCAR algorithmNoPCAR algorithmNoPCAR_Selection(in AC: set of admissible output channels, nd: destination node, out SC: selected output channel){ credits[] = 0; foreach channel1 in AC { node1 = target(channel1); foreach channel2 in R(node1, nd) { node2 = target(channel2); if bsv[node1][node2] is available then credits[channel1]++; } } sc = ch s.t. credits[ch] = max(credits[]);}
Maurizio Palesi [[email protected]] 84
Block Diagram of the RouterBlock Diagram of the Router
Synthesised with Synopsys Design Compiler0.13 um tech. Library from Virtual SiliconFlit size: 64 bitFIFO size 3 flitsBuffer area significantly dominatesthe logic
Area overhead due to NoPCAR is less then 6%
Maurizio Palesi [[email protected]] 85
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.060
50
100
150
200
250
300
Average packet injection rate [packets/cycle/node]
Ave
rage
pac
ket d
elay
[cyc
les]
OELAXY
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.060
50
100
150
200
250
300
Average packet injection rate [packets/cycle/node]
Ave
rage
pac
ket d
elay
[cyc
les]
OELAXY
M1: (i,j) (N-1-j,N-1-i) M2: (i,j) (j,i)
M1/M2 M1/M2 TrafficTraffic
Maurizio Palesi [[email protected]] 86
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.040
50
100
150
200
250
300
Average packet injection rate [packets/cycle/node]
Ave
rage
pac
ket d
elay
[cyc
les]
OELAXY
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.040
50
100
150
200
250
300
Average packet injection rate [packets/cycle/node]
Ave
rage
pac
ket d
elay
[cyc
les]
OELAXY
1 node HS (hsp=15%) 5 nodes HS (hsp=40%)
Hot-Spot Hot-Spot TrafficTraffic (1/2) (1/2)
Maurizio Palesi [[email protected]] 87
0 0.005 0.01 0.015 0.02 0.025 0.030
50
100
150
200
250
300
Average packet injection rate [packets/cycle/node]
Ave
rage
pac
ket d
elay
[cyc
les]
OELAXY
(2,2) (3,3)
0 0.005 0.01 0.015 0.02 0.025 0.030
50
100
150
200
250
300
Average packet injection rate [packets/cycle/node]
Ave
rage
pac
ket d
elay
[cyc
les]
OELAXY
(1,1) (4,4)
Hot-Spot Hot-Spot TrafficTraffic (2/2) (2/2)
Maurizio Palesi [[email protected]] 88
OutlineOutlineDesign Space Exploration of ParameterizedSystemsNetwork on Chip ArchitecturesInstruction Level Power Estimation
Maurizio Palesi [[email protected]] 89
Why Low Power Design?Why Low Power Design?Mobile applications
Large & heavy devicesBattery life
High-performance devicesHeat dissipation problemsExpensive packaging
ReliabilityFault probability doubling for every 10°Cincrease in temperature
Maurizio Palesi [[email protected]] 90
SoC Architecture: an exampleSoC Architecture: an exampleSystem bus
Peripheral bus
AMS VCs
Logic VCs
Embedded SoftwareRTOS, Drivers, Applications
µP
Application logic
Memory control
Glue logic
Test EIDE USB10Base-T
PCI
Bridge
PLL
ROM
DSP
MPEGDecode
AudioCodec
Video I/F
RAM
FLASH
time
pow
er
Maurizio Palesi [[email protected]] 91
The Average Cost ModelThe Average Cost Model“By measuring the power absorbed by a processor Xrepeating certain instructions or short sequences ofinstructions it is possible to obtain much of theinformation needed to calculate the power dissipatedwhen the processor X executes the instructions of anyprogram Y”.
Tiwari et. al ’94
Maurizio Palesi [[email protected]] 92
Case StudyCase StudySTMicroelectronics ST20 C2P
Stack based architectureStrong data dependency
STMicroelectronics LXVLIW architecture
Work supported by CNR (Project MADESS II) and STMicroelectronics
Maurizio Palesi [[email protected]] 93
The Average Cost ModelThe Average Cost Model
add r1, r2, r3add r1, r2, r3add r1, r2, r3add r1, r2, r3add r1, r2, r3add r1, r2, r3add r1, r2, r3… … … … …add r1, r2, r3add r1, r2, r3add r1, r2, r3add r1, r2, r3
Transistor- or Gate-level
Power estimator
Instruction Costadd 1.5sub 1.9lw 4.2sw 7.9…
Maurizio Palesi [[email protected]] 94
Inter-instruction EffectsInter-instruction Effects
15.3052Add
353%58.223Stl
0%9.6162Stl
85%16.5719Stl
25.885LdnlpIncr.CostInstr.
Maurizio Palesi [[email protected]] 95
Code OptimizationCode Optimization…ldl 0ldl 1addldl 2addldl 3addldl 4...ldl 28addldl 29
...ldl 0ldl 1ldl 2addaddldl 3ldl 4add...ldl 28ldl 29add
13 µJ 7.3 µJ
44% saving!
Maurizio Palesi [[email protected]] 96
Data DependencyData Dependency
Clock cycles
EU
cur
rent
abs
orpt
ion
(mA
)
Low activityHigh activity
Maurizio Palesi [[email protected]] 99
Data Dependent ModelData Dependent ModelE(IN) = base(IN) + Int(IN,IN-1) + Data(IN)
Maurizio Palesi [[email protected]] 101
ExperimentsExperiments
Real ACM DDM(mA) (mA) (mA) ACM DDM ACM DDM
sum_low 6,79 13,44 6,8 97,77% 0,12% 97,79% 0,86%sum_high 14,24 13,44 14,31 5,58% 0,50% 20,37% 3,47%rand_inst 40,62 48,14 40,12 18,51% 1,24% 31,37% 6,44%mtx_sum 31,97 37,83 30,97 18,32% 3,13% 31,44% 6,81%mtx_mul 36,01 41,26 34,02 7,07% 5,53% 27,21% 8,61%dft 32,07 28,57 30,92 10,91% 3,54% 15,13% 8,94%
26,36% 2,34% 37,22% 5,86%
Benchmark Relative Error Absolute Error
Average
Maurizio Palesi [[email protected]] 102
OutlineOutlineDesign Space Exploration of ParameterizedSystemsNetwork on Chip ArchitecturesInstruction Level Power Estimation