Upload
helga
View
41
Download
3
Embed Size (px)
DESCRIPTION
Algorithms. +. Data Structures. Architecture. ARM. RAM. ROM. IP 1. IP 2. Platform architecture. ROM. custom logic. micro processor. ROM. MMU. Platform integration. RAM. DSP. RAM. Requirements of system level design approach. Data mngnt. Concurrency mngnt. Platform - PowerPoint PPT Presentation
Citation preview
1F.Catthoor © imec 2004
Requirements of system level
design approachAlgorithms Data Structures+
ARM
IP1 IP2
RAM ROM
Architecture
Platform architecture
RAMRAM
ROM
MMU
custom logic
DSP
ROM microprocessor
2F.Catthoor © imec 2004
• System Specification and System-level Refinement with Exploration Support (algorithm design level, concurrent task level, system timing simulation)
• Data Transfer and Storage Exploration for Massive Real Time Data Manipulation (dynamic memory mngntstatic transfer and storage, address generation)
• Co-Design for Heterogenous Implementation Paradigms (refinement from unified HW/SW model,RTOS modeling, complete system simulation)
• RF front-end exploration (fast mixed-signal co-simulation, chip-package co-design, noise coupling)
Current challenges and solutions
3F.Catthoor © imec 2004
Task vs Array data vs Instr level issues
Instr.-level issues
Optimized system specification
Task-level system architecture
Task level issues
Array-level system architecture
Array data level issues
Proc.-level system architecture
Task1 Task2
Task3
Proc1 Proc2
Proc3Arithmetic + local control + address issues
5F.Catthoor © imec 2004
Concurrency versus DTSE issues
Concurrency issues
Optimized system specification
DTSE optimized specification
Data transfer and storageexploration issues
Proc.-level system architecture
Arithmetic + local control + addressing
Proc1 Proc2
Proc3
Background memories
6F.Catthoor © imec 2004
Why fix data storage/transfer before concurrency mngnt issues?Recursive image processing algorithm on local neighbourhoods:(i : 0 .. I-1 ) ::(j : 0 .. J-1 ) :: img[i][j]= f(img[i][j-k], old_img[i][j]);
I
rows
J c o l u m n s
7F.Catthoor © imec 2004
Why fix data storage/transfer before concurrency mngnt issues?
For given speed-up M: minimally M data-paths for f()
Unrolling i loop (limited by I): M J-word double-buffered FIFO's
I
rows
J c o l u m n s
14.4mm2
(0.7um)
8F.Catthoor © imec 2004
Why fix data storage/transfer before concurrency mngnt issues?
Unrolling j loop(limited by k):M - 1 buffer reg
(i : 0 .. I-1 ) ::(j : 0 .. (J div 2)-1 ) ::. begin.img[i][2j-1]= f(img[i][2j-k-1],... old_img[i][2j-1]);. img[i][2j]= f(img[i][2j-k],... old_img[i][2j]);. end;
I
rows
J c o l u m n s
9F.Catthoor © imec 2004
Global data management design flow for dynamic concurrent tasks with data-dominated behaviour
Data Type RefinementData Type Refinement
Task concurrency mgmtTask concurrency mgmt
Physical memory mgmtPhysical memory mgmt
Address optimizationAddress optimization
SWSWdesigndesignflowflow
HWHWdesigndesignflowflow
Concurrent OO specConcurrent OO spec
MgmtUnit
Memory
controller
ASU ASU
processor
memmemmem
MemoryAllocationAssignment
SW/HW co-designSW/HW co-design
Virtual
MgmtMemory
DynamicDataTypes
keydata
keydata
Binary Tree (BT)
keydata
Sub-pool per size
Free Blocks
10F.Catthoor © imec 2004
Data Management Flow
Dynamic
Data
Type
Explor.
Physical
Memory
Mngnt.
VirtualMemory
Segments
ConcreteData types
PhysicalMemories
DDT Dynamic Data TypeTrafo & Refinement
Dynamic memory mgmtRefinement
Physical memory mgmtRefinement
11F.Catthoor © imec 2004
Data-transfer and data-storage bottlenecks: SDRAM access
ClientMain
Memory
Client
data
128 - 1024bit bus
LocalLatch
LocalSelectbank1
LocalLatch
LocalSelectbankN
Cacheand
Bankcomb.
GlobalBankSelectControl addr
ctrl
Wide word Burst mode
12F.Catthoor © imec 2004
Data-transfer and data-storage bottlenecks: cache misses
ClientMain
Memory
MainMemory
Processors
Data-paths regf
16kBN-portSRAM
L1 cache
1MB1/2-portSRAM
L2 cache 256 MB (S)DRAM
Many cache missesPage Loading
13F.Catthoor © imec 2004
Data-transfer and data-storage bottlenecks: system bus load
MainMemoryL2 cache
Datapaths
L1 cache
System chip
Harddisk
OtherSystem
Resources
OtherSystem
Resources
Diskaccess
bus
Mainsystem
bus
L2bus
14F.Catthoor © imec 2004
Multi-processor System Design
Image Proc System
Standardsubsystem :
detailed solutionlocally optimized
by expert
Subsystemresembles a
standard solutionbut needs small
adaptations
Newcomplex
subsystem
E.g.: 2D convolution E.g.: DCT for specific coderE.g.: quadtree coder
Locally optimized
Globally optimized => exploration!
Buffer Buffer
15F.Catthoor © imec 2004
Platform design requires change
Multi-media platform city
Traditionalarchitecture city
Traditional compiler boulevard
Power volcano (multi-media)
processor trend= Application engineer
Cobblestone bypassroad (requires paving)
16F.Catthoor © imec 2004
Ad-Hoc Design: Backtracking ?
System Specification
Memory Organizations?? ? ? ? ? ? ? ? ? ? ? ? ?
??
17F.Catthoor © imec 2004
Systematic System Exploration
System Specification
Memory Organization
? !? ?
? !? ?
! ?? ?
? ! ? ?
18F.Catthoor © imec 2004
Data Transfer & Storage Exploration (DTSE) Principles
Processor Data Paths
L1Cache
L2Cache
Chip
Cache & BankRecombine
Local Latch 1 +Bank 1
Off-chip SDRAM
Local Latch N +Bank N
19F.Catthoor © imec 2004
Data Transfer & Storage Exploration (DTSE) Principles
Processor Data Paths
L1Cache
L2Cache
Chip
Cache & BankRecombine
Local Latch 1 +Bank 1
Off-chip SDRAM
Local Latch N +Bank N
ANALYSIS !
25F.Catthoor © imec 2004
Main Data Transfer & Storage Principles
Processor Data Paths
L1Cache
L2Cache
Chip
Cache & BankRecombine
Local Latch 1 +Bank 1
Off-chip SDRAM
Local Latch N +Bank N
4 Avoid N-port Memories 3 Exploit memory hierarchy
1 Reduce redundant transfers2 Introduce Locality
6 Exploit limited life-timeand data layout freedom 5 Meet real-time constraints
26F.Catthoor © imec 2004
Fast implementationwith tools
Time - Efficient System Exploration Design Flow
Initial System Specification
Accurate cost figuresto guide decision
System-level Feedback?? ??
design alternatives
27F.Catthoor © imec 2004
Physical Memory Management
28F.Catthoor © imec 2004
Cavity detection application:medical imagingInitial description
f u n c t i o n
G a u s s B l u r
f u n c t i o n
C o m p u t e E d g e s
f u n c t i o n
D e t e c t R o o t s
f u n c t i o n
L a b e l R o o t s
( i m a g e _ i n : W [ N ] [ N ] )
i m a g e _ o u t : W [ N ] [ N ]
= . . .
Every function computes new matrix information from the output of the previous step. The new value of a pixel depends on its neighbors.
29F.Catthoor © imec 2004
Cavity detector results: overall summary
0
100
200
300
400
500
600
accesses size cycles
Original
DF trafo
Loop trafo
Data reuse
In-place
Data layout
ADOPT - modulo
ADOPT - rest
30F.Catthoor © imec 2004
Conclusions for DTSE stage•Order of magnitude can be typically gained on system bus load|
•As a result, also the energy consumption in the data memory hierarchy is reduced with about the same amount
•Also the system performance (board level) is significantly reduced because of competing resources on these system busses
•Penalty on code size is small (less than 20%)
•Typically the pure CPU speed is improved IF there was a data transfer bottleneck that could not be “hidden” by overlapping the computation and communication in the original code (which was certainly so for the cavity detector)
31F.Catthoor © imec 2004
Task- versus Proc./Instr-level: mapping
Task1 Task2
Task3
Proc1 Proc2
Proc3
Array Proc1
ArrayProc2
32F.Catthoor © imec 2004
Pareto curves allow task trade-off decision: DAB illustration
TASK-1 TASK-2 TASK-3
0 10000 20000 30000 40000
Execution time
0
4
8
12
0 50000 100000
Execution time
0
5
10
15
0.0 2.0 4.0 6.0
Execution time
0
500
1000
En
erg
y
Source: Digital Audio Broadcast
Mapped on two processors
33F.Catthoor © imec 2004
Pareto curves allowtask trade-off decision
0 10000 20000 30000 400000
4
8
12
0 50000 1000000
5
10
15
0.0 2.0 4.0 6.00
500
1000
Source: Digital Audio Broadcast
Single proc.Large mem. overhead
TASK-1 TASK-2 TASK-3
En
erg
y
Execution time Execution timeExecution time
34F.Catthoor © imec 2004
Pareto curves allowtask trade-off decision
0 10000 20000 30000 400000
4
8
12
0 50000 1000000
5
10
15
0.0 2.0 4.0 6.00
500
1000
Source: Digital Audio Broadcast
TASK-1 TASK-2 TASK-3
En
erg
y
Execution time Execution timeExecution time
35F.Catthoor © imec 2004
512w256w128w96w64w
1
2
Cache Power
Main memory Power
032w
Cache Size[ words ]
Relativepower
Trade-offs in memory organisation(e.g. voice coder SW controlled cache)
Gain in power of additional factor 6 comparedto optimized (platform independent code)
36F.Catthoor © imec 2004
Global concurrency management design flow for dynamic concurrent tasks with data-dominated behaviour
Data Type RefinementData Type Refinement
Task concurrency mgmtTask concurrency mgmt
Physical memory mgmtPhysical memory mgmt
Address optimizationAddress optimization
SWSWdesigndesignflowflow
HWHWdesigndesignflowflow
Concurrent OO specConcurrent OO spec
System control
HW-Ctrl uCtrl
Memory organ.
uProcDSPHWUnified modelPartitionRefine/compile
SW/HW co-designSW/HW co-design
Task scheduleAllocate/assign
Transform
Task1 Task2
Task3
37F.Catthoor © imec 2004
MPEG4JPEG
Why are Applications becoming more dynamic and concurrent?
The workload decreases but the tasks are dynamically created and their size is data dependent
T1
T1’ T1T2
T3T4
38F.Catthoor © imec 2004
Terminal QoS (3D demonstrator)
39F.Catthoor © imec 2004
ARM
Processor
1Vdd=1V Vdd=3.3V
ARM
Processor
2
TNnTN2TN1
Codes’01, System Design Automation book Verlag’01
Reduce global system energy by task scheduling + assignment (e.g. 2-processor approach )
40F.Catthoor © imec 2004
Tradeoff between time-budget and energy
Processor 1Low Vdd
Vdd=1.5V5nJ/instr.
2TUs/instr.
Processor 2High speed
Vdd=3.0V20nJ/instr.1TU/instr.
TradeoffMoreTimeUnits
Moreenergy
90M instr.180 M TUs
450 mJ
20M instr.20 M TUs400 mJ
180 M TUs
40M instr.80 M TU200 mJ
70M instr.70 M TU1400 mJ
80 M TUs 1600 mJ
850 mJ
41F.Catthoor © imec 2004
Trade-off between time budget (period/latency) and cost (e.g.energy) leads to Pareto curves
Time
Cost
TB1TB2TB3TB4TB5TB6
Processor alloc/assign and scheduling alternativesFor TNs in code version 1
xx
x
xNon-optimal points
42F.Catthoor © imec 2004
0
500
1000
1500
2000
2500
3000
3500
0 50 100 150 200 250
Time budget (us)
Energy (nJ)
Not single working point but Pareto curves needed in global trade-off
Both data transfer-storageand concurrency aspectshave to be combined!
43F.Catthoor © imec 2004
0
5 0 0
1 0 0 0
1 5 0 0
2 0 0 0
2 5 0 0
3 0 0 0
3 5 0 0
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0
Comparison of scheduling the original and transformed task-level descriptions
Time budget (us)
Energy (nJ)
original
Transformed
44F.Catthoor © imec 2004
Overall solution: combination of complex design- and simple run-time
schedulers
Cases’00, ISSS’01,Design&Test- Sep.’01
12
3
th read fram e 1
A B
th read fram e 2
cost
1 3 2
Design-time Scheduling
Design-time Scheduling
A B
Design-time scheduling: at compile time, exploring all the optimization possibilities
time
TF 1cost
time
TF 2
Run-time
Scheduling
1 A B 3 2
• Run-time scheduling: at run time, providing flexibility and dynamic control at low cost as part of synthesized RTOS
45F.Catthoor © imec 2004
Task 2
Application
Task 1
task
en
ergy
task execution time
En
ergy
task execution time
app
lica
tion
en
ergy
application execution time
time limit
Run-time: original Pareto point selection
46F.Catthoor © imec 2004
Task 3Task 2
Application
Task 1ta
sk e
ner
gy
task execution time task
en
ergy
task execution time
app
lica
tion
en
ergy
application execution time
time limit
Run-time: one selection if new task enters
En
ergy
task execution time
47F.Catthoor © imec 2004
Task 3Task 2
Application
Task 1ta
sk e
ner
gy
task execution time task
en
ergy
task execution time
En
ergy
task execution timeap
pli
cati
on e
ner
gy
application execution time
time limit
Run-time: better selection if new task enters
Gain
48F.Catthoor © imec 2004
Quality of Service (QoS) result
17,53
14,32
6,211 6,171
17,53
14,65
9,487 9,469
0
2
4
6
8
10
12
14
16
18
20
no DVS inter-task DVS greedy heur. DP
ener
gy(
J)
fps=5 fps=10
65% energy saving for 5 fps, 46% for 10 fps