Requirements of system level design approach

1F.Catthoor © imec 2004

Requirements of system level

design approachAlgorithms Data Structures+

ARM

IP1 IP2

RAM ROM

Architecture

Platform architecture

RAMRAM

ROM

MMU

custom logic

DSP

ROM microprocessor


• System Specification and System-level Refinement with Exploration Support (algorithm design level, concurrent task level, system timing simulation)

• Data Transfer and Storage Exploration for Massive Real Time Data Manipulation (dynamic memory mngntstatic transfer and storage, address generation)

• Co-Design for Heterogenous Implementation Paradigms (refinement from unified HW/SW model,RTOS modeling, complete system simulation)

• RF front-end exploration (fast mixed-signal co-simulation, chip-package co-design, noise coupling)

Current challenges and solutions


Task vs Array data vs Instr level issues

Instr.-level issues

Optimized system specification

Task-level system architecture

Task level issues

Array-level system architecture

Array data level issues

Proc.-level system architecture

Task1 Task2

Task3

Proc1 Proc2

Proc3Arithmetic + local control + address issues


Concurrency versus DTSE issues

Concurrency issues

Optimized system specification

DTSE optimized specification

Data transfer and storageexploration issues

Proc.-level system architecture

Arithmetic + local control + addressing

Proc1 Proc2

Proc3

Background memories


Why fix data storage/transfer before concurrency mngnt issues?Recursive image processing algorithm on local neighbourhoods:(i : 0 .. I-1 ) ::(j : 0 .. J-1 ) :: img[i][j]= f(img[i][j-k], old_img[i][j]);

I

rows

J c o l u m n s


Why fix data storage/transfer before concurrency mngnt issues?

For given speed-up M: minimally M data-paths for f()

Unrolling i loop (limited by I): M J-word double-buffered FIFO's

I

rows

J c o l u m n s

14.4mm2

(0.7um)


Why fix data storage/transfer before concurrency mngnt issues?

Unrolling j loop(limited by k):M - 1 buffer reg

(i : 0 .. I-1 ) ::(j : 0 .. (J div 2)-1 ) ::. begin.img[i][2j-1]= f(img[i][2j-k-1],... old_img[i][2j-1]);. img[i][2j]= f(img[i][2j-k],... old_img[i][2j]);. end;

I

rows

J c o l u m n s


Global data management design flow for dynamic concurrent tasks with data-dominated behaviour

Data Type RefinementData Type Refinement

Task concurrency mgmtTask concurrency mgmt

Physical memory mgmtPhysical memory mgmt

Address optimizationAddress optimization

SWSWdesigndesignflowflow

HWHWdesigndesignflowflow

Concurrent OO specConcurrent OO spec

MgmtUnit

Memory

controller

ASU ASU

processor

memmemmem

MemoryAllocationAssignment

SW/HW co-designSW/HW co-design

Virtual

MgmtMemory

DynamicDataTypes

keydata

keydata

Binary Tree (BT)

keydata

Sub-pool per size

Free Blocks


Data Management Flow

Dynamic

Data

Type

Explor.

Physical

Memory

Mngnt.

VirtualMemory

Segments

ConcreteData types

PhysicalMemories

DDT Dynamic Data TypeTrafo & Refinement

Dynamic memory mgmtRefinement

Physical memory mgmtRefinement


Data-transfer and data-storage bottlenecks: SDRAM access

ClientMain

Memory

Client

data

128 - 1024bit bus

LocalLatch

LocalSelectbank1

LocalLatch

LocalSelectbankN

Cacheand

Bankcomb.

GlobalBankSelectControl addr

ctrl

Wide word Burst mode


Data-transfer and data-storage bottlenecks: cache misses

ClientMain

Memory

MainMemory

Processors

Data-paths regf

16kBN-portSRAM

L1 cache

1MB1/2-portSRAM

L2 cache 256 MB (S)DRAM

Many cache missesPage Loading


Data-transfer and data-storage bottlenecks: system bus load

MainMemoryL2 cache

Datapaths

L1 cache

System chip

Harddisk

OtherSystem

Resources

OtherSystem

Resources

Diskaccess

bus

Mainsystem

bus

L2bus


Multi-processor System Design

Image Proc System

Standardsubsystem :

detailed solutionlocally optimized

by expert

Subsystemresembles a

standard solutionbut needs small

adaptations

Newcomplex

subsystem

E.g.: 2D convolution E.g.: DCT for specific coderE.g.: quadtree coder

Locally optimized

Globally optimized => exploration!

Buffer Buffer


Platform design requires change

Multi-media platform city

Traditionalarchitecture city

Traditional compiler boulevard

Power volcano (multi-media)

processor trend= Application engineer

Cobblestone bypassroad (requires paving)


Ad-Hoc Design: Backtracking ?

System Specification

Memory Organizations?? ? ? ? ? ? ? ? ? ? ? ? ?

??


Systematic System Exploration

System Specification

Memory Organization

? !? ?

? !? ?

! ?? ?

? ! ? ?


Data Transfer & Storage Exploration (DTSE) Principles

Processor Data Paths

L1Cache

L2Cache

Chip

Cache & BankRecombine

Local Latch 1 +Bank 1

Off-chip SDRAM

Local Latch N +Bank N


Data Transfer & Storage Exploration (DTSE) Principles


L1Cache

L2Cache

Chip



Off-chip SDRAM


ANALYSIS !


Main Data Transfer & Storage Principles


L1Cache

L2Cache

Chip



Off-chip SDRAM


4 Avoid N-port Memories 3 Exploit memory hierarchy

1 Reduce redundant transfers2 Introduce Locality

6 Exploit limited life-timeand data layout freedom 5 Meet real-time constraints


Fast implementationwith tools

Time - Efficient System Exploration Design Flow

Initial System Specification

Accurate cost figuresto guide decision

System-level Feedback?? ??

design alternatives


Physical Memory Management


Cavity detection application:medical imagingInitial description

f u n c t i o n

G a u s s B l u r

f u n c t i o n

C o m p u t e E d g e s

f u n c t i o n

D e t e c t R o o t s

f u n c t i o n

L a b e l R o o t s

( i m a g e _ i n : W [ N ] [ N ] )

i m a g e _ o u t : W [ N ] [ N ]

= . . .

Every function computes new matrix information from the output of the previous step. The new value of a pixel depends on its neighbors.


Cavity detector results: overall summary

0

100

200

300

400

500

600

accesses size cycles

Original

DF trafo

Loop trafo

Data reuse

In-place

Data layout

ADOPT - modulo

ADOPT - rest


Conclusions for DTSE stage•Order of magnitude can be typically gained on system bus load|

•As a result, also the energy consumption in the data memory hierarchy is reduced with about the same amount

•Also the system performance (board level) is significantly reduced because of competing resources on these system busses

•Penalty on code size is small (less than 20%)

•Typically the pure CPU speed is improved IF there was a data transfer bottleneck that could not be “hidden” by overlapping the computation and communication in the original code (which was certainly so for the cavity detector)


Task- versus Proc./Instr-level: mapping

Task1 Task2

Task3

Proc1 Proc2

Proc3

Array Proc1

ArrayProc2


Pareto curves allow task trade-off decision: DAB illustration

TASK-1 TASK-2 TASK-3

0 10000 20000 30000 40000

Execution time

0

4

8

12

0 50000 100000

Execution time

0

5

10

15

0.0 2.0 4.0 6.0

Execution time

0

500

1000

En

erg

y

Source: Digital Audio Broadcast

Mapped on two processors


Pareto curves allowtask trade-off decision

0 10000 20000 30000 400000

4

8

12

0 50000 1000000

5

10

15

0.0 2.0 4.0 6.00

500

1000


Single proc.Large mem. overhead


En

erg

y

Execution time Execution timeExecution time


Pareto curves allowtask trade-off decision

0 10000 20000 30000 400000

4

8

12

0 50000 1000000

5

10

15

0.0 2.0 4.0 6.00

500

1000



En

erg

y

Execution time Execution timeExecution time


512w256w128w96w64w

1

2

Cache Power

Main memory Power

032w

Cache Size[ words ]

Relativepower

Trade-offs in memory organisation(e.g. voice coder SW controlled cache)

Gain in power of additional factor 6 comparedto optimized (platform independent code)


Global concurrency management design flow for dynamic concurrent tasks with data-dominated behaviour

Data Type RefinementData Type Refinement

Task concurrency mgmtTask concurrency mgmt

Physical memory mgmtPhysical memory mgmt

Address optimizationAddress optimization

SWSWdesigndesignflowflow

HWHWdesigndesignflowflow

Concurrent OO specConcurrent OO spec

System control

HW-Ctrl uCtrl

Memory organ.

uProcDSPHWUnified modelPartitionRefine/compile

SW/HW co-designSW/HW co-design

Task scheduleAllocate/assign

Transform

Task1 Task2

Task3


MPEG4JPEG

Why are Applications becoming more dynamic and concurrent?

The workload decreases but the tasks are dynamically created and their size is data dependent

T1

T1’ T1T2

T3T4


Terminal QoS (3D demonstrator)


ARM

Processor

1Vdd=1V Vdd=3.3V

ARM

Processor

2

TNnTN2TN1

Codes’01, System Design Automation book Verlag’01

Reduce global system energy by task scheduling + assignment (e.g. 2-processor approach )


Tradeoff between time-budget and energy

Processor 1Low Vdd

Vdd=1.5V5nJ/instr.

2TUs/instr.

Processor 2High speed

Vdd=3.0V20nJ/instr.1TU/instr.

TradeoffMoreTimeUnits

Moreenergy

90M instr.180 M TUs

450 mJ

20M instr.20 M TUs400 mJ

180 M TUs

40M instr.80 M TU200 mJ

70M instr.70 M TU1400 mJ

80 M TUs 1600 mJ

850 mJ


Trade-off between time budget (period/latency) and cost (e.g.energy) leads to Pareto curves

Time

Cost

TB1TB2TB3TB4TB5TB6

Processor alloc/assign and scheduling alternativesFor TNs in code version 1

xx

x

xNon-optimal points


0

500

1000

1500

2000

2500

3000

3500

0 50 100 150 200 250

Time budget (us)

Energy (nJ)

Not single working point but Pareto curves needed in global trade-off

Both data transfer-storageand concurrency aspectshave to be combined!


0

5 0 0

1 0 0 0

1 5 0 0

2 0 0 0

2 5 0 0

3 0 0 0

3 5 0 0

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0

Comparison of scheduling the original and transformed task-level descriptions

Time budget (us)

Energy (nJ)

original

Transformed


Overall solution: combination of complex design- and simple run-time

schedulers

Cases’00, ISSS’01,Design&Test- Sep.’01

12

3

th read fram e 1

A B

th read fram e 2

cost

1 3 2

Design-time Scheduling

Design-time Scheduling

A B

Design-time scheduling: at compile time, exploring all the optimization possibilities

time

TF 1cost

time

TF 2

Run-time

Scheduling

1 A B 3 2

• Run-time scheduling: at run time, providing flexibility and dynamic control at low cost as part of synthesized RTOS


Task 2

Application

Task 1

task

en

ergy

task execution time

En

ergy

task execution time

app

lica

tion

en

ergy

application execution time

time limit

Run-time: original Pareto point selection


Task 3Task 2

Application

Task 1ta

sk e

ner

gy

task execution time task

en

ergy

task execution time

app

lica

tion

en

ergy


time limit

Run-time: one selection if new task enters

En

ergy

task execution time


Task 3Task 2

Application

Task 1ta

sk e

ner

gy

task execution time task

en

ergy

task execution time

En

ergy

task execution timeap

pli

cati

on e

ner

gy


time limit

Run-time: better selection if new task enters

Gain


Quality of Service (QoS) result

17,53

14,32

6,211 6,171

17,53

14,65

9,487 9,469

0

2

4

6

8

10

12

14

16

18

20

no DVS inter-task DVS greedy heur. DP

ener

gy(

J)

fps=5 fps=10

65% energy saving for 5 fps, 46% for 10 fps

Documents

Requirements of system level design approach