Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

1

Seminar Nº 03301, Dynamically

Reconfigurable Architectures

A Mead-&-Conway-like Break-through is overdue

Reiner Hartenstein

Kaiserslautern University of Technology

Dagstuhl, July 20-25, 2003

© 2003, [email protected] http://hartenstein.de


2

Ubiquitous embedded systems

Embedded System Engineering (ESE) requires:

• Hardware (HW) / (E)Software (ESW) co-design

• Configware (CW) / ESW co-design

• HW / CW / ESW co-design

ESE becomes the main focus in system design:

ESW becomes main vehicle to product differentiation



3

Reconfigurable Computing: a second programming domain

Migration of programming to the structural domain

The opportunity to introduce the structural domain to programmers ...

The structural domain has become RAM-based

... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm



4

>> outline (1) <<

• Embedded System Design Crisis • Supercomputing Crisis • µP Crisis • CS crisis • CS for Embedded Systems? • New Machine Paradigm • final remarks

http://www.uni-kl.de

more crises



5

Embedded System Design Crisis

year



6

Mask & NRE cost [ST microelectronics]




2



7

Foundries: Adoption Rate By Process [Nick Tredennick]



8

„EDA industry shifts into CS mentality“ [Wojciech Maly]

• patches instead of engineering

• innovation stalled many years ago

• netlist-based: do not care about efficiency, ...

• ... do not care about transistor density

• 85% users hate their tools



9

Where are we heading ?

1

2

0 10 12 18 months

factor

*) Department of Trade and Industry, London

90% by 2010

10 times more programmers

will write embedded applications

than computer software by 2010



10

Panels on the 2nd Design Crisis: proposing a solution

Lacking Sense of Direction



11

>> outline (2) <<





12

Dead Supercomputer Society

•ACRI •Alliant •American Supercomputer

•Ametek •Applied Dynamics •Astronautics •BBN •CDC •Convex •Cray Computer •Cray Research •Culler-Harris •Culler Scientific •Cydrome •Dana/Ardent/ Stellar/Stardent

•DAPP •Denelcor •Elexsi •ETA Systems •Evans and Sutherland •Computer •Floating Point Systems •Galaxy YH-1 •Goodyear Aerospace MPP •Gould NPL •Guiltech •ICL •Intel Scientific Computers •International Parallel Machines

•Kendall Square Research •Key Computer Laboratories

[Gordon Bell, keynote at ISCA 2000]

•MasPar •Meiko •Multiflow •Myrias •Numerix •Prisma •Tera •Thinking Machines •Saxpy •Scientific Computer •Systems (SCS) •Soviet Supercomputers •Supertek •Supercomputer Systems •Suprenum •Vitesse Electronics




3



13

microprocessor architectures (1)

©Arndt Bode LRR-TUM 13

Entwicklung der Mikroprozessor Architekturen (1)

Bis 1995: Einschränkung - , seit 1995 Erhöhung der Typen- und Architekturvielfalt

Transistorzahl (Moore‘s Gesetz): Abwägung Rechenleistung-Leistungsaufnahme-Kosten-

Kompatibilität

MPR Analysts‘ Choice Awards Kategorien:

- PC Processors: Intel P4 (HyperThreading), AMD Athlon (x 86-64,

Hyper Transport), Transmeta (Binary Compilation, VLIW),...

- Server Processors: Intel Xeon MP und Itanium 2 (EPIC), AMD Opteron

(x86-64), HP Alpha EV-7, Fujitsu Sparc 64 V (out-of-order superscalar)

- High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440

GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC VR7701, PMC Sierra

RM9000x2

- Low-Power Embedded Processors: AMD Au1100, Intel PXA 250, NEC VR

4131, DragonBall MX1, NeoMagic MiMagic5 (1mW pro MHz)

- Extreme Processors: CmU PipeRench, Intrinsity FastMath, Micron Yukon,

NEC DRP, PACT XPP, Sandbridge Sand Blaster (bis 512 ALUs)

- Embedded IP Processor Cores: ARCtangent-A5, ARM 1026 EJ-S/1136JF-S,

Improv Crescendo, MIPS M4K, Tensilica Xtensa V

- Graphics Processors: 3Dlabs Wildcat VP900, ATI Radeon 9700, Nvidia GeForce FX



14

Some Supercomputing people now looking at us

Reconfigurable Computing

PetaFlop/s (1015

) Initiative

Steroids for the aging microprocessor:



15

>> outline (3) <<





16

CS: young ? dynamic?

.. but the von Neumann Paradigm is still the dominant doctrine ...

Microelectronics is ignored (except falling cost

of computational effort)

... still pushing he basic models from the times of mainframe dinosaurs

after >10 technology generations ...

• 1th 4004 • 2nd 8008 • 3rd 8086 • 4th 80286 • 5th 80386 • 6th 80486 • 7th P5 (Pentium) • 8th P6 (Pentium Pro / Pentium II) • 9th Pentium III • 10th .... • 11th

• .......

... the vN Microprocessor is a methusela, the steam engine of the silicon age.



17

stolen from Bob Colwell

processor/memory commmunication bottleneck

vN bottleneck vN: unbalanced



18

MPU designs more complex

greatly complicates the verification process

chip-level multiprocessing + simultaneous multithreading

many bugs relate to concurrency issues

new kinds of concurrency are becoming important




4



19

„Pollack‘s Law“ (simplified) [intel]

growth factor

µm

0.1

performance

area efficiency



20

MPU performance stalled

Moore’s law will stall soon for MPUs

relative computation time needed doubles every 2 years

had been compensated by Moore’s law

Bill Gates’ law:



21

>> outline (4) <<





22

Crusty Computing Sciences

[David Padua, John Hennessy]

shrinking supercomputing conferences

more and more efforts yield only marginal improvements

dataflow machines dead

98.5% vN-only

this monopoly is the problem

areas fade away



23

blinders:

„we are o.k. !“ (no new direction)

Lacking Sense of Direction ?

for ignoring the impact of RC © 2003, [email protected] http://hartenstein.de


24

Stealthy CS Crisis

progress in CS stalled by qualification problems in industry and academia

communication barriers between disciplines

severe software quality problems

often hardware people needed to solve CS problems




5



25

What‘s the problem ?

.... by signals rippling through a network of transistors.

The typical programmer has problems to understand function evaluation without machine mechanisms....

Traditional CS: programming is (control-)procedural, instruction-stream-based – sources: software

accelerators accelerators µprocessor µprocessor

It‘s the gap between procedural and structural mind set

Crossing the Hardware / Software Chasm [Mike Butts]



26

What‘s the problem ?

accelerators accelerators µprocessor µprocessor

The brain hurts on paradigm shift ?

no, it can‘t ...

Brain usage: procedural-only

structural hemisphere missing

Crossing the Hardware / Software Chasm [Mike Butts]



27

>> outline (5) <<





28

ITRS SoC design cost model [ITRS 2001]

RTL methodology only

w. future improvements

tall t

hin

en

gin

ee

r

sm

all b

loc

k r

eu

se

larg

e b

loc

k r

eu

se

IC im

ple

men

tati

on

to

ols

Inte

llig

en

t te

stb

en

ch

ES

le

ve

l m

eth

od

olo

gy

http://public.itrs.net/Files/2001ITRS/Design.pdf



29

SoC System level Design: Embedded SW (ESW)

new design automation from high level descriptions

ESE becomes the main focus in system design:

HW-(E)SW codesign onto highly programmable platforms (SoC)

ESW becomes main vehicle to product differentiation

formal verification for (E)SW

HW-(E)SW-co-verificationH.]

SW synthesis included (SoC)

CW-

CW and

CW-

and CW

(ECW)

ECW



30

Complexity: System Level Design Challenge

language infrastructures for complex models (SystemC etc.)

must be leveraged by industry consensus on use-methodology and abstraction levels”

[ITRS 2001]

from HW + (processor-dependent embedded) C code level

“abstraction levels must be raised above present-day RT-level




6



31

>> outline (6) <<





32

Why a dichotomy of machine paradigms?

data stream machine:

• bad message: caches do not help

• good message: no vN bottleneck

• caches not needed stolen from Bob Colwell

vN bottleneck vN: unbalanced

The anti machine has no von Neumann bottleneck



33

computing paradigms and methodologies

1946: machine paradigm (von Neumann)

1980: data streams (Kung, Leiserson)

1989: anti machine paradigm

1990: rDPU (Rabaey)

1994: anti machine high level programming language

1995: super systolic rDPA

1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ...

1997+: discipline of distributed memory architecture

1997: configware / software partitioning compiler

flow

war

e*



34

Flowware heading toward mainstream

•Data-stream-based Computing is heading for mainstream

–1997 SCCC (LANL) Streams-C Configurabble Computing

–SCORE (UCB) Stream Computations Organized for Reconfigurable Execution

–ASPRC (UCB) Adapting Software Pipelining for Reconfigurable Computing

–2000 Bee (UCB), ...

–Most stream-based multimedia systems, etc.

–Many other areas ....

Flowware:

managing data streams

Software:

managing instruction streams



35

control-procedural vs. data-procedural

The structural domain is primarily data-stream-based:

..... mostly not yet modelled that way: most flowware is hidden by its indirect

instruction-stream-based implementation

Flowware provides a (data-)procedural abstraction from the (data-stream-based) structural domain

Flowware converts „procedural vs. structural“ into „control-procedural vs. data-procedural“ ...

... a Troyan horse to introduce the structural domain to the procedural mind set of programmers

Flowware



36

flowware defines ....

DPA

x x x

x x x

x x x

|

| |

x x

x

x

x

x

x x

x

- -

-

input data streams

x x

x

x

x

x

x x

x

- -

-

-

-

-

-

-

-

-

-

-

x x x

x x x

x x x

|

|

|

|

|

|

|

|

|

|

|

| output data streams

time

port #

time

time

port # time

port #

... which data item at which time at which port

Placement & routing (configware) done:




7



37

Programming Language Paradigms

language category Computer Languages Languages f. Anti Machine

both deterministic procedural sequencing: traceable, checkpointable

operation sequence driven by:

read next instruction, goto (instr. addr.),

jump (to instr. addr.), instr. loop, loop nesting

no parallel loops, escapes, instruction stream branching

read next data item, goto (data addr.),

jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching

state register program counter data counter(s)

address computation

massive memory cycle overhead overhead avoided

Instruction fetch memory cycle overhead overhead avoided

parallel memory bank access interleaving only no restrictions



38

Machine paradigms

von Neumann instruction

stream machine M

I/O

instruction sequencer

CPU

instruction stream

I/O M M M M M

(r)DPU

DPU

Software

I/O M M M M M

(r)DPA

memory distributed memory architecture*

data stream

data-stream machine

M

DPU or rDPU

data address generator (data sequencer)

memory

I/O

asM**

Flowware

(Configware)

(reconf.)

*) the new discipline came just in time: see Herz et al.: Proc. IEEE ICECS 2002

+ CPU

- -

DPU

+

memory



39

heavy anti atoms: DPA = DPU array

- DPA

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU -

DPA

+

+

+

+

+

+

+ +

+



40

Distributed Memory

SA: scrambling and descrambling the data ?

Just in time: a new research area:

Application-specific distributed memory:

e. g. book by F. Catthoor et al. ...

Data address generators - 20 years research:



41

Synthesizable distributed memory architecture...

Memory (data memory)

memory bank

memory bank

memory bank

memory bank

memory bank

...

...

Scheduler

for a Stream-based Soft Machine

rDPA “instructions”

Compiler

Sequencers (data stream

generator)



42

>> outline (7) <<






8



43

Conclusion: all knowledge needed is available

•machine paradigm

•anti machine architectural resources

•sequencing methodology: hw & sw

•parallel memory IP core and module generator vendors

•anything else needed

•compilation techniques

•hw / sw partitioning methodology

• languages



44

The Situation in Computing Sciences

• Computing Sciences are in a severe crisis

• New fundamentals and R&D directions are inevitable

• my mission: getting you involved

• All knowledge needed is readily available ...

• ... even from Computing Sciences

• Silicon application and EDA provide useful concepts

• Reconfigurable Computing has the remedy



45

>>> we need ... <<<<<

We need a Mead-&-Conway-like text book

We need undergraduate lab courses on HW / CW / SW partitioning

We need new courses with extended scope on parallelism and algorithmic cleverness for HW / CW / SW migration / partitioning

What else do we need ? Your proposals ? © 2003, [email protected] http://hartenstein.de


46

>>> we need support <<<<<

We need the support of the open-minded

members of the classical CS community

Let us assemble a list with e-mail addresses



47

>>> thank you <<<<<

thank you for your patience



48

>>> END <<<

END




9



49

microprocessor architectures (8)

TU Dresden, 09.05.2003

©Arndt Bode LRR-TUM 49

Mikroprozessorarchitekturen (8):

hochgradig parallele Systeme

E/A SRAM PE PE PE PE PE PE PE PE PE SRAM

E/A

SRAM PE PE PE PE PE PE PE PE PE SRAM

SRAM PE PE PE PE PE PE PE PE SRAM PE




SRAM PE PE PE PE PE PE PE PE SRAM PE E/A E/A

Konfigu-

ration

Manager

©Arndt Bode LRR-TUM © 2003, [email protected] http://hartenstein.de


50

PACT XPP: Reference Module: XPU128 Co-Processor

XPP128 rDPA

• Evaluation Board • XDS Development Tool with Simulator

buses not

shown

rDPU

CF

G

PAE

core

ALU CtrlALU

CF

GC

FG

PAE

core

CF

GC

FG

PAE

core

PAE

core

ALU CtrlALUALU CtrlALU

CF

GC

FG

CF

GC

FG

• all used by SIEMENS Corporation • Other contractors preparing .... : ask Ron Mabry (here in the audience)

• Full 32 or 24 Bit Design working silicon • 2 Configuration Hierarchies



51

wide variety of speed-up factors

platform application speed-up factor method

PACT Xtreme 4-by-4 array [2003]

16 tap FIR filter x16 MOPS/mW straight forward

*) MPC fabrication via E.I.S. multi university project

key issue: algorithmic cleverness

MoM anti machine with DPLA* [1983]

grid-based DRC**

1-metal 1-poly nMOS 256 reference patterns

> x1000

(computation time)

multiple aspects

**) Design Rule Check



52

>>> flowware-based <<<

flowware -based



53

asM

Configware / Flowware Compilation

r. Data Path

Array

rDPA intermediate

high level source program

wrapper

configware configware

mapper

flowware flowware

scheduler

M M M M

M M M M

M

M

M

M

M

M

M

M

data streams

data sequencer

address generator



54

Configware / Flowware Co-Compilation

intermediate


wrapper

r. Data Path

Array

rDPA

configware

mapper

address generator

flowware

scheduler

M M M M

M M M M

M

M

M

M

M

M

M

M

data streams

data sequencer




10



55

>>> 2nd machine paradigm <<<

2nd machine paradigm



56

Matter & Antimatter

The World of Matter machine paradigm: the Atom

+ + - The World of Anti Matter

machine paradigm: Anti Atom

- - +



57

Matter & Antimatter of Informatics :

- DPU

+

Anti Machine paradigm

+

CPU

-

nothing central !



58

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

mapping algorithms efficently onto rDPA

rout thru only

not used backbus connect

SNN filter on KressArray

by the way: example of scalability / relocatability by EDA support

also FPGA scalability (avoid routing congestion) by EDA solution



59

One more argument for coarse grain

100

1000

10

1

0.1

0.01

0.001 2 1 0.5 0.25 0.13 0.1 0,07

MOPS / mW

µ feature size

T. Claasen et al.: ISSCC 1999

Wiring by abutment: a 32 Bit KressArray example

if coarse grain cells are full custom and

mesh-connected, and 2nd level interconnect

ressources layouted over the cells

*) R. Hartenstein: ISIS 1997

the array is almost as

area-efficient as hardwired

we have already seen the first day:



60

The Secret of Success: Co-Compilation

Analyzer / Profiler

SW code

SW compiler

para d igm “vN" machine

CW Code

CW compiler

anti machine paradigm

Partitioner

Resource Parameters

supporting different platforms

supporting platform-based design

High level PL source

could provide the platforms




11



61

Machine Paradigms

machine category Computer (the Machine:

“v. Neumann”) The Anti Machine

driven by: Instruction streams data streams (no “dataflow”)

engine principles instruction sequencing sequencing data streams

state register single program counter (multiple) data counter(s)

Communication path set-up .

at run time at load time

resource DPU (e.g. single ALU) DPU or DPA (DPU array) etc. data path

operation sequential parallel pipe network etc.

( “instruction fetch” )

also hardwired implementations* *) e g. Bee project Prof. Broderson



62

Programming Language Paradigms

language category Computer Languages Languages f. Anti Machine


operation sequence driven by:

read next instruction, goto (instr. addr.),

jump (to instr. addr.), instr. loop, loop nesting

no parallel loops, escapes, instruction stream branching

read next data item, goto (data addr.),

jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching

state register program counter data counter(s)

address computation

massive memory cycle overhead overhead avoided

Instruction fetch memory cycle overhead overhead avoided

parallel memory bank access interleaving only no restrictions



63

Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld

Why Coarse Grain instead of FPGA ?

physical logical

FPGA logical

1980 1990 2000 2010

FPGA physical

100 000 000 000

10 000 000 000

1000 000 000

100 000 000

10 000 000

1000 000

100 000

10 000

1000

Tra

nsis

tors

/ c

hip

~ 10

~ 10 000

drastically smaller configuration memory

a lot of more benefits

much faster loading

FPGA routed

reduced reconfigurability overhead by up to ~ 1000



64

KressArray Family generic Fabrics: a few examples

Examples of 2nd Level Interconnect: layouted over rDPU cell - no separate routing areas !

+

rout-through and function

rout-through

only more NNports:

rich Rout Resources

Select Function

Repertory

select Nearest Neighbour (NN) Interconnect: an example

16 32 8 24

4

2 rDPU

Select mode, number, width of NNports

http://kressarray.de



65

Changing Models of Computing

“von Neumann”

downloading

RAM

downloading

data path instruction sequencer

I / O

(procedural) Software

hardware/software co-design

software design

the problem with typical CS

people: -the dominance of von Neumann

- they cannot partition

- they cannot migrate

host

hardwired

downloading

accelerator(s)

CAD

RAM

hardware

Software hardware

spec

hardware people needed



66

Changing Models of Computing

host

re-

downloading

conf. accelerator(s)

RAM RAM

Software Configware

(structural)

Morphware

configware/software co-design

hardware/configware/software co-design “von Neumann”

downloading

RAM

downloading

data path instruction sequencer

I / O

(procedural) Software

host

hardwired

downloading

accelerator(s)

CAD

RAM

Hardware

Software

hardware/software co-design

software design




12



67

Super Pipe Networks

pipeline properties array applications

shape resources

mapping scheduling

(data stream formation)

systolic array

regular data

dependencies only

linear only

uniform only

linear projection or algebraic synthesis

super-systolic rDPA

no restrictions simulated

annealing or P&R algorithm

(e.g. force-directed) scheduling algorithm

*

*) KressArray [1995]



68

>>> distributed memory <<<

distributed memory



69

instruction stream-based Compilation Principles

scheduler

parser

source text

library

link/load instruction call placement

1-D memory space

execution order by location



70

Datastream-based Compilation Principles

library

data stream assembly

scheduler

mapper placement & routing



71

>>> flowware languages <<<

flowware languages



72

Similar Programming Language Paradigms

language category Computer Languages Xputer Languages


sequencingdriven by:

read next instruction, goto (instruction addr.), jump (to instruction addr.), instruction loop, instruction loop nesting no parallel loops, instruction loop escapes, instruction stream branching

read next data object, goto (data addr.), jump (to data addr.), data loop, data loop nesting, parallel data loops, data loop escapes, data stream branching




13



73

JPEG zigzag scan pattern

x

y

*> Declarations

HalfZigZag is EastScan loop 3 times SouthWestScan SouthScan NorthEastScan EastScan endloop end HalfZigZag;

goto PixMap[1,1]

HalfZigZag; SouthWestScan uturn (HalfZigZag)

HalfZigZag

data counter data counter


HalfZigZag

EastScan is step by [1,0] end EastScan;

SouthWestScan is loop 8 times until [1,*] step by [-1,1] endloop end SouthWestScan;

SouthScan is step by [0,1] endSouthScan;

NorthEastScan is loop 8 times until [*,1] step by [1,-1] endloop end NorthEastScan;

Flowware language example (MoPL)

Main program:



74

JPEG zigzag scan pattern

x

y

EastScan is step by [1,0] end EastScan;

SouthScan is step by [0,1] endSouthScan;

*> Declarations

NorthEastScan is loop 8 times until [*,1] step by [1,-1] endloop end NorthEastScan;

SouthWestScan is loop 8 times until [1,*] step by [-1,1] endloop end SouthWestScan;

HalfZigZag is EastScan loop 3 times SouthWestScan SouthScan NorthEastScan EastScan endloop end HalfZigZag;

goto PixMap[1,1]

HalfZigZag; SouthWestScan uturn (HalfZigZag)

HalfZigZag



2

1

3

4

HalfZigZag

Main program:



75

>>> address generators <<<

address generators



76

GAG generic address generator Scheme

Base Slider

B0

Limit Slider

L0

0 B

[

Address Stepper

DA

A

D A

| | | |

L

]

limit

all 3 are copies of the same BSU

stepper circuit GAG



77

GAG Slider Model

LimitStepper

BaseStepper

AddressStepper

B0AL0

A

LimitStepper

BaseStepper

AddressStepper

B0AL0

A

sliders

B 0 B

[

0 L

]

0 L 0

B 0 B

[

0 A D

A D

L

]

0 L 0

GAG Generic

Address Generator

floor ceiling



78

GAG: Address Stepper

GAG =

Address

Generator

Generic

+ /

Escape

Clause End

Detect

Step Counter

=o

L A D A

init tag

A

Address endExec

maxStepCount

0 B Limit Base stepVector

[ ] | |

D A L B 0

[ ] | | | |

limit

GAG: Address Stepper




14



79

GAG Complex Sequencer Implementation

Limit Slider

Base Slider

GAU

Address Stepper

B0 DA L0

A

all `been published

in 1990

Limit Slider

Base Slider

GAU

Address Stepper

B0 DA L0

A

Limit Slider

Base Slider

GAU

Address Stepper

B0 DA L0

A

GAU GAU

GAG Generic Address Generator

SDS

GAG

VLIW stack



80

Generic Sequence Examples

a) b)

c)

d) e) f) g)

Limit Slider

Base Slider

GAG

Address Stepper

B0 DA L0

A



81

ceiling

C

address

GAG Slider Operation Demo Example

yx

LB

L0B 0 A

F

floor

LB



82 r r

r/w r r

r

r r r

r/w r r

r/w r r

r r r

after inner scan line loop unrolling

final design

after scan line unrolling

hardw. level access optim.

initial design

r r

w/r r r

r

r r r Bank a

Bank a

Bank b

Storage scheme optimization: scanline unrolling

x

y

handle positions

scan window

scan pattern

(high level sequencing)

example

intra scan window accesses

(low level sequencing)

MoM anti machine architecture



83 © 2001, [email protected]

University of Kaiserslautern

Xputer Lab

instructions

program cou n ter: state register

Compiler RAM

Datapath

har dw ired

Sequencer

Computer Computer tightly coupled

by compact instruction code

“von Neumann” “von Neumann” does not support soft data paths does not support soft data paths

Datapath

Xputer Xputer

Scheduler

Compiler

RAM

(multiple) sequencer

Datapath Array

“instructions”

University of Kaiserslautern

Xputer Lab

loosely coupled by decision data bits only

Xputer: Xputer: The Soft Machine Paradigm

The Soft Machine Paradigm reconfigurable reconfigurable

also for hardwired also for hardwired

Computer: the wrong Machine Paradigm

“von Neumann”

s d a ta cou n ter

(anti machine) © 2003, [email protected] http://hartenstein.de


84

Binding Time vs. Computing Domain

time domain (procedural)

Binding time: (Set-up of Communication Channels)

at run time microprocessor parallel computer

time & space (hybrid)

later fabrication step ASICs

space domain (structural)

before fabrication full custom ICs

at loading time

at compile time

Reconfigurable Computing

array processor

programming domain:

supersystolic arrays systolic

arrays




15



85

Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld

Why Coarse Grain instead of FPGA ?

physical logical

FPGA logical

1980 1990 2000 2010

FPGA physical

100 000 000 000

10 000 000 000

1000 000 000

100 000 000

10 000 000

1000 000

100 000

10 000

1000

Tra

nsis

tors

/ c

hip

~ 10

~ 10 000

drastically smaller configuration memory

a lot of more benefits

much faster loading

FPGA routed

reduced reconfigurability overhead by up to ~ 1000



86

Paradigm Shifts: Nick Tredennick‘s view

algorithms variable

resources fixed

instruction-stream-based computing:

algorithms variable

resources variable

reconfigurable computing:

programmable

why 2 program sources ?



87

Compilation for (r)DPA of anti machine

mapper

scheduler

expressionmorphware

configware

streamware

tree


wrapperparameters

codegenerators

DPU library

(software notation)

flowware



88

machine paradigm: some differences

+ CPU

-

- DPA

+ +

+

- DPU

+

no. of streams ³ 1



89

Annihilation?

- +

-

+ - + avoidable

by tools ....



90

Matter & Antimatter: Atom and Anti Atom

The World of Matter

Machine paradigm: the Atom

Anti Matter

Machine paradigm: Anti Atom

+ + -

- - +




16



91

Parallelism by Concurrency

+ -

+ -

- +

- +

+ -

- +

- +

independent instruction streams



92

Co-Compilation

Xputer

“Soft” Machine Paradigm

Configware running on

partitioning compiler

high level programming language source

mProcessor Reconfigurable

Accelerators inte

rface

Reconfigurable Architecture (RA)

-- instead of hardwired

We introduce: Co-Compilation

Computer Machine Paradigm

Software running on

Xputer

“Soft” Machine Paradigm

Configware running on



93

Loop Transformation Examples

loop 1-8 body body endloop

loop 1-8 body endloop

loop 9-16 body endloop

fork

join

strip mining

loop 1-4 trigger endloop



reconf.array: host: loop 1-16 body endloop

sequential processes: resource parameter driven Co-Compilation

loop unrolling



94

„new“ terms

Flowware*: to schedule data streams, similar to software, but data counter manipulation (programming data streams ...

... instead of instruction streams)

Configware: sources for programming morphware

Software: you all know (programming instruction streams)

Hardware: you all know (not programmable) Morphware: structurally programmable „hardware“

(only some terms are „new“, however, not their subject)

clean terminology needed for taxonomy and comprehensibility

*) flowware has no relations to „dataflow machine“

Granularity defines block path width:

fine grain: 1-2 bit coarse grain: > 2 bit multi grain: > 2 bit, variable

algorithms variable

resources variable

algorithms variable

resources fixed



95

Why data streams are a common model

Flowware: to schedule data streams Configware: programming the ressources

all other details are defined here: Nick Tredennick‘s

paradigm shifts

Data streams (flowware) are derived from configware having been compiled before

Data stream execution ressources: distributed memory architectures. This new discipline came just in time.

see Herz et al.: Proc. IEEE ICECS 2002 Link (via „recent talks“) also here:

algorithms variable

resources variable

reconfigurable:

algorithms fixed

resources fixed

fully hardwired: not programmable

*) only one source needed

algorithms variable

resources fixed

CPU:

Documents

Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC