16
[email protected] July 20, 2003 Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003 Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de 1 Seminar Nº 03301, Dynamically Reconfigurable Architectures A Mead-&-Conway-like Break-through is overdue Reiner Hartenstein Kaiserslautern University of Technology Dagstuhl, July 20-25, 2003 © 2003, [email protected] http://hartenstein.de Kaiserslautern University of Technology 2 Ubiquitous embedded systems Embedded System Engineering (ESE) requires: Hardware (HW) / (E)Software (ESW) co-design Configware (CW) / ESW co-design • HW / CW / ESW co-design ESE becomes the main focus in system design: ESW becomes main vehicle to product differentiation © 2003, [email protected] http://hartenstein.de Kaiserslautern University of Technology 3 Reconfigurable Computing: a second programming domain Migration of programming to the structural domain The opportunity to introduce the structural domain to programmers ... The structural domain has become RAM-based ... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm © 2003, [email protected] http://hartenstein.de Kaiserslautern University of Technology 4 >> outline (1) << • Embedded System Design Crisis • Supercomputing Crisis • μP Crisis • CS crisis • CS for Embedded Systems? • New Machine Paradigm • final remarks http://www.uni-kl.de more crises © 2003, [email protected] http://hartenstein.de Kaiserslautern University of Technology 5 Embedded System Design Crisis year © 2003, [email protected] http://hartenstein.de Kaiserslautern University of Technology 6 Mask & NRE cost [ST microelectronics]

Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

Embed Size (px)

Citation preview

Page 1: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

1

Seminar Nº 03301, Dynamically

Reconfigurable Architectures

A Mead-&-Conway-like Break-through is overdue

Reiner Hartenstein

Kaiserslautern University of Technology

Dagstuhl, July 20-25, 2003

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

2

Ubiquitous embedded systems

Embedded System Engineering (ESE) requires:

• Hardware (HW) / (E)Software (ESW) co-design

• Configware (CW) / ESW co-design

• HW / CW / ESW co-design

ESE becomes the main focus in system design:

ESW becomes main vehicle to product differentiation

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

3

Reconfigurable Computing: a second programming domain

Migration of programming to the structural domain

The opportunity to introduce the structural domain to programmers ...

The structural domain has become RAM-based

... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

4

>> outline (1) <<

• Embedded System Design Crisis • Supercomputing Crisis • µP Crisis • CS crisis • CS for Embedded Systems? • New Machine Paradigm • final remarks

http://www.uni-kl.de

more crises

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

5

Embedded System Design Crisis

year

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

6

Mask & NRE cost [ST microelectronics]

Page 2: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

2

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

7

Foundries: Adoption Rate By Process [Nick Tredennick]

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

8

„EDA industry shifts into CS mentality“ [Wojciech Maly]

• patches instead of engineering

• innovation stalled many years ago

• netlist-based: do not care about efficiency, ...

• ... do not care about transistor density

• 85% users hate their tools

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

9

Where are we heading ?

1

2

0 10 12 18 months

factor

*) Department of Trade and Industry, London

90% by 2010

10 times more programmers

will write embedded applications

than computer software by 2010

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

10

Panels on the 2nd Design Crisis: proposing a solution

Lacking Sense of Direction

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

11

>> outline (2) <<

• Embedded System Design Crisis • Supercomputing Crisis • µP Crisis • CS crisis • CS for Embedded Systems? • New Machine Paradigm • final remarks

http://www.uni-kl.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

12

Dead Supercomputer Society

•ACRI •Alliant •American Supercomputer

•Ametek •Applied Dynamics •Astronautics •BBN •CDC •Convex •Cray Computer •Cray Research •Culler-Harris •Culler Scientific •Cydrome •Dana/Ardent/ Stellar/Stardent

•DAPP •Denelcor •Elexsi •ETA Systems •Evans and Sutherland •Computer •Floating Point Systems •Galaxy YH-1 •Goodyear Aerospace MPP •Gould NPL •Guiltech •ICL •Intel Scientific Computers •International Parallel Machines

•Kendall Square Research •Key Computer Laboratories

[Gordon Bell, keynote at ISCA 2000]

•MasPar •Meiko •Multiflow •Myrias •Numerix •Prisma •Tera •Thinking Machines •Saxpy •Scientific Computer •Systems (SCS) •Soviet Supercomputers •Supertek •Supercomputer Systems •Suprenum •Vitesse Electronics

Page 3: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

3

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

13

microprocessor architectures (1)

©Arndt Bode LRR-TUM 13

Entwicklung der Mikroprozessor Architekturen (1)

Bis 1995: Einschränkung - , seit 1995 Erhöhung der Typen- und Architekturvielfalt

Transistorzahl (Moore‘s Gesetz): Abwägung Rechenleistung-Leistungsaufnahme-Kosten-

Kompatibilität

MPR Analysts‘ Choice Awards Kategorien:

- PC Processors: Intel P4 (HyperThreading), AMD Athlon (x 86-64,

Hyper Transport), Transmeta (Binary Compilation, VLIW),...

- Server Processors: Intel Xeon MP und Itanium 2 (EPIC), AMD Opteron

(x86-64), HP Alpha EV-7, Fujitsu Sparc 64 V (out-of-order superscalar)

- High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440

GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC VR7701, PMC Sierra

RM9000x2

- Low-Power Embedded Processors: AMD Au1100, Intel PXA 250, NEC VR

4131, DragonBall MX1, NeoMagic MiMagic5 (1mW pro MHz)

- Extreme Processors: CmU PipeRench, Intrinsity FastMath, Micron Yukon,

NEC DRP, PACT XPP, Sandbridge Sand Blaster (bis 512 ALUs)

- Embedded IP Processor Cores: ARCtangent-A5, ARM 1026 EJ-S/1136JF-S,

Improv Crescendo, MIPS M4K, Tensilica Xtensa V

- Graphics Processors: 3Dlabs Wildcat VP900, ATI Radeon 9700, Nvidia GeForce FX

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

14

Some Supercomputing people now looking at us

Reconfigurable Computing

PetaFlop/s (1015

) Initiative

Steroids for the aging microprocessor:

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

15

>> outline (3) <<

• Embedded System Design Crisis • Supercomputing Crisis • µP Crisis • CS crisis • CS for Embedded Systems? • New Machine Paradigm • final remarks

http://www.uni-kl.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

16

CS: young ? dynamic?

.. but the von Neumann Paradigm is still the dominant doctrine ...

Microelectronics is ignored (except falling cost

of computational effort)

... still pushing he basic models from the times of mainframe dinosaurs

after >10 technology generations ...

• 1th 4004 • 2nd 8008 • 3rd 8086 • 4th 80286 • 5th 80386 • 6th 80486 • 7th P5 (Pentium) • 8th P6 (Pentium Pro / Pentium II) • 9th Pentium III • 10th .... • 11th

• .......

... the vN Microprocessor is a methusela, the steam engine of the silicon age.

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

17

stolen from Bob Colwell

processor/memory commmunication bottleneck

vN bottleneck vN: unbalanced

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

18

MPU designs more complex

greatly complicates the verification process

chip-level multiprocessing + simultaneous multithreading

many bugs relate to concurrency issues

new kinds of concurrency are becoming important

Page 4: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

4

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

19

„Pollack‘s Law“ (simplified) [intel]

growth factor

µm

0.1

performance

area efficiency

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

20

MPU performance stalled

Moore’s law will stall soon for MPUs

relative computation time needed doubles every 2 years

had been compensated by Moore’s law

Bill Gates’ law:

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

21

>> outline (4) <<

• Embedded System Design Crisis • Supercomputing Crisis • µP Crisis • CS crisis • CS for Embedded Systems? • New Machine Paradigm • final remarks

http://www.uni-kl.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

22

Crusty Computing Sciences

[David Padua, John Hennessy]

shrinking supercomputing conferences

more and more efforts yield only marginal improvements

dataflow machines dead

98.5% vN-only

this monopoly is the problem

areas fade away

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

23

blinders:

„we are o.k. !“ (no new direction)

Lacking Sense of Direction ?

for ignoring the impact of RC © 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

24

Stealthy CS Crisis

progress in CS stalled by qualification problems in industry and academia

communication barriers between disciplines

severe software quality problems

often hardware people needed to solve CS problems

Page 5: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

5

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

25

What‘s the problem ?

.... by signals rippling through a network of transistors.

The typical programmer has problems to understand function evaluation without machine mechanisms....

Traditional CS: programming is (control-)procedural, instruction-stream-based – sources: software

accelerators accelerators µprocessor µprocessor

It‘s the gap between procedural and structural mind set

Crossing the Hardware / Software Chasm [Mike Butts]

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

26

What‘s the problem ?

accelerators accelerators µprocessor µprocessor

The brain hurts on paradigm shift ?

no, it can‘t ...

Brain usage: procedural-only

structural hemisphere missing

Crossing the Hardware / Software Chasm [Mike Butts]

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

27

>> outline (5) <<

• Embedded System Design Crisis • Supercomputing Crisis • µP Crisis • CS crisis • CS for Embedded Systems? • New Machine Paradigm • final remarks

http://www.uni-kl.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

28

ITRS SoC design cost model [ITRS 2001]

RTL methodology only

w. future improvements

tall t

hin

en

gin

ee

r

sm

all b

loc

k r

eu

se

larg

e b

loc

k r

eu

se

IC im

ple

men

tati

on

to

ols

Inte

llig

en

t te

stb

en

ch

ES

le

ve

l m

eth

od

olo

gy

http://public.itrs.net/Files/2001ITRS/Design.pdf

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

29

SoC System level Design: Embedded SW (ESW)

new design automation from high level descriptions

ESE becomes the main focus in system design:

HW-(E)SW codesign onto highly programmable platforms (SoC)

ESW becomes main vehicle to product differentiation

formal verification for (E)SW

HW-(E)SW-co-verificationH.]

SW synthesis included (SoC)

CW-

CW and

CW-

and CW

(ECW)

ECW

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

30

Complexity: System Level Design Challenge

language infrastructures for complex models (SystemC etc.)

must be leveraged by industry consensus on use-methodology and abstraction levels”

[ITRS 2001]

from HW + (processor-dependent embedded) C code level

“abstraction levels must be raised above present-day RT-level

Page 6: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

6

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

31

>> outline (6) <<

• Embedded System Design Crisis • Supercomputing Crisis • µP Crisis • CS crisis • CS for Embedded Systems? • New Machine Paradigm • final remarks

http://www.uni-kl.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

32

Why a dichotomy of machine paradigms?

data stream machine:

• bad message: caches do not help

• good message: no vN bottleneck

• caches not needed stolen from Bob Colwell

vN bottleneck vN: unbalanced

The anti machine has no von Neumann bottleneck

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

33

computing paradigms and methodologies

1946: machine paradigm (von Neumann)

1980: data streams (Kung, Leiserson)

1989: anti machine paradigm

1990: rDPU (Rabaey)

1994: anti machine high level programming language

1995: super systolic rDPA

1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ...

1997+: discipline of distributed memory architecture

1997: configware / software partitioning compiler

flow

war

e*

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

34

Flowware heading toward mainstream

•Data-stream-based Computing is heading for mainstream

–1997 SCCC (LANL) Streams-C Configurabble Computing

–SCORE (UCB) Stream Computations Organized for Reconfigurable Execution

–ASPRC (UCB) Adapting Software Pipelining for Reconfigurable Computing

–2000 Bee (UCB), ...

–Most stream-based multimedia systems, etc.

–Many other areas ....

Flowware:

managing data streams

Software:

managing instruction streams

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

35

control-procedural vs. data-procedural

The structural domain is primarily data-stream-based:

..... mostly not yet modelled that way: most flowware is hidden by its indirect

instruction-stream-based implementation

Flowware provides a (data-)procedural abstraction from the (data-stream-based) structural domain

Flowware converts „procedural vs. structural“ into „control-procedural vs. data-procedural“ ...

... a Troyan horse to introduce the structural domain to the procedural mind set of programmers

Flowware

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

36

flowware defines ....

DPA

x x x

x x x

x x x

|

| |

x x

x

x

x

x

x x

x

- -

-

input data streams

x x

x

x

x

x

x x

x

- -

-

-

-

-

-

-

-

-

-

-

x x x

x x x

x x x

|

|

|

|

|

|

|

|

|

|

|

| output data streams

time

port #

time

time

port # time

port #

... which data item at which time at which port

Placement & routing (configware) done:

Page 7: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

7

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

37

Programming Language Paradigms

language category Computer Languages Languages f. Anti Machine

both deterministic procedural sequencing: traceable, checkpointable

operation sequence driven by:

read next instruction, goto (instr. addr.),

jump (to instr. addr.), instr. loop, loop nesting

no parallel loops, escapes, instruction stream branching

read next data item, goto (data addr.),

jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching

state register program counter data counter(s)

address computation

massive memory cycle overhead overhead avoided

Instruction fetch memory cycle overhead overhead avoided

parallel memory bank access interleaving only no restrictions

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

38

Machine paradigms

von Neumann instruction

stream machine M

I/O

instruction sequencer

CPU

instruction stream

I/O M M M M M

(r)DPU

DPU

Software

I/O M M M M M

(r)DPA

memory distributed memory architecture*

data stream

data-stream machine

M

DPU or rDPU

data address generator (data sequencer)

memory

I/O

asM**

Flowware

(Configware)

(reconf.)

*) the new discipline came just in time: see Herz et al.: Proc. IEEE ICECS 2002

+ CPU

- -

DPU

+

memory

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

39

heavy anti atoms: DPA = DPU array

- DPA

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU -

DPA

+

+

+

+

+

+

+ +

+

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

40

Distributed Memory

SA: scrambling and descrambling the data ?

Just in time: a new research area:

Application-specific distributed memory:

e. g. book by F. Catthoor et al. ...

Data address generators - 20 years research:

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

41

Synthesizable distributed memory architecture...

Memory (data memory)

memory bank

memory bank

memory bank

memory bank

memory bank

...

...

Scheduler

for a Stream-based Soft Machine

rDPA “instructions”

Compiler

Sequencers (data stream

generator)

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

42

>> outline (7) <<

• Embedded System Design Crisis • Supercomputing Crisis • µP Crisis • CS crisis • CS for Embedded Systems? • New Machine Paradigm • final remarks

http://www.uni-kl.de

Page 8: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

8

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

43

Conclusion: all knowledge needed is available

•machine paradigm

•anti machine architectural resources

•sequencing methodology: hw & sw

•parallel memory IP core and module generator vendors

•anything else needed

•compilation techniques

•hw / sw partitioning methodology

• languages

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

44

The Situation in Computing Sciences

• Computing Sciences are in a severe crisis

• New fundamentals and R&D directions are inevitable

• my mission: getting you involved

• All knowledge needed is readily available ...

• ... even from Computing Sciences

• Silicon application and EDA provide useful concepts

• Reconfigurable Computing has the remedy

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

45

>>> we need ... <<<<<

We need a Mead-&-Conway-like text book

We need undergraduate lab courses on HW / CW / SW partitioning

We need new courses with extended scope on parallelism and algorithmic cleverness for HW / CW / SW migration / partitioning

What else do we need ? Your proposals ? © 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

46

>>> we need support <<<<<

We need the support of the open-minded

members of the classical CS community

Let us assemble a list with e-mail addresses

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

47

>>> thank you <<<<<

thank you for your patience

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

48

>>> END <<<

END

Page 9: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

9

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

49

microprocessor architectures (8)

TU Dresden, 09.05.2003

©Arndt Bode LRR-TUM 49

Mikroprozessorarchitekturen (8):

hochgradig parallele Systeme

E/A SRAM PE PE PE PE PE PE PE PE PE SRAM

E/A

SRAM PE PE PE PE PE PE PE PE PE SRAM

SRAM PE PE PE PE PE PE PE PE SRAM PE

SRAM PE PE PE PE PE PE PE PE SRAM PE

SRAM PE PE PE PE PE PE PE PE SRAM PE

SRAM PE PE PE PE PE PE PE PE SRAM PE

SRAM PE PE PE PE PE PE PE PE SRAM PE E/A E/A

Konfigu-

ration

Manager

©Arndt Bode LRR-TUM © 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

50

PACT XPP: Reference Module: XPU128 Co-Processor

XPP128 rDPA

• Evaluation Board • XDS Development Tool with Simulator

buses not

shown

rDPU

CF

G

PAE

core

ALU CtrlALU

CF

GC

FG

PAE

core

CF

GC

FG

PAE

core

PAE

core

ALU CtrlALUALU CtrlALU

CF

GC

FG

CF

GC

FG

• all used by SIEMENS Corporation • Other contractors preparing .... : ask Ron Mabry (here in the audience)

• Full 32 or 24 Bit Design working silicon • 2 Configuration Hierarchies

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

51

wide variety of speed-up factors

platform application speed-up factor method

PACT Xtreme 4-by-4 array [2003]

16 tap FIR filter x16 MOPS/mW straight forward

*) MPC fabrication via E.I.S. multi university project

key issue: algorithmic cleverness

MoM anti machine with DPLA* [1983]

grid-based DRC**

1-metal 1-poly nMOS 256 reference patterns

> x1000

(computation time)

multiple aspects

**) Design Rule Check

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

52

>>> flowware-based <<<

flowware -based

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

53

asM

Configware / Flowware Compilation

r. Data Path

Array

rDPA intermediate

high level source program

wrapper

configware configware

mapper

flowware flowware

scheduler

M M M M

M M M M

M

M

M

M

M

M

M

M

data streams

data sequencer

address generator

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

54

Configware / Flowware Co-Compilation

intermediate

high level source program

wrapper

r. Data Path

Array

rDPA

configware

mapper

address generator

flowware

scheduler

M M M M

M M M M

M

M

M

M

M

M

M

M

data streams

data sequencer

Page 10: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

10

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

55

>>> 2nd machine paradigm <<<

2nd machine paradigm

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

56

Matter & Antimatter

The World of Matter machine paradigm: the Atom

+ + - The World of Anti Matter

machine paradigm: Anti Atom

- - +

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

57

Matter & Antimatter of Informatics :

- DPU

+

Anti Machine paradigm

+

CPU

-

nothing central !

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

58

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

mapping algorithms efficently onto rDPA

rout thru only

not used backbus connect

SNN filter on KressArray

by the way: example of scalability / relocatability by EDA support

also FPGA scalability (avoid routing congestion) by EDA solution

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

59

One more argument for coarse grain

100

1000

10

1

0.1

0.01

0.001 2 1 0.5 0.25 0.13 0.1 0,07

MOPS / mW

µ feature size

T. Claasen et al.: ISSCC 1999

Wiring by abutment: a 32 Bit KressArray example

if coarse grain cells are full custom and

mesh-connected, and 2nd level interconnect

ressources layouted over the cells

*) R. Hartenstein: ISIS 1997

the array is almost as

area-efficient as hardwired

we have already seen the first day:

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

60

The Secret of Success: Co-Compilation

Analyzer / Profiler

SW code

SW compiler

para d igm “vN" machine

CW Code

CW compiler

anti machine paradigm

Partitioner

Resource Parameters

supporting different platforms

supporting platform-based design

High level PL source

could provide the platforms

Page 11: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

11

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

61

Machine Paradigms

machine category Computer (the Machine:

“v. Neumann”) The Anti Machine

driven by: Instruction streams data streams (no “dataflow”)

engine principles instruction sequencing sequencing data streams

state register single program counter (multiple) data counter(s)

Communication path set-up .

at run time at load time

resource DPU (e.g. single ALU) DPU or DPA (DPU array) etc. data path

operation sequential parallel pipe network etc.

( “instruction fetch” )

also hardwired implementations* *) e g. Bee project Prof. Broderson

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

62

Programming Language Paradigms

language category Computer Languages Languages f. Anti Machine

both deterministic procedural sequencing: traceable, checkpointable

operation sequence driven by:

read next instruction, goto (instr. addr.),

jump (to instr. addr.), instr. loop, loop nesting

no parallel loops, escapes, instruction stream branching

read next data item, goto (data addr.),

jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching

state register program counter data counter(s)

address computation

massive memory cycle overhead overhead avoided

Instruction fetch memory cycle overhead overhead avoided

parallel memory bank access interleaving only no restrictions

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

63

Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld

Why Coarse Grain instead of FPGA ?

physical logical

FPGA logical

1980 1990 2000 2010

FPGA physical

100 000 000 000

10 000 000 000

1000 000 000

100 000 000

10 000 000

1000 000

100 000

10 000

1000

Tra

nsis

tors

/ c

hip

~ 10

~ 10 000

drastically smaller configuration memory

a lot of more benefits

much faster loading

FPGA routed

reduced reconfigurability overhead by up to ~ 1000

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

64

KressArray Family generic Fabrics: a few examples

Examples of 2nd Level Interconnect: layouted over rDPU cell - no separate routing areas !

+

rout-through and function

rout-through

only more NNports:

rich Rout Resources

Select Function

Repertory

select Nearest Neighbour (NN) Interconnect: an example

16 32 8 24

4

2 rDPU

Select mode, number, width of NNports

http://kressarray.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

65

Changing Models of Computing

“von Neumann”

downloading

RAM

downloading

data path instruction sequencer

I / O

(procedural) Software

hardware/software co-design

software design

the problem with typical CS

people: -the dominance of von Neumann

- they cannot partition

- they cannot migrate

host

hardwired

downloading

accelerator(s)

CAD

RAM

hardware

Software hardware

spec

hardware people needed

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

66

Changing Models of Computing

host

re-

downloading

conf. accelerator(s)

RAM RAM

Software Configware

(structural)

Morphware

configware/software co-design

hardware/configware/software co-design “von Neumann”

downloading

RAM

downloading

data path instruction sequencer

I / O

(procedural) Software

host

hardwired

downloading

accelerator(s)

CAD

RAM

Hardware

Software

hardware/software co-design

software design

Page 12: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

12

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

67

Super Pipe Networks

pipeline properties array applications

shape resources

mapping scheduling

(data stream formation)

systolic array

regular data

dependencies only

linear only

uniform only

linear projection or algebraic synthesis

super-systolic rDPA

no restrictions simulated

annealing or P&R algorithm

(e.g. force-directed) scheduling algorithm

*

*) KressArray [1995]

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

68

>>> distributed memory <<<

distributed memory

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

69

instruction stream-based Compilation Principles

scheduler

parser

source text

library

link/load instruction call placement

1-D memory space

execution order by location

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

70

Datastream-based Compilation Principles

library

data stream assembly

scheduler

mapper placement & routing

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

71

>>> flowware languages <<<

flowware languages

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

72

Similar Programming Language Paradigms

language category Computer Languages Xputer Languages

both deterministic procedural sequencing: traceable, checkpointable

sequencingdriven by:

read next instruction, goto (instruction addr.), jump (to instruction addr.), instruction loop, instruction loop nesting no parallel loops, instruction loop escapes, instruction stream branching

read next data object, goto (data addr.), jump (to data addr.), data loop, data loop nesting, parallel data loops, data loop escapes, data stream branching

Page 13: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

13

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

73

JPEG zigzag scan pattern

x

y

*> Declarations

HalfZigZag is EastScan loop 3 times SouthWestScan SouthScan NorthEastScan EastScan endloop end HalfZigZag;

goto PixMap[1,1]

HalfZigZag; SouthWestScan uturn (HalfZigZag)

HalfZigZag

data counter data counter

data counter data counter

HalfZigZag

EastScan is step by [1,0] end EastScan;

SouthWestScan is loop 8 times until [1,*] step by [-1,1] endloop end SouthWestScan;

SouthScan is step by [0,1] endSouthScan;

NorthEastScan is loop 8 times until [*,1] step by [1,-1] endloop end NorthEastScan;

Flowware language example (MoPL)

Main program:

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

74

JPEG zigzag scan pattern

x

y

EastScan is step by [1,0] end EastScan;

SouthScan is step by [0,1] endSouthScan;

*> Declarations

NorthEastScan is loop 8 times until [*,1] step by [1,-1] endloop end NorthEastScan;

SouthWestScan is loop 8 times until [1,*] step by [-1,1] endloop end SouthWestScan;

HalfZigZag is EastScan loop 3 times SouthWestScan SouthScan NorthEastScan EastScan endloop end HalfZigZag;

goto PixMap[1,1]

HalfZigZag; SouthWestScan uturn (HalfZigZag)

HalfZigZag

data counter data counter

data counter data counter

2

1

3

4

HalfZigZag

Main program:

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

75

>>> address generators <<<

address generators

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

76

GAG generic address generator Scheme

Base Slider

B0

Limit Slider

L0

0 B

[

Address Stepper

DA

A

D A

| | | |

L

]

limit

all 3 are copies of the same BSU

stepper circuit GAG

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

77

GAG Slider Model

LimitStepper

BaseStepper

AddressStepper

B0AL0

A

LimitStepper

BaseStepper

AddressStepper

B0AL0

A

sliders

B 0 B

[

0 L

]

0 L 0

B 0 B

[

0 A D

A D

L

]

0 L 0

GAG Generic

Address Generator

floor ceiling

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

78

GAG: Address Stepper

GAG =

Address

Generator

Generic

+ /

Escape

Clause End

Detect

Step Counter

=o

L A D A

init tag

A

Address endExec

maxStepCount

0 B Limit Base stepVector

[ ] | |

D A L B 0

[ ] | | | |

limit

GAG: Address Stepper

Page 14: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

14

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

79

GAG Complex Sequencer Implementation

Limit Slider

Base Slider

GAU

Address Stepper

B0 DA L0

A

all `been published

in 1990

Limit Slider

Base Slider

GAU

Address Stepper

B0 DA L0

A

Limit Slider

Base Slider

GAU

Address Stepper

B0 DA L0

A

GAU GAU

GAG Generic Address Generator

SDS

GAG

VLIW stack

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

80

Generic Sequence Examples

a) b)

c)

d) e) f) g)

Limit Slider

Base Slider

GAG

Address Stepper

B0 DA L0

A

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

81

ceiling

C

address

GAG Slider Operation Demo Example

yx

LB

L0B 0 A

F

floor

LB

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

82 r r

r/w r r

r

r r r

r/w r r

r/w r r

r r r

after inner scan line loop unrolling

final design

after scan line unrolling

hardw. level access optim.

initial design

r r

w/r r r

r

r r r Bank a

Bank a

Bank b

Storage scheme optimization: scanline unrolling

x

y

handle positions

scan window

scan pattern

(high level sequencing)

example

intra scan window accesses

(low level sequencing)

MoM anti machine architecture

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

83 © 2001, [email protected]

University of Kaiserslautern

Xputer Lab

instructions

program cou n ter: state register

Compiler RAM

Datapath

har dw ired

Sequencer

Computer Computer tightly coupled

by compact instruction code

“von Neumann” “von Neumann” does not support soft data paths does not support soft data paths

Datapath

Xputer Xputer

Scheduler

Compiler

RAM

(multiple) sequencer

Datapath Array

“instructions”

University of Kaiserslautern

Xputer Lab

loosely coupled by decision data bits only

Xputer: Xputer: The Soft Machine Paradigm

The Soft Machine Paradigm reconfigurable reconfigurable

also for hardwired also for hardwired

Computer: the wrong Machine Paradigm

“von Neumann”

s d a ta cou n ter

(anti machine) © 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

84

Binding Time vs. Computing Domain

time domain (procedural)

Binding time: (Set-up of Communication Channels)

at run time microprocessor parallel computer

time & space (hybrid)

later fabrication step ASICs

space domain (structural)

before fabrication full custom ICs

at loading time

at compile time

Reconfigurable Computing

array processor

programming domain:

supersystolic arrays systolic

arrays

Page 15: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

15

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

85

Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld

Why Coarse Grain instead of FPGA ?

physical logical

FPGA logical

1980 1990 2000 2010

FPGA physical

100 000 000 000

10 000 000 000

1000 000 000

100 000 000

10 000 000

1000 000

100 000

10 000

1000

Tra

nsis

tors

/ c

hip

~ 10

~ 10 000

drastically smaller configuration memory

a lot of more benefits

much faster loading

FPGA routed

reduced reconfigurability overhead by up to ~ 1000

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

86

Paradigm Shifts: Nick Tredennick‘s view

algorithms variable

resources fixed

instruction-stream-based computing:

algorithms variable

resources variable

reconfigurable computing:

programmable

why 2 program sources ?

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

87

Compilation for (r)DPA of anti machine

mapper

scheduler

expressionmorphware

configware

streamware

tree

high level source program

wrapperparameters

codegenerators

DPU library

(software notation)

flowware

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

88

machine paradigm: some differences

+ CPU

-

- DPA

+ +

+

- DPU

+

no. of streams ³ 1

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

89

Annihilation?

- +

-

+ - + avoidable

by tools ....

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

90

Matter & Antimatter: Atom and Anti Atom

The World of Matter

Machine paradigm: the Atom

Anti Matter

Machine paradigm: Anti Atom

+ + -

- - +

Page 16: Software (ESW) co-design ESW co-design Reiner Hartenstein ... · - High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440 GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC

[email protected] July 20, 2003

Reiner Hartenstein: A Mead-&-Conway-like Break-through is overdue; Seminar Nº 03301, Dynamically Reconfigurable Architectures; Dagstuhl, Germany, July 20-25, 2003

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de

16

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

91

Parallelism by Concurrency

+ -

+ -

- +

- +

+ -

- +

- +

independent instruction streams

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

92

Co-Compilation

Xputer

“Soft” Machine Paradigm

Configware running on

partitioning compiler

high level programming language source

mProcessor Reconfigurable

Accelerators inte

rface

Reconfigurable Architecture (RA)

-- instead of hardwired

We introduce: Co-Compilation

Computer Machine Paradigm

Software running on

Xputer

“Soft” Machine Paradigm

Configware running on

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

93

Loop Transformation Examples

loop 1-8 body body endloop

loop 1-8 body endloop

loop 9-16 body endloop

fork

join

strip mining

loop 1-4 trigger endloop

loop 1-2 trigger endloop

loop 1-8 trigger endloop

reconf.array: host: loop 1-16 body endloop

sequential processes: resource parameter driven Co-Compilation

loop unrolling

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

94

„new“ terms

Flowware*: to schedule data streams, similar to software, but data counter manipulation (programming data streams ...

... instead of instruction streams)

Configware: sources for programming morphware

Software: you all know (programming instruction streams)

Hardware: you all know (not programmable) Morphware: structurally programmable „hardware“

(only some terms are „new“, however, not their subject)

clean terminology needed for taxonomy and comprehensibility

*) flowware has no relations to „dataflow machine“

Granularity defines block path width:

fine grain: 1-2 bit coarse grain: > 2 bit multi grain: > 2 bit, variable

algorithms variable

resources variable

algorithms variable

resources fixed

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

95

Why data streams are a common model

Flowware: to schedule data streams Configware: programming the ressources

all other details are defined here: Nick Tredennick‘s

paradigm shifts

Data streams (flowware) are derived from configware having been compiled before

Data stream execution ressources: distributed memory architectures. This new discipline came just in time.

see Herz et al.: Proc. IEEE ICECS 2002 Link (via „recent talks“) also here:

algorithms variable

resources variable

reconfigurable:

algorithms fixed

resources fixed

fully hardwired: not programmable

*) only one source needed

algorithms variable

resources fixed

CPU: