15
[email protected] ; Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de Swedish INTELECT Summer School on Multiprocessor Systems on Chip Reconfigurable Computing and its Compilation Techiques Reiner Hartenstein Kaiserslautern University of Technology Örebro, Aug. 25-27, 2003 8.45 10.30 hrs © 2003, [email protected] http://hartenstein.de Kaiserslautern University of Technology 2 Reconfigurable Computing: a second programming domain Migration of programming to the structural domain The opportunity to introduce the structural domain to programmers ... The structural domain has become RAM-based ... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm © 2003, [email protected] http://hartenstein.de Kaiserslautern University of Technology 3 >> outline << why coarse grain reconfigurable ? terminology toward higher abstraction levels flowware languages why a new Machine Paradigm ? (co-) compilation techniques final remarks http://www.uni-kl.de © 2003, [email protected] http://hartenstein.de Kaiserslautern University of Technology 4 granularity Datapath width 1 bit CLB: fine grain Word level CFB: coarse grain bundling of nibble or byte width CFBs: multiple granularity © 2003, [email protected] http://hartenstein.de Kaiserslautern University of Technology 5 One more argument for coarse grain 100 1000 10 1 0.1 0.01 0.001 2 1 0.5 0.25 0.13 0.1 0,07 MOPS / mW μ feature size T. Claasen et al.: ISSCC 1999 Wiring by abutment: a 32 Bit KressArray example if coarse grain cells are full custom and mesh-connected, and 2 nd level interconnect ressources layouted over the cells *) R. Hartenstein: ISIS 1997 the array is almost as area-efficient as hardwired we have already seen the first day: © 2003, [email protected] http://hartenstein.de Kaiserslautern University of Technology 6 array size: 10 x 16 = 160 rDPUs mapping algorithms efficently onto rDPA rout thru only not used backbus connect SNN filter on KressArray by the way: example of scalability / relocatability by EDA support also FPGA scalability (avoid routing congestion) by EDA solution

Reiner Hartenstein, University of Kaiserslautern, Germany ... · Reiner Hartenstein, University of Kaiserslautern, Germany ... Reiner Hartenstein, University of Kaiserslautern,

Embed Size (px)

Citation preview

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

Swedish INTELECT

Summer School

on Multiprocessor

Systems on Chip

Reconfigurable Computing and its Compilation Techiques

Reiner Hartenstein

Kaiserslautern University of Technology

Örebro, Aug. 25-27, 2003

8.45 – 10.30 hrs

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

2

Reconfigurable Computing: a second programming domain

Migration of programming to the structural domain

The opportunity to introduce the structural domain to programmers ...

The structural domain has become RAM-based

... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

3

>> outline <<

• why coarse grain reconfigurable ? • terminology • toward higher abstraction levels • flowware languages • why a new Machine Paradigm ? • (co-) compilation techniques • final remarks

http://www.uni-kl.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

4

granularity

Datapath width 1 bit CLB: fine grain

Word level CFB: coarse grain

bundling of nibble or byte width CFBs: multiple granularity

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

5

One more argument for coarse grain

100

1000

10

1

0.1

0.01

0.001 2 1 0.5 0.25 0.13 0.1 0,07

MOPS / mW

µ feature size

T. Claasen et al.: ISSCC 1999

Wiring by abutment: a 32 Bit KressArray example

if coarse grain cells are full custom and

mesh-connected, and 2nd level interconnect

ressources layouted over the cells

*) R. Hartenstein: ISIS 1997

the array is almost as

area-efficient as hardwired

we have already seen the first day:

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

6

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

mapping algorithms efficently onto rDPA

rout thru only

not used backbus connect

SNN filter on KressArray

by the way: example of scalability / relocatability by EDA support

also FPGA scalability (avoid routing congestion) by EDA solution

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

7

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

route-thru-only rDPU

3 vert. NNports, 32 bit

http://kressarray.de

Xplorer Plot: SNN Filter Example

+ [13]

2 hor. NNports, 32 bit

operator

result

operand

operand

route thru

backbus connect

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

8

PACT XPP: Reference Module: XPU128 Co-Processor

XPP128 rDPA

• Evaluation Board • XDS Development Tool with Simulator

buses not

shown

rDPU

CF

G

PAE

core

ALU CtrlALU

CF

GC

FG

PAE

core

CF

GC

FG

PAE

core

PAE

core

ALU CtrlALUALU CtrlALU

CF

GC

FG

CF

GC

FG

• all used by SIEMENS Corporation • Other contractors preparing .... : ask Ron Mabry (here in the audience)

• Full 32 or 24 Bit Design working silicon • 2 Configuration Hierarchies

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

9

microprocessor architectures (8)

TU Dresden, 09.05.2003

©Arndt Bode LRR-TUM 9

Mikroprozessorarchitekturen (8):

hochgradig parallele Systeme

E/A SRAM PE PE PE PE PE PE PE PE PE SRAM

E/A

SRAM PE PE PE PE PE PE PE PE PE SRAM

SRAM PE PE PE PE PE PE PE PE SRAM PE

SRAM PE PE PE PE PE PE PE PE SRAM PE

SRAM PE PE PE PE PE PE PE PE SRAM PE

SRAM PE PE PE PE PE PE PE PE SRAM PE

SRAM PE PE PE PE PE PE PE PE SRAM PE E/A E/A

Konfigu-

ration

Manager

©Arndt Bode LRR-TUM © 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

10

XPP64A: Platform Development Board

- SDR Board In Debug Phase -> XPP64A Chips from STMicro Fab

- Assembly & Test / Available March 2003

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

11

PACT Corp

• Xtreme Processor Platform (XPP) family of IP cores, high-speed data-stream-capable, scalable, reconfigurable clusters of arrays of 32-bit DPUs with embedded memories, and high-speed I/O ports -

• Application development support software featuring a flow graph-style algorithm mapping language - to minimize training requirements.

• XPP's fabrics, featuring automatic DataFlow synchronization and flagged Event Network to dynamically configure the execution flow,

• Supports dynamic RTR: hierarchical configuration managers free the designer from chip-level details and ensure that configurations are independently loaded in exactly the intended order.

• Automatic event-based task swapping along with data streams: released resources automatically reconfigured immediately

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

12

microprocessor architectures (1)

©Arndt Bode LRR-TUM 12

Entwicklung der Mikroprozessor Architekturen (1)

Bis 1995: Einschränkung - , seit 1995 Erhöhung der Typen- und Architekturvielfalt

Transistorzahl (Moore‘s Gesetz): Abwägung Rechenleistung-Leistungsaufnahme-Kosten-

Kompatibilität

MPR Analysts‘ Choice Awards Kategorien:

- PC Processors: Intel P4 (HyperThreading), AMD Athlon (x 86-64,

Hyper Transport), Transmeta (Binary Compilation, VLIW),...

- Server Processors: Intel Xeon MP und Itanium 2 (EPIC), AMD Opteron

(x86-64), HP Alpha EV-7, Fujitsu Sparc 64 V (out-of-order superscalar)

- High-Performance Embedded Processors: Broadcom BCM 1250, IBM 440

GX, Intrinsity FastMIPS, Motorola MPC 7455, NEC VR7701, PMC Sierra

RM9000x2

- Low-Power Embedded Processors: AMD Au1100, Intel PXA 250, NEC VR

4131, DragonBall MX1, NeoMagic MiMagic5 (1mW pro MHz)

- Extreme Processors: CmU PipeRench, Intrinsity FastMath, Micron Yukon,

NEC DRP, PACT XPP, Sandbridge Sand Blaster (bis 512 ALUs)

- Embedded IP Processor Cores: ARCtangent-A5, ARM 1026 EJ-S/1136JF-S,

Improv Crescendo, MIPS M4K, Tensilica Xtensa V

- Graphics Processors: 3Dlabs Wildcat VP900, ATI Radeon 9700, Nvidia GeForce FX

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

13

wide variety of speed-up factors

platform application speed-up factor method

PACT Xtreme 4-by-4 array [2003]

16 tap FIR filter x16 MOPS/mW straight forward

*) MPC fabrication via E.I.S. multi university project

key issue: algorithmic cleverness

MoM anti machine with DPLA* [1983]

grid-based DRC**

1-metal 1-poly nMOS 256 reference patterns

> x1000

(computation time)

multiple aspects

**) Design Rule Check

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

14

instruction stream-based Compilation Principles

scheduler

parser

source text

library

link/load instruction call placement

1-D memory space

execution order by location

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

15

Datastream-based Compilation Principles

library

data stream assembly

scheduler

mapper placement & routing

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

16

Sequential Processor Model

Conventional processors use the sequential model:

Each operation takes one clock cycle.

Multiple operations are computed consecutively.

Register Operation 1

Time

Operation 2

Operation 3

Operation 4

Operation 5

© 2003, PACT AG

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

17

A New Parallel Processor Paradigm

Multiple computations are configured as code sections onto

a two dimensional array.

Time

x

y

Data Buffer

© 2003, PACT AG

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

18

Parallel Processor Model

Multiple code sections are computed sequentially.

Time

y

x Operation 2

Section 1

Section 2

Section 3

© 2003, PACT AG

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

19

Dataflow Performance

Traditional Microprocessor XPP Architecture

SHIFT

Instruction

Memory and cache

Configuration

Memory and cache

MULT

One operation

per cycle

Many

operations

per cycle FFT

Viterbi

Basic machine operations

performed on

single words

Complex Functions

performed on

data streams

ADD

Register

One word

ALU Buffer

Stream

of words

Array of ALUs

Filter

© 2003, PACT AG

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

20

1

8 7

3 2 4

3 13

5

3 13

6

3 13

Dataflow Synchronisation: Transport Triggered

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

21

Matrix Multiplication

Flow Graph

x

+

x a

x

c

x

y b

x

d

+

x’ y’

a b

c d

x x

y =

ax+by

cx+dy =

x’

y’

Matrix is Constant

XPP: Parallel Algorithm Example PACT

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

22

I/O I/O

• SCM configures Opcodes and Constant Registers via CM

x

+

x a

x

c

x

y b

x

d

+

x’ y’

SCM

+

CM

MUL

a

Note: MAC Opcode is not used in this example to improve clarity of the presentation

XPP: Parallel Algorithm Example

PACT

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

23

MUL

a

MUL

c

MUL

d

MUL

b

ADD ADD • CM Configures

Opcodes and Constant Registers

x

+

x a

x

c

x

y b

x

d

+

x’ y’

out_x out_y in_x mul1 mul2 mul3 mul4 in_y

adder2 adder1

XPP: Parallel Algorithm Example

PACT

SCM

+

CM

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

24

MUL

a

MUL

c

MUL

d

MUL

b

ADD ADD • CM Configures

Routing Resources

x

+

x a

x

c

x

y b

x

d

+

x’ y’

out_x out_y in_x mul1 mul2 mul3 mul4 in_y

adder2 adder1

x

y

x’

y’

XPP: Parallel Algorithm Example

PACT

SCM

+

CM

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

25

I/O I/O MUL

a

MUL

c

MUL

d

MUL

b

ADD ADD

x

+

x a

x

c

x

y b

x

d

+

x’ y’

• Data Packets are routed through the Network

out_x out_y in_x mul1 mul2 mul3 mul4 in_y

adder2 adder1

x

y

x’

y’

XPP: Parallel Algorithm Example

PACT

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

26

>> terminology <<

• why coarse grain reconfigurable ? • terminology • toward higher abstraction levels • flowware languages + mapping • why a new Machine Paradigm ? • (co-) compilation techniques • final remarks

http://www.uni-kl.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

27

Tredennick’s Paradigm Shifts

TTL µproc., memory

custom

standard

1957

1967

1977

1987

1997

2007

ASICs, accel’s

LSI, MSI

hardwired

algorithm: fixed

resources: fixed

procedural programming

algorithm: variable

resources: fixed

structural programming

algorithm: variable

resources: variable

vN machine paradigm

new machine paradigm needed

2 sources

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

28

Paradigm Shifts: Nick Tredennick‘s view

algorithms variable

resources fixed

instruction-stream-based computing:

algorithms variable

resources variable

reconfigurable computing:

programmable

why 2 program sources ?

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

29

Co-Compilation programming media and platforms

µProcessor

program Memory

instruction stream

configuration Memory

bit stream

Reconfigurable

Accelerators inte

rface

asM - auto sequencing

data Memory

data streams

...

data adress generators

hardware

configware

software

morphware

flowware

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

30

flowware defines ....

DPA

x x x

x x x

x x x

|

| |

x x

x

x

x

x

x x

x

- -

-

input data streams

x x

x

x

x

x

x x

x

- -

-

-

-

-

-

-

-

-

-

-

x x x

x x x

x x x

|

|

|

|

|

|

|

|

|

|

|

| output data streams

time

port #

time

time

port # time

port #

... which data item at which time at which port

Placement & routing (configware) done:

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

31

Terminology: Digital System Platforms clearly distinguished

platform source running on it

machine paradigm

hardware (not running on it)

none

morphware

fine grain rGA (FPGA) configware

coarse grain

rDPU, rDPA reconfigurable data stream processor

flowware & configware anti machine

data stream processor (hardwired) flowware

instruction stream processor software von Neumann machine

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

32

>> higher abstraction levels <<

• why coarse grain reconfigurable ? • terminology • toward higher abstraction levels • flowware languages + mapping • why a new Machine Paradigm ? • (co-) compilation techniques • final remarks

http://www.uni-kl.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

33

„EDA industry shifts into CS mentality“ [Wojciech Maly]

• patches instead of engineering

• innovation stalled many years ago

• netlist-based: do not care about efficiency, ...

• ... do not care about transistor density

• 85% users hate their tools

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

34

Paradigm

Shift

Mainstream

Tornado

Development of Hypergrowth Markets

Harper Business 1995

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

35

McKinsey Curve: dynamics of R&D disciplines

maturity of a discipline

year

fundmental issues

consolidation

saturation: limitations met

new discipline on top of it by ....

... by innovation

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

36

EDA Industry Revolutions

1978

Transistor entry: Applicon, Calma, CV ...

1992

Synthesis: Cadence, Synopsys ... 1985

Schematics entry: Daisy, Mentor, Valid ...

courtesy [Keutzer / Newton]

EDA industry paradigm switching every 7 years

1999 HLLs, (Co-) Compilation

Data-Stream-based DPU arrays

2006 coming closer to programmers‘ mind set

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

37

SoC System level Design: Embedded SW (ESW)

new design automation from high level descriptions

ESE becomes the main focus in system design:

HW-(E)SW codesign onto highly programmable platforms (SoC)

ESW becomes main vehicle to product differentiation

formal verification for (E)SW

HW-(E)SW-co-verificationH.]

SW synthesis included (SoC)

CW-

CW and

CW-

and CW

(ECW)

ECW

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

38

Complexity: System Level Design Challenge

language infrastructures for complex models (SystemC etc.)

must be leveraged by industry consensus on use-methodology and abstraction levels”

[ITRS 2001]

from HW + (processor-dependent embedded) C code level

“abstraction levels must be raised above present-day RT-level

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

39

>> flowware languages <<

• why coarse grain reconfigurable ? • terminology • toward higher abstraction levels • flowware languages + mapping • why a new Machine Paradigm ? • (co-) compilation techniques • final remarks

http://www.uni-kl.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

40

x x x

x x x

x x x

|

| |

x x

x

x

x

x

x x

x

- -

-

input data streams

x x

x

x

x

x

x x

x

- -

-

-

-

-

-

-

-

-

-

-

x x x

x x x

x x x

|

|

|

|

|

|

|

|

|

|

|

| output data streams

time

port #

time

time

port # time

port #

good reading: Nikolay Petkov: Systolic Parallel Processing; North-Holland; 1992

DPA

DPU architecture

math formula

linear projection or

algebraic method

preprocessing

only uniform DPA with linear pipes: only for applications with strictly regular data dependencies

mathematic methods for systolic array synthesis

map

ping

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

41

Compilation for (r)DPA of anti machine

mapper

scheduler

expressionmorphware

configware

streamware

tree

high level source program

wrapperparameters

codegenerators

DPU library

(software notation)

flowware

simulated annealing

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

42

Super Pipe Networks

pipeline properties array applications

shape resources

mapping scheduling

(data stream formation)

systolic array

regular data

dependencies only

linear only

uniform only

linear projection or algebraic synthesis

super-systolic rDPA

no restrictions simulated

annealing or P&R algorithm

(e.g. force-directed) scheduling algorithm

*

*) KressArray [1995]

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

43

Programming Language Paradigms

language category Computer Languages Languages f. Anti Machine

both deterministic procedural sequencing: traceable, checkpointable

operation sequence driven by:

read next instruction, goto (instr. addr.),

jump (to instr. addr.), instr. loop, loop nesting

no parallel loops, escapes, instruction stream branching

read next data item, goto (data addr.),

jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching

state register program counter data counter(s)

address computation

massive memory cycle overhead overhead avoided

Instruction fetch memory cycle overhead overhead avoided

parallel memory bank access interleaving only no restrictions

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

44

Basics of Binding Time

run time

loading time

compile time

time of “Instruction Fetch”

microprocessor parallel computer

Reconfigurable Computing

anti machine

v.N. machine

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

45

Similar Programming Language Paradigms

language category Computer Languages Xputer Languages

both deterministic procedural sequencing: traceable, checkpointable

sequencingdriven by:

read next instruction, goto (instruction addr.), jump (to instruction addr.), instruction loop, instruction loop nesting no parallel loops, instruction loop escapes, instruction stream branching

read next data object, goto (data addr.), jump (to data addr.), data loop, data loop nesting, parallel data loops, data loop escapes, data stream branching

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

46

JPEG zigzag scan pattern

x

y

*> Declarations

HalfZigZag is EastScan loop 3 times SouthWestScan SouthScan NorthEastScan EastScan endloop end HalfZigZag;

goto PixMap[1,1]

HalfZigZag; SouthWestScan reverse(uturn(HalfZigZag))

HalfZigZag

data counter data counter

data counter data counter

HalfZigZag

EastScan is step by [1,0] end EastScan;

SouthWestScan is loop 8 times until [1,*] step by [-1,1] endloop end SouthWestScan;

SouthScan is step by [0,1] endSouthScan;

NorthEastScan is loop 8 times until [*,1] step by [1,-1] endloop end NorthEastScan;

Flowware language example (MoPL)

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

47

>> new Machine Paradigm <<

• why coarse grain reconfigurable ? • terminology • toward higher abstraction levels • flowware languages + mapping • why a new Machine Paradigm ? • (co-) compilation techniques • final remarks

http://www.uni-kl.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

48

CS: young ? dynamic?

.. but the von Neumann Paradigm is still the dominant doctrine ...

Microelectronics is ignored (except falling cost

of computational effort)

... still pushing he basic models from the times of mainframe dinosaurs

after >10 technology generations ...

• 1th 4004 • 2nd 8008 • 3rd 8086 • 4th 80286 • 5th 80386 • 6th 80486 • 7th P5 (Pentium) • 8th P6 (Pentium Pro / Pentium II) • 9th Pentium III • 10th .... • 11th

• .......

... the vN Microprocessor is a methusela, the steam engine of the silicon age.

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

49

MPU designs more complex

greatly complicates the verification process

chip-level multiprocessing + simultaneous multithreading

many bugs relate to concurrency issues

new kinds of concurrency are becoming important

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

50

„Pollack‘s Law“ (simplified) [intel]

growth factor

µm

0.1

performance

area efficiency

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

51

KressArray principles

• take systolic array principles

• replace classical synthesis by simulated annealing

• yields the super systolic array

• a generalization of the systolic array

• no more restricted to regular data dependencies

• now reconfigurability makes sense

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

52

control-procedural vs. data-procedural

The structural domain is primarily data-stream-based:

..... mostly not yet modelled that way: most flowware is hidden by its indirect

instruction-stream-based implementation

Flowware provides a (data-)procedural abstraction from the (data-stream-based) structural domain

Flowware converts „procedural vs. structural“ into „control-procedural vs. data-procedural“ ...

... a Troyan horse to introduce the structural domain to the procedural mind set of programmers

Flowware

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

53

Why a dichotomy of machine paradigms?

data stream machine:

• bad message: caches do not help

• good message: no vN bottleneck

• caches not needed stolen from Bob Colwell

vN bottleneck vN: unbalanced

The anti machine has no von Neumann bottleneck

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

54

computing paradigms and methodologies

1946: machine paradigm (von Neumann)

1980: data streams (Kung, Leiserson)

1989: anti machine paradigm

1990: rDPU (Rabaey)

1994: anti machine high level programming language

1995: super systolic rDPA

1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ...

1997+: discipline of distributed memory architecture

1997: configware / software partitioning compiler

flow

war

e*

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

55

Flowware heading toward mainstream

•Data-stream-based Computing is heading for mainstream

–1997 SCCC (LANL) Streams-C Configurabble Computing

–SCORE (UCB) Stream Computations Organized for Reconfigurable Execution

–ASPRC (UCB) Adapting Software Pipelining for Reconfigurable Computing

–2000 Bee (UCB), ...

–Most stream-based multimedia systems, etc.

–Many other areas ....

Flowware:

managing data streams

Software:

managing instruction streams

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

56

Matter & Antimatter: Atom and Anti Atom

The World of Matter

Machine paradigm: the Atom

Anti Matter

Machine paradigm: Anti Atom

+ + -

- - +

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

57

Matter & Antimatter of Informatics :

- DPU

+

Anti Machine paradigm

+

CPU

-

nothing central !

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

58

machine paradigm: some differences

+ CPU

-

- DPA

+ +

+

- DPU

+

no. of streams ³ 1

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

59

Parallelism by Concurrency

+ -

+ -

- +

- +

+ -

- +

- +

independent instruction streams

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

60

Dead Supercomputer Society

•ACRI •Alliant •American Supercomputer

•Ametek •Applied Dynamics •Astronautics •BBN •CDC •Convex •Cray Computer •Cray Research •Culler-Harris •Culler Scientific •Cydrome •Dana/Ardent/ Stellar/Stardent

•DAPP •Denelcor •Elexsi •ETA Systems •Evans and Sutherland •Computer •Floating Point Systems •Galaxy YH-1 •Goodyear Aerospace MPP •Gould NPL •Guiltech •ICL •Intel Scientific Computers •International Parallel Machines

•Kendall Square Research •Key Computer Laboratories

[Gordon Bell, keynote at ISCA 2000]

•MasPar •Meiko •Multiflow •Myrias •Numerix •Prisma •Tera •Thinking Machines •Saxpy •Scientific Computer •Systems (SCS) •Soviet Supercomputers •Supertek •Supercomputer Systems •Suprenum •Vitesse Electronics

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

61

blinders:

„we are o.k. !“ (no new direction)

Lacking Sense of Direction ?

for ignoring the impact of RC © 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

62

Some Supercomputing people now looking at us

Reconfigurable Computing

Steroids for the aging microprocessor:

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

63

Machine paradigms

von Neumann instruction

stream machine M

I/O

instruction sequencer

CPU

instruction stream

I/O M M M M M

(r)DPU

DPU

Software

I/O M M M M M

(r)DPA

memory distributed memory architecture*

data stream

data-stream machine

M

DPU or rDPU

data address generator (data sequencer)

memory

I/O

asM**

Flowware

(Configware)

(reconf.)

*) the new discipline came just in time: see Herz et al.: Proc. IEEE ICECS 2002

+ CPU

- -

DPU

+

memory

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

64

heavy anti atoms: DPA = DPU array

- DPA

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU

- DPU -

DPA

+

+

+

+

+

+

+ +

+

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

65

Distributed Memory

SA: scrambling and descrambling the data ?

Just in time: a new research area:

Application-specific distributed memory:

e. g. book by F. Catthoor et al. ...

Data address generators - 20 years research:

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

66

>> compilation techniques <<

• why coarse grain reconfigurable ? • terminology • toward higher abstraction levels • flowware languages + mapping • why a new Machine Paradigm ? • (co-) compilation techniques • final remarks

http://www.uni-kl.de

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

67

Co-Compilation

Xputer

“Soft” Machine Paradigm

Configware running on

partitioning compiler

high level programming language source

mProcessor Reconfigurable

Accelerators inte

rface

Reconfigurable Architecture (RA)

-- instead of hardwired

We introduce: Co-Compilation

Computer Machine Paradigm

Software running on

Xputer

“Soft” Machine Paradigm

Configware running on

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

68

The Secret of Success: Co-Compilation

Analyzer / Profiler

SW code

SW compiler

para d igm “vN" machine

CW Code

CW compiler

anti machine paradigm

Partitioner

Resource Parameters

supporting different platforms

supporting platform-based design

High level PL source

could provide the platforms

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

69

Loop Transformation Examples

loop 1-8 body body endloop

loop 1-8 body endloop

loop 9-16 body endloop

fork

join

strip mining

loop 1-4 trigger endloop

loop 1-2 trigger endloop

loop 1-8 trigger endloop

reconf.array: host: loop 1-16 body endloop

sequential processes: resource parameter driven Co-Compilation

loop unrolling

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

70

Machine Paradigms

machine category Computer (the Machine:

“v. Neumann”) The Anti Machine

driven by: Instruction streams data streams (no “dataflow”)

engine principles instruction sequencing sequencing data streams

state register single program counter (multiple) data counter(s)

Communication path set-up .

at run time at load time

resource DPU (e.g. single ALU) DPU or DPA (DPU array) etc. data path

operation sequential parallel pipe network etc.

( “instruction fetch” )

also hardwired implementations* *) e g. Bee project Prof. Broderson

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

71

KressArray Family generic Fabrics: a few examples

Examples of 2nd Level Interconnect: layouted over rDPU cell - no separate routing areas !

+

rout-through and function

rout-through

only more NNports:

rich Rout Resources

Select Function

Repertory

select Nearest Neighbour (NN) Interconnect: an example

16 32 8 24

4

2 rDPU

Select mode, number, width of NNports

http://kressarray.de © 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

72

DPSS

KressArray DPSS

VHDL Verilog

HDL Generator Simulator

User Interface

Architecture Editor

Mapping Editor

Power Estimator

Power Data

Selection

interm. form

Mapper Bus & I / O Schedule

Scheduler

User

ALEX Code

interm. form

ALE-X Compiler

Architecture Estimator

interm. form

Design Rules

Datapath Generator Generator

Kress rDPU

Layout Data Path Synthesis System

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

73

DPSS

KressArray DPSS

User Interface

Architecture Editor

Mapping Editor

Selection

interm. form

Mapper Bus & I / O Schedule

Scheduler

Application Set

Improvement Proposal Generator

User

ALEX Code

interm. form

ALE-X Compiler

Architecture Estimator

interm. form

Su

gg

estio

n

Delay Estim.

Sug- gest- ion

Inference Engine (FOX) Analyzer

statist. Data

KressArray (Design Space) (Platform Space) Xplorer

Xplorer

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

74

Ulrich Nageldinger‘s Ph. D. thesis

http://hartenstein.de

click „recent talks“

this page: also link to Ph. D thesis download

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

75

>> final remarks <<

• why coarse grain reconfigurable ? • terminology • toward higher abstraction levels • flowware languages + mapping • why a new Machine Paradigm ? • (co-) compilation techniques • final remarks

http://www.uni-kl.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

76

Where are we heading ?

1

2

0 10 12 18 months

factor

*) Department of Trade and Industry, London

90% by 2010

10 times more programmers

will write embedded applications

than computer software by 2010

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

77

data streams ...

PS: Personal Supercomputer replaces the PC

mainframes PC

1957

1967

1977

1987

1997

2007

PS: personal supercomputer

morphware

µProc. rDPA

co-compiler

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

78

What‘s the problem ?

.... by signals rippling through a network of transistors.

The typical programmer has problems to understand function evaluation without machine mechanisms....

Traditional CS: programming is (control-)procedural, instruction-stream-based – sources: software

accelerators accelerators µprocessor µprocessor

It‘s the gap between procedural and structural mind set

Crossing the Hardware / Software Chasm [Mike Butts]

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

79

What‘s the problem ?

accelerators accelerators µprocessor µprocessor

The brain hurts on paradigm shift ?

no, it can‘t ...

Brain usage: procedural-only

structural hemisphere missing

Crossing the Hardware / Software Chasm [Mike Butts]

solution only with user-friendly SW / CW / FW co-compilers

based on anti machine paradigm used as a Troyan Horse into CS

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

80

Annihilation?

- +

-

+ - + avoidable

by tools ....

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

81

>>> thank you <<<<<

thank you

for your

patience

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

82

>>> END <<<

END

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

83

Conclusion: all knowledge needed is available

•machine paradigm

•anti machine architectural resources

•sequencing methodology: hw & sw

•parallel memory IP core and module generator vendors

•anything else needed

•compilation techniques

•hw / sw partitioning methodology

• languages

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

84

The Situation in Computing Sciences

• Computing Sciences are in a severe crisis

• New fundamentals and R&D directions are inevitable

• my mission: getting you involved

• All knowledge needed is readily available ...

• ... even from Computing Sciences

• Silicon application and EDA provide useful concepts

• Reconfigurable Computing has the remedy

[email protected]

;

Reiner Hartenstein, University of Kaiserslautern, Germany http://hartestein.de

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

85

asM

Configware / Flowware Compilation

r. Data Path

Array

rDPA intermediate

high level source program

wrapper

configware configware

mapper

flowware flowware

scheduler

M M M M

M M M M

M

M

M

M

M

M

M

M

data streams

data sequencer

address generator

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

86 © 2001, [email protected]

University of Kaiserslautern

Xputer Lab

instructions

program cou n ter: state register

Compiler RAM

Datapath

har dw ired

Sequencer

Computer Computer tightly coupled

by compact instruction code

“von Neumann” “von Neumann” does not support soft data paths does not support soft data paths

Datapath

Xputer Xputer

Scheduler

Compiler

RAM

(multiple) sequencer

Datapath Array

“instructions”

University of Kaiserslautern

Xputer Lab

loosely coupled by decision data bits only

Xputer: Xputer: The Soft Machine Paradigm

The Soft Machine Paradigm reconfigurable reconfigurable

also for hardwired also for hardwired

Computer: the wrong Machine Paradigm

“von Neumann”

s d a ta cou n ter

(anti machine)

© 2003, [email protected] http://hartenstein.de

Kaiserslautern University of Technology

87

Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld

Why Coarse Grain instead of FPGA ?

physical logical

FPGA logical

1980 1990 2000 2010

FPGA physical

100 000 000 000

10 000 000 000

1000 000 000

100 000 000

10 000 000

1000 000

100 000

10 000

1000

Tra

nsis

tors

/ c

hip

~ 10

~ 10 000

drastically smaller configuration memory

a lot of more benefits

much faster loading

FPGA routed

reduced reconfigurability overhead by up to ~ 1000