Reconfigurable Supercomputing: Hurdles and Chances Reiner Hartenstein TU Kaiserslautern Dresden,...

Reconfigurable

Supercomputing: Hurdles

and Chances

Reiner Hartenstein

TU Kaiserslautern

Dresden, Gemany, June 28 - 30, 2006

International Supercomputer Conference

http://hartenstein.de

TU Kaiserslautern>> Outline <<

http://www.uni-kl.de

•Preface

•The von Neumann paradigm trap

•Supercomputing: the wrong Road Map

•The Solution ignored for decades

•Fine-grained vs. coarse-grained

•The wrong Road Map for CS Curricula

•Conclusions

TU KaiserslauternPreface

The talk illustrates why behind the success of FPGAs there is a hidden paradigm shift

My talk does not really cover the performance of bulk storage, discs, etc.

My talk highlights the supercomputing paradigm trapand the fully ignored early solution

TU Kaiserslautern

The Pervasiveness of Reconfigurable Computing (RC)

162,000

127,000

158,000113,000

171,000194,000

# of hits by Google

1,620,000

915,000

398,000

272,000

647,000

1,490,000

# of hits by Google

“FPGA and ….” FPGAs are used everywhere

Nov. 2005

~150 conferences~150 conferences

TU Kaiserslautern

An Example: FPGAs in Oil and Gas ....

Saves more than $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack

„Application migration [from supercomputer] has resulted in a 17-to-1 increase in performance"

[Herb Riley, R. Associates]

… 25% of Amsterdam‘s electric energy consumption goes into server farms ?

… a quarter square-kilometer of office floor space within New York City is occupied by server farms ?

did you know …

TU Kaiserslautern

Oil and Gas as a strategic issue

It should be investigated, how far the migrational achievements obtained for computationally intensive applications, can also be utilized for servers

You know the amount of Google’ s electricity bill?

TU Kaiserslautern

15 GigaFLOPs on a single FPGA chip

15 GigaFLOPs on single chip for matrix computations

A surprize: much less memory needed than expected

Last night I met Stamatis Vassiliadis (TU Delft)

TU Kaiserslautern

X 2/yr

some published speed-up factors

1980 1990 2000 2010100

Pentium 4

50%/yr

real-time face detectionreal-time face detection6000

video-rate stereo vision

900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching

Smith-Waterman pattern matching

BLASTBLAST52protein identificationprotein identification

molecular dynamics simulationmolecular dynamics simulation

Reed-Solomon Decoding

Reed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

FFTFFT

1000MA

DSP and wirelessImage processing,Pattern matching,

Multimedia

Bioinformatics

GRAPEGRAPE20

AstrophysicsMicroprocessor

Memory

cryptocrypto

although the effective integration density of FPGAs is by 4 orders of magnitude behind the Moore curvewiring overheadreconfigurability overheadrouting congestion

The RC paradox

the memor

y wall

invented

TU KaiserslauternEducational Deficits

Transdisciplinary fragmentation: each application domain uses its own trick boxes

Too many sophisticated very clever architecturesWe need a fundamental model with a methodology which all application domains have in common

Educational deficits have stalled Reconfigurable Computing (RC) as well as classical supercomputing

Transdisciplinary education & basic research needed

•Preface

•Conclusions

TU KaiserslauternThe basic model paradigm

High performance computing stalled for decades by the von Neuman paradigm trap

stolen from Bob Colwell

For decades the right roadmap was hidden by another paradigm trap

most systems are extremely

unbalanced

TU Kaiserslautern

Computer Science not prepared

Transdisciplinary Education?

for decades: the Hardware / Software chasm

Lacking intradisciplinary cohesion between the mind sets of:

•Hardware

People

•Theoreticians (Math

background)

•Software People (Application Development)

turns into the Configware / Software chasm

•Embedded Syst. Designers•Computer Architects

TU Kaiserslautern

Flag ship conference series: IEEE ISCA

Parallelism

faded away

98.5 % von

Neumann98.5 % von

Neumann

t al.]

Other: cache

coherence ?

speculative

scheduling?

(2001: 84%)

Jean-Loup Baer

migration of the

lemings

TU Kaiserslautern

The Dead Supercomputer Society

•ACRI •Alliant •American Supercomputer •Ametek •Applied Dynamics •Astronautics •BBN •CDC•Convex•Cray Computer •Cray Research •Culler-Harris •Culler Scientific •Cydrome •Dana/Ardent/ Stellar/Stardent

•DAPP •Denelcor •Elexsi •ETA Systems •Evans and Sutherland•Computer•Floating Point Systems •Galaxy YH-1 •Goodyear Aerospace MPP •Gould NPL •Guiltech •ICL •Intel Scientific Computers •International Parallel Machines •Kendall Square Research •Key Computer Laboratories•MasPar

Research 1985 – 1995 [Gordon Bell, keynote ISCA 2000]

•Meiko •Multiflow •Myrias •Numerix •Prisma •Tera •Thinking Machines •Saxpy •Scientific Computer•Systems (SCS) •Soviet Supercomputers •Supertek •Supercomputer Systems •Suprenum •Vitesse Electronics

49: died just in 1 decade

•Preface

•Conclusions

TU Kaiserslautern

Moving Data around

5120 Processors, 5000 pins eachCrossbar weight: 220 t, 3000 km of thick cable,

ES 20: TFLOPSpeak or sustained?

TU KaiserslauternThe Memory Wall (1)

Moving data to the processor:

TU KaiserslauternData meeting the Processing Unit

by Software

byConfigware

routing the data by memory-cycle-hungry instruction streams

placement of the execution locality

We have 2 choices

optimize a pipe network: place PU in data stream

TU Kaiserslautern

Illustrating the von Neumann paradigm trap

The data-stream-based approach

The instruction-stream-based approach

von Neuman

n bottle-

von Neuman

n bottle-

the watering pot model [Hartenstein]

many watering pots

TU KaiserslauternThe Memory Wall (2)

Supercomputing urgently needs a fundamentally different approach

toward interconnect efficiency.

Key problem is the inefficiency and complexity of moving data, not processor performance.

Tear down this Wall !

Most important goal is the minimization of the number of main memory cycles.

•Preface

•Conclusions

TU Kaiserslautern

input data stream

|output data streams

port #

port #time

port #

defines: ... which data item at which time at which port

(H. T. Kung paradigm)

CS Mathematicians‘ hobby, early 80ies

DPA*(pipe network)*) DataPath Array

(array of DPUs)

The Systolic Array

introducing Data streams

no instruction streams needed

DataPath Unit hasno program counter!

it’s no CPU!

nice time/space notation -

TU KaiserslauternTerminology

termprogra

m counter

execution triggered

byparadigm

CPUyes instructio

n fetchinstruction-stream-based

DPU** no data arrival*

data-stream-based

programcounter

DPUCPUCPU

*) “transport-triggered”**) does not have a program counter

TU Kaiserslautern

The new paradigm: how the data are traveling

An old hat: transport-triggered

pipeline, or chaining

super systolic array

better not by instruction executionbetter not by instruction execution

DPU DPU DPU

vN Move Processor

instruction-driven

+ instruction-driven

[Jack Lipovski, EUROMiCRO, Nice, 1975]

P&R: move locality of operation, not data !

TU Kaiserslautern

No Memory Wall

input data streams

DPU operation is transport-triggered

no instruction streamsno message passing

nor thru common memory

where were the supercomputing people ?

massively reducing memory cycles

The right road map to HPC: there ignored for

decades

TU Kaiserslautern

Mathematicians X-ing

Systolic Synthesis

Mathematicians like the beauty and elegance of Systolic Arrays. Due to a lacking intra-disciplinary view, their efforts yielded poor synthesis algorithms.

Reiner Hartenstein

TU Kaiserslautern

of course, algebraic !Synthesis Method?

Algebraic means linear projection, restricted to uniform arrays, only with linear pipes

useful only for applications with strictly regular data dependencies:

Mathematicians caught by their own paradigm trap for more than a decade

Rainer Kress discarded their algebraic synthesis methods and replaced it by simulated annealing.

Generalization* by a transdisciplinary hardware guy:

1995*) super-systolic

the specialist trap

TU KaiserslauternGenerating the Data Streams

input data streams

Mathematicians: it‘s not our job

Who generates the data streams ?

(it‘s not algebraic)

TU Kaiserslautern

datacounter

GAG RAM

ASM: Auto-Sequencing

Memory

ASM: Auto-Sequencing

Memory

use data counters, no program

counter

rDPArDPA

ASM Data stream generators

(pipe network)

implemented by distributed on-

chip memory

implemented by distributed on-

chip memory

50 & more on-chip ASM are feasible

reconfigurablereconfigurable

32 ports, orn x 32 ports

other example non-von-Neumannmachine paradigm

TU Kaiserslautern

Compilation: Software vs. Configware

source program

softwarecompiler

software code

Software Engineeri

Configware

Engineering

Configware

Engineeringplaceme

nt & routing

scheduler

flowware code

C, FORTRANMATHLAB, …

instruction streams data streamsconfiguration

configware code

mapperconfigware

compiler

source „program“

•Preface

•Conclusions

TU Kaiserslautern

Coarse-grained vs. fine-grained

Definition of FPGA see previous talk by Dr. Thomas Steinke

device granularity path width eff’ve density

flexibility

FPGA fine-grained ~ 1 bit very low general purpose

DPA coarse-grained multi bit: e.g. 32 bits very high

specializedrDPA coarse-grained

domain-specificplatform

FPGAfine-grained &

embedded hdw. mixed high

TU KaiserslauternFPGA with island architecture

reconfigurable logic box

switch box

connect box

TU Kaiserslautern

example: wire

routed for 1 net

TU KaiserslauternWhy coarse grain

much more MOPS/milliWatt

reconfigurable Data Path Unit (e. g. rALU)

mind set close to classical computing background

instead of rLB (~1 bit wide) use rDPU (e. g. 32 bits wide)

instead of FPGA use rDPA

rDPU rDPU rDPU rDPUrDPU rDPU rDPU rDPUrDPU rDPU rDPU rDPUrDPU rDPU rDPU rDPU

much more area-efficientmuch less

reconfigurability overhead

TU Kaiserslautern

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

Coarse grain is about computing, not logic

rout thru only

not usedbackbus connect

SNN filter on KressArray (mainly a pipe network)

[Ulrich Nageldinger]

Example: mapping onto rDPA by DPSS: based on simulated annealing

reconfigurable function block, e. g. 32 bits wide

no CPU

TU Kaiserslauterncommercial rDPA

example:

PACT XPP - XPU128XPP128 rDPA

• Evaluation Board available, and • XDS Development Tool with Simulator

buses not

ALU CtrlALU

ALU CtrlALUALU CtrlALU

• Full 32 or 24 Bit Design working silicon • 2 Configuration Hierarchies

(r)DPA

TU Kaiserslautern

e. g.: array w. 56 rDPUs: running under 500 MHz

GamesGames MusicMusicVideosVideos

SMeXPPSMeXPP

CameraCamera

Baseband-Baseband-ProcessorProcessor

Radio-Radio-InterfaceInterface

AudioAudio--InterfaceInterface

SD/MMC CardsSD/MMC Cards

LCD DISPLAY

rDPArDPA

• Variable resolutions and refresh rates• Variable scan mode characteristics• Noise Reduction and Artifact Removal• High performance requirements• Variable file encoding formats• Variable content security formats• Variable Displays• Luminance processing• Detail enhancement• Color processing• Sharpness Enhancement• Shadow Enhancement• Differentiation • Programmable de-interlacing heuristics• Frame rate detection and conversion• Motion detection & estimation & compensation• Different standards (MPEG2/4, H.264)• A single device handles all modes

World TV & game console & multi media center

http://pactcorp.com

TU Kaiserslautern

500MHz FlexibleSoft Logic Architecture

200KLogic Cells

500MHz Programmable DSP Execution Units

0.6-11.1GbpsSerial Transceivers

500MHz PowerPC™ Processors(680DMIPS) with

Auxiliary Processor Unit

1Gbps Differential I/O

500MHz multi-portDistributed 10 Mb SRAM

500MHz DCM DigitalClock Management

DSP platform FPGA[courtesy Xilinx Corp.]

•Preface

•Conclusions

TU Kaiserslautern

Computing Curricula 2004fully ignores

Reconfigurable Computing

Joint Task Force for

FPGA & synonyma: 0 hits

not even here

(Google: 10 million hits)

Curricula ?

TU Kaiserslautern

Upon my complaints* the only change: including at end of last paragraph of the survey volume:

Curriculum Recommendations, v. 2005

"programmable hardware (including FPGAs, PGAs, PALs,

GALs, etc.)." However, no structural changes at all

v. 2005 intended to be the final version (?) torpedoing the transdisciplinary

responsibility of CS curriculaThis is criminal !This is criminal !

Peter Denning …

*) no reply

TU Kaiserslautern

Here is the common model

data-stream-based

instruction-stream-

software code

accelerator reconfigurable

accelerator hardwired

configware code

it’s not von Neumann

wagging the dog

the tail is

most accumulated

MIPS have been migrated here

most accumulated

MIPS have been migrated here mainly just

for running legacy code etc.

mainly just for running legacy code etc.

TU Kaiserslautern

Dual Paradigm Application Development

instruction-stream-

software code

accelerator reconfigurable

accelerator hardwired

configware codedata-stream-based

software/configwareco-compiler

high level language

CPUCPU

SWcompiler

CWcompiler

C language source

Partitioner

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU

Juergen Becker’s CoDe-X, 1996

Road map to the Personal

Supercomputer

intel future

multi core ?

intel future

multi core ?

dual mode

PUsdual m

TU Kaiserslautern

We need a curricular dual-paradigm approach

structural procedural

instruction-stream-based

data-stream-

For Transdisciplinary CS Education

procedural-only

The von-Neumann-only mind set is obsolete

•Preface

•Conclusions

TU Kaiserslautern

Taxonomy of Algorithm Migration (1)

(Instruction-stream-based algorithm taxonomy: partially existing, not really systematic)Algorithms migrated to time-space domain (for RC): a taxonomy is not existing

Steadily coming and going data streams are best candidates

Computationally intensive applications are the best candidates for migration to FPGA

bulk data bases might be subject of FPGA usage to avoid memory cycles for address computation

A few algorithms (e. g. Turbocode or Viterbi) require a massive amount of interconnect

TU Kaiserslautern

Taxonomy of Algorithm Migration (2)

Migration efficiency (reducing memory cycles):Servers: to be investigated - for sure is: •loop transformations: efficient, deterministic•caches: indeterministic and energy guzzlers•much less local memory needed•secondary data memory: distributed on-chip memory architectures highly promising

•address computations: efficient migration

TU KaiserslauternConclusions

highly promising for servers

excellent results proven for computationally intensive applications

improvements likely for bulk data & storage applications

tool and language scenario needs an urgent transdisciplinary clean-up

TU Kaiserslautern

thank you

TU Kaiserslautern

Backup:

TU Kaiserslautern

SW / CW Co-Compilation

softwarecompiler

software code

Software / Configware Co-Compiler

configware code

mapperconfigware

compiler

scheduler

flowware code

C, FORTRAN, MATHLAB, etc.

automatic SW / CW partitioner

TU Kaiserslautern

configware resources: variable

Why 2 different Codes needed ? Nick Tredennick’s Paradigm Shifts

2 programming sources needed

flowware algorithm: variable

Configware EngineeringConfigware Engineering

Software EngineeringSoftware Engineering

1 programming source needed

algorithm: variable

resources: fixedsoftware

TU Kaiserslautern

rDPU rDPU rDPU rDPU

for demo: a tiny section of the pipe networkinter-rDPU-communication: no memory cycles needed

configware solution: computing in space

TU KaiserslauternCompare it to software solution on CPU

on a very simple CPU C = 1

memory cycles

nanoseconds

if C then read A

read instruction

instruction decoding

read operand*

operate & register transfers

if not C then read B

read instruction

add & store

read instruction

operate & register transfers

store result

S = R + (if C then A else B endif);

Clock200

TU Kaiserslautern

hypothetical branching example to illustrate software-to-configware

migration

*) if no intermediate storage in register file

C = 1simple conservative CPU example

memory cycles

nanoseconds

if C then read A

read instruction 1 100instruction decoding

read operand* 1 100operate & reg. transfers

if not C then read B

add & store

operate & reg. transfers

store result 1 100

total 5 500

clock200 MHz(5 nanosec)

TU Kaiserslautern

The wrong mind set ....

section of a very large pipe network:

decision

not knowing this solution:symptom of the hardware / software chasm

and the configware / software chasm

„but you can‘t implement decisions!“

TU Kaiserslautern(anti-von-Neumann machine

paradigm)Data Counter instead of Program CounterGeneralization of the DMA

datacounter

GAG RAM

ASM: Auto-Sequencing MemoryASM

GAG & enabling technology:published 1989 [by TU-KL],Survey paper: [M. Herz et al.*: IEEE ICECS 2003, Dubrovnik] *) IMEC & TU-KL

**) -- patented by TI** 1995

Storge Scheme optimization methodology, etc.

TU Kaiserslautern

Xputer Lab at Kaiserslautern: MoM I and II1986:

TU Kaiserslautern

Reconfigurable Supercomputing: Hurdles and Chances Reiner Hartenstein TU Kaiserslautern Dresden,...

Documents

Reiner Hartenstein, University of Kaiserslautern, · PDF fileUniversity of Kaiserslautern Xputer Lab 7 The Software Alliance EDA Program ... Eureka Technology Inc. ... University of

Reconfigurable Supercomputing: Hindernisse und Chancen Reiner Hartenstein TU Kaiserslautern Universität Mannheim, 13. Dez. 2005

SMART CITY HACKATHON - TU Kaiserslautern · 2018-04-20 · SMART CITY HACKATHON 8.-9. Mai 2018 Kaiserslautern. In Kooperation mit TU Kaiserslautern, DFKI Kaiserslautern, Hochschule

Reiner Hartenstein, TU Kaiserslautern, Germanyhelios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Data Stream Parallelism (on accelerators; on FPGAs) Qualified Programmer

Reiner Hartenstein, University of Kaiserslautern, · PDF fileand Software / Configware Co-Design, ... Eureka Technology Inc. ... Reiner Hartenstein, University of Kaiserslautern, Germany

Gemany sebastian2

Reiner Hartenstein, University of Kaiserslautern, Germany - Xputer Lab … · 2013. 5. 8. · Enabling Technologies for System-on-Chip Development, November 19-20, 2001, Tampere,

Kaiserslautern Pikes e.V 110 % Kaiserslautern

Culture for gemany and saudi arabia by Mahmoud Naseh and Tim

How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS fellow PATMOS 2015, the 25 th International

Reiner Hartenstein, TU Kaiserslautern, Germany http

Personal overview Lars Hartenstein

Reiner Hartenstein, University of Kaiserslautern, GermanyUniversity of Kaiserslautern 14 Dead Supercomputer Society •ACRI •Alliant • •American Supercomputer •Ametek •Applied

Reiner Hartenstein, University of Kaiserslautern, · PDF fileReiner Hartenstein, University of Kaiserslautern, Germany ... . ... Reiner Hartenstein, University of Kaiserslautern,

ISICT 2005 Supercomputing going Reconfigurable Reiner Hartenstein TU Kaiserslautern Jan. 4-6, 2005, Capetown, South Africa

Reiner Hartenstein, University of Kaiserslautern, Germany ...helios.informatik.uni-kl.de/staff/hartenstein/lot/... · •Floating Point Systems •Galaxy YH-1 •Goodyear Aerospace

The PATMOS Workshop Series Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS fellow PATMOS 2015, the 25 th International

Workshop Selbstoptimierung und Adaption Reiner Hartenstein* TU Kaiserslautern Paderborn, 24. - 25. November 2003 Morphware: neue Perspektiven für eingebettete

PowerPoint-Präsentation - - TU Kaiserslautern · reiner@hartenstein.de 21 July 2010 Reiner Hartenstein, TU Kaiserslautern, Germany © 2010, rein er@hartenstein.de http: /hart nstei

CS curricula update proposed: by adding Reconfigurable Computing Reiner Hartenstein TU Kaiserslautern EAB meeting, Philadelphia,1 Nov 2005