Reconfigurable HPC
Reconfigurable HPC
part 1Introduction
Reiner Hartenstein
TU Kaiserslautern
May 14, 2004 , TU Tallinn, Estonia
© 2004, [email protected] http://hartenstein.de2
TU Kaiserslautern
Preface
The White House, Sept 2000:Bill Clinton condemns the Digital Divide in America:
World Economic Forum 2002: The Global Digital Divide,disparity between the "haves" and "have nots“
The Digital Divide of Parallel Computing:The Digital Divide of Parallel Computing:Access to Configware (CW) SolutionsAccess to Configware (CW) Solutions
access to the internet
© 2004, [email protected] http://hartenstein.de3
TU KaiserslauternThe „havenots“
Configware methodology to move data around more efficiently:
Configware engineering as a qualification for programming embedded systems*:
„havenots“ are found in the HPC community
The „havenots“ are our typical CS graduatesReconfigurable HPC is
torpedoed by deficits in education:
curricular revisions are overdue*) also HPC !
© 2004, [email protected] http://hartenstein.de4
TU KaiserslauternSoftware to Configware
Migration
this talk will illustrate the performance benfitwhich may be obtained from Reconfigurable Computing stressing coarse grain Reconfigurable Computing (RC),point of view, this talk hardly mentions FPGAs(But coarse grain may be always mapped onto FPGAs)
Software to Configware Migration is the most important source of speed-upHardware is just frozen Configware
© 2004, [email protected] http://hartenstein.de5
TU Kaiserslautern>> HPC <<
•HPC
•Embedded Computing
•The wrong Roadmap
•Configware Engineering
•Dual Machine Paradigms
•Speed-up Examples
•Final Remarkshttp://www.uni-kl.de
© 2004, [email protected] http://hartenstein.de6
TU Kaiserslautern
Earth Simulator
5120 Processors, 5000 pins eachES 20: TFLOPS
Crossbar weight: 220 t, 3000 km of cable,moving data around
inside the
© 2004, [email protected] http://hartenstein.de7
TU Kaiserslautern
data are moved around by software
(slower than CPU clock by 2 orders of magnitude)
i.e. by memory-cycle-hungry instruction streams which fully hit the memory wall
extremely unbalanced
stolen from Bob Colwell
CPU
© 2004, [email protected] http://hartenstein.de8
TU Kaiserslautern
path of least resistance*:avoiding a paradigm shift
Many researchers seem never to stop working on sophisticated solutions for marginal improvements ...... continously ignoring methodologies promising speed-ups by orders of magnitude ....
blinders to ignore the impact of morphware
... continue to bang their heads against the memory
wallinstead
of
*) [Michel Dubois]
© 2004, [email protected] http://hartenstein.de9
TU Kaiserslautern
the data-stream-based approach
has no von Neumann bottle-neck
has no von Neumann bottle-neck
… understand only this parallelism solution:
the instruction-stream-based approach
von Neuman
n bottle-necks
von Neuman
n bottle-necks
... c
annot
cope w
ith
this
one
© 2004, [email protected] http://hartenstein.de10
TU Kaiserslautern>> Embedded Computing
<<
•HPC
•Embedded Computing
•The wrong Roadmap
•Configware Engineering
•Dual Machine Paradigms
•Speed-up Examples
•Final Remarkshttp://www.uni-kl.de
© 2004, [email protected] http://hartenstein.de11
TU Kaiserslautern
?
What’s coming next ?
The History of Paradigm Shifts
“Mainstream Silicon Applicationis switching every 10 Years”
TTL µproc.,memory
“The Programmable System-on-a-Chipis the next wave“
custom
standard
1957
1967
1977
1987
1997
2007
Makimoto’s Wave
ASICs,accel’s
LSI,MSI
1st D
esig
n C
risis
2n
d D
esig
n C
risis
?
reconfigurablePublished
in 1989
© 2004, [email protected] http://hartenstein.de12
TU Kaiserslautern
Makimoto’s 3rd Wave
• Fine Grain Subsystems (FPGAs):–
1st half of 3rd wave
–
universal (but less efficient)
• Coarse Grain Subsystems:–
2nd half of 3rd wave
–
domain-specific
–
much more flexible than 2nd half of 2rd wave
© 2004, [email protected] http://hartenstein.de13
TU Kaiserslautern
How’s next Wave ?
2007FPGAs
custom
standard
1957
1967
1977
1987
1997
Tredennick’sParadigm Shifts
procedural programming
algorithm: variable
resources: fixed
hardwired
algorithm: fixed
resources: fixed
2007
?
structural programming
algorithm: variable
resources: variable
Coarse grain
RAs
no further wave !
Hartenstein’s Curve
?4th wave ?
© 2004, [email protected] http://hartenstein.de14
TU KaiserslauternHistory of Silicon Application
TTL µproc.,memory
1957
1967
1977
1987
1997
2007
ASICs,accel’s
LSI,MSI
FPGAs
coarsegrain
soft CPUs
hardware people CSpeople new breed needed
Common terminology needed
3 different mind sets
© 2004, [email protected] http://hartenstein.de15
TU Kaiserslautern
History of Machine Models
1957
1967
1977
1987
1997
2007
mainframe age
mainframe.
compile
procedural mind set: instruction-stream-based
procedural mind set: instruction-stream-based
(coordinates by Makimtos wave)
computer age (PC age)
accel.µProc.
compile
users: RIKEN institute, ARI, Heidelberg, etc.
MD-GRAPE-2 PCI board [1997]4 chips for N-body simulation
converts a PC to 64 GFlops
scientific computing example:molecular dynamics, astrophysics ,
plasma physics, hydrodynamics:
© 2004, [email protected] http://hartenstein.de16
TU Kaiserslautern
History of Machine Models
1957
1967
1977
1987
1997
2007
mainframe age
mainframe.
compile
procedural mind set: instruction-stream-based
procedural mind set: instruction-stream-based
(coordinates by Makimtos wave)
computer age (PC age)
accel.µProc.
compile
structural mind set: data-stream-basedstructural mind set: data-stream-based
by hardware guysby hardware guysdesign
© 2004, [email protected] http://hartenstein.de17
TU Kaiserslautern
the hardware / Software Chasm:
typical programmers don‘t understand function evaluation without machine mechanisms (counters, state registers)
It‘s the gap between procedural (instruction-stream-based) and structural (datastream-based) mind set
acceleratorsacceleratorsµprocessorµprocessor
© 2004, [email protected] http://hartenstein.de18
TU Kaiserslautern
Growth Rate of Embedded Software
1
2
0 10 12 18 months
factor
*) Department of Trade and Industry, London
(1.4/year)
[Moore
’s law]
>10 times more programmers will write embedded applications
than computer software by 2010
Embe
dded
sof
twar
e [D
TI*]
(~2.
5/yr
)
already to-day, more than 98% of all microprocessors are
used within embedded systems
© 2004, [email protected] http://hartenstein.de19
TU Kaiserslauterntypical CS graduates: the
„havenots“
To-day, „typical“ CS graduates are unqualified for this labor market
… cannot cope with Hardware / Configware / Software partitioning issues
… cannot implement Configware
© 2004, [email protected] http://hartenstein.de20
TU Kaiserslautern
the current CS mind set is based on the Submarine
Model
Hardware invisible:under the surface
Hardware
Algorithm
Assembly Language
procedural high level Programming
Language
SoftwareThis model does not suport Hardware / Configware / Software partitioning
© 2004, [email protected] http://hartenstein.de21
TU Kaiserslautern
Hardware / Configware / Software Partitioning skills urgently needed
Algorithm
partitioning
HWHW
CWCW
SWSW
to cope with each of it:SW, CW, HW
.
SW/HWSW/HW
SW/CW/HWSW/CW/HWSW/CWSW/CW
CW/H
WCW
/HW
or: to cope with anycombination of co-design
.
Software to Configware Migration is the most important source of speed-up
Hardware is just frozen Configware
© 2004, [email protected] http://hartenstein.de22
TU Kaiserslautern
By the way ...
http://fpl.org
International Conference on Field-Programmable Logic
and Applications (FPL)
Aug. 20 – Sept 1, 2004, Antwerp, Belgium
288 submissions !
... the oldest and largest conference in the field:
accel.µProc.... going into every type of application
they all work on high performance
© 2004, [email protected] http://hartenstein.de23
TU Kaiserslautern
Dominance of the Submarine Model ...
Hardware
... indicates, that our CS education system produces
zillions of mentally disabled CS graduates
(procedural) structurallydisabled
… disabled to cope with solutions other than
instruction-stream-based
© 2004, [email protected] http://hartenstein.de24
TU Kaiserslautern
CS Education
proceduralhave not
You cannot *teach Hardware to a Programmer*) efficiently
But to a Hardware Guy you always can teach
Programming
structuralhave natural
© 2004, [email protected] http://hartenstein.de25
TU Kaiserslautern>> the wrong Roadmap <<
•HPC
•Embedded Computing
• the wrong Roadmap
•Configware Engineering
•Dual Machine Paradigms
•Speed-up Examples
•Final Remarkshttp://www.uni-kl.de
© 2004, [email protected] http://hartenstein.de26
TU Kaiserslautern
Completely wrong roadmap
beef up old architectural principles by new technology?
growth factor
µm
0.1
performance
area efficiency
„Pollack‘s Law“ (simplified)
[intel]
... the CPU is a methusela, the steam
engine of the silicon age
© 2004, [email protected] http://hartenstein.de27
TU Kaiserslautern
Completely wrong mind set
The key problem, the memory wall, cannot be solved by new CPU technology
We need a 2nd machine paradigm (a 2nd mind set ...)
The vN paradigm is not a communication paradigmIts monopoly creates a completely wrong mind set
We need an architectural communication paradigmBut we need both paradigms: a dichotomy
© 2004, [email protected] http://hartenstein.de28
TU Kaiserslautern3rd machine model became
mainstream
1957
1967
1977
1987
1997
2007
computer age (PC age)
accel.
design
µProc.
compile
(Makimtos wave)
mainframe age
mainframe
compile
instruction-stream-based DPArrµProc.
programma
bleprogramma
ble
most CS curricula &
HPC are still here
most CS curricula &
HPC are still here
morphware age
© 2004, [email protected] http://hartenstein.de29
TU Kaiserslautern>> Configware Engineering <<
•Supercomputing (HPC)
•Embedded Computing
•The wrong Roadmap
•Configware Engineering
•Dual Machine Paradigms
•Speed-up Examples
•Final Remarkshttp://www.uni-kl.de
© 2004, [email protected] http://hartenstein.de30
TU Kaiserslautern
de facto Duality of RAM-based platforms
traditional new
RAM-based platform CPU morphware (FPGA, rDPA ..)
„running“ on it: software configware
machine paradigm von Neumann etc.:instruction-stream-based
anti machine: data-stream-based
2nd paradigm
We now have 2 types of programmable platforms
soft hardwaremorphware[DARPA]
hardware viewed as frozen configware: Just earlier binding
© 2004, [email protected] http://hartenstein.de31
TU Kaiserslautern
[Gordon Bell]
... going into every type of application
[Gordon Bell]
.... the brain hurts
CW has become mainstream ...
Others experienced, that the brain hurts, when trying the paradigm shift
The HPC scene believed to be smart, when smiling about us CW guys
morphware: fastest growing sector of the IC market
© 2004, [email protected] http://hartenstein.de32
TU Kaiserslautern
DPA
morphware age
rr
From Software to Configware Industry
structuralpersonalization:
RAM-based
Repeat Success Story bya 2nd Machine Paradigm !
Growing Configware IndustryGrowing Configware Industry
1957
1967
1977
1987
1997
2007
computer age (PC age)
µProc.
compileProcedural
personalization via RAM-based .
Machine Paradigm
Software IndustrySoftware Industry
1)2)
Software Industry’sSecret of Success anti machine
© 2004, [email protected] http://hartenstein.de33
TU Kaiserslautern
benefit from RAM-based & 2nd paradigm
RAM-based platform needed for:• flexibility, programmability
• avoiding the need of specific silicon
mask cost: currently 2 mio $ - rapidly growing
1)
simple 2nd machine paradigm needed as a common model:• to avoid the need of circuit expertize
• needed to to educate zillions of programmers
2)
(vN relocatability is based on the von Neumann bottleneck) high price
By the way: relocatability is more difficult, but not hopeless
© 2004, [email protected] http://hartenstein.de34
TU Kaiserslautern
configware resources: variable
Nick Tredennick’s Paradigm Shifts explain the differences
2 programming sources needed
flowware algorithm: variable
Configware EngineeringConfigware Engineering
Software EngineeringSoftware Engineering
1 programming source needed
algorithm: variable
resources: fixedsoftware
CPU
© 2004, [email protected] http://hartenstein.de35
TU Kaiserslautern
Compilation: Software vs. Configware
source program
softwarecompiler
software code
Software Engineeri
ng
Software Engineeri
ng
configware code
mapper
configwarecompiler
scheduler
flowware code
source „program“
Configware
Engineering
Configware
Engineering
placement &
routing
data
© 2004, [email protected] http://hartenstein.de36
TU Kaiserslautern
DPA
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
input data streams
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|output data streams
„data
streams“ time
port #
time
time
port #time
port #
Flowware defines: ... which data item at which time at which port
Flowware programs data streams
© 2004, [email protected] http://hartenstein.de37
TU Kaiserslautern
Flowware: not new
1957
1967
1977
1987
1997
2007
computer age (PC age)
accel.
design
µProc.
compile
(Makimtos wave)
mainframe age
mainframe
compile
DPArrµProc.
morphware age
*) no confusion, please: no „dataflow
machine“ !!!
*) no confusion, please: no „dataflow
machine“ !!!data stream* ...Flowware:
around 1980
© 2004, [email protected] http://hartenstein.de38
TU Kaiserslauterndata streams*: not new
1980: data streams (Kung, Leiserson: systolic arrays)
1989: data-stream-based Xputer architecture
1990: rDPU (Rabaey)
1994: Flowware Language MoPL (Becker et al.)
1995: super systolic array (rDPA) + DPSS tool (Kress)
1996+: Stream-C language, SCCC (Los Alamos), SCORE, ASPRC, Bee (UC Berkeley), ...
1996+: streaming languages (Stanford et al.)
1996: configware / software partitioning compiler (Becker)
*) please, don‘t
confuse with „data
flow“
© 2004, [email protected] http://hartenstein.de39
TU Kaiserslautern>> Dual Machine Paradigms <<
•HPC
•Embedded Computing
•The wrong Roadmap
•Configware Engineering
•Dual Machine Paradigms
•Speed-up Examples
•Final Remarkshttp://www.uni-kl.de
© 2004, [email protected] http://hartenstein.de40
TU Kaiserslautern
Why a new machine
paradigm ???The anti machine as the 2nd paradigm is the key to curricular innovation
rDPArDPAµprocessorµprocessor
... a Troyan horse to introduce the structural domain to the procedural-only mind set of programmers
Programming by flowware instead of software is very easy to learn
Flowware education: no fully fledged hardware expert needed to program embedded systems
(... same language primitives)
© 2004, [email protected] http://hartenstein.de41
TU Kaiserslautern
von Neumann vs. anti machine
programcounter
DPUCPU
RAMmemory
von Neumann bottleneck
(r)DPA
withoutsequencerno
CP
U !
asMA: auto-sequencing Memory Array
........
asM
asM
asM
asM
asM
(r)DPA
....
....
data stream machine
(anti machine)
datacounter
memory
bank
asM
asM: auto-sequencing Memory
instruction stream machine(von Neumann etc.)
© 2004, [email protected] http://hartenstein.de42
TU Kaiserslautern
Behavior of the Counter
datacounter
memory
bank
asM
asM
asM
asM
asM
asM
........
programcounter
DPUCPU
programme
d
by flowwaredata
streams
programm
e
d by software
(r)DPA
programm
e
d by flowware
© 2004, [email protected] http://hartenstein.de43
TU Kaiserslautern
Counters: the same micro architecture ?
data stream machine (anti
machine)
datacounter
memory
bank
asM
programcounter
DPUCPU
instruction stream machine:(von Neumann etc.)
yes, is possible, but for data counters ...
*) for history of AGUs see Herz et al.: Proc. ICECS 2002, Dubrovnik, Croatia
... a much better AGU methodology is available*
AGU: address generator unit
© 2004, [email protected] http://hartenstein.de44
TU KaiserslauternThe dichotomy of models
• Note for von Neumann: state register is with the CPU
• Note for the anti machine: state register is with memory bank / state registers are within memory banks
© 2004, [email protected] http://hartenstein.de45
TU Kaiserslauterncommercial rDPA example:
PACT XPP - XPU128
XPP128 rDPA
• Evaluation Board available, and • XDS Development Tool with Simulator
buses not
shown
rDPU
CF
G
PAE
core
ALU CtrlALU
CF
GC
FG
PAE
core
CF
GC
FG
PAE
core
PAE
core
ALU CtrlALUALU CtrlALU
CF
GC
FG
CF
GC
FG
• Full 32 or 24 Bit Design working silicon • 2 Configuration Hierarchies
© PACT AG, Munich http://pactcorp.com
(r)DPA
© 2004, [email protected] http://hartenstein.de46
TU Kaiserslautern
XPP64A: Platform Development Board
- SDR Board In Debug Phase -> XPP64A Chips from STMicro Fab
- Assembly & Test / Available March 2003
© 2004, [email protected] http://hartenstein.de47
TU Kaiserslautern
symbiosis of machine models
1957
1967
1977
1987
1997
2007
computer age (PC age)
accel.
design
µProc.
compile
(Makimtos wave)
mainframe age
mainframe
compile
morphware age
DPArrµProc.replace PC by PS
co-compiler
symbiosis
© 2004, [email protected] http://hartenstein.de48
TU Kaiserslautern>> Speed-up Examples <<
•HPC
•Embedded Computing
•The wrong Roadmap
•Configware Engineering
•Dual Machine Paradigms
•Speed-up Examples
•Final Remarkshttp://www.uni-kl.de
© 2004, [email protected] http://hartenstein.de49
TU Kaiserslautern
Better solutions by Configware
Memory cycles minimizede.g.: no instruction fetch at run time & other
effects
Memory access for data: caches do not help anyhowLoop xforms: no intra-stream data memory cyclesComplex address computation: no memory cycles
No cache misses!
instead of software
methodologies not new: high level synthesis (1980+)loop transformations (1970+)
many other areas
© 2004, [email protected] http://hartenstein.de50
TU Kaiserslautern
speed-up examples
platform application example speed-up factor method
PACT Xtreme4-by-4 array[2003]
16 tap FIR filter x16 MOPS/mWstraight forward
MoM anti machine with DPLA* [1983]
grid-based DRC** 1-metal 1-poly nMOS*** 256 reference patterns
> x1000(computation
time)
multipleaspects
*) DPLA: MPC fabr. via E.I.S. multi univ. project
key issue: algorithmic cleverness
**) Design Rule Check
CPU 2 FPGA [FPGA 2004]
migrate several simple application exampes
x7 – x46(compute
time)
hi levelsynthesis
***) for 10-metal 3-poly cMOS expected: >> x10,000
DSP 2 FPGA [Xilinx 20042]
from fastest DSP: 10 gMACs to 1 teraMAC
X 100(compute
time)not spec.
2) Wim Roelandts
© 2004, [email protected] http://hartenstein.de51
TU KaiserslauternSoftware to Configware Migration:
(RAW’99 at Orlando)
question by a highly respected industrial senior researcher:
Ulrich Nageldinger‘s talk about KressArray Xplorer:
„But you can‘t implement decisions!“
(symptom of ...)
© 2004, [email protected] http://hartenstein.de52
TU Kaiserslautern
branching example: time-to-space migration
*) if no intermed. storage in register file
on a very simple CPU C = 1
memory
cycles
if C then read A
read instruction 1
instruction decoding
read operand* 1
operate & register transfers
if not C then read B
read instruction 1
instruction decoding
add & store
read instruction 1
instruction decoding
operate & register transfers
store result 1
total 5
S = R + (if C then A else B endif);
S
+
ABR C
clock
=1
on rDPU:
no m
emor
y cy
cles
© 2004, [email protected] http://hartenstein.de53
TU Kaiserslautern
Why the speed-up ...
... although FPGA is clock slower by x 3 or even more(most know-how from „high level synthesis“ discipline)
moving operator to the data stream (before run time)
support operations: no clock nor memory cycle
decisions without memory cycles nor clock cycles
most „data fetch“ without memory cycle
© 2004, [email protected] http://hartenstein.de54
TU Kaiserslautern>> Final Remarks <<
http://www.uni-kl.de
•HPC
•Embedded Computing
•The wrong Roadmap
•Configware Engineering
•Dual Machine Paradigms
•Speed-up Examples
•Final Remarks
© 2004, [email protected] http://hartenstein.de55
TU Kaiserslautern
First Indications of Change
10th RAW at IPDPS, Nice, France, April 2003: after a decade of non-overlap: first IPDPS people coming
HPC Asia 2004 - 7th Int‘l Conference on High Performance Computing, July 20-22, 2004 Omiya Sonic City, Tokyo Area, Japan: Workshop on Reconfigurable Systems f. HPC (RHPC) + keynote address*
HPCA-11, 11th International Symposium on High-Performance Computer Architecture, San Francisco, Febr. 12-16, 2005: topic area explicitely: Embedded and reconfigurable architectures
SBAC-PAD 2004 - 16th Symposium on Computer Architecture and High Performance Computing, Foz do Iguacu, PR, Brazil, October 27-29, 2004: topic area explicitely: Reconfigurable Systems
*) keynote speaker:
PARS & Speed-up, Basel, Switzerland, March 2003: keynote address*
IPDPS, Santa Fe, NM, USA, April 2004: keynote address*
PDP’04, La Coruna, Spain, Febr. 2004: keynote address*
© 2004, [email protected] http://hartenstein.de56
TU Kaiserslautern
HPC experts coming ...
ARI, Astrononisches Rechen-Institut, founded 1700 in Berlin, moved 1945 to Heidelberg by
August Kopff
Gottfried Kirch
Simulation of Star Clusters: x10 speed-upby supercomputer-to-morphware migration
(also molecular biology et al.)
Rainer Spurzem, University of Heidelberg
Reinhard Männer, University of MannheimHPC pioneer since 1976 (Physics Dept Heidelberg)
Configware by
Astrophysics by
example: N-body problem went configwarepaper already
at FPL 1999http://fpl.org
© 2004, [email protected] http://hartenstein.de57
TU Kaiserslautern
August Kopff
18th Director, Astrononisches Rechen-Institut (ARI) 1924 - 1954
discovered the Kopff comet, Koenigstuhl Observatory, Heidelberg, Germany, 1906
Copyright © 1996 by Masayuki Suzuki
The Galileo spacecraft's 14-year odyssey came to an end on Sunday, Sept. 21, 2003
discovered the asteriod 631 Philippina, 21 March 1907,
which became the first asteroid ever visited by a spacecraft - on the Galileo mission to Jupiter
© 2004, [email protected] http://hartenstein.de58
TU KaiserslauternConclusions
We need an academic grass roots movement, for ....
RC has become mainstream in all kinds of applications
... by a merger with the embedded systems mind set
CS education deficits: a curricular revision is overdue
...free material & tools for undergraduate lab courses to program and emulate small SW/CW/HW examples
all know-how needed readily available:
get involved !