Upload
vudan
View
221
Download
2
Embed Size (px)
Citation preview
A Computational Physicist’s View of Reconfigurable High Performance
Computing
Vincent NatoliReconfigurable Systems Summer Institute
11 July 2005
OutlineI. High Performance Computing: Past and
PresentII. Field Programmable Gate ArraysIII. Reconfigurable High Performance
Computing
PART I
High Performance Computing Past and Present
Brief History of HPC
The Early Years
Brief History of HPCThe Early Years
ENIAC
BINAC
Colossus
Z3
1945 1950
UNIVAC
ERA1101SEAC
1990198019701960
ERA RAND CDCIBM
Fujitsu
CDC6600IBM360
Cray 1
Intel 4004
CRAY Hitachi
NECAlliant
Convex
Sperry
Cray XMP
Cray YMP
Brief History of HPC
The Phantom Menace
Brief History of HPC
1993-2000Decline of Vector ProcessorsRise of Commodity Processors
The Phantom Menace
Brief History of HPC
Attack of the Clones
Brief History of HPC
2000-2005Rise of Clusters (ASC Red, Blue, White, Q)
Attack of the Clones
Brief History of HPC
The Empire Strikes Back
Brief History of HPC
2002: Japanese Earth SimulatorComputnik?
The Empire Strikes Back
5,120 (640 8-way nodes) 500 MHz NEC CPUs 8 GFLOPS per CPU (41 TFLOPS total) 2 GB Memory per CPU (10 TB total) 20 kVA power consumption per node
HPC Performance 1993-2005
All is well …
#Proc: 1.3/yr Scalar: 1.4/yrTotal : 1.8/yr
Or is it?
1 PFLOP by 2009
Current Problems in HPCThe Studies
(2002) DARPA: HPCS(2003) DoD: IHEC(2004) NCO/NITRD: HECRTF(2004) NRC: Future of Supercomputing(2004) DOE: HEC Revitalization Act“The Coming Crisis in Computational Science” Doug Post
Summary of ResultsGood News! Only two big problems
Hardware and SoftwareHardware: Moore’s law
Power Dissipation: More difficult to wring out clock speed increaseMemory wall: Time to access memory in clock cycles is risingDivergence problem: sustained performance < 10% of Peak
Software: The Law of MoreMachines more and more complicated to programMachines are obsolete by the time software is ready
Moore’s Law
In 1965, Gordon Moore sketched out his prediction of thepace of silicon technology. Decades later, Moore’s Law remains true, driven largely by Intel’s unparalleled silicon expertise. Copyright © 2005 Intel Corporation.
"In terms of size [of transistor] you can see that we're approaching the size of atoms which is a fundamental barrier, but it'll be two or three
generations before we get that far - but that's as far out as we've ever been able to see. We have
another 10 to 20 years before we reach a fundamental limit. By then they'll be able to make
bigger chips and have transistor budgets in the billions.“
1965
2005
“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year (see graph on next page). Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although
there is no reason to believe it will not remain nearly constant for at least 10 years.
That means by 1975, the number of components per integrated circuit for
minimum cost will be 65,000.”
Power Dissipation
0.1
1
10
100
1000
4004
8008
8080
8085
8086
8028
680
386
8048
6Pen
tium I
Pentiu
m IIPen
tium III
Pentiu
m IVPen
tium IV
Pow
er (W
atts
)
The increase in power dissipation must stop. New engineering techniques have to be implemented to cap the rise in power
Source: “IC Power: The Influence and Impact of Semiconductor Technology”, Presentation by Marc Knox (IBM), Burn-in & Test Socket Workshop, March 7-10, 2004.http://www.bitsworkshop.org/archive/archive2004/2004s1.pdf
Source: “IC Power: The Influence and Impact of Semiconductor Technology”, Presentation by Marc Knox (IBM), Burn-in & Test Socket Workshop, March 7-10, 2004.http://www.bitsworkshop.org/archive/archive2004/2004s1.pdf
Power Dissipation
1
10
100
1 10 100 1000 10000Cache Capacity (KB)
180nm130nm100nm70nm50nm
Memory Access Times (SIA Clock Est)
Memory Wall
Source: Horst Simon, The Divergence Problem, 18th International Supercomputer Conference ISC2003, Heidelberg, Germany, June 2003. http://www.nersc.gov/~simon/Talks/ISC2003_rev5.pdf
Divergence ProblemRequirements of HPC for Science and Engineering
and requirements for the commercial market are diverging
Memory BandwidthInterconnect LatencyInterconnect BandwidthHP Parallel I/O
Software: The law of MoreResearchers spending more time on
software developmentProblem will get more acute as the
number of processors increasesSoftware done…Machine obsolete
The Capability GapReliance on commodity solutions has led
us on a path that has delivered great capacity computing solutions but has left a
gap in capability computing.
or
SummaryMaturing Field: Scientific HPC is still a very young field. Its final state is as yet unknown.Commodity Rules: We have had a free ride on Moore’s law that has led commodity solutions to dominate the HPC market.
Good: Excellent price/performance; Wide dissemination of skills.Bad: Low sustained performance ~10% of peak; Difficult programming model
Technology Divergence: Dependence on increased clock speed and increased number of processors is now in jeopardy
End of the Roadmap?Signs of a transition: Multi-core chips; Less attention to clock speedPrediction: Intel will solve the power problem, but…How do you divide work among 100,000 processors? Good for huge problems but what about doing small to medium sized problems faster? What about capacity computing?
SummaryMarket Divergence: Increasingly, the interest of HPC is diverging from the commodity market.
Less motivation for chip vendors to provide massive FP performanceWho ordered multi-core chips? Will they share the FP units?
JES: HPC market still dominated by US manufacturers.JES slipped to #4$7,500/GFLOPLimited Access…not the path to widespread use of supercomputing
How do we continue the logical progression of commodity-based
supercomputing?
History of HPC: Next Chapter
A New Hope
A New Hope2005: Reconfigurable Computing; FPGA’s
History of HPC: Next Chapter
PART II
Field Programmable Gate Arrays“The real significance of a great invention ultimately rests in its ability
to transcend its original purpose and empower radically new ideas. Reconfigurable hardware holds such an extraordinary potential. Not only will it give us additional flexibility in our current directions, but
more importantly, it will create the substrate for emergent capabilities of awesome reach.” Federico Faggin
Field Programmable Gate Arrays
Lots of wireVast quantities of Mountain DewHealthy disregard for personal hygiene
Very little wireVast quantities of Mountain DewHealthy disregard for personal hygiene
The old way
1984
The new way
2005
Field Programmable Gate ArraysMask Programmable Gate Arrays (MPGAs)
Field Programmable Gate Arrays
Field Programmable Gate Arrays
...a general purpose multi-level programmable device that is customized in the package by the end-user...
Disconnected G ates
10110111100
Software descriptionof c ircu it Connected G ates
Field Programmable Gate Arrays (FPGA’s)
Volume
Cost
NRE
ASICFPGA
Field Programmable Gate ArraysAdvantages of FPGAs
Reduced design cycleIncreased securityCommodity parts (better reliability, lower cost)Faster upgrades (technology generation, application)Design for speed
Disadvantages of FPGAsHistorically trail by one technology generationSwitching fabric takes up spaceSlow clock (~200-300 Mhz vs. 3 Ghz)
Field Programmable Gate Arrays (FPGA’s)
Traditional uses− Communications− Data Acquisition− Signal Processing− Embedded Processing− Hardware testing
Industries using FPGAs− Military Aerospace and Defense− Automotive− Consumer (STBs, Broadband, PDAs, HDTV)− Networking (Routers, Switches, Modems)
− Key Features− Programmability− Re-Programmability
FGPAs and General ComputingAlgorithms implemented in hardware can be many times
faster than software implementations
Examples:Graphics cards and chipsGrape; Grape-MD QCDOC Lattice Gauge Theory
on a chip
FGPAs and General Computing
Offload bottleneck calculations to a programmable circuit− 90/10 rule
FPGACPU
IDEA!
Spatial vs. Temporal ComputingCPUs are temporal processors
Algorithm translated into set of instructions and data which areinterpreted sequentially
FPGAs are spatial processorsAlgorithm is laid out in space in digital hardware
(a) Spatial and (b) Temporal computations for the expression y[i]=w1*x[i]+w2*x[i-1]+w3*x[i-2]+w4*x[i-3]
“The Density Advantage of Configurable Computing”Computer, April 2000 Andre DeHon
Why FPGA’s are fast?Parallelism Pipelining
“A Quantitative Analysis of the Speedup Factors of FPGAs over Processors” FPGA’04 Guo, Naijar, Vahid, Vissers
1.08
N/A
CPI
2101236,865,6001,000Pentium III
8(Pipeline Depth)
8131,07240FPGA(XC2V2000E)
Instr/PixelParallelismClock CyclesClock(MHz)
Time(CPU) = (1/Clock) X N(Pixels) X (Clocks/Instr.) X (Instr/Pixel)Time(FPGA) = (1/Clock) X N(Pixels)/ PRatio = .04 X 8 X 1.08 X 210 = 70.1
Maximum Filter 3x3 box on 1024x1024 image
Instruction efficiencyLatency
Scaling Advantage of FPGAsMoore’s Law: Number of transistors doubles every 2.03 years over last 35 years.
CPUs: Lower capacitance, faster clock, more FLOPSMemory: Smaller feature size, higher density, more capacityFPGAs: Faster clock and more capacity, even more FLOPS
The computational capacity of FPGAs exceeds that of CPUs and the gap is increasing.
DeHon’s Law
“The Density Advantage of Configurable Computing” Computer, April 2000 Andre DeHon
1996 DEC Alpha.18µ technology208mm2 die size
6.8x109 λ2
2.3ns Clock128 ALU bitops/clock
8.6
DeHon’s Law: FP Addition
“FPGA’s vs. CPUs: Trends in Peak Floating–Point Performance” Keith Underwood FPGA’04
Historical and Projected Scaling of FP Addition on FPGAs and CPUs
2003Virtex II Pro 100-6Pentium 4 3.2GHz
DeHon’s Law: FP Multiplication
“FPGA’s vs. CPUs: Trends in Peak Floating–Point Performance” Keith Underwood FPGA’04
Historical and Projected Scaling of FP Multiplication on FPGAs and CPUs
2003Virtex II Pro 100-6Pentium 4 3.2GHz
Trends in FPGA FP Performance
Data from Xilinx 2005 Annual Report
Xilinx Products
0
50
100
150
200
250
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
Time
Logi
c C
ells
(tho
usan
ds) Virtex-4
Virtex-2
Virtex-EXC4000XL
CPU: Flops increase = 1.4 /yrFPGA: Flops increase = 1.59 x 1.39 = 2.2 /yr
Spatial Factor Time Factor
Each year FPGA FP capability
increases 50% faster than that
of CPU’s!
SummaryFPGA’s are a candidate to fill the capability gap
Can provide significant hardware acceleration to a wide variety of problems
Fits profile for continuation of commodity performance solutionsDeHon’s law: The computational capacity of FPGAs is
increasing at a faster rate than that of CPU’sSometime in the last few years FPGA FP
performance surpassed that of CPU’sFPGA’s and CPU’s benefit equally from coming
semiconductor processing technology improvements
PART IIIReconfigurable High
Performance Computing
“The most constant difficulty in contriving the engine has arisen from the desire to reduce the time in which the
calculations were executed to the shortest which is possible.”--- Charles Babbage
“I feel the need…the need for speed”---Tom Cruise (Mav)
Pentium Prescott
Floating Point Operations~7% of chip area
www.chip-architect.com
L2 Cache
90 nm CMOSClock 3.4 GHz
Other
FP
What HPC Users WantHow Computational Physicists see chips
Other
FP
What HPC Users WantHow Computational Physicists see chipsHow Computational Physicists would design chips
FPOther
Cache
Bessel Functions
The House and the DishwasherWe have lots of dishes to doWhat Computer Scientists have given us
What we want
Cray Quote
"If you were plowing a field, which would you rather use? Two strong
oxen or 1024 chickens ?" --- Seymour Cray
Xilinx XC4VLX200
90 nm CMOS200,448 Logic Cells750 kB BRAM96 18x18 bit MultipliersRocket IO 1.5 GB/sClock <500MHz
32 bit Integer and Fixed Point Thousands of Arithmetic Units
Floating Point600 SP Floating Point Multipliers100 SP Floating Point Dividers100 DP Floating Point Multipliers20 DP Floating Point Dividers
SP != 2 X DP
Theoretical PeaksSP Floating Point 20-120 GFLOPsDP Floating Point 4-20 GFLOPsInteger .5-1 TOP
Nallatech Floating Point Core data used
Xilinx XC4VLX200
Metric Comparison
JESVirtex-4Pentium-440,00020-1206-8GFLOPs
?0.14-0.8116.9Density(mm2/GFlops)
3200.05-0.3216.5Power(W/GFlops)
$7,500$15-$99$60Economics($/GFlops)
HPC Concerns
Hey! The clock speed is slower!Can I do Floating Point?Where is the chip?What about the memory wall?What about the bandwidth?How do I program it?Where is the Mountain Dew?
Floating Point Cores
CommercialNallatech Single and Double Precision FP CoresDillon EngineeringXilinx Coregen
AcademicMiriam Leeser (Northeastern U.)Viktor Prasanna (USC)Peter Athanas (VA Tech)
FPGA’s and The Memory Wall
No instruction fetchLarger “Register Set”More flexibility in memory
configuration and accessSlower clock
Nallatech BenData-WS6 Banks of ZBT-SRAM
8GB/s throughput
PCI Bandwidth Problem
Blas 1 Vector multiplication: X·XAssume vector length NData Transferred: 8N bytesNumber of operations: 2NFP Operations per Byte xferred: 1/4PCI 64Bit 64 MHz =512MB/sPerformance 64MFlops
Overall FP Performance = FP Ops
Blas 3Matrix Multiplication; A·AAssume NxN matricesData Transferred; 8N2 bytesNumber of operations: N3
FP Operations per Byte xfered: N/8Performance: FPGA performance for N>100-200
Data xferred
PCI BW+ FP Ops
FPGA Performance
Barriers to Use
Development EnvironmentUnfamiliar to application programmersHuman is least programmable part of development chainBiased toward digital designers
BandwidthPCI boards currently present a bottleneckPCI-Express will help (1Q06)
Hardware is expensive$8K – $250K+
Software is expensive$2,500 - $25,000 for development tools
Technology AdoptionHPC developers are willing to adopt new hardware and make reasonable changes to codes
RHPC AlgorithmsTime steppingMonte CarloInteger based
e.g. BioInformatics, Encryption
Spatially Local (e.g. FD)Pixel processingDigital filteringConvolutions and Transforms (FFT, WT)Data Compression; Encryption
High Computation to Bandwidth ratio
Trivially ParallelMinimize data
transferRepetitive
application of kernelsLong Pipelines of
repetitive kernels
Recent FP FPGA ApplicationsCFD – Schnore and Smith (GE)
2.210.36.4
RCC (GFlops)
25x.086Smoothing133x.077Viscous41x.154Euler
SpeedupP4 (GFlops)
138 x 66 x 52 Mesh
2 Nallatech BenNuey4 Nallatech BenData-WS
“Towards an RCC-based Accelerator for Computational Fluid Dynamics Applications” Schnore and Smith, GE
Future
Bigger FPGAs (2006-2007) >500K logic cellsLots of headroom on the FPGA clockFPGAs with more embedded hardwareMuti-core chips; Less mention of ClockTighter coupling between GPP and RC
ConclusionsThe HPC community is confronting a capability gap
Hardware and software challenges threaten the progression of system performanceReconfigurable computing and FPGAs are a potential solution
Intrinsic dimensionally rooted scaling laws favor reconfigurable spatial computing over temporal computing
FPGA’s have a greater intrinsic computational density and the gap is growing
Many challenges are ahead (e.g. Development environments, bandwidth, cost)Merging of hardware and software
The hardware is the software; The software is the hardware
FrontierLogical extension of UNIX to manage reconfigurable co-processing.
SCC (Stone Ridge Compiler Collection)Compiler tools that target Frontier and conform to HPC development methodologies
An Operating System, Software tools and Applications to enable the use of reconfigurable devices by HPC
programmers and users