Upload
everett-goodwin
View
220
Download
0
Embed Size (px)
Citation preview
Reconfigurable Computing Leveraging FPGA Accelerators in
High-Performance Computing Applications
Craig [email protected]
June 2, 2005
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Craig Ulmer, Ryan Hilles, and David Thompson SNL/CAKeith Underwood and K. Scott Hemmert SNL/NM
What is Center 8900 doing with FPGAs?
• Computer Sciences & Information Technologies 8900– High-Performance Computing
– Viz & Scientific computing (8963)
• FPGA LDRD for accelerating HPC– NM: Computational aspects
– CA: Communication aspects
• FPGA LDRD for network security– NM/CA: Smart routers
Catalyst
Outline
• Reconfigurable Computing– Communication Aspects– Computational Aspects
• New Platforms: Cray XD1
• Summary
In a nutshell..
• Reconfigurable Computing– Use FPGAs as a means of accelerating simulations– Implement key computations in hardware: soft-hardware
double doX( double *a, int n) {int i;double x;
x=0;for(i=0;i<n;i+=3){
x+= a[i] * a[i+1] + a[i+2];…
}…
return x;}
* +
+
a[i] a[i+1]
Z -1
a[i+2]
Software Hardware Synthesized Hardware
FPGA
CPU
RC Platform
Trivial Example: Sorting
• Sort a matrix of 64b values
• Hardware implementation– Sorting element– Array of elements– Data transfer engine
• Exploits parallelism– Do n comparisons each clock– Requires 3n clock cycles
• Limited to n values– Use as building block
Sorting Array
AddressWr Req
Rd ReqDMA
EngineSortingControl
Double-BufferedOutgoing Memory
Reg
>
Sorting Element11.532.994.188.779.434.539.254.2
24.354.14.235.1
14.83.019.791.4
17.581.39.431.394.188.732.911.579.454.239.234.5
54.135.124.34.2
91.419.714.83.0
81.331.317.59.4
unsorted sorted
Sorting Performance
• Xilinx Virtex II/Pro50-7 FPGA– 128 elements– 110 MHz
• Data transfer by FPGA– Read/write concurrency– Pinned memory
• 2-4x Rate of Quicksort(128)– Host is 2.2GHz AMD Opteron
0
5
10
15
20
25
30
35
1 10 100 1,000
Matrix Rows
Sor
ting
Rat
e (M
iWor
ds/s
)
FPGA (No Copy)
FPGA (Pre/Post Copy)
CPU
RC: Communication Aspects
New FPGAs are “Network Ready”
• Multi-gigabit transceivers (MGTs)– V2Pro: twenty-four 3Gbps channels– V2ProX: twenty 10Gbps channels– InfiniBand, FC, GigE, 10GigE
• Multiple embedded processors– PowerPC 405 400MHz
• Direct connection to network– Build NI in hardware– Use PPC core for complex protocols– User-defined computational circuits
Xilinx Virtex II/Pro-7
NetworkInterface
PowerPCCPU
User-DefinedCircuits
TCP/GigE Network Interface for FPGAs
• Network interface (NI) in hardware– TCP Offload Engine (TOE)– Gigabit Ethernet (GigE)
• Refined TCP functionality– Slow start, NACK detection, Throttling…– User data rate to host: 600 Mb/s
Outgoing TCPMessageControl
Rocket I/O TxRx
Align
Decode
ARPCache
Framer
MACHeader
Control
ARP Reply
Ping Reply IP Header
IncomingByte Stream
TimeoutMonitor
CRCOutgoing
Byte Stream
Incoming TCPMessageControl
TOE
GigEUnit Slices V2P20
TOE 2,068 22%
GigE 1,102 12%
Transport Framework for Visualization
• Need a framework for supporting Visualization in FPGAs– Want to perform viz tasks, not rendering– Built Chromium interface to transport OpenGL graphics
• Results– Functional system where visualization in FPGA, Rendering in Host– Saturate GigE link with OpenGL data
VizApp
GigE NetworkGigETCP
OffloadEngine
ChromiumPacketizer
FPGA
Network Intrusion Detection System (NIDS)
• Collaboration with Georgia Tech– GT: Pattern matching cores– SNL: GigE NI cores
• Network Intrusion Detection System – Examine packets & remove malicious– Based on Snort rule set (1,500 rules)
• Filter multiple GigE links transparently– Demonstrated for two links
V2P100V2P70V2P50V2P30V2P20V2P7
16-bit32-bit64-bit
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
16-bit
32-bit
64-bit
NI NI
Malicious PacketPattern Matching
NI NI
OK Drop Scheduler
RC: Computational Aspects
Floating Point Computations
• Floating point has been weakness for FPGAs– Complex to implement, consume many gates
• Keith Underwood and K. Scott Hemmert– Developed FP library at SNL/NM
• FP Library– Highly-optimized and compact
– Single/Double Precision
– Add, Multiply, Divide, Square Root
– V2P50: 10-20 DP cores, 130-210 MHz
Example: FP Distance Computation
• Compute c=sqrt(a2 + b2)– Simple pipeline (88 stages)
• Implemented on Xilinx Virtex II/Pro 50– 159 MHz clock rate
– 39% of V2P50
– 75 MiOperations/s
IncomingSRAM
OutgoingSRAM
DMAEngine
IncomingSRAM
FromHost
ToHost
0
10
20
30
40
50
60
70
80
90
100
100 1,000 10,000 100,000 1,000,000
Distance Computations
Pro
cess
ing
Rat
e (M
iOpe
ratio
ns/s
)
CPU
FPGA
New Platforms: Cray XD1
Cray XD1
• Dense multiprocessor system– 12 AMD Opterons on 6 blades
– 6 Xilinx Virtex-II/Pro FPGAs
– InfiniBand-like interconnect
– 6 SATA hard drives
– 4 PCI-X slots
– 3U Rack
Cray XD1: Placing FPGAs Near Memory
• Blades use HyperTransport– High bandwidth I/O
• FPGA accelerator– Xilinx Virtex II/Pro 50 (-7)– HT connection: 1.6+1.6GB/s– 10x performance of PCI
• First of a new breed– Commodity HT accelerators
CPU CPU MainMemory
MainMemory
NISRAM
Expansion Module
SRAM
SRAM
SRAM
FPGANI
1.6 GB/s HT
Processor Blade
Summary
• FPGAs are getting faster and less expensive– Finally have capacity for interesting problems
– Million gate parts for $5
• Large amount of reconfigurable computing work at SNL– Experience with Virtex II/Pro Platform FPGAs
– Networks & embedded CPUs
– Willing to share cores, work with others
Deliverables
Networking Visualization Computations XD1 General
• OpenGigE• OpenTOE• Raw GigE• GigE Bridge• NIDS Filter
• Isosurfacer• Chromium Packetizer
• Sorting Array• MD5• 2D Distance
• DMA Engine• Computation
Framework
• Asynch. Message FIFOs• Clock domain data pass• PowerPC instantiation
SNL/CA Cores
32b/64b Floating-Point Computations
• Add• Multiply• Divide• Square root
• Vector/Matrix Multiply
SNL/NM Cores
[email protected] http://cdulmer.ran.sandia.gov
0x
5x
10x
15x
20x
25x
30x
0 10,000 20,000 30,000 40,000 50,000
FPGA Price (Relative to V2P7)
FPGA Density (Relative to V2P7)
FPGA Slices
Relative V2P Price & Density
V2P100
V2P70
V2P7
V2P40
Relative Price & Density