Reconfigurable Computing Leveraging FPGA Accelerators in High-Performance Computing Applications Craig Ulmer [email protected] June 2, 2005 Sandia is

Reconfigurable Computing Leveraging FPGA Accelerators in

High-Performance Computing Applications

Craig [email protected]

June 2, 2005

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Craig Ulmer, Ryan Hilles, and David Thompson SNL/CAKeith Underwood and K. Scott Hemmert SNL/NM

What is Center 8900 doing with FPGAs?

• Computer Sciences & Information Technologies 8900– High-Performance Computing

– Viz & Scientific computing (8963)

• FPGA LDRD for accelerating HPC– NM: Computational aspects

– CA: Communication aspects

• FPGA LDRD for network security– NM/CA: Smart routers

Catalyst

Outline

• Reconfigurable Computing– Communication Aspects– Computational Aspects

• New Platforms: Cray XD1

• Summary

In a nutshell..

• Reconfigurable Computing– Use FPGAs as a means of accelerating simulations– Implement key computations in hardware: soft-hardware

double doX( double *a, int n) {int i;double x;

x=0;for(i=0;i<n;i+=3){

x+= a[i] * a[i+1] + a[i+2];…

}…

return x;}

* +

+

a[i] a[i+1]

Z -1

a[i+2]

Software Hardware Synthesized Hardware

FPGA

CPU

RC Platform

Trivial Example: Sorting

• Sort a matrix of 64b values

• Hardware implementation– Sorting element– Array of elements– Data transfer engine

• Exploits parallelism– Do n comparisons each clock– Requires 3n clock cycles

• Limited to n values– Use as building block

Sorting Array

AddressWr Req

Rd ReqDMA

EngineSortingControl

Double-BufferedOutgoing Memory

Reg

>

Sorting Element11.532.994.188.779.434.539.254.2

24.354.14.235.1

14.83.019.791.4

17.581.39.431.394.188.732.911.579.454.239.234.5

54.135.124.34.2

91.419.714.83.0

81.331.317.59.4

unsorted sorted

Sorting Performance

• Xilinx Virtex II/Pro50-7 FPGA– 128 elements– 110 MHz

• Data transfer by FPGA– Read/write concurrency– Pinned memory

• 2-4x Rate of Quicksort(128)– Host is 2.2GHz AMD Opteron

0

5

10

15

20

25

30

35

1 10 100 1,000

Matrix Rows

Sor

ting

Rat

e (M

iWor

ds/s

)

FPGA (No Copy)

FPGA (Pre/Post Copy)

CPU

RC: Communication Aspects

New FPGAs are “Network Ready”

• Multi-gigabit transceivers (MGTs)– V2Pro: twenty-four 3Gbps channels– V2ProX: twenty 10Gbps channels– InfiniBand, FC, GigE, 10GigE

• Multiple embedded processors– PowerPC 405 400MHz

• Direct connection to network– Build NI in hardware– Use PPC core for complex protocols– User-defined computational circuits

Xilinx Virtex II/Pro-7

NetworkInterface

PowerPCCPU

User-DefinedCircuits

TCP/GigE Network Interface for FPGAs

• Network interface (NI) in hardware– TCP Offload Engine (TOE)– Gigabit Ethernet (GigE)

• Refined TCP functionality– Slow start, NACK detection, Throttling…– User data rate to host: 600 Mb/s

Outgoing TCPMessageControl

Rocket I/O TxRx

Align

Decode

ARPCache

Framer

MACHeader

Control

ARP Reply

Ping Reply IP Header

IncomingByte Stream

TimeoutMonitor

CRCOutgoing

Byte Stream

Incoming TCPMessageControl

TOE

GigEUnit Slices V2P20

TOE 2,068 22%

GigE 1,102 12%

Transport Framework for Visualization

• Need a framework for supporting Visualization in FPGAs– Want to perform viz tasks, not rendering– Built Chromium interface to transport OpenGL graphics

• Results– Functional system where visualization in FPGA, Rendering in Host– Saturate GigE link with OpenGL data

VizApp

GigE NetworkGigETCP

OffloadEngine

ChromiumPacketizer

FPGA

Network Intrusion Detection System (NIDS)

• Collaboration with Georgia Tech– GT: Pattern matching cores– SNL: GigE NI cores

• Network Intrusion Detection System – Examine packets & remove malicious– Based on Snort rule set (1,500 rules)

• Filter multiple GigE links transparently– Demonstrated for two links

V2P100V2P70V2P50V2P30V2P20V2P7

16-bit32-bit64-bit

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

16-bit

32-bit

64-bit

NI NI

Malicious PacketPattern Matching

NI NI

OK Drop Scheduler

RC: Computational Aspects

Floating Point Computations

• Floating point has been weakness for FPGAs– Complex to implement, consume many gates

• Keith Underwood and K. Scott Hemmert– Developed FP library at SNL/NM

• FP Library– Highly-optimized and compact

– Single/Double Precision

– Add, Multiply, Divide, Square Root

– V2P50: 10-20 DP cores, 130-210 MHz

Example: FP Distance Computation

• Compute c=sqrt(a2 + b2)– Simple pipeline (88 stages)

• Implemented on Xilinx Virtex II/Pro 50– 159 MHz clock rate

– 39% of V2P50

– 75 MiOperations/s

IncomingSRAM

OutgoingSRAM

DMAEngine

IncomingSRAM

FromHost

ToHost

0

10

20

30

40

50

60

70

80

90

100

100 1,000 10,000 100,000 1,000,000

Distance Computations

Pro

cess

ing

Rat

e (M

iOpe

ratio

ns/s

)

CPU

FPGA

New Platforms: Cray XD1

Cray XD1

• Dense multiprocessor system– 12 AMD Opterons on 6 blades

– 6 Xilinx Virtex-II/Pro FPGAs

– InfiniBand-like interconnect

– 6 SATA hard drives

– 4 PCI-X slots

– 3U Rack

Cray XD1: Placing FPGAs Near Memory

• Blades use HyperTransport– High bandwidth I/O

• FPGA accelerator– Xilinx Virtex II/Pro 50 (-7)– HT connection: 1.6+1.6GB/s– 10x performance of PCI

• First of a new breed– Commodity HT accelerators

CPU CPU MainMemory

MainMemory

NISRAM

Expansion Module

SRAM

SRAM

SRAM

FPGANI

1.6 GB/s HT

Processor Blade

Summary

• FPGAs are getting faster and less expensive– Finally have capacity for interesting problems

– Million gate parts for $5

• Large amount of reconfigurable computing work at SNL– Experience with Virtex II/Pro Platform FPGAs

– Networks & embedded CPUs

– Willing to share cores, work with others

Deliverables

Networking Visualization Computations XD1 General

• OpenGigE• OpenTOE• Raw GigE• GigE Bridge• NIDS Filter

• Isosurfacer• Chromium Packetizer

• Sorting Array• MD5• 2D Distance

• DMA Engine• Computation

Framework

• Asynch. Message FIFOs• Clock domain data pass• PowerPC instantiation

SNL/CA Cores

32b/64b Floating-Point Computations

• Add• Multiply• Divide• Square root

• Vector/Matrix Multiply

SNL/NM Cores

[email protected] http://cdulmer.ran.sandia.gov

0x

5x

10x

15x

20x

25x

30x

0 10,000 20,000 30,000 40,000 50,000

FPGA Price (Relative to V2P7)

FPGA Density (Relative to V2P7)

FPGA Slices

Relative V2P Price & Density

V2P100

V2P70

V2P7

V2P40

Relative Price & Density

Documents

Reconfigurable Computing Leveraging FPGA Accelerators in High-Performance Computing Applications Craig Ulmer [email protected] June 2, 2005 Sandia is