Xilinx Public System Interfaces & Caches RAMP Retreat Austin, TX June 2009

Xilinx Public

System Interfaces & Caches

RAMP RetreatAustin, TXJune 2009

Page 2 Xilinx Public © Copyright 2009 Xilinx

Disclaimer

All Information Contained In This Presentation Based on Publicly Available Material

References:– Goldhammer & Ayer:

“Understanding The Performance of PCI Express Systems”, Xilinx WP350, 2008

– John Goodacre:“The Effect and Technique of System Coherence in ARM Multicore Technology”, ARM Developer Conference 2008

– T. Shanley:“The Unabridged Pentium 4 IA32 Processor Genealogy”,Addison-Wesley, 2004


This Talk

You Will Learn:

How an x86 CPU and an FPGA Can Exchange Data– IO Device Mapping vs Shared Memory

How the low level coherency interface works– Data and Control Exchange

How standard programming models are mapped– FIFOs, Message Passing, Shared Memory

How Direct Cache Access can reduce latency– And how we can overcome latency challenges


Context: System Interconnect

FSB FSBQPI

PCIe PCIe PCIe

PCIe2005 2008 Future

Equal BW: FSB vs PCIe FSB/QPI: 2x PCIe BW

PCIe IO Device In-Socket Accelerator

Device Driver Call Shared Memory Function

Cache Flush Direct Cache Access

Diagram Source: Intel, 2008


Singlecore CPU

Interconnect

Accelerator

Memory

Mailbox

12

1. CPU flushes input data from cache

2. Writes to mailbox

3. Interrupt serviced by accelerator

4. Reads input data from memory

5. Writes result data to memory

6. Writes to mailbox

7. Interrupt serviced by CPU

8. Reads result data

3

456

7

8

CPU-FPGA CommunicationGeneric, Without DCA (Direct Cache Access)


Singlecore CPU

PCIe + FSB

Accelerator

Memory

Device registers

12

1. CPU flushes input data from cache

2. Writes to device registers

3. Write seen by accelerator

4. Reads input data from memory with DMA

5. Writes result data to memory with DMA

6. Writes to device registers

7. Write seen by processor

8. Reads result data

3

456

7

8

CPU-FPGA CommunicationPCI Express Based FPGA


Xeon

FSB

Fabric-hostedaccelerator

Memory

Mailbox

2

1. CPU leaves data in cache

2. Writes to mailbox in cached memory

3. Snoop intercepted by accelerator

4. Reads input data from cached memory

Snoop intercepted by CPU giving data

5. Writes result data to cached memory

Snoop intercepted by CPU getting data

6. Writes to mailbox in cached memory

7. Snoop intercepted by CPU

8. Result data already in cache

3 4 5 67 Snoop interface

CPU-FPGA CommunicationFSB Based FPGA


Raw FPL Interface(FPL = FSB Protocol Layer)

Fabric-hostedaccelerator

Xeon

snoop

snoop

System Memory (Not Coherent)

Coherent Memory Mailboxes

Unguarded Shared Memory Accesses– Convey added guarded access to this

Synchronization Region (Coherent)


Special 2MB synchronization region used for communication between SW and HW– Writes by SW immediately result notification to FPGA (snoop control)

– SW can poll locations waiting for write from FPGA

SW can also allocate other 2MB regions– Simple pinned memory regions

– Use synchronization region to pass physical addresses to hardware

Use 2MB regions to move data between domains

Use synchronization region as start/finished indicators– Hardware uses a snoop for start

– Software uses a poll for finished

Raw FPL Interface(FPL = FSB Protocol Layer)

Page 10 Xilinx Public © Copyright 2009 XilinxXilinx Confidential – Internal • Unpublished Work © Copyright 2009 Xilinx

FIFO Programming Model Over FSB

Synchronization region used to convey full/empty status of buffers

Pinned memories acted as elastic buffers for SW

On chip memories acted as elastic buffers for HW

AFUs thought they were just reading and writing from streams

Exactly the kind of setup suitable for running Map/Reduce jobs in hardware


Intel: AAL (Accelerator Abstraction Layer)FPGA Co-Processor API

Co-Processor Management API

Streaming API Inside FPGA (FIFOs)

AAL Pins Mailboxes / Manages System Memory

Virtual Memory Support via Workspace

Accelerator Discovery Services

Accelerator Configuration Management Services

System Memory

USERSPACE

KERNELSPACE

AHM

SOFTWAREUser Application

AcceleratorAbstraction Layer

FPGA

WS1

AFU2

WS2 WS3

FSB

FSB

AFU3

FSB protocol layer

PackageInstaller

Library

AFU1

AFU1 Proxy

System protocol layer

Liu et al:

“A high performance, energy efficient FPGA accelerator platform”, FPGA 2009


MPI FSB Bridge

BMPI SW Process

FSB

HWMPEMPI HW

“Process”

PPCMPI SW Process

MPI GT/GTXSerial I/O Bridge

X86MPI SW Process

X86MPI SWProcess

Memory

MPI FSB Bridge

BMPI SW Process

FSB

HWMPEMPI HW

“Process”

PPCMPI SW Process

MPI GT/GTXSerial I/O Bridge

X86MPI SW Process

X86MPI SWProcess

Memory

Standard MPI Programming Model & API

Light Weight Message Passing Protocol Implementation

Focused on Embedded Systems

Explicit Rank to Node Binding SupportSource: ArchesComputing, 2009

Arches: MPI (Message Passing Interface)Symmetrical Peer Processing


Convey: Shared MemoryConvey HC-1 (2008)

Socket Filler Module

Bridge FPGA

Implements FSB Protocol

Full Snoop Support

FPGA Based Compute Accelerator

Pre-Defined Vector Instruction Set

Shared Memory Programming Model

ANSI C Support

Accelerator Cache Memory

80 GB/s BW

Snoop Coherent with System Memory

Direct Cache Access CPU<->FPGASource: Convey Computer, 2008

MCLX155

MCLX155

MCLX155

MCLX155

MCLX155

MCLX155

MCLX155

MCLX155


Latency: PCI Express & FSBThe Effects of DCA

PCIe– ~400ns latency– Gen1x8 interface, 64 byte payloads– Includes: PCIe device to chipset– Does not include: Chipset to CPU latency (add FSB latency)

FSB– 110ns latency– 64 byte DCA transfers– 200+ ns latency on cache miss operations (fetch from memory)

DCA: 6x reduced latency

Results for minimally loaded systems (i.e. single master active)

Chipset can defer and/or retry transactions in loaded systems (both FSB and PCIe)

Typically less congestion on FSB than on PCIe interface

DCA = Direct Cache Access


DCA with ARM ACP(ARM ACP = Accelerator Coherency Port)

Xilinx ACP Platform Applications: Customer Confidential

Hence using ARM as an example

Source: ARM, 2008~ 8x reduced latency


System Memory Bandwidth(PCI Express on Virtex-5)

Virtex-5 LXT ML555 fitted in Dell PowerEdge machine– Goldhammer and Ayer Jr: “Understanding performance of PCI Express systems”

Intel E5000P Chipset Virtex-6 & Gen2 data is available (but not public yet)

– Rough data points: 2x the BW, similar latency

PCIe (Gen 1) x16– Partner IP (not studied)

Link width and transfer size

READ BW(GBytes/s)

WRITE BW(GBytes/s)

x1 8 KB 0.163 0.249

x1 16 KB 0.164 0.230

x1 32 KB 0.164 0.223

x4 8 KB 0.644 0.882

x4 16 KB 0.668 0.883

x4 32 KB 0.680 0.864

x8 8 KB 1.257 1.77

x8 16 KB 1.328 1.77

x8 32 KB 1.371 1.77


System Memory Bandwidth(FSB on Virtex-5)

Intel Xeon 7300 chipset FPL Performance (FSB Protocol Layer = Raw Interface)

– FPL: Primitives for data and control exchange

Higher level protocols may reduce BW or require longer burst sizes to achieve the same BW

– AAL, MPI, Other: Higher level protocols built on top of FPL

BLOCK SIZE READ BW GBytes/s WRITE BW GBytes/s

512 B 1.62 1.3

1 KB 2.47 Not recorded




16 KB 4.66 3.4

64 KB 4.99 3.4

128 KB 5.03 3.4


Bandwidth: PCI Express & FSB

PCIe Gen2 x8 (estimated performance data)– Double bandwidth of Gen1

– 2.66 GBytes/s read

– 3.54 GBytes/s write

FSB– 1.7x the read BW of PCIe Gen2 x8

– Half Duplex only

– 4.66 GBytes/s read

– 3.4 GBytes/s write (not fully optimized yet)

Data for 16 kB transfers for both PCIe and FSB

PCIe Gen2 BW: Estimates


Summary

FPGA Mapped Into Shared System Memory

Raw FPL Interface Exposes Coherency Engine in FPGA

Multiple Programming Models Supported– FIFO, Message Passing, ShMem

DCA Helps To Reduce Latency

Application Code Must Maximize Issue Rate

Documents

Xilinx Public System Interfaces & Caches RAMP Retreat Austin, TX June 2009