View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Xilinx Public
System Interfaces & Caches
RAMP RetreatAustin, TXJune 2009
Page 2 Xilinx Public © Copyright 2009 Xilinx
Disclaimer
All Information Contained In This Presentation Based on Publicly Available Material
References:– Goldhammer & Ayer:
“Understanding The Performance of PCI Express Systems”, Xilinx WP350, 2008
– John Goodacre:“The Effect and Technique of System Coherence in ARM Multicore Technology”, ARM Developer Conference 2008
– T. Shanley:“The Unabridged Pentium 4 IA32 Processor Genealogy”,Addison-Wesley, 2004
Page 3 Xilinx Public © Copyright 2009 Xilinx
This Talk
You Will Learn:
How an x86 CPU and an FPGA Can Exchange Data– IO Device Mapping vs Shared Memory
How the low level coherency interface works– Data and Control Exchange
How standard programming models are mapped– FIFOs, Message Passing, Shared Memory
How Direct Cache Access can reduce latency– And how we can overcome latency challenges
Page 4 Xilinx Public © Copyright 2009 Xilinx
Context: System Interconnect
FSB FSBQPI
PCIe PCIe PCIe
PCIe2005 2008 Future
Equal BW: FSB vs PCIe FSB/QPI: 2x PCIe BW
PCIe IO Device In-Socket Accelerator
Device Driver Call Shared Memory Function
Cache Flush Direct Cache Access
Diagram Source: Intel, 2008
Page 5 Xilinx Public © Copyright 2009 Xilinx
Singlecore CPU
Interconnect
Accelerator
Memory
Mailbox
12
1. CPU flushes input data from cache
2. Writes to mailbox
3. Interrupt serviced by accelerator
4. Reads input data from memory
5. Writes result data to memory
6. Writes to mailbox
7. Interrupt serviced by CPU
8. Reads result data
3
456
7
8
CPU-FPGA CommunicationGeneric, Without DCA (Direct Cache Access)
Page 6 Xilinx Public © Copyright 2009 Xilinx
Singlecore CPU
PCIe + FSB
Accelerator
Memory
Device registers
12
1. CPU flushes input data from cache
2. Writes to device registers
3. Write seen by accelerator
4. Reads input data from memory with DMA
5. Writes result data to memory with DMA
6. Writes to device registers
7. Write seen by processor
8. Reads result data
3
456
7
8
CPU-FPGA CommunicationPCI Express Based FPGA
Page 7 Xilinx Public © Copyright 2009 Xilinx
Xeon
FSB
Fabric-hostedaccelerator
Memory
Mailbox
2
1. CPU leaves data in cache
2. Writes to mailbox in cached memory
3. Snoop intercepted by accelerator
4. Reads input data from cached memory
Snoop intercepted by CPU giving data
5. Writes result data to cached memory
Snoop intercepted by CPU getting data
6. Writes to mailbox in cached memory
7. Snoop intercepted by CPU
8. Result data already in cache
3 4 5 67 Snoop interface
CPU-FPGA CommunicationFSB Based FPGA
Page 8 Xilinx Public © Copyright 2009 Xilinx
Raw FPL Interface(FPL = FSB Protocol Layer)
Fabric-hostedaccelerator
Xeon
snoop
snoop
System Memory (Not Coherent)
Coherent Memory Mailboxes
Unguarded Shared Memory Accesses– Convey added guarded access to this
Synchronization Region (Coherent)
Page 9 Xilinx Public © Copyright 2009 Xilinx
Special 2MB synchronization region used for communication between SW and HW– Writes by SW immediately result notification to FPGA (snoop control)
– SW can poll locations waiting for write from FPGA
SW can also allocate other 2MB regions– Simple pinned memory regions
– Use synchronization region to pass physical addresses to hardware
Use 2MB regions to move data between domains
Use synchronization region as start/finished indicators– Hardware uses a snoop for start
– Software uses a poll for finished
Raw FPL Interface(FPL = FSB Protocol Layer)
Page 10 Xilinx Public © Copyright 2009 XilinxXilinx Confidential – Internal • Unpublished Work © Copyright 2009 Xilinx
FIFO Programming Model Over FSB
Synchronization region used to convey full/empty status of buffers
Pinned memories acted as elastic buffers for SW
On chip memories acted as elastic buffers for HW
AFUs thought they were just reading and writing from streams
Exactly the kind of setup suitable for running Map/Reduce jobs in hardware
Page 11 Xilinx Public © Copyright 2009 Xilinx
Intel: AAL (Accelerator Abstraction Layer)FPGA Co-Processor API
Co-Processor Management API
Streaming API Inside FPGA (FIFOs)
AAL Pins Mailboxes / Manages System Memory
Virtual Memory Support via Workspace
Accelerator Discovery Services
Accelerator Configuration Management Services
System Memory
USERSPACE
KERNELSPACE
AHM
SOFTWAREUser Application
AcceleratorAbstraction Layer
FPGA
WS1
AFU2
WS2 WS3
FSB
FSB
AFU3
FSB protocol layer
PackageInstaller
Library
AFU1
AFU1 Proxy
System protocol layer
Liu et al:
“A high performance, energy efficient FPGA accelerator platform”, FPGA 2009
Page 12 Xilinx Public © Copyright 2009 Xilinx
MPI FSB Bridge
BMPI SW Process
FSB
HWMPEMPI HW
“Process”
PPCMPI SW Process
MPI GT/GTXSerial I/O Bridge
X86MPI SW Process
X86MPI SWProcess
Memory
MPI FSB Bridge
BMPI SW Process
FSB
HWMPEMPI HW
“Process”
PPCMPI SW Process
MPI GT/GTXSerial I/O Bridge
X86MPI SW Process
X86MPI SWProcess
Memory
Standard MPI Programming Model & API
Light Weight Message Passing Protocol Implementation
Focused on Embedded Systems
Explicit Rank to Node Binding SupportSource: ArchesComputing, 2009
Arches: MPI (Message Passing Interface)Symmetrical Peer Processing
Page 13 Xilinx Public © Copyright 2009 Xilinx
Convey: Shared MemoryConvey HC-1 (2008)
Socket Filler Module
Bridge FPGA
Implements FSB Protocol
Full Snoop Support
FPGA Based Compute Accelerator
Pre-Defined Vector Instruction Set
Shared Memory Programming Model
ANSI C Support
Accelerator Cache Memory
80 GB/s BW
Snoop Coherent with System Memory
Direct Cache Access CPU<->FPGASource: Convey Computer, 2008
MCLX155
MCLX155
MCLX155
MCLX155
MCLX155
MCLX155
MCLX155
MCLX155
Page 14 Xilinx Public © Copyright 2009 Xilinx
Latency: PCI Express & FSBThe Effects of DCA
PCIe– ~400ns latency– Gen1x8 interface, 64 byte payloads– Includes: PCIe device to chipset– Does not include: Chipset to CPU latency (add FSB latency)
FSB– 110ns latency– 64 byte DCA transfers– 200+ ns latency on cache miss operations (fetch from memory)
DCA: 6x reduced latency
Results for minimally loaded systems (i.e. single master active)
Chipset can defer and/or retry transactions in loaded systems (both FSB and PCIe)
Typically less congestion on FSB than on PCIe interface
DCA = Direct Cache Access
Page 15 Xilinx Public © Copyright 2009 Xilinx
DCA with ARM ACP(ARM ACP = Accelerator Coherency Port)
Xilinx ACP Platform Applications: Customer Confidential
Hence using ARM as an example
Source: ARM, 2008~ 8x reduced latency
Page 16 Xilinx Public © Copyright 2009 Xilinx
System Memory Bandwidth(PCI Express on Virtex-5)
Virtex-5 LXT ML555 fitted in Dell PowerEdge machine– Goldhammer and Ayer Jr: “Understanding performance of PCI Express systems”
Intel E5000P Chipset Virtex-6 & Gen2 data is available (but not public yet)
– Rough data points: 2x the BW, similar latency
PCIe (Gen 1) x16– Partner IP (not studied)
Link width and transfer size
READ BW(GBytes/s)
WRITE BW(GBytes/s)
x1 8 KB 0.163 0.249
x1 16 KB 0.164 0.230
x1 32 KB 0.164 0.223
x4 8 KB 0.644 0.882
x4 16 KB 0.668 0.883
x4 32 KB 0.680 0.864
x8 8 KB 1.257 1.77
x8 16 KB 1.328 1.77
x8 32 KB 1.371 1.77
Page 17 Xilinx Public © Copyright 2009 Xilinx
System Memory Bandwidth(FSB on Virtex-5)
Intel Xeon 7300 chipset FPL Performance (FSB Protocol Layer = Raw Interface)
– FPL: Primitives for data and control exchange
Higher level protocols may reduce BW or require longer burst sizes to achieve the same BW
– AAL, MPI, Other: Higher level protocols built on top of FPL
BLOCK SIZE READ BW GBytes/s WRITE BW GBytes/s
512 B 1.62 1.3
1 KB 2.47 Not recorded
2 KB 3.36 Not recorded
4 KB 3.97 Not recorded
8 KB 4.54 Not recorded
16 KB 4.66 3.4
64 KB 4.99 3.4
128 KB 5.03 3.4
Page 18 Xilinx Public © Copyright 2009 Xilinx
Bandwidth: PCI Express & FSB
PCIe Gen2 x8 (estimated performance data)– Double bandwidth of Gen1
– 2.66 GBytes/s read
– 3.54 GBytes/s write
FSB– 1.7x the read BW of PCIe Gen2 x8
– Half Duplex only
– 4.66 GBytes/s read
– 3.4 GBytes/s write (not fully optimized yet)
Data for 16 kB transfers for both PCIe and FSB
PCIe Gen2 BW: Estimates
Page 19 Xilinx Public © Copyright 2009 Xilinx
Summary
FPGA Mapped Into Shared System Memory
Raw FPL Interface Exposes Coherency Engine in FPGA
Multiple Programming Models Supported– FIFO, Message Passing, ShMem
DCA Helps To Reduce Latency
Application Code Must Maximize Issue Rate