Upload
steen-larsen
View
72
Download
0
Tags:
Embed Size (px)
Citation preview
Why such complexity? Programmed I/O (PIO)
Direct CPU interaction with I/O devicesDecades old methodExtremely slow relative to CPU frequency
CPU DMA engine“DMA channels”Limited by I/O device performanceI/O devices need to be compatible with CPU DMA
engineThese methods are seen in embedded devices, but
not in mainstream general purpose CPUs.System security aspectPCI & PCIe allows I/O devices to access memory
Super-Computers I/O Processors with I/O Transaction Forwarding
This approach leads to NGIO + Future IO => Infiniband & iWARP RDMA Connection context offloading (similar to TOE architectures) “30 minutes to print `Hello world`” http://www.mcs.anl.gov/papers/P1594A.pdf
iDMA Latency and Throughput Benefits
Total critical path latency = TxSW + TxHW + fiber + RxHW + RxSW
Hot PotatoAfter survey of I/O transactions and acceleration
functions, we chose to stay with looking at descriptors.
Treat the payload data as a Hot Potato
Write-Combining Buffers
CPU
Core
64B WC buffers
PCIe packet
24B header 64B payloadfull
full
fullfullfullfull
[Myricom experiments]
Measurements and Conclusions1.5us latency reduction in benchmark tests8% latency reduction in real memcached application
PC
Ie
CPU
System Memory
Receiver Device(e.g., NIC)
TXQueues
RXQueues
Sender Device(e.g., SSD)
Disk WriteQueues
Disk ReadQueues
DDR3
Kernel buffer for disk data
Kernel buffer for network packet data
3
2
1
PC
Ie
Legacy Video Streaming from Storage
Details of a D2D NIC
D2D-enabled NIC
Transmit Packet Queue(s)
TX PHY
Receive Packet Queue(s)
RX PHY
Legacy Rx DMA
UOE / Packet-Based Priority Control
PCIe BARx Memory Space
PCIe TLP/DLP/LLP processing
Legacy NIC Control Registers
D2D UOE Control Registers
Legacy Tx DMA
UOE / Parse Control
D2D Rx FSM
D2D Rx Queue+
Frame Header
D2D Tx Queue+
Frame Header
D2D Tx FSM
Optional
D2D Flow Control Registers
PCIe PHY
Details of a D2D SSD
D2D-enabled SSD
Legacy Tx DMA
PCIe TLP/DLP/LLP processing
Legacy Rx DMA
Optional
SSD
SSD Controller and Buffers
Legacy NIC Control Registers
D2D Flow Control Registers
PCIe BARx Memory Space
D2D Tx FSMD2D Rx FSM
PCIe PHY
D2D Tx QueueD2D Rx Queue
Modified Rx DMA
Modified Tx DMA
D2D Transmit FSM INIT state: Sets Tx Flow Control registers
◦ Tx Address register◦ D2D Transmit Byte Count register◦ Data Rate and Granularity parameters◦ Tx and Rx Base Credits
Parse state:◦ Map OS block addresses to SSD
physical addresses◦ Enqueued in D2D Tx Queue
Send state:◦ Fetch and forward SSD block data to
PCIe interface. Wait state:
◦ Waits until the next chunk needs to be sent.
◦ Depends on Data Rate and Granularity Check state:
◦ Checks whether bytes sent < D2D Transmit Byte Count
Check
Idle
Init
Parse
Send
Wait
True
False
D2D Receive FSM (with UOE) INIT state: Set D2D Flow & UOE Control
registers◦ MAC source (SSD) and destination (NIC)
addresses.◦ Source and destination IP addresses.◦ UDP source and destination port, length, and
checksum. ◦ RTP version, sequence #, and timestamp.
Fetch state:◦ Monitor D2D Rx Queue for data.
Frame state:◦ Assign static fields: MAC & IP addresses.
Calc state:◦ Assign Ethernet length & CRC, IP length &
checksum, UDP length & checksum, RTP timestamp & sequence #.
◦ Enqueue in Tx Packet Queue Send state:
◦ Send to MAC layer for transmission
Idle
Init
Fetch
Frame
Send
Calc
NetFPGA Logical ArchitectureXilinx Virtex-5 TX240T FPGA, 10GbE, and memories
Driver Interfaces
AXI Lite
AMB AXI-Stream Interface (160 MHz, 64-bit)
DMA Engine Registers
nf0 nf1 nf2 nf3 ioctl
MAC
TxQ
MAC
RxQ
Ethernet
MAC
TxQ
MAC
RxQ
MAC
TxQ
MAC
RxQMACTxQ
MACRxQ
Shared PCIe interfacePCIe
Interface Layer
Software Interface
Layer
NetFPGA Internal
Control and Routing
D2D Tx/Rx Queues
D2D Tx/Rx FSMsLegacy
DMA RxQLegacy
DMA Tx Q
Read/write from/to D2D control registers
Control & Status Interface
Data Path Interface
D2D VoD streaming configuration
PCIe Interface
PCIe Bridge
CPU System Memory
PCIe InterfaceIncoming network data is treated as storage data to be written to D2D-enabled NIC
System Output Display (SOD)
Out-bound device: UDP VoD stream
Stimulus System (SS)
In-bound device: Emulated SSD stream
System Under Test (SUT)
PCIe Interface
D2D-enabled NIC
Transmit Packet Queue(s)
TX PHY
Receive Packet Queue(s)
RX PHY
Legacy RX DMA
PCIe BARx Memory Space
PCIe TLP/DLP/LLP processing
Legacy NIC Control Registers
D2D Flow Control Registers D2D UOE Control Registers
Legacy TX DMA
PCIe PHY
UOE / Parse Control
D2D Rx FSM
D2D Rx Queue+
Frame Header
D2D Tx Queue+
Frame Header
D2D Tx FSM
D2D-enabled NIC
Transmit Packet Queue(s)
TX PHY
Receive Packet Queue(s)
RX PHY
Legacy RX DMA
UOE / Packet-Based Priority Control
PCIe BARx Memory Space
PCIe TLP/DLP/LLP processing
Legacy NIC Control Registers
D2D Flow Control Registers D2D UOE Control Registers
Legacy TX DMA
PCIe PHY
UOE / Parse Control
D2D Rx FSM
D2D Rx Queue+
Frame Header
D2D Tx Queue+
Frame Header
D2D Tx FSM
UOE / Packet-Based Priority Control
D2D Physical Configuration
System Under Test (SUT)
Shared KVM
System Output Display (SOD)
1000 Watt SUT power
supply
DMM measuring CPU+CPU VRM
12V current
Stimulus System
(SS)
FPGA programmer
SUT Configuration
SSD Linux boot drive
PCIe x8 bridge
Fan to add cooling to FPGA fans
Emulated SSDNetFPGA
Spliced power supply to CPU for ammeter
D2D NICNetFPGA
Xilinx USB programmer for
FPGA and Chipscope
Intel 2500 4-core 3.1GHz
CPU
2GB 1333MHz
DDR3
ConclusionsCPU-based descriptor DMA makes sense in the
context of off-loading slow I/O devices when additional overhead was small relative to overall latency, power, throughput
This work proposes small additional changes in hardware and software that bypass this descriptor overhead.
Depending on the application I/O transaction profile, benefits in latency, throughput, and power are significant.
Non-Transparent Bridging (NTB)
Host BHost A
Device Device
BARTranslate
BAR Translate
PCIe switch
Device Device
BARTranslate
BAR Translate
PCIe switch
Host B memory write to host A
Host A memory write to host B
Basic video frame buffer format
4 bytes defined per pixel.
Frame buffer mapped to linear system memory space
FPGA writes in bit level compatible format
Verified with PCIe trace analyzer
Video screen
Target stream space (i.e. 640x480)
Pixel information (4B per pixel)
[0x0]
[0x1]
[0x2]
[0x3] transparency
pixel
D2D video stream timeline
Tim
e
SOD SUT SS
VoD server configurationSOD configures D2D
stream configuration in SUT
SOD requests UDP video stream on specific UDP port from SS (emulated
SSD)
SS begins streaming video to SUT
SUT UOE strips packet header and passes to D2D
TX queue
SUT UOE frames new packet to the SOD
SOD decodes video frames and displays
Repeated pipelined VoD packets
End of video stream, or SOD termination.