42
Steen Larsen March 5 2015 Offloading of I/O Transactions in Current CPU Architectures

Steen_Dissertation_March5

Embed Size (px)

Citation preview

Steen Larsen

March 5 2015

Offloading of I/O Transactions in Current CPU Architectures

Agenda Outline

Introduction and motivationBackgroundiDMAHot PotatoDevice2DeviceConclusions

Growing I/O System Performance Discrepancy

Legacy I/O Transmit Operation

Why such complexity? Programmed I/O (PIO)

Direct CPU interaction with I/O devicesDecades old methodExtremely slow relative to CPU frequency

CPU DMA engine“DMA channels”Limited by I/O device performanceI/O devices need to be compatible with CPU DMA

engineThese methods are seen in embedded devices, but

not in mainstream general purpose CPUs.System security aspectPCI & PCIe allows I/O devices to access memory

BackgroundDirect integrationSupercomputing I/O forwardingRDMA (Future I/O and NGIO)

Direct NIC IntegrationSun Niagara CPU [8 core 64 threads]Dual 10GbE on-dieReleased in 2007

Super-Computers I/O Processors with I/O Transaction Forwarding

This approach leads to NGIO + Future IO => Infiniband & iWARP RDMA Connection context offloading (similar to TOE architectures) “30 minutes to print `Hello world`” http://www.mcs.anl.gov/papers/P1594A.pdf

iDMA

iDMA Transaction Operations

iDMA transmit iDMA receive

iDMA Latency and Throughput Benefits

Total critical path latency = TxSW + TxHW + fiber + RxHW + RxSW

iDMA Summary

Hot PotatoAfter survey of I/O transactions and acceleration

functions, we chose to stay with looking at descriptors.

Treat the payload data as a Hot Potato

Hot Potato Motivation: Legacy NIC Internal Design

Transmit I/O Receive I/O

Write-Combining Buffers

CPU

Core

64B WC buffers

PCIe packet

24B header 64B payloadfull

full

fullfullfullfull

[Myricom experiments]

Hot Potato Device Design

Hot Potato Prototype

PCIe Packet FramingTo discuss further we need to dig into PCIe protocol details:

Typical ICMP Ping Sequence

Doorbell write

Descriptor read

ICMP packet

IP address “C0A80001”

Example Hot-Potato Loopback

Hot-Potato Latency and Throughput Benefit

Measurements and Conclusions1.5us latency reduction in benchmark tests8% latency reduction in real memcached application

Device2Device (D2D)

Shift gears from CPU I/O transactions to inter-device communication.

PC

Ie

CPU

System Memory

Receiver Device(e.g., NIC)

TXQueues

RXQueues

Sender Device(e.g., SSD)

Disk WriteQueues

Disk ReadQueues

DDR3

Kernel buffer for disk data

Kernel buffer for network packet data

3

2

1

PC

Ie

Legacy Video Streaming from Storage

Details of a D2D NIC

D2D-enabled NIC

Transmit Packet Queue(s)

TX PHY

Receive Packet Queue(s)

RX PHY

Legacy Rx DMA

UOE / Packet-Based Priority Control

PCIe BARx Memory Space

PCIe TLP/DLP/LLP processing

Legacy NIC Control Registers

D2D UOE Control Registers

Legacy Tx DMA

UOE / Parse Control

D2D Rx FSM

D2D Rx Queue+

Frame Header

D2D Tx Queue+

Frame Header

D2D Tx FSM

Optional

D2D Flow Control Registers

PCIe PHY

Details of a D2D SSD

D2D-enabled SSD

Legacy Tx DMA

PCIe TLP/DLP/LLP processing

Legacy Rx DMA

Optional

SSD

SSD Controller and Buffers

Legacy NIC Control Registers

D2D Flow Control Registers

PCIe BARx Memory Space

D2D Tx FSMD2D Rx FSM

PCIe PHY

D2D Tx QueueD2D Rx Queue

Modified Rx DMA

Modified Tx DMA

D2D Transmit FSM INIT state: Sets Tx Flow Control registers

◦ Tx Address register◦ D2D Transmit Byte Count register◦ Data Rate and Granularity parameters◦ Tx and Rx Base Credits

Parse state:◦ Map OS block addresses to SSD

physical addresses◦ Enqueued in D2D Tx Queue

Send state:◦ Fetch and forward SSD block data to

PCIe interface. Wait state:

◦ Waits until the next chunk needs to be sent.

◦ Depends on Data Rate and Granularity Check state:

◦ Checks whether bytes sent < D2D Transmit Byte Count

Check

Idle

Init

Parse

Send

Wait

True

False

D2D Receive FSM (with UOE) INIT state: Set D2D Flow & UOE Control

registers◦ MAC source (SSD) and destination (NIC)

addresses.◦ Source and destination IP addresses.◦ UDP source and destination port, length, and

checksum. ◦ RTP version, sequence #, and timestamp.

Fetch state:◦ Monitor D2D Rx Queue for data.

Frame state:◦ Assign static fields: MAC & IP addresses.

Calc state:◦ Assign Ethernet length & CRC, IP length &

checksum, UDP length & checksum, RTP timestamp & sequence #.

◦ Enqueue in Tx Packet Queue Send state:

◦ Send to MAC layer for transmission

Idle

Init

Fetch

Frame

Send

Calc

NetFPGA Logical ArchitectureXilinx Virtex-5 TX240T FPGA, 10GbE, and memories

Driver Interfaces

AXI Lite

AMB AXI-Stream Interface (160 MHz, 64-bit)

DMA Engine Registers

nf0 nf1 nf2 nf3 ioctl

MAC

TxQ

MAC

RxQ

Ethernet

MAC

TxQ

MAC

RxQ

MAC

TxQ

MAC

RxQMACTxQ

MACRxQ

Shared PCIe interfacePCIe

Interface Layer

Software Interface

Layer

NetFPGA Internal

Control and Routing

D2D Tx/Rx Queues

D2D Tx/Rx FSMsLegacy

DMA RxQLegacy

DMA Tx Q

Read/write from/to D2D control registers

Control & Status Interface

Data Path Interface

Sample Chipscope trace

NIC - to - Video

D2D VoD streaming configuration

PCIe Interface

PCIe Bridge

CPU System Memory

PCIe InterfaceIncoming network data is treated as storage data to be written to D2D-enabled NIC

System Output Display (SOD)

Out-bound device: UDP VoD stream

Stimulus System (SS)

In-bound device: Emulated SSD stream

System Under Test (SUT)

PCIe Interface

D2D-enabled NIC

Transmit Packet Queue(s)

TX PHY

Receive Packet Queue(s)

RX PHY

Legacy RX DMA

PCIe BARx Memory Space

PCIe TLP/DLP/LLP processing

Legacy NIC Control Registers

D2D Flow Control Registers D2D UOE Control Registers

Legacy TX DMA

PCIe PHY

UOE / Parse Control

D2D Rx FSM

D2D Rx Queue+

Frame Header

D2D Tx Queue+

Frame Header

D2D Tx FSM

D2D-enabled NIC

Transmit Packet Queue(s)

TX PHY

Receive Packet Queue(s)

RX PHY

Legacy RX DMA

UOE / Packet-Based Priority Control

PCIe BARx Memory Space

PCIe TLP/DLP/LLP processing

Legacy NIC Control Registers

D2D Flow Control Registers D2D UOE Control Registers

Legacy TX DMA

PCIe PHY

UOE / Parse Control

D2D Rx FSM

D2D Rx Queue+

Frame Header

D2D Tx Queue+

Frame Header

D2D Tx FSM

UOE / Packet-Based Priority Control

D2D Physical Configuration

System Under Test (SUT)

Shared KVM

System Output Display (SOD)

1000 Watt SUT power

supply

DMM measuring CPU+CPU VRM

12V current

Stimulus System

(SS)

FPGA programmer

SUT Configuration

SSD Linux boot drive

PCIe x8 bridge

Fan to add cooling to FPGA fans

Emulated SSDNetFPGA

Spliced power supply to CPU for ammeter

D2D NICNetFPGA

Xilinx USB programmer for

FPGA and Chipscope

Intel 2500 4-core 3.1GHz

CPU

2GB 1333MHz

DDR3

D2D Latency (1500 byte packet)

D2D Power and Utilization benefit

D2D Measured Throughput (and Limitations)

ConclusionsCPU-based descriptor DMA makes sense in the

context of off-loading slow I/O devices when additional overhead was small relative to overall latency, power, throughput

This work proposes small additional changes in hardware and software that bypass this descriptor overhead.

Depending on the application I/O transaction profile, benefits in latency, throughput, and power are significant.

BACKUP

Non-Transparent Bridging (NTB)

Host BHost A

Device Device

BARTranslate

BAR Translate

PCIe switch

Device Device

BARTranslate

BAR Translate

PCIe switch

Host B memory write to host A

Host A memory write to host B

Basic video frame buffer format

4 bytes defined per pixel.

Frame buffer mapped to linear system memory space

FPGA writes in bit level compatible format

Verified with PCIe trace analyzer

Video screen

Target stream space (i.e. 640x480)

Pixel information (4B per pixel)

[0x0]

[0x1]

[0x2]

[0x3] transparency

pixel

D2D video stream timeline

Tim

e

SOD SUT SS

VoD server configurationSOD configures D2D

stream configuration in SUT

SOD requests UDP video stream on specific UDP port from SS (emulated

SSD)

SS begins streaming video to SUT

SUT UOE strips packet header and passes to D2D

TX queue

SUT UOE frames new packet to the SOD

SOD decodes video frames and displays

Repeated pipelined VoD packets

End of video stream, or SOD termination.