15
RDMAoE collaboration with KISTI Tuesday 6/7/2011 10:00am-11:00am (50B-2222) [email protected]

Rdma presentation-kisti-v2

Embed Size (px)

Citation preview

Page 1: Rdma presentation-kisti-v2

RDMAoE collaboration with KISTI

Tuesday 6/7/201110:00am-11:00am (50B-2222)

[email protected]

Page 2: Rdma presentation-kisti-v2

RDMA for High Performance Data Movement

Network I/O operations are costly:− CPU load

− Context switching

− Memory latency Zero-copy networking

− NIC copies data directly to/from application memory

IB transport (HPC applications) iWARP (TCP stack / TOE)

Page 3: Rdma presentation-kisti-v2

RDMA model

One sided operations Get/Put semantics

Send/receive

Direct data placement RDMA Write RDMA Read

Asyschronous− Work Queue (send queue – receive queue)

− Completion Queue

Page 4: Rdma presentation-kisti-v2

RDMA Programming Model

Objects Queue Pairs (protection domain) Send queue (RDMA write, RDMA read) Receive queue Modify state Completetion queue (poll) Memory region (MR)

Functions (verbs)− IB (libmlx4) iWARP (libcxgb3)

Librdmacm (connection setup)

Page 5: Rdma presentation-kisti-v2

RDMA/iWARP

Implicit RDMA support Explicit RDMA support

iWARP − encapsulate RDMA traffic at a high level

− Use TCP stack

− Without TOE is it beneficial?

Page 6: Rdma presentation-kisti-v2

Alternative Approaches

RDMA over Converged Ethernet (RoCE)− Lightweight RDMA transport over Ethernet

Widely deployed technology Support kernel bypass OFED 1.5.1 supports RoCE

SoftRDMAs...− SoftRoCE (OFED 1.5.1 supports softRoCE)

− SoftiWARP (new TPC kernel stack)

Page 7: Rdma presentation-kisti-v2

Hidden Cost

Memory Registration− RDMA Read/Write

Connection Setup− Librdmacm

→ Bulk data movement? Asynchronous Model

− Buffer Management

Page 8: Rdma presentation-kisti-v2

Challanges in Bulk Transfer

Application Level Adjustments Request Aggregation

− Small data files

− Does FTP like transfer mechanism is appropriate for RDMA?

File System Overhead− Asynchronous Operations

Connection Caching / Multiple Connection?

Page 9: Rdma presentation-kisti-v2

Local Area / Wide Area

IB RDMA designed for local area− How does RDMA perform in Wide Area?

iWARP − No promising results - Over TCP (with TOE?)

− SoftiWARP ??? RoCE

− Isolated traffic ? / much less CPU usage

− softRoCE?

Page 10: Rdma presentation-kisti-v2

GridFTP over RDMA

XIO driver for GridFTP− Experimented using Chelsio cards (cxgb3)

− 10GE

− WAN testing in progress!

− Local area: 910MBbps – 1175MBps

− Much better than GridFTP over TCP Much less CPU load (1/2)

Page 11: Rdma presentation-kisti-v2

FTP100 – FTP over RDMA

Experimented with Mellonox Cards− Local area – 10GE

− iWARP Did not perform well compared to TCP

− No significant gain

− RoCE tests In progress (have some initial results) Limited by the disk performance Mem2mem:

− Can already saturate the 10GE link

Page 12: Rdma presentation-kisti-v2

What is Next?

Experiments RDMA model over WAN

SoftiWARP from IBM Zurich− TCP kernel stack implementing/defining RDMA

iverbs

SoftRoCE – OFED 1.5.2-rxe distribution− Multiple connections?

Page 13: Rdma presentation-kisti-v2

Transfer Applications over RDMA

Simple Client/Server:− Developing a prototype for transferring climate

dataset using RDMA protocols

− Asysnchronous memory management module

Application level tuning?− Memory regions (max/min?)

− Multiple QPs

Page 14: Rdma presentation-kisti-v2

Climate Analysis

Climate Applications are Data-Intensive

Shared data repository:− Data files needs to be downloaded for further

processing and analysis

− Data retrieval is the main bottleneck

− Multiple clients (working as VM instances) Can not depent on HW support SoftRoCE ? softiWARP

Page 15: Rdma presentation-kisti-v2

What can we do for WAN testing?

Q&A?

→ https://sdm.lbl.gov/climate100/