16
13 th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, Tej Parkash, Jeetendra R Sonar, Storage Engineering [March 28 th , 2017 ]

13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

13th ANNUAL WORKSHOP 2017

BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES

Subhojit Roy, Tej Parkash, Jeetendra R Sonar, Storage Engineering[March 28th, 2017 ]

Page 2: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

OpenFabrics Alliance Workshop 2017

AGENDA

Introduction Setting the Context (SVC as Storage Virtualizer) SVC Software Architecture overview

Challenges Queue Pair states RDMA disconnect behavior RDMA connection management Query and modify Queue Pair attributes Large DMA memory allocation Query Device List Conclusion

2

Page 3: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

OpenFabrics Alliance Workshop 2017

INTRODUCTION

Page 4: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

OpenFabrics Alliance Workshop 2017

SETTING THE CONTEXT (SVC AS STORAGE VIRTUALIZER)

4

Host SAN

HostHostHostHostHosts

RAID Ctrl RAID Ctrl RAID Ctrl RAID Ctrl

Controller LUNs

Device SAN

SVC Virtual SAN

Lodeston eSVC /

VDisks 1

Lodeston eSVC

VDisks 2

Lodeston eSVC

VDisks 3

Lodeston eSVC

VDisks 4

SVC Storage

Application

• SVC pools heterogenous storage and virtualizes itfor the host

• iSER Target for Host

• iSER Initiator for Storage Controller (FLASH orHDD)

• Clustered over iSER for high availability

• Supports both RoCE and iWARP

• Supports 10/25/40/50/100G bandwidths

Page 5: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

OpenFabrics Alliance Workshop 2017

SVC ARCHITECTURE OVERVIEW

Architecture characteristics SVC application runs in user space

iSER and iSCSI drivers in kernel space

Lockless architecture (Per CPU port handling)

Polled mode IO handling

Supports RoCE and iWARP

Vendor Independent (Mellanox, Chelsio, Qlogic, Broadcom, Intel etc.)

Dependence on OFED kernel IB Verbs

5

SVC Storage Virtualization Application

SCSI Initiator SCSI Target

iSCSI Driver

iSER Initiator iSER Target

OFED IB Verbs

RoCE Adapter iWARP Adapter

CQ

RQ

SQ

CQ

RQ

SQ

Page 6: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

OpenFabrics Alliance Workshop 2017

CHALLENGES

Page 7: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

OpenFabrics Alliance Workshop 2017

QUEUE PAIR STATES

Goal• Control number of retries and retry timeout during network outage

Actual behavior• State transition differs across RoCE and iWARP e.g iWARP does not

support SQD state

Expectation• Transition QP to SQD state to modify QP attributes• ib_modify_qp() must transition QP states as per state diagram

shown• All state transition must be supported by both RoCE and iWARP

Work Around• No work around found• Exploring vendor specific possibilities

7

Referenced from book “Linux Kernel Networking - Implementation and

Theory”

Page 8: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

OpenFabrics Alliance Workshop 2017

RDMA DISCONNECT BEHAVIOR

Goal/Observation• QP cannot be freed before RDMA_CM_EVENT_DISCONNECTED event

is received• There is no control over the timeout period for this event

Actual behavior• Link down on peer system causes DISCONNECT event to be

received after long delay• RoCE: ~100 Sec• iWARP: ~70 Sec

• There is no standard mechanism (verb) to control these timeouts

Expectation• RDMA disconnect event must exhibit uniform timeout across RoCE

and iWARP• Timeout period for disconnect must be configurable

Work Around• Evaluating vendor specific mechanism to tune CM timeout

8

SVC Application

Peer host/target

Fabric

Page 9: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

OpenFabrics Alliance Workshop 2017

RDMA CONNECTION MANAGEMENT

Goal• Polled mode data path and Connection Management

Current mechanism• No mechanism to poll for CM events. All RDMA CM events

are interrupt driven• Current implementation involves deferring CM events to

Linux workqueues• Application has no control over which CPU to POLL CM

events from

Expectation• Queues for CM event handling

Work Around• Usage of locks add to IO latency

9

SVC Storage Virtualization Application

SCSI Initiator SCSI Target

iSCSI Driver

iSER Initiator iSER Target

OFED IB Verbs

RoCE Adapter iWARP Adapter

CQ

RQ

SQ

RQ

SQ

Page 10: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

OpenFabrics Alliance Workshop 2017

LARGE DMA MEMORY ALLOCATION

Observation• Allocation of large chunks DMAable memory during session

establishment fails• SVC reserves majority of physical memory during system

initialization for caching

Current mechanism• IB Verbs use kmalloc() to allocate DMAable memory for all the

queues

Expectation• IB Verbs must provide a means to allocate DMA-able memory

from pre-allocated memory pool. e.g. in the following• ib_alloc_cq()• ib_create_qp()

Work Around Solutions• Modified iWARP and RoCE driver to use pre-allocated memory

pools from SVC

10

Type Elements Size Total Size(KB)

SQ 2064 88 ~177KB

RQ 2064 32 ~64KB

CQ 2064 32 ~64KB

Single Connection Memory requirement in Linux OFED Stack = ~297KB

Page 11: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

OpenFabrics Alliance Workshop 2017

QUERY AND MODIFY QUEUE PAIR ATTRIBUTES

Goal/Observation• Query and set QP parameters to control error recovery behavior

Actual behavior• Unable to get and set QP parameters• iWARP does not support modify/query of all parameters defined in ib_qp_attr() e.g. field rnr_retry

Expectations• ib_query_qp() and ib_modify_qp() should behave as documented• If QP parameters are specific to iWARP or RoCE, they must be

documented

Work Around Solutions• Evaluating vendor specific possibilities

11

retry_cnt

Referenced from: Linux Kernel

Page 12: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

OpenFabrics Alliance Workshop 2017

QUERY DEVICE LIST

Observation• No kernel verb to find list of rdma devices on system until RDMA session is established• Per device resource allocation during kernel module initialization

Current mechanism• RDMA device available only after connection request is established by CM event handler

Expectation• Need verb equivalent to ibv_get_device_list() in kernel IB Verbs

Work Around• Complicates per port resource allocation during initialization

12

Page 13: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

OpenFabrics Alliance Workshop 2017

CONCLUSION

Initial indications of IO performance compared to FC – excellent! iSER presents an opportunity for high performance Flash based Ethernet data center Error recovery and handling is troublesome Mass adoption by storage vendors requires more work in OFED

• IB Verbs is not completely protocol independent• Proper documentation of RoCE vs iWARP specific differences • Definitive resource allocation timeout values (R_A_TOV equivalent in FC) Same requirements applicable to NVMef Seeking right forum to address these requirements

13

Page 14: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

13th ANNUAL WORKSHOP 2017

THANK YOU

[March 28th, 2017 ]

[email protected], [email protected], [email protected]

Page 15: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

OpenFabrics Alliance Workshop 2017

BACKUP SLIDES

Page 16: 13th BUILDING A BLOCK STORAGE APPLICATION ON OFED - …€¦ · 13th ANNUAL WORKSHOP 2017 BUILDING A BLOCK STORAGE APPLICATION ON OFED - CHALLENGES Subhojit Roy, TejParkash, Jeetendra

OpenFabrics Alliance Workshop 2017

FC V/S ISER LATENCY PERFORMANCE

16

IO Size FC Latency (milliseconds) iSER Latency (milliseconds)

Read_4k 0.107 0.072

Write_4k 0.185 0.222

Read_32k 0.121 0.100

Write_32k 0.224 0.267

Read_64k 0.183 0.150

Write_64k 0.299 0.342