Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
UCX enhanced NVMe-over-Fabric Target
in SPDK
Xinle Du Dai Zhang Minhu WangTsinghua University Team
August 2020
BackgroundNVMe and NVMe-over-Fabric(NVMe-oF)
NVM Express* (NVMe) block protocol is designed to access
local SSD block device over PCIe in a fast and efficient way.
And after we got a great local protocol, we then look forward to
extend it for remote accessing. Thus we got NVMe-oF.
The NVMe over Fabrics (NVMe-oF) protocol extends
the parallelism and efficiencies of the NVM Express*
(NVMe) block protocol over network fabrics such as
RDMA (iWARP, RoCE), InfiniBand™, Fibre Channel,
TCP and Intel® Omni-Path.
which allows us e.g. to read/write *remote* SSD in low-latency.
Backgroundspdk
The Storage Performance Development Kit (SPDK) is a library that:
Implement Nvme and Nvme-oF protocol.
Allow for writing high performance, scalable, user-mode storage application easily.
Achieve high performance by zero-copy, polled-mode, asynchronous and lockless.
We focus on the NVMe-oF part, which is the SPDK NVMe-oF target library (lib/nvmf).
The NVMe-oF specification is designed to allow for many different network fabrics, thus the
SPDK NVMe-oF target library implements a plugin system for easily adding new fabrics and
network transports in the future.
The API a new transport must implement is located in lib/nvmf/transport.h
We can find an example NVMe-oF target application in app/nvmf_tgt.
Competition TaskWhat we need to do…
Extend a new transport (a concept in NVMe-oF spec) with UCX to
the SPDK NVMe-oF target library.
However, it is quite challenging, cause we need to:
Have an understanding on the concepts in NVMe-oF spec.
Know the workflow of spdk, including the logic inside NVMe-oF target library, and
how it interoperates with the other part of the project.
Be familiar with the tricks and design patterns in product-level codes to improve code reuse, performance and resource efficiency.
Understand the framework and interfaces of UCX to apply and fit it into.
Dive in…Concepts in NVMe-oF Spec and spdk Impl
NVMe-oF specification defines a lot of concepts which helps us understand the implementation of the NVMe
NVMe-oF target: the whole collection of the abstract SSD storage server. (struct spdk_nvmf_tgt).
NVMe-oF subsystem and namespace: access control related concepts, we don’t care about them
much here.(struct spdk_nvmf_subsystem and struct spdk_nvmf_ns).
NVMe-oF transport: An abstraction for a network fabric. The NVMe-oF specification defines multiple
network transports (the "Fabrics" in NVMe over Fabrics), and spdk has an extensible system for
adding new fabrics in the future.(struct spdk_nvmf_transport).
Currently, spdk implemented RDMA transport and TCP transport (lib/nvmf/rdma.c, lib/nvmf/tcp.c).
NVMe-oF queue pair: defined by the NVMe-oF specification, map 1:1 to network connections.
It is similar to the concept of socket in the scope of TCP. (struct spdk_nvmf_qpair).
Dive in…Concepts in NVMe-oF Spec and spdk Impl
NVMe-oF specification defines a lot of concepts which helps us understand the implementation of the NVMe
Poll group: An abstraction for a collection of network connections that can be polled as a unit.
Spdk chooses to check for incoming data on groups of connections than checking each one individually
(e.g. epoll) in order to improve efficiency and scale to large numbers of connections, so poll groups
provide a generic abstraction for that. All new qpairs assigned to the poll group are given their own
RDMA send and receive queues, but share this common completion queue. SPDK NVMe-oF RDMA
transport allocates a single RDMA completion queue per poll group. (struct spdk_nvmf_poll_group) .
NVMe-oF listener: listen on a network address at which the target will accept new connections.
(struct spdk_nvmf_listener).
NVMe-oF host: An NVMe-oF NQN representing a host (initiator) system.
This is used for access control. (struct spdk_nvmf_host).
Dive in…Workflow of the SPDK NVMe-oF target library (overview)
How NVMe-oF works with terminologies defined before.
The client and server in NVMe-oF context is called initiator and target separately.
The SPDK NVMe-oF target uses the SPDK user-space, polled-mode NVMe driver to submit and
complete I/O requests to NVMe devices.
The host system uses the initiator to establish a connection and submit I/O requests to an NVMe
subsystem within an NVMe-oF target.
The SPDK NVMe-oF target and initiator uses the Infiniband/RDMA verbs API to access an
RDMA-capable NIC.
Dive in…Workflow of the SPDK NVMe-oF target library (interface)
How NVMe-oF works with terminologies defined before.
A user of the NVMe-oF target library begins by creating a target using spdk_nvmf_tgt_create(),
setting up a set of addresses on which to accept connections by calling spdk_nvmf_tgt_listen(),
then creating a subsystem using spdk_nvmf_subsystem_create(). Namespaces which
represents bdevs can be added to the subsystems with spdk_nvmf_subsystem_add_ns().
Once a subsystem exists and the target is listening on an address, new connections may be
accepted by polling spdk_nvmf_tgt_accept().
When spdk_nvmf_tgt_accept() detects a new connection, it will construct a new struct
spdk_nvmf_qpair object and call the user provided callback for each new qpair.The user must
assign the qpair to a poll group by calling spdk_nvmf_poll_group_add() to process I/O requests.
All I/O to a subsystem is driven by a poll group, which polls for incoming network I/O.
Poll groups may be created by calling spdk_nvmf_poll_group_create(). And they automatically
request to begin polling upon creation on the thread from which they were created.
Dive in…Transport Abstraction in SPDK
Spdk enables separate implementations for different transports.
The API a new transport must implement is located in (lib/nvmf/transport.h)
* construct / destruct controller* get infos / statistics* Create / delete I/O queue* Submit request* Process completions
We focus on the tranports implementations.
TBCWhere we reached and what’s the future works.
We almost understand the transport part of the project,
it’s position inside the whole project.
and have a preliminary feeling on which part we need to modify and change.
But there are still many other issues:
* The tricks and design patterns in product-level codes* Interoperate with the other part of the project* Resource management* Multi-connection and multi-processing* Error handling* performance optimizations
Thanks!