15
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 1 Breaking the Barrier of Monolithic SDS Architecture using NVMe-oF to Enable Next-gen Storage Technology Innovation Yi Zou (Research Scientist) Arun Raghunath (Research Scientist) Intel Corp.

Breaking the Barrier of Monolithic SDS Architecture using

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 1

Breaking the Barrier of Monolithic SDS Architecture using NVMe-oF to Enable Next-gen Storage Technology Innovation

Yi Zou (Research Scientist)Arun Raghunath (Research Scientist)Intel Corp.

Page 2: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 2

Monolithic SDS Architecture Challenges Cannot handle demands of modern elastic applications

(e.g. serverless) Cannot scale up/down IOPs | Increase throughput | Reduce latency, for

subsets of cluster data Cannot scale out without cluster-wide data rebalancing

Storage disaggregation “tax” [SDC2018 talk] Relayed data placement latency + bandwidth overheads

Deep coupling of block layer storage functions with purpose-built distributed storage capabilities makes it hard to integrate next-gen storage media and protocols

SDS = Software Defined Storage

Page 3: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 3

Proposed Architecture Change

Stateless: performs cluster-wide operations Chooses data placement destination Manages replicas and erasure coded chunks Monitors data integrity Performs failure recovery

Disaggregated

Decouple SDS architecture into stateless and stateful components

stateless

stateful

Hyper-converged

stateless

stateful

stateless

stateful

stateless

stateful

SDS

NVMe-oF

Standards-based NVMeoF to communicate between stateful and stateless components

Stateful: actually stores data and metadata responsible for durability provides persistence manages block layout Provides object semantics Supports transactions

Applies to hyper-converged as well as disaggregated deployments

stateless

stateful

stateless

stateful

stateless

stateful

Page 4: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 4

Unique Scale-out Vectors

Storage Target

stateful

stateless

stateless

stateless

Spawn stateless component of SDS server on new machines to add more CPU resources as needed increase physical cache size for SDS

server under memory pressure Only the bottleneck SDS server can be

scaled out Scaling possible with no data rebalancing

As load reduces, the extra stateless instances can be shut down to reduce cost

Storage Target

stateful

stateless

stateless

stateless

Scale down

Scale up

Elastic and fine-grained scale-out capability

Page 5: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 5

Simplifies Integrating Next-gen Storage

Stateful

No modifications

stateless

MonolithicStack

• Logic to leverage storage media entwined with remaining code

• Integrating new storage media and protocols gets complicated

• Ripple effects on unrelated code

• Media/Protocol specific optimizations repeated per storage frameworkMedia/Protocol specific logic

Media 1 specific logicStateful

Media 2 specific logic

stateless• New media can be integrated with no modifications to stateless component

• Industry standard communication interface between components simplifies integration

• Media specific optimization can be done once and then re-used by multiple storage frameworks

Page 6: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 6

Benefits to Scaling out More Heterogeneous Services

Services focus on their unique values Offload stateful tasks to remote target side Let the target manage what it is good at: blocks Drive service to be stateless and container-friendly

CEPH CassandraKafkaSwift

StorageTargetLUN

StorageTargetLUN

StorageTargetLUN

StorageTargetLUN

Storage Target

stateful

CEPH’ Swift’ Cassandra’Kafka’

Our vision is to be able to scale out heterogeneous services simultaneously

Page 7: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 7

Benefits to RecoveryStateless Recovery(1) Creates a new stateless

instance(2) Connects the new stateless

instance to the stateful component

Monolithic Failure Domain

Stateless and stateful

Stateless and stateful

Stateless and stateful

NVMeof (Disaggregated) or PCIe (Hyper-converged)

Stateless

stateful stateful

Stateless Stateless

(1)

(2)

stateful

Stateless

stateful

Stateless

stateful

(1)

(2)

(3)

Recovery of Stateful Component(1) Temporarily route client

requests to replica’s stateful component

(2) Creates a new stateful component

(3) Connects the original stateless component to the new stateful component

Page 8: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 8

Benefits to Disaggregation

ID, data

ID, data

Storage Target1

statefulstateless

data from client

stateless

ID, only metadata

Storage Target2

stateful

No “relayed” data placement

Latency reduction

Minimize data transfers Bandwidth savings

Improve data parallelism

Reduce BW Consumption

Improve latency

TCO Reduction

Page 9: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 9

Ceph PoC Details Based on Ceph Luminous +

SPDK v19.07 Created a new Ceph

ObjectStore backend as a SPDK NVMe-oF Initiator

The new Ceph ObjectStore backend handles ObjectStore APIs over NVMe-oF

Uses SPDK RDMA transport for NVMe

Uses SPDK NVMe-oF target Created a new SPDK bdev

that runs standalone Ceph BlueStore

SPDK bdev maps requests to remote Ceph BlueStore

PoC Ceph Architecture Change

PoC Setup

Metric: Ceph cluster network rx/tx bytes

Page 10: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 10

NVMe-oF Protocol Modifications NVMe-oF Protocol currently

A block transport protocol designed for block I/O queueing/execution Expects accesses to be block aligned No transaction support

PoC Modifies the NVMe-oF Protocol in SPDK Adds object awareness Enables minimal set of object operations (Native Ceph ObjectStore APIs) Enables support for remote asynchronous transactions Adds new READ/WRITE Op code to differ from block level READ/WRITE

Wraps around existing SPDK spdk_nvme_ns_cmd_writev/readv()

Target remaps the new READ/WRITE Op code Target decodes the payload header Target passes the decoded object information to BlueStore

PoC shows NVMe-oF can be extended to be more powerful and flexible for next-gen storage architecturesSolicit feedback from industry on our approach and extensions required to generalize NVMe-oF protocol modifications

Page 11: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 11

Results

Setup: Optimized Ceph (modified stock Ceph and stock SPDK) with Chelsio RDMA NIC Test: rados put of various object sizes, Small and big objects, 3KB to 20MB, 100 iteration each Measure Ceph network rx/tx bytes The results validate the PoC

* Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Intel is the trademark of Intel Corporation in the U.S. and/or other countries. Other names and brands may be claimed as the property of others. See Trademarks on intel.com for full list of Intel trademarks or the Trademarks & Brands Names Database

Observations- Fabric traffic = client x rep factor- Cluster traffic greatly reduced- OSD to OSD traffic is close to constant Meta only

- almost constant

Meta + Data!

Page 12: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 12

Bandwidth Reduction Results

Derive reduction in bandwidth consumption Estimate the 3-way replication case based on 2-way replication In stock Ceph, overhead is due to extra hops and increases with object sizes and replication factor Overall reduction is ~33% (2-way replication), 40% (3-way replication)

Cost = client traffic x rep factor as expected

Tax reduction

Page 13: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 13

Summary SDS stacks are monolithic with deep coupling of block layer storage functions

with purpose-built distributed storage capabilities

We propose to decouple the SDS architectures into stateless and stateful components enable independent scalability create new scaling vectors

We presented initial results from a hardware RDMA based Ceph PoC

Next Steps Illustrate the capability of containerization of SDS Illustrate the CPU cost on stateless and stateful components Illustrate the latency reduction in various scenarios

Page 14: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 14

Please take a moment to rate this session.

Your feedback matters to us.

Page 15: Breaking the Barrier of Monolithic SDS Architecture using

2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 15