21
Steve Miller, Siddhartha Panda, & Sujoy Sen

Steve Miller, Siddhartha Panda, & Sujoy Sen · 2019-09-24 · SPDK, PMDK & Vtune™ Summit 16 Pool Configuration - Performance Throughput scales as pool width scales Saturates 12.5GBps

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Steve Miller, Siddhartha Panda, & Sujoy Sen

SPDK, PMDK & Vtune™ Summit 2

Agenda Overview of RSD

Composable Block Storage CBS

– Feature Requirements

– Why SPDK

– Performance

SPDK, PMDK & Vtune™ Summit 3

1. Source: Quantifying Datacenter Inefficiency: Making the Case for Composable Infrastructure, IDC, Document #US42318917, 2017. 2. Source: Disaggregated Server Architecture Drives Data Center Efficiency and Innovation, Shesha Krishnapura, Intel Fellow and Intel IT CTO, 2017

Today’s data center Challenges Intel®Rack Scale Design

Current Infrastructure

• Fixed ratio of compute, storage, and accelerator resources

• Expensive refresh & scale out

• Outdated software interface

• Cumbersome hardware provisioning process

Composable

Disaggregated

Interoperable

Data Center Agility, Built on Open Standards

“an industry-aligned architecture for composable, disaggregated infrastructure built on modern, open standards.”

DecreaseCosts

IncreaseAgility

50 %Efficiency

IT Operations1

35 People Hours Per

Rack Update2

45 %Utilization

Of Equipment1

SPDK, PMDK & Vtune™ Summit

Intel® RSD Key Attributes

4

ComposableDisaggregated Interoperable

Compose hardware resources “on the fly”

Buy less up front and Save money over time

Choose the best now without vendor lock-in

Storage Drawer

Compute

Compute

Compute

Compute

Network

Accelerator Drawer

Intel Pod Manager

Composed Node 2

Orchestration

App 1 App 2 App 3

Composed Node 1

Single-Pane-of-Glass Management

Vendor A Hardware

Vendor B Hardware

Vendor C Hardware

Vendor D Hardware

Open Standard API

Intel®Rack Scale Design

OEMs with solutions based on Intel RSD

SPDK, PMDK & Vtune™ Summit

Benefits of Disaggregation and ComposabilityResource PoolingMaximize utilization of high-value assets and improve agility with dynamic composability

Modular RefreshIndependently scale and upgrade resources with better lifecycle management

Operational CostsLower Power Usage Effectiveness (PUE) and streamline operations and HW management

SPDK, PMDK & Vtune™ Summit

Compute

Composed NodeAttached Resources

Intel® RSD – ComposabilityCompose hardware resources “on the fly” for specific workloads

Pooled System Management Engine (PSME)Firmware located in the Baseboard Management Controller (BMC) of each hardware component

Accelerator Sled

PS

MEStorage Sled

PS

ME

Compute

PS

ME

Network

PS

ME

Intel® Rack Scale Design

Pod ManagerDynamically composes resources into server nodes from inventory

Orchestration Software

App 1 App 2 App 3Intel® RSD software functions include:

Resource DiscoveryAutomatically discover and store hardware characteristics and location for all your resources

Node CompositionDynamically compose compute, storage, and other resources to meet workload specific demands

Telemetry DataMonitor data center efficiency and detect, diagnose, and help predict resource failures.

SPDK, PMDK & Vtune™ Summit

Storage Sled

Compute

Compute

Compute

Compute

Network

Accelerator Sled

Intel® RSD – Storage DisaggregationDisaggregation

Save $$ over time with modular refresh

NVMe over Fabrics (Ethernet)Feature Set for Storage Target• High Performance

• Low Latency

• RAID 0 (Stripes across SSD)

• Clones

• Thin Provisioning

• Snap Shots

• Quality of Service per logical volume

• Efficient use of CPU and memory resources

• Standards Based Management

• Simplicity of use

SPDK, PMDK & Vtune™ Summit

• Disaggregated high speed NVME-o-F target over RDMA ( powered by Intel Networking)

• Pooling of Intel SSDs for capacity and performance

• Volume Management functionalities: Thin Provisioning, Snapshots, Clones

• Data at rest security and multi-tenant volume level security

• Smart volume level Quality of service. (Rate limiting, guaranteed BW)

• Centralized Storage manager at PoD level. Policy based provisioning and automatic deployment

• Online Storage Pool capacity Expansion ( both stripe width expansion & drive concatenation)

• Proactive detection of drive failure and seamless drive replacement

• Scale and Performance (Network saturation up to 400Gbs, 8K volume support)

• Support Industry standard management Interfaces by DMTF/SNIA (Redfish/Swordfish)

• Integration across flavors of cloud OS’s through various plugins.

• Event and Telemetry Support to trigger AI for PoD level management

CBS – Key Features

SPDK, PMDK & Vtune™ Summit 10

Why SPDK ?

• CBS leverages the Modularized bdev abstraction framework of SPDK

– Most of CBS functionalities are developed as new SPDK bdev modules or uses the SPDK framework extensively.

• SPDK being an user space application, it is easy to develop, debug and maintain the data path application like CBS

– No system call overhead in IOPATH

• SPDK designed to be a Async Poll mode driver with no interrupts

– currently most of the IO path drivers are limited in interrupt mode

– Poll mode enables IOPs scalability and avoid complex MSIX interrupt overhead across cores.

SPDK Logical Vol Store (LVS)

SPDK Pooled Volume BDev Module

SPDK NVMe Bdev Module

SPDK NVMe PCIe Driver

Intel SSD

nvme dev(s)

nvme bdev(s)

pvol(s)

SPDK BlobStore

nvme bdev(s)

pvol(s)

Ns-1 Ns-2 Ns-3

Subsystem-0

Ns-5 Ns-7

Subsystem-1

NVMe-oF Target

lvol(s)

SPDK BDev Subsystem

Ns-4 Ns-6

SPDK, PMDK & Vtune™ Summit 11

Why SPDK cont..• Lockless architecture

• SPDK framework enforces the developer minimize or eliminate locks. This reduces the complexity and increases efficiency

• Also if lock is required, SPDK provides a good infrastructure for inter core messaging which minimizes processor cache thrashing and avoids traditional expensive locks like mutex/semaphore etc.

• Core affinity

• Significantly lower overhead in synchronization with per core data structure

• End to end IO request processing in same core enables faster and more efficient IO processing

• HPSP uses the SPDK for efficient Core distribution based on functionality and workload. Enabling feature SLAs, for example, HPSP uses separate cores for telemetry collection and analysis, dedicated cores for expansion, QOS management, and non-IO path config commands.

• Error handling

• Using SPDK’s error propagation feature across different bdev hierarchy, handling error in each module is very systematic and can be kept module specific.

SPDK, PMDK & Vtune™ Summit

• CBS provides disaggregated storage for big datacenters, where IOPs and scale are critical.

• SPDK is designed for 8-10x IOPs/core vs kernel NVMe and fabric drivers.

• SPDK scales easily to 8-10M IOPs ( both RW) with just a few cores.

• Intel Optane NVMe SSD require a lightweight, low latency stack, SPDK provides that infrastructure.

• Optimized tail latencies with per core IO processing.

• With SPDK volume management layer, number of volumes can scale well to 8K, and volume size can go up to TBs.

CBS with SPDK ( Scale & Performance)

12

SPDK, PMDK & Vtune™ Summit

Performance Test Configuration

13

Hardware

Wolfpass storage sled

• Skylake-Dual socket 36 cores

• Mellanox ConnectX-5 100Gbps NIC

• 8x Intel P4500 SSDs

Software

• Fedora Core 26

• SPDK base version 18.07 with new pooled volume module

• OFED 4.2 with RDMA support

SPDK Logical Vol Store (LVS)

SPDK Pooled Volume BDev Module

SPDK NVMe Bdev Module

SPDK NVMe PCIe Driver

NVMe PCIe Controller(s)

nvme dev(s)

nvme bdev(s)

pvol(s)

SPDK BlobStore

nvme bdev(s)

pvol(s)

Ns-1 Ns-2 Ns-3

Subsystem-0

Ns-5 Ns-7

Subsystem-1

NVMe-oF Target

lvol(s)

SPDK BDev Subsystem

Ns-4 Ns-6

SPDK stack with new Pooled Volume BDEV Module

SPDK, PMDK & Vtune™ Summit 14

Pool Configuration - Performance

4 Corner Performance 100Gb/s RDMA network

Workload Measured Performance

4K RND 100% RD 3.14 MIOPS

4K RND 100% WR 2.3 MIOPS

4K RND 70/30 R/W 2.23 MIOPS

128k SEQ 100% RD 12.5 GBps

128k SEQ 100% WR 12.5 GBps

1 QD RND RD 101 us

1 QD RND WR 31 us

SPDK, PMDK & Vtune™ Summit

0

500

1000

1500

2000

2500

3000

3500

1 3 6 12 24 48 96 192 384 768 1536 3072

IOP

s in

K

Queue Depth

4K Rand Read IOPS - Single NVMe SSD vs Pool

(Increasing # drives)

Single Nvme Drive PooledVol(1x) PooledVol(2x)

PooledVol(4x) PooledVol(6x) PooledVol(8x)

0

500

1000

1500

2000

2500

1 3 6 12 24 48 96 192 384 768 1536 3072

IOP

s in

K

Queue Depth

Rand Write - LVOL on PVOL v/s Single

Drive LVOL

Nvme PVOL1 PVOL2 PVOL4 PVOL6 PVOL8

15

CBS NVMeoF SSD performance data

SPDK, PMDK & Vtune™ Summit 16

Pool Configuration - PerformanceThroughput scales as pool width scales

Saturates 12.5GBps bandwidth with 4 Intel NVMe SSDs

0

2000

4000

6000

8000

10000

12000

14000

1 2 4 8 16 32 64 128 256 512 1024

Ban

dw

idth

in M

B/s

Queue Depth

128k Seq Read throughput - Single NVMe SSD vs Pool (Increasing # drives)

Single Nvme Drive PooledVol(1x) PooledVol(2x) PooledVol(4x) PooledVol(6x) PooledVol(8x)

SPDK, PMDK & Vtune™ Summit 17

Pool Configuration - PerformanceThroughput scales as pool width scales

0

2000

4000

6000

8000

10000

12000

14000

1 2 4 8 16 32 64

Ban

dw

idth

in M

B/s

Queue Depth

128k Seq Write Throughput - Single NVMe SSD vs Pool (Increasing # drives)

Single Nvme Drive PooledVol(1x) PooledVol(2x) PooledVol(4x) PooledVol(6x) PooledVol(8x)

Saturates 12.5GBps bandwidth with 7 Intel NVMe SSDs

SPDK, PMDK & Vtune™ Summit 18

Pool Configuration - PerformanceIOPs scale as pool width increases

0

500

1000

1500

2000

2500

3000

3500

1 3 6 12 24 48 96 192 384 768 1536 3072

IOP

s in

K

Queue Depth

4K Rand Read IOPS - Single NVMe SSD vs Pool (Increasing # drives)

Single Nvme Drive PooledVol(1x) PooledVol(2x) PooledVol(4x) PooledVol(6x) PooledVol(8x)

Saturates 3M IOPs with 5 Intel NVMeSSDs

SPDK, PMDK & Vtune™ Summit 19

Pool Configuration - PerformanceLatency decreases as pool width increases

At higher QDs, latency decreases for pool with larger number of drives

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Single Nvme Drive PooledVol(1x) PooledVol(2x) PooledVol(4x) PooledVol(6x) PooledVol(8x)

La

ten

cy i

n u

sec

Queue Depth

4K Rand Read latency - Single NVMe SSD vs Pool (Increasing # drives) at

increasing QD

QD1 QD3 QD6 QD12 QD24 QD48

QD96 QD192 QD384 QD768 QD1536 QD3072

SPDK, PMDK & Vtune™ Summit 20

Pool Configuration - PerformanceIOPs scale as pool width increases

0

500

1000

1500

2000

2500

1 3 6 12 24 48 96 384 768 1536 3072

IOP

s in

K

Queue Depth

4K Rand Write IOPS - Single NVMe SSD vs Pool (Increasing # drives)

Single Nvme Drive PooledVol(1x) PooledVol(2x)

PooledVol(4x) PooledVol(6x) PooledVol(8x)

• Random writes scales with increasing number of drives.

• Variations seen due to underlying Intel NVMe SSDs

SPDK, PMDK & Vtune™ Summit 22

CBS with SPDK & intel RSD Architecture

• Intel CBS solution uses the SPDK stack as its base platform and leverages the Intel RSD architecture for management and cloud OS integrations.

• CBS software delivers the best from Intel components like Xeon Processors, SSDs, Networking, Optane Persistent Memory with optimized smart algorithms implemented using SPDK stack.

• Apart from new features, CBS also leverages many of SPDK features like NVMeoF target, volume management , Poll mode driver etc.

Intel SPA

Volume Mgmt (CRUD Strategy)

RESTApi Vol create

Storage Sled

CNBS Target App

Linux OS Sysfs and

VFIO

SP

DK

RP

C S

erv

er

Fabric

Compute Sled Compute Sled Compute Sled

PSMS Asset Manager

Storage Service

RedfishRest APICMDB

ESMA

SPDK RPC Client

Agent Registration

Event Dispatcher

SPDK Target Health Monitor

AMC Server

Event Server

ESMA Local

Database

Pool Mgmt (CRUD Strategy)

Secure BDEV

Intel Smart NIC

Ex

pa

nsio

n M

gr

Qo

S M

an

ag

er

SP

DK

BD

EV

LVS/LVOL

RAID BDEV

NVMe BDEV

NVMe Driver

Driver

OFED Driver

RPC Discovery Service

GAMI RPC Server

Telemetry Collection

Intel SSDsRSSD RSSD RSSD RSSD .

QoS Management

Pool Expansion Strategy

Event Mgmt workflow

Telemetry (Monitor, Analysis, Report)

Volume attach and Connection Mgmt

CNBS Cinder Backend driver

Pool Aware Storage Provisioner (Eg. Openstack Cinder)

Cinder API

Cinder Pool Aware Scheduler

Cinder Volume Service

Storage Class based Provisioner (Eg. K8s CSI)

CSI External Provisioner/

Attacher

CNBS CSI Control Plugin

CSI Node Plugin

Cinder os-brick extension

CNBS Initiator Config AgentBMC Local

Discovery SvcK8s Persistent Volume Mgmt

K8s API service

gRPC

REST Redfish/Swordfish(volume)

CLI/GUI

BMC

Redfish/Swordfish

OOB Redfish

NVMEoF Tgt Subsystem

GAMI SPA LiB

QueryAPIs

South Bound

APIs

SwordFish Controller

REST Redfish/Swordfish(volume)

Collectd ESMA

Collectd Input plugin

GAMI

Powered by

SPDK