Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
SPDK, PMDK & Vtune™ Summit 2
Agenda Overview of RSD
Composable Block Storage CBS
– Feature Requirements
– Why SPDK
– Performance
SPDK, PMDK & Vtune™ Summit 3
1. Source: Quantifying Datacenter Inefficiency: Making the Case for Composable Infrastructure, IDC, Document #US42318917, 2017. 2. Source: Disaggregated Server Architecture Drives Data Center Efficiency and Innovation, Shesha Krishnapura, Intel Fellow and Intel IT CTO, 2017
Today’s data center Challenges Intel®Rack Scale Design
Current Infrastructure
• Fixed ratio of compute, storage, and accelerator resources
• Expensive refresh & scale out
• Outdated software interface
• Cumbersome hardware provisioning process
Composable
Disaggregated
Interoperable
Data Center Agility, Built on Open Standards
“an industry-aligned architecture for composable, disaggregated infrastructure built on modern, open standards.”
DecreaseCosts
IncreaseAgility
50 %Efficiency
IT Operations1
35 People Hours Per
Rack Update2
45 %Utilization
Of Equipment1
SPDK, PMDK & Vtune™ Summit
Intel® RSD Key Attributes
4
ComposableDisaggregated Interoperable
Compose hardware resources “on the fly”
Buy less up front and Save money over time
Choose the best now without vendor lock-in
Storage Drawer
Compute
Compute
Compute
Compute
Network
Accelerator Drawer
Intel Pod Manager
Composed Node 2
Orchestration
App 1 App 2 App 3
Composed Node 1
Single-Pane-of-Glass Management
Vendor A Hardware
Vendor B Hardware
Vendor C Hardware
Vendor D Hardware
Open Standard API
Intel®Rack Scale Design
OEMs with solutions based on Intel RSD
SPDK, PMDK & Vtune™ Summit
Benefits of Disaggregation and ComposabilityResource PoolingMaximize utilization of high-value assets and improve agility with dynamic composability
Modular RefreshIndependently scale and upgrade resources with better lifecycle management
Operational CostsLower Power Usage Effectiveness (PUE) and streamline operations and HW management
SPDK, PMDK & Vtune™ Summit
Compute
Composed NodeAttached Resources
Intel® RSD – ComposabilityCompose hardware resources “on the fly” for specific workloads
Pooled System Management Engine (PSME)Firmware located in the Baseboard Management Controller (BMC) of each hardware component
Accelerator Sled
PS
MEStorage Sled
PS
ME
Compute
PS
ME
Network
PS
ME
Intel® Rack Scale Design
Pod ManagerDynamically composes resources into server nodes from inventory
Orchestration Software
App 1 App 2 App 3Intel® RSD software functions include:
Resource DiscoveryAutomatically discover and store hardware characteristics and location for all your resources
Node CompositionDynamically compose compute, storage, and other resources to meet workload specific demands
Telemetry DataMonitor data center efficiency and detect, diagnose, and help predict resource failures.
SPDK, PMDK & Vtune™ Summit
Storage Sled
Compute
Compute
Compute
Compute
Network
Accelerator Sled
Intel® RSD – Storage DisaggregationDisaggregation
Save $$ over time with modular refresh
NVMe over Fabrics (Ethernet)Feature Set for Storage Target• High Performance
• Low Latency
• RAID 0 (Stripes across SSD)
• Clones
• Thin Provisioning
• Snap Shots
• Quality of Service per logical volume
• Efficient use of CPU and memory resources
• Standards Based Management
• Simplicity of use
SPDK, PMDK & Vtune™ Summit
• Disaggregated high speed NVME-o-F target over RDMA ( powered by Intel Networking)
• Pooling of Intel SSDs for capacity and performance
• Volume Management functionalities: Thin Provisioning, Snapshots, Clones
• Data at rest security and multi-tenant volume level security
• Smart volume level Quality of service. (Rate limiting, guaranteed BW)
• Centralized Storage manager at PoD level. Policy based provisioning and automatic deployment
• Online Storage Pool capacity Expansion ( both stripe width expansion & drive concatenation)
• Proactive detection of drive failure and seamless drive replacement
• Scale and Performance (Network saturation up to 400Gbs, 8K volume support)
• Support Industry standard management Interfaces by DMTF/SNIA (Redfish/Swordfish)
• Integration across flavors of cloud OS’s through various plugins.
• Event and Telemetry Support to trigger AI for PoD level management
CBS – Key Features
SPDK, PMDK & Vtune™ Summit 10
Why SPDK ?
• CBS leverages the Modularized bdev abstraction framework of SPDK
– Most of CBS functionalities are developed as new SPDK bdev modules or uses the SPDK framework extensively.
• SPDK being an user space application, it is easy to develop, debug and maintain the data path application like CBS
– No system call overhead in IOPATH
• SPDK designed to be a Async Poll mode driver with no interrupts
– currently most of the IO path drivers are limited in interrupt mode
– Poll mode enables IOPs scalability and avoid complex MSIX interrupt overhead across cores.
SPDK Logical Vol Store (LVS)
SPDK Pooled Volume BDev Module
SPDK NVMe Bdev Module
SPDK NVMe PCIe Driver
Intel SSD
nvme dev(s)
nvme bdev(s)
pvol(s)
SPDK BlobStore
nvme bdev(s)
pvol(s)
Ns-1 Ns-2 Ns-3
Subsystem-0
Ns-5 Ns-7
Subsystem-1
NVMe-oF Target
lvol(s)
SPDK BDev Subsystem
Ns-4 Ns-6
SPDK, PMDK & Vtune™ Summit 11
Why SPDK cont..• Lockless architecture
• SPDK framework enforces the developer minimize or eliminate locks. This reduces the complexity and increases efficiency
• Also if lock is required, SPDK provides a good infrastructure for inter core messaging which minimizes processor cache thrashing and avoids traditional expensive locks like mutex/semaphore etc.
• Core affinity
• Significantly lower overhead in synchronization with per core data structure
• End to end IO request processing in same core enables faster and more efficient IO processing
• HPSP uses the SPDK for efficient Core distribution based on functionality and workload. Enabling feature SLAs, for example, HPSP uses separate cores for telemetry collection and analysis, dedicated cores for expansion, QOS management, and non-IO path config commands.
• Error handling
• Using SPDK’s error propagation feature across different bdev hierarchy, handling error in each module is very systematic and can be kept module specific.
SPDK, PMDK & Vtune™ Summit
• CBS provides disaggregated storage for big datacenters, where IOPs and scale are critical.
• SPDK is designed for 8-10x IOPs/core vs kernel NVMe and fabric drivers.
• SPDK scales easily to 8-10M IOPs ( both RW) with just a few cores.
• Intel Optane NVMe SSD require a lightweight, low latency stack, SPDK provides that infrastructure.
• Optimized tail latencies with per core IO processing.
• With SPDK volume management layer, number of volumes can scale well to 8K, and volume size can go up to TBs.
CBS with SPDK ( Scale & Performance)
12
SPDK, PMDK & Vtune™ Summit
Performance Test Configuration
13
Hardware
Wolfpass storage sled
• Skylake-Dual socket 36 cores
• Mellanox ConnectX-5 100Gbps NIC
• 8x Intel P4500 SSDs
Software
• Fedora Core 26
• SPDK base version 18.07 with new pooled volume module
• OFED 4.2 with RDMA support
SPDK Logical Vol Store (LVS)
SPDK Pooled Volume BDev Module
SPDK NVMe Bdev Module
SPDK NVMe PCIe Driver
NVMe PCIe Controller(s)
nvme dev(s)
nvme bdev(s)
pvol(s)
SPDK BlobStore
nvme bdev(s)
pvol(s)
Ns-1 Ns-2 Ns-3
Subsystem-0
Ns-5 Ns-7
Subsystem-1
NVMe-oF Target
lvol(s)
SPDK BDev Subsystem
Ns-4 Ns-6
SPDK stack with new Pooled Volume BDEV Module
SPDK, PMDK & Vtune™ Summit 14
Pool Configuration - Performance
4 Corner Performance 100Gb/s RDMA network
Workload Measured Performance
4K RND 100% RD 3.14 MIOPS
4K RND 100% WR 2.3 MIOPS
4K RND 70/30 R/W 2.23 MIOPS
128k SEQ 100% RD 12.5 GBps
128k SEQ 100% WR 12.5 GBps
1 QD RND RD 101 us
1 QD RND WR 31 us
SPDK, PMDK & Vtune™ Summit
0
500
1000
1500
2000
2500
3000
3500
1 3 6 12 24 48 96 192 384 768 1536 3072
IOP
s in
K
Queue Depth
4K Rand Read IOPS - Single NVMe SSD vs Pool
(Increasing # drives)
Single Nvme Drive PooledVol(1x) PooledVol(2x)
PooledVol(4x) PooledVol(6x) PooledVol(8x)
0
500
1000
1500
2000
2500
1 3 6 12 24 48 96 192 384 768 1536 3072
IOP
s in
K
Queue Depth
Rand Write - LVOL on PVOL v/s Single
Drive LVOL
Nvme PVOL1 PVOL2 PVOL4 PVOL6 PVOL8
15
CBS NVMeoF SSD performance data
SPDK, PMDK & Vtune™ Summit 16
Pool Configuration - PerformanceThroughput scales as pool width scales
Saturates 12.5GBps bandwidth with 4 Intel NVMe SSDs
0
2000
4000
6000
8000
10000
12000
14000
1 2 4 8 16 32 64 128 256 512 1024
Ban
dw
idth
in M
B/s
Queue Depth
128k Seq Read throughput - Single NVMe SSD vs Pool (Increasing # drives)
Single Nvme Drive PooledVol(1x) PooledVol(2x) PooledVol(4x) PooledVol(6x) PooledVol(8x)
SPDK, PMDK & Vtune™ Summit 17
Pool Configuration - PerformanceThroughput scales as pool width scales
0
2000
4000
6000
8000
10000
12000
14000
1 2 4 8 16 32 64
Ban
dw
idth
in M
B/s
Queue Depth
128k Seq Write Throughput - Single NVMe SSD vs Pool (Increasing # drives)
Single Nvme Drive PooledVol(1x) PooledVol(2x) PooledVol(4x) PooledVol(6x) PooledVol(8x)
Saturates 12.5GBps bandwidth with 7 Intel NVMe SSDs
SPDK, PMDK & Vtune™ Summit 18
Pool Configuration - PerformanceIOPs scale as pool width increases
0
500
1000
1500
2000
2500
3000
3500
1 3 6 12 24 48 96 192 384 768 1536 3072
IOP
s in
K
Queue Depth
4K Rand Read IOPS - Single NVMe SSD vs Pool (Increasing # drives)
Single Nvme Drive PooledVol(1x) PooledVol(2x) PooledVol(4x) PooledVol(6x) PooledVol(8x)
Saturates 3M IOPs with 5 Intel NVMeSSDs
SPDK, PMDK & Vtune™ Summit 19
Pool Configuration - PerformanceLatency decreases as pool width increases
At higher QDs, latency decreases for pool with larger number of drives
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Single Nvme Drive PooledVol(1x) PooledVol(2x) PooledVol(4x) PooledVol(6x) PooledVol(8x)
La
ten
cy i
n u
sec
Queue Depth
4K Rand Read latency - Single NVMe SSD vs Pool (Increasing # drives) at
increasing QD
QD1 QD3 QD6 QD12 QD24 QD48
QD96 QD192 QD384 QD768 QD1536 QD3072
SPDK, PMDK & Vtune™ Summit 20
Pool Configuration - PerformanceIOPs scale as pool width increases
0
500
1000
1500
2000
2500
1 3 6 12 24 48 96 384 768 1536 3072
IOP
s in
K
Queue Depth
4K Rand Write IOPS - Single NVMe SSD vs Pool (Increasing # drives)
Single Nvme Drive PooledVol(1x) PooledVol(2x)
PooledVol(4x) PooledVol(6x) PooledVol(8x)
• Random writes scales with increasing number of drives.
• Variations seen due to underlying Intel NVMe SSDs
SPDK, PMDK & Vtune™ Summit 22
CBS with SPDK & intel RSD Architecture
• Intel CBS solution uses the SPDK stack as its base platform and leverages the Intel RSD architecture for management and cloud OS integrations.
• CBS software delivers the best from Intel components like Xeon Processors, SSDs, Networking, Optane Persistent Memory with optimized smart algorithms implemented using SPDK stack.
• Apart from new features, CBS also leverages many of SPDK features like NVMeoF target, volume management , Poll mode driver etc.
Intel SPA
Volume Mgmt (CRUD Strategy)
RESTApi Vol create
Storage Sled
CNBS Target App
Linux OS Sysfs and
VFIO
SP
DK
RP
C S
erv
er
Fabric
Compute Sled Compute Sled Compute Sled
PSMS Asset Manager
Storage Service
RedfishRest APICMDB
ESMA
SPDK RPC Client
Agent Registration
Event Dispatcher
SPDK Target Health Monitor
AMC Server
Event Server
ESMA Local
Database
Pool Mgmt (CRUD Strategy)
Secure BDEV
Intel Smart NIC
Ex
pa
nsio
n M
gr
Qo
S M
an
ag
er
SP
DK
BD
EV
LVS/LVOL
RAID BDEV
NVMe BDEV
NVMe Driver
Driver
OFED Driver
RPC Discovery Service
GAMI RPC Server
Telemetry Collection
Intel SSDsRSSD RSSD RSSD RSSD .
QoS Management
Pool Expansion Strategy
Event Mgmt workflow
Telemetry (Monitor, Analysis, Report)
Volume attach and Connection Mgmt
CNBS Cinder Backend driver
Pool Aware Storage Provisioner (Eg. Openstack Cinder)
Cinder API
Cinder Pool Aware Scheduler
Cinder Volume Service
Storage Class based Provisioner (Eg. K8s CSI)
CSI External Provisioner/
Attacher
CNBS CSI Control Plugin
CSI Node Plugin
Cinder os-brick extension
CNBS Initiator Config AgentBMC Local
Discovery SvcK8s Persistent Volume Mgmt
K8s API service
gRPC
REST Redfish/Swordfish(volume)
CLI/GUI
BMC
Redfish/Swordfish
OOB Redfish
NVMEoF Tgt Subsystem
GAMI SPA LiB
QueryAPIs
South Bound
APIs
SwordFish Controller
REST Redfish/Swordfish(volume)
Collectd ESMA
Collectd Input plugin
GAMI
Powered by
SPDK