©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
‘’NVMe Takes It All, SCSI Has To Fall’’freely adapted from ABBA
Brave New Storage World
Alexander Ruebensaal
Lugano April 2018
1
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
ABC Systems AG
Design, Implementation, Support & Operating of optimized IT Infrastructures
- HA & HP - allowing for fail-safe Transportation of the Applications … since 1981
2
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
EDSFF Enterprise & DataCenter SFF
[ Ruler ]
NGSFFNext Generation SFF
[ M.3 ]
Six Years After …
Non-Volatile Memory Express NVMe SSD
NVMe is an innovative Host Controller Interface to use SSD natively over PCIe. Mainly, it allows for acceleration due to parallelism resulting in reduced I/O overhead and latency.
M.2
PCIe
U.2
2.5’’
4
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
EDSFF Enterprise & DataCenter SFF
[ Ruler ]
Why NGSFF and EDSFF?
NGSFFNext Generation SFF
[ M.3 ]
U.2[ 2.5’’ ]
• Less complicated chassis• Reduced component cost per SSD• Simple hot swap with high density capabiltites
• No costly drive cages with failure points• No cables to SSDs• Eliminate the backplane with cooling holes• Simplified thermal implementation
5
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
NVMe SSD against SAS SSD …
10x NVMe1U
24x NVMe2U
48x NVMe2U
7
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
NVMe Storage Protocol is designed to take full Advantage of Flash
NVMe supports 64K commands per queue (SAS 256, SATA 32) and up to 64K queues. These queues are designed such that I/O commands and responses to those commands operate on the same processor core and can take advantage of the parallel processing capabilities of multi-core processors. Each application or thread can have its own independent queue, so no I/O locking is required. NVMe also supports MSI-X and interrupt steering, which prevents bottlenecking at the CPU level and enables massive scalability as systems expand.
NVMe has a streamlined and simple command set that uses less than half the number of CPU instructions to process an I/O request that SAS or SATA does, providing higher IOPS per CPU instruction cycle and lower I/O latency in the host software stack. NVMe also supports enterprise features such as reservations and client features such as power management, extending the improved efficiency beyond just I/O.
Text & Graphics from http://nvmexpress.org
8
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
NVMe uses CPU Lanes directly
CPU – Bus – NVMe Flash
or
CPU - Bus – FC-HBA –Switche(s) – FC-HBA – RAID Ctrl –SAS Enclosure – Disk
8x SAS NVMe already saturate the SAS-Bus …
-> Effect is reduced to Electronics vs Mechanics!
1 NVMe uses 4 CPU Lanes
Broadwell 40 LanesSkylake 48 LanesEPYC 128 Lanes
10
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
Intel Xeon Scalable Processors –F (With OmniPath)
Single on-package OmniPath interface Incremental to existing 48 PCIe Lanes Single cable connection to QSFP I/O module Same socket for Skylake & Skylake-F processors
12
Are the NVMe too strong, are the CPU too weak …
Lanes: 1 CPU 48 – 1 NVMe 4
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
AMD EPYC CPU
8~32 “Zen” Cores
TDP 120W~180W
8 Memory Channels
Up to 2TB per CPU
Dedicated Security Engine
Lanes of High Bandwidth I/O 128
How to use them?
1x NVMe 4 Lanes < 3’500MB/S / 3’938MB/s 89%1x SATA SSD 1 Lane max. 32 directly supported by CPU < 540MB/s / 985MB/s 55%1x 100GbE 16 Lanes < 12’500MB/s / 15’754MB/s 79%2x 25GbE 8 Lanes < 6’250MB/s / 7’877MB/s 79%2x 10GbE 8 Lanes [standard Interface for comparison] < 2’500MB/s / 7’877MB/s 32%
13
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
Storage Centric Solution Design – Don’t waste Lanes!
Balanced Designs for Multi-Socket Server Solutions, regardless of CPU Vendor, is a huge Optimization Challenge!
1x NVMe 4 Lanes < 3’500MB/S / 3’938MB/s 89%1x SATA SSD 1 Lane max. 32 directly supported by CPU < 540MB/s / 985MB/s 55%1x 100GbE 16 Lanes < 12’500MB/s / 15’754MB/s 79%2x 25GbE 8 Lanes < 6’250MB/s / 7’877MB/s 79%2x 10GbE 8 Lanes [standard Interface for comparison] < 2’500MB/s / 7’877MB/s 32%
Reads Writes Reads Writes 112 TB IOPS GB/s Gbps 112 TB Mio. GB/s Gbps 112 TB IOPS GB/s Gbps
NVDIMM 32GB 17
NVMe SSD 4 Lane 11TB 800 95 3.35 2.4 64 176 12.8 53.6 48 132 9.6 40.2
SATA SSDmax. 32/CPU
1 Lane 8TB 93 74 0.54 0.52 32 192 2.9 3
100GbE 16 Lane 32 200 200 48 300
25GbE 4 Lane
16 Lane 2x
8 lane 2x
Total 112 192 2.9 3 200 112 176 29.8 53.6 200 112 132 9.6 40.2 300
R -world performance is, of course, application, workload and file system depend Assumption: 112 net Lanes availabe of 128
C A P A C I T Y I O P S
full bandwidth
8 Lane for 2x 25GbE
48
T H R O U G H P U T
16
Component ParameterK IOPS Random GB/s Sequential
1100
PCIe Slots 48
14
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
Conceptual NVMe-Server Design
Universel Department Store Servers Purpose-built Servers for Efficiency
might be over- or wrong-sized for SDS cost, performance, power, space etc. effective
allow for lean Architecture
… HPC, 10’000s in biggest DCs
e.g. choose from > 100 NVMe ServersAMD and INTEL CPUs
15
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
36x NVMe NGSFF
2x Intel Xeon Scalable CPU3x UPI, <10.4GT/s24x DIMM up to 3TB2x PCIe x16, 1x PCIe x82x 10GBase-T
NVMe < 576TB < 352TB < 1’080TB
< 10 million IOPS
32x NVMe U.2
2x Intel Xeon Scalable CPU3x UPI, <10.4GT/s24x DIMM up to 3TB2x PCIe x162x 10GBase-T
32x NVMe EDSFF
2x Intel Xeon Scalable CPU3x UPI, <10.4GT/s24x DIMM up to 3TB2x PCIe x162x 10GBase-T
1U 1U
From Storage-Server to Server-Storage
16
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
48x NVMe U.2 Dual Port 2x Nodes:2x Intel Xeon E5-2600v4 CPU2x QPI, <9.6GT/s16x DIMM up to 2TB1x PCIe x16, 1x PCIe x82x 10GBase-T, SIOM
24x NVMe U.2 4x Nodes:2x Intel Xeon Scalable CPU3x UPI, <10.4GT/s24x DIMM up to 3TB2x PCIe x162x 10GBase-T
NVMe < 528TB < 264TB < 528TB
2U 2U
48x NVMe U.2
2x Intel Xeon E5-2600v4 CPU2x QPI, <9.6GT/s24x DIMM up to 3TB2x PCIe x16, 1x PCIe x8, SIOMSIOM (e.g. 2x 25GbE, 2x 10GbE)
From Storage-Server to Server-Storage
17
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
From Storage-Server to Server-Storage
20x NVMe U.2 7mm
2x Intel Xeon Scalable CPU3x UP up 10.4GT/s24x DIMM up to 3TB2x PCIe x82x 25GBe
< 80TB
1U All-NVMe & GPU Server on ABC booth
4x V100, P100,P40, M10 ...
Storage changes to
- SERVER-CENTRIC
- SOFTWARE-DEFINED
RAID Protection
JBOF Just a Bunch of Flash
NVMe-oF - NVMEe over Fabric
Holisitc Data Management
HCI Hyper Converged Infrastructure
18
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
IBM Spectrum ScaleAcceleration
Achieved with 24x NVMe:The only sub-millisecond overall response time at 0.69ms ORT! 2.5x more builds than other Spectrum Scale storage options. Higher IOPS and throughput than all other SPEC SFS2014_swbuild results.Soltuion available as Appliance or Software only.
NVMe-oF - NVMe over Fabric
RDMAFC
The goal of NVMe over Fabrics is to provide distance connectivity to NVMe devices with no more than 10 micro-seconds (µs) of additional latency over a native NVMe device inside a server.
Use Cases
- A storage system comprised of many NVMe devices, using NVMe over Fabrics with either an RDMA or Fibre Channel interface, making a complete end-to-end NVMe storage solution. This system would provide extremely high performance while maintaining the very low latency available via NVMe.
- Usage of NVMe over Fabrics to achieve the low latency while connected to a storage subsystem that uses more traditional protocols internally to handle I/O to each of the SSDs in that system. This would gain the benefits of the simplified host software stack and lower latency over the wire, while taking advantage of existing storage subsystem technology.
19
Text & Graphics from http://nvmexpress.org
Low Latency Networking
Storage Accelerations, leveraging hardware offloads for NVMe- ConnectX adapters support NVMe-oF <100Gbps- BlueField (SoC) Smart NIC 2x 25GbE combines
ConnectX5 with ARM CPU
NVMesh Reference Architecture- near server-local performance
in a linear scale-out remote standard NVMe solution.NVMesh RA provides the flexibility to create and manage a single, centralized pool of storage, create “right-sized” logical volumes, and even share storage resources with existing compute resources. Also supporting existing applications without changes.
NVMe-oFFC-HBA
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
All-NVMe JBOF
4x PCIe-Bus Extension PCIe 3.0 x16
Client Network 2x EDR IB 100Gbps / 100/40GbE PCIe 3.0 x16
Cluster Interconnect2x EDR IB 100Gbps PCIe 3.0 x16
Sync. Mirror
64 GB/s> 36 Mio. IOPS • 64, 128TB
• …• 256TB• 512 TB• 1’024TB
1U 32-bay JBOF Just a Bunch Of Flash
NVMe SSD U2 hot-swap NVMe SSD EDSFF hot-swap
Capacity CacheDrives
10x more Performance with 3D XPoint™ OPTANE Technology than NAND via PCIe* NVMe*
20
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
The Supermicro JBOF supports up to 12 direct attached hosts, making this the go-to storage platform for any high-performance computing application.
Alternatively, the dual PCI-E 3.0 x16 slots can support dual NVMe-oF add-on-cards to enable additional deployment scenarios.
4 Mini-SAS HD x16 ports , 2 PCI-E 3.0 x16 Slots, 2 IPMI ports
Supermicro JBOF 32x NVMe SSD U.2
21
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
RAID Protection
With a 3.2x lower Annualized Failure Rate (AFR) -SASTA SSD - compared to HDD, IT departments will spend less time and expense replacing or upgrading storage devices.
22
Flash Technology
• More reliable, less Replacements
• Higher Throughput, faster Rebuilds
RAID Approach
• Hardware-Defined (RAID-Controller)
• Software-Defined (SDS)
• Hybrid: Intel VROC Virtual RAID on CPU
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
RAIDFunctionin VMD Volume Management Device
Intel Virtual RAID on CPU – VROC
23
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
HCI Hyper Converged Infrastructure
8x2U 4-Node Server. Dual CPU. 3UPI <10.4GT/s. 24x DIMM. 6x NVMe U.2. 1x PCIe Extension x16. 2x 10GbE…
1U JBOF 32x NVMe. 4x Mini-SAS HD x16 ports. 2x PCI-E 3.0 x16 Slots
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
Data Management – not only Data Storing
Conceptual Optimization
NVMe
LTO LTFS
25
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
Main-stream
Directions of Movementsup to… TB/drive
TB in 1UServer
or JBOF
TB in 4UServer
or JBOD
EDSFF [Ruler] 32 1080
NGSFF [M.3] 16 576
U.2 2.5" 11
M.2 2
AIC Add-in Card 8
SAS 8
SATA 11
15K
10K 1.8
2.5" 2
3.5" 12 1080
12 - 30
10
Parameter
Storage-Technology
RAMVolatile
Non-Volatile
Flash
SSD2.5"
NVMe
Disk
SAS2.5"
NL SASSATA
Tape
LTO
IBM TS1150 [Jaguar]
…
26
Gotthardpost, 1873Johann Rudolf Koller, 1828-1905
https://en.wikipedia.org/wiki/Rudolf_Koller
Change Horses, add Horses
or use the Gotthard Tunnel …
NVMe Takes It All, SCSI Has To Fall
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
NVMe Takes It All, SCSI Has To Fall
Cold Data
Lukewarm Data
HotData
NVMe
NL SASSATA
LTO[ LTFS ]
SomewhatHot Data
SASFC
SATA SSD
27
• The PCIe Bus is in the Server• NVMe is the Protocol for Flash
• 50-100TB NVMe• PCIe 4.0
‘’Flat screens
vs Displays’’
©© 2018 ABC SYSTEMS AG. All Rights reserved. 11.4.18
Headquarter Zurich Branch Office Berne
Ruetistrasse 28 Giessereiweg 9CH-8952 Schlieren CH-3007 Bern
Tel +41 43 433 6 433 Tel +41 31 3 700 600
http://www.ABCsystems.ch [email protected]
Alexander Ruebensaal [email protected]
… simplify and win with us and our Partners
Other names and brands may be claimed as property of others.
Spectrum ScaleSpectrum Protect
28