Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
2019 Storage Developer Conference India © All Rights Reserved.1
Emerging Ethernet standards
&
their impact on Storage
Anupama B N
NetApp
2019 Storage Developer Conference India © All Rights Reserved.2
Agenda
Ethernet Technology Landscape
Ethernet Standards and Technology
Connector Standards and types
RDMA (RoCE and iWARP)
RoCE vs iWARP
RoCE over long distance
References
2019 Storage Developer Conference India © All Rights Reserved.3
Ethernet Technology Landscape FCoE
RDMA
iWARP
RoCE
SDN
Extension into the VM environment vSphere/OpenVswitch/Nexus
Provisioning and orchestration tools, focus on Overlays – VXLAN, NVGRE,
GENEVE, WAN
NVMf
NVMe over Fabrics
iSCSI
iSER
2019 Storage Developer Conference India © All Rights Reserved.4
Data Center
Bigger – 1km cable runs common
Fill as you go, leave in place
Manage via API (remote), ports set up on demand via API
Leaf/Spine Clos (vs. Tree)
Old New
2019 Storage Developer Conference India © All Rights Reserved.5
5
High Speed Interconnects
Lowlatencyaccess
+High
speedtransports
NVMe over FabricsPersistent Memory (PMEM) in Server
Storage Class Memory (SCM) as Cache
All-FlashNVMe/SAS
Flash At All TiersQLC-SSD/SCM
HybridSAS
2019 Storage Developer Conference India © All Rights Reserved.6
Sta
nd
ard
sEthernet Technology and Standards
6
Technology/MSA [Purple]
IEEE Standard [Black]
Te
ch
no
log
y
CY16 CY17 CY18 CY19 CY20 CY21 CY22
50G PAM4
SERDES
Silicon Photonics
100G PAM4
SERDES
IEEE 802.3cd 50/100/200G
IEEE 802.3bq 25/40GBase-T
IEEE 802.3by 25G
IEEE 802.3cc25G SMF
CWDM4 PSM4 MSA
IEEE 802.3ck 100/200/400G
Silicon Photonics in NIC/Switch
ASICs
IEEE 802.3bs 200G/400G (4/8
lane)
IEEE 802.3cz 800G
OSPF MSA
QSFP-DD
IEEE 802.1AS-REV time sensitive
Ethernet
50/100/200/400G Early NIC and Switch ASICs
100/200/400/800G Early NIC and Switch ASICs
2019 Storage Developer Conference India © All Rights Reserved.7
802.3bs/cd Signaling
PAM4
1, 2, 4, and 8 Lanes
NRZ to PAM4
7
Source: Mellanox blog, neophotonics
2019 Storage Developer Conference India © All Rights Reserved.8
Connector Types
MSA Mainstream: New Double Density Connectors
8
Source SFP-DD consortium, QSFP-DD consortium
2019 Storage Developer Conference India © All Rights Reserved.9
What is RDMA RDMA – Remote Direct Memory Access
Benefits
Very low latency, very high throughput, ≈ zero CPU
Bypasses traditional network stacks (TCP/IP)
Provides a Fibre Channel-equivalent solution at a lower cost
Three hardware technologies
RoCE
iWARP
Infiniband
Traditional protocols (SMB, NFS, iSCSI) can operate over RDMA
2019 Storage Developer Conference India © All Rights Reserved.10
Ethernet RDMA Stack
10
2019 Storage Developer Conference India © All Rights Reserved.11
iWARP
Delivers RDMA on top of Pervasive TCP/IP
Runs over all Ethernet Infrastructure
TCP provides Flow control and Congestion Management
Highly routable and scalable Implementation
Extensions eliminate TCP/IP stack process, mem copies and application
contexts switches .
iWARP addresses n/w bottlenecks of high speed Ethernet and provides
high-throughput and low-latency with low-CPU utilization for data
communication .
2019 Storage Developer Conference India © All Rights Reserved.12
RDMA over Converged Enhanced Ethernet (RoCE)
Remote Direct Memory Access
Accelerates data exchange
between servers
Bypass CPU & typical network
stack
Reduced latency
Converged Enhanced Ethernet
Priority Flow Control
Enhanced Transmission Selection
Lossless Ethernet fabric
Same RDMA, different L2 transport
12
2019 Storage Developer Conference India © All Rights Reserved.13
RoCE(V2)
Well known on InfiniBand
Works well on a lossless network
Lower latency than alternative Transport protocols (TCP)
Significantly lower overhead when offloaded to adapter
..BUT
Ethernet is not lossless by design
PFC is required to achieve lossless Ethernet fabric
PFC (Part of DCB)has a high configuration and management overhead – VLANs, Priorities
PFC is Layer 2 only
2019 Storage Developer Conference India © All Rights Reserved.14
RDMA Pros and ConsTransport Pros Cons
Non-RDMA Ethernet • TCP/IP-based protocol
• Works with any Ethernet switch
• Wide variety of vendors and models
• Support for in-box NIC teaming
• High CPU Utilization under load
• High latency
iWARP
Low
CP
U U
tiliz
atio
n u
nd
er lo
ad
Low
late
ncy
• TCP/IP-based protocol
• Works with any Ethernet switch
• RDMA traffic routable
• Offers up to 100 Gbps per NIC port today*
• Requires enabling firewall rules
RoCE • Ethernet-based protocol
• Works with Ethernet switches
• Offers up to 100 Gbps per NIC port today*
• Routable with RoCEv2
• Requires DCB switch with Priority Flow Control (PFC)
• Switches typically less expensive per port*
• Switches offer high speed Ethernet uplinks
• Commonly used in HPC environments
• Offers up to 54Gbps per NIC port today*
• Not an Ethernet-based protocol
• RDMA traffic not routable via IP infrastructure
• Requires InfiniBand switches
• Requires a subnet manager (typically on the switch)
InfiniBand
2019 Storage Developer Conference India © All Rights Reserved.15
RoCE vs IWARP differences
RoCE iWARP
Underlying Network UDP TCP
Congestion Management Rely on DCB TCP handles with flow
control
Adapter Offload Full DMA Full DMA w/TCP/IP
Routability Yes Yes
Cost Need DCB enabled Switch
Infra
Depends on the deployment ,
no requirement of Switch
conf
15
2019 Storage Developer Conference India © All Rights Reserved.16
16
NVMe Native NFS, CIFS, SAN
NVMeoFabric
RDMA FC
IB RoCE iWARP
Ethernet
FC
Ethernet/IP
NVMe SSDs
NVMe SSDs
NVMeInitiator
NVMeoF
RDMA-ROCE
NVMe Target
NVMeInitiator
NVMe Target
Application
Storage Compute
Storage Drives Converged Shelves
Integrated
Ethernet RDMA NIC Implementation
RDMA NICs
Switches
Switches
2019 Storage Developer Conference India © All Rights Reserved.17
PFC –Priority Flow Control By nature Ethernet is a lossy network
Ethernet provides flow control mechanism which makes it lossless – 2 options:
• Applied FC over the whole port (Priority Flow Control - 802.3x)
• Applied FC over specific priority (Priority Flow Control -802.1Qbb)
PFC negotiation between switch-host can be done by DCB (Data Center Bridging)
• Using Data Center Bridging Exchange (DCBX) negotiation
• End points (switch & host) exchange information about their capabilities
• If PFC is supported, it will be used
• If PFC is not supported, Global FC will be used
• If DCBX is not supported or the PFC capability is not supported, manual configuration is
required
Routers rebuild the layer 2 header
• Among it the routers rebuild the PCP filed using a DSCP to PCP mapping
2019 Storage Developer Conference India © All Rights Reserved.18
PFC contd..
https://www.darrylvanderpeijl.com/wp-content/uploads/part1_PFC.gif
2019 Storage Developer Conference India © All Rights Reserved.19
PFC-ETS
2019 Storage Developer Conference India © All Rights Reserved.20
RoCE for Long Distance
Minimize the recovery impact from lost packets
Congestion, faulty networking components, alpha particles, etc.
Congestion
Can not use normal congestion control
- PFC and ECN latency is too great because of distance
Solution options - Packet Pacing(NIC and Application) - Prioritize flows(QPs)
through local networks(NIC and Switches) ECN, PFC, other QOS
Enhance recovery for lost packet
Resilient RoCE
Create a lot of small flows (Application)
Minimize the latency of retry
2019 Storage Developer Conference India © All Rights Reserved.21
Routable RoCE
Routable RoCE requires a higher
level congestion mechanism
ECN – Explicit Congestion
Notification
ECN can slow down traffic to
prevent congestion
ECN configuration overhead is
lower than PFC, simple and easy
Source: Mellanox web
2019 Storage Developer Conference India © All Rights Reserved.22
Resilient RoCE
Resilient RoCE can cope with packet loss and Out of Order packets
ECN is suggested but not required
Out of Order packets are held in buffer to fill the gaps. Re-ordered packets are then
written to memory
Missing packets are requested from the sender
So..
No loss – everything is fast
Some loss – slows down, but stays in working order
Still significantly better than TCP/IP
2019 Storage Developer Conference India © All Rights Reserved.23
References
http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p523.pdf
https://community.mellanox.com/s/article/understanding-qos-configuration-
for-roce
https://www.snia.org/sites/default/files/ESF/RoCE-vs.-iWARP-Final.pdf
http://files.gpfsug.org/presentations/2017/Manchester/04_Mellanox.pdf
2019 Storage Developer Conference India © All Rights Reserved.24
CONCLUSION
High-Speed Ethernet is the new back-bone which could replace FC
Different media/storage via network require reliable connectivity with High
throughput and low latency .
High Availability and Disaster Recovery solutions are On-Demand with high data re-
locational capabilities across geographies.
Transports for NVMe over Fabric with Ethernet is gaining momentum .
2019 Storage Developer Conference India © All Rights Reserved.25
Ca
viu
mEthernet NIC
25
NIC ASICs [Red]
SOC/Netwk processor ASICs [Blue]
Bro
ad
co
m
© 2018 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only
CY17 CY18 CY19 CY20 CY21 CY22
Ch
els
io
Cumulus2x25G gen3
Stratus1x100G gen3
Me
lla
no
xIn
tel Columbiaville
2 x 100G gen4RoCE + iWARP
T72x100G gen4
RoCE + iWARP
T62x100G gen3
T82x200G gen4
RoCE + iWARP
Thor1x200G gen4
Stingray1x100G gen3
Stingray 21x200G gen4
ConnectX-52x100G gen4
ConnectX-62x200G gen4
Bluefield2x100G gen4 x32
Bluefield 2Lx2x100G gen4 x16
Arrowhead2/4x25G gen3
Elbrus (E5)2x100G (50G serdes)
1x200G (50G serdes) gen4 RoCE+ iWARP
Liquid I/O III1x100G gen4
Mt. Stellar2x200G 2x gen4 x16
F12x200G (50G serdes)
1x100G (25G serdes) gen4 RoCE+ iWARP
© 2018 NetApp, Inc. All rights reserved. ---
NETAPP CONFIDENTIAL ---
FastLinQ
2019 Storage Developer Conference India © All Rights Reserved.26
26
Questions ?