Upload
others
View
18
Download
0
Embed Size (px)
Citation preview
Deepak Pathania, Senior Technical LeaderNEC Corporation and NEC Technologies India Private Limited
Join the Conversation #OpenPOWERSummit
A Resource disaggregated Platform for realizing Ultra-Fast Failover Recovery High Availability Systems
So, What can be a Resource Disaggregated Platform?A technology that extends PCI Express beyond the confines of a computer chassis via Ethernet, WITHOUT any modification of existing hardware and software or PCIe switch over Ethernet (ExpEther or EE)
Server
CPU
Memory
PCI Express
ExpEther NIC
L2 Switch
StandardEthernet
PCI Express
IO Device
ExpEtherEngine
ExpEtherEngine
IO Expansion Unit with PCIe Cards
Just another implementation of PCIe Switch
IODevic
e
IODevic
e
ExpEther Engine is seen as PCIe Switch from CPU● Ethernet region is invisible from the CPU
Upstream Port(PCI Bridge)
Downstream Port(PCI Bridge)
Downstream Port(PCI Bridge)
Internal PCI bus
CPU
IODevic
e
IODevic
e
PCIe Switch
CPU
Ethernet
Switch
ExpEther Engine(PCI Bridge)
ExpEther Engine(PCI Bridge)
ExpEther Engine(PCI Bridge)
Ethernet Fabric(Invisible)
PCI Express
PCI Express
PCI Express
PCI Express
Broad-Scale Single Computer
PCIeSwitc
hIO
Device
IODevic
e
CPU
CPU
IODevic
e
IODevic
e
IODevic
e
IODevic
e
IODevic
e
IODevic
eIn the same rack In the next rack
IODevic
eIO
Device
In another floor
IODevic
eIO
Device
In another buildingA PCI express switch is equivalent to Ethernet
fabric.
ExpEther
Engines
ExpEther
Engines
100 m
2 Km
10 m
ExpEther
Engine
ExpEther
Engines
ExpEther
Engines
Ethernet
Switch
Ethernet
Switch
Ethernet
Switch
Ethernet
Switch
ExpEther can build new type of computing environment without physical constraints
ExpEther Architecture• Achieve the “System on Network”
• Merge the PCI Express technology into Ethernet technology
• Connect logically in MAC layer• No impact for upper or lower layer of the PCIe and Ethernet standard for future
expansion
ApplicationOS
PCI DriverEFI/PCI BIOS
ExpEther Logic
MAC
PHY40G
10G 1G
ApplicationOS
NDIS Driver
Ethernet Logic
MAC
PHY10M
100M 1G 10
G40G
Ethernet
ExpEther
SoftwareHardware
Upper Compatible
No modification forfuture expansion of
ExpEther or Ethernet
Resource Disaggregated Platform or ExpEther features
EtherFrame
CPU
PCI Express
ExpEtherEngine
PCI Express
EthernetSwitch
ExpEtherEngine
ExpEtherEngine
ExpEtherEngine
I/ODevice
I/ODevice
I/ODevice
PCI E
xpre
ss
Equivalent to direct connection(Ethernet is invisible from CPU/IO)
1
EthernetFabric
Low Latency(L2 Ether w/o SW stack)
2
I/O Dynamic Reconfiguration(Hot-Plug Scheme)
4
EE PCI Express TLP
No packet loss(Adding reliability to Ethernet)
3
Dual Path for Throughput and Reliability• Two Ethernet connections are established between the Host Chip and I/O Chip
• Load balancing for performance• Path redundancy for failure recovery
Dual Port
CPU ExpEtherHost Chip
I/ODevice
ExpEther
IO Chip
I/ODevice
ExpEther
IO Chip
Failure RecoveryQuickly detects path failures and switches paths
Load-balancing
Round-robin data packet transmission between the two redundant connections
Ethernet Fabric-I
Ethernet Fabric-II
40G ExpEther NIC
Frame Rate ControlTCP/IP : Rate control is triggered by packet loss (TCP Reno)
NetworkBandwidth
Slow Start AvoidCongestion
TimeAvoid
CongestionAvoid
Congestion
Packet loss causes significant performance degradation because of retransmission.
ExpEther : Rate control is always done by measuring network latency
Probing Avoid Congestion
NetworkBandwidth
Time
Packet loss does not occur basically in ExpEther.
Congestion (Packet Loss)
ExpEther engine always measures the frame arrival time of receive side and minutely controls the frame rate to avoid packet loss.
Sequence of Network Path Failover (1/2)• Both network paths are used as ACT-ACT
EE NIC (Tx side)
RetransferBuffer Ar
bite
r
ExpEther Packet
EE (Rx side)
Rcv. Buf.
Rcv. Buf.
OrderingBuffer
Ether
Switch
Ether
Switch
Ether
Switch
Ether
Switch
Ether
Switch
Ether
Switch
123
4
5
6
7
8
10
121
4161
820
9
If a path is failed, ExpEther resends lost packets. This failover time is about 10 RTT (several microseconds).
11
13
15
17
19
EE NIC (Tx side)
RetransferBuffer Ar
bite
r
ExpEther Packet
EE (Rx side)
Rcv. Buf.
Rcv. Buf.
OrderingBuffer
Ether
Switch
Ether
Switch
Ether
Switch
Ether
Switch
Ether
Switch
Lost Packet
131
4151
617
Sequence Number Check
Re-receive packets after several microseconds
Ether
Switch
Resending
Sequence of Network Path Failover (2/2)• Network path is recovered by some Ethernet recovering scheme like P-Flow
linked with EE manager.EE NIC (Tx side)
RetransferBuffer Ar
bite
r
ExpEther Packet
EE (Rx side)
Rcv. Buf.
Rcv. Buf.
OrderingBuffer
Ether
Switch
Ether
Switch
Ether
Switch
Ether
Switch
Ether
Switch
123
456
7
89
101
112
DEVINFO
When ExpEther device receives a management packet indicating the path recovered, it starts reusing both network paths.
EE NIC (Tx side)
RetransferBuffer Ar
bite
r
ExpEther Packet
EE (Rx side)
Rcv. Buf.
Rcv. Buf.
OrderingBuffer
Ether
Switch
Ether
Switch
Ether
Switch
Ether
Switch
Ether
Switch
789
10
11
12
14
161
8202
224
13
15
Lost Packet
17
19
21
23 Eth
erSwitch
Ether
Switch
New path is enabled
SAS JBOD
Multi-path IO with Resource Disaggregation or ExpEther
• Multi-Path IO (MPIO)• MPIO is one of the technic for achieving high-reliability. If the target IO device supports MPIO,
it can support MPIO even under ExpEther.• Multi-Path Ethernet
• It supports the high-speed network path failover.
Host
SASHBA#0
SASHBA#1
HostEE
NIC#0
SAS JBOD
SASHBA#0
SASHBA#1
Equivalent
Act Act
MPIO
Ether
Switch
Ether
Switch
EE EE
MPIO
High-SpeedNetwork Failover
Dynamic Reconfiguration and Hot-Plug Capability
Host
B D G I
Host
A J
Host
C E H
Host
F
Group#1 Group#2 Group#3 Group#4
Logical View
Host Host Host
1 2 4
A B C D E F G H I J
1 1 1 12 23 3 34
EEManager
PCIeSwitch
PCIeSwitch
PCIeSwitch
PCIeSwitch
Host
Ethernet Fabric
3
Dynamic Reconfiguration and Hot-Plug Capability• Group ID (GID : 1~4,095)
• GID range from 1 to 15 is set by physical DIP switch residing on card.• Setting GID to 0 allows Management Software to program a soft GID.
Host Host HostHost
ManagementServer
EE1
EE2
EE3
EE4
EE
EE
EE
EE
EE
EE
EE
EE
EE
EE
EE
EE
EE
EE
EE
EE
1 1 1 1 12 2 2 23 3 34 4 4 4
IO
IO
IO
IO
IO
IO
IO
IO
IO
IO
IO
IO
IO
IO
IO
IO
Ethernet Fabric
Group ID Configuration
Group ID Configuration
Collecting Various
Information
- ExpEther Manager -➢ Configuration
• Group ID Configuration➢ Monitoring
• ExpEther network status• PCIe device status• New ExpEther detection• Failure detection
Management Frame
- Mng. Frame -➢ Special Ether Frame
• ExpEther hard wired logic directly receives and sends the frames for configuration and management
ExpEther Technology Architectural Possibilities▐ Std-EE : Standard PCIe-over-Ethernet
• Foundation of ExpEther▐ MR-EE : I/O sharing
• Multi-hosts are able to share an IO device by using SR-IOV compliant device▐ P2P-EE: I/O direct connection
• Support for the Peer-to-Peer data transfer between I/O devices.▐ NTB-EE : Remote direct memory access by NTB
• Hi-speed data transfer between hosts
Host
Std-EE
I/O I/O
P2P-EE
P2P-EE
Ethernet
Switch
Peer-to-Peer
P2P-EE
Current Path
Direct data transfer between IOs
Host
NTB-EEEthernet
Switch
Host
NTB-EE
Host
NTB-EE
NTB
NTB-EE
NTB Direct Access
Ethernet
I/O
Std-EE
I/O
Std-EE
I/O
EE V1
I/O
EE V1
Host
Std-EE
I/O I/O
Std-EE
Std-EE
Ethernet
Switch
PCIe-over-Ether
Standard Ho
stStd-EE
Partitioning
Foundation of ExpEther
I/O
Std-EE
Hot-Plug!
Partitioning
Host
Std-EE
SR-IOV
Ethernet
Switch
Host
Std-EE
Host
Std-EE
SR-IOV
MR-EE
MR-EE
Resource Sharing
MR-EE
Ethernet
Resource Sharing
• 40G ExpEther
ExpEther Lineup• 1G/10G ExpEther
● 2x 1000BASE-T● DVIx1,HDMI x1● USB3.0 x1● USB2.0 x3● Headphone x1● Microphone x1
● x1 PCI Express● Dual 1000BASE-T
● x8 PCIe Gen2● Dual 10G SFP+
● x16 PCIe x 1 slot● Dual 1000BASE-T
● x16 PCIe2 x 2 slots
(full height/full length)● Dual 10G SFP+ per slot
ExpEther HBA ExpEther Client
ExpEther IO Expansion Unit
IO Interface : x8 PCI Express 3.0Network I/F : QSFP+ x 2Form Factor : PCI Low Profile
IO Interface : x8 PCI Express 3.0Slots : x16 Slot x 4Network I/F : QSFP+ x 4Support IO : GPGPU (K80, P100, etc)
ExpEther HBA IO Expansion Unit
19” Rack Size
1,000W PSU for 2-Slot IO Expansion Unit
800W PSU for 4-Slot IO Expansion Unit
3U
400mm
1G1G
1G 10G
10G
40G
1G1G
1G 10G
10G
40G
Performance of EE vs Local with PCIe based SSD’s
name/depth 1 4 16 64
local 2363699.2 2266555.73
2264849.07
2197777.07
ExpEther 2657348.27 2490436.27
2491665.07 2462105.6
ExpEther/local (%) 112.42 109.88 110.01 112.03
name/depth 1 4 16 64
local 1039667.2 1056358.4 1093597.87 1068544
ExpEther 1053832.53
1060447.57
1072503.47
1054276.27
ExpEther/local (%) 101.36 100.39 98.07 98.66
There is no impact on bandwidth in ExpEther that can fully support PCIe x8 gen3 (64Gbps)
Performance of EE vs Local with PCIe based SSD’s
name/depth 1 4 16 64
local 42467.67 157166.67 370562 394857.33
ExpEther 39113.67 145294.33 359693.67 392591
ExpEther/local (%) 92.10 92.45 97.07 99.43
name/depth 1 4 16 64
local 216616.67 231208.67 226872.67 231524
ExpEther 130586.67 230349.33 232602.67 231147.33
ExpEther/local (%) 60.28 99.63 102.53 99.84
ExpEther can achieve the same IOPS as local by increasing the IO depth parameter to hide the latency of Ethernet.
Performance of EE vs Local with Tesla GPU(P100)
ExpEther able to achieve almost the same performance as local case.In DeepLearning applications, the data exchange between host and GPU is very small and most process
time is consumed in GPU, so there is no performance impact in ExpEther.
Local
via ExpEther
Main DB(FC SAN)
DB Journal(NVMe + EE)
Ultra-Fast Failover Recovery for Database system with EE
High-speed data restore to Standby server for In-memory DB.
Ethernet
FC
✓ NVMe SSD is faster than Fiber Channel.
✓ Use NVMe SSD as Journal for DB.
Logging
Restore
Failo
ver w
hen
serv
er is
faile
d.
Fail
Active Server
Standby Server
✓ When Active Server fails, NVMe SSDs’ connection is switched, allowing for DB journal restore on Standby Server.
Ultra-Fast Failover Recovery for Database system with EE and ExpressCluster X
• <Real pictures of the setup to be added here on left hand side>
Failover
Primary Power Server
Secondary Power Server
Ultra-Fast Failover Recovery for Database system with EE and ExpressClusterX
• <Experiments results, observations and conclusions to be added here>
Ultra-Fast Failover Recovery for Database system with EE and ExpressCluster X
• <Comparisons with existing Failover system architectures to be added here>
Service Acceleration Platform with ExpEther
EE Client
USB/VGA
KVM
CPU/Chips
et
CPU/Chips
et
Remote IO
GPGPUGPGPUGPGPUGPGPU
GPGPUGPGPUGPGPUAcceleratorFPGA
NVMe
SSDNVM
eSSD
NVMe
SSD
NVMe
SSD
ExpEther
Engines
NVMe
SSDNVM
eSSD
NVMe
SSD
NVMe
SSD
ExpEther
Engines
NVMe
SSDNVM
eSSD
NVMe
SSD
NVMe
SSD
ExpEther
Engines
NVMe
SSDNVM
eSSD
NVMe
SSD
NVMe
SSD
ExpEther
Engines
Accelerator NodeCompute Node
Remote IO Devices
ExpEtherHBA
ExpEtherHBA
ExpEther
Engine
EthernetPCIe TLP
Ether
Switch
ExpEther
Engine
USBCtrl
ExpEther
Engines
ExpEther
Engines
Sensors
Ether
Switch
Accelerator Resource PoolIO devices can be dynamically allocated to appropriate host according to workload
Ether
Switch
Hi-speed Storage Node
Future Roadmap of ExpEther or Universal Interconnect
Summary• <To be Added>
Thank you