Upload
juan-mills
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Transport Layer Transport Layer Enhancements Enhancements for Unified for Unified Ethernet in Ethernet in Data CentersData Centers
K. KantK. KantRaj RamanujanRaj Ramanujan
Intel CorpIntel Corp
Exploratory work only, not a committed Intel positionExploratory work only, not a committed Intel position
2*Third party marks and brands are the property of their respective owners
InsertLogoHere
ContextContext Data center is evolving Data center is evolving
Fabric should too.Fabric should too. Last talk: Last talk:
–Enhancements to Ethernet, already on trackEnhancements to Ethernet, already on track This talk:This talk:
–Enhancements to Transport LayerEnhancements to Transport Layer–Exploratory, not in any standards track.Exploratory, not in any standards track.
3*Third party marks and brands are the property of their respective owners
InsertLogoHere
OutlineOutline
–Data Center evolution & transport Data Center evolution & transport impactimpact
–Transport deficiencies & remediesTransport deficiencies & remedies– Many areas of deficiencies …Many areas of deficiencies …
– Only Congestion Control and QoS Only Congestion Control and QoS addressed in detailaddressed in detail
–Summary & Call to ActionSummary & Call to Action
4*Third party marks and brands are the property of their respective owners
InsertLogoHere
Data Center TodayData Center Today
Tiered structure Tiered structure Multiple Multiple incompatibleincompatible fabrics fabrics
– Ethernet, Fiber Channel, IBA, Myrinet, etc.Ethernet, Fiber Channel, IBA, Myrinet, etc.– Management complexityManagement complexity
Dedicated servers for applications Dedicated servers for applications Inflexible Inflexible resource usageresource usage
business trans
client req/ resp
Storage
Fabric
network Fabric
SAN storage
database query
IPC Fabric
5*Third party marks and brands are the property of their respective owners
InsertLogoHere
Future DC: Stage 1 – Fabric Future DC: Stage 1 – Fabric UnificationUnification
Enet dominant, but convergence really on IP.Enet dominant, but convergence really on IP.– New layer2: PCI-Exp, Optical, WLAN, UWB, …New layer2: PCI-Exp, Optical, WLAN, UWB, …
Most ULP’s run over transport over IPMost ULP’s run over transport over IP Need to comprehend transport implications Need to comprehend transport implications
business trans
client req/ resp
iSCSI storage
database query
6*Third party marks and brands are the property of their respective owners
InsertLogoHere
Future DC: Stage 2 – Clustering Future DC: Stage 2 – Clustering & Virtualization& Virtualization
Sub-cluster1
Sub-cluster 2
Sub-cluster 3
Storage Nodes
SMP SMP Cluster (cost, flexibility, …) Cluster (cost, flexibility, …) Virtualization Virtualization
– Nodes, network, storage, … Nodes, network, storage, … Virtual clusters (VC) Virtual clusters (VC)
– Each VC may have multiple traffic types insideEach VC may have multiple traffic types inside
VirtualCluster1
VirtualCluster 2 Virtual
Cluster 3
IP ntwk
7*Third party marks and brands are the property of their respective owners
InsertLogoHere
Future DC: New Usage Future DC: New Usage ModelsModels Dynamically provisioned virtual clustersDynamically provisioned virtual clusters Distributed storage (per node) Distributed storage (per node) Streaming traffic (VoIP/IPTV + data services)Streaming traffic (VoIP/IPTV + data services) HPC in DCHPC in DC
– Data mining for focused advertising, pricing, …Data mining for focused advertising, pricing, …
Special purpose nodesSpecial purpose nodes– Protocol accelerators (XML, authentication, etc.)Protocol accelerators (XML, authentication, etc.)
New models New models New fabric requirements New fabric requirements
8*Third party marks and brands are the property of their respective owners
InsertLogoHere
Fabric ImpactFabric Impact More types of traffic, more demanding needs.More types of traffic, more demanding needs. Protocol impact at all levelsProtocol impact at all levels
– Ethernet: Previous presentation.Ethernet: Previous presentation.
– IP: Change affects entire infrastructure.IP: Change affects entire infrastructure.
– Transport: This talkTransport: This talk
Why transport focus?Why transport focus?– Change Change primarilyprimarily confined to endpoints. confined to endpoints.
– Many app needs relate to transport layerMany app needs relate to transport layer
– App. interface (Sockets/RDMA) mostly unchanged.App. interface (Sockets/RDMA) mostly unchanged.
DC evolution DC evolution Transport evolution Transport evolution
9*Third party marks and brands are the property of their respective owners
InsertLogoHere
Transport Issues & Transport Issues & enhancementsenhancements Transport (TCP) enhancement areasTransport (TCP) enhancement areas
– Better Congestion control and QoSBetter Congestion control and QoS– Support media evolutionSupport media evolution– Support for high availabilitySupport for high availability– Many othersMany others
– Message based & unordered data delivery.Message based & unordered data delivery.– Connection migration in virtual clusters.Connection migration in virtual clusters.– Transport layer multicasting.Transport layer multicasting.
How do we enhance transport?How do we enhance transport?– New TCP compatible protocol? New TCP compatible protocol? – Use an existing protocol (SCTP)?Use an existing protocol (SCTP)?– Evolutionary changes to TCP from DC perspective.Evolutionary changes to TCP from DC perspective.
10*Third party marks and brands are the property of their respective owners
InsertLogoHere
What’s wrong with TCP What’s wrong with TCP Congestion controlCongestion control
TCP congestion control (CC) works TCP congestion control (CC) works independentlyindependently for each connection for each connection – By default TCP equalizes throughput By default TCP equalizes throughput undesirable undesirable
– Sophisticated QoS can change this, but …Sophisticated QoS can change this, but …
Lower level CC Lower level CC Backpressure on transport Backpressure on transport – Transport layer congestion control is crucialTransport layer congestion control is crucial
MACMAC
routerswitch switch
Congfeedback
TL cong cntrl IP
MAC
Apptranspo
rtIP
MAC
ECN/ICMPApptranspo
rtIP
MAC
11*Third party marks and brands are the property of their respective owners
InsertLogoHere
What’s wrong with QoS?What’s wrong with QoS? Elaborate mechanismsElaborate mechanisms
– Intserv (RSVP), Diffserv, BW broker, …Intserv (RSVP), Diffserv, BW broker, …
… … But a nightmare to useBut a nightmare to use– App knowledge, many parameters, sensitivity, …App knowledge, many parameters, sensitivity, …
What do we need?What do we need?– Simple/intuitive parameters Simple/intuitive parameters
– e.g., streaming or not, normal vs. premium, etc.e.g., streaming or not, normal vs. premium, etc.
– Automatic estimation of BW needs.Automatic estimation of BW needs.– Application focus, not flow focus!Application focus, not flow focus!
QoS relevant primarily under congestionQoS relevant primarily under congestion
Fix TCP congestion control, use IP QoS sparingly.Fix TCP congestion control, use IP QoS sparingly.
12*Third party marks and brands are the property of their respective owners
InsertLogoHere
TCP Congestion Control TCP Congestion Control EnhancementsEnhancements1)1) Collective control of all flows of an appCollective control of all flows of an app
– Applicable to both TCP & UDPApplicable to both TCP & UDP– Ensures proportional fairness of multiple Ensures proportional fairness of multiple inter-inter-
relatedrelated flowsflows– Tagging of connections to identify related flows.Tagging of connections to identify related flows.
2)2) Packet loss highly undesirable in DCPacket loss highly undesirable in DC– Move towards a delay based TCP variant.Move towards a delay based TCP variant.
3)3) Multilevel CoordinationMultilevel Coordination– Socket vs. RDMA apps, TCP vs. UDP, … Socket vs. RDMA apps, TCP vs. UDP, … – A layer above transport for coordinationA layer above transport for coordination
13*Third party marks and brands are the property of their respective owners
InsertLogoHere
Collective Congestion Collective Congestion ControlControl Control connections thru a congested device Control connections thru a congested device
together (control set)together (control set) Determining control set is challengingDetermining control set is challenging BW requirement estimated automatically BW requirement estimated automatically
during non-congested periodsduring non-congested periods
Cong. Control
S21
S23
SW1SW2
CL1
SW0
S11
S13
CL2
14*Third party marks and brands are the property of their respective owners
InsertLogoHere
Sample Collective ControlSample Collective Control App 1: App 1: client1 client1 server1 server1
–Database queries Database queries over a over a single connectionsingle connection Drives ~5.0 Mb/s BWDrives ~5.0 Mb/s BW
App2: App2: client2 client2 server1 server1–Similar to App1Similar to App1 Drives 2.5 Mb/s BWDrives 2.5 Mb/s BW
App 3: App 3: client3 client3 server2 server2–FTP, starts at t=30 secsFTP, starts at t=30 secs 25 conn. 25 conn. 8 Mb/s 8 Mb/s
15*Third party marks and brands are the property of their respective owners
InsertLogoHere
Sample ResultsSample Results Cong. Control
Collective control highly desirable within a DC
Modified TCP can maintain 2:1 throughput ratio Modified TCP can maintain 2:1 throughput ratio – Also yields lower losses & smaller RTT.Also yields lower losses & smaller RTT.
16*Third party marks and brands are the property of their respective owners
InsertLogoHere
Adaptation to MediaAdaptation to Media Problem:Problem: TCP assumes loss TCP assumes loss congestion, congestion,
and designed for WAN (high loss/delay)and designed for WAN (high loss/delay) Effects:Effects:
– Wireless (e.g. UWB) attractive in DC (wiring Wireless (e.g. UWB) attractive in DC (wiring reduction, mobility, self configuration).reduction, mobility, self configuration).
– … … but TCP is not a suitable transport.but TCP is not a suitable transport.– Overkill for communications within a DC.Overkill for communications within a DC.
Solution:Solution: A self-adjusting transport A self-adjusting transport– Support multiple congestion/flow-control regimes.Support multiple congestion/flow-control regimes.
– Automatically selected during connection setup.Automatically selected during connection setup.
17*Third party marks and brands are the property of their respective owners
InsertLogoHere
High Availability IssuesHigh Availability Issues Problem:Problem: Single failure Single failure broken connection, broken connection,
weak robustness check, …weak robustness check, … Effect:Effect: Difficult to achieve high availability. Difficult to achieve high availability.
A B
Path 1
Path 2
Solution: Solution: – Multi-homed connections w/ load sharing among paths.Multi-homed connections w/ load sharing among paths.
– Ideally, controlled diversity & path managementIdeally, controlled diversity & path management– Difficult: need topology awareness, spanning tree problem, Difficult: need topology awareness, spanning tree problem,
18*Third party marks and brands are the property of their respective owners
InsertLogoHere
Summary & call to actionSummary & call to action Data Centers are evolvingData Centers are evolving
– Transport must evolve too, but a difficult Transport must evolve too, but a difficult proposition proposition
– TCP is heavily entrenched, change needs an TCP is heavily entrenched, change needs an industry wide effortindustry wide effort
Call to ActionCall to Action– Need to get an industry effort going to defineNeed to get an industry effort going to define
– New features & their implementationNew features & their implementation
– Deployment & compatibility issues.Deployment & compatibility issues.
– Change will need push from data center Change will need push from data center administrators & planners.administrators & planners.
19*Third party marks and brands are the property of their respective owners
InsertLogoHere
Additional ResourcesAdditional Resources
Presentation can be downloaded from Presentation can be downloaded from the IDF web site – when prompted enter:the IDF web site – when prompted enter:
–Username: idfUsername: idf
–Password: fall2005Password: fall2005
Additional backup slidesAdditional backup slides Several relevant papers available at Several relevant papers available at http://http://
kkant.ccwebhost.com/download.htmlkkant.ccwebhost.com/download.html
– Analysis of collective bandwidth control.Analysis of collective bandwidth control.
– SCTP performance in data centers.SCTP performance in data centers.
20*Third party marks and brands are the property of their respective owners
InsertLogoHere
BackupBackup
21*Third party marks and brands are the property of their respective owners
InsertLogoHere
Comparative Fabric Comparative Fabric FeaturesFeatures
FeatureFeature TCPTCP SCTPSCTP IBAIBA
Scalability to 100 Gb/s Scalability to 100 Gb/s difficultdifficult difficultdifficult Easy?Easy?
Message based & ULP supportMessage based & ULP support NoNo YesYes YesYes
QoS friendly transport?QoS friendly transport? NoNo NoNo YesYes
Virtual channel supportVirtual channel support NoNo NoNo yesyes
DC centric flow/cong. control DC centric flow/cong. control NoNo NoNo YesYes
Point to multipoint communicationPoint to multipoint communication NoNo NoNo YesYes
High availability features High availability features PoorPoor FairFair GoodGood
Offload latency (end-pt only)Offload latency (end-pt only) ~1us~1us >1us>1us <.5us<.5us
Compatible w/ TCP/IP baseCompatible w/ TCP/IP base YesYes limitedlimited
Unordered data delivery Unordered data delivery NoNo YesYes YesYes
Protection against DoS attacksProtection against DoS attacks PoorPoor GoodGood PoorPoor
Multiple traffic streamsMultiple traffic streams NoNo YesYes YesYes
DC requirements
TCP lacks many desirable features; SCTP has some
22*Third party marks and brands are the property of their respective owners
InsertLogoHere
Transport Layer QoSTransport Layer QoS Needed at Needed at
multiple levelsmultiple levels– Between transport Between transport
usesuses
– Conn. of a given Conn. of a given transporttransport
– Logical streamsLogical streams
DB App
cntrl data
iSCSIntwk IPC
Web app
text images
page
• May be on two VM’s on same physical machine.
Inter-app
Intra-app
Intra-conn
Intra-conn
• Best BW subdivision to maximize performance?
RequirementsRequirements– Must be compatible with Must be compatible with
lower level QoS lower level QoS – PCI-Exp, MAC, etc.PCI-Exp, MAC, etc.
– Automatic estimation of Automatic estimation of bandwidth requirements bandwidth requirements
– Automatic BW controlAutomatic BW control
23*Third party marks and brands are the property of their respective owners
InsertLogoHere
Multicasting in DCMulticasting in DC Software/patch distributionSoftware/patch distribution
– Multicast to all machines w/ same version.Multicast to all machines w/ same version.
– CharacteristicsCharacteristics– Medium to large file transferMedium to large file transfer
– Time to finish matters, BW doesn’t.Time to finish matters, BW doesn’t.
– Scale: 10s to 1000s.Scale: 10s to 1000s.
High performance computingHigh performance computing– MPI collectives need multicastingMPI collectives need multicasting
– CharacteristicsCharacteristics– Small but frequent transfersSmall but frequent transfers
– Latency premium, BW not an issue mostly.Latency premium, BW not an issue mostly.
– Scale: 10s to 100’sScale: 10s to 100’s
24*Third party marks and brands are the property of their respective owners
InsertLogoHere
Transport layer Transport layer multicastingmulticasting
subnet2 subnet1
outer router
Asubnet2 subnet1
outer router
AIP multicasting TL multicasting
DC needsDC needs IP multicastingIP multicasting TL multicastingTL multicasting
Legacy infrast.Legacy infrast. Needs specialized routersNeeds specialized routers Std. routers adequateStd. routers adequate
Short msgs, Short msgs, dynamic groupdynamic group
Usually designed for long Usually designed for long transferstransfers
Appropriate mechanism? Appropriate mechanism?
Topology aware?Topology aware? Yes (routing alg. based)Yes (routing alg. based) No (Need new mechnisms)No (Need new mechnisms)
Low overhead Low overhead No (Complex mgmnt)No (Complex mgmnt) Simpler, done in TL engineSimpler, done in TL engine
Low latency Low latency Primarily BW focussedPrimarily BW focussed Need latency centric designNeed latency centric design
Reliable mcast. Reliable mcast. Built on topBuilt on top Part of TLPart of TL
25*Third party marks and brands are the property of their respective owners
InsertLogoHere
TL multicasting valueTL multicasting value AssumptionsAssumptions
– A 16 node cluster w/ 4-node subclusters.A 16 node cluster w/ 4-node subclusters.– Mcast group: 2 nodes in each sub-Mcast group: 2 nodes in each sub-
clustercluster– Latencies: Latencies:
– endpt: 2 us, ack proc: 1 us, switch: 1 usendpt: 2 us, ack proc: 1 us, switch: 1 us– App-TL interface: 5 usApp-TL interface: 5 us
Latency w/o mcastLatency w/o mcast– send: 7x2 + 3x1 + 2 = 19 ussend: 7x2 + 3x1 + 2 = 19 us– ack: 1 + 3x1 + 7x1 = 11 usack: 1 + 3x1 + 7x1 = 11 us– reply: 5 + 2 + 7x2 = 21 usreply: 5 + 2 + 7x2 = 21 us– Total: 19+11+21 = 51 usTotal: 19+11+21 = 51 us
Latency w/ mcastLatency w/ mcast– send: 3x2 + 3x1 + 2 + 2x(1+1) + 2 = 17 ussend: 3x2 + 3x1 + 2 + 2x(1+1) + 2 = 17 us– ack: 1 + 1 + 2x1 + 3x1 + 3x1 = 10 usack: 1 + 1 + 2x1 + 3x1 + 3x1 = 10 us– Total = 17 + 10 + 5 = 32 usTotal = 17 + 10 + 5 = 32 us
Larger savings in full network mcast.Larger savings in full network mcast.
subnet2 subnet1A
subnet3 subnet4outer router
D
B
C
26*Third party marks and brands are the property of their respective owners
InsertLogoHere
Hierarchical ConnectionsHierarchical Connections Choose a “leader” in each Choose a “leader” in each
subnet.subnet.– Topology directedTopology directed
Multicast connections to Multicast connections to others nodes via leaders others nodes via leaders – Ack consolidation at leaders Ack consolidation at leaders
(multicast)(multicast)
– Msg consolidation at Msg consolidation at leaders (reverse multicast)leaders (reverse multicast)
Done by a layer above? Done by a layer above? (layer 4.5?)(layer 4.5?)
A
n1 n2
S4
n1 n2
S2
n1 n2
S3
n1 n2
subnet2 subnet1
subnet4subnet3
outer router
A