Upload
danganh
View
217
Download
1
Embed Size (px)
Citation preview
SDN for the CloudAlbert Greenberg
Distinguished Engineer
Director of Networking @ Microsoft Azure
Road to SDN
• WAN• Motivating scenario: network engineering at
scale
• Innovation: infer traffic, control routing, centralize control to meet network-wide goals
• Cloud• Motivating scenario: software defined data
center, requiring per customer virtual networks
• Innovation: VL2• scale-out L3 fabric
• network virtualization at scale, enabling SDN and NFV
TomoGravityACM Sigmetrics 2013 Test of Time Award
RCPUsenix NSDI 2015 Test of Time Award
4DACM Sigcomm 2015 Test of Time Award
VL2ACM Sigcomm 2009
Cloud provided the killer scenario for SDN
• Cloud has the right scenarios• Economic and scale pressure huge leverage • Control huge degree of control to make changes in the right places• Virtualized Data Center for each customer prior art fell short
• Cloud had the right developers and the right systems• High scale fault tolerant distributed systems and data management
At Azure we changed everything because we had to,
from optics to host to NIC to physical fabric to WAN to
Edge/CDN to ExpressRoute (last mile)
Hyperscale Cloud
Microsoft Cloud Network
15 billion dollar bet
Microsoft Cloud Network
Microsoft Cloud Network
2010 2015
100K MillionsCompute
Instances
10’s of PBExabytesAzure
Storage
10’s of TbpsDatacenter
Network Pbps
Fortune 500 using
Microsoft Cloud
>85%
>60 TRILLIONAzure storage objects
425MILLIONAzure Active
Directory users
1 out of 4 Azure VMsare Linux VMs>5
MILLIONrequests/sec
1 TRILLIONAzure Event Hubs
events/month
1,400,000SQL databases
in Azure
Scale
>18BILLIONAzure Active Directory authentications/week
New Azure customers a month
>93,000
Consistent cloud design principles for SDN
Physical and Virtual networks, NFV
Integration of enterprise & cloud, physical & virtual
Future: reconfigurable network hardware
Demo
Career Advice
Acknowledgements
Agenda
Azure SmartNIC
Cloud Design Principles
Scale-out N-Active Data PlaneEmbrace and Isolate failures
Centralized control plane: drive network to target stateResource managers service requests, while meeting system wide objectives
Controllers drive each component relentlessly to the target state
Stateless agents plumb the policies dictated by the controllers
These principles are built into every component
Hyperscale Physical Networks
VL2 Azure Clos Fabrics with 40G NICs
16
T2-1-1 T2-1-2 T2-1-8
T3-1 T3-2 T3-3 T3-4
Row Spine
T2-2-1 T2-2-2 T2-2-4Data Center Spine
T1-1 T1-8T1-7…T1-2
… …
Regional Spine
…
T1-1 T1-8T1-7…T1-2 T1-1 T1-8T1-7…T1-2
Rack …T0-1 T0-2 T0-20
Servers
…T0-1 T0-2 T0-20
Servers
…T0-1 T0-2 T0-20
Servers
Outcome of >10 years of history, with major
revisions every six months
Scale-out, active-active
L3
L2
LB/FW LB/FW LB/FW LB/FW
Scale-up, active-passive
Challenges of Scale
• Clos network management problem• Huge number of paths, ASICS, switches to examine, with a dynamic set of
gray failure modes, when chasing app latency issues at 99.995% levels
• Solution• Infrastructure for graph and state tracking to provide an app platform
• Monitoring to drive out gray failures
• Azure Cloud Switch OS to manage the switches as we do servers
Azure State Management System ArchitectureInfrastructure
Provisioning
Traffic
Engineering… Checker
Observed
State
Target
StateProposed
State
Monitor Updater
Abstract network graph and state
model: the basic programming
paradigm
48 servers
podset
… 16 Spine Switchs
… 17 pods
podset
… 17 pods
… 16 Spine Switchs
… 16 Spine Switchs
… 16 Spine Switchs
64x10G 64x10G
48 servers
Border RoutersEach border router connects to each Interconnect switch
GNS/WAN Network
XXx10G
XXx10G
XXx10G
XXx10G
XXx10G
XXx10G
XXx10G
XXx10G
n7k
n7k
pod
BGP
BGP
BGP
BGP
BGP
BGP
BGP
64x10G64x10G 64x10G 64x10G 64x10G64x10G 64x10G 64x10G
eBGPeBGP
eBGP
eBGP
Fault
Mitigation
High scale infrastructure is complex:
multiple vendors, designs, software
versions, failures
Centralized control plane: drive
network to target state
App I: Automatic Failure Mitigation
Device
monitor
Checker
Observed
StateTarget
State
Proposed
State
Fault
Mitigation
Link
updater
Link
monitor
Repair bad links
Device
updater
Repair bad devices
App II: Traffic Engineering Towards High Utilization
LSP
monitorService
monitor
Routes
Topology,
link load Service
demand
Routing
updater
Add or remove
LSPs, change
weights
Checker
Observed
State
Target
StateProposed
State
Traffic
Engineering
Device &Link
monitorService LB
updater
SWAN250
200
150
100
50
0
Gb
ps
Sun Mo
nTue We
d
Thu Fri Sat
= Total traffic
= Elastic 100
Gbps
= Background 100
Gbps
= Interactive 20-60
Gbps
Azure Scale Monitoring – Pingmesh
• Problem: Is it the app or the net explaining app latency issues?
• Solution: Measure the network latency between any two servers
• Full coverage, always-on, brute force
• Running in Microsoft DCs for near 5 years, generating 200+B probes every day
NormalPodset down
Podset failure Spine
failure
EverFlow: Packet-level Telemetry + Cloud Analytics
22
Guided
prober
Reshuffler
Match & mirror
Collector
Collector
Collector
Distributed
storage
Filter &
aggregation
EverFlow controller
Query
Load
imbalanceLatency
spike
Packet
dropLoop
Cloud
Network
Challenge
Cloud
Scale Data
Azure
Storage
Azure
Analytics
Azure Cloud Switch – Open Way to Build Switch OSO
pera
ting
Sys
tem
Switch ASIC
Driver
Switch ASIC SDK
Switch State Service
Switch
ASIC
BGP LLDP
Other SDN Apps
ACL
Hard
ware
Provision
Deployment
Automation
Sync
Agent1Switch Abstraction Interface (SAI)
Ethernet Interfaces
Sync
Agent2
Sync
Agent3
Hw Load
Balancer
SNMP
Front Panel Ports
User space
• SAI collaboration is industry wide
• SAI simplifies bringing up Azure Cloud Switch (Azure’s switch OS) on new ASICSSwitch Control: drive to target state
SAI is on github
Hyperscale Virtual Networks
ISP/MPLS
QoS
Internet
Network Virtualization (VNet)
Microsoft Azure Virtual Networks
Azure is the hub of your enterprise, reach to branch offices via VPN
VNet is the right abstraction, the counterpart of the VM for compute
Efficient and scalable communication within and across VNets
Microsoft Azure
10.1/16 10.1/16
L3 T
un
nel
Hyperscale SDN: All Policy is in the Host
Management
Control
Data
Proprietary
ApplianceAzure Resource
Manager
Controller
Switch (Host)
Management Plane
Data Plane
SDN
Control Plane
ControllerController
Switch (Host)Switch (Host)
Azure Resource
ManagerAzure Resource
Manager
Key Challenges for Hyperscale SDN Controllers
Must scale up to 500k+ Hosts in a region
Needs to scale down to small deployments too
Must handle millions of updates per day
Must support frequent updates without downtime
Microsoft Azure Service Fabric: A platform for reliable, hyperscale, microservice-based applications
App2
App1
http://aka.ms/servicefabric
Gateway
Regional Network Manager Microservices
VIP Manager
MAC Manager
VNET Manager
(VNETs, Gateways)
Service Manager
(VMs, Cloud Services)
Network Resource Manager
(ACLs, Routes, QoS)
Partitioned Federated Controller
Management REST
APIs
Fabric Task Driver
Software
Load
Balancers
Other
Appliances
Cluster Network Managers
Regional Network Controller Stats
• 10s of millions of API calls per day
• API execution time• Read : <50 milliseconds
• Write : <150 milliseconds
• Varying deployment footprint• Smallest : <10 Hosts
• Largest : >100 Hosts
Write API Transactions
AddOrUpdateVnet
AllocateMacs
AllocateVips
AssociateToAclGroup
CreateOrUpdateNetworkServi
ce
DeleteNetworkService
FreeMac
Hyperscale Network Function Virtualization
Azure SLB: Scaling Virtual Network Functions
• Key Idea: Decompose Load Balancing into Tiers to achieve scale-out data plane and centralized control plane
• Tier 1: Distribute packets (Layer 3)• Routers ECMP
• Tier 2: Distribute connections (Layer 3-4)• Multiplexer or Mux• Enable high availability and scale-out
• Tier 3: Virtualized Network Functions (Layer 3-7)• Example: Azure VPN, Azure Application
Gateway, third-party firewall
Load-balanced
Connections
VNF VNF VNF
Router Router
Mux Mux Mux
Load-
balanced IP
Packets
Express Route: Direct Connectivity to the Cloud
Virtual Networks
Azure Edge
(network virtual
function)
Connectivity Provider
Infrastructure (physicalWAN)
Data Center-Scale Distributed Router
Customer’s network Customer’s Virtual Network in AzureMicrosoft WAN
Centralized control plane Distributed data plane
Building a Hyperscale Host SDN
Virtual Filtering Platform (VFP)
Acts as a virtual switch inside Hyper-V VMSwitch
Provides core SDN functionality for Azure networking services, including:
Address Virtualization for VNETVIP -> DIP Translation for SLBACLs, Metering, and Security Guards
Uses programmable rule/flow tables to perform per-packet actions
Supports all Azure data plane policy at 40GbE+ with offloads
Coming to private cloud in Windows Server 2016
NIC vNIC
VM Switch
VFP
VMvNIC
VM
ACLs, Metering, Security
VNET
SLB (NAT)
Host: 10.4.1.5
Flow Tables: the Right Abstraction for the Host
VMSwitch exposes a typed Match-Action-Table API to the controller
• Controllers define policy
• One table per policy
Key insight: Let controller tell switch exactly what to do with which packets
• e.g. encap/decap, rather than trying to use existing abstractions (tunnels, …)
Tenant Description
VNet Description
VNet Routing
PolicyACLsNAT
Endpoints
VFP
VM1
10.1.1.2NIC
Flow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
Flow Action
TO: 79.3.1.2 DNAT to
10.1.1.2
TO: !10/8 SNAT to
79.3.1.2
Flow Action
TO: 10.1.1/24 Allow
10.4/16 Block
TO: !10/8 Allow
VNET LB NAT ACLS
Controller
Table Typing/Flow Caching are Critical to Performance
Flow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to
10.5.1.7
TO: !10/8 NAT out of
VNET
• COGS in the cloud is driven by VM density: 40GbE is here
• First-packet actions can be complex
• Established-flow matches must be typed, predictable, and simple hash lookups
Blue VM110.1.1.2
NIC
VFP
First Packet
Subsequent Packets
Flow ActionFlow ActionFlow Action
TO: 79.3.1.2 DNAT to
10.1.1.2
TO: !10/8 SNAT to
79.3.1.2
Flow Action
TO: 10.1.1/24 Allow
10.4/16 Block
TO: !10/8 Allow
VNET LB NAT ACLS
Connection Action
10.1.1.2,10.2.3.5,80,9876 DNAT + Encap to GW
10.1.1.2,10.2.1.5,80,9876 Encap to 10.5.1.7
10.1.1.2,64.3.2.5,6754,80 SNAT to 79.3.1.2
RDMA/RoCEv2 at Scale in Azure
• RDMA addresses high CPU cost and long latency tail of TCP• Zero CPU Utilization at 40Gbps • μs level E2E latency
• Running RDMA at scale• RoCEv2 for RDMA over commodity IP/Ethernet switches• Cluster-level RDMA• DCQCN* for end-to-end congestion control
NICNICMemory
Buffer A
Memory
Buffer BWrite local buffer at Address A to remotebuffer at Address B
Buffer B is filled
ApplicationApplication
*DCQCN is running on Azure NICs
TCP 99th percentile
RDMA 99.9th percentile
RDMA 99th percentile
RDMA Latency Reduction at 99.9th %ile in Bing
Host SDN Scale Challenges
• Host network is Scaling Up: 1G 10G 40G 50G 100G• The driver is VM density (more VMs per host), reducing COGs
• Need the performance of hardware to implement policy without CPU
• Need to support new scenarios: BYO IP, BYO Topology, BYO Appliance• We are always pushing richer semantics to virtual networks
• Need the programmability of software to be agile and future-proof
How do we get the performance of hardware with programmability
of software?
Azure SmartNIC• Use an FPGA for reconfigurable
functions• FPGAs are already used in Bing (Catapult)• Roll out Hardware as we do software
• Programmed using Generic Flow Tables (GFT)• Language for programming SDN to hardware• Uses connections and structured actions as primitives
• SmartNIC can also do Crypto, QoS, storage acceleration, and more…
Host
CPU
NIC
ASICFPGA
SmartNIC
ToR
Flow Action
Decap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80VFP
Southbound API
GFT Offload API (NDIS)
VMSwitch
VM
Northbound API
GFT
Table
First Packet
GFT Offload Engine
SmartNIC50G
QoSCrypto RDMAFlow Action
Decap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80
GFT
43
SmartNIC
TranspositionEngine
Rewrite
SLB Decap SLB NAT VNET ACL Metering
Rule Action Rule ActionRule Action Rule Action Rule Action Rule ActionDecap* DNAT* Rewrite* Allow* Meter*
ControllerControllerController
En
cap
Azure SmartNIC
Demo: SmartNIC Encryption
Closing Thoughts
Cloud scale, financial pressure unblocked SDNControl and systems developed earlier for compute, storage, power helped
Moore’s Law helped: order 7B transistors per ASIC
We did not wait for a moment for standards, for vendor persuasion
SDN realized through consistent application of principles of Cloud designEmbrace and isolate failure
Centralize (partition, federate) control and relentlessly drive to target state
Microsoft Azure re-imagined networking, created SDN and it paid off
Career Advice
Cloud
Software leverage and agilityEven for hardware people
Quantity timeWith team, with project
Hard infrastructure problems take >3 years, but it’s worth it
Usage and its measurement oxygen for ideasQuick wins (3 years is a long time)
Foundation and proof that the innovation matters
Shout-Out to Colleagues & Mentors
AT&T & Bell Labs• Han Nguyen, Brian Freeman, Jennifer Yates, … and entire AT&T Labs team• Alumni: Dave Belanger, Rob Calderbank, Debasis Mitra, Andrew Odlyzko, Eric Sumner Jr.
Microsoft Azure• Reza Baghai, Victor Bahl, Deepak Bansal, Yiqun Cai, Luis Irun-Briz, Yiqun Cai, Alireza
Dabagh, Nasser Elaawar, Gopal Kakivaya, Yousef Khalidi, Chuck Lenzmeier, Dave Maltz, Aaron Ogus, Parveen Patel, Mark Russinovich, Murari Sridharan, Marne Staples, Junhua Wang, Jason Zander, and entire Azure team
• Alumni: Arne Josefsberg, James Hamilton, Randy Kern, Joe Chau, Changhoon Kim, Parantap Lahiri, Clyde Rodriguez, Amitabh Srivastava, Sumeet Singh, Haiyong Wang
Academia• Nick Mckeown, Jennifer Rexford, Hui Zhang
MSFT @ SIGCOMM’15DCQCN
Congestion control for large RDMA deployments2000x reduction in Pause Frames, 16x better
performance
EverflowPacket-level telemetry for large DC networks106x reduction in trace overhead, pinpoint accuracy
SiloVirtual networks with guaranteed bw and latency
No changes to host stack!
IridiumLow-latency geo-distributed analytics
19x query speedup, 64% reduction in WAN traffic
CorralJoint data & compute placement for big data jobs
56% reduction in completion time over Yarn
Eden
Enable network functions at end hostDynamic, stateful policies, F#-based API
R2C2Network stack for rack-scale computing
Rack is the new building block!
Hopper
Speculation-aware scheduling of big-data jobs
66% speedup of production queries
PingMeshDC network latency measurement and analysis
200 billion probes per day!
http://research.microsoft.com/en-us/um/cambridge/events/sigcomm2015/papers.html