1 Data Center Networking CS 6250: Computer Networking Fall 2011

Data Center Networking

CS 6250: Computer NetworkingFall 2011

Cloud Computing

• Elastic resources– Expand and contract resources– Pay-per-use– Infrastructure on demand

• Multi-tenancy– Multiple independent users– Security and resource isolation– Amortize the cost of the (shared) infrastructure

• Flexibility service management– Resiliency: isolate failure of servers and storage– Workload movement: move work to other locations

Trend of Data Center

By J. Nicholas Hoover , InformationWeek June 17, 2008 04:00 AM

200 million Euro

Data centers will be larger and larger in the future cloud computing era to benefit from commodities of scale. Tens of thousand Hundreds of thousands in # of servers

Most important things in data center management- Economies of scale- High utilization of equipment- Maximize revenue- Amortize administration cost- Low power consumption (http://www.energystar.gov/ia/partners/prod_development/downloads/EPA_Datacenter_Report_Congress_Final1.pdf)

http://gigaom.com/cloud/google-builds-megadata-center-in-finland/

http://www.datacenterknowledge.com

Cloud Service Models

• Software as a Service– Provider licenses applications to users as a service– e.g., customer relationship management, email, …– Avoid costs of installation, maintenance, patches, …

• Platform as a Service– Provider offers software platform for building applications– e.g., Google’s App-Engine– Avoid worrying about scalability of platform

• Infrastructure as a Service– Provider offers raw computing, storage, and network– e.g., Amazon’s Elastic Computing Cloud (EC2)– Avoid buying servers and estimating resource needs

Multi-Tier Applications

• Applications consist of tasks– Many separate components– Running on different machines

• Commodity computers– Many general-purpose computers– Not one big mainframe– Easier scaling

Front end Server

Aggregator

Aggregator Aggregator… …

Aggregator

Worker

Worker Worker

Enabling Technology: Virtualization

• Multiple virtual machines on one physical machine• Applications run unmodified as on real machine• VM can migrate from one computer to another6

Status Quo: Virtual Switch in Server

Top-of-Rack Architecture

• Rack of servers– Commodity servers– And top-of-rack switch

• Modular design– Preconfigured racks– Power, network, and

storage cabling

• Aggregate to the next level

Modularity

• Containers

• Many containers

Data Center Challenges

• Traffic load balance• Support for VM migration• Achieving bisection bandwidth• Power savings / Cooling• Network management (provisioning)• Security (dealing with multiple tenants)

Data Center Costs (Monthly Costs)

• Servers: 45%– CPU, memory, disk

• Infrastructure: 25%– UPS, cooling, power distribution

• Power draw: 15%– Electrical utility costs

• Network: 15%– Switches, links, transit

http://perspectives.mvdirona.com/2008/11/28/CostOfPowerInLargeScaleDataCenters.aspx

Common Data Center Topology

Internet

Servers

Layer-2 switchAccess

Data Center

Layer-2/3 switchAggregation

Layer-3 routerCore

Data Center Network Topology

AR AR AR AR. . .

Internet

A AA …

Key•CR = Core Router•AR = Access Router•S = Ethernet Switch•A = Rack of app. servers

~ 1,000 servers/pod

Requirements for future data center• To catch up with the trend of mega data center, DCN technology

should meet the requirements as below– High Scalability– Transparent VM migration (high agility)– Easy deployment requiring less human administration– Efficient communication– Loop free forwarding– Fault Tolerance

• Current DCN technology can’t meet the requirements.– Layer 3 protocol can not support the transparent VM migration.– Current Layer 2 protocol is not scalable due to the size of forwarding table

and native broadcasting for address resolution.

Problems with Common Topologies

• Single point of failure

• Over subscription of links higher up in the topology

• Tradeoff between cost and provisioning

Capacity Mismatch

AR AR AR AR

A AA …

~ 5:1~ 5:1

~ 40:1~ 40:1

~ 200:1~ 200:1

Data-Center Routing

AR AR AR AR. . .

DC-Layer 3

Internet

A AA …

DC-Layer 2

Key•CR = Core Router (L3)•AR = Access Router (L3)•S = Ethernet Switch (L2)•A = Rack of app. servers

~ 1,000 servers/pod == IP subnet

S S S S

Reminder: Layer 2 vs. Layer 3• Ethernet switching (layer 2)

– Cheaper switch equipment– Fixed addresses and auto-configuration– Seamless mobility, migration, and failover

• IP routing (layer 3)– Scalability through hierarchical addressing– Efficiency through shortest-path routing– Multipath routing through equal-cost multipath

• So, like in enterprises…– Data centers often connect layer-2 islands by IP routers

Need for Layer 2

• Certain monitoring apps require server with same role to be on the same VLAN

• Using same IP on dual homed servers

• Allows organic growth of server farms

• Migration is easier

Review of Layer 2 & Layer 3• Layer 2

– One spanning tree for entire network• Prevents loops• Ignores alternate paths

• Layer 3– Shortest path routing between source and destination– Best-effort delivery

FAT Tree-Based Solution

• Connect end-host together using a fat-tree topology – Infrastructure consist of cheap devices

• Each port supports same speed as endhost– All devices can transmit at line speed if packets are distributed

along existing paths– A k-port fat tree can support k3/4 hosts

Fat-Tree Topology

Problems with a Vanilla Fat-tree

• Layer 3 will only use one of the existing equal cost paths

• Packet re-ordering occurs if layer 3 blindly takes advantage of path diversity

Modified Fat Tree

• Enforce special addressing scheme in DC– Allows host attached to same switch to route only

through switch– Allows inter-pod traffic to stay within pod– unused.PodNumber.switchnumber.Endhost

• Use two level look-ups to distribute traffic and maintain packet ordering.

Two-Level Lookups

• First level is prefix lookup– Used to route down the topology to endhost

• Second level is a suffix lookup– Used to route up towards core– Diffuses and spreads out traffic– Maintains packet ordering by using the same ports for

the same endhost

Diffusion Optimizations

• Flow classification– Eliminates local congestion– Assign to traffic to ports on a per-flow basis instead of

a per-host basis

• Flow scheduling– Eliminates global congestion– Prevent long lived flows from sharing the same links– Assign long lived flows to different links

Drawbacks

• No inherent support for VLAN traffic

• Data center is fixed size

• Ignored connectivity to the Internet

• Waste of address space– Requires NAT at border

Data Center Traffic Engineering

Challenges and Opportunities

Wide-Area Network

Router Router

DNS Server

DNS-basedsite selection

. . . . . .Servers Servers

Internet

Clients

Data Centers

Wide-Area Network: Ingress Proxies

Router Router

Data Centers

. . . . . .Servers Servers

Clients

ProxyProxy

Traffic Engineering Challenges

• Scale– Many switches, hosts, and virtual machines

• Churn– Large number of component failures

– Virtual Machine (VM) migration

• Traffic characteristics– High traffic volume and dense traffic matrix

– Volatile, unpredictable traffic patterns

• Performance requirements– Delay-sensitive applications

– Resource isolation between tenants31

Traffic Engineering Opportunities• Efficient network

– Low propagation delay and high capacity

• Specialized topology– Fat tree, Clos network, etc.– Opportunities for hierarchical addressing

• Control over both network and hosts– Joint optimization of routing and server placement– Can move network functionality into the end host

• Flexible movement of workload– Services replicated at multiple servers and data centers– Virtual Machine (VM) migration

PortLand: Main IdeaAdd a new host

Transfer a packet

Key features - Layer 2 protocol based on tree topology - PMAC encode the position information - Data forwarding proceeds based on PMAC- Edge switch’s responsible for mapping between PMAC and AMAC- Fabric manger’s responsible for address resolution- Edge switch makes PMAC invisible to end host- Each switch node can identify its position by itself- Fabric manager keep information of overall topology. Corresponding to the fault, it notifies affected nodes.

Questions in discussion board• Question about Fabric Manager

o How to make Fabric Manager robust ?o How to build scalable Fabric Manager ?

Redundant deployment or cluster Fabric manager could be solution

• Question about reality of base-line tree topologyo Is the tree topology common in real world ?

Yes, multi-rooted tree topology has been a traditional topology in data center.

[A scalable, commodity data center network architecture Mohammad La-Fares and et el, SIGCOMM ‘08]

Discussion pointso Is the PortLand applicable to other topology ?

The Idea of central ARP management could be applicable. To solve forwarding loop problem, TRILL header-like method would be necessary. The benefits of PMAC ? It would require larger size of forwarding table.

• Question about benefits from VM migrationo VM migration helps to reduce traffic going through aggregate/core switches.o How about user requirement change?o How about power consumption ?

• ETCo Feasibility to mimic PortLand on layer 3 protocol.

o How about using pseudo IP ?

o Delay to boot up whole data center

Status Quo: Conventional DC Network

Reference – “Data Center: Load balancing Data Center Services”, Cisco 2004

AR AR AR AR. . .

DC-Layer 3

Internet

A AA …

DC-Layer 2Key

• CR = Core Router (L3)• AR = Access Router (L3)• S = Ethernet Switch (L2)• A = Rack of app. servers

~ 1,000 servers/pod == IP subnet

Conventional DC Network Problems

AR AR AR AR

A AA …

~ 40:1

~ 200:1

Dependence on high-cost proprietary routers

Extremely limited server-to-server capacity

And More Problems …CR CR

AR AR AR AR

IP subnet (VLAN) #1

~ 200:1

• Resource fragmentation, significantly lowering utilization (and cost-efficiency)

IP subnet (VLAN) #2

A AA … A AA … A A… AA …AA A

And More Problems …CR CR

AR AR AR AR

IP subnet (VLAN) #1

~ 200:1

• Resource fragmentation, significantly lowering cloud utilization (and cost-efficiency)

Complicated manual L2/L3 re-configuration

IP subnet (VLAN) #2

A AA … A AA … A A… AA …AA A

All We Need is Just a Huge L2 Switch,

or an Abstraction of One

A AA … A AA …

AR AR AR AR

AAAA AAAA AAAA A A A A AA A AA AA AA

VL2 Approach

• Layer 2 based using future commodity switches• Hierarchy has 2:

– access switches (top of rack) – load balancing switches

• Eliminate spanning tree– Flat routing– Allows network to take advantage of path diversity

• Prevent MAC address learning– 4D architecture to distribute data plane information– TOR: Only need to learn address for the intermediate switches– Core: learn for TOR switches

• Support efficient grouping of hosts (VLAN replacement)

VL2 Components

• Top-of-Rack switch: – Aggregate traffic from 20 end host in a rack– Performs ip to mac translation

• Intermediate Switch– Disperses traffic– Balances traffic among switches– Used for valiant load balancing

• Decision Element– Places routes in switches– Maintain a directory services of IP to MAC

• End-host– Performs IP-to-MAC lookup

Routing in VL2

• End-host checks flow cache for MAC of flow– If not found ask agent to resolve– Agent returns list of MACs for server and MACs for intermediate

routers

• Send traffic to Top of Router – Traffic is triple-encapsulated

• Traffic is sent to intermediate destination

• Traffic is sent to Top of rack switch of destination

The Illusion of a Huge L2 Switch

1. L2 semantics

2. Uniform high capacity

3. Performance isolation

A AA … A AA … A AA … A AA …AAAA AAAA AAAA A A A A AA A AA AA AA

SolutionApproachObjective

2. Uniformhigh capacity between servers

Enforce hose model using existing

mechanisms only

Employ flat addressing1. Layer-2 semantics

3. Performance Isolation

Guarantee bandwidth for

hose-model traffic

Flow-based random traffic indirection

(Valiant LB)

Name-location separation & resolution

service

Objectives and Solutions

Name/Location Separation

payloadToR3

. . . . . .

Servers use flat names

Switches run link-state routing and maintain only switch-level topology

Cope with host churns with very little overheadCope with host churns with very little overhead

y zpayloadToR4 z

ToR2 ToR4ToR1 ToR3

y, zpayloadToR3 z

DirectoryService

…x ToR2

y ToR3

z ToR4

Lookup &Response

…x ToR2

y ToR3

z ToR3

• Allows to use low-cost switches• Protects network and hosts from host-state churn• Obviates host and switch reconfiguration

Clos Network Topology

20 Servers

. . . . . . . . .

K aggr switches with D ports

20*(DK/4) Servers

. . . . . . . . . . .

Offer huge aggr capacity & multi paths at modest costOffer huge aggr capacity & multi paths at modest cost

D (# of 10G ports)

Max DC size(# of Servers)

48 11,520

96 46,080

144 103,680

Valiant Load Balancing: Indirection

payloadT3 y

payloadT5 z

IANYIANYIANY

Cope with arbitrary TMs with very little overheadCope with arbitrary TMs with very little overhead

Links used for up pathsLinks usedfor down paths

T1 T2 T3 T4 T5 T6

[ ECMP + IP Anycast ]• Harness huge bisection bandwidth• Obviate esoteric traffic engineering or optimization• Ensure robustness to failures• Work with switch mechanisms available today

1. Must spread traffic2. Must ensure dst independence1. Must spread traffic2. Must ensure dst independenceEqual Cost Multi Path ForwardingEqual Cost Multi Path Forwarding

Properties of Desired Solutions• Backwards compatible with existing

infrastructure– No changes in application– Support of layer 2 (Ethernet)

• Cost-effective– Low power consumption & heat emission– Cheap infrastructure

• Allows host communication at line speed• Zero-configuration• No loops• Fault-tolerant

Research Questions

• What topology to use in data centers?– Reducing wiring complexity

– Achieving high bisection bandwidth

– Exploiting capabilities of optics and wireless

• Routing architecture?– Flat layer-2 network vs. hybrid switch/router

– Flat vs. hierarchical addressing

• How to perform traffic engineering?– Over-engineering vs. adapting to load

– Server selection, VM placement, or optimizing routing

• Virtualization of NICs, servers, switches, …51

Research Questions• Rethinking TCP congestion control?

– Low propagation delay and high bandwidth– “Incast” problem leading to bursty packet loss

• Division of labor for TE, access control, …– VM, hypervisor, ToR, and core switches/routers

• Reducing energy consumption– Better load balancing vs. selective shutting down

• Wide-area traffic engineering– Selecting the least-loaded or closest data center

• Security– Preventing information leakage and attacks

1 Data Center Networking CS 6250: Computer Networking Fall 2011

Documents

CS 268: Computer Networking

CS 640: Computer Networking

The Data Plane Nick Feamster CS 6250: Computer Networking Fall 2011

Cs 9216 Networking Lab Cor c

CS 5413: High Performance Systems and Networking

Routing Security CS 6250 Nick Feamster Fall 2011

Optimizing Cost and Performance for Multihoming Nick Feamster CS 6250 Fall 2011

Access Networks: Applications and Policy Nick Feamster CS 6250 Fall 2011 (HomeOS slides from Ratul Mahajan)

Measurement: Techniques, Strategies, and Pitfalls Nick Feamster CS 6250 Fall 2011

CS 21343 S: Networking III - Internet

Transit Traffic Engineering Nick Feamster CS 6250: Computer Networks Fall 2011

CS 1150 – Lab # 15 – Networking

Cs Linux Networking

CS-328 CS-328 A Networking Primer Internet Programming TCP/IP

Network Configuration Management Nick Feamster CS 6250: Computer Networking Fall 2011 (Some slides on configuration complexity from Prof. Aditya Akella)

Lecture 17: Networking Background - CS 161 | CS 161

Troubleshooting Network Configuration Nick Feamster CS 6250: Computer Networking Fall 2011

The Host Nick Feamster CS 6250: Computer Networking Fall 2011

1 Networking Background - CS 161

CS 515 Mobile and Wireless Networking