SCALE-OUT NETWORKING IN THE DATA CENTER ...cseweb.ucsd.edu/~vahdat/papers/scale-out-micro10.pdfSCALE-OUT NETWORKING IN THE DATA CENTER . SCALE-OUT ARCHITECTURES SUPPORTING FLEXIBLE,

..........................................................................................................................................................................................................................

SCALE-OUT NETWORKINGIN THE DATA CENTER

..........................................................................................................................................................................................................................

SCALE-OUT ARCHITECTURES SUPPORTING FLEXIBLE, INCREMENTAL SCALABILITY ARE

COMMON FOR COMPUTING AND STORAGE. HOWEVER, THE NETWORK REMAINS THE

LAST BASTION OF THE TRADITIONAL SCALE-UP APPROACH, MAKING IT THE DATA

CENTER’S WEAK LINK. THROUGH THE UCSD TRITON NETWORK ARCHITECTURE, THE

AUTHORS EXPLORE ISSUES IN MANAGING THE NETWORK AS A SINGLE PLUG-AND-PLAY

VIRTUALIZABLE FABRIC SCALABLE TO HUNDREDS OF THOUSANDS OF PORTS AND

PETABITS PER SECOND OF AGGREGATE BANDWIDTH.

......Over the past decade, the scale-out model has replaced scale up as the basisfor delivering large-scale computing and stor-age. In the traditional scale-up model, systemarchitects replace a server with a higher-endmodel upon reaching the device’s computeor storage capacity. This model requires peri-odically moving applications and data fromone machine to another. Worse, it eliminatesleveraging commodities of scale, meaningthat the price per compute cycle or byte ofstorage can be a factor of 10 or more higherthan the unit costs available from commod-ity components.

This additional cost could be justified insmaller deployments—that is, the simpler ad-ministration and software architecture of run-ning on fewer, more powerful devices mightreduce the absolute cost difference betweenscale-out and scale-up service deployment.However, planetary-scale infrastructures serv-ing billions of requests daily1 and cloud com-puting infrastructures centrally hostingmillions of individuals’ and businesses’ com-puting and storage requirements have ledto dense data centers with hundreds of

thousands of servers, such as the MicrosoftChicago Data Center Container Bay. Inthis environment, the cost difference betweenleveraging commodity versus noncommoditycomponents can amount to billions of dollarsfor the largest providers.

Such economics have justified necessarysoftware architecture development to lever-age incrementally deployable, horizontallyscalable commodity servers and disks.2-3 Inthe scale-out model, adding additional com-pute or storage resources should ideally be amatter of adding more servers or disks to thesystem. For instance, increasing a servicecluster’s compute capacity by 10 percentshould require adding only 10 percentmore servers to the cluster. The scale-outmodel requires parallelism in the underlyingworkload, but this required parallelism istrivially available for storage for services han-dling many small requests from a large userpopulation, such as search, and increasinglyavailable for various computational models.2

Although the scale-out model providesclear and compelling benefits, the last bastionof the scale-up model in data centers is the

mmi2010040029.3d 29/7/010 15:54 Page 29

Amin Vahdat

Mohammad Al-Fares

Nathan Farrington

Radhika Niranjan Mysore

George Porter

Sivasankar Radhakrishnan

University of California,

San Diego

0272-1732/10/$26.00 �c 2010 IEEE Published by the IEEE Computer Society

...................................................................

29

network. Delivering the necessary parallelismfor both compute and storage requires coordi-nation and communication among many ele-ments spread across the data center. In turn,this cooperation requires a high-performancenetwork interconnect at the scale of tens ofthousands of servers. Today, however, deliv-ering bandwidth to ever-larger clustersrequires leveraging increasingly specialized,noncommodity switches and routers. Forexample, servers within a single rack caninterconnect using a commodity 48-portnonblocking top-of-rack switch. All hosts di-rectly connected to this first-hop switch cancommunicate with one another at theirlocal network interface card’s (NIC’s)speed. However, high-speed communicationat the scale of thousands of racks requires ahierarchical network. Although the hier-archy’s leaves can use commodity switches,aggregation up the hierarchy requires highercapacity and specialized switches (withmore ports and more backplane bandwidth).Existing datacenter topologies might increasecost and limit performance by a factor of 10or more while limiting end-host connectivityto 1 gigabit per second (Gbps).4

The resulting performance limitations inthe datacenter network limit the data cen-ter’s flexibility and scalability as a whole.For instance, the number of network portsthat can economically deliver sufficient ag-gregate bandwidth among all hosts limitsthe size of the largest services that can runin the data center. Oversubscribed networksare a reality in data centers of any scale,meaning that application developers mustexplicitly understand and account for widelyvariable cluster bandwidth. For example, 1Gbps of bandwidth can be available tohosts in the same rack, 200 megabits per sec-ond (Mbps) to hosts in the same row, and40 Mbps to hosts in a different row. Limit-ing available bandwidth in this manner limitsvarious communication-intensive applica-tions’ scalability. At the same time, movingexisting applications to a new cluster with adifferent network interconnect can violatehidden assumptions in application code.

Numerous recent efforts are collectivelyinvestigating scale-out networking. The goalis to make expanding the number of networkports or the amount of aggregate bandwidth

as simple as adding processing power or stor-age. Operators should be able to manage thedatacenter network as a single plug-and-playnetwork, approaching zero manual configura-tion and transparently bringing online addi-tional connectivity or bandwidth for servers.We’ll take a look at the many challenges torealizing scale-out networking, with a particu-lar focus on the University of California, SanDiego’s Triton network architecture.

RequirementsAn idealized network fabric should be scal-

able, manageable, flexible, and cost effective.

ScalableThe network should scale to support the

largest data centers: 100,000 or more portsand approaching 1 petabit per second of ag-gregate bandwidth. Furthermore, the scaleshould be incrementally realizable—that is,the number of ports should be expandableat the granularity of hundreds of ports andaggregate bandwidth scalable at the granular-ity of terabits per second.

ManageableDevelopers should be able to manage the

entire network as a single logical layer 2(L2) domain, realizing all of plug-and-playdeployment’s benefits of additional ports orbandwidth. Expanding the network shouldn’trequire configuring Dynamic Host Configu-ration Protocol (DHCP) servers, subnetnumbers, subnet masks, or network maps.At the same time, one of the perhaps surpris-ing management challenges with large-scalenetworks is cabling complexity. Avoidingthe cost and complexity of interconnectingtens of thousands of long cables crisscross-ing the data center is often reason enoughto oversubscribe a network.

FlexibleAny service should be flexible enough to

run anywhere in the data center with fullsupport for both end-host and network virtu-alization. Virtual machines should be able tomigrate to any physical machine while main-taining their IP addresses and reachability.Datacenter architects should be able to allo-cate larger computations on the basis of

mmi2010040029.3d 29/7/010 15:54 Page 30

....................................................................

30 IEEE MICRO

...............................................................................................................................................................................................

DATACENTER COMPUTING

resource availability rather than interrackbandwidth limitations.

Cost effectiveThe price per port should be cost

effective—that is, small relative to the costof compute and storage, in terms of bothcapital and operational expenditures. Thelatter requires reducing not only the humanconfiguration and management burden butalso the per-port power consumption.5

TritonAddressing these challenges forms the

basis of our Triton network architectureand implementation, which comprises

� a fundamentally scalable network to-pology as the basis for cost-effectivebandwidth scalability,

� a hardware organization that enables in-cremental deployment of discrete, iden-tical network elements with minimalcabling complexity,

� a set of protocols that network switchescan use to self-discover their location inthe network topology, as well as theavailable routes between all end hosts,and

� dynamic forwarding techniques thatload-balance communication amongall available paths, delivering the under-lying fabric’s full bandwidth for a rangeof communication patterns.

Taken together, our work aims to demon-strate one instance of a hardware, software,and protocol organization to deliver scale-out networking.

TopologyTraditional best practices in building

datacenter networks such as the Cisco DataCenter Infrastructure 2.5 Design Guide de-scribe a tree topology with aggregation todenser, higher-speed switches moving upthe hierarchy. The highest density switchesavailable for the topology’s root limits thenetwork performance in these environments.Until recently, nonblocking 10-Gbps Ether-net switches were limited to 128 ports. Nu-merous product announcements claimincreased density to 200 to 300 ports.

Assuming servers with Gbps NICs, thiswould limit a single network’s size to a fewthousand hosts. Expanding further is possibleby building a tree with multiple roots andthen using techniques such as equal-costmultipath forwarding (ECMP) to balanceload among these roots.

Overall, however, this scale-up approachto datacenter networking has many draw-backs. First, it requires aggregation tohigher-speed, denser switches as its funda-mental scaling mechanism. For example, intransitioning to 10-Gbps NICs, thisapproach requires 40-Gbps or 100-GbpsEthernet switches. Unfortunately, the timeto transition from one generation of Ethernettechnology to the next has been increasing. In1998, 1-Gbps Ethernet debuted; in 2002,10-Gbps Ethernet. The widespread transitionto 10 Gbps to end hosts, and 40-Gbps Ether-net technology will see commercial deploy-ment in 2010. During the transition period,the highest-end switches command a signifi-cant price premium primarily because of lim-ited volumes, substantially increasing thedatacenter network’s cost.4 Second, scale-upnetworking such as this limits the overallbisection bandwidth, in turn impacting themaximum degree of parallelism or overall sys-tem performance for large-scale distributedcomputations.2

Issues of cost lead datacenter architects tooversubscribe their networks. For instance, ina typical data center organized around racksand rows, intrarack communication througha top-of-rack switch might deliver nonblock-ing performance. Communication to a serverin a remote rack but the same row might beoversubscribed by a factor of 5. Finally, inter-row communication might be oversubscribedby some larger factor. Oversubscription ratiosof up to 240 to 1 have occurred in commer-cial deployments.6

Our approach to the challenges with cur-rent datacenter networking topologies hasbeen to adopt ideas from the area of high-performance computing.7 In particular, wehave explored the benefits of topologiesbuilt around fat trees,4 a special buffered in-stance of a folded-Clos topology using iden-tical switching elements and adapted forpacket-switched networks. Of course, thisgeneral approach to network scaling is not

mmi2010040029.3d 29/7/010 15:54 Page 31

....................................................................

JULY/AUGUST 2010 31

new. However, the need to apply them tolarge-scale Ethernet and TCP/IP networksrunning standard Internet services has beenlimited, until the recent advent of planetaryscale services and exploding data sets rou-tinely requiring thousands of servers per-forming all-to-all communication.

Figure 1 shows a simple example of a fattree topology. The topology effectively formsa 16-port switch built internally from 20identical 4-port switches. With appropriatescheduling, this topology can deliver thesame throughput as a single-stage crossbarswitch. Put another way, it can achieve aggre-gate bisection bandwidth limited only by thespeed of each server’s network interface. (Theanalogous circuit-switched network wouldbe rearrangeably nonblocking.) Achievingthis functionality in practice might be com-putationally challenging, but we haveachieved good results in practice, as we’ll dis-cuss in the ‘‘Multipath Forwarding’’ section.

The fat tree’s scaling properties make itparticularly appealing for the data center.In general, a fat tree built from identicalk-port switches can support 100 percentthroughput among k3/4 servers using k2/4switching elements. We organize the topol-ogy into k pods, as Figure 1 shows, each con-necting k2/4 end hosts. Edge switches in eachpod provide connectivity to end hosts and

connect to aggregation switches in the hier-archy’s second level. Core switches formthe root of the fat tree, facilitating interpodcommunication.

Although a fat tree can scale out to a non-blocking topology, it can also be oversub-scribed depending on the underlyingapplications’ communication requirements.For example, if the fat tree in Figure 1 hadonly two core switches, the network as awhole would be oversubscribed by a factorof two. Similarly, limiting the number ofaggregation switches could oversubscribeeach pod.

Critically, the fat tree is built from identi-cal switching elements, so relying on aggrega-tion to higher-speed, more expensiveswitching elements moving up the hierarchyis unnecessary. This approach demonstrateshow to build a cost-effective network fabricfrom prevailing commodity network compo-nents. As higher-speed network elementstransition to becoming commodity, fattrees can still support incremental networkevolution. For instance, inserting higher-speed switches into the core switching layercan reduce oversubscription. Next, such ele-ments might migrate into pods at the aggre-gation layer delivering more bandwidth intoeach pod. Finally, incorporating theseswitches into the edge can deliver more

mmi2010040029.3d 29/7/010 15:54 Page 32

Edge

Aggregation

Core

Pod 0 Pod 1 Pod 2 Pod 3

Figure 1. Sample fat tree topology. With appropriate scheduling, this topology can deliver the same throughput as a single-

stage crossbar switch. Edge switches in each pod provide connectivity to end hosts and connect to aggregation switches

in the hierarchy’s second level. Core switches form the root of the fat tree, facilitating interpod communication.

....................................................................

32 IEEE MICRO

...............................................................................................................................................................................................


bandwidth to individual servers. This type ofscale-out networking would require carefullydesigning the modular switching infrastruc-ture, which we’ll discuss in the next section.

Our evaluation of a fat tree-based topol-ogy’s cost structure relative to existing bestpractices indicates a potential reduction incosts by a factor of 10 or more for thesame network bandwidth.4 Of course, thefat tree topology introduces its own chal-lenges in cabling, management, and routingand forwarding. See the ‘‘Related work inscale-out networking’’ sidebar for more ontopologies in this area.

Merchant siliconTwo challenges of deploying large-scale

network fabrics are managing a large networkof switching elements as a single, logicalswitch, and the sheer number of long cablesrequired to interconnect such topologies. Forexample, many so-called merchant siliconvendors, such as Fulcrum Microsystems,Marvell, and Broadcom, are currently ship-ping 24-, 48-, or 64-port 10 Gigabit Ether-net (GigE) switches on a chip that couldserve as the building block for the fat-treetopologies we’ve described. Considering thek ¼ 64 case, it becomes possible to build a65,536-port 10 GigE switch partitionedinto 64 pods of 1,024 ports each. Unfortu-nately, the topology would also require5,120 individual 64-port switching elementsand 131,072 cables running across the datacenter to interconnect edge switches to aggre-gation switches and aggregation switches tocore switches (see Figure 1). These cablesare in addition to the short cables requiredto connect end hosts to the edge switches.

Managing 5,120 individual switches and131,072 individual cables is a tremendouslogistical challenge. So, although a fat-treetopology might reduce a large-scale net-work’s capital expenses, this advantagemight be lost because of operational expensesin building and managing the infrastructure.

Alternatively, we believe that a next-generation scale-out network infrastructurewill involve constructing larger-scale modularcomponents using available merchant siliconas an internal building block. We developeda modular switch infrastructure along thismodel.8 (See the ‘‘Related work in scale-out

networking’’ sidebar for additional commer-cial switch vendors employing merchant sili-con.) Our earlier work showed how toconstruct a 3,456-port 10 GigE switchusing then-commodity 24-port 10 GigEchips as the basic switching element. With64-port switches as the fundamental buildingblock, our exemplar deployment would com-prise 64 modular pod switches and onemodular core switch array as the basis forscale-out networking (Figure 2). A singlecable comprising multiple optical fiberstrands would connect each pod switch tothe core switch array. Thus, we reduce thepreviously required 5,120 individual switchesto 64 pod switches and one core switch array,and the 131,072 required cables to 64.

In one possible hardware configuration,each pod switch consists of up to 16 linecards, each delivering 64 ports of 10 GigEhost connectivity. The pod switch also sup-ports up to eight fabric/uplink cards. Eachuplink card delivers 1.28 terabits per second(Tbps) of bisection bandwidth to the podswitch. Installing fewer uplink cards createsa less costly but internally oversubscribedpod switch. We use identical line cards andidentical uplink cards to support our visionfor scale-out networking: installing addi-tional commodity components delivers addi-tional capacity, whether ports or aggregatebisection bandwidth.

Because copper cables are power-limitedto approximately 10 meters in length for10 GigE, the uplink cards use optical trans-ceivers. This removes the 10-meter limita-tion, but, just as importantly, providesopportunities for reducing the numbers ofcables. Each optical transceiver moduletakes eight independent 10 GigE channels,encodes them onto different optical wave-lengths, and transmits them over a singlestrand of fiber. Up to 128 such fibers aggre-gate into a single cable, which then runs tothe core switch array. An identical opticalmodule on the core switch array then demul-tiplexes the original 10 GigE channels androutes them to the appropriate electricalswitch in the core array.

In our design, the modular core switcharray supports up to 64 line cards, each pro-viding 10.24 Tbps of bandwidth for inter-pod communication. Internally, each line

mmi2010040029.3d 29/7/010 15:54 Page 33

....................................................................

JULY/AUGUST 2010 33

mmi2010040029.3d 29/7/010 15:54 Page 34

...............................................................................................................................................................................................

Related work in scale-out networking

In developing our approach to scale-out networking, we examined

four areas: topology, merchant silicon, layer 2 versus layer 3 routing,

and multipath forwarding. Related work in these areas includes the

following.

TopologyVL2 employs a folded-Clos topology to overcome the scaling limita-

tions of traditional tree-based hierarchies.1 DCell2 and BCube3 let end

hosts also potentially act as switches, allowing the datacenter inter-

connect to scale using a modest number of low-radix switching

elements.

Merchant siliconSeveral commercial switches are logically organized internally as a

three-stage folded-Clos topology built from merchant silicon. As two

examples, Arista has built a 48-port 10 Gigabit Ethernet (GigE) switch

(www.aristanetworks.com/en/7100 Series SFPSwitches) and Voltaire a

288-port 10 GigE switch (www.voltaire.com/Products/Ethernet/voltaire_

vantage_8500), both employing 24-port 10 GigE switch silicon as the

fundamental building block.

Layer 2 versus layer 3 fabricsTransparent Interconnection of Lots of Links (TRILL) is an IEEE stan-

dard for routing at layer 2 (L2).4 It requires global knowledge of flat

memory access controller (MAC) addresses and introduces a separate

TRILL header to protect against L2 loops and to limit the amount of per-

switch forwarding state. Scalable Ethernet Architecture for Large

Enterprises (SEATTLE) employs a distributed hash table among

switches to ease determining the mapping between end host and

egress switch.5 Multilevel Origin-Organized Scalable Ethernet

(MOOSE) introduces hierarchical MAC addresses similar to our

approach with PortLand.6 Finally, VL2 performs IP-in-IP encapsulation

to translate an application IP address (AA) into a host’s actual IP ad-

dress as its local switch (LA) determines.

Multipath forwardingVL2 and Monsoon7 propose using valiant load balancing on a per-flow

basis, which suffers from the same hash collision problem as static

equal-cost multipath forwarding. Per-packet load-balancing should over-

come the hash collision problem, however, variable path latencies can

result in out-of-order packet delivery, which can substantially reduce

TCP performance.

Traffic Engineering Explicit Control Protocol (TeXCP)8 and MPLS Adap-

tive Traffic Engineering (MATE)9 propose dynamic and distributed traffic

engineering techniques to route around congestion in the wide area.

FLARE, a flowlet aware routing engine, proposes the idea of grouping

back-to-back TCP segments into flowlets that can independently follow

different wide-area paths.10 Several proposals of TCP-variants11,12 aim to

overcome current TCP reordering limitations.

References

1. A. Greenberg et al., ‘‘VL2: A Scalable and Flexible Data Cen-

ter Network,’’ Proc. ACM Special Interest Group on Data

Communication Conf. Data Comm. (SIGCOMM 09), ACM

Press, 2009, pp. 51-62.

2. C. Guo et al., ‘‘DCell: A Scalable and Fault-Tolerant Network

Structure for Data Centers,’’ Proc. ACM Special Interest

Group on Data Communication (SIGCOMM) Computer

Comm. Rev., vol. 38, no. 4, 2008, pp. 75-86.

3. C. Guo et al., ‘‘BCube: A High Performance, Server-Centric

Network Architecture for Modular Data Centers,’’ Proc.

ACM Special Interest Group on Data Communication Conf.

Data Comm. (SIGCOMM 09), ACM Press, 2009, pp. 63-74.

4. J. Touch and R. Perlman, Transparent Interconnection of Lots

of Links (TRILL): Problem and Applicability Statement, IETF

RFC 5556, May 2009; www.rfc-editor.org/rfc/rfc5556.txt.

5. C. Kim, M. Caesar, and J. Rexford, ‘‘Floodless in SEATTLE:

A Scalable Ethernet Architecture for Large Enterprises,’’

Proc. ACM Special Interest Group on Data Communication

Conf. Data Comm. (SIGCOMM 08), ACM Press, 2008,

pp. 3-14.

6. M. Scott and J. Crowcroft, ‘‘MOOSE: Addressing the Scal-

ability of Ethernet,’’ poster session, ACM Special Interest

Group on Operating Systems (SIGOPS) European Conf.

Computer Systems, 2008; www.cl.cam.ac.uk/~mas90/

MOOSE/EuroSys2008-Posterþabstract.pdf.

7. A. Greenberg et al., ‘‘Towards a Next Generation Data Center

Architecture: Scalability and Commoditization,’’ Proc. ACM

Workshop Programmable Routers for Extensible Services of

Tomorrow (PRESTO 08), ACM Press, 2008, pp. 57-62.

8. S. Kandula et al., ‘‘Walking the Tightrope: Responsive Yet

Stable Traffic Engineering,’’ ACM Special Interest Group on

Data Communication (SIGCOMM) Computer Comm. Rev.,

vol. 35, no. 4, 2005, pp. 253-264.

9. A. Elwalid et al., ‘‘MATE: MPLS Adaptive Traffic Engineering,’’

Proc. IEEE Int’l Conf. Computer Comm. (INFOCOM 01), IEEE

Press, vol. 3, 2001, pp. 1300-1309.

10. S. Sinha, S. Kandula, and D. Katabi, ‘‘Harnessing TCPs Bursti-

ness Using Flowlet Switching,’’ Proc. ACM Special Interest

Group on Data Communication (SIGCOMM) Workshop Hot

Topics in Networks (HotNets 04), ACM Press, 2004; http://

nms.csail.mit.edu/papers/flare-hotnet04.ps.

11. M. Zhang et al., ‘‘RR-TCP: A Reordering-Robust TCP with

DSACK,’’ Proc. IEEE Int’l Conf. Network Protocols (ICNP

03), IEEE CS Press, 2003, p. 95.

12. S. Bohacek et al., ‘‘A New TCP for Persistent Packet Reor-

dering,’’ IEEE/ACM Trans. Networking (TON), vol. 14, no. 2,

2006, pp. 369-382.

....................................................................

34 IEEE MICRO

...............................................................................................................................................................................................


card comprises 16 identical 64-port switches.Datacenter operators could deploy as manycore line cards as required to support theinterpod communication requirements fortheir particular application mix. So, the net-work as a whole could be oversubscribed byup to a factor of 64 with one core switcharray card or deliver full bisection bandwidthwith 64 cards. For fault tolerance, operatorscan physically distribute the core switcharray in different portions of the data centerand on different electrical circuits.

Our architecture involving pod switchesinterconnected by a core switch array alignswell with recent trends toward modulardata centers. Currently, many large-scaledata centers are being built from individualcontainers with 100 to 1,000 servers.9 Ourmodular pod switch maps well to the net-working requirements of individual contain-ers. The core switch array then provides anincrementally bandwidth-scalable intercon-nect for a variable number of containers.As we add additional containers to a con-stantly evolving data center, additional coreswitch array cards can support the expandingcommunication requirements.

Layer 2 versus layer 3 fabricsOnce any datacenter network is con-

structed, independent of its underlying to-pology or modularity, the next question iswhether the network should be managed asan L2 or L3 addressing domain. Eachapproach has an associated set of trade-offsfor manageability, scalability, switch state,and support for end host virtualization (seeTable 1). An L2 network is essentiallyplug-and-play: the datacenter operatordeploys individual switches, and the switcheslearn their position in the extended local areanetwork through a protocol such as RapidSpanning Tree (www.ieee802. org/1/pages/802.1w.html). With L3, the operator mustconfigure each switch with a subnet maskand appropriately synchronize any DHCPservers such that they can assign hosts suit-able IP addresses matching the subnet ofthe connected switch. A minor configurationerror can render portions of the data centerinaccessible.

Network scalability is related to the re-quired control protocols’ overhead, primarily

the routing protocols. Popular routing pro-tocols, such as Intermediate System-to-Intermediate System (IS-IS) routing for L2and Open Shortest Path First (OSPF) rout-ing for L3, require broadcast among all net-work switches to discover the networktopology and end host locations, limitingsystem scalability and stability. L2 forward-ing protocols face the additional challengeof distributing information for all endhosts because naming is based on flat mem-ory access controller (MAC) addresses ratherthan hierarchical IP addresses.

At L2, forwarding based on flat MACaddresses imposes significant scalability chal-lenges on commodity switch hardware.Modern data centers might host more than100,000 servers with millions of virtual endhosts. Switch forwarding tables must includean entry for every flat MAC address in thedata center. Unfortunately, one millionforwarding table entries would require10 Mbytes of on-chip memory (assuming

mmi2010040029.3d 29/7/010 15:54 Page 35

Line card(Total 16)

Uplink card(Total 8)

Midplane

PHY (64)

CPUASIC (2)

SFP+ (64)

Connector (8)

80G opticalmodule (16)

ASIC (4)

EEP (16)

CPU

Connector (16)

ASIC: Application-specific integrated circuitPHY: Physical layerSFP+: Small form-factor pluggable plusEEP: Errata evaluator polynomial

Figure 2. Modular switch infrastructure. A 1,024-port 10 Gigabit Ethernet

pod switch with 10.24 terabits per second (Tbps) of uplink capacity to a

core switch array. We can combine up to 64 such switches to form a

fabric with 655 Tbps of bisection bandwidth.

....................................................................

JULY/AUGUST 2010 35

10 bytes per entry), which would consumenearly the entire transistor count of a moderncommodity switch ASIC. In fact, currentnetwork switches, such as Cisco Nexus5000 Series (www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-462176.html) and Fulcrum Micro-systems’ FM4000 Series (www.fulcrummicro.com/products/focalpoint/fm4000. htm), arelimited to between 16,000 and 64,000forwarding table entries. An L3 networkemploys topologically meaningful hierarchi-cal IP addresses, so forwarding tables arecompactable to have single entries for multi-ple hosts.

Finally, prevalent use of virtual machinesin the data center imposes novel require-ments on the networking infrastructure. AnIP address uniquely identifies an end host,virtual or otherwise. An end host can cacheanother end host’s IP address (deliveringsome service) in application-level state andname TCP connections in part by the IP ad-dress of the connection’s source and destina-tion. At the same time, in an L3 network, thesubnet number of a host’s first-hop switchpartly determines a host’s IP address. A vir-tual machine migrating from one physicalmachine to another across the data centermust then potentially change its IP addressto match the appropriate subnet number.This remapping is a key requirement forthe hierarchical forwarding we discussedearlier. Unfortunately, adopting a new IPaddress then can invalidate significantnetwork-wide state and all existing TCPconnections. An L2 network won’t face thislimitation because any physical machinecan host any IP address.

In summary, neither L2 nor L3 networkprotocols are currently a good match to data-center requirements. However, we believethe general spirit of L2 networks, ease ofconfiguration and management, are in princi-pal better aligned with datacenter network

requirements. So, we designed PortLand tocombine the benefits of both approaches with-out the need to modify end hosts and switchhardware.10 (See the ‘‘Related work in scale-out networking’’ sidebar for additional workin this area.)

We leverage the known technique of sep-arating host identity from host location. InPortLand, the network automatically self-organizes to assign topologically meaningfullocation labels to every end host. These labelsefficiently encode location, with hosts closerto one another in the topology sharingmore specific location prefixes. As Figure 3shows, all packet forwarding proceeds onthe basis of these location labels, with hierar-chical location labels enabling compactswitch forwarding tables.

To maintain transparent compatibilitywith unmodified end hosts, we insert theselabels as L2 pseudo MAC (PMAC) addresses.PortLand switches employ a decentralizedLocation Discovery Protocol (LDP) to learntheir pod membership. They use this infor-mation to assign themselves a unique podnumber (for aggregation and edge switches)and a unique position number within eachpod (for edge switches). Edge switches thenassign to each end host 48-bit PMACs ofthe form pod.position.port.vmid. Theylearn pod and position values through LDP:the port field is based on the port to whichthe end host is connected. The edge switchesalso assign vmids to each virtual machine res-ident on the same end host based on the vir-tual machine’s unique MAC addresses.

Edge switches transmit IP-address-to-PMAC mappings to a centralized fabricmanager on seeing the first packet fromany end host. Our insertion point for back-ward compatibility is the ubiquitous Ether-net Address Resolution Protocol (ARP).Hosts in the same L2 domain will broadcastARP requests to determine the MAC addressassociated with a particular host’s IP address.

mmi2010040029.3d 29/7/010 15:54 Page 36

Table 1. Comparing traditional approaches.

Approach

Required

configuration Routing

Switch

state

Seamless virtual

machine migration

Layer 2 Plug and play Flooding Large Supported

Layer 3 Subnet configuration Broadcast-based Small Not supported

....................................................................

36 IEEE MICRO

...............................................................................................................................................................................................


PortLand edge switches intercept these ARPrequests and proxy them to the fabric man-ager, which returns the appropriate PMAC(rather than actual MAC address) for theIP address. The edge switch returns thisPMAC to the querying host. From thispoint, all packets include the correct destina-tion PMAC in the Ethernet header. Egressedge switches rewrite the PMAC to theactual MAC address on the basis of locallyavailable soft state, maintaining transparency.

Our implementation requires modifica-tions only to switch software and leveragesalready available hardware functionality toforward based on PMACs, to interceptARPs, to set up soft-state mappings on seeingthe first packet from end hosts, and so on.We employ OpenFlow to interface withswitch software, with the hope of enablingcompatibility with various hardware switchplatforms.11 Numerous additional protocoldetails support virtual machine migration,fault tolerance, and multicast.10

Multipath forwardingThe final component we will consider in

scale-out networking is packet forwardingin networks with multiple paths. Modern

datacenter topologies must contain multiplepaths between hosts to deliver the necessaryaggregate bandwidth and to tolerate failures.Unfortunately, traditional L3 Internet rout-ing protocols are optimized for deliveringbaseline connectivity rather than for load bal-ancing among a set of available paths.

ECMP forwarding balances flows amongmultiple network paths. An ECMP-capableswitch matches a packet’s destination addressagainst the forwarding table. To chooseamong possible next hops, the switch com-putes a hash of a predefined set of fields inthe packet header modulo the number of pos-sible paths to choose the particular next hop.By hashing on fields such as source and desti-nation address, source and destination portnumber, and time to live, ECMP ensuresthat packets belonging to the same flow fol-low the same path through the network, mit-igating concerns over packet reorderingendemic to multipath forwarding protocols.

Although a step in the right direction,ECMP is oblivious to dynamically changingcommunication patterns. For any static hashalgorithm, multiple flows can collide on acongested link, as Figure 4 depicts. For exam-ple, in our earlier work, analysis shows that

mmi2010040029.3d 29/7/010 15:54 Page 37

Pod 0

Packet

00:02:00:00:02:01

00:00:00:01:02:01

0×0800 0×45

Pod 1 Pod 2 Pod 3

0 1 0 1 0 10 1

Address

10.5.1.2

Hardware type

ether

Hardware address

00:00:00:01:02:01

Flags

C

Mask Interface

eth1

Figure 3. Hierarchical forwarding in PortLand. PortLand separates host identity and host location.

....................................................................

JULY/AUGUST 2010 37

we can restrict ECMP to less than 48 percentof the available bandwidth for some commu-nication patterns, such as serial data shuffles(for example, common to MapReduce2).12

To address this limitation, we designedand implemented Hedera to perform dy-namic flow scheduling in data centers withsignificant multipathing (see Figure 5).12

Two observations motivated us in designingHedera. First, when flows are uniform insize and their individual rates are small,ECMP works well in distributing flowsgiven a reasonable hash function. Problemsarise with ECMP’s static hashing whenflow length varies. In this case, a relativelysmall number of large flows can account

for most bandwidth. So, good load balancingrequires efficiently identifying and placingthese large flows among the set of availablepaths between source and destination. Oursecond observation was that it’s insufficientto perform load balancing on the basis ofmeasured communication patterns, becausepoor scheduling can mask the actual demandof a congestion-responsive flow (such as aTCP flow). Good scheduling then requiresa mechanism to estimate a flow’s intrinsicbandwidth demand independent of any bot-tlenecks the network topology or past sched-uling decisions might introduce.

Given these observations, Hedera operatesas follows: A central fabric manager, akin to

mmi2010040029.3d 29/7/010 15:54 Page 38

Localcollision

Downstreamcollision

Core 0 Core 1 Core 2 Core 3

Agg 0

Flow AFlow BFlow CFlow D

Agg 1 Agg 2

Figure 4. Equal-cost multipath (ECMP) forwarding collisions. As in this example, multiple flows can collide on a congested

link, resulting in reduced throughput. We omitted unused links for clarity.

1. Detectlarge flows

2. Estimateflow demands

3. Scheduleflows

Central scheduler

. . ....

. . .0

Demand matrix

Core 0 Core 1 Core 2 Core 3

Agg 0 Agg 1 Agg 2

Flow AFlow BFlow CFlow D

Figure 5. Dynamic flow placement with Hedera avoids long-term flow collisions.

....................................................................

38 IEEE MICRO

...............................................................................................................................................................................................


NOX,13 periodically queries switches for sta-tistics for flows larger than a threshold size.We employ OpenFlow to perform suchqueries to leverage hardware flow countersavailable in a range of modern switch hard-ware. Hedera then uses information aboutthe set of communicating source-destinationpairs to build a traffic demand matrix. Thismatrix represents estimated bandwidthrequirements of all large flows in the datacenter, absent any topological or schedulingbottlenecks—that is, the max-min fair shareof each flow on an idealized crossbar switch.

This traffic demand matrix, along withthe current view of actual network utiliza-tion, serves as input to a scheduling/placement algorithm in the fabric manager.Our goal is to maximize aggregate through-put given current communication patterns.In Hedera, new TCP flows default toECMP forwarding to choose a path to thedestination. However, if the fabric managerdetermines sufficient benefit from remappingthe flow to an alternate path, it informs thenecessary switches along the path to overridethe default forwarding decision with a newHedera-chosen path.

Our evaluation indicates it’s possible todynamically shift traffic at the granularity ofone second and possibly hundreds of millisec-onds with additional optimizations. So, aslong as the communication patterns at thisgranularity are stable, an approach such asHedera can appropriately schedule flows.Overall, Hedera delivers performance withina few percentage points of optimal for arange of communication patterns. (Formore on multipath forwarding, see the ‘‘Re-lated work in scale-out networking’’ sidebar.)

Future challengesWe have focused on some of the key

requirements for scale-out networking inthe data center, but many important chal-lenges remain. As services become increas-ingly tightly coupled, many applicationswould benefit from low-latency communica-tion. Given current network architectures,machine-to-machine latencies of under onemicrosecond should be achievable even inmultihop topologies. However, achievingsuch latencies will require reexamininghardware components down to the physical

layer as well as legacy software optimizedfor wide-area deployments where millisecondlatencies are the norm.

Given the bursty nature of datacentercommunication patterns, any static oversub-scription for a network might be wastefulalong one dimension or another. Althougha nonblocking datacenter fabric will supportarbitrary communication patterns, the cost iswasteful if most links are idle most of thetime. More information is required aboutdatacenter communication patterns to designappropriate topologies and technologies thatsupport a range of communication pat-terns.14 The ideal circumstance is to allocatebandwidth when and where required withouthaving to preallocate for the worst case.

Bursty communication can take place atmultiple time scales. Recent work on theincast problem, in which sudden bandwidthdemand from multiple senders to a commondestination can overwhelm shallow switchbuffers, quantifies issues with fine-grainedcommunication bursts to a single node.15

The commonality of these microburstsrequires enhancements either at the transportlayer or at the link layer to avoid negativeinteractions between bursts of lost packetsand higher-level transport behavior.

On the standardization front, to supportmigration toward a converged, losslessfabric, Ethernet is undergoing significantenhancements, such as IEEE 802.1Qau:Congestion Notification, IEEE 802.1Qaz:Enhanced Transmission Selection, andIEEE 802.1Qbb: Priority-Based FlowControl. The goal is to employ a single net-work topology to support application scenar-ios that traditionally aren’t well matched tothe Ethernet’s best-effort nature, such as stor-age area networks and high-performancecomputing. Understanding the interactionsbetween emerging L2 standards and higher-level protocols and applications will beimportant.

Once we achieve a scale-out network ar-chitecture, one of the next key requirementswill be increased network virtualization. Al-though we can carve up processing and stor-age with current technology, the key missingpiece for isolating portions of a larger-scalecluster is virtualizing the network. Thegoal would be to deliver application-level

mmi2010040029.3d 29/7/010 15:54 Page 39

....................................................................

JULY/AUGUST 2010 39

bandwidth and quality-of-service guaranteesusing virtual switches, with performancelargely indistinguishable from the analogousphysical switch.

Energy is increasingly important in thedata center. The switching fabric might ac-count for only a modest fraction of totaldatacenter power consumption. However,the absolute total power consumption canstill be significant, making it an importanttarget for future architectures aiming to re-duce energy consumed per bit of transmitteddata.5,16 Perhaps more importantly, overalldatacenter energy consumption might in-crease when system components such asCPU or DRAM idle unnecessarily waitingon a slow network.

We’ve discussed the importance of sim-plifying network fabric management throughplug-and-play L2 network protocols. Moreaggressively, an entire datacenter networkcomprising thousands of switches should bemanageable as a single unified network fab-ric. Just as datacenter operators can insertand remove individual disks transparentlyfrom a petabyte storage array, networkswitches should act as ‘‘bricks’’ that togethermake up a larger network infrastructure.Such unification will require both betterhardware organization and improved soft-ware protocols.

C omputing technology periodically un-dergoes radical shifts in organization

followed by relatively sustained periods ofmore steady improvement in establishedmetrics such as performance. Networking iscurrently at one of those inflection points,driven by mega data centers and theirassociated application requirements. Signifi-cant work remains to be done to realize thescale-out network model, and we believe thatthis will be fertile space for both research andindustry in the years ahead. M I CR O

....................................................................

References1. L.A. Barroso, J. Dean, and U. Hoelzle, ‘‘Web

Search for a Planet: The Google Cluster

Architecture,’’ IEEE Micro, vol. 23, no. 2,

Mar.-Apr. 2003, pp. 22-28.

2. J. Dean and S. Ghemawat, ‘‘MapReduce:

Simplified Data Processing on Large

Clusters,’’ Proc. Symp. Operating Systems

Design & Implementation (OSDI 04), Usenix

Assoc., pp. 137-150.

3. M. Isard et al., ‘‘Dryad: Distributed Data-

parallel Programs from Sequential Building

Blocks,’’ ACM Special Interest Group on

Operating Systems (SIGOPS) Operating

Systems Rev., vol. 41, no. 3, 2007,

pp. 59-72.

4. M. Al-Fares, A. Loukissas, and A. Vahdat,

‘‘A Scalable, Commodity, Data Center Net-

work Architecture,’’ ACM Special Interest

Group on Data Communication (SIGCOMM)

Computer Comm. Rev., vol. 38, no. 4, Oct.

2008, pp. 63-74.

5. B. Heller et al., ‘‘ElasticTree: Saving Energy

in Data Center Networks,’’ Proc. Usenix

Symp. Networked Systems Design and

Implementation (NSDI 10), Usenix Assoc.,

2010; www.usenix.org/events/nsdi10/tech/

full_papers/heller.pdf.

6. A. Greenberg et al., ‘‘VL2: A Scalable and

Flexible Data Center Network,’’ Proc. ACM

Special Interest Group on Data Communica-

tion Conf. Data Comm. (SIGCOMM 09),

ACM Press, 2009, pp. 51-62.

7. W.D. Hillis and L.W. Tucker, ‘‘The CM-5

Connection Machine: A Scalable Supercom-

puter,’’ Comm. ACM, vol. 36, no. 11, 1993,

pp. 31-40.

8. N. Farrington, E. Rubow, and A. Vahdat,

‘‘Data Center Switch Architecture in the

Age of Merchant Silicon,’’ Proc. IEEE

Symp. High Performance Interconnects

(HOTI 09), IEEE CS Press, 2009, pp. 93-102.

9. C. Guo et al., ‘‘BCube: A High Performance,

Server-Centric Network Architecture for

Modular Data Centers,’’ Proc. ACM Special

Interest Group on Data Communication

Conf. Data Comm. (SIGCOMM 09), ACM

Press, 2009, pp. 63-74.

10. R.N. Mysore et al., ‘‘PortLand: A Scalable,

Fault-Tolerant Layer 2 Data Center Network

Fabric,’’ Proc. ACM Special Interest Group

on Data Communication Conf. Data Comm.

(SIGCOMM 09), ACM Press, 2009, pp. 39-50.

11. N. McKeown et al., ‘‘OpenFlow: Enabling

Innovation in Campus Networks,’’ ACM

Special Interest Group on Data Communica-

tion (SIGCOMM) Computer Comm. Rev.,

vol. 38, no. 2, Apr. 2008, pp. 69-74.

12. M. Al-Fares et al., ‘‘Hedera: Dynamic Flow

Scheduling for Data Center Networks,’’

mmi2010040029.3d 29/7/010 15:54 Page 40

....................................................................

40 IEEE MICRO

...............................................................................................................................................................................................


Proc. Usenix Symp. Networked Systems

Design and Implementation (NSDI 10), Use-

nix Assoc., 2010; www.usenix.org/events/

nsdi10/tech/slides/al-fares.pdf.

13. N. Gude et al., ‘‘NOX: Towards an Operat-

ing System for Networks,’’ ACM Special In-

terest Group on Data Communication

(SIGCOMM) Computer Comm. Rev., vol. 38,

no. 3, 2008, pp. 105-110.

14. S. Kandula et al., ‘‘The Nature of Data Cen-

ter Traffic: Measurements & Analysis,’’

Proc. ACM Special Interest Group on Data

Communication (SIGCOMM) Internet Mea-

surement Conf. (IMC 09), ACM Press,

2009, pp. 202-208.

15. A. Phanishayee et al., ‘‘Measurement and

Analysis of TCP Throughput Collapse in

Cluster-based Storage Systems,’’ Proc.

Usenix Conf. File and Storage Technologies

(FAST 08), Usenix Assoc., 2008, pp. 175-188;

www.usenix.org/events/fast08/tech/full_

papers/phanishayee/phanishayee.pdf.

16. M. Gupta and S. Singh, ‘‘Greening of the

Internet,’’ Proc. Conf. Applications, Tech-

nologies, Architectures, and Protocols for

Computer Comm., ACM Press, 2003,

pp. 19-26.

Amin Vahdat is a professor and holds theScience Applications International Corpora-tion chair in the Department of ComputerScience and Engineering at the University ofCalifornia, San Diego. He’s also director ofUC San Diego’s Center for Networked Sys-tems. His research interests include distributedsystems, networks, and operating systems.Vahdat has a PhD in computer science fromUniversity of California, Berkeley.

Mohammad Al-Fares is a PhD candidate incomputer science at the University ofCalifornia, San Diego, working in theSystems and Networking group. His researchinterests include building large datacenternetwork architectures, especially focusing ontopology, routing, and forwarding efficiency.Al-Fares has an MS in computer sciencefrom UC San Diego. He’s a member ofACM and Usenix.

Nathan Farrington is a PhD candidate incomputer science and engineering at theUniversity of California, San Diego. Hisresearch interests include improving datacen-ter networks’ cost, performance, powerefficiency, scalability, and manageability.Farrington has an MS in computer sciencefrom UC San Diego. He’s a member ofIEEE and ACM.

Radhika Niranjan Mysore is a PhDcandidate in computer science at the Uni-versity of California, San Diego. Herresearch interests include datacenter net-works’ scalability, storage, and protocoldesign and correctness proofs. Mysore hasan MS in electrical and computer engineer-ing from Georgia Institute of Technology.She’s a member of ACM and Usenix.

George Porter is a postdoctoral researcherin the Center for Networked Systems at theUniversity of California, San Diego. Hisresearch interests include improving the scale,efficiency, and reliability of large-scale dis-tributed systems, with a current focus on thedata center. Porter has a PhD in computerscience from the University of California,Berkeley.

Sivasankar Radhakrishnan is a graduatestudent in computer science at the Universityof California, San Diego. His researchinterests include designing large-scale data-center networks that deliver the high perfor-mance and efficiency required for planetaryscale services. Radhakrishnan has a BTech incomputer science and engineering fromIndian Institute of Technology, Madras.He’s a member of ACM and Usenix.

Direct questions and comments to AminVahdat, 9500 Gilman Drive, M/C 0404,University of California San Diego, La Jolla,CA 92093-0404; [email protected].

mmi2010040029.3d 29/7/010 15:54 Page 41

....................................................................

JULY/AUGUST 2010 41

Documents

SCALE-OUT NETWORKING IN THE DATA CENTER ...cseweb.ucsd.edu/~vahdat/papers/scale-out-micro10.pdfSCALE-OUT NETWORKING IN THE DATA CENTER . SCALE-OUT ARCHITECTURES SUPPORTING FLEXIBLE,