Exploring Efficient and Scalable Multicast Routing in .... Exploring... · in future data center networks. Bandwidth-hungry, large-scale data center applications call for efﬁcient

Exploring Efficient and Scalable Multicast Routingin Future Data Center Networks

Dan Li, Jiangwei Yu, Junbiao Yu, Jianping Wu

Tsinghua University

Abstract—Multicast benefits group communications in savingnetwork traffic and improving application throughput, both ofwhich are important for data center applications. However, thetechnical trend of future data center design poses new challengesfor efficient and scalable Multicast routing. First, the denselyconnected networks make traditional receiver-driven Multicastrouting protocols inefficient in Multicast tree formation. Second,it is quite difficult for the low-end switches largely used in datacenters to hold the routing entries of massive Multicast groups.

In this paper, we address these challenges by exploiting thefeature of future data center networks. Based on the regulartopology of data centers, we use a source-to-receiver expansionapproach to build efficient Multicast trees, excluding many un-necessary intermediate switches used in receiver-driven Multicastrouting. Our algorithm enjoys significantly lower computationcomplexity than general Steiner-tree algorithm, and gracefullyembraces online receiver join/leave.

As for scalable Multicast routing, we combine both in-packetBloom Filters and in-switch entries to make the tradeoff betweenthe number of Multicast groups supported and the additionalbandwidth overhead. We use node-based Bloom Filter to encodethe tree and design a provably loop-free Bloom Filter forwardingscheme. We implement a software based combination forwardingengine on Linux platform, and the experimental results on a testbed demonstrate that the overhead of the forwarding engine islight-weighted.

I. INTRODUCTION

As the core of cloud services, data centers run not only on-line cloud applications such as Web search, Web mail, and in-teractive games, but also back-end infrastructural computationsincluding distributed file system [1], [2], structured storagesystem [3] and distributed execution engine [1], [4]. The trafficpattern of group communication is popular in both the twotypes of computations [7]. Examples include redirecting searchqueries to a set of indexing servers [5], replicating file chunksin distributed file systems [1], [2], distributing executablebinaries to a group of servers participating Map-Reduce alikecooperative computations [1], [4], [6], and so on.

Multicast benefits group communications by both savingnetwork traffic and improving application throughput. Thoughnetwork-level Multicast bears a notorious reputation duringthe past two decades for the deploying obstacles and manyopen issues such as congestion control, pricing model andsecurity concern, recently there is a noticeable resurgence of

The work in this paper is supported by the National Basic Research Programof China (973 Program) under Grant 2011CB302900 and 2009CB320501, theNational Natural Science Foundation of China (No. 61073166), the NationalHigh-Tech Research and Development Program of China (863 Program)under Grants 2009AA01Z251, and the National Science & Technology PillarProgram of China under Grant 2008BAH37B03.

it, e.g., the successful application of IP Multicast in IPTVnetworks [8], enterprise networks, and etc. The managedenvironment of data centers also provides good opportunity forMulticast deployment. However, existing Multicast protocolsbuilt in data center switches/servers are primarily based on theMulticast design for the Internet. Before the wide applicationof Multicast in data centers, we need to carefully investigatewhether these Internet-oriented Multicast technologies canwell accommodate data center networks.

In this paper we explore network-level Multicast routing,which is responsible for building the Multicast delivery tree,in future data center networks. Bandwidth-hungry, large-scaledata center applications call for efficient and scalable Multicastrouting schemes. But we find the technical trend of future datacenter design poses new challenges to achieve the goals.

First, data center topologies usually expose high link den-sity. For example, in BCube [9], both switches and servershave multiple ports for interconnection; while in Portland [10]and VL2 [11], several levels of switches with tens of ports areused to connect the large population of servers. There are manyequal-cost paths between a pair of servers or switches. In thisscenario, Multicast trees formed by traditional independentreceiver-driven Multicast routing can result in severe linkwaste compared with efficient ones.

Second, low-end commodity switches are largely used inmost data center designs for economic and scalability con-siderations [12]. The memory space of routing table in themis relatively narrow. Previous investigation shows that typicalaccess switches can hold no more than 1500 Multicast groupstates [13]. Besides, it is difficult to aggregate in-switchMulticast routing entries since the Multicast address embedsno topological information [19]. Hence, it is quite challengingto support a large number of Multicast groups in data centernetworks, especially considering the massive groups in filechunk replications to embrace.

We address the challenges above by exploiting the featuresof future data center networks. Leveraging the managed envi-ronment and multistage graph of data center topologies, webuild the Multicast tree via a source-to-receiver expansionapproach on a Multicast Manager. This approach overcomesthe problem in receiver-driven Multicast routing since it ex-cludes many unnecessary intermediate switches in the formedMulticast tree. Compared with general Steiner-tree algorithm,our approach enjoys significantly lower computation com-plexity, which is critical for online tree building. Besides,given dynamic receiver join/leave, tree links can be gracefullyadded/removed without reforming the whole tree, in favor

This paper was presented as part of the main technical program at IEEE INFOCOM 2011

978-1-4244-9921-2/11/$26.00 ©2011 IEEE 1368

of avoiding out-of-order packets during Multicast delivery.Evaluation results show that our algorithm can save 40%∼50%network traffic than receiver-driven Multicast routing, andtake orders of magnitude less computation time than generalSteiner-tree algorithm.

In-packet Bloom Filter eliminates the necessity of in-switchrouting entries, which helps achieve scalable Multicast routingupon low-end switches in data centers. However, both the in-packet Bloom Filter field and the traffic leakage during packetforwarding result in bandwidth waste. To make the tradeoff,we combine both in-packet Bloom Filter and in-switch entriesto realize scalable Multicast routing in data center networks.Specifically, in-packet Bloom Filters are used for small-sizedgroups to save routing space in switches, while routing entriesare installed into switches for large groups to alleviate thebandwidth overhead. Our combination approach well alignswith the fact that small groups are the most common indata center group communications, represented by file chunkreplication in distributed file systems. For the in-packet BloomFilter routing, we use node-based Bloom Filter to encode thetree and design a provably loop-free forwarding scheme. Wehave implemented a software based combination forwardingengine on Linux platform. Experimental results on a test bedshow that the forwarding engine brings less than 4% CPUoverhead at high-speed data delivery.

The rest of this paper is organized as follows. Section IIintroduces the background and related work of data centerMulticast routing. Section III presents our approach to buildingefficient Multicast trees. Section IV discusses the scalableMulticast routing design. Section V concludes the paper.

II. BACKGROUND AND RELATED WORK

In this section we introduce the background and relatedwork on data center Multicast routing, including data centernetwork architecture design, Multicast tree formation as wellas scalable Multicast routing.

A. Data Center Network Architecture

In current practice, data-center servers are connected by atree hierarchy of Ethernet switches, with commodity ones atthe first level and increasingly larger and more expensive onesat higher levels. It is well known that this kind of tree structuresuffers from many problems. The top-level switches are thebandwidth bottleneck, and high-end high-speed switches haveto be used. Moreover, a high-level switch shows as a single-point failure spot for its subtree branch. Using redundantswitches does not fundamentally solve the problem but incurseven higher cost. Recently there is a growing interest in thecommunity to design new data center network architectureswith full bisection bandwidth to replace the tree structure,represented by BCube [9], PortLand [10] and VL2 [11].

BCube: BCube is a server-centric interconnection topol-ogy, which targets for shipping-container-sized data centers,typically sized of 1k-4k servers. BCube is constructed in arecursive way. A BCube(n,0) is simply n servers connect-ing to an n-port switch. A BCube(n,1) is constructed fromn BCube(n,0)s and n n-port switches. More generically, a

Fig. 1. A BCube(4,1) architecture.

Fig. 2. A PortLand architecture with 4-port switches.

BCubek(n,k) (k ≥ 1) is constructed from n BCube(n,k − 1)sand nk n-port switches. Each server in a BCube(n,k) hask +1 ports. Fig. 1 illustrates a BCube(4,1) topology, which iscomposed of 16 servers and two levels of switches.

PortLand: Fig. 2 illustrates the topology of PortLand,which organizes the switches as a three-level Fat-Tree struc-ture. There are n pods (n = 4 in the example), each containingtwo levels of n/2 switches, i.e., the edge level and theaggregation level. Each n-port switch at the edge level usesn/2 ports to connect the n/2 servers, and uses the remainingn/2 ports to connect the n/2 aggregation-level switches in thepod. At the core level, there are (n/2)2 n-port switches andeach switch has one port connecting to one pod. Therefore,the total number of servers supported in PortLand is n3/4.

VL2: VL2 architecture takes a similar fashion as PortLand.The difference is that in VL2, each edge-level switch usestwo higher-speed uplinks (e.g., 10GE links) to connect themiddle-level switches, and the top two layers of switches areconnected as a complete bipartite graph. Hence, the wiringcost of VL2 is lower than PortLand.

In spite of different interconnection structures, consistentthemes lie in recently proposed data center network architec-tures. First, low-end switches are used for server intercon-nection, instead of high-end expensive ones. Second, highlink density exists in these networks, since a large numberof switches with tens of ports are used, or both servers andswitches use multiple ports for interconnection. Third, the datacenter structure is built in a hierarchical and regular way onthe large population of servers/switches. These technical trendspose new challenges as well as opportunities on Multicastrouting design, which we explore later.

B. Multicast Tree Formation

Building a Multicast tree with the lowest cost covering agiven set of nodes on a general graph is well-known as theSteiner Tree problem [14]. The problem is NP-Hard and thereare many approximate solutions [15], [16]. For the Internet,Multicast routing protocols are designed to build the deliverytree, among which PIM is the most widely used one. InPIM-SM [17] or PIM-SSM [18], receivers independently send

1369

group join/leave requests to a rendezvous point or the sourcenode, and the Multicast tree is thus formed by the reverseUnicast routing in the intermediate routers.

For data center Multicast, BCube [9] proposes an algorithmto build server-based Multicast trees, and switches are usedonly as dummy crossbars. But obviously network-level Mul-ticast with switches involved can save much more bandwidththan the server-based one. In VL2 [11], it is suggested thattraditional IP Multicast protocols are used for tree building,like in the Internet. In PortLand [10], a simple tree buildingalgorithm is also designed. But in this paper, we targetat efficient Multicast tree formation in generic data centernetworks, by exploiting the technical trend on data centerdesign.

C. Scalable Multicast Routing

As one of the deploying obstacles of Multicast in the In-ternet, scalable Multicast routing has attracted much attentionin the research community. For data center networks wherelow-end switches with limited routing space are used, theproblem is even more challenging. One possible solution is toaggregate a number of Multicast routing entries into a singleone, as used in Unicast. However, it is difficult for Multicastentry aggregation because Multicast group address is a logicaladdress without topological information [19].

Bloom Filter can be used to compress in-switch Multicastrouting entries. In FRM [20], Bloom Filter based groupinformation is maintained at border routers to help determineinter-domain Multicast packet forwarding. Similar idea isadopted in BUFFALO [21], though it is primarily designedfor scalable Unicast routing. But this approach usually requiresone Bloom Filter for each interface. In data center networkswhere high-density switches are used (e.g., 48-port switches),the Bloom Filter based Multicast routing entries still occupyconsiderable memory space. In addition, traffic leakage alwaysexists due to false positive in looking up Bloom Filter basedrouting entries.

Another solution is to encode the tree information intoin-packet Bloom Filter, and thus there is no need to installany Multicast routing entries in network equipments. Forinstance, LIPSIN [22] adopts this scheme. However, the in-packet Bloom Filter field brings network bandwidth cost.Besides, since the Bloom Filter size is quite limited, whenthe group size becomes large, the traffic leakage from false-positive forwarding is significant. In this paper, we achievescalable Multicast routing by making the tradeoff betweenthe number of Multicast groups supported and the additionalbandwidth overhead. By exploiting the feature of future datacenter networks, we use node-based Bloom Filter to encodethe tree instead of the link-based one in LIPSIN, and designa loop-free forwarding scheme.

In the recent work of MCMD [7], scalable data centerMulticast is realized in the way that only a selection ofgroups are supported by Multicast according to the hardwarecapacity, and the other groups are translated into Unicastcommunications. Our work differs from MCMD in that weuse Multicast to support all group communications instead of

turning to Unicast, since Multicast exposes significant gainsover Unicast in traffic saving and throughput enhancement.

III. EFFICIENT MULTICAST TREE BUILDING

Efficient Multicast trees are required to save network trafficand accordingly reduce the task finish time of data centerapplications. In this section we present how to build Multicasttrees in data center networks.

A. The Problem

In densely connected data center networks, there are oftena large number of tree candidates for a group. Given multipleequal-cost paths between servers/switches, it is undesirable torun traditional receiver-driven Multicast routing protocols suchas PIM for tree building, because independent path selectionby receivers can result in many unnecessary intermediate links.Without loss of generality, we take an example in the BCubetopology shown in Fig. 1. Assume the receiver set is {v5,v6, v9, v10} and the sender is v0. If using receiver-drivenMulticast routing, the resultant Multicast tree can have 14 linksas follows (we represent the tree as the paths from the senderto each receiver):

v0 → w0 → v1 → w5 → v5,v0 → w4 → v4 → w1 → v6,v0 → w4 → v8 → w2 → v9,v0 → w0 → v2 → w6 → v10.

However, an efficient Multicast tree for this case includesonly 9 links if we construct in the following way:

v0 → w0 → v1 → w5 → v5,v0 → w0 → v2 → w6 → v6,v0 → w0 → v1 → w5 → v9,v0 → w0 → v2 → w6 → v10.

To eliminate the unnecessary switches used in independentreceiver-driven Multicast routing, we propose to build theMulticast tree in a managed way. Receivers send join/leaverequests to a Multicast Manager. The Multicast Manager thencalculates the Multicast tree based on the data center topologyand group membership distribution. Data center topologies areregular graphs and the Multicast Manager can easily maintainthe topology information (with failure management).

The problem is then translated as how to calculate anefficient Multicast tree on the Multicast Manager. Noting thatbuilding the Steiner-tree in general graphs is NP-Hard, weprove that the Steiner Tree problem in typical data centerarchitectures such as BCube is also NP-Hard.

Theorem 1: The Steiner tree problem in BCube network isNP-Hard.

Proof: A BCube(n,k) network with n = 2, i.e., the BCubenetwork where all switches have two ports, is just a hypercubenetwork with the dimension of k, if we treat each serveras a hypercube node and each switch as an interconnectionedge between its two adjacent servers. Since the Steiner treeproblem in hypercube is NP-Hard [25], it is also NP-Hard inBCube(2,k). Therefore, the Steiner tree problem in a genericBCube network is NP-Hard.

1370

(a)

(b)

Fig. 3. Exemplified group spanning graphs in BCube (a)and PortLand (b). The sender is v0, and the receiver set is{v1,v5,v9,v10,v11,v12,v14}. The corresponding data center topolo-gies are shown in Fig. 1 and Fig. 2 respectively. The bolded linksillustrate the tree formed upon the group spanning graph.

In order to meet the requirement of online tree buildingfor large groups, we develop an approximate algorithm by ex-ploiting the topological feature of future data center networks.Next we present our tree-building approach.

B. Source Driven Tree Building

We observe that recently proposed data center architectures(BCube, PortLand, VL2) use several levels of switches forserver interconnection, and switches within the same level arenot directly connected. Hence, they are multistage graphs [23].From the feature of multistage graphs, the possible paths fromthe Multicast source to all receivers can be expanded as a di-rected multistage graph with d+1 stages (from stage 0 to staged), where d is the topology of the network. We call it groupspanning graph. Stage 0 includes the sender only and stage dis composed of receivers. Note that any node appears at mostonce in the group spanning graph. For example, if the senderis v0 and the receiver set is {v1,v5,v9,v10,v11,v12,v14}, thegroup spanning graph in BCube topology of Fig. 1 is shownin Fig. 3(a). For the same Multicast group in PortLand ofFig. 2, the group spanning graph is shown in Fig. 3(b). Groupspanning graphs in VL2 can be obtained in the same way.

We make some definitions on the group spanning graph.A covers B (or B is covered by A): For any two node sets

A and B in a group spanning graph, we call A covers B (orB is covered by A) if and only if for each node j ∈ B, thereexists a directed path from a node i ∈ A to j in the groupspanning graph.

A strictly covers B (or B is strictly covered by A): If Acovers B and any subset of A does not cover B, we call Astrictly covers B (or B is strictly covered by A).

We propose to build Multicast tree in a source-to-receiverexpansion way upon the group spanning graph, with thetree node set from each stage strictly covering downstreamreceivers. Multicast trees built by our approach hold twomerits. First, many unnecessary intermediate switches usedin receiver-driven Multicast routing are eliminated. Second,the source-to-receiver latency is bounded by the number ofstages of the group spanning graph, e.g., the diameter of datacenter topology, which favors delay-sensitive applications suchas redirecting search queries to indexing servers.

The complexity of the algorithm primarily comes from howto select the node set from each stage of the group spanninggraph, which is covered by the node set in the previous stageand strictly covers downstream receivers. Generally speaking,it is an NP-Hard problem. However, we can leverage thehierarchically constructed regular data center topologies todesign approximate algorithms. In what follows, we discussthe tree-building algorithm based on the group spanning graphin BCube and PortLand respectively. The algorithm in VL2 issimilar with PortLand.

BCube: The tree node selection in a BCube network canbe conducted in an iterative way on the group spanning graph.For a BCube(n,k) with the sender s, we first select the set ofservers from stage 2 of the group spanning graph, which arecovered by both s and a single switch in stage 1. Assume theserver set in stage 2 is E, and the switch selected in stage1 is W . The tree node set for BCube(n,k) is the union ofthe tree node sets for |E| + 1 BCube(n,k − 1)s. |E| of theBCube(n,k − 1)s has a server in E as the source p, and thereceivers in stage 2 ∗ (k + 1) covered by p as the receiver set.The other BCube(n,k−1) has s as the source and the receiversin stage 2∗k which are covered by s while not covered by Was the receiver set. In the same way, we can further get the treenode set in each BCube(n,k − 1) by dividing it into severalBCube(n,k − 2)s. The process iterates until when getting allthe BCube(n,0)s. Hence, the computation complexity is O(N),where N is the total number of servers in BCube.

Take Fig. 3(a) as an example. For the BCube(4,1), we firstchoose the set of servers {v1,v2,v3} from stage 2 which arecovered by v0 and a single switch W0 in stage 1. Then, wecompute the tree for 4 BCube(4,0). The first is from v1 to v5and v9, the second is from v2 to v10 and v14, the third isfrom v3 to v11, and the fourth is from v0 to v12. The boldedlinks in this graph show the final tree we construct for theBCube(4,1).

PortLand: There are at most 6 hops in a group spanninggraph of PortLand, since the network diameter is 6. The treenode selection approach for each stage is as follows. Fromthe first stage to the stage of core-level switches, any singlepath can be chosen, because any single core-level switch cancover the downstream receivers. From the stage of core-levelswitches to the final stage of receivers, the paths are fixeddue to the interconnection rule in PortLand. The computationcomplexity is also O(N).

For instance, on the group spanning graph of Fig. 3(b), wefirst choose a single path from v0 to a core-level switch W16,and then W16 covers all downstream receivers via the uniquepaths. v1 is added as a receiver without traversing core-level

1371

switches. The final tree is also labeled by the bolded links inthe group spanning graph.

C. Dynamical Receiver Join/Leave

Given Multicast receivers dynamically join or leave a group,the Multicast tree should be re-built to encompass groupdynamics. Our tree-building algorithm can gracefully embracethis case since the dynamical receiver join/leave does notchange the source-to-end paths of other receivers in the group.It is important for avoiding out-of-order packets during packetdelivery.

BCube: When a new receiver rj joins an existing groupin a BCube(n,k), we first recompute the group spanninggraph involving rj . Then in the group spanning graph, wecheck whether there is a BCube(n,0) that can hold rj whencalculating the previous Multicast tree. If so, we add rj tothe BCube(n,0). Otherwise, we try to find the BCube(n,1)when calculating the previous Multicast tree that can holdrj and add a BCube(n,0) to it containing rj . If we cannotfind such BCube(n,1), we try to find a BCube(n,2) and add acorresponding BCube(n,1), so on and so forth until when wesuccessfully add rj in the Multicast tree. In this way, the finaltree obeys our tree-building algorithm, and we do not needto change the source-to-end paths for existing receivers in theMulticast group.

Given a receiver rl leaves the group in a BCube(n,k), weat first regenerate the group spanning graph by eliminating rl.Then, if the deletion of rl results in zero BCube(n,m−1)s in aBCube(n,m) from the group spanning graph when calculatingthe previous Multicast tree, we also eliminate the nodes inthe BCube(n,m). Of course this process does not change thesource-to-end paths for other receivers, either.

PortLand: If a receiver rj joins a group in PortLand, itshould be added at the final stage of the group spanninggraph. In the new tree, we need to do nothing except addingthe path from the previously selected core-level switch torj . Note that the paths from the core-level switch to rj isfixed, and it may overlap with some links in the previousMulticast tree. Similarly, when a receiver rl leaves a groupin PortLand, we delete it from the final stage of the groupspanning graph. In addition, we exclude the tree links withzero downstream receiver for the removal of rl. In this way,the receiver join/leave in PortLand does not affect the source-to-end paths for other receivers in the Multicast session. Theconclusion also holds for VL2 since it uses a Fat-Tree alikestructure.

D. Evaluation

We evaluate our tree-building approach in two aspects bysimulations. First, we compare the number of links in theformed trees by our algorithm, typical Steiner Tree algorithmand receiver-driven Multicast routing, respectively. Second,we demonstrate the computation times of our algorithm andtypical Steiner Tree algorithm respectively.

We use a BCube(8,3) (4096 servers in total) and a PortLandusing 48-port switches (27,648 servers in total) as the repre-sentative data center topologies in the simulation. The speed of

0 1000 2000 3000 4000 5000 6000 7000 80000

10

20

30

40

50

60

70

80

90

100

Number of links in the tree

CD

F o

f g

rou

ps(

%)

Our algorithm

Steiner−tree algorithm

Receiver−driven routing

(a)

0 1 2 3 4 5 6 7

x 104

0

10

20

30

40

50

60

70

80

90

100

Number of links in the tree

CD

F o

f g

rou

ps(

%)

Our algorithm


Receiver−driven routing

(b)

Fig. 4. CDF of groups occupying different number of links byour algorithm, Steiner-Tree algorithm and receiver-driven Multicastrouting. (a) BCube; (b) PortLand.

all links is 1Gbps. 200 random-sized groups are generated foreach network. The sender/receivers of a group are randomlyselected from all the data center servers. Different algorithmsare used for tree building on each group.

Number of Tree Links: We first check the number of linksoccupied by these tree formation algorithms. The Steiner-Treealgorithm we choose is the one described in [15], which holdsfast computation speed1. The algorithm works as follows. Atfirst, a virtual complete graph with cost is generated upon thegroup members; then, a minimum spanning tree is calculatedon the virtual complete graph; and finally, the virtual link inthe virtual complete graph is replaced by the shortest pathbetween any two group members in the original topology,with unnecessary links deleted. In the receiver-driven Multicastrouting, each receiver independently selects a path to jointhe source-rooted tree. In case of multiple equal-cost paths,a random one is chosen.

Fig. 4 shows the CDF (Cumulative Distribution Function)of groups occupying different numbers of links. Our algorithmhas similar performance with the Steiner-tree algorithm, bothof which save 40%∼50% links than receiver-driven Multicastrouting. It follows our expectation, since independent receiver-driven path selection results in many unnecessary intermediateswitches in densely-connected network environment.

Then we compare the number of tree links between ouralgorithm and Steiner-tree algorithm. Fig. 4(a) shows thatthere is a little difference for the two algorithms in BCubenetwork. When the group size is small, Steiner-tree algorithmis better. Because in this case, the group members are diversely

1There are algorithms with better approximation ratio but higher computa-tion complexity, such as [16].

1372

0 200 400 600 800 10000

10

20

30

40

50

60

70

80

90

100

Computation time (ms)

CD

F o

f g

rou

ps(

%)

Our algorithm


(a)

0 200 400 600 800 10000

10

20

30

40

50

60

70

80

90

100

Computation time (ms)

CD

F o

f g

rou

ps(

%)

Our algorithm


(b)

Fig. 5. CDF of groups taking different computation time by ouralgorithm and the Steiner-Tree algorithm. (a) BCube; (b) PortLand.

distributed within the network. Our tree-building algorithmrestricts the tree depth as the number of stages in the groupspanning graph, while there is no such restriction in Steiner-tree algorithm. But for larger groups, our algorithm betterexploits the topological feature of BCube, and the resultant treehas less links than Steiner-tree algorithm, even with limitationon tree depth. In PortLand, the two algorithms perform exactlythe same, as shown in Fig. 4(b). This is because in Fat-Treestructure, both the algorithms can form the optimal Multicasttree.

Computation Time: Fig. 5 illustrates the CDF of groupstaking different computation time to form the tree by our al-gorithm and the Steiner-tree algorithm. We run the algorithmson a Desktop with AMD Athlon(tm) II X2 245 2.91G CPUand 2GB DRAM. The computation speed of our algorithm isshown to be orders of magnitudes faster than the Steiner-treealgorithm. The reason behind is that our algorithm fully lever-ages the topological feature of data center topologies. Underthe Steiner-tree algorithm, only about 32% groups in BCubeand 47% groups in PortLand from all the randomly generatedgroups can form the tree within one second. However, ouralgorithm can finish the tree computation for all groups within20ms in BCube and 260ms in PortLand.

IV. SCALABLE MULTICAST ROUTING

A desirable Multicast routing scheme not only builds anefficient Multicast tree for each group, but also scales to alarge number of Multicast groups. In this section we discussthe scalability issues in data center Multicast routing.

A. Combination Routing Scheme

It is challenging to support massive Multicast groups usingthe traditional way of installing all the Multicast routing entriesinto switches, especially on the low-end commodity switchescommonly used in data centers. In-packet Bloom Filter is analternative choice since it eliminates the requirement of in-switch entries. However, the in-packet Bloom Filter can causebandwidth waste during Multicast forwarding.

Bandwidth Overhead Ratio: The bandwidth waste ofin-packet Bloom Filter comes from two aspects. First, theBloom Filter field in the packet brings network bandwidthcost. Second, false-positive forwarding by Bloom Filter causestraffic leakage. Besides, switches receiving packets by false-positive forwarding may further forward packets to otherswitches, incurring not only additional traffic leakage but alsopossible loops. We define the Bandwidth Overhead Ratio of in-packet Bloom Filter, r, as the ratio of additional traffic causedby in-packet Bloom Filter over the actual payload traffic tocarry. Assume the packet length (including the Bloom Filterfield) is p, the length of the in-packet Bloom Filter field is f ,the number of links in the Multicast tree is t, and the numberof actual links covered by Bloom Filter based forwarding isc, then r is calculated as Eq. (1).

r =t ∗ f + (c− t) ∗ p

t ∗ (p− f)(1)

In Eq. (1), t ∗ f is the additional traffic in the Multicasttree resulted from the Bloom Filter field in the packet, (c −t) ∗ p is the total traffic carried by the links beyond the tree,and t ∗ (p − f) is the actual payload traffic on the tree. Toreduce the bandwidth overhead of in-packet Bloom Filter, weneed to either control the false positive ratio during packetforwarding, or limit the size of Bloom Filter field. However,the two goals are conflict with each other: given a certain groupsize, reducing the false positive ratio indicates enlarging theBloom Filter field, and vice versa.

Simulation: We measure the bandwidth overhead ratio ofBloom Filter based routing against the Bloom Filter length andthe group size, by simulations in a BCube(8,3) and PortLandwith 48-port switches respectively. For each group size, thegroup members are randomly selected. Note that in eitherBCube or PortLand, there are often multiple equal-cost treesfor a given group. We choose the one with the minimumbandwidth overhead ratio. The total packet size is set as 1500bytes, which is the typical MTU in Ethernet.

The result is shown in Fig. 6, from which we have thefollowing observations. First, for a given group size, thereis an “optimal”length of the in-packet Bloom Filter whichminimizes the bandwidth overhead ratio. When the BloomFilter length is shorter than the optimal value, false-positiveforwarding is the major factor for bandwidth overhead ratio.But when the length grows larger than the optimal value, theBloom Filter field itself dominates the bandwidth overhead.Second, for a certain length of in-packet Bloom Filter, thebandwidth overhead ratio increases with the growth of groupsize. It is straightforward because more elements in the BloomFilter result in higher false positives during packet forwarding.

1373

0 50 100 150 200 250 300 350 4000

10

20

30

40

50

60

70

80

90

100

Bloom Filter Length (bytes)

Ban

dw

idth

Ov

erh

ead

Rat

io (

%)

Number of receivers = 3






(a)

0 50 100 150 200 250 300 350 4000

10

20

30

40

50

60

70

80

90

100

Bloom Filter Length (bytes)

Ban

dw

idth

Ov

erh

ead

Rat

io (

%)







(b)

Fig. 6. Bandwidth overhead ratio against in-packet Bloom Filterlength and group size. (a) BCube; (b) PortLand.

We find that when the group size is larger than 100, theminimum bandwidth overhead ratio becomes higher than 25%.Hence, in-packet Bloom Filter causes significant bandwidthoverhead for large groups in both BCube and PortLand.Though there are proposed schemes to reduce the traffic leak-age during Bloom Filter forwarding, such as using multipleidentifiers for a single link or a virtual identifier for severallinks [22], they cannot fundamentally solve the problem indata center networks where the link density is high and theserver population is huge.

Group Size Pattern: We investigate the group size dis-tribution in typical data center applications generating groupcommunication patterns. In distributed file systems [1], [2],large files are divided into chunks with 64MB or 100MB.In a data center with Petabytes of files (such as the Hadoopcluster in Facebook [24]), there can be as many as billionsof small groups for chunk replication if all these files areput into distributed file system. But for large groups suchas binary distribution to thousands of servers in Map-Reducecomputations, the number is quite limited because too manyconcurrent Map-Reduce computations defer the task finishtime. As a result, we envision that small groups are the mostcommon in typical data center applications.

Our Approach: Hence, we propose to combine in-packetBloom Filters with in-switch routing entries for scalableMulticast routing in data center networks. Specifically, in-packet Bloom Filters are used for small-sized groups to saverouting space in switches, while routing entries are installedinto switches for large groups to alleviate bandwidth overhead.A predefined group size, z, is used to differentiate the twotypes of routing schemes. The value of z is globally set toaccommodate the routing space on data center switches and the

Fig. 7. An example to show the tree ambiguity problem in node-basedBloom Filters.

actual group size distribution among data center applications.In this way, our combination routing approach well aligns withthe group size pattern of data center applications.

Intermediate switches/servers receiving the Multicast packetcheck a special TAG in the packet to determine whether toforward the packet via in-packet Bloom Filter or lookingup the in-switch forwarding table. Before implementing thecombination forwarding engine, we explore two importantdesign issues of in-packet Bloom Filter, i.e., tree encodingscheme and loop avoidance.

B. Tree Encoding SchemeThere are two ways to encode the tree information with in-

packet Bloom Filter, i.e., node-based encoding and link-basedencoding. In node-based Bloom Filters, elements are the treenodes, including both switches and servers; while in link-basedBloom Filters, elements are the directed physical links.

In general graphs, ambiguity may exist in node-basedBloom Filter on encoding the tree structure. Fig. 7 shows asimple example. Fig. 7(a) shows the topology of the graph.Assume the Multicast sender is node A, and nodes B and Care the two receivers. There are three possible Multicast treesfor the group. The first tree includes link A → B and linkA → C, shown in Fig. 7(b); the second one includes linkA → B and link B → C, shown in Fig. 7(c); and the thirdone is composed of link A → C and link C → B, shown inFig. 7(d). However, if using node-based encoding, the BloomFilter is the same for all the three trees. The problem isobvious: when node B (or C) receives the packet from nodeA, it has no idea whether to forward the packet to node C (orB) or not. We call this problem the tree ambiguity problem.

Link-based Bloom Filters do not have the tree ambigu-ity problem because each tree link is explicitly identified.Hence in LIPSIN [22], link-based encoding is used instead ofnode-based encoding. However, the tree ambiguity problemfor node-based encoding does not exist if the topology andMulticast tree satisfy certain requirements.

Theorem 2: Given a tree T built on top of a topology G,the tree ambiguity problem for node-based encoding can beavoided given that except the tree links, there are no otherlinks in G directly connecting any two nodes in T .

Proof: When a node in T receives the Multicast packet,it forwards the packet to the physically neighboring nodes inT . Since there is no other link in G between any two nodes inT except the tree links, the forwarding process will not coverany links beyond tree links in T . Hence, the Bloom Filtercan uniquely identify the tree and there is no tree ambiguityproblem.

1374

Interestingly, we find that recently proposed data centernetworks, including BCube, PortLand and VL2, and our treebuilding algorithm upon them presented in Section III, canwell meet the requirement in Theorem 2 (please refer toAppendix A). As a result, both node-based encoding and link-based encoding can be used in our in-packet Bloom Filter. Wechoose node-based encoding since it can ride on the identifiersof servers/switches such as MAC address or IP address fornode identification.

C. Loop-free Forwarding

A potential problem with in-packet Bloom Filter is thatfalse-positive forwarding may cause loops, no matter in node-based encoding or link-based encoding. Take Fig. 2 as theexample. Assume there is a Multicast group originated fromserver v0, and switches W8, W16, W10 are in the Multicasttree. Now W8 forwards the packet to W17 false positively.W17 checks that W10 is in the Bloom Filter and then forwardsit to W10. After W10 receives the packet from W17, it findsW16 in the Bloom Filter so further forwards it to W16. Then,W16 forwards it to W8 since W8 is also in the BloomFilter. In this way, the packet is forwarded in the path ofW8 → W17 → W10 → W16 → W8 and the loop forms.

We design a distance based Bloom Filter forwarding schemeto solve this problem. We define the distance between twonodes as the number of their shortest-path hops on the datacenter topology. When a node j receives a Multicast packetwith in-packet Bloom Filter originated from server s, j onlyforwards the packet to its neighboring nodes (within the BloomFilter) whose distances to s are larger than j itself.

Theorem 3: The distance based Bloom Filter forwardingscheme is both correct and loop-free.

Proof: We first prove the correctness. In the Multicasttree built upon the group spanning graph, a child node k ofany node i cannot be closer than or equal as i in terms of thedistance from the source node; otherwise, k will appear in theprevious or the same stage of i on the group spanning graph.Hence, forwarding packets to nodes with further distancesfrom the source node follows our tree-building approach.

Then we prove the loop-freeness. When false-positive deliv-ery occurs during the distance based Bloom Filter forwarding,the falsely-delivered packet is forwarded within at most dhops before dropped, where d is the network diameter. Con-sequently, no loop is formed.

In either BCube, PortLand or VL2, it is quite easy for eachswitch or server to determine whether its neighboring nodesare closer or further from the Multicast source than itself. Let’sstill check the scenario above. When node W8 falsely forwardsthe packet to W17 and W17 forwards it to W10, W10 willnot send the packet back to W16 because W16 is closer fromthe source v0 than W10 itself. In this way, loop is effectivelyavoided.

D. Implementation and Experiment

Our combination forwarding scheme requires modifying thedata plane of switches. It has been shown that, if employingOpenFlow [26] framework which has already been ported to

Fig. 8. The experimental test bed.

100 200 300 400 500 600 700 800 900 10000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Group data rate (Mbps)

CP

U u

tili

zati

on

rat

io (

%)

100 200 300 400 500 600 700 800 900 10000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Pac

ket

lo

ss r

atio

(%

)

CPU utilization ratio on the forwarder

Packet loss ratio

Fig. 9. The CPU utilization on the forwarder as well as the packetloss ratio against different data rates.

run on a variety of hardware platforms, such as switches fromCisco, Hewlett Packard, and Juniper, only minor modifications(several lines of codes) on the data path of switches are re-quired to encompass Bloom Filter based forwarding [27]. Thissupports our belief that our proposed combination forwardingscheme can be well incorporated into existing commodityswitches.

At current stage, we have implemented a software basedcombination forwarding engine on Linux platform to embracelegacy applications using IP Multicast. We insert a 32-byteBloom Filter field into the IP option field if Bloom Filter basedrouting is required. For the group communication in chunkreplication where the number of receivers is usually three, 32-byte Bloom Filter can control the bandwidth overhead ratiobelow 7% in BCube(8,3), and 3% in PortLand with 48-portswitches (refer to Fig. 6).

We evaluate the performance of our combination forwardingengine on a simple test bed, which is shown in Fig. 8.The test bed is composed of 5 hosts: 1 Multicast sender, 3Multicast receivers and 1 forwarder. The forwarder has one2.8G Intel(R) Core(TM)2 Duo CPU, 2GB DRAM, and fourRealtek RTL8111/8168B PCI Express Gigabit Ethernet NICs,each connecting one of the other four hosts. The forwarder isinstalled with Ubuntu Linux 9.10 (karmic) with kernel 2.6.31-14-generic and also equipped with our combination forwardingengine. The sender sends out UDP Multicast packets for agroup at different speeds, and the 3 receivers join the groupto receive the packets.

Fig. 9 shows the CPU utilization ratio on the forwarderand the packet loss ratio for different packet rates. We can seethat our combination forwarding engine can control the packetloss ratio under 0.1% even at packet speeds close to 1Gbps.Meanwhile, the CPU overhead on the forwarder is less than4%, demonstrating the light weight of the forwarding engine.

1375

V. CONCLUSION

In this paper we explored the design of efficient andscalable Multicast routing in future data center networks.Given that receiver-driven Multicast routing does not performwell in densely connected data center networks, we proposean efficient Multicast tree building algorithm leveraging themultistage graph feature of data center networks. Our treebuilding algorithm not only eliminates many unnecessaryintermediate switches in tree formation, but also enjoys muchlower computation complexity than general Steiner-tree algo-rithm. For scalable Multicast routing on low-end data centerswitches, we combine both in-packet Bloom Filters and in-switch entries to make the tradeoff between the number ofMulticast groups supported and the additional bandwidth over-head. Our approach well aligns with the fact that small-sizedgroups are the most common in data center applications. Node-based encoding and loop-free forwarding are designed in ourin-packet Bloom Filter. We have implemented a software basedcombination forwarding engine on Linux platform, which runsat high packet rate with less than 4% CPU overhead.

REFERENCES

[1] Hadoop, http://hadoop.apache.org/[2] S. Ghemawat, H. Gobio, and S. Leungm, “The Google File System”, In

Proceedings of ACM SOSP’03, 2003[3] F. Chang, J. Dean, S. Ghemawat, and etc.,“Bigtable: A Distributed

Storage System for Structured Data”, In Proceedings of OSDI’06, 2006[4] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on

Large Clusters”, In Proceedings of OSDI’04, 2004[5] T. Hoff, “Google Architecture”, http://highscalability.com/google-

architecture, Jul 2007[6] M. Isard, M. Budiu, Y. Yu and etc., “Dryad: Distributed Data-Parallel

Programs from Sequential Building Blocks”, In Proceedings of ACMEuroSys’07, 2007

[7] Y. Vigfusson, H. Abu-Libdeh, M. Balakrishnan, and etc., “Dr. Multicast:Rx for Data Center Communication Scalability”, In Proceedings of ACMEurosys’10, Apr 2010

[8] A. Mahimkar, Z. Ge, A. Shaikh, and etc., “Towards Automated Perfor-mance Diagnosis in a Large IPTV Network”, In Proceedings of ACMSIGCOMM’09, 2009

[9] C. Guo, G. Lu, D. Li and etc., “BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers”, In Proceedingsof ACM SIGCOMM’09, Aug 2009

[10] R. Mysore, A. Pamboris, N. Farrington and etc., “PortLand: A ScalableFault-Tolerant Layer 2 Data Center Network Fabric”, In Proceedings ofACM SIGCOMM’09, Aug 2009

[11] A.Greenberg, J. Hamilton, N. Jain and etc., “VL2: A Scalable andFlexible Data Center Network”, In Proceedings of ACM SIGCOMM’09,Aug 2009

[12] A. Greenberg, J. Hamilton, D. Maltz and etc., “The Cost of a Cloud:Research Problems in Data Center Networks”, SIGCOMM CCR 2009

[13] D. Newman, “10 Gig access switches: Not just packetpushers anymore”,Network World, 25(12), Mar 2008

[14] Steiner tree problem, http://en.wikipedia.org/wiki/Steiner tree problem[15] L. Kou, G. Markowsky and L. Berman, “A Fast Algorithm for Steiner

Trees”, Acta Information, 15:141-145, 1981[16] G. Robins and A. Zelikovsky, “Tighter Bounds for Graph Steiner Tree

Approximation”, SIAM Journal on Discrete Mathematics, 19(1):122-134,2005

[17] D. Estrin, D. Farinacci, A. Helmy, and etc., “Protocol IndependentMulticast-Sparse Mode (PIM-SM): Protocol Specification”, RFC 2362,Jun 1998

[18] S. Bhattacharyya, “An Overview of Source-Specific Multicast (SSM)”,RFC 3569, Jul 2003

[19] D. Thaler and M. Handley, “On the Aggregatability of MulticastForwarding State”, In Proceedings of IEEE INFOCOM’00, Mar 2000

[20] S. Ratnasamy, A. Ermolinskiy and S. Shenker, “Revisiting IP Multicast”,In Proceedings of ACM SIGCOMM’06, Aug 2006

[21] M. Yu, A. Fabrikant and J. Rexford, “BUFFALO: Bloom Filter For-warding Architecture for Large Organizations”, In Proceedings of ACMCoNext’09, Dec 2009

[22] P. Jokela, A. Zahemszky, C. Rothenberg, and etc., “LIPSIN: LineSpeed Publish/Subscribe Inter-Networking”, In Proceedings of ACM SIG-COMM’09, Aug 2009

[23] Multistage interconnection networks, http://en.wikipedia.org/wiki/Mul-tistage interconnection networks

[24] Hadoop Summit 2010, http://developer.yahoo.com/events/hadoopsummit2010[25] Y. Lan, A. Esfahanian and L. Ni, “Multicast in hypercube multiproces-

sors”, Journal of Parallel and Distributed Computing, 1990, 8(1):30-41[26] OpenFlow, http://www.openflowswitch.org/[27] C. Rothenberg, C. Macapuna, F. Verdi, and etc., “Data center networking

with in-packet Bloom filters”, In Proceedings of SBRC’10, May 2009

APPENDIX

A. Recently proposed data center networks and our treebuilding algorithm upon them presented in Section III canmeet the requirement in Theorem 2.

Proof:BCube: In BCube, every link lies between a server and

a switch. Based on our algorithm in Section III, duringthe tree building process, a BCube(n,k) is first divided intoBCube(n,k−1)s, and then each BCube(n,k−1) is divided intoBCube(n,k − 2)s, finally into BCube(n,0)s. When dividing aBCube(n,m) (0 < m ≤ k) into BCube(n,m−1)s, we add linkswhich connect the source server in the BCube(n,m) to sourceservers in the divided BCube(n,m − 1)s by a certain switch,into the tree. In this way, each switch only occurs once in thedividing process due to the topological feature of BCube.

Assume there is a link in the topology connecting an in-treeswitch w and an in-tree server v, lwv . Since switch w is in thetree and used only once in the dividing process, links fromit should be either in the tree to connect lower-level BCubes,or not in because the corresponding lower-level BCube doesnot have any receivers. Note that server v is also in the tree,so lwv should be a link in the tree. Consequently, no link inBCube beyond the Multicast tree can connect two tree nodes,and the requirement of Theorem 1 is satisfied.

PortLand: We prove by contradiction. Assume the treebuilt by our algorithm is T . Note there are four levels ofservers/switches in PortLand, and no link exists within acertain level or between any two non-neighboring levels. Ifthere is a link beyond the Multicast tree connecting two treenodes, say, l, it can appear only between two neighboringlevels.

First, assume l appears between the server level and theedge level. It is impossible because each group member has asingle link to connect an edge-level switch, and the link shouldlie in the Multicast tree. Second, assume l appears betweenthe edge level and aggregation level. It is impossible becausewithin each pod, only one aggregation-level switch is selectedin the Multicast tree, and it connects the edge-level switchesby tree links. Third, assume l appears between the aggregationlevel and core level. It is also impossible, since only one core-level switch is selected in the Multicast tree, and it connectsthe pods which have group members by tree links.

As a whole, there is no link beyond the Multicast tree inthe PortLand topology that connects any two tree nodes. Therequirement of Theorem 1 is also met.

VL2: The proof is similar with PortLand.

1376

Documents

Exploring Efficient and Scalable Multicast Routing in .... Exploring... · in future data center networks. Bandwidth-hungry, large-scale data center applications call for efﬁcient