Presenter:[email protected]
Po-Chun Wu
Outline
• Introduction• BCube Structure• BCube Source Routing (BSR)• Other Design Issues• Graceful degradation• Implementation and Evaluation• Conclusion
Introduction
Container-based modular DC• 1000-2000 servers in a single container• Core benefits of Shipping Container DCs:– Easy deployment
• High mobility • Just plug in power, network, & chilled water
– Increased cooling efficiency– Manufacturing & H/W Admin. Savings
BCube design goals
• High network capacity for:– One-to-one unicast– One-to-all and one-to-several reliable groupcast– All-to-all data shuffling
• Only use low-end, commodity switches• Graceful performance degradation– Performance degrades gracefully as
servers/switches failure increases
BCube Structure
BCube Structure
<0,0>
00 01 02 03
<0,1>
10 11 12 13
<0,2>
20 21 22 23
<0,3>
30 31 32 33
<1,0> <1,1> <1,2> <1,3>
BCube0
BCube1
server
switch
Level-1
Level-0
• Connecting rule- The i-th server in the j-th BCube0 connects to the j-th
port of the i-th level-1 switch- Server “13” is connected to switches <0,1> and <1,3>
• A BCubek has:- K+1 levels: 0 through k- n-port switches, same count at each level (nk)- nk+1 total servers, (k+1)nk total switches.- (n=8,k=3 : 4-levels connecting 4096 servers using 512 8-port
switches at each layer.)• A server is assigned a BCube addr (ak,ak-1,…,a0) where ai [0,k]• Neighboring server addresses differ in only one digit• Switches only connect to servers
Bigger BCube: 3-levels (k=2) BCube2
BCube1
MAC addr
Bcube addr
<0,0>
00 01 02 03
<0,1>
10 11 12 13
<0,2>
20 21 22 23
<0,3>
30 31 32 33
<1,0> <1,1> <1,2> <1,3>
BCube0
BCube1
MAC03 0MAC13 1MAC23 2MAC33 3
port
Switch <1,3> MAC table
MAC20 0MAC21 1MAC22 2MAC23 3
portSwitch <0,2> MAC table
BCube: Server centric network
MAC23 MAC03
20 03
data
MAC23 MAC03
20 03
data
dst src
MAC20 MAC23
20 03
data
MAC20 MAC23
20 03
data
•Server-centric BCube- Switches never connect to other switches and only
connect to servers- Servers control routing, load balancing, fault-
tolerance
Bandwidth-intensive application support
• One-to-one:– one server moves data to another server. (disk backup)
• One-to-several:– one server transfers the same copy of data to several receivers.
(distributed file systems)• One-to-all:
– a server transfers the same copy of data to all the other servers in the cluster (boardcast)
• All-to-all:– very server transmits data to all the other servers (MapReduce)
Multi-paths for one-to-one traffic• THEOREM 1. The diameter(longest path) of a BCubek is k+1• THEOREM 3. There are k+1 parallel paths between any two servers in a BCubek
<0,0>
00 01 02 03
<0,1>
10 11 12 13
<0,2>
20 21 22 23
<0,3>
30 31 32 33
<1,0> <1,1> <1,2> <1,3>
Speedup for one-to-several traffic• THEOREM 4. Server A and a set of servers {di|di is A’s level-i neighbor}
form an edge disjoint complete graph of diameter 2
<0,0>
00 01 02 03
<0,1>
10 11 12 13
<0,2>
20 21 22 23
<0,3>
30 31 32 33
<1,0> <1,1> <1,2> <1,3>
P1P2P1 P2P1P2
Writing to ‘r’ servers, is r-times faster than pipeline replication
Speedup for one-to-all traffic
• THEOREM 5. There are k+1 edge-disjoint spanning trees in a Bcubek
src 00
030201
10 131211
20 232221
30 333231
• The one-to-all and one-to-several spanning trees can be implemented by TCP unicast to achieve reliability
Aggregate bottleneck throughput for all-to-all traffic
• The flows that receive the smallest throughput are called the bottleneck flows.
• Aggregate bottleneck throughput (ABT) – ( the bottleneck flow) * ( the number of total flows in the all-to-all traffic )
• Larger ABT means shorter all-to-all job finish time.
• THEOREM 6. The ABT for a BCube network is
• where n is the switch port number and N is the total server number
)1(1
Nn
n
In BCube there are no bottlenecks since
all links are used equally
Bcube Source Routing (BSR)
BCube Source Routing (BSR)
• Server-centric source routing– Source server decides the best path for a flow by probing a
set of parallel paths– Source server adapts to network condition by re-probing
periodically or due to failures– Intermediate servers only forward the packets based on
the packet header.
sourceintermediate
K+1 pathProbe packet
destination
BSR Path Selection• Source server:
– 1.construct k+1 paths using BuildPathSet– 2.Probes all these paths (no link status broadcasting)– 3.If a path is not found, it uses BFS to find alternative (after removing all
others)• Intermediate servers:
– Updates Bandwidth: min(PacketBW, InBW, OutBW)– If next hops is not found, returns failure to source
• Destination server:– Updates Bandwidth: min(PacketBW, InBW)– Send probe response to source on reverse path
• 4.Use a metric to select best path. (maximum available bandwidth / end-to-end delay)
Path Adaptation
• Source performs path selection periodically (every 10 seconds) to adapt to failures and network condition changes.
• If a failure is received, the source switches to an available path and waits for next timer to expire for the next selection round and not immediately.
• Usually uses randomness in timer to avoid path oscillation.
Packet Forwarding
• Each server has two components:– Neighbor status table (k+1)x(n-1) entries
• Maintained by the neighbor maintenance protocol (updated upon probing / packet forwarding)
• Uses NHA(next hop index) encoding for indexing neighbors ([DP:DV])– DP: diff digit (2bits)– DV: value of diff digit (6 bits)– NHA Array (8 bytes: maximun diameter = 8)
• Almost static (except Status)
– Packet forwarding procedure• Intermediate servers update next hop MAC address on packet if next hop is
alive• Intermediate servers update status from packet • One table lookup
20
Path compression and fast packet forwarding
<0,0>
00 01 02 03
<0,1>
10 11 12 13
<0,2>
20 21 22 23
<0,3>
30 31 32 33
<1,0> <1,1> <1,2> <1,3>
Traditional address array needs 16 bytes:Path(00,13) = {02,22,23,13}
Forwarding table of server 23
The Next Hop Index (NHI) Array needs 4 bytes:Path(00,13)={0:2,1:2,0:3,1:1}
NHI Output port MAC
0:0 0 Mac20
0:1 0 Mac21
0:2 0 Mac22
1:0 1 Mac03
1:1 1 Mac13
1:3 1 Mac33
2 31 3
Fwd nodeNext hop
Other Design Issues
Partial Bcubek
<0,0>
00 01 02 03
<0,1>
10 11 12 13
<1,0> <1,1> <1,2> <1,3>
BCube0
BCube1
Level-1
Level-0
(1) build the need BCube k−1s (2) use partial layer-k switches ?• Solution
– connect the BCubek−1s using a full layer-k switches.
• Advantage– BCubeRouting performs just as in a complete
BCube, and BSR just works as before.
• Disadvantage– switches in layer-k are not fully utilized.
Packing and Wiring (1/2)
• 2048 servers and 1280 8-port switches – a partial BCube with n = 8 and k = 3
• 40 feet container (12m*2.35m*2.38m)• 32 racks in a container
Packing and Wiring (2/2)
• One rack = BCube1
• Each rack has 44 units– 1U = 2 servers or 4 switches– 64 servers occupy 32 units– 40 switches occupy 10 units
• Super-rack(8 racks) = BCube2
Routing to external networks (1/2)
• Ethernet has two levels link rate hierarchy– 1G for end hosts and 10G for uplink
aggregator
gateway gateway gateway gateway
<0,0>
00 01 02 03
<0,1>
10 11 12 13
<0,2>
20 21 22 23
<0,3>
30 31 32 33
<1,0> <1,1> <1,2> <1,3><1,1>
10G
3111 2101
<1,1><1,3>
1G
Routing to external networks (2/2)
• When an internal server sends a packet to an external IP address1) choose one of the gateways.2) The packet is then routed to the gateway using
BSR (BCube Source Routing)3) After the gateway receives the packet, it strips
the BCube protocol header and forwards the packet to the external network via the 10G uplink
Graceful degradation
DCellfat-tree
Graceful degradation
• Server failure • Switch failure
BCube
DCell DCell
Fat-tree
BCube
Fat-tree
• Graceful degradation : when server or switch failure increases, ABT reduces slowly and there are no dramatic performance falls. (Simulation Based)
Implementation and Evaluation
hardware
IF 0 IF 1 IF k
Ethernet miniport driver
TCP/IP protocol driver
BCube configuration
serverports
BCube driverBSR path probing & selection
Flow-path cache
Neighbor maintenance
Ava_band calculation
Packet send/recv
app
kernel
packet fwd
softw
are
Neighbor maintenance
Ava_band calculation
packet fwd
Implementation
Intermediate driver
Testbed
• A BCube testbed – 16 servers (Dell Precision 490 workstation with Intel
2.00GHz dualcore CPU, 4GB DRAM, 160GB disk)– 8 8-port mini-switches (DLink 8-port Gigabit switch
DGS-1008D)• NIC– Intel Pro/1000 PT
quad-port Ethernet NIC – NetFPGA
Intel® PRO/1000 PT Quad Port Server Adapter
NetFPGA
Bandwidth-intensive application support
• Per-server throughput
Support for all-to-all traffic
• Total throughput for all-to-all
Related work
Speedup
Conclusion
• BCube is a novel network architecture for shipping-container-based MDC
• Forms a server-centric architecture network• Use mini-switches instead of 24 port switches• BSR enables graceful degradation and meets
the special requirements of MDC