Redundant Array of Inexpensive Network Switches

CS294 Project1

Virtual and Redundant Switches

IRAM Retreat – Winter 2001

Sam Williams

CS294 Project2

Outline

• Motivation

• Existing Products

• Arrayed Commodity Switches

• Adding Redundancy

• Optimizing

• Generalization

• Conclusions

CS294 Project3

Motivation

• Cost of switches grows very quickly:

O(Ports2) for crossbar based

Additionally address tables and buffers must grow

• Industry leading MTBF for a single switch is about 50K hours

and typical is perhaps only 25K.

• Modular Switches provide redundancy for management and power, but not the data transport fabric.

• MTTR is typically over 1 hour

• Can the money saved by cascading commodity switches be applied towards improved performance or redundancy?

• The goals are to improve the MTBF, improve performance, and simplify the work that must be done to replace a failed switch.

CS294 Project4

Existing Products• Existing modular aggregators can merge several smaller

switches (modules) into a single large virtual switch.

• In this case, each 36 port switch module has a pair of gigabit uplinks to the switching fabric, which has either 6 or 24 gigabit ports (full duplex)

• Redundancy is also provided for management modules, fans, and power supplies.

• However, not for modules or switching fabric.

So if the switching fabric fails, the entire

device fails, but if individual switching

modules fail, then only that sub network

fails.

• Management modules can infer priority to

improve performance for critical activity

3com switch 4007

Management

4 x 36 port switching modules, each with 2 gigabit uplinks

120 Gbps backplane (16 used)

Logical View

Switching Fabric: 24 internal gigabit ports

CS294 Project5

Existing Products (Analysis)• The cost analysis here is based on use of either 18 or 48 Gbps switching

fabrics, 36 port switching modules and either a 7 or 13 bay chassis.

• Performance is slowdown on the time to send from every node to every other node compared to a true n*36 port switch.

• MTBF is for any part of the network

• MTTR was at least 1 hour.

• Repair cost is about $4000/failure – modularization helps to keep this low, but yearly maintenance cost will grow with the number of ports

$0.00

$10,000.00

$20,000.00

$30,000.00

$40,000.00

$50,000.00

$60,000.00

$70,000.00

0 100 200 300 400 500

Number of Ports

Co

st

0.5

0.8

1.0

1.3

1.5

1.8

2.0

0 100 200 300 400 500

Number of Ports

Slo

wd

ow

n c

om

par

ed t

o a

n N

po

rt s

wit

ch

0

50

100

150

200

250

300

350

400

0 100 200 300 400 500

Number of Ports

MTBF

(day

s)

CS294 Project6

Examples of failure

• Switching module fails, each of the nodes/sub-networks attached is no disconnected from all other nodes

• More likely case

• Switching fabric fails, each of the switches is now disconnected from the others, but nodes attached to a switch still can communicate with each other.

CS294 Project7

Examples of failure (continued)

• Redundancy allows for this failure, with reduced performance.• This are not commodity switches, and are considerably more

expensive.

• However, in this case, the failure does cause a network split.• This is the more likely case, so why not allow the extra switch be

used to cover any other switch’s failure• Could be extended to nodes, but then you pay double for NIC’s and

ports.

CS294 Project8

Virtual switch from commodity switches• Although without the management functions, and performance, cheaper

virtual switches can be built – nothing more than just cascading them

• This is based on 5, 8, 16, and 24 port switches, each with the last port MDI type, and from 5 different companies

• Performance is poor since the uplinks are only 100Mbps

• Adding a second uplink port only moderately alleviates this deficiency

$0.00

$2,000.00

$4,000.00

$6,000.00

$8,000.00

$10,000.00

$12,000.00

$14,000.00

0 100 200 300 400 500 600

Number of Ports

Cos

t

0.0

5.0

10.0

15.0

20.0

25.0

0 100 200 300 400 500 600

Number of Ports

Slow

down

0

50

100

150

200

0 100 200 300 400 500 600

Number of PortsMT

BF (d

ays)

CS294 Project9

Virtual switch from mid-range switches• By using switches more suited to this design (higher speed uplink(s)), we

can improve performance

• These switches use an 8 or 24 port switch at the bottom, each with 1 or 2 gigabit uplink modules, and a 4, 8, or 12 port gigabit switch at the top

• The gigabit uplinks and gigabit switches drive cost to at least twice as much as commodity solution, but with 10x better performance

• Performance is near that of a monolithic switch if 2 uplinks are used.

• Compared to packaged solution, its about half the cost, and slightly less performance, but no management functionality.

$0.00

$5,000.00

$10,000.00

$15,000.00

$20,000.00

$25,000.00

0 100 200 300

Number of Ports

Co

st

0.5

1.0

1.5

2.0

2.5

0 100 200 300

Number of Ports

Slo

wd

ow

n

CS294 Project10

Port Virtualization for Redundancy• The re-mapping stage is much simpler than a full n*m port switch.

Essentially each of the m n bit busses are mapped to one of the k n bit internal busses which are connected directly to the switches

• For this example each of the 4 groups of 8 virtual ports is mapped to one of the 5 groups of physical ports. The uplinks of the first stage switches are sent back, and into one of the top level switches.

• An even simpler solution, for single redundancy, would be to map either directly, or to the spare

• In this design the the single point of failure is the re-mapping block, since first and second level switches have redundancy

• So for the example below, MTBF is improved by about 50% (from 208 days to 347 days)

port re-mapping

Extra switches for redundancy 4

8

4

CS294 Project11

Operation (Homogenous switches)

• In this somewhat rigid example, there are 6 bays, 4 are map direct or to spare, There is a switching fabric slot, and a slot for the redundant switch, which can replace either of the other two classes

• In this case, the switching fabric switch failed, and the uplink ports were remapped to the spare.

• At this point the admin must replace the failed switch. If any other switch fails before this, the network will be partially split.

CS294 Project12

Operation - continued

• In this case, one of the first level of switches failed. Instead of those nodes loosing connection to the rest of the network, they are remapped to the spare.

• Once again, the admin must replace the failed switch. If any other switch fails before this, the network will be partially split.

• If the case had bee the spare went down, then it would need to be replaced to provide redundancy.

CS294 Project13

Port Virtualization for Higher Performance• Previous performance analysis was based on “1-to-all” messaging.

• However, it is likely that network access patterns can be broken into groups of high inter-node communication

• Thus monitoring can be performed, and the network can be periodically paritioned into activity groups

• Create a graph based on bandwidth used between nodes, use something like Kernighan partitioning to separate it into a number of partitions equal to the number of first stage switches (power of 2).

• The re-mapping stage is only slightly simpler than a full n*m port switch (no buffers, never any contention, etc…)

spares

1 3 4 58

6

2(failed)7

partition 1

partition 2

partition 3

partition 4

Logical View3 switches reserved as spares.

1 failed, and the network was repartitioning

CS294 Project14

Performance / Availability• MTTR for aggregators was typically over

an hour. This is on top of the time to detect the failure.

• By automating recovery, the downtime can be significantly reduced

• This is dependent on timely detection of a failed switch, which could be handled via packet injection.

• Once the failing switch is determined, a new mapping can quickly be determined.

• For the performance optimizing case, satisfying connectivity is the top priority, a previously scheduled performance can be done later.

Hard fail

Fail detected

Switches have adapted

perf

time

repartition for performanceSwitches have adapted

Hard fail

Fail detected


perf

time

Hard fail

admin notices & fixes failure


perf

time. . .

CS294 Project15

Generalization• Use homogenous switches. There is a mapping layer which maps

physical to virtual ports. This can range from simple 1 to 2, to complex 1 to n, with performance monitoring and repartitioning. Performance can be gained by using some faster switches where needed.

Extra switches for redundancy or extra performance1 2 3 4 5 6 7 8

monitor and port re-mapping

# Description Fails Performance Cost

0 Array of switches 0 Low Nswitches

1 Single Redundancy 1 Low Nswitches+ 1 + trivial mapper

2 R way redundancy R Low Nswitches+ R + general mapper

3 Array of switches with partitioning 0 Adaptive Nswitches+ expensive mapper

4 R way redundancy with partitioning R Adaptive Nswitches+ R + expensive mapper

5R way redundancy with partitioning

And total utilizationR Adaptive Nswitches+ R + expensive mapper

CS294 Project16

Conclusion• It is possible to make a larger virtual switch out of smaller switches, and

still get reasonable performance.

• With little additional hardware, and monitoring agent, it is possible to make it fault tolerant, with several spare switches which can be automatically swapped in – simple case cost ~ O(Spares * Ports).

more complex designs make it O(Ports2)

• With a very simple, but large switch, it is possible to also optimize for performance by balancing network bandwidth among switches in the pool. This is a much more costly solution.

• A generalization would provide a pool of switches connected by the port mapper, and some or none reserved as spares.

• Both of these concepts and their functionality could be integrated into a single ASIC or even using a network processor.

CS294 Project17

Future Work• How do switches fail? This determines the failure detection method.

• Implementation of type 1 or 2 switch would be possible given the relative simple mapper.

• Type 3, 4, or 5 would require a complex ASIC, which should be replaced with a network processor and software.

Documents

Redundant Array of Inexpensive Network Switches