Low-cost survivable Ethernet architecture over fiber

Low-cost survivable Ethernet architectureover fiber

János Farkas

Ericsson Research, Budapest, Hungary

[email protected]

Alberto Paradisi

CPqD Telecom & IT Solutions, Campinas, Brazil

[email protected]

Csaba Antal

Ericsson Research, Budapest, Hungary

[email protected]

RECEIVED 1 FEBRUARY 2006; REVISED 11 MARCH 2006;ACCEPTED 14 MARCH 2006; PUBLISHED 24 APRIL 2006

Ethernet provides a simple and low-cost solution at high bandwidth formetropolitan optical networks. However, native Ethernet still lacks carrier-graderesilience and management schemes. We propose and demonstrate a low-costrobust Ethernet-over-fiber network architecture that can recover from both nodeand link failures in less than 50 ms and, furthermore, ensure that packet loss isavoided during fault restoration. We implemented and tested the architecture ina prototype network. The proposed scalable architecture works with commodityoff-the-shelf Ethernet switches and handles network failures in arbitraryEthernet-level topologies by the edge nodes of the network. We present theexperimental results of the protection protocol implementation, showing thatthe 50 ms carrier-grade recovery time is achieved. © 2006 Optical Society ofAmerica

OCIS codes: 000.1200, 060.0060.

1. Introduction

Ethernet is becoming the leading technology in both access aggregation networks andmetro networks due to its simplicity and high capacity provided at low cost. The bandwidthprovided by native Ethernet has increased over time. Gigabit Ethernet (GbE) and 10 GbEover fiber, commonly available today, have preserved the frame structure and simplicity oflower-speed Ethernet standards. As a local area network (LAN) technology, Ethernet hadbeen optimized for fast data transfer, but it does not support fast failure handling. There-fore, current developments aim to construct carrier-grade Ethernet networks, which providehighly reliable transport networks based on Ethernet directly mapped over optical fiber, tocarry bandwidth-intensive real-time applications.

A failover mechanism that fulfills carrier-grade requirements is still missing in Ether-net networks [1]. The Spanning Tree Protocol (STP) [2], which was developed to ensureloop-free topologies, is also responsible for failure handling in a basic Ethernet network.Therefore, the speed of the STP determines the failover time, which is on the order of tensof seconds. The Rapid Spanning Tree Protocol (RSTP) [3] was developed to reduce theconvergence time to the order of seconds, which is still not applicable for carrier-grade

© 2006 Optical Society of AmericaJON 67700 May 2006 / Vol. 5, No. 5 / JOURNAL OF OPTICAL NETWORKING 398

networks. The next step in the evolution of the STP family was the introduction of the Mul-tiple Spanning Tree Protocol (MSTP) [4], which does not improve the failover time as itcombines Virtual LAN (VLAN) tagging [5] and RSTP. There are other mechanisms stan-dardized for Ethernet networks, including Resilient Packet Ring (RPR) [6] and EthernetAutomatic Protection Switching (EAPS) [7] for ring topologies, but they cannot be appliedin arbitrary network topologies; furthermore, RPR requires the new Medium Access Con-trol (MAC) Protocol, which is not supported by most network devices. Failover time isreduced to a subsecond range in the Viking architecture [8], which supports multiple span-ning trees through a VLAN. Each switch is configured to send Simple Network Manage-ment Protocol (SNMP) traps to the central manager in case of failures. The central manageris a central server, which is responsible for the overall operation of the network includingfault handling. After failure notification, the central server finds out which VLANs are af-fected and informs the end-nodes about the reconfiguration necessary for using a backupVLAN. Each of the end-nodes runs a client module that is responsible for the appropri-ate VLAN selection. Although the Viking approach relies on standard Ethernet switches, itrequires a failure management center, which slows down the failover procedure. Failure de-tection preceding fast failover could also be based on the recently developed BidirectionalForwarding Detection (BFD) [9, 10] Protocol. However, BFD has not been developed forEthernet yet. Furthermore, in terms of scalability, a point-to-point BFD instance would beneeded between each edge–node pair at the border of the Ethernet network to detect allpossible failures and would probably overload the network.

Development of resilience schemes is still the subject of research. A current develop-ment is proposed in Ref. [11], which aimed at minimizing the cost of guaranteeing fullrecovery from a specific failure event (e.g., fiber link failure). In these solutions, all-opticalswitches are driven by out-of-band generalized multiprotocol label switching (GMPLS)-based signaling on an out-of-band Ethernet control plane. The Connectivity Fault Manage-ment (CFM) framework [12] also aims at improving fault management in Ethernet net-works. CFM provides operation, administration, and management (OAM) tools to monitorand troubleshoot faults in native Ethernet networks by means of fault detection and local-ization; nevertheless, no failover mechanism is provided.

All of the solutions described above could be applied in an optical Ethernet network;however, none of them provides a low-cost, carrier-grade, scalable in-band solution for allpossible network topologies. Therefore, we aimed at developing a framework that makesfast failover possible in a general Ethernet network. We propose a robust, survivable, andscalable Ethernet-architecture-over-fiber network that can be easily implemented in arbi-trary network topologies by using commodity Ethernet switches, thus providing a low-costand reliable solution. The spare capacity required by our proposed architecture is far lessthan that of brute force service-path duplication or 100% redundant links. As it is veryimportant to handle failures at the Ethernet level in order to achieve fast failover, we havedeveloped a lightweight in-band fault-handling mechanism that transparently handles bothlink and node failures. We also provide experimental demonstration that our low-cost re-silience solution provides fault recovery under the 50 ms carrier-grade requirement pro-vided by synchronous optical network/synchronous digital hierarchy networks. A further,significant feature of our approach is that no packets are lost during fault restoration.

2. Survivable Architecture

We propose a new Ethernet-based architecture that provides resilience in a distributed man-ner, ensuring fast failover. The architecture consists of low-cost, off-the-shelf, standard Eth-ernet switches that are available on the market. To keep the price advantage of current Eth-ernet products, we excluded solutions that rely on new functionality in Ethernet switches.The extra functionalities that are needed for providing resiliency are implemented as a


software protocol at the edge nodes of the Ethernet network, which are typically IP routers.Figure 1 shows an example of our proposed network architecture, which is also one of thetopologies tested in our prototype.

S1S1 S2S2

S3S3 S4S4

R2R2

R4R4

R3R3

R1R1

Emitter

Primary-notifier

StandardEthernetswitch

Edge-node:Linux router

VLAN1VLAN2VLAN3

Fig. 1. Prototype Ethernet network: physical and logical topologies.

Predefined multiple spanning trees are statically set up across the network to serveas either primary or alternative paths that can be used to route traffic in the network andthus handle possible failures. To achieve protection against any single link or node failure,the topology of the spanning trees must be such that there remains at least one completefunctional tree in the event of failure of any single network element. The spanning treesare calculated according to specifications in Ref. [13] and set up before network start-up,remaining unchanged during operation, even in the presence of a failure.

In the event of a failure, each edge node must stop forwarding frames to the affectedtrees and redirect traffic to unharmed trees. Therefore, a protocol is needed for failure de-tection and for notifying all the edge nodes about the broken trees. Failover time mainlydepends on the time elapsed between the failure event and its detection by the edge routers,because protection switching from one tree to another is done without any reconfigurationof the Ethernet switches. The proposed failure-handling method is described in the nextsection in detail.

We propose implementing the predefined spanning trees using VLANs, i.e., assigning aunique VLAN Identifier (ID) to each spanning tree; thus traffic forwarding to a tree can becontrolled by means of VLAN IDs in the edge nodes. That is, VLANs implement the loop-free spanning tree topologies determined according to Ref. [13]. Therefore each protocolbelonging to the STP family is disabled, as it is not needed to provide loop-free topology.As a consequence of using VLAN-tree topologies, protection switching becomes a simpleVLAN switching in this network.

In the example network shown in Fig. 1, three spanning trees, i.e., three VLANs, areneeded to handle any single failure.

In Ethernet networks, logical network implementation, i.e., virtual private network(VPN) separation, is also solved by VLANs. Since only a subset of the nodes take partin a VPN, redundancy should be provided only for the links and the nodes that play a rolein the VPN interconnections. That is, the number of required spanning trees for a givenVPN may be less than what is needed for the protection of the whole network. However,multiple spanning trees and multiple VLAN IDs should be used for each VPN since multi-ple tree topologies applied for fault protection in the proposed network architecture.

Note that VPNs are not discussed in the following because they are a straightforwardextension of the approach defined here, which includes all the nodes. As a result of thissimplification, VLAN and spanning tree are used as synonyms in the description, and they


refer to a tree interconnecting all edge nodes. In other words, a VLAN does not refer to aVPN in the following.

Once the trees are configured, they can be used in either primary-backup modes orload-sharing modes. In the former mode a single spanning tree is used as a primary tree,and all the traffic is sent on the corresponding VLAN. If one of its links or nodes fails, thenone of the trees that remained complete is used for traffic forwarding. Note that VLANIDs have to be reserved for backup trees in order to provide fast protection switching, andthose VLANs stay idle during normal operation (i.e., no failure). The VLANs are listed inthe same priority order in each edge node; the primary VLAN has the highest priority. If aVLAN has to be selected for traffic forwarding, then the VLAN that has the highest priorityis chosen. Thus each edge node sends user traffic on the same VLAN after a failure eventand after its restoration.

In the load-sharing mode, traffic is evenly distributed among all operational trees. In theevent of a failure, traffic is redistributed among the remaining trees.

The primary-backup mode is simpler than the load-sharing one, because in the latter theedge routers have to distribute the incoming messages among VLANs. On the other hand,in the primary-backup mode some links are not used and traffic distribution is unbalancedacross the network. A further advantage of the load-sharing mode is that a smaller amountof traffic is redirected after a failure; thus fewer new MAC addresses have to be learned onthe unharmed VLANs. In the primary-backup mode, however, each MAC address is newon the backup VLAN after a failure, the result of which may be significant traffic burstwhile the switches are learning the new MAC addresses.

3. Failure Handling

The most important design goals of a failure handling mechanism are fast failover, simplic-ity, robustness, and low protocol-processing and transport overheads. Our further aim wasto construct a protocol with a built-in synchronization mechanism; i.e., no other protocol isneeded to synchronize the communication among edge nodes.

3.A. Protocol Design

Our proposed Failure Handling Protocol (FHP) is a simple and lightweight distributed pro-tocol, implemented in the edge routers, that relies on a few broadcast messages to providefast protection against a single link or node failure that occurs in the network.

The protocol basically defines three types of broadcast messages:

• KEEP-ALIVE (KA): message sent out periodically by one or more edge routers re-ferred to as emitter over each VLAN according to a predefined time interval TKA;

• FAILURE: message issued by an edge router named notifier when a KA messagedoes not arrive over a VLAN within a predefined detection interval TDI, to inform allthe other edge routers of a failure in that VLAN;

• REPAIRED: message issued by the same notifier that detected a failure when a KAmessage arrives over a previously failed VLAN to inform all the other edge routersabout the reparation of the failed VLAN.

Two types of notifiers are distinguished on the basis of their timer settings: primary andsecondary. Few notifiers are configured as primary; all the others that are neither emittersnor primary notifiers are called secondary notifiers. The reason of differentiating primarynotifiers and secondary notifiers is to reduce the number of concurrent notification mes-sages during a failure event, as detailed below.


KA period

Transmissiondelay

time

Emitt

er

send

sNo

tifie

rre

ceiv

esN

otifi

erse

nds

Detection interval

Failure notification

VLA

N 1

VLA

N 2

VLA

N 3

time

time

KA messages

VLA

N 1

VLA

N 3

Repair notification

VLA

N 1

VLA

N 3

VLA

N 2

Fig. 2. FHP message time-sequence.

Figure 2 shows a schematic time-sequence chart of the protocol messages and noderoles.

KA messages are broadcasted periodically by the emitter edge router over each VLANat the beginning of TKA time interval. The requirement is that KA messages are receivedon all VLANs at all the other edge routers (notifiers) within the predefined TDI time in-terval. Since the transmission delay is, in general, different for each notifier and protocoltime intervals are short, the synchronization of notifiers with respect to the emitter has keyimportance. Therefore, each notifier starts a timer when the first KA message has arrivedin order to measure when TDI has elapsed, i.e., the first received KA message synchronizesthe notifier to the emitter. Thus, the effect of the difference in transmission delay amongdifferent notifiers has been eliminated. Subsequent KA messages suffer somewhat differentdelay as they travel different paths, which has to be taken into account during the config-uration of TDI. The arrival of all KA messages is registered in each notifier edge node. Ifthere are KA messages that have not arrived within TDI, then the corresponding VLANs areconsidered down. That is, the loss of a single KA message is interpreted as the breakdownof a VLAN. However, to avoid false alarms owing to a KA frame drop, notifiers can beconfigured to wait two or three subsequent KA periods and mark a VLAN broken if a KAmessage is consistently missing in each period.

All edge nodes, except the emitter, supervise the reception of KA messages. However,to avoid excessive protocol load after a failure, there are only a few primary notifier edgenodes whose task is to notify other edge nodes about the failure. The detection interval ofprimary notifiers is shorter than that of secondary notifiers, and it can be adjusted depend-ing on the network size and other parameters. When a notifier edge node detects a failure,it broadcasts a FAILURE message over each operating VLAN that is considered unharmed.The message contains the IDs of the broken VLANs. As each edge node receives the FAIL-URE messages, all of them become aware of the failed VLANs.

Because the number of primary notifiers is intentionally limited, some failures might beundetected depending on the network topology. Therefore, if a secondary notifier detectsa failure on the basis of the missing arrival of a KEEP-ALIVE message, then this nodebroadcasts the FAILURE message to inform all the other edge nodes of the failure in thesame way as described above.

The restoration procedure after the fixing of a failure is very simple in this protocol.The perception of the reparation of a formerly broken VLAN is an easy task because theemitter continuously broadcasts KA messages over all VLANs even if a failure has beendetected before. If the failure is repaired, then the same notifier that detected the failure isable to detect its restoration because it starts receiving KA messages over that VLAN againafter the reparation. Thus, that notifier can notify the other edge nodes by broadcasting aREPAIRED message containing the ID of the repaired VLAN. Figure 3 shows the operationof the FHP in a flowchart.


Fig. 3. Operation of FHP.

The protocol described above handles failures and their restoration in multiple spanningtree architectures ensuring fast failover; therefore the user traffic experiences only a veryshort outage (of the order of 50 milliseconds), which depends on the configuration of theprotocol time intervals. However, the FHP should also be protected against the breakdownof edge nodes. As there are multiple notifier nodes, their role is taken over as describedbefore: any notifier that recognizes a failure informs the others if that failure has not beenalready reported. Nonetheless, the outage of an emitter edge node is a special case, whichcan also be easily recognized. If the emitter goes down, then no KA message from theemitter will arrive on any VLAN. Therefore, if no KA message arrives within TKA, then theemitter edge node is assumed to be broken (also assuming that a single failure can happenat a time). Then the so-called backup emitter takes over the emitter’s role. If the emitter isrepaired and comes back, then it will again receive KA messages in each VLAN and thusknow that there is already an emitter in the network; after this event occurs the repairedemitter becomes the backup emitter. That is, as opposed to the Viking architecture [8], ourprotocol has no central entity that is exclusively responsible for a task. Instead, each role islocated in a different part of the network, which results in a robust architecture.

A detailed description of a worst-case failure event may illustrate well the operation ofour FHP. Let us examine a failure in the network shown in Fig. 1, operating in primary-backup mode where VLAN1 is the primary traffic path, R1 is the emitter, and R4 is theprimary notifier. R2 and R3 are neither emitter nor primary notifier; they are secondary-notifier edge nodes. VLAN1 has the highest priority and VLAN3 has the lowest one in theVLAN priority list in each edge node. R1 sends KA messages to R4 passing through S1and S4 switches on VLAN3 and VLAN 2, and KA messages of VLAN1 reach R4 throughS2 and S3. If S2 goes down, then R4 does not receive KA messages on VLAN1. Therefore,it broadcasts a FAILURE message on VLAN3 and VLAN2, which informs that VLAN1 isbroken. Thus each edge node would redirect traffic to VLAN2 because it is the next one inthe VLAN list. However, R2 does not receive any KA messages neither on VLAN2 nor onVLAN1; therefore, it broadcasts a FAILURE message on VLAN3 saying that VLAN2 andVLAN1 are broken. Thus all traffic is redirected to VLAN3 in each edge node.

Failover time is a key performance indicator of any resiliency approach. Our failure-handling mechanism is fast because it depends only on the end-to-end transmission time ofmessages and on TKA, which is determined from the transmission time.

The theoretical upper-bound of the failover time, considering network transmission andpacket processing delays, is given by

Failover time≤TKA+TDI+Ttransmission+Tprocessing. (1)

The reason for this is that a failure happens at the beginning of a KA period in the worst


case. It is then detected only in the next KA period before the end of the detection inter-val. In the worst case, a secondary notifier detects the failure, thus its TDI has to be takeninto account. Realistic FHP timer settings allow failover time shorter than 50 ms, which isanalyzed in detail in Section 5.

3.B. Protocol Implementation

Our proposed FHP has been implemented in the edge routers, which are Linux PCs inour testbed prototype. Despite using a non-real-time operating system we achieved veryshort failover under repetitive and extensive testing, which of course could be improvedby the integration of the proposed protocol into high-performance router hardware. Theimplementation is described in Ref. [14] in more detail. The core nodes are commodity off-the-shelf layer-2 Ethernet switches with VLAN support; no additional features are requiredto support FHP or to perform protection switching. Combinations of switches from twodifferent vendors were tested: D-link and Extreme Networks switches were applied.

In our testbed network shown in Fig. 1, a single emitter, a single primary notifier, andtwo secondary-notifier nodes were configured; traffic was mapped to the VLANs in load-sharing mode, and FHP was prototyped using 514-byte Ethernet frames, providing roomenough to accommodate all of the needed protocol messages and additional parameters.

4. Design and Configuration

There are key elements in the network architecture that have to be configured for properoperation. The necessary configurations are described in the following subsections and canbe implemented by an automated process.

4.A. Construction of Spanning Trees

The proposed architecture is based on multiple spanning tree topologies that have to beefficiently designed. The most important criterion against the set of spanning trees is thatit has to be fault tolerant—keeping a small number of trees because of the limited numberof VLAN IDs (4096) and for simpler network management. The calculation of spanningtrees is examined in the literature; however, minimizing the number of them has not beena primary goal. These algorithms were typically developed for weighted graphs where theweights represent the cost of the links. Gabow developed a polynomial-time algorithm tofind k edge-disjoint spanning trees with the smallest cumulative weight in a directed graph[15]. The algorithm described in Ref. [16] can be used for the same purpose in an undirectedgraph. Nevertheless, edge-disjoint spanning trees are only needed if each element of a treecan break down at the same time, and completely independent backup is needed. Assumingthat only a part of the network can break down at a time, the needed trees are not necessarilyedge disjoint, but weaker requirements can be formulated instead.

We have developed an algorithm that determines the number of spanning trees neces-sary for overcoming failures of single network elements. The algorithm, which has to beinvoked at network setup, is described in Ref. [13] in detail. This algorithm constructs span-ning trees such that there remains a complete tree providing connectivity despite failure ofany single element. The algorithm is divided into two parts according to the two types offailure: the first part determines the trees needed to handle link failures; the second partproduces the additional trees needed to overcome node failures. If someone wants to have anetwork prepared for handling link failures then it is sufficient to configure the trees result-ing from the first part of the algorithm. The requirement for the spanning trees to be ableto handle link failures is that there be at least one tree for each link that does not containthat specific link. Similarly, there has to be a tree for each node, where that node is a leafin order to overcome node failures. If these constraints are fulfilled, then there remains at


least one complete tree that is not affected by the failure in case of the breakdown of anynetwork element. The main achievement of our tree-calculation algorithm compared withexisting solutions [8, 15, 16] is that it solves the problem with significantly less trees thanthe former methods. Our proposed algorithm results in near optimal solutions with regardto the minimal number of spanning trees, which is a key issue owing to the limited numberof VLAN IDs, and the minimal overhead of fault handling.

4.B. Selection of Nodes for the Protocol

Having the necessary tree topologies, emitter and primary-notifier nodes have to be selectedfrom among the edge nodes. The proposed default configuration is that each edge nodeshould be set as a secondary-notifier node. Then one of them has to be configured as emitterand another one as backup emitter. Depending on the size of the network, one or more ofthe remaining edge nodes are configured as primary notifiers.

To achieve the shortest failover time and minimize the number of broadcast messages,we propose the following methods for the node selection:

• Emitter: the edge node that is the closest in average to all other edge nodes in eachtree because the transmission delay is minimized this way between the emitter andnotifier nodes. If a simpler rule is required for some reason the emitter closest toeach other edge node in the physical topology should be chosen. This criterion canbe easily implemented with an exhaustive search.

• Backup emitter: the edge node which is closest to the emitter, since the backup hasto take over the role of emitter in case of its breakdown. This choice ensures thesmallest change in transmission delay compared with the original setup.

• Primary notifier: the minimal set of edge nodes whose connection path to the emittercovers each link of each tree. This definition also determines the number of necessaryemitters. If the links are categorized as risky and nonrisky links, then it is enough todetect the breakdown of risky links by the primary notifiers; failures of nonrisky linkscan be detected by secondary notifiers. Then it is sufficient to configure as primarynotifiers the minimal set of edge nodes whose connection path to the emitter coverseach risky link of each tree. This configuration assures that most of the failures aredetected by the primary notifier, which makes the failover time shorter.

• Secondary notifier: all the remaining edge nodes.

4.C. Configuration of Timers

Protocol timers have to be properly configured, mainly depending on the failover time(TF) to be achieved and the transmission delay of the network. As KA messages of differ-ent VLANs follow different paths, they suffer different transmission delay, which has tobe considered in the choice of the detection interval (TDI). Therefore, the round-trip time(RTT) has to be measured (or estimated) on each VLAN between the emitter and the far-thest primary notifier because they are the edge nodes most remote from each other thatplay a key role in the protocol. The largest RTT experienced among the VLANs is selectedas the RTT between the emitter and primary notifier. The transmission delay is approxi-mately half of the RTT, which includes packet processing delay of intermediate switchingnodes. As a consequence, we propose to set TDI no shorter than RTT to avoid the effect ofits variance. Larger TDI results in an even more-robust setting, however, TKA gives an upper-bound since TDI cannot be larger than TKA, in order to avoid overlapping of KA periods.By assuming TDI of primary notifiers equal to the RTT, then TKA can easily be calculatedbased on Eq. (1): TKA ≈ TF−TDI−RTT/2 = TF−3×RTT/2.


The upper-bound of TDI at secondary notifiers is also TKA. However, it is better to seta shorter interval in order to achieve better failover when a secondary notifier has to reactto a failure. Because secondary notifiers are distinguished in order to avoid broadcastingstorms, their TDI has to be set larger than that of primary notifiers.

4.D. Priority Classes

Frames of the FHP have to be configured with the highest priority with respect to all othertraffic. It has to be ensured that the highest-priority protocol frames are served via thehighest-priority queues in the switches [5]; in this way the effect of bursty traffic that mayfill some links for a short time is avoided, which otherwise could cause false alarms inthe protocol. RTT for high-priority packets includes negligible queuing delay in typicaldeployment scenarios, which allows smaller FHP timer settings and shorter failover time.

5. Performance Measurements

Performance evaluation was done following the setup, as shown in Fig. 1, with the mainobjective of demonstrating fast failover under high protocol-stability conditions. A testerPC transmitted and received the probe traffic while controlling the optical switch in themiddle of the link between S1 and S2 in order to generate failures.

Table 1 shows the measured failover time results collected more than 1000 protection-switching events for several emitter/primary-notifier node configurations in the networktopology shown in Fig. 1. The applied KA transmission period (TKA) was fixed at 15 ms;the detection interval TDI was 5 ms in the primary notifier and 10 ms in the secondarynotifiers. The probe traffic enters the network through router R1 and leaves the networkthrough router R3, as shown in Fig. 1.

Table 1. Failover time of FHP

Scenario Failover time [ms] Emitter: R1

Primary: R2 Emitter: R2 Primary: R4

Emitter: R3 Primary: R4

Emitter: R4 Primary: R2

Average 19.62 21.83 15.38 20.47

Maximum 29 29 24 28

Minimum 12 11 7 12

Standard deviation 4.69 5.15 4.87 4.31

The results are consistent with theoretical predictions in the sense that the minimum

(best-case) failover time is lower-bounded by TDI. The average time equals 0.5×TKA +TDI,and the maximum (worst-case) time is upper-bounded by TKA +TDI. A time interval shouldbe added for all the cases to account for network delay, local packet processing, broadcastnotification, and so on.

The results show that the third scenario provides the fastest failover time. This is be-cause in this scenario the primary notifier, whose TDI is the shortest, is able to detect thefailure and initiates the notification process. In the first, second, and fourth scenarios, onlya secondary notifier, configured with longer TDI, can detect the network failure and initiatethe FHP.

Figure 4 shows the results measured over 1000 protection-switching events with theKA period increased from 6 ms to 50 ms, primary notifier’s TDI set to TKA/3, secondary


notifiers’ TDI set to 2/3TKA, R2 as emitter, R4 as primary notifier, and R1 and R3 as sec-ondary notifiers. In this configuration, a failure is always detected by a secondary-notifiernode (specifically R3). It can be observed that the measured worst-case failover is consis-tent with the expected failover from Eq. (1), where the difference between them accountsfor network transmission and, mainly, packet processing delays (about 5 ms, independentof TKA). The use of more powerful machines at the edge of the network should consistentlymake the measured and expected failover closer.

0

10

20

30

40

50

60

70

80

90

100

6 10 15 20 25 30 35 40 45 50KA period [ms]

Failo

ver

time

[ms]

AverageMaximumMinimum

Fig. 4. Failover time.

The results indicate that the maximum failover performance can be maintained below50 ms by keeping the KA period below 25 ms. Even low-performance routers such as theLinux PCs used in the testbed (Pentium II and Pentium III machines), can reliably sus-tain the FHP operating with KA transmission intervals as low as 6 ms, in which case themaximum failover time is 15 ms.

The traffic overhead generated by the protocol is a relevant parameter, and it varies as afunction of the KA transmission period, being calculated as follows:

LFHP = 514×8×NVLAN/TKA, (2)

where 514 is the Ethernet tagged frame size used in our implementation, NVLAN is thenumber of VLANs in the network, and the measure unit of LFHP is kbit/s.

Figure 5 shows the protocol load normalized to 1 Gbit/s Ethernet as a function of TKAwith NVLAN as a parameter. As the curves show, the protocol overhead can be kept low(below 1% of the link capacity) even for medium-size networks (with several tens of edgenodes), where a number of VLANs between 10 and 20 would be enough [13], and failovercan be achieved within 20 ms with KA period set to 10 ms.

With regard to the repeated node-failure tests, the results were in full agreement withTable 1, and the failover time stood without exception between the minimum and maximumvalues for all the cases.

Besides the topology shown in Fig. 1, we also invoked repeated link-failure tests in a12-node grid network topology, where the results were in line with Table 1, which demon-strates the good performance and scalability of the protocol.

An additional feature of FHP that has been experimentally verified in our testbed is thatno packets are lost at the edge routers during the restoration phase, corresponding to either


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25KA period [ms]

Norm

aliz

ed F

HP

load

[%]

5 VLANs

10 VLANs

15 VLANs

20 VLANs

Fig. 5. Protocol load.

link or node repair. This is because all edge routers are notified after network restoration bymeans of the broadcast of REPAIRED messages. Consequently, input packets at each edgerouter are forwarded again over the original VLAN (i.e., the one used before protectionswitching) without need for additional synchronization among edge routers.

Summarizing the above, failover time can be reduced to a range of tens of millisecondsin our approach; thus it is faster than the Viking approach [8], whose failover time is slightlybelow one second.

6. Conclusions

We have presented and experimentally validated a lightweight and efficient protection tech-nique for a low-cost robust Ethernet architecture over fiber network, which is built on low-cost, commodity off-the-shelf Ethernet switches. We have implemented the proposed high-availability architecture in a prototype network. We have shown a solution to extract theminimal number of spanning trees that need to be configured in the network nodes, and wehave developed and experimentally demonstrated a fast, failure-handling, in-band protocolthat works at the Ethernet level. We have described a method for the selection of protocolroles among edge nodes and for proper configuration of protocol timers. The protocol per-formance and robustness were validated by means of extensive protection-switching test-ing. Experimental results showed that worst-case failover can be maintained below 50 mswith 0.5% protocol overhead of link capacity in a GbE network. A further advantage of theproposed architecture is that no packets are lost during fault restoration.

Future work will include developing traffic engineering methods to optimize the use ofspanning trees, and extending the prototype with a network management system in orderto have a plug-and-play network.

References and Links[1] S. Shin, B. Ahn, M. Chung, S. Cho, D. Kim, and Y. Park, “Optics layer pro-

tection of Gigabit-Ethernet system by monitoring optical signal quality,” Electron.Lett. 38, 1118–1119 (2002), http://ieeexplore.ieee.org/iel5/2220/22262/01038625.pdf?arnumber=1038625.


http://ieeexplore.ieee.org/iel5/2220/22262/01038625.pdf?arnumber=1038625

http://ieeexplore.ieee.org/iel5/2220/22262/01038625.pdf?arnumber=1038625

[2] IEEE 802.1d, Standard for local and metropolitan area networks—Media access control (MAC)bridges, http://www.ieee802.org/1/pages/802.1D.html.

[3] IEEE 802.1w, Standard for local and metropolitan area networks—Rapid reconfiguration ofspanning tree, http://www.ieee802.org/1/pages/802.1w.html.

[4] IEEE 802.1s, Standard for local and metropolitan area networks—Multiple spanning trees,http://www.ieee802.org/1/pages/802.1s.html.

[5] IEEE 802.1q, Standard for local and metropolitan area networks—Virtual bridged local areanetworks, http://www.ieee802.org/1/pages/802.1Q.html.

[6] IEEE 802.17, Resilient Packet Ring, http://www.ieee802.org/17.[7] E. Shah, “Ethernet automatic protection switching,” IETF RFC 3619, October 2003,

http://www.apps.ietf.org/rfc/rfc3619.html.[8] S. Sharma, K. Gopalan, S. Nanda, and T. Chiueh, “Viking: a multi-spanning-tree Ether-

net architecture for metropolitan area and cluster networks,” in Proceedings of 23rd Con-ference of the IEEE Communications Society (INFOCOM) (IEEE, 2004), http://www.ieee-infocom.org/2004/Papers/47_3.PDF.

[9] Cisco Systems white paper, “Bidirectional Forwarding Detection for OSPF,” (Cisco Sys-tems, 2005), http://www.cisco.com/application/pdf/en/us/guest/tech/tk480/c1550/cdccont_0900aecd80244005.pdf.

[10] R. Aggarwal, “Application of Bidirectional Forwarding Detection,” (Juniper Networks,2003) http://www.ripe.net/ripe/meetings/ripe-48/presentations/ripe48-eof-bfd.pdf.

[11] F. Cugini, L. Valcarenghi, P. Castoldi, and M. Guglielmucci, “Low-cost resilience schemesfor the Optical Ethernet,” J. Opt. Netw. 4, 829–837 (2005), http://www.osa-jon.org/abstract.cfm?id=86422.

[12] IEEE 802.1ag Connectivity Fault Management, http://www.ieee802.org/1/pages/802.1ag.html.

[13] J. Farkas, C. Antal, G. Tóth, and L. Westberg, “Distributed resilient architecture for Ethernetnetworks,” in Proceedings of Design of Reliable Communication Networks 2005 (IEEE, 2005)pp. 515–522.

[14] J. Farkas, C. Antal, L. Westberg, A. Paradisi, T. R. Tronco, and V. G. Oliviera are preparing apaper to be called “Fast failure handling in Ethernet networks.”

[15] H. N. Gabow, “A matroid approach to finding edge connectivity and packing arborescences,”Comp. Sys. Sci. 50, 259–273 (1995).

[16] J. Roskind and R. E. Tarjan, “A note on finding minimum-cost edge-disjoint spanning trees,”Math.Op. Res. 10, 701–708 (1985).


http://www.ieee802.org/1/pages/802.1D.html

http://www.ieee802.org/1/pages/802.1w.html

http://www.ieee802.org/1/pages/802.1s.html

http://www.ieee802.org/1/pages/802.1Q.html

http://www.ieee802.org/17

http://www.apps.ietf.org/rfc/rfc3619.html

http://www.ieee-infocom.org/2004/Papers/47_3.PDF

http://www.ieee-infocom.org/2004/Papers/47_3.PDF

http://www.cisco.com/application/pdf/en/us/guest/tech/tk480/c1550/cdccont_0900aecd80244005.pdf

http://www.cisco.com/application/pdf/en/us/guest/tech/tk480/c1550/cdccont_0900aecd80244005.pdf

http://www.ripe.net/ripe/meetings/ripe-48/presentations/ripe48-eof-bfd.pdf

http://www.ripe.net/ripe/meetings/ripe-48/presentations/ripe48-eof-bfd.pdf

http://www.osa-jon.org/abstract.cfm?id=86422

http://www.osa-jon.org/abstract.cfm?id=86422

http://www.ieee802.org/1/pages/802.1ag.html

http://www.ieee802.org/1/pages/802.1ag.html

Documents

Low-cost survivable Ethernet architecture over fiber