Improve the Fault Management Capability of IP …cqr2009.ieee-cqr.org/FINAL UPLOAD/DAY 2 - WED/SEAN GONG... Improve the Fault Management Capability of IP Networks XiangPing Wu ([email protected])

www.huawei.com

Improve the Fault Management

Capability of IP Networks

XiangPing Wu ([email protected])

Sean Xuewen Gong ([email protected])

For IEEE CQR 2009

Content

• Introduction

• Challenges

• Why are IP Networks not Reliable?

• Current State of Fault Management Capability of IP

Networks

• The Requirement for Recovery

• The Typical Solutions

• Conclusions

HUAWEI TECHNOLOGIES CO., LTD. Page 3

Introduction

� All services will be based on IP technology

� Service control will be mainly handled by IMS (SIP signaling)

� Packet transport will be used for all network traffic

� There will be many access types supporting IP connectivity.

Future Networks will be All-IP, Converged networks:

All IP detail in Each Network Layer

IP

SIP Signaling HTTP

NG-WDM/OTN

IP Etherne

t

…Voice Video Data APPs

…

IP

Voice Video Data APPs

Core Network

Bearer

Terminal

Services

IP Infrastructure

NG-WDM/OTN

TV Voice BB Mobile ...

IP


Challenges

� When migrating to ALL-IP, Converged networks from legacy

networks, the operators are concerning about:

� Can the future networks be as reliable as legacy networks?

� … and can do so at the low cost of ownership?

� Rule of thumb for Carrier-

Grade Telecom Network:

� 50-ms recovery time, and

� Five 9s of availability


Why are IP Networks not Reliable?

New Security Threats

Poor Fault Management Capability

•Fault detection mechanisms in IP are relatively slow, usually in seconds;

•Fault recovery time depends on the re-convergence time for IP networks,

usually in minutes, will be worse for lager-scale networks;

•Edge effects of network recovery, such as, route flapping and black hole,

will also impact the stability of IP networks;

•Degradation or QoS/QoE detection is still a challenging problem for IP

Networks.

• IP network is an Open and Distributed architecture, is prone to being

attacked from inside and outside, such as, DoS/DDoS, and virus.

• Although new security mechanisms make the networks more robust to

attacks, attack means themselves are also retrofitting

Complexity Makes IP Networks Configuration Prone to Errors

• Consistency of neighboring routers

• Complex configuration options

•Rapid changes to the network

• Limited configuration tools

HUAWEI TECHNOLOGIES CO., LTD.

Fault Management Capability of IP Networks -Concept of Fault Management

� Detection. A fault is found, but determination of the

failed component is not made

� Diagnosis. The determination of which component

has failed

� Isolation. Ensuring a fault does not cause a system

failure (isolation does not necessarily make the

system function correctly)

� Recovery. Restoring system to expected behavior

� Repair. Restoring a system to full capability

including all redundancy

� Notification. Between each step above, there is

notification of the fault to the next step or steps in

the process.

RecoveryRecovery

DetectionDetection

DiagnoseDiagnose

IsolationIsolation

RepairRepair

Notific

ation

Notific

ation

From the view of Fast Service Recovery, the steps of Detection

and Recovery are of most importance.


Fault Management Capability of IP Networks -Detection

� Longer Fault Detection Time Compared to Legacy Networks

� SONET/SDH are synchronous technologies, faults can be detected

within 50 ms;

� IP is an asynchronous technology, it detects fault using Keep-alive or

Hello mechanisms embedded in the protocols.

� For OSPF and IS-IS, keep-alive or hello packets are sent every 3 seconds,

and a fault is detected when consecutive 3 packets are lost by default,

which means the fault detection time will be 9 seconds by default.

� Reducing keep-alive or hello transmission time arbitrarily might result in the

problem of route flapping.

� Another challenge for IP networks is the detection of

degradation or QoS attributes, such as, packet loss, delay,

and jitter.


Fault Management Capability of IP Networks –Detection (Con.)

� Separation of Control and Data Planes Makes Correlation of

Fault Detection for These Two Planes also a Problem

FIB

LC/NP

FIB

FIB

RIB

Primary RP

RIB

LC/NP

LC/NP

Backup RP

BP

Data Plane Control Plane

LC: Line Card

NP: Network Processor

BP: Backplane

RP: Routing Processor

RIB: Routing Information Base

FIB: Forward Information Base

Control Signaling Flow

Data Traffic Flow


Fault Management Capability of IP Networks –Recovery

� Fault recovery time of IP networks depends on the re-

convergence time of networks, which afterwards depends on

the size and topology of Networks

� Basically, this time can be relatively LONG.

� In some networks, link and node faults can be

troubleshot very quickly when incorporating the

recovery capabilities of IP layer and

transmission layer

� But the fact is, it still CAN NOT meet the rule of thumb

requirements of Carrier-Grade Reliability


The Requirement for Recovery

� Is 50 ms of Recovery NECESSARY for IP Networks?

� If NOT, How Long is Required?

IP over Optical

� According to testing results, 1 ~ 2

seconds of interrupt of signaling has

no impact on dialogues

� The story is totally different for voice

flows, see the table below.

� Protection time for IP carrier networks

should not be faster than optical

transmission networks;

� From the viewpoints of operating and

billing, the protection time for IP carrier

networks has to be equivalent to TDM

Signaling vs. Voice Flow

Recovery Time Impact on Voice Service

< 50 ms No impact

50~200 ms Connection loss probability is less than 5%, which has no impact on signaling

>2 s (Connection

loss threshold)

Voice session and dedicated line connection are interrupted

From above, the recovery time for IP networks should be within 50 ~ 500 ms

HUAWEI TECHNOLOGIES CO., LTD.

Case for Example-

Reliability of VPN with enhanced VPN FRR

� The commonly used protection techniques

nowadays focuses on the protection of nodes

and links in the core layer, and do not get the

PE involved.

� Keep-alive of BGP is used to detect the faults

on PE nodes, the time is above 3 seconds.

� After that, the end-to-end route/LSP re-

convergence is needed for service recovery,

the time of which depends on the quantity of

the routes and hops in carrier networks.

� The whole protection process will last about 5

seconds.

� The fault detection time for typical networks is less than 200 ms

� The re-convergence time for PE nodes only depends on the fault detection time of remote PE nodes and

the time needed for state update in forward engine, but has nothing to do with the quantity of VPN route.

The new VPN FRR protection:

• With End-to-end view, based on service-Impact centralized analysis,

identify WHAT, and find out HOW.

• Innovation and/or optimization for solutions


Conclusions

� All-IP is the inevitable trend for future networks and services .

� Reliability will be the foundation for ALL-IP to come true.

� Fault management capability is one of the most important features to

achieve acceptable IP network reliability, especially from the view of service

protection and recovery, the capabilities of fault Detection and Recovery are

the keys for success.

� Current IP networks have some inherent deficiencies in their fault detection

and recovery capabilities, which can not meet the requirements of carrier-

grade reliability.

� We has developed some solutions to improve the fault management

capability of IP networks. These solutions can make the IP networks real

carrier-grade, and meet the requirements from operators around the world.


Questions?

Documents

Improve the Fault Management Capability of IP …cqr2009.ieee-cqr.org/FINAL UPLOAD/DAY 2 - WED/SEAN GONG... Improve the Fault Management Capability of IP Networks XiangPing Wu ([email protected])