Fault Management * * Mani Subramanian “Network Management: Principles and practice”, Addison-Wesley, 2000

Fault Management*

*Mani Subramanian “Network Management: Principles and practice”, Addison-Wesley, 2000.

Fault Management The process of locating

and correcting network problems and faultso fault is a failure of a

network component, which

o results in loss of connectivity

It is the most important functional management area

Resolve problem

Process, 5 steps:o Identify faults

o Gathering information via traps (linkDown, egpNeighborLoss) and polling

o Traps may not be sufficiento Is a received trap an important one???

o Locate Faulto Detect all failed components and trace

down the tree topology to the source (e.g., interface card failure on a router all connected components will indicate a failure)

o Fault isolation by network and SNMP tools

o Use artificial intelligence /correlation techniques

o Restore service (high priority)o Identify the root cause of the problem

(trouble ticket)o Resolve problem

Network Restoration- example

IP Data Layer

IP/MPLS, DiffServ packet QoS

Intelligent transportrouting/protection

switch

backbone

Virtual router topology

Collapsed Hierarchy, Improved Efficiency

Traffic is successfully restored only after failure notification and a round trip configuration/confirmation.

Failure detectedSource notified

Message received and resources configured.

SEND ACK

Resources successfully setup, Restore traffic

Preliminaries

An event is an exceptional condition in the operation of the networko Software failureo Performance bottlenecko Configuration inconsistencieso Intrusion attempts

Network management operationso Monitoring eventso Interpreting eventso Handling events

A single problem event may cause many symptom eventso Correlating symptom events to identify and

localize the underlying problems

Illustrative scenario

A client application exchanges data over a TCP connection with a DB server

Distinct domains each administered by a different organization


Problem scenario A clock at an interface in WAN2 that supports T3 link

loses SYNC 4 times a second for 0.25 ms intermittent noise causing loss of 0.1% of T3 capacity this small noise causes bit errors in a large number of

packets routed over C-D Bit errors cause packet losses, either at routers (if IP

header corrupted) or at destinations


performance of TCP connection degrades due to packet loss

TCP sender interprets this as congestion and hence reduces its window

TCP increases its window gradually until new packet loss However due to the noise, the TCP window will not increase

DB transactions by client will last longer DB server performance will degrade due to records lock-out,

causing frequent aborts for remote transactions


Three important points problems propagate among

related objects, and possibly amplified by various protocol mechanisms

single problem can cause numerous observable events in multiple domains

some problems are not observable where they originate:

WAN2 domain may observe minor error events at the T3 interface, but these events may be indistinguishable from normal operating noise WAN2 may be unaware that there is a problem

Challenges Determine events to

monitor and ways to analyze them

Operations staff must have knowledge of operational parameters of managed objects and the significance of its events

Correlation of events and coordination among different domains

Automating the management activities (manual processing does not scale)

Modeling the Scenario

Partition the system into multiple management domains (e.g., enterprise domain, ED, and router domain, RD)

Each domain has a domain manager (DM) to monitor, correlate and handle its events A MD may subscribe to receive notifications from other

domains ED sees the RD as a single entity connecting LAN1 and

LAN2

Modeling the Scenario

Any problem in the connection is seen as RD problem Inside each domain, finer grained correlation can

determine the particular problem using symptoms from other domains

Example: packet loss is degraded TCP performance is detected by ED not by the RD. this symptom is received by the RD and can be correlated

along with other observable symptoms to isolate the “clock problem”.

Detects only IP header corruption

Automating Event Management An automated event management system (AEMS) must

accurately model and store knowledge of the underlying system and its associated events. Static Information associated with managed objects such as

SNMP traps, thresholds for MIB variables, etc. Dynamic information: reflects addition, removal, upgrades of

network devices, etc. The process of automation is that of developing

correlation algorithms to analyze observable events Correlation algorithms must

Scalable to large networks involving complex systems Handle a large number of symptoms caused by a single

problem Fast --real time correlation Robust (loss of a single alarm or generation of spurious event

should not affect its decision insensitive or resilient to noise

Problems and Symptoms A problem is an event that can be handled directly;

e.g., a faulty interface Some problems are directly observable or indirectly

by observing their symptoms Symptoms are observable events

Degraded application performance is a symptom of a faulty interface

Symptoms cannot be handled; symptoms persist unless the problem is resolved

Problems and symptoms propagate from one object to another Noise in WAN bit errors in link C-D loss of

packets at routers poor TCP performance frequent transaction aborts in the DB server

Event Correlation System

Monitors typically collect managed data at network elements and detect out of tolerance conditions, generating appropriate alarms.

The correlator uses an event model to analyze these alarms. The event model represents knowledge of various events and

their causal relationships Event model depends on the expert people

The correlator determines the common problems that caused the observed alarms.

Event Knowledge

The Modeler’s event knowledge contains the following information for each class of managed objects:

The data attributes of objects of this class (e.g., MIB variables).

The set of events that are observable within instances of this class (e.g., a particular MIB variable is above threshold), or by asynchronous event notifications.

The set of events caused by each problem. This set can include events within the object, as well as events in other objects to which the object is related.

The problems that can originate within instances of this class.

The relationships in which an instance of the class can be involved.

The events and/or problems that are exported by instances of the class.

Coding Approach for Event Correlation Treat the complete set of events caused by

a problem as a “code” that identifies the problem

Correlation is the process of decoding the set of observed symptoms o Determine which problem has these symptoms

as its codeo Note: traditionally, alarms are typically

correlated through searches over the event model knowledge base

Complexity of search limits scalabilityo Event model is a large database and the

received alarms or symptoms may also be quite large

Coding Approach for Event CorrelationTwo phases: Codebook selection phase:

o Select a subset of events for monitoring – codebook

o Codebook is an optimal subset of events that must be monitored to distinguish the problems of interests from one another

o Ensure a desired level of noise toleranceo Algorithms must decode or infer the problem in

the presence of lose alarms or the existence of spurious alarms

Decodingo Find the problem whose associated symptoms

(i.e., code) match the observed symptoms most closely

Causality Graph Models

Correlation is concerned with analysis of causality relations among eventso e f denotes causality of event f by event eo Causality is a partial order relation between

eventso Relation can be described by a graph whose

nodes represent the events and edges represent causality

Causality Graph Models

Event that is neither a symptom nor a problem.

Causal equivalence

A symptom caused by another symptomdo not contribute any information about the problem

All these indirect symptoms can be eliminated without loss of information

Correlation graph

Correlation

Information contained in the correlation graph must be converted into codes, one for each problem in the graph.

A code for a problem p is a vector p of 0s an 1s. Each bit corresponds to a symptom in the graph

example: code is of length 3 (3 symptoms) – after ordering of the symptoms (e.g., <S3, S6, S9>):

code for p1 is p1 = (1,0,1)

This means p1 causes symptoms S3 and S9

p2 = (1, 1, 0) and p11 = (1, 0, 1)

Correlation graph

Event correlation is finding problems whose codes optimally match an observed symptom vector

Correlation

What happens when we observe symptoms S3 and S9?

Both P1 and P11 match the observed vector!

Clearly we know there is a problem but cannot identify the problem since both problems have identical codes..

What happens when we observe symptoms (0, 1, 0)?

two possibilities: (1) a false event or (2) P3 occurred but one symptom was lost.

Correlation graph

Interpretation depends on whether loss is more likely than false alarm generation

In case spurious or lost symptoms are unlikely, information provided by S9 is redundant (1, 0) and (1, 1) are sufficient to correlate event vectors. Subset of symptoms required to provide desired level of distinction between problems is called codebook

Correlation- example Codebook contains only

three symptoms The codebook

distinguishes among all problems however, it guarantees distinction by only a single symptom

A loss or spurious generation of S4 will result in decoding error

Distinction between problems ismeasured by the “hamming Distance” between their codes

Radius is ½ the hamming distance

Codebook not resilient to noise

Correlation- example

Event vectors {011100, 101100, 110100, 111000} will be decoded as P1 with a single symptom lossand {111110, 111101} is interpreted as P1 with a single spurious symptom

When two error symptoms occur, decoder will detect the error but cannot correctly (uniquely) decode the event (e.g., P1 and P4)

Correlation- Advantages

Documents

Fault Management * * Mani Subramanian “Network Management: Principles and practice”, Addison-Wesley, 2000