Upload
martin-hancock
View
227
Download
1
Embed Size (px)
Citation preview
Fault Management*
*Mani Subramanian “Network Management: Principles and practice”, Addison-Wesley, 2000.
Fault Management The process of locating
and correcting network problems and faultso fault is a failure of a
network component, which
o results in loss of connectivity
It is the most important functional management area
Resolve problem
Process, 5 steps:o Identify faults
o Gathering information via traps (linkDown, egpNeighborLoss) and polling
o Traps may not be sufficiento Is a received trap an important one???
o Locate Faulto Detect all failed components and trace
down the tree topology to the source (e.g., interface card failure on a router all connected components will indicate a failure)
o Fault isolation by network and SNMP tools
o Use artificial intelligence /correlation techniques
o Restore service (high priority)o Identify the root cause of the problem
(trouble ticket)o Resolve problem
Network Restoration- example
IP Data Layer
IP/MPLS, DiffServ packet QoS
Intelligent transportrouting/protection
switch
backbone
Virtual router topology
Collapsed Hierarchy, Improved Efficiency
Traffic is successfully restored only after failure notification and a round trip configuration/confirmation.
Failure detectedSource notified
Message received and resources configured.
SEND ACK
Resources successfully setup, Restore traffic
Preliminaries
An event is an exceptional condition in the operation of the networko Software failureo Performance bottlenecko Configuration inconsistencieso Intrusion attempts
Network management operationso Monitoring eventso Interpreting eventso Handling events
A single problem event may cause many symptom eventso Correlating symptom events to identify and
localize the underlying problems
Illustrative scenario
A client application exchanges data over a TCP connection with a DB server
Distinct domains each administered by a different organization
Illustrative scenario
Problem scenario A clock at an interface in WAN2 that supports T3 link
loses SYNC 4 times a second for 0.25 ms intermittent noise causing loss of 0.1% of T3 capacity this small noise causes bit errors in a large number of
packets routed over C-D Bit errors cause packet losses, either at routers (if IP
header corrupted) or at destinations
Illustrative scenario
performance of TCP connection degrades due to packet loss
TCP sender interprets this as congestion and hence reduces its window
TCP increases its window gradually until new packet loss However due to the noise, the TCP window will not increase
DB transactions by client will last longer DB server performance will degrade due to records lock-out,
causing frequent aborts for remote transactions
Illustrative scenario
Three important points problems propagate among
related objects, and possibly amplified by various protocol mechanisms
single problem can cause numerous observable events in multiple domains
some problems are not observable where they originate:
WAN2 domain may observe minor error events at the T3 interface, but these events may be indistinguishable from normal operating noise WAN2 may be unaware that there is a problem
Challenges Determine events to
monitor and ways to analyze them
Operations staff must have knowledge of operational parameters of managed objects and the significance of its events
Correlation of events and coordination among different domains
Automating the management activities (manual processing does not scale)
Modeling the Scenario
Partition the system into multiple management domains (e.g., enterprise domain, ED, and router domain, RD)
Each domain has a domain manager (DM) to monitor, correlate and handle its events A MD may subscribe to receive notifications from other
domains ED sees the RD as a single entity connecting LAN1 and
LAN2
Modeling the Scenario
Any problem in the connection is seen as RD problem Inside each domain, finer grained correlation can
determine the particular problem using symptoms from other domains
Example: packet loss is degraded TCP performance is detected by ED not by the RD. this symptom is received by the RD and can be correlated
along with other observable symptoms to isolate the “clock problem”.
Detects only IP header corruption
Automating Event Management An automated event management system (AEMS) must
accurately model and store knowledge of the underlying system and its associated events. Static Information associated with managed objects such as
SNMP traps, thresholds for MIB variables, etc. Dynamic information: reflects addition, removal, upgrades of
network devices, etc. The process of automation is that of developing
correlation algorithms to analyze observable events Correlation algorithms must
Scalable to large networks involving complex systems Handle a large number of symptoms caused by a single
problem Fast --real time correlation Robust (loss of a single alarm or generation of spurious event
should not affect its decision insensitive or resilient to noise
Problems and Symptoms A problem is an event that can be handled directly;
e.g., a faulty interface Some problems are directly observable or indirectly
by observing their symptoms Symptoms are observable events
Degraded application performance is a symptom of a faulty interface
Symptoms cannot be handled; symptoms persist unless the problem is resolved
Problems and symptoms propagate from one object to another Noise in WAN bit errors in link C-D loss of
packets at routers poor TCP performance frequent transaction aborts in the DB server
Event Correlation System
Monitors typically collect managed data at network elements and detect out of tolerance conditions, generating appropriate alarms.
The correlator uses an event model to analyze these alarms. The event model represents knowledge of various events and
their causal relationships Event model depends on the expert people
The correlator determines the common problems that caused the observed alarms.
Event Knowledge
The Modeler’s event knowledge contains the following information for each class of managed objects:
The data attributes of objects of this class (e.g., MIB variables).
The set of events that are observable within instances of this class (e.g., a particular MIB variable is above threshold), or by asynchronous event notifications.
The set of events caused by each problem. This set can include events within the object, as well as events in other objects to which the object is related.
The problems that can originate within instances of this class.
The relationships in which an instance of the class can be involved.
The events and/or problems that are exported by instances of the class.
Coding Approach for Event Correlation Treat the complete set of events caused by
a problem as a “code” that identifies the problem
Correlation is the process of decoding the set of observed symptoms o Determine which problem has these symptoms
as its codeo Note: traditionally, alarms are typically
correlated through searches over the event model knowledge base
Complexity of search limits scalabilityo Event model is a large database and the
received alarms or symptoms may also be quite large
Coding Approach for Event CorrelationTwo phases: Codebook selection phase:
o Select a subset of events for monitoring – codebook
o Codebook is an optimal subset of events that must be monitored to distinguish the problems of interests from one another
o Ensure a desired level of noise toleranceo Algorithms must decode or infer the problem in
the presence of lose alarms or the existence of spurious alarms
Decodingo Find the problem whose associated symptoms
(i.e., code) match the observed symptoms most closely
Causality Graph Models
Correlation is concerned with analysis of causality relations among eventso e f denotes causality of event f by event eo Causality is a partial order relation between
eventso Relation can be described by a graph whose
nodes represent the events and edges represent causality
Causality Graph Models
Event that is neither a symptom nor a problem.
Causal equivalence
A symptom caused by another symptomdo not contribute any information about the problem
All these indirect symptoms can be eliminated without loss of information
Correlation graph
Correlation
Information contained in the correlation graph must be converted into codes, one for each problem in the graph.
A code for a problem p is a vector p of 0s an 1s. Each bit corresponds to a symptom in the graph
example: code is of length 3 (3 symptoms) – after ordering of the symptoms (e.g., <S3, S6, S9>):
code for p1 is p1 = (1,0,1)
This means p1 causes symptoms S3 and S9
p2 = (1, 1, 0) and p11 = (1, 0, 1)
Correlation graph
Event correlation is finding problems whose codes optimally match an observed symptom vector
Correlation
What happens when we observe symptoms S3 and S9?
Both P1 and P11 match the observed vector!
Clearly we know there is a problem but cannot identify the problem since both problems have identical codes..
What happens when we observe symptoms (0, 1, 0)?
two possibilities: (1) a false event or (2) P3 occurred but one symptom was lost.
Correlation graph
Interpretation depends on whether loss is more likely than false alarm generation
In case spurious or lost symptoms are unlikely, information provided by S9 is redundant (1, 0) and (1, 1) are sufficient to correlate event vectors. Subset of symptoms required to provide desired level of distinction between problems is called codebook
Correlation- example Codebook contains only
three symptoms The codebook
distinguishes among all problems however, it guarantees distinction by only a single symptom
A loss or spurious generation of S4 will result in decoding error
Distinction between problems ismeasured by the “hamming Distance” between their codes
Radius is ½ the hamming distance
Codebook not resilient to noise
Correlation- example
Event vectors {011100, 101100, 110100, 111000} will be decoded as P1 with a single symptom lossand {111110, 111101} is interpreted as P1 with a single spurious symptom
When two error symptoms occur, decoder will detect the error but cannot correctly (uniquely) decode the event (e.g., P1 and P4)
Correlation- Advantages