12
Spatio-temporal Patterns in Network Events Ting Wang Mudhakar Srivatsa Dakshi Agrawal Ling Liu School of Computer Science, Georgia Institute of Technology IBM T.J. Watson Research Center {twang, lingliu}@cc.gatech.edu {msrivats, agrawal}@us.ibm.com ABSTRACT Operational networks typically generate massive monitoring data that consist of local (in both space and time) observa- tions of the status of the networks. It is often hypothesized that such data exhibit both spatial and temporal correlation based on the underlying network topology and time of occur- rence; identifying such correlation patterns offers valuable insights into global network phenomena (e.g., fault cascad- ing in communication networks). In this paper we introduce a new class of models suitable for learning, indexing, and identifying spatio-temporal patterns in network monitoring data. We exemplify our techniques with the application of fault diagnosis in enterprise networks. We show how it can help network management systems (NMSes) to efficiently detect and localize potential faults (e.g., failure of routing protocols or network equipments) by analyzing massive op- erational event streams (e.g., alerts, alarms, and metrics). We provide results from extensive experimental studies over real network event and topology datasets to explore the efficacy of our solution. 1. INTRODUCTION A network, in its simplest form, can be modeled as a graph wherein nodes represent network entities and edges represent their pairwise interactions. It is known that simple, local interactions between network entities can give rise to complex, global network phenomena [5] (e.g., fault cascading in communication networks). Nev- ertheless, understanding and modeling such global phe- nomena based on local observations remains one key challenge in network science. An operational network typically generates various monitoring data that essentially consist of local (in both space and time) observations on the state of dispersed network entities. Examples include SNMP (Simple Net- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM CoNEXT 2010, November 30 – December 3 2010, Philadelphia, USA. Copyright 2010 ACM 1-4503-0448-1/10/11 ...$5.00. 21% 6% 44% 3% 26% 18% 6% 38% 7% 31% 29% 7% 52% 12% 27% 4% 32% 4% 33% SE NE DS US TN Figure 1: Percentage of fault-triggered events reported at faulty nodes (SE) and nodes with specific relation- ships to faulty ones (NE, DS, US, TN) in four real en- terprise networks, with size of 2,514 nodes, 3,200 nodes, 141 nodes, and 12,444 nodes, respectively. work Management Protocol) and syslog messages (e.g., ping failure, interface down, high CPU usage) in com- munication networks, and resource utilization alarms, SLA (Service Level Agreement) violations and thresh- old crossing alerts in datacenter networks. Such local observations (henceforth, called as events) provide a window into understanding the global network phenomena. However, the analysis of the raw monitor- ing data is inherently difficult due to its (1) incomplete- ness, e.g., important symptom events of a network fault may be missing due to incomplete monitoring or loss in transit; (2) imprecision, e.g., the recorded timestamps of individual events may be erroneous with respect to systemic causality relationships; and (3) massive vol- ume, e.g., a single network fault often results in a burst of messages sent to the network management system (NMS) from the affected entities. Therefore, it can offer significant performance benefits to succinctly and scal- ably identify relevant events occurring across the entire network. Figure 1 motivates this with a concrete ex- ample from fault diagnosis in enterprise networks, us- ing event data collected from four real enterprise net- works. The fractions of relevant events (that correspond to a same fault) reported at the faulty node itself (SE), and nodes with specific relationships (neighboring - NE, down-streaming - DS, up-streaming - US, and tunneling - TN) to the faulty node are shown. Observe that those topologically related nodes account for an overwhelm- ing 71-82% of the events. It is often hypothesized that relevant network events exhibit both spatial and temporal correlation based on the underlying network topology and their time of oc- currence, although the correlation extent may depend

Spatio-temporal Patterns in Network Events

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spatio-temporal Patterns in Network Events

Spatio-temporal Patterns in Network Events

Ting Wang† Mudhakar Srivatsa‡ Dakshi Agrawal‡ Ling Liu††School of Computer Science, Georgia Institute of Technology

‡IBM T.J. Watson Research Center†{twang, lingliu}@cc.gatech.edu ‡{msrivats, agrawal}@us.ibm.com

ABSTRACTOperational networks typically generate massive monitoringdata that consist of local (in both space and time) observa-tions of the status of the networks. It is often hypothesizedthat such data exhibit both spatial and temporal correlationbased on the underlying network topology and time of occur-rence; identifying such correlation patterns offers valuableinsights into global network phenomena (e.g., fault cascad-ing in communication networks). In this paper we introducea new class of models suitable for learning, indexing, andidentifying spatio-temporal patterns in network monitoringdata. We exemplify our techniques with the application offault diagnosis in enterprise networks. We show how it canhelp network management systems (NMSes) to efficientlydetect and localize potential faults (e.g., failure of routingprotocols or network equipments) by analyzing massive op-erational event streams (e.g., alerts, alarms, and metrics). Weprovide results from extensive experimental studies over realnetwork event and topology datasets to explore the efficacyof our solution.

1. INTRODUCTIONA network, in its simplest form, can be modeled as

a graph wherein nodes represent network entities andedges represent their pairwise interactions. It is knownthat simple, local interactions between network entitiescan give rise to complex, global network phenomena [5](e.g., fault cascading in communication networks). Nev-ertheless, understanding and modeling such global phe-nomena based on local observations remains one keychallenge in network science.

An operational network typically generates variousmonitoring data that essentially consist of local (in bothspace and time) observations on the state of dispersednetwork entities. Examples include SNMP (Simple Net-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ACM CoNEXT 2010, November 30 – December 3 2010, Philadelphia,USA.Copyright 2010 ACM 1-4503-0448-1/10/11 ...$5.00.

21%

6%

44%

3%

26% 18%

6%

38%

7%

31% 29%

7%

52%

12% 27%

4%

32%

4%

33%SENEDSUSTN

Figure 1: Percentage of fault-triggered events reportedat faulty nodes (SE) and nodes with specific relation-ships to faulty ones (NE, DS, US, TN) in four real en-terprise networks, with size of 2,514 nodes, 3,200 nodes,141 nodes, and 12,444 nodes, respectively.

work Management Protocol) and syslog messages (e.g.,ping failure, interface down, high CPU usage) in com-munication networks, and resource utilization alarms,SLA (Service Level Agreement) violations and thresh-old crossing alerts in datacenter networks.

Such local observations (henceforth, called as events)provide a window into understanding the global networkphenomena. However, the analysis of the raw monitor-ing data is inherently difficult due to its (1) incomplete-ness, e.g., important symptom events of a network faultmay be missing due to incomplete monitoring or loss intransit; (2) imprecision, e.g., the recorded timestampsof individual events may be erroneous with respect tosystemic causality relationships; and (3) massive vol-ume, e.g., a single network fault often results in a burstof messages sent to the network management system(NMS) from the affected entities. Therefore, it can offersignificant performance benefits to succinctly and scal-ably identify relevant events occurring across the entirenetwork. Figure 1 motivates this with a concrete ex-ample from fault diagnosis in enterprise networks, us-ing event data collected from four real enterprise net-works. The fractions of relevant events (that correspondto a same fault) reported at the faulty node itself (SE),and nodes with specific relationships (neighboring - NE,down-streaming - DS, up-streaming - US, and tunneling- TN) to the faulty node are shown. Observe that thosetopologically related nodes account for an overwhelm-ing 71-82% of the events.

It is often hypothesized that relevant network eventsexhibit both spatial and temporal correlation based onthe underlying network topology and their time of oc-currence, although the correlation extent may depend

Page 2: Spatio-temporal Patterns in Network Events

on the accuracy of monitoring system. It is worth em-phasizing that in complex, multi-layer networks (e.g.,enterprise networks as targeted in this work), the net-work topology includes both horizontal (e.g., that be-tween BGP peers) and vertical (e.g., that between layerL1 and L3 counterparts) relationships. We argue thatidentifying and leveraging such spatio-temporal patternsto correlate relevant events may offer both scalable (ef-ficient in facing massive event volumes) and robust (re-silient to incomplete and imprecise monitoring data)primitives for network-event analysis.

In this paper, we propose a new class of models suit-able for learning, indexing and matching spatio-temporalpatterns in network-event data. We exemplify our tech-niques with the application of detecting and localizingpotential faults in enterprise networks. Conventionalsolutions, as adopted by most widely deployed NMSes insuch networks, maintain a cache of “unresolved” events,and use rule or codebook based mechanisms (e.g., [23,25]) to correlate each new event with all cached eventsto suppress dependent events and retain only the (un-fixed) root-cause events in the cache. These approacheshowever suffer from both scalability (e.g., the compu-tation complexity is quadratic in terms of the networksize) and robustness (e.g., missing important symptomor root-cause events may result in a large number ofunresolved events in the cache) issues.

We observe: (i) Events triggered by a fault are typi-cally generated by a small, approximately constant sub-set1 of network entities that are topologically related tothe faulty entity within a limited time window; thus,for each new event arriving at the NMS, only those (po-tentially) topologically and temporally relevant eventsneed to be considered. (ii) By aggregating a set of cor-related events, one may be able to infer the root causewith high confidence, despite the possible loss of im-portant symptom or root-cause events, and imprecisionin individual events. To exploit these observations, wepropose a new class of indexable network signatures thatencode the temporal patterns of events as well as thetopological relationships among the entities where theseevents occur. We present efficient learning algorithmsto extract such signatures from noisy historical eventdata. With the help of novel space-time indexing struc-tures, we show how to perform efficient online signa-ture matching. We entitle the complete framework Tar(topologically-aware reasoning) that, to the best of ourknowledge, is the first proposal to utilize topologically-aware event patterns to perform scalable, yet robustnetwork root cause reasoning.

We implemented our solution on a large-scale testbedNMS, and empirically evaluated its efficacy in terms

1The size of this subset depends on factors such as degreedistribution of the network, and is independent of the sizeof the entire network.

of diagnosis accuracy, scalability, predictive power, anderror tolerance, demonstrating significant improvementover its previous counterpart techniques (e.g., reducefault diagnosis time from 45 mins to 37 seconds for anevent burst of size 15K).

2. PROBLEM CONTEXT AND SCOPEThis section describes a fundamental yet challenging

task facing network operators when analyzing networkmonitoring data in enterprise networks, namely, detect-ing and localizing potential network faults, which mo-tivates our study on spatio-temporal patterns.

2.1 Fault Diagnosis in Enterprise NetworksWe target large-scale enterprise networks, which usu-

ally involve hundreds of thousands of network entities(e.g., routers), and are typically managed by a central-ized management system that collects (local) observa-tions from network entities at different layers of the pro-tocol stack. The aim of fault diagnosis is to quickly de-tect potential faults (e.g., failures of routing protocols ornetwork equipments) for a given set of symptom events,and localize the possible network entities responsible forthe faults, such that corrective measures can be directedat the root cause, as opposed to merely addressing theimmediate symptoms.

Even though operators today have a myriad networkmonitoring data at their disposal, it is still non-trivialto efficiently identify and extract root-cause events frommassive event storms, attributed to the inherent incom-pleteness and imprecision of the monitoring system, andthe cascading nature of network failures, namely, fail-ure of a single network entity triggers a large burst ofevents from affected entities all over the network (seeexample shown in Figure 1). For instance, large net-works in 2007 are faced with the challenge of monitor-ing over 100,000 network entities and cope with eventbursts of over 10,000 events per second [17]. Evidently,scalability and robustness have become two key issues.The pairwise comparison approaches adopted by mostwidely deployed NMSes in enterprise networks, as wehave discussed in Section 1, suffer from both scalabilityand robustness issues.

2.2 TAR: Topologically-Aware ReasoningIt is observed, however, that two events may be corre-

lated only if the underlying nodes (at which these eventswere triggered) are topologically related.

Example 1. In the network shown in Figure 2, the fail-ure of router n6 may trigger events at both n4 and n7

since they establish a tunnel containing n6.

Hence, an incoming event needs to be correlated withonly a subset of events that occurred at topologically re-lated network entities, rather than on the scale of theentire network. Also, the time of their occurrences tends

Page 3: Spatio-temporal Patterns in Network Events

n7

n1 n2 n3

n6

n5n4

n8

Figure 2: Correlation of topologically relevant eventsfor root-cause analysis.

to follow the order of fault cascading, within a limitedtime window. We hypothesize that a network fault maybe characterized by the spatio-temporal patterns of itstriggered events; leveraging such patterns facilitates ef-ficient diagnosis. Further, the imprecision and incom-pleteness of individual events may be overcome by ag-gregating correlated events, which capacitates us to in-fer or even predict network faults with high confidence,in facing of incomplete or imprecise information.

Exploiting such topological and temporal correlation,however, requires addressing a set of non-trivial chal-lenges: First, how to select a set of patterns that bestcapture the essential spatio-temporal correlation in fault-triggered network events, from the myriad semantic struc-tures, especially for complex, multi-layer enterprise net-works? Second, how to concisely represent such spatio-temporal patterns? Third, how to efficiently extract(parameterize) the patterns from noisy historical data?Finally, how to support scalable online pattern match-ing against high-volume event streams?

This paper presents the design, implementation andevaluation of Tar, the first known complete frameworkfor extracting, indexing and identifying spatio-temporalpatterns in network monitoring data. The cornerstoneof Tar is a class of indexable network signatures thatconcisely encode the temporal and topological patternsof events triggered by a root cause. Tar includes so-lutions to automatically extract such signatures fromhistorical datasets and to efficiently match such signa-tures against high-volume network-event streams.

3. DESIGN OF TARThis section introduces the basic concepts and no-

tations used throughout the paper, then describes thedesign of network signature, fundamental to Tar, andfinally gives an overview of the architecture of Tar.

For simplicity of exposition, we start with single-layernetworks (e.g., L3 layer network), and will discuss theextension to multi-layer networks in Section 4.4. Wehave the following assumptions. The network is mod-eled as a graph with each node representing a man-ageable logical entity (“entity” for short) uniting one ormore physical devices (e.g., routers), and each edge cor-responding to a network link (e.g., BGP peering). Alsowe assume the network configurations to be static. The

relationship description

selfing u and itselfneighboring u and v are directly connected

containingu contains v as a sub-component(e.g., a router and its interfaces)

down-streamingu is at v’s down-stream side(route from sink to u contains v)

tunnelingu is on a tunnel (a special type ofnetwork connection) with v as one end

Table 1: Topological relationships and descriptions (uand v as two network entities under consideration).

network management system (NMS) consists of a set ofmonitoring agents and a sink. Deployed at dispersedentities, the agents collect and send monitored events

(alarms, performance, alerts) to the sink that is respon-sible for root cause analysis. Each event is a tuple ofthe form 〈v, e, t〉, where v represents the network nodegenerating the event, e the event type, and t the times-tamp of the event. Events come in as an online stream.The goal of Tar can be summarized as: by analyzingthe event stream and exploiting the network topology in-formation, detect the potential faults and localize theirroot-cause entities in real-time.

3.1 Network SignatureThe concept of network signature is central to Tar.

In designing network signature, we strive to achieve aset of objectives: expressive - it should be semanticallyrich, capturing both topological and temporal featuresof correlated events; compact - yet, it should be struc-turally simple, thus easy to be matched against incom-ing events; and indexable - it should be amenable toindexing, thus can be encoded in space-efficient struc-tures for online matching.

Intuitively, we construct network signature based onthe following two fundamental observations. First, whena fault occurs at an entity u, correlated events are typi-cally triggered at affected entities that are topologicallyrelated to u. Second, the triggered event at an affectedentity v depends on the topological relationship betweenu and v, in terms of its event type and time delay.

Example 2. In Figure 2, the failure of n6 may lead tothe event of “OSPF neighbor down” at n3 since n3 is adirect neighbor of n6, while n4 may observe the eventof “failed connection attempt” since the tunnel betweenn4 and n7 involves n6.

We considers a set of relationshipsR = {selfing, neigh-boring, down/up-streaming, containing/contained, tun-neling}, with brief descriptions listed in Table 1. Notethat the relationship down/up-streaming is referred fromthe view of sink, i.e., u is at v’s down-stream side if theroute from the sink to u contains v. We refer to the setof entities with a specific relationship r to v as a topo-

set, denoted by N (v, r). Each r ∈ R (except selfing) isalso associated with a reverse relationship, e.g., down-streaming to up-streaming, denoted by r.

Note that in general when a fault occurs, a sequence

Page 4: Spatio-temporal Patterns in Network Events

root cause

event datanetworksignature

topologynetwork

indexing

Signature

Matching

signatureand

indicestopology

Signature

Extraction

(3)

(2)

(1)

historicalnetworktopology

network-event stream

. . .. . .!

historical

Figure 3: Main architecture of Tar.

of events could be triggered at a specific entity, whichtend to follow certain statistical models (e.g., Markovchain [18, 19]), which could be manually encoded basedon expert domain knowledge, or learned using tempo-ral data mining techniques (e.g., [21]), with concretetechniques orthogonal to the scope of this work; in thispaper, we consider such event sequences as meta-events,encoded in the set E of known event types.

We consider finite sets of faults F (that may be un-known, and the concrete faults considered in our im-plementation is discussed in detail in Section 5), eventtypes E , and topological relationships R. We definenetwork signature as below.

Definition 1 (Network Signature). For a specificfault type f ∈ F , its signature sig(f) is a series of tu-ples 〈e, r, t, prob(e, r, t|f)〉, where e ∈ E, r ∈ R, t is adiscretized time measure, and prob(e, r, t|f) denotes theprobability of observing an event e at an affected nodewith topological relationship r to the faulty node wheref occurs, with time delay t.

fault type relationship event

interface downtunneling failed connection attempt

neighboring OSPF neighbor down. . . . . .

Table 2: Signature example.

Example 3. Recall Example 2. We may have the signa-ture for the fault “interface down” as in Table 2.

This definition is structurally intuitive to interpret inthat it simply encodes the association between a faultand a set of symptoms; it is yet semantically rich inthat the encoded association includes both the tempo-ral correlations of symptom events, and the topologicalrelationships among the entities that generate them.

3.2 High-Level DesignConceptually, Tar entails three major phases (see

Figure 3): (1) signature extraction, abstracting networksignatures from the (possibly noisy) historical event dataand associated underlying topological information; (2)indexing, organizing the network signatures and the topo-

logical information of targeted network into space effi-cient indexing structures; and (3) signature matching,performing root cause analysis by matching indexed sig-natures against incoming event streams.

Signature extraction. The main input to this phaseis the (typically noisy) historical network-event dataand the associated network topology information, whichis expected to be comprehensive to cover a large numberof combinations of the variables (e.g., event type, topo-logical relationship, time delay). Most network moni-toring infrastructures today can typically provide suchdatasets [2, 1]. By applying unsupervised learning tech-niques (details in Section 4.1), Tar extracts the statis-tical relationships among the set of relevant variables,and encode them into a set of network signatures.

Indexing. Leveraging the learned network signaturesrequires “compressing” them into indexing structures soas to support fast lookup queries over spatio-temporalpatterns. Specifically, in the application of fault diag-nosis, one needs to efficiently correlate symptom eventspossibly caused by a common fault, a symptom-event-to-root-cause lookup structure that encodes inverted sig-natures is desired. Meanwhile, to determine the topo-logical correlation amongst a set of events, the under-lying network topology needs to be frequently queried.Tar employs a set of novel topology indexing struc-tures that support efficient spatial intersection queries(details in Section 4.2).

Signature matching. Equipped with network signa-tures and topology indices, one is now able to performscalable signature matching against high-volume moni-toring event streams. Intuitively, the matching processcorrelates topologically and temporally relevant events,and reasons about the underlying root cause. We pro-pose a novel signature matching tree structure to enablefast fault localization (details in Section 4.3).

4. IMPLEMENTATIONIn this section, we describe in detail the implementa-

tion issues of our signature-based fault diagnosis solu-tion. Further, we discuss how to extend our model tothe case of multi-layer networks.

4.1 Signature ExtractionBefore performing signature extraction, it is neces-

sary to prepare the training data from typically noisyarchive data. Here “noise” indicates events irrelevant tofaults (e.g., heartbeat messages) or events with impre-cise timestamps. We achieve this in two steps: (1) ag-gregate events into meta-events (a sequence of events ata specific network entity caused by a single root cause)by applying the temporal mining algorithm in [21], and(2) separate meta-events caused by faults from thosetriggered by regular network operations, based on their

Page 5: Spatio-temporal Patterns in Network Events

occurrence frequency and periodicity. Intuitively, eventswith extremely low frequency are usually the noise com-ponents while over-frequent or periodical events typi-cally correspond to regular network operations.

Let B denote the resulting training data. For clar-ity of presentation, we assume that each event2 in Bis stored in the form (v, ev, tv) where v is the networkentity generating the event, ev ∈ E the event type, andtv the timestamp of its occurrence.

The process of signature learning is divided into twomain phases, partitioning and extracting. In partition-ing, we select a subset of the archive B, B′, such thatthe events in B′ are organized into disjoint partitions{B}, each corresponding to a common root cause withhigh probability. We discard the rest events B \ B′ asnoise. In extracting, we summarize from each partitionB a candidate signature P , and cluster the set of can-didate signatures {P} into the final set of signatures.For clarity, we conceive a candidate signature P as 3-dimensional matrix wherein each element P (e, r, t) rep-resents the likelihood of observing event e at node withrelationship r to the faulty node with time delay t.

We assume a time window ω that specifies the maxi-mum delay between observing the first and last eventstriggered by a single fault. Also all the time measureshave been discretized, and all the timestamps have beennormalized relative to the corresponding time windows.

Algorithm 1: Signature ExtractionInput: event archive B, maximum window size ω

Output: set of network signatures// partitioning

B′ ← ∅;1for each subset B of B within ω do2

if cohe(B) > λ then continue;3

if B overlaps with B′ ∈ B′ then4if cohe(B) ≥ cohe(B′) then continue;5

B′ ← B′ \ {B′} ∪ {B};6

// extracting

for each partition B ∈ B′ do7{r∗

v}(v,ev)∈B = arg minrv |T

(v,ev)∈BN (v, rv)|;8

for each r ∈ R, e ∈ E, and t ∈ ω do9

P (e, r, t) =

P

(v,ev ,tv)∈B 1(r∗v=r,ev=e,tv=t)P

(v′,ev′ ,t

v′ )∈B1(r∗

v′=r);

10

// determining number of faults|F| = arg min|F| 2|θ| − 2 log[like({P}|θ)];11

apply K-means (K = |F|) clustering to {P};12set the cluster centers as fault signatures;13

Algorithm 1 sketches our learning algorithm.(i) In the event archive B, we examine the events

within each time-window ω as a partition B, i.e., theevents in the same partition are possibly triggered bya single fault. We intend to narrow down the set ofcandidate faulty nodes. We achieve this by leveragingthe following observation: for each node v appearingin an event (v, ev) ∈ B, the faulty node must lie in

2In following, we assume that the meta-events are encode inthe set of known event types E , and do not further distin-guish the terms “event” and “meta-event”.

one topo-set of v, N (v, rv) (with topological relation-ship rv); hence, it must also appear in the non-emptyintersection of such topo-sets {N (v, rv)}. We considerall such non-empty intersections, and pick the minimumone as the candidate set. The intuition behind is theminimum explanation principle: the smallest candidateset is considered to provide the best explanation aboutthe fault. We can then measure the quality of B usingthe size of this minimum candidate set, as formalizedin the metric of coherence:

cohe(B) = minrv

| ∩(v,ev)∈B N (v, rv)|

s.t. ∩(v,ev)∈BN (v, rv) 6= ∅

We discard those partitions with coherence above athreshold λ. If two selected partions overlap, we pickthe one with better quality (line 2-6).

(ii) In each selected partition B, for each involvednode v, one identifies the topological relationship r∗vthat leads to the minimum non-empty candidate set.All the events {〈v, r∗v , ev, tv〉} in B are then used to com-pute a potential signature P (line 8-10). Note that herewe simply use the frequency of tuples to compute P ,while a prior distribution can be readily incorporated.

(iii) The number of fault types |F| (which is assumedunknown) essentially controls the complexity of the sta-tistical model. We apply the Akaike’s information cri-terion [3] to select |F|.

Specifically, we assume that the candidate signaturescorresponding to a common fault follow a Gaussian dis-tribution. The information criterion of a model is givenby: aic(θ) = 2|θ| - 2 log[like({P}|θ)], where |θ| is thenumber of parameters to be estimated which capturesits complexity (i.e., expressiveness), and like({P}|θ) isthe likelihood of observing the set of candidate signa-tures under the model (i.e., fitting to the data). A trade-off is made between these two terms. The setting of |F|leading to a minimum aic(θ) is considered optimal.

We apply K-means (with K = |F|) clustering algo-rithm to the set {P}; the centers of the clusters are re-garded as the signatures for the |F| faults (line 11-13).Due to its sensitivity to the initial clustering centers, werun the clustering algorithm multiple times with ran-domly selected centers, and average over the results.

4.2 IndexingTar attempts to detect and localize faults as follows.

At each affected entity v observing event e at time t,network-signatures are used to compute the probabilityprob(f, r|e) (r is the inverse relationship of r) that thefaulty entity incurred the fault f and has a topologicalrelationship r to v. If this probability is greater than asystem threshold, an evidence 〈f, v, r, t〉 is formed, in-dicating with high confidence that a faulty entity existsamong the set of entities with relationship r to v. Allcollected evidences within a time-window are correlated

Page 6: Spatio-temporal Patterns in Network Events

to gradually narrow down the candidate faulty entityset and the fault f .

Example 4. Recall Example 2. Assume in Figure 2n3 observes event “OSPF neighbor down” while n4 ob-serves event “failed connection attempt”. According tothe signature in Table 2, one can infer that the faultynode lies in the topo-set with neighboring relationshipto n3 and in the topo-set with tunneling relationship ton4, which uncovers n6 as the faulty node.

To facilitate signature-based fault diagnosis, Tar em-ploys two indexing structures: index of signatures forevidence computation, and index of network topologi-cal dependency for evidence correlation.

Signature Index We propose an inverted signaturestructure Is to support fast symptom-root-cause lookup,by maintaining the association between symptom eventsand possible faults. Recall that the signature of a faultf is a series of tuples 〈e, r, t, prob(e, r, t|f)〉, where e,r, and t represent a symptom event, a topological rela-tionship, and time delay, respectively, and prob(e, r, t|f)is the probability distribution of observing e at an en-tity with relationship r to the faulty entity with delayt. Here we temporarily ignore t, and adopt a marginalversion prob(e, r|f) , because the underlying root causeis unknown, and the information of absolute time delayis missing. We will later use the relative time differencebetween symptom events to infer potential fault.

Corresponding to each signature, we create an entryin the index Is: 〈f, r, prob(f, r|e)〉, where r is the in-verse relationship of r, and prob(f, r|e) is the posteriorprobability that f occurs at an entity with topologicalrelationship r to a given entity observing e. Its compu-tation is given by:

prob(f, r|e) =prob(e, r|f) · prob(f)

∑f ′∈F

∑r′∈R prob(e, r′|f ′) · prob(f ′)

where the prior probability of the occurrence of faultf , prob(f), can be derived from the overall statistics ofhistorical event data. At an entity v observing event e,for each fault f ∈ F , we select the set of topologicalrelationships Rv that satisfy prob(f, r|e) above systemthreshold for r ∈ Rv. We term such a tuple 〈f, v, t,Rv〉as an evidence.

Topology Index The incorporation of network topo-logical information significantly boosts the precision offault diagnosis, by correlating events according to theirunderlying topological relationships. Such improvement,however, requires space-efficient indexing structures thatsupport fast retrieval of topological relationships amongnetwork entities.

As will be shown shortly, a key operation heavily in-volved in fault localization is computing the intersec-tion of two topo-sets, e.g., joining the down-streamingneighbors of one entity and the direct neighbors of an-

T f2s

. . .

I1

I2 I3ρ(L)

ρ(I2) ρ(I3)

T f3sT f1

s

L {〈v, t,N (v,Rv)〉}

Figure 4: Signature matching tree Ts.

other; hence, for each indexing structure, we are par-ticularly interested in minimizing the cost of retrieving(constructing) a topo-set from it. Due to space limi-tations, we focus our discussion on building indices forup/down-streaming relationships.

We construct our index based on the following twoobservations: (1) the shortest path routes from the sinkto all the entities form a spanning tree rooted at thesink, i.e., a tree cover of the network; (2) the diameterφ of a typical management domain (as observed in fourlarge enterprise networks shown in Figure 1) is about3-7 hops. Therefore, the set of up-streaming neighbors(utmost φ) of an entity can be directly cached. Wethen traverse the routing tree in level order (breadth-first), assigning each entity a traversal-order number.The down-streaming neighbors of a given entity u canbe summarized as φ intervals, {[li, ri]}

φi=1, where li (ri)

denotes the order number of its left-most (right-most)descendent on the ith level below u. Clearly, this struc-ture achieves retrieval cost of O(φ), since the neighborson the same level can be retrieved in one consecutivechunk, at the storage cost of O(φ) for each entity.

4.3 Signature MatchingWith the help of topology indices, in each evidence

〈f, v, t,Rv〉, (v,Rv) can be replaced with the union ofthe corresponding topo-sets

⋃r∈Rv

N(v, r) (N(v,Rv) forshort). We consider two evidences 〈f, t,N(u,Ru)〉 and〈f ′, t′,N(v,Rv)〉 (possibly) correlated if (1) f = f ′, (2)the time of their occurrences is within a short window|t − t′| ≤ ω, and (3) N(u,Ru) ∩ N(v,Rv) 6= ∅. Thisconcept can be generalized to multiple evidences.

While checking conditions (1) and (2) is fairly straight-forward, computing the intersection of N(u,Ru) andN(v,Rv) is costly. The complexity is O(min{|N(u,Ru)|,|N(v,Rv)|}), even if both sets are stored in a hashing-table structure. Moreover, following the naıve pairwisecomparison paradigm, each incoming evidence would becompared with all existing ones to detect relevance, andthus scales poorly with the event-stream rate.

Signature Matching Tree We devise a novel signa-ture matching tree Ts that enables efficient correlationof evidences. Our design follows the one-pass clusteringphilosophy [11], which endows Ts with high throughputand scalability. Figure 4 shows the chief structure of

Page 7: Spatio-temporal Patterns in Network Events

Ts. It is a hierarchical structure, with the highest levelcontaining |F| buckets, each corresponding to one faulttype f ∈ F . Within each bucket is a height-balancedtree T f

s , into which evidences of the form 〈f, t,N(v,Rv)〉are inserted. Each leaf of T f

s corresponds to a clusterof relevant evidences; and each non-leaf node representsthe union of all the clusters in its subtree.

For each leaf node (cluster) L containing a set of evi-dences, we maintain the intersection of their topo-sets,called its aggregation, ρ(L) =

⋂〈f,t,N(v,Rv)〉∈L N(v,Rv),

which corresponds to the candidate faulty entity set;while for each non-leaf node (super cluster) I, we main-tain the union of the aggregations of all the leaves in itssubtree, ρ(I) =

⋃L∈I ρ(L).

The signature matching tree Ts supports two basicoperations, insertion and deletion. In insertion, an ar-riving evidence 〈f, t,N(v,Rv)〉 recursively descends downT f

s by testing N(v, f,Rv)∩ ρ(I) for each non-leaf I en-countered, until being clustered into an appropriate leafL that can absorb it. If no such leaf exists, a new oneis created which solely contains this evidence; it thenupdates the aggregations of the nodes on the path fromthe leaf to the root of T f

s . Those evidences with times-tamps out of the current time window are consideredas expired. In deletion, expired evidences are removedfrom the tree, and the aggregations of the nodes on thepaths from the affected leaves to the root are updatedin a bottom-up manner.

4.4 ExtensionNow we briefly discuss how to extend our model to

support multi-layer networks. Typically, in a complexnetwork (e.g., enterprise networks), the monitoring datais collected at different layers within the protocol stack,e.g., data may be available from networks entities at L1

layer and L3 layer. Different layers tend to expose fairlydifferent connectivity structures and relationships. It ischallenging to line up such heterogenous topology in anconsistent manner.

In our current implementation, we employ the con-cept of composite relationship. Specifically, for two net-work entities u and v at two different layers, we mayconsider u × v as a composite entity if there exists amapping u

m−→ v between them, e.g., a layer L3 entity

and its layer L1 counterpart. Now consider two rela-tionships u′

r1−→ u and vr2−→ v′ with u′ at the same layer

as u and v′ at the same layer as v. We define the com-

posite relationship between u′ and v′ as u′r1,m,r2−−−−−→ v′.

We can then apply the concept of network signature asdescribed above.

The drawback of this approach is the increased vari-able space, which may impose prohibitive overhead onboth learning and applying the network signature model.For two layers with relationship types R1 and R2, thecardinality of the set of composite relationships is typ-

attribute description

IPAddressaddress of the network entitygenerating the event

PollerIPAddress address of the poller/monitorEvent-count sequence number of the eventgeneric-trap SNMP trap IDspecific-trap enterprise specific SNMP trap ID

Raw Capture Timestamp timestamp of the trap message

Table 3: Attributes of network event.

ically |R1||R2|. Note that, however, the combinatorialcomplexity may be largely reduced by domain expertise,since many composite relationships may be invalid.

5. EMPIRICAL EVALUATIONThis section presents an empirical evaluation of the

efficacy of Tar in network fault diagnosis. The experi-ments are specifically designed to center around the fol-lowing metrics: (1) the descriptive power of the networksignature model in capturing real network faults; (2) theeffectiveness of the learning algorithm in extracting thenetwork signatures from historical data; (3) the scalabil-ity of Tar in detecting and localizing network faults fac-ing high-volume monitoring data streams; (4) its faultpredictive power by exploiting partial/incremental faultsignature matching; and (5) its robustness against miss-ing symptom events (e.g., due to packet losses in SNMPmessages transported over UDP, or incomplete moni-toring caused by configuration errors, etc.) and errorsin the information regarding underlying network topol-ogy (e.g., due to the staleness in the discovered networktopology). We start with describing the datasets.

5.1 DatasetOur experiments mainly used two datasets from real-

life network management systems. The first dataset isan archive of 2.4 million SNMP trap messages collectedfrom a large enterprise network (spanning 7 ASes, 32IGP networks, 871 subnets, 1,268 VPN tunnels, 2,068main nodes, 18,747 interfaces, and 192,000 network en-tities) over several weeks in 2007. Event attributes ofinterest to us are listed in Table 3. The second datasetis a real European backbone network consisting of 2,383network nodes (spanning 7 countries, 11 Ases, and over100,000 entities). Based on its topology, we generate asynthetic monitoring data stream (with tunable failurerates) to quantify the efficacy and scalability of Tar.While the real event dataset collected in 2007 indicatesevent bursts (events arriving within a extremely shorttime window) of sizes up to 12,000 events, the syn-thetic dataset (generated by artificially increasing fail-ure rates) includes event bursts of sizes up to 36,000events. Our core libraries were implemented using Java.All experiments were conducted on a Linux workstationwith modest computation power, running 1.6 GHz Pen-tium IV and 1G memory.

5.2 Descriptive Power

Page 8: Spatio-temporal Patterns in Network Events

trap id description

28 bsnlpsecEspAuthFailureTrap79 bsnAPRegulatoryDomainMismatch

102 bsnAPBigNavDosAttack104 bsnAPContainedAsARogue124 bsnAPIfDown130 bsnAPInterferenceProfileFailed230 bsnTemperatureSensorFailure378 csiErrorTrap

Table 4: SNMP trap ids (bsn - BackwardSequenceNum-ber, csi - Cisco SSL VPN Client Interface).

(network entity, event) set support

(x.y.11.3, 124) (x.y.1.10, 124) 0.87(x.y.8.29, 124) (x.y.9.163, 230) 0.86(x.y.8.29, 124) (x.y.9.163, 230)

0.78(x.y.15.1, 378)(x.y.9.33, 103) (x.y.15.1, 378)

0.75(x.y.1.10, 230)(x.y.9.163, 104) (x.y.9.163, 230)

0.61(x.y.15.1, 378) (x.y.107.1, 102)

Table 5: Topology-agnostic signatures (only masked en-tity IP-addresses and trap message IDs are shown).

We contrast our fault signature model against a code-book based model [25], which we refer to as the topology-agnostic model. The topology-agnostic signature for anetwork fault is defined as a set of tuples of the form(network entity, symptom event), where the network en-tity is the concrete network node generating the eventand the set of symptom events are shown in Table ??.A sample of topology-agnostic signature is show in Ta-ble 5. The co-occurrence of the tuples of a signaturein the event stream indicates the potential occurrenceof the corresponding fault. Note that such signaturesare coupled with concrete network entities, and are thusinherently deterministic in that the correlation of symp-tom events are explicitly encoded.

To make the comparison accurate, we apply frequentpattern mining techniques to learn both topology-awareand -agnostic signatures. From both synthetic and realevent datasets, we extracted a trace of 40,000 annotatedevents (root-cause events are marked with special tags)with 20,000 events used for extracting network signa-tures and the rest for measuring the efficacy of bothsignature models.

In signature mining, we examine events in a narrowtime window around an annotated fault event, and usethe well-known apriori algorithm to identify event setswith high co-occurrence frequency across such time win-dows. This approach extracts event sets that are ob-served with high support when a fault occurs in the

dataset # signatures time (mins)

real 44 (4,518) 3 (10)synthetic 92 (12,645) 6 (21)

dataset precision recall

real 0.92 (0.82) 0.91 (0.83)synthetic 0.88 (0.68) 0.87 (0.70)

Table 6: Comparison of topology-aware and -agnosticsignature models (numbers for topology-agnostic modelare within braces).

0 500 1000 1500 2000 2500 3000 3500−1

−0.5

0

0.5

1

1.5

2

frequency (support)

log of percentage lg 1

0(%

)

0 20 40 60 80−1

0

1

2

0 0.5 1.0 1.5 2.0 2.5 3.00

0.05

0.1

0.15

0.2

occurrence periodicity

pro

po

rtio

n

Figure 5: Distribution of frequency of meta events, andnormalized histogram of periodicity of meta-events.

400 450 500 550 600 650 700occu

rre

nce

meta-event {28, 102, 104, 130}

400 450 500 550 600 650 700

timestamp (ms)occu

rre

nce

meta−event {79, 124}

Figure 6: Occurrences of two sample meta events.

network. The topology-agnostic signature directly en-codes the association between frequent symptom eventsets and concrete network entities wherein these eventsoccur; in contrast, our network signature abstracts suchassociations based on topological relationships amongnetwork entities, rather than concrete entities.

We compare these two models in terms of signaturesize and execution time, with results shown in the up-per table of Table 6. Evidently, topology-aware modelgenerates much more concise signatures than topology-agnostic model, with the size of signature set two or-ders of magnitude smaller than the latter on both realand synthetic datasets. This is mainly attributed tothe nature of topology-aware signature: it may encodemultiple topology-agnostic signature instances into onesignature, given that they reflect the same correlationon the topological relationship level. Also, it is noticedthat topology-aware signature leads to much higher effi-ciency of signature extraction; this is explained by thatthe introduction of topology-awareness significantly re-duces the search space.

We further apply the extracted signatures to diag-nose (classify) the faulty events in the test dataset,with accuracy (precision and recall) shown in the lowertable of Table 6. Evidently, topology-aware signaturedemonstrates better descriptive power in capturing theessential characteristics of network faults. Note that,however, because of its strong dependency on topologi-cal information, topology-aware signature is inherentlyfaced with the challenge of dealing with errors in thetopological information (e.g., missing or stale topologi-cal information). We will discuss the robustness of Taragainst such errors shortly.

5.3 Learning EffectivenessWe evaluate the effectiveness of the data prepara-

tion phase of our learning algorithm. In data prepa-

Page 9: Spatio-temporal Patterns in Network Events

ration, we use frequency and periodicity as two criteriato distinguish fault-caused meta events from the rest.The normalized histogram of meta-events with respectto frequency (in logarithmic scale) is depicted in theleft plot of Figure 5, which approximately follows apower law distribution. It is observe that more than60% meta-events have fairly low frequency, e.g., below5, which, as we confirmed by examining the definition oftrap ids, are mainly caused by infrequent network oper-ations, e.g., the event set {3} represents “the cisco Ne-tReg server has started on the host from which this noti-fication is sent”, or certain non-recurrent faults, whichare of modest interest for our purposes of leveragingexisting diagnosis efforts. Meanwhile, the meta-eventswith significantly higher support than others are typi-cally due to regular network operations, e.g., the eventset {102, 104} which appears with support 348 indi-cates “data from the remote side is available for theTDM channel”.

The distribution of the periodicity of meta-events isillustrated in the right plot of Figure 5. Observe thatmost of the meta-events demonstrate low deviation ofoccurrence intervals, i.e., they are resulted from normalnetwork operations. We randomly selected two meta-events {28, 102, 104, 130} and {79, 124} with period-icity 0.43, and 1.09 (lower periodicity ⇒ more regular),respectively, and examined their implications. Figure 6compares their occurrences. From the descriptions ofthe traps, it is confirmed that the meta-event {79, 124}indicates potential network faults, while the meta-event{28, 102, 104, 130} is caused by regular network oper-ations, e.g., link mode sniffing.

We then evaluate the effectiveness of the signatureextraction component using the meta-event size his-togram. More specifically, after applying the unsuper-vised learning algorithm over the event dataset, by run-ning Monte Carlo simulation, we derived the histogramof meta-event size from the learned signatures, and com-pared it against that extracted from the real data.

The upper plot of Figure 7 illustrates the comparisonof these two histograms (normalization is applied). It isclear that the distribution of the model-generated datafits that of the underlying data fairly tightly. Further-more, we analyzed the distribution of individual eventsfor real data and model generated data, respectively. Asshown in the lower plot of Figure 7, these two distribu-tions demonstrate strong consistency, which empiricallyproves that our learning model can capture the essentialfeatures of the real data.

5.4 ScalabilityNext, we proceed to evaluating the scalability of Tar

against the state-of-the-art approaches used by fault di-agnosis engines in widely deployed NMSes. These ap-proaches follow a pairwise event correlation paradigm:

1 2 3 4 5 6 7 8 9 0

0.5

1

size of meta−event

pro

po

rtio

n model generated datareal event data

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 180

0.1

0.2

0.3

0.4

event type

pro

po

rtio

n

real event datamodel generated data

Figure 7: Histograms of size of meta events and indi-vidual events in real and model generated data.

the engine maintains a cache of events seen in the re-cent past. A new event is compared (pairwise) againstall the events in the cache to determine if it can suppressone or more dependent events in the cache or if it canbe suppressed by any one in the cache; pairwise com-parisons terminate when the new event gets suppressed.In such approaches, all symptom events will be eventu-ally suppressed by the root-cause event (also known asthe failure event). Indeed after processing all the eventsin an event burst, the only unsuppressed event in thecache has to be the root-cause event. Hence, one canidentify root-cause events (and thus diagnose faults) byfiltering unsuppressed events in the cache. Diagnosedfaults are ultimately funneled to network operators forcorrective actions.

However, the pairwise correlation paradigm suffersfrom the following drawbacks: (1) the complexity ofcomparison grows quadratic in the size of event burst,(2) caching all the unsuppressed events results in mem-ory bounded operations (especially, since event burstsizes of over 10,000 events per second are not uncom-mon), and (3) it lacks sufficient predictive capabilitiesand robustness to missing events (especially, if the eventcorresponding to the root cause is missing). More re-cently, [4] proposed a divide-and-conquer approach toenhance the scalability of the pairwise event correlationapproach by partitioning a large network into multiplemanagement domains, each of which are independentlymonitored. They proposed a hierarchical event correla-tion architecture wherein, event correlation is first per-formed within each management domain; then, a se-lected subset of events is sent for cross-domain correla-tion to a super root-cause-analysis engine. In compari-son, we use CRCA to denote (pairwise) centralized rootcause analysis, and DRCA the distributed version ofthe same. Furthermore, we consider a simplified versionof CRCA that ignores down-stream correlation (whichcaptures the most common types of cascading faults,and thus requires network-wide event correlation in theabsence of down-stream topology indices).

Page 10: Spatio-temporal Patterns in Network Events

0 2 4 6 8 10 120

5

10

15

20

25

event burst size (x 1K events)

dia

gn

osi

s re

spo

nse

tim

e (

min

s)

CRCA CRCA (w/o DS supp.) DRCA (6 engines) TAR

0 5 10 15 20 25 30 350

10

20

30

40

50

event burst size (x 1K events)

dia

gn

osi

s re

spo

nse

tim

e (

min

s)

real data synthetic data

Figure 8: Scalability of multiple fault diagnosis approaches with respect to event burst size.

0 10 20 30 40 50 60

0

0.2

0.4

0.6

0.8

1

# of observed symptom events

pre

dic

tio

n a

ccu

racy

0 1 2 3 4 5

0

0.2

0.4

0.6

0.8

1

time to fault occurrence (mins)

prediction accuracyexponential fitting

prediction accuracylogarithmic fitting

Figure 9: Prediction accuracy of Tar with respect tonumber of observed symptoms (synthetic dataset), andtime to fault occurrence (real dataset).

Figures 8 presents the fault diagnosis response timeof the multiple correlation approaches as a function ofevent burst size, using both real and synthetic eventdatasets. Figures show the ability of signature-basedcorrelation approach to scale linearly with burst size;on the other hand, the pairwise event correlation ap-proach is of quadratic complexity, and does not scalewell to large-scale networks featuring high-volume eventstreams. It is noticed that among them, DRCA demon-strates better scalability than its centralized counter-parts; however, it tends to pose much stricter computa-tion power requirement than Tar, e.g., the memory re-quired by cross-domain correlation easily hits the boundof the test platform (which causes the missing numbersfor DRCA on the synthetic dataset).

Nevertheless, we note that the pairwise event correla-tion approach supports more generic event correlationthan Tar, which only supports topological event corre-lation. Indeed in the case of network events, an event e1

occurring on node n1 and an event e2 occurring on noden2 are correlated only if the nodes n1 and n2 are topo-logically related. Hence, all event correlation rules usedin the context of IP network could be readily translatedinto network signatures of Tar.

5.5 Predictive PowerFor large communication networks, it is a desired fea-

ture of a fault diagnosis engine to raise certain warningahead of the actual occurrence of the fault, i.e., beforefully observing the symptom events, such that the net-work operators can quickly direct corrective efforts tothe fault before its full-scale cascading.

Figures 9 shows the predictive capability of signature-based diagnosis approach. These results have been av-eraged over 121 real faults and 1,000 synthetic faults, re-spectively. Using a signature matching based approachallows us to partially match a network-event streamwith a fault signature and predict the fault even beforethe failure event is actually observed (or received) byNMS. In our experiments with both synthetic and realevent datasets, we ensured that the root-cause event (orthe failure event) was withheld from the fault diagnosisengine, until after the engine predicts the fault.

Using synthetic event dataset, the left plot of Figure 9shows the accuracy of fault prediction with respect tothe number of observed symptom events. For a totalset of more than 128 symptom events, observing only15 symptom events gives prediction accuracy close to1. Using real event dataset, the right plot of Figure 9shows the fault prediction accuracy as a function of thetime ahead of receiving (or observing) the failure event.The fitting curve shows that the predictive power growsexponentially as approaching the failure occurrence. Inmost cases (99.99%), it is possible to diagnose the faultaccurately after the failure event is received; however, itis possible to predict failure about 30 seconds and oneminute before the actual failure event is observed withaccuracy 84% and 66%, respectively.

Predicting failures enables swifter recovery actions,and thus reduces SLA (service level agreement) viola-tion costs (e.g., customer VPN tunnel failures). In suchcases, predicting a failure event a couple of seconds inadvance can help us reconfigure MPLS paths betweentwo customer edge routers that are the end points ofa (probably) faulty VPN tunnel. We also note thatpredicting a failure a minute before the actual failureis observed may be insufficient for other classes of fail-ures (e.g., BGP route failures due to misconfigurationof BGP policies, whose ripple effects may propagate allover the network creating route stabilization problems).

The ability to withhold the failure event and yet di-agnose the fault clearly shows that Tar can toleratemissing events (even if the missing event were the fail-ure event itself). A pairwise event correlation approachthat suppresses symptom events only after observingthe suppressing event cannot tolerate missing root-cause

Page 11: Spatio-temporal Patterns in Network Events

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

missing symptom events (%)

dia

gn

osis

accu

racy

(synthetic dataset)

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

missing symptom events (%)

(real dataset)

Figure 10: Fault diagnosis accuracy of Tar with respectto varying amount of missing symptom events.

events. In the absence of the root-cause event, severalsymptom events will be left unsuppressed; eventuallysuch symptom events will (incorrectly) be funneled asfailure events to network operators.

5.6 Error ToleranceAbove we have shown partially the robustness of Tar

against missing symptom events. In this set of con-trolled experiments, we further explore in this direction.By varying the amount of withheld symptom events,we simulate random packet losses. The efficacy of faultdiagnosis by Tar with respect to the amount of miss-ing symptoms (averaged over 121 real faults and 1,000synthetic faults) is shown in Figure 10. On both syn-thetic and real event datasets, Tar achieves an accu-racy above 0.75, under missing 40% percent of symptomevents. The result may seem at first glance in conflictwith that in Figure 9, where Tar demonstrates predic-tion accuracy close to 1 after observing only 11.7% ofsymptoms. However, keep in mind that the symptomevents arriving early at NMS tend to be topologicallyclose to the faulty entity, thus bearing more informationfor fault diagnosis; while in this set of experiment, wesimulate random package losses, irrespective of their ar-rival order (i.e., the information carried by the events),which explains the seeming inconsistency.

Figure 11 shows the efficacy of fault diagnosis in fac-ing errors in the network topology information. We con-sider two models of topological errors, one by randomlymodifying (adding or deleting) links in the network, andthe other by modifying links according to a preferentialattachment model [6]. As per the preferential attach-ment model, new links are likely to be added/deletedbetween a high degree node and a low degree node (de-gree of a node denotes the number of its neighbors). Insuch a model (which more accurately reflects the growthof a network and thus attributes to errors in networktopology information [24]), updated links typically af-fect connectivity to the nodes that are in the peripheryof the network, rather than the core nodes in the net-work. Such added/deleted network links tend to onlylocally affect network relationships; meanwhile, randomaddition/deletion of links may affect network relation-ships at the scale of the entire network itself. Conse-

0 5 10 15 20 250

0.2

0.4

0.6

0.8

1 (synthetic dataset)

added/deleted links (%)

dia

gn

osis

accu

racy

0 5 10 15 20 250

0.2

0.4

0.6

0.8

1 (real dataset)

added/deleted links (%)

random PA model

Figure 11: Diagnosis accuracy versus errors in topolog-ical information (introduced via changing links accord-ing to random or Preferential Attachment model).

quently, Tar performs much better when topology er-rors are introduced as per the preferential attachmentmodel than selected randomly.

6. RELATED WORKNetwork management techniques have been evolving

rapidly with advance in monitoring infrastructures, andgradual diversification of monitoring context.

Due to its wide spread availability, low-level metricdata (e.g., network traffic or routing data) has beenproviding valuable information for network operators tomonitor underlying network status. A plethora of workhas been focused on static analysis of low-level metricdata for disruption detection and trouble shooting incommunication networks. Examples include analyzingBGP update messages using clustering to detect net-work disruption [10, 22], applying multivariate analysisto routing data to model normal network traffic and de-tect deviations [15, 12], and using wavelet-based clus-tering to detect abnormal routing behavior in routingdata [27]. Moreover, disruption detection using histor-ical metric data has also been an important topic forcomputing systems in general [8].

Many of today’s network monitoring infrastructurescan provide high-level, descriptive observations or symp-toms (events). A line of research efforts have been ded-icated to fault diagnosis from the set of observed symp-toms. Existing solutions can be categorized roughly asexpert-system or graph-theoretic techniques. The firstcategory attempts to imitate the knowledge of domainexperts, with examples including rule-based [23], cased-based [16], and model-based systems [19]; the secondcategory relies on a graphical model of the system thatdescribes the propagation for each specific fault, withexamples as dependency graph [14], codebook [25], andbelief-network [20]. The drawbacks of these techniqueslie in the requirement for accurate dependency informa-tion amongst network entities (usually not available forlarge-scale enterprise networks), and the cost of faultinference (scale poorly with network size and complex-ity). In contrast, Tar only requires elementary topo-logical information and network signatures to support

Page 12: Spatio-temporal Patterns in Network Events

online fault diagnosis over high-volume event streams.With the emergence of more complex network con-

texts (e.g., information network, social networks, etc.),network management is exposed to monitoring datawith increasingly richer semantics (e.g., email, socialmedia, etc.). It poses great challenge to understandthe information conveyed by monitoring data within thecontext of underlying network and yet, offers valuableinsight into global network phenomena. For example,SybilGuard [26] was proposed to leverage underlyingsocial network structures in interpreting nodes’ votingmessages, thus defending against Sybil attacks. We con-sider Tar as an initial effort towards understanding andmodeling the interplay between monitoring data andunderlying network context. Nonetheless, the networkcontext setting in Tar is still fairly limited, e.g., we onlyconsider topological relationships among network enti-ties, and focus on pairwise interactions among them. Itmight be part of a temporary solution until more com-prehensive models are proposed, and it might informthe design of these models.

7. CONCLUSIONThis work advances the state-of-the-art in network

monitoring data analysis by presenting Tar, a generalframework of learning, indexing, and identifying topo-logical and temporal correlation existing in network-event data, based on a novel class of network signatures.We present efficient learning algorithms to extract suchsignatures from noisy historical event data, and with thehelp of novel indexing structures, we show how to per-form efficient, online signature matching against high-volume event streams. While focusing on topological-temporal patterns only is unlikely to capture the myriadsemantic structures existing in network-event data, weshow that it is a powerful primitive to support a range ofapplications. Our experiments of deploying Tar witha large-scale testbed NMS to perform fault diagnosisshow that Tar is able to perform scalable, yet robustroot cause analysis.

AcknowledgementThis work was sponsored by the Army Research Labo-ratory and was accomplished under Cooperative Agree-ment Number W911NF-09-2-0053. The views and con-clusions contained in this document are those of the au-thors and should not be interpreted as representing theofficial policies, either expressed or implied, of the ArmyResearch Laboratory or the U.S. Government. The U.S.Government is authorized to reproduce and distributereprints for Government purposes notwithstanding anycopyright notation here on. The first author and thelast author are partially supported by grants from NSFCyberTrust program, NSF NetSE program, an IBMSUR grant, and a grant from Intel research council.

8. REFERENCES[1] HP OpenView. http://www.openview.hp.com.[2] IBM Tivoli Monitoring. http://www-

01.ibm.com/software/tivoli/products/monitor/.[3] H. Akaike. A new look at the statistical model

identification. IEEE Trans. Auto. Cont., 19(6), 1974.[4] D. Banerjee, V. Madduri, and M. Srivatsa. A framework for

distributed monitoring and root cause analysis for large ipnetworks. In SRDS, 2009.

[5] A.-L. Barabasi. Linked: The New Science of Networks.Perseus Publishing, 2002.

[6] A.-L. Barabasi and R. Albert. Emergence of Scaling inRandom Networks. Science, 286(5439):509–512, 1999.

[7] E. Cohen, E. Halperin, H. Kaplan, and U. Zwick.Reachability and distance queries via 2-hop labels. SIAM J.Comput., 32(5), 2003.

[8] I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly,and A. Fox. Capturing, indexing, clustering, and retrievingsystem history. In SOSP, 2005.

[9] L. Fan, P. Cao, J. Almeida, and A. Broder. Summarycache: A scalable wide-area web cache sharing protocol. InIEEE/ACM Trans. Netw., 1998.

[10] A. Feldmann, O. Maennel, Z. Mao, A. Berger, andB. Maggs. Locating internet routing instabilities.SIGCOMM Comput. Commun. Rev., 34(4), 2004.

[11] A. Guttman. R-trees: A dynamic index structure forspatial searching. In SIGMOD, 1984.

[12] Y. Huang, N. Feamster, A. Lakhina, and J. Xu. Diagnosingnetwork disruptions with network-wide analysis.SIGMETRICS Perform. Eval. Rev., 35(1), 2007.

[13] Internet Engineering Task Force. OSPF version 2.http://www.ietf.org/rfc.

[14] I. Katzela and M. Schwartz. Schemes for fault identificationin communication networks. IEEE/ACM Trans. Netw.,3(6), 1995.

[15] A. Lakhina, M. Crovella, and C. Diot. Mining anomaliesusing traffic feature distributions. SIGCOMM Comput.Commun. Rev., 35(4), 2005.

[16] L. Lewis. A case-based reasoning approach to the resolutionof faults in communication networks. In IM, 1993.

[17] J. P. Martin-Flatin, G. Jakobson, and L. Lewis. Eventcorrelation in integrated management: Lessons learned andoutlook. J. Netw. Syst. Manage., 15(4):481–502, 2007.

[18] X. Meng, G. Jiang, H. Zhang, H. Chen, and K. Yoshihira.Automatic profiling of network event sequences: algorithmand application. In IEEE INFOCOM, 2008.

[19] Y. Nygate. Event correlation using rule and object basedtechniques. In IM, 1995.

[20] J. Pearl. Probabilistic reasoning in intelligent systems:networks of plausible inference. Morgan KaufmannPublishers Inc., 1988.

[21] T. Wang, M. Srivatsa, D. Agrawal, and L. Liu. Learning,indexing, and diagnosing network faults. In KDD, 2009.

[22] J. Wu, Z. Mao, J. Rexford, and J. Wang. Finding a needlein a haystack: pinpointing significant BGP routing changesin an IP network. In NSDI, 2005.

[23] P. Wu, R. Bhatnagar, L. Epshtein, M. Bhandaru, andS. Zhongwen. Alarm correlation engine. In NOMS, 1998.

[24] K. Yamasaki, K. Matia, S. V. Buldyrev, D. Fu,F. Pammolli, M. Riccaboni, and H. E. Stanley. Preferentialattachment and growth dynamics in complex systems.Phys. Rev. E, 74(3), 2006.

[25] S. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie.High speed and robust event correlation. CommunicationsMagazine, IEEE, 34(5), 1996.

[26] H. Yu, M. Kaminsky, P. B. Gibbons, and A. Flaxman.Sybilguard: defending against sybil attacks via socialnetworks. In SIGCOMM, 2006.

[27] J. Zhang, J. Rexford, and J. Feigenbaum. Learning-basedanomaly detection in BGP updates. In MineNet, 2005.