17
Dynamic Anomalography: Tracking Network Anomalies via Sparsity and Low Rank Morteza Mardani, Student Member, IEEE, Gonzalo Mateos, Member, IEEE, and Georgios B. Giannakis, Fellow, IEEE * Abstract—In the backbone of large-scale networks, origin- to-destination (OD) traffic flows experience abrupt unusual changes known as traffic volume anomalies, which can result in congestion and limit the extent to which end-user quality of service requirements are met. As a means of maintaining seamless end-user experience in dynamic environments, as well as for ensuring network security, this paper deals with a crucial network monitoring task termed dynamic anomalography. Given link traffic measurements (noisy superpositions of unobserved OD flows) periodically acquired by backbone routers, the goal is to construct an estimated map of anomalies in real time, and thus summarize the network ‘health state’ along both the flow and time dimensions. Leveraging the low intrinsic- dimensionality of OD flows and the sparse nature of anomalies, a novel online estimator is proposed based on an exponentially- weighted least-squares criterion regularized with the sparsity- promoting 1-norm of the anomalies, and the nuclear norm of the nominal traffic matrix. After recasting the non-separable nuclear norm into a form amenable to online optimization, a real-time algorithm for dynamic anomalography is developed and its convergence established under simplifying technical assump- tions. For operational conditions where computational complexity reductions are at a premium, a lightweight stochastic gradient algorithm based on Nesterov’s acceleration technique is developed as well. Comprehensive numerical tests with both synthetic and real network data corroborate the effectiveness of the proposed online algorithms and their tracking capabilities, and demonstrate that they outperform state-of-the-art approaches developed to diagnose traffic anomalies. Index Terms—Traffic volume anomalies, online optimization, sparsity, network cartography, low rank. I. I NTRODUCTION In the backbone of large-scale networks, origin-to- destination (OD) traffic flows experience abrupt unusual changes which can result in congestion, and limit QoS pro- visioning of the end users. These so-termed traffic volume anomalies could be due to unexpected failures in networking equipment, cyberattacks (e.g., denial of service (DoS) attacks), or, intruders which hijack the network services [37]. Unveiling such anomalies in a promptly manner is a crucial monitoring task towards engineering network traffic. This is a challeng- ing task however, since the available data are usually high- dimensional, noisy and possibly incomplete link-load mea- surements, which are the superposition of unobservable OD Manuscript received July 31, 2012; accepted December 1, 2012. First published XXXXXX XX, 20XX; current version published XXXXXX XX, 20XX. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Ery Arias-Castro. Work in this paper was supported by the MURI Grant No. AFOSR FA9550-10-1-0567. * The authors are with the Dept. of ECE and the Digital Technology Center, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455. Tel/fax: (612)626-7781/625-4583; Emails: {morteza,mate0058,georgios}@umn.edu flows. Several studies have experimentally demonstrated the low intrinsic dimensionality of the nominal traffic subspace, that is, the intuitive low-rank property of the traffic matrix in the absence of anomalies, which is mainly due to common temporal patterns across OD flows, and periodic behavior across time [22], [43]. Exploiting the low-rank structure of the anomaly-free traffic matrix, a landmark principal component analysis (PCA)-based method was put forth in [21] to identify network anomalies; see also [28] for a distributed implemen- tation. A limitation of the algorithm in [21] is that it cannot identify multiple anomalous flows. Most importantly, [21] has not exploited the sparsity of anomalies across flows and time – anomalous traffic spikes are rare, and tend to last for short periods of time relative to the measurement horizon. Capitalizing on the low-rank property of the traffic matrix and the sparsity of the anomalies, the fresh look advocated here permeates benefits from rank minimization [9]–[11], and compressive sampling [12], [13], to perform dynamic anoma- lography. The aim is to construct a map of network anomalies in real time, that offers a succinct depiction of the network ‘health state’ across both the flow and time dimensions (Section II). Different from the batch centralized and distributed anoma- lography algorithms in [26] and [25], the focus here is on devis- ing online (adaptive) algorithms that are capable of efficiently processing link measurements and track network anomalies ‘on the fly’; see also [4] for a ‘model-free’ approach that relies on the kernel recursive LS (RLS) algorithm. Online monitoring algorithms are attractive for operation in dynamic network environments, since they can cope with traffic nonstationarities arising due to routing changes and missing data. Accordingly, the novel online estimator entails an exponentially-weighted least-squares (LS) cost regularized with the sparsity-promoting 1 -norm of the anomalies, and the nuclear norm of the nominal traffic matrix. After recasting the non-separable nuclear norm into a form amenable to online optimization (Section III-A), a real-time algorithm for dynamic anomalography is developed in Section IV based on alternating minimization. Each time a new datum is acquired, anomaly estimates are formed via the least-absolute shrinkage and selection operator (Lasso), e.g, [18, p. 68], and the low-rank nominal traffic subspace is refined using RLS [36]. Convergence analysis is provided under simplifying technical assumptions in Section IV-B. For situations where reducing computational complexity is critical, an online stochastic gradient algorithm based on Nesterov’s accelaration technique [6], [30] is developed as well (Section V-A). The possibility of implementing the anomaly trackers in a distributed fashion is further outlined in Section V-B, where several directions for future research are also delineated. Extensive numerical tests involving both synthetic and real

Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

Dynamic Anomalography: Tracking NetworkAnomalies via Sparsity and Low Rank†

Morteza Mardani, Student Member, IEEE, Gonzalo Mateos, Member, IEEE, and Georgios B. Giannakis, Fellow,IEEE∗

Abstract—In the backbone of large-scale networks, origin-to-destination (OD) traffic flows experience abrupt unusualchanges known astraffic volume anomalies, which can resultin congestion and limit the extent to which end-user qualityof service requirements are met. As a means of maintainingseamless end-user experience in dynamic environments, as wellas for ensuring network security, this paper deals with a crucialnetwork monitoring task termed dynamic anomalography. Givenlink traffic measurements (noisy superpositions of unobservedOD flows) periodically acquired by backbone routers, the goalis to construct an estimated map of anomalies in real time,and thus summarize the network ‘health state’ along boththe flow and time dimensions. Leveraging the low intrinsic-dimensionality of OD flows and thesparsenature of anomalies, anovel online estimator is proposed based on an exponentially-weighted least-squares criterion regularized with the sparsity-promoting ℓ1-norm of the anomalies, and the nuclear norm ofthe nominal traffic matrix. After recasting the non-separablenuclear norm into a form amenable to online optimization, areal-time algorithm for dynamic anomalography is developed andits convergence established under simplifying technical assump-tions. For operational conditions where computational complexityreductions are at a premium, a lightweight stochastic gradientalgorithm based on Nesterov’s acceleration technique is developedas well. Comprehensive numerical tests with both syntheticandreal network data corroborate the effectiveness of the proposedonline algorithms and their tracking capabilities, and demonstratethat they outperform state-of-the-art approaches developed todiagnose traffic anomalies.

Index Terms—Traffic volume anomalies, online optimization,sparsity, network cartography, low rank.

I. I NTRODUCTION

In the backbone of large-scale networks, origin-to-destination (OD) traffic flows experience abrupt unusualchanges which can result in congestion, and limit QoS pro-visioning of the end users. These so-termedtraffic volumeanomaliescould be due to unexpected failures in networkingequipment, cyberattacks (e.g., denial of service (DoS) attacks),or, intruders which hijack the network services [37]. Unveilingsuch anomalies in a promptly manner is a crucial monitoringtask towards engineering network traffic. This is a challeng-ing task however, since the available data are usually high-dimensional, noisy and possibly incomplete link-load mea-surements, which are the superposition ofunobservableOD

† Manuscript received July 31, 2012; accepted December 1, 2012. Firstpublished XXXXXX XX, 20XX; current version published XXXXXX XX,20XX. The associate editor coordinating the review of this manuscript andapproving it for publication was Prof. Ery Arias-Castro. Work in this paperwas supported by the MURI Grant No. AFOSR FA9550-10-1-0567.

∗ The authors are with the Dept. of ECE and the DigitalTechnology Center, University of Minnesota, 200 Union Street SE,Minneapolis, MN 55455. Tel/fax: (612)626-7781/625-4583;Emails:morteza,mate0058,[email protected]

flows. Several studies have experimentally demonstrated thelow intrinsic dimensionality of the nominal traffic subspace,that is, the intuitivelow-rank property of the traffic matrix inthe absence of anomalies, which is mainly due to commontemporal patterns across OD flows, and periodic behavioracross time [22], [43]. Exploiting the low-rank structure of theanomaly-free traffic matrix, a landmark principal componentanalysis (PCA)-based method was put forth in [21] to identifynetwork anomalies; see also [28] for a distributed implemen-tation. A limitation of the algorithm in [21] is that it cannotidentify multiple anomalous flows. Most importantly, [21] hasnot exploited thesparsityof anomalies across flows and time– anomalous traffic spikes are rare, and tend to last for shortperiods of time relative to the measurement horizon.

Capitalizing on the low-rank property of the traffic matrixand the sparsity of the anomalies, the fresh look advocatedhere permeates benefits from rank minimization [9]–[11], andcompressive sampling [12], [13], to performdynamic anoma-lography. The aim is to construct a map of networkanomaliesin real time, that offers a succinct depiction of the network‘health state’ across both the flow and time dimensions (SectionII). Different from thebatchcentralized and distributed anoma-lography algorithms in [26] and [25], the focus here is on devis-ing online (adaptive) algorithms that are capable of efficientlyprocessing link measurements and track network anomalies ‘onthe fly’; see also [4] for a ‘model-free’ approach that reliesonthe kernel recursive LS (RLS) algorithm. Online monitoringalgorithms are attractive for operation in dynamic networkenvironments, since they can cope with traffic nonstationaritiesarising due to routing changes and missing data. Accordingly,the novel online estimator entails an exponentially-weightedleast-squares (LS) cost regularized with the sparsity-promotingℓ1-norm of the anomalies, and the nuclear norm of the nominaltraffic matrix. After recasting the non-separable nuclear norminto a form amenable to online optimization (Section III-A), areal-time algorithm for dynamic anomalography is developedin Section IV based on alternating minimization. Each timea new datum is acquired, anomaly estimates are formed viathe least-absolute shrinkage and selection operator (Lasso),e.g, [18, p. 68], and the low-rank nominal traffic subspaceis refined using RLS [36]. Convergence analysis is providedunder simplifying technical assumptions in Section IV-B. Forsituations where reducing computational complexity is critical,an online stochastic gradient algorithm based on Nesterov’saccelaration technique [6], [30] is developed as well (SectionV-A). The possibility of implementing the anomaly trackersina distributed fashion is further outlined in Section V-B, whereseveral directions for future research are also delineated.

Extensive numerical tests involving both synthetic and real

Page 2: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 1

network data corroborate the effectiveness of the proposedal-gorithms in unveiling network anomalies, as well as their track-ing capabilities when traffic routes are slowly time-varying, andthe network monitoring station acquires incomplete link trafficmeasurements (Section VI). Different from [42] which employsa two-step batch procedure to learn the nominal traffic subspacefirst, and then unveil anomalies viaℓ1-norm minimization,the approach here estimates both quantities jointly and attainsbetter performance as illustrated in Section VI-B. Concludingremarks are given in Section VII, while most technical detailsrelevant to the convergence proof in Section IV-C are deferredto the Appendix.Notation: Bold uppercase (lowercase) letters will denote ma-trices (column vectors), and calligraphic letters will be usedfor sets. Operators(·)′, tr(·), λmin(·), σmin(·), [·]+, andE[·],will denote transposition, matrix trace, minimum eigenval-ue, minimum singular value, projection onto the nonnegativeorthant, and expectation, respectively;| · | will be used forthe cardinality of a set, and the magnitude of a scalar. Thepositive semidefinite matrixM will be denoted byM 0.The ℓp-norm of x ∈ R

n is ‖x‖p := (∑n

i=1 |xi|p)1/p forp ≥ 1. For two matricesM,U ∈ R

n×n, 〈M,U〉 := tr(M′U)denotes their trace inner product. The Frobenius norm of matrixM = [mi,j ] ∈ R

n×p is ‖M‖F :=√

tr(MM′), ‖M‖ :=max‖x‖2=1 ‖Mx‖2 is the spectral norm,‖M‖1 :=

i,j |mi,j |is the ℓ1-norm, and‖M‖∗ :=

i σi(M) is the nuclear norm,whereσi(M) denotes thei-th singular value ofM. Then×nidentity matrix will be represented byIn and itsi-th columnby ei, while 0n will stand for then × 1 vector of all zeros,0n×p := 0n0

′p, and the support set supp(x) := i : xi 6= 0.

II. M ODELING PRELIMINARIES AND PROBLEM STATEMENT

Consider a backbone Internet protocol (IP) network naturallymodeled as a directed graphG(N ,L), whereN andL denotethe sets of nodes (routers) and physical links of cardinality|N | = N and |L| = L, respectively. The operational goal ofthe network is to transport a set of OD traffic flowsF (with|F| = F ) associated with specific source-destination pairs. Forbackbone networks, the number of network layer flows is muchlarger than the number of physical links(F ≫ L). Single-pathrouting is adopted here, that is, a given flow’s traffic is carriedthrough multiple links connecting the corresponding source-destination pair along a single path. Letrl,f , l ∈ L, f ∈ F ,denote the flow to link assignments (routing), which takethe value one whenever flowf is carried over linkl, andzero otherwise. Unless otherwise stated, the routing matrixR := [rl,f ] ∈ 0, 1L×F is assumed fixed and given. Likewise,let zf,t denote the unknown traffic rate of flowf at time t,measured in e.g., Mbps. At any given time instantt, the trafficcarried over linkl is then the superposition of the flow ratesrouted through linkl, i.e.,

f∈F rl,f zf,t.It is not uncommon for some of the flow rates to experi-

ence unusual abrupt changes. These so-termedtraffic volumeanomaliesare typically due to unexpected network failures, orcyberattacks (e.g., DoS attacks) which aim at compromisingthe services offered by the network [37]. Letaf,t denote theunknown traffic volume anomaly of flowf at time t. In thepresence of anomalous flows, the measured traffic carried by

link l over a time horizont ∈ [1, T ] is then given by

yl,t =∑

f∈F

rl,f (zf,t + af,t) + vl,t, t = 1, ..., T (1)

where the noise variablesvl,t account for measurement errorsand unmodeled dynamics.

In IP networks, traffic volume can be readily measured ona per-link basis using off-the-shelf tools such as the simplenetwork management protocol (SNMP) supported by mostrouters. Missing entries in the link-level measurementsyl,t mayhowever skew the network operator’s perspective. SNMP pack-ets may be dropped for instance, if some links become con-gested, rendering link count information for those links moreimportant, as well as less available [33]. To model missing linkmeasurements, collect the tuples(l, t) associated with the avail-able observationsyl,t in the setΩ ∈ [1, 2, ..., L]× [1, 2, ..., T ].Introducing the matricesY := [yl,t],V := [vl,t] ∈ R

L×T , andZ := [zf,t],A := [af,t] ∈ R

F×T , the (possibly incomplete) setof measurements in (1) can be expressed in compact matrixform as

PΩ(Y) = PΩ(R (Z+A) +V) (2)

where the sampling operatorPΩ(.) sets the entries of its matrixargument not inΩ to zero, and keeps the rest unchanged.Matrix Z contains the nominal traffic flows over the timehorizon of interest. Common temporal patterns among thetraffic flows in addition to their periodic behavior, render mostrows (respectively columns) ofZ linearly dependent, and thusZ typically has low rank. This intuitive property has beenextensively validated with real network data; see e.g, [22].Anomalies inA are expected to occur sporadically over time,and last shortly relative to the (possibly long) measurementinterval [1, T ]. In addition, only a small fraction of the flowsis supposed to be anomalous at a any given time instant. Thisrenders the anomaly traffic matrixA sparse across both rows(flows) and columns (time).

Given measurementsPΩ(Y) adhering to (2) and the binary-valued routing matrixR, the main goal of this paper is toaccurately estimate the anomaly matrixA, by capitalizing onthe sparsity ofA and the low-rank property ofZ. Specialfocus will be placed on devising online (adaptive) algorithmsthat are capable of efficiently processing link measurementsand tracking network anomalies in real time. This criticalmonitoring task is termeddynamic anomalography, and theresultant estimated mapA offers a depiction of the network’s‘health state’ along both the flow and time dimensions. If|af,t| > 0, the f -th flow at time t is deemed anomalous,otherwise it is healthy. By examiningR the network operatorcan immediately determine the links carrying the anomalousflows. Subsequently, planned contingency measures involvingtraffic-engineering algorithms can be implemented to addressnetwork congestion.

III. U NVEILING ANOMALIES VIA SPARSITY AND LOW

RANK

Consider the nominal link-count traffic matrixX := RZ,which inherits the low-rank property fromZ. Since the primarygoal is to recoverA, the following observation model

PΩ(Y) = PΩ(X+RA+V) (3)

Page 3: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 2

can be adopted instead of (2). A natural estimator leveragingthe low rank property ofX and the sparsity ofA will besought next. The idea is to fit the incomplete dataPΩ(Y) tothe modelX + RA in the least-squares (LS) error sense, aswell as minimize the rank ofX, and the number of nonzeroentries ofA measured by itsℓ0-(pseudo) norm. Unfortunately,albeit natural both rank andℓ0-norm criteria are in general NP-hard to optimize [17], [29]. Typically, the nuclear norm‖X‖∗and theℓ1-norm ‖A‖1 are adopted as surrogates, since theyare the closestconvexapproximants to rank(X) and ‖A‖0,respectively [12], [31], [38]. Accordingly, one solves

(P1) minX,A

1

2‖PΩ(Y−X−RA)‖2F +λ∗‖X‖∗+λ1‖A‖1 (4)

whereλ∗, λ1 ≥ 0 are rank- and sparsity-controlling parame-ters. When an estimateσ2

v of the noise variance is available,guidelines for selectingλ∗ andλ1 have been proposed in [44].

Being convex (P1) is appealing, and it yields reliable per-formance when full data are available, i.e.,Ω = ∅ [26]. In thepresence of missing data, one has to ensure that the sampledsubset of links provides sufficient information to identifyanomalous flows. Intuitively, for high estimation accuracyeachflow must traverse sufficiently many links, whereas networklinks should not be overloaded by too many flows. Theseproperties typically hold for large-scale networks with distantOD node pairs, and routing paths that are sufficiently ‘spread-out.’ Developing identifiability conditions when link measure-ments are incomplete is an open problem, and constitutes aninteresting future research direction.

Model (3) and its estimator (P1) are quite general, asdiscussed in the ensuing remark.Remark 1 (Subsumed paradigms):When there is no miss-ing data andX = 0L×T , one is left with an under-determinedsparse signal recovery problem typically encountered withcompressive sampling (CS); see e.g., [12] and the tutorialaccount [13]. The decompositionY = X + A correspondsto principal component pursuit (PCP), also referred to asrobust principal component analysis (PCA) [9], [14]. PCPwas adopted for network anomaly detection using flow (notlink traffic) measurements in [2]. For the idealized noise-freesetting (V = 0L×T ), sufficient conditions for exact recoveryare available for both of the aforementioned special cases [9],[12], [14]. However, the superposition of a low-rank plus acompressedsparse matrix in (3) further challenges identifiabil-ity of X,A; see [26] for early results. Going back to the CSparadigm, even whenX is nonzero one could envision a variantwhere the measurements are corrupted with correlated (low-rank) noise [15]. Last but not least, whenA = 0F×T andY isnoisy, the recovery ofX subject to a rank constraint is nothingbut PCA – arguably, the workhorse of high-dimensional dataanalytics. This same formulation is adopted for low-rank matrixcompletion, to impute the missing entries of a low-rank matrixobserved in noise, i.e.,PΩ(Y) = PΩ(X+V) [10].

Albeit convex, (P1) is a non-smooth optimization problem(both the nuclear andℓ1-norms are not differentiable at theorigin). In addition, scalable algorithms to unveil anomalies inlarge-scale networks should effectively overcome the followingchallenges: (c1) the problem size can easily become quite large,since the number of optimization variables is(L+ F )T ; (c2)

existing iterative solvers for (P1) typically rely on costly SVDcomputations per iteration; see e.g., [26]; and (c3) differentfrom the Frobenius andℓ1-norms, (columnwise) nonsepara-bility of the nuclear-norm challenges online processing whennew columns ofPΩ(Y) arrive sequentially in time. In theremainder of this section, the ‘big data’ challenges (c1) and(c2) are dealt with to arrive at an efficient batch algorithmfor anomalography. Tracking network anomalies is the mainsubject of Section IV.

To address (c1) and reduce the computational complexityand memory storage requirements of the algorithms sought, itis henceforth assumed that an upper boundρ ≥ rank(X) isa priori available [X is the estimate obtained via (P1)]. Asargued next, the smaller the value ofρ, the more efficient thealgorithm becomes. Small values ofρ are well motivated due tothe low intrinsic dimensionality of network flows. For instance,experiments with Internet-2 network data [1] show thatρ =5 suffices [21]; see also [22]. Because rank(X) ≤ ρ, (P1)’ssearch space is effectively reduced and one can factorize thedecision variable asX = PQ′, whereP and Q are L × ρandT × ρ matrices, respectively. It is possible to interpret thecolumns ofX (viewed as points inRL) as belonging to a low-rank ‘nominal traffic subspace’, spanned by the columns ofP.The rows ofQ are thus the projections of the columns ofX

onto the traffic subspace.Adopting this reparametrization ofX in (P1), and defining

r(P,Q,A) := 12‖PΩ(Y − PQ − RA)‖2F , one arrives at an

equivalent optimization problem

(P2) minP,Q,A

r(P,Q,A) + λ∗‖PQ′‖∗ + λ1‖A‖1

which is non-convex due to the bilinear termsPQ′. Thenumber of variables is reduced from(L + F )T in (P1), toρ(L+T )+FT in (P2). The savings can be significant whenρis small, and bothL andT are large. Note that the dominantFT -term in the variable count of (P2) is due toA, which issparse and can be efficiently handled even when bothF andT are large.

A. A separable low-rank regularization

To address (c2) [along with (c3) as it will become clear inSection IV], consider the following alternative characterizationof the nuclear norm [31], [32]

‖X‖∗ := minP,Q

1

2

‖P‖2F + ‖Q‖2F

, s. t. X = PQ′.

(5)The optimization (5) is over all possible bilinear factorizationsof X, so that the number of columnsρ of P andQ is also avariable. Leveraging (5), the following reformulation of (P2)provides an important first step towards obtaining an onlinealgorithm:

(P3) minP,Q,A

r(P,Q,A) +λ∗

2

‖P‖2F + ‖Q‖2F

+ λ1‖A‖1.

As asserted in [25, Lemma 1], adopting the separableFrobenius-norm regularization in (P3) comes with no loss ofoptimality relative to (P1), providedρ ≥ rank(X). By findingthe global minimum of (P3) [which could have considerably

Page 4: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 3

less variables than (P1)], one can recover the optimal so-lution of (P1). However, since (P3) is non-convex, it mayhave stationary points which need not be globally optimum.Interestingly, the next proposition shows that under relativelymild assumptions on rank(X) and the noise variance, everystationary point of (P3) is globally optimum for (P1). For aproof, see [25, App. A].

Proposition 1: Let P, Q, A be a stationary point of (P3).If ‖PΩ(Y − PQ′ −RA)‖ ≤ λ∗, thenX := PQ′, A = Ais the globally optimal solution of (P1).The qualification condition‖PΩ(Y − PQ′ − RA)‖ ≤ λ∗

captures tacitly the role ofρ. In particular, for sufficientlysmall ρ the residual‖PΩ(Y − PQ′ − RA)‖ becomes largeand consequently the condition is violated [unlessλ∗ is largeenough, in which case a sufficiently low-rank solution to (P1)is expected]. The condition on the residual also implicitlyenforces rank(X) ≤ ρ, which is necessary for the equivalencebetween (P1) and (P3). Note also that selecting a large valueofρ does not ensure satisfaction of the condition in Proposition 1.In fact, other factors such as the noise variance and routingmatrix structure are critical as well.

B. Batch block coordinate-descent algorithm

The block coordinate-descent (BCD) algorithm is adoptedhere to solve the batch non-convex optimization problem (P3).BCD is an iterative method which has been shown efficient intackling large-scale optimization problems encountered withvarious statistical inference tasks, see e.g., [7]. The proposedsolver entails an iterative procedure comprising three steps periterationk = 1, 2, . . .

[S1] Update the anomaly map:

A[k + 1] = arg minA

[r(P[k],Q[k],A) + λ1‖A‖1] .

[S2] Update the nominal traffic subspace:

P[k + 1] = arg minP

[

r(P,Q[k],A[k + 1]) +λ∗

2‖P‖2F

]

.

[S3] Update the projection coefficients:

Q[k+1] = argminQ

[

r(P[k + 1],Q,A[k + 1]) +λ∗

2‖Q‖2F

]

.

To update each of the variable groups, the cost of (P3) isminimized while fixing the rest of the variables to their mostup-to-date values. The minimization in [S1] decomposes overthe columns ofA := [a1, . . . , aT ]. At iteration k, thesecolumns are updated in parallel via Lasso

at[k + 1] = argmina

[1

2‖Ωt(yt −P[k]qt[k]−Ra)‖22

+ λ1‖a‖1]

, t = 1, . . . , T (6)

whereyt andqt[k] respectively denote thet-th column ofYandQ′[k], while the diagonal matrixΩt ∈ R

L×L contains aone on itsl-th diagonal entry ifyl,t is observed, and a zerootherwise. To keep computational complexity at a minimum,in practice each iteration of the proposed algorithm minimizes(6) inexactly. This is achieved for eacht = 1, . . . , T , byperforming a single pass of the cyclic coordinate-descent

Algorithm 1 : Batch BCD algorithm for unveiling networkanomalies

input PΩ(Y),Ω,R, λ∗, andλ1.initialize P[1] andQ[1] at random.for k = 1, 2,. . . do

[S1] Update the anomaly map:for f = 1, . . . , F do

y(−f)t [k + 1] = Ωt(yt −P[k]qt[k]−

∑f−1f ′=1 rf ′af ′,t[k +1]

×∑F

f ′=f+1 rf ′af ′,t[k]), t = 1, . . . , T .

af,t[k+1] = sign(r′f y(−f)t [k+1])

[

|r′f y(−f)t [k+1]|−λ1

]

+

× ‖Ωtrf‖−12 , t = 1, . . . , T .

end forA[k + 1] =

[

[a1,1[k + 1], . . . , aF,1[k + 1]]′, . . . , [a1,T [k + 1],. . . , aF,T [k + 1]]′

]

.[S2] Update the nominal traffic subspace:pl[k + 1] = (λ∗Iρ +Q′[k]Ωr

lQ[k])−1

Q′[k]Ωrl (y

rl − A′[k +

1]rrl ), l = 1, . . . , L.P[k + 1] = [p1[k + 1], . . . ,pL[k + 1]]′.[S3] Update the projection coefficients:qt[k+1] = (λ∗Iρ +P′[k + 1]ΩtP[k + 1])

−1P′[k+1]Ωt(yt−

Rat[k + 1]), t = 1, . . . , T .Q[k + 1] = [q1[k + 1], . . . ,qT [k + 1]]′.

end forreturn A := A[∞] andX := P[∞]Q′[∞].

algorithm in [18, p. 92] over each one of theF scalar entriesin at[k + 1]; see Algorithm 1 for the resulting iterations, andAppendix A for further details. As shown at the end of thissection, this inexact minimization suffices to claim convergenceto a stationary point of (P3).

Similarly, in [S2] and [S3] the minimizations that give risetoP[k+1] andQ[k+1] are separable over their respective rows.For instance, thel-th row p′

l of the traffic subspace matrixP := [p1, . . . ,pL]

′ is updated as the solution of the followingridge-regression problem

pl[k + 1]=argminp

[1

2‖((yr

l )′−p′Q′[k]−(rrl )′A[k + 1])Ωr

l ‖22

+λ∗

2‖p‖22

]

(7)

where (yrl )

′ and (rrl )′ represent thel-th row of Y and R,

respectively. Thet-th diagonal entry of the diagonal matrixΩr

l ∈ RT×T is an indicator variable testing whether mea-

surementyl,t is available. Because (7) is an unconstrainedconvex quadratic program, the first-order optimality conditionyields the closed-form solution tabulated under Algorithm1. Asimilar regularized LS problem yieldsqt[k+1], t = 1, . . . , T ;see Algorithm 1 for the details and a description of theoverall BCD solver. The novel batch scheme for unveilingnetwork anomalies is less complex computationally than theaccelerated proximal gradient algorithm in [26], since Algo-rithm 1’s iterations are devoid of SVD computations. Differentfrom [26], Algorithm 1 can also accommodate missing linkmeasurements.

Despite being non-convex and non-differentiable, (P3) hasfavorable structure which facilitates convergence of the iteratesgenerated by Algorithm 1. Specifically, the resulting cost isconvex in each block variable when the rest are fixed. Thenon-smoothℓ1-norm is also separable over the entries of itsmatrix argument. Accordingly, [39, Theorem 5.1] guarantees

Page 5: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 4

convergence of the BCD algorithm to a stationary point of(P3). This result together with Proposition 1 establishes thenext claim.

Proposition 2: If a subsequenceX[k] := P[k]Q′[k],A[k]of iterates generated by Algorithm 1 satisfies‖PΩ(Y−X[k]−RA[k])‖ ≤ λ∗, then it converges to the optimal solution setof (P1) ask → ∞.

In practice, it is desirable to monitor anomalies in realtime and accomodate time-varying traffic routes. These reasonsmotivate devising algorithms fordynamicanomalography, thesubject dealt with next.

IV. DYNAMIC ANOMALOGRAPHY

Monitoring of large-scale IP networks necessitates collectingmassive amounts of data which far outweigh the ability ofmodern computers to store and analyze them in real time. Inaddition, nonstationarities due to routing changes and missingdata further challenge identification of anomalies. In dynamicnetworks routing tables are constantly readjusted to effecttraffic load balancing and avoid congestion caused by e.g.,traffic anomalies or network infrastructure failures. To accountfor slowly time-varing routing tables, letRt ∈ R

L×F denotethe routing matrix at timet1. In this dynamic setting, thepartially observed link counts at timet adhere to [cf. (3)]

PΩt(yt) = PΩt

(xt +Rtat + vt), t = 1, 2, . . . (8)

where the link-level trafficxt := Rtzt, for zt from the (low-dimensional) traffic subspace. In general, routing changesmayalter a link load considerably by e.g., routing traffic completelyaway from a specific link. Therefore, even though the network-level traffic vectorszt live in a low-dimensional subspace,the same may not be true for the link-level trafficxtwhen the routing updates are major and frequent. In backbonenetworks however, routing changes are sporadic relative tothe time-scale of data acquisition used for network moni-toring tasks. For instance, data collected from the operationof Internet-2 network reveals that only a few rows ofRt

change per week [1]. It is thus safe to assume thatxt stilllies in a low-dimensional subspace, and exploit the temporalcorrelations of the observations to identify the anomalies.

On top of the previous arguments, in practice link measure-ments are acquired sequentially in time, which motivates updat-ing previously obtained estimates rather than re-computing newones from scratch each time a new datum becomes available.The goal is then to recursively estimatext, at at timet fromhistorical observationsPΩτ

(yτ ),Ωτtτ=1, naturally placingmore importance to recent measurements. To this end, onepossible adaptive counterpart to (P3) is the exponentially-

1Fixed size routing matricesRt are considered here for convenience, whereL andF correspond to upper bounds on the number of physical links and flowstransported by the network, respectively. If at timet some links are not usedat all, or, less thanF flows are present, the corresponding rows and columnsof Rt will be identically zero.

Fig. 1. Internet-2 network topology graph.

weighted LS estimator found by minimizing the empirical cost

minP,Q,A

t∑

τ=1

βt−τ

[

1

2‖PΩτ

(yτ −Pqτ −Rτaτ )‖22

+λ∗

2∑t

u=1 βt−u

‖P‖2F +λ∗

2‖qτ‖22 + λ1‖aτ‖1

]

(9)

in which 0 < β ≤ 1 is the so-termed forgetting factor.When β < 1 data in the distant past are exponentiallydownweighted, which facilitates tracking network anomaliesin nonstationary environments. In the case of static routing(Rt = R, t = 1, 2, . . .) and infinite memory(β = 1), theformulation (9) coincides with the batch estimator (P3). Thisis the reason for the time-varying factor weighting‖P‖2F .

A. Tracking network anomalies

Towards deriving a real-time, computationally efficient, andrecursive solver of (9), an alternating minimization method isadopted in which iterationk coincides with the time scalet of data acquisition. A justification in terms of minimizinga suitable approximate cost function is discussed in detailinSection IV-B. Per time instantt, a new datumPΩt

(yt),Ωtis drawn andqt, at are jointly estimated via

q[t], a[t] =arg minq,a

[

1

2‖PΩt

(yt −P[t− 1]q−Rta)‖22

+λ∗

2‖q‖22 + λ1‖a‖1

]

. (10)

It turns out that (10) can be efficiently solved. Fixinga to carryout the minimization with respect toq first, one is left with anℓ2-norm regularized LS (ridge-regression) problem

q[t] = argminq

[

1

2‖PΩt

(yt −P[t− 1]q−Rta)‖22 +λ∗

2‖q‖2

]

= (λ∗Iρ +P′[t− 1]ΩtP[t− 1])−1

P′[t− 1]

× PΩt(yt −Rta). (11)

Note that q[t] is an affine function ofa, and the updaterule for q[t] is not well defined untila is replaced witha[t]. Towards obtaining an expression fora[t], defineD[t] :=(λ∗Iρ +P[t− 1]ΩtP

′[t− 1])−1

P′[t − 1] for notational con-venience, and substitute (11) back into (10) to arrive at the

Page 6: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 5

Lasso estimator

a[t] = argmina

[

1

2‖F[t](yt −Rta)‖22 + λ1‖a‖1

]

(12)

whereF[t] :=[

Ωt −ΩtP[t− 1]D[t]Ωt,√λ∗ΩtD

′[t]]′

. Thediagonal matrixΩt was defined in Section III-B, see thediscussion after (6).

In the second step of the alternating-minimization scheme,the updated subspace matrixP[t] is obtained by minimiz-ing (9) with respect toP, while the optimization variablesqτ , aτtτ=1 are fixed and take the valuesq[τ ], a[τ ]tτ=1. Thisyields

P[t] = argminP

[

t∑

τ=1

βt−τ 1

2‖PΩτ

(yτ −Pq[τ ] −Rτa[τ ])‖22

+λ∗

2‖P‖2F

]

. (13)

Similar to the batch case, (13) decouples over the rows ofP

which are obtained in parallel via

pl[t] = argminp

[

t∑

τ=1

βt−τωl,τ (yl,τ − p′q[τ ]− (rrl,τ )′a[τ ])2

+λ∗

2‖p‖2

]

, l = 1, . . . , L (14)

whereωl,τ denotes thel-th diagonal entry ofΩτ . For β =1, subproblems (14) can be efficiently solved using the RLSalgorithm [36]. Upon definingsl[t] :=

∑tτ=1 β

t−τωl,τ (yl,τ −r′l,τa[τ ])q[τ ], Hl[t] :=

∑tτ=1 β

t−τωl,τq[τ ]q′[τ ] + λ∗Iρ, and

Ml[t] := H−1l [t], with β = 1 one simply updates

sl[t] = sl[t− 1] + ωl,t(yl,t − r′l,ta[t])q[t]

Ml[t] = Ml[t− 1]− ωl,tMl[t− 1]q[t]q′[t]Ml[t− 1]

1 + q′[t]Ml[t− 1]q[t]

and formspl[t] = Ml[t]sl[t], for l = 1, . . . , L.However, for0 < β < 1 the regularization term(λ∗/2)‖p‖2

in (14) makes it impossible to expressHl[t] in terms ofHl[t − 1] plus a rank-one correction. Hence, one cannotresort to the matrix inversion lemma and updateMl[t] withquadratic complexity only. Based on direct inversion ofHl[t],l = 1, . . . , L, the overall recursive algorithm for trackingnetwork anomalies is tabulated under Algorithm 2. The periteration cost of theL inversions (eachO(ρ3), which couldbe further reduced if one leverages also the symmetry ofHl[t]) is affordable for moderate number of links, becauseρis small when estimating low-rank traffic matrices. Still, forthose settings where computational complexity reductionsareat a premium, an online stochastic gradient descent algorithmis described in Section V-A.Remark 2 (Robust subspace trackers):Algorithm 2is closely related to timely robust subspace trackers,which aim at estimating a low-rank subspaceP fromgrossly corrupted and possibly incomplete data, namelyPΩt

(yt) = PΩt(Pqt + at + vt), t = 1, 2, . . .. In the

absence of sparse ‘outliers’at∞t=1, an online algorithmbased on incremental gradient descent on the Grassmannianmanifold of subspaces was put forth in [5]. The second-order

Algorithm 2 : Online algorithm for tracking network anoma-lies

input PΩt(yt),Ωt,Rt∞t=1, β, λ∗, andλ1.initialize Gl[0] = 0ρ×ρ, sl[0] = 0ρ, l = 1, ..., L, andP[0] atrandom.for t = 1, 2,. . . do

D[t] = (λ∗Iρ +P′[t− 1]ΩtP[t − 1])−1

P′[t− 1].F[t] =

[

Ωt −ΩtP[t− 1]D[t]Ωt,√λ∗ΩtD

′[t]]

.a[t] = argmina

[

12‖F[t](yt −Rta)‖2 + λ1‖a‖1

]

.q[t] = D[t]Ωt(yt −Rta[t]).Gl[t] = βGl[t− 1] + ωl,tq[t]q[t]

′, l = 1, . . . , L.sl[t] = βsl[t− 1] + ωl,t(yl,t − r′l,ta[t])q[t], l = 1, . . . , L.pl[t] = (Gl[t] + λ∗Iρ)

−1sl[t], l = 1, ..., L.

return at := a[t] and xt := P[t]q[t].end for

RLS-type algorithm in [16] extends the seminal projectionapproximation subspace tracking algorithm [41] to handlemissing data. When outliers are present, robust counterpartscan be found in [15], [19], [27]. Relative to all aforementionedworks, the estimation problem here is more challenging dueto the presence of the fat (compression) matrixRt; see [26]for fundamental identifiability issues related to the model(3).

B. Convergence Analysis

This section studies the convergence of the iterates generatedby Algorithm 2, for the infinite memory special case i.e., whenβ = 1. Upon defining the function

gt(P,q, a) :=1

2‖PΩt

(yt −Pq−Rta)‖22 +λ∗

2‖q‖22

+ λ1‖a‖1 (15)

in addition to ℓt(P) := minq,a gt(P,q, a), when β = 1Algorithm 2 aims at minimizing the followingaveragecostfunction at timet

Ct(P) :=1

t

t∑

τ=1

ℓτ (P) +λ∗

2t‖P‖2F . (16)

Normalization (byt) ensures that the cost function does notgrow unbounded as time evolves. For fixed routingRτ =Rtτ=1, (16) it is essentially identical to the batch estimatorin (P3) up to a scaling, which does not affect the valueof the minimizers. Note that as time evolves, minimizationof Ct becomes increasingly complex computationally. EvenevaluatingCt is challenging for larget, since it entails solvingt Lasso problems to minimize allgτ and defining the functionsℓτ , τ = 1, . . . , T . Hence, at timet the subspace estimateP[t]is obtained by minimizing theapproximatecost function [cf.(13) whenβ = 1]

Ct(P) =1

t

t∑

τ=1

gτ (P,q[τ ], a[τ ]) +λ∗

2t‖P‖2F (17)

in which q[t], a[t] are obtained based on the prior subspaceestimateP[t− 1] after solving [cf. (10)]

q[t], a[t] = arg minq,a

gt(P[t− 1],q, a). (18)

Obtainingq[t] this way resembles the projection approximationadopted in [41], and can only be evaluated aftera[t] is obtained

Page 7: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 6

[cf. (11)]. Since Ct(P) is a smooth convex function, theminimizer P[t] = argminP Ct(P) is the solution of thequadratic equation∇Ct(P[t]) = 0L×ρ.

So far, it is apparent that the approximate cost functionCt(P[t]) overestimates the target costCt(P[t]), for t =1, 2, . . .. However, it is not clear whether the dictionary iteratesP[t]∞t=1 converge, and most importantly, how well can theyoptimize the target cost functionCt. The good news is thatCt(P[t]) asymptotically approachesCt(P[t]), and the subspaceiterates null∇Ct(P[t]) as well, both ast → ∞. The latterresult is summarized in the next proposition, which is provedin the next section.

Proposition 3: Assume that: a1)Ωt∞t=1 and yt∞t=1 areindependent and identically distributed (i.i.d.) random pro-cesses; a2)‖PΩt

(yt)‖∞ is uniformly bounded; a3) iteratesP[t]∞t=1 are in a compact setL ⊂ R

L×ρ; a4) Ct(P)

is positive definite, namelyλmin

[

∇2Ct(P)]

≥ c for some

c > 0; and a5) σmin(S[t]) ≥ c0, where the matrixS[t] ∈R

(L+ρ)×|supp(a[t])| contains the columns ofF[t]Rt associatedwith the elements in supp(a[t]), and c0 is a positive constant.Then limt→∞ ∇Ct(P[t]) = 0L×ρ almost surely (a.s.), whichimplies that the subspace iteratesP[t]∞t=1 asymptoticallycoincide with the stationary points of (P3) when the routingremains invariant, i.e., whenRt = R, t = 1, 2, . . ..

To clearly delineate the scope of the analysis, it is worthcommenting on the assumptions a1)-a5) and the factors thatinfluence their satisfaction. Regarding a1), the acquired da-ta is assumed statistically independent across time as it iscustomary when studying the stability and performance ofonline (adaptive) algorithms [35], [36]. While independence isrequired for tractability, a1) may be grossly violated becauseOD flows are correlated across time (cf. the low-rank propertyof Z andX). Still, in accordance with the adaptive filteringfolklore e.g., [35, p. 321], asβ → 1 the upshot of theanalysis based on i.i.d. data extends accurately to the pragmaticsetting whereby the link-counts and missing data patternsexhibit spatiotemporal correlations. Uniform boundedness ofPΩt

(yt) [cf. a2)] is satisfied in practice, since the trafficis always limited by the (finite) capacity of the physicallinks. The bounded subspace requirement in a3) is a technicalassumption that simplifies the arguments of the ensuing proof,and has been corroborated via extensive computer simulationsincluding those in Section VI. It is apparent that the samplingset Ωt plays a key role towards ensuring that a4) and a5)are satisfied. Intuitively, if the missing entries tend to beonlyfew and somehow uniformly distributed across links and time,they will not markedly increase coherence of the regressionmatrices F[t]Rt, and thus compromise the uniqueness ofthe Lasso solutions. This also increases the likelihood that∇2Ct(P) = λ∗

t ILρ +1t

∑tτ=1(q[τ ]q

′[τ ])⊗Ωτ cILρ holds.As argued in [24], if needed one could incorporate additionalregularization terms in the cost function to enforce a4) anda5).Before moving on to the proof, a remark is in order.Remark 3 (Performance guarantees):In line with Proposi-tion 2, one may be prompted to ponder whether the online es-timator offers the performance guarantees of the nuclear-normregularized estimator (P1), for which stable/exact recoveryhave been well documented e.g., in [9], [26], [44]. Specifically,

given the learned traffic subspaceP and the correspondingQ and A [obtained via (10)] over a time window of sizeT , is X := PQ′, A := A an optimal solution of (P1)when T → ∞? This in turn requires asymptotic analysisof the optimality conditions for (P1), and is left for futureresearch. Nevertheless, empirically the online estimatorattainsthe performance of (P1), as evidenced by the numerical testsin Section VI.

C. Proof of Proposition 3

The main steps of the proof are inspired by [24], whichstudies convergence of an online dictionary learning algorith-m using the theory of martingale sequences; see e.g., [23].However, relative to [24] the problem here introduces severaldistinct elements including: i) missing data with a time-varyingpattern Ωt; ii) a non-convex bilinear term where the tallsubspace matrixP plays a role similar to the fat dictionaryin [24], but the multiplicative projection coefficients here arenot sparse; and iii) the additional bilinear termsRtat whichentail sparse coding ofat as in [24], but with a known regres-sion (routing) matrix. Hence, convergence analysis becomesmore challenging and demands, in part, for a new treatment.Accordingly, in the sequel emphasis will be placed on the novelaspects specific to the problem at hand.

The basic structure of the proof consists of three prelimi-nary lemmata, which are subsequently used to establish thatlimt→∞ ∇Ct(P[t]) = 0L×ρ a.s. through a simple argument.The first lemma deals with regularity properties of functionsCt andCt, which will come handy later on; see Appendix Bfor a proof.

Lemma 1: If a2) and a5) hold, then the func-tions: i) at(P),qt(P) = argminq,a gt(P,q, a), ii)gt(P,q[t], a[t]), iii) ℓt(P), and iv) ∇ℓt(P) are Lipschitzcontinuous forP ∈ L (L is a compact set), with constantsindependent oft.The next lemma (proved in Appendix C) asserts that thedistance between two subsequent traffic subspace estimatesvanishes ast → ∞, a property that will be instrumental lateron when establishing thatCt(P[t])− Ct(P[t]) → 0 a.s.

Lemma 2: If a2)-a5) hold, then‖P[t+1]−P[t]‖F = O(1/t).The previous lemma by no means implies that the subspaceiterates converge, which is a much more ambitious objectivethat may not even hold under the current assumptions. Thefinal lemma however, asserts that the cost sequence indeedconverges with probability one; see Appendix D for a proof.

Lemma 3: If a1)-a5) hold, thenCt(P[t]) converges a.s.Moreover,Ct(P[t])− Ct(P[t]) → 0 a.s.

Putting the pieces together, in the sequel it is shown that thesequence∇Ct(P[t])−∇Ct(P[t])∞t=1 converges a.s. to zero,and since∇Ct(P[t]) = 0L×ρ by algorithmic construction, thesubspace iteratesP[t]∞t=1 coincide with the stationary pointsof the target cost functionCt. To this end, it suffices to provethat every convergentsubsequencenulls the gradient∇Ct

asymptotically, which in turn implies that the entire sequenceconverges to the set of stationary points of the batch problem(P3).

Page 8: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 7

Since L is compact by virtue of a3), one can alwayspick a convergent subsequenceP[t]∞t=1 whose limit pointis P∗, say2. Consider the positive-valued decreasing sequenceαt∞t=1 that converges to zero slower thanCt(P[t])−Ct(P[t])does, and recall thatCt(P[t]+αtU) ≥ Ct(P[t]+αtU) for anyU ∈ R

L×ρ. From the mean-value theorem and for arbitraryU,expanding both sides of the inequality around the pointP[t]one arrives at

Ct(P[t]) + αttrU′∇Ct(P[t]) + αttrU′(∇Ct(Θ1[t])

−∇Ct(P[t])) ≥ Ct(P[t]) + αttrU′∇Ct(P[t])+ αttrU′(∇Ct(Θ2[t])−∇Ct(P[t]))

for someΘ1[t],Θ2[t] ∈ RL×ρ on the line segment connecting

P[t] and P[t] + αtU. Taking limit as t → ∞ and applyingLemma 3 it follows that

limt→∞

trU′(∇Ct(P[t])−∇Ct(P[t]))

+ limt→∞

trU′(∇Ct(Θ1[t])−∇Ct(P[t]))+ lim

t→∞trU′(∇Ct(P[t])−∇Ct(Θ2[t])) ≥ 0, a.s. (19)

For the quadratic functionCt, uniform boundedness of theHessian∇2Ct(P) = λ∗

t ILρ+1t

∑tτ=1(q[τ ]q

′[τ ])⊗Ωτ impliesthat ∇Ct is Lipschitz. Furthermore, since∇ℓτ is Lipschitzas per Lemma 1,∇Ct is Lipschitz as well. Consequently,according to the Cauchy-Schwarz inequality∣

∣trU′(∇Ct(P[t]) −∇Ct(Θ2[t]))∣

≤ ‖U‖F ‖∇Ct(P[t])−∇Ct(Θ2[t])‖F

≤ c ‖U‖F ‖P[t]−Θ2[t]‖F(a)

≤ c αt‖U‖2F (20)

for some constantc > 0, where (a) holds sinceΘ2[t] is aconvex combination ofP[t] and P[t] + αtU. Likewise, onecan bound the second term on the left-hand-side of (19).Accordingly, it holds that

limt→∞

trU′(∇Ct(P[t]) −∇Ct(Θ2[t])) =

limt→∞

trU′(∇Ct(P[t])−∇Ct(Θ1[t])) = 0.

All in all, the second and third terms in (19) vanish and oneis left with

limt→∞

trU′(∇Ct(Pt)−∇Ct(Pt)) ≥ 0. (21)

BecauseU ∈ RL×ρ is arbitrary, (21) can only hold if

limt→∞(∇Ct(P[t]) − ∇Ct(P[t])) = 0L×ρ a.s., which com-pletes the proof.

V. FURTHER ALGORITHMIC ISSUES

For completeness, this section outlines a couple of additionalalgorithmic aspects relevant to anomaly detection inlarge-scale networks. Firstly, a lightweight first-order algorithm isdeveloped as an alternative to Algorithm 2, which relies onfast Nesterov-type gradient updates for the traffic subspace.Secondly, the possibility of developing distributed algorithmsfor dynamic anomalography is discussed.

2Formally, the subsequence should be denoted asP[t(i)]∞i=1 , but a slightabuse of notation is allowed for simplicity.

A. Fast stochastic-gradient algorithm

Reduction of the computational complexity in updating thetraffic subspaceP is the subject of this section. The basicalternating minimization framework in Section IV-A will beretained, and the updates forq[t], a[t] will be identical tothose tabulated under Algorithm 2. However, instead of solvingan unconstrained quadratic program per iteration to obtainP[t][cf. (13)], the refinements to the subspace estimate will begiven by a (stochastic) gradient algorithm.

As discussed in Section IV-B, in Algorithm 2 the subspaceestimateP[t] is obtained by minimizing the empirical costfunction Ct(P) = (1/t)

∑tτ=1 fτ (P), where

ft(P) :=1

2‖Ωt(yt −Pq[t]−Rta[t])‖22 +

λ∗

2t‖P‖2F

+λ∗

2‖q[t]‖22 + λ1‖a[t]‖1, t = 1, 2, . . . (22)

By the law of large numbers, if dataPΩt(yt)∞t=1 are station-

ary, solvingminP limt→∞ Ct(P) yields the desired minimizerof the expectedcostE[Ct(P)], where the expectation is takenwith respect to the unknown probability distribution of thedata.A standard approach to achieve this same goal – typically withreduced computational complexity – is to drop the expectation(or the sample averaging operator for that matter), and updatethe nominal traffic subspace via a stochastic gradient itera-tion [36]

P[t] = argminP

Q(1/µ[t]),t(P,P[t− 1])

= P[t− 1]− µ[t]∇ft(P[t− 1]) (23)

where µ[t] is a stepsize,Qµ,t(P1,P2) := ft(P2) + 〈P1 −P2,∇ft(P2)〉 + µ

2 ‖P1 − P2‖2f , and ∇ft(P) = −Ωt(yt −Pq[t] − Rta[t])q

′[t] + (λ∗/t)P. In the context of adaptivefiltering, stochastic gradient algorithms such as (22) are knownto converge typically slower than RLS. This is expected sinceRLS can be shown to be an instance of Newton’s (second-order) optimization method [36].

Building on the increasingly popularacceleratedgradi-ent methods for (batch) smooth optimization [6], [30], theidea here is to speed-up the learning rate of the estimat-ed traffic subspace (23), without paying a penalty in termsof computational complexity per iteration. The critical dif-ference between standard gradient algorithms and the so-termed Nesterov’s variant, is that the accelerated updatestake the formP[t] = P[t] − µ[t]∇ft(P[t]), which relieson a judicious linear combinationP[t − 1] of the previouspair of iteratesP[t − 1],P[t − 2]. Specifically, the choiceP[t] = P[t−1]+ k[t−1]−1

k[t] (P[t− 1]−P[t− 2]), wherek[t] =[

1 +√

4k2[t− 1] + 1]

/2, has been shown to significantlyaccelerate batch gradient algorithms resulting in convergencerate no worse thanO(1/k2); see e.g., [6] and referencestherein. Using this acceleration technique in conjunctionwith abacktracking stepsize rule [7], a fast online stochastic gradientalgorithm for unveiling network anomalies is tabulated underAlgorithm 3. Different from Algorithm 2, no matrix inversionsare involved in the update of the traffic subspaceP[t]. Clearly,a standard (non accelerated) stochastic gradient descent algo-rithm with backtracking stepsize rule is subsumed as a specialcase, whenk[t] = 1, t = 0, 1, 2, . . .

Page 9: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 8

Algorithm 3 : Online stochastic gradient algorithm for unveil-ing network anomalies

input yt,Rt,Ωt∞t=1, ρ, λ∗, λ1, η > 1.initialize P[0] at random,µ[0] > 0, P[1] := P[0], andk[1] := 1.for t = 1, 2,. . . do

D[t] = (λ∗Iρ +P′[t − 1]ΩtP[t− 1])−1

P′[t− 1]F′[t] :=

[

Ωt −ΩtP[t− 1]D[t]Ωt,√λ∗ΩtD

′[t]]

a[t] = argmina

[

12‖F[t](yt −Rta)‖2 + λ1‖a‖1

]

q[t] = D[t]Ωt(yt −Rtat)Find the smallest nonnegative integeri[t] such that withµ :=ηi[t]µ[t − 1]

ft(P[t]−(1/µ)∇ft(P[t])) ≤ Qµ,t(P[t]−(1/µ)∇ft(P[t]), P[t])

holds, and setµ[t] = ηi[t]µ[t− 1].P[t] = P[t]− (1/µ[t])∇ft(P[t]).

k[t + 1] =1+

√1+4k2[t]

2.

P[t+ 1] = P[t] +(

k[t]−1k[t+1]

)

(P[t]−P[t − 1]).

end forreturn x[t] := P[t]q[t], a[t] := a[t].

Convergence analysis of Algorithm 3 is beyond the scopeof this paper, and will only be corroborated using computersimulations in Section VI. It is worth pointing out that since anon-diminishing stepsize is adopted, asymptotically the iteratesgenerated by Algorithm 3 will hover inside a ball centered atthe minimizer of the expected cost, with radius proportional tothe noise variance.

B. In-network anomaly trackers

Implementing Algorithms 1-3 presumes that network nodescontinuously communicate their local link traffic measurementsto a central monitoring station, which uses their aggregationin PΩt

(yt)∞t=1 to unveil network anomalies. While for themost part this is the prevailing operational paradigm adopt-ed in current network technologies, it is fair to say thereare limitations associated with this architecture. For instance,collecting all this information centrally may lead to excessiveprotocol overhead, especially when the rate of data acquisitionis high at the routers. Moreover, minimizing the exchangesof raw measurements may be desirable to reduce unavoidablecommunication errors that translate to missing data. Performingthe optimization in a centralized fashion raises robustnessconcerns as well, since the central monitoring station representsan isolated point of failure.

These reasons motivate devisingfully-distributed iterativealgorithms for dynamic anomalography in large-scale network-s, embedding the network anomaly detection functionality tothe routers. In a nutshell, per iteration nodes carry out simplecomputational tasks locally, relying on their own link countmeasurements (a few entries of the network-wide vectoryt

corresponding to the router links). Subsequently, local esti-mates are refined after exchanging messages only with directlyconnected neighbors, which facilitates percolation of localinformation to the whole network. The end goal is for networknodes to consent on a global map of network anomalies, andattain (or at least come close to) the estimation performanceof the centralized counterpart which has all dataPΩt

(yt)∞t=1

available.

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

x−axis

y−ax

is

1

f6

5

3 14

6

2

4

10 f9

f2

Fig. 2. Synthetic network topology graph, and the paths usedfor routingthree flows.

Relying on the alternating-directions method of multipliers(AD-MoM) as the basic tool to carry out distributed optimiza-tion, a general framework for in-network sparsity-regularizedrank minimization was put forth in a companion paper [25].In the context of network anomaly detection, results thereinare encouraging yet there is ample room for improvement andimmediate venues for future research open up. For instance,the distributed algorithms of [25] can only tackle the batchformulation (P3), so extensions to a dynamic network setting,e.g., building on the ideas here to devise distributed anomalytrackers seems natural. To obtain desirable tradeoffs in terms ofcomputational complexity and speed of convergence, develop-ing and studying algorithms for distributed optimization basedon Nesterov’s acceleration techniques emerges as an excitingand rather pristine research direction; see [20] for early workdealing with separable batch optimization.

VI. PERFORMANCETESTS

Performance of the proposed batch and online estimators isassessed in this section via computer simulations using bothsynthetic and real network data.Selection of tuning parameters.In the batch case,λ1 andλ∗

are tuned to optimize the relative error‖A −A0‖F /‖A0‖F ,with A0 and A denoting the true and estimated anomalymatrices, respectively. In particular, one needs to perform agrid search over the bounded two-dimensional regionR :=(λ1, λ∗) : λ1 ∈ (0, ‖R′PΩ(Y)‖∞], λ∗ ∈ (0, ‖PΩ(Y)‖].The corresponding bounds are derived from the optimalityconditions for (P1), which indicate that for(λ1, λ∗) ∈ Rc

the optimal solution is0L×T ,0F×T . Practical rules thatdo not require knowledge ofA0 can be devised along thelines of [3] and [10]. Supposing that the true values arezero, choosingλ1 > ‖R′PΩ(V)‖∞ and λ∗ > ‖PΩ(V)‖the estimator (P1) outputsX = 0L×T , A = 0F×T . Thismitigates noise, but it may overshrink the true values. To avoidovershrinking, these parameters can be chosen close to theircorresponding lower bounds, e.g., pickλ∗ = ‖PΩ(V)‖ andλ1 = ‖R′PΩ(V)‖∞. One can further simplify the candidateparameters by making the following reasonable assumptions:i) Gaussian noisevl,t ∼ N (0, σ2), ii) uniform samplingwith each entry ofΩ chosen independently with probability

Page 10: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 9

5 10 15 20 25 30 35 40 45 50 55 6010

1

102

103

104

Iteration index (k)

Co

st

(P3) π=1, λ1=λ

*=1

(P1), π=1,λ1=λ

*=1

(P3), π=0.75, λ1=λ

*=2

(P1), π=0.75, λ1=λ

*=2

10−3

10−2

10−1

100

10−2

10−1

100

False alarm probabilityD

ete

ctio

n p

rob

ab

ility

π=1π=0.75π=0.5

(a) (b)

Fig. 3. Performance of the batch estimator (P3) forp = 0.005 and differentamounts of missing data. (a) Cost of the estimators (P1) and (P3) versusiteration index whenσ = 10−2. (b) ROC curves whenσ = 10−1.

π, and iii) large dimensionsF, T → ∞. It is then knownthat (

√F +

√T )−1‖PΩ(V)‖ → √

πσ, almost surely, seee.g., [10], and thus one can pickλ∗ = (

√F +

√T )

√πσ. Also,

large-deviation tail bounding implies that‖R′PΩ(V)‖∞ ≤4σmaxi ‖Rei‖2 log (FT ) with high probability, which sug-gests selectingλ1 = σmaxi ‖Rei‖2 log (FT ). The said reg-ularization parameters can also be used for online processing(upon settingT = t) , where they naturally increase as timeevolves.

A. Synthetic network data tests

Synthetic network example. A network of N = 15 nodesis considered as a realization of the random geometric graphmodel with agents randomly placed on the unit square, andtwo agents link if their Euclidean distance is less than aprescribed communication range ofdc = 0.35; see Fig. 2. Thenetwork graph is bidirectional and comprisesL = 52 links,and F = N(N − 1) = 210 OD flows. For each candidateOD pair, minimum hop count routing is considered to formthe routing matrixR. Entries of vt are i.i.d., zero-mean,Gaussian with varianceσ2; i.e., vl,t ∼ N (0, σ2). Flow-trafficvectorszt are generated from the low-dimensional subspaceU ∈ R

F×r with i.i.d. entriesuf,i ∼ N (0, 1/F ), and projectioncoefficientswi,t ∼ N(0, 1) such thatzt = Uwt. Everyentry of at is randomly drawn from the set−1, 0, 1, withPr(af,t = −1) = Pr(af,t = 1) = p/2. Entries ofY aresampled uniformly at random with probabilityπ to form thediagonal sampling matrixΩt. The observations at time instantt are generated according toPΩt

(yt) = Ωt(Rzt +Rat +vt).Unless otherwise stated,r = 2, ρ = 5, andβ = 0.99 are usedthroughout. Different values ofσ, p andπ are tested.Performance of the batch estimator. To demonstrate themerits of the batch BCD algorithm for unveiling networkanomalies (Algorithm 1), simulated data are generated for atime interval of sizeT = 100. For validation purposes, thebenchmark estimator (P1) is iteratively solved by alternatingminimization overA (which corresponds to Lasso) andX.The minimizations with respect toX can be carried out usingthe iterative singular-value thresholding (SVT) algorithm [8].Note that with full data, SVT requires only a single SVDcomputation. In the presence of missing data however, the SVTalgorithm may require several SVD computations until conver-

020

4060

020

4060

0

0.2

0.4

0.6

0.8

1

1.2

Time index (t)Flow index (f)

An

om

aly

am

plit

ud

e

020

4060

020

4060

0

0.2

0.4

0.6

0.8

1

1.2

Time index (t)Flow index (f)

An

om

aly

am

plit

ud

e

(a) (b)

Fig. 4. Amplitude of the true (blue) and estimated (red) anomalies forσ =10−1. (a) π = 1 (no missing data),PFA = 0.021 and PD = 0.96. (b)π = 0.75, PFA = 0.016 andPD = 0.69.

0 500 1000 1500 2000 2500 3000 3500 400010

−1

100

Time index (t)

Ave

rag

e c

ost

Algorithm 2, π=1Algorithm 2, π=0.75Algorithm 3, π=0.75Algorithm 3, π=1Batch, π=1Batch, π=0.75

0 200 400 600 800 1000 1200 1400 1600 1800 20000

0.5

1

1.5flow #2

0 200 400 600 800 1000 1200 1400 1600 1800 20000

0.5

1

1.5

An

om

aly

am

plitu

de

flow #6

0 200 400 600 800 1000 1200 1400 1600 1800 20000

0.5

1

1.5

Time index (t)

flow #9

(a) (b)

Fig. 5. Performance of the online estimator forσ = 10−2, p = 0.005,λ1 = 0.11, andλ∗ = 0.36. (a) Evolution of the average costCt(P[t]) ofthe online algorithms versus the batch counterpart (P3). (b) Amplitude of true(solid) and estimated (circle markers) anomalies via the online Algorithm 2,for three representative flows whenπ = 1 (no missing data).

gence, rendering the said algorithm prohibitively complexforlarge-scale problems. In contrast, Algorithm 1 only requiressimpleρ×ρ inversions. Fig. 3 (a) depicts the convergence of therespective algorithms used to solve (P1) and (P3), for differentamounts of missing data (controlled byπ). It is apparent thatboth estimators attain identical performance after a few tensof iterations, as asserted by Proposition 1. To corroboratetheeffectiveness of Algorithm 1 in unveiling network anomaliesacross flows and time, the ROC curves are plotted for variouspercentages of missing link observations in Fig. 3 (b) whenσ = 10−1. To discard spurious estimates, the hypothesis testaf,t RH1

H00.1 is considered, with anomalous and anomaly-free

hypothesesH1 and H0, respectively. Apparently, an inferiordetection performance is expected as the percentage of missingdata increases. Note that when link observations are missing(π < 1), some flows may not be identifiable because theymay traverse none of the observed links. For such flows, theanomalous traffic is assumed zero. Hence, as it is seen in Fig.3(b), the maximum achievable detection probability equals thefraction of (partially) observed flows. For the instances of(PFA = 0.021, PD = 0.96) and (PFA = 0.016, PD = 0.69)corresponding toπ = 1 and π = 0.75, respectively, Fig. 4depicts the magnitude of the true and estimated anomalies.Performance of the online algorithms. To confirm the con-vergence and effectiveness of the online Algorithms 2 and 3,

Page 11: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 10

simulation tests are carried out for infinite memoryβ = 1 andinvariant routing matrixR. Fig. 5 (a) depicts the evolutionsof the average costCt(Pt) in (16) for different amounts ofmissing dataπ = 0.75, 1 when the noise level isσ = 10−2.It is evident that for both online algorithms the average costconverges (possibly within a ball) to its batch counterpartin(P3) normalized by the window sizeT = t. Impressively, thisobservation together with the one in Fig. 3 (a) corroboratethat the online estimators can attain the performance of thebenchmark estimator, whose stable/exact recovery performanceis well documented e.g., in [11], [26], [44]. It is furtherobserved that the more data are missing, the more time it takesto learn the low-rank nominal traffic subspace, which in turnslows down convergence.

To examine the tracking capability of the online estima-tors, Fig. 5 (b) depicts the estimated versus true anomaliesover time as Algorithm 2 evolves for three representative flowsindicated on Fig. 2, namelyf2, f6, f9 corresponding to thef = 2, 6, 9-th rows of A0. Setting the detection thresholdto the value0.1 as before, for the flowsf2, f6, f9 Algorithm2 attains detection ratePD = 0.83, 1, 1 at false alarm ratePFA = 0.0171, 0.0040, 0.0081, respectively. The quantificationerror per flow is also aroundPQ = 0.7606, 0.5863, 0.4028,respectively. As expected, more false alarms are declared atearly iterations as the low-rank subspace has not been learntaccurately. Upon learning the subspace performance improvesand almost all anomalies are identified. Careful inspectionof Fig. 5 (b) reveals that the anomalies forf9 are betteridentified visually than those forf2. As shown in Fig. 2,f2 iscarried over links(1, 2), (2, 4), (4, 14), (14, 3) each one carry-ing 33, 31, 35, 22 additional flows, respectively, whereasf9 isaggregated over link(1, 3) with only 2 additional flows. Hence,identifying f2’s anomalies from the highly-superimposed loadof links (1, 2), (2, 4), (4, 14), (14, 3) is a more challenging taskrelative to link (1, 3). This simple example manifests thefact that the detection performance strongly depends on thenetwork topology and the routing policy implemented, whichdetermine the routing matrix. In accordance with [26], thecoherence of sparse column subsets of the routing matrix playsan important role in identifying the anomalies. In essence,the more incoherent the column subsets ofR are, the betterrecovery performance one can attain. An intriguing questionleft here to address in future research pertains to desirablenetwork topologies giving rise to incoherent routing matrices.

Tracking routing changes. The measurement model in (8) hastwo time-varying attributes which challenge the identificationof anomalies. The first one is missing measurement data arisingfrom e.g., packet losses during the data collection process,and the second one pertains to routing changes due to e.g.,network congestion or link failures. It is thus important totestwhether the proposed online algorithm succeeds in trackingthese changes. As discussed earlier, missing data are sampleduniformly at random. To assess the impact of routing changeson the recovery performance, a simple probabilistic model isadopted where each time instant a single link fails, or, returnsto the operational state. LetΦ denote the adjacency matrix ofthe network graphG, where[Φ]i,j = 1 if there exists a physicallink joining nodesi and j, and zero otherwise. Similarly, the

0 100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10

12

14

16

18

Ave

rag

e e

stim

atio

n e

rro

r

Time index (t)

α=0.01, π=0.8, σ=10−2

α=0.01, π=0.8, σ=10−2

α=0.01, π=0.8, σ=10−5

α=0.01, π=0.8, σ=10−5

α=0.1, π=0.8, σ=10−5

α=0.1, π=0.8, σ=10−5

α=0.01, π=1, σ=10−5

α=0.01, π=1, σ=10−5

0 100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ave

rag

e d

ete

ctio

n/f

als

e a

larm

pro

ba

bili

ty

Time index (t)

α=0.01, π=0.8, σ=10−2

α=0.01, π=0.8, σ=10−2

α=0.01, π=0.8, σ=10−5

α=0.01, π=0.8, σ=10−5

α=0.01, π=1, σ=10−5

α=0.01, π=1, σ=10−5

α=0.1, π=0.8, σ=10−5

α=0.1, π=0.8, σ=10−5

(a) (b)

0 100 200 300 400 500 600 700 800 900 10000

0.5

1

1.5link #12

0 100 200 300 400 500 600 700 800 900 10000

0.5

Tra

ffic

le

ve

l link #6

0 100 200 300 400 500 600 700 800 900 10000

0.5

Time index (t)

link #31

0 100 200 300 400 500 600 700 800 900 10000

0.5

1

1.5flow #5

0 100 200 300 400 500 600 700 800 900 10000

0.5

1

1.5

An

om

aly

am

plitu

de

flow #13

0 100 200 300 400 500 600 700 800 900 10000

0.5

1

1.5

Time index (t)

flow #29

(c) (d)

Fig. 6. Tracking routing changes forp = 0.005. (a) Evolution of averageanomaly (dotted) and traffic (solid) estimation errors. (b)Evolution of averagedetection (solid) and false alarm (dotted) rates. (c) Estimated (red) versus true(blue) link traffic for three representative links. (d) Estimated (circle markers)versus true (solid) anomalies for three representative flows whenπ = 0.8,σ = 10−5, andα = 0.01.

active links involved in routing the data at timet are repre-sented by the effective adjacency matrixΦeff

t . At time instantt+1, a biased coin is tossed with small success probabilityα,and one of the links, say(i, j) ∈ Φeff

t , is chosen uniformly atrandom and removed fromG while ensuring that the networkremains connected. Likewise, an edge(ℓ, k) ∈ Φ\Φeff

t is addedwith the same probabilityα. The resulting adjacency matrixis thenΦeff

t+1 = Φefft + 1b1,teℓe

′k − 1b1,teie

′j , where the

indicator function1x∈X equals one whenx ∈ X , and zerootherwise; andb1,t, b2,t ∼ Ber(α) are i.i.d. Bernoulli randomvariables. The minimum hop-count algorithm is then appliedtoΦeff

t+1, to update the routing matrixRt+1. Note thatRt+1 = Rt

with probability (1− α)2.

The performance is tested here for fast and slowly varyingrouting corresponding toα = 0.1 andα = 0.01, respectively,whenβ = 0.9. A metric of interest is the average square errorin estimating the anomalies, namelyeat := 1

t

∑ti=1 ‖ai−ai‖22,

and the link traffic, namelyext := 1t

∑ti=1 ‖xi−xi‖22. Fig. 6 (a)

plots the average estimation error for various noise variancesand amounts of missing data. The estimation error decreasesquickly and after learning the subspace it becomes almostinvariant. To evaluate the support recovery performance oftheonline estimator, define the average detection and false alarmrate

PD :=

∑tτ=1

∑Ff=1 1af,τ≥0.1,af,τ≥0.1

∑tτ=1

∑Ff=1 1af,τ≥0.1

,

Page 12: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 11

PFA :=

∑tτ=1

∑Ff=1 1af,τ≥0.1,af,τ≤0.1

∑tτ=1

∑Ff=1 1af,τ≤0.1

.

Inspecting Fig. 6 (b) one observes that forα = 0.01 andπ =0.8, increasing the noise variance from10−5 to 10−2 lowersthe detection probability by10%. Moreover, whenσ = 10−5

andα = 0.01, dropping20% of the observations renders theestimator misdetect11% more anomalies. The routing changesfrom α = 0.01 to α = 0.1 whenσ = 10−5 andπ = 0.8 comeswith an adverse effect of about6% detection-rate decrease. Fora few representative network links and flows Fig. 6 (c) and (d)illustrate how Algorithm 2 tracks the anomalies and link-leveltraffic. Note that in Fig. 6 (c) link12 is dropped for the timeperiod t ∈ [220, 420], and thus the traffic level becomes zero.The flows being carried over link31 are also varying due torouting changes, which occur at time instantst = 220, 940when the traffic is not tracked accurately.

B. Real network data tests

Internet-2 network example. Real data including OD flowtraffic levels are collected from the operation of the Internet-2network (Internet backbone network across USA) [1], shownin Fig. 1. Flow traffic levels are recorded every5-minuteintervals, for a three-week operational period of Internet-2during Dec. 8–28, 2008 [1]. Internet-2 comprisesN = 11nodes,L = 41 links, andF = 121 flows. Given the OD flowtraffic measurements, the link loads inY are obtained throughmultiplication with the Internet-2 routing matrix, which inthis case remains invariant during the three weeks of dataacquisition [1]. Even thoughY is “constructed” here fromflow measurements, link loads can be typically acquired fromSNMP traces [37].

The available OD flows are incomplete due to problems inthe data collection process. In addition, flows can be modeledas the superposition of “clean” plus anomalous traffic, i.e.,the sum of some unknown “ground-truth” low-rank and sparsematricesPΩ(X0+A0). Therefore, settingR = IF in (P1) onecan first run the batch Algorithm 1 to estimate the “ground-truth” componentsX0,A0. The estimatedX0 exhibits threedominant singular values, confirming the low-rank propertyofthe nominal traffic matrix. To be on the conservative side, onlyimportant spikes with magnitude greater than the thresholdlev-el 50‖Y‖F/LT are retained as benchmark anomalies (nonzeroentries inA0).Comparison with PCA-based batch estimators [21], [42].To highlight the merits of the batch estimator (P3), its per-formance is compared with the spatial PCA-based schemesreported in [21] and [42]. These methods capitalize on thefact that the anomaly-free traffic matrix has low-rank, whilethe presence of anomalies considerably increases the rank ofY. Both algorithms rely on a two-step estimation procedure:(s1) perform PCA on the dataY to extract the (low-rank)anomaly-free link traffic matrixX; and (s2) declare anomaliesbased on the residual trafficY := Y − X. The algorithmsin [42] and [21] differ in the way (s2) is performed. On itsoperational phase, the algorithm in [21] declares the presenceof an anomaly at timet, when the projection ofyt onto theanomalous subspace exceeds a prescribed threshold. It is clear

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False alarm probability

De

tectio

n p

rob

ab

ility

[20], rank=1[20], rank=2[20], rank=3Proposed method[40], rank=1[40], rank=2[40], rank=3

0

200

400

0

50

100

0

2

4

6

Time index (t)Flow index (f)

Ano

mal

y am

plitu

de

(a) (b)

Fig. 7. Performance of the batch estimator for Internet-2 network data. (a)ROC curves of the proposed versus the PCA-based methods. (b)Amplitude ofthe true (blue) and estimated (red) anomalies forPFA = 0.04 andPD = 0.93.

that the aforementioned method is unable to identify anomalousflows. On the other hand, the network anomography approachof [42] capitalizes on the sparsity of anomalies, and recoversthe anomaly matrix by minimizing‖A‖1, subject to the linearconstraintsY = RA.

The aforementioned methods require a priori knowledge onthe rank of the anomaly-free traffic matrix, and assume thereis no missing data. To carry out performance comparisons,the detection rate will be adopted as figure of merit, whichmeasures the algorithm’s success in identifying anomaliesacross both flows and time instants. ROC curves are depictedin Fig. 7 (a), for different values of the rank required to runthe PCA-based methods. It is apparent that the estimator (P3)obtained via Algorithm 1 markedly outperforms both PCA-based methods in terms of detection performance. This issomehow expected, since (P3) advocates joint estimation ofthe anomalies and the nominal traffic matrix. For an instanceof PFA = 0.04 and PD = 0.93, Fig. 7 (b) illustrates theeffectiveness of the proposed algorithm in terms of unveilingthe anomalous flows and time instants.Online operation. Algorithm 2 is tested here with the Internet-2 network data under two scenarios: with and without missingdata. For the incomplete data case, a randomly chosen subsetof link counts with cardinality0.15 × LT is discarded. Thepenalty parameters are tuned asλ1 = 0.7 andλ∗ = 1.4. Theevolution of the average anomaly and traffic estimation errors,and average detection and false alarm rates are depicted inFig. 8 (a), (b), respectively. Note how in the case of full-data,after about a week the traffic subspace is accurately learnedand the detection (false alarm) rates approach the values0.72(0.011). It is further observed that even with15% missingdata, the detection performance degrades gracefully. Finally,Fig. 8(c)[(d)] depicts how three representative link traffic levels[OD flow anomalies] are accurately tracked over time.

VII. C ONCLUDING REMARKS

An online algorithm is developed in this paper to perform acritical network monitoring task termeddynamic anomalogra-phy, meaning to unveil traffic volume anomalies in backbonenetworks adaptively. Given link-level traffic measurements(noisy superpositions of OD flows) acquired sequentially intime, the goal is to construct amap of anomalies inre-

Page 13: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 12

0 1000 2000 3000 4000 5000 60000

500

1000

1500

Ave

rag

e e

stim

atio

n e

rro

r

Time index (t)

π=1π=1π=0.85π=0.85

0 1000 2000 3000 4000 5000 60000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ave

rag

e d

ete

ctio

n/f

als

e a

larm

pro

ba

bili

ty

Time index (t)

π=1π=1π=0.85π=0.85

(a) (b)

0 1000 2000 3000 4000 5000 60000

2

4

6ATLA−−HSTN

0 1000 2000 3000 4000 5000 60000

5

10

15

Tra

ffic

le

ve

l

DNVR−−KSCY

0 1000 2000 3000 4000 5000 60000

5

10

15

Time index (t)

HSTN−−ATLA

0 1000 2000 3000 4000 5000 60000

2

4CHIN−−ATLA

0 1000 2000 3000 4000 5000 60000

20

40

An

om

aly

am

plit

ud

e

WASH−−STTL

0 1000 2000 3000 4000 5000 60000

10

20

30

Time index (t)

WASH−−WASH

(c) (d)

Fig. 8. Performance of the online estimator for Internet-2 network data. (a)Evolution of average anomaly (dotted) and traffic (solid) estimation errors.(b) Evolution of average detection (solid) and false alarm (dotted) rates. (c)Estimated (red) versus true (blue) link traffic for thee representative links. (d)Estimated (circle markers) versus true (solid) anomalies for three representativeflows whenπ = 0.85.

al time, that summarizes the network ‘health state’ alongboth the flow and time dimensions. Online algorithms enabletracking of anomalies in nonstationary environments, typicallyarising due to to e.g., routing changes and missing data.The resultant online schemes offer an attractive alternative tobatch algorithms, since they scale gracefully as the numberof flows in the network grows, or, the time window of dataacquisition increases. Comprehensive numerical tests with bothsynthetic and real network data corroborate the effectivenessof the proposed algorithms and their tracking capabilities, andshow that they outperform existing workhorse approaches fornetwork anomaly detection.

APPENDIX

A. Update of the anomaly map in Algorithm 1: As arguedin Section III-B, the matrix Lasso problem under [S1] de-composes over the columns ofA := [a1, . . . , aT ]. Hence,it suffices to focus on the update of a single column, sayat := [a1,t, . . . , aF,t]

′, which boils down to solving [cf. (6)]

at[k + 1] =argmina

[

1

2

∥Ωt

(

yt −P[k]qt[k]−F∑

f=1

rfaf,t

)∥

2

2

+ λ1

F∑

f=1

|af,t|]

(24)

whererf denotes thef -th column ofR.Let n = 0, 1, . . . , denote the (inner) iteration index for the

cyclic coordinate descent algorithm adopted to solve (24) [18,

p. 92]. For the minimization at stepk of the (outer) BCDiterations in Algorithm 1, the sequence of iteratesat[k;n] areinitialized as at[k; 0] := at[k]. At each stepn, the scalarcoordinatesaf,t of vectorat are updated cyclically, by solvingsequentially forf = 1, 2, . . . , F

af,t[k;n+ 1] =

argmina

[

1

2

∥y(−f)t [k;n+ 1]−Ωtrfa

2

2+ λ1|a|

]

(25)

y(−f)t [k;n+ 1] := Ωt

(

yt −P[k]qt[k]−f−1∑

f ′=1

rf ′af ′,t[k;n+ 1]

−F∑

f ′=f+1

rf ′af ′,t[k;n])

. (26)

Vector y(−f)t corresponds to the partial residual error without

considering the contribution of the predictorΩtrf . The use-fulness of a coordinate descent approach stems from the factthat the coordinate updates (25) amount to scalar Lasso-typeoptimizations. Skipping details that can be found in, e.g.,[18,p. 93], the solutions are thus expressible in the closed form

af,t[k;n+ 1] = sign(r′f y(−f)t [k;n+ 1])

×[

|r′f y(−f)t [k;n+ 1]| − λ1

]

+/‖Ωtrf‖2 (27)

which is oftentimes referred to as soft-thresholding of thepartial residualy(−f)

t . Separability of the nondifferentiableℓ1-norm term in (24) is sufficient to guarantee the convergenceof (27) to a minimizer of (24), asn → ∞ [39]. Hence, theupdateat[k+ 1] := limn→∞[a1,t[k;n], . . . , aF,t[k;n]]

′ is welldefined, and identical to the one in (24).

The rationale behind the actual anomaly map updates inAlgorithm 1 hinges upon the fact that the solution of (24) doesnot need to be super accurate, since it is just an intermediatestep in the outer loop defined by the BCD solver. In therelaxation pursued here, the inner iteration is halted after asingle step (i.e., whenn = 1) to yield an inexact minimizer of(24). In this case, the indexn can be dropped and (26)-(27)simplify to the sequential updates forf = 1, 2, . . . , F

y(−f)t [k + 1] := Ωt

(

yt −P[k]qt[k]−f−1∑

f ′=1

rf ′af ′,t[k + 1]

−F∑

f ′=f+1

rf ′af ′,t[k]

)

(28)

af,t[k + 1] = sign(r′f y(−f)t [k + 1])

[

|r′f y(−f)t [k + 1]| − λ1

]

+

× ‖Ωtrf‖−12 (29)

as tabulated under Algorithm 1.B. Proof of Lemma 1: With P1,P2 ∈ L consider the function

ut(a,P1,P2) :=1

2‖Ft(P1)(yt −Rta)‖22

− 1

2‖Ft(P2)(yt −Rta)‖22 (30)

where Ft(P) :=[

Ωt [IL −PDt(P)]Ωt,√λ∗ΩtD

′t(P)

]′,

and Dt(P) := (λ∗Iρ +P′ΩtP)−1

P′. From the convexity

Page 14: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 13

of the Lasso problem in (12) together with the mean-valuetheorem and a5), it can be readily inferred that

ut(at(P2),P1,P2)− ut(at(P1),P1,P2)

≥ c0‖at(P2)− at(P1)‖22 (31)

for some positive constantc0. The rest of the proof deals withLipschitz continuity ofut(.,P1,P2). For a1 and a2 from acompact setA, consider

2|ut(a1,P1,P2)− ut(a2,P1,P2)| =2〈R′

t [F′t(P2)Ft(P2)− F′

t(P1)Ft(P1)] , (a2 − a1)y′t〉

+(

‖Ft(P1)Rta1‖22 − ‖Ft(P1)Rta2‖22)

−(

‖Ft(P2)Rta1‖22 − ‖Ft(P2)Rta2‖22)

. (32)

Introducing the auxiliary variable∆a := a2 − a1, the last twosummands in (32) can be bounded as

‖Ft(P1)Rta1‖22 − ‖Ft(P1)Rta2‖22− ‖Ft(P2)Rta1‖22 + ‖Ft(P2)Rta2‖22

=(

‖Ft(P1)Rt∆a‖22 − ‖Ft(P2)Rt∆a‖22)

+ 2〈R′t [F

′t(P2)Ft(P2)− F′

t(P1)Ft(P1)] , a2∆′a 〉

≤ c1‖Ft(P2)− Ft(P1)‖‖∆a‖22+c2‖F′

t(P2)Ft(P2)− F′t(P1)Ft(P1)‖‖∆a‖2

≤ c3‖Ft(P2)− Ft(P1)‖‖∆a‖2 (33)

for some constantsc1, c2, c3 > 0, since‖Ft(P)‖ for P ∈ L,‖∆a‖2, ‖a2‖2 for a1, a2 ∈ A, and ‖Rt‖ are all uniformlybounded. The first summand on the right-hand side of (32)is similarly bounded (details omitted here). Next, to establishthat Ft(P) is Lipschitz one can derive the following bound(∆P := P2 −P1)

‖Ft(P2)− Ft(P1)‖≤ ‖Ωt [P2Dt(P2)−P1Dt(P1)]Ωt‖

+√

λ∗‖Ωt(D′t(P2)−D′

t(P1))‖≤ ‖P1‖(‖P1‖+

λ∗)‖(λ∗Iρ +P′2ΩtP2)

−1

− (λ∗Iρ +P′1ΩtP1)

−1‖+ ‖∆P‖(‖P1‖+ ‖P2‖+

λ∗)‖(λ∗Iρ +P′2ΩtP2)

−1‖.(34)

Define Gt := ∆′PΩtP1 + ∆′

PΩt∆P + P′1Ωt∆P and

Ht,i := λ∗Iρ +PiΩtP′i, i = 1, 2, and consider the following

identity [?]

H−1t,1 =

(

Ht,1 +Gt

)−1+H−1

t,1Gt

(

Ht,1 +Gt

)−1

The first term in the right-hand of (34) is then bounded asfollows

‖(λ∗Iρ +P′2ΩtP2)

−1 − (λ∗Iρ +P′1ΩtP1)

−1‖= ‖(

Ht,1 +Gt

)−1 −H−1t,1‖

≤ ‖H−1t,1‖‖Gt‖‖

(

Ht,1 +Gt

)−1‖

≤(

1

λ∗

)2

‖Gt‖ ≤ c4‖∆P‖. (35)

Putting the pieces togetherFt(.) is found to be Lipschitzand subsequently (32) is bounded by a constant factor of

‖∆P‖‖∆a‖2. Substitutinga1 = at(P1) and a2 = at(P2)along with the bound in (31) yields the desired result‖at(P2)−at(P1)‖2 ≤ c5‖P2 −P1‖. Furthermore, from the relationshipqt = Dt(P)Ωt(yt − Rtat), Lipschitz continuity ofqt(P)readily follows.

Moreover, gt(P,q[t], a[t]) is a quadratic function on acompact set, and thus clearly Lipschitz continuous. Toprove Lipschitz continuity ofℓt(P), recall the definitionqt(P), at(P) = argminq,a gt(P,q, a) to obtain aftersome algebra

ℓt(P2)− ℓt(P1) =1

2‖PΩt

(P2qt(P2) +Rtat(P2))‖22− ‖PΩt

(P1qt(P1) +Rtat(P1))‖22− 〈PΩt

(yt),P2qt(P2) +Rtat(P2)−P1qt(P1)−Rtat(P1)〉

+λ∗

2

(

‖qt(P2)‖22 − ‖qt(P1)‖22)

+ λ1

(

‖at(P2)‖1 − ‖at(P1)‖1)

. (36)

The first term in the right-hand side of (36) is bounded as

‖PΩt(P2qt(P2) +Rtat(P2))‖22

− ‖PΩt(P1qt(P1) +Rtat(P1))‖22

≤(

‖PΩt(P2qt(P2)−P1qt(P1))‖2

+ ‖PΩt(Rtat(P2)−Rtat(P1))‖2

)

×(

‖PΩt(P2qt(P2) +Rtat(P2))‖2

+ ‖PΩt(P1qt(P1) +Rtat(P1))‖2

)

≤ c6(

‖P2 −P1‖‖qt(P2)‖2 + ‖P1‖‖qt(P2)− qt(P1)‖2+ ‖Rt‖‖at(P2)− at(P1)‖2

)

(37)

for some constantc6 > 0. The second one is bounded as

〈PΩt(yt),P2qt(P2) +Rtat(P2)−P1qt(P1)−Rtat(P1)〉

≤ ‖PΩt(yt)‖2

(

‖PΩt(P2qt(P2)−P1qt(P1))‖2

+ ‖PΩt(Rtat(P2)−Rtat(P1))‖2

)

≤ ‖PΩt(yt)‖2

(

‖P2 −P1‖‖qt(P2)‖2 + ‖P1‖× ‖qt(P2)− qt(P1)‖2 + ‖Rt‖‖at(P2)− at(P1)‖2

)

. (38)

Finally, one can bound the third term in (36) as

λ∗

2

(

‖qt(P2)‖22 − ‖qt(P1)‖22)

+ λ1

(

‖at(P2)‖1 − ‖at(P1)‖1)

≤ λ∗

2‖qt(P2)− qt(P1)‖2

(

‖qt(P2)‖2 + ‖qt(P1)‖2)

+ λ1

√F‖at(P2)− at(P1)‖2. (39)

Sinceqt(P) and at(P) are Lipschitz as proved earlier, andP1,P2 ∈ L are uniformly bounded, the expressions in theright-hand side of (37)-(39) are upper bounded by a constantfactor of‖P2−P1‖, and so is|ℓt(P2)−ℓt(P1)| after applyingthe triangle inequality to (36).

Regarding∇ℓt(P), notice first that sinceqt(P), at(P)is the unique minimizer ofgt(P,q, a) [cf. a5)], Danskin’stheorem [7, Prop. B.25(a)] implies that∇ℓt(P) = PΩt

(yt −Pqt(P)−Rtat(P))q′

t(P). In the sequel, the triangle inequal-

Page 15: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 14

ity will be used to split the norm in the right-hand side of

∥∇ℓt(P2)−∇ℓt(P1)∥

F=

∥PΩt(yt)

[

qt(P2)− qt(P1)]′ −

[

PΩt(P2qt(P2))q

′t(P2)

− PΩt(P1qt(P1))q

′t(P1)

]

−[

PΩt(Rtat(P2))q

′t(P2)

− PΩt(Rtat(P1))q

′t(P1)

]∥

F. (40)

The first term inside the norm is bounded as

‖PΩt(yt) [qt(P2)− qt(P1)]

′ ‖F ≤‖PΩt

(yt)‖2‖qt(P2)− qt(P1)‖2. (41)

After some algebraic manipulations, the second term is alsobounded as

‖PΩt(P2qt(P2))q

′t(P2)− PΩt

(P1qt(P1))q′t(P1)‖F

≤ ‖P2 −P1‖F ‖qt(P2)‖22 + ‖qt(P2)− qt(P1)‖2×(

‖qt(P2)‖2 + ‖qt(P1)‖2)

(42)

and finally one can simply bound the third term as

‖PΩt(Rtat(P2))q

′t(P2)− PΩt

(Rtat(P1))q′t(P1)‖F

≤ ‖Rt‖(

‖at(P2)− at(P1)‖2‖qt(P1)‖2+ ‖qt(P2)− qt(P1)‖2‖at(P1)‖2

)

. (43)

Sinceat(P) andqt(P) are Lipschitz and uniformly bounded,from (41)-(43) one can easily deduce that∇ℓt(.) is indeedLipschitz continuous. C. Proof of Lemma 2: Exploiting that ∇Ct(P[t]) =∇Ct+1(P[t+1]) = 0L×ρ by algorithmic construction and thestrong convexity assumption onCt [cf. a4)], application of themean-value theorem readily yields

Ct(P[t+ 1]) ≥ Ct(P[t]) +c

2‖P[t+ 1]−P[t]‖2F

Ct+1(P[t]) ≥ Ct+1(P[t+ 1]) +c

2‖P[t+ 1]−P[t]‖2F .

Upon defining the functionht(P) := Ct(P) − Ct+1(P) onearrives at

c‖P[t+ 1]−P[t]‖2F ≤ ht(P[t+ 1])− ht(P[t]). (44)

To complete the proof, it suffices to show thatht is Lipschitzwith constantO(1/t), and upper bound the right-hand side of(44) accordingly. Since [cf. (17)]

ht(P) =1

t(t+ 1)

t∑

τ=1

gτ (P,q[τ ], a[τ ])

− 1

t+ 1gt+1(P,q[t + 1], a[t+ 1]) +

λ∗

2t(t+ 1)‖P‖2F (45)

andgi(P) is Lipschitz according to Lemma 1, it follows thatht is Lipschitz with constantO(1/t). D. Proof of Lemma 3: The first step of the proof is to showthat Ct(P[t])∞t=1 is a quasi-matringale sequence, and henceconvergent a.s. [23]. Building on the variations ofCt(P[t]),

one can write

Ct+1(P[t+ 1])− Ct(P[t]) = Ct+1(P[t+ 1])− Ct+1(P[t])

+ Ct+1(P[t])− Ct(P[t])

(a)

≤ Ct+1(P[t]) − Ct(P[t])

=1

t+ 1

[

gt+1(P[t],q[t + 1], a[t+ 1])

− 1

t

t∑

τ=1

gτ (P[τ ],q[τ ], a[τ ])]

(b)

≤ 1

t+ 1

[

gt+1(P[t],q[t+ 1], a[t+ 1])−1

t

t∑

τ=1

ℓτ (P[t])]

(46)

where (a) uses thatCt+1(P[t + 1]) ≤ Ct+1(P[t]), and (b)follows from Ct(P[t]) ≤ Ct(P[t]).

Collect all past data inFt = (Ωτ ,yτ ) : τ ≤ t, and recallthat under a1) the random processesΩt,yt are i.i.d. overtime. Then, the expected variations of the approximate costfunction are bounded as

E

[

Ct+1(P[t+ 1])− Ct(P[t])|Ft

]

≤ 1

t+ 1

(

E[gt+1(P[t],q[t+ 1], a[t+ 1])|Ft]−1

t

t∑

τ=1

ℓτ (P[t]))

(a)=

1

t+ 1

(

E[ℓ1(P[t])] − 1

t

t∑

τ=1

ℓτ (P[t]))

≤ 1

t+ 1sup

P[t]∈L

(

E[ℓ1(P[t])] − 1

t

t∑

τ=1

ℓτ (P[t]))

(47)

where (a) follows from a1). Using the fact thatℓi(Pt) isLipschitz from Lemma 1, and uniformly bounded due to a2),Donsker’s Theorem [40, Ch. 19.2] yields

E

[

supP[t]

∣E[ℓ1(P[t])]− 1

t

t∑

τ=1

ℓτ (P[t])∣

]

= O(1/√t). (48)

From (47) and (48) the expected non-negative variations canbe readily bounded as

E

[

E

[

Ct+1(P[t+ 1])− Ct(P[t])|Ft

]

+

]

= O(1/t3/2) (49)

and consequently∞∑

t=1

E

[

E

[

Ct+1(P[t+ 1])− Ct(P[t])|Ft

]

+

]

< ∞ (50)

which indeed proves thatCt(P[t])∞t=1 is a quasi-martingalesequence.

To prove the second part, define firstUt(P[t]) := Ct(P[t])−λ∗

2t ‖P[t]‖2F and Ut(P[t]) := Ct(P[t]) − λ∗

2t ‖P[t]‖2F for whichUt(P[t]) − Ut(P[t]) = Ct(P[t]) − Ct(P[t]) holds. Followingsimilar arguments as withCt(P[t]), one can show that (50)holds for Ut(P[t]) as well. It is also useful to expand thevariations

Ut+1(P[t+ 1])− Ut(P[t]) = Ut+1(P[t+ 1])− Ut+1(P[t])

+ℓt+1(P[t])− Ut(P[t])

t+ 1+

Ut(P[t]) − Ut(P[t])

t+ 1

Page 16: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 15

and bound their expectation conditioned onFt, to arrive at

Ut(P[t]) − Ut(P[t])

t+ 1≤∣

∣E

[

Ut+1(P[t+ 1])− Ut+1(P[t])|Ft

]∣

+∣

∣E

[

Ut+1(P[t+ 1])− Ut(P[t])|Ft

]∣

+1

t+ 1

∣E[ℓ1(P[t])] − 1

t

t∑

τ=1

ℓτ (P[t])∣

∣. (51)

Focusing on the right-hand side of (51), the second and thirdterms are bothO(1/t3/2) since counterparts of (48) and (49)also hold forUt(P[t]). With regards to the first term, usingthe fact thatCt+1(P[t+1]) < Ct+1(P[t]), from Lemma 1 anda4), it follows thatUt+1(P[t+1])− Ut+1(P[t]) = o(1/t). Allin all,

∞∑

t=1

Ut(P[t])− Ut(P[t])

t+ 1< ∞ a.s. (52)

Defining dt(P[t]) := Ut(P[t]) − Ut(P[t]), due to Lips-chitz continuity of ℓt and gt (cf. Lemma 1), and uniformboundedness ofPt∞t=1 [cf a3)], invoking Lemma 2 onecan establish thatdt+1(P[t + 1]) − dt(P[t]) = O(1/t).Hence, Dirichlet’s theorem [34] applied to the sum (52)asserts thatlimt→∞ dt(P[t]) = 0 a.s., and consequentlylimt→∞(Ct(P[t]) − Ct(P[t])) = 0 a.s.

REFERENCES

[1] [Online]. Available: http://internet2.edu/observatory/archive/data-collections.html

[2] A. Abdelkefi, Y. Jiang, W. Wang, A. Aslebo, and O. Kvittem,“Robusttraffic anomaly detection with principal component pursuit,” in Proc. ofthe ACM CoNEXT Student Workshop, Philadelphia, USA, Nov. 2010.

[3] A. Agarwal, S. Negahban, and M. J. Wainright, “Noisy matrix decom-position via convex relaxation: optimal rates in high dimensions,” TheAnnals of Statistics, vol. 40, p. 11711197, 2012.

[4] T. Ahmed, M. Coates, and A. Lakhina, “Multivariate online anomalydetection using kernel recursive least squares,” inProc. of IEEE/ACMInternational Conference on Computer Communications, Anchorage,Alaska, May 2007.

[5] L. Balzano, R. Nowak, and B. Recht, “Online identification and trackingof subspaces from highly incomplete information,” inProc. of AllertonConference on Communication, Control, and Computing, Monticello,USA, Jun. 2010.

[6] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algo-rithm for linear inverse problems,”SIAM J. Imag. Sci., vol. 2, p. 183202,Jan. 2009.

[7] D. P. Bertsekas,Nonlinear Programming, 2nd ed. Athena-Scientific,1999.

[8] J. F. Cai, E. J. Candes, and Z. Shen, “A singular value thresholdingalgorithm for matrix completion,”SIAM J. Optim., vol. 20, no. 4, pp.1956–1982, 2008.

[9] E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal componentanalysis?”Journal of the ACM, vol. 58, no. 1, pp. 1–37, 2011.

[10] E. J. Candes and Y. Plan, “Matrix completion with noise,” Proceedingsof the IEEE, vol. 98, pp. 925–936, 2009.

[11] E. J. Candes and B. Recht, “Exact matrix completion viaconvexoptimization,” Found. Comput. Math., vol. 9, no. 6, pp. 717–722, 2009.

[12] E. J. Candes and T. Tao, “Decoding by linear programming,” IEEE Trans.Info. Theory, vol. 51, no. 12, pp. 4203–4215, 2005.

[13] E. J. Candes and M. Wakin, “An introduction to compressive sampling,”IEEE Signal Processing Magazine, vol. 25, p. 1420, 2008.

[14] V. Chandrasekaran, S. Sanghavi, P. R. Parrilo, and A. S.Willsky, “Rank-sparsity incoherence for matrix decomposition,”SIAM J. Optim., vol. 21,no. 2, pp. 572–596, 2011.

[15] Q. Chenlu and N. Vaswani, “Recursive sparse recovery inlarge butcorrelated noise,” inProc. of 49th Allerton Conf. on Communication,Control, and Computing, Sep. 2011, pp. 752 –759.

[16] Y. Chi, Y. C. Eldar, and R. Calderbank, “PETRELS: Subspace estimationand tracking from partial observations,” inProc. of IEEE InternationalConference on Acoustics, Speech and Signal Processing, Kyoto, Japan,Mar. 2012.

[17] A. Chistov and D. Grigorev, “Complexity of quantifier elimination inthe theory of algebraically closed fields,” inMath. Found. of ComputerScience, ser. Lecture Notes in Computer Science. Springer Berlin /Heidelberg, 1984, vol. 176, pp. 17–31.

[18] T. Hastie, R. Tibshirani, and J. Friedman,The Elements of StatisticalLearning, 2nd ed. Springer, 2009.

[19] J. He, L. Balzano, and A. Szlam, “Incremental gradient on the Grass-mannian for online foreground and background separation insubsampledvideo,” in Proc. of IEEE Conference on Computer Vision and PatternRecognition, Providence, Rhode Island, Jun. 2012.

[20] D. Jakovetic, J. Xavier, and J. M. F. Moura, “Fast distributed gradientmethods,” arXiv:1112.2972v1 [cs.IT].

[21] A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide trafficanomalies,” inProc. of ACM SIGCOMM, Portland, OR, Aug. 2004.

[22] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. D.Kolaczyk, andN. Taft, “Structural analysis of network traffic flows,” inProc. of ACMSIGMETRICS, New York, NY, Jul. 2004.

[23] L. Ljung and T. Soderstrom, Theory and Practice of Recursive Identifi-cation, 2nd ed. MIT Press, 1983.

[24] J. Mairal, J. Bach, J. Ponce, and G. Sapiro, “Online learning for matrixfactorization and sparse coding,”J. of Machine Learning Research,vol. 11, pp. 19–60, Jan. 2010.

[25] M. Mardani, G. Mateos, and G. B. Giannakis, “In-networksparsityregularized rank minimization: Applications and algorithms,” IEEE Tran-s. Signal Process., Feb. 2012 (submitted), see also arXiv:1203.1570v1[cs.MA].

[26] ——, “Recovery of low-rank plus compressed sparse matrices withapplication to unveiling traffic anomalies,”IEEE Trans. Info. Theory.,Apr. 2012 (submitted), see also arXiv:1204.6537v1 [cs.IT].

[27] G. Mateos and G. B. Giannakis, “Robust PCA as bilinear decompositionwith outlier-sparsity regularization,”IEEE Trans. Signal Process., Sep.2012, see also arXiv:1111.1788v1 [stat.ML].

[28] Z. Meng, A. Wiesel, and A. Hero, “Distributed principalcomponentanalysis on networks via directed graphical models,” inProc. of IEEEInternational Conference on Acoustics, Speech, and SignalProcessing,Kyoto, Japan, Mar. 2012.

[29] B. K. Natarajan, “Sparse approximate solutions to linear systems,”SIAMJ. Comput., vol. 24, pp. 227–234, 1995.

[30] Y. Nesterov, “A method of solving a convex programming problem withconvergence rateO(1/k2),” Soviet Mathematics Doklady, vol. 27, pp.372–376, 1983.

[31] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-ranksolutions of linear matrix equations via nuclear norm minimization,”SIAM Rev., vol. 52, no. 3, pp. 471–501, 2010.

[32] B. Recht and C. Re, “Parallel stochastic gradient algorithms for large-scale matrix completion,” 2011, (submitted).

[33] M. Roughan, “A case study of the accuracy of SNMP measurements,”Journal of Electrical and Computer Engineering, Dec. 2010, article ID812979.

[34] W. Rudin, Principles of Mathematical Analysis, 3rd ed. McGraw-Hill,1976.

[35] A. H. Sayed,Fundamentals of Adaptive Filtering. John Wiley & Sons,2003.

[36] V. Solo and X. Kong,Adaptive Signal Processing Algorithms: Stabilityand Performance. Prentice Hall, 1995.

[37] M. Thottan and C. Ji, “Anomaly detection in IP networks,” IEEE Trans.Signal Process., vol. 51, pp. 2191–2204, Aug. 2003.

[38] J. Tropp, “Just relax: Convex programming methods for identifyingsparse signals,”IEEE Trans. Info. Theory, vol. 51, pp. 1030–1051, Mar.2006.

[39] P. Tseng, “Convergence of a block coordinate descent method fornondifferentiable minimization,”Journal of optimization theory andapplications, vol. 109, pp. 475–494, 2001.

[40] A. W. Van Der Vaart,Asymptotic Statistics. Cambridge University Press,2000.

[41] B. Yang, “Projection approximation subspace tracking,” IEEE Trans.Signal. Process., vol. 43, pp. 95–107, Jan. 1995.

[42] Y. Zhang, Z. Ge, A. Greenberg, and M. Roughan, “Network anomog-raphy,” in Proc. of ACM SIGCOM Conf. on Interent Measurements,Berekly, CA, USA, Oct. 2005.

[43] Y. Zhang, M. Roughan, W. Willinger, and L. Qiu, “Spatio-temporalcompressive sensing and internet traffic matrices,” inProc. of ACMSIGCOM Conf. on Data Commun., New York, USA, Oct. 2009.

Page 17: Dynamic Anomalography: Tracking Network Anomalies via ...afosr/wiki/uploads/... · Anomalies via Sparsity and Low Rank ... or, intruders which hijack the network services [37]. Unveiling

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (TO APPEAR) 16

[44] Z. Zhou, X. Li, J. Wright, E. Candes, and Y. Ma, “Stable principalcomponent pursuit,” inProc. of Intl. Symp. on Information Theory,Austin, TX, Jun. 2010, pp. 1518–1522.

Morteza Mardani (S’06) received his B.Sc. degreein Electrical Engineering from the Shahid BahonarUniversity of Kerman, Kerman, Iran, in 2006 andthe M.Sc. degree in Electrical Engineering fromthe University of Tehran, Tehran, Iran, in 2009.Since September 2009, he has been working towardhis Ph.D. degree with the Department of Electricaland Computer Engineering, University of Minnesota,Minneapolis. His research interests include networkinference and optimization, sparse and low rank re-covery, and cross-layer design of wireless networks.

Mr. Mardani is the recipient of the Best Student Paper Award from the 13thIEEE Workshop on Signal Processing Advances in Wireless Communicationsin June 2012. He also received the ADC Fellowship Award from the DigitalTechnology Center at the University of Minnesota for two academic years2009-2010 and 2010-2011.

Gonzalo Mateos (M’12) received his B.Sc. degreein Electrical Engineering from Universidad de laRepublica (UdelaR), Montevideo, Uruguay in 2005and the M.Sc. and Ph.D. degrees in Electrical andComputer Engineering from the University of Min-nesota, Minneapolis, in 2009 and 2011.

Since 2012, he has been a post doctoral associatewith the Department of Electrical and Computer En-gineering and the Digital Technology Center, Univer-sity of Minnesota. Since 2003, he is an assistant withthe Department of Electrical Engineering, UdelaR.

From 2004 to 2006, he worked as a Systems Engineer at Asea Brown Boveri(ABB), Uruguay. His research interests lie in the areas of communicationtheory, signal processing and networking. His current research focuses ondistributed signal processing, sparse linear regression,and statistical learningfor social data analysis and network health monitoring.

Georgios B. Giannakis (Fellow’97) received hisDiploma in Electrical Engr. from the Ntl. Tech. Univ.of Athens, Greece, 1981. From 1982 to 1986 he waswith the Univ. of Southern California (USC), wherehe received his MSc. in Electrical Engineering, 1983,MSc. in Mathematics, 1986, and Ph.D. in ElectricalEngr., 1986. Since 1999 he has been a professorwith the Univ. of Minnesota, where he now holdsan ADC Chair in Wireless Telecommunications inthe ECE Department, and serves as director of theDigital Technology Center.

His general interests span the areas of communications, networking andstatistical signal processing - subjects on which he has published more than325 journal papers, 525 conference papers, 20 book chapters, two editedbooks and two research monographs. Current research focuses on compressivesensing, cognitive radios, cross-layer designs, wirelesssensors, social andpower grid networks. He is the (co-) inventor of 21 patents issued, and the(co-) recipient of 8 best paper awards from the IEEE Signal Processing (SP)and Communications Societies, including the G. Marconi Prize Paper Awardin Wireless Communications. He also received Technical Achievement Awardsfrom the SP Society (2000), from EURASIP (2005), a Young Faculty TeachingAward, and the G. W. Taylor Award for Distinguished Researchfrom theUniversity of Minnesota. He is a Fellow of EURASIP, and has served theIEEE in a number of posts, including that of a Distinguished Lecturer for theIEEE-SP Society.