Multi-Low-Rank Approximation For Trafﬁc Matricesverma076/itc17-verma.pdfwork for trafﬁc matrix approximation based on conventional methods such as PCA/SVD and NMF (non-negative

Multi-Low-Rank Approximation For TrafficMatrices

Saurabh Verma, Arvind Narayanan, Zhi-Li ZhangDepartment of Computer Science and Engineering, University of Minnesota, USA

Email: {verma,arvind,zhzhang}@cs.umn.edu

Abstract—With the Internet applications become more com-plex and diverse, simple network traffic matrix estimation orapproximation methods such as gravity model are no longeradequate. In this paper, we advocate a novel approach of ap-proximating traffic matrices with multiple low-rank matrices. Webuild the theory behind the MULTI-LOW-RANK approximationand discuss the conditions under which it is better than Low-Rank SVD in terms of both matrix approximation and preservingthe local (and global) structure of the traffic matrices. Further,we develop an effective technique based on spectral clustering ofcolumn/row feature vectors for decomposing traffic matrices intomultiple low rank matrices. We perform a series of experimentson traffic matrices extracted from a synthetic dataset and tworeal world datasets – one that represents nationwide cellulartraffic and another taken from a tier-1 ISP. The results thusobtained show that: 1) MULTI-LOW-RANK approximation issuperior for traffic classification; 2) it can be used to predictcomplete or missing entries of traffic matrices over time; 3) showit’s robustness against noise; and, 4) demonstrate that it closelyfollows the optimal solution (i.e., low-rank SVD solution).

Index Terms—Traffic matrix approximation, Low rank SVD,Traffic classification.

I. INTRODUCTION

The wide proliferation of various kinds of sensors in thephysical and/or cyber worlds has enabled us to collect a wholegamut of (spatial-temporal) data, e.g., voice calls betweenusers at various locations in a cellular network, traffic betweendifferent points-of-presence (PoPs) in an ISP (Internet serviceprovider) network, human mobility or commuting behaviorsacross different locations in a transport network. Traffic matri-ces such as origin-destination (OD) matrices are a natural wayto represent many of these datasets arising in these applicationdomains. Here, origins and destinations may refer to a person,a location or a physical object or an abstract entity. Each cellof an OD matrix quantifies the relation between a pair of originand destination using certain metrics, e.g., traffic volume,activity counts, that is observed during a given time interval.We will be using the terms traffic matrices and OD matricesinterchangeably. With abundance of such data, extractingmeaningful patterns from OD matrices is an important dataanalysis task that has wide applications, from network trafficengineering to urban transportation management, smart cityplanning, social behavior analysis and cyber-physical worldsecurity.

Perhaps the most prominent area where OD matrices havebeen widely studied in the past is Internet traffic analysis,where the gravity model – originally proposed in traffic anal-ysis in transportation networks [1] – and its extensions have

been developed for PoP-level IP traffic matrix estimation [2],[3], [4]. These approaches essentially assume that a (PoP-level)OD traffic matrix can be approximated by a low rank matrix,and in the case of the standard gravity model, a rank−1matrix. Such an assumption may be justified, if traffic flows ondifferent links of the network are roughly independent [5]. Thegoal is to characterize the entire OD matrix for the purpose ofIP traffic matrix estimation. However, the emergence of largecloud-based application service providers such as Google,Facebook, Amazon, Netflix, coupled with the dominant roleof content distribution networks (CDNs) in content delivery,has altered the Internet traffic dynamics. It is shown in [6] thatthe simple gravity model is no longer sufficient to capture andmodel the IP traffic matrices in a large ISP network.

In this paper, we start with a brief discussion on relatedwork for traffic matrix approximation based on conventionalmethods such as PCA/SVD and NMF (non-negative matrixfactorization). All these approaches implicitly assume thatobserved data points come from a latent linear subspacewith “noises”, and thus can be approximated using a low-rank matrix. However, as we have observed from many real-world network traffic matrices such as those from voicecommunications in cellular networks or data communicationsin large ISP backbone networks, this assumption is no longervalid. Instead, we postulate that the observed global trafficmatrix in a network is likely an aggregate of many diversetraffic patterns, each of which reflects a distinct class of ap-plication/user communication structures or behaviors, and thuscan be approximated by a low-rank matrix. In other words, theobserved data points represent a mixture of several (latent)low-dimensional linear sub-manifolds. In such a setting, weargue that using the standard SVD/PCA to approximate theentire traffic matrix is not adequate; to preserve the localstructures inherent in the data, it is best to approximate itusing multiple low-rank matrices, each capturing one latentsub-manifold (see Section III).

Based on the above intuition, we develop a theory behindour proposed MULTI-LOW-RANK approximation method anddiscuss the conditions under which its better than a Low-RankSVD in terms of both matrix approximation and preservingthe local (and global) structure of the traffic matrices (seeSection IV). In order to identify and extract sub-matricescorresponding to clusters of data points lying in various linearsub-manifolds, we develop an effective technique based onspectral clustering of row/column feature vectors. We first

convert the original OD matrix to generate a new (feature)similarity graph using Gaussian kernel by treating each row (orcolumn) as a feature vector. We then apply spectral clusteringto project the feature space into lower dimensional manifoldsfor traffic matrix decomposition and MULTI-LOW-RANK ma-trix approximation (see Section VI). In Sections VII–XI, weperform a series of experiments on traffic matrices extractedfrom a synthetic dataset and two real world datasets – one thatrepresents nationwide cellular traffic and another taken from atier-1 ISP. The results thus obtained show that MULTI-LOW-RANK approximation is superior for traffic classification. Itcan be used to predict complete or missing entries of trafficmatrices over time and also it is robust against noise andclosely follows the optimal solution (via SVD).

II. BACKGROUND AND RELATED WORK

One of the earlier attempts in extracting meaningful patternsin traffic matrices is made by [7] where principal componentanalysis (PCA) was employed for approximating the low rankmatrix by performing either eigenvalue decomposition of co-variance or singular value decomposition (SVD) of the datamatrix. Since then, many variants of the PCA-based approach(e.g., robust PCA for handling noise and outliers) and otherrelated techniques such as latent semantic indexing (LSI) havebeen developed for anomaly detection and network tomogra-phy [8], [9], [10]. The basic premise of PCA (or SVD or LSI)is that the data lies in a linear subspace of a high dimensionalspace. This premise is invalidated when there are non-linearrelations or patterns in the data, although such patterns maylie in multiple linear sub-manifolds of low-dimensional spaceand have low intrinsic dimensions. For such cases, PCA (orSVD) would fail to capture the local structure of the data andwon’t be able to approximate the low rank matrix accurately.Another matrix factorization approach is NMF [11] developedto address the interpretability issue associated with the low-rank matrix approximations. However, NMF also assumes thatthe entities lie in a lower linear subspace of the original highdimensional data space.

Besides analysis through matrix decomposition, much ofthe previous work in traffic matrix prediction focus on fillingmissing entries in a partially observed matrix under various as-sumptions such as sparsity or spatio-temporal constraints [12],[13], [14]. Most of these problems are solved using thetechniques obtained from compressive sensing or matrix com-pletion [15]. However, in our case we are dealing with partiallyobserved matrices and are interested in predicting full matrixover time in future. This is possible, if the structure of trafficmatrices remain intact over time in the sense that only fewlocal structures of matrix changes at a particular time overdifferent time domains. In [16], such a problem is addressedby first constructing partially observed matrix by identifyingimportant OD flow links from past data and then convertingthe main problem into popular traffic matrix prediction withmissing entries. This approach only takes global structure ofthe data into account and does not make an attempt to preservelocal, but important, structures of the data.

III. MULTI-LOW-RANK INTRODUCTION AND MOTIVATION

10

5

0-0.55 -50

M2

-5

0

M3

-10 -10

M1

-15

0.5

1

Fig. 1: Data consisting of three almost linear sub-manifoldsM1, M2, M3.

In this paper, we are interested in approximating a classof Internet traffic matrices composed of multiple linear sub-manifolds as shown in Figure 1. This special structure allowsus to predict such traffic matrices over time domain basedon the information available from present day traffic matrices.This is possible because, over the time domain, most localstructures (or clusters) are preserved while few of them changein traffic matrices and can be accomplished by approximatingmatrices with multiple low-rank sub-matrices. Through seriesof experiments, we demonstrate and argue that it is moreappropriate to account for these clusters while approximatingthe traffic matrices.

Consider a low rank A ∈ Rn×m matrix approximation ofA ∈ Rn×m matrix. Then in general, A is obtained throughminimizing following objective function, where ‖.‖F is theFrobenius norm of matrix:

minimizeA

‖A− A‖2F

subject to rank(A) ≤ r,(1)

The optimal solution of the above function is obtainedthrough Low-Rank SVD of A i.e., A = UΣVT , where Σ ∈Rr×r contains top r singular values and U ∈ Rn×r, Vr ∈Rm×r are orthogonal matrices containing the corresponding rleft and r right singular vectors, respectively.

SVD provides the best approximation of a matrix underrank constraint but does not guarantee preserving the importantlocal structure (i.e., clusters) of the data. For example, considera special case of block diagonal matrix A commonly seenas origin-destination (OD) matrices in representing Internettraffic data. Here, we are interested in a low rank matrixapproximation while also being able to preserve the usefullocal structure for this block diagonal matrix.

Let A =

A1 0 00 A2 00 0 A3

= UΣV, where,

UΣV =

U1 0 00 U2 00 0 U3

Σ1 0 00 Σ2 00 0 Σ2

V1 0 00 V2 00 0 V3

1

23

4

5

6

7

8

910

11

12

13

14

15

16

(a) Structure of a Real Traffic Dataset.Depicts total 16 local structures/clusters.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

(b) Structure of the data after Low-Rank SVDapproximation with error rate = 41.28%.

1

2

3

4

5

6

7

89

10

11

12

13

141516

(c) Structure of the data after Multi-Low-Rank SVD approx. with error rate = 46.66%.

Fig. 2: Visualization of the local and global structure of a Real Traffic Dataset in R2 using t-SNE. Comparing Figures 2b & 2creveal that MULTI-LOW-RANK SVD approx. (with each sub-matrix rank = 1) preserves the structure of the data much betterthan Low-Rank SVD approx. (having single matrix rank = 16) by sacrificing a small amount of approximation error rate.

Let A be the Low-Rank SVD approximation of A withrank = r and let σr(A1) ≥ σ1(A2), where σj(A) is the topjth singular value of A matrix. Then, A can be obtained as:

A =

U1 0 00 0 00 0 0

Σ1 0 00 0 00 0 0

V1 0 00 0 00 0 0

As a result, no information about A2 and A3 clusters areextracted from the matrix. Even, if we increase the rankr, it is quite possible that a dominant cluster with highersingular values can mask the extraction of less dominant butimportant local clusters of the data. To solve this issue, wepropose MULTI-LOW-RANK approach to preserve the usefulcluster information in the matrix approximation. Specifically,we consider approximating matrix A via multiple low ranksub-matrices {As}Cs=1, where each As contains some localstructure information and C is total number of sub-matrices.

To motivate further, consider a Real Traffic Dataset (seedataset description and definition of approximate error ratein Section VIII). We apply the state-of-art visualization tech-nique, t-SNE [17] in conjunction with spectral clustering toreveal the local and global structure of the data in R2 planeas shown in Figure 2a. Note that, we found a total 16 localstructures/clusters in the data. Then, we proceed to apply t-SNE on the approximation data obtained through Low-RankSVD and our proposed MULTI-LOW-RANK SVD approxima-tion method. For MULTI-LOW-RANK SVD approximation,we set rank = 1 for each sub-matrix/cluster. To make afair comparison with Low-Rank SVD, we approximate the(single) data matrix with rank = 16. This results in obtainingFigures 2b & 2c which clearly show that Multi-Low-RankSVD is able to preserve the overall structure of the data muchbetter than Low-Rank SVD without much compromising thequality of approximation (only ∼ 5.38% of accuracy is lost).

IV. MULTI-LOW-RANK APPROXIMATION THEORY

In this section, we provide some theoretical results to helpjustify the approach taken in this paper and the condition underwhich MULTI-LOW-RANK SVD is superior than Low-Rank

SVD, particularly for approximation purposes only. First, westate the following proposition:

Proposition 1. Let A =[

A1

A2

]be a matrix and S =

[S1

S2

]be the approximation of A under the rank constraint (rank ≤r) on sub-matrices S1, S2. Then, the optimal approximation is

given by S =[

A1

A2

], where A1, A2 are rank ≤ r matrices

obtained via Low-Rank SVD.

Proof. Formally, the problem can be stated as:minimize

S1,S2

||A− S||2F

subject to rank(S1) ≤ r, rank(S2) ≤ r(2)

The proof is based on the following simple observation:

‖A− S‖2F = ‖A1 − S1‖2F︸︷︷︸f1(S1)

+ ‖A2 − S2‖2F︸︷︷︸f2(S2)

(3)

Since f1(S1), f2(S2) and corresponding constraint variablesare independent, equation 2 can be written as:

minS1

||A1 − S1||2F + minS2

||A2 − S2||2F

s.t. rank(S1) ≤ r, rank(S2) ≤ r

(4)

Thus, the optimal solution is straight forwardly obtained viaSVD of A1,A2 as shown from Eq. (1).Remarks. Proposition 1 shows that under the rank constrainton sub-matrices of a matrix, Low-Rank SVD still providesthe best approximation when performed on the correspondingsub-matrices. The above results is valid for only Frobeniusnorm case and not necessarily holds for spectral norm.

Next, we derive the condition under which approximatingmatrix with multiple low-rank sub-matrices is better thanapproximating with single/global rank for overall matrix.

Theorem 1. Let An×m be a matrix of the given form:

A =

[A1 A2

A3 A4

], S =

[A1 A2

A3 A4

]and As are sub-matrices. Let A be the Low-Rank SVD ofA with rank r ≤ min(n,m). Similarly, let S contain sub-matrices As of rank rs ≤ min(ns,ms).

Then ‖A − S‖2F ≤ ‖A − A‖2F holds, if it satisfies the

condition( 4∑s=1

rs∑j=1

σ2sj −

r∑j=1

σ2j

)≥ 0, where σj is the top jth

singular value of A and σsj is the top jth singular value ofAs matrix.

Proof : Low-Rank SVD matrix approximation error for rank ris given by (here qs = min(ns,ms)).

‖A− A‖2F =

q∑j=r+1

σ2j

‖A− S‖2F = ‖A1 − A1‖2F + · · ·+ ‖A4 − A4‖2F

=

4∑s=1

qs∑j=rs+1

σ2sj

(5)

Further, we can rewrite A in terms of sub-matrices as:

trace(ATA) = trace(AT1 A1) + ...+ trace(AT

4 A4)q∑j=1

σ2j =

4∑s=1

qs∑j=1

σ2sj

(6)

Here, we used the fact that eigenvalues of ATA are squareof singular values of A and trace(ATA) corresponds to thesum of eigenvalues of a matrix. Finally, using Eq. (5) & (6):

‖A− A‖2F − ‖A− S‖2F =

4∑s=1

rs∑j=1

σ2sj −

r∑j=1

σ2j (7)

Then it follows, ‖A − A‖2F ≥ ‖A − S‖2F , if we have( 4∑s=1

rs∑j=1

σ2sj −

r∑j=1

σ2j

)≥ 0.

Remarks. Above result hold for a more general case of blockmatrices and proof can be derived in a similar manner asshown above. The above theorem states that in order toapproximate the low rank matrix better than the Low-RankSVD matrix of rank r, the sum of square of singular valuesof Low-Rank sub-matrices must be greater than the sumof square of singular values of Low-Rank SVD matrix ofrank r. The result is intuitive in the sense that we tend toseek the top dominant directions in the data measured bysingular values and higher values of these (singular values)under the constraint of preserving local structure suggest betterapproximation of the data.

Corollary 1.1. Let An×m be a matrix of the given form andlet S contains sub-matrices As of rank r.

A =

A1

...Ac

, S =

A1

...Ac

Then ‖A−S‖2F ≤ ‖A−A‖2F , if

( c∑s=1

r∑j=1

σ2sj−

r∑j=1

σ2s

)≥ 0,

where σj is the top jth singular value of A and σsj is the topjth singular value of As matrix.

Proof. Directly can derived from Theorem 1.

Remarks. Here, the sum of (square of) eigenvalues of eachsub-matrices would contribute towards the quality of matrixapproximation. Therefore, identifying dominant (local) princi-pal components will result in reduced matrix approximationerror along with capturing the local structure of the data.

Corollary 1.2. Let An×n be a block diagonal matrix of thegiven form and let S contain sub-matrices As of rank r.

A =

A1 . . . 0. . .

0 . . . Ac

, S =

A1 . . . 0. . .

0 . . . Ac

Then, ‖A− S‖2F ≤ ‖A− A‖2F , is always satisfied.

Proof. For block diagonal matrix, it is easy to see that σjare top r singular values of the following set {σsj : σsj ∈σ(As),∀s}, where σ(As) are singular values of As, and from

there, it directly follows( c∑s=1

r∑j=1

σ2sj −

r∑j=1

σ2j

)≥ 0.

Remarks. Thus, if the data has a block diagonal form structure,it is always better to approximate matrix through MULTI-LOW-RANK method which will guarantee to provide lowerapproximation error than Low-Rank SVD. Moreover, it alsohelps to preserve both local and global structure of the data.

A. Computational Complexity

Time complexity of exact SVD of a An×d matrix isbounded by O(min{n2d, nd2}) [18]. As a result, in the caseof MULTI-LOW-RANK approximation, the complexity will be

equal toc∑s=1O(min{n2sd, nsd2}) where

c∑s=1

ns = n and c

is number of clusters. If n>d, then the time complexity ofSVD and MULTI-LOW-RANK approximation would be samei.e. O(nd2). But for high dimensional data, where n<d, thecomplexity of MULTI-LOW-RANK is less than SVD sincec∑s=1O(n2sd)<O(n2d).

V. MULTI-LOW-RANK BASED PROBABILISTIC MATRIXFACTORIZATION

We extend our MULTI-LOW-RANK approach to includematrices with missing entries. Since, the objective functionof SVD under a matrix with missing entries (where eigen-value decomposition is not possible anymore to give optimalsolution and therefore need to perform optimization throughiterative methods) is prone to over-fitting, we a seek morerobust method, specifically, probabilistic matrix factorizationtechnique (PMF) [19]. Let A ∈ Rn×m be a matrix composedof multiple {Ac ∈ Rnc×m}Cc=1 sub-matrices and each of themderived from {Dc}Cc=1 different distributions to represent Cclusters in the data. Then, probabilistic matrix factorizationcan be generalized to multi-distributions as follows:

p(A|{Uc,Vc, σc}C1 ) =C∏c=1

nc∏i=1

m∏j=1

[N (Acij |UT

ciVcj , σ2c )]Icij(8)

where, Uc ∈ Rr×nc ,Vc ∈ Rr×m are latent matrices of rankr with Uck ,Vck as kth column in the matrix respectively.Also, N (x|µ, σ2) is a Gaussian distribution with mean µ andvariance σ2. Iij is indicator matrix with entry 0 for missingentries in matrix Ac or else 1. Assuming zero-mean sphericalGaussian priors on Uc & Vc and then deriving the log ofposterior distribution of Eq. (8) and applying MAP inference,results in minimization of following objective function:

E =

C∑c=1

nc∑i=1

m∑j=1

Icij (Acij −UTciVcj ) +

C∑c=1

λcu

nc∑i=1

‖Uci‖2F

+

C∑c=1

λcv

m∑j=1

‖Vcj‖2F

(9)

where, λcu =σ2c

σ2cu

, λcv =σ2c

σ2cv

and σcu & σcv are priorvariances on Uc & Vc respectively. Directly optimizingEq. (9) is NP-hard, since the distribution assignment foreach entry of A is unknown. But, we can work aroundthis problem by first performing clustering which providesdistribution assignment for each entry. Secondly, assumingdistributions are independent enable us to decouple Eq. (9)into E =

∑Cc=1Ec, where each Ec corresponds to PMF of

each sub-matrix and can be solved independently. Also notethat Eq. (9) reduces exactly to MULTI-LOW-RANK SVD underthe condition that all prior variances σcu , σcv →∞.

VI. DECOMPOSING TRAFFIC MATRICES INTOMULTI-LOW-RANK MATRICES

For decomposing heterogeneous traffic matrices into mul-tiple sub-matrices with lower ranks, we can adopt two ap-proaches: a) in many practical applications such as labeledtraffic classification, we can group data points with same labelsinto forming sub-matrices; b) in absence of labeled datasets,we can adopt a clustering approach to decompose the trafficmatrices. Specifically, we advocate to use spectral clusteringdue to its sound theoretical foundation and ability to handlehigh dimensional as well as non-linearity in the data. Otherreasons of not directly applying standard clustering algorithmsuch as k-means is due its dependency on having clustersto form convex regions which may not be true for the datacontaining multiple linear sub-manifolds. In this section, weprovide details of applying our version of spectral clusteringon traffic matrices which requires constructing similarity ma-trix and graph Laplacian.Constructing Similarity Matrix: Computing appropriatesimilarity matrix W of spectral clustering from the data matrixX is a crucial step and needs careful consideration. For highdimensional data, Gaussian (or heat) kernel W is a suitablechoice, for which theoretical motivation can be found in [20].

Wij = e−‖xi−xj‖

2

2σ2i (10)

Gaussian kernel can be interpreted in many different ways,from kernel density estimation (KDE) to representing condi-tional probability pj|i of picking xj as the neighbor of xi

data point. As density can be different in different regions,choosing σ appropriately in Gaussian kernel is important andcan greatly affect the mapping of embedded data points inlow-dimensional space. Instead of setting constant σ for alldata points, we propose to compute σi at each data point xisuch that the entropy of distribution:

−∑j

pj|i log pj|i = log k (11)

is equal to log k, where k is a user defined perplexity parameterand can be interpreted as a smooth measure of effectivenumber of neighbors. For calculating σi we performed a binarysearch over its value – so that gives log k entropy for each datapoint. It turns out that similarity matrix is robust for differentvalues of k and typical, the values lie in the range of 5− 50.Constructing Graph Laplacian: Different versions of graphLaplacians exist in literature but we adopt a symmetric nor-malized graph Laplacian (as shown below) proposed by [21]as it is less susceptible to bad clustering when different clustersare connected with varied degree.

L = D−1/2WD−1/2 (12)

where D is the diagonal degree matrix whose elements aresum of the rows of similarity matrix. From eigen decompo-sition of L, d largest eigenvectors are stacked as columns ina Y matrix which is renormalized to yield a low-dimensionalrepresentation of data in Rd space. There are several waysto estimate the intrinsic dimension d of the data (e.g., ker-nel PCA) but graph Laplacian implicitly provides a way toestimate d through examining drop in eigenvalues of L. Abetter approach of approximating intrinsic dimension can befound in [22]. For our datasets, spectral clustering approachwas sufficient enough to yield faithful results. After obtaininga low dimensional data Y, we apply traditional clusteringalgorithms to obtain clusters. In our paper, we have appliedDBSCAN for clustering due to its robustness against outliers.Finally using cluster labels, we construct sub-matrices andapproximate them with low rank via SVD. Algorithm 1 showsthe complete MULTI-LOW-RANK approximation method.

Algorithm 1 MULTI-LOW-RANK Approximation Algorithm

1: Input: Traffic Matrix X ∈ RN×P and k perplexity;2: Compute matrix W:3: for each 1 ≤ i, j ≤ N do4: compute σi for a given k using Eq. (11);5: compute Wij using Eq. (10);6: Compute D =

∑j Wij and L = D−1/2WD−1/2;

7: Compute {v1, v2, .., vd} as d largest eigenvectors of L andstack them to form Y ∈ RN×d matrix;

8: Normalize Y to have unit length rows;9: Apply DBSCAN algorithm to cluster points in Y matrix

and obtain clusters labels;10: Output: Construct {Xc}Cc=1 sub-matrices using cluster

labels and approximate them via PMF/SVD.

VII. APPLICATION OF MULTI-LOW-RANKAPPROXIMATION IN TRAFFIC CLASSIFICATION

To demonstrate the importance of preserving local structureswhile approximating the data, we perform traffic classificationon the approximated data obtained via MULTI-LOW-RANKSVD and compare the performance with Low-Rank SVD. Forthis purpose, we employed the widely used NSL-KDD [23] as abenchmark dataset which contains 15 different types of traffic(labels). We constructed a partial data matrix X ∈ R1200×2058

from the subset of data due to memory constraints and performone-hot encoding for each categorical feature. Next, we followthe standard procedure of performing 10-fold cross validationwith LIBSVM [24] library to test the classification perfor-mance. Following methods were considered for comparison:1) MULTI-LOW-RANK SVD via traffic labels to approximateX with total 15 sub-matrices and setting rank = 1 for eachof them. 2) MULTI-LOW-RANK SVD via spectral clusteringto get labels and create sub-matrices. In this case, we obtaintotal 11 such sub-matrices and set rank = 1 to approximateeach of them. 3) Low-Rank SVD method, and to make faircomparison, we set rank = 15 for matrix approximation. 4)Full-Rank (without any approximation of the data) method.

Method Full-Rank

Low-RankSVD

MULTI-LOW-RANKSVD via Spectral.

MULTI-LOW-RANKvia Label.

Accuracy 95.28% 90.10% 94.60% 96.58%

TABLE I: Traffic classification accuracy on NSL-KDD Data.

Table I shows the traffic classification accuracy of eachmethod. It is clear from the results that MULTI-LOW-RANKSVD approximation outperforms the Low-Rank SVD on trafficclassification task. The results conform with the reasoning ofpreserving local and global structure of the data. Furthermore,MULTI-LOW-RANK SVD approximation via traffic labels sur-prisingly outperforms the Full-Rank method as well. It is dueto the fact that all the local structures are well approximatedby only few dominant features which boost the performanceof SVM as compare to just using raw features.

VIII. COMPARISON AND EVALUATION OF MULTI-LOWRANK APPROXIMATION ON REAL DATASETS

Our first real dataset (RD1) represents a nationwide cellulartraffic extracted from a call detail record (CDR) dataset.This dataset consists millions of (voice and text) call recordscaptured by over a 1000 base stations (or towers) spanning anentire African nation for over a month. Two features make thisdataset a good candidate for our traffic matrix analysis: 1) wehave information about both the origin and destination towersassociated with every call thereby capturing the amount oftraffic transmitted or received by towers (or even traffic volumebetween towers), and; 2) the dataset represents an entire nationwhich makes the traffic matrix rich with diverse patterns. Forthis particular dataset, we consider the traffic matrix to bein the form of an origin destination matrix, where originsand destinations correspond to the cellular towers. The value

inside the matrix denotes the number of calls made from thecorresponding origin (or row) to some destination (or column).As a result, we obtain a data matrix A ∈ R1214×1214.

In order to evaluate the MULTI-LOW-RANK approximationmethod, we use the following relative Frobenius norm differ-ence metric for calculating matrix approximation error rate:‖A − A‖2F /(‖A‖2F + ‖A‖2F ) in order to bounded its rangein [0, 1]. We compare our approach with existing algorithmsspecifically – Bi-clustering, non-negative factorization (NMF)and Lo-Rank SVD. Bi-clustering (or Spectral Co-Clustering)allows to cluster both rows and columns simultaneously whileNMF is popular for factorizing with non-negative matrix inorder to have better interpretation.

10 30 50 70 90Rank

0

0.1

0.2

0.3

0.4

0.5

0.6

App

roxi

mat

ion

Err

or Multi-Low-RankBi-ClusteringNMFLow-Rank SVD

5 10 15 20 25 30 35 40Rank

0

0.2

0.4

0.6

0.8

1

App

roxi

mat

ion

Err

or Multi-Low-RankLow-Rank SVD

Fig. 3: (a) Comparing matrix approximation error rate. (b)Approximation error rate w.r.t. overall rank.

Figure 3a shows the matrix approximation error rate fordifferent algorithms with respect to the rank. It is important todistinguish the meaning of “rank” in the context of differentalgorithms. For spectral and bi-clustering algorithms, rankcorresponds to a sub-matrix representing a cluster. In ourcase, we found 16 clusters for this dataset. So here, matrixapproximation error rate is the sum of approximation error ofeach sub-matrices. The rank for Low-Rank SVD and NMF(non-negative matrix factorization) represents the single (orglobal) rank of the full matrix. Figure 3b points out thefact that Low-Rank SVD or NMF requires higher rank toapproximate traffic matrix to yield lower approximation error.In contrast, it would be better to approximate traffic matrixwith multiple ranks corresponding to lower error rate thattend to preserve local as well global structure of the data.However, it is important to point out again that, we areno longer approximating matrix A via a single specifiedrank r matrix but rather approximating via 16 small sub-matrices of rank r (or as rank 16r), so comparison mayseem misleading. Therefore, we provide Figure 3b to be morethorough and discuss its implication later. But the point is thatapproximating a matrix via multiple sub-matrices can be madecomputationally cheaper and also be parallelize to providea faster and better experience. Also, our proposed MULTI-LOW-RANK via spectral clustering approach outperforms bi-clustering algorithm in approximating the overall matrix.

We provide Figure 3b to understand how MULTI-LOW-RANK performs with respect to optimal solution and in orderto make much more fair comparison with Low-Rank SVD. Forthis case, we evaluate the performance of Low-Rank SVD for

5 10 15 20 25 30 35 40

Rank

0

0.2

0.4

0.6

0.8

1A

ppro

xim

atio

nE

rror

MonTuesWedThurFri

(a) Matrix prediction error rate over consec-utive days in time period 10AM-1PM.

5 10 15 20 25 30 35 40

Rank

0

0.2

0.4

0.6

0.8

1

App

roxi

mat

ion

Err

or

MonTuesWedThurFri

(b) Matrix prediction error rate over consec-utive days in time period 8-9AM.

5 10 15 20 25 30 35 40

Rank

0

0.2

0.4

0.6

0.8

1

App

roxi

mat

ion

Err

or

Week1Week2Week3Week4Week5

(c) Matrix prediction error over consecutiveweeks for in time period 10AM-1PM.

Fig. 4: Variation of MULTI-LOW-RANK prediction error rate over different time domains on RD1 dataset.

which we set the (global) rank equal to the sum of rank ofsub-matrices (sub-matrices are obtained from multi-low-rankmethod) and varies the individual (same) rank of sub-matrices.Figures 3b shows that approximating matrix with multipleranks closely follows the optimal solution obtained via SVDwith single rank (which is equal to the sum of multiple ranks)as we increase the rank of each sub-matrices and also tendto preserve the local structure information (due to how it isdesign) which is in contrast with Low-Rank SVD.

IX. MULTI-LOW RANK PREDICTION ON REAL DATASETS

In this section, we investigate the matrix prediction overdifferent time domains by first evaluating the results overdifferent days of a week. Here, we use the following matrixprediction error rate: ‖Ai−AM‖2F /(‖Ai‖2F+‖AM‖2F ) whereAi is the ith day matrix (e.g., Tuesday) and predictingthis matrix using Monday AM matrix with different ranks.Figure 4a shows that the prediction error over a week withrespect to different ranks on RD1 dataset. We can observe thata rank in the range 10−15 for each cluster is quite sufficient toapproximate overall matrix for prediction purposes. Predictionerror is also reasonable ranging around 0.25 which is close tothe reference matrix. Similarly, this trend is consistent overdifferent time periods as shown specifically for morning timein Figure 4b. Infact, the same trend can be observed around thefour weeks of a month as shown in Figure 4c. The fifth week(or the first week of next month) is closest in approximationto the first week of a previous month suggesting the existenceof some repetitive pattern happening each month.

We also compute the per cluster approximation error rateusing the same metric (prediction error rate) as mentionedbefore on RD1 dataset. Figure 6a shows the error overdifferent days. Few set of clusters such as 3, 5, 7 exhibitlow approximation error for all days suggesting that theirbehavior does not change over time domains and may be thecore of traffic matrix. While some clusters show increment inapproximation error suggesting that they are responsible forexhibiting certain pattern on that particular day. For instance,clusters 9, 10, 13 on Wednesday have relatively higher approx-imation error compared to the other days. While on Fridaymost clusters have relatively low approximation error except

12 which suggests an existence of a unique pattern (may bebecause Friday is the last working day of a week). Similartrends are also observed over weeks as shown in Figure 6b.

To get a deeper understanding how the local structures ofthese MULTI-LOW-RANK matrices change over time domain,we adopt a visualization technique called t-stochastic neigh-bor embedding algorithm (t-SNE)[17], for projecting thesedata points into low-dimensional space (in R2 space). t-SNEprojects the data in lower dimensional space by minimizing theKL-divergence between the distribution in higher dimensionsand lower dimensions. Once we have data points in 2D space,we use cluster labels obtained from spectral clustering to labeleach points belonging to different clusters. Figure 5 showsthe structure of clusters over different days for the same timeperiod of the day. Comparing Figure 5a representing Mondaywith Figure 5b representing the next day, we can observethat most cluster structures remain intact (e.g., cluster labels1, 3, 4, 6, ... etc.) over all the considered days. Only few clus-ters such as 9, 14 seems to merge into a single cluster, thereby,dissolving a certain hidden pattern on Monday and giving riseto a new one on Tuesday. Similarly, cluster 12 seems to splitinto two parts where one part merges together with cluster 10and other part remains separated. Based on Figures 5a & 5b, it is evident that the traffic matrix does not change much inthe immediate future. Looking further into the time domain,specifically 2 days ahead (i.e. on Thursday), again most ofcluster (or part of the cluster) structures remain intact andare stable. However, parts of some clusters break down andform new smaller clusters; while few other clusters seems tomerge together to form a single new cluster. Nonetheless, thesemantics behind the formation of these clusters is beyond thescope of this paper.

The second real world dataset (RD2) consists of (sampled)netflow records collected by a tier-1 ISP at various PoPlocations in the US and Europe. For every netflow record,we have information such as the IP address, port, autonomoussystem number (ASN) for both the source and destination endsassociated with the flow. To extract a traffic matrix from thisdataset, we consider only web traffic (i.e. either source port ordestination port is equal to 80) and construct an ASN to ASNmatrix, where every cell in the matrix represents the number

1

2

3

4

5

6

7

8

910

11

12

13

14

15

(a) Structure of clusters on Monday.

1

2

34

5

67

8

9

10

11

12

13

14

15

(b) Structure of clusters on Tuesday.

1

2

3

4

56

7

8

910

11

1213

1415

(c) Structure of clusters on Thursday.

Fig. 5: Visualizing variation in local (and global) structure of traffic matrices over different time periods using t-SNE inconjunction with spectral clustering on RD1 dataset.

1 3 5 7 9 11 13 15Cluster Labels

0

0.2

0.4

0.6

0.8

1

Clu

ster

App

roxi

mat

ion

Err

or TuesWedThurFri

1 3 5 7 9 11 13 15Cluster Labels

0

0.2

0.4

0.6

0.8

1

Clu

ster

App

roxi

mat

ion

Err

or Week2Week3Week4Week5

Fig. 6: (a) Relative cluster approx. error rate over days. (b)Relative cluster approx. error rate over weeks on RD1 dataset.

of netflow records. In other words, this matrix represents thetraffic between different ASNs.

To show the efficacy of MULTI-LOW-RANK approximationon RD2 dataset, we predict the traffic matrices of subsequenthour. For instance, we use an ASN-to-ASN traffic matrix from8PM to 9PM to predict that for 9PM to 10PM. Similarly, weuse the ASN-to-ASN matrix capturing traffic between 9PMto 10PM to predict the same from 10PM to 11PM, and soon. Figure 7 shows the matrix approximation error for all thepredictions are about 0.18 − 0.20 which is quite acceptable.Such a task is especially useful to actively process streamingtraffic data and predict the subsequent hour’s requirement.Such a system could provide valuable insights to ISPs todynamically provision resources according to the changingrequirements as seen during different times of the day.

A. Multi-low Rank PMF Prediction on Real DatasetsIn this subsection, we predict the missing entries of a matrix

through MULTI-LOW RANK PMF discussed in Section V andcompare the performance with Low-Rank PMF on RD1 andRD2 datasets. In our case,RD1 andRD2 contains the total of1, 473, 796 and 23, 668, 225 records (or entries), respectively.For evaluation, we use 90% of records as the training data andset aside 10% of records for testing purpose. Following stan-dard root mean square error (RMSE) is used to measure the

performance on testing data:√

1N

∑Ni=1(ai − ai)2 ∈ [0,∞),

where N is number of records, ai & ai is actual and predictedvalue of the record. For fair comparison, we set number oflatent features of Low-Rank PMF as the sum of latent featuresof each sub-matrix in MULTI-LOW-RANK PMF.

Dataset / Method LOW-RANK PMF MULTI-LOW-RANK PMF

RD1 RMSE 1.53 1.49

RD2 RMSE 2.34 2.11

TABLE II: Prediction Error on RD1 & RD2 Datasets.Table II shows that MULTI-LOW-RANK PMF is slightly

better than predicting missing entries of a matrix than LowRank PMF. This is due to ability of MULTI-LOW-RANK PMFto preserve local structures and also the ease in getting betterlocal minima in optimization for smaller sub-matrices.

X. ROBUSTNESS OF MULTI-LOW-RANK APPROXIMATION

Finally, we evaluate the ability of MULTI-LOW-RANK toretrieve low rank matrices from the corrupted data with thehelp of proposed spectral clustering method. For this purpose,we generate a synthetic dataset as follows: Let Xi ∈ Rni×dibe a low rank matrix of rank = ri and corrupted by aGaussian noise N (0, σi) with zero mean and σi variance,therefore, Xi = Xi + N (0, σi). Let D ∈ Rn×d be ablock diagonal matrix obtained as D = diag(X1, ..., X2),so that each Xi lies in separate linear subspaces of highdimensional space. Finally, we obtain our noisy data matrixas: D = P1DP2 + N (0, σD) where P1, P2 are randompermutation matrices and σD is Gaussian noise variance addedat the final stage of the data. We normalize each entry of Xi

to [0, 1], so that the effects of noise variance is observable.Following parameters are set: n = 800, d = 400, c = 8,σd = 0.5, r = 160 (Low-Rank SVD) and ni = 100, di = 50,ri = 20, σi = 0.5 ∀i.

Figure 8a shows the MULTI-LOW-RANK decomposition ofthe data where each cluster is obtained through proposedspectral clustering. This approach correctly retrieves all thelow rank sub-matrices with appropriate rank = 20 set in syn-thetic dataset. Figure 8b shows the effects of changing noisevariance σ = σi in MULTI-LOW-RANK matrix approximation.It reveals that MULTI-LOW-RANK matrix approximation errorrate linearly decline with increasing noise variance upto certainlevel and then vary slowly with σ (here after σ = 2). Low-Rank SVD also follows the similar trend. But MULTI-LOW-RANK approximation is much better in resisting noise thanLow-Rank SVD for any noise level.

5 10 15 20 25 30 35 40

Rank

0

0.2

0.4

0.6

0.8

1A

ppro

xim

atio

nE

rror

8− 9PM9− 10PM

(a) Prediction error for 9−10PM hour matrix.

5 10 15 20 25 30 35 40

Rank

0

0.2

0.4

0.6

0.8

1

App

roxi

mat

ion

Err

or

9− 10PM10− 11PM

(b) Prediction error for 10− 11PM matrix.

5 10 15 20 25 30 35 40

Rank

0

0.2

0.4

0.6

0.8

1

App

roxi

mat

ion

Err

or

10− 11PM11− 12PM

(c) Prediction error for 11− 12PM matrix.

Fig. 7: Prediction error for next-hour matrix on RD2 dataset.

10 20 30 40 50 60 70 80 90 100Rank

0

10

20

30

40

50

60

Sing

ular

Val

ues

SVDCL1SVDCL2SVDCL3SVDCL4SVDCL5SVDCL6SVDCL7SVDCL8

0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4σ

0

0.2

0.4

0.6

0.8

1

App

roxi

mat

ion

Err

or

Multi-Low-Rank SVDLow-Rank SVD

Fig. 8: (a) MULTI-LOW-RANK decomposition on syntheticdataset (b) Approximation Error Rate with respect to noisevariance in synthetic dataset.

XI. CONCLUSION

In this paper, we have advanced a novel approach ofapproximating traffic matrices with multiple ranks as opposedto the popular single/global rank approximation approach. Weestablished the theory behind the MULTI-LOW-RANK approx-imation and identified the conditions under which MULTI-LOW-RANK method is better than Low-Rank SVD in bothmatrix approximation and preserving the local (and global)structure of the traffic matrices. We developed an effectiveapproach based on spectral clustering for MULTI-LOW-RANKmatrix decomposition and approximation. Finally, with a seriesof experiments on synthetic and real datasets, we demonstratedthat the MULTI-LOW-RANK approximation yields better re-sults in traffic classification than Low-Rank SVD and can beused to predict the traffic matrices in the immediate futurespecially in case of streaming traffic data over different timedomains and also show its robustness against noise.

ACKNOWLEDGMENT

This research was supported in part by DTRA grantHDTRA1-14-1-0040, DoD ARO MURI Award W911NF-12-1-0385, and NSF grants CNS-1411636.

REFERENCES

[1] Y. Vardi, “Network tomography,” Journal of the Amer. Stat. Assoc., 1996.[2] A. Medina, N. Taft, K. Salamatian, S. Bhattacharyya, and C. Diot.,

“Traffic matrix estimation: Existing techniques and new directions,”ACM SIGCOMM, 2002.

[3] Y. Zhang, M. Roughan, N. Duffield, and A. Greenberg, “Fast accuratecomputation of large-scale ip traffic matrices from link loads.” ACMSIGMETRICS, 2003.

[4] A. Soule, A. Lakhina, N. Taft, K. Papagiannaki, K. Salamatian, M. C.A. Nucci, and C. Diot, “Traffic matrices: Balancing measurements,inference and modeling,” ACM SIGMETRICS, 2005.

[5] V. Erramilli, M. Crovella, and N. Taft., “An independent-connectionmodel for traffic matrices.” Internet Measurement Conference, 2006.

[6] V. K. Adhikari, S. Jain, and Z.-L. Zhang, “From traffic matrix to routingmatrix: Pop level traffic characteristics for a tier-1 ISP,” in Proc. ofACM SIGMETRICS HotMetrics Workshop (in conjunction with ACMSIGMETRICS10 Conference), June 2010.

[7] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. Kolaczyk, andN. Taft, “Structural analysis of network traffic flows.” ACM SIGMET-RICS, 2004.

[8] A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide trafficanomalies,” ACM SIGCOMM, 2004.

[9] Z. Wang, K. Hu, K. Xu, B. Yin, and X. Dong, “Structural analysisof network traffic matrix via relaxed principal component pursuit,”Computer Networks, 2012.

[10] Y. Zhang, Z. Ge, A. Greenberg, and M. Roughan, “Network anomogra-phy,” ACM SIGCOMM conference on Internet Measurement, 2005.

[11] Y. Han and F. Moutarde, “Analysis of network-level traffic states usinglocality preservative non-negative matrix factorization,” in IntelligentTransportation Systems. IEEE, 2011.

[12] Y. Zhang, M. Roughan, W. Willinger, and L. Qiu, “Spatio-temporalcompressive sensing and internet traffic matrices,” in ACM SIGCOMMComputer Communication Review. ACM, 2009.

[13] A. Medina, N. Taft, K. Salamatian, S. Bhattacharyya, and C. Diot,“Traffic matrix estimation: Existing techniques and new directions,”ACM SIGCOMM Computer Communication Review, vol. 32, no. 4, pp.161–174, 2002.

[14] A. Soule, A. Lakhina, N. Taft, K. Papagiannaki, K. Salamatian, A. Nucci,M. Crovella, and C. Diot, “Traffic matrices: balancing measurements, in-ference and modeling,” in ACM SIGMETRICS Performance EvaluationReview, vol. 33, no. 1. ACM, 2005, pp. 362–373.

[15] E. J. Candes and T. Tao, “The power of convex relaxation: Near-optimalmatrix completion,” IEEE Transactions on Information Theory, vol. 56,no. 5, pp. 2053–2080, 2010.

[16] Y. Song, M. Liu, S. Tang, and X. Mao, “Time series matrix factorizationprediction of internet traffic matrices,” in Local Computer Networks(LCN), 2012 IEEE 37th Conference on. IEEE, 2012, pp. 284–287.

[17] L. van der Maaten and G. Hinton, “Visualizing data using t-sne,” Journalof Machine Learning Research (JMLR), 2008.

[18] M. Holmes, A. Gray, and C. Isbell, “Fast svd for large-scale matrices,”in Workshop on Efficient Machine Learning at NIPS, 2007.

[19] R. Salakhutdinov and A. Mnih, “Probabilistic matrix factorization.” inNips, vol. 1, no. 1, 2007, pp. 2–1.

[20] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionalityreduction and data representation,” in Neural Information ProcessingSystems (NIPS), 2002.

[21] A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis andan algorithm,” Neural Information Processing Systems (NIPS), 2002.

[22] L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,” NeuralInformation Processing Systems (NIPS), 2005.

[23] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailedanalysis of the kdd cup 99 data set,” in Computational Intelligence forSecurity and Defense Applications, 2009. IEEE, 2009.

[24] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vectormachines,” ACM Transactions on Intelligent Systems and Technology(TIST), vol. 2, no. 3, p. 27, 2011.

Documents

Multi-Low-Rank Approximation For Trafﬁc Matricesverma076/itc17-verma.pdfwork for trafﬁc matrix approximation based on conventional methods such as PCA/SVD and NMF (non-negative