MStream: Fast Anomaly Detection in Multi-Aspect Streams

MSTREAM: Fast Anomaly Detection in Multi-Aspect StreamsSiddharth Bhatia

National University of [email protected]

Arjit JainIIT Bombay

[email protected]

Pan LiPurdue [email protected]

Ritesh KumarIIT Kanpur

[email protected]

Bryan HooiNational University of Singapore

[email protected]

ABSTRACTGiven a stream of entries in a multi-aspect data setting i.e., entrieshaving multiple dimensions, how can we detect anomalous activitiesin an unsupervised manner? For example, in the intrusion detectionsetting, existing work seeks to detect anomalous events or edgesin dynamic graph streams, but this does not allow us to take intoaccount additional attributes of each entry. Our work aims to definea streaming multi-aspect data anomaly detection framework, termedMSTREAM which can detect unusual group anomalies as they occur,in a dynamic manner. MSTREAM has the following properties: (a) itdetects anomalies in multi-aspect data including both categorical andnumeric attributes; (b) it is online, thus processing each record inconstant time and constant memory; (c) it can capture the correlationbetween multiple aspects of the data. MSTREAM is evaluated overthe KDDCUP99, CICIDS-DoS, UNSW-NB 15 and CICIDS-DDoSdatasets, and outperforms state-of-the-art baselines.

CCS CONCEPTS• Computing methodologies → Anomaly detection; Onlinelearning settings; • Security and privacy → Intrusion detectionsystems.

KEYWORDSAnomaly Detection, Multi-Aspect Data, Streaming, Intrusion Detec-tion

1 INTRODUCTIONGiven a stream of entries (i.e. records) in multi-aspect data (i.e. datahaving multiple features or dimensions), how can we detect anoma-lous behavior, including group anomalies involving the sudden ap-pearance of large groups of suspicious activity, in an unsupervisedmanner?

In particular, we focus on an important application of anomalydetection to intrusion detection in networks. In this application, wewant to design algorithms that monitor a stream of records, each ofwhich represents a single connection (or ‘flow’) over the network.We aim to detect multiple types of suspicious activities, such asdenial of service or port scanning attacks, whereby attackers make alarge number of connections to a target server to make it inaccessibleor to look for vulnerabilities.

Recent intrusion detection datasets typically report tens of fea-tures for each individual flow, such as its source and destination IP,port, protocol, average packet size, etc. This makes it important todesign approaches that can handle multi-aspect data. In addition, toeffectively guard against attacks, it is important for our algorithm

to process the data in a streaming manner, so that we can quicklyreport any attack in real-time, as soon as it arrives.

Some existing approaches for this problem aim to detect pointanomalies, or individually unusual connections. However, as thisignores the relationships between records, it does not effectivelydetect large and suddenly appearing groups of connections, as isthe case in denial of service and other attacks. For detecting suchgroups, there are also existing methods based on dense subgraphdetection [1] as well as dense subtensor detection [2, 3]. However,these approaches are generally designed for datasets with a smallernumber of dimensions, thus facing significant difficulties scaling toour dataset sizes. Moreover, they treat all variables of the dataset ascategorical variables, whereas our approach can handle arbitrary mix-tures of categorical variables (e.g. source IP address) and numericalvariables (e.g. average packet size).

Hence, in this work, we propose MSTREAM, a method for pro-cessing a stream of multi-aspect data that detects group anomalies,i.e. the sudden appearance of large amounts of suspiciously similaractivity. Our approach naturally allows for similarity both in termsof categorical variables (e.g. a small group of repeated IP addressescreating a large number of connections), as well as in numericalvariables (e.g. numerically similar values for average packet size).

MSTREAM is a streaming approach that performs each updatein constant memory and time. This is constant both with respectto the stream length as well as in the number of attribute valuesfor each attribute: this contrasts with tensor decomposition-basedapproaches such as STA and dense subtensor-based approachessuch as DENSEALERT, where memory usage grows in the num-ber of possible attribute values. To do this, our approach makesuse of locality-sensitive hash functions (LSH), which process thedata in a streaming manner while allowing connections which formgroup anomalies to be jointly detected, as they consist of similarattribute values and hence are mapped into similar buckets by thehash functions. Finally, we demonstrate that the anomalies detectedby MSTREAM are explainable.

To incorporate correlation between features, we further proposeMSTREAM-PCA, MSTREAM-IB and MSTREAM-AE which lever-age Principal Component Analysis (PCA), Information Bottleneck(IB), and Auto Encoder (AE) respectively, to map the original fea-tures into a lower-dimensional space and then execute MSTREAM inthis lower-dimensional space. MSTREAM-AE is shown to providebetter anomaly detection performance while also improving speedcompared to MSTREAM, due to its lower number of dimensions.

In summary, the main contributions of our approach are:

, , Siddharth Bhatia, Arjit Jain, Pan Li, Ritesh Kumar, and Bryan Hooi

(1) Multi-Aspect Group Anomaly Detection: We propose anovel approach for detecting group anomalies in multi-aspectdata, including both categorical and numeric attributes.

(2) Streaming Approach: Our approach processes the data in afast and streaming fashion, performing each update in con-stant time and memory.

(3) Effectiveness: Our experimental results show thatMSTREAM outperforms baseline approaches.

Reproducibility: Our code and datasets are publicly available athttps://github.com/Stream-AD/MStream.

2 RELATED WORKOur work closely related to areas like anomaly detection ongraphs [4–17], graph and stream classification [18–22], andoutlier detection on streams [23–27]. In this section, we limitour review only to previous approaches for detecting anomalieson edge-streams, tensors and multi-aspect data. See [28] for anextensive survey on tensor-based anomaly detection.

Anomaly Detection in Edge Streams uses as input a stream ofedges over time. We categorize them according to the type of anom-aly detected.• Anomalous node detection: Given an edge stream, [29] de-

tects nodes whose egonets suddenly and significantly change.• Anomalous subgraph detection: Given an edge stream,

DENSEALERT [2] identifies dense subtensors created withina short time.• Anomalous edge detection: RHSS [30] focuses on sparsely-

connected parts of a graph, while SEDANSPOT [31] identi-fies edge anomalies based on edge occurrence, preferentialattachment, and mutual neighbors. MIDAS [1] identifies mi-crocluster based anomalies, or suddenly arriving groups ofsuspiciously similar edges.

Anomaly Detection in Multi-Aspect Data Streams uses as inputa stream of multi-aspect data records over time. Each multi-aspectdata record can also be considered as an edge of an attributed graphhaving multiple attributes. Therefore, in addition to detecting anom-alies in multi-aspect data streams, the following approaches can alsodetect anomalies in edge streams.• Score Plot based: Score Plots are obtained from tensor de-

composition which are then analyzed manually or automat-ically for anomaly detection. These score plots can be one-dimensional: [32], multi-dimensional: MALSPOT [33] ortime-series [34].• Histogram based: MASTA [35] uses histogram approxima-

tion to analyze tensors. It vectorizes the whole tensor andsimultaneously segments into slices in each mode. The distri-bution of each slice is compared against the vectorized tensorto identify anomalous slices.• Tensor decomposition based: Tensor decomposition methods

such as [36] can be used to find anomalies. [37] and STA[3] are streaming algorithms for CPD and Tucker decom-positions. STenSr [38] models the tensor stream as a singleincremental tensor for representing the entire network, in-stead of dealing with each tensor in the stream separately.[39] uses subspace learning in tensors to find anomalies. STA

monitors the streaming decomposition reconstruction errorfor each tensor at each time instant and anomalies occur whenthis error goes beyond a pre-defined threshold. However [2]shows limited accuracy for dense-subtensor detection basedon tensor decomposition.• Dense subtensor detection based: Dense-subtensor detection

has been used to detect anomalies in M-ZOOM [40], D-CUBE

[41], [42] and CROSSSPOT [43] but these approaches con-sider the data as a static tensor. DENSEALERT [2] is a stream-ing algorithm to identify dense subtensors created within ashort time.

Other Approaches for Anomaly Detection can typically be usedin multi-aspect settings by converting categorical attributes to numer-ical ones e.g. using one-hot encoding. Elliptic Envelope [44] fits anellipse to the normal data points by fitting a robust covariance esti-mate to the data. Local Outlier Factor (LOF) [45] estimates the localdensity at each point, then identifies anomalies as points with muchlower local density than their neighbors. Isolation Forest (I-Forest)[46] constructs trees by randomly selecting features and splittingthem at random split points, and then defines anomalies as pointswhich are separated from the rest of the data at low depth values.Random Cut Forest (RCF) [47] improves upon isolation forest bycreating multiple random cuts (trees) of data and constructs a forestof such trees to determine whether a point is anomalous or not.

Recently, deep learning approaches for anomaly detection inmulti-aspect data have also been proposed. DAGMM [48] learns aGaussian Mixture density model (GMM) over a low-dimensionallatent space produced by a deep autoencoder. [49] use metric learn-ing for anomaly detection. Deep Structured Energy-based Modelfor Anomaly Detection (DSEBM) [50] trains deep energy modelssuch as Convolutional and Recurrent EBMs using denoising scorematching instead of maximum likelihood, for performing anomalydetection. More recently, methods like MEG [51] and Fence GAN[52] have been successfully used to detect anomalies.

For the task of Intrusion Detection [53–58], a variety of differentapproaches have been used in the literature including Ensemblemethods [59], Feature Selection [60], Fuzzy Neural Networks [61],Kernel Methods [62], Random Forests [63], and deep learning basedmethods [64] [65]. However, we refrain from comparing with theseapproaches as they do not process the data in a streaming mannerand typically require large amount of labelled training data, whereaswe process the data in an unsupervised and online manner.

Note that Local Outlier Factor, Isolation Forest, Elliptic Envelope,STA, MASTA, STenSr, DENSEALERT and Random Cut Forest areunsupervised algorithms. Of these, only DENSEALERT performsgroup anomaly detection (by detecting dense subtensors); however,as shown in Table 1, it cannot effectively handle real-valued features(as it treats all features as discrete-valued).

3 PROBLEMLet R = {𝑟1, 𝑟2, . . .} be a stream of records, arriving in a streamingmanner. Each record 𝑟𝑖 = (𝑟𝑖1, . . . , 𝑟𝑖𝑑 ) consists of 𝑑 attributes ordimensions, in which each dimension can either be categorical (e.g.IP address) or real-valued (e.g. average packet length). Note thatsince the data is arriving over time as a stream, we do not assume thatthe set of possible feature values is known beforehand; for example,

MSTREAM: Fast Anomaly Detection in Multi-Aspect Streams , ,

Table 1: Comparison of relevant multi-aspect anomaly detection approaches.

Elliptic LOF I-Forest STA MASTA STenSr Random Cut Forest DENSEALERT MSTREAM(1999) (2000) (2008) (2006) (2015) (2015) (2016) (2017) (2020)

Group Anomalies ! "

Real-valued Features ! ! ! ! "

Constant Memory ! ! "

Const. Update Time ! ! ! ! ! "

in network traffic settings, it is common for new IP addresses to beseen for the first time at some point in the middle of the stream.

Our goal is to detect group anomalies. Intuitively, group anom-alies should have the following properties:

(1) Similarity in Categorical Attributes: for categorical at-tributes, the group anomalies consist of a relatively smallnumber of attribute values, repeated a suspiciously large num-ber of times.

(2) Similarity in Real-Valued Attributes: for real-valued at-tributes, the group anomalies consist of clusters of numeri-cally similar attribute values.

(3) Temporally Sudden: the group anomalies arrive suddenly,over a suspiciously short amount of time. In addition, theirbehavior (in terms of attribute values) should clearly differfrom what we have observed previously, over the course ofthe stream.

4 PROPOSED ALGORITHM4.1 MotivationConsider the toy example in Table 2, comprising a stream of con-nections over time. This dataset shows a clear block of suspiciousactivity from time 4 to 5, consisting of several IP addresses repeateda large number of times, as well as large packet sizes which seem tobe anomalously large compared to the usual distribution of packetsizes.

Table 2: Simple toy example, consisting of a stream of multi-aspect connections over time.

Time Source IP Dest. IP Pkt. Size · · ·1 194.027.251.021 194.027.251.021 100 · · ·2 172.016.113.105 207.230.054.203 80 · · ·4 194.027.251.021 192.168.001.001 1000 · · ·4 194.027.251.021 192.168.001.001 995 · · ·4 194.027.251.021 192.168.001.001 1000 · · ·5 194.027.251.021 192.168.001.001 990 · · ·5 194.027.251.021 194.027.251.021 1000 · · ·5 194.027.251.021 194.027.251.021 995 · · ·6 194.027.251.021 194.027.251.021 100 · · ·7 172.016.113.105 207.230.054.203 80 · · ·

The main challenge, however, is to detect this type of patterns in astreaming manner, considering that we do not want to set any limitsa priori on the duration of the anomalous activity we want to detect,

or the number of IP addresses (or other attribute values) which maybe involved in this activity.

As shown in Figure 1, our approach addresses these prob-lems through the use of a number of locality-sensitive hash func-tions [66] which hash each incoming tuple into a fixed number ofbuckets. Intuitively, we do this such that tuples with many similarentries tend to be hashed into similar buckets. These hash functionsare combined with a temporal scoring approach, which takes intoaccount how much overlap we observe between the buckets at anytime: high amounts of overlap arriving in a short period of timesuggest the presence of anomalous activity.

In Sections 4.2 and 4.3, we describe our MSTREAM approach, andin Section 4.4, we describe our MSTREAM-PCA, MSTREAM-IB andMSTREAM-AE approaches which incorporate correlation betweenfeatures in an unsupervised manner. MSTREAM-PCA uses principalcomponent analysis, MSTREAM-IB uses information bottleneck, andMSTREAM-AE uses an autoencoder to first compress the originalfeatures and then apply MSTREAM in the compressed feature space.

4.2 Hash FunctionsOur approach uses two types of hash functions: FEATUREHASH,which hashes each feature individually, and RECORDHASH, whichhashes an entire record jointly. We use multiple independent copiesof each type of hash function, and explain how to combine these toproduce a single anomalousness score.

4.2.1 FeatureHash. As shown in Algorithm 1, FEATUREHASH

consists of hash functions independently applied to a single feature.There are two cases, corresponding to whether the feature is categor-ical (e.g. IP address) or real-valued (e.g. average packet length):

For categorical data, we use standard linear hash functions[67] which map integer-valued data randomly into 𝑏 buckets, i.e.{0, . . . , 𝑏 − 1}, where 𝑏 is a fixed number.

For real-valued data, however, we find that randomized hashfunctions tend to lead to highly uneven bucket distributions forcertain input datasets. Instead, we use a streaming log-bucketizationapproach. We first apply a log-transform to the data value (line 5),then perform min-max normalization, where the min and max aremaintained in a streaming manner (line 7) and finally map it suchthat the range of feature values is evenly divided into 𝑏 buckets, i.e.{0, . . . , 𝑏 − 1} (line 8).

4.2.2 RecordHash. As shown in Algorithm 2, in RECORDHASH,we operate on all features of a record simultaneously. We first dividethe entire record 𝑟𝑖 into two parts, one consisting of the categorical


R?????H?s?

F??????H?s?

Calculate Anomaly

Score

Real-valued

Categorical

Dimensionality Reduction

Figure 1: Diagram of the proposed MSTREAM. The dimensionality reduction unit (Section 4.4) takes in a record and outputs alower-dimensional embedding. Two types of locality-sensitive hash functions are then applied. FEATUREHASH (Algorithm 1) hasheseach individual feature and RECORDHASH (Algorithm 2) hashes the entire record jointly. These are then combined together using atemporal scoring approach to calculate the anomaly score for the record (Algorithm 3).

Algorithm 1: FEATUREHASH: Hashing Individual FeatureInput: 𝑟𝑖 𝑗 (Feature 𝑗 of record 𝑟𝑖 )Output: Bucket index in {0, . . . , 𝑏 − 1} to map 𝑟𝑖 𝑗 into

1 if 𝑟𝑖 𝑗 is categorical2 output HASH(𝑟𝑖 𝑗 ) // Linear Hash [67]

3 else if 𝑟𝑖 𝑗 is real-valued4 ⊲ Log-Transform5 𝑟𝑖 𝑗 = log(1 + 𝑟𝑖 𝑗 )6 ⊲ Normalize7 𝑟𝑖 𝑗 ←

𝑟𝑖 𝑗−𝑚𝑖𝑛 𝑗

𝑚𝑎𝑥 𝑗−𝑚𝑖𝑛 𝑗// Streaming Min-Max

8 output ⌊𝑟𝑖 𝑗 · 𝑏⌋ (mod 𝑏) // Bucketization into

𝑏 buckets

features C, say 𝑟𝑐𝑎𝑡𝑖

, and the other consisting of real-valued fea-tures R, say 𝑟𝑛𝑢𝑚

𝑖. We then separately hash 𝑟𝑐𝑎𝑡

𝑖to get 𝑏𝑢𝑐𝑘𝑒𝑡𝑐𝑎𝑡 ,

and 𝑟𝑛𝑢𝑚𝑖

to get 𝑏𝑢𝑐𝑘𝑒𝑡𝑛𝑢𝑚 . Finally we take the sum modulo 𝑏 of𝑏𝑢𝑐𝑘𝑒𝑡𝑐𝑎𝑡 and 𝑏𝑢𝑐𝑘𝑒𝑡𝑛𝑢𝑚 to get a bucket for 𝑟𝑖 . We hash 𝑟𝑐𝑎𝑡

𝑖and

𝑟𝑛𝑢𝑚𝑖

as follows:

(1) 𝑟𝑐𝑎𝑡𝑖

: We use standard linear hash functions [67] to map ∀𝑗 ∈C each of the individual features 𝑟𝑖 𝑗 into 𝑏 buckets, and thencombine them by summing them modulo 𝑏 to compute thebucket index 𝑏𝑢𝑐𝑘𝑒𝑡𝑐𝑎𝑡 for 𝑟𝑐𝑎𝑡

𝑖(line 3).

(2) 𝑟𝑛𝑢𝑚𝑖

: To compute the hash of a real-valued record 𝑟𝑛𝑢𝑚𝑖

ofdimension 𝑝 = |R |, we choose 𝑘 random vectors a1, a2, .., akeach having 𝑝 dimensions and independently sampled froma Gaussian distribution N𝑝 (0, Ip), where 𝑘 = ⌈log2 (𝑏)⌉. Wecompute the scalar product of 𝑟𝑛𝑢𝑚

𝑖with each of these vectors

(line 6). We then map the positive scalar products to 1 and the

non-positive scalar products to 0 and then concatenate thesemapped values to get a 𝑘-bit string, then convert it from abitset into an integer 𝑏𝑢𝑐𝑘𝑒𝑡𝑛𝑢𝑚 between 0 and 2𝑘 − 1. (line10).

Algorithm 2: RECORDHASH: Hashing Entire RecordInput: Record 𝑟𝑖Output: Bucket index in {0, . . . , 𝑏 − 1} to map 𝑟𝑖 into

1 ⊲ Divide 𝑟𝑖 into its categorical part, 𝑟𝑐𝑎𝑡𝑖

, and itsnumerical part, 𝑟𝑛𝑢𝑚

𝑖

2 ⊲ Hashing 𝑟𝑐𝑎𝑡𝑖

3 𝑏𝑢𝑐𝑘𝑒𝑡𝑐𝑎𝑡 = (∑

𝑗 ∈CHASH(𝑟𝑖 𝑗 )) (mod 𝑏) // Linear

Hash [67]

4 ⊲ Hashing 𝑟𝑛𝑢𝑚𝑖

5 for 𝑖𝑑 ← 1 to 𝑘6 if ⟨𝑟𝑛𝑢𝑚

𝑖, aid⟩ > 0

7 𝑏𝑖𝑡𝑠𝑒𝑡 [𝑖𝑑] = 18 else9 𝑏𝑖𝑡𝑠𝑒𝑡 [𝑖𝑑] = 0

10 𝑏𝑢𝑐𝑘𝑒𝑡𝑛𝑢𝑚 = INT(𝑏𝑖𝑡𝑠𝑒𝑡) // Convert bitset to

integer

11 output (𝑏𝑢𝑐𝑘𝑒𝑡𝑐𝑎𝑡 + 𝑏𝑢𝑐𝑘𝑒𝑡𝑛𝑢𝑚) (mod 𝑏)

4.3 Temporal ScoringA recent algorithm, MIDAS[1], finds anomalous edges of a dynamicgraph in a streaming manner. It combines a chi-squared goodness-of-fit test with the Count-Min-Sketch (CMS) [68] streaming datastructures to get an anomaly score for each edge. In MIDAS, 𝑠𝑢𝑣


is defined to be the total number of edges from node 𝑢 to 𝑣 up tothe current time 𝑡 , while 𝑎𝑢𝑣 is the number of edges from node 𝑢to 𝑣 in the current time tick 𝑡 (but not including past time ticks).MIDAS then divides the edges into two classes: edges at the currenttime tick 𝑡 (= 𝑎𝑢𝑣), and edges in past time ticks (= 𝑠𝑢𝑣 − 𝑎𝑢𝑣), andcomputes the chi-squared statistic i.e. the sum over categories of(observed−expected)2

expected :

𝑋 2 =(observed(𝑐𝑢𝑟_𝑡 ) − expected(𝑐𝑢𝑟_𝑡 ) )2

expected(𝑐𝑢𝑟_𝑡 )

+(observed(𝑝𝑎𝑠𝑡_𝑡 ) − expected(𝑝𝑎𝑠𝑡_𝑡 ) )2

expected(𝑝𝑎𝑠𝑡_𝑡 )

=(𝑎𝑢𝑣 − 𝑠𝑢𝑣

𝑡 )2

𝑠𝑢𝑣𝑡

+((𝑠𝑢𝑣 − 𝑎𝑢𝑣) − 𝑡−1

𝑡 𝑠𝑢𝑣)2𝑡−1𝑡 𝑠𝑢𝑣

=

(𝑎𝑢𝑣 −

𝑠𝑢𝑣

𝑡

)2 𝑡2

𝑠𝑢𝑣 (𝑡 − 1)MIDAS uses two types of CMS data structures to maintain approx-

imate counts 𝑠𝑢𝑣 and 𝑎𝑢𝑣 which estimate 𝑠𝑢𝑣 and 𝑎𝑢𝑣 respectively.The anomaly score for an edge in MIDAS is then defined as:

𝑠𝑐𝑜𝑟𝑒 (𝑢, 𝑣, 𝑡) =(𝑎𝑢𝑣 −

𝑠𝑢𝑣

𝑡

)2𝑡2

𝑠𝑢𝑣 (𝑡 − 1)(1)

However, MIDAS is designed to detect anomalous edges, whichare two-dimensional records (consisting of source and destinationnode index). Therefore, it cannot be applied in the high-dimensionalsetting of multi-aspect data. Moreover, MIDAS treats variables ofthe dataset as categorical variables, whereas multi-aspect data cancontain arbitrary mixtures of categorical variables (e.g. source IPaddress) and numerical variables (e.g. average packet size).

We extend MIDAS to define an anomalousness score for eachrecord and detect anomalous records in a streaming manner. Giveneach incoming record 𝑟𝑖 having 𝑗 features, we can compute 𝑗 + 1anomalousness scores: one for the entire record 𝑟𝑖 and one for eachindividual feature 𝑟𝑖 𝑗 . We compute each score by computing thechi-squared statistic over the two categories: current time tick andpast time ticks. Anomaly scores for individual attributes are usefulfor interpretability, as they help explain which features are mostresponsible for the anomalousness of the record. Finally, we combinethese scores by taking their sum.

DEFINITION 1 (ANOMALY SCORE). Given a newly arrivingrecord (𝑟𝑖 , 𝑡), our anomalousness score is computed as:

𝑠𝑐𝑜𝑟𝑒 (𝑟𝑖 , 𝑡) =(𝑎𝑟𝑖 −

𝑠𝑟𝑖

𝑡

)2𝑡2

𝑠𝑟𝑖 (𝑡 − 1)+

𝑑∑︁𝑗=1

𝑠𝑐𝑜𝑟𝑒 (𝑟𝑖 𝑗 , 𝑡) (2)

where,

score(𝑟𝑖 𝑗 , 𝑡) =(𝑎𝑟𝑖 𝑗 −

𝑠𝑟𝑖 𝑗

𝑡

)2𝑡2

𝑠𝑟𝑖 𝑗 (𝑡 − 1)(3)

and 𝑎𝑟𝑖 (or 𝑎𝑟𝑖 𝑗 ) is an approximate count of 𝑟𝑖 (or 𝑟𝑖 𝑗 ) at currenttime 𝑡 and 𝑠𝑟𝑖 (or 𝑠𝑟𝑖 𝑗 ) is an approximate count of 𝑟𝑖 (or 𝑟𝑖 𝑗 ) up totime 𝑡 .

We also allow temporal flexibility of records, i.e. records in therecent past count towards the current anomalousness score. This isachieved by reducing the counts 𝑎𝑟𝑖 and 𝑎𝑟𝑖 𝑗∀𝑗 ∈ {1, .., 𝑑} by a factorof 𝛼 ∈ (0, 1) rather than resetting them at the end of each time tick.This results in past records counting towards the current time tick,with a diminishing weight.

MSTREAM is summarised in Algorithm 3.

Algorithm 3: MSTREAM: Streaming Anomaly ScoringInput: Stream of records over timeOutput: Anomaly scores for each record

1 ⊲ Initialize data structures:2 Total record count 𝑠𝑟𝑖 and total attribute count

𝑠𝑟𝑖 𝑗∀𝑗 ∈ {1, .., 𝑑}3 Current record count 𝑎𝑟𝑖 and current attribute count

𝑎𝑟𝑖 𝑗∀𝑗 ∈ {1, .., 𝑑}4 while new record (𝑟𝑖 , 𝑡) = (𝑟𝑖1, . . . , 𝑟𝑖𝑑 , 𝑡) is received: do5 ⊲ Hash and Update Counts:6 for 𝑗 ← 1 to 𝑑7 𝑏𝑢𝑐𝑘𝑒𝑡 𝑗 = FEATUREHASH(𝑟𝑖 𝑗 )8 Update count of 𝑏𝑢𝑐𝑘𝑒𝑡 𝑗9 𝑏𝑢𝑐𝑘𝑒𝑡 = RECORDHASH(𝑟𝑖 )

10 Update count of 𝑏𝑢𝑐𝑘𝑒𝑡11 ⊲ Query Counts:12 Retrieve updated counts 𝑠𝑟𝑖 , 𝑎𝑟𝑖 , 𝑠𝑟𝑖 𝑗 and

𝑎𝑟𝑖 𝑗∀𝑗 ∈ {1..𝑑}13 ⊲ Anomaly Score:14 output

𝑠𝑐𝑜𝑟𝑒 (𝑟𝑖 , 𝑡) =(𝑎𝑟𝑖 −

𝑠𝑟𝑖𝑡

)2𝑡2

𝑠𝑟𝑖 (𝑡−1)+∑𝑑

𝑗=1 𝑠𝑐𝑜𝑟𝑒 (𝑟𝑖 𝑗 , 𝑡)

4.4 Incorporating Correlation Between FeaturesIn this section, we describe our MSTREAM-PCA, MSTREAM-IB andMSTREAM-AE approaches where we run the MSTREAM algorithmon a lower-dimensional embedding of the original data obtainedusing Principal Component Analysis (PCA) [69], Information Bot-tleneck (IB) [70] and Autoencoder (AE) [71] methods in a streamingmanner.

Our motivation for combining PCA, IB and AE methods withMSTREAM is two-fold. Firstly, the low-dimensional representationslearned by these algorithms incorporate correlation between differentattributes of the record, making anomaly detection more effective.Secondly, a reduction in the dimensions would result in faster pro-cessing per record.

For all three methods, we first learn the dimensionality reductiontransformation using a very small initial subset of 256 records fromthe incoming stream. We then compute the embeddings for thesubsequent records and pass them to MSTREAM to detect anomaliesin an online manner.

Principal Component Analysis. We choose PCA because it onlyrequires one major parameter to tune: namely the dimension ofthe projection space. Moreover, this parameter can be set easily byanalysis of the explained variance ratios of the principal components.


Hence MSTREAM-PCA can be used as an off-the-shelf algorithmfor streaming anomaly detection with dimensionality reduction.

Information Bottleneck. Information bottleneck for dimension-ality reduction can be posed as the following optimization problem:

min𝑝 (𝑡 |𝑥)

𝐼 (𝑋 ;𝑇 ) − 𝛽𝐼 (𝑇 ;𝑌 )

where 𝑋 , 𝑌 , and 𝑇 are random variables. 𝑇 is the compressed repre-sentation of 𝑋 , 𝐼 (𝑋 ;𝑇 ) and 𝐼 (𝑇 ;𝑌 ) are the mutual information of 𝑋and𝑇 , and of𝑇 and𝑌 , respectively, and 𝛽 is a Lagrange multiplier. Inour setting, 𝑋 denotes the multi-aspect data, 𝑌 denotes whether thedata is anomalous and 𝑇 denotes the dimensionally reduced featuresthat we wish to find. Our implementation is based on the NeuralNetwork approach for Nonlinear Information Bottleneck [72].

Autoencoder. Autoencoder is a neural network based approachfor dimensionality reduction. An autoencoder network consists ofan encoder and a decoder. The encoder compresses the input into alower-dimensional space, while the decoder reconstructs the inputfrom the low-dimensional representation. Our experimental resultsin Section 5 show that even with a simple 3 layered autoencoder,MSTREAM-AE outperforms both MSTREAM-PCA and MSTREAM-IB.

4.5 Time and Memory ComplexityIn terms of memory, MSTREAM only needs to maintain data struc-tures over time, which requires memory proportional to 𝑂 (𝑤𝑏𝑑),where𝑤 , 𝑏 and 𝑑 are the number of hash functions, the number ofbuckets in the data structures and the total number of dimensions;which is bounded with respect to the stream size.

For time complexity, the only relevant steps in Algorithm 3 arethose that either update or query the data structures, which take𝑂 (𝑤𝑑) (all other operations run in constant time). Thus, the timecomplexity per update step is 𝑂 (𝑤𝑑).

5 EXPERIMENTSIn this section, we evaluate the performance of MSTREAM andMSTREAM-AE compared to Elliptic Envelope, LOF, I-Forest, Ran-dom Cut Forest and DENSEALERT on multi-aspect data streams. Weaim to answer the following questions:

Q1. Anomaly Detection Performance: How accurately doesMSTREAM detect real-world anomalies compared to base-lines, as evaluated using the ground truth labels?

Q2. Scalability: How does it scale with input stream length andnumber of dimensions? How does the time needed to processeach input compare to baseline approaches?

Q3. Real-World Effectiveness: Does it detect meaningful anom-alies? Does it detect group anomalies?

Datasets. KDDCUP99 dataset[73] is based on the DARPAdataset and is among the most extensively used datasets for intru-sion detection. Since the proportion of data belonging to the ‘attack’class is much larger than the proportion of data belonging to the‘non-attack’ class, we downsample the ‘attack’ class to a proportionof 20%. KDDCUP99 has 42 dimensions and 1.21 million records.

[74] surveys different intrusion detection datasets and recom-mends to use the newer CICIDS[75] and UNSW-NB15[76] datasets.

These contain modern-day attacks and follow the established guide-lines for reliable intrusion detection datasets (in terms of realism,evaluation capabilities, total capture, completeness, and maliciousactivity)[75].

CICIDS 2018 dataset was generated at the Canadian Institute ofCybersecurity. Each record is a flow containing features such asSource IP Address, Source Port, Destination IP Address, Bytes, andPackets. These flows were captured from a real-time simulation ofnormal network traffic and synthetic attack simulators. This consistsof the CICIDS-DoS dataset (1.05 million records, 80 features) andthe CICIDS-DDoS dataset (7.9 million records, 83 features). CICIDS-DoS has 5% anomalies whereas CICIDS-DDoS has 7% anomalies.

UNSW-NB 15 dataset was created by the Cyber Range Lab of theAustralian Centre for Cyber Security (ACCS) for generating a hybridof real modern normal activities and synthetic contemporary attackbehaviors. This dataset has nine types of attacks, namely, Fuzzers,Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shell-code and Worms. It has 49 features and 2.5 million records including13% anomalies.

Baselines. As described in Section 2, we are in the streamingunsupervised learning regime, and hence do not compare with su-pervised or offline algorithms.

We consider Elliptic Envelope, Local Outlier Factor, IsolationForest, STA, MASTA, STenSr, DENSEALERT and Random CutForest since they operate on multi-aspect data, however, due toa large number of dimensions, even sparse tensor versions ofSTA/MASTA/STenSr run out of memory on these datasets. So, wecompare with Elliptic Envelope, Local Outlier Factor, Isolation For-est, DENSEALERT and Random Cut Forest.

Evaluation Metrics. All the methods output an anomaly scoreper edge (higher is more anomalous). We plot the ROC curve, whichcompares the True Positive Rate (TPR) and False Positive Rate(FPR), without needing to fix any threshold. We also report theROC-AUC (Area under the ROC curve).

Experimental Setup. All experiments are carried out on a2.4𝐺𝐻𝑧 Intel Core 𝑖9 processor, 32𝐺𝐵 RAM, running OS 𝑋 10.15.2.We implement MSTREAM in C++. We use 2 independent copiesof each hash function, and we set the number of buckets to 1024.We set the temporal decay factor 𝛼 as 0.85 for KDDCUP99, 0.95 forCICIDS-DoS and CICIDS-DDoS, and 0.4 for UNSW-NB 15 due toits higher time granularity. Note that MSTREAM is not sensitive tovariation of 𝛼 parameter as shown in Table 5. Since KDDCUP99dataset does not have timestamps, we apply the temporal decay fac-tor once every 1000 records. We discuss the influence of temporaldecay factor 𝑎𝑙𝑝ℎ𝑎 on the ROC-AUC in Appendix 6.1.

To demonstrate the robustness of our proposed approach, weset the output dimension of MSTREAM-PCA, MSTREAM-IB andMSTREAM-AE for all datasets to a common value of 12 insteadof searching individually on each method and dataset. We reducethe real-valued columns to 12 dimensions and then pass these alongwith the categorical columns to MSTREAM. Results on varying thenumber of output dimensions can be found in the Appendix. ForMSTREAM-PCA we use the open-source implementation of PCAavailable in the scikit-learn [77] library. Parameters for MSTREAM-AE and MSTREAM-IB are described in Appendix 6.3.


Table 3: AUC of each method on different datasets.

Elliptic LOF I-Forest DAlert RCF MSTREAM MSTREAM-PCA MSTREAM-IB MSTREAM-AE

KDD 0.34 ± 0.025 0.34 0.81 ± 0.018 0.92 0.63 0.91 ± 0.016 0.92 ± 0.000 0.96 ± 0.002 0.96 ± 0.005DoS 0.75 ± 0.021 0.50 0.73 ± 0.008 0.61 0.83 0.93 ± 0.001 0.92 ± 0.001 0.95 ± 0.003 0.94 ± 0.001

UNSW 0.25 ± 0.003 0.49 0.84 ± 0.023 0.80 0.45 0.86 ± 0.001 0.81 ± 0.001 0.82 ± 0.001 0.90 ± 0.001DDoS 0.57 ± 0.106 0.46 0.56 ± 0.021 −− 0.63 0.91 ± 0.000 0.94 ± 0.000 0.82 ± 0.000 0.93 ± 0.000

We use open-sourced implementations of DENSEALERT andRandom Cut Forest, provided by the authors, following parametersettings as suggested in the original papers. For Elliptic Envelope,Local Outlier Factor and Isolation Forest we use the open-sourceimplementation available in the scikit-learn [77] library. We alsopass the true anomaly percentage to Elliptic Envelope, Local OutlierFactor and Isolation Forest methods, while the remainder of themethods do not require the anomaly percentage.

All the experiments, unless explicitly specified, are performed 5times for each parameter group, and the mean and standard deviationvalues are reported.

5.1 Anomaly Detection PerformanceFigure 2 plots the ROC curve for MSTREAM, MSTREAM-PCA,MSTREAM-IB and MSTREAM-AE along with the baselines, El-liptic Envelope, Local Outlier Factor, Isolation Forest, DENSEAL-ERT and Random Cut Forest on CICIDS-DoS dataset. We see thatMSTREAM, MSTREAM-PCA, MSTREAM-IB and MSTREAM-AEachieve a much higher ROC-AUC (0.92 − 0.95) compared to thebaselines. MSTREAM and its variants achieve at least 50% higherAUC than DENSEALERT, 11% higher than Random Cut Forest 26%higher than Isolation Forest, 23% higher than Elliptic Envelope and84% higher than Local Outlier Factor.

Figure 2: ROC on CICIDS-DoS dataset.

Table 3 shows the AUC of Elliptic Envelope, Local Outlier Factor,Isolation Forest, DENSEALERT, Random Cut Forest and MSTREAM

on KDDCUP99, CICIDS-DoS, UNSW-NB 15 and CICIDS-DDoS

datasets. We report a single value for Local Outlier Factor andDENSEALERT because these are non-randomized methods. Wealso report a single value for Random Cut Forest because we usethe parameters and random seed of the original implementation.DENSEALERT performs well on small sized datasets such as KDD-CUP99 but as the dimensions increase, its performance decreases.On the large CICIDS-DDoS dataset DENSEALERT runs out of mem-ory. We observe that MSTREAM outperforms all baselines on alldatasets. By learning the correlation between features, MSTREAM-AE achieves higher ROC-AUC than MSTREAM, and performs com-parably or better than MSTREAM-PCA and MSTREAM-IB. We alsodiscuss evaluating the ROC-AUC in a streaming manner in Appendix6.4.

Figure 3 plots ROC-AUC vs. running time (log-scale, in sec-onds, excluding I/O) for the different methods on the CICIDS-DoSdataset. We see that MSTREAM, MSTREAM-PCA, MSTREAM-IBand MSTREAM-AE achieve 11% to 90% higher AUC compared tobaselines, while also running almost two orders of magnitude faster.

RO

C-A

UC

0

0.25

0.5

0.75

1

Running Time (s)1 10 100 1000

0.940.95

0.92

0.930.83

0.61

0.73

0.5

0.75

Elliptic LOF I-Forest DenseAlert Random Cut ForestMStream MStream-PCA MStream-IB MStream-AE

Figure 3: ROC-AUC vs time on CICIDS-DoS dataset.

5.2 ScalabilityFigure 4 shows the scalability of MSTREAM with respect to thenumber of records in the stream (log-scale). We plot the time neededto run on the (chronologically) first 212, 213, 214, ..., 220 records of theCICIDS-DoS dataset. Each record has 80 dimensions. This confirmsthe linear scalability of MSTREAM with respect to the number ofrecords in the input stream due to its constant processing time perrecord.

Figure 5 shows the scalability of MSTREAM with respect to thenumber of dimensions (linear-scale). We plot the time needed to runon the first 10, 20, 30, ..., 80 dimensions of the CICIDS-DoS dataset.

, , Siddharth Bhatia, Arjit Jain, Pan Li, Ritesh Kumar, and Bryan HooiR

unni

ng T

ime

(s)

0.01

0.1

1

10

100

1000

Number of Records

1K 10K 100K 1,000K 10,000K

MSᴛʀᴇᴀᴍslope = 1

Figure 4: MSTREAM scales linearly with the number of recordsin CICIDS-DoS.

This confirms the linear scalability of MSTREAM with respect to thenumber of dimensions in the input data.

Run

ning

Tim

e (s

)

0

2

4

6

8

10

12

Number of Dimensions10 20 30 40 50 60 70 80

MSᴛʀᴇᴀᴍslope = 0.1284

Figure 5: MSTREAM scales linearly with the number of dimen-sions in CICIDS-DoS.

Figure 6 shows the scalability of MSTREAM with respect to thenumber of hash functions (linear-scale). We plot the time taken to runon the CICIDS-DoS dataset with 2, 3, 4 hash functions. This confirmsthe linear scalability of MSTREAM with respect to the number ofhash functions.

Runn

ing

Tim

e (s

)

0

4

8

12

16

20

Number of Hash Functions1 2 3 4 5

M-Streamslope= 3.4928

Figure 6: MSTREAM scales linearly with the number of hashfunctions in CICIDS-DoS.

Since MSTREAM-PCA, MSTREAM-IB and MSTREAM-AE applyMSTREAM on the lower-dimensional features obtained using anautoencoder, they are also scalable.

Figure 7 plots a frequency distribution of the time taken (inmicroseconds) to process each record in the CICIDS-DoS dataset.MSTREAM processes 957𝐾 records within 10`𝑠 each, 60𝐾 recordswithin 100`𝑠 each and remaining 30𝐾 records within 1000`𝑠 each.

Num

ber o

f Rec

ords

(in

Tho

usan

ds)

0K

500K

1,000K

Running Time (μs)<10 10 — 100 100 — 1000

Figure 7: Distribution of processing times for ∼ 1.05𝑀 recordsof the CICIDS-DoS dataset.

Table 4 shows the time it takes Elliptic Envelope, Local Out-lier Factor, Isolation Forest, DENSEALERT, Random Cut Forest,MSTREAM and MSTREAM-AE to run on KDDCUP99, CICIDS-DoS, UNSW-NB 15 and CICIDS-DDoS datasets. We see thatMSTREAM runs much faster than the baselines: for example,MSTREAM is 79 times faster than DENSEALERT on the KDDCUP99dataset. MSTREAM-PCA, MSTREAM-IB and MSTREAM-AE in-corporate dimensionality reduction and are therefore faster thanMSTREAM: for example, MSTREAM-AE is 1.38 times faster thanMSTREAM and 110 times faster than DENSEALERT on the KDD-CUP99 dataset.

5.3 DiscoveriesWe plot normalized anomaly scores over time using Elliptic Enve-lope, Local Outlier Factor, Isolation Forest, DENSEALERT, RandomCut Forest and MSTREAM on the CICIDS-DoS dataset in Figure8. To visualize, we aggregate records occurring in each minute bytaking the max anomaly score per minute, for a total of 565 minutes.Ground truth values are indicated by points plotted at 𝑦 = 0 (i.e.normal) or 𝑦 = 1 (anomaly).

Local Outlier Factor and DENSEALERT miss many anomalieswhereas Elliptic Envelope, Isolation Forest and Random Cut Forestoutput many high scores unrelated to any attacks. This is also re-flected in Table 3 and shows that MSTREAM is effective in catchingreal-world anomalies.

Group anomaly detection: In Figure 8, 𝐺 is a group anomalywhich MSTREAM is able to detect, whereas Elliptic Envelope, LocalOutlier Factor and Isolation Forest completely miss it. DENSEAL-ERT and Random Cut Forest partially catch it, but are also not fullyeffective in such high-dimensional datasets. This shows the effec-tiveness of MSTREAM in catching group anomalies such as DoSand DDoS attacks.

Explainability: As MSTREAM estimates feature-specific anom-aly scores before aggregating them, it is interpretable. For a given


Table 4: Running time of each method on different datasets in seconds.

Elliptic LOF I-Forest DAlert RCF MSTREAM MSTREAM-PCA MSTREAM-IB MSTREAM-AE

KDD 216.3 1478.8 230.4 341.8 181.6 4.3 2.5 3.1 3.1DoS 455.8 398.8 384.8 333.4 459.4 10.4 2.1 3.7 5.1

UNSW 654.6 2091.1 627.4 329.6 683.8 12.8 6.6 8 8DDoS 3371.4 15577𝑠 3295.8 −− 4168.8 61.6 16.9 25.6 27.7

Figure 8: Plots of anomaly scores over time; spikes forMSTREAM correspond to the ground truth events in CICIDS-DoS, but not for baselines.

anomaly, we can rank the features according to their anomaly scores.We can then explain which features were most responsible for theanomalousness of a record in an unsupervised setting.

For example, in Figure 8, MSTREAM finds that 𝑒 is an anomalythat occurs due to the Flow IAT Min feature. This agrees with [75],which finds that the best feature set for DoS using a Random Forestapproach (supervised learning; in contrast, our approach does notrequire labels) are B.Packet Len Std, Flow IAT Min, Fwd IAT Min,and Flow IAT Mean.

6 CONCLUSIONIn this paper, we proposed MSTREAM for detecting group anomaliesin multi-aspect streams, and MSTREAM-PCA, MSTREAM-IB andMSTREAM-AE which incorporate dimensionality reduction to im-prove accuracy and speed. Future work could consider more complexcombinations (e.g. weighted sums) of anomaly scores for individualattributes. Our contributions are:

(1) Multi-Aspect Group Anomaly Detection: We propose anovel approach for detecting group anomalies in multi-aspectdata, including both categorical and numeric attributes.

(2) Streaming Approach: Our approach processes the data in afast and streaming fashion, performing each update in con-stant time and memory.

(3) Effectiveness: Our experimental results show thatMSTREAM outperforms baseline approaches.

ACKNOWLEDGMENTSThis work was supported in part by NUS ODPRT Grant R-252-000-A81-133.

REFERENCES[1] Siddharth Bhatia, Bryan Hooi, Minji Yoon, Kijung Shin, and Christos Faloutsos.

Midas: Microcluster-based detector of anomalies in edge streams. In AAAI 2020 :The Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.

[2] Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos. Densealert: Incre-mental dense-subtensor detection in tensor streams. KDD, 2017.

[3] Jimeng Sun, Dacheng Tao, and Christos Faloutsos. Beyond streams and graphs:dynamic tensor analysis. In KDD, 2006.

[4] Leman Akoglu, Hanghang Tong, and Danai Koutra. Graph based anomaly detec-tion and description: A survey. Data Min. Knowl. Discov., 29(3):626–688, May2015.

[5] Jiabao Zhang, Shenghua Liu, Wenjian Yu, Wenjie Feng, and Xueqi Cheng. Eigen-pulse: Detecting surges in large streaming graphs with row augmentation. In QiangYang, Zhi-Hua Zhou, Zhiguo Gong, Min-Ling Zhang, and Sheng-Jun Huang, ed-itors, Advances in Knowledge Discovery and Data Mining - 23rd Pacific-AsiaConference, PAKDD 2019, Macau, China, April 14-17, 2019, Proceedings, PartII, volume 11440 of Lecture Notes in Computer Science, pages 501–513. Springer,2019.

[6] Fragkiskos D. Malliaros, V. Megalooikonomou, and C. Faloutsos. Fast robustnessestimation in large social graphs: Communities and anomaly detection. In SDM,2012.

[7] Petko Bogdanov, Christos Faloutsos, Misael Mongiovì, Evangelos E. Papalexakis,Razvan Ranca, and Ambuj K. Singh. Netspot: Spotting significant anomalousregions on dynamic networks. In Proceedings of the 13th SIAM InternationalConference on Data Mining, May 2-4, 2013. Austin, Texas, USA, pages 28–36.SIAM, 2013.

[8] Misael Mongiovì, Petko Bogdanov, Razvan Ranca, Ambuj K. Singh, Evangelos E.Papalexakis, and Christos Faloutsos. Sigspot: Mining significant anomalousregions from time-evolving networks (abstract only). In Proceedings of the 2012ACM SIGMOD International Conference on Management of Data, SIGMOD ’12,page 865, New York, NY, USA, 2012. Association for Computing Machinery.

[9] Nikhil Gupta, D. Eswaran, N. Shah, Leman Akoglu, and C. Faloutsos. Lookout ontime-evolving graphs: Succinctly explaining anomalies from any detector. ArXiv,abs/1710.05333, 2017.

[10] N. Shah, A. Beutel, B. Hooi, L. Akoglu, S. Gunnemann, D. Makhija, M. Kumar,and C. Faloutsos. Edgecentric: Anomaly detection in edge-attributed networks. In


2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW),pages 327–334, 2016.

[11] Chen Luo and Anshumali Shrivastava. Arrays of (locality-sensitive) count estima-tors (ace): Anomaly detection on the edge. In Proceedings of the 2018 World WideWeb Conference, WWW ’18, page 1439–1448, Republic and Canton of Geneva,CHE, 2018. International World Wide Web Conferences Steering Committee.

[12] Bryan Perozzi and Leman Akoglu. Discovering communities and anomalies inattributed graphs: Interactive visual exploration and summarization. ACM Trans.Knowl. Discov. Data, 12(2), January 2018.

[13] Bryan Perozzi, Michael Schueppert, Jack Saalweachter, and Mayur Thakur. Whenrecommendation goes wrong: Anomalous link discovery in recommendationnetworks. In Proceedings of the 22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’16, page 569–578, New York, NY,USA, 2016. Association for Computing Machinery.

[14] Bryan Perozzi and Leman Akoglu. Scalable anomaly ranking of attributed neigh-borhoods, 2016.

[15] Francesco Bonchi, Ilaria Bordino, Francesco Gullo, and Giovanni Stilo. Theimportance of unexpectedness: Discovering buzzing stories in anomalous temporalgraphs. Web Intell., 17(3):177–198, 2019.

[16] F. Bonchi, I. Bordino, F. Gullo, and G. Stilo. Identifying buzzing stories viaanomalous temporal subgraph discovery. In 2016 IEEE/WIC/ACM InternationalConference on Web Intelligence (WI), pages 161–168, 2016.

[17] Hanghang Tong and Ching-Yung Lin. Non-Negative Residual Matrix Factorizationwith Application to Graph Anomaly Detection, pages 143–153.

[18] L. Chi, B. Li, X. Zhu, S. Pan, and L. Chen. Hashing for adaptive real-time graphstream classification with concept drifts. IEEE Transactions on Cybernetics,48(5):1591–1604, 2018.

[19] Shirui Pan, Kuan Wu, Yang Zhang, and Xue Li. Classifier ensemble for uncertaindata stream classification. In Mohammed J. Zaki, Jeffrey Xu Yu, B. Ravindran,and Vikram Pudi, editors, Advances in Knowledge Discovery and Data Mining,pages 488–495, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.

[20] Xiangnan Kong and P. S. Yu. An ensemble-based approach to fast classificationof multi-label data streams. In 7th International Conference on CollaborativeComputing: Networking, Applications and Worksharing (CollaborateCom), pages95–104, 2011.

[21] S. Pan, J. Wu, X. Zhu, and C. Zhang. Graph ensemble boosting for imbalancednoisy graph stream classification. IEEE Transactions on Cybernetics, 45(5):954–968, 2015.

[22] Shirui Pan, Xingquan Zhu, Chengqi Zhang, and P. S. Yu. Graph stream classi-fication using labeled and unlabeled graphs. In 2013 29th IEEE InternationalConference on Data Engineering (ICDE 2013), pages 398–409, Los Alamitos,CA, USA, apr 2013. IEEE Computer Society.

[23] Gyoung S. Na, Donghyun Kim, and Hwanjo Yu. Dilof: Effective and memoryefficient local outlier detection in data streams. In Proceedings of the 24th ACMSIGKDD International Conference on Knowledge Discovery &; Data Mining,KDD ’18, page 1993–2002, New York, NY, USA, 2018. Association for Comput-ing Machinery.

[24] Audrey Wilmet, Tiphaine Viard, Matthieu Latapy, and Robin Lamarche-Perrin.Outlier detection in ip traffic modelled as a link stream using the stability of degreedistributions over time. Computer Networks, 161:197 – 209, 2019.

[25] Audrey Wilmet, Tiphaine Viard, Matthieu Latapy, and Robin Lamarche-Perrin.Degree-based outlier detection within IP traffic modelled as a link stream. CoRR,abs/1906.02524, 2019.

[26] Emaad A. Manzoor, Hemank Lamba, and Leman Akoglu. xstream: Outlierdetection in feature-evolving data streams. In Yike Guo and Faisal Farooq, editors,Proceedings of the 24th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pages1963–1972. ACM, 2018.

[27] S. Sathe and C. C. Aggarwal. Subspace outlier detection in linear time withrandomized hashing. In 2016 IEEE 16th International Conference on DataMining (ICDM), pages 459–468, 2016.

[28] Hadi Fanaee-T and João Gama. Tensor-based anomaly detection: An interdisci-plinary survey. Knowledge-Based Systems, 98:130–147, 2016.

[29] Weiren Yu, Charu C Aggarwal, Shuai Ma, and Haixun Wang. On anomaloushotspot discovery in graph streams. In ICDM, 2013.

[30] Stephen Ranshous, Steve Harenberg, Kshitij Sharma, and Nagiza F. Samatova.A Scalable Approach for Outlier Detection in Edge Streams Using Sketch-basedApproximations, pages 189–197. SDM, 2016.

[31] Dhivya Eswaran and Christos Faloutsos. Sedanspot: Detecting anomalies in edgestreams. In ICDM, 2018.

[32] Evangelos Papalexakis, Konstantinos Pelechrinis, and Christos Faloutsos. Spottingmisbehaviors in location-based social networks using tensors. In WWW, 2014.

[33] Hing-Hao Mao, Chung-Jung Wu, Evangelos E Papalexakis, Christos Faloutsos,Kuo-Chen Lee, and Tien-Cheu Kao. Malspot: Multi 2 malicious network behaviorpatterns analysis. In PAKDD, pages 1–14. Springer, 2014.

[34] Evangelos E Papalexakis, Christos Faloutsos, and Nicholas D Sidiropoulos. Par-cube: Sparse parallelizable tensor decompositions. In ECMLPKDD, 2012.

[35] Hadi Fanaee-T and João Gama. Multi-aspect-streaming tensor analysis.Knowledge-Based Systems, 89:332–345, 2015.

[36] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications.SIAM review, 51(3):455–500, 2009.

[37] Shuo Zhou, Nguyen Xuan Vinh, James Bailey, Yunzhe Jia, and Ian Davidson.Accelerating online cp decompositions for higher order tensors. In KDD, pages1375–1384. ACM, 2016.

[38] Lei Shi, Aryya Gangopadhyay, and Vandana P Janeja. Stensr: Spatio-temporaltensor streams for anomaly detection and pattern discovery. Knowledge andInformation Systems, 43(2):333–353, 2015.

[39] Jie Li, Guan Han, Jing Wen, and Xinbo Gao. Robust tensor subspace learning foranomaly detection. International Journal of Machine Learning and Cybernetics,2(2):89–98, 2011.

[40] Kijung Shin, Bryan Hooi, and Christos Faloutsos. M-zoom: Fast dense-blockdetection in tensors with quality guarantees. In ECMLPKDD, 2016.

[41] Kijung Shin, Bryan Hooi, Jisu Kim, and Christos Faloutsos. D-cube: Dense-blockdetection in terabyte-scale tensors. In WSDM, pages 681–689. ACM, 2017.

[42] Koji Maruhashi, Fan Guo, and Christos Faloutsos. Multiaspectforensics: Patternmining on large-scale heterogeneous networks with tensor analysis. In 2011International Conference on Advances in Social Networks Analysis and Mining,pages 203–210. IEEE, 2011.

[43] Meng Jiang, Alex Beutel, Peng Cui, Bryan Hooi, Shiqiang Yang, and ChristosFaloutsos. A general suspiciousness metric for dense blocks in multimodal data.In ICDM, 2015.

[44] Peter J Rousseeuw and Katrien Van Driessen. A fast algorithm for the minimumcovariance determinant estimator. Technometrics, 41(3):212–223, 1999.

[45] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. Lof:Identifying density-based local outliers. In SIGMOD, 2000.

[46] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 EighthIEEE International Conference on Data Mining, pages 413–422. IEEE, 2008.

[47] Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers. Robust randomcut forest based anomaly detection on streams. In International conference onmachine learning, pages 2712–2721, 2016.

[48] Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Dae-ki Cho, and Haifeng Chen. Deep autoencoding gaussian mixture model forunsupervised anomaly detection. In ICLR, 2018.

[49] Hyunjun Ju, Dongha Lee, Junyoung Hwang, Junghyun Namkung, and Hwanjo Yu.Pumad: Pu metric learning for anomaly detection. Information Sciences, 523:167– 183, 2020.

[50] Shuangfei Zhai, Yu Cheng, Weining Lu, and Zhongfei Zhang. Deep structuredenergy based models for anomaly detection. arXiv preprint arXiv:1605.07717,2016.

[51] Rithesh Kumar, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. Maximumentropy generators for energy-based models. arXiv preprint arXiv:1901.08508,2019.

[52] P. C. Ngo, A. A. Winarto, C. K. L. Kou, S. Park, F. Akram, and H. K. Lee.Fence gan: Towards better anomaly detection. In 2019 IEEE 31st InternationalConference on Tools with Artificial Intelligence (ICTAI), pages 141–148, 2019.

[53] Tyrone Gradison and Evimaria Terzi. Intrusion Detection Technology, pages2031–2036. Springer New York, New York, NY, 2018.

[54] E. Bertino, E. Terzi, A. Kamra, and A. Vakali. Intrusion detection in rbac-administered databases. In 21st Annual Computer Security Applications Confer-ence (ACSAC’05), pages 10 pp.–182, 2005.

[55] Azeem Aqil, Karim Khalil, Ahmed O.F. Atya, Evangelos E. Papalexakis,Srikanth V. Krishnamurthy, Trent Jaeger, K. K. Ramakrishnan, Paul Yu, andAnanthram Swami. Jaal: Towards network intrusion detection at isp scale. InProceedings of the 13th International Conference on Emerging Networking EX-periments and Technologies, CoNEXT ’17, page 134–146, New York, NY, USA,2017. Association for Computing Machinery.

[56] Wei Wang, Xiaohong Guan, Xiangliang Zhang, and Liwei Yang. Profiling programbehavior for anomaly intrusion detection based on the transition and frequencyproperty of computer audit data. Comput. Secur., 25(7):539–550, October 2006.

[57] Wei Wang, Thomas Guyet, René Quiniou, Marie-Odile Cordier, Florent Masseglia,and Xiangliang Zhang. Autonomic intrusion detection: Adaptively detectinganomalies over unlabeled audit data streams in computer networks. Knowledge-Based Systems, 70:103 – 117, 2014.

[58] A. Ziviani, A. T. A. Gomes, M. L. Monsores, and P. S. S. Rodrigues. Networkanomaly detection using nonextensive entropy. IEEE Communications Letters,11(12):1034–1036, 2007.

[59] Smitha Rajagopal, Poornima Panduranga Kundapur, and Katiganere SiddaramappaHareesha. A stacking ensemble for network intrusion detection using heteroge-neous datasets. Secur. Commun. Networks, 2020:4586875:1–4586875:9, 2020.

[60] Smitha Rajagopal, Katiganere Siddaramappa Hareesha, and Poornima PandurangaKundapur. Feature relevance analysis and feature reduction of unsw nb-15 usingneural networks on mamls. In Bibudhendu Pati, Chhabi Rani Panigrahi, Rajku-mar Buyya, and Kuan-Ching Li, editors, Advanced Computing and IntelligentEngineering - Proceedings of ICACIE 2018, Advances in Intelligent Systems and


Computing, pages 321–332, France, January 2020. Springer Paris. 3rd Interna-tional Conference on Advanced Computing and Intelligent Engineering, ICACIE2018 ; Conference date: 22-12-2018 Through 24-12-2018.

[61] Paulo Campos Souza, Augusto Guimarães, Thiago Rezende, Vinicius Araujo, andVanessa Araujo. Detection of anomalies in large-scale cyberattacks using fuzzyneural networks. AI, 1:92–116, 02 2020.

[62] Hanghang Tong, Chongrong Li, Jingrui He, Jiajian Chen, Quang-Anh Tran, HaixinDuan, and Xing Li. Anomaly internet network traffic detection by kernel principlecomponent classifier. In Proceedings of the Second International Conference onAdvances in Neural Networks - Volume Part III, ISNN’05, page 476–481, Berlin,Heidelberg, 2005. Springer-Verlag.

[63] K. Hassine, A. Erbad, and R. Hamila. Important complexity reduction of ran-dom forest in multi-classification problem. In 2019 15th International WirelessCommunications Mobile Computing Conference (IWCMC), pages 226–231, 2019.

[64] R. Vinayakumar, M. Alazab, K. P. Soman, P. Poornachandran, A. Al-Nemrat, andS. Venkatraman. Deep learning approach for intelligent intrusion detection system.IEEE Access, 7:41525–41550, 2019.

[65] F. A. Khan, A. Gumaei, A. Derhab, and A. Hussain. A novel two-stage deeplearning model for efficient network intrusion detection. IEEE Access, 7:30373–30385, 2019.

[66] Moses S Charikar. Similarity estimation techniques from rounding algorithms. InSTOC, 2002.

[67] Witold Litwin. Linear hashing: a new tool for file and table addressing. In VLDB,volume 80, pages 1–3, 1980.

[68] Graham Cormode and Shan Muthukrishnan. An improved data stream summary:the count-min sketch and its applications. Journal of Algorithms, 55(1):58–75,2005.

[69] Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space.The London, Edinburgh, and Dublin Philosophical Magazine and Journal ofScience, 2(11), 1901.

[70] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottle-neck method. arXiv preprint physics/0004057, 2000.

[71] Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum descriptionlength and helmholtz free energy. In NIPS, 1994.

[72] Artemy Kolchinsky, Brendan D. Tracey, and David H. Wolpert. Nonlinear infor-mation bottleneck, 2017.

[73] Kdd cup 1999. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.[74] Markus Ring, Sarah Wunderlich, Deniz Scheuring, Dieter Landes, and Andreas

Hotho. A survey of network-based intrusion detection data sets. Computers &Security, 2019.

[75] Iman Sharafaldin, Arash Habibi Lashkari, and Ali A Ghorbani. Toward generatinga new intrusion detection dataset and intrusion traffic characterization. In ICISSP,2018.

[76] Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for networkintrusion detection systems (unsw-nb15 network data set). In MilCIS, pages 1–6.IEEE, 2015.

[77] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machinelearning in Python. JMLR, 2011.

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99


APPENDIX6.1 Influence of temporal decay factorTable 5 shows the influence of the temporal decay factor 𝛼 on theROC-AUC for MSTREAM on CICIDS-DoS dataset. We see that 𝛼 =

0.95 gives the maximum ROC-AUC for MSTREAM (0.9326±0.0006),as also shown in Table 3.

Table 5: Influence of temporal decay factor 𝛼 on the ROC-AUCin MSTREAM on CICIDS-DoS dataset.

𝛼 ROC-AUC

0.1 0.9129 ± 0.00040.2 0.9142 ± 0.00090.3 0.9156 ± 0.00060.4 0.9164 ± 0.00140.5 0.9163 ± 0.00050.6 0.917 ± 0.00050.7 0.9196 ± 0.00150.8 0.9235 ± 0.00030.9 0.929 ± 0.00030.95 0.9326 ± 0.0006

6.2 Influence of dimensionsTable 6 shows the influence of the output dimensions on the ROC-AUC for MSTREAM-PCA, MSTREAM-IB, and MSTREAM-AE KD-DCUP99 dataset. We see that all methods are robust to the variationin output dimensions.

Table 6: Influence of Output Dimensions on the ROC-AUC ofMSTREAM-PCA, MSTREAM-IB, and MSTREAM-AE on KDD-CUP99 dataset.

Dimensions MSTREAM-PCA MSTREAM-IB MSTREAM-AE

4 0.93 0.95 0.958 0.94 0.95 0.9312 0.92 0.96 0.9616 0.87 0.96 0.96

6.3 Dimensionality ReductionFor MSTREAM-IB, we used an online implementation, https://github.com/burklight/nonlinear-IB-PyTorch for the underlying InformationBottleneck algorithm with 𝛽 = 0.5 and the variance parameter setto a constant value of 1. The network was implemented as a 2 layerbinary classifier. For MSTREAM-AE, the encoder and decoder wereimplemented as single layers with ReLU activation. We used AdamOptimizer to train both these networks with 𝛽1 = 0.9 and 𝛽2 = 0.999.Grid Search was used for hyperparameter tuning: Learning Rate wassearched on [1e − 2, 1e − 3, 1e − 4, 1e − 5], and number of epochswas searched on [100, 200, 500, 1000]. The final values for these canbe found in Table 7.

Table 7: MSTREAM-IB parameters for different datasets.

MSTREAM-IB MSTREAM-AEDataset Learning Rate Epochs Learning Rate Epochs

KDD 1e − 2 100 1e − 2 100DoS 1e − 5 200 1e − 2 1000

UNSW 1e − 2 100 1e − 2 100DDoS 1e − 3 200 1e − 3 100

6.4 Evaluating ROC-AUC in a streaming mannerTable 8 shows the ROC-AUC for MSTREAM-AE on KDDCUP99when evaluated over the stream. The evaluation is done on all recordsseen so far and is performed after every 100𝐾 records. We see thatas the stream length increases, ROC-AUC for MSTREAM-AE con-verges to 0.96, as also shown in Table 3.

Table 8: Evaluating ROC-AUC of MSTREAM-AE in a stream-ing manner on KDDCUP99 dataset.

Stream Size ROC-AUC

100𝐾 0.912488200𝐾 0.895391300𝐾 0.855598400𝐾 0.934532500𝐾 0.965250600𝐾 0.953906700𝐾 0.947531800𝐾 0.961340900𝐾 0.9732171000𝐾 0.9702121100𝐾 0.9672151200𝐾 0.959664

https://github.com/burklight/nonlinear-IB-PyTorch

https://github.com/burklight/nonlinear-IB-PyTorch

Documents

MStream: Fast Anomaly Detection in Multi-Aspect Streams