Upload
letuyen
View
213
Download
0
Embed Size (px)
Citation preview
NETWORK CENTRIC TRAFFIC ANALYSIS
By
JIEYAN FAN
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2007
1
To those who sparked my interest in science, opening for me the door to discovering
nature and letting me walk through it in my own way.
3
ACKNOWLEDGMENTS
First of all, thank my advisor Professor Dapeng Wu for his great inspiration, excellent
guidance, deep thoughts, and friendship. I also thank my supervisory committee members,
Professors Shigang Chen, Liuqing Yang, and Tao Li, for their interest in my work.
I also express my appreciation to all of the faculty, staff, and my fellow students
in the Department of Electrical and Computer Engineering. In particular, I extend my
thanks to Dr. Kejie Lu for his helpful discussions.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1 Introduction to Network Anomaly Detection . . . . . . . . . . . . . . . . . 141.2 Introduction to Network Centric Traffic Classification . . . . . . . . . . . . 16
2 NETWORK ANOMALY DETECTION FRAMEWORK . . . . . . . . . . . . . 18
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 Edge-Router Based Network Anomaly Detection Framework . . . . . . . . 18
2.2.1 Traffic Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.2 Local Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.3 Global Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 FEATURES FOR NETWORK ANOMALY DETECTION . . . . . . . . . . . . 23
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Hierarchical Feature Extraction Architecture . . . . . . . . . . . . . . . . . 24
3.2.1 Three-Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.2 Feature Extraction in a Traffic Monitor . . . . . . . . . . . . . . . . 263.2.3 Feature Extraction in a Local Analyzer or a Global Analyzer . . . . 27
3.3 Two-Way Matching Features . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.2 Definition of Two-Way Matching Features . . . . . . . . . . . . . . 30
3.4 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.1 Hash Table Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.2 Bloom Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Bloom Filter Array (BFA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.5.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.5.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.5.3 Round Robin Sliding Window . . . . . . . . . . . . . . . . . . . . . 383.5.4 Random-Keyed Hash Functions . . . . . . . . . . . . . . . . . . . . 39
3.6 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.6.1 Space/Time Trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . 413.6.2 Optimal Parameter Setting for Bloom Filter Array . . . . . . . . . . 50
5
3.7 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.7.1 The BFA Algorithm vs. the Hash Table Algorithm . . . . . . . . . . 513.7.2 Experiment on Feature Extraction System . . . . . . . . . . . . . . 55
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 MACHINE LEARNING ALGORITHM FOR NETWORK ANOMALYDETECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.1.1 Receiver Operating Characteristics Curve . . . . . . . . . . . . . . . 594.1.2 Threshold-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . 604.1.3 Change-Point Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 604.1.4 Bayesian Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Bayesian Model for Network Anomaly Detection . . . . . . . . . . . . . . . 644.2.1 Bayesian Model for Traffic Monitors and Local Analyzers . . . . . . 644.2.2 Bayesian Model for Global Analyzers . . . . . . . . . . . . . . . . . 664.2.3 Hidden Markov Tree (HMT) Model for Global Analyzer . . . . . . . 68
4.3 Estimation of HMT Parameters . . . . . . . . . . . . . . . . . . . . . . . . 724.3.1 Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 724.3.2 Transition Probability Estimation . . . . . . . . . . . . . . . . . . . 76
4.4 Network Anomaly Detection Using HMT . . . . . . . . . . . . . . . . . . . 814.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.5.2 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . 864.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5 NETWORK CENTRIC TRAFFIC CLASSIFICATION: AN OVERVIEW . . . . 90
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.3 Intuitions Behind a Proper Detection of Voice and Video Streams . . . . . 95
5.3.1 Packet Inter-Arrival Time and Packet Size in Time Domain . . . . . 975.3.2 Packet Inter-Arrival Time in Frequency Domain . . . . . . . . . . . 995.3.3 Packet Size in Frequency Domain . . . . . . . . . . . . . . . . . . . 995.3.4 Combining Packet Inter-Arrival Time and Packet Size in Frequency
Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6 NETWORK CENTRIC TRAFFIC CLASSIFICATION SYSTEM . . . . . . . . 104
6.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.1.1 Flow Summary Generator (FSG) . . . . . . . . . . . . . . . . . . . 1056.1.2 Feature Extactor (FE) and Voice/Video Subspace Generator (SG) . 1056.1.3 Voice/Video CLassifer (CL) . . . . . . . . . . . . . . . . . . . . . . 106
6.2 Feature Extractor (FE) Module via Power Spectral Density (PSD) . . . . . 1076.2.1 Modeling the network flow as a stochastic digital process . . . . . . 107
6
6.2.2 Power Spectral Density (PSD) Computation . . . . . . . . . . . . . 1086.3 Subspace Decomposition and Bases Identification on PSD Features . . . . 115
6.3.1 Subspace Decomposition Based on Minimum Coding Length . . . . 1176.3.2 Subspace Bases Identification . . . . . . . . . . . . . . . . . . . . . . 120
6.4 Voice/Video Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.5.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.5.2 Skype Flow Classification . . . . . . . . . . . . . . . . . . . . . . . . 1246.5.3 General Flow Classification . . . . . . . . . . . . . . . . . . . . . . . 1246.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . 129
7.1 Summary of Network Centric Anomaly Detection . . . . . . . . . . . . . . 1297.2 Summary of Network Centric Traffic Classification . . . . . . . . . . . . . . 131
APPENDIX
A PROOFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
A.1 Equation (4–31) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133A.2 Equation (4–32) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133A.3 Equation (4–33) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134A.4 Equation (4–34) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7
LIST OF TABLES
Table page
3-1 Notations for two-way matching features . . . . . . . . . . . . . . . . . . . . . . 31
3-2 Notations for complexity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3-3 Space/time complexity for hash table, Bloom filter, and BFA . . . . . . . . . . . 47
4-1 Parameters used in CUSUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4-2 Notations for hidden markov tree model . . . . . . . . . . . . . . . . . . . . . . 70
4-3 Parameter setting of feature extraction for network anomaly detection . . . . . . 86
4-4 Performance of different schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5-1 Commonly used speech codec and their specifications . . . . . . . . . . . . . . . 96
6-1 Typical PD and PFA values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8
LIST OF FIGURES
Figure page
2-1 An ISP network architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2-2 Network anomaly detection framework. . . . . . . . . . . . . . . . . . . . . . . . 19
2-3 Responsibilities of and interactions among the traffic monitor, local analyzer,and global analyzer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2-4 Example of asymmetric traffic whose feature extraction is done by the globalanalyzer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3-1 Hierarchical structure for feature extraction. . . . . . . . . . . . . . . . . . . . . 24
3-2 Network in normal condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3-3 Source-address-spoofed packets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3-4 Reroute. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3-5 Hash Table Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3-6 Bloom Filter Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3-7 Scenarios of the problems caused by Bloom filter. (a) Boundary problem. (b)An outbound packet arrives before its matched inbound packet with t2 − t1 < Γ. 34
3-8 Bloom Filter Array Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3-9 Bloom Filter Array Algorithm using sliding window . . . . . . . . . . . . . . . . 38
3-10 Space/time trade-off for the hash table, BFA with η = 0.1%, and BFA with η =1% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3-11 Relation among space complexity, time complexity, and collision probability.(a) M∗
a vs. η. (b) E[Ta]∗ vs. η. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3-12 Space complexity vs. collision probability for fixed time complexity. . . . . . . . 52
3-13 Memory size (in bits) vs. average processing time per query (in µs) . . . . . . . 53
3-14 Average processing time per query (in µs) vs. average number of hash functioncalculations per query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3-15 Comparison of numerical and simulation results. (a) Hash table algorithm. (b)BFA algorithm with η=1%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3-16 Feature data: (a) Number of SYN packets (link 1), (b) Number of unmatchedSYN packets (link 1), (c) Number of SYN packets (link 2), and (d) Number ofunmatched SYN packets (link 2). . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9
4-1 Generative process in graphical representation, in which the traffic state generatesthe stochastic process of traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4-2 Extended generative model including traffic feature vectors: (a) original modeland (b) simplified model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4-3 Generative independent model that describes dependencies among traffic statesand traffic feature vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4-4 Generative dependent model that describes dependencies among edge routers. . 67
4-5 Hidden Markov tree model. For an node i, ρ(i) denotes its parent node and ν(i)denotes the set of its children nodes. . . . . . . . . . . . . . . . . . . . . . . . . 69
4-6 Probability density function of the univariate Gaussian distribution N (x; 0, 1). . 73
4-7 Histogram of the two-way matching features measured at a real network duringnetwork anomalies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4-8 The EM algorithm for estimating p(φi|Ωi = u), i ∈ Ξ, u ∈ 0, 1. . . . . . . . . . 75
4-9 Iteratively estimate transition probabilities. . . . . . . . . . . . . . . . . . . . . 77
4-10 Belief propagation algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4-11 Viterbi algorithm for HMT decoding. . . . . . . . . . . . . . . . . . . . . . . . . 82
4-12 Experiment Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4-13 Performance of threshold-based and machine learning algorithms with differentfeature data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4-14 Performance of four detection algorithms . . . . . . . . . . . . . . . . . . . . . . 88
5-1 Average packet size versus inter-arrival variability metric for 5 applications: voice,video, file transfer, mix of file transfer with voice and video. . . . . . . . . . . . 96
5-2 Inter-arrival time distribution for voice and video traffic . . . . . . . . . . . . . . 97
5-3 Packet size distribution for voice and video traffic . . . . . . . . . . . . . . . . . 98
5-4 Power spectral density of two sequences/traces of time-varying inter-arrival timesfor voice traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5-5 Power spectral density of two sequences of time-varying inter-arrival times forvideo traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5-6 Power spectral density of two sequences of discrete-time packet sizes for voicetraffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10
5-7 Power spectral density of two sequences of discrete-time packet sizes for videotraffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5-8 Power spectral density of two sequences of continuous-time packet sizes for voicetraffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5-9 Power spectral density of two sequences of continuous-time packet sizes for videotraffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6-1 VOVClassifier System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 104
6-2 Power spectral density features extraction module. Cascade of processing steps. 107
6-3 Levinson-Durbin Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6-4 Parametric PSD Estimate using Levinson-Durbin Algorithm. . . . . . . . . . . . 114
6-5 Pairwise steepest descent method to achieve minimal coding length. . . . . . . . 119
6-6 Function IdentifyBases identifies bases of subspace. . . . . . . . . . . . . . . . . 120
6-7 Function VoiceVideoClassify determines whether a flow with PSD feature vector~ψ is of type voice or video or neither. θ1 are θ2 are two user-specified thresholdarguments. Function voicevideoClassify uses Function NormalizedDistance tocalculate normalized distance between a feature vector and a subspace. . . . . . 122
6-8 The ROC curves of single-typed flows generated by Skype, (a) VOICE and (b)VIDEO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6-9 The ROC curves of hybrid flows generated by Skype, (a) VOICE, (b) VIDEO,(c) FILE+VOICE, and (d) FILE+VIDEO. . . . . . . . . . . . . . . . . . . . . . 125
6-10 The ROC curves of single-typed flows generated by Skype, MSN, and GTalk:(a) VOICE and (b) VIDEO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6-11 The ROC curves of hybrid flows generated by Skype, MSN, and GTalk: (a) VOICE,(b) VIDEO, (c) FILE+VOICE, and (d) FILE+VIDEO. . . . . . . . . . . . . . . 127
11
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
NETWORK CENTRIC TRAFFIC ANALYSIS
By
Jieyan Fan
December 2007
Chair: Dapeng Oliver WuMajor: Electrical and Computer Engineering
Over the past few years, the Internet infrastructure has become a critical part of the
global communications fabric. Emergence of new applications and protocols (such as voice
over Internet Protocol, peer-to-peer, and video on demand) also increases the complexity
of Internet. All these trends increase the demand for more reliable and secure service.
This has affected the interest of Internet service providers (ISP) in network centric traffic
analysis.
Our study considers network centric traffic analysis from the two perspectives
that most interest ISPs: network centric anomaly detection, and network centric traffic
classification.
In the first part of our research, we focus on network centric anomaly detection.
Despite the rapid advance in networking technologies, detection of network anomalies
at high-speed switches/routers is still far from maturity. To push the frontier, two
major technologies need to be addressed. The first is efficient feature-extraction
algorithms/hardware that can match a line rate in the order of Gb/s. The second is fast
and effective anomaly detection schemes. Our study addresses both issues. The novelties
of our scheme are the following. First, we design an edge-router based framework that
detects network anomalies as they first enter an ISP’s network. Second, we propose the
so-called two-way matching features, which are effective indicators of network anomalies.
We also design data structure to extract the features efficiently. Our detection scheme
12
exploits both temporal and spatial correlations among network traffic. Simulation results
show that our scheme can detect network anomalies with high accuracy, even if the volume
of abnormal traffic on each link is extremely small.
In the second part, we focus on network centric traffic classification. Nowadays,
VoIP and IPTV become increasingly popular. To tap the potential profits that VoIP
and IPTV offer, carrier networks must efficiently and accurately manage and track the
delivery of IP services. Yet, the emergence of a bloom of new zero-day voice and video
applications such as Skype, Google Talk, and MSN pose tremendous challenges for ISPs.
The traditional approach of using port numbers to classify traffic is infeasible because
it uses a dynamic port number. The proliferation of proprietary protocols and usage of
encryption techniques make application-level analysis infeasible. Our study focus on a
statistical pattern classification technique to identify multimedia traffic. In particular,
we focus on detecting and classifying voice and video traffic. We propose a system
(VOVClassifier ) for voice and video traffic classification that uses the regularities residing
in multimedia streams. Experimental results demonstrate the effectiveness and robustness
of our approach.
13
CHAPTER 1INTRODUCTION
Over the past few years, the Internet infrastructure has become a critical part of the
global communications fabric. A survey by the Internet Systems Consortium (ISC) shows
that the number of hosts advertised in domain name system (DNS)[1, 2] has risen from
approximately 9,472,000 in January 1996 to 394,991,609 in January 2006. In addition, the
emergence of new applications and protocols, such as voice over Internet Protocol (VoIP),
pear-to-pear (P2P), and video on demand (VoD)[3], also increases the complexity of the
Internet. Accompanying this trend is an increasing demand for more reliable and secure
service. A major challenge for Internet service providers (ISP) is to better understand the
network state by analyzing network traffic in real time. Thus ISPs are very interested in
the problem of network centric traffic analysis.
We consider the network centric traffic analysis problem from two perspectives: 1)
network anomaly detection and 2) network centric traffic classification. We introduce the
two perspectives in the next two sections.
1.1 Introduction to Network Anomaly Detection
With the rapid growth of Internet, detection of network anomalies becomes a major
concern in both industry and academia since it is critical to maintain availability of
network services. Abnormal network behavior is usually the symptom of potential
unavailability in that:
• Network anomaly is usually caused by malicious behavior, such as denial-of-service(DoS) attacks, distributed denial-of-service (DDoS) attacks, worm propagation,network scans, or email spams;
• Even if it is caused by unintentional reasons, network anomaly is often accompaniedwith network congestion or router failures.
However, detecting network anomalies is not an easy task, especially at high-speed
routers. One of the main difficulties arises from the fact that the data rate is too high to
afford complicated data processing. An anomaly detection algorithm usually works with
14
traffic features instead of the original traffic data itself. Traffic features can be regarded as
succinct summaries of the voluminous traffic (e.g., the traffic data rate is a feature of the
traffic). We study two major issues in feature extraction for network anomaly:
• what features to extract (i.e., what features make most distinction between normaland abnormal network states);
• how to extract features efficiently to catch up line rate of high-speed routers (e.g., inthe order of Gb/s).
Our research addresses both issues.
In addition to traffic feature extraction, another difficulty lies in classification
of network state based on extracted features. Given the same feature set, different
classification schemes have different performance. The difficulty lies in how to efficiently
but accurately make decisions on network state. In this paper, we address this problem by
designing a machine learning algorithm to exploit spatial correlations among edge routers
Specifically, our major contributions in network anomaly detection include but not
limited to
• designing a framework which deploys on edge routers to detect network anomaliesbased on both local information and global information;
• proposing the so-called two-way matching features which make significant distinctionsbetween normal and abnormal network states, and designing the data structureBloom filter array to extract the two-way matching features efficiently;
• designing a machine learning algorithm to detect network anomalies accurately byexploiting spatial correlations of edge routers and efficiently by employing the hiddenMarkov tree data structure.
Analysis and simulation results show that our framework is capable of detecting
network anomalies accompanied with low volume traffic, which is of much importance to
detect network anomalies in the first place. For example, for low volume DDoS attacks,
given the same false alarm probability, our scheme has a detection probability of 0.97,
whereas the existing scheme has a detection probability of 0.17, which demonstrates the
superior performance of our scheme.
15
1.2 Introduction to Network Centric Traffic Classification
Besides network anomaly detection, classification of normal network traffic is also
of practical significance to both enterprise network administrators and ISPs. Along
with the rapid emergence of new types of network applications such as VoIP, VoD, and
P2P file exchange, quality of service (QoS) becomes a more and more important issue.
For example, transmission of real-time voice and video has bandwidth, delay, and loss
requirements. However, there is no QoS guarantee for these real-time applications over the
current best-effort network. Many schemes are proposed to address this problem. On the
other hand, enterprise network administrators may want to restrict network bandwidth
used by disallowed VoIP, VoD, or P2P applications, if not totally block, which might be
too rude. That is, they want to limit the QoS of specific network traffic.
Wu et al.[4] summarized techniques for QoS provision for real-time streams from
the point of view of end hosts. These techniques include coding methods, protocols,
and requirements on stream servers. Another effective solution is from the point of view
of network carriers or ISPs. For example, ISPs can assign different forwarding priority
to different types of network traffic on routers. This is the motivation of differentiated
services (DiffServ)[5, 6].
DiffServ is a method designed to guarantee different levels of QoS for different classes
of network traffic. It is achieved by setting the “type of service” (TOS)[7] field, which
hence is also called DiffServ code point(DSCP)[5], in the IP header according to the class
of the network data, so that the better classes get higher numbers. Unfortunately, such
design highly depends on network protocols, especially proprietary protocols, observing
DiffServ regulations. In the worst case, if all protocols set TOS to the highest number, it
is even worse to employ DiffServ method.
For this reason, we believe a proper DiffServ scheme should be able to classify
network traffic on the fly, instead of relying on any tags in packet header. Thus, the
difficulty lies in accurate classification of network traffic in real-time.
16
Yet, the emergence of a bloom of new zero-day voice and video applications such as
Skype, Google Talk, and MSN poses tremendous challenges for ISPs. The traditional
approach of using port numbers to classify traffic is infeasible due to the usage of
dynamic port number. In the second part of our research, we focus on a statistical
pattern classification technique to identify multimedia traffic. Based on the intuitions that
voice and video data streams show strong regularities in the packet inter-arrival times
and the associated packet sizes when combined together in one single stochastic process,
we propose a system, called VOVClassifier , for voice and video traffic classification.
VOVClassifier is an automated self-learning system that classifies traffic data by extracting
features from frequency domain using Power Spectral Density analysis and grouping
features using Subspace Decomposition. We applied VOVClassifier to real packet traces
collected from different network scenarios. Results demonstrate the effectiveness and
robustness of our approach.
17
CHAPTER 2NETWORK ANOMALY DETECTION FRAMEWORK
2.1 Introduction
The first issue of network anomaly detection is to design a framework. There are
two types of network anomaly detection frameworks, i.e., host-based frameworks and
network-based frameworks. Host-based frameworks are deployed on end-hosts. These
frameworks typically use firewall and intrusion detection systems (IDS), and/or balance
the load among multiple (geographically dispersed) servers to defend against network
anomalies. The host-based approaches can help protect the server system; but it may not
be able to protect legitimate access to the server, because high-volume abnormal traffic
may congest the incoming link to the server.
On the other hand, network-based frameworks are deployed inside networks, e.g.,
on routers. These frameworks are responsible for detecting network anomalies and
identifying abnormal packets/flows or anomaly sources. To detect network anomalies,
signal processing techniques (e.g., wavelet [8], spectral analysis [9, 10], statistical methods
[11–13]), and machine learning techniques [14] can be used. To identify network anomaly
sources, IP traceback [15] is typically used. The IP traceback techniques can help contain
the attack sources; but it requires large-scale deployment of the same IP traceback
technique and needs modification of existing IP forwarding mechanisms (e.g., IP header
processing).
This chapter presents our network anomaly detection framework, which is of the
network-based category. We present our framework design in Section 2.2 and summarize
this chapter in Section 2.3.
2.2 Edge-Router Based Network Anomaly Detection Framework
To detect network anomalies in an ISP network, we designed an edge-router based
network anomaly detection framework. The motivation results from an ISP network
architecture (Figure 2-1). It consists of two types of IP routers, i.e., core routers and edge
18
Figure 2-1. An ISP network architecture.
routers. Core routers interconnect with one another to form a high-speed autonomous
system (AS). In contrast, edge routers are responsible for connecting subnets (i.e.,
customer networks or other ISP networks) with the AS. In this paper, a subnet can be
either a customer network or an ISP network.
Figure 2-2. Network anomaly detection framework.
19
Figure 2-3. Responsibilities of and interactions among the traffic monitor, local analyzer,and global analyzer.
Given such ISP network architecture, we design a framework to detect network
anomalies. Our framework (Figure 2-2) consists of three types of components: traffic
monitors, local analyzers, and a global analyzer. Figure 2-3 summarizes the functionalities
of each type of components and their interactions. Next, we discuss the functionalities of
traffic monitors, local analyzers, and global analyzer in Sections 2.2.1, 2.2.2, and 2.2.3,
respectively.
2.2.1 Traffic Monitor
A traffic monitor (represented by a filled oval in Figure 2-2) is responsible for:
• scanning partial or all packets of a single unidirectional link;
• summarizing traffic characteristics;
• extracting simple features from the traffic characteristic;
• making decisions (e.g., declare network anomaly or classify type of normal traffic) onone single unidirectional link; and
• reporting the summary of traffic information, simple feature data, and decisions to alocal analyzer.
2.2.2 Local Analyzer
A local analyzer is responsible for:
20
• extracting complicated features from traffic information obtained at a single edgerouter;
• making decisions based on local traffic information (i.e., one edge router);
• reporting decisions, feature data, and summary of traffic information (if necessary) toa global analyzer.
The local analyzer can utilize temporal correlation of traffic to generate feature data.
2.2.3 Global Analyzer
A global analyzer is responsible for:
• extracting complicated features that require global information, such as routinginformation, from traffic;
• analyzing feature data obtained from multiple local analyzers; and
• making decisions with global information obtained from multiple edge routers.
Figure 2-4. Example of asymmetric traffic whose feature extraction is done by the globalanalyzer.
The global analyzer has a global view of the whole network. Hence, it exploits
both temporal correlation and spatial correlation of traffic. Here it is important to note
that, some feature data must be obtained at the global analyzer if global information
is required. For example, in Figure 2-4, if the traffic from subnet A to server B passes
through edge router X, and the traffic from server B to subnet A passes through edge
21
router Y, then the so-called two-way matching features between subnet A and server B
shall be obtained at the global analyzer, which has the routing information of the ISP
network.
The advantages of our framework design are that:
1. it is deployed on edge routers instead of systems of end users, such that it can detectnetwork anomalies in the first place they enter an AS;
2. it has no burden on core routers;
3. it is flexible in that detection of network anomalies can be made both locally andglobally;
4. it is capable of detecting low volume network anomalies accurately by exploitingspatial correlations among edge routers.
The framework is designed to be an add-on service provided by ISP to protect end
users from network anomalies.
2.3 Summary
This chapter is concerned with design of network anomaly detection frameworks.
There are two types of frameworks (i.e., host-based and network based. Our design is of
the second type). Specifically, we designed a framework deployed on edge routers. It is
composed of three components, traffic monitors, local analyzers, and global analyzers.
This framework is flexible in that it can detect network anomalies from both local view
and global view of the network. By exploiting spatial correlations among edge routers, our
framework is capable of detecting low volume network anomalies.
22
CHAPTER 3FEATURES FOR NETWORK ANOMALY DETECTION
3.1 Introduction
Given the network anomaly detection framework we have established, the second
issue of network anomaly detection is feature extraction. Features for network anomaly
detection have been studied extensively in recent years. For example, Peng et al.[12]
proposed the number of new source IP addresses to detect DDoS attacks, under the
assumption that source addresses of IP packets observed at an edge router were relatively
static in normal conditions than those during DDoS attacks. Peng further pointed out
that the feature could differentiate DDoS attacks from the flash crowd, which represents
the situation when many legitimate users start to access one service at the same time.
For example, when many people watch a live sports broadcast over the Internet at the
same time. In both cases (DDoS attacks and the flash crowd), the traffic rate is high. But
during DDoS attacks, the edge routers will observe many new source IP addresses because
attackers usually spoof source IP addresses of attacking packets to hide their identities.
Therefore, this feature improves those DDoS detection schemes that rely on traffic rate
only. However, Peng et al.[12] focused on detection of DDoS attacks. It did not mention
other types of network anomalies. For example, when malicious users are scanning the
network, we can also observe high traffic rate but few new source IP addresses. It is
very important to differentiate network scanning from flash crowd because the former is
malicious but the latter is not. The two-way matching feature on different network layers
(Section 3.3.1) can tell not only the presence of network anomalies but also their cause.
Lakhina et al.[16] summarized the characteristics of network anomalies under different
causes. Its contribution is to help identify causes of network anomalies. For example,
during DDoS attacks, we can observe high bit rate, high packet rate, and high flow rate.
The source addresses are distributed over the whole IP address space. On the other hand,
during network scanning, all the three rates are high, but the destination addresses,
23
rather than the source addresses, are distributed. However, the paper did not resolve an
important problem, i.e., how to extract features efficiently to match a high line rate in
the order of Gb/s. We proposed a data structure called Bloom filter array to address this
problem.
3.2 Hierarchical Feature Extraction Architecture
Network anomaly detection is not an easy task, especially at high-speed routers.
One of the main difficulties arises from the fact that the data rate is too high to afford
complicated data processing. An anomaly detection algorithm usually works with traffic
features instead of the original traffic data itself. Traffic features can be regarded as
succinct representations of the voluminous traffic, e.g., the traffic data rate is a feature of
the traffic.
We focus on presenting our feature extraction architecture for network anomaly
detection. We also cover extraction schemes for some simple features, such as data rate
and SYN/FIN(RST) ratio. The more advanced features, the so-called two-way matching
features, are discussed later.
3.2.1 Three-Level Design
Figure 3-1. Hierarchical structure for feature extraction.
To efficiently extract features from traffic, we design a three-level hierarchical
structure (Figure 3-1), where incoming packets are processed by level-one filters, then
by level-two filters, and finally by (level-three) feature extraction modules. Level-one filters
24
and level-two filters are placed in traffic monitors. A feature extraction module can be
placed in either a traffic monitor or a local analyzer, depending on the type of the feature.
Level-one filters select a packet based on its source-destination pair, which is defined
by the source IP address (SA), the source network mask (SNM), the destination IP
address (DA) , the destination network mask (DNM). For example, if we are interested in
packets from 172.10.5.28 to 210.33.68.102, we can choose 255.255.255.255 as both the SNM
and the DNM; if we are interested in packets from 172.10.x.x to 208.33.1.x, we can use
255.255.0.0 as the SNM and 255.255.255.0 as the DNM. In this way, we selectively monitor
an end-host or a subnet, giving much flexibility in framework configuration. The output of
a level-one filter is packets with the same source-destination pair, which are conveyed to
level-two filters.
A level-two filter classifies the packets coming from level-one filters, based on
the upper-layer1 data fields, e.g., TCP SYN or FIN. The packets of interest will be
forwarded to one or multiple feature extraction modules. For example, the number of
TCP SYN packets can be used to generate both the TCP SYN rate feature and the TCP
SYN/FIN(RST) ratio feature; hence, TCP SYN packets are conveyed to both the TCP
SYN rate module and the TCP SYN/FIN(RST) ratio module (Figure 3-1). On the other
hand, a feature module may need packets from multiple level-two filters. For example, the
SYN/FIN(RST) ratio feature extraction requires packets from three filters (Figure 3-1).
Compared to the packet classification schemes developed by Wang et al.[11] and
Peng et al.[12], our hierarchical structure for feature extraction is more general and
efficient.
Next, we describe the most important module in the three-level hierarchical structure,
the feature extraction module.
1 Here, the upper layer can be either Layer 4 or Layer 7.
25
Similar to previous studies [11, 12], we generate features in a discrete manner, i.e.,
our feature extraction module will generate a (feature) value or a vector at the end of
each time slot. Intuitively, shorter slot duration may reduce the detection delay, which
is defined as the interval from the epoch when the anomaly starts to the epoch when the
anomaly is detected; but a smaller duration may increase the computational complexity,
since the detection algorithm needs to analyze more feature data for the same time
interval. On the other hand, if a feature is represented by a ratio, the slot duration
must be sufficiently large to avoid division by zero. For example, if we want to use the
SYN/FIN(RST) ratio as in Ref. [11] to detect TCP SYN flood, then the slot duration
cannot be too small, because the number of FIN packets in a short period can be 0, which
will result in a false alarm even if the number of SYN packets is not large.
Feature extraction can be done in a traffic monitor, a local analyzer, and a global
analyzer, which will be described in Sections 3.2.2 and 3.2.3, respectively.
3.2.2 Feature Extraction in a Traffic Monitor
As we mentioned earlier, some features are generated within a traffic monitor. These
features are typically simple and reside in traffic of a single unidirectional link.
In our framework, a traffic monitor can generate the following features:
• Packet rate: defined by the number of packet arrivals in one time slot. This feature issimple but useful for detecting high volume DoS and DDoS attacks. But it can hardlyhelp detect low volume attacks and other types of network anomalies. Furthermore,normal network behaviors may be also accompanied with high packet rate, e.g., flashcrowd [12]. So is data rate.
• Data rate: defined by the total number of bits of all packets that arrive in one timeslot.
• SYN/FIN(RST) ratio2 : defined by the ratio of the number of TCP SYN packets inone time slot to the number of FIN (and a portion of RST) packets in the same timeslot.
2 How to obtain this ratio can be found in Ref. [17].
26
3.2.3 Feature Extraction in a Local Analyzer or a Global Analyzer
Although a traffic monitor can generate simple features efficiently, these features may
not be sufficient to detect network anomalies. In particular, the packet rate and data
rate features may only be useful for detecting network anomalies accompanied with high
volume traffic; and SYN/FIN(RST) ratio has a large variation even for normal traffic
and hence cannot help accurately distinguish normal network conditions from network
anomalies. To improve detection accuracy, one can use a local analyzer to generate more
sophisticated features, for example, the SYN/SYN-ACK ratio proposed in Ref. [17] and
the percentage of new IP addresses proposed in Ref. [12].
However, the existing features such as the SYN/SYN-ACK ratio [17] and the
percentage of new IP addresses [12] either do not lead to good performance of detectors,
or require high storage/time complexity (Section 3.1). To address these deficiencies, we
propose a new type of features called two-way matching features, which can make distinct
features between normal and attack traffic, thereby improving accuracy of detecting
attacks.
Next, we discuss the two-way matching features and the extraction scheme.
3.3 Two-Way Matching Features
3.3.1 Motivation
The motivation of using two-way matching features arises from the fact that, for
most Internet applications, packets are generated from both end hosts that are engaged
in communication. Information carried by packets on one direction shall match the
corresponding information carried by packets on the other direction. By monitoring the
degree of mismatch between flows of two directions, we can detect network anomalies.
To illustrate this, let us consider the behaviors of the two-way traffic in three scenarios,
namely, 1) normal conditions, 2) DDoS attacks, and 3) re-route.
In the first scenario, when the network of an ISP works normally, information carried
on both directions of communication matches (Figure 3-2). Host a and host v are two
27
Figure 3-2. Network in normal condition.
ends of communication (assume that host v is within the autonomous system of the ISP
while host a is not). Host a sends a packet to host v and v responds a packet back to
host a. Both packets pass the edge router A. From the point of view of the local analyzer
1 attached to edge router A, we define the first packet as an inbound packet, and the
second packet as an outbound packet. The source IP address (SA) and destination IP
address (DA) of the inbound packet match the DA and SA of the outbound packet. If the
communication is based on UDP or TCP, we can further observe that the source port (SP)
and destination port (DP) of the inbound packet match the DP and SP of the outbound
packet. Therefore, the local analyzer 1 can observe matched inbound and outbound
packets in normal conditions. In the example of Figure 3-2, it is assumed that the border
gateway protocol (BGP) routing makes the inbound packets and the corresponding
outbound packets pass through the same edge router. If the BGP routing makes the
inbound packets and the corresponding outbound packets go through different edge routers
(Figure 2-4), the matching can still be achieved by a global analyzer (Section 2.2.3), i.e.,
multiple local analyzers convey the unmatched inbound packets and the corresponding
outbound packets to the global analyzer, which has the routing information of the whole
autonomous system.
28
Figure 3-3. Source-address-spoofed packets.
In the second scenario, when attackers launch spoofed-source-IP-address DDoS
attacks[18], the local analyzer 1 observes many unmatched inbound packets (Figure 3-3).
Since source addresses of inbound packets are spoofed, the outbound packets are routed to
the nominal destinations, i.e., b and c in Figure 3-3, which do not pass through edge router
A any more. In this case, local analyzer 1 will observe many unmatched inbound packets.
Figure 3-4. Reroute.
In the third scenario (Figure 3-4), the number of unmatched inbound packets
observed by local analyzer 1 is increased due to a failure of the original route and re-route
of outbound packets to another edge router. A global analyzer can address this problem
similar to the asymmetric case in the first scenario.
29
All the above scenarios seem to suggest that the number of unmatched inbound
packets observed by an edge router is a good feature for network anomaly detection.
However, usually, this is not true because traffic volume from one end to the other is not
symmetric, typically. In Figure 3-2, if host a is a client uploading a large file using the File
Transfer Protocol (FTP)[19] to host v, there will be much more packets from a to v than
those from v to a. Uploading file to an FTP server is a normal behavior but the number of
unmatched inbound packets is very high in this case.
Therefore, it is more appropriate to use flow-level quantities (instead of packet-level
quantities) as features for network anomaly detection. As in the above FTP case, when a
TCP connection is established, all packets on one direction constitute one flow and packets
on the reverse direction constitute another flow. No matter how many packets are sent on
each direction, there are only one inbound flow and only one outbound flow. They match
in IP addresses and port numbers. Therefore, we call the number of unmatched inbound
flows as a two-way matching feature.
Two-way matching features are shown to be effective indicators of network anomalies
[20].3 However, extraction of two-way matching features at high-speed edge routers is not
an easy task. We will address this issue in Sections 3.4 and 3.5.
Next, we define the two-way matching features.
3.3.2 Definition of Two-Way Matching Features
We first define three terms.
Definition 1. Signature is the information of interest, carried in traffic.
The exact definition of signature depends on the specific application targeted. For
example, to detect SYN flood DDoS attacks, we may use a 5-tuple signature <SA, SP,
3 Two-way matching features are good indicators of DDoS attacks with spoofed sourceIP addresses but are not good indicators of DDoS attacks with non-spoofed source IPaddresses.
30
DA, DP, sequence number> for inbound packets and <DA, DP, SA, SP, ACK number –
1> for outbound packets. We further define inbound signature as the signature extracted
from inbound packets and outbound signature from outbound packets.
Definition 2. A flow is a set of the packets with the same signature and the same
direction.
For example, a TCP connection between two ends generates two flows with different
directions.
Definition 3. An unmatched inbound flow (UIF) is an inbound flow that has no cor-
responding outbound packet arriving at an intended edge router within a time period
Γ.
Note that we use a time constraint Γ in the definition of UIF because it takes
time for an outbound packet to arrive. If Γ is too short, then some returning outbound
packets might be ignored, which increases the false alarm probability of network anomaly
detection. If Γ is too large, then the detection delay is long. The suitable choice of Γ
depends on the round trip time (RTT) of the connection. For example, we can choose Γ
to be the most significant 99% RTT, i.e., more than 99% corresponding outbound packets
return within time Γ.
Table 3-1: Notations for two-way matching featuresNotation Descriptionti The ith sampling time epoch, where ti+1 = ti + Γ and i ∈ Z+.s(p) Inbound signature of an an inbound packet p.s′(p′) Outbound signature of an outbound packet p′.D(ti) The number of UIF during the ith period.
Based on the above definitions, we define the two-way matching features to be the
number of UIF. Table 3-1 lists the notations used in the rest of the paper, where Z+
stands for the nonnegative integer set.
In the following sections, we present algorithms to extract two-way matching features
from the traffic at local analyzers. Note that two-way matching features should be
31
extracted by global analyzers when an AS is not symmetric. However, the feature
extraction approaches used by local analyzers and global analyzers are same.
3.4 Basic Algorithms
This section presents two basic algorithms to process and store the two-way matching
features, namely, the hash table algorithm and the Bloom filter algorithm.
3.4.1 Hash Table Algorithm
The general procedure to extract the two-way matching features from traffic at an
local analyzer is:
1. The local analyzer maintains a buffer in memory;
2. When the traffic monitor captures an inbound packet, if its inbound signature is notin the buffer, the local analyzer creates one entry for its signature and set the state ofthat entry to “UNMATCHED”;
3. When the traffic monitor captures an outbound packet, if its outbound signature is inthe buffer, the local analyzer sets the state of that entry to “MATCHED”;
4. At time ti+1, the local analyzer assigns the number of entries with state “UNMATCHED”to D(ti).
So typically we need three operations: insertion, search and removal4 .
A basic algorithm to do this is to use a hash table. Suppose the signature extracted
from a packet is b bits long. We organize the buffer into a table, V , with ` cells of b + 1
bits each. The extra one bit is the state bit. We also have K hash functions hi:S 7→ Z`,
where i ∈ ZK = 0, 1, . . . ,K − 1, and S is the data set of interest, e.g., signature domain.
The symbol Z` stands for the set 0, . . . , `− 1, where ` is an integer.
The operations of hash table algorithm are listed in Figure 3-5, where the argument s
is the signature extracted from a packet.
4 Setting the state to “MATCHED” is actually the removal operation.
32
1. function HashTableInsert(V , s)2. for i ← 0 to K − 13. if V [hi(s)] is empty4. insert s to V [hi(s)], set state bit of V [hi(s)] to “UNMATCHED”5. return6. end if7. end for8. report insertion operation error9. end function
10. function HashTableSearch(V , s)11. for i ← 0 to K − 112. if V [hi(s)] is empty13. return false14. if V [hi(s)] holds s15. return true16. end for17. return false;18. end function19. function HashTableRemove(V , s)20. for i ← 0 to K − 121. if V [hi(s)] is empty22. return;23. if V [hi(s)] holds s24. set state bit of V [hi(s)] to “MATCHED”25. return true;26. end if27. end for28. end function
Figure 3-5. Hash Table Algorithm
3.4.2 Bloom Filter
The hash table algorithm can be used for offline traffic analysis or analysis of low
data-rate traffic but it cannot catch up with a high data rate at edge routers. To address
this limitation, one can use Bloom filter algorithm[21]. Compared to the hash table
algorithm, Bloom filter algorithm reduces space/time complexity by allowing small degree
of inaccuracy in membership representation, i.e., a packet signature, which does not
appear before, may be falsely identified as present.
33
Bloom filter stores data in a vector V of M elements, each of which consists of one
bit. Bloom filter also uses K hash functions hi:S 7→ ZM , where i ∈ ZK. Figure 3-6
describes the insertion and search operations of Bloom filter.
1. function BloomFilterInsert(V , s)
2. for ∀i ∈ ZK do
3. V [hi(s)] ← 1
4. end function
5. function BloomFilterSearch(V , s)
6. for ∀i ∈ ZK do
7. if V [hi(s)] 6= 1 then
8. return false
9. end for
10. return true
11. end function
Figure 3-6. Bloom Filter Operations
(a) (b)
Figure 3-7. Scenarios of the problems caused by Bloom filter. (a) Boundary problem. (b)An outbound packet arrives before its matched inbound packet with t2 − t1 < Γ.
Although Bloom filter has better performance in the sense of space/time trade-off, it
cannot be directly applied to our application because of the following problems:
1. Bloom filter does not provide removal functionality. Since one bit in the vector maybe mapped by more than one item, it is unsuitable to remove the item by setting allbits indexed by its hash results to 0.
2. Bloom filter does not have counting functionality. Although the counting Bloom filter[22] can be used for counting, it replaces a bit with a counter, which significantlyincreases the space complexity.
34
3. Sampling two-way matching features in discrete time results in boundary effect(Figure 3-7(a)). An inbound packet arrives at time t′1 ∈ [ti, ti+1) whereas its matchedoutbound packet arrives within next period. The inbound packet is counted as anunmatched inbound packet even though t′2 − t′1 < Γ. Therefore, boundary effectincreases the false alarm rate.
4. In previous discussion, we did not consider the scenario that an outbound packet mayarrive before its matched inbound packet (Figure 3-7(b)). When the outbound packetarrives at time t′1, its signature is not in the buffer, so we do nothing. At time t′2,its matched inbound packet arrives, whose inbound signature will be recorded. As aresult, the latter is regarded as an unmatched inbound packet during period [ti, ti+1).This early-arrival problem also increases the false alarm rate.
Next, we propose a Bloom filter array algorithm to address the above problems.
3.5 Bloom Filter Array (BFA)
The good space/time trade-off motivates us to apply Bloom filter to two-way
matching feature extraction. But we need to address the limitations of Bloom filter
mentioned in Section 3.4.2. Our idea is to design a Bloom filter array (BFA) with the
following functionalities, not available in the original Bloom filter [21, 23]:
1. Removal functionality : We implement insertion and removal operations synergisticallyby using insertion-removal pair vectors. The trick is that, rather than removing anoutbound signature from the insertion vector, we create a removal vector and insertthe outbound signature into the removal vector.
2. Counting functionality : We implement this by introducing counters in Bloomfilter array. The value of a counter is changed based on the query result from aninsertion/removal operation.
3. Boundary effect abatement : We use multiple time slots and a sliding window tomitigate the boundary effect.
4. Resolving the early-arrival problem: which is achieved by storing signature of not onlyinbound packets but also outbound packets. In this way, when an inbound packetarrives and the signature of its matched outbound packet is present, we do not countthis inbound packet as an unmatched one.
3.5.1 Data Structure
To address the boundary effect, we partition the time constraint Γ into w time slots,
where w is the number of slots enough to mitigate the boundary effect (see Section 3.5.3).
35
Assume the length of a slot is γ. Then, we have Γ = w × γ. The data structure of BFA is
as follows:
• An array of bit vectors IVj (j ∈ Z+), where IVj is the jth insertion vector holdinginbound signatures in slot [τj, τj+1), where τj+1 = τj + γ.
• An array of bit vectors RVj (j ∈ Z+), where RVj is the jth removal vector holdingoutbound signatures in slot [τj, τj+1).
• An array of counters Cj (j ∈ Z+), where Cj is used to count the number of UIF inslot [τj, τj+1).
Since the two-way flows need to be matched within a time interval of length Γ, we
only need to keep information within a time window of length Γ. That is, if the current
slot is [τj, τj+1), only IVj−w+1, . . . , IVj, RVj−w+1, . . . , RVj, and Cj−w+1, . . . , Cj are
kept in memory.
3.5.2 Algorithm
Our algorithm for BFA (Figure 3-8) consists of three functions, namely, ProcInbound,
ProcOutbound and Sample, which are described as below.
Function ProcInbound is to process inbound packets. It works as below. When
an inbound packet arrives during [τj, τj+1), we increase Cj by 1 and insert its inbound
signature s into IVj if none of the following conditions is satisfied:
1. s is stored in at least one RVj′ , where j − w + 1 ≤ j′ ≤ j;
2. s is stored in IVj.
Condition 1 being true means that the corresponding outbound flow of this inbound
packet has been observed previously; so we should not count it as an unmatched inbound
packet. Condition 2 being true means that the inbound flow, to which this inbound packet
belongs, has been observed during the current slot j; so we should not count the same
inbound flow again. If both conditions are false, we increase Cj by one to indicate a new
potential UIF (line 7 to 10).
36
1. function ProcInbound(s)2. a ← false, b ← false3. if ∃j′, j − w + 1 ≤ j′ ≤ j, such that BloomFilterSearch(RVj′ ,s) returns true then4. a ← true5. if BloomFilterSearch(IVj,s) returns true then6. b ← true7. if a and b are both false8. Cj ← Cj + 19. BloomFilterInsert(IVj, s)
10. end if11. end function12. function ProcOutbound(s′)13. for j′ ← j to j − w + 114. if BloomFilterSearch(RVj′ , s
′) returns true15. break16. if BloomFilterSearch(IVj′ , s
′) returns true17. Cj′ ← Cj′ − 118. end for19. BloomFilterInsert(RVj, s′)20. end function21. function Sample(j)22. return Cj−w+1
23. end function
Figure 3-8. Bloom Filter Array Algorithm
Function ProcOutbound is to process outbound packets. It works as below. When an
outbound packet arrives during [τj, τj+1), we check whether we need to update counter Cj′
for each j′ (j − w + 1 ≤ j′ ≤ j). Specifically, for each j′ (j − w + 1 ≤ j′ ≤ j), decrease Cj′
by one if its outbound signature s′ satisfies both of the following conditions:
1. s′ is not contained in RVj′ ;
2. s′ is contained in IVj′ .
Condition 1 being true means that no packet from the outbound flow to which this
outbound packet belongs arrives during the j′th time slot. Condition 2 being true means
that the matched inbound flow of this outbound packet has been observed in the j′th slot.
Satisfying both conditions means that its matched inbound flow has been counted as a
potential UIF; hence, upon the arrival of the outbound packet, we need to decrease Cj′ by
37
one to uncount it. In Function ProcOutbound, Line 13 starts a loop to iterate j′ from j to
j − w + 1. Condition 1 is checked in lines 14 to 15 and Condition 2 is checked in lines 16
to 17. Note that the loop exits (line 15) if RVj′ contains s′; this is because an outbound
packet of the same flow arrived in that j′th slot and hence the buffer of the jth slot (for
each j < j′) has already been checked.
Function Sample is to extract the two-way matching features. When we execute
Function Sample at the end of the jth slot (i.e., at time τj+1), the output is D(τj−w+1)
instead of D(τj) since a time lag of Γ (w slots) is needed for two-way matching.
3.5.3 Round Robin Sliding Window
1. function ProcInbound(s)2. a ← false, b ← false3. if ∃j′, j′ ∈ (I − w + 1)%w, (I − w + 2)%w, . . . , I%w, such that
BloomFilterSearch(RVj′ ,s) returns true then4. a ← true5. if BloomFilterSearch(IVI ,s) returns true then6. b ← true7. if a and b are both false then8. CI ← CI + 19. BloomFilterInsert(IVI , s)
10. end if11. end function12. function ProcOutbound(s′)13. for j′ ← I to (I − w + 1)%w14. if BloomFilterSearch(RVj′ , s
′) returns true then15. break16. if BloomFilterSearch(IVj′ , s
′) returns true then17. Cj′ ← Cj′ − 118. end for19. BloomFilterInsert(RVI , s′)20. end function21. function Sample()22. I ← (I + 1)%w23. return CI
24. end function
Figure 3-9. Bloom Filter Array Algorithm using sliding window
The algorithm presented in Section 3.5.2 has a drawback in memory allocation.
Specifically, at epoch τj+1, we sample D(τj−w+1), and then we need to throw away the
38
buffer for the (j − w + 1)th slot, and create a new buffer for the (j + 1)th slot. This is
inefficient for most operating systems. A better memory allocation strategy is to use the
useless buffer of the (j−w +1)th slot for the new (j +1)th slot, saving the cost of memory
allocation. This is the idea of our round-robin sliding window.
Our new memory allocation scheme is the following. We allocate a memory area
of fixed size for w insertion vectors IVj, w removal vectors, RVj, and w counters
Cj, where j ∈ Zw. The insertion vector, removal vector, and counter for the jth slot
are IVj%w, RVj%w, and Cj%w, respectively. Here, % stands for modulo operation. We
also define a pointer I to point to the current slot. Then, rather than deleting a useless
buffer and acquiring a new buffer for the new slot, we simply update the pointer by
I = (I + 1)%w. Figure 3-9 shows the improved version of BFA, based on the round-robin
sliding window.
3.5.4 Random-Keyed Hash Functions
In previous sections, we assume K hash functions are given a priori. However,
choosing hash functions appropriately is not trivial due to the following two concerns.
First, K is a user-specified parameter, subject to change. But for a value of K that a
user5 chooses, it is not desirable to require the user to manually select K hash functions
from a large pool of hash functions provided by the manufacturer. Also, it wastes memory
to store a large pool of hash functions.
Second, to improve security, the K hash functions need to be changed over time.
Otherwise, if an attacker knows the hash functions, he can generate such attack packets
that for signatures of any two packets, s1 and s2, s1) 6= s2 but hi(s1) = hi(s2), i ∈ ZK. The
consequence is that even if there are many attack packets with different signatures, the
5 A user here is a network operator who wants to use our BFA and detection techniqueto detect network anomalies.
39
BFA algorithm will regard them as belonging to the same flow. So, the number of UIF for
these packets is only one. This causes security vulnerability.
We address the aforementioned two problems by using keyed hash functions, i.e.,
we only need one kernel hash function and K randomly generated keys. Specifically, the
ith hash function hi(x) is simply h(keyi, x), where h is a predefined kernel hash function
and keyi (i ∈ ZK) are randomly generated keys. For example, we can use MD5 Digest
Algorithm[24] as the hash function. Since MD5 takes any number of bits as input, we can
organize keyi and x into a bit vector and apply MD5 to it.
Using keyed hash functions, the first concern (varying K) can be addressed straightforwardly.
Specifically, when K is changed, we simply generate a corresponding number of random
keys. Applying these K keys to the same kernel hash function, we obtain K hash
functions. Hence, our method has two advantages: 1) the number of hash functions
can be specified on the fly; 2) hash functions are determined on the fly, instead of being
stored a priori, resulting in storage saving.
The second concern (changing hash functions) can also be addressed if the keys are
periodically changed. Even if the kernel hash function is disclosed, it is still very difficult,
if not impossible, for an attacker to guess the changing random keys.
Note that the collision probability of the hash functions is not affected due to the
use of keyed hash functions. In the case of random-keyed hash functions, the collision
probability of hi(x) depends on not only the collision probability of h but also the
correlation between keyi and x. Since random number generator techniques are so mature
that we can assume independence between keyi and x, introduction of random keys has no
effect on the collision probability.
3.6 Complexity Analysis
This section compares the hash table, Bloom filter, and our BFA. The section is
organized as follows. In Section 3.6.1, we analyze the space/time trade-off for the three
algorithms. Section 3.6.2 addresses how to optimally choose parameters of BFA.
40
3.6.1 Space/Time Trade-off
Space/time trade-off for both Hash table and Bloom filter algorithms was analyzed by
Bloom [21]. However, the analysis by Bloom[21] is not directly applicable to our setting
due to the following reasons:
1. A static data set was assumed by Bloom[21]. However, our feature extraction dealswith a dynamic data set, i.e., the number of elements in the data set changes overtime. Hence, new analysis for a dynamic data set is needed. In addition, Bloom[21]only considered the search operation due to the assumption of static data sets. Ourfeature extraction, on the other hand, requires three operations, i.e., insertion, search,and removal, for dynamic data sets.
2. Bloom[21] assumed bit-comparison hardware in time complexity analysis. However,current computers usually use word (or multiple-bit) comparison, which is moreefficient than bit-comparison hardware. Hence, it is necessary to analyze thecomplexity based on word comparison.
3. The time complexity obtained by Bloom[21] did not include hash function calculations.However, hash function calculation dominates the overall time complexity, e.g.,calculating one hash function based on MD5 takes 64 clock cycles [25], while oneword-comparison usually takes less than 8 clock cycles [26].
For the above reasons, we develop new analysis for the hash table and Bloom filter,
respectively. In addition, we analyze the performance of BFA and use numerical results to
compare the three algorithms. Table 3-2 lists the notations used in the analysis.
Table 3-2: Notations for complexity analysisNotation DescriptionN Random variable representing the number of different flows recorded.φ Empty ratio.η Collision probability, i.e., the probability that an item is falsely identified
to be in the buffer.R Flow arrival rate, which is assumed to be constant.
Analysis for hash table. Denote by Mh the size of a hash table in bits (i.e., space
complexity) and by Th the random variable representing the number of hash function
calculations for an unsuccessful search (i.e., time complexity).
Let us consider search operation first. Upon the arrival of an inbound packet, the
HashTableSearch (see Figure 3-5) checks if its inbound signature s is in the table. Because
41
an unsuccessful search will continue the loop until an empty cell is found, it consumes
more time than a successful one does. In addition, it is very difficult to analyze the time
complexity of a successful search since the complexity depends on the distribution of
flow signatures and the data rate of each flow. For this reason, we only consider the time
consumed for an unsuccessful search, which is a conservative estimate of the average time
complexity of a search. Recall that, as mentioned in Section 3.4.1, the hash table has `
cells of b + 1 bits each, such that Mh = `(b + 1). Given the condition that N flows have
been recorded by the hash table, the empty ratio is
φ =`−N
`=
Mh −N(b + 1)
Mh
. (3–1)
In each loop, the HashTableSearch calculates one hash function and checks the addressed
entry. If the entry is not empty, next loop is executed. The conditional probability that
the loop is executed for x times for a given n follows a geometric distribution as below
Pr[Th = x|N = n] = φ(1− φ)x−1. (3–2)
Therefore the conditional expectation of Th is
E[Th|N = n] =∞∑
x=1
xφ(1− φ)x−1 =1
φ=
Mh
Mh − n(b + 1). (3–3)
Since the table records data for the duration of Γ, the maximum number of different
flows that we need to store in the buffer is RΓ. Then the expectation of Th is
E[Th] =RΓ∑n=0
Pr[N = n]E[Th|N = n]. (3–4)
Assume N has a uniform distribution
Pr[N = n] =1
RΓ + 1. (3–5)
42
Applying Equation (3–5) to Equation (3–4), we obtain the expectation of Th
E[Th] =1
RΓ + 1
RΓ∑n=0
Mh
Mh − n(b + 1). (3–6)
Since the time to insert a signature into or remove a signature from a given entry is
much shorter than that to find the proper entry, the time complexities of insertion and
removal operations are almost the same as that of the search operation. Equation (3–6)
gives the space/time trade-off (i.e., Mh vs. Th) of the hash table method.
Analysis for Bloom filter. First of all, we consider the space complexity of Bloom
filter. Denote by Mb the length of the vector V used by Bloom filter (see Section 3.4.2).
The choice of Mb will affect the accuracy of the search function, BloomFilterSearch (see
Figure 3-6). The reason is the following.
When signatures of N flows are stored in V , φ, denoting the percentage of entries of
V with value 0, is
φ =
(1− K
Mb
)N
, (3–7)
where K is the number of hash functions. Assuming K ¿ Mb, as is certainly the case, we
can approximate φ as
φ ≈ exp
(−KN
Mb
). (3–8)
Function BloomFilterSearch(V , s) falsely identifies s to be stored in V if and only if
results of all K hash functions point to bits with value 1, which is known as a collision.
Denote by ηN the collision probability under the condition that N flows have been
recorded. Then
ηN = (1− φ)K =
[1− exp
(−KN
Mb
)]K. (3–9)
43
Therefore, the average collision probability is
η =RΓ∑n=0
ηn Pr[N = n] =1
RΓ + 1
RΓ∑n=0
[1− exp
(−Kn
Mb
)]K, (3–10)
where N is assumed to be uniformly distributed as in Equation (3–5). From Equation (3–10),
it can be observed that η decreases with Mb if K is fixed. Based on Equation (3–10), we
can denote Mb as a function of η and K as below
Mb = αRΓ(η,K). (3–11)
Equation (3–11) gives the space complexity of Bloom filter as a function of collision
probability and the number of hash functions.
Now, let us consider the time complexity of Bloom filter. Denote by Tb the random
variable representing the number of hash function calculations.
Function BloomFilterInsert always calculates all the K hash functions, that is,
Tb|BloomFilterInsert is executed ≡ K, (3–12)
where “|” followed by an event means a condition and “≡” means equality with
probability 1.
For function BloomFilterSearch, we first consider a special case that BloomFil-
terSearch returns true. In this case, all K hash functions need to be calculated. So
Tb|BloomFilterSearch returns true ≡ K. (3–13)
This fact will be used in the analysis for BFA (see Section 3.6.1).
In general,
Pr[Tb = x|N=n and BloomFilterSearch is executed]
=
φ(1− φ)x−1 x < K
(1− φ)K−1 x = K. (3–14)
44
Hence, the conditional expectation of Tb is
E[Tb|N=n and BloomFilterSearch is executed]
=K−1∑x=1
xφ(1− φ)x−1 +K(1− φ)K−1
=1−
[1− exp
(− Kn
αRΓ(η,K)
)]K
exp(− Kn
αRΓ(η,K)
)
4=βn(η,K). (3–15)
Averaging over N at both sides of Equation (3–15), we get the expectation of Tb under the
condition that BloomFilterSearch is executed, i.e.,
E[Tb|BloomFilterSearch is executed]
=1
RΓ + 1
RΓ∑n=0
βn(η,K). (3–16)
If we know the two prior probabilities, i.e., the probability that BloomFilterSearch is
executed, denoted by Ps, and the probability that BloomFilterInsert is executed, denoted
by Pi, then we can get
E[Tb] =Ps
RΓ + 1
RΓ∑n=0
βn(η,K) + PiK. (3–17)
Equation (3–17) gives the time complexity of Bloom filter in terms of number of hash
function calculations.
Analysis for Bloom filter array. Once again, we analyze the space complexity
of BFA first. The techniques in Section 3.6.1 can be applied here since BFA is originated
from standard Bloom filter. However, there are some differences between these two
schemes. As described in Section 3.5, BFA has multiple buffers such as IVj, RVj, and Cj,
j ∈ Zw. Therefore, the storage size for BFA, denoted by Ma (in bits), is w(2 ×Mv + L),
where Mv is the size of each insertion or removal vector, and L is the size of each counter
in bits.
45
Similar to Equation (3–10), the collision probability is
η =1
Rγ + 1
Rγ∑n=0
[1− exp
(−Kn
Mv
)]K. (3–18)
Note that length of each time slot of BFA is γ, so that the upper limit of the summation
operator is Rγ rather than RΓ. Similar to Equation (3–11), Mv is a function of η and K.
We define
Mv = αRγ(η,K). (3–19)
Then
Ma = w(2× αRγ(η,K) + L). (3–20)
Equation (3–20) gives the space complexity of BFA.
Now, let us consider the time complexity of BFA. Denote by Ta the random
variable representing the number of hash function calculations for BFA. Recall that BFA
(Figure 3-9) defines three functions, ProcInbound, ProcOutbound, and Sample. Obviously,
Ta|Sample is executed ≡ 0. (3–21)
When executing Function ProcInbound, all the K hash functions need to be
calculated. The reason is the following.
1. If variables a and b are both false, Function BloomFilterInsert is executed, whichcalculates K hash functions (see Equation (3–12)).
2. Otherwise, at least one of a and b is true; then at least one of the search operations,i.e., BloomFilterSearch(RVj′ ,s), j′ = (I − w + 1)%w,(I − w + 2)%w,. . . , I%w, andBloomFilterSearch(IVI ,s), returns true. This also means that K hash functions havebeen calculated (see Equation (3–13)).
Therefore, in any case, ProcInbound calculates all the K hash functions. Further note
that, although BloomFilterSearch executes up to w + 1 search operations, and at most
one insertion operation, the total number of hash function calculations in these operations
46
is the same as that in one search operation. This is because the results of hash function
calculation in one search operation can be used again by all the other search operations
and insertion operation. Therefore,
Ta|ProcInbound is executed ≡ K. (3–22)
Similarly,
Ta|ProcOutbound is executed ≡ K. (3–23)
In each time slot, we execute Sample once, ProcInbound for Rpiγ times, and ProcOut-
bound for Rpoγ times, where Rpi and Rpo are inbound packet arrival rate and outbound
packet arrival rate, respectively. Combining Equations (3–21), (3–22), and (3–23) and
assuming (Rpi + Rpo)γ À 1, which is always true in our design of BFA, we have
E[Ta] =0× 1
(Rpi + Rpo)γ + 1+
K(Rpi + Rpo)γ
(Rpi + Rpo)γ + 1≈ K. (3–24)
Combining Equations (3–24) and (3–20), we obtain the relationship between Ma and
Ta as below
Ma = w [2αRγ(η, E[Ta]) + L] . (3–25)
Table 3-3: Space/time complexity for hash table, Bloom filter, and BFAAlgorithm Space complexity Time complexityHash table Mh (free variable) Equation (3–6)Bloom filter Equations (3–10) and (3–11) Equations (3–15), (3–16), and (3–17)BFA Equation (3–18), (3–19), and (3–20) Equation (3–24)
Table 3-3 lists the space complexity and time complexity for hash table, Bloom filter,
and BFA algorithms.
Numerical Results.
47
In this section, we use the formulae derived in above sections to compare the hash
table scheme with BFA algorithm through numerical calculations. The setting of our
numerical study is the following:
1. Traces captured from an ISP’s edge router shows that the average number offlows during one second is around 250, 000. So, we let R=250, 000. To reduce theprobability of false alarms caused by normal packets with long RTT, we choose Γlarge enough such that more than 99% packets have RTT less than Γ. For the sametraces, Γ=80 seconds.
2. Suppose we want to detect TCP traffic anomaly. Thus the signature captured fromeach packet is composed of 32-bit SA, 32-bit DA, 16-bit SP, and 16-bit DP. So b = 96bits.
3. In the BFA algorithm, we use 40 time slots (i.e., w = 40), each of which is 2 seconds(i.e., γ = 2). Also suppose each counter is a 32-bit integer (i.e., L = 32).
1 2 3 4 5 6 7 8 9 1010
7
108
109
1010
1011
Time E[T]
Spa
ce M
Hash Table
BFA (η = 0.001)
BFA (η = 0.01)
Figure 3-10. Space/time trade-off for the hash table, BFA with η = 0.1%, and BFA withη = 1%
Figure 3-10 shows M vs. E[T ] for the hash table scheme, BFA with collision
probability 1%, and BFA with collision probability 0.1%. In Figure 3-10, X axis represents
the time complexity (i.e., the expected number of hash function calculations) and Y
axis represents the space complexity (i.e., the number of bits needed for storage). From
Figure 3-10, we can see that the curve of BFA is below the curve of the hash table. It
means BFA uses less space for a given time complexity. Therefore, BFA achieves better
48
space/time trade-off than the hash table. We also see that the curve of BFA with η = 1%
is below the curve of BFA with η = 0.1%. This shows the relationship between space/time
and collision probability. Specifically, to reach a lower collision probability or more
accurate detection, we need to either calculate more hash functions or use more storage
space.
To see the gain of using BFA, let us look at an example. Suppose E[T ] = 5, i.e., in
each slot, 5 hash function calculations is needed on average. Then, the memory required
by the hash table scheme, BFA with η = 0.1%, and BFA with η = 1% is 1.01G bits,
115.3M bits, and 62.9M bits, respectively. It can be seen that our BFA with η = 1% can
save storage by a factor of 16, compared to the hash table scheme.
Figure 3-10 shows that for the hash table scheme, Mh is a monotonic decreasing
function of E[Th]. The observation matches our intuition that the larger table, the smaller
collision probability for hash functions, resulting in less hash function calculations. Further
note that Mh approaches RΓ(b + 1) when E[Th] increases. This is the minimum space
required to tolerate up to RΓ flows.
For BFA, Ma is not a monotonic function of E[Ta], which approximately equals K.
We have the following observations.
• Case A: For fixed storage size, the smaller K, the larger the probability that all Khash functions of two different inputs return the same outputs, which is the collisionprobability. In other words, the smaller K, the larger storage size required to achievea fixed collision probability. That is, K ↓⇒ Ma ↑.
• Case B: Since an input to BFA may set K bits to “1” in a vector V , hence the largerK, the more bits in V will be set to “1” (nonempty), which translates into a largercollision probability. In other words, the larger K, the larger storage size required toachieve a fixed collision probability. That is, K ↑⇒ Ma ↑.Combining Cases A and B, it can be argued that there exists a value of K or E[Ta]
that achieves the minimum value of Ma, given a fixed collision probability. This minimum
property can be used to guide the parameter setting for BFA, which will be addressed in
Section 3.6.2.
49
3.6.2 Optimal Parameter Setting for Bloom Filter Array
This section addresses how to determine parameters of BFA under two criteria,
namely, minimum space criterion and competitive optimality criterion.
Minimum space criterion. According to Equation (3–25), three parameters, Ma,
E[Ta], and η, are coupled. Since the collision probability η critically affects the detection
error rate in our network anomaly detection, a network operator may want to choose an
upper bound η on the acceptable collision probability η and then minimize the storage
required, i.e.,
minE[Ta]
Ma, subject to η ≤ η (3–26)
According to Equation (3–25), the solution of (3–26) is as below
M∗a = min
E[Ta]Ma = min
E[Ta]w [2αRγ(η, E[Ta]) + L] , (3–27)
E[Ta]∗ = arg min
E[Ta]Ma = arg min
E[Ta]αRγ(η, E[Ta]). (3–28)
10−5
10−4
10−3
10−2
10−1
1
2
3
4
5
6
7
8x 10
8
η
Ma*
(a)
10−5
10−4
10−3
10−2
10−1
2
4
6
8
10
12
14
η
E[T
a]*
(b)
Figure 3-11. Relation among space complexity, time complexity, and collision probability.(a) M∗
a vs. η. (b) E[Ta]∗ vs. η.
Figure 3-11 shows M∗a vs. η, and E[Ta]
∗ vs. η under the same setting as that
in Section 3.6.1. From Figure 3-11(a), it can be observed that M∗a decreases when η
50
increases. This is because the larger collision probability we can tolerate, the less space
required.
From Figure 3-11(b), one observes that generally, E[T ∗a ] decreases when η increases.
This may be because the smaller E[T ∗a ] or K, the larger the probability that all K hash
functions of two different inputs return the same outputs, which is the collision probability.
Competitive optimality criterion.From Equation (3–18), it can be observed that
η decreases with the increase of Mv if K is fixed; in other words, Mv decreases with the
increase of η if K is fixed. Further, from Equations (3–19) and (3–25), it can be inferred
that Ma decreases with the increase of η if E[Ta] is fixed (note that E[Ta] ≈ K). This is
shown in Figure 3-12. From the figure, it can be observed that the two lines intersect at a
value of collision probability, denoted by ηc. This value is critical for the parameter setting
of BFA. If a network operator has a desirable collision probability η, which is greater than
ηc, then it should choose E[Ta] = 4 since this parameter setting gives both smaller time
complexity and smaller space complexity. We call this property ‘competitive optimality’
since there is no tradeoff between time complexity and space complexity in this case.
On the other hand, if a network operator has a desirable collision probability η, which
is smaller than ηc, then it needs to make a tradeoff between space complexity and time
complexity.
3.7 Simulation Results
In this section, we conduct two sets of experiments to show the performance of BFA
for feature extraction in high-speed networks. Section 3.7.1 compares the performance of
the BFA algorithm with that of the hash table algorithm. In Section 3.7.2, we show the
performance of the complete feature extraction system, which uses the BFA algorithm.
3.7.1 The BFA Algorithm vs. the Hash Table Algorithm
Simulation settings. We apply the hash table algorithm and the BFA algorithm
to the time series of signatures extracted from real traffic traces, which were collected
by Auckland University[27]. To make a fair comparison with respective to the numerical
51
10−3
10−2
10−1
108.1
108.3
108.5
108.7
η
Ma
ηc=0.0156
E[Ta]=4
E[Ta]=6
Figure 3-12. Space complexity vs. collision probability for fixed time complexity.
results in Section 3.6.1, we use the same 96-bit signature, i.e., SA, DA, SP, and DP, and
let R=250, 000 packets/second and Γ=80 seconds, which translates to 250, 000 × 80=20M
input signatures for each simulation. These signatures are preloaded into memory before
the beginning of simulations so that I/O speed of hard drive does not affect the execution
time of simulations.
For each simulation run of the hash table algorithm, we specify the memory size
Mh, and measure the algorithm performance in terms of the average number of hash
function calculations per signature query request, denoted by Th, and the execution
time. Due to the Law of Large Numbers, Th approaches the expected number of hash
function calculations per signature query request, i.e., E[Th] in Equation (3–6), if we run
the simulation many times with the same Mh. In our simulations, we run the hash table
algorithm ten times; each time with a different set of input signatures but with the same
Mh.
For each simulation run of the BFA algorithm, we specify the memory size ma and
the number of hash functions K, and measure the algorithm performance in terms of
the collision frequency, denoted by η, and the execution time. The collision frequency is
defined as the ratio of the number of collision occurrences in BloomFilterSearch to the
52
total number of BloomFilterSearch executions. Due to the Law of Large Numbers, η is a
good estimate of collision probability, η.
Performance comparison between hash table and BFA. Figure 3-13 shows
average processing time per query vs. memory size for the hash table algorithm, BFA
algorithm with η=0.1%, and BFA algorithm with η=1%.
0 1 2 3 4 5 6 710
7
108
109
1010
Average processing time per packet (µ s)
Allo
cate
d m
emor
y si
ze (
bits
)
Hash TableBFA (η=0.001)BFA (η=0.01)
Figure 3-13. Memory size (in bits) vs. average processing time per query (in µs)
From Figure 3-13, we observe that 1) compared to the hash table algorithm, the BFA
algorithm requires less memory space for the same time complexity (average processing
time per query), which was predicted in Section 3.6, and 2) the BFA algorithm with η=1%
has a better space-complexity/time-complexity tradeoff than the BFA algorithm with
η=0.1% but at cost of higher collision probability, which is predicted by the numerical
results in Figure 3-10.
Figure 3-14 shows average processing time per query vs. average number of
hash function calculations per query. It can be observed that the average processing
time per query linearly increases with the increase of the average number of hash
function calculations per query. That is, the larger the average number of hash function
calculations per query, the larger the average processing time per query. For this
reason, instead of running simulations to obtain the time complexity (i.e., the average
53
1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
Average number of hash function calculations
Ave
rage
pro
cess
ing
time
per
pack
et (µ
s)
Hash Table
BFA (η=0.001)
BFA (η=0.01)
Figure 3-14. Average processing time per query (in µs) vs. average number of hashfunction calculations per query.
processing time per query), in Section 3.6.1, we used the average number of hash function
calculations per query to represent the time complexity of the hash table algorithm and
the BFA algorithm.
Performance comparison between numerical and simulation results.
Figure 3-15 compares the simulations results and the numerical results obtained from
the analysis in Section 3.6, for both hash table algorithm and BFA algorithm in terms of
space complexity vs. time complexity.
In Figure 3-15(a), the numerical result agrees well with the simulation result, except
when the average number of hash function calculations per query is close to 1. From
Equation (3–6), if the expected number of hash function calculations approaches 1, the
required memory size approaches infinity; in contrast, simulations with a large Mh may
not give accurate results, due to limited memory size of a computer. This causes the big
discrepancy between the numerical result and the simulation result when the average
number of hash function calculations per query is close to 1. When the average number
of hash function calculations per query is greater than or equal to two, it is observed that
simulation always requires more memory than the numerical result. This is due to the
fact that practical hash function is not perfect. That is, entries in the hash table are not
54
1 2 3 4 5 6 7 810
9
1010
1011
Average number of hash function calculations
Allo
cate
d m
emor
y si
ze (
bits
)
simulation resultsnumerical results
(a)
1 2 3 4 5 6 7 810
7
108
109
Average number of hash function calculations
Allo
cate
d m
emor
y si
ze (
bits
)
simulation resultsnumerical results
(b)
Figure 3-15. Comparison of numerical and simulation results. (a) Hash table algorithm.(b) BFA algorithm with η=1%.
equally likely to be accessed. Hence, Equation (3–2) does not hold perfectly, neither does
Equation (3–3). As a result, the average number of hash function calculations per query in
simulation is larger than that predicted by Equation (3–6).
Figure 3-15(b) shows that the numerical result agrees well with the simulation result
for all the values of the average number of hash function calculations per query under our
study.
3.7.2 Experiment on Feature Extraction System
In this section, we show the performance of the complete feature extraction system
implemented on traffic monitors and local analyzers, which uses the BFA algorithm.
The reason of conducting this experiment is that we would like to know the
performance of the whole hierarchical feature extraction architecture presented in
Section 3.2. In contrast, the experiment in Section 3.7.1 does not involve the interaction
among the three levels.
Experiment settings. We use the trace data provided by Auckland University [27]
as the background traffic. This data set consists of packet header information of traffic
between the Internet and Auckland University. The connection is OC-3 (155 Mb/s) for
both directions.
55
In our experiment, we use two 24-hour traces as the background traffic. We simulate
network anomalies caused by TCP SYN flood attacks [18] by randomly inserting TCP
SYN packets with random source IP addresses into the background trace during specified
time periods. Specifically, synchronized attacks are simulated during 14000 – 16000 second
and 28000 – 32000 second in both traces. In addition, asynchronous attacks are launched
during 50000-52000 second in trace 1 and during 57000-59000 second in trace 2. The
average attack rate is 1% of the packet rate of the background traffic during the same
period.
To detect TCP SYN flood attacks, we choose < SA, DA, SP, DP > as the signature
of inbound packets and < DA, SA,DP, SP > for outbound ones. Thus the two-way
matching feature is the number of unmatched inbound TCP SYN packets in one time slot.
The average flow rate is 2480 flows/second. Therefore, we set R = 2480. We further set
η = 0.1%, K = 8, w = 8, and γ = 10. Then, by solving Equation (3–18) for Mv and
requiring Mv to be a power of 2, we obtain Mv = 215 bits. The computer used for our
experiments has one 2.4G Hz CPU and 1GB memory. For comparison, we also extract the
number of inbound SYN packets in a slot.
Performance.
The average processing rate is measured to be 265, 000 packets/second. Hence,
the algorithm can deal with a line rate of 1 Gbps since the average Internet packet
size is about 500 bytes. Note that our test is offline and data is read from hard disk,
whose access speed is much lower than that of memory. In a real implementation,
data is captured by a high-speed network interface and maintained in the memory;
so the processing speed can be increased. Furthermore, in our test, a hash function is
implemented by software, which is also much slower than a dedicated hardware. Therefore,
it is reasonable to anticipate a higher processing rate if a dedicated hardware is used.
56
We show the features extracted from the two traces, specifically, the number of SYN
packet arrivals and the number of unmatched inbound SYN packet arrivals during a slot
(Figure 3-16).
From Figure 3-16, it can be observed that the features are rather noisy, especially for
the feature of the number of SYN packets. From Figs. 3-16(a) and 3-16(c), we can hardly
distinguish the slots under the low volume synchronized attacks from the slots without
attacks (by visual inspection). In comparison, it is much easier to identify the slots under
the synchronized attacks (by visual inspection) when the number of unmatched SYN
packets is used as the feature (see Slot 1400 − 1600 and Slot 2800 − 3200 in Figs. 3-16(b)
and 3-16(d).
3.8 Summary
This chapter is concerned with design of data structure and algorithms for network
anomaly detection, more specifically, feature extraction for network anomaly detection.
Our objective is to design efficient data structure and algorithms for feature extraction,
which can cope with a link with a line rate in the order of Gbps. We proposed a novel
data structure, namely the Bloom filter array, to extract the so-called two-way matching
features, which are shown to be effective indicators of network anomalies. Our key
technique is to use a Bloom filter array to trade off a small amount of accuracy in feature
extraction, for much less space and time complexity. Different from the existing work, our
data structure has the following properties: 1) dynamic Bloom filter, 2) combination of a
sliding window with the Bloom filter, and 3) using an insertion-removal pair to enhance
the Bloom filter with a removal operation. Our analysis and simulation demonstrate
that the proposed data structure has a better space/time trade-off than conventional
algorithms.
Next, we discuss classification algorithm based on extracted features.
57
0
50
100
150
200
250
300
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Num
ber
of P
acke
ts
Time Slot
SYN (Link 1)
(a)
0
50
100
150
200
250
300
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Num
ber
of P
acke
ts
Time Slot
UM-SYN (Link 1)
(b)
0
50
100
150
200
250
300
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Num
ber
of P
acke
ts
Time Slot
SYN (Link 2)
(c)
0
50
100
150
200
250
300
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Num
ber
of P
acke
ts
Time Slot
UM-SYN (Link 2)
(d)
Figure 3-16. Feature data: (a) Number of SYN packets (link 1), (b) Number of unmatchedSYN packets (link 1), (c) Number of SYN packets (link 2), and (d) Number of unmatchedSYN packets (link 2).
58
CHAPTER 4MACHINE LEARNING ALGORITHM FOR NETWORK ANOMALY DETECTION
4.1 Introduction
The third issue in network anomaly detection is the classification algorithm. In
this section, we introduce three basic detection algorithms, threshold-based algorithm,
change-point algorithm, and Bayesian decision theory. Our machine learning algorithm
derives from the Bayesian decision theory.
This section is organized as below. Section 4.1.1 introduces the Receiver Operating
Characteristics curve. It is used in Section 4.5 as the metrics to compare performance of
different classification methods. We describe the threshold-based algorithm, change-point
algorithm, and Bayesian decision theory in Sections 4.1.2, 4.1.3, and 4.1.4, respectively.
4.1.1 Receiver Operating Characteristics Curve
Receiver Operating Characteristics (ROC) curve [28] is a typical method to quantify
the performance of detection algorithms. It is a plot of detection probability vs.
false alarm probability. In practice, we estimate detection probability and false alarm
probability by the fraction of true positives and the fraction of true negatives, respectively.
Hence, to obtain an ROC curve, one needs to measure the following quantities
• ∆f : the number of false alarms, i.e., the number of slots in which the detectionalgorithm declares ‘abnormal’ given that no anomaly actually happens in these slots;
• ∆n: the number of slots in which no anomaly happens;
• ∆d: the number of slots in which the detection algorithm declares ‘abnormal’ giventhat network anomalies actually happen in these slots;
• ∆a: the number of slots in which network anomalies happen.
The false alarm probability and the detection probability of the detection algorithm
can be estimated by ∆f/∆n and ∆d/∆a, respectively. By varying parameters of detection
algorithms, we can obtain different pairs of false alarm probability and detection
59
probability, which give the ROC curve [28]. In this paper, we will use the ROC curve
to compare the performance of different detection algorithms.
Next, we introduce some basic classification algorithms.
4.1.2 Threshold-Based Algorithm
The idea of the threshold-based algorithm is that if the feature value exceeds a
preset threshold, declare ‘abnormal’; otherwise, declare ‘normal’. Note that the detection
operation is conducted in each slot. By tuning the threshold for the feature value, we
can obtain different pairs of false alarm probability and detection probability, resulting
in the ROC curve. Given the ROC curve and the desired false alarm probability, one can
determine the value of the threshold for detection operation.
Although it is the simplest method, threshold-based algorithm can only be used
when the features make significant difference between normal and abnormal conditions.
Therefore, threshold-based algorithm is not suitable to detect low volume network
anomalies.
4.1.3 Change-Point Algorithm
In the literature, a simple change-point algorithm — non-parametric Cumulative
Summation (CUSUM) algorithm — has been widely used [11–13, 17]. However, existing
studies only consider the change from normal state to abnormal state, which means that
the number of false alarms can be very large after network anomalies end. To facilitate the
discussion, we define the following parameters used in CUSUM in Table 4-1:
Table 4-1: Parameters used in CUSUMParameter DescriptionΦ(ti) The observed traffic feature at the end of time slot i.Φn The expectation of Φ(ti) in normal states.Φa The expectation of Φ(ti) in abnormal states. Without losing generality,
here we assume that Φn < Φa
Φ(ti) The adjusted variable, defined as Φ(ti) = Φ(ti)− a, where a is a parameter such thatΦn < a < Φa.
60
Now define variable S(ti) by
S(ti)4=
0 i = 0
max(0,S(ti−1) + Φ(ti)) i > 0
(4–1)
In the CUSUM algorithm, if S(ti) is smaller than a threshold HCUSUM , declare that
the network state is normal; otherwise, declare that the state is abnormal.
From the discussion above, we note that two parameters, i.e., a and HCUSUM , need
to be determined. However, we cannot uniquely determine these two parameters. To
overcome this problem, we shall introduce another parameter, i.e., the detection delay,
denoted as D. According to the change-point theory, we have
D
HCUSUM
→ 1(Φa − Φn
)−∣∣Φn − a
∣∣ =1
Φa − a. (4–2)
From Equation (4–2), we can obtain
HCUSUM = D× (Φa − a). (4–3)
Hence, once D and a are given, we can determine HCUSUM through Equation (4–3).1
Given a and HCUSUM , we can use the CUSUM algorithm to detect network anomaly.
We notice that the existing CUSUM algorithms [11–13] only consider one change,
i.e., from the normal state to the abnormal state. In practice, this approach may lead to
a large number of false alarms after the end of attacks. To mitigate the high false alarm
issue of the existing algorithms, which we call single-CUSUM algorithms, we develop a
dual-CUSUM algorithm. In this algorithm, one CUSUM will be used to detect the change
from the normal to the abnormal state, while another CUSUM is responsible for detecting
the change from the abnormal to the normal state. The method of setting parameters for
1 In Refs. [11] and [12], a = (Φa − Φn)/2; thus only the detection delay is needed.
61
dual-CUSUM is similar to the method described in this section. Tuning HCUSUM results
in the ROC curve of both CUSUM and dual-CUSUM methods.
Although CUSUM has better performance than threshold-based method, its detection
accuracy is still unsatisfactory. We developed a machine learning algorithm which
dramatically outperforms both single and dual CUSUM algorithms (Section 4.5.2).
Our machine learning algorithm is based on Bayesian decision theory, which is introduced
in next section.
4.1.4 Bayesian Decision Theory
Bayesian decision theory is a “fundamental statistical approach to the problem of
pattern classification”[29]. It is composed of
• the feature space, D, which might be a multi-dimensional Euclidean space;
• U states of nature, ~H = Hu; u ∈ ZU;
• prior probability distribution, P (H), H ∈ ~H;
• likelihood probabilities, p(φ|H), φ ∈ D, H ∈ ~H;
• loss function, χ(H∗, H), H∗, H ∈ ZU , which describes the loss incurred for classifyingan object to be of class H∗ when its state of nature is class H.
Note that, in the paper, P (·) represents a probability mass function (PMF)[30] and
p(·) a probability density function (pdf)[30].
Given the observed feature, φ, of an object, the Bayesian decision theory classifies it
to be of class H such that
H = arg minH∗
∑
H∈ ~H
χ(H∗, H)P (H|φ). (4–4)
Due to Bayes formula[30],
P (H|φ) =p(φ|H)P (H)∑
H′∈ ~H p(φ|H ′)P (H ′)=
p(φ|H)P (H)
p(φ), (4–5)
62
Equation (4–4) is equivalent to
H = arg minH∗
∑
H∈ ~H
χ(H∗, H)p(φ|H)P (H). (4–6)
Equation (4–6) gives the Bayesian criterion for pattern classification.
A simple loss function is defined to be
χ(Hu∗ , Hu) =− χ(u)I(u∗ = u), for ∀ u ∈ ZU , (4–7)
where χ(u) is the gain factor, typically positive, representing the gain obtained by
correctly detecting Hu, and I(·) is the indicator function such that
I(x) =
1 if x is true
0 if x is false
. (4–8)
Equation (4–7) specifies that misclassification induces zero loss and correct classification
induces negative loss, which actually achieves gain. Applying Equation (4–7) to Equation (4–6),
the Bayesian criterion is simplified to,
u = arg maxu∈ZU
χ(u)p(φ|Hu)P (Hu). (4–9)
We call Equation (4–9) the maximum gain criterion. Further note that, scaling all gain
factors by a same factor does not change the criterion specified by Equation (4–9). Hence,
we can always set χ(0) = 1. By tuning other gain factors, we can generate ROC curve for
Bayesian decision theory.
In this chapter, we extended the Bayesian decision theory for network anomaly
detection. The remainder of this chapter is organized as follows. In Section 4.2, we
establish Bayesian models for network anomaly detection. Sections 4.3 and 4.4 solve
two fundamental problems of Bayesian model, i.e., training problem and classification
problem, respectively. Section 4.5 shows our simulation results and Section 4.6 concludes
this chapter.
63
4.2 Bayesian Model for Network Anomaly Detection
In this section, we model the network anomaly detection issue in terms of Bayesian
decision theory. This section is organized as follows. Section 4.2.1 generalizes the
Bayesian model for network anomaly detection on traffic monitors and local analyzers.
In section 4.2.2, we extend this model to the whole autonomous system. Section 4.2.3
introduces the hidden Markov tree model to decrease the computation complexity of the
general model defined in Section 4.2.2.
4.2.1 Bayesian Model for Traffic Monitors and Local Analyzers
As described earlier, both the traffic monitor and the local analyzer have local
information of one edge router. They are able to detect network anomalies through one
single edge router. That is, a traffic monitor makes anomaly declaration when it observes
abnormal features extracted from one link of an edge router, such as large data rate, large
SYN/FIN(RST) ratio, and so on. Similarly, a local analyzer detects network anomaly
by observing the two-way matching features on one edge router. Next, we formulate the
detection problem in terms of the Bayesian decision theory introduced in Section 4.1.4.
Figure 4-1. Generative process in graphical representation, in which the traffic stategenerates the stochastic process of traffic.
In the context of network anomaly detection, there are two states of nature of an edge
router, i.e., ~H = H0, H1, where
• H0 represents normal state, in which case no abnormal network traffic enters the ASthrough that edge router;
• H1 represents abnormal state, in which case abnormal network traffic enters the ASthrough the edge router.
64
To formulate the model, we define a random variable Ω : ~H → Z2, such that
Ω(Hu) = u, u ∈ Z2. (4–10)
Furthermore, denote by Λ the traffic observed by traffic monitors. Since network
state induces stochastic process of traffic, we employ the widely-used graphic model
representation [31] to depict this cause-effect relationship in Figure 4-1.
(a) (b)
Figure 4-2. Extended generative model including traffic feature vectors: (a) original modeland (b) simplified model.
Denote by Φ, Φ ∈ D, the feature extracted from traffic, where D is the feature space.
Most importantly, in selection of the optimal features, we seek for the most discriminative
statistical properties of the traffic. Also note that it is possible to employ multiple features
in the detection procedure, in which case Φ is a vector. Since features are succinct
representations of the voluminous traffic, we extend the above model in Figure 4-1 to the
one illustrated in Figure 4-2(a). Once Φ is extracted from Λ, we assume that Φ represents
Λ well. It means that we may operate only over lower-dimensional Φ, which reduces
computational complexity. Therefore, we simplify the model in Figure 4-2(a) to that
illustrated in Figure 4-2(b), where Λ is dismissed.
Since the feature is measurable, it is called observable random vector, and is depicted
by a rectangular node in Figure 4-2(b). The network state generating the traffic feature
is to be estimated. We call it hidden random variable, and depict it by a round node in
Figure 4-2(b). Now, the goal becomes to estimate the hidden state Ω given the observable
65
Φ. The maximum gain criterion (see Equation (4–9)) specifies the estimate, u, to be
u = arg maxu∈Z2
χ(u)p(φ|Ω = u)P (Ω = u). (4–11)
Since p(φ|Ω) and P (Ω = u) are unknown, we need to estimate them. This is the goal
of training (Section 4.3).
Figure 4-3. Generative independent model that describes dependencies among trafficstates and traffic feature vectors.
An AS has many edge routers, each of which has multiple links. Traffic monitors
deployed on links and local analyzers on edge routers extract features and make decisions
independently. Therefore, the one link model in Figure 4-2(b) is further extended to the
more general model for the AS as illustrated in Figure 4-3, where κ stands for the number
of edge routers.
The limitation of the detection model in Figure 4-3 is that it assumes edge routers
are mutually independent. This is due to the fact that traffic monitors and local analyzers
only have local information of the whole AS. Although it is suitable to detect network
anomalies accompanied with high traffic volume on single link, it is not suitable for low
volume network anomaly detection. We address this limitation by introducing spatial
correlation in next section.
4.2.2 Bayesian Model for Global Analyzers
The novelty of our detection approach lies in introducing spatial correlation among
edge routers into the network anomaly detection. This section introduces the spatial
correlation and its contribution to network anomaly detection. Since only global analyzers
66
have global information of the whole AS, detection approach employing spatial correlation
can only be deployed in global analyzers.
When network anomaly happens, usually more than one edge router exhibits
abnormal symptoms. For example, when DDoS attacks are launched toward a victim
in an AS, the attack traffic enters the AS from multiple edge routers as the attack sources
are distributed. At each of those edge routers, the monitored traffic volume may be
low. That is, each traffic monitor or local analyzer observes a small deviation of traffic
features from normal distribution. However, the global analyzer, upon obtaining reports
from local analyzers, will observe small deviations of features from multiple edge routers
simultaneously. Employing spatial correlation contributes to low traffic network anomaly
detection.
Figure 4-4. Generative dependent model that describes dependencies among edge routers.
Introducing the spatial correlation into the independent model in Figure 4-3 results
in the dependent model as illustrated in Figure 4-4. The difference between two models
is that, from the view point of a global analyzer, edge routers are no long independent.
As a result, statistical dependence among states of edge routers is represented by the
non-directional connections. Note that the independent model can be regarded as a special
case of the dependent one. Also note that we still assume that features extracted from one
edge router are independent of the states of other edge routers.
67
Let
~Ω =
(Ω1, Ω2, · · · , Ωκ
), (4–12)
~u =
(u1, u2, · · · , uκ
), (4–13)
~φ =
(φ1, φ2, · · · , φκ
), (4–14)
where Ωi is the random variable representing state of edge router i, i ∈ 1, . . . , κ, which
is defined in the same way as in Equation (4–10). We further assume gain factors are
independent of node index, i.e., χ(ui) = χ(ui′) whenever ui = ui′ (0 or 1), no matter
whether i is equal to i′ or not. Then the maximum gain criterion (see Equation (4–9)) for
the dependent model is
~u = arg max~u
χ(~u)p(~φ|~Ω = ~u)P (~Ω = ~u)
= arg max~u
[κ∏
i=1
χ(ui)p(φi|Ωi = ui)
]P (~Ω = ~u). (4–15)
As the dependent model takes spatial correlation into consideration, it can make more
accurate detection, especially when traffic volume is low. However, it is a computationally
intractable model. That is because solving Equation (4–15) directly, we need to exhaustively
compute p(~φ|~Ω)P (~Ω) for each possible combination of ~Ω, which results in a O (2κ)
complexity. For a large AS, it is intractable.
We introduced a hierarchical structure to reduce computation complexity, which is the
topic of the next section.
4.2.3 Hidden Markov Tree (HMT) Model for Global Analyzer
The reason that the dependent model illustrated in Figure 4-4 becomes computationally
intractable is that we assume edge routers are fully dependent. A rough understanding
is that, if we break some dependence in Figure 4-4, we can reduce the computation
complexity. On the other hand, we would like to account for the dependencies among as
many nodes as possible to provide accurate detection. To balance these two conflicting
68
goals, we propose to use a hierarchical model, the hidden Markov tree (HMT) model, as
depicted in Figure 4-5.
Figure 4-5. Hidden Markov tree model. For an node i, ρ(i) denotes its parent node andν(i) denotes the set of its children nodes.
The motivation of applying HMT model is that we assume edge routers are not
equally correlated. Instead, edge routers topologically close to each other have high mutual
correlations. Based on this assumption, we cluster edge routers according to the topology
of AS and form a tree structure, as depicted in Figure 4-5. Without loss of generality,
Figure 4-5 plots a quad-tree structure, i.e., each node, except leaf nodes, has four children.
To facilitate further discussion, each node in the HMT is assigned an integer number,
beginning with 0, from top to bottom. That is, node 0 is always a root node2 . Table 4-2
lists the notations used in the rest of the paper for HMT.
In the HMT, each leaf node stands for an edge router. Zero-padding virtual edge
routers are introduced when the number of edge routers is not a power of B. States of
these zero-padding virtual nodes are always normal and features are always 0. Non-leaf
2 A HMT might have multiple roots, depending on the number of edge routers and thenumber of levels.
69
Table 4-2: Notations for hidden markov tree modelNotation DescriptionΩi The random variable representing the state of node i.Φi The random variable/vector representing the feature(s) measured at node i.~ΦT Φi; i ∈ T , where T is a subtree of the HMT.L The number of levels of the HMT.Ξ The set of all nodes in the HMT.Ξl The set of nodes at level l, l ∈ ZL, in the HMT. Specifically, Ξ0 represents the set
of root nodes and ΞL−1 leaf nodes.B The number of children nodes of each node, except leaves. For example,
B = 4 for quad-HMT, as illustrated in Figure 4-5.ρ(i) The parent node of node i, where i /∈ Ξ0.ν(i) The set of children nodes of node i, where i /∈ ΞL−1.T i The set of ancestor nodes of node i, where i /∈ Ξ0, including node i.R(i) The root node of the subtree containing node i, where i ∈ Ξ.Ti The subtree whose root is node i, where i ∈ Ξ.Ti\j Ti \ Tj.T\i TR(i) \ Ti.
nodes represent clusters of edge routers. Features of nodes are defined in Equation (4–16).
Φi =
Features measured at the corresponding edge router i ∈ ΞL−1(i.e., leaf node)
1B
∑j∈ν(i) Φj i /∈ ΞL−1(i.e., non-leaf node)
.
(4–16)
One notes that only features of leaf nodes have physical meaning, i.e., features measured
at corresponding edge routers. Features of a non-leaf node are assumed to be average of
features of its child nodes.
We have two assumptions for the HMT:
1. Node state only depends on state of its parent, if it is known, i.e.,
P (Ωi|Ωj, j ∈ Ξ, j 6= i) =P (Ωi|Ωρ(i)), i ∈ Ξ \ Ξ0 (4–17)
2. Features measured at a node only depends on state of that node, if it is known, i.e.,
p(φi|~Ω) =p(φi|Ωi), i ∈ Ξ (4–18)
70
Similar to Sections 4.2.1 and 4.2.2, we employ maximum gain criterion (see
Equation (4–9)) to estimate node states, i.e.,
ui = arg max~u
χ(~u)P (~Ω = ~u|~φ)
= arg maxui′ ;i′∈T i
∏
i′∈T i\R(i)χ(ui′)P (Ωi′ = ui′|Ωρ(i′) = uρ(i′), ~φ)
· χ(uR(i))P (ΩR(i) = uR(i)|~φ), (4–19)
for ∀ i ∈ ΞL−1. Applying Viterbi algorithm to solving Equation (4–19), we reduce the
computation complexity from O (2κ) (see Section 4.2.2) to O (Bκ). This is the major
advantage of introducing HMT model. The details are given in Section 4.4.
Solving Equation (4–19) requires knowledge of P (Ωi|Ωρ(i), ~φ), for ∀i ∈ Ξ \ Ξ0, and
P (Ωi = ui|~φ), for ∀i ∈ Ξ0. By Bayesian formula[30],
P (Ωi = ui|Ωρ(i) = uρ(i), ~φ) =P (Ωi = ui, Ωρ(i) = uρ(i)|~φ)
P (Ωρ(i) = uρ(i)|~φ), (4–20)
for ∀i ∈ Ξ \ Ξ0. Therefore, solving Equation (4–19) translates to estimating
P (Ωi, Ωρ(i)|~φ), ∀ i ∈ Ξ \ Ξ0, (4–21)
and
P (Ωρ(i)|~φ), ∀ i ∈ Ξ. (4–22)
Estimating Equations (4–21) and (4–22) in closed form is difficult. We proposed a
belief propagation (BP) algorithm, described in Section 4.3, to estimate them efficiently
given knowledge of
• prior probabilities: P (Ωi = 0), i ∈ Ξ0;
• likelihood: p(φi|Ωi), i ∈ Ξ;
• transition probabilities: P (Ωi|Ωρ(i)), i ∈ Ξ \ Ξ0.
71
Usually it is difficult to estimate prior probabilities. In the paper, we simply assume
P (Ωi = 0) = P (Ωi = 1) = 12, i ∈ Ξ0. That is, states of root nodes are equally likely to
be normal or abnormal. Other parameters such as likelihood and transition probabilities
are estimated from training data. This is covered in Section 4.3. After that, classification
using maximum gain criterion is described in Section 4.4.
4.3 Estimation of HMT Parameters
In this section, we describe estimation of HMT parameters. It is organized as follows.
In Section 4.3.1, we describe estimation of likelihood p(φi|Ωi), i ∈ Ξ. Estimation of
transition probabilities P (Ωi|Ωρ(i)), i ∈ Ξ \ Ξ0 are presented in Section 4.3.2.
4.3.1 Likelihood Estimation
For the purpose of likelihood estimation, we collect two sets of training data.
• The set of features sampled in normal states, ~φ(k)0 ; k ∈ 1, . . . , K0, where K0 is
the number of normal samples and ~φ(k)0 =φ(k)
i,0 ; i ∈ Ξ, φ(k)i,0 denotes the kth feature
measured at node i;
• The set of features sampled in abnormal states, ~φ(k)1 ; k ∈ 1, . . . , K1, where K1 is
the number of abnormal samples and ~φ(k)1 =φ(k)
i,1 ; i ∈ Ξ, φ(k)i,1 denotes the kth feature
measured at node i.
Gaussian mixture model. In order to effectively estimate the likelihood, we assume
that the random variables/vectors, Φi|Ωi, i ∈ Ξ, follow a statistical distribution model.
Then, likelihood estimation translates to model parameters estimation. We establish the
statistical model in the following.
Because of its good properties, Gaussian (normal) distribution [30] is widely employed
in many applications. The pdf of a d-dimensional multivariate Gaussian distribution with
mean vector µ and variance matrix Σ is
N (x; µ, Σ)4=
1
(2π)d/2 |Σ|1/2exp
[−1
2(x− µ)tΣ−1(x− µ)
]. (4–23)
Figure 4-6 plots the pdf of the univariate Gaussian distribution N (x; 0, 1). It is observed
that Gaussian distribution is a unimodal distribution[30], i.e., its pdf only has one peak.
72
−4 −3 −2 −1 0 1 2 3 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
p(x)
N(0,1) distribution
Figure 4-6. Probability density function of the univariate Gaussian distribution N (x; 0, 1).
−100 0 100 200 300 400 500 6000
0.02
0.04
0.06
0.08
0.1
0.12Histogram
Figure 4-7. Histogram of the two-way matching features measured at a real networkduring network anomalies.
However, multiple peaks may exist in the empirical distribution of Φi|Ωi. For
example, Figure 4-7 shows the histogram of the two-way matching features measured
in a real network during DDoS attacks. It has two peaks. Hence, the unimodal Gaussian
distribution is not suitable. In the paper, we adopt the Gaussian mixture model (GMM)
to model the likelihood distribution.
73
The motivation of the GMM is the following. Suppose we first randomly pick a
number G from set 1, 2, . . . , G with probability P (G = g)=π(g), where
G∑g=1
π(g) = 1. (4–24)
Next, we generate a random variable X from a Gaussian distribution with pdf N (x; µ(g), Σ(g)).
Then the random variable X follows the G-state GMM, whose pdf is
pX(x) =G∑
g=1
N (x; µ(g), Σ(g)) π(g), (4–25)
where π(g), µ(g), and Σ(g) are known as prior probability, mean vector, and variance
matrix of the gth Gaussian distribution in the GMM, respectively. A G-state GMM has G
modes. Therefore, it is suitable to model distributions with multiple modes.
In the paper, we assume Φi|Ωi, i ∈ Ξ, follow G-state GMM. That is, the pdf of the
likelihood of node i is
p(φi|Ωi = u) =G∑
g=1
πi,u(g)N (φi; µi,u(g), Σi,u(g)) , (4–26)
for ∀i ∈ Ξ, ∀u ∈ 0, 1, where πi,u(g), µi,u(g), and Σi,u(g) are prior probability, mean
vector, and variance matrix of the gth Gaussian distribution at node i with state u,
respectively.
Next, we present schemes to estimate the GMM parameters.
GMM parameter estimation. Based on Equation (4–26), the likelihood estimation
translates to estimating GMM parameters πi,u(g), µi,u(g), and Σi,u(g) for ∀i ∈ Ξ,
∀u ∈ 0, 1, and ∀g ∈ 1, . . . , G, with constraint
G∑g=1
πi,u(g) = 1. (4–27)
Denote by πi,u(g), µi,u(g), and Σi,u(g) the estimates of πi,u(g), µi,u(g), and Σi,u(g),
respectively.
74
The most commonly used approach to estimate model parameters is the maximum-likelihood
(ML) method. Given the training features at node i with network state u, φ(1)i,u , . . . , φ
(Ku)i,u ,
the ML method chooses the parameters to maximize
Ku∏
k=1
p(φ(k)i,u |Ωi = u), (4–28)
where p(·|Ωi = u) is given in Equation (4–26). Unfortunately, Nechyba[32] showed
that ML method for G-state GMM with G > 1 has no closed form solution. In
addition, a G-state GMM has a 3G-dimensional continuous parameter space. Exhaustive
searching numerical solution for ML estimate in such a parameter space is computational
intractable.
1. Input: φ(k)i,u , k ∈ 1, . . . , Ku; π
(0)i,u (g), µ
(0)i,u(g), and Σ
(0)i,u(g), g ∈ 1, . . . , G.
2. Output: πi,u(g), µi,u(g), and Σi,u(g), g ∈ 1, . . . , G.3. j = 0.4. repeat
5. ϑ(j)i,u(g, φ)
4=
N(φ;µ
(j)i,u(g),Σ
(j)i,u(g)
)π
(j)i,u(g)
∑Gg=1N
(φ;µ
(j)i,u(g),Σ
(j)i,u(g)
)π
(j)i,u(g)
6. π(j+1)i,u (g) = 1
Ku
∑Ku
k=1 ϑ(j)i,u(g, φ
(k)i,u )
7. µ(j+1)i,u (g) =
∑Kuk=1 ϑ
(j)i,u(g,φ
(k)i,u)φ
(k)i,u∑Ku
k=1 ϑ(j)i,u(g,φ
(k)i,u)
8. Σ(j+1)i,u (g) =
∑Kuk=1 ϑ
(j)i,u(g,φ
(k)i,u)‖φ(k)
i,u−µ(j+1)i,u (g)‖2
∑Kuk=1 ϑ
(j)i,u(g,φ
(k)i,u)
9. j ← j + 110. until converge11. πi,u(g) = π
(j)i,u(g), µi,u(g) = µ
(j)i,u(g), Σi,u(g) = Σ
(j)i,u(g), ∀g ∈ 1, . . . , G.
Figure 4-8. The EM algorithm for estimating p(φi|Ωi = u), i ∈ Ξ, u ∈ 0, 1.
A practical solution to this issue is the expectation-maximization (EM) algorithm[29,
30]. Nechyba[32] derived EM algorithm for GMM in detail. Figure 4-8 illustrates the
algorithm.
The EM algorithm requires initial values for the parameters, as denoted by π(0)i,u (g),
µ(0)i,u(g), and Σ
(0)i,u(g) in Figure 4-8. At each iteration j, the EM algorithm uses parameters
estimated at iteration j − 1 to calculate new estimates. Although both EM and ML
75
methods scan the parameter space, EM works in a better way. It is proven that after each
iteration, EM algorithm guarantees to generate estimates of parameters which increase
Equation (4–28). As a result, EM algorithm converges much faster than numerical ML
method.
However, the disadvantage of EM algorithm is that it converges to a local maxima
rather than the global one. Specifically, initial values of parameters determine the local
maxima to which the EM algorithm converges. In practice, we have prior knowledge of
network features, which helps to choose initial values of parameters.
Till now, we present schemes to estimate likelihood of HMT. In next section, we
estimate transition probabilities.
4.3.2 Transition Probability Estimation
In this section, we estimate P (Ωi|Ωρ(i)), i ∈ Ξ \ Ξ0. Since closed form representation of
the transition probabilities is not available, we also estimate them in an iterative way.
Denote by ~φ(k); k ∈ 1, . . . , K the set of training features for transition probability
estimation, where ~φ(k) = φ(k)i ; i ∈ Ξ. Figs. 4-9 and 4-10 show the pseudo-code for
transition probability estimation. In next two sections, we explain the two figures.
Iteratively estimate transition probabilities. Figure 4-9 shows the pseudo-code
to estimate transition probabilities. The function TransProbEstimate takes three sets of
arguments:
1. likelihood estimated in Section 4.3.1, p(φi|Ωi); i ∈ Ξ;
2. training features, ~φ(k); k ∈ 1, . . . , K.It returns the estimate of transition probabilities, i.e.,
P (Ωi|Ωρ(i)), ∀ i ∈ Ξ \ Ξ0.
Before the iterations, we set the initial transition probabilities to be 12
at line 5 of
Figure 4-9. This is equivalent to assume normal state and abnormal state to be initially
76
1. function TransProbEstimate(. . .)2. Argument 1: likelihood, p(φi|Ωi); i ∈ Ξ.3. Argument 2: training data, ~φ(k); k ∈ 1, . . . , K.4. Return: transition probability estimate: P (Ωi = ui|Ωρ(i)), i ∈ Ξ \ Ξ0.5. P (0)(Ωi = u|Ωρ(i) = u′) = 1
2, for ∀ i ∈ Ξ \ Ξ0, ∀ u, u′ ∈ 0, 1.
6. j = 0.7. repeat8. for k ← 1 to K9.
P (j+1)(Ωρ(i)|~φ(k)), P (j+1)(Ωi, Ωρ(i)|~φ(k)); i ∈ Ξ \ Ξ0
=BP(
P (j)(Ωi|Ωρ(i)); i ∈ Ξ \ Ξ0
, p(φi|Ωi); i ∈ Ξ , ~φ(k)
)(4–29)
10. end for11. for ∀ i ∈ Ξ \ Ξ0
12.
P (j+1)(Ωi|Ωρ(i)) =1
K
K∑
k=1
P (j+1)(Ωi, Ωρ(i)|~φ(k))
P (j+1)(Ωρ(i)|~φ(k))
=1
K
K∑
k=1
P (j+1)(Ωi|Ωρ(i), ~φ(k)) (4–30)
13. end for14. j ← j + 115. until converge16. P (Ωi|Ωρ(i))=P (j)(Ωi|Ωρ(i)).
Figure 4-9. Iteratively estimate transition probabilities.
equally likely. Then, at each iteration, we update the estimate of transition probabilities
until it converges. The update procedure is the following.
First, we iterate the training feature set. For each feature, we use BP algorithm (see
Figure 4-10) to estimate the posterior probabilities given that feature. The details of the
BP algorithm is discussed in the next section.
Three sets of arguments are passed to the BP algorithm:estimate of transition probabilities obtained at the previous iteration,
P (j)(Ωi|Ωρ(i)); i ∈ Ξ \ Ξ0
;
1. likelihood, p(φi|Ωi); i ∈ Ξ, which is the argument passed to function TransProbEsti-mate;
77
1. function BP(. . .)2. Argument 1: transition probabilities,
P (Ωi|Ωρ(i)); i ∈ Ξ \ Ξ0
.
3. Argument 2: likelihood, p(φi|Ωi); i ∈ Ξ.4. Argument 3: training feature, ~φ.
5. Return: posterior probabilities,
P (Ωρ(i)|~φ), P (Ωi, Ωρ(i)|~φ); i ∈ Ξ \ Ξ0
.
6. Υi(0) = Υi(1) = 12, for ∀i ∈ Ξ0(i.e., roots).
7. υi(u) = p(φi|Ωi = u), for ∀i ∈ ΞL−1(i.e., leaves), ∀u ∈ 0, 1.8. Top-down pass, i.e., from root to leaf:9. for l ← 1, . . . , L− 1
10. for ∀ i ∈ Ξl,∀ u ∈ 0, 1, let11.
Υi(u) =∑
u′∈0,1P
(Ωi = u|Ωρ(i) = u′
)Υρ(i)(u
′)p(φρ(i)|uρ(i) = u′) (4–31)
12. end for13. end for14. Bottom-up pass, i.e., from leaf to root;15. for l ← L− 2, . . . , 016. for ∀ i ∈ Ξl, ∀ u ∈ 0, 1, let17.
υi(u) = p(φi|Ωi = u)∏
j∈ν(i)
∑
u′∈0,1P (Ωj = u′|Ωi = u) υj(u
′)
(4–32)
18. end for19. end for20.
P (Ωρ(i) = u|~φ) =Υi(u)υi(u)∑u′ Υi(u′)υi(u′)
(4–33)
21.
P (Ωi = u, Ωρ(i) = u′|~φ)
=υi(u)υρ(i)(u
′)P(Ωi = u|Ωρ(i) = u′
)Υρ(i)(u
′)
[∑
u′′ Υi(u′′)υi(u′′)][∑
u′′ P(Ωi = u′′|Ωρ(i) = u′
)υi(u′′)
] (4–34)
Figure 4-10. Belief propagation algorithm.
78
2. current training feature, ~φ(k).
It returns estimate of posterior probabilities
P (j+1)(Ωρ(i)|~φ(k)), P (j+1)(Ωi, Ωρ(i)|~φ(k)); i ∈ Ξ \ Ξ0
.
With the posterior probabilities obtained through BP algorithm, we update the
estimates of transition probabilities by Equation (4–30) for ∀ i ∈ Ξ \ Ξ0 and step to the
next iteration. When the estimates converge, iteration stops and Function TransProbEsti-
mate returns estimates obtained at the last iteration.
The validity of Equation (4–30) is shown in the following. For ∀ i ∈ Ξ \ Ξ0,
P (Ωi|Ωρ(i)) =
∫P (Ωi|Ωρ(i), ~Φ = ~φ)p(~φ)d~φ
=E~Φ
[P ((Ωi|Ωρ(i), ~Φ))
], (4–35)
where E [·] represents statistical expectation and the subscript stands for the random
variable over which expectation is taken. Because sample average is always the best
unbiased estimate of statistical expectation[30], we estimate
E~Φ
[P ((Ωi|Ωρ(i), ~Φ))
]
by
1
K
K∑
k=1
P (Ωi|Ωρ(i), ~φ(k)) =
1
K
K∑
k=1
P (Ωi, Ωρ(i)|~φ(k))
P (Ωρ(i)|~φ(k)). (4–36)
Combining Equations (4–35) and (4–36), we obtain Equation (4–30).
Next, we discuss the belief propagation algorithm, which is called by function
TransProbEstimate.
Belief propagation algorithm. The BP algorithm [33–35], also known as the
sum-product algorithm, e.g., [36–39], is an important method for computing approximate
marginal distributions. In this paper, we apply the BP algorithm to estimating posterior
probabilities (Figure 4-10).
79
Function BP takes three sets of arguments:
1. transition probabilities,P (Ωi|Ωρ(i)); i ∈ Ξ \ Ξ0
;
2. likelihood , p(φi|Ωi); i ∈ Ξ;
3. the training feature, ~φ,
and returns estimates of the posterior probabilities,
P (Ωρ(i)|~φ), P (Ωi, Ωρ(i)|~φ); i ∈ Ξ \ Ξ0
.
In Function BP(), we define two sets of transitory variables for convenience, i.e.,
Υi(u)4=p
(Ωi = u, ~φT\i
), (4–37)
υi(u)4=p
(~φTi|Ωi = u
), (4–38)
where u ∈ 0, 1, i ∈ Ξ.
Function BP() first initializes the variables Υi(u) for root nodes and υi(u) for leaves.
• When i ∈ Ξ0, i.e., root nodes, T\i = ∅, such that Υi(u) = P (Ωi = u). Hence, we letΥi(0) = 1
2and Υi(1) = 1
2for all root nodes i(see line 6 of Figure 4-10), because we
assume root nodes are equally likely to be in normal or abnormal state.
• When i ∈ ΞL−1, i.e., leaf nodes, ~φTi= ~φi, therefore υi(u) = p(φi|Ωi = u), u ∈ 0, 1
(see line 7 of Figure 4-10).
Then it propagates belief on the tree roots and leaves to other nodes in top-down pass
and bottom-up pass, respectively.
• During the top-down pass, Function BP iterates from root to leaf. At each level l,we update the transitory variables Υi(u) by Equation (4–31), which is proven inAppendix A.1. Note that, Υρ(i)(u
′) in Equation (4–31) is obtained in the previousiteration, i.e., iteration at level l − 1.
• During the bottom-up pass, Function BP iterates from leaf to root. at each levell, we update the transitory variables υi(u) by Equation (4–32), which is provenin Appendix A.2. Also note that, υj(u
′) is obtained in the previous iteration, i.e.,iteration at level l + 1.
80
Finally, we obtain the posterior probabilities by Equations (4–33) and (4–34). These
two equations are proven in Appendix A.3 and A.4, respectively. The estimated posterior
probabilities are used in Function TransProbEstimate to update estimates of transition
probabilities (see Figure 4-9).
Till now, we established the HMT model and described approaches to estimate
its model parameters from training data. Next, we present network anomaly detection
approaches using the fully determined HMT model.
4.4 Network Anomaly Detection Using HMT
In this section, we present the network anomaly detection using HMT model. This
is equivalent to a decoding problem in terms of pattern classification. That is, given an
observation sequence, i.e., extracted features ~φ = φi; i ∈ Ξ, and a HMT model defined by
• prior probabilities: P (Ωi), i ∈ Ξ0;
• likelihood: p(φi|Ωi), i ∈ Ξ;
• transition probabilities: P (Ωi|Ωρ(i)), i ∈ Ξ \ Ξ0,
we need to compute the “best” state combination ~Ω = Ωi; i ∈ Ξ. Here, the word
“best” is in terms of maximum gain criterion, as illustrated in Equation (4–19). We
rewrite the criterion in Equation (4–39),
ui = arg maxui′ ;i′∈T i
∏
i′∈T i\R(i)χ(ui′)P (Ωi′ = ui′|Ωρ(i′) = uρ(i′), ~φ)
· χ(uR(i))P(ΩR(i) = uR(i)|~φ
), (4–39)
for ∀ i ∈ Ξ. Function ViterbiDecodeHMT, as illustrated in Figure 4-11, shows the
pseudo-code for the classification algorithm.
Function ViterbiDecodeHMT takes three arguments. The first two arguments, the
transition probabilities and the likelihood, are estimated during training phase. The
last one is the extracted features, based on which we perform anomaly detection. It
returns the estimates of node states. Among them, we are only interested in states of leaf
81
nodes, which represent whether an edge router is in abnormal state. Next, we explain the
algorithm in detail.
1. function ViterbiDecodeHMT(. . .)2. Argument 1: transition probabilities,
P (Ωi|Ωρ(i)); i ∈ Ξ \ Ξ0
.
3. Argument 2: likelihood, p(φi|Ωi); i ∈ Ξ.4. Argument 3: feature, ~φ.5. Return: estimated node states, ui; i ∈ Ξ.6.
P (Ωρ(i)|~φ), P (Ωi, Ωρ(i)|~φ); i ∈ Ξ \ Ξ0
=BP(
P (Ωi|Ωρ(i)); i ∈ Ξ \ Ξ0
, p(φi|Ωi); i ∈ Ξ , ~φ
)(4–40)
7.
P(Ωi|Ωρ(i), ~φ
)=
P(Ωi, Ωρ(i)|~φ
)
P(Ωρ(i)|~φ
) (4–41)
8. for ∀ i ∈ Ξ0
9. ui = arg maxuiP (Ωi = ui|~φ)
10. end for11. for l ← 1, . . . , L− 1, for ∀ i ∈ Ξl
12.
ui = arg maxui
χ(ui)P(Ωi = ui|Ωρ(i)=uρ(i),~φ
)·
∏
i′∈T ρ(i)\R(i)
χ(ui′)P(Ωi′ = ui′|Ωρ(i′) = uρ(i′), ~φ
) χ(uR(i))P
(ΩR(i) = uR(i)|~φ
)
(4–42)
13. end for
Figure 4-11. Viterbi algorithm for HMT decoding.
By Bayesian formula, the terms in Equation (4–39) can be computed by
P(Ωi′ = ui′|Ωρ(i′) = uρ(i′), ~φ
)=
P(Ωi′ = ui′ , Ωρ(i′) = uρ(i′)|~φ
)
P(Ωρ(i′) = uρ(i′)|~φ
) , (4–43)
82
where i′ ∈ Ξ\Ξ0. Therefore, it also requires to calculate posterior probabilities. We employ
BP algorithm (see Equations (4–40) and (4–41) in Figure 4-11) to solve Equation (4–43) in
a way similar to HMT transition probability estimation(see Figure 4-10).
Obtaining solutions to Equation (4–43) for all nodes, we can solve Equation (4–39). A
brute force solution to Equation (4–39) is to exhaustively compute
∏
i′∈T i\R(i)χ(ui′)P (Ωi′ = ui′|Ωρ(i′) = uρ(i′), ~φ)
χ(uR(i))P
(ΩR(i) = uR(i)|~φ
), (4–44)
for ∀ i ∈ Ξ \ Ξ0, ∀ ~u ∈ Space of ~Ω and select the “best” one. A HMT with L levels
modeling an AS with κ edge routers has
BdlogB κe1− B−L
1− B−1
nodes, each of which has two states, normal and abnormal. Then the ~Ω space has
2BdlogB κe 1−B−L
1−B−1
possible values. The computation complexity of the brute force method is
O(2BdlogB κe)
,
even worse than the dependent model (see Figure 4-4), whose complexity is
O (2κ) .
In this paper, we applied Viterbi algorithm [40–42] to solving Equation (4–39) in an
iterative manner, reducing the computational complexity to O (Bκ).
The motivation of Viterbi algorithm is the following. It iterates from top level of a
HMT to bottom level. At each iteration, it estimates the node states in that level in a way
that, when combined with states estimated in upper levels, “best” explains the observed
features. That is, at each iteration, Viterbi algorithm always selects the local maxima.
83
Although it does not guarantee to find the global optimal solution to Equation (4–39),
Viterbi algorithm is efficient and has good performance empirically.
The computation complexity of Viterbi algorithm is O (Bκ), much better than
the dependent model, as illustrated in Figure 4-4, whose complexity is O (2κ). The
performance improvement results from the fact that Viterbi algorithm does not exhaustively
test all possible node state combinations. Instead, we decompose the decoding problem
into multiple stages, each of which decodes node states at one level in HMT. At level l, we
estimate node states of level l, i.e., Ωi; i ∈ Ξl, based on results obtained during previous
stages, i.e. Ωi; i ∈ Ξ0, . . . , Ξl−1. In such a procedure, each node is only accessed for
twice. Hence the complexity is linear to the number of nodes, which is
BdlogB κe1− B−L
1− B−1< Bκ
1− B−L
1− B−1≈ Bκ.
Till now, we described the HMT model, including its parameter training and
classification approaches. Next, we show the simulation results of applying the HMT
model to network anomaly detection.
4.5 Simulation Results
In this section, we evaluate the performance of the proposed schemes through
simulation.
4.5.1 Experiment Setting
In our study, we develop a testbed to 1) extract various feature information, and 2)
analyze feature data by our machine learning algorithm and CUSUM algorithms. Next,
we describe the setting for networks, traffic traces, and feature extraction used in our
experiments.
Network.
In our experiment, we assume that the ISP network consists of a core AS, a victim
subnet, and 16 edge routers that connect to 16 subnets, as illustrated in Figure 4-12. At
each edge router, two monitors are placed to measure the inbound and outbound traffic
84
Figure 4-12. Experiment Network
between a subnet and the victim network, respectively. For convenience, we denote a link
as the route between an edge router and the victim subnet.
Traffic. A link may carry normal traffic (called background traffic) or abnormal
traffic. For the background traffic, we use the same data set as in Section 3.7.2[27]. Since
we do not have real data traces obtained from 16 different links, we use the real traffic
trace measured on one link (between the Internet and Auckland University) in 16 different
days to create traffic traces for 16 different links.
For the abnormal traffic, we randomly generate TCP SYN flood attacks into the
background trace. Specifically, we generate several attack scenarios. For each scenario,
we randomly select the abnormal links and attack durations. Attack traffic on each link
is generated in the same way as in Section 3.7.2. That is, we randomly insert TCP SYN
packets with random source IP addresses into the background traffic of that link. The
average packet rate of TCP SYN attack traffic on each selected link is 1% of the total
packet rate on the link. For each attack scenario, attacks on each of the selected links are
launched during almost the same period to simulate the synchronized DDoS attacks. Since
the attack traffic on each link is low (just 1%), we effectively simulate low volume attack
traffic.
85
Features.
To detect distributed TCP SYN attacks, we use the two-way matching features
described in Chapter 3, i.e., the number of unmatched inbound SYN packets in one
time slot. The parameter setting of two-way matching features extraction is same as in
Section 3.7.2. For convenience, we summarize the parameters in Table 4-3.
Table 4-3: Parameter setting of feature extraction for network anomaly detectionNotation DescriptionR 2480η 0.1%K 8w 8γ 10 secondsMv 215 bits
For comparison purpose, we also measure the number of SYN packets and SYN/FIN
ratio[11] in a slot.
4.5.2 Performance Comparison
Table 4-4: Performance of different schemes.Feature Detection algorithm Detection probability False alarm probabilitySYN/FIN ratio CUSUM 0.174 0.129SYN CUSUM 0.52 0.129SYN Machine learning 0.656 0.123Unmatched SYN CUSUM 0.690 0.130Unmatched SYN Machine learning 0.973 0.115
Table 4-4 compares the performance of different schemes, where the benchmark is
the scheme in [11], i.e., the CUSUM scheme with SYN/FIN ratio as the feature; for the
benchmark scheme, we use the same parameter setting as that in [11]; we compare the
benchmark with CUSUM and our machine learning algorithm under different features.
To make fair comparison, we make the false alarm probability of each scheme almost
the same and compare the detection probability. From Table 4-4, it can be seen that,
the benchmark scheme (‘SYN/FIN ratio’+CUSUM) performs very poorly in detecting
low volume DDoS attacks. In contrast, a CUSUM algorithm with the number of SYN
86
packets or the number of unmatched SYN packets as the feature can achieve much higher
detection probability. More importantly, our machine learning algorithm can significantly
outperform CUSUM, given the same feature data, no matter whether the feature is the
number of SYN packets or the number of unmatched SYN packets.
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Det
ectio
n P
roba
bilit
y
False Alarm Probability
Threshold (SYN)Machine Learning (SYN)
Threshold (UM-SYN)Machine Learning (UM-SYN)
Figure 4-13. Performance of threshold-based and machine learning algorithms withdifferent feature data
Figure 4-13 compares the ROC curve of the threshold-based scheme described in
Section 4.1.2 and our machine learning algorithm under two different features, i.e., the
number of SYN packets (denoted by ‘SYN’) and the number of unmatched SYN packets
(denoted by ‘UM-SYN’). We observe that, for the same detection algorithm, using the
number of unmatched SYN packets can significantly improve the ROC performance,
compared to using the number of SYN packets. In other words, given the same false alarm
probability, the detection probability is much higher when using the number of unmatched
SYN as feature.
Another important observation from Figure 4-13 is that given the same feature data,
our machine learning algorithm can (significantly) improve the ROC, compared to the
threshold-based scheme; e.g., for the same false alarm probability of 0.05, our machine
learning algorithm achieves a detection probability of 0.93, while the threshold-based
scheme only achieves a detection probability of 0.72. This is due to the fact that our
87
machine learning algorithm exploits the spatial correlation among traffic on multiple links,
while the threshold-based scheme only uses the traffic on one link.
0
0.2
0.4
0.6
0.8
1
0.0001 0.001 0.01 0.1 1
Det
ectio
n P
roba
bilit
y
False Alarm Probability
Single CUSUMDual CUSUM
ThresholdMachine Learning
Figure 4-14. Performance of four detection algorithms
In Figure 4-14, we compare the ROC performance of four detection algorithms (the
threshold-based, the single-CUSUM, the dual-CUSUM described in Section 4.1.3, and our
machine learning algorithm) under the same feature, i.e., the number of unmatched SYN
packets. For the single-CUSUM and the dual-CUSUM algorithm, the detection delay D is
chosen from 1 to 10 slots and the parameter ai of link i is determined by
ai = (Dattack − Dnormal)× i
17, ∀ 1 ≤ i ≤ 16,
where Dattack and Dnormal are the average number of unmatched inbound SYN packets in
attack and normal conditions, respectively.
The ROC performance (Figure 4-14) of our machine learning algorithm is the
best among all the algorithms. We also see that the dual-CUSUM out-performs the
simple threshold-based algorithm and the single-CUSUM algorithm has the worst ROC
performance.
4.5.3 Discussion
We would like to point out that, besides detecting network anomalies with low
volume traffic, our machine learning algorithm is also able to detect high volume anomaly,
88
the results of which are not shown here due to the space limit. The machine learning
algorithm is shown to be robust under realistic time-varying traffic patterns such as
the Auckland data traffic [27]. We tested our machine learning algorithms for a large
IP address space, i.e., the IP address space can be the whole IP address space for the
Internet.
4.6 Summary
In this chapter, we propose a novel machine learning detection algorithm, based on
Bayesian decision theory and hidden Markov tree model. The key idea of our algorithm is
to exploit spatial correlation of network anomalies. Our detection scheme has the following
nice properties:
• In addition to detecting network anomalies having high-data-rate on a link, ourscheme is also capable of accurately detecting attacks having low-data-rate onmultiple links. This is due to exploitation of spatial correlation of network anomalies.
• Our scheme is robust against time-varying traffic patterns, owing to powerful machinelearning techniques.
• Our scheme can be deployed in large-scale high-speed networks, thanks to use ofBloom filter array to efficiently extract features.
With the proposed techniques, our scheme can effectively detect network anomalies
without modifying existing IP forwarding mechanisms at routers. Our simulation results
show that the proposed framework can detect DDoS attacks even if the volume of attack
traffic on each link is extremely small (i.e., 1%). Especially, for the same false alarm
probability, our scheme has a detection probability of 0.97, whereas the existing scheme
has a detection probability of 0.17, which demonstrates the superior performance of our
scheme.
89
CHAPTER 5NETWORK CENTRIC TRAFFIC CLASSIFICATION: AN OVERVIEW
In Chapters 5 and 6, we focus on the second part of our research, i.e., network centric
traffic classification. This chapter motivates the significance and points out the challenges
of this issue, and shows weakness of existing solutions.
5.1 Introduction
The Telecom business is rapidly changing. Commoditized below profitable levels,
traditional circuit-switched voice service just is not lucrative anymore. Since 2000, the
drop in traditional voice revenue has prompted the large telcos to explore new business
opportunities. Services over IP (SoIP) have been identified as the new streams to continue
growing. Among all the SoIP, VoIP and IPTV are the most attractive ones as they are
trusted to represent the largest source of profits as consumer interest in online voice and
video services increases, and as broadband deployments proliferate. According to Point
Topic research firm, there were 209.3 million global broadband users at the end of 2005,
up to 56.2 million from 153.3 million lines on 31 December 2004. As a consequence, VoIP
and IPTV user population is expected to grow dramatically in the next few months. For
example, France Telecom released on July 2006 that the number of its VoIP users grew
80% in the last 6 months to a total of 1.73 million as of June 30th, 2006. Same for IPTV
users expected to grow from 300,000 today to 5 million in the next 2 years.
But to tap the potential profits that SoIP offers, the infrastructure of carrier networks
needs to evolve. Next-generation Networks (NGN) feature the convergence of access
technologies (wireline, wireless, cellular), information services (voice, broadband, data,
content), and devices (consumer electronics, traditional telecom equipment). Such
multi-layered convergence promises reduced costs, greater workforce and consumer
mobility, and exciting new business models. However, the trend toward convergence
creates a strong need for fair methods of efficiently and accurately managing and tracking
the delivery of IP services. As carriers transition to becoming service providers, they begin
90
to sell and deliver IP services to their customers. Unfortunately the emergence of a bloom
of new zero-day voice and video applications over IP, like Skype, Google Talk (Gtalk),
MSN, etc, the proliferation of new peer-to-peer protocols that now allow the usage of voice
and video among other applications, and the continuing growth of the usage of encryption
techniques to protect data confidentiality, lead to tremendous revenue leakage for ISPs due
to their inefficiency in detecting these new applications and thus lack of proper actions.
The result from unmanaged commercial traffic adds up to loss of hundreds of millions of
dollars annually and poses a solid road block to the profitability of ISPs’ VoIP and IPTV
services.
As a consequence, it is imperative for ISPs to identify robust solutions for detecting
voice and video over IP data-streams. The most common approach for identifying
applications on an IP network is to associate the observed traffic with an application
based on TCP or UDP port numbers [43, 44]. In principle the TCP and UDP server
port numbers can be used to identify the higher layer application, by simply identifying
the server port and mapping this port to an application using the Internet Assigned
Numbers Authority (IANA) list of registered ports [45]. However, port-based application
classification has limitations due to the emergence of new applications that no longer
use fixed, predictable port numbers. For example, non-privileged users often have to use
ports above 1024 to circumvent operating system access control restrictions; or common
applications like FTP allows the negotiation of unknown and unpredictable server ports to
be used for the data transfer; or proprietary applications may deliberately try to hide their
existence or bypass port-based filters by using standard ports. For example, server port 80
is being used by a large variety of non-web applications to circumvent firewalls which do
not filter port-80 traffic; others (e.g., Skype) tend to use dynamic ports.
A more reliable technique involves stateful reconstruction of session and application
information from packet contents [46–48]. Although this avoids reliance on fixed port
numbers, it imposes significant complexity and processing load on the classification device,
91
which must be kept powerful enough to perform concurrent analysis of a large number
of flows while applying techniques to search very complex protocol signatures that might
require processing of a large chunk of packet payload. The proliferation of proprietary
protocols, coupled with the growing trend in the usage of encryption techniques to ensure
data confidentiality, makes this approach infeasible. For example, Skype does not run on
any standard port, but randomly selects ports for its communication and use either the
TCP, or UDP or both for the data transfer. Furthermore, its use of a 256-bit encryption
algorithm and no visibility into neither the algorithm nor its keys makes its detection
even harder. All the above makes the general problem of the detection of VoIP and video
data-streams over IP challenging and yet it is of huge business interest.
A new emerging tendency in the research community to approach this problem is
to rely on pattern classification techniques. This new family of techniques formulate
the application detection problem as a statistical problem that develop discriminating
criteria based on statistical observations and distributions of various flow properties in the
packet traces. A few papers [49, 50] have taken this statistical approach to classify traffic
into p2p, multimedia streaming, interactive application, and bulk transfer application.
Unfortunately, although these papers addressed the problem of distinguishing multimedia
traffic from other applications, they have not addressed the problem of distinguishing
voice traffic from video traffic. One problem is to separate streaming traffic from other
applications, and a different problem is to detect and correctly classify voice and video and
clearly separate the two applications from each other. In the extreme case, voice and/or
video data streams might even be bundled together in the same exact flow with other
applications. These problems are common for many applications like Skype, Gtalk and
MSN that allow users to mix voice and/or video streams with chat and/or file transfer
traffic in the same exact 5-tuple flow, defined as ¡source IP address, destination IP address,
source port, destination port, and protocol number¿. In such cases, one flow may carry
92
traffic from multiple types of applications (such as voice, video, chat, and file transfer),
referred to as hybrid flow in the remainder of this paper.
Our research focuses on detecting and classifying voice and video traffic and further
deal with its more general formulation that considers the presence of hybrid flows.
Based on the intuitions that voice and video data streams show strong regularities in
the inter-arrival times of packets within the flow and the associated packet sizes when
combined together in one single stochastic process and analyzed in the frequency domain,
we propose a system, called VOVClassifier , for voice and video traffic classification.
VOVClassifier is an automated self-learning system that is composed of four
major modules operating in cascade. The system first is trained with voice and video
data streams and afterwards enters the classification phase. During the training
period, all packets belonging to the same flow are extracted and used to generate a
stochastic model that captures the features of interest. Then all flows are processed in
the frequency domain using Power Spectral Density (PSD) analysis in order to extract a
high-dimensional space of frequencies that carry the majority of energy of the signal. All
features extracted from each flow are grouped into a “feature vector”. Due to the wide
usage of different codecs for voice and video, we propose a second module that clusters the
feature vectors into several groups using Subspace Decomposition (SD) and then identifies
its subspace structure, e.g. bases of the subspace, using Principal Component Analysis
(PCA). These two steps are applied to all flows during the training period and produce
low-dimensional spaces, referred in the paper as voice subspace and video subspace.
After training, all flows are processed by the PSD module and the associated feature
vector is compared with the voice and video spaces obtained during training. The space at
minimum normalized distance from the feature vector is selected as candidate and chosen
if and only if its distance is below a specific predetermined threshold.
We applied VOVClassifier to real packet traces collected into two different network
scenarios. Results demonstrate the effectiveness and robustness of our approach, able to
93
achieve 100% detection rate of both voice and video in the case of single-typed flow, e.g.
one application per 5-tuple, and 98.6% and 94.8% respectively for voice and video when
dealing with the more complex scenario of hybrid flows, e.g. voice, video and file transfer
bundled together in the same 5-tuple flow.
The rest of the chapter is organized as follows. In Section 5.2, we introduce the
related work in the area of pattern classification methodologies. Section 5.3 describes
the weaknesses of metrics previously used by other works when applied in our context
and highlights the new traffic features that constitute the foundation of our approach.
Section 5.4 summarizes this chapter.
5.2 Related Work
Existing work on traffic classification uses discriminating criteria such as the packet
size distribution per flow, the inter-arrival times between packets within the same flow
and other statistics captured across multiple flows. For example, in Ref. [49], the authors
proposed the combination of average packet size within a flow and the inter-arrival
variability metric, e.g. defined as the ratio of the variance to the average inter-arrival
times of packets within a flow, as a powerful metric to define fairly distinct boundaries for
three groups of applications: (i) bulk data transfer like FTP, (ii) interactive like HTTP,
and (iii)streaming like voice, video, gaming, etc. Several classification techniques, like
nearest-neighbor and K-nearest-neighbor, were then tested using the above traffic features.
Although this preliminary study has proved that the approach of pattern classification has
great potential for a proper application classification, it proves that much more work still
remains, e.g. exploiting other alternative for traffic features and classification techniques.
Moreover, although the features extracted are simple and feasible to be implemented
on-the-fly, the learning algorithm is complex and the outcome boundaries among the three
families of applications are heavily non-linear and time-dependent.
Similar to Ref. [49], Karagiannis et al.[50] proposed a novel approach, called BLINC,
that exploits network-related properties and characteristics. The novelty of this approach
94
resides in twofold. First, the authors shift the focus from classifying individual flows to
associating Internet hosts with applications, and then classifying their flows accordingly.
Second, BLINC follows a different philosophy from previous methods attempting to
capture the inherent behavior of a host at three levels: (i) social level, e.g. how each host
interacts with other hosts, (ii) functional level, e.g. role played by each host in the network
as a provider of an application or a consumer of the application, and finally (iii) the
application level, e.g. ports used by each host during its communication with other hosts.
Although the approach proposed in [50] is interesting from a conceptual perspective and
proved to perform reasonably well for a variety of different applications, it is still prone
to large estimation errors for streaming applications. Moreover, its high complexity and
large memory consumption remains an open issue for high-speed application classification.
Other papers using pattern classification appeared lately in literature but more focused
on specific application detection like Peer-to-Peer [46] and chat [51]. More importantly, to
the best of our knowledge, none of the existing work has been able to separate voice traffic
from video traffic or to indicate the presence of voice traffic or video traffic in a hybrid
flow that contains traffic from both voice/video and other applications such as file transfer.
5.3 Intuitions Behind a Proper Detection of Voice and Video Streams
Generally speaking, the problem of voice and video detection can be formulated as
a complex pattern classification problem that has to deal with curse of dimensionality,
e.g. discrimination of voice and video data streams when dealing with hidden traffic
patterns and too many interrelated features. A critical step toward the solution is to
identify traffic features that correctly represent the characteristics of the data streams of
interest and uniquely isolate them from other applications. In order to achieve this, in this
section we start by showing how simple metrics presented in the past are not applicable
in our context and we conclude with some observations that constitute the essence of
our approach. In Figure 5-1 we show the results obtained when using the combination
of average packet size and the inter-arrival variability metric proposed by Roughan et
95
al. [49]. Although this metric performed very well in separating streaming, file transfer,
transactional and interactive applications, it performs poorly when used to further
separate applications within the same family, as voice, video or voice and video mixed
with other applications like file transfer, e.g. hybrid flows. Figure 5-1 clearly highlights the
complete absence of any distinct boundary and heavy overlapping between voice and video
traffic. The reasons why the pair (average packet size, inter-arrival variability metric)
cannot separate video from voice are as below. First, the packet size for video/voice is
controlled by the packetization strategy of the video/voice application designer [52]; hence,
a video application may produce similar average packet size to that for voice (Figure 5-1).
Second, random end-to-end delay in the Internet causes large variations in the inter-arrival
variability metric for different video/voice flows.
0 2 4 6 8 10 120
500
1000
1500
Inter−Arrival Variability Metric
Ave
rage
Pac
ket S
ize
(Byt
es)
audiofilefileaudiofilevideovideo
Figure 5-1. Average packet size versus inter-arrival variability metric for 5 applications:voice, video, file transfer, mix of file transfer with voice and video.
Table 5-1: Commonly used speech codec and their specificationsStandard Codec Method Inter-Packet Delay (ms)G.711[53] PCM .125G.726[54] ADPCM .125G.728[55] LD-CELP .625G.729[56] CS-ACELP 10G.729A[56] CS-ACELP 10G.723.1[57] MP-MLQ 30G.723.1[57] ACELP 30
96
In order to overcome the above problem, in this section we exploit different metrics
that might have great potential to serve our purpose: strong regularities of inter-arrival
times between packets within the same flow and packet sizes residing in voice and video
data streams. Specifically, we consider four types of metrics, i.e.,
1. packet inter-arrival time and packet size in time domain;
2. packet inter-arrival time in frequency domain;
3. packet size in frequency domain;
4. combining packet inter-arrival time and packet size in frequency domain.
These metrics are discussed later.
0 0.01 0.02 0.03 0.04 0.05 0.060
0.05
0.1
0.15
0.2
0.25
IAT (seconds)
AUDIOVIDEO
Figure 5-2. Inter-arrival time distribution for voice and video traffic
5.3.1 Packet Inter-Arrival Time and Packet Size in Time Domain
The intuitions behind such metrics reside in the observation that any protocol used
for voice and video applications specifies a constant time between two consecutive packets
at the transmitter side, also known as Inter-Packet Delay (IPD).For example, Table 5-1
lists some speech codec standards and the associated IPDs that are required for a correct
implementation of those protocols. Packets leaving the transmitter might traverse a
large number of links in the Internet before reaching the proper destination. Along this
97
0 100 200 300 400 500 600 7000
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Packet Size (Bytes)
AUDIOVIDEO
Figure 5-3. Packet size distribution for voice and video traffic
traveling, packets might experience random delay due to congestion at routers’ interfaces.
As a consequence, the inter-arrival times between packets at the receiver might be severely
affected by random noise, e.g. jitter, and thus this metric might not represent a reliable
candidate feature for a robust classification methodology. Although this problem does
exist, we note how the inter-arrival times between packets within the same flow still shows
a strong regularity when studied in the frequency domain at the receiver side. As an
example, in Figure 5-2, we show the distributions of the inter-arrival packet times at the
receiver side when using Skype to transmit respectively voice only and video only between
two hosts, one located in University A in east coast and the other in University B in
west coast of USA. As we can see, the distributions for both video and voice are centered
around 0.03 second. On the other hands, Figure 5-3 shows the distributions of the packet
sizes for both voice and video. As you can see, both voice and video are characterized by
similar distribution for packet size less than 200 bytes. Although video traffic generates
packet size of larger than 200, these larger packets cannot be reliably used to separate
video from voice since other applications such as chat or file transfer might also generate
these larger packets. As a consequence, packet inter-arrival time or packet size is a weak
feature when considered in the temporal domain.
98
5.3.2 Packet Inter-Arrival Time in Frequency Domain
We show how the same feature becomes a key reliable feature when observed in
the frequency domain. In this new domain, we are interested whether it does exist any
frequency component, e.g. inter-arrival time, that captures the majority of the energy of
this stochastic process at the receiver side. We exploit the above by computing the power
spectral density (PSD) analysis of the packet inter-arrival time process of two traces, each
of which is of length 10 second, in Figures 5-4 and 5-5 respectively for voice and video. We
can see that some regularity for both voice and video exist for different traces, although
the regularity is not quite strong. This result holds true for all experiments conducted
when transmitting Skype voice and video packets over the Internet from University A to
University B.
0 0.2 0.4 0.6 0.8 1−70
−60
−50
−40
−30
−20
−10
0
Normalized Frequency (×π rad/sample)
Pow
er S
pect
ral D
ensi
ty (
dB)
Trace 1Trace 2
Figure 5-4. Power spectral density of two sequences/traces of time-varying inter-arrivaltimes for voice traffic
5.3.3 Packet Size in Frequency Domain
Somewhat stronger regularity is visible for voice and video packet sizes. Indeed,
most video coding schemes use two types of frames[58], i.e., Intra frames (I-frame)
and Predicted frames (P-frame). An I-frame is a frame coded without reference to any
frame except itself. It serves as the starting point for a decoder to reconstruct the video
stream. A P-frame may contain both image data and motion vector displacements
99
0 0.2 0.4 0.6 0.8 1−45
−40
−35
−30
−25
−20
−15
−10
−5
0
Normalized Frequency (×π rad/sample)
Pow
er S
pect
ral D
ensi
ty (
dB)
Trace 1Trace 2
Figure 5-5. Power spectral density of two sequences of time-varying inter-arrival times forvideo traffic
and/or combinations of the two. Its decoding needs reference to previously decoded
frames. Packets containing I-frames are larger than those containing P-frames. Usually,
the number of P-frames between two consecutive I-frames is constant. Hence, one can
observe a strong periodic variation of packet size due to the interleaving of I-frames and
P-frames composing video data streams. Voice streams have similar phenomenon if Linear
Prediction Coding (LPC), e.g., code excited linear prediction (CELP) voice coder, is
employed. As an example, Figs. 5-6 and 5-7 show the power spectral density of voice and
video packet sizes, respectively.
5.3.4 Combining Packet Inter-Arrival Time and Packet Size in FrequencyDomain
Figs 5-8 and 5-9 show how the regularities hidden in voice and video data streams
can be amplified when combining the two features together in one single stochastic process
that will be described later in the paper. Note how the two important frequencies are
amplified and clearly visible in the PSD plots. The reason why there is a peak in the
PSD for voice (see Figure 5-8) is that voice applications usually produce close-to-constant
packet rate due to constant inter-packet delay of the widely used speech codecs listed
in Table 5-1; e.g., the peak of 33 Hz in Figure 5-8 corresponds to 30 ms inter-packet
100
0 0.2 0.4 0.6 0.8 110
20
30
40
50
60
70
80
Normalized Frequency (×π rad/sample)
Pow
er S
pect
ral D
ensi
ty (
dB)
Trace 1Trace 2
Figure 5-6. Power spectral density of two sequences of discrete-time packet sizes for voicetraffic
0 0.2 0.4 0.6 0.8 130
35
40
45
50
55
60
65
70
75
80
Normalized Frequency (×π rad/sample)
Pow
er S
pect
ral D
ensi
ty (
dB)
Trace 1Trace 2
Figure 5-7. Power spectral density of two sequences of discrete-time packet sizes for videotraffic
delay. Compared to voice, video applications have a flatter PSD. The reason is as below.
The number of bits in an I-frame of video depends on the texture of the image (e.g.,
the I-frame of a blackboard image produces much less bits than that of a complicated
flower image), resulting in a large range in the number of packets in an I-frame, e.g., from
1 packet to a few hundred packets produced by an I-frame. The frame rate is usually
constant (e.g., 30 frames/s is a standard rate in USA), i.e., a frame will be generated every
101
33 seconds for 30 frames/s; hence, the inter-arrival time between two packets in an I-frame
may span a large range, resulting in a flat PSD.
0 10 20 30 40 50−5
0
5
10
15
20
25
30
35
40
45
Frequency (Hz)
Pow
er S
pect
ral D
ensi
ty (
dB)
Trace 1Trace 2
Figure 5-8. Power spectral density of two sequences of continuous-time packet sizes forvoice traffic
0 100 200 300 400 50012
14
16
18
20
22
24
26
28
30
Frequency (Hz)
Pow
er S
pect
ral D
ensi
ty (
dB)
Trace 1Trace 2
Figure 5-9. Power spectral density of two sequences of continuous-time packet sizes forvideo traffic
5.4 Summary
In this chapter, we motivated the importance and presented challenges faced by
network traffic classification, specifically, detecting and classifying voice and video traffic.
102
Nowadays, VoIP and IPTV become increasingly popular and represent the largest
source of profits as consumer interest in online voice and video services increases, and
as broadband deployments proliferate. In order to tap the potential profits that VoIP
and IPTV offer, carrier networks have to efficiently and accurately manage and track the
delivery of IP services. Yet, the emergence of a bloom of new zero-day voice and video
applications such as Skype, Gtalk, and MSN poses tremendous challenges for ISPs. The
traditional approach of using port numbers to classify traffic is infeasible due to the usage
of dynamic port number. The proliferation of proprietary protocols, coupled with the
growing trend in the usage of encryption techniques to ensure data confidentiality, makes
application-level analysis infeasible. We also proposed a novel problem that multiple
sessions reuse the same transport layer connection. To our best knowledge, this problem
has never been considered in existing literatures.
We showed that existing technologies (Section 5.2) are not able to accurately
distinguish between voice and video flows. By analyzing the properties of voice and
video data streams, our intuition is to exploit the strong regularities residing in packet
inter-arrival times and the associated packet sizes. In this chapter, we analyze four types
of metrics that exploit the regularities,
1. packet inter-arrival time and packet size in time domain;
2. packet inter-arrival time in frequency domain;
3. packet size in frequency domain;
4. combining packet inter-arrival time and packet size in frequency domain.
By analyzing properties and illustrating figures of the four types of metrics, we show
that combining packet inter-arrival time and packet size in one single stochastic process
generates distinctive feature to classify voice and video streams.
103
CHAPTER 6NETWORK CENTRIC TRAFFIC CLASSIFICATION SYSTEM
6.1 System Architecture
Figure 6-1. VOVClassifier System Architecture
We first present the overall architecture of our system (VOVClassifier , Figure 6-1),
and provide a high-level description of the functionalities of each of its modules. Generally
speaking, VOVClassifier is an automated learning system that uses packet headers from
raw packets collected off the wire, organize them into transport network flows and process
them in realtime to search for voice and video applications. VOVClassifier first trains voice
and video data streams separately before being used in realtime for classification. During
the training phase, VOVClassifier extracts feature vectors, which is a summary (also
known as a statistic) of raw traffic bit stream, and maintain their statistics in memory.
During the online classification phase, a classifier makes decision by measuring similarity
metrics between the feature vector extracted from on-the-fly network traffic and the
feature vectors extracted from training data. Flows with high values of similarity metric
with the voice (or video) features are classified as voice(or video); data streams with low
values of similarity with voice/video are classified as other applications.
In general, VOVClassifier is composed of four major modules that operate in cascade:
(i) Flow Summary Generator (FSG),(ii) Feature Extractor (FE) via Power Spectral
104
Density analysis, (iii) Voice/video-Subspace Generator (SG) and (iv) Voice/video-
CLassifier (CL).
Next, we briefly summarize the functionalities of each component.
6.1.1 Flow Summary Generator (FSG)
All packets collected off-the-wire are processed by the Flow Summary Generator
module, that reorders packets by removing any duplicated packet, and organizes them into
network transport flows according to their 5-tuple, e.g., source IP, destination IP, source
Port, destination Port and transport protocol. In Section 5.3 we have shown as voice and
video data streams that are characterized by packets that are very small in size. As a
consequence, this module filters out all packets whose size is smaller than a pre-specified
threshold θP . The processed flow is then internally described in terms of packet sizes and
inter-arrival times between packets within any generic flow FS.
FS = 〈Pi,Ai〉 ; i = 1, . . . , I , (6–1)
where Pi and Ai denote the packet size and relative packet arrival time of the ith packet
in the flow with I packets, respectively. As we only consider relative arrival time, A1 is
always 0.
6.1.2 Feature Extactor (FE) and Voice/Video Subspace Generator (SG)
The FSG output is forwarded to the Feature Extractor module that computes a
feature vector for each flow processed by analyzing its power spectral density (PSD) in
order to exploit regularities residing in voice and video traffic. The voice/video Subspace
Generator processes the high dimensional feature vectors received and projects the feature
vectors into a low dimensional space that embeds the fine granularity properties of the
data stream in process. This is achieved by first partitioning the feature vector space into
a few non-overlapping clusters or data sets and then extracting the characteristic of each
cluster using principal component analysis (PCA).The FSG and FE modules (Figure 6-1)
are used during both the training and the classification phases.
105
6.1.3 Voice/Video CLassifer (CL)
During the classification phase, the data are processed by one extra module, named
Voice/Video Classifier, that compares the feature vectors extracted from current data
streams entering the system to the voice and video subspaces generated during training
in order to classify the stream as voice, video or other. The problem of data stream
classification requires the implementation of a similarity metric. In literatures, there
are many similarity metrics. For example, Bayes classifier uses cost function, and
nearest-neighbor (1-NN) and K-nearest-neighbor (KNN) use Euclidean distance. In
general, no similarity metric is guaranteed to be the best for all applications. For example,
Bayes classifier is applicable only when the likelihood probabilities are well estimated,
which requires the number of training samples to be much larger than the number of
feature dimensions. As a consequence, it is not suitable for classification based on an high
dimensional feature vector, such as the PSD feature vector. Furthermore, both 1-NN and
K-NN are proved to be optimal only under the assumption that data of the same category
are clustered together. Unfortunately, this is not always the case. We overcome the above
problem by employing a similarity metric based on the normalized distance from feature
vector representing the ongoing flow to the two subspaces obtained during training phase.
The subspace at minimum distance will be elected as candidate only if the distance is
below specific thresholds.
We conclude this section by highlighting one minor limitation of our approach. Our
system is unable to distinguish a flow containing video only from a flow containing video
packets piggybacked by voice data (when video and voice applications are simultaneously
launched in Skype, voice data is piggybacked on video packets). This is because the
feature for video packets piggybacked by voice data is very similar to that for video only.
Hence, our traffic classifier will declare a flow containing video packets piggybacked by
voice data as ”video”.
106
The rest of this chapter is organized as the following. Sections 6.2, 6.3, and 6.4
describe the components, Feature Extractor, Voice / Video Subspace Generator, and Voice
/ Video Classifier, respectively. In Section 6.5, we conduct experiments on traffic collected
between two universities using Skype, MSN, and GTalk. Section 6.6 summarizes this
chapter.
6.2 Feature Extractor (FE) Module via Power Spectral Density (PSD)
As explained in Section 5.3 the extraction and processing of simple traffic features
does not solve the problem of detecting and separating voice and video data streams from
other applications. In this section we first introduce the preliminary steps that we take to
transform each generic flow FS obtained from FSG into a stochastic process that combines
the inter-arrival times and packet sizes. Then we describe how to use power spectral
density (PSD) analysis as a powerful methodology to extract such hidden key regularities
residing in real-time multimedia network traffic.
6.2.1 Modeling the network flow as a stochastic digital process
Figure 6-2. Power spectral density features extraction module. Cascade of processingsteps.
Each flow FS extracted from the FSG is forwarded to the FE module that applies
several steps in cascade (Figure 6-2). First, any FS extracted (see Equation (6–1)) is
modeled as a continuous stochastic process as illustrated in Equation (6–2):
P(t) =∑
<P,A>∈FS
Pδ(t−A), (6–2)
where δ(·) denotes the delta function. As the reader can notice, our model combines
packet arrival times and packet sizes together as a single stochastic process. Because
digital computers are more suitable to deal with discrete-time sequences than continuous-time
107
processes, we transform P(t) to a discrete-time sequence by applying sampling at
frequency Fs = 1Ts
.
Because the signal defined in Equation (6–2) is represented as a summation of delta
functions, its spectrum spans the whole frequency domain. In order to correctly reshape
the spectrum of Ph(t) to avoid aliasing when it is sampled at interval Ts, we apply a
low pass filter (LPF) characterized by its impulse response hLPF (t). Ph(t) can then be
mathematically described as following:
Ph(t) =P(t) ∗ hLPF (t) =∑
<P,A>∈FS
Ph (t−A) . (6–3)
After sampling at interval Ts we obtain the following discrete-time sequence:
Pd(i) =Ph(iTs) =∑
<P,A>∈FS
Ph (iTs −A) , (6–4)
where i = 1, . . . , Id = Amax
Ts+ 1, Amax is the arrival time of the last packet in the flow.
We note that the sampling interval Ts cannot be arbitrarily chosen. If Ts is too large,
then the spectrum of the flow Fs contains information only related to low frequencies and
thus lacks of information related to the high frequency spectrum. On the other hand, if Ts
is too small, then the length Id of the resulting discrete-time sequence will be very large,
resulting in very high complexity in computing the PSD of Fs. After an extensive analysis
of widely-used voice and video applications such as Skype, MSN, and GTalk, we observed
that choosing Ts = 0.5 milliseconds is sufficient to extract all useful information for our
purpose.
Next, we provide a methodology to extract the regularities residing in the signal P(t).
We achieve this by studying the extracted digital signal Pd(i) in the frequency domain
applying power spectral density analysis.
6.2.2 Power Spectral Density (PSD) Computation
Power spectral density definition.
108
The power spectral density of a digital signal represents its energy distribution in
the frequency domain. Regularities in time domain translate into dominant periodic
components in its autocorrelation function and finally to peaks in its power spectral
density.
For a general second-order stationary sequence y(i)i∈Z, the power spectral density
(defined in [59]) can be computed as:
ψ($; y) =∞∑
k=−∞r(k; y)e−j$k, $ ∈ [−π, π), (6–5)
where r(k; y); k ∈ Z represents the autocovariance sequence of the signal yi; i ∈ Z, i.e.,
r(k; y) =E [y(i)y∗(i− k)] . (6–6)
Although $ in Equation (6–5) can take any value, we restrict its domain to be within
[−π, π) because ψ($; y) = ψ($ + 2π; y).
According to Equations (6–5) and (6–6), the computation of the PSD for a digital
signal theoretically requires to have access to an infinite long time sequence. Since in
reality, we cannot assume to have infinite digital sequences at our disposal, we need to
face the problem on which technique can be used in our context to estimate the power
spectral density with an admissible accuracy. In literature, two different families of PSD
estimation are available: parametric and non-parametric. Parametric methods have
shown to perform better under the assumption that the underlying model is correct
and accurate. Furthermore, these methods are more interesting from a computational
complexity perspective as they require the estimation of fewer variables when compared
with non-parametric methods.
In our research, we employ parametric method to estimate PSD. The details are
presented in the next section.
PSD estimation based on parametric method.
109
Now, we briefly present the parametric methods to estimate PSD. According to
Weierstrass theorem, any continuous PSD can be approximated arbitrarily closely by a
rational PSD of the form
ψ($) =
∣∣∣∣B($)
A($)
∣∣∣∣2
ε2, (6–7)
where ε2 is a positive scalar and B($) and A($) are polynomials:
A($) =1 + a1e−i$ + · · ·+ ape
−ip$, (6–8)
B($) =1 + b1e−i$ + · · ·+ bqe
−iq$. (6–9)
Equation (6–7) can be regarded as obtaining a signal by filtering white noise of power ε2
through a filter with transfer function B($)A($)
, i.e.,
y(i) +
p∑t=1
aty(i− t) =ε(i) +
q∑t=1
btε(i− t) (6–10)
Starting from Equation (6–7), three types of methods are derived:if p > 0 and q = 0, one models y(i)i∈Z as an autoregressive (AR(p)) signal;
1. if p = 0 and q > 0, one models y(i)i∈Z as a moving average (MA(q)) signal;
2. otherwise, it is modeled as an autoregressive moving average (ARMA(p, q)) signal.
Based on the AR, MA, or ARMA assumptions, one can estimate the coefficients in
Equation (6–7) and hence PSD. In general, none of these three models outperforms the
other two but rather their performance are strictly related to the specific shape of the
signal under consideration. Due to the fact that the signal we process are characterized
by strong regularities in the time domain, we decided to adopt the AR model. The reason
is that the AR equation can model spectrum with narrow peaks by placing zeros of A($)
close to the unit circle.
110
Yule-Walker method. Now, we describe methods to estimate coefficients of an AR
signal. Given the fact that q = 0, Equation (6–10) can be written as
y(i) +
p∑t=1
aty(i− t) =ε(i). (6–11)
Multiplying Equation (6–12) by y∗(i−k) and taking expectation at both sides, one obtains
r(k; y) +
p∑t=1
atr(k − t; y) =E ε(i)y∗(i− k) . (6–12)
Noting that
E ε(i)y∗(i− k) =
0 if i 6= k
ε2 if i = k
, (6–13)
one obtains the equation system
r(0; y) +∑p
t=1 atr(−t; y) = ε2,
r(k; y) +∑p
t=1 atr(k − t; y) = 0, k = 1, . . . , p.
(6–14)
Equation (6–14) can be rewritten in matrix form, i.e.,
r(0; y) r(−1; y) · · · r(−p; y)
r(1; y) r(0; y)...
.... . . r(−1; y)
r(p; y) · · · r(0; y)
1
a1
...
ap
=
ε2
0
...
0
(6–15)
Equation (6–15) is known as the Yule-Walker method for AR spectral estimation[59].
Given data y(i)Ii=1, one first estimates the autocovariance sequence
r(k; y) = 1I
∑Ii=k+1 y(i)y∗(i− k)
r(−k; y) = r∗(k; y)
, k = 0, . . . , p. (6–16)
When r(k; y); k = −p, . . . , p is replaced by its estimate r(k; y); k = −p, . . . , p,Equation (6–15) becomes a system of p + 1 linear equations with p + 1 unknown variables,
111
i.e., ε2, a1, . . ., ap. Its solution is
a1
...
ap
=−
r(0; y) · · · r(−p + 1; y)
.... . .
...
r(p− 1; y) · · · r(0; y)
−1
r(1)
...
r(p)
, (6–17)
ε2 =r(0; y) +
p∑t=1
r(−t; y)at. (6–18)
Levinson-Durbin algorithm (LDA). The direct solution of Yule-Walker method,
i.e., Equations (6–17) and (6–18), is not good enough in terms of time complexity.
Equation (6–17) computes the inversion of covariance matrix, whose time complexity is
O (p3)[60, page 755]. In addition, in most applications, there is no a priori information
about the true order p. To cope with that, the Yule-Walker system of equations,
Equation (6–15), has to be solved for p = 1 up to p = pmax, where pmax is some
prespecified maximum order. The time complexity is O (p4max).
In this paper, we use the Levinson-Durbin Algorithm (LDA)[59]to reduce time
complexity. It estimates AR signal coefficients recursively in the order p. To facilitate
further discussion and to emphasize the order p, we denote
Rp+14=
r(0; y) r(−1; y) · · · r(−p; y)
r(1; y) r(0; y) · · · r(−p + 1; y)
.... . .
...
r(p; y) · · · r(1; y) r(0; y)
, (6–19)
ap4=
a1
...
ap
, (6–20)
112
and ε2p the power of noise of the AR(p) signal. Thus, one can rewrite Equation (6–15) as
Rp+1
1
ap
=
ε2p
0
(6–21)
1. function LDA(. . .)2. Argument 1: data, y(i)I
i=1.3. Argument 2: order, p.4. Return: parameters of AR(p) model, ap and ε2
p.5.
r(k; y) =1
I
I∑
i=k+1
y(i)y∗(i− k), for ∀ k = 0, 1, . . . , p (6–22)
6.
r′1 = a1 = −r(1; y)
r(0; y)(6–23)
7.
ε21 = r(0; y)− |r(1; y)|2
r(0; y)(6–24)
8. for t ← 1, . . . , p9.
rt4= [r∗(t; y), r∗(t− 1; y), · · · , r∗(1; y)]T (6–25)
at4= [at, at−1, · · · , a1]
T (6–26)
r′t+1 =− r(t + 1; y) + rtat
ε2t
(6–27)
ε2t+1 =ε2
t
(1−
∣∣r′t+1
∣∣2)
(6–28)
at+1 =
[at
0
]+ r′t+1
[at
1
](6–29)
10. end for
Figure 6-3. Levinson-Durbin Algorithm.
Figure 6-3 gives the LDA algorithm to estimate coefficients of AR(p) model given
data y(i)Ii=1.
113
For the same scenario that one needs to estimate AR model from order 1 up to pmax,
the time complexity of LDA is O (p2max), much better than the direct solution given by
Equations (6–17) and (6–18).
1. function PSDEstimate(. . .)2. Argument 1: data, y(i)I
i=1.3. Argument 2: order, p.4. Return: PSD ψ($; y).5.
[ap, ε
2p
]= LDA(y(i)I
i=1 , p) (6–30)
6.
ψ($; y) =ε2
p
|1 +∑p
t=1 ate−it$|2 (6–31)
Figure 6-4. Parametric PSD Estimate using Levinson-Durbin Algorithm.
Once the AR model is estimated, one can estimate PSD of signal y(i)Ii=1. The
procedure is given in Figure 6-4.
PSD feature vector.
According to the above discussion, we now define the PSD feature vector of a flow
as the following. Let us assume Pd(i)Id
i=1 (see Equation (6–4)) to be second-order
stationary. Then its PSD can be estimated as:
ψ ($;Pd) =PSDEstimate(Pd(i)Id
i=1 , p)
(6–32)
where $ ∈ [−π, π) and p is the pre-specified order.
Recall that Pd(i)Id
i=1 are obtained by sampling a continuous-time signal Ph(t) at
time interval Ts (see Figure 6-2). Thus, one can further formulate the PSD in terms of real
frequency f as
ψf (f ;Pd) =ψ
(2πf
Fs
;Pd
), f ∈
(−Fs
2,Fs
2
)(6–33)
114
where Fs = 1Ts
. Equation (6–33) shows the relationship between the periodic components
of a stochastic process in the continuous-time domain and the shape of its PSD in the
frequency domain.
ψf (f ;Pd) is a continuous function in frequency domain. To handle it in a computer,
we need to do sampling in frequency domain. In other words, we select a series of
frequencies,
0 ≤ f1 < f2 < · · · < fM ≤ Fs
2, (6–34)
and define the PSD feature vector as
~ψ =
[ψf (f1;Pd) , ψf (f2;Pd) , ψf (fM ;Pd)
]T
. (6–35)
~ψ ∈ RM is the feature vector we use to perform classification.
In the next Section, we introduce a new technique that we use to translate the
characteristic of these high-dimensional feature vectors into a more tractable low
dimensional space.
6.3 Subspace Decomposition and Bases Identification on PSD Features
In many scientific and engineering problems, the data of interest can be viewed
as drawn from a mixture of geometric or statistical models instead of a single one.
Such data are often referred to in different contexts as “mixed,” or “multi-modal,” or
“multi-model,” or “heterogeneous,” or “hybrid.” Subspace decomposition is a general
method for modeling and segmenting such mixed data using a collection of subspaces,
also known in mathematics as a subspace arrangement. By introducing certain new
algebraic models and techniques into data clustering, traditionally a statistical problem,
the subspace decomposition methodology offers a new spectrum of algorithms for data
modeling and clustering that are in many aspects more efficient and effective than (or
complementary to) traditional methods, e.g., principle component analysis (PCA),
Expectation Maximization (EM), and K-Means clustering.
115
As illustrated in Figure 6-1, we collect voice and video training flows during the
training phase. After processing the raw packet data through the feature extraction
module via PSD, one obtains two sets of feature vectors,
Ψ(1) 4=
~ψ1(1), ~ψ1(2), . . . , ~ψ1 (N1)
, (6–36)
which is obtained through voice training data, where N1 is the number of voice flows; and
Ψ(2) 4=
~ψ2(1), ~ψ2(2), . . . , ~ψ2 (N2)
, (6–37)
which is obtained through video training data, where N2 is the number of video flows.
To facilitate further discussion, let us also regard Ψ(i) as a M ×Ni matrix, for i = 1, 2,
where each column is a feature vector. In other words, Ψ(i) ∈ RM×Ni .
In this section, we present techniques to identify the low dimensional subspaces
embedded in RM , for both Ψ1 and Ψ2.
There are a lot of low dimensional subspace identification schemes, such as Principal
Components Analysis (PCA) [61] and Metric Multidimensional Scaling (MDS) [62], which
identify linear structure, and ISOMAP [63] and Locally Linear Embedding (LLE)[64],
which identify non-linear structure.
Unfortunately, all these methods assume that data are embedded in one single
low-dimensional subspace. This assumption is not always true. For example, as different
software uses different voice coding, it is more reasonable to assume that the PSD feature
vector of voice traffic is a random vector generated from a mixture model than a single
model. In such case, it is more likely that there are several subspaces in which the feature
vectors are embedded. The same holds for video feature vectors.
As a result, a better scheme is to first, cluster the trained feature vectors into several
groups, known as subspace decomposition; and second, to identify the subspace structure
of each group, known as subspace bases identification. We describe the two steps in the
following sections.
116
6.3.1 Subspace Decomposition Based on Minimum Coding Length
The purpose of subspace decomposition is to partition the data set
Ψ =
~ψ(1), ~ψ(2), . . . , ~ψ (N)
(6–38)
into non-overlapping K subsets such that
Ψ =Ψ1 ∪Ψ2 ∪ · · · ∪ΨK . (6–39)
Hong[65] proposed a method to decompose subspaces according to the minimum coding
length criteria. The idea is to view the data segmentation problem from the perspective of
data coding/compression.
Suppose one wants to find a coding scheme, C, which maps data in Ψ ∈ RM×N to
bit sequence. As all elements are real numbers, infinite long bit sequence is needed to
decode without error. Hence, one has to specify the tolerable decoding error, ε, to obtain a
mapping with finite coding length, i.e.,
∥∥∥~ψn − C−1(C
(~ψn
))∥∥∥2
≤ ε2, for ∀n = 1, . . . , N. (6–40)
Then the coding length of the coding scheme C is a function
LC : RM×N → Z+. (6–41)
It is proven [65] that the coding length is up bounded by
LC (Ψ) ≤L (Ψ) =N + K
2log2 det
(I +
K
Nε2ΨΨT
)+
K
2log2
(1 +
µTΨµΨ
ε2
), (6–42)
where
µΨ =1
N
N∑i=1
~ψ(i), (6–43)
Ψ =[~ψ(1)− µΨ, . . . , ~ψ(N)− µΨ
]. (6–44)
117
The optimal partition (see Equation (6–39)), in terms of minimum coding length
criteria, should minimize coding length of the segmented data, i.e.,
minΠLC (Ψ; Π) = min
Π
K∑
k=1
LC (Ψk) +K∑
k=1
|Ψk|[− log2
(Ψk
K
)], (6–45)
where Π denotes the partition scheme. The first term in Equation (6–45) is the summation
of coding length of each group, and the second one is the number of bits needed to
encoding membership of each item of Ψ in the K groups.
The optimal partition is achieved in the following way. Let the segmentation scheme
be represented by the membership matrix,
Πk4= diag ([π1k, π2k, . . . , πNk]) ∈ RN×N , (6–46)
where πnk denotes the probability that vector ~ψ(n) belongs to subset k, such that
K∑
k=1
πnk = 1, for ∀n = 1, . . . , N (6–47)
and diag(·) denotes converting a vector to a diagonal matrix.
Hong[65, page 34] proved that the coding length is bounded as follows.
LC (Ψ; Π)
≤K∑
k=1
[tr (Πk) + K
2log2 det
(I +
K
tr (Πk) ε2ΨΠkΨ
T
)]
+K∑
k=1
[tr (Πk)
(− log2
tr (Πk)
N
)]
4=L(Ψ; Π), (6–48)
where tr (·) denotes the trace of a matrix, and det (·) denotes matrix determinant.
Combining Equations (6–45) and (6–48), one achieves a minimax criterion
Π = arg minΠ
[maxCLC (Ψ; Π)
]= arg min
ΠL(Ψ; Π). (6–49)
118
Then for ∀ψn ∈ Ψ, ψn ∈ Ψk after segmentation if and only if
k = arg maxk
πnk. (6–50)
1. function MCLPartition(. . .)
2. Argument 1: set of feature vectors, Ψ =
~ψ(1), . . . , ~ψ (N)
3. Return: partition of Ψ,
Π = Ψ1, . . . , ΨK ; Ψ1 ∪ . . . ∪ΨK = Ψ, Ψi ∩Ψj = ∅ for ∀i 6= j,
.4. Initialization:
Π =
~ψ(1)
,
~ψ(2)
, . . . ,
~ψ(N)
,
5. while true do6.
〈π1, π2〉 = arg minπ∗1∈Π,π∗2∈Π
L (π∗1 ∪ π∗2)− L (π∗1, π∗2) (6–51)
7. if L (π∗1 ∪ π∗2)− L (π∗1, π∗2) ≥ 0 then
8. break9. else
10. Π = (Π \ π1, π2) ∪ π1 ∪ π211. end if12. end while13. return Π
Figure 6-5. Pairwise steepest descent method to achieve minimal coding length.
There is no closed form solution for Equation (6–49). Hong[65, page41] proposed a
pairwise steepest descent method to solve it (Figure 6-5). It works in a bottom-up way.
It starts with a partition scheme that assigns each element of Ψ to a partition. Then, at
each iteration, the algorithm finds two subsets of feature vectors such that by merging
these two subsets, one can decrease the coding length the most (Equation (6–51)). This
procedure stops when no further decrease of coding length can be achieved by merging any
two subsets.
Using the above method, we obtain a partition of voice feature vector set Ψ(1),
119
Ψ(1) = Ψ(1)1 ∪ · · · ∪Ψ
(1)K1
, (6–52)
and a partition of video feature vector set Ψ(2),
Ψ(2) = Ψ(2)1 ∪ · · · ∪Ψ
(2)K2
. (6–53)
Next, we describe the method to identify subspace bases in each of the segmentations.
6.3.2 Subspace Bases Identification
In this section, we use PCA[61] algorithm to identify subspace bases for each
segmentation,
Ψ
(i)k ; k = 1, . . . , Ki, i = 1, 2
, (6–54)
obtained in the previous section. The basic idea is to identify uncorrelated bases and
choose those bases with dominant energy. Figure 6-6 shows the algorithm.
1. function[~µ, U , Σ, U , Σ
]= IdentifyBases
(Ψ ∈ RM×N , δ
)
2. ~µ = 1|Ψ|
∑~ψ∈Ψ
~ψ
3. Ψ =[~ψ1 − ~µ, ~ψ2 − ~µ, . . . , ~ψ|Ψ| − ~µ
]
4. Do eigenvalue decomposition on ΨΨT such that
ΨΨT =UΣUT , (6–55)
where U 4= [~u1, · · · , ~uM ], Σ
4= diag ([σ2
1, . . . , σ2M ]), and σ2
1 ≥ σ22 ≥ · · · ≥ σ2
M .
5. J = arg minJ∑J
m=1 σ2(m) ≥ δ∑M
m=1 σ2(m)
6. U = [~u1, ~u2, . . . , ~uJ−1]7. U = [~uJ , ~uJ+1, . . . , ~uM ]8. Σ = diag
([σ2
1, . . . , σ2J−1
])9. Σ = diag
([σ2J , . . . , σ2
M
])10. end function
Figure 6-6. Function IdentifyBases identifies bases of subspace.
In Figure 6-6, argument Ψ represents the feature vector set of one segmentation and
δ is a user defined parameter which specifies the percentage of energy retained, e.g., 90%
120
or 95%. The algorithm returns 5 variables. ~µ represents the sampled mean of all feature
vectors. It is the origin of the identified subspace. The columns of U are the bases with
dominant energy (i.e.,variance), whose corresponding variances are denoted by Σ. These
bases determine the identified low dimensional subspace spanned by Ψ. The columns of
U compose the null space of the previous subspace, whose corresponding variances are Σ.
The last two outputs are required to calculate the distance of an ongoing feature vector to
the subspace, which will be described in Section 6.4.
Applying the function IdentifyBases on all segmentations, we obtain
[~µ
(i)k , U (i)
k , Σ(i)k , U (i)
k , Σ(i)k
]=IdentifyBases(Ψ
(i)k ) (6–56)
for ∀k = 1, . . . , Ki, ∀i = 1, 2. These are the outputs of subspace identification module, and
hence the results of training phase, in Figure 6-1.
During the classification phase, these outputs are used as system parameters, which
will be presented in the next section.
6.4 Voice/Video Classifier
In Section 6.3, we presented an approach to identify subspaces spanned by PSD
feature vectors of training voice and video flows. Specifically, one obtains the following
parameters:
[~µ
(i)k , U (i)
k , Σ(i)k , U (i)
k , Σ(i)k
](6–57)
for ∀k = 1, . . . , Ki, ∀i = 1, 2. In this section, we use these parameters to do classification.
During the classification phase, for each ongoing flow F , one composes a sub-flow,
FS, by extracting small packets, i.e., packets smaller than θP , and passes it through PSD
feature extraction module to generate PSD feature vector ~ψ. This is the input to the
voice/video classifier.
The voice/video classifier works in the following way. It first calculates the normalized
distances between ~ψ and all subspaces of both categories. Then it chooses minimum
121
distance to each category. The decision is made by comparing the two distance values to
two thresholds, θA and θV respectively for voice and video. Figure 6-7 shows the procedure
of the voice/video classifier.
1. function type = VoiceVideoClassify(
~ψ, θA, θV
)
2. For ∀i = 1, 2, ∀k = 1, . . . , Ki
d(i)k = NormalizedDistance
(~ψ, ~µ
(i)k , U (i)
k , Σ(i)k
)
.3. For ∀i = 1, 2,
di = mink
d(i)k
.4. if d1 < θA and d2 > θV
5. type = VOICE.6. else if d1 > θA and d2 < θV
7. type = VIDEO.8. else9. type = “DON’T KNOW”, i.e., neither voice nor video.
10. end if11. end function
12. function d = NormalizedDistance(
~ψ, ~µ, U , Σ)
13. d =(
~ψ − ~µ)T
UΣ−1UT(
~ψ − ~µ)
14. end function
Figure 6-7. Function VoiceVideoClassify determines whether a flow with PSD featurevector ~ψ is of type voice or video or neither. θ1 are θ2 are two user-specified thresholdarguments. Function voicevideoClassify uses Function NormalizedDistance to calculatenormalized distance between a feature vector and a subspace.
Note that, in Function VoiceVideoClassify, line 7, when we detect flow type to be
video, the flow may also carry voice traffic. The reason is discussed in Section 6.1.3.
From lines 2 and 13 in Figure 6-7, the time complexity of function VoiceVideoClassify
is
O((K1 + K2)M
2). (6–58)
122
6.5 Experiment Results
In this section, we demonstrate the experiment results of applying the system
presented in Figure 6-1 to network traffic classification. Before that, we first describe
experiment settings in Section 6.5.1.
6.5.1 Experiment Settings
We perform four sets of experiments. In Section 6.5.2, two sets of experiments are
conducted on traffic generated by Skype. In Section 6.5.3, other two sets of experiments
are conducted on traffic generated by Skype, MSN, and GTalk.
For each set of the experiments, we use Receiver Operating Characteristics (ROC)
curves[28, page 107] as the performance metric. ROC curve is a curve of detection
probability, PD, vs. false alarm probability, PFA, where,
PD|H4=P (The estimated state of nature is H|The true state of nature is H) , (6–59)
PFA|H4=P (The estimated state of nature is H|The true state of nature is not H) , (6–60)
where H can be voice, video, file+voice, and file+video. By tuning parameters θσ, θA, and
θV ( see Figure 6-7), one is able to generate ROC curve.
During the experiments, we collected network traffic from three applications, i.e.,
Skype, MSN, and GTalk. For each application, traffic was collected in two scenarios. For
the first scenario, two lab computers located in University A and University B respectively
were communicating with each other. There was direct connection between two peers. For
the second one, we used a firewall to block direct connection between the two peers such
that the application was forced to use relay nodes.
To do classification, we chose first 10 seconds of each flow, i.e., Amax ≤ 10 seconds.
We set Ts = 0.5 milliseconds. Hence, Id = 20, 000.
123
6.5.2 Skype Flow Classification
In this section, we conduct experiments on Skype traffic. We first consider the
scenario when each Skype flow carries one type of traffic. In other words, in this set of
experiments, one flow is of type VOICE, VIDEO, or none of the above.
Figure 6-8 shows the ROC curves of classifying voice and video flows.
0 0.2 0.4 0.6 0.8 10.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
PFA
PD
(a)
0 0.2 0.4 0.6 0.8 10.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
PFA
PD
(b)
Figure 6-8. The ROC curves of single-typed flows generated by Skype, (a) VOICE and (b)VIDEO.
We then conduct experiments on hybrid Skype flows. In other words, each flow
may be of type VOICE, VIDEO, FILE+VOICE, FILE+VIDEO, or none of the above.
Figure 6-9 plots the ROC curves of these five types, respectively.
6.5.3 General Flow Classification
Now, let us do the same experiments on network traffic generated by Skype, MSN,
and GTalk, as these are very common applications at present. In other words, a voice flow
now can be a VoIP flow generated by Skype, MSN, or GTalk. So are video flows. Note
that, GTalk does not support video conference. Similarly, two sets of experiments are
conducted, one on single-typed flows and the other on hybrid flows.
124
0 0.2 0.4 0.6 0.8 10.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
PFA
PD
(a)
0 0.2 0.4 0.6 0.8 10.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
PFA
PD
(b)
0 0.2 0.4 0.6 0.8 10.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
PFA
PD
(c)
0 0.2 0.4 0.6 0.8 10.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
PFA
PD
(d)
Figure 6-9. The ROC curves of hybrid flows generated by Skype, (a) VOICE, (b) VIDEO,(c) FILE+VOICE, and (d) FILE+VIDEO.
Similar to Section 6.5.2, we first consider the scenario when each flow carries one type
of traffic. By tuning the thresholds, θσ, θA, and θV , we generate ROC curves of classifying
voice and video flows (Figure 6-10).
We then conduct experiments on hybrid flows. Figure 6-11 shows the ROC curves of
classifying VOICE, VIDEO, FILE+VOICE, and FILE+VIDEO flows.
6.5.4 Discussion
To better understand Figs. 6-8, 6-9, 6-10, and 6-11, we show some typical values of
PD and PFA pairs in Table 6-1. One can see the following phenomena from Table 6-1.
125
0 0.2 0.4 0.6 0.8 10.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
PFA
PD
(a)
0 0.2 0.4 0.6 0.8 10.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
PFA
PD
(b)
Figure 6-10. The ROC curves of single-typed flows generated by Skype, MSN, and GTalk:(a) VOICE and (b) VIDEO.
Table 6-1: Typical PD and PFA values.Skype Skype+MSN+GTalk
PFA (PD) Single Hybrid Single HybridVOICE 0(1) 0(1) 0(.995) .002(.986)VIDEO 0(.993) 0(.965) 0(.952) 0(.948)
Voice flows vs. video flows. From Table 6-1, one notes that classification of
VOICE traffic is more accurate than that of VIDEO. Specifically, we can achieve 100%
accurate classification of Skype voice flows. This is due to the fact that voice traffic has
higher regularity than video does (Figure 5-8 and Figure 5-9). One can immediately tell
the dominant periodic component at 33Hz in the voice flows. This frequency corresponds
to the 30-millisecond IPD of the employed voice coding. On the other hand, Video PSDs
have peaks at 0. It means that non-periodic component dominates in video flows. One can
see that PSDs of the two video flows are close to each other. That is the reason why our
approach achieves high classification accuracy by using PSD features.
Single-typed flows vs. hybrid flows. From Table 6-1, one can see that the
classification of single-typed flows is more accurate than that of hybrid flows. Mixing
multiple types of traffic together is like increasing noise. Hence, it is not surprising that
classification accuracy is reduced.
126
0 0.2 0.4 0.6 0.8 10.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
PFA
PD
(a)
0 0.2 0.4 0.6 0.8 10.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
PFA
PD
(b)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
PFA
PD
(c)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
PFA
PD
(d)
Figure 6-11. The ROC curves of hybrid flows generated by Skype, MSN, and GTalk: (a)VOICE, (b) VIDEO, (c) FILE+VOICE, and (d) FILE+VIDEO.
One application vs. multiple applications. One further notes that classification
of Skype flows is more accurate than that of flows generated from general applications, i.e.,
Skype, MSN, and GTalk.
Empirically, we found that Skype flows are similar to GTalk flows, but quite different
from MSN. For example, both Skype and GTalk voice flows have about 33-millisecond
inter-arrival time, whereas MSN voice flow has approximately 25-millisecond inter-arrival
time.
127
When these flows are mixed together, classification accuracy is reduced. But the
accuracy reduction is acceptable. Specifically, for hybrid voice traffic at PFA ≈ 0, PD is
reduced from 1 to 0.986; for hybrid video traffic, it is from 0.965 to 0.948.
This shows that our approach is robust. The robustness results from the fact that the
subspace identification module, as presented in Section 6.3, decomposes multiple subspaces
in the original high-dimensional feature space. As a result, PSD feature vectors of Skype
and GTalk are likely to be within different subspaces than those of MSN. Therefore, we
can still classify traffic accurately.
6.6 Summary
In this chapter, we describe the VOVClassifier system to do network classification.
VOVClassifier is composed of four components, feature summary generator, feature
extractor, subspace generator, and voice/video classifier. The novelty of VOVClassifier is
• modeling a network flow by a stochastic process;
• estimating PSD feature vector to extract the regularities residing in voice and videotraffic;
• decomposing subspaces of training feature vectors followed by bases identification;
• using minimum distance to subspace as the similarity metric to perform classification.
Experiment results demonstrate the effectiveness and robustness of our approach.
Specifically, we show that classification of voice traffic is more accurate than that of video
traffic; classification of single-typed flows is more accurate than that of hybrid flows;
and classification of pure Skype flows is more accurate than that of flows generated from
multiple applications (e.g., Skype, MSN, and GTalk).
128
CHAPTER 7CONCLUSION AND FUTURE WORK
7.1 Summary of Network Centric Anomaly Detection
In the first part of our study, we presented our achievement on network centric
anomaly detection. We first proposed a novel edge-router based framework to robustly
and efficiently detect network anomalies in the first place they happen. The key idea is
to exploit spatial and temporal correlation of abnormal traffic. The motivation of our
framework design is to use both spacial and temporal correlation among edge routers. The
framework consists of three types of components (i.e., traffic monitors, local analyzers,
and a global analyzer). Traffic monitors summarize traffic information on each single link
between edge router and user subnet. Local analyzers collect information on edge routers,
which is provided by traffic monitors, and reports to the global analyzer. The global
analyzer has a global view of the whole autonomous system and makes final decision. The
advantages of our framework design are the following.
1. It is deployed on edge routers instead of systems of end users, such that it can detectnetwork anomalies in the first place they enter an AS.
2. It has no burden on core routers;
3. It is flexible in that detection of network anomalies can be made both locally andglobally;
4. It is capable of detecting low volume network anomalies accurately by exploitingspatial correlations among edge routers.
We then presented feature extraction for network anomaly detection. Based on
the framework, we designed the hierarchical feature extraction architecture. Different
components extract different features. For example, traffic monitors can extract features
such as packet rate, data rate, and SYN/FIN ratio. Local analyzers are able to extract
features such as SYN/SYN-ACK ratio, round-trip time, and two-way matching features
on one edge router. The global analyzer can extract two-way matching features from
the whole autonomous system. Specifically, we focus on the novel type of features,
129
the two-way matching features, proposed by us. This type of features uses both the
temporal and spacial information carried in network traffic. It is a very effective indicator
of network anomaly associated with spoofed source IP address. We designed a novel
data structure, referred to as Bloom filter array, to efficiently extract two-way matching
features. Different from the existing works, our data structure has the following properties:
1) dynamic Bloom filter, 2) combination of a sliding window with the Bloom filter, and
3) using insertion-removal pairs to enhance the Bloom filter with a removal operation.
Our analysis and simulation demonstrate that the proposed data structure has a better
space/time trade-off than conventional algorithms.
In the end, we applied the machine learning technology to network anomaly detection.
Specifically, we used Bayesian model to determine the state of each edge router, normal
or abnormal. Traditionally, edge routers are regarded as independent. It is incapable to
detect low-traffic network anomaly. Straightforward improvement over this independent
model is to regard edge routers to be dependent on each other. However, this method has
an exponential time complexity to determine the edge router states, i.e., O (2κ), where
κ is the number of edge routers. We proposed the hidden Markov tree (HMT) to model
the correlations among edge routers. It takes advantages of employing dependence among
edge routers while has almost linear time complexity, i.e., O (Bκ), where B is the number
of child nodes of each non-leaf node in the HMT. Our machine learning scheme has the
following nice properties:
• In addition to detecting network anomalies having high-data-rate on a link, ourscheme is also capable of accurately detecting attacks having low-data-rate onmultiple links. This is due to exploitation of spatial correlation of network anomalies.
• Our scheme is robust against time-varying traffic patterns, owing to powerful machinelearning techniques.
• Our scheme can be deployed in large-scale high-speed networks, thanks to use ofBloom filter array to efficiently extract features.
130
Our simulation results show that the proposed framework can detect DDoS attacks
even if the volume of attack traffic on each link is extremely small (i.e., 1%). Especially,
for the same false alarm probability, our scheme has a detection probability of 0.97,
whereas the existing scheme has a detection probability of 0.17, which demonstrates the
superior performance of our scheme.
7.2 Summary of Network Centric Traffic Classification
We then presented our research on network centric traffic classification, specifically, to
detect and classify voice and video data streams.
We first motivated the significance and points out the challenges of this issue, and
shows weakness of existing solutions. With the emergence of software using user specified
ports or dynamic ports, traffic classification based on TCP and UDP port numbers is
no longer valid. Other methods based on reconstructions of session and application
information from packet contents impose significant complexity and processing load on
the classification device. In addition, they are incapable to classify encrypted traffic. A
new emerging tendency in the research community to approach this problem is to rely
on pattern classification techniques. However, existing machine learning technologies are
not able to distinguish between voice and video traffic. In the research, we also proposed
a novel problem that one network flow may carry multiple types of sessions, such as
Skype uses one connection to carry voice, video, chat, and file transfer at the same time.
This increases the difficulties of traffic classification. To our best knowledge, no existing
literature has ever considered this problem. Our intuition to approach this problem is to
employ regularities residing in multimedia traffic. We also illustrates four types of metrics
to measure the regularities, i.e.,
1. packet inter-arrival time and packet size in time domain;
2. packet inter-arrival time in frequency domain;
3. packet size in frequency domain;
131
4. combining packet inter-arrival time and packet size in frequency domain.
It turns out that the last one is the most distinctive feature for classifying voice and
video traffic.
We then presented the VOVClassifier system to classify voice and video traffic.
VOVClassifier is composed of four major modules that operate in cascade, flow summary
generator, feature extractor, voice / video subspace generator, and voice / video classifier.
The novelty of VOVClassifier is that
• we combine packet inter-arrival times and packet sizes of a network flow and model itby a stochastic process;
• we estimate PSD feature vector to extract regularities residing in voice and videotraffic;
• we use minimum coding length to decompose subspaces from the training featurevectors and principal component analysis to identify bases of each subspace;
• we use minimum distance to subspaces as the similarity metric to perform classification.
The experiment results demonstrate the effectiveness and robustness of our approach.
132
APPENDIX APROOFS
A.1 Equation (4–31)
Proof:
∑
u′∈0,1P
(Ωi = u|Ωρ(i) = u′
)Υρ(i)(u
′)p(φρ(i)|uρ(i) = u′)
=∑
u′∈0,1P
(Ωi = u|Ωρ(i) = u′
)p(Ωρ(i) = u′, ~φT\ρ(i)
)p(φρ(i)|Ωρ(i) = u′)
=∑
u′∈0,1p(Ωi = u, Ωρ(i) = u′, ~φT\i
)
=p(Ωi = u, ~φT\i)
=Υi(u). (A–1)
A.2 Equation (4–32)
Proof:
p(φi|Ωi = u)∏
j∈ν(i)
∑
u′∈0,1P (Ωj = u′|Ωi = u) υj(u
′)
=p(φi|Ωi = u)∏
j∈ν(i)
∑
u′∈0,1P (Ωj = u′|Ωi = u) p(~φTj
|Ωj = u′)
=p(φi|Ωi = u)∏
j∈ν(i)
p(~φTj|Ωi = u)
=p(~φTi|Ωi = u) = υi(u) (A–2)
133
A.3 Equation (4–33)
Proof:
Υi(u)υi(u)∑u′ Υi(u′)υi(u′)
=p(Ωi = u, ~φT\i
)p(~φT\i
|Ωi = u)
∑u′ p
(Ωi = u′, ~φT\i
)p(~φT\i
|Ωi = u)
=p(Ωi = u, ~φ
)
∑u′ p
(Ωi = u′, ~φ
)
=p(Ωi = u, ~φ
)
p(~φ)
=P (Ωρ(i) = u|~φ) (A–3)
A.4 Equation (4–34)
Proof:
υi(u)υρ(i)(u′)P
(Ωi = u|Ωρ(i) = u′
)Υρ(i)(u
′)
[∑
u′′ Υi(u′′)υi(u′′)][∑
u′′ P(Ωi = u′′|Ωρ(i) = u′
)υi(u′′)
]
=p(~φTi|Ωi = u
)p(~φTρ(i)
|Ωρ(i) = u′)
P(Ωi = u|Ωρ(i) = u′
)p(Ωρ(i) = u′, ~φT\ρ(i)
)[∑
u′′ p(Ωi = u′′, ~φT\i
)p(~φTi|Ωi = u′′
)] [∑u′′ P
(Ωi = u′′|Ωρ(i) = u′
)p(~φTi|Ωi = u′′
)]
=p(Ωi = u, ~φTi
|Ωρ(i) = u′)
p(Ωρ(i) = u′, ~φT\ρ(i)
)p(~φTρ(i)
|Ωρ(i) = u′)
[∑u′′ p
(Ωi = u′′, ~φ
)] [∑u′′ p
(Ωi = u′′, ~φTi
|Ωρ(i) = u′)]
=p(Ωi = u, Ωρ(i) = u′, ~φTi∪T\ρ(i)
)p(~φTρ(i)
|Ωρ(i) = u′)
p(~φ)
p(~φTi|Ωρ(i) = u′
)
=p(Ωi = u, Ωρ(i) = u′, ~φTi∪T\ρ(i)
)p(~φρ(i)|~φTi
, Ωρ(i) = u′)
p(~φ)
=p(Ωi = u, Ωρ(i) = u′, ~φ
)
p(~φ)
=P (Ωi = u, Ωρ(i) = u′|~φ) (A–4)
134
REFERENCES
[1] P. Mockapetris, “Domain names - concepts and facilities,” RFC 1034.
[2] P. Mockapetris, “Domain names - implementation and specification,” RFC 1035.
[3] “Video on demand,” Wikipedia. [Online]. Available: http://en.wikipedia.org/wiki/Video on demand
[4] D. Wu, Y. T. Hou, W. Zhu, Y.-Q. Zhang, and J. M. Peha, “Streaming video overthe internet: Approaches and directions,” IEEE Trans. Circuits Syst. Video Technol.,vol. 11, pp. 282–300, Mar. 2001.
[5] K. Nichols, S. Blake, F. Baker, and D. Black, “Definition of the differentiated servicesfield (ds field) in the ipv4 and ipv6 headers,” RFC 2474.
[6] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss, “An architecturefor differentiated services,” RFC 2475.
[7] P. Almquist, “Type of service in the internet protocol suite,” RFC 1349.
[8] S. S. Kim, A. L. N. Reddy, and M. Vannucci, “Detecting traffic anomalies usingdiscrete wavelet transform,” in Proceedings of International Conference on InformationNetworking (ICOIN), vol. III, Busan, Korea, Feb. 2004, pp. 1375–1384.
[9] C.-M. Cheng, H. T. Kung, and K.-S. Tan, “Use of spectral analysis in defense againstdos attacks,” in Proceedings of IEEE Globecom 2002, vol. 3, Taipei, Taiwan, Nov.2002, pp. 2143–2148.
[10] A. Hussain, J. Heidemann, and C. Papadopoulos, “A framework for classifying denialof service attacks,” in Proceedings of ACM SIGCOMM, Karlsruhe, Germany, Aug.2003.
[11] H. Wang, D. Zhang, and K. G. Shin, “Detecting SYN flooding attacks,” in Proc. IEEEINFOCOM’02, New York City, NY, June 2002, pp. 1530–1539.
[12] T. Peng, C. Leckie, and K. Ramamohanarao, “Detecting distributed denial of serviceattacks using source IP address monitoring,” Department of Computer Science andSoftware Engineering, The University of Melbourne, Tech. Rep., 2002. [Online].Available: http://www.cs.mu.oz.au/∼tpeng
[13] R. B. Blazek, H. Kim, B. Rozovskii, and A. Tartakovsky, “A novel approach todetection of “denial-of-service” attacks via adaptive sequential and batch-sequentialchange-point detection methods,” in Proc. IEEE Workshop on Information Assuranceand Security, West Point, NY, June 2001, pp. 220–226.
[14] S. Mukkamala and A. H. Sung, “Detecting denial of service attacks using supportvector machines,” in Proceedings of IEEE International Conference on Fuzzy Systems,May 2003.
135
136
[15] S. Savage, D. Wetherall, A. Karlin, and T. Anderson, “Practical network support forip traceback,” in Proc. of ACM SIGCOMM’2000, Aug. 2000.
[16] A. Lakhina, M. Crovella, and C. Diot, “Characterization of network-wide anomaliesin traffic flows,” in Proc. ACM SIGCOMM Conference on Internet Measurement ’04,Oct. 2004.
[17] H. Wang, D. Zhang, and K. G. Shin, “Change-point monitoring for the detectionof dos attacks,” IEEE Transactions on Dependable and Secure Computing, no. 4, pp.193–208, Oct. 2004.
[18] J. Mirkovic and P. Reiher, “A taxonomy of ddos attacks and ddos defense mechanisms,”in Proc. ACM SIGCOMM Computer Communications Review ’04, vol. 34, Apr. 2004,pp. 39–53.
[19] J. B. Postel and J. Reynolds, “File transfer protocol,” RFC 959, Oct. 1985. [Online].Available: http://www.faqs.org/rfcs/rfc959.html
[20] K. Lu, J. Fan, J. Greco, D. Wu, S. Todorovic, and A. Nucci, “A novel anti-ddos systemfor large-scale internet,” in ACM SIGCOMM 2005, Philadelphia, PA, Aug. 2005.
[21] B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Commun.ACM, vol. 13, no. 7, pp. 422–426, July 1970.
[22] L. Fan, P. Cao, J. Almeida, and A. Z. Broder, “Summary cache: A scalable wide-areaweb cache sharing protocol.” IEEE/ACM Trans. Netw., vol. 8, no. 3, June 2000.
[23] F. Chang, W. chang Feng, and K. Li, “Approximate caches for packet classification,”in IEEE INFOCOM 2004, vol. 4, Mar. 2004, pp. 2196–2207.
[24] R. Rivest, “The md5 message-digest algorithm,” RFC 1321, Apr. 1992. [Online].Available: http://www.faqs.org/rfcs/rfc1321.html
[25] MD5 CRYPTO CORE FAMILY, HDL Design House, 2002. [Online]. Available:http://www.hdl-dh.com/pdf/hcr 7910.pdf
[26] D. A. Patterson and J. L. Hennessy, Computer Organization and Design: TheHardware/Software Interface. San Francisco, CA: Morgan Kaufmann, 1998, ch. 5,6.
[27] “Auckland-IV trace data,” 2001. [Online]. Available: http://wand.cs.waikato.ac.nz/wand/wits/auck/4/
[28] L. L. Scharf, Statistical signal processing: detection, estimation, and time seriesanalysis. Addison Wesley, 1991.
[29] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed.Wiley-Interscience, Oct. 2000.
[30] G. Casella and R. L. Berger, Statistical Inference, 2nd ed. Duxbury Press, June 2001.
137
[31] B. J. Frey, Graphical Models for Machine Learning and Digital Communication.Cambridge, MA: MIT Press, 1998.
[32] M. C. Nechyba, “Maximum-likelihood estimation for mixture models:the em algorithm,,” 2003, course note. [Online]. Available: http://mil.ufl.edu/∼nechyba/ eel6825.f2003/course materials/t4.em theory/em notes.pdf
[33] Y. Weiss, “Correctness of local probability propagation in graphical models withloops,” Neural Computation, vol. 12, pp. 1–4, 2000.
[34] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Generalized belief propagation,”Advances in Neural Information Processing Systems, vol. 13, pp. 689–695, Dec. 2000.
[35] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference. Morgan Kaufmann, Sept. 1988.
[36] S. M. Aji and R. J. McEliece, “The generalized distributive law,” IEEE Trans. Inform.Theory, vol. 46, pp. 325–343, Mar. 2000.
[37] T. Richardson, “The geometry of turbo-decoding dynamics,” IEEE Trans. Inform.Theory, vol. 46, pp. 9–23, Jan. 2000.
[38] R. J. McEliece, D. J. C. McKay, and J. F. Cheng, “Turbo decoding as an instanceof pearls belief propagation algorithm,” IEEE J. Select. Areas Commun., vol. 16, pp.140–52, Feb. 1998.
[39] F. Kschischang and B. Frey, “Iterative decoding of compound codes by probabilitypropagation in graphical models,” IEEE J. Select. Areas Commun., vol. 16, pp.219–230, Feb. 1998.
[40] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimumdecoding algorithm,” IEEE Trans. Inform. Theory, vol. 13, pp. 260–269, Apr. 1967.
[41] J. G. DAVID FORNEY, “The viterbi algorithm,” in Proceedings of the IEEE, vol. 61,Mar. 1973, pp. 268–278.
[42] L. R. RABINER, “A tutorial on hidden markov models and selected applications inspeech recognition,” in Proceedings of the IEEE, vol. 77, Feb. 1989, pp. 257–286.
[43] D. Moore, K. Keys, R. Koga, E. Lagache, and k claffy, “The CoralReef software suiteas a tool for system and network administrators,” in Usenix LISA. (2001), Dec. 2001.[Online]. Available: citeseer.ist.psu.edu/moore01coralreef.html
[44] C. Logg, “Characterization of the traffic between slac and the internet,” July2003. [Online]. Available: http://www.slac.stanford.edu/comp/net/slac-netflow/html/SLAC-netflow.html
[45] I. A. N. Authority, “Port numbers,” Aug. 2006. [Online]. Available:http://www.iana.org/assignments/port-numbers
138
[46] T. Karagiannis, A. Broido, N. Brownlee, kc claffy, and M. Faloutsos, “Is p2p dying orjust hiding?” in IEEE Globecom 2004, 2004.
[47] S. Sen, O. Spatscheck, and D. Wang, “Accurate, scalable in nettwork identification ofp2p traffic using application signatures,” in WWW, 2004.
[48] K. Wang, G. Cretu, and S. J. Stolfo, “Anomalous payload-based network intrusiondetection,” in 7th International Symposium on Recent Advanced in IntrusionDetection, Sept. 2004, pp. 201–222.
[49] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-service mapping for qos:A statistical signature-based approach to ip traffic classification,” in ACM InternetMeasurement Conference, Taormina, Italy, 2004.
[50] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “Blinc: multilevel trafficclassification in the dark,” in SIGCOMM ’05: Proceedings of the 2005 conference onApplications, technologies, architectures, and protocols for computer communications.New York, NY, USA: ACM Press, 2005, pp. 229–240.
[51] C. Dewes, A. Wichmann, and A. Feldmann, “An analysis of internet chatsystems,” in IMC ’03: Proceedings of the 3rd ACM SIGCOMM conference on Internetmeasurement. New York, NY, USA: ACM Press, 2003, pp. 51–64.
[52] D. Wu, T. Hou, and Y.-Q. Zhang, “Transporting real-time video over the internet:Challenges and approaches,” Proceedings of the IEEE, vol. 88, no. 12, pp. 1855–1875,December 2000.
[53] ITU-T, “G.711: Pulse code modulation (pcm) of voice frequencies,” ITU-TRecommendation G.711, 1989. [Online]. Available: http://www.itu.int/rec/T-REC-G.711/e
[54] ITU-T, “G.726: 40, 32, 24, 16 kbit/s adaptive differential pulse codemodulation (adpcm),” ITU-T Recommendation G.726, 1990. [Online]. Available:http://www.itu.int/rec/T-REC-G.726/e
[55] ITU-T, “G.728: Coding of speech at 16 kbit/s using low-delay code excitedlinear prediction,” ITU-T Recommendation G.728, 1992. [Online]. Available:http://www.itu.int/rec/T-REC-G.728/e
[56] ITU-T, “G.729: Coding of speech at 8 kbit/s using conjugate-structurealgebraic-code-excited linear prediction (cs-acelp),” ITU-T Recommendation G.729,1996. [Online]. Available: http://www.itu.int/rec/T-REC-G.729/e
[57] ITU-T, “G.723.1: Dual rate speech coder for multimedia communications transmittingat 5.3 and 6.3 kbit/s,” ITU-T Recommendation G.723.1, 2006. [Online]. Available:http://www.itu.int/rec/T-REC-G.723.1/en
[58] Y. Wang, J. Ostermann, and Y.-Q. Zhang, Video Processing and Communications,1st ed. Prentice Hall, 2002.
139
[59] P. Stoica and R. Moses, Spectral Analysis of Signals, 1st ed. Upper Saddle River, NJ:Prentice Hall, 2005.
[60] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms,2nd ed. Duxbury Press, Sept. 2001.
[61] L. I. Smith, “A tutorial on principal components analysis,” Feb. 2002.[Online]. Available: http://www.cs.otago.ac.nz/cosc453/student tutorials/principal components.pdf
[62] K. V. Deun and L. Delbeke, “Multidimensional scaling,” University of Leuven.[Online]. Available: http://www.mathpsyc.uni-bonn.de/doc/delbeke/delbeke.htm
[63] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework fornonlinear dimensionality reduction,” Science, vol. 290, pp. 2319–2323, Dec. 2000.
[64] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linearembedding,” Science, vol. 290, pp. 2323–2326, Dec. 2000.
[65] W. Hong, “Hybrid models for representation of imagery data,” Ph.D. dissertation,University of Illinois at Urbana-Champaign, Aug. 2006.
BIOGRAPHICAL SKETCH
Jieyan Fan was born on Jul 26, 1979 in Shanghai, China. The only child in the family,
he grew up mostly in his home town, graduating from the High School Affiliated to Fudan
University in 1997. He earned his B.S. and M.S. in electrical engineering from Shanghai
Jiao Tong University, Shanghai, China, in 2001 and 2004, respectively. He is currently
a Ph.D. candidate with electrical and computer engineering, University of Florida,
Gainesville, FL. His research interests are network security and pattern classification.
Upon completion of his Ph.D. program, Jieyan will be working in Yahoo! Inc,
Sunnyvale, CA.
140