A Statistical Approach for Fingerprinting Probing Activitiesfaculty.business.utsa.edu/Elias-BouHarb/pubs/confs/ares2013.pdfprobing activity, after conﬁrmed as probing, can be identiﬁed

A Statistical Approach for FingerprintingProbing Activities

Elias Bou-HarbNCFTA & Concordia University

Montreal, Quebec, Canadae [email protected]

Mourad DebbabiNCFTA & Concordia University

Montreal, Quebec, [email protected]

Chadi AssiNCFTA & Concordia University

Montreal, Quebec, [email protected]

Abstract—Probing is often the primary stage of an intrusionattempt that enables an attacker to remotely locate, target, andsubsequently exploit vulnerable systems. This paper attemptsto investigate whether the perceived traffic refers to probingactivities and which exact scanning technique is being employedto perform the probing. Further, this work strives to examineprobing traffic dimensions to infer the ‘machinery’ of the scan;whether the probing activity is generated from a software toolor from a worm/botnet and whether the probing is random orfollows a certain predefined pattern.Motivated by recent cyber attacks that were facilitated throughprobing, limited cyber security intelligence related to the men-tioned inferences and the lack of accuracy that is provided byscanning detection systems, this paper presents a new approachto fingerprint probing activity. The approach leverages a numberof statistical techniques, probabilistic distribution methods andobservations in an attempt to understand and analyze probingactivities. To prevent evasion, the approach formulates this matteras a change point detection problem that yielded motivatingresults. Evaluations performed using 55 GB of real darknet trafficshows that the extracted inferences exhibit promising accuracyand can generate significant insights that could be used formitigation purposes.

I. INTRODUCTION

Recent events demonstrated that cyberspace could be sub-jected to amplified, debilitating and disrupting cyber attacksleading to drastic impacts on provided network and Internetwide services. For instance, it was disclosed that hackers hadorchestrated multiple breaches of Sony’s PlayStation Networktaking it offline for 24 days and costing the company anestimated $171 million [1]. Citibank revealed that it haddetected a data breach that publicly exposed 360,000 NorthAmerican credit card holders’ account details, including theiraccount numbers, names, and email addresses [2]. Moreover,hackers have targeted Twitter gaining substantial access toaround 250,000 user accounts [3]. The latter cyber attack camedays after US newspapers, the New York Times and the WallStreet Journal, revealed that their respective websites had beenthe target of a well coordinated hacking effort [4]. Further,Google became the target of a phishing campaign that aimedat gaining access to the accounts of senior officials in the U.S.,Korea and other governments [5]. Moreover, a hacker teamdubbed as LulzSec took down CIA.gov with a targeted denialof service attack [6]. Last but not least, Facebook officialsannounced that they have discovered that computers belongingto several of its engineers had been hacked using a zero-dayJava attack [7]. In general, cyberspace could facilitate advancedpersistent threats, distributed denial of service attacks, zero dayexploits and cyber terrorism/warfare. Despite efforts to protect

the cyberspace, the latest reports from senior governmentofficials [8] highlighted that only limited progress has beenmade in improving the cyber security of crucial networks.

Probing, the task of scanning enterprise networks or In-ternet wide services, searching for vulnerabilities or ways toinfiltrate IT assets, is a significant cyber security concern.The latter is due to the fact that probing is commonly theprimary stage of an intrusion attempt that enables an attackerto remotely locate, target, and subsequently exploit vulnerablesystems. It is basically a core technique and a facilitatingfactor of the above mentioned cyber attacks. Additionally,it was concluded that a momentous 50% of attacks againstcyber systems are preceded by some form of network scanningactivity [9]. Furthermore, there has been recently a flourish-ing of a new cyber phenomena dubbed as cyber scanningcampaigns [10, 11]. These are scanning approaches that areextremely distributed, possess composite stealth capabilitiesand high coordination and orchestration. Rather than focusingon specific hosts or networks, these campaigns aim at prob-ing and consequently exploiting Internet’s wide services andinfrastructures. Motivated by the above, in addition to the thelack of accuracy that is provided by current scanning detectionsystems [12], this paper attempts to answer the followingquestions:

• Does the perceived traffic refer to probing activities?

• Which specific scanning technique is being employedto perform the probing?

• Who is generating such activity?

• Is the probing random or does it follow a certainpredefined pattern?

Stimulated by the requirement to answer the above andhence generate insights that could be used for mitigationpurposes, we frame the paper’s contributions as follows:

• Proposing a new method to fingerprint probing activity(i.e., detect activity and identify the technique). Themethod does not rely on identifying the scanningsource and is independent from the scanning strategy(remote to local, local to remote, local to local),the scanning aim (wide range or target specific) andthe scanning method (single source or distributed).Further, the proposed method does not rely on acertain predefined alert threshold, the transport pro-tocol used (TCP or UDP) or the number of probeddestinations and ports. The method uniquely leveragesthe Detrended Fluctuation Analysis (DFA) technique.

• Validating our method by comparing its clusteringcapabilities with the well established machine learn-ing clustering approaches, namely, the k-means andExpectation Maximization (EM) methods.

• Employing various statistical test techniques, includ-ing Bhattacharyya distance, Mann-Kendall and Wald-Wolfowitz, to analyze probing traffic dimensions,namely, employed technique, monotonicity and ran-domness.

• Adopting a change point detection mechanism basedon a time series Bayesian inference to prevent ap-proach evasion that could be attempted by a maliciousattacker or due to network traffic fluctuations.

• Evaluating the proposed approach using 55 GB of realdarknet traffic.

The remainder of this paper is organized as follows. Section IIelaborates on our proposed approach by attempting to answerthe above mentioned questions. Specifically, it describes therationale and methodology, presents the methods validationtechniques, and pinpoints the possible obtained cyber secu-rity inferences. Section III presents the empirical evaluationand discuss the results. The evasion prevention approach ishighlighted in Section IV. A discussion of the limitations ofthe proposed approach is stated in Section V. Related work isreviewed in Section VI. Finally, Section VII summarizes thepaper and pinpoints the future work.

II. PROPOSED APPROACH

A. Does the perceived traffic refer to probing activities?

Organizations are interested in generating insights andinferences concerning any probing activities that they mightreceive. It is significant for them to have the capability tofingerprint probing events. Thus, they would benefit fromknowing if the perceived traffic is related to scanning ornot and if it is, exactly which scanning technique has beenemployed. This section aims at tackling these two issues.

The rationale of the method states that regardless of thesource, strategy and aim of the probing, the reconnaissanceactivity should have been generated using a certain literature-known scanning technique (i.e., TCP SYN, UDP, ACK, etc.[13]). We observe that a number of those probing techniquesdemonstrate a similar temporal correlation and similarity whengenerating their corresponding probing traffic. In other words,the observation states that we can cluster the scanning tech-niques based on their traffic correlation statuses. Subsequently,we can differentiate between probing and other malicioustraffic (i.e., Denial of Service (DoS)) based on the possessedtraffic correlation status. We can as well attribute the probingtraffic to a certain cluster of scanning techniques (i.e., theprobing activity, after confirmed as probing, can be identifiedas being generated by a certain cluster of techniques thatpossess similar traffic correlation status). To identify exactlywhich scanning technique has been employed in the probing,we statistically estimate the relative closeness of the probingtraffic in comparison with the techniques found in that cluster.To enable the capturing of traffic signals correlation statuses,the proposed method employs the DFA technique. DFA wasfirst proposed in [14] and has since been used in many research

areas to study signals correlation. Very limited work in theareas of cyber security and malicious traffic detection hasutilized DFA [15, 16], and to the best of our knowledge, nowork has leveraged the DFA technique to tackle the problemof fingerprinting probing traffic. For detailed information aboutthe DFA technique, including its inner workings, we referthe reader to [14]. The output of the DFA procedure is arelationship F (n), the average fluctuation as a function of boxsize, and the box size n. Typically, F (n) will increase withbox size n. A linear relationship on a log-log graph indicatesthe presence of scaling; statistical self-affinity expressed asF (n) ∼ nα. Under such conditions, the fluctuations can becharacterized by a scaling exponent α, which is the slope ofthe line relating logF (n) to log(n). The scaling exponent αcan take the following values, disclosing the correlation statusof the traffic time series.

• α < 0.5: anti-correlated.

• α ≈ 0.5: uncorrelated or white noise.

• α > 0.5: correlated.

• α ≈ 1: 1/f -noise or pink noise.

• α > 1: non-stationary, random walk like, unbounded

• α ≈ 1.5: Brownian noise.

We proceed by fingerprinting the scanning techniques. Forthe scope of the current work, we have selected 10 cyberscanning techniques. To accomplish the task, we created anexperimental environment that includes two virtual machines, ascanning and a target machine. The machines are isolated fromany external networks to prevent any noise in the generatedsignal. The target machine does not operate any special service.We have also setup a TCPDUMP sink on the target tocollect the network traffic data originating from the scanningmachine. It is worthy to note that we do not record the trafficresponse from the target as our intention is to capture thescanning techniques’ traffic regardless of the offered servicesby the probed machine. To execute the scanning activity, wehave utilized Nmap [17], an open source utility for networkscanning and discovery. We ran Nmap 10 times, once foreach scanning technique, and collected the generated trafficin 10 pcap traffic files. The selected scanning techniques andthe corresponding required Nmap command flags to executeeach of the techniques is summarized in Table I. For detailedinformation about the concerned scanning techniques, we referthe reader to [13, 17].

The generated packets’ distribution of the 10 scanningtechniques is illustrated in Figure 1. Subsequently, we appliedthe DFA technique on each of the scanning techniques’ trafficsignals. To achieve that, we have utilized the DFA MATLABcode found in [18] and used 1ms as the bin size for all the10 scanning techniques. The outcome of applying the DFAon the previous scanning traffic time series distributions isshown in Figure 2 and the output of the scaling exponentsα is summarized in Table II.

From Table II and the information relating the scalingexponent α to the correlation status, we can produce TableIII that discloses that a number of scanning techniques infact demonstrated a similar temporal correlation and similaritywhen generating their corresponding probing traffic. It is

0

5

10

15

20

25

0

0.0

06

0.0

12

0.0

18

0.0

24

0.0

3

0.0

36

0.0

42

0.0

48

0.0

54

0.0

6

0.0

66

0.0

72

0.0

78

0.0

84

0.0

9

0.0

96

0.1

02

0.1

08

0.1

14

0.1

2

0.1

26

0.1

32

0.1

38

0.1

44

0.1

5

0.1

56

0.1

62

Nbr o

f P

ackets

Time(1ms interval)

TCP SYN Scan Packets Distribution

(a) TCP SYN Scan

0

2

4

6

8

10

12

14

0

5.6

11

.1

21

.6

32.1

42

.6

49

.1

54

.2

59

.6

64

.7

70

.1

75

.2

80

.6

86.2

96

.7

10

6.7

11

7.2

127.7

13

8.2

14

7.7

15

5.8

165.4

17

0.8

17

5.9

18

1.3

18

6.8

196.4

20

6.9

21

4.5

22

5

Nb

r o

f P

ack

ets

Time(0.1s interval)

TCP Connect() Scan Packets

Distribution

(b) TCP connect() Scan

0

2

4

6

8

10

12

14

16

18

Nb

r o

f P

ack

ets

Time(1ms interval)

FIN Scan Packets Distribution

(c) FIN Scan

0

5

10

15

20

25

30

Nb

r o

f P

ack

ets

Time(1ms interval)

XMAS Scan Packets Distribution

(d) Xmas Scan

0

5

10

15

20

25

Nb

r o

f P

ack

ets

Time(1ms interval)

NULL Scan Packets Distribution

(e) Null Scan

0

5

10

15

20

25

0

12

24

35

46

57

68

79

90

101

112

123

134

145

156

167

178

189

200

211

222

233

244

373

487

616

746

875

928

1057

Nb

r o

f P

ack

ets

Time(1s interval)

UDP Scan Packets Distribution

(f) UDP Scan

0

5

10

15

20

25

30

06.4

8.7

10.1

11.5

12.9

14.3

15.7

17.1

18.5

19.9

21.3

23.7

30.9

39.9

48.9

57.9

71.2

85.2

99.2

113.2

127.3

141.3

156.3

170.3

184.4

198.4

212.4

226.4

240.4

254.5

Nbr o

f P

ackets

Time(0.01s interval)

IP Protocol Scan Packets Distribution

(g) IP Protocol Scan

0

5

10

15

20

25

30

0

0.0

02

0.0

04

0.0

06

0.0

08

0.0

1

0.0

12

0.0

14

0.0

16

0.0

18

0.0

2

0.0

22

0.0

24

0.0

26

0.0

28

0.0

3

0.0

32

0.0

34

0.0

36

0.0

38

0.0

4

0.0

42

0.0

44

0.0

46

0.0

48

0.0

5

Nbr o

f P

ackets

Time(1ms interval)

ACK Scan Packets Distribution

(h) ACK Scan

0

2

4

6

8

10

12

14

16

18

0

0.0

2

0.0

4

0.0

6

0.0

8

0.1

0.1

2

0.1

4

0.1

6

0.1

8

0.2

0.2

2

0.2

4

0.2

6

0.2

8

0.3

0.3

2

0.3

4

0.3

6

0.3

8

0.4

0.4

2

0.4

4

0.4

6

Nb

r o

f P

ack

ets

Time(0.01s interval)

Window Scan Packets Distribution

(i) Window Scan

0

5

10

15

20

25

30

Nb

r o

f P

ack

ets

Time(1ms interval)

RPC Scan Packets Distribution

(j) RPC Scan

Fig. 1: Packets’ Distribution generated by the Scanning Techniques

-0.5

0

0.5

1

1.5

2

0.6 1.0 1.2 1.5 1.8 2.1 2.4 2.7

Lo

g1

0(F

(n

))

Log10(n)

Detrended Fluctuation Analysis

TCP SYN Scan

TCP Connect() Scan

UDP Scan

(a)

-0.5

0

0.5

1

1.5

2

0.6 1.0 1.2 1.5 1.8 2.1 2.4 2.7 3.0

Lo

g1

0(F

(n

))

Log10(n)


FIN Scan

XMAS Scan

NULL Scan

ACK Scan

(b)

-1

-0.5

0

0.5

1

1.5

2

2.5

0.6 1.0 1.2 1.5 1.8 2.1 2.4 2.7

Lo

g1

0(F

(n

))

Log10(n)


Window Scan

IP Protocol Scan

RPC Scan

(c)

Fig. 2: Applying DFA on the Scanning Techniques Traffic Signals

Cyber Scanning Technique Nmap Command Flags

TCP SYN Scan -sSTCP connect() Scan -sT

FIN Scan -sFXmas Scan -sXNull Scan -sNUDP Scan -sU

IP Protocol Scan -sOACK Scan -sA

Window Scan -sWRPC Scan -sR

TABLE I: Selected Cyber Scanning Techniques and NmapCommand Flags

significant to pinpoint that such results are independent fromthe used scanning tool. In this work, we have used Nmap sinceit is the most widely adopted and well established scanning

Cyber Scanning Technique Scaling Exponent

TCP SYN Scan 0.57TCP connect() Scan 0.87

FIN Scan 0.31Xmas Scan 0.30Null Scan 0.37UDP Scan 0.66

IP Protocol Scan 1.13ACK Scan 0.44

Window Scan 1.24RPC Scan 1.31

TABLE II: Summary of the DFA Scaling Exponent α

tool. Moreover, it provided a simple mechanism to generate thescanning traffic. We argue that same scanning techniques willgenerate a somehow similar traffic distribution regardless of thetool used and hence will output similar DFA results. To support

Correlation Status Cyber Scanning Techniques

Anti-Correlated

FIN ScanXmas ScanNull ScanACK Scan

CorrelatedTCP SYN Scan

TCP connect() ScanUDP Scan

Non-StationaryIP Protocol Scan

Window ScanRPC Scan

TABLE III: Cyber Scanning Techniques and CorrespondingCorrelation Statuses

this statement, we executed an experiment using Fing [19],another network scanning and discovery tool. We also selectedthe TCP SYN Scan since it is the most popular scanningtechnique [20]. We repeated the same experimentation as wedid with the other scanning techniques. The output of the DFAscaling exponent α was = 0.55. Therefore, the correlationstatus was shown to be ‘correlated’ which is coherent with theDFA results that we have previously obtained with the TCPSYN Scan when we used Nmap. We can generalize such resultto other techniques as well; since DFA operates on packetsdistribution in a time series, where similar scanning techniques,when following the protocol and technique standards, willpossess similar distributions when probing their target, thenwe can expect similar DFA results regardless of the tool used.

One of the issues that arises is where to apply DFA givena traffic time series that needs to be tested; this problemcould be re-stated as follows: given a traffic time series St,find a starting position X and an ending position X + σin St where we can apply DFA on. Assuming a ‘random’or a ‘predefined’ window location in the time series St toapply DFA on will be erroneous as this will result in wronginferences. For example, if the traffic time series that needs tobe tested is of length 5 minutes, applying DFA on the entiredistribution could indicate a result (suppose it was inferred thatit is not scanning) while the actual scanning starts on the 3rdminute; the entire distribution’s correlation status appears tobe close to noise (i.e., α value ≈ 1 and hence not scanning)while from the 3rd minute up to the 5th the signal is correlated(i.e., scanning). To tackle this, we present Algorithm 1 whichdiscloses an approach to approximate when to apply DFA,to correctly infer whether or not the traffic refers to probingactivities.

Algorithm 1 takes as input the time series St of the distribu-tion under testing and all the other distributions of the scanningtechniques Sci of Figure 1. For every distribution related tothe scanning techniques, it calculates the euclidean distanceE between the points (p, q) in Sci and St, respectively.Subsequently, the scanning technique distribution is movedone point in a sliding window fashion against St. For eachsliding window, it records the distance between Sci and St,s(i). After finishing all the sliding window procedures, the

algorithm stores the minimum distance between both slidingwindows in all the iterations, mi. The algorithm finally selectsX as the minimum of all the distances in all sliding windowsafter all the techniques Sci have passed on St. This willapproximate the starting position on where to apply DFA inSt. Note that, σ in the ending position X + σ, that waspreviously mentioned, is the length of the scanning techniqueSci where X was derived from. It is significant to note thatwe use the scanning techniques Sci of Figure 1 as a way toinfer, in an apriori fashion, where to start applying DFA andhence where we estimate that the scanning is initiated; we arenot completely matching the techniques with the input timeseries St. In fact this is not the intention of Algorithm 1 andhence we do not expect the techniques to completely overlapthe input time series St. Thus, any variation of scanningtechniques’ distributions is tolerated and manageable, as longtheir correlation status, as we expect and observe, are keptstable.

input : A time series St of the distribution undertesting; the set of time series Sci of thedistributions of the scanning techniques.

output: X , reflecting the starting location on whereto apply DFA in St

lt=length(St);for i ∈ Sci do

li=length(Sci);for i=1 → lt do

for j=1 → li dok(i, j) = ‖E‖ = d(pj,qi)

endendfor i, j ∈ k(i, j) do

s(i)=Sum(k(i, j));endmi =Min(s);

endfor i ∈ mi do

X =Min(mi);endreturn (X);

Algorithm 1: Approximating the starting location onwhere to apply DFA in St

After applying DFA, given the output information of Al-gorithm 1, we end up with a certain correlation status. Weexpect that the correlation status indicates a probing activity(recall Table III). However, we may encounter the case wherethe correlation status does not indicate probing (i.e., uncorre-lated, 1/f -noise or Brownian noise). If the activity refers toprobing, then the output correlation status will lead us to acertain cluster of scanning techniques (of Table III) that thisprobing activity has been generated from. To exactly identifywhich scanning technique has been used to probe, we presentAlgorithm 2, which discloses an approach to achieve that.

Algorithm 2 takes as input a time series Sb of the probingdistribution that DFA was previously applied on; Sb refers tothe time series extracted from X to X + σ. For each of thescanning techniques Sci in the cluster Scb that is related tothe previous output correlation status, we statistically measure

the relative closeness of Sb to each scanning technique inthat cluster Scbi using Bhattacharyya distance [21], Bha. Thelatter statistic test is an established and an effective metricto determine the overlap of two sample distributions [22].Intuitively, the algorithm select the technique’s distribution thatis closer to Sb, Scbi. Scbi will be identified as the scanningtechnique that Sb was employing.

input : A time series Sb of the probing distributionthat DFA was previously applied on; acluster of time series Scb of thedistributions of the scanning techniquesrelated to the correlation status.

output: Scbi, reflecting one scanning technique thatis estimated to be generating the probingactivity found in Sb.

for i ∈ Scbi doBhabi = ‖Bha‖ = d(Scbi,Sb)

endfor i ∈ Bhabi do

di =Min(Bhabi);endreturn (Scbi|i of di);

Algorithm 2: Identifying the scanning technique

B. Observation Validation

From what has been presented in Table III, we deducedthat the techniques could be clustered into 3 groups basedon their probing traffic correlation statuses. Thus, the nullhypothesis states that the scanning techniques’ traffic originatesfrom 3 independent clusters. However, by default, the DFAapproach, especially in the context of probing activity, is notan established method to identify clusters. We now aim tovalidate the proposed method by comparing it with the well es-tablished machine learning clustering approaches, namely, thek-means and the Expectation Maximization (EM) techniques.These techniques and the validation results are consequentlydiscussed.

When the data observations are not pre-labeled into definednumerical or categorical classes, as in our case, two standardwidely deployed algorithms for data clustering using unsuper-vised learning could be employed. These are the k-means [23]and the EM [24] algorithms. The k-means algorithm finds kclusters by choosing n data points at random as initial clustercenters. Each data point is then assigned to the cluster withthe center that is closest to that point. Each cluster center isthen replaced by the mean of all the data points that havebeen assigned to that cluster. Note that, the K-means algorithmoperates by minimizing the sum of squared Euclidean distancesbetween data records in a cluster and the clusters mean vector.This process is iterated until no data point is reassigned to adifferent cluster. On the other hand, the EM algorithm viewsthe data clustering problem as a framework for data densityestimation using a probability density function. An effectiverepresentation of the probability density function is the mixturemodel, which asserts that the data is a combination of kindividual component densities corresponding to k clusters.The EM problem can be summarized as follows: given a setof data observations, identify a set of k populations in the

data and provide a density distribution model of each of thepopulations. Readers who are interested in the details of theEM are referred to [24].

We proceed by going back to our scanning traffic pcap filesthat we have previously collected. Subsequently, we extractedfrom them a total of 29 data link, network and transportlayer packet features as summarized in Table IV. This feature

Features

Data Link Features

1 Delta time with previous capturepacket

2 Packet Length3 Frame Length4 Capture Length5 The flag ‘frame’ is marked

Network Layer Features

6 IP Header length7 IP Flags.8 IP Flags: reversed bit9 IP Flags: do not fragment bit10 IP Flags: more fragments bit11 IP Fragment offset12 IP Time to live13 IP Protocol

Transport Layer Features

14 TCP Segment length15 TCP Sequence number16 TCP Next sequence number17 TCP Acknowledgement number18 TCP Header length19 TCP Flags20 TCP Flags: congestion window re-

duced21 TCP Flags: ECN-Echo22 TCP Flags: Urgent23 TCP Flags: Acknowledgement24 TCP Flags: Push25 TCP Flags: Reset26 TCP Flags: Syn27 TCP Flags: Fin28 TCP Window size29 UDP Length

TABLE IV: Features Description

extraction procedure was achieved using the open sourcejNetPcap API [25]. We consequently compiled the extractedfeatures into a unified data file of 7138 data instances. To applythe k-means and the EM algorithm on our data file, we haverespectively used MATLAB’s default clustering functionalityand the WEKA data mining tool [26]. The output of thoseprocedures is depicted in Figure 3. Figure 3a clearly showsthe formation of 3 clusters. This result provides evidence thatthe traffic originates from 3 different classes. To further testthe validity of this result, we produced a silhouette graphof the k-means clusters as shown in Figure 3b. Typically, asilhouette plot displays a measure of how close each pointin one cluster is to points in the neighboring clusters. Avalue of 1 indicates that the points are very distant fromneighboring clusters, a value of 0 informs that the points arenot distant from other clusters while a negative value indicatesthat the points are erroneously placed in that cluster. FromFigure 3b, it is shown that a significant amount of pointsin all the 3 classes have a large silhouette value, greaterthan 0.6 [27], indicating that the clusters are separated fromneighboring clusters. This provides incentives to validate thequality of the formed k-means clusters. Further, the output ofthe EM clustering procedure on the same data file is shown

(a) K-means Clusters (b) K-means Silhouette (c) EM Clusters

Fig. 3: Method Validation through Unsupervised Learning

in Figure 3c. Similar to the k-means results, we can noticethe formation of 3 distinct classes. These results relativelyvalidate the proposed method by revealing that the scanningtraffic originates from 3 distinct classes; we accept the nullhypothesis.

C. Who is generating the probing activity?

When an enterprise fingerprint probing (i.e., detect activityand identify the technique), it is of value as well to infer aboutthe ‘machinery’ behind the source of the probing. Specifically,it would be significant to pinpoint whether the scanning isgenerated by a scanning tool or a worm/botnet. One use casefor this, is for instance, when an Internet Service Provider(ISP) notices that one of its customers is generating probingwhere that probing is generated by a worm/botnet. Keeping inmind that most probing activity is generated from non-spoofedInternet Protocol (IP) addresses (so the actual attacker/scannercan essentially get back the probing results), then the ISP canreact on that and provide the suitable anti-malware solution tothat specific customer.

It is known that monotonic scanning is a behavior ofscanning tools in which the latter sequentially probe theirtargets (IPs and ports) [28]. Hence, testing for monotonicityin the probing traffic could differentiate between scanninggenerated by tools or worms/bots. In the current work, weapply the Mann-Kendall statistic test [29], a non-parametrichypothesis testing approach, to check for monotonicity. In ourstudy, we set the significance level to 0.5% since a higher valuecould introduce false positives. Although in the current workwe do not directly differentiate between scanning generatedby worms or bots, however, future work would couple suchinformation with real malware data (that we possess as well)to provide more accurate inferences, including the type ofmalware or the botnet orchestration pattern.

D. Are we being scanned in a random manner?

When probing activity occur, it would be also interestingto possess insights related to how the targets are being probed.Does the scanning occur in a random manner or does itfollow a certain predefined pattern. Patterns existence providecyber security intelligence about bots orchestration behavior.The latter is specifically important when trying to obtain

inferences related to cyber scanning campaigns, such as theone reported in [11]. In the current work, we apply the Wald-Wolfowitz (also known as the runs) statistic test [30] to checkfor randomness.

III. EMPIRICAL EVALUATION

We possess real darknet data that we are receiving on adaily basis from a trusted third party. Such traffic originatesfrom the Internet and is destined to numerous /16 networksensors. In a nutshell, darknet traffic is Internet traffic destinedto routable but unused Internet addresses (i.e., dark sensors).Since these addresses are unallocated, any traffic targetingthem is suspicious. Darknet analysis has shown to be aneffective method to generate cyber threat intelligence [31, 32].We use one week of data (55GB), extracted from multiple /24networks, that was collected during the duration of February24th to March 3rd 2013, to empirically evaluate our approach.

Darknet traffic is typically composed of three types oftraffic, namely, scanning, backscattered and misconfiguration[33]. Scanning arises from bots and worms while backscatteredtraffic commonly refers to unsolicited traffic that is the resultof responses to denial of service attacks with spoofed source IPaddresses. On the other hand, misconfiguration traffic is dueto network/routing or hardware/software faults causing suchtraffic to be sent to the darknet sensors.

The first aim is to filter out misconfiguration data. We usea simple metric that records the average number of sources perdestination darknet address. This metric should be significantlylarger for misconfiguration than scanning traffic. However, al-though it differentiates misconfiguration from scanning traffic,it could include as well backscattered traffic as they also canpossess a large average number of sources per destination(i.e, in case of a DoS). To cope with this issue, we observe,per the technique in [33], flags in packet headers, such asTCP SYN+ACK, RST, RST+ACK, ACK, etc., that resemblebackscattered traffic [33]. Subsequently, we filter out flows thatlack that observation, deeming them as misconfiguration.

Second, we aggregate the connections into sessions usingan approach similar to the first step algorithm by Kannan et al.[34]. We consider all those connections within Taggreg of eachother as part of the same session for a given pair of hosts. Weused the same threshold, Taggreg = 100 seconds, and found

that this seems to correctly group the majority of connectionsbetween any given pair of hosts. For each day in our data set,we extracted 100 sessions for a total of 700 sessions.

We setup an experimental environment using Java andimplemented algorithms 1 and 2. For all the statistical testtechniques, including DFA, Bhattacharyya distance, Mann-Kendall and Wald-Wolfowitz, we employed their MATLABimplementations [18, 35–37].

We first applied the approach to attempt to differentiatebetween scanning and backscattered traffic (i.e., DoS relatedactivity). Recall, that we have identified scanning as havingthe correlation statuses of Table III. Figure 4 represents howthe 700 sessions were distributed and fingerprinted.

87%

13%

Probing Activity DoS related Activity

Fig. 4: Sessions Distribution

It is shown that probing activity corresponds to 87% (612)of all the sessions. This scanning to backscattered traffic ratiois somehow coherent with other darknet studies [33]. Notethat in Figure 4, DoS related activity was fingerprinted assuch since it was shown from its DFA results that 13% (88)of the sessions possessed correlation statuses correspondingto either uncorrelated, 1/f -noise or Brownian noise. The factthat DoS related traffic demonstrated noise or Brownian noiseis compliant with what was found in [15] when the authorsperformed DFA analysis on DoS traffic. To further validatesuch inference, we implemented the DoS detection algorithmby Moore et al. [38] and applied it on the 88 sessions. 77sessions out of the 88 were detected as DoS related. Thus, withour approach, we have erroneously fingerprinted 11 sessions asDoS related (assuming the mentioned DoS detection algorithmdid not produce any false positive). To understand why thatoccurred, we inspected the 11 sessions. 7 sessions out of the 11possessed a DFA scaling exponent α ranging from 1.51 to 1.59,and accordingly were fingerprinted as Brownian noise (i.e.,DoS related). However, after inspecting their traffic packets,they were shown to be a rare type of RPC scanning traffic.This suggests that one should not consider large α valuesas valid results or at least keep those incidents in a ‘quar-antine’ for further automated post-processing. The remaining4 sessions that were also erroneously fingerprinted seem to bemisconfiguration that apparently, were not previously filteredas expected.

To evaluate the scanning fingerprinting capabilities of ourapproach, we experimented with Snort’s sfPortscan prepro-cessor using the same 612 sessions that were previously

fingerprinted as probing. sfPortscan [39], a preprocessor pluginfor the open source network intrusion and detection systemSnort [40], provides the capability to detect TCP, UDP, andICMP scanning. The sfPortscan preprocessor detects scansby counting RST packets from each perceived target duringa predetermined timeout interval. Before declaring a scan, 5events (i.e., RST packets) are required from a given targetwithin a window. The sliding timeout window varies from 60to 600 seconds by sensitivity level; at the highest level, analert will be generated if the 5 events are observed within 600seconds. We have chosen to compare our approach with Snort’ssfPortscan preprocessor since Snort is one of the most broadlydeployed intrusion detection/prevention technology worldwideand has become a de-facto standard.

We relied on sfPortscan’s output as a baseline for ourcomparison. Snort’s sfPortscan detected 590 scans. After asemi-automated analysis and comparison that was based onthe logged scanning traffic flows (i.e., source and destinationIP and port, protocol, and timestamp), we identified that allthe 612 scans that our approach fingerprinted as probingactivity include sfPortscan’s 590 scans. Therefore, relative tothis technique and experimenting with this specific data set, weconfirm that our approach yielded no false negative. Moreover,according to the results, our proposed approach generated 22sessions that are considered as false positive. It is worthy topinpoint that our approach can detect certain types of scansthat were not included at the time of the experiment, and bydefault, in Snort’s sfPortscan definitions. These include scansfrom a single host to a single port on a single host, slow scansand a specific host scanning multiple ports on multiple hosts.In general, we claim that a certain limited, acceptable and amanageable number of false positives might occur (taking intoconsideration the system that we compare our approach with).We need as well to consider Snort’s sfPortscan false negativesand the different types of probing that our approach is able tofingerprint.

We next applied the proposed approach to identify whichtechniques were leveraged in the previous fingerprinted prob-ing activity. Figure 5 reveals that TCP SYN scanning leadswith 35% (212) of all the sessions, followed by UDP, TCPconnect() and ACK scanning. FIN, Xmas Tree, and Null scan-ning are typically considered as members of the ‘stealth’ scansbecause they send a single frame to a TCP port without anyTCP handshaking or any additional packet transfers. They arerelatively effective in evading firewall detection and they areoften employed. The fact that the latter techniques were foundto be among the least leveraged in the previous fingerprintedprobing activity in our data set is quite puzzling.

We proceed by attempting to answer the questions thatwere raised in Sections II-C and II-D. We applied the proposedapproach which yielded the output of Figure 6. The resultsdisclose that around 81% of the probing activity is beinggenerated by worms or bots. Only 21% are being generatedby probing tools. These percentages infer that leveraging realmalware data, in a future study, could reveal substantial cybersecurity insights.

The results also elaborate on the manner in whichworms/bots probe their targets. It is demonstrated that 66%of that probing traffic follow a random approach, while theremaining 34% follow a certain pattern when scanning their

35%

13%15%

2%

6%

4%

12%

5%3%

5%

TCP SYN TCP Connect() UDP FIN

XMAS Null Ack IP Protocol

Window RPC

Fig. 5: Probing Techniques Distribution

targets. The latter insight allows us to 1) track the probingactivity that possess similar scanning patterns and perhapsattribute it to the same campaign, 2) apply similar mitigationsteps to probing activity with similar patterns and techniquesand 3) provide inferences, although not decisive, that theprobing is being generated by an orchestrated botnet.

Probing Activity

Probing Tools

Worms/Bots

Random

Pattern

612 sessions

114 sessions

498 sessions

169 sessions

329 sessions

Fig. 6: Probing Activity Dimensions Analysis

Note that we also generate supplementary material relatedto the above mentioned probing activity (i.e., worm, bots,probing patterns) including geo-location information per realsource, organization, ISP, city, region and country. However,we refrain from publishing those due to sensitivity/legal issues.

IV. EVASION PREVENTION

To fingerprint probing activity (i.e., detect activity andidentify the technique), our approach, as stated in Section IIand evaluated in Section III, leverages the DFA time seriestechnique and operates on traffic distributions. However, itis realistic to pinpoint that the approach could be evadedin two ways. First, by a malicious attacker who deliberatelyinjects a number of packets while performing his probingactivity. Second, due to network fluctuations (i.e., delay), thedistribution could be distorted. In both cases, our approachmight erroneously miss the probing, attribute the probingto the wrong technique cluster or fail to identify the exact

Fig. 7: Approach Evasion Prevention using a Change PointDetection Technique

technique. We project that by formulating this problem as atime series change point detection problem, we can detect ifand where the distribution has been susceptible to any suddenmodifications. The time series change point detection problemhas been excessively reviewed in the literature [41, 42]. Forthe sake of this work, as a proof of concept, we selected thework from Adams et al. [42] to experiment the effectivenessof such approach. We decided to leverage this specific worksince it adopts a statistical approach which is coherent withthe theme of our approach, is highly respected in its domain,and is directly applicable to our work that is related totime series distributions. We employed the authors’ MATLABimplementation, used a sample of a TCP SYN scanning traffic,and emulated a malicious attacker by injecting packets in theprobing distribution using ‘packit’ [43], a packet analysis andinjection tool.

Figure 7 shows how the TCP SYN scan distribution isbeing generated with a low frequency distribution. Suddenly,the malicious attacker injects random packets. The marked‘X’s on the Figure demonstrate how the change point detectiontechnique was successfully able to detect the change. In thecontext of our work, to prevent distribution modifications, wecan simply remove the distribution between the first marked‘X’ and the second marked ‘X’ to retrieve the original distribu-tion, namely the probing distribution. Although we admit thatfurther experimentation should be undertaken to thoroughlyauthenticate the effectiveness of such change point detectiontechniques in providing evasion prevention to our approach,the obtained preliminary results seem to be promising and thusmotivating.

V. APPROACH LIMITATIONS

We acknowledge a number of limitations in our proposedapproach. First, although in this work, the approach exhibitedpromising accuracy when evaluated using darknet (i.e, mali-cious) data, we have not tested the approach using normal twoway traffic. Normal network traffic (i.e., benign http and ftptraffic for example) is known to be self-similar (i.e., posseseslong traffic correlations). This directly affects the accuracy ofour approach. We believe this point is manageable by applyingpre-processing filtering mechanisms to filter out the benigntraffic before applying our proposed approach. The payload

analysis techniques in [44] seem viable to accomplish that.Second, in its current state, our approach does not fingerprintICMP scanning. The latter constitutes a significant portion oftoday’s probing activity. We can overcome this limitation byanalyzing/fingerprinting the distributions related to such trafficand performing experimentation validation as we did to otherscanning techniques. Third, the approach does not differentiatebetween worms and bots probing. If accomplished, it willyield significant cyber security intelligence for attribution andmitigation purposes. This task is currently work in progress.Finally, bots probing orchestration is not yet confirmed. Cou-pled with probing patterns that we have generated in this work,it could be used for tracking cyber scanning campaigns andfor effective mitigation.

VI. RELATED WORK

In this section, we discuss few works related to probingdetection/analysis using statistical approaches. Zhang et al.[45] proposed a scan detection method based on a distributedcooperative model. Their technique is composed of feature-based detection, scenario-based detection and statistic-baseddetection. Their proposed architecture is decomposed into 5layers (sensors, event generators, event detection agents, afusion center and a control center) that collaborate to achievethe intended task. The technique’s statistic-based detectionemploys predefined thresholds that allows the detection of bothscan and denial of service attacks. A positive aspect of thiswork is that the proposed technique is well suited to distributedlarge-scale environments. However, the presented work wasbased on an illustrated described scenario and the authors didnot discuss its applicability on real data samples. In [46],Bhuyan et al. presented the adaptive outlier based approachfor coordinated scan detection (AOCD). First, the authors usedthe principal component analysis feature reduction techniqueto identify the relevant feature set. Second, they employed avariant of the fuzzy c-means clustering algorithm to clusterinformation. The authors tested their algorithm using differ-ent real-life datasets and compared the results against otheravailable literature techniques. Their approach assumes thatthe target of the scanning is a set of contiguous addresses,which is not always the case. In another work, Baldoni et al.[47] proposed a collaborative architecture where each targetnetwork deploys local sensors that send alarms to a collab-orative layer. This, in turn, correlates this data with the aimof (1) identifying coordinated cyber scanning activity while(2) reducing false positive alarms and (3) correctly separatinggroups of attackers that act concurrently on overlapping targets.The soundness of the proposed approach was tested on realnetwork traces. Their proposed system is designed to leverageinformation coming from various network domains to detectdistributed scanning. Hence, the collaborative layer appears tobe ineffective when the adversary is acting only against onenetwork domain. In a more general work, Dainotti et al. [10]presented the measurement and analysis of a 12-day world-wide cyber scanning campaign targeting VoIP (SIP) servers.The authors used darknet/telescope data collected at the UCSDnetwork telescope to exclusively focus on the analysis andreporting of that SIP scanning incident.

Our work is different from the above as it does not rely onidentifying the scanning source and is independent from thescanning strategy. Further, the proposed approach does not rely

on a certain predefined alert threshold, the transport protocolused or the number of probed destinations and ports. Moreover,we attempted to go further than detection by analyzing probingtraffic dimensions, namely, employed technique, monotonicityand randomness.

VII. CONCLUSION

This paper presents a new method to fingerprint probing ac-tivity. It aims at detecting the cyber scanning activity and iden-tifying the exact technique that was employed in the activity.Further, it analyzes certain probing traffic dimensions such asmonotonicity and randomness to generate inferences related tothe ‘machinery’ of the scan (i.e, probing tools Vs worms/bots)and the approach of the scanning (i.e., randomness Vs. probingpatterns). The paper leverages and employs several statisticaltechniques to achieve the required tasks. Empirical evaluationsperformed using real darknet traffic showed that the extractedinferences exhibit promising accuracy. As for future work, weare in the process of working on coupling part of the inferencesthat we have generated in this work with real malware data toprovide more accurate and impactful inferences, including thetype of probing malware and the botnet orchestration pattern.

ACKNOWLEDGMENT

The authors are grateful to the members of the NationalCyber-Forensics and Training Alliance (NCFTA) Canada labfor their aid in this work. The first author is supported by aPost Graduate Scholarship (PGS) from the Natural Sciencesand Engineering Research Council of Canada (NSERC).

REFERENCES

[1] Techcrunch. Hack Attack: Sony Confirms PlayStationNetwork Outage Caused By ‘External Intrusion‘. http://tinyurl.com/6cbcldv.

[2] Forbes. Citibank Reveals One Percent Of Credit CardAccounts Exposed In Hacker Intrusion. http://tinyurl.com/7jxgxqz.

[3] The Wall Street Journal. Hackers Targeted Twitter UserData. http://tinyurl.com/a9tkbn5.

[4] ABC news. Major US Newspapers Allege Chinese HackAttack. http://tinyurl.com/bekao8k.

[5] Computer World. Google: Phishers stole e-mail fromU.S. officials, others. http://tinyurl.com/cav4rwj.

[6] The Next Web. Senate website, CIA.gov reportedlyhacked. http://tinyurl.com/7y8dv5b.

[7] ArsRechnica. Facebook Computers Compromised. http://tinyurl.com/cwmvxrv.

[8] National Security Council. http://www.whitehouse.gov/cybersecurity.

[9] S. Panjwani, S. Tan, K.M. Jarrin, and M. Cukier. Anexperimental evaluation to determine if port scans areprecursors to an attack. In Dependable Systems andNetworks, 2005. DSN 2005. Proceedings. InternationalConference on, pages 602 – 611, june-1 july 2005.

[10] A. Dainotti, A. King, K. Claffy, F. Papale, and A. Pescap.Analysis of a ”/0” Stealth Scan from a Botnet. In InternetMeasurement Conference (IMC), Nov 2012.

[11] Internet Storm Center. Reports of a Distributed InjectionScan. http://isc.sans.edu/diary.html?storyid=14251.

[12] Emmanuele Zambon & Damiano Bolzoni. NetworkIntrusion Detection Systems. http://www.blackhat.com/presentations/bh-usa-06/BH-US-06-Zambon.pdf.

[13] M.H. Bhuyan, DK Bhattacharyya, and JK Kalita. Sur-veying port scans and their detection methodologies. TheComputer Journal, 54(10):1565–1581, 2010.

[14] C.-K. Peng, S. V. Buldyrev, S. Havlin, M. Simons, H. E.Stanley, and A. L. Goldberger. Mosaic organization ofdna nucleotides. Phys. Rev. E, 49:1685–1689, Feb 1994.

[15] U. Harder, M.W. Johnson, J.T. Bradley, and W.J. Knot-tenbelt. Observing internet worm and virus attacks with asmall network telescope. Electronic Notes in TheoreticalComputer Science, 151(3):47–59, 2006.

[16] K. Fukuda, T. Hirotsu, O. Akashi, and T. Sugawara.Correlation among piecewise unwanted traffic time series.In Global Telecommunications Conference, 2008. IEEEGLOBECOM 2008. IEEE, pages 1–5. IEEE, 2008.

[17] G.F. Lyon. Nmap network scanning: The official nmapproject guide to network discovery and security scanningauthor: Gordon fyodor l. 2009.

[18] M. Little, P. McSharry, I. Moroz, and S. Roberts. Non-linear, biophysically-informed speech pathology detec-tion. In Acoustics, Speech and Signal Processing, 2006.ICASSP 2006 Proceedings., volume 2, page II.

[19] Fing, the ultimate network toolkit. http://www.overlooksoft.com/fing.

[20] S. Staniford, J.A. Hoagland, and J.M. McAlerney. Prac-tical automated detection of stealthy portscans. Journalof Computer Security, 10(1/2):105–136, 2002.

[21] Thomas Kailath. The divergence and bhattacharyyadistance measures in signal selection. CommunicationTechnology, IEEE Transactions on, 15(1):52–60, 1967.

[22] Thomas Kailath. The divergence and bhattacharyyadistance measures in signal selection. CommunicationTechnology, IEEE Transactions on, 15(1):52–60, 1967.

[23] J. MacQueen et al. Some methods for classification andanalysis of multivariate observations. In Proceedings ofthe fifth Berkeley symposium on mathematical statisticsand probability, volume 1, page 14. California, USA,1967.

[24] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximumlikelihood from incomplete data via the em algorithm.Journal of the Royal Statistical Society. Series B (Method-ological), pages 1–38, 1977.

[25] jNetPcap. Sli Technologies. http://jnetpcap.com/userguide.

[26] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reute-mann, and I.H. Witten. The weka data mining software:an update. ACM SIGKDD Explorations Newsletter,11(1):10–18, 2009.

[27] k-Means Clustering. http://www.mathworks.com/help/stats/k-means-clustering.html.

[28] Sushil Jajodia. Cyber Situational Awareness: Issues andResearch. 2012.

[29] Maurice George Kendall. Rank correlation methods.1948.

[30] Jerome H Friedman and Lawrence C Rafsky. Multivariategeneralizations of the wald-wolfowitz and smirnov two-sample tests. The Annals of Statistics, pages 697–717,1979.

[31] M. Bailey, E. Cooke, F. Jahanian, J. Nazario, D. Watson,et al. The internet motion sensor: A distributed blackhole

monitoring system. In Proceedings of the 12th ISOCSymposium on Network and Distributed Systems Security(SNDSS), pages 167–179, 2005.

[32] Yegneswaran, Vinod et al. On the design and use ofinternet sinks for network abuse monitoring. In RecentAdvances in Intrusion Detection. 2004.

[33] Eric Wustrow, Manish Karir, Michael Bailey, Farnam Ja-hanian, and Geoff Huston. Internet background radiationrevisited. In Proceedings of the 10th annual conferenceon Internet measurement, pages 62–74. ACM, 2010.

[34] Jayanthkumar Kannan, Jaeyeon Jung, Vern Paxson, andCan Emre Koksal. Semi-automated discovery of appli-cation session structure. In Proceedings of the 6th ACMSIGCOMM conference on Internet measurement, pages119–132. ACM, 2006.

[35] Yi Cao. Bhattacharyya Distance Measure for PatternRecognition. http://tinyurl.com/bveualz.

[36] Simone Fatichi. Mann-Kendall Test. http://tinyurl.com/cstvpwa.

[37] MathWorks. Run test for randomness. http://tinyurl.com/d6gtykz.

[38] David Moore, Colleen Shannon, Douglas J Brown, Ge-offrey M Voelker, and Stefan Savage. Inferring internetdenial-of-service activity. ACM Transactions on Com-puter Systems (TOCS), 24(2):115–139, 2006.

[39] Daniel Roelker, Marc Norton and Jeremy Hewlett. sf-portscan, 2004. http://projects.cs.luc.edu/comp412/dredd/docs/software/readmes/sfportscan.

[40] Snort. Available at: http://www.snort.org.[41] Valery Guralnik and Jaideep Srivastava. Event detection

from time series data. In Proceedings of the fifthACM SIGKDD international conference on Knowledgediscovery and data mining, pages 33–42. ACM, 1999.

[42] Ryan Prescott Adams and David J.C. MacKay. Bayesianonline changepoint detection. Cambridge, UK, 2007.

[43] Darren Bounds. Packit - Packet analysis and injectiontool. http://linux.die.net/man/8/packit.

[44] Vinod Yegneswaran, Paul Barford, and Vern Paxson.Using honeynets for internet situational awareness. InProceedings of the Fourth Workshop on Hot Topics inNetworks (HotNets IV), pages 17–22, 2005.

[45] W. Zhang, S. Teng, and X. Fu. Scan attack detectionbased on distributed cooperative model. In ComputerSupported Cooperative Work in Design, 2008. CSCWD2008. 12th International Conference on, pages 743–748.IEEE, 2008.

[46] M.H. Bhuyan, D.K. Bhattacharyya, and J.K. Kalita.Aocd: An adaptive outlier based coordinated scan detec-tion approach. International Journal of Network Security,14(6):339–351, 2012.

[47] R. Baldoni, G. Di Luna, and L. Querzoni. Collabo-rative Detection of Coordinated Port Scans. Technicalreport, 2012. http://www.dis.uniroma1.it/∼midlab; Lastaccessed: 20/2/2013.

Documents

A Statistical Approach for Fingerprinting Probing Activitiesfaculty.business.utsa.edu/Elias-BouHarb/pubs/confs/ares2013.pdfprobing activity, after conﬁrmed as probing, can be identiﬁed