[IEEE 2010 2nd Computer Science and Electronic Engineering Conference (CEEC) - Colchester, United Kingdom (2010.09.8-2010.09.9)] 2010 2nd Computer Science and Electronic Engineering

Optimised Clustering Method for Reducing Challenges of Network forensics

Joshua Ojo Nehinbe

School of Computer Science and Electronic Engineering Systems University of Essex, Colchester, UK

[email protected]

Abstract- Network forensics are challenging because of numerous quantities of low level alerts that are generated by network intrusion detectors generate to achieve high detection rates. However, clustering analyses are insufficient to establish overall patterns, sequential dependencies and precise classifications of attacks embedded in of low level alerts. This is because there are several ways to cluster a set of alerts especially if the alerts contain clustering criteria that have several values. Consequently, it is difficult to promptly select an appropriate clustering technique for investigating computer attacks and to concurrently handle the tradeoffs between interpretations and clustering of low level alerts effectively. Accordingly, alerts, attacks and corresponding countermeasures are frequently mismatched. Hence, several realistic attacks easily circumvent early detections. Therefore, in this paper, intrusive alerts were clustered and the quality of each cluster was evaluated. The results demonstrate how a measure of entropy can be used to establish suitable clustering technique for investigating computer attacks. Key word: Intrusion detectors; Entropy; Clustering criteria; clustering good clusters; bad clusters.

I. INTRODUCTION

There are growing interests in network forensics in a recent time because computer activities generate normal, intrusive or suspicious packets that are difficult to accurately anticipate. Hence, network detectors are installed as second line of defensive mechanisms to monitor and report suspicious packets on computer networks. Nonetheless, the volume of packets that a detector would sniff essentially depends on hardware requirements of the host machine, the location of the detector on the networks and the capacity of the detector [9, 11]. The location of the detector is usually influenced by the purpose of the toolkit and the policy of an organization.

There are seven critical issues that are associated with network forensics [11]. Usually, Intrusion Detection System (IDS) receives several tons of packets per seconds in a spanning mode when compared with operations of IDS in a port’s mirroring mode.

Secondly, every signature-based detector has a catalogue of detection rules that signify known and anticipated attacks [9, 11]. Therefore, patterns in each of the incoming packet have to be promptly matched with each of the signatures at the speed of each packet. Conventionally, each detection rule has only one pattern of attacks and a match or a close match is then flagged as a potential intrusion [10, 11]. Thus, the detector raises low level alerts for a match and currently logs the event for further analysis to complete the intrusion detection phase. Thereafter, the detector selects another packet and the above processes repeated while the detector is in operation until there is no packet to be sniffed.

Thirdly, there are fundamental challenges that are associated with low level alerts that inhibit prompt intrusion review and countermeasures. For instance, network intrusion detector generates numerous multiple and unordered low level alerts in a default standard operation [1, 3]. These have been longstanding challenges that confront continuous usage of the detector and analysts that are designated to review and take suitable actions on the audit logs generated by the detector [9-11].

Fourthly, network forensics show that an alert usually contains several attributes of an attack. Some of these attributes are the source and destination addresses of the offending packet, protocol, type of service, time to live, IP flag and IP length of the datagram. In addition, different attacks tend to have different attributes and several values. Consequently, regrouping low level alerts are complex to promptly forestall attacks in progress [9].

Another central problem in intrusion analysis is how to make meaning from massive alerts without the need to underestimate the proportions of the attacks that have been logged. Thus, clustering has been suggested as a data mining method for grouping and making meaning in related alerts [8, 12, 15]. Nevertheless, clustering has three critical weaknesses [8].

978-1-4244-9030-1/10/$26.00 ©2010 Crown

There are several ways to cluster a set of low level alerts. Therefore, clusters can be accidentally matched together by matching dissimilar attributes or attacks together. Secondly, how to decide attributes that will produce best clusters among overwhelming alerts is another challenge of clustering analyses [8]. Thirdly, alerts cannot be ordinarily previewed by human experts to determine the distributions of the attacks on the systems without carrying out in-depth data mining processes.

Furthermore, network forensics show that low level alerts can signify categories of attacks with a common attribute such as repeated protocol, type of service and so on. For instance, experience shows that flood attacks like ping of death attacks and ping attacks can adopt similar protocol and ICMP message code. Similarly, Distributed Denial of Service (DDoS) attacks that originate from several sources against a destination machine will inevitably generate a cluster of attacks if the attacks are clustered on destination address. Consequently, network forensics experts frequently underestimate networks security violations despite massive alerts that network detectors frequently generate.

Experience shows that criticality of the dangers of underestimated network security violations depends on an organisation and the configuration of the networks. Unfortunately, appropriate preventive measures will not be implemented on time and the damage incurred will not be immediately ascertained in an underestimated audit review. Consequently, this paper shows how a measure of entropy can help a user to decide an appropriate clustering technique for network forensics in other to lessen the aforementioned challenges.

One of the notable contributions of this paper was our ability to demonstrate how a measure of entropy be used to select an appropriate clustering technique for reviewing audit logs without compromising the original information necessary to further investigate the original alerts. Secondly, we have been able to propose unsupervised machine learning model that worked with aged and recent datasets. The approach reported in this paper has helped network forensics experts to effectively evaluate the degree of disorderliness of alerts and corresponding attacks.

The remainders of this paper are as follows. Section 2 will discuss closely related works. Section 3 will discuss clustering technique, how to measure the quality of clustering analysis using entropy and overview of evaluative datasets. Section 4 will discuss experiments performed and overview of the results obtained while section 5 gives the discussions of results and section 6 gives summaries of the paper and areas for future research directions.

II. RELATED WORKS Entropy as proposed by Rudolf Clausius in 1860’s was initially meant to account for energy lost in the conversion of energy to work and vice versa [13]. Nowadays, entropy paradigm has been used in many domains. Claude Shannon in 1948 cited in [6] adopted entropy to explain the discrepancy in encoding and decoding a message from a sender to a receiver. In transportation for instance, [13] used entropy to optimally analyse transportation planning while [14] has applied entropy to recognise image patterns.

Similarly, in machine learning, [7] proposed interval-based recursive entropy to improve the performance of learning algorithms. The algorithms worked in two layers. The first layer kept the statistics of discrete data that were input data to the learning algorithms. Thereafter, the data were formatted into equal widths as required by the second layer. Subsequently, the data were processed to conform to the format accepted by the complex recursive algorithms. Thus, the authors reported that the model increased the speed of the learning algorithms because algorithms were applied directly to high level dataset. However, the dynamisms of input and output requirements of learning algorithms often restrict a broad applicability of this method.

In [2], entropy optimisation method was applied to model synthetic, iris, wine and heart disease datasets. Multiple agents that have separate target missions and a set of thresholds were incorporated to improve feature selection and detection of distributions of the datasets. Similarly, incremental predictive algorithm that only analysed internal relationship of dataset to correct the structure of the original data sets has been proposed by [6]. Nevertheless, the efficacy of this model is limited by attributes that are external to the algorithm.

Essentially and to the best of our knowledge, existing models are completely different from the model that is proposed in this paper.

III. PARTITIONING CLUSTERING

Each alert from a network intrusion detection

system has distinguishing attributes such as IP addresses, type of service (TOS) and time to live (TTL). An attribute can be homogeneous with a single value across the entire dataset (figure (1) below) or heterogeneous with a multiple values (figure (2)). The patterns of attacks are synonymous to the randomness of the alerts. Hence, a method that determines randomness of low level alerts indirectly measures corresponding patterns or distributions of attacks in the dataset.

One of the simplest ways to determine the patterns of attacks is to split the original dataset into

two categories that will give minimum entropy using attributes of the dataset. Fig 1. Homogeneous attribute

This process is called partitioning clustering. In

other words, partitioning clustering is the process that divides a dataset into succinctly groups. Each group or cluster has at least an attribute that will distinguish its members from members of other clusters so that a cluster will not be a subset of another cluster [12]. Essentially, alerts that constitute a good cluster are more similar to other members in the same cluster and dissimilar to members of other clusters while a bad cluster indicates mismatched members that originate from different clusters [8].

Fig 2. Heterogeneous attribute

Partitioning clustering technique is widely used in biology, clinical studies, pattern recognition, financial sector, etc to understand data and to properly classify or represent data into meaningful groups [8]. So, this method is adopted in this paper to achieve data compression by eliminating redundancies from the original data [12].

A. CLUSTERING CRITERIA

A set of low level alerts can be grouped into

different schemes using its attributes that are otherwise known as clustering criteria. There are several clustering criteria that were investigated in the course of this research. However, we report investigations the evaluations of (SI) to denote source IP address of an attack, (DI) or destination address of machine that was attacked, (TOS) for Type of service or order of precedence of each intrusive datagram, (TTL) or Time to live that measured the hops count or lifespan of each intrusive datagram, (IPP) or IP protocol that transmitted each intrusive datagram, (IPF) or IP flags of each intrusive datagram and (IPL) i.e. length of each intrusive datagram with respect to the aforementioned problems [16].

Port numbers were not reported since most of the attacks that were investigated did not use port numbers. Similarly, timestamp was not reported because it generated numerous redundancies. In this paper, duplicate alerts were repeated alerts that have at least a similar clustering criterion.

B. QUALITY OF CLUSTERS

Claude Shannon introduced entropy in 1948 to explain information theory [15]. The author suggests that entropy is an expected information bit essential for the transmission of a data from a sender to a receiver. Since then, entropy has generated wide interest in deciding an appropriate clustering technique for a particular case. For instance, entropy in data mining can be used to measure the quality of clustering scheme in two ways. Entropy can be used to establish the disorderliness of a dataset and to determine additional information or gain that is necessary to classify dependent variables [8]. Thus, in this paper, entropy has been used to establish distributive quality of each clustering scheme and to show that a measure of entropy can help decide an appropriate clustering technique for investigating computer attacks. Suppose a dataset has a set of clustering criteria. Let a clustering criterion (A) has V1, V2, V3 and Vn possible outcomes or values. Suppose the probabilities of occurrence of each value, V1 = P1, V2= P2, V3 = P3 and Vn =Pn respectively. Then, the minimum number of information necessary to evaluate the quality of each clustering criteria is called entropy [6, 7, 13, 14]. In general, entropy ξ(A) in bits is expressed as

(1)

In practice, entropy with a high value usually indicates that the clustering criterion is less predictable. It also indicates homogeneous distribution that indicates repeatedly bunched event or monotonic histogram. Conversely, low entropy indicates an attribute that is more predictable among the clustering criteria. Low entropy also implies unstable distributions of events. C. EVALUATIVE DATASETS

We report five datasets that were used to evaluate the efficacy of algorithm that is proposed in this paper. The TESTDATA was labeled as (1) and it was extracted from simulated networks. Linux and windows machines were on the networks and they were connected together with the aid of an intelligent hub. Series of attacks that included ping of death, operating system and version detection attacks were launched from two sources against three destination

TTL

TTL1-Val1

TTL2- Val2

TTLn-Valn

TTL

TTL1 (Val1)

machines and the traces of the attacks were recorded using a string length of 55535.

Furthermore, UNI-DATA was the second dataset that was used to evaluate our algorithm. The dataset was labeled as (2) for simple reference. The dataset was extracted from realistic networks and its traces were similarly recorded using a string length of 55535.

We also used DARPA-1 that was labeled as (3) and DARPA-2 that was labeled as (4). Both datasets were internet traces that signified Distributed Denial of Service attacks that were respectively launched by novice and experienced attackers [5]. The datasets represented 2000 DARPA Intrusion Detection Scenario-Specific dataset and they were extracted from Lincoln Laboratory of MIT repository.

DEFCON-10 was another standard dataset that was used. The dataset was labeled as (5) and its trace file was extracted from repository that is maintained by the Shmoo Group [4].

IV. EXPERIMENTS AND RESULTS

Each of the evaluative dataset was sniffed with Snort in intrusion detection and in default modes. Thereafter, a partition classifier was applied to reinvent the alerts generated by the detector.

Fig 3. Architecture of Partition classifier

The classifier used partitioning clustering method and entropy as an underpinning classification paradigm to interpret low level alerts and to demonstrate how a measure of entropy can help a network forensic expert to promptly decide an appropriate clustering technique for investigating computer attacks.

The architectural design of this model is shown in figure (3) above and it was implemented with C++ programming language. The input to the classifier and quality analyzer was series of alerts of each dataset that were generated by the sensor (Snort) while the outputs were corresponding high level alerts that were partitioned into Output-A and Output-B

respectively. Seven clustering criteria that have been discussed above were built into the classifier and the quality analyzer. The partitioning classifier and quality analyzer independently received low level alerts from the sensor (Snort). The alerts were counted and pre-processed.

Thus, for each clustering criteria and a dataset, the classifier partitioned the dataset into duplicate and unique alerts. The maximum numbers of clusters that each clustering criterion could generate were determined. In essence, all duplicate alerts were clustered together to form an alert while unique alerts were clustered into respective clusters. Concurrently, entropy of each cluster and clustering criterion were determined based on equations (1) above and the respective alerts results were converted into high level alerts in two different output formats. The results obtained in all the experiments are shown in subsequent section.

A. RESULTS

The results obtained before the alerts were

reclassified are presented in figure (4) below.

Alerts before clustering

70475

56 834 8165372

0

10000

20000

30000

40000

50000

60000

70000

80000

TEST-DATA

UNI-DATA DARPA-1 DARPA-2 DEFCON-10

Dataset

Qua

ntity

Raw alerts

Fig 4. Original cluster per dataset

The outputs of the classifier before the quality

analyzer was applied are presented in figure (5) while the results obtained after the quality analyzer and the classifier were applied to the input data are respectively given in tables (I) and (II) below. Table (I) gives entropy per dataset whenever the clustering criteria were mainly SI, DI and IPP while table (II) respectively gives measure of entropy in terms of TOS, TTL, IPF and IPL..

Classifier

Quality analyzer

Sensor (Low level alerts)

Output-B (High level

alerts)

Output-A (High level

alerts)

Number of clusters per dataset

2 3 3 2 25 1 7

265

1 1 1 1 1 1

408

1 1 1 1 1 121 27 3 2 7 1

547

0

100

200

300

400

500

600

SI DI IPP TOS TTL IPF IPL

Clustering criteria

Qua

ntity

of c

lust

erTEST-DATA

UNI-DATA

DARPA-1

DARPA-2

DEFCON-10

Fig 5. Cluster per dataset

Although the unit of entropy is in bits, however, the results obtained do not imply binary digits. Instead, the results denote the degree of disorderliness in each dataset.

TABLE I Entropy per clustering criteria-1

Data SI

/bits DI /bits

IPP /bits

1 0.0007 0.0007 0.2840 2 4.3230 1.6990 0.8730 3 8.0230 0.0000 0.0000 4 8.6720 0.0000 0.0000 5 2.5400 2.6130 0.4760

TABLE II

Entropy per clustering criteria-2

Data TOS

/bits TTL /bits

IPF /bits

IPL /bits

1 0.0002 0.5080 0.0000 0.2880 2 0.3710 1.5690 0.0000 2.6130 3 0.0000 0.0000 0.0000 0.0000 4 0.0000 0.0000 0.0000 0.0000 5 0.5500 1.9380 0.0000 5.9260

B. DISCUSSIONS OF RESULTS

The results obtained above showed that entropy can be used to reinvent intrusive alerts. Thus, comparison of the datasets in figure (4) showed that the TEST-DATA has 70,475 low level alerts while the UNI-DATA has 56 alerts. In addition, DARPA-1 has 834 low level alerts while DARPA-2 has 816 and DEFCON-10 has 5,372 low level alerts.

The results of our model showed in figure (5) above reclassified the TEST-DATA into 2 sources of attacks against 3 destination machines. Also, 3 different IP protocols were used to launch the attacks while the attacks have only 2 values of type of service. The results also showed that 25 different values of time to live, 7 different IP length and the attacks were launched by homogeneous IP flag.

Similarly, attacks in the UNI-DATA were launched by 26 source addresses against 5 destinations using 5 IP protocol, 2 different values of type of service, 12 different values of TTL were used; one value of IP flag and all the datagram used 11 different IP lengths. Furthermore, DARPA-1 and DARPA-2 were respectively clustered into 265 and 408 sources against 1 destination address. The results further showed that the attacks repeatedly used each of the other five clustering. Figure (5) also indicated that DEFCON-10 was clustered into 21 sources and 27 target addresses. The dataset mainly contained attacks from 3 different IP protocol, 2 values of type of service, 7 different values of TTL, 1 IP flag and 547 different lengths of suspicious datagram.

Table (I) and table (II) above showed the message bits after the alerts of each dataset were reinvented. The entropy of each clustering criterion per dataset similarly corroborated the results obtained in figure (5) above. For example, the entropy of each of the datasets was zero for clustering criteria such as IPF that have a value of 1. The results implied that all the alerts were grouped into only 1 cluster. Meanwhile extremely high values of entropy that were shown in DARPA-1 and DARPA-2 with source address as a clustering criterion implied that the attacks were repeatedly launched from varied sources and the patterns also form uniformly distributions. The results also implied that about 8 and 9 bits of information were respectively transmitted from the sources of the attacks to the two destination machines in both datasets.

Above all, the results have demonstrated that clustering technique that employs source address to investigate DDoS attacks was more generally "good" when compared with other clustering criteria. Conversely, extremely low entropy that was indicated by TEST-DATA for source and destination addresses implied chaotic distribution. A chaotic distribution is a pattern of attacks that occurs in unpredictable rate and they are often confusing. Hence, such attacks can better be studied by clustering them on the basis of TTL that has the highest information bits. Furthermore, table (I) showed that UNI-DATA has entropy of 1.699 bits when the clustering criterion was destination address and 1.599 bits in table (II) when the clustering criterion was TTL. There are many explanations about these results. The results implied that clusters in each case nearly have equal distributions or degree disorderliness. Secondly, the entropy value of the attacks implied that there were about 2 bits of information in each intrusive packet. The results further demonstrated that clustering technique that employed source address was more generally "good" while IPP was "bad” for forensic investigation of UNI-DATA. Similarly, entropy of DEFCON-10 dataset with clustering criteria of TTL has about 2 bits of information in each of the intrusive packets.

V. CONCLUSION

Network forensics are challenging because every network intrusion detector conventionally generates tons of low level alerts in a default mode in other to detect numerous attacks. Fundamentally, each alert has clustering criteria that describe each attack. The advantage of these attributes is that they help to further investigate each attack. However, low level alerts are characterized with hidden patterns and redundancies. Hence, clustering techniques are often used to succinctly generate high level alerts.

Nevertheless, it is not sufficient to succinctly reinvent alerts because there are multiple ways to cluster intrusive dataset. Hence, a poor clustering method usually generates low quality clusters. Further still, the efficacy of each clustering criteria and the quality of each cluster need to be quantified to isolate good clusters from bad clusters. Unfortunately, most of the existing models are flawed in this respect. Therefore, this paper implemented entropy to determine the quality and disorderliness of clusters of intrusive alerts. Specifically, we have demonstrated how a measure of entropy can help decide an appropriate clustering technique for network forensics.

The results showed that lower entropy was an indication of bad clustering method. The results obtained have also indicated distributions and characteristics of each class and attacks per clustering criterion. Higher entropy was recorded by Distributed Denial of Service (DDoS) attacks than other categories of attacks that were explored. This was due to the fact that DDoS attacks were bundled together within very short time intervals. Besides, clustering criteria that generated zero entropy were indications of poor clusters and underestimations of security violations. Also, interpretations of the results showed distinctions between good and bad clusters per dataset and best clustering criteria were established. However, this paper has not investigated other methods such as information gain, Gini index ratio, classification error and category utility that can also be used to discriminate clustering criteria and quality of clustering algorithms. Therefore, one of the potential areas for future research work that we are currently exploring is multiple applications of these methods to evaluate our algorithms.

ACKNOWLEDGMENT Special thanks to my supervisor Dr. Paul Scott

for the feedbacks I constantly receive from him on my research. His lesson notes and comments have been tremendously useful in writing all my publications.

REFERENCES [1] A. Lazarevic, J. Srivastava and V. Kumar, “Intrusion detection: A survey”, Computer Science Department, University of Minnesota (2005). [2] A. Okafor, “Entropy Based Techniques with Applications in Data Mining”, PhD thesis, University of Florida, USA, 2005 [3] B. Morin, L. Me, H. Debar and M. Ducass, M2D2: “A formal data model for IDS alerts correlation”, In: Recent Advances in Intrusion Detection (RAID2002). Volume 2516 of Lecture Notes in Computer Science, Springer-Verlag, 2002, pg 115–137. [4] CTFC (Capture the flag contest) defcon datasets, http://ccstf.shmoo.com/data/, 2009.Accessed 25 April 2009. [5] 2000 DARPA Intrusion Detection Scenario Specific Datasets http://www.ll.mit.edu/mission/communications /ist/corpora/ideval/data/2000data.html, 2009.Accessed 25 April 2009. [6] D. Wan, X. Ren and Y. Hu, “ Data Mining Algorithmic Research and Application Based on Information Entropy”, International Conference on Computer Science and Software Engineering, 2008 [7] ] J. Gama and C. Pinto, “Discretization from Data Streams: Applications to Histograms and Data Mining”, Proceedings of the 2006 ACM symposium on Applied computing, PP. 662 - 667, 2006 [8] J. Han and M. Kamber, “Data mining: concepts and techniques, 2nd edition, Morgan Kaufmann publisher, US. [9] K. Scarfone and P. Mell, “Guide to Intrusion Detection and Prevention Systems (IDPS)”, Recommendations of the National Institute of Standards and Technology, Special Publication 800-94, Technology Administration, Department of Commerce, USA, 2007. [10] R. Rehman, “Intrusion Detection Systems with Snort: Advanced IDS Techniques Using Snort, Apache, MySQL, PHP and ACID”, Prentice Hall PTR Upper Saddle River, New Jersey, 2003. [11] R. Alder, A.R. Baker, E.F. Carter, J. Esler, J.C. Foster, M. Jonkman, C. Keefer, R. Marty and E.S. Seagren, “Snort: IDS and IPS Toolkit”, Syngress publishing, Burlington, Canada, 2007 [12] P. Tan, M. Steinbach and V. Kumar, “Introduction to Data Mining”, Pearson International edition, NY, 2006 [13] S. Fang, J.R. Rajasekera and H. J. Tsao, “Entropy Optimization and Mathematical Programming”, Kluwer Academic Publishers, Norwell, 1997. [14] S.F. Gull, J. Skilling and J.A. Roberts(ed), “The Entropy of an Image: Indirect Imaging”, Cambridge University Press, UK, 267-279, 1984. [15] M. Mitra and T. Acharya, “Data mining: Multimedia, soft computing and bioinformatics”, John Willey and sons, NJ, 2003 [16] W. A. Shay, “Understanding communications and networks”, 3rd Edition, Brooks/Cole, Belmont, CA, 2004

Documents

[IEEE 2010 2nd Computer Science and Electronic Engineering Conference (CEEC) - Colchester, United Kingdom (2010.09.8-2010.09.9)] 2010 2nd Computer Science and Electronic Engineering