Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset

7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset

1/8

International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6

ISSN: 1837-7823

42

Evaluation of Rule based Machine Learning Algorithms on NSL-KDD

Dataset

Tejvir Kaur1

and Sanmeet Kaur2

1

M,Tech Student,2

Asst. Professor

Thapar University, Patiala- 147001, India

Abstract

Intrusion Detection can be defined as the act of detecting actions that attempt to compromise the confidentiality,

integrity or availability of a resource. There are many approaches to intrusion detection like anomaly based,

signature based and machine learning based. Machine learning approach can prove to be very useful for

developing intrusion detection systems. This paper presents the comparison of various Rule based machine

learning algorithms. These learning algorithms are categorized under supervised learning. The NSL-KDD

dataset [9] andWaikato Environment for Knowledge Analysis (WEKA) [3] is used to evaluate the performance

of the machine learning algorithms.

Keywords: Intrusion detection, NSL-KDD dataset, WEKA

1. Introduction

The purpose of network security is to protect the network from unauthorized access, destruction and disclosure.

Many techniques have emerged in the field of network security that helps in the protection of computer systems

and computer networks. One of the techniques used for making the network secure and detecting intrusions is

Intrusion Detection System. Intrusion Detection System is a mechanism that detects unauthorized and malicious

activity present in the computer systems. There are mainly two approaches to Intrusion Detection Signature

detection and Anomaly detection. Machine learning techniques have also been applied to Intrusion detection in

many ways. This paper presents the evaluation and results of rule base machine learning algorithms to NSL-

KDD dataset [12].

2. Review of Literature

Intrusion Detection System helps information systems to deal with attacks. An IDS gathers and analyzes

information from various areas within a computer or a network to identify the intrusions which includes attacks

from outside the organization and as well as attacks from within the organization. There are mainly two

approaches to intrusion detection - Signature based detection and Anomaly based detection. The signature-based

approach looks for the signatures of known attacks, which exploit weaknesses in system and application

software [5]. It uses pattern matching techniques against a frequently updated database of attack signatures. It is

useful to detect already known attack but not the new ones. Many attacks can be detected by this approach

because many attacks have clear anddistinct signatures.

An Intrusion Detection system that looks at network traffic and detects data that is incorrect, not valid or

generally abnormal is called anomaly-based detection. This method is useful for detecting unwanted traffic that


2/8


ISSN: 1837-7823

43

is not specifically known [5]. There are various methods that can be used in Anomaly detection approach to

detect anomalous behavior from normal behavior like machine learning, statistical methods.

Machine learning algorithms can be used in Intrusion detection problems to find interesting intrusion patterns in

data [10]. This requires the data to be in labelled form. For this purpose NSL- KDD dataset is taken which has

the required characteristics. The rule based learning algorithms are applied on this dataset.

K.. Shafi et al. [11] proposes a methodology to create a fully labelled network intrusion detection dataset which

is suitable for machine learning algorithms. The dataset is created using real background traffic and simulated

attacks. This dataset is tested on supervised machine learning algorithms in WEKA.

F. Gharibian and A. Ghorbani (2007) [2] compare the supervised machine learning techniques. The algorithms

used are Naive Bayes, Gaussian, Decision Tree and Random Forests. The ability of each technique for detecting

the attack categories in the KDD dataset has been compared. From the results, the proper technique for

identifying an attack category is also proposed.

According to M. Panda and M. Patra (2007) [9], the use ofnave bayes for anomaly based network intrusion

detection technique produces better results in terms of false positive rate, cost, and computational time when

applied to KDD99 data sets as compared to a back propagation neural network based approach. The

experimentation is done on WEKA program.

J. Zhang et al. (2008) [13] focuses on a framework that apply a data mining algorithm called random forests in

misuse, anomaly, and hybrid-network-based IDSs. In misuse detection, patterns of intrusions are built

automatically by the random forests algorithm over training data. In anomaly detection, novel intrusions are

detected by the outlier detection mechanism of the random forests algorithm.

G. Oreku and F. Mtenzi (2009) [8] presents the use ofdata mining techniques to discover consistent and useful

patterns of system features that describe program and user behavior. According to them the useful set of relevant

system features can recognize anomalies and known intrusions.

D. Zhao (2010) [14] et al. proposes a hybrid IDS which combines network and host IDS, with anomaly and

misuse detection mode. Data mining programs are applied to learn rules that can capture the behavior of

intrusions and normal activities.

K. Qazanfari et al. (2012) [10] have proposed an Intrusion detection system which uses Support Vector Machine

(SVM) and Multi Layer Perceptron (MLP) machine learning algorithms to classify normal from abnormal

behaviors.

S. M. Hussein et al. (2012) [4] discusses the anomaly detection engine that will be based on NaveBayes

algorithm, J48graft Decision Tree algorithm and Bayes Net algorithm in WEKA program.

3. Methodology

This section presents the methodology used to carry out the work. The workflow is described in Figure 1. The

main aim is to analyze the performance of rule based classifiers present in WEKA. For this purpose NSL KDD

dataset is used. The NSL KDD dataset is applied to the rule based classifiers. The output of the classifiers is

compared to each other. The measures used to compare the results are accuracy, false alarm rate and the number

of instances that are correctly and incorrectly classified. The rule based classifier giving the best performance is

deduced by analyzing the results based on above metrics.


3/8


ISSN: 1837-7823

44

Figure 1: Design of Research

4. Experimentation

4.1 Description of dataset

The KDD data set is built based on the data captured in DARPA98 IDS evaluation program. DARPA98 is

about 4 gigabytes. It contains tcpdump data of 7 weeks of network traffic. This data can be processed into about

5 million connection records. The two weeks of test data have around 2 million connection records. KDD

training dataset consists of approximately 4,900,000 single connection vectors each of which contains 41

features. It is labelled as either normal or an attack. The analysis has shown that there are two important issues

in the data set which affects the performance of evaluated systems. It results in a very poor evaluation of

anomaly detection approaches. To solve these issues, a new dataset called NSL-KDD [7] has been proposed

which consists of selected records of the complete KDD data set [12]. The NSL-KDD data set has the following

advantages over the original KDD data set.

It does not include redundant records in the train set. There are no duplicate records in the proposed test. The number of records in the train and test sets is reasonable, which makes it affordable to run the

experiments on the complete set without the need to randomly select a small portion [7].


4/8


ISSN: 1837-7823

45

For the purpose of experimentation KDDTest-21.ARFF - a subset of the KDDTest+.arff file is taken. It contains

11850 instances [7]. The attribute names and types are listed in Table 1.

Table 1: Attribute name and type in KDDTest-21.arff file

Attribute Name Attribute Type Attribute Name Attribute Type

Duration Real is_guest_login nominal

protocol_type Nominal Count real

Service Nominal srv_count real

Flag Nominal serror_rate real

src_bytes Real srv_serror_rate real

dst_bytes Real rerror_rate real

Land Nominal srv_rerror_rate real

wrong_fragment Real same_srv_rate real

Urgent Real diff_srv_rate real

Hot Real srv_diff_host_rate real

num_failed_logins Real dst_host_count real

logged_in Nominal dst_host_srv_count real

num_compromised Real dst_host_same_srv_rate real

root_shell Real dst_host_diff_srv_rate real

su_attempted Real dst_host_same_src_port_rate real

num_root Real dst_host_srv_diff_host_rate real

num_file_creations Real dst_host_serror_rate real

num_shells Real dst_host_srv_serror_rate real

num_access_files Real dst_host_rerror_rate real

num_outbound_cmds Real dst_host_srv_rerror_rate real

is_host_login Nominal Class nominal

4.2 Classifiers used

The following are the machine learning algorithms or classifiers that are evaluated on NSL-KDD dataset.

1) ConjunctiveRule: This classifier implements a single conjunctive rule learner that can predict for numeric and

nominal class labels.

2) DecisionTable: The classifier is used for building and using a simple decision table majority classifier.

3) DTNB: This classifier provides keys to the hash table

4) JRip: This class implements a propositional rule learner, Repeated Incremental Pruning to Produce Error

Reduction (RIPPER).


5/8


ISSN: 1837-7823

46

5) OneR: It generates a set of rules that test one particular attribute and learns a one-level decision tree.

6) Part: Class for generating a PART decision list.

7) NNge:It is Nearest neighbour like algorithm using non-nested generalized exemplars.

8) Ridor: This is implementation of a Ripple-Down Rule learner.

4.3 Test Options in WEKA

The result of applying the chosen classifier will be tested according to the options. The following are the four

test modes that are provided in WEKA [3].

1. Use training set. The classifier is evaluated on how well it predicts the class of the instances it was trained on.

2. Supplied test set. The classifier is evaluated on how well it predicts the class of a set of instances loaded from

a file. A classifier is built using the trainset and evaluate it using the testset.

3. Cross-validation. The classifier is evaluated by cross-validation by using the number of folds that are enteredin the Folds text field.The dataset is randomly divided into k subsamples. The k-1 subsamples are used astraining data and one sub-sample as test data. This process is repeatedktimes.

4. Percentage split. The classifier is evaluated on how well it predicts a certain percentage of the data which is

held out for testing. The amount of data held out depends on the value entered in the % field [3].

4.4 Results and Discussion

The measures used for evaluating the performance are as follows.

1) Correctly classified instances and Incorrectly classified instances shows the percentage of instances that are

correctly and incorrectly classified.

2) Overall accuracy: The percentage of correctly classified instances is called accuracy.

3) False alarm rate: The alarm rate is the proportion of examples which were classified as classx, but belong to

a different class, among all examples which are not of classx.

4) Class wise accuracy: It is the proportion of examples which were classified as class x, among all examples

which truly have classx [3].

The results of evaluating rule based classifiers are shown in Table 2 and Table 3. Test option used is 10 fold

cross validation. Part classifier has the highest overall accuracy which clearly justifies that it correctly classifies

11540 instances. The false alarm rate of JRip and Part classifier is the lowest. The ConjunctiveRule classifier

has the highest false alarm rate and it has least accuracy. So the overall performance of Part is best and

ConjunctiveRule is worst. The performance of Part and Jrip is almost identical. Table 3 shows class wise

accuracy. Almost all the classifiers show above 90% accuracy for Anomaly class. For Normal class and

Anomaly class Part gives highest accuracy.


6/8


ISSN: 1837-7823

47

Table 2: Comparison of various measures for rule based classifiers

Rule

based Classifier

Correctly

classified

Instances

Incorrectly

Classified

Instances

Overall Accuracy False Alarm rate

(class=Normal)

ConjunctiveRule 10258 1592 86.5654 % 8.7%

DecisionTable 11268 582 95.0886 % 3.5%

DTNB 11314 536 95.4768 % 3.4%

JRip 11492 358 96.9789 % 1.8%

OneR 10867 983 91.7046 % 4.3%

Part 11540 310 97.384 % 1.7%

NNge 11401 449 96.211 % 2%

Ridor 11386 464 96.0844 % 3.2%

Table 3: Class wise accuracy

Rule based classifier Normal Anomaly

ConjunctiveRule 65% 91.3%

DecisionTable 88.8% 96.5%

DTNB 90.5% 96.6%

JRip 91.7% 98.2%

OneR 73.9% 95.7%

Part 93.4% 98.3%

NNge 88% 98%

Ridor 92.8% 96.8%

Figure 2: Classification of Instances by rule based classifiers

0

2000

4000

6000

8000

10000

12000N

0

.

o

f

i

n

s

t

a

n

c

e

s

Correctly classified Instances

Incorrectly classified Instances


7/8


ISSN: 1837-7823

48

Figure 3: Overall Accuracy of various rule based classifiers

Figure 4: False alarm rate of rule based classifiers

Figure 5: Accuracy of rule based classifiers for class Normal and class Anomaly

80.00%

85.00%

90.00%

95.00%

100.00%

Overall Accuracy

0.00%1.00%2.00%3.00%4.00%5.00%6.00%7.00%8.00%9.00%

10.00%

False alarm rate

0%

20%

40%

60%

80%

100%

120%

Class wise

Accuracy

Normal Anomaly


8/8


ISSN: 1837-7823

49

5. Conclusion and Future Scope

The eight Rule based classifiers are evaluated on the NSL- KDD dataset. The various measures taken for

comparison are overall accuracy, class wise accuracy and false alarm rate. Part and Jrip classifier performed best

among all the algorithms and the performance of ConjunctiveRule is the worst. The rules generated by these

classifiers can be incorporated into the signature based Intrusion detection systems to enhance their

performance.

References

[1] Chandolikar n. S., and Nandavadekar V. D., (2012), comparative analysis of two algorithms for intrusion

attack classification using kdd cup dataset, International Journal of Computer Science and Engineering (

IJCSE ) Vol.1, Issue 1, pp. 81-88.

[2] Gharibian, F., & Ghorbani, A. A. (2007), Comparative study of supervised machine learning techniques for

intrusion detection, In Communication Networks and Services Research, CNSR'07.,Fifth Annual

Conference on ,pp. 350-358, IEEE.

[3] Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten I. H., (2009),The WEKA Data Mining

Software: An Update, SIGKDD Explorations, Volume 11, Issue 1.

[4] Hussein S. M., Ali F., and Kasiran Z., (2012), Evaluation Effectiveness of Hybrid IDS Using Snort with

Nave Bayes to Detect Attacks, IEEE.

[5] Information Assurance Tools Report Intrusion Detection Systems, (2009), IATAC, Herndon, VA.[6] Kurundkar G.D., Naik N.A. and Dr.Khamitkar S.D,(2012), Network Intrusion Detection using SNORT,

International Journal of Engineering Research and Applications (IJERA) ,Vol. 2, Issue 2, pp. 1288-1296.

[7] Nsl-kdd data set for network-based intrusion detection systems. (2009) Available on:

http://nsl.cs.unb.ca/NSL-KDD/.

[8] Oreku, G. S., & Mtenzi, F. J., (2009), Intrusion Detection Based on Data Mining, InDependable,

Autonomic and Secure Computing, DASC'09, Eighth IEEE International Conference onDependable,

Autonomic and Secure Computing pp. 696-701, IEEE.[9] Panda, M., and Patra, M. R., (2007), Network intrusion detection using naive bayes., International

journal of computer science and network security, Vol.7, No.12.

[10] Qazanfari K., Mirpouryan . S., and Gharaee H., (2012), A Novel Hybrid Anomaly Based Intrusion

Detection Method, 6.th International Symposium on Telecommunications (IST'2012).

[11] Shafi, K., Abbass, H. A., and Zhu, W. A., Methodology to Evaluate Supervised Learning Algorithms forIntrusion Detection.

[12] Tavallaee M., Bagheri E., Lu W., and Ghorbani A. A., (2009), A Detailed Analysis of the KDD CUP 99

Data Set, Proceedings of the Second IEEE Symposium on Computational Intelligence for Security and

Defence Applications.

[13] Zhang J., Zulkernine M., and Haque A., (2008), Random-Forests-Based Network Intrusion

Detection Systems, IEEE Transactions On Systems, Man, And CyberneticsPart C: Applications And

Reviews, Vol. 38, No. 5.

[14] Zhao D., Xu Q., Feng Z., (2010), Analysis and Design for Intrusion Detection System Based on DataMining, Second International Workshop on Education Technology and Computer Science.

Documents

Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset