Upload
rachel-wheeler
View
214
Download
0
Embed Size (px)
Citation preview
7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset
1/8
International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6
ISSN: 1837-7823
42
Evaluation of Rule based Machine Learning Algorithms on NSL-KDD
Dataset
Tejvir Kaur1
and Sanmeet Kaur2
1
M,Tech Student,2
Asst. Professor
Thapar University, Patiala- 147001, India
Abstract
Intrusion Detection can be defined as the act of detecting actions that attempt to compromise the confidentiality,
integrity or availability of a resource. There are many approaches to intrusion detection like anomaly based,
signature based and machine learning based. Machine learning approach can prove to be very useful for
developing intrusion detection systems. This paper presents the comparison of various Rule based machine
learning algorithms. These learning algorithms are categorized under supervised learning. The NSL-KDD
dataset [9] andWaikato Environment for Knowledge Analysis (WEKA) [3] is used to evaluate the performance
of the machine learning algorithms.
Keywords: Intrusion detection, NSL-KDD dataset, WEKA
1. Introduction
The purpose of network security is to protect the network from unauthorized access, destruction and disclosure.
Many techniques have emerged in the field of network security that helps in the protection of computer systems
and computer networks. One of the techniques used for making the network secure and detecting intrusions is
Intrusion Detection System. Intrusion Detection System is a mechanism that detects unauthorized and malicious
activity present in the computer systems. There are mainly two approaches to Intrusion Detection Signature
detection and Anomaly detection. Machine learning techniques have also been applied to Intrusion detection in
many ways. This paper presents the evaluation and results of rule base machine learning algorithms to NSL-
KDD dataset [12].
2. Review of Literature
Intrusion Detection System helps information systems to deal with attacks. An IDS gathers and analyzes
information from various areas within a computer or a network to identify the intrusions which includes attacks
from outside the organization and as well as attacks from within the organization. There are mainly two
approaches to intrusion detection - Signature based detection and Anomaly based detection. The signature-based
approach looks for the signatures of known attacks, which exploit weaknesses in system and application
software [5]. It uses pattern matching techniques against a frequently updated database of attack signatures. It is
useful to detect already known attack but not the new ones. Many attacks can be detected by this approach
because many attacks have clear anddistinct signatures.
An Intrusion Detection system that looks at network traffic and detects data that is incorrect, not valid or
generally abnormal is called anomaly-based detection. This method is useful for detecting unwanted traffic that
7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset
2/8
International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6
ISSN: 1837-7823
43
is not specifically known [5]. There are various methods that can be used in Anomaly detection approach to
detect anomalous behavior from normal behavior like machine learning, statistical methods.
Machine learning algorithms can be used in Intrusion detection problems to find interesting intrusion patterns in
data [10]. This requires the data to be in labelled form. For this purpose NSL- KDD dataset is taken which has
the required characteristics. The rule based learning algorithms are applied on this dataset.
K.. Shafi et al. [11] proposes a methodology to create a fully labelled network intrusion detection dataset which
is suitable for machine learning algorithms. The dataset is created using real background traffic and simulated
attacks. This dataset is tested on supervised machine learning algorithms in WEKA.
F. Gharibian and A. Ghorbani (2007) [2] compare the supervised machine learning techniques. The algorithms
used are Naive Bayes, Gaussian, Decision Tree and Random Forests. The ability of each technique for detecting
the attack categories in the KDD dataset has been compared. From the results, the proper technique for
identifying an attack category is also proposed.
According to M. Panda and M. Patra (2007) [9], the use ofnave bayes for anomaly based network intrusion
detection technique produces better results in terms of false positive rate, cost, and computational time when
applied to KDD99 data sets as compared to a back propagation neural network based approach. The
experimentation is done on WEKA program.
J. Zhang et al. (2008) [13] focuses on a framework that apply a data mining algorithm called random forests in
misuse, anomaly, and hybrid-network-based IDSs. In misuse detection, patterns of intrusions are built
automatically by the random forests algorithm over training data. In anomaly detection, novel intrusions are
detected by the outlier detection mechanism of the random forests algorithm.
G. Oreku and F. Mtenzi (2009) [8] presents the use ofdata mining techniques to discover consistent and useful
patterns of system features that describe program and user behavior. According to them the useful set of relevant
system features can recognize anomalies and known intrusions.
D. Zhao (2010) [14] et al. proposes a hybrid IDS which combines network and host IDS, with anomaly and
misuse detection mode. Data mining programs are applied to learn rules that can capture the behavior of
intrusions and normal activities.
K. Qazanfari et al. (2012) [10] have proposed an Intrusion detection system which uses Support Vector Machine
(SVM) and Multi Layer Perceptron (MLP) machine learning algorithms to classify normal from abnormal
behaviors.
S. M. Hussein et al. (2012) [4] discusses the anomaly detection engine that will be based on NaveBayes
algorithm, J48graft Decision Tree algorithm and Bayes Net algorithm in WEKA program.
3. Methodology
This section presents the methodology used to carry out the work. The workflow is described in Figure 1. The
main aim is to analyze the performance of rule based classifiers present in WEKA. For this purpose NSL KDD
dataset is used. The NSL KDD dataset is applied to the rule based classifiers. The output of the classifiers is
compared to each other. The measures used to compare the results are accuracy, false alarm rate and the number
of instances that are correctly and incorrectly classified. The rule based classifier giving the best performance is
deduced by analyzing the results based on above metrics.
7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset
3/8
International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6
ISSN: 1837-7823
44
Figure 1: Design of Research
4. Experimentation
4.1 Description of dataset
The KDD data set is built based on the data captured in DARPA98 IDS evaluation program. DARPA98 is
about 4 gigabytes. It contains tcpdump data of 7 weeks of network traffic. This data can be processed into about
5 million connection records. The two weeks of test data have around 2 million connection records. KDD
training dataset consists of approximately 4,900,000 single connection vectors each of which contains 41
features. It is labelled as either normal or an attack. The analysis has shown that there are two important issues
in the data set which affects the performance of evaluated systems. It results in a very poor evaluation of
anomaly detection approaches. To solve these issues, a new dataset called NSL-KDD [7] has been proposed
which consists of selected records of the complete KDD data set [12]. The NSL-KDD data set has the following
advantages over the original KDD data set.
It does not include redundant records in the train set. There are no duplicate records in the proposed test. The number of records in the train and test sets is reasonable, which makes it affordable to run the
experiments on the complete set without the need to randomly select a small portion [7].
7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset
4/8
International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6
ISSN: 1837-7823
45
For the purpose of experimentation KDDTest-21.ARFF - a subset of the KDDTest+.arff file is taken. It contains
11850 instances [7]. The attribute names and types are listed in Table 1.
Table 1: Attribute name and type in KDDTest-21.arff file
Attribute Name Attribute Type Attribute Name Attribute Type
Duration Real is_guest_login nominal
protocol_type Nominal Count real
Service Nominal srv_count real
Flag Nominal serror_rate real
src_bytes Real srv_serror_rate real
dst_bytes Real rerror_rate real
Land Nominal srv_rerror_rate real
wrong_fragment Real same_srv_rate real
Urgent Real diff_srv_rate real
Hot Real srv_diff_host_rate real
num_failed_logins Real dst_host_count real
logged_in Nominal dst_host_srv_count real
num_compromised Real dst_host_same_srv_rate real
root_shell Real dst_host_diff_srv_rate real
su_attempted Real dst_host_same_src_port_rate real
num_root Real dst_host_srv_diff_host_rate real
num_file_creations Real dst_host_serror_rate real
num_shells Real dst_host_srv_serror_rate real
num_access_files Real dst_host_rerror_rate real
num_outbound_cmds Real dst_host_srv_rerror_rate real
is_host_login Nominal Class nominal
4.2 Classifiers used
The following are the machine learning algorithms or classifiers that are evaluated on NSL-KDD dataset.
1) ConjunctiveRule: This classifier implements a single conjunctive rule learner that can predict for numeric and
nominal class labels.
2) DecisionTable: The classifier is used for building and using a simple decision table majority classifier.
3) DTNB: This classifier provides keys to the hash table
4) JRip: This class implements a propositional rule learner, Repeated Incremental Pruning to Produce Error
Reduction (RIPPER).
7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset
5/8
International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6
ISSN: 1837-7823
46
5) OneR: It generates a set of rules that test one particular attribute and learns a one-level decision tree.
6) Part: Class for generating a PART decision list.
7) NNge:It is Nearest neighbour like algorithm using non-nested generalized exemplars.
8) Ridor: This is implementation of a Ripple-Down Rule learner.
4.3 Test Options in WEKA
The result of applying the chosen classifier will be tested according to the options. The following are the four
test modes that are provided in WEKA [3].
1. Use training set. The classifier is evaluated on how well it predicts the class of the instances it was trained on.
2. Supplied test set. The classifier is evaluated on how well it predicts the class of a set of instances loaded from
a file. A classifier is built using the trainset and evaluate it using the testset.
3. Cross-validation. The classifier is evaluated by cross-validation by using the number of folds that are enteredin the Folds text field.The dataset is randomly divided into k subsamples. The k-1 subsamples are used astraining data and one sub-sample as test data. This process is repeatedktimes.
4. Percentage split. The classifier is evaluated on how well it predicts a certain percentage of the data which is
held out for testing. The amount of data held out depends on the value entered in the % field [3].
4.4 Results and Discussion
The measures used for evaluating the performance are as follows.
1) Correctly classified instances and Incorrectly classified instances shows the percentage of instances that are
correctly and incorrectly classified.
2) Overall accuracy: The percentage of correctly classified instances is called accuracy.
3) False alarm rate: The alarm rate is the proportion of examples which were classified as classx, but belong to
a different class, among all examples which are not of classx.
4) Class wise accuracy: It is the proportion of examples which were classified as class x, among all examples
which truly have classx [3].
The results of evaluating rule based classifiers are shown in Table 2 and Table 3. Test option used is 10 fold
cross validation. Part classifier has the highest overall accuracy which clearly justifies that it correctly classifies
11540 instances. The false alarm rate of JRip and Part classifier is the lowest. The ConjunctiveRule classifier
has the highest false alarm rate and it has least accuracy. So the overall performance of Part is best and
ConjunctiveRule is worst. The performance of Part and Jrip is almost identical. Table 3 shows class wise
accuracy. Almost all the classifiers show above 90% accuracy for Anomaly class. For Normal class and
Anomaly class Part gives highest accuracy.
7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset
6/8
International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6
ISSN: 1837-7823
47
Table 2: Comparison of various measures for rule based classifiers
Rule
based Classifier
Correctly
classified
Instances
Incorrectly
Classified
Instances
Overall Accuracy False Alarm rate
(class=Normal)
ConjunctiveRule 10258 1592 86.5654 % 8.7%
DecisionTable 11268 582 95.0886 % 3.5%
DTNB 11314 536 95.4768 % 3.4%
JRip 11492 358 96.9789 % 1.8%
OneR 10867 983 91.7046 % 4.3%
Part 11540 310 97.384 % 1.7%
NNge 11401 449 96.211 % 2%
Ridor 11386 464 96.0844 % 3.2%
Table 3: Class wise accuracy
Rule based classifier Normal Anomaly
ConjunctiveRule 65% 91.3%
DecisionTable 88.8% 96.5%
DTNB 90.5% 96.6%
JRip 91.7% 98.2%
OneR 73.9% 95.7%
Part 93.4% 98.3%
NNge 88% 98%
Ridor 92.8% 96.8%
Figure 2: Classification of Instances by rule based classifiers
0
2000
4000
6000
8000
10000
12000N
0
.
o
f
i
n
s
t
a
n
c
e
s
Correctly classified Instances
Incorrectly classified Instances
7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset
7/8
International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6
ISSN: 1837-7823
48
Figure 3: Overall Accuracy of various rule based classifiers
Figure 4: False alarm rate of rule based classifiers
Figure 5: Accuracy of rule based classifiers for class Normal and class Anomaly
80.00%
85.00%
90.00%
95.00%
100.00%
Overall Accuracy
0.00%1.00%2.00%3.00%4.00%5.00%6.00%7.00%8.00%9.00%
10.00%
False alarm rate
0%
20%
40%
60%
80%
100%
120%
Class wise
Accuracy
Normal Anomaly
7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset
8/8
International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6
ISSN: 1837-7823
49
5. Conclusion and Future Scope
The eight Rule based classifiers are evaluated on the NSL- KDD dataset. The various measures taken for
comparison are overall accuracy, class wise accuracy and false alarm rate. Part and Jrip classifier performed best
among all the algorithms and the performance of ConjunctiveRule is the worst. The rules generated by these
classifiers can be incorporated into the signature based Intrusion detection systems to enhance their
performance.
References
[1] Chandolikar n. S., and Nandavadekar V. D., (2012), comparative analysis of two algorithms for intrusion
attack classification using kdd cup dataset, International Journal of Computer Science and Engineering (
IJCSE ) Vol.1, Issue 1, pp. 81-88.
[2] Gharibian, F., & Ghorbani, A. A. (2007), Comparative study of supervised machine learning techniques for
intrusion detection, In Communication Networks and Services Research, CNSR'07.,Fifth Annual
Conference on ,pp. 350-358, IEEE.
[3] Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten I. H., (2009),The WEKA Data Mining
Software: An Update, SIGKDD Explorations, Volume 11, Issue 1.
[4] Hussein S. M., Ali F., and Kasiran Z., (2012), Evaluation Effectiveness of Hybrid IDS Using Snort with
Nave Bayes to Detect Attacks, IEEE.
[5] Information Assurance Tools Report Intrusion Detection Systems, (2009), IATAC, Herndon, VA.[6] Kurundkar G.D., Naik N.A. and Dr.Khamitkar S.D,(2012), Network Intrusion Detection using SNORT,
International Journal of Engineering Research and Applications (IJERA) ,Vol. 2, Issue 2, pp. 1288-1296.
[7] Nsl-kdd data set for network-based intrusion detection systems. (2009) Available on:
http://nsl.cs.unb.ca/NSL-KDD/.
[8] Oreku, G. S., & Mtenzi, F. J., (2009), Intrusion Detection Based on Data Mining, InDependable,
Autonomic and Secure Computing, DASC'09, Eighth IEEE International Conference onDependable,
Autonomic and Secure Computing pp. 696-701, IEEE.[9] Panda, M., and Patra, M. R., (2007), Network intrusion detection using naive bayes., International
journal of computer science and network security, Vol.7, No.12.
[10] Qazanfari K., Mirpouryan . S., and Gharaee H., (2012), A Novel Hybrid Anomaly Based Intrusion
Detection Method, 6.th International Symposium on Telecommunications (IST'2012).
[11] Shafi, K., Abbass, H. A., and Zhu, W. A., Methodology to Evaluate Supervised Learning Algorithms forIntrusion Detection.
[12] Tavallaee M., Bagheri E., Lu W., and Ghorbani A. A., (2009), A Detailed Analysis of the KDD CUP 99
Data Set, Proceedings of the Second IEEE Symposium on Computational Intelligence for Security and
Defence Applications.
[13] Zhang J., Zulkernine M., and Haque A., (2008), Random-Forests-Based Network Intrusion
Detection Systems, IEEE Transactions On Systems, Man, And CyberneticsPart C: Applications And
Reviews, Vol. 38, No. 5.
[14] Zhao D., Xu Q., Feng Z., (2010), Analysis and Design for Intrusion Detection System Based on DataMining, Second International Workshop on Education Technology and Computer Science.