Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset

Embed Size (px)

Citation preview

  • 7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset

    1/8

    International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6

    ISSN: 1837-7823

    42

    Evaluation of Rule based Machine Learning Algorithms on NSL-KDD

    Dataset

    Tejvir Kaur1

    and Sanmeet Kaur2

    1

    M,Tech Student,2

    Asst. Professor

    Thapar University, Patiala- 147001, India

    Abstract

    Intrusion Detection can be defined as the act of detecting actions that attempt to compromise the confidentiality,

    integrity or availability of a resource. There are many approaches to intrusion detection like anomaly based,

    signature based and machine learning based. Machine learning approach can prove to be very useful for

    developing intrusion detection systems. This paper presents the comparison of various Rule based machine

    learning algorithms. These learning algorithms are categorized under supervised learning. The NSL-KDD

    dataset [9] andWaikato Environment for Knowledge Analysis (WEKA) [3] is used to evaluate the performance

    of the machine learning algorithms.

    Keywords: Intrusion detection, NSL-KDD dataset, WEKA

    1. Introduction

    The purpose of network security is to protect the network from unauthorized access, destruction and disclosure.

    Many techniques have emerged in the field of network security that helps in the protection of computer systems

    and computer networks. One of the techniques used for making the network secure and detecting intrusions is

    Intrusion Detection System. Intrusion Detection System is a mechanism that detects unauthorized and malicious

    activity present in the computer systems. There are mainly two approaches to Intrusion Detection Signature

    detection and Anomaly detection. Machine learning techniques have also been applied to Intrusion detection in

    many ways. This paper presents the evaluation and results of rule base machine learning algorithms to NSL-

    KDD dataset [12].

    2. Review of Literature

    Intrusion Detection System helps information systems to deal with attacks. An IDS gathers and analyzes

    information from various areas within a computer or a network to identify the intrusions which includes attacks

    from outside the organization and as well as attacks from within the organization. There are mainly two

    approaches to intrusion detection - Signature based detection and Anomaly based detection. The signature-based

    approach looks for the signatures of known attacks, which exploit weaknesses in system and application

    software [5]. It uses pattern matching techniques against a frequently updated database of attack signatures. It is

    useful to detect already known attack but not the new ones. Many attacks can be detected by this approach

    because many attacks have clear anddistinct signatures.

    An Intrusion Detection system that looks at network traffic and detects data that is incorrect, not valid or

    generally abnormal is called anomaly-based detection. This method is useful for detecting unwanted traffic that

  • 7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset

    2/8

    International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6

    ISSN: 1837-7823

    43

    is not specifically known [5]. There are various methods that can be used in Anomaly detection approach to

    detect anomalous behavior from normal behavior like machine learning, statistical methods.

    Machine learning algorithms can be used in Intrusion detection problems to find interesting intrusion patterns in

    data [10]. This requires the data to be in labelled form. For this purpose NSL- KDD dataset is taken which has

    the required characteristics. The rule based learning algorithms are applied on this dataset.

    K.. Shafi et al. [11] proposes a methodology to create a fully labelled network intrusion detection dataset which

    is suitable for machine learning algorithms. The dataset is created using real background traffic and simulated

    attacks. This dataset is tested on supervised machine learning algorithms in WEKA.

    F. Gharibian and A. Ghorbani (2007) [2] compare the supervised machine learning techniques. The algorithms

    used are Naive Bayes, Gaussian, Decision Tree and Random Forests. The ability of each technique for detecting

    the attack categories in the KDD dataset has been compared. From the results, the proper technique for

    identifying an attack category is also proposed.

    According to M. Panda and M. Patra (2007) [9], the use ofnave bayes for anomaly based network intrusion

    detection technique produces better results in terms of false positive rate, cost, and computational time when

    applied to KDD99 data sets as compared to a back propagation neural network based approach. The

    experimentation is done on WEKA program.

    J. Zhang et al. (2008) [13] focuses on a framework that apply a data mining algorithm called random forests in

    misuse, anomaly, and hybrid-network-based IDSs. In misuse detection, patterns of intrusions are built

    automatically by the random forests algorithm over training data. In anomaly detection, novel intrusions are

    detected by the outlier detection mechanism of the random forests algorithm.

    G. Oreku and F. Mtenzi (2009) [8] presents the use ofdata mining techniques to discover consistent and useful

    patterns of system features that describe program and user behavior. According to them the useful set of relevant

    system features can recognize anomalies and known intrusions.

    D. Zhao (2010) [14] et al. proposes a hybrid IDS which combines network and host IDS, with anomaly and

    misuse detection mode. Data mining programs are applied to learn rules that can capture the behavior of

    intrusions and normal activities.

    K. Qazanfari et al. (2012) [10] have proposed an Intrusion detection system which uses Support Vector Machine

    (SVM) and Multi Layer Perceptron (MLP) machine learning algorithms to classify normal from abnormal

    behaviors.

    S. M. Hussein et al. (2012) [4] discusses the anomaly detection engine that will be based on NaveBayes

    algorithm, J48graft Decision Tree algorithm and Bayes Net algorithm in WEKA program.

    3. Methodology

    This section presents the methodology used to carry out the work. The workflow is described in Figure 1. The

    main aim is to analyze the performance of rule based classifiers present in WEKA. For this purpose NSL KDD

    dataset is used. The NSL KDD dataset is applied to the rule based classifiers. The output of the classifiers is

    compared to each other. The measures used to compare the results are accuracy, false alarm rate and the number

    of instances that are correctly and incorrectly classified. The rule based classifier giving the best performance is

    deduced by analyzing the results based on above metrics.

  • 7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset

    3/8

    International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6

    ISSN: 1837-7823

    44

    Figure 1: Design of Research

    4. Experimentation

    4.1 Description of dataset

    The KDD data set is built based on the data captured in DARPA98 IDS evaluation program. DARPA98 is

    about 4 gigabytes. It contains tcpdump data of 7 weeks of network traffic. This data can be processed into about

    5 million connection records. The two weeks of test data have around 2 million connection records. KDD

    training dataset consists of approximately 4,900,000 single connection vectors each of which contains 41

    features. It is labelled as either normal or an attack. The analysis has shown that there are two important issues

    in the data set which affects the performance of evaluated systems. It results in a very poor evaluation of

    anomaly detection approaches. To solve these issues, a new dataset called NSL-KDD [7] has been proposed

    which consists of selected records of the complete KDD data set [12]. The NSL-KDD data set has the following

    advantages over the original KDD data set.

    It does not include redundant records in the train set. There are no duplicate records in the proposed test. The number of records in the train and test sets is reasonable, which makes it affordable to run the

    experiments on the complete set without the need to randomly select a small portion [7].

  • 7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset

    4/8

    International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6

    ISSN: 1837-7823

    45

    For the purpose of experimentation KDDTest-21.ARFF - a subset of the KDDTest+.arff file is taken. It contains

    11850 instances [7]. The attribute names and types are listed in Table 1.

    Table 1: Attribute name and type in KDDTest-21.arff file

    Attribute Name Attribute Type Attribute Name Attribute Type

    Duration Real is_guest_login nominal

    protocol_type Nominal Count real

    Service Nominal srv_count real

    Flag Nominal serror_rate real

    src_bytes Real srv_serror_rate real

    dst_bytes Real rerror_rate real

    Land Nominal srv_rerror_rate real

    wrong_fragment Real same_srv_rate real

    Urgent Real diff_srv_rate real

    Hot Real srv_diff_host_rate real

    num_failed_logins Real dst_host_count real

    logged_in Nominal dst_host_srv_count real

    num_compromised Real dst_host_same_srv_rate real

    root_shell Real dst_host_diff_srv_rate real

    su_attempted Real dst_host_same_src_port_rate real

    num_root Real dst_host_srv_diff_host_rate real

    num_file_creations Real dst_host_serror_rate real

    num_shells Real dst_host_srv_serror_rate real

    num_access_files Real dst_host_rerror_rate real

    num_outbound_cmds Real dst_host_srv_rerror_rate real

    is_host_login Nominal Class nominal

    4.2 Classifiers used

    The following are the machine learning algorithms or classifiers that are evaluated on NSL-KDD dataset.

    1) ConjunctiveRule: This classifier implements a single conjunctive rule learner that can predict for numeric and

    nominal class labels.

    2) DecisionTable: The classifier is used for building and using a simple decision table majority classifier.

    3) DTNB: This classifier provides keys to the hash table

    4) JRip: This class implements a propositional rule learner, Repeated Incremental Pruning to Produce Error

    Reduction (RIPPER).

  • 7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset

    5/8

    International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6

    ISSN: 1837-7823

    46

    5) OneR: It generates a set of rules that test one particular attribute and learns a one-level decision tree.

    6) Part: Class for generating a PART decision list.

    7) NNge:It is Nearest neighbour like algorithm using non-nested generalized exemplars.

    8) Ridor: This is implementation of a Ripple-Down Rule learner.

    4.3 Test Options in WEKA

    The result of applying the chosen classifier will be tested according to the options. The following are the four

    test modes that are provided in WEKA [3].

    1. Use training set. The classifier is evaluated on how well it predicts the class of the instances it was trained on.

    2. Supplied test set. The classifier is evaluated on how well it predicts the class of a set of instances loaded from

    a file. A classifier is built using the trainset and evaluate it using the testset.

    3. Cross-validation. The classifier is evaluated by cross-validation by using the number of folds that are enteredin the Folds text field.The dataset is randomly divided into k subsamples. The k-1 subsamples are used astraining data and one sub-sample as test data. This process is repeatedktimes.

    4. Percentage split. The classifier is evaluated on how well it predicts a certain percentage of the data which is

    held out for testing. The amount of data held out depends on the value entered in the % field [3].

    4.4 Results and Discussion

    The measures used for evaluating the performance are as follows.

    1) Correctly classified instances and Incorrectly classified instances shows the percentage of instances that are

    correctly and incorrectly classified.

    2) Overall accuracy: The percentage of correctly classified instances is called accuracy.

    3) False alarm rate: The alarm rate is the proportion of examples which were classified as classx, but belong to

    a different class, among all examples which are not of classx.

    4) Class wise accuracy: It is the proportion of examples which were classified as class x, among all examples

    which truly have classx [3].

    The results of evaluating rule based classifiers are shown in Table 2 and Table 3. Test option used is 10 fold

    cross validation. Part classifier has the highest overall accuracy which clearly justifies that it correctly classifies

    11540 instances. The false alarm rate of JRip and Part classifier is the lowest. The ConjunctiveRule classifier

    has the highest false alarm rate and it has least accuracy. So the overall performance of Part is best and

    ConjunctiveRule is worst. The performance of Part and Jrip is almost identical. Table 3 shows class wise

    accuracy. Almost all the classifiers show above 90% accuracy for Anomaly class. For Normal class and

    Anomaly class Part gives highest accuracy.

  • 7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset

    6/8

    International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6

    ISSN: 1837-7823

    47

    Table 2: Comparison of various measures for rule based classifiers

    Rule

    based Classifier

    Correctly

    classified

    Instances

    Incorrectly

    Classified

    Instances

    Overall Accuracy False Alarm rate

    (class=Normal)

    ConjunctiveRule 10258 1592 86.5654 % 8.7%

    DecisionTable 11268 582 95.0886 % 3.5%

    DTNB 11314 536 95.4768 % 3.4%

    JRip 11492 358 96.9789 % 1.8%

    OneR 10867 983 91.7046 % 4.3%

    Part 11540 310 97.384 % 1.7%

    NNge 11401 449 96.211 % 2%

    Ridor 11386 464 96.0844 % 3.2%

    Table 3: Class wise accuracy

    Rule based classifier Normal Anomaly

    ConjunctiveRule 65% 91.3%

    DecisionTable 88.8% 96.5%

    DTNB 90.5% 96.6%

    JRip 91.7% 98.2%

    OneR 73.9% 95.7%

    Part 93.4% 98.3%

    NNge 88% 98%

    Ridor 92.8% 96.8%

    Figure 2: Classification of Instances by rule based classifiers

    0

    2000

    4000

    6000

    8000

    10000

    12000N

    0

    .

    o

    f

    i

    n

    s

    t

    a

    n

    c

    e

    s

    Correctly classified Instances

    Incorrectly classified Instances

  • 7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset

    7/8

    International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6

    ISSN: 1837-7823

    48

    Figure 3: Overall Accuracy of various rule based classifiers

    Figure 4: False alarm rate of rule based classifiers

    Figure 5: Accuracy of rule based classifiers for class Normal and class Anomaly

    80.00%

    85.00%

    90.00%

    95.00%

    100.00%

    Overall Accuracy

    0.00%1.00%2.00%3.00%4.00%5.00%6.00%7.00%8.00%9.00%

    10.00%

    False alarm rate

    0%

    20%

    40%

    60%

    80%

    100%

    120%

    Class wise

    Accuracy

    Normal Anomaly

  • 7/28/2019 Paper-5 Evaluation of Rule Based Machine Learning Algorithms on NSL-KDD Dataset

    8/8

    International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6

    ISSN: 1837-7823

    49

    5. Conclusion and Future Scope

    The eight Rule based classifiers are evaluated on the NSL- KDD dataset. The various measures taken for

    comparison are overall accuracy, class wise accuracy and false alarm rate. Part and Jrip classifier performed best

    among all the algorithms and the performance of ConjunctiveRule is the worst. The rules generated by these

    classifiers can be incorporated into the signature based Intrusion detection systems to enhance their

    performance.

    References

    [1] Chandolikar n. S., and Nandavadekar V. D., (2012), comparative analysis of two algorithms for intrusion

    attack classification using kdd cup dataset, International Journal of Computer Science and Engineering (

    IJCSE ) Vol.1, Issue 1, pp. 81-88.

    [2] Gharibian, F., & Ghorbani, A. A. (2007), Comparative study of supervised machine learning techniques for

    intrusion detection, In Communication Networks and Services Research, CNSR'07.,Fifth Annual

    Conference on ,pp. 350-358, IEEE.

    [3] Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten I. H., (2009),The WEKA Data Mining

    Software: An Update, SIGKDD Explorations, Volume 11, Issue 1.

    [4] Hussein S. M., Ali F., and Kasiran Z., (2012), Evaluation Effectiveness of Hybrid IDS Using Snort with

    Nave Bayes to Detect Attacks, IEEE.

    [5] Information Assurance Tools Report Intrusion Detection Systems, (2009), IATAC, Herndon, VA.[6] Kurundkar G.D., Naik N.A. and Dr.Khamitkar S.D,(2012), Network Intrusion Detection using SNORT,

    International Journal of Engineering Research and Applications (IJERA) ,Vol. 2, Issue 2, pp. 1288-1296.

    [7] Nsl-kdd data set for network-based intrusion detection systems. (2009) Available on:

    http://nsl.cs.unb.ca/NSL-KDD/.

    [8] Oreku, G. S., & Mtenzi, F. J., (2009), Intrusion Detection Based on Data Mining, InDependable,

    Autonomic and Secure Computing, DASC'09, Eighth IEEE International Conference onDependable,

    Autonomic and Secure Computing pp. 696-701, IEEE.[9] Panda, M., and Patra, M. R., (2007), Network intrusion detection using naive bayes., International

    journal of computer science and network security, Vol.7, No.12.

    [10] Qazanfari K., Mirpouryan . S., and Gharaee H., (2012), A Novel Hybrid Anomaly Based Intrusion

    Detection Method, 6.th International Symposium on Telecommunications (IST'2012).

    [11] Shafi, K., Abbass, H. A., and Zhu, W. A., Methodology to Evaluate Supervised Learning Algorithms forIntrusion Detection.

    [12] Tavallaee M., Bagheri E., Lu W., and Ghorbani A. A., (2009), A Detailed Analysis of the KDD CUP 99

    Data Set, Proceedings of the Second IEEE Symposium on Computational Intelligence for Security and

    Defence Applications.

    [13] Zhang J., Zulkernine M., and Haque A., (2008), Random-Forests-Based Network Intrusion

    Detection Systems, IEEE Transactions On Systems, Man, And CyberneticsPart C: Applications And

    Reviews, Vol. 38, No. 5.

    [14] Zhao D., Xu Q., Feng Z., (2010), Analysis and Design for Intrusion Detection System Based on DataMining, Second International Workshop on Education Technology and Computer Science.