Machine Learning Algorithms for Detection of Network ...ljilja/cnl/presentations/ljilja/iwcsn2018/IWCSN2018_final_final.pdf · nAvailable in multi-threaded routing toolkit (MRT) binary

Machine Learning Algorithms for Detection of Network Intrusions

Ljiljana Trajković[email protected]

Communication Networks Laboratoryhttp://www.ensc.sfu.ca/cnl

School of Engineering ScienceSimon Fraser University, Vancouver, British Columbia

Canada

Simon Fraser UniversityBurnaby Campus

IWCSN 2018, Nanjing, ChinaOctober 22, 2018 2

Roadmap

n Zhida Lin Soroush Haerin Qingye Dingn Prerna Battan Ana Laura Gonzalesn Nabil Al Rousan

IWCSN 2018, Nanjing, China 3October 22, 2018

lhr: 535,102 nodes and 601,678 links

October 22, 2018 IWCSN 2018, Nanjing, China

http://www.caida.org/home/

4

Roadmap

n Introduction and survey of past workn Intrusion detectionn Data processingn Classification algorithms

n Experimental proceduren Performance evaluation

n Conclusionn References


Roadmap





Motivation

§ Detecting and analyzing network anomalies and intrusions are important topics in cyber security.

§ Network intrusions may be classified using machine learning algorithms such as Recurrent Neural Networks (RNNs) and Broad Learning System (BLS).

§ Classification models are trained and tested using the NSL-KDD dataset containing information about intrusion and regular network connections.

§ Performance results indicate that the BLS algorithm shows comparable performance and has shorter training time.


Border Gateway Protocol

n BGP’s main function is to optimally route data between Autonomous Systems (ASes)

n AS: a collection of BGP routers (peers) within a single administrative domain

n Four types of BGP messages: n open, keepalive, update, and notification

n BGP anomalies: n Slammer, Nimda, Code Red I, routing

misconfigurations


Machine learning

n Machine learning models classify data using a feature matrix

n Feature matrix: n rows: data points n columns: feature values

n Algorithms: n Logistic Regression, Naïve Bayes,

Support Vector Machine (SVM)n SVM defines decision boundary to geometrically lie

midway between the support vectors


Feature extraction: BGP messages

n Extract 37 featuresn Sample every minute during a five-day period:

n the peak day of an anomalyn two days prior and two days after the peak day

n 7,200 samples for each anomalous event:n 5,760 regular samples (non-anomalous)n 1,440 anomalous samplesn imbalanced dataset


BGP features

IWCSN 2018, Nanjing, China

Feature Definition Category

1 Number of announcements Volume

2 Number of withdrawals Volume

3 Number of announced NLRI prefixes Volume

4 Number of withdrawn NLRI prefixes Volume

5 Average AS-PATH length AS-path

6 Maximum AS-PATH length AS-path

7 Average unique AS-PATH length AS-path

8 Number of duplicate announcements Volume

9 Number of duplicate withdrawals Volume

10 Number of implicit withdrawals Volume

October 22, 2018 11

BGP features


Feature Definition Category

11 Average edit distance AS-path

12 Maximum edit distance AS-path

13 Inter-arrival timeVolume

14–24Maximum edit distance = n,

where n = (7, ..., 17)

AS-path

25–33Maximum AS-path length = n,

where n = (7, ..., 15)

AS-path

34 Number of IGP packets Volume

35 Number of EGP packets Volume

36 Number of incomplete packets Volume

37 Packet size (B)Volume

October 22, 2018 12

Feature extraction: BGP messages

n BGP enables exchange of routing information between gateway routers using update messages

n Collections of BGP update message:n Réseaux IP Européens (RIPE), Routing

Information Service (RIS) projectn Route Views

n Available in multi-threaded routing toolkit (MRT) binary format


BGP anomalies

n Slammer:n infected Microsoft SQL servers through a small

piece of code that generated IP addresses at random

n Nimda: n exploited vulnerabilities in the Microsoft Internet

Information Services (IIS) web servers for Internet Explorer 5

n Code Red I: n attacked Microsoft IIS web servers by replicating

itself through IIS server weaknesses


Duration of BGP events


Anomaly Date Anomaly (min)

Regular (min)

Slammer January 25, 2003 869 6,331Nimda September 18-20, 2001 3,521 3,679Code Red I July 19, 2001 600 6,600

October 22, 2018 IWCSN 2018, Nanjing, China 16

Number of BGP announcements: Slammer


Number of BGP announcements: Nimda


Number of BGP announcements: Code Red I


Number of BGP announcements: Regular

Feature selection

n Reduces redundancy among features and improves the classification accuracy

n Decision tree algorithm was used for feature selection:n one of the most successful techniques for

supervised classification learningn It handles both numerical and categorical featuresn Publicly available software tool: C5


Feature selection: decision tree


Dataset Training data Selected Features

Dataset 1 Slammer + Nimda 1–21, 23–29, 34–37Dataset 2 Slammer + Code Red I 1–22, 24–29, 34–37Dataset 3 Code Red I + Nimda 1–29, 34–37

§ Either four (30, 31, 32, 33) or five (22, 30, 31, 32, 33) features are removed in the constructed trees mainly because:n features are numerical and some are used

repeatedly

October 22, 2018 21

Support Vector Machine


n SVM defines a separating hyperplane in order to assign the target variables into distinct categories

n It is a non-probabilistic binary classifiern Used for classification problems and in pattern

recognition applicationsn Modified version of logistic regression


n For a given dataset x with n number of training data, SVM finds the maximum margin hyperplane separating different classes of data:! = !#, %# , !# ∈ '(, %# ∈ 1,−1 , ∀ , = 1,2, … ,/,

where !# is the p-dimensional input vector and %# is the output value (1 or -1).

n Decision vector separating two classes is given by:01 2 ! + 4 = 0,

where 01 is the optimal weighing vector and b is the bias

n For linearly separable training data, margins are defined as: 01 2 ! + 4 = 1 and 01 2 ! + 4 = −1.


SVM with linear kernel: correctly classified regular (circles) and anomalous (stars) data points as well as one incorrectly classified regular

(circle) data point




n The distance between the margins: 2/∥ $% ∥n The objective function is to minimize ∥ $% ∥n The hyperplane is acquired by minimizing the

margins:

& '()*

(+( +

12 ∥ . ∥/,

with constraints 1(2 3( ≥ 1 − +(, 6 = 1,… ,9,where: C is the regularization parameter that defines the separation of two classes and the error when using a training dataset, 1( is the target value, and +( is the set of slack variables


Support Vector Machine: kernel trick

n Instead of calculating each mapping, the “kernel trick” is used to directly calculate the inner product in the input space

n The mapping defines feature space and generates a decision boundary for input data points

n Using the “kernel trick” reduces the complexity of the optimization problem that now only depends on the input space instead if the feature space



n Instead of employing a minimization model, the problem be formulated using Lagrangian dual multiplier ! as:

max %&'(

&!& −

12%&'(

,%-'(

,!&!-.&.- /&, /- ,

subject to variables:

0 ≤ !3 ≤ 4 ∀ 6 = 1,2, … , 9 and%3'(

&!3.3 = 0.


SVM with the nonlinear kernel function: the three-dimensional space shows a hyperplane dividing regular (circles) and anomalous (stars) data

pointsOctober 22, 2018 IWCSN 2018, Nanjing, China 28


Experimental procedure


n Step 1:

n Use 37 features or select the most relevant

features using the decision tree algorithm

n Step 2:

n Train the SVM with linear, quadratic, or cubic

kernels

n Step 3:

n Test the models using various datasets

n Step 4:

n Evaluate the SVM kernels based on accuracy and

F-Score

Training and test datasets


Training dataset Test dataset

Dataset 1 Slammer and Nimda Code Red IDataset 2 Nimda and Code Red I Slammer

Experimental procedure

n MATLAB 2017b Statistics and Machine Learning Toolbox

n The performance of SVM with various kernels is evaluated using combinations of datasets

n SVM performance was measured based on accuracy and F-Score

n The confusion matrix is used to evaluate performance of classification algorithms

n True positive (TP) and false negative (FN) are the number of anomalous data points that are classified as anomaly and regular, respectively



n Accuracy: n (TP+TN)/(TP+TN+FP+FN)

n F-Score signifies harmonic mean between precision and sensitivity:n 2 x (precision x sensitivity)/(precision + sensitivity)n precision: TP/(TP+FP)n sensitivity: TP/(TP+FN)

Performance measures

SVM with linear kernel


Linear kernel Accuracy (%) F-Score (%)Selected features Training

datasetTest RIPE BCNET Test

1-37

Dataset 1 72.76 61.34 54.21 73.60

Dataset 2 70.81 52.89 45.36 73.19

1-21, 23-29, 34-37 Dataset 1 74.71 63.26 55.39 76.29

Dataset 2 73.27 54.12 49.38 74.48

SVM with quadratic kernel


Quadratic kernel Accuracy (%) F-Score (%)Selected features Training


1-37

Dataset 1 58.55 52.73 43.68 58.85

Dataset 2 61.27 42.87 35.52 63.15

1-21, 23-29, 34-37Dataset 1 63.84 58.51 46.39 67.24

Dataset 2 63.36 46.55 38.73 64.68

SVM with cubic kernel


Cubic kernel Accuracy (%) F-Score (%)Selected features Training


1-37

Dataset 1 65.33 54.31 45.53 68.55

Dataset 2 63.15 46.23 40.17 65.49

1-21, 23-29, 34-37Dataset 1 69.21 58.12 49.26 70.14

Dataset 2 67.79 49.78 42.36 69.55

SVM algorithm and kernels

n SVM algorithm is one of the most efficient ML toolsn Kernels are used to transform the input data into a

high dimensional spacen Their performance depends on both the feature

selection and the type of datasetsn Analyzed BGP anomaly datasets are linearly

separablen SVM with linear kernel outperforms SVMs with

quadratic and cubic kernels


Roadmap





Intrusion detection

Various detection systems have been designed using machine learning techniques that help detect malicious intentions of network users.

§ Classification algorithms: J48, naive Bayes (NB), NB Tree, Random Forests (RF), Random Tree (RT), Multilayer Perception (MP), Support Vector Machine (SVM)

§ Deep learning algorithms: Network (NIDS) and Recurrent Neural Network (RNN-IDS) Intrusion Detection Systems


Intrusion detection (cont.)

§ Hybrid framework: Binary Classifier (BC) modules based on the C4.5 algorithm, aggregation module, and k-NN module

§ Recurrent Neural Networks (RNNs): Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and Bidirectional LSTM (Bi-LSTM)

§ Broad Learning System (BLS): an alternative to deep learning networks with increased number of mapped features and enhancement nodes


Roadmap





Data Processingn NSL-KDD Dataset: Types of Intrusion Attacks https://web.archive.org/web/20150205070216/http://nsl.cs.unb.ca/NSL-KDD/


Type Intrusion attacksDoS(denial-of-service)

back, land, neptune, pod, smurf, teardrop, mailbomb,processtable, udpstorm, apache2, worm

U2R(user-to-root)

buffer-overflow, loadmodule, perl, rootkit, sqlattack, xterm, ps

R2L(remote-to- local)

fpt-write, guess-passwd, imap, multihop, phf, spy, warezmaster, xlock, xsnoop, snmpguess, snmpgetattack, httptunnel, sendmail, named

Probe ipsweep, nmap, portsweep, satan, mscan, saint

Data Processing

n NSL-KDD Dataset: number of data points


Regular DoS U2R R2L Probe TotalKDDTrain+ 67,343 45,927 52 995 11,656 125,973KDDTest+ 9,711 7,458 200 2,754 2,421 22,544KDDTest-21 2,152 4,342 200 2,754 2,402 11,850

Roadmap





Classification Algorithms

n Repeating module for the Long Short-Term Memory (LSTM) neural network:



n Repeating module for the Gated Recurrent Unit (GRU) neural network:



n Deep learning neural network model:



n Module of the Broad Learning System algorithm with increment of mapped features, enhancement nodes, and new input data:


Roadmap





Experimental Procedure

• Step 1: Convert categorical into numerical features using dummy coding for training and test datasets

• Step 2: Normalize training and test datasets

• Step 3: Tune the model parameters during the 10-fold validation

• Step 4: Test LSTM, GRU, Bi-LSTM, and BLS models using KDDTest+ and KDDTest-21 datasets

• Step 5: Evaluate derived models based on accuracy and F-Score for binary and multiple classes


Roadmap





Performance Evaluation


Model LSTM GRU Bi-LSTM BLSTwo-way classification

Accuracy (%)

KDDTest+ 82.68 82.87 81.03 84.14KDDTest-21 64.32 65.42 64.31 72.64

F-Score (%)

KDDTest+ 82.76 83.05 81.23 84.68KDDTest-21 73.18 74.60 73.49 80.61

Five-way classification

Accuracy(%)

KDDTest+ 79.56 80.17 79.44 82.47KDDTest-21 60.51 60.75 60.80 70.30



Model NB Tree [1] RT [1] NIDS [2] RNN-IDS[3]

Two-way classification

Accuracy(%)

KDDTest+ 82.02 81.59 75.75 83.28KDDTest-21 66.16 58.51 N/A 68.55



Model J48 [3] NB [3] NB Tree [3] MP [3]Five-way classification

Accuracy(%)

KDDTest+ 74.60 74.40 75.40 78.10KDDTest-21 51.90 55.77 55.40 58.40

Model LSTM GRU Bi-LSTM BLS RNN-IDS[3]

Training time (s) 355.86 345.04 497.66 21.92 5,516.00

Roadmap





Conclusion§ Three types of RNNs and a BLS have been employed to detect

network intrusions. § KDDTest+ and KDDTest-21 datasets: BLS shows better

performance than LSTM, GRU, Bi-LSTM, and most reported results.

§ BLS performance depends on the number of mapped features and enhancement nodes.

§ While additional mapped features and enhancement nodes improve BLS performance, they require more memory and longer training time.

§ Advantage of the BLS model is that it requires considerably shorter time for training than the conventional deep learning networks.


Roadmap

n Introduction and past workn Intrusion detectionn Data processingn Classification algorithms




References

n M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis of the

KDD CUP 99 data set,” in Proc. IEEE Symp. Comput. Intell. in Security and Defense Appl. (CISDA), Ottawa, ON, Canada, July 2009, pp. 1–6.

n T. A. Tang, L. Mhamdi, D. McLernon, S. A. R. Zaidi, and M. Ghogho, “Deep

learning approach for network intrusion detection in software defined networking,”

in Proc. Wireless Netw. Mobile Commun. (WINCOM), Fez, Morocco, Oct. 2016,

pp. 258–263.

n C. Yin, Y. Zhu, J. Fei, and X. He, “A deep learning approach for intrusion detection

using recurrent neural networks,” IEEE Access, vol. 5, pp. 21954–21961, Nov.

2017.

n L. Li, Y. Yu, S. Bai, Y. Hou, and X. Chen, “An effective two-step intrusion detection

approach based on binary classification and k-NN,” IEEE Access, vol. 6, pp.

12060–12073, Mar. 2018.

n C. L. P. Chen and Z. Liu, “Broad learning system: an effective and efficient

incremental learning system without the need for deep architecture,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 1, pp. 10–24, Jan. 2018.

n NSL-KDD Data Set [Online]. Available:

https://web.archive.org/web/20150205070216/http://nsl.cs.unb.ca/NSL-KDD/.

Accessed: July 18, 2018.IWCSN 2018, Nanjing, China 57October 22, 2018

References: sources of datan RIPE RIS raw data:

http://www.ripe.net/data-tools/n University of Oregon Route Views project:

http:// www.routeviews.org/n BCNET:

http://www.bc.net/n NSL-KDD Data Set:

https://web.archive.org/web/20150205070216/http://nsl.cs.unb.ca/NSL-KDD/


Publications:http://www.sfu.ca/~ljilja/cnl

n Z. Li, P. Batta, and Lj. Trajkovic, “Comparison of machine learning algorithms for detection of network intrusions,” in Proc. IEEE International Conference on Systems, Man, and Cybernetics (SMC 2018), Miyazaki, Japan, Oct. 2018, pp. 4238-4243.

n P. Batta, M. Singh, Z. Li, Q. Ding, and Lj. Trajkovic, “Evaluation of support vector machine kernels for detecting network anomalies,” in Proc. IEEE Int. Symp. Circuits and Systems, Florence, Italy, May 2018, pp. 1-4.

n Q. Ding, Z. Li, S. Haeri, and Lj. Trajković, “Application of machine learning techniques to detecting anomalies in communication networks: Datasets and Feature Selection Algorithms” in Cyber Threat Intelligence, M. Conti, A. Dehghantanha, and T. Dargahi, Eds., Berlin: Springer, pp. 47-70, 2018.

n Q. Ding, Z. Li, S. Haeri, and Lj. Trajković, “Application of machine learning techniques to detecting anomalies in communication networks: Classification Algorithms” in Cyber Threat Intelligence, M. Conti, A. Dehghantanha, and T. Dargahi, Eds., Berlin: Springer, pp. 71-92, 2018.

n Q. Ding, Z. Li, P. Batta, and Lj. Trajković, "Detecting BGP anomalies using machine learning techniques," in Proc. IEEE International Conference on Systems, Man, and Cybernetics (SMC 2016), Budapest, Hungary, Oct. 2016, pp. 3352-3355.


Publications:http://www.sfu.ca/~ljilja/cnl

n Y. Li, H. J. Xing, Q. Hua, X.-Z. Wang, P. Batta, S. Haeri, and Lj. Trajković, “Classification of BGP anomalies using decision trees and fuzzy rough sets,” in Proc. IEEE International Conference on Systems, Man, and Cybernetics, SMC 2014, San Diego, CA, October 2014, pp. 1312-1317.

n T. Farah and Lj. Trajkovic, "Anonym: a tool for anonymization of the Internet traffic," in Proc. 2013 IEEE International Conference on Cybernetics, CYBCONF 2013, Lausanne, Switzerland, June 2013, pp. 261-266.

n N. Al-Rousan, S. Haeri, and Lj. Trajković, “Feature selection for classification of BGP anomalies using Bayesian models," in Proc. International Conference on Machine Learning and Cybernetics, ICMLC 2012, Xi'an, China, July 2012, pp. 140-147.

n N. Al-Rousan and Lj. Trajković, “Machine learning models for classification of BGP anomalies,” in Proc. IEEE Conf. High Performance Switching and Routing, HPSR 2012, Belgrade, Serbia, June 2012, pp. 103-108.

n T. Farah, S. Lally, R. Gill, N. Al-Rousan, R. Paul, D. Xu, and Lj. Trajković, “Collection of BCNET BGP traffic," in Proc. 23rd ITC, San Francisco, CA, USA, Sept. 2011, pp. 322-323.

n S. Lally, T. Farah, R. Gill, R. Paul, N. Al-Rousan, and Lj. Trajković, “Collection and characterization of BCNET BGP traffic," in Proc. 2011 IEEE Pacific Rim Conf. Communications, Computers and Signal Processing, Victoria, BC, Canada, Aug. 2011, pp. 830-835.


Documents

Machine Learning Algorithms for Detection of Network ...ljilja/cnl/presentations/ljilja/iwcsn2018/IWCSN2018_final_final.pdf · nAvailable in multi-threaded routing toolkit (MRT) binary