Intrusion Detection Using Neural Networks and Support Vector Machine

Preview:

DESCRIPTION

IEEE WCCI IJCNN 2002 World Congress on Computational Intelligence International Joint Conference on Neural Networks. Intrusion Detection Using Neural Networks and Support Vector Machine. Srinivas Mukkamala , Guadalupe Janoski , Andrew Sung - PowerPoint PPT Presentation

Citation preview

1

INTRUSION DETECTION USING NEURAL NETWORKS AND SUPPORT VECTOR MACHINE

Srinivas Mukkamala, Guadalupe Janoski, Andrew SungDept. of CS in New Mexico Institute of Mining and Technology

IEEE WCCI IJCNN 2002World Congress on Computational IntelligenceInternational Joint Conference on Neural Networks

2

Outline

Approaches to intrusion detection using neural networks and support vector machines

DARPA dataset Neural Networks Support Vector Machines Experiments Conclusion and Comments

3

Approaches

Key ideas are to discover useful patterns or features that describe

user behavior on a system And use the set of relevant features to build

classifiers that can recognize anomalies and known intrusions

Neural networks and support vector machines are trained with normal user activity and attack patterns Significant deviations from normal behavior are

flagged as attacks

4

DARPA Data for Intrusion Detection

DARPA (Defense Advanced Research Projects Agency) An agency of US Department of Defense responsible for

the development of new technology for use by the military

Benchmark from a KDD (Knowledge Discovery and Data Mining) competition designed by DARPA

Attacks fall into four main categories DOS: denial of service R2L: unauthorized access from a remote machine U2R: unauthorized access to local super user (root)

privileges Probing: surveillance and other probing

5

Features

http://kdd.ics.uci.edu/databases/kddcup99/task.html

6

Signals

Signals

Sign

als

Signal

s

Signal

s

Signals

Signals

Signal

s

Neuron 神經

Dendrite 樹突

Axon 軸突

Soma 中心

Gather signals

Output signal

Combine signals & decide to trigger

Neural Networks

7

OUTPUT

X1

X2

平面的線 : w1X1 + w2X2 – θ = 0

w1

w2

θ

A

BC

D

INPUT

WEIGHT

ACTIVATION

Divide and Conquer

N1

N2

N3

1

1

1

1

1

1

-1

-1

-1

Data N1 N2

A +1 +1 +1 -3

B +1 -1 -1 -1

C -1 -1 -3 +1

D -1 +1 -1 -1N3

A +1 -1 +1

B -1 -1 -1

C -1 +1 +1

D -1 -1 -1

-1

-1

-1x1

x1

x2

x2

out1

out2

out3

Σ

Σ

Σ

Σ

8

1

2

Layer 1 Layer 2 Layer 3 Layer 4

w01(1)

w11(1)

w21(1)

Σ

Layer 1

N1

S1(1)

x1(1)

x0(0)

x1(0)

x2(0)

general

wij(l)

Σ

Layer l

Nj

Sj(l)

xj(l)

xi(l-1)

Hyperbolicfunction

tanh(S) = eS – e-S

eS + e-S

S

tanh(S)

)1(

0

)1()()(

ld

i

li

lij

lj xwS

)tanh( )()( lj

lj Sx

Decide Architecture

Determine Weight Automatically

Feed Forward Neural Network (FFNN)

9

Σ

Σ

ΣInputOutpu

t

g(x) 由w所組成的 classifier

w

w

w

w

w

w

ww

w

Training Data: Nnnn yx 1)},{(

Error Function:

N

nnn yxg

NwE

1

2))((1

)(

How to minimize E(w) ? Stochastic Gradient Descent (SGD)

w

E

w is random small value at the beginningfor T iterations

wnew wold – η .▽ w(En)learning rate

10

……

Layer 1Layer 2 Layer L-1Layer L… …

wij(l)

Σ

Layer l

Nj

Sj(l)

x1(l)

xi(l-1)

forwardfor l = 1, 2, …, L compute Sj

(l) and xj(l)

)1(

0

)1()()(

ld

i

li

lij

lj xwS

)tanh( )()( lj

lj Sx

2)( )(1

yxE L

2)1()(1

2)(1

))(tanh(

))(tanh(

yxw

ySLi

Li

L

)(1

)(1

)(1

)(1

Li

L

LLi w

S

S

E

w

E

)(1

)(1

2)(1 )](tanh1[))(tanh(2

L

LL SyS

)1( Lix

)1()(1)(

1

L

iL

Li

xw

E

)1()()(

l

iljl

ij

xw

E

j

li

lij

lj

li Sw ))(tanh1( )1(2)()()1(

Back Propagation Algorithm

General

backwardfor l = L, L-1, …, 1 compute δi

(l)

11

Σ

Σ

Σ

w

w

w

w

w

w

ww

w

… …

wij(l)

Σ

Layer l

Nj

Sj(l)

x1(l)

xi(l-1)

Feed Forward NNet

)1(

0

)1()()(

ld

i

li

lij

lj xwS

Consists of layers 1, 2, …, L

wij(l) connect neuron i in layer (l-

1) to neuron j in layer lCumulated signal

Activated output

)( )()( lj

lj Sx

often tanh

Minimize E(w) and determine the weights automatically

SGD (Stochastic Gradient Descent)

)1()()(

l

iljl

ij

xw

E Forward: compute Sj(l) and xj

(l)

Backward: compute δi(l)

w is random small value at the beginningfor T iterations wnew wold – η .▽ w(En)

Stop when desired error rate was met

12

Support Vector Machine

A supervised learning method Is known as the maximum margin

classifier Find the max-margin separating

hyperplane

SVM – hard margin13

x1

x2

2∥w∥

<w, x> - θ = 0

<w, x> - θ = -1

<w, x> - θ = +1

max2

∥w∥w, θyn(<w, xn> - θ) ≧1

argmin

2w, θyn(<w, xn> - θ) ≧1

1<w, w>

Quadratic programming14

argmin

1Σ Σ aijvivj + Σ bivi2 i j

Σ rkivi ≧ qki

vV* quadprog(A, b, R, q)

argmin

2w, θyn(<w, xn> - θ) ≧1

1<w, w>

Let V = [ θ, w1, w2, …, wD ]

Σ wd2

21

d=1

D

(-yn) θ + Σ yn (xn)d wd ≧ 1d=1

D

Adapt the problem for quadratic programming

Find A, b, R, q and put into the quad. solver

Adaptation15

V = [ θ, w1, w2, …, wD ]

v0, v1, v2, .…, vD

Σ wd2

21

d=1

D

(-yn) θ + Σ yn (xn)d wd ≧ 1d=1

D

v0 vd

argmin

1Σ Σ aijvivj + Σ bivi2 i j

Σ rkivi ≧ qki

v

a00 = 0a0j = 0ai0 = 0

i ≠ 0, j ≠ 0aij = 1 (i = j)

0 (i ≠ j)

b0 = 0

i ≠ 0bi = 0

qn = 1

rn0 = -yn

d > 0

rnd = yn (xn)d

(1+D)*(1+D)

(1+D)*1

(2N)*(1+D)

(2N)*1

SVM – soft margin

Allow possible training errors

Tradeoff c Large c : thinner hyperplane, care about

error Small c : thicker hyperplane, not care about

error

16

argmin

2w, θyn(<w, xn> - θ) ≧1 – ξn

1<w, w> + c Σξnn

ξn ≧ 0

errorstradeoff

Adaptation17

argmin

1Σ Σ aijvivj + Σ bivi2 i j

Σ rkivi ≧ qki

v

V = [ θ, w1, w2, …, wD, ξ1, ξ2, …, ξN ]

(1+D+N)*(1+D+N)

(2N)*(1+D+N)

(1+D+N)*1

(2N)*1

Primal form and Dual form

Primal form

18

Dual form

argmin

2w, θyn(<w, xn> - θ) ≧1 – ξn

1<w, w> + c Σξnn

ξn ≧ 0

argmin

0 ≦αn≦C

1ΣΣ αnynαmym<xn, xm> - Σ αnn m

Σ ynαn = 0

n

n

Variables: 1+D+N

Constraints: 2N

Variables: N

Constraints: 2N+1

Dual form SVM

Find optimal α* Use α* solve w* and θ

αn=0 correct or on 0<αn<C on αn=C wrong or on

19

αn=C

free SV

αn=0

Support Vector

Nonlinear SVM

Nonlinear mapping X Φ(X) {(x)1, (x)2} R2 {1, (x)1, (x)2, (x)1

2, (x)22,

(x)1(x)2} R6

Need kernel trick

20

argmin

0 ≦αn≦C

1ΣΣ αnynαmym<Φ(xn), Φ(xm)> - Σ αnn m

Σ ynαn = 0

n

n

(1+ <xn, xm>)2

21

Experiments

Using automated parsers to process the raw TCP/IP dump data into machine-readable form

7312 training data (different types of attacks and normal data) has 41 features

6980 testing data evaluate the classifier

Pre-processing Training Testing

Support Vector Machines

Neural Networks

Details RBF kernelC = 1000

204 support vectors (29 free)

3-layer 41-40-40-1 FFNNetsScaled conjugate gradient

descentDesired error rate = 0.001

Accuracy

99.5% 99.25%

Time spent

17.77 sec 18 min

22

Conclusion and Comments

Speed SVMs is significant shorter

Avoid the ”curse of dimensionality” by max-margin

Accuracy Both have high accuracy

SVMs can only make binary classification IDS requires multiple-class identification

How to determine the features?

Recommended