Knowledge extraction from support vector machines

Preview:

Citation preview

Knowledge Extraction from

Support Vector Machines:

A Fuzzy Logic Approach

“Certain class of SVMs is mathematically equivalent to FARB”

What is SVM

What does it do?

Learns a hyper plane to classify data into 2 classes.

What is a hyperplane?

A hyperplane is a function like the equation for a line,

𝑦 = 𝑚𝑥 + 𝑏

In fact, for a simple classification task with just 2 features, the hyperplane can be a line.

SVM finds the optimal solution.

Support Vector Machine

SVM attempts to maximize the margin, so that the hyperplane is just as far away from red ball as the blue ball. In this way, it decreases the chance of misclassification.

More Formally

Input:

set of (input, output) training pair samples.

Output:

set of weights w (or 𝑤𝑖), one for each feature, whose linear combination predicts the value of y.

We use the optimization of maximizing the margin (‘street width’) to reduce the number of weights that are nonzero to just a few that

correspond to the important features that ‘matter’ in deciding the separating

line(hyperplane)…these nonzero weights correspond to the support vectors (because they ‘support’ the separating hyperplane)

The optimization problem

minimize 𝑓(𝑤) ≡ (1/2) ∥ 𝒘 ∥2

subject to 𝑔 𝑤, 𝑏 ≡ −𝑦𝑖 𝒘 ⋅ 𝒙 + 𝑏 + 1 ≤ 0, 𝑖 = 1…𝑚

we use Lagrange multipliers to get thisproblem into a form that can be solved analytically

What if things get more complicated?

Throw the balls in the air. While the balls are in the air and thrown up in just the right way, you use a large sheet of paper to divide the

balls in the air.

mapping data to a high dimensional space

Kernel

polynomial: (𝒙𝒊 ⋅ 𝒙𝒋 + 𝑐)𝑝

Gaussian radial basis function: exp(−∥ 𝒙𝒊– 𝒙𝒋 ∥2/2𝜎2)

SVM does its thing, maps them into a higher dimension and then finds the hyperplane to separate the classes.

Where does SVM get its name from?

• The decision function is fully specified by a (usually very small) subset of training samples, the support vectors.

• Support vectors are the data points that lie closest to the decision surface (or hyperplane)

• They are the data points most difficult to classify

• They have direct bearing on the optimum location of the decision surface

• they ‘support’ the separating hyperplane

Knowledge ExtractionExtracting the knowledge learned by a black–box classifier and

representing it in a comprehensible form

Knowledge Extraction

Benefits :

• Validation

• Feature extraction

• Knowledge refinement and improvement

• Knowledge acquisition for symbolic AI systems

• Scientific discovery

Knowledge Extraction Rule Extraction

• Methods for RE from ANNs have been classified into three categories:

Decompositional

Pedagogical

Eclectic

decompositional approach for KE

SVM :The IO mapping of the trained SVM f : Rn → {−1, 1} is given by

decompositional approach for KE

“Certain class of SVMs is mathematically equivalent to FARB”

What is FARB !?

Let’s take an example first !

Example:Input: q ∈ R ,

Output: O ∈ R ,

And: a0, a1, k ∈ R, with k > 0.

Rules:

R1: If q is equal to k Then O = a0 + a1,

R2: If q is equal to −k Then O = a0 − a1,

- Linguistic terms: equal to k , equal to –k

- To express fuzziness, Gaussian membership function is used:

These Function Satisfy:

Applying Singleton fuzzifier and Centre of Gravity defuzzifier yields:

But, What does this Output mean !??

Take a deeper look !

It is a feedforward ANN with a single neuron, employing the activation function tanh() !

So: this FRB is equivalent to ANN

This FRB , in particular, satisfy the definition of FARB, which is:

To get the same output, apply the same steps as in the example, which is:

But how this output is any close to the one in the example ?!?

And many other MFs satisfy this output, given specific values of z,u,v,r and g. Such as Logistic function and others.

Apply: zi = ui = 1, vi = ri = 0, and gi(x) = tanh(x).

ResultKolman and Margaliot: Every standard ANN has a corresponding FARB.

There’s a transformation T:

This work extend that to: Certain class of SVMs satisfy the transformation P:

“Certain class of SVMs is mathematically equivalent to FARB”

The SVM-FARB Equivalence

*

1

( ) * ( , )Nsv

i

i i

i

h x b y K x s

(2)

0

1 1

( ) ( )m m

i i i i i i i i

i i

O q a ra z a g u q v

(8)

Theorem 2. (SVM-FARB equivalence)conditionFind FARB with:

So these conditions would hold0, , , ,i i i i

m Nsv

q a a

*

0

1

*

,

,

( ) ( , )

m

i i

i

i i i i

i

i i i i

a ra b

z a y

g u q v K x s

(15)

Pause and Think

• Let’s say we have a FARB

• How many rules have we got?

1 1

0 1

...

...

m m

m

If q is and and q is

Then O a a a

(7)

Famous SVM Kernels

( , ) , (linear kernel)TK x y x y

( , ) (1 / ) , , , (polynomial kernel)T dK x y x y c c d

( , ) tanh( ), 0, 0, (MLP kernel)TK x y x y

2 2ˆ ˆ( , ) exp( / (2 )), , (RBF or Gaussian kernel)K x y x y

Corollary 1 {MLP kernel}

These parameters will satisfy (15) conditions

( ) ( )tanh(( ) )

( ) ( ) 2

k k

k k

q qq k

q q

,

,

2 , /

i i

T i

i

i i

k k

i i

q x s

k

* *

1

( ) tanh( )Nsv

T i

i i

i

h x y x s b

(17)

0

1 1

( ) ( )m m

i i i i i i i i

i i

O q a ra z a g u q v

* *

0

1, 0,

2 ,

2 ,

( ) tanh( )

,

i i

i i

i i i

i

i i i

z r

u

v k

g x x

a b a y

Pause and Think

appear in the FARB if-part

What could this mean?

iq

cos

cos ; and are normalized

T i

i

i

i

i

i

q x s

q x s

q x s

Corollary 2 {MPL Kernel}2

* *

21

( ) exp( )ˆ2

Nsv

i i

i

x yh x y b

(18)

These parameters will

satisfy (15) conditions

0

1 1

( ) ( )m m

i i i i i i i i

i i

O q a ra z a g u q v

2

2

( ) ( ) ( )2exp( ) 1

( ) ( ) 2

k k

k k

q q q k

q q

0 0

,

,

ˆ , 0

T i

i

i i

q x s

k

2

2

* *

0 1

*

2, 1

1 2 ,

0,

( ) exp( ),

/ 2,

/ 2

i i

i

i

i

Nsv

i ii

i i i

z r

u

v

g x x

a b y

a y

Experimentation{Iris data set}

• 150 examples• sepal length

• sepal width

• length

• petal width

• 3 classes• Setosa

• Versicolor

• Virginica

SVM

Results {SVM1}

Results {SVM2}