Neural Networks: Support Vector machines

CHAPTER 06

SUPPORT VECTOR MACHINES

CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq M. Mostafa

Computer Science Department

Faculty of Computer & Information Sciences

AIN SHAMS UNIVERSITY

(some of the figures in this presentation are copyrighted to Pearson Education, Inc.)

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Introduction

Optimal Hyperplane for Linearly Separable Pattern

Quadratic Optimization for Finding the Optimal Hyperplan

Optimal Hyperplane for Nonseparable Patterns

Underlying Philosophy of SVM for Pattern Calssification

SVM viewed as Kernel Machine

The XOR problem

Computer Experiment

2

Outlines


Prof. Dr. Mostafa Gadal-Haqq 3

Introduction

The main idea of the SVMs may be summed up as follows:

“Given a training samples, the SVM constructs a

hyperplane as decision surface in such a way the

margin of separation between positive and negative

examples is maximized.”



Linearly Separable Patterns

SVM is a binary learning machine.

Binary classification is the task of separating classes in feature space.

wTx + b = 0

wTx + b < 0

wTx + b > 0

bxwxg T

)(



Linearly Separable Patterns

Which of the linear separators is optimal?



Optimal Decision Boundary

The optimal decision boundary is the one that maximize the margin

6

r

ρ



The Margin

7

|||| w

wrxx P



The Margin

||||)( then

,0 since

||||)()(

|||| , )(

wrxg

bxw

ww

wrbxwxg

w

wrxxbxwxg

P

T

T

P

T

P

T

8



The Margin

1||||

1

1||||

1

||||

)(

11)(

difw

difw

w

xgr

dforbxwxg T

9

r

ρ

1 bxwT

1 bxwT

0 bxwT

||||

22

wr

Then the margin is given as:



Optimal Decision Boundary

Let {x1, ..., xn} be our data set and let di {1,-1} be the class label of xi

The decision boundary should classify all points correctly.

That is, we have a constrained optimization problem

Maximize = 𝟐𝒓 =𝟐

𝒘, or Minimize 𝒘

Subject to 𝒅𝒊(𝒘𝑻𝒙 ± 𝒃) ≥ 𝟏

10



The Optimization Problem

Introduce Lagrange multipliers ,

That is, the Lagrange function:

Is to be minimized with respect to w and b, i.e,

𝜕𝑱(𝒘,𝒃,)𝜕𝒘

= 𝟎 ; and 𝜕𝑱(𝒘,𝒃, )

𝜕𝒃= 𝟎

)1][(||||2

1),,(

1

2

bxwdwbwJ i

T

i

N

i

i

11



Solving the Optimization Problem

Need to optimize a quadratic function subject to linear constraints.

The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every constraint in the primary problem:

Find 𝛼1…𝛼𝑁such that

𝑸 𝜶 = 𝛼𝑖 −1

2 𝛼𝑖𝛼𝑗𝑑𝑖𝑑𝑗x𝑖x𝑗𝑗𝑖

𝑵𝒊=𝟏

is maximized and

(1) 𝛼𝑖𝑑𝑖𝑗

(2) 𝛼1 ≥ 0 ∀ 𝑖

12




The solution has the form:

and such that 𝒊 ≠ 𝟎

Each non-zero αi indicates that corresponding xi is a support vector.

Then the classifying function will have the form:

Notice that it relies on an inner product between the test point x and the

support vectors xi

Also keep in mind that solving the optimization problem involved computing

the inner products xiTxj between all training points!

13

ii

N

i

i xd

1

w iii

N

i

idb xx11

bdxg iii

N

i

i

xx)(1



6=1.4


Support vectors are samples that have non-zero

Class 1

Class 2

1=0.8

2=0

3=0

4=0

5=0

7=0

8=0.6

9=0

10=0




Figure 6.3 Soft margin hyperplane (a) Data point xi (belonging to class C1,

represented by a small square) falls inside the region of separation, but on the correct side of the decision surface. (b) Data point xi (belonging to class C2,

represented by a small circle) falls on the wrong side of the decision surface.

15




We allow “error” xi in classification

16

ξi

ξi



Soft Margin Hyperplane

The old formulation:

The new formulation incorporating relaxed variables:

Parameter C can be viewed as a way to control overfitting.

17

Find w and b such that

∅ 𝑾 = 𝟏

𝟐𝑾𝑻𝑾 is minimized and for all {(xi ,yi)}

Subject to: 𝒅𝒊(𝒘𝑻𝒙 ± 𝒃) ≥ 𝟏

Find w and b such that

∅ 𝐖 = 𝟏

𝟐𝐖𝐓𝐖+ 𝐜 𝝃𝒊𝒊 is minimized for all {(xi ,yi)}

Subject to: 𝒅𝒊(𝒘𝑻𝒙 ± 𝒃) ≥ 𝟏 , and ξi ≥ 0 for all i



Soft Margin Hyperplane

Again, xi with non-zero αi will be support vectors.

Solution to the dual problem is:

𝑾 = 𝜶𝒊𝒅𝒊𝒙𝒊𝒊

and

𝒃 = 𝒅𝒊 𝟏 − 𝝃𝒊 −𝑾𝑻𝒙𝒊

18



Extension to Non-linear Decision Boundary

Key idea: transform xi to a higher dimensional space

Input space: the space of xi

Feature space: the “kernel” space of f(xi)

19

f( )

f( )

f( ) f( ) f( )

f( )

f( ) f( )

f(.) f( )

f( )

f( )

f( ) f( )

f( )

f( )

f( ) f( )

f( )

Feature space Input space



Kernel Trick

The linear classifier relies on inner product between vectors:

𝑲 𝐱𝒊, 𝐱𝒋 = 𝐱𝒊𝑻𝐱𝒋

If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes:

𝑲 𝐱𝒊, 𝐱𝒋 = 𝛟 𝐱𝐢𝑻𝛟(𝐱𝒋)

A kernel function is some function that corresponds to an inner product into some feature space.

K (x, xj) needs to satisfy a technical condition (Mercer condition) in order for f(.) to exist

20



Mercer’s Theorem

𝑲 = 𝒌(𝒙𝒊, 𝒙𝒋) ∀𝒊, 𝒋 has to be non-negative definite or

positive semidefinite , that is, it satisfies:

𝒂𝑻K𝒂 ≥ 𝟎

Some of kernel functions that satisfy Mercer’s condition:

21



The SVM viewed as Kernel Machine

Figure 6.5 Architecture of support vector machine, using a

radial-basis function network.

22



The XOR Problem

For the two dimensional vectors x=[x1 x2];

Define the following Kernel:

𝒌 x,x𝒊 = 𝟏 + x𝑻x𝒊2

Need to show that

K(xi,xj)= φ(xi) Tφ(xj)

K(xi,xj)=(1 + xiTxj)

2

= 1+ xi12xj1

2 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 + 2xi2xj2=

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2

2 √2xj1 √2xj2]

= φ(xi) Tφ(xj),

where

φ(x) = [1 x12 √2 x1x2 x2

2 √2x1 √2x2]

23



The XOR Problem

Which give the optimal hyperplane as:

−𝒙𝟏𝒙𝟐 = 𝟎

This yields

Figure 6.6 (a) Polynomial machine for solving the XOR problem. (b) Induced

images in the feature space due to the four data points of the XOR problem.

24

(1, -1)

(-1,1)

(-1, -1) (1,1)

-1.0



Conclusion

SVM is a useful alternative to neural networks

Two key concepts of SVM: maximize the margin

and the kernel trick

Many active research is taking place on areas

related to SVM

Many SVM implementations are available on the

web for you to try on your data set!

25



Computer Experiment

Figure 6.7 Experiment on SVM for the double-moon of Fig. 1.8 with

distance d = –6.

26



Computer Experiment

Figure 6.8 Experiment on SVM for the double-moon of Fig. 1.8 with

distance d = –6.5.

27

Principal Component Analysis (PCA)

Next Time

28