Support Vector Machine (SVM) and Kernel Methods

CE-717: Machine LearningSharif University of Technology

Fall 2015

Soleymani

Support Vector Machine (SVM)

and Kernel Methods

Outline

Margin concept

Hard-Margin SVM

Soft-Margin SVM

Dual Problems of Hard-Margin SVM and Soft-Margin SVM

Nonlinear SVM

Kernel trick

Kernel methods

2

Margin

3

Which line is better to select as the boundary to providemore generalization capability?

Margin for a hyperplane that separates samples of twolinearly separable classes is:

The smallest distance between the decision boundary and any of thetraining samples

𝑥2

𝑥1

Larger margin provides better

generalization to unseen data

Maximum margin

4

SVM finds the solution with maximum margin

Solution: a hyperplane that is farthest from all training samples

The hyperplane with the largest margin has equal distances tothe nearest sample of both classes

𝑥2

𝑥1

𝑥2

𝑥1 Larger margin

Hard-margin SVM: Optimization problem

5

max𝑀,𝒘,𝑤0

2𝑀

𝒘

s. t. 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 𝑀 ∀𝒙 𝑖 ∈ 𝐶1

𝒘𝑇𝒙 𝑖 + 𝑤0 ≤ −𝑀 ∀𝒙 𝑖 ∈ 𝐶2

𝑥2

𝑥1

𝑀

𝒘

𝒘𝑇𝒙 + 𝑤0 = 0

𝒘𝑇𝒙 + 𝑤0 = 𝑀

𝒘𝑇𝒙 + 𝑤0 = −𝑀

𝒘

𝑦 𝑖 = 1

𝑦 𝑖 = −1

Margin: 2𝑀

𝒘


6

max𝑀,𝒘,𝑤0

2𝑀

𝒘

s. t. 𝑦(𝑖) 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 𝑀 𝑖 = 1, … , 𝑁

𝑥2

𝑥1

𝑀

𝒘

𝒘

𝒘𝑇𝒙 + 𝑤0 = 0

𝒘𝑇𝒙 + 𝑤0 = 𝑀

𝒘𝑇𝒙 + 𝑤0 = −𝑀

𝑀 = min𝑖=1,…,𝑁

𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0


7

We can set 𝒘′ =𝒘

𝑀, 𝑤0

′ =𝑤0

𝑀:

max𝒘′,𝑤0

′

2

𝒘′

s. t. 𝑦(𝑖) 𝒘′𝑇𝒙 𝑖 + 𝑤0′ ≥ 1 𝑖 = 1, … , 𝑁

𝑥2

𝑥1

1

𝒘′

𝒘′𝑇𝒙 + 𝑤0′ = 0

𝒘′𝑇𝒙 + 𝑤0′ = 1

𝒘′𝑇𝒙 + 𝑤0′ = −1

𝒘′

The place of boundary and

margin lines do not change


8

We can equivalently optimize:

min𝒘,𝑤0

1

2𝒘 2

s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 𝑖 = 1, … , 𝑁

It is a convex Quadratic Programming (QP) problem

There are computationally efficient packages to solve it.

It has a global minimum (if any).

When training samples are not linearly separable, it has no

solution.

How to extend it to find a solution even though the classes are not

linearly separable.

Beyond linear separability

9

How to extend the hard-margin SVM to allow

classification error

Overlapping classes that can be approximately separated by a

linear boundary

Noise in the linearly separable classes

𝑥2

𝑥1 𝑥1

Beyond linear separability: Soft-margin SVM

10

Minimizing the number of misclassified points?!

NP-complete

Soft margin:

Maximizing a margin while trying to minimize the distance

between misclassified points and their correct margin plane

Soft-margin SVM

11

SVM with slack variables: allows samples to fall within the

margin, but penalizes them

min𝒘,𝑤0, 𝜉𝑖 𝑖=1

𝑁

12

𝒘 2 + 𝐶

𝑖=1

𝑁

𝜉𝑖

s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 − 𝜉𝑖 𝑖 = 1, … , 𝑁

𝜉𝑖 ≥ 0

𝜉𝑖 : slack variables

𝜉𝑖 > 1: if 𝒙 𝑖 misclassifed

0 < 𝜉𝑖 < 1 : if 𝒙 𝑖 correctly

classified but inside margin

𝑥2

𝑥1

𝜉𝑖

Soft-margin SVM

12

linear penalty (hinge loss) for a sample if it is misclassified

or lied in the margin

tries to maintain 𝜉𝑖 small while maximizing the margin.

always finds a solution (as opposed to hard-margin SVM)

more robust to the outliers

Soft margin problem is still a convex QP

𝜉 = 0

𝜉 = 0

Soft-margin SVM: Parameter 𝐶

13

𝐶 is a tradeoff parameter:

small 𝐶 allows margin constraints to be easily ignored

large margin

large 𝐶 makes constraints hard to ignore

narrow margin

𝐶 → ∞ enforces all constraints: hard margin

𝐶 can be determined using a technique like cross-

validation

Soft-margin SVM: Cost function

14

min𝒘,𝑤0, 𝜉𝑖 𝑖=1

𝑁

1

2𝒘 2 + 𝐶

𝑖=1

𝑛

𝜉𝑖

s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 − 𝜉𝑖 𝑖 = 1, … , 𝑛

𝜉𝑖 ≥ 0

It is equivalent to the unconstrained optimizationproblem:

min𝒘,𝑤0

1

2𝒘 2 + 𝐶

𝑖=1

𝑛

max(0,1 − 𝑦(𝑖)(𝒘𝑇𝒙(𝑖) + 𝑤0))

SVM loss function

15

Hinge loss vs. 0-1 loss

𝒘𝑇𝒙 + 𝑤0

0-1 Loss

𝑦 = 1

Hinge Loss

max(0,1 − 𝑦(𝒘𝑇𝒙 + 𝑤0))

Dual formulation of the SVM

16

We are going to introduce the dual SVM problem which

is equivalent to the original primal problem. The dual

problem:

is often easier

enable us to exploit the kernel trick

gives us further insights into the optimal hyperplane

Optimization: Lagrangian multipliers

17

𝑝∗ = min𝒙

𝑓(𝒙)

s. t. 𝑔𝑖 𝒙 ≤ 0 𝑖 = 1,… , 𝑚ℎ𝑖 𝒙 = 0 𝑖 = 1, … , 𝑝

ℒ 𝒙, 𝜶, 𝝀 = 𝑓 𝒙 +

𝑖=1

𝑚

𝛼𝑖 𝑔𝑖 𝒙 +

𝑖=1

𝑝

𝜆𝑖 ℎ𝑖 𝒙

max𝛼𝑖≥0 , 𝜆𝑖

ℒ 𝒙, 𝜶, 𝝀 =

∞ any 𝑔𝑖 𝒙 > 0

∞ any ℎ𝑖 𝒙 ≠ 0

𝑓 𝒙 otherwise

𝑝∗ = min𝒙


ℒ 𝒙, 𝜶, 𝝀

Lagrangian multipliers

𝜶 = 𝛼1, … , 𝛼𝑚

𝝀 = [𝜆1, … , 𝜆𝑝]

Optimization: Dual problem

18

In general, we have:

max𝑥

min𝑦

𝑓(𝑥, 𝑦) ≤ min𝑦

max𝑥

𝑓(𝑥, 𝑦)

Primal problem: 𝑝∗ = min𝒙



Dual problem: 𝑑∗ = max𝛼𝑖≥0 , 𝜆𝑖

min𝒙


Obtained by swapping the order of min and max

𝑑∗ ≤ 𝑝∗

When the original problem is convex (𝑓 and 𝑔 are convex

functions and ℎ is affine), we have strong duality 𝑑∗ = 𝑝∗

Hard-margin SVM: Dual problem

19

min𝒘,𝑤0

1

2𝒘 2

s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 𝑖 = 1, … , 𝑁

By incorporating the constraints through lagrangian multipliers,we will have:

min𝒘,𝑤0

max{𝛼𝑖≥0}

1

2𝒘 2 +

𝑖=1

𝑁

𝛼𝑖 1 − 𝑦 𝑖 (𝒘𝑇𝒙(𝑖) + 𝑤0)

Dual problem (changing the order of min and max in theabove problem):

max{𝛼𝑖≥0}

min𝒘,𝑤0

1

2𝒘 2 +

𝑖=1

𝑁

𝛼𝑖 1 − 𝑦 𝑖 (𝒘𝑇𝒙(𝑖) + 𝑤0)


20

max{𝛼𝑖≥0}

min𝒘,𝑤0

ℒ 𝒘, 𝑤0, 𝜶

ℒ 𝒘, 𝑤0, 𝜶 =1

2𝒘 2 +

𝑖=1

𝑁

𝛼𝑖 1 − 𝑦 𝑖 (𝒘𝑇𝒙(𝑖) + 𝑤0)

𝛻𝒘ℒ 𝒘, 𝑤0, 𝜶 = 0 ⇒ 𝒘 = 𝑖=1𝑁 𝛼𝑖𝑦

(𝑖)𝒙(𝑖)

𝜕ℒ 𝒘,𝑤0,𝜶

𝜕𝑤𝟎= 0 ⇒ 𝑖=1

𝑁 𝛼𝑖𝑦(𝑖) = 0

𝑤0 do not appear, instead, a “global” constraint

on 𝜶 is created.


21

max𝜶

𝑖=1

𝑁

𝛼𝑖 −1

2

𝑖=1

𝑁

𝑗=1

𝑁

𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝒙 𝑖 𝑇

𝒙 𝑗

Subject to 𝑖=1𝑁 𝛼𝑖𝑦

(𝑖) = 0

𝛼𝑖 ≥ 0 𝑖 = 1, … , 𝑁

It is a convex QP

By solving the above problem first we find 𝜶 and then 𝒘= 𝑖=1

𝑁 𝛼𝑖 𝑦(𝑖)𝒙(𝑖)

𝑤0 can be set by making the margin equidistant to classes (wediscuss it more after defining support vectors).

Karush-Kuhn-Tucker (KKT) conditions

22

Necessary conditions for the solution [𝒘∗, 𝑤0∗, 𝜶∗]:

𝛻𝒘ℒ 𝒘, 𝑤0, 𝜶 𝒘∗,𝑤0∗ ,𝜶∗ = 0

𝜕ℒ 𝒘,𝑤0,𝜶

𝜕𝑤0 𝒘∗,𝑤0

∗ ,𝜶∗ = 0

𝛼𝑖∗ ≥ 0 𝑖 = 1, … , 𝑁

𝑦 𝑖 𝒘∗𝑇𝒙 𝑖 + 𝑤0∗ ≥ 1 𝑖 = 1, … , 𝑁

𝛼𝑖∗ 1 − 𝑦 𝑖 𝒘∗𝑇𝒙 𝑖 + 𝑤0

∗ = 0 𝑖 = 1, … , 𝑁

𝛻𝒙ℒ 𝒙, 𝜶 𝒙∗,𝜶∗

= 0

𝛼𝑖∗ ≥ 0 𝑖 = 1, … , 𝑚

𝑔𝑖 𝒙∗ ≤ 0 𝑖 = 1, … , 𝑚𝛼𝑖

∗𝑔𝑖 𝒙∗ = 0 𝑖 = 1, … , 𝑚

In general, the optimal 𝒙∗, 𝜶∗

satisfies KKT conditions:

Hard-margin SVM: Support vectors

23

Inactive constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 > 1

⇒ 𝛼𝑖 = 0 and thus 𝒙 𝑖 is not a support vector.

Active constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 = 1

⇒ 𝛼𝑖 can be greater than 0 and thus 𝒙 𝑖 can be a support vector.

𝑥2

𝑥1

1

𝒘

𝛼 > 0

𝛼 > 0𝛼 > 0


24

Inactive constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 > 1

⇒ 𝛼𝑖 = 0 and thus 𝒙 𝑖 is not a support vector.

Active constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 = 1

⇒ 𝛼𝑖 can be greater than 0 and thus 𝒙 𝑖 can be a support vector.

𝑥2

𝑥1

1

𝒘

𝛼 > 0

𝛼 > 0𝛼 > 0

𝛼 = 0

𝛼 = 0

A sample with 𝛼𝑖 = 0 can also

lie on one of the margin

hyperplanes


25

SupportVectors (SVs)= {𝑥 𝑖 𝛼𝑖 > 0}

The direction of hyper-plane can be found only based on

support vectors:

𝒘 =

𝛼𝑖>0

𝛼𝑖 𝑦(𝑖)𝒙(𝑖)


Classifying new samples using only SVs

26

Classification of a new sample 𝒙:

𝑦 = sign 𝑤0 + 𝒘𝑇𝒙

𝑦 = sign 𝑤0 + 𝛼𝑖>0

𝛼𝑖𝑦𝑖 𝒙 𝑖

𝑇

𝒙

𝑦 = sign(𝑤0 + 𝛼𝑖>0

𝛼𝑖𝑦𝑖 𝒙 𝑖 𝑇

𝒙)

The classifier is based on the expansion in terms of dot

products of 𝒙 with support vectors.

Support vectors are sufficient to

predict labels of new samples

How to find 𝑤0

27

At least one of the 𝛼’s is strictly positive (𝛼𝑠 > 0 for support

vector).

The KKT complementary slackness condition tells us that the

constraint corresponding to this non-zero 𝛼𝑠 is equality.

𝑤0 is found such that the following constraint is satisfied.

𝑦𝑠 𝑤0 + 𝒘𝑇𝒙𝑠 = 1

We can find 𝑤0 using one of the support vectors:

⇒ 𝑤0 = 𝑦𝑠 − 𝒘𝑇𝒙𝑠

= 𝑦𝑠 −

𝛼𝑛>0

𝛼𝑛𝑦(𝑛)𝒙 𝑛 𝑇𝒙𝑠


28

max𝜶

𝑖=1

𝑁

𝛼𝑖 −1

2

𝑖=1

𝑁

𝑗=1

𝑁


𝒙 𝑗


(𝑖) = 0

𝛼𝑖 ≥ 0 𝑖 = 1, … , 𝑁

Only the dot product of each pair of training data appears

in the optimization problem

This is an important property that is helpful to extend to non-linear

SVM (the cost function does not depend explicitly on the dimensionality

of the feature space).

Soft-margin SVM: Dual problem

29

max𝜶

𝑖=1

𝑁

𝛼𝑖 −1

2

𝑖=1

𝑁

𝑗=1

𝑁


𝒙 𝑗


(𝑖) = 0

0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, … , 𝑁

By solving the above quadratic problem first we find 𝜶

and then find 𝒘 = 𝑖=1𝑁 𝛼𝑖 𝑦(𝑖)𝒙(𝑖) and 𝑤0 is computed

from SVs.

For a test sample 𝒙 (as before):

𝑦 = sign 𝑤0 + 𝒘𝑇𝒙 = sign(𝑤0 + 𝛼𝑖>0

𝛼𝑖𝑦𝑖 𝒙 𝑖 𝑇

𝒙)

Soft-margin SVM: Support vectors

30

SupportVectors: 𝛼 > 0

If 0 < 𝛼 < 𝐶: SVs on the margin, 𝜉 = 0.

If 𝛼 = 𝐶: SVs on or over the margin.

Primal vs. dual hard-margin SVM problem

31

Primal problem of hard-margin SVM

𝑁 inequality constraints

𝑑 + 1 number of variables

Dual problem of hard-margin SVM

one equality constraint

𝑁 positivity constraints

𝑁 number of variables (Lagrange multipliers)

Objective function more complicated

The dual problem is helpful and instrumental to use the kernel

trick

Primal vs. dual soft-margin SVM problem

32

Primal problem of soft-margin SVM

𝑁 inequality constraints

𝑁 positivity constraints

𝑁 + 𝑑 + 1 number of variables

Dual problem of soft-margin SVM

one equality constraint

2𝑁 positivity constraints

𝑁 number of variables (Lagrange multipliers)

Objective function more complicated

The dual problem is helpful and instrumental to use the kernel

trick

Not linearly separable data

33

Noisy data or overlapping classes

(we discussed about it: soft margin)

Near linearly separable

Non-linear decision surface

Transform to a new feature space

𝑥2

𝑥1

𝑥2

𝑥1

Nonlinear SVM

34

Assume a transformation 𝜙: ℝ𝑑 → ℝ𝑚 on the feature

space

𝒙 → 𝝓 𝒙

Find a hyper-plane in the transformed feature space:

𝑥2

𝑥1 𝜙1(𝒙)

𝜙2(𝒙)

𝜙: 𝒙 → 𝝓 𝒙

𝒘𝑇𝝓 𝒙 + 𝑤0 = 0

{𝜙1(𝒙),...,𝜙𝑚(𝒙)}: set of basis functions (or features)

𝜙𝑖 𝒙 : ℝ𝑑 → ℝ

𝝓 𝒙 = [𝜙1(𝒙), . . . , 𝜙𝑚(𝒙)]

Soft-margin SVM in a transformed space:

Primal problem

35

Primal problem:

min𝒘,𝑤0

1

2𝒘 2 + 𝐶

𝑖=1

𝑁

𝜉𝑖

s. t. 𝑦 𝑖 𝒘𝑇𝝓(𝒙 𝑖 ) + 𝑤0 ≥ 1 − 𝜉𝑖 𝑖 = 1, … , 𝑁

𝜉𝑖 ≥ 0

𝒘 ∈ ℝ𝑚: the weights that must be found

If 𝑚 ≫ 𝑑 (very high dimensional feature space) then there are

many more parameters to learn

Soft-margin SVM in a transformed space:

Dual problem

36

Optimization problem:

max𝜶

𝑖=1

𝑁

𝛼𝑖 −1

2

𝑖=1

𝑁

𝑗=1

𝑁

𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝝓 𝒙(𝑖) 𝑇

𝝓 𝒙(𝑖)


(𝑖) = 0

0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, … , 𝑛

If we have inner products 𝝓 𝒙(𝑖) 𝑇𝝓 𝒙(𝑗) , only 𝜶

= [𝛼1, … , 𝛼𝑁] needs to be learnt.

It is not necessary to learn 𝑚 parameters as opposed to the primal

problem

Kernelized soft-margin SVM

37

Optimization problem:

max𝜶

𝑖=1

𝑁

𝛼𝑖 −1

2

𝑖=1

𝑁

𝑗=1

𝑁

𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝝓 𝒙(𝑖) 𝑇

𝝓 𝒙(𝑖)


(𝑖) = 0

0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, … , 𝑛

Classifying a new data:

𝑦 = 𝑠𝑖𝑔𝑛 𝑤0 + 𝒘𝑇𝝓(𝒙) = 𝑠𝑖𝑔𝑛(𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦𝑖 𝝓(𝒙 𝑖 )𝑇𝝓(𝒙))

𝑘(𝒙 𝑖 , 𝒙(𝑗))

𝑘(𝒙 𝑖 , 𝒙)

𝑘 𝒙, 𝒙′ = 𝝓 𝒙 𝑇𝝓(𝒙′)

Kernel trick

38

Kernel: the dot product in the mapped feature space ℱ 𝑘 𝒙, 𝒙′ = 𝝓 𝒙 𝑇𝝓(𝒙′)

𝝓 𝒙 = 𝜙1 𝒙 , … , 𝜙𝑚 𝒙 𝑇

𝑘 𝒙, 𝒙′ shows the similarity of 𝒙 and 𝒙′.

The distance can be found as accordingly

Kernel trick → Extension of many well-known algorithms to

kernel-based ones

By substituting the dot product with the kernel function

Kernel trick: Idea

39

Idea: when the input vectors appears only in the form of dot

products, we can use other choices of kernel function instead

of inner product

Solving the problem without explicitly mapping the data

Explicit mapping is expensive if 𝝓 𝒙 is very high dimensional

Instead of using a mapping 𝝓: 𝒳 ← ℱ to represent 𝒙 ∈ 𝒳 by

𝝓(𝒙) ∈ ℱ, a similarity function 𝑘: 𝒳 × 𝒳 → ℝ is used.

The data set can be represented by 𝑁 × 𝑁 matrix 𝑲.

We specify only an inner product function between points in the

transformed space (not their coordinates)

In many cases, the inner product in the embedding space can be

computed efficiently.

Why kernel?

40

kernel functions 𝐾 can indeed be efficiently computed, with a

cost proportional to 𝑑 instead of 𝑚.

Example: consider the second-order polynomial transform:

𝝓 𝒙 = 1, 𝑥1, … , 𝑥𝑑 , 𝑥12, 𝑥1𝑥2, … , 𝑥𝑑𝑥𝑑

𝑇

𝝓 𝒙 𝑇𝝓 𝒙′ = 1 +

𝑖=1

𝑑

𝑥𝑖𝑥𝑖′ +

𝑖=1

𝑑

𝑗=1

𝑑

𝑥𝑖𝑥𝑗𝑥𝑖′𝑥𝑗

′

𝝓 𝒙 𝑇𝝓 𝒙′ = 1 + 𝑥𝑇𝑥′ + 𝑥𝑇𝑥′ 2

𝑖=1

𝑑

𝑥𝑖𝑥𝑖′ ×

𝑗=1

𝑑

𝑥𝑗𝑥𝑗′

𝑚 = 1 + 𝑑 + 𝑑2

𝑂(𝑚)

𝑂(𝑑)

Polynomial kernel: Efficient kernel function

41

An efficient kernel function relies on a carefully constructed

specific transforms to allow fast computation

Example: for polynomial kernels, certain combinations of

coefficients that would still make K easy (𝛾 > 0 and 𝑐 > 0)

𝝓 𝒙

= 𝑐 × 1, 2𝑐𝛾 × 𝑥1, … , 2𝑐𝛾 × 𝑥𝑑 , 𝛾 × 𝑥12, 𝛾 × 𝑥1𝑥2, … , 𝛾

Polynomial kernel: Degree two

42

We instead use 𝑘(𝒙, 𝒙′) = 𝒙𝑇𝒙′ + 1 2 that corresponds to:

𝝓 𝒙

= 1, 2𝑥1, … , 2𝑥𝑑 , 𝑥12, . . , 𝑥𝑑

2, 2𝑥1𝑥2, … , 2𝑥1𝑥𝑑 , 2𝑥2𝑥3, … , 2𝑥𝑑−1𝑥𝑑

𝑇

𝝓 𝒙 = 1, 2𝑥1, … , 2𝑥𝑑 , 𝑥12, . . . , 𝑥𝑑

2, 𝑥1𝑥2, … , 𝑥1𝑥𝑑 , 𝑥2𝑥1, … , 𝑥𝑑−1𝑥𝑑

𝑇

𝑑-dimensional feature space 𝒙 = 𝑥1, … ,𝑥𝑑𝑇

Polynomial kernel

44

In general, 𝑘(𝒙, 𝒛) = 𝒙𝑇𝒛 𝑀 contains all monomials of order

𝑀.

This can similarly be generalized to include all terms up to

degree 𝑀, 𝑘(𝒙, 𝒛) = 𝒙𝑇𝒛 + 𝑐 𝑀 (𝑐 > 0)

Example: SVM boundary for a polynomial kernel

𝑤0 + 𝒘𝑇𝝓 𝒙 = 0

⇒ 𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖)𝝓 𝒙 𝑖 𝑇

𝝓 𝒙 = 0

⇒ 𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖)𝑘(𝒙 𝑖 , 𝒙) = 0

⇒ 𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖) 𝒙(𝑖)𝑇

𝒙 + 𝑐𝑀

= 0Boundary is a

polynomial of order 𝑀

Some common kernel functions

45

𝐾 is a mathematical and computational shortcut that

allows us to combine the transform and the inner

product into a single more efficient function.

Linear: 𝑘(𝒙, 𝒙′) = 𝒙𝑇𝒙′

Polynomial: 𝑘 𝒙, 𝒙′ = (𝒙𝑇𝒙′ + 𝑐)𝑀 𝑐 ≥ 0

Gaussian: 𝑘 𝒙, 𝒙′ = exp(−𝒙−𝒙′ 2

2𝜎2 )

Sigmoid: 𝑘 𝒙, 𝒙′ = tanh(𝑎𝒙𝑇𝒙′ + 𝑏)

Gaussian kernel

46

Gaussian: 𝑘 𝒙, 𝒙′ = exp(−𝒙−𝒙′ 2

2𝜎2 )

Infinite dimensional feature space

Example: SVM boundary for a gaussian kernel

Considers a Gaussian function around each data point.

𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖)exp(−

𝒙−𝒙(𝑖) 2

2𝜎2 ) = 0

SVM + Gaussian Kernel can classify any arbitrary training set

Training error is zero when 𝜎 → 0 All samples become support vectors (likely overfiting)

SVM: Linear and Gaussian kernels example

47

Gaussian kernel: 𝐶 = 1,𝜎 = 0.25Linear kernel: 𝐶 = 0.1

[Zisserman’s slides]

some SVs appear to be far

from the boundary?

Hard margin Example

48

For narrow Gaussian (large 𝜎), even the protection of a large

margin cannot suppress overfitting.

Y.S. Abou Mostafa, et. al, “Learning From Data”, 2012

SVM Gaussian kernel: Example

49 Source: Zisserman’s slides

𝑓 𝒙 = 𝑤0 + 𝛼𝑖>0

𝛼𝑖𝑦(𝑖)exp(−

𝒙 − 𝒙(𝑖) 2

2𝜎2 )


50 Source: Zisserman’s slides


51 Source: Zisserman’ s slides









Constructing kernels

56

Construct kernel functions directly

Ensure that it is a valid kernel

Corresponds to an inner product in some feature space.

Example: 𝑘(𝒙, 𝒙′) = 𝒙𝑇𝒙′ 2

Corresponding mapping: 𝝓 𝒙 = 𝑥12, 2𝑥1𝑥2, 𝑥2

2 𝑇for 𝒙

= 𝑥1, 𝑥2𝑇

We need a way to test whether a kernel is valid without

having to construct 𝝓 𝒙

Valid kernel: Necessary & sufficient conditions

57

Gram matrix 𝑲𝑁×𝑁: 𝐾𝑖𝑗 = 𝑘(𝒙(𝑖), 𝒙(𝑗))

Restricting the kernel function to a set of points {𝒙 1 , 𝒙 2 , … , 𝒙(𝑁)}

𝐾 =𝑘(𝒙(1), 𝒙(1)) ⋯ 𝑘(𝒙(1), 𝒙(𝑁))

⋮ ⋱ ⋮𝑘(𝒙(𝑁), 𝒙(1)) ⋯ 𝑘(𝒙(𝑁), 𝒙(𝑁))

Mercer Theorem: The kernel matrix is Symmetric Positive

Semi-Definite (for any choice of data points)

Any symmetric positive definite matrix can be regarded as a kernel

matrix, that is as an inner product matrix in some space

[Shawe-Taylor & Cristianini 2004]

Construct Valid Kernels

58

• 𝑐 > 0 , 𝑘1: valid kernel

• 𝑓(. ): any function

• 𝑞(. ): a polynomial with coefficients≥ 0

• 𝑘1, 𝑘2: valid kernels

• 𝝓(𝒙): a function from 𝒙 to ℝ𝑀

𝑘3(. , . ): a valid kernel in ℝ𝑀

• 𝑨 : a symmetric positive semi-definite

matrix

• 𝒙𝑎 and 𝒙𝑏 are variables (not necessarily

disjoint) with 𝒙 = (𝒙𝑎 , 𝒙𝑏), and 𝑘𝑎 and 𝑘𝑏

are valid kernel functions over their

respective spaces.

[Bishop]

Extending linear methods to kernelized ones

59

Kernelized version of linear methods

Linear methods are famous

Unique optimal solutions, faster learning algorithms, and better analysis

However, we often require nonlinear methods in real-world problems

and so we can use kernel-based version of these linear algorithms

If a problem can be formulated such that feature vectors only

appear in inner products, we can extend it to a kernelized one

When we only need to know the dot product of all pairs of data (about

the geometry of the data in the feature space)

By replacing inner products with kernels in linear algorithms,

we obtain very flexible methods

We can operate in the mapped space without ever computing the

coordinates of the data in that space

Which information can be obtained from kernel?

60

Example: we know all pairwise distances

𝑑 𝝓 𝒙 ,𝝓 𝒛2

= 𝝓 𝒙 − 𝝓 𝒛 2 = 𝑘 𝒙, 𝒙 + 𝑘 𝒛, 𝒛 − 2𝑘(𝒙, 𝒛)

Therefore, we also know distance of points from center of mass of a set

Many dimensionality reduction, clustering, and classification

methods can be described according to pairwise distances.

This allow us to introduce kernelized versions of them

Kernels for structured data

61

Kernels also can be defined on general types of data

Kernel functions do not need to be defined over vectors

just we need a symmetric positive definite matrix

Thus, many algorithms can work with general (non-vectorial)

data

Kernels exist to embed strings, trees, graphs, …

This may be more important than nonlinearity that we can use

kernel-based version of the classical learning algorithms for

recognition of structured data

Kernel function for objects

62

Sets: Example of kernel function for sets:

𝑘 𝐴, 𝐵 = 2 𝐴∩𝐵

Strings: The inner product of the feature vectors for two

strings can be defined as

e.g. sum over all common subsequences weighted according to

their frequency of occurrence and lengths

A E G A T E A G G

E G T E A G A E G A T G

Kernel trick advantages: summary

63

Operating in the mapped space without ever computing the

coordinates of the data in that space

Besides vectors, we can introduce kernel functions for

structured data (graphs, strings, etc.)

Much of the geometry of the data in the embedding space is

contained in all pairwise dot products

In many cases, inner product in the embedding space can be

computed efficiently.

SVM: Summary

64

Hard margin: maximizing margin

Soft margin: handling noisy data and overlapping classes

Slack variables in the problem

Dual problems of hard-margin and soft-margin SVM

Lead us to non-linear SVM method easily by kernel substitution

Also, classifier decision in terms of support vectors

Kernel SVM’s

Learns linear decision boundary in a high dimension space without

explicitly working on the mapped data

Documents

Support Vector Machine (SVM) and Kernel Methods