63
CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

  • Upload
    others

  • View
    20

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Support Vector Machine (SVM) and Kernel Methods

CE-717: Machine LearningSharif University of Technology

Fall 2015

Soleymani

Support Vector Machine (SVM)

and Kernel Methods

Page 2: Support Vector Machine (SVM) and Kernel Methods

Outline

Margin concept

Hard-Margin SVM

Soft-Margin SVM

Dual Problems of Hard-Margin SVM and Soft-Margin SVM

Nonlinear SVM

Kernel trick

Kernel methods

2

Page 3: Support Vector Machine (SVM) and Kernel Methods

Margin

3

Which line is better to select as the boundary to providemore generalization capability?

Margin for a hyperplane that separates samples of twolinearly separable classes is:

The smallest distance between the decision boundary and any of thetraining samples

𝑥2

𝑥1

Larger margin provides better

generalization to unseen data

Page 4: Support Vector Machine (SVM) and Kernel Methods

Maximum margin

4

SVM finds the solution with maximum margin

Solution: a hyperplane that is farthest from all training samples

The hyperplane with the largest margin has equal distances tothe nearest sample of both classes

𝑥2

𝑥1

𝑥2

𝑥1 Larger margin

Page 5: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Optimization problem

5

max𝑀,𝒘,𝑤0

2𝑀

𝒘

s. t. 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 𝑀 ∀𝒙 𝑖 ∈ 𝐶1

𝒘𝑇𝒙 𝑖 + 𝑤0 ≤ −𝑀 ∀𝒙 𝑖 ∈ 𝐶2

𝑥2

𝑥1

𝑀

𝒘

𝒘𝑇𝒙 + 𝑤0 = 0

𝒘𝑇𝒙 + 𝑤0 = 𝑀

𝒘𝑇𝒙 + 𝑤0 = −𝑀

𝒘

𝑦 𝑖 = 1

𝑦 𝑖 = −1

Margin: 2𝑀

𝒘

Page 6: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Optimization problem

6

max𝑀,𝒘,𝑤0

2𝑀

𝒘

s. t. 𝑦(𝑖) 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 𝑀 𝑖 = 1, … , 𝑁

𝑥2

𝑥1

𝑀

𝒘

𝒘

𝒘𝑇𝒙 + 𝑤0 = 0

𝒘𝑇𝒙 + 𝑤0 = 𝑀

𝒘𝑇𝒙 + 𝑤0 = −𝑀

𝑀 = min𝑖=1,…,𝑁

𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0

Page 7: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Optimization problem

7

We can set 𝒘′ =𝒘

𝑀, 𝑤0

′ =𝑤0

𝑀:

max𝒘′,𝑤0

2

𝒘′

s. t. 𝑦(𝑖) 𝒘′𝑇𝒙 𝑖 + 𝑤0′ ≥ 1 𝑖 = 1, … , 𝑁

𝑥2

𝑥1

1

𝒘′

𝒘′𝑇𝒙 + 𝑤0′ = 0

𝒘′𝑇𝒙 + 𝑤0′ = 1

𝒘′𝑇𝒙 + 𝑤0′ = −1

𝒘′

The place of boundary and

margin lines do not change

Page 8: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Optimization problem

8

We can equivalently optimize:

min𝒘,𝑤0

1

2𝒘 2

s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 𝑖 = 1, … , 𝑁

It is a convex Quadratic Programming (QP) problem

There are computationally efficient packages to solve it.

It has a global minimum (if any).

When training samples are not linearly separable, it has no

solution.

How to extend it to find a solution even though the classes are not

linearly separable.

Page 9: Support Vector Machine (SVM) and Kernel Methods

Beyond linear separability

9

How to extend the hard-margin SVM to allow

classification error

Overlapping classes that can be approximately separated by a

linear boundary

Noise in the linearly separable classes

𝑥2

𝑥1 𝑥1

Page 10: Support Vector Machine (SVM) and Kernel Methods

Beyond linear separability: Soft-margin SVM

10

Minimizing the number of misclassified points?!

NP-complete

Soft margin:

Maximizing a margin while trying to minimize the distance

between misclassified points and their correct margin plane

Page 11: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM

11

SVM with slack variables: allows samples to fall within the

margin, but penalizes them

min𝒘,𝑤0, 𝜉𝑖 𝑖=1

𝑁

12

𝒘 2 + 𝐶

𝑖=1

𝑁

𝜉𝑖

s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 − 𝜉𝑖 𝑖 = 1, … , 𝑁

𝜉𝑖 ≥ 0

𝜉𝑖 : slack variables

𝜉𝑖 > 1: if 𝒙 𝑖 misclassifed

0 < 𝜉𝑖 < 1 : if 𝒙 𝑖 correctly

classified but inside margin

𝑥2

𝑥1

𝜉𝑖

Page 12: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM

12

linear penalty (hinge loss) for a sample if it is misclassified

or lied in the margin

tries to maintain 𝜉𝑖 small while maximizing the margin.

always finds a solution (as opposed to hard-margin SVM)

more robust to the outliers

Soft margin problem is still a convex QP

𝜉 = 0

𝜉 = 0

Page 13: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM: Parameter 𝐶

13

𝐶 is a tradeoff parameter:

small 𝐶 allows margin constraints to be easily ignored

large margin

large 𝐶 makes constraints hard to ignore

narrow margin

𝐶 → ∞ enforces all constraints: hard margin

𝐶 can be determined using a technique like cross-

validation

Page 14: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM: Cost function

14

min𝒘,𝑤0, 𝜉𝑖 𝑖=1

𝑁

1

2𝒘 2 + 𝐶

𝑖=1

𝑛

𝜉𝑖

s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 − 𝜉𝑖 𝑖 = 1, … , 𝑛

𝜉𝑖 ≥ 0

It is equivalent to the unconstrained optimizationproblem:

min𝒘,𝑤0

1

2𝒘 2 + 𝐶

𝑖=1

𝑛

max(0,1 − 𝑦(𝑖)(𝒘𝑇𝒙(𝑖) + 𝑤0))

Page 15: Support Vector Machine (SVM) and Kernel Methods

SVM loss function

15

Hinge loss vs. 0-1 loss

𝒘𝑇𝒙 + 𝑤0

0-1 Loss

𝑦 = 1

Hinge Loss

max(0,1 − 𝑦(𝒘𝑇𝒙 + 𝑤0))

Page 16: Support Vector Machine (SVM) and Kernel Methods

Dual formulation of the SVM

16

We are going to introduce the dual SVM problem which

is equivalent to the original primal problem. The dual

problem:

is often easier

enable us to exploit the kernel trick

gives us further insights into the optimal hyperplane

Page 17: Support Vector Machine (SVM) and Kernel Methods

Optimization: Lagrangian multipliers

17

𝑝∗ = min𝒙

𝑓(𝒙)

s. t. 𝑔𝑖 𝒙 ≤ 0 𝑖 = 1,… , 𝑚ℎ𝑖 𝒙 = 0 𝑖 = 1, … , 𝑝

ℒ 𝒙, 𝜶, 𝝀 = 𝑓 𝒙 +

𝑖=1

𝑚

𝛼𝑖 𝑔𝑖 𝒙 +

𝑖=1

𝑝

𝜆𝑖 ℎ𝑖 𝒙

max𝛼𝑖≥0 , 𝜆𝑖

ℒ 𝒙, 𝜶, 𝝀 =

∞ any 𝑔𝑖 𝒙 > 0

∞ any ℎ𝑖 𝒙 ≠ 0

𝑓 𝒙 otherwise

𝑝∗ = min𝒙

max𝛼𝑖≥0 , 𝜆𝑖

ℒ 𝒙, 𝜶, 𝝀

Lagrangian multipliers

𝜶 = 𝛼1, … , 𝛼𝑚

𝝀 = [𝜆1, … , 𝜆𝑝]

Page 18: Support Vector Machine (SVM) and Kernel Methods

Optimization: Dual problem

18

In general, we have:

max𝑥

min𝑦

𝑓(𝑥, 𝑦) ≤ min𝑦

max𝑥

𝑓(𝑥, 𝑦)

Primal problem: 𝑝∗ = min𝒙

max𝛼𝑖≥0 , 𝜆𝑖

ℒ 𝒙, 𝜶, 𝝀

Dual problem: 𝑑∗ = max𝛼𝑖≥0 , 𝜆𝑖

min𝒙

ℒ 𝒙, 𝜶, 𝝀

Obtained by swapping the order of min and max

𝑑∗ ≤ 𝑝∗

When the original problem is convex (𝑓 and 𝑔 are convex

functions and ℎ is affine), we have strong duality 𝑑∗ = 𝑝∗

Page 19: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Dual problem

19

min𝒘,𝑤0

1

2𝒘 2

s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 𝑖 = 1, … , 𝑁

By incorporating the constraints through lagrangian multipliers,we will have:

min𝒘,𝑤0

max{𝛼𝑖≥0}

1

2𝒘 2 +

𝑖=1

𝑁

𝛼𝑖 1 − 𝑦 𝑖 (𝒘𝑇𝒙(𝑖) + 𝑤0)

Dual problem (changing the order of min and max in theabove problem):

max{𝛼𝑖≥0}

min𝒘,𝑤0

1

2𝒘 2 +

𝑖=1

𝑁

𝛼𝑖 1 − 𝑦 𝑖 (𝒘𝑇𝒙(𝑖) + 𝑤0)

Page 20: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Dual problem

20

max{𝛼𝑖≥0}

min𝒘,𝑤0

ℒ 𝒘, 𝑤0, 𝜶

ℒ 𝒘, 𝑤0, 𝜶 =1

2𝒘 2 +

𝑖=1

𝑁

𝛼𝑖 1 − 𝑦 𝑖 (𝒘𝑇𝒙(𝑖) + 𝑤0)

𝛻𝒘ℒ 𝒘, 𝑤0, 𝜶 = 0 ⇒ 𝒘 = 𝑖=1𝑁 𝛼𝑖𝑦

(𝑖)𝒙(𝑖)

𝜕ℒ 𝒘,𝑤0,𝜶

𝜕𝑤𝟎= 0 ⇒ 𝑖=1

𝑁 𝛼𝑖𝑦(𝑖) = 0

𝑤0 do not appear, instead, a “global” constraint

on 𝜶 is created.

Page 21: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Dual problem

21

max𝜶

𝑖=1

𝑁

𝛼𝑖 −1

2

𝑖=1

𝑁

𝑗=1

𝑁

𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝒙 𝑖 𝑇

𝒙 𝑗

Subject to 𝑖=1𝑁 𝛼𝑖𝑦

(𝑖) = 0

𝛼𝑖 ≥ 0 𝑖 = 1, … , 𝑁

It is a convex QP

By solving the above problem first we find 𝜶 and then 𝒘= 𝑖=1

𝑁 𝛼𝑖 𝑦(𝑖)𝒙(𝑖)

𝑤0 can be set by making the margin equidistant to classes (wediscuss it more after defining support vectors).

Page 22: Support Vector Machine (SVM) and Kernel Methods

Karush-Kuhn-Tucker (KKT) conditions

22

Necessary conditions for the solution [𝒘∗, 𝑤0∗, 𝜶∗]:

𝛻𝒘ℒ 𝒘, 𝑤0, 𝜶 𝒘∗,𝑤0∗ ,𝜶∗ = 0

𝜕ℒ 𝒘,𝑤0,𝜶

𝜕𝑤0 𝒘∗,𝑤0

∗ ,𝜶∗ = 0

𝛼𝑖∗ ≥ 0 𝑖 = 1, … , 𝑁

𝑦 𝑖 𝒘∗𝑇𝒙 𝑖 + 𝑤0∗ ≥ 1 𝑖 = 1, … , 𝑁

𝛼𝑖∗ 1 − 𝑦 𝑖 𝒘∗𝑇𝒙 𝑖 + 𝑤0

∗ = 0 𝑖 = 1, … , 𝑁

𝛻𝒙ℒ 𝒙, 𝜶 𝒙∗,𝜶∗

= 0

𝛼𝑖∗ ≥ 0 𝑖 = 1, … , 𝑚

𝑔𝑖 𝒙∗ ≤ 0 𝑖 = 1, … , 𝑚𝛼𝑖

∗𝑔𝑖 𝒙∗ = 0 𝑖 = 1, … , 𝑚

In general, the optimal 𝒙∗, 𝜶∗

satisfies KKT conditions:

Page 23: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Support vectors

23

Inactive constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 > 1

⇒ 𝛼𝑖 = 0 and thus 𝒙 𝑖 is not a support vector.

Active constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 = 1

⇒ 𝛼𝑖 can be greater than 0 and thus 𝒙 𝑖 can be a support vector.

𝑥2

𝑥1

1

𝒘

𝛼 > 0

𝛼 > 0𝛼 > 0

Page 24: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Support vectors

24

Inactive constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 > 1

⇒ 𝛼𝑖 = 0 and thus 𝒙 𝑖 is not a support vector.

Active constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 = 1

⇒ 𝛼𝑖 can be greater than 0 and thus 𝒙 𝑖 can be a support vector.

𝑥2

𝑥1

1

𝒘

𝛼 > 0

𝛼 > 0𝛼 > 0

𝛼 = 0

𝛼 = 0

A sample with 𝛼𝑖 = 0 can also

lie on one of the margin

hyperplanes

Page 25: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Support vectors

25

SupportVectors (SVs)= {𝑥 𝑖 𝛼𝑖 > 0}

The direction of hyper-plane can be found only based on

support vectors:

𝒘 =

𝛼𝑖>0

𝛼𝑖 𝑦(𝑖)𝒙(𝑖)

Page 26: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Dual problem

Classifying new samples using only SVs

26

Classification of a new sample 𝒙:

𝑦 = sign 𝑤0 + 𝒘𝑇𝒙

𝑦 = sign 𝑤0 + 𝛼𝑖>0

𝛼𝑖𝑦𝑖 𝒙 𝑖

𝑇

𝒙

𝑦 = sign(𝑤0 + 𝛼𝑖>0

𝛼𝑖𝑦𝑖 𝒙 𝑖 𝑇

𝒙)

The classifier is based on the expansion in terms of dot

products of 𝒙 with support vectors.

Support vectors are sufficient to

predict labels of new samples

Page 27: Support Vector Machine (SVM) and Kernel Methods

How to find 𝑤0

27

At least one of the 𝛼’s is strictly positive (𝛼𝑠 > 0 for support

vector).

The KKT complementary slackness condition tells us that the

constraint corresponding to this non-zero 𝛼𝑠 is equality.

𝑤0 is found such that the following constraint is satisfied.

𝑦𝑠 𝑤0 + 𝒘𝑇𝒙𝑠 = 1

We can find 𝑤0 using one of the support vectors:

⇒ 𝑤0 = 𝑦𝑠 − 𝒘𝑇𝒙𝑠

= 𝑦𝑠 −

𝛼𝑛>0

𝛼𝑛𝑦(𝑛)𝒙 𝑛 𝑇𝒙𝑠

Page 28: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Dual problem

28

max𝜶

𝑖=1

𝑁

𝛼𝑖 −1

2

𝑖=1

𝑁

𝑗=1

𝑁

𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝒙 𝑖 𝑇

𝒙 𝑗

Subject to 𝑖=1𝑁 𝛼𝑖𝑦

(𝑖) = 0

𝛼𝑖 ≥ 0 𝑖 = 1, … , 𝑁

Only the dot product of each pair of training data appears

in the optimization problem

This is an important property that is helpful to extend to non-linear

SVM (the cost function does not depend explicitly on the dimensionality

of the feature space).

Page 29: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM: Dual problem

29

max𝜶

𝑖=1

𝑁

𝛼𝑖 −1

2

𝑖=1

𝑁

𝑗=1

𝑁

𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝒙 𝑖 𝑇

𝒙 𝑗

Subject to 𝑖=1𝑁 𝛼𝑖𝑦

(𝑖) = 0

0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, … , 𝑁

By solving the above quadratic problem first we find 𝜶

and then find 𝒘 = 𝑖=1𝑁 𝛼𝑖 𝑦(𝑖)𝒙(𝑖) and 𝑤0 is computed

from SVs.

For a test sample 𝒙 (as before):

𝑦 = sign 𝑤0 + 𝒘𝑇𝒙 = sign(𝑤0 + 𝛼𝑖>0

𝛼𝑖𝑦𝑖 𝒙 𝑖 𝑇

𝒙)

Page 30: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM: Support vectors

30

SupportVectors: 𝛼 > 0

If 0 < 𝛼 < 𝐶: SVs on the margin, 𝜉 = 0.

If 𝛼 = 𝐶: SVs on or over the margin.

Page 31: Support Vector Machine (SVM) and Kernel Methods

Primal vs. dual hard-margin SVM problem

31

Primal problem of hard-margin SVM

𝑁 inequality constraints

𝑑 + 1 number of variables

Dual problem of hard-margin SVM

one equality constraint

𝑁 positivity constraints

𝑁 number of variables (Lagrange multipliers)

Objective function more complicated

The dual problem is helpful and instrumental to use the kernel

trick

Page 32: Support Vector Machine (SVM) and Kernel Methods

Primal vs. dual soft-margin SVM problem

32

Primal problem of soft-margin SVM

𝑁 inequality constraints

𝑁 positivity constraints

𝑁 + 𝑑 + 1 number of variables

Dual problem of soft-margin SVM

one equality constraint

2𝑁 positivity constraints

𝑁 number of variables (Lagrange multipliers)

Objective function more complicated

The dual problem is helpful and instrumental to use the kernel

trick

Page 33: Support Vector Machine (SVM) and Kernel Methods

Not linearly separable data

33

Noisy data or overlapping classes

(we discussed about it: soft margin)

Near linearly separable

Non-linear decision surface

Transform to a new feature space

𝑥2

𝑥1

𝑥2

𝑥1

Page 34: Support Vector Machine (SVM) and Kernel Methods

Nonlinear SVM

34

Assume a transformation 𝜙: ℝ𝑑 → ℝ𝑚 on the feature

space

𝒙 → 𝝓 𝒙

Find a hyper-plane in the transformed feature space:

𝑥2

𝑥1 𝜙1(𝒙)

𝜙2(𝒙)

𝜙: 𝒙 → 𝝓 𝒙

𝒘𝑇𝝓 𝒙 + 𝑤0 = 0

{𝜙1(𝒙),...,𝜙𝑚(𝒙)}: set of basis functions (or features)

𝜙𝑖 𝒙 : ℝ𝑑 → ℝ

𝝓 𝒙 = [𝜙1(𝒙), . . . , 𝜙𝑚(𝒙)]

Page 35: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM in a transformed space:

Primal problem

35

Primal problem:

min𝒘,𝑤0

1

2𝒘 2 + 𝐶

𝑖=1

𝑁

𝜉𝑖

s. t. 𝑦 𝑖 𝒘𝑇𝝓(𝒙 𝑖 ) + 𝑤0 ≥ 1 − 𝜉𝑖 𝑖 = 1, … , 𝑁

𝜉𝑖 ≥ 0

𝒘 ∈ ℝ𝑚: the weights that must be found

If 𝑚 ≫ 𝑑 (very high dimensional feature space) then there are

many more parameters to learn

Page 36: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM in a transformed space:

Dual problem

36

Optimization problem:

max𝜶

𝑖=1

𝑁

𝛼𝑖 −1

2

𝑖=1

𝑁

𝑗=1

𝑁

𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝝓 𝒙(𝑖) 𝑇

𝝓 𝒙(𝑖)

Subject to 𝑖=1𝑁 𝛼𝑖𝑦

(𝑖) = 0

0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, … , 𝑛

If we have inner products 𝝓 𝒙(𝑖) 𝑇𝝓 𝒙(𝑗) , only 𝜶

= [𝛼1, … , 𝛼𝑁] needs to be learnt.

It is not necessary to learn 𝑚 parameters as opposed to the primal

problem

Page 37: Support Vector Machine (SVM) and Kernel Methods

Kernelized soft-margin SVM

37

Optimization problem:

max𝜶

𝑖=1

𝑁

𝛼𝑖 −1

2

𝑖=1

𝑁

𝑗=1

𝑁

𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝝓 𝒙(𝑖) 𝑇

𝝓 𝒙(𝑖)

Subject to 𝑖=1𝑁 𝛼𝑖𝑦

(𝑖) = 0

0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, … , 𝑛

Classifying a new data:

𝑦 = 𝑠𝑖𝑔𝑛 𝑤0 + 𝒘𝑇𝝓(𝒙) = 𝑠𝑖𝑔𝑛(𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦𝑖 𝝓(𝒙 𝑖 )𝑇𝝓(𝒙))

𝑘(𝒙 𝑖 , 𝒙(𝑗))

𝑘(𝒙 𝑖 , 𝒙)

𝑘 𝒙, 𝒙′ = 𝝓 𝒙 𝑇𝝓(𝒙′)

Page 38: Support Vector Machine (SVM) and Kernel Methods

Kernel trick

38

Kernel: the dot product in the mapped feature space ℱ 𝑘 𝒙, 𝒙′ = 𝝓 𝒙 𝑇𝝓(𝒙′)

𝝓 𝒙 = 𝜙1 𝒙 , … , 𝜙𝑚 𝒙 𝑇

𝑘 𝒙, 𝒙′ shows the similarity of 𝒙 and 𝒙′.

The distance can be found as accordingly

Kernel trick → Extension of many well-known algorithms to

kernel-based ones

By substituting the dot product with the kernel function

Page 39: Support Vector Machine (SVM) and Kernel Methods

Kernel trick: Idea

39

Idea: when the input vectors appears only in the form of dot

products, we can use other choices of kernel function instead

of inner product

Solving the problem without explicitly mapping the data

Explicit mapping is expensive if 𝝓 𝒙 is very high dimensional

Instead of using a mapping 𝝓: 𝒳 ← ℱ to represent 𝒙 ∈ 𝒳 by

𝝓(𝒙) ∈ ℱ, a similarity function 𝑘: 𝒳 × 𝒳 → ℝ is used.

The data set can be represented by 𝑁 × 𝑁 matrix 𝑲.

We specify only an inner product function between points in the

transformed space (not their coordinates)

In many cases, the inner product in the embedding space can be

computed efficiently.

Page 40: Support Vector Machine (SVM) and Kernel Methods

Why kernel?

40

kernel functions 𝐾 can indeed be efficiently computed, with a

cost proportional to 𝑑 instead of 𝑚.

Example: consider the second-order polynomial transform:

𝝓 𝒙 = 1, 𝑥1, … , 𝑥𝑑 , 𝑥12, 𝑥1𝑥2, … , 𝑥𝑑𝑥𝑑

𝑇

𝝓 𝒙 𝑇𝝓 𝒙′ = 1 +

𝑖=1

𝑑

𝑥𝑖𝑥𝑖′ +

𝑖=1

𝑑

𝑗=1

𝑑

𝑥𝑖𝑥𝑗𝑥𝑖′𝑥𝑗

𝝓 𝒙 𝑇𝝓 𝒙′ = 1 + 𝑥𝑇𝑥′ + 𝑥𝑇𝑥′ 2

𝑖=1

𝑑

𝑥𝑖𝑥𝑖′ ×

𝑗=1

𝑑

𝑥𝑗𝑥𝑗′

𝑚 = 1 + 𝑑 + 𝑑2

𝑂(𝑚)

𝑂(𝑑)

Page 41: Support Vector Machine (SVM) and Kernel Methods

Polynomial kernel: Efficient kernel function

41

An efficient kernel function relies on a carefully constructed

specific transforms to allow fast computation

Example: for polynomial kernels, certain combinations of

coefficients that would still make K easy (𝛾 > 0 and 𝑐 > 0)

𝝓 𝒙

= 𝑐 × 1, 2𝑐𝛾 × 𝑥1, … , 2𝑐𝛾 × 𝑥𝑑 , 𝛾 × 𝑥12, 𝛾 × 𝑥1𝑥2, … , 𝛾

Page 42: Support Vector Machine (SVM) and Kernel Methods

Polynomial kernel: Degree two

42

We instead use 𝑘(𝒙, 𝒙′) = 𝒙𝑇𝒙′ + 1 2 that corresponds to:

𝝓 𝒙

= 1, 2𝑥1, … , 2𝑥𝑑 , 𝑥12, . . , 𝑥𝑑

2, 2𝑥1𝑥2, … , 2𝑥1𝑥𝑑 , 2𝑥2𝑥3, … , 2𝑥𝑑−1𝑥𝑑

𝑇

𝝓 𝒙 = 1, 2𝑥1, … , 2𝑥𝑑 , 𝑥12, . . . , 𝑥𝑑

2, 𝑥1𝑥2, … , 𝑥1𝑥𝑑 , 𝑥2𝑥1, … , 𝑥𝑑−1𝑥𝑑

𝑇

𝑑-dimensional feature space 𝒙 = 𝑥1, … ,𝑥𝑑𝑇

Page 43: Support Vector Machine (SVM) and Kernel Methods

Polynomial kernel

44

In general, 𝑘(𝒙, 𝒛) = 𝒙𝑇𝒛 𝑀 contains all monomials of order

𝑀.

This can similarly be generalized to include all terms up to

degree 𝑀, 𝑘(𝒙, 𝒛) = 𝒙𝑇𝒛 + 𝑐 𝑀 (𝑐 > 0)

Example: SVM boundary for a polynomial kernel

𝑤0 + 𝒘𝑇𝝓 𝒙 = 0

⇒ 𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖)𝝓 𝒙 𝑖 𝑇

𝝓 𝒙 = 0

⇒ 𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖)𝑘(𝒙 𝑖 , 𝒙) = 0

⇒ 𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖) 𝒙(𝑖)𝑇

𝒙 + 𝑐𝑀

= 0Boundary is a

polynomial of order 𝑀

Page 44: Support Vector Machine (SVM) and Kernel Methods

Some common kernel functions

45

𝐾 is a mathematical and computational shortcut that

allows us to combine the transform and the inner

product into a single more efficient function.

Linear: 𝑘(𝒙, 𝒙′) = 𝒙𝑇𝒙′

Polynomial: 𝑘 𝒙, 𝒙′ = (𝒙𝑇𝒙′ + 𝑐)𝑀 𝑐 ≥ 0

Gaussian: 𝑘 𝒙, 𝒙′ = exp(−𝒙−𝒙′ 2

2𝜎2 )

Sigmoid: 𝑘 𝒙, 𝒙′ = tanh(𝑎𝒙𝑇𝒙′ + 𝑏)

Page 45: Support Vector Machine (SVM) and Kernel Methods

Gaussian kernel

46

Gaussian: 𝑘 𝒙, 𝒙′ = exp(−𝒙−𝒙′ 2

2𝜎2 )

Infinite dimensional feature space

Example: SVM boundary for a gaussian kernel

Considers a Gaussian function around each data point.

𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖)exp(−

𝒙−𝒙(𝑖) 2

2𝜎2 ) = 0

SVM + Gaussian Kernel can classify any arbitrary training set

Training error is zero when 𝜎 → 0 All samples become support vectors (likely overfiting)

Page 46: Support Vector Machine (SVM) and Kernel Methods

SVM: Linear and Gaussian kernels example

47

Gaussian kernel: 𝐶 = 1,𝜎 = 0.25Linear kernel: 𝐶 = 0.1

[Zisserman’s slides]

some SVs appear to be far

from the boundary?

Page 47: Support Vector Machine (SVM) and Kernel Methods

Hard margin Example

48

For narrow Gaussian (large 𝜎), even the protection of a large

margin cannot suppress overfitting.

Y.S. Abou Mostafa, et. al, “Learning From Data”, 2012

Page 48: Support Vector Machine (SVM) and Kernel Methods

SVM Gaussian kernel: Example

49 Source: Zisserman’s slides

𝑓 𝒙 = 𝑤0 + 𝛼𝑖>0

𝛼𝑖𝑦(𝑖)exp(−

𝒙 − 𝒙(𝑖) 2

2𝜎2 )

Page 49: Support Vector Machine (SVM) and Kernel Methods

SVM Gaussian kernel: Example

50 Source: Zisserman’s slides

Page 50: Support Vector Machine (SVM) and Kernel Methods

SVM Gaussian kernel: Example

51 Source: Zisserman’ s slides

Page 51: Support Vector Machine (SVM) and Kernel Methods

SVM Gaussian kernel: Example

52 Source: Zisserman’ s slides

Page 52: Support Vector Machine (SVM) and Kernel Methods

SVM Gaussian kernel: Example

53 Source: Zisserman’ s slides

Page 53: Support Vector Machine (SVM) and Kernel Methods

SVM Gaussian kernel: Example

54 Source: Zisserman’ s slides

Page 54: Support Vector Machine (SVM) and Kernel Methods

SVM Gaussian kernel: Example

55 Source: Zisserman’ s slides

Page 55: Support Vector Machine (SVM) and Kernel Methods

Constructing kernels

56

Construct kernel functions directly

Ensure that it is a valid kernel

Corresponds to an inner product in some feature space.

Example: 𝑘(𝒙, 𝒙′) = 𝒙𝑇𝒙′ 2

Corresponding mapping: 𝝓 𝒙 = 𝑥12, 2𝑥1𝑥2, 𝑥2

2 𝑇for 𝒙

= 𝑥1, 𝑥2𝑇

We need a way to test whether a kernel is valid without

having to construct 𝝓 𝒙

Page 56: Support Vector Machine (SVM) and Kernel Methods

Valid kernel: Necessary & sufficient conditions

57

Gram matrix 𝑲𝑁×𝑁: 𝐾𝑖𝑗 = 𝑘(𝒙(𝑖), 𝒙(𝑗))

Restricting the kernel function to a set of points {𝒙 1 , 𝒙 2 , … , 𝒙(𝑁)}

𝐾 =𝑘(𝒙(1), 𝒙(1)) ⋯ 𝑘(𝒙(1), 𝒙(𝑁))

⋮ ⋱ ⋮𝑘(𝒙(𝑁), 𝒙(1)) ⋯ 𝑘(𝒙(𝑁), 𝒙(𝑁))

Mercer Theorem: The kernel matrix is Symmetric Positive

Semi-Definite (for any choice of data points)

Any symmetric positive definite matrix can be regarded as a kernel

matrix, that is as an inner product matrix in some space

[Shawe-Taylor & Cristianini 2004]

Page 57: Support Vector Machine (SVM) and Kernel Methods

Construct Valid Kernels

58

• 𝑐 > 0 , 𝑘1: valid kernel

• 𝑓(. ): any function

• 𝑞(. ): a polynomial with coefficients≥ 0

• 𝑘1, 𝑘2: valid kernels

• 𝝓(𝒙): a function from 𝒙 to ℝ𝑀

𝑘3(. , . ): a valid kernel in ℝ𝑀

• 𝑨 : a symmetric positive semi-definite

matrix

• 𝒙𝑎 and 𝒙𝑏 are variables (not necessarily

disjoint) with 𝒙 = (𝒙𝑎 , 𝒙𝑏), and 𝑘𝑎 and 𝑘𝑏

are valid kernel functions over their

respective spaces.

[Bishop]

Page 58: Support Vector Machine (SVM) and Kernel Methods

Extending linear methods to kernelized ones

59

Kernelized version of linear methods

Linear methods are famous

Unique optimal solutions, faster learning algorithms, and better analysis

However, we often require nonlinear methods in real-world problems

and so we can use kernel-based version of these linear algorithms

If a problem can be formulated such that feature vectors only

appear in inner products, we can extend it to a kernelized one

When we only need to know the dot product of all pairs of data (about

the geometry of the data in the feature space)

By replacing inner products with kernels in linear algorithms,

we obtain very flexible methods

We can operate in the mapped space without ever computing the

coordinates of the data in that space

Page 59: Support Vector Machine (SVM) and Kernel Methods

Which information can be obtained from kernel?

60

Example: we know all pairwise distances

𝑑 𝝓 𝒙 ,𝝓 𝒛2

= 𝝓 𝒙 − 𝝓 𝒛 2 = 𝑘 𝒙, 𝒙 + 𝑘 𝒛, 𝒛 − 2𝑘(𝒙, 𝒛)

Therefore, we also know distance of points from center of mass of a set

Many dimensionality reduction, clustering, and classification

methods can be described according to pairwise distances.

This allow us to introduce kernelized versions of them

Page 60: Support Vector Machine (SVM) and Kernel Methods

Kernels for structured data

61

Kernels also can be defined on general types of data

Kernel functions do not need to be defined over vectors

just we need a symmetric positive definite matrix

Thus, many algorithms can work with general (non-vectorial)

data

Kernels exist to embed strings, trees, graphs, …

This may be more important than nonlinearity that we can use

kernel-based version of the classical learning algorithms for

recognition of structured data

Page 61: Support Vector Machine (SVM) and Kernel Methods

Kernel function for objects

62

Sets: Example of kernel function for sets:

𝑘 𝐴, 𝐵 = 2 𝐴∩𝐵

Strings: The inner product of the feature vectors for two

strings can be defined as

e.g. sum over all common subsequences weighted according to

their frequency of occurrence and lengths

A E G A T E A G G

E G T E A G A E G A T G

Page 62: Support Vector Machine (SVM) and Kernel Methods

Kernel trick advantages: summary

63

Operating in the mapped space without ever computing the

coordinates of the data in that space

Besides vectors, we can introduce kernel functions for

structured data (graphs, strings, etc.)

Much of the geometry of the data in the embedding space is

contained in all pairwise dot products

In many cases, inner product in the embedding space can be

computed efficiently.

Page 63: Support Vector Machine (SVM) and Kernel Methods

SVM: Summary

64

Hard margin: maximizing margin

Soft margin: handling noisy data and overlapping classes

Slack variables in the problem

Dual problems of hard-margin and soft-margin SVM

Lead us to non-linear SVM method easily by kernel substitution

Also, classifier decision in terms of support vectors

Kernel SVM’s

Learns linear decision boundary in a high dimension space without

explicitly working on the mapped data