Upload
others
View
20
Download
2
Embed Size (px)
Citation preview
CE-717: Machine LearningSharif University of Technology
Fall 2015
Soleymani
Support Vector Machine (SVM)
and Kernel Methods
Outline
Margin concept
Hard-Margin SVM
Soft-Margin SVM
Dual Problems of Hard-Margin SVM and Soft-Margin SVM
Nonlinear SVM
Kernel trick
Kernel methods
2
Margin
3
Which line is better to select as the boundary to providemore generalization capability?
Margin for a hyperplane that separates samples of twolinearly separable classes is:
The smallest distance between the decision boundary and any of thetraining samples
𝑥2
𝑥1
Larger margin provides better
generalization to unseen data
Maximum margin
4
SVM finds the solution with maximum margin
Solution: a hyperplane that is farthest from all training samples
The hyperplane with the largest margin has equal distances tothe nearest sample of both classes
𝑥2
𝑥1
𝑥2
𝑥1 Larger margin
Hard-margin SVM: Optimization problem
5
max𝑀,𝒘,𝑤0
2𝑀
𝒘
s. t. 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 𝑀 ∀𝒙 𝑖 ∈ 𝐶1
𝒘𝑇𝒙 𝑖 + 𝑤0 ≤ −𝑀 ∀𝒙 𝑖 ∈ 𝐶2
𝑥2
𝑥1
𝑀
𝒘
𝒘𝑇𝒙 + 𝑤0 = 0
𝒘𝑇𝒙 + 𝑤0 = 𝑀
𝒘𝑇𝒙 + 𝑤0 = −𝑀
𝒘
𝑦 𝑖 = 1
𝑦 𝑖 = −1
Margin: 2𝑀
𝒘
Hard-margin SVM: Optimization problem
6
max𝑀,𝒘,𝑤0
2𝑀
𝒘
s. t. 𝑦(𝑖) 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 𝑀 𝑖 = 1, … , 𝑁
𝑥2
𝑥1
𝑀
𝒘
𝒘
𝒘𝑇𝒙 + 𝑤0 = 0
𝒘𝑇𝒙 + 𝑤0 = 𝑀
𝒘𝑇𝒙 + 𝑤0 = −𝑀
𝑀 = min𝑖=1,…,𝑁
𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0
Hard-margin SVM: Optimization problem
7
We can set 𝒘′ =𝒘
𝑀, 𝑤0
′ =𝑤0
𝑀:
max𝒘′,𝑤0
′
2
𝒘′
s. t. 𝑦(𝑖) 𝒘′𝑇𝒙 𝑖 + 𝑤0′ ≥ 1 𝑖 = 1, … , 𝑁
𝑥2
𝑥1
1
𝒘′
𝒘′𝑇𝒙 + 𝑤0′ = 0
𝒘′𝑇𝒙 + 𝑤0′ = 1
𝒘′𝑇𝒙 + 𝑤0′ = −1
𝒘′
The place of boundary and
margin lines do not change
Hard-margin SVM: Optimization problem
8
We can equivalently optimize:
min𝒘,𝑤0
1
2𝒘 2
s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 𝑖 = 1, … , 𝑁
It is a convex Quadratic Programming (QP) problem
There are computationally efficient packages to solve it.
It has a global minimum (if any).
When training samples are not linearly separable, it has no
solution.
How to extend it to find a solution even though the classes are not
linearly separable.
Beyond linear separability
9
How to extend the hard-margin SVM to allow
classification error
Overlapping classes that can be approximately separated by a
linear boundary
Noise in the linearly separable classes
𝑥2
𝑥1 𝑥1
Beyond linear separability: Soft-margin SVM
10
Minimizing the number of misclassified points?!
NP-complete
Soft margin:
Maximizing a margin while trying to minimize the distance
between misclassified points and their correct margin plane
Soft-margin SVM
11
SVM with slack variables: allows samples to fall within the
margin, but penalizes them
min𝒘,𝑤0, 𝜉𝑖 𝑖=1
𝑁
12
𝒘 2 + 𝐶
𝑖=1
𝑁
𝜉𝑖
s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 − 𝜉𝑖 𝑖 = 1, … , 𝑁
𝜉𝑖 ≥ 0
𝜉𝑖 : slack variables
𝜉𝑖 > 1: if 𝒙 𝑖 misclassifed
0 < 𝜉𝑖 < 1 : if 𝒙 𝑖 correctly
classified but inside margin
𝑥2
𝑥1
𝜉𝑖
Soft-margin SVM
12
linear penalty (hinge loss) for a sample if it is misclassified
or lied in the margin
tries to maintain 𝜉𝑖 small while maximizing the margin.
always finds a solution (as opposed to hard-margin SVM)
more robust to the outliers
Soft margin problem is still a convex QP
𝜉 = 0
𝜉 = 0
Soft-margin SVM: Parameter 𝐶
13
𝐶 is a tradeoff parameter:
small 𝐶 allows margin constraints to be easily ignored
large margin
large 𝐶 makes constraints hard to ignore
narrow margin
𝐶 → ∞ enforces all constraints: hard margin
𝐶 can be determined using a technique like cross-
validation
Soft-margin SVM: Cost function
14
min𝒘,𝑤0, 𝜉𝑖 𝑖=1
𝑁
1
2𝒘 2 + 𝐶
𝑖=1
𝑛
𝜉𝑖
s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 − 𝜉𝑖 𝑖 = 1, … , 𝑛
𝜉𝑖 ≥ 0
It is equivalent to the unconstrained optimizationproblem:
min𝒘,𝑤0
1
2𝒘 2 + 𝐶
𝑖=1
𝑛
max(0,1 − 𝑦(𝑖)(𝒘𝑇𝒙(𝑖) + 𝑤0))
SVM loss function
15
Hinge loss vs. 0-1 loss
𝒘𝑇𝒙 + 𝑤0
0-1 Loss
𝑦 = 1
Hinge Loss
max(0,1 − 𝑦(𝒘𝑇𝒙 + 𝑤0))
Dual formulation of the SVM
16
We are going to introduce the dual SVM problem which
is equivalent to the original primal problem. The dual
problem:
is often easier
enable us to exploit the kernel trick
gives us further insights into the optimal hyperplane
Optimization: Lagrangian multipliers
17
𝑝∗ = min𝒙
𝑓(𝒙)
s. t. 𝑔𝑖 𝒙 ≤ 0 𝑖 = 1,… , 𝑚ℎ𝑖 𝒙 = 0 𝑖 = 1, … , 𝑝
ℒ 𝒙, 𝜶, 𝝀 = 𝑓 𝒙 +
𝑖=1
𝑚
𝛼𝑖 𝑔𝑖 𝒙 +
𝑖=1
𝑝
𝜆𝑖 ℎ𝑖 𝒙
max𝛼𝑖≥0 , 𝜆𝑖
ℒ 𝒙, 𝜶, 𝝀 =
∞ any 𝑔𝑖 𝒙 > 0
∞ any ℎ𝑖 𝒙 ≠ 0
𝑓 𝒙 otherwise
𝑝∗ = min𝒙
max𝛼𝑖≥0 , 𝜆𝑖
ℒ 𝒙, 𝜶, 𝝀
Lagrangian multipliers
𝜶 = 𝛼1, … , 𝛼𝑚
𝝀 = [𝜆1, … , 𝜆𝑝]
Optimization: Dual problem
18
In general, we have:
max𝑥
min𝑦
𝑓(𝑥, 𝑦) ≤ min𝑦
max𝑥
𝑓(𝑥, 𝑦)
Primal problem: 𝑝∗ = min𝒙
max𝛼𝑖≥0 , 𝜆𝑖
ℒ 𝒙, 𝜶, 𝝀
Dual problem: 𝑑∗ = max𝛼𝑖≥0 , 𝜆𝑖
min𝒙
ℒ 𝒙, 𝜶, 𝝀
Obtained by swapping the order of min and max
𝑑∗ ≤ 𝑝∗
When the original problem is convex (𝑓 and 𝑔 are convex
functions and ℎ is affine), we have strong duality 𝑑∗ = 𝑝∗
Hard-margin SVM: Dual problem
19
min𝒘,𝑤0
1
2𝒘 2
s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 𝑖 = 1, … , 𝑁
By incorporating the constraints through lagrangian multipliers,we will have:
min𝒘,𝑤0
max{𝛼𝑖≥0}
1
2𝒘 2 +
𝑖=1
𝑁
𝛼𝑖 1 − 𝑦 𝑖 (𝒘𝑇𝒙(𝑖) + 𝑤0)
Dual problem (changing the order of min and max in theabove problem):
max{𝛼𝑖≥0}
min𝒘,𝑤0
1
2𝒘 2 +
𝑖=1
𝑁
𝛼𝑖 1 − 𝑦 𝑖 (𝒘𝑇𝒙(𝑖) + 𝑤0)
Hard-margin SVM: Dual problem
20
max{𝛼𝑖≥0}
min𝒘,𝑤0
ℒ 𝒘, 𝑤0, 𝜶
ℒ 𝒘, 𝑤0, 𝜶 =1
2𝒘 2 +
𝑖=1
𝑁
𝛼𝑖 1 − 𝑦 𝑖 (𝒘𝑇𝒙(𝑖) + 𝑤0)
𝛻𝒘ℒ 𝒘, 𝑤0, 𝜶 = 0 ⇒ 𝒘 = 𝑖=1𝑁 𝛼𝑖𝑦
(𝑖)𝒙(𝑖)
𝜕ℒ 𝒘,𝑤0,𝜶
𝜕𝑤𝟎= 0 ⇒ 𝑖=1
𝑁 𝛼𝑖𝑦(𝑖) = 0
𝑤0 do not appear, instead, a “global” constraint
on 𝜶 is created.
Hard-margin SVM: Dual problem
21
max𝜶
𝑖=1
𝑁
𝛼𝑖 −1
2
𝑖=1
𝑁
𝑗=1
𝑁
𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝒙 𝑖 𝑇
𝒙 𝑗
Subject to 𝑖=1𝑁 𝛼𝑖𝑦
(𝑖) = 0
𝛼𝑖 ≥ 0 𝑖 = 1, … , 𝑁
It is a convex QP
By solving the above problem first we find 𝜶 and then 𝒘= 𝑖=1
𝑁 𝛼𝑖 𝑦(𝑖)𝒙(𝑖)
𝑤0 can be set by making the margin equidistant to classes (wediscuss it more after defining support vectors).
Karush-Kuhn-Tucker (KKT) conditions
22
Necessary conditions for the solution [𝒘∗, 𝑤0∗, 𝜶∗]:
𝛻𝒘ℒ 𝒘, 𝑤0, 𝜶 𝒘∗,𝑤0∗ ,𝜶∗ = 0
𝜕ℒ 𝒘,𝑤0,𝜶
𝜕𝑤0 𝒘∗,𝑤0
∗ ,𝜶∗ = 0
𝛼𝑖∗ ≥ 0 𝑖 = 1, … , 𝑁
𝑦 𝑖 𝒘∗𝑇𝒙 𝑖 + 𝑤0∗ ≥ 1 𝑖 = 1, … , 𝑁
𝛼𝑖∗ 1 − 𝑦 𝑖 𝒘∗𝑇𝒙 𝑖 + 𝑤0
∗ = 0 𝑖 = 1, … , 𝑁
𝛻𝒙ℒ 𝒙, 𝜶 𝒙∗,𝜶∗
= 0
𝛼𝑖∗ ≥ 0 𝑖 = 1, … , 𝑚
𝑔𝑖 𝒙∗ ≤ 0 𝑖 = 1, … , 𝑚𝛼𝑖
∗𝑔𝑖 𝒙∗ = 0 𝑖 = 1, … , 𝑚
In general, the optimal 𝒙∗, 𝜶∗
satisfies KKT conditions:
Hard-margin SVM: Support vectors
23
Inactive constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 > 1
⇒ 𝛼𝑖 = 0 and thus 𝒙 𝑖 is not a support vector.
Active constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 = 1
⇒ 𝛼𝑖 can be greater than 0 and thus 𝒙 𝑖 can be a support vector.
𝑥2
𝑥1
1
𝒘
𝛼 > 0
𝛼 > 0𝛼 > 0
Hard-margin SVM: Support vectors
24
Inactive constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 > 1
⇒ 𝛼𝑖 = 0 and thus 𝒙 𝑖 is not a support vector.
Active constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 = 1
⇒ 𝛼𝑖 can be greater than 0 and thus 𝒙 𝑖 can be a support vector.
𝑥2
𝑥1
1
𝒘
𝛼 > 0
𝛼 > 0𝛼 > 0
𝛼 = 0
𝛼 = 0
A sample with 𝛼𝑖 = 0 can also
lie on one of the margin
hyperplanes
Hard-margin SVM: Support vectors
25
SupportVectors (SVs)= {𝑥 𝑖 𝛼𝑖 > 0}
The direction of hyper-plane can be found only based on
support vectors:
𝒘 =
𝛼𝑖>0
𝛼𝑖 𝑦(𝑖)𝒙(𝑖)
Hard-margin SVM: Dual problem
Classifying new samples using only SVs
26
Classification of a new sample 𝒙:
𝑦 = sign 𝑤0 + 𝒘𝑇𝒙
𝑦 = sign 𝑤0 + 𝛼𝑖>0
𝛼𝑖𝑦𝑖 𝒙 𝑖
𝑇
𝒙
𝑦 = sign(𝑤0 + 𝛼𝑖>0
𝛼𝑖𝑦𝑖 𝒙 𝑖 𝑇
𝒙)
The classifier is based on the expansion in terms of dot
products of 𝒙 with support vectors.
Support vectors are sufficient to
predict labels of new samples
How to find 𝑤0
27
At least one of the 𝛼’s is strictly positive (𝛼𝑠 > 0 for support
vector).
The KKT complementary slackness condition tells us that the
constraint corresponding to this non-zero 𝛼𝑠 is equality.
𝑤0 is found such that the following constraint is satisfied.
𝑦𝑠 𝑤0 + 𝒘𝑇𝒙𝑠 = 1
We can find 𝑤0 using one of the support vectors:
⇒ 𝑤0 = 𝑦𝑠 − 𝒘𝑇𝒙𝑠
= 𝑦𝑠 −
𝛼𝑛>0
𝛼𝑛𝑦(𝑛)𝒙 𝑛 𝑇𝒙𝑠
Hard-margin SVM: Dual problem
28
max𝜶
𝑖=1
𝑁
𝛼𝑖 −1
2
𝑖=1
𝑁
𝑗=1
𝑁
𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝒙 𝑖 𝑇
𝒙 𝑗
Subject to 𝑖=1𝑁 𝛼𝑖𝑦
(𝑖) = 0
𝛼𝑖 ≥ 0 𝑖 = 1, … , 𝑁
Only the dot product of each pair of training data appears
in the optimization problem
This is an important property that is helpful to extend to non-linear
SVM (the cost function does not depend explicitly on the dimensionality
of the feature space).
Soft-margin SVM: Dual problem
29
max𝜶
𝑖=1
𝑁
𝛼𝑖 −1
2
𝑖=1
𝑁
𝑗=1
𝑁
𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝒙 𝑖 𝑇
𝒙 𝑗
Subject to 𝑖=1𝑁 𝛼𝑖𝑦
(𝑖) = 0
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, … , 𝑁
By solving the above quadratic problem first we find 𝜶
and then find 𝒘 = 𝑖=1𝑁 𝛼𝑖 𝑦(𝑖)𝒙(𝑖) and 𝑤0 is computed
from SVs.
For a test sample 𝒙 (as before):
𝑦 = sign 𝑤0 + 𝒘𝑇𝒙 = sign(𝑤0 + 𝛼𝑖>0
𝛼𝑖𝑦𝑖 𝒙 𝑖 𝑇
𝒙)
Soft-margin SVM: Support vectors
30
SupportVectors: 𝛼 > 0
If 0 < 𝛼 < 𝐶: SVs on the margin, 𝜉 = 0.
If 𝛼 = 𝐶: SVs on or over the margin.
Primal vs. dual hard-margin SVM problem
31
Primal problem of hard-margin SVM
𝑁 inequality constraints
𝑑 + 1 number of variables
Dual problem of hard-margin SVM
one equality constraint
𝑁 positivity constraints
𝑁 number of variables (Lagrange multipliers)
Objective function more complicated
The dual problem is helpful and instrumental to use the kernel
trick
Primal vs. dual soft-margin SVM problem
32
Primal problem of soft-margin SVM
𝑁 inequality constraints
𝑁 positivity constraints
𝑁 + 𝑑 + 1 number of variables
Dual problem of soft-margin SVM
one equality constraint
2𝑁 positivity constraints
𝑁 number of variables (Lagrange multipliers)
Objective function more complicated
The dual problem is helpful and instrumental to use the kernel
trick
Not linearly separable data
33
Noisy data or overlapping classes
(we discussed about it: soft margin)
Near linearly separable
Non-linear decision surface
Transform to a new feature space
𝑥2
𝑥1
𝑥2
𝑥1
Nonlinear SVM
34
Assume a transformation 𝜙: ℝ𝑑 → ℝ𝑚 on the feature
space
𝒙 → 𝝓 𝒙
Find a hyper-plane in the transformed feature space:
𝑥2
𝑥1 𝜙1(𝒙)
𝜙2(𝒙)
𝜙: 𝒙 → 𝝓 𝒙
𝒘𝑇𝝓 𝒙 + 𝑤0 = 0
{𝜙1(𝒙),...,𝜙𝑚(𝒙)}: set of basis functions (or features)
𝜙𝑖 𝒙 : ℝ𝑑 → ℝ
𝝓 𝒙 = [𝜙1(𝒙), . . . , 𝜙𝑚(𝒙)]
Soft-margin SVM in a transformed space:
Primal problem
35
Primal problem:
min𝒘,𝑤0
1
2𝒘 2 + 𝐶
𝑖=1
𝑁
𝜉𝑖
s. t. 𝑦 𝑖 𝒘𝑇𝝓(𝒙 𝑖 ) + 𝑤0 ≥ 1 − 𝜉𝑖 𝑖 = 1, … , 𝑁
𝜉𝑖 ≥ 0
𝒘 ∈ ℝ𝑚: the weights that must be found
If 𝑚 ≫ 𝑑 (very high dimensional feature space) then there are
many more parameters to learn
Soft-margin SVM in a transformed space:
Dual problem
36
Optimization problem:
max𝜶
𝑖=1
𝑁
𝛼𝑖 −1
2
𝑖=1
𝑁
𝑗=1
𝑁
𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝝓 𝒙(𝑖) 𝑇
𝝓 𝒙(𝑖)
Subject to 𝑖=1𝑁 𝛼𝑖𝑦
(𝑖) = 0
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, … , 𝑛
If we have inner products 𝝓 𝒙(𝑖) 𝑇𝝓 𝒙(𝑗) , only 𝜶
= [𝛼1, … , 𝛼𝑁] needs to be learnt.
It is not necessary to learn 𝑚 parameters as opposed to the primal
problem
Kernelized soft-margin SVM
37
Optimization problem:
max𝜶
𝑖=1
𝑁
𝛼𝑖 −1
2
𝑖=1
𝑁
𝑗=1
𝑁
𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝝓 𝒙(𝑖) 𝑇
𝝓 𝒙(𝑖)
Subject to 𝑖=1𝑁 𝛼𝑖𝑦
(𝑖) = 0
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, … , 𝑛
Classifying a new data:
𝑦 = 𝑠𝑖𝑔𝑛 𝑤0 + 𝒘𝑇𝝓(𝒙) = 𝑠𝑖𝑔𝑛(𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦𝑖 𝝓(𝒙 𝑖 )𝑇𝝓(𝒙))
𝑘(𝒙 𝑖 , 𝒙(𝑗))
𝑘(𝒙 𝑖 , 𝒙)
𝑘 𝒙, 𝒙′ = 𝝓 𝒙 𝑇𝝓(𝒙′)
Kernel trick
38
Kernel: the dot product in the mapped feature space ℱ 𝑘 𝒙, 𝒙′ = 𝝓 𝒙 𝑇𝝓(𝒙′)
𝝓 𝒙 = 𝜙1 𝒙 , … , 𝜙𝑚 𝒙 𝑇
𝑘 𝒙, 𝒙′ shows the similarity of 𝒙 and 𝒙′.
The distance can be found as accordingly
Kernel trick → Extension of many well-known algorithms to
kernel-based ones
By substituting the dot product with the kernel function
Kernel trick: Idea
39
Idea: when the input vectors appears only in the form of dot
products, we can use other choices of kernel function instead
of inner product
Solving the problem without explicitly mapping the data
Explicit mapping is expensive if 𝝓 𝒙 is very high dimensional
Instead of using a mapping 𝝓: 𝒳 ← ℱ to represent 𝒙 ∈ 𝒳 by
𝝓(𝒙) ∈ ℱ, a similarity function 𝑘: 𝒳 × 𝒳 → ℝ is used.
The data set can be represented by 𝑁 × 𝑁 matrix 𝑲.
We specify only an inner product function between points in the
transformed space (not their coordinates)
In many cases, the inner product in the embedding space can be
computed efficiently.
Why kernel?
40
kernel functions 𝐾 can indeed be efficiently computed, with a
cost proportional to 𝑑 instead of 𝑚.
Example: consider the second-order polynomial transform:
𝝓 𝒙 = 1, 𝑥1, … , 𝑥𝑑 , 𝑥12, 𝑥1𝑥2, … , 𝑥𝑑𝑥𝑑
𝑇
𝝓 𝒙 𝑇𝝓 𝒙′ = 1 +
𝑖=1
𝑑
𝑥𝑖𝑥𝑖′ +
𝑖=1
𝑑
𝑗=1
𝑑
𝑥𝑖𝑥𝑗𝑥𝑖′𝑥𝑗
′
𝝓 𝒙 𝑇𝝓 𝒙′ = 1 + 𝑥𝑇𝑥′ + 𝑥𝑇𝑥′ 2
𝑖=1
𝑑
𝑥𝑖𝑥𝑖′ ×
𝑗=1
𝑑
𝑥𝑗𝑥𝑗′
𝑚 = 1 + 𝑑 + 𝑑2
𝑂(𝑚)
𝑂(𝑑)
Polynomial kernel: Efficient kernel function
41
An efficient kernel function relies on a carefully constructed
specific transforms to allow fast computation
Example: for polynomial kernels, certain combinations of
coefficients that would still make K easy (𝛾 > 0 and 𝑐 > 0)
𝝓 𝒙
= 𝑐 × 1, 2𝑐𝛾 × 𝑥1, … , 2𝑐𝛾 × 𝑥𝑑 , 𝛾 × 𝑥12, 𝛾 × 𝑥1𝑥2, … , 𝛾
Polynomial kernel: Degree two
42
We instead use 𝑘(𝒙, 𝒙′) = 𝒙𝑇𝒙′ + 1 2 that corresponds to:
𝝓 𝒙
= 1, 2𝑥1, … , 2𝑥𝑑 , 𝑥12, . . , 𝑥𝑑
2, 2𝑥1𝑥2, … , 2𝑥1𝑥𝑑 , 2𝑥2𝑥3, … , 2𝑥𝑑−1𝑥𝑑
𝑇
𝝓 𝒙 = 1, 2𝑥1, … , 2𝑥𝑑 , 𝑥12, . . . , 𝑥𝑑
2, 𝑥1𝑥2, … , 𝑥1𝑥𝑑 , 𝑥2𝑥1, … , 𝑥𝑑−1𝑥𝑑
𝑇
𝑑-dimensional feature space 𝒙 = 𝑥1, … ,𝑥𝑑𝑇
Polynomial kernel
44
In general, 𝑘(𝒙, 𝒛) = 𝒙𝑇𝒛 𝑀 contains all monomials of order
𝑀.
This can similarly be generalized to include all terms up to
degree 𝑀, 𝑘(𝒙, 𝒛) = 𝒙𝑇𝒛 + 𝑐 𝑀 (𝑐 > 0)
Example: SVM boundary for a polynomial kernel
𝑤0 + 𝒘𝑇𝝓 𝒙 = 0
⇒ 𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖)𝝓 𝒙 𝑖 𝑇
𝝓 𝒙 = 0
⇒ 𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖)𝑘(𝒙 𝑖 , 𝒙) = 0
⇒ 𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖) 𝒙(𝑖)𝑇
𝒙 + 𝑐𝑀
= 0Boundary is a
polynomial of order 𝑀
Some common kernel functions
45
𝐾 is a mathematical and computational shortcut that
allows us to combine the transform and the inner
product into a single more efficient function.
Linear: 𝑘(𝒙, 𝒙′) = 𝒙𝑇𝒙′
Polynomial: 𝑘 𝒙, 𝒙′ = (𝒙𝑇𝒙′ + 𝑐)𝑀 𝑐 ≥ 0
Gaussian: 𝑘 𝒙, 𝒙′ = exp(−𝒙−𝒙′ 2
2𝜎2 )
Sigmoid: 𝑘 𝒙, 𝒙′ = tanh(𝑎𝒙𝑇𝒙′ + 𝑏)
Gaussian kernel
46
Gaussian: 𝑘 𝒙, 𝒙′ = exp(−𝒙−𝒙′ 2
2𝜎2 )
Infinite dimensional feature space
Example: SVM boundary for a gaussian kernel
Considers a Gaussian function around each data point.
𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖)exp(−
𝒙−𝒙(𝑖) 2
2𝜎2 ) = 0
SVM + Gaussian Kernel can classify any arbitrary training set
Training error is zero when 𝜎 → 0 All samples become support vectors (likely overfiting)
SVM: Linear and Gaussian kernels example
47
Gaussian kernel: 𝐶 = 1,𝜎 = 0.25Linear kernel: 𝐶 = 0.1
[Zisserman’s slides]
some SVs appear to be far
from the boundary?
Hard margin Example
48
For narrow Gaussian (large 𝜎), even the protection of a large
margin cannot suppress overfitting.
Y.S. Abou Mostafa, et. al, “Learning From Data”, 2012
SVM Gaussian kernel: Example
49 Source: Zisserman’s slides
𝑓 𝒙 = 𝑤0 + 𝛼𝑖>0
𝛼𝑖𝑦(𝑖)exp(−
𝒙 − 𝒙(𝑖) 2
2𝜎2 )
SVM Gaussian kernel: Example
50 Source: Zisserman’s slides
SVM Gaussian kernel: Example
51 Source: Zisserman’ s slides
SVM Gaussian kernel: Example
52 Source: Zisserman’ s slides
SVM Gaussian kernel: Example
53 Source: Zisserman’ s slides
SVM Gaussian kernel: Example
54 Source: Zisserman’ s slides
SVM Gaussian kernel: Example
55 Source: Zisserman’ s slides
Constructing kernels
56
Construct kernel functions directly
Ensure that it is a valid kernel
Corresponds to an inner product in some feature space.
Example: 𝑘(𝒙, 𝒙′) = 𝒙𝑇𝒙′ 2
Corresponding mapping: 𝝓 𝒙 = 𝑥12, 2𝑥1𝑥2, 𝑥2
2 𝑇for 𝒙
= 𝑥1, 𝑥2𝑇
We need a way to test whether a kernel is valid without
having to construct 𝝓 𝒙
Valid kernel: Necessary & sufficient conditions
57
Gram matrix 𝑲𝑁×𝑁: 𝐾𝑖𝑗 = 𝑘(𝒙(𝑖), 𝒙(𝑗))
Restricting the kernel function to a set of points {𝒙 1 , 𝒙 2 , … , 𝒙(𝑁)}
𝐾 =𝑘(𝒙(1), 𝒙(1)) ⋯ 𝑘(𝒙(1), 𝒙(𝑁))
⋮ ⋱ ⋮𝑘(𝒙(𝑁), 𝒙(1)) ⋯ 𝑘(𝒙(𝑁), 𝒙(𝑁))
Mercer Theorem: The kernel matrix is Symmetric Positive
Semi-Definite (for any choice of data points)
Any symmetric positive definite matrix can be regarded as a kernel
matrix, that is as an inner product matrix in some space
[Shawe-Taylor & Cristianini 2004]
Construct Valid Kernels
58
• 𝑐 > 0 , 𝑘1: valid kernel
• 𝑓(. ): any function
• 𝑞(. ): a polynomial with coefficients≥ 0
• 𝑘1, 𝑘2: valid kernels
• 𝝓(𝒙): a function from 𝒙 to ℝ𝑀
𝑘3(. , . ): a valid kernel in ℝ𝑀
• 𝑨 : a symmetric positive semi-definite
matrix
• 𝒙𝑎 and 𝒙𝑏 are variables (not necessarily
disjoint) with 𝒙 = (𝒙𝑎 , 𝒙𝑏), and 𝑘𝑎 and 𝑘𝑏
are valid kernel functions over their
respective spaces.
[Bishop]
Extending linear methods to kernelized ones
59
Kernelized version of linear methods
Linear methods are famous
Unique optimal solutions, faster learning algorithms, and better analysis
However, we often require nonlinear methods in real-world problems
and so we can use kernel-based version of these linear algorithms
If a problem can be formulated such that feature vectors only
appear in inner products, we can extend it to a kernelized one
When we only need to know the dot product of all pairs of data (about
the geometry of the data in the feature space)
By replacing inner products with kernels in linear algorithms,
we obtain very flexible methods
We can operate in the mapped space without ever computing the
coordinates of the data in that space
Which information can be obtained from kernel?
60
Example: we know all pairwise distances
𝑑 𝝓 𝒙 ,𝝓 𝒛2
= 𝝓 𝒙 − 𝝓 𝒛 2 = 𝑘 𝒙, 𝒙 + 𝑘 𝒛, 𝒛 − 2𝑘(𝒙, 𝒛)
Therefore, we also know distance of points from center of mass of a set
Many dimensionality reduction, clustering, and classification
methods can be described according to pairwise distances.
This allow us to introduce kernelized versions of them
Kernels for structured data
61
Kernels also can be defined on general types of data
Kernel functions do not need to be defined over vectors
just we need a symmetric positive definite matrix
Thus, many algorithms can work with general (non-vectorial)
data
Kernels exist to embed strings, trees, graphs, …
This may be more important than nonlinearity that we can use
kernel-based version of the classical learning algorithms for
recognition of structured data
Kernel function for objects
62
Sets: Example of kernel function for sets:
𝑘 𝐴, 𝐵 = 2 𝐴∩𝐵
Strings: The inner product of the feature vectors for two
strings can be defined as
e.g. sum over all common subsequences weighted according to
their frequency of occurrence and lengths
A E G A T E A G G
E G T E A G A E G A T G
Kernel trick advantages: summary
63
Operating in the mapped space without ever computing the
coordinates of the data in that space
Besides vectors, we can introduce kernel functions for
structured data (graphs, strings, etc.)
Much of the geometry of the data in the embedding space is
contained in all pairwise dot products
In many cases, inner product in the embedding space can be
computed efficiently.
SVM: Summary
64
Hard margin: maximizing margin
Soft margin: handling noisy data and overlapping classes
Slack variables in the problem
Dual problems of hard-margin and soft-margin SVM
Lead us to non-linear SVM method easily by kernel substitution
Also, classifier decision in terms of support vectors
Kernel SVM’s
Learns linear decision boundary in a high dimension space without
explicitly working on the mapped data