Upload
andres-mendez-vazquez
View
177
Download
6
Embed Size (px)
Citation preview
Machine Learning for Data MiningIntroduction to Support Vector Machines
Andres Mendez-Vazquez
June 22, 2016
1 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 2 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 3 / 124
HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs
BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.
4 / 124
HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs
BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.
4 / 124
HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs
BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.
4 / 124
HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs
BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.
4 / 124
HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs
BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.
4 / 124
HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs
BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.
4 / 124
HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs
BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.
4 / 124
In addition
Alexey Yakovlevich ChervonenkisHe was a Soviet and Russian mathematician, and, with Vladimir Vapnik,was one of the main developers of the Vapnik–Chervonenkis theory, alsoknown as the "fundamental theory of learning" an important part ofcomputational learning theory.
He died in September 22nd, 2014At Losiny Ostrov National Park on 22 September 2014.
5 / 124
In addition
Alexey Yakovlevich ChervonenkisHe was a Soviet and Russian mathematician, and, with Vladimir Vapnik,was one of the main developers of the Vapnik–Chervonenkis theory, alsoknown as the "fundamental theory of learning" an important part ofcomputational learning theory.
He died in September 22nd, 2014At Losiny Ostrov National Park on 22 September 2014.
5 / 124
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
6 / 124
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
6 / 124
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
6 / 124
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
6 / 124
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
6 / 124
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
6 / 124
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
6 / 124
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
6 / 124
ApplicationsPartial List
1 Predictive ControlI Control of chaotic systems.
2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.
3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.
5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....
6 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 7 / 124
Separable Classes
Given
xi, i = 1, · · · , N
A set of samples belonging to two classes ω1, ω2.
ObjectiveWe want to obtain a decision function as simple as
g (x) = wTx+ w0
8 / 124
Separable Classes
Given
xi, i = 1, · · · , N
A set of samples belonging to two classes ω1, ω2.
ObjectiveWe want to obtain a decision function as simple as
g (x) = wTx+ w0
8 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 10 / 124
In other words ...
We have the following samplesFor x1, · · · ,xm ∈ C1
For x1, · · · ,xn ∈ C2
We want the following decision surfaceswTxi + w0 ≥ 0 for di = +1 if xi ∈ C1
wTxj + w0 ≤ 0 for dj = −1 if xj ∈ C2
11 / 124
In other words ...
We have the following samplesFor x1, · · · ,xm ∈ C1
For x1, · · · ,xn ∈ C2
We want the following decision surfaceswTxi + w0 ≥ 0 for di = +1 if xi ∈ C1
wTxj + w0 ≤ 0 for dj = −1 if xj ∈ C2
11 / 124
In other words ...
We have the following samplesFor x1, · · · ,xm ∈ C1
For x1, · · · ,xn ∈ C2
We want the following decision surfaceswTxi + w0 ≥ 0 for di = +1 if xi ∈ C1
wTxj + w0 ≤ 0 for dj = −1 if xj ∈ C2
11 / 124
In other words ...
We have the following samplesFor x1, · · · ,xm ∈ C1
For x1, · · · ,xn ∈ C2
We want the following decision surfaceswTxi + w0 ≥ 0 for di = +1 if xi ∈ C1
wTxj + w0 ≤ 0 for dj = −1 if xj ∈ C2
11 / 124
What do we want?Our goal is to search for a direction w that gives the maximum possible margin
direction 2
MARGINSdirection 1
12 / 124
First d = |w0|√w2
1+w22
We can use the following rule in a triangle with a 90o angle
Area = 12Cd (2)
In addition, the area can be calculated also as
Area = 12AB (3)
Thus
d = AB
C
Remark: Can you get the rest of values?
15 / 124
First d = |w0|√w2
1+w22
We can use the following rule in a triangle with a 90o angle
Area = 12Cd (2)
In addition, the area can be calculated also as
Area = 12AB (3)
Thus
d = AB
C
Remark: Can you get the rest of values?
15 / 124
First d = |w0|√w2
1+w22
We can use the following rule in a triangle with a 90o angle
Area = 12Cd (2)
In addition, the area can be calculated also as
Area = 12AB (3)
Thus
d = AB
C
Remark: Can you get the rest of values?
15 / 124
What about r = |g(x)|√w2
1+w22?
First, rememberg (xp) = 0 and x = xp + r
w
‖w‖ (4)
Thus, we have
g (x) =wT
[xp + r
w
‖w‖
]+ w0
=wTxp + w0 + rwTw
‖w‖
=wTxp + w0 + r‖w‖2
‖w‖=g (xp) + r ‖w‖
Thenr = g(x)
||w||16 / 124
What about r = |g(x)|√w2
1+w22?
First, rememberg (xp) = 0 and x = xp + r
w
‖w‖ (4)
Thus, we have
g (x) =wT
[xp + r
w
‖w‖
]+ w0
=wTxp + w0 + rwTw
‖w‖
=wTxp + w0 + r‖w‖2
‖w‖=g (xp) + r ‖w‖
Thenr = g(x)
||w||16 / 124
What about r = |g(x)|√w2
1+w22?
First, rememberg (xp) = 0 and x = xp + r
w
‖w‖ (4)
Thus, we have
g (x) =wT
[xp + r
w
‖w‖
]+ w0
=wTxp + w0 + rwTw
‖w‖
=wTxp + w0 + r‖w‖2
‖w‖=g (xp) + r ‖w‖
Thenr = g(x)
||w||16 / 124
What about r = |g(x)|√w2
1+w22?
First, rememberg (xp) = 0 and x = xp + r
w
‖w‖ (4)
Thus, we have
g (x) =wT
[xp + r
w
‖w‖
]+ w0
=wTxp + w0 + rwTw
‖w‖
=wTxp + w0 + r‖w‖2
‖w‖=g (xp) + r ‖w‖
Thenr = g(x)
||w||16 / 124
What about r = |g(x)|√w2
1+w22?
First, rememberg (xp) = 0 and x = xp + r
w
‖w‖ (4)
Thus, we have
g (x) =wT
[xp + r
w
‖w‖
]+ w0
=wTxp + w0 + rwTw
‖w‖
=wTxp + w0 + r‖w‖2
‖w‖=g (xp) + r ‖w‖
Thenr = g(x)
||w||16 / 124
What about r = |g(x)|√w2
1+w22?
First, rememberg (xp) = 0 and x = xp + r
w
‖w‖ (4)
Thus, we have
g (x) =wT
[xp + r
w
‖w‖
]+ w0
=wTxp + w0 + rwTw
‖w‖
=wTxp + w0 + r‖w‖2
‖w‖=g (xp) + r ‖w‖
Thenr = g(x)
||w||16 / 124
Now
We know that the straight line that we are looking for looks like
wTx+ w0 = 0 (5)
What about something like this
wTx+ w0 = δ (6)
ClearlyThis will be above or below the initial line wTx+ w0 = 0.
18 / 124
Now
We know that the straight line that we are looking for looks like
wTx+ w0 = 0 (5)
What about something like this
wTx+ w0 = δ (6)
ClearlyThis will be above or below the initial line wTx+ w0 = 0.
18 / 124
Now
We know that the straight line that we are looking for looks like
wTx+ w0 = 0 (5)
What about something like this
wTx+ w0 = δ (6)
ClearlyThis will be above or below the initial line wTx+ w0 = 0.
18 / 124
Come back to the hyperplanesWe have then for each border support line an specific bias!!!
Support Vectors
19 / 124
Then, normalize by δ
The new margin functionsw′Tx + w10 = 1w′Tx + w01 = −1
where w′ = wδ, w10 = w′
0δ ,and w01 = w′′
0δ
Now, we come back to the middle separator hyperplane, but with thenormalized term
wTxi + w0 ≥ w′Tx + w10 for di = +1wTxi + w0 ≤ w′Tx + w01 for di = −1
I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′
20 / 124
Then, normalize by δ
The new margin functionsw′Tx + w10 = 1w′Tx + w01 = −1
where w′ = wδ, w10 = w′
0δ ,and w01 = w′′
0δ
Now, we come back to the middle separator hyperplane, but with thenormalized term
wTxi + w0 ≥ w′Tx + w10 for di = +1wTxi + w0 ≤ w′Tx + w01 for di = −1
I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′
20 / 124
Then, normalize by δ
The new margin functionsw′Tx + w10 = 1w′Tx + w01 = −1
where w′ = wδ, w10 = w′
0δ ,and w01 = w′′
0δ
Now, we come back to the middle separator hyperplane, but with thenormalized term
wTxi + w0 ≥ w′Tx + w10 for di = +1wTxi + w0 ≤ w′Tx + w01 for di = −1
I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′
20 / 124
Then, normalize by δ
The new margin functionsw′Tx + w10 = 1w′Tx + w01 = −1
where w′ = wδ, w10 = w′
0δ ,and w01 = w′′
0δ
Now, we come back to the middle separator hyperplane, but with thenormalized term
wTxi + w0 ≥ w′Tx + w10 for di = +1wTxi + w0 ≤ w′Tx + w01 for di = −1
I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′
20 / 124
Then, normalize by δ
The new margin functionsw′Tx + w10 = 1w′Tx + w01 = −1
where w′ = wδ, w10 = w′
0δ ,and w01 = w′′
0δ
Now, we come back to the middle separator hyperplane, but with thenormalized term
wTxi + w0 ≥ w′Tx + w10 for di = +1wTxi + w0 ≤ w′Tx + w01 for di = −1
I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′
20 / 124
Then, normalize by δ
The new margin functionsw′Tx + w10 = 1w′Tx + w01 = −1
where w′ = wδ, w10 = w′
0δ ,and w01 = w′′
0δ
Now, we come back to the middle separator hyperplane, but with thenormalized term
wTxi + w0 ≥ w′Tx + w10 for di = +1wTxi + w0 ≤ w′Tx + w01 for di = −1
I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′
20 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 22 / 124
A little about Support Vectors
They are the vectors (Here, we assume that w)xi such that wTxi + w0 = 1 or wTxi + w0 = −1
PropertiesThe vectors nearest to the decision surface and the most difficult toclassify.Because of that, we have the name “Support Vector Machines”.
23 / 124
A little about Support Vectors
They are the vectors (Here, we assume that w)xi such that wTxi + w0 = 1 or wTxi + w0 = −1
PropertiesThe vectors nearest to the decision surface and the most difficult toclassify.Because of that, we have the name “Support Vector Machines”.
23 / 124
A little about Support Vectors
They are the vectors (Here, we assume that w)xi such that wTxi + w0 = 1 or wTxi + w0 = −1
PropertiesThe vectors nearest to the decision surface and the most difficult toclassify.Because of that, we have the name “Support Vector Machines”.
23 / 124
Now, we can resume the decision rule for the hyperplane
For the support vectors
g (xi) = wTxi + w0 = −(+)1 for di = −(+)1 (7)
ImpliesThe distance to the support vectors is:
r = g (xi)||w||
=
1||w|| if di = +1− 1||w|| if di = −1
24 / 124
Now, we can resume the decision rule for the hyperplane
For the support vectors
g (xi) = wTxi + w0 = −(+)1 for di = −(+)1 (7)
ImpliesThe distance to the support vectors is:
r = g (xi)||w||
=
1||w|| if di = +1− 1||w|| if di = −1
24 / 124
Therefore ...
We want the optimum value of the margin of separation as
ρ = 1||w||
+ 1||w||
= 2||w||
(8)
And the support vectors define the value of ρ
25 / 124
Therefore ...We want the optimum value of the margin of separation as
ρ = 1||w||
+ 1||w||
= 2||w||
(8)
And the support vectors define the value of ρ
Support V
ectors
25 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 26 / 124
Quadratic Optimization
Then, we have the samples with labelsT = {(xi, di)}Ni=1
Then we can put the decision rule asdi(wTxi + w0
)≥ 1 i = 1, · · · , N
27 / 124
Quadratic Optimization
Then, we have the samples with labelsT = {(xi, di)}Ni=1
Then we can put the decision rule asdi(wTxi + w0
)≥ 1 i = 1, · · · , N
27 / 124
Then, we have the optimization problem
The optimization problemminwΦ (w) = 1
2wTw
s.t. di(wTxi + w0) ≥ 1 i = 1, · · · , N
ObservationsThe cost functions Φ (w) is convex.The constrains are linear with respect to w.
28 / 124
Then, we have the optimization problem
The optimization problemminwΦ (w) = 1
2wTw
s.t. di(wTxi + w0) ≥ 1 i = 1, · · · , N
ObservationsThe cost functions Φ (w) is convex.The constrains are linear with respect to w.
28 / 124
Then, we have the optimization problem
The optimization problemminwΦ (w) = 1
2wTw
s.t. di(wTxi + w0) ≥ 1 i = 1, · · · , N
ObservationsThe cost functions Φ (w) is convex.The constrains are linear with respect to w.
28 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 29 / 124
Lagrange Multipliers
The method of Lagrange multipliersGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problems.
This is done by converting a constrained problem to an equivalentunconstrained problem with the help of certain unspecified parametersknown as Lagrange multipliers.
30 / 124
Lagrange Multipliers
The method of Lagrange multipliersGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problems.
This is done by converting a constrained problem to an equivalentunconstrained problem with the help of certain unspecified parametersknown as Lagrange multipliers.
30 / 124
Lagrange MultipliersThe classical problem formulation
min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}(9)
whereL(x, λ) is the Lagrangian function.λ is an unspecified positive or negative constant called the LagrangeMultiplier.
31 / 124
Lagrange MultipliersThe classical problem formulation
min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}(9)
whereL(x, λ) is the Lagrangian function.λ is an unspecified positive or negative constant called the LagrangeMultiplier.
31 / 124
Lagrange MultipliersThe classical problem formulation
min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}(9)
whereL(x, λ) is the Lagrangian function.λ is an unspecified positive or negative constant called the LagrangeMultiplier.
31 / 124
Finding an Optimum using Lagrange Multipliers
New problemmin L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}
We want a λ = λ∗ optimalIf the minimum of L (x1, x2, ..., xn, λ
∗) occurs at
(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)T∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)T∗
minimizes:min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0
TrickIt is to find appropriate value for Lagrangian multiplier λ.
32 / 124
Finding an Optimum using Lagrange Multipliers
New problemmin L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}
We want a λ = λ∗ optimalIf the minimum of L (x1, x2, ..., xn, λ
∗) occurs at
(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)T∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)T∗
minimizes:min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0
TrickIt is to find appropriate value for Lagrangian multiplier λ.
32 / 124
Finding an Optimum using Lagrange Multipliers
New problemmin L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}
We want a λ = λ∗ optimalIf the minimum of L (x1, x2, ..., xn, λ
∗) occurs at
(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)T∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)T∗
minimizes:min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0
TrickIt is to find appropriate value for Lagrangian multiplier λ.
32 / 124
Remember
Think about thisRemember First Law of Newton!!!
Yes!!!A system in equilibrium does not move
Static Body
33 / 124
Lagrange Multipliers
DefinitionGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problem
34 / 124
Lagrange was a Physicists
He was thinking in the following formulaA system in equilibrium has the following equation:
F1 + F2 + ...+ FK = 0 (10)
But functions do not have forces?Are you sure?
Think about the followingThe Gradient of a surface.
35 / 124
Lagrange was a Physicists
He was thinking in the following formulaA system in equilibrium has the following equation:
F1 + F2 + ...+ FK = 0 (10)
But functions do not have forces?Are you sure?
Think about the followingThe Gradient of a surface.
35 / 124
Lagrange was a Physicists
He was thinking in the following formulaA system in equilibrium has the following equation:
F1 + F2 + ...+ FK = 0 (10)
But functions do not have forces?Are you sure?
Think about the followingThe Gradient of a surface.
35 / 124
Gradient to a Surface
After all a gradient is a measure of the maximal changeFor example the gradient of a function of three variables:
∇f (x) = i∂f (x)∂x
+ j∂f (x)∂y
+ k∂f (x)∂z
(11)
where i, j and k are unitary vectors in the directions x, y and z.
36 / 124
Now, Think about this
Yes, we can use the gradientHowever, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ...+ λKFK = 0 (12)
where F0 is the gradient of the principal cost function and Fi fori = 1, 2, ..,K.
39 / 124
Now, Think about this
Yes, we can use the gradientHowever, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ...+ λKFK = 0 (12)
where F0 is the gradient of the principal cost function and Fi fori = 1, 2, ..,K.
39 / 124
Geometric interpretation in the case of minimizationWhat is wrong? Gradients are going in the other direction, we can fixby simple multiplying by -1Here the cost function is f (x, y) = x exp
{−x2 − y2} we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇ f (−→x ) + λ1∇ g1 (−→x ) + λ2∇ g2 (−→x ) = 0
Nevertheless: it is equivalent to ∇f(−→x )− λ1∇g1
(−→x )− λ2∇g2(−→x ) = 0
40 / 124
Geometric interpretation in the case of minimizationWhat is wrong? Gradients are going in the other direction, we can fixby simple multiplying by -1Here the cost function is f (x, y) = x exp
{−x2 − y2} we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇ f (−→x ) + λ1∇ g1 (−→x ) + λ2∇ g2 (−→x ) = 0
Nevertheless: it is equivalent to ∇f(−→x )− λ1∇g1
(−→x )− λ2∇g2(−→x ) = 0
40 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 41 / 124
Method
Steps1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x)− λh1 (x)2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.3 Express all xi in terms of Lagrangian multiplier λ.4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.5 Calculate x by using the just found value for λ.
From the step 2If there are n variables (i.e., x1, · · · , xn) then you will get n equations withn+ 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
42 / 124
Method
Steps1 Original problem is rewritten as:
1 minimize L (x, λ) = f (x)− λh1 (x)2 Take derivatives of L (x, λ) with respect to xi and set them equal to
zero.3 Express all xi in terms of Lagrangian multiplier λ.4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.5 Calculate x by using the just found value for λ.
From the step 2If there are n variables (i.e., x1, · · · , xn) then you will get n equations withn+ 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).
42 / 124
Example
We can apply that to the following problem
min f (x, y) = x2 − 8x+ y2 − 12y + 48s.t x+ y = 8
43 / 124
Then, Rewriting The Optimization Problem
The optimization with equality constraintsminwΦ (w) = 1
2wTw
s.t. di(wTxi + w0) ≥ 1 i = 1, · · · , N
44 / 124
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)We obtain the following cost function that we want to minimize
J(w, w0,α) = 12w
Tw −N∑i=1
αi[di(wTxi + w0)− 1]
ObservationMinimize with respect to w and w0.Maximize with respect to α because it dominates
−N∑i=1
αi[di(wTxi + w0)− 1]. (13)
45 / 124
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)We obtain the following cost function that we want to minimize
J(w, w0,α) = 12w
Tw −N∑i=1
αi[di(wTxi + w0)− 1]
ObservationMinimize with respect to w and w0.Maximize with respect to α because it dominates
−N∑i=1
αi[di(wTxi + w0)− 1]. (13)
45 / 124
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)We obtain the following cost function that we want to minimize
J(w, w0,α) = 12w
Tw −N∑i=1
αi[di(wTxi + w0)− 1]
ObservationMinimize with respect to w and w0.Maximize with respect to α because it dominates
−N∑i=1
αi[di(wTxi + w0)− 1]. (13)
45 / 124
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)We obtain the following cost function that we want to minimize
J(w, w0,α) = 12w
Tw −N∑i=1
αi[di(wTxi + w0)− 1]
ObservationMinimize with respect to w and w0.Maximize with respect to α because it dominates
−N∑i=1
αi[di(wTxi + w0)− 1]. (13)
45 / 124
Saddle Point?
At the left the original problem, at the right the Lagrangian!!!
f (−→x )
g1 (−→x )
g2 (−→x )
46 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 47 / 124
Karush-Kuhn-Tucker Conditions
First An Inequality Constrained Problem P
min f (x)s.t g1 (x) = 0
...gN (x) = 0
A really minimal version!!! Hey, it is a patch work!!!A point x is a local minimum of an equality constrained problem P only ifa set of non-negative αj ’s may be found such that:
∇L (x,α) = ∇f (x)−N∑i=1
αi∇gi (x) = 0
48 / 124
Karush-Kuhn-Tucker Conditions
First An Inequality Constrained Problem P
min f (x)s.t g1 (x) = 0
...gN (x) = 0
A really minimal version!!! Hey, it is a patch work!!!A point x is a local minimum of an equality constrained problem P only ifa set of non-negative αj ’s may be found such that:
∇L (x,α) = ∇f (x)−N∑i=1
αi∇gi (x) = 0
48 / 124
Karush-Kuhn-Tucker Conditions
ImportantThink about this each constraint correspond to a sample in both classes,thus
The corresponding αi’s are going to be zero after optimization, if aconstraint is not active i.e. di
(wTxi + w0
)− 1 6= 0 (Remember
Maximization).
Again the Support VectorsThis actually defines the idea of support vectors!!!
ThusOnly the αi’s with active constraints (Support Vectors) will be differentfrom zero when di
(wTxi + w0
)− 1 = 0.
49 / 124
Karush-Kuhn-Tucker Conditions
ImportantThink about this each constraint correspond to a sample in both classes,thus
The corresponding αi’s are going to be zero after optimization, if aconstraint is not active i.e. di
(wTxi + w0
)− 1 6= 0 (Remember
Maximization).
Again the Support VectorsThis actually defines the idea of support vectors!!!
ThusOnly the αi’s with active constraints (Support Vectors) will be differentfrom zero when di
(wTxi + w0
)− 1 = 0.
49 / 124
Karush-Kuhn-Tucker Conditions
ImportantThink about this each constraint correspond to a sample in both classes,thus
The corresponding αi’s are going to be zero after optimization, if aconstraint is not active i.e. di
(wTxi + w0
)− 1 6= 0 (Remember
Maximization).
Again the Support VectorsThis actually defines the idea of support vectors!!!
ThusOnly the αi’s with active constraints (Support Vectors) will be differentfrom zero when di
(wTxi + w0
)− 1 = 0.
49 / 124
A small deviation from the SVM’s for the sake of VoxPopuli
Theorem (Karush-Kuhn-Tucker Necessary Conditions)Let X be a non-empty open set Rn, and let f : Rn → R and gi : Rn → Rfor i = 1, ...,m. Consider the problem P to minimize f (x) subject tox ∈ X and gi (x) ≤ 0 i = 1, ...,m. Let x be a feasible solution, anddenote I = {i|gi (x) = 0}. Suppose that f and gi for i ∈ I aredifferentiable at x and that gi i /∈ I are continuous at x. Furthermore,suppose that ∇gi (x) for i ∈ I are linearly independent. If x solvesproblem P locally, there exist scalars ui for i ∈ I such that
∇f (x) +∑i∈I
ui∇gi (x) = 0
ui ≥ 0 for i ∈ I
50 / 124
It is more...
In addition to the above assumptionsIf gi for each i /∈ I is also differentiable at x, the previous conditions canbe written in the following equivalent form:
∇f (x) +m∑i=1
ui∇gi (x) = 0
ugi (x) = 0 for i = 1, ...,mui ≥ 0 for i = 1, ...,m
51 / 124
The necessary conditions for optimality
We use the previous theorem
∇(
12w
Tw −N∑i=1
αi[di(wTxi + w0)− 1])
(14)
Condition 1∂J (w, w0,α)
∂w= 0
Condition 2∂J (w, w0,α)
∂w0= 0
52 / 124
The necessary conditions for optimality
We use the previous theorem
∇(
12w
Tw −N∑i=1
αi[di(wTxi + w0)− 1])
(14)
Condition 1∂J (w, w0,α)
∂w= 0
Condition 2∂J (w, w0,α)
∂w0= 0
52 / 124
The necessary conditions for optimality
We use the previous theorem
∇(
12w
Tw −N∑i=1
αi[di(wTxi + w0)− 1])
(14)
Condition 1∂J (w, w0,α)
∂w= 0
Condition 2∂J (w, w0,α)
∂w0= 0
52 / 124
Using the conditions
We have the first condition
∂J(w, w0, α)∂w
=∂ 1
2wTw
∂w−∂N∑i=1
αi[di(wTxi + w0)− 1]
∂w= 0
∂J(w, w0, α)∂w
= 12(w +w)−
N∑i=1
αidixi
Thus
w =N∑i=1
αidixi (15)
53 / 124
Using the conditions
We have the first condition
∂J(w, w0, α)∂w
=∂ 1
2wTw
∂w−∂N∑i=1
αi[di(wTxi + w0)− 1]
∂w= 0
∂J(w, w0, α)∂w
= 12(w +w)−
N∑i=1
αidixi
Thus
w =N∑i=1
αidixi (15)
53 / 124
Using the conditions
We have the first condition
∂J(w, w0, α)∂w
=∂ 1
2wTw
∂w−∂N∑i=1
αi[di(wTxi + w0)− 1]
∂w= 0
∂J(w, w0, α)∂w
= 12(w +w)−
N∑i=1
αidixi
Thus
w =N∑i=1
αidixi (15)
53 / 124
In a similar way ...
We have by the second optimality conditionN∑i=1
αidi = 0
Note
αi[di(wTxi + w0
)− 1
]= 0
Because the constraint vanishes in the optimal solution i.e. αi = 0 ordi(wTxi + w0
)− 1 = 0.
54 / 124
In a similar way ...
We have by the second optimality conditionN∑i=1
αidi = 0
Note
αi[di(wTxi + w0
)− 1
]= 0
Because the constraint vanishes in the optimal solution i.e. αi = 0 ordi(wTxi + w0
)− 1 = 0.
54 / 124
Thus
We need something extraOur classic trick of transforming a problem into another problem
In this caseWe use the Primal-Dual Problem for Lagrangian
WhereWe move from a minimization to a maximization!!!
55 / 124
Thus
We need something extraOur classic trick of transforming a problem into another problem
In this caseWe use the Primal-Dual Problem for Lagrangian
WhereWe move from a minimization to a maximization!!!
55 / 124
Thus
We need something extraOur classic trick of transforming a problem into another problem
In this caseWe use the Primal-Dual Problem for Lagrangian
WhereWe move from a minimization to a maximization!!!
55 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 56 / 124
Lagrangian Dual Problem
Consider the following nonlinear programming problemPrimal Problem P
min f (x)s.t gi (x) ≤ 0 for i = 1, ...,m
hi (x) = 0 for i = 1, ..., lx ∈ X
Lagrange Dual Problem Dmax Θ (u,v)s.t. u > 0
where Θ (u,v) = infx{f (x) +
∑mi=1 uigi (x) +
∑li=1 vihi (x) |x ∈ X
}
57 / 124
Lagrangian Dual Problem
Consider the following nonlinear programming problemPrimal Problem P
min f (x)s.t gi (x) ≤ 0 for i = 1, ...,m
hi (x) = 0 for i = 1, ..., lx ∈ X
Lagrange Dual Problem Dmax Θ (u,v)s.t. u > 0
where Θ (u,v) = infx{f (x) +
∑mi=1 uigi (x) +
∑li=1 vihi (x) |x ∈ X
}
57 / 124
What does this mean?
Assume that the equality constraint does not existWe have then
min f (x)s.t gi (x) ≤ 0 for i = 1, ...,m
x ∈ X
Now assume that we finish with only one constraintWe have then
min f (x)s.t g (x) ≤ 0
x ∈ X
58 / 124
What does this mean?
Assume that the equality constraint does not existWe have then
min f (x)s.t gi (x) ≤ 0 for i = 1, ...,m
x ∈ X
Now assume that we finish with only one constraintWe have then
min f (x)s.t g (x) ≤ 0
x ∈ X
58 / 124
What does this means?
Thus at the y − z plane you have
G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)
ThusGiven u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -Equivalent to ∇f (x) + u∇g(x) = 0
60 / 124
What does this means?
Thus at the y − z plane you have
G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)
ThusGiven u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -Equivalent to ∇f (x) + u∇g(x) = 0
60 / 124
What does this means?
Thus at the y − z plane, we have
z + uy = α (17)
a line with slope −u.
Then, to minimize z + uy = α
We need to move the line z + uy = α in a parallel to itself as far down aspossible, along its negative gradient, while in contact with G.
61 / 124
What does this means?
Thus at the y − z plane, we have
z + uy = α (17)
a line with slope −u.
Then, to minimize z + uy = α
We need to move the line z + uy = α in a parallel to itself as far down aspossible, along its negative gradient, while in contact with G.
61 / 124
In other words
Move the line parallel to itself until it supports G
A
B
X G
Slope:
Slope:
Note The Set G lies above the line and touches it.
62 / 124
Thus
ThusThen, the problem is to find the slope of the supporting hyperplane forG.
Then intersection with the z-axisGives θ(u)
63 / 124
Thus
ThusThen, the problem is to find the slope of the supporting hyperplane forG.
Then intersection with the z-axisGives θ(u)
63 / 124
Thus
The dual problem is equivalentFinding the slope of the supporting hyperplane such that its intercept onthe z-axis is maximal
65 / 124
Or
Such an hyperplane has slope −u and support G at (y, z)
A
B
X G
Slope:
Slope:
Remark: The optimal solution is u and the optimal dual objective is z.
66 / 124
Or
Such an hyperplane has slope −u and support G at (y, z)
A
B
X G
Slope:
Slope:
Remark: The optimal solution is u and the optimal dual objective is z.
66 / 124
For more on this Please!!!
Look at this bookFrom “Nonlinear Programming: Theory and Algorithms” by MokhtarS. Bazaraa, and C. M. Shetty. Wiley, New York, (2006)
I At Page 260.
67 / 124
Example (Lagrange Dual)
Primalmin x2
1 + x22
s.t. −x1 − x2 + 4 ≤ 0x1, x2 ≥ 0
Lagrange DualΘ(u) = inf {x2
1 + x22 + u(−x1 − x2 + 4)|x1, x2 ≥ 0}
68 / 124
Example (Lagrange Dual)
Primalmin x2
1 + x22
s.t. −x1 − x2 + 4 ≤ 0x1, x2 ≥ 0
Lagrange DualΘ(u) = inf {x2
1 + x22 + u(−x1 − x2 + 4)|x1, x2 ≥ 0}
68 / 124
Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clearWhat about when u < 0
We have that
θ (u) ={−1
2u2 + 4u if u ≥ 0
4u if u < 0(18)
69 / 124
Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clearWhat about when u < 0
We have that
θ (u) ={−1
2u2 + 4u if u ≥ 0
4u if u < 0(18)
69 / 124
Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clearWhat about when u < 0
We have that
θ (u) ={−1
2u2 + 4u if u ≥ 0
4u if u < 0(18)
69 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 70 / 124
Duality Theorem
First PropertyIf the Primal has an optimal solution, the dual too.
ThusIn order to w ∗ and α∗ to be optimal solutions for the primal and dualproblem respectively, It is necessary and sufficient that w∗:
It is feasible for the primal problem and
Φ(w∗) = J (w∗, w0∗,α∗)= min
wJ (w∗, w0∗,α∗)
71 / 124
Duality Theorem
First PropertyIf the Primal has an optimal solution, the dual too.
ThusIn order to w ∗ and α∗ to be optimal solutions for the primal and dualproblem respectively, It is necessary and sufficient that w∗:
It is feasible for the primal problem and
Φ(w∗) = J (w∗, w0∗,α∗)= min
wJ (w∗, w0∗,α∗)
71 / 124
Reformulate our Equations
We have then
J (w, w0,α) = 12w
Tw −N∑i=1
αidiwTxi − w0
N∑i=1
αidi +N∑i=1
αi
Now for our 2nd optimality condition
J (w, w0,α) = 12w
Tw −N∑i=1
αidiwTxi +
N∑i=1
αi
72 / 124
Reformulate our Equations
We have then
J (w, w0,α) = 12w
Tw −N∑i=1
αidiwTxi − w0
N∑i=1
αidi +N∑i=1
αi
Now for our 2nd optimality condition
J (w, w0,α) = 12w
Tw −N∑i=1
αidiwTxi +
N∑i=1
αi
72 / 124
We have finally for the 1st Optimality Condition:
First
wTw =N∑i=1
αidiwTxi =
N∑i=1
N∑j=1
αiαjdidjxTj xi
Second, setting J (w, w0,α) = Q (α)
Q (α) =N∑i=1
αi −12
N∑i=1
N∑j=1
αiαjdidjxTj xi
73 / 124
We have finally for the 1st Optimality Condition:
First
wTw =N∑i=1
αidiwTxi =
N∑i=1
N∑j=1
αiαjdidjxTj xi
Second, setting J (w, w0,α) = Q (α)
Q (α) =N∑i=1
αi −12
N∑i=1
N∑j=1
αiαjdidjxTj xi
73 / 124
From here, we have the problemThis is the problem that we really solveGiven the training sample {(xi, di)}Ni=1, find the Lagrange multipliers{αi}Ni=1 that maximize the objective function
Q(α) =N∑i=1
αi −12
N∑i=1
N∑j=1
αiαjdidjxTj xi
subject to the constraints
N∑i=1
αidi = 0 (19)
αi ≥ 0 for i = 1, · · · , N (20)
NoteIn the Primal, we were trying to minimize the cost function, for this it isnecessary to maximize α. That is the reason why we are maximizingQ (α). 74 / 124
From here, we have the problemThis is the problem that we really solveGiven the training sample {(xi, di)}Ni=1, find the Lagrange multipliers{αi}Ni=1 that maximize the objective function
Q(α) =N∑i=1
αi −12
N∑i=1
N∑j=1
αiαjdidjxTj xi
subject to the constraints
N∑i=1
αidi = 0 (19)
αi ≥ 0 for i = 1, · · · , N (20)
NoteIn the Primal, we were trying to minimize the cost function, for this it isnecessary to maximize α. That is the reason why we are maximizingQ (α). 74 / 124
Solving for α
We can compute w∗ once we get the optimal α∗i by using (Eq. 15)
w∗ =N∑i=1
α∗i dixi
In addition, we can compute the optimal bias w∗0 using the optimalweight, w∗For this, we use the positive margin equation:
g(x(s)
)= wTx(s) + w0 = 1
corresponding to a positive support vector.
Then
w0 = 1− (w∗)T x(s) for d(s) = 1 (21)
75 / 124
Solving for α
We can compute w∗ once we get the optimal α∗i by using (Eq. 15)
w∗ =N∑i=1
α∗i dixi
In addition, we can compute the optimal bias w∗0 using the optimalweight, w∗For this, we use the positive margin equation:
g(x(s)
)= wTx(s) + w0 = 1
corresponding to a positive support vector.
Then
w0 = 1− (w∗)T x(s) for d(s) = 1 (21)
75 / 124
Solving for α
We can compute w∗ once we get the optimal α∗i by using (Eq. 15)
w∗ =N∑i=1
α∗i dixi
In addition, we can compute the optimal bias w∗0 using the optimalweight, w∗For this, we use the positive margin equation:
g(x(s)
)= wTx(s) + w0 = 1
corresponding to a positive support vector.
Then
w0 = 1− (w∗)T x(s) for d(s) = 1 (21)
75 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 76 / 124
What do we need?
Until now, we have only a maximal margin algorithmAll this work fine when the classes are separableProblem, What when they are not separable?What we can do?
77 / 124
What do we need?
Until now, we have only a maximal margin algorithmAll this work fine when the classes are separableProblem, What when they are not separable?What we can do?
77 / 124
What do we need?
Until now, we have only a maximal margin algorithmAll this work fine when the classes are separableProblem, What when they are not separable?What we can do?
77 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 78 / 124
Map to a higher Dimensional Space
Assume that exist a mapping
x ∈ Rl → y ∈ Rk
Then, it is possible to define the following mapping
79 / 124
Map to a higher Dimensional Space
Assume that exist a mapping
x ∈ Rl → y ∈ Rk
Then, it is possible to define the following mapping
79 / 124
Define a map to a higher Dimension
Nonlinear transformationsGiven a series of nonlinear transformations
{φi (x)}mi=1
from input space to the feature space.
We can define the decision surface asm∑i=1
wiφi (x) + w0 = 0
80 / 124
Define a map to a higher Dimension
Nonlinear transformationsGiven a series of nonlinear transformations
{φi (x)}mi=1
from input space to the feature space.
We can define the decision surface asm∑i=1
wiφi (x) + w0 = 0
80 / 124
This allows us to define
The following vector
φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T
that represents the mapping.
From this mappingWe can define the following kernel function
K : X×X→ R
K (xi,xj) = φ (xi)T φ (xj)
81 / 124
This allows us to define
The following vector
φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T
that represents the mapping.
From this mappingWe can define the following kernel function
K : X×X→ R
K (xi,xj) = φ (xi)T φ (xj)
81 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 82 / 124
Example of Kernels
Polynomials
k (x, z) = (xTz + 1)q q > 0
Radial Basis Functions
k (x, z) = exp(−||x− z||
2
σ2
)
Hyperbolic Tangents
k (x, z) = tanh(βxTz + γ
)
84 / 124
Example of Kernels
Polynomials
k (x, z) = (xTz + 1)q q > 0
Radial Basis Functions
k (x, z) = exp(−||x− z||
2
σ2
)
Hyperbolic Tangents
k (x, z) = tanh(βxTz + γ
)
84 / 124
Example of Kernels
Polynomials
k (x, z) = (xTz + 1)q q > 0
Radial Basis Functions
k (x, z) = exp(−||x− z||
2
σ2
)
Hyperbolic Tangents
k (x, z) = tanh(βxTz + γ
)
84 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 85 / 124
Now, How to select a Kernel?
We have a problemSelecting a specific kernel and parameters is usually done in a try-and-seemanner.
ThusIn general, the Radial Basis Functions kernel is a reasonable first choice.
Thenif this fails, we can try the other possible kernels.
86 / 124
Now, How to select a Kernel?
We have a problemSelecting a specific kernel and parameters is usually done in a try-and-seemanner.
ThusIn general, the Radial Basis Functions kernel is a reasonable first choice.
Thenif this fails, we can try the other possible kernels.
86 / 124
Now, How to select a Kernel?
We have a problemSelecting a specific kernel and parameters is usually done in a try-and-seemanner.
ThusIn general, the Radial Basis Functions kernel is a reasonable first choice.
Thenif this fails, we can try the other possible kernels.
86 / 124
Thus, we have something like this
Step 1Normalize the data.
Step 2Use cross-validation to adjust the parameters of the selected kernel.
Step 3Train against the entire dataset.
87 / 124
Thus, we have something like this
Step 1Normalize the data.
Step 2Use cross-validation to adjust the parameters of the selected kernel.
Step 3Train against the entire dataset.
87 / 124
Thus, we have something like this
Step 1Normalize the data.
Step 2Use cross-validation to adjust the parameters of the selected kernel.
Step 3Train against the entire dataset.
87 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 88 / 124
Optimal Hyperplane for non-separable patterns
ImportantWe have been considering only problems where the classes are linearlyseparable.
NowWhat happen when the patterns are not separable?
Thus, we can still build a separating hyperplaneBut errors will happen in the classification... We need to minimize them...
89 / 124
Optimal Hyperplane for non-separable patterns
ImportantWe have been considering only problems where the classes are linearlyseparable.
NowWhat happen when the patterns are not separable?
Thus, we can still build a separating hyperplaneBut errors will happen in the classification... We need to minimize them...
89 / 124
Optimal Hyperplane for non-separable patterns
ImportantWe have been considering only problems where the classes are linearlyseparable.
NowWhat happen when the patterns are not separable?
Thus, we can still build a separating hyperplaneBut errors will happen in the classification... We need to minimize them...
89 / 124
What if the following happensSome data points invade the “margin” space
Optimal Hyperplane
Data PointViolating Property
90 / 124
Fixing the Problem - Corinna’s Style
The margin of separation between classes is said to be soft if a datapoint (xi, di) violates the following condition
di(wTxi + b
)≥ +1 i = 1, 2, ..., N (22)
This violation can arise in one of two waysThe data point (xi, di) falls inside the region of separation but on theright side of the decision surface - still correct classification.
91 / 124
Fixing the Problem - Corinna’s Style
The margin of separation between classes is said to be soft if a datapoint (xi, di) violates the following condition
di(wTxi + b
)≥ +1 i = 1, 2, ..., N (22)
This violation can arise in one of two waysThe data point (xi, di) falls inside the region of separation but on theright side of the decision surface - still correct classification.
91 / 124
Or...
This violation can arise in one of two waysThe data point (xi, di) falls on the wrong side of the decision surface -incorrect classification.
Example
93 / 124
Or...This violation can arise in one of two waysThe data point (xi, di) falls on the wrong side of the decision surface -incorrect classification.
Example
Optimal Hyperplane
Data PointViolating Property
93 / 124
Solving the problem
What to do?We introduce a set of nonnegative scalar values {ξi}Ni=1.
Introduce this into the decision rule
di(wTxi + b
)≥ 1− ξi i = 1, 2, ..., N (23)
94 / 124
Solving the problem
What to do?We introduce a set of nonnegative scalar values {ξi}Ni=1.
Introduce this into the decision rule
di(wTxi + b
)≥ 1− ξi i = 1, 2, ..., N (23)
94 / 124
Solving the problem
What to do?We introduce a set of nonnegative scalar values {ξi}Ni=1.
Introduce this into the decision rule
di(wTxi + b
)≥ 1− ξi i = 1, 2, ..., N (23)
94 / 124
The ξi are called slack variables
What?In 1995, Corinna Cortes and Vladimir N. Vapnik suggested a modifiedmaximum margin idea that allows for mislabeled examples.
Ok!!!Instead of expecting to have constant margin for all the samples, themargin can change depending of the sample.
What do we have?ξi measures the deviation of a data point from the ideal condition ofpattern separability.
95 / 124
The ξi are called slack variables
What?In 1995, Corinna Cortes and Vladimir N. Vapnik suggested a modifiedmaximum margin idea that allows for mislabeled examples.
Ok!!!Instead of expecting to have constant margin for all the samples, themargin can change depending of the sample.
What do we have?ξi measures the deviation of a data point from the ideal condition ofpattern separability.
95 / 124
The ξi are called slack variables
What?In 1995, Corinna Cortes and Vladimir N. Vapnik suggested a modifiedmaximum margin idea that allows for mislabeled examples.
Ok!!!Instead of expecting to have constant margin for all the samples, themargin can change depending of the sample.
What do we have?ξi measures the deviation of a data point from the ideal condition ofpattern separability.
95 / 124
Properties of ξi
What if?You have 0 ≤ ξi ≤ 1
We have
Optimal Hyperplane
Data PointViolating Property
96 / 124
Properties of ξi
What if?You have ξi > 1
We have
Optimal Hyperplane
Data PointViolating Property
97 / 124
Support Vectors
We wantSupport vectors that satisfy equation (Eq. 23) even when ξi > 0
di(wTxi + b
)≥ 1− ξi i = 1, 2, ..., N
98 / 124
We want the following
We want to find an hyperplaneSuch that average error is misclassified over all the samples
1N
N∑i=1
e2 (24)
99 / 124
First Attempt Into Minimization
We can try the followingGiven
I (x) ={
0 if x ≤ 01 if x > 0
(25)
Minimize the following
Φ (ξ) =N∑i=1
I (ξi − 1) (26)
with respect to the weight vector w subject to1 di
(wTxi + b
)≥ 1− ξi i = 1, 2, ..., N
2 ‖w‖ 2 ≤ C for a given C.
100 / 124
First Attempt Into Minimization
We can try the followingGiven
I (x) ={
0 if x ≤ 01 if x > 0
(25)
Minimize the following
Φ (ξ) =N∑i=1
I (ξi − 1) (26)
with respect to the weight vector w subject to1 di
(wTxi + b
)≥ 1− ξi i = 1, 2, ..., N
2 ‖w‖ 2 ≤ C for a given C.
100 / 124
Problem
Using this first attemptMinimization of Φ (ξ) with respect to w is a non-convex optimizationproblem that is NP-complete.
Thus, we need to use an approximation, maybe
Φ (ξ) =N∑i=1
ξi (27)
Now, we simplify the computations by integrating the vector w
Φ (w, ξ) = 12w
Tw + CN∑i=1
ξi (28)
101 / 124
Problem
Using this first attemptMinimization of Φ (ξ) with respect to w is a non-convex optimizationproblem that is NP-complete.
Thus, we need to use an approximation, maybe
Φ (ξ) =N∑i=1
ξi (27)
Now, we simplify the computations by integrating the vector w
Φ (w, ξ) = 12w
Tw + CN∑i=1
ξi (28)
101 / 124
Problem
Using this first attemptMinimization of Φ (ξ) with respect to w is a non-convex optimizationproblem that is NP-complete.
Thus, we need to use an approximation, maybe
Φ (ξ) =N∑i=1
ξi (27)
Now, we simplify the computations by integrating the vector w
Φ (w, ξ) = 12w
Tw + CN∑i=1
ξi (28)
101 / 124
Important
FirstMinimizing the first term in (Eq. 28) is related to minimize theVapnik–Chervonenkis dimension.
Which is a measure of the capacity (complexity, expressive power,richness, or flexibility) of a statistical classification algorithm.
SecondThe second term ∑N
i=1 ξi is an upper bound on the number of test errors.
102 / 124
Important
FirstMinimizing the first term in (Eq. 28) is related to minimize theVapnik–Chervonenkis dimension.
Which is a measure of the capacity (complexity, expressive power,richness, or flexibility) of a statistical classification algorithm.
SecondThe second term ∑N
i=1 ξi is an upper bound on the number of test errors.
102 / 124
Important
FirstMinimizing the first term in (Eq. 28) is related to minimize theVapnik–Chervonenkis dimension.
Which is a measure of the capacity (complexity, expressive power,richness, or flexibility) of a statistical classification algorithm.
SecondThe second term ∑N
i=1 ξi is an upper bound on the number of test errors.
102 / 124
Some problems for the Parameter C
Little ProblemThe parameter C has to be selected by the user.
This can be done in two ways1 The parameter C is determined experimentally via the standard use of
a training! (validation) test set.2 It is determined analytically by estimating the Vapnik–Chervonenkis
dimension.
103 / 124
Some problems for the Parameter C
Little ProblemThe parameter C has to be selected by the user.
This can be done in two ways1 The parameter C is determined experimentally via the standard use of
a training! (validation) test set.2 It is determined analytically by estimating the Vapnik–Chervonenkis
dimension.
103 / 124
Some problems for the Parameter C
Little ProblemThe parameter C has to be selected by the user.
This can be done in two ways1 The parameter C is determined experimentally via the standard use of
a training! (validation) test set.2 It is determined analytically by estimating the Vapnik–Chervonenkis
dimension.
103 / 124
Primal Problem
Problem, given samples {(xi, di)}Ni=1
minw,ξ
Φ (w, ξ) = minw,ξ
{12w
Tw + CN∑i=1
ξi
}s.t. di(wTxi + w0) ≥ 1− ξi for i = 1, · · · , N
ξi ≥ 0 for all i
With C a user-specified positive parameter.
104 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 105 / 124
Final Setup
Using Lagrange Multipliers and dual-primal method is possible toobtain the following setupGiven the training sample {(xi, di)}Ni=1, find the Lagrange multipliers{αi}Ni=1 that maximize the objective function
minαQ(α) = min
α
N∑i=1
αi −12
N∑i=1
N∑j=1
αiαjdidjxTj xi
subject to the constraints
N∑i=1
αidi = 0 (29)
0 ≤ αi ≤ C for i = 1, · · · , N (30)
where C is a user-specified positive parameter.
106 / 124
Remarks
Something NotableNote that neither the slack variables nor their Lagrange multipliersappear in the dual problem.The dual problem for the case of non-separable patterns is thussimilar to that for the simple case of linearly separable patterns
The only big differenceInstead of using the constraint αi ≥ 0, the new problem use the morestringent constraint 0 ≤ αi ≤ C.
Note the following
ξi = 0 if αi < C (31)
107 / 124
Remarks
Something NotableNote that neither the slack variables nor their Lagrange multipliersappear in the dual problem.The dual problem for the case of non-separable patterns is thussimilar to that for the simple case of linearly separable patterns
The only big differenceInstead of using the constraint αi ≥ 0, the new problem use the morestringent constraint 0 ≤ αi ≤ C.
Note the following
ξi = 0 if αi < C (31)
107 / 124
Remarks
Something NotableNote that neither the slack variables nor their Lagrange multipliersappear in the dual problem.The dual problem for the case of non-separable patterns is thussimilar to that for the simple case of linearly separable patterns
The only big differenceInstead of using the constraint αi ≥ 0, the new problem use the morestringent constraint 0 ≤ αi ≤ C.
Note the following
ξi = 0 if αi < C (31)
107 / 124
Remarks
Something NotableNote that neither the slack variables nor their Lagrange multipliersappear in the dual problem.The dual problem for the case of non-separable patterns is thussimilar to that for the simple case of linearly separable patterns
The only big differenceInstead of using the constraint αi ≥ 0, the new problem use the morestringent constraint 0 ≤ αi ≤ C.
Note the following
ξi = 0 if αi < C (31)
107 / 124
Finally
The optimal solution for the weight vector w∗
w∗ =Ns∑i=1
α∗i dixi
Where Ns is the number of support vectors.
In additionThe determination of the optimum values to that described before.
The KKT conditions are as followαi[di(wTxi + wo
)− 1 + ξi
]= 0 for i = 1, 2, ..., N .
µiξi = 0 for i = 1, 2, ..., N .
108 / 124
Finally
The optimal solution for the weight vector w∗
w∗ =Ns∑i=1
α∗i dixi
Where Ns is the number of support vectors.
In additionThe determination of the optimum values to that described before.
The KKT conditions are as followαi[di(wTxi + wo
)− 1 + ξi
]= 0 for i = 1, 2, ..., N .
µiξi = 0 for i = 1, 2, ..., N .
108 / 124
Finally
The optimal solution for the weight vector w∗
w∗ =Ns∑i=1
α∗i dixi
Where Ns is the number of support vectors.
In additionThe determination of the optimum values to that described before.
The KKT conditions are as followαi[di(wTxi + wo
)− 1 + ξi
]= 0 for i = 1, 2, ..., N .
µiξi = 0 for i = 1, 2, ..., N .
108 / 124
Finally
The optimal solution for the weight vector w∗
w∗ =Ns∑i=1
α∗i dixi
Where Ns is the number of support vectors.
In additionThe determination of the optimum values to that described before.
The KKT conditions are as followαi[di(wTxi + wo
)− 1 + ξi
]= 0 for i = 1, 2, ..., N .
µiξi = 0 for i = 1, 2, ..., N .
108 / 124
Where...
The µi are Lagrange multipliersThey are used to enforce the non-negativity of the slack variables ξi for alli.
Something NotableAt saddle point, the derivative of the Lagrangian function for the primalproblem:
12w
Tw + CN∑i=1
ξi −N∑i=1
αi[di(wTxi + wo
)− 1 + ξi
]−
N∑i=1
µiξi (32)
109 / 124
Where...
The µi are Lagrange multipliersThey are used to enforce the non-negativity of the slack variables ξi for alli.
Something NotableAt saddle point, the derivative of the Lagrangian function for the primalproblem:
12w
Tw + CN∑i=1
ξi −N∑i=1
αi[di(wTxi + wo
)− 1 + ξi
]−
N∑i=1
µiξi (32)
109 / 124
Thus
We get
αi + µi = C (33)
Thus, we get if αi < C
Then µi > 0⇒ ξi = 0
We may determine w0
Using any data point (xi, di) in the training set such that 0 ≤ α∗i ≤ C.Then, given ξi = 0,
w∗0 = 1di− (w∗)T xi (34)
110 / 124
Thus
We get
αi + µi = C (33)
Thus, we get if αi < C
Then µi > 0⇒ ξi = 0
We may determine w0
Using any data point (xi, di) in the training set such that 0 ≤ α∗i ≤ C.Then, given ξi = 0,
w∗0 = 1di− (w∗)T xi (34)
110 / 124
Thus
We get
αi + µi = C (33)
Thus, we get if αi < C
Then µi > 0⇒ ξi = 0
We may determine w0
Using any data point (xi, di) in the training set such that 0 ≤ α∗i ≤ C.Then, given ξi = 0,
w∗0 = 1di− (w∗)T xi (34)
110 / 124
Nevertheless
It is betterTo take the mean value of w∗0 from all such data points in the trainingsample (Burges, 1998).
BTW He has a great book in SVM’s “An Introduction to SupportVector Machines and Other Kernel-based Learning Methods”
111 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 112 / 124
Basic Idea
Something NotableThe SVM uses the scalar product 〈xi,xj〉 as a measure of similaritybetween xi and xj , and of distance to the hyperplane.Since the scalar product is linear, the SVM is a linear method.
ButUsing a nonlinear function instead, we can make the classifier nonlinear.
113 / 124
Basic Idea
Something NotableThe SVM uses the scalar product 〈xi,xj〉 as a measure of similaritybetween xi and xj , and of distance to the hyperplane.Since the scalar product is linear, the SVM is a linear method.
ButUsing a nonlinear function instead, we can make the classifier nonlinear.
113 / 124
Basic Idea
Something NotableThe SVM uses the scalar product 〈xi,xj〉 as a measure of similaritybetween xi and xj , and of distance to the hyperplane.Since the scalar product is linear, the SVM is a linear method.
ButUsing a nonlinear function instead, we can make the classifier nonlinear.
113 / 124
We do this by defining the following map
Nonlinear transformationsGiven a series of nonlinear transformations
{φi (x)}mi=1
from input space to the feature space.
We can define the decision surface asm∑i=1
wiφi (x) + w0 = 0
.
114 / 124
We do this by defining the following map
Nonlinear transformationsGiven a series of nonlinear transformations
{φi (x)}mi=1
from input space to the feature space.
We can define the decision surface asm∑i=1
wiφi (x) + w0 = 0
.
114 / 124
This allows us to define
The following vector
φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T
That represents the mapping.
115 / 124
Outline1 History
The Beginning2 Separable Classes
Separable ClassesHyperplanes
3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties
4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?
5 Soft MarginsIntroductionThe Soft Margin Solution
6 More About KernelsBasic IdeaFrom Inner products to Kernels 116 / 124
Finally
We define the decision surface as
wTφ (x) = 0 (35)
We now seek "linear" separability of features, we may write
w =N∑i=1
αidiφ (xi) (36)
Thus, we finish with the following decision surfaceN∑i=1
αidiφT (xi)φ (x) = 0 (37)
117 / 124
Finally
We define the decision surface as
wTφ (x) = 0 (35)
We now seek "linear" separability of features, we may write
w =N∑i=1
αidiφ (xi) (36)
Thus, we finish with the following decision surfaceN∑i=1
αidiφT (xi)φ (x) = 0 (37)
117 / 124
Finally
We define the decision surface as
wTφ (x) = 0 (35)
We now seek "linear" separability of features, we may write
w =N∑i=1
αidiφ (xi) (36)
Thus, we finish with the following decision surfaceN∑i=1
αidiφT (xi)φ (x) = 0 (37)
117 / 124
Thus
The term φT (xi)φ (x)It represents the inner product of two vectors induced in the feature spaceinduced by the input patterns.
We can introduce the inner-product kernel
K (xi,x) = φT (xi)φ (x) =m∑j=0
φj (xi)φj (x) (38)
Property: Symmetry
K (xi,x) = K (x,xi) (39)
118 / 124
Thus
The term φT (xi)φ (x)It represents the inner product of two vectors induced in the feature spaceinduced by the input patterns.
We can introduce the inner-product kernel
K (xi,x) = φT (xi)φ (x) =m∑j=0
φj (xi)φj (x) (38)
Property: Symmetry
K (xi,x) = K (x,xi) (39)
118 / 124
Thus
The term φT (xi)φ (x)It represents the inner product of two vectors induced in the feature spaceinduced by the input patterns.
We can introduce the inner-product kernel
K (xi,x) = φT (xi)φ (x) =m∑j=0
φj (xi)φj (x) (38)
Property: Symmetry
K (xi,x) = K (x,xi) (39)
118 / 124
This allows to redefine the optimal hyperplane
We getN∑i=1
αidiK (xi,x) = 0 (40)
Something NotableUsing kernels, we can avoid to go from:
Input Space =⇒ Mapping Space =⇒ Inner Product (41)
By directly going from
Input Space =⇒ Inner Product (42)
119 / 124
This allows to redefine the optimal hyperplane
We getN∑i=1
αidiK (xi,x) = 0 (40)
Something NotableUsing kernels, we can avoid to go from:
Input Space =⇒ Mapping Space =⇒ Inner Product (41)
By directly going from
Input Space =⇒ Inner Product (42)
119 / 124
This allows to redefine the optimal hyperplane
We getN∑i=1
αidiK (xi,x) = 0 (40)
Something NotableUsing kernels, we can avoid to go from:
Input Space =⇒ Mapping Space =⇒ Inner Product (41)
By directly going from
Input Space =⇒ Inner Product (42)
119 / 124
Important
Something NotableThe expansion of (Eq. 38) for the inner-product kernel K (xi,x) is animportant special case of that arises in functional analysis.
120 / 124
Mercer’s Theorem
Mercer’s TheoremLet K (x,x′) be a continuous symmetric kernel that is defined in theclosed interval a ≤ x ≤ b and likewise for x′. The kernel K (x,x′) can beexpanded in the series
K(x,x′
)=∞∑i=1
λiφi (x)φi(x′)
(43)
WithPositive coefficients, λi > 0 for all i.
121 / 124
Mercer’s Theorem
Mercer’s TheoremLet K (x,x′) be a continuous symmetric kernel that is defined in theclosed interval a ≤ x ≤ b and likewise for x′. The kernel K (x,x′) can beexpanded in the series
K(x,x′
)=∞∑i=1
λiφi (x)φi(x′)
(43)
WithPositive coefficients, λi > 0 for all i.
121 / 124
Mercer’s Theorem
For this expression to be valid and or it to converge absolutely anduniformlyIt is necessary and sufficient that the condition
ˆ b
a
ˆ b
aK(x,x′
)ψ (x)ψ
(x′)dxdx′ ≥ 0 (44)
holds for all ψ such that´ ba ψ
2 (x) dx <∞(Example of a quadratic normfor functions).
122 / 124
Remarks
FirstThe functions φi (x) are called eigenfunctions of the expansion and thenumbers λi are called eigenvalues.
SecondThe fact that all of the eigenvalues are positive means that the kernelK (x,x′) is positive definite.
123 / 124
Remarks
FirstThe functions φi (x) are called eigenfunctions of the expansion and thenumbers λi are called eigenvalues.
SecondThe fact that all of the eigenvalues are positive means that the kernelK (x,x′) is positive definite.
123 / 124
Not only that
We have thatFor λi 6= 1, the ith image of
√λiφi (x) induced in the feature space by the
input vector x is an eigenfunction of the expansion.
In theoryThe dimensionality of the feature space (i.e., the number of eigenvalues/eigenfunctions) can be infinitely large.
124 / 124