177
Machine Learning for Data Mining Introduction to Support Vector Machines Andres Mendez-Vazquez June 21, 2015 1 / 85

An introduction to support vector machines

Embed Size (px)

Citation preview

Page 1: An introduction to support vector machines

Machine Learning for Data MiningIntroduction to Support Vector Machines

Andres Mendez-Vazquez

June 21, 2015

1 / 85

Page 2: An introduction to support vector machines

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

2 / 85

Page 3: An introduction to support vector machines

HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963

At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”

Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs

BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.

3 / 85

Page 4: An introduction to support vector machines

HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963

At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”

Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs

BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.

3 / 85

Page 5: An introduction to support vector machines

HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963

At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”

Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs

BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.

3 / 85

Page 6: An introduction to support vector machines

HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963

At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”

Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs

BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.

3 / 85

Page 7: An introduction to support vector machines

HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963

At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”

Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs

BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.

3 / 85

Page 8: An introduction to support vector machines

HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963

At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”

Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs

BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.

3 / 85

Page 9: An introduction to support vector machines

HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963

At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”

Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs

BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.

3 / 85

Page 10: An introduction to support vector machines

In addition

Alexey Yakovlevich ChervonenkisHe was a Soviet and Russian mathematician, and, with Vladimir Vapnik,was one of the main developers of the Vapnik–Chervonenkis theory, alsoknown as the "fundamental theory of learning" an important part ofcomputational learning theory.

He died in September 22nd, 2014At Losiny Ostrov National Park on 22 September 2014.

4 / 85

Page 11: An introduction to support vector machines

In addition

Alexey Yakovlevich ChervonenkisHe was a Soviet and Russian mathematician, and, with Vladimir Vapnik,was one of the main developers of the Vapnik–Chervonenkis theory, alsoknown as the "fundamental theory of learning" an important part ofcomputational learning theory.

He died in September 22nd, 2014At Losiny Ostrov National Park on 22 September 2014.

4 / 85

Page 12: An introduction to support vector machines

ApplicationsPartial List

1 Predictive ControlI Control of chaotic systems.

2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.

3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.

4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.

5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....

5 / 85

Page 13: An introduction to support vector machines

ApplicationsPartial List

1 Predictive ControlI Control of chaotic systems.

2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.

3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.

4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.

5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....

5 / 85

Page 14: An introduction to support vector machines

ApplicationsPartial List

1 Predictive ControlI Control of chaotic systems.

2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.

3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.

4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.

5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....

5 / 85

Page 15: An introduction to support vector machines

ApplicationsPartial List

1 Predictive ControlI Control of chaotic systems.

2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.

3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.

4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.

5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....

5 / 85

Page 16: An introduction to support vector machines

ApplicationsPartial List

1 Predictive ControlI Control of chaotic systems.

2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.

3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.

4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.

5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....

5 / 85

Page 17: An introduction to support vector machines

ApplicationsPartial List

1 Predictive ControlI Control of chaotic systems.

2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.

3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.

4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.

5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....

5 / 85

Page 18: An introduction to support vector machines

ApplicationsPartial List

1 Predictive ControlI Control of chaotic systems.

2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.

3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.

4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.

5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....

5 / 85

Page 19: An introduction to support vector machines

ApplicationsPartial List

1 Predictive ControlI Control of chaotic systems.

2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.

3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.

4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.

5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....

5 / 85

Page 20: An introduction to support vector machines

ApplicationsPartial List

1 Predictive ControlI Control of chaotic systems.

2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.

3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.

4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.

5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....

5 / 85

Page 21: An introduction to support vector machines

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

6 / 85

Page 22: An introduction to support vector machines

Separable Classes

Given

xi , i = 1, · · · ,N

A set of samples belonging to two classes ω1, ω2.

ObjectiveWe want to obtain decision functions

g(x) = wtx + w0

7 / 85

Page 23: An introduction to support vector machines

Separable Classes

Given

xi , i = 1, · · · ,N

A set of samples belonging to two classes ω1, ω2.

ObjectiveWe want to obtain decision functions

g(x) = wtx + w0

7 / 85

Page 24: An introduction to support vector machines

Such that we can do the following

A linear separation function g (x) = wtx + w0

8 / 85

Page 25: An introduction to support vector machines

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

9 / 85

Page 26: An introduction to support vector machines

In other words ...

We have the following samplesFor x1, · · · , xm ∈ C1

For x1, · · · , xn ∈ C2

We want the following decision surfaceswT xi + w0 ≥ 0 for di = +1 if xi ∈ C1

wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2

10 / 85

Page 27: An introduction to support vector machines

In other words ...

We have the following samplesFor x1, · · · , xm ∈ C1

For x1, · · · , xn ∈ C2

We want the following decision surfaceswT xi + w0 ≥ 0 for di = +1 if xi ∈ C1

wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2

10 / 85

Page 28: An introduction to support vector machines

In other words ...

We have the following samplesFor x1, · · · , xm ∈ C1

For x1, · · · , xn ∈ C2

We want the following decision surfaceswT xi + w0 ≥ 0 for di = +1 if xi ∈ C1

wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2

10 / 85

Page 29: An introduction to support vector machines

In other words ...

We have the following samplesFor x1, · · · , xm ∈ C1

For x1, · · · , xn ∈ C2

We want the following decision surfaceswT xi + w0 ≥ 0 for di = +1 if xi ∈ C1

wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2

10 / 85

Page 30: An introduction to support vector machines

What do we want?Our goal is to search for a direction w that gives the maximum possible margin

direction 2

MARGINSdirection 1

11 / 85

Page 31: An introduction to support vector machines

Remember

We have the following

d

Projection r distance

0

12 / 85

Page 32: An introduction to support vector machines

A Little of GeometryThus

r

d

A

B

C

Then

d = |w0|√w2

1 + w22

, r = |g (x)|√w2

1 + w22

(1)

13 / 85

Page 33: An introduction to support vector machines

A Little of GeometryThus

r

d

A

B

C

Then

d = |w0|√w2

1 + w22

, r = |g (x)|√w2

1 + w22

(1)

13 / 85

Page 34: An introduction to support vector machines

First d = |w0|√w2

1+w22

We can use the following rule in a triangle with a 90o angle

Area = 12Cd (2)

In addition, the area can be calculated also as

Area = 12AB (3)

Thus

d = ABC

Remark: Can you get the rest of values?

14 / 85

Page 35: An introduction to support vector machines

First d = |w0|√w2

1+w22

We can use the following rule in a triangle with a 90o angle

Area = 12Cd (2)

In addition, the area can be calculated also as

Area = 12AB (3)

Thus

d = ABC

Remark: Can you get the rest of values?

14 / 85

Page 36: An introduction to support vector machines

First d = |w0|√w2

1+w22

We can use the following rule in a triangle with a 90o angle

Area = 12Cd (2)

In addition, the area can be calculated also as

Area = 12AB (3)

Thus

d = ABC

Remark: Can you get the rest of values?

14 / 85

Page 37: An introduction to support vector machines

What about r = |g(x)|√w2

1+w22?

First, remember

g (xp) = 0 and x = xp + r w‖w‖ (4)

Thus, we have

g(x) =wT[xp + r w

‖w‖

]+ w0

=wT xp + w0 + r wT wwT

=g (xp) + r ‖w‖

Thenr = g(x)

||w||

15 / 85

Page 38: An introduction to support vector machines

What about r = |g(x)|√w2

1+w22?

First, remember

g (xp) = 0 and x = xp + r w‖w‖ (4)

Thus, we have

g(x) =wT[xp + r w

‖w‖

]+ w0

=wT xp + w0 + r wT wwT

=g (xp) + r ‖w‖

Thenr = g(x)

||w||

15 / 85

Page 39: An introduction to support vector machines

What about r = |g(x)|√w2

1+w22?

First, remember

g (xp) = 0 and x = xp + r w‖w‖ (4)

Thus, we have

g(x) =wT[xp + r w

‖w‖

]+ w0

=wT xp + w0 + r wT wwT

=g (xp) + r ‖w‖

Thenr = g(x)

||w||

15 / 85

Page 40: An introduction to support vector machines

What about r = |g(x)|√w2

1+w22?

First, remember

g (xp) = 0 and x = xp + r w‖w‖ (4)

Thus, we have

g(x) =wT[xp + r w

‖w‖

]+ w0

=wT xp + w0 + r wT wwT

=g (xp) + r ‖w‖

Thenr = g(x)

||w||

15 / 85

Page 41: An introduction to support vector machines

What about r = |g(x)|√w2

1+w22?

First, remember

g (xp) = 0 and x = xp + r w‖w‖ (4)

Thus, we have

g(x) =wT[xp + r w

‖w‖

]+ w0

=wT xp + w0 + r wT wwT

=g (xp) + r ‖w‖

Thenr = g(x)

||w||

15 / 85

Page 42: An introduction to support vector machines

This has the following interpretation

The distance from the projection

0

16 / 85

Page 43: An introduction to support vector machines

Now

We know that the straight line that we are looking for looks like

wT x + w0 = 0 (5)

What about something like this

wT x + w0 = δ (6)

ClearlyThis will be above or below the initial line wT x + w0 = 0.

17 / 85

Page 44: An introduction to support vector machines

Now

We know that the straight line that we are looking for looks like

wT x + w0 = 0 (5)

What about something like this

wT x + w0 = δ (6)

ClearlyThis will be above or below the initial line wT x + w0 = 0.

17 / 85

Page 45: An introduction to support vector machines

Now

We know that the straight line that we are looking for looks like

wT x + w0 = 0 (5)

What about something like this

wT x + w0 = δ (6)

ClearlyThis will be above or below the initial line wT x + w0 = 0.

17 / 85

Page 46: An introduction to support vector machines

Come back to the hyperplanesWe have then for each border support line an specific bias!!!

Support Vectors

18 / 85

Page 47: An introduction to support vector machines

Then, normalize by δ

The new margin functionsw′T x + w10 = 1w′T x + w01 = −1

where w′ = wδ, w10 = w′

0δ ,and w01 = w′′

Now, we come back to the middle separator hyperplane, but with thenormalized term

wT xi + w0 ≥ w′T x + w10 for di = +1wT xi + w0 ≤ w′T x + w01 for di = −1

I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′

19 / 85

Page 48: An introduction to support vector machines

Then, normalize by δ

The new margin functionsw′T x + w10 = 1w′T x + w01 = −1

where w′ = wδ, w10 = w′

0δ ,and w01 = w′′

Now, we come back to the middle separator hyperplane, but with thenormalized term

wT xi + w0 ≥ w′T x + w10 for di = +1wT xi + w0 ≤ w′T x + w01 for di = −1

I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′

19 / 85

Page 49: An introduction to support vector machines

Then, normalize by δ

The new margin functionsw′T x + w10 = 1w′T x + w01 = −1

where w′ = wδ, w10 = w′

0δ ,and w01 = w′′

Now, we come back to the middle separator hyperplane, but with thenormalized term

wT xi + w0 ≥ w′T x + w10 for di = +1wT xi + w0 ≤ w′T x + w01 for di = −1

I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′

19 / 85

Page 50: An introduction to support vector machines

Then, normalize by δ

The new margin functionsw′T x + w10 = 1w′T x + w01 = −1

where w′ = wδ, w10 = w′

0δ ,and w01 = w′′

Now, we come back to the middle separator hyperplane, but with thenormalized term

wT xi + w0 ≥ w′T x + w10 for di = +1wT xi + w0 ≤ w′T x + w01 for di = −1

I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′

19 / 85

Page 51: An introduction to support vector machines

Then, normalize by δ

The new margin functionsw′T x + w10 = 1w′T x + w01 = −1

where w′ = wδ, w10 = w′

0δ ,and w01 = w′′

Now, we come back to the middle separator hyperplane, but with thenormalized term

wT xi + w0 ≥ w′T x + w10 for di = +1wT xi + w0 ≤ w′T x + w01 for di = −1

I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′

19 / 85

Page 52: An introduction to support vector machines

Then, normalize by δ

The new margin functionsw′T x + w10 = 1w′T x + w01 = −1

where w′ = wδ, w10 = w′

0δ ,and w01 = w′′

Now, we come back to the middle separator hyperplane, but with thenormalized term

wT xi + w0 ≥ w′T x + w10 for di = +1wT xi + w0 ≤ w′T x + w01 for di = −1

I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′

19 / 85

Page 53: An introduction to support vector machines

Come back to the hyperplanesThe meaning of what I am saying!!!

20 / 85

Page 54: An introduction to support vector machines

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

21 / 85

Page 55: An introduction to support vector machines

A little about Support Vectors

They are the vectorsxi such that wT xi + w0 = 1 or wT xi + w0 = −1

PropertiesThe vectors nearest to the decision surface and the most difficult toclassify.Because of that, we have the name “Support Vector Machines”.

22 / 85

Page 56: An introduction to support vector machines

A little about Support Vectors

They are the vectorsxi such that wT xi + w0 = 1 or wT xi + w0 = −1

PropertiesThe vectors nearest to the decision surface and the most difficult toclassify.Because of that, we have the name “Support Vector Machines”.

22 / 85

Page 57: An introduction to support vector machines

A little about Support Vectors

They are the vectorsxi such that wT xi + w0 = 1 or wT xi + w0 = −1

PropertiesThe vectors nearest to the decision surface and the most difficult toclassify.Because of that, we have the name “Support Vector Machines”.

22 / 85

Page 58: An introduction to support vector machines

Now, we can resume the decision rule for the hyperplane

For the support vectors

g (xi) = wT xi + w0 = −(+)1 for di = −(+)1 (7)

ImpliesThe distance to the support vectors is:

r = g(xi)||w|| =

1||w|| if di = +1− 1||w|| if di = −1

23 / 85

Page 59: An introduction to support vector machines

Now, we can resume the decision rule for the hyperplane

For the support vectors

g (xi) = wT xi + w0 = −(+)1 for di = −(+)1 (7)

ImpliesThe distance to the support vectors is:

r = g(xi)||w|| =

1||w|| if di = +1− 1||w|| if di = −1

23 / 85

Page 60: An introduction to support vector machines

Therefore ...

We want the optimum value of the margin of separation as

ρ = 1||w|| + 1

||w|| = 2||w|| (8)

And the support vectors define the value of ρ

24 / 85

Page 61: An introduction to support vector machines

Therefore ...We want the optimum value of the margin of separation as

ρ = 1||w|| + 1

||w|| = 2||w|| (8)

And the support vectors define the value of ρ

Support V

ectors

24 / 85

Page 62: An introduction to support vector machines

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

25 / 85

Page 63: An introduction to support vector machines

Quadratic Optimization

Then, we have the samples with labelsT = {(xi , di)}Ni=1

Then we can put the decision rule asdi(wT xi + w0) ≥ 1 i = 1, · · · ,N

26 / 85

Page 64: An introduction to support vector machines

Quadratic Optimization

Then, we have the samples with labelsT = {(xi , di)}Ni=1

Then we can put the decision rule asdi(wT xi + w0) ≥ 1 i = 1, · · · ,N

26 / 85

Page 65: An introduction to support vector machines

Then, we have the optimization problem

The optimization problemminwΦ(w) = 1

2wT w

s.t. di(wT xi + w0) ≥ 1 i = 1, · · · ,N

ObservationsThe cost functions Φ (w) is convex.The constrains are linear with respect to w.

27 / 85

Page 66: An introduction to support vector machines

Then, we have the optimization problem

The optimization problemminwΦ(w) = 1

2wT w

s.t. di(wT xi + w0) ≥ 1 i = 1, · · · ,N

ObservationsThe cost functions Φ (w) is convex.The constrains are linear with respect to w.

27 / 85

Page 67: An introduction to support vector machines

Then, we have the optimization problem

The optimization problemminwΦ(w) = 1

2wT w

s.t. di(wT xi + w0) ≥ 1 i = 1, · · · ,N

ObservationsThe cost functions Φ (w) is convex.The constrains are linear with respect to w.

27 / 85

Page 68: An introduction to support vector machines

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

28 / 85

Page 69: An introduction to support vector machines

Lagrange Multipliers

The method of Lagrange multipliersGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problems.

This is done by converting a constrained problem to an equivalentunconstrained problem with the help of certain unspecified parametersknown as Lagrange multipliers.

29 / 85

Page 70: An introduction to support vector machines

Lagrange Multipliers

The method of Lagrange multipliersGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problems.

This is done by converting a constrained problem to an equivalentunconstrained problem with the help of certain unspecified parametersknown as Lagrange multipliers.

29 / 85

Page 71: An introduction to support vector machines

Lagrange Multipliers

The classical problem formulation

min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0

It can be converted into

min L (x1, x2, ..., xn , λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)} (9)

whereL(x, λ) is the Lagrangian function.λ is an unspecified positive or negative constant called the LagrangeMultiplier.

30 / 85

Page 72: An introduction to support vector machines

Lagrange Multipliers

The classical problem formulation

min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0

It can be converted into

min L (x1, x2, ..., xn , λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)} (9)

whereL(x, λ) is the Lagrangian function.λ is an unspecified positive or negative constant called the LagrangeMultiplier.

30 / 85

Page 73: An introduction to support vector machines

Lagrange Multipliers

The classical problem formulation

min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0

It can be converted into

min L (x1, x2, ..., xn , λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)} (9)

whereL(x, λ) is the Lagrangian function.λ is an unspecified positive or negative constant called the LagrangeMultiplier.

30 / 85

Page 74: An introduction to support vector machines

Finding an Optimum using Lagrange Multipliers

New problemmin L (x1, x2, ..., xn , λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}

We want a λ = λ∗ optimalIf the minimum of L (x1, x2, ..., xn, λ

∗) occurs at

(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗

and (x1, x2, ..., xn)T∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)T∗

minimizes:min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0

TrickIt is to find appropriate value for Lagrangian multiplier λ.

31 / 85

Page 75: An introduction to support vector machines

Finding an Optimum using Lagrange Multipliers

New problemmin L (x1, x2, ..., xn , λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}

We want a λ = λ∗ optimalIf the minimum of L (x1, x2, ..., xn, λ

∗) occurs at

(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗

and (x1, x2, ..., xn)T∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)T∗

minimizes:min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0

TrickIt is to find appropriate value for Lagrangian multiplier λ.

31 / 85

Page 76: An introduction to support vector machines

Finding an Optimum using Lagrange Multipliers

New problemmin L (x1, x2, ..., xn , λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}

We want a λ = λ∗ optimalIf the minimum of L (x1, x2, ..., xn, λ

∗) occurs at

(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗

and (x1, x2, ..., xn)T∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)T∗

minimizes:min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0

TrickIt is to find appropriate value for Lagrangian multiplier λ.

31 / 85

Page 77: An introduction to support vector machines

Remember

Think about thisRemember First Law of Newton!!!

Yes!!!

32 / 85

Page 78: An introduction to support vector machines

Remember

Think about thisRemember First Law of Newton!!!

Yes!!!A system in equilibrium does not move

Static Body

32 / 85

Page 79: An introduction to support vector machines

Lagrange Multipliers

DefinitionGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problem

33 / 85

Page 80: An introduction to support vector machines

Lagrange was a Physicists

He was thinking in the following formulaA system in equilibrium has the following equation:

F1 + F2 + ...+ FK = 0 (10)

But functions do not have forces?Are you sure?

Think about the followingThe Gradient of a surface.

34 / 85

Page 81: An introduction to support vector machines

Lagrange was a Physicists

He was thinking in the following formulaA system in equilibrium has the following equation:

F1 + F2 + ...+ FK = 0 (10)

But functions do not have forces?Are you sure?

Think about the followingThe Gradient of a surface.

34 / 85

Page 82: An introduction to support vector machines

Lagrange was a Physicists

He was thinking in the following formulaA system in equilibrium has the following equation:

F1 + F2 + ...+ FK = 0 (10)

But functions do not have forces?Are you sure?

Think about the followingThe Gradient of a surface.

34 / 85

Page 83: An introduction to support vector machines

Gradient to a Surface

After all a gradient is a measure of the maximal changeFor example the gradient of a function of three variables:

∇f (x) = i ∂f (x)∂x + j ∂f (x)

∂y + k ∂f (x)∂z (11)

where i, j and k are unitary vectors in the directions x, y and z.

35 / 85

Page 84: An introduction to support vector machines

ExampleWe have f (x , y) = x exp {−x2 − y2}

36 / 85

Page 85: An introduction to support vector machines

Example

With Gradient at the the contours when projecting in the 2D plane

37 / 85

Page 86: An introduction to support vector machines

Now, Think about this

Yes, we can use the gradientHowever, we need to do some scaling of the forces by using parameters λ

Thus, we have

F0 + λ1F1 + ...+ λK FK = 0 (12)

where F0 is the gradient of the principal cost function and Fi fori = 1, 2, ..,K .

38 / 85

Page 87: An introduction to support vector machines

Now, Think about this

Yes, we can use the gradientHowever, we need to do some scaling of the forces by using parameters λ

Thus, we have

F0 + λ1F1 + ...+ λK FK = 0 (12)

where F0 is the gradient of the principal cost function and Fi fori = 1, 2, ..,K .

38 / 85

Page 88: An introduction to support vector machines

ThusIf we have the following optimization:

min f (x)s.tg1 (x) = 0

g2 (x) = 0

39 / 85

Page 89: An introduction to support vector machines

Geometric interpretation in the case of minimizationWhat is wrong? Gradients are going in the other direction, we can fixby simple multiplying by -1Here the cost function is f (x, y) = x exp

{−x2 − y2} we want to minimize

f (−→x )

g1 (−→x )

g2 (−→x )

−∇ f (−→x ) + λ1∇ g1 (−→x ) + λ2∇ g2 (−→x ) = 0

Nevertheless: it is equivalent to ∇f (−→x )− λ1∇g1 (−→x )− λ2∇g2 (−→x ) = 0

39 / 85

Page 90: An introduction to support vector machines

Geometric interpretation in the case of minimizationWhat is wrong? Gradients are going in the other direction, we can fixby simple multiplying by -1Here the cost function is f (x, y) = x exp

{−x2 − y2} we want to minimize

f (−→x )

g1 (−→x )

g2 (−→x )

−∇ f (−→x ) + λ1∇ g1 (−→x ) + λ2∇ g2 (−→x ) = 0

Nevertheless: it is equivalent to ∇f (−→x )− λ1∇g1 (−→x )− λ2∇g2 (−→x ) = 0

39 / 85

Page 91: An introduction to support vector machines

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

40 / 85

Page 92: An introduction to support vector machines

Method

Steps1 Original problem is rewritten as:

1 minimize L(x, λ) = f (x)− λh1(x)2 Take derivatives of L(x, λ) with respect to xi and set them equal to

zero.3 Express all xi in terms of Lagrangian multiplier λ.4 Plug x in terms of α in constraint h1(x) = 0 and solve λ.5 Calculate x by using the just found value for λ.

From the step 2If there are n variables (i.e., x1, · · · , xn) then you will get n equations withn + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier α).

41 / 85

Page 93: An introduction to support vector machines

Method

Steps1 Original problem is rewritten as:

1 minimize L(x, λ) = f (x)− λh1(x)2 Take derivatives of L(x, λ) with respect to xi and set them equal to

zero.3 Express all xi in terms of Lagrangian multiplier λ.4 Plug x in terms of α in constraint h1(x) = 0 and solve λ.5 Calculate x by using the just found value for λ.

From the step 2If there are n variables (i.e., x1, · · · , xn) then you will get n equations withn + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier α).

41 / 85

Page 94: An introduction to support vector machines

Example

We can apply that to the following problem

min f (x, y) = x2 − 8x + y2 − 12y + 48s.t x + y = 8

42 / 85

Page 95: An introduction to support vector machines

Then, Rewriting The Optimization Problem

The optimization with equality constraintsminwΦ(w) = 1

2wT w

s.t

di(wT xi + w0)− 1 = 0 i = 1, · · · ,N

43 / 85

Page 96: An introduction to support vector machines

Then, for our problem

Using the Lagrange Multipliers (We will call them αi)We obtain the following cost function

J (w,w0, α) = 12wT w−

N∑i=1

αi [di(wT xi + w0)− 1]

ObservationMinimize with respect to w and w0.Maximize with respect to α because it dominates

−N∑

i=1αi [di(wT xi + w0)− 1]. (13)

44 / 85

Page 97: An introduction to support vector machines

Then, for our problem

Using the Lagrange Multipliers (We will call them αi)We obtain the following cost function

J (w,w0, α) = 12wT w−

N∑i=1

αi [di(wT xi + w0)− 1]

ObservationMinimize with respect to w and w0.Maximize with respect to α because it dominates

−N∑

i=1αi [di(wT xi + w0)− 1]. (13)

44 / 85

Page 98: An introduction to support vector machines

Then, for our problem

Using the Lagrange Multipliers (We will call them αi)We obtain the following cost function

J (w,w0, α) = 12wT w−

N∑i=1

αi [di(wT xi + w0)− 1]

ObservationMinimize with respect to w and w0.Maximize with respect to α because it dominates

−N∑

i=1αi [di(wT xi + w0)− 1]. (13)

44 / 85

Page 99: An introduction to support vector machines

Then, for our problem

Using the Lagrange Multipliers (We will call them αi)We obtain the following cost function

J (w,w0, α) = 12wT w−

N∑i=1

αi [di(wT xi + w0)− 1]

ObservationMinimize with respect to w and w0.Maximize with respect to α because it dominates

−N∑

i=1αi [di(wT xi + w0)− 1]. (13)

44 / 85

Page 100: An introduction to support vector machines

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

45 / 85

Page 101: An introduction to support vector machines

Karush-Kuhn-Tucker Conditions

First An Inequality Constrained Problem P

min f (x)s.t g1 (x) = 0

...gN (x) = 0

A really minimal version!!! Hey, it is a patch work!!!A point x is a local minimum of an equality constrained problem P only ifa set of non-negative αj ’s may be found such that:

∇L(x,α) = ∇f (x)−N∑

i=1αi∇gi (x) = 0

46 / 85

Page 102: An introduction to support vector machines

Karush-Kuhn-Tucker Conditions

First An Inequality Constrained Problem P

min f (x)s.t g1 (x) = 0

...gN (x) = 0

A really minimal version!!! Hey, it is a patch work!!!A point x is a local minimum of an equality constrained problem P only ifa set of non-negative αj ’s may be found such that:

∇L(x,α) = ∇f (x)−N∑

i=1αi∇gi (x) = 0

46 / 85

Page 103: An introduction to support vector machines

Karush-Kuhn-Tucker Conditions

ImportantThink about this each constraint correspond to a sample in both classes,thus

The corresponding αi ’s are going to be zero after optimization, if aconstraint is not active i.e. di

(wT xi + w0

)− 1 6= 0 (Remember

Maximization).

Again the Support VectorsThis actually defines the idea of support vectors!!!

ThusOnly the αi ’s with active constraints (Support Vectors) will be differentfrom zero when di

(wT xi + w0

)− 1 = 0.

47 / 85

Page 104: An introduction to support vector machines

Karush-Kuhn-Tucker Conditions

ImportantThink about this each constraint correspond to a sample in both classes,thus

The corresponding αi ’s are going to be zero after optimization, if aconstraint is not active i.e. di

(wT xi + w0

)− 1 6= 0 (Remember

Maximization).

Again the Support VectorsThis actually defines the idea of support vectors!!!

ThusOnly the αi ’s with active constraints (Support Vectors) will be differentfrom zero when di

(wT xi + w0

)− 1 = 0.

47 / 85

Page 105: An introduction to support vector machines

Karush-Kuhn-Tucker Conditions

ImportantThink about this each constraint correspond to a sample in both classes,thus

The corresponding αi ’s are going to be zero after optimization, if aconstraint is not active i.e. di

(wT xi + w0

)− 1 6= 0 (Remember

Maximization).

Again the Support VectorsThis actually defines the idea of support vectors!!!

ThusOnly the αi ’s with active constraints (Support Vectors) will be differentfrom zero when di

(wT xi + w0

)− 1 = 0.

47 / 85

Page 106: An introduction to support vector machines

A small deviation from the SVM’s for the sake of VoxPopuli

Theorem (Karush-Kuhn-Tucker Necessary Conditions)Let X be a non-empty open set Rn , and let f : Rn → R and gi : Rn → Rfor i = 1, ...,m. Consider the problem P to minimize f (x) subject tox ∈ X and gi (x) ≤ 0 i = 1, ...,m. Let x be a feasible solution, anddenote I = {i|gi (x) = 0}. Suppose that f and gi for i ∈ I aredifferentiable at x and that gi i /∈ I are continuous at x. Furthermore,suppose that ∇gi (x) for i ∈ I are linearly independent. If x solvesproblem P locally, there exist scalars ui for i ∈ I such that

∇f (x) +∑i∈I

ui∇gi (x) = 0

ui ≥ 0 for i ∈ I

48 / 85

Page 107: An introduction to support vector machines

It is more...

In addition to the above assumptionsIf gi for each i /∈ I is also differentiable at x, the previous conditions canbe written in the following equivalent form:

∇f (x) +m∑

i=1ui∇gi (x) = 0

ugi (x) = 0 for i = 1, ...,mui ≥ 0 for i = 1, ...,m

49 / 85

Page 108: An introduction to support vector machines

The necessary conditions for optimality

We use the previous theorem

∇(12wT w−

N∑i=1

αi [di(wT xi + w0)− 1])

(14)

Condition 1∂J (w,w0, α)

∂w= 0

Condition 2∂J (w,w0, α)

∂w0= 0

50 / 85

Page 109: An introduction to support vector machines

The necessary conditions for optimality

We use the previous theorem

∇(12wT w−

N∑i=1

αi [di(wT xi + w0)− 1])

(14)

Condition 1∂J (w,w0, α)

∂w= 0

Condition 2∂J (w,w0, α)

∂w0= 0

50 / 85

Page 110: An introduction to support vector machines

The necessary conditions for optimality

We use the previous theorem

∇(12wT w−

N∑i=1

αi [di(wT xi + w0)− 1])

(14)

Condition 1∂J (w,w0, α)

∂w= 0

Condition 2∂J (w,w0, α)

∂w0= 0

50 / 85

Page 111: An introduction to support vector machines

Using the conditions

We have the first condition

∂J (w,w0, α)∂w =

∂ 12wT w∂w −

∂N∑

i=1αi [di(wT xi + w0)− 1]

∂w = 0

∂J (w,w0, α)∂w = 1

2(w + w)−N∑

i=1αidixi

Thus

w =N∑

i=1αidixi (15)

51 / 85

Page 112: An introduction to support vector machines

Using the conditions

We have the first condition

∂J (w,w0, α)∂w =

∂ 12wT w∂w −

∂N∑

i=1αi [di(wT xi + w0)− 1]

∂w = 0

∂J (w,w0, α)∂w = 1

2(w + w)−N∑

i=1αidixi

Thus

w =N∑

i=1αidixi (15)

51 / 85

Page 113: An introduction to support vector machines

Using the conditions

We have the first condition

∂J (w,w0, α)∂w =

∂ 12wT w∂w −

∂N∑

i=1αi [di(wT xi + w0)− 1]

∂w = 0

∂J (w,w0, α)∂w = 1

2(w + w)−N∑

i=1αidixi

Thus

w =N∑

i=1αidixi (15)

51 / 85

Page 114: An introduction to support vector machines

In a similar way ...

We have by the second optimality conditionN∑

i=1αidi = 0

Note

αi [di(wT xi + w0)− 1] = 0

Because the constraint vanishes in the optimal solution i.e. αi = 0 ordi(wT xi + w0)− 1 = 0.

52 / 85

Page 115: An introduction to support vector machines

In a similar way ...

We have by the second optimality conditionN∑

i=1αidi = 0

Note

αi [di(wT xi + w0)− 1] = 0

Because the constraint vanishes in the optimal solution i.e. αi = 0 ordi(wT xi + w0)− 1 = 0.

52 / 85

Page 116: An introduction to support vector machines

Thus

We need something extraOur classic trick of transforming a problem into another problem

In this caseWe use the Primal-Dual Problem for Lagrangian

WhereWe move from a minimization to a maximization!!!

53 / 85

Page 117: An introduction to support vector machines

Thus

We need something extraOur classic trick of transforming a problem into another problem

In this caseWe use the Primal-Dual Problem for Lagrangian

WhereWe move from a minimization to a maximization!!!

53 / 85

Page 118: An introduction to support vector machines

Thus

We need something extraOur classic trick of transforming a problem into another problem

In this caseWe use the Primal-Dual Problem for Lagrangian

WhereWe move from a minimization to a maximization!!!

53 / 85

Page 119: An introduction to support vector machines

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

54 / 85

Page 120: An introduction to support vector machines

Lagrangian Dual Problem

Consider the following nonlinear programming problemPrimal Problem P

min f (x)s.t gi (x) ≤ 0 for i = 1, ...,m

hi (x) = 0 for i = 1, ..., lx ∈ X

Lagrange Dual Problem Dmax Θ (u, v)s.t. u > 0

where Θ (u, v) = infx{

f (x) +∑m

i=1 uigi (x) +∑l

i=1 vihi (x) |x ∈ X}

55 / 85

Page 121: An introduction to support vector machines

Lagrangian Dual Problem

Consider the following nonlinear programming problemPrimal Problem P

min f (x)s.t gi (x) ≤ 0 for i = 1, ...,m

hi (x) = 0 for i = 1, ..., lx ∈ X

Lagrange Dual Problem Dmax Θ (u, v)s.t. u > 0

where Θ (u, v) = infx{

f (x) +∑m

i=1 uigi (x) +∑l

i=1 vihi (x) |x ∈ X}

55 / 85

Page 122: An introduction to support vector machines

What does this mean?

Assume that the equality constraint does not existWe have then

min f (x)s.t gi (x) ≤ 0 for i = 1, ...,m

x ∈ X

Now assume that we finish with only one constraintWe have then

min f (x)s.t g (x) ≤ 0

x ∈ X

56 / 85

Page 123: An introduction to support vector machines

What does this mean?

Assume that the equality constraint does not existWe have then

min f (x)s.t gi (x) ≤ 0 for i = 1, ...,m

x ∈ X

Now assume that we finish with only one constraintWe have then

min f (x)s.t g (x) ≤ 0

x ∈ X

56 / 85

Page 124: An introduction to support vector machines

What does this mean?

First, we have the following figure

A

B

X G

Slope:

Slope:

57 / 85

Page 125: An introduction to support vector machines

What does this means?

Thus at the y − z plane you have

G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)

ThusGiven u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -Equivalent to ∇f (x) + u∇g(x) = 0

58 / 85

Page 126: An introduction to support vector machines

What does this means?

Thus at the y − z plane you have

G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)

ThusGiven u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -Equivalent to ∇f (x) + u∇g(x) = 0

58 / 85

Page 127: An introduction to support vector machines

What does this means?

Thus at the y − z plane, we have

z + uy = α (17)

a line with slope −u.

Then, to minimize z + uy = α

We need to move the line z + uy = α in a parallel to itself as far down aspossible, along its negative gradient, while in contact with G.

59 / 85

Page 128: An introduction to support vector machines

What does this means?

Thus at the y − z plane, we have

z + uy = α (17)

a line with slope −u.

Then, to minimize z + uy = α

We need to move the line z + uy = α in a parallel to itself as far down aspossible, along its negative gradient, while in contact with G.

59 / 85

Page 129: An introduction to support vector machines

In other words

Move the line parallel to itself until it supports G

A

B

X G

Slope:

Slope:

Note The Set G lies above the line and touches it.

60 / 85

Page 130: An introduction to support vector machines

Thus

ThusThen, the problem is to find the slope of the supporting hyperplane forG.

Then intersection with the z-axisGives θ(u)

61 / 85

Page 131: An introduction to support vector machines

Thus

ThusThen, the problem is to find the slope of the supporting hyperplane forG.

Then intersection with the z-axisGives θ(u)

61 / 85

Page 132: An introduction to support vector machines

Again

We can see the θ

A

B

X G

Slope:

Slope:

62 / 85

Page 133: An introduction to support vector machines

Thus

The dual problem is equivalentFinding the slope of the supporting hyperplane such that its intercept onthe z-axis is maximal

63 / 85

Page 134: An introduction to support vector machines

Or

Such an hyperplane has slope −u and support G at (y, z)

A

B

X G

Slope:

Slope:

Remark: The optimal solution is u and the optimal dual objective is z.

64 / 85

Page 135: An introduction to support vector machines

Or

Such an hyperplane has slope −u and support G at (y, z)

A

B

X G

Slope:

Slope:

Remark: The optimal solution is u and the optimal dual objective is z.

64 / 85

Page 136: An introduction to support vector machines

For more on this Please!!!

Look at this bookFrom “Nonlinear Programming: Theory and Algorithms” by MokhtarS. Bazaraa, and C. M. Shetty. Wiley, New York, (2006)

I At Page 260.

65 / 85

Page 137: An introduction to support vector machines

Example (Lagrange Dual)

Primalmin x2

1 + x22

s.t. −x1 − x2 + 4 ≤ 0x1, x2 ≥ 0

Lagrange DualΘ(u) = inf {x2

1 + x22 + u(−x1 − x2 + 4)|x1, x2 ≥ 0}

66 / 85

Page 138: An introduction to support vector machines

Example (Lagrange Dual)

Primalmin x2

1 + x22

s.t. −x1 − x2 + 4 ≤ 0x1, x2 ≥ 0

Lagrange DualΘ(u) = inf {x2

1 + x22 + u(−x1 − x2 + 4)|x1, x2 ≥ 0}

66 / 85

Page 139: An introduction to support vector machines

Solution

Derive with respect to x1 and x2

We have two case to take in account: u ≥ 0 and u < 0

The first case is clearWhat about when u < 0

We have that

θ (u) ={−1

2u2 + 4u if u ≥ 04u if u < 0

(18)

67 / 85

Page 140: An introduction to support vector machines

Solution

Derive with respect to x1 and x2

We have two case to take in account: u ≥ 0 and u < 0

The first case is clearWhat about when u < 0

We have that

θ (u) ={−1

2u2 + 4u if u ≥ 04u if u < 0

(18)

67 / 85

Page 141: An introduction to support vector machines

Solution

Derive with respect to x1 and x2

We have two case to take in account: u ≥ 0 and u < 0

The first case is clearWhat about when u < 0

We have that

θ (u) ={−1

2u2 + 4u if u ≥ 04u if u < 0

(18)

67 / 85

Page 142: An introduction to support vector machines

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

68 / 85

Page 143: An introduction to support vector machines

Duality Theorem

First PropertyIf the Primal has an optimal solution, the dual too.

ThusIn order to w ∗ and α∗ to be optimal solutions for the primal and dualproblem respectively, It is necessary and sufficient that w∗:

It is feasible for the primal problem and

Φ(w∗) = J (w∗,w0∗, α∗)= min

wJ (w,w0∗, α∗)

69 / 85

Page 144: An introduction to support vector machines

Duality Theorem

First PropertyIf the Primal has an optimal solution, the dual too.

ThusIn order to w ∗ and α∗ to be optimal solutions for the primal and dualproblem respectively, It is necessary and sufficient that w∗:

It is feasible for the primal problem and

Φ(w∗) = J (w∗,w0∗, α∗)= min

wJ (w,w0∗, α∗)

69 / 85

Page 145: An introduction to support vector machines

Reformulate our Equations

We have then

J (w,w0, α) = 12wT w−

N∑i=1

αidiwT xi − ω0

N∑i=1

αidi +N∑

i=1αi

Now for our 2nd optimality condition

J (w,w0, α) = 12wT w−

N∑i=1

αidiwT xi +N∑

i=1αi

70 / 85

Page 146: An introduction to support vector machines

Reformulate our Equations

We have then

J (w,w0, α) = 12wT w−

N∑i=1

αidiwT xi − ω0

N∑i=1

αidi +N∑

i=1αi

Now for our 2nd optimality condition

J (w,w0, α) = 12wT w−

N∑i=1

αidiwT xi +N∑

i=1αi

70 / 85

Page 147: An introduction to support vector machines

We have finally for the 1st Optimality Condition:

First

wT w =N∑

i=1αidiwT xi =

N∑i=1

N∑j=1

αiαjdidjxTj xi

Second, setting J (w, ω0, α) = Q (α)

Q(α) =N∑

i=1αi −

12

N∑i=1

N∑j=1

αiαjdidjxTj xi

71 / 85

Page 148: An introduction to support vector machines

We have finally for the 1st Optimality Condition:

First

wT w =N∑

i=1αidiwT xi =

N∑i=1

N∑j=1

αiαjdidjxTj xi

Second, setting J (w, ω0, α) = Q (α)

Q(α) =N∑

i=1αi −

12

N∑i=1

N∑j=1

αiαjdidjxTj xi

71 / 85

Page 149: An introduction to support vector machines

From here, we have the problemThis is the problem that we really solveGiven the training sample {(xi , di)}Ni=1, find the Lagrange multipliers{αi}Ni=1 that maximize the objective function

Q(α) =N∑

i=1αi −

12

N∑i=1

N∑j=1

αiαjdidjxTj xi

subject to the constraints

N∑i=1

αidi = 0 (19)

αi ≥ 0 for i = 1, · · · ,N (20)

NoteIn the Primal, we were trying to minimize the cost function, for this it isnecessary to maximize α. That is the reason why we are maximizing Q(α).

72 / 85

Page 150: An introduction to support vector machines

From here, we have the problemThis is the problem that we really solveGiven the training sample {(xi , di)}Ni=1, find the Lagrange multipliers{αi}Ni=1 that maximize the objective function

Q(α) =N∑

i=1αi −

12

N∑i=1

N∑j=1

αiαjdidjxTj xi

subject to the constraints

N∑i=1

αidi = 0 (19)

αi ≥ 0 for i = 1, · · · ,N (20)

NoteIn the Primal, we were trying to minimize the cost function, for this it isnecessary to maximize α. That is the reason why we are maximizing Q(α).

72 / 85

Page 151: An introduction to support vector machines

Solving for α

We can compute w∗ once we get the optimal α∗i by using (Eq. 15)

w∗ =N∑

i=1α∗i dixi

In addition, we can compute the optimal bias w∗0 using the optimalweight, w∗

For this, we use the positive margin equation:

g(xi) = wT x(s) + w0 = 1

corresponding to a positive support vector.

Then

w0 = 1− (w∗)T x(s) for d(s) = 1 (21)

73 / 85

Page 152: An introduction to support vector machines

Solving for α

We can compute w∗ once we get the optimal α∗i by using (Eq. 15)

w∗ =N∑

i=1α∗i dixi

In addition, we can compute the optimal bias w∗0 using the optimalweight, w∗

For this, we use the positive margin equation:

g(xi) = wT x(s) + w0 = 1

corresponding to a positive support vector.

Then

w0 = 1− (w∗)T x(s) for d(s) = 1 (21)

73 / 85

Page 153: An introduction to support vector machines

Solving for α

We can compute w∗ once we get the optimal α∗i by using (Eq. 15)

w∗ =N∑

i=1α∗i dixi

In addition, we can compute the optimal bias w∗0 using the optimalweight, w∗

For this, we use the positive margin equation:

g(xi) = wT x(s) + w0 = 1

corresponding to a positive support vector.

Then

w0 = 1− (w∗)T x(s) for d(s) = 1 (21)

73 / 85

Page 154: An introduction to support vector machines

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

74 / 85

Page 155: An introduction to support vector machines

What do we need?

Until now, we have only a maximal margin algorithmAll this work fine when the classes are separableProblem, What when they are not separable?What we can do?

75 / 85

Page 156: An introduction to support vector machines

What do we need?

Until now, we have only a maximal margin algorithmAll this work fine when the classes are separableProblem, What when they are not separable?What we can do?

75 / 85

Page 157: An introduction to support vector machines

What do we need?

Until now, we have only a maximal margin algorithmAll this work fine when the classes are separableProblem, What when they are not separable?What we can do?

75 / 85

Page 158: An introduction to support vector machines

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

76 / 85

Page 159: An introduction to support vector machines

Map to a higher Dimensional Space

Assume that exist a mapping

x ∈ Rl → y ∈ Rk

Then, it is possible to define the following mapping

77 / 85

Page 160: An introduction to support vector machines

Map to a higher Dimensional Space

Assume that exist a mapping

x ∈ Rl → y ∈ Rk

Then, it is possible to define the following mapping

77 / 85

Page 161: An introduction to support vector machines

Define a map to a higher Dimension

Nonlinear transformationsGiven a series of nonlinear transformations

{φi(x)}mi=1

from input space to the feature space.

We can define the decision surface asm∑

i=1wiφi(x) + w0 = 0

.

78 / 85

Page 162: An introduction to support vector machines

Define a map to a higher Dimension

Nonlinear transformationsGiven a series of nonlinear transformations

{φi(x)}mi=1

from input space to the feature space.

We can define the decision surface asm∑

i=1wiφi(x) + w0 = 0

.

78 / 85

Page 163: An introduction to support vector machines

This allows us to define

The following vector

φ(x) = (φ0(x), φ1(x), · · · , φm(x)) T

.That represents the mapping.

From this mappingwe can define the following kernel function

K : X×X→ R

K (xi , xj) = φ (xi)T φ (xj)

79 / 85

Page 164: An introduction to support vector machines

This allows us to define

The following vector

φ(x) = (φ0(x), φ1(x), · · · , φm(x)) T

.That represents the mapping.

From this mappingwe can define the following kernel function

K : X×X→ R

K (xi , xj) = φ (xi)T φ (xj)

79 / 85

Page 165: An introduction to support vector machines

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

80 / 85

Page 166: An introduction to support vector machines

Example

Assume

x ∈ R→ y =

x21√

2x1x2x2

2

We can show that

yTi yj =

(xT

i xj)

2

81 / 85

Page 167: An introduction to support vector machines

Example

Assume

x ∈ R→ y =

x21√

2x1x2x2

2

We can show that

yTi yj =

(xT

i xj)

2

81 / 85

Page 168: An introduction to support vector machines

Example of Kernels

Polynomials

k(x, z) = (xT z + 1)q q > 0

Radial Basis Functions

k(x, z) = exp(−||x− z||2

σ2

)

Hyperbolic Tangents

k(x, z) = tanh(βxT z + γ)

82 / 85

Page 169: An introduction to support vector machines

Example of Kernels

Polynomials

k(x, z) = (xT z + 1)q q > 0

Radial Basis Functions

k(x, z) = exp(−||x− z||2

σ2

)

Hyperbolic Tangents

k(x, z) = tanh(βxT z + γ)

82 / 85

Page 170: An introduction to support vector machines

Example of Kernels

Polynomials

k(x, z) = (xT z + 1)q q > 0

Radial Basis Functions

k(x, z) = exp(−||x− z||2

σ2

)

Hyperbolic Tangents

k(x, z) = tanh(βxT z + γ)

82 / 85

Page 171: An introduction to support vector machines

Outline1 History2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

83 / 85

Page 172: An introduction to support vector machines

Now, How to select a Kernel?

We have a problemSelecting a specific kernel and parameters is usually done in a try-and-seemanner.

ThusIn general, the Radial Basis Functions kernel is a reasonable first choice.

Thenif this fails, we can try the other possible kernels.

84 / 85

Page 173: An introduction to support vector machines

Now, How to select a Kernel?

We have a problemSelecting a specific kernel and parameters is usually done in a try-and-seemanner.

ThusIn general, the Radial Basis Functions kernel is a reasonable first choice.

Thenif this fails, we can try the other possible kernels.

84 / 85

Page 174: An introduction to support vector machines

Now, How to select a Kernel?

We have a problemSelecting a specific kernel and parameters is usually done in a try-and-seemanner.

ThusIn general, the Radial Basis Functions kernel is a reasonable first choice.

Thenif this fails, we can try the other possible kernels.

84 / 85

Page 175: An introduction to support vector machines

Thus, we have something like this

Step 1Normalize the data.

Step 2Use cross-validation to adjust the parameters of the selected kernel.

Step 3Train against the entire dataset.

85 / 85

Page 176: An introduction to support vector machines

Thus, we have something like this

Step 1Normalize the data.

Step 2Use cross-validation to adjust the parameters of the selected kernel.

Step 3Train against the entire dataset.

85 / 85

Page 177: An introduction to support vector machines

Thus, we have something like this

Step 1Normalize the data.

Step 2Use cross-validation to adjust the parameters of the selected kernel.

Step 3Train against the entire dataset.

85 / 85