09 Machine Learning - Introduction Support Vector Machines

Machine Learning for Data MiningIntroduction to Support Vector Machines

Andres Mendez-Vazquez

June 22, 2016

1 / 124

Outline1 History

The Beginning2 Separable Classes

Separable ClassesHyperplanes

3 Support VectorsSupport VectorsQuadratic OptimizationLagrange MultipliersMethodKarush-Kuhn-Tucker ConditionsPrimal-Dual Problem for LagrangianProperties

4 KernelKernel IdeaHigher Dimensional SpaceExamplesNow, How to select a Kernel?

5 Soft MarginsIntroductionThe Soft Margin Solution

6 More About KernelsBasic IdeaFrom Inner products to Kernels 2 / 124

Outline1 History







HistoryInvented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963

At the Institute of Control Sciences, MoscowOn the paper “Estimation of dependencies based on empirical data”

Corinna Cortes and Vladimir Vapnik in 1995They Invented their Current Incarnation - Soft MarginsAt the AT&T Labs

BTW Corinna CortesDanish computer scientist who is known for her contributions to thefield of machine learning.She is currently the Head of Google Research, New York.Cortes is a recipient of the Paris Kanellakis Theory and PracticeAward (ACM) for her work on theoretical foundations of supportvector machines.

4 / 124





4 / 124





4 / 124





4 / 124





4 / 124





4 / 124





4 / 124

In addition

Alexey Yakovlevich ChervonenkisHe was a Soviet and Russian mathematician, and, with Vladimir Vapnik,was one of the main developers of the Vapnik–Chervonenkis theory, alsoknown as the "fundamental theory of learning" an important part ofcomputational learning theory.

He died in September 22nd, 2014At Losiny Ostrov National Park on 22 September 2014.

5 / 124

In addition

Alexey Yakovlevich ChervonenkisHe was a Soviet and Russian mathematician, and, with Vladimir Vapnik,was one of the main developers of the Vapnik–Chervonenkis theory, alsoknown as the "fundamental theory of learning" an important part ofcomputational learning theory.

He died in September 22nd, 2014At Losiny Ostrov National Park on 22 September 2014.

5 / 124

ApplicationsPartial List

1 Predictive ControlI Control of chaotic systems.

2 Inverse Geosounding ProblemI It is used to understand the internal structure of our planet.

3 Environmental SciencesI Spatio-temporal environmental data analysis and modeling.

4 Protein Fold and Remote Homology DetectionI In the recognition if two different species contain similar genes.

5 Facial expression classification6 Texture Classification7 E-Learning8 Handwritten Recognition9 AND counting....

6 / 124







6 / 124







6 / 124







6 / 124







6 / 124







6 / 124







6 / 124







6 / 124







6 / 124

Outline1 History







Separable Classes

Given

xi, i = 1, · · · , N

A set of samples belonging to two classes ω1, ω2.

ObjectiveWe want to obtain a decision function as simple as

g (x) = wTx+ w0

8 / 124

Separable Classes

Given

xi, i = 1, · · · , N

A set of samples belonging to two classes ω1, ω2.

ObjectiveWe want to obtain a decision function as simple as

g (x) = wTx+ w0

8 / 124

Such that we can do the following

A linear separation function g (x) = wtx+ w0

9 / 124

Outline1 History







In other words ...

We have the following samplesFor x1, · · · ,xm ∈ C1

For x1, · · · ,xn ∈ C2

We want the following decision surfaceswTxi + w0 ≥ 0 for di = +1 if xi ∈ C1

wTxj + w0 ≤ 0 for dj = −1 if xj ∈ C2

11 / 124

In other words ...


For x1, · · · ,xn ∈ C2



11 / 124

In other words ...


For x1, · · · ,xn ∈ C2



11 / 124

In other words ...


For x1, · · · ,xn ∈ C2



11 / 124

What do we want?Our goal is to search for a direction w that gives the maximum possible margin

direction 2

MARGINSdirection 1

12 / 124

Remember

We have the following

d

Projection r distance

0

13 / 124

A Little of GeometryThus

r

d

A

B

C

Then

d = |w0|√w2

1 + w22

, r = |g (x)|√w2

1 + w22

(1)

14 / 124

A Little of GeometryThus

r

d

A

B

C

Then

d = |w0|√w2

1 + w22

, r = |g (x)|√w2

1 + w22

(1)

14 / 124

First d = |w0|√w2

1+w22

We can use the following rule in a triangle with a 90o angle

Area = 12Cd (2)

In addition, the area can be calculated also as

Area = 12AB (3)

Thus

d = AB

C

Remark: Can you get the rest of values?

15 / 124

First d = |w0|√w2

1+w22


Area = 12Cd (2)


Area = 12AB (3)

Thus

d = AB

C


15 / 124

First d = |w0|√w2

1+w22


Area = 12Cd (2)


Area = 12AB (3)

Thus

d = AB

C


15 / 124

What about r = |g(x)|√w2

1+w22?

First, rememberg (xp) = 0 and x = xp + r

w

‖w‖ (4)

Thus, we have

g (x) =wT

[xp + r

w

‖w‖

]+ w0

=wTxp + w0 + rwTw

‖w‖

=wTxp + w0 + r‖w‖2

‖w‖=g (xp) + r ‖w‖

Thenr = g(x)

||w||16 / 124


1+w22?


w

‖w‖ (4)

Thus, we have

g (x) =wT

[xp + r

w

‖w‖

]+ w0

=wTxp + w0 + rwTw

‖w‖

=wTxp + w0 + r‖w‖2

‖w‖=g (xp) + r ‖w‖

Thenr = g(x)

||w||16 / 124


1+w22?


w

‖w‖ (4)

Thus, we have

g (x) =wT

[xp + r

w

‖w‖

]+ w0

=wTxp + w0 + rwTw

‖w‖

=wTxp + w0 + r‖w‖2

‖w‖=g (xp) + r ‖w‖

Thenr = g(x)

||w||16 / 124


1+w22?


w

‖w‖ (4)

Thus, we have

g (x) =wT

[xp + r

w

‖w‖

]+ w0

=wTxp + w0 + rwTw

‖w‖

=wTxp + w0 + r‖w‖2

‖w‖=g (xp) + r ‖w‖

Thenr = g(x)

||w||16 / 124


1+w22?


w

‖w‖ (4)

Thus, we have

g (x) =wT

[xp + r

w

‖w‖

]+ w0

=wTxp + w0 + rwTw

‖w‖

=wTxp + w0 + r‖w‖2

‖w‖=g (xp) + r ‖w‖

Thenr = g(x)

||w||16 / 124


1+w22?


w

‖w‖ (4)

Thus, we have

g (x) =wT

[xp + r

w

‖w‖

]+ w0

=wTxp + w0 + rwTw

‖w‖

=wTxp + w0 + r‖w‖2

‖w‖=g (xp) + r ‖w‖

Thenr = g(x)

||w||16 / 124

This has the following interpretation

The distance from the projection

0

17 / 124

Now

We know that the straight line that we are looking for looks like

wTx+ w0 = 0 (5)

What about something like this

wTx+ w0 = δ (6)

ClearlyThis will be above or below the initial line wTx+ w0 = 0.

18 / 124

Now


wTx+ w0 = 0 (5)


wTx+ w0 = δ (6)


18 / 124

Now


wTx+ w0 = 0 (5)


wTx+ w0 = δ (6)


18 / 124

Come back to the hyperplanesWe have then for each border support line an specific bias!!!

Support Vectors

19 / 124

Then, normalize by δ

The new margin functionsw′Tx + w10 = 1w′Tx + w01 = −1

where w′ = wδ, w10 = w′

0δ ,and w01 = w′′

0δ

Now, we come back to the middle separator hyperplane, but with thenormalized term

wTxi + w0 ≥ w′Tx + w10 for di = +1wTxi + w0 ≤ w′Tx + w01 for di = −1

I Where w0 is the bias of that central hyperplane!! And the w is thenormalized direction of w′

20 / 124




0δ ,and w01 = w′′

0δ




20 / 124




0δ ,and w01 = w′′

0δ




20 / 124




0δ ,and w01 = w′′

0δ




20 / 124




0δ ,and w01 = w′′

0δ




20 / 124




0δ ,and w01 = w′′

0δ




20 / 124

Come back to the hyperplanesThe meaning of what I am saying!!!

21 / 124

Outline1 History







A little about Support Vectors

They are the vectors (Here, we assume that w)xi such that wTxi + w0 = 1 or wTxi + w0 = −1

PropertiesThe vectors nearest to the decision surface and the most difficult toclassify.Because of that, we have the name “Support Vector Machines”.

23 / 124




23 / 124




23 / 124

Now, we can resume the decision rule for the hyperplane

For the support vectors

g (xi) = wTxi + w0 = −(+)1 for di = −(+)1 (7)

ImpliesThe distance to the support vectors is:

r = g (xi)||w||

=

1||w|| if di = +1− 1||w|| if di = −1

24 / 124

Now, we can resume the decision rule for the hyperplane

For the support vectors

g (xi) = wTxi + w0 = −(+)1 for di = −(+)1 (7)

ImpliesThe distance to the support vectors is:

r = g (xi)||w||

=

1||w|| if di = +1− 1||w|| if di = −1

24 / 124

Therefore ...

We want the optimum value of the margin of separation as

ρ = 1||w||

+ 1||w||

= 2||w||

(8)

And the support vectors define the value of ρ

25 / 124

Therefore ...We want the optimum value of the margin of separation as

ρ = 1||w||

+ 1||w||

= 2||w||

(8)

And the support vectors define the value of ρ

Support V

ectors

25 / 124

Outline1 History







Quadratic Optimization

Then, we have the samples with labelsT = {(xi, di)}Ni=1

Then we can put the decision rule asdi(wTxi + w0

)≥ 1 i = 1, · · · , N

27 / 124

Quadratic Optimization

Then, we have the samples with labelsT = {(xi, di)}Ni=1

Then we can put the decision rule asdi(wTxi + w0

)≥ 1 i = 1, · · · , N

27 / 124

Then, we have the optimization problem

The optimization problemminwΦ (w) = 1

2wTw

s.t. di(wTxi + w0) ≥ 1 i = 1, · · · , N

ObservationsThe cost functions Φ (w) is convex.The constrains are linear with respect to w.

28 / 124



2wTw

s.t. di(wTxi + w0) ≥ 1 i = 1, · · · , N


28 / 124



2wTw

s.t. di(wTxi + w0) ≥ 1 i = 1, · · · , N


28 / 124

Outline1 History







Lagrange Multipliers

The method of Lagrange multipliersGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problems.

This is done by converting a constrained problem to an equivalentunconstrained problem with the help of certain unspecified parametersknown as Lagrange multipliers.

30 / 124


The method of Lagrange multipliersGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problems.

This is done by converting a constrained problem to an equivalentunconstrained problem with the help of certain unspecified parametersknown as Lagrange multipliers.

30 / 124

Lagrange MultipliersThe classical problem formulation

min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0

It can be converted into

min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}(9)

whereL(x, λ) is the Lagrangian function.λ is an unspecified positive or negative constant called the LagrangeMultiplier.

31 / 124






31 / 124






31 / 124

Finding an Optimum using Lagrange Multipliers

New problemmin L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn)− λh1 (x1, x2, ..., xn)}

We want a λ = λ∗ optimalIf the minimum of L (x1, x2, ..., xn, λ

∗) occurs at

(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗

and (x1, x2, ..., xn)T∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)T∗

minimizes:min f (x1, x2, ..., xn)s.t h1 (x1, x2, ..., xn) = 0

TrickIt is to find appropriate value for Lagrangian multiplier λ.

32 / 124




∗) occurs at

(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗




32 / 124




∗) occurs at

(x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗




32 / 124

Remember

Think about thisRemember First Law of Newton!!!

Yes!!!

33 / 124

Remember

Think about thisRemember First Law of Newton!!!

Yes!!!A system in equilibrium does not move

Static Body

33 / 124


DefinitionGives a set of necessary conditions to identify optimal points of equalityconstrained optimization problem

34 / 124

Lagrange was a Physicists

He was thinking in the following formulaA system in equilibrium has the following equation:

F1 + F2 + ...+ FK = 0 (10)

But functions do not have forces?Are you sure?

Think about the followingThe Gradient of a surface.

35 / 124



F1 + F2 + ...+ FK = 0 (10)



35 / 124



F1 + F2 + ...+ FK = 0 (10)



35 / 124

Gradient to a Surface

After all a gradient is a measure of the maximal changeFor example the gradient of a function of three variables:

∇f (x) = i∂f (x)∂x

+ j∂f (x)∂y

+ k∂f (x)∂z

(11)

where i, j and k are unitary vectors in the directions x, y and z.

36 / 124

ExampleWe have f (x, y) = x exp {−x2 − y2}

37 / 124

Example

With Gradient at the the contours when projecting in the 2D plane

38 / 124

Now, Think about this

Yes, we can use the gradientHowever, we need to do some scaling of the forces by using parameters λ

Thus, we have

F0 + λ1F1 + ...+ λKFK = 0 (12)

where F0 is the gradient of the principal cost function and Fi fori = 1, 2, ..,K.

39 / 124

Now, Think about this

Yes, we can use the gradientHowever, we need to do some scaling of the forces by using parameters λ

Thus, we have

F0 + λ1F1 + ...+ λKFK = 0 (12)

where F0 is the gradient of the principal cost function and Fi fori = 1, 2, ..,K.

39 / 124

ThusIf we have the following optimization:

min f (x)s.tg1 (x) = 0g2 (x) = 0

40 / 124

Geometric interpretation in the case of minimizationWhat is wrong? Gradients are going in the other direction, we can fixby simple multiplying by -1Here the cost function is f (x, y) = x exp

{−x2 − y2} we want to minimize

f (−→x )

g1 (−→x )

g2 (−→x )

−∇ f (−→x ) + λ1∇ g1 (−→x ) + λ2∇ g2 (−→x ) = 0

Nevertheless: it is equivalent to ∇f(−→x )− λ1∇g1

(−→x )− λ2∇g2(−→x ) = 0

40 / 124

Geometric interpretation in the case of minimizationWhat is wrong? Gradients are going in the other direction, we can fixby simple multiplying by -1Here the cost function is f (x, y) = x exp

{−x2 − y2} we want to minimize

f (−→x )

g1 (−→x )

g2 (−→x )

−∇ f (−→x ) + λ1∇ g1 (−→x ) + λ2∇ g2 (−→x ) = 0

Nevertheless: it is equivalent to ∇f(−→x )− λ1∇g1

(−→x )− λ2∇g2(−→x ) = 0

40 / 124

Outline1 History







Method

Steps1 Original problem is rewritten as:

1 minimize L (x, λ) = f (x)− λh1 (x)2 Take derivatives of L (x, λ) with respect to xi and set them equal to

zero.3 Express all xi in terms of Lagrangian multiplier λ.4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.5 Calculate x by using the just found value for λ.

From the step 2If there are n variables (i.e., x1, · · · , xn) then you will get n equations withn+ 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).

42 / 124

Method

Steps1 Original problem is rewritten as:

1 minimize L (x, λ) = f (x)− λh1 (x)2 Take derivatives of L (x, λ) with respect to xi and set them equal to

zero.3 Express all xi in terms of Lagrangian multiplier λ.4 Plug x in terms of λ in constraint h1 (x) = 0 and solve λ.5 Calculate x by using the just found value for λ.

From the step 2If there are n variables (i.e., x1, · · · , xn) then you will get n equations withn+ 1 unknowns (i.e., n variables xi and one Lagrangian multiplier λ).

42 / 124

Example

We can apply that to the following problem

min f (x, y) = x2 − 8x+ y2 − 12y + 48s.t x+ y = 8

43 / 124

Then, Rewriting The Optimization Problem

The optimization with equality constraintsminwΦ (w) = 1

2wTw

s.t. di(wTxi + w0) ≥ 1 i = 1, · · · , N

44 / 124

Then, for our problem

Using the Lagrange Multipliers (We will call them αi)We obtain the following cost function that we want to minimize

J(w, w0,α) = 12w

Tw −N∑i=1

αi[di(wTxi + w0)− 1]

ObservationMinimize with respect to w and w0.Maximize with respect to α because it dominates

−N∑i=1

αi[di(wTxi + w0)− 1]. (13)

45 / 124



J(w, w0,α) = 12w

Tw −N∑i=1



−N∑i=1

αi[di(wTxi + w0)− 1]. (13)

45 / 124



J(w, w0,α) = 12w

Tw −N∑i=1



−N∑i=1

αi[di(wTxi + w0)− 1]. (13)

45 / 124



J(w, w0,α) = 12w

Tw −N∑i=1



−N∑i=1

αi[di(wTxi + w0)− 1]. (13)

45 / 124

Saddle Point?

At the left the original problem, at the right the Lagrangian!!!

f (−→x )

g1 (−→x )

g2 (−→x )

46 / 124

Outline1 History







Karush-Kuhn-Tucker Conditions

First An Inequality Constrained Problem P

min f (x)s.t g1 (x) = 0

...gN (x) = 0

A really minimal version!!! Hey, it is a patch work!!!A point x is a local minimum of an equality constrained problem P only ifa set of non-negative αj ’s may be found such that:

∇L (x,α) = ∇f (x)−N∑i=1

αi∇gi (x) = 0

48 / 124


First An Inequality Constrained Problem P

min f (x)s.t g1 (x) = 0

...gN (x) = 0

A really minimal version!!! Hey, it is a patch work!!!A point x is a local minimum of an equality constrained problem P only ifa set of non-negative αj ’s may be found such that:

∇L (x,α) = ∇f (x)−N∑i=1

αi∇gi (x) = 0

48 / 124


ImportantThink about this each constraint correspond to a sample in both classes,thus

The corresponding αi’s are going to be zero after optimization, if aconstraint is not active i.e. di

(wTxi + w0

)− 1 6= 0 (Remember

Maximization).

Again the Support VectorsThis actually defines the idea of support vectors!!!

ThusOnly the αi’s with active constraints (Support Vectors) will be differentfrom zero when di

(wTxi + w0

)− 1 = 0.

49 / 124




(wTxi + w0


Maximization).



(wTxi + w0

)− 1 = 0.

49 / 124




(wTxi + w0


Maximization).



(wTxi + w0

)− 1 = 0.

49 / 124

A small deviation from the SVM’s for the sake of VoxPopuli

Theorem (Karush-Kuhn-Tucker Necessary Conditions)Let X be a non-empty open set Rn, and let f : Rn → R and gi : Rn → Rfor i = 1, ...,m. Consider the problem P to minimize f (x) subject tox ∈ X and gi (x) ≤ 0 i = 1, ...,m. Let x be a feasible solution, anddenote I = {i|gi (x) = 0}. Suppose that f and gi for i ∈ I aredifferentiable at x and that gi i /∈ I are continuous at x. Furthermore,suppose that ∇gi (x) for i ∈ I are linearly independent. If x solvesproblem P locally, there exist scalars ui for i ∈ I such that

∇f (x) +∑i∈I

ui∇gi (x) = 0

ui ≥ 0 for i ∈ I

50 / 124

It is more...

In addition to the above assumptionsIf gi for each i /∈ I is also differentiable at x, the previous conditions canbe written in the following equivalent form:

∇f (x) +m∑i=1

ui∇gi (x) = 0

ugi (x) = 0 for i = 1, ...,mui ≥ 0 for i = 1, ...,m

51 / 124

The necessary conditions for optimality

We use the previous theorem

∇(

12w

Tw −N∑i=1

αi[di(wTxi + w0)− 1])

(14)

Condition 1∂J (w, w0,α)

∂w= 0


∂w0= 0

52 / 124



∇(

12w

Tw −N∑i=1


(14)


∂w= 0


∂w0= 0

52 / 124



∇(

12w

Tw −N∑i=1


(14)


∂w= 0


∂w0= 0

52 / 124

Using the conditions

We have the first condition

∂J(w, w0, α)∂w

=∂ 1

2wTw

∂w−∂N∑i=1


∂w= 0

∂J(w, w0, α)∂w

= 12(w +w)−

N∑i=1

αidixi

Thus

w =N∑i=1

αidixi (15)

53 / 124



∂J(w, w0, α)∂w

=∂ 1

2wTw

∂w−∂N∑i=1


∂w= 0

∂J(w, w0, α)∂w

= 12(w +w)−

N∑i=1

αidixi

Thus

w =N∑i=1

αidixi (15)

53 / 124



∂J(w, w0, α)∂w

=∂ 1

2wTw

∂w−∂N∑i=1


∂w= 0

∂J(w, w0, α)∂w

= 12(w +w)−

N∑i=1

αidixi

Thus

w =N∑i=1

αidixi (15)

53 / 124

In a similar way ...

We have by the second optimality conditionN∑i=1

αidi = 0

Note

αi[di(wTxi + w0

)− 1

]= 0

Because the constraint vanishes in the optimal solution i.e. αi = 0 ordi(wTxi + w0

)− 1 = 0.

54 / 124

In a similar way ...

We have by the second optimality conditionN∑i=1

αidi = 0

Note

αi[di(wTxi + w0

)− 1

]= 0

Because the constraint vanishes in the optimal solution i.e. αi = 0 ordi(wTxi + w0

)− 1 = 0.

54 / 124

Thus

We need something extraOur classic trick of transforming a problem into another problem

In this caseWe use the Primal-Dual Problem for Lagrangian

WhereWe move from a minimization to a maximization!!!

55 / 124

Thus




55 / 124

Thus




55 / 124

Outline1 History







Lagrangian Dual Problem

Consider the following nonlinear programming problemPrimal Problem P

min f (x)s.t gi (x) ≤ 0 for i = 1, ...,m

hi (x) = 0 for i = 1, ..., lx ∈ X

Lagrange Dual Problem Dmax Θ (u,v)s.t. u > 0

where Θ (u,v) = infx{f (x) +

∑mi=1 uigi (x) +

∑li=1 vihi (x) |x ∈ X

}

57 / 124

Lagrangian Dual Problem

Consider the following nonlinear programming problemPrimal Problem P


hi (x) = 0 for i = 1, ..., lx ∈ X

Lagrange Dual Problem Dmax Θ (u,v)s.t. u > 0

where Θ (u,v) = infx{f (x) +

∑mi=1 uigi (x) +

∑li=1 vihi (x) |x ∈ X

}

57 / 124

What does this mean?

Assume that the equality constraint does not existWe have then


x ∈ X

Now assume that we finish with only one constraintWe have then

min f (x)s.t g (x) ≤ 0

x ∈ X

58 / 124


Assume that the equality constraint does not existWe have then


x ∈ X

Now assume that we finish with only one constraintWe have then

min f (x)s.t g (x) ≤ 0

x ∈ X

58 / 124


First, we have the following figure

A

B

X G

Slope:

Slope:

59 / 124

What does this means?

Thus at the y − z plane you have

G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)

ThusGiven u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -Equivalent to ∇f (x) + u∇g(x) = 0

60 / 124


Thus at the y − z plane you have

G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)

ThusGiven u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -Equivalent to ∇f (x) + u∇g(x) = 0

60 / 124


Thus at the y − z plane, we have

z + uy = α (17)

a line with slope −u.

Then, to minimize z + uy = α

We need to move the line z + uy = α in a parallel to itself as far down aspossible, along its negative gradient, while in contact with G.

61 / 124


Thus at the y − z plane, we have

z + uy = α (17)

a line with slope −u.

Then, to minimize z + uy = α

We need to move the line z + uy = α in a parallel to itself as far down aspossible, along its negative gradient, while in contact with G.

61 / 124

In other words

Move the line parallel to itself until it supports G

A

B

X G

Slope:

Slope:

Note The Set G lies above the line and touches it.

62 / 124

Thus

ThusThen, the problem is to find the slope of the supporting hyperplane forG.

Then intersection with the z-axisGives θ(u)

63 / 124

Thus

ThusThen, the problem is to find the slope of the supporting hyperplane forG.

Then intersection with the z-axisGives θ(u)

63 / 124

Again

We can see the θ

A

B

X G

Slope:

Slope:

64 / 124

Thus

The dual problem is equivalentFinding the slope of the supporting hyperplane such that its intercept onthe z-axis is maximal

65 / 124

Or

Such an hyperplane has slope −u and support G at (y, z)

A

B

X G

Slope:

Slope:

Remark: The optimal solution is u and the optimal dual objective is z.

66 / 124

Or

Such an hyperplane has slope −u and support G at (y, z)

A

B

X G

Slope:

Slope:

Remark: The optimal solution is u and the optimal dual objective is z.

66 / 124

For more on this Please!!!

Look at this bookFrom “Nonlinear Programming: Theory and Algorithms” by MokhtarS. Bazaraa, and C. M. Shetty. Wiley, New York, (2006)

I At Page 260.

67 / 124

Example (Lagrange Dual)

Primalmin x2

1 + x22

s.t. −x1 − x2 + 4 ≤ 0x1, x2 ≥ 0

Lagrange DualΘ(u) = inf {x2

1 + x22 + u(−x1 − x2 + 4)|x1, x2 ≥ 0}

68 / 124

Example (Lagrange Dual)

Primalmin x2

1 + x22

s.t. −x1 − x2 + 4 ≤ 0x1, x2 ≥ 0

Lagrange DualΘ(u) = inf {x2

1 + x22 + u(−x1 − x2 + 4)|x1, x2 ≥ 0}

68 / 124

Solution

Derive with respect to x1 and x2

We have two case to take in account: u ≥ 0 and u < 0

The first case is clearWhat about when u < 0

We have that

θ (u) ={−1

2u2 + 4u if u ≥ 0

4u if u < 0(18)

69 / 124

Solution




We have that

θ (u) ={−1

2u2 + 4u if u ≥ 0

4u if u < 0(18)

69 / 124

Solution




We have that

θ (u) ={−1

2u2 + 4u if u ≥ 0

4u if u < 0(18)

69 / 124

Outline1 History







Duality Theorem

First PropertyIf the Primal has an optimal solution, the dual too.

ThusIn order to w ∗ and α∗ to be optimal solutions for the primal and dualproblem respectively, It is necessary and sufficient that w∗:

It is feasible for the primal problem and

Φ(w∗) = J (w∗, w0∗,α∗)= min

wJ (w∗, w0∗,α∗)

71 / 124

Duality Theorem

First PropertyIf the Primal has an optimal solution, the dual too.

ThusIn order to w ∗ and α∗ to be optimal solutions for the primal and dualproblem respectively, It is necessary and sufficient that w∗:

It is feasible for the primal problem and

Φ(w∗) = J (w∗, w0∗,α∗)= min

wJ (w∗, w0∗,α∗)

71 / 124

Reformulate our Equations

We have then

J (w, w0,α) = 12w

Tw −N∑i=1

αidiwTxi − w0

N∑i=1

αidi +N∑i=1

αi

Now for our 2nd optimality condition

J (w, w0,α) = 12w

Tw −N∑i=1

αidiwTxi +

N∑i=1

αi

72 / 124

Reformulate our Equations

We have then

J (w, w0,α) = 12w

Tw −N∑i=1

αidiwTxi − w0

N∑i=1

αidi +N∑i=1

αi

Now for our 2nd optimality condition

J (w, w0,α) = 12w

Tw −N∑i=1

αidiwTxi +

N∑i=1

αi

72 / 124

We have finally for the 1st Optimality Condition:

First

wTw =N∑i=1

αidiwTxi =

N∑i=1

N∑j=1

αiαjdidjxTj xi

Second, setting J (w, w0,α) = Q (α)

Q (α) =N∑i=1

αi −12

N∑i=1

N∑j=1

αiαjdidjxTj xi

73 / 124

We have finally for the 1st Optimality Condition:

First

wTw =N∑i=1

αidiwTxi =

N∑i=1

N∑j=1

αiαjdidjxTj xi

Second, setting J (w, w0,α) = Q (α)

Q (α) =N∑i=1

αi −12

N∑i=1

N∑j=1

αiαjdidjxTj xi

73 / 124

From here, we have the problemThis is the problem that we really solveGiven the training sample {(xi, di)}Ni=1, find the Lagrange multipliers{αi}Ni=1 that maximize the objective function

Q(α) =N∑i=1

αi −12

N∑i=1

N∑j=1

αiαjdidjxTj xi

subject to the constraints

N∑i=1

αidi = 0 (19)

αi ≥ 0 for i = 1, · · · , N (20)

NoteIn the Primal, we were trying to minimize the cost function, for this it isnecessary to maximize α. That is the reason why we are maximizingQ (α). 74 / 124

From here, we have the problemThis is the problem that we really solveGiven the training sample {(xi, di)}Ni=1, find the Lagrange multipliers{αi}Ni=1 that maximize the objective function

Q(α) =N∑i=1

αi −12

N∑i=1

N∑j=1

αiαjdidjxTj xi


N∑i=1

αidi = 0 (19)

αi ≥ 0 for i = 1, · · · , N (20)

NoteIn the Primal, we were trying to minimize the cost function, for this it isnecessary to maximize α. That is the reason why we are maximizingQ (α). 74 / 124

Solving for α

We can compute w∗ once we get the optimal α∗i by using (Eq. 15)

w∗ =N∑i=1

α∗i dixi

In addition, we can compute the optimal bias w∗0 using the optimalweight, w∗For this, we use the positive margin equation:

g(x(s)

)= wTx(s) + w0 = 1

corresponding to a positive support vector.

Then

w0 = 1− (w∗)T x(s) for d(s) = 1 (21)

75 / 124

Solving for α


w∗ =N∑i=1

α∗i dixi


g(x(s)

)= wTx(s) + w0 = 1


Then

w0 = 1− (w∗)T x(s) for d(s) = 1 (21)

75 / 124

Solving for α


w∗ =N∑i=1

α∗i dixi


g(x(s)

)= wTx(s) + w0 = 1


Then

w0 = 1− (w∗)T x(s) for d(s) = 1 (21)

75 / 124

Outline1 History







What do we need?

Until now, we have only a maximal margin algorithmAll this work fine when the classes are separableProblem, What when they are not separable?What we can do?

77 / 124

What do we need?


77 / 124

What do we need?


77 / 124

Outline1 History







Map to a higher Dimensional Space

Assume that exist a mapping

x ∈ Rl → y ∈ Rk

Then, it is possible to define the following mapping

79 / 124

Map to a higher Dimensional Space

Assume that exist a mapping

x ∈ Rl → y ∈ Rk

Then, it is possible to define the following mapping

79 / 124

Define a map to a higher Dimension

Nonlinear transformationsGiven a series of nonlinear transformations

{φi (x)}mi=1

from input space to the feature space.

We can define the decision surface asm∑i=1

wiφi (x) + w0 = 0

80 / 124

Define a map to a higher Dimension


{φi (x)}mi=1



wiφi (x) + w0 = 0

80 / 124

This allows us to define

The following vector

φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T

that represents the mapping.

From this mappingWe can define the following kernel function

K : X×X→ R

K (xi,xj) = φ (xi)T φ (xj)

81 / 124



φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T

that represents the mapping.

From this mappingWe can define the following kernel function

K : X×X→ R

K (xi,xj) = φ (xi)T φ (xj)

81 / 124

Outline1 History







Example

Assume

x ∈ R→ y =

x21√

2x1x2x2

2

We can show that

yTi yj =(xTi xj

)2

83 / 124

Example

Assume

x ∈ R→ y =

x21√

2x1x2x2

2

We can show that

yTi yj =(xTi xj

)2

83 / 124

Example of Kernels

Polynomials

k (x, z) = (xTz + 1)q q > 0

Radial Basis Functions

k (x, z) = exp(−||x− z||

2

σ2

)

Hyperbolic Tangents

k (x, z) = tanh(βxTz + γ

)

84 / 124

Example of Kernels

Polynomials

k (x, z) = (xTz + 1)q q > 0


k (x, z) = exp(−||x− z||

2

σ2

)

Hyperbolic Tangents


)

84 / 124

Example of Kernels

Polynomials

k (x, z) = (xTz + 1)q q > 0


k (x, z) = exp(−||x− z||

2

σ2

)

Hyperbolic Tangents


)

84 / 124

Outline1 History







Now, How to select a Kernel?

We have a problemSelecting a specific kernel and parameters is usually done in a try-and-seemanner.

ThusIn general, the Radial Basis Functions kernel is a reasonable first choice.

Thenif this fails, we can try the other possible kernels.

86 / 124





86 / 124





86 / 124

Thus, we have something like this

Step 1Normalize the data.

Step 2Use cross-validation to adjust the parameters of the selected kernel.

Step 3Train against the entire dataset.

87 / 124





87 / 124





87 / 124

Outline1 History







Optimal Hyperplane for non-separable patterns

ImportantWe have been considering only problems where the classes are linearlyseparable.

NowWhat happen when the patterns are not separable?

Thus, we can still build a separating hyperplaneBut errors will happen in the classification... We need to minimize them...

89 / 124





89 / 124





89 / 124

What if the following happensSome data points invade the “margin” space

Optimal Hyperplane

Data PointViolating Property

90 / 124

Fixing the Problem - Corinna’s Style

The margin of separation between classes is said to be soft if a datapoint (xi, di) violates the following condition

di(wTxi + b

)≥ +1 i = 1, 2, ..., N (22)

This violation can arise in one of two waysThe data point (xi, di) falls inside the region of separation but on theright side of the decision surface - still correct classification.

91 / 124

Fixing the Problem - Corinna’s Style

The margin of separation between classes is said to be soft if a datapoint (xi, di) violates the following condition

di(wTxi + b

)≥ +1 i = 1, 2, ..., N (22)

This violation can arise in one of two waysThe data point (xi, di) falls inside the region of separation but on theright side of the decision surface - still correct classification.

91 / 124

We have then

Example

Optimal Hyperplane


92 / 124

Or...

This violation can arise in one of two waysThe data point (xi, di) falls on the wrong side of the decision surface -incorrect classification.

Example

93 / 124

Or...This violation can arise in one of two waysThe data point (xi, di) falls on the wrong side of the decision surface -incorrect classification.

Example

Optimal Hyperplane


93 / 124

Solving the problem

What to do?We introduce a set of nonnegative scalar values {ξi}Ni=1.

Introduce this into the decision rule

di(wTxi + b

)≥ 1− ξi i = 1, 2, ..., N (23)

94 / 124

Solving the problem



di(wTxi + b

)≥ 1− ξi i = 1, 2, ..., N (23)

94 / 124

Solving the problem



di(wTxi + b

)≥ 1− ξi i = 1, 2, ..., N (23)

94 / 124

The ξi are called slack variables

What?In 1995, Corinna Cortes and Vladimir N. Vapnik suggested a modifiedmaximum margin idea that allows for mislabeled examples.

Ok!!!Instead of expecting to have constant margin for all the samples, themargin can change depending of the sample.

What do we have?ξi measures the deviation of a data point from the ideal condition ofpattern separability.

95 / 124





95 / 124





95 / 124

Properties of ξi

What if?You have 0 ≤ ξi ≤ 1

We have

96 / 124

Properties of ξi

What if?You have 0 ≤ ξi ≤ 1

We have

Optimal Hyperplane


96 / 124

Properties of ξi

What if?You have ξi > 1

We have

97 / 124

Properties of ξi

What if?You have ξi > 1

We have

Optimal Hyperplane


97 / 124

Support Vectors

We wantSupport vectors that satisfy equation (Eq. 23) even when ξi > 0

di(wTxi + b

)≥ 1− ξi i = 1, 2, ..., N

98 / 124

We want the following

We want to find an hyperplaneSuch that average error is misclassified over all the samples

1N

N∑i=1

e2 (24)

99 / 124

First Attempt Into Minimization

We can try the followingGiven

I (x) ={

0 if x ≤ 01 if x > 0

(25)

Minimize the following

Φ (ξ) =N∑i=1

I (ξi − 1) (26)

with respect to the weight vector w subject to1 di

(wTxi + b

)≥ 1− ξi i = 1, 2, ..., N

2 ‖w‖ 2 ≤ C for a given C.

100 / 124

First Attempt Into Minimization

We can try the followingGiven

I (x) ={

0 if x ≤ 01 if x > 0

(25)

Minimize the following

Φ (ξ) =N∑i=1

I (ξi − 1) (26)

with respect to the weight vector w subject to1 di

(wTxi + b

)≥ 1− ξi i = 1, 2, ..., N

2 ‖w‖ 2 ≤ C for a given C.

100 / 124

Problem

Using this first attemptMinimization of Φ (ξ) with respect to w is a non-convex optimizationproblem that is NP-complete.

Thus, we need to use an approximation, maybe

Φ (ξ) =N∑i=1

ξi (27)

Now, we simplify the computations by integrating the vector w

Φ (w, ξ) = 12w

Tw + CN∑i=1

ξi (28)

101 / 124

Problem



Φ (ξ) =N∑i=1

ξi (27)


Φ (w, ξ) = 12w

Tw + CN∑i=1

ξi (28)

101 / 124

Problem



Φ (ξ) =N∑i=1

ξi (27)


Φ (w, ξ) = 12w

Tw + CN∑i=1

ξi (28)

101 / 124

Important

FirstMinimizing the first term in (Eq. 28) is related to minimize theVapnik–Chervonenkis dimension.

Which is a measure of the capacity (complexity, expressive power,richness, or flexibility) of a statistical classification algorithm.

SecondThe second term ∑N

i=1 ξi is an upper bound on the number of test errors.

102 / 124

Important





102 / 124

Important





102 / 124

Some problems for the Parameter C

Little ProblemThe parameter C has to be selected by the user.

This can be done in two ways1 The parameter C is determined experimentally via the standard use of

a training! (validation) test set.2 It is determined analytically by estimating the Vapnik–Chervonenkis

dimension.

103 / 124





dimension.

103 / 124





dimension.

103 / 124

Primal Problem

Problem, given samples {(xi, di)}Ni=1

minw,ξ

Φ (w, ξ) = minw,ξ

{12w

Tw + CN∑i=1

ξi

}s.t. di(wTxi + w0) ≥ 1− ξi for i = 1, · · · , N

ξi ≥ 0 for all i

With C a user-specified positive parameter.

104 / 124

Outline1 History







Final Setup

Using Lagrange Multipliers and dual-primal method is possible toobtain the following setupGiven the training sample {(xi, di)}Ni=1, find the Lagrange multipliers{αi}Ni=1 that maximize the objective function

minαQ(α) = min

α

N∑i=1

αi −12

N∑i=1

N∑j=1

αiαjdidjxTj xi


N∑i=1

αidi = 0 (29)

0 ≤ αi ≤ C for i = 1, · · · , N (30)

where C is a user-specified positive parameter.

106 / 124

Remarks

Something NotableNote that neither the slack variables nor their Lagrange multipliersappear in the dual problem.The dual problem for the case of non-separable patterns is thussimilar to that for the simple case of linearly separable patterns

The only big differenceInstead of using the constraint αi ≥ 0, the new problem use the morestringent constraint 0 ≤ αi ≤ C.

Note the following

ξi = 0 if αi < C (31)

107 / 124

Remarks



Note the following

ξi = 0 if αi < C (31)

107 / 124

Remarks



Note the following

ξi = 0 if αi < C (31)

107 / 124

Remarks



Note the following

ξi = 0 if αi < C (31)

107 / 124

Finally

The optimal solution for the weight vector w∗

w∗ =Ns∑i=1

α∗i dixi

Where Ns is the number of support vectors.

In additionThe determination of the optimum values to that described before.

The KKT conditions are as followαi[di(wTxi + wo

)− 1 + ξi

]= 0 for i = 1, 2, ..., N .

µiξi = 0 for i = 1, 2, ..., N .

108 / 124

Finally


w∗ =Ns∑i=1

α∗i dixi




)− 1 + ξi

]= 0 for i = 1, 2, ..., N .

µiξi = 0 for i = 1, 2, ..., N .

108 / 124

Finally


w∗ =Ns∑i=1

α∗i dixi




)− 1 + ξi

]= 0 for i = 1, 2, ..., N .

µiξi = 0 for i = 1, 2, ..., N .

108 / 124

Finally


w∗ =Ns∑i=1

α∗i dixi




)− 1 + ξi

]= 0 for i = 1, 2, ..., N .

µiξi = 0 for i = 1, 2, ..., N .

108 / 124

Where...

The µi are Lagrange multipliersThey are used to enforce the non-negativity of the slack variables ξi for alli.

Something NotableAt saddle point, the derivative of the Lagrangian function for the primalproblem:

12w

Tw + CN∑i=1

ξi −N∑i=1

αi[di(wTxi + wo

)− 1 + ξi

]−

N∑i=1

µiξi (32)

109 / 124

Where...

The µi are Lagrange multipliersThey are used to enforce the non-negativity of the slack variables ξi for alli.

Something NotableAt saddle point, the derivative of the Lagrangian function for the primalproblem:

12w

Tw + CN∑i=1

ξi −N∑i=1

αi[di(wTxi + wo

)− 1 + ξi

]−

N∑i=1

µiξi (32)

109 / 124

Thus

We get

αi + µi = C (33)

Thus, we get if αi < C

Then µi > 0⇒ ξi = 0

We may determine w0

Using any data point (xi, di) in the training set such that 0 ≤ α∗i ≤ C.Then, given ξi = 0,

w∗0 = 1di− (w∗)T xi (34)

110 / 124

Thus

We get

αi + µi = C (33)



We may determine w0


w∗0 = 1di− (w∗)T xi (34)

110 / 124

Thus

We get

αi + µi = C (33)



We may determine w0


w∗0 = 1di− (w∗)T xi (34)

110 / 124

Nevertheless

It is betterTo take the mean value of w∗0 from all such data points in the trainingsample (Burges, 1998).

BTW He has a great book in SVM’s “An Introduction to SupportVector Machines and Other Kernel-based Learning Methods”

111 / 124

Outline1 History







Basic Idea

Something NotableThe SVM uses the scalar product 〈xi,xj〉 as a measure of similaritybetween xi and xj , and of distance to the hyperplane.Since the scalar product is linear, the SVM is a linear method.

ButUsing a nonlinear function instead, we can make the classifier nonlinear.

113 / 124

Basic Idea



113 / 124

Basic Idea



113 / 124

We do this by defining the following map


{φi (x)}mi=1



wiφi (x) + w0 = 0

.

114 / 124

We do this by defining the following map


{φi (x)}mi=1



wiφi (x) + w0 = 0

.

114 / 124



φ (x) = (φ0 (x) , φ1 (x) , · · · , φm (x)) T

That represents the mapping.

115 / 124

Outline1 History







Finally

We define the decision surface as

wTφ (x) = 0 (35)

We now seek "linear" separability of features, we may write

w =N∑i=1

αidiφ (xi) (36)

Thus, we finish with the following decision surfaceN∑i=1

αidiφT (xi)φ (x) = 0 (37)

117 / 124

Finally


wTφ (x) = 0 (35)


w =N∑i=1

αidiφ (xi) (36)



117 / 124

Finally


wTφ (x) = 0 (35)


w =N∑i=1

αidiφ (xi) (36)



117 / 124

Thus

The term φT (xi)φ (x)It represents the inner product of two vectors induced in the feature spaceinduced by the input patterns.

We can introduce the inner-product kernel

K (xi,x) = φT (xi)φ (x) =m∑j=0

φj (xi)φj (x) (38)

Property: Symmetry

K (xi,x) = K (x,xi) (39)

118 / 124

Thus




φj (xi)φj (x) (38)

Property: Symmetry

K (xi,x) = K (x,xi) (39)

118 / 124

Thus




φj (xi)φj (x) (38)

Property: Symmetry

K (xi,x) = K (x,xi) (39)

118 / 124

This allows to redefine the optimal hyperplane

We getN∑i=1

αidiK (xi,x) = 0 (40)

Something NotableUsing kernels, we can avoid to go from:

Input Space =⇒ Mapping Space =⇒ Inner Product (41)

By directly going from

Input Space =⇒ Inner Product (42)

119 / 124


We getN∑i=1

αidiK (xi,x) = 0 (40)





119 / 124


We getN∑i=1

αidiK (xi,x) = 0 (40)





119 / 124

Important

Something NotableThe expansion of (Eq. 38) for the inner-product kernel K (xi,x) is animportant special case of that arises in functional analysis.

120 / 124

Mercer’s Theorem

Mercer’s TheoremLet K (x,x′) be a continuous symmetric kernel that is defined in theclosed interval a ≤ x ≤ b and likewise for x′. The kernel K (x,x′) can beexpanded in the series

K(x,x′

)=∞∑i=1

λiφi (x)φi(x′)

(43)

WithPositive coefficients, λi > 0 for all i.

121 / 124

Mercer’s Theorem

Mercer’s TheoremLet K (x,x′) be a continuous symmetric kernel that is defined in theclosed interval a ≤ x ≤ b and likewise for x′. The kernel K (x,x′) can beexpanded in the series

K(x,x′

)=∞∑i=1

λiφi (x)φi(x′)

(43)

WithPositive coefficients, λi > 0 for all i.

121 / 124

Mercer’s Theorem

For this expression to be valid and or it to converge absolutely anduniformlyIt is necessary and sufficient that the condition

ˆ b

a

ˆ b

aK(x,x′

)ψ (x)ψ

(x′)dxdx′ ≥ 0 (44)

holds for all ψ such that´ ba ψ

2 (x) dx <∞(Example of a quadratic normfor functions).

122 / 124

Remarks

FirstThe functions φi (x) are called eigenfunctions of the expansion and thenumbers λi are called eigenvalues.

SecondThe fact that all of the eigenvalues are positive means that the kernelK (x,x′) is positive definite.

123 / 124

Remarks

FirstThe functions φi (x) are called eigenfunctions of the expansion and thenumbers λi are called eigenvalues.

SecondThe fact that all of the eigenvalues are positive means that the kernelK (x,x′) is positive definite.

123 / 124

Not only that

We have thatFor λi 6= 1, the ith image of

√λiφi (x) induced in the feature space by the

input vector x is an eigenfunction of the expansion.

In theoryThe dimensionality of the feature space (i.e., the number of eigenvalues/eigenfunctions) can be infinitely large.

124 / 124

Not only that

We have thatFor λi 6= 1, the ith image of

√λiφi (x) induced in the feature space by the

input vector x is an eigenfunction of the expansion.

In theoryThe dimensionality of the feature space (i.e., the number of eigenvalues/eigenfunctions) can be infinitely large.

124 / 124

Engineering

09 Machine Learning - Introduction Support Vector Machines