ECE595 / STAT598: Machine Learning I Lecture 20 Support ...Outline Support Vector Machine Lecture 19...

ECE595 / STAT598: Machine Learning ILecture 20 Support Vector Machine: Duality

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 31

Outline

Support Vector Machine

Lecture 19 SVM 1: The Concept of Max-Margin

Lecture 20 SVM 2: Dual SVM

Lecture 21 SVM 3: Kernel SVM

This lecture: Support Vector Machine: Duality

Lagrange Duality

Maximize the dual variableMinimax ProblemToy Example

Dual SVM

FormulationInterpretation

2 / 31

SVM is the solution of this optimization

minimizew ,w0

2‖w‖22

subject to yj(wTx j + w0) ≥ 1, j = 1, . . . ,N.

3 / 31

Lagrange Function

Goal: Construct the dual problem of

minimizew ,w0

2‖w‖22

Approach: Consider the Lagrangian function

L(w ,w0,λ)def=

2‖w‖22︸︷︷︸

objective

+N∑j=1

[1− yj(w

Tx j + w0)]

︸︷︷︸constraint

Solution happens at the saddle point of L:

∇(w ,w0)L(w ,w0,λ) = 0, and ∇λL(w ,w0,λ) = 0.

4 / 31

Lagrangian function

The Lagrangian function is

L(w ,w0,λ)def=

2‖w‖22 +

N∑j=1

[1− yj(w

Tx j + w0)]

︸︷︷︸≤0

Complementarity Condition saysλj > 0 and

[1− yj(w

Tx j + w0)

λj = 0 and[1− yj(w

Tx j + w0)

So, if we want ∇λL(w ,w0,λ) = 0, then must be one of the twocases:

N∑j=1

[1− yj(w

Tx j + w0)]→ max or min

No saddle point because linear in λ.But 1− yj(w

Tx j + w0) ≤ 0. So unbounded minimum. So must gowith maximum.

5 / 31

Primal Problem

Let λ∗ as the maximizer

λ∗def= argmax

λ≥0

N∑j=1

[1− yj(w

Tx j + w0)]

Then the primal problem is

minimizew ,w0

L(w ,w0,λ∗)

= minimizew ,w0

2‖w‖22 + max

λ≥0

N∑j=1

[1− yj(w

Tx j + w0)]

= minimizew ,w0

{maxλ≥0

L(w ,w0,λ)}

This is a min-max problem:

minw ,w0

maxλ≥0

L(w ,w0,λ)

6 / 31

Strong Duality

Recall that our problem is quadratic programming (QP).Strong Duality holds for QP:

minw ,w0

maxλ≥0

L(w ,w0,λ)︸︷︷︸primal

= maxλ≥0

minw ,w0

L(w ,w0,λ)︸︷︷︸dual

http://www.onmyphd.com/?p=duality.theory7 / 31

Toy Example

The SVM problem

minimizew ,w0

2‖w‖22

is in the form of

minimizeu

‖u‖2, subject to aTj u ≥ bj , j = 1, 2, . . . ,N

Example:

minimizeu1,u2

u21 + u22 , subject to

1 21 00 1

Can we write down its dual problem?

8 / 31

Toy Example

Lagrangian function is

L(u,λ)def= u21 + u22 + λ1(2− u1 − 2u2)− λ2u1 − λ3u3

Minimize over u:

∂L∂u1

= 0 ⇒ u1 =λ1 + λ2

2∂L∂u2

= 0 ⇒ u2 =2λ1 + λ3

Plugging into the Lagrangian function yields

maximizeλ

L(λ) = −5

4λ21 −

4λ22 −

4λ23 −

2λ1λ2 − λ1λ3 + 2λ1

subject to λ ≥ 0.

Primal is QP. Dual is also QP.

9 / 31

Have We Gained Anything?

Here is the dual problem:

maximizeλ

L(λ) = −5

4λ21−

4λ22−

4λ23−

2λ1λ2−λ1λ3 + 2λ1

subject to λ ≥ 0.

These terms are all negative! So we must have λ2 = λ3 = 0.

This gives

maximizeλ1≥0

4λ21 + 2λ1.

which is maximized at λ1 = 45 .

Plugging into the primal yields

u1 =λ1 + λ2

5, and u2 =

2λ1 + λ32

10 / 31

Outline

Lecture 19 SVM 1: The Concept of Max-Margin

Lecture 20 SVM 2: Dual SVM

Lecture 21 SVM 3: Kernel SVM

This lecture: Support Vector Machine: Duality

Lagrange Duality

Maximize the dual variableMinimax ProblemToy Example

Dual SVM

FormulationInterpretation

11 / 31

Dual of SVM

We want to find the dual problem of

minimizew ,w0

2‖w‖22

We start with the Lagrangian function

L(w ,w0,λ)def=

2‖w‖22 +

N∑j=1

[1− yj(w

Tx j + w0)].

Let us minimize over (w ,w0):

∇wL(w ,w0,λ) = w −N∑j=1

λjyjx j = 0

∇w0L(w ,w0,λ) =N∑j=1

λjyj = 0.

12 / 31

Interpreting ∇wL(w ,w0,λ) = 0

The first result suggests that

w =N∑j=1

λjyjx j .

This is support vector: λj is either λj = 0 or λj > 0.

13 / 31

Interpreting ∇wL(w ,w0,λ) = 0

The complementarity condition states that

λ∗j

[1− yj(w

∗Tx j + w∗0 )]

= 0, for j = 1, . . . ,N.

If 1− yj(w∗Tx j + w∗0 ) > 0, then λ∗j = 0

If λ∗j > 0, then 1− yj(w∗Tx j + w∗0 ) = 0

So you can define the support vector set:

V def= {j | λ∗j > 0}.

So the optimal weight is

w∗ =∑j∈V

λ∗j yjx j .

14 / 31

Back to Duality

The Lagrangian function is

L(w∗,w∗0 ,λ) =1

2‖w∗‖22 +

N∑j=1

[1− yj((w∗)Tx j + w0)

∥∥∥∥∥∥N∑j=1

λjyjx j

∥∥∥∥∥∥2

2︸︷︷︸A

+N∑j=1

1− yj

( n∑i=1

λiyix i

x j + w0

︸︷︷︸

15 / 31

Back to Duality

We can show that

N∑i=1

N∑j=1

λiλjyiyjxTi x j

B =N∑j=1

λj −N∑i=1

N∑j=1

λiλjyiyjxTi x j −

N∑j=1

︸︷︷︸

and we can show that

A + B =N∑j=1

λj +1

N∑i=1

N∑j=1

λiλjyiyjxTi x j

16 / 31

Back to Duality

Therefore, the dual problem is

maximizeλ≥0

N∑i=1

N∑j=1

λiλjyiyjxTi x j +

N∑j=1

subject toN∑j=1

λjyj = 0.

If you prefer matrix-vector:

maximizeλ≥0

2λTQλ + 1Tλ

subject to yTλ = 0.

We can combine the constraints λ ≥ 0 and yTλ = 0 as

Aλ ≥ 0.

17 / 31

Back to Duality

yTλ = 0 means

yTλ ≥ 0, and yTλ ≤ 0.

Thus, we can write yTλ = 0 as[yT

]λ ≥

Therefore, the matrices Q and A are

T1 x1 . . . y1yNx

y2y1xT2 x1 . . . y2yNx

......

...yNy1x

TNx1 . . . yNyNx

and A =

18 / 31

So How to Solve the SVM Problem?

You look at the dual problem

maximizeλ

2λTQλ + 1Tλ

subject to Aλ ≥ 0.

You get the solution λ∗.

Then compute w∗:

w∗ =∑j∈V

λ∗j yjx j .

V is the set of support vectors: λj > 0.

19 / 31

Are We Done Yet?

Not quite.

We still need to find out w∗0 .

Pick any support vector x+ ∈ C+ and x− ∈ C−.

Then we have

wTx+ + w0 = +1, and wTx− + w0 = −1.

Sum them, we have wT (x+ + x−) + 2w0 = 0, which means

w∗0 = −(x+ + x−)Tw∗

20 / 31

Summary of Dual SVM

Primal

minimizew ,w0

2‖w‖22

Strong Duality

minw ,w0

maxλ≥0

L(w ,w0,λ)︸︷︷︸primal

= maxλ≥0

minw ,w0

L(w ,w0,λ)︸︷︷︸dual

maximizeλ≥0

N∑i=1

N∑j=1

λiλjyiyjxTi x j +

N∑j=1

subject toN∑j=1

λjyj = 0.

21 / 31

Summary of Dual SVM

The weights are computed as

w∗ =N∑j=1

λ∗j yjx j .

This is support vector: λj is either λj = 0 or λj > 0.

Pick any support vector x+ ∈ C+ and x− ∈ C−.

Then we have

wTx+ + w0 = +1, and wTx− + w0 = −1.

Sum them, we have wT (x+ + x−) + 2w0 = 0, which means

w∗0 = −(x+ + x−)Tw∗

22 / 31

Summary of Dual SVM

23 / 31

Reading List

Mustafa, Learning from Data, e-Chapter

Duda-Hart-Stork, Pattern Classification, Chapter 5.5

Chris Bishop, Pattern Recognition, Chapter 7.1

UCSD Statistical Learninghttp://www.svcl.ucsd.edu/courses/ece271B-F09/

24 / 31

Appendix

25 / 31

Inequality Constrained Optimization

Inequality constrained optimization:

minimizex∈Rn

subject to gi (x) ≥ 0, i = 1, . . . ,m

hj(x) = 0, j = 1, . . . , k.

Requires a function: Lagrangian function

L(x ,µ,ν)def= f (x)−

m∑i=1

µigi (x)−k∑

νjhj(x).

µ ∈ Rm and ν ∈ Rk are called the Lagrange multipliers or the dualvariables.

26 / 31

Karush-Kahn-Tucker Conditions

If (x∗,µ∗,ν∗) is the solution to the constrained optimization, then all thefollowing conditions should hold:

(i) ∇xL(x∗,µ∗,ν∗) = 0.

Stationarity.The primal variables should be stationary.

(ii) gi (x∗) ≥ 0 and hj(x

∗) = 0 for all i and j .

Primal Feasibility.Ensures that constraints are satisfied.

(iii) µ∗i ≥ 0 for all i and j .

Dual Feasibility.Require µ∗

i ≥ 0; but no constraint on ν∗i .

(iv) µ∗i gi (x∗) = 0 for all i and j .

Complementary SlacknessEither µ∗

i = 0 or gi (x∗) = 0 (or both).

KKT Condition is a first order necessary condition.

27 / 31

Example: `2-minimization with two constraints

Solve the following least squares over positive quadrant problem.

minimizex∈Rn

2‖x − b‖2,

subject to xT1 = 1, and x ≥ 0.(1)

%MATLAB code: Use CVX to solve min ||x-b|| s.t. sum(x) = 1, x >= 0.

cvx_begin

variable x(n)

minimize( norm(x-b, 2) )

subject to

sum(x) == 1;

x >= 0;

cvx_end

28 / 31

Analytic Solution

L(x ,λ, γ) =1

2‖x − y‖2 − λTx − γ(1− xT1).

Stationarity suggests that:

∇xL(x ,λ, γ) = x − b − λ + γ1 = 0

This meansx = b + λ− γ1 or xi = bi + λi − γ.

The complementary slackness implies λixi = 0.Case 1: If λi = 0, then

xi = bi +��0

λi − γ = bi − γ.Since constraint requires xi ≥ 0, this means bi ≥ γ.

Case 2: If λi > 0, then xi = 0.

��>0

xi = bi + λi − γ.This implies bi + λi = γ.Since λi > 0, this implies bi < γ.

29 / 31

These three cases can be re-written as:

If bi > γ, then xi = bi − γ;If bi = γ, then xi = 0;If bi < γ, then xi = 0.

Compactly written asx = max(b − γ1, 0).

Primal feasibility implies that

xT1 = 1, ⇔n∑

xi = 1.

Therefore, γ needs to satisfy the equationn∑

max(bi − γ, 0) = 1,

which can be found by doing a root-finding of

g(γ) =n∑

max(bi − γ, 0)− 1.30 / 31

Non-CVX Implementation

%MATLAB code: solve min ||x-b|| s.t. sum(x) = 1, x >= 0.

n = 10;

b = randn(n,1);

fun = @(gamma) myfun(gamma,b);

gamma = fzero(fun,0);

x = max(b-gamma,0);

where the function myfun is defined as

function y = myfun(gamma,b)

y = sum(max(b-gamma,0))-1;

31 / 31

ECE595 / STAT598: Machine Learning I Lecture 20 Support ...Outline Support Vector Machine Lecture 19...

Documents

Machine Learning Lecture 9 Bayes Decision Theory ng · Extensions 3 B. Leibe ng ‘18 Recap: Support Vector Machine (SVM) • Basic idea The SVM tries to find a classifier which maximizes

Daniele Loiacono - Politecnico di Milanohome.deib.polimi.it/loiacono/uploads/Teaching/DMTM/DMTM Suppor… · Training SVM ν-SVM Multi-class SVM Regression. Daniele Loiacono SVM Training

ECE595 / STAT598: Machine Learning I Lecture 02: Regularized … · 2020. 1. 17. · 12 14 lambda feature attribute funding % high % no high % college % graduate 10-2 100 102 104

Credit Risk - Lecture 2defaultrisk.free.fr/pdf/Lecture2.pdfDefault and Ratings Logistic regression Tree-based algorithms SVM Credit Risk Lecture 2 { Statistical tools for scoring and

ECE595 / STAT598: Machine Learning I Lecture 19 Support ... · Outline Support Vector Machine Lecture 19 SVM 1: The Concept of Max-Margin Lecture 20 SVM 2: Dual SVM Lecture 21 SVM

Lecture 13: SVM Again

The SVM classifier zisserman lecture note.pdf

Lecture 3: SVM dual, kernels and regression

ECE595 / STAT598: Machine Learning I Lecture 37 Robustness … · Lecture 37 Trade-o between Accuracy and Robustness Today’s Lecture Adversarial robustness of any classi er Can

Lecture 2: The SVM classifier

ECE595 / STAT598: Machine Learning I Lecture 01: Linear ...Lecture 1: Linear regression: A basic data analytic tool Lecture 2: Regularization: Constraining the solution Lecture 3:

AI Study · 2004. 11. 16. · SVM -O 155- 10 ETRIOII'/ 4.1 SVM 2007119} non-eye non-eye —L 719-1 SVM 1 Ù)A5L, Radial Basis Function(RBF) SVM non-eye SVM* 17 non-eye SVM 71 gl EL

ECE595 / STAT598: Machine Learning I Lecture 26 Growth ...€¦ · Outline Lecture 25 Generalization Lecture 26 Growth Function Lecture 27 VC Dimension Today’s Lecture: Overcoming

Lecture 3: Linear SVM with slack variables · Lecture 3: Linear SVM with slack variables Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 March 23, 2014 ... it works for CONVEX

SVM and Kernel machine Lecture 1: Linear SVM...A support vector machine (SVM) is a linear classiﬁer associated with the following decision function: D(x) = sign w ⊤x+b where w

SupportVectorMachine(SVM) · 2020. 10. 13. · SupportVectorMachine(SVM) Whataresupportvectormachines(SVM)? LikeLDAandLogisticregression,anSVMisalsoalinearclassiﬁerbut seekstoﬁndamaximum

ECE595 / STAT598: Machine Learning I Lecture 10 Minimum ...Lecture 9 Bayesian Decision Rules Lecture 10 Evaluating Performance Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior

ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

Svm Tricia

ECE595 / STAT598: Machine Learning I Lecture 36 Defending ...Lecture 33-35 Adversarial Attack Strategies Lecture 36 Adversarial Defense Strategies Lecture 37 Trade-o between Accuracy