10701 Recitation 5 Duality and SVMalex.smola.org/teaching/cmu2013-10-701x/slides/R5... · 2013-11-02 · Lagrangian •Consider the problem min ( ) s.t. =0 •Add a Lagrange multiplier

10701 Recitation 5Duality and SVM

Ahmed Hefny

Outline

• Langrangian and Duality

– The Lagrangian

– Duality

– Examples

• Support Vector Machines

– Primal Formulation

– Dual Formulation

– Soft Margin and Hinge Loss

Lagrangian

• Consider the problem

min𝑥

𝑓(𝑥)

s.t. 𝑔𝑖 𝑥 = 0

• Add a Lagrange multiplier for each constraint

𝐿 𝑥, 𝑢 = 𝑓 𝑥 + 𝑖 𝑢𝑖𝑔𝑖(𝑥)

Lagrangian

• Lagrangian

𝐿 𝑥, 𝑢 = 𝑓 𝑥 + 𝑖 𝑢𝑖𝑔𝑖(𝑥)

• Setting gradient to 0 gives

– 𝑔𝑖 𝑥 = 0 [Feasible point]

– 𝛻𝑓 𝑥 + 𝑖 𝑢𝑖𝛻𝑔𝑖 𝑥 = 0

[Cannot decrease 𝑓 except by violating constraints]

Lagrangian

• Consider the problem

min𝑥

𝑓(𝑥)

s.t. 𝑔𝑖 𝑥 = 0

ℎ𝑗 𝑥 ≤ 0

• Add a Lagrange multiplier for each constraint

𝐿 𝑥, 𝑢, 𝜆 = 𝑓 𝑥 + 𝑖 𝑢𝑖𝑔𝑖(𝑥) + 𝑗 𝜆𝑗ℎ𝑗(𝑥)

Duality

Duality

• Primal problem

min𝑥

𝑓(𝑥)

s.t. 𝑔𝑖 𝑥 = 0

ℎ𝑗 𝑥 ≤ 0

• Equivalent to

min𝑥

max𝜆≥0,𝑢

𝑓 𝑥 +

𝑖

𝑢𝑖𝑔𝑖(𝑥) +

𝑗

𝜆𝑗ℎ𝑗(𝑥)

Duality

• Primal problem

min𝑥

𝑓(𝑥)

s.t. 𝑔𝑖 𝑥 = 0

ℎ𝑗 𝑥 ≤ 0

• Equivalent to

min𝑥

𝑓(𝑥) 𝑥 𝑖𝑠 𝑓𝑒𝑎𝑠𝑖𝑏𝑙𝑒∞ 𝑜.𝑤.

Duality

• Dual Problem

max𝜆≥0,𝑢

min𝑥

𝑓 𝑥 + 𝑖 𝑢𝑖𝑔𝑖(𝑥) + 𝑗 𝜆𝑗ℎ𝑗(𝑥)

• Dual function:

– Concave, regardless of the convexity of the primal

– Lower bound on primal

Lagrangian Dual Function 𝐿(𝜆, 𝑢)

Duality

λ

𝑥

Primal Problemmin𝑥

max𝜆≥0

𝐿(𝑥, 𝜆)

Duality

λ

𝑥

Primal Problemmin𝑥

max𝜆≥0

𝐿(𝑥, 𝜆)

For each row (choice of 𝑥),pick the largest element then select the minimum.

Duality

λ

𝑥

Dual Problemmax𝜆≥0

min𝑥

𝐿(𝑥, 𝜆)

For each column (choice of 𝜆),pick the smallest element then select the maximum.

Duality

𝑥∗, 𝜆∗

λ

𝑥

Claim:min𝑥

max𝜆≥0

𝐿(𝑥, 𝜆) ≥ max𝜆≥0

min𝑥

𝐿(𝑥, 𝜆)

Duality

𝑥∗, 𝜆∗

λ

𝑥

Claim:min𝑥

max𝜆≥0

𝐿(𝑥, 𝜆) ≥ max𝜆≥0

min𝑥

𝐿(𝑥, 𝜆)

For any 𝜆 ≥ 0min𝑥

𝐿(𝑥, 𝜆) ≤ 𝐿 𝑥∗, 𝜆 ≤ 𝐿(𝑥∗, 𝜆∗)

The difference between primal minimumAnd dual maximum is called duality gap

duality gap = 0 Strong Duality

Duality

𝑥∗, 𝜆∗

λ

𝑥

When does min𝑥

max𝜆≥0

𝐿(𝑥, 𝜆) = max𝜆≥0

min𝑥

𝐿(𝑥, 𝜆)

Duality

𝒙∗, 𝝀∗

λ

𝑥

When does min𝑥

max𝜆≥0


min𝑥

𝐿(𝑥, 𝜆)

𝑥∗, 𝜆∗ is a saddle point𝐿 𝑥∗, 𝜆 ≤ 𝐿 𝑥∗, 𝜆∗ ≤ 𝐿(𝑥, 𝜆∗)

Duality

𝒙∗, 𝝀∗

λ

𝑥

When does min𝑥

max𝜆≥0


min𝑥

𝐿(𝑥, 𝜆)


Necessity By definition of dualSufficiency

𝐿 𝜆 = minx

𝐿(𝑥, 𝜆) ≤ 𝐿 𝑥∗, 𝜆∗

𝐿 𝜆∗ = 𝐿 𝑥∗, 𝜆∗

Duality

𝒙∗, 𝝀∗

λ

𝑥

When does min𝑥

max𝜆≥0


min𝑥

𝐿(𝑥, 𝜆)


Necessity By definition of dualSufficiency

𝐿 𝜆 = min𝑥 𝐿(𝑥, 𝜆) ≤ 𝐿 𝑥∗, 𝜆∗

𝐿 𝜆∗ = 𝐿 𝑥∗, 𝜆∗

The dual at 𝜆∗ is the upper bound

Duality

• If strong duality holds, KKT conditions apply to optimal point– Stationary Point 𝛻𝐿 𝑥, 𝑢, 𝜆 = 0

– Primal Feasibility

– Dual Feasibility (𝜆 ≥ 0)

– Complementary Slackness (𝜆𝑖ℎ𝑖 𝑥 = 0)

• KKT conditions are – Sufficient

– Necessary under strong duality

Example: LP

• Primalmin𝑥

𝑐𝑇𝑥

s.t. 𝐴𝑥 ≥ 𝑏

Example: LP

• Primalmin𝑥

𝑐𝑇𝑥

s.t. 𝐴𝑥 ≥ 𝑏

• Lagrangian

𝐿 𝑥, 𝜆 = 𝑐𝑇𝑥 − 𝜆𝑇 𝐴𝑥 − 𝑏

Example: LP

• Dual Function

𝐿 𝜆 = min𝑥

𝑐𝑇𝑥 − 𝜆𝑇 𝐴𝑥 − 𝑏

Example: LP

• Dual Function

𝐿 𝜆 = min𝑥


• Set gradient w.r.t 𝑥 to 0

𝑐 − 𝐴𝑇𝜆 = 0

Example: LP

• Dual Function

𝐿 𝜆 = min𝑥


• Set gradient w.r.t 𝑥 to 0𝑐 − 𝐴𝑇𝜆 = 0

• Dual Problem

max𝜆≥0

𝜆𝑇𝑏

s.t. 𝑐 − 𝐴𝑇𝜆 = 0 Why keep this as a constraint ?

Example: LASSO

• We will use duality to transform LASSO into a QP

Example: LASSO

Primal

min1

2𝑦 − 𝑋𝑤 2 + 𝛾 𝑤 1

What is the dual function in this case ?

Example: LASSO

Reformulated Primal

min1

2𝑦 − 𝑧 2 + 𝛾 𝑤 1

s.t. 𝑧 = 𝑋𝑤

Dual

𝐿 𝜆 = min𝑧,𝑤

1

2𝑦 − 𝑧 2 + 𝛾 𝑤 1 + 𝜆𝑇(𝑧 − 𝑋𝑤)

Example: LASSO

Dual

𝐿 𝜆 = min𝑧,𝑤

1

2𝑦 − 𝑧 2 + 𝛾 𝑤 1 + 𝜆𝑇(𝑧 − 𝑋𝑤)

Setting gradient to zero gives

𝑧 = 𝑦 − 𝜆

𝑋𝑇𝜆 ∞ ≤ 𝛾

Example: LASSO

• Dual Problem

max−1

2𝜆 2 + 𝜆𝑇𝑦

s.t. 𝑋𝑇𝜆 ∞ ≤ 𝛾

Support Vector Machines

docs.opencv.org


• Find the maximum margin hyper-plane

• “Distance” from a point 𝑥 to the hyper-plane 𝑤, 𝑥𝑖 + 𝑏 = 0 is given by

𝑑𝑖 = ( 𝑤, 𝑥𝑖 + 𝑏)/ 𝑤

• 𝑀𝑎𝑟𝑔𝑖𝑛 = min𝑖𝑦𝑖𝑑𝑖 =

1

𝑤min𝑖 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖

• Max Margin: max𝑤,𝑏

1



• Max Margin

max𝑤,𝑏

1


• Unpleasant (max min ?)

• No Unique Solution


• Max Margin

max𝑤,𝑏

1


s.t. ???


• Max Margin

max𝑤,𝑏

1


s.t. min𝑖 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 = 1


• Max Margin

min𝑤,𝑏

1

2𝑤 2

s.t. min𝑖

𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 = 1


• Max Margin (Canonical Representation)

min𝑤,𝑏

1

2𝑤 2

s.t. 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 ≥ 1, ∀𝑖

• QP, much better than

max𝑤,𝑏

1


SVM Dual Problem

Recall that the Lagrangian is formed by adding a Lagrange multiplier for each constraint.

𝐿 𝑤, 𝑏, 𝛼 =1

2𝑤 2 −

𝑖

𝛼𝑖 [ 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 − 1]

SVM Dual Problem

𝐿 𝑤, 𝑏, 𝛼 =1

2𝑤 2 −

𝑖

𝛼𝑖 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 − 1

Fix 𝛼 and minimize w.r.t 𝑤, 𝑏:

𝑤 − 𝑖 𝛼𝑖 𝑦𝑖𝑥𝑖 = 0

𝑖 𝛼𝑖𝑦𝑖 = 0

SVM Dual Problem

𝐿 𝑤, 𝑏, 𝛼 =1

2𝑤 2 −

𝑖

𝛼𝑖 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 − 1

Fix 𝛼 and minimize w.r.t 𝑤, 𝑏:

𝑤 − 𝑖 𝛼𝑖 𝑦𝑖𝑥𝑖 = 0

𝑖 𝛼𝑖𝑦𝑖 = 0

Plug-in

Constraint (why ?)

SVM Dual Problem

Dual Problem

max−1

2

𝑖

𝑗

𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗 𝑥𝑖 , 𝑥𝑗 +

𝑖

𝛼𝑖

s.t. 𝑖 𝛼𝑖𝑦𝑖 = 0

𝛼𝑖 ≥ 0

Another QP. So what ?

SVM Dual Problem

• Only Inner products Kernel Trick

• Complementary Slackness Support Vectors

• KKT conditions lead to Efficient optimization algorithms (compared to general QP solver)

SVM Dual Problem

• Classification of a test point

𝑓 𝑥 = 𝑤, 𝑥 + 𝑏 =

𝑖

𝛼𝑖𝑦𝑖 𝑥𝑖 , 𝑥 + 𝑏

• To get 𝑏 use the fact that 𝑦𝑖𝑓(𝑥𝑖) = 1 for any support vector.

• For numerical stability, average over all support vectors.

Soft Margin SVM

Hard Margin SVM

minw,b

𝑖 𝐸∞ 1 − 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 +1

2𝑤

2

, where

𝐸∞ 𝑥 = ∞ 𝑥 ≥ 00 𝑥 < 0

Soft Margin SVM

Hard Margin SVM

minw,b

𝑖 𝐸∞ 1 − 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 +1

2𝑤

2

, where

𝐸∞ 𝑥 = ∞ 𝑥 ≥ 00 𝑥 < 0

𝑦𝑖𝑓(𝑥𝑖)

𝑙𝑜𝑠𝑠

loss regularization

Soft Margin SVM

Relax it a little bit

minw,b

𝑖 𝐸𝐶 1 − 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 +1

2𝑤

2

, where

𝐸𝐶 𝑥 = 𝐶𝑥 𝑥 ≥ 00 𝑥 < 0

Soft Margin SVM


minw,b

𝑖 𝐸𝐶 1 − 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 +1

2𝑤

2

, where

𝐸𝐶 𝑥 = 𝐶𝑥 𝑥 ≥ 00 𝑥 < 0


𝑙𝑜𝑠𝑠

Soft Margin SVM


minw,b

𝐶 𝑖 1 − 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 + +1

2𝑤 2

𝑙𝑜𝑠𝑠


Soft Margin SVM

Equivalent Formulation

minw,b,𝜁

𝐶 𝑖 𝜁𝑖 +1

2𝑤 2

s.t. 𝜁𝑖 ≥ 0

𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 ≥ 1 − 𝜁𝑖

Conclusions

• Duality allows for establishing a lower bound on minimization problem.

• Key idea– “min max” upper bounds “max min”

• Strong Duality Necessity of KKT Conditions

• Duality on SVMs– Kernel Trick

– Support Vectors

• Soft Margin SVM = Hinge Loss

Resources

• Bishop, “Pattern Recognition and Machine Learning”, Chp 7

• Gordon & Tibshirani, 10725 Optimization (Fall 2012) Lecture Slides: http://www.cs.cmu.edu/~ggordon/10725-F12/schedule.html

• Fiterau, Kernels and SVM “http://alex.smola.org/teaching/cmu2013-10-701/slides/6_Recitation_Kernels.pdf”

http://www.cs.cmu.edu/~ggordon/10725-F12/schedule.html

http://alex.smola.org/teaching/cmu2013-10-701/slides/6_Recitation_Kernels.pdf

Documents

10701 Recitation 5 Duality and SVMalex.smola.org/teaching/cmu2013-10-701x/slides/R5... · 2013-11-02 · Lagrangian •Consider the problem min ( ) s.t. =0 •Add a Lagrange multiplier