Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
10701 Recitation 5Duality and SVM
Ahmed Hefny
Outline
• Langrangian and Duality
– The Lagrangian
– Duality
– Examples
• Support Vector Machines
– Primal Formulation
– Dual Formulation
– Soft Margin and Hinge Loss
Lagrangian
• Consider the problem
min𝑥
𝑓(𝑥)
s.t. 𝑔𝑖 𝑥 = 0
• Add a Lagrange multiplier for each constraint
𝐿 𝑥, 𝑢 = 𝑓 𝑥 + 𝑖 𝑢𝑖𝑔𝑖(𝑥)
Lagrangian
• Lagrangian
𝐿 𝑥, 𝑢 = 𝑓 𝑥 + 𝑖 𝑢𝑖𝑔𝑖(𝑥)
• Setting gradient to 0 gives
– 𝑔𝑖 𝑥 = 0 [Feasible point]
– 𝛻𝑓 𝑥 + 𝑖 𝑢𝑖𝛻𝑔𝑖 𝑥 = 0
[Cannot decrease 𝑓 except by violating constraints]
Lagrangian
• Consider the problem
min𝑥
𝑓(𝑥)
s.t. 𝑔𝑖 𝑥 = 0
ℎ𝑗 𝑥 ≤ 0
• Add a Lagrange multiplier for each constraint
𝐿 𝑥, 𝑢, 𝜆 = 𝑓 𝑥 + 𝑖 𝑢𝑖𝑔𝑖(𝑥) + 𝑗 𝜆𝑗ℎ𝑗(𝑥)
Duality
Duality
• Primal problem
min𝑥
𝑓(𝑥)
s.t. 𝑔𝑖 𝑥 = 0
ℎ𝑗 𝑥 ≤ 0
• Equivalent to
min𝑥
max𝜆≥0,𝑢
𝑓 𝑥 +
𝑖
𝑢𝑖𝑔𝑖(𝑥) +
𝑗
𝜆𝑗ℎ𝑗(𝑥)
Duality
• Primal problem
min𝑥
𝑓(𝑥)
s.t. 𝑔𝑖 𝑥 = 0
ℎ𝑗 𝑥 ≤ 0
• Equivalent to
min𝑥
𝑓(𝑥) 𝑥 𝑖𝑠 𝑓𝑒𝑎𝑠𝑖𝑏𝑙𝑒∞ 𝑜.𝑤.
Duality
• Dual Problem
max𝜆≥0,𝑢
min𝑥
𝑓 𝑥 + 𝑖 𝑢𝑖𝑔𝑖(𝑥) + 𝑗 𝜆𝑗ℎ𝑗(𝑥)
• Dual function:
– Concave, regardless of the convexity of the primal
– Lower bound on primal
Lagrangian Dual Function 𝐿(𝜆, 𝑢)
Duality
λ
𝑥
Primal Problemmin𝑥
max𝜆≥0
𝐿(𝑥, 𝜆)
Duality
λ
𝑥
Primal Problemmin𝑥
max𝜆≥0
𝐿(𝑥, 𝜆)
For each row (choice of 𝑥),pick the largest element then select the minimum.
Duality
λ
𝑥
Dual Problemmax𝜆≥0
min𝑥
𝐿(𝑥, 𝜆)
For each column (choice of 𝜆),pick the smallest element then select the maximum.
Duality
𝑥∗, 𝜆∗
λ
𝑥
Claim:min𝑥
max𝜆≥0
𝐿(𝑥, 𝜆) ≥ max𝜆≥0
min𝑥
𝐿(𝑥, 𝜆)
Duality
𝑥∗, 𝜆∗
λ
𝑥
Claim:min𝑥
max𝜆≥0
𝐿(𝑥, 𝜆) ≥ max𝜆≥0
min𝑥
𝐿(𝑥, 𝜆)
For any 𝜆 ≥ 0min𝑥
𝐿(𝑥, 𝜆) ≤ 𝐿 𝑥∗, 𝜆 ≤ 𝐿(𝑥∗, 𝜆∗)
The difference between primal minimumAnd dual maximum is called duality gap
duality gap = 0 Strong Duality
Duality
𝑥∗, 𝜆∗
λ
𝑥
When does min𝑥
max𝜆≥0
𝐿(𝑥, 𝜆) = max𝜆≥0
min𝑥
𝐿(𝑥, 𝜆)
Duality
𝒙∗, 𝝀∗
λ
𝑥
When does min𝑥
max𝜆≥0
𝐿(𝑥, 𝜆) = max𝜆≥0
min𝑥
𝐿(𝑥, 𝜆)
𝑥∗, 𝜆∗ is a saddle point𝐿 𝑥∗, 𝜆 ≤ 𝐿 𝑥∗, 𝜆∗ ≤ 𝐿(𝑥, 𝜆∗)
Duality
𝒙∗, 𝝀∗
λ
𝑥
When does min𝑥
max𝜆≥0
𝐿(𝑥, 𝜆) = max𝜆≥0
min𝑥
𝐿(𝑥, 𝜆)
𝑥∗, 𝜆∗ is a saddle point𝐿 𝑥∗, 𝜆 ≤ 𝐿 𝑥∗, 𝜆∗ ≤ 𝐿(𝑥, 𝜆∗)
Necessity By definition of dualSufficiency
𝐿 𝜆 = minx
𝐿(𝑥, 𝜆) ≤ 𝐿 𝑥∗, 𝜆∗
𝐿 𝜆∗ = 𝐿 𝑥∗, 𝜆∗
Duality
𝒙∗, 𝝀∗
λ
𝑥
When does min𝑥
max𝜆≥0
𝐿(𝑥, 𝜆) = max𝜆≥0
min𝑥
𝐿(𝑥, 𝜆)
𝑥∗, 𝜆∗ is a saddle point𝐿 𝑥∗, 𝜆 ≤ 𝐿 𝑥∗, 𝜆∗ ≤ 𝐿(𝑥, 𝜆∗)
Necessity By definition of dualSufficiency
𝐿 𝜆 = min𝑥 𝐿(𝑥, 𝜆) ≤ 𝐿 𝑥∗, 𝜆∗
𝐿 𝜆∗ = 𝐿 𝑥∗, 𝜆∗
The dual at 𝜆∗ is the upper bound
Duality
• If strong duality holds, KKT conditions apply to optimal point– Stationary Point 𝛻𝐿 𝑥, 𝑢, 𝜆 = 0
– Primal Feasibility
– Dual Feasibility (𝜆 ≥ 0)
– Complementary Slackness (𝜆𝑖ℎ𝑖 𝑥 = 0)
• KKT conditions are – Sufficient
– Necessary under strong duality
Example: LP
• Primalmin𝑥
𝑐𝑇𝑥
s.t. 𝐴𝑥 ≥ 𝑏
Example: LP
• Primalmin𝑥
𝑐𝑇𝑥
s.t. 𝐴𝑥 ≥ 𝑏
• Lagrangian
𝐿 𝑥, 𝜆 = 𝑐𝑇𝑥 − 𝜆𝑇 𝐴𝑥 − 𝑏
Example: LP
• Dual Function
𝐿 𝜆 = min𝑥
𝑐𝑇𝑥 − 𝜆𝑇 𝐴𝑥 − 𝑏
Example: LP
• Dual Function
𝐿 𝜆 = min𝑥
𝑐𝑇𝑥 − 𝜆𝑇 𝐴𝑥 − 𝑏
• Set gradient w.r.t 𝑥 to 0
𝑐 − 𝐴𝑇𝜆 = 0
Example: LP
• Dual Function
𝐿 𝜆 = min𝑥
𝑐𝑇𝑥 − 𝜆𝑇 𝐴𝑥 − 𝑏
• Set gradient w.r.t 𝑥 to 0𝑐 − 𝐴𝑇𝜆 = 0
• Dual Problem
max𝜆≥0
𝜆𝑇𝑏
s.t. 𝑐 − 𝐴𝑇𝜆 = 0 Why keep this as a constraint ?
Example: LASSO
• We will use duality to transform LASSO into a QP
Example: LASSO
Primal
min1
2𝑦 − 𝑋𝑤 2 + 𝛾 𝑤 1
What is the dual function in this case ?
Example: LASSO
Reformulated Primal
min1
2𝑦 − 𝑧 2 + 𝛾 𝑤 1
s.t. 𝑧 = 𝑋𝑤
Dual
𝐿 𝜆 = min𝑧,𝑤
1
2𝑦 − 𝑧 2 + 𝛾 𝑤 1 + 𝜆𝑇(𝑧 − 𝑋𝑤)
Example: LASSO
Dual
𝐿 𝜆 = min𝑧,𝑤
1
2𝑦 − 𝑧 2 + 𝛾 𝑤 1 + 𝜆𝑇(𝑧 − 𝑋𝑤)
Setting gradient to zero gives
𝑧 = 𝑦 − 𝜆
𝑋𝑇𝜆 ∞ ≤ 𝛾
Example: LASSO
• Dual Problem
max−1
2𝜆 2 + 𝜆𝑇𝑦
s.t. 𝑋𝑇𝜆 ∞ ≤ 𝛾
Support Vector Machines
docs.opencv.org
Support Vector Machines
• Find the maximum margin hyper-plane
• “Distance” from a point 𝑥 to the hyper-plane 𝑤, 𝑥𝑖 + 𝑏 = 0 is given by
𝑑𝑖 = ( 𝑤, 𝑥𝑖 + 𝑏)/ 𝑤
• 𝑀𝑎𝑟𝑔𝑖𝑛 = min𝑖𝑦𝑖𝑑𝑖 =
1
𝑤min𝑖 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖
• Max Margin: max𝑤,𝑏
1
𝑤min𝑖 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖
Support Vector Machines
• Max Margin
max𝑤,𝑏
1
𝑤min𝑖 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖
• Unpleasant (max min ?)
• No Unique Solution
Support Vector Machines
• Max Margin
max𝑤,𝑏
1
𝑤min𝑖 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖
s.t. ???
Support Vector Machines
• Max Margin
max𝑤,𝑏
1
𝑤min𝑖 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖
s.t. min𝑖 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 = 1
Support Vector Machines
• Max Margin
min𝑤,𝑏
1
2𝑤 2
s.t. min𝑖
𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 = 1
Support Vector Machines
• Max Margin (Canonical Representation)
min𝑤,𝑏
1
2𝑤 2
s.t. 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 ≥ 1, ∀𝑖
• QP, much better than
max𝑤,𝑏
1
𝑤min𝑖 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖
SVM Dual Problem
Recall that the Lagrangian is formed by adding a Lagrange multiplier for each constraint.
𝐿 𝑤, 𝑏, 𝛼 =1
2𝑤 2 −
𝑖
𝛼𝑖 [ 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 − 1]
SVM Dual Problem
𝐿 𝑤, 𝑏, 𝛼 =1
2𝑤 2 −
𝑖
𝛼𝑖 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 − 1
Fix 𝛼 and minimize w.r.t 𝑤, 𝑏:
𝑤 − 𝑖 𝛼𝑖 𝑦𝑖𝑥𝑖 = 0
𝑖 𝛼𝑖𝑦𝑖 = 0
SVM Dual Problem
𝐿 𝑤, 𝑏, 𝛼 =1
2𝑤 2 −
𝑖
𝛼𝑖 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 − 1
Fix 𝛼 and minimize w.r.t 𝑤, 𝑏:
𝑤 − 𝑖 𝛼𝑖 𝑦𝑖𝑥𝑖 = 0
𝑖 𝛼𝑖𝑦𝑖 = 0
Plug-in
Constraint (why ?)
SVM Dual Problem
Dual Problem
max−1
2
𝑖
𝑗
𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗 𝑥𝑖 , 𝑥𝑗 +
𝑖
𝛼𝑖
s.t. 𝑖 𝛼𝑖𝑦𝑖 = 0
𝛼𝑖 ≥ 0
Another QP. So what ?
SVM Dual Problem
• Only Inner products Kernel Trick
• Complementary Slackness Support Vectors
• KKT conditions lead to Efficient optimization algorithms (compared to general QP solver)
SVM Dual Problem
• Classification of a test point
𝑓 𝑥 = 𝑤, 𝑥 + 𝑏 =
𝑖
𝛼𝑖𝑦𝑖 𝑥𝑖 , 𝑥 + 𝑏
• To get 𝑏 use the fact that 𝑦𝑖𝑓(𝑥𝑖) = 1 for any support vector.
• For numerical stability, average over all support vectors.
Soft Margin SVM
Hard Margin SVM
minw,b
𝑖 𝐸∞ 1 − 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 +1
2𝑤
2
, where
𝐸∞ 𝑥 = ∞ 𝑥 ≥ 00 𝑥 < 0
Soft Margin SVM
Hard Margin SVM
minw,b
𝑖 𝐸∞ 1 − 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 +1
2𝑤
2
, where
𝐸∞ 𝑥 = ∞ 𝑥 ≥ 00 𝑥 < 0
𝑦𝑖𝑓(𝑥𝑖)
𝑙𝑜𝑠𝑠
loss regularization
Soft Margin SVM
Relax it a little bit
minw,b
𝑖 𝐸𝐶 1 − 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 +1
2𝑤
2
, where
𝐸𝐶 𝑥 = 𝐶𝑥 𝑥 ≥ 00 𝑥 < 0
Soft Margin SVM
Relax it a little bit
minw,b
𝑖 𝐸𝐶 1 − 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 +1
2𝑤
2
, where
𝐸𝐶 𝑥 = 𝐶𝑥 𝑥 ≥ 00 𝑥 < 0
𝑦𝑖𝑓(𝑥𝑖)
𝑙𝑜𝑠𝑠
Soft Margin SVM
Relax it a little bit
minw,b
𝐶 𝑖 1 − 𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 + +1
2𝑤 2
𝑙𝑜𝑠𝑠
𝑦𝑖𝑓(𝑥𝑖)
Soft Margin SVM
Equivalent Formulation
minw,b,𝜁
𝐶 𝑖 𝜁𝑖 +1
2𝑤 2
s.t. 𝜁𝑖 ≥ 0
𝑤, 𝑥𝑖 + 𝑏 𝑦𝑖 ≥ 1 − 𝜁𝑖
Conclusions
• Duality allows for establishing a lower bound on minimization problem.
• Key idea– “min max” upper bounds “max min”
• Strong Duality Necessity of KKT Conditions
• Duality on SVMs– Kernel Trick
– Support Vectors
• Soft Margin SVM = Hinge Loss
Resources
• Bishop, “Pattern Recognition and Machine Learning”, Chp 7
• Gordon & Tibshirani, 10725 Optimization (Fall 2012) Lecture Slides: http://www.cs.cmu.edu/~ggordon/10725-F12/schedule.html
• Fiterau, Kernels and SVM “http://alex.smola.org/teaching/cmu2013-10-701/slides/6_Recitation_Kernels.pdf”