Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by...

Machine Learning

Support Vector Machines: Training with

Stochastic Gradient Descent

Support vector machines

• Training by maximizing margin

• The SVM objective

• Solving the SVM optimization problem

• Support vectors, duals and kernels

SVM objective function

Regularization term: • Maximize the margin• Imposes a preference over the

hypothesis space and pushes for better generalization

• Can be replaced with other regularization terms which impose other preferences

Empirical Loss: • Hinge loss • Penalizes weight vectors that make

mistakes

• Can be replaced with other loss functions which impose other preferences

A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss

min𝐰

12𝐰"𝐰+ 𝐶)

max 0, 1 − 𝑦#𝐰"𝐱

Outline: Training SVM by optimization

1. Review of convex functions and gradient descent

2. Stochastic gradient descent

3. Gradient descent vs stochastic gradient descent

4. Sub-derivatives of the hinge loss

5. Stochastic sub-gradient descent for SVM

6. Comparison to perceptron

1. Review of convex functions and gradient descent

Solving the SVM optimization problem

This function is convex in 𝐰

min𝐰

12𝐰"𝐰+ 𝐶)

max 0, 1 − 𝑦#𝐰"𝐱#

A function 𝑓 is convex if for every 𝒖, 𝒗 in the domain, and for every 𝜆 ∈ [0,1] we have

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

Recall: Convex functions

From geometric perspective

Every tangent plane lies below the function

A function 𝑓 is convex if for every 𝒖, 𝒗 in the domain, and for every 𝜆 ∈ [0,1] we have

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

Recall: Convex functions

From geometric perspective

Every tangent plane lies below the function

Convex functions

Linear functions max is convex

Some ways to show that a function is convex:

1. Using the definition of convexity

2. Showing that the second derivative is positive (for one dimensional functions)

3. Showing that the second derivative is positive semi-definite (for vector functions)

Not all functions are convex

These are concave

These are neither

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≥ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

Convex functions are convenient

A function 𝑓 is convex if for every 𝒖, 𝒗 in the domain, and for every 𝜆 ∈[0,1] we have

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

In general: Necessary condition for 𝑥 to be a minimum for the function 𝑓 is that the gradient ∇𝑓 𝑥 = 0

For convex functions, this is both necessary and sufficient

This function is convex in w

• This is a quadratic optimization problem because the objective is quadratic

• Older methods: Used techniques from Quadratic Programming– Very slow

• No constraints, can use gradient descent– Still very slow!

Solving the SVM optimization problem

min𝐰

12𝐰"𝐰+ 𝐶)

max 0, 1 − 𝑦#𝐰"𝐱#

Gradient descent

General strategy for minimizing a function 𝐽 𝐰

• Start with an initial guess for 𝐰, say 𝐰+

• Iterate till convergence: – Compute the gradient of the

gradient of 𝐽 at 𝐰!

– Update 𝐰! to get 𝐰!"# by taking a step in the opposite direction of the gradient

Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

We are trying to minimize

𝐽 𝐰 = min𝐰

12𝐰"𝐰+ 𝐶+

max 0, 1 − 𝑦#𝐰"𝐱#

Gradient descent

𝐽 𝐰 = min𝐰

12𝐰"𝐰+ 𝐶+

max 0, 1 − 𝑦#𝐰"𝐱#

Gradient descent

ww0w1w2

𝐽 𝐰 = min𝐰

12𝐰"𝐰+ 𝐶+

max 0, 1 − 𝑦#𝐰"𝐱#

Gradient descent

ww0w1w2w3

𝐽 𝐰 = min𝐰

12𝐰"𝐰+ 𝐶+

max 0, 1 − 𝑦#𝐰"𝐱#

Gradient descent for SVM

1. Initialize 𝐰%

2. For t = 0, 1, 2, ….1. Compute gradient of 𝐽 𝐰 at 𝐰,. Call it ∇J 𝐰,-.

2. Update w as follows:𝐰,-. ← 𝐰, − 𝑟∇𝐽(𝐰,)

𝑟: The learning rate .

𝐽 𝐰 = min𝐰

12𝐰

"𝐰+ 𝐶+#

max 0, 1 − 𝑦#𝐰"𝐱#

ü Review of convex functions and gradient descent

Gradient descent for SVM

1. Initialize 𝐰%

2. For t = 0, 1, 2, ….1. Compute gradient of 𝐽 𝐰 at 𝐰,. Call it ∇J 𝐰,-.

2. Update w as follows:𝐰,-. ← 𝐰, − 𝑟∇𝐽(𝐰,)

r: Called the learning rate

Gradient of the SVM objective requires summing over the entire training set

Slow, does not really scale

𝐽 𝐰 = min𝐰

12𝐰

"𝐰+ 𝐶+#

max 0, 1 − 𝑦#𝐰"𝐱#

Stochastic gradient descent for SVMGiven a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ12. For epoch = 1 … T:

1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆

2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective ,𝐽 at the current 𝐰!%#. Call it ∇,𝐽 𝐰&%#

,𝐽 𝐰 = min𝐰

12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$

3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#

3. Return final w

𝐽 𝐰 =12𝐰"𝐰+ 𝐶+

max 0, 1 − 𝑦#𝐰"𝐱#

12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$

3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#

3. Return final w

𝐽 𝐰 =12𝐰"𝐰+ 𝐶+

max 0, 1 − 𝑦#𝐰"𝐱#

12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$

3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#

3. Return final w

𝐽 𝐰 =12𝐰"𝐰+ 𝐶+

max 0, 1 − 𝑦#𝐰"𝐱#

12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$

3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#

3. Return final w

𝐽 𝐰 =12𝐰"𝐰+ 𝐶+

max 0, 1 − 𝑦#𝐰"𝐱#

12𝐰(𝐰+ 𝐶max 0, 1 − 𝑦$𝐰(𝐱$

3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#

3. Return final w

𝐽 𝐰 =12𝐰"𝐰+ 𝐶+

max 0, 1 − 𝑦#𝐰"𝐱#

Stochastic gradient descent for SVM

Given a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ1

2. For epoch = 1 … T:1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆

3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#

3. Return final w

This algorithm is guaranteed to converge to the minimum of 𝐽 if 𝛾! is small enough.Why? The objective 𝐽 𝐰 is a convex function

𝐽 𝐰 =12𝐰"𝐰+ 𝐶+

max 0, 1 − 𝑦#𝐰"𝐱#

ü Stochastic gradient descent

Gradient Descent vs SGD

Gradient descent

Stochastic Gradient descent

Many more updates than gradient descent, but each individual update is less computationally expensive

ü Gradient descent vs stochastic gradient descent

Stochastic gradient descent for SVM

Given a training set 𝑆 = (𝐱0, 𝑦0) , 𝐱 ∈ ℜ1, 𝑦 ∈ {−1,1}1. Initialize 𝐰+ = 0 ∈ ℜ1

2. For epoch = 1 … T:1. Pick a random example (𝐱$, 𝑦$) from the training set 𝑆

2. Treat (𝐱$, 𝑦$) as a full dataset and take the derivative of the SVM objective at the current 𝐰!%# to be ∇J 𝐰&%#

3. Update: 𝐰! ← 𝐰!%# − 𝛾!∇J 𝐰&%#

3. Return final w

𝐽 𝐰 = min𝐰

12𝐰"𝐰+ 𝐶+

max 0, 1 − 𝑦#𝐰"𝐱

Hinge loss is not differentiable!

What is the derivative of the hinge loss with respect to w?

𝐽 𝐰 = min𝐰

12𝐰5𝐰+ 𝐶max 0, 1 − 𝑦0𝐰5𝐱0

Detour: Sub-gradients

Generalization of gradients to non-differentiable functions(Remember that every tangent is a hyperplane that lies below the function for convex functions)

Informally, a sub-tangent at a point is any hyperplane that lies below the function at the point.A sub-gradient is the slope of that line

Sub-gradients

50[Example from Boyd]

g1 is a gradient at x1

g2 and g3 is are both subgradients at x2

f is differentiable at x1Tangent at this point

Formally, a vector g is a subgradient to f at point x if

Sub-gradients

Sub-gradient of the SVM objective

General strategy: First solve the max and compute the gradient for each case

Sub-gradient of the SVM objective

General strategy: First solve the max and compute the gradient for each case

ü Sub-derivatives of the hinge loss

Stochastic sub-gradient descent for SVM

Given a training set 𝑆 = (𝐱# , 𝑦#) , 𝐱 ∈ ℜ> , 𝑦 ∈ {−1,1}1. Initialize 𝐰 = 0 ∈ ℜ>

2. For epoch = 1 … T:For each training example 𝐱0, 𝑦0 ∈ 𝑆:

If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$

else: 𝐰 ← 1 − 𝛾! 𝐰

3. Return 𝐰

If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$

else: 𝐰 ← 1 − 𝛾! 𝐰

3. Return 𝐰

If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$

else: 𝐰 ← 1 − 𝛾! 𝐰

3. Return 𝐰

Update 𝐰 ← 𝐰− 𝛾!∇𝐽

If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$

else: 𝐰 ← 1 − 𝛾! 𝐰

3. Return 𝐰

If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$

else: 𝐰 ← 1 − 𝛾! 𝐰

3. Return 𝐰

𝛾$: learning rate, many tweaks possible

If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$

else: 𝐰 ← 1 − 𝛾! 𝐰

3. Return 𝐰

𝛾$: learning rate, many tweaks possible

Important to shuffle examples at the start of each epoch

Convergence and learning rates

With enough iterations, it will converge in expectation

Provided the step sizes are “square summable, but not summable”

• Step sizes 𝛾! are positive• Sum of squares of step sizes over t = 1 to 1 is not infinite• Sum of step sizes over t = 1 to 1 is infinity

• Some examples: 𝛾@ =A!

BC"!#$or 𝛾@ =

• Number of iterations to get to accuracy within 𝜖

• For strongly convex functions, N examples, d dimensional:– Gradient descent: 𝑂 𝑁𝑑 ln !

– Stochastic gradient descent: 𝑂 #"

• More subtleties involved, but SGD is generally preferable when the data size is huge

• Recently, many variants that are based on this general strategy– Examples: Adagrad, momentum, Nesterov’s accelerated gradient, Adam,

RMSProp, etc…

• Number of iterations to get to accuracy within 𝜖

• For strongly convex functions, N examples, d dimensional:– Gradient descent: 𝑂 𝑁𝑑 ln !

– Stochastic gradient descent: 𝑂 #"

• More subtleties involved, but SGD is generally preferable when the data size is huge

• Recently, many variants that are based on this general strategy– Examples: Adagrad, momentum, Nesterov’s accelerated gradient, Adam,

RMSProp, etc…

ü Sub-derivatives of the hinge loss

ü Stochastic sub-gradient descent for SVM

If 𝑦$𝐰(𝐱$ ≤ 1:𝐰 ← 1 − 𝛾! 𝐰+ 𝛾!𝐶𝑦$𝐱$

else: 𝐰 ← 1 − 𝛾! 𝐰

3. Return 𝐰

Compare with the Perceptron update:If 𝑦"𝐰#𝐱" ≤ 0,

update 𝐰 ← 𝐰+ 𝛾!𝑦"𝐱"

Perceptron vs. SVM

• Perceptron: Stochastic sub-gradient descent for a different loss– No regularization though

• SVM optimizes the hinge loss– With regularization

SVM summary from optimization perspective

• Minimize regularized hinge loss

• Solve using stochastic gradient descent– Very fast, run time does not depend on number of examples

– Compare with Perceptron algorithm: Perceptron does not maximize margin width• Perceptron variants can force a margin

– Convergence criterion is an issue; can be too aggressive in the beginning and get to a reasonably good solution fast; but convergence is slow for very accurate weight vector

• Other successful optimization algorithms exist– Eg: Dual coordinate descent, implemented in liblinear

68Questions?

Support Vector Machines: Training with Stochastic Gradient ...Support vector machines •Training by...

Documents

Complexity and Support Vector Machines - Aleix Ruiz de VillaAleix Ruiz de Villa Complexity and Support Vector Machines. Introduction Statistical Learning Theory Introduction to SVM

Support Vector Machines (Contd.), Classification Loss ...latecki/Courses/AI-Fall12/Lectures/SVM.pdf · Support Vector Machines (Contd.), ... SVM ﬁnds the maximum margin hyperplane

Support Vector Machines (SVM) - mathematik.uni-ulm.de · Grundlagen Lineare Trennung Nichtlineare Klassiﬁkation Soft Margin Hyperebene Abschließende Betrachtungen Support Vector

CA-SVM: Communication-Avoiding Parallel Support Vector ... · CA-SVM: Communication-Avoiding Parallel Support Vector Machines on Distributed Systems Yang You James Demmel Kenneth

Support Vector Machines Applied to Face Recognitionpapers.nips.cc/paper/1609-support-vector-machines...Support Vector Machines Applied to Face Recognition 805 SVM can be extended to

SVM Support Vector Machines - TriloByte

Support Vector Machines - University of Iceland · Support Vector Machines Chapter6(and5) ... April10,2003 Introduction SupportVectorMachines(SVM) ... > In /usr/local/matlab/toolbox/optim/quadprog.m

Support Vector Machines for visualization and ...fizyka.umk.pl/publications/kmk/08-SVM-vis.pdf · Support Vector Machines for visualization and dimensionality reduction 3. over all

LNAI 5212 - Support Vector Machines, Data …bnrg.cs.berkeley.edu/~adj/publications/paper-files/SVM-PKDD08.pdf · Keywords: Support vector machines, kernel methods, approximate kernel

Support Vector Machinesor/gestionale/svm/svm_sciandrone.pdf · Generalit`a Le Support Vector Machines (SVM) costituiscono una classe di “macchine di apprendi-mento” recentemente

Classification of microarray gene expression data using support vector machines ( SVM )

Les SVM : Séparateurs à Vastes Marges (Support Vector Machines)

Support Vector Machines (SVM)web.cse.ohio-state.edu/~wang.77/teaching/cse5526/SVM.pdf · 2019. 9. 26. · CSE 5526: Introduction to Neural Networks Support Vector Machines (SVM)

Support Vector Machines Chapter 12. 1 Outline Separating Hyperplanes – Separable Case Extension to Non-separable case – SVM Nonlinear SVM SVM as a Penalization

Supervised Learning Methods Support Vector Machinespages.cs.wisc.edu/~dyer/cs540/notes/10_svm.pdf · •Support vector machines (SVM) Support Vector Machines ... water, snow) in a

Support Vector Machines (SVM) - CAE Usershomepages.cae.wisc.edu/~ece539/spring00/notes/wordfile/svm.doc · Web viewSupport Vector Machines (SVM) Outline. Linear pattern classfiers

SVM Support Vector Machines Ticiano A. C. Bragatto bragatto@ufpr.br

Support Vector Machines (SVM) Eduardo Borges Gabriel Simões

SVM – Support Vector Machines

Core Vector Machines: Fast SVM Training on Very Large …