47
Support Vector Machines Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines - CompNeuroscicompneurosci.com/wiki/images/4/4f/Support_Vector_Machines_(SVM).pdfImplementation of SVM in MATLAB environment 6. Stochastic Gradient Descent

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

  • Support Vector Machines Machine Learning Series

    Jerry Jeychandra

    Blohm Lab

  • Outline

    β€’ Main goal: To understand how support vector machines (SVMs) perform optimal classification for labelled data sets, also a quick peek at the implementational level.

    1. What is a support vector machine?

    2. Formulating the SVM problem

    3. Optimization using Sequential Minimal Optimization

    4. Use of Kernels for non-linear classification

    5. Implementation of SVM in MATLAB environment

    6. Stochastic Gradient Descent (extra)

  • Outline

    β€’ Main goal: To understand how support vector machines (SVMs) perform optimal classification for labelled data sets, also a quick peek at the implementational level.

    1. What is a support vector machine?

    2. Formulating the SVM problem

    3. Optimization using Sequential Minimal Optimization

    4. Use of Kernels for non-linear classification

    5. Implementation of SVM in MATLAB environment

    6. Stochastic Gradient Descent (extra)

  • Support Vector Machines (SVM)

    x1

    x2

  • Support Vector Machines (SVM)

    Not very good decision boundary!

    x1

    x2

  • Support Vector Machines (SVM)

    Great decision boundary!

    x1

    x2

    π’‚π’™πŸ + π’ƒπ’™πŸ + 𝒄 > πŸŽπ’‚π’™πŸ + π’ƒπ’™πŸ + 𝒄 < 𝟎

    π‘Žπ‘₯1 + 𝑏π‘₯2 + 𝑐 = 0

    In machine learningβ€¦π’˜π‘‡π‘₯ + 𝑏 = 0

  • Support Vector Machines (SVM)

    Great decision boundary!

    x1

    x2

    π’˜π‘»π’™ + 𝒃 > πŸŽπ’˜π‘»π’™ + 𝒃 < 𝟎

    π‘Žπ‘₯1 + 𝑏π‘₯2 + 𝑐 = 0

    In machine learningβ€¦π’˜π‘‡π‘₯ + 𝑏 = 0

  • Support Vector Machines (SVM)

    x1

    x2

    π’˜π‘‡π‘₯ + 𝑏 = 0π’˜π‘‡π‘₯ + 𝑏 = βˆ’1

    π’˜π‘‡π‘₯ + 𝑏 = 1

  • Support Vector Machines (SVM)

    x1

    x2

    π’˜π‘‡π‘₯ + 𝑏 = 0π’˜π‘‡π‘₯ + 𝑏 = βˆ’1

    π’˜π‘‡π‘₯ + 𝑏 = 1

  • Support Vector Machines (SVM)

    x1

    x2

    π’˜π‘‡π‘₯ + 𝑏 = 0π’˜π‘‡π‘₯ + 𝑏 = βˆ’1

    π’˜π‘‡π‘₯ + 𝑏 = 1

    -d+d

    How do we maximize d?

  • Support Vector Machines (SVM)

    x1

    x2

    π’˜π‘‡π‘₯ + 𝑏 = 0π’˜π‘‡π‘₯ + 𝑏 = βˆ’1

    π’˜π‘‡π‘₯ + 𝑏 = 1

    -d+d

    How do we maximize d?

    𝑑 =𝑀𝑇π‘₯ + 𝑏

    | 𝑀 |=

    1

    | 𝑀 |

    Minimize this!

    | 𝑀 |

  • Support Vector Machines (SVM)

    x1

    x2

    π’˜π‘‡π‘₯ + 𝑏 = 0π’˜π‘‡π‘₯ + 𝑏 = βˆ’1

    π’˜π‘‡π‘₯ + 𝑏 = 1

    -d+d

    How do we maximize d?

    𝑑 =𝑀𝑇π‘₯ + 𝑏

    | 𝑀 |=

    1

    | 𝑀 |

    1

    2𝑀

    2

    Minimize this!

    | 𝑀 |

  • SVM Initial Constraints

    If our data point is labelled as y = 1𝑀𝑇π‘₯(𝑖) + 𝑏 β‰₯ 1

    If y = -1𝑀𝑇π‘₯(𝑖) + 𝑏 ≀ βˆ’1

    This can be succinctly summarized as:𝑦 𝑀𝑇π‘₯(𝑖) + 𝑏 β‰₯ 1

    We’re going to allow for some margin violation:𝑦(𝑖) 𝑀𝑇π‘₯(𝑖) + 𝑏 β‰₯ 1 βˆ’ πœ‰π‘–

    βˆ’π‘¦(𝑖) 𝑀𝑇π‘₯(𝑖) + 𝑏 + 1 βˆ’ πœ‰π‘– ≀ 0

    Hard Margin SVM

    Soft Margin SVM

  • Support Vector Machines (SVM)

    x1

    x2

    π’˜π‘‡π‘₯ + 𝑏 = 0π’˜π‘‡π‘₯ + 𝑏 = βˆ’1

    π’˜π‘‡π‘₯ + 𝑏 = 1

    ΞΎ/||w||

  • Optimization Objective

    Minimize:1

    2𝑀

    2+ 𝐢 σ𝑖

    π‘š πœ‰π‘–

    Subject to: βˆ’π‘¦ 𝑖 𝑀𝑇π‘₯ 𝑖 + 𝑏 + 1 βˆ’ πœ‰π‘– ≀ 0

    πœ‰π‘– β‰₯ 0

    Penalizing Term

  • Outline

    β€’ Main goal: To understand how support vector machines (SVMs) perform optimal classification for labelled data sets, also a quick peek at the implementational level.

    1. What is a support vector machine?

    2. Formulating the SVM problem

    3. Optimization using Sequential Minimal Optimization

    4. Use of Kernels for non-linear classification

    5. Implementation of SVM in MATLAB environment

  • Optimization Objective

    Minimize 1

    2𝑀

    2+ 𝐢 σ𝑖

    π‘š πœ‰π‘– s.t βˆ’π‘¦(𝑖) 𝑀𝑇π‘₯(𝑖) + 𝑏 + 1 βˆ’ πœ‰π‘– ≀ 0 and

    πœ‰π‘– β‰₯ 0

    Need to use the Lagrangian

    Using Lagrangian with inequalities as a constraint requires us to fulfill

    Karush-Khan-Tucker conditions…

  • Karush-Kahn-Tucker Conditions

    There must exist a w* that solves the primal problem whereas 𝛼* and b* are solutions to the dual problem. All three parameters must satisfy the following conditions:

    1.πœ•

    πœ•π‘€π‘–β„’ π‘€βˆ—, π›Όβˆ—, π‘βˆ— = 0

    2.πœ•

    πœ•π‘π‘–β„’ π‘€βˆ—, π›Όβˆ—, π‘βˆ— = 0

    3. π›Όπ‘–βˆ—π‘”π‘– 𝑀

    βˆ— = 0

    4. 𝑔𝑖 π‘€βˆ— ≀ 0

    5. π›Όπ‘–βˆ— β‰₯ 0

    6. πœ‰π‘– β‰₯ 0

    7. π‘Ÿπ‘–πœ‰π‘– = 0

    Constraint 3: If 𝜢*i is a non-zero number, gi must be 0!

    Recall: 𝑔𝑖 w = βˆ’π‘¦(𝑖) 𝑀𝑇π‘₯(𝑖) + 𝑏 + 1 βˆ’ πœ‰π‘– ≀ 0

    Example x must lie directly on the functional margin in order for π’ˆπ’Š π’˜to equal 0.

    Here 𝜢 > 0.This is our support vectors!

  • Optimization Objective

    Minimize 1

    2𝑀

    2+ 𝐢σ𝑖

    π‘š πœ‰π‘– subject to βˆ’π‘¦(𝑖) 𝑀𝑇π‘₯(𝑖) + 𝑏 + 1 βˆ’ πœ‰π‘– ≀ 0 and πœ‰π‘– β‰₯ 0

    Need to use the Lagrangian

    β„’ π‘₯, 𝛼 = 𝑓 π‘₯ +

    𝑖

    πœ†π‘– βˆ™ β„Žπ‘– π‘₯ …

    ℒ𝒫 𝑀, 𝑏, 𝛼, πœ‰, π‘Ÿ =1

    2𝑀

    2+ 𝐢

    𝑖

    πœ‰π‘– βˆ’ 𝑖𝛼𝑖 𝑦

    𝑖 𝑀𝑇π‘₯ 𝑖 + 𝑏 βˆ’ 1+ πœ‰π‘– βˆ’π‘–

    π‘Ÿπ‘–πœ‰π‘–

    min𝑀,𝑏

    ℒ𝒫 =1

    2𝑀

    2+ 𝐢

    𝑖=1

    π‘š

    πœ‰π‘– +

    𝑖=1

    π‘š

    𝛼𝑖 βˆ’

    𝑖=1

    π‘š

    𝛼𝑖𝑦𝑖 [ 𝑀𝑇π‘₯ 𝑖 + 𝑏 + πœ‰π‘–] βˆ’

    𝑖=1

    π‘š

    π‘Ÿπ‘–πœ‰π‘–

  • Optimization Objective

    Minimize 1

    2𝑀

    2+ 𝐢σ𝑖

    π‘š πœ‰π‘– subject to βˆ’π‘¦(𝑖) 𝑀𝑇π‘₯(𝑖) + 𝑏 + 1 βˆ’ ΞΎ ≀ 0 and πœ‰π‘– β‰₯ 0

    Need to use the Lagrangian

    β„’ π‘₯, 𝛼 = 𝑓 π‘₯ +

    𝑖

    πœ†π‘– βˆ™ β„Žπ‘– π‘₯ …

    ℒ𝒫 𝑀, 𝑏, 𝛼, πœ‰, π‘Ÿ =1

    2𝑀

    2+ 𝐢

    𝑖

    πœ‰π‘– βˆ’ 𝑖𝛼𝑖 𝑦

    𝑖 𝑀𝑇π‘₯ 𝑖 + 𝑏 βˆ’ 1+ πœ‰π‘– βˆ’π‘–

    π‘Ÿπ‘–πœ‰π‘–

    min𝑀,𝑏

    ℒ𝒫 =𝟏

    πŸπ’˜

    𝟐+ π‘ͺ

    π’Š=𝟏

    π’Ž

    πƒπ’Š +

    π’Š=𝟏

    π’Ž

    πœΆπ’Š βˆ’

    π’Š=𝟏

    π’Ž

    πœΆπ’Šπ’šπ’Š [ π’˜π‘»π’™ π’Š + 𝒃 + πœ‰π‘–] βˆ’

    𝑖=1

    π‘š

    π‘Ÿπ‘–πœ‰π‘–

    Minimization function Margin Constraint Error Constraint

  • The problem with the Primal

    ℒ𝒫 =1

    2𝑀

    2+ 𝐢

    𝑖=1

    π‘š

    πœ‰π‘– +

    𝑖=1

    π‘š

    𝛼𝑖 βˆ’

    𝑖=1

    π‘š

    𝛼𝑖[𝑦𝑖 𝑀𝑇π‘₯ 𝑖 + 𝑏 + πœ‰π‘–] βˆ’

    𝑖=1

    π‘š

    π‘Ÿπ‘–πœ‰π‘–

    We could just optimize the primal equation and be finished with SVM optimization…BUT!!!You would miss out on one of the most important aspects of the SVMThe Kernel Trick…

    We need the Lagrangian Dual!

  • http://stats.stackexchange.com/questions/19181/why-bother-with-the-dual-problem-when-fitting-svm

  • ℒ𝒫 =1

    2𝑀

    2+ 𝐢

    𝑖=1

    π‘š

    πœ‰π‘– +

    𝑖=1

    π‘š

    𝛼𝑖 βˆ’

    𝑖=1

    π‘š

    𝛼𝑖 [𝑦𝑖 𝑀𝑇π‘₯ 𝑖 + 𝑏 + πœ‰π‘–] βˆ’

    𝑖=1

    π‘š

    π‘Ÿπ‘–πœ‰π‘–

    First, compute minimal w, b and 𝝃.

    𝛻𝑀ℒ 𝑀, 𝑏, 𝛼 = 𝑀 βˆ’

    𝑖=1

    π‘š

    𝛼𝑖𝑦𝑖 π‘₯ 𝑖 = 0

    𝑀 =

    𝑖=1

    π‘š

    𝛼𝑖𝑦𝑖 π‘₯ 𝑖

    πœ•

    πœ•π‘β„’ 𝑀, 𝑏, 𝛼 =

    𝑖=1

    π‘š

    𝛼𝑖𝑦(𝑖) = 0

    πœ•

    πœ•πœ‰β„’ 𝑀, 𝑏, 𝛼 = 𝐢 βˆ’ 𝛼𝑖 βˆ’ π‘Ÿπ‘– = 0

    Finding the Dual

    Substitute in

    Can solve for w!

    New constraint

    𝛼𝑖 = 𝐢 βˆ’ π‘Ÿπ‘–

    0 ≀ 𝛼𝑖 ≀ 𝐢

    New constraint

  • Lagrangian Dual

    ℒ𝒫 =1

    2𝑀

    2+ 𝐢

    𝑖=1

    π‘š

    πœ‰π‘– +

    𝑖=1

    π‘š

    𝛼𝑖 βˆ’

    𝑖=1

    π‘š

    𝛼𝑖𝑦𝑖 [ 𝑀𝑇π‘₯ 𝑖 + 𝑏 + πœ‰π‘–] βˆ’

    𝑖=1

    π‘š

    π‘Ÿπ‘–πœ‰π‘–

    𝑀 =

    𝑖=1

    π‘š

    𝛼𝑖𝑦𝑖 π‘₯ 𝑖

    β„’π’Ÿ =

    𝑖=1

    π‘š

    𝛼𝑖 βˆ’1

    2

    𝑖=1

    π‘š

    𝑗=1

    π‘š

    𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗) π‘₯(𝑖) βˆ™ π‘₯(𝑗)

    s.t βˆ€i : σ𝑖=1

    π‘š 𝛼𝑖𝑦(𝑖) = 0, πœ‰ β‰₯ 0 and 0 ≀ 𝛼𝑖 ≀ 𝐢(from KKT and prev. derivation)

  • Lagrangian Dual

    max𝛼

    β„’π’Ÿ =

    𝑖=1

    π‘š

    𝛼𝑖 βˆ’1

    2

    𝑖=1

    π‘š

    𝑗=1

    π‘š

    𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗) 𝒙(π’Š) βˆ™ 𝒙(𝒋)

    β€’ S.T: βˆ€i : σ𝑖=1π‘š 𝛼𝑖𝑦

    (𝑖) = 0, πœ‰ β‰₯ 0, 0 ≀ 𝛼𝑖 ≀ 𝐢 and KKT conditions

    β€’ Compute inner product of x(i) and x(j)

    β€’ No longer rely on w or bβ€’ Can replace with K(x,z) for kernels

    β€’ Can solve for 𝜢 explicitlyβ€’ All values not on functional margin have 𝛼 = 0, more efficient!

  • Outline

    β€’ Main goal: To understand how support vector machines (SVMs) perform optimal classification for labelled data sets, also a quick peek at the implementational level.

    1. What is a support vector machine?

    2. Formulating the SVM problem

    3. Optimization using Sequential Minimal Optimization

    4. Use of Kernels for non-linear classification

    5. Implementation of SVM in MATLAB environment

    6. Stochastic SVM Gradient Descent (extra)

  • Sequential Minimal Optimization

    max𝛼

    β„’π’Ÿ =

    𝑖=1

    π‘š

    𝛼𝑖 βˆ’1

    2

    𝑖=1

    π‘š

    𝑗=1

    π‘š

    𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗) 𝒙(π’Š) βˆ™ 𝒙(𝒋)

    β€’ Ideally, hold all Ξ± but Ξ±j constant and optimize but:

    β€’ Solution: Change two Ξ± at a time!

    𝑖=1

    π‘š

    𝛼𝑖𝑦(𝑖) = 0 𝛼1 = βˆ’π‘¦

    1

    𝑖=2

    π‘š

    𝛼𝑖𝑦(𝑖)

    Platt 1998

  • Sequential Minimal Optimization Algorithm

    Initialize all Ξ± to 0

    Repeat {

    1. Select Ξ±j that violates KKT and additional Ξ±k to update

    2. Reoptimize β„’π’Ÿ(Ξ±) with respect to two α’s while holding all other Ξ± constant and compute w,b

    }

    In order to keep σ𝑖=1π‘š 𝛼𝑖𝑦

    (𝑖) = 0 true:𝛼𝑗𝑦

    (𝑖) + π›Όπ‘˜π‘¦(𝑗) = 𝜁

    Even after update, linear combination must remain constant!

    Platt 1998

  • Sequential Minimal Optimization

    β€’ 0 ≀ Ξ± ≀ C from constraint

    β€’ Line given due to linear constraint

    β€’ Clip values if they exceed C, H or L

    L

    H

    𝛼𝑗𝑦(𝑖) + π›Όπ‘˜π‘¦

    (𝑗) = 𝜁

    C

    C

    Ξ±1

    Ξ±2

    L

    H

    𝛼𝑗𝑦(𝑖) + π›Όπ‘˜π‘¦

    (𝑗) = 𝜁

    C

    C

    Ξ±1

    Ξ±2

    If 𝑦(𝑗) β‰  𝑦 π‘˜ π‘‘β„Žπ‘’π‘› 𝛼𝑗 βˆ’ π›Όπ‘˜ = 𝜁 If 𝑦(𝑗) = 𝑦(π‘˜)π‘‘β„Žπ‘’π‘› 𝛼𝑗 + π›Όπ‘˜ = 𝜁

  • Sequential Minimal Optimization

    Can analytically solve for new Ξ±, w and b for each update step (closed form computation)

    π›Όπ‘˜π‘›π‘’π‘€ = π›Όπ‘˜

    π‘œπ‘™π‘‘ +𝑦 π‘˜ (𝐸𝑗

    π‘œπ‘™π‘‘βˆ’πΈπ‘˜π‘œπ‘™π‘‘)

    πœ‚where Ξ· is second derivative of Dual with respect

    to Ξ±2.

    𝛼𝑗𝑛𝑒𝑀 = 𝛼𝑗

    π‘œπ‘™π‘‘ + 𝑦 π‘˜ 𝑦 𝑗 (π›Όπ‘˜π‘œπ‘™π‘‘ βˆ’ π›Όπ‘˜

    𝑛𝑒𝑀,𝑐𝑙𝑖𝑝𝑝𝑒𝑑)

    𝑏1 = 𝐸1 + 𝑦𝑗 𝛼𝑗

    𝑛𝑒𝑀 βˆ’ π›Όπ‘—π‘œπ‘™π‘‘ 𝐾 π‘₯ 𝑗 , π‘₯ 𝑗 + 𝑦 π‘˜ π›Όπ‘˜

    𝑛𝑒𝑀,π‘π‘™π‘–π‘π‘π‘’π‘‘βˆ’ π›Όπ‘˜

    π‘œπ‘™π‘‘ 𝐾(π‘₯ 𝑗 , π‘₯ π‘˜ )

    𝑏2 = 𝐸2 + 𝑦𝑗 𝛼𝑗

    𝑛𝑒𝑀 βˆ’ π›Όπ‘—π‘œπ‘™π‘‘ 𝐾 π‘₯ 𝑗 , π‘₯ π‘˜ + 𝑦 π‘˜ π›Όπ‘˜

    𝑛𝑒𝑀,π‘π‘™π‘–π‘π‘π‘’π‘‘βˆ’ π›Όπ‘˜

    π‘œπ‘™π‘‘ 𝐾(π‘₯ π‘˜ , π‘₯ π‘˜ )

    𝑏 =𝑏1 + 𝑏2

    2

    𝑀 =

    𝑖=1

    π‘š

    𝛼𝑖𝑦(𝑖)π‘₯(𝑖)

    Platt 1998

  • Outline

    β€’ Main goal: To understand how support vector machines (SVMs) perform optimal classification for labelled data sets, also a quick peek at the implementational level.

    1. What is a support vector machine?

    2. Formulating the SVM problem

    3. Optimization using Sequential Minimal Optimization

    4. Use of Kernels for non-linear classification

    5. Implementation of SVM in MATLAB environment

    6. Stochastic SVM Gradient Descent (extra)

  • Kernels

    Recall: max𝛼

    β„’π’Ÿ = σ𝑖=1π‘š 𝛼𝑖 βˆ’

    1

    2σ𝑖=1π‘š σ𝑗=1

    π‘š 𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗) 𝒙(π’Š) βˆ™ 𝒙(𝒋)

    We can generalize this using a kernel function 𝐊(𝒙 π’Š , 𝒙 𝒋 )!K π‘₯ 𝑖 , π‘₯ 𝑗 = πœ™(π‘₯ 𝑖 )π‘‡πœ™(π‘₯ 𝑗 )

    Where Ο† is a feature mapping function.

    Turns out that we don’t need to explicitly compute πœ™ π‘₯ due to Kernel Trick!

  • Kernel Trick

    Suppose K π‘₯, 𝑧 = π‘₯𝑇𝑧2

    K π‘₯, 𝑧 =

    𝑖=1

    π‘š

    π‘₯ 𝑖 𝑧 𝑖

    𝑗=1

    π‘š

    π‘₯ 𝑗 𝑧 𝑗

    𝐾 π‘₯, 𝑧 =

    𝑖=1

    π‘š

    𝑗=1

    π‘š

    π‘₯ 𝑖 𝑧 𝑗 π‘₯ 𝑖 𝑧 𝑗

    𝐾 π‘₯, 𝑧 =

    𝑖,𝑗=1

    π‘š

    ( π‘₯ 𝑖 π‘₯ 𝑗 )(𝑧 𝑖 𝑧 𝑗 )

    Representing each feature: O(n2)

    Kernel trick:O(n)

  • Types of Kernels

    Linear KernelK π‘₯, 𝑧 = π‘₯𝑇𝑧

    Gaussian Kernal (Radial Basis Function)

    𝐾 π‘₯, 𝑧 = expπ‘₯ βˆ’ 𝑧

    2

    2𝜎2

    Polynomial Kernel

    𝐾 π‘₯, 𝑧 = π‘₯𝑇𝑧 + 𝑐𝑑

    max𝛼

    β„’π’Ÿ =

    𝑖=1

    π‘š

    𝛼𝑖 βˆ’1

    2

    𝑖=1

    π‘š

    𝑗=1

    π‘š

    𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝑲 𝒙 π’Š , 𝒙 𝒋

  • Outline

    β€’ Main goal: To understand how support vector machines (SVMs) perform optimal classification for labelled data sets, also a quick peek at the implementational level.

    1. What is a support vector machine?

    2. Formulating the SVM problem

    3. Optimization using Sequential Minimal Optimization

    4. Use of Kernels for non-linear classification

    5. Implementation of SVM in MATLAB environment

    6. Stochastic SVM Gradient Descent (extra)

  • Example 1: Linear Kernel

  • Example 1: Linear Kernel

    In MATLAB:I used svmtrain:model = svmTrain(X, y, C, @linearKernel, 1e-3, 500);

    β€’ 500 max iterationsβ€’ C = 1000β€’ KKT tol = 1e-3

  • Example 2: Gaussian Kernel (RBF)

  • Example 2: Gaussian Kernel (RBF)

    In MATLABGaussian Kernel:gSim = exp(-1*((x1-x2)'*(x1-x2))/(2*sigma^2));

    model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma), 0.1, 300);

    β€’ 300 max iterationsβ€’ C = 1β€’ KKT tol = 0.1

  • Example 2: Gaussian Kernel (RBF)

    In MATLABGaussian Kernel:gSim = exp(-1*((x1-x2)'*(x1-x2))/(2*sigma^2));

    model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma), 0.001, 300);

    β€’ 300 max iterationsβ€’ C = 1β€’ KKT tol = 0.001

  • Example 2: Gaussian Kernel (RBF)

    In MATLABGaussian Kernel:gSim = exp(-1*((x1-x2)'*(x1-x2))/(2*sigma^2));

    model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma), 0.001, 300);

    β€’ 300 max iterationsβ€’ C = 10β€’ KKT tol = 0.001

  • Resources

    β€’ http://research.microsoft.com/pubs/69644/tr-98-14.pdf

    β€’ ftp://www.ai.mit.edu/pub/users/tlp/projects/svm/svm-smo/smo.pdf

    β€’ http://www.cs.toronto.edu/~urtasun/courses/CSC2515/09svm-2515.pdf

    β€’ http://www.pstat.ucsb.edu/student%20seminar%20doc/svm2.pdf

    β€’ http://www.ics.uci.edu/~dramanan/teaching/ics273a_winter08/lectures/lecture11.pdf

    β€’ http://stats.stackexchange.com/questions/19181/why-bother-with-the-dual-problem-when-fitting-svm

    β€’ http://cs229.stanford.edu/notes/cs229-notes3.pdf

    β€’ http://www.svms.org/tutorials/Berwick2003.pdf

    β€’ http://www.med.nyu.edu/chibi/sites/default/files/chibi/Final.pdf

    β€’ https://share.coursera.org/wiki/index.php/ML:Main

    β€’ https://www.youtube.com/watch?v=vqoVIchkM7I

    http://research.microsoft.com/pubs/69644/tr-98-14.pdfftp://www.ai.mit.edu/pub/users/tlp/projects/svm/svm-smo/smo.pdfhttp://www.cs.toronto.edu/~urtasun/courses/CSC2515/09svm-2515.pdfhttp://www.pstat.ucsb.edu/student seminar doc/svm2.pdfhttp://www.ics.uci.edu/~dramanan/teaching/ics273a_winter08/lectures/lecture11.pdfhttp://stats.stackexchange.com/questions/19181/why-bother-with-the-dual-problem-when-fitting-svmhttp://cs229.stanford.edu/notes/cs229-notes3.pdfhttp://www.svms.org/tutorials/Berwick2003.pdfhttp://www.med.nyu.edu/chibi/sites/default/files/chibi/Final.pdfhttps://share.coursera.org/wiki/index.php/ML:Mainhttps://www.youtube.com/watch?v=vqoVIchkM7I

  • Outline

    β€’ Main goal: To understand how support vector machines (SVMs) perform optimal classification for labelled data sets, also a quick peek at the implementational level.

    1. What is a support vector machine?

    2. Formulating the SVM problem

    3. Optimization using Sequential Minimal Optimization

    4. Use of Kernels for non-linear classification

    5. Implementation of SVM in MATLAB environment

    6. Stochastic SVM Gradient Descent (extra)

  • Hinge-Loss Primal SVM

    β€’ Recall: Original Problem1

    2𝑀𝑇𝑀 + 𝐢 σ𝑖=1

    π‘š πœ‰π‘– S.T: 𝑦𝑖 𝑀𝑇π‘₯ 𝑖 + 𝑏 βˆ’ 1 + πœ‰ β‰₯ 0

    Re-write in terms of hinge-loss function

    ℒ𝒫 =πœ†

    2𝑀𝑇𝑀 +

    1

    𝑛

    𝑖=1

    π‘š

    β„Ž 1 βˆ’ 𝑦 𝑖 𝑀𝑇π‘₯ 𝑖 + 𝑏

    Hinge-Loss Cost FunctionRegularization

  • Hinge Loss Function

    Y(i)*f(x)

    0 321-1-3 -2

    Penalize when signs of x(i)

    and y(i) don’t match!

    β„Ž 1 βˆ’ 𝑦 𝑖 𝑀𝑇π‘₯ 𝑖 + 𝑏

  • Hinge-Loss Primal SVM

    Hinge-Loss Primal

    ℒ𝒫 =πœ†

    2𝑀𝑇𝑀 +

    1

    𝑛

    𝑖=1

    π‘š

    β„Ž 1 βˆ’ 𝑦 𝑖 𝑀𝑇π‘₯ 𝑖 + 𝑏

    Compute gradient (sub-gradient)

    𝛻𝑀 = πœ†π‘€ βˆ’1

    𝑛

    𝑖=1

    𝑛

    𝑦 𝑖 π‘₯ 𝑖 𝐿 𝑦 𝑖 𝑀𝑇π‘₯ 𝑖 + 𝑏 < 1

    Indicator functionL(y(i)f(x)) = 0 if classified correctlyL(y(i)f(x)) = hinge slope if classified incorrectly

  • Stochastic Gradient Descent

    We implement random sampling here to pick out data points:πœ†π‘€ βˆ’ π”Όπ‘–βˆˆπ•Œ 𝑦

    𝑖 π‘₯ 𝑖 𝑙 𝑦 𝑖 𝑀𝑇π‘₯ 𝑖 + 𝑏 < 1

    Use this to compute update rule:

    𝑀𝑑 ≔ π‘€π‘‘βˆ’1 +1

    𝑑𝑦 𝑖 π‘₯ 𝑖 𝑙 𝑦 𝑖 π‘₯ 𝑖 π‘‡π‘€π‘‘βˆ’1 + 𝑏 < 1 βˆ’ πœ†π‘€π‘‘βˆ’1

    Here i is sampled from a uniform distribution.

    Shalev-Shwartz et al. 2007