Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Support Vector Machines Machine Learning Series
Jerry Jeychandra
Blohm Lab
Outline
β’ Main goal: To understand how support vector machines (SVMs) perform optimal classification for labelled data sets, also a quick peek at the implementational level.
1. What is a support vector machine?
2. Formulating the SVM problem
3. Optimization using Sequential Minimal Optimization
4. Use of Kernels for non-linear classification
5. Implementation of SVM in MATLAB environment
6. Stochastic Gradient Descent (extra)
Outline
β’ Main goal: To understand how support vector machines (SVMs) perform optimal classification for labelled data sets, also a quick peek at the implementational level.
1. What is a support vector machine?
2. Formulating the SVM problem
3. Optimization using Sequential Minimal Optimization
4. Use of Kernels for non-linear classification
5. Implementation of SVM in MATLAB environment
6. Stochastic Gradient Descent (extra)
Support Vector Machines (SVM)
x1
x2
Support Vector Machines (SVM)
Not very good decision boundary!
x1
x2
Support Vector Machines (SVM)
Great decision boundary!
x1
x2
πππ + πππ + π > ππππ + πππ + π < π
ππ₯1 + ππ₯2 + π = 0
In machine learningβ¦πππ₯ + π = 0
Support Vector Machines (SVM)
Great decision boundary!
x1
x2
ππ»π + π > πππ»π + π < π
ππ₯1 + ππ₯2 + π = 0
In machine learningβ¦πππ₯ + π = 0
Support Vector Machines (SVM)
x1
x2
πππ₯ + π = 0πππ₯ + π = β1
πππ₯ + π = 1
Support Vector Machines (SVM)
x1
x2
πππ₯ + π = 0πππ₯ + π = β1
πππ₯ + π = 1
Support Vector Machines (SVM)
x1
x2
πππ₯ + π = 0πππ₯ + π = β1
πππ₯ + π = 1
-d+d
How do we maximize d?
Support Vector Machines (SVM)
x1
x2
πππ₯ + π = 0πππ₯ + π = β1
πππ₯ + π = 1
-d+d
How do we maximize d?
π =π€ππ₯ + π
| π€ |=
1
| π€ |
Minimize this!
| π€ |
Support Vector Machines (SVM)
x1
x2
πππ₯ + π = 0πππ₯ + π = β1
πππ₯ + π = 1
-d+d
How do we maximize d?
π =π€ππ₯ + π
| π€ |=
1
| π€ |
1
2π€
2
Minimize this!
| π€ |
SVM Initial Constraints
If our data point is labelled as y = 1π€ππ₯(π) + π β₯ 1
If y = -1π€ππ₯(π) + π β€ β1
This can be succinctly summarized as:π¦ π€ππ₯(π) + π β₯ 1
Weβre going to allow for some margin violation:π¦(π) π€ππ₯(π) + π β₯ 1 β ππ
βπ¦(π) π€ππ₯(π) + π + 1 β ππ β€ 0
Hard Margin SVM
Soft Margin SVM
Support Vector Machines (SVM)
x1
x2
πππ₯ + π = 0πππ₯ + π = β1
πππ₯ + π = 1
ΞΎ/||w||
Optimization Objective
Minimize:1
2π€
2+ πΆ Οπ
π ππ
Subject to: βπ¦ π π€ππ₯ π + π + 1 β ππ β€ 0
ππ β₯ 0
Penalizing Term
Outline
β’ Main goal: To understand how support vector machines (SVMs) perform optimal classification for labelled data sets, also a quick peek at the implementational level.
1. What is a support vector machine?
2. Formulating the SVM problem
3. Optimization using Sequential Minimal Optimization
4. Use of Kernels for non-linear classification
5. Implementation of SVM in MATLAB environment
Optimization Objective
Minimize 1
2π€
2+ πΆ Οπ
π ππ s.t βπ¦(π) π€ππ₯(π) + π + 1 β ππ β€ 0 and
ππ β₯ 0
Need to use the Lagrangian
Using Lagrangian with inequalities as a constraint requires us to fulfill
Karush-Khan-Tucker conditionsβ¦
Karush-Kahn-Tucker Conditions
There must exist a w* that solves the primal problem whereas πΌ* and b* are solutions to the dual problem. All three parameters must satisfy the following conditions:
1.π
ππ€πβ π€β, πΌβ, πβ = 0
2.π
πππβ π€β, πΌβ, πβ = 0
3. πΌπβππ π€
β = 0
4. ππ π€β β€ 0
5. πΌπβ β₯ 0
6. ππ β₯ 0
7. ππππ = 0
Constraint 3: If πΆ*i is a non-zero number, gi must be 0!
Recall: ππ w = βπ¦(π) π€ππ₯(π) + π + 1 β ππ β€ 0
Example x must lie directly on the functional margin in order for ππ πto equal 0.
Here πΆ > 0.This is our support vectors!
Optimization Objective
Minimize 1
2π€
2+ πΆΟπ
π ππ subject to βπ¦(π) π€ππ₯(π) + π + 1 β ππ β€ 0 and ππ β₯ 0
Need to use the Lagrangian
β π₯, πΌ = π π₯ +
π
ππ β βπ π₯ β¦
βπ« π€, π, πΌ, π, π =1
2π€
2+ πΆ
π
ππ β ππΌπ π¦
π π€ππ₯ π + π β 1+ ππ βπ
ππππ
minπ€,π
βπ« =1
2π€
2+ πΆ
π=1
π
ππ +
π=1
π
πΌπ β
π=1
π
πΌππ¦π [ π€ππ₯ π + π + ππ] β
π=1
π
ππππ
Optimization Objective
Minimize 1
2π€
2+ πΆΟπ
π ππ subject to βπ¦(π) π€ππ₯(π) + π + 1 β ΞΎ β€ 0 and ππ β₯ 0
Need to use the Lagrangian
β π₯, πΌ = π π₯ +
π
ππ β βπ π₯ β¦
βπ« π€, π, πΌ, π, π =1
2π€
2+ πΆ
π
ππ β ππΌπ π¦
π π€ππ₯ π + π β 1+ ππ βπ
ππππ
minπ€,π
βπ« =π
ππ
π+ πͺ
π=π
π
ππ +
π=π
π
πΆπ β
π=π
π
πΆπππ [ ππ»π π + π + ππ] β
π=1
π
ππππ
Minimization function Margin Constraint Error Constraint
The problem with the Primal
βπ« =1
2π€
2+ πΆ
π=1
π
ππ +
π=1
π
πΌπ β
π=1
π
πΌπ[π¦π π€ππ₯ π + π + ππ] β
π=1
π
ππππ
We could just optimize the primal equation and be finished with SVM optimizationβ¦BUT!!!You would miss out on one of the most important aspects of the SVMThe Kernel Trickβ¦
We need the Lagrangian Dual!
http://stats.stackexchange.com/questions/19181/why-bother-with-the-dual-problem-when-fitting-svm
βπ« =1
2π€
2+ πΆ
π=1
π
ππ +
π=1
π
πΌπ β
π=1
π
πΌπ [π¦π π€ππ₯ π + π + ππ] β
π=1
π
ππππ
First, compute minimal w, b and π.
π»π€β π€, π, πΌ = π€ β
π=1
π
πΌππ¦π π₯ π = 0
π€ =
π=1
π
πΌππ¦π π₯ π
π
ππβ π€, π, πΌ =
π=1
π
πΌππ¦(π) = 0
π
ππβ π€, π, πΌ = πΆ β πΌπ β ππ = 0
Finding the Dual
Substitute in
Can solve for w!
New constraint
πΌπ = πΆ β ππ
0 β€ πΌπ β€ πΆ
New constraint
Lagrangian Dual
βπ« =1
2π€
2+ πΆ
π=1
π
ππ +
π=1
π
πΌπ β
π=1
π
πΌππ¦π [ π€ππ₯ π + π + ππ] β
π=1
π
ππππ
π€ =
π=1
π
πΌππ¦π π₯ π
βπ =
π=1
π
πΌπ β1
2
π=1
π
π=1
π
πΌππΌππ¦(π)π¦(π) π₯(π) β π₯(π)
s.t βi : Οπ=1
π πΌππ¦(π) = 0, π β₯ 0 and 0 β€ πΌπ β€ πΆ(from KKT and prev. derivation)
Lagrangian Dual
maxπΌ
βπ =
π=1
π
πΌπ β1
2
π=1
π
π=1
π
πΌππΌππ¦(π)π¦(π) π(π) β π(π)
β’ S.T: βi : Οπ=1π πΌππ¦
(π) = 0, π β₯ 0, 0 β€ πΌπ β€ πΆ and KKT conditions
β’ Compute inner product of x(i) and x(j)
β’ No longer rely on w or bβ’ Can replace with K(x,z) for kernels
β’ Can solve for πΆ explicitlyβ’ All values not on functional margin have πΌ = 0, more efficient!
Outline
β’ Main goal: To understand how support vector machines (SVMs) perform optimal classification for labelled data sets, also a quick peek at the implementational level.
1. What is a support vector machine?
2. Formulating the SVM problem
3. Optimization using Sequential Minimal Optimization
4. Use of Kernels for non-linear classification
5. Implementation of SVM in MATLAB environment
6. Stochastic SVM Gradient Descent (extra)
Sequential Minimal Optimization
maxπΌ
βπ =
π=1
π
πΌπ β1
2
π=1
π
π=1
π
πΌππΌππ¦(π)π¦(π) π(π) β π(π)
β’ Ideally, hold all Ξ± but Ξ±j constant and optimize but:
β’ Solution: Change two Ξ± at a time!
π=1
π
πΌππ¦(π) = 0 πΌ1 = βπ¦
1
π=2
π
πΌππ¦(π)
Platt 1998
Sequential Minimal Optimization Algorithm
Initialize all Ξ± to 0
Repeat {
1. Select Ξ±j that violates KKT and additional Ξ±k to update
2. Reoptimize βπ(Ξ±) with respect to two Ξ±βs while holding all other Ξ± constant and compute w,b
}
In order to keep Οπ=1π πΌππ¦
(π) = 0 true:πΌππ¦
(π) + πΌππ¦(π) = π
Even after update, linear combination must remain constant!
Platt 1998
Sequential Minimal Optimization
β’ 0 β€ Ξ± β€ C from constraint
β’ Line given due to linear constraint
β’ Clip values if they exceed C, H or L
L
H
πΌππ¦(π) + πΌππ¦
(π) = π
C
C
Ξ±1
Ξ±2
L
H
πΌππ¦(π) + πΌππ¦
(π) = π
C
C
Ξ±1
Ξ±2
If π¦(π) β π¦ π π‘βππ πΌπ β πΌπ = π If π¦(π) = π¦(π)π‘βππ πΌπ + πΌπ = π
Sequential Minimal Optimization
Can analytically solve for new Ξ±, w and b for each update step (closed form computation)
πΌππππ€ = πΌπ
πππ +π¦ π (πΈπ
πππβπΈππππ)
πwhere Ξ· is second derivative of Dual with respect
to Ξ±2.
πΌππππ€ = πΌπ
πππ + π¦ π π¦ π (πΌππππ β πΌπ
πππ€,πππππππ)
π1 = πΈ1 + π¦π πΌπ
πππ€ β πΌππππ πΎ π₯ π , π₯ π + π¦ π πΌπ
πππ€,πππππππβ πΌπ
πππ πΎ(π₯ π , π₯ π )
π2 = πΈ2 + π¦π πΌπ
πππ€ β πΌππππ πΎ π₯ π , π₯ π + π¦ π πΌπ
πππ€,πππππππβ πΌπ
πππ πΎ(π₯ π , π₯ π )
π =π1 + π2
2
π€ =
π=1
π
πΌππ¦(π)π₯(π)
Platt 1998
Outline
β’ Main goal: To understand how support vector machines (SVMs) perform optimal classification for labelled data sets, also a quick peek at the implementational level.
1. What is a support vector machine?
2. Formulating the SVM problem
3. Optimization using Sequential Minimal Optimization
4. Use of Kernels for non-linear classification
5. Implementation of SVM in MATLAB environment
6. Stochastic SVM Gradient Descent (extra)
Kernels
Recall: maxπΌ
βπ = Οπ=1π πΌπ β
1
2Οπ=1π Οπ=1
π πΌππΌππ¦(π)π¦(π) π(π) β π(π)
We can generalize this using a kernel function π(π π , π π )!K π₯ π , π₯ π = π(π₯ π )ππ(π₯ π )
Where Ο is a feature mapping function.
Turns out that we donβt need to explicitly compute π π₯ due to Kernel Trick!
Kernel Trick
Suppose K π₯, π§ = π₯ππ§2
K π₯, π§ =
π=1
π
π₯ π π§ π
π=1
π
π₯ π π§ π
πΎ π₯, π§ =
π=1
π
π=1
π
π₯ π π§ π π₯ π π§ π
πΎ π₯, π§ =
π,π=1
π
( π₯ π π₯ π )(π§ π π§ π )
Representing each feature: O(n2)
Kernel trick:O(n)
Types of Kernels
Linear KernelK π₯, π§ = π₯ππ§
Gaussian Kernal (Radial Basis Function)
πΎ π₯, π§ = expπ₯ β π§
2
2π2
Polynomial Kernel
πΎ π₯, π§ = π₯ππ§ + ππ
maxπΌ
βπ =
π=1
π
πΌπ β1
2
π=1
π
π=1
π
πΌππΌππ¦(π)π¦(π)π² π π , π π
Outline
β’ Main goal: To understand how support vector machines (SVMs) perform optimal classification for labelled data sets, also a quick peek at the implementational level.
1. What is a support vector machine?
2. Formulating the SVM problem
3. Optimization using Sequential Minimal Optimization
4. Use of Kernels for non-linear classification
5. Implementation of SVM in MATLAB environment
6. Stochastic SVM Gradient Descent (extra)
Example 1: Linear Kernel
Example 1: Linear Kernel
In MATLAB:I used svmtrain:model = svmTrain(X, y, C, @linearKernel, 1e-3, 500);
β’ 500 max iterationsβ’ C = 1000β’ KKT tol = 1e-3
Example 2: Gaussian Kernel (RBF)
Example 2: Gaussian Kernel (RBF)
In MATLABGaussian Kernel:gSim = exp(-1*((x1-x2)'*(x1-x2))/(2*sigma^2));
model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma), 0.1, 300);
β’ 300 max iterationsβ’ C = 1β’ KKT tol = 0.1
Example 2: Gaussian Kernel (RBF)
In MATLABGaussian Kernel:gSim = exp(-1*((x1-x2)'*(x1-x2))/(2*sigma^2));
model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma), 0.001, 300);
β’ 300 max iterationsβ’ C = 1β’ KKT tol = 0.001
Example 2: Gaussian Kernel (RBF)
In MATLABGaussian Kernel:gSim = exp(-1*((x1-x2)'*(x1-x2))/(2*sigma^2));
model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma), 0.001, 300);
β’ 300 max iterationsβ’ C = 10β’ KKT tol = 0.001
Resources
β’ http://research.microsoft.com/pubs/69644/tr-98-14.pdf
β’ ftp://www.ai.mit.edu/pub/users/tlp/projects/svm/svm-smo/smo.pdf
β’ http://www.cs.toronto.edu/~urtasun/courses/CSC2515/09svm-2515.pdf
β’ http://www.pstat.ucsb.edu/student%20seminar%20doc/svm2.pdf
β’ http://www.ics.uci.edu/~dramanan/teaching/ics273a_winter08/lectures/lecture11.pdf
β’ http://stats.stackexchange.com/questions/19181/why-bother-with-the-dual-problem-when-fitting-svm
β’ http://cs229.stanford.edu/notes/cs229-notes3.pdf
β’ http://www.svms.org/tutorials/Berwick2003.pdf
β’ http://www.med.nyu.edu/chibi/sites/default/files/chibi/Final.pdf
β’ https://share.coursera.org/wiki/index.php/ML:Main
β’ https://www.youtube.com/watch?v=vqoVIchkM7I
http://research.microsoft.com/pubs/69644/tr-98-14.pdfftp://www.ai.mit.edu/pub/users/tlp/projects/svm/svm-smo/smo.pdfhttp://www.cs.toronto.edu/~urtasun/courses/CSC2515/09svm-2515.pdfhttp://www.pstat.ucsb.edu/student seminar doc/svm2.pdfhttp://www.ics.uci.edu/~dramanan/teaching/ics273a_winter08/lectures/lecture11.pdfhttp://stats.stackexchange.com/questions/19181/why-bother-with-the-dual-problem-when-fitting-svmhttp://cs229.stanford.edu/notes/cs229-notes3.pdfhttp://www.svms.org/tutorials/Berwick2003.pdfhttp://www.med.nyu.edu/chibi/sites/default/files/chibi/Final.pdfhttps://share.coursera.org/wiki/index.php/ML:Mainhttps://www.youtube.com/watch?v=vqoVIchkM7I
Outline
β’ Main goal: To understand how support vector machines (SVMs) perform optimal classification for labelled data sets, also a quick peek at the implementational level.
1. What is a support vector machine?
2. Formulating the SVM problem
3. Optimization using Sequential Minimal Optimization
4. Use of Kernels for non-linear classification
5. Implementation of SVM in MATLAB environment
6. Stochastic SVM Gradient Descent (extra)
Hinge-Loss Primal SVM
β’ Recall: Original Problem1
2π€ππ€ + πΆ Οπ=1
π ππ S.T: π¦π π€ππ₯ π + π β 1 + π β₯ 0
Re-write in terms of hinge-loss function
βπ« =π
2π€ππ€ +
1
π
π=1
π
β 1 β π¦ π π€ππ₯ π + π
Hinge-Loss Cost FunctionRegularization
Hinge Loss Function
Y(i)*f(x)
0 321-1-3 -2
Penalize when signs of x(i)
and y(i) donβt match!
β 1 β π¦ π π€ππ₯ π + π
Hinge-Loss Primal SVM
Hinge-Loss Primal
βπ« =π
2π€ππ€ +
1
π
π=1
π
β 1 β π¦ π π€ππ₯ π + π
Compute gradient (sub-gradient)
π»π€ = ππ€ β1
π
π=1
π
π¦ π π₯ π πΏ π¦ π π€ππ₯ π + π < 1
Indicator functionL(y(i)f(x)) = 0 if classified correctlyL(y(i)f(x)) = hinge slope if classified incorrectly
Stochastic Gradient Descent
We implement random sampling here to pick out data points:ππ€ β πΌπβπ π¦
π π₯ π π π¦ π π€ππ₯ π + π < 1
Use this to compute update rule:
π€π‘ β π€π‘β1 +1
π‘π¦ π π₯ π π π¦ π π₯ π ππ€π‘β1 + π < 1 β ππ€π‘β1
Here i is sampled from a uniform distribution.
Shalev-Shwartz et al. 2007