Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Support Vector Machine and Kernel Methods
Jiayu Zhou
1Department of Computer Science and EngineeringMichigan State UniversityEast Lansing, MI USA
February 26, 2017
Jiayu Zhou CSE 847 Machine Learning 1 / 50
Which Separator Do You Pick?
Jiayu Zhou CSE 847 Machine Learning 2 / 50
Robustness to Noisy Data
Being robust to noise (measurement error) is good (rememberregularization).
Jiayu Zhou CSE 847 Machine Learning 3 / 50
Thicker Cushion Means More Robustness
We call such hyperplanes fat
Jiayu Zhou CSE 847 Machine Learning 4 / 50
Two Crucial Questions
1 Can we efficiently find the fattest separating hyperplane?
2 Is a fatter hyperplane better than a thin one?
Jiayu Zhou CSE 847 Machine Learning 5 / 50
Pulling Out the Bias
Before
x ∈ {1} × Rd;w ∈ Rd+1
x =
1x1...xd
;w =
w0
w1...wd
signal = wTx
After
x ∈ Rd; b ∈ R,w ∈ Rd
x =
x1...xd
;w =
w1...wd
bias b
signal = wTx+ b
Jiayu Zhou CSE 847 Machine Learning 6 / 50
Separating The Data
Hyperplane h = (b,w)h separates the data means:
yn(wTxn + b) > 0
By rescaling the weights andbias,
minn=1,...,N
yn(wTxn + b) = 1
Jiayu Zhou CSE 847 Machine Learning 7 / 50
Distance to the Hyperplane
w is normal to the hyperplane (why?)wT (x2 − x1) = wTx2 −wTx1 = −b+ b = 0
Scalar projection:
aTb = ∥a∥∥b∥ cos(a,b)⇒ aTb/∥b∥ = ∥a∥ cos(a,b)
let x⊥ be the orthogonal projection ofx to h, distance to hyperplane is givenby projection of x− x⊥ to w (why?)
dist(x, h) =1
∥w∥· |wTx−wTx⊥|
=1
∥w∥· |wTx+ b|
Jiayu Zhou CSE 847 Machine Learning 8 / 50
Fatness of a Separating Hyperplane
dist(x, h) =1
∥w∥· |wTx+ b| = 1
∥w∥· |yn(wTx+ b)| = 1
∥w∥· yn(wTx+ b)
Fatness
= Distance to the closest point
Fatness = minn
dist(xn, h)
=1
∥w∥minn
yn(wTx+ b)
=1
∥w∥
Jiayu Zhou CSE 847 Machine Learning 9 / 50
Maximizing the Margin
Formal definition of margin:
margin: γ(h) =1
∥w∥
NOTE: Bias b does not appear in the margin.
Objective maximizing margin:
minb,w
1
2wTw
subject to: minn=1,...,N
yn(wTxn + b) = 1
An equivalent objective:
minb,w
1
2wTw
subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N
Jiayu Zhou CSE 847 Machine Learning 10 / 50
Example - Our Toy Data Set
minb,w
1
2wTw
subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N
Training Data:
X =
0 02 22 03 0
,y =
−1−1+1+1
What is the margin?
Jiayu Zhou CSE 847 Machine Learning 11 / 50
Example - Our Toy Data Set
minb,w
1
2wTw
subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N
X =
0 02 22 03 0
,y =
−1−1+1+1
⇒
(1) : −b ≥ 1
(2) : −(2w1 + 2w2 + b) ≥ 1
(3) : 2w1 + b ≥ 1
(4) : 3w1 + b ≥ 1
{(1) + (3) → w1 ≥ 1
(2) + (3) → w2 ≤ −1⇒ 1
2wTw =
1
2(w2
1 + w22) ≥ 1
Thus: w1 = 1, w2 = −1, b = −1Jiayu Zhou CSE 847 Machine Learning 12 / 50
Example - Our Toy Data Set
Given data X =
0 02 22 03 0
Optimal solution
w∗ =
[w1 = 1w2 = −1
], b∗ = −1
Optimal hyperplaneg(x) = sign(x1 − x2 − 1)
margin:1
∥w∥ = 1√2≈ 0.707
For data points (1), (2) and(3) yn(x
Tnw
∗ + b∗) = 1Support Vectors
Jiayu Zhou CSE 847 Machine Learning 13 / 50
Solver: Quadratic Programming
minu∈Rq
1
2uTQu+ pTu
subject to: Au ≥ c
u∗ ← QP (Q,p, A, c)
(Q = 0 is linear programming.)http://cvxopt.org/examples/tutorial/qp.html
Jiayu Zhou CSE 847 Machine Learning 14 / 50
Maximum Margin Hyperplane is QP
minb,w
1
2wTw
subject to: yn(wTxn + b) ≥ 1, ∀n
minu∈Rq
1
2uTQu+ pTu
subject to: Au ≥ c
u =
[bw
]∈ Rd+1 ⇒
1
2wTw = [b,wT ]
[0 0T
d0d Id
] [b
wT
]= uT
[0 0T
d0d Id
]u
Q =
[0 0T
d0d Id
],p = 0d+1
yn(wTxn + b) ≥ 1 = [yn, ynx
Tn ]u ≥ 1 ⇒
y1 y1xT1
......
yN yNxTN
u ≥
1...1
A =
y1 y1xT1
......
yN yNxTN
, c =
1...1
Jiayu Zhou CSE 847 Machine Learning 15 / 50
Back To Our Example
Exercise:
X =
0 02 22 03 0
,y =
−1−1+1+1
(1) : −b ≥ 1
(2) : −(2w1 + 2w2 + b) ≥ 1
(3) : 2w1 + b ≥ 1
(4) : 3w1 + b ≥ 1
Show the corresponding Q,p, A, c.
Q =
0 0 00 1 00 0 1
,p =
000
, A =
−1 0 0−1 −2 −21 2 01 3 0
, c =
1111
Use your QP-solver to give
u∗ = [b∗, w∗1 , w
∗2 ]
T = [−1, 1,−1]
Jiayu Zhou CSE 847 Machine Learning 16 / 50
Primal QP algorithm for linear-SVM
1 Let p = 0d+1 be the (d+ 1)-vector of zeros and c = 1N theN -vector of ones. Construct matrices Q and A, where
A =
y1 −y1xT1−
......
yN −yNxTN−
, Q =
[0 0Td0d Id
]
2 Return
[b∗
w∗
]= u∗ ← QP (Q,p, A, c).
3 The final hypothesis is g(x) = sign(xTw∗ + b∗).
Jiayu Zhou CSE 847 Machine Learning 17 / 50
Link to Regularization
minw
Ein(w)
subject to: wTw ≤ C
optimal hyperplane regularization
minimize wTw Ein
subject to Ein = 0 wTw ≤ C
Jiayu Zhou CSE 847 Machine Learning 18 / 50
How to Handle Non-Separable Data?
(a) Tolerate noisy data points: soft-margin SVM.
(b) Inherent nonlinear boundary: non-linear transformation.
Jiayu Zhou CSE 847 Machine Learning 19 / 50
Non-Linear Transformation
Φ1(x) = (x1, x2)
Φ2(x) = (x1, x2, x21, x1x2, x
22)
Φ3(x) = (x1, x2, x21, x1x2, x
22, x
31, x
21x2, x1x
22, x
32)
Jiayu Zhou CSE 847 Machine Learning 20 / 50
Non-Linear Transformation
Using the nonlinear transform with the optimal hyperplaneusing a transform Φ: Rd → Rd̃:
zn = Φ(xn)
Solve the hard-margin SVM in the Z-space (w̃∗, b̃∗):
minb̃,w̃
1
2w̃T w̃
subject to: yn(w̃Tzn + b̃) ≥ 1, ∀n
Final hypothesis:
g(x) = sign(w̃∗TΦ(x) + b̃∗)
Jiayu Zhou CSE 847 Machine Learning 21 / 50
SVM and non-linear transformation
The margin is shaded in yellow, and the support vectors are boxed.
For Φ2, d̃2 = 5 and for Φ3, d̃3 = 9d̃2 is nearly double d̃3, yet the resulting SVM separator is notseverely overfitting with Φ3 (regularization?).
Jiayu Zhou CSE 847 Machine Learning 22 / 50
Support Vector Machine Summary
A very powerful, easy to use linear model which comes withautomatic regularization.
Fully exploit SVM: Kernel
potential robustness to overfitting even after transforming to amuch higher dimensionHow about infinite dimensional transforms?Kernel Trick
Jiayu Zhou CSE 847 Machine Learning 23 / 50
SVM Dual: Formulation
Primal and dual in optimization.
The dual view of SVM enables us to exploit the kernel trick.
In the primal SVM problem we solve w ∈ Rd, b, while in thedual problem we solve α ∈ RN
maxα∈RN
N∑n=1
αn −1
2
N∑m=1
N∑n=1
αnαmynymxTnxm
subject toN∑
n=1
ynαn = 0, αn ≥ 0, ∀n
which is also a QP problem.
Jiayu Zhou CSE 847 Machine Learning 24 / 50
SVM Dual: Prediction
We can obtain the primal solution:
w∗ =
N∑n=1
ynα∗nxn
where for support vectors αn > 0
The optimal hypothesis:
g(x) = sign(w∗Tx+ b∗)
= sign
(N∑
n=1
ynα∗nx
Tnx+ b∗
)
= sign
∑α∗n>0
ynα∗nx
Tnx+ b∗
Jiayu Zhou CSE 847 Machine Learning 25 / 50
Dual SVM: Summary
maxα∈RN
N∑n=1
αn −1
2
N∑m=1
N∑n=1
αnαmynymxTnxm
subject toN∑
n=1
ynαn = 0, αn ≥ 0, ∀n
w∗ =
N∑n=1
ynα∗nxn
Jiayu Zhou CSE 847 Machine Learning 26 / 50
Common SVM Basis Functions
zk = polynomial terms of xk of degree 1 to q
zk = radial basis function of xk
zk(j) = ϕj(xk) = exp(−|xk − cj |2/σ2)
zk = sigmoid functions of xk
Jiayu Zhou CSE 847 Machine Learning 27 / 50
Quadratic Basis Functions
Φ(x) =
1√2x1
...√2xd
x21...x2d√
2x1x2
...√2x1xd√2x2x3
...√2xd−1xd
Including Constant Term, LinearTerms, Pure Quadratic Terms,Quadratic Cross-Terms
The number of terms is approximatelyd2/2.
You may be wondering what those√2s are doing. Youll find out why
theyre there soon.
Jiayu Zhou CSE 847 Machine Learning 28 / 50
Dual SVM: Non-linear Transformation
maxα∈RN
N∑n=1
αn −1
2
N∑m=1
N∑n=1
αnαmynymΦ(xn)TΦ(xm)
subject toN∑
n=1
ynαn = 0, αn ≥ 0, ∀n
w∗ =
N∑n=1
ynα∗nΦ(xn)
Need to prepare a matrix Q, Qnm = ynymΦ(xn)TΦ(xm)
Cost?We must do N2/2 dot products to get this matrix ready.Each dot product requires d2/2 additions and multiplications,The whole thing costs N2d2/4.
Jiayu Zhou CSE 847 Machine Learning 29 / 50
Quadratic Dot Products
Φ(a)TΦ(b) =
1√2a1...√2ama21...
a2m√2a1a2...√
2a1ad√2a2a3...√
2ad−1ad
T
1√2b1...√2bdb21...b2d√2b1b2...√
2b1bd√2b2b3...√
2bd−1bd
Constant Term 1
Linear Terms
d∑i=1
2aibi
Pure Quadratic Terms
d∑i=1
a2i b2i
Quadratic Cross-Terms
d∑i=1
d∑j=i+1
2aiajbibj
Jiayu Zhou CSE 847 Machine Learning 30 / 50
Quadratic Dot Product
Does Φ(a)TΦ(b) look familiar?
Φ(a)TΦ(b) = 1 + 2∑d
i=1aibi +
∑d
i=1a2i b
2i +
∑d
i=1
∑d
j=i+12aiajbibj
Try this: (aTb+ 1)2
(aT b+ 1)2 = (aT b)2 + 2aT b+ 1
=
(∑d
i=1aibi
)2
+ 2∑d
i=1aibi + 1
=∑d
i=1
∑d
j=1aibiajbj + 2
∑d
i=1aibi + 1
=∑d
i=1a2i b
2i + 2
∑d
i=1
∑d
j=i+1aiajbibj + 2
∑d
i=1aibi + 1
They’re the same! And this is only O(d) to compute!
Jiayu Zhou CSE 847 Machine Learning 31 / 50
Dual SVM: Non-linear Transformation
maxα∈RN
N∑n=1
αn −1
2
N∑m=1
N∑n=1
αnαmynymΦ(xn)TΦ(xm)
subject toN∑
n=1
ynαn = 0, αn ≥ 0, ∀n
w∗ =N∑
n=1
ynα∗nΦ(xn)
Need to prepare a matrix Q, Qnm = ynymΦ(xn)TΦ(xm)
Cost?We must do N2/2 dot products to get this matrix ready.Each dot product requires d additions and multiplications.
Jiayu Zhou CSE 847 Machine Learning 32 / 50
Higher Order Polynomials
Φ(x) Cost 100dim
Quadratic d2/2 terms d2N2/4 2.5kN2
Cubic d3/6 terms d3N2/12 83kN2
Quartic d4/24 terms d4N2/48 1.96mN2
Φ(a)TΦ(b) Cost 100dim
Quadratic (aTb+ 1)2 dN2/2 50N2
Cubic (aTb+ 1)3 dN2/2 50N2
Quartic (aTb+ 1)4 dN2/2 50N2
Jiayu Zhou CSE 847 Machine Learning 33 / 50
Dual SVM with Quintic basis functions
maxα∈RN
N∑n=1
αn −1
2
N∑m=1
N∑n=1
αnαmynymΦ(xn)TΦ(xm)︸ ︷︷ ︸
(xTnxm+1)5
subject toN∑
n=1
ynαn = 0, αn ≥ 0, ∀n
Classification:
g(x) = sign(w∗TΦ(x) + b∗) = sign
(∑α∗n>0
ynα∗nΦ(xn)
TΦ(x) + b∗)
= sign
(∑α∗n>0
ynα∗n(x
Tnx+ 1)5 + b∗
)
Jiayu Zhou CSE 847 Machine Learning 34 / 50
Dual SVM with general kernel functions
maxα∈RN
N∑n=1
αn −1
2
N∑m=1
N∑n=1
αnαmynymK(xn,xm)
subject toN∑
n=1
ynαn = 0, αn ≥ 0, ∀n
Classification:
g(x) = sign(w∗TΦ(x) + b∗) = sign
(∑α∗n>0
ynα∗nΦ(xn)
TΦ(x) + b∗)
= sign
(∑α∗n>0
ynα∗nK(xn,xm) + b∗
)
Jiayu Zhou CSE 847 Machine Learning 35 / 50
Kernel Tricks
Replacing dot product with a kernel function
Not all functions are kernel functions!
Need to be decomposable K(a,b) = Φ(a)TΦ(b)Could K(a,b) = (a− b)3 be a kernel function?Could K(a,b) = (a− b)4 − (a+ b)2 be a kernel function?
Mercer’s condition
To expand Kernel function K(a,b) into a dot product, i.e.,K(a,b) = Φ(a)TΦ(b), K(a,b) has to be positivesemi-definite function.kernel matrix K is always symmetric PSD for any givenx1, . . . ,xN .
Jiayu Zhou CSE 847 Machine Learning 36 / 50
Kernel Design: expression kernel
mRNA expression data:
Each matrix entry is an mRNAexpression measurement.Each column is an experiment.Each row corresponds to a gene.
Similar or dissimilarSimilar
Dissimilar
Kernel
K(x, y) =
∑i xiyi√∑
i xixi√∑
i yiyi
Jiayu Zhou CSE 847 Machine Learning 37 / 50
Kernel Design: sequence kernel
Work with non-vectorial data
Scalar product on a pair of variable-length, discrete strings?>ICYA_MANSEGDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKYDGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVNLVPWVLATDYKNYAINYMENSHPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH
>LACB_BOVINMKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI
Jiayu Zhou CSE 847 Machine Learning 38 / 50
Commonly Used SVM Kernel Functions
K(a,b) = (α · aTb+ β)Q is an example of an SVM kernelfunction.
Beyond polynomials there are other very high dimensionalbasis functions that can be made practical by finding the rightKernel Function
Radial-basis style kernel (RBF)/Gaussian kernel function
K(a,b) = exp(−γ∥a− b∥2
)Sigmoid functions
Jiayu Zhou CSE 847 Machine Learning 39 / 50
2nd Order Polynomial Kernel
K(a,b) = (α · aTb+ β)2
Jiayu Zhou CSE 847 Machine Learning 40 / 50
Gaussian Kernels
K(a,b) = exp(−γ∥a− b∥2
)
When γ is large, we clearly see that even the protection of a largemargin cannot suppress overfitting. However, for a reasonablysmall γ, the sophisticated boundary discovered by SVM with theGaussian-RBF kernel looks quite good.
Jiayu Zhou CSE 847 Machine Learning 41 / 50
Gaussian Kernels
For (a) a noisy data set that linear classifier appears to work quitewell, (b) using the Gaussian-RBF kernel with the hard-margin SVMleads to overfitting.
Jiayu Zhou CSE 847 Machine Learning 42 / 50
From hard-margin to soft-margin
When there are outliers, hard margin SVM + Gaussian-RBFkernel result in an unnecessarily complicated decisionboundary that overfits the training noise.
Remedy: a soft formulation that allows small violation of themargins or even some classification errors.
Soft-margin: margin violation εn ≥ 0 for each data point(xn, yn) and require that
yn(wTxn + b) ≥ 1− εn
εn captures by how much (xn, yn) fails to be separated.
Jiayu Zhou CSE 847 Machine Learning 43 / 50
Soft-Margin SVM
We modify the hard-margin SVM to the soft-margin SVM byallowing margin violations but adding a penalty term to discouragelarge violations:
minb,w,ε
1
2wTw + C
N∑n=1
εn
subject to: yn(wTxn + b) ≥ 1− εn for n = 1, . . . , N
εn ≥ 0, for n = 1, . . . , N
The meaning of C?
When C is large, it means we care more about violating the margin,which gets us closer to the hard-margin SVM.
When C is small, on the other hand, we care less about violating themargin.
Jiayu Zhou CSE 847 Machine Learning 44 / 50
Soft Margin Example
Jiayu Zhou CSE 847 Machine Learning 45 / 50
Soft Margin and Hard Margin
minb,w,ε
12w
Tw︸ ︷︷ ︸margin
+C∑N
n=1εn︸ ︷︷ ︸
error tolerance
subject to: yn(wTxn + b) ≥ 1− εn, εn ≥ 0,∀N
Jiayu Zhou CSE 847 Machine Learning 46 / 50
The Hinge Loss
The trade-off sounds very similar, right?
We have εn ≥ 0, and thatyn(w
Txn + b) ≥ 1− εn ⇒ εn ≥ 1− yn(wTxn + b)
The SVM loss (aka. Hinge Loss) function
ESVM(b,w) =1
N
∑N
n=1max(1− yn(w
Txn + b), 0)
The soft-margin SVM can be re-written as the followingoptimization problem:
minb,w
ESVM(b,w) + λwTw
Jiayu Zhou CSE 847 Machine Learning 47 / 50
Dual Soft-Margin SVM
maxα∈RN
N∑n=1
αn −1
2
N∑m=1
N∑n=1
αnαmynymxTnxm
subject toN∑
n=1
ynαn = 0, 0 ≤ αn≤ C, ∀n
w∗ =
N∑n=1
ynα∗nxn
Jiayu Zhou CSE 847 Machine Learning 48 / 50
Summary of Dual SVM
Deliver a large-margin hyperplane, and in so doing it cancontrol the effective model complexity.
Deal with high- or infinite-dimensional transforms using thekernel trick.
Express the final hypothesis g(x) using only a few supportvectors, their corresponding dual variables (Lagrangemultipliers), and the kernel.
Control the sensitivity to outliers and regularize the solutionthrough setting C appropriately.
Jiayu Zhou CSE 847 Machine Learning 49 / 50
Support Vector Machine
Robust Classifier: Maximum Margin
Primal Objective
Dual Objective
QP Solver - d
QP Solver - N
Design: Hard Margin
Design: Soft Margin
Primal Objective
Dual Objective
QP Solver - d
QP Solver - N
Kernel TrickAllow Training Error
Kernel Trick
Support Vector
Support Vector
Hinge Loss
Jiayu Zhou CSE 847 Machine Learning 50 / 50