Upload
others
View
27
Download
0
Embed Size (px)
Citation preview
INTRODUCTION
TO
MACHINE
LEARNING3rd EDITION
ETHEM ALPAYDIN
© The MIT Press, 2014
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
Lecture Slides for
CHAPTER 13:
Kernel
Machines
Kernel Machines2
Discriminant-based: No need to estimate densities
first
Define the discriminant in terms of support vectors
The use of kernel functions, application-specific
measures of similarity
No need to represent instances as vectors
Convex optimization problems with a unique
solution
3
Optimal Separating Hyperplane
1
2
0
0
0
0
1 if , where
1 if
find and such that
1 for 1
1 for 1
which can be rewritten as
1
t
t t t
tt
T t t
T t t
t T t
Cr r
C
w
w r
w r
r w
xx
x
w
w x
w x
w x
X
Note that we do not simply require
0 0t T tr w w x
4
Margin
Distance from the discriminant to the closest instances
on either side
Distance of x to the hyperplane is
We require
For a unique sol’n, fix ρ||w||=1, and to max margin
0
T t ww x
w
0,
t T tr wt
w x
w
2
0
1min subject to 1,
2
t T tr w t w w x
B1
B2
b11
b12
b21
b22
margin
Support Vector Machines
Find a hyperplane maximizing the margin => B1 is better than B2
10
B1
b11
b12
0. 0T w w x
0
0
1 if . 1( )
1 if . 1
T
T
wf
w
w xx
w x
2Margin
|| ||
w
0. 1T w w x0. 1T w w x
11
Support Vector Machines
12
2
0
2
0
1
2
0
1 1
1
10
1min subject to 1,
2
11
2
1
2
0
0 0
0
t T t
Nt t T t
p
t
N Nt t T t t
t t
Np t t t
t
Np t t
t
t
r w t
L r w
r w
Lr
Lr
w
w w x
w w x
w w x
w xw
To max margin
13
Most αt are 0 and only a small number have αt >0; they are the
support vectors
0
t
1
2
1
2
1
2
subject to 0 and 0,
T T t t t t t t
d
t t t
T t
t
Tt s t s t s t
t s t
t t t
L r w r
r r
r t
w w w x
w w
x x
0 01t T t t T tr w w r w x w x
To maximize the dual with respect to αt only
Linear SVM for Non-linearly Separable Problems
What if the problem is not linearly separable?
Introduce slack variables
Need to minimize:
C is chosen using a validation set trying to keep the margins wide while keeping the training error low.
2
1
|| ||( )
2
Nt
t
L C
ww
Measures prediction error
Inverse size of marginbetween hyperplanes
Parameter
Slack variable
allows constraint violationto a certain degree
No kernel14
15
Soft Margin Hyperplane
Not linearly separable
We define Soft error as:
New primal is
0 1 t T t tr x w w
t
t
2
0
11
2
t t t T t t t t
p t t tL C r x w
w w
slack variables تسامح، مسامحهمتغیرهای سستی ،
The number of misclassifications is # {ξ t > 1}
where μt are the new Lagrange parameters to guarantee the
positivity of ξ t .
16
1 1
10
0
0, 0
N Np t t t t t t
t t
Np pt t t t
tt
Lr r
L Lr C
w
w x w xw
0 0t t C
t
1
2
subject to 0 and 0 ,
Tt t s t s t s
d
t t s
t t t
L r r
r C t
x x
[# of support vectors][ ( )] N
N
EE P error
N
17
In classifying an instance, there are four possible cases: In (a), the
instance is on the correct side and far away from the margin; rt g(xt)>1,
ξ t = 0. In (b), ξ t = 0; it is on the right side and on the margin. In (c),
ξ t =1− g(xt), 0 < ξ < 1; it is on the right side but is in the margin and not
sufficiently away. In (d), ξ t =1+ g(xt) > 1; it is on the wrong side—this
is a misclassification. All cases except (a) are support vectors. In terms
of the dual variable, in (a), αt= 0; in (b), αt < C; in (c) and (d), αt = C.
*Hinge Loss18
0 if 1( , )
1 otherwise
t t
t t
hinge t t
y rL y r
y r
We define error if the instance is on the wrong side or if
the margin is less than 1. This is called the hinge loss. If
yt = wT xt + w0 is the output and rt is the desired output:
Comparison of different loss functions for
rt = 1: 0/1 loss is 0 if yt = 1, 1 otherwise.
Hinge loss is 0 if yt > 1, 1 − yt otherwise.
Squared error is (1 − yt)2. Cross-entropy is
log(1/(1 + exp(− yt))).
* ν-SVM19
2
0
1
t
1 1min -
2
subject to
, 0, 0
1
2
subject to
1 0,0 ,
t
t
t T t t t
NT
t s t s t s
d
t s
t t t t
t
N
r w
L r r
rN
w
w x
x x
ν controls the fraction of support vectors
Key Properties of Support Vector Machines
1. Use a single hyperplane which subdivides the space into two
half-spaces, one which is occupied by Class1 and the other by
Class2.
2. They maximize the margin of the decision boundary using
quadratic optimization techniques which find the optimal
hyperplane.
3. When used in practice, SVM approaches frequently map (using
) the examples to a higher dimensional space and find margin
maximal hyperplanes in the mapped space, obtaining decision
boundaries which are not hyperplanes in the original space.
4. Moreover, versions of SVMs exist that can be used when linear
separability cannot be accomplished.
20
Nonlinear Support Vector Machines
Alternative 1:Use technique thatEmploys non-linear decision boundaries
Non-linear function
What if decision boundary is not linear?
21
Alternative 2:
Transform into a higher dimensionalattribute space and find linear decision boundaries in this space
1. Transform data into higher dimensional space
2. Find the best hyperplane using the methods introduced earlier
22
Nonlinear Support Vector Machines
23
,
t t t t t t
t t
TT t t t
t
t t t
t
r r
g r
g r K
w z φ x
x w φ x φ x φ x
x x x
Kernel Trick
Preprocess input x by basis functions
z = φ(x) g(z)=wTz
g(x)=wT φ(x)
The SVM solution
24
Vectorial Kernels
Polynomials of degree q:
, 1 q
t T tK x x x x
2
2
1 1 2 2
2 2 2 2
1 1 2 2 1 2 1 2 1 1 2 2
2 2
1 2 1 2 1 2
, 1
1
1 2 2 2
corresponds to the inner product of the basis function
1, 2 , 2 , 2 , ,
T
T
K
x y x y
x y x y x x y y x y x y
x x x x x x
x y x y
x
Defining kernels26
Kernels are generally considered to be measures of
similarity.
Kernel “engineering”.
Defining good measures of similarity.
String kernels, graph kernels, image kernels, ...
Given two documents say D1 and D2, one possible
representation is called bag of words where we
predefine M words relevant for the application →
φ(D1)T φ(D2) counts the number of shared words.
27
Given two strings (of genes), a kernel measures the edit distance, namely, how many operations (insertions, deletions, substitutions) it takes to convert one string into another; this is also called alignment.
Empirical kernel map: Define a set of templates mi
and score function s(x,mi)
φ(xt)=[s(xt,m1), s(xt,m2),..., s(xt,mM)]T
and we define the empirical kernel map as K(xt,xs)=φ (xt)T φ (xs)
Defining kernels
Fixed kernel combination
Adaptive kernel combination
Localized kernel combination
* Multiple Kernel Learning28
1 2
1 2
,
, , ,
, ,
cK
K K K
K K
x y
x y x y x y
x y x y
1
, ,
1,
2
( ) ,
m
i i
i
t t s t s t s
d i i
t t s i
t t t
i i
t i
K K
L r r K
g r K
x y x y
x x
x x x
( ) | , t t t
i i
t i
g r K x x x x
* Multiclass Kernel Machines29
1-vs-all
Pairwise separation (K(K−1)/2 classifiers)
Error-Correcting Output Codes (section 17.5)
Single multiclass optimization
The one-vs.-all approach is generally preferred because it solves K separate N variable problems whereas the multiclass formulation uses K · N variables.
2
1
0 0
1min
2
subject to 2 , , 0t t
Kt
i i
i i t
T t T t t t t
i i i iz z
C
w w i z
w
w x w x
30
* SVM for Regression
Use a linear model (possibly kernelized)
f(x)=wTx+w0
Use the ε-sensitive error function
which means that we tolerate errors up to ε and also that
errors beyond have a linear effect and not a quadratic one.
This error function is therefore more tolerant to noise and
is thus more robust.
0 if ,
otherwise
t t
t t
t t
r fe r f
r f
xx
x
31
0
0
, 0
t T t
T t t
t t
r w
w r
w x
w x
21
min 2
t t
t
C w subject to
we use two types of slack
variables, for positive and
negative deviations, to keep
them positive.
32
subject to
The dual is
Once we solve this, we see that all instances that fall in the tube
have αt+ = αt
- = 0; these are the instances that are fitted with
enough precision. The support vectors satisfy either αt+ > 0 or αt
-
> 0 and are of two types. They may be instances that are on the
boundary of the tube (either αt+ or αt
- is between 0 and C), and
we use these to calculate w0 .
we can write the fitted line as a weighted sum of the support
vectors:
Note: (xt)T x be replaced with Kernel K(xt,x).
0 _ 0( ) ( )( )T t t t T
t
f w w x w x x x
33
The fitted regression line to data points shown as crosses and the ε- tube
are shown (C = 10, ε = 0.25). There are three cases: In (a), the instance is
in the tube; in (b), the instance is on the boundary of the tube (circled
instances); in (c), it is outside the tube with a positive slack, that is, ξt+>0
(squared instances). (b) and (c) are support vectors. In terms of the dual
variable, in (a), αt+= 0,αt
- = 0, in (b), αt+ < C, and in (c), αt
+ = C.
* Kernel Machines for Ranking35
We require not only that scores be correct order but
at least +1 unit margin.
Linear case:
21min
2
subject to 1 , : , 0
t
i i
t
T u T v t u v t
i
C
t r r
w
w x w x
( ) ( )
subject to 0 ,
t t s u v T k l
d
t t s
t
L
C
x x x xThe dual is
For new test instance x, the score is calculated as ( ) T
t u v
t
g x x x x
* One-Class Kernel Machines36
Consider a sphere with center a and radius R
2
2
1
min C
subject to
, 0
subject to
0 , 1
t
t
t t t
NT T
t t s t s t s t s
d
t t s
t t
t
R
a R
L x x r r x x
C
x
37
One-class support vector machine places the smoothest
boundary (here using a linear kernel, the circle with the
smallest radius) that encloses as much of the instances
as possible. There are three possible cases: In (a), the
instance is a typical instance. In (b), the instance falls
on the boundary with ξt = 0; such instances define R. In
(c), the instance is an outlier with ξt> 0. (b) and (c) are
support vectors. In terms of the dual variable, we have,
in (a), αt= 0; in (b), 0 < αt < C; in (c), αt = C.
One-class support vector
machine using a Gaussian
kernel with different
spreads.
*Large Margin Nearest Neighbor38
Learns the matrix M of Mahalanobis metric
D(xi, xj)=(xi-xj)TM(xi-xj)
For three instances i, j, and l, where i and j are of
the same class and l different, we require
D(xi, xl) > D(xi, xj)+1
and if this is not satisfied, we have a slack for the
difference and we learn M to minimize the sum of
such slacks over all i,j,l triples (j and l being one of
k neighbors of i, over all i)
LMNN algorithm (Weinberger and Saul 2009)
LMCA algorithm (Torresani and Lee 2007) uses a
similar approach where M=LTL and learns L
*Learning a Distance Measure 39
* Motivation Kernel PCA
40
Remark: This approach uses kernels, but is unrelated to SVMs!
Example: we want to cluster the following dataset using K-means which
will be difficult; idea: change coordinate system using a few new, non-
linear features.
* Kernel PCA
Kernel PCA does PCA on the kernel matrix (equal to doing PCA
in the mapped space selecting some orthogonal eigenvectors in
the mapped space as the new coordinate system)
Kind of PCA using non-linear transformations in the original
space, moreover, the vectors of the chosen new coordinate
system are usually not orthogonal in the original space.
Then, ML/DM algorithms are used in the Reduced Feature
Space.
41
Original Space Feature Space
features are a few
linear combinations
of features in the
Feature Space
PCA
Reduced Feature
Space (less dimensions)