Upload
leo-holland
View
222
Download
0
Embed Size (px)
Citation preview
1
Linear Classifiers / SVM
Soongsil UniversityDept. of Industrial and Information Systems Engineering
Intelligence Systems Lab.
2
sample
Linear Classifiers
3
Feature
Linear Classifiers
4
Feature
Linear Classifiers
5
Training Set
Linear Classifiers
6
How to Classify Them Using Computer?
Linear Classifiers
7
How to Classify Them Using Computer?
Linear Classifiers
n
1iii
T xwXW
n
1iiixwWXT혹은
8
Linear Classification
Linear Classifiers
9
Optimal Hyperplane
SVMs(Support Vector Machines)
9
Misclassified
1010
Which Separating Hyperplane to Use?
Var1
Var2
11
denotes +1
denotes -1
Any of these would be fine..
..but which is best?
Optimal Hyperplane
SVMs(Support Vector Machines)
1212
Support Vector Machines
Three main ideas:1. Define what an optimal hyperplane is (in
way that can be identified in a computa-tionally efficient way): maximize margin
2. Extend the above definition for non-linearly separable problems: have a penalty term for misclassifications
3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space
13
Optimal Hyperplane
SVMs(Support Vector Machines)
Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
14
Canonical Hyperplane
SVMs(Support Vector Machines)
The maximum margin linear classifier is the linear classifier with the maximum margin.
This is the simplest kind of SVM (Called an LSVM)
15
Normal Vector
SVMs(Support Vector Machines)
1616
Maximizing the Margin Var1
Var2
Margin Width
Margin Width
IDEA 1: Select the separating hyper-plane that maxi-mizes the margin!
1717
Support Vectors Var1
Var2
Margin Width
Support Vectors
18
Margin widthB1
b11
b12
1bXW
1bXW
||W||
2 Margin(d)
1x2x
W
2)2
X1
(XW
d
2cos21 XXW
cos21 XXdSince
2d||W||
벡터 W 와 의 내적은 다음의 기하학적 의미를 갖는다 .
21 XX d
0 bXW
1919
Setting Up the Optimization Problem
Var1
Var21w x b
1w x b
0 bxw
11
w
There is a scale and unit for data so that k=1. Then problem be-comes:
2max
. . ( ) 1, of class 1
( ) 1, of class 2
w
s t w x b x
w x b x
20
Setting Up the Optimization Problem
SVMs(Support Vector Machines)
20
Var1
Var2kbxw
kbxw
0 bxw
kk
w
The width of the margin is:
2 k
w
So, the problem is:
2max
. . ( ) , of class 1
( ) , of class 2
k
w
s t w x b k x
w x b k x
21
Setting Up the Optimization Problem
• If class 1 corresponds to 1 and class 2 corre-sponds to -1, we can rewrite
• as
• So the problem becomes:
( ) 1, with 1
( ) 1, with 1i i i
i i i
w x b x y
w x b x y
( ) 1, i i iy w x b x
2max
. . ( ) 1, i i i
w
s t y w x b x
21min
2. . ( ) 1, i i i
w
s t y w x b x
SVMs(Support Vector Machines)
||w||2=wTw is minimized
2222
Linear, Hard-Margin SVM Formula-tion Find w,b that solves
Problem is convex so, there is a unique global mini-mum value (when feasible)
There is also a unique minimizer, i.e. weight and b value that provides the minimum
Non-solvable if the data is not linearly separable Quadratic Programming
Very efficient computationally with modern con-straint optimization engines (handles thousands of constraints and training instances).
21min
2. . ( ) 1, i i i
w
s t y w x b x
23
Finding the Decision Bound-ary
• Let {x1, ..., xn} be our data set and let yi Î {1,-1} be the class label of xi
• The decision boundary should classify all points correctly Þ
• The decision boundary can be found by solving the following constrained optimization problem
• The Lagrangian of this optimization problem is
Lagrangian of SVM optimization problem
25
Lagrangian of SVM optimization problem
26
Lagrangian of SVM optimization problem
27
Lagrangian of SVM optimization problem
28
Lagrangian of SVM optimization problem
29
Lagrangian of SVM optimization problem
30
와
다음 식에 대입하여 정리하면
=
Lagrangian of SVM optimization problem
j
n
i
n
ji
Tijijii XXyyQ
1 1,12
1)(
편미분 하여 얻은 다음 결과를
31
Remember The Dual Problem !!
• Two functions based on the Lagrangian function
Min L(x, λ) 을 위한 x 값 , 의 최대값에 해당하는 λ 을 구하는 문제 )(ˆ L
L(x, λ) )(ˆ L
32
The Dual Problem• By setting the derivative of the Lagrangian to be
zero, the optimization problem can be written in terms of ai (the dual problem)
• This is a quadratic programming (QP) problem– A global maximum of ai can always be found
• w can be recovered by
0,0tosubject
2
1)(Max
0i
1 1,1
i
n
ii
j
n
i
n
ji
Tijijii
y
XXyyQ
3333
The Dual Problem
• By setting the derivative of the Lagrangian to be zero, the optimization problem can be written in terms of ai (the dual problem)
• This is a quadratic programming (QP) problem– A global maximum of ai can always be found
• w can be recovered by
만약 학습 데이터 수가 아주 많을때는 SVM 학습 속도가 매우 느려질 수 있다 .
dual 문제에서는 parameters α 의 수가 매우 많아 질 수 있기 때문이다 .
0,0tosubject
2
1)(Max
0i
1 1,1
i
n
ii
i
n
i
n
ji
Tijijii
y
XXyyQ
3434
a6=1.4
A Geometrical Interpretation
Class 1
Class 2
a1=0.8
a2=0
a3=0
a4=0
a5=0a7=0
a8=0.6
a9=0
a10=0
35
36
Characteristics of the Solution
• KKT condition indicates many of the ai are zero– w is a linear combination of a small number of data
points
• xi with non-zero ai are called support vectors (SV)– The decision boundary is determined only by the SV– Let tj (j=1, ..., s) be the indices of the s support vectors.
We can write• For testing with a new data z
– Compute
and classify z as:
class 1 if the sum is positive,
class 2 otherwise.
3737
Support Vector Machines
Three main ideas:1. Define what an optimal hyperplane is
(in way that can be identified in a com-putationally efficient way): maximize margin
2. Extend the above definition for non-lin-early separable problems: have a penalty term for misclassifications
3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space
3838
Non-Linearly Separable Data
i
Var1
Var21w x b
1w x b
0 bxw
11
w
i
Allow some instances to fall within the margin, but penal-ize them
Introduce slack variablesi
• What if the training set is not linearly separable?
3939
Formulating the Optimization Prob-lem
i
Var1
Var21w x b
1w x b
0 bxw
11
w
i
Constraint becomes:
Objective function penalizes for misclassified instances and those within the margin
C trades-off margin width and misclassifications;
chosen by the user;
( ) 1 ,
0i i i i
i
y w x b x
21min
2 ii
w C
large C a higher penalty to errors
4040
Soft Margin Hyperplane• By minimizing åixi, xi can be obtained by
xi are “slack variables” in optimization; xi=0 if there is no error for xi, and xi is an upper bound of the number of errors
• The optimization problem becomes
41
Soft Margin Classification
• Slack variables ξi can be added to allow misclassification of difficult or noisy examples, resulting margin called soft.
• Need to minimize:
• Subject to:
ξi
ξi
SVMs(Support Vector Machines)
N
1i
kiC
2
||W||)w(L
ii
ii
1bX Wif1
-1bX Wif1)(
iXf
4242
Linear, Soft-Margin SVMs
Algorithm tries to maintain i to zero while maximiz-ing margin
Notice: algorithm does not minimize the number of misclassifications (NP-complete problem) but the sum of distances from the margin hyperplanes
Other formulations use i2 instead
As C, we get closer to the hard-margin solution
( ) 1 ,
0i i i i
i
y w x b x
21min
2 ii
w C
4343
Robustness of Soft vs Hard Margin SVMs
i
Var1
Var2
0 bxw
i
Var1
Var20 bxw
Soft Margin SVM Hard Margin SVM
As CAs C0
4444
Soft vs Hard Margin SVM
Soft-Margin always have a solution Soft-Margin is more robust to outliers
Smoother surfaces (in the non-linear case)
Hard-Margin does not require to guess the cost parameter (requires no param-eters at all)
45
46
47
48
Linear SVMs: Overview
• The classifier is a separating hyperplane.• Most “important” training points are support vectors; they
define the hyperplane.• Quadratic optimization algorithms can identify which train-
ing points xi are support vectors with non-zero Lagrangian multipliers αi.
• Both in the dual formulation of the problem and in the solu-tion training points appear only inside inner products:
• Find α1…αN such that•
f(x) = ΣαiyixiTx +
b
SVMs(Support Vector Machines)
0,0tosubject
2
1)(Max
0i
1 1,1
i
n
ii
i
n
i
n
ji
Tijijii
yC
XXyyQ
505050
Support Vector Machines
Three main ideas:1. Define what an optimal hyperplane is
(in way that can be identified in a com-putationally efficient way): maximize margin
2. Extend the above definition for non-lin-early separable problems: have a penalty term for misclassifications
3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space
51
52
Extension to Non-linear Decision Boundary
• So far, we only consider large-margin classifier with a linear decision boundary, how to generalize it to become nonlinear?
• Key idea: transform xi to a higher dimensional space to “make life easier”– Input space: the space the point xi are located
– Feature space: the space of f(xi) after transformation
• Why transform?– Linear operation in the feature space is equivalent to
non-linear operation in input space– Classification can become easier with a proper
transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable
53
Non-linear SVMs
SVMs(Support Vector Machines)
53
Datasets that are linearly separable with some noise work out great:
But what are we going to do if the dataset is just too hard?
How about… mapping data to a higher-dimen-sional space:
0 x
0 x
0 x
x2
54
Disadvantages of Linear Decision Surfaces
Var1
Var2
SVMs(Support Vector Machines)
55
Advantages of Non-Linear Surfaces
Var1
Var2
SVMs(Support Vector Machines)
56
Linear Classifiers in High-Dimensional Spaces
Var1
Var2 Constructed Fea-ture 1
Find function (x) to map to a different space
Constructed Fea-ture 2
SVMs(Support Vector Machines)
57
Transforming the Data
• Computation in the feature space can be costly because it is high dimensional– The feature space is typically infinite-dimensional!
• The kernel trick comes to rescue
f( )
f( )
f( )f( )f( )
f( )
f( )f( )
f(.) f( )
f( )
f( )
f( )f( )
f( )
f( )
f( )f( )
f( )
Feature spaceInput space
58
59
Mapping Data to a High-Dimensional Space
• Find function (x) to map to a different space, then SVM formulation becomes:
• Data appear as (x), weights w are now weights in the new space
• Explicit mapping expensive if (x) is very high dimensional
• Solving the problem without explicitly mapping the data is desirable !!
21min
2 ii
w C 0
,1))(( ..
i
iii xbxwyts
SVMs(Support Vector Machines)
60
The Kernel Trick Recall the SVM optimization problem
• The data points only appear as inner product• As long as we can calculate the inner product
in the feature space, we do not need the mapping explicitly
• Many common geometric operations (angles, distances) can be expressed by inner products
• Define the kernel function K by
6161
An Example for f(.) and K(.,.)
• Suppose f(.) is given as follows
• An inner product in the feature space is
• So, if we define the kernel function as follows, there is no need to carry out f(.) explicitly
• This use of kernel function to avoid carrying out f(.) explicitly is known as the kernel trick
62
The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj
If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
A kernel function is some function that corresponds to an inner product in some expanded feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2
,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2
, = (1+[xi1xi2] T[xj1xj2]) 2
= 1+ xi12xj1
2 + 2 xi1xj1 xi2xj2+ xi2
2xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi2
2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2
2 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x1
2 √2 x1x2 x22 √2x1 √2x2]
Kernel Example
6363
Kernel Example
Let
Where:…(we can do XOR!)
2121
2221
21
2221
21
22
222121
21
21
222222111111
22211
221212121
,,
,2,,2,
2
2
,,,,
xxxx
xxxxxxxx
xxxxxxxx
xxxxxxxxxxxx
xxxx
xxxxxxxxK
2221
2121 ,2,, xxxxxx
2, xxxx K
64
What Functions are Kernels?
SVMs(Support Vector Machines)
64
For some functions K(xi,xj) checking that
K(xi,xj)= φ(xi) Tφ(xj) can be cumbersome.
Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel Semi-positive definite symmetric functions correspond to a
semi-positive definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xN)
K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN)
… … … … …
K(xN,x1) K(xN,x2) K(xN,x3) … K(xN,xN)
K=
65
Kernel Functions• In practical use of SVM, only the kernel function is
specified (and not f(.))• Kernel function can be thought of as a similarity
measure between the input objects
• Not all similarity measure can be used as kernel function, however
– Mercer's condition states that any positive semi-definite kernel K(x, y), i.e.
can be expressed as a dot product in a high
dimensional space.
66
Examples of Kernel Functions
SVMs(Support Vector Machines)
66
Linear: K(xi,xj)= xi Txj
Polynomial of power p : K(xi,xj)= (1+ xi Txj)p
Gaussian (radial-basis function network):
Closely related to radial basis function neural networks
Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)
It does not satisfy the Mercer condition on all k and q
)2
exp(),(2
2
ji
ji
xxxx
K
67
Modification Due to Kernel Func-tion
• Change all inner products to kernel functions• For training,
Original
With ker-nel func-tion
68
Modification Due to Kernel Func-tion
• For testing, the new data z is classified as: – class 1 if f ³0, – class 2 if f <0
Original
With ker-nel func-tion
69
70
Suppose we have 5 1D data points– x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1
and 4, 5 as class 2 y1=1, y2=1, y3=-1, y4=-1, y5=1
We use the polynomial kernel of degree 2– K(x,y) = (xy+1)2
– C is set to 100We first find i (i=1, …, 5) by
Example
72
By using a QP solver, we get1=0, 2=2.5, 3=0, 4=7.333, 5=4.833
– Verify (at home) that the constraints are indeed satis-fied
– The support vectors are {x2=2, x4=5, x5=6}
The discriminant function is
b is recovered by solving f(x2=2)=1 or by f(x4=5)=-1
or by f(x5=6})=1,
as x2, x5 lie on and x4, lies on
all give b=9
Example
1))(W( bzy Ti 1))(W( bzy T
i
73
74
Solve the classical XOR problem, i.e find non-linear discriminant function !!
–Dataset Class 1: 1=(−1,−1), 4=(+1,+1) 𝑥 𝑥Class 2: 2=(−1,+1), 3=(+1,−1) 𝑥 𝑥
–Kernel function Polynomial of order 2: ( , ′)=(𝐾 𝑥 𝑥 𝑥𝑇 ′+1)𝑥 2
–To achieve linear separability, we will use =∞𝐶
Homework
75
-
767676
Robustness of Soft vs Hard Margin SVMs
i
Var1
Var2
0 bxw
i
Var1
Var20 bxw
Soft Margin SVM Hard Margin SVM
7777
21min
2 ii
w C
78
Choosing the Kernel Func-tion Probably the most tricky part of using SVM.
The kernel function is important because it creates the kernel matrix, which summarize all the data
Many principles have been proposed (diffusion kernel, Fisher kernel, string kernel, …)
In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try for most applications.
SVM with RBF kernel is closely related to RBF neu-ral networks, with the centers of the radial basis functions automatically chosen for SVM
79
Strengths of SVM
• Strengths– Training is relatively easy
• No local optimal, unlike in neural networks
– It scales relatively well to high dimensional data
– Tradeoff between classifier complexity and error can be controlled explicitly
– Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors
– By performing logistic regression (Sigmoid) on the SVM output of a set of data can map SVM output to probabilities.
8080
• Need to choose a “good” kernel function.
• It is sensitive to noise - A relatively small number of mislabeled examples can
dramatically decrease the performance
• It only considers two classes - how to do multi-class classification with SVM ? - Answer: 1) with output arity m, learn m SVM’s
– SVM 1 learns “Output==1” vs “Output != 1”– SVM 2 learns “Output==2” vs “Output != 2”– :– SVM m learns “Output==m” vs “Output != m”
2)To predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.
Weaknesses of SVM
81
Summary: Steps for Classification
1. Prepare the pattern matrix {(xi,yi)}
2. Select a Kernel function
3. Select the error parameter Ccan use the values suggested by the SVM software, orcan set apart a validation set to determine the values of the parameter
4. Execute the training algorithm (to find all αi)
5. New data can be classified using αi and Support Vectors
82
The Dual of the SVM Formula-tion Original SVM formula-
tion n inequality con-
straints n positivity con-
straints n number of vari-
ables
The (Wolfe) dual of this problem
one equality con-straint
n positivity con-straints
n number of vari-ables (Lagrange mul-tipliers)
Objective function more complicated
NOTICE: Data only ap-pear as (xi) (xj)
0
,1))(( ..
i
iii xbxwyts
i
ibw
Cw 2
, 2
1min
iii
i
y
xts
0
,0C .. i
ji i
ijijijia
xxyyi ,
))()((2
1min
83
Nonlinear SVM - Overview
SVMs(Support Vector Machines)
SVM locates a separating hyperplane in the feature space and classify points in that space
It does not need to represent the space explicitly, simply by defining a kernel function
The kernel function plays the role of the dot product in the feature space.
84
SVM Applications
• SVM has been used successfully in many real-world
problems - text (and hypertext) categorization - image classification - Ranking (e.g., Google searches)
- bioinformatics (Protein classification, Cancer classification)
- hand-written character recognition
85
Handwritten digit recognition
8686
Comparison with Neural Net-works
Neural Networks Hidden Layers map to
lower dimensional spaces Search space has multi-
ple local minima Training is expensive Classification extremely
efficient Requires number of hid-
den units and layers Very good accuracy in
typical domains
SVMs Kernel maps to a very-
high dimensional space Search space has a
unique minimum Training is extremely effi-
cient Classification extremely
efficient Kernel and cost the two
parameters to select Very good accuracy in
typical domains Extremely robust
8787
Conclusions
SVMs express learning as a mathematical pro-gram taking advantage of the rich theory in optimization
SVM uses the kernel trick to map indirectly to extremely high dimensional spaces
SVMs extremely successful, robust, efficient, and versatile while there are good theoretical indications as to why they generalize well
8888
Suggested Further Reading http://www.kernel-machines.org/tutorial.html C. J. C. Burges. A Tutorial on Support Vector Machines for
Pattern Recognition. Knowledge Discovery and Data Min-ing, 2(2), 1998.
P.H. Chen, C.-J. Lin, and B. Schölkopf. A tutorial on nu -support vector machines. 2003.
N. Cristianini. ICML'01 tutorial, 2001. K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and
B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE Neural Networks, 12(2):181-201, May 2001. (PDF)
B. Schölkopf. SVM and kernel methods, 2001. Tutorial given at the NIPS Conference.
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, Springel 2001
89
References
Burges, C. “A Tutorial on Support Vector Machines for Pattern Recognition.” Bell Labs. 1998
Law, Martin. “A Simple Introduction to Support Vector Ma-chines.” Michigan State University. 2006
Prabhakar, K. “An Introduction to Support Vector Machines”
90
9191
9292