Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

1

Linear Classifiers / SVM

Soongsil UniversityDept. of Industrial and Information Systems Engineering

Intelligence Systems Lab.

2

sample

Linear Classifiers

3

Feature

Linear Classifiers

4

Feature

Linear Classifiers

5

Training Set

Linear Classifiers

6

How to Classify Them Using Computer?

Linear Classifiers

7

How to Classify Them Using Computer?

Linear Classifiers

n

1iii

T xwXW

n

1iiixwWXT혹은

8

Linear Classification

Linear Classifiers

9

Optimal Hyperplane

SVMs(Support Vector Machines)

9

Misclassified

1010

Which Separating Hyperplane to Use?

Var1

Var2

11

denotes +1

denotes -1

Any of these would be fine..

..but which is best?

Optimal Hyperplane


1212

Support Vector Machines

Three main ideas:1. Define what an optimal hyperplane is (in

way that can be identified in a computa-tionally efficient way): maximize margin

2. Extend the above definition for non-linearly separable problems: have a penalty term for misclassifications

3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space

13

Optimal Hyperplane


Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

14

Canonical Hyperplane


The maximum margin linear classifier is the linear classifier with the maximum margin.

This is the simplest kind of SVM (Called an LSVM)

15

Normal Vector


1616

Maximizing the Margin Var1

Var2

Margin Width

Margin Width

IDEA 1: Select the separating hyper-plane that maxi-mizes the margin!

1717

Support Vectors Var1

Var2

Margin Width

Support Vectors

18

Margin widthB1

b11

b12

1bXW

1bXW

||W||

2 Margin(d)

1x2x

W

2)2

X1

(XW

d

2cos21 XXW

cos21 XXdSince

2d||W||

벡터 W 와 의 내적은 다음의 기하학적 의미를 갖는다 .

21 XX d

0 bXW

1919

Setting Up the Optimization Problem

Var1

Var21w x b

1w x b

0 bxw

11

w

There is a scale and unit for data so that k=1. Then problem be-comes:

2max

. . ( ) 1, of class 1

( ) 1, of class 2

w

s t w x b x

w x b x

20



20

Var1

Var2kbxw

kbxw

0 bxw

kk

w

The width of the margin is:

2 k

w

So, the problem is:

2max

. . ( ) , of class 1

( ) , of class 2

k

w

s t w x b k x

w x b k x

21


• If class 1 corresponds to 1 and class 2 corre-sponds to -1, we can rewrite

• as

• So the problem becomes:

( ) 1, with 1

( ) 1, with 1i i i

i i i

w x b x y

w x b x y

( ) 1, i i iy w x b x

2max

. . ( ) 1, i i i

w

s t y w x b x

21min

2. . ( ) 1, i i i

w

s t y w x b x


||w||2=wTw is minimized

2222

Linear, Hard-Margin SVM Formula-tion Find w,b that solves

Problem is convex so, there is a unique global mini-mum value (when feasible)

There is also a unique minimizer, i.e. weight and b value that provides the minimum

Non-solvable if the data is not linearly separable Quadratic Programming

Very efficient computationally with modern con-straint optimization engines (handles thousands of constraints and training instances).

21min

2. . ( ) 1, i i i

w

s t y w x b x

23

Finding the Decision Bound-ary

• Let {x1, ..., xn} be our data set and let yi Î {1,-1} be the class label of xi

• The decision boundary should classify all points correctly Þ

• The decision boundary can be found by solving the following constrained optimization problem

• The Lagrangian of this optimization problem is

Lagrangian of SVM optimization problem

25


26


27


28


29


30

와

다음 식에 대입하여 정리하면

=


j

n

i

n

ji

Tijijii XXyyQ

1 1,12

1)(

편미분 하여 얻은 다음 결과를

31

Remember The Dual Problem !!

• Two functions based on the Lagrangian function

Min L(x, λ) 을 위한 x 값 , 의 최대값에 해당하는 λ 을 구하는 문제 )(ˆ L

L(x, λ) )(ˆ L

32

The Dual Problem• By setting the derivative of the Lagrangian to be

zero, the optimization problem can be written in terms of ai (the dual problem)

• This is a quadratic programming (QP) problem– A global maximum of ai can always be found

• w can be recovered by

0,0tosubject

2

1)(Max

0i

1 1,1

i

n

ii

j

n

i

n

ji

Tijijii

y

XXyyQ

3333

The Dual Problem

• By setting the derivative of the Lagrangian to be zero, the optimization problem can be written in terms of ai (the dual problem)

• This is a quadratic programming (QP) problem– A global maximum of ai can always be found

• w can be recovered by

만약 학습 데이터 수가 아주 많을때는 SVM 학습 속도가 매우 느려질 수 있다 .

dual 문제에서는 parameters α 의 수가 매우 많아 질 수 있기 때문이다 .

0,0tosubject

2

1)(Max

0i

1 1,1

i

n

ii

i

n

i

n

ji

Tijijii

y

XXyyQ

3434

a6=1.4

A Geometrical Interpretation

Class 1

Class 2

a1=0.8

a2=0

a3=0

a4=0

a5=0a7=0

a8=0.6

a9=0

a10=0

35

36

Characteristics of the Solution

• KKT condition indicates many of the ai are zero– w is a linear combination of a small number of data

points

• xi with non-zero ai are called support vectors (SV)– The decision boundary is determined only by the SV– Let tj (j=1, ..., s) be the indices of the s support vectors.

We can write• For testing with a new data z

– Compute

and classify z as:

class 1 if the sum is positive,

class 2 otherwise.

3737


Three main ideas:1. Define what an optimal hyperplane is

(in way that can be identified in a com-putationally efficient way): maximize margin

2. Extend the above definition for non-lin-early separable problems: have a penalty term for misclassifications


3838

Non-Linearly Separable Data

i

Var1

Var21w x b

1w x b

0 bxw

11

w

i

Allow some instances to fall within the margin, but penal-ize them

Introduce slack variablesi

• What if the training set is not linearly separable?

3939

Formulating the Optimization Prob-lem

i

Var1

Var21w x b

1w x b

0 bxw

11

w

i

Constraint becomes:

Objective function penalizes for misclassified instances and those within the margin

C trades-off margin width and misclassifications;

chosen by the user;

( ) 1 ,

0i i i i

i

y w x b x

21min

2 ii

w C

large C a higher penalty to errors

4040

Soft Margin Hyperplane• By minimizing åixi, xi can be obtained by

xi are “slack variables” in optimization; xi=0 if there is no error for xi, and xi is an upper bound of the number of errors

• The optimization problem becomes

41

Soft Margin Classification

• Slack variables ξi can be added to allow misclassification of difficult or noisy examples, resulting margin called soft.

• Need to minimize:

• Subject to:

ξi

ξi


N

1i

kiC

2

||W||)w(L

ii

ii

1bX Wif1

-1bX Wif1)(

iXf

4242

Linear, Soft-Margin SVMs

Algorithm tries to maintain i to zero while maximiz-ing margin

Notice: algorithm does not minimize the number of misclassifications (NP-complete problem) but the sum of distances from the margin hyperplanes

Other formulations use i2 instead

As C, we get closer to the hard-margin solution

( ) 1 ,

0i i i i

i

y w x b x

21min

2 ii

w C

4343

Robustness of Soft vs Hard Margin SVMs

i

Var1

Var2

0 bxw

i

Var1

Var20 bxw

Soft Margin SVM Hard Margin SVM

As CAs C0

4444

Soft vs Hard Margin SVM

Soft-Margin always have a solution Soft-Margin is more robust to outliers

Smoother surfaces (in the non-linear case)

Hard-Margin does not require to guess the cost parameter (requires no param-eters at all)

45

46

47

48

Linear SVMs: Overview

• The classifier is a separating hyperplane.• Most “important” training points are support vectors; they

define the hyperplane.• Quadratic optimization algorithms can identify which train-

ing points xi are support vectors with non-zero Lagrangian multipliers αi.

• Both in the dual formulation of the problem and in the solu-tion training points appear only inside inner products:

• Find α1…αN such that•

f(x) = ΣαiyixiTx +

b


0,0tosubject

2

1)(Max

0i

1 1,1

i

n

ii

i

n

i

n

ji

Tijijii

yC

XXyyQ

505050


Three main ideas:1. Define what an optimal hyperplane is

(in way that can be identified in a com-putationally efficient way): maximize margin

2. Extend the above definition for non-lin-early separable problems: have a penalty term for misclassifications


51

52

Extension to Non-linear Decision Boundary

• So far, we only consider large-margin classifier with a linear decision boundary, how to generalize it to become nonlinear?

• Key idea: transform xi to a higher dimensional space to “make life easier”– Input space: the space the point xi are located

– Feature space: the space of f(xi) after transformation

• Why transform?– Linear operation in the feature space is equivalent to

non-linear operation in input space– Classification can become easier with a proper

transformation. In the XOR problem, for example, adding a new feature of x1x2 make the problem linearly separable

53

Non-linear SVMs


53

Datasets that are linearly separable with some noise work out great:

But what are we going to do if the dataset is just too hard?

How about… mapping data to a higher-dimen-sional space:

0 x

0 x

0 x

x2

54

Disadvantages of Linear Decision Surfaces

Var1

Var2


55

Advantages of Non-Linear Surfaces

Var1

Var2


56

Linear Classifiers in High-Dimensional Spaces

Var1

Var2 Constructed Fea-ture 1

Find function (x) to map to a different space

Constructed Fea-ture 2


57

Transforming the Data

• Computation in the feature space can be costly because it is high dimensional– The feature space is typically infinite-dimensional!

• The kernel trick comes to rescue

f( )

f( )

f( )f( )f( )

f( )

f( )f( )

f(.) f( )

f( )

f( )

f( )f( )

f( )

f( )

f( )f( )

f( )

Feature spaceInput space

58

59

Mapping Data to a High-Dimensional Space

• Find function (x) to map to a different space, then SVM formulation becomes:

• Data appear as (x), weights w are now weights in the new space

• Explicit mapping expensive if (x) is very high dimensional

• Solving the problem without explicitly mapping the data is desirable !!

21min

2 ii

w C 0

,1))(( ..

i

iii xbxwyts


60

The Kernel Trick Recall the SVM optimization problem

• The data points only appear as inner product• As long as we can calculate the inner product

in the feature space, we do not need the mapping explicitly

• Many common geometric operations (angles, distances) can be expressed by inner products

• Define the kernel function K by

6161

An Example for f(.) and K(.,.)

• Suppose f(.) is given as follows

• An inner product in the feature space is

• So, if we define the kernel function as follows, there is no need to carry out f(.) explicitly

• This use of kernel function to avoid carrying out f(.) explicitly is known as the kernel trick

62

The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj

If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes:

K(xi,xj)= φ(xi) Tφ(xj)

A kernel function is some function that corresponds to an inner product in some expanded feature space.

Example:

2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2

,

Need to show that K(xi,xj)= φ(xi) Tφ(xj):

K(xi,xj)=(1 + xiTxj)2

, = (1+[xi1xi2] T[xj1xj2]) 2

= 1+ xi12xj1

2 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 + 2xi2xj2

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2

2 √2xj1 √2xj2]

= φ(xi) Tφ(xj), where φ(x) = [1 x1

2 √2 x1x2 x22 √2x1 √2x2]

Kernel Example

6363

Kernel Example

Let

Where:…(we can do XOR!)

2121

2221

21

2221

21

22

222121

21

21

222222111111

22211

221212121

,,

,2,,2,

2

2

,,,,

xxxx

xxxxxxxx

xxxxxxxx

xxxxxxxxxxxx

xxxx

xxxxxxxxK

2221

2121 ,2,, xxxxxx

2, xxxx K

64

What Functions are Kernels?


64

For some functions K(xi,xj) checking that

K(xi,xj)= φ(xi) Tφ(xj) can be cumbersome.

Mercer’s theorem:

Every semi-positive definite symmetric function is a kernel Semi-positive definite symmetric functions correspond to a

semi-positive definite symmetric Gram matrix:

K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xN)

K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN)

… … … … …

K(xN,x1) K(xN,x2) K(xN,x3) … K(xN,xN)

K=

65

Kernel Functions• In practical use of SVM, only the kernel function is

specified (and not f(.))• Kernel function can be thought of as a similarity

measure between the input objects

• Not all similarity measure can be used as kernel function, however

– Mercer's condition states that any positive semi-definite kernel K(x, y), i.e.

can be expressed as a dot product in a high

dimensional space.

66

Examples of Kernel Functions


66

Linear: K(xi,xj)= xi Txj

Polynomial of power p : K(xi,xj)= (1+ xi Txj)p

Gaussian (radial-basis function network):

Closely related to radial basis function neural networks

Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)

It does not satisfy the Mercer condition on all k and q

)2

exp(),(2

2

ji

ji

xxxx

K

67

Modification Due to Kernel Func-tion

• Change all inner products to kernel functions• For training,

Original

With ker-nel func-tion

68

Modification Due to Kernel Func-tion

• For testing, the new data z is classified as: – class 1 if f ³0, – class 2 if f <0

Original

With ker-nel func-tion

69

70

Suppose we have 5 1D data points– x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1

and 4, 5 as class 2 y1=1, y2=1, y3=-1, y4=-1, y5=1

We use the polynomial kernel of degree 2– K(x,y) = (xy+1)2

– C is set to 100We first find i (i=1, …, 5) by

Example

72

By using a QP solver, we get1=0, 2=2.5, 3=0, 4=7.333, 5=4.833

– Verify (at home) that the constraints are indeed satis-fied

– The support vectors are {x2=2, x4=5, x5=6}

The discriminant function is

b is recovered by solving f(x2=2)=1 or by f(x4=5)=-1

or by f(x5=6})=1,

as x2, x5 lie on and x4, lies on

all give b=9

Example

1))(W( bzy Ti 1))(W( bzy T

i

73

74

Solve the classical XOR problem, i.e find non-linear discriminant function !!

–Dataset Class 1: 1=(−1,−1), 4=(+1,+1) 𝑥 𝑥Class 2: 2=(−1,+1), 3=(+1,−1) 𝑥 𝑥

–Kernel function Polynomial of order 2: ( , ′)=(𝐾 𝑥 𝑥 𝑥𝑇 ′+1)𝑥 2

–To achieve linear separability, we will use =∞𝐶

Homework

75

-

767676

Robustness of Soft vs Hard Margin SVMs

i

Var1

Var2

0 bxw

i

Var1

Var20 bxw

Soft Margin SVM Hard Margin SVM

7777

21min

2 ii

w C

78

Choosing the Kernel Func-tion Probably the most tricky part of using SVM.

The kernel function is important because it creates the kernel matrix, which summarize all the data

Many principles have been proposed (diffusion kernel, Fisher kernel, string kernel, …)

In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try for most applications.

SVM with RBF kernel is closely related to RBF neu-ral networks, with the centers of the radial basis functions automatically chosen for SVM

79

Strengths of SVM

• Strengths– Training is relatively easy

• No local optimal, unlike in neural networks

– It scales relatively well to high dimensional data

– Tradeoff between classifier complexity and error can be controlled explicitly

– Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors

– By performing logistic regression (Sigmoid) on the SVM output of a set of data can map SVM output to probabilities.

8080

• Need to choose a “good” kernel function.

• It is sensitive to noise - A relatively small number of mislabeled examples can

dramatically decrease the performance

• It only considers two classes - how to do multi-class classification with SVM ? - Answer: 1) with output arity m, learn m SVM’s

– SVM 1 learns “Output==1” vs “Output != 1”– SVM 2 learns “Output==2” vs “Output != 2”– :– SVM m learns “Output==m” vs “Output != m”

2)To predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.

Weaknesses of SVM

81

Summary: Steps for Classification

1. Prepare the pattern matrix {(xi,yi)}

2. Select a Kernel function

3. Select the error parameter Ccan use the values suggested by the SVM software, orcan set apart a validation set to determine the values of the parameter

4. Execute the training algorithm (to find all αi)

5. New data can be classified using αi and Support Vectors

82

The Dual of the SVM Formula-tion Original SVM formula-

tion n inequality con-

straints n positivity con-

straints n number of vari-

ables

The (Wolfe) dual of this problem

one equality con-straint

n positivity con-straints

n number of vari-ables (Lagrange mul-tipliers)

Objective function more complicated

NOTICE: Data only ap-pear as (xi) (xj)

0

,1))(( ..

i

iii xbxwyts

i

ibw

Cw 2

, 2

1min

iii

i

y

xts

0

,0C .. i

ji i

ijijijia

xxyyi ,

))()((2

1min

83

Nonlinear SVM - Overview


SVM locates a separating hyperplane in the feature space and classify points in that space

It does not need to represent the space explicitly, simply by defining a kernel function

The kernel function plays the role of the dot product in the feature space.

84

SVM Applications

• SVM has been used successfully in many real-world

problems - text (and hypertext) categorization - image classification - Ranking (e.g., Google searches)

- bioinformatics (Protein classification, Cancer classification)

- hand-written character recognition

85

Handwritten digit recognition

8686

Comparison with Neural Net-works

Neural Networks Hidden Layers map to

lower dimensional spaces Search space has multi-

ple local minima Training is expensive Classification extremely

efficient Requires number of hid-

den units and layers Very good accuracy in

typical domains

SVMs Kernel maps to a very-

high dimensional space Search space has a

unique minimum Training is extremely effi-

cient Classification extremely

efficient Kernel and cost the two

parameters to select Very good accuracy in

typical domains Extremely robust

8787

Conclusions

SVMs express learning as a mathematical pro-gram taking advantage of the rich theory in optimization

SVM uses the kernel trick to map indirectly to extremely high dimensional spaces

SVMs extremely successful, robust, efficient, and versatile while there are good theoretical indications as to why they generalize well

8888

Suggested Further Reading http://www.kernel-machines.org/tutorial.html C. J. C. Burges. A Tutorial on Support Vector Machines for

Pattern Recognition. Knowledge Discovery and Data Min-ing, 2(2), 1998.

P.H. Chen, C.-J. Lin, and B. Schölkopf. A tutorial on nu -support vector machines. 2003.

N. Cristianini. ICML'01 tutorial, 2001. K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and

B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE Neural Networks, 12(2):181-201, May 2001. (PDF)

B. Schölkopf. SVM and kernel methods, 2001. Tutorial given at the NIPS Conference.

Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, Springel 2001

http://mlg.anu.edu.au/~raetsch/ps/review.pdf

89

References

Burges, C. “A Tutorial on Support Vector Machines for Pattern Recognition.” Bell Labs. 1998

Law, Martin. “A Simple Introduction to Support Vector Ma-chines.” Michigan State University. 2006

Prabhakar, K. “An Introduction to Support Vector Machines”

90

9191

9292

Documents

Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1