Ch. Eick : Support Vector Machines: The Main Ideas

Ch. Eick: Support Vector Machines: The Main Ideas

Reading Material Support Vector Machines:1.Textbook2. First 3 columns of Smola/Schönkopf article on SV Regression3.http://en.wikipedia.org/wiki/Kernel_trick

2

Likelihood- vs. Discriminant-based Classification

Likelihood-based: Assume a model for p(x|Ci), use Bayes’ rule to calculate P(Ci|x)

gi(x) = log P(Ci|x)

Discriminant-based: Assume a model for gi(x|Φi); no density estimation Prototype-based: Make classification decisions based on nearest prototypes without constructing

decision boundaries (kNN, kMeans approach) Estimating the boundaries is enough; no need to accurately estimate the densities/probability inside the

boundaries; we are just interested in learning decision boundaries (lines for which the densities of two classes is the same), and many popular classification techniques learn decision boundaries without explicitly constructing density functions.

Eick: Support Vector Machines: The Main Ideas

Support Vector Machines

SVMs use a single hyperplane; one Possible Solution

B1


http://en.wikipedia.org/wiki/Hyperplane


Another possible solution

B2



Other possible solutions

B2



Which one is better? B1 or B2? How do you define better?

B1

B2



Find a hyperplane maximizing the margin => B1 is better than B2

B1

B2

b11

b12

b21b22

margin


8

Key Properties of Support Vector Machines1. Use a single hyperplane which subdivides the space into two half-spaces, one which is

occupied by Class1 and the other by Class22. They maximize the margin of the decision boundary using quadratic optimization

techniques which find the optimal hyperplane.3. When used in practice, SVM approaches frequently map (using ) the examples to a higher

dimensional space and find margin maximal hyperplanes in the mapped space, obtaining decision boundaries which are not hyperplanes in the original space.

4. Moreover, versions of SVMs exist that can be used when linear separability cannot be accomplished.


T

T

xxxxxx

yxyxyyxxyxyx

yxyx

K

22

212121

22

22

21

2121212211

22211

2

2221

2221

1

1

,,,,,

,

x

yxyx

Support Vector MachinesB1

b11

b12

0 bxw

1 bxw 1 bxw

1bxw if1

1bxw if1)(

xf ||||

2Margin

w

Examples are:

(x1,..,xn,y) with y{-1,1}


L2 Norm: http://en.wikipedia.org/wiki/L2_norm#Euclidean_norm

Dot-Product: http://en.wikipedia.org/wiki/Dot_product

Support Vector Machines We want to maximize:

Which is equivalent to minimizing:

But subjected to the following N constraints:

This is a constrained convex quadratic optimization problem that can be solved in polynominal time

Numerical approaches to solve it (e.g., quadratic programming) exist

The function to be optimized has only a single minimum no local minimum problem

||||

2Margin

w

N1,..,i 1b)xw(y ii

2

||||)(

2wwL


Dot-Product: http://en.wikipedia.org/wiki/Dot_product

Support Vector Machines What if the problem is not linearly separable?


Linear SVM for Non-linearly Separable Problems

What if the problem is not linearly separable? Introduce slack variables Need to minimize:

Subject to (i=1,..,N):

C is chosen using a validation set trying to keep the margins wide while keeping the training error low.

i

iii

0)2(

-1b)xw(*y )1(

N

i

kiC

wwL

1

2

2

||||)(

Measures prediction error

Inverse size of marginbetween hyperplanes

Parameter

Slack variable

allows constraint violationto a certain degree


No kernel

Nonlinear Support Vector Machines What if decision boundary is not linear?

Alternative 1:Use technique thatEmploys non-lineardecision boundaries

Non-linear function


Nonlinear Support Vector Machines1. Transform data into higher dimensional space2. Find the best hyperplane using the methods introduced

earlier

Alternative 2:Transform into a higher dimensionalattribute space and find linear decision boundaries in this space


Nonlinear Support Vector Machines

1. Choose a non-linear function to transform into a different, usually higher dimensional, attribute space

2. Minimize

but subjected to the following N constraints:

N1,..,i 1b))x w(y ii

2

||||)(

2wwL

Find a good hyperplanein the transformed space


Remark: The Soft Margin SVM can be generalized similarly.

Example: Polynomial Kernel Function

Polynomial Kernel Function:(x1,x2)=(x12,x22,sqrt(2)*x1,sqrt(2)*x2,1)K(u,v)=(u)(v)= (uv + 1)2

A Support Vector Machine with polynomial kernel function classifies a new example z as follows:

sign(( iyi(xi)(z))+b) =

sign(( iyi (xiz +1)2))+b)

Remark: i and b are determined using the methods for linear SVMs that were discussed earlier

Kernel function trick: perform computations in the original space, although we solve an optimization problem in the transformed space more efficient; more detailsTopic14.

Other Material on SVMs http://www.youtube.com/watch?v=27RQRUR7Ubc Support Vector Machines in Rapid Miner

http://stackoverflow.com/questions/1072097/pointers-to-some-good-svm-tutorial

http://www.csie.ntu.edu.tw/~cjlin/libsvm/ http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html Adaboost/SVM Relationship Lecture:

http://videolectures.net/mlss05us_rudin_da/


Summary Support Vector Machines Support vector machines learn hyperplanes that separate two

classes maximizing the margin between them (the empty space between the instances of the two classes).

Support vector machines introduce slack variables—in the case that classes are not linear separable—trying to maximize margins while keeping the training error low.

The most popular versions of SVMs use non-linear kernel functions and map the attribute space into a higher dimensional space to facilitate finding “good” linear decision boundaries in the modified space.

Support vector machines find “margin optimal” hyperplanes by solving a convex quadratic optimization problem. However, this optimization process is quite slow and support vector machines tend to fail if the number of examples goes beyond 500/5000/50000…

In general, support vector machines accomplish quite high accuracies, if compared to other techniques.

In the last 10 years, support vector machines have been generalized for other tasks such as regression, PCA, outlier detection,…


19

Kernels—What can they do for you? Some machine learning/statistical problems only depend on the dot-

product of the objects in the dataset O={x1,..,xn} and not on other characteristics of the objects in the dataset; in other words, those techniques only depend on the gram matrix of O which stores x1x1, x1x2,…xnxn (http://en.wikipedia.org/wiki/Gramian_matrix) .

These techniques can be generalized by mapping the dataset into a higher dimensional space as long as the non-linear mapping can be kernelized; that is, a kernel function K can be found such that:

K(u,v)= (u)(v) In this case the results are computed in the mapped space based on

K(x1,x1), K(x1,x2),…,K(xn,xn) which is called the kernel trick: http://en.wikipedia.org/wiki/Kernel_trick

Kernels have been successfully used to generalize PCA, K-means, support vector machines, and many other techniques, allowing them to use non-linear coordinate systems, more complex decision boundaries, or more complex cluster boundaries.

We will revisit kernels later discussing transparencies 13-25, 30-35 of the Vasconcelos lecture.


Documents

Ch. Eick : Support Vector Machines: The Main Ideas