19
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on SV Regression 3.http://en.wikipedia.org/wiki/Kernel_trick

Ch. Eick : Support Vector Machines: The Main Ideas

  • Upload
    cale

  • View
    56

  • Download
    0

Embed Size (px)

DESCRIPTION

Ch. Eick : Support Vector Machines: The Main Ideas. Reading Material Support Vector Machines: Textbook First 3 columns of Smola / Schönkopf article on SV Regression http://en.wikipedia.org/wiki/Kernel_trick. Likelihood- vs. Discriminant-based Classification. - PowerPoint PPT Presentation

Citation preview

Page 1: Ch. Eick : Support Vector Machines: The Main Ideas

Ch. Eick: Support Vector Machines: The Main Ideas

Reading Material Support Vector Machines:1.Textbook2. First 3 columns of Smola/Schönkopf article on SV Regression3.http://en.wikipedia.org/wiki/Kernel_trick

Page 2: Ch. Eick : Support Vector Machines: The Main Ideas

2

Likelihood- vs. Discriminant-based Classification

Likelihood-based: Assume a model for p(x|Ci), use Bayes’ rule to calculate P(Ci|x)

gi(x) = log P(Ci|x)

Discriminant-based: Assume a model for gi(x|Φi); no density estimation Prototype-based: Make classification decisions based on nearest prototypes without constructing

decision boundaries (kNN, kMeans approach) Estimating the boundaries is enough; no need to accurately estimate the densities/probability inside the

boundaries; we are just interested in learning decision boundaries (lines for which the densities of two classes is the same), and many popular classification techniques learn decision boundaries without explicitly constructing density functions.

Eick: Support Vector Machines: The Main Ideas

Page 3: Ch. Eick : Support Vector Machines: The Main Ideas

Support Vector Machines

SVMs use a single hyperplane; one Possible Solution

B1

Eick: Support Vector Machines: The Main Ideas

http://en.wikipedia.org/wiki/Hyperplane

Page 4: Ch. Eick : Support Vector Machines: The Main Ideas

Support Vector Machines

Another possible solution

B2

Eick: Support Vector Machines: The Main Ideas

Page 5: Ch. Eick : Support Vector Machines: The Main Ideas

Support Vector Machines

Other possible solutions

B2

Eick: Support Vector Machines: The Main Ideas

Page 6: Ch. Eick : Support Vector Machines: The Main Ideas

Support Vector Machines

Which one is better? B1 or B2? How do you define better?

B1

B2

Eick: Support Vector Machines: The Main Ideas

Page 7: Ch. Eick : Support Vector Machines: The Main Ideas

Support Vector Machines

Find a hyperplane maximizing the margin => B1 is better than B2

B1

B2

b11

b12

b21b22

margin

Eick: Support Vector Machines: The Main Ideas

Page 8: Ch. Eick : Support Vector Machines: The Main Ideas

8

Key Properties of Support Vector Machines1. Use a single hyperplane which subdivides the space into two half-spaces, one which is

occupied by Class1 and the other by Class22. They maximize the margin of the decision boundary using quadratic optimization

techniques which find the optimal hyperplane.3. When used in practice, SVM approaches frequently map (using ) the examples to a higher

dimensional space and find margin maximal hyperplanes in the mapped space, obtaining decision boundaries which are not hyperplanes in the original space.

4. Moreover, versions of SVMs exist that can be used when linear separability cannot be accomplished.

Eick: Support Vector Machines: The Main Ideas

T

T

xxxxxx

yxyxyyxxyxyx

yxyx

K

22

212121

22

22

21

2121212211

22211

2

2221

2221

1

1

,,,,,

,

x

yxyx

Page 9: Ch. Eick : Support Vector Machines: The Main Ideas

Support Vector MachinesB1

b11

b12

0 bxw

1 bxw 1 bxw

1bxw if1

1bxw if1)(

xf ||||

2Margin

w

Examples are:

(x1,..,xn,y) with y{-1,1}

Eick: Support Vector Machines: The Main Ideas

L2 Norm: http://en.wikipedia.org/wiki/L2_norm#Euclidean_norm

Dot-Product: http://en.wikipedia.org/wiki/Dot_product

Page 10: Ch. Eick : Support Vector Machines: The Main Ideas

Support Vector Machines We want to maximize:

Which is equivalent to minimizing:

But subjected to the following N constraints:

This is a constrained convex quadratic optimization problem that can be solved in polynominal time

Numerical approaches to solve it (e.g., quadratic programming) exist

The function to be optimized has only a single minimum no local minimum problem

||||

2Margin

w

N1,..,i 1b)xw(y ii

2

||||)(

2wwL

Eick: Support Vector Machines: The Main Ideas

Dot-Product: http://en.wikipedia.org/wiki/Dot_product

Page 11: Ch. Eick : Support Vector Machines: The Main Ideas

Support Vector Machines What if the problem is not linearly separable?

Eick: Support Vector Machines: The Main Ideas

Page 12: Ch. Eick : Support Vector Machines: The Main Ideas

Linear SVM for Non-linearly Separable Problems

What if the problem is not linearly separable? Introduce slack variables Need to minimize:

Subject to (i=1,..,N):

C is chosen using a validation set trying to keep the margins wide while keeping the training error low.

i

iii

0)2(

-1b)xw(*y )1(

N

i

kiC

wwL

1

2

2

||||)(

Measures prediction error

Inverse size of marginbetween hyperplanes

Parameter

Slack variable

allows constraint violationto a certain degree

Eick: Support Vector Machines: The Main Ideas

No kernel

Page 13: Ch. Eick : Support Vector Machines: The Main Ideas

Nonlinear Support Vector Machines What if decision boundary is not linear?

Alternative 1:Use technique thatEmploys non-lineardecision boundaries

Non-linear function

Eick: Support Vector Machines: The Main Ideas

Page 14: Ch. Eick : Support Vector Machines: The Main Ideas

Nonlinear Support Vector Machines1. Transform data into higher dimensional space2. Find the best hyperplane using the methods introduced

earlier

Alternative 2:Transform into a higher dimensionalattribute space and find linear decision boundaries in this space

Eick: Support Vector Machines: The Main Ideas

Page 15: Ch. Eick : Support Vector Machines: The Main Ideas

Nonlinear Support Vector Machines

1. Choose a non-linear function to transform into a different, usually higher dimensional, attribute space

2. Minimize

but subjected to the following N constraints:

N1,..,i 1b))x w(y ii

2

||||)(

2wwL

Find a good hyperplanein the transformed space

Eick: Support Vector Machines: The Main Ideas

Remark: The Soft Margin SVM can be generalized similarly.

Page 16: Ch. Eick : Support Vector Machines: The Main Ideas

Example: Polynomial Kernel Function

Polynomial Kernel Function:(x1,x2)=(x12,x22,sqrt(2)*x1,sqrt(2)*x2,1)K(u,v)=(u)(v)= (uv + 1)2

A Support Vector Machine with polynomial kernel function classifies a new example z as follows:

sign(( iyi(xi)(z))+b) =

sign(( iyi (xiz +1)2))+b)

Remark: i and b are determined using the methods for linear SVMs that were discussed earlier

Kernel function trick: perform computations in the original space, although we solve an optimization problem in the transformed space more efficient; more detailsTopic14.

Page 17: Ch. Eick : Support Vector Machines: The Main Ideas

Other Material on SVMs http://www.youtube.com/watch?v=27RQRUR7Ubc Support Vector Machines in Rapid Miner

http://stackoverflow.com/questions/1072097/pointers-to-some-good-svm-tutorial

http://www.csie.ntu.edu.tw/~cjlin/libsvm/ http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html Adaboost/SVM Relationship Lecture:

http://videolectures.net/mlss05us_rudin_da/

Eick: Support Vector Machines: The Main Ideas

Page 18: Ch. Eick : Support Vector Machines: The Main Ideas

Summary Support Vector Machines Support vector machines learn hyperplanes that separate two

classes maximizing the margin between them (the empty space between the instances of the two classes).

Support vector machines introduce slack variables—in the case that classes are not linear separable—trying to maximize margins while keeping the training error low.

The most popular versions of SVMs use non-linear kernel functions and map the attribute space into a higher dimensional space to facilitate finding “good” linear decision boundaries in the modified space.

Support vector machines find “margin optimal” hyperplanes by solving a convex quadratic optimization problem. However, this optimization process is quite slow and support vector machines tend to fail if the number of examples goes beyond 500/5000/50000…

In general, support vector machines accomplish quite high accuracies, if compared to other techniques.

In the last 10 years, support vector machines have been generalized for other tasks such as regression, PCA, outlier detection,…

Eick: Support Vector Machines: The Main Ideas

Page 19: Ch. Eick : Support Vector Machines: The Main Ideas

19

Kernels—What can they do for you? Some machine learning/statistical problems only depend on the dot-

product of the objects in the dataset O={x1,..,xn} and not on other characteristics of the objects in the dataset; in other words, those techniques only depend on the gram matrix of O which stores x1x1, x1x2,…xnxn (http://en.wikipedia.org/wiki/Gramian_matrix) .

These techniques can be generalized by mapping the dataset into a higher dimensional space as long as the non-linear mapping can be kernelized; that is, a kernel function K can be found such that:

K(u,v)= (u)(v) In this case the results are computed in the mapped space based on

K(x1,x1), K(x1,x2),…,K(xn,xn) which is called the kernel trick: http://en.wikipedia.org/wiki/Kernel_trick

Kernels have been successfully used to generalize PCA, K-means, support vector machines, and many other techniques, allowing them to use non-linear coordinate systems, more complex decision boundaries, or more complex cluster boundaries.

We will revisit kernels later discussing transparencies 13-25, 30-35 of the Vasconcelos lecture.

Eick: Support Vector Machines: The Main Ideas