29
CS 8751 ML & KDD Support Vector Machines 1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based on maximizing the notion of a margin Based on PAC learning Has mechanisms for – Noise Non-linear separating surfaces (kernel functions) Notes based on those of Prof. Jude Shavlik

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

Embed Size (px)

Citation preview

Page 1: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 1

Support Vector Machines (SVMs)• Learning mechanism based on linear

programming• Chooses a separating plane based on maximizing

the notion of a margin– Based on PAC learning

• Has mechanisms for– Noise

– Non-linear separating surfaces (kernel functions)

• Notes based on those of Prof. Jude Shavlik

Page 2: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 2

Support Vector Machines

A+

A-

Find the best separating plane in feature space - many possibilities to choose from

which is thebest choice?

Page 3: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 3

SVMs – The General Idea• How to pick the best separating plane?• Idea:

– Define a set of inequalities we want to satisfy

– Use advanced optimization methods (e.g., linear programming) to find satisfying solutions

• Key issues:– Dealing with noise

– What if no good linear separating surface?

Page 4: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 4

Linear Programming• Subset of Math Programming• Problem has the following form:

function f(x1,x2,x3,…,xn) to be maximizedsubject to a set of constraints of the form:

g(x1,x2,x3,…,xn) > b

• Math programming - find a set of values for the variables x1,x2,x3,…,xn that meets all of the constraints and maximizes the function f

• Linear programming - solving math programs where the constraint functions and function to be maximized use linear combinations of the variables– Generally easier than general Math Programming problem– Well studied problem

Page 5: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 5

Maximizing the Margin

A+

A-

the decisionboundary

The margin between categories - want this distance to be maximal - (we’ll assume linearly separable for now)

Page 6: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 6

PAC Learning• PAC – Probably Approximately Correct learning• Theorems that can be used to define bounds for

the risk (error) of a family of learning functions• Basic formula, with probability (1 - ):

• R – risk function, is the parameters chosen by the learner, N is the number of data points, and h is the VC dimension (something like an estimate of the complexity of the class of functions)

N

hNhRR emp

)4/log()/2(log()()(

Page 7: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 7

Margins and PAC Learning• Theorems connect PAC theory to the size of the

margin• Basically, the larger the margin, the better the

expected accuracy• See, for example, Chapter 4 of Support Vector

Machines by Christianini and Shawe-Taylor, Cambridge University Press, 2002

Page 8: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 8

Some Equations

w

xw

xw

xw

xw

neg

pos

2margin

margin) (the planes red and bluebetween Distance

1

examples negative allFor

1

examples positive allFor

threshold-

features,input - weights,-

Plane Separating

1s result from dividingthrough by a constantfor convenience

Euclidean length (“2 norm”) ofthe weight vector

Page 9: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 9

What the Equations Mean

A+

A-

Support Vectors

Margin

x´w = γ + 1

x´w = γ - 1

2 / ||w||2

Page 10: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 10

Choosing a Separating Plane

A+

A-

?

Page 11: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 11

Our “Mathematical Program” (so far)

soln) optimal global (a above theosolution t a find tosoftware

onoptimizati gprogramminmath existing use nowcan We

es)inequalitiour of sideleft the to move and trick"" ANN

theuse course, of could, (we parameters adjustableour are , :Note

examples) (for 1

examples) (for 1

such that

min2

,

w

xw

xw

w

neg

pos

w

for technical reasons easier tooptimize this “quadratic program”

Page 12: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 12

Dealing with Non-Separable Data

We can add what is called a “slack” variable to each example

This variable can be viewed as:0 if the example is correctly separated

y “distance” we need to move example to make it

correct (i.e., the distance from its surface)

Page 13: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 13

“Slack” Variables

A+

A-

y

Support Vectors

Page 14: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 14

The Math Program with Slack Variables

0

1

1

such that

positive) (all components of sum - norm" one"

constant scaling

exampleeach for one

featureinput each for one

min

k

1

1

2

,,

k

jneg

ipos

sw

s

sxw

sxw

s

s

w

sw

j

i

This is the “traditional”Support Vector Machine

Page 15: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 15

Why the word “Support”?• All those examples on or on the wrong side of the

two separating planes are the support vectors– We’d get the same answer if we deleted all the non-

support vectors!

– i.e., the “support vectors [examples]” support the solution

Page 16: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 16

PAC and the Number of Support Vectors• The fewer the support vectors, the better the

generalization will be• Recall, non-support vectors are

– Correctly classified

– Don’t change the learned model if left out of the training set

• So

examples training#

ctorssupport ve # rateerror out oneleave

Page 17: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 17

Finding Non-Linear Separating Surfaces• Map inputs into new space

Example: features x1 x2

5 4

Example: features x1 x2 x12 x2

2 x1*x2

5 4 25 16 20

• Solve SVM program in this new space– Computationally complex if many features

– But a clever trick exists

Page 18: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 18

The Kernel Trick• Optimization problems often/always have a

“primal” and a “dual” representation– So far we’ve looked at the primal formulation

– The dual formulation is better for the case of a non-linear separating surface

Page 19: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 19

Perceptrons Re-Visited

zero) (all 0 assumes This

weightschange and wrong

get we timesofnumber some is where

So

iedmisclassifcurrently is example theif

, 1- F and 1 T if s,perceptronIn

#

1

11

initial

i

i

examples

iiiifinal

i

iikk

w

x

xyw

x

xyww

Page 20: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 20

Dual Form of the Perceptron Learning Rule

errors) (counts 1 then

) teacher predicted (i.e., 0 if

i example eachFor

:algorithm perceptron dual) (i.e.,New

sgn

sgn So

otherwise 1-

0 if 1 sgn

sgn

ii1

1

ii

ij

#examples

jjji

iii

#examples

iiii

αα

xxyαy

)xxyα(

)xxyα( ) xh(

z(z)

)xw( ) x h( perceptronoutput of

Page 21: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 21

Primal versus Dual Space• Primal – “weight space”

– Weight features to make output decision

• Dual – “training-examples space”– Weight distance (which is based on the features) to

training examples

)sgn()( newnew xwxh

)xxyα()xh( ij

#examples

jjjnew

1

sgn

Page 22: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 22

The Dual SVM

2

minmax

:form primal back toconvert Can

0

thatsuch

min

Let

11

1

1

11 121

iyiy

i

n

i ii

i

n

i ii

n

i i

n

i

n

j jijiji

xwxw

xyw

y

xxyy

examples #trainingn

ii

Page 23: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 23

Non-Zero αi’s

weights the tocontribute

0) (i.e., ctorssupport ve only the -

Recall

ctorssupport ve the

are 0 withexamples Those

1

i

i

n

i ii

i

xyw

Page 24: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 24

Generalizing the Dot Product

spacenew to

convertingdirectly thanefficient moreusually -

spacenew thein

features know the explicitly toneedt don' we-

productdot a computing re we'spacenew thisin -

implicitly spacenew a into features original the

maps linear)-non(usually kernel acceptable LAn

), K(e.g.,

functions" kernel"other to

),uct( Dot_Prod

generalize can We

jiji

jiji

xxxx

xxxx

Page 25: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 25

The New Space for a Sample Kernel

2212211122122111

2222121221211111

22211

2

2

,,,,,,

)(

2let and

Let

zzzzzzzzxxxxxxxx

zzxxzzxxzzxxzzxx

zxzx)zx(

#features

)zx () z,xK(

Our new feature space (with 4 dimensions)- we’re doing a dot product in it

Page 26: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 26

g(+)

Visualizing the Kernel

-

+

+

++-

-

-

-

+

Input Space

Original Space

Separating plane (non-linear here butlinear in derived space)

g(+)

g(-)

Derived Feature Space

New Space

g(+)g(+)g(+)

g(-)g(-)

g(-) g(-)g() is feature transformation function

process is similar to what hidden units do in ANNs but kernel is user chosen

Page 27: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 27

More Sample Kernels

etc.) DNA,(text, tasksspecific

for designedmany including more,many plus -

sANN' of sigmoid to Related-

tanh 3)

network RBF toleads kernel, Gaussian -

2)

1)22

d

dzxc)z,xK(

e)z,xK(

constzx)z,xK(zx

Page 28: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 28

What Makes a Kernel

...

reala returns where 4)

3)

constant is where 2)

1)

are so thenkernels, are and If

kernela is function

a whenzescharacteri theoremnsMercer'

21

1

21

21

f())zf()xf(

()() * KK

c()c*K

()K()K

()K()K

)z,xf(

Page 29: CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based

CS 8751 ML & KDD Support Vector Machines 29

Key SVM Ideas• Maximize the margin between positive and

negative examples (connects to PAC theory)• Penalize errors in non-separable case• Only the support vectors contribute to the solution• Kernels map examples into a new, usually non-

linear space– We implicitly do dot products in this new space (in the

“dual” form of the SVM program)