42
INTRODUCTION TO MACHINE LEARNING 3 rd EDITION ETHEM ALPAYDIN © The MIT Press, 2014 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml3e Lecture Slides for CHAPTER 13: Kernel Machines

INTRODUCTION TO Machine Learning 2nd Edition...Radial-basis functions: 2 2, exp 2 t K t s ªº «» «» ¬¼ xx xx Defining kernels 26 Kernels are generally considered to be measures

  • Upload
    others

  • View
    27

  • Download
    0

Embed Size (px)

Citation preview

INTRODUCTION

TO

MACHINE

LEARNING3rd EDITION

ETHEM ALPAYDIN

© The MIT Press, 2014

[email protected]

http://www.cmpe.boun.edu.tr/~ethem/i2ml3e

Lecture Slides for

CHAPTER 13:

Kernel

Machines

Kernel Machines2

Discriminant-based: No need to estimate densities

first

Define the discriminant in terms of support vectors

The use of kernel functions, application-specific

measures of similarity

No need to represent instances as vectors

Convex optimization problems with a unique

solution

3

Optimal Separating Hyperplane

1

2

0

0

0

0

1 if , where

1 if

find and such that

1 for 1

1 for 1

which can be rewritten as

1

t

t t t

tt

T t t

T t t

t T t

Cr r

C

w

w r

w r

r w

xx

x

w

w x

w x

w x

X

Note that we do not simply require

0 0t T tr w w x

4

Margin

Distance from the discriminant to the closest instances

on either side

Distance of x to the hyperplane is

We require

For a unique sol’n, fix ρ||w||=1, and to max margin

0

T t ww x

w

0,

t T tr wt

w x

w

2

0

1min subject to 1,

2

t T tr w t w w x

5

Margin

Support Vector Machines

SVMs use a single hyperplane; one Possible Solution

B1

6

Another possible solution

B2

Support Vector Machines

7

Other possible solutions

B2

Support Vector Machines

8

B1

B2

Support Vector Machines

Which one is better? B1 or B2?

How do you define better?

9

B1

B2

b11

b12

b21

b22

margin

Support Vector Machines

Find a hyperplane maximizing the margin => B1 is better than B2

10

B1

b11

b12

0. 0T w w x

0

0

1 if . 1( )

1 if . 1

T

T

wf

w

w xx

w x

2Margin

|| ||

w

0. 1T w w x0. 1T w w x

11

Support Vector Machines

12

2

0

2

0

1

2

0

1 1

1

10

1min subject to 1,

2

11

2

1

2

0

0 0

0

t T t

Nt t T t

p

t

N Nt t T t t

t t

Np t t t

t

Np t t

t

t

r w t

L r w

r w

Lr

Lr

w

w w x

w w x

w w x

w xw

To max margin

13

Most αt are 0 and only a small number have αt >0; they are the

support vectors

0

t

1

2

1

2

1

2

subject to 0 and 0,

T T t t t t t t

d

t t t

T t

t

Tt s t s t s t

t s t

t t t

L r w r

r r

r t

w w w x

w w

x x

0 01t T t t T tr w w r w x w x

To maximize the dual with respect to αt only

Linear SVM for Non-linearly Separable Problems

What if the problem is not linearly separable?

Introduce slack variables

Need to minimize:

C is chosen using a validation set trying to keep the margins wide while keeping the training error low.

2

1

|| ||( )

2

Nt

t

L C

ww

Measures prediction error

Inverse size of marginbetween hyperplanes

Parameter

Slack variable

allows constraint violationto a certain degree

No kernel14

15

Soft Margin Hyperplane

Not linearly separable

We define Soft error as:

New primal is

0 1 t T t tr x w w

t

t

2

0

11

2

t t t T t t t t

p t t tL C r x w

w w

slack variables تسامح، مسامحهمتغیرهای سستی ،

The number of misclassifications is # {ξ t > 1}

where μt are the new Lagrange parameters to guarantee the

positivity of ξ t .

16

1 1

10

0

0, 0

N Np t t t t t t

t t

Np pt t t t

tt

Lr r

L Lr C

w

w x w xw

0 0t t C

t

1

2

subject to 0 and 0 ,

Tt t s t s t s

d

t t s

t t t

L r r

r C t

x x

[# of support vectors][ ( )] N

N

EE P error

N

17

In classifying an instance, there are four possible cases: In (a), the

instance is on the correct side and far away from the margin; rt g(xt)>1,

ξ t = 0. In (b), ξ t = 0; it is on the right side and on the margin. In (c),

ξ t =1− g(xt), 0 < ξ < 1; it is on the right side but is in the margin and not

sufficiently away. In (d), ξ t =1+ g(xt) > 1; it is on the wrong side—this

is a misclassification. All cases except (a) are support vectors. In terms

of the dual variable, in (a), αt= 0; in (b), αt < C; in (c) and (d), αt = C.

*Hinge Loss18

0 if 1( , )

1 otherwise

t t

t t

hinge t t

y rL y r

y r

We define error if the instance is on the wrong side or if

the margin is less than 1. This is called the hinge loss. If

yt = wT xt + w0 is the output and rt is the desired output:

Comparison of different loss functions for

rt = 1: 0/1 loss is 0 if yt = 1, 1 otherwise.

Hinge loss is 0 if yt > 1, 1 − yt otherwise.

Squared error is (1 − yt)2. Cross-entropy is

log(1/(1 + exp(− yt))).

* ν-SVM19

2

0

1

t

1 1min -

2

subject to

, 0, 0

1

2

subject to

1 0,0 ,

t

t

t T t t t

NT

t s t s t s

d

t s

t t t t

t

N

r w

L r r

rN

w

w x

x x

ν controls the fraction of support vectors

Key Properties of Support Vector Machines

1. Use a single hyperplane which subdivides the space into two

half-spaces, one which is occupied by Class1 and the other by

Class2.

2. They maximize the margin of the decision boundary using

quadratic optimization techniques which find the optimal

hyperplane.

3. When used in practice, SVM approaches frequently map (using

) the examples to a higher dimensional space and find margin

maximal hyperplanes in the mapped space, obtaining decision

boundaries which are not hyperplanes in the original space.

4. Moreover, versions of SVMs exist that can be used when linear

separability cannot be accomplished.

20

Nonlinear Support Vector Machines

Alternative 1:Use technique thatEmploys non-linear decision boundaries

Non-linear function

What if decision boundary is not linear?

21

Alternative 2:

Transform into a higher dimensionalattribute space and find linear decision boundaries in this space

1. Transform data into higher dimensional space

2. Find the best hyperplane using the methods introduced earlier

22

Nonlinear Support Vector Machines

23

,

t t t t t t

t t

TT t t t

t

t t t

t

r r

g r

g r K

w z φ x

x w φ x φ x φ x

x x x

Kernel Trick

Preprocess input x by basis functions

z = φ(x) g(z)=wTz

g(x)=wT φ(x)

The SVM solution

24

Vectorial Kernels

Polynomials of degree q:

, 1 q

t T tK x x x x

2

2

1 1 2 2

2 2 2 2

1 1 2 2 1 2 1 2 1 1 2 2

2 2

1 2 1 2 1 2

, 1

1

1 2 2 2

corresponds to the inner product of the basis function

1, 2 , 2 , 2 , ,

T

T

K

x y x y

x y x y x x y y x y x y

x x x x x x

x y x y

x

25

Vectorial Kernels

Radial-basis functions:

2

2, exp

2

t

tKs

x xx x

Defining kernels26

Kernels are generally considered to be measures of

similarity.

Kernel “engineering”.

Defining good measures of similarity.

String kernels, graph kernels, image kernels, ...

Given two documents say D1 and D2, one possible

representation is called bag of words where we

predefine M words relevant for the application →

φ(D1)T φ(D2) counts the number of shared words.

27

Given two strings (of genes), a kernel measures the edit distance, namely, how many operations (insertions, deletions, substitutions) it takes to convert one string into another; this is also called alignment.

Empirical kernel map: Define a set of templates mi

and score function s(x,mi)

φ(xt)=[s(xt,m1), s(xt,m2),..., s(xt,mM)]T

and we define the empirical kernel map as K(xt,xs)=φ (xt)T φ (xs)

Defining kernels

Fixed kernel combination

Adaptive kernel combination

Localized kernel combination

* Multiple Kernel Learning28

1 2

1 2

,

, , ,

, ,

cK

K K K

K K

x y

x y x y x y

x y x y

1

, ,

1,

2

( ) ,

m

i i

i

t t s t s t s

d i i

t t s i

t t t

i i

t i

K K

L r r K

g r K

x y x y

x x

x x x

( ) | , t t t

i i

t i

g r K x x x x

* Multiclass Kernel Machines29

1-vs-all

Pairwise separation (K(K−1)/2 classifiers)

Error-Correcting Output Codes (section 17.5)

Single multiclass optimization

The one-vs.-all approach is generally preferred because it solves K separate N variable problems whereas the multiclass formulation uses K · N variables.

2

1

0 0

1min

2

subject to 2 , , 0t t

Kt

i i

i i t

T t T t t t t

i i i iz z

C

w w i z

w

w x w x

30

* SVM for Regression

Use a linear model (possibly kernelized)

f(x)=wTx+w0

Use the ε-sensitive error function

which means that we tolerate errors up to ε and also that

errors beyond have a linear effect and not a quadratic one.

This error function is therefore more tolerant to noise and

is thus more robust.

0 if ,

otherwise

t t

t t

t t

r fe r f

r f

xx

x

31

0

0

, 0

t T t

T t t

t t

r w

w r

w x

w x

21

min 2

t t

t

C w subject to

we use two types of slack

variables, for positive and

negative deviations, to keep

them positive.

32

subject to

The dual is

Once we solve this, we see that all instances that fall in the tube

have αt+ = αt

- = 0; these are the instances that are fitted with

enough precision. The support vectors satisfy either αt+ > 0 or αt

-

> 0 and are of two types. They may be instances that are on the

boundary of the tube (either αt+ or αt

- is between 0 and C), and

we use these to calculate w0 .

we can write the fitted line as a weighted sum of the support

vectors:

Note: (xt)T x be replaced with Kernel K(xt,x).

0 _ 0( ) ( )( )T t t t T

t

f w w x w x x x

33

The fitted regression line to data points shown as crosses and the ε- tube

are shown (C = 10, ε = 0.25). There are three cases: In (a), the instance is

in the tube; in (b), the instance is on the boundary of the tube (circled

instances); in (c), it is outside the tube with a positive slack, that is, ξt+>0

(squared instances). (b) and (c) are support vectors. In terms of the dual

variable, in (a), αt+= 0,αt

- = 0, in (b), αt+ < C, and in (c), αt

+ = C.

* Kernel Regression

Polynomial kernel Gaussian kernel

34

* Kernel Machines for Ranking35

We require not only that scores be correct order but

at least +1 unit margin.

Linear case:

21min

2

subject to 1 , : , 0

t

i i

t

T u T v t u v t

i

C

t r r

w

w x w x

( ) ( )

subject to 0 ,

t t s u v T k l

d

t t s

t

L

C

x x x xThe dual is

For new test instance x, the score is calculated as ( ) T

t u v

t

g x x x x

* One-Class Kernel Machines36

Consider a sphere with center a and radius R

2

2

1

min C

subject to

, 0

subject to

0 , 1

t

t

t t t

NT T

t t s t s t s t s

d

t t s

t t

t

R

a R

L x x r r x x

C

x

37

One-class support vector machine places the smoothest

boundary (here using a linear kernel, the circle with the

smallest radius) that encloses as much of the instances

as possible. There are three possible cases: In (a), the

instance is a typical instance. In (b), the instance falls

on the boundary with ξt = 0; such instances define R. In

(c), the instance is an outlier with ξt> 0. (b) and (c) are

support vectors. In terms of the dual variable, we have,

in (a), αt= 0; in (b), 0 < αt < C; in (c), αt = C.

One-class support vector

machine using a Gaussian

kernel with different

spreads.

*Large Margin Nearest Neighbor38

Learns the matrix M of Mahalanobis metric

D(xi, xj)=(xi-xj)TM(xi-xj)

For three instances i, j, and l, where i and j are of

the same class and l different, we require

D(xi, xl) > D(xi, xj)+1

and if this is not satisfied, we have a slack for the

difference and we learn M to minimize the sum of

such slacks over all i,j,l triples (j and l being one of

k neighbors of i, over all i)

LMNN algorithm (Weinberger and Saul 2009)

LMCA algorithm (Torresani and Lee 2007) uses a

similar approach where M=LTL and learns L

*Learning a Distance Measure 39

* Motivation Kernel PCA

40

Remark: This approach uses kernels, but is unrelated to SVMs!

Example: we want to cluster the following dataset using K-means which

will be difficult; idea: change coordinate system using a few new, non-

linear features.

* Kernel PCA

Kernel PCA does PCA on the kernel matrix (equal to doing PCA

in the mapped space selecting some orthogonal eigenvectors in

the mapped space as the new coordinate system)

Kind of PCA using non-linear transformations in the original

space, moreover, the vectors of the chosen new coordinate

system are usually not orthogonal in the original space.

Then, ML/DM algorithms are used in the Reduced Feature

Space.

41

Original Space Feature Space

features are a few

linear combinations

of features in the

Feature Space

PCA

Reduced Feature

Space (less dimensions)

* Kernel Dimensionality Reduction42

Kernel PCA does

PCA on the kernel

matrix (equal to

canonical PCA with

a linear kernel)

Kernel LDA, CCA