56
22/6/20 Chap8 SVM Zhongzhi Shi 1 Advanced Computing Seminar Data Mining and Its Industrial Applications — Chapter 8 — Support Vector Machines Zhongzhi Shi, Markus Stumptner, Yalei Hao, Gerald Quirchmayr Knowledge and Software Engineering Lab Advanced Computing Research Centre School of Computer and Information Science University of South Australia

Support Vector Machines

  • Upload
    nextlib

  • View
    8.455

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 1

Advanced Computing Seminar Data Mining and Its Industrial

Applications

— Chapter 8 —

Support Vector Machines

Zhongzhi Shi, Markus Stumptner, Yalei Hao, Gerald Quirchmayr Knowledge and Software Engineering Lab

Advanced Computing Research CentreSchool of Computer and Information Science

University of South Australia

Page 2: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 2

Outline

Introduction

Support Vector Machine

Non-linear Classification

SVM and PAC

Applications

Summary

Page 3: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 3

History SVM is a classifier derived from

statistical learning theory by Vapnik and Chervonenkis

SVMs introduced by Boser, Guyon, Vapnik in COLT-92

Initially popularized in the NIPS community, now an important and active field of all Machine Learning research.

Special issues of Machine Learning Journal, and Journal of Machine Learning Research.

Page 4: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 4

What is SVM? SVMs are learning systems that

use a hypothesis space of linear functions

in a high dimensional feature space — Kernel function

trained with a learning algorithm from optimization theory — Lagrange

Implements a learning bias derived from statistical learning theory — Generalisation SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis

Page 5: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 5

Linear Classifiers

yest

denotes +1

denotes -1

f x

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Copyright © 2001, 2003, Andrew W. Moore

Page 6: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 6

Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Copyright © 2001, 2003, Andrew W. Moore

Page 7: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 7

Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Copyright © 2001, 2003, Andrew W. Moore

Page 8: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 8

Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Copyright © 2001, 2003, Andrew W. Moore

Page 9: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 9

Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Copyright © 2001, 2003, Andrew W. Moore

Page 10: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 10

Maximum Marginf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

The maximum margin linear classifier is the linear classifier with the maximum margin.

This is the simplest kind of SVM (Called an LSVM)Linear SVM

Copyright © 2001, 2003, Andrew W. Moore

Page 11: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 11

Model of Linear Classification

Binary classification is frequently performed by

using a real-valued hypothesis function:

bxw

bxwxFn

iii

1

)(

The input x is assigned to the positive class, if

Otherwise to the negative class.

0)( xf

Page 12: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 12

The concept of Hyperplane

For a binary linear separable

training set, we can find at least

a hyperplane (w,b) which

divides the space into two half

spaces.

The definition of hyperplane

0)( xf

0)( xf

0)( xf

0)( xf

Page 13: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 13

Tuning the Hyperplane (w,b)

The Perceptron Algorithm Proposed by Frank Rosenblatt in 1956

Preliminary definition The functional margin of an example (xi,yi)

implies correct classification of (xi,yi)

)( bxwy iii

0i

Page 14: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 14

The Perceptron Algorithm

The number of mistakes is at most 22

R

Page 15: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 15

The Geometric margin →

The Euclidean

distance of an

example (xi,yi)

from the decision

boundary

w

bx

w

wii

Page 16: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 16

The Geometric margin

The margin of a training set Si

li

1min

Maximal Margin Hyperplane A hyperplane realising the

maximun geometric margin

The optimal linear classifier If it can form the Maximal Margin Hyperplane.

Page 17: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 17

How to Find the optimal solution?

The drawback of the perceptron algorithm The algorithm may give a different

solution depending on the order in which the examples are processed.

The superiority of SVM The kind of learning machines tune the

solution based on the optimization theory.

Page 18: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 18

The Maximal Margin Classifier

The simplest model of SVM Finds the maximal margin hyperplane

in an chosen kernel-induced feature space.

A convex optimization problem Minimizing a quadratic function under

linear inequality constrains

Page 19: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 19

Support Vector Classifiers Support vector machines

Cortes and Vapnik (1995) well suited for high-dimensional data binary classification

Training set D = {(xi,yi), i=1,…,n}, xi Rm and yi {-1,1}

Linear discriminant classifier Separating hyperplane

{ x : g(x) = wTx + w0 = 0 } model parameters: w Rm and w0 R

Page 20: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 20

Formalizi the geometric margin

Assumes that Sxx 1,,1,

1,1 bxwbxw

The geometric margin

wx

w

wx

w

w 1

2

1

In order to find the maximum ,we must find the minimum

w

Page 21: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 21

Minimizing the norm →

Because

We can re-formalize the optimization problem

w

www 2

Page 22: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 22

Minimizing the norm →

Uses the Lagrangian function

Obtained

Resubstituting into the primal to obtain

l

jijijiji

l

ii xxyybwL

1,21

1

,,,

w

Page 23: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 23

Minimizing the norm

Finds the minimum is equivalent to find the

maximum

The strategies for minimizing differentiable function Decomposition Sequential Minimal Optimization (SMO)

w

ww

)(W

Page 24: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 24

The Support Vector

The condition of the optimization problem states that

This implies that only for input xi for which the functional margin is one

This implies that it lies closest to the hyperplane

The corresponding

01*** bxwy iii

0* i

Page 25: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 25

The optimal hypothesis (w,b)

The two parameters can be obtained from

The hypothesis is

l

iiii xyw

1

**

2

minmax *1

*1* iyiy xwxw

b ii

l

SViiii bxxybxf *** ),,(

Page 26: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 26

Soft Margin Optimization

The main problem with the maximal margin classifier is that

it always products perfectly a consistent hypothesis a hypothesis with no training error

Relax the boundary

Page 27: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 27

Non-linear Classification The problem

The maximal margin classifier is an important concept, but it cannot be used in many real-world problems

There will in general be no linear separation in the feature space.

The solution Maps the data into another space that can

be separated linearly.

Page 28: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 28

A learning machine

A learning machine f takes an input x and transforms it, somehow using weights , into a predicted output yest = +/- 1

f x

yest

is some vector of adjustable parameters

Page 29: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 29

Some definitions

Given some machine f And under the assumption that all training points (xk,yk) were drawn i.i.d

from some distribution. And under the assumption that future test points will be drawn from the

same distribution Define

icationMisclassif

ofy Probabilit),(

2

1)(TESTERR)(

xfyER

Official terminology

Page 30: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 30

Some definitions Given some machine f And under the assumption that all training points (xk,yk) were drawn i.i.d from

some distribution. And under the assumption that future test points will be drawn from the same

distribution Define

icationMisclassif

ofy Probabilit),(

2

1)(TESTERR)(

xfyER

Official terminology

iedmisclassifSet

TrainingFraction ),(

2

11)(TRAINERR)(

1

R

kkk

emp xfyR

R

R = #training set data points

Page 31: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 31

Vapnik-Chervonenkis Dimension

Given some machine f, let h be its VC dimension. h is a measure of f’s power (h does not depend on the choice of training set) Vapnik showed that with probability 1-

),(2

1)(TESTERR xfyE

R

kkk xfy

R 1

),(2

11)(TRAINERR

R

hRh )4/log()1)/2(log()(TRAINERR)(TESTERR

This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f

Page 32: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 32

Structural Risk Minimization Let (f) = the set of functions representable by f. Suppose Then We’re trying to decide which machine to use. We train each machine and make a table…

i fiTRAINERR

VC-Conf Probable upper bound on TESTERR

Choice

1 f1

2 f2

3 f3

4 f4

5 f5

6 f6

R

hRh )4/log()1)/2(log()(TRAINERR)(TESTERR

)()()( 21 nfφfφfφ )()()( 21 nfhfhfh

Page 33: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 33

Kernel-Induced Feature Space

Mapping the data of space X into space F

)(),.....,(,....., 11 xxxxxx Nn

Page 34: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 34

Implicit Mapping into Feature Space

For the non-linear separable data set, we can modify the

hypothesis to map implicitly the data to another feature

space

l

iii bxwxf

1

)(

l

SViiiiii bxxyxf *)(

Page 35: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 35

Kernel Function

A Kernel is a function K, such that for all

The benefits Solve the computational problem of

working with many dimensions

Xzx ,

zxzxK ),(

Page 36: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 36

kernels sigmoid ))(tanh(),(

functions basis radial ))2(exp(),(

1),(

22

jiji

jiji

d

jiji

xxxxk

xxxxk

polynomialxxxxk

Kernel function

Page 37: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 37

The Polynomial Kernel

The kind of kernel represents the inner product of two vector(point) in a feature space of dimension.

For example

dyxyxK ),(

d

dn 1

Page 38: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 38

Page 39: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 39

Page 40: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 40

Text Categorization

Inductive learning Inpute :Output : f(x) = confidence(class)

In the case of text classification ,the attribute are words in the document ,and the classes are the categories.

),...,( 21 nxxxx

Page 41: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 41

PROPERTIES OF TEXT-CLASSIFICATION TASKS

High-Dimensional Feature Space. Sparse Document Vectors. High Level of Redundancy.

Page 42: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 42

Text representation and feature selection

Binary feature term frequency Inverse document frequency

n is the total number of documentsDF(w) is the number of documents the word

occurs in

Page 43: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 43

Page 44: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 44

Learning SVMS

To learn the vector of feature weights Linear SVMS Polynomial classifiers Radial basis functions

w

Page 45: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 45

Processing

Text files are processed to produce a vector of words

Select 300 words with highest mutual information with each category(remove stopwords)

A separate classifier is learned for each category.

Page 46: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 46

An example - Reuters (trends & controversies)

Category : interest

Weight vector

large positive weights : prime (.70), rate (.67), interest (.63), rates (.60), and discount (.46)

large negative weights: group (–.24),year (–.25), sees (–.33) world (–.35), and dlrs (–.71)

w

Page 47: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 47

Page 48: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 48

Text Categorization Results

Dumais et al. (1998)

Page 49: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 49

Apply to the Linear Classifier

Substitutes to the hypothesis

Substitutes to the margin optimization

l

SViiiiii bxxyxf *)(

Page 50: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 50

SVMs and PAC Learning

Theorems connect PAC theory to the size of the margin

Basically, the larger the margin, the better the expected accuracy

See, for example, Chapter 4 of Support Vector Machines by Christianini and Shawe-Taylor, Cambridge University Press, 2002

Page 51: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 51

PAC and the Number of Support Vectors

The fewer the support vectors, the better the generalization will be

Recall, non-support vectors are Correctly classified Don’t change the learned model if left out of

the training set So

examples training#

ctorssupport ve # rateerror out oneleave

Page 52: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 52

VC-dimension of an SVM Very loosely speaking there is some theory which under some

different assumptions puts an upper bound on the VC dimension as

where Diameter is the diameter of the smallest sphere that can

enclose all the high-dimensional term-vectors derived from the training set.

Margin is the smallest margin we’ll let the SVM use This can be used in SRM (Structural Risk Minimization) for

choosing the polynomial degree, RBF , etc. But most people just use Cross-Validation

Margin

Diameter

Copyright © 2001, 2003, Andrew W. Moore

Page 53: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 53

Finding Non-Linear Separating Surfaces

Map inputs into new spaceExample: features x1 x2

5 4

Example: features x1 x2 x12 x2

2 x1*x2

5 4 25 16 20

Solve SVM program in this new space Computationally complex if many features But a clever trick exists

Page 54: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 54

Summary

Maximize the margin between positive and negative examples (connects to PAC theory)

Non-linear Classification The support vectors contribute to

the solution Kernels map examples into a new,

usually non-linear space

Page 55: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 55

References

Vladimir Vapnik. The Nature of Statistical Learning Theory, Springer, 1995

Andrew W. Moore. cmsc726: SVMs. http://www.cs.cmu.edu/~awm/tutorials

C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html

Vladimir Vapnik. Statistical Learning Theory. Wiley-Interscience; 1998

Thorsten Joachims (joachims_01a): A Statistical Learning Model of Text Classification for Support Vector Machines

Page 56: Support Vector Machines

23/4/11 Chap8 SVM Zhongzhi Shi 56

www.intsci.ac.cn/shizz/

Questions?!Questions?!