Classifying and clustering using Support Vector Machine 2 nd PhD report PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Suppervisor:

Classifying and clustering using Support Vector

Machine

2nd PhD report

PhD title : Data mining in unstructured dataDaniel I. MORARIU, MSc

PhD Suppervisor: Lucian N. VINŢANSibiu, 2005

Contents Classification (clustering) steps Reuters Database processing Feature extraction and selection

Information Gain Support Vector Machine

Support Vector Machine Binary classification Multiclass classification Clustering Sequential Minimal Optimizations (SMO) Probabilistic outputs

Experiments & results Binary classification. Aspects and results. Feature subset selection. A comparative approach. Multiclass classification. Quantitative aspects. Clustering. Quantitative aspects.

Conclusions and further work

Classifying (clustering) steps

Text mining – features extraction

Features selection

Classifying or Clustering

Testing results

Reuters Database Processing

806791 total documents, 126 topics, 366 regions, 870 industry codes

Industry category selection – “system software”

7083 documents 4722 training samples 2361 testing samples

19038 attributes (features) 68 classes (topics)

Binary classification Topics “c152” (only 2096 from 7083)

Frequency vector

Terms frequency Stopwords Stemming Threshold

Large frequency vector

Features extraction

Information Gain

SVM features selection Liniar kernel – weight vector

Features selection

c

iii ppSEnt

12 )(log)(

)()(),()(

vAValuesv

v SEntropyS

SSEntropyASGain






Support Vector Machine Binary classification

Optimal hyperplane Higher-dimensional feature space Primal optimization problem Dual optimization problem -

Lagrange multipliers Karush-Kuhn-Tucker conditions Support Vectors Kernel trick Decision function

Optimal Hyperplane

bxxf )(sgn w

{x|‹w,x›+b=0}

X2

X1 yi=+1

yi=-1

{x|‹w,x›+b=-1}{x|‹w,x›+b=+1}

w margin

Higher-dimensional feature space

x

Primal optimization problem

mibxi

bH

,...,1,1,y subject to

,2

1)(minimize

i

2

,

w

www

m

iii bwybL

1

2)1),((

2

1),,( ixww

l

iiii yandli

1

0,,...,1,0

Dual optimization problem Maximize:

subject to:

m

iiii xy

0

w

Ci 0

Lagrangeformulation

SVM - caracteristics Karush-Kuhn-Tucker (KKT) conditions

only the Lagrange multipliers that are non-zero at the saddle point

Support Vectors the patterns xi for which

Kernel trick Positively defined kernel

Decision function

0i

',)',( xxxxk

miiii bxxyxf

,1

,sgn)(

i

Multi-class classification Separate one class versus the rest

m

i

ji

jii

jj

Mjbxxkyxgxg

1,1

),()(where,)(maxarg

)(maxargsgn)(

,1xgxf j

Mj

Clustering Caracteristics

mapped data into a higher dimensional space

search for the minimal enclosing sphere Primal optimisation problem

Dual optimisation problem

Karush Kuhn Tucker condition

mjRax jj ...1,)( 22

j j j

jjjjjj CaxRRL 222 )(

1j

j j

jj xa )(jj C






SMO characteristics Only two parameters are updated (minimal

size of updates).

Benefit: doesn’t need any extra matrix storage doesn’t need to use numerical QP optimization step needs more iterations to converge, but only needs a

few operations at each step, which leads to overall speed-up

Components: Analytic method to solve the problem for two

Lagrange multipliers Heuristics for choosing the points

01

m

iii y

Analytic method

Heuristics for choosing the point Choice of 1st point (x1/1):

Find KKT violations

Choice of 2nd point (x2/2): update 1, 2 which cause a large change, which, in

turn, result in a large increase of the dual objective maximize quantity |E1-E2|

SMO - components

Cyyyy oldold 2122112211 ,0 ,

2

21

21222

)()( ,)( where

)(

xxyxfE

EEy

iii

old

Probabilistic outputs

))(exp(1

1)()1()(

xfxpxyPinputclassP

))(exp(1

11

BxAffyP

Features selection using SVM

Linear kernel

Primal optimisation form

Keeped only that value that have weight in learned w vector great ther a threshold

bxf xwsgn

'2)',( xxxxk

miwiththresholdwall i ,..,1






Polynomial kernel

Gaussian kernel

Kernels used

dxxdxxk '2)',(

Cn

xxxxk

2'

exp)',(

Binary using values ”0” and “1”

Nominal

Connell SMART

Data representation

),(max

),(),(

dn

tdntdTF

otherwisetdn

tdniftdTF

)),(log(1log(1

0),(0),(

Binary classification - 63

0,00

10,00

20,00

30,00

40,00

50,00

60,00

70,00

80,00

1 2 3 4 5 6 7 10

kernel degree

accu

racy

(%)

BINARY

NOMINAL

CONNELL SMART

d - kernel’s degree 1 2 3 4 5 6 7 10

Binary 40.13 64.78 66.54 27.23 46.54 71.62 56.95 55.19

Nominal 38.96 62.65 67.93 82.03

16.62

11.95

83.99 64.08

CONNELL SMART 40.24 63.32 62.41 14.41 7.78 49.72 68.27 49.65

Binary classification - 7999

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

1 2 3 4 5 6 7 10

kernel degree

accu

racy

(%)

BINARY

NOMINAL

CONNELL SMART

d - kernel’s degree 1 2 3 4 5 6 7 10

Binary 35.77

41.74 61.88 77.64 69.21 81.87 10.95 35.77

Nominal 56.69

26.83

28.06

28.27

29.14

41.38 36.19 34.05

CONNELL SMART 50.44

35.28 41.17 59.28 79.82 81.81 82.32 17.85

Influence of vector size

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

1 2 3 4 5 6 7 10

kernel degree

accu

racy

(%) 63

1309

2488

7999

Polynomial kernel

Influence of vector size

30,00

40,00

50,00

60,00

70,00

80,00

90,00

100,00

C0.01 C0.05 C0.1 C0.5 C0.7 C1.0 C1.4 C2.1

degree of kernel

accu

ran

cy(%

) 41

63

1309

2488

7999

Gaussian kernel

Polynomial kernel

IG versus SVM – 427 features

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

1 2 3 4 5 6 7 10

kernel degree

IG - BINARY IG - NOMINAL IG - CONNEL SMARTSVM - BINARY SVM - NOMINAL SVM - CONNEL SMART

Gaussian kernel

IG versus SVM – 427 features

50,00

55,00

60,00

65,00

70,00

75,00

80,00

85,00

90,00

95,00

0,01 0,05 0,1 0,5 0,7 1 1,4 2,1 2,7

kernel degree

accu

racy

(%

)

IG - BINARY IG - NOMINAL IG - CONNEL SMART

SVM - BINARY SVM - NOMINAL SVM - CONNEL SMART

LibSvm versus UseSvm - 2493

Polynomial kernel

dxxdxxk '2)',(

40

50

60

70

80

90

100

1 2 3 4 5 6 7 10

kernel degree

accu

racy

(%

)

LibSVM

LibSVM+coef0

UseSVM

dcoefxxgammaxxk 0')',(

LibSvm versus UseSvm - 2493

Gaussian kernel

Cn

xxxxk

2'

exp)',(

2'*exp)',( xxgammaxxk

40

50

60

70

80

90

100

0,01 0,05 0,1 0,5 0,7 1 1,4 2,1 2,7

kernel degree

accu

racy

(%

)

LibSVM

LibSVM+gamma

UseSVM

Multiclass classification

0,00

20,00

40,00

60,00

80,00

100,00

2 3 4 5

kernel degree

accu

racy

(%) BINARY

NOMINAL

CONNELL SMART

Polynomial kernel - 2488 features

Multiclass classification Gaussian kernel 2488 features

0,0010,0020,0030,0040,0050,0060,0070,0080,0090,00

C0.05 C0.1 C1.0 C1.4 C2.1 C2.7

kernel degree

accu

racy

(%)

BINARY

NOMINAL

CONNELL SMART

Clustering using SVM

0

10

20

30

40

50

60

70

80

0.01 0.1 0.5

percent υ

accu

racy 41

63

1309

2111

υ\#features 41 63 1309 2111

0,01 0,6% 0,6% 0,7% 0,6%

0,1 0,5% 0,5% 0,5% 0,5%

0,5 25,2% 25,1% 25,1% 25,1%

Conclusions – best results

Polynomial kernel and nominal representation (degree 5 and 6 )

Gaussian kernel and Connell Smart ( C=2.7)

Reduced # of support vectors for polynomial kernel in comparison with Gaussian kernel (24,41% versus 37.78%)

# features between 6% (1309) and 10% (2488)

Multiclass follows the binary classification Clustering has a smaller # of sv‘s Clustering follows binary classification

Further work Features extraction and selection

Association rules between words (Mutual Information)

Synonym and Polysemy problem Better implementation of SVM with

linear kernel Using families of words (WordNet) SVM with kernel degree greater then 1

Classification and clustering Using classification and clustering

together

Documents

Classifying and clustering using Support Vector Machine 2 nd PhD report PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Suppervisor: