Upload
ryley-kipps
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Classifying and clustering using Support Vector
Machine
2nd PhD report
PhD title : Data mining in unstructured dataDaniel I. MORARIU, MSc
PhD Suppervisor: Lucian N. VINŢANSibiu, 2005
Contents Classification (clustering) steps Reuters Database processing Feature extraction and selection
Information Gain Support Vector Machine
Support Vector Machine Binary classification Multiclass classification Clustering Sequential Minimal Optimizations (SMO) Probabilistic outputs
Experiments & results Binary classification. Aspects and results. Feature subset selection. A comparative approach. Multiclass classification. Quantitative aspects. Clustering. Quantitative aspects.
Conclusions and further work
Classifying (clustering) steps
Text mining – features extraction
Features selection
Classifying or Clustering
Testing results
Reuters Database Processing
806791 total documents, 126 topics, 366 regions, 870 industry codes
Industry category selection – “system software”
7083 documents 4722 training samples 2361 testing samples
19038 attributes (features) 68 classes (topics)
Binary classification Topics “c152” (only 2096 from 7083)
Frequency vector
Terms frequency Stopwords Stemming Threshold
Large frequency vector
Features extraction
Information Gain
SVM features selection Liniar kernel – weight vector
Features selection
c
iii ppSEnt
12 )(log)(
)()(),()(
vAValuesv
v SEntropyS
SSEntropyASGain
Contents Classification (clustering) steps Reuters Database processing Feature extraction and selection
Information Gain Support Vector Machine
Support Vector Machine Binary classification Multiclass classification Clustering Sequential Minimal Optimizations (SMO) Probabilistic outputs
Experiments & results Binary classification. Aspects and results. Feature subset selection. A comparative approach. Multiclass classification. Quantitative aspects. Clustering. Quantitative aspects.
Conclusions and further work
Support Vector Machine Binary classification
Optimal hyperplane Higher-dimensional feature space Primal optimization problem Dual optimization problem -
Lagrange multipliers Karush-Kuhn-Tucker conditions Support Vectors Kernel trick Decision function
Optimal Hyperplane
bxxf )(sgn w
{x|‹w,x›+b=0}
X2
X1 yi=+1
yi=-1
{x|‹w,x›+b=-1}{x|‹w,x›+b=+1}
w margin
Higher-dimensional feature space
x
Primal optimization problem
mibxi
bH
,...,1,1,y subject to
,2
1)(minimize
i
2
,
w
www
m
iii bwybL
1
2)1),((
2
1),,( ixww
l
iiii yandli
1
0,,...,1,0
Dual optimization problem Maximize:
subject to:
m
iiii xy
0
w
Ci 0
Lagrangeformulation
SVM - caracteristics Karush-Kuhn-Tucker (KKT) conditions
only the Lagrange multipliers that are non-zero at the saddle point
Support Vectors the patterns xi for which
Kernel trick Positively defined kernel
Decision function
0i
',)',( xxxxk
miiii bxxyxf
,1
,sgn)(
i
Multi-class classification Separate one class versus the rest
m
i
ji
jii
jj
Mjbxxkyxgxg
1,1
),()(where,)(maxarg
)(maxargsgn)(
,1xgxf j
Mj
Clustering Caracteristics
mapped data into a higher dimensional space
search for the minimal enclosing sphere Primal optimisation problem
Dual optimisation problem
Karush Kuhn Tucker condition
mjRax jj ...1,)( 22
j j j
jjjjjj CaxRRL 222 )(
1j
j j
jj xa )(jj C
Contents Classification (clustering) steps Reuters Database processing Feature extraction and selection
Information Gain Support Vector Machine
Support Vector Machine Binary classification Multiclass classification Clustering Sequential Minimal Optimizations (SMO) Probabilistic outputs
Experiments & results Binary classification. Aspects and results. Feature subset selection. A comparative approach. Multiclass classification. Quantitative aspects. Clustering. Quantitative aspects.
Conclusions and further work
SMO characteristics Only two parameters are updated (minimal
size of updates).
Benefit: doesn’t need any extra matrix storage doesn’t need to use numerical QP optimization step needs more iterations to converge, but only needs a
few operations at each step, which leads to overall speed-up
Components: Analytic method to solve the problem for two
Lagrange multipliers Heuristics for choosing the points
01
m
iii y
Analytic method
Heuristics for choosing the point Choice of 1st point (x1/1):
Find KKT violations
Choice of 2nd point (x2/2): update 1, 2 which cause a large change, which, in
turn, result in a large increase of the dual objective maximize quantity |E1-E2|
SMO - components
Cyyyy oldold 2122112211 ,0 ,
2
21
21222
)()( ,)( where
)(
xxyxfE
EEy
iii
old
Probabilistic outputs
))(exp(1
1)()1()(
xfxpxyPinputclassP
))(exp(1
11
BxAffyP
Features selection using SVM
Linear kernel
Primal optimisation form
Keeped only that value that have weight in learned w vector great ther a threshold
bxf xwsgn
'2)',( xxxxk
miwiththresholdwall i ,..,1
Contents Classification (clustering) steps Reuters Database processing Feature extraction and selection
Information Gain Support Vector Machine
Support Vector Machine Binary classification Multiclass classification Clustering Sequential Minimal Optimizations (SMO) Probabilistic outputs
Experiments & results Binary classification. Aspects and results. Feature subset selection. A comparative approach. Multiclass classification. Quantitative aspects. Clustering. Quantitative aspects.
Conclusions and further work
Polynomial kernel
Gaussian kernel
Kernels used
dxxdxxk '2)',(
Cn
xxxxk
2'
exp)',(
Binary using values ”0” and “1”
Nominal
Connell SMART
Data representation
),(max
),(),(
dn
tdntdTF
otherwisetdn
tdniftdTF
)),(log(1log(1
0),(0),(
Binary classification - 63
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
1 2 3 4 5 6 7 10
kernel degree
accu
racy
(%)
BINARY
NOMINAL
CONNELL SMART
d - kernel’s degree 1 2 3 4 5 6 7 10
Binary 40.13 64.78 66.54 27.23 46.54 71.62 56.95 55.19
Nominal 38.96 62.65 67.93 82.03
16.62
11.95
83.99 64.08
CONNELL SMART 40.24 63.32 62.41 14.41 7.78 49.72 68.27 49.65
Binary classification - 7999
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
1 2 3 4 5 6 7 10
kernel degree
accu
racy
(%)
BINARY
NOMINAL
CONNELL SMART
d - kernel’s degree 1 2 3 4 5 6 7 10
Binary 35.77
41.74 61.88 77.64 69.21 81.87 10.95 35.77
Nominal 56.69
26.83
28.06
28.27
29.14
41.38 36.19 34.05
CONNELL SMART 50.44
35.28 41.17 59.28 79.82 81.81 82.32 17.85
Influence of vector size
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
1 2 3 4 5 6 7 10
kernel degree
accu
racy
(%) 63
1309
2488
7999
Polynomial kernel
Influence of vector size
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
C0.01 C0.05 C0.1 C0.5 C0.7 C1.0 C1.4 C2.1
degree of kernel
accu
ran
cy(%
) 41
63
1309
2488
7999
Gaussian kernel
Polynomial kernel
IG versus SVM – 427 features
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
1 2 3 4 5 6 7 10
kernel degree
IG - BINARY IG - NOMINAL IG - CONNEL SMARTSVM - BINARY SVM - NOMINAL SVM - CONNEL SMART
Gaussian kernel
IG versus SVM – 427 features
50,00
55,00
60,00
65,00
70,00
75,00
80,00
85,00
90,00
95,00
0,01 0,05 0,1 0,5 0,7 1 1,4 2,1 2,7
kernel degree
accu
racy
(%
)
IG - BINARY IG - NOMINAL IG - CONNEL SMART
SVM - BINARY SVM - NOMINAL SVM - CONNEL SMART
LibSvm versus UseSvm - 2493
Polynomial kernel
dxxdxxk '2)',(
40
50
60
70
80
90
100
1 2 3 4 5 6 7 10
kernel degree
accu
racy
(%
)
LibSVM
LibSVM+coef0
UseSVM
dcoefxxgammaxxk 0')',(
LibSvm versus UseSvm - 2493
Gaussian kernel
Cn
xxxxk
2'
exp)',(
2'*exp)',( xxgammaxxk
40
50
60
70
80
90
100
0,01 0,05 0,1 0,5 0,7 1 1,4 2,1 2,7
kernel degree
accu
racy
(%
)
LibSVM
LibSVM+gamma
UseSVM
Multiclass classification
0,00
20,00
40,00
60,00
80,00
100,00
2 3 4 5
kernel degree
accu
racy
(%) BINARY
NOMINAL
CONNELL SMART
Polynomial kernel - 2488 features
Multiclass classification Gaussian kernel 2488 features
0,0010,0020,0030,0040,0050,0060,0070,0080,0090,00
C0.05 C0.1 C1.0 C1.4 C2.1 C2.7
kernel degree
accu
racy
(%)
BINARY
NOMINAL
CONNELL SMART
Clustering using SVM
0
10
20
30
40
50
60
70
80
0.01 0.1 0.5
percent υ
accu
racy 41
63
1309
2111
υ\#features 41 63 1309 2111
0,01 0,6% 0,6% 0,7% 0,6%
0,1 0,5% 0,5% 0,5% 0,5%
0,5 25,2% 25,1% 25,1% 25,1%
Conclusions – best results
Polynomial kernel and nominal representation (degree 5 and 6 )
Gaussian kernel and Connell Smart ( C=2.7)
Reduced # of support vectors for polynomial kernel in comparison with Gaussian kernel (24,41% versus 37.78%)
# features between 6% (1309) and 10% (2488)
Multiclass follows the binary classification Clustering has a smaller # of sv‘s Clustering follows binary classification
Further work Features extraction and selection
Association rules between words (Mutual Information)
Synonym and Polysemy problem Better implementation of SVM with
linear kernel Using families of words (WordNet) SVM with kernel degree greater then 1
Classification and clustering Using classification and clustering
together