Upload
mihaelamk6818
View
243
Download
0
Embed Size (px)
DESCRIPTION
бази,податоци,рударење
Citation preview
- 1 -
.................................................................................................................................................3
1. ....................................................4
1.1 .........................................................................................................................................4 1.2 ..........................................................................6
2. ...............................................................................9
2.1 .........................................................................................................................................9 2.2 ............................................................................................ 10 2.3 ............................................................................................... 12
2.3.1 ................................................. 12 2.3.2 ..................................... 14 2.3.3 ID3 ......................................................................... 16 2.3.4 C4.5 ..................................................... 18
2.3.4.1 ..............................................19 2.3.4.2 .....................................................................................19 2.3.4.3 .............................................................20 2.3.4.4 .....................................................................................20 2.3.4.5 .................................................................................................................21
2.4 .................................................................................... 22 2.4.1 .......................... 24 2.4.2 ........................................................................................... 26 2.4.3 .................................................................. 27 2.4.4 Incremental Reduced Error Pruning (IREP)........................................................................ 28 2.4.5 IREP RIPPER .............................................. 30
3. ........................ 33
3.1 ........................................................................ 33 3.2 ............................................................................... 36
3.2.1 .................................................................................................. 36 3.2.2 ............................................................................................................. 37
3.3 PRIORI .................................................................................................................. 38 3.3.1 priori .......................................................................................... 38 3.3.2 ..................................................................................................... 39
3.3.2.1 Hash ................................................................................................................................40 3.3.2.2 ........................................................................................................................40
4. .......................................................................................................... 42
- 2 -
4.1 ............................................................................................................................. 42 4.2 .................................. 42 4.3 ..................................... 44 4.4 ........................... 45
5. ........................................................................................ 47
5.1 WEKA WAIAKATO ENVIROMENT FOR KNOWLEDGE ANALYSIS ................................................. 47 5.2 WEKA .......................................................................... 48
5.2.1 weka.core ................................................................................................................ 48 5.2.2 weka.classifiers ........................................................................................................ 49 5.2.3 WEKA ................................................................................... 50
5.3 APRIORI ................................................................................. 50 5.3.1 ItemSet..................................................................................................................... 50 5.3.2 FrequentItemsets...................................................................................................... 51
5.4 .............................. 54 5.4.1 NaiveBayes .............................................................................................................. 54
5.5 .................................. 56 5.5.1 48........................................................................................................................... 57 5.5.2 ClassifierTree .......................................................................................................... 58 5.5.3 ModelSelection C45ModelSelection ........................................ 59 5.5.4 ClassifierSplitModel C45Split................................................ 60 5.5.5 Distribution ............................................................................................................. 63 5.5.6 InfoGainSplitCrit GainRatioSplitCrit..................................................................... 66
5.6 ........................... 67 5.6.1 JRip ......................................................................................................................... 67 5.6.2 Antd NominalAntd.................................................................. 69 5.6.3 Rule RipperRule ..................................................................... 71
6. ..................................................... 72
6.1 ................................................................................. 72 6.2 ..................................................................................................................................... 74
6.2.1 .................................................................................... 74 6.2.2 NaiveBayes......................................................................................................................... 75 6.2.3 J48 ..................................................................................................................................... 76 6.2.4 JRIP................................................................................................................................... 78
....................................................................................................................................... 80
......................................................................................................... 81
- 3 -
,
.
,
.
,
.
.
.
.
.
(frequent itemsets) .
. 1
. 2
.
3 .
4, 5 6 , ,
.
- 4 -
1.
1.1
, ,
.
1.1 (Knowledge
Discovery in Databases KDD) ,
.1
1.2 (Data Mining) e KDD
(
)
.
.
,
(frequently occuring patterns), .
KDD :
:
. ,
.
:
.
.
, ,
.
1 : U.Fayyad, G.Piatetsky-Shapiro and P.Smyth. From data mining to
knowledge discovery An overview, Advances in KD and DM, MIT Press, Cambrige, MA, 1991
- 5 -
:
.
.
.
: ,
.
/:
.
.
1-1 KDD ,
.
1-1. KDD
.
.
.
() ().
(pattern)
. () ,
.
: (equations); (trees); ,
(rules).
.
- 6 -
() () ( )
. ,
.
1.2.
.
1-2
.
.
, ,
. ,
.
. ,
. , ,
.
1.2
. (supervised
learning), .
- 7 -
.
.
,
.
.
.
(, .)
.
.
.
. . ,
. ,
.
.
.
,
() .
, .
.
.
.
.
.
,
.
. .
- 8 -
.
.
.
.
. .
, .
- 9 -
2.
2.1
2.1. { }ntttD ,...,, 21=
{ }mCCC ,..,1= .
CDf : it .
jC , .
( ){ }DtniCtftC ijiij == ,1, .
. ,
.
.
.
:
1.
(training data). ,
.
.
2. 1
.
:
.
, .
. jC ,
( )ji CtP it . ( )jCP ,
- 10 -
( ) ( )jij CtPCP it jC .
.
it , it
jC . ( )ij tCP .
it
.
2.2
.
(2.1)
( ) ( ) ( )( )|
|P X H P H
P H XP X
= (2.1)
( )P X , ( )|P X H
H.
.
( )1,..., nX x x=
( )i iA x= n iA . m , H
, 1,...,j jC C c j m = = , jc
.
kc
( )|kP C X . ( ) ( )| | , 1,...,k jP C X P C X j m = . ,
( ) ( ) ( )( )|
| j jjP X C P C
P C XP X
= . ( )P X ,
( ) ( )| j jP X C P C .
- 11 -
( ,
). ( )|jP C X ( )|i i j
iP A x C= .
( )| jP X C ,
( )|i i jP A x C=
. ( )jP C .
( ) ( )j jP C P C c= = ( )|i ik kP A v C=
jc ikv
iA . ( )jP C ( )j jn N c=
jc . ( ) jjn
P Cn
= .
( )|i ik kP A v C= ( )i ik jN A v C c= =
( )jN C c= . ( ) ( )( )|i ik j
i ik kj
N A v C cP A v C
N C c
= == =
=.
,
.
m .
(2.2)
( ) ( )( ), 1
| j ij ii
N C AP C A
N A k+
=+
(2.2)
k .
m
m. ,
. (2.3)
- 12 -
( ) ( ) ( )( ),
| j i jj ii
N C A m P CP C A
N A m+
=+
(2.3)
(2.4)
( ) ( )1
, 1..jjN C
P C j kN k
+= =
+ (2.4)
.
.
. ,
. ,
.
.
.
,
. ,
.
.
.
.
2.3
2.3.1
2.2. { }ntttD ,...,, 21=
1,...,i i iht t t=
- 13 -
{ }1 2, ,..., hA A A . { }mCCC ,..,1= .
(Decision Tree DT)
D, :
iA .
.
jC .
:
1. :
.
2. it D ,
.
2.1 ,
.
.
.
.
.
.
, .
.
. ,
.
.
. ,
.
.
(tree pruning).
.
- 14 -
.
.
2.1 : D // : // T=0; ; = ; = ; { D= D; = ; =(D); T= ; }
2.3.2
.
:
:
.
.
.
:
,
- 15 -
.
:
. , ,
.
,
.
:
,
.
.
.
:
.
.
.
(overfitting) .
:
.
,
.
,
.
: ,
.
.
.
- 16 -
.
. ,
.
. ,
.
.
,
.
,
, .
q,
h . ,
. ,
, .
( )loghq q .
n
. ( )log q ,
( )logn q .
.
2.3.3 ID3
. , ,
. ID3 [Quinlan 79]
- 17 -
. ,
.
ID3
( ),
.
ID3 :
. D PE NE= , PE
NE , p PE=
n NE= . t PE
( ) pP t PEp n
=+
NE ( ) nP t NEp n
=+
.
,
,,PE ,,NE,
(2.5)
( ) 2 2log log , 0; 0
,0,
p p n n p np n p n p n p nI p n
- - + + + +=
(2.5)
{ }1,..., Nv v
, D { }1,..., ND D
iD D iv . iD
ip PE in NE.
iD e ( ),i iI p n .
, ( )EI A (2.6)
( ) ( )1
,N
i ii i
i
p nEI A I p np n=
+=
+ (2.6)
i
D iD .
, ( )G A (2.7)
- 18 -
( ) ( ) ( ),G A I p n EI A= - (2.7)
ID3
( )G A , ,
{ }1,..., ND D . iD
iD , ,,yes ;
,,no ;
.
p n+ (
) ,
.
( )a p n+ . ID3
( )G A .
( )( )O ba p n+ b .
.
ID3 ( ) ( )( ) ( )( )22O ba p n a p n O ba p n+ + = + . ID3
.
. ,
, ID3
C4.5 .
2.3.4 C4.5
C4.5 ID3
,
() .
- 19 -
2.3.4.1
ID3
. ,
gain ratio criterion ( )G X
.
( ) ( )/G X IV X
( ) 21
logN
i i i i
i
p n p nIV Xp n p n=
+ += + +
, ,N p n . ( ) 0IV X
.
2.3.4.2
ID3 ,
(datasets) .
.
.
.
,
:
.
.
- 20 -
2.3.4.3
ID3
.
:
;
.
,
, .
.
,
.
.
2.3.4.4
C4.5 : -
(postpruning) - (prepruning).
-. -
.
-.
- :
(subtree replacement) (subtree raising).
- 21 -
,
().
.
(accuracy)
.
.
.
.
.
2.3.4.5
(
)
.
.
.
.
(reduced error pruning).
.
.
. N .
q , N-
q ,
- 22 -
. :
(confidence) c ( 4.5 0.25)
( )1 /f qP z c
q q N
- > = -
N , /f E N= q
.
:
2 2 2
2
22 4
1
z f f zf zN N N Ne
zN
+ + - +=
+
z
0.25c = , z 0.69.
2.4
.
(antecedents)
(consequent).
IF THEN
, () ()
.
IF 1 1 1 AND 2 2 2 AND ... m m m AND THEN Class=classX
- 23 -
.
.
.
.
.
. .
,
.
.
,
(pruning)
.
.
,
. ,
.
,
. ,
.
.
.
- 24 -
2.4.1
(divide and conquer).
,
.
.
.
(covering aproach)
.
(accuracy).
.
.
.
,
-
.
.
t , p
, t-p .
.
p/t.
.
- 25 -
2.2
C E ; R ; R ( ) R v A=v R; v p/t; ( p) A=v R; R;
p/t. 100%
.
.
.
.
(
)
.
.
(
),
.
.
- 26 -
2.4.2
,
. ,
p/t, t
p .
.
log logp Ppt T
-
p t
,
.
.
. ,
, .
.
,
.
,
.
.
- 27 -
2.4.3
.
(overfitting)
,
.
:
Pre-Pruning:
,
;
Post-Pruning:
.
.
.
overfit-and-simplify (
) .
.
.
reduced error pruning (REP)
. REP
.
REP ,
: a (growing set)
(pruning set).
,
.
;
.
,
- 28 -
.
.
REP
.
REP
Incremental Reduced Error Pruning (IREP)2. ,
RIPPER.
2.4.4 Incremental Reduced Error Pruning (IREP)
IREP (REP)
.
2.3 IREP(Pos,Neg) Begin Ruleset:= ; While Pos do /*grow and prune a new rule*/ split(Pos,Neg) into (GrowPos,GrowNeg) and (PrunePos,PruneNeg); Rule:=GrowRule(GrowPos,GrowNeg); Rule:=PruneRule(Rule,PrunePos,PruneNeg); If the error rate of Rule on (PrunePos,PruneNeg) exceeds 50% then return Ruleset; else add Rule to Ruleset remove examples covered by Rule from (Pos,Neg); endif endwhile return Ruleset 2 : J.Frnkanz and G.Widmer. Incremental reduced error
pruning. In Machine Learning: Proceedings of the Eighth International Workshop on Machine Learning,
Ithaca, New York, 1991.Morgan Kaufmann.
- 29 -
end
2.3 IREP .
, IREP
, . ,
.
, .
, IREP . ,
,
(growing set) (pruning set).
2/3 .
.
nA v= , cA q cA q , nA
v nA , cA q
cA . GrowRule
3
.
, .
(2.8)
( ) ( ), ,p N n
v Rule prunePos pruneNegP N
+ -
+ (2.8)
P (N) prunePos (pruneNeg) p (n)
prunePos (pruneNeg) Rule.
v .
IREP .
.
3 FOIL
: Quinlan and Cameron-Jones. FOIL: a midterm report. Machine Learning:ECML-93,
Vienna, Austria, 1993
- 30 -
1,..., kC C , 1C kC
IREP
1C . ,
IREP
2C .
kC (default) .
.
.
IREP Cohen,
,
. ,
,
50%.
2.4.5 IREP RIPPER
:
,
REP
.
IREP
. ;
, P N 1R
1 2000p = 1 1000n = ,
- 31 -
2R 1 1000p = 1 1n =
. 2R
1R .
(2.9)
( )* , , p nv Rule prunePos pruneNegp n
-
+ (2.9)
IREP
50%. ,
.
MDL .4
(description length)
. IREP
d
.
64.
(
)
.
IREP
REP .
IREP
REP.
1,..., kR R .
. iR
: (replacement) (revision). iR
'iR
'1,..., ,...,i kR R R
4 DL Minimum Description Length
.
- 32 -
. iR
.
,
MDL .
RIPPER (Repeated Incremental Pruning to Produce Error Reduction).
- 33 -
3.
3.1
(frequent sets)
(patterns) , ,
, , , .
.
, .
.
.
, ,
.
.
.
3.1. R. r R
R. R (items),
r (rows). r
r r t r
r t
= t . ,
.
X R (items)
() .
.
- 34 -
3.2. R r R X R
. X t r , X t .
r X ( ),M X r , .
( ) { },M X r t r X t= X . X r ( ),fr X r (3.1)
( )( ),
,M X r
fr X rr
= (3.1)
( )M X ( )fr X . [ ]min_ 0,1fr ,
X 5 ( ), min_fr X r fr .
1t { }, , , ,A B C D G
2t { }, , ,A B E F
3t { }, ,B I K
4t { }, ,A B H
5t { }, ,E G J
3-1 r { }, ,...,R A B K=
3.1 r
{ }, ,...,R A B K= 3.1. { },A B
, { }( ) { }1 2 4, , , ,M A B r t t t=
{ }( ), , 3 5 0.6fr A B r = = .
{ }, ,...,A B K , , ,...,A B K
. 3.2
.
5 large set covering set,
support.
- 35 -
A B C D E F G H I J K
1t 1 1 1 1 0 0 1 0 0 0 0
2t 1 1 0 0 1 1 0 0 0 0 0
3t 0 1 0 0 0 0 0 0 1 0 1
4t 1 1 0 0 0 0 0 1 0 0 0
5t 0 0 0 0 1 0 1 0 0 1 0
3-2 r
min_ fr
.
3.3. R r R
min_ fr . r
min_ fr ( ), min_F r fr (3.2)
( ) ( ){ }, min_ , min_F r fr X R fr X r fr= , (3.2) ( )F r .
l (3.3)
( ) ( ){ }lF r X F r X l= = . (3.3) 3.2 0.3.
r 3.1
( ) { } { } { } { } { }{ },0.3 , , , , ,F r A B E G A B= .
.
- 36 -
3.2
.
,
. ,
.
.
3.2.1
2 R R.
, .
. ,
. sparse
, dense
.
,
.
.
,
.
,
.
- 37 -
.
3.1 ( ) ,X Y R ,
( ) ( )X Y fr X fr Y .
: ( ) ( )M X M Y X Y .
,
(supersets) .
3.2.2
(itemsets),
.
, .
. ,
I.
.
.
(cover).
X
,
T X T .
.
I/O. .
.
- 38 -
.
.
3.3 priori
AIS Agrawal,
.
Apriori
.
3.3.1 priori
3.1.
3.1 Apriori Input: r, Output: ( ),F r s
{ }{ }1 :C i i R= ; k:=1; while { }kC do //compute the frequencies of all candidate sets for all transactions ( ),tid I r do for all candidate sets kX C do if X I then Increment X.frequency by 1; //Extract all frequent sets { }: .k kF X C X frequency s= ; //Generate new candidate sets { }1 :kC + = ; for all [ ] [ ], ,kX Y F X i Y i = for 1 1i k - and [ ] [ ]X k Y k< do
- 39 -
[ ]{ }:I X Y k= U ; if , : kJ I J k J F" = then Add I to 1kC + ; Increment k by 1; breadth-first
1kC +
k+1, k=0.
. , 1C
R k k+1.
. kF .
, kX Y F k-1 X YU .
kC X YU , k
kF .
k ,
,
. kF .
k+1
, .
. ,
k+2, k+1
k+1.
3.3.2
,
.
- 40 -
3.3.2.1 Hash
k
, kF hash .
hash . hash
() hash ( ).
hash .
hash 1.
d d+1.
.
k X
,
. d,
hash
[ ]X d . .
d ,
, k d> .
, . ,
.
hash
i , hash i
.
hash .
3.3.2.2
(prefix tree,
trie). , k-
k-1 . . 1-
- 41 -
. k- k-1 .
,
.
hash ,
.
k, k k
.
, .
, (1)
o
(2)
. .
,
k k-1
( k-1 ).
k-1 X ,
X .
- 42 -
4.
4.1
,
.
,
, (dataset)
.
.
,
(
).
,
.
4.2
2.2
( )|i ik kP A v C=
- 43 -
( ) ( )( )| i ik ji ik k
j
N A v C cP A v C
N C c
= == =
=, ( )i ik jN A v C c= =
i ik jA v C c= = , ( )jN C c= jC c= . .
i ik jA v C c= =
jC c= . :
( ) ( )( )| i ik ji ik k
j
fr A v C cP A v C
fr C c
= == =
= (4.1)
(itemsets)
.
Apriori (itemsets)
() .
,
,
.
(non-frequent itemset). Naive Bayes
min2
fr . ,
0 minfr fr < min2
fr
. ,
,
.
,
.
- 44 -
4.3
2.3
.
,
.
,
.
: &...&i ij l lmA a A a= = ,
kC
( )&...&i ij l lmfr A a A a= = ( )&...& &i ij l lm kfr A a A a C c= = = k. :
( ) ( )( )&...& &
&...&&...&
i ij l lm kk i ij l lm
i ij l lm
fr A a A a C cP C c A a A a
fr A a A a
= = == = = =
= = (4.2).
( )&...&i ij l lmfr A a A a= = l, l
,
( )&...& &i ij l lm kfr A a A a C c= = = l +1.
,
. (Naive
Bayes),
.
min fr .
- 45 -
. , ,
.
,
.
. , ,
, . 1
min2
fr , 2 min2*2
fr
min2*
frg
min fr , g
.
(
0)
(
) 0 1, .
.
.
4.4
2.4
.
(accuracy)
,
,
.
,
.
- 46 -
:
IF &...&i ij l lmA a A a= = THEN kC c=
( )&...&i ij l lmfr A a A a= = ( )&...& &i ij l lm kfr A a A a C c= = = . :
( )( )
&...& &
&...&i ij l lm k
i ij l lm
fr A a A a C cAccuracy
fr A a A a
= = ==
= = (4.3).
( )&...&i ij l lmfr A a A a= = l, l
, ( )&...& &i ij l lm kfr A a A a C c= = = l +1.
.
- 47 -
5.
5.1 WEKA Waiakato Enviroment for Knowledge Analysis
WEKA, Waiakato .
WEKA Waiakato Enviroment for Knowledge
Analysis Waiakato .
WEKA JAVA,
Windows, Linux
Macintosh. JAVA
,
.
WEKA . ,
.
,
.
WEKA
.
. .
.
.
, WEKA
.
.
- 48 -
WEKA
.
WEKA,
.
,
WEKA
.
WEKA
.
5.2 WEKA
WEKA .
WEKA
.
.
5.2.1 weka.core
core WEKA .
.
core : Attribute, Instance Instances.
Attribute . ,
.
Instance
.
Instances
.
- 49 -
5.2.2 weka.classifiers
classifiers
. Classifier,
.
5-1.
5-1 UML Classifier
: buildClassifier(), classifyInstance()
distributionForInstance().
Classifier
.
.
JAVA .
.
- 50 -
5.2.3 WEKA
WEKA : weka.associations, weka.clusterers,
weka.estimators, weka.filters, weka.attributeSelection .
weka.associations
priori
.
.
5.3 Apriori
weka.associations
Apriori.
Apriori,
.
FrequentItemsets
weka.classifiers.freqitemsets.
5.3.1 ItemSet ItemSet ()
. 5-2.
,
.
m_items
.
, i
i .
,
-1.
- 51 -
.
WEKA 0
1.
5-2 UML ItemSet
ItemSet .
,
, Apriori
.
5.3.2 FrequentItemsets
FrequentItemsets Apriori
. 5-3.
- 52 -
5-3 UML FrequentItemsets
Apriori
m_Ls FastVector weka.core.
m_minSuppot,
m_upperBoundMinSupport .
FrequentItemsets
FrequentItemsets()
.
buildAssociations() .
findLargeItemSets() Apriori .
findLargeItemSets() 5.4.
/** * Method that finds all large itemsets for the given set of instances. * * @param the instances to be used * @exception Exception if an attribute is numeric */ private void findLargeItemSets(Instances instances)throws Exception { FastVector kMinusOneSets, kSets; Hashtable hashtable;
- 53 -
int necSupport, necMaxSupport,i = 0; m_instances = instances; // Find large itemsets // minimum support necSupport = (int)Math.floor((m_minSupport * (double)instances.numInstances())); necMaxSupport = (int)Math.floor((m_upperBoundMinSupport * (double)instances.numInstances())); kSets = ItemSet.singletons(instances); ItemSet.upDateCounters(kSets, instances); kSets = ItemSet.deleteItemSets(kSets, necSupport, necMaxSupport); if (kSets.size() == 0) return; do { m_Ls.addElement(kSets); kMinusOneSets = kSets; kSets = ItemSet.mergeAllItemSets(kMinusOneSets,i,instances.numInstances()); hashtable = ItemSet.getHashtable(kMinusOneSets, kMinusOneSets.size()); m_hashtables.addElement(hashtable); kSets =ItemSet.pruneItemSets(kSets, hashtable); ItemSet.upDateCounters(kSets, instances); kSets =ItemSet.deleteItemSets(kSets, necSupport, necMaxSupport); i++; } while (kSets.size() > 0); }
5-4 priori
singletons()
().
kSets FastVector.
.
-1
.
hash
.
. m_Ls
i6
.
6 () (itemset).
- 54 -
5.4
() WEKA
NaiveBayes.
.
5.4.1 NaiveBayes
NaiveBayes
, 2.2.
5-5.
,
. m_Counts [klasa] [atribut]
[vrednost na atribut]
, m_Devs [klasa] [atribut] m_Means [klasa] [atribut]
m_Priors [klasa]
. ,
NaiveBayes
Classifiers.
- 55 -
5-5 UML NaiveBayes
buildClassifier()
.
, .
FrequentItemsets
. , ,
1 2.
calculateProbs()
. 5-6
.
//za klasna verojatnost for(int i=0;i
- 56 -
if(help[j]!=-1) { m_Counts[help[help.length-1]][j][help[j]]=count; } } }
5-6 NaiveBayes
,
1
,
m_Priors [klasa]
.
m_Counts [klasa] [atribut] [vrednost na atribut]
2.
, min2
fr .
,
distributionForInstance()
.
.
.
5.5
WEKA
. .
,
:
- 57 -
, ,
, ,
.
,
.
5.5.1 48 48 4.57
2.3.
5-6.
5-7 UML 48
7 : Ross
Quinlan (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA.
- 58 -
. 48 Classifiers
buildClassifier(), classifyInstance()
distributionForInstance(), .
5.5.2 ClassifierTree
ClassifierTree
. 5-8.
5-8 UML ClassifierTree
.
:
(ModelSelection m_toSelectModel), (ClassifierSplitModel
m_localModel), (ClassifierTree [] m_sons),
(boolean m_isLeaf)
- 59 -
,
(int []m_usedAttr).
Distribution
.
buildTree()
selectModel()
m_toSelectModel. ClassifierSplitModel
.
,
ClassifierTree ()
.
.
.
WEKA ClassifierTree
.
C45PrunableClassifierTree
2.3.4.4. PrunableClassifierTree
IREP(Increased Reduced Error Pruning) .
, .
48.
5.5.3 ModelSelection C45ModelSelection
ModelSelection selectModel()
.
ClassifierSplitModel
. C45ModelSelection BinC45ModelSelection
ModelSelection. BinC45ModelSelection
, . ,
C45ModelSelection (
- 60 -
).
C45ModelSelection 5-9.
5-9 UML C45ModelSelection
selectModel() .
:
. NoSplit.
C45Split
. ,
gain ratio 2.3.4.1.
.
selectModel() BinC45ModelSelection
, .
5.5.4 ClassifierSplitModel C45Split
ClassifierSplitModel
.
5-10.
- 61 -
5-10 UML ClassifierSplitModel
: m_attIndex
, m_distribution
m_numSubsets
.
:
C45Split, Bin C45Split NoSplit.
C45Split C45 .
5-11.
selectModel() .
m_infoGain m_gainRatio,
InfoGainSplitCrit GainRatioSplitCrit.
m_splitPoint
.
- 62 -
5-11 UML C45Split
buildClassifier()
.
handleEnumeratedAttribute()
. Distribution
.
InfoGainSplitCrit GainRatioSplitCrit
.
handleNumericAttribute()
.
m_splitPoint,
.
Bin C45Split
.
.
- 63 -
5.5.5 Distribution
Distribution
.
5-12.
5-12 UML Distribution
- 64 -
: m_perClassPerBag[klasa][vrednost
na atribut]
, m_perBag[vrednost na atribut]
, m_perClass[klasa]
total
.
48.
.
populateDistribution() 5-13.
public final void populateDistribution(int attIndex,int[] usedAtt,FrequentItemsets itemsets){ int bags=this.numBags(); int[] usedAttr=usedAtt; int numClasses=this.numClasses(); int numOfUsedAttr=0; for(int i=0;i
- 65 -
} } } } }//end if
for(int i=0;i
- 66 -
m_perClassPerBag[][]
m_perBag[] m_perClass[].
5.5.6 InfoGainSplitCrit GainRatioSplitCrit
InfoGainSplitCrit
GainRatioSplitCrit. 5-14
5-15 .
5-14 UML InfoGainSplitCrit
5-15 UML GainRatioSplitCrit
EntropyBasedSplitCrit.
- 67 -
.
.
5.6
RIPPER WEKA
JRIP. JRI
: Antd ,
NumericAntd NominalAntd Antd
RipperRule .
RuleStats
.
5.6.1 JRip
JRip Classifiers
buildClassifier(), classifyInstance() distributionForInstance(),
. 5-16.
JRip :
FastVector m_Ruleset, Attribute m_Class,
FastVector m_RulesetStats
.
buildClassifier()
. :
.
,
.
rulesetForOneClass(),
.
- 68 -
5-16 UML Jrip
M rulesetForOneClass() RIPPER .
.
.
RipperRule
.
.
.
- 69 -
5.6.2 Antd NominalAntd
Antd .
.
5-17.
5-17 UML Antd
: att, value,
maxInfoGain,
accuRate, cover
accu. splitData()
,
NominalAntd Antd
. 5-18.
- 70 -
5-18 UML NominalAntd
NominalAntd accurate[] coverage[]
.
splitData()
.
.
.
NominalAntd .
,
- .
.
- 71 -
5.6.3 Rule RipperRule
Rule
.
RipperRule Rule.
5-19.
5-19 UML RipperRule
m_Antds
m_Consequent.
: grow() prune(),
.
grow() .
.
.
.
-
. .
prune()
.
- 72 -
6.
6.1
.
,
,
.
, .
,
.
:
.
6-1.
6-1
:
- 73 -
?
.
.
. ,
holdout .
.
.
( 10 ).
.
holdout ,
.
- .
k . k-
,
k-1 .
k .
k 10,
.
-
k
t- .
:
recall, ROC .
recall :
Precision=
Recall=
TPTP FPTP
TP FN
+
+
- 74 -
TP , FP
, TN FN
. , recall
,
.
ROC TP FP,
:
TP rate=
FP rate=
TPTP FN
FPFP TN
+
+
.
6.2
UCI .
.
APRIORI
,
.
6.2.1
Vote
1984 . 435
267 168 . 16
. democrat
republican. y
n. 6-2 . Attribute Information:
- 75 -
% 1. Class Name: 2 (democrat, republican) % 2. handicapped-infants: 2 (y,n) % 3. water-project-cost-sharing: 2 (y,n) % 4. adoption-of-the-budget-resolution: 2 (y,n) % 5. physician-fee-freeze: 2 (y,n) % 6. el-salvador-aid: 2 (y,n) % 7. religious-groups-in-schools: 2 (y,n) % 8. anti-satellite-test-ban: 2 (y,n) % 9. aid-to-nicaraguan-contras: 2 (y,n) % 10. mx-missile: 2 (y,n) % 11. immigration: 2 (y,n) % 12. synfuels-corporation-cutback: 2 (y,n) % 13. education-spending: 2 (y,n) % 14. superfund-right-to-sue: 2 (y,n) % 15. crime: 2 (y,n) % 16. duty-free-exports: 2 (y,n) % 17. export-administration-act-south-africa: 2 (y,n)
6-2 Vote
6.2.2 NaiveBayes
NaiveBayes
, 6-1.
.
Min Support
-
-
Kappa
-
-
-
%
-
%
-
%
%
. 0,06 392 43 0,7949 0,0995 0,2977 20,9815 61,1406 435 90,1149 9,8851 0,01 1,26 391 44 0,7904 0,0996 0,2977 21,011 61,1482 435 89,8851 10,1149 0,05 1,08 388 47 0,7763 0,1059 0,3072 22,339 63,1038 435 89,1954 10,8046 0,1 0,95 388 47 0,7763 0,1093 0,3116 23,0409 63,9927 435 89,1954 10,8046
0,15 0,72 388 47 0,7787 0,1151 0,3168 24,2646 65,0684 435 89,1954 10,8046 0,2 0,56 384 56 0,7614 0,1187 0,3196 25,0286 65,6502 435 88,2759 12,8736
0,25 0,48 385 50 0,7668 0,1228 0,3224 25,8928 66,2183 435 88,5057 11,4943 0,3 0,31 383 52 0,7585 0,1314 0,3284 27,7174 67,4427 435 88,0460 11,9540 0,4 0,31 383 52 0,7585 0,1314 0,3284 27,7174 67,4427 435 88,0460 11,9540 0,5 0,31 383 52 0,7585 0,1314 0,3284 27,7174 67,4427 435 88,0460 11,9540
6-1 NaiveBayes
6-1
90,1149%
- 76 -
.
.
0,01 0,15
,
.
.
.
.
.
6-2
.
Class=republican 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000
TP Rate 0,8910 0,8800 0,8610 0,8430 0,8390 0,8280 0,8500 0,8840 FP Rate 0,0830 0,0890 0,0600 0,0540 0,0420 0,0360 0,0600 0,0830 Precision 0,9440 0,9400 0,9580 0,9620 0,9700 0,9740 0,9580 0,9440 Recall 0,8910 0,8800 0,8610 0,8430 0,8390 0,8280 0,8500 0,8840 F-Measure 0,9170 0,9090 0,9070 0,8980 0,9000 0,8950 0,9010 0,9130
Class=democrat 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000
TP Rate 0,9170 0,9110 0,9400 0,9460 0,9580 0,9640 0,9400 0,9170 FP Rate 0,1090 0,1200 0,1390 0,1570 0,1610 0,1720 0,1500 0,1160 Precision 0,8420 0,8270 0,8100 0,7910 0,7890 0,7790 0,7980 0,8320 Recall 0,9170 0,9110 0,9400 0,9460 0,9580 0,9640 0,9400 0,9170 F-Measure 0,8770 0,8670 0,8710 0,8620 0,8660 0,8620 0,8630 0,8730
6-2 Naive Bayes
6.2.3 J48
48
6-3. .
.
- 77 -
Min Suppo-rt
-- -
-
-
-
-- -
Kappa -
-
-
-
%
-
-
%
-
%
% - --
- -
- -
. 0,23 419 16 0,922 0,0552 0,1748 11,6378 35,892 435 96,3218 3,6782 19 37 0,1 107,08 416 19 0,909 0,0799 0,1993 16,8547 40,928 435 95,6322 4,3678 25 49 0,15 13,64 415 20 0,904 0,0802 0,1971 16,9015 40,484 435 95,4023 4,5977 22 43 0,2 5,25 394 41 0,794 0,1355 0,2596 28,5643 53,308 435 90,5747 9,4253 19 37 0,25 1,74 405 30 0,856 0,1272 0,241 26,8127 49,507 435 93,1034 6,8966 9 17 0,3 0,94 368 67 0,682 0,2407 0,3441 50,7518 70,68 435 84,5977 15,4023 4 7 0,4 0,19 357 78 0,625 0,2809 0,3695 59,2332 75,882 435 82,0690 17,9310 2 3 0,5 0,09 267 168 0 0,4741 0,4869 99,9723 100 435 61,3793 38,6207 1 1
6-3 J48
6-3
96,3218%
19 37 .
0,23 .
0,1
. 25
49.
.
.
, . 0,5
.
.
.
.
6-4
.
- 78 -
Class=republican 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000
TP Rate 0,97 0,948 0,948 0,981 0,933 0,839 0,839 1 FP Rate 0,048 0,03 0,036 0,214 0,071 0,143 0,208 1 Precision 0,97 0,981 0,977 0,879 0,954 0,903 0,865 0,614 Recall 0,97 0,948 0,948 0,981 0,933 0,839 0,839 1 F-Measure 0,97 0,964 0,962 0,927 0,943 0,87 0,852 0,761 Class=democrat 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000
TP Rate 0,952 0,97 0,946 0,786 0,929 0,857 0,792 0 FP Rate 0,03 0,052 0,052 0,019 0,067 0,161 0,161 0 Precision 0,952 0,921 0,92 0,964 0,897 0,77 0,756 0 Recall 0,952 0,97 0,964 0,786 0,929 0,857 0,792 0 F-Measure 0,952 0,945 0,942 0,866 0,912 0,811 0,773 0
6-4 J48
6.2.4 JRIP
Rip
6-5.
.
.
6-5 JRIP
6-4
95,4023% 10
. 0,13 .
0,1
Min Support
-
-
--
-
Kappa -
-
-
- -
%
- -
%
-
-
% --
% --
-
. 0,13 415 20 0,9028 0,0608 0,209 12,8143 42,9263 435 95,4023 4,5977 10 0,1 267,17 401 34 0,8295 0,1389 0,2636 29,2980 54,1344 435 92,1839 7,8161 4 0,15 37,53 380 55 0,7181 0,2073 0,3261 43,7051 66,9844 435 87,3563 12,6437 3 0,2 8,64 371 64 0,6661 0,2374 0,3445 50,0522 70,7564 435 85,2874 14,7126 2 0,25 6,53 383 52 0,7484 0,1803 0,3044 38,0096 62,5219 435 88,0460 11,9540 3 0,3 3,38 390 45 0,7858 0,1523 0,277 32,1116 56,8911 435 89,6552 10,3448 3 0,4 0,66 368 67 0,6711 0,2462 0,3525 51,9209 72,4033 435 84,5977 15,4023 2 0,5 0,14 247 168 0 0,4741 0,4869 99,9750 99,9999 435 56,7816 38,6207 1
- 79 -
92,1839%
4 .
RIPPER
.
. 4 ,
.
0,5
.
6-5
.
Class=republican 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000
TP Rate 0,966 0,993 0,989 0,981 0,899 0,884 0,895 1 FP Rate 0,065 0,19 0,31 0,339 0,149 0,083 0,232 1 Precision 0,959 0,892 0,835 0,821 0,906 0,944 0,86 0,614 Recall 0,966 0,993 0,989 0,981 0,899 0,884 0,895 1 F-Measure 0,963 0,94 0,906 0,894 0,902 0,913 0,877 0,761 Class=democrat 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000
TP Rate 0,935 0,81 0,69 0,661 0,851 0,917 0,768 0 FP Rate 0,034 0,007 0,011 0,019 0,101 0,116 0,105 0 Precision 0,946 0,986 0,975 0,957 0,841 0,832 0,822 0 Recall 0,935 0,81 0,69 0,661 0,851 0,917 0,768 0 F-Measure 0,94 0,889 0,808 0,782 0,846 0,873 0,794 0
6-6 JRIP
- 80 -
.
.
.
.
.
.
. freesets
. freesets
.
. ,
,
.
- 81 -
1. Ian H. Witten and Eibe Frank: Data Mining Practical Machine Learning Tools
and Techniques with Java Implementations, Morgan Kaufmann Publishers, San
Francisco, California, 1999
2. Margaret H. Dunham: Data Mining Introductory and Advanced Topics, Prentice
Hall, 2003
3. Sao Deroski: Data Mining in a Nutshell
4. J. R. Quinlan: Induction of decision trees, Machine Learning, no.1, pg.81-106,
1986
5. William W. Cohen: Fast Effective Rule Indution, Proceedings of the Twelfth
International Conference on Machine Learning, 1995
6. Johannes Frnkranz and Gerhard Widmer: Incremental Reduced Error Pruning,
Proceedings of the Eleventh International Conference on Machine Learning, 1994
7. Rakesh Agrawal and Ramakrishnan Sirkant: Fast Algorithms For Mining
Association Rules, Proceedings of the 20th International Conference on Very Large
Data Bases, 1994
8. Rakesh Agrawal, Tomasz Imielinski and Arun Swami: Mining Association Rules
between Sets of Items in Large Databases, Proceedings of the 1993 ACM SIGMOD
Conference, Washington DC, May 1993
9. Bart Goethaals: Course notes in Knowledge Discovery in Databases Search for
Frequent Patterns, http://www.cs.helsinki.fi/u/goethals/dmcourse/chap12.pdf
10. J. F. Boulicaut, A. Bykowski and C. Rigotti: Approximation of Frequency Queries
by Means of Free-Sets, PKDD 2000.
1. 1.1 1.2
2. 1 2.2 2.3 2.3.1 2.3.2 2.3.3 ID32.3.4 C4.5 2.3.4.1 2.3.4.2 2.3.4.3 2.3.4.4 2.3.4.5
2.4 2.4.1 2.4.2 2.4.3 2.4.4 Incremental Reduced Error Pruning (IREP)2.4.5 IREP RIPPER
3. 3.1 3.2 3.2.1 3.2.2
3.3 priori 3.3.1 priori 3.3.2 3.3.2.1 Hash 3.3.2.2
4. 4.1 4.2 4.3 4.4
5. 5.1 WEKA Waiakato Enviroment for Knowledge Analysis5.2 WEKA 5.2.1 weka.core 5.2.2 weka.classifiers 5.2.3 WEKA
5.3 Apriori 5.3.1 ItemSet5.3.2 FrequentItemsets
5.4 5.4.1 NaiveBayes
5.5 5.5.1 485.5.2 5.5.3 5.5.4 5.5.5 5.5.6
5.6 5.6.1 5.6.2 5.6.3 Rule RipperRule
6. 6.1 6.2 6.2.1 6.2.2 NaiveBayes6.2.3 J486.2.4 JRIP