Diploma Thesis

- 1 -

.................................................................................................................................................3

1. ....................................................4

1.1 .........................................................................................................................................4 1.2 ..........................................................................6

2. ...............................................................................9

2.1 .........................................................................................................................................9 2.2 ............................................................................................ 10 2.3 ............................................................................................... 12

2.3.1 ................................................. 12 2.3.2 ..................................... 14 2.3.3 ID3 ......................................................................... 16 2.3.4 C4.5 ..................................................... 18

2.3.4.1 ..............................................19 2.3.4.2 .....................................................................................19 2.3.4.3 .............................................................20 2.3.4.4 .....................................................................................20 2.3.4.5 .................................................................................................................21

2.4 .................................................................................... 22 2.4.1 .......................... 24 2.4.2 ........................................................................................... 26 2.4.3 .................................................................. 27 2.4.4 Incremental Reduced Error Pruning (IREP)........................................................................ 28 2.4.5 IREP RIPPER .............................................. 30

3. ........................ 33

3.1 ........................................................................ 33 3.2 ............................................................................... 36

3.2.1 .................................................................................................. 36 3.2.2 ............................................................................................................. 37

3.3 PRIORI .................................................................................................................. 38 3.3.1 priori .......................................................................................... 38 3.3.2 ..................................................................................................... 39

3.3.2.1 Hash ................................................................................................................................40 3.3.2.2 ........................................................................................................................40

4. .......................................................................................................... 42

- 2 -

4.1 ............................................................................................................................. 42 4.2 .................................. 42 4.3 ..................................... 44 4.4 ........................... 45

5. ........................................................................................ 47

5.1 WEKA WAIAKATO ENVIROMENT FOR KNOWLEDGE ANALYSIS ................................................. 47 5.2 WEKA .......................................................................... 48

5.2.1 weka.core ................................................................................................................ 48 5.2.2 weka.classifiers ........................................................................................................ 49 5.2.3 WEKA ................................................................................... 50

5.3 APRIORI ................................................................................. 50 5.3.1 ItemSet..................................................................................................................... 50 5.3.2 FrequentItemsets...................................................................................................... 51

5.4 .............................. 54 5.4.1 NaiveBayes .............................................................................................................. 54

5.5 .................................. 56 5.5.1 48........................................................................................................................... 57 5.5.2 ClassifierTree .......................................................................................................... 58 5.5.3 ModelSelection C45ModelSelection ........................................ 59 5.5.4 ClassifierSplitModel C45Split................................................ 60 5.5.5 Distribution ............................................................................................................. 63 5.5.6 InfoGainSplitCrit GainRatioSplitCrit..................................................................... 66

5.6 ........................... 67 5.6.1 JRip ......................................................................................................................... 67 5.6.2 Antd NominalAntd.................................................................. 69 5.6.3 Rule RipperRule ..................................................................... 71

6. ..................................................... 72

6.1 ................................................................................. 72 6.2 ..................................................................................................................................... 74

6.2.1 .................................................................................... 74 6.2.2 NaiveBayes......................................................................................................................... 75 6.2.3 J48 ..................................................................................................................................... 76 6.2.4 JRIP................................................................................................................................... 78

....................................................................................................................................... 80

......................................................................................................... 81

- 3 -

,

.

,

.

,

.

.

.

.

.

(frequent itemsets) .

. 1

. 2

.

3 .

4, 5 6 , ,

.

- 4 -

1.

1.1

, ,

.

1.1 (Knowledge

Discovery in Databases KDD) ,

.1

1.2 (Data Mining) e KDD

(

)

.

.

,

(frequently occuring patterns), .

KDD :

:

. ,

.

:

.

.

, ,

.

1 : U.Fayyad, G.Piatetsky-Shapiro and P.Smyth. From data mining to

knowledge discovery An overview, Advances in KD and DM, MIT Press, Cambrige, MA, 1991

- 5 -

:

.

.

.

: ,

.

/:

.

.

1-1 KDD ,

.

1-1. KDD

.

.

.

() ().

(pattern)

. () ,

.

: (equations); (trees); ,

(rules).

.

- 6 -

() () ( )

. ,

.

1.2.

.

1-2

.

.

, ,

. ,

.

. ,

. , ,

.

1.2

. (supervised

learning), .

- 7 -

.

.

,

.

.

.

(, .)

.

.

.

. . ,

. ,

.

.

.

,

() .

, .

.

.

.

.

.

,

.

. .

- 8 -

.

.

.

.

. .

, .

- 9 -

2.

2.1

2.1. { }ntttD ,...,, 21=

{ }mCCC ,..,1= .

CDf : it .

jC , .

( ){ }DtniCtftC ijiij == ,1, .

. ,

.

.

.

:

1.

(training data). ,

.

.

2. 1

.

:

.

, .

. jC ,

( )ji CtP it . ( )jCP ,

- 10 -

( ) ( )jij CtPCP it jC .

.

it , it

jC . ( )ij tCP .

it

.

2.2

.

(2.1)

( ) ( ) ( )( )|

|P X H P H

P H XP X

= (2.1)

( )P X , ( )|P X H

H.

.

( )1,..., nX x x=

( )i iA x= n iA . m , H

, 1,...,j jC C c j m = = , jc

.

kc

( )|kP C X . ( ) ( )| | , 1,...,k jP C X P C X j m = . ,

( ) ( ) ( )( )|

| j jjP X C P C

P C XP X

= . ( )P X ,

( ) ( )| j jP X C P C .

- 11 -

( ,

). ( )|jP C X ( )|i i j

iP A x C= .

( )| jP X C ,

( )|i i jP A x C=

. ( )jP C .

( ) ( )j jP C P C c= = ( )|i ik kP A v C=

jc ikv

iA . ( )jP C ( )j jn N c=

jc . ( ) jjn

P Cn

= .

( )|i ik kP A v C= ( )i ik jN A v C c= =

( )jN C c= . ( ) ( )( )|i ik j

i ik kj

N A v C cP A v C

N C c

= == =

=.

,

.

m .

(2.2)

( ) ( )( ), 1

| j ij ii

N C AP C A

N A k+

=+

(2.2)

k .

m

m. ,

. (2.3)

- 12 -

( ) ( ) ( )( ),

| j i jj ii

N C A m P CP C A

N A m+

=+

(2.3)

(2.4)

( ) ( )1

, 1..jjN C

P C j kN k

+= =

+ (2.4)

.

.

. ,

. ,

.

.

.

,

. ,

.

.

.

.

2.3

2.3.1

2.2. { }ntttD ,...,, 21=

1,...,i i iht t t=

- 13 -

{ }1 2, ,..., hA A A . { }mCCC ,..,1= .

(Decision Tree DT)

D, :

iA .

.

jC .

:

1. :

.

2. it D ,

.

2.1 ,

.

.

.

.

.

.

, .

.

. ,

.

.

. ,

.

.

(tree pruning).

.

- 14 -

.

.

2.1 : D // : // T=0; ; = ; = ; { D= D; = ; =(D); T= ; }

2.3.2

.

:

:

.

.

.

:

,

- 15 -

.

:

. , ,

.

,

.

:

,

.

.

.

:

.

.

.

(overfitting) .

:

.

,

.

,

.

: ,

.

.

.

- 16 -

.

. ,

.

. ,

.

.

,

.

,

, .

q,

h . ,

. ,

, .

( )loghq q .

n

. ( )log q ,

( )logn q .

.

2.3.3 ID3

. , ,

. ID3 [Quinlan 79]

- 17 -

. ,

.

ID3

( ),

.

ID3 :

. D PE NE= , PE

NE , p PE=

n NE= . t PE

( ) pP t PEp n

=+

NE ( ) nP t NEp n

=+

.

,

,,PE ,,NE,

(2.5)

( ) 2 2log log , 0; 0

,0,

p p n n p np n p n p n p nI p n

- - + + + +=

(2.5)

{ }1,..., Nv v

, D { }1,..., ND D

iD D iv . iD

ip PE in NE.

iD e ( ),i iI p n .

, ( )EI A (2.6)

( ) ( )1

,N

i ii i

i

p nEI A I p np n=

+=

+ (2.6)

i

D iD .

, ( )G A (2.7)

- 18 -

( ) ( ) ( ),G A I p n EI A= - (2.7)

ID3

( )G A , ,

{ }1,..., ND D . iD

iD , ,,yes ;

,,no ;

.

p n+ (

) ,

.

( )a p n+ . ID3

( )G A .

( )( )O ba p n+ b .

.

ID3 ( ) ( )( ) ( )( )22O ba p n a p n O ba p n+ + = + . ID3

.

. ,

, ID3

C4.5 .

2.3.4 C4.5

C4.5 ID3

,

() .

- 19 -

2.3.4.1

ID3

. ,

gain ratio criterion ( )G X

.

( ) ( )/G X IV X

( ) 21

logN

i i i i

i

p n p nIV Xp n p n=

+ += + +

, ,N p n . ( ) 0IV X

.

2.3.4.2

ID3 ,

(datasets) .

.

.

.

,

:

.

.

- 20 -

2.3.4.3

ID3

.

:

;

.

,

, .

.

,

.

.

2.3.4.4

C4.5 : -

(postpruning) - (prepruning).

-. -

.

-.

- :

(subtree replacement) (subtree raising).

- 21 -

,

().

.

(accuracy)

.

.

.

.

.

2.3.4.5

(

)

.

.

.

.

(reduced error pruning).

.

.

. N .

q , N-

q ,

- 22 -

. :

(confidence) c ( 4.5 0.25)

( )1 /f qP z c

q q N

- > = -

N , /f E N= q

.

:

2 2 2

2

22 4

1

z f f zf zN N N Ne

zN

+ + - +=

+

z

0.25c = , z 0.69.

2.4

.

(antecedents)

(consequent).

IF THEN

, () ()

.

IF 1 1 1 AND 2 2 2 AND ... m m m AND THEN Class=classX

- 23 -

.

.

.

.

.

. .

,

.

.

,

(pruning)

.

.

,

. ,

.

,

. ,

.

.

.

- 24 -

2.4.1

(divide and conquer).

,

.

.

.

(covering aproach)

.

(accuracy).

.

.

.

,

-

.

.

t , p

, t-p .

.

p/t.

.

- 25 -

2.2

C E ; R ; R ( ) R v A=v R; v p/t; ( p) A=v R; R;

p/t. 100%

.

.

.

.

(

)

.

.

(

),

.

.

- 26 -

2.4.2

,

. ,

p/t, t

p .

.

log logp Ppt T

-

p t

,

.

.

. ,

, .

.

,

.

,

.

.

- 27 -

2.4.3

.

(overfitting)

,

.

:

Pre-Pruning:

,

;

Post-Pruning:

.

.

.

overfit-and-simplify (

) .

.

.

reduced error pruning (REP)

. REP

.

REP ,

: a (growing set)

(pruning set).

,

.

;

.

,

- 28 -

.

.

REP

.

REP

Incremental Reduced Error Pruning (IREP)2. ,

RIPPER.

2.4.4 Incremental Reduced Error Pruning (IREP)

IREP (REP)

.

2.3 IREP(Pos,Neg) Begin Ruleset:= ; While Pos do /*grow and prune a new rule*/ split(Pos,Neg) into (GrowPos,GrowNeg) and (PrunePos,PruneNeg); Rule:=GrowRule(GrowPos,GrowNeg); Rule:=PruneRule(Rule,PrunePos,PruneNeg); If the error rate of Rule on (PrunePos,PruneNeg) exceeds 50% then return Ruleset; else add Rule to Ruleset remove examples covered by Rule from (Pos,Neg); endif endwhile return Ruleset 2 : J.Frnkanz and G.Widmer. Incremental reduced error

pruning. In Machine Learning: Proceedings of the Eighth International Workshop on Machine Learning,

Ithaca, New York, 1991.Morgan Kaufmann.

- 29 -

end

2.3 IREP .

, IREP

, . ,

.

, .

, IREP . ,

,

(growing set) (pruning set).

2/3 .

.

nA v= , cA q cA q , nA

v nA , cA q

cA . GrowRule

3

.

, .

(2.8)

( ) ( ), ,p N n

v Rule prunePos pruneNegP N

+ -

+ (2.8)

P (N) prunePos (pruneNeg) p (n)

prunePos (pruneNeg) Rule.

v .

IREP .

.

3 FOIL

: Quinlan and Cameron-Jones. FOIL: a midterm report. Machine Learning:ECML-93,

Vienna, Austria, 1993

- 30 -

1,..., kC C , 1C kC

IREP

1C . ,

IREP

2C .

kC (default) .

.

.

IREP Cohen,

,

. ,

,

50%.

2.4.5 IREP RIPPER

:

,

REP

.

IREP

. ;

, P N 1R

1 2000p = 1 1000n = ,

- 31 -

2R 1 1000p = 1 1n =

. 2R

1R .

(2.9)

( )* , , p nv Rule prunePos pruneNegp n

-

+ (2.9)

IREP

50%. ,

.

MDL .4

(description length)

. IREP

d

.

64.

(

)

.

IREP

REP .

IREP

REP.

1,..., kR R .

. iR

: (replacement) (revision). iR

'iR

'1,..., ,...,i kR R R

4 DL Minimum Description Length

.

- 32 -

. iR

.

,

MDL .

RIPPER (Repeated Incremental Pruning to Produce Error Reduction).

- 33 -

3.

3.1

(frequent sets)

(patterns) , ,

, , , .

.

, .

.

.

, ,

.

.

.

3.1. R. r R

R. R (items),

r (rows). r

r r t r

r t

= t . ,

.

X R (items)

() .

.

- 34 -

3.2. R r R X R

. X t r , X t .

r X ( ),M X r , .

( ) { },M X r t r X t= X . X r ( ),fr X r (3.1)

( )( ),

,M X r

fr X rr

= (3.1)

( )M X ( )fr X . [ ]min_ 0,1fr ,

X 5 ( ), min_fr X r fr .

1t { }, , , ,A B C D G

2t { }, , ,A B E F

3t { }, ,B I K

4t { }, ,A B H

5t { }, ,E G J

3-1 r { }, ,...,R A B K=

3.1 r

{ }, ,...,R A B K= 3.1. { },A B

, { }( ) { }1 2 4, , , ,M A B r t t t=

{ }( ), , 3 5 0.6fr A B r = = .

{ }, ,...,A B K , , ,...,A B K

. 3.2

.

5 large set covering set,

support.

- 35 -

A B C D E F G H I J K

1t 1 1 1 1 0 0 1 0 0 0 0

2t 1 1 0 0 1 1 0 0 0 0 0

3t 0 1 0 0 0 0 0 0 1 0 1

4t 1 1 0 0 0 0 0 1 0 0 0

5t 0 0 0 0 1 0 1 0 0 1 0

3-2 r

min_ fr

.

3.3. R r R

min_ fr . r

min_ fr ( ), min_F r fr (3.2)

( ) ( ){ }, min_ , min_F r fr X R fr X r fr= , (3.2) ( )F r .

l (3.3)

( ) ( ){ }lF r X F r X l= = . (3.3) 3.2 0.3.

r 3.1

( ) { } { } { } { } { }{ },0.3 , , , , ,F r A B E G A B= .

.

- 36 -

3.2

.

,

. ,

.

.

3.2.1

2 R R.

, .

. ,

. sparse

, dense

.

,

.

.

,

.

,

.

- 37 -

.

3.1 ( ) ,X Y R ,

( ) ( )X Y fr X fr Y .

: ( ) ( )M X M Y X Y .

,

(supersets) .

3.2.2

(itemsets),

.

, .

. ,

I.

.

.

(cover).

X

,

T X T .

.

I/O. .

.

- 38 -

.

.

3.3 priori

AIS Agrawal,

.

Apriori

.

3.3.1 priori

3.1.

3.1 Apriori Input: r, Output: ( ),F r s

{ }{ }1 :C i i R= ; k:=1; while { }kC do //compute the frequencies of all candidate sets for all transactions ( ),tid I r do for all candidate sets kX C do if X I then Increment X.frequency by 1; //Extract all frequent sets { }: .k kF X C X frequency s= ; //Generate new candidate sets { }1 :kC + = ; for all [ ] [ ], ,kX Y F X i Y i = for 1 1i k - and [ ] [ ]X k Y k< do

- 39 -

[ ]{ }:I X Y k= U ; if , : kJ I J k J F" = then Add I to 1kC + ; Increment k by 1; breadth-first

1kC +

k+1, k=0.

. , 1C

R k k+1.

. kF .

, kX Y F k-1 X YU .

kC X YU , k

kF .

k ,

,

. kF .

k+1

, .

. ,

k+2, k+1

k+1.

3.3.2

,

.

- 40 -

3.3.2.1 Hash

k

, kF hash .

hash . hash

() hash ( ).

hash .

hash 1.

d d+1.

.

k X

,

. d,

hash

[ ]X d . .

d ,

, k d> .

, . ,

.

hash

i , hash i

.

hash .

3.3.2.2

(prefix tree,

trie). , k-

k-1 . . 1-

- 41 -

. k- k-1 .

,

.

hash ,

.

k, k k

.

, .

, (1)

o

(2)

. .

,

k k-1

( k-1 ).

k-1 X ,

X .

- 42 -

4.

4.1

,

.

,

, (dataset)

.

.

,

(

).

,

.

4.2

2.2

( )|i ik kP A v C=

- 43 -

( ) ( )( )| i ik ji ik k

j

N A v C cP A v C

N C c

= == =

=, ( )i ik jN A v C c= =

i ik jA v C c= = , ( )jN C c= jC c= . .

i ik jA v C c= =

jC c= . :

( ) ( )( )| i ik ji ik k

j

fr A v C cP A v C

fr C c

= == =

= (4.1)

(itemsets)

.

Apriori (itemsets)

() .

,

,

.

(non-frequent itemset). Naive Bayes

min2

fr . ,

0 minfr fr < min2

fr

. ,

,

.

,

.

- 44 -

4.3

2.3

.

,

.

,

.

: &...&i ij l lmA a A a= = ,

kC

( )&...&i ij l lmfr A a A a= = ( )&...& &i ij l lm kfr A a A a C c= = = k. :

( ) ( )( )&...& &

&...&&...&

i ij l lm kk i ij l lm

i ij l lm

fr A a A a C cP C c A a A a

fr A a A a

= = == = = =

= = (4.2).

( )&...&i ij l lmfr A a A a= = l, l

,

( )&...& &i ij l lm kfr A a A a C c= = = l +1.

,

. (Naive

Bayes),

.

min fr .

- 45 -

. , ,

.

,

.

. , ,

, . 1

min2

fr , 2 min2*2

fr

min2*

frg

min fr , g

.

(

0)

(

) 0 1, .

.

.

4.4

2.4

.

(accuracy)

,

,

.

,

.

- 46 -

:

IF &...&i ij l lmA a A a= = THEN kC c=

( )&...&i ij l lmfr A a A a= = ( )&...& &i ij l lm kfr A a A a C c= = = . :

( )( )

&...& &

&...&i ij l lm k

i ij l lm

fr A a A a C cAccuracy

fr A a A a

= = ==

= = (4.3).

( )&...&i ij l lmfr A a A a= = l, l

, ( )&...& &i ij l lm kfr A a A a C c= = = l +1.

.

- 47 -

5.

5.1 WEKA Waiakato Enviroment for Knowledge Analysis

WEKA, Waiakato .

WEKA Waiakato Enviroment for Knowledge

Analysis Waiakato .

WEKA JAVA,

Windows, Linux

Macintosh. JAVA

,

.

WEKA . ,

.

,

.

WEKA

.

. .

.

.

, WEKA

.

.

- 48 -

WEKA

.

WEKA,

.

,

WEKA

.

WEKA

.

5.2 WEKA

WEKA .

WEKA

.

.

5.2.1 weka.core

core WEKA .

.

core : Attribute, Instance Instances.

Attribute . ,

.

Instance

.

Instances

.

- 49 -

5.2.2 weka.classifiers

classifiers

. Classifier,

.

5-1.

5-1 UML Classifier

: buildClassifier(), classifyInstance()

distributionForInstance().

Classifier

.

.

JAVA .

.

- 50 -

5.2.3 WEKA

WEKA : weka.associations, weka.clusterers,

weka.estimators, weka.filters, weka.attributeSelection .

weka.associations

priori

.

.

5.3 Apriori

weka.associations

Apriori.

Apriori,

.

FrequentItemsets

weka.classifiers.freqitemsets.

5.3.1 ItemSet ItemSet ()

. 5-2.

,

.

m_items

.

, i

i .

,

-1.

- 51 -

.

WEKA 0

1.

5-2 UML ItemSet

ItemSet .

,

, Apriori

.

5.3.2 FrequentItemsets

FrequentItemsets Apriori

. 5-3.

- 52 -

5-3 UML FrequentItemsets

Apriori

m_Ls FastVector weka.core.

m_minSuppot,

m_upperBoundMinSupport .

FrequentItemsets

FrequentItemsets()

.

buildAssociations() .

findLargeItemSets() Apriori .

findLargeItemSets() 5.4.

/** * Method that finds all large itemsets for the given set of instances. * * @param the instances to be used * @exception Exception if an attribute is numeric */ private void findLargeItemSets(Instances instances)throws Exception { FastVector kMinusOneSets, kSets; Hashtable hashtable;

- 53 -

int necSupport, necMaxSupport,i = 0; m_instances = instances; // Find large itemsets // minimum support necSupport = (int)Math.floor((m_minSupport * (double)instances.numInstances())); necMaxSupport = (int)Math.floor((m_upperBoundMinSupport * (double)instances.numInstances())); kSets = ItemSet.singletons(instances); ItemSet.upDateCounters(kSets, instances); kSets = ItemSet.deleteItemSets(kSets, necSupport, necMaxSupport); if (kSets.size() == 0) return; do { m_Ls.addElement(kSets); kMinusOneSets = kSets; kSets = ItemSet.mergeAllItemSets(kMinusOneSets,i,instances.numInstances()); hashtable = ItemSet.getHashtable(kMinusOneSets, kMinusOneSets.size()); m_hashtables.addElement(hashtable); kSets =ItemSet.pruneItemSets(kSets, hashtable); ItemSet.upDateCounters(kSets, instances); kSets =ItemSet.deleteItemSets(kSets, necSupport, necMaxSupport); i++; } while (kSets.size() > 0); }

5-4 priori

singletons()

().

kSets FastVector.

.

-1

.

hash

.

. m_Ls

i6

.

6 () (itemset).

- 54 -

5.4

() WEKA

NaiveBayes.

.

5.4.1 NaiveBayes

NaiveBayes

, 2.2.

5-5.

,

. m_Counts [klasa] [atribut]

[vrednost na atribut]

, m_Devs [klasa] [atribut] m_Means [klasa] [atribut]

m_Priors [klasa]

. ,

NaiveBayes

Classifiers.

- 55 -

5-5 UML NaiveBayes

buildClassifier()

.

, .

FrequentItemsets

. , ,

1 2.

calculateProbs()

. 5-6

.

//za klasna verojatnost for(int i=0;i

- 56 -

if(help[j]!=-1) { m_Counts[help[help.length-1]][j][help[j]]=count; } } }

5-6 NaiveBayes

,

1

,

m_Priors [klasa]

.

m_Counts [klasa] [atribut] [vrednost na atribut]

2.

, min2

fr .

,

distributionForInstance()

.

.

.

5.5

WEKA

. .

,

:

- 57 -

, ,

, ,

.

,

.

5.5.1 48 48 4.57

2.3.

5-6.

5-7 UML 48

7 : Ross

Quinlan (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA.

- 58 -

. 48 Classifiers

buildClassifier(), classifyInstance()

distributionForInstance(), .

5.5.2 ClassifierTree

ClassifierTree

. 5-8.

5-8 UML ClassifierTree

.

:

(ModelSelection m_toSelectModel), (ClassifierSplitModel

m_localModel), (ClassifierTree [] m_sons),

(boolean m_isLeaf)

- 59 -

,

(int []m_usedAttr).

Distribution

.

buildTree()

selectModel()

m_toSelectModel. ClassifierSplitModel

.

,

ClassifierTree ()

.

.

.

WEKA ClassifierTree

.

C45PrunableClassifierTree

2.3.4.4. PrunableClassifierTree

IREP(Increased Reduced Error Pruning) .

, .

48.

5.5.3 ModelSelection C45ModelSelection

ModelSelection selectModel()

.

ClassifierSplitModel

. C45ModelSelection BinC45ModelSelection

ModelSelection. BinC45ModelSelection

, . ,

C45ModelSelection (

- 60 -

).

C45ModelSelection 5-9.

5-9 UML C45ModelSelection

selectModel() .

:

. NoSplit.

C45Split

. ,

gain ratio 2.3.4.1.

.

selectModel() BinC45ModelSelection

, .

5.5.4 ClassifierSplitModel C45Split

ClassifierSplitModel

.

5-10.

- 61 -

5-10 UML ClassifierSplitModel

: m_attIndex

, m_distribution

m_numSubsets

.

:

C45Split, Bin C45Split NoSplit.

C45Split C45 .

5-11.

selectModel() .

m_infoGain m_gainRatio,

InfoGainSplitCrit GainRatioSplitCrit.

m_splitPoint

.

- 62 -

5-11 UML C45Split

buildClassifier()

.

handleEnumeratedAttribute()

. Distribution

.

InfoGainSplitCrit GainRatioSplitCrit

.

handleNumericAttribute()

.

m_splitPoint,

.

Bin C45Split

.

.

- 63 -

5.5.5 Distribution

Distribution

.

5-12.

5-12 UML Distribution

- 64 -

: m_perClassPerBag[klasa][vrednost

na atribut]

, m_perBag[vrednost na atribut]

, m_perClass[klasa]

total

.

48.

.

populateDistribution() 5-13.

public final void populateDistribution(int attIndex,int[] usedAtt,FrequentItemsets itemsets){ int bags=this.numBags(); int[] usedAttr=usedAtt; int numClasses=this.numClasses(); int numOfUsedAttr=0; for(int i=0;i

- 65 -

} } } } }//end if

for(int i=0;i

- 66 -

m_perClassPerBag[][]

m_perBag[] m_perClass[].

5.5.6 InfoGainSplitCrit GainRatioSplitCrit

InfoGainSplitCrit

GainRatioSplitCrit. 5-14

5-15 .

5-14 UML InfoGainSplitCrit

5-15 UML GainRatioSplitCrit

EntropyBasedSplitCrit.

- 67 -

.

.

5.6

RIPPER WEKA

JRIP. JRI

: Antd ,

NumericAntd NominalAntd Antd

RipperRule .

RuleStats

.

5.6.1 JRip

JRip Classifiers

buildClassifier(), classifyInstance() distributionForInstance(),

. 5-16.

JRip :

FastVector m_Ruleset, Attribute m_Class,

FastVector m_RulesetStats

.

buildClassifier()

. :

.

,

.

rulesetForOneClass(),

.

- 68 -

5-16 UML Jrip

M rulesetForOneClass() RIPPER .

.

.

RipperRule

.

.

.

- 69 -

5.6.2 Antd NominalAntd

Antd .

.

5-17.

5-17 UML Antd

: att, value,

maxInfoGain,

accuRate, cover

accu. splitData()

,

NominalAntd Antd

. 5-18.

- 70 -

5-18 UML NominalAntd

NominalAntd accurate[] coverage[]

.

splitData()

.

.

.

NominalAntd .

,

- .

.

- 71 -

5.6.3 Rule RipperRule

Rule

.

RipperRule Rule.

5-19.

5-19 UML RipperRule

m_Antds

m_Consequent.

: grow() prune(),

.

grow() .

.

.

.

-

. .

prune()

.

- 72 -

6.

6.1

.

,

,

.

, .

,

.

:

.

6-1.

6-1

:

- 73 -

?

.

.

. ,

holdout .

.

.

( 10 ).

.

holdout ,

.

- .

k . k-

,

k-1 .

k .

k 10,

.

-

k

t- .

:

recall, ROC .

recall :

Precision=

Recall=

TPTP FPTP

TP FN

+

+

- 74 -

TP , FP

, TN FN

. , recall

,

.

ROC TP FP,

:

TP rate=

FP rate=

TPTP FN

FPFP TN

+

+

.

6.2

UCI .

.

APRIORI

,

.

6.2.1

Vote

1984 . 435

267 168 . 16

. democrat

republican. y

n. 6-2 . Attribute Information:

- 75 -

% 1. Class Name: 2 (democrat, republican) % 2. handicapped-infants: 2 (y,n) % 3. water-project-cost-sharing: 2 (y,n) % 4. adoption-of-the-budget-resolution: 2 (y,n) % 5. physician-fee-freeze: 2 (y,n) % 6. el-salvador-aid: 2 (y,n) % 7. religious-groups-in-schools: 2 (y,n) % 8. anti-satellite-test-ban: 2 (y,n) % 9. aid-to-nicaraguan-contras: 2 (y,n) % 10. mx-missile: 2 (y,n) % 11. immigration: 2 (y,n) % 12. synfuels-corporation-cutback: 2 (y,n) % 13. education-spending: 2 (y,n) % 14. superfund-right-to-sue: 2 (y,n) % 15. crime: 2 (y,n) % 16. duty-free-exports: 2 (y,n) % 17. export-administration-act-south-africa: 2 (y,n)

6-2 Vote

6.2.2 NaiveBayes

NaiveBayes

, 6-1.

.

Min Support

-

-

Kappa

-

-

-

%

-

%

-

%

%

. 0,06 392 43 0,7949 0,0995 0,2977 20,9815 61,1406 435 90,1149 9,8851 0,01 1,26 391 44 0,7904 0,0996 0,2977 21,011 61,1482 435 89,8851 10,1149 0,05 1,08 388 47 0,7763 0,1059 0,3072 22,339 63,1038 435 89,1954 10,8046 0,1 0,95 388 47 0,7763 0,1093 0,3116 23,0409 63,9927 435 89,1954 10,8046

0,15 0,72 388 47 0,7787 0,1151 0,3168 24,2646 65,0684 435 89,1954 10,8046 0,2 0,56 384 56 0,7614 0,1187 0,3196 25,0286 65,6502 435 88,2759 12,8736

0,25 0,48 385 50 0,7668 0,1228 0,3224 25,8928 66,2183 435 88,5057 11,4943 0,3 0,31 383 52 0,7585 0,1314 0,3284 27,7174 67,4427 435 88,0460 11,9540 0,4 0,31 383 52 0,7585 0,1314 0,3284 27,7174 67,4427 435 88,0460 11,9540 0,5 0,31 383 52 0,7585 0,1314 0,3284 27,7174 67,4427 435 88,0460 11,9540

6-1 NaiveBayes

6-1

90,1149%

- 76 -

.

.

0,01 0,15

,

.

.

.

.

.

6-2

.

Class=republican 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000

TP Rate 0,8910 0,8800 0,8610 0,8430 0,8390 0,8280 0,8500 0,8840 FP Rate 0,0830 0,0890 0,0600 0,0540 0,0420 0,0360 0,0600 0,0830 Precision 0,9440 0,9400 0,9580 0,9620 0,9700 0,9740 0,9580 0,9440 Recall 0,8910 0,8800 0,8610 0,8430 0,8390 0,8280 0,8500 0,8840 F-Measure 0,9170 0,9090 0,9070 0,8980 0,9000 0,8950 0,9010 0,9130

Class=democrat 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000

TP Rate 0,9170 0,9110 0,9400 0,9460 0,9580 0,9640 0,9400 0,9170 FP Rate 0,1090 0,1200 0,1390 0,1570 0,1610 0,1720 0,1500 0,1160 Precision 0,8420 0,8270 0,8100 0,7910 0,7890 0,7790 0,7980 0,8320 Recall 0,9170 0,9110 0,9400 0,9460 0,9580 0,9640 0,9400 0,9170 F-Measure 0,8770 0,8670 0,8710 0,8620 0,8660 0,8620 0,8630 0,8730

6-2 Naive Bayes

6.2.3 J48

48

6-3. .

.

- 77 -

Min Suppo-rt

-- -

-

-

-

-- -

Kappa -

-

-

-

%

-

-

%

-

%

% - --

- -

- -

. 0,23 419 16 0,922 0,0552 0,1748 11,6378 35,892 435 96,3218 3,6782 19 37 0,1 107,08 416 19 0,909 0,0799 0,1993 16,8547 40,928 435 95,6322 4,3678 25 49 0,15 13,64 415 20 0,904 0,0802 0,1971 16,9015 40,484 435 95,4023 4,5977 22 43 0,2 5,25 394 41 0,794 0,1355 0,2596 28,5643 53,308 435 90,5747 9,4253 19 37 0,25 1,74 405 30 0,856 0,1272 0,241 26,8127 49,507 435 93,1034 6,8966 9 17 0,3 0,94 368 67 0,682 0,2407 0,3441 50,7518 70,68 435 84,5977 15,4023 4 7 0,4 0,19 357 78 0,625 0,2809 0,3695 59,2332 75,882 435 82,0690 17,9310 2 3 0,5 0,09 267 168 0 0,4741 0,4869 99,9723 100 435 61,3793 38,6207 1 1

6-3 J48

6-3

96,3218%

19 37 .

0,23 .

0,1

. 25

49.

.

.

, . 0,5

.

.

.

.

6-4

.

- 78 -


TP Rate 0,97 0,948 0,948 0,981 0,933 0,839 0,839 1 FP Rate 0,048 0,03 0,036 0,214 0,071 0,143 0,208 1 Precision 0,97 0,981 0,977 0,879 0,954 0,903 0,865 0,614 Recall 0,97 0,948 0,948 0,981 0,933 0,839 0,839 1 F-Measure 0,97 0,964 0,962 0,927 0,943 0,87 0,852 0,761 Class=democrat 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000

TP Rate 0,952 0,97 0,946 0,786 0,929 0,857 0,792 0 FP Rate 0,03 0,052 0,052 0,019 0,067 0,161 0,161 0 Precision 0,952 0,921 0,92 0,964 0,897 0,77 0,756 0 Recall 0,952 0,97 0,964 0,786 0,929 0,857 0,792 0 F-Measure 0,952 0,945 0,942 0,866 0,912 0,811 0,773 0

6-4 J48

6.2.4 JRIP

Rip

6-5.

.

.

6-5 JRIP

6-4

95,4023% 10

. 0,13 .

0,1

Min Support

-

-

--

-

Kappa -

-

-

- -

%

- -

%

-

-

% --

% --

-

. 0,13 415 20 0,9028 0,0608 0,209 12,8143 42,9263 435 95,4023 4,5977 10 0,1 267,17 401 34 0,8295 0,1389 0,2636 29,2980 54,1344 435 92,1839 7,8161 4 0,15 37,53 380 55 0,7181 0,2073 0,3261 43,7051 66,9844 435 87,3563 12,6437 3 0,2 8,64 371 64 0,6661 0,2374 0,3445 50,0522 70,7564 435 85,2874 14,7126 2 0,25 6,53 383 52 0,7484 0,1803 0,3044 38,0096 62,5219 435 88,0460 11,9540 3 0,3 3,38 390 45 0,7858 0,1523 0,277 32,1116 56,8911 435 89,6552 10,3448 3 0,4 0,66 368 67 0,6711 0,2462 0,3525 51,9209 72,4033 435 84,5977 15,4023 2 0,5 0,14 247 168 0 0,4741 0,4869 99,9750 99,9999 435 56,7816 38,6207 1

- 79 -

92,1839%

4 .

RIPPER

.

. 4 ,

.

0,5

.

6-5

.


TP Rate 0,966 0,993 0,989 0,981 0,899 0,884 0,895 1 FP Rate 0,065 0,19 0,31 0,339 0,149 0,083 0,232 1 Precision 0,959 0,892 0,835 0,821 0,906 0,944 0,86 0,614 Recall 0,966 0,993 0,989 0,981 0,899 0,884 0,895 1 F-Measure 0,963 0,94 0,906 0,894 0,902 0,913 0,877 0,761 Class=democrat 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000

TP Rate 0,935 0,81 0,69 0,661 0,851 0,917 0,768 0 FP Rate 0,034 0,007 0,011 0,019 0,101 0,116 0,105 0 Precision 0,946 0,986 0,975 0,957 0,841 0,832 0,822 0 Recall 0,935 0,81 0,69 0,661 0,851 0,917 0,768 0 F-Measure 0,94 0,889 0,808 0,782 0,846 0,873 0,794 0

6-6 JRIP

- 80 -

.

.

.

.

.

.

. freesets

. freesets

.

. ,

,

.

- 81 -

1. Ian H. Witten and Eibe Frank: Data Mining Practical Machine Learning Tools

and Techniques with Java Implementations, Morgan Kaufmann Publishers, San

Francisco, California, 1999

2. Margaret H. Dunham: Data Mining Introductory and Advanced Topics, Prentice

Hall, 2003

3. Sao Deroski: Data Mining in a Nutshell

4. J. R. Quinlan: Induction of decision trees, Machine Learning, no.1, pg.81-106,

1986

5. William W. Cohen: Fast Effective Rule Indution, Proceedings of the Twelfth

International Conference on Machine Learning, 1995

6. Johannes Frnkranz and Gerhard Widmer: Incremental Reduced Error Pruning,

Proceedings of the Eleventh International Conference on Machine Learning, 1994

7. Rakesh Agrawal and Ramakrishnan Sirkant: Fast Algorithms For Mining

Association Rules, Proceedings of the 20th International Conference on Very Large

Data Bases, 1994

8. Rakesh Agrawal, Tomasz Imielinski and Arun Swami: Mining Association Rules

between Sets of Items in Large Databases, Proceedings of the 1993 ACM SIGMOD

Conference, Washington DC, May 1993

9. Bart Goethaals: Course notes in Knowledge Discovery in Databases Search for

Frequent Patterns, http://www.cs.helsinki.fi/u/goethals/dmcourse/chap12.pdf

10. J. F. Boulicaut, A. Bykowski and C. Rigotti: Approximation of Frequency Queries

by Means of Free-Sets, PKDD 2000.

1. 1.1 1.2

2. 1 2.2 2.3 2.3.1 2.3.2 2.3.3 ID32.3.4 C4.5 2.3.4.1 2.3.4.2 2.3.4.3 2.3.4.4 2.3.4.5

2.4 2.4.1 2.4.2 2.4.3 2.4.4 Incremental Reduced Error Pruning (IREP)2.4.5 IREP RIPPER

3. 3.1 3.2 3.2.1 3.2.2

3.3 priori 3.3.1 priori 3.3.2 3.3.2.1 Hash 3.3.2.2

4. 4.1 4.2 4.3 4.4

5. 5.1 WEKA Waiakato Enviroment for Knowledge Analysis5.2 WEKA 5.2.1 weka.core 5.2.2 weka.classifiers 5.2.3 WEKA

5.3 Apriori 5.3.1 ItemSet5.3.2 FrequentItemsets

5.4 5.4.1 NaiveBayes

5.5 5.5.1 485.5.2 5.5.3 5.5.4 5.5.5 5.5.6

5.6 5.6.1 5.6.2 5.6.3 Rule RipperRule

6. 6.1 6.2 6.2.1 6.2.2 NaiveBayes6.2.3 J486.2.4 JRIP

Documents

Diploma Thesis