81
- 1 - СОДРЖИНА ВОВЕД ................................................................................................................................................. 3 1. ОСНОВНИ ПОИМИ ОД ПОДАТОЧНОТО РУДАРЕЊЕ .................................................... 4 1.1 ВОВЕД ......................................................................................................................................... 4 1.2 ОСНОВНИ ЗАДАЧИ ВО ПОДАТОЧНОТО РУДАРЕЊЕ.......................................................................... 6 2. УЧЕЊЕ НА ПРЕДИКТИВНИ МОДЕЛИ............................................................................... 9 2.1 ВОВЕД ......................................................................................................................................... 9 2.2 УЧЕЊЕ НА ВЕРОЈАТНОСНИ МОДЕЛИ ............................................................................................ 10 2.3 УЧЕЊЕ НА ДРВА ЗА ОДЛУЧУВАЊЕ ............................................................................................... 12 2.3.1 Основен алгоритам за индукција на дрва за одлучување................................................. 12 2.3.2 Перформанси на алгоритмите за учење на дрва за одлучување..................................... 14 2.3.3 Индукција на дрва за одлучување со ID3 ......................................................................... 16 2.3.4 Индукција на дрва на одлучување со C4.5 алгоритам ..................................................... 18 2.3.4.1 Нов критериум за селекција на атрибут при индукција на дрво .............................................. 19 2.3.4.2 Справување со континуални атрибути ..................................................................................... 19 2.3.4.3 Справување со недостаток на вредности на атрибути ............................................................. 20 2.3.4.4 Поткастрување на дрва за одлучување..................................................................................... 20 2.3.4.5 Проценка на грешка ................................................................................................................. 21 2.4 УЧЕЊЕ НА КЛАСИФИКАЦИСКИ ПРАВИЛА .................................................................................... 22 2.4.1 Алгоритам за генерирање на класификациски правила со прекривање .......................... 24 2.4.2 Критериум за избор на услов ........................................................................................... 26 2.4.3 Поткастрување на класификациски правила .................................................................. 27 2.4.4 Incremental Reduced Error Pruning (IREP)........................................................................ 28 2.4.5 Подобрување на IREP алгоритмот – RIPPER алгоритам .............................................. 30 3. ФРЕКВЕНТНИ ЗАПИСИ И АЛГОРИТМИ ЗА НИВНО ГЕНЕРИРАЊЕ........................ 33 3.1 ОСНОВНИ ПОИМИ ЗА ФРЕКВЕНТНИ МНОЖЕСТВА ........................................................................ 33 3.2 ГЕНЕРИРАЊЕ НА ФРЕКВЕНТНИ МНОЖЕСТВА ............................................................................... 36 3.2.1 Простор за пребарување .................................................................................................. 36 3.2.2 База на податоци ............................................................................................................. 37 3.3 АPRIORI АЛГОРИТАМ .................................................................................................................. 38 3.3.1 Опис на Аpriori алгоритамот .......................................................................................... 38 3.3.2 Податочни структури ..................................................................................................... 39 3.3.2.1 Hash – дрво................................................................................................................................ 40 3.3.2.2 Префиксно дрво ........................................................................................................................ 40 4. МОДИФИКАЦИЈА НА АЛГОРИТМИТЕ ЗА УЧЕЊЕ НА ПРЕДИКТИВНИ МОДЕЛИ ОД СУМАРНИ ПОДАТОЦИ .......................................................................................................... 42

Diploma Thesis

Embed Size (px)

DESCRIPTION

бази,податоци,рударење

Citation preview

  • - 1 -

    .................................................................................................................................................3

    1. ....................................................4

    1.1 .........................................................................................................................................4 1.2 ..........................................................................6

    2. ...............................................................................9

    2.1 .........................................................................................................................................9 2.2 ............................................................................................ 10 2.3 ............................................................................................... 12

    2.3.1 ................................................. 12 2.3.2 ..................................... 14 2.3.3 ID3 ......................................................................... 16 2.3.4 C4.5 ..................................................... 18

    2.3.4.1 ..............................................19 2.3.4.2 .....................................................................................19 2.3.4.3 .............................................................20 2.3.4.4 .....................................................................................20 2.3.4.5 .................................................................................................................21

    2.4 .................................................................................... 22 2.4.1 .......................... 24 2.4.2 ........................................................................................... 26 2.4.3 .................................................................. 27 2.4.4 Incremental Reduced Error Pruning (IREP)........................................................................ 28 2.4.5 IREP RIPPER .............................................. 30

    3. ........................ 33

    3.1 ........................................................................ 33 3.2 ............................................................................... 36

    3.2.1 .................................................................................................. 36 3.2.2 ............................................................................................................. 37

    3.3 PRIORI .................................................................................................................. 38 3.3.1 priori .......................................................................................... 38 3.3.2 ..................................................................................................... 39

    3.3.2.1 Hash ................................................................................................................................40 3.3.2.2 ........................................................................................................................40

    4. .......................................................................................................... 42

  • - 2 -

    4.1 ............................................................................................................................. 42 4.2 .................................. 42 4.3 ..................................... 44 4.4 ........................... 45

    5. ........................................................................................ 47

    5.1 WEKA WAIAKATO ENVIROMENT FOR KNOWLEDGE ANALYSIS ................................................. 47 5.2 WEKA .......................................................................... 48

    5.2.1 weka.core ................................................................................................................ 48 5.2.2 weka.classifiers ........................................................................................................ 49 5.2.3 WEKA ................................................................................... 50

    5.3 APRIORI ................................................................................. 50 5.3.1 ItemSet..................................................................................................................... 50 5.3.2 FrequentItemsets...................................................................................................... 51

    5.4 .............................. 54 5.4.1 NaiveBayes .............................................................................................................. 54

    5.5 .................................. 56 5.5.1 48........................................................................................................................... 57 5.5.2 ClassifierTree .......................................................................................................... 58 5.5.3 ModelSelection C45ModelSelection ........................................ 59 5.5.4 ClassifierSplitModel C45Split................................................ 60 5.5.5 Distribution ............................................................................................................. 63 5.5.6 InfoGainSplitCrit GainRatioSplitCrit..................................................................... 66

    5.6 ........................... 67 5.6.1 JRip ......................................................................................................................... 67 5.6.2 Antd NominalAntd.................................................................. 69 5.6.3 Rule RipperRule ..................................................................... 71

    6. ..................................................... 72

    6.1 ................................................................................. 72 6.2 ..................................................................................................................................... 74

    6.2.1 .................................................................................... 74 6.2.2 NaiveBayes......................................................................................................................... 75 6.2.3 J48 ..................................................................................................................................... 76 6.2.4 JRIP................................................................................................................................... 78

    ....................................................................................................................................... 80

    ......................................................................................................... 81

  • - 3 -

    ,

    .

    ,

    .

    ,

    .

    .

    .

    .

    .

    (frequent itemsets) .

    . 1

    . 2

    .

    3 .

    4, 5 6 , ,

    .

  • - 4 -

    1.

    1.1

    , ,

    .

    1.1 (Knowledge

    Discovery in Databases KDD) ,

    .1

    1.2 (Data Mining) e KDD

    (

    )

    .

    .

    ,

    (frequently occuring patterns), .

    KDD :

    :

    . ,

    .

    :

    .

    .

    , ,

    .

    1 : U.Fayyad, G.Piatetsky-Shapiro and P.Smyth. From data mining to

    knowledge discovery An overview, Advances in KD and DM, MIT Press, Cambrige, MA, 1991

  • - 5 -

    :

    .

    .

    .

    : ,

    .

    /:

    .

    .

    1-1 KDD ,

    .

    1-1. KDD

    .

    .

    .

    () ().

    (pattern)

    . () ,

    .

    : (equations); (trees); ,

    (rules).

    .

  • - 6 -

    () () ( )

    . ,

    .

    1.2.

    .

    1-2

    .

    .

    , ,

    . ,

    .

    . ,

    . , ,

    .

    1.2

    . (supervised

    learning), .

  • - 7 -

    .

    .

    ,

    .

    .

    .

    (, .)

    .

    .

    .

    . . ,

    . ,

    .

    .

    .

    ,

    () .

    , .

    .

    .

    .

    .

    .

    ,

    .

    . .

  • - 8 -

    .

    .

    .

    .

    . .

    , .

  • - 9 -

    2.

    2.1

    2.1. { }ntttD ,...,, 21=

    { }mCCC ,..,1= .

    CDf : it .

    jC , .

    ( ){ }DtniCtftC ijiij == ,1, .

    . ,

    .

    .

    .

    :

    1.

    (training data). ,

    .

    .

    2. 1

    .

    :

    .

    , .

    . jC ,

    ( )ji CtP it . ( )jCP ,

  • - 10 -

    ( ) ( )jij CtPCP it jC .

    .

    it , it

    jC . ( )ij tCP .

    it

    .

    2.2

    .

    (2.1)

    ( ) ( ) ( )( )|

    |P X H P H

    P H XP X

    = (2.1)

    ( )P X , ( )|P X H

    H.

    .

    ( )1,..., nX x x=

    ( )i iA x= n iA . m , H

    , 1,...,j jC C c j m = = , jc

    .

    kc

    ( )|kP C X . ( ) ( )| | , 1,...,k jP C X P C X j m = . ,

    ( ) ( ) ( )( )|

    | j jjP X C P C

    P C XP X

    = . ( )P X ,

    ( ) ( )| j jP X C P C .

  • - 11 -

    ( ,

    ). ( )|jP C X ( )|i i j

    iP A x C= .

    ( )| jP X C ,

    ( )|i i jP A x C=

    . ( )jP C .

    ( ) ( )j jP C P C c= = ( )|i ik kP A v C=

    jc ikv

    iA . ( )jP C ( )j jn N c=

    jc . ( ) jjn

    P Cn

    = .

    ( )|i ik kP A v C= ( )i ik jN A v C c= =

    ( )jN C c= . ( ) ( )( )|i ik j

    i ik kj

    N A v C cP A v C

    N C c

    = == =

    =.

    ,

    .

    m .

    (2.2)

    ( ) ( )( ), 1

    | j ij ii

    N C AP C A

    N A k+

    =+

    (2.2)

    k .

    m

    m. ,

    . (2.3)

  • - 12 -

    ( ) ( ) ( )( ),

    | j i jj ii

    N C A m P CP C A

    N A m+

    =+

    (2.3)

    (2.4)

    ( ) ( )1

    , 1..jjN C

    P C j kN k

    += =

    + (2.4)

    .

    .

    . ,

    . ,

    .

    .

    .

    ,

    . ,

    .

    .

    .

    .

    2.3

    2.3.1

    2.2. { }ntttD ,...,, 21=

    1,...,i i iht t t=

  • - 13 -

    { }1 2, ,..., hA A A . { }mCCC ,..,1= .

    (Decision Tree DT)

    D, :

    iA .

    .

    jC .

    :

    1. :

    .

    2. it D ,

    .

    2.1 ,

    .

    .

    .

    .

    .

    .

    , .

    .

    . ,

    .

    .

    . ,

    .

    .

    (tree pruning).

    .

  • - 14 -

    .

    .

    2.1 : D // : // T=0; ; = ; = ; { D= D; = ; =(D); T= ; }

    2.3.2

    .

    :

    :

    .

    .

    .

    :

    ,

  • - 15 -

    .

    :

    . , ,

    .

    ,

    .

    :

    ,

    .

    .

    .

    :

    .

    .

    .

    (overfitting) .

    :

    .

    ,

    .

    ,

    .

    : ,

    .

    .

    .

  • - 16 -

    .

    . ,

    .

    . ,

    .

    .

    ,

    .

    ,

    , .

    q,

    h . ,

    . ,

    , .

    ( )loghq q .

    n

    . ( )log q ,

    ( )logn q .

    .

    2.3.3 ID3

    . , ,

    . ID3 [Quinlan 79]

  • - 17 -

    . ,

    .

    ID3

    ( ),

    .

    ID3 :

    . D PE NE= , PE

    NE , p PE=

    n NE= . t PE

    ( ) pP t PEp n

    =+

    NE ( ) nP t NEp n

    =+

    .

    ,

    ,,PE ,,NE,

    (2.5)

    ( ) 2 2log log , 0; 0

    ,0,

    p p n n p np n p n p n p nI p n

    - - + + + +=

    (2.5)

    { }1,..., Nv v

    , D { }1,..., ND D

    iD D iv . iD

    ip PE in NE.

    iD e ( ),i iI p n .

    , ( )EI A (2.6)

    ( ) ( )1

    ,N

    i ii i

    i

    p nEI A I p np n=

    +=

    + (2.6)

    i

    D iD .

    , ( )G A (2.7)

  • - 18 -

    ( ) ( ) ( ),G A I p n EI A= - (2.7)

    ID3

    ( )G A , ,

    { }1,..., ND D . iD

    iD , ,,yes ;

    ,,no ;

    .

    p n+ (

    ) ,

    .

    ( )a p n+ . ID3

    ( )G A .

    ( )( )O ba p n+ b .

    .

    ID3 ( ) ( )( ) ( )( )22O ba p n a p n O ba p n+ + = + . ID3

    .

    . ,

    , ID3

    C4.5 .

    2.3.4 C4.5

    C4.5 ID3

    ,

    () .

  • - 19 -

    2.3.4.1

    ID3

    . ,

    gain ratio criterion ( )G X

    .

    ( ) ( )/G X IV X

    ( ) 21

    logN

    i i i i

    i

    p n p nIV Xp n p n=

    + += + +

    , ,N p n . ( ) 0IV X

    .

    2.3.4.2

    ID3 ,

    (datasets) .

    .

    .

    .

    ,

    :

    .

    .

  • - 20 -

    2.3.4.3

    ID3

    .

    :

    ;

    .

    ,

    , .

    .

    ,

    .

    .

    2.3.4.4

    C4.5 : -

    (postpruning) - (prepruning).

    -. -

    .

    -.

    - :

    (subtree replacement) (subtree raising).

  • - 21 -

    ,

    ().

    .

    (accuracy)

    .

    .

    .

    .

    .

    2.3.4.5

    (

    )

    .

    .

    .

    .

    (reduced error pruning).

    .

    .

    . N .

    q , N-

    q ,

  • - 22 -

    . :

    (confidence) c ( 4.5 0.25)

    ( )1 /f qP z c

    q q N

    - > = -

    N , /f E N= q

    .

    :

    2 2 2

    2

    22 4

    1

    z f f zf zN N N Ne

    zN

    + + - +=

    +

    z

    0.25c = , z 0.69.

    2.4

    .

    (antecedents)

    (consequent).

    IF THEN

    , () ()

    .

    IF 1 1 1 AND 2 2 2 AND ... m m m AND THEN Class=classX

  • - 23 -

    .

    .

    .

    .

    .

    . .

    ,

    .

    .

    ,

    (pruning)

    .

    .

    ,

    . ,

    .

    ,

    . ,

    .

    .

    .

  • - 24 -

    2.4.1

    (divide and conquer).

    ,

    .

    .

    .

    (covering aproach)

    .

    (accuracy).

    .

    .

    .

    ,

    -

    .

    .

    t , p

    , t-p .

    .

    p/t.

    .

  • - 25 -

    2.2

    C E ; R ; R ( ) R v A=v R; v p/t; ( p) A=v R; R;

    p/t. 100%

    .

    .

    .

    .

    (

    )

    .

    .

    (

    ),

    .

    .

  • - 26 -

    2.4.2

    ,

    . ,

    p/t, t

    p .

    .

    log logp Ppt T

    -

    p t

    ,

    .

    .

    . ,

    , .

    .

    ,

    .

    ,

    .

    .

  • - 27 -

    2.4.3

    .

    (overfitting)

    ,

    .

    :

    Pre-Pruning:

    ,

    ;

    Post-Pruning:

    .

    .

    .

    overfit-and-simplify (

    ) .

    .

    .

    reduced error pruning (REP)

    . REP

    .

    REP ,

    : a (growing set)

    (pruning set).

    ,

    .

    ;

    .

    ,

  • - 28 -

    .

    .

    REP

    .

    REP

    Incremental Reduced Error Pruning (IREP)2. ,

    RIPPER.

    2.4.4 Incremental Reduced Error Pruning (IREP)

    IREP (REP)

    .

    2.3 IREP(Pos,Neg) Begin Ruleset:= ; While Pos do /*grow and prune a new rule*/ split(Pos,Neg) into (GrowPos,GrowNeg) and (PrunePos,PruneNeg); Rule:=GrowRule(GrowPos,GrowNeg); Rule:=PruneRule(Rule,PrunePos,PruneNeg); If the error rate of Rule on (PrunePos,PruneNeg) exceeds 50% then return Ruleset; else add Rule to Ruleset remove examples covered by Rule from (Pos,Neg); endif endwhile return Ruleset 2 : J.Frnkanz and G.Widmer. Incremental reduced error

    pruning. In Machine Learning: Proceedings of the Eighth International Workshop on Machine Learning,

    Ithaca, New York, 1991.Morgan Kaufmann.

  • - 29 -

    end

    2.3 IREP .

    , IREP

    , . ,

    .

    , .

    , IREP . ,

    ,

    (growing set) (pruning set).

    2/3 .

    .

    nA v= , cA q cA q , nA

    v nA , cA q

    cA . GrowRule

    3

    .

    , .

    (2.8)

    ( ) ( ), ,p N n

    v Rule prunePos pruneNegP N

    + -

    + (2.8)

    P (N) prunePos (pruneNeg) p (n)

    prunePos (pruneNeg) Rule.

    v .

    IREP .

    .

    3 FOIL

    : Quinlan and Cameron-Jones. FOIL: a midterm report. Machine Learning:ECML-93,

    Vienna, Austria, 1993

  • - 30 -

    1,..., kC C , 1C kC

    IREP

    1C . ,

    IREP

    2C .

    kC (default) .

    .

    .

    IREP Cohen,

    ,

    . ,

    ,

    50%.

    2.4.5 IREP RIPPER

    :

    ,

    REP

    .

    IREP

    . ;

    , P N 1R

    1 2000p = 1 1000n = ,

  • - 31 -

    2R 1 1000p = 1 1n =

    . 2R

    1R .

    (2.9)

    ( )* , , p nv Rule prunePos pruneNegp n

    -

    + (2.9)

    IREP

    50%. ,

    .

    MDL .4

    (description length)

    . IREP

    d

    .

    64.

    (

    )

    .

    IREP

    REP .

    IREP

    REP.

    1,..., kR R .

    . iR

    : (replacement) (revision). iR

    'iR

    '1,..., ,...,i kR R R

    4 DL Minimum Description Length

    .

  • - 32 -

    . iR

    .

    ,

    MDL .

    RIPPER (Repeated Incremental Pruning to Produce Error Reduction).

  • - 33 -

    3.

    3.1

    (frequent sets)

    (patterns) , ,

    , , , .

    .

    , .

    .

    .

    , ,

    .

    .

    .

    3.1. R. r R

    R. R (items),

    r (rows). r

    r r t r

    r t

    = t . ,

    .

    X R (items)

    () .

    .

  • - 34 -

    3.2. R r R X R

    . X t r , X t .

    r X ( ),M X r , .

    ( ) { },M X r t r X t= X . X r ( ),fr X r (3.1)

    ( )( ),

    ,M X r

    fr X rr

    = (3.1)

    ( )M X ( )fr X . [ ]min_ 0,1fr ,

    X 5 ( ), min_fr X r fr .

    1t { }, , , ,A B C D G

    2t { }, , ,A B E F

    3t { }, ,B I K

    4t { }, ,A B H

    5t { }, ,E G J

    3-1 r { }, ,...,R A B K=

    3.1 r

    { }, ,...,R A B K= 3.1. { },A B

    , { }( ) { }1 2 4, , , ,M A B r t t t=

    { }( ), , 3 5 0.6fr A B r = = .

    { }, ,...,A B K , , ,...,A B K

    . 3.2

    .

    5 large set covering set,

    support.

  • - 35 -

    A B C D E F G H I J K

    1t 1 1 1 1 0 0 1 0 0 0 0

    2t 1 1 0 0 1 1 0 0 0 0 0

    3t 0 1 0 0 0 0 0 0 1 0 1

    4t 1 1 0 0 0 0 0 1 0 0 0

    5t 0 0 0 0 1 0 1 0 0 1 0

    3-2 r

    min_ fr

    .

    3.3. R r R

    min_ fr . r

    min_ fr ( ), min_F r fr (3.2)

    ( ) ( ){ }, min_ , min_F r fr X R fr X r fr= , (3.2) ( )F r .

    l (3.3)

    ( ) ( ){ }lF r X F r X l= = . (3.3) 3.2 0.3.

    r 3.1

    ( ) { } { } { } { } { }{ },0.3 , , , , ,F r A B E G A B= .

    .

  • - 36 -

    3.2

    .

    ,

    . ,

    .

    .

    3.2.1

    2 R R.

    , .

    . ,

    . sparse

    , dense

    .

    ,

    .

    .

    ,

    .

    ,

    .

  • - 37 -

    .

    3.1 ( ) ,X Y R ,

    ( ) ( )X Y fr X fr Y .

    : ( ) ( )M X M Y X Y .

    ,

    (supersets) .

    3.2.2

    (itemsets),

    .

    , .

    . ,

    I.

    .

    .

    (cover).

    X

    ,

    T X T .

    .

    I/O. .

    .

  • - 38 -

    .

    .

    3.3 priori

    AIS Agrawal,

    .

    Apriori

    .

    3.3.1 priori

    3.1.

    3.1 Apriori Input: r, Output: ( ),F r s

    { }{ }1 :C i i R= ; k:=1; while { }kC do //compute the frequencies of all candidate sets for all transactions ( ),tid I r do for all candidate sets kX C do if X I then Increment X.frequency by 1; //Extract all frequent sets { }: .k kF X C X frequency s= ; //Generate new candidate sets { }1 :kC + = ; for all [ ] [ ], ,kX Y F X i Y i = for 1 1i k - and [ ] [ ]X k Y k< do

  • - 39 -

    [ ]{ }:I X Y k= U ; if , : kJ I J k J F" = then Add I to 1kC + ; Increment k by 1; breadth-first

    1kC +

    k+1, k=0.

    . , 1C

    R k k+1.

    . kF .

    , kX Y F k-1 X YU .

    kC X YU , k

    kF .

    k ,

    ,

    . kF .

    k+1

    , .

    . ,

    k+2, k+1

    k+1.

    3.3.2

    ,

    .

  • - 40 -

    3.3.2.1 Hash

    k

    , kF hash .

    hash . hash

    () hash ( ).

    hash .

    hash 1.

    d d+1.

    .

    k X

    ,

    . d,

    hash

    [ ]X d . .

    d ,

    , k d> .

    , . ,

    .

    hash

    i , hash i

    .

    hash .

    3.3.2.2

    (prefix tree,

    trie). , k-

    k-1 . . 1-

  • - 41 -

    . k- k-1 .

    ,

    .

    hash ,

    .

    k, k k

    .

    , .

    , (1)

    o

    (2)

    . .

    ,

    k k-1

    ( k-1 ).

    k-1 X ,

    X .

  • - 42 -

    4.

    4.1

    ,

    .

    ,

    , (dataset)

    .

    .

    ,

    (

    ).

    ,

    .

    4.2

    2.2

    ( )|i ik kP A v C=

  • - 43 -

    ( ) ( )( )| i ik ji ik k

    j

    N A v C cP A v C

    N C c

    = == =

    =, ( )i ik jN A v C c= =

    i ik jA v C c= = , ( )jN C c= jC c= . .

    i ik jA v C c= =

    jC c= . :

    ( ) ( )( )| i ik ji ik k

    j

    fr A v C cP A v C

    fr C c

    = == =

    = (4.1)

    (itemsets)

    .

    Apriori (itemsets)

    () .

    ,

    ,

    .

    (non-frequent itemset). Naive Bayes

    min2

    fr . ,

    0 minfr fr < min2

    fr

    . ,

    ,

    .

    ,

    .

  • - 44 -

    4.3

    2.3

    .

    ,

    .

    ,

    .

    : &...&i ij l lmA a A a= = ,

    kC

    ( )&...&i ij l lmfr A a A a= = ( )&...& &i ij l lm kfr A a A a C c= = = k. :

    ( ) ( )( )&...& &

    &...&&...&

    i ij l lm kk i ij l lm

    i ij l lm

    fr A a A a C cP C c A a A a

    fr A a A a

    = = == = = =

    = = (4.2).

    ( )&...&i ij l lmfr A a A a= = l, l

    ,

    ( )&...& &i ij l lm kfr A a A a C c= = = l +1.

    ,

    . (Naive

    Bayes),

    .

    min fr .

  • - 45 -

    . , ,

    .

    ,

    .

    . , ,

    , . 1

    min2

    fr , 2 min2*2

    fr

    min2*

    frg

    min fr , g

    .

    (

    0)

    (

    ) 0 1, .

    .

    .

    4.4

    2.4

    .

    (accuracy)

    ,

    ,

    .

    ,

    .

  • - 46 -

    :

    IF &...&i ij l lmA a A a= = THEN kC c=

    ( )&...&i ij l lmfr A a A a= = ( )&...& &i ij l lm kfr A a A a C c= = = . :

    ( )( )

    &...& &

    &...&i ij l lm k

    i ij l lm

    fr A a A a C cAccuracy

    fr A a A a

    = = ==

    = = (4.3).

    ( )&...&i ij l lmfr A a A a= = l, l

    , ( )&...& &i ij l lm kfr A a A a C c= = = l +1.

    .

  • - 47 -

    5.

    5.1 WEKA Waiakato Enviroment for Knowledge Analysis

    WEKA, Waiakato .

    WEKA Waiakato Enviroment for Knowledge

    Analysis Waiakato .

    WEKA JAVA,

    Windows, Linux

    Macintosh. JAVA

    ,

    .

    WEKA . ,

    .

    ,

    .

    WEKA

    .

    . .

    .

    .

    , WEKA

    .

    .

  • - 48 -

    WEKA

    .

    WEKA,

    .

    ,

    WEKA

    .

    WEKA

    .

    5.2 WEKA

    WEKA .

    WEKA

    .

    .

    5.2.1 weka.core

    core WEKA .

    .

    core : Attribute, Instance Instances.

    Attribute . ,

    .

    Instance

    .

    Instances

    .

  • - 49 -

    5.2.2 weka.classifiers

    classifiers

    . Classifier,

    .

    5-1.

    5-1 UML Classifier

    : buildClassifier(), classifyInstance()

    distributionForInstance().

    Classifier

    .

    .

    JAVA .

    .

  • - 50 -

    5.2.3 WEKA

    WEKA : weka.associations, weka.clusterers,

    weka.estimators, weka.filters, weka.attributeSelection .

    weka.associations

    priori

    .

    .

    5.3 Apriori

    weka.associations

    Apriori.

    Apriori,

    .

    FrequentItemsets

    weka.classifiers.freqitemsets.

    5.3.1 ItemSet ItemSet ()

    . 5-2.

    ,

    .

    m_items

    .

    , i

    i .

    ,

    -1.

  • - 51 -

    .

    WEKA 0

    1.

    5-2 UML ItemSet

    ItemSet .

    ,

    , Apriori

    .

    5.3.2 FrequentItemsets

    FrequentItemsets Apriori

    . 5-3.

  • - 52 -

    5-3 UML FrequentItemsets

    Apriori

    m_Ls FastVector weka.core.

    m_minSuppot,

    m_upperBoundMinSupport .

    FrequentItemsets

    FrequentItemsets()

    .

    buildAssociations() .

    findLargeItemSets() Apriori .

    findLargeItemSets() 5.4.

    /** * Method that finds all large itemsets for the given set of instances. * * @param the instances to be used * @exception Exception if an attribute is numeric */ private void findLargeItemSets(Instances instances)throws Exception { FastVector kMinusOneSets, kSets; Hashtable hashtable;

  • - 53 -

    int necSupport, necMaxSupport,i = 0; m_instances = instances; // Find large itemsets // minimum support necSupport = (int)Math.floor((m_minSupport * (double)instances.numInstances())); necMaxSupport = (int)Math.floor((m_upperBoundMinSupport * (double)instances.numInstances())); kSets = ItemSet.singletons(instances); ItemSet.upDateCounters(kSets, instances); kSets = ItemSet.deleteItemSets(kSets, necSupport, necMaxSupport); if (kSets.size() == 0) return; do { m_Ls.addElement(kSets); kMinusOneSets = kSets; kSets = ItemSet.mergeAllItemSets(kMinusOneSets,i,instances.numInstances()); hashtable = ItemSet.getHashtable(kMinusOneSets, kMinusOneSets.size()); m_hashtables.addElement(hashtable); kSets =ItemSet.pruneItemSets(kSets, hashtable); ItemSet.upDateCounters(kSets, instances); kSets =ItemSet.deleteItemSets(kSets, necSupport, necMaxSupport); i++; } while (kSets.size() > 0); }

    5-4 priori

    singletons()

    ().

    kSets FastVector.

    .

    -1

    .

    hash

    .

    . m_Ls

    i6

    .

    6 () (itemset).

  • - 54 -

    5.4

    () WEKA

    NaiveBayes.

    .

    5.4.1 NaiveBayes

    NaiveBayes

    , 2.2.

    5-5.

    ,

    . m_Counts [klasa] [atribut]

    [vrednost na atribut]

    , m_Devs [klasa] [atribut] m_Means [klasa] [atribut]

    m_Priors [klasa]

    . ,

    NaiveBayes

    Classifiers.

  • - 55 -

    5-5 UML NaiveBayes

    buildClassifier()

    .

    , .

    FrequentItemsets

    . , ,

    1 2.

    calculateProbs()

    . 5-6

    .

    //za klasna verojatnost for(int i=0;i

  • - 56 -

    if(help[j]!=-1) { m_Counts[help[help.length-1]][j][help[j]]=count; } } }

    5-6 NaiveBayes

    ,

    1

    ,

    m_Priors [klasa]

    .

    m_Counts [klasa] [atribut] [vrednost na atribut]

    2.

    , min2

    fr .

    ,

    distributionForInstance()

    .

    .

    .

    5.5

    WEKA

    . .

    ,

    :

  • - 57 -

    , ,

    , ,

    .

    ,

    .

    5.5.1 48 48 4.57

    2.3.

    5-6.

    5-7 UML 48

    7 : Ross

    Quinlan (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA.

  • - 58 -

    . 48 Classifiers

    buildClassifier(), classifyInstance()

    distributionForInstance(), .

    5.5.2 ClassifierTree

    ClassifierTree

    . 5-8.

    5-8 UML ClassifierTree

    .

    :

    (ModelSelection m_toSelectModel), (ClassifierSplitModel

    m_localModel), (ClassifierTree [] m_sons),

    (boolean m_isLeaf)

  • - 59 -

    ,

    (int []m_usedAttr).

    Distribution

    .

    buildTree()

    selectModel()

    m_toSelectModel. ClassifierSplitModel

    .

    ,

    ClassifierTree ()

    .

    .

    .

    WEKA ClassifierTree

    .

    C45PrunableClassifierTree

    2.3.4.4. PrunableClassifierTree

    IREP(Increased Reduced Error Pruning) .

    , .

    48.

    5.5.3 ModelSelection C45ModelSelection

    ModelSelection selectModel()

    .

    ClassifierSplitModel

    . C45ModelSelection BinC45ModelSelection

    ModelSelection. BinC45ModelSelection

    , . ,

    C45ModelSelection (

  • - 60 -

    ).

    C45ModelSelection 5-9.

    5-9 UML C45ModelSelection

    selectModel() .

    :

    . NoSplit.

    C45Split

    . ,

    gain ratio 2.3.4.1.

    .

    selectModel() BinC45ModelSelection

    , .

    5.5.4 ClassifierSplitModel C45Split

    ClassifierSplitModel

    .

    5-10.

  • - 61 -

    5-10 UML ClassifierSplitModel

    : m_attIndex

    , m_distribution

    m_numSubsets

    .

    :

    C45Split, Bin C45Split NoSplit.

    C45Split C45 .

    5-11.

    selectModel() .

    m_infoGain m_gainRatio,

    InfoGainSplitCrit GainRatioSplitCrit.

    m_splitPoint

    .

  • - 62 -

    5-11 UML C45Split

    buildClassifier()

    .

    handleEnumeratedAttribute()

    . Distribution

    .

    InfoGainSplitCrit GainRatioSplitCrit

    .

    handleNumericAttribute()

    .

    m_splitPoint,

    .

    Bin C45Split

    .

    .

  • - 63 -

    5.5.5 Distribution

    Distribution

    .

    5-12.

    5-12 UML Distribution

  • - 64 -

    : m_perClassPerBag[klasa][vrednost

    na atribut]

    , m_perBag[vrednost na atribut]

    , m_perClass[klasa]

    total

    .

    48.

    .

    populateDistribution() 5-13.

    public final void populateDistribution(int attIndex,int[] usedAtt,FrequentItemsets itemsets){ int bags=this.numBags(); int[] usedAttr=usedAtt; int numClasses=this.numClasses(); int numOfUsedAttr=0; for(int i=0;i

  • - 65 -

    } } } } }//end if

    for(int i=0;i

  • - 66 -

    m_perClassPerBag[][]

    m_perBag[] m_perClass[].

    5.5.6 InfoGainSplitCrit GainRatioSplitCrit

    InfoGainSplitCrit

    GainRatioSplitCrit. 5-14

    5-15 .

    5-14 UML InfoGainSplitCrit

    5-15 UML GainRatioSplitCrit

    EntropyBasedSplitCrit.

  • - 67 -

    .

    .

    5.6

    RIPPER WEKA

    JRIP. JRI

    : Antd ,

    NumericAntd NominalAntd Antd

    RipperRule .

    RuleStats

    .

    5.6.1 JRip

    JRip Classifiers

    buildClassifier(), classifyInstance() distributionForInstance(),

    . 5-16.

    JRip :

    FastVector m_Ruleset, Attribute m_Class,

    FastVector m_RulesetStats

    .

    buildClassifier()

    . :

    .

    ,

    .

    rulesetForOneClass(),

    .

  • - 68 -

    5-16 UML Jrip

    M rulesetForOneClass() RIPPER .

    .

    .

    RipperRule

    .

    .

    .

  • - 69 -

    5.6.2 Antd NominalAntd

    Antd .

    .

    5-17.

    5-17 UML Antd

    : att, value,

    maxInfoGain,

    accuRate, cover

    accu. splitData()

    ,

    NominalAntd Antd

    . 5-18.

  • - 70 -

    5-18 UML NominalAntd

    NominalAntd accurate[] coverage[]

    .

    splitData()

    .

    .

    .

    NominalAntd .

    ,

    - .

    .

  • - 71 -

    5.6.3 Rule RipperRule

    Rule

    .

    RipperRule Rule.

    5-19.

    5-19 UML RipperRule

    m_Antds

    m_Consequent.

    : grow() prune(),

    .

    grow() .

    .

    .

    .

    -

    . .

    prune()

    .

  • - 72 -

    6.

    6.1

    .

    ,

    ,

    .

    , .

    ,

    .

    :

    .

    6-1.

    6-1

    :

  • - 73 -

    ?

    .

    .

    . ,

    holdout .

    .

    .

    ( 10 ).

    .

    holdout ,

    .

    - .

    k . k-

    ,

    k-1 .

    k .

    k 10,

    .

    -

    k

    t- .

    :

    recall, ROC .

    recall :

    Precision=

    Recall=

    TPTP FPTP

    TP FN

    +

    +

  • - 74 -

    TP , FP

    , TN FN

    . , recall

    ,

    .

    ROC TP FP,

    :

    TP rate=

    FP rate=

    TPTP FN

    FPFP TN

    +

    +

    .

    6.2

    UCI .

    .

    APRIORI

    ,

    .

    6.2.1

    Vote

    1984 . 435

    267 168 . 16

    . democrat

    republican. y

    n. 6-2 . Attribute Information:

  • - 75 -

    % 1. Class Name: 2 (democrat, republican) % 2. handicapped-infants: 2 (y,n) % 3. water-project-cost-sharing: 2 (y,n) % 4. adoption-of-the-budget-resolution: 2 (y,n) % 5. physician-fee-freeze: 2 (y,n) % 6. el-salvador-aid: 2 (y,n) % 7. religious-groups-in-schools: 2 (y,n) % 8. anti-satellite-test-ban: 2 (y,n) % 9. aid-to-nicaraguan-contras: 2 (y,n) % 10. mx-missile: 2 (y,n) % 11. immigration: 2 (y,n) % 12. synfuels-corporation-cutback: 2 (y,n) % 13. education-spending: 2 (y,n) % 14. superfund-right-to-sue: 2 (y,n) % 15. crime: 2 (y,n) % 16. duty-free-exports: 2 (y,n) % 17. export-administration-act-south-africa: 2 (y,n)

    6-2 Vote

    6.2.2 NaiveBayes

    NaiveBayes

    , 6-1.

    .

    Min Support

    -

    -

    Kappa

    -

    -

    -

    %

    -

    %

    -

    %

    %

    . 0,06 392 43 0,7949 0,0995 0,2977 20,9815 61,1406 435 90,1149 9,8851 0,01 1,26 391 44 0,7904 0,0996 0,2977 21,011 61,1482 435 89,8851 10,1149 0,05 1,08 388 47 0,7763 0,1059 0,3072 22,339 63,1038 435 89,1954 10,8046 0,1 0,95 388 47 0,7763 0,1093 0,3116 23,0409 63,9927 435 89,1954 10,8046

    0,15 0,72 388 47 0,7787 0,1151 0,3168 24,2646 65,0684 435 89,1954 10,8046 0,2 0,56 384 56 0,7614 0,1187 0,3196 25,0286 65,6502 435 88,2759 12,8736

    0,25 0,48 385 50 0,7668 0,1228 0,3224 25,8928 66,2183 435 88,5057 11,4943 0,3 0,31 383 52 0,7585 0,1314 0,3284 27,7174 67,4427 435 88,0460 11,9540 0,4 0,31 383 52 0,7585 0,1314 0,3284 27,7174 67,4427 435 88,0460 11,9540 0,5 0,31 383 52 0,7585 0,1314 0,3284 27,7174 67,4427 435 88,0460 11,9540

    6-1 NaiveBayes

    6-1

    90,1149%

  • - 76 -

    .

    .

    0,01 0,15

    ,

    .

    .

    .

    .

    .

    6-2

    .

    Class=republican 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000

    TP Rate 0,8910 0,8800 0,8610 0,8430 0,8390 0,8280 0,8500 0,8840 FP Rate 0,0830 0,0890 0,0600 0,0540 0,0420 0,0360 0,0600 0,0830 Precision 0,9440 0,9400 0,9580 0,9620 0,9700 0,9740 0,9580 0,9440 Recall 0,8910 0,8800 0,8610 0,8430 0,8390 0,8280 0,8500 0,8840 F-Measure 0,9170 0,9090 0,9070 0,8980 0,9000 0,8950 0,9010 0,9130

    Class=democrat 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000

    TP Rate 0,9170 0,9110 0,9400 0,9460 0,9580 0,9640 0,9400 0,9170 FP Rate 0,1090 0,1200 0,1390 0,1570 0,1610 0,1720 0,1500 0,1160 Precision 0,8420 0,8270 0,8100 0,7910 0,7890 0,7790 0,7980 0,8320 Recall 0,9170 0,9110 0,9400 0,9460 0,9580 0,9640 0,9400 0,9170 F-Measure 0,8770 0,8670 0,8710 0,8620 0,8660 0,8620 0,8630 0,8730

    6-2 Naive Bayes

    6.2.3 J48

    48

    6-3. .

    .

  • - 77 -

    Min Suppo-rt

    -- -

    -

    -

    -

    -- -

    Kappa -

    -

    -

    -

    %

    -

    -

    %

    -

    %

    % - --

    - -

    - -

    . 0,23 419 16 0,922 0,0552 0,1748 11,6378 35,892 435 96,3218 3,6782 19 37 0,1 107,08 416 19 0,909 0,0799 0,1993 16,8547 40,928 435 95,6322 4,3678 25 49 0,15 13,64 415 20 0,904 0,0802 0,1971 16,9015 40,484 435 95,4023 4,5977 22 43 0,2 5,25 394 41 0,794 0,1355 0,2596 28,5643 53,308 435 90,5747 9,4253 19 37 0,25 1,74 405 30 0,856 0,1272 0,241 26,8127 49,507 435 93,1034 6,8966 9 17 0,3 0,94 368 67 0,682 0,2407 0,3441 50,7518 70,68 435 84,5977 15,4023 4 7 0,4 0,19 357 78 0,625 0,2809 0,3695 59,2332 75,882 435 82,0690 17,9310 2 3 0,5 0,09 267 168 0 0,4741 0,4869 99,9723 100 435 61,3793 38,6207 1 1

    6-3 J48

    6-3

    96,3218%

    19 37 .

    0,23 .

    0,1

    . 25

    49.

    .

    .

    , . 0,5

    .

    .

    .

    .

    6-4

    .

  • - 78 -

    Class=republican 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000

    TP Rate 0,97 0,948 0,948 0,981 0,933 0,839 0,839 1 FP Rate 0,048 0,03 0,036 0,214 0,071 0,143 0,208 1 Precision 0,97 0,981 0,977 0,879 0,954 0,903 0,865 0,614 Recall 0,97 0,948 0,948 0,981 0,933 0,839 0,839 1 F-Measure 0,97 0,964 0,962 0,927 0,943 0,87 0,852 0,761 Class=democrat 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000

    TP Rate 0,952 0,97 0,946 0,786 0,929 0,857 0,792 0 FP Rate 0,03 0,052 0,052 0,019 0,067 0,161 0,161 0 Precision 0,952 0,921 0,92 0,964 0,897 0,77 0,756 0 Recall 0,952 0,97 0,964 0,786 0,929 0,857 0,792 0 F-Measure 0,952 0,945 0,942 0,866 0,912 0,811 0,773 0

    6-4 J48

    6.2.4 JRIP

    Rip

    6-5.

    .

    .

    6-5 JRIP

    6-4

    95,4023% 10

    . 0,13 .

    0,1

    Min Support

    -

    -

    --

    -

    Kappa -

    -

    -

    - -

    %

    - -

    %

    -

    -

    % --

    % --

    -

    . 0,13 415 20 0,9028 0,0608 0,209 12,8143 42,9263 435 95,4023 4,5977 10 0,1 267,17 401 34 0,8295 0,1389 0,2636 29,2980 54,1344 435 92,1839 7,8161 4 0,15 37,53 380 55 0,7181 0,2073 0,3261 43,7051 66,9844 435 87,3563 12,6437 3 0,2 8,64 371 64 0,6661 0,2374 0,3445 50,0522 70,7564 435 85,2874 14,7126 2 0,25 6,53 383 52 0,7484 0,1803 0,3044 38,0096 62,5219 435 88,0460 11,9540 3 0,3 3,38 390 45 0,7858 0,1523 0,277 32,1116 56,8911 435 89,6552 10,3448 3 0,4 0,66 368 67 0,6711 0,2462 0,3525 51,9209 72,4033 435 84,5977 15,4023 2 0,5 0,14 247 168 0 0,4741 0,4869 99,9750 99,9999 435 56,7816 38,6207 1

  • - 79 -

    92,1839%

    4 .

    RIPPER

    .

    . 4 ,

    .

    0,5

    .

    6-5

    .

    Class=republican 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000

    TP Rate 0,966 0,993 0,989 0,981 0,899 0,884 0,895 1 FP Rate 0,065 0,19 0,31 0,339 0,149 0,083 0,232 1 Precision 0,959 0,892 0,835 0,821 0,906 0,944 0,86 0,614 Recall 0,966 0,993 0,989 0,981 0,899 0,884 0,895 1 F-Measure 0,963 0,94 0,906 0,894 0,902 0,913 0,877 0,761 Class=democrat 0,0000 0,1000 0,1500 0,2000 0,2500 0,3000 0,4000 0,5000

    TP Rate 0,935 0,81 0,69 0,661 0,851 0,917 0,768 0 FP Rate 0,034 0,007 0,011 0,019 0,101 0,116 0,105 0 Precision 0,946 0,986 0,975 0,957 0,841 0,832 0,822 0 Recall 0,935 0,81 0,69 0,661 0,851 0,917 0,768 0 F-Measure 0,94 0,889 0,808 0,782 0,846 0,873 0,794 0

    6-6 JRIP

  • - 80 -

    .

    .

    .

    .

    .

    .

    . freesets

    . freesets

    .

    . ,

    ,

    .

  • - 81 -

    1. Ian H. Witten and Eibe Frank: Data Mining Practical Machine Learning Tools

    and Techniques with Java Implementations, Morgan Kaufmann Publishers, San

    Francisco, California, 1999

    2. Margaret H. Dunham: Data Mining Introductory and Advanced Topics, Prentice

    Hall, 2003

    3. Sao Deroski: Data Mining in a Nutshell

    4. J. R. Quinlan: Induction of decision trees, Machine Learning, no.1, pg.81-106,

    1986

    5. William W. Cohen: Fast Effective Rule Indution, Proceedings of the Twelfth

    International Conference on Machine Learning, 1995

    6. Johannes Frnkranz and Gerhard Widmer: Incremental Reduced Error Pruning,

    Proceedings of the Eleventh International Conference on Machine Learning, 1994

    7. Rakesh Agrawal and Ramakrishnan Sirkant: Fast Algorithms For Mining

    Association Rules, Proceedings of the 20th International Conference on Very Large

    Data Bases, 1994

    8. Rakesh Agrawal, Tomasz Imielinski and Arun Swami: Mining Association Rules

    between Sets of Items in Large Databases, Proceedings of the 1993 ACM SIGMOD

    Conference, Washington DC, May 1993

    9. Bart Goethaals: Course notes in Knowledge Discovery in Databases Search for

    Frequent Patterns, http://www.cs.helsinki.fi/u/goethals/dmcourse/chap12.pdf

    10. J. F. Boulicaut, A. Bykowski and C. Rigotti: Approximation of Frequency Queries

    by Means of Free-Sets, PKDD 2000.

    1. 1.1 1.2

    2. 1 2.2 2.3 2.3.1 2.3.2 2.3.3 ID32.3.4 C4.5 2.3.4.1 2.3.4.2 2.3.4.3 2.3.4.4 2.3.4.5

    2.4 2.4.1 2.4.2 2.4.3 2.4.4 Incremental Reduced Error Pruning (IREP)2.4.5 IREP RIPPER

    3. 3.1 3.2 3.2.1 3.2.2

    3.3 priori 3.3.1 priori 3.3.2 3.3.2.1 Hash 3.3.2.2

    4. 4.1 4.2 4.3 4.4

    5. 5.1 WEKA Waiakato Enviroment for Knowledge Analysis5.2 WEKA 5.2.1 weka.core 5.2.2 weka.classifiers 5.2.3 WEKA

    5.3 Apriori 5.3.1 ItemSet5.3.2 FrequentItemsets

    5.4 5.4.1 NaiveBayes

    5.5 5.5.1 485.5.2 5.5.3 5.5.4 5.5.5 5.5.6

    5.6 5.6.1 5.6.2 5.6.3 Rule RipperRule

    6. 6.1 6.2 6.2.1 6.2.2 NaiveBayes6.2.3 J486.2.4 JRIP