28
Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

Embed Size (px)

Citation preview

Page 1: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

Modifying Logic of Discovery for Dealing with Domain Knowledge

in Data Mining

Jan Rauch University of Economics, Prague

Czech Republic

Page 2: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

2

Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining

Presented an idea of a theoretical approach

There are software tools for partial steps

o Logic of discovery

o Modifications

o 4ft-Discoverer

Page 3: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

3

Logic of Discovery

Can computers formulate and verify scientific hypotheses?

Can computers in a rational way analyze empirical data and produce reasonable reflection of the observed empirical world? Can it be done using mathematical logic and statistics?

1978

Page 4: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

Logic of Discovery (simplified)

Data matrix M

State dependent structure

Theoretical statementsTheoretical calculi

Observational statementsObservational calculi

1: 1

Statistical hypothesis tests4

Page 5: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

5

Association rules – observational statements

)5,4,2()6()3,1( 21 AAA K Basep, Baseapba

a

Basep, Basea

dcba

cap

ba

a

)1(

M a b c d

…. hypothesis tests

Val(, M ) {0,1}

M

M

Page 6: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

GUHA Procedure ASSOC – a tool for finding a set of interesting association rules

6

M , ,

All prime

M

a b

c d

Val(, M ) = 1

is prime: is true + does not logically follow from other more simple

Page 7: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

Deduction rules in logic of association rules

7

Examples :

Theorem for : is correct if and only if (1) or (2)

(1) 1A and 1B tautologies of propositional calculus (2) 2 tautology Theorems for additional 4ft-quantifiers: is correct iff

Applications: prime rules + dealing with knowledge in data mining

)4,3,2()3,1(

)2()3,1(

210,9.01

210,9.01

AA

AA

)8()2()4()3,1(

)2()4()3,1(

42310,9.01

210,9.031

AAAA

AAA

Basep, '' ,

,

Basep

Basep

1A , 1B , 2 , created from , , ‘ , ‘

''

Page 8: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

8

Data mining – CRISP-DM http://www.crisp-dm.org/

M , ,

All prime

Beer BMI Wine region, sportsmen, …

Analytical report

Logical calculus …

Page 9: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

9

Data mining – CRISP-DM http://www.crisp-dm.org/

M , ,

All prime

Beer BMI Wine region, sportsmens, …

Analytical report

Logical calculus ??

Page 10: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

10

Modifying Logic of Discovery

Logic of discovery Theoretical statements

Logical calculus of associational rules

Logic of association rules mining

Logical calculus of associational rules

A1 A2, A3 A4 … ; Cons(A1 A2 ), …

Statements on data matrices , evaluation , Cons

Page 11: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

11

Logic of association rules mining (simplified)

patient BMI Beer Education Sex Status … AK

o1 17 2 U F D … 1o2 34 4 B M M … 6o3 27 3 S M W … 2… … … … … … … …on 28 7 B F S … 4

o Type of M : number of columns + possible values ,

o

o Val(, M )

o Items of domain knowledge: Beer BMI, …

o Consequences of domain knowledge Cons(Beer BMI ), …

Beer (8-10) 0.9,50 BMI (>30) Status (W)

M

a b c d

509.0

aba

a

LCAR Logical Calculus of Association Rules

DK AR

Page 12: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

12

Atomic consequences of Beer BMI (simplified) patient BMI Beer Education Sex Status …

AK

o1 17 2 U F D … 1o2 34 4 B M M … 6o3 27 3 S M W … 2… … … … … … … …on 28 7 B F S … 4

Cons(Beer BMI)

Beer(low) 0.9,50 BMI(low ) Beer(high) 0.9,50 BMI(high)

Beer(0 – 3) 0.9,50 BMI(15 – 18)

Beer: 0, 1, 2, …., 15 Low: - , = 0, …, 5 High: - , = 10, …, 15

BMI: 15, 1, 2, …., 35 Low: - , = 15, …, 22 High: - , = 28, …, 35 Beer(2 – 4) 0.9,50 BMI(17 – 22)

Beer(11 – 13) 0.9,50 BMI(29 – 31)

Beer(14 – 15) 0.9,50 BMI(30 – 35)

Page 13: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

13

4ft-Discoverer4ftD = LCAR, DK AR, 4ft-Miner, 4ft-Filter, 4ft-Synt

M , ,

All prime

''

Under implementation, based on Cons(Beer BMI) and

''

Page 14: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

14

Applying 4ft-Discoverer New knowledge not following from Beer BMI true in given data M ?

M , ,

All prime

4ft-Miner

4ft-Filter

Consequences of Beer BMI

Rules not following from Beer BMI

4ft-Synt

New knowledge C D, E F

Particular interesting rules

Page 15: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

15

4ft-Filter

4ft-Miner Cons(Beer BMI)

Set of p, Base Set of Beer() .09, 50 BMI()

M , ,

All prime

Basep

Basep BMIBeer

,

, )()(

Each p, Base : Is there Beer() .09, 50 BMI() such that is correct ?

Filter out p, Base

+

Page 16: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

16

4ft-Synt

4ft-Miner Cons(C D)

Set of p, Base Set of C() .09, 50 D()

M , ,

All prime

Is there enough p, Base and C() .09, 50 D() such that

Consider C D as a candidate of new knowledge

+

Basep

Basep BMIBeer

,

, )()(

is correct ?

Page 17: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

17

Conclusions

http://sewebar.vse.cz/RuleML_demo/final/final.html

o Rich association rules , ,

o Criteria of correctness for deduction rules

o Formal language for domain knowledge Beer BMI, …

o Atomic consequences Beer(low) p, Base BMI(low) , …, Beer() p, Base BMI()

o Conversion Beer BMI via

o Partially implemented http://lispminer.vse.cz/, http://sewebar.vse.cz/

)5,4,2()6()3,1( 21 AAA K Basep, Basep,

)()( BMIBeer

''

Page 18: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

18

Thank you

Page 19: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

19

Lower critical implication for 0 < p 1, 0 < < 0.5 :

Basep ,,!

Examples of 4ft-quantifiers – statistical hypothesis tests

The rule !p; corresponds to the statistical test (on the level ) of the null hypothesis H0: P( |

) p against the alternative one H1: P( | ) > p. Here P( | ) is the conditional probability of the validity of under the condition .

Fisher’s quantifier for 0 < < 0.5 :

Base,

The rule ,Base corresponds to the statistical test (on the level of the null hypothesis of independence of and against the alternative one of the positive dependence.

Page 20: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

20

Founded implication:

M

a b c d

Basep, Baseapba

a

Double founded implication: Basep, Baseapcba

a

Founded equivalence: Basep, Baseapdcba

da

Above Average: Basep, Basea

dcba

cap

ba

a

)1(

„Classical“: SC , Sdcba

aC

ba

a

4ft-Miner, important simple 4ft-quantifiers

Page 21: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

21

The generalized quantifier is associational if it satisfies:

If (a, b, c, d) = 1 and a’ a b’ b c’ c d’ d then also (a’, b’, c, d) = 1

Examples:

Associational and implicational quantifiers

BasepBasep ,,!

, ,, Base,

The generalized quantifier is implicational if it satisfies:

If (a, b, c, d) = 1 and a’ a b’ b then also (a’, b’, c, d) = 1

Examples: BasepBasep ,,!

, ,

Page 22: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

22

where is implicational is sound if there is a such that

Despecifying-dereducing deduction rule SpRd

An example:

despecifies to

dereduces to

instead of

)()()( 3*

21 xPxPxP )()()( 32*

1 xPxPxP

)()()( 32*

1 xPxPxP )()()()( 432*

1 xPxPxPxP

)()()( 3*

21 xPxPxP ))(),()()(( 321* xPxPxPx

22

11

)()()()(

)()()(

4321

321

xPxPxPxP

xPxPxP

3

despecifies to and dereduces to 11 32 32 22

Page 23: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

23

The 4ft quantifier is implicational if it satisfies:

If (a,b,c,d) = 1 and a’ a b’ b then also (a’,b’,c,d) = 1

Deduction rules and implicational quantifiers (1)

o is a-dependent if there a, a’, b, c, d such that (a,b,c,d ) (a’,b,c,d),

o b-dependent, ….

o If is implicational then (a,b,c,d ) = (a,b,c’,d’) for all c’, c’ , d, d’

o If * is implicational then we use only *(a,b) instead of *(a,b,c,d)

TPC = a’ a b’ b is True Preservation Condition for implicational quantifiers

Page 24: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

24

Theorem:

If * is interesting implicational 4ft-quantifier and R = is a deduction rule

then there are propositional formulas 1A, 1B, 2 derived from , , ’, ’ such that R is sound iff at least one of the conditions i), ii) is satisfied:

i) both 1A and 1B are tautologies

ii) 2 is a tautology

Deduction rules and implicational quantifiers (2)

´´ *

*

and are examples of interesting implicational 4ft - quantifiers

Definition: The implicational 4ft-quantifier * is interesting implicational if * is both a-dependent and b-dependent * (0,0) = 0

Basep ,,!

Basep,

Page 25: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

Class of 4ft quantifiers Truth Preservation Condition criterion for

implicational a’ a b’ b known

double implicational a’ a b’ b c’ c

- double implicational a’ a b’+ c’ b + c known

equivalency (associational ) a’ a b’ b c’ c d’ d

- equivalency a’ + d’ a + d b’ + c’ b + c known

with F-property

if (a,b,c,d) = 1 and b c – 1 0 then (a,b+1,c-1,d) = 1

if (a,b,c,d) = 1 and c b – 1 0 then (a,b -1,c+1,d) = 1

known

22

11

Overview of classes of 4ft-quantifiers

Additional results: o Dealing with missing information o Tables of critical frequencies o Definability in classical predicate calculi o Interesting subclasses

25

Page 26: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

26

Association rules and the ASSOC procedure (1)

basket itemsb1 A, B, D, E, F, J

b2 A, C, D, G, H

b3 A, B, C, E, F, G

b4 E, F, G, J

b5 A, B, C, E, G

b6 A, B, E, F, G, J

b7 C, G, K, L

b8 A, B, C, E, F, J

{ A, B } { E, F }

Page 27: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

27

Association rules and the ASSOC procedure (2)

{ A, B } { E, F }

Conf ({ A, B } { E, F }) =

Supp ({ A, B } { E, F }) =

ba

a

E F (E F)

A B a b (A B) c d

dcba

a

Page 28: Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining Jan Rauch University of Economics, Prague Czech Republic

28

GUHA and association rules

http://en.wikipedia.org/wiki/Association_rule_learning#cite_note-pospaper-7

History: The concept of association rules was popularised particularly due to the 1993 article of Agrawal [2], which has acquired more than 6000 citations according to Google Scholar, as of March 2008, and is thus one of the most cited papers in the Data Mining field.

However, it is possible that what is now called "association rules" is simliar to what appears in the 1966 paper [7] on GUHA, a general data mining method developed by Petr Hájek et al. [8].