Upload
iris-hodges
View
222
Download
5
Embed Size (px)
Citation preview
Modifying Logic of Discovery for Dealing with Domain Knowledge
in Data Mining
Jan Rauch University of Economics, Prague
Czech Republic
2
Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining
Presented an idea of a theoretical approach
There are software tools for partial steps
o Logic of discovery
o Modifications
o 4ft-Discoverer
3
Logic of Discovery
Can computers formulate and verify scientific hypotheses?
Can computers in a rational way analyze empirical data and produce reasonable reflection of the observed empirical world? Can it be done using mathematical logic and statistics?
1978
Logic of Discovery (simplified)
Data matrix M
State dependent structure
Theoretical statementsTheoretical calculi
Observational statementsObservational calculi
1: 1
Statistical hypothesis tests4
5
Association rules – observational statements
)5,4,2()6()3,1( 21 AAA K Basep, Baseapba
a
Basep, Basea
dcba
cap
ba
a
)1(
M a b c d
…. hypothesis tests
Val(, M ) {0,1}
M
M
GUHA Procedure ASSOC – a tool for finding a set of interesting association rules
6
M , ,
All prime
M
a b
c d
Val(, M ) = 1
is prime: is true + does not logically follow from other more simple
Deduction rules in logic of association rules
7
Examples :
Theorem for : is correct if and only if (1) or (2)
(1) 1A and 1B tautologies of propositional calculus (2) 2 tautology Theorems for additional 4ft-quantifiers: is correct iff
Applications: prime rules + dealing with knowledge in data mining
)4,3,2()3,1(
)2()3,1(
210,9.01
210,9.01
AA
AA
)8()2()4()3,1(
)2()4()3,1(
42310,9.01
210,9.031
AAAA
AAA
Basep, '' ,
,
Basep
Basep
1A , 1B , 2 , created from , , ‘ , ‘
''
8
Data mining – CRISP-DM http://www.crisp-dm.org/
M , ,
All prime
Beer BMI Wine region, sportsmen, …
Analytical report
Logical calculus …
9
Data mining – CRISP-DM http://www.crisp-dm.org/
M , ,
All prime
Beer BMI Wine region, sportsmens, …
Analytical report
Logical calculus ??
10
Modifying Logic of Discovery
Logic of discovery Theoretical statements
Logical calculus of associational rules
Logic of association rules mining
Logical calculus of associational rules
A1 A2, A3 A4 … ; Cons(A1 A2 ), …
Statements on data matrices , evaluation , Cons
11
Logic of association rules mining (simplified)
patient BMI Beer Education Sex Status … AK
o1 17 2 U F D … 1o2 34 4 B M M … 6o3 27 3 S M W … 2… … … … … … … …on 28 7 B F S … 4
o Type of M : number of columns + possible values ,
o
o Val(, M )
o Items of domain knowledge: Beer BMI, …
o Consequences of domain knowledge Cons(Beer BMI ), …
Beer (8-10) 0.9,50 BMI (>30) Status (W)
M
a b c d
509.0
aba
a
LCAR Logical Calculus of Association Rules
DK AR
12
Atomic consequences of Beer BMI (simplified) patient BMI Beer Education Sex Status …
AK
o1 17 2 U F D … 1o2 34 4 B M M … 6o3 27 3 S M W … 2… … … … … … … …on 28 7 B F S … 4
Cons(Beer BMI)
Beer(low) 0.9,50 BMI(low ) Beer(high) 0.9,50 BMI(high)
Beer(0 – 3) 0.9,50 BMI(15 – 18)
Beer: 0, 1, 2, …., 15 Low: - , = 0, …, 5 High: - , = 10, …, 15
BMI: 15, 1, 2, …., 35 Low: - , = 15, …, 22 High: - , = 28, …, 35 Beer(2 – 4) 0.9,50 BMI(17 – 22)
Beer(11 – 13) 0.9,50 BMI(29 – 31)
Beer(14 – 15) 0.9,50 BMI(30 – 35)
…
…
…
…
13
4ft-Discoverer4ftD = LCAR, DK AR, 4ft-Miner, 4ft-Filter, 4ft-Synt
M , ,
All prime
''
Under implementation, based on Cons(Beer BMI) and
''
14
Applying 4ft-Discoverer New knowledge not following from Beer BMI true in given data M ?
M , ,
All prime
4ft-Miner
4ft-Filter
Consequences of Beer BMI
Rules not following from Beer BMI
4ft-Synt
New knowledge C D, E F
Particular interesting rules
15
4ft-Filter
4ft-Miner Cons(Beer BMI)
Set of p, Base Set of Beer() .09, 50 BMI()
M , ,
All prime
Basep
Basep BMIBeer
,
, )()(
Each p, Base : Is there Beer() .09, 50 BMI() such that is correct ?
Filter out p, Base
+
16
4ft-Synt
4ft-Miner Cons(C D)
Set of p, Base Set of C() .09, 50 D()
M , ,
All prime
Is there enough p, Base and C() .09, 50 D() such that
Consider C D as a candidate of new knowledge
+
Basep
Basep BMIBeer
,
, )()(
is correct ?
17
Conclusions
http://sewebar.vse.cz/RuleML_demo/final/final.html
o Rich association rules , ,
o Criteria of correctness for deduction rules
o Formal language for domain knowledge Beer BMI, …
o Atomic consequences Beer(low) p, Base BMI(low) , …, Beer() p, Base BMI()
o Conversion Beer BMI via
o Partially implemented http://lispminer.vse.cz/, http://sewebar.vse.cz/
)5,4,2()6()3,1( 21 AAA K Basep, Basep,
)()( BMIBeer
''
18
Thank you
19
Lower critical implication for 0 < p 1, 0 < < 0.5 :
Basep ,,!
Examples of 4ft-quantifiers – statistical hypothesis tests
The rule !p; corresponds to the statistical test (on the level ) of the null hypothesis H0: P( |
) p against the alternative one H1: P( | ) > p. Here P( | ) is the conditional probability of the validity of under the condition .
Fisher’s quantifier for 0 < < 0.5 :
Base,
The rule ,Base corresponds to the statistical test (on the level of the null hypothesis of independence of and against the alternative one of the positive dependence.
20
Founded implication:
M
a b c d
Basep, Baseapba
a
Double founded implication: Basep, Baseapcba
a
Founded equivalence: Basep, Baseapdcba
da
Above Average: Basep, Basea
dcba
cap
ba
a
)1(
„Classical“: SC , Sdcba
aC
ba
a
4ft-Miner, important simple 4ft-quantifiers
21
The generalized quantifier is associational if it satisfies:
If (a, b, c, d) = 1 and a’ a b’ b c’ c d’ d then also (a’, b’, c, d) = 1
Examples:
Associational and implicational quantifiers
BasepBasep ,,!
, ,, Base,
The generalized quantifier is implicational if it satisfies:
If (a, b, c, d) = 1 and a’ a b’ b then also (a’, b’, c, d) = 1
Examples: BasepBasep ,,!
, ,
22
where is implicational is sound if there is a such that
Despecifying-dereducing deduction rule SpRd
An example:
despecifies to
dereduces to
instead of
)()()( 3*
21 xPxPxP )()()( 32*
1 xPxPxP
)()()( 32*
1 xPxPxP )()()()( 432*
1 xPxPxPxP
)()()( 3*
21 xPxPxP ))(),()()(( 321* xPxPxPx
22
11
)()()()(
)()()(
4321
321
xPxPxPxP
xPxPxP
3
despecifies to and dereduces to 11 32 32 22
23
The 4ft quantifier is implicational if it satisfies:
If (a,b,c,d) = 1 and a’ a b’ b then also (a’,b’,c,d) = 1
Deduction rules and implicational quantifiers (1)
o is a-dependent if there a, a’, b, c, d such that (a,b,c,d ) (a’,b,c,d),
o b-dependent, ….
o If is implicational then (a,b,c,d ) = (a,b,c’,d’) for all c’, c’ , d, d’
o If * is implicational then we use only *(a,b) instead of *(a,b,c,d)
TPC = a’ a b’ b is True Preservation Condition for implicational quantifiers
24
Theorem:
If * is interesting implicational 4ft-quantifier and R = is a deduction rule
then there are propositional formulas 1A, 1B, 2 derived from , , ’, ’ such that R is sound iff at least one of the conditions i), ii) is satisfied:
i) both 1A and 1B are tautologies
ii) 2 is a tautology
Deduction rules and implicational quantifiers (2)
´´ *
*
and are examples of interesting implicational 4ft - quantifiers
Definition: The implicational 4ft-quantifier * is interesting implicational if * is both a-dependent and b-dependent * (0,0) = 0
Basep ,,!
Basep,
Class of 4ft quantifiers Truth Preservation Condition criterion for
implicational a’ a b’ b known
double implicational a’ a b’ b c’ c
- double implicational a’ a b’+ c’ b + c known
equivalency (associational ) a’ a b’ b c’ c d’ d
- equivalency a’ + d’ a + d b’ + c’ b + c known
with F-property
if (a,b,c,d) = 1 and b c – 1 0 then (a,b+1,c-1,d) = 1
if (a,b,c,d) = 1 and c b – 1 0 then (a,b -1,c+1,d) = 1
known
22
11
Overview of classes of 4ft-quantifiers
Additional results: o Dealing with missing information o Tables of critical frequencies o Definability in classical predicate calculi o Interesting subclasses
25
26
Association rules and the ASSOC procedure (1)
basket itemsb1 A, B, D, E, F, J
b2 A, C, D, G, H
b3 A, B, C, E, F, G
b4 E, F, G, J
b5 A, B, C, E, G
b6 A, B, E, F, G, J
b7 C, G, K, L
b8 A, B, C, E, F, J
{ A, B } { E, F }
27
Association rules and the ASSOC procedure (2)
{ A, B } { E, F }
Conf ({ A, B } { E, F }) =
Supp ({ A, B } { E, F }) =
ba
a
E F (E F)
A B a b (A B) c d
dcba
a
28
GUHA and association rules
http://en.wikipedia.org/wiki/Association_rule_learning#cite_note-pospaper-7
History: The concept of association rules was popularised particularly due to the 1993 article of Agrawal [2], which has acquired more than 6000 citations according to Google Scholar, as of March 2008, and is thus one of the most cited papers in the Data Mining field.
However, it is possible that what is now called "association rules" is simliar to what appears in the 1966 paper [7] on GUHA, a general data mining method developed by Petr Hájek et al. [8].