Upload
julian-best
View
25
Download
0
Embed Size (px)
DESCRIPTION
挖掘頻繁項目集合之快速演算法研究. Fast Algorithms for Mining Frequent Itemsets. 博士論文計畫書. 指導教授 : 張真誠 教授 研究生 : 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University Date: January 20, 2005. Outline. Introduction Background and Related Work - PowerPoint PPT Presentation
Citation preview
Fast Algorithms for Mining Frequent Itemsets
指導教授指導教授 : : 張真誠 教授張真誠 教授研究生研究生 : : 李育強李育強Dept. of Computer Science and Information EnginDept. of Computer Science and Information Engineering, eering, National Chung Cheng UniversityNational Chung Cheng University
Date:Date: January 20, 2005 January 20, 2005
博士論文計畫書
挖掘頻繁項目集合之快速演算法研究挖掘頻繁項目集合之快速演算法研究
2
OutlineOutline
Introduction Background and Related Work A New FP-Tree for Mining Frequent Item
sets Efficient Algorithms for Mining Share-Fr
equent itemsets Conclusions
3
Introduction Data mining techniques have been developed to fin
e a small set of precious nugget from reams of data Mining association rules constitutes one of the most
important data mining problem Two sub-problem
Identifying all frequent itemsets Using these frequent itemsets to generate associa
tion rules The first sub-problem plays an essential role in min
ing association rules Mining frequent itemsets & mining share-frequent i
temsets
4
Background and Related Work
Support-Confidence Framework Each item is a binary variable denoting whether an item
was purchased Apriori (Agrawal & Swami, 1994) & Apriori-like algorithms Pattern-growth algorithms (Han et al, 2000; Han et al, 20
04) Share-Confidence Framework (Carter et al., 1997 )
Support-confidence framework does not analyze the exact number of products purchased.
The support count method does not measure the profit or cost of an itemset
Exhaustive search algorithm Fast algorithms (but with errors)
5
Support-Confidence Framework (1/3)
Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%
6
Support-Confidence Framework (2/3)
FP-growth algorithm (Han et al. 2000; Han et al., 2004)
TID Frequent 1-itemsets (sorted)
001002003004005006
C A B DC AC AC B DA B DC B D
C
A
B
D
root
B(1)
A(1)
C(1)Header table
D(1)
C
A
B
D
root
B(1)
A(2)
C(2)Header table
D(1)
C
A
B
D
root
B(1)
A(3)
C(3)Header table
D(1)
C
A
B
D
root
B(1)
B(1)A(3)
D(1)
C(4)Header table
D(1)
C
A
B
D
root
B(1) D(1)
B(1)B(1)A(3)
D(1)
C(4) A(1)Header table
D(1)
C
A
B
D
root
B(1) D(1)
B(1)B(2)A(3)
D(2)
C(5) A(1)Header table
D(1)
7
Support-Confidence Framework (3/3)
C
A
B
D
root
B(1) D(1)
B(1)B(2)A(3)
D(2)
C(5) A(1)Header table
D(1)
B(1) D(1)
B(1)B(2)A(1)
D(2)
C(1) A(1)
D(1)
C(2) C
B
root
B(4)
C(3)Header table
C
root
C(3)Header table
Conditional FP-tree of “D”
Conditional FP-tree of “BD”
8
Measure value: mv(ip, Tq) mv({D}, T01) = 1 mv({C}, T03) = 3
Transaction measure value: tmv(Tq) = tmv(T02) = 9
Total measure value: Tmv(DB)= Tmv(DB)=44
Itemset measure value: imv(X, Tq)= imv({A, E}, T02)=4
Local measure value: lmv(X)= lmv({BC})=2+4+5=11
Share-Confidence Framework (1/6)
xq dbT
qTXimv ),(
dbT Ti
qpq qp
Timv ),(
XiTX
qppq
Timv,
),(
qp Ti
qp Timv ),(
9
Share-Confidence Framework (2/6)
Tmv
Xlmv )(
minShare=30%
Itemset share: SH(X)= SH({BC})=11/44=25%
SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset
10
Share-Confidence Framework (3/6)
ZP(Zero Pruning) 、 ZSP(Zero Subset Pruning) variants of exhaustive search prune the candidate itemsets whose local measure values are
exactly zero SIP(Share Infrequent Pruning)
like Apriori with errors
CAC(Combine All Counted) 、 PCAC(Parametric CAC) From ZSP, using a predict function with errors
IAB(Item Add-Back) 、 PIAB(Parametric IAB) join each share frequent itemset with each 1-itemset with errors
Existing algorithms are either inefficient or do not discover complete share-frequent (SH-frequent) itemsets
11A:10 B:8 C:10 D:6 E:4 H:1...
AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0
ABC:3 ABD:9 ABE:0 ACD:3 ACE:16 ADE:0 BCD:15 BDE:0BCE:0 CDE:0
ABCD:4 ABCE:0 ABDE:0 ACDE:0 BCDE:0
ABCDE:0
Share-Confidence Framework (4/6)
ZP Algorithm
SIP & IAB Algorithms
12
Share-Confidence Framework (5/6)
A:10 B:8 C:10 D:6 E:4 H:1...
AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0
ABC:3 ABD:9 ABE:0 ACD:3 ACE:16 ADE:0 BCD:15 CDE:0
ABCD:4 ACDE:0ZSP Algorithm
13
Share-Confidence Framework (6/6)
A:10 B:8 C:10 D:6 E:4 H:1...
AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0
ABC:3 ABD:9 ACD:3
CAC AlgorithmPSH(XY)=SH(X)+(SH(Y) × |dbx|/|DB|), |dbx|<|dbY|…(1)
PSH(XY)=SH(Y)+(SH(X) × |dbY|/|DB|), |dbY|<|dbX|…(2)
PSH(XY)=((1)+(2))/2, |dbY|=|dbX|
PSH(AB)=(22.7%+18.2% × 4/6+18.2% + 22.7% × 4/6)/2=34.1%
PSH(AE)=9.1%+22.7% × (2/6)=16.7% < 30%
14
A New FP-Tree for Mining Frequent Itemsets (1/3)
NFP-growth Algorithm NFP-tree construction
15
A New FP-Tree for Mining Frequent Itemsets (2/3)
TID Frequent 1-itemsets (sorted)
001002003004005006
C A B DC AC AC B DA B DC B D
C
A
B
D
root
B(1) D(1)
B(1)B(2)A(3)
D(2)
C(5) A(1)Header table
D(1)
A
B
DB(1,1)
A(1,1)
C(5,5)
Header table
D(1,1)
A
B
DB(1,1)
A(2,2)
C(5,5)
Header table
D(1,1)
A
B
DB(1,1)
A(3,3)
C(5,5)
Header table
D(1,1)
A
B
DB(1,1)
B(1,1)A(3,3)
D(1,1)
C(5,5)
Header table
D(1,1)
A
B
DB(1,2)
B(1,1)A(3,4)
D(1,1)
C(5,5)
Header table
D(1,2)
A
B
DB(1,2)
B(2,2)A(3,4)
D(2,2)
C(5,5)
Header table
D(1,2)
16
A New FP-Tree for Mining Frequent Itemsets (3/3)
B(1,2)
B(2,2)A(1,2)
D(2,2)
D(1,2)
B
root
B(3,4)
Header table
A
B
DB(1,2)
B(2,2)A(3,4)
D(2,2)
C(5,5)
Header table
D(1,2)
Conditional NFP-tree of
“D(3,4)”
17
Experimental Results (1/4)
PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running windows 2000 professional
All algorithms were coded in VC++ 6.0 Datasets:
Real: BMS-Web View-1, BMS-Web View-2, Connect 4 Artificial: generated by IBM synthetic data generator
|D| Number of transactions in DB
|T| Mean size of the transactions
|I| Mean size of the maximal potentially frequent itemsets
|L| Number of maximal potentially frequent itemsets
N Number of items
18
Experimental Results (2/4)
BMS-WebView-1
0
5
10
15
0.056 0.058 0.06 0.062 0.064 0.066
Minimum support (%)
Run
ning
tim
e (s
ec)
FP
NFP
BMS-WebView-2
0
10
20
30
40
0.008 0.014 0.020 0.026 0.032
Minimum support (%)
Run
ning
tim
e (s
ec)
FP
NFP
19
Experimental Results (3/4)
Connect-4
0
20
40
60
80
100
57 60 63 66 69 72 75
Minimum support (%)
Run
ning
tim
e (s
ec)
FP
NFP
20
Experimental Results (4/4)
T10.I6.D500k.L10k
0
100
200
300
400
500
600
0.010 0.030 0.050 0.070 0.090
Minimum support (%)
Run
ning
tim
e (s
ec)
FP
NFP
T10.I6.D500k.L50
0
50
100
150
0.010 0.030 0.050 0.070 0.090
Minimum support (%)R
unni
ng ti
me
(sec
)
FP
NFP
21
A Fast Algorithm for Mining Share-Frequent Itemsets
FSM: Fast Share Measure algorithm ML: Maximum transaction length in DB MV: Maximum measure value in DB min_lmv=minShare×Tmv Level Closure Property: Given a minShare and a k-ite
mset X Theorem 1. If lmv(X)+(lmv(X)/k)×MV < min_lmv, all superse
ts of X with length k + 1 are infrequent Theorem 2. If lmv(X)+(lmv(X)/k)×MV ×k’< min_lmv, all sup
ersets of X with length k+k’ are infrequent Corollary 1. If lmv(X)+(lmv(X)/k)×MV ×(ML-k)< min_lmv, all
supersets of X are infrequent
22
FSM: Fast Share Measure algorithm
A:10 B:8 C:10 D:6 E:4 H:1...
AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0
ABC:3 ABD:9 ABE:0 ACD:3 ACE:16 ADE:0 BCD:15 CDE:0
minShare=30% Let CF(X)=lmv(X)+(lmv(X)/k)×MV ×(ML-k) Prune X if CF(X)<min_lmv CF({ABC})=3+(3/3)×3×(6-3)=12<14=min_lmv
23
ExperimentalResults (1/2)
T4.I2.D100k.N50.S10 minShare = 0.8% ML=14
MethodPass (k)
ZSP FSM(1) FSM(2) FSM(3) FSM(ML-k)
k=1
Ck 50 50 50 50 50
RCk 50 49 49 49 50
Fk 32 32 32 32 32
k=2
Ck 1225 1176 1176 1176 1225
RCk 1219 570 754 845 1085
Fk 119 119 119 119 119
k=3
Ck 19327 4256 7062 8865 14886
RCk 17217 868 1685 2410 5951
Fk 65 65 65 65 65
k=4
Ck 165077 1725 3233 5568 24243
RCk 107397 232 644 1236 6117
Fk 9 9 9 9 9
k=5
Ck 406374 81 258 717 6309
RCk 266776 5 40 109 1199
Fk 0 0 0 0 0
k=6
Ck 369341 0 1 4 287
RCk 310096 0 0 0 37
Fk 0 0 0 0 0
k>=7
Ck 365975 0 0 0 0
RCk 359471 0 0 0 0
Fk 0 0 0 0 0
Time(sec) 10349.9 2.30 2.98 3.31 11.24
24
Experimental Results (2/2)
T4.I2.Dz.N50.S10
1
10
100
1000
10000
100000
0 200 400 600 800 1000Transactions (k)
Run
ning
tim
e(se
c)
ZSPFSM(ML-1)FSM(3)FSM(2)FSM(1)
T4.I2.D100k.N50.S10
1
10
100
1000
10000
100000
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
minShare (%)
Run
ning
tim
e (s
ec)
ZSPFSM(ML-1)FSM(3)FSM(2)FSM(1)
25
Efficient Algorithms for Mining Share-Frequent itemsets
EFSM (Enhanced FSM): instead of joining arbitrary two itemsets in RCk-1, EFSM joins arbitrary itemset of RCk-1 with a single item in RC1 to generate Ck efficiently
Reduce time complexity from O(n2k-2) to O(nk)
26
Efficient Algorithms for Mining Share-Frequent itemsets
Xk+1: arbitrary superset of X with length k+1 in DB
S(Xk+1): the set which contains all Xk+1 in DB dbS(Xk+1): the set of transactions of which ea
ch transaction contains at least one Xk+1 SuFSM and ShFSM from EFSM which prun
e the candidates more efficiently than FSM SuFSM (Support-counted FSM):
Theorem 3. If lmv(X)+Sup(S(Xk+1))×MV×(ML – k)< min_lmv, all supersets of X are infrequent
27
SuFSM (Support-counted FSM) lmv(X)/k Sup(X) Sup(S(Xk+1)) maxSup(Xk+1) EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD}k+1))=2,maxSup(Xk+1)=1 If there is no superset of X is an SH-frequent ite
mset, then the following four equations hold lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv lmv(X)+Sup(X) ×MV× (ML - k) < min_lmv lmv(X)+Sup(S(Xk+1)) ×MV× (ML - k) < min_lmv lmv(X)+maxSup(Xk+1) ×MV× (ML - k) < min_lmv
28
ShFSM (Share-counted FSM) ShFSM (Share-counted FSM):
Theorem 4. If Tmv(dbS(Xk+1)) < min_lmv, all supersets of X are infrequent
FSM:lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv
SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV× (ML - k) < min_lmv ShFSM: Tmv(dbS(Xk+1)) < min_lmv
29
ShFSM (Share-counted FSM)
A:10 B:8 C:10 D:6 E:4 H:1...
AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0
ACE:16 BCD:15 CDE:0
Ex. X={AB} Tmv(dbS(Xk+1)) = tmv(T01)+tmv(T0
5) =6+6=12 <14 = min_lmv
30
Experimental Results (1/4)
T6.I4.D100k.N200.S10
1
10
100
1000
10000
100000
0 0.2 0.4 0.6 0.8 1 1.2
minShare (%)R
unni
ng ti
me
(sec
)
FSMEFSMSuFSMShFSM
T4.I2.D100k.N50.S10
1
10
100
1000
10000
100000
0 0.2 0.4 0.6 0.8 1 1.2
minShare (%)
Run
ning
tim
e (s
ec)
ZSPEZSPFSMEFSMSuFSMShFSM
31
Experimental Results (2/4)
T6.I4.Dz.N200.S10
1
10
100
1000
10000
0 200 400 600 800 1000
Transactions (k)R
unni
ng ti
me
(sec
)
FSM
EFSM
SuFSM
ShFSM
T10.I6.D100k.N500.S20
1
10
100
1000
10000
100000
0 0.2 0.4 0.6 0.8 1 1.2
minShare (%)
Run
ning
tim
e (s
ec)
.
FSMEFSMSuFSMShFSM
minShare=0.3%
32
Experimental Results (3/4)
T6.I4.D100k.N200.Sm
1
10
100
1000
10000
100000
1000000
0 10 20 30 40 50 60
S
Run
ning
tim
e (s
ec)
.
FSM
EFSM
SuFSM
ShFSM
minShare=0.3%
33
Experimental Results (4/4)
MethodPass (k)
FSM EFSM SuFSM ShFSM Fk
k=1Ck 200 200 200 200
159RCk 200 200 199 197
k=2Ck 19900 19900 19701 19306
1844RCk 16214 16214 13312 7199
k=3Ck 829547 829547 564324 190607
101RCk 251877 251877 99765 9792
k=4Ck 3290296 3290296 793042 20913
0RCk 332877 332877 41057 1420
k=5Ck 393833 393833 25003 1050
5RCk 71420 71420 19720 959
k=6Ck 26137 26137 11582 518
8RCk 25562 25562 11045 506
k=7Ck 11141 11141 5940 204
7RCk 11099 11099 5827 196
k=8Ck 4426 4426 2797 58
1RCk 4423 4423 2750 54
k>=9Ck 2036 2036 1567 12
0RCk 2030 2030 1513 10
Time(sec) 13610.4 71.55 29.67 10.95
T6.I4.D100k.N200.S10 minShare = 0.1% ML=20
34
Conclusions Support measure
Uses two counters per tree node to reduce the number of the tree nodes.
Applies a smaller tree and header table to discover frequent itemsets efficiently
Consider the development of superior data structures and extend the pattern-growth approach
35
Share measure Proposed algorithms efficiently decrease the candid
ate number to be counted The performance of ShFSM is the best Consider the development of superior algorithms to
accelerate the process of identifying all SH-frequent itemsets
36
...
...
...
...
E ...
0 ...
E ...
0 ...
D E ...
20 0 ...
...
...
D E ...
6 18 ...
A:10 B:8 C:10 D:6 E:4 H:1...
AC:14 AE:10 BC:11 BD:14 CE:10CD:8
ACE:16 BCD:15
B C D E ...
12 24 12 18 ...
C D E ...
20 26 0 ...
D E ...
20 18 ...
E ...
0 ...
...
...
...
...
ShFSM: Tmv(dbS(Xk+1)) < min_lmv
Thank You!