37
Fast Algorithms for Mini ng Frequent Itemsets 指指指指 指指指指 : : 指指指 指指 指指指 指指 指指指 指指指 : : 指指指 指指指 Dept. of Computer Science and Info Dept. of Computer Science and Info rmation Engineering, rmation Engineering, National Chun National Chun g Cheng University g Cheng University Date: Date: January 20, 2005 January 20, 2005 指指指指指指指 指指指指指指指指指指指指指指指指 指指指指指指指指指指指指指指指指

Fast Algorithms for Mining Frequent Itemsets

Embed Size (px)

DESCRIPTION

挖掘頻繁項目集合之快速演算法研究. Fast Algorithms for Mining Frequent Itemsets. 博士論文計畫書. 指導教授 : 張真誠 教授 研究生 : 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University Date: January 20, 2005. Outline. Introduction Background and Related Work - PowerPoint PPT Presentation

Citation preview

Page 1: Fast Algorithms for Mining Frequent Itemsets

Fast Algorithms for Mining Frequent Itemsets

指導教授指導教授 : : 張真誠 教授張真誠 教授研究生研究生 : : 李育強李育強Dept. of Computer Science and Information EnginDept. of Computer Science and Information Engineering, eering, National Chung Cheng UniversityNational Chung Cheng University

Date:Date: January 20, 2005 January 20, 2005

博士論文計畫書

挖掘頻繁項目集合之快速演算法研究挖掘頻繁項目集合之快速演算法研究

Page 2: Fast Algorithms for Mining Frequent Itemsets

2

OutlineOutline

Introduction Background and Related Work A New FP-Tree for Mining Frequent Item

sets Efficient Algorithms for Mining Share-Fr

equent itemsets Conclusions

Page 3: Fast Algorithms for Mining Frequent Itemsets

3

Introduction Data mining techniques have been developed to fin

e a small set of precious nugget from reams of data Mining association rules constitutes one of the most

important data mining problem Two sub-problem

Identifying all frequent itemsets Using these frequent itemsets to generate associa

tion rules The first sub-problem plays an essential role in min

ing association rules Mining frequent itemsets & mining share-frequent i

temsets

Page 4: Fast Algorithms for Mining Frequent Itemsets

4

Background and Related Work

Support-Confidence Framework Each item is a binary variable denoting whether an item

was purchased Apriori (Agrawal & Swami, 1994) & Apriori-like algorithms Pattern-growth algorithms (Han et al, 2000; Han et al, 20

04) Share-Confidence Framework (Carter et al., 1997 )

Support-confidence framework does not analyze the exact number of products purchased.

The support count method does not measure the profit or cost of an itemset

Exhaustive search algorithm Fast algorithms (but with errors)

Page 5: Fast Algorithms for Mining Frequent Itemsets

5

Support-Confidence Framework (1/3)

Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%

Page 6: Fast Algorithms for Mining Frequent Itemsets

6

Support-Confidence Framework (2/3)

FP-growth algorithm (Han et al. 2000; Han et al., 2004)

TID Frequent 1-itemsets (sorted)

001002003004005006

C A B DC AC AC B DA B DC B D

C

A

B

D

root

B(1)

A(1)

C(1)Header table

D(1)

C

A

B

D

root

B(1)

A(2)

C(2)Header table

D(1)

C

A

B

D

root

B(1)

A(3)

C(3)Header table

D(1)

C

A

B

D

root

B(1)

B(1)A(3)

D(1)

C(4)Header table

D(1)

C

A

B

D

root

B(1) D(1)

B(1)B(1)A(3)

D(1)

C(4) A(1)Header table

D(1)

C

A

B

D

root

B(1) D(1)

B(1)B(2)A(3)

D(2)

C(5) A(1)Header table

D(1)

Page 7: Fast Algorithms for Mining Frequent Itemsets

7

Support-Confidence Framework (3/3)

C

A

B

D

root

B(1) D(1)

B(1)B(2)A(3)

D(2)

C(5) A(1)Header table

D(1)

B(1) D(1)

B(1)B(2)A(1)

D(2)

C(1) A(1)

D(1)

C(2) C

B

root

B(4)

C(3)Header table

C

root

C(3)Header table

Conditional FP-tree of “D”

Conditional FP-tree of “BD”

Page 8: Fast Algorithms for Mining Frequent Itemsets

8

Measure value: mv(ip, Tq) mv({D}, T01) = 1 mv({C}, T03) = 3

Transaction measure value: tmv(Tq) = tmv(T02) = 9

Total measure value: Tmv(DB)= Tmv(DB)=44

Itemset measure value: imv(X, Tq)= imv({A, E}, T02)=4

Local measure value: lmv(X)= lmv({BC})=2+4+5=11

Share-Confidence Framework (1/6)

xq dbT

qTXimv ),(

dbT Ti

qpq qp

Timv ),(

XiTX

qppq

Timv,

),(

qp Ti

qp Timv ),(

Page 9: Fast Algorithms for Mining Frequent Itemsets

9

Share-Confidence Framework (2/6)

Tmv

Xlmv )(

minShare=30%

Itemset share: SH(X)= SH({BC})=11/44=25%

SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset

Page 10: Fast Algorithms for Mining Frequent Itemsets

10

Share-Confidence Framework (3/6)

ZP(Zero Pruning) 、 ZSP(Zero Subset Pruning) variants of exhaustive search prune the candidate itemsets whose local measure values are

exactly zero SIP(Share Infrequent Pruning)

like Apriori with errors

CAC(Combine All Counted) 、 PCAC(Parametric CAC) From ZSP, using a predict function with errors

IAB(Item Add-Back) 、 PIAB(Parametric IAB) join each share frequent itemset with each 1-itemset with errors

Existing algorithms are either inefficient or do not discover complete share-frequent (SH-frequent) itemsets

Page 11: Fast Algorithms for Mining Frequent Itemsets

11A:10 B:8 C:10 D:6 E:4 H:1...

AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0

ABC:3 ABD:9 ABE:0 ACD:3 ACE:16 ADE:0 BCD:15 BDE:0BCE:0 CDE:0

ABCD:4 ABCE:0 ABDE:0 ACDE:0 BCDE:0

ABCDE:0

Share-Confidence Framework (4/6)

ZP Algorithm

SIP & IAB Algorithms

Page 12: Fast Algorithms for Mining Frequent Itemsets

12

Share-Confidence Framework (5/6)

A:10 B:8 C:10 D:6 E:4 H:1...

AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0

ABC:3 ABD:9 ABE:0 ACD:3 ACE:16 ADE:0 BCD:15 CDE:0

ABCD:4 ACDE:0ZSP Algorithm

Page 13: Fast Algorithms for Mining Frequent Itemsets

13

Share-Confidence Framework (6/6)

A:10 B:8 C:10 D:6 E:4 H:1...

AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0

ABC:3 ABD:9 ACD:3

CAC AlgorithmPSH(XY)=SH(X)+(SH(Y) × |dbx|/|DB|), |dbx|<|dbY|…(1)

PSH(XY)=SH(Y)+(SH(X) × |dbY|/|DB|), |dbY|<|dbX|…(2)

PSH(XY)=((1)+(2))/2, |dbY|=|dbX|

PSH(AB)=(22.7%+18.2% × 4/6+18.2% + 22.7% × 4/6)/2=34.1%

PSH(AE)=9.1%+22.7% × (2/6)=16.7% < 30%

Page 14: Fast Algorithms for Mining Frequent Itemsets

14

A New FP-Tree for Mining Frequent Itemsets (1/3)

NFP-growth Algorithm NFP-tree construction

Page 15: Fast Algorithms for Mining Frequent Itemsets

15

A New FP-Tree for Mining Frequent Itemsets (2/3)

TID Frequent 1-itemsets (sorted)

001002003004005006

C A B DC AC AC B DA B DC B D

C

A

B

D

root

B(1) D(1)

B(1)B(2)A(3)

D(2)

C(5) A(1)Header table

D(1)

A

B

DB(1,1)

A(1,1)

C(5,5)

Header table

D(1,1)

A

B

DB(1,1)

A(2,2)

C(5,5)

Header table

D(1,1)

A

B

DB(1,1)

A(3,3)

C(5,5)

Header table

D(1,1)

A

B

DB(1,1)

B(1,1)A(3,3)

D(1,1)

C(5,5)

Header table

D(1,1)

A

B

DB(1,2)

B(1,1)A(3,4)

D(1,1)

C(5,5)

Header table

D(1,2)

A

B

DB(1,2)

B(2,2)A(3,4)

D(2,2)

C(5,5)

Header table

D(1,2)

Page 16: Fast Algorithms for Mining Frequent Itemsets

16

A New FP-Tree for Mining Frequent Itemsets (3/3)

B(1,2)

B(2,2)A(1,2)

D(2,2)

D(1,2)

B

root

B(3,4)

Header table

A

B

DB(1,2)

B(2,2)A(3,4)

D(2,2)

C(5,5)

Header table

D(1,2)

Conditional NFP-tree of

“D(3,4)”

Page 17: Fast Algorithms for Mining Frequent Itemsets

17

Experimental Results (1/4)

PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running windows 2000 professional

All algorithms were coded in VC++ 6.0 Datasets:

Real: BMS-Web View-1, BMS-Web View-2, Connect 4 Artificial: generated by IBM synthetic data generator

|D| Number of transactions in DB

|T| Mean size of the transactions

|I| Mean size of the maximal potentially frequent itemsets

|L| Number of maximal potentially frequent itemsets

N Number of items

Page 18: Fast Algorithms for Mining Frequent Itemsets

18

Experimental Results (2/4)

BMS-WebView-1

0

5

10

15

0.056 0.058 0.06 0.062 0.064 0.066

Minimum support (%)

Run

ning

tim

e (s

ec)

FP

NFP

BMS-WebView-2

0

10

20

30

40

0.008 0.014 0.020 0.026 0.032

Minimum support (%)

Run

ning

tim

e (s

ec)

FP

NFP

Page 19: Fast Algorithms for Mining Frequent Itemsets

19

Experimental Results (3/4)

Connect-4

0

20

40

60

80

100

57 60 63 66 69 72 75

Minimum support (%)

Run

ning

tim

e (s

ec)

FP

NFP

Page 20: Fast Algorithms for Mining Frequent Itemsets

20

Experimental Results (4/4)

T10.I6.D500k.L10k

0

100

200

300

400

500

600

0.010 0.030 0.050 0.070 0.090

Minimum support (%)

Run

ning

tim

e (s

ec)

FP

NFP

T10.I6.D500k.L50

0

50

100

150

0.010 0.030 0.050 0.070 0.090

Minimum support (%)R

unni

ng ti

me

(sec

)

FP

NFP

Page 21: Fast Algorithms for Mining Frequent Itemsets

21

A Fast Algorithm for Mining Share-Frequent Itemsets

FSM: Fast Share Measure algorithm ML: Maximum transaction length in DB MV: Maximum measure value in DB min_lmv=minShare×Tmv Level Closure Property: Given a minShare and a k-ite

mset X Theorem 1. If lmv(X)+(lmv(X)/k)×MV < min_lmv, all superse

ts of X with length k + 1 are infrequent Theorem 2. If lmv(X)+(lmv(X)/k)×MV ×k’< min_lmv, all sup

ersets of X with length k+k’ are infrequent Corollary 1. If lmv(X)+(lmv(X)/k)×MV ×(ML-k)< min_lmv, all

supersets of X are infrequent

Page 22: Fast Algorithms for Mining Frequent Itemsets

22

FSM: Fast Share Measure algorithm

A:10 B:8 C:10 D:6 E:4 H:1...

AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0

ABC:3 ABD:9 ABE:0 ACD:3 ACE:16 ADE:0 BCD:15 CDE:0

minShare=30% Let CF(X)=lmv(X)+(lmv(X)/k)×MV ×(ML-k) Prune X if CF(X)<min_lmv CF({ABC})=3+(3/3)×3×(6-3)=12<14=min_lmv

Page 23: Fast Algorithms for Mining Frequent Itemsets

23

ExperimentalResults (1/2)

T4.I2.D100k.N50.S10 minShare = 0.8% ML=14

MethodPass (k)

ZSP FSM(1) FSM(2) FSM(3) FSM(ML-k)

k=1

Ck 50 50 50 50 50

RCk 50 49 49 49 50

Fk 32 32 32 32 32

k=2

Ck 1225 1176 1176 1176 1225

RCk 1219 570 754 845 1085

Fk 119 119 119 119 119

k=3

Ck 19327 4256 7062 8865 14886

RCk 17217 868 1685 2410 5951

Fk 65 65 65 65 65

k=4

Ck 165077 1725 3233 5568 24243

RCk 107397 232 644 1236 6117

Fk 9 9 9 9 9

k=5

Ck 406374 81 258 717 6309

RCk 266776 5 40 109 1199

Fk 0 0 0 0 0

k=6

Ck 369341 0 1 4 287

RCk 310096 0 0 0 37

Fk 0 0 0 0 0

k>=7

Ck 365975 0 0 0 0

RCk 359471 0 0 0 0

Fk 0 0 0 0 0

Time(sec) 10349.9 2.30 2.98 3.31 11.24

Page 24: Fast Algorithms for Mining Frequent Itemsets

24

Experimental Results (2/2)

T4.I2.Dz.N50.S10

1

10

100

1000

10000

100000

0 200 400 600 800 1000Transactions (k)

Run

ning

tim

e(se

c)

ZSPFSM(ML-1)FSM(3)FSM(2)FSM(1)

T4.I2.D100k.N50.S10

1

10

100

1000

10000

100000

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

minShare (%)

Run

ning

tim

e (s

ec)

ZSPFSM(ML-1)FSM(3)FSM(2)FSM(1)

Page 25: Fast Algorithms for Mining Frequent Itemsets

25

Efficient Algorithms for Mining Share-Frequent itemsets

EFSM (Enhanced FSM): instead of joining arbitrary two itemsets in RCk-1, EFSM joins arbitrary itemset of RCk-1 with a single item in RC1 to generate Ck efficiently

Reduce time complexity from O(n2k-2) to O(nk)

Page 26: Fast Algorithms for Mining Frequent Itemsets

26

Efficient Algorithms for Mining Share-Frequent itemsets

Xk+1: arbitrary superset of X with length k+1 in DB

S(Xk+1): the set which contains all Xk+1 in DB dbS(Xk+1): the set of transactions of which ea

ch transaction contains at least one Xk+1 SuFSM and ShFSM from EFSM which prun

e the candidates more efficiently than FSM SuFSM (Support-counted FSM):

Theorem 3. If lmv(X)+Sup(S(Xk+1))×MV×(ML – k)< min_lmv, all supersets of X are infrequent

Page 27: Fast Algorithms for Mining Frequent Itemsets

27

SuFSM (Support-counted FSM) lmv(X)/k Sup(X) Sup(S(Xk+1)) maxSup(Xk+1) EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD}k+1))=2,maxSup(Xk+1)=1 If there is no superset of X is an SH-frequent ite

mset, then the following four equations hold lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv lmv(X)+Sup(X) ×MV× (ML - k) < min_lmv lmv(X)+Sup(S(Xk+1)) ×MV× (ML - k) < min_lmv lmv(X)+maxSup(Xk+1) ×MV× (ML - k) < min_lmv

Page 28: Fast Algorithms for Mining Frequent Itemsets

28

ShFSM (Share-counted FSM) ShFSM (Share-counted FSM):

Theorem 4. If Tmv(dbS(Xk+1)) < min_lmv, all supersets of X are infrequent

FSM:lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv

SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV× (ML - k) < min_lmv ShFSM: Tmv(dbS(Xk+1)) < min_lmv

Page 29: Fast Algorithms for Mining Frequent Itemsets

29

ShFSM (Share-counted FSM)

A:10 B:8 C:10 D:6 E:4 H:1...

AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0

ACE:16 BCD:15 CDE:0

Ex. X={AB} Tmv(dbS(Xk+1)) = tmv(T01)+tmv(T0

5) =6+6=12 <14 = min_lmv

Page 30: Fast Algorithms for Mining Frequent Itemsets

30

Experimental Results (1/4)

T6.I4.D100k.N200.S10

1

10

100

1000

10000

100000

0 0.2 0.4 0.6 0.8 1 1.2

minShare (%)R

unni

ng ti

me

(sec

)

FSMEFSMSuFSMShFSM

T4.I2.D100k.N50.S10

1

10

100

1000

10000

100000

0 0.2 0.4 0.6 0.8 1 1.2

minShare (%)

Run

ning

tim

e (s

ec)

ZSPEZSPFSMEFSMSuFSMShFSM

Page 31: Fast Algorithms for Mining Frequent Itemsets

31

Experimental Results (2/4)

T6.I4.Dz.N200.S10

1

10

100

1000

10000

0 200 400 600 800 1000

Transactions (k)R

unni

ng ti

me

(sec

)

FSM

EFSM

SuFSM

ShFSM

T10.I6.D100k.N500.S20

1

10

100

1000

10000

100000

0 0.2 0.4 0.6 0.8 1 1.2

minShare (%)

Run

ning

tim

e (s

ec)

.

FSMEFSMSuFSMShFSM

minShare=0.3%

Page 32: Fast Algorithms for Mining Frequent Itemsets

32

Experimental Results (3/4)

T6.I4.D100k.N200.Sm

1

10

100

1000

10000

100000

1000000

0 10 20 30 40 50 60

S

Run

ning

tim

e (s

ec)

.

FSM

EFSM

SuFSM

ShFSM

minShare=0.3%

Page 33: Fast Algorithms for Mining Frequent Itemsets

33

Experimental Results (4/4)

MethodPass (k)

FSM EFSM SuFSM ShFSM Fk

k=1Ck 200 200 200 200

159RCk 200 200 199 197

k=2Ck 19900 19900 19701 19306

1844RCk 16214 16214 13312 7199

k=3Ck 829547 829547 564324 190607

101RCk 251877 251877 99765 9792

k=4Ck 3290296 3290296 793042 20913

0RCk 332877 332877 41057 1420

k=5Ck 393833 393833 25003 1050

5RCk 71420 71420 19720 959

k=6Ck 26137 26137 11582 518

8RCk 25562 25562 11045 506

k=7Ck 11141 11141 5940 204

7RCk 11099 11099 5827 196

k=8Ck 4426 4426 2797 58

1RCk 4423 4423 2750 54

k>=9Ck 2036 2036 1567 12

0RCk 2030 2030 1513 10

Time(sec) 13610.4 71.55 29.67 10.95

T6.I4.D100k.N200.S10 minShare = 0.1% ML=20

Page 34: Fast Algorithms for Mining Frequent Itemsets

34

Conclusions Support measure

Uses two counters per tree node to reduce the number of the tree nodes.

Applies a smaller tree and header table to discover frequent itemsets efficiently

Consider the development of superior data structures and extend the pattern-growth approach

Page 35: Fast Algorithms for Mining Frequent Itemsets

35

Share measure Proposed algorithms efficiently decrease the candid

ate number to be counted The performance of ShFSM is the best Consider the development of superior algorithms to

accelerate the process of identifying all SH-frequent itemsets

Page 36: Fast Algorithms for Mining Frequent Itemsets

36

...

...

...

...

E ...

0 ...

E ...

0 ...

D E ...

20 0 ...

...

...

D E ...

6 18 ...

A:10 B:8 C:10 D:6 E:4 H:1...

AC:14 AE:10 BC:11 BD:14 CE:10CD:8

ACE:16 BCD:15

B C D E ...

12 24 12 18 ...

C D E ...

20 26 0 ...

D E ...

20 18 ...

E ...

0 ...

...

...

...

...

ShFSM: Tmv(dbS(Xk+1)) < min_lmv

Page 37: Fast Algorithms for Mining Frequent Itemsets

Thank You!