Mining Frequent Patterns

Without Candidate GenerationMining Frequent Patterns

Afsoon YousefiCS:332, March 24th, 2014Inspired by Song Wang slides

Jiawei Han, Jian Pei and Yiwen YinSchool of Computer ScienceSimon Fraser University

2

Problem of mining frequent PatternReview of AprioriFrequent Pattern Tree

An ExampleDesign & Construction Properties

Mining Frequent Patterns Using FP-TreeAn ExampleDesign and ConstructionProperties

Algorithm Efficiency PropertiesPerformance StudyFuture WorksConclusionSelected Questions

Outline

3





Outline

4

Frequent pattern mining plays an essential role in mining associations.

Most of the previous studies, adopt an Apriori-like approach.Achieves good performance but suffers from:

Problem of mining frequent Pattern

• Apriori:• frequent 1-itemsets → length-2 candidates• Accumulate and test• Find a length-100 frequent pattern → candidates

It is costly to handle a huge number of candidate sets

• Scan database• Check a large set of candidates

It is tedious to repeatedly scan the database

5





Outline

6

Knowing the minimum support threshold Use frequent (k-1)-itemsets generate candidates of frequent k-itemsets Scan database and count each pattern in Get frequent k-itemsets

Review of Apriori

TID Items Bought

100 f , a , c , d , g , i , m , p200 a , b , c , f , i , m , o300 b , f , h , j , o400 b , c , k , s , p500 a , f , c , e , i , p , m , n

Apriori itemsets

f , a , c , d , g , i , m , p , l , o , h , j , k , s , b , e , n

f , a , c , m , b , pfa , fc , fm , fp , ac , am , … bp

fa , fc , fm, …… …

7

Bottleneck of the Apriori-like method is at theCandidate set generationTest

How to avoid generating a huge set of candidates?A novel compact data structure, called FP-treeFP-tree based pattern fragment growth mining methodEmploying a divide-and-conquer search method for frequent

itemsets combinations

Review of Apriori

8





Outline

9





Outline

10

Minimum support threshold 1. One scan of DB to identify the set of frequent items

Items are ordered in frequency descending orderFor convenience, the frequent itemsets of each transaction is listed

in this ordering

Frequent Pattern Tree: An Example

TID Items Bought

100 f , a , c , d , g , i , m , p200 a , b , c , f , i , m , o300 b , f , h , j , o400 b , c , k , s , p500 a , f , c , e , i , p , m , n

¿ ( 𝑓 : 4 ) , (𝑐 : 4 ) , (𝑎 : 3 ) , (𝑏 :3 ) , (𝑚: 3 ) ,(𝑝 :3)>¿Frequent items

TID Items Bought Ordered frequent

items100

f , a , c , d , g , i , m , p f , c , a , m , p

200 a , b , c , f , i , m , o f , c , a , b , m

300 b , f , h , j , o f , b

400 b , c , k , s , p c , b , p

500

a , f , c , e , i , p , m , n f , c , a , m , p

TID

Ordered frequent items

100 f , c , a , m , p

200 f , c , a , b , m

300 f , b

400 c , b , p

500 f , c , a , m , p

11

1. One scan of DB to identify the set of frequent items2. Store the set of frequent items of each transaction in a tree

1. Create a “null” root2. Scan the DB for second time3. Add the paths which are the ordered frequent items4. Share the path until a different item comes up5. Branch and create a sub-path


TID


100 f , c , a , m , p

200 f , c , a , b , m

300 f , b

400 c , b , p

500 f , c , a , m , p

root

f:4

c:3

a:3

m:2

b:1

b:1

p:2 m:1

c:1

b:1

p:1

12

1. One scan of DB to identify the set of frequent items2. Store the set of frequent items of each transaction in a tree

1. To facilitate tree traversal, build item header table2. Nodes with the same item-name are linked


TID


100 f , c , a , m , p

200 f , c , a , b , m

300 f , b

400 c , b , p

500 f , c , a , m , p

root

f:4

c:3

a:3

m:2

b:1

b:1

p:2 m:1

c:1

b:1

p:1

item Head of pointer

f cabmp

13





Outline

14

1. The tree consist of

2. Each node in the tree has three fields

3. Each entry in the frequent-item header table consist of

Frequent Pattern Tree: Design and Construction

One root A set of item prefix subtrees as the children of the

root A frequent-item header table

Item-name Count Node-link

Item-name Head of node-link

15





Outline

16

1. Constructing FP-tree Needs exactly two scans of DB First to collect the set of frequent items Second to construct the FP-tree

The cost of inserting transaction is is the number of frequent items in

2. Completeness the FP-tree contains all the information related to mining frequent patterns given the minimum support threshold

3. Compactness The size of the tree is bounded by the occurrences of frequent items The height of the tree is bounded by the maximum number of items in a transaction

Frequent Pattern Tree: Properties

17

The frequent itemsets of transactions have descending orderAn example for unordered itemsets

Frequent Pattern Tree: Properties

TID


100 p , m , a , c , f

200 m , b , a , c , f

300 b , f

400 p , b , c

500 p , m , a , c , f

m:2

a:2

c:2

f:2

c:1

b:1

p:1

m:2

b:1

a:2

c:1

f:2

b:1

c:1

p:3

rootroot

f:4

c:3

a:3

m:2

b:1

b:1

p:2 m:1

c:1

b:1

p:1

18





Outline

19

1. Examine the mining process by starting from the bottom of the header table

Collect all the patterns that node participates

Starting from ’s head in the header table and following ’s node-links

Mining Frequent Patterns Using FP-tree

20





Outline

21

Node p (p:3) FP-tree paths <f:4 , c:3 , a:3 , m:2 , p:2> , <c:1 , b:1 , p:1> Conditional pattern base {(f:2 , c:2 , a:2 , m:2), (c:1 , b:1)} Construction of a FP-tree on these

just keep the frequent items

Mining Frequent Patterns Using FP-tree: An Example

root

f:4

c:3

a:3

m:2

b:1

b:1

p:2 m:1

c:1

b:1

p:1


f cabmp

¿ (𝑐 :3 )>¿Frequent items

• <p:3 , cp:3>

Frequent itemsets containing p

22

Node m (m:3) FP-tree paths <f:4 , c:3 , a:3 , m:2 > , < f:4 , c:3 , a:3 , b:1 , m:1 > Conditional pattern base {(f:2 , c:2 , a:2 ), (f:1 , c:1 , a:1 , b:1)} Construction of a FP-tree on these

just keep the frequent items create the tree


root

f:4

c:3

a:3

m:2

b:1

b:1

p:2 m:1

c:1

b:1

p:1


f cabmp

¿ ( 𝑓 :3 ,𝑐 :3 ,𝑎 :3 )>¿Frequent items

• <m:3 , am:3 , cm:3 , fm:3 , cam:3 , • fam:3 , fcm:3 , fcam:3>

Frequent itemsets containing m

23

Node b (b:3) FP-tree paths <f:4 , c:3 , a:3 , b:1 > , < f:4 , b:1 > , < c:1 , b:1 > Conditional pattern base {(f:1 , c:1 , a:1 ), (f:1), (c:1)} Construction of a FP-tree on these



root

f:4

c:3

a:3

m:2

b:1

b:1

p:2 m:1

c:1

b:1

p:1


f cabmp

¿ ()>¿Frequent items

• < b:3 >


24

Node a (a:3) FP-tree paths <f:4 , c:3 , a:3 > Conditional pattern base {(f:3 , c:3)} Construction of a FP-tree on these



root

f:4

c:3

a:3

m:2

b:1

b:1

p:2 m:1

c:1

b:1

p:1


f cabmp

¿ ( 𝑓 :3 ,𝑐 :3 )>¿Frequent items

• <a:3 , fa:3 , ca:3 , fca:3>


25

Node c (c:4) FP-tree paths <f:4 , c:3> , <c:1> Conditional pattern base {(f:3)} Construction of a FP-tree on these



root

f:4

c:3

a:3

m:2

b:1

b:1

p:2 m:1

c:1

b:1

p:1


f cabmp

¿ ( 𝑓 :3 ,𝑐 : 4 )>¿Frequent items

• <c:4 , fc:3>


26

Node f (f:4) FP-tree paths <f:4 > Conditional pattern base {()} Construction of a FP-tree on these



root

f:4

c:3

a:3

m:2

b:1

b:1

p:2 m:1

c:1

b:1

p:1


f cabmp

¿ ()>¿Frequent items

• <f:4>


27





Outline

28

Mining Frequent Patterns Using FP-tree: Design and construction

• FP-tree• Minimum support threshold

Input

• The complete set of frequent patterns

Output

• If Tree contains a single path • Then for each combination of the nodes () do

• Generate pattern • Support = min support in

• Else for each in the header table • Generate pattern with support = support• Construct ’s FP-tree call it • If • Then call FP-growth(

FP-growth(, )

29





Outline

30

1. To calculate the frequent patterns containing in path Only consider prefix sub-path of node in The frequency count of every node in tat sub-path is the same as node

2. Suppose FP-tree has a single path The complete set of the frequent patterns of FP-tree can be generated by Enumeration of all the combinations of the sub-paths of The support of each is equal to the minimum support of the items contained in that sub-path

Mining Frequent Patterns Using FP-tree : Properties

31





Outline

32

1. FP-tree is usually much smaller than the size of DB.

2. FP-trees constructed in the FP-growth are never bigger than the sub-paths

3. Mining operations consist of mainly prefix count adjustment Counting Pattern fragment concatenation

This is much less costly than Generating a very large number of candidate patterns Test each of them

Algorithm Efficiency Properties

33





Outline

34

Comparison of FP-growth with Apriori

Performed on a 450MHz Pentium PC 128MB main memory Microsoft Windows/NT

Written in Microsoft/Visual C++6.0

Run Time was considered time interval between input and output

Two datasets

Performance Study

D1 D2Items → 1KAverage transaction size → 25Average maximal frequent itemset size → 10Number of transactions → 10K

Items → 10KAverage transaction size → 25Average maximal frequent itemset size → 20Number of transactions → 100K

35

Performance Study

36

Performance Study

37

Performance Study

38





Outline

39

Construction of FP-trees for projected Databases

Database is large

FP-tree can not be constructed in the main memory

Partition database into a set of projected databases

Construct an FP-tree

Mine it in each projected databases

Future Works

40

Construction of a disk-resident FP-tree

Use B+-tree structure to index FP-tree Split the tree based on the common prefix paths

Materialization of an FP-tree

Constructing FP-tree needs two scan of the database Materialize an FP-tree for frequent pattern mining How to select a good minimum support threshold Use a low ?

Future Works

41





Outline

42

Constructs a highly compact FP-tree Usually substantially smaller than the original database

Applies a pattern growth method Avoids costly candidate generation and tests

Applies a partitioning-based divide and conquer method Dramatically reduces the size of the subsequent conditional FP-trees

Mines both short and long patterns efficiently in large databases

Conclution

43





Outline

44

What are the components of a FP-tree?

How To calculate the frequent patterns containing in path

Compare efficiency of mining operation in FP-growth with Apriori

Selected questions

One root A set of item prefix subtrees as the children of the root A frequent-item header table

Only consider prefix sub-path of node in The frequency count of every node in tat sub-path is the same as node Find all the combinations

Mining operations consist of mainly prefix count adjustment Counting Pattern fragment concatenation

This is much less costly than Generating a very large number of candidate patterns Test each of them

45

Without Candidate GenerationMining Frequent Patterns

Afsoon YousefiCS:332, March 24th, 2014Inspired by Song Wang slides

Jiawei Han, Jian Pei and Yiwen YinSchool of Computer ScienceSimon Fraser University

Category 1 Category 2 Category 3 Category 40

1

2

3

4

5

6

Series 1 Series 2 Series 3

Documents

Mining Frequent Patterns