Zeev Dvir – [email protected] GenMax From: “ Efficiently Mining Frequent Itemsets ” By : Karam Gouda & Mohammed J. Zaki

Zeev Dvir – [email protected]

GenMax From :

“ Efficiently Mining Frequent Itemsets ”

By:

Karam Gouda & Mohammed J. Zaki


The Problem

• Given a large database of items transactions, find all frequent itemsets

• A frequent itemset is a set of items that occurs in at-least a user-specified percentage of the data-base

• We call this percentage : min_sup (for minimum support).


• A Maximal Frequent Itemset is a frequent itemset, that doesn’t have a frequent superset

• FI := frequent itemsets

MFI := maximal frequent itemsets

• Fact:

|MFI| << |FI|

GenMax is an algorithm to find the exact MFI


ExampleItem/Tid

ABCD

1xxx

2xx

3xxx

4xxxx

5x

6xx

7x

ABCD

ABC ABD ACD BCD

AB AC AD BC BD CD

A B C D

Min_sup = 3


Some Useful Definitions

• The Combine-Set of an itemset I , is the set of items that can be added to I to create a frequent itemset.

• For example , in the previous example, The combine-set of the itemset {A} is {B, C}.

• The combine-set of the empty itemset is called F1 and is actually the set of frequent itemsets ofsize 1.


)1k,C,I(backtrackMFI 11.

else .10

IMFI MFI 9.

MFIin superset no has I if 8.

empty is C if .7

)P,Combine(I-FI C 6.

return 5.

MFIin superset a has PI If 4.

x}y and Cy|:y{P .3

}x{II .2

C xeach for 1.

)k,C,I(backtrackMFI

)0,F,(backtrackMFI:invocation//

1k1k

1k

1k

1k

1k1k1k

1k1k

k1k

k1k

k

kk

1


Creturn 5.

{y}C C 4.

frequent is }y{I if .3

Pyeach for 2.

C .1

)P,I(combineFI

1k

1k

1k1k


Improvement

• At each level, sort the combine-set (C) in increasing order of support

• An itemset with low support has a smaller chance of producing a large combine-set in the next level

• The sooner we prune the tree, the more work we save

• This heuristic was first used in MaxMiner


Bottlenecks

1. Superset checking :

The best algorithms for superset checking give an amortized bound of per operation.

that’s bad if we have many itemsets in the MFI.

2. Frequency testing :

How can we make frequency testing faster ?

))s(Logs(


Optimizing Superset Checking

• A technique called “Progressive Focusing” is used to narrow down the group of potential supersets, as the recursive calls are made

• LMFI := Local MFI

• Before each recursive call, we construct the LMFI for the next call, based on the current LMFI and the new item added.


FGHI FGHJ …

FGH FGI …

FG …

LMFI Example

kI

k 1I

k

k 1

LMFI {AFGI,ABFGH,AWFG}

LMFI {ABFGH}


1kkk

1k1k1k

k1k

1kkk

k1k

1k

1k1k1k

1k

k1k1k

k1k

k1k

k

kkk

LMFILMFILMFI .14

)1k,LMFI,C,I(backtrackMFIL 13.

}Mx:LMFIM{LMFI 12.

else .11

ILMFI LMFI 10.

LMFIin superset no has I if 9.

empty is C if .8

)P,Combine(I-FI C 7.

LMFI 6.

return 5.

LMFIin superset a has PI If 4.

x}y and Cy|:y{P .3

}x{II .2

C xeach for 1.

)k,LMFI,C,I(backtrackLMFI


Frequency Testing Optimization

• GenMax uses a “vertical database format”:• For each item , we have a set of all the

transactions containing this item.• This set is called a tidset. (Transaction ID

Set).• This method makes support computations

easier, because we don’t have to go over the entire database.


Vertical Database

Item/Tid

ABCD

1xxx

2xx

3xxx

4xxxx

5x

6xx

7x

A {1, 3, 4, 5}

B {1, 3, 4, 6}

C {1 ,2 ,3 ,4 ,7}

D {2, 4, 6}

t(A) = {1, 3, 4, 5}

t(AC) = {1, 3, 4}

supp(I) = |t(I)|


ABC ABD ABE

AB …

= { C , E }

t(ABC) t(ABE)

k 1 k 1

k+1

k+1

FI tidset combine(I ,P )

1. C=

2. for each y P

3. y' = y

4. t(y') = t(I ) t(y)

5. if |t(y')| min_sup

6. C = C {y'}

7. return C

kI

kC

Each item y in the combine-set , actually represents the itemset

, and stores the tidset associated with it.

kC

kI {y}


Additional Optimization

• Diffsets: don’t store the entire tidsets, only the differences between tidsets (described in “Fast Vertical Mining Using Diffsets”)


Experimental Results

• GenMax is compared with: MaxMiner , MAFIA, MAFIA-PP• MaxMiner & MAFIA-PP give the exact

MFI, while MAFIA gives a superset of the MFI

• The Databases used in the experiments are grouped according to the MFI length distribution


Type I Datasets


Type II Datasets


Type III Datasets


Type IV Datasets


Documents

Zeev Dvir – [email protected] GenMax From: “ Efficiently Mining Frequent Itemsets ” By : Karam Gouda & Mohammed J. Zaki