Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and

Embed Size (px)

Citation preview

  • Slide 1

Frequent Itemset Mining Methods Slide 2 The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 Uses an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)- itemsets. Apriori property to reduce the search space: All nonempty subsets of a frequent itemset must also be frequent. P(I) I is not frequent P(I+A) I+A is not frequent either Antimonotone property if a set cannot pass a test, all of its supersets will fail the same test as well Slide 3 Using the apriori property in the algorithm: Let us look at how Lk-1 is used to find Lk, for k>=2 Two steps: Join finding Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself The items within a transaction or itemset are sorted in lexicographic order For the (k-1) itemset: li[1]B)=P(B|A)= support_count(AUB)/support_count(A) support_count(AUB) number of transactions containing the itemsets AUB support_count(A) - number of transactions containing the itemsets A Slide 8 for every nonempty susbset s of l, output the rule s=>(l-s) if support_count(l)/support_count(s)>=min_conf Example: lets have l={I1, I2, I5} The nonempty subsets are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}. Generating association rules: I1 and I2=>I5conf=2/4=50% I1 and I5=>I2conf=2/2=100% I2 and I5=> I1conf=2/2=100% I1=>I2 and I5conf=2/6=33% I2=>I1 and I5conf=2/7=29% I5=>I1 and I2conf=2/2=100% If min_conf is 70%, then only the second, third and last rules above are output. Slide 9 Improving the efficiency of Apriori Hash-based technique to reduce the size of the candidate k-itemsets, Ck, for k>1 Generate all of the 2-itemsets for each transaction, hash them into a different buckets of a hash table structure H(x,y)=((order of x)X10+(order of y)) mod 7 Transaction reduction a transaction that does not contain any frequent k-itemsets cannot contain any frequent k+1 itemsets. Partitioning partitioning the data to find candidate itemsets Sampling mining on a subset of a given data searching for frequents itemsets in subset S, instead of D Lower support threshold Dynamic itemset counting adding candidate itemsets at different points during a scan Slide 10 Mining Frequent Itemsets without candidate generation The candidate generate and test method Reduces the size of candidates sets Good performance It may need to generate a huge number of candidate sets It may need to repeatedly scan the database and check a large set of candidates by pattern matching Frequent-pattern growth method(FP- growth) frequent pattern tree(FP-tree) Slide 11 Example: Slide 12 I5 (I2, I1, I5:1) (I2, I1, I3, I5:1) I5 is a suffix, so the two prefixes are (I2, I1:1) (I2, I1, I3:1) FP tree: (I2:2, I1:2), I3 is removed because