18
Sampling Large Databases for Association Rules (Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Embed Size (px)

Citation preview

Page 1: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Sampling Large Databases for Association Rules

(Toivenon’s Approach, 1996)

Farzaneh MirzazadehFall 2007

Page 2: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Outline

IntroductionPreliminaries

Definitions, and Problem StatementTwo General Approaches

Sampling Method for Mining Association RulesThe algorithmAnalysis

Experimental Results

Page 3: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Introduction

Problem: Discovery of Association RulesDomain: Very Large DatabasesBottleneck: Time

o Main Memory Processes: Ignorableo Disk I/O: An Influential Factor

Suggestion: Minimize the Number of Scans of the Database

Only One Full Pass Over the Database

Page 4: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Introduction(Con’t)Overview of Toivonen’s Method

Main Steps:

Pick a random sample from the database. Use the sample to determine all probable association rules. Verify the results with the rest of the database, i.e. Eliminated

incorrectly detected association rules and add missing association rules.

The Main Contribution: To show that all exact frequencies can be found efficiently, by

analyzing first a random sample and then the whole database with the proposed method.

Page 5: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Preliminaries• Items– I={I1,I2,…,Im}

• Transactions– r={t1,t2, …, tn}, tj I

• Support of an itemset– Percentage of transactions which contain that itemset.

• Frequent Itemsets

• Association Rules• Strong Association Rules

}_),(|{)_,( frminrXfrIXfrminrF

Page 6: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Preliminaries• Association Rule: implication X Y where X,Y I

and X Y = Ø;• Support of Association Rule X Y: Percentage of

transactions that contain X Y • Confidence of Association Rule X Y: Ratio of

number of transactions that contain X Y to the number that contain X

• Problem: Find the strong association rules of a given set I with respect to threshold min_fr and confidence min_conf.

Page 7: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Algorithms for Mining Association Rules

Level-wise AlgorithmsIdea: If a set is not frequent then its supersets can not be frequent.

On level k, candidate itemsets X of size k are generated such that all subsets of

X are frequent.Partition Algorithm

Idea: Partition the data to sections small enough to be handled in main memory. First Pass: Find locally frequent Itemsets.Second Pass: Union of the local frequent itemsets

Page 8: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Sampling for Frequent Sets

Major Stepso Random samplingo Finding the frequent itemsets of the sampleo Finding other probable candidates using the

concept of Negative Bordero Using the rest of the database to check the

candidates

Page 9: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Negative Border

• All sets which are not in our frequent itemsets, but all their subsets are.

minimal itemsets not in S, where S is the collection of frequent itemsets

• Example: – S = {{A}, {B}, {C}, {F}, {A,B}, {A,C}, {A,F}, {C,F}, {A,C,F}}– = {{B, C}, {B, F}, {D}, {E}}

)(SBd

)(SBd

Page 10: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Frequent Set Discovery Intuition: Given a collection S of sets that are frequent, the negative

border contains the closest itemsets that could be frequent too. After finding the collection of frequent itemsets, S, we check

negative border of S:o If no frequent items are added=> We can conclude that all frequent sets are

already found. (Why?)

o Decrease minimum support to increase the chance of success.o If at least one frequent itemset is found in negative border => We can

conclude that some of its supersets may be frequent.(Why?)

o In the case of failure, we can either report failure and stop, or scan the database again and check the supersets to find the exact result.

Failure

Success

Page 11: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Toivonen’s Algorithm

Page 12: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Failure Handling• In the fraction of cases where a possible failure is reported, all

frequent sets can be found by making a second pass over the database:

The algorithm simply

computes the collection of all sets that

could possibly be frequent.

Page 13: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Analysis of Sampling

• Sample Size and Probability of Failure

Page 14: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Experimental Results

Page 15: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Conclusion

• Advantages: Reduced failure probability, while keeping

candidate-count low enough for memory

• Disadvantages:Potentially large number of candidatesin second pass

Page 16: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

References

[1] H. Toivonen, Sampling Large Databases for Association Rules, Proc. of VLDB Conference, India, 1996.

Page 17: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Questions

?

Page 18: Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007

Thank you