32
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun Advisor: Dr. Koh, JiaLing

Verifying and Mining Frequent Patterns from Large Windows

  • Upload
    randi

  • View
    36

  • Download
    3

Embed Size (px)

DESCRIPTION

Verifying and Mining Frequent Patterns from Large Windows. ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun Advisor: Dr. Koh, JiaLing. Outline. Introduction Mining Large Sliding Windows Verification Double-Tree Verifier (DTV) - PowerPoint PPT Presentation

Citation preview

Page 1: Verifying and Mining Frequent Patterns from Large Windows

1

Verifying and Mining Frequent Patterns from

Large Windows

ICDE2008Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo

Date: 2008/9/25Speaker: Li, HueiJyun

Advisor: Dr. Koh, JiaLing

Page 2: Verifying and Mining Frequent Patterns from Large Windows

2

Outline

Introduction Mining Large Sliding Windows Verification

Double-Tree Verifier (DTV) Depth-First Verifier (DFV) Hybrid Version of Our Verifiers

Experiments Conclusion

Page 3: Verifying and Mining Frequent Patterns from Large Windows

3

Introduction

Normally, finding new rules requires both machines and domain experts

Delays by the mining algorithms in detecting new frequent itemsets are acceptable

Page 4: Verifying and Mining Frequent Patterns from Large Windows

4

Introduction

Propose an algorithm for incremental mining of frequent itemsets that compares favorably with existing algorithms when real-time response is required

The performance of the proposed algorithm improves when small delays are acceptable

Page 5: Verifying and Mining Frequent Patterns from Large Windows

5

Introduction

The on-line verification of old rules is highly desirable in most application scenarios

Propose verifiers for verifying the frequency of previously frequent itemsets over new arriving windows

Page 6: Verifying and Mining Frequent Patterns from Large Windows

6

Mining Large Sliding Windows

Problem Statement and Notations D : the dataset to be mined (a window), contains

several transactions Count(p, D) : frequency of an itemset p sup(p, D) : the support of p Minimum support threshold α σα(D) : the set of frequent itemsets in D n = |W| / |S| : the number of slides in each

window

Page 7: Verifying and Mining Frequent Patterns from Large Windows

7

Mining Large Sliding Windows * The SWIM Algorithm

Sliding Window Incremental Miner (SWIM) always maintains a union of the frequent patterns of all slides in the current window W, called Pattern Tree (PT), which is guaranteed to be a superset of the frequent patterns over W

Page 8: Verifying and Mining Frequent Patterns from Large Windows

Mining Large Sliding Windows * The SWIM Algorithm

Mine the new slide and add its frequent patterns to PT

Uses an auxiliary array, aux_array, stores frequency of a pattern for each window, for which the frequency is not known This counting can either be done eagerly (i.e.,

immediately) or lazily

8

Page 9: Verifying and Mining Frequent Patterns from Large Windows

Mining Large Sliding Windows * The SWIM Algorithm

At the end of each slides, SWIM outputs all patterns in PT whose frequency at that time is ≥α•n•|S|

We may miss a few patterns due to lack of knowledge at the time of output, but we will report them as delayed when other slides expires

Page 10: Verifying and Mining Frequent Patterns from Large Windows

Mining Large Sliding Windows * The SWIM Algorithm

Example 1: assume that our input stream is partitioned into slides S1, S2, … and we have 3 slides in each window

Consider a pattern p which shows up as frequent in S4 for the first time

p.fi: the frequency of p in the ith slide p.freq: p’s cumulative frequency in the current

window

Page 11: Verifying and Mining Frequent Patterns from Large Windows

Mining Large Sliding Windows * The SWIM Algorithm

W4={S2, S3, S4} p.freq=p.f4

p.aux_array=<p.f4, p.f4> W5={S3, S4, S5}

p.freq=p.f4+p.f5

p.aux_array=<p.f2+p.f4, p.f4+p.f5> W6={S4, S5, S6}

p.freq=p.f4+p.f5+p.f6

p.aux_array=<p.f2+p.f3+p.f4, p.f4+p.f5+p.f6>

Page 12: Verifying and Mining Frequent Patterns from Large Windows

Mining Large Sliding Windows * The SWIM Algorithm

12

Page 13: Verifying and Mining Frequent Patterns from Large Windows

Mining Large Sliding Windows * SWIM with Adjusted Delay Bound

SWIM can be easily modified to only allow a given maximum delay of L slides (0 ≤ L ≤ n-1) Choosing L = 0 guarantees that all frequent

patterns are reported immediately once they become frequent in W

Choosing L = n-1 leads to the laziest approach, we wait until a slide expires and then compute the frequency of new patterns and update aux_arrays accordingly

Page 14: Verifying and Mining Frequent Patterns from Large Windows

Verification

Definition 1: D: be a transactional database P: be a given set of arbitrary patterns min_freq: a given minimum frequency A function f is called verifier if it takes D, P and

min_freq as input and for each p ϵ P returns one of the following:1. p’s true frequency in D if it has occurred at least

min_freq times or otherwise

2. Reports that it has occurred less than min_freq times

Page 15: Verifying and Mining Frequent Patterns from Large Windows

Verification

In the special case of min_freq = 0, a verifier simply counts the frequency of all p ϵ P

In general if min_freq > 0, the verifier can skip any pattern whose frequency will be less than min_freq

Verification simply verifies counts for a given set of patterns, i.e., verification does not discover additional patterns

15

Page 16: Verifying and Mining Frequent Patterns from Large Windows

Verification * Background

16

Page 17: Verifying and Mining Frequent Patterns from Large Windows

Verification * Background

Use the following notation to describe the verifiers: u.item: the item represented by node u u.freq: the frequency of u, or NIL when unknown u.children: the set of u’s children head(c): the set of all nodes holding item c

17

Page 18: Verifying and Mining Frequent Patterns from Large Windows

Verification * Background

Use another data structure called pattern tree A fp-tree Each node represents a unique pattern

A verifier algorithm computes the frequency of all patterns in a given pattern tree Initially the frequency of each node in the pattern

tree is 0, but after verification it would contain the true frequency of the pattern it represents

18

Page 19: Verifying and Mining Frequent Patterns from Large Windows

Verification * Double Tree Verifier (DTV)

19

Page 20: Verifying and Mining Frequent Patterns from Large Windows

Verification * Double Tree Verifier (DTV)

Theoretically, the running time of DTV could be exponential in the worst case due to exponential nature of frequent pattern mining

In practice, number of frequent patterns is small when min_freq is not very small

20

Page 21: Verifying and Mining Frequent Patterns from Large Windows

Verification * Double Tree Verifier (DTV)

The number of different paths in the execution tree of DTV is bounded by the total number of different patterns in the pattern tree

The advantage of DTV increases when the minimum support decreases

21

Page 22: Verifying and Mining Frequent Patterns from Large Windows

Verification * Depth-First Verifier (DFV)

DFV exploits the following optimizations: Ancestor Failure:

If a path in the fp-tree already proved to not contain a prefix of the pattern p, then we know that it does not contain p itself either

22

Page 23: Verifying and Mining Frequent Patterns from Large Windows

Verification * Depth-First Verifier (DFV)

Smaller Sibiling Equivalence: If a path in the fp-tree has already been marked to

(or not to) contain a smaller sibiling of the pattern p, then we know that it does (or does not) contain p itself too

Parent Success: If a path in the fp-tree has already been marked to

contain the parent pattern of p, then we know that it also contains p

23

Page 24: Verifying and Mining Frequent Patterns from Large Windows

24

Page 25: Verifying and Mining Frequent Patterns from Large Windows

Verification * Depth-First Verifier (DFV)

Definition 2 (smallest decisive ancestor): for a given pattern node u and an fp-tree’s

node s, smallest decisive ancestor of s is its lowest ancestor t for which either t.item < u.item or t.item ≠ NIL

25

Page 26: Verifying and Mining Frequent Patterns from Large Windows

Verification * Hybrid Version of Our Verifier

DTV is faster than DFV when there are many transactions in the fp-tree and many patterns in the pattern tree

When our tree is small, DFV is more efficient because conditionalization overhead is high

Start with DTV until the conditionalizad tree are “small enough” and after that point switch to DFV

26

Page 27: Verifying and Mining Frequent Patterns from Large Windows

Experiments

P4 machine running Linux, with 1GB of RAM Algorithm is implemented in C dataset:

IBM QUEST data generator Kosarak real-world dataset

27

Page 28: Verifying and Mining Frequent Patterns from Large Windows

Experiments * Efficiency of Verification Algorithms

28

Page 29: Verifying and Mining Frequent Patterns from Large Windows

Experiments * Efficiency of Verification Algorithms

* T20I5D50K

29

Page 30: Verifying and Mining Frequent Patterns from Large Windows

Experiments * Efficiency of SWIM Algorithms

T20I5D1000K support = 1%

30

Page 31: Verifying and Mining Frequent Patterns from Large Windows

Experiments * Efficiency of SWIM Algorithms

31

Page 32: Verifying and Mining Frequent Patterns from Large Windows

Conclusion

The introduction of a very fast algorithm to verify the frequency of a given set of patterns

Further improves the performance by simply allowing a small reporting delay

32