IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign

IncSpan: Incremental Mining of Sequential Patterns in Large

Databases

Hong Cheng,Xifeng Yan,Jiawei Han

University of Illinois at Urbana-Champaign

Sequence Database Is Growing! Sequential pattern mining is an important

problem with broad applications Customer shopping sequences Medical treatment sequences Web log mining

Many real life sequence databases grow incrementally Customer continues shopping Patient has new treatment records Web log grows with subsequent visits

Incremental Mining Is Challenging

Undesirable to mine from scratch each time a small fraction of sequences grow

Nontrivial to mine sequential patterns incrementally because Database growth brings in new patterns Growing subsequences interact with original

ones IncSpan: Major new techniques

Buffering Semi-frequent patterns Reverse Pattern Matching

Major Challenge: Appending to Existing Sequences Two kinds of sequence database growth

Insert new sequences Append new transactions to existing sequences (More

challenging—our focus) Example: Minimum Support=10%

Semi-Frequent: A Buffer In Between

Given minsup andμ≤ 1, a sequence a is frequent if sup(a) ≥ min_sup semi-frequent if μ·min_sup ≤ sup(a) < min_sup infrequent if sup(a) <μ·min_sup

Incremental sequential pattern mining Given a sequence database D, a min_sup threshold,

the set of frequent subsequences FS in D, and an appended sequence database D’ of D

Mine the set of frequent subsequences FS’ in D’ based on FS instead of mining on D’ from scratch

Semi-Frequent Sequence Buffering and Maintenance

Keeping some additional information about the original database for incremental mining

Buffering semi-frequent subsequences SFS of the original database SFS are “almost frequent”, they are likely to

become frequent in the growing database SFS is a boundary between frequent and

infrequent sequences Keep FS and SFS of the original database

Possible State Transitions After Appending

Status In D Status In D’ CommentFrequent Frequent Easy

Semi-frequent Frequent Easy

Semi-frequent Semi-frequent Easy

Not appear Appear Have no information of infrequent pattern

or new itemsInfrequent Frequent

Infrequent Semi-frequent

Buffering Technique (I) Handle “infrequent-to-frequent” case.

If an infrequent pattern p’ in D becomes frequent in D’, then at least one of its prefix subsequences p is in FS

<>

<d>s:4<b>s:3<a>s:4

<d>s:3

<e>s:1 3

.........

...

Solution: Start from its frequent prefix p and construct p-projected database to discover p’

Theorem (Used for search space pruning)For a frequent pattern p, if its support in satisfies the condition , then there is no sequence p’ having p as prefix changing from infrequent in D to frequent in D’

dbsupmin)1()(sup pdb

Buffering Technique (II) Handle “infrequent-to-semi-frequent” case

If an infrequent pattern p’ in D becomes semi-frequent in D’, then at least one of its prefix subsequence p is either in FS or SFS

<>

<d>s:4<b>s:3<a>s:4

<d>s:3

...

.........

...

<e>s:1->2

Solution: Start from its frequent or semi-frequent prefix p and construct p-projected database to discover p’

Reverse Pattern Matching An optimization technique: Match a pattern

against a sequence from end towards front Since the item sets are appended at the end,

reverse matching can save some computation

If the last item of pattern p does not appear in Sa, then appending Sa to S will not increase sup(p)

So, just scan Sa for the last item in p and prune search if the above condition meets

Performance Study

Compare with ISM algorithm

[Parthasarathy, Zaki, Ogihara and Dwarkadas, CIKM’99]

PrefixSpan – mining from scratch approach to see how much we can save

Compare CPU time and memory usage

Figure 1. Memory Usage under varied minsup

Performance Study (II)

Figure 2. Varying minsup Figure 3. Varying percentage of updated sequences

Discussion and Conclusion Buffering semi-frequent patterns is effective

User can control the size of SFS by μ SFS is within 1 μfrom being frequent, so likely to

become frequent with dababase growth When only a small portion (5%) of the database

is appended, IncSpan is more efficient than mining from scratch

IncSpan can be easily extended to handle inserting or deleting sequences from database

Handling incremental mining in Stream data? No. still needs more than one scan of the database

Documents

IncSpan: Incremental Mining of Sequential Patterns in Large Databases Hong Cheng,Xifeng Yan,Jiawei Han University of Illinois at Urbana-Champaign