Upload
jacob-fitzgerald
View
213
Download
1
Embed Size (px)
Citation preview
IncSpan: Incremental Mining of Sequential Patterns in Large
Databases
Hong Cheng,Xifeng Yan,Jiawei Han
University of Illinois at Urbana-Champaign
Sequence Database Is Growing! Sequential pattern mining is an important
problem with broad applications Customer shopping sequences Medical treatment sequences Web log mining
Many real life sequence databases grow incrementally Customer continues shopping Patient has new treatment records Web log grows with subsequent visits
Incremental Mining Is Challenging
Undesirable to mine from scratch each time a small fraction of sequences grow
Nontrivial to mine sequential patterns incrementally because Database growth brings in new patterns Growing subsequences interact with original
ones IncSpan: Major new techniques
Buffering Semi-frequent patterns Reverse Pattern Matching
Major Challenge: Appending to Existing Sequences Two kinds of sequence database growth
Insert new sequences Append new transactions to existing sequences (More
challenging—our focus) Example: Minimum Support=10%
Semi-Frequent: A Buffer In Between
Given minsup andμ≤ 1, a sequence a is frequent if sup(a) ≥ min_sup semi-frequent if μ·min_sup ≤ sup(a) < min_sup infrequent if sup(a) <μ·min_sup
Incremental sequential pattern mining Given a sequence database D, a min_sup threshold,
the set of frequent subsequences FS in D, and an appended sequence database D’ of D
Mine the set of frequent subsequences FS’ in D’ based on FS instead of mining on D’ from scratch
Semi-Frequent Sequence Buffering and Maintenance
Keeping some additional information about the original database for incremental mining
Buffering semi-frequent subsequences SFS of the original database SFS are “almost frequent”, they are likely to
become frequent in the growing database SFS is a boundary between frequent and
infrequent sequences Keep FS and SFS of the original database
Possible State Transitions After Appending
Status In D Status In D’ CommentFrequent Frequent Easy
Semi-frequent Frequent Easy
Semi-frequent Semi-frequent Easy
Not appear Appear Have no information of infrequent pattern
or new itemsInfrequent Frequent
Infrequent Semi-frequent
Buffering Technique (I) Handle “infrequent-to-frequent” case.
If an infrequent pattern p’ in D becomes frequent in D’, then at least one of its prefix subsequences p is in FS
<>
<d>s:4<b>s:3<a>s:4
<d>s:3
<e>s:1 3
.........
...
Solution: Start from its frequent prefix p and construct p-projected database to discover p’
Theorem (Used for search space pruning)For a frequent pattern p, if its support in satisfies the condition , then there is no sequence p’ having p as prefix changing from infrequent in D to frequent in D’
dbsupmin)1()(sup pdb
Buffering Technique (II) Handle “infrequent-to-semi-frequent” case
If an infrequent pattern p’ in D becomes semi-frequent in D’, then at least one of its prefix subsequence p is either in FS or SFS
<>
<d>s:4<b>s:3<a>s:4
<d>s:3
...
.........
...
<e>s:1->2
Solution: Start from its frequent or semi-frequent prefix p and construct p-projected database to discover p’
Reverse Pattern Matching An optimization technique: Match a pattern
against a sequence from end towards front Since the item sets are appended at the end,
reverse matching can save some computation
If the last item of pattern p does not appear in Sa, then appending Sa to S will not increase sup(p)
So, just scan Sa for the last item in p and prune search if the above condition meets
Performance Study
Compare with ISM algorithm
[Parthasarathy, Zaki, Ogihara and Dwarkadas, CIKM’99]
PrefixSpan – mining from scratch approach to see how much we can save
Compare CPU time and memory usage
Figure 1. Memory Usage under varied minsup
Performance Study (II)
Figure 2. Varying minsup Figure 3. Varying percentage of updated sequences
Discussion and Conclusion Buffering semi-frequent patterns is effective
User can control the size of SFS by μ SFS is within 1 μfrom being frequent, so likely to
become frequent with dababase growth When only a small portion (5%) of the database
is appended, IncSpan is more efficient than mining from scratch
IncSpan can be easily extended to handle inserting or deleting sequences from database
Handling incremental mining in Stream data? No. still needs more than one scan of the database