View
47
Download
0
Category
Preview:
Citation preview
Efficient Mining of Closed Sequential Patterns onStream Sliding Window
Chuancong Gao, Jianyong Wang, Qingyan Yang
Database LaboratoryDepartment of Computer Science and Technology
Tsinghua University, Beijing, China
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 1 / 13
What is Closed Sequence
Sequence:A sequence s is an ordered list of items, where each item can appear multipletimes, denoted by s = e1e2 . . . em. If sa is contained by sb, it is denoted as sa v sb.
A Sequence s’s Support (Frequency):
I dbs denotes a subset of input sequence database db where each sequences ′ ∈ db contains s.
I |dbs | is defined as the absolute support of s in db, denoted by supdbs (or supswhen clear).
I Given a specified minimum support threshold supmin, a sequence s is said tobe frequent iff sups ≥ supmin.
Closed Sequence:a non-empty subsequence sa is said to be closed iff 6 ∃sb that supsa = supsb andsa @ sb.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 2 / 13
What is Closed Sequence
Sequence:A sequence s is an ordered list of items, where each item can appear multipletimes, denoted by s = e1e2 . . . em. If sa is contained by sb, it is denoted as sa v sb.
A Sequence s’s Support (Frequency):
I dbs denotes a subset of input sequence database db where each sequences ′ ∈ db contains s.
I |dbs | is defined as the absolute support of s in db, denoted by supdbs (or supswhen clear).
I Given a specified minimum support threshold supmin, a sequence s is said tobe frequent iff sups ≥ supmin.
Closed Sequence:a non-empty subsequence sa is said to be closed iff 6 ∃sb that supsa = supsb andsa @ sb.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 2 / 13
What is Closed Sequence
Sequence:A sequence s is an ordered list of items, where each item can appear multipletimes, denoted by s = e1e2 . . . em. If sa is contained by sb, it is denoted as sa v sb.
A Sequence s’s Support (Frequency):
I dbs denotes a subset of input sequence database db where each sequences ′ ∈ db contains s.
I |dbs | is defined as the absolute support of s in db, denoted by supdbs (or supswhen clear).
I Given a specified minimum support threshold supmin, a sequence s is said tobe frequent iff sups ≥ supmin.
Closed Sequence:a non-empty subsequence sa is said to be closed iff 6 ∃sb that supsa = supsb andsa @ sb.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 2 / 13
What is Closed Sequence
Sequence:A sequence s is an ordered list of items, where each item can appear multipletimes, denoted by s = e1e2 . . . em. If sa is contained by sb, it is denoted as sa v sb.
A Sequence s’s Support (Frequency):
I dbs denotes a subset of input sequence database db where each sequences ′ ∈ db contains s.
I |dbs | is defined as the absolute support of s in db, denoted by supdbs (or supswhen clear).
I Given a specified minimum support threshold supmin, a sequence s is said tobe frequent iff sups ≥ supmin.
Closed Sequence:a non-empty subsequence sa is said to be closed iff 6 ∃sb that supsa = supsb andsa @ sb.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 2 / 13
What is Closed Sequence
Sequence:A sequence s is an ordered list of items, where each item can appear multipletimes, denoted by s = e1e2 . . . em. If sa is contained by sb, it is denoted as sa v sb.
A Sequence s’s Support (Frequency):
I dbs denotes a subset of input sequence database db where each sequences ′ ∈ db contains s.
I |dbs | is defined as the absolute support of s in db, denoted by supdbs (or supswhen clear).
I Given a specified minimum support threshold supmin, a sequence s is said tobe frequent iff sups ≥ supmin.
Closed Sequence:a non-empty subsequence sa is said to be closed iff 6 ∃sb that supsa = supsb andsa @ sb.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 2 / 13
Why Mine Closed SequencesComparing to the set of all frequent sequences, the set of frequent closedsequences has the following advantages:
I Has much smaller pattern number. (Hundreds times smaller.)
I Its mining process is much more efficient, as unpromising search spaces canbe pruned. (Hundreds to thousands times faster on real datasets.)
I It is a concise representation of all frequent sequences.
Our GoalWe want to mine frequent closed sequences in current sliding window oversequence streams. Current sliding window is updated when a batch of sequencesarrives or leaves.
Tim
e L
ine
Time
6
5
4
3
2
1
ID
6
5
4
3
2
1
Sequence
CBBBA
BACB
CCABB
BCACB
CABC
ACABC Win
dow
#1
Win
dow
#2
Win
dow
#3
Example Dataset: Window Size = 4, Update Batch Size = 1
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 3 / 13
Why Mine Closed SequencesComparing to the set of all frequent sequences, the set of frequent closedsequences has the following advantages:
I Has much smaller pattern number. (Hundreds times smaller.)
I Its mining process is much more efficient, as unpromising search spaces canbe pruned. (Hundreds to thousands times faster on real datasets.)
I It is a concise representation of all frequent sequences.
Our GoalWe want to mine frequent closed sequences in current sliding window oversequence streams. Current sliding window is updated when a batch of sequencesarrives or leaves.
Tim
e L
ine
Time
6
5
4
3
2
1
ID
6
5
4
3
2
1
Sequence
CBBBA
BACB
CCABB
BCACB
CABC
ACABC Win
dow
#1
Win
dow
#2
Win
dow
#3
Example Dataset: Window Size = 4, Update Batch Size = 1
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 3 / 13
Why Mine Closed SequencesComparing to the set of all frequent sequences, the set of frequent closedsequences has the following advantages:
I Has much smaller pattern number. (Hundreds times smaller.)
I Its mining process is much more efficient, as unpromising search spaces canbe pruned. (Hundreds to thousands times faster on real datasets.)
I It is a concise representation of all frequent sequences.
Our GoalWe want to mine frequent closed sequences in current sliding window oversequence streams. Current sliding window is updated when a batch of sequencesarrives or leaves.
Tim
e L
ine
Time
6
5
4
3
2
1
ID
6
5
4
3
2
1
Sequence
CBBBA
BACB
CCABB
BCACB
CABC
ACABC Win
dow
#1
Win
dow
#2
Win
dow
#3
Example Dataset: Window Size = 4, Update Batch Size = 1
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 3 / 13
Why Mine Closed SequencesComparing to the set of all frequent sequences, the set of frequent closedsequences has the following advantages:
I Has much smaller pattern number. (Hundreds times smaller.)
I Its mining process is much more efficient, as unpromising search spaces canbe pruned. (Hundreds to thousands times faster on real datasets.)
I It is a concise representation of all frequent sequences.
Our GoalWe want to mine frequent closed sequences in current sliding window oversequence streams. Current sliding window is updated when a batch of sequencesarrives or leaves.
Tim
e L
ine
Time
6
5
4
3
2
1
ID
6
5
4
3
2
1
Sequence
CBBBA
BACB
CCABB
BCACB
CABC
ACABC Win
dow
#1
Win
dow
#2
Win
dow
#3
Example Dataset: Window Size = 4, Update Batch Size = 1
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 3 / 13
Why Mine Closed SequencesComparing to the set of all frequent sequences, the set of frequent closedsequences has the following advantages:
I Has much smaller pattern number. (Hundreds times smaller.)
I Its mining process is much more efficient, as unpromising search spaces canbe pruned. (Hundreds to thousands times faster on real datasets.)
I It is a concise representation of all frequent sequences.
Our GoalWe want to mine frequent closed sequences in current sliding window oversequence streams. Current sliding window is updated when a batch of sequencesarrives or leaves.
Tim
e L
ine
Time
6
5
4
3
2
1
ID
6
5
4
3
2
1
Sequence
CBBBA
BACB
CCABB
BCACB
CABC
ACABC Win
dow
#1
Win
dow
#2
Win
dow
#3
Example Dataset: Window Size = 4, Update Batch Size = 1
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 3 / 13
Why Mine Closed SequencesComparing to the set of all frequent sequences, the set of frequent closedsequences has the following advantages:
I Has much smaller pattern number. (Hundreds times smaller.)
I Its mining process is much more efficient, as unpromising search spaces canbe pruned. (Hundreds to thousands times faster on real datasets.)
I It is a concise representation of all frequent sequences.
Our GoalWe want to mine frequent closed sequences in current sliding window oversequence streams. Current sliding window is updated when a batch of sequencesarrives or leaves.
Tim
e L
ine
Time
6
5
4
3
2
1
ID
6
5
4
3
2
1
Sequence
CBBBA
BACB
CCABB
BCACB
CABC
ACABC Win
dow
#1
Win
dow
#2
Win
dow
#3
Example Dataset: Window Size = 4, Update Batch Size = 1C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 3 / 13
ADD and REMOVE Operations
An enumeration tree is used to maintain all frequent closed sequences incurrent sliding window. Each tree node represents a frequent closedsubsequence. All nodes can be organized lexicographically as follows:
Ø
ACB : 2 BB : 2 BC : 3 CC : 4CAB : 4 CAC : 3
CCB : 2CABC : 2
Enumeration Tree of Window #1, supmin = 2
I When a batch of sequences arrives (call ADD) or leaves (call REMOVE):
I Update the enumerate tree by mining on the updated databaseincrementally.
I Get frequent closed sequences in current window by transversing theenumerate tree.
Later used Notations:db – Window before update db′ – Window after updatedb∗ – Batch of updating data (when arriving or leaving)
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 4 / 13
ADD and REMOVE Operations
An enumeration tree is used to maintain all frequent closed sequences incurrent sliding window. Each tree node represents a frequent closedsubsequence. All nodes can be organized lexicographically as follows:
Ø
ACB : 2 BB : 2 BC : 3 CC : 4CAB : 4 CAC : 3
CCB : 2CABC : 2
Enumeration Tree of Window #1, supmin = 2
I When a batch of sequences arrives (call ADD) or leaves (call REMOVE):
I Update the enumerate tree by mining on the updated databaseincrementally.
I Get frequent closed sequences in current window by transversing theenumerate tree.
Later used Notations:db – Window before update db′ – Window after updatedb∗ – Batch of updating data (when arriving or leaving)
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 4 / 13
ADD and REMOVE Operations
An enumeration tree is used to maintain all frequent closed sequences incurrent sliding window. Each tree node represents a frequent closedsubsequence. All nodes can be organized lexicographically as follows:
Ø
ACB : 2 BB : 2 BC : 3 CC : 4CAB : 4 CAC : 3
CCB : 2CABC : 2
Enumeration Tree of Window #1, supmin = 2
I When a batch of sequences arrives (call ADD) or leaves (call REMOVE):
I Update the enumerate tree by mining on the updated databaseincrementally.
I Get frequent closed sequences in current window by transversing theenumerate tree.
Later used Notations:db – Window before update db′ – Window after updatedb∗ – Batch of updating data (when arriving or leaving)
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 4 / 13
ADD and REMOVE Operations
An enumeration tree is used to maintain all frequent closed sequences incurrent sliding window. Each tree node represents a frequent closedsubsequence. All nodes can be organized lexicographically as follows:
Ø
ACB : 2 BB : 2 BC : 3 CC : 4CAB : 4 CAC : 3
CCB : 2CABC : 2
Enumeration Tree of Window #1, supmin = 2
I When a batch of sequences arrives (call ADD) or leaves (call REMOVE):
I Update the enumerate tree by mining on the updated databaseincrementally.
I Get frequent closed sequences in current window by transversing theenumerate tree.
Later used Notations:db – Window before update db′ – Window after updatedb∗ – Batch of updating data (when arriving or leaving)
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 4 / 13
ADD and REMOVE Operations
An enumeration tree is used to maintain all frequent closed sequences incurrent sliding window. Each tree node represents a frequent closedsubsequence. All nodes can be organized lexicographically as follows:
Ø
ACB : 2 BB : 2 BC : 3 CC : 4CAB : 4 CAC : 3
CCB : 2CABC : 2
Enumeration Tree of Window #1, supmin = 2
I When a batch of sequences arrives (call ADD) or leaves (call REMOVE):
I Update the enumerate tree by mining on the updated databaseincrementally.
I Get frequent closed sequences in current window by transversing theenumerate tree.
Later used Notations:db – Window before update db′ – Window after updatedb∗ – Batch of updating data (when arriving or leaving)
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 4 / 13
ADD and REMOVE Operations
An enumeration tree is used to maintain all frequent closed sequences incurrent sliding window. Each tree node represents a frequent closedsubsequence. All nodes can be organized lexicographically as follows:
Ø
ACB : 2 BB : 2 BC : 3 CC : 4CAB : 4 CAC : 3
CCB : 2CABC : 2
Enumeration Tree of Window #1, supmin = 2
I When a batch of sequences arrives (call ADD) or leaves (call REMOVE):
I Update the enumerate tree by mining on the updated databaseincrementally.
I Get frequent closed sequences in current window by transversing theenumerate tree.
Later used Notations:db – Window before update db′ – Window after updatedb∗ – Batch of updating data (when arriving or leaving)
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 4 / 13
ADD Operation
General Framework:
I Given an sequence p in db′ (with supdb′
p ≥ supmin and supdb∗
p ≥ 1):
I If p is already in a node n of the enumeration tree (was closed in db):Update p’s support in n (as can be proved p is also closed in db′).
I Else if p was frequent (non-closed) in db (if only supdbp ≥ supmin):I Check whether p can be pruned (incremental pruning checking).I If p cannot be pruned: Check whether p is closed (incremental closed
checking).I Else if p was not frequent in db:
I Check whether p can be pruned (basic pruning checking).I If p cannot be pruned: Check whether p is closed (basic closed
checking).
I If p cannot be pruned:
I If p is closed: Create a new node n for p in the enumeration tree.I For p’s each extended sequence p′ (by adding each local frequent item
after p), go to the first step.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 5 / 13
ADD Operation
General Framework:
I Given an sequence p in db′ (with supdb′
p ≥ supmin and supdb∗
p ≥ 1):
I If p is already in a node n of the enumeration tree (was closed in db):Update p’s support in n (as can be proved p is also closed in db′).
I Else if p was frequent (non-closed) in db (if only supdbp ≥ supmin):I Check whether p can be pruned (incremental pruning checking).I If p cannot be pruned: Check whether p is closed (incremental closed
checking).I Else if p was not frequent in db:
I Check whether p can be pruned (basic pruning checking).I If p cannot be pruned: Check whether p is closed (basic closed
checking).
I If p cannot be pruned:
I If p is closed: Create a new node n for p in the enumeration tree.I For p’s each extended sequence p′ (by adding each local frequent item
after p), go to the first step.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 5 / 13
ADD Operation
General Framework:
I Given an sequence p in db′ (with supdb′
p ≥ supmin and supdb∗
p ≥ 1):
I If p is already in a node n of the enumeration tree (was closed in db):Update p’s support in n (as can be proved p is also closed in db′).
I Else if p was frequent (non-closed) in db (if only supdbp ≥ supmin):I Check whether p can be pruned (incremental pruning checking).I If p cannot be pruned: Check whether p is closed (incremental closed
checking).
I Else if p was not frequent in db:I Check whether p can be pruned (basic pruning checking).I If p cannot be pruned: Check whether p is closed (basic closed
checking).
I If p cannot be pruned:
I If p is closed: Create a new node n for p in the enumeration tree.I For p’s each extended sequence p′ (by adding each local frequent item
after p), go to the first step.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 5 / 13
ADD Operation
General Framework:
I Given an sequence p in db′ (with supdb′
p ≥ supmin and supdb∗
p ≥ 1):
I If p is already in a node n of the enumeration tree (was closed in db):Update p’s support in n (as can be proved p is also closed in db′).
I Else if p was frequent (non-closed) in db (if only supdbp ≥ supmin):I Check whether p can be pruned (incremental pruning checking).I If p cannot be pruned: Check whether p is closed (incremental closed
checking).I Else if p was not frequent in db:
I Check whether p can be pruned (basic pruning checking).I If p cannot be pruned: Check whether p is closed (basic closed
checking).
I If p cannot be pruned:
I If p is closed: Create a new node n for p in the enumeration tree.I For p’s each extended sequence p′ (by adding each local frequent item
after p), go to the first step.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 5 / 13
ADD Operation
General Framework:
I Given an sequence p in db′ (with supdb′
p ≥ supmin and supdb∗
p ≥ 1):
I If p is already in a node n of the enumeration tree (was closed in db):Update p’s support in n (as can be proved p is also closed in db′).
I Else if p was frequent (non-closed) in db (if only supdbp ≥ supmin):I Check whether p can be pruned (incremental pruning checking).I If p cannot be pruned: Check whether p is closed (incremental closed
checking).I Else if p was not frequent in db:
I Check whether p can be pruned (basic pruning checking).I If p cannot be pruned: Check whether p is closed (basic closed
checking).
I If p cannot be pruned:
I If p is closed: Create a new node n for p in the enumeration tree.I For p’s each extended sequence p′ (by adding each local frequent item
after p), go to the first step.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 5 / 13
Function insertable
Function insertable(pattern, db, endPos [, pred ]) returns a set of (pos, item) pairswhere 1 ≤ pos ≤ endPos ≤ |pattern|+ 1, with:
I cPattern = pattern[1, pos − 1] + 〈item〉+ pattern[pos, |pattern|].I supdbpattern = supdbcPattern.
I The optional predicate function pred(pattern, pos, item) must be true foreach found (pos, item) pair, if it is given.
I If endPos ≤ |pattern|, the matching positions of the last itempattern[endPos] in db must be same for both pattern and cPattern.
Function insertable gives a set of items with which we can get a new sequencewith the same support, by inserting one of the item in the given sequence.Predicate function pred gives some constraints.
Example:insertable(CB, db, |CB|) on db = {ACABC ,CABC ,BCACB,CCABB} returns(2,A).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 6 / 13
Function insertable
Function insertable(pattern, db, endPos [, pred ]) returns a set of (pos, item) pairswhere 1 ≤ pos ≤ endPos ≤ |pattern|+ 1, with:
I cPattern = pattern[1, pos − 1] + 〈item〉+ pattern[pos, |pattern|].
I supdbpattern = supdbcPattern.
I The optional predicate function pred(pattern, pos, item) must be true foreach found (pos, item) pair, if it is given.
I If endPos ≤ |pattern|, the matching positions of the last itempattern[endPos] in db must be same for both pattern and cPattern.
Function insertable gives a set of items with which we can get a new sequencewith the same support, by inserting one of the item in the given sequence.Predicate function pred gives some constraints.
Example:insertable(CB, db, |CB|) on db = {ACABC ,CABC ,BCACB,CCABB} returns(2,A).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 6 / 13
Function insertable
Function insertable(pattern, db, endPos [, pred ]) returns a set of (pos, item) pairswhere 1 ≤ pos ≤ endPos ≤ |pattern|+ 1, with:
I cPattern = pattern[1, pos − 1] + 〈item〉+ pattern[pos, |pattern|].I supdbpattern = supdbcPattern.
I The optional predicate function pred(pattern, pos, item) must be true foreach found (pos, item) pair, if it is given.
I If endPos ≤ |pattern|, the matching positions of the last itempattern[endPos] in db must be same for both pattern and cPattern.
Function insertable gives a set of items with which we can get a new sequencewith the same support, by inserting one of the item in the given sequence.Predicate function pred gives some constraints.
Example:insertable(CB, db, |CB|) on db = {ACABC ,CABC ,BCACB,CCABB} returns(2,A).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 6 / 13
Function insertable
Function insertable(pattern, db, endPos [, pred ]) returns a set of (pos, item) pairswhere 1 ≤ pos ≤ endPos ≤ |pattern|+ 1, with:
I cPattern = pattern[1, pos − 1] + 〈item〉+ pattern[pos, |pattern|].I supdbpattern = supdbcPattern.
I The optional predicate function pred(pattern, pos, item) must be true foreach found (pos, item) pair, if it is given.
I If endPos ≤ |pattern|, the matching positions of the last itempattern[endPos] in db must be same for both pattern and cPattern.
Function insertable gives a set of items with which we can get a new sequencewith the same support, by inserting one of the item in the given sequence.Predicate function pred gives some constraints.
Example:insertable(CB, db, |CB|) on db = {ACABC ,CABC ,BCACB,CCABB} returns(2,A).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 6 / 13
Function insertable
Function insertable(pattern, db, endPos [, pred ]) returns a set of (pos, item) pairswhere 1 ≤ pos ≤ endPos ≤ |pattern|+ 1, with:
I cPattern = pattern[1, pos − 1] + 〈item〉+ pattern[pos, |pattern|].I supdbpattern = supdbcPattern.
I The optional predicate function pred(pattern, pos, item) must be true foreach found (pos, item) pair, if it is given.
I If endPos ≤ |pattern|, the matching positions of the last itempattern[endPos] in db must be same for both pattern and cPattern.
Function insertable gives a set of items with which we can get a new sequencewith the same support, by inserting one of the item in the given sequence.Predicate function pred gives some constraints.
Example:insertable(CB, db, |CB|) on db = {ACABC ,CABC ,BCACB,CCABB} returns(2,A).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 6 / 13
Function insertable
Function insertable(pattern, db, endPos [, pred ]) returns a set of (pos, item) pairswhere 1 ≤ pos ≤ endPos ≤ |pattern|+ 1, with:
I cPattern = pattern[1, pos − 1] + 〈item〉+ pattern[pos, |pattern|].I supdbpattern = supdbcPattern.
I The optional predicate function pred(pattern, pos, item) must be true foreach found (pos, item) pair, if it is given.
I If endPos ≤ |pattern|, the matching positions of the last itempattern[endPos] in db must be same for both pattern and cPattern.
Function insertable gives a set of items with which we can get a new sequencewith the same support, by inserting one of the item in the given sequence.
Predicate function pred gives some constraints.
Example:insertable(CB, db, |CB|) on db = {ACABC ,CABC ,BCACB,CCABB} returns(2,A).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 6 / 13
Function insertable
Function insertable(pattern, db, endPos [, pred ]) returns a set of (pos, item) pairswhere 1 ≤ pos ≤ endPos ≤ |pattern|+ 1, with:
I cPattern = pattern[1, pos − 1] + 〈item〉+ pattern[pos, |pattern|].I supdbpattern = supdbcPattern.
I The optional predicate function pred(pattern, pos, item) must be true foreach found (pos, item) pair, if it is given.
I If endPos ≤ |pattern|, the matching positions of the last itempattern[endPos] in db must be same for both pattern and cPattern.
Function insertable gives a set of items with which we can get a new sequencewith the same support, by inserting one of the item in the given sequence.Predicate function pred gives some constraints.
Example:insertable(CB, db, |CB|) on db = {ACABC ,CABC ,BCACB,CCABB} returns(2,A).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 6 / 13
Function insertable
Function insertable(pattern, db, endPos [, pred ]) returns a set of (pos, item) pairswhere 1 ≤ pos ≤ endPos ≤ |pattern|+ 1, with:
I cPattern = pattern[1, pos − 1] + 〈item〉+ pattern[pos, |pattern|].I supdbpattern = supdbcPattern.
I The optional predicate function pred(pattern, pos, item) must be true foreach found (pos, item) pair, if it is given.
I If endPos ≤ |pattern|, the matching positions of the last itempattern[endPos] in db must be same for both pattern and cPattern.
Function insertable gives a set of items with which we can get a new sequencewith the same support, by inserting one of the item in the given sequence.Predicate function pred gives some constraints.
Example:insertable(CB, db, |CB|) on db = {ACABC ,CABC ,BCACB,CCABB} returns(2,A).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 6 / 13
Basic Pattern Closure & Pruning Checking
When a sequence p was not frequent in db, but is now frequent in db′, wehave:
Basic Pattern Closure Checking
Pattern p is non-closed if insertable(p,db′, |p|+ 1) returns at least oneresult.This checks whether we can get another sequence with the same support,by inserting an item in p.
Basic Pattern Pruning Checking
Pattern p and all its extended patterns can be safely pruned ifinsertable(p,db′, |p|) returns at least one result.This checks whether we can get another sequence with the same support,by inserting an item in p (before p’s last item which remains the samematching positions).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 7 / 13
Basic Pattern Closure & Pruning Checking
When a sequence p was not frequent in db, but is now frequent in db′, wehave:
Basic Pattern Closure Checking
Pattern p is non-closed if insertable(p,db′, |p|+ 1) returns at least oneresult.
This checks whether we can get another sequence with the same support,by inserting an item in p.
Basic Pattern Pruning Checking
Pattern p and all its extended patterns can be safely pruned ifinsertable(p,db′, |p|) returns at least one result.This checks whether we can get another sequence with the same support,by inserting an item in p (before p’s last item which remains the samematching positions).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 7 / 13
Basic Pattern Closure & Pruning Checking
When a sequence p was not frequent in db, but is now frequent in db′, wehave:
Basic Pattern Closure Checking
Pattern p is non-closed if insertable(p,db′, |p|+ 1) returns at least oneresult.This checks whether we can get another sequence with the same support,by inserting an item in p.
Basic Pattern Pruning Checking
Pattern p and all its extended patterns can be safely pruned ifinsertable(p,db′, |p|) returns at least one result.This checks whether we can get another sequence with the same support,by inserting an item in p (before p’s last item which remains the samematching positions).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 7 / 13
Basic Pattern Closure & Pruning Checking
When a sequence p was not frequent in db, but is now frequent in db′, wehave:
Basic Pattern Closure Checking
Pattern p is non-closed if insertable(p,db′, |p|+ 1) returns at least oneresult.This checks whether we can get another sequence with the same support,by inserting an item in p.
Basic Pattern Pruning Checking
Pattern p and all its extended patterns can be safely pruned ifinsertable(p,db′, |p|) returns at least one result.
This checks whether we can get another sequence with the same support,by inserting an item in p (before p’s last item which remains the samematching positions).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 7 / 13
Basic Pattern Closure & Pruning Checking
When a sequence p was not frequent in db, but is now frequent in db′, wehave:
Basic Pattern Closure Checking
Pattern p is non-closed if insertable(p,db′, |p|+ 1) returns at least oneresult.This checks whether we can get another sequence with the same support,by inserting an item in p.
Basic Pattern Pruning Checking
Pattern p and all its extended patterns can be safely pruned ifinsertable(p,db′, |p|) returns at least one result.This checks whether we can get another sequence with the same support,by inserting an item in p (before p’s last item which remains the samematching positions).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 7 / 13
Incremental Pattern Closure Checking
When a sequence p was frequent in db, we check whether it is closed in db′
incrementally by calling insertable(p,db∗, |p|+ 1,prevClosed), with:
I Predicate prevClosed(p, pos, item) checks for each (pos, item) pair whetherp was closed in db with the same (pos, item) pair, by checking whetherp @ p′ v s with:
I p′ = p[1, pos − 1] + 〈item〉+ p[pos, |p|].I s ∈ {s|p @ s, s is closed, and supdbp = supdbs } (by using an index).
I p is non-closed if at least one result is found.
This checks whether p is not only non-closed in db∗ (with function insertable) byinserting a item in p, but also non-closed in db (with predicate prevClosed) withthe same item and position, and further non-closed on db′.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 8 / 13
Incremental Pattern Closure Checking
When a sequence p was frequent in db, we check whether it is closed in db′
incrementally by calling insertable(p,db∗, |p|+ 1,prevClosed), with:
I Predicate prevClosed(p, pos, item) checks for each (pos, item) pair whetherp was closed in db with the same (pos, item) pair, by checking whetherp @ p′ v s with:
I p′ = p[1, pos − 1] + 〈item〉+ p[pos, |p|].I s ∈ {s|p @ s, s is closed, and supdbp = supdbs } (by using an index).
I p is non-closed if at least one result is found.
This checks whether p is not only non-closed in db∗ (with function insertable) byinserting a item in p, but also non-closed in db (with predicate prevClosed) withthe same item and position, and further non-closed on db′.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 8 / 13
Incremental Pattern Closure Checking
When a sequence p was frequent in db, we check whether it is closed in db′
incrementally by calling insertable(p,db∗, |p|+ 1,prevClosed), with:
I Predicate prevClosed(p, pos, item) checks for each (pos, item) pair whetherp was closed in db with the same (pos, item) pair, by checking whetherp @ p′ v s with:
I p′ = p[1, pos − 1] + 〈item〉+ p[pos, |p|].I s ∈ {s|p @ s, s is closed, and supdbp = supdbs } (by using an index).
I p is non-closed if at least one result is found.
This checks whether p is not only non-closed in db∗ (with function insertable) byinserting a item in p, but also non-closed in db (with predicate prevClosed) withthe same item and position, and further non-closed on db′.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 8 / 13
Incremental Pattern Closure Checking
When a sequence p was frequent in db, we check whether it is closed in db′
incrementally by calling insertable(p,db∗, |p|+ 1,prevClosed), with:
I Predicate prevClosed(p, pos, item) checks for each (pos, item) pair whetherp was closed in db with the same (pos, item) pair, by checking whetherp @ p′ v s with:
I p′ = p[1, pos − 1] + 〈item〉+ p[pos, |p|].I s ∈ {s|p @ s, s is closed, and supdbp = supdbs } (by using an index).
I p is non-closed if at least one result is found.
This checks whether p is not only non-closed in db∗ (with function insertable) byinserting a item in p, but also non-closed in db (with predicate prevClosed) withthe same item and position, and further non-closed on db′.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 8 / 13
Incrmental Pattern Pruning Checking
When a sequence p was frequent in db, we check whether it can be pruned in db′
incrementally by calling insertable(p,db∗, |p|,prevPruned):
I We use a global hash structure to store all the frequent sequences that couldnot be pruned in db, called nonPruned .
I Predicate prevPruned(p, pos, item) checks for each (pos, item) pair whetherp was pruned in db with the same (pos, item) pair.
I This is done by running insertable(p, tdb, |p|) with tdb = {s|p @ s ands ∈ nonPrunable}, and checking whether the verifying (pos, item) pairis one of the returned results.
I p can be pruned if at least one result is found.
This checks whethe p can not only be pruned in db∗ (by function insertable) by
inserting an item in p (before p’s last item with matching positions unchanged),
but also in db (by predicate prevPruned) with the same item and position, and
further can be pruned on db′, .
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 9 / 13
Incrmental Pattern Pruning Checking
When a sequence p was frequent in db, we check whether it can be pruned in db′
incrementally by calling insertable(p,db∗, |p|,prevPruned):
I We use a global hash structure to store all the frequent sequences that couldnot be pruned in db, called nonPruned .
I Predicate prevPruned(p, pos, item) checks for each (pos, item) pair whetherp was pruned in db with the same (pos, item) pair.
I This is done by running insertable(p, tdb, |p|) with tdb = {s|p @ s ands ∈ nonPrunable}, and checking whether the verifying (pos, item) pairis one of the returned results.
I p can be pruned if at least one result is found.
This checks whethe p can not only be pruned in db∗ (by function insertable) by
inserting an item in p (before p’s last item with matching positions unchanged),
but also in db (by predicate prevPruned) with the same item and position, and
further can be pruned on db′, .
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 9 / 13
Incrmental Pattern Pruning Checking
When a sequence p was frequent in db, we check whether it can be pruned in db′
incrementally by calling insertable(p,db∗, |p|,prevPruned):
I We use a global hash structure to store all the frequent sequences that couldnot be pruned in db, called nonPruned .
I Predicate prevPruned(p, pos, item) checks for each (pos, item) pair whetherp was pruned in db with the same (pos, item) pair.
I This is done by running insertable(p, tdb, |p|) with tdb = {s|p @ s ands ∈ nonPrunable}, and checking whether the verifying (pos, item) pairis one of the returned results.
I p can be pruned if at least one result is found.
This checks whethe p can not only be pruned in db∗ (by function insertable) by
inserting an item in p (before p’s last item with matching positions unchanged),
but also in db (by predicate prevPruned) with the same item and position, and
further can be pruned on db′, .
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 9 / 13
Incrmental Pattern Pruning Checking
When a sequence p was frequent in db, we check whether it can be pruned in db′
incrementally by calling insertable(p,db∗, |p|,prevPruned):
I We use a global hash structure to store all the frequent sequences that couldnot be pruned in db, called nonPruned .
I Predicate prevPruned(p, pos, item) checks for each (pos, item) pair whetherp was pruned in db with the same (pos, item) pair.
I This is done by running insertable(p, tdb, |p|) with tdb = {s|p @ s ands ∈ nonPrunable}, and checking whether the verifying (pos, item) pairis one of the returned results.
I p can be pruned if at least one result is found.
This checks whethe p can not only be pruned in db∗ (by function insertable) by
inserting an item in p (before p’s last item with matching positions unchanged),
but also in db (by predicate prevPruned) with the same item and position, and
further can be pruned on db′, .
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 9 / 13
Incrmental Pattern Pruning Checking
When a sequence p was frequent in db, we check whether it can be pruned in db′
incrementally by calling insertable(p,db∗, |p|,prevPruned):
I We use a global hash structure to store all the frequent sequences that couldnot be pruned in db, called nonPruned .
I Predicate prevPruned(p, pos, item) checks for each (pos, item) pair whetherp was pruned in db with the same (pos, item) pair.
I This is done by running insertable(p, tdb, |p|) with tdb = {s|p @ s ands ∈ nonPrunable}, and checking whether the verifying (pos, item) pairis one of the returned results.
I p can be pruned if at least one result is found.
This checks whethe p can not only be pruned in db∗ (by function insertable) by
inserting an item in p (before p’s last item with matching positions unchanged),
but also in db (by predicate prevPruned) with the same item and position, and
further can be pruned on db′, .
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 9 / 13
Incrmental Pattern Pruning Checking
When a sequence p was frequent in db, we check whether it can be pruned in db′
incrementally by calling insertable(p,db∗, |p|,prevPruned):
I We use a global hash structure to store all the frequent sequences that couldnot be pruned in db, called nonPruned .
I Predicate prevPruned(p, pos, item) checks for each (pos, item) pair whetherp was pruned in db with the same (pos, item) pair.
I This is done by running insertable(p, tdb, |p|) with tdb = {s|p @ s ands ∈ nonPrunable}, and checking whether the verifying (pos, item) pairis one of the returned results.
I p can be pruned if at least one result is found.
This checks whethe p can not only be pruned in db∗ (by function insertable) by
inserting an item in p (before p’s last item with matching positions unchanged),
but also in db (by predicate prevPruned) with the same item and position, and
further can be pruned on db′, .
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 9 / 13
REMOVE Operation
We can prove that during the REMOVE operation, a non-closed sequence wouldnever become closed.
Hence we only need to care about those closed sequences on the enumerationtree, since all other sequences would never become closed.
REMOVE operation removes subtrees whose supports are smaller than supmin
after update.Nodes whose patterns are no longer closed are removed, by checking whether ∃p′in the enumerate tree with p @ p′ and supdb
′
p = sup′db′
p (with the index usedbefore in incremental closed checking).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 10 / 13
REMOVE Operation
We can prove that during the REMOVE operation, a non-closed sequence wouldnever become closed.Hence we only need to care about those closed sequences on the enumerationtree, since all other sequences would never become closed.
REMOVE operation removes subtrees whose supports are smaller than supmin
after update.Nodes whose patterns are no longer closed are removed, by checking whether ∃p′in the enumerate tree with p @ p′ and supdb
′
p = sup′db′
p (with the index usedbefore in incremental closed checking).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 10 / 13
REMOVE Operation
We can prove that during the REMOVE operation, a non-closed sequence wouldnever become closed.Hence we only need to care about those closed sequences on the enumerationtree, since all other sequences would never become closed.
REMOVE operation removes subtrees whose supports are smaller than supmin
after update.
Nodes whose patterns are no longer closed are removed, by checking whether ∃p′in the enumerate tree with p @ p′ and supdb
′
p = sup′db′
p (with the index usedbefore in incremental closed checking).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 10 / 13
REMOVE Operation
We can prove that during the REMOVE operation, a non-closed sequence wouldnever become closed.Hence we only need to care about those closed sequences on the enumerationtree, since all other sequences would never become closed.
REMOVE operation removes subtrees whose supports are smaller than supmin
after update.Nodes whose patterns are no longer closed are removed, by checking whether ∃p′in the enumerate tree with p @ p′ and supdb
′
p = sup′db′
p (with the index usedbefore in incremental closed checking).
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 10 / 13
Experiments
The performance of our algorithm StreamCloSeq is compared with anotheralgorithm CISpan (D. Yuan, etc.. SDM’2008).
Table: Dataset characteristics
Dataset # Item # Seq. Max. Len. Avg. Len.Gazelle 1,423 29,369 651 2.98TCAS 105 1,578 96 60.3
Kosarak 41,270 990,002 2,498 8.1MSNBC 17 989,818 14,795 4.74
I For smaller datasets Gazelle and TCAS, evaluation for incremental mining isconducted, with only ADD operation executed, denoted by using a windowsize of ∞.
I For larger datasets Kosarak and MSNBC, evaluation for sliding windowmining is conducted, with running time of the first 100 full-sized windows(window with maximal window size) collected.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 11 / 13
Experiments
The performance of our algorithm StreamCloSeq is compared with anotheralgorithm CISpan (D. Yuan, etc.. SDM’2008).
Table: Dataset characteristics
Dataset # Item # Seq. Max. Len. Avg. Len.Gazelle 1,423 29,369 651 2.98TCAS 105 1,578 96 60.3
Kosarak 41,270 990,002 2,498 8.1MSNBC 17 989,818 14,795 4.74
I For smaller datasets Gazelle and TCAS, evaluation for incremental mining isconducted, with only ADD operation executed, denoted by using a windowsize of ∞.
I For larger datasets Kosarak and MSNBC, evaluation for sliding windowmining is conducted, with running time of the first 100 full-sized windows(window with maximal window size) collected.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 11 / 13
Experiments
The performance of our algorithm StreamCloSeq is compared with anotheralgorithm CISpan (D. Yuan, etc.. SDM’2008).
Table: Dataset characteristics
Dataset # Item # Seq. Max. Len. Avg. Len.Gazelle 1,423 29,369 651 2.98TCAS 105 1,578 96 60.3
Kosarak 41,270 990,002 2,498 8.1MSNBC 17 989,818 14,795 4.74
I For smaller datasets Gazelle and TCAS, evaluation for incremental mining isconducted, with only ADD operation executed, denoted by using a windowsize of ∞.
I For larger datasets Kosarak and MSNBC, evaluation for sliding windowmining is conducted, with running time of the first 100 full-sized windows(window with maximal window size) collected.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 11 / 13
Experiments
The performance of our algorithm StreamCloSeq is compared with anotheralgorithm CISpan (D. Yuan, etc.. SDM’2008).
Table: Dataset characteristics
Dataset # Item # Seq. Max. Len. Avg. Len.Gazelle 1,423 29,369 651 2.98TCAS 105 1,578 96 60.3
Kosarak 41,270 990,002 2,498 8.1MSNBC 17 989,818 14,795 4.74
I For smaller datasets Gazelle and TCAS, evaluation for incremental mining isconducted, with only ADD operation executed, denoted by using a windowsize of ∞.
I For larger datasets Kosarak and MSNBC, evaluation for sliding windowmining is conducted, with running time of the first 100 full-sized windows(window with maximal window size) collected.
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 11 / 13
Experiments – RuntimeStreamCloSeq CISpan
10
100
1000
10000
100000
4 5 6 7 8 9
Run
time
(in s
econ
ds)
Minimum Support Threshold
GazelleWindow Size = ∞Batch Size = 100
10
100
1000
100 200 300 400 500R
untim
e (in
sec
onds
)Minimum Support Threshold
TCASWindow Size = ∞
Batch Size = 10
0
100
200
300
400
500
600
100 150 200 250 300
Run
time
(in s
econ
ds)
Minimum Support Threshold
KosarakWindow Size = 50K
Batch Size = 1K
10
100
1000
10000
4 5 6 7 8 9 10
Run
time
(in s
econ
ds)
Minimum Support Threshold
MSNBCWindow Size = 10K
Batch Size = 100Results with Varying Minimum Support Threshold
10
100
1000
100 200
400 800
1600
Run
time
(in s
econ
ds)
Update Batch Size
GazelleWindow Size = ∞
Minimum Support = 5
100
200
300
400
500
600
700
125 250
500 1000
2000
Run
time
(in s
econ
ds)
Update Batch Size
KosarakWindow Size = 50K
Minimum Support = 100
0 200 400 600 800
1000 1200 1400 1600 1800
25 50 75 100 125R
untim
e (in
sec
onds
)Sliding Window Size (in 1,000s)
KosarakBatch Size = 1K
Minimum Support = 250
0 200 400 600 800
1000 1200 1400 1600
10 15 20 25 30 35
Run
time
(in s
econ
ds)
Sliding Window Size (in 1,000s)
MSNBCBatch Size = 100
Minimum Support = 10
Results with Varying Update Batch Size Results with Varying Sliding Window Size
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 12 / 13
The End
Thank you!
Questions or Comments?
C. Gao, etc. (Tsinghua Univ.) Efficient Mining of Closed Sequential Patterns on Stream Sliding Window 13 / 13
Recommended