Online Frequent Episode MiningXiang Ao1, Ping Luo1, Chengkai Li2, Fuzhen Zhuang1 and Qing He1
1
2
23/4/19 X. Ao et al. Online Frequent Episode Mining 1
2
Agenda
Introduction
• Problem Formulation
• Solution Framework
• Experiental Results
• Conlusions
23/4/19 X. Ao et al. Online Frequent Episode Mining
Introduction
23/4/19 X. Ao et al. Online Frequent Episode Mining 3
• Frequent episode mining (FEM) techniques are broadly conduced to analyze data sequences in many domains.
Manufacturing Telecommunication Finance
Biology News analysisSystem log analysis
Time stamps
Events• Episode (especially for serial episode in this paper), is kind of totally ordered set of events.
• E.g., D → A is an episode.
Introduction
04/19/23 X. Ao et al. Online Frequent Episode Mining
4
• FEM aims at identifying all the frequent episodes whose frequencies are larger than a user-specified threshold.
Introduction
04/19/23 X. Ao et al. Online Frequent Episode Mining
5
• Usually, FEM algorithms are time-consuming:
1. The anti-monotonicity property may fail tohold for episode frequency [Achar, 2012].2. Testing whether an episode occurs in a sequence is an NP-complete problem [Tatti, 2011].
[Achar, 2012] A. Achar, S. Laxman, and P. Sastry, “A unified view of the apriori-based algorithms for frequent episode discovery,” KAIS, 2012.[Tatti, 2011] N. Tatti and B. Cule, “Mining closed episodes with simultaneousevents,” in KDD, 2011.
Introduction
04/19/23 X. Ao et al. Online Frequent Episode Mining
6
1
AB
2 3 4 5 6 7
D C A AB
D B
• Previous studies on FEM mostly process data offline in a batch mode.
FEM algorithmHistorical data
Frequent episodes
Output
1
AB
2 3 4 5 6 7 8
D C A AB
D BB
Updated dataUpdated frequent
episodes
Different
Introduction
04/19/23 X. Ao et al. Online Frequent Episode Mining
7
• In this paper, we consider online frequent episode mining problem (OFEM).
1
AB
2 3 4 5 6 7
D C A AB
D B
8
B
9
AC
...10
D
• Newly emerging episodes may become valuable.
• Old episodes may become obsolete.
• Time-critical applications. Need efficient methods to find recent and frequent episodes.
Predictive maintenance
Introduction
04/19/23 X. Ao et al. Online Frequent Episode Mining
8
• Examples of motivated applications
High Frequency Trading
• Fast-growing data• Recency effect• Time-critical analysis.
Introduction
04/19/23 X. Ao et al. Online Frequent Episode Mining
9
Challenges of OFEM algorithm:Infrequent events at the current moment may become frequent in future.
Intensive computation will generate lots of episode occurrences.
Efficiently mining all occurrences of episodes also becomes a big challenge over the growing sequence.
Introduction
Contributions of this paper:Propose an algorithm, MESELO (Mining frEquent Serial Episode via Last Occurrence), for online frequent episode mining.
Design a data structure, episode trie, to compactly store all minimal occurrences of episode.
Introduce the concept of last episode occurrence.
Compare our method and some state-of-the-art batch mode FEM methods based on minimal occurrence.
04/19/23 X. Ao et al. Online Frequent Episode Mining 10
11
Agenda
• Introduction
Problem Formulation
• Solution Framework
• Experiental Results
• Conlusions
23/4/19 X. Ao et al. Online Frequent Episode Mining
Problem Formulation
04/19/23 X. Ao et al. Online Frequent Episode Mining
12
Valid Sequence
1
AB
2 3 4 5 6 7
D C A AB
D B
8 ...
B
∆
∆Frequent episodes may change as the sequence continues growing.
∆—window size of valid sequence.
13
Agenda
• Introduction
• Problem Formulation
Solution Framework
• Experiental Results
• Conlusions
23/4/19 X. Ao et al. Online Frequent Episode Mining
Solution Framework
04/19/23 X. Ao et al. Online Frequent Episode Mining
14
Minimal occurrence is a kind of occurrence of episode which can not contain any other occurrence of same episode.
• A → B is a serial episode in the example.• Consider another episode D → D in the example.
δ
Also, minimal episode occurrence is bounded by a user-specified parameter -- maximal occurrence window δ.
• The support of A → B is 2 in the example.
Frequent episodes
Valid Sequence
Local time window
1
AB
2 3 4 5 6 7 8 ...
D C A AB
D BB
Solution Framework
Updated frequent episodes
04/19/23 X. Ao et al. Online Frequent Episode Mining
15
Valid Sequence
1
AB
2 3 4 5 6 7
D C A AB
D B
8 ...
B
δ - 1
• The concept of local time window
Solution Framework
04/19/23 X. Ao et al. Online Frequent Episode Mining
16
Valid Sequence
Local time window
1
AB
2 3 4 5 6 7 8 ...
D C A AB
D BB
• The concept of last episode occurrence
last occurrence of A→B in the local time window
Minimal but not last occurrence of A→B in the local time window
last minimal occurrence of A→B in the local time window
In MESELO, only last minimal episode occurrences could be further expanded to new minimal episode occurrences.
Solution Framework
04/19/23 X. Ao et al. Online Frequent Episode Mining
17
Valid Sequence
1
AB
2 3 4 5 6 7
D C A AB
D B
8 ...
B
• The concept of minimal occurrence starting at i and ending not later than j.
• Definition (Minimal episode occurrence starting at ti and ending no later than tj). Given a time window [ti, tj], we use to denote the set of all minimal episode occurrence for which the start time is equal to ti, and the end time is not larger than tj.
jiM
• In the running example, = {(A, [5, 5]), (A → A, [5, 6]), (A → B, [5, 6]), (A → B → B, [5, 7]), (A → A →B, [5, 7])}.
75M
Solution Framework
04/19/23 X. Ao et al. Online Frequent Episode Mining
1804/19/23 X. Ao et al. Online Frequent Episode Mining
18
Δ
......
δ-1
Sequence grows to k+1
k-Ѭ+1 k-Ѭ+2 k... k-δ+1 ... ... k+1...k-δ+3
... δ-1
k-Ѭ+1 k-Ѭ+2 k... k-δ+1 k-1 ... ...
...
root:5
A:5
A:6 B:6
B:7 B:7
non-last occurrence node, denotes a minimal but not last occurrence
last occurrence node, denotes a last minimal occurrence
Solution Framework
04/19/23 X. Ao et al. Online Frequent Episode Mining
19
• Use episode trie to denote jiM
B:6
• Each node p = p.event:p.time, consists of two fields p.event and p.time.
• p.event registers which event this node represents.
• p.time registers the occurrence timestamp.
• The event field of the root is associated with the empty string (labeled as “root”), and the time field of the root is equal to ti.
root:5
• The event sequence along the path from the root to p denotes an episode minimal occurrence, and its occurrence window is [ti, p.time]. E.g., (A → A, [5, 6]).
The episode trie 75T
jiT
In fact, j ji iT M In MESELO, only last occurrence node could be
further expanded to new minimal episode occurrences.
kiT
11kkT
Solution Framework
04/19/23 X. Ao et al. Online Frequent Episode Mining
20
• MESELO Algorithm
1- +2kkT
Basically,
•Step 1: create a new and update the super script of each
which still varies from k to k+1.
•Step 2: transfer the episode trie out of the main memory.
04/19/23 X. Ao et al. Online Frequent Episode Mining
21
Valid Sequence
Latest δ timestamps
1 8 BkE E
root:5
A:5
A:6 B:6
B:7 B:7
root:6
A:6 B:6
B:7 B:7
root:7
B:7
(a) (b)
(c)
Before processing
root:8
B:8
(g)
root:7
B:7
B:8
(f)
root:6
A:6 B:6
B:7 B:7
B:8 B:8
(e)
root:5
A:5
A:6
B:7
B:6
B:7
B:8 B:8
(d)
root:5
A:5
A:6
B:7
B:6
B:7
B:8 B:8
root:6
A:6 B:6
B:7 B:7
B:8 B:8
root:7
B:7
B:8
root:8
B:8
(d) (e)
(f)
(g)
After processing
The more details, the proof of soundness and completeness of the algorithm, and the complexity analysis can refer to the paper.
22
Agenda
• Introduction
• Motivation
• Problem Formulation
• Solution Framework
Experiental Results
• Conlusions
23/4/19 X. Ao et al. Online Frequent Episode Mining
Experimental Results
23/4/19 X. Ao et al. Online Frequent Episode Mining
23
Data sets
Online mode
Batch mode
Mining Server:•2.00 GHz Intel Xeon E5-2620 •32G gigabytes memory•Windows 2008
Database Server:•2.00 GHz Intel Xeon E5-2620 •16G gigabytes memory•Linux CentOS
• 100MB connection
BaselinesOnline mode BRUTE
Online mode MESELO-BS
Batch mode PPS [ICDM’04]
Batch mode MINEPI+ [Info. Sys.’08]
Batch mode UP-Span [KDD’13]
Batch mode DFS [DKE’13]
Environments
Degradation of MESELO Alg.
Experimental Results (1)
04/19/23 X. Ao et al. Online Frequent Episode Mining
24
• Online mode data preparation
Industry Name # of Stocks Datasets Name
Pharmaceuticals 1 Stock-1
Security 2 Stock-2
Electricity Power 4 Stock-3
Iron and Steel 6 Stock-4
Nonferrous-material
8 Stock-5
Estate 10 Stock-6
Table 1. Details of online mode data sets• Data from China Stock
Exchange Daily Trading list (denoted as Stock-1 to 6) over 2,509 trading days from January 1st, 2004 to May 9th, 2014.
• We always select the most leading stocks from each industry.
• Build stock event from daily closing price1. Calculate the increase ratio r of price between two consecutive trading
days.2. Discretize the value of r into 4 levels: UH (r >= 3.5%), UL (0% ≤ r ≤ 3.5%), DL
(−3.5% ≤ r < 0%), DH (r ≤ −3.5%)3. Then, a stock must happen one of the four events every day.
Experimental Results (2)
04/19/23 X. Ao et al. Online Frequent Episode Mining
25
• Online mode experimental results• Comparison method:
• Sequentially read every event set of the coming time stamp, and perform online frequent episode mining.
• Record the execution time at each time stamp and use their average value as the measure for the comparison.
Note: the average time over all time stamps is only related to δ.
Experimental Results (4)
04/19/23 X. Ao et al. Online Frequent Episode Mining
26
• Batch mode data preparationDatasets Name Data Type
Retail Market basket data from stores.
ChainStore Market basket data from stores.
Kosarak Click-stream data from web sites.
BMS Click-stream data from web sites.
Table 2. Details of batch mode data sets
Note: The four datasets are originally for sequential pattern mining. We follow the processing steps in [1].
[1] C.-W. Wu, Y.-F. Lin, S. Y. Philip, and V. S. Tseng, “Mining high utility episodes in complex event sequences,” in KDD, 2013.
Tid Events
1 A, B, D
2 B, E
3 A, F
… …
Sequential pattern mining data form
1 2 3 ...
ABD
BE
AF
Episode mining
data formto
Horizontal
Vertical
Experimental Results (5)
04/19/23 X. Ao et al. Online Frequent Episode Mining
27
• Batch mode performance evaluations• Comparison method: min_sup & δ variations
• 1. Fix δ and vary min_sup. (See Fig. 8)• 2. Fix min_sup and vary δ. (See Fig. 9)
BMS holds a shorter sequence length.And most importantly, less number of events per timestamp compared with other datasets.
28
Agenda
• Introduction
• Motivation
• Problem Formulation
• Solution Framework
• Experiental Results
Conlusions
23/4/19 X. Ao et al. Online Frequent Episode Mining
Conclusions
New problem: online frequent episode mining.
• Especially useful to time-critical applications with growing sequences.
Efficient online algorithm (i.e. MESELO).
• Experiments on real data sets show the efficiency of MESELO is at least one magnitude of order faster than other baselines.
New concept of last episode occurrence and episode trie.
• Detecting the minimal episode occurrences efficiently. • All minimal episode occurrences are stored in a compact way.
04/19/23 X. Ao et al. Online Frequent Episode Mining 29
Thanks! Q&Ahttp://mldm.ict.ac.cn/MLDM/~aox
23/4/19 X. Ao et al. Online Frequent Episode Mining 30