Upload
allenwu
View
795
Download
0
Tags:
Embed Size (px)
DESCRIPTION
This paper proposed a novel tree structure, DSTree, which can handle the stream data. The experiments show the comparable performance in terms of accuracy and efficiency.
Citation preview
DSTree: A Tree Structure for the Mining of Frequent Sets from Data StreamsPresenter / Meng-Lun WuSource / ICDM’06, IEEEAuthor / Carson Kai-Sang Leung, Quamrul I. Khan
1
Outline• Introduction• Related Work• DSTree• Discussion• Experimental Results• Conclusions
2
Introduction• With advances in technology, a flood of data can be produced
in many applications.• Ex. Sensor networks and Web click streams.
• This calls for efficient techniques for extracting useful information from streams of data.
• Mining from data streams is more challenging due to• Property 1: Data streams are continuous and unbounded.
• Property 2: Data in the streams are not necessarily uniformly distributed; their distributions are usually changing with time. 3
Introduction (cont.)• Several stream mining algorithms can be broadly categorized
into two classes.• Exact algorithms
• Finding truly frequent itemsets, especially for maximal, closed, or “short” itemsets.• i.e., itemsets with frequency user-defined minimum support threshold
minsup.
• Approximate algorithms• Finding “frequent” itemsets by using approximate procedures, which
may lead to some false positives or false negatives.
• The key contribution of this work is called DSTree (Data Stream Tree), which is designed for exact stream mining of regular frequent itemsets. 4
Related Work• In this section, we briefly discuss some data structures that
are relevant to our work.
• CanTree (proposed in ICDM’05) vs. DSTree• CanTree is designed for incremental mining, whereas DSTree is
designed for stream mining.
• Each node in the CanTree keeps just one frequency count, whereas each node in the DSTree keeps a list of frequency counts.
• Transactions are deleted from the CanTree, the frequency counts of the affected nodes get decremented while in the DSTree, the list of frequency counts at each affected node just shifts.
5
Related Work (cont.)• FP-streaming (FP-tree and FP-stream) vs. DSTree
FP-streaming DSTree
Algorithm type Approximate stream mining Exact stream mining
Constructing tree One batch built one FP-tree Several batch in one transaction built one DSTree
Store counts Keep just one frequency counts
Keeps a list of frequency counts
Support threshold
Yes No
Each path Represents an itemset Represents a transaction in the current window
Window Titled-time windowing sliding window6
DSTree• Due to Property 1 of data streams,• The DSTree is desinged for (exact) stream mining.• The construction of the DSTree only requires one scan of the
streaming data.• The tree captures the contents of transactions in each batch of
streaming data.
• Due to the dynamic nature and Property 2 of data streams• We arrange transaction items according to some canonical order.
• E.g. lexicographic order or alphabetical order.
• The frequency of a node in DSTree is at least as high as the sum of frequencies of its children.
• The ordering of items is unaffected by the continuous changes in item frequencies.
7
• Let minsup be 3 and let the window size w be 2 batches (indicating that only two batches of transactions are kept).
• If we call the mining process at time T’, we get frequent itemsets {a}:4, {a,c}:3, {a,d}:3, {b}:4, {b,d}:4, {c}:3 and {d}:5.
DSTree Example
8
Discussion
(a) Applicability for finding other patterns:•DSTree also provides users with such functionalities as stream mining of maximal, closed, and constrained itemsets.• A frequent itemset is maximal if none of its proper supersets is
frequent.• A frequent itemset X is closed if none of proper supersets of X
has the same frequency as X.• DSTree can provide users with additional functionality to these
algorithms.• These algorithms can use DSTree and arrange tree items according
to some cannonical order.• Csucc max(S.Price) 30
• Cconvavg(S.Price) 7 9
Discussion (cont.)
(b)Extensions – different windowing techniques• It is important to note that the sliding window is not confined.
• Tiled-time window: more weights can be put on recent data and less weights on older data.
(c) Efficiency and memory issues:• DSTree do not need to keep any extra tree structures as in the
FP-streaming algorithm, where space is required for both the FP-tree and the FP-stream structure.
10
Experimental Results & First Experiment• Experiment Data• IBM Almaden Research Center: 1 M records with an average
transaction length of 10 items, and a domain of 1,000 items.• Each batch contain 0.1 M transactions and the window size is set
to w=5 batches.
• This experiments mainly evaluated the accuracy and efficiency of DSTree.
• First Experiment• Accuracy: Comparing the frequent itemsets returned by mining
directly from these transactions with those returned by mining from our DSTree.• Accuracy: 100%
11
Second Experiment• We compared the runtime of mining from DSTree with that of
using the FP-streaming mining algorithm.
12
Third Experiment• We compared the DSTree with its relevant structures (e.g.,
CanTree, FP-tree and FP-stream structure).• When minsup=0.05%, the size of DSTree is about 1.25X that of
the FP-stream; when minsup=0.01%, the size of the DSTree <0.90X that of the stream.
• The results also confirmed that the size of the DSTree did not depend on minsup whereas that of the FP-stream did.
• Among the two structures, the former kept transactions while the latter kept itemsets.
• Thus mining from the DSTree gave exact results, but mining with FP-streaming gave approximate results.
13
Fourth Experiment• It show that the sizes of both DSTree and CanTree were
unaffected by changes in minsup.• The size of the DSTree is smaller than that of the CanTree
because the latter keeps all transactions whereas the former only keeps transactions in the current window.• The size of DSTree 0.5X that of the CanTree.• The DSTree required a lower maintenance cost than the CanTree.• Whenever transactions were deleted from the CanTree, it needed to
either decrement the frequency count of nodes or remove the nodes corresponding to the deleted transactions.
• In contrast, DSTree did not require expensive deletion of transactions; it just shifted the frequency lists.
14
Fifth experiment• This paper tested with the usual experiment (e.g., the effect of
minsup).• As expected, when minsup increased, the runtime decreased.
15
Sixth experiment• This experiment tested scalability with the number of
transactions.• The results show that mining with DSTree had linear
scalability.
16
Conclusions• A key contribution of this paper is to propose the novel
structure of DSTree (Data Stream Tree).
• This tree captures the contents of transactions in a window, and arranges tree nodes according to some canonical order.
• By exploiting its nice properties, DSTree can be easily maintained when the window slides.
• It can also be used for efficient stream mining of maximal itemsets, closed itemsets, as well as constrained itemsets.
17