View
231
Download
1
Category
Preview:
Citation preview
Data Warehousing Mining & BIData Warehousing Mining & BI
Data Streams MiningData Streams Mining
DWMBIDWMBI 11
Mining Data StreamsMining Data Streams
What is stream data? Why Stream Data Systems?What is stream data? Why Stream Data Systems?
Stream data management systems: Issues and solutionsStream data management systems: Issues and solutions
Stream data cube and multidimensional OLAP analysisStream data cube and multidimensional OLAP analysis
Stream frequent pattern analysisStream frequent pattern analysis
Stream classificationStream classification
Stream cluster analysisStream cluster analysis
Research issuesResearch issues
DWMBIDWMBI 22
Characteristics of Data StreamsCharacteristics of Data Streams Data StreamsData Streams
Data Data streamsstreams——continuous, ordered, changing, fast, huge amount continuous, ordered, changing, fast, huge amount
Traditional Traditional DBMSDBMS——data stored in finite, persistent data sets data stored in finite, persistent data sets
CharacteristicsCharacteristics Huge volumes of continuous data, possibly infiniteHuge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time responseFast changing and requires fast, real-time response Data stream captures nicely our data processing needs of todayData stream captures nicely our data processing needs of today Random access is expensiveRandom access is expensive——single scan algorithm (single scan algorithm (can only have can only have
one lookone look)) Store only the summary of the data seen thus farStore only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in Most stream data are at pretty low-level or multi-dimensional in
nature, needs multi-level and multi-dimensional processingnature, needs multi-level and multi-dimensional processingDWMBIDWMBI 33
Stream Data ApplicationsStream Data Applications
Telecommunication calling recordsTelecommunication calling records Business: credit card transaction flowsBusiness: credit card transaction flows Network monitoring and traffic engineeringNetwork monitoring and traffic engineering Financial market: stock exchangeFinancial market: stock exchange Engineering & industrial processes: power supply & Engineering & industrial processes: power supply &
manufacturingmanufacturing Sensor, monitoring & surveillance: video streams, RFIDsSensor, monitoring & surveillance: video streams, RFIDs Security monitoringSecurity monitoring Web logs and Web page click streamsWeb logs and Web page click streams Massive Business data setsMassive Business data sets
DWMBIDWMBI 44
Challenges of Stream Data ProcessingChallenges of Stream Data Processing
Multiple, continuous, rapid, time-varying, orderedMultiple, continuous, rapid, time-varying, ordered streams streams
Main memoryMain memory computations computations
Queries are often Queries are often continuouscontinuous Evaluated continuously as stream data arrivesEvaluated continuously as stream data arrives
Answer updated over timeAnswer updated over time
Queries are often Queries are often complexcomplex Beyond element-at-a-time processingBeyond element-at-a-time processing
Beyond stream-at-a-time processingBeyond stream-at-a-time processing
Beyond relational queries (scientific, data mining, OLAP)Beyond relational queries (scientific, data mining, OLAP)
Multi-level/multi-dimensional Multi-level/multi-dimensional processing and data miningprocessing and data mining Most stream data are at low-level or multi-dimensional in natureMost stream data are at low-level or multi-dimensional in nature
DWMBIDWMBI 55
Architecture: Stream Query ProcessingArchitecture: Stream Query Processing
DWMBIDWMBI 66
Scratch SpaceScratch Space(Main memory and/or Disk)(Main memory and/or Disk)
User/ApplicationUser/ApplicationUser/ApplicationUser/Application
Stream QueryStream QueryProcessorProcessor
ResultsResultsMultiple streamsMultiple streams
SDMS (Stream Data Management System)
Stream Data Processing Methods Stream Data Processing Methods Random sampling (but without knowing the total length in advance)Random sampling (but without knowing the total length in advance)
Reservoir samplingReservoir sampling: maintain a set of : maintain a set of ss candidates in the reservoir, which candidates in the reservoir, which form a true random sample of the element seen so far in the stream. As the form a true random sample of the element seen so far in the stream. As the data stream flow, every new element has a certain probability (data stream flow, every new element has a certain probability (ss/N) of /N) of replacing an old element in the reservoir.replacing an old element in the reservoir.
Sliding windowsSliding windows Make decisions based only on Make decisions based only on recentrecent data data of sliding window size of sliding window size ww An element arriving at time An element arriving at time tt expires at time expires at time t t + + ww
HistogramsHistograms Approximate the frequency distribution of element values in a streamApproximate the frequency distribution of element values in a stream Partition data into a set of contiguous bucketsPartition data into a set of contiguous buckets Equal-width (equal value range for buckets) vs. V-optimal (minimizing Equal-width (equal value range for buckets) vs. V-optimal (minimizing
frequency variance within each bucket)frequency variance within each bucket) Multi-resolution modelsMulti-resolution models
Popular models: balanced binary trees, micro-clusters, and waveletsPopular models: balanced binary trees, micro-clusters, and waveletsDWMBIDWMBI 77
Processing Stream QueriesProcessing Stream Queries Query typesQuery types
One-time query vs. One-time query vs. continuous querycontinuous query (being evaluated (being evaluated continuously as stream continues to arrive)continuously as stream continues to arrive)
Predefined queryPredefined query vs. ad-hoc query (issued on-line) vs. ad-hoc query (issued on-line) Unbounded memory requirementsUnbounded memory requirements
For real-time response, For real-time response, main memory algorithmmain memory algorithm should be should be usedused
Memory requirement is unbounded if one will join future tuples Memory requirement is unbounded if one will join future tuples Approximate query answeringApproximate query answering
With bounded memory, it is not always possible to produce With bounded memory, it is not always possible to produce exact answersexact answers
High-quality approximate answersHigh-quality approximate answers are desired are desired Data reduction and synopsis construction methodsData reduction and synopsis construction methods
Sketches, random sampling, histograms, wavelets, etc.Sketches, random sampling, histograms, wavelets, etc.
DWMBIDWMBI 88
Approximate Query Answering in StreamsApproximate Query Answering in Streams
Sliding windowsSliding windows Only over sliding windows of Only over sliding windows of recentrecent stream data stream data Approximation but often more desirable in applicationsApproximation but often more desirable in applications
Batched processing, sampling and synopsesBatched processing, sampling and synopses BatchedBatched if update is fast but computing is slow if update is fast but computing is slow
Compute periodically, not very timelyCompute periodically, not very timely SamplingSampling if update is slow but computing is fast if update is slow but computing is fast
Compute using sample data, but not good for joins, etc.Compute using sample data, but not good for joins, etc. SynopsisSynopsis data structures data structures
Maintain a small Maintain a small synopsissynopsis or or sketchsketch of data of data Good for querying historical dataGood for querying historical data
Blocking operators, e.g., sorting, avg, min, etc.Blocking operators, e.g., sorting, avg, min, etc. BlockingBlocking if unable to produce the first output until seeing the entire if unable to produce the first output until seeing the entire
inputinput DWMBIDWMBI 99
Stream Data Mining vs. Stream QueryingStream Data Mining vs. Stream Querying Stream miningStream mining——A more challenging task in many casesA more challenging task in many cases
It shares most of the difficulties with stream queryingIt shares most of the difficulties with stream querying But often requires less “precision”, e.g., no join, grouping, But often requires less “precision”, e.g., no join, grouping,
sortingsorting
Patterns are hidden and more general than queryingPatterns are hidden and more general than querying It may require exploratory analysisIt may require exploratory analysis
Not necessarily continuous queriesNot necessarily continuous queries Stream data mining tasksStream data mining tasks
Multi-dimensional on-line analysis of streamsMulti-dimensional on-line analysis of streams Mining outliers and unusual patterns in stream dataMining outliers and unusual patterns in stream data Clustering data streams Clustering data streams Classification of stream dataClassification of stream data
DWMBIDWMBI 1010
Challenges for Mining Dynamics in Data StreamsChallenges for Mining Dynamics in Data Streams
Most stream data are at pretty low-level or multi-Most stream data are at pretty low-level or multi-
dimensional in nature: needs ML/MD processingdimensional in nature: needs ML/MD processing
Analysis requirementsAnalysis requirements
Multi-dimensional trends and unusual patternsMulti-dimensional trends and unusual patterns
Capturing important changes at multi-dimensions/levels Capturing important changes at multi-dimensions/levels
Fast, real-time detection and responseFast, real-time detection and response
Comparing with data cube: Similarity and differencesComparing with data cube: Similarity and differences
Stream (data) cube or stream OLAP: Is this feasible?Stream (data) cube or stream OLAP: Is this feasible?
Can we implement it efficiently?Can we implement it efficiently?
DWMBIDWMBI 1111
Multi-Dimensional Stream Analysis: ExamplesMulti-Dimensional Stream Analysis: Examples
Analysis of Analysis of Web click streamsWeb click streams Raw data at low levels: seconds, web page addresses, user IP Raw data at low levels: seconds, web page addresses, user IP
addresses, …addresses, … Analysts want: changes, trends, unusual patterns, at reasonable Analysts want: changes, trends, unusual patterns, at reasonable
levels of detailslevels of details E.g., E.g., AAverageverage clicking traffic in clicking traffic in India India on on sportssports in the in the last 15 last 15
minutesminutes is 40% higher than that in the is 40% higher than that in the last 24 hourslast 24 hours.”.”
Analysis of Analysis of power consumption streamspower consumption streams Raw data: power consumption flow for every household, every Raw data: power consumption flow for every household, every
minute minute Patterns one may find: Patterns one may find: averageaverage hourly power consumption hourly power consumption
surges up 30%surges up 30% for for housing housing in in DelhiDelhi in the in the last 2 hours today last 2 hours today than than that of the same that of the same day a week ago day a week ago
DWMBIDWMBI 1212
A Stream Cube ArchitectureA Stream Cube Architecture
AA tilted time frametilted time frame Different time granularitiesDifferent time granularities
second, minute, quarter, hour, day, week, …second, minute, quarter, hour, day, week, …
Critical layersCritical layers Minimum interest layerMinimum interest layer (m-layer) (m-layer) Observation layerObservation layer (o-layer) (o-layer) User: watches at o-layer and occasionally needs to drill-down down User: watches at o-layer and occasionally needs to drill-down down
to m-layerto m-layer
Partial materialization of stream cubesPartial materialization of stream cubes Full materialization: too space and time consumingFull materialization: too space and time consuming No materialization: slow response at query timeNo materialization: slow response at query time Partial materialization: what do we mean “partial”?Partial materialization: what do we mean “partial”?
DWMBIDWMBI 1313
A Titled Time ModelA Titled Time Model
Tim et8 t 4 t 2 t t1 6 t3 2 t6 4 t
DWMBIDWMBI
1414
Natural Natural tilted time frame:tilted time frame: Example: Minimal: quarter, then 4 quarters Example: Minimal: quarter, then 4 quarters 1 hour, 24 hours 1 hour, 24 hours
day, … day, …
LogarithmicLogarithmic tilted time frame: tilted time frame: Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, …Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, …
4 q tr s2 4 h o u r s3 1 d ay s1 2 m o n th stim e
Tim et8 t 4 t 2 t t1 6 t3 2 t6 4 t
Two Critical Layers in the Stream CubeTwo Critical Layers in the Stream Cube
DWMBIDWMBI 1515
(*, theme, quarter)
(user-group, URL-group, minute)
m-layer (minimal interest)(individual-user, URL, second)
(primitive) stream data layer
o-layer (observation)
Frequent Patterns for Stream DataFrequent Patterns for Stream Data Frequent pattern mining is valuable in stream applicationsFrequent pattern mining is valuable in stream applications
e.g., network intrusion mininge.g., network intrusion mining Mining Mining preciseprecise freq. patterns in stream data: unrealistic freq. patterns in stream data: unrealistic
store them in a compressed form, such as FPtreestore them in a compressed form, such as FPtree How to mine frequent patterns with good approximation?How to mine frequent patterns with good approximation?
Approximate frequent patterns Approximate frequent patterns Keep only current frequent patterns? No changes can Keep only current frequent patterns? No changes can
be detectedbe detected Mining evolution freq. patternsMining evolution freq. patterns
Use tilted time window frame Use tilted time window frame Mining evolution and dramatic changes of frequent Mining evolution and dramatic changes of frequent
patternspatterns Space-saving computation of frequent and top-k elementsSpace-saving computation of frequent and top-k elements
DWMBIDWMBI 1616
Mining Approximate Frequent PatternsMining Approximate Frequent Patterns
Mining Mining preciseprecise freq. patterns in stream data: freq. patterns in stream data: unrealisticunrealistic
Even store them in a compressed form, such as FPtreeEven store them in a compressed form, such as FPtree
Approximate answersApproximate answers are often sufficient (e.g., trend/pattern analysis) are often sufficient (e.g., trend/pattern analysis)
Example: a router is interested in all flows:Example: a router is interested in all flows:
whose whose frequencyfrequency is at least is at least 1% (1% (σσ)) of the entire traffic stream of the entire traffic stream
seen so far seen so far
and feels that and feels that 1/10 of 1/10 of σσ ( (εε = 0.1%) error= 0.1%) error is comfortable is comfortable
How to mine frequent patterns with How to mine frequent patterns with good approximationgood approximation??
Lossy Counting Algorithm) Lossy Counting Algorithm)
Major ideas: not tracing items until it becomes frequentMajor ideas: not tracing items until it becomes frequent
Adv: guaranteed error boundAdv: guaranteed error bound
Disadv: keep a large set of tracesDisadv: keep a large set of tracesDWMBIDWMBI 1717
Lossy Counting for Lossy Counting for Frequent ItemsFrequent Items
DWMBIDWMBI 1818
Bucket 1 Bucket 2 Bucket 3
Divide Stream into ‘Buckets’ (bucket size is 1/ ε = 1000)
First Bucket of StreamFirst Bucket of Stream
DWMBIDWMBI 1919
Empty(summary) +
At bucket boundary, decrease all counters by 1
Next Bucket of StreamNext Bucket of Stream
DWMBIDWMBI 2020
+
At bucket boundary, decrease all counters by 1
Approximation GuaranteeApproximation Guarantee Given: (1) support threshold: Given: (1) support threshold: σσ, (2) error threshold: , (2) error threshold: εε, and , and
(3)(3) stream length N stream length N Output: Output: items with frequency counts exceeding (items with frequency counts exceeding (σσ –– εε) N) N How much do we undercount?How much do we undercount?
If stream length seen so far = N and bucket-size = If stream length seen so far = N and bucket-size = 1/1/εε then then frequency count error frequency count error #buckets = #buckets = εεNN
Approximation guaranteeApproximation guarantee No false negativesNo false negatives
False positives have frequency at least (False positives have frequency at least (σσ––εε)N)N
Frequency count underestimated by at most Frequency count underestimated by at most εεNN
DWMBIDWMBI 2121
Lossy Counting For Frequent ItemsetsLossy Counting For Frequent Itemsets
DWMBIDWMBI 2222
Divide Stream into ‘Buckets’ as for frequent itemsBut fill as many buckets as possible in main memory
one time
Bucket 1 Bucket 2 Bucket 3
If we put 3 buckets of data into main memory one time,Then decrease each frequency count by 3
DWMBIDWMBI 2323
Update of Summary Data Structure
2
2
1
2
11
1
summary data 3 bucket datain memory
4
4
10
22
0
+
3
3
9
summary data
Itemset ( ) is deleted.That’s why we choose a large number of buckets – delete more
Pruning Itemsets – Apriori RulePruning Itemsets – Apriori Rule
DWMBIDWMBI 2424
If we find itemset ( ) is not frequent itemset,Then we needn’t consider its superset
3 bucket datain memory
1
+
summary data
2
2
1
1
Summary of Lossy CountingSummary of Lossy Counting
StrengthStrength A A simplesimple idea idea Can be extended to Can be extended to frequent itemsetsfrequent itemsets
Weakness:Weakness: Space BoundSpace Bound is not good is not good For frequent itemsets, they do For frequent itemsets, they do scan each record scan each record
many timesmany times The output is based on The output is based on all previous dataall previous data. But . But
sometimes, we are only interested in sometimes, we are only interested in recent datarecent data A space-saving method for stream frequent item A space-saving method for stream frequent item
miningmining
DWMBIDWMBI 2525
Mining Evolution of Frequent Patterns for Mining Evolution of Frequent Patterns for Stream DataStream Data
Approximate frequent patternsApproximate frequent patterns Keep only current frequent patterns—No changes Keep only current frequent patterns—No changes
can be detectedcan be detected Mining evolution and dramatic changes of frequent Mining evolution and dramatic changes of frequent
patternspatterns Use tilted time window frame Use tilted time window frame Use compressed form to store significant Use compressed form to store significant
(approximate) frequent patterns and their time-(approximate) frequent patterns and their time-dependent tracesdependent traces
Note: To mine precise counts, one has to trace/keep a Note: To mine precise counts, one has to trace/keep a fixed (and small) set of itemsfixed (and small) set of items
DWMBIDWMBI 2626
Classification for Dynamic Data StreamsClassification for Dynamic Data Streams Decision tree induction for stream data classificationDecision tree induction for stream data classification
VFDT (Very Fast Decision Tree)/ CVFDTVFDT (Very Fast Decision Tree)/ CVFDT Is decision-tree good for modeling fast changing data, e.g., Is decision-tree good for modeling fast changing data, e.g.,
stock market analysis?stock market analysis? Other stream classification methodsOther stream classification methods
Instead of decision-trees, consider other models Instead of decision-trees, consider other models Naïve BayesianNaïve BayesianEnsemble Ensemble K-nearest neighborsK-nearest neighbors
Tilted time framework, incremental updating, dynamic Tilted time framework, incremental updating, dynamic maintenance, and model constructionmaintenance, and model construction
Comparing of models to find changesComparing of models to find changesDWMBIDWMBI 2727
Network traffic
Classification model
Attack traffic
Firewall
Block and quarantine
Benign traffic
Server
Model
update
Expert analysis and labeling
Uses past Uses past labeled data labeled data to build classification modelto build classification model
Predicts the labels of future instances using the Predicts the labels of future instances using the modelmodel
Helps decision making Helps decision making
Applications
SecuritySecurity Monitoring Monitoring
Network monitoring and traffic engineering.Network monitoring and traffic engineering.
Business : credit card transaction flows.Business : credit card transaction flows.
Telecommunication calling records.Telecommunication calling records.
Web logs and web page click streams.Web logs and web page click streams.
Financial market: stock exchange.Financial market: stock exchange.
Hoeffding Tree AlgorithmHoeffding Tree Algorithm Hoeffding Tree InputHoeffding Tree Input
S: sequence of examplesS: sequence of examplesX: attributesX: attributesG( ): evaluation functionG( ): evaluation functiond: desired accuracyd: desired accuracy
Hoeffding Tree AlgorithmHoeffding Tree Algorithmfor each example in Sfor each example in S
retrieve G(Xretrieve G(Xaa) and G(X) and G(Xbb) //two highest G(X) //two highest G(Xii))
if ( G(Xif ( G(Xaa) – G(X) – G(Xbb) > ) > εε ) )
split on Xsplit on Xaa
recurse to next noderecurse to next nodebreakbreak
DWMBIDWMBI 3030
Decision-Tree Induction with Data Decision-Tree Induction with Data StreamsStreams
DWMBIDWMBI 3131
yes no
Packets > 10
Protocol = http
Protocol = ftp
yes
yes no
Packets > 10
Bytes > 60K
Protocol = http
Data Stream
Data Stream
Hoeffding Tree: Strengths and WeaknessesHoeffding Tree: Strengths and Weaknesses
Strengths Strengths Scales better than traditional methodsScales better than traditional methods
Sublinear with samplingSublinear with samplingVery small memory utilizationVery small memory utilization
IncrementalIncremental
Make class predictions in parallelMake class predictions in parallelNew examples are added as they comeNew examples are added as they come
WeaknessWeakness Could spend a lot of time with tiesCould spend a lot of time with ties Memory used with tree expansionMemory used with tree expansion Number of candidate attributesNumber of candidate attributes
DWMBIDWMBI 3232
VFDT (Very Fast Decision Tree)VFDT (Very Fast Decision Tree) Modifications to Hoeffding TreeModifications to Hoeffding Tree
Near-ties broken more aggressivelyNear-ties broken more aggressively G computed every nG computed every nminmin
Deactivates certain leaves to save memoryDeactivates certain leaves to save memory Poor attributes droppedPoor attributes dropped Initialize with traditional learnerInitialize with traditional learner
Compare to Hoeffding Tree: Better time and memoryCompare to Hoeffding Tree: Better time and memory Compare to traditional decision treeCompare to traditional decision tree
Similar accuracySimilar accuracy Better runtime with 1.61 million examplesBetter runtime with 1.61 million examples
21 minutes for VFDT ; 24 hours for C4.521 minutes for VFDT ; 24 hours for C4.5 Still does not handle concept driftStill does not handle concept drift
DWMBIDWMBI 3333
CVFDT (Concept-adapting VFDT)CVFDT (Concept-adapting VFDT) Concept Drift Concept Drift
Time-changing data streamsTime-changing data streams Incorporate new and eliminate oldIncorporate new and eliminate old
CVFDTCVFDT Increments count with new exampleIncrements count with new example Decrement old exampleDecrement old example
Sliding windowSliding windowNodes assigned monotonically increasing IDsNodes assigned monotonically increasing IDs
Grows alternate subtreesGrows alternate subtrees When alternate more accurate => replace oldWhen alternate more accurate => replace old O(w) better runtime than VFDT-windowO(w) better runtime than VFDT-window
DWMBIDWMBI 3434
Ensemble of Classifiers AlgorithmEnsemble of Classifiers Algorithm Method (derived from the ensemble idea in Method (derived from the ensemble idea in
classification)classification) train K classifiers from K chunkstrain K classifiers from K chunks
for each subsequent chunkfor each subsequent chunk
train a new classifiertrain a new classifier
test other classifiers against the chunktest other classifiers against the chunk
assign weight to each classifierassign weight to each classifier
select top K classifiersselect top K classifiers
DWMBIDWMBI 3535
DWMBIDWMBI 3636
Clustering Data Streams Base on the k-median method
Data stream points from metric space Find k clusters in the stream s.t. the sum of distances
from data points to their closest center is minimized Constant factor approximation algorithm
In small space, a simple two step algorithm:
1. For each set of M records, Si, find O(k) centers in S1, …, Sl
Local clustering: Assign each point in Si to its closest center
2. Let S’ be centers for S1, …, Sl with each center weighted by number of points assigned to it
Cluster S’ to find k centers
DWMBIDWMBI 3737
Hierarchical Clustering Tree
data points
level-i medians
level-(i+1) medians
Hierarchical Tree and DrawbacksHierarchical Tree and Drawbacks Method:Method:
maintain at most m level-i mediansmaintain at most m level-i medians On seeing m of them, generate O(k) level-(i+1) On seeing m of them, generate O(k) level-(i+1)
medians of weight equal to the sum of the weights of medians of weight equal to the sum of the weights of the intermediate medians assigned to themthe intermediate medians assigned to them
Drawbacks:Drawbacks:
Low quality for evolving data streams (register only k Low quality for evolving data streams (register only k centers)centers)
Limited functionality in discovering and exploring Limited functionality in discovering and exploring clusters over different portions of the stream over timeclusters over different portions of the stream over time
DWMBIDWMBI 3838
Clustering for Mining Stream DynamicsClustering for Mining Stream Dynamics
Network intrusion detection: one example Network intrusion detection: one example
Detect bursts of activities or abrupt changes in real timeDetect bursts of activities or abrupt changes in real time——by on-by on-
line clustering line clustering
Our methodology Our methodology
Tilted time frame work: o.w. dynamic changes cannot be foundTilted time frame work: o.w. dynamic changes cannot be found
Micro-clustering: better quality than k-means/k-median Micro-clustering: better quality than k-means/k-median
incremental, online processing and maintenance)incremental, online processing and maintenance)
Two stages: micro-clustering and macro-clusteringTwo stages: micro-clustering and macro-clustering
With limited “overhead” to achieve high efficiency, scalability, With limited “overhead” to achieve high efficiency, scalability,
quality of results and power of evolution/change detectionquality of results and power of evolution/change detection
DWMBIDWMBI 3939
DWMBIDWMBI 4040
CluStream: A Framework for Clustering Evolving Data Streams
Design goal High quality for clustering evolving data streams
with greater functionality While keep the stream mining requirement in
mind One-pass over the original stream data Limited space usage and high efficiency
CluStream: A framework for clustering evolving data streams Divide the clustering process into online and
offline components Online component: periodically stores
summary statistics about the stream data Offline component: answers various user
questions based on the stored summary statistics
DWMBIDWMBI 4141
CluStream: Clustering On-line Streams
Online micro-cluster maintenance Initial creation of q micro-clusters
q is usually significantly larger than the number of natural clusters
Online incremental update of micro-clusters If new point is within max-boundary, insert into the
micro-cluster O.w., create a new cluster May delete obsolete micro-cluster or merge two
closest ones
Query-based macro-clustering Based on a user-specified time-horizon h and the
number of macro-clusters K, compute macroclusters using the k-means algorithm
Stream Data Mining: Research IssuesStream Data Mining: Research Issues
Mining sequential patterns in data streamsMining sequential patterns in data streams
Mining partial periodicity in data streamsMining partial periodicity in data streams
Mining notable gradients in data streamsMining notable gradients in data streams
Mining outliers and unusual patterns in data streamsMining outliers and unusual patterns in data streams
Stream clusteringStream clustering
Multi-dimensional clustering analysis?Multi-dimensional clustering analysis?
Cluster not confined to 2-D metric space, how to incorporate Cluster not confined to 2-D metric space, how to incorporate
other features, especially non-numerical propertiesother features, especially non-numerical properties
Stream clustering with other clustering approaches?Stream clustering with other clustering approaches?
Constraint-based cluster analysis with data streams?Constraint-based cluster analysis with data streams?
DWMBIDWMBI 4242
Recommended