A Framework for Clustering Evolving Data Streams

Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu

Presented by: Di Yang

Charudatta Wad

Outline

Background of ClusteringMotivation for Clustering over Streaming

Data.Overall SolutionMicro ClustersPyramid Time FrameMacro ClusterCluster Maintenance

Background of Clustering

Definition of Clustering For a given set of data points, partitioning them

into one or more groups of similar objects. “Similarity” is often defined with the use of some

distance measure.

Difference between “group by” queries and clustering.

Background of Clustering

Some of the most popular clustering algorithms: K- Means, BIRCH, CURE, Density Based

Clustering.

Clustering has many applications in data bases, information visualization, data mining.

What are Oultiers?

Motivation

Challenge in Streaming Environment: Clustering is an expensive process. Resource constraints. Infinite streams.

Can simply extending one pass algorithms for static databases to stream processing suffice?

Motivation

Requirements of clustering for stream processing: Statistical summary information storage. Efficient update process. Ability to cluster for a specific time horizon,

Overall Solution of the Paper

Divide the clustering process to two phases

Online Component:

periodically stores detailed summary statistics Offline Component uses only the summary statistics to do clustering

Micro-Clusters

What is a Micro-Cluster A Micro-Cluster is a set of individual data points that are

close to each other and will be treated as a single unit in further offline Macro-clustering.

View of Micro-Cluster View of Macro-Cluster

Micro-Clusters

What to Store in a Micro-Cluster

Key idea: Additivity Property

Pyramidal Time Frame

The snapshots follow a pyramidal pattern

… …

When should we make the snapshot?

The micro-clusters are stored at snapshots.

Snapshot

Pyramidal Time Frame

Snapshots are classified into different orders which can vary from 1 to log α(T). For example, T is 55, α=2, then we have orders 0 with interval 2^0=1, order 1 with interval 2^1=2, order 2 with interval 2^2=4, order 3 with interval 2^3=8, order 4 with interval 2^4=16, order 5 with interval 2^5=32.

For a data stream the maximum number of snapshots maintained at T time units since the beginning of the stream mining process is

(α + 1) log α(T). (α + 1 for each order)

Why Pyramidal Pattern?

For any user-specified time window of h, at least one stored snapshot can be found within 2 h units of the current time.

Please Note: Only Approximate Answers!!!

Micro Cluster Creation

It is assumed that a total of q micro-clusters are maintained at any moment by the algorithm.

This is done using an offline process (k-means) at the very beginning of the data stream computation process.

Online Micro Cluster Maintenance

How to deal with a new coming point?

1. Join one of the old cluster

2. Create a new cluster by its own

How to deal with the old clusters 1. Delete them (based on relevance stamp)

2. Merge them (merge the closest two)

A merged cluster will have all the IDs its components have

Macro-Cluster Creation

Based on the Additivity Property of cluster feature vector

Current Time T, the window size is h. That means the user want to find the clusters formed in (T-h, T).

Approach: 1. 1st step: Find the snapshot for T, get the micro-cluster set S(T).

2. 2nd step: Find the snapshot for T-h, get the micro-cluster set S(T-h).

3. Use S(T)-S(T-h)

Specifically, we have a merged cluster with Id list (C1, C2, C3) in S(T)

and a cluster with Id C1 in S(T-h). Then the we use

CFT(C1,C2,C3)-CFT(C1)=CFT(C2,C3), because C1 are formed before

T-h, thus should not contribute to the micro-cluster formed in (T-h,T)

Example

C_ID: [C1]

Time: T-h

C_ID: [C1, C2, C3]

Time: T

C_ID: [C2, C3]

Result: T-h

Run K-means on Micro-Clusters

How do you feel about this paper?

My feeling:

Quite Fuzzy Results:

Approximation is every where.

Nothing New:

Micro-Clusters, K-means, Cluster Feature Vectors, Pyramidal Time Frame are all old stuffs.

Counter Example

C_ID: [C2] C_ID: [C1, C2, C3]

Time: T

C_ID: [C1, C3]

Time: T-hResult

Di and Charu’s project deals with:

1. Deterministic Clusters

2. Clusters with Arbitrary Shapes

3. Real Expirations

4. Disk Version

5. Outlier Detection by Free

A Framework for Clustering Evolving Data Streams

Documents

On Clustering Graph Streams

1 More Clustering CURE Algorithm Clustering Streams

Lecture 6. Evolving data streamsgavalda/DataStreamSeminar/files/...Lecture 6. Evolving data streams Ricard Gavalda` MIRI Seminar on Data Streams, Spring 2015 1/39 Contents 1 Time in

Predictive Analytics on Evolving Data Streams · Predictive Analytics on Evolving Data Streams: ... • MOA: Massive online analysis, a framework for stream classification and

Using HDDT to avoid instances propagation in unbalanced and evolving data streams

Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data

Improving Adaptive Bagging Methods for Evolving Data Streamsgavalda/papers/acml09slides.pdf · Realtime analytics: from Databases to Dataﬂows Data streams Data streams are ordered

Online Clustering of Parallel Data Streamsusers.monash.edu/~mgaber/DRAFT.pdfKey words: data mining, clustering, data streams, fuzzy sets 1 Introduction In recent years, so-called data

Aggregating and Querying Geo-Streams : From static sensors to Evolving Phenomena

A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

Adaptive XML Tree Mining on Evolving Data Streams

Density-based Projected Clustering over High Dimensional ...disi.unitn.it/~themis/publications/sdm12.pdf · Density-based Projected Clustering over High Dimensional Data Streams Irene

A framework for mining evolving trends in Web data streams

Clustering over Multiple Evolving Streams by Events and Correlations

Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University

Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

LNAI 5212 - Clustering Distributed Sensor Data Streams · Clustering Distributed Sensor Data Streams 283 processing it as a multivariate stream. In this work we study the problem

Scalable Distributed Real-Time Clustering for Big Data Streams

Anytime Concurrent Clustering of Multiple Streams with an Indexing

Model-based Clustering of Short Text Streams · Short text streams like microblog posts are popular on the Internet ... and text summarization [1, 24]. The short text stream clustering