11
C-Cube: Elastic Continuous Clustering in the Cloud Speaker: LIN Qian http://www.comp.nus.edu.sg/ ~linqian

C-Cube: Elastic Continuous Clustering in the Cloud

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: C-Cube: Elastic Continuous Clustering in the Cloud

C-Cube: Elastic Continuous Clustering in the Cloud

Speaker: LIN Qianhttp://www.comp.nus.edu.sg/~linqian

Page 2: C-Cube: Elastic Continuous Clustering in the Cloud

2

Problem & Objective

• Existing solutions for continuous clustering are not elastic– Central server– Distributed setting with a fixed number of

dedicated servers.

• Objective– An elastic algorithm for real-time,

continuous clustering analysis

C-Cube is somewhat tricky on this point. It alternatively maintains a fixed number of VMs.

Page 3: C-Cube: Elastic Continuous Clustering in the Cloud

3

Clustering

• Divide a set of unlabeled objects into groups that are not pre-defined– objects in the same group similar– objects in different groups dissimilar

• C-Cube’s elastic solution– Dynamically adjust the amount of

computational resources based on the current workload

Actually, C-Cube is doing workload-balancing

Page 4: C-Cube: Elastic Continuous Clustering in the Cloud

4

C-Cube

• A general and elastic streaming framework to support a variety of clustering algorithms.

Provided by Storm

Only discuss the distance-based clustering algorithm

Page 5: C-Cube: Elastic Continuous Clustering in the Cloud

5

Elastic Operator

Achieve elasticity by dynamically adjusting the number of processing units

Mapper / Spout Reducer / Last Bolt

Worker nodes / Intermediate Bolts

Page 6: C-Cube: Elastic Continuous Clustering in the Cloud

Verification-Reclustering

• Scheme– Verify the clustering results computed at a

previous timestamp, and – only re-run the clustering algorithm when

the verifier module determines that the previous results no longer fit the current data distribution

• Verification module– Performed by an elastic operator

• Distance-based clustering criteria

Page 7: C-Cube: Elastic Continuous Clustering in the Cloud

7

Distance-based Clustering

• Goal– Partition the objects into clusters to

minimize the sum of distances from all objects in a cluster to the cluster center

• Distance functions– K-Means– K-Median

and their approximations

Page 8: C-Cube: Elastic Continuous Clustering in the Cloud

8

C-Cube Architecture

Page 9: C-Cube: Elastic Continuous Clustering in the Cloud

Implementation

• 9 PCs– 2 GB memory, 1.8 GHz CPU (2 cores)– Ubuntu 10.0.4

• Storm 0.6.2– Zookeeper (1 PC)– Nimbus node (1 PC)– Kestrel message queue server (1 PC)– Supervisor nodes (6 PCs)

Page 10: C-Cube: Elastic Continuous Clustering in the Cloud

10

Scaling Strategy

• Start a maximal number of virtual machines at the beginning

• Only use a fraction of the virtual machines and keeps other virtual machines in idle

• Activate the virtual machines on demand according to the workload

Still the limitation

Page 11: C-Cube: Elastic Continuous Clustering in the Cloud

11

System Performance

• Number of clusters 𝑘• Approximation factor 𝛽• Number of verifiers used in C-Cube• Workload change rate• Number of machines in the cluster