Upload
qian-lin
View
193
Download
7
Embed Size (px)
DESCRIPTION
Citation preview
C-Cube: Elastic Continuous Clustering in the Cloud
Speaker: LIN Qianhttp://www.comp.nus.edu.sg/~linqian
2
Problem & Objective
• Existing solutions for continuous clustering are not elastic– Central server– Distributed setting with a fixed number of
dedicated servers.
• Objective– An elastic algorithm for real-time,
continuous clustering analysis
C-Cube is somewhat tricky on this point. It alternatively maintains a fixed number of VMs.
3
Clustering
• Divide a set of unlabeled objects into groups that are not pre-defined– objects in the same group similar– objects in different groups dissimilar
• C-Cube’s elastic solution– Dynamically adjust the amount of
computational resources based on the current workload
Actually, C-Cube is doing workload-balancing
4
C-Cube
• A general and elastic streaming framework to support a variety of clustering algorithms.
Provided by Storm
Only discuss the distance-based clustering algorithm
5
Elastic Operator
Achieve elasticity by dynamically adjusting the number of processing units
Mapper / Spout Reducer / Last Bolt
Worker nodes / Intermediate Bolts
Verification-Reclustering
• Scheme– Verify the clustering results computed at a
previous timestamp, and – only re-run the clustering algorithm when
the verifier module determines that the previous results no longer fit the current data distribution
• Verification module– Performed by an elastic operator
• Distance-based clustering criteria
7
Distance-based Clustering
• Goal– Partition the objects into clusters to
minimize the sum of distances from all objects in a cluster to the cluster center
• Distance functions– K-Means– K-Median
and their approximations
8
C-Cube Architecture
Implementation
• 9 PCs– 2 GB memory, 1.8 GHz CPU (2 cores)– Ubuntu 10.0.4
• Storm 0.6.2– Zookeeper (1 PC)– Nimbus node (1 PC)– Kestrel message queue server (1 PC)– Supervisor nodes (6 PCs)
10
Scaling Strategy
• Start a maximal number of virtual machines at the beginning
• Only use a fraction of the virtual machines and keeps other virtual machines in idle
• Activate the virtual machines on demand according to the workload
Still the limitation
11
System Performance
• Number of clusters 𝑘• Approximation factor 𝛽• Number of verifiers used in C-Cube• Workload change rate• Number of machines in the cluster