Upload
kalea
View
44
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Scalable Monitoring & Autonomous Management of Cloud Environments. Idit Keidar Technion. Executive Summary. Goal : Scalable Monitoring and Autonomous Management of Cloud Environments Approach : Distributed Local Computations Combine theory and experimental work - PowerPoint PPT Presentation
Citation preview
1
Scalable Monitoring & Autonomous Management of
Cloud Environments
Idit KeidarTechnion
Idit Keidar, April 2009
2
Executive Summary Goal: Scalable Monitoring and Autonomous
Management of Cloud Environments Approach: Distributed Local Computations
– Combine theory and experimental work Task 1: Robust aggregation Task 2: Overcome (& understand impact of)
loss, failures in gossip-based membership Task 3 (long term): Local and adaptive self-
organization
Idit Keidar, April 2009
Autonomous Self* Clouds
Complex autonomous decision making Collaboratively computing functionsIdit Keidar, April 2009 3
The Haifa data center is too hot!
Let’s reduce power They’re going to turn on the sprinklers - need to backup
4
Centralized Solutions Don’t Cut It Load Communication costs Delays Fault-tolerance
Idit Keidar, April 2009
5
Classical Dist. Solutions Don’t Cut It Global agreement before any output Repeated invocations to adapt to changes High latency, high load By the time synchronization is done, input
may have changed … the result is irrelevant Frequent changes -> inconsistent snapshots Synchronization typically relies on leader
– difficult and costly to maintain
Idit Keidar, April 2009
6
Locality to the Rescue! Nodes make local decisions based on
communication with some proximate nodes– rather than the entire network
Infinitely scalable Fast, low overhead, low power
L
Idit Keidar, April 2009
7
What is Locality? Worst case view
– Interesting problems have (a few) inherently global instances
Average case view– Requires an a priori distribution of the inputs
Our approach: be “as local as possible” – E.g., Veracity Radius of distributed aggregation
[BKLSW’06] : how far does a node need to look in order to know the globally correct result?
Idit Keidar, April 2009
8
Task 1: Distributed Clustering for Robust Aggregation
Years 1-2With Ittay Eyal and Raphael Rom
Idit Keidar, April 2009
9
Clouds Need Monitoring Load balancing storage/computation
– Need to know load distributions Ensuring a certain replication level
– Need to know number of failures per object Discovering problems – detecting anomalies
– Isolated outliers (malfunctioning node)– Anomalous clusters
All nodes running some OS version are overloaded due to attack
Overheating area
Idit Keidar, April 2009
10
Aggregation Needs Robustness to data errors
– Ignore erroneous reports (outliers)– See Amazon S3’s recent crash caused by
corrupt data being gossiped Data is multi-dimensional
– Physical location X Heat: Where is there a fire? – Cluster group X Load: Overloaded clusters? – Software version X Performance: What software
are perturbed nodes running?
Idit Keidar, April 2009
11
Solution Requirements Decentralized, tolerating crashes Scalable, low cost
– Clouds run 100,000s of machines– Machines are busy doing real work
Dynamic: deal with churn, value changes All nodes learn the outcome
– Data used for self-configuring/self-managing systems, so all nodes need to know the outcome in order to take appropriate actions
Idit Keidar, April 2009
12
Proposed Approach Gossip-based diffusion
– Crash robust, scalable Constant size synopses represent data
distribution as set of Gaussian clusters
0
0.1
0.2
0.3
0.4
Pro
babi
lity
Den
sity
Samples taken
Gaussian 1
Gaussian 2Estimated Distribution
Idit Keidar, April 2009
13
Merging Synopses Gossiping nodes exchange synopses,
merge them to improve accuracy
+ =
merge
Idit Keidar, April 2009
14
Preliminary Results - Robustness
0 20 400
0.5
1
1.5
Iteration
Ave
rage
Err
or
Robust AggregationRegular Aggregation
No crashesWith crashes
Sample Distribution
Idit Keidar, April 2009
15
Estimating Distributions - Pareto
CDF
Idit Keidar, April 2009
16
Estimating Distributions - Uniform
CDF
Idit Keidar, April 2009
17
Multi-Dimensional Distributions
Samples Taken Aggregated Synopsis
Idit Keidar, April 2009
18
Key Challenges Test with real data Analyze convergence properties Understand locality Deal with changing inputs
Idit Keidar, April 2009
19
Task 2: Fault- & Loss-Tolerant Gossip-Based Membership:
Formal Analysis
Years 1-2With Maxim Gurevich
Idit Keidar, April 2009
20
Why Membership? Each node needs to know some live nodes
– In a dynamically changing system (churn) Gossip partners
– Random choices make gossip protocols work Unstructured overlay networks
– E.g., among super-peers– Random links provide robustness, expansion
Gathering statistics– Probe random nodes
Idit Keidar, April 2009
21
Desirable Properties Each node has a local view (set of node ids)1. Small views, e.g., logarithmic2. Load balance of representation in views3. Uniform sample: In every node’s view, all
other nodes appear with equal probability4. Spatial independence: No correlation
among views of different nodes5. Temporal independence: fast decay of
correlation with past viewsIdit Keidar, April 2009
22
Existing Work Many protocols studied only empirically
– Achieve good load balance – Induce spatial dependence – No bound on temporal dependence
A few analyzed theoretically– Uniformity, load balance, spatial indep. – Unrealistic assumptions
Atomic actions with bi-directional communication No churn, failures, or message loss
– No bounds on temporal dependence Idit Keidar, April 2009
23
Our Goal Bridge “Theory” and “Practice” A practical protocol
– Working despite message loss, churn, failures– No complex bookkeeping for atomic actions
Formally prove the 5 desirable properties– Should perfectly hold in good circumstances– Quantify how much they degrade due to averse
conditions – message loss, churn, etc.
Idit Keidar, April 2009
24
Send & Forget Membership
No bi-directional communication– Overcomes message loss
Simple– Amenable to formal analysis
u v
w
u v
w
before after u -> vu v
w
after lossu v
w
after dup
Idit Keidar, April 2009
25
Challenges Setting parameters
– View size, how often to dup? Proving all 5 desirable properties w/out loss
– Markov Analysis 1: In-degree distribution– Markov Analysis 2: Markov Chain of all
reachable global states stationary probability, mixing, membership properties
Quantify impact of loss, churn, failures– Bound dependencies, degree imbalance
Idit Keidar, April 2009
26
Task 3: Local and Adaptive Self-Organization and Topology Maintenance
Years 2-3
Idit Keidar, April 2009
27
Decisions, Decisions, Making autonomous decisions based on
some function computation– E.g., optimization function for topology
maintenance Devise local distributed computations for
these Challenge 1: Prove instance-based locality Challenge 2: Test with real data
Idit Keidar, April 2009
28
Summary (Repeated) Goal: Scalable Monitoring and Autonomous
Management of Cloud Environments Approach: Distributed Local Computations
– Combine theory and experimental work Task 1: Robust aggregation Task 2: Overcome (& understand impact of)
loss, failures in gossip-based membership Task 3 (long term): Local and adaptive self-
organization
Idit Keidar, April 2009