Scalable Monitoring & Autonomous Management of Cloud Environments

1

Scalable Monitoring & Autonomous Management of

Cloud Environments

Idit KeidarTechnion

Idit Keidar, April 2009

2

Executive Summary Goal: Scalable Monitoring and Autonomous

Management of Cloud Environments Approach: Distributed Local Computations

– Combine theory and experimental work Task 1: Robust aggregation Task 2: Overcome (& understand impact of)

loss, failures in gossip-based membership Task 3 (long term): Local and adaptive self-

organization


Autonomous Self* Clouds

Complex autonomous decision making Collaboratively computing functionsIdit Keidar, April 2009 3

The Haifa data center is too hot!

Let’s reduce power They’re going to turn on the sprinklers - need to backup

4

Centralized Solutions Don’t Cut It Load Communication costs Delays Fault-tolerance


5

Classical Dist. Solutions Don’t Cut It Global agreement before any output Repeated invocations to adapt to changes High latency, high load By the time synchronization is done, input

may have changed … the result is irrelevant Frequent changes -> inconsistent snapshots Synchronization typically relies on leader

– difficult and costly to maintain


6

Locality to the Rescue! Nodes make local decisions based on

communication with some proximate nodes– rather than the entire network

Infinitely scalable Fast, low overhead, low power

L


7

What is Locality? Worst case view

– Interesting problems have (a few) inherently global instances

Average case view– Requires an a priori distribution of the inputs

Our approach: be “as local as possible” – E.g., Veracity Radius of distributed aggregation

[BKLSW’06] : how far does a node need to look in order to know the globally correct result?


8

Task 1: Distributed Clustering for Robust Aggregation

Years 1-2With Ittay Eyal and Raphael Rom


9

Clouds Need Monitoring Load balancing storage/computation

– Need to know load distributions Ensuring a certain replication level

– Need to know number of failures per object Discovering problems – detecting anomalies

– Isolated outliers (malfunctioning node)– Anomalous clusters

All nodes running some OS version are overloaded due to attack

Overheating area


10

Aggregation Needs Robustness to data errors

– Ignore erroneous reports (outliers)– See Amazon S3’s recent crash caused by

corrupt data being gossiped Data is multi-dimensional

– Physical location X Heat: Where is there a fire? – Cluster group X Load: Overloaded clusters? – Software version X Performance: What software

are perturbed nodes running?


11

Solution Requirements Decentralized, tolerating crashes Scalable, low cost

– Clouds run 100,000s of machines– Machines are busy doing real work

Dynamic: deal with churn, value changes All nodes learn the outcome

– Data used for self-configuring/self-managing systems, so all nodes need to know the outcome in order to take appropriate actions


12

Proposed Approach Gossip-based diffusion

– Crash robust, scalable Constant size synopses represent data

distribution as set of Gaussian clusters

0

0.1

0.2

0.3

0.4

Pro

babi

lity

Den

sity

Samples taken

Gaussian 1

Gaussian 2Estimated Distribution


13

Merging Synopses Gossiping nodes exchange synopses,

merge them to improve accuracy

+ =

merge


14

Preliminary Results - Robustness

0 20 400

0.5

1

1.5

Iteration

Ave

rage

Err

or

Robust AggregationRegular Aggregation

No crashesWith crashes

Sample Distribution


15

Estimating Distributions - Pareto

PDF

CDF


16

Estimating Distributions - Uniform

PDF

CDF


17

Multi-Dimensional Distributions

Samples Taken Aggregated Synopsis


18

Key Challenges Test with real data Analyze convergence properties Understand locality Deal with changing inputs


19

Task 2: Fault- & Loss-Tolerant Gossip-Based Membership:

Formal Analysis

Years 1-2With Maxim Gurevich


20

Why Membership? Each node needs to know some live nodes

– In a dynamically changing system (churn) Gossip partners

– Random choices make gossip protocols work Unstructured overlay networks

– E.g., among super-peers– Random links provide robustness, expansion

Gathering statistics– Probe random nodes


21

Desirable Properties Each node has a local view (set of node ids)1. Small views, e.g., logarithmic2. Load balance of representation in views3. Uniform sample: In every node’s view, all

other nodes appear with equal probability4. Spatial independence: No correlation

among views of different nodes5. Temporal independence: fast decay of

correlation with past viewsIdit Keidar, April 2009

22

Existing Work Many protocols studied only empirically

– Achieve good load balance – Induce spatial dependence – No bound on temporal dependence

A few analyzed theoretically– Uniformity, load balance, spatial indep. – Unrealistic assumptions

Atomic actions with bi-directional communication No churn, failures, or message loss

– No bounds on temporal dependence Idit Keidar, April 2009

23

Our Goal Bridge “Theory” and “Practice” A practical protocol

– Working despite message loss, churn, failures– No complex bookkeeping for atomic actions

Formally prove the 5 desirable properties– Should perfectly hold in good circumstances– Quantify how much they degrade due to averse

conditions – message loss, churn, etc.


24

Send & Forget Membership

No bi-directional communication– Overcomes message loss

Simple– Amenable to formal analysis

u v

w

u v

w

before after u -> vu v

w

after lossu v

w

after dup


25

Challenges Setting parameters

– View size, how often to dup? Proving all 5 desirable properties w/out loss

– Markov Analysis 1: In-degree distribution– Markov Analysis 2: Markov Chain of all

reachable global states stationary probability, mixing, membership properties

Quantify impact of loss, churn, failures– Bound dependencies, degree imbalance


26

Task 3: Local and Adaptive Self-Organization and Topology Maintenance

Years 2-3


27

Decisions, Decisions, Making autonomous decisions based on

some function computation– E.g., optimization function for topology

maintenance Devise local distributed computations for

these Challenge 1: Prove instance-based locality Challenge 2: Test with real data


28

Summary (Repeated) Goal: Scalable Monitoring and Autonomous

Management of Cloud Environments Approach: Distributed Local Computations

– Combine theory and experimental work Task 1: Robust aggregation Task 2: Overcome (& understand impact of)

loss, failures in gossip-based membership Task 3 (long term): Local and adaptive self-

organization


Documents

Scalable Monitoring & Autonomous Management of Cloud Environments