28
Scalable Monitoring & Autonomous Management of Cloud Environments Idit Keidar Technion 1 Idit Keidar, April 2009

Scalable Monitoring & Autonomous Management of Cloud Environments

  • Upload
    kalea

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Scalable Monitoring & Autonomous Management of Cloud Environments. Idit Keidar Technion. Executive Summary. Goal : Scalable Monitoring and Autonomous Management of Cloud Environments Approach : Distributed Local Computations Combine theory and experimental work - PowerPoint PPT Presentation

Citation preview

Page 1: Scalable Monitoring & Autonomous Management of Cloud Environments

1

Scalable Monitoring & Autonomous Management of

Cloud Environments

Idit KeidarTechnion

Idit Keidar, April 2009

Page 2: Scalable Monitoring & Autonomous Management of Cloud Environments

2

Executive Summary Goal: Scalable Monitoring and Autonomous

Management of Cloud Environments Approach: Distributed Local Computations

– Combine theory and experimental work Task 1: Robust aggregation Task 2: Overcome (& understand impact of)

loss, failures in gossip-based membership Task 3 (long term): Local and adaptive self-

organization

Idit Keidar, April 2009

Page 3: Scalable Monitoring & Autonomous Management of Cloud Environments

Autonomous Self* Clouds

Complex autonomous decision making Collaboratively computing functionsIdit Keidar, April 2009 3

The Haifa data center is too hot!

Let’s reduce power They’re going to turn on the sprinklers - need to backup

Page 4: Scalable Monitoring & Autonomous Management of Cloud Environments

4

Centralized Solutions Don’t Cut It Load Communication costs Delays Fault-tolerance

Idit Keidar, April 2009

Page 5: Scalable Monitoring & Autonomous Management of Cloud Environments

5

Classical Dist. Solutions Don’t Cut It Global agreement before any output Repeated invocations to adapt to changes High latency, high load By the time synchronization is done, input

may have changed … the result is irrelevant Frequent changes -> inconsistent snapshots Synchronization typically relies on leader

– difficult and costly to maintain

Idit Keidar, April 2009

Page 6: Scalable Monitoring & Autonomous Management of Cloud Environments

6

Locality to the Rescue! Nodes make local decisions based on

communication with some proximate nodes– rather than the entire network

Infinitely scalable Fast, low overhead, low power

L

Idit Keidar, April 2009

Page 7: Scalable Monitoring & Autonomous Management of Cloud Environments

7

What is Locality? Worst case view

– Interesting problems have (a few) inherently global instances

Average case view– Requires an a priori distribution of the inputs

Our approach: be “as local as possible” – E.g., Veracity Radius of distributed aggregation

[BKLSW’06] : how far does a node need to look in order to know the globally correct result?

Idit Keidar, April 2009

Page 8: Scalable Monitoring & Autonomous Management of Cloud Environments

8

Task 1: Distributed Clustering for Robust Aggregation

Years 1-2With Ittay Eyal and Raphael Rom

Idit Keidar, April 2009

Page 9: Scalable Monitoring & Autonomous Management of Cloud Environments

9

Clouds Need Monitoring Load balancing storage/computation

– Need to know load distributions Ensuring a certain replication level

– Need to know number of failures per object Discovering problems – detecting anomalies

– Isolated outliers (malfunctioning node)– Anomalous clusters

All nodes running some OS version are overloaded due to attack

Overheating area

Idit Keidar, April 2009

Page 10: Scalable Monitoring & Autonomous Management of Cloud Environments

10

Aggregation Needs Robustness to data errors

– Ignore erroneous reports (outliers)– See Amazon S3’s recent crash caused by

corrupt data being gossiped Data is multi-dimensional

– Physical location X Heat: Where is there a fire? – Cluster group X Load: Overloaded clusters? – Software version X Performance: What software

are perturbed nodes running?

Idit Keidar, April 2009

Page 11: Scalable Monitoring & Autonomous Management of Cloud Environments

11

Solution Requirements Decentralized, tolerating crashes Scalable, low cost

– Clouds run 100,000s of machines– Machines are busy doing real work

Dynamic: deal with churn, value changes All nodes learn the outcome

– Data used for self-configuring/self-managing systems, so all nodes need to know the outcome in order to take appropriate actions

Idit Keidar, April 2009

Page 12: Scalable Monitoring & Autonomous Management of Cloud Environments

12

Proposed Approach Gossip-based diffusion

– Crash robust, scalable Constant size synopses represent data

distribution as set of Gaussian clusters

0

0.1

0.2

0.3

0.4

Pro

babi

lity

Den

sity

Samples taken

Gaussian 1

Gaussian 2Estimated Distribution

Idit Keidar, April 2009

Page 13: Scalable Monitoring & Autonomous Management of Cloud Environments

13

Merging Synopses Gossiping nodes exchange synopses,

merge them to improve accuracy

+ =

merge

Idit Keidar, April 2009

Page 14: Scalable Monitoring & Autonomous Management of Cloud Environments

14

Preliminary Results - Robustness

0 20 400

0.5

1

1.5

Iteration

Ave

rage

Err

or

Robust AggregationRegular Aggregation

No crashesWith crashes

Sample Distribution

Idit Keidar, April 2009

Page 15: Scalable Monitoring & Autonomous Management of Cloud Environments

15

Estimating Distributions - Pareto

PDF

CDF

Idit Keidar, April 2009

Page 16: Scalable Monitoring & Autonomous Management of Cloud Environments

16

Estimating Distributions - Uniform

PDF

CDF

Idit Keidar, April 2009

Page 17: Scalable Monitoring & Autonomous Management of Cloud Environments

17

Multi-Dimensional Distributions

Samples Taken Aggregated Synopsis

Idit Keidar, April 2009

Page 18: Scalable Monitoring & Autonomous Management of Cloud Environments

18

Key Challenges Test with real data Analyze convergence properties Understand locality Deal with changing inputs

Idit Keidar, April 2009

Page 19: Scalable Monitoring & Autonomous Management of Cloud Environments

19

Task 2: Fault- & Loss-Tolerant Gossip-Based Membership:

Formal Analysis

Years 1-2With Maxim Gurevich

Idit Keidar, April 2009

Page 20: Scalable Monitoring & Autonomous Management of Cloud Environments

20

Why Membership? Each node needs to know some live nodes

– In a dynamically changing system (churn) Gossip partners

– Random choices make gossip protocols work Unstructured overlay networks

– E.g., among super-peers– Random links provide robustness, expansion

Gathering statistics– Probe random nodes

Idit Keidar, April 2009

Page 21: Scalable Monitoring & Autonomous Management of Cloud Environments

21

Desirable Properties Each node has a local view (set of node ids)1. Small views, e.g., logarithmic2. Load balance of representation in views3. Uniform sample: In every node’s view, all

other nodes appear with equal probability4. Spatial independence: No correlation

among views of different nodes5. Temporal independence: fast decay of

correlation with past viewsIdit Keidar, April 2009

Page 22: Scalable Monitoring & Autonomous Management of Cloud Environments

22

Existing Work Many protocols studied only empirically

– Achieve good load balance – Induce spatial dependence – No bound on temporal dependence

A few analyzed theoretically– Uniformity, load balance, spatial indep. – Unrealistic assumptions

Atomic actions with bi-directional communication No churn, failures, or message loss

– No bounds on temporal dependence Idit Keidar, April 2009

Page 23: Scalable Monitoring & Autonomous Management of Cloud Environments

23

Our Goal Bridge “Theory” and “Practice” A practical protocol

– Working despite message loss, churn, failures– No complex bookkeeping for atomic actions

Formally prove the 5 desirable properties– Should perfectly hold in good circumstances– Quantify how much they degrade due to averse

conditions – message loss, churn, etc.

Idit Keidar, April 2009

Page 24: Scalable Monitoring & Autonomous Management of Cloud Environments

24

Send & Forget Membership

No bi-directional communication– Overcomes message loss

Simple– Amenable to formal analysis

u v

w

u v

w

before after u -> vu v

w

after lossu v

w

after dup

Idit Keidar, April 2009

Page 25: Scalable Monitoring & Autonomous Management of Cloud Environments

25

Challenges Setting parameters

– View size, how often to dup? Proving all 5 desirable properties w/out loss

– Markov Analysis 1: In-degree distribution– Markov Analysis 2: Markov Chain of all

reachable global states stationary probability, mixing, membership properties

Quantify impact of loss, churn, failures– Bound dependencies, degree imbalance

Idit Keidar, April 2009

Page 26: Scalable Monitoring & Autonomous Management of Cloud Environments

26

Task 3: Local and Adaptive Self-Organization and Topology Maintenance

Years 2-3

Idit Keidar, April 2009

Page 27: Scalable Monitoring & Autonomous Management of Cloud Environments

27

Decisions, Decisions, Making autonomous decisions based on

some function computation– E.g., optimization function for topology

maintenance Devise local distributed computations for

these Challenge 1: Prove instance-based locality Challenge 2: Test with real data

Idit Keidar, April 2009

Page 28: Scalable Monitoring & Autonomous Management of Cloud Environments

28

Summary (Repeated) Goal: Scalable Monitoring and Autonomous

Management of Cloud Environments Approach: Distributed Local Computations

– Combine theory and experimental work Task 1: Robust aggregation Task 2: Overcome (& understand impact of)

loss, failures in gossip-based membership Task 3 (long term): Local and adaptive self-

organization

Idit Keidar, April 2009