QualityofService.ppt

Quality of Service

Toward “Effective Applications of

Cloud Computing”

Cheng-Zhong XuWayne State [email protected]

FCST, Dec.18, 2009 2

FCST, Dec.18, 2009 3

Quality of Service

Performance Perf. isolation and differentiation

Availability Server behavior in stress conditions? Resilient to failure

Security Prevention like ID/AC is not enough

for flash-crowd like attacks Admit good, block bad, and contain

suspicious ones. How to contain?

QoS

Security Availability

Performance

SLA Example: Amazon EC2 99.95% availability for the EC2 service on a yearly

basis (4 hours and 23 min outage per year) Unavailability in a 5-minute period 10% credit, if SLA violation is proved

Revenue Loss Per Down Hour Amazon outage on June 6, 2008 for 2 hrs;

cnet.com estimated a loss of $16,000/minute and $2M in total

eBay search engine down for 1.5 hrs on Aug 16, 08 Google Gmail down for 2+ hrs on Aug 11, 2008

Challenges of Scale

FCST, 12/20095

Microscopic View of Systems Reliability

Challenges of Scale (cont’)

6 FCST, 12/2009 C. Xu @ Wayne State

System #CPUs Reliability (MTBF/I)ASCI Q LANL

8,192 MTBF: 6.5 hrs

(114 Unplanned outages/mo)

ASCI White 8,192 MTBF: 5/40 hs (2001/2003)

PSC Lemieux 3,016 MTBI: 9.7 hrs

Google 15,000 20 reboots/daySource: D. Reed, High-end computing: The challenge of scale, May 2004

Autonomic Cloud Management for rapid deployment and management of cloud Adaptive to workload change, client

requirements, resource supplies, system failures, power cap/energy budget, etc

Machine Learning, Optimization and Control Reinforcement learning for VM autoonfiguration Machine learning to characterize the systems

uncertainty and predict anomaly (e.g. overload, failure, violation of SLA, etc)

Feedback control for assurance and adaption

ACM @Wayne State

C. Xu @ Wayne State 7Autonomic Cloud Management

MLOC for QoS

8

Controller

Model Predictor

Ref Input

Crl Knob

Measured QoS

Error

Clients

VMCloud

System State

Learningf(param)=S

• Control, machine learning, andstochastic optimization to

deal with systems uncertainty, predict anomaly, and scale resource

in a real-time manner.

• Multi-agent coordination

for tradeoff

Autonomic Cloud Management 9

Model-Free, Self-Tuning Fuzzy Control

See [TC’05, Computer’08] for stability analysis and other details

FCST, Dec.18, 2009

FCST, 12/2009 Autonomic Cloud Management 10

Transient Behavior on PlanetLab

on PlanetLab (World Cup Trace)

on PlanetLab (Surge)

Statistical guarantee of the target time

FCST 12/2009 Autonomic Cloud Management 11

Robustness

Self-adaptive to load change

Self-adaptive to net condition

Dealing with Failures

Preventive measures– Keep enhancing system components reliability– Simplify systems design – Disable components which are prone to failures

• BlueGene/L in LLNL disabled L1 cache in each node when jobs larger than a few hours were running

Checkpoint-based recovery measure– Frequent checkpointing is costly and conservative

Proactive management to handle failures before they occur

FCST, Dec.18, 2009 Autonomic Cloud Management 12

Proactive Failure Management

Key is failure prediction/detection– Predict failure occurrences in the near future

based on statistics of observed failures and dependence on performance states

Opportunities– Failure occurrences display uneven inter-

arrival time– They are correlated in time and space

domains– But, no simple and general model for failure

dynamics on production systemsFCST, Dec.18, 2009 13Autonomic Cloud Management

Multi-time Scale Predictor

Internet

App. App. App.

Node Scheduler

ResourceManager

FailureManagerFailure

Predictor

Event Sensor

Firewall

Failure signatures

Layer 1: node-wide

Master node

Worker node

Department cluster

TaskDispatcher

ResourceManager

FailureManagerFailure

PredictorFailure

Collector

Administrat-ive Tools

Tier 2:cluster-wide

Department cluster

Interconnected cluster system

Layer 3:system-wide

System management

Failure prediction

Failure event aggregationfrom master nodes

Autonomic Clouds Management 14

Multiscale spherical covar for time loc. Aggregate model for spatial locality

See SC’07 and SRDS’07 for details

15 C. Xu@wayne StateTime (minutes)

0 50 100 150 200 250 300 350 400 450 500 550 600

Use

r u

tiliz

atio

n (

%)

20

30

40

50

60

70

80

90

100F

ram

e u

tiliz

atio

n (

%)

50

55

60

65

70

75

80

85Time (minutes)

0 60 120 180 240 300 360 420 480 540

software failureshardware failurespower outagenetwork breakdown

x 103

Failure Prediction on LANL Trace 20 failures of Node 1

Bayesian network model

fType

pktCountUsrUtil

SysUtil

frmUtil

failure

time

33.8%

27.8%11.4%

10.9% 33.7%

30.2%

36.8%18.5%

21.6%42.4%21.7%

Predict a failure for 3:50 am on 9/7/2004

A scheduler SW failure at 08:20am 9/7/2004

Online Failure Prediction

On-line prediction from 5/12/2006 to 4/2/2007 on the Wayne State Grid.

40 high-end compute servers in 3 clusters: ISC, CIT, CHM (16, 16, 8)

QoS toward Effective Cloud Comp

Performance Perf. isolation & differentiation

Availability Graceful perf degradation in

stress conditions Resilient to failure

Security ABC admission policy: Admit

good, Block bad, and Contain suspicious ones.

QoS

Security Availability

Performance

FCST, Dec.18, 2009 Autonomic Cloud Management

Documents

QualityofService.ppt