Upload
ebayworld
View
284
Download
0
Tags:
Embed Size (px)
Citation preview
Quality of Service
Toward “Effective Applications of
Cloud Computing”
Cheng-Zhong XuWayne State [email protected]
FCST, Dec.18, 2009 2
FCST, Dec.18, 2009 3
Quality of Service
Performance Perf. isolation and differentiation
Availability Server behavior in stress conditions? Resilient to failure
Security Prevention like ID/AC is not enough
for flash-crowd like attacks Admit good, block bad, and contain
suspicious ones. How to contain?
QoS
Security Availability
Performance
SLA Example: Amazon EC2 99.95% availability for the EC2 service on a yearly
basis (4 hours and 23 min outage per year) Unavailability in a 5-minute period 10% credit, if SLA violation is proved
Revenue Loss Per Down Hour Amazon outage on June 6, 2008 for 2 hrs;
cnet.com estimated a loss of $16,000/minute and $2M in total
eBay search engine down for 1.5 hrs on Aug 16, 08 Google Gmail down for 2+ hrs on Aug 11, 2008
Challenges of Scale
FCST, 12/20095
Microscopic View of Systems Reliability
Challenges of Scale (cont’)
6 FCST, 12/2009 C. Xu @ Wayne State
System #CPUs Reliability (MTBF/I)ASCI Q LANL
8,192 MTBF: 6.5 hrs
(114 Unplanned outages/mo)
ASCI White 8,192 MTBF: 5/40 hs (2001/2003)
PSC Lemieux 3,016 MTBI: 9.7 hrs
Google 15,000 20 reboots/daySource: D. Reed, High-end computing: The challenge of scale, May 2004
Autonomic Cloud Management for rapid deployment and management of cloud Adaptive to workload change, client
requirements, resource supplies, system failures, power cap/energy budget, etc
Machine Learning, Optimization and Control Reinforcement learning for VM autoonfiguration Machine learning to characterize the systems
uncertainty and predict anomaly (e.g. overload, failure, violation of SLA, etc)
Feedback control for assurance and adaption
ACM @Wayne State
C. Xu @ Wayne State 7Autonomic Cloud Management
MLOC for QoS
8
Controller
Model Predictor
Ref Input
Crl Knob
Measured QoS
Error
Clients
VMCloud
System State
Learningf(param)=S
• Control, machine learning, andstochastic optimization to
deal with systems uncertainty, predict anomaly, and scale resource
in a real-time manner.
• Multi-agent coordination
for tradeoff
Autonomic Cloud Management 9
Model-Free, Self-Tuning Fuzzy Control
See [TC’05, Computer’08] for stability analysis and other details
FCST, Dec.18, 2009
FCST, 12/2009 Autonomic Cloud Management 10
Transient Behavior on PlanetLab
on PlanetLab (World Cup Trace)
on PlanetLab (Surge)
Statistical guarantee of the target time
FCST 12/2009 Autonomic Cloud Management 11
Robustness
Self-adaptive to load change
Self-adaptive to net condition
Dealing with Failures
Preventive measures– Keep enhancing system components reliability– Simplify systems design – Disable components which are prone to failures
• BlueGene/L in LLNL disabled L1 cache in each node when jobs larger than a few hours were running
Checkpoint-based recovery measure– Frequent checkpointing is costly and conservative
Proactive management to handle failures before they occur
FCST, Dec.18, 2009 Autonomic Cloud Management 12
Proactive Failure Management
Key is failure prediction/detection– Predict failure occurrences in the near future
based on statistics of observed failures and dependence on performance states
Opportunities– Failure occurrences display uneven inter-
arrival time– They are correlated in time and space
domains– But, no simple and general model for failure
dynamics on production systemsFCST, Dec.18, 2009 13Autonomic Cloud Management
Multi-time Scale Predictor
Internet
App. App. App.
Node Scheduler
ResourceManager
FailureManagerFailure
Predictor
Event Sensor
Firewall
Failure signatures
Layer 1: node-wide
Master node
Worker node
Department cluster
TaskDispatcher
ResourceManager
FailureManagerFailure
PredictorFailure
Collector
Administrat-ive Tools
Tier 2:cluster-wide
Department cluster
Interconnected cluster system
Layer 3:system-wide
System management
Failure prediction
Failure event aggregationfrom master nodes
Autonomic Clouds Management 14
Multiscale spherical covar for time loc. Aggregate model for spatial locality
See SC’07 and SRDS’07 for details
15 C. Xu@wayne StateTime (minutes)
0 50 100 150 200 250 300 350 400 450 500 550 600
Use
r u
tiliz
atio
n (
%)
20
30
40
50
60
70
80
90
100F
ram
e u
tiliz
atio
n (
%)
50
55
60
65
70
75
80
85Time (minutes)
0 60 120 180 240 300 360 420 480 540
software failureshardware failurespower outagenetwork breakdown
x 103
Failure Prediction on LANL Trace 20 failures of Node 1
Bayesian network model
fType
pktCountUsrUtil
SysUtil
frmUtil
failure
time
33.8%
27.8%11.4%
10.9% 33.7%
30.2%
36.8%18.5%
21.6%42.4%21.7%
Predict a failure for 3:50 am on 9/7/2004
A scheduler SW failure at 08:20am 9/7/2004
Online Failure Prediction
On-line prediction from 5/12/2006 to 4/2/2007 on the Wayne State Grid.
40 high-end compute servers in 3 clusters: ISC, CIT, CHM (16, 16, 8)
QoS toward Effective Cloud Comp
Performance Perf. isolation & differentiation
Availability Graceful perf degradation in
stress conditions Resilient to failure
Security ABC admission policy: Admit
good, Block bad, and Contain suspicious ones.
QoS
Security Availability
Performance
FCST, Dec.18, 2009 Autonomic Cloud Management