View
221
Download
2
Category
Preview:
Citation preview
Automated Multi-Tier System Design for Service Availability
John JanakiramanJose Renato Santos
Yoshio Turner
Hewlett Packard Laboratories
page 206/23/2003 DSMS workshop - DSN 2003
Motivation:New Enterprise/Internet Computing Model
• Utility ComputingHP (Adaptive Enterprise), IBM (Autonomic Computing), SUN (N1)
– Shared resources allocated to services on demand• Virtual resources hide details of physical environment
– Self-managing system• Service life-cycle (creation, change, deletion)• Adaptation (load changes, component faults, etc.)
– High level service requirements specification• E.g. desired performance and availability instead of detailed design
• Automated System Design/Configuration– Required for utility computing– Our focus: design for service availability
page 306/23/2003 DSMS workshop - DSN 2003
access tier
web tier
application tier
database tier
internetinternet internetinternet
access tier
web tier
application tier
database tier
internetinternet internetinternet
allocated physical resourceslogical resources specification
deployment
internetinternet intranetintranet
storage virtualization
network /servervirtualization
utility controller
server poolfirewall pool
switching pool
switching pool
storage pool
internetinternet intranetintranet
storage virtualization
network /servervirtualization
utility controller
server pool
load balancer pool
firewall pool
storage pool
• HP hardware and software solution that enables provision of computing resources to applications on demand.
HP Utility Data Center (UDC)
page 406/23/2003 DSMS workshop - DSN 2003
Utility Computing Environment
• Higher level service specification– Functional specification– Performance requirement– Availability requirement
• Design/Configuration automation– System determines the required computing
resources for service• type of resources• amount of resources• topology• application and OS configurations
– Dynamically redesign and reconfigure due to changes (load, faults, etc…)
Physical Resources
ResourceAllocation
Virtual Resources
AutomaticDesign and
Configuration
ServiceSpecification
page 506/23/2003 DSMS workshop - DSN 2003
Automated System Design for Availability
Goal: Automatically explore the space of infrastructure designs and availability mechanisms and select a design that meets availability requirements with minimum cost.
– Numerous availability mechanisms to select/configure (cost/benefit tradeoffs)• host failover, NIC failover, standby/active spare, database
checkpointing, application state checkpointing (on peer, on file, on database), data replication, software rejuvenation, etc…
• Different mechanisms and operating points have different cost, performance overhead and availability characteristics
page 606/23/2003 DSMS workshop - DSN 2003
• Initial prototype for stand-alone environment• Current Scope:
Application type: Multi-tier services (e.g. 3-tier e-commerce)Availability requirement: Maximum Service Downtime per yearDesign space (limited set):
• Choice of server hw and sw components (no network, no storage)• Number of servers• State of servers (active or spare)• Repair strategies for component failures
• Key challenges: – How to model relevant properties of service and compute infrastructure– How to reason about alternative designs
AVED: Proof of Concept Tool
AVEDService Model
Service Requirements
Infrastructure ModelAvailable
Design
page 706/23/2003 DSMS workshop - DSN 2003
Modeling ApproachInfrastructure Model• Component types: Basic elements that can fail
– Cost model– A set of failure modes
• A set of repair options for each failure mode (cost/benefit)Example: lp2000r(2-way x86 server) , rp8400 (8-way PA-RISC
server), Linux, WebLogic, etc.• Resource types: Unit of provisioning
– Set of componentsExample: App server = lp2000r+Linux+WebLogic
Service Model• Set of tiers
– Set of valid resource alternatives for each tier– Performance model for each tier
Service Requirements• Expected load (peak)• Maximum Annual Downtime
– System is assumed down when resources cannot sustain expected peak load
Component
Component
Component
Resource
Tier
Tier
Tier
Service
page 806/23/2003 DSMS workshop - DSN 2003
AVED architecture
• Two availability evaluation engines supported:– AVANTO (HP availability
simulation tool)– Simple Markov model
Design Generator
Translatoravailability
model
AvailabilityEvaluation
Engine
Translator
Service Model Infrastructure Model
final design
candidate design
final design
AVED
availabilityestimate
Output
Service Requirement
GUI REPOSITORY
page 906/23/2003 DSMS workshop - DSN 2003
Illustrating the use of AVED
• Application tier scenario
• Published component failure rates (hardware)
• Costs based on vendor published prices
• Reasonable assumptions for unavailable information– E.g. SW failure rates
page 1006/23/2003 DSMS workshop - DSN 2003
Infrastructure ModelComponent Types
Infrastructure ModelResource Types
Service Model
Output DesignServiceRequirements
AVED GUI
page 1106/23/2003 DSMS workshop - DSN 2003
page 1206/23/2003 DSMS workshop - DSN 2003
page 1306/23/2003 DSMS workshop - DSN 2003
page 1406/23/2003 DSMS workshop - DSN 2003
page 1506/23/2003 DSMS workshop - DSN 2003
page 1606/23/2003 DSMS workshop - DSN 2003
0.1
1
10
100
1000
10000
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Ann
ual D
ownt
ime
Lim
it [m
ins]
Units of load (service specific)
1 - lp2000r/linux/AS-A, bronze, no spare2 - lp2000r/linux/AS-A, silver, no spare3 - lp2000r/linux/AS-A, gold, no spare4 - lp2000r/linux/AS-B, gold, no spare5 - lp2000r/linux/AS-A, platinum, no spare6 - lp2000r/linux/AS-A, bronze, 1 cold spare7 - lp2000r/linux/AS-B, bronze, 1 cold spare8 - lp2000r/linux/AS-B, silver, 1 cold spare9 - lp2000r/linux/AS-A, bronze, 1 active spare10 - lp2000r/linux/AS-A, silver, 1 active spare11 - lp2000r/linux/AS-A, gold, 1 active spare12 - lp2000r/linux/AS-B, gold, 1 active spare13 - lp2000r/linux/AS-A, platinum, 1 active spare14 - lp2000r/linux/AS-B, platinum, 1 active spare15 - lp2000r/linux/AS-A, bronze, 2 active spare16 - lp2000r/linux/AS-A, silver, 2 active spare17 - lp2000r/linux/AS-A, bronze, 3 active spare
Example of AVED use – Results
• Identifies optimal solution for a range of service requirements: load and maximum annual downtime
Design 9 optimal for requirement (1000 load units, 100minutes max
downtime)
Design 10 optimal for requirement (2000 load units, 100minutes max
downtime)
page 1706/23/2003 DSMS workshop - DSN 2003
Example of AVED use – Results
• Availability/Cost tradeoff– e.g., relaxing downtime requirement from 1.5 min/yr to 2.5 min/yr
in a design for 800 load units reduces cost from $9500 to $6600
0
2000
4000
6000
8000
10000
12000
14000
0.1 1 10 100 1000 10000
Add
ition
al A
nnua
l Cos
t for
Ava
ilabi
lity
Downtime (minutes)
Load= 400 unitsLoad= 800 units
Load= 1,600 unitsLoad= 3,200 units
page 1806/23/2003 DSMS workshop - DSN 2003
Automated Design in Self-Managing System
self-managing system
monitoringinfrastructure
automatedruntime mgmt
managed infrastructure
automateddeployment
infrastructurerepository
resource allocationengine
servicedescriptions
automated designengine
deploy /configure
monitor
service specification
automated designengine
page 1906/23/2003 DSMS workshop - DSN 2003
Integrating AVED w/ automatic deployment
• Automatic OS deployment using network boot (PXE)• Automatic application deployment/configuration • Automatic configuration of failure detection, repair and
failover mechanisms • Extensions for closed-loop adaptive operation
– Integrate monitoring mechanisms (load, performance, recovery time, failure rates)
– Adapt AVED for incremental design/configuration changes
page 2006/23/2003 DSMS workshop - DSN 2003
Conclusion
• AVED: proof of concept tool that automates the design and configuration of availability mechanisms– Important building block in utility computing environment– Also useful in stand-alone mode
• Future work– Extended cost models– More complex performance models
• Experimental models based on monitoring– Factor business cost of downtime– Alternative availability metrics
• Outage duration, degraded performance levels, etc.– Support for storage and network resources– Failure dependencies/correlations
page 2206/23/2003 DSMS workshop - DSN 2003
0
500
1000
1500
2000
0 1 2 3 4 5 6 7 8
Ann
ual D
ownt
ime
(min
utes
)
OS boot time (minutes)
Load= 3,200 unitsLoad= 1,600 units
Load= 800 unitsLoad= 400 units
Example of AVED use – Results
• downtime is sensitive to repair parameters, e.g., the OS boot time.– sensitivity analysis necessary for parameters with unreliable values– can select designs with lower sensitivity to such parameters
page 2306/23/2003 DSMS workshop - DSN 2003
parametersused in
example
• design choices: type of machine/OS/application server resource, number of extra machines, state of extra machines (cold or active), repair option
N/A$030 secReset150 daysTransient$93500$85000Machine B M-B2 min$380/node38 hrsBronze1300 daysPermanent
$580/node15 hrsSilver
$750/node8 hrsGold
$1500/node6 hrsPlatinum
$1500/node6 hrsPlatinum
$750/node8 hrsGold
$580/node15 hrsSilver
2 min$380/node38 hrsBronze650 daysPermanent
N/A$030 secReboot90 daysCrash$0$2000App Server AS-B
N/A$02 minReboot30 daysCrash$0$1700App Server AS-A
N/A$04 minReboot365 daysCrash$0$200UnixN/A$02 minReboot60 daysCrash$0$0Linux
N/A$030 secReset75 daysTransient$2640$2400Machine A lp2000r
Failover TimeRepair CostMTTRRepair OptionMTBFFailure TypeCost ActiveCost ColdComponent
Cluster?
true251600 unitsM-B/Unix/AS-B
true25200 unitslp2000r/Linux/AS-B
true251600 unitsM-B/Unix/AS-A
true25200 unitslp2000r/Linux/AS-A
Max NodesNode capacity
Performance ModelResource
Inputs:Component Behaviors &
Costs
Inputs:Service
Characteristics
page 2406/23/2003 DSMS workshop - DSN 2003
Relative cost
0
20
40
60
80
100
0.1 1 10 100 1000 10000Add
ition
al A
nnua
l Cos
t for
Ava
ilabi
lity
(%)
Downtime (minutes)
Load= 400 unitsLoad= 800 units
Load= 1,600 unitsLoad= 3,200 units
Recommended