6
Investigating Survivability Strategies for Ultra-Large Scale (ULS) Systems Vanderbilt University Nashville, Tennessee Institute for Software Integrated Systems Jaiganesh Balasubramanian [email protected] www.dre.vanderbilt.edu/~jai Dr. Aniruddha Gokhale [email protected] www.dre.vanderbilt.edu/~gokhale Dr. Douglas C. Schmidt [email protected] ww.dre.vanderbilt.edu/~schmidt Dr. Sherif Abdelwahed [email protected] www.isis.vanderbilt.edu/~sherif

Investigating Survivability Strategies for Ultra-Large Scale (ULS) Systems Vanderbilt University Nashville, Tennessee Institute for Software Integrated

Embed Size (px)

Citation preview

Page 1: Investigating Survivability Strategies for Ultra-Large Scale (ULS) Systems Vanderbilt University Nashville, Tennessee Institute for Software Integrated

Investigating Survivability Strategies for

Ultra-Large Scale (ULS) Systems

Vanderbilt University Nashville, Tennessee

Institute for Software Integrated

Systems

Jaiganesh [email protected]

www.dre.vanderbilt.edu/~jai

Dr. Aniruddha [email protected]

www.dre.vanderbilt.edu/~gokhale

Dr. Douglas C. Schmidt [email protected]

www.dre.vanderbilt.edu/~schmidt

Dr. Sherif [email protected]

www.isis.vanderbilt.edu/~sherif

Page 2: Investigating Survivability Strategies for Ultra-Large Scale (ULS) Systems Vanderbilt University Nashville, Tennessee Institute for Software Integrated

3

Motivating Scenario for ULS• Impact of Service-Oriented

Architectures on enterprise distributed real-time & embedded (DRE) ULS systems• Applications composed of an

“operational string” of services• A service is an assembly of

components• Dynamic (re)deployment of

services into operational strings is necessary

• Performability = performance + survivability requirements

• Key challenges• Regulating & adapting to (dis)continuous changes in runtime environments

• e.g., online prognostics, dependable upgrades• Satisfying tradeoffs between multiple (often conflicting) QoS demands

• e.g., secure, real-time, reliable, etc.• Satisfying QoS demands in face of fluctuating and/or insufficient resources

• e.g., mobile ad hoc networks (MANETs)

Page 3: Investigating Survivability Strategies for Ultra-Large Scale (ULS) Systems Vanderbilt University Nashville, Tennessee Institute for Software Integrated

4

Some Performability Challenges for ULS Systems

• Performability challenges in dynamic provisioning of operational strings & services

• Service workloads & resource capacity issues – service placement depends on workloads & available resources

• Service accessibility patterns – service survivability depends on its sharing degree

• Differentiated levels of QoS – affects resource provisioning & survivability strategies

• Operational string & service failover – different failover possibilities e.g., as a whole or part operational string or one service at a time

• No one-size-fits-all dependability strategy – cannot dictate one survivability strategy on all services & operational strings

Application performability addressed by resolving service placement & survivability problems

Page 4: Investigating Survivability Strategies for Ultra-Large Scale (ULS) Systems Vanderbilt University Nashville, Tennessee Institute for Software Integrated

5

Model of ApproachModel addresses various concerns:

• Per-service concern: Choice of implementation

• Depends on resources, compatibility with other components in assembly

• Coupling concern: Choice of invocation & communication mechanism used

• Sharing concern: Shared services will need proactive survivability since it affects several services simultaneously

• Failure recovery concern: What is the unit of failover?

• Availability concerns: What is the degree of redundancy? What replication styles to use? Does it apply to whole assembly?

• Deployment concerns: How to select resources? How much sharing?

• Assembly concerns: What components to assemble dynamically? Configurations & optimizations for end-to-end performability?

Service placement & service survivability strategies address these concerns

Page 5: Investigating Survivability Strategies for Ultra-Large Scale (ULS) Systems Vanderbilt University Nashville, Tennessee Institute for Software Integrated

6

Addressing the Service Placement Problem

Service placement algorithms must consider tradeoffs between providing performance to applications & providing survivability to applications, allocating resources either to primaries or replicas

• Service placement problem must consider:• Set of computation nodes attributed by:

• Processing index or capacity• Memory index or capacity• Survivability index

• Set of communication links attributed by:• Bandwidth index• Survivability index

• Set of components attributed by:• Different implementations offering

performance tradeoffs across quality dimensions

• Different implementations consuming various amounts of resources

• Constraints on being deployed as an assembly to offer a complete service

• Replica placement issues involve:• Different availability requirements for different assemblies of components:

• Multiple replicas needed, tolerate non-availability of replicas based on importance of assemblies• Replica resource provisioning depending on replication schemes used• Load balancing of replicas if resources available but introduce run-time problems on consistency

Page 6: Investigating Survivability Strategies for Ultra-Large Scale (ULS) Systems Vanderbilt University Nashville, Tennessee Institute for Software Integrated

7

Addressing the Survivability Problem• A configurable approach to survivability including micro- (infrastructure) & macro- (assembly & operational

string) level strategies

• Micro-level strategies monitor infrastructure state to make proactive decisions at

• Component level (swapping & migration)

• Middleware level (configurations)

• Component Server Level (process resource allocations)

• Node level (multiple components)

• Macro-level strategies monitor assembly health to make failover decisions

• Failover based on type of failover unit

• Affects service placement decisions

• May involve load balancing

• State synchronization issues

• Replication styles (hidden by FT strategies)

• Initial prototype developed using Component-Integrated ACE ORB (CIAO) & Deployment & Configuration Engine (DAnCE) (www.dre.vanderbilt.edu)

• Future work on Data Distribution Service (DDS) & Distributed Real-time Specification for Java (DRTSJ)