Why Recovery Should Be Free, And Often Can Be Armando Fox, Stanford University June 2003 ROC Retreat

Why Recovery Should Be Free,Why Recovery Should Be Free,And Often Can BeAnd Often Can Be

Armando FoxArmando Fox, Stanford University, Stanford University

June 2003 ROC RetreatJune 2003 ROC Retreat

© 2003 Armando Fox

Recovery Should Be Free, and Can Recovery Should Be Free, and Can BeBe

Already espouse arguments about lowering MTTR:Already espouse arguments about lowering MTTR: Mitigates impact on service as a whole [Fox & Patterson, Mitigates impact on service as a whole [Fox & Patterson,

2002]2002]

Results in higher end-user-perceived availability, given Results in higher end-user-perceived availability, given same overall availability [Xie et al. 2002]same overall availability [Xie et al. 2002]

etcetc

Tim Chou, Oracle: maybe more important to make Tim Chou, Oracle: maybe more important to make recovery recovery predictable predictable (so can plan provisioning, anticipate (so can plan provisioning, anticipate impact of outage, etc.)...if we understand it, we can impact of outage, etc.)...if we understand it, we can optimize its speedoptimize its speed

© 2003 Armando Fox

Real win: Recovery management is Real win: Recovery management is hardhard

Determining when to recover is hardDetermining when to recover is hard How to detect that something’s wrong?How to detect that something’s wrong?

How do you know when recovery is really necessary? (fail-How do you know when recovery is really necessary? (fail-stutter, etc.)stutter, etc.)

Will recovery make things worse? (cascading recovery)Will recovery make things worse? (cascading recovery)

Knowing what happens when you recover is hardKnowing what happens when you recover is hard Will a particular recovery technique work? (the machinery Will a particular recovery technique work? (the machinery

needed to perform the recovery may also be broken)needed to perform the recovery may also be broken)

What is the effect on online performance? (recovery can be What is the effect on online performance? (recovery can be expensive)expensive)

What if you needlessly “over-recover”? (cost of making a What if you needlessly “over-recover”? (cost of making a mistake is high)mistake is high)

If recovery were predictable and fast, it would simplify both If recovery were predictable and fast, it would simplify both failure detection failure detection and and recovery management.recovery management.

© 2003 Armando Fox

Simplifying Recovery Management: Crash-Only Simplifying Recovery Management: Crash-Only SoftwareSoftware

Goal: enforce simple invariants on Goal: enforce simple invariants on recovery recovery behavior, behavior, from from outside outside the component(s) being recoveredthe component(s) being recovered

Crash-only component provides PWR switch: Crash-only component provides PWR switch: stop = stop = crashcrash::

clean shutdown = loss of power = kernel panic = ...clean shutdown = loss of power = kernel panic = ...

One way to go down One way to go down one way to come up: one way to come up: start = start = recoverrecover

Power switch is Power switch is externalexternal uniform behavioruniform behavior killkill -9, -9, “turning off” (process kill) a VM, pull power cord“turning off” (process kill) a VM, pull power cord Intuition: the “infrastructure” supporting the power switch is Intuition: the “infrastructure” supporting the power switch is

usually usually simpler simpler than the applications using it, and common than the applications using it, and common across all those applicationsacross all those applications

Can crash-only software actually be built, and if so, how?Can crash-only software actually be built, and if so, how? (a) provide building blocks(a) provide building blocks

(b) formalize C/O definition and provide developer (b) formalize C/O definition and provide developer

© 2003 Armando Fox

Crash-only Building BlocksCrash-only Building Blocks

JAGR/ROC-2, a self-recovering J2EE app server [Candea et al., JAGR/ROC-2, a self-recovering J2EE app server [Candea et al., WIAPP 2003]WIAPP 2003] Micro-reboots used for recovery, application-generic failure-path Micro-reboots used for recovery, application-generic failure-path

inference used for determining recovery strategyinference used for determining recovery strategy Significantly improves performability relative to whole-app redeploySignificantly improves performability relative to whole-app redeploy

SSM: a CO session state manager [Ling, Fox, AMS 2003]SSM: a CO session state manager [Ling, Fox, AMS 2003]

DStore: a CO persistent single-key state manager [Huang, Fox, DStore: a CO persistent single-key state manager [Huang, Fox, submitted to SRDS 2003]submitted to SRDS 2003] Similar in spirit to HP Labs FAB [Frolund, Saito et al., 2003]Similar in spirit to HP Labs FAB [Frolund, Saito et al., 2003]

Common features of both SSM and DStore:Common features of both SSM and DStore: Redundancy used for persistenceRedundancy used for persistence Workload semantics exploited to simplify consistency model & Workload semantics exploited to simplify consistency model &

recoveryrecovery Recovery=restart, safe to reboot any node at any timeRecovery=restart, safe to reboot any node at any time Safe to coerce any failure to a crash (fail-stop) at any timeSafe to coerce any failure to a crash (fail-stop) at any time

© 2003 Armando Fox

Building blocks, cont.Building blocks, cont.

Pinpoint, statistical-anomaly-based failure detectionPinpoint, statistical-anomaly-based failure detection Standard tension: accuracy vs. precision (false positives Standard tension: accuracy vs. precision (false positives

problem)problem)

Different clustering techniques seem to be good at Different clustering techniques seem to be good at detecting different kinds of problemsdetecting different kinds of problems Surprising result from a CS241 project: character-frequency Surprising result from a CS241 project: character-frequency

histograms are a good app-generic way to detect end-user-histograms are a good app-generic way to detect end-user-visible failuresvisible failures

Mostly integrated with JAGR and SSMMostly integrated with JAGR and SSM

On burner: discussions with BEA Systems for integrating into On burner: discussions with BEA Systems for integrating into WebLogic ServerWebLogic Server

Insight: if cost of “over-recovering” is low, aggressive Insight: if cost of “over-recovering” is low, aggressive statistics-based failure detection becomes more appealingstatistics-based failure detection becomes more appealing

© 2003 Armando Fox

Toward a crash-only formalismToward a crash-only formalism

Component frameworks force you into certain app-writing Component frameworks force you into certain app-writing patternspatterns Inter-EJB calls through runtime-managed level of indirectionInter-EJB calls through runtime-managed level of indirection

Restrictions on how persistent state mgt can be expressedRestrictions on how persistent state mgt can be expressed

Restrictions on state sharing: difficult to do without using Restrictions on state sharing: difficult to do without using explicit external storeexplicit external store

Hypothesis: these are the elements that allow C/O to workHypothesis: these are the elements that allow C/O to work

Ongoing work: formalize crash-only SWOngoing work: formalize crash-only SW One possibility: One possibility: observational equivalenceobservational equivalence with respect to a with respect to a

request streamrequest stream

Can be expressed using a Can be expressed using a design pattern design pattern or or denotational denotational semanticssemantics

Ideally, will lead to a tool (“co-lint”) telling you whether your Ideally, will lead to a tool (“co-lint”) telling you whether your component is crash-onlycomponent is crash-only

© 2003 Armando Fox

Summary: Toward a Crash-only Summary: Toward a Crash-only WorldWorld

Goal: simplify Goal: simplify recovery managementrecovery management diagnosisdiagnosis: statistical methods even more appealing if the cost of : statistical methods even more appealing if the cost of

making a mistake is lowmaking a mistake is low

recoveryrecovery: crash-only enforces invariants about what happens : crash-only enforces invariants about what happens when recovery is attemptedwhen recovery is attempted

allows aggressive use of fault model enforcement [Martin et al allows aggressive use of fault model enforcement [Martin et al 2002]2002]

Good progress on providing building blocks for app writersGood progress on providing building blocks for app writers JAGR: J2EE app server that allows fast recovery via micro-reboots JAGR: J2EE app server that allows fast recovery via micro-reboots

and application-generic fault injectionand application-generic fault injection

SSM: a crash-only session state store (in process of integrating SSM: a crash-only session state store (in process of integrating with JAGR)with JAGR)

DStore: a crash-only persistent single-key storeDStore: a crash-only persistent single-key store

PinPoint: statistics-based failure detection (integrated with JAGR, PinPoint: statistics-based failure detection (integrated with JAGR, mostly integrated with SSM)mostly integrated with SSM)

© 2003 Armando Fox

Xie et al: MTTR and End-User Xie et al: MTTR and End-User AvailabilityAvailability

Let ALet AUU=user-perceived unavailability, A=user-perceived unavailability, ASS=system unavailability=system unavailability

Hypothesis: if users retry failed requests, and retry succeeds Hypothesis: if users retry failed requests, and retry succeeds because system had fast recovery, they will perceive higher because system had fast recovery, they will perceive higher availabilityavailability When retry rate is sufficiently frequent, AWhen retry rate is sufficiently frequent, AUU approaches A approaches ASS (for A (for ASS

=99.3%, this threshold is 200-300 sec)=99.3%, this threshold is 200-300 sec)

Method: model user retry behavior and system failure/recovery Method: model user retry behavior and system failure/recovery using Markov models; solve using numerical methodsusing Markov models; solve using numerical methods

Finding: Given 2 systems with same AFinding: Given 2 systems with same ASS, the one with shorter , the one with shorter

MTTR (MTTR (even though it also has lower MTTF)even though it also has lower MTTF) appears better to appears better to the user.the user.

Goal of this project: validate that result empirically (Jeff Goal of this project: validate that result empirically (Jeff Raymakers, Yee-Jiun Song, Wendy Tobagus)Raymakers, Yee-Jiun Song, Wendy Tobagus)

© 2003 Armando Fox

User perceived unavailability vs retry User perceived unavailability vs retry raterate

“sweet spot” Higher user retry rates yields little improvement in perceived availability.

© 2003 Armando Fox

“sweet spot”At low MTTR, lowering MTTR and MTTF at the same time results in worse user perceived unavailability!Variable MTTR, but fixed system

availability (low MTTR -> low MTTF)

Surprise! MTTF eventually catches up with Surprise! MTTF eventually catches up with youyou

© 2003 Armando Fox

Optimization ChoicesOptimization Choices

Fixed MTTF

Fixed MTTR

System Unavailability

User Perceived Unavailability

© 2003 Armando Fox

Results SummaryResults Summary

We can find a “sweet spot” (for a given system We can find a “sweet spot” (for a given system availability) beyond which higher user retry rates availability) beyond which higher user retry rates yield little benefit.yield little benefit.

For two systems of a given availability, the one For two systems of a given availability, the one with lower MTTR does not always yield better user with lower MTTR does not always yield better user perceived availability.perceived availability.

For a given system, we can determine whether For a given system, we can determine whether improving MTTR or MTTF will yield more user-improving MTTR or MTTF will yield more user-visible benefits.visible benefits.

© 2003 Armando Fox

““Clean” shutdown vs. restart?Clean” shutdown vs. restart? Impractical to guarantee zero crashes Impractical to guarantee zero crashes robust robust

systems must be crash-safe anywaysystems must be crash-safe anyway In that case, why support any other kind of shutdown? In that case, why support any other kind of shutdown? Historically, for Historically, for performanceperformance (avoid synchronous writes, (avoid synchronous writes,

do buffering/caching, etc) - leads to replicated/mirrored do buffering/caching, etc) - leads to replicated/mirrored state, more code, special recovery code paths... state, more code, special recovery code paths...

Crash-only software must:(a) be crash-safe & (b) recover quickly

Total recovery time may be shorter even if crash is forced WinXP can be

(mostly) crash-rebooted for upgrades

VMS sysadmins would sometimes crash the system rather than shut it down (if no users were logged on)

© 2003 Armando Fox

Why Crash-Only Simplifies Why Crash-Only Simplifies RecoveryRecovery

““Hardware works, software doesn’t”Hardware works, software doesn’t” Hardware interlocks, timers, etc. have small state spaces Hardware interlocks, timers, etc. have small state spaces

of behavior, hence high confidence they will work as of behavior, hence high confidence they will work as designeddesigned

Crash-only PWR switch is a way to approach that same Crash-only PWR switch is a way to approach that same property for softwareproperty for software

Crash-only makes recovery policies easier to Crash-only makes recovery policies easier to reason aboutreason about Opportunity to aggressively apply SW rejuvenationOpportunity to aggressively apply SW rejuvenation ““Recovery” code exercised on every restart; no exotic-but-Recovery” code exercised on every restart; no exotic-but-

rarely-used code pathsrarely-used code paths ““Over-recovery” may be OK from performability Over-recovery” may be OK from performability

standpoint: if recovery is free (performance & standpoint: if recovery is free (performance & correctness), you stop thinking about it as correctness), you stop thinking about it as recovery recovery and and start thinking about it as start thinking about it as normal aspect of operationnormal aspect of operation

© 2003 Armando Fox

Towards a Crash-Only WorldTowards a Crash-Only World

Existing software that is crash-only or near-crash-onlyExisting software that is crash-only or near-crash-only Stateless apps: most Web serversStateless apps: most Web servers Most RDBMS’s: crash-safe, but long recoveryMost RDBMS’s: crash-safe, but long recovery Postgres, BerkeleyDB/Sleepycat: “recovery” codepath is the main Postgres, BerkeleyDB/Sleepycat: “recovery” codepath is the main

codepathcodepath Some appliance storage devices: separate but pretty fast recovery Some appliance storage devices: separate but pretty fast recovery

pathpath

Our goals...Our goals... Focus on Internet (“3 tier”) applications; already “crash-mostly” Focus on Internet (“3 tier”) applications; already “crash-mostly”

except for persistence tier(s)except for persistence tier(s) Make the app server, middle-tier persistence, and back-end tier (to Make the app server, middle-tier persistence, and back-end tier (to

the extent possible) truly crash-onlythe extent possible) truly crash-only Deploy application-generic failure detection techniques (which may Deploy application-generic failure detection techniques (which may

over-recover, but the goal is to make that OK)over-recover, but the goal is to make that OK) Quantify improvement (we hope!) in performability resulting from Quantify improvement (we hope!) in performability resulting from

these changesthese changes By doing it in the middleware, any app on that middleware can benefitBy doing it in the middleware, any app on that middleware can benefit

Documents

Why Recovery Should Be Free, And Often Can Be Armando Fox, Stanford University June 2003 ROC Retreat