Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score....

Designing Services for Resilience Experiments:Lessons from Netflix

Nora Jones, Senior Chaos Engineer@nora_js

Designing Services for Resilience Experiments:Lessons from Netflix

Nora Jones, Senior Chaos Engineer@nora_js

So, how can teams design services for resilience testing?

● Failure Injection Enabled

● Failure Injection Enabled● RPC enabled

● Failure Injection Enabled● RPC enabled● Fallback Paths

○ And ways to discover them

○ And ways to discover them● Proper monitoring

○ Key business metrics to look for

○ And ways to discover them● Proper monitoring

○ Key business metrics to look for● Proper timeouts

○ And ways to discover them

Known Ways to Increase Confidence in Resilience

● Unit Tests

Known Ways to Increase Confidence in Resilience

● Integration Tests

New Ways to Increase Confidence in Resilience

● Chaos Experiments

SPS: Key Business Metric

Chaos Engineering: Netflix’s ChAP

API Personalization100%

APIGateway Personalization

API Control

API Exp1%

API Control

API Exp1%

Monitoring

SHORTED

1. Have Failure Injection Testing Enabled.

Sample Failure Injection Library

https://github.com/norajones/FailureInjectionLibrary

Types of Chaos Failures

Criteria&API

Automating Creation of Chaos Experiments

2. Have Good Monitoring in Place for Configuration Changes.

Have Good Monitoring in Place

● RPC Enabled

● RPC Enabled○ Associated Hystrix Commands

■ Associated Fallbacks

■ Associated Fallbacks● Timeouts

■ Associated Fallbacks● Timeouts● Retries

■ Associated Fallbacks● Timeouts● Retries● All in One Place!

● Java library managing REST clients to/from different services

● Fast failing/fallback capability

RPC/Ribbon

RPC/Ribbon Timeouts

RPC Timeouts

At what point does the service give up?

Retries

Immediately retrying a failure after an operation is not usually a great idea.

Retries

Understand the logic between your timeouts and your retries.

Circuit Breakers/Fallback Paths

Hystrix Commands/Fallback Paths

If your service is non-critical, ensure that there are fallback paths in place.

Fallback Strategies

Static Content Cache Fallback Service

Fallback Strategies

Know what your fallback strategy is and how to get that information.

3.Ensure Synergy between Hystrix Timeouts, RPC timeouts, and retry logic.

ChAP’s Monocle

There isn’t always money in microservices

Criticality Score

RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

Criticality Score

Chaos Success Stories

“We ran a chaos experiment which verifies that our fallback path works and it successfully caught a issue in the fallback path and the issue was

resolved before it resulted in any availability incident!”

“While [failing calls] we discovered an increase in license requests for the experiment cluster even

though fallbacks were all successful...

“While [failing calls] we discovered an increase in license requests for the experiment cluster even though fallbacks were all successful. ...This likely means that whoever was consuming the fallback

was retrying the call, causing an increase in license requests.”

Don’t lose sight of your company’s customers.

Takeaways

● Designing for resiliency testability is a shared responsibility.

● Configuration changes can cause outages.● Have explicit monitoring in place on

antipatterns in configuration changes.

@nora_js

Questions?@nora_js

Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score....

Documents

Solution 6+6 Operators Manual - Protect West Security · Solution 6+6 Operators Manual 11 Electronics Design and Manufacturing Pty Limited ISSUE123.DOC. Code Retries Code retries

ORNL-CEC-1 -46 Criticality Studies CRITICALITY … · ORNL-CEC-1 -46 Criticality Studies CRITICALITY OF LARGE SYS- OF SUBCRMXL'CAL U&3) COMI?ONENTS 5. T. Thomas Neutron Physfcs

Criticality of Supplier Focus

Criticality Safety

Business Meets Criticality Seminar - Gordian Logistic · PDF fileBusiness Meets Criticality Seminar ... Spare parts and material Technical information. ... Asset Criticality Assessment

TOWARDS A COSMOPOLITAN CRITICALITY? … · CRITICALITY? RELATIONAL AESTHETICS, RIRKRIT TIRAVANIJA ... cosmopolitan criticality? Relational aesthetics, Rirkrit ... Blurring the boundaries

Criticality development as important quality of …...education, from the point of view of criticality development are defined. The fundamental principles causing development of criticality

Criticality Safety Assessment Program

Systems Criticality Matrix

Distributed Real-time Architecture for Mixed Criticality Systems · 2018-08-31 · Distributed Real-time Architecture for Mixed Criticality Systems White Paper on Mixed-Criticality

Component Criticality Analysis to Minimizing Soft … · Component Criticality Analysis to Minimizing Soft Errors Risk ... this paper by presenting a criticality ranking ... Index

CRITICALITY SAFETY QUALIFICATION STANDARD … · criticality safety..... 231 39. Criticality safety specialists must demonstrate a working level knowledge of DOE-STD-1027-92, Hazard

National Criticality Experiments Research Center …local.ans.org/trinity/pdf/goda-150220.pdfNational Criticality Experiments Research Center (NCERC) Capabilities ... Nuclear Criticality

z 008 Criticality Analysis

Asset Criticality Analysis

A few things... Who still doesn’t have a book? Who still doesn’t have a book? Homework and retries Homework and retries Labs In S415 ESC Labs In S415 ESC

CLIMATE CHANGE & EXTREME WEATHER PILOT … Molden.pdfCLIMATE CHANGE & EXTREME WEATHER PILOT PROJECT ... Criticality analysis • Data ... • Criticality Matrix • Criticality ranking

PRELIMINARY CRITICALITY SAFETY EVALUATION

Time Criticality Challenge in the Presence of Parallelised ... Criticality... · Time Criticality Challenge in the Presence of Parallelised Execution Lus Miguel Pinho1, Eduardo Quinones~2,

Criticality of the Geological Copper Family - MMTA · Criticality of the Geological Copper Family Nedal T. Nassar, ... severe environmental implications ranking. ... Criticality vector