Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score....

Preview:

Citation preview

Designing Services for Resilience Experiments:Lessons from Netflix

Nora Jones, Senior Chaos Engineer@nora_js

Designing Services for Resilience Experiments:Lessons from Netflix

Nora Jones, Senior Chaos Engineer@nora_js

So, how can teams design services for resilience testing?

● Failure Injection Enabled

So, how can teams design services for resilience testing?

● Failure Injection Enabled● RPC enabled

So, how can teams design services for resilience testing?

● Failure Injection Enabled● RPC enabled● Fallback Paths

○ And ways to discover them

So, how can teams design services for resilience testing?

● Failure Injection Enabled● RPC enabled● Fallback Paths

○ And ways to discover them● Proper monitoring

○ Key business metrics to look for

So, how can teams design services for resilience testing?

● Failure Injection Enabled● RPC enabled● Fallback Paths

○ And ways to discover them● Proper monitoring

○ Key business metrics to look for● Proper timeouts

○ And ways to discover them

Known Ways to Increase Confidence in Resilience

Known Ways to Increase Confidence in Resilience

● Unit Tests

Known Ways to Increase Confidence in Resilience

● Integration Tests

New Ways to Increase Confidence in Resilience

● Chaos Experiments

SPS: Key Business Metric

Chaos Engineering: Netflix’s ChAP

API Personalization100%

Chaos Engineering: Netflix’s ChAP

APIGateway Personalization

API Control

1%

98%

Chaos Engineering: Netflix’s ChAP

APIGateway Personalization

API Control

1%

98%

Chaos Engineering: Netflix’s ChAP

APIGateway Personalization

API Control

API Exp1%

1%

98%

Chaos Engineering: Netflix’s ChAP

APIGateway Personalization

API Control

API Exp1%

1%

98%

Monitoring

Monitoring

SHORTED

1. Have Failure Injection Testing Enabled.

Sample Failure Injection Library

https://github.com/norajones/FailureInjectionLibrary

Types of Chaos Failures

Types of Chaos Failures

Criteria&API

Automating Creation of Chaos Experiments

2. Have Good Monitoring in Place for Configuration Changes.

Have Good Monitoring in Place

● RPC Enabled

Have Good Monitoring in Place

● RPC Enabled○ Associated Hystrix Commands

Have Good Monitoring in Place

● RPC Enabled○ Associated Hystrix Commands

■ Associated Fallbacks

Have Good Monitoring in Place

● RPC Enabled○ Associated Hystrix Commands

■ Associated Fallbacks● Timeouts

Have Good Monitoring in Place

● RPC Enabled○ Associated Hystrix Commands

■ Associated Fallbacks● Timeouts● Retries

Have Good Monitoring in Place

● RPC Enabled○ Associated Hystrix Commands

■ Associated Fallbacks● Timeouts● Retries● All in One Place!

● Java library managing REST clients to/from different services

● Fast failing/fallback capability

RPC/Ribbon

RPC/Ribbon Timeouts

RPC Timeouts

At what point does the service give up?

Retries

Immediately retrying a failure after an operation is not usually a great idea.

Retries

Understand the logic between your timeouts and your retries.

Circuit Breakers/Fallback Paths

Hystrix Commands/Fallback Paths

If your service is non-critical, ensure that there are fallback paths in place.

Fallback Strategies

Static Content Cache Fallback Service

Fallback Strategies

Know what your fallback strategy is and how to get that information.

3.Ensure Synergy between Hystrix Timeouts, RPC timeouts, and retry logic.

ChAP’s Monocle

ChAP’s Monocle

ChAP’s Monocle

There isn’t always money in microservices

Criticality Score

Criticality Score

RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

Criticality Score

RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

Criticality Score

RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

Criticality Score

RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

Chaos Success Stories

“We ran a chaos experiment which verifies that our fallback path works and it successfully caught a issue in the fallback path and the issue was

resolved before it resulted in any availability incident!”

“While [failing calls] we discovered an increase in license requests for the experiment cluster even

though fallbacks were all successful...

“While [failing calls] we discovered an increase in license requests for the experiment cluster even though fallbacks were all successful. ...This likely means that whoever was consuming the fallback

was retrying the call, causing an increase in license requests.”

Don’t lose sight of your company’s customers.

Takeaways

● Designing for resiliency testability is a shared responsibility.

● Configuration changes can cause outages.● Have explicit monitoring in place on

antipatterns in configuration changes.

@nora_js

Questions?@nora_js

Recommended