81
Designing Services for Resilience Experiments: Lessons from Netflix Nora Jones, Senior Chaos Engineer @nora_js

Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Designing Services for Resilience Experiments:Lessons from Netflix

Nora Jones, Senior Chaos Engineer@nora_js

Page 2: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 3: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 4: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 5: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 6: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 7: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Designing Services for Resilience Experiments:Lessons from Netflix

Nora Jones, Senior Chaos Engineer@nora_js

Page 8: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

So, how can teams design services for resilience testing?

● Failure Injection Enabled

Page 9: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

So, how can teams design services for resilience testing?

● Failure Injection Enabled● RPC enabled

Page 10: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

So, how can teams design services for resilience testing?

● Failure Injection Enabled● RPC enabled● Fallback Paths

○ And ways to discover them

Page 11: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

So, how can teams design services for resilience testing?

● Failure Injection Enabled● RPC enabled● Fallback Paths

○ And ways to discover them● Proper monitoring

○ Key business metrics to look for

Page 12: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

So, how can teams design services for resilience testing?

● Failure Injection Enabled● RPC enabled● Fallback Paths

○ And ways to discover them● Proper monitoring

○ Key business metrics to look for● Proper timeouts

○ And ways to discover them

Page 13: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Known Ways to Increase Confidence in Resilience

Page 14: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Known Ways to Increase Confidence in Resilience

● Unit Tests

Page 15: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 16: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Known Ways to Increase Confidence in Resilience

● Integration Tests

Page 17: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 18: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

New Ways to Increase Confidence in Resilience

● Chaos Experiments

Page 19: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 20: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

SPS: Key Business Metric

Page 21: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 22: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Chaos Engineering: Netflix’s ChAP

API Personalization100%

Page 23: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Chaos Engineering: Netflix’s ChAP

APIGateway Personalization

API Control

1%

98%

Page 24: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Chaos Engineering: Netflix’s ChAP

APIGateway Personalization

API Control

1%

98%

Page 25: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Chaos Engineering: Netflix’s ChAP

APIGateway Personalization

API Control

API Exp1%

1%

98%

Page 26: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Chaos Engineering: Netflix’s ChAP

APIGateway Personalization

API Control

API Exp1%

1%

98%

Page 27: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Monitoring

Page 28: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Monitoring

SHORTED

Page 29: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

1. Have Failure Injection Testing Enabled.

Page 30: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Sample Failure Injection Library

https://github.com/norajones/FailureInjectionLibrary

Page 31: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 32: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 33: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 34: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 35: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 36: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 37: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Types of Chaos Failures

Page 38: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Types of Chaos Failures

Page 39: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Criteria&API

Page 40: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 41: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 42: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Automating Creation of Chaos Experiments

Page 43: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

2. Have Good Monitoring in Place for Configuration Changes.

Page 44: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Have Good Monitoring in Place

● RPC Enabled

Page 45: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Have Good Monitoring in Place

● RPC Enabled○ Associated Hystrix Commands

Page 46: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Have Good Monitoring in Place

● RPC Enabled○ Associated Hystrix Commands

■ Associated Fallbacks

Page 47: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Have Good Monitoring in Place

● RPC Enabled○ Associated Hystrix Commands

■ Associated Fallbacks● Timeouts

Page 48: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Have Good Monitoring in Place

● RPC Enabled○ Associated Hystrix Commands

■ Associated Fallbacks● Timeouts● Retries

Page 49: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Have Good Monitoring in Place

● RPC Enabled○ Associated Hystrix Commands

■ Associated Fallbacks● Timeouts● Retries● All in One Place!

Page 50: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 51: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

● Java library managing REST clients to/from different services

● Fast failing/fallback capability

RPC/Ribbon

Page 52: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

RPC/Ribbon Timeouts

Page 53: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

RPC Timeouts

At what point does the service give up?

Page 54: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Retries

Immediately retrying a failure after an operation is not usually a great idea.

Page 55: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Retries

Understand the logic between your timeouts and your retries.

Page 56: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Circuit Breakers/Fallback Paths

Page 57: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Hystrix Commands/Fallback Paths

If your service is non-critical, ensure that there are fallback paths in place.

Page 58: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Fallback Strategies

Static Content Cache Fallback Service

Page 59: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Fallback Strategies

Know what your fallback strategy is and how to get that information.

Page 60: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 61: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 62: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

3.Ensure Synergy between Hystrix Timeouts, RPC timeouts, and retry logic.

Page 63: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 64: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 65: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

ChAP’s Monocle

Page 66: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

ChAP’s Monocle

Page 67: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

ChAP’s Monocle

Page 68: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands
Page 69: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

There isn’t always money in microservices

Page 70: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Criticality Score

Page 71: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Criticality Score

RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

Page 72: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Criticality Score

RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

Page 73: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Criticality Score

RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

Page 74: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Criticality Score

RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

Page 75: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Chaos Success Stories

Page 76: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

“We ran a chaos experiment which verifies that our fallback path works and it successfully caught a issue in the fallback path and the issue was

resolved before it resulted in any availability incident!”

Page 77: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

“While [failing calls] we discovered an increase in license requests for the experiment cluster even

though fallbacks were all successful...

Page 78: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

“While [failing calls] we discovered an increase in license requests for the experiment cluster even though fallbacks were all successful. ...This likely means that whoever was consuming the fallback

was retrying the call, causing an increase in license requests.”

Page 79: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Don’t lose sight of your company’s customers.

Page 80: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Takeaways

● Designing for resiliency testability is a shared responsibility.

● Configuration changes can cause outages.● Have explicit monitoring in place on

antipatterns in configuration changes.

@nora_js

Page 81: Lessons from Netflix Resilience Experiments: Designing ... · microservices. Criticality Score. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands

Questions?@nora_js