221
Using Hystrix to Build Resilient Distributed Systems Matt Jacobs Netflix

Using Hystrix to Build Resilient Distributed Systems

Embed Size (px)

Citation preview

Page 1: Using Hystrix to Build Resilient Distributed Systems

Using Hystrix to Build Resilient Distributed Systems

Matt JacobsNetflix

Page 2: Using Hystrix to Build Resilient Distributed Systems

Who am I?• Edge Platform team at Netflix• In a 24/7 Support Role for Netflix production

• (Not 365/24/7)• Maintainer of Hystrix OSS

• github.com/Netflix/Hystrix• I get a lot of internal questions too

Page 3: Using Hystrix to Build Resilient Distributed Systems

• Video on Demand• 1000s of device types• 1000s of microservices• Control Plane completely on AWS• Competing against products which have high availability

(network / cable TV)

Page 4: Using Hystrix to Build Resilient Distributed Systems

• 2013 – now• from 40M to 75M+ paying customers• from 40 to 190+ countries (#netflixeverywhere)• In original programming, from House of Cards to hundreds of hours

a year, in documentaries / movies / comedy specials/ TV series

Page 5: Using Hystrix to Build Resilient Distributed Systems

Edge Platform @ Netflix• Goal: Maximize availability of Netflix API

• API used by all customer devices

Page 6: Using Hystrix to Build Resilient Distributed Systems

Edge Platform @ Netflix• Goal: Maximize availability of Netflix API

• API used by all customer devices• How?

• Fault-tolerant by design• Operational visibility• Limit manual intervention• Understand failure modes before they happen

Page 7: Using Hystrix to Build Resilient Distributed Systems

Hystrix @ Netflix• Fault-tolerance pattern as a library• Provides operational insights in real-time• Automatic load-shedding under pressure• Initial design/implementation by Ben Christensen

Page 8: Using Hystrix to Build Resilient Distributed Systems

Scope of this talk• User-facing request-response system that uses blocking I/O• NOT batch system• NOT nonblocking I/O• (Hystrix can be used in those, though…)

Page 9: Using Hystrix to Build Resilient Distributed Systems

Netflix API

Page 10: Using Hystrix to Build Resilient Distributed Systems

Goals of this talk• Philosophical Motivation

• Why are distributed systems hard?• Practical Motivation

• Why do I keep getting paged?• Solving those problems with Hystrix

• How does it work?• How do I use it in my system?• How should a system behave if I use Hystrix?• What? Netflix was down for me – what happened there?

Page 11: Using Hystrix to Build Resilient Distributed Systems

Goals of this talk• Philosophical Motivation

• Why are distributed systems hard?• Practical Motivation

• Why do I keep getting paged?• Solving those problems with Hystrix

• How does it work?• How do I use it in my system?• How should a system behave if I use Hystrix?• What? Netflix was down for me – what happened there?

Page 12: Using Hystrix to Build Resilient Distributed Systems

Un-Distributed Systems• Write an application and deploy!

Page 13: Using Hystrix to Build Resilient Distributed Systems

Distributed Systems• Write an application and deploy!

• Add a redundant system for resiliency (no SPOF)• Break out the state• Break out the function (microservices)• Add more machines to scale horizontally

Page 14: Using Hystrix to Build Resilient Distributed Systems

Source: https://blogs.oracle.com/jag/resource/Fallacies.html

Page 15: Using Hystrix to Build Resilient Distributed Systems

Fallacies of Distributed Computing• Network is reliable• Latency is zero• Bandwidth is infinite• Network is secure• Topology doesn’t change• There is 1 administrator• Transport cost is 0• Network is homogenous

Page 16: Using Hystrix to Build Resilient Distributed Systems

Fallacies of Distributed Computing• Network is reliable• Latency is zero• Bandwidth is infinite• Network is secure• Topology doesn’t change• There is 1 administrator• Transport cost is 0• Network is homogenous

Page 17: Using Hystrix to Build Resilient Distributed Systems

Fallacies of Distributed Computing• In the cloud, other services are reliable• In the cloud, latency is zero• In the cloud, bandwidth is infinite• Network is secure• Topology doesn’t change• In the cloud, there is 1 administrator• Transport cost is 0• Network is homogenous

Page 18: Using Hystrix to Build Resilient Distributed Systems

Other Services are Reliable getCustomer() fails at 0% when on-box

Page 19: Using Hystrix to Build Resilient Distributed Systems

Other Services are Reliable getCustomer() used to fail at 0%

Page 20: Using Hystrix to Build Resilient Distributed Systems

Latency is zero• search(term) returns in μs or ns

Page 21: Using Hystrix to Build Resilient Distributed Systems

Latency is zero• search(term) used to return in μs or ns

Page 22: Using Hystrix to Build Resilient Distributed Systems

Bandwidth is infinite• Making n calls to getRating() is fine

Page 23: Using Hystrix to Build Resilient Distributed Systems

Bandwidth is infinite• Making n calls to getRating() results in multiple

network calls

Page 24: Using Hystrix to Build Resilient Distributed Systems

There is 1 administrator• Deployments are in 1 person’s head

Page 25: Using Hystrix to Build Resilient Distributed Systems

There is 1 administrator• Deployments are distributed

Page 26: Using Hystrix to Build Resilient Distributed Systems

There is 1 administrator

Page 27: Using Hystrix to Build Resilient Distributed Systems

There is 1 administrator

Page 28: Using Hystrix to Build Resilient Distributed Systems

There is 1 administrator

Source: http://research.microsoft.com/en-us/um/people/lamport/pubs/distributed-system.txt

Page 29: Using Hystrix to Build Resilient Distributed Systems

Goals of this talk• Philosophical Motivation

• Why are distributed systems hard?• Practical Motivation

• Why do I keep getting paged?• Solving those problems with Hystrix

• How does it work?• How do I use it in my system?• How should a system behave if I use Hystrix?• What? Netflix was down for me – what happened there?

Page 30: Using Hystrix to Build Resilient Distributed Systems

Operating a Distributed System• How can I stop going to incident reviews?

• Which occur every time we cause customer pain

Page 31: Using Hystrix to Build Resilient Distributed Systems

Operating a Distributed System• How can I stop going to incident reviews?

• Which occur every time we cause customer pain• We must assume every other system can fail

Page 32: Using Hystrix to Build Resilient Distributed Systems

Operating a Distributed System• How can I stop going to incident reviews?

• Which occur every time we cause customer pain• We must assume every other system can fail• We must understand those failure modes and prevent

them from cascading

Page 33: Using Hystrix to Build Resilient Distributed Systems

Operating a Distributed System (bad)

Page 34: Using Hystrix to Build Resilient Distributed Systems

Operating a Distributed System (bad)

Page 35: Using Hystrix to Build Resilient Distributed Systems

Operating a Distributed System (bad)

Page 36: Using Hystrix to Build Resilient Distributed Systems

Operating a Distributed System (good)

Page 37: Using Hystrix to Build Resilient Distributed Systems

Failure Modes (in the small)• Errors• Latency

Page 38: Using Hystrix to Build Resilient Distributed Systems

If you don’t plan for errors

Page 39: Using Hystrix to Build Resilient Distributed Systems

How to handle Errors

return getCustomer(id);

Page 40: Using Hystrix to Build Resilient Distributed Systems

How to handle Errors

try { return getCustomer(id);} catch (Exception ex) { //TODO What to return here?}

Page 41: Using Hystrix to Build Resilient Distributed Systems

How to handle Errors

try { return getCustomer(id);} catch (Exception ex) { //static value return null;}

Page 42: Using Hystrix to Build Resilient Distributed Systems

How to handle Errors

try { return getCustomer(id);} catch (Exception ex) { //value from memory return anonymousCustomer;}

Page 43: Using Hystrix to Build Resilient Distributed Systems

How to handle Errors

try { return getCustomer(id);} catch (Exception ex) { //value from cache return cache.getCustomer(id);}

Page 44: Using Hystrix to Build Resilient Distributed Systems

How to handle Errors

try { return getCustomer(id);} catch (Exception ex) { //value from service return customerViaSecondSvc(id);}

Page 45: Using Hystrix to Build Resilient Distributed Systems

How to handle Errors

try { return getCustomer(id);} catch (Exception ex) { //explicitly rethrow throw ex;}

Page 46: Using Hystrix to Build Resilient Distributed Systems

Handle Errors with Fallbacks• Some options for fallbacks

• Static value• Value from in-memory• Value from cache• Value from network• Throw

• Make error-handling explicit• Applications have to work in the presence of either

fallbacks or rethrown exceptions

Page 47: Using Hystrix to Build Resilient Distributed Systems

Handle Errors with Fallbacks

Page 48: Using Hystrix to Build Resilient Distributed Systems

Exposure to failures• As your app grows, your set of dependencies is much

more likely to get bigger, not smaller• Overall uptime = (Dep uptime)^(num deps)

99.99% 99.9% 99%

5 99.95% 99.5% 95%

10 99.9% 99% 90.4%

25 99.75% 97.5% 77.8%

Dependency uptime

Dependencies

Page 49: Using Hystrix to Build Resilient Distributed Systems

If you don’t plan for Latency

• First-order effect: outbound latency increase• Second-order effect: increase resource usage

Page 50: Using Hystrix to Build Resilient Distributed Systems

If you don’t plan for Latency

• First-order effect: outbound latency increase

Page 51: Using Hystrix to Build Resilient Distributed Systems

If you don’t plan for Latency

• First-order effect: outbound latency increase

Page 52: Using Hystrix to Build Resilient Distributed Systems

Queuing theory• Say that your application is making 100RPS of a network

call and mean latency is 10ms

Page 53: Using Hystrix to Build Resilient Distributed Systems

Queuing theory• Say that your application is making 100RPS of a network

call and mean latency is 10ms• The utilization of I/O threads has a distribution over time –

according to Little’s Law, the mean is 100 RPS * (1/100 s) = 1

Page 54: Using Hystrix to Build Resilient Distributed Systems

Queuing theory• Say that your application is making 100RPS of a network

call and mean latency is 10ms• The utilization of I/O threads has a distribution over time –

according to Little’s Law, the mean is 100 RPS * (1/100 s) = 1

• If mean latency increased to 100ms, mean utilization increases to 10

Page 55: Using Hystrix to Build Resilient Distributed Systems

Queuing theory• Say that your application is making 100RPS of a network

call and mean latency is 10ms• The utilization of I/O threads has a distribution over time –

according to Little’s Law, the mean is 100 RPS * (1/100 s) = 1

• If mean latency increased to 100ms, mean utilization increases to 10

• If mean latency increased to 1000ms, mean utilization increases to 100

Page 56: Using Hystrix to Build Resilient Distributed Systems

If you don’t plan for Latency

Page 57: Using Hystrix to Build Resilient Distributed Systems

Handle Latency with Timeouts• Bound the worst-case latency

Page 58: Using Hystrix to Build Resilient Distributed Systems

Handle Latency with Timeouts• Bound the worst-case latency • What should we return upon timeout?

• We’ve already designed a fallback

Page 59: Using Hystrix to Build Resilient Distributed Systems

Bound resource utilization• Don’t take on too much work globally

Page 60: Using Hystrix to Build Resilient Distributed Systems

Bound resource utilization• Don’t take on too much work globally• Don’t let any single dependency take too many resources

Page 61: Using Hystrix to Build Resilient Distributed Systems

Bound resource utilization• Don’t take on too much work globally• Don’t let any single dependency take too many resources• Bound the concurrency of each dependency

Page 62: Using Hystrix to Build Resilient Distributed Systems

Bound resource utilization• Don’t take on too much work globally• Don’t let any single dependency take too many resources• Bound the concurrency of each dependency• What should we do if we’re at that threshold?

• We’ve already designed a fallback.

Page 63: Using Hystrix to Build Resilient Distributed Systems

Goals of this talk• Philosophical Motivation

• Why are distributed systems hard?• Practical Motivation

• Why do I keep getting paged?• Solving those problems with Hystrix

• How does it work?• How do I use it in my system?• How should a system behave if I use Hystrix?• What? Netflix was down for me – what happened there?

Page 64: Using Hystrix to Build Resilient Distributed Systems

Hystrix Goals• Handle Errors with Fallbacks• Handle Latency with Timeouts• Bound Resource Utilization

Page 65: Using Hystrix to Build Resilient Distributed Systems

Handle Errors with Fallback• We need something to execute• We need a fallback

Page 66: Using Hystrix to Build Resilient Distributed Systems

Handle Errors with Fallbackclass CustLookup extends HystrixCommand<Customer> { @Override public Customer run() { return svc.getOverHttp(customerId); }}

Page 67: Using Hystrix to Build Resilient Distributed Systems

Handle Errors with Fallbackclass CustLookup extends HystrixCommand<Customer> { @Override public Customer run() { return svc.getOverHttp(customerId); }

@Override public Customer getFallback() { return Customer.anonymousCustomer(); }}

Page 68: Using Hystrix to Build Resilient Distributed Systems

Handle Errors with Fallbackclass CustLookup extends HystrixCommand<Customer> { private final CustomersService svc; private final long customerId;

public CustLookup(CustomersService svc, long id) { this.svc = svc; this.customerId = id; }

//run() : has references to svc and customerId //getFallback()}

Page 69: Using Hystrix to Build Resilient Distributed Systems

Handle Latency with Timeouts• We need a timeout value • And a timeout mechanism

Page 70: Using Hystrix to Build Resilient Distributed Systems

Handle Latency with Timeoutsclass CustLookup extends HystrixCommand<Customer> { private final CustomersService svc; private final long customerId;

public CustLookup(CustomersService svc, long id) { super(timeout = 100); this.svc = svc; this.customerId = id; }

//run(): has references to svc and customerId //getFallback()}

Page 71: Using Hystrix to Build Resilient Distributed Systems

Handle Latency with Timeouts• What if run() does not respect your timeout?

Page 72: Using Hystrix to Build Resilient Distributed Systems

Handle Latency with Timeoutsclass CustLookup extends HystrixCommand<Customer> {

@Override public Customer run() { return svc.getOverHttp(customerId); }

@Override public Customer getFallback() { return Customer.anonymousCustomer(); }}

Page 73: Using Hystrix to Build Resilient Distributed Systems

Handle Latency with Timeoutsclass CustLookup extends HystrixCommand<Customer> {

@Override public Customer run() { return svc.nowThisThreadIsMine(customerId); }

@Override public Customer getFallback() { return Customer.anonymousCustomer(); }}

Page 74: Using Hystrix to Build Resilient Distributed Systems

Handle Latency with Timeouts• What if run() does not respect your timeout?• We can’t control the I/O library• We can put it on a different thread to control its impact

Page 75: Using Hystrix to Build Resilient Distributed Systems

Hystrix execution model

Page 76: Using Hystrix to Build Resilient Distributed Systems

Bound Resource Utilization• Since we’re using separate thread pool, we should give it

a fixed size

Page 77: Using Hystrix to Build Resilient Distributed Systems

Bound Resource Utilizationclass CustLookup extends HystrixCommand<Customer> {

public CustLookup(CustomersService svc, long id) { super(timeout = 100, threadPool = “CUSTOMER”, threadPoolSize = 8); this.svc = svc; this.customerId = id; }

//run() : has references to svc and customerId //getFallback()}

Page 78: Using Hystrix to Build Resilient Distributed Systems

Thread pool as a bulkhead

Image Credit: http://www.researcheratlarge.com/Ships/DD586/DD586ForwardRepair.html

Page 79: Using Hystrix to Build Resilient Distributed Systems

Thread pool as a bulkhead

Page 80: Using Hystrix to Build Resilient Distributed Systems

Circuit-Breaker• Popularized by Michael Nygard in “Release It!”

Page 81: Using Hystrix to Build Resilient Distributed Systems

Circuit-Breaker• We can build feedback loop

Page 82: Using Hystrix to Build Resilient Distributed Systems

Circuit-Breaker• We can build feedback loop• If command has high error rate in last 10 seconds, it’s

more likely to fail right now

Page 83: Using Hystrix to Build Resilient Distributed Systems

Circuit-Breaker• We can build feedback loop• If command has high error rate in last 10 seconds, it’s

more likely to fail right now• Fail fast now – don’t spend my resources

Page 84: Using Hystrix to Build Resilient Distributed Systems

Circuit-Breaker• We can build feedback loop• If command has high error rate in last 10 seconds, it’s

more likely to fail right now• Fail fast now – don’t spend my resources• Give downstream system some breathing room

Page 85: Using Hystrix to Build Resilient Distributed Systems
Page 86: Using Hystrix to Build Resilient Distributed Systems

Hystrix as a Decision Tree• Compose all of the above behaviors

Page 87: Using Hystrix to Build Resilient Distributed Systems

Hystrix as a Decision Tree• Compose all of the above behaviors• Now you’ve got a library!

Page 88: Using Hystrix to Build Resilient Distributed Systems

SUCCESSclass CustLookup extends HystrixCommand<Customer> { @Override public Customer run() { return svc.getOverHttp(customerId); //succeeds }

@Override public Customer getFallback() { return Customer.anonymousCustomer(); }}

Page 89: Using Hystrix to Build Resilient Distributed Systems

SUCCESS

Page 90: Using Hystrix to Build Resilient Distributed Systems

SUCCESS

Page 91: Using Hystrix to Build Resilient Distributed Systems

FAILUREclass CustLookup extends HystrixCommand<Customer> { @Override public Customer run() { return svc.getOverHttp(customerId); //throws }

@Override public Customer getFallback() { return Customer.anonymousCustomer(); }}

Page 92: Using Hystrix to Build Resilient Distributed Systems

FAILURE

Page 93: Using Hystrix to Build Resilient Distributed Systems

FAILURE

Page 94: Using Hystrix to Build Resilient Distributed Systems

FAILURE

Page 95: Using Hystrix to Build Resilient Distributed Systems

TIMEOUTclass CustLookup extends HystrixCommand<Customer> { @Override public Customer run() { //doesn’t return in time return svc.getOverHttp(customerId); }

@Override public Customer getFallback() { return Customer.anonymousCustomer(); }}

Page 96: Using Hystrix to Build Resilient Distributed Systems

TIMEOUT

Page 97: Using Hystrix to Build Resilient Distributed Systems

TIMEOUT

Page 98: Using Hystrix to Build Resilient Distributed Systems

THREADPOOL-REJECTEDclass CustLookup extends HystrixCommand<Customer> { @Override public Customer run() { //never runs – threadpool is full return svc.getOverHttp(customerId); }

@Override public Customer getFallback() { return Customer.anonymousCustomer(); }}

Page 99: Using Hystrix to Build Resilient Distributed Systems

THREADPOOL-REJECTED

Page 100: Using Hystrix to Build Resilient Distributed Systems

THREADPOOL-REJECTED

Page 101: Using Hystrix to Build Resilient Distributed Systems

SHORT-CIRCUITEDclass CustLookup extends HystrixCommand<Customer> { @Override public Customer run() { //never runs – circuit is open return svc.getOverHttp(customerId); }

@Override public Customer getFallback() { return Customer.anonymousCustomer(); }}

Page 102: Using Hystrix to Build Resilient Distributed Systems

SHORT-CIRCUITED

Page 103: Using Hystrix to Build Resilient Distributed Systems

SHORT-CIRCUITED

Page 104: Using Hystrix to Build Resilient Distributed Systems

Fallback Handling• What are the basic error modes?

• Failure• Latency

Page 105: Using Hystrix to Build Resilient Distributed Systems

Fallback Handling• What are the basic error modes?

• Failure• Latency

• We need to be aware of failures in fallback• No automatic recovery provided – single fallback is enough

Page 106: Using Hystrix to Build Resilient Distributed Systems

Fallback Handling• What are the basic error modes?

• Failure• Latency

• We need to be aware of failures in fallback• We need to protect ourselves from latency in fallback

• Add a semaphore to fallback to bound concurrency

Page 107: Using Hystrix to Build Resilient Distributed Systems

FALLBACK SUCCESS

Page 108: Using Hystrix to Build Resilient Distributed Systems

FALLBACK SUCCESS

Page 109: Using Hystrix to Build Resilient Distributed Systems

FALLBACK FAILURE

Page 110: Using Hystrix to Build Resilient Distributed Systems

FALLBACK FAILURE

Page 111: Using Hystrix to Build Resilient Distributed Systems

FALLBACK REJECTION

Page 112: Using Hystrix to Build Resilient Distributed Systems

FALLBACK REJECTION

Page 113: Using Hystrix to Build Resilient Distributed Systems

FALLBACK MISSING

Page 114: Using Hystrix to Build Resilient Distributed Systems

FALLBACK MISSING

Page 115: Using Hystrix to Build Resilient Distributed Systems
Page 116: Using Hystrix to Build Resilient Distributed Systems

Hystrix as a building block• Now we have bounded latency and concurrency, along

with a consistent fallback mechanism

Page 117: Using Hystrix to Build Resilient Distributed Systems

Hystrix as a building block• Now we have bounded latency and concurrency, along

with a consistent fallback mechanism• In practice within the Netflix API, we have:

• ~250 commands• ~90 thread pools• 10s of billions of command executions / day

Page 118: Using Hystrix to Build Resilient Distributed Systems

Goals of this talk• Philosophical Motivation

• Why are distributed systems hard?• Practical Motivation

• Why do I keep getting paged?• Solving those problems with Hystrix

• How does it work?• How do I use it in my system?• How should a system behave if I use Hystrix?• What? Netflix was down for me – what happened there?

Page 119: Using Hystrix to Build Resilient Distributed Systems

Example

Page 120: Using Hystrix to Build Resilient Distributed Systems

CustomerCommand• Takes id as arg• Makes HTTP call to service• Uses anonymous user as fallback

• Customer has no personalization

Page 121: Using Hystrix to Build Resilient Distributed Systems

RatingsCommand• Takes Customer as argument• Makes HTTP call to service• Uses empty list for fallback

• Customer has no ratings

Page 122: Using Hystrix to Build Resilient Distributed Systems

RecommendationsCommand• Takes Customer as argument• Makes HTTP call to service• Uses precomputed list for fallback

• Customer gets unpersonalized recommendations

Page 123: Using Hystrix to Build Resilient Distributed Systems

Fallbacks in this example• Fallbacks vary in importance

• Customer has no personalization• Customer gets unpersonalized recommendations• Customer has no ratings

Page 124: Using Hystrix to Build Resilient Distributed Systems
Page 125: Using Hystrix to Build Resilient Distributed Systems

HystrixCommand.execute()• Blocks application thread on Hystrix thread • Returns T

Page 126: Using Hystrix to Build Resilient Distributed Systems
Page 127: Using Hystrix to Build Resilient Distributed Systems
Page 128: Using Hystrix to Build Resilient Distributed Systems
Page 129: Using Hystrix to Build Resilient Distributed Systems
Page 130: Using Hystrix to Build Resilient Distributed Systems

HystrixCommand.queue()• Does not block application thread

• Application thread can launch multiple HystrixCommands• Returns java.util.concurrent.Future to application thread

Page 131: Using Hystrix to Build Resilient Distributed Systems
Page 132: Using Hystrix to Build Resilient Distributed Systems
Page 133: Using Hystrix to Build Resilient Distributed Systems
Page 134: Using Hystrix to Build Resilient Distributed Systems
Page 135: Using Hystrix to Build Resilient Distributed Systems
Page 136: Using Hystrix to Build Resilient Distributed Systems
Page 137: Using Hystrix to Build Resilient Distributed Systems

HystrixCommand.observe()• Does not block application thread• Returns rx.Observable to application thread• RxJava – reactive stream

• reactivex.io

Page 138: Using Hystrix to Build Resilient Distributed Systems
Page 139: Using Hystrix to Build Resilient Distributed Systems
Page 140: Using Hystrix to Build Resilient Distributed Systems
Page 141: Using Hystrix to Build Resilient Distributed Systems
Page 142: Using Hystrix to Build Resilient Distributed Systems
Page 143: Using Hystrix to Build Resilient Distributed Systems
Page 144: Using Hystrix to Build Resilient Distributed Systems
Page 145: Using Hystrix to Build Resilient Distributed Systems
Page 146: Using Hystrix to Build Resilient Distributed Systems
Page 147: Using Hystrix to Build Resilient Distributed Systems
Page 148: Using Hystrix to Build Resilient Distributed Systems

Goals of this talk• Philosophical Motivation

• Why are distributed systems hard?• Practical Motivation

• Why do I keep getting paged?• Solving those problems with Hystrix

• How does it work?• How do I use it in my system?• How should a system behave if I use Hystrix?• What? Netflix was down for me – what happened there?

Page 149: Using Hystrix to Build Resilient Distributed Systems

Hystrix Metrics• Hystrix provides a nice semantic layer to model

• Abstracts details of I/O mechanism• HTTP?• UDP?• Postgres?• Cassandra?• Memcached?• …

Page 150: Using Hystrix to Build Resilient Distributed Systems

Hystrix Metrics• Hystrix provides a nice semantic layer to model

• Abstracts details of I/O mechanism• Can capture event counts and event latencies

Page 152: Using Hystrix to Build Resilient Distributed Systems

Hystrix Metrics• Can be aggregated for historical time-series

• hystrix-servo-metrics-publisher

Page 153: Using Hystrix to Build Resilient Distributed Systems

Hystrix Metrics• Can be streamed off-box

• hystrix-metrics-event-stream

Page 154: Using Hystrix to Build Resilient Distributed Systems

Hystrix Metrics• Stream can be consumed by UI

• Hystrix-dashboard

Page 155: Using Hystrix to Build Resilient Distributed Systems

Hystrix metrics• Can be grouped by request

• hystrix-core:HystrixRequestLog

Page 156: Using Hystrix to Build Resilient Distributed Systems

Hystrix metrics• <DEMO>

Page 157: Using Hystrix to Build Resilient Distributed Systems

Taking Action on Hystrix Metrics• Circuit Breaker opens/closes

Page 158: Using Hystrix to Build Resilient Distributed Systems

Taking Action on Hystrix Metrics• Circuit Breaker opens/closes• Alert/Page on Error %

Page 159: Using Hystrix to Build Resilient Distributed Systems

Taking Action on Hystrix Metrics• Circuit Breaker opens/closes• Alert/Page on Error %• Alert/Page on Latency

Page 160: Using Hystrix to Build Resilient Distributed Systems

Taking Action on Hystrix Metrics• Circuit Breaker opens/closes• Alert/Page on Error %• Alert/Page on Latency• Use in canary analysis

Page 161: Using Hystrix to Build Resilient Distributed Systems

How the system should behave

~100 RPS Success~0 RPS Fallback

Page 162: Using Hystrix to Build Resilient Distributed Systems

How the system should behave

~90 RPS Success~10 RPS Fallback

Page 163: Using Hystrix to Build Resilient Distributed Systems

How the system should behave

~5 RPS Success~95 RPS Fallback

Page 164: Using Hystrix to Build Resilient Distributed Systems

How the system should behave

~100 RPS Success~0 RPS Fallback

Page 165: Using Hystrix to Build Resilient Distributed Systems

How the system should behave• What happens if mean latency goes from 10ms -> 30ms?

Before Latency

Page 166: Using Hystrix to Build Resilient Distributed Systems

How the system should behave• What happens if mean latency goes from 10ms -> 30ms?

Before Latency After Latency

Page 167: Using Hystrix to Build Resilient Distributed Systems

How the system should behave• What happens if mean latency goes from 10ms -> 30ms?

• Increased timeouts

Before Latency After Latency

Page 168: Using Hystrix to Build Resilient Distributed Systems

How the system should behave• What happens if mean latency goes from 10ms -> 30ms

at 100 RPS?

Page 169: Using Hystrix to Build Resilient Distributed Systems

How the system should behave• What happens if mean latency goes from 10ms -> 30ms

at 100 RPS?• By Little’s Law, mean utilization: 1 -> 3

->

Before Thread PoolUtilization

Page 170: Using Hystrix to Build Resilient Distributed Systems

How the system should behave• What happens if mean latency goes from 10ms -> 30ms

at 100 RPS?• By Little’s Law, mean utilization: 1 -> 3

->

Before Thread PoolUtilization

After Thread PoolUtilization

Page 171: Using Hystrix to Build Resilient Distributed Systems

How the system should behave• What happens if mean latency goes from 10ms -> 30ms

at 100 RPS?• By Little’s Law, mean utilization: 1 -> 3• Increased thread pool rejections

->

Before Thread PoolUtilization

After Thread PoolUtilization

Page 172: Using Hystrix to Build Resilient Distributed Systems

How the system should behave

~84 RPS Success~16 RPS Fallback

Page 173: Using Hystrix to Build Resilient Distributed Systems

How the system should behave

~48 RPS Success~52 RPS Fallback

Page 174: Using Hystrix to Build Resilient Distributed Systems

Lessons learned• Everything we measure has a distribution (once you add

time)• Queuing theory can teach us a lot

Page 175: Using Hystrix to Build Resilient Distributed Systems

Lessons learned• Everything we measure has a distribution (once you add

time)• Queuing theory can teach us a lot

• Errors are more frequent, but latency is more consequential

Page 176: Using Hystrix to Build Resilient Distributed Systems

Lessons learned• Everything we measure has a distribution (once you add

time)• Queuing theory can teach us a lot

• Errors are more frequent, but latency is more consequential

• 100% success rates are not what you want!

Page 177: Using Hystrix to Build Resilient Distributed Systems

Lessons learned• Everything we measure has a distribution (once you add

time)• Queuing theory can teach us a lot

• Errors are more frequent, but latency is more consequential

• 100% success rates are not what you want!• Don’t know if your fallbacks work• You have outliers, and they’re not being failed

Page 178: Using Hystrix to Build Resilient Distributed Systems

Lessons learned• Everything we measure has a distribution (once you add

time)• Queueing theory can teach us a lot

• Errors are more frequent, but latency is more consequential

• 100% success rates are not what you want!• Don’t know if your fallbacks work• You have outliers, and they’re not being failed

• Visualization (especially realtime) goes a long way

Page 179: Using Hystrix to Build Resilient Distributed Systems

Goals of this talk• Philosophical Motivation

• Why are distributed systems hard?• Practical Motivation

• Why do I keep getting paged?• Solving those problems with Hystrix

• How does it work?• How do I use it in my system?• How should a system behave if I use Hystrix?• What? Netflix was down for me – what happened there?

Page 180: Using Hystrix to Build Resilient Distributed Systems

Areas Hystrix doesn’t cover• Traffic to the system• Functional bugs• Data issues• AWS• Service Discovery• Etc…

Page 181: Using Hystrix to Build Resilient Distributed Systems

Hystrix doesn’t help if...• Your fallbacks fail

Page 182: Using Hystrix to Build Resilient Distributed Systems

Hystrix doesn’t help if...• Your fallbacks fail• Upstream systems do not tolerate fallbacks

Page 183: Using Hystrix to Build Resilient Distributed Systems

Hystrix doesn’t help if...• Your fallbacks fail• Upstream systems do not tolerate fallbacks• Resource bounds are too loose

Page 184: Using Hystrix to Build Resilient Distributed Systems

Hystrix doesn’t help if...• Your fallbacks fail• Upstream systems do not tolerate fallbacks• Resource bounds are too loose• I/O calls don’t use Hystrix

Page 185: Using Hystrix to Build Resilient Distributed Systems

Failing fallbacks• By definition, fallbacks are exercised less

Page 186: Using Hystrix to Build Resilient Distributed Systems

Failing fallbacks• By definition, fallbacks are exercised less• Likely less tested

Page 187: Using Hystrix to Build Resilient Distributed Systems

Failing fallbacks• By definition, fallbacks are exercised less• Likely less tested • If fallback fails, then you have cascaded failure

Page 188: Using Hystrix to Build Resilient Distributed Systems

Unusable fallbacks• Even if fallback succeeds, upstream systems/UIs need to

handle them

Page 189: Using Hystrix to Build Resilient Distributed Systems

Unusable fallbacks• Even if fallback succeeds, upstream systems/UIs need to

handle them• What happens when UI receives anonymous customer?

Page 190: Using Hystrix to Build Resilient Distributed Systems

Unusable fallbacks• Even if fallback succeeds, upstream systems/UIs need to

handle them• What happens when UI receives anonymous customer?

• Need integration tests that expect fallback data

Page 191: Using Hystrix to Build Resilient Distributed Systems

Unusable fallbacks• Even if fallback succeeds, upstream systems/UIs need to

handle them• What happens when UI receives anonymous customer?

Page 192: Using Hystrix to Build Resilient Distributed Systems

Loose Resource Bounding• “Kids, get ready for school!”

Page 193: Using Hystrix to Build Resilient Distributed Systems

Loose Resource Bounding• “Kids, get ready for school!”

• “You’ve got 8 hours to get ready”

Page 194: Using Hystrix to Build Resilient Distributed Systems

Loose Resource Bounding• “Kids, get ready for school!”

• “You’ve got 8 hours to get ready”• Speed limit is 1000mph

Page 195: Using Hystrix to Build Resilient Distributed Systems

Loose Resource Bounding

Page 196: Using Hystrix to Build Resilient Distributed Systems

Loose Resource Bounding

Page 197: Using Hystrix to Build Resilient Distributed Systems

Unwrapped I/O calls• Any place where user traffic triggers I/O is dangerous

Page 198: Using Hystrix to Build Resilient Distributed Systems

Unwrapped I/O calls• Any place where user traffic triggers I/O is dangerous• Hystrix-network-auditor-agent finds all stack traces that:

• Are on Tomcat thread• Do Network work• Don’t use Hystrix

Page 199: Using Hystrix to Build Resilient Distributed Systems

How to Prove our Resilience“Trust, but verify”

Page 200: Using Hystrix to Build Resilient Distributed Systems

How to Prove our Resilience• Get lucky and have a real outage

Page 201: Using Hystrix to Build Resilient Distributed Systems

How to Prove our Resilience• Get lucky and have a real outage

• Hopefully our system reacts the way we expect

Page 202: Using Hystrix to Build Resilient Distributed Systems

How to Prove our Resilience• Get lucky and have a real outage

• Hopefully our system reacts the way we expect• Cause the outage we want to protect against

Page 203: Using Hystrix to Build Resilient Distributed Systems

How to Prove our Resilience• Get lucky and have a real outage

• Hopefully our system reacts the way we expect• Cause the outage we want to protect against

Page 204: Using Hystrix to Build Resilient Distributed Systems

Failure Injection Testing

Page 205: Using Hystrix to Build Resilient Distributed Systems

Failure Injection Testing• Inject failures at any layer

• Hystrix is a good one

Page 206: Using Hystrix to Build Resilient Distributed Systems

Failure Injection Testing• Inject failures at any layer

• Hystrix is a good one• Scope (region / % traffic / user / device)

Page 207: Using Hystrix to Build Resilient Distributed Systems

Failure Injection Testing• Inject failures at any layer

• Hystrix is a good one• Scope (region / % traffic / user / device)• Scenario (fail?, add latency?)

Page 208: Using Hystrix to Build Resilient Distributed Systems

Failure Injection Testing• Inject error and observe fallback successes

Page 209: Using Hystrix to Build Resilient Distributed Systems

Failure Injection Testing• Inject error and observe fallback successes• Inject error and observe device handling fallbacks

Page 210: Using Hystrix to Build Resilient Distributed Systems

Failure Injection Testing• Inject error and observe fallback successes• Inject error and observe device handling fallbacks• Inject latency and observe thread-pool rejections and

timeouts

Page 211: Using Hystrix to Build Resilient Distributed Systems

Failure Injection Testing• Inject error and observe fallback successes• Inject error and observe device handling fallbacks• Inject latency and observe thread-pool rejections and

timeouts• We have an off-switch in case this goes poorly!

Page 212: Using Hystrix to Build Resilient Distributed Systems

FIT in Action• <DEMO>

Page 213: Using Hystrix to Build Resilient Distributed Systems

Operational Summary• Downstream transient errors cause fallbacks and a

degraded but not broken customer experience

Page 214: Using Hystrix to Build Resilient Distributed Systems

Operational Summary• Downstream transient errors cause fallbacks and a

degraded but not broken customer experience• Fallbacks go away once errors subside

Page 215: Using Hystrix to Build Resilient Distributed Systems

Operational Summary• Downstream transient errors cause fallbacks and a

degraded but not broken customer experience• Fallbacks go away once errors subside• Downstream latency increases load/latency, but not to a

dangerous level

Page 216: Using Hystrix to Build Resilient Distributed Systems

Operational Summary• Downstream transient errors cause fallbacks and a

degraded but not broken customer experience• Fallbacks go away once errors subside• Downstream latency increases load/latency, but not to a

dangerous level• We have near realtime visibility into our system and all of

our downstream systems

Page 218: Using Hystrix to Build Resilient Distributed Systems

Coordinates• github.com/Netflix/Hystrix• @HystrixOSS• @NetflixOSS• @mattrjacobs• [email protected]• jobs.netflix.com (We’re hiring!)

Page 219: Using Hystrix to Build Resilient Distributed Systems

Bonus: Hystrix Config

Page 220: Using Hystrix to Build Resilient Distributed Systems

Bonus: Hystrix 1.5 Metrics

Page 221: Using Hystrix to Build Resilient Distributed Systems

Bonus: Horror Stories