68
Resilient Functional Service Design The usually forgotten parts of resilient software design Uwe Friedrichsen – codecentric AG – 2015-2017

Resilient Functional Service Design

Embed Size (px)

Citation preview

Page 1: Resilient Functional Service Design

Resilient Functional Service Design The usually forgotten parts of resilient software design

Uwe Friedrichsen – codecentric AG – 2015-2017

Page 2: Resilient Functional Service Design

@ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com

Page 3: Resilient Functional Service Design

What’s that “resilience” thing?

Page 4: Resilient Functional Service Design

Business

Production

Availability

Page 5: Resilient Functional Service Design

(Almost) every system is a distributed system

Chas Emerick

http://www.infoq.com/presentations/problems-distributed-systems

Page 6: Resilient Functional Service Design

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.

Leslie Lamport

Page 7: Resilient Functional Service Design

Failures in todays complex, distributed and interconnected systems are not the exception. •  They are the normal case

•  They are not predictable

Page 8: Resilient Functional Service Design

… and it’s getting “worse”

•  Cloud-based systems

•  Microservices

•  Zero Downtime

•  Mobile & IoT

•  Social Web

à Ever-increasing complexity and connectivity

Page 9: Resilient Functional Service Design

Do not try to avoid failures. Embrace them.

Page 10: Resilient Functional Service Design

resilience (IT) the ability of a system to handle unexpected situations

-  without the user noticing it (best case) -  with a graceful degradation of service (worst case)

Page 11: Resilient Functional Service Design

Beware of the “100% available” trap!

Page 12: Resilient Functional Service Design

Designing for resilience The pitfall

Page 13: Resilient Functional Service Design

First, you learn about resilience …

Page 14: Resilient Functional Service Design

Complement

Core

Detect

Prevent

Recover

Mitigate

Treat

Page 15: Resilient Functional Service Design

Core

Detect Treat

Prevent

Recover

Mitigate Complement

Supporting patterns

Redundancy

Stateless

Idempotency

Escalation

Zero downtime deployment

Location transparency

Relaxed temporal

constraints

Fallback

Shed load Share load

Marked data Queue for resources

Bounded queue

Finish work in progress

Fresh work before stale

Deferrable work Communication paradigm

Isolation

Bulkhead System level

Monitor

Watchdog

Heartbeat

Acknowledgement

Either level

Voting

Synthetic transaction

Leaky bucket Routine

checks

Health check

Fail fast

Let sleeping dogs lie

Small releases

Hotdeployments

Routine maintenance

Backuprequest

Anti-fragility

Diversity Jitter

Error injection

Spread the news

Anti-entropy

Backpressure

Retry

Limit retries

Rollback Roll-forward

Checkpoint Safe point

Failover

Read repair

Error handler

Reset Restart

Reconnect

Fail silently

Default value

Node level

Timeout

Circuit breaker

Complete parameter checking

Checksum

Statically

Dynamically

Confinement

Page 16: Resilient Functional Service Design

... then, you digest the stuff just learned

Page 17: Resilient Functional Service Design

Confinement

Core

Detect Treat

Prevent

Recover

Mitigate Complement

Supporting patterns

Redundancy

Stateless

Idempotency

Escalation

Zero downtime deployment

Location transparency

Relaxed temporal

constraints

Fallback

Shed load Share load

Marked data Queue for resources

Bounded queue

Finish work in progress

Fresh work before stale

Deferrable work Communication paradigm

Isolation

Bulkhead System level

Monitor

Watchdog

Heartbeat

Acknowledgement

Either level

Voting

Synthetic transaction

Leaky bucket Routine

checks

Health check

Fail fast

Let sleeping dogs lie

Small releases

Hotdeployments

Routine maintenance

Backuprequest

Anti-fragility

Diversity Jitter

Error injection

Spread the news

Anti-entropy

Backpressure

Retry

Limit retries

Rollback Roll-forward

Checkpoint Safe point

Failover

Read repair

Error handler

Reset Restart

Reconnect

Fail silently

Default value

Node level

Timeout

Circuit breaker

Complete parameter checking

Checksum

Statically

Dynamically

Oh, my! Theoretical blah!

Uncool!

Know that anyway for eons. So, let’s

move on to the cool parts …

Ah, now we’re talkin’! Here’s the cool stuff!

That‘s practical, applicable. Don‘t you have more code examples? Or even better: Can‘t we turn that all into a live hacking session?

Offline activities?

Hmm, let‘s

focus on the other stuff.

Uh, sounds like one-off, tough

stuff …

Better start with the easier stuff, best

with library support

Yeah, more cool stuff!

Aren‘t there more libs like Hystrix that we can drag into our projects

with a line of configuration?

Well, neat …

I’ll come back to that stuff whenever

I really need it

Page 18: Resilient Functional Service Design

Core

Detect

Recover

Mitigate

Treat

Prevent

Complement

Developer priority

Relevance for application robustness

Ye be warned!

If you don’t get this part right, nothing else matters

Here be dragons!

This is extremely hard and poorly understood

Page 19: Resilient Functional Service Design

Let’s recap …

Page 20: Resilient Functional Service Design

The core parts are

•  extremely important

•  poorly understood

•  massively underestimated

Page 21: Resilient Functional Service Design

Houston, we have a problem!

Page 22: Resilient Functional Service Design

Let’s have a closer look at the core parts

Page 23: Resilient Functional Service Design

Complement

Core

Detect

Prevent

Recover

Mitigate

Treat

Page 24: Resilient Functional Service Design

Core

Detect Treat

Prevent

Recover

Mitigate Complement

Isolation

Page 25: Resilient Functional Service Design

Isolation

•  System must not fail as a whole

•  Split system in parts and isolate parts against each other

•  Avoid cascading failures

•  Foundation of resilient software design

Page 26: Resilient Functional Service Design

Core

Detect Treat

Prevent

Recover

Mitigate Complement

Isolation

Bulkhead

Page 27: Resilient Functional Service Design

Bulkheads are not about thread pools!

Page 28: Resilient Functional Service Design

Bulkheads

•  Core isolation pattern (a.k.a. “failure units” or “units of mitigation”)

•  Diverse implementation choices available, e.g., µservice, actor, scs, ...

•  Implementation choice impacts system and resilience design a lot

•  Shaping good bulkheads is extremely hard (pure design issue)

Page 29: Resilient Functional Service Design

Sounds easy. Where is the problem?

Page 30: Resilient Functional Service Design

Service A Service B Request

Due to functional design, Service A always needs backing from Service B to be able to answer a client request,

i.e. the isolation is broken by design

How do we avoid this …

Page 31: Resilient Functional Service Design

Service

Request

Due to functional design we need to call a lot of services to be able

to answer a client request,

i.e. availability is broken by design

... and this ...

Service

Service

Service Service

Service

Service

Service

Service

Service

Service

Service

Service

Page 32: Resilient Functional Service Design

Mothership Service

(a.k.a. Monolith) Request

By trying to avoid the aforementioned issues we ended up with cramming all

required functionality in one big service

i.e. the isolation is broken by design

... without ending up with this?

Page 33: Resilient Functional Service Design

Let’s use the well-known best practices

•  Divide & conquer a.k.a. functional decomposition

•  DRY (Don’t Repeat Yourself )

•  Design for reusability

•  Layered architecture

•  …

Page 34: Resilient Functional Service Design

Unfortunately, …

Page 35: Resilient Functional Service Design

Service A Service B Request

Due to functional design, Service A always needs backing from Service B to be able to answer a client request,

i.e. the isolation is broken by design

... this usually leads to this ...

Page 36: Resilient Functional Service Design

Service

Request

Due to functional design we need to call a lot of services to be able

to answer a client request,

i.e. availability is broken by design

... and this ...

Service

Service

Service Service

Service

Service

Service

Service

Service

Service

Service

Service

Page 37: Resilient Functional Service Design

Mothership Service

(a.k.a. Monolith) Request

By trying to avoid the aforementioned issues we ended up with cramming all

required functionality in one big service

i.e. the isolation is broken by design

... and in the end also often to this.

Page 38: Resilient Functional Service Design

Welcome to distributed hell!

Page 39: Resilient Functional Service Design

Caches to the rescue!

Page 40: Resilient Functional Service Design

Service A Service B Request

Due to functional design, Service A always needs backing from Service B to be able to answer a client request,

i.e. the isolation is broken by design

Cach

e of

B

Break tight service coupling by caching data/responses

of downstream service

Page 41: Resilient Functional Service Design

Caches to the rescue?

Page 42: Resilient Functional Service Design

Do you really thinkthat copying stale data all over your system

is a suitable measure to fix an inherently broken design?

Page 43: Resilient Functional Service Design

We have to re-learn design for distributed systems!

Page 44: Resilient Functional Service Design

A works-out-of-the-box-in-all-contexts, just-add-water-and-stir,

three-bullet-point panacea for designing perfect bulkheads

Page 45: Resilient Functional Service Design

You need lots of those …

Page 46: Resilient Functional Service Design

... maybe some of those

Page 47: Resilient Functional Service Design

Then it is a lot of hard work …

Page 48: Resilient Functional Service Design

... and there is no silver bullet

Page 49: Resilient Functional Service Design

Yet, a few guiding thoughts about bulkhead design …

Page 50: Resilient Functional Service Design

Foundations of design •  High cohesion, low coupling

•  Separation of concerns

•  Crucial across process boundaries

•  Still poorly understood issue

•  Start with •  Understanding organizational boundaries

•  Understanding use cases and flows

•  Identifying functional domains (à DDD)

•  Finding areas that change independently

•  Do not start with a data model!

Page 51: Resilient Functional Service Design

Short activation paths

•  Long activation paths affect availability

•  Increase latency and likelihood of failures

•  Minimize remote calls per request

•  Need to balance opposing forces

•  Avoid monolith à clear separation of concerns

•  Minimize requests à cluster functionality & data

•  Caches sometimes help, but stale data as trade-off

Page 52: Resilient Functional Service Design

Dismiss reusability

•  Reusability increases coupling

•  Reusability leads to bad service design

•  Reusability compromises availability

•  Reusability rarely pays

•  Do not strive for reuse

•  Strive for replaceability instead

Page 53: Resilient Functional Service Design

Broadening the options ...

Page 54: Resilient Functional Service Design

Core

Detect Treat

Prevent

Recover

Mitigate Complement

Isolation

Communication paradigm

Bulkhead

Page 55: Resilient Functional Service Design

Communication paradigm

•  Request-response <-> messaging <-> events <-> …

•  Heavily influences resilience patterns to be used

•  Also heavily influences functional bulkhead design

•  Very fundamental decision which is often underestimated

Page 56: Resilient Functional Service Design

µS

Request/Response : Horizontal slicing

Flow / Process

µS µS

µS µS µS

µS

Event-driven : Vertical slicing

µS µS

µS

µS µS

Flow / Process

Page 57: Resilient Functional Service Design

Synchronous R/R vs. asynchronous events

•  Decomposition •  Vertically divide-and-conquer vs. horizontally go-with-the-flow

•  Coordination •  Coordination logic/services and orchestration vs. event chains and choreography

•  Transactions •  Built-in transaction handling vs. external supervision

•  Error handling •  Built into service vs. escalation/supervision strategy

•  Separation of concerns •  Multiple responsibilities service vs. single responsibility services

•  Encapsulation •  Domain logic distributed across services vs. domain logic in one place •  Reusability vs. Replaceability

•  Complexity •  A draw …

Page 58: Resilient Functional Service Design

The communication paradigm influences the functional service design a lot

and also the resilience patterns to be used

Page 59: Resilient Functional Service Design

Example: order fulfillment •  Simple order, credit card, non-digital items

•  Add coupons incl. validation •  Add promotions incl. usage notification •  Add bonus card incl. purchase notification

•  Customer accounts as payment type •  PayPal as payment type

•  Integrate digital music library •  Integrate digital video library •  Integrate e-book library

Page 60: Resilient Functional Service Design

Design exercise – Part 1 Create a bulkhead design for the case study •  Use one communication paradigm

•  Synchronous request/response (e.g., REST) •  Asynchronous messaging (e.g., Akka) •  Asynchronous events (e.g., Pub/Sub)

•  Assume incremental requirements •  How many services do you need to touch •  What about the functional isolation of the services •  How big/maintainable are the resulting services •  Take a few notes

Page 61: Resilient Functional Service Design

Online Shop Checkout

Credit Card Provider

Warehouse System

Coupon Management

Campaign Management

PayPal

Loyalty Management

Accounts Receivables Music Library

E-Book Library Video Library

E-Mail Server

Customer pressed

“Buy now”

?

Page 62: Resilient Functional Service Design

Order Fulfillment Service

Online Shop

Payment Service

Credit Card Provider

Shipment Service

Warehouse System

<Foreign Service> <Own Service>

Coupon Management

Promotion Campaign

Management Loyalty

Account Service

Payment Provider

PayPal

Loyalty Management

Accounts Receivables

Music Library

E-Book Library

Video Library

E-Mail Server

Coupon

Credit Card

Coordinate

Warehouse

Coordinate

Assets

Notify Cust.

PayPal

Coordinate

Page 63: Resilient Functional Service Design

Order confirmed

Online Shop

Credit Card Provider

Warehouse System

<Foreign Service>

<Own Service>

Coupon Management

Campaign Management

Account service

Credit Card Service

Loyalty Management

Accounts Receivables

Music Library

E-Book Library

Video Library E-Mail Server

PayPal

PayPal Service

Warehouse Service

Promotion Service

Bonus Card Service

Coupon Service

Music Library Service

Video Library Service

E-Book Library Service

Notification Service

Payment authorized Digital asset provisioned

Payment failed

<Event>

Order fulfillment supervisor

Track flow of events Reschedule events in case of failure

Services are responsible to eventually succeed or fail for good, usually incorporating a supervision/escalation hierarchy for that

Page 64: Resilient Functional Service Design

Do not limit your design options upfront without an important reason

Page 65: Resilient Functional Service Design

Wrap-up

•  Today’s systems are distributed •  Failures are not avoidable, nor predictable •  Resilient software design needed

•  Bulkhead design is •  crucial for application robustness •  poorly understood •  massively underrated •  different from traditional design best practices

•  Communication paradigms broaden your bulkhead design options

Page 66: Resilient Functional Service Design

We have to re-learn design for distributed systems

Page 67: Resilient Functional Service Design

@ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com

Page 68: Resilient Functional Service Design