12
Limiting Damage During Chaos Experiments Nils Meder | Computer Scientist @ Adobe

Chaos Engineering - Limiting Damage During Chaos Experiments

Embed Size (px)

Citation preview

Page 1: Chaos Engineering - Limiting Damage During Chaos Experiments

Limiting Damage During Chaos ExperimentsNils Meder | Computer Scientist @ Adobe

Page 2: Chaos Engineering - Limiting Damage During Chaos Experiments

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Agenda

• Doing Chaos In Your Production System

• Building A Context Around Your Experiment

• Protect Your Infrastructure

• Example: Kill Random Instances

• Protect Your Application

• Resilience Patterns

• Wrap-Up & Discussion

2

Page 3: Chaos Engineering - Limiting Damage During Chaos Experiments

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Doing Chaos In Your Production System

• Testing in Production is The Ultimate Goal

• But, It is Not The First Step

• There are Always Differences Between Staging and Production

• Scale, Networking, Datasets, …

• Start In Staging Environment

• Make Sure Doesn’t Bring Down The Whole Service

• “Know Your Enemy” - Have A Clear View of Your Environment

• Iterate Over Your Experiments

• Be Brave - Having Just Some Basic Tests Running in Production is Better Than None

3

Page 4: Chaos Engineering - Limiting Damage During Chaos Experiments

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Building A Context Around Your Experiments

• Chaos Testing is Not Just Pull The Plug

• Focus On Business Critical Scenarios/Components First

• Have A Clear Goal, e.g. What Happens When The Network Fails?

• Focus - Run One Experiment At a Time

• Monitor Your Experiments

• Define Fallbacks And Defaults

4

Page 5: Chaos Engineering - Limiting Damage During Chaos Experiments

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Protect Your Infrastructure

• Target Infrastructure Components

• Think About Recovery

• Take Snapshots

• Limit The Damage To Single Instances

• Limit The Damage To Groups of Instances

• Of The Same Kind

• Within The Same Workflow

• Limit Percentage Of Impact

• Limit What Chaos Tests Are Allowed To Do

5

Page 6: Chaos Engineering - Limiting Damage During Chaos Experiments

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Example: Kill Random Instances

• Terminate Random EC2 Instances

• Focus:

• What Happens If A Number Of My Servers Die?

• Does Autoscaling Work?

• Is the Web API still serving requests?

• The Test is Only Allowed To Terminate Instances

• Simulate Experiment Before

• Take An Environment Snapshot

• Run The Test

6

Chaos Test

App1 App2App3

Client

Appx

Page 7: Chaos Engineering - Limiting Damage During Chaos Experiments

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Protect Your Application

• Plan For Chaos in Your Application

• Fail Fast, But Keep The Streams Flowing

• Build Your Application Isolated

• Apply Loose Coupling

• Introduce Latency Control

• Real-Time Data and Diagnostics

7

Page 8: Chaos Engineering - Limiting Damage During Chaos Experiments

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Resilience Patterns

• Bulk Heads

• Building Failure Units

• Protect App Against Cross-Failures

• Event-Driven & Stateless

• Embrace Loose Coupling

• Circuit Breaker

• Timeouts

• Fallbacks

• Healthchecks

8

Page 9: Chaos Engineering - Limiting Damage During Chaos Experiments

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Resilience Patterns

• “Release It!” - Michael Nygard

• More On Resilience Patterns, Anit-Patterns and Case-Studies

• ISBN-13: 978-0978739218

9

Page 10: Chaos Engineering - Limiting Damage During Chaos Experiments

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Wrap-Up & Discussion

• Expect The Unexpected

• Failures Are The Normal Case & Not Predictable

• Do Not Try To Avoid Failures. Embrace Them.

• Chaos Engineering Helps To Discover Weak Points

• Apply Resilience Patterns

10

Page 11: Chaos Engineering - Limiting Damage During Chaos Experiments
Page 12: Chaos Engineering - Limiting Damage During Chaos Experiments

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

References

• Resilience Patterns: http://de.slideshare.net/ufried/patterns-of-resilience

• Bulk Heads: http://skife.org/architecture/fault-tolerance/2009/12/31/bulkheads.html

• Making APIs More Resilient: http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html

• “Release It!” - Michael Nygard

12