Flight training for DevOps

Jorge Salamero Sanz <[email protected]>

CfgMgmtCamp 1 Feb 2016

War Games - Flight training for DevOps

How to Monitor MySQL

The Cost of Uptime

$ 3.55bn 2015 Q4

$ 1.21bn 2015 Q2

$ 4.1bn 2015 Q1

How much do you spend?

● Infrastructure automation

● Configuration automation

● Continuous testing

● Continuous deployment / delivery

● Monitoring

● Logs, error handling

● Feedback

● Human Ops

DevOps lifecycle

● Prepare

● Respond

● Postmortem

Expect downtime

● Power failure to half of our servers● Automated failover unavailable

(known failure condition)● Manual DNS switch required

● Expected impact: 20 min● Actual impact: 43min

Incident example

● Unfamiliarity with the process● Pressure of time sensitive event

(panic effect)● Escalation introduces delays

The Human Factor

● Extended use of checklists● Not to follow blindly, use knowledge

and experience● Independent system● Searchable● List of known issues and

documented workarounds/fixes

Documented procedures

● Realistic incident simulation● Practice general response process● Practice specific incident response● Deficiencies: practice and improve

the process

Practiced procedures

● First responder, acknowledge alert

● Load incident response checklist

● Log into #ops-war-room in Slack

● Log incident into JIRA

● Begin investigation

General response process

● The “limits of human memory and attention”○ Complexity○ Stress and fatigue○ Ego

● Pilots, doctors, divers:Bruce Willis Ruins All Films(BCD, weights, releases, air, final)

Pre-flight checklists

● Increase confidence

● Reduce panic

● Better coordination

● Trust relationships

● Improves time to resolution

Humans

● Replica environment

● or mock command line

● Record actions and timing

● Multiple failures

● Unexpected results

Realistic scenarios

● Team and individual test of response

● Run real commands

● Training the people

● Training the procedures

● Training the tools

Simulation goals

● Objective review● Suggestions for improvements● Do it again

● Scenario evolves● People forget

loop(): review and repeat

● Failure sucks● Fearless, blameless● Significant learning● Restores confidence● Increases credibility

Postmortem

● Short regular updates● Even “we’re still looking into it”● ~1 week to publish full version

○ follow-up incidents○ check with 3rd party providers○ timeline for required changes

Postmortem Timing

● Root cause● Turn of event led to failure● Steps to identify & isolate the cause● Services affected● How we fixed it● What we have learned and changed

Postmortem Content

Jorge Salamero SanzChief Developer Advocate

@bencerillo@serverdensity

our DevOps stories, no product spamblog.serverdensity.com

Engineering

Flight training for DevOps