Upload
server-density
View
926
Download
0
Embed Size (px)
Citation preview
Jorge Salamero Sanz <[email protected]>
CfgMgmtCamp 1 Feb 2016
War Games - Flight training for DevOps
● Infrastructure automation
● Configuration automation
● Continuous testing
● Continuous deployment / delivery
● Monitoring
● Logs, error handling
● Feedback
● Human Ops
DevOps lifecycle
● Power failure to half of our servers● Automated failover unavailable
(known failure condition)● Manual DNS switch required
● Expected impact: 20 min● Actual impact: 43min
Incident example
● Unfamiliarity with the process● Pressure of time sensitive event
(panic effect)● Escalation introduces delays
The Human Factor
● Extended use of checklists● Not to follow blindly, use knowledge
and experience● Independent system● Searchable● List of known issues and
documented workarounds/fixes
Documented procedures
● Realistic incident simulation● Practice general response process● Practice specific incident response● Deficiencies: practice and improve
the process
Practiced procedures
● First responder, acknowledge alert
● Load incident response checklist
● Log into #ops-war-room in Slack
● Log incident into JIRA
● Begin investigation
General response process
● The “limits of human memory and attention”○ Complexity○ Stress and fatigue○ Ego
● Pilots, doctors, divers:Bruce Willis Ruins All Films(BCD, weights, releases, air, final)
Pre-flight checklists
● Increase confidence
● Reduce panic
● Better coordination
● Trust relationships
● Improves time to resolution
Humans
● Replica environment
● or mock command line
● Record actions and timing
● Multiple failures
● Unexpected results
Realistic scenarios
● Team and individual test of response
● Run real commands
● Training the people
● Training the procedures
● Training the tools
Simulation goals
● Objective review● Suggestions for improvements● Do it again
● Scenario evolves● People forget
loop(): review and repeat
● Failure sucks● Fearless, blameless● Significant learning● Restores confidence● Increases credibility
Postmortem
● Short regular updates● Even “we’re still looking into it”● ~1 week to publish full version
○ follow-up incidents○ check with 3rd party providers○ timeline for required changes
Postmortem Timing
● Root cause● Turn of event led to failure● Steps to identify & isolate the cause● Services affected● How we fixed it● What we have learned and changed
Postmortem Content