Upload
victorops
View
196
Download
0
Embed Size (px)
Citation preview
Reducing Alert FatigueMike Meredith @vo_mike Senior Dir. of IT
Jason Hand @jasonhand DevOps Evangelist
Alert Fatigue: Defined
Exposure to a high volume of frequent alerts, causing desensitization to critical issues, leading to …
People are becoming numb to alerts,
making alerts less effective
image credit: blog.newrelic.com
• Longer response time• Anxiety• Sleep deprivation• Physical effects• Employee dissatisfaction
Identifying Alert Fatigue
• Take a pulse of your company culture
• ChatOps
• Office interactions
Are there signs of dissatisfaction?
• 1-on-1’s
Why does it exist?
• Computing is getting cheaper and automation is
easier
• DevOps encourages us to “Monitor All the Things”.
Dealing with existing fatigueIt’s part of being on-call, but doesn’t have to be disruptive
“Your personal life will be interrupted. This is not the ideal situation but it will happen. You have been warned.”
- Matthew Green
… now what?
Reducing fatigue
• Don’t be afraid to ask for help
• Overrides for the future
• “Take on-call” for others
Hold a Weekly Handoff Meeting• It’s a chance for the whole team to catch up on platform trends
• It promotes continual awareness of platform operations
• People from different disciplines can add context to the previous week’s
events
• If the sprint process needs to be diverted to respond to scale events or
bugs, this is a great place to figure that out
At the Weekly Handoff Meeting
• Review your alert history and reports, and look for:
• Non-actionable events
• Thresholds that can be adjusted
• Opportunities to use “time periods”
• Redundant alerts
• Review post-mortem reports for major events during
the week
• Take a moment to let the whole team voice any
operational concerns
• You might discover gaps in your monitoring
On-Call Rotation ManagementWhat it means:
• Period and duration of on-call shifts
• Processes for managing handoff
• Decisions about coverage
On-Call Rotation ManagementLarge teams with lots of alerts:
• Daily or follow-the-sun rotation
• Still hold a weekly meeting
• Avoid rotation changes on the
weekend
On-Call Rotation Management
Smaller teams with fewer
alerts:
• Weekly or bi-weekly rotation
• Hand off in the middle of the week
• Be aware of the impact of platform
scale
On-Call Rotation Management
Whatever the team size, carefully consider:
• The real time commitment of an on-call shift:
• How many total alerts a person will receive on average during a shift
• The mean time-to-resolution (MTTR) of your alerts
• How much personal recovery time a rotation provides between on-call shifts
• One week at the very least
• More is better, to a point
“The road to Hell is paved with good intentions”
Rogue IT and Over-Monitoring
• Establish a process for adding new alerts and alert
types
• Make sure alert sources are documented, identifiable,
and reachable
• Apply multiple-source monitoring only where it really
makes sense
• Differentiate between performance and functional
monitoring
Alert Routing
Large teams with several departments:
• Have an on-call rotation for each department
• Use rules to route alerts straight to the right department
• Combine with a larger “triage” rotation
• Use runbooks!
Alert Routing
Smaller multidisciplinary teams:
• Make sure you have effective contact methods
• One-person teams can make sense
• Be careful of disturbing specialists unnecessarily
Expanding the On-Call PoolPut developers on-call
• It promotes empathy among teams and better product understanding
• It promotes the idea that code quality means “runs well in production”
• It reinforces the value of keeping the alert signal-to-noise ratio low
• Use runbooks to ease the spread of process knowledge
Expanding the On-Call Pool
Incentivizing on-call
• Some orgs pay a small bonus to the on-call team member
• It reinforces the value the company puts on uptime
• It won’t prevent fatigue, but it may help ease recovery
Expanding the On-Call Pool
Who else could go on-call?
• Anyone who has a stake in operations and
the technical aptitude
• Operations support, QA, even management
• It all depends on the organization