20
Reducing Alert Fatigue Mike Meredith @vo_mike Senior Dir. of IT Jason Hand @jasonhand DevOps Evangelist

Reducing Alert Fatigue

Embed Size (px)

Citation preview

Reducing Alert FatigueMike Meredith @vo_mike Senior Dir. of IT

Jason Hand @jasonhand DevOps Evangelist

Alert Fatigue: Defined

Exposure to a high volume of frequent alerts, causing desensitization to critical issues, leading to …

People are becoming numb to alerts,

making alerts less effective

image credit: blog.newrelic.com

• Longer response time• Anxiety• Sleep deprivation• Physical effects• Employee dissatisfaction

Identifying Alert Fatigue

Reporting

• MTTA & MTTR trends

• Incident trends

Identifying Alert Fatigue

• Take a pulse of your company culture

• ChatOps

• Office interactions

Are there signs of dissatisfaction?

• 1-on-1’s

Why does it exist?

• Computing is getting cheaper and automation is

easier

• DevOps encourages us to “Monitor All the Things”.

Dealing with existing fatigueIt’s part of being on-call, but doesn’t have to be disruptive

“Your personal life will be interrupted. This is not the ideal situation but it will happen. You have been warned.”

- Matthew Green

… now what?

Reducing fatigue

• Don’t be afraid to ask for help

• Overrides for the future

• “Take on-call” for others

Hold a Weekly Handoff Meeting• It’s a chance for the whole team to catch up on platform trends

• It promotes continual awareness of platform operations

• People from different disciplines can add context to the previous week’s

events

• If the sprint process needs to be diverted to respond to scale events or

bugs, this is a great place to figure that out

At the Weekly Handoff Meeting

• Review your alert history and reports, and look for:

• Non-actionable events

• Thresholds that can be adjusted

• Opportunities to use “time periods”

• Redundant alerts

• Review post-mortem reports for major events during

the week

• Take a moment to let the whole team voice any

operational concerns

• You might discover gaps in your monitoring

On-Call Rotation ManagementWhat it means:

• Period and duration of on-call shifts

• Processes for managing handoff

• Decisions about coverage

On-Call Rotation ManagementLarge teams with lots of alerts:

• Daily or follow-the-sun rotation

• Still hold a weekly meeting

• Avoid rotation changes on the

weekend

On-Call Rotation Management

Smaller teams with fewer

alerts:

• Weekly or bi-weekly rotation

• Hand off in the middle of the week

• Be aware of the impact of platform

scale

On-Call Rotation Management

Whatever the team size, carefully consider:

• The real time commitment of an on-call shift:

• How many total alerts a person will receive on average during a shift

• The mean time-to-resolution (MTTR) of your alerts

• How much personal recovery time a rotation provides between on-call shifts

• One week at the very least

• More is better, to a point

“The road to Hell is paved with good intentions”

Rogue IT and Over-Monitoring

• Establish a process for adding new alerts and alert

types

• Make sure alert sources are documented, identifiable,

and reachable

• Apply multiple-source monitoring only where it really

makes sense

• Differentiate between performance and functional

monitoring

Alert Routing

Large teams with several departments:

• Have an on-call rotation for each department

• Use rules to route alerts straight to the right department

• Combine with a larger “triage” rotation

• Use runbooks!

Alert Routing

Smaller multidisciplinary teams:

• Make sure you have effective contact methods

• One-person teams can make sense

• Be careful of disturbing specialists unnecessarily

Expanding the On-Call PoolPut developers on-call

• It promotes empathy among teams and better product understanding

• It promotes the idea that code quality means “runs well in production”

• It reinforces the value of keeping the alert signal-to-noise ratio low

• Use runbooks to ease the spread of process knowledge

Expanding the On-Call Pool

Incentivizing on-call

• Some orgs pay a small bonus to the on-call team member

• It reinforces the value the company puts on uptime

• It won’t prevent fatigue, but it may help ease recovery

Expanding the On-Call Pool

Who else could go on-call?

• Anyone who has a stake in operations and

the technical aptitude

• Operations support, QA, even management

• It all depends on the organization

Preventing Alert Fatigue

• Hold on-call handoff meetings

• Carefully manage on-call rotation

• Contain rogue IT and avoid over-monitoring

• Route alerts to the right people

• Expand the on-call pool