59
MANAGING YOUR HEROES The People Aspect of Monitoring Alex Solomon [email protected] (a.k.a. Dealing with Outages and Failures)

Nagios Conference 2012 - Alex Solomon - Managing Your Heros

  • Upload
    nagios

  • View
    929

  • Download
    0

Embed Size (px)

DESCRIPTION

Alex Solomon's presentation on the people's aspect of monitoring. The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Citation preview

Page 1: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

MANAGING YOUR HEROESThe People Aspect of Monitoring

Alex [email protected]

(a.k.a. Dealing with Outages and Failures)

Page 2: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

2

WHO AM I?

Alex Solomon

• Founder / CEO of PagerDuty

• Intersect Inc.

• Amazon.com

Page 3: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

DEFINITIONS

3

Page 4: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

4

Service Level Agreement (SLA)

Mean Time To Response

Mean Time To Resolution (MTTR)

Mean Time Between Failures (MTBF)

Page 5: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

OUTAGES

5

Page 6: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

6

Can we prevent them?

Page 7: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

PREVENTING OUTAGES

7

Single Points of Failure (SPOFs)

Complex, monolithic systems

Redundant systems

Service-oriented architecture

Page 8: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

8

Netflix distributed SOA system

Page 9: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

9

Change

(not much you can do about this one)

PREVENTING OUTAGES

Page 10: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

10

OUTAGES

Page 11: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

FAILURE LIFECYCLE

11

Page 12: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

12

Investigate

Root-cause Analysis

Fix

Alert

detect failure

Monitoring

Page 13: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

13

Critical Incident Timeline

{

Issue isdetected

Engineer startsworking on issue

Issue isfixed

RESPONSE TIME

Alert Investigate Fix

RESOLUTION TIME

Page 14: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

MONITOR

14

Page 15: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

MONITOR EVERYTHING!

• Data center

• Network

• Servers

• Database

• Application

• Website

• Business Metrics

15

All levels of the stack

Page 16: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

WHY MONITOR EVERYTHING?

16

Metrics!

Metrics!

Metrics!

Page 17: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

TOOLS

17

• Internal monitoring (behind the firewall):

• External monitoring (SaaS-based):

• Metrics:

• Graphite or

Page 18: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

ALERT

18

Page 19: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

19

Best Practice: Categorize alerts by severity.

Page 20: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

SEVERITIES

• sev1 - large scale business loss

• sev2 - small to medium business loss

• sev3 - no immediate business loss, customers may be impacted

• sev4 - no business loss, no customers impacted

20

Define severities based on business impact:{2 criticalseverities{

2 non-critical severities

Page 21: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

• Who

• How

• Response time

21

Each severity level should have its own standard operating procedure (SOP):

Page 22: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

• Sev1: Major outage, all hands on deck

• Notify the entire team via phone and SMS

• Response time: 5 min

• Sev2: Critical issue

• Notify the on-call person via phone and SMS

• Response time: 15 min

• Sev3: Non-critical issue

• Notify the on-call person via email

• Response time: next day during business hours

22

Page 23: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

• Sev1 incidents

• Rare

• Rarely auto-generated

• Frequently start as sev2 which are upgraded to sev1

23

Page 24: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

• Sev2 incidents

• More common

• Mostly auto-generated

24

Page 25: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

• Sev3 incidents

• Non-critical incidents

• Can be auto-generated

• Can also be manually generated

25

Page 26: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

• Severities can be downgraded or upgraded

• ex. sev2 ➞ sev1 (problem got worse)

• ex. sev1 ➞ sev2 (problem was partially fixed)

• ex. sev2 ➞ sev3 (critical problem was fixed but we still need to investigate root cause)

26

Page 27: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

27

One more best-practice:

Alert before your systems fail completely

Page 28: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

28

Main benefit of severities

Only page on critical issues (sev1 or 2)

Page 29: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

29

Preserve sanity

Page 30: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

30

Avoid “Peter and the wolf ” scenarios

Page 31: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

ON-CALL BEST PRACTICES

31

PersonLevel

TeamLevel

Page 32: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

ON-CALL AT THE PERSON LEVEL

32

Cellphone

Page 33: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

33

CellphoneSmart phone

OR AND

Page 34: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

34

4G / 3G internet

(don’t forget your laptop)

4G hotspot 4G USB modem 3G/4G tethering

Page 35: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

•Time zero: email and SMS

• 1 min later: phone-call on cell

• 5 min later: phone-call on cell

• 5 min later: phone-call on landline

• 5 min later: phone-call to girlfriend

35

Page multiple times until you respond

Page 36: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

36

Bonus: vibrating bluetooth bracelet

Page 37: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

ON-CALL AT THE TEAM LEVEL

37

Do not send alerts to the entire teamRarely

sev1 OKsev2 NO

Page 38: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

38

On-call schedules:

• Simple rotation-based schedule

• ex. weekly - everyone is on-call for a week at a time

• Set up a follow-the-sun schedule

• people in multiple timezones

• no night-shifts simple rotation

Page 39: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

39

What happens if the on-call person doesn’t respond at all?

Page 40: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

40

If you care about uptime, you need redundancy in your on-call.

Page 41: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

41

Set up multiple on-call levels with automatic escalation between them:

Level 1: Primary on-call

Level 2: Secondary on-call

Escalate after 15 min

Level 3: Team on-call (alert entire team)

Escalate after 20 min

Page 42: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

42

Best Practice: Put management in the on-call chain

Level 1: Primary on-call

Level 2: Secondary on-call

Escalate after 15 min

Level 3: Team on-call (alert entire team)

Escalate after 20 min

Level 4: Manager / Director

Escalate after 20 min

Page 43: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

43

Best Practice: put software engineers in the on-call chain

• Devops model

• Devs need to own the systems they write

• Getting paged provides a strong incentive to engineer better systems

Page 44: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

44

Best Practice: measure on-call performance

• Measure: mean-time-to-response

• Measure: % of issues that were escalated

• Set up policies to encourage good performance

• Put managers in on-call chain

• Pay people extra to do on-call

“You can’t improve what you don’t measure.”

Page 45: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

45

Network Operations Center

Page 46: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

46

NOC with lots of Nagios goodness

Page 47: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

47

NOCs:

• Reduce the mean-time-to-response drastically

• Expensive (staffed 24x7 with multiple people)

• Train NOC staff to fix a good %age of issues

• As you scale your org, you may want a hybrid on-call approach (where NOC handles some issues, teams handle other issues directly)

Page 48: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

48

Critical Incident Timeline

Issue isdetected

Engineer startsworking on issue

Issue isfixed

RESPONSE TIME

Alert Investigate Fix

RESOLUTION TIME

Alert

Issue isdetected

Alerting system gets ahold of

somebody

Engineer gets to a computer, connects

to internet

Engineer isaware of issue

Engineer startsworking on issue

Page 49: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

49

Alert

Issue isdetected

Alerting system gets ahold of

somebody

Engineer gets to a computer, connects

to internet

Engineer isaware of issue

Engineer startsworking on issue{• Carry 4G internet device +

laptop at all times

• Set loud ringtone at night

How to minimize:{How to minimize:

• Alert via phone & SMS

• Alert multiple times via multiple channels

• Failing that, escalate!

• Failing that, escalate to manager!

Page 50: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

RESEARCH & FIX

50

Page 51: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

51

Investigate Fix

How do we reduce the amount of time needed to investigate and fix?

Page 52: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

• When you encounter a new failure, document it in the Guide

• Document symptoms, research steps, fixes

• Use a wiki

52

Set up an Emergency Ops Guide:

Page 53: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

53

Page 54: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

54

Page 55: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

55

Automate fixesor

Add more fault tolerance

Page 56: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

• Tools to help you diagnose problems faster

• Comprehensive monitoring, metrics and dashboards

• Tools that help search for problems in log files quickly (ie. Splunk)

• Tools to help your team communicate efficiently

• Voice: Conference bridge, Skype, Google Hangout

• Chat: Hipchat, Campfire

56

You need the right tools:

Page 57: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

57

Best Practice: Incident Commander

Page 58: Nagios Conference 2012 - Alex Solomon - Managing Your Heros

58

• Essential for dealing with sev1 issues

• In charge of the situation

• Providers leadership, prevents analysis paralysis

• He/she directs people to do things

• Helps save time making decisions

Incident Commander: