125
Leveling Up Monitoring: A Decade of Automating and Scaling Nagios Katherine Daniels and Laurie Denness @beerops - @lozzd Velocity 2016

Leveling up monitoring: A decade of automating and scaling Nagios

Embed Size (px)

Citation preview

Page 1: Leveling up monitoring: A decade of automating and scaling Nagios

Leveling Up Monitoring:

A Decade of Automating and Scaling Nagios

Katherine Daniels and Laurie Denness

@beerops - @lozzd Velocity 2016

Page 2: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Katherine Daniels@beerops

Senior Operations Engineer, Etsy Co-Author of Effective DevOps

Laurie Denness @lozzd

Staff Operations Engineer, Etsy Official Graph Enthusiast

Page 3: Leveling up monitoring: A decade of automating and scaling Nagios

3

Page 4: Leveling up monitoring: A decade of automating and scaling Nagios

Agenda

@beerops - @lozzd Velocity 2016

Au to mat i o n

2

D e p loy i nato r

3

S c a l i ng + To o l i ng

4

I n T h e B e g i n n i ng . . .

1

Page 5: Leveling up monitoring: A decade of automating and scaling Nagios
Page 6: Leveling up monitoring: A decade of automating and scaling Nagios

25MActive Buyers

About Etsy

1.6MActive Sellers

$2.39B2015 Annual GMS

(As of March 31, 2016)

Page 7: Leveling up monitoring: A decade of automating and scaling Nagios

Monitoring!

Page 8: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 9: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 10: Leveling up monitoring: A decade of automating and scaling Nagios

bit.ly/yaynagios

Page 11: Leveling up monitoring: A decade of automating and scaling Nagios

https://kartar.net/2015/08/monitoring-survey-2015---tools/

Page 13: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

In The Beginning

Page 14: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 15: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 16: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Sometimes your statement needs emphasis with a black background.

Page 17: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 18: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

L E S S O N S L E A R N E D :

Templates are awesome.

Page 19: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 20: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 21: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 22: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 23: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

define service { use generic-service hostgroups Linux_hosts,!email-only-servers service_description SSH check_command check_ssh }

Page 24: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

define service { use disk-space-service hostgroup_name email-only-servers contact_groups ops_nonurgent }

Page 25: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 26: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

L E S S O N S L E A R N E D :

Start small.

Page 27: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Nagios and Chef

Page 28: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 29: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 30: Leveling up monitoring: A decade of automating and scaling Nagios

24

Page 31: Leveling up monitoring: A decade of automating and scaling Nagios
Page 32: Leveling up monitoring: A decade of automating and scaling Nagios
Page 33: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

L E S S O N S L E A R N E D :

Automation is awesome!

Page 34: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

L E S S O N S L E A R N E D :

Automation is awesome!

HA HA JUST KIDDING

Page 35: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 36: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 37: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

L E S S O N S L E A R N E D :

Trust but verify.

Page 38: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

How Many Repos?

Page 39: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 40: Leveling up monitoring: A decade of automating and scaling Nagios
Page 41: Leveling up monitoring: A decade of automating and scaling Nagios
Page 42: Leveling up monitoring: A decade of automating and scaling Nagios
Page 43: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 44: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

L E S S O N S L E A R N E D :

?!?!?!?!??!?!

Page 45: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

L E S S O N S L E A R N E D :

Try, fail, learn, and try again.

Page 46: Leveling up monitoring: A decade of automating and scaling Nagios

Problems

Page 47: Leveling up monitoring: A decade of automating and scaling Nagios

Problems

• Four git repos, inconsistent mess, duplication

Page 48: Leveling up monitoring: A decade of automating and scaling Nagios

Problems

• Four git repos, inconsistent mess, duplication

• Broken semi-useful automation - need to regain trust

Page 49: Leveling up monitoring: A decade of automating and scaling Nagios

Problems

• Four git repos, inconsistent mess, duplication

• Broken semi-useful automation - need to regain trust

• Some shared config, some unique

Page 50: Leveling up monitoring: A decade of automating and scaling Nagios

Problems

• Four git repos, inconsistent mess, duplication

• Broken semi-useful automation - need to regain trust

• Some shared config, some unique

• Gain confidence in changes

Page 51: Leveling up monitoring: A decade of automating and scaling Nagios

Problems

• Four git repos, inconsistent mess, duplication

• Broken semi-useful automation - need to regain trust

• Some shared config, some unique

• Gain confidence in changes

• Stop editing on the production box

Page 52: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Nagios and Chef

Page 53: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Nagios and Chefand Deployinator!

Page 54: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Solution 1: Merge everything: find and remove duplication,

shared configs

Page 55: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Thanks Murphy!

Page 56: Leveling up monitoring: A decade of automating and scaling Nagios
Page 57: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Super Secret Option!!!

Page 58: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 59: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 60: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 61: Leveling up monitoring: A decade of automating and scaling Nagios
Page 62: Leveling up monitoring: A decade of automating and scaling Nagios
Page 63: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Solution 2:

Using Jenkins CI to test changes before production

Page 64: Leveling up monitoring: A decade of automating and scaling Nagios
Page 65: Leveling up monitoring: A decade of automating and scaling Nagios
Page 66: Leveling up monitoring: A decade of automating and scaling Nagios
Page 67: Leveling up monitoring: A decade of automating and scaling Nagios
Page 68: Leveling up monitoring: A decade of automating and scaling Nagios
Page 69: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Solution 3:

Use Deployinator to run Chef recipe to generate automated configs

Page 70: Leveling up monitoring: A decade of automating and scaling Nagios
Page 71: Leveling up monitoring: A decade of automating and scaling Nagios
Page 72: Leveling up monitoring: A decade of automating and scaling Nagios

Chart Tit le

Page 73: Leveling up monitoring: A decade of automating and scaling Nagios

Chart Tit le

Page 74: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Solution 4:

Use Deployinator to rsync config to all boxes

Page 75: Leveling up monitoring: A decade of automating and scaling Nagios
Page 76: Leveling up monitoring: A decade of automating and scaling Nagios

• git pull repo on deploy host

Page 77: Leveling up monitoring: A decade of automating and scaling Nagios

• git pull repo on deploy host

• Run Chef recipe to add automated pieces

Page 78: Leveling up monitoring: A decade of automating and scaling Nagios

• git pull repo on deploy host

• Run Chef recipe to add automated pieces

• Re-run the try-nagios script against that

Page 79: Leveling up monitoring: A decade of automating and scaling Nagios

• git pull repo on deploy host

• Run Chef recipe to add automated pieces

• Re-run the try-nagios script against that

• rsync copy from deploy box to Nagios hosts

Page 80: Leveling up monitoring: A decade of automating and scaling Nagios

• git pull repo on deploy host

• Run Chef recipe to add automated pieces

• Re-run the try-nagios script against that

• rsync copy from deploy box to Nagios hosts

• Create symlink for nagios.cfg

Page 81: Leveling up monitoring: A decade of automating and scaling Nagios

• git pull repo on deploy host

• Run Chef recipe to add automated pieces

• Re-run the try-nagios script against that

• rsync copy from deploy box to Nagios hosts

• Create symlink for nagios.cfg

• Restart Nagios

Page 82: Leveling up monitoring: A decade of automating and scaling Nagios
Page 83: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

L E S S O N S L E A R N E D :

Use the tools you have.

Page 84: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Scaling things up!

Page 85: Leveling up monitoring: A decade of automating and scaling Nagios
Page 86: Leveling up monitoring: A decade of automating and scaling Nagios
Page 87: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 88: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 89: Leveling up monitoring: A decade of automating and scaling Nagios
Page 90: Leveling up monitoring: A decade of automating and scaling Nagios
Page 91: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 92: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 93: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Core Workers

Page 94: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Core Workers

Page 95: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 96: Leveling up monitoring: A decade of automating and scaling Nagios
Page 97: Leveling up monitoring: A decade of automating and scaling Nagios
Page 98: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

L E S S O N S L E A R N E D :

If at first you don’t succeed, rub some webscale on it.

Page 99: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Iterating and Iterating

Page 100: Leveling up monitoring: A decade of automating and scaling Nagios
Page 101: Leveling up monitoring: A decade of automating and scaling Nagios
Page 102: Leveling up monitoring: A decade of automating and scaling Nagios
Page 103: Leveling up monitoring: A decade of automating and scaling Nagios
Page 104: Leveling up monitoring: A decade of automating and scaling Nagios
Page 105: Leveling up monitoring: A decade of automating and scaling Nagios
Page 106: Leveling up monitoring: A decade of automating and scaling Nagios
Page 107: Leveling up monitoring: A decade of automating and scaling Nagios
Page 108: Leveling up monitoring: A decade of automating and scaling Nagios
Page 109: Leveling up monitoring: A decade of automating and scaling Nagios
Page 110: Leveling up monitoring: A decade of automating and scaling Nagios
Page 111: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

L E S S O N S L E A R N E D :

Iterate

Iterate

Iterate

Page 112: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

To Infinity and Beyond

Page 113: Leveling up monitoring: A decade of automating and scaling Nagios
Page 114: Leveling up monitoring: A decade of automating and scaling Nagios

@beerops - @lozzd Velocity 2016

Page 115: Leveling up monitoring: A decade of automating and scaling Nagios

http://github.com/etsy/opsweekly

Page 116: Leveling up monitoring: A decade of automating and scaling Nagios

http://github.com/etsy/opsweekly

Page 117: Leveling up monitoring: A decade of automating and scaling Nagios

Chart Tit le

Page 118: Leveling up monitoring: A decade of automating and scaling Nagios

Chart Tit le

Page 119: Leveling up monitoring: A decade of automating and scaling Nagios
Page 120: Leveling up monitoring: A decade of automating and scaling Nagios
Page 121: Leveling up monitoring: A decade of automating and scaling Nagios

Final Lessons Learned

Page 122: Leveling up monitoring: A decade of automating and scaling Nagios

• Templates are awesome

• Start small

• Automation is awesome

• Trust but verify

• Learn from (y)our mistakes

• Iterate on the tools you have

Page 123: Leveling up monitoring: A decade of automating and scaling Nagios

Open Source Summary

Page 124: Leveling up monitoring: A decade of automating and scaling Nagios

Open Source Summary

• http://github.com/etsy/deployinator

• http://github.com/etsy/pushbot

• http://github.com/etsy/trylib

• http://github.com/etsy/opsweekly

• http://github.com/etsy/nagios-herald

• http://github.com/RJ/irccat

Page 125: Leveling up monitoring: A decade of automating and scaling Nagios

THANK YOU!

@beerops - @lozzd Velocity 2016