Upload
tomas-doran
View
2.027
Download
0
Embed Size (px)
DESCRIPTION
As the Yelp infrastructure and engineering team grew, so did the pain of managing Nagios. Problems like splitting alerting across multiple teams, providing high availability and managing nagios systems in multiple environments had become pressing. As we grew towards a service oriented architecture and pushed some services out into the cloud, we rapidly needed more automated monitoring configuration. An evolutionary solution wasn’t going to solve all of our problems, we needed to revolutionize our monitoring. Sensu is built from the ground up to solve many of our issues and be easy to extend. This talk covers our puppet ‘monitoring_check’ API (that sets up monitoring for our services within puppet), how and why we deploy Sensu and our custom handlers and escalations, along with how we provide automatic ‘self service’ monitoring for dynamic services and how we deal with the challenges posed by the more ephemeral nature of cloud architectures.
Citation preview
Sensu and Sensibility
Tomas Doran @bobtfish 2014-‐09-‐23
2
Sensu and Sensibility
Cycle of failure and disappointment
• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production
3
4
Cycle of failure and disappointment
• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production
• Escalation of issues is hard • Ops ignore alerts from services • Postmortems
5
6
Cycle of failure and disappointment
• Manually edited and deployed monitoring • Changes require two teams • Low developer visibility about production
• Escalation of issues is hard • Ops ignore alerts from services • Postmortems
• High friction, low trust, low visibility.7
“Normality”
8-‐ http://gunshowcomic.com/648
“Normality”
9-‐ http://gunshowcomic.com/648
This is dysfunctional
10
Sensibility
11
Sensibility
“51 % viewed their ERP implementation as unsuccessful”
12
The Robbins-Gioia Survey (2001)
“40 % of the projects failed to achieve their business case within one year of going live”
13
The Conference Board Survey (2001)
• “17 percent of large IT projects go so badly that they can threaten the very existence of the company”
• “On average, large IT projects run 45 percent over budget and 7 percent over time, while delivering 56 percent less value than predicted”
14
McKinsey & Company in conjunction with the University of Oxford (2012)
Failure is an option
15-‐ blog.parasoft.com/single-‐greatest-‐barrier-‐with-‐sw-‐delivery
Sensibility
16
17
Sensibility
Why Sensu?• Designed to be pluggable / extensible
• Arbitrary check metadata • Simple model
• Components do exactly one thing • Ruby
• Not afraid to extend (or fork!)
18
‘industry standard’ ‘enterprise class’
19
Cheap shot
20
21
status.dat cmd.dat
22
cmd.dat
23
24
Centralized
25
How we use Sensu
• Don’t use all of this! • ‘Standalone’ checks only • Default in the puppet module
26
Sensu data flow
• Sensu client runs checks on each machine • Pushes results to RabbitMQ • Clustered, clients/messages will fail over.
• Sensu server (multiple, ha) • Processes check results, invokes handlers • Writes state to redis
• Redis + sentinel • Read by API (2 instances)
• All layers behind haproxy
27
Quis custodiet ipsos custodes?
28
“Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”
Mutually assured monitoring
• Multiple independent Sensu installs (per-datacenter) • Monitor each other!
29
Machine readable config
• /etc/sensu/conf.d/checks/check_name.json
• Extensible with arbitrary metadata
• Hash merge
• Never edit by hand!
30
monitoring_check
monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }
31
monitoring_check
monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }
32
monitoring_check
monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }
33
monitoring_check
monitoring_check { 'systems-apache-external': page => true, command => "/usr/lib/nagios/plugins/check_tcp -H ${external_ip_address} -p 443", check_every => ‘5m', alert_after => '30m', realert_every => 10, runbook => 'y/apache', }
34
sensu::check
• monitoring_check wraps this
• Writes a JSON file for each check
• Comment safe
35
"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }
36
"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }
37
"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }
38
"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }
39
"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }
40
"disk_ro_mounts": { "standalone": true, "handlers": [“default"], "subscribers": [], "command": "/usr/lib/nagios/plugins/yelp/check_ro_mounts", "interval": 60, "alert_after": 0, "realert_every": “-1", "dependencies": [], "runbook": "http://lmgtfy.com/?q=linux+read+only+disk", "annotation": "https://gitweb.yelpcorp.com/?p=puppet.git;a=blob;f=modules/profile/manifests/server.pp#l80", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": true, "project": “OPS”, "page": false, "tip": false }
41
Check scripts
• Same as nagios checks • Simple (text) output • Exit code
• Result sent to server, along with check definition • Including all the custom metadata • Our handlers use the extra data.
42
Handlers
• base • JIRA • email • irc • pagerduty • awsprune
43
How do checks get run?
• Every machine runs the client.
• Client managed by puppet
• Client has a TCP socket you can send JSON to
• Custom checks + pysensu-yelp
44
45
Situational awareness
46
Single source of truth
• DNS is canonical for sensu servers • Configure things in one place!
47
Single source of truth
• DNS is canonical for sensu servers • Configure things in one place!
48
Automatic monitoring
• E.g. cron jobs - check successful recently! • cron::d
49
Automatic monitoring
• E.g. cron jobs - check successful recently! • cron::d
50
Generate monitoring_check
51
User specified monitoring
52
User specified monitoring
53
• Data lives in the service config • Next to the code to emit metrics!
• Simple checks for free!
54
User specified monitoring
User specified monitoring
• Data lives in the service config • Next to the code to emit metrics • Next to metadata about SLAs and LB timeouts • Developers can push without OPS
55
Cluster checks
• We’re working on this currently • Assert some % of machines are healthy. • Use to reduce alert noise.
• If a service becomes fully unavailable to clients, you want to page someone.
• If one machine goes belly up, you don’t (make a JIRA ticket for handling later!)
56
WIP
• This is all still a work in progress.
• We’ve not 100% migrated off of Nagios
• Open sourcing the pieces
57
Thanks!• Slides will be online shortly: • slideshare.net/bobtfish • @bobtfish
• Some (most?) of our code is open source: • https://github.com/Yelp/sensu/commit/
aa5c43c2fdfde5e8739952c0b8082000934f3ad2 • https://github.com/Yelp/puppet-monitoring_check • https://github.com/Yelp/puppet-netstdlib • https://github.com/Yelp/sensu_handlers • https://github.com/Yelp/pysensu-yelp
58