27
Nagios On-Call Rotation James Clark [email protected]

Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

  • Upload
    nagios

  • View
    2.370

  • Download
    1

Embed Size (px)

DESCRIPTION

James Clark's presentation on Nagios On-Call Rotation. The presentation was given during the Nagios World Conference North America held Sept 20-Oct 2nd, 2013 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Citation preview

Page 1: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

Nagios On-Call RotationJames Clark

[email protected]

Page 2: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

Topics Discussed

2

About Me / My Monitoring History

Monitoring History at Current Company

Prerequisites

Current Company Setup

Scripts

Nagios Configuration

Page 3: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

About Me

3

Have been in the IT industry since 1988

In 2004 became server group manager

Have been using Nagios since ~2003

Switched to XI ~2010 (And loved every part of it)

Changed jobs in August 2012 and quickly convinced new company to purchase XI

Page 4: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

About Me

4

Private web page is http://www.bandits-home-on-the-web.com

On that page you will find some of theNagios modifications I have done

Page 5: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

History of Monitoring and Alerting at new job

5

Many monitoring applications spread through-out the IT department

CCSS for iSeriesFoglight for DBSCOM for WindowsThree separate Nagios Core serversIBM NetCoolMany departments had no monitoring

All of the applications forward to NetCool and NetCool then forwards alerts to AlarmPoint (xMatters)

AlarmPoint holds the on-call schedule for the many different groups in the IT department

Page 6: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

History of Monitoring and Alerting at new job

6

CCSS for iSeriesPartial conversion to XI started

Foglight for DBComplete conversion to XI hopeful

SCOM for WindowsWill either develop custom script to communicate

back and forth. Currently testing WMI and hope to use that instead

Three separate Nagios Core serversConverted to a single XI server

Page 7: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

History of Monitoring and Alerting at new job

7

IBM NetCoolRemoving from company

AlarmPointMore than likely, removing from company

One primary XI server currently with 3 mod_gearman workers

One XI server for monitoring primary XI and a few other devices

One XI server in our DR web data center

Page 8: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

History of Monitoring and Alerting at new job

8

Besides AlarmPoint, On-Call schedule is kept in a separate MS SharePoint site that the DC Operations uses.

No fulltime administrator for either NetCool or AlarmPoint.

When done switching everything to NagiosXI, a significant savings will be realized.

One of the main hurdles to the switch, is on-call rotation for alerting.

Page 9: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

On Call Data - Prerequisites

9

On-call information stored in some application

On-call information able to be exported from the application in a specific format

A job scheduler to run the jobs

Page 10: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

On Call Data – Our Setup

10

SharePoint site to store on-call scheduleSharePoint admin created an application to export the

data needed and send the files to an FTP server.Two files are sent, one for primary and one for

secondary.We use Control-M to schedule the above program

and the two Linux scripts. The job is run daily at 8am. Our on-call changes

Monday’s at 8am.If changes are made to the on-call schedule, that

need to take effect immediately, the job is manually run. Otherwise, it can wait until the next day at 8am.

Page 11: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

On Call Data – Our Setup

11

Added ID to contacts table.Added short name to On-Call Groups table.Set the SharePoint site to alert me when any changes done to those two tables so it can be mirrored it in Nagios.

The scripts do handle blanks. This will be shown in a later slide.

Page 12: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

On Call Data – Example files

12

Networking,network,smithjSystem p Administration,aix_admins,doejAE Direct,aed_infra,user1Database,dba,clarksSystem i Administration,system_i_admin,walenciejsWintel Administration,wintel_admins,hilderbrandrSystem i Applications,system_i_apps,brownrClient Server Applications,client_server,yatespDataWarehouse/Enterprise Rpts,datawarehouse,connerysStore Applications,store_apps,probstj

The first field is what is displayed on the SharePoint site and is the alias assigned in Nagios. The second field is the name given to the contact groups. The third field is of course the ID of the user.

Page 13: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

Scripts:

Page 14: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

On Call Data – FTP Script

14

HOST=xxxxxxx #This is the FTP servers host or IP address.USER=xxxxxxx #This is the FTP user that has access to the server.PASS=xxxxxxx #This is the password for the FTP user.

ftp -inv $HOST << EOF

user $USER $PASS

cd /nagiosftp

get primaryOnCall.txtget secondaryOnCall.txt

delete primaryOnCall.txtdelete secondaryOnCall.txt

byeEOFexit 0

Page 15: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

On Call Data – Data Manipulation Script

15

#!/usr/bin/perl

#Remove old config filessystem ("find /usr/local/nagios/etc/static -type f -not -name 'xi*' -not -name 'esc*' -not -name 'aed_*' | xargs rm");

#Process primary on-call fileopen (INFILE, 'primaryOnCall.txt') or die $1;while (<INFILE>) { chomp; ($group, $alias, $id) = split(","); if (($alias ne '') && ($group ne '') && ($id ne '')) { open (OUTFILE, '>/usr/local/nagios/etc/static/' . $alias . '_oncall_pri.cfg'); print OUTFILE "define contactgroup{\n"; print OUTFILE "contactgroup_name $alias" . "_oncall_pri\n"; print OUTFILE "alias $group\n"; print OUTFILE "members $id\n"; print OUTFILE "}"; close (OUTFILE); } }close (INFILE);

Page 16: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

On Call Data – Data Manipulation Script(cont…)

16

#Process secondary on-call fileopen (INFILE, 'secondaryOnCall.txt') or die $1;while (<INFILE>) { chomp; ($group, $alias, $id) = split(","); if (($alias ne '') && ($group ne '') && ($id ne '')) { open (OUTFILE, '>/usr/local/nagios/etc/static/' . $alias . '_oncall_sec.cfg'); print OUTFILE "define contactgroup{\n"; print OUTFILE "contactgroup_name $alias" . "_oncall_sec\n"; print OUTFILE "alias $group\n"; print OUTFILE "members $id\n"; print OUTFILE "}"; close (OUTFILE); } }close (INFILE);

Page 17: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

On Call Data – Data Manipulation Script(cont…)

17

#Change ownership and permissions of config filessystem ("sudo /bin/chown apache:nagios /usr/local/nagios/etc/static/*.cfg");system ("sudo /bin/chmod 777 /usr/local/nagios/etc/static/*.cfg");

#Delete data filessystem ("rm primaryOnCall.txt");system ("rm secondaryOnCall.txt");

#Restart Nagiossystem ("sudo su -l nagios -c 'cd /usr/local/nagiosxi/scripts/ && ./reconfigure_nagios.sh'");

#Exit cleanexit 0;

Page 18: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

On Call Data – List of Files Created

18

Due to a blank for secondary on-call in the file, only the primary file for datawarehouse exists.

Page 19: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

On Call Data – Files Created – Example Content

19

Page 20: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

Nagios Configuration:

Page 21: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

NagiosXI Configuration

21

No contacts or contact groups are assigned to the hosts or services. Unless you want to always receive alerts. i.e. Someone who needs alerted that is not a member of the specific on-call group.

Users receive permissions to see hosts and services by having an escalation for them

Escalations must be created for both hosts and services. Services do not inherit escalations like they do notifications

Page 22: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

NagiosXI Configuration(cont…)

22

Escalations created as static config files.

Otherwise Nagios would error on the empty contact groups.

All members of groups go into an ALL group. This will be used to give users permissions

The group manager goes into a BOSS group. This is used for alerting the manager after on-call individuals fail to acknowledge an issue

Page 23: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

Static Configuration Example - Hosts

23

define hostescalation{ hostgroup_name network_oncall contact_groups network_oncall_pri first_notification 1 last_notification 0 notification_interval 15 }define hostescalation{ hostgroup_name network_oncall contact_groups network_oncall_sec first_notification 2 last_notification 0 notification_interval 15 }define hostescalation{ hostgroup_name network_oncall contact_groups network_boss first_notification 4 last_notification 0 notification_interval 15 }define hostescalation{ hostgroup_name network_oncall contact_groups network_all first_notification 3 last_notification 0 notification_interval 15 }

Created by script

Created by script

Created in XI and manager of group assigned as member

Created in XI and all members of group assigned as members

Page 24: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

Static Configuration Example - Services

24

define serviceescalation{ hostgroup_name network_oncall service_description * contact_groups network_oncall_pri first_notification 1 last_notification 0 notification_interval 15

}define serviceescalation{ hostgroup_name network_oncall service_description * contact_groups network_oncall_sec first_notification 2 last_notification 0 notification_interval 15 }define serviceescalation{ hostgroup_name network_oncall service_description * contact_groups network_all first_notification 3 last_notification 0 notification_interval 15 }define serviceescalation{ hostgroup_name network_oncall service_description * contact_groups network_boss first_notification 4 last_notification 0 notification_interval 15 }

The way we set it up, it uses the same hostgroup used for all the hosts and uses a wildcard for service, to include all services.

This could get very complicated if different groups/individuals were needed on different services on the same host.

Page 25: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

Static Configuration Example - Services

25

define serviceescalation{ host_name *

servicegroup_name dba_oncall contact_groups dba_oncall_pri,dba_oncall_sec first_notification 1 last_notification 0 notification_interval 15 }define serviceescalation{ host_name * servicegroup_name dba_oncall contact_groups dba first_notification 500 last_notification 0 notification_interval 15 }

Page 26: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

Static Configuration Example - Services

26

escalations_aed_serv.cfg

The services can be an simple as the last slide, or as complex as you can imagine. This attached file is a great example of the complexity that is capable.

Page 27: Nagios Conference 2013 - James Clark - Nagios On-Call Rotation

Questions?

James ClarkSystems Monitoring Administrator

[email protected]