OV203 course notes

© Opsera 2010 Commercial in Confidence

OV203: Advanced Opsview Configuration and ManagementTon Voon, OpseraWeb trainingMarch 2010


Introduction

Who am I?

Who are you?• Name• What you do• Experience with Opsview• What you are most interested in

What will we be learning?

2

Ton Voon is the Product Architect for Opsview and is the main person in charge of the design and scope of Opsview. He has been involved in the development of Opsview since 2005.The main documentation site is http://docs.opsview.org/doku.php?id=opsview-community


Aims of training course

Understand the advanced concepts of how Opsview monitors

Understand how a distributed Opsview system works

Understand how to add new custom plugins

Understand what backups are required for Opsview

3

Theory: Understanding the advanced concepts that will appear in OpsviewDistributed Opsview: Understand how it works and what limitations existPlugins: How to extend Opsview to monitor specific characteristics


Agenda

Advanced monitoring concepts

Distributed architecture

Modules

Custom plugins

ODW

API

Configuration files

Backup and recovery

Troubleshooting

4


Advanced concepts


Checks: active versus passive

Active: Run on a periodic basis

Passive: Result that arrives on demand• Manually reset the state of a failure• or can set a freshness interval to auto change state after a period of time

6

Active checks are "polling checks". Try to use active checks because when things are fixed, the service will automatically change status at the next polling interval.Examples of passive checks: a backup start/finish message; a link up or link down from an interface; or entries in a log file.Passive checks need to be reset otherwise the state stays the same. You can: - submit a result via the contextual menu for the service. This will send the result to the slave (if appropriate) which will process it and send the result back up - configure the service check so that after a defined interval, it will auto reset the stateYou can also submit a result to test failure scenarios.


State changes

A Soft state is an initial failure state

A failure state will change to Hard if it is still failing after a certain number of checks

Based on Max Check Attempts

Can retry at different intervals

Aim: cater for temporary “glitches”

7

When a service transitions from an OK state to a failed state, the check attempt will be 1. This increments for each subsequent failure. When the max check attempts is reached, the state is considered a hard state.Hostgroup Hierarchy always shows soft states, so it is always the most recent state information.Beware of ignoring soft states - there maybe something that is a transcient problem that needs resolving (usually load related).


Notifications

Will send notifications on hard state changes

Can configure to receive emails, sms or RSS/Atom feeds• Emails: slave, RSS: master, SMS: configurable

Extra state: recovery

Checks continue to be run - will re-notify unless acknowledged or marked as downtime

Limitation: Notification from a slave will not necessarily have all information that the master has, since the parent/child topology will be different

8

Notifications are only sent on hard state changes which means a delay is introduced. You can set max check attempts to 1 if you want to be notified straight away.Opsview by default supports 3 different notification methods: Emails (this will require setting up a mail system on the Opsview server) SMS (need to have either an SMS gateway or a GSM modem attached) RSSSee: http://docs.opsview.org/doku.php?id=opsview-community:notificationmethods for more information.If you want to overcome the limitations of slaves, you can set notification from the master.A "recovery" state is when a host or service transitions from a hard failure back to ok


Flapping

If a host or service changes state too frequently

Disables notifications temporarily

9

The flapping value is calculated as the number of state changes that have occurred from the last 21 states for a service.There is a high flapping threshold (when something goes into a flapping state) and a low flapping threshold (when something comes out of a flapping state).We use the default values of high: 30% and low: 20%.You can disable flap detection for each servicecheck. The default is to set this on.


Parent/child relationships

Host can have a parent to denote a dependency

You can assign multiple parents for a host

Defines a network topology, acting as a dependency for determining network reachability

10

If a host is marked as down, then its parents are checked to determine the network reachability.


Performance graphing

A way of storing numeric data over time

Automatically created from plugin’s performance data

RRD based, averaged

5 minute intervals

Service checks must run at least once an hour, otherwise “gaps” appear

Gauge or counter data points

Map file for changing non-compliant plugins

11

RRD stands for Round Robin Database. It is a fixed size database and very fast, but it loses resolution over time.If your plugin returns valid performance data, the database will be automatically updated.There is a default resolution for all RRDs: 5 minute is the smallest resolution 50 hours averaged over 5 mins 14 days averaged over 30 mins 2 months averaged over 2 hours 2 years averaged over 1 dayThis produces an RRD file which is 24K in size.The service check must run at least once an hour, otherwise RRD will mark no data in the RRD.The Gauge is automatic and default. To set counter values, suffix the value with “c”


Checkpoint

What are the two types of checks?

What are the two state change types?

A process graph shows there are 3.2 processes. How come I get a non-integer value?

What happens with service defined with max check attempts of 1? When is this useful?

A active service is checked every 3 minutes, with a retry check interval of 1 min with a max check attempts of 4. How long will it be between the last OK and the time you are notified?

12

Opsview has 4 types of service checks: active, snmp polling, passive and snmp traps. The first two are active checks and the last two are passive.You could get fractional values for data points that are only integers because of the constant averaging at the RRD.Max check attempts 1 useful for passive checks, and also for services you want to get alerted on immediately.


Answer!

Time

6 minutes since last OK

13

OK OK CRITCRIT

CRITCRIT

Soft Hard

3 3 1 1 1


Distributed architecture


Distributed slaves

Slaves provide monitoring from a different location• Reduces bandwidth• Spreads load• Independent monitoring system• Simplifies firewall configuration

Slaves can consist of clustered nodes• Balances workload• Redundancy and automatic failover

15


Distributed slaves, part 2

Runs as an independent monitoring server, reporting to master (using NSCA)

Stores MRTG and NMIS data locally

Can enable web interface on slave (standard Nagios CGIs)

Managed from master server

Each host is assigned to a slave

Results from slave are marked as “stale” if no results arrive

Slave-node services automatically created

16

One of the main features of Opsview is the handling of distributed system. Opsview handles: * installing the Opsview software on the slave * upgrading Opsview software when master is upgraded * synchronising /usr/local/nagios/libexec * generating configuration for master and slave * single point of control from the master web UI

The technology used to send results back to the master is Nagios Service Check Acceptor (NSCA).

You can enable the web interface on the slave server, so that you can see standard Nagios screens from the slave."Freshness checking", a Nagios concept of expecting results within a certain timeframe, has an additional 30 minute window before it will start to mark services into a different state.There is a "Slave-node: {hostname}" service that will be automatically created that monitors the slave.http://docs.opsview.org/doku.php?id=opsview-community:slavesetup


Distributed slaves, part 2

Limitation: Same OS and architecture as master

Limitation: Plugin output on slave only sends the first 511 bytes or the 1st line to the master

Limitation: Loses results if connection drops

Limitation: Acknowledgements and downtime not synchronised

Limitation: Time must be synchronised, but time zone can be different

Can have them clustered• Failover• Load balancing

17

Slaves need to be the same architecture as the master because Opsview master sends all files in /usr/local/nagios/bin, including the nagios executable, to the slaves. This will include architecture specific files.Plugin output from a slave to the master is limited to the 1st 511 bytes or the 1st line, which ever comes first. This is due to the transport mechanism used.Opsview uses NSCA for sending results from the slaves to the master. If it fails to send the check results, it will drop those results, hence you could have lost results. Time must be synchronised between master and slave. The check_opsview_slave_node, automatically created, will check that the time is within 5 seconds of each other.The timezone does not have to match, but we recommend it is set the same as the master.Software level clustering for slaves is provided in Opsview. You can use your own virtual machines to provide failover and redundancy if desired.


Synchronisation of states

Principle: single point of control from Opsview master

Changes made through user interface will be propagated to slaves (5 second delay)

• So acknowledgements and downtimes on master were replicated on slaves

At Opsview reload, states for hosts and services on slave will be synchronised with the master if the last updated time is older on the slave

18

A limitation of Opsview prior to 3.5.2 was that if you moved a host or service that was acknowledged from one slave to another, the new slave would not know about the acknowledgement and thus notifications from the slave would be sent.

From Opsview 3.5.2 onwards, the state of the slave is synchronised with the master as long as the last_updated field on the slave was older than the masterʼs view. This occurs at Opsview reload time and a cluster take over time.


Distributed Architecture Diagram

Opsview Master

Slave ASlave B1 Slave B2

ssh port 22

HTTP port 80

Web clientsWeb

clientsWeb clientsWeb

clients

Optional: HTTP 80

Datacenter 1 Datacenter 2

NRPE,SNMP or

check_by_ssh

sshSlave B3

ssh

ssh

nsca

All communication between master and slave is over port 22 (SSH). This is usually initiated from the master to the slave. It is possible to do a “reverse SSH”, so the slave initiates an SSH tunnel from the slave to the master which the master will then use to connect to slaves.Setup of slave requires exchanging SSH keys.Cluster nodes requires ssh key exchange between each other so they can tell between themselves what their state information is.You can have a different number of slave cluster nodes for each slave system.


Clustered slaves

Can have an arbitrary number of clustered nodes, usually 2 or 3

At reload time, the hosts for a slave are split across cluster nodes

On each node, a service (“Cluster-node: {nodename}”) is automatically generated to monitor every other node

An event handler is setup to take over on failure

Requires an ssh connection between the two slave clusters

20

For more information about slave clusters, see: http://docs.opsview.org/doku.php?id=opsview-community:slaveclusters

Master

Slave-node: A

Slave-node: B

Slave-node: C

Monitors hosts

1, 2, 3, 4, 5, 6

Node A Node B Node C

Cluster-node: B

Cluster-node: C

Cluster-node: A

Cluster-node: C

Cluster-node: A

Cluster-node: B

Monitors hosts

1, 2

Monitors hosts

3,4

Monitors hosts

5, 6

Takeover 3 Takeover 1 Takeover 2

Takeover 5 Takeover 6 Takeover 4


Example slave cluster system

21

The Opsview Master has 6 hosts which are being monitored by the slave system. There are 3 nodes in the slave.

At Opsview reload time, the 6 hosts are split across all the nodes (based on an algorithm called Set::Cluster on CPAN - http://search.cpan.org/dist/Set-Cluster/ ).

The “Cluster-node” services are automatically generated to look at every other node in that slave system. This requires SSHEach cluster node service has a list of hosts it will takeover in the case of the specific node failure - this is calculated at reload time (and not dynamically).


Clustered slaves, part 2

MRTG data stored locally: the first node in the cluster will poll devices

NMIS data stored locally: will rsync with other cluster nodes

Every 15 minutes, the status information for node is sent to every other node (for synchronisation at takeover)

Limitation: Single node failure only

Limitation: Slave clusters should be in the same network segment

22

The NMIS data is rsyncʼd between cluster nodes every hour (based on nagios userʼs crontab)


Checkpoint

Why would you use slaves?

How many slaves can you run?

If there are 2 cluster nodes in a slave, how many Cluster-node checks are automatically created?

What if there are 4 cluster nodes?

What would happen if 2 nodes in a cluster failed?

23

There is no software limit to the number of slaves, but there is a cost as reload times increase.For 2 cluster nodes, 2 checks are automatically setup, each monitoring the other.For 3 cluster nodes, 6 checks would be setup.For 4, 12 checks are setup.The formula is: N x (N-1) where N is the number of nodes in a slaveIf two nodes failed, some of the hosts/services would be marked as stale.


Distributed slaves setup


Setup

Setup users and groups on slaves

Prerequisite software needs to be installed. Use opsview-slave package

Doesn’t install actual software - will be sent from master

Upgrades will update slaves as part of the upgrade process

Can enable a web interface on the slave

25

The procedure for setting up a slave web interface is documented at http://docs.opsview.org/doku.php?id=opsview-community:slavesetup#slave_web_interface


Operation

Slave runs its active checks

Every result is written to /usr/local/nagios/var:• cache_host.log• cache_service.log

Every 5 seconds, /usr/local/nagios/bin/process-cache-data is called to send host and service results to the master

Output from these calls is saved to cache.log and the return code is saved to /usr/local/nagios/var/ocsp.status

The Opsview master will show the service as passive, though it is actively run on slave

26

process-cache-data uses send_nsca to transporting results back to the master. This has a limit of 511 characters in the output.


Troubleshooting

Check ocsp.status for last return code on slave

Check cache.log for errors

Check:• echo “” | /usr/local/nagios/bin/send_nsca -H 127.0.0.1 -c /usr/local/nagios/

etc/send_nsca.cfg• 0 data packet(s) sent to host successfully.

Check ssh on master

Check netstat -an | grep 5667 on master

27


Modules

We use the term Modules for functionality that is “loosely coupled” to Opsview but we still provide integration with it.

Opsview Core comes with: * Nagvis * MRTG * NMIS


Nagvis

Nagvis provides a visual representation of the status of various objects

Maps are the grouping of objects together with a background image

Automap is a replacement network map view

Technology:• PHP5

Integration:• Apache configuration• Authentication• Host group hierarchy when choosing host groups

29

Because Nagvis is PHP5 based, Opsview delegates the PHP5 page rendering to Apache, by disabling proxying through to the Opsview Web application.

When configuring Apache, you can use the auth ticket method for authenticating. This means that users of Opsview can access Nagvis using this authentication ticket seamlessly. If a user tries to access nagvis without this ticket, they will be redirected back to the Opsview login screen.

For more information about Nagvis: http://docs.opsview.org/doku.php?id=opsview-community:nagvis


Nagvis: Using

Initial page at /nagvis

Access to maps

Adding a new map

Adding a background

Adding a new state object (host groups can be based on the hierarchy)

Limitation: No fine-grained access controls within Nagvis

Limitation: Times displayed are in UTC

30

Beware of Nagvis access controls! Maps can be assigned to EVERYONE for view and EVERYONE for edit. This means any authenticated user could edit your maps. Also, since the maps can be edited, this means it is possible to get a drop down list of all the hosts and services on your system! You can overcome this by making sure you only have named users for edit.


MRTG and NMIS

Both used for interface statistics

Use SNMP to collect information from hosts

Uses RRD files to store its data

31

More information about MRTG and NMIS in the OV204 course


Custom Plugins


Plugin specifications

A plugin must provide:• A return code on completion• Output to stdout (preferably 1 line, less than 511 bytes)

A plugin should provide:• Help output when run with -h (written to stdout)

A plugin can provide:• Performance data. Everything after the pipe symbol (“|”) is considered

performance data (preferably on that 1st line)

Full plugin guidelines:• http://nagiosplug.sourceforge.net/developer-guidelines.html

33


Performance data format

Performance data will be automatically graphed if it is of the correct format

label=value[uom][;warn][;critical]

Can have multiple sections of these - use space to separate

Can change the order and the number of the sections and the insert routine will update the data appropriately

Be aware of averaging!

34

The warning and critical levels are optional.The full performance format includes maximum and minimum values, but these are ignored in Opsview.If the label changes, then it will be considered a new performance plot.ODW will save the raw value. However, threshold information is not retained.

For counter values, use performance data like: inputbytes=119c


Custom checks

Write it

Run it on command line

Have -h option to print out help text

Drop onto /usr/local/nagios/libexec

Will be automatically available in Opsview servicecheck page

Recommendation: Plugins return a short amount of data on 1 line

35

You can create a plugin in any language as long as it is executable by the nagios user.We will create an example plugin that just returns OKYou may not need to write your own plugin, if an existing one can handle the checking for you. For instance, instead of a dedicated virus update plugin, just use check_file_age to test that the virus definition file is up to date.You could also use a "proxy" method for getting results. For instance, query a database to get results for a test (say, number of sessions in your web application, balance of a test account)FOSDEM example plugin: http://nagiosplugins.org/fosdem


Creating plugins easily

Use Nagios::Plugin

Distributed with Opsview Agents, installed in /usr/local/nagios/perl/lib

Start plugin with:• use FindBin;

use lib "$FindBin::Bin/../perl/lib";use Nagios::Plugin;

More documentation:• http://search.cpan.org/dist/Nagios-Plugin/lib/Nagios/Plugin.pm

36


ODW


History

Nagios is very good at “what is happening now”

Nagios’ reporting uses nagios.log file to get status

Lots of logic to work out status changes held in report code

Need to move to database driven

38


NDOutils and Runtime

NDOutils (Nagios Data Objects) is a project to put Nagios status data into a mysql database

Some limitations in how NDOutils saves its data:• No configuration information over time• Some tables are too big

39

Opsera have been very active in using NDOutils and updating the software.Nagios has started to move data into a database structure - this is the project called NDOutils (Nagios Data Objects).Opsview uses NDOutils, and has been actively promoting and updating the software, but recognises there are some limitations in how the data is represented in NDOutils. For instance, the rows to record every result in NDOutils takes about 1K for every result, whereas in ODW it takes about 250 bytes.


Opsview Data Warehouse

ODW designed to be a data warehouse

Denormalised data, for easier searching

Long term storage tables

Raw results - performance data is exactly as received

Summary tables for quick queries

Schema diagram and further documentation• http://docs.opsview.org/doku.php?id=opsview-community:odw

40


ODW: Operation

Need to opt-in to the import process from System Preference

Cron job called import_runtime runs at 4 minutes past the hour to collect data and summarise

Plugin: check_odw_hostgroup_availability

41

If you change the crontab entry, be aware that an upgrade will revert the crontab back again.Be aware that the tables used by ODW are MyISAM tables which do table level locking. If you have a long running query, it may lock up the import process, but the import process will continue when tables are unlocked again.If you do a lot of reporting, you may want to consider setting up replication of the ODW database onto a different mysql server where you can run the reports without affecting the main Opsview instance.


Architecture

42

Only the import_runtime script is used to add data into ODW.The import_runtime is also aware of multiple Opsview masters, so it is possible to have a shared ODW across multiple Opsview masters. This allows comparisons between hosts and services on different systems.The main limitation is that the host name must be unique between all Opsview masters: http://docs.opsview.org/doku.php?id=opsview-community:sharedodw


Reports

Opsview comes with some PDF reports

Phasing out to use Opsview Enterprise Reports Module instead

• Based on Jasper Reports technology• Available to Enterprise subscribers• Documentation at http://docs.opsview.org/doku.php?id=reportingmodule• Some predefined reports, such as Weekly Availability by Keyword and

Weekly Performance by Keyword

Uses ODW to gather all data metrics for availability and performance

43

The PDF reports are retained in Opsview Core, but are being phased out due to complexity of code and poor functionality.

The new reports infrastructure uses Jasper Reports for its base technology, and includes several report types


API


Reasons for API

Automated configuration changes

Currently supports:• creating / cloning / deleting hosts• scheduling downtime• reloads

45

For further information about the API, see: http://docs.opsview.org/doku.php?id=opsview-community:api


<opsview>

<authentication><username>admin</username><password>initial</password></authentication>

<host action="create">

<name>host</name>

<ip>10.10.10.10</ip>

<check_command><name>ping</name></check_command>

<hostgroup><id>2</id></hostgroup>

<icon><name>LOGO - Opsview</name></icon>

</host>

</opsview>

This is an example XML file to push to Opsview.


Example invocations

curl -H 'Content-Type: text/xml' -d @file.xml http://opsviewserver/api

opsview_api -f file.xml

47


Configuration files


Configuration files

/usr/local/nagios/etc• opsview.conf and opsview.defaults• map.local and map

/usr/local/opsview-web• opsview_web_local.yml and opsview_web.yml

/usr/local/nagios/share/stylesheets• custom.css

/usr/local/nagios/nagvis/etc• nagvis.ini.php

/usr/local/nagios/nmis/conf• nmis.conf

49

opsview.defaults is a shipped file and is subject to change. If there are files that you need to override, copy the variable into opsview.conf and amend it there. opsview.conf will not usually be changed over a upgrade.Similarly for the map and the map.local file. However, you shouldnʼt need to use the map file as correctly formatted performance data will be automatically graphed.The nagios configuration files are regenerated every time.


Backup and recovery


Backup scope

A cronjob is run around 3am which invokes a backup

The backups save configuration data and some key Runtime data

The scope is to be able to recover a system quickly

Key variables in opsview.conf:• $backup_dir - which directory to store the backups• $backup_retention_days - number of days worth of backups to keep

You will have to design your own backup strategy for long term archival

51

Long term archival is not handled as part of the backup script because of the amount of data that could be in ODW and Runtime. For more information about backups: http://docs.opsview.org/doku.php?id=opsview-community:backups


Data to consider backing up

ODW

Runtime

MRTG

NMIS

Slaves• MRTG• NMIS• Nagios logs

52


Restore process

Assumes you are restoring to the same server as the original Opsview server

Stop Opsview and Opsview Web

Restore files

Restore Opsview, Runtime (subset), Reports database

Restart Opsview and Opsview Web

Reload Opsview

53

For the latest restore process, see: http://docs.opsview.org/doku.php?id=opsview-community:backupsIf you are looking to migrate Opsview onto different hardware, see: http://docs.opsview.org/doku.php?id=opsview-community:migratinghardware


Troubleshooting


Reload process

Creates Nagios configuration for master

Runs MRTG configuration generation in background

For each slave:• Creates Nagios configuration• Transfers files

Validates all configuration

Reloads Nagios simultaneously

55

Opsview uses Parallel::Forker for the workflow management. There is a limit set to only run 4 concurrent jobs at once - you can increase this if you have more CPUs on your master server.


Reload troubleshooting

/admin/reload - will show common errors, including Nagios validation errors

/usr/local/nagios/var/rw/config_output• Last full debug output of all configuration generation

/usr/local/nagios/var/log/create_and_send_configs.debug• Last reload process workflow

56


File locations

/usr/local/nagios - Nagios and main Opsview core

/usr/local/opsview-web - Opsview web application

/var/log/opsview - logs for master server• opsviewd.log - For main opsviewd daemon and other master jobs• opsview-web.log - For web application

/usr/local/nagios/var - logs and status files for Nagios (on slaves too)

• var/log/opsview-slave.log - For Opsview specific slave jobs

Performance RRD data• /usr/local/nagios/var/rrd/{hostname}/{servicename}/{metric}/value.rrd• thresholds.rrd

57

Performance graphs stored in /usr/local/nagios/var/rrd/{hostname}/{servicename}/{metric}/value.rrd.Thresholds are stored in thresholds.rrd, with information about the warning and critical levels.

Opsview uses Log4perl for logging and you can set location and rotation information. However, this file is overwritten as part of an upgrade


Opsview not running

Restarting Opsview: /etc/init.d/opsview restart

Restarting Opsview Web: /etc/init.d/opsview-web

Do these either as root or switch to nagios user• su - nagios

58

You must use su - nagios. The dash means to pick up some environment variables required by Nagios.


Errors relating to browser

The Opsview Web Server is not running• If an upgrade is in progress

Error retrieving update from Opsview. Will continue to retry• AJAX update problem. Will continue to repoll. Problems with web service?

59


Errors relating to Opsview

If a service fails which is for the Opsview master or slaves, escalate to Opsera

60

Opsera provide commercial support for Opsview. See http://opsera.com/jsp/opsera_product/Opsview%20product.jsp for details


Summary

Understand the advanced concepts of how Opsview monitors

Understand how a distributed Opsview system works

Understand how to add new custom plugins

Understand what backups are required for Opsview

61

Documents

OV203 course notes