Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

Leveraging and UnderstandingPerformance Data and Graphs

Troy Lea

[email protected]

Twitter: @Box293

http://exchange.nagios.org/directory/Owner/Box293/1

2

About Me

IT Consultant

Nagios Developer

Love tinkering with Nagios

Why Nagios XI?

It’s a virtual appliance - ready to go

3

About This Presentation

Understanding how performance data is stored in the back end and how Nagios accesses it

Goal is to give you key pieces of information

A good reference for understanding concepts

This presentation is centered around Nagios XI

Valid for other Nagios implementations

4

Basic Concepts - Part 1

5


./check_nt -H SERVER -s "" -p 12489 -v USEDDISKSPACE -l C -w 80 -c 95

C:\ - total: 39.99 Gb - used: 25.28 Gb (63%) - free 14.71 Gb (37%) | 'C:\ Used Space'=25.28Gb;32.00;38.00;0.00;39.99

6


Service check command is executed by the monitoring engineMonitoring engine receives the result of the checkData received has performance dataPerformance data is anything after the | (pipe)The performance data is inserted into an RRD fileWhen viewing the performance graph, PNP4Nagios retrieves the performance data from the RRD file and generates a pretty graphEvery time the service check receives performance data, it inserts this performance data into the RRD file which allows you to look at trends over time

7

Plugins

The power of Nagios is in the plugins!

Monitor what you want, how you want!

Resources available that clearly define the guidelines around creating plugins

Nagios Plug-in Developer Guidelines

http://nagiosplug.sourceforge.net/developer-guidelines.html

PNP Documentation

http://docs.pnp4nagios.org/pnp-0.4/doc_complete

8

Plugin Output Explained - Part 1

Plugins produce data divided into two parts

The pipe symbol “|” is used as a delimiter

Example check_icmp

OK - 127.0.0.1: rta 2.687ms, lost 0% | rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;;

Data to the left of the pipe symbol is processed by the monitoring engine

Data to the right of the pipe symbol is used for inserting into RRD and XML files

9


The exit code Nagios receives from the plugin determines the state of the service

0 = OK

1 = WARNING

2 = CRITICAL

3 = UNKNOWN

The exit code is not “visible” when running a check from the command line or looking at the output returned from the plugin

10


No performance data = no pretty graphs

You can create a plugin using whatever language and tools are available

All that matters is the end result which is returned back to Nagios when the plugin has finished running

11


Examples:

Shell script

Something you might want to check on the Nagios host itself

perl script

Remotely checking a device using SNMP OR using third party APIs like the VMware vSphere SDK to remotely access virtual environments

Visual Basic script

Using NSClient on a Windows host to perform a check (like RDP usage)

12

Performance Data Specifics - Part 1

Asterix (*) fields are required fields, everything else is optional

In this instance, rta is the FIRST DS, or DS 1

13

Performance Data Specifics - Part 2

Multiple DS

Each DS is separated by a space

rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;;

The label can have spaces however the label MUST be enclosed by single quotes

'Round Trip Average'=2.687ms;3000.000;5000.000;0; 'Packet Loss'=0%;80;100;;

13

14

Basic Plugin - Part 1

Example shell script demonstrating how a plugin outputs performance data

NUMBER1=$[ ( $RANDOM % 100 ) + 1 ]

NUMBER2=$[ ( $RANDOM % 1000 ) + 1 ]

echo ""OK - Number 1: $NUMBER1 Number 2: $NUMBER2" | 'Number 1'=$NUMBER1;;;; 'Number 2'=$NUMBER2;;;;“

exit "0"

15


Here is the output each time it is run:

OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;;;; 'Number 2'=74;;;;





16


Performance data displayed as a pretty graph

Demonstration of how you can generate performance data in a plugin

17


Now lets add warning and critical thresholds to the performance data string

Number1

WARNING @ 50

CRITICAL @ 75

Number2

WARNING @ 500

CRITICAL @ 750

echo ""OK - Number 1: $NUMBER1 Number 2: $NUMBER2" | 'Number 1'=$NUMBER1;50;75;; 'Number 2'=$NUMBER2;500;750;;"

18


Here is the output each time it is run:

OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;50;75;; 'Number 2'=74;500;750;;





19


This demonstrates how the performance data does not have any effect on the state of the service

Warning and Critical thresholds are inside the .xml file

19

20

.rrd and .xml files

Used for recording the results from Nagios checks

Useful for observing daily trends of your environment

Invaluable for helping resolve performance issues

RRD = Round Robin Database

XML = Information about the Nagios check

PNP4Nagios uses the RRD and XML files to generate pretty graphs

21

Location of .rrd and .xml files

When a service check returns performance data, Nagios dumps this into:

/usr/local/nagios/var/spool/perfdata

A background process detects the spooled data and creates / updates the relevant .rrd and .xml

The Performance Data files live in:

/usr/local/nagios/share/perfdata/<host>

22

Extract .rrd data

You can extract data from an .rrd file

Example (from the CLI): rrdtool fetch /usr/local/nagios/share/perfdata/localhost/_HOST_.rrd MAX -r 900 -s -1h

23

.rrd and .xml Gotchya - Part 1

The .xml file can contain sensitive data

<NAGIOS_SERVICECHECKCOMMAND>check_emc_clariion!$HOSTADDRESS$!-u readonly!-p Str0ngPassw0rd!-t sp_cbt_busy!--sp A!--warn 70!--crit 90!</NAGIOS_SERVICECHECKCOMMAND>

24


Perhaps use a central credential file

<NAGIOS_SERVICECHECKCOMMAND>check_vmware_host!check_vmware_config_vcenter01!cpu!90!95!!!!</NAGIOS_SERVICECHECKCOMMAND>

25


RRD Data is averaged out over time

Looking at performance graphs for past day / week / month / year will show results with less spikey data

This generally only occurs with data that has lots of peaks and troughs

Constant data like disk space used will generally not average out that much

It all depends on your environment!

When reviewing RRD data you need to take into consideration these factors, it’s all relative!

26

Graphs - How Templates Are Used - Part 1

http://docs.pnp4nagios.org/pnp-0.4/tpl

27


PNP4Nagios queries the XML file for the <TEMPLATE> tag

Each datasource has it’s own <TEMPLATE> tag<TEMPLATE>check-host-alive</TEMPLATE>

Also can be a trailing string in the performance data (good for distributed monitoring)

OK - 127.0.0.1: rta 2.687ms, lost 0% | rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;; [check_icmp]

28


From the example graphs:

<TEMPLATE>check-host-alive</TEMPLATE>

<TEMPLATE>check_local_load_alt</TEMPLATE>

PNP4Nagios looks for a php file with this name in the following folders:

/usr/local/nagios/share/pnp/templates.dist

/usr/local/nagios/share/pnp/templates

29


check-host-alive

/usr/local/nagios/share/pnp/templates.dist/check-host-alive.php

This PHP file generates the performance graph

check_local_load_alt

check_local_load_alt.php does NOT exist

Default template is used:

/usr/local/nagios/share/pnp/templates.dist/default.php

29

30

Graphs - Creating Your Own Template - Part 1

The check_command name is what Nagios uses to insert into the <TEMPLATE> tag in the XML file (how PNP determines which template to use)

So for this example I have created a copy of an existing command

check_xi_service_nsclient_alt

31


The service definition using the new command

32


The graph currently being generated

Default Template being used

Check Command being used

.rrd and .xml files currently contain valid data

33


Copy the file:

/usr/local/nagios/share/pnp/templates.dist/default.php

To the following location with the name:

/usr/local/nagios/share/pnp/templates/check_xi_service_nsclient_alt.php

Edit check_xi_service_nsclient_alt.php

34


In the graph we are removing the bottom two lines

Default Template

Check Command command name

Which are lines 62 and 63

$def[$i] .= 'COMMENT:"Default Template\r" ';

$def[$i] .= 'COMMENT:"Check Command ' . $TEMPLATE[$i] . '\r" ';

Save check_xi_service_nsclient_alt.php

34

35


How easy was that!

Updated graph

Template Name and Check Command removed

36

PNP Templates In Detail - Part 1

Lets get into specifics

Template we just modified

It’s not that complicated! (LOL)

36

37


.rrd files can have multiple datasources (DS)

Round Trip Time and Packet Loss for example

38


Example of .rrd file with five DS

Two graphs generated using these DS

39


Default Template creates one graph per DS

This is a simple PHP foreach loop

The code within the loop references the relevant DS by the $i variable

40


This section of the template uses three DS

One graph will be generated using three DS

$opt[1] and $def[1] is a reference for the first graph being generated

41


Number formatting

Our modified template and the relative code

The relevant information:

%3.4lf

42


The three DS template and the relative code

The relevant information:

%4.0lf

43


Numbers are displayed with four decimal points

%3.4lf

Numbers are displayed as whole numbers

%4.0lf

44


PNP documentation defines the number formatting using the printf standard defined here

http://en.wikipedia.org/wiki/Printf

The number (1) and the letter "L" look alike

%3.4lg contains a lower case "L"

The syntax is

%[parameter][flags][width][.precision][length]type

http://en.wikipedia.org/wiki/Printf

45


width

When the number is generated on the graph, it will allocate a minimum specific width, this helps you align numbers in a column style

precision

Determines if the number displayed is a whole number, or a number with a specific number of digits following the decimal place

46


%3.4lf

width = 3

precision = .4

hence the displayed number is 25.3800

%4.0lf

width = 4

precision = .0

hence the displayed number is 14

Because the precision is 0, NO decimal place is used

47

MRTG - Part 1

MRTG = Multi Router Traffic Grapher

Nagios Addon that is useful for monitoring network switch and router bandwidth using SNMP

Can be complicated to understand configuration

48

MRTG - Part 2

Nagios XI Wizard called “Network Switch / Router” automates the configuration of MRTG

MRTG configuration file

/etc/mrtg/mrtg.cfg

MRTG runs as a cron job every five minutes

cron comes from the Greek word for time, χρόνος [chronos]

Hence cron is a software utility on linux which is a time-based job scheduler

In the windows world it's the Task Scheduler

49

MRTG - Part 3

When MRTG runs, it gathers data from the devices defined in the mrtg.cfg file

It dumps this data into the folder

/var/lib/mrtg

For every port monitored, an .rrd file is created (no .xml file created at this point)

Another background process will then take the data in /var/lib/mrtg and put it into the correct location

/usr/local/nagios/share/perfdata/<host>

50

MRTG Gotchya - Part 1

When the Wizard populates the mrtg.cfg file it will add ALL ports on the switch to the config file

Even if you only selected to monitor 10 ports on the switch

The Nagios XI Service Configuration will only have 10 ports defined as service definitions

Every time the MRTG cron job runs, it will collect data from all ports on the switch (as defined in the mrtg.cfg file)

Extra CPU cycles, extra disk space

50

51


On a 48 port switch this might not concern you

But in a stack of two 48 port switches this becomes 96 ports + also other internal ports like link aggregation ports (another 32 ports perhaps)

So these additional 128 ports have now added 8700+ configuration lines to the mrtg.cfg file

128 ports consume about 24 MB of .rrd disk space

In my past environment, the mrtg.cfg file was 59,000 lines long!

51

52


Suggestion

Clean up the mrtg.cfg file

Remove the ports you do not wish to gather data on

Can this cause Problems?

Yes!

Problem 1

Monitoring additional ports later using the wizard will not work

The wizard will NOT re-add the ports to the mrtg.cfg file

Wizard detects switch / router is already in the mrtg.cfg file

53


Problem 2 - Adding a switch (or module) to an existing switch

Monitoring additional ports later using the wizard will not work

The wizard will NOT add newly detected ports to the mrtg.cfg file

Wizard detects switch / router is already in the mrtg.cfg file

Very similar behaviour to Problem 1

Only relevant when the new switch / module is managed through the existing IP Address / FQDN

Common with stacked switches, adding another switch to the stack

54


Solutions to Problems 1 & 2

cfgmakerThis is how the Wizard configures mrtg.cfg

The wizard updates the existing mrtg.cfg using a php function (not available from the CLI)

Run cfgmaker @ CLI to generate a config fileAdd the contents of the config file to the existing mrtg.cfg

cfgmaker --noreversedns “[email protected]" --output=output.txt

55


Problem 3 - With a frequently changing environment, keep mrtg.cfg clean

Monitoring WAN links for remote routers?

WAN link no longer exists?Disable / Delete service definition(s) in Core Configuration Manager (CCM)

You will NEED to remove device from mrtg.cfg

Why?MRTG will still try and collect data from WAN links no longer accessible

Causes delays and can make MRTG run past the default 5 minute schedule ... can cause graph anomalies

56


Problem 4 - Firmware Upgrade causes port numbering to change

Major firmware revision applied to switch / routerNew data collected for ports is no longer the same pattern

Internal port numbering has changed

mrtg.cfg queries specific port numbers, does not use port names or descriptions

ExampleOld Firmware: WAN = Port 1 LAN = Port 2

New Firmware: WAN = Port 0 LAN = Port 1

Have seen this behaviour on SonicWALL Firewalls

57

Questions

Questions ?

58

Discount Offer

But wait, there's more ...

When visiting the Nagios XI use my affiliate link

http://www.nagios.com/#ref=3oHG00



Technology

Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs