58
Leveraging and Understanding Performance Data and Graphs Troy Lea [email protected] Twitter: @Box293 http://exchange.nagios.org/directory/ Owner/Box293/1

Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

  • Upload
    nagios

  • View
    2.533

  • Download
    6

Embed Size (px)

DESCRIPTION

Troy Lea's presentation on Leveraging and Understanding Performance Data and Graphs. The presentation was given during the Nagios World Conference North America held Sept 20-Oct 2nd, 2013 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Citation preview

Page 1: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

Leveraging and UnderstandingPerformance Data and Graphs

Troy Lea

[email protected]

Twitter: @Box293

http://exchange.nagios.org/directory/Owner/Box293/1

Page 2: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

2

About Me

IT Consultant

Nagios Developer

Love tinkering with Nagios

Why Nagios XI?

It’s a virtual appliance - ready to go

Page 3: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

3

About This Presentation

Understanding how performance data is stored in the back end and how Nagios accesses it

Goal is to give you key pieces of information

A good reference for understanding concepts

This presentation is centered around Nagios XI

Valid for other Nagios implementations

Page 4: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

4

Basic Concepts - Part 1

Page 5: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

5

Basic Concepts - Part 2

./check_nt -H SERVER -s "" -p 12489 -v USEDDISKSPACE -l C -w 80 -c 95

C:\ - total: 39.99 Gb - used: 25.28 Gb (63%) - free 14.71 Gb (37%) | 'C:\ Used Space'=25.28Gb;32.00;38.00;0.00;39.99

Page 6: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

6

Basic Concepts - Part 3

Service check command is executed by the monitoring engineMonitoring engine receives the result of the checkData received has performance dataPerformance data is anything after the | (pipe)The performance data is inserted into an RRD fileWhen viewing the performance graph, PNP4Nagios retrieves the performance data from the RRD file and generates a pretty graphEvery time the service check receives performance data, it inserts this performance data into the RRD file which allows you to look at trends over time

Page 7: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

7

Plugins

The power of Nagios is in the plugins!

Monitor what you want, how you want!

Resources available that clearly define the guidelines around creating plugins

Nagios Plug-in Developer Guidelines

http://nagiosplug.sourceforge.net/developer-guidelines.html

PNP Documentation

http://docs.pnp4nagios.org/pnp-0.4/doc_complete

Page 8: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

8

Plugin Output Explained - Part 1

Plugins produce data divided into two parts

The pipe symbol “|” is used as a delimiter

Example check_icmp

OK - 127.0.0.1: rta 2.687ms, lost 0% | rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;;

Data to the left of the pipe symbol is processed by the monitoring engine

Data to the right of the pipe symbol is used for inserting into RRD and XML files

Page 9: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

9

Plugin Output Explained - Part 2

The exit code Nagios receives from the plugin determines the state of the service

0 = OK

1 = WARNING

2 = CRITICAL

3 = UNKNOWN

The exit code is not “visible” when running a check from the command line or looking at the output returned from the plugin

Page 10: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

10

Plugin Output Explained - Part 3

No performance data = no pretty graphs

You can create a plugin using whatever language and tools are available

All that matters is the end result which is returned back to Nagios when the plugin has finished running

Page 11: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

11

Plugin Output Explained - Part 4

Examples:

Shell script

Something you might want to check on the Nagios host itself

perl script

Remotely checking a device using SNMP OR using third party APIs like the VMware vSphere SDK to remotely access virtual environments

Visual Basic script

Using NSClient on a Windows host to perform a check (like RDP usage)

Page 12: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

12

Performance Data Specifics - Part 1

Asterix (*) fields are required fields, everything else is optional

In this instance, rta is the FIRST DS, or DS 1

Page 13: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

13

Performance Data Specifics - Part 2

Multiple DS

Each DS is separated by a space

rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;;

The label can have spaces however the label MUST be enclosed by single quotes

'Round Trip Average'=2.687ms;3000.000;5000.000;0; 'Packet Loss'=0%;80;100;;

13

Page 14: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

14

Basic Plugin - Part 1

Example shell script demonstrating how a plugin outputs performance data

NUMBER1=$[ ( $RANDOM % 100 ) + 1 ]

NUMBER2=$[ ( $RANDOM % 1000 ) + 1 ]

echo ""OK - Number 1: $NUMBER1 Number 2: $NUMBER2" | 'Number 1'=$NUMBER1;;;; 'Number 2'=$NUMBER2;;;;“

exit "0"

Page 15: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

15

Basic Plugin - Part 2

Here is the output each time it is run:

OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;;;; 'Number 2'=74;;;;

OK - Number 1: 52 Number 2: 758 | 'Number 1'=52;;;; 'Number 2'=758;;;;

OK - Number 1: 73 Number 2: 60 | 'Number 1'=73;;;; 'Number 2'=60;;;;

OK - Number 1: 29 Number 2: 338 | 'Number 1'=29;;;; 'Number 2'=338;;;;

OK - Number 1: 87 Number 2: 612 | 'Number 1'=87;;;; 'Number 2'=612;;;;

Page 16: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

16

Basic Plugin - Part 3

Performance data displayed as a pretty graph

Demonstration of how you can generate performance data in a plugin

Page 17: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

17

Basic Plugin - Part 4

Now lets add warning and critical thresholds to the performance data string

Number1

WARNING @ 50

CRITICAL @ 75

Number2

WARNING @ 500

CRITICAL @ 750

echo ""OK - Number 1: $NUMBER1 Number 2: $NUMBER2" | 'Number 1'=$NUMBER1;50;75;; 'Number 2'=$NUMBER2;500;750;;"

Page 18: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

18

Basic Plugin - Part 5

Here is the output each time it is run:

OK - Number 1: 4 Number 2: 74 | 'Number 1'=4;50;75;; 'Number 2'=74;500;750;;

OK - Number 1: 52 Number 2: 758 | 'Number 1'=52;50;75;; 'Number 2'=758;500;750;;

OK - Number 1: 73 Number 2: 60 | 'Number 1'=73;50;75;; 'Number 2'=60;500;750;;

OK - Number 1: 29 Number 2: 338 | 'Number 1'=29;50;75;; 'Number 2'=338;500;750;;

OK - Number 1: 87 Number 2: 612 | 'Number 1'=87;50;75;; 'Number 2'=612;500;750;;

Page 19: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

19

Basic Plugin - Part 6

This demonstrates how the performance data does not have any effect on the state of the service

Warning and Critical thresholds are inside the .xml file

19

Page 20: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

20

.rrd and .xml files

Used for recording the results from Nagios checks

Useful for observing daily trends of your environment

Invaluable for helping resolve performance issues

RRD = Round Robin Database

XML = Information about the Nagios check

PNP4Nagios uses the RRD and XML files to generate pretty graphs

Page 21: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

21

Location of .rrd and .xml files

When a service check returns performance data, Nagios dumps this into:

/usr/local/nagios/var/spool/perfdata

A background process detects the spooled data and creates / updates the relevant .rrd and .xml

The Performance Data files live in:

/usr/local/nagios/share/perfdata/<host>

Page 22: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

22

Extract .rrd data

You can extract data from an .rrd file

Example (from the CLI): rrdtool fetch /usr/local/nagios/share/perfdata/localhost/_HOST_.rrd MAX -r 900 -s -1h

Page 23: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

23

.rrd and .xml Gotchya - Part 1

The .xml file can contain sensitive data

<NAGIOS_SERVICECHECKCOMMAND>check_emc_clariion!$HOSTADDRESS$!-u readonly!-p Str0ngPassw0rd!-t sp_cbt_busy!--sp A!--warn 70!--crit 90!</NAGIOS_SERVICECHECKCOMMAND>

Page 24: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

24

.rrd and .xml Gotchya - Part 2

Perhaps use a central credential file

<NAGIOS_SERVICECHECKCOMMAND>check_vmware_host!check_vmware_config_vcenter01!cpu!90!95!!!!</NAGIOS_SERVICECHECKCOMMAND>

Page 25: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

25

.rrd and .xml Gotchya - Part 3

RRD Data is averaged out over time

Looking at performance graphs for past day / week / month / year will show results with less spikey data

This generally only occurs with data that has lots of peaks and troughs

Constant data like disk space used will generally not average out that much

It all depends on your environment!

When reviewing RRD data you need to take into consideration these factors, it’s all relative!

Page 26: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

26

Graphs - How Templates Are Used - Part 1

http://docs.pnp4nagios.org/pnp-0.4/tpl

Page 27: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

27

Graphs - How Templates Are Used - Part 2

PNP4Nagios queries the XML file for the <TEMPLATE> tag

Each datasource has it’s own <TEMPLATE> tag<TEMPLATE>check-host-alive</TEMPLATE>

Also can be a trailing string in the performance data (good for distributed monitoring)

OK - 127.0.0.1: rta 2.687ms, lost 0% | rta=2.687ms;3000.000;5000.000;0; pl=0%;80;100;; [check_icmp]

Page 28: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

28

Graphs - How Templates Are Used - Part 3

From the example graphs:

<TEMPLATE>check-host-alive</TEMPLATE>

<TEMPLATE>check_local_load_alt</TEMPLATE>

PNP4Nagios looks for a php file with this name in the following folders:

/usr/local/nagios/share/pnp/templates.dist

/usr/local/nagios/share/pnp/templates

Page 29: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

29

Graphs - How Templates Are Used - Part 4

check-host-alive

/usr/local/nagios/share/pnp/templates.dist/check-host-alive.php

This PHP file generates the performance graph

 

check_local_load_alt

check_local_load_alt.php does NOT exist

Default template is used:

/usr/local/nagios/share/pnp/templates.dist/default.php

29

Page 30: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

30

Graphs - Creating Your Own Template - Part 1

The check_command name is what Nagios uses to insert into the <TEMPLATE> tag in the XML file (how PNP determines which template to use)

So for this example I have created a copy of an existing command

check_xi_service_nsclient_alt

Page 31: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

31

Graphs - Creating Your Own Template - Part 2

The service definition using the new command

Page 32: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

32

Graphs - Creating Your Own Template - Part 3

The graph currently being generated

Default Template being used

Check Command being used

.rrd and .xml files currently contain valid data

Page 33: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

33

Graphs - Creating Your Own Template - Part 4

Copy the file:

/usr/local/nagios/share/pnp/templates.dist/default.php

To the following location with the name:

/usr/local/nagios/share/pnp/templates/check_xi_service_nsclient_alt.php

Edit check_xi_service_nsclient_alt.php

Page 34: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

34

Graphs - Creating Your Own Template - Part 5

In the graph we are removing the bottom two lines

Default Template

Check Command command name

Which are lines 62 and 63

$def[$i] .= 'COMMENT:"Default Template\r" ';

$def[$i] .= 'COMMENT:"Check Command ' . $TEMPLATE[$i] . '\r" ';

Save check_xi_service_nsclient_alt.php

34

Page 35: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

35

Graphs - Creating Your Own Template - Part 6

How easy was that!

Updated graph

Template Name and Check Command removed

Page 36: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

36

PNP Templates In Detail - Part 1

Lets get into specifics

Template we just modified

It’s not that complicated! (LOL)

36

Page 37: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

37

PNP Templates In Detail - Part 2

.rrd files can have multiple datasources (DS)

Round Trip Time and Packet Loss for example

Page 38: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

38

PNP Templates In Detail - Part 3

Example of .rrd file with five DS

Two graphs generated using these DS

Page 39: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

39

PNP Templates In Detail - Part 4

Default Template creates one graph per DS

This is a simple PHP foreach loop

The code within the loop references the relevant DS by the $i variable

Page 40: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

40

PNP Templates In Detail - Part 5

This section of the template uses three DS

One graph will be generated using three DS

$opt[1] and $def[1] is a reference for the first graph being generated

Page 41: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

41

PNP Templates In Detail - Part 6

Number formatting

Our modified template and the relative code

The relevant information:

%3.4lf

Page 42: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

42

PNP Templates In Detail - Part 7

The three DS template and the relative code

The relevant information:

%4.0lf

Page 43: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

43

PNP Templates In Detail - Part 8

Numbers are displayed with four decimal points

%3.4lf

Numbers are displayed as whole numbers

%4.0lf

Page 44: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

44

PNP Templates In Detail - Part 9

PNP documentation defines the number formatting using the printf standard defined here

http://en.wikipedia.org/wiki/Printf

The number (1) and the letter "L" look alike

%3.4lg contains a lower case "L"

The syntax is

%[parameter][flags][width][.precision][length]type

Page 45: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

45

PNP Templates In Detail - Part 10

width

When the number is generated on the graph, it will allocate a minimum specific width, this helps you align numbers in a column style

precision

Determines if the number displayed is a whole number, or a number with a specific number of digits following the decimal place

Page 46: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

46

PNP Templates In Detail - Part 11

%3.4lf

width = 3

precision = .4

hence the displayed number is 25.3800

%4.0lf

width = 4

precision = .0

hence the displayed number is 14

Because the precision is 0, NO decimal place is used

Page 47: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

47

MRTG - Part 1

MRTG = Multi Router Traffic Grapher

Nagios Addon that is useful for monitoring network switch and router bandwidth using SNMP

Can be complicated to understand configuration

Page 48: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

48

MRTG - Part 2

Nagios XI Wizard called “Network Switch / Router” automates the configuration of MRTG

MRTG configuration file

/etc/mrtg/mrtg.cfg

MRTG runs as a cron job every five minutes

cron comes from the Greek word for time, χρόνος [chronos]

Hence cron is a software utility on linux which is a time-based job scheduler

In the windows world it's the Task Scheduler

Page 49: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

49

MRTG - Part 3

When MRTG runs, it gathers data from the devices defined in the mrtg.cfg file

It dumps this data into the folder

/var/lib/mrtg

For every port monitored, an .rrd file is created (no .xml file created at this point)

Another background process will then take the data in /var/lib/mrtg and put it into the correct location

/usr/local/nagios/share/perfdata/<host>

Page 50: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

50

MRTG Gotchya - Part 1

When the Wizard populates the mrtg.cfg file it will add ALL ports on the switch to the config file

Even if you only selected to monitor 10 ports on the switch

The Nagios XI Service Configuration will only have 10 ports defined as service definitions

Every time the MRTG cron job runs, it will collect data from all ports on the switch (as defined in the mrtg.cfg file)

Extra CPU cycles, extra disk space

50

Page 51: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

51

MRTG Gotchya - Part 2

On a 48 port switch this might not concern you

But in a stack of two 48 port switches this becomes 96 ports + also other internal ports like link aggregation ports (another 32 ports perhaps)

So these additional 128 ports have now added 8700+ configuration lines to the mrtg.cfg file

128 ports consume about 24 MB of .rrd disk space

In my past environment, the mrtg.cfg file was 59,000 lines long!

51

Page 52: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

52

MRTG Gotchya - Part 3

Suggestion

Clean up the mrtg.cfg file

Remove the ports you do not wish to gather data on

Can this cause Problems?

Yes!

Problem 1

Monitoring additional ports later using the wizard will not work

The wizard will NOT re-add the ports to the mrtg.cfg file

Wizard detects switch / router is already in the mrtg.cfg file

Page 53: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

53

MRTG Gotchya - Part 4

Problem 2 - Adding a switch (or module) to an existing switch

Monitoring additional ports later using the wizard will not work

The wizard will NOT add newly detected ports to the mrtg.cfg file

Wizard detects switch / router is already in the mrtg.cfg file

Very similar behaviour to Problem 1

Only relevant when the new switch / module is managed through the existing IP Address / FQDN

Common with stacked switches, adding another switch to the stack

Page 54: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

54

MRTG Gotchya - Part 5

Solutions to Problems 1 & 2

cfgmakerThis is how the Wizard configures mrtg.cfg

The wizard updates the existing mrtg.cfg using a php function (not available from the CLI)

Run cfgmaker @ CLI to generate a config fileAdd the contents of the config file to the existing mrtg.cfg

cfgmaker --noreversedns “[email protected]" --output=output.txt

Page 55: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

55

MRTG Gotchya - Part 6

Problem 3 - With a frequently changing environment, keep mrtg.cfg clean

Monitoring WAN links for remote routers?

WAN link no longer exists?Disable / Delete service definition(s) in Core Configuration Manager (CCM)

You will NEED to remove device from mrtg.cfg

Why?MRTG will still try and collect data from WAN links no longer accessible

Causes delays and can make MRTG run past the default 5 minute schedule ... can cause graph anomalies

Page 56: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

56

MRTG Gotchya - Part 7

Problem 4 - Firmware Upgrade causes port numbering to change

Major firmware revision applied to switch / routerNew data collected for ports is no longer the same pattern

Internal port numbering has changed

mrtg.cfg queries specific port numbers, does not use port names or descriptions

ExampleOld Firmware: WAN = Port 1 LAN = Port 2

New Firmware: WAN = Port 0 LAN = Port 1

Have seen this behaviour on SonicWALL Firewalls

Page 57: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

57

Questions

Questions ?

Page 58: Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance Data and Graphs

58

Discount Offer

But wait, there's more ...

When visiting the Nagios XI use my affiliate link

http://www.nagios.com/#ref=3oHG00