Sasag Nagios Scaling

Scaling Nagios® to monitor large heterogeneous environments

Dave BluntFebruary 21, 2008

Seattle Area System

Administrators Guild

2© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

What is Nagios? “an Open Source host, service and network

monitoring program.” Started as Netsaint in 1999 and became Nagios in

2002. www.nagios.org Availability and performance monitoring – is it up, is

it down? How much load/memory/disk is in use?


February 2008


What is Nagios?

Nagios parent PID

Nagios child PIDNagios child PID


Nagios child PID

NagiosConfiguration

files

Nagios Status Log

PluginPluginPlugin Notification Event Handler

CGIs


February 2008


What can Nagios suffer from? Configuration file maintenance issues

CPU and disk I/O bottlenecks

Blocking host checks

File based performance bottlenecks


February 2008


What can Nagios suffer from? Configuration file maintenance issues

– Use a web based configuration tool Monarch (sourceforge.net/projects/monarch) Fruity (sourceforge.net/projects/fruity)

– Facilitates monitoring across multiple Windows domains, SNMP communities, and other security zones.

Nagios instance 1

NagiosConfiguration

files

CGI

Nagios instance 2

NagiosConfiguration

files

Nagios instance 3

NagiosConfiguration

files

Nagios instance n

NagiosConfiguration

files

Configuration


February 2008


What can Nagios suffer from? CPU and disk I/O bottlenecks

– Optimize Nagios nagios.sourceforge.net/docs/2_0/tuning.html

– Use database to store config and status information NDOUtils (www.nagios.org/downloads) Foundation (sourceforge.net/projects/gwfoundation)

– Placing the database on a separate server will greatly improve performance and both examples support it.

Nagios parent PID



Nagios child PID


CGIs

Configuration StatusAnd Events


February 2008


What can Nagios suffer from? Blocking host checks

– Passive host updates Fping (fping.sourceforge.net)

– Huge increase in host check capacity (8,000+ checks a minute) if pings are parallelized.

– Downside of passive host updates is the possibility of some extra service alarms.

Nagios parent PID



Nagios child PID


CGIs

StatusAnd Events

Fping Feeder

Host BHost A Host n

Configuration


February 2008


What can Nagios suffer from? File based performance bottlenecks

– Remove Nagios pipe file bottleneck with Event Brokers Bronx (archive.groundworkopensource.com/groundwork-

opensource/trunk/bronx/)– Feed data into Bronx as replacement for NSCA and also have Bronx

send data to Foundation DNX (dnx.sourceforge.net)

– Specifically tied to distributed monitoring.


February 2008


Typical scaling limits in Nagios*

Typical mix of Hosts/Services

Active service checks/min**

Passive service checks/min**

700/7,000 770 -

**Based on a Service being checked once every 10 minutes, and 1% of Services and Hosts being in transition between OK and non-OK states. Retry interval for non-OK states is 1 minute.

*With dual 3GHz Xeon, 4GB RAM, 10k RPM disk, RHEL4 ES 32-bit OS.


February 2008


So, how could you scale up?


February 2008


How can I drive my primary Nagios with Passive service checks?

Additional Nagios instances and forward the data to Bronx or NSCA, or set up DNX

– At some point you end up having too much monitoring infrastructure! Passive agents, e.g. GroundWork Distributed Monitoring Agent,

NT_Scheduler, Cron– Monarch supports creation of configuration files for passive agents

Different tier one monitoring tools, e.g. syslog, SNMP traps, Ganglia, Cacti

– Feed data from these up to your primary Nagios server by installing the right agent on that server, process results, and then submit to Nagios.

– Syslog-ng (www.balabit.com/network-security/syslog-ng/)– Snmptt (www.snmptt.org)– Ganglia (ganglia.sourceforge.net)– Cacti (www.cacti.net)


February 2008


One Nagios Instance


February 2008


Many Nagios Instances


February 2008


How do I implement many Nagios instances?

Use a web based configuration tool– Monarch (sourceforge.net/projects/monarch)

Enable configuration data transfer between instances– SSH

Enable check result data transfer between instances– NSCA (Nagios Service Check Acceptor – www.nagios.org/downloads),

or Bronx Optimize each Nagios instance for its purpose

– Turn off active checking on parent– Set command_check_interval=-1 on parent– Turn off performance data, eventhandler, and notification processing on

children Alternative approach

– DNX? ‘beta’, but significant maintenance advantages


February 2008


Typical scaling limits in Nagios*

Typical mix of Hosts/Services

Active service checks/min**

Passive service checks/min**

700/7,000 770 -

2,700/27,000 - 2,970

8,000/80,000*** - 10,000***

**Based on a Service being checked once every 10 minutes, and 1% of Services and Hosts being in transition between OK and non-OK states. Retry interval for non-OK states is 1 minute.

*With dual 3GHz Xeon, 4GB RAM, 10k RPM disk, RHEL4 ES 32-bit OS.

***Using Bronx Event Broker and assumptions listed for note (**)


February 2008


Heterogeneous environments Mix of Operating Systems, Network security zones,

Applications, and Administrators! Approaches to the problem:

– Same agent type on every system Consistent Limited coverage

– Mix of methods Flexible More difficult to maintain Must normalize data


February 2008


Methods UNIX

– SNMP / SNMP traps– SSH with plugins (www.nagios.org/downloads)– NRPE with plugins (www.nagios.org/downloads)– Cron with plugins– Port-based checks– Syslog (aka traps)

Windows (http://www.crn.com/software/206801053)– SNMP / SNMP traps– NRPE_NT with plugins (www.nagiosexchange.org/Windows_NRPE.66.0.html?

&tx_netnagext_pi1[p_view]=235)– WMI (with proxy)– NT_Scheduler with plugins– Port-based checks– Event logs (aka traps) (www.intersectalliance.com/projects/SnareWindows/ or

www.steveshipway.org/software/f_nagios.html) Network

– SNMP / SNMP traps– Syslog– Port-based checks

Special devices– SNMP / SNMP traps– Syslog– Port-based checks


February 2008


GroundWork Open Source, Inc.139 Townsend Street, Suite 100San Francisco, CA 94107

phone: (415) 992-4500

www.groundworkopensource.com

[email protected]

Documents

Sasag Nagios Scaling