18
Scaling Nagios ® to monitor large heterogeneous environments Dave Blunt February 21, 2008 Seattle Area System Administrators Guild

Sasag Nagios Scaling

Embed Size (px)

Citation preview

Page 1: Sasag Nagios Scaling

Scaling Nagios® to monitor large heterogeneous environments

Dave BluntFebruary 21, 2008

Seattle Area System

Administrators Guild

Page 2: Sasag Nagios Scaling

2© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

What is Nagios? “an Open Source host, service and network

monitoring program.” Started as Netsaint in 1999 and became Nagios in

2002. www.nagios.org Availability and performance monitoring – is it up, is

it down? How much load/memory/disk is in use?

Page 3: Sasag Nagios Scaling

3© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

What is Nagios?

Nagios parent PID

Nagios child PIDNagios child PID

Nagios child PIDNagios child PID

Nagios child PID

NagiosConfiguration

files

Nagios Status Log

PluginPluginPlugin Notification Event Handler

CGIs

Page 4: Sasag Nagios Scaling

4© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

What can Nagios suffer from? Configuration file maintenance issues

CPU and disk I/O bottlenecks

Blocking host checks

File based performance bottlenecks

Page 5: Sasag Nagios Scaling

5© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

What can Nagios suffer from? Configuration file maintenance issues

– Use a web based configuration tool Monarch (sourceforge.net/projects/monarch) Fruity (sourceforge.net/projects/fruity)

– Facilitates monitoring across multiple Windows domains, SNMP communities, and other security zones.

Nagios instance 1

NagiosConfiguration

files

CGI

Nagios instance 2

NagiosConfiguration

files

Nagios instance 3

NagiosConfiguration

files

Nagios instance n

NagiosConfiguration

files

Configuration

Page 6: Sasag Nagios Scaling

6© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

What can Nagios suffer from? CPU and disk I/O bottlenecks

– Optimize Nagios nagios.sourceforge.net/docs/2_0/tuning.html

– Use database to store config and status information NDOUtils (www.nagios.org/downloads) Foundation (sourceforge.net/projects/gwfoundation)

– Placing the database on a separate server will greatly improve performance and both examples support it.

Nagios parent PID

Nagios child PIDNagios child PID

Nagios child PIDNagios child PID

Nagios child PID

PluginPluginPlugin Notification Event Handler

CGIs

Configuration StatusAnd Events

Page 7: Sasag Nagios Scaling

7© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

What can Nagios suffer from? Blocking host checks

– Passive host updates Fping (fping.sourceforge.net)

– Huge increase in host check capacity (8,000+ checks a minute) if pings are parallelized.

– Downside of passive host updates is the possibility of some extra service alarms.

Nagios parent PID

Nagios child PIDNagios child PID

Nagios child PIDNagios child PID

Nagios child PID

PluginPluginPlugin Notification Event Handler

CGIs

StatusAnd Events

Fping Feeder

Host BHost A Host n

Configuration

Page 8: Sasag Nagios Scaling

8© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

What can Nagios suffer from? File based performance bottlenecks

– Remove Nagios pipe file bottleneck with Event Brokers Bronx (archive.groundworkopensource.com/groundwork-

opensource/trunk/bronx/)– Feed data into Bronx as replacement for NSCA and also have Bronx

send data to Foundation DNX (dnx.sourceforge.net)

– Specifically tied to distributed monitoring.

Page 9: Sasag Nagios Scaling

9© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

Typical scaling limits in Nagios*

Typical mix of Hosts/Services

Active service checks/min**

Passive service checks/min**

700/7,000 770 -

**Based on a Service being checked once every 10 minutes, and 1% of Services and Hosts being in transition between OK and non-OK states. Retry interval for non-OK states is 1 minute.

*With dual 3GHz Xeon, 4GB RAM, 10k RPM disk, RHEL4 ES 32-bit OS.

Page 10: Sasag Nagios Scaling

10© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

So, how could you scale up?

Page 11: Sasag Nagios Scaling

11© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

How can I drive my primary Nagios with Passive service checks?

Additional Nagios instances and forward the data to Bronx or NSCA, or set up DNX

– At some point you end up having too much monitoring infrastructure! Passive agents, e.g. GroundWork Distributed Monitoring Agent,

NT_Scheduler, Cron– Monarch supports creation of configuration files for passive agents

Different tier one monitoring tools, e.g. syslog, SNMP traps, Ganglia, Cacti

– Feed data from these up to your primary Nagios server by installing the right agent on that server, process results, and then submit to Nagios.

– Syslog-ng (www.balabit.com/network-security/syslog-ng/)– Snmptt (www.snmptt.org)– Ganglia (ganglia.sourceforge.net)– Cacti (www.cacti.net)

Page 12: Sasag Nagios Scaling

12© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

One Nagios Instance

Page 13: Sasag Nagios Scaling

13© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

Many Nagios Instances

Page 14: Sasag Nagios Scaling

14© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

How do I implement many Nagios instances?

Use a web based configuration tool– Monarch (sourceforge.net/projects/monarch)

Enable configuration data transfer between instances– SSH

Enable check result data transfer between instances– NSCA (Nagios Service Check Acceptor – www.nagios.org/downloads),

or Bronx Optimize each Nagios instance for its purpose

– Turn off active checking on parent– Set command_check_interval=-1 on parent– Turn off performance data, eventhandler, and notification processing on

children Alternative approach

– DNX? ‘beta’, but significant maintenance advantages

Page 15: Sasag Nagios Scaling

15© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

Typical scaling limits in Nagios*

Typical mix of Hosts/Services

Active service checks/min**

Passive service checks/min**

700/7,000 770 -

2,700/27,000 - 2,970

8,000/80,000*** - 10,000***

**Based on a Service being checked once every 10 minutes, and 1% of Services and Hosts being in transition between OK and non-OK states. Retry interval for non-OK states is 1 minute.

*With dual 3GHz Xeon, 4GB RAM, 10k RPM disk, RHEL4 ES 32-bit OS.

***Using Bronx Event Broker and assumptions listed for note (**)

Page 16: Sasag Nagios Scaling

16© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

Heterogeneous environments Mix of Operating Systems, Network security zones,

Applications, and Administrators! Approaches to the problem:

– Same agent type on every system Consistent Limited coverage

– Mix of methods Flexible More difficult to maintain Must normalize data

Page 17: Sasag Nagios Scaling

17© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

Methods UNIX

– SNMP / SNMP traps– SSH with plugins (www.nagios.org/downloads)– NRPE with plugins (www.nagios.org/downloads)– Cron with plugins– Port-based checks– Syslog (aka traps)

Windows (http://www.crn.com/software/206801053)– SNMP / SNMP traps– NRPE_NT with plugins (www.nagiosexchange.org/Windows_NRPE.66.0.html?

&tx_netnagext_pi1[p_view]=235)– WMI (with proxy)– NT_Scheduler with plugins– Port-based checks– Event logs (aka traps) (www.intersectalliance.com/projects/SnareWindows/ or

www.steveshipway.org/software/f_nagios.html) Network

– SNMP / SNMP traps– Syslog– Port-based checks

Special devices– SNMP / SNMP traps– Syslog– Port-based checks

Page 18: Sasag Nagios Scaling

18© 2008 GroundWork Open Source, Inc.

February 2008

SASAG – Scaling Nagios ®

GroundWork Open Source, Inc.139 Townsend Street, Suite 100San Francisco, CA 94107

phone: (415) 992-4500

www.groundworkopensource.com

[email protected]