Upload
kapador-smith
View
139
Download
4
Tags:
Embed Size (px)
Citation preview
Scaling Nagios® to monitor large heterogeneous environments
Dave BluntFebruary 21, 2008
Seattle Area System
Administrators Guild
2© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
What is Nagios? “an Open Source host, service and network
monitoring program.” Started as Netsaint in 1999 and became Nagios in
2002. www.nagios.org Availability and performance monitoring – is it up, is
it down? How much load/memory/disk is in use?
3© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
What is Nagios?
Nagios parent PID
Nagios child PIDNagios child PID
Nagios child PIDNagios child PID
Nagios child PID
NagiosConfiguration
files
Nagios Status Log
PluginPluginPlugin Notification Event Handler
CGIs
4© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
What can Nagios suffer from? Configuration file maintenance issues
CPU and disk I/O bottlenecks
Blocking host checks
File based performance bottlenecks
5© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
What can Nagios suffer from? Configuration file maintenance issues
– Use a web based configuration tool Monarch (sourceforge.net/projects/monarch) Fruity (sourceforge.net/projects/fruity)
– Facilitates monitoring across multiple Windows domains, SNMP communities, and other security zones.
Nagios instance 1
NagiosConfiguration
files
CGI
Nagios instance 2
NagiosConfiguration
files
Nagios instance 3
NagiosConfiguration
files
Nagios instance n
NagiosConfiguration
files
Configuration
6© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
What can Nagios suffer from? CPU and disk I/O bottlenecks
– Optimize Nagios nagios.sourceforge.net/docs/2_0/tuning.html
– Use database to store config and status information NDOUtils (www.nagios.org/downloads) Foundation (sourceforge.net/projects/gwfoundation)
– Placing the database on a separate server will greatly improve performance and both examples support it.
Nagios parent PID
Nagios child PIDNagios child PID
Nagios child PIDNagios child PID
Nagios child PID
PluginPluginPlugin Notification Event Handler
CGIs
Configuration StatusAnd Events
7© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
What can Nagios suffer from? Blocking host checks
– Passive host updates Fping (fping.sourceforge.net)
– Huge increase in host check capacity (8,000+ checks a minute) if pings are parallelized.
– Downside of passive host updates is the possibility of some extra service alarms.
Nagios parent PID
Nagios child PIDNagios child PID
Nagios child PIDNagios child PID
Nagios child PID
PluginPluginPlugin Notification Event Handler
CGIs
StatusAnd Events
Fping Feeder
Host BHost A Host n
Configuration
8© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
What can Nagios suffer from? File based performance bottlenecks
– Remove Nagios pipe file bottleneck with Event Brokers Bronx (archive.groundworkopensource.com/groundwork-
opensource/trunk/bronx/)– Feed data into Bronx as replacement for NSCA and also have Bronx
send data to Foundation DNX (dnx.sourceforge.net)
– Specifically tied to distributed monitoring.
9© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
Typical scaling limits in Nagios*
Typical mix of Hosts/Services
Active service checks/min**
Passive service checks/min**
700/7,000 770 -
**Based on a Service being checked once every 10 minutes, and 1% of Services and Hosts being in transition between OK and non-OK states. Retry interval for non-OK states is 1 minute.
*With dual 3GHz Xeon, 4GB RAM, 10k RPM disk, RHEL4 ES 32-bit OS.
10© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
So, how could you scale up?
11© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
How can I drive my primary Nagios with Passive service checks?
Additional Nagios instances and forward the data to Bronx or NSCA, or set up DNX
– At some point you end up having too much monitoring infrastructure! Passive agents, e.g. GroundWork Distributed Monitoring Agent,
NT_Scheduler, Cron– Monarch supports creation of configuration files for passive agents
Different tier one monitoring tools, e.g. syslog, SNMP traps, Ganglia, Cacti
– Feed data from these up to your primary Nagios server by installing the right agent on that server, process results, and then submit to Nagios.
– Syslog-ng (www.balabit.com/network-security/syslog-ng/)– Snmptt (www.snmptt.org)– Ganglia (ganglia.sourceforge.net)– Cacti (www.cacti.net)
12© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
One Nagios Instance
13© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
Many Nagios Instances
14© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
How do I implement many Nagios instances?
Use a web based configuration tool– Monarch (sourceforge.net/projects/monarch)
Enable configuration data transfer between instances– SSH
Enable check result data transfer between instances– NSCA (Nagios Service Check Acceptor – www.nagios.org/downloads),
or Bronx Optimize each Nagios instance for its purpose
– Turn off active checking on parent– Set command_check_interval=-1 on parent– Turn off performance data, eventhandler, and notification processing on
children Alternative approach
– DNX? ‘beta’, but significant maintenance advantages
15© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
Typical scaling limits in Nagios*
Typical mix of Hosts/Services
Active service checks/min**
Passive service checks/min**
700/7,000 770 -
2,700/27,000 - 2,970
8,000/80,000*** - 10,000***
**Based on a Service being checked once every 10 minutes, and 1% of Services and Hosts being in transition between OK and non-OK states. Retry interval for non-OK states is 1 minute.
*With dual 3GHz Xeon, 4GB RAM, 10k RPM disk, RHEL4 ES 32-bit OS.
***Using Bronx Event Broker and assumptions listed for note (**)
16© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
Heterogeneous environments Mix of Operating Systems, Network security zones,
Applications, and Administrators! Approaches to the problem:
– Same agent type on every system Consistent Limited coverage
– Mix of methods Flexible More difficult to maintain Must normalize data
17© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
Methods UNIX
– SNMP / SNMP traps– SSH with plugins (www.nagios.org/downloads)– NRPE with plugins (www.nagios.org/downloads)– Cron with plugins– Port-based checks– Syslog (aka traps)
Windows (http://www.crn.com/software/206801053)– SNMP / SNMP traps– NRPE_NT with plugins (www.nagiosexchange.org/Windows_NRPE.66.0.html?
&tx_netnagext_pi1[p_view]=235)– WMI (with proxy)– NT_Scheduler with plugins– Port-based checks– Event logs (aka traps) (www.intersectalliance.com/projects/SnareWindows/ or
www.steveshipway.org/software/f_nagios.html) Network
– SNMP / SNMP traps– Syslog– Port-based checks
Special devices– SNMP / SNMP traps– Syslog– Port-based checks
18© 2008 GroundWork Open Source, Inc.
February 2008
SASAG – Scaling Nagios ®
GroundWork Open Source, Inc.139 Townsend Street, Suite 100San Francisco, CA 94107
phone: (415) 992-4500
www.groundworkopensource.com