1
Monitoring the LHCb Experiment Computing Infrastructure with NAGIOS Enrico Bonaccorsi, Niko Neufeld CERN Monitoring the LHCb Experiment Computing Infrastructure with NAGIOS Enrico Bonaccorsi, Niko Neufeld CERN LHCb has a large and complex infrastructure consisting of thousands of servers and embedded computers, hundreds of network devices and a lot of common infrastructure services such as shared storage, login and time services, databases and many others. All aspects that are operatively critic are integrated into the standard Experiment Control System (ECS) based on PVSSII. - ture, whose monitoring is based on NAGIOS. We present the design and implementation of the fabric management based on NAGIOS. Care has been taken to complement rather than duplicate functionality available in the Experiment Control System. 1044 Servers 8421 Services Systems under monitoring Systems under monitoring Storage Controllers UPS Maraton Boards Storage Controllers Network devices CPU Load Quattor Network interfaces errors Fiber Channel Disk Space Swap / Pagefile usage CPU usage Network interfaces Trunk Status Jumbo Frames Network interfaces errors Error logs Backups Web services Terminal services Domain Controllers Gateways Farm Databases Castor reachability WAN status Servers Network Services Embedded / Special Systems Simplest setup has central server running Nagios daemon that runs local check scripts which the status of services on that and remote hosts. A host is a computer running on the network which runs one or more services to be checked. A service is anything on the host that you want checked. Its state can be one of: OK, Warning, Critical or Unknown A check is a script run on the server whose exit status de- termines the state of the service: 0, 1, 2 or -1 Event handlers are optional commands that are ex- ecuted whenever a host or service state change occurs. An obvious use for event handlers (especially with ser- Architecture Hosts, Services, Checks & Event Handlers Nagios is a popular open source computer system and network monitoring software application. It watches hosts and services, alerting users when things go wrong and again when they get better. When incidents are detected, The monitoring system sends out notifications to the admnistrative contacts (email, instant message, SMS, etc.). Notification SMS / Mail Oracle Databases Farm Servers Nagios Control PCs Checks Checks Checks Checks NAGIOS runs on a server as a daemon and periodically runs plugins in order to understand the status of hosts and services. The main tasks are the plugins executions scheduling, the dependency calculations and the alert- Due the modularity of the software every action is del- and binaries for plugins). software could be programmed to execute, when a ser- vice or a host status change occurs, any kind of external - tions by mail, instant messaging, sms or by a phone call using Voice over IP and a PBX like asterisk. sw-d2c05-m1 sw-sx-01 sw-mgmt-sx-01 sw-sx-s1b04-02 sw-storecluster-d1 sw-storecluster-c1 sw-sx-s1b04-03 VLAN extension sw-sx-s1d02-02 sw-sx-s1d01-01 plus sw-sx-s1d02-01 SX UX CERN sw-2889-R008-01 GW02 netctrl01 CERN GPN CERN LCG NETWORK-MGMT sw-ux-01 net-mgmt net-mgmt net-mgmt net-mgmt net-mgmt Gi 4/46: online wifi Gi 4/47: gas control Mgmt_eth0,1 sw-2870-01 gas control Gi 4/3 CERN TN CERN TN Gi 4/23 Te 0/3 Te 0/0 Po 2 Te 0/1, 0/2 LHCb Online Network Topology (SX85) Control room Gi 4/45 Gi 4/34 Gi 4/6: stream Gi 4/36: ECS Po 11(Gi 4/43,44): LHCB_DB Network Devices Every network device in LHCb implement the Simple Network Management Protocol (SNMP), allowing us to monitor not only if a cer- tain device is up but also CPU utilisation, ports and trunk status, uplinks and event logs. WAN links are also under monitoring and additionally we check the ability to process jumbo frames on the network devices that pro- vide the network infrastructure from the TELL1 boards to the High Level Trigger (HLT). Web Interface & NAGVIS The web interface is word wide reachable through reverse proxy at the following URL: https://lbnagios.cern.ch. A complementary interface that displays the LHCb network map has been implemented using NagVis, this map is displayed in the control room giving an instant overview of the infrastructures status. Servers Combining local and remote executions of plugins our system monitors CPU usage, load average, local disks, the LDAP servers, status of the NFS mounts, SSH daemon, uptime, quattor daemons for the Linux based servers and all servers. In case of failure the system will try to restart automatically the services. DNS and DHCP NAGIOS query every 10 minutes the status of DNS and DHCP services simulating real requests. Uninterruptable power supplies The UPS battery charge, output amperes, frequency in and out, current load, status temperature, test date and result, volts in and out are being monitored. Backups LHCb uses Amanda as a backup solution, every night the backup consistency is checked by NAGIOS. Web Services All the web services are under monitoring; our implementation checks both the status of the TCP ports and additionally the dy- namic creation of web pages. Databases The experiment itself and in particular the data acquisition part depends on a number of oracle databases: listener and db con- sistency are checked constantly. Storage Storage controllers are constantly monitored by SNMP looking for hardware disk failures. The storage is connected via optical publish the disks using NFS and CIFS protocols: status of the daemons and the free space on the storage is under monitor- ing. We have deployed our NAGIOS monitoring infrastructure six months ago and have it run suc- cessfully since. The software provides us with an essential insight into our network- and server- availability and gives us the possibility to solve problems ideally even before anyone notices critical events are forwarded to PVSSII which are seen by non-experts operators. We have successfully implemented a light independent system to monitor the key infrastructure of the LHCb ECS itself, this give us an important second level of monitoring in case of serious The Systems Managers are kept informed about prob- lems discovered via mail or short message on the mobile by the monitoring system. When a service which has dependencies fails NAGIOS until the problem with the hierarchically highest ser- vice is solved (for example in case of problems on the main routers) avoiding the receiving of a large amount of mail or SMS and informing us only about the main problem. In case of failure of the main switch connected to our monitoring server, NAGIOS is able to enable a backup network interface directly connected to the CERN net- Routers, UPS, Network Devices

Monitoring LHCb with Nagios10cds.cern.ch/record/1215280/files/Poster-2009-136.pdf · Web Interface & NAGVIS The web interface is word wide reachable through reverse proxy at the following

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Monitoring LHCb with Nagios10cds.cern.ch/record/1215280/files/Poster-2009-136.pdf · Web Interface & NAGVIS The web interface is word wide reachable through reverse proxy at the following

Monitoring the LHCb Experiment Computing Infrastructure with NAGIOSEnrico Bonaccorsi, Niko Neufeld

CERN

Monitoring the LHCb Experiment Computing Infrastructure with NAGIOSEnrico Bonaccorsi, Niko Neufeld

CERN

LHCb has a large and complex infrastructure consisting of thousands of servers and embedded computers, hundreds of network devices and a lot of common infrastructure services such as shared storage, login and time services, databases and many others. All aspects that are operatively critic are integrated into the standard Experiment Control System (ECS) based on PVSSII.

-ture, whose monitoring is based on NAGIOS. We present the design and implementation of the fabric management based on NAGIOS. Care has been taken to complement rather than

duplicate functionality available in the Experiment Control System.

1044 Servers8421 Services

Systems under monitoringSystems under monitoring

Storage ControllersUPS

Maraton Boards

Storage ControllersNetwork devices

CPU LoadQuattor

Network interfaces errorsFiber Channel

Disk SpaceSwap / Pagefile usage

CPU usageNetwork interfaces

Trunk StatusJumbo Frames

Network interfaces errorsError logs

BackupsWeb services

Terminal servicesDomain Controllers

GatewaysFarm

DatabasesCastor reachability

WAN status

ServersNetwork

Services

Embedded / Special Systems

Simplest setup has central server running Nagios daemon that runs local check scripts which the status of services on that and remote hosts.A host is a computer running on the network which runs one or more services to be checked.A service is anything on the host that you want checked. Its state can be one of: OK, Warning, Critical or Unknown A check is a script run on the server whose exit status de-termines the state of the service: 0, 1, 2 or -1 Event handlers are optional commands that are ex-ecuted whenever a host or service state change occurs. An obvious use for event handlers (especially with ser-

Architecture

Hosts, Services, Checks & Event Handlers

Nagios is a popular open source computer system and network monitoring software application. It watches hosts and services, alerting users when things go wrong and again when they get better.

When incidents are detected, The monitoring system sends out noti�cations to the admnistrative contacts

(email, instant message, SMS, etc.).

Notification

SMS / M

ail

OracleDatabases

Farm Servers

Nagios

Control PCsCh

ecks

ChecksChecks

Checks

NAGIOS runs on a server as a daemon and periodically runs plugins in order to understand the status of hosts and services. The main tasks are the plugins executions scheduling, the dependency calculations and the alert-

Due the modularity of the software every action is del-

and binaries for plugins).

software could be programmed to execute, when a ser-vice or a host status change occurs, any kind of external

-tions by mail, instant messaging, sms or by a phone call using Voice over IP and a PBX like asterisk.

sw-d2c05-m1

sw-sx-01

sw-mgmt-sx-01

sw-sx-s1b04-02

sw-storecluster-d1sw-storecluster-c1

sw-sx-s1b04-03

VLAN extensionsw-sx-s1d02-02

sw-sx-s1d01-01plus

sw-sx-s1d02-01

SX

UX

CERN

sw-2889-R008-01

GW02

netctrl01

CERN GPNCERN LCG

NETWORK-MGMT

sw-ux-01

net-mgmt

net-mgmt

net-mgmt

net-mgmt

net-mgmt

Gi 4/46: online wifiGi 4/47: gas control

Mgmt_eth0,1

sw-2870-01gas control

Gi 4/3

CERN TNCERN TN

Gi 4/23Te 0/3

Te 0/0

Po 2Te 0/1, 0/2

LHCb Online Network Topology (SX85)

Control room

Gi 4/45

Gi 4/34

Gi 4/6: streamGi 4/36: ECS

Po 11(Gi 4/43,44): LHCB_DB

Network Devices

Every network device in LHCb implement the Simple Network Management Protocol (SNMP), allowing us to monitor not only if a cer-tain device is up but also CPU utilisation, ports and trunk status, uplinks and event logs. WAN links are also under monitoring and additionally we check the ability to process jumbo frames on the network devices that pro-vide the network infrastructure from the TELL1 boards to the High Level Trigger (HLT).

Web Interface & NAGVIS

The web interface is word wide reachable through reverse proxy at the following URL: https://lbnagios.cern.ch. A complementary interface that displays the LHCb network map has been implemented using NagVis, this mapis displayed in the control room giving an instant overview of the infrastructures status.

ServersCombining local and remote executions of plugins our system monitors CPU usage, load average, local disks,

the LDAP servers, status of the NFS mounts, SSH daemon, uptime, quattor daemons for the Linux based servers and all

servers.In case of failure the system will try to restart automatically the services.

DNS and DHCPNAGIOS query every 10 minutes the status of DNS and DHCP services simulating real requests.

Uninterruptable power supplies The UPS battery charge, output amperes, frequency in and out, current load, status temperature, test date and result, volts in and out are being monitored.

BackupsLHCb uses Amanda as a backup solution, every night the backup consistency is checked by NAGIOS.

Web ServicesAll the web services are under monitoring; our implementation checks both the status of the TCP ports and additionally the dy-namic creation of web pages.

DatabasesThe experiment itself and in particular the data acquisition part depends on a number of oracle databases: listener and db con-sistency are checked constantly.

StorageStorage controllers are constantly monitored by SNMP looking for hardware disk failures. The storage is connected via optical

publish the disks using NFS and CIFS protocols: status of the daemons and the free space on the storage is under monitor-ing.

We have deployed our NAGIOS monitoring infrastructure six months ago and have it run suc-cessfully since. The software provides us with an essential insight into our network- and server-availability and gives us the possibility to solve problems ideally even before anyone notices

critical events are forwarded to PVSSII which are seen by non-experts operators.We have successfully implemented a light independent system to monitor the key infrastructure of the LHCb ECS itself, this give us an important second level of monitoring in case of serious

The Systems Managers are kept informed about prob-lems discovered via mail or short message on the mobile by the monitoring system. When a service which has dependencies fails NAGIOS

until the problem with the hierarchically highest ser-vice is solved (for example in case of problems on the main routers) avoiding the receiving of a large amount of mail or SMS and informing us only about the main problem. In case of failure of the main switch connected to our monitoring server, NAGIOS is able to enable a backup network interface directly connected to the CERN net-

Routers, UPS,Network Devices