1
CCR GRID 2010 (Catania) Daniele Gregori, Stefano Antonelli, Donato De Girolamo, Luca dell’Agnello, Andrea Ferraro, Guido Guizzunti, Pierpaolo Ricci, Felice Rosso, Vladimir Sapunenko, Riccardo Veraldi, Paolo Veronesi, Cristina Vistoli, Giulia Vita Finzi, Stefano Zani INFN CNAF Monitor and Control system The Italian National Institute of Nuclear Physics (INFN) hosts the Tier1 for LHC Experiments at CNAF located in Bologna. CNAF’s TIER1 is the main INFN computing facility in Italy. It provides computing and storage resources to LHC experiments. Due to the huge complexity of a Tier-1 center, the use of control systems is fundamental for the management and the operation of the center.At INFN-CNAF, numerous solutions have been adopted, from commercial to open source products till fully INFN-CNAF developed systems. Adopted open source solutions have been strongly adapted to specific needs; in particular, for monitoring systems like LeMon, MRTG, sFlow analyzer and for alarm system NAGIOS. A monitoring system, Red Eye, has been developed for the Farming division to allow a suitable control of the worker nodes without overloading the cpu of the server machine. Finally, a dashboard has been developed, to whom described control systems send critical alarms (sent via sms to an operator as well). The dashboard can be exploited to get an historical view of the Tier-1 and national services' state and to allow a quick web control. DASHBOARD Dashboard is a tool which summarizes and updates information about the state of all the INFN CNAF divisions (Infrastructure, Network, Farming, Storage, Grid Operation and National Services), useful for people on shift. Each division has defined critical situations related to machines or services exploiting different systems below defined. These information are sent to a MySQL database interfaced to a web page via Python and PHP scripts. Web page shows one box for each division and three colors tied to three states: green=OK orange=WARNING red=CRITICAL Time evolution of each service can be accessed from a 'search' form in the web page. For Dashboard interface, a Python based gateway middleware named “dashgw” has been developed to interact with each commercial monitoring tool able to send only emails with a user- definable format. RED EYE Red Eye is an INFN CNAF developed software useful for monitoring, accounting, analyzing, reporting, alarming and self- acting used by the Farming division. It uses a pull technology with one single server needed and up to about 1000 computers monitored. Every five minutes Red Eye monitors data and takes one of the following actions: report the value on an autorefresh web Page and if there is an allarm condition send an e-mail, send a SMS, report to INFN Tier-1 Dashboard. Red Eye analyzes syslog messages (all Worker Nodes (WN) and farm servers log all messages on a single server) taking actions in case of errors and sending a notification. For drawing plots GNUPlot is used. Clicking on the violet icon the pop up will appea showing the network status icking on the blue icon the pop up will appear showing the job submission status via Red Eye NAGIOS Clicking on the warning or critical message in the Storage, National, Network or GRID boxes the related Nagios page will appear. Nagios is the "de facto" open source industrial standard for monitor and control system. It is used by some of the INFN Tier- 1 divisions: Storage, Network and Grid Operation. It comes with default plugins and particular plugins for local needs have been developed (General Parallel File System (GPFS), Storage Resource Manager (SRM), Tivoli Storage Manager (TSM)). At INFN Tier-1, advanced feature of Nagios, called cluster service, has been exploited to control state of services defining downgrade of the service if any of the servers is unavailable but the service is still guaranteed and critical state if the service is unavailable at all. It is configured to send alarm messages (warning or critical) via mail, sms and web interface to the dashboard. Monitoring system at INFN CNAF is based on a finite state machine (FSM). It is configured in two layers: a low level layer based on customized scripts under Nagios, Lemon, MRTG, sFlow, Red Eye and a commercial software (named TAC) and an high level based on a customized dashboard. LEMON, MRTG, sFlow Lemon (LHC Era Monitoring) is a CERN developed open source software used at INFN Tier-1 for monitoring hosts or groups of hosts (clusters). It's main components are a client, a server and a web interface. To monitor a particular metric you have to write a so called sensor; an agent on each monitored node gets metric informations at regular time steps from sensors and sends them to the server, in our case Oracle based. Informations are stored as rrd (Round Robin Database) files which are exploited by the web interface for visualization of plots. Beside default sensors, at INFN Tier-1 we have developed sensors for particular requests. MRTG is a tool which uses SNMP to read the traffic counters of any SNMP-Managed network device and draw pretty pictures showing how much traffic has passed through each interface. sFlow® is an industry standard technology for monitoring high speed switched networks. It gives complete visibility into the use of networks in terms of hosts conversation or network flows. In other words it answers at question:"Who and how is using the available bandwidth of my network links?”. TAC Clicking on the warning or critical message in the infrastructure box the TAC page will appear showing all sensors state.. A set of dedicated software tools (Business Development Manager) tackles the cooling and electrical power issues of a high density datacenter as the Tier1. Such tools were implemented and customized for the Tier1 datacenter requirements imposed to provide redundant cooling and power to the hosted IT equipment. They allow to monitor all components involved and alert faults to the dashboard. The consumption data are used to optimize the efficiency and reduce costs.

CCR GRID 2010 (Catania) Daniele Gregori, Stefano Antonelli, Donato De Girolamo, Luca dell’Agnello, Andrea Ferraro, Guido Guizzunti, Pierpaolo Ricci, Felice

Embed Size (px)

Citation preview

Page 1: CCR GRID 2010 (Catania) Daniele Gregori, Stefano Antonelli, Donato De Girolamo, Luca dell’Agnello, Andrea Ferraro, Guido Guizzunti, Pierpaolo Ricci, Felice

CCR GRID 2010 (Catania) Daniele Gregori, Stefano Antonelli, Donato De Girolamo, Luca dell’Agnello, Andrea Ferraro, Guido Guizzunti, Pierpaolo Ricci,

Felice Rosso, Vladimir Sapunenko, Riccardo Veraldi, Paolo Veronesi, Cristina Vistoli, Giulia Vita Finzi, Stefano Zani

INFN CNAF Monitor and Control system

The Italian National Institute of Nuclear Physics (INFN) hosts the Tier1 for LHC Experiments at CNAF located in Bologna. CNAF’s TIER1 is the main INFN computing facility in Italy. It provides computing and storage resources to LHC experiments. Due to the huge complexity of a Tier-1 center, the use of control systems is fundamental for the management and the operation of the center.At INFN-CNAF, numerous solutions have been adopted, from commercial to open source products till fully INFN-CNAF developed systems. Adopted open source solutions have been strongly adapted to specific needs; in particular, for monitoring systems like LeMon, MRTG, sFlow analyzer and for alarm system NAGIOS. A monitoring system, Red Eye, has been developed for the Farming division to allow a suitable control of the worker nodes without overloading the cpu of the server machine. Finally, a dashboard has been developed, to whom described control systems send critical alarms (sent via sms to an operator as well). The dashboard can be exploited to get an historical view of the Tier-1 and national services' state and to allow a quick web control.

DASHBOARDDashboard is a tool which summarizes and updates information about the state of all the INFN CNAF divisions (Infrastructure, Network, Farming, Storage, Grid Operation and National Services), useful for people on shift.Each division has defined critical situations related to machines or services exploiting different systems below defined. These information are sent to a MySQL database interfaced to a web page via Python and PHP scripts.Web page shows one box for each division and three colors tied to three states: green=OK orange=WARNINGred=CRITICAL Time evolution of each service can be accessed from a 'search' form in the web page. For Dashboard interface, a Python based gateway middleware named “dashgw” has been developed to interact with each commercial monitoring tool able to send only emails with a user-definable format.

DASHBOARDDashboard is a tool which summarizes and updates information about the state of all the INFN CNAF divisions (Infrastructure, Network, Farming, Storage, Grid Operation and National Services), useful for people on shift.Each division has defined critical situations related to machines or services exploiting different systems below defined. These information are sent to a MySQL database interfaced to a web page via Python and PHP scripts.Web page shows one box for each division and three colors tied to three states: green=OK orange=WARNINGred=CRITICAL Time evolution of each service can be accessed from a 'search' form in the web page. For Dashboard interface, a Python based gateway middleware named “dashgw” has been developed to interact with each commercial monitoring tool able to send only emails with a user-definable format.

RED EYERed Eye is an INFN CNAF developed software useful for monitoring,accounting, analyzing, reporting, alarming and self-acting usedby the Farming division.It uses a pull technology with one single server needed and upto about 1000 computers monitored.Every five minutes Red Eye monitors data and takes one of thefollowing actions: report the value on an autorefresh webPage and if there is an allarm condition send an e-mail, send a SMS, report to INFN Tier-1 Dashboard.Red Eye analyzes syslog messages (all Worker Nodes (WN) and farm servers log all messages on a single server) taking actions in case of errors and sending a notification. For drawing plots GNUPlot is used.

RED EYERed Eye is an INFN CNAF developed software useful for monitoring,accounting, analyzing, reporting, alarming and self-acting usedby the Farming division.It uses a pull technology with one single server needed and upto about 1000 computers monitored.Every five minutes Red Eye monitors data and takes one of thefollowing actions: report the value on an autorefresh webPage and if there is an allarm condition send an e-mail, send a SMS, report to INFN Tier-1 Dashboard.Red Eye analyzes syslog messages (all Worker Nodes (WN) and farm servers log all messages on a single server) taking actions in case of errors and sending a notification. For drawing plots GNUPlot is used.

Clicking on the violet icon the pop up will appear showing the network status

Clicking on the violet icon the pop up will appear showing the network status

Clicking on the blue icon the pop up will appear showing the job submission status via Red Eye

Clicking on the blue icon the pop up will appear showing the job submission status via Red Eye

NAGIOSClicking on the warning or critical message in the Storage, National, Network or GRID boxes the related Nagios page will appear. Nagios is the "de facto" open source industrial standard for monitor and control system. It is used by some of the INFN Tier-1 divisions:Storage, Network and Grid Operation.It comes with default plugins and particular plugins for local needs have been developed (General Parallel File System (GPFS), Storage Resource Manager (SRM), Tivoli Storage Manager (TSM)).At INFN Tier-1, advanced feature of Nagios, called cluster service, has been exploited to control state of services defining downgrade of the service if any of the servers is unavailable but the service is still guaranteed and critical state if the service is unavailable at all.It is configured to send alarm messages (warning or critical) via mail, sms and web interface to the dashboard.

NAGIOSClicking on the warning or critical message in the Storage, National, Network or GRID boxes the related Nagios page will appear. Nagios is the "de facto" open source industrial standard for monitor and control system. It is used by some of the INFN Tier-1 divisions:Storage, Network and Grid Operation.It comes with default plugins and particular plugins for local needs have been developed (General Parallel File System (GPFS), Storage Resource Manager (SRM), Tivoli Storage Manager (TSM)).At INFN Tier-1, advanced feature of Nagios, called cluster service, has been exploited to control state of services defining downgrade of the service if any of the servers is unavailable but the service is still guaranteed and critical state if the service is unavailable at all.It is configured to send alarm messages (warning or critical) via mail, sms and web interface to the dashboard.

Monitoring system at INFN CNAF is based on a finite state machine (FSM). It is configured in two layers: a low level layer based on customized scripts under Nagios, Lemon, MRTG, sFlow, Red Eye and a commercial software (named TAC) and an high level

based on a customized dashboard.

Monitoring system at INFN CNAF is based on a finite state machine (FSM). It is configured in two layers: a low level layer based on customized scripts under Nagios, Lemon, MRTG, sFlow, Red Eye and a commercial software (named TAC) and an high level

based on a customized dashboard.

LEMON, MRTG, sFlowLemon (LHC Era Monitoring) is a CERN developed open source software used at INFN Tier-1 for monitoring hosts or groups of hosts (clusters). It's main components are a client, a server and a web interface. To monitor a particular metric you have to write a so called sensor; an agent on each monitored node gets metric informations at regular time steps from sensors and sends them to the server, in our case Oracle based. Informations are stored as rrd (Round Robin Database) files which are exploited by the web interface for visualization of plots. Beside default sensors, at INFN Tier-1 we have developed sensors for particular requests.MRTG is a tool which uses SNMP to read the traffic counters of any SNMP-Managed network device and draw pretty pictures showing how much traffic has passed through each interface.sFlow® is an industry standard technology for monitoring high speed switched networks. It gives complete visibility into the use of networks in terms of hosts conversation or network flows. In other words it answers at question:"Who and how is using the available bandwidth of my network links?”.

LEMON, MRTG, sFlowLemon (LHC Era Monitoring) is a CERN developed open source software used at INFN Tier-1 for monitoring hosts or groups of hosts (clusters). It's main components are a client, a server and a web interface. To monitor a particular metric you have to write a so called sensor; an agent on each monitored node gets metric informations at regular time steps from sensors and sends them to the server, in our case Oracle based. Informations are stored as rrd (Round Robin Database) files which are exploited by the web interface for visualization of plots. Beside default sensors, at INFN Tier-1 we have developed sensors for particular requests.MRTG is a tool which uses SNMP to read the traffic counters of any SNMP-Managed network device and draw pretty pictures showing how much traffic has passed through each interface.sFlow® is an industry standard technology for monitoring high speed switched networks. It gives complete visibility into the use of networks in terms of hosts conversation or network flows. In other words it answers at question:"Who and how is using the available bandwidth of my network links?”.

TACClicking on the warning or critical message in the infrastructure box the TAC page will appear showing all sensors state..A set of dedicated software tools (Business Development Manager) tackles the cooling and electrical power issues of a high density datacenter as the Tier1. Such tools were implemented and customized for the Tier1 datacenter requirements imposed to provide redundant cooling and power to the hosted IT equipment. They allow to monitor all components involved and alert faults to the dashboard. The consumption data are used to optimize the efficiency and reduce costs.

TACClicking on the warning or critical message in the infrastructure box the TAC page will appear showing all sensors state..A set of dedicated software tools (Business Development Manager) tackles the cooling and electrical power issues of a high density datacenter as the Tier1. Such tools were implemented and customized for the Tier1 datacenter requirements imposed to provide redundant cooling and power to the hosted IT equipment. They allow to monitor all components involved and alert faults to the dashboard. The consumption data are used to optimize the efficiency and reduce costs.