Upload
miyoko
View
28
Download
0
Embed Size (px)
DESCRIPTION
http://grid.infn.it/gridice. A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD. OUTLINE. Architecture overview CMS DC04 Experience Next Steps Validation. CMS DC04 GridICE basic layout. - PowerPoint PPT Presentation
Citation preview
EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 11
A monitoring tool for aGrid Operation Center by
EGEE-SA1Sergio Fantinel, INFN LNL/PD
http://grid.infn.it/gridice
EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 22
OUTLINE
• Architecture overview
• CMS DC04 Experience
• Next Steps
• Validation
EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 33
• Low level collection*: we use LEMON (was FMON) to collect the host related metrics; we improved standard metrics with our extensions (eg. host services info). It is based on sensors on the hosts side and on a client/server paradigm for the collection
• Publishing service*: on a collector node visible from the Inet (std. is LCG-SE) there is a service that put the info collected by the LEMON server to a EX GRIS (run on port 2136)
• Discovery and high level collection: on the top there is a service that discovery new resources from BDIIs and accordingly fire queries on GRISes to acquire the monitoring info of the resources; the info are stored on a RDBMS for historical and analysis purposes
CMS DC04 GridICE basic layout
*needed only to publish extended info
EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 44
Data Collection Framework
GRIS (GLUE+ schema)
LEMON Server
cluster head node
information providers farm
monitoringarchive
runldif output
write
read
information index
BDII/GIIS (GLUE schema)
monitoring server
First discovery phase
Cont. discovery & collection
ldap query
ldap query
web interface
CentralMonitoringDatabase
clusterworker node
/procfilesystem
sensors
run
readmetric output
metric output
LEMON monitoring agent
LEMON monitoring agent
clusterworker node
/procfilesystem
sensors
run
readmetric output
metric output
GridICESchema
EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 55
Current Deployment Layout
EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 66
Info Sources & metrics
GridICEServer
EX GRIS (port 2136)(GridICE collector node)
Std. GRIS (port 2135)(CE, SE)
Basic info:
• Number of queues
• Jobs running/waiting
• Storage Areas info
Extended info:
• Disk partitions space
• Network Adapters activity
• Role based (CE, SE, RB, RLS, WN,…) user defined services (daemons, agents,…)
• More… (MEM, CPU, swap, context switches, interrupts, reg. open files, sockets, procs, INodes, host power,…)
GRIS status info:
• GRIS Service Online/Offline
EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 77
CMS DC04 experience
• 11 monitored sites from LCG-CMS/CMS merged BDIIs;6 sites publish extended information (CE, SE, RB); 3 sites publish complete info-------------------------------------------------------------------- 42 GRISes (status w/ 5min resolution), 10 RBs, 13 CEs, 8 SEs, 402 WNs (all extended info)
• Most difficulties encountered come from the following facts:
• at the rump up of the CMS DC04 the monitoring requirements and the environment were not well known
• High utilization of proprietary/non-grid resources
• High latency on people response due to DC stress
EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 88
CMS DC04 experience
• The following are the areas where the GridICE team put the major efforts during the DC04
• produced instructions to install GridICE agent on WNs in site installed with LCG-2 that has no WNs monitoring support (manual & LCFGng)
• produced instructions to install GridICE agent on whichever host (UI, non Grid/LCG,…)
• support to users
• LEMON preinstalled on hosts compatibility issue resolved (hosts managed by IT/CERN for CMS DC04)
EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 99
IT/CERN machines integration
• We were in direct contact with IT people of CERN to ensure the compatibility of GridICE with the hosts managed by this CERN division: they provided and managed most of the CERN hosts involved in the CMS DC04
• Export Buffers (ClassicSE, SRM, SRB)
• key machines running Agents (i.e. lxgate04.cern.ch for CMS DC04)
• Although the compatibility and the integration have been proven, the installation never reached the production hosts due to the ending phase of the DC and the lack of time by the people involved.
EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 1010
• We made experience with GridICE notification service, a new feature introduced just for the CMS DC04, with 3 main sites: LNL, CERN, PIC
• LNL: helped us in many situations when services crashed (e.g., sbatchd LSF daemon on CE & WNs, nfsd on LCFGng server) or host disappeared from the GIS. Sometimes GridICE correctly reported down of hosts, while the local monitoring (ganglia) has not caught the anomaly.
• PIC: correctly notified of RBs services restart for maintenance made by PIC people.
• CERN: RBs services unavailability
CMS DC04 experience: notification
EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 1111
Next steps
• Job Monitoring per VO: an effective (VO,queue) job monitoring, per user (user certificate) job statistics so to produce detailed use of resources utilization and resources availability.
• Notification: in future we expect to have a flexible system where authorized users will be able to set up via a GUI the notifications they would like to receive
• Analysis: a generic interface for graph generation
EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 1212
Validation/experiences: LCG-0
First large deployment in the CMS-LCG0 testbed
graph and analysis provided by: M. Maggi et al. – INFN Bari CMS group