12
oject funded by the European Union oject funded by the European Union CMS-CPT Week (CERN) CMS-CPT Week (CERN) , , 11 11 May May A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD http://grid.infn.it/gridice

A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD

  • Upload
    miyoko

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

http://grid.infn.it/gridice. A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD. OUTLINE. Architecture overview CMS DC04 Experience Next Steps Validation. CMS DC04 GridICE basic layout. - PowerPoint PPT Presentation

Citation preview

Page 1: A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD

EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 11

A monitoring tool for aGrid Operation Center by

EGEE-SA1Sergio Fantinel, INFN LNL/PD

http://grid.infn.it/gridice

Page 2: A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD

EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 22

OUTLINE

• Architecture overview

• CMS DC04 Experience

• Next Steps

• Validation

Page 3: A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD

EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 33

• Low level collection*: we use LEMON (was FMON) to collect the host related metrics; we improved standard metrics with our extensions (eg. host services info). It is based on sensors on the hosts side and on a client/server paradigm for the collection

• Publishing service*: on a collector node visible from the Inet (std. is LCG-SE) there is a service that put the info collected by the LEMON server to a EX GRIS (run on port 2136)

• Discovery and high level collection: on the top there is a service that discovery new resources from BDIIs and accordingly fire queries on GRISes to acquire the monitoring info of the resources; the info are stored on a RDBMS for historical and analysis purposes

CMS DC04 GridICE basic layout

*needed only to publish extended info

Page 4: A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD

EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 44

Data Collection Framework

GRIS (GLUE+ schema)

LEMON Server

cluster head node

information providers farm

monitoringarchive

runldif output

write

read

information index

BDII/GIIS (GLUE schema)

monitoring server

First discovery phase

Cont. discovery & collection

ldap query

ldap query

web interface

CentralMonitoringDatabase

clusterworker node

/procfilesystem

sensors

run

readmetric output

metric output

LEMON monitoring agent

LEMON monitoring agent

clusterworker node

/procfilesystem

sensors

run

readmetric output

metric output

GridICESchema

Page 5: A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD

EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 55

Current Deployment Layout

Page 6: A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD

EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 66

Info Sources & metrics

GridICEServer

EX GRIS (port 2136)(GridICE collector node)

Std. GRIS (port 2135)(CE, SE)

Basic info:

• Number of queues

• Jobs running/waiting

• Storage Areas info

Extended info:

• Disk partitions space

• Network Adapters activity

• Role based (CE, SE, RB, RLS, WN,…) user defined services (daemons, agents,…)

• More… (MEM, CPU, swap, context switches, interrupts, reg. open files, sockets, procs, INodes, host power,…)

GRIS status info:

• GRIS Service Online/Offline

Page 7: A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD

EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 77

CMS DC04 experience

• 11 monitored sites from LCG-CMS/CMS merged BDIIs;6 sites publish extended information (CE, SE, RB); 3 sites publish complete info-------------------------------------------------------------------- 42 GRISes (status w/ 5min resolution), 10 RBs, 13 CEs, 8 SEs, 402 WNs (all extended info)

• Most difficulties encountered come from the following facts:

• at the rump up of the CMS DC04 the monitoring requirements and the environment were not well known

• High utilization of proprietary/non-grid resources

• High latency on people response due to DC stress

Page 8: A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD

EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 88

CMS DC04 experience

• The following are the areas where the GridICE team put the major efforts during the DC04

• produced instructions to install GridICE agent on WNs in site installed with LCG-2 that has no WNs monitoring support (manual & LCFGng)

• produced instructions to install GridICE agent on whichever host (UI, non Grid/LCG,…)

• support to users

• LEMON preinstalled on hosts compatibility issue resolved (hosts managed by IT/CERN for CMS DC04)

Page 9: A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD

EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 99

IT/CERN machines integration

• We were in direct contact with IT people of CERN to ensure the compatibility of GridICE with the hosts managed by this CERN division: they provided and managed most of the CERN hosts involved in the CMS DC04

• Export Buffers (ClassicSE, SRM, SRB)

• key machines running Agents (i.e. lxgate04.cern.ch for CMS DC04)

• Although the compatibility and the integration have been proven, the installation never reached the production hosts due to the ending phase of the DC and the lack of time by the people involved.

Page 10: A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD

EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 1010

• We made experience with GridICE notification service, a new feature introduced just for the CMS DC04, with 3 main sites: LNL, CERN, PIC

• LNL: helped us in many situations when services crashed (e.g., sbatchd LSF daemon on CE & WNs, nfsd on LCFGng server) or host disappeared from the GIS. Sometimes GridICE correctly reported down of hosts, while the local monitoring (ganglia) has not caught the anomaly.

• PIC: correctly notified of RBs services restart for maintenance made by PIC people.

• CERN: RBs services unavailability

CMS DC04 experience: notification

Page 11: A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD

EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 1111

Next steps

• Job Monitoring per VO: an effective (VO,queue) job monitoring, per user (user certificate) job statistics so to produce detailed use of resources utilization and resources availability.

• Notification: in future we expect to have a flexible system where authorized users will be able to set up via a GUI the notifications they would like to receive

• Analysis: a generic interface for graph generation

Page 12: A monitoring tool for a Grid Operation Center by EGEE-SA1 Sergio Fantinel, INFN LNL/PD

EGEEEGEE is a project funded by the European Union is a project funded by the European Union CMS-CPT Week (CERN)CMS-CPT Week (CERN), , 1111 MayMay 200 20044 – n – noo 1212

Validation/experiences: LCG-0

First large deployment in the CMS-LCG0 testbed

graph and analysis provided by: M. Maggi et al. – INFN Bari CMS group