Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

Your university or experiment logo here

Performance Monitoring

Gidon [email protected]

e-Science, HEP, Imperial College London

Talk to JRA1 All-Hands Meeting @ CERN

24 March 2006 Performance MonitoringYour university or experiment logo here

Introduction

• How we gather data.• How we release the information.

– Real Time Monitor– LCG Load Monitor– Daily Reports– XML files and ROOT analysis

• Interesting metrics


How we gather data

• The data comes from direct queries of the mySQL databases of Resource Brokers.

• Around 30 Resource Brokers currently monitored.• Queries once a minute.

– find all jobs that had an event in the last minute– retrieve status and CE/WN information– write a complete (XML) description of all jobs– remove jobs that have finished status after 2 hours (or if Cleared)

– As a job is removed, query all events and write a summary file

• Multithreaded (one thread per RB) Java program.


Current RB List

gdrb01.cern.ch lcgrb01.gridpp.rl.ac.ukrb01.pic.esgdrb02.cern.ch gfe01.hep.ph.ic.ac.uk rb-egee.bifi.unizar.esgdrb03.cern.ch egee-rb-01.cnaf.infn.itgrid09.lal.in2p3.frgdrb04.cern.ch egee-rb-02.cnaf.infn.itnode04.datagrid.cea.frgdrb06.cern.ch egee-rb-03.cnaf.infn.itmu3.matrix.sara.nlgdrb07.cern.ch gridit-rb-01.cnaf.infn.itrb.isabella.grnet.grgdrb08.cern.ch a01-004-127.gridka.derb101.grid.ucy.ac.cygdrb09.cern.ch grid-rb0.desy.degrid151.kfki.hugdrb10.cern.ch grid-rb2.desy.delcg16.sinp.msu.rugdrb11.cern.ch lcg00124.grid.sinica.edu.tw

rb.phy.bg.ac.yuui.ulakbim.gov.tr


Real Time Monitor

• The Real Time Monitor has developed from a demo to show real timeusage of the LCG

• Further developmentwill include sortabletables of RB/CE info

• Java applet - doesnot require extralibraries


LCG Load Monitor

• Requested as a tool to monitor London Tier 2

• Java Application• Can monitor RBs,CEs, and groupsof CEs (eg a T2)

• Jobs colour codedby VO (stacked)

• Sortable table ofall current jobs


Daily Reports

• PDF documents created automatically at 3am• Provides counts and metrics for all jobs that left the RTM in a 24 hour period

• Analysis split by– Resource Brokers– Virtual Organisation– Computing Element

• Metrics can identify problems• Data used to generate reports is available as a tab delimited plain text file on request


XML Files and ROOT

• Information from each RB is presented as an XML file

• For efficiency reasons the RTM and LCG Load programs use a single plain text file

• To see long term trends, the data is imported into ROOT. Graphs can then be made with larger data sets, and time dependent trends can be shown.

• We currently have data for half a year (from September 2005 - now)

• ROOT file available on request


Interesting Metrics

• We can identify RB problems by looking at the match time for jobs. We have established that all RBs slow down with more than 10 jobs/second being submitted.

• We can show VO behaviour by average job lengths and success rates, as well as the usage of LCG components (RBs/CEs used) and the number of users (unique DNs).

• We can measure CE/VO efficiency by both the fraction of successful jobs AND by the amount of computational WN time that resulted in a Done (Success) state against the total time of all jobs (including those that failed) - labeled as “Useful Time”.


RB Match TimesJob scheduling (Match Time) versus load (mean number of jobs/sec

during the matching)


DNs over time / VO

We can see weekends, as well as relative users per VO


Useful TimeUseful time for those CEs that had more than 30000 jobssubmitted from September 2005 - February 2006 inclusive.


URLS etc.

http://gridportal.hep.ph.ic.ac.uk/rtm/

[email protected]

http://gridportal.hep.ph.ic.ac.uk/

http://gridportal.hep.ph.ic.ac.uk/

Documents

Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1