Upload
jocelyn-davis
View
221
Download
0
Embed Size (px)
DESCRIPTION
Your university or experiment logo here 24 March 2006Performance Monitoring How we gather data The data comes from direct queries of the mySQL databases of Resource Brokers. Around 30 Resource Brokers currently monitored. Queries once a minute. –find all jobs that had an event in the last minute –retrieve status and CE/WN information –write a complete (XML) description of all jobs –remove jobs that have finished status after 2 hours (or if Cleared) –As a job is removed, query all events and write a summary file Multithreaded (one thread per RB) Java program.
Citation preview
Your university or experiment logo here
Performance Monitoring
Gidon [email protected]
e-Science, HEP, Imperial College London
Talk to JRA1 All-Hands Meeting @ CERN
24 March 2006 Performance MonitoringYour university or experiment logo here
Introduction
• How we gather data.• How we release the information.
– Real Time Monitor– LCG Load Monitor– Daily Reports– XML files and ROOT analysis
• Interesting metrics
24 March 2006 Performance MonitoringYour university or experiment logo here
How we gather data
• The data comes from direct queries of the mySQL databases of Resource Brokers.
• Around 30 Resource Brokers currently monitored.• Queries once a minute.
– find all jobs that had an event in the last minute– retrieve status and CE/WN information– write a complete (XML) description of all jobs– remove jobs that have finished status after 2 hours (or if Cleared)
– As a job is removed, query all events and write a summary file
• Multithreaded (one thread per RB) Java program.
24 March 2006 Performance MonitoringYour university or experiment logo here
Current RB List
gdrb01.cern.ch lcgrb01.gridpp.rl.ac.ukrb01.pic.esgdrb02.cern.ch gfe01.hep.ph.ic.ac.uk rb-egee.bifi.unizar.esgdrb03.cern.ch egee-rb-01.cnaf.infn.itgrid09.lal.in2p3.frgdrb04.cern.ch egee-rb-02.cnaf.infn.itnode04.datagrid.cea.frgdrb06.cern.ch egee-rb-03.cnaf.infn.itmu3.matrix.sara.nlgdrb07.cern.ch gridit-rb-01.cnaf.infn.itrb.isabella.grnet.grgdrb08.cern.ch a01-004-127.gridka.derb101.grid.ucy.ac.cygdrb09.cern.ch grid-rb0.desy.degrid151.kfki.hugdrb10.cern.ch grid-rb2.desy.delcg16.sinp.msu.rugdrb11.cern.ch lcg00124.grid.sinica.edu.tw
rb.phy.bg.ac.yuui.ulakbim.gov.tr
24 March 2006 Performance MonitoringYour university or experiment logo here
Real Time Monitor
• The Real Time Monitor has developed from a demo to show real timeusage of the LCG
• Further developmentwill include sortabletables of RB/CE info
• Java applet - doesnot require extralibraries
24 March 2006 Performance MonitoringYour university or experiment logo here
LCG Load Monitor
• Requested as a tool to monitor London Tier 2
• Java Application• Can monitor RBs,CEs, and groupsof CEs (eg a T2)
• Jobs colour codedby VO (stacked)
• Sortable table ofall current jobs
24 March 2006 Performance MonitoringYour university or experiment logo here
Daily Reports
• PDF documents created automatically at 3am• Provides counts and metrics for all jobs that left the RTM in a 24 hour period
• Analysis split by– Resource Brokers– Virtual Organisation– Computing Element
• Metrics can identify problems• Data used to generate reports is available as a tab delimited plain text file on request
24 March 2006 Performance MonitoringYour university or experiment logo here
XML Files and ROOT
• Information from each RB is presented as an XML file
• For efficiency reasons the RTM and LCG Load programs use a single plain text file
• To see long term trends, the data is imported into ROOT. Graphs can then be made with larger data sets, and time dependent trends can be shown.
• We currently have data for half a year (from September 2005 - now)
• ROOT file available on request
24 March 2006 Performance MonitoringYour university or experiment logo here
Interesting Metrics
• We can identify RB problems by looking at the match time for jobs. We have established that all RBs slow down with more than 10 jobs/second being submitted.
• We can show VO behaviour by average job lengths and success rates, as well as the usage of LCG components (RBs/CEs used) and the number of users (unique DNs).
• We can measure CE/VO efficiency by both the fraction of successful jobs AND by the amount of computational WN time that resulted in a Done (Success) state against the total time of all jobs (including those that failed) - labeled as “Useful Time”.
24 March 2006 Performance MonitoringYour university or experiment logo here
RB Match TimesJob scheduling (Match Time) versus load (mean number of jobs/sec
during the matching)
24 March 2006 Performance MonitoringYour university or experiment logo here
DNs over time / VO
We can see weekends, as well as relative users per VO
24 March 2006 Performance MonitoringYour university or experiment logo here
Useful TimeUseful time for those CEs that had more than 30000 jobssubmitted from September 2005 - February 2006 inclusive.
24 March 2006 Performance MonitoringYour university or experiment logo here
URLS etc.
http://gridportal.hep.ph.ic.ac.uk/rtm/