13
Your university or experiment logo here Performance Monitoring Gidon Moont [email protected] e-Science, HEP, Imperial College London Talk to JRA1 All-Hands Meeting @ CERN

Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

Embed Size (px)

DESCRIPTION

Your university or experiment logo here 24 March 2006Performance Monitoring How we gather data The data comes from direct queries of the mySQL databases of Resource Brokers. Around 30 Resource Brokers currently monitored. Queries once a minute. –find all jobs that had an event in the last minute –retrieve status and CE/WN information –write a complete (XML) description of all jobs –remove jobs that have finished status after 2 hours (or if Cleared) –As a job is removed, query all events and write a summary file Multithreaded (one thread per RB) Java program.

Citation preview

Page 1: Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

Your university or experiment logo here

Performance Monitoring

Gidon [email protected]

e-Science, HEP, Imperial College London

Talk to JRA1 All-Hands Meeting @ CERN

Page 2: Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

24 March 2006 Performance MonitoringYour university or experiment logo here

Introduction

• How we gather data.• How we release the information.

– Real Time Monitor– LCG Load Monitor– Daily Reports– XML files and ROOT analysis

• Interesting metrics

Page 3: Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

24 March 2006 Performance MonitoringYour university or experiment logo here

How we gather data

• The data comes from direct queries of the mySQL databases of Resource Brokers.

• Around 30 Resource Brokers currently monitored.• Queries once a minute.

– find all jobs that had an event in the last minute– retrieve status and CE/WN information– write a complete (XML) description of all jobs– remove jobs that have finished status after 2 hours (or if Cleared)

– As a job is removed, query all events and write a summary file

• Multithreaded (one thread per RB) Java program.

Page 4: Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

24 March 2006 Performance MonitoringYour university or experiment logo here

Current RB List

gdrb01.cern.ch lcgrb01.gridpp.rl.ac.ukrb01.pic.esgdrb02.cern.ch gfe01.hep.ph.ic.ac.uk rb-egee.bifi.unizar.esgdrb03.cern.ch egee-rb-01.cnaf.infn.itgrid09.lal.in2p3.frgdrb04.cern.ch egee-rb-02.cnaf.infn.itnode04.datagrid.cea.frgdrb06.cern.ch egee-rb-03.cnaf.infn.itmu3.matrix.sara.nlgdrb07.cern.ch gridit-rb-01.cnaf.infn.itrb.isabella.grnet.grgdrb08.cern.ch a01-004-127.gridka.derb101.grid.ucy.ac.cygdrb09.cern.ch grid-rb0.desy.degrid151.kfki.hugdrb10.cern.ch grid-rb2.desy.delcg16.sinp.msu.rugdrb11.cern.ch lcg00124.grid.sinica.edu.tw

rb.phy.bg.ac.yuui.ulakbim.gov.tr

Page 5: Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

24 March 2006 Performance MonitoringYour university or experiment logo here

Real Time Monitor

• The Real Time Monitor has developed from a demo to show real timeusage of the LCG

• Further developmentwill include sortabletables of RB/CE info

• Java applet - doesnot require extralibraries

Page 6: Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

24 March 2006 Performance MonitoringYour university or experiment logo here

LCG Load Monitor

• Requested as a tool to monitor London Tier 2

• Java Application• Can monitor RBs,CEs, and groupsof CEs (eg a T2)

• Jobs colour codedby VO (stacked)

• Sortable table ofall current jobs

Page 7: Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

24 March 2006 Performance MonitoringYour university or experiment logo here

Daily Reports

• PDF documents created automatically at 3am• Provides counts and metrics for all jobs that left the RTM in a 24 hour period

• Analysis split by– Resource Brokers– Virtual Organisation– Computing Element

• Metrics can identify problems• Data used to generate reports is available as a tab delimited plain text file on request

Page 8: Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

24 March 2006 Performance MonitoringYour university or experiment logo here

XML Files and ROOT

• Information from each RB is presented as an XML file

• For efficiency reasons the RTM and LCG Load programs use a single plain text file

• To see long term trends, the data is imported into ROOT. Graphs can then be made with larger data sets, and time dependent trends can be shown.

• We currently have data for half a year (from September 2005 - now)

• ROOT file available on request

Page 9: Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

24 March 2006 Performance MonitoringYour university or experiment logo here

Interesting Metrics

• We can identify RB problems by looking at the match time for jobs. We have established that all RBs slow down with more than 10 jobs/second being submitted.

• We can show VO behaviour by average job lengths and success rates, as well as the usage of LCG components (RBs/CEs used) and the number of users (unique DNs).

• We can measure CE/VO efficiency by both the fraction of successful jobs AND by the amount of computational WN time that resulted in a Done (Success) state against the total time of all jobs (including those that failed) - labeled as “Useful Time”.

Page 10: Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

24 March 2006 Performance MonitoringYour university or experiment logo here

RB Match TimesJob scheduling (Match Time) versus load (mean number of jobs/sec

during the matching)

Page 11: Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

24 March 2006 Performance MonitoringYour university or experiment logo here

DNs over time / VO

We can see weekends, as well as relative users per VO

Page 12: Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

24 March 2006 Performance MonitoringYour university or experiment logo here

Useful TimeUseful time for those CEs that had more than 30000 jobssubmitted from September 2005 - February 2006 inclusive.

Page 13: Your university or experiment logo here Performance Monitoring Gidon Moont e-Science, HEP, Imperial College London Talk to JRA1

24 March 2006 Performance MonitoringYour university or experiment logo here

URLS etc.

http://gridportal.hep.ph.ic.ac.uk/rtm/

[email protected]