Upload
aiden-newman
View
219
Download
1
Tags:
Embed Size (px)
Citation preview
Monitoring and Accounting in EGEE/LCG
Dave Kant
GridPP 15
RAL
Operations Workshop, Sept 2005 - 2
Overview
• Overview
• Monitoring Service Availability Monitoring
• Service Availability Monitoring Environment• The Sensors• Schema
• Accounting Status of Batch Support in APEL
• Condor and SGE LCG-RUS
Operations Workshop, Sept 2005 - 3
Service Availability Monitoring
• Grid Operations Activity (CERN lead)
… with contributions from anyone who wants to participate
• Work started at 4th EGEE conference Pisa (October)
• Implementation of sensors, metrics and alarms for services in EGEE/LCG infrastructure to ensure smooth grid operations Good sensors Meaningful metrics Controllable Alarms
• How to contribute http://goc.grid.sinica.edu.tw/gocwiki/Service_Availability_Sensors
Operations Workshop, Sept 2005 - 4
Contributions to Sensors
• Substantial Metrics document in circulation which defines 50+ metrics
• Home Page not yet available.
• Section 6 concerns the services
• Does GridPP have a strong presence here?
BDII AsaiPacific Min Tsai
Catalogue CERN James Casey
CE CERN Piotr Nyczyk
FTS CERN Gavin McCance
MyProxy SEE-Greece / CERN ? / Maarten Litmaath
RGMA UKI and CERN Laurence Field, Antony Wilson
RB UKI and Italy Dave Kant, Sergio Andreozzi
SRM UKI Dave Kant, Jens Jensen, Greg Cowan
VOMS Italy Valerio Venturi
Operations Workshop, Sept 2005 - 5
Architecture
• All sensors publish into RGMA using a common schema
• Publish frequency depends on the sensor: SFT every 2 hours; RB every 30 mins; SRM once-a-day
• Alarms generated according to thresholds e.g. RB alarm if match make time exceeds 90 seconds
Operations Workshop, Sept 2005 - 6
TimeLine
Preliminary Releases Expected
• Sensors: Feb 2006
• Summary Generator Indian Team ?
• Metric Generator: Re-use Lemon Components?
• Displays: Feb 2006 Based on SFT
• Alarm System: March 2006 Sure/Lemon Piotr RSS Dave Integration with CIC portal (Lyon?)• Work in progress
• Community working together
Operations Workshop, Sept 2005 - 7
Sensors for Service Monitoring
RB Active Monitoring Track a test job through the Grid; from UI to Worker Node Functional Test: Can RBs Match Jobs to Resources requested Frequent Job submission: Sample functionality every 30 minutes Tools on the UI (edg-get-job-info) RGMA Publishers on RB and WN Sceen Shots: Job Summaries, RB Summaries, Metrics
Operations Workshop, Sept 2005 - 9
Example: RB Service Monitoring
Shows Results of the latest round of jobs sent to RBs
View details of individual tests
Operations Workshop, Sept 2005 - 10
Track a test job through the Grid; from UI to Worker Node
UI edg-job-output
RB L&B Info
RB Publisher
WN Publisher
Operations Workshop, Sept 2005 - 11
Recent History for a RB
Derive Metric Data Capture time to matchmake
the job Capture availability in a 24 hr
periodNumber of Jobs to reach DONE
------------------------------------------
Total number of jobs submitted
Operations Workshop, Sept 2005 - 12
Passive Monitoring
Passive Monitoring (Italy: Sergio Andreozzi)Processing of log files http://www.cnaf.infn.it/~andreozzi/wiki/Work/WMS• Workload Manager Component
WaitingRequests InputFileListSize
• Job Controller WaitingRequests InputFileListSizee
• Network Server submissionRate (requests/600s)
• WM Proxy ServerPoolSize
• Whole System InJobs in last 10 mins OutJobs in last 10 mins
• Hosting Environment Load (1,5,15), memory (used, free, total, real, virtual)
Can GridPP Contribute here?
Operations Workshop, Sept 2005 - 15
APEL, Job Accounting Flow Diagram
[1] Build Job Accounting Records at site.
[2] Send Job Records to a central repository
[3] Data Aggregation
Operations Workshop, Sept 2005 - 16
Accounting for Grid Jobs
• Build Job Records at Site• APEL mapping grid users to the resource usage on local
farms
Job Records In via RGMA
RGMA
MON
SQL QUERY TO Accounting Server 1 Query / Hour
On-Demand Accounting Pages based on SQL queries to summary data
1 Record per Grid Job (Millions of records expected)
Summary data refreshed every hour (Max records about 100K per year)
Hom
e P
ag
e
User queries
Graphs
GOC
Consolidation of Data
Operations Workshop, Sept 2005 - 18
APEL Status
APEL has been in production for 1 year
• 156 Sites , 5.4 Million Job Records
• 100K Job records per week -> Linear rise (c.f exponential)
continues despite growth in CE. -> More site doesn’t mean more
Jobs or more users.
Operations Workshop, Sept 2005 - 19
Demos of Accounting Aggregation
Global views of resource consumption.
• LHC View
http://goc.grid-support.ac.uk/gridsite/accounting/tree/treeview.php Data Aggregation across Countries
• EGEE View http://www2.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php
Data Aggregation across EGEE ROC• Based on LHC View and Data Mining • Displays Official EGEE VOs (12) and Regional VOs• Tables to show which GOCDB sites haven’t published recently• … which ones publish but are not listed in GOCDB
• GridPP View http://goc.grid-support.ac.uk/gridsite/accounting/tree/gridppview.php
Specific view for GridPP accounting summaries for Tier-2s Comments from Peter Gronbech, Fraser Spiers
Aggregation of Data for GridPP
Aggregation of Data for Tier2
Data Aggregation at Site Level
Breakdown of data per Vo per month showing Njobs, CPUt, WCT, record history
Total CPU Usage per VO
Gantt Chart NB:Gaps across all VOs consistent with scheduled downdowns in GocDB
Operations Workshop, Sept 2005 - 23
Batch Support in APEL
Currently Available in LCG 2.6• OpenPBS, Torque, PBSPro and Vanilla PBS
~80% Sites in LCG/EGEE• Load Share Facility (Versions 5 and 6)
CERN, Italy
In Development• Condor (http://goc.grid-support.ac.uk/gridsite/accounting/condor.html )
Requested by Canada, UK Was due for release in Nov/December but delayed Deal with multiple parses of large batch files: Condor does not self-manage its logs,
so they grow to > 2GB in size, multiple parses via APEL in-efficient.
• Sun Grid Engine (http://goc.grid-support.ac.uk/gridsite/accounting/sge.html ) Requested by UK (Imperial College) Format of Log records unclear to us: Missing information in message logs LCG-SGE job manager format is not LCG Compliant (PBS, LSF and Condor all
are!). Substantial changes to APEL required unless this is addressed more carefully.
Operations Workshop, Sept 2005 - 24
APEL/RGMA Issues
• Publishing Missing Records Options available to users are limited <publish>all</publish> Republish mean republish everything: exceeds internal memory limits in Java causing APEL to
crash.• RGMA Archiver is growing in size
• It takes longer to traverse the database• About 2 minutes to run the summary generator• Benefit to move to Oracle
• Batch Support is still limited (!)• Condor and SGE should be seen by the community to be important extensions to the
application. • APEL and gLite (!)
• Will Apel work in this environment• Data Privacy and Security
• Sites don’t publish User DN … its private data• Restrict access to private data via RGMA client• Data needs to be shifted from produces to consumers in a secure way
• Restricted to Fixed Schema in RGMA?• Cannot easily add new fields to the database• Unable to capture information about Jobs in batch logs e.g. exit status, time in queue, etc (STEVE FISHER COMMENTS: NEW FIELDS CAN BE ADDED)
Operations Workshop, Sept 2005 - 26
More Wider Issues
• How important is accounting? Compute resource viewed as a grid currency Need a guarantee that the data has not been tampered with in an un fair way How does normalisation fit into this? The concept of a raw usage records has no
meaning if internal scaling is applied to Heterogeneous farms.
• Recognise that accounting isn’t just about “job usage” its about Resource usage which encompasses many things:- CPU Usage Also Storage & Network Usage Treated Differently ? CPU is consumed; Storage is Occupied and can be recycled
• Getting Data from All Participants Hasn’t been easy to get all sites in EGEE to send data to us. Many reasons: some technical, some political How do we account for usage in wider communities which span grid projects e.g. LHC?
Operations Workshop, Sept 2005 - 28
Challenges Ahead
• Usage Reporting at what Level? Anonymous level: How much resource has been provided to each VO Aggregation across: VOs, Countries, Regions, Grids, Organisations Granularity: summed over units of Hours, Days, Weeks, Months?
• User Level Reporting? If 10,000 CPU hours were consumed by Atlas VO, who are the users
that submitted the work? Data privacy laws A Grid “DN” is personal information which could be used to target an
individual. Who has access to this data and how do you get it? Can CA policies change to support anonymous DNs and reverse DN
mappings? What are the consequences? Are there any lawyers in the audience?
Operations Workshop, Sept 2005 - 29
World Wide Accounting Service for LCG
• Project involves combining results from all three peer infrastructures and presenting an aggregated view of resource usage for LHC VOs to the RRB
Peer Infrastructures in LCG• Open Science Grid + Others (Ruth Pordes, Philippe Canal, Matteo
Melani)• Nordugrid (Per Oster, Thomas Sandholm)• LCG/EGEE (Kors Bos, Dave Kant)
Operations Workshop, Sept 2005 - 30
Resource Usage Service
• Based on emerging GGF standards and Web Services GGF UR, OGSI
• An implementation exists in “Market for Computational Science” – UK e-Science project
• Use case might be: A user invokes the query service through a web browser, using SSL for client
authentication, to ensure that usage information at user level belongs to the user. Servlet sends query to RUS web service and gets user data.
Service
InterfaceRUS WS Application
ACL
DB
Web Service Container
Work started with Akram Khan and Xiaoyu Chen at Brunel
Operations Workshop, Sept 2005 - 31
Conclusions
• Very Busy Year Ahead
• SGE and Condor support need to be completed
• Improve some features of APEL that cause difficulties
• Investigate LCG-RUS
• Service Metrics Activity – very important - beginning to consume effort.