Upload
shannon-lambert
View
219
Download
0
Embed Size (px)
DESCRIPTION
BQS Integration to gLite CE3 BQS integration in LCG-CE Gatekeeper BQS job-manager BDII Local batch system CE Submit job Provided CC-IN2P3 To be done UIRB BQS Information Provider BQS
Citation preview
BQS integration in gLite-CEBQS integration in gLite-CETCG meeting, CERN01/11/2006Sylvain Reynaud, Fabio Hernandez
BQS Integration to gLite CE 2
ContextContext
We have been running a BQS-backed computing element since the early days of Datagrid– BQS Information Provider
• Maps BQS information data to Glue Schema (ldiff)– bqs-jobmanager
• Maps Globus commands to BQS commands• Maps job queues to “BQS classes”, requests AFS tokens for jobs
needing them, archives job information, logs job information for accounting purposes, creates the BQS job wrapper, caches job status information…
Currently trying to integrate BQS to gLite-CE– STEP 1: develop a “BLAH-to-Globus jobmanager” adapter
• So that we can reuse the bqs-jobmanager currently in production with LCG-CE
– STEP 2: develop a grid-neutral front-end to BQS and use it with several CE (e.g. gLite-CE, CREAM, GT4 WS-GRAM)
We areWe areherehere
BQS Integration to gLite CE 3
BQS integration in LCG-CEBQS integration in LCG-CE
Gatekeeper
BQS job-manager
BDII
Localbatchsystem
CE
Submitjob
Provided
CC-IN2P3
To be done
UIUIRBRB
BQS InformationProvider
BQS
BQS Integration to gLite CE 4
BQS job-manager
BQS integration in gLite-CE (STEP 1)BQS integration in gLite-CE (STEP 1)
BQS
Gatekeeper BDII
Condor-CBlahpd
LaunchCondor-C
LaunchCondor-C
Localbatchsystem
CE
Submitjob
fork job-manager
BLAH to Globus
Provided
CC-IN2P3
To be done
BQS InformationProvider
UIUIWMSWMS
BQS Integration to gLite CE 5
Purpose of this presentationPurpose of this presentation
Provide feedback about the difficulties to integrate a new LRMS to gLite-CE
– These difficulties are not specific to BQS
– No impossibility to do it– …but can not do it efficiently !
BQS Integration to gLite CE 6
OverviewOverview
Difficulties– gLite-CE installation– Plug-in development– Plug-in testing
BQS integration in CREAM
Discussion
BQS Integration to gLite CE 7
gLite-CE installationgLite-CE installation
On a standard Scientific Linux 3.0.5– gLite 3.0.0 and 3.0.1: solution to most bugs were found on mailing-lists archives– gLite 3.0.2 update 6: almost no more bugs for installation
On our site-customized Scientific Linux 3.0.5– Customization related to
• different releases of language interpreters (perl, python)• modified environment variables
– Sensible to modifications on the execution environment• About 2/3 of problems found were specific to this customization
– Such kind of problems were not observed with other software packages (e.g. GT4)– Some problems were hard to resolve (e.g. Globus fork-jobmanager script modified to set a
specific and non-trivial order of directories in $PATH)• It seems to work now (with PBS), but there may be some remaining problems with
untested features– Not yet re-tested with gLite 3.0.2 update 6
BQS Integration to gLite CE 8
Plug-in developmentPlug-in development
BLAH expects 5 commands for interacting with the underlying LRMS– One per action (submit, status, cancel, hold, resume)– In the case of PBS and LSF, these commands are implemented as
Shell scripts
Lack of complete documentation is not a big issue– Provided plug-ins for PBS and LSF are a good starting point– Following the job lifecycle through testing is also instructive for
understanding the system• But testing is the hard part (more on next slides)
BQS Integration to gLite CE 9
Plug-in testing (1/4)Plug-in testing (1/4)
CAN NOT TEST EFFICIENTLY BECAUSE…
Can not test CE in standalone mode (without WMS)– This adds complexity and lot of opportunities for job failures– We had to deploy a WMS locally
• WMS deployed on PPS were not stable enough (before summer)• Needed to understand where and why jobs fail
Each job submission test takes too long time to complete– Around 4’30” to execute a “hello-world” job on not loaded machines
connected to the same LAN– 15’ for an abnormally ended job=> No test can be done in less than 5 minutes !
BQS Integration to gLite CE 10
Plug-in testing (2/4)Plug-in testing (2/4)
Some services sometimes fail to start, start in a bad way or stop working (WMS, CE)– (NOT security related problems: time synchronization, CRL &
gridmap file updates)– Occur after a configuration change or a simple service restart
=> restart the relevant services several times in different order– Sometimes unable to get back to a working configuration (even by
resetting original values) => reinstalling is the fastest solution
We haven’t been able to deactivate automatic retry of jobs– (setting RetryCount/ShallowRetryCount to 0 in JDL does not do it)– Lifecycle of failed jobs is longer to complete– Previous failed jobs continue to pollute the CE log files
BQS Integration to gLite CE 11
Plug-in testing (3/4)Plug-in testing (3/4)
Job cancellation often does not work– The glite-job-cancel command always returns “request has been
successfully submitted”, but has often no effect on the job– Don’t know how to get WMS & CE back to a “clean” state
First submitted job almost always fails– Not systematic anymore with latest release, but still very often– We often face this situation because the development phase
implies frequent configuration changes, and this often requires restarting the gLite services
BQS Integration to gLite CE 12
Plug-in testing (4/4)Plug-in testing (4/4)
Hard to find the cause of failures– Many silent failures or useless messages
"The PeriodicHold expression 'Matched =!= TRUE &&CurrentTime > QDate + 900' evaluated to TRUE".
– Command “glite-job-logging-info -v 2” does not often help to understand why the job has been retried for 900 seconds
– Need to follow the job life by looking at the log files, but they are dispersed, and some are ephemeral (they disappear too quickly)
• Several log files per component: Globus gatekeeper, Globus job-manager, Condor-C (ephemeral logs), BLAH (ephemeral logs), GridManager, …
• Several directories contain logs: /var/log, $HOME, /tmp, …– No error detection when the LRMS-specific BLAH scripts return
unexpected output
BQS Integration to gLite CE 13
BQS integration in CREAMBQS integration in CREAM
Currently exploring the integration of BQS to CREAM– Have just started installing CREAM with PBS (27/10/2006)
CREAM installation (ongoing)– Not yet automated, but not sensible to modification on the execution
environment Plug-in development (not started yet)
– STEP 1: Implementing a “BLAH Log Parser” is required=> reusing code developed for LCG-CE may require modifications
– STEP 2: Develop a CREAM connector for BQS Plug-in testing (not started yet)
– Seems to have none of previously mentioned difficulties
Thanks to Massimo Sgaravatto for providing early access to CREAM for gLite 3.1
BQS Integration to gLite CE 14
BQS integration in CREAM (STEP 1)BQS integration in CREAM (STEP 1)
BQS job-manager
CREAM CEMon
Blahpd
Localbatchsystem
CE
BLAH connector
BLAH to Globus
Provided
CC-IN2P3
To be done
ICEICE
BQSBLAH Log Parser
???
Submitjob
BQS InformationProvider
BQS
BQS Integration to gLite CE 15
BQS integration in CREAM (STEP 2)BQS integration in CREAM (STEP 2)
CREAM CEMon
Localbatchsystem
CE
BLAH connector BQS connector
Provided
CC-IN2P3
To be done
ICEICE Submitjob
BQS InformationProvider
BQS grid-neutral front-end
BQS
BQS Integration to gLite CE 16
ReferencesReferences
gLite– http://glite.web.cern.ch/glite/documentation/
BLAH– http://egee-jra1-wm.mi.infn.it/egee-jra1-wm/ce_blahp.shtml
CREAM– http://grid.pd.infn.it/cream/field.php?n=Main.HomePage
BQS Integration to gLite CE 17
DiscussionDiscussion
Are there tips to work more efficiently with WMS and gLite-CE components ?– How to configure WMS/gLite-CE to reduce time to complete ?– How to deactivate automatic retry of jobs ?
What is the recommended way to proceed ?– Will the next releases of gLite-CE provide some answers to the problems
reported in this talk?– Should we instead concentrate on working on the BQS integration to
CREAM? (our preferred way)• Will WMS support CREAM before the support for LCG-CE will be dropped?
– As a site, will we have to support both gLite-CE and CREAM ?
Is there any plan to drop support for LCG-CE in the near future ?