Upload
elisabeth-dennis
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Report from USA
Massimo SgaravattoINFN Padova
Introduction Workload management system for
productions Monte Carlo productions, data
reconstructions and production analyses
Scheduled activities Goal: optimization of overall
throughput
Possible architecture
GRAM
CONDOR
GRAM
LSF
GRAM
PBS
globusrun
Site1Site2 Site3
condor_submit(Globus Universe)
PersonalCondor
Master GISSubmit jobs
ResourceDiscovery
Information on characteristics andstatus of local resources
condor_qcondor_rm…
Overview GRAM as uniform interface to different local resource
management systems Personal Condor able to provide robustness and
reliability The user can submit his 10,000 jobs and he will be sure
that they will be completed (even if there are problems in the submitting machine, in the executing machines, in the network, …) without human intervention
Usage of Condor interface and tools to “manage” the jobs “Robust” tools with all the required capabilities (monitor,
logging, …) Master smart enough to decide in which Globus
resources the jobs must be submitted The Master uses the information on characteristics and
status of resources published in the GIS
Globus GRAM Fixed problems:
I/O with vanilla Condor jobs Globus-job-status with LSF and Condor Publishing of Globus LSF and Condor jobs in
the GIS Open problems:
Submission of multiple instances of a same job to a LSF cluster
Necessary to modify the Globus LSF scripts Scalability Fault tolerance
Globus GRAM Architecture
Client
LSF/ Condor/ PBS/ …
Globus front-end machine
Jobmanager
Job
%globusrun –b –r lxpd.pd.infn.it/jobmanager-lsf \ –f file.rsl
file.rsl:&(executable=$(CMS)/startcmsim.sh)(stdin=$(CMS)/Pythia/inp)(stdout=$(CMS)/Cmsim/out)(count=1)(queue=cmsprod)
Scalability One jobmanager for each globusrun If I want to submit 1000 jobs ???
1000 globusrun 1000 jobmanagers running in the front-end
machine !!! %globusrun –b –r lxpd.pd.infn.it/jobmanager-lsf –f file.rsl
file.rsl:&(executable=$(CMS)/startcmsim.sh)(stdin=$(CMS)/Pythia/inp)(stdout=$(CMS)/Cmsim/out)(count=1000)(queue=cmsprod)
Problems with LSF It is not possible to specify in the RSL file 1000
different input files and 1000 different output files …
Fault tolerance The jobmanager is not persistent If the jobmanager can’t be contacted, Globus
assumes that the job(s) has been completed Example
Submission of n jobs on a LSF cluster Reboot of the front end machine The jobmanager(s) doesn’t run anymore
Orphan jobs -> Globus assumes that the jobs have been successfully completed
Globus is not able to understand if a job exited normally, or if it doesn’t run anymore for a problem (i.e. crash of the executing machine) and therefore must be re-submitted
Globus Universe Condor-G tested with:
Workstation using the fork system call LSF Cluster Condor pool
Submission (condor_submit), monitoring (condor_q), removing (condor_rm) seem working fine, but…
Globus Universe: problems It is not possible to have the
input/output/error files in the submitting machine
Very difficult to understand about errors Condor-G is not able to provide
fault tolerance and robustness (because Globus doesn’t provide these features) Fault tolerance only in the submitting side
Condor-G Architecture
Personal Condor(Globus Client)
LSF/ Condor/ PBS/ …
Globus front-end machine
Jobmanager
Job
condor_submit globusrun
polling (globus_job_status)
Jobs
Possible solutions Some improvements foreseen with Condor
6.3 (but they will not solve all the problems) Persistent Globus jobmanager
??? Direct interaction between Condor and local
resource management systems (LSF) Necessary to modify the Condor startd
GlideIn Only “ready-to-use” solution if robustness is
considered a fundamental requirement
GlideIn Condor daemons run on Globus resources
Local resource management systems used only to run Condor daemons
Robustness and fault tolerance Use of Condor matchmaking system
Viable solution if the goal is just to find idle CPUs
And if we have to take into account other parameters (i.e. location of input files) ???
Various changes have been necessary in the condor_glidein script
GlideIn GlideIn tested with:
Workstation using the fork system call as job manager
Seems working Condor pool
Seems working Condor flocking better solution if authentication is not
required LSF cluster
Problems (because Globus assumes SMP machines managed by LSF, while there are some problems with clusters)
Necessary to modify the Globus LSF scripts
Conclusions Major problems related with
scalability and fault tolerance with Globus Necessary to re-implement the GRAM
service The foreseen architecture doesn’t
work Personal Condor able to provide
robustness only in the submitting side