Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions

Report from USA

Massimo SgaravattoINFN Padova

Introduction Workload management system for

productions Monte Carlo productions, data

reconstructions and production analyses

Scheduled activities Goal: optimization of overall

throughput

Possible architecture

GRAM

CONDOR

GRAM

LSF

GRAM

PBS

globusrun

Site1Site2 Site3

condor_submit(Globus Universe)

PersonalCondor

Master GISSubmit jobs

ResourceDiscovery

Information on characteristics andstatus of local resources

condor_qcondor_rm…

Overview GRAM as uniform interface to different local resource

management systems Personal Condor able to provide robustness and

reliability The user can submit his 10,000 jobs and he will be sure

that they will be completed (even if there are problems in the submitting machine, in the executing machines, in the network, …) without human intervention

Usage of Condor interface and tools to “manage” the jobs “Robust” tools with all the required capabilities (monitor,

logging, …) Master smart enough to decide in which Globus

resources the jobs must be submitted The Master uses the information on characteristics and

status of resources published in the GIS

Globus GRAM Fixed problems:

I/O with vanilla Condor jobs Globus-job-status with LSF and Condor Publishing of Globus LSF and Condor jobs in

the GIS Open problems:

Submission of multiple instances of a same job to a LSF cluster

Necessary to modify the Globus LSF scripts Scalability Fault tolerance

Globus GRAM Architecture

Client

LSF/ Condor/ PBS/ …

Globus front-end machine

Jobmanager

Job

%globusrun –b –r lxpd.pd.infn.it/jobmanager-lsf \ –f file.rsl

file.rsl:&(executable=$(CMS)/startcmsim.sh)(stdin=$(CMS)/Pythia/inp)(stdout=$(CMS)/Cmsim/out)(count=1)(queue=cmsprod)

Scalability One jobmanager for each globusrun If I want to submit 1000 jobs ???

1000 globusrun 1000 jobmanagers running in the front-end

machine !!! %globusrun –b –r lxpd.pd.infn.it/jobmanager-lsf –f file.rsl

file.rsl:&(executable=$(CMS)/startcmsim.sh)(stdin=$(CMS)/Pythia/inp)(stdout=$(CMS)/Cmsim/out)(count=1000)(queue=cmsprod)

Problems with LSF It is not possible to specify in the RSL file 1000

different input files and 1000 different output files …

Fault tolerance The jobmanager is not persistent If the jobmanager can’t be contacted, Globus

assumes that the job(s) has been completed Example

Submission of n jobs on a LSF cluster Reboot of the front end machine The jobmanager(s) doesn’t run anymore

Orphan jobs -> Globus assumes that the jobs have been successfully completed

Globus is not able to understand if a job exited normally, or if it doesn’t run anymore for a problem (i.e. crash of the executing machine) and therefore must be re-submitted

Globus Universe Condor-G tested with:

Workstation using the fork system call LSF Cluster Condor pool

Submission (condor_submit), monitoring (condor_q), removing (condor_rm) seem working fine, but…

Globus Universe: problems It is not possible to have the

input/output/error files in the submitting machine

Very difficult to understand about errors Condor-G is not able to provide

fault tolerance and robustness (because Globus doesn’t provide these features) Fault tolerance only in the submitting side

Condor-G Architecture

Personal Condor(Globus Client)

LSF/ Condor/ PBS/ …

Globus front-end machine

Jobmanager

Job

condor_submit globusrun

polling (globus_job_status)

Jobs

Possible solutions Some improvements foreseen with Condor

6.3 (but they will not solve all the problems) Persistent Globus jobmanager

??? Direct interaction between Condor and local

resource management systems (LSF) Necessary to modify the Condor startd

GlideIn Only “ready-to-use” solution if robustness is

considered a fundamental requirement

GlideIn Condor daemons run on Globus resources

Local resource management systems used only to run Condor daemons

Robustness and fault tolerance Use of Condor matchmaking system

Viable solution if the goal is just to find idle CPUs

And if we have to take into account other parameters (i.e. location of input files) ???

Various changes have been necessary in the condor_glidein script

GlideIn GlideIn tested with:

Workstation using the fork system call as job manager

Seems working Condor pool

Seems working Condor flocking better solution if authentication is not

required LSF cluster

Problems (because Globus assumes SMP machines managed by LSF, while there are some problems with clusters)

Necessary to modify the Globus LSF scripts

Conclusions Major problems related with

scalability and fault tolerance with Globus Necessary to re-implement the GRAM

service The foreseen architecture doesn’t

work Personal Condor able to provide

robustness only in the submitting side

Documents

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions