Upload
rodney-riley
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Oxana SmirnovaLCG/ATLAS/[email protected] 19, 2002, RHULATLAS Software Workshop
ATLAS-EDG Task Force report
2002-09-03 [email protected] 2
Outline
EDG overview ATLAS-EDG Task Force Use case Problems & solutions Summary
NB! Things are changing (improving) very rapidly; this report
may become outdated tomorrow
2002-09-03 [email protected] 3
EU Datagrid project
Started on January 1, 2001, to deliver by end 2003 Aim: to develop a Grid middleware suitable for High
Energy physics, Earth Observation and biology applications
Development based on existing tools, e.g., Globus, LCFG, GDMP etc
The core testbed consists of the central site at CERN and few facilities across the Western Europe; many more sites are foreseen to join later
By now reached the stability level sufficient to execute production-scale tasks
2002-09-03 [email protected] 4
EDG Testbed EDG is committed to create a stable
testbed to be used by applications for real tasks This started to materialize in mid-August… …and coincided with the ATLAS DC1 ATLAS was given the first priority
Most sites are installed from scratch using the EDG tools (RedHat 6.2 based) Lyon has installation on the top of existing
farm A lightweight EDG installation is available
Central element: the Resource Broker (RB), distributes jobs between the resources Currently, only one RB (CERN) is available
for applications In future, may be an RB per Virtual
Organization (VO)
2002-09-03 [email protected] 5
EDG functionality as of today
UI
CASTOR
RC
CE
CE
CE
lxshare0393.cern.chRB
lxshare033.cern.ch
testbed010.cern.ch
lxshare0399.cern.chdo rfcp
rfcp
rfcp
replicate
GDMP or RM
GDMP or RM
jdl
LDAP NFS
RSL
OutputGDMP or RM
Chart borrowed from Guido Negri’s slides
Input
Input
Output
2002-09-03 [email protected] 6
ATLAS is eager to use Grid tools for the Data Challenges ATLAS Data Challenges are already on the Grid (NorduGrid, USA) The DC1/phase2 (to start in October) is expected to be done using the
Grid tools to a bigger extent ATLAS-EDG Task Force was put together in August with the aims:
To assess the usability of the EDG testbed for the immediate production tasks
To introduce the Grid awareness to the ATLAS collaboration The Task Force has representatives both from ATLAS and EDG:
40 members (!) on the mailing list, ca 10 of them working nearly full-time
The initial task: to process 5 input partitions of the Dataset 2000 at the EDG Testbed + one non-EDG site (Karlsruhe); if this works, continue with other datasets
ATLAS-EDG Task Force
2002-09-03 [email protected] 7
Task description (dataset 2000)
Input: set of generated events as ROOT files (each input partition ca 1.8 GB, 100.000 events); master copies are stored in CERN CASTOR
Processing: ATLAS detector simulation using a pre-installed software release 3.2.1 Each input partition is processed by 20 jobs (5000 events
each) Full simulation is applied only to filtered events, ca 450 per job A full event simulation takes ca 150 seconds per event on a
1GHz PIII processor Output: simulated events are stored in ZEBRA files (ca 1 GB each
output partition); an HBOOK histogram file and a log-file (stdout+stderr) are also produced.
Total: 9 GB of input, 2000 CPU-hours of processing, 100 GB of output.
2002-09-03 [email protected] 8
Execution of jobs
It was expected that we can make full use of the Resource Broker functionality Data-driven job steering Best available resources otherwise
Input files are pre-staged once (copied from CASTOR and replicated elsewhere)
A job consists of the standard DC1 shell-script, very much the way it is done on a conventional cluster
A Job Definition Language is used to wrap up the job, specifying: The executable file (script) Input data Files to be retrieved manually by the user Optionally, other attributes (maxCPU, Rank etc)
Storage and registration of output files is a part of the job
2002-09-03 [email protected] 9
Encountered obstacles
EDG can not replicate files directly from CASTOR and can not register them in the Replica Catalog
Replication was done via CERN SE; EDG is working on a better (though temporary) solution. CASTOR team writes a GridFTP interface, which will help a lot.
Big file transfer interrupts after 1.2 – 1.3 GB Also known Globus API problem, temporary fixed by using plain
Globus instead of EDG tools Jobs were “lost” by the system after 20 minutes of execution
Known problem of the Globus software, temporary fixed on expense of frequent job submission
Static information system: if a site goes down, it should be removed manually from the index
Attempts are under way to switch to the dynamic hierarchical MDS; not yet stable due to the Globus bugs
2002-09-03 [email protected] 10
Other minor problems
Installation of ATLAS software: Cyclic dependencies External dependencies, esp. on system software
Authentication & authorization, users and services EDG can’t accept instantly a dozen of new country Certificate Authorities A possible (quick) solution: ATLAS CA? Default proxy lives only 12 hours – users keep forgetting to request
longer ones to accommodate long jobs Documentation
Is abundant and not very much user-oriented Things are improving as more users are coming
Information system very difficult to browse/search and retrieve relevant info
Data management information about existing file collections is not easy to find management of output data is mostly manual (can not be done via JDL)
2002-09-03 [email protected] 11
Achievements:
A team of hard-working people across the Europe ATLAS software (release 3.2.1) is packed into relocatable RPMs,
distributed and validated elsewhere DC1 production script is “gridified”, submission script is produced User-friendly testbed status monitor deployed 5 Dataset 2000 input files are replicated to 5 sites (2 @ each) After fixing the “long jobs” problem, 50% of the planned challenge
is performed (5 researchers × 10 jobs) – unfortunately, only CERN testbed was fully available
With the rest of the testbed being fixed, jobs are getting scheduled and executed elsewhere
Second test: 4 input files (ca 400 MB each) replicated to 4 sites; 250 jobs submitted, adjusted to run ca 4 hours each. The jobs were distributed across all the testbed by the Resource Broker
2002-09-03 [email protected] 12
Summary
Advantages of the Grid: Possibility to execute tasks and move files over a distributed
computing infrastructure by using one single personal certificate (no need to memorize dozens of passwords)
Possibility do distribute the workload adequately and automatically, without logging in explicitly to each remote system
Possibility to do worldwide production in a perfectly coordinated way, using identical software (RPMs), scripts and databases
Where we are now: Several Grid toolkits are on the market EDG – probably the most elaborated, but still in development This development goes way faster with the help of the users running
real applications Common efforts of the ATLAS-EDG Task Force proved that it is
possible to execute real tasks on the EDG Testbed already now Thanks all the members for the efforts so far, but there’s more to
be done!
Ingo AugustinVandy Berten
Jean-Jacques BlaisingFrederic BrochuStephen Burke
Serban ConstantinescuFrancois EtienneMichael GardnerLuc GoossensMarcus HardtFrank Harris
Fabio HernandezBob Jones
Roger JonesChristos Kanellopoulos
Andrey KiryanovPeter Kunszt
Emanuele LeonardiCal Loomis
Fairouz Malek-OhlssonGonzalo Merino
Armin NairzGiudo Negri
Steve O'NealeLaura Perini
Gilbert PoulardAlois Putzer
Di QingMario Reale
David RebattoZhongliang RenSilvia Resconi
Alessandro De SalvoMarkus Schulz
Oxana SmirnovaChun Lik TanJeff Templon
Stan ThompsonLuca Vaccarossa
Peter Watkins
No animals were harmed in the production tests
MMII