Upload
elfrieda-manning
View
229
Download
5
Tags:
Embed Size (px)
Citation preview
ARDA Prototypes
Julia Andreeva/CERN
On behalf of the ARDA team
CERN
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 2
Overview
• Main directions of the ARDA activities
• Experience with the gLite middleware
• ARDA prototypes, status and plans
• Conclusions
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 3
ARDA and HEP experiments
EGEE middleware (gLite)
LHCbGanga,Dirac,
Gaudi, DaVinci…
AliceROOT,AliRoot,
Proof…
CMSCobra,Orca,OCTOPUS…
AtlasDial,Ganga,
Athena, Don Quijote
ARDA
LCG2
ARDA is an LCG project whose main task is to enable LHC analysis on the GRID
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 4
Middleware Prototype• Available for us since May 18th
– In the first month, many problems connected with the stability of the service and procedures
– At that point just a few worker nodes available– Most important services are available: file catalog, authentication module, job queue, meta-data catalog,
package manager, Grid access service– A second site (Madison) available since end of June– CASTOR access to the actual data store
Currently 34 worker nodes are available at CERN 10 nodes (RH7.3, PBS) 20 nodes (low end, SLC, LSF) 4 nodes (high end, SLC, LSF) 1 node is available in Wisconsin
• Number of CPUs will increase • Number of sites will increase
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 5
Authentication and authorization
•gLite uses Globus 2.4 Grid-Certificates(X.509) to authenticate + authorize, session not encrypted
•VOMS is used for VO Management
Unfortunately, till now getting access to gLite for a new user is often painful due to registration problems. It takes minimum one day , but can take up to two weeks!
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 6
Accessing gLite• Access through gLite shell
- User-friendly Shell implemented in Perl
- Shell provides a set of Unix-like commands and a set of gLite specific commands
• Perl API
- no API to compile against, but Perl-API sufficient for tests,
though it is poorly documented Perl-Shell
GASGSI
GSI
PE
RL
File-Catalogue
Storage-Manager
Compute-Element
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 7
Workload Management System
• ARDA has been evaluating two WMSs
WMS derived from Alien – Task Queue (available since April)– pull model– integrated with gLite shell, file catalog and package
manager
WMS derived from EDG(available since middle of October)– currently push model (pull model not yet possible but
foreseen)– not yet integrated with other gLite components (file
catalogue, package manager, gLite shell)
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 8
WMS observations Integration WMS with File Catalog very useful
Definition of input and output data Specification of input data by file name and metadata queries Jobs splitting driven by data in file catalog
Integration of WMS with Package Management very useful Service character of Package Management provides on demand installation
Full access to debugging information very important Stdout/stderr of executing jobs System information
Lightweight deployment of client interface Client has to be easy installable Client should work behind Firewalls and NAT routers
o Worker nodes should be shared between deployed WMSs
As long as several WMS are deployed - their usage has to be transparent for the user (same JDL syntax, worker nodes should be accessible through both systems , they should provide the same functionality and need to be integrated with other gLite services)
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 9
Job submissionSteps required for submitting of the user job to gLite:- Register executable in the user bin directory- Create JDL file where executable, required packages, input and output files,
possibly some additional requirements are defined- Run submit command providing JDL file as an inputStraight forward, did not experience any problems (but system stability)
Advanced features for job-submission tested by ARDA• Job splitting implemented by gLite is based on the gLite file catalogue
LFN hierarchy
This functionality is widely used in the ARDA prototypes Different job-splitting policy (on the file, directory, SE level) can be chosen
by the user (file catalog snapshot from CMS and LHCb used)• An additional advantage is using of only one master job ID for tracing
of the processing of all sub-jobs belonging to the same master job. Output files of all sub-jobs are collected in the master job “proc” directory.
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 10
Job Submission: Stability
Job queues monitored at CERN every hour:80% Success rate (Jobs don't do anything real)
In recent weeks general instability was observed, testbed support can not be a responsibility of almost a single person
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 11
Data Management
ARDA has been evaluating two DMS• gLite File Catalog (derived from Alien) (deployed in April)
– Allowed to access experiments data from CERN CASTOR and –with low efficiency– from the Wiscosin installation
– Mainly using RFIO– LFN name space is organized as a very intuitive hierarchical
structure– MySQL backend
• Fireman File Catalogue (deployed in November)
– Just delivered to us– gliteIO– Oracle backend
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 12
File catalogue performance tests
Errors
Time to completion
# Clients
Tim
e t
o C
om
ple
tio
n [
s]
Selecting 2.5k files of 10k
0
5
10
15
20
25
30
35
40
0 20 40 60 80 100
FC insertion
Attach MD
Files
Fil
es
/s [
1/s
]
0
2
4
6
8
10
1000 10000 100000
Good performance due to streamingFind matching 2500 entries in 10000 entry directory:
80 concurrent queries0.35 s/query2.6s startup time
•Fireman performance tests are currently ongoing
•gLite catalogue performs well
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 13
gLiteIO• We started to study gLiteIO (as soon as it became available to us)
– ARDA contributed to gLiteIO development (support of AIOD integration)
• Some aspects requirement:– gLiteIO has to be rock solid!
• High performance!• Graceful error recovery!• No data corruption even under high load and
high concurrency!!!• Tests are currently ongoing
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 14
Package management• Multiple approaches exist for handling of the experiment software and
user private packages on the Grid. Two extremes: - “Static”:
Pre-installation of the experiment software is implemented by a site manager with further publishing of the installed software, installation resides “forever” in the shared area, can be removed only by a site manager. Job can run only on a site where required package is preinstalled.
- “Dynamic”:Installation is done on demand at the worker node before job assigned to a given node starts execution. Installation can be removed as soon as job execution is over.
• Current gLite package management implementation can handle “Light-weight” installations, close to the second approach. gLite package manager was tested by ARDA team for this kind of installations.
Clearly more work has to be done to satisfy different use cases
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 15
gLite related ARDA activities: Metadata
• Modern file systems have metadata attached to the file/directory
• gLite has provided a prototype interface and implementation mainly for the Biomed community
• The gLite file catalog has some metadata functionality (tested by ARDA)– Information containing file properties (file
metadata attributes) can be defined in a tag attached to a directory in the file catalog. Any arbitrary number of tag tables can be attached to the corresponding directory table.
– Access to the metadata attributes is via gLite shell or Perl API
– Knowledge of schema is required– No schema evolution
• Can these limitations be overcome?
Server
Client
Perl-Process
GASGSI
TE
XT
SQ
L
Server
MySQL: FC, MD
GSI
STDOUT
MD-Interface
TE
XT
PE
RL
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 16
gLite related ARDA activities Metadata studies
• ARDA preparatory work– Stress testing of the existing experiment metadata
catalogues was performed – Existing implementations showed to share similar problems
• ARDA technology investigation– On the other hand usage of extended file attributes in
modern systems (NTFS, NFS, EXT2/3 SCL3,ReiserFS,JFS,XFS) was analyzed:
a sound POSIX standard exists!– Presentation in LCG-GAG and discussion with gLite– As a result of metadata studies a prototype for metadata
catalogue was developed
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 17
Metadata prototype performance tests
gLite(Alien)
ARDA
Tim
e t
o C
om
ple
tio
n [
s]
Selecting 2.5k files of 10k
0# Clients
20 40 60 80 100
5
25
10
15
20
30
35
40
Crashes
Files per directory
File
s / s
[1
/s]
Attach Metadata to Files
0
5
10
15
20
25
30
1000 10000 100000
ARDAAlien
Comparing performance of the metadata catalogue prototype and gLite catalogue.
Tested operations:
-query catalogue by meta attributes
-attaching meta attributes to the files
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 18
Prototypes overviewLHC Experiment
Main focus
Basic prototype component
Experiment analysis
application framework
Middleware prototype
GUI to Grid GANGA DaVinci gLite
Interactive analysis
PROOF
ROOTAliROOT gLite
High level service
DIAL Athena gLite
Use of maximum native gLite functionality
Aligned with the APROM activity
ORCA gLite
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 19
LHCb
Basic component of the prototype defined by the experiment :
GANGA - Gaudi/Athena aNd Grid Alliance
GANGAGU
I
JobOptionsAlgorithms
Collective&
ResourceGrid
Services
Submitting jobsMonitoringRetrieving results
Framework for job creating-submitting-monitoring
GAUDI Program
ExperimentBook-keeping
DB
• ARDA contributions :– GANGA Release management and software
process • CVS, Savannah,…
– GANGA Participating in the development driven by the GANGA team
– GANGA-gLite Integrating of GANGA with gLite• Enabling job submission through GANGA
to gLite• Job splitting and merging• Retrieving results
– GANGA-gLite-DaVinci Enabling real analysis jobs (DaVinci) to run on gLite using GANGA framework
• Running DaVinci jobs on gLite• Installing and managing LHCb software
on gLite using gLite package manager
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 20
LHCb
Current Status• GANGA job submission handler
for gLite has been developed• DaVinci job running on gLite
submitted through GANGA• Submission of user jobs is
working• Command line interface (CLI)
prototype for GANGA has been developed
• Can submit jobs using the gLite job-splitter
Demonstration of the LHCb end-to-end analysis prototype was made at the 19th LHCb Software Week (two weeks ago)
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 21
LHCb
• Related activities :– GANGA-DIRAC (LHCb production system)
• Convergence with GANGA/components/experience• Submitting jobs to DIRAC using GANGA
– GANGA-Condor • Enabling submission of jobs through GANGA to Condor
– Metadata catalog (Bookkeeping)• Performance tests• Collaboration going on• Interest for our prototype
• Short term plans– Involve people from LHCb physics community (limited number) in testing
for getting feed back from the user side• One person (PhD student) already involved
– Integrating LHCb software releases with the gLite package manager
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 22
USER SESSIONUSER SESSION
PROOF PROOF SLAVESSLAVES
PROOFPROOF
PROOF PROOF SLAVESSLAVES
PROOF MASTERPROOF MASTER SERVERSERVER
PROOF PROOF SLAVESSLAVES
Site A
Site C
Site B
The ALICE/ARDA is evolving the ALICE analysis system
ALICE
Basic components of the prototype defined by the experiment :
ROOT and PROOF
Analysis approach:
– ALICE experiment provides the UI
and the analysis application (AliROOT)
– GRID middleware gLite provides all the rest
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 23
ALICE
The interactive analysis session was presented at the Super Computing 2004
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 24
gLite related activitiesC/C++ API
• Lack of C/C++ API represents a problem for experiment prototypes:C++ access library for gLite and C library for Posix like IO is developed by ARDA
• Idea: Create an interface sending text-commands to server:– UUEncode Strings– Send Strings via gSOAP– Authentication via GSI (Globus TK3)– Encrypt with SSL (cache credential on the
service level (provide a stateful authenticated channel)
• High performance increase compared to SOAP calls with structures (multithreaded server with cached authentication)
• Protocol quite proprietary...• Essential for the ALICE prototype (but
generic enough to be interesting for anybody)
Server
Client
Server Application
Application
C-API (POSIX)
Security-wrapperGSI
SSL
UUEnc
Security-wrapper
GSI gSOAPSSL
TE
XT
ServerServiceUUEnc
gSOAP
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 25
Current status• Developed gLite C++ API and API Service (providing generic interface to any
GRID service)• C++ API is integrated into ROOT (will be added to the next ROOT release). As a
result job submission and job status query for batch analysis can be done from inside ROOT.
• Bash interface for gLite commands with catalogue expansion is developed• First version of the interactive analysis prototype is ready• Batch analysis model is improved - submission and status query are integrated into ROOT - job splitting based on XML query files - application (Aliroot) reads file using xrootd without prestagingShort term plans• Create generic API service accessible to all Alice users for batch analysis using
bash CLI for the Alice data challenge phase III• Make interactive prototype available to Alice users• Create default XML datasets and default JDLs for analysis
ALICE
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 26
ATLAS
Basic component of the prototype • DIAL- Distributed Analysis of Large datasets
ARDA contribution:• Integrating DIAL with gLite (main starategic line in the ATLAS
distributed analysis)• Enabling Atlas analysis jobs (Athena application) submitted
through DIAL to run on gLite • Integrate gLite with Atlas data management based on Don
Quijote• Tests on AMI (Metadata catalogue)• Contribution to the combined test beam• Improvements to AtCom, GUI for job definition (AMI),
submission and monitoring
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 27
ATLAS
SE
Nordugrid
DQ server
RLS DQ server
DQ Client
Don Quijote and gLite
DQ server DQ serverDQ server
RLSRLS SESE
SERLS
gLiteLCGGRID3
• Current status :• DIAL server has been adapted to
CERN environment and installed at CERN
• First implementation of gLite scheduler for DIAL available
• Still depending on a shared file system for inter-job communication
• ATHENA jobs submitted through DIAL are run on gLite middleware
• Integration of gLite with Atlas file management based on Don Quijote is in progress, first prototype is ready
• Realistic ATHENA jobs executed on the gLite prototype by non-ARDA users (physicists). See next transparency.
Future plans :Evolve ATLAS prototype to work directly with glite middleware:
Authentication and seamless data access
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 28
ATLASCombined Test Beam
Example:
ATLAS TRT data analysis done by PNPI St Petersburg
Number of straw hits per layer
Real data processed at gLiteStandard RecExTB
Data from CASTOR
Processed on gLite worker node
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 29
• Ongoing development of the first end-to-end prototype for enabling CMS analysis jobs on gLite
• Main strategy is to use as much of native middleware functionality as gLite can provide and only in case of very CMS specific tasks develop something on top of existing middleware
CMS
RefDB PubDB
Workflow planner with gLite
back-end and command line UI
gLite
Dataset and owner name
defining CMS data collection
Points to the corresponding PubDB where POOL catalog for a given data collection is published
POOL catalog and a set of COBRA META files
Register required info in gLite catalog
Creates and submits jobs to gLite,
Queries their status
Retrieves
output
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 30
CMS - Using MonAlisafor user job monitoring
Demonstrated at Super Computing 2004
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 31
CMS
PhySh (Physicists’ Shell) should provide an entry point for the CMS analysis and handling of physics data. The idea behind PhySh is to combine information from different CMS DBs (RefDB, PubDB, SCRAM, PHEDEX) into a single virtual file-system and to provide a file-handling-like interface to the user.
Related activities
Develop a job submission service for PhyShIntegrate PhySh with gLite middleware components like file and metadata catalogues
Data management task is vital for CMS
The evolution of PubDB from the experience of RefDB is of high interest for ARDA because it provides effective access to the data not only for a production system but for individual users
Participating in the development of PubDB (Publication DB) distributed data bases for publishing information about available data collection CMS-wide
Participating in the redesign of RefDB (Reference DB) , CMS meta data catalog and production book-keeping data base
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 32
Current status:• ORCA analysis jobs (real user code) generated by CMS end-to-end prototype
using gLite job-splitting functionality and instrumented for MonAlisa monitoring successfully ran on the gLite testbed
• Work focused to enable merging of the output files produced by the child sub-jobs belonging to the same parent master job is under way
Future plans:• Give a demonstration of the first working version of the CMS prototype at the
next CMS week in the beginning of December• Involve CMS users (limited number) for testing of the first version of the
prototype• Use of the new version of gLite package manager (as soon as it is available) for
handling of the “heavy” CMS software distributions on gLite• Depending on CMS decision either evolve this prototype according to the users
feed back, or integrate it with the tool(s) which CMS would choose for ARDA prototype
CMS
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 33
Conclusions and outlook
• ARDA uses all components made available on the gLite prototype– Experience and feedback
• First version of analysis systems are being demonstrated– We look forward to have users!
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 34
BACKUP Transparencies Metadata catalogue
prototype
LHCC Comprehensive Review 22.11.2004
Julia Andreeva , CERN 35
BACKUP Transparencies ATLAS
Basic component of the prototype
• DIAL- Distributed Analysis of Large datasets
Dataset1 Dataset2
Dataset
Result1
Application
CodeResult
Task Result2
User analysis framework
SchedulerJob1
Job2
Event data , summary data, tuples
ROOT,JAS,SEAL
Athena,
dialpaw,
ROOT
Collects results
Does splitting