Upload
beulah
View
41
Download
0
Embed Size (px)
DESCRIPTION
AliEn development. Miguel Martinez Pedreira. Contents. Basics and concepts CVMFS migration Recent changes ToDos / Ideas. What is AliEn ?. - PowerPoint PPT Presentation
Citation preview
+
AliEn development
Miguel Martinez Pedreira
2+Contents
Basics and concepts CVMFS migration Recent changes ToDos / Ideas
AliEn development - Miguel Martinez Pedreira
3+What is AliEn ?
AliEn (ALICE Environment) is a lightweight Open Source Grid Framework built around other Open Source components using the combination of a Web Service and Distributed Agent Model designed to comply with the offline world of a HEP experiment
massive amounts of data implies distributing its storage and processing
It started within the ALICE Off-line Project at CERN and constitutes the production environment for simulation, reconstruction, and analysis of physics data of the ALICE Experiment
The current status of the ALICE grid operation can be found at the MonALISA Grid Monitoring
AliEn development - Miguel Martinez Pedreira
4+Virtual Organisations
Users (jobs submission for data analysis) + Central management (the “brain” of the GRID) + Sites (the “muscle” of the GRID)
ALICE numbers: ~35K avg running jobs (~200K/day), ~80sites, ~60SE, ~1000M entries in the catalogue
AliEn development - Miguel Martinez Pedreira
5+Distributed Analysis
AliEn development - Miguel Martinez Pedreira
6+ALICE Grid
AliEn development - Miguel Martinez Pedreira
7+AliEn summary
3-layer system that leverages the deployed resources of the underlying WLCG infrastructures and services
Interfaces to AliRoot via ROOT plugin (TAlien) that implements AliEn API
Complex workflows including distributed analysis built on top of AliEn API
Used by ALICE and PANDA
AliEn development - Miguel Martinez Pedreira
8+CVMFS
AliEn development - Miguel Martinez Pedreira
9+CVMFS migration
Started in second half of August 2013 First sites: test CE at CERN (local) and BITP (grid) Running production-like JDLs (alitrain/aliprod) CVMFS setup tested then over ‘real’ site
Combination of packages were showing problem with the environment of the process that was being set First fix IP list from APISCONFIG Paths
Wiki created with steps to follow before starting migration
AliEn development - Miguel Martinez Pedreira
10+CVMFS migration
Then started with some other sites Mainly big sites, 1 by 1, overall quite smooth Decided to push everyone Spam-time ( sorry )
haven’t counted myself, 7 November ~800 mails had been sent ( according to Maarten )
Reactions Many sites reinstalling to WLCG SL6 Sites pending CVMFS being installed for quite a while... Several interface updates: CONDOR, ARC CREAM was updated also to add robustness and deal with
specific setups
AliEn development - Miguel Martinez Pedreira
11+CVMFS migration
Deadline modified after October push: end 2013
AliEn development - Miguel Martinez Pedreira
JULY
AUGUST
SEPT
EMBER
OCTOBER
NOVEMBER
DECEM
BER
JANUARY0
10
20
30
40
50
60
70
12+CVMFS migration
PackMan service not needed anymore Issues
Fix for the ROOT/AliRoot paths when local ROOT installed (conflict)
Missing libraries: libtcl, gfortran, libtermcap... Some sites don’t use HEP_OS_libs
‘Warm-up’ time with ERROR_E ? I/O Error
squid and cache misconfigurations? not easy to debug... Efficiencies ?
Thanks to admins ;-) !
AliEn development - Miguel Martinez Pedreira
13+CVMFS migration
AliEn development - Miguel Martinez Pedreira
14+CVMFS migration
AliEn development - Miguel Martinez Pedreira
15+CVMFS migration
AliEn development - Miguel Martinez Pedreira
16+Recent changes: /tmp issue
Detected some jobs that were analyzing the wrong data From the file ‘wn.xml’
Job tokens Not unique Using a new service creating unique tokens
Other ideas led to a more robust JobAgent Check sandbox use, creation Check chdir Printing more info Unique open/read of the XML file
Problem resulted not to be in AliEn jobs waiting for same CVMFS content writing concurrently helped to create a model of the JobAgent flow: to be ‘digitalized’
AliEn development - Miguel Martinez Pedreira
17+Recent changes
Investigating the errors ERROR_E
TTL Memory Idle Couldn’t get Catalogue...
Still the winner is ERROR_V Focusing on what we don’t understand or the problems that
affect more the system overall
AliEn development - Miguel Martinez Pedreira
18+Recent changes: ERROR_E
Many jobs failing for running over the TTL ProxyTTL=“1” Job runtime to proxy timeleft on the job – 10 minutes Still some continue failing...
Proxy timeleft unavailable
Memory issues (or other) Saving the output to check logs OutputErrorE, same as output
then registerOutput <jobId>
AliEn development - Miguel Martinez Pedreira
19+Recent changes: ERROR_E
Some jobs fail getting a catalogue instance Found race-condition
Jobs WAITING for a while, and being moved to ZOMBIE just after ASSIGNED
Optimizer using DB field based on timestamp that wasn’t properly updated...
A small portion still fail Added full trace of the catalogue creation in the JobAgent “Bad hostname” or “Host undefined” Stuck after getting JDL Under investigation
AliEn development - Miguel Martinez Pedreira
20+Recent changes: INSERTION
AliEn development - Miguel Martinez Pedreira
Submit
SplittingCheck JDL, inserting
Analyse split fields
Job
Man
Split
ting
Opt
Create basketsGetting all files`whereis` per fileIf maxsize, `ls` per file
Submit subjobs
Split
Inserting
Inse
rting
Opt
copyInput`whereis` per file
sizes checkInputDownload-Workdirectorysize
add SE requirementscheck SE-CE compatibilitysizes check
Waiting
insert jobagent
21+Recent changes: INSERTION
AliEn development - Miguel Martinez Pedreira
Submit
Splitting
Check JDL, inserting
Analyse split fields
Job
Man
Split
ting
Opt
Create basketsGetting all files
`whereis` per file (and cached!)sizes check (no InputDownload or InputBox)
Subjobs to WAITING
Split Waiting
add/check SE-CE compatibility
22+Recent changes: INSERTION
AliEn development - Miguel Martinez Pedreira
23+Recent changes: INSERTION
AliEn development - Miguel Martinez Pedreira
24+Recent changes: baskets Some months ago, file/job distribution found to be very ugly Issue in catalogue: entries with duplicated pfns
cleanup ? fix on the optimizer deals with it
Improved the basket creation Step 1: file transfers to have data in same SEs (Markus/Jan)
not so good for the grid balance
Step 2: creating big collections for several runs (Costin) balanced grid
FileBroker not needed anymore...
AliEn development - Miguel Martinez Pedreira
25+File Catalogue
AliEn development - Miguel Martinez Pedreira
26+File Access Monitoring Service FAMoS provides a facility to monitor the attributes of the
accesses to the files and to record in an organized manner the values of attributes to a database
Counts the accesses not only to individual files but also to set of AOD and ESD files of the LHC periods, called categories (e.g. LHC10f6a_ESD, LHC10h_AOD)
It provides also information on the categories accessed by individual users (like: alidaq, aliprod, alitrain)
Information gathered from Authen’s and API servers since August
Web interface under development http://aligrid.yerphi.am/famos/monitoring
AliEn development - Miguel Martinez Pedreira
27+File Access Monitoring Service
AliEn development - Miguel Martinez Pedreira
28+File Access Monitoring Service
AliEn development - Miguel Martinez Pedreira
29+File Access Monitoring Service
AliEn development - Miguel Martinez Pedreira
30+ToDos / Ideas Unifying AliEn versions
v2-19, v2-20, v2-21, trunk, central, API Also making installations/tests work
JDL optimization Millions of jobs, big JDL text Done in v2-21 Storing compressed JDL
and only diff tags in resultsJdl We could also create the subjobs JDL from the father’s
Optimizers investigations periodicity ? jobs not expiring (SPLIT without pending subjobs e.g.) queries failing ?
Catalogue cleanups/modifications orphan entries, duplicated pfns unused tables new solutions (EOS?), apply (v2-20?) improvements more caching
AliEn development - Miguel Martinez Pedreira
31+ToDos / Ideas zip64
currently using standard zip can’t deal with >4GB
IPv6 readiness this summer
Broker queries packages matching queries, long list
JSON services SOAP used in production: but JSON is ready since v2-20 and
being used by PANDA + performance – backward incompatibility (in critical parts...)
AliEn development - Miguel Martinez Pedreira
32+ToDos / Ideas find
new flag to sort files according to the position of the request for better reading
Long desired commands fixes ps, top, masterjob...
glExec utility allowing user separation in multi-user pilot jobs
let each user task run under a corresponding account PayLoad + user certs?
Supercomputing...? interfacing with PanDA
AliEn development - Miguel Martinez Pedreira
33+ToDos / Ideas Machine/Job features Task Force
aiming to adapt to virtualized environments (including cloud)
also related to multi-core slot queues, cpu dynamic availability...
concerns about overloading the meta-service when communicating features via “magic-IP”
Expand to more Iaas systems
AliEn development - Miguel Martinez Pedreira
34+jAlien
AliEn development - Miguel Martinez Pedreira
35+jAlien
AliEn development - Miguel Martinez Pedreira
36+jAlien
AliEn development - Miguel Martinez Pedreira
37+Sorry if I bored you ;-)
AliEn development - Miguel Martinez Pedreira
Any questions ?