37
+ AliEn development Miguel Martinez Pedreira

AliEn development

  • Upload
    beulah

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

AliEn development. Miguel Martinez Pedreira. Contents. Basics and concepts CVMFS migration Recent changes ToDos / Ideas. What is AliEn ?. - PowerPoint PPT Presentation

Citation preview

Page 1: AliEn  development

+

AliEn development

Miguel Martinez Pedreira

Page 2: AliEn  development

2+Contents

Basics and concepts CVMFS migration Recent changes ToDos / Ideas

AliEn development - Miguel Martinez Pedreira

Page 3: AliEn  development

3+What is AliEn ?

AliEn (ALICE Environment) is a lightweight Open Source Grid Framework built around other Open Source components using the combination of a Web Service and Distributed Agent Model designed to comply with the offline world of a HEP experiment

massive amounts of data implies distributing its storage and processing

It started within the ALICE Off-line Project at CERN and constitutes the production environment for simulation, reconstruction, and analysis of physics data of the ALICE Experiment

The current status of the ALICE grid operation can be found at the MonALISA Grid Monitoring

AliEn development - Miguel Martinez Pedreira

Page 4: AliEn  development

4+Virtual Organisations

Users (jobs submission for data analysis) + Central management (the “brain” of the GRID) + Sites (the “muscle” of the GRID)

ALICE numbers: ~35K avg running jobs (~200K/day), ~80sites, ~60SE, ~1000M entries in the catalogue

AliEn development - Miguel Martinez Pedreira

Page 5: AliEn  development

5+Distributed Analysis

AliEn development - Miguel Martinez Pedreira

Page 6: AliEn  development

6+ALICE Grid

AliEn development - Miguel Martinez Pedreira

Page 7: AliEn  development

7+AliEn summary

3-layer system that leverages the deployed resources of the underlying WLCG infrastructures and services

Interfaces to AliRoot via ROOT plugin (TAlien) that implements AliEn API

Complex workflows including distributed analysis built on top of AliEn API

Used by ALICE and PANDA

AliEn development - Miguel Martinez Pedreira

Page 8: AliEn  development

8+CVMFS

AliEn development - Miguel Martinez Pedreira

Page 9: AliEn  development

9+CVMFS migration

Started in second half of August 2013 First sites: test CE at CERN (local) and BITP (grid) Running production-like JDLs (alitrain/aliprod) CVMFS setup tested then over ‘real’ site

Combination of packages were showing problem with the environment of the process that was being set First fix IP list from APISCONFIG Paths

Wiki created with steps to follow before starting migration

AliEn development - Miguel Martinez Pedreira

Page 10: AliEn  development

10+CVMFS migration

Then started with some other sites Mainly big sites, 1 by 1, overall quite smooth Decided to push everyone Spam-time ( sorry )

haven’t counted myself, 7 November ~800 mails had been sent ( according to Maarten )

Reactions Many sites reinstalling to WLCG SL6 Sites pending CVMFS being installed for quite a while... Several interface updates: CONDOR, ARC CREAM was updated also to add robustness and deal with

specific setups

AliEn development - Miguel Martinez Pedreira

Page 11: AliEn  development

11+CVMFS migration

Deadline modified after October push: end 2013

AliEn development - Miguel Martinez Pedreira

JULY

AUGUST

SEPT

EMBER

OCTOBER

NOVEMBER

DECEM

BER

JANUARY0

10

20

30

40

50

60

70

Page 12: AliEn  development

12+CVMFS migration

PackMan service not needed anymore Issues

Fix for the ROOT/AliRoot paths when local ROOT installed (conflict)

Missing libraries: libtcl, gfortran, libtermcap... Some sites don’t use HEP_OS_libs

‘Warm-up’ time with ERROR_E ? I/O Error

squid and cache misconfigurations? not easy to debug... Efficiencies ?

Thanks to admins ;-) !

AliEn development - Miguel Martinez Pedreira

Page 13: AliEn  development

13+CVMFS migration

AliEn development - Miguel Martinez Pedreira

Page 14: AliEn  development

14+CVMFS migration

AliEn development - Miguel Martinez Pedreira

Page 15: AliEn  development

15+CVMFS migration

AliEn development - Miguel Martinez Pedreira

Page 16: AliEn  development

16+Recent changes: /tmp issue

Detected some jobs that were analyzing the wrong data From the file ‘wn.xml’

Job tokens Not unique Using a new service creating unique tokens

Other ideas led to a more robust JobAgent Check sandbox use, creation Check chdir Printing more info Unique open/read of the XML file

Problem resulted not to be in AliEn jobs waiting for same CVMFS content writing concurrently helped to create a model of the JobAgent flow: to be ‘digitalized’

AliEn development - Miguel Martinez Pedreira

Page 17: AliEn  development

17+Recent changes

Investigating the errors ERROR_E

TTL Memory Idle Couldn’t get Catalogue...

Still the winner is ERROR_V Focusing on what we don’t understand or the problems that

affect more the system overall

AliEn development - Miguel Martinez Pedreira

Page 18: AliEn  development

18+Recent changes: ERROR_E

Many jobs failing for running over the TTL ProxyTTL=“1” Job runtime to proxy timeleft on the job – 10 minutes Still some continue failing...

Proxy timeleft unavailable

Memory issues (or other) Saving the output to check logs OutputErrorE, same as output

then registerOutput <jobId>

AliEn development - Miguel Martinez Pedreira

Page 19: AliEn  development

19+Recent changes: ERROR_E

Some jobs fail getting a catalogue instance Found race-condition

Jobs WAITING for a while, and being moved to ZOMBIE just after ASSIGNED

Optimizer using DB field based on timestamp that wasn’t properly updated...

A small portion still fail Added full trace of the catalogue creation in the JobAgent “Bad hostname” or “Host undefined” Stuck after getting JDL Under investigation

AliEn development - Miguel Martinez Pedreira

Page 20: AliEn  development

20+Recent changes: INSERTION

AliEn development - Miguel Martinez Pedreira

Submit

SplittingCheck JDL, inserting

Analyse split fields

Job

Man

Split

ting

Opt

Create basketsGetting all files`whereis` per fileIf maxsize, `ls` per file

Submit subjobs

Split

Inserting

Inse

rting

Opt

copyInput`whereis` per file

sizes checkInputDownload-Workdirectorysize

add SE requirementscheck SE-CE compatibilitysizes check

Waiting

insert jobagent

Page 21: AliEn  development

21+Recent changes: INSERTION

AliEn development - Miguel Martinez Pedreira

Submit

Splitting

Check JDL, inserting

Analyse split fields

Job

Man

Split

ting

Opt

Create basketsGetting all files

`whereis` per file (and cached!)sizes check (no InputDownload or InputBox)

Subjobs to WAITING

Split Waiting

add/check SE-CE compatibility

Page 22: AliEn  development

22+Recent changes: INSERTION

AliEn development - Miguel Martinez Pedreira

Page 23: AliEn  development

23+Recent changes: INSERTION

AliEn development - Miguel Martinez Pedreira

Page 24: AliEn  development

24+Recent changes: baskets Some months ago, file/job distribution found to be very ugly Issue in catalogue: entries with duplicated pfns

cleanup ? fix on the optimizer deals with it

Improved the basket creation Step 1: file transfers to have data in same SEs (Markus/Jan)

not so good for the grid balance

Step 2: creating big collections for several runs (Costin) balanced grid

FileBroker not needed anymore...

AliEn development - Miguel Martinez Pedreira

Page 25: AliEn  development

25+File Catalogue

AliEn development - Miguel Martinez Pedreira

Page 26: AliEn  development

26+File Access Monitoring Service FAMoS provides a facility to monitor the attributes of the

accesses to the files and to record in an organized manner the values of attributes to a database

Counts the accesses not only to individual files but also to set of AOD and ESD files of the LHC periods, called categories (e.g. LHC10f6a_ESD, LHC10h_AOD)

It provides also information on the categories accessed by individual users (like: alidaq, aliprod, alitrain)

Information gathered from Authen’s and API servers since August

Web interface under development http://aligrid.yerphi.am/famos/monitoring

AliEn development - Miguel Martinez Pedreira

Page 27: AliEn  development

27+File Access Monitoring Service

AliEn development - Miguel Martinez Pedreira

Page 28: AliEn  development

28+File Access Monitoring Service

AliEn development - Miguel Martinez Pedreira

Page 29: AliEn  development

29+File Access Monitoring Service

AliEn development - Miguel Martinez Pedreira

Page 30: AliEn  development

30+ToDos / Ideas Unifying AliEn versions

v2-19, v2-20, v2-21, trunk, central, API Also making installations/tests work

JDL optimization Millions of jobs, big JDL text Done in v2-21 Storing compressed JDL

and only diff tags in resultsJdl We could also create the subjobs JDL from the father’s

Optimizers investigations periodicity ? jobs not expiring (SPLIT without pending subjobs e.g.) queries failing ?

Catalogue cleanups/modifications orphan entries, duplicated pfns unused tables new solutions (EOS?), apply (v2-20?) improvements more caching

AliEn development - Miguel Martinez Pedreira

Page 31: AliEn  development

31+ToDos / Ideas zip64

currently using standard zip can’t deal with >4GB

IPv6 readiness this summer

Broker queries packages matching queries, long list

JSON services SOAP used in production: but JSON is ready since v2-20 and

being used by PANDA + performance – backward incompatibility (in critical parts...)

AliEn development - Miguel Martinez Pedreira

Page 32: AliEn  development

32+ToDos / Ideas find

new flag to sort files according to the position of the request for better reading

Long desired commands fixes ps, top, masterjob...

glExec utility allowing user separation in multi-user pilot jobs

let each user task run under a corresponding account PayLoad + user certs?

Supercomputing...? interfacing with PanDA

AliEn development - Miguel Martinez Pedreira

Page 33: AliEn  development

33+ToDos / Ideas Machine/Job features Task Force

aiming to adapt to virtualized environments (including cloud)

also related to multi-core slot queues, cpu dynamic availability...

concerns about overloading the meta-service when communicating features via “magic-IP”

Expand to more Iaas systems

AliEn development - Miguel Martinez Pedreira

Page 34: AliEn  development

34+jAlien

AliEn development - Miguel Martinez Pedreira

Page 35: AliEn  development

35+jAlien

AliEn development - Miguel Martinez Pedreira

Page 36: AliEn  development

36+jAlien

AliEn development - Miguel Martinez Pedreira

Page 37: AliEn  development

37+Sorry if I bored you ;-)

AliEn development - Miguel Martinez Pedreira

Any questions ?