14
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications – WP8) [email protected]

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

Embed Size (px)

Citation preview

Page 1: UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1

HEP Applications Evaluation of the

EDG Testbed and Middleware

Stephen Burke (EDG HEP Applications – WP8)

[email protected]

Page 2: UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 2

Introduction

Updated from the CHEP talk ~ 1 year ago Some things have changed, some not!

Based on D8.4 report (EDG only here, 2.0/2.1 releases)

Achievements of WP8

Updated use case analysis mapping HEPCAL to EDG

Lessons learnt

Page 3: UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 3

OBJECTIVES ACHIEVEMENTS

Evaluate EDG Application Testbed, and integrate into experiment tests as appropriate.

Further successful evaluation of 1.4.n throughout the summer.

Evaluation of EDG 2.0 on the EDG Application Testbed since October, and of EDG 2.1 since December

Liaise with LCG regarding EDG/LCG integration and the development of the LCG service.

EIPs (Loose Cannons) helped testing of EDG components on the LCG Cert TB prior to LCG-1 start in September.

Performed stress tests on LCG-1.

Continue work with experiments on data challenges throughout the year.

All 6 experiments have conducted data challenges of different scales throughout 2003 on EDG App TB or LCG/Grid.it.

Page 4: UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 4

OBJECTIVES ACHIEVEMENTS

Continued work in Architectural Task Force (ATF)

Walkthroughs of HEP use cases helped to clarify interfacing problems.

Reactivation of the Application Working Group (AWG)

Extension of HEPCAL use cases covering key areas in Biomedicine and Earth Sciences.

Basis of first proposal for common application work in EGEE

Work with LCG/GAG (Grid Applications group) in further refinement of HEP requirements

HEPCAL-2 requirements document for the use of grid by thousands of individual users.

In addition further refined the original HEPCAL document

Developments of tutorials and documentation for the user community

WP8 has played a substantial role in course design, implementation and delivery

Page 5: UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 5

Use Case Analysis

EDG release 2.0 has been evaluated against the HEPCAL Use Cases

Of the 43 Use Cases: 13 (was 10) are fully implemented 4 (was 8) are largely satisfied, but with some restrictions or

complications 11 (was 8) are partially implemented, but have significant missing

features 15 (was 17) are not implemented

Missing functionality is mainly in: Virtual data (not considered by EDG) Metadata catalogues and file collections (still needs more work) Authorisation, job control and optimisation (partly delivered but not

integrated)

Page 6: UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 6

Lessons Learnt - General

Having real users on an operating testbed on a fairly large scale is vital – many problems emerged which had not been seen in local testing.

Problems with configuration are at least as important as bugs - integrating the middleware into a working system takes as long as writing it!

Grids need different ways of thinking by users and system managers. A job must run anywhere it lands. Sites are not uniform so jobs should make as few demands as possible.

Page 7: UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 7

Job Submission

Limitations seen in 1.4 are largely gone Efficiency over 90% in stress tests (1600 jobs)

Failures are ~ 1% in normal use (after resubmission)

Most failures now at globus/site level, not broker

Can still be sensitive to poor or incorrect information from Information Providers

Info providers have improved, configuration generally better

No “black hole” sites lately (but still possible)

Still hard to diagnose errors (“invalid script response”???)

Advanced features (checkpointing, DAGMAN, interactivity, accounting, …) largely untested, some not integrated

Page 8: UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 8

Information Systems

R-GMA is a big improvement on MDS Tables, SQL queries, much easier to publish, …

Largely a personal view, experiments have mostly not used it yet

Took a very long time to become stable – during the D8.4 evaluation R-GMA availability was O(75%)

Latest version installed for the EU review looks much better – total end-to-end efficiency now > 95%, R-GMA is ~100% (but testbed is now lightly loaded)

NO SECURITY! And no Registry/schema replication

Need to check published information for accuracy (or at least sanity!)

GLUE schema is not in EDG/LCG control, and has proved very hard to change

Page 9: UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 9

Replica Management

Now mostly “just works” Command line tools are fairly intuitive Sometimes processes can hang Orphan processes sometimes left behind when job ends Some inconsistencies found when used with POOL

Interaction with SE schema is still unclear Works, but gives artificial restrictions on NFS access

Bulk operations, mirroring and client-server architecture lost with GDMP

Java command-line tools are very slow (tens of seconds)

Fault tolerance is important: error conditions should leave things in a consistent state, failures should be re-tried where possible

Page 10: UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 10

Replica Catalogues

Oracle/MySQL catalogues are much better than LDAP in 1.4

Tested up to O(100k) entries, no degradation seen But need to cope with millions

At 10 seconds per file it would take ~ 4 months to register a million files!

Queries can be very slow due to inefficient transport of data 30 minutes to return 45k entries

Java runs out of memory on bigger queries

Distributed LRC + RLI not deployed

NO SECURITY! (Integrated but not deployed)

Still no consistency checking against SE content

Page 11: UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 11

Mass Storage

Always the most problematic area, and still not solved

LCG2 still using “classic SE”, but only a stop-gap

SRM should be the solution (?), WP5 SE is the EDG version

Works, but many rough edges, really still a prototype No disk space management

Error reporting is poor, not fault-tolerant

Too much logging, not helpful for a system manager

Configuration is complex and fragile

Also dCache, CASTOR SRM, Enstore SRM … But still not production-quality?

What is the way forward?

Page 12: UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 12

VO Management

Current LDAP-based system works fairly well, but has many limitations

VO servers are a single point of failure

VOMS looks good, but not yet deployed or fully integrated Or documented!

Middleware groups seem to have a different security model to VOMS designers

E.g. they usually assume one and only one VO VO defines service (Replica Catalogue, SE namespace) and not

authorisation

Experiments will need to gain experience about how a VO should be run

Page 13: UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 13

User View of the Testbed

Site configuration is very complex, there is usually one way to get it right and many ways to be wrong

LCFG is a big help in ensuring uniform configuration Middleware should be self-configuring (and self-checking) as far as

possible

Need well-defined certification procedures, checked on an ongoing basis (sites decay with a half-life of ~ a few weeks)

Services should fail gracefully when they hit resource limits The grid must be robust against failures and misconfiguration. Large

grids will ~ always be broken, so errors are not exceptional!

Many HEP experiments require outbound IP connectivity from worker nodes

Still no solution, discussion is needed

Scalability? Still only ~ 20 sites – 1 job/minute!

Page 14: UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications

UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 14

Gaps

Disk space management on worker nodes Some discussion, nothing appeared

Analysis of scheduling algorithms EstimatedResponseTime is not optimal

Pre-replication by the broker

Information about networking at the LAN level Where are the network bottlenecks?

Distribution of experiment software (now being tackled in LCG)

Enforcement of quotas (whose job is this?)

Documentation