Upload
nathan-stanley
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
gLite Status
Stephen BurkeRAL
GridPP 13 - Durham
July 6th 2005 gLite Status
Overview
• gLite releases• gLite deployment
• WMS• DMS• R-GMA• VOMS
• Outstanding issues
• E&OE!
Releases
July 6th 2005 gLite Status
gLite releases so far
• Release 1.0 on April 5th
– Released to meet deadline– WMS + CE + Fireman + gLite i/o + R-GMA + VOMS– AliEn, GAS and package manager gone– Several things missing or not working well
• No SE in gLite– Documentation is reasonable
• Release 1.1 on May 12th
– First versions of File Transfer Service (FTS), metadata catalogue
– Secure file catalogues– Bug fixes
July 6th 2005 gLite Status
Future releases
• Release 1.2 should have been on June 1st
– Delayed to end of June, now expected late July• Was expected to be in LCG July release• Have gLite R-GMA and VOMS as LCG upgrades
• “Final” gLite release (2.0) for EGEE 1 by end of the year– Updated architecture/design/workplan
documents– Code freeze October (?)– Maybe a 1.3 release (August?), time is tight
July 6th 2005 gLite Status
Timelines
March 2006
December 2
005
November 2
005
October 2
005
June 2005
End of EGEE 1
TODAY
Release 1.2
Release 2.0
Release 2.0
XmasVacatio
n
Integrated 2.0
Func. freeze
Final Report
Mid Dec.
Func. freeze
?
Consequences
• ~ 2.5 months of development left
• probably only 1 or 2 releases between 1.2 and 2.0
• Focus on consolidation of 1.2 and little improvements as requested from applications
• Very careful in introducing new services
Review
July 6th 2005 gLite Status
Release priorities
• Driven by service challenges– Especially data management– LCG Baseline Services document
• No time to change anything for EGEE 1• EGEE PTF disbanded
– Not seen as effective– Who collects requirements?– Do non-LCG VOs have influence?
Deployment
July 6th 2005 gLite Status
gLite deployments – JRA1
• gLite “prototype” system– Used by ARDA team, biomed, some others– Very small, basically just CERN– Not properly maintained
• JRA1 testing testbed– Was CERN, RAL and NIKHEF– Two sites + manpower added at Imperial
• One person subtracted at CERN– Still small and under-resourced– Releases are not sufficiently tested
• 928 open bugs in savannah, 84 critical• 281 “ready for test”, but no time to test!
July 6th 2005 gLite Status
gLite deployments - LCG
• Pre-production system now being installed– ~8 sites so far – more coming
• None in UK?
– Currently a “pure” gLite system• Role seems to change from week to week!
– Partly working but many problems– Some users allowed in soon (now?)
• Production system– Various plans considered– LCG 2.6 has R-GMA and VOMS– Next steps unclear (to me at least!)
Status as of release 1.1
July 6th 2005 gLite Status
Workload management
• Broker is a development of the EDG/LCG RB– Seems to be largely backward-compatible– Main new feature is DAGMAN (composite jobs)– Push and pull job submission– No web services
• Hybrid info system (CEMON + BDII)– Static configuration of WMS-CE relationships– Should change to R-GMA (?)
• Condor-C replaces Globus gatekeeper on CE– Several security problems– Current performance is poor
• Submissions often fail• Cryptic error messages
July 6th 2005 gLite Status
Data Management
• First version of metadata catalogue– No command-line clients yet, MySQL only
• Fireman file catalogue– Competes with new LCG File Catalogue– Various experiment-specific solutions
• gLite i/o– Security model still under debate (delegation, file ownership)– Doesn’t yet work with dCache or DPM SRMs, only Castor!
• FTS – developed for service challenges– Point-to-point reliable file transfer– No interaction with Fireman catalogue
• No File Placement Service (FPS) yet, hence no replication!• No Data Scheduler• Interaction with WMS still under discussion
July 6th 2005 gLite Status
R-GMA
• Should be an information system– But both LCG and gLite still use BDII
• New Service Discovery API– Still discussing service types and names
• LCG now making substantial use of R-GMA for monitoring, accounting etc– Lots of pressure to fix bugs!– Some stability problems, needs more testing
• Not ideal to test in production, but …
– Seems generally in a good state
July 6th 2005 gLite Status
Security
• gLite VOMS server now used by LCG– Some problems with gLite installation scripts
• WMS and DMS have limited support for VOMS– SRM, Condor-C and R-GMA don’t yet
• Many test VOMS servers exist, but still not in production– Will probably need a long learning period to get the best use
of VOMS– Not a a panacea!
• Security requirements mostly still not being addressed– Most date back to the start of EDG
• Many known security vulnerabilities
Outstanding Issues
July 6th 2005 gLite Status
General
• Error messages, logging and fault-tolerance– Still very poor
• Proposal on common error handling by Steve Fisher
• Configuration– gLite has a common config tool (python/XML)– Underlying config not unified– Still complex, fragile and error prone– Not clear if LCG will switch
• May get many layers - YAIM -> XML -> m/w specific config files?
• Monitoring– Getting better – but all from LCG, not in gLite
• Single points of failure– Still have many, but some positive movement
July 6th 2005 gLite Status
WMS
• Job submission rate too slow– Not tested (?), but probably
no change• Failover (RB goes down ->
jobs lost)– No change so far
• Bulk job submission– Partial support via DAGs– Parameterised jobs coming
• Space management on WNs– Not being addressed
• Access to output from running jobs– Not yet
• Advance reservation– Some work, but not yet
available
• Interaction with data management (pre-staging)– Discussion but nothing yet
• CPU speed, memory etc requirements not passed to batch system– May appear in future
• Job distribution is poor (ERT etc)– Partly addressed by new
Glue schema– Still no direct support in
broker
July 6th 2005 gLite Status
DMS
• Need a metadata solution– Much discussion, seems to
be converging• File catalogue performance,
bulk operations– Partly addressed by
Fireman, LFC– LFC seems to have better
performance but no bulk operations
• Catalogue replication– Oracle replication by LCG– gLite working towards local
catalogues• Small files
– Not being addressed
• Reliable file replication– Partly addressed by FTS,
need FPS as well• File pinning
– Not yet in SRMs or FTS• Posix file access
– May be addressed by gLite i/o
– Security model unclear• High level data
management– Not yet (wait for Data
Scheduler in 2.0)
July 6th 2005 gLite Status
Information systems
• Not many issues!• Glue schema not ideal
– Minor update just released– Maybe new major version in ~ 1 year?
• Stability, scalability– Need to test in production - test systems
too small
July 6th 2005 gLite Status
Security
• VO management, groups and roles– Should come with VOMS
• VO policies for CEs– Some tools (LCAS, LCMAPS)– Needs experience
• ACLs on files– Should come with gLite File
Access Service (FAS)– Not ready yet– Need to check security
model satisfied sites– No support in SRM yet
• No outbound IP access– Some discussion, nothing
yet
• Secure file management– Not needed for HEP, but
strong need for biomed– Some work, not there yet
• Quotas– Some work on
measurement– Enforcement?
• Vulnerabilities– Many known, little work– New group (Linda Cornwall)
July 6th 2005 gLite Status
Summary
• First gLite releases are out, but are buggy and incomplete
• Next release is late, not much time to the end of EGEE 1
• Many long-standing issues not addressed– Developers tend to follow their own interests rather than
user/sysadmin needs– Functionality is less than at the end of EDG!
• Probably still >~ 1 year to get production quality– OK for EGEE if EGEE 2 is approved– Mismatch with LCG timescale
• LHC experiments are building their own Grids– How much of gLite do they need?
• Who decides requirements and priorities?