Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk...

Grid Computing for UK Particle Physics

Jeremy ColesGridPP Production ManagerJ.Coles@rl.ac.uk

Monday 3rd SeptemberCHEP 2007, Victoria, Canada

Overview

2 Current resource status

3 Tier-1 developments

4 Tier-2 reviews

5 Future plans

6 Summary

1 Background

Acknowledgements: Material in this talk comes from various sources across GridPP including the web-site (http://www.gridpp.ac.uk/), Blogs (http://planet.gridpp.ac.uk/) and meetings such as a GridPP collaboration meeting last week (http://www.gridpp.ac.uk/gridpp19/)

Background

GridPP is a collaboration of particle physicists and computer scientists from the UK and CERN. The collaboration is building a distributed computing Grid across the UK for particle physicists. At the moment there is a working particle physics Grid across 17 UK institutions. A primary driver of this work is meeting the needs of WLCG.

http://www.gridpp.ac.uk/pmb/People_and_Roles.htm

Our user HEP community is wider than the LHC

• Esr Earth Science research• Magic Gamma ray telescope• Planck – satellite for mapping cosmic m/w bg• Cdf• D0• H1• Ilc - International Linear Collider project (future electron-positron linear

collider studies) • MINOS - Main Injector Neutrino Oscillation Search, is an experiment at

Fermilab designed to study the phenomena of neutrino oscillations • NA48• Supernemo• ZEUS• CEDAR - Combined e-Science Data Analysis Resource for high-energy physics • Mice - A neutrino factory experiments • T2k - (http://neutrino.kek.jp/jhfnu/) Next Generation Long Baseline Neutrino

Oscillation Experiment• SNO

The GridPP2 project map shows the many areas of recent involvement

phenoGRID is a VO dedicated to developing and integrating (for the experiments) some of the phenomenological tools necessary to interpret the events produced by the LHC. HERWIG; DISENT; JETRAD and DYRAD.

http://www.phenogrid.dur.ac.uk/

Job submission framework

http://gridportal.hep.ph.ic.ac.uk/rtm/

Real Time Monitor

However, provision of computing for the LHC experiments dominates

activities

The reality of the usage is much more erraticthan the target fairshare requests

ATLASBaBarCMS

ATLASBaBar

CMSLHCb

Tier-2s are delivering a lot of the CPU time

So where are we with delivery of CPU resources for WLCG?

Oxford

Glasgo

RAL PPD

Birming

Sheffie

Brunel

Durham

Bristo

Cambridge

Edinbu

Pledged Actual

Overall WLCG pledge has been metbut it is not sufficiently used

New machine room being built

Shared resources

CPU at the Tier-2 sites level

… storage is not quite so good

Oxford

Glasgo

RAL PPD

Birming

Sheffie

Brunel

Durham

Bristo

Cambridge

Edinbu

Pledged Actual More storage at site but not included in dCache

Shared resource

Additional procurement underway

Storage at the Tier-2 site level

and so far usage is not great.

Oxford

Glasgo

RAL PPD

Birming

Sheffie

Brunel

Durham

Bristo

Cambridge

Edinbu

Pledged Actual Used

ATLAS & Babar

CMS & Babar

• Experiments target sites with largest amounts of available disk• Local interests influence involvement in experiment testing/challenges

London Tier-2 (available) London Tier-2 (used)

Growing Steady

Tier-1 storage problems ...aaahhhh… we now face a dCache-CASTOR2

migration

CastorCMS

dCacheLHCb dCacheCastorATLAS

The migration to Castorcontinues to be a challenge!

At the GridPP User Board meeting on 20 June it was

agreed that 6 month notice be given for dCache

termination.

Experiments have to fund storage costs past March

2008 for ADS/vtp tape service.

Castor

Slide provided by Glenn Patrick (GridPP UB chair)

At least we can start the migration now!

Separate Instances for LHC ExperimentsATLAS Instance - Version 2.1.3 in production.CMS Instance - Version 2.1.3 in production.LHCb Instance - Version 2.1.3 in testing.

Known “challenges” remaining:• Tape migration rates• Monitoring at RAL (server load)• SRM development (v2.2 timescale)• disk1tape0 capability• Bulk file deletion• Repack

• The experiment data challenges

Problems faced (sample):• ID servers in which pools• Tape access speed• Address-in-use errors• Disk-disk copies not working• Changing storage class• Sub-request pile up• Network tuning -> server crash• Slow scheduling• Reserve space pile up• Small files • …

New RAL machine room (artists’ impression)

New RAL machine room - specs

• Shared room• 800m2 can accommodate 300 racks + 5 robots• 2.3MW Power/Cooling capacity• September 2008• In GridPP3 extra fault management/hardware staff planned as size of farm increases

• Several Tier-2 sites are also building/installing new facilities.

2008 2009 2010 2011

Storage Capacity (TiB)

2008 2009 2010 2011

April T

iB Tape

GridPP is keeping a close watch on site availability/stability

Site A

Site B

UK average(reveals non-site

specific problems)

Power outage

SE problems

Algorithm issue

We reviewed UK sites to determine their readiness for LHC startup

Questionnaire sent followed by team review:• Management and Context• Hardware and Network• Middleware and Operating Systems• Service Levels and Staff• Finance and Funding• Users

• Tier-2 sites working well together – technical cooperation• Most Tier-2 storage is RAID5. RAM 1GB-8GB per node. • Some large shared (HPC) resources yet to join. HEP influence large• Widespread install: Kickstart – local scripts - YAIM + tarball WNs. Cfengine use increasing.• Concern about lack of full testing prior to use in production – PPS role• Monitoring (& metric accuracy) needs to improve• Still issues with site-user communication• Missing functionality – VOMS, storage admin tool….

… and much more.

Some examples of resulting discussions

Generally agreed that regular (user) testing has helped improve

the infrastructure

} Sites in scheduled maintenance

Observation: It is very useful to work closely with a VO in resolving problems (nb. many of the problems solved here would have impacted others) as site admins are not able to test their site from a user’s perspective. However, GridPP sites have stronger ties with certain VOs and intensive support can not scale to all VOs…

Improved monitoring (with alarms) is a key ingredient for future progress

MonAMI is being used by several sites. It is a "generic" monitoring agent. It supports the monitoring of multiple services (DPM, Torque..) and supports reporting to multiple monitoring systems (Ganglia, Nagios…).

A UK monitoring workshop/tutorial is being arranged for October.

But you still have to understand what is going on and fix it!

http://monami.sourceforge.net/

Examples of recent problems caught by better monitoring

Phenomenologist and Biomed “DOS” attacks CPU ends up with too many gatekeeper processes active.

Excessive resource consumption seen at some DPM sites. Turns out that "hung" dpm.gsiftp connections from ATLAS transfers.

Removing inefficient jobs

Some (example) problems sites face

• Incomplete software installation• Slow user response• Disks fail or SE decommissioned – how to contact users• /tmp full with >50GB log files of an ATLAS user. Crippled worker node.• Jobs fill Storage (no quotas) -> site fails Site Availability (rm). Site blacklisted. • 1790 gatekeeper processes running as the user alice001 - but we have no ALICE jobs running. • Confusion over queue setups for prioritisation• Job connections left open with considerable consumption of cpu and network resources. • ATLAS ACL change caused a lot of confusion. Dates and requirements changed and the script for sites (made available without the source) had bugs which caused concern. • Very hard for sites to know if jobs are running successfully. There is a massive amount of wasted CPU time with job resubmission (often automated)• Knowing when to worry if no jobs are seen• Lack of sufficient information in tickets created by users (leads to problems assigning tickets and increased time resolving them)• Slow or no response to site follow up questions• Problem raised in multiple ways possible confusion about whether something is still a problem

CPU storm – gatekeeper processes stall - no jobs submitted. User banned.

GridPP future

2006 20082007

GridPP3 Proposal

Submitted July 13

GridPP2 GridPP2+

GridPP3

End of GridPP2 (31 August

Start of GridPP3 (1 April 2008)

£25.9M

Summary

2 Resource deployment ok – still a concern about utilisation

3 Some major problems for Tier-1 storage have eased

4 Availability and monitoring are now top priorities

6 GridPP now funded until 2011

1 GridPP has involvement with many HEP areas but WLCG dominates

5 Sites face “new” challenges and need more monitoring tools

Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk...

Documents

A Grid For Particle Physics From testbed to production Jeremy Coles J.Coles@rl.ac.uk 3 rd September 2004 All Hands Meeting – Nottingham, UK

GridPP Deployment Status GridPP15 Jeremy Coles J.Coles@rl.ac.uk 11 th January 2006

GridPP From Prototype to Production David Britton 21/Sep/06 1.Context – Introduction to GridPP 2.Performance of the GridPP/EGEE/wLCG Grid 3.Some Successes

31 January 2003 GridPP Collaboration Meeting 1 CLRC e-Science Centre David Boyd Deputy Director, CLRC e-Science Centre d.r.s.boyd@rl.ac.uk

GridPP Deployment Status

GridPP Deployment Status Steve Traylen s.traylen@rl.ac.uk 28th October 2004 GOSC Face to Face, NESC, UK

Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow GridPP Oversight Committee Meeting

GridPP: Running a Production Grid

Tony Doyle GridPP Oversight Committee 15 May 2002

S.L.LloydGrid Steering Committee 8 March 2002 Slide 1 Status of GridPP Overview Financial Summary Recruitment Status EU DataGrid UK Grid Status GridPP

GridPP Deployment Status GridPP14 Jeremy Coles J.Coles@rl.ac.uk 6 th September 2005

GridPP: status report

J.coles@rl.ac.uk Service Resilience GridPP22 – UCL 1 st April 2009 GridPP22 – UCL 1 st April 2009 Introduction & overview

GRIDPP Collaboration Meeting

Alison Pamment J.A.Pamment@rl.ac.uk CF Standard Names Status and Development Alison Pamment J.A.Pamment@rl.ac.uk

Deployment Summary GridPP11 Jeremy Coles J.Coles@rl.ac.uk 15th September 2004

Steve Traylen Particle Physics Department EDG and LCG Status s.traylen@rl.ac.uk 9 th December 2003 s.traylen@rl.ac.uk

GridPP Steve Lloyd, Chair of the GridPP Collaboration Board

POOL Project Status GridPP 10 th Collaboration Meeting Radovan Chytracek CERN IT/DB, GridPP, LCG AA

GridPP: the UK's contribution to the international collaboration building a worldwide Grid, the LHC Computing Grid GridPP – is the system usable? Tony