Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk...

Preview:

Citation preview

Grid Computing for UK Particle Physics

Jeremy ColesGridPP Production ManagerJ.Coles@rl.ac.uk

Monday 3rd SeptemberCHEP 2007, Victoria, Canada

2

Overview

2 Current resource status

3 Tier-1 developments

4 Tier-2 reviews

5 Future plans

6 Summary

1 Background

Acknowledgements: Material in this talk comes from various sources across GridPP including the web-site (http://www.gridpp.ac.uk/), Blogs (http://planet.gridpp.ac.uk/) and meetings such as a GridPP collaboration meeting last week (http://www.gridpp.ac.uk/gridpp19/)

3

Background

GridPP is a collaboration of particle physicists and computer scientists from the UK and CERN. The collaboration is building a distributed computing Grid across the UK for particle physicists. At the moment there is a working particle physics Grid across 17 UK institutions. A primary driver of this work is meeting the needs of WLCG.

http://www.gridpp.ac.uk/pmb/People_and_Roles.htm

4

Our user HEP community is wider than the LHC

• Esr Earth Science research• Magic Gamma ray telescope• Planck – satellite for mapping cosmic m/w bg• Cdf• D0• H1• Ilc - International Linear Collider project (future electron-positron linear

collider studies) • MINOS - Main Injector Neutrino Oscillation Search, is an experiment at

Fermilab designed to study the phenomena of neutrino oscillations • NA48• Supernemo• ZEUS• CEDAR - Combined e-Science Data Analysis Resource for high-energy physics • Mice - A neutrino factory experiments • T2k -  (http://neutrino.kek.jp/jhfnu/) Next Generation Long Baseline Neutrino

Oscillation Experiment• SNO

5

The GridPP2 project map shows the many areas of recent involvement

phenoGRID is a VO dedicated to developing and integrating (for the experiments) some of the phenomenological tools necessary to interpret the events produced by the LHC. HERWIG; DISENT; JETRAD and DYRAD.

http://www.phenogrid.dur.ac.uk/

Job submission framework

http://gridportal.hep.ph.ic.ac.uk/rtm/

Real Time Monitor

6

However, provision of computing for the LHC experiments dominates

activities

The reality of the usage is much more erraticthan the target fairshare requests

ATLASBaBarCMS

LHCb

ATLASBaBar

CMSLHCb

Tier-2s are delivering a lot of the CPU time

7

So where are we with delivery of CPU resources for WLCG?

0

500

1000

1500

2000

2500

Man

ches

ter

Imper

ial

QMUL

Liver

pool

Lanc

aster

Oxford

Glasgo

w**

RHUL

RAL PPD

Birming

ham

Sheffie

ld

Brunel

Durham

*UCL

Bristo

l

Cambridge

Edinbu

rgh

KS

I2K

Pledged Actual

Overall WLCG pledge has been metbut it is not sufficiently used

New machine room being built

Shared resources

CPU at the Tier-2 sites level

8

… storage is not quite so good

0

50

100

150

200

250

300

350

400

Man

ches

ter

Imper

ial

QMUL

Liver

pool

Lanc

aster

Oxford

Glasgo

w**

RHUL

RAL PPD

Birming

ham

Sheffie

ld

Brunel

Durham

*UCL

Bristo

l

Cambridge

Edinbu

rgh

TB

Pledged Actual More storage at site but not included in dCache

Shared resource

Additional procurement underway

Storage at the Tier-2 site level

9

and so far usage is not great.

0

50

100

150

200

250

300

350

400

Man

ches

ter

Imper

ial

QMUL

Liver

pool

Lanc

aster

Oxford

Glasgo

w**

RHUL

RAL PPD

Birming

ham

Sheffie

ld

Brunel

Durham

*UCL

Bristo

l

Cambridge

Edinbu

rgh

TB

Pledged Actual Used

ATLAS

CMS

ATLAS & Babar

CMS & Babar

CMS

ATLAS

• Experiments target sites with largest amounts of available disk• Local interests influence involvement in experiment testing/challenges

London Tier-2 (available) London Tier-2 (used)

Growing Steady

X

10

Tier-1 storage problems ...aaahhhh… we now face a dCache-CASTOR2

migration

CastorCMS

dCacheLHCb dCacheCastorATLAS

The migration to Castorcontinues to be a challenge!

At the GridPP User Board meeting on 20 June it was

agreed that 6 month notice be given for dCache

termination.

Experiments have to fund storage costs past March

2008 for ADS/vtp tape service.

Castor

Data

Slide provided by Glenn Patrick (GridPP UB chair)

11

At least we can start the migration now!

Separate Instances for LHC ExperimentsATLAS Instance - Version 2.1.3 in production.CMS Instance - Version 2.1.3 in production.LHCb Instance - Version 2.1.3 in testing.

Known “challenges” remaining:• Tape migration rates• Monitoring at RAL (server load)• SRM development (v2.2 timescale)• disk1tape0 capability• Bulk file deletion• Repack

• The experiment data challenges

Problems faced (sample):• ID servers in which pools• Tape access speed• Address-in-use errors• Disk-disk copies not working• Changing storage class• Sub-request pile up• Network tuning -> server crash• Slow scheduling• Reserve space pile up• Small files • …

PANIC

over?

12

New RAL machine room (artists’ impression)

13

New RAL machine room - specs

• Shared room• 800m2 can accommodate 300 racks + 5 robots• 2.3MW Power/Cooling capacity• September 2008• In GridPP3 extra fault management/hardware staff planned as size of farm increases

• Several Tier-2 sites are also building/installing new facilities.

0

2000

4000

6000

8000

10000

12000

14000

16000

2008 2009 2010 2011

April

KS

I2K

Storage Capacity (TiB)

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

2008 2009 2010 2011

April T

iB Tape

Disk

14

GridPP is keeping a close watch on site availability/stability

Site A

Site B

UK average(reveals non-site

specific problems)

Power outage

SE problems

Algorithm issue

15

We reviewed UK sites to determine their readiness for LHC startup

Questionnaire sent followed by team review:• Management and Context• Hardware and Network• Middleware and Operating Systems• Service Levels and Staff• Finance and Funding• Users

• Tier-2 sites working well together – technical cooperation• Most Tier-2 storage is RAID5. RAM 1GB-8GB per node. • Some large shared (HPC) resources yet to join. HEP influence large• Widespread install: Kickstart – local scripts - YAIM + tarball WNs. Cfengine use increasing.• Concern about lack of full testing prior to use in production – PPS role• Monitoring (& metric accuracy) needs to improve• Still issues with site-user communication• Missing functionality – VOMS, storage admin tool….

… and much more.

Some examples of resulting discussions

16

Generally agreed that regular (user) testing has helped improve

the infrastructure

} Sites in scheduled maintenance

Observation: It is very useful to work closely with a VO in resolving problems (nb. many of the problems solved here would have impacted others) as site admins are not able to test their site from a user’s perspective. However, GridPP sites have stronger ties with certain VOs and intensive support can not scale to all VOs…

17

Improved monitoring (with alarms) is a key ingredient for future progress

MonAMI is being used by several sites. It is a "generic" monitoring agent. It supports the monitoring of multiple services (DPM, Torque..) and supports reporting to multiple monitoring systems (Ganglia, Nagios…).

A UK monitoring workshop/tutorial is being arranged for October.

But you still have to understand what is going on and fix it!

http://monami.sourceforge.net/

18

Examples of recent problems caught by better monitoring

Phenomenologist and Biomed “DOS” attacks CPU ends up with too many gatekeeper processes active.

Excessive resource consumption seen at some DPM sites. Turns out that "hung" dpm.gsiftp connections from ATLAS transfers.

Removing inefficient jobs

19

Some (example) problems sites face

• Incomplete software installation• Slow user response• Disks fail or SE decommissioned – how to contact users• /tmp full with >50GB log files of an ATLAS user. Crippled worker node.• Jobs fill Storage (no quotas) -> site fails Site Availability (rm). Site blacklisted. • 1790 gatekeeper processes running as the user alice001 - but we have no ALICE jobs running. • Confusion over queue setups for prioritisation• Job connections left open with considerable consumption of cpu and network resources. • ATLAS ACL change caused a lot of confusion. Dates and requirements changed and the script for sites (made available without the source) had bugs which caused concern. • Very hard for sites to know if jobs are running successfully. There is a massive amount of wasted CPU time with job resubmission (often automated)• Knowing when to worry if no jobs are seen• Lack of sufficient information in tickets created by users (leads to problems assigning tickets and increased time resolving them)• Slow or no response to site follow up questions• Problem raised in multiple ways possible confusion about whether something is still a problem

CPU storm – gatekeeper processes stall - no jobs submitted. User banned.

20

GridPP future

2006 20082007

GridPP3 Proposal

Submitted July 13

GridPP2 GridPP2+

GridPP3

End of GridPP2 (31 August

2007)

Start of GridPP3 (1 April 2008)

?

£25.9M

htt

p:/

/ww

w.n

gs.

ac.

uk/

acc

ess

.htm

l

21

Summary

2 Resource deployment ok – still a concern about utilisation

3 Some major problems for Tier-1 storage have eased

4 Availability and monitoring are now top priorities

6 GridPP now funded until 2011

1 GridPP has involvement with many HEP areas but WLCG dominates

5 Sites face “new” challenges and need more monitoring tools

Recommended