21
Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager [email protected] Monday 3 rd September CHEP 2007, Victoria, Canada

Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager [email protected] Monday 3 rd September CHEP 2007, Victoria, Canada

Embed Size (px)

Citation preview

Page 1: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

Grid Computing for UK Particle Physics

Jeremy ColesGridPP Production [email protected]

Monday 3rd SeptemberCHEP 2007, Victoria, Canada

Page 2: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

2

Overview

2 Current resource status

3 Tier-1 developments

4 Tier-2 reviews

5 Future plans

6 Summary

1 Background

Acknowledgements: Material in this talk comes from various sources across GridPP including the web-site (http://www.gridpp.ac.uk/), Blogs (http://planet.gridpp.ac.uk/) and meetings such as a GridPP collaboration meeting last week (http://www.gridpp.ac.uk/gridpp19/)

Page 3: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

3

Background

GridPP is a collaboration of particle physicists and computer scientists from the UK and CERN. The collaboration is building a distributed computing Grid across the UK for particle physicists. At the moment there is a working particle physics Grid across 17 UK institutions. A primary driver of this work is meeting the needs of WLCG.

http://www.gridpp.ac.uk/pmb/People_and_Roles.htm

Page 4: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

4

Our user HEP community is wider than the LHC

• Esr Earth Science research• Magic Gamma ray telescope• Planck – satellite for mapping cosmic m/w bg• Cdf• D0• H1• Ilc - International Linear Collider project (future electron-positron linear

collider studies) • MINOS - Main Injector Neutrino Oscillation Search, is an experiment at

Fermilab designed to study the phenomena of neutrino oscillations • NA48• Supernemo• ZEUS• CEDAR - Combined e-Science Data Analysis Resource for high-energy physics • Mice - A neutrino factory experiments • T2k -  (http://neutrino.kek.jp/jhfnu/) Next Generation Long Baseline Neutrino

Oscillation Experiment• SNO

Page 5: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

5

The GridPP2 project map shows the many areas of recent involvement

phenoGRID is a VO dedicated to developing and integrating (for the experiments) some of the phenomenological tools necessary to interpret the events produced by the LHC. HERWIG; DISENT; JETRAD and DYRAD.

http://www.phenogrid.dur.ac.uk/

Job submission framework

http://gridportal.hep.ph.ic.ac.uk/rtm/

Real Time Monitor

Page 6: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

6

However, provision of computing for the LHC experiments dominates

activities

The reality of the usage is much more erraticthan the target fairshare requests

ATLASBaBarCMS

LHCb

ATLASBaBar

CMSLHCb

Tier-2s are delivering a lot of the CPU time

Page 7: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

7

So where are we with delivery of CPU resources for WLCG?

0

500

1000

1500

2000

2500

Man

ches

ter

Imper

ial

QMUL

Liver

pool

Lanc

aster

Oxford

Glasgo

w**

RHUL

RAL PPD

Birming

ham

Sheffie

ld

Brunel

Durham

*UCL

Bristo

l

Cambridge

Edinbu

rgh

KS

I2K

Pledged Actual

Overall WLCG pledge has been metbut it is not sufficiently used

New machine room being built

Shared resources

CPU at the Tier-2 sites level

Page 8: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

8

… storage is not quite so good

0

50

100

150

200

250

300

350

400

Man

ches

ter

Imper

ial

QMUL

Liver

pool

Lanc

aster

Oxford

Glasgo

w**

RHUL

RAL PPD

Birming

ham

Sheffie

ld

Brunel

Durham

*UCL

Bristo

l

Cambridge

Edinbu

rgh

TB

Pledged Actual More storage at site but not included in dCache

Shared resource

Additional procurement underway

Storage at the Tier-2 site level

Page 9: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

9

and so far usage is not great.

0

50

100

150

200

250

300

350

400

Man

ches

ter

Imper

ial

QMUL

Liver

pool

Lanc

aster

Oxford

Glasgo

w**

RHUL

RAL PPD

Birming

ham

Sheffie

ld

Brunel

Durham

*UCL

Bristo

l

Cambridge

Edinbu

rgh

TB

Pledged Actual Used

ATLAS

CMS

ATLAS & Babar

CMS & Babar

CMS

ATLAS

• Experiments target sites with largest amounts of available disk• Local interests influence involvement in experiment testing/challenges

London Tier-2 (available) London Tier-2 (used)

Growing Steady

X

Page 10: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

10

Tier-1 storage problems ...aaahhhh… we now face a dCache-CASTOR2

migration

CastorCMS

dCacheLHCb dCacheCastorATLAS

The migration to Castorcontinues to be a challenge!

At the GridPP User Board meeting on 20 June it was

agreed that 6 month notice be given for dCache

termination.

Experiments have to fund storage costs past March

2008 for ADS/vtp tape service.

Castor

Data

Slide provided by Glenn Patrick (GridPP UB chair)

Page 11: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

11

At least we can start the migration now!

Separate Instances for LHC ExperimentsATLAS Instance - Version 2.1.3 in production.CMS Instance - Version 2.1.3 in production.LHCb Instance - Version 2.1.3 in testing.

Known “challenges” remaining:• Tape migration rates• Monitoring at RAL (server load)• SRM development (v2.2 timescale)• disk1tape0 capability• Bulk file deletion• Repack

• The experiment data challenges

Problems faced (sample):• ID servers in which pools• Tape access speed• Address-in-use errors• Disk-disk copies not working• Changing storage class• Sub-request pile up• Network tuning -> server crash• Slow scheduling• Reserve space pile up• Small files • …

PANIC

over?

Page 12: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

12

New RAL machine room (artists’ impression)

Page 13: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

13

New RAL machine room - specs

• Shared room• 800m2 can accommodate 300 racks + 5 robots• 2.3MW Power/Cooling capacity• September 2008• In GridPP3 extra fault management/hardware staff planned as size of farm increases

• Several Tier-2 sites are also building/installing new facilities.

0

2000

4000

6000

8000

10000

12000

14000

16000

2008 2009 2010 2011

April

KS

I2K

Storage Capacity (TiB)

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

2008 2009 2010 2011

April T

iB Tape

Disk

Page 14: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

14

GridPP is keeping a close watch on site availability/stability

Site A

Site B

UK average(reveals non-site

specific problems)

Power outage

SE problems

Algorithm issue

Page 15: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

15

We reviewed UK sites to determine their readiness for LHC startup

Questionnaire sent followed by team review:• Management and Context• Hardware and Network• Middleware and Operating Systems• Service Levels and Staff• Finance and Funding• Users

• Tier-2 sites working well together – technical cooperation• Most Tier-2 storage is RAID5. RAM 1GB-8GB per node. • Some large shared (HPC) resources yet to join. HEP influence large• Widespread install: Kickstart – local scripts - YAIM + tarball WNs. Cfengine use increasing.• Concern about lack of full testing prior to use in production – PPS role• Monitoring (& metric accuracy) needs to improve• Still issues with site-user communication• Missing functionality – VOMS, storage admin tool….

… and much more.

Some examples of resulting discussions

Page 16: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

16

Generally agreed that regular (user) testing has helped improve

the infrastructure

} Sites in scheduled maintenance

Observation: It is very useful to work closely with a VO in resolving problems (nb. many of the problems solved here would have impacted others) as site admins are not able to test their site from a user’s perspective. However, GridPP sites have stronger ties with certain VOs and intensive support can not scale to all VOs…

Page 17: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

17

Improved monitoring (with alarms) is a key ingredient for future progress

MonAMI is being used by several sites. It is a "generic" monitoring agent. It supports the monitoring of multiple services (DPM, Torque..) and supports reporting to multiple monitoring systems (Ganglia, Nagios…).

A UK monitoring workshop/tutorial is being arranged for October.

But you still have to understand what is going on and fix it!

http://monami.sourceforge.net/

Page 18: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

18

Examples of recent problems caught by better monitoring

Phenomenologist and Biomed “DOS” attacks CPU ends up with too many gatekeeper processes active.

Excessive resource consumption seen at some DPM sites. Turns out that "hung" dpm.gsiftp connections from ATLAS transfers.

Removing inefficient jobs

Page 19: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

19

Some (example) problems sites face

• Incomplete software installation• Slow user response• Disks fail or SE decommissioned – how to contact users• /tmp full with >50GB log files of an ATLAS user. Crippled worker node.• Jobs fill Storage (no quotas) -> site fails Site Availability (rm). Site blacklisted. • 1790 gatekeeper processes running as the user alice001 - but we have no ALICE jobs running. • Confusion over queue setups for prioritisation• Job connections left open with considerable consumption of cpu and network resources. • ATLAS ACL change caused a lot of confusion. Dates and requirements changed and the script for sites (made available without the source) had bugs which caused concern. • Very hard for sites to know if jobs are running successfully. There is a massive amount of wasted CPU time with job resubmission (often automated)• Knowing when to worry if no jobs are seen• Lack of sufficient information in tickets created by users (leads to problems assigning tickets and increased time resolving them)• Slow or no response to site follow up questions• Problem raised in multiple ways possible confusion about whether something is still a problem

CPU storm – gatekeeper processes stall - no jobs submitted. User banned.

Page 20: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

20

GridPP future

2006 20082007

GridPP3 Proposal

Submitted July 13

GridPP2 GridPP2+

GridPP3

End of GridPP2 (31 August

2007)

Start of GridPP3 (1 April 2008)

?

£25.9M

htt

p:/

/ww

w.n

gs.

ac.

uk/

acc

ess

.htm

l

Page 21: Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

21

Summary

2 Resource deployment ok – still a concern about utilisation

3 Some major problems for Tier-1 storage have eased

4 Availability and monitoring are now top priorities

6 GridPP now funded until 2011

1 GridPP has involvement with many HEP areas but WLCG dominates

5 Sites face “new” challenges and need more monitoring tools