View
220
Download
0
Category
Tags:
Preview:
Citation preview
Grid Computing for UK Particle Physics
Jeremy ColesGridPP Production ManagerJ.Coles@rl.ac.uk
Monday 3rd SeptemberCHEP 2007, Victoria, Canada
2
Overview
2 Current resource status
3 Tier-1 developments
4 Tier-2 reviews
5 Future plans
6 Summary
1 Background
Acknowledgements: Material in this talk comes from various sources across GridPP including the web-site (http://www.gridpp.ac.uk/), Blogs (http://planet.gridpp.ac.uk/) and meetings such as a GridPP collaboration meeting last week (http://www.gridpp.ac.uk/gridpp19/)
3
Background
GridPP is a collaboration of particle physicists and computer scientists from the UK and CERN. The collaboration is building a distributed computing Grid across the UK for particle physicists. At the moment there is a working particle physics Grid across 17 UK institutions. A primary driver of this work is meeting the needs of WLCG.
http://www.gridpp.ac.uk/pmb/People_and_Roles.htm
4
Our user HEP community is wider than the LHC
• Esr Earth Science research• Magic Gamma ray telescope• Planck – satellite for mapping cosmic m/w bg• Cdf• D0• H1• Ilc - International Linear Collider project (future electron-positron linear
collider studies) • MINOS - Main Injector Neutrino Oscillation Search, is an experiment at
Fermilab designed to study the phenomena of neutrino oscillations • NA48• Supernemo• ZEUS• CEDAR - Combined e-Science Data Analysis Resource for high-energy physics • Mice - A neutrino factory experiments • T2k - (http://neutrino.kek.jp/jhfnu/) Next Generation Long Baseline Neutrino
Oscillation Experiment• SNO
5
The GridPP2 project map shows the many areas of recent involvement
phenoGRID is a VO dedicated to developing and integrating (for the experiments) some of the phenomenological tools necessary to interpret the events produced by the LHC. HERWIG; DISENT; JETRAD and DYRAD.
http://www.phenogrid.dur.ac.uk/
Job submission framework
http://gridportal.hep.ph.ic.ac.uk/rtm/
Real Time Monitor
6
However, provision of computing for the LHC experiments dominates
activities
The reality of the usage is much more erraticthan the target fairshare requests
ATLASBaBarCMS
LHCb
ATLASBaBar
CMSLHCb
Tier-2s are delivering a lot of the CPU time
7
So where are we with delivery of CPU resources for WLCG?
0
500
1000
1500
2000
2500
Man
ches
ter
Imper
ial
QMUL
Liver
pool
Lanc
aster
Oxford
Glasgo
w**
RHUL
RAL PPD
Birming
ham
Sheffie
ld
Brunel
Durham
*UCL
Bristo
l
Cambridge
Edinbu
rgh
KS
I2K
Pledged Actual
Overall WLCG pledge has been metbut it is not sufficiently used
New machine room being built
Shared resources
CPU at the Tier-2 sites level
8
… storage is not quite so good
0
50
100
150
200
250
300
350
400
Man
ches
ter
Imper
ial
QMUL
Liver
pool
Lanc
aster
Oxford
Glasgo
w**
RHUL
RAL PPD
Birming
ham
Sheffie
ld
Brunel
Durham
*UCL
Bristo
l
Cambridge
Edinbu
rgh
TB
Pledged Actual More storage at site but not included in dCache
Shared resource
Additional procurement underway
Storage at the Tier-2 site level
9
and so far usage is not great.
0
50
100
150
200
250
300
350
400
Man
ches
ter
Imper
ial
QMUL
Liver
pool
Lanc
aster
Oxford
Glasgo
w**
RHUL
RAL PPD
Birming
ham
Sheffie
ld
Brunel
Durham
*UCL
Bristo
l
Cambridge
Edinbu
rgh
TB
Pledged Actual Used
ATLAS
CMS
ATLAS & Babar
CMS & Babar
CMS
ATLAS
• Experiments target sites with largest amounts of available disk• Local interests influence involvement in experiment testing/challenges
London Tier-2 (available) London Tier-2 (used)
Growing Steady
X
10
Tier-1 storage problems ...aaahhhh… we now face a dCache-CASTOR2
migration
CastorCMS
dCacheLHCb dCacheCastorATLAS
The migration to Castorcontinues to be a challenge!
At the GridPP User Board meeting on 20 June it was
agreed that 6 month notice be given for dCache
termination.
Experiments have to fund storage costs past March
2008 for ADS/vtp tape service.
Castor
Data
Slide provided by Glenn Patrick (GridPP UB chair)
11
At least we can start the migration now!
Separate Instances for LHC ExperimentsATLAS Instance - Version 2.1.3 in production.CMS Instance - Version 2.1.3 in production.LHCb Instance - Version 2.1.3 in testing.
Known “challenges” remaining:• Tape migration rates• Monitoring at RAL (server load)• SRM development (v2.2 timescale)• disk1tape0 capability• Bulk file deletion• Repack
• The experiment data challenges
Problems faced (sample):• ID servers in which pools• Tape access speed• Address-in-use errors• Disk-disk copies not working• Changing storage class• Sub-request pile up• Network tuning -> server crash• Slow scheduling• Reserve space pile up• Small files • …
PANIC
over?
12
New RAL machine room (artists’ impression)
13
New RAL machine room - specs
• Shared room• 800m2 can accommodate 300 racks + 5 robots• 2.3MW Power/Cooling capacity• September 2008• In GridPP3 extra fault management/hardware staff planned as size of farm increases
• Several Tier-2 sites are also building/installing new facilities.
0
2000
4000
6000
8000
10000
12000
14000
16000
2008 2009 2010 2011
April
KS
I2K
Storage Capacity (TiB)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
2008 2009 2010 2011
April T
iB Tape
Disk
14
GridPP is keeping a close watch on site availability/stability
Site A
Site B
UK average(reveals non-site
specific problems)
Power outage
SE problems
Algorithm issue
15
We reviewed UK sites to determine their readiness for LHC startup
Questionnaire sent followed by team review:• Management and Context• Hardware and Network• Middleware and Operating Systems• Service Levels and Staff• Finance and Funding• Users
• Tier-2 sites working well together – technical cooperation• Most Tier-2 storage is RAID5. RAM 1GB-8GB per node. • Some large shared (HPC) resources yet to join. HEP influence large• Widespread install: Kickstart – local scripts - YAIM + tarball WNs. Cfengine use increasing.• Concern about lack of full testing prior to use in production – PPS role• Monitoring (& metric accuracy) needs to improve• Still issues with site-user communication• Missing functionality – VOMS, storage admin tool….
… and much more.
Some examples of resulting discussions
16
Generally agreed that regular (user) testing has helped improve
the infrastructure
} Sites in scheduled maintenance
Observation: It is very useful to work closely with a VO in resolving problems (nb. many of the problems solved here would have impacted others) as site admins are not able to test their site from a user’s perspective. However, GridPP sites have stronger ties with certain VOs and intensive support can not scale to all VOs…
17
Improved monitoring (with alarms) is a key ingredient for future progress
MonAMI is being used by several sites. It is a "generic" monitoring agent. It supports the monitoring of multiple services (DPM, Torque..) and supports reporting to multiple monitoring systems (Ganglia, Nagios…).
A UK monitoring workshop/tutorial is being arranged for October.
But you still have to understand what is going on and fix it!
http://monami.sourceforge.net/
18
Examples of recent problems caught by better monitoring
Phenomenologist and Biomed “DOS” attacks CPU ends up with too many gatekeeper processes active.
Excessive resource consumption seen at some DPM sites. Turns out that "hung" dpm.gsiftp connections from ATLAS transfers.
Removing inefficient jobs
19
Some (example) problems sites face
• Incomplete software installation• Slow user response• Disks fail or SE decommissioned – how to contact users• /tmp full with >50GB log files of an ATLAS user. Crippled worker node.• Jobs fill Storage (no quotas) -> site fails Site Availability (rm). Site blacklisted. • 1790 gatekeeper processes running as the user alice001 - but we have no ALICE jobs running. • Confusion over queue setups for prioritisation• Job connections left open with considerable consumption of cpu and network resources. • ATLAS ACL change caused a lot of confusion. Dates and requirements changed and the script for sites (made available without the source) had bugs which caused concern. • Very hard for sites to know if jobs are running successfully. There is a massive amount of wasted CPU time with job resubmission (often automated)• Knowing when to worry if no jobs are seen• Lack of sufficient information in tickets created by users (leads to problems assigning tickets and increased time resolving them)• Slow or no response to site follow up questions• Problem raised in multiple ways possible confusion about whether something is still a problem
CPU storm – gatekeeper processes stall - no jobs submitted. User banned.
20
GridPP future
2006 20082007
GridPP3 Proposal
Submitted July 13
GridPP2 GridPP2+
GridPP3
End of GridPP2 (31 August
2007)
Start of GridPP3 (1 April 2008)
?
£25.9M
htt
p:/
/ww
w.n
gs.
ac.
uk/
acc
ess
.htm
l
21
Summary
2 Resource deployment ok – still a concern about utilisation
3 Some major problems for Tier-1 storage have eased
4 Availability and monitoring are now top priorities
6 GridPP now funded until 2011
1 GridPP has involvement with many HEP areas but WLCG dominates
5 Sites face “new” challenges and need more monitoring tools
Recommended