35
Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

Embed Size (px)

Citation preview

Page 1: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

Ian BirdLCG Project Leader

WLCG Status Report

8th June 2009Overview Board

Page 2: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 2

Agenda

General status & Milestones STEP’09 Resource planning – post-RRB

EGI Progress EGI & WLCG (+CERN)

Jamie: Preparations for HEP SSC

Steven: gLite consortium and later

Discussion How WLCG should interact with EGI + NGIs in future

Christoph Eck – CERN-IT-2

Page 3: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

CERN

WLCG MoU Signature Status

All anticipated signatures have now been received, including Brazil (April).

Sue Foffano – CERN-IT-3

Today we have 49 MoU signatories, representing 34 countries:

Australia, Austria, Belgium, Brazil, Canada, China, Czech Rep, Denmark, Estonia,Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Rep. Korea, Netherlands,Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, Switzerland, Taipei, Turkey, UK, Ukraine, USA.

Page 4: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 4

CERN + Tier 1 accounting - 2008

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

CPU Time Delivered

month (2008)

MS

I2K

-da

ys

.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

20,000

Disk Storage Used

month (2008)

Te

raB

yte

s

.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

Tape Storage Used

month (2008)

Te

raB

yte

s

.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

0%10%20%30%40%50%60%70%80%90%

100%

Ratio of CPU : Wall_clock Times

month (2008)

Page 5: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 5

Accounting - 2009 Ramp up in 2009 delayed until

September But several Tier 1s have started CPU usage is significantly increased

wrt 2008

All missing resources from 2008 now installed except NL-T1 where delayed until >>July; and ASGC where fire delayed CPU installation

Page 6: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 6

• Reliabilities now regularly reported for all experiments in addition to OPS (next slides)

• Only T2 federation still not reporting is Ukraine

Page 7: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 12

Experiment-specific reliabilities

Page 8: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 13

Service issues

Problems requiring “Service Incident Report:

8/1: CERN many jobs killed due to memory problems 17/1: CERN FTS transfer problems for ATLAS 23/1: CERN FTS/SRM/Castor problems for ATLAS 24/1: FZK FTS &LFC down for 3 days 26/1: Backward-incompatible change on SRM 21/2: CNAF: Network outage to Tier 2s and some Tier 1s 25/2: ASGC Fire affecting entire site – services relocated 27/2: CERN Accidental deletion of RAID volumes 4/3: CERN General Castor outage for 3 hours 14/3: CERN ATLAS Castor outage for 12 hours 24/3: RAL site down after power glitches, knock-on effects for several days 2/4: IN2P3 tape robotics failure 11/4: TRIUMF cooling failure 3/5: IN2P3 cooling down 44 hours (still in degraded mode until new cooling added in June) 4/5: SARA MSS tape backend down 14/5: PIC 5 hours cooling down 19/5: Geant routing problem cut off CERN from all Geant customers (not OPN) 20/5: dCache at NL-T1 – upgrade problems

- Not all sites (yet) reporting consistently, but improving

- Power/cooling issues continue at ~1/month

Page 9: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 14

14 14

Page 10: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 15

Milestones

Added milestones for: 2009 procurements SL5 deployment SCAS/gLexec deployment Updates to accounting (Tier 2 report, reporting installed capacity, user

level reporting) STEP’09 specifics CREAM CE rollout MSS Metrics CPU benchmark transition

Page 11: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 16

Milestone table...

Page 12: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 17

MSS Metrics for Tier 1s

Metrics gathered for Tier 1s – large set of metrics Most sites agree that they can provide almost

all (maybe not by VO at the level of tape access)

Published by each site via XML Displayed in SLS Data available (automatically) now for:

CERN, TRIUMF, CNAF, BNL, ASGC Others available soon

Page 13: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 18

STEP’09 The LHCC mini review recommended a 2009 readiness exercise, specifically

to address the issues of Data recall from tape at Tier 1s, for more than 1 experiment Analysis activities

At the WLCG workshop prior to CHEP it was agreed that we would have such an exercise despite the difficulties in co-scheduling this between the experiments “Scale Testing for the Experimental Programme – 2009” (STEP’09)

Implication that each year we foresee increased scaling tests Timescale: May (preparations), June

Essentially started last week (ATLAS, CMS, ALICE), this week (LHCb)

BUT: IN2P3 had scheduled MSS upgrade (hw+sw) June 1-4, degraded performance until

finished (agreed with experiments) FZK had problem with tape backend hardware just before start of STEP ASGC put in huge effort to prepare for STEP according to ATLAS requests after fire

(Following summaries thanks to Julia Andreeva)

Page 14: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 19

STEP’09: ATLAS

Goals: Parallel test of all main tasks at nominal data taking rate

Export from Tier 0 Reprocessing + reconstruction at Tier 1s; tape reading/writing Export of processed data to other Tier 1s and Tier 2s Simulation at Tier 2 Analysis at Tier 2 using 50% of T2 CPU, 25% pilot submission, 25% via

WMS Progress

Started June 1 Simulation running at full rate Load generator for data transfers reached 100% on 2nd June Reprocessing running in 7 ATLAS clouds Analysis in progress – using Hammercloud

10-20k jobs concurrently between WMS, Panda, ARC; 130k jobs submitted so far (June 4 – less than 1 day of activity)

All clouds receive jobs from both WMS and Panda ATLAS measures efficiency and read performance at each site

Page 15: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 20

STEP’09: CMS

Goals: Tier 0 data recording in parellel with other experiments

Plan 48 hours run: 10-11 June & 17-18 June Ideally we would like longer run (5 days) but for CMS this would interfere with weekly

cosmics run Tier 1 focus on tape archiving and prestaging (2-21 June) Data transfer goals:

Tier 0 – Tier 1 (2-16/6): latency between CERN MSS and Tier 1 Tier 1 – Tier 1 (1-16/6) replicate 50 TB between all Tier 1s Tier 1 – Tier 2 (4-9, 11-16/6): stress Tier 1 tapes, latency from Tier 1 MSS to Tier 2

Analysis at Tier 2 Demonstrate able to use 50% of pledged resources with analysis jobs, overlaps with MC

work. Throughout June. Progress:

Started June 2 CRUZET (Cosmics at 0 T) at CERN with export to Tier 1 last week (so no major Tier 0 activities) First STEP’09 work at Tier 0 foreseen June 6 Reprocessing at Tier 1s started June 3 T0T1 transfers started June 3 T1T1 transfers started June 4 Analysis: job preparation under way (June 4)

Page 16: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 21

STEP’09: ALICE+LHCb

ALICE goals: Tier 0 – Tier 1 data replication at 100 MB/s Reprocessing with data recall from tape at Tier 1s

ALICE status: Started June 1 15k concurrent jobs running FTS transfers to start this week

LHCb goals: Data injection into HLT Data distribution to Tier 1s Reconstruction at Tier 1s

LHCb status: Will join STEP’09 this week

Page 17: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 22

Resource planning – post RRB

Next slides are requirements as shown at the RRB at then end of April:

Summary 2009 req

2009 pledge

2010 req

Old 2010 req

2010 pledge

CERN CPU 164.9 131.9 254.7 238.7 213.6

CERN disk 8.78 10.07 15.67 15.03 13.4

CERN tape 22.2 25.1 34.2 44.7 43.1

T1 CPU 217.3 245.7 497.4 494.36 406.1

T1 disk 37.6 34.9 65.1 72 60.3

T1 tape 29 40.12 50.9 71.46 65.9

T2 CPU 228.1 305.3 570.4 693.5 475.8

T2 disk 22.72 22.79 48.52 36.72 35.2

Requirement >10% more/less than pledge

Page 18: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 23

ATLASATLAS 2009 req 2009

pledge2010 req Old 2010

reqCERN CPU 57 26.5 67 68

CERN disk 3.7 2.075 5.1 5.25

CERN tape 7.8 6.21 9.9 14.6

T1 CPU 90 120.9 227 234T1 disk 24 19.86 36.7 41.3T1 tape 11.3 14.72 14.8 22.7T2 CPU 108 114 240 242T2 disk 13.3 11.2 24.8 24.8

Cosmic ray data in Q309 will produce 1.2PB (same as Aug-Nov 08)

In 6x10^6 sec will collect 1.2x10^9 events 2PB raw

Raw stored on disk at T1s for a few weeks

Plan for 990M full sim events and 2200M fast sim events

CERN request was updated last Aug and was seen by RSG

Generally new requirements <= old requirements (except at CERN) Provide resource needs profile by quarter (see document) NB. The August 2008 request for 2009 while agreed by the RSG has never been

validated by LHCC

Requirement >10% more/less than pledge/requirement

Page 19: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 24

CMSCMS 2009 req 2009

pledge2010 req Old 2010

reqCERN CPU

48.1 54.8 112.9 115.2

CERN disk 1.9 2.5 4.6 3.8

CERN tape

9.5 9.3 15.3 14.3

T1 CPU 53.5 63.7 119 139T1 disk 6.5 8.4 14.1 15.4T1 tape 10.5 16 21.6 23.2T2 CPU 54.1 116 209.6 306T2 disk 5 8.4 11.3 7.6

Model foresees 300Hz data taking rate ...

... and CPU times assume higher lumi in ‘10 recCPU: 100200 HSO6.s simCPU: 360540 HSO6.s

Changes 3 re-reconstr in each ’09, ‘10 40% overlap in PD datasets Added storage needs for ‘09

cosmics

Tier 1: Finish ‘09 re-reco in 1 month (was spread

over full year) Tier 2:

Require 1.5 more MC events than raw: sw changes and bug fixes

MC events produced in 8 months (can only start after Aug’09)

Tier 0: Added 1 re-reco in each year Capacity for express stream Reco to finish in 2x runtime in ‘09 Monitoring + commissioning is now

25% of total (was 10%)

Requirement >10% more/less than pledge/requirement

Page 20: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 25

ALICEALICE 2009 req 2009

pledge2010 req Old 2010

reqCERN CPU 42.8 46.4 46.8 49.4

CERN disk 2.4 4.5 4.5 4.7

CERN tape

3.7 7.3 6.7 11.6

T1 CPU 42.8 40.9 102.4 94T1 disk 4.3 3.9 9.9 12T1 tape 5.9 6.2 11.6 19.7T2 CPU 36 39.9 80.8 100T2 disk 4.4 2.82 12.4 4.3

Requests are within (or close to) existing ‘09 pledges except for Tier 2 disk For 2010 – don’t know actual pledge for ALICE, but generally pledges are significantly

lower than requirement. (so final column should be mostly pink for T1+T2!)

Will collect p-p data at ~maximum rate: 1.5x10^9 events at 300 Hz Initial running will give luminosity

required without special machine tuning – cleaner data for many physics topics

First pp run energy is important in interpolating results to full Pb-Pb energy

Thus plan to collect large statistics pp in 2009-10

Assume 1 month Pb-Pb at end of 2010

Requirement >10% more/less than pledge/requirement

Page 21: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 26

LHCbLHCb 2009 req 2009

pledge2010 req Old 2010

reqCERN CPU 17 4.2 28 6.12

CERN disk 0.78 0.99 1.47 1.28

CERN tape

1.2 2.27 2.3 4.2

T1 CPU 31 20.2 49 27.36T1 disk 2.8 2.7 4.4 3.25T1 tape 1.3 3.2 2.9 5.86T2 CPU 30 35.4 40 45.5T2 disk 0.02 0.37 0.02 0.02

CERN increase due to need for fast feedback to detector of alignment/calibration + anticipation of local analysis use

T1 CPU increase in 2010 due to more reprocessing T2 requirements decrease as less overall simulation needed

Uncertainty in running mode (pile up) add contingency on event sizes and simulation time

2009 Simulation with assumed running conditions

Early data with loose trigger cuts and many reprocessing passes – alignment/calib+early physics

2010 – several reprocessing passes and many stripping passes

Simulation over full period

NB. Previously LHCb had presented integrated CPU needs – now here are shown the total capacity needed in each period – as for the other experiments

Requirement >10% more/less than pledge/requirement

Page 22: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 27

Resource Planning - RRB

The Scrutiny Group also reported at the RRB: Essential message was that they thought that the resources pledged for

2008/2009 should be sufficient for the data taking during 2009/2010 Discussion followed ...

Conclusion was that the scrutiny group and the experiments together with their LHCC referees should discuss and come back with clarifications before the summer

LHCC will have a mini-review on computing 9-11 July (including LCG and LHCC referees of LCG)

Page 23: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 28

GDB topics – security challenge

Security service challenge: (from report at April GDB) 2nd challenge run (following last year’s) site was asked to trace a job back from WN, through CE, WMS, to the

submitting UI. To ban a particular user, and to trace certain storage operations.

9 Tier1s (NIKHEF & SARA, not OSG sites and ASGC), tested plus Prague Tier2 volunteered

6 of 9 sites equalled or exceeded max score (bonus points were possible). One of the others and the Prague T2 scored >90%.

improvement for the sites previously poor to middling was considerable. exception was the INFN T1 at CNAF. It took three attempts before there was

any response at all and even that was poorer than last year. The EGEE Security Officer will discuss in detail with CNAF but the MB should be concerned at the apparent inability of this T1 to react to standard procedures.

This test is currently being run against Tier2s as well. UK and South-East Europe have completed, Asia Pacific and Benelux are in progress and NDGF and OSG are preparing. The aim is to have completed all regions in time to report to the EGEE09 conference in September.

Page 24: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

EGEE III PMB 13.5.2009 29

EGI.eu

• Location: Amsterdam, Science Park, Matrix Building IV– Decision taken by PB in Catania

• Organizational Task Force– Members from NCF, NIKHEF and EGI_DS– Chair: Arjen van Rijn (NIKHEF)– Preparation of Convention and Statutes

Page 25: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

EGEE III PMB 13.5.2009 30

Memorandum of Understanding

• Identify parties ready to commit man power and financial resources– NGIs and EIROforum organizations– Common Fund Administrator to deal with the financial

contributions• Defines EGI Collaboration

– An interim step towards EGI Council– Body with authority to deal with EGI project(s) preparation– Body with authority to assign interim EGI.eu Director and other

personnel• Released last Wednesday (6th May)

– Comment till end of May– First round of signatures end of June

• A minimal quorum of 10 parties and 150 k€ for MoU to come in force– First financial contributions 1st October

Page 26: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

EGEE III PMB 13.5.2009 31

Letter of Intent

• To identify parties interested to sign MoU• Released together with MoU• Deadline for signatures 25th May• Mostly informational, to collect preliminary

interest– However, will play role in MoU endorsement

Page 27: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

LoI Signatures

Page 28: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

EGEE III PMB 13.5.2009 33

EGI.eu Convention and Statutes

• Using MoU as the input• Extending it to define a legal body—

EGI.eu• EGI.eu will be a Foundation under Dutch

Law– Open path towards ERIC in the future

Page 29: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

EGEE III PMB 13.5.2009 34

EGI Project(s) preparation

• Leaders of Editorial Board nominated– Laura Perini for EGI proper—EGI.eu establishment,

EGI operations, …– Cal Loomis for EGI application support

• The idea is to encourage one project covering several scientific areas (SSCs) and their generic support

– Steven Newhouse for interim EGI.eu director

• Middleware development outside EGI Blueprint– Discussion within the UMD task force

Page 30: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

EGEE III PMB 13.5.2009 35

Schedule and Milestones

• 25th May: LoI signed• 29th May: Next PB meeting

– MoU discussion– Draft EGI.eu Convention and Statutes published

• 30th May: Deadline for MoU comments• Early—Mid June: Final MoU published• 30th June: Deadline for MoU signature (first round)• Early July: Interim EGI Council convened

– Endorsement of steps already taken– Endorsement of Editorial Board– Endorsement/election of EGI.eu director– New version of EGI.eu Convention and Statutes

• 30th July: EC Call open• October/November: EGI.eu established• 1st October: Financial contributions to EGI Collaboration due• 5th December: EC Call closed

Page 31: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 36

WLCG and EGI

In previous meetings we have discussed planning for the EGEE to EGI transition period (or for the case where EGI is not in place)

Updated document (attached here): Includes status of NGI planning for WLCG countries (presented on

May12) in response to a number of questions posed to them (most Tier 1s + a few Tier 2-only countries)

Also includes list of services and responsibilities (present and anticipated) and list of middleware components with responsibilities

Page 32: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 37

Tier 1s were asked:

Which services you currently provide for WLCG (via EGEE) that you will commit to continue to support (see attached slide) – what is the level of effort you currently provide for these (separated into operation, maintenance, and development)

Which services you will not be able to continue to support or where the level of effort may be significantly decreased that may slow developments, bug fixes, etc.

What is the state of the planning for the NGI: Will it be in place (and fully operational!) by the end of EGEE-III? What is the management structure of the NGI?, and How do the Tier 1 and Tier 2s fit into that structure? How the effort that today is part of the ROCs (e.g. COD, TPM, etc) for supporting the WLCG

operations evolve?  How will daily operations support be provided? Does the country intend to sign the Letter of Intent and MoU expressing the intention to be a

full member of EGI? Which additional services could the Tier 1 offer if other Tier 1s are unable to provide them? Other issues particular to the country, or general problems to be addressed. What are the plans to maintain the WLCG service if the NGI is not in place by May 2010, or

if EGI.org is not in place? For ASGC and Triumf it would be useful to hear on their plans in the absence of EGEE

ROC support – i.e. do they have plans to continue or build local support centres? For BNL and FNAL it is assumed that nothing will really change on the timescale of the next year.

Page 33: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 38

Responses (May 12)

UK, France, Italy, Nordic, NL Structures in place – expect to continue to provide existing services

Germany Situation under discussion (Gauss alliance), but WLCG commitments

clear Spain

Structure for NGI not yet in place, but intend to fulfil Tier 1+Tier 2 service commitments

See document for details

Page 34: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 39

Outside Europe

CERN ROC will close at end of EGEE-III Today supports several countries/sites outside of Europe

Latin America Brazil, Mexico, Columbia propose to fund a LA-ROC to support Latin

American LHC collaborators Supported by many other LA countries (list...)

Will send people to CERN for training Asia-Pacific

A-P ROC in Taipei will remain in much the same way as now Canada

Will be self-supporting, but is also offering to set up a ROC potentially in support of other sites if necessary

Page 35: Ian Bird LCG Project Leader WLCG Status Report 8 th June 2009 Overview Board

[email protected] 40

CERN’s roles in EGI

CERN will participate in all aspects of the EGI: EGI.eu

Hopefully as a full member with voting rights (but still some doubts) Specialised Support Centres (SSC)

CERN will lead the formation of an SSC for HEP (+astroparticle?) together with other partners

Will be of direct benefit to WLCG Middleware

The gLite consortium must urgently be put in place – Letters of Intent have been signed by (almost) all key partners

This is a minimum solution for ongoing support of software in production

Hopefully a collaboration between gLite and ARC can eventually participate in a project proposal