WLCG Service Challenges - CERN Indico

10/27/201013 October 2009 1

WLCG Service Report

[email protected],[email protected]@cern.ch,[email protected]~~~

WLCG Management Board, 26th October 2010Covers 2 week period from 11th October but includes GGUS summary by Maria Dimou from 27 September (report was

skipped at last MB)

1

2

Introduction• Generally smooth operation on experiment and service side

• Some site availability issues – commented directly in Availability Plots

• 4 day technical stop advanced from November to 19 October

• Some complaints on short notice and intervention rescheduling but the agreement is to not strictly coordinate scheduled interventions with technical stops.

• Service Incident Reports received:• CNAF CMS storage down

• CERN LSF degradation

• No new SIR awaited

2

3

GGUS summary (4 weeks)

VO User Team Alarm Total

ALICE 5 0 1 6

ATLAS 28 215 6 249

CMS 6 1 1 8

LHCb 2 38 1 41

Totals 41 254 9 304

3

410/27/2010 WLCG MB Report WLCG Service Report 4

Support-related events since last MB• We need WLCG shifters, alarmers, management to give us meaningful values for the GGUS ‘Problem Type’ field, in order for periodic reporting to show better weak areas in support.• GGUS:61440 (CNAF-BNL network problem) re-opened by ATLAS till network problem fully understood. •EMI insists on changing the GGUS supporters’ privileges, such that assignment to middleware-related Support Units (SUs) be only possible by the EGI DMSU (Deployed Middleware SU). Although this matches the ‘Service Desk’ spirit, it might slow things down. As we have no more USAG, we need the WLCG community input offline a.s.a.p. •There were 9 ALARM tickets since the Sept. 28th MB (4 weeks), 5 of which were real, all submitted by ATLAS. No ALARMs since the Oct 12th MB (where WLCG report was not given). Details follow…

https://gus.fzk.de/ws/ticket_info.php?ticket=61440

5

ATLAS ALARM->CERN-CNAF TRANSFERS

•https://gus.fzk.de/ws/ticket_info.php?ticket=62761


What time UTC What happened

2010/10/05 9:13 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_Italy.

2010/10/05 10:23 Site acknowledges ticket and finds a StoRM backend problem.

2010/10/05 12:03 Service restored. Site puts the ticket to ‘solved’ and refers to GGUS:62745 for details.

2010/10/11 9:48 Submitter of ticket GGUS:62745 sets status ‘verified’. No explanation on any of the 2 tickets what the problem/diagnostic/solution actually was…




6

ATLAS ALARM->TRANSFERS TO .FR CLOUD




2010/10/08 5:56 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to NGI_France.

2010/10/08 6:31 Site acknowledges ticket and finds a network problem preventing all DB server access.

2010/10/08 7:29 Service restored.

2010/10/08 10:41 Site puts ticket to status ‘solved’.

2010/10/14 8:39 Submitter sets the ticket to status ‘verified’.


mailto:[email protected]

7

ATLAS ALARM-> CERN SLOW LSF

•https://gus.fzk.de/ws/ticket_info.php?ticket=6246710/27/2010 WLCG MB Report WLCG Service Report 7


2010/09/27 15:34 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN.

2010/09/27 16:01 Operator acknowledges ticket and contacts the expert.

2010/09/27 16:37 Expert’s 1st diagnosis. Too many queries.

2010/09/27 20:10 Service mgr kills a home-made robot by another experiment launching >> bjob queries and puts ticket to status ‘solved’.

2010/09/28 12:21 Submitter sets ticket to status ‘verified’.


8

ATLAS ALARM-> CERN SLOW AFS






2010/10/01 9:37 IT Service manager re-classifies in CERN Remedy PRMS.

2010/10/11 15:33 Still ‘in progress’. Reminder sent during this drill.

2010/10/25 15:56 Still ‘in progress’. No reaction to the Oct 11th

reminder


9

ATLAS ALARM-> CERN CASTOR






2010/10/01 16:42 Expert starts investigation.

2010/10/01 17:23 Solved. Put DONE in SRM not propagated to CASTOR. Done by hand.

2010/10/01 17:45 Submitter ‘verified’. Shifter added x-ref to GGUS:62705



10/27/2010 10

2.2 2.2 2.2

2.1 2.11.1 1.1

4.2 4.2

1.2

4.3 4.3 4.3

4.1

10/27/2010 11

Analysis of the availability plotsATLAS1.1 IN2P3: SRM lcg‐cp (time out after 600 seconds) and lcg‐del (internal communication error) test failures. On the 14th

of

October, the SRMv2‐ATLAS‐lcg‐cp test was timing out after 600 seconds.1.2 SARA: CE‐sft‐job failures at SARA: ‘Job failed, no reason given by GRAM server‘. Downtime announced only for CREAM CE,

not for CE.

ALICE2.1 FZK: Occasional VOBOX User Proxy Registration test failures.2.2

Green

box

NIKHEF:

CE

tests

were

failing.

SAM

CE

tests

should

be

ignored

as

ALICE

is

using

CREAM

CE.

There

are

no

CREAM CE direct submission tests for ALICE yet (work in progress).

CMSNTR.

LHCb4.1 IN2P3: CE sft‐vo‐swdir

test was failing (took more than 150 seconds while 60 seconds are expected). Software installation

tests failed. User jobs were also reported as having failed due to issues with the software area.4.2 RAL: SRM test failures.

SRM lcg‐cr

(connection timeout GGUS #62893 related to GGUS #62829, CASTOR DB performance

issue) and Dirac (invalid argument) test failures. 4.3 SARA: CE‐LHCb‐availability test was failing on Monday the 11th. CE‐sft‐job failures at SARA: ‘Job failed, no reason given by

GRAM server’. Downtime announced only for CREAM CE, not for CE. Ron Trompert

at SARA was notified and forwarded this

issue to the appropriate person. Update from the MB meeting was that there was in fact an independent

PBS

problem

that

was affecting the CE.

10/27/2010 12

Main observations for the week of Oct 11 (1/2)

• CNAF-BNL network issue• Slow progress on GGUS:61440, xfer rates increased without full understanding if

hardware or inherent limitations after Geant capacity increase and Force10 router ‘shaping’. Suggestion to take path out of service for rigorous testing.

• Latest: rerouting Amsterdam to Vienna link gave dramatic performance benefit

• New ticket GGUS: 63134 opened Oct 15, files > 2 GB hit transfer timeouts

• CERN xrootd errors for ATLASHOTDISK and CMS t0export• both related to load/timeout issues, fixes have been deployed

• IN2P3 storage instabilities, 133k ATLAS files inaccessible• Files not properly migrated – stuck in storage. Finally closed on 19th.

• CNAF tail of CMS storage unavailability due to GPFS bug• Incident from 6 to 10 October - SIR provided

• LHCb Condition DB access from any WN at any T1• Port opening issue - to be further discussed during 3D workshop Nov 16-17

• OPN - who should handle issues with Tier-1 links to CERN• https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#Responsibilities• CERN experts will kindly inform Tier-1 when they happen to notice issues

10/27/2010 13

Main observations for the week of Oct 11 (2/2)

• Issues with new GOCDB• AuthZ problem for downtimes, fixed• Downtime summary no longer available, RFE accepted

• CASTOR• Upgrades to 2.1.9-9 planned for ALICE and CMS in preparation of HI run

• Done on Oct 19• 1.2 PB added for CMS HI run (ALICE done already)• ATLAS users circumvented ACLs via castor-public

route closed Oct 18

• CMS downtime calendar: some downtimes not seen• Possibly after having been modified

CIC Portal out of sync ?

• CMS see again a high rate of Maradona errors for jobs at CERN• LSF hotfix requested from Platform- will be rolled out this week

• ALICE user jobs killing WN at various sites• User job memory limits will be included in upcoming AliEn update

14

4.2

4.3

4.1

15

Analysis of the availability plotsATLASNTR.

ALICENTR.

CMSNTR.

LHCb4.1

GRIDKA:

SRM

DiracUnitTest

failures

on

the

23rd:

‘No

space

left

on

the

device’.

The

disk

space

was

full

but

the

test

threshold was a bit unrealistic. The test threshold has been properly adjusted by Roberto.4.2

IN2P3:

Problem

during

software

installation

– CE‐tests

were

failing:

install,

job‐Boole,

job‐DaVinci,

job‐Brunel,job‐Gauss

and sft‐vo‐swdir

(took more than 150 seconds while 60 seconds are expected).4.3 PIC: CE‐sft‐vo‐swdir

(took more than 150 seconds while 60 seconds are expected) and SRM‐lhcb‐FileAccess

(Error in ROOT

macro opening/reading a file) test failures from time to time.

16

Main observations for the week of Oct 18(1/2)

• Technical stop advanced from November week 1 to 19-22 October mainly to fix a problem with beam 1 injection. Very little notice caused some grumbles.

• Better availability statistics when no raw data flowing.• CERN annual power tests also moved from 1 November.• PIC bdii stopped publishing overnight 18/19. Such sites are not

shown within SAM framework. To be followed up.• ATLAS found 4 UK Tier 2 sites intermittently failing glite 3.2 FTS

transfers. Complicated investigation first concluded there was a proxy delegation bug leading to real problem that the sites were not accepting credentials issued by the BNL VOMS server. Some 11 other Tier 2 subsequently found.

• Two new ‘easy to use’ slc5 root exploits discovered – both fixable without service interruption.

17

Main observations for the week of Oct 18(2/2)• Two operational problems due to end-users:

• RAL-LCG2 SRM problems due to an ATLAS user pulling data from RAL to an NDGF cluster

• ALICE user jobs filling memory on RAL worker nodes

• ATLAS wish to escalate ggus 61106 middleware issue: The binaries from the glite 3.2.7 WN tarball fail immediately on SUSE 10 with a floating point exception. They saw the same thing in the ATLAS SW externals area a while ago and Stefan Roiser fixed this by compiling the libs with some option to ensure they used sysv hashes. This stops ATLAS from using the German MPP and LRZ Tier2 sites. Initial response from developers is that SUSE is not a supported OS. First raised in August.

• LHCb wish to escalate ggus 63234 issue with IN2P3: Shared software area issue with jobs timing out in setting up the environment.

• The ALICE and CMS simulated Tier 0 HI data taking on Thursday/Friday caused no operational problems.

• New issue this week: CMS heavily use an unsupported ccrc08 application that has lost its interactivity. See GGUS:63450

• GridMap http://servicemap.cern.ch/ccrc08/servicemap1.html


https://twiki.cern.ch/twiki/bin/edit/LCG/GridMap?topicparent=LCG.WLCGDailyMeetingsWeek101025;nowysiwyg=1

http://servicemap.cern.ch/ccrc08/servicemap1.html

18

CNAF CMS STORAGE DOWN SIR• CMS has a T1D0 unique fs served by 8 servers, 2 gridftps and 2 metadata servers

in a dedicated GPFS cluster. October 1-3 CMS fs moved from server DDN-3 to DDN-4 & DDN-5.

• Oct 6 13:00 CET started deletion of old metadata disks (i.e. on DDN-3) from fs with a “wrong” (not useful but perfectly safe) mmdeldisk option

• This triggered a GPFS bug - GPFS crashes on CMS cluster - CMS fs hangs• Oct 6 19:00 IBM suggests to upgrade all cluster to 3.2.1-23 (“known bug in

previous release”) but problem still there…• Oct 8 morning: other diagnostic files sent to IBM support • - not convinced from IBM support analysis: other investigations during all day• Oct 8 evening Decision: follow IBM support suggestion – mmdeldisk still fails• Oct 10 morning New strategy: re-add old metadata disks and force remigration • - Oct 10 16:00 Subsequent disk deletion successful • Oct 10 23:49 Recovery of the deleted files from tape completed • Follow up • We are now waiting for a complete analysis from IBM but also for an “ifix” to

correct the bug.• The long time spent to recover could have been minimized simply scratching the

disk (all data are on tape) and recreating the fs but in this case we should not have understood the problem.

18

19

Severely degraded response from CERN Batch Service

• From 05.26 until 10.00 on 20 October the batch service was severely degraded; the mbatchd daemon was crashing every 10-30 minutes resulting in poor response time (or rejected) job submissions and job status queries. Job dispatch to and status updates from the worker nodes were also impacted which slowed the rate of jobs being started on the worker nodes. No jobs were lost.

• 09:36 Priority 1 ticket opened with vendor to investigate problem [1-100387964] • 09:40 Vendor starts to investigate • 09:59 Last daemon crash • 10:33 Vendor provides fix: deployed.

• The details of the problem remain to be understood with the vendor. The fix appears to have addressed the problem.

• There were no changes on the production system that are correlated with this problem.

• Follow up • Reported to WLCG operations meeting on 20 Oct 2010. • Understand more details of the root cause. [ inline on Platform 1-100387964 ] • 3 pending fixes for other issues (CMS Maradona problem, misc scaling issues) will

be combined and delivered as an official hotfix for deployment on the batch service. trac:182

19

https://svnweb.cern.ch/trac/batchinter/ticket/182

20

Summary

• Business as usual – busy but successful

• Tier 0 Heavy Ion simulation went almost unnoticed

• Proton collisions at 368 bunches ends on November 5 for start of Heavy Ion setting up.

20

Documents

WLCG Service Challenges - CERN Indico