14
WLCG Service Report [email protected] [email protected] ~~~ WLCG Management Board, 18 th September 2012 1

WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

Embed Size (px)

DESCRIPTION

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE5005 ATLAS CMS74112 LHCb Totals

Citation preview

Page 1: WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

WLCG Service Report

[email protected]@cern.ch~~~

WLCG Management Board, 18th September 2012

1

Page 2: WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

Introduction• 3 relatively quiet weeks since the last MB report on August 28th

• Smooth LHC operations, including proton-ion test run. Now in technical stop.

• No Service Incident Reports received. One SIR expected:• Accidental deletion on EOSCMS of 1.6M files (1PB) by an (unprivileged) CMS

user. Several group-writeable areas deleted, only a minor fraction could be recovered. Permissions tightened, other preventive measures being reviewed.

• 3 real GGUS ALARMS, all at CERN• 1 for CMS (SRM down), 2 for ATLAS (slow LSF; slow migration to tape)

• Many other issues reported at the daily meetings, most notably:• Ongoing issues with Alcatel audioconf system. On average one remote user per

day has been unable to connect to the meeting for the last two weeks. Under investigation (INC:158097), seems (at least partly) browser-related.

• Oracle security patches installations. Also Castor upgrades and NAS migration.• Constant rate of aborted LHCb pilots for one week due to CERN batch issues.• SRM overload for ATLAS at PIC, related to ATLAS deletion policy.• Bug in CVMFS stratum ones at CERN, affecting mainly LHCb.• Storage issues in Denmark due to power supply problems.

2

Page 3: WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

GGUS summary (3 weeks)

VO User Team Alarm Total

ALICE 5 0 0 5

ATLAS 14 103 2 119

CMS 7 4 1 12

LHCb 5 31 0 36

Totals 31 138 3 172

3

Page 4: WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

Support-related events since last MB• There have been 3 real ALARMs

since the 2012/08/28 MB.• They were submitted by ATLAS (2)

and CMS (1).• Site for all tickets was CERN.• There has been no GGUS Releases

since the last MB due to summer holidays. The next one is planned for 2012/09/26.

4

Page 5: WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

CMS ALARM->CERN SRM DOWN GGUS:85530What time UTC What happened2012/08/27 14:07 GGUS ALARM ticket opened, automatic email

notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems.

2012/08/27 14:12 Operator records that the CASTOR piquet was called.

2012/08/27 14:16 Expert records in the ticket that there is an overload of 400 pending transfers in c2cms/t1transfer queue.

2012/08/30 12:38 10 comments exchanged between shifters, service experts and IT/ES CMS supporters. Excessive PhEDEx activity to Caltech was temporarily thought to be the problem cause but this was not the case.

2012/08/31 08:38 Ticket set to ‘solved’ with conclusion that test transfers had caused the overload. Automatic ticket closing took place after 3 working days. 5

Page 6: WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

ATLAS ALARM->CERN SLOW LSF GGUS:85556What time UTC What happened2012/08/28 11:37 GGUS ALARM ticket opened, automatic email

notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems.

2012/08/28 11:49 The operator records in the ticket that the the it-dep-pes-pes-sms e-group was informed.

2012/08/28 11:51 Service expert starts investigating. Heavy queries were found that slow down the job submission.

2012/08/30 06:56 Ticket set to ‘solved’ & ‘verified’ after exchange of 7 comments between supporters and shifters. The monitoring plots in a period of 1.5 days of supervision showed occasional spikes from bursts of job submission without any real problem. 6

Page 7: WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

ATLAS ALARM->CERN TAPE MIGRATION PROBLEM GGUS:85704What time UTC What happened2012/08/31 23:12SATURDAY here already.

GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems.

2012/08/31 23:26 The operator records in the ticket that the Castor piquet was informed.

2012/09/04 09:32 The service mgr and multiple shifters exchanged 40 comments throughout the night, the whole of Saturday and Monday. The problem was that inaccessible files were on a broken server that required vendor intervention, followed by a long time required to mount the disk.

2012/09/04 09:36 Service expert sets the ticket to ‘solved’ after reconfiguration of the tape system. Ten hours later, the shifter of the day set it to status ‘verified’.

7

Page 8: WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

3.1

2.1

2.2

2.1

3.1

3.1

8

Page 9: WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

Analysis of the reliability plots: Week 27/08/2012

• ATLAS:• 2.1 BNL: Some error transfers from T0 to BNL. A few

percent of the transfers timed out• 2.2 SARA-MATRIX: Some inconsistency between

dCache and BDII was noticed, [SE][StatusOfPutRequest][SRM_NO_FREE_SPACE]

• CMS:• 3.1 IN2P3: The site was busy and the mc tests didn’t

run due to their lower priority compare to the production activity

9

Page 10: WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

3.1

1.1

3.1 3.

2

10

Page 11: WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

Analysis of the reliability plots: Week of 03/09/2012

• ALICE:• 1.1 RAL 04/09 [Green]: Not a site problem – bug

detected in SAM test. See GGUS #85794.• CMS:

• 3.1 IN2P3 05/09-08/09: CREAMCE-JobSubmit tests intermittently failing against cccreamceli05 & 06 with timeouts. No downtime registered; no relevant Savannah tickets found.

• 3.2 CNAF 09/09: Site problem with STORM storage element. See GGUS #85953 and Savannah #131937.

11

Page 12: WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

0.1

0.1

0.1

0.1

3.33.1 3.2

12

Page 13: WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

Analysis of the reliability plots: Week 10/09/2012

Common:0.1 SARA : CREAMCE - JobSubmit test was failing due to timeouts and errors while loading glite libraries

CMS:3.1 ASGC : CREAMCE - failures of Software-Installed test3.2 ASGC : SRM - failures of the VOPut test3.3 CNAF : SRM - intermittent failures of VOPut test due to moving of data from disk which causes delays, see GGUS & SAV tickets

13

Page 14: WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1

Conclusions• Business as usual – relatively quiet period

• Ongoing Alcatel issue preventing users from connecting

14