14
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove this slide and let me know if drills are missing and should be prepared for a future MB. Thank You! MariaDZ 1 06/21/22 WLCG MB Report WLCG Service Report

GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

Embed Size (px)

Citation preview

Page 1: GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

GGUS Slides for the 2012/07/24 MB

• Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend.

• Remove this slide and let me know if drills are missing and should be prepared for a future MB.

Thank You!MariaDZ

104/21/23 WLCG MB Report WLCG Service Report

Page 2: GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

GGUS summary (5 weeks)

VO User Team Alarm Total

ALICE

ATLAS

CMS

LHCb

Totals

2

To calculate the totals for this slide and copy/paste the usual graph for the 2012/07/24 MB please:

1. Take the summary from the table on https://ggus.eu/download/wlcg_metrics/html/20120716_escalationreport_wlcg.html and https://ggus.eu/download/wlcg_metrics/html/20120723_escalationreport_wlcg.html

2. Copy locally file https://twiki.cern.ch/twiki/pub/LCG/WLCGOperationsMeetings/ggus-tickets.xls

3. Include 2 more lines from the escalation reports above. Add up the last 5 weeks i.e. starting from the 25-Jun line and put the totlas in this table.

4. Copy/paste here, instead of these instructions, the updated graph from the point 2 .xls file.

Page 3: GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

04/21/23 WLCG MB Report WLCG Service Report 3

Support-related events since last MB• There have been 12+ real ALARMs

since the 2012/06/19 MB.• All were submitted by ATLAS,CMS

& LHCb.• Sites for all these tickets were

CERN, IN2P3, FZK, PIC, SARA.• There have been 2 GGUS Releases

since the last MB: • On 2012/06/25: specifically on

new Reporting Tools.• On 2012/07/09: all other

dev.items.

Page 4: GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

ATLAS ALARM->CERN CASTOR PROBLEM GGUS:83360

04/21/23 WLCG MB Report WLCG Service Report 4

What time UTC What happened

2012/06/18 15:42 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems.

2012/06/18 15:42 Expert records work started.

2012/06/18 15:48 Operator records that expert is working already.

2012/06/18 15:50 Expert records there was a configuration error. ITSBB is updated and fixing started.

2012/06/18 16:27 Ticket set to ‘solved’ after configuration change and propagation. 4 more comments were exchanged because the problem persisted for some nodes that appeared to be under maintenance in CASTOR monitor and had not received the new config. Problem really solved at 18:05 hrs.

Page 5: GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

ATLAS ALARM->CERN LSF SCHEDULING GGUS:83362

04/21/23 WLCG MB Report WLCG Service Report 5

What time UTC What happened

2012/06/18 15:50 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems.

2012/06/18 16:02 Operator’s acknowledgment and email to …pes-sms…

2012/06/18 16:19 Service mgr starts work.

2012/06/18 16:38 The ticket is ‘solved’ because the LSF problem was a side-effect of the CASTOR problem of the previous slide.

Page 6: GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

ATLAS ALARM-> FZK FTS TRANSFER ERRORS GGUS:83367

04/21/23 WLCG MB Report WLCG Service Report 6

What time UTC What happened

2012/06/18 23:16 GGUS TEAM ticket opened, automatic email notification to [email protected] AND automatic assignment to NGI_DE. Type of Problem: File Transfer.

2012/06/18 01:18 Increased to “Top Priority” followed by ticket conversion to ALARM 10 mins later as transfer failure rate increases.

2012/06/19 05:46 A CMS comment! They have the same problem!

2012/06/19 09:02 The ticket is ‘solved’ after finding a disk issue that needed a log partition cleanup on an FTS host. Both experiments agree the problem is gone.

Page 7: GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

ATLAS ALARM->CERN LSF SLOW RESPONSE GGUS:83375

04/21/23 WLCG MB Report WLCG Service Report 7

What time UTC What happened

2012/06/19 07:43 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems.

2012/06/19 07:51 Operator’s acknowledgment and email to …pes-sms…

2012/06/19 07:55 Service mgr starts work.

2012/06/20 15:42 The ticket is ‘solved’ because the problem went away. Although Platform was supposed to get back with a diagnostic, after the ticket was set to ‘verified’ no further update is possible, hence, we never knew what the cause of the problem was.

Page 8: GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

ATLAS ALARM->IN2P3 SW SRC PROBLEM VIA CVMFS GGUS:83517

04/21/23 WLCG MB Report WLCG Service Report 8

What time UTC What happened

2012/06/24 08:30 SUNDAY

GGUS TEAM ticket opened, automatic email notification to [email protected] AND automatic assignment to NGI_FRANCE. Type of Problem: Middleware.

2012/06/25 07:37 Ticket upgrade to ALARM after 2 comments with all WNs where 100% of the jobs failed. Email sent to [email protected]. Automatic acknowledgment recorded immediately afterwards.

2012/06/25 08:21 Sys.admins investigate (cvmfs cache problem).

2012/06/25 11:16 The ticket is ‘solved’ after changing the logrotate policy to reduce the logs but as the ticket was set to ‘verified’ no further update is possible, hence, we never knew why the high increase of connections led to this fast grow of logfiles.

Page 9: GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

ATLAS ALARM-> SARA SRM CONTACT PROBLEM GGUS:83523

04/21/23 WLCG MB Report WLCG Service Report 9

What time UTC What happened

2012/06/24 19:57SUNDAY

GGUS TEAM ticket opened, automatic email notification to [email protected] AND automatic assignment to NGI_NL. Type of Problem: Storage Systems.

2012/06/24 20:21 Ticket upgrade to ALARM as the SRM layer appeared broken. Email sent to [email protected]. Automatic acknowledgment recorded immediately afterwards.

2012/06/25 05:54 Service mgr restarted srm.

2012/06/27 14:47 The ticket is ‘solved’ after exchanging16 comments to understand the cause, which seemed to be the recent dcache upgrade to v.2.2.1. Moving the srm to new hardware didn’t help but re-indexing the DB did.

Page 10: GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

ATLAS ALARM->CERN VOATLAS SERVERS DOWN GGUS:83705

04/21/23 WLCG MB Report WLCG Service Report 10

What time UTC What happened

2012/06/29 06:33 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Other.

2012/06/29 06:34 Grid services’ expert informs the submitter that there is a power cut in the CC, published on the itssb.

2012/06/29 06:40 Operator also records there all many problems due to the power cut.

2012/06/29 12:01 The ticket is set to ‘verified’ after the services got back at 08:26 and the solution was recorded at 11:55.

Page 11: GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

LHCB ALARM->CERN MISSING DATA ON DISK GGUS:83713

04/21/23 WLCG MB Report WLCG Service Report 11

What time UTC

What happened

2012/06/29 11:37

GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems.

2012/06/29 11:38

Storage expert informs the submitter that after the power cut in the CC earlier on that day, not all servers have yet recovered.

2012/06/29 11:45

Operator also records that CASTOR piquet was called.

2012/06/29 16:20

Ticket set to ’solved’ at 16:20 when all servers came back to production. SLS was showing all was fine even if this was partially true. The reason was that the monitoring process checks a necessary and sufficient subset of nodes’ availability only.

2012/07/04 07:45

The ticket was ‘re-opened’ and eventually re-’solved’ & ‘verified’ following experiment complaints when files were found missing. The reason was that a machine was still unreachable. It came back after vendor call.

Page 12: GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

CMS ALARM->CERN VOCMS203 WEB SERVICE PROBLEM GGUS:83726

04/21/23 WLCG MB Report WLCG Service Report 12

What time UTC What happened

2012/06/30 07:41SATURDAY

GGUS TEAM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Other.

2012/06/30 10:37 Escalated and soon afterwards upgraded to ALARM.

2012/06/30 10:52 Operator records that the problem is known and the piquet has already sent mail suggesting copying the data because the disk is scheduled for replacement.

2012/07/02 10:04 Various CMS ALARMers submitted 6 comments in the ticket trying to get any news on progress of this.

2012/07/03 09:09 Ticket set to ‘solved’ after fixing the hardware problem.

Page 13: GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

ATLAS ALARM->PIC TRANSFERS FROM CERN FAIL GGUS:83923

04/21/23 WLCG MB Report WLCG Service Report 13

What time UTC What happened

2012/07/06 09:31 GGUS TEAM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: File Transfer.

2012/07/06 09:36 Site mgrs record in the ticket a know network problem in the dCache pools. LHCb opened a similar ticket on the matter.

2012/07/06 11:25 Transfer failure rate keeps increasing. Ticket upgraded to ALARM. Email sent to [email protected].

2012/07/06 16:26 The ticket is set to ‘solved’ after reducing the timeout and increasing the queue size. Supporters and submitters observed the service recovering for 2 days before ‘verify’ing the ticket.

Page 14: GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove

ATLAS ALARM->CERN SLOW LSF GGUS:83947

04/21/23 WLCG MB Report WLCG Service Report 14

What time UTC What happened

2012/07/07 07:27SATURDAY

GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems.

2012/07/07 11:22 The same problem was reported by CMS via ALARM GGUS:83948. No operator acknowledgment was recorded in these 2 tickets, due to the invalid email addresses used [email protected]. Submitters provided debug info about jobs appearing to ‘run’ on lost-and-found machines. Service mgr applied recently received hot fixes. 7 comments exchanged.

2012/07/07 20:12 The ticket is set to ‘solved’. ‘verified’ the next day. Similar process for the CMS ALARM on this issue.