Upload
khangminh22
View
2
Download
0
Embed Size (px)
Citation preview
10/27/201013 October 2009 1
WLCG Service Report
[email protected],[email protected]@cern.ch,[email protected]~~~
WLCG Management Board, 26th October 2010Covers 2 week period from 11th October but includes GGUS summary by Maria Dimou from 27 September (report was
skipped at last MB)
1
2
Introduction• Generally smooth operation on experiment and service side
• Some site availability issues – commented directly in Availability Plots
• 4 day technical stop advanced from November to 19 October
• Some complaints on short notice and intervention rescheduling but the agreement is to not strictly coordinate scheduled interventions with technical stops.
• Service Incident Reports received:• CNAF CMS storage down
• CERN LSF degradation
• No new SIR awaited
2
3
GGUS summary (4 weeks)
VO User Team Alarm Total
ALICE 5 0 1 6
ATLAS 28 215 6 249
CMS 6 1 1 8
LHCb 2 38 1 41
Totals 41 254 9 304
3
410/27/2010 WLCG MB Report WLCG Service Report 4
Support-related events since last MB• We need WLCG shifters, alarmers, management to give us meaningful values for the GGUS ‘Problem Type’ field, in order for periodic reporting to show better weak areas in support.• GGUS:61440 (CNAF-BNL network problem) re-opened by ATLAS till network problem fully understood. •EMI insists on changing the GGUS supporters’ privileges, such that assignment to middleware-related Support Units (SUs) be only possible by the EGI DMSU (Deployed Middleware SU). Although this matches the ‘Service Desk’ spirit, it might slow things down. As we have no more USAG, we need the WLCG community input offline a.s.a.p. •There were 9 ALARM tickets since the Sept. 28th MB (4 weeks), 5 of which were real, all submitted by ATLAS. No ALARMs since the Oct 12th MB (where WLCG report was not given). Details follow…
5
ATLAS ALARM->CERN-CNAF TRANSFERS
•https://gus.fzk.de/ws/ticket_info.php?ticket=62761
10/27/2010 WLCG MB Report WLCG Service Report 5
What time UTC What happened
2010/10/05 9:13 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_Italy.
2010/10/05 10:23 Site acknowledges ticket and finds a StoRM backend problem.
2010/10/05 12:03 Service restored. Site puts the ticket to ‘solved’ and refers to GGUS:62745 for details.
2010/10/11 9:48 Submitter of ticket GGUS:62745 sets status ‘verified’. No explanation on any of the 2 tickets what the problem/diagnostic/solution actually was…
6
ATLAS ALARM->TRANSFERS TO .FR CLOUD
•https://gus.fzk.de/ws/ticket_info.php?ticket=62871
10/27/2010 WLCG MB Report WLCG Service Report 6
What time UTC What happened
2010/10/08 5:56 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to NGI_France.
2010/10/08 6:31 Site acknowledges ticket and finds a network problem preventing all DB server access.
2010/10/08 7:29 Service restored.
2010/10/08 10:41 Site puts ticket to status ‘solved’.
2010/10/14 8:39 Submitter sets the ticket to status ‘verified’.
7
ATLAS ALARM-> CERN SLOW LSF
•https://gus.fzk.de/ws/ticket_info.php?ticket=6246710/27/2010 WLCG MB Report WLCG Service Report 7
What time UTC What happened
2010/09/27 15:34 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN.
2010/09/27 16:01 Operator acknowledges ticket and contacts the expert.
2010/09/27 16:37 Expert’s 1st diagnosis. Too many queries.
2010/09/27 20:10 Service mgr kills a home-made robot by another experiment launching >> bjob queries and puts ticket to status ‘solved’.
2010/09/28 12:21 Submitter sets ticket to status ‘verified’.
8
ATLAS ALARM-> CERN SLOW AFS
•https://gus.fzk.de/ws/ticket_info.php?ticket=62662
10/27/2010 WLCG MB Report WLCG Service Report 8
What time UTC What happened
2010/10/01 7:13 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN.
2010/10/01 7:33 Operator acknowledges ticket and contacts the expert.
2010/10/01 9:37 IT Service manager re-classifies in CERN Remedy PRMS.
2010/10/11 15:33 Still ‘in progress’. Reminder sent during this drill.
2010/10/25 15:56 Still ‘in progress’. No reaction to the Oct 11th
reminder
9
ATLAS ALARM-> CERN CASTOR
•https://gus.fzk.de/ws/ticket_info.php?ticket=62688
10/27/2010 WLCG MB Report WLCG Service Report 9
What time UTC What happened
2010/10/01 16:24 GGUS ALARM ticket opened, automatic email notification to [email protected] AND automatic assignment to ROC_CERN.
2010/10/01 16:41 Operator acknowledges ticket and contacts the expert.
2010/10/01 16:42 Expert starts investigation.
2010/10/01 17:23 Solved. Put DONE in SRM not propagated to CASTOR. Done by hand.
2010/10/01 17:45 Submitter ‘verified’. Shifter added x-ref to GGUS:62705
10/27/2010 11
Analysis of the availability plotsATLAS1.1 IN2P3: SRM lcg‐cp (time out after 600 seconds) and lcg‐del (internal communication error) test failures. On the 14th
of
October, the SRMv2‐ATLAS‐lcg‐cp test was timing out after 600 seconds.1.2 SARA: CE‐sft‐job failures at SARA: ‘Job failed, no reason given by GRAM server‘. Downtime announced only for CREAM CE,
not for CE.
ALICE2.1 FZK: Occasional VOBOX User Proxy Registration test failures.2.2
Green
box
NIKHEF:
CE
tests
were
failing.
SAM
CE
tests
should
be
ignored
as
ALICE
is
using
CREAM
CE.
There
are
no
CREAM CE direct submission tests for ALICE yet (work in progress).
CMSNTR.
LHCb4.1 IN2P3: CE sft‐vo‐swdir
test was failing (took more than 150 seconds while 60 seconds are expected). Software installation
tests failed. User jobs were also reported as having failed due to issues with the software area.4.2 RAL: SRM test failures.
SRM lcg‐cr
(connection timeout GGUS #62893 related to GGUS #62829, CASTOR DB performance
issue) and Dirac (invalid argument) test failures. 4.3 SARA: CE‐LHCb‐availability test was failing on Monday the 11th. CE‐sft‐job failures at SARA: ‘Job failed, no reason given by
GRAM server’. Downtime announced only for CREAM CE, not for CE. Ron Trompert
at SARA was notified and forwarded this
issue to the appropriate person. Update from the MB meeting was that there was in fact an independent
PBS
problem
that
was affecting the CE.
10/27/2010 12
Main observations for the week of Oct 11 (1/2)
• CNAF-BNL network issue• Slow progress on GGUS:61440, xfer rates increased without full understanding if
hardware or inherent limitations after Geant capacity increase and Force10 router ‘shaping’. Suggestion to take path out of service for rigorous testing.
• Latest: rerouting Amsterdam to Vienna link gave dramatic performance benefit
• New ticket GGUS: 63134 opened Oct 15, files > 2 GB hit transfer timeouts
• CERN xrootd errors for ATLASHOTDISK and CMS t0export• both related to load/timeout issues, fixes have been deployed
• IN2P3 storage instabilities, 133k ATLAS files inaccessible• Files not properly migrated – stuck in storage. Finally closed on 19th.
• CNAF tail of CMS storage unavailability due to GPFS bug• Incident from 6 to 10 October - SIR provided
• LHCb Condition DB access from any WN at any T1• Port opening issue - to be further discussed during 3D workshop Nov 16-17
• OPN - who should handle issues with Tier-1 links to CERN• https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#Responsibilities• CERN experts will kindly inform Tier-1 when they happen to notice issues
10/27/2010 13
Main observations for the week of Oct 11 (2/2)
• Issues with new GOCDB• AuthZ problem for downtimes, fixed• Downtime summary no longer available, RFE accepted
• CASTOR• Upgrades to 2.1.9-9 planned for ALICE and CMS in preparation of HI run
• Done on Oct 19• 1.2 PB added for CMS HI run (ALICE done already)• ATLAS users circumvented ACLs via castor-public
route closed Oct 18
• CMS downtime calendar: some downtimes not seen• Possibly after having been modified
CIC Portal out of sync ?
• CMS see again a high rate of Maradona errors for jobs at CERN• LSF hotfix requested from Platform- will be rolled out this week
• ALICE user jobs killing WN at various sites• User job memory limits will be included in upcoming AliEn update
15
Analysis of the availability plotsATLASNTR.
ALICENTR.
CMSNTR.
LHCb4.1
GRIDKA:
SRM
DiracUnitTest
failures
on
the
23rd:
‘No
space
left
on
the
device’.
The
disk
space
was
full
but
the
test
threshold was a bit unrealistic. The test threshold has been properly adjusted by Roberto.4.2
IN2P3:
Problem
during
software
installation
– CE‐tests
were
failing:
install,
job‐Boole,
job‐DaVinci,
job‐Brunel,job‐Gauss
and sft‐vo‐swdir
(took more than 150 seconds while 60 seconds are expected).4.3 PIC: CE‐sft‐vo‐swdir
(took more than 150 seconds while 60 seconds are expected) and SRM‐lhcb‐FileAccess
(Error in ROOT
macro opening/reading a file) test failures from time to time.
16
Main observations for the week of Oct 18(1/2)
• Technical stop advanced from November week 1 to 19-22 October mainly to fix a problem with beam 1 injection. Very little notice caused some grumbles.
• Better availability statistics when no raw data flowing.• CERN annual power tests also moved from 1 November.• PIC bdii stopped publishing overnight 18/19. Such sites are not
shown within SAM framework. To be followed up.• ATLAS found 4 UK Tier 2 sites intermittently failing glite 3.2 FTS
transfers. Complicated investigation first concluded there was a proxy delegation bug leading to real problem that the sites were not accepting credentials issued by the BNL VOMS server. Some 11 other Tier 2 subsequently found.
• Two new ‘easy to use’ slc5 root exploits discovered – both fixable without service interruption.
17
Main observations for the week of Oct 18(2/2)• Two operational problems due to end-users:
• RAL-LCG2 SRM problems due to an ATLAS user pulling data from RAL to an NDGF cluster
• ALICE user jobs filling memory on RAL worker nodes
• ATLAS wish to escalate ggus 61106 middleware issue: The binaries from the glite 3.2.7 WN tarball fail immediately on SUSE 10 with a floating point exception. They saw the same thing in the ATLAS SW externals area a while ago and Stefan Roiser fixed this by compiling the libs with some option to ensure they used sysv hashes. This stops ATLAS from using the German MPP and LRZ Tier2 sites. Initial response from developers is that SUSE is not a supported OS. First raised in August.
• LHCb wish to escalate ggus 63234 issue with IN2P3: Shared software area issue with jobs timing out in setting up the environment.
• The ALICE and CMS simulated Tier 0 HI data taking on Thursday/Friday caused no operational problems.
• New issue this week: CMS heavily use an unsupported ccrc08 application that has lost its interactivity. See GGUS:63450
• GridMap http://servicemap.cern.ch/ccrc08/servicemap1.html
18
CNAF CMS STORAGE DOWN SIR• CMS has a T1D0 unique fs served by 8 servers, 2 gridftps and 2 metadata servers
in a dedicated GPFS cluster. October 1-3 CMS fs moved from server DDN-3 to DDN-4 & DDN-5.
• Oct 6 13:00 CET started deletion of old metadata disks (i.e. on DDN-3) from fs with a “wrong” (not useful but perfectly safe) mmdeldisk option
• This triggered a GPFS bug - GPFS crashes on CMS cluster - CMS fs hangs• Oct 6 19:00 IBM suggests to upgrade all cluster to 3.2.1-23 (“known bug in
previous release”) but problem still there…• Oct 8 morning: other diagnostic files sent to IBM support • - not convinced from IBM support analysis: other investigations during all day• Oct 8 evening Decision: follow IBM support suggestion – mmdeldisk still fails• Oct 10 morning New strategy: re-add old metadata disks and force remigration • - Oct 10 16:00 Subsequent disk deletion successful • Oct 10 23:49 Recovery of the deleted files from tape completed • Follow up • We are now waiting for a complete analysis from IBM but also for an “ifix” to
correct the bug.• The long time spent to recover could have been minimized simply scratching the
disk (all data are on tape) and recreating the fs but in this case we should not have understood the problem.
18
19
Severely degraded response from CERN Batch Service
• From 05.26 until 10.00 on 20 October the batch service was severely degraded; the mbatchd daemon was crashing every 10-30 minutes resulting in poor response time (or rejected) job submissions and job status queries. Job dispatch to and status updates from the worker nodes were also impacted which slowed the rate of jobs being started on the worker nodes. No jobs were lost.
• 09:36 Priority 1 ticket opened with vendor to investigate problem [1-100387964] • 09:40 Vendor starts to investigate • 09:59 Last daemon crash • 10:33 Vendor provides fix: deployed.
• The details of the problem remain to be understood with the vendor. The fix appears to have addressed the problem.
• There were no changes on the production system that are correlated with this problem.
• Follow up • Reported to WLCG operations meeting on 20 Oct 2010. • Understand more details of the root cause. [ inline on Platform 1-100387964 ] • 3 pending fixes for other issues (CMS Maradona problem, misc scaling issues) will
be combined and delivered as an official hotfix for deployment on the batch service. trac:182
19