Upload
franklin-walsh
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
CMS Stress Test Report
Marco Verlato (INFN-Padova)
INFN-GRID Testbed Meeting 17 Gennaio 2003
Motivations and goals
Purpose of the “stress test”: Verify how EDG middleware is good for CMS Production Verify the portability of CMS Production environment on a grid
environment Produce a reasonable amount of the PRS requested events
Goals Aim for 1 million events (only FZ files, no Objectivity) Measure performances, efficiencies and reasons of job failures Try to make the system stable
Organization Operations started November 30th and ended at Xmas (~3 weeks) The joint effort involved CMS, EDG and LCG people (~50 people,
17 from INFN) Mailing list: <[email protected]>
Software and middleware
CMS Software used is the official production one CMKIN and CMSIM: installed as rpm on all the sites
EDG Middleware releases: 1.3.4 (before 9/12)
1.4.0 (after 9/12)
Tools used (on EDG “User Interface”) Modified IMPALA/BOSS system to allow for Grid submission
of jobs
Scripts and ad-hoc tools to: Replicate files Collect monitoring information from EDG and from the jobs
UIIMPALA
BOSSDB
GRIDSERVICES
SE
SE
CESE
SE
CE
RefDB
RC
CE
CMS sw
CE
CMS sw
Write dataWN
data registration
Job output filteringRuntime monitoring
JDL
JobExecuterdbUpdator
parameters
Resources
The production is managed from 4 UI’s: Bologna / CNAF Ecole Polytechnique Imperial College Padova
reduces the bottleneck due to the BOSS DB
Several RB’s seeing the same Computing and Storage Elements: CERN (dedicated to CMS) (EP UI) CERN (common to all applications) (backup!) CNAF (common to all applications) (Padova UI) CNAF (dedicated to CMS) (CNAF UI) Imperial College (dedicated to CMS and BABAR) (IC UI)
reduces the bottleneck due to intensive use of the RB and the 512-owner limit in Condor-G
Resources
Site CE No. of CPU
SE Disk space (GB)
CERN lxshare0227 122 lxshare0393
lxshare0384
100
1000(=10010)
CNAF testbed008 40 grid007g 1300
RAL gppce05 16 gppse05 330
NIKHEF tbn09 22 tbn03 430
Lyon ccgridli03 120 ccgridli07 200
Legnaro cmsgrid001 50 cmsgrid002 500
Padova grid001 12 grid005 670
Ecòle Pol. polgrid1 4 polgrid2 200
386 4730
Data management
Two practical approaches: Bologna, Padova: FZ files (~230 MB sized) are directly stored
at CNAF, Legnaro
EP, IC: FZ files are stored where they have been produced and later replicated to a dedicated SE at CERN.Goal: to test the creation of replicas of files
All sites use disk for the file storage, but: CASTOR at CERN: FZ files replicated to CERN are also
automatically copied into CASTOR (thanks to a new staging daemon from WP2)
HPSS in Lyon: FZ files stored in Lyon are automatically copied into HPSS
Online Monitoring (MDS based)
Events vs. time (CMKIN)
Events vs. time (CMSIM)
~7 sec/event average
~2.5 sec/event peak (12-14 dec)
Final results (preliminary!)
UI #CMKIN evts % #CMSIM evts %
CNAF 253625 43 130375 48
IC 73125 12 23375 9
IN2P3 114250 19 32125 12
Padova 151875 26 82750 31
total 592875 268625
UI #CMKIN jobs #success (%) #CMSIM jobs #success (%)
CNAF 2430 2029 (83) 1412 1043 (74)
IC 647 585 (90) 290 187 (64)
IN2P3 1327 914 (69) 474 253 (53)
Padova 1358 1215 (89) 1188 662 (56)
total 5762 4743 (82) 3364 2145 (63)
Main issues
Symptom Cause Solution Frequencyno matching resources II stuck by too many
accesses-“fake” dbII used since 1.4.0
-slow job submission rate
very high before 1.4.0
low since 1.4.0 + slow sub.
Standard output of job wrapper does not contain useful data
1) home dir not available on WN
2) exhausted resources on CE
3) race conditions for file updates between WN and CE
4) Glitches in the gass_transfer
5) ….
GRAM-PBS script patch and JSS-Maradona patch under test after Xmas, since 1.4.2
very high special for “long jobs” (~12 hours)
Condor Failure Condor scheduler crashes (file-max parameter too low)
increase file-max parameter
low
cannot connect to RC server
LDAP server overloaded Create new RC’s and new collections; restart the LDAP server
highedg-job-* commands hang MDS not responding due
to local GRISremove offending GRIS from MDS
low
Globus down/failed submission
Gatekeeper unreachable ? low
Cannot download InputSandbox
globus-url-copy problem between WN and RB (security, gridftp, etc.)
? low
Chronology
29/11 – 2/12: reasonably smooth
3/12 – 5/12: “inefficiency” due to CMS week
6/12: RC problems begin; new collections created; Nagios monitoring online
7/12 – 8/12: II in very bad shape
9/12 – 10/12: deployment of 1.4.0; still problems with RC; CNAF and Legnaro resources not available; problems with CNAF RB
11/12: Top level MDS stuck because of a CE in Lyon
14/12 – 15/12: II stuck, most submitted jobs aborted
16/12: failure in grid-mapfile update due to NIKHEF VO ldap server not reachable
Conclusions
Job failures are dominated by: Standard output of job wrapper does not contain useful data:
many different causes does affect mainly “long jobs” some patches with possible solutions implemented
Replica Catalog stops responding: no real solution, but we will soon use RLS
Information System (GRIS,GIIS,dbII): hopefully R-GMA will solve these problems
Lots of smaller problems (Globus, Condor-G, machine configuration, defective disks, etc.)
Short term actions: EDG-1.4.3 released the 14/1 and deployed on PRODUCTION testbed
Test is going on in “no-stress” mode: in parallel with the review preparation (testbed will remain stable) it will measure the effect of new GRAM-PBS script and JSS-Maradona patches