CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003

CMS Stress Test Report

Marco Verlato (INFN-Padova)

INFN-GRID Testbed Meeting 17 Gennaio 2003

Motivations and goals

Purpose of the “stress test”: Verify how EDG middleware is good for CMS Production Verify the portability of CMS Production environment on a grid

environment Produce a reasonable amount of the PRS requested events

Goals Aim for 1 million events (only FZ files, no Objectivity) Measure performances, efficiencies and reasons of job failures Try to make the system stable

Organization Operations started November 30th and ended at Xmas (~3 weeks) The joint effort involved CMS, EDG and LCG people (~50 people,

17 from INFN) Mailing list: <[email protected]>

Software and middleware

CMS Software used is the official production one CMKIN and CMSIM: installed as rpm on all the sites

EDG Middleware releases: 1.3.4 (before 9/12)

1.4.0 (after 9/12)

Tools used (on EDG “User Interface”) Modified IMPALA/BOSS system to allow for Grid submission

of jobs

Scripts and ad-hoc tools to: Replicate files Collect monitoring information from EDG and from the jobs

UIIMPALA

BOSSDB

GRIDSERVICES

SE

SE

CESE

SE

CE

RefDB

RC

CE

CMS sw

CE

CMS sw

Write dataWN

data registration

Job output filteringRuntime monitoring

JDL

JobExecuterdbUpdator

parameters

Resources

The production is managed from 4 UI’s: Bologna / CNAF Ecole Polytechnique Imperial College Padova

reduces the bottleneck due to the BOSS DB

Several RB’s seeing the same Computing and Storage Elements: CERN (dedicated to CMS) (EP UI) CERN (common to all applications) (backup!) CNAF (common to all applications) (Padova UI) CNAF (dedicated to CMS) (CNAF UI) Imperial College (dedicated to CMS and BABAR) (IC UI)

reduces the bottleneck due to intensive use of the RB and the 512-owner limit in Condor-G

Resources

Site CE No. of CPU

SE Disk space (GB)

CERN lxshare0227 122 lxshare0393

lxshare0384

100

1000(=10010)

CNAF testbed008 40 grid007g 1300

RAL gppce05 16 gppse05 330

NIKHEF tbn09 22 tbn03 430

Lyon ccgridli03 120 ccgridli07 200

Legnaro cmsgrid001 50 cmsgrid002 500

Padova grid001 12 grid005 670

Ecòle Pol. polgrid1 4 polgrid2 200

386 4730

Data management

Two practical approaches: Bologna, Padova: FZ files (~230 MB sized) are directly stored

at CNAF, Legnaro

EP, IC: FZ files are stored where they have been produced and later replicated to a dedicated SE at CERN.Goal: to test the creation of replicas of files

All sites use disk for the file storage, but: CASTOR at CERN: FZ files replicated to CERN are also

automatically copied into CASTOR (thanks to a new staging daemon from WP2)

HPSS in Lyon: FZ files stored in Lyon are automatically copied into HPSS

Online Monitoring (MDS based)

Events vs. time (CMKIN)

Events vs. time (CMSIM)

~7 sec/event average

~2.5 sec/event peak (12-14 dec)

Final results (preliminary!)

UI #CMKIN evts % #CMSIM evts %

CNAF 253625 43 130375 48

IC 73125 12 23375 9

IN2P3 114250 19 32125 12

Padova 151875 26 82750 31

total 592875 268625

UI #CMKIN jobs #success (%) #CMSIM jobs #success (%)

CNAF 2430 2029 (83) 1412 1043 (74)

IC 647 585 (90) 290 187 (64)

IN2P3 1327 914 (69) 474 253 (53)

Padova 1358 1215 (89) 1188 662 (56)

total 5762 4743 (82) 3364 2145 (63)

Main issues

Symptom Cause Solution Frequencyno matching resources II stuck by too many

accesses-“fake” dbII used since 1.4.0

-slow job submission rate

very high before 1.4.0

low since 1.4.0 + slow sub.

Standard output of job wrapper does not contain useful data

1) home dir not available on WN

2) exhausted resources on CE

3) race conditions for file updates between WN and CE

4) Glitches in the gass_transfer

5) ….

GRAM-PBS script patch and JSS-Maradona patch under test after Xmas, since 1.4.2

very high special for “long jobs” (~12 hours)

Condor Failure Condor scheduler crashes (file-max parameter too low)

increase file-max parameter

low

cannot connect to RC server

LDAP server overloaded Create new RC’s and new collections; restart the LDAP server

highedg-job-* commands hang MDS not responding due

to local GRISremove offending GRIS from MDS

low

Globus down/failed submission

Gatekeeper unreachable ? low

Cannot download InputSandbox

globus-url-copy problem between WN and RB (security, gridftp, etc.)

? low

Chronology

29/11 – 2/12: reasonably smooth

3/12 – 5/12: “inefficiency” due to CMS week

6/12: RC problems begin; new collections created; Nagios monitoring online

7/12 – 8/12: II in very bad shape

9/12 – 10/12: deployment of 1.4.0; still problems with RC; CNAF and Legnaro resources not available; problems with CNAF RB

11/12: Top level MDS stuck because of a CE in Lyon

14/12 – 15/12: II stuck, most submitted jobs aborted

16/12: failure in grid-mapfile update due to NIKHEF VO ldap server not reachable

Conclusions

Job failures are dominated by: Standard output of job wrapper does not contain useful data:

many different causes does affect mainly “long jobs” some patches with possible solutions implemented

Replica Catalog stops responding: no real solution, but we will soon use RLS

Information System (GRIS,GIIS,dbII): hopefully R-GMA will solve these problems

Lots of smaller problems (Globus, Condor-G, machine configuration, defective disks, etc.)

Short term actions: EDG-1.4.3 released the 14/1 and deployed on PRODUCTION testbed

Test is going on in “no-stress” mode: in parallel with the review preparation (testbed will remain stable) it will measure the effect of new GRAM-PBS script and JSS-Maradona patches

Documents

CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003