CMS Experience on the GRID

CMS Experience on the GRIDPeter Kreuzer

for the CMS Computing project

RWTH Aachen / CERN

NEC2009 ConferenceVarna / Bulgaria , Sep 10, 2009

No need to provide the scientific motivation here, since Sergio Cittolin and Livio Mapelli did this very well in the

introductory session of this conference!

Goal of the CMS Computing project• Provide Resources and Services to store/serve O(10) PB data/year• Provide access to most interesting physics events to O(1500) CMS collaborators located in 200 institutions around the world

• Minimize constraints due to user localisation and resource variety• Decentralize control and costs of computing infrastructure• Share resources with other LHC experiments

Find the answer on the Worldwide LCG GRID

Up to 200 Tracks / Event

After Reconstruction / Selection

CAF450MB/s(300Hz)

30-300MB/s (ag. 800MB/s)

~150k jobs/day

~50k jobs/day

50-500MB/s

10-20MB/s

CMSdetector

Tier-0Prompt Reconstruction

Archival of RAW and First RECO data

Calibration Streams (CAF)Data Distribution Tier-1

7+1 Tier-1sRe-Reconstruction

SkimmingCustodial Copy of RAW, RECO

Served Copy of RECOArchival of Simulation

Data Distribution Tier-2

~50 Tier-2sPrimary Resources for

Physics Analysis and Detector Studies by

users MC Simulation Tier-1

WLCG Computing Grid Infrastructure

The way CMS uses the GRID

TIER-3TIER-3 TIER-3TIER-3

100MB/s

CMS in Total: 1 Tier-0 at CERN (GVA) 7+1 Tier-1s on 3 continents 50 Tier-2s on 4 continents

Computing Resources: Setting the scale

Data Tier RAW RECO SIMRAW SIMRECO AOD

<Size> [MB] 1.5 0.5 2.0 0.5 0.1

CPU [HS06-sec] - 100 1000

• Run 2009-10 (Oct-Mar/Apr-Sep) 300 Hz / 6 x 106 sec / 2.2 x 109 events

• Size&CPU per event

• CMS datasets- O(10) Primary Datasets (RAW, RECO, AOD)- O(30) Secondary Datastets (RECO, AOD)- 1.5 times more Simulated (SIMRAW, SIMRECO, AOD)

• During run 2009-10, CMS plans 5 full re-Reconstructions, need :

- 400 kHS06 CPU- 26 PB disk- 38 PB tape

• Resources ratio CERN / (T1+T2) :- CPU : 25% , Disk : 15%

disk [PB] / CPU [kHS06]

HEP-Spec2006 :• Modern CPU ~ 8 HS06 / core

• 100 HS06-sec ~ 12.5 sec/event• 100 kHS06 ~ 12,500 cores

And not to forget the obvious...GRID Middleware Services

- Storage Elements- Computing Element- Workload Management System- Local File Catalog- Information System- Virtual Organisation Management Service- Inter-operability between GRIDs EGEE, OSG, NorduGriD..

Networking

Site Specificities, e.g. Storage/Batch systems at CMS Tier-1s:

Storage : dCache/ Castor dCache/HPSS dCache/ Castor Castor+ dCache/TSM Endstore Endstore Storm

Batch : Condor Torque/Maui Torque/Maui BQS Torque/Maui LSF PBSPro

FNAL RAL CCIN2P3 PIC ASGC INFN FZK

Russian/Ukrainian CMS T2s, CERN-T1• 2006 : agreement between WLCG and CMS managements on association of

RU/UA T2 sites with a partial-T1 at CERN• 2008 : 30TB disk buffer in front of 450TB tape space deployed at CERN to serve

AODs to and archive MC data from RU/UA T2s• RU/UA T2s providing WLCG/CMS service work, e.g Transfer Link Commissioning

and Data Management support• CMS established strong collaboration and weekly Operations meetings with

RU/UA T2 community. A lot of progress made since 1 year !• As of today, following active RU/UA CMS sites :

• And many participating physics communities, like Bulgaria !T2_RU_IHEP (Protvino) T2_RU_PNPI (Petersburg Nucl. Ph. Inst.)T2_RU_RRC_KI (Kurchatov Inst.) T2_RU_INR (Inst. for Nucl. Res. RAS)T2_RU_JINR (Dubna) T2_RU_ITEP (R.F. Inst. for Th. and Exp. Ph.) T2_RU_SINP (MSU Develop. site) T2_UA_KIPT (NSC Kharkov Inst. of Ph. & Tech.)

In the rest of this talk I will cover

• Snapshots of achieved CMS computing milestones : • Data Distribution• Data Processing• Distributed Analysis

• A few more aspects related to CMS Computing Operations on the GRID :• Site Availability & Readiness monitoring• SW Deployment• Computing shift procedures

Acronyms of CMS Computing activitiesChallenge What is it ? Type of

DataOrganized Parallel activities with other LHC VOs

DC04 Data Challenge 2004 MC No

SC2,3,4 Service Challenges 2004,05,06 MC No

CSA06,07,08 Computing, Software & Analysis Challenge 2006

MC No

Summer07,09 Prod MC Production 2007,09 MC No

CCRC08 Common Computing Readiness Challenge 2008

MC Yes

CRUZET08,09 Cosmic Run at 0 Tesla 2008,09 Cosmic No

CRAFT08,09 Cosmic Run at 4 Tesla 2008,09 Cosmic No

STEP09 Scale Testing for the Experiment Program in 2009

MC Yes

MWGR Mid Week Global Run Cosmic No

Data Transfers• PhEDEx: distributed set of perl agents on top of large ORACLE DB• CMS Data Distribution strategy relies on Full Mesh Data Management

- Links individually „commissioned“- Target rates depend on size of sites- Minimum requirement to sites

- T1 : 4 links to T1s / 20 downlinks to T2 - T2 : 2 uplinks / 4 downlinks to/from T1

• CMS has >600 commissioned links !

• Historical view : gained 3 orders of magnitude in transfer volume since 04• Main milestone (CCRC08) achieved in parallel to other LHC VOs• Reached expected burst level of first year LHC running

100 TB/day

200 TB/day

CMS Data transer team

Data Transfer challenges• T0 – T1

– CMS regularly over design rate in CCRC08multi-VO challenge

– STEP09 : included tapewriting step at T1s : observed T0 T1 transfer latency ➞impacted by T1 tape system state (busy, overloaded, … )

• T1 – T1– STEP09 : simultaneous transfer 50TB AOD from 1 T1 to all T1s

average 970MB/s (3 days), no big problems encountered

• T1 – T2– See „T1 Data Serving tests in STEP09“ below

Tier-0 Tier-1 Throuput (MB/s)

(Multi-VO, 05/01/2008 – 06/02/2008)

Data Transfers from/to RU/UA T2s• T2 T1 transfers of produced MC data 32.7 TB events since 1 year from 6 different T2s to custodial T1

• Transfer Qualilty to/from RU/UA T2s improved lately

• T1 T2 downloads : 73 TB to RU/UA T2s since 1 year

Production of simulated data• Good measure of Computing performance on the GRID (CMS activity since 2004)• ProdAgent : modular Python deamons on top of MySQL DB, interacting with CMS

DM systems• CMS has a centralized production strategy handled by the Data Operations teams• 2006-07 : 6 „regional teams“

+ Production Manager• 2008-09 : 1 team (6 people)

managing submissions in defined „T1-regions“

CMS DataOps Production Team

Resource Utilization for Production• Summer 2007 :

- <4700 slots> utilized- including 32 T1s and T2s 75% of total available then

• Summer 2009 : - <9-10,000 slots> utilized- including 53 T2s and T3s 60% of total T2 available then rest of T2 resources used by analysis

Note : besides the scale, the other important performance metric is to reach a sustained site occupancy

5,000 slots

10,000 slots

Summer 09 Production at RU/UA T2s

• So far produced around 33M events at all RU/UA Tier-2 sites• This was possible thanks to a strong effort by site coordinators (V. Gavrilov, E. Tikhonenko, O. Kodolova) and local site admins to satisfy the CMS Site Availability and Readiness requirements• Note: total pledge for RU/UA job slots for CMS : 807 (June 09)

Total Pledge (June 09)

Primary Reconstruction at Tier-0

Tier-0 jobs (running , pending) CRAFT08

CRAFT09

270

M e

vent

s32

0 M

eve

nts

• CMS T0 mainly operated from CMS Centre (CERN) and FNAL ROC

• Despite lack of collision data, CMS able to commission T0 workflows, based on cosmic ray data taking at 300HZ :

– Repacking – reformatting raw data, splitting into primary datasets– Prompt Reconstruction – first pass with a few days turnaround– Automated subscriptions to the Data Management system– Alignment and Calibration data skimming as input to the CAF

Primary Data Archival at Tier-0 • Transfer RAW Data from the Detector and archive it on tape at CERN

CCRC08 – Phase 1

1.4 GB/s

• Routinely held archival performance > nominal rate 500 MB/s• Until LHC startup, will test the T0 performance based on simulated collision-like data

STEP09

CMS Tier-0 Operations team

Data Reprocessing, Data Serving at T1s CMS Data Distribution Model puts a heavy load on Tier-1s : • Disk and tape storage Capacity

– Custodial copy of Rec. Data (fraction) & archival of Simulated Data

– Full set of Analysis Object Data - AOD (subset for 90% analyses)– Non-custodial copy of RECO/AOD encouraged

• Processing Capacity– Re-Reconstruction– Skimming of data to reduce the data size samples

• Tape I/O bandwith– Reading many times to Serve data to T2s for analysis– Writing for storage / Reading for Re-Reconstruction if not on disk

• Full mesh (T1 – T2) strategy• Pre-Staging strategy ? • Other strong requirements

– 24/7 coverage and high (98%) availability (WLCG)• CMS Data Operations at Tier-1s

– Central operations or specialized workflows, no user access

Routine Re-Reconstruction operations • CMS routinely reprocessed Simulated Data (CSA08, ...) and

Cosmic Data (CRAFT08,09, ...) at all Tier-1s• CPU performances according to CMS pledges• Using both GLite/WMS (Bulk) and GlideIn submissions• Regular Data Back-fill jobs to test site reliability• Example below : Transfer / Re-Reconstruction of CRAFT08 data

500 TB

CMS DataOps reprocessing team

Tier-1 Pre-Staging tests in STEP09• Rolling Re-Reconstruction exercise : Pre-stage / Process /

Purge data at all T1s during 2 weeks

Processing paused at 1 T1 due to staging backlog

• Tape system at almost all T1s demonstrated pre-staging capability above required rate. To be re-done in Fall 09 if needed

• Re-Reconstruction CPU efficiency significantly better with pre-staging . CMS considering an integrated pre-staging solution.

Tier-1 Data Serving tests in STEP09• T1 T2 Data serving exercise: transfer from T1 tapes to T2s,

i.e. put load on T1 tape-recall, not WAN rate to T2s

• Latency performance : overall result ok• Punctual weaknessees at some T1 during test, to be investigated before LHC startup

Analysis Model in CMS

Tier-1

Tier-2 Tier-2 Tier-2 Tier-2

Tier-3 Tier-3 Tier-3Tier-3

Tier-3

Several Tier-1s have separatelyaccounted analysis resources

Tier-2 Computing Facilities arehalf devoted to simulationhalf user analysis. Primary Resource for Analysis

Tier-3

CAF

Tier-3 Computing Facilities areentirely controlled by the providing institution used foranalysis

Tier-0CERN CAF

Tier-1 Tier-1TierTier-2-2

TierTier-2-2 Tier-1

Tier-3

Tier-2

Analysis in CMS is performed on a globally distributed collection of computing facilities

Analysis tool and users on the GRID

CRABServer

Job Specifications

Global Tier-2s

LocalTier-2

Small Data Products

Output to User Space

Job Status

Eventually output to Temporary Space

RemoteStage-out

CRABAccess to Official Data

CRABAccess to UserData

Files Registered inData Management

LocaldCache

Continued work

on scaling and submission

Grid Failures

Work on

Data Integrity

Problems with

Reliability and Scale

CMS Remote Analysis Builder (CRAB)

• Since 2008 : >1000 distinct CMS Users• 100,000 analysis jobs/day in reach

CMS Tier-2 Disk Space managementIn CMS jobs go to the data : distribute data broadlyCMS attempts to share management of space across Analysis Groups

• Ensures people doing the work have some control

200TB of disk space at a nominal Tier-2

•20 x 1TB is identified for storing local user produced files and making them grid accessible

• 30TB is identified for use by the local group

• 2-3 x 30 TB reserved to CMS PH Analysis groups

• 30 TB for centrally managed Analysis Operations expect to be able to host most RECO data in 1sr y.

• 20 TB of space for DataOps for MC staging buffer

25

Analysis Groups associations to T2s- Currently 17 physics analysis / detector performance groups- Association improved communication with sites and user involvment

• Example:

• Current associations of RU/UA T2s to PH groups: - exotica, heavy ion, jetmet, muon

Group Space, Private Skims, Priority Access @ Tier-2s

• Many groups do not yet use the space assigned to them

• CMS analysis groups or users are producing „private“ skims– Data written in private /store/user area

• New tool to „promote“ private skims to the full collaboration – validate skims and “migrate” skims to public /store/results area

• Priority Queues– 2 people per Analysis Group to access 25% of pledged CPU at assoc. T2

% of used Group Space

Group Spacepledged at T2s

The CMS Analysis Operation team

Job Slot utilization for Analysis

• Current CMS total CPU pledge at T2s : 17k jobs slots• Analysis pledge : 50%• Utilization in August was reasonable but need to go into sustained analysis mode

Analysis

Production

Analysis Success Rates

• Need to understand difference– In particular Application failures

• And convince majority of CMS physicists to favor Distributed Analysis

Mainly Read failuresSome Grid failures

Not well understoodApplication failures:

stage out, ...

CMS Site Commissioning• Objectives

– Test all functionality required from CMS at each site in a continuous mode– Determine if the site is usable and stable

• What is tested?• Job submission• Local site configuration and CMS software installation• Data access and data stage-out from batch node to storage• “Fake” analysis jobs• Quality of data transfers across sites

• What is measured?– Site availability: fraction of time all functional tests in a site are successful– Job Robot efficiency: fraction of successful “fake” analysis jobs– Link quality: number of data transfer links with an acceptable quality

• What is calculated?– A global estimator which expresses how good and stable a site is

Statistics and plotsSite summary table

Site history

Site ranking

The CMS Site Commissioning team

How can this be used?• To measure global trends in the evolution of the reliability of sites

– Impressive results in the last year

• Weekly reviews of the site readiness• Production teams can better plan where to run productions• Automatically map to production and analysis tools ?

The CMS Site Commissioning team

CMS Software Deployment

Deployment of CMS SW to 90% sites within 3 hours

The CMSSW Deployment teams

Basic strategy: Use RPM (with apt-get) in CMS SW area

CMS Centers and Computing Shifts

CMS Centre at CERN: monitoring, computing operations, analysis

CMS Experiment Control Room

CMS Remote Operations Centre at Fermilab

• CMS running Computing shifts 24/7• Encourage remote shifts• Main task: monitor and alarm CMS sites & Computing Experts

Conclusions• In last 5 year CMS gained a very large expertise in GRID

Computing Operations and is in good state of readiness

• Dedicated Data Challenges and Cosmic data taking very valuable, however not quite „the real thing“ which is :• Sustained data processing• Stronger demand on site readiness• Higher demand on data accessibility by physicists

• Left over program until the LHC startup– Tier-0 : repeat scale tests using simulated collision-like events– Tier-1 : repeat STEP09 tape and processing exercises where needed

+ add automation („WMBS“) to T1 processing– Tier-2 : support and improve distributed analysis efficiency – Review Critical Services coverage– Fine tune Computing Shifts procedures– Make sure pledged site resources for 2010 become available

Backup

Scientific Motivation

20 collisions/Bx

(~1 PB/s)

• L1 Trigger output :

100KHz

• After HLTrigger :

O(300Hz)

• Storage rate :

O(500MB/s)

Documents

CMS Experience on the GRID