Upload
zandra
View
34
Download
1
Tags:
Embed Size (px)
DESCRIPTION
CMS Experience on the GRID. Peter Kreuzer for the CMS Computing project RWTH Aachen / CERN. NEC2009 Conference Varna / Bulgaria , Sep 10, 2009. - PowerPoint PPT Presentation
Citation preview
CMS Experience on the GRIDPeter Kreuzer
for the CMS Computing project
RWTH Aachen / CERN
NEC2009 ConferenceVarna / Bulgaria , Sep 10, 2009
No need to provide the scientific motivation here, since Sergio Cittolin and Livio Mapelli did this very well in the
introductory session of this conference!
Goal of the CMS Computing project• Provide Resources and Services to store/serve O(10) PB data/year• Provide access to most interesting physics events to O(1500) CMS collaborators located in 200 institutions around the world
• Minimize constraints due to user localisation and resource variety• Decentralize control and costs of computing infrastructure• Share resources with other LHC experiments
Find the answer on the Worldwide LCG GRID
Up to 200 Tracks / Event
After Reconstruction / Selection
CAF450MB/s(300Hz)
30-300MB/s (ag. 800MB/s)
~150k jobs/day
~50k jobs/day
50-500MB/s
10-20MB/s
CMSdetector
Tier-0Prompt Reconstruction
Archival of RAW and First RECO data
Calibration Streams (CAF)Data Distribution Tier-1
7+1 Tier-1sRe-Reconstruction
SkimmingCustodial Copy of RAW, RECO
Served Copy of RECOArchival of Simulation
Data Distribution Tier-2
~50 Tier-2sPrimary Resources for
Physics Analysis and Detector Studies by
users MC Simulation Tier-1
WLCG Computing Grid Infrastructure
The way CMS uses the GRID
TIER-3TIER-3 TIER-3TIER-3
100MB/s
CMS in Total: 1 Tier-0 at CERN (GVA) 7+1 Tier-1s on 3 continents 50 Tier-2s on 4 continents
Computing Resources: Setting the scale
Data Tier RAW RECO SIMRAW SIMRECO AOD
<Size> [MB] 1.5 0.5 2.0 0.5 0.1
CPU [HS06-sec] - 100 1000
• Run 2009-10 (Oct-Mar/Apr-Sep) 300 Hz / 6 x 106 sec / 2.2 x 109 events
• Size&CPU per event
• CMS datasets- O(10) Primary Datasets (RAW, RECO, AOD)- O(30) Secondary Datastets (RECO, AOD)- 1.5 times more Simulated (SIMRAW, SIMRECO, AOD)
• During run 2009-10, CMS plans 5 full re-Reconstructions, need :
- 400 kHS06 CPU- 26 PB disk- 38 PB tape
• Resources ratio CERN / (T1+T2) :- CPU : 25% , Disk : 15%
disk [PB] / CPU [kHS06]
HEP-Spec2006 :• Modern CPU ~ 8 HS06 / core
• 100 HS06-sec ~ 12.5 sec/event• 100 kHS06 ~ 12,500 cores
And not to forget the obvious...GRID Middleware Services
- Storage Elements- Computing Element- Workload Management System- Local File Catalog- Information System- Virtual Organisation Management Service- Inter-operability between GRIDs EGEE, OSG, NorduGriD..
Networking
Site Specificities, e.g. Storage/Batch systems at CMS Tier-1s:
Storage : dCache/ Castor dCache/HPSS dCache/ Castor Castor+ dCache/TSM Endstore Endstore Storm
Batch : Condor Torque/Maui Torque/Maui BQS Torque/Maui LSF PBSPro
FNAL RAL CCIN2P3 PIC ASGC INFN FZK
Russian/Ukrainian CMS T2s, CERN-T1• 2006 : agreement between WLCG and CMS managements on association of
RU/UA T2 sites with a partial-T1 at CERN• 2008 : 30TB disk buffer in front of 450TB tape space deployed at CERN to serve
AODs to and archive MC data from RU/UA T2s• RU/UA T2s providing WLCG/CMS service work, e.g Transfer Link Commissioning
and Data Management support• CMS established strong collaboration and weekly Operations meetings with
RU/UA T2 community. A lot of progress made since 1 year !• As of today, following active RU/UA CMS sites :
• And many participating physics communities, like Bulgaria !T2_RU_IHEP (Protvino) T2_RU_PNPI (Petersburg Nucl. Ph. Inst.)T2_RU_RRC_KI (Kurchatov Inst.) T2_RU_INR (Inst. for Nucl. Res. RAS)T2_RU_JINR (Dubna) T2_RU_ITEP (R.F. Inst. for Th. and Exp. Ph.) T2_RU_SINP (MSU Develop. site) T2_UA_KIPT (NSC Kharkov Inst. of Ph. & Tech.)
In the rest of this talk I will cover
• Snapshots of achieved CMS computing milestones : • Data Distribution• Data Processing• Distributed Analysis
• A few more aspects related to CMS Computing Operations on the GRID :• Site Availability & Readiness monitoring• SW Deployment• Computing shift procedures
Acronyms of CMS Computing activitiesChallenge What is it ? Type of
DataOrganized Parallel activities with other LHC VOs
DC04 Data Challenge 2004 MC No
SC2,3,4 Service Challenges 2004,05,06 MC No
CSA06,07,08 Computing, Software & Analysis Challenge 2006
MC No
Summer07,09 Prod MC Production 2007,09 MC No
CCRC08 Common Computing Readiness Challenge 2008
MC Yes
CRUZET08,09 Cosmic Run at 0 Tesla 2008,09 Cosmic No
CRAFT08,09 Cosmic Run at 4 Tesla 2008,09 Cosmic No
STEP09 Scale Testing for the Experiment Program in 2009
MC Yes
MWGR Mid Week Global Run Cosmic No
Data Transfers• PhEDEx: distributed set of perl agents on top of large ORACLE DB• CMS Data Distribution strategy relies on Full Mesh Data Management
- Links individually „commissioned“- Target rates depend on size of sites- Minimum requirement to sites
- T1 : 4 links to T1s / 20 downlinks to T2 - T2 : 2 uplinks / 4 downlinks to/from T1
• CMS has >600 commissioned links !
• Historical view : gained 3 orders of magnitude in transfer volume since 04• Main milestone (CCRC08) achieved in parallel to other LHC VOs• Reached expected burst level of first year LHC running
100 TB/day
200 TB/day
CMS Data transer team
Data Transfer challenges• T0 – T1
– CMS regularly over design rate in CCRC08multi-VO challenge
– STEP09 : included tapewriting step at T1s : observed T0 T1 transfer latency ➞impacted by T1 tape system state (busy, overloaded, … )
• T1 – T1– STEP09 : simultaneous transfer 50TB AOD from 1 T1 to all T1s
average 970MB/s (3 days), no big problems encountered
• T1 – T2– See „T1 Data Serving tests in STEP09“ below
Tier-0 Tier-1 Throuput (MB/s)
(Multi-VO, 05/01/2008 – 06/02/2008)
Data Transfers from/to RU/UA T2s• T2 T1 transfers of produced MC data 32.7 TB events since 1 year from 6 different T2s to custodial T1
• Transfer Qualilty to/from RU/UA T2s improved lately
• T1 T2 downloads : 73 TB to RU/UA T2s since 1 year
Production of simulated data• Good measure of Computing performance on the GRID (CMS activity since 2004)• ProdAgent : modular Python deamons on top of MySQL DB, interacting with CMS
DM systems• CMS has a centralized production strategy handled by the Data Operations teams• 2006-07 : 6 „regional teams“
+ Production Manager• 2008-09 : 1 team (6 people)
managing submissions in defined „T1-regions“
CMS DataOps Production Team
Resource Utilization for Production• Summer 2007 :
- <4700 slots> utilized- including 32 T1s and T2s 75% of total available then
• Summer 2009 : - <9-10,000 slots> utilized- including 53 T2s and T3s 60% of total T2 available then rest of T2 resources used by analysis
Note : besides the scale, the other important performance metric is to reach a sustained site occupancy
5,000 slots
10,000 slots
Summer 09 Production at RU/UA T2s
• So far produced around 33M events at all RU/UA Tier-2 sites• This was possible thanks to a strong effort by site coordinators (V. Gavrilov, E. Tikhonenko, O. Kodolova) and local site admins to satisfy the CMS Site Availability and Readiness requirements• Note: total pledge for RU/UA job slots for CMS : 807 (June 09)
Total Pledge (June 09)
Primary Reconstruction at Tier-0
Tier-0 jobs (running , pending) CRAFT08
CRAFT09
270
M e
vent
s32
0 M
eve
nts
• CMS T0 mainly operated from CMS Centre (CERN) and FNAL ROC
• Despite lack of collision data, CMS able to commission T0 workflows, based on cosmic ray data taking at 300HZ :
– Repacking – reformatting raw data, splitting into primary datasets– Prompt Reconstruction – first pass with a few days turnaround– Automated subscriptions to the Data Management system– Alignment and Calibration data skimming as input to the CAF
Primary Data Archival at Tier-0 • Transfer RAW Data from the Detector and archive it on tape at CERN
CCRC08 – Phase 1
1.4 GB/s
• Routinely held archival performance > nominal rate 500 MB/s• Until LHC startup, will test the T0 performance based on simulated collision-like data
STEP09
CMS Tier-0 Operations team
Data Reprocessing, Data Serving at T1s CMS Data Distribution Model puts a heavy load on Tier-1s : • Disk and tape storage Capacity
– Custodial copy of Rec. Data (fraction) & archival of Simulated Data
– Full set of Analysis Object Data - AOD (subset for 90% analyses)– Non-custodial copy of RECO/AOD encouraged
• Processing Capacity– Re-Reconstruction– Skimming of data to reduce the data size samples
• Tape I/O bandwith– Reading many times to Serve data to T2s for analysis– Writing for storage / Reading for Re-Reconstruction if not on disk
• Full mesh (T1 – T2) strategy• Pre-Staging strategy ? • Other strong requirements
– 24/7 coverage and high (98%) availability (WLCG)• CMS Data Operations at Tier-1s
– Central operations or specialized workflows, no user access
Routine Re-Reconstruction operations • CMS routinely reprocessed Simulated Data (CSA08, ...) and
Cosmic Data (CRAFT08,09, ...) at all Tier-1s• CPU performances according to CMS pledges• Using both GLite/WMS (Bulk) and GlideIn submissions• Regular Data Back-fill jobs to test site reliability• Example below : Transfer / Re-Reconstruction of CRAFT08 data
500 TB
CMS DataOps reprocessing team
Tier-1 Pre-Staging tests in STEP09• Rolling Re-Reconstruction exercise : Pre-stage / Process /
Purge data at all T1s during 2 weeks
Processing paused at 1 T1 due to staging backlog
• Tape system at almost all T1s demonstrated pre-staging capability above required rate. To be re-done in Fall 09 if needed
• Re-Reconstruction CPU efficiency significantly better with pre-staging . CMS considering an integrated pre-staging solution.
Tier-1 Data Serving tests in STEP09• T1 T2 Data serving exercise: transfer from T1 tapes to T2s,
i.e. put load on T1 tape-recall, not WAN rate to T2s
• Latency performance : overall result ok• Punctual weaknessees at some T1 during test, to be investigated before LHC startup
Analysis Model in CMS
Tier-1
Tier-2 Tier-2 Tier-2 Tier-2
Tier-3 Tier-3 Tier-3Tier-3
Tier-3
Several Tier-1s have separatelyaccounted analysis resources
Tier-2 Computing Facilities arehalf devoted to simulationhalf user analysis. Primary Resource for Analysis
Tier-3
CAF
Tier-3 Computing Facilities areentirely controlled by the providing institution used foranalysis
Tier-0CERN CAF
Tier-1 Tier-1TierTier-2-2
TierTier-2-2 Tier-1
Tier-3
Tier-2
Analysis in CMS is performed on a globally distributed collection of computing facilities
Analysis tool and users on the GRID
CRABServer
Job Specifications
Global Tier-2s
LocalTier-2
Small Data Products
Output to User Space
Job Status
Eventually output to Temporary Space
RemoteStage-out
CRABAccess to Official Data
CRABAccess to UserData
Files Registered inData Management
LocaldCache
Continued work
on scaling and submission
Grid Failures
Work on
Data Integrity
Problems with
Reliability and Scale
CMS Remote Analysis Builder (CRAB)
• Since 2008 : >1000 distinct CMS Users• 100,000 analysis jobs/day in reach
CMS Tier-2 Disk Space managementIn CMS jobs go to the data : distribute data broadlyCMS attempts to share management of space across Analysis Groups
• Ensures people doing the work have some control
200TB of disk space at a nominal Tier-2
•20 x 1TB is identified for storing local user produced files and making them grid accessible
• 30TB is identified for use by the local group
• 2-3 x 30 TB reserved to CMS PH Analysis groups
• 30 TB for centrally managed Analysis Operations expect to be able to host most RECO data in 1sr y.
• 20 TB of space for DataOps for MC staging buffer
25
Analysis Groups associations to T2s- Currently 17 physics analysis / detector performance groups- Association improved communication with sites and user involvment
• Example:
• Current associations of RU/UA T2s to PH groups: - exotica, heavy ion, jetmet, muon
Group Space, Private Skims, Priority Access @ Tier-2s
• Many groups do not yet use the space assigned to them
• CMS analysis groups or users are producing „private“ skims– Data written in private /store/user area
• New tool to „promote“ private skims to the full collaboration – validate skims and “migrate” skims to public /store/results area
• Priority Queues– 2 people per Analysis Group to access 25% of pledged CPU at assoc. T2
% of used Group Space
Group Spacepledged at T2s
The CMS Analysis Operation team
Job Slot utilization for Analysis
• Current CMS total CPU pledge at T2s : 17k jobs slots• Analysis pledge : 50%• Utilization in August was reasonable but need to go into sustained analysis mode
Analysis
Production
Analysis Success Rates
• Need to understand difference– In particular Application failures
• And convince majority of CMS physicists to favor Distributed Analysis
Mainly Read failuresSome Grid failures
Not well understoodApplication failures:
stage out, ...
CMS Site Commissioning• Objectives
– Test all functionality required from CMS at each site in a continuous mode– Determine if the site is usable and stable
• What is tested?• Job submission• Local site configuration and CMS software installation• Data access and data stage-out from batch node to storage• “Fake” analysis jobs• Quality of data transfers across sites
• What is measured?– Site availability: fraction of time all functional tests in a site are successful– Job Robot efficiency: fraction of successful “fake” analysis jobs– Link quality: number of data transfer links with an acceptable quality
• What is calculated?– A global estimator which expresses how good and stable a site is
Statistics and plotsSite summary table
Site history
Site ranking
The CMS Site Commissioning team
How can this be used?• To measure global trends in the evolution of the reliability of sites
– Impressive results in the last year
• Weekly reviews of the site readiness• Production teams can better plan where to run productions• Automatically map to production and analysis tools ?
The CMS Site Commissioning team
CMS Software Deployment
Deployment of CMS SW to 90% sites within 3 hours
The CMSSW Deployment teams
Basic strategy: Use RPM (with apt-get) in CMS SW area
CMS Centers and Computing Shifts
CMS Centre at CERN: monitoring, computing operations, analysis
CMS Experiment Control Room
CMS Remote Operations Centre at Fermilab
• CMS running Computing shifts 24/7• Encourage remote shifts• Main task: monitor and alarm CMS sites & Computing Experts
Conclusions• In last 5 year CMS gained a very large expertise in GRID
Computing Operations and is in good state of readiness
• Dedicated Data Challenges and Cosmic data taking very valuable, however not quite „the real thing“ which is :• Sustained data processing• Stronger demand on site readiness• Higher demand on data accessibility by physicists
• Left over program until the LHC startup– Tier-0 : repeat scale tests using simulated collision-like events– Tier-1 : repeat STEP09 tape and processing exercises where needed
+ add automation („WMBS“) to T1 processing– Tier-2 : support and improve distributed analysis efficiency – Review Critical Services coverage– Fine tune Computing Shifts procedures– Make sure pledged site resources for 2010 become available
Backup
Scientific Motivation
20 collisions/Bx
(~1 PB/s)
• L1 Trigger output :
100KHz
• After HLTrigger :
O(300Hz)
• Storage rate :
O(500MB/s)