Upload
britain
View
32
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Large scale data flow in local and GRID environment. Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow). Research objectives. Plans: Large scale data flow simulation in local and GRID environment. Done: Large scale data flow optimization in realistic DC environment (ALICE and LHCb). - PowerPoint PPT Presentation
Citation preview
Large scale data flow in localand GRID environment
Viktor Kolosov (ITEP Moscow)
Ivan Korolko (ITEP Moscow)
Research objectives
Plans: Large scale data flow simulation in local and
GRID environment.
Done: Large scale data flow optimization in realistic
DC environment (ALICE and LHCb)
more interesting
more useful (hopefully)
main components
ITEP LHC computer farm (1)
64 Pentium IV PC modules(01.01.2004)
A. Selivanov (ITEP-ALICE)a head of the ITEP-LHC farm
BATCH nodes
ITEP LHC computer farm (2)
CPU: 64 PIV-2.4GHz (hyperthreading)
RAM: 1 GB
Disks: 80 GB
Mass storageDisk servers: 6 x 1.6 TB + 1 x 1.0 TB + 1 x 0.5 TB
100 Mbit/s
CERN2-3 Mbit/s
20 (LCG test) + 44 (DCs)
Monitoring available at http://egee.itep.ru
ITEP LHC FARM usage in 2004
Main ITEP players in 2004 – ALICE and LHCb
ALICE DCGoals• Determine readiness of the off-line framework for data processing• Validate the distributed computing model• PDC’2004:10% test of the final capacity• PDC’04 physics: hard probes (jets, heavy flavours) & pp physics
Strategy • Part 1: underlying (background) events (March-July)
– Distributed simulation– Data transfer to CERN
• Part 2: signal events & test of CERN as data source (July-November)– Distributed simulation, reconstruction, generation of ESD
• Part 3: distributed analysis
Tools• AliEn – Alice Environment for the distributed computing• AliEn – LCG Interface
LHCb DCPhysics Goals (170M events) 1. HLT studies
2. S/B studies, consolidate background estimates, background properties
Gather information for the LHCb computing TDR● Robustness test of the LHCb software and production system
● Test of the LHCb distributed computing model
● Incorporation of the LCG application software
● Use of LCG as a substantial fraction of the production capacity
Strategy:1. MC Production (April-September)
2. Stripping (event preselection) still going on
3. Analysis
Details
1 job – 1 event
Raw event size: 2 GB
ESD size: 0.5-50 MB
CPU time: 5-20 hours
RAM usage: huge
Store local copies
Backup sent to CERN
ALICE AliEn
Massive data exchange with
disk servers ---
1 job – 500 events
Raw event size: ~1.3 MB
DST size: 0.3-0.5 MB
CPU time: 28-32 hours
RAM usage: moderate
Store local copies of DSTs
DSTs and LOGs sent to CERN
LHCb DIRAC
Often communication with central services -
OptimizationApril – start massive LHCb DC 1 job/CPU – everything OK
use hyperthreading - 2jobs/CPU - increase efficiency by 30-40%
May – start massive ALICE DC bad interference with LHCb jobs
often crashes of NFS
restrict ALICE queue to 10 simultaneous jobs,
optimize communication with disk server
June – September smooth running share resources, LHCb - June July, ALICE – August September
careful online monitoring of jobs (on top of usual monitoring from collaboration)
MonitoringOften power cuts in summer (4-5 times) -5% all intermediate steps are lost (…)
provide reserve power line and more powerful UPS
Stalled jobs -10% infinite loops in GEANT4 (LHCb)
crashes of central services
write simple check script and kill such jobs (bug report is not sent…)
Slow data transfer to CERN poor and restricted link to CERN
problems with CASTOR
automatic retry
ALICE Statistics
LHCb Statistics
Site Total Jobs CPU Time (h) Events O.Data (GB) EventsUSA 56 1408 32500 13 0.02%Israel 77 2493 64600 21 0.03%Brasil 247 4489 231355 83 0.12%Switzerland 813 19826 726750 235 0.39%Taiwan 595 8332 757200 216 0.41%Canada 1148 21286 1204200 348 0.65%Poland 1418 24058 1224500 403 0.66%Hungary 1817 31103 1999200 592 1.08%France 5888 135632 4997156 1967 2.69%Netherlands 6408 131273 7811900 2246 4.21%Russia 10059 255324 8999750 3388 4.85%Spain 13378 304433 13687450 4189 7.38%Germany 17101 275037 17732655 6235 9.56%Italy 25626 618359 24836950 7763 13.39%United Kingdom 46580 917874 47535055 14567 25.62%CERN 52940 960470 53708405 18948 28.95%
All Sites 184151 3711397 185549626 61214 100.00%
Summary
Quite visible participation in ALICE and LHCb DCs
ALICE → ~5% contribution (ITEP part ~70%)
LHCb → ~5% contribution (ITEP part ~70%)
With only 44 CPUs
Problems reported to colleagues in collaborations
More attention to LCG now
Distributed analysis – very different pattern of work load