Large scale data flow in local and GRID environment

Large scale data flow in localand GRID environment

Viktor Kolosov (ITEP Moscow)

Ivan Korolko (ITEP Moscow)

Research objectives

Plans: Large scale data flow simulation in local and

GRID environment.

Done: Large scale data flow optimization in realistic

DC environment (ALICE and LHCb)

more interesting

more useful (hopefully)

main components

ITEP LHC computer farm (1)

64 Pentium IV PC modules(01.01.2004)

A. Selivanov (ITEP-ALICE)a head of the ITEP-LHC farm

BATCH nodes

ITEP LHC computer farm (2)

CPU: 64 PIV-2.4GHz (hyperthreading)

RAM: 1 GB

Disks: 80 GB

Mass storageDisk servers: 6 x 1.6 TB + 1 x 1.0 TB + 1 x 0.5 TB

100 Mbit/s

CERN2-3 Mbit/s

20 (LCG test) + 44 (DCs)

Monitoring available at http://egee.itep.ru

ITEP LHC FARM usage in 2004

Main ITEP players in 2004 – ALICE and LHCb

ALICE DCGoals• Determine readiness of the off-line framework for data processing• Validate the distributed computing model• PDC’2004:10% test of the final capacity• PDC’04 physics: hard probes (jets, heavy flavours) & pp physics

Strategy • Part 1: underlying (background) events (March-July)

– Distributed simulation– Data transfer to CERN

• Part 2: signal events & test of CERN as data source (July-November)– Distributed simulation, reconstruction, generation of ESD

• Part 3: distributed analysis

Tools• AliEn – Alice Environment for the distributed computing• AliEn – LCG Interface

LHCb DCPhysics Goals (170M events) 1. HLT studies

2. S/B studies, consolidate background estimates, background properties

Gather information for the LHCb computing TDR● Robustness test of the LHCb software and production system

● Test of the LHCb distributed computing model

● Incorporation of the LCG application software

● Use of LCG as a substantial fraction of the production capacity

Strategy:1. MC Production (April-September)

2. Stripping (event preselection) still going on

3. Analysis

Details

1 job – 1 event

Raw event size: 2 GB

ESD size: 0.5-50 MB

CPU time: 5-20 hours

RAM usage: huge

Store local copies

Backup sent to CERN

ALICE AliEn

Massive data exchange with

disk servers ---

1 job – 500 events

Raw event size: ~1.3 MB

DST size: 0.3-0.5 MB

CPU time: 28-32 hours

RAM usage: moderate

Store local copies of DSTs

DSTs and LOGs sent to CERN

LHCb DIRAC

Often communication with central services -

OptimizationApril – start massive LHCb DC 1 job/CPU – everything OK

use hyperthreading - 2jobs/CPU - increase efficiency by 30-40%

May – start massive ALICE DC bad interference with LHCb jobs

often crashes of NFS

restrict ALICE queue to 10 simultaneous jobs,

optimize communication with disk server

June – September smooth running share resources, LHCb - June July, ALICE – August September

careful online monitoring of jobs (on top of usual monitoring from collaboration)

MonitoringOften power cuts in summer (4-5 times) -5% all intermediate steps are lost (…)

provide reserve power line and more powerful UPS

Stalled jobs -10% infinite loops in GEANT4 (LHCb)

crashes of central services

write simple check script and kill such jobs (bug report is not sent…)

Slow data transfer to CERN poor and restricted link to CERN

problems with CASTOR

automatic retry

ALICE Statistics

LHCb Statistics

Site Total Jobs CPU Time (h) Events O.Data (GB) EventsUSA 56 1408 32500 13 0.02%Israel 77 2493 64600 21 0.03%Brasil 247 4489 231355 83 0.12%Switzerland 813 19826 726750 235 0.39%Taiwan 595 8332 757200 216 0.41%Canada 1148 21286 1204200 348 0.65%Poland 1418 24058 1224500 403 0.66%Hungary 1817 31103 1999200 592 1.08%France 5888 135632 4997156 1967 2.69%Netherlands 6408 131273 7811900 2246 4.21%Russia 10059 255324 8999750 3388 4.85%Spain 13378 304433 13687450 4189 7.38%Germany 17101 275037 17732655 6235 9.56%Italy 25626 618359 24836950 7763 13.39%United Kingdom 46580 917874 47535055 14567 25.62%CERN 52940 960470 53708405 18948 28.95%

All Sites 184151 3711397 185549626 61214 100.00%

Summary

Quite visible participation in ALICE and LHCb DCs

ALICE → ~5% contribution (ITEP part ~70%)

LHCb → ~5% contribution (ITEP part ~70%)

With only 44 CPUs

Problems reported to colleagues in collaborations

More attention to LCG now

Distributed analysis – very different pattern of work load

Documents

Large scale data flow in local and GRID environment